Uploaded by People's Medical Publishing House -- USA

Written by Geoffrey R. Norman, PhD (McMaster University, Hamilton, Ontario) and David L. Streiner, PhD (University of Toronto)
This book translates biostatistics in the health sciences literature with clarity and irreverence. Students and practitioners alike, applaud Biostatistics as the practical guide that exposes them to every statistical test they may encounter, with careful conceptual explanations and a minimum of algebra.
The new Bare Essentials reflects recent advances in statistics, as well as time-honored methods. For example, the authors describe “hierarchical linear modeling,” which first appeared in psychology journals and only now is described in the medical literature. Also new is a chapter on testing for equivalence and non-inferiority as well as a chapter with information on how to get started with the computer statistics program SPSS.
Free of calculations and jargon, Bare Essentials speaks so plainly that you won’t need a technical dictionary. The objective is to enable you to determine whether research results are applicable to your own patients.
Throughout the guide, you’ll find highlights of areas in which researchers frequently misuse or misinterpret statistical tests. The authors have labeled these “C.R.A.P. Detectors” (Convoluted Reasoning and Anti-intellectual Pomposity), and they’ll help you to identify faulty methodology and misuse of statistics.

BIOSTATISTICS

The Bare Essentials Third Edition

A NOTE ON THE FRONT COVER The cover depicts the famous “Study of Human Proportion in the Manner of Vitruvius” by Leonardo da Vinci, drawn about 1490, and done to death 500 years later in 2000. Those with a classical bent may wish to know the origin of the idea. According to Renaissance notions, the “Perfect Man” was based on geometric principles. The arms outstretched, the top of the head, and the tip of the feet deﬁned a square, and the tips of the arms and legs outstretched in a fanlike position inscribed a circle centered on the navel. What da Vinci failed to notice is that the legs ﬁt precisely on a normal curve, with the mean between the two heels and the apex at the crotch, one standard deviation falling exactly on the two kneecaps, and the asymptotes at the corners of the inscribed square. The centers of the two feet, at the point where they intersect the arc of the circle, then determine the conventional criterion for statistical signiﬁcance at ± two standard deviations from the mean. Leonardo da Vinci can be forgiven, however. Statistics hadn’t been invented yet in 1492.

BIOSTATISTICS

The Bare Essentials

Third Edition

**Geoffrey R. Norman, PhD
**

Professor, Department of Clinical Epidemiology and Biostatistics McMaster University Hamilton, Ontario, Canada

**David L. Streiner, PhD
**

Assistant Vice-President, Research Director, Kunin-Lunenfeld Applied Research Unit Baycrest Centre Professor, Department of Psychiatry University of Toronto Toronto, Ontario, Canada

with 122 illustrations

2008 B.C. Decker Inc Hamilton

BC Decker Inc P.O. Box 620, L.C.D. 1 Hamilton, Ontario L8N 3K7 Tel: 905-522-7017; 800-568-7281 Fax: 905-522-7839; 888-311-4987 E-mail: info@bcdecker.com www.bcdecker.com © 2008 Geoffrey R. Norman and David L. Streiner All rights reserved. Without limiting the rights under copyright reserved above, no part of this publication may be reproduced, stored in or introduced into a retrieval system, or transmitted, in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), without the prior written permission of the copyright holder. 08 09 10/IPP/9 8 7 6 5 4 3 2 1 ISBN 978-1-55009-347-6 Printed in India Production Editor: Devorah Abrams Farmer; Typesetter: Norm Reid; Cover Designer: Mary McKeon Sales and Distribution

United States BC Decker Inc P.O. Box 785 Lewiston, NY 14092-0785 Tel: 905-522-7017; 800-568-7281 Fax: 905-522-7839; 888-311-4987 E-mail: info@bcdecker.com www.bcdecker.com Canada BC Decker Inc 50 King St. E. P.O. Box 620, LCD 1 Hamilton, Ontario L8N 3K7 Tel: 905-522-7017; 800-568-7281 Fax: 905-522-7839; 888-311-4987 E-mail: info@bcdecker.com www.bcdecker.com Foreign Rights John Scott & Company International Publishers’ Agency P.O. Box 878 Kimberton, PA 19442 Tel: 610-827-1640 Fax: 610-827-1671 E-mail: jsco@voicenet.com Japan Igaku-Shoin Ltd. Foreign Publications Department 1-28-23 Hongo Bunkyo-ku, Tokyo, Japan 113-8719 Tel: 3 3817 5611 Fax: 3 3815 4114 E-mail: fd@igaku-shoin.co.jp UK, Europe, Middle East McGraw-Hill Education Shoppenhangers Road Maidenhead Berkshire, England SL6 2QL Tel: 44-0-1628-502500 Fax: 44-0-1628-635895 www.mcgraw-hill.co.uk Singapore, Malaysia,Thailand, Philippines, Indonesia, Vietnam, Paciﬁc Rim, Korea Elsevier Science Asia 583 Orchard Road #09/01, Forum Singapore 238884 Tel: 65-737-3593 Fax: 65-753-2145 Australia, New Zealand Elsevier Science Australia Customer Service Department Locked Bag 16 St. Peters, New South Wales 2044 Australia Tel: 61 02-9517-8999 Fax: 61 02-9517-2249 E-mail: customerserviceau@ elsevier.com www.elsevier.com.au Mexico and Central America ETM SA de CV Calle de Tula 59 Colonia Condesa 06140 Mexico DF, Mexico Tel: 52-5-5553-6657 Fax: 52-5-5211-8468 E-mail: editoresdetextosmex@ prodigy.net.mx Brazil Tecmedd Importadora E Distribuidora De Livros Ltda. Avenida Maurílio Biagi, 2850 City Ribeirão, Ribeirão Preto – SP – Brasil CEP: 14021-000 Tel: 0800 992236 Fax: (16) 3993-9000 E-mail: tecmedd@tecmedd.com.br India, Bangladesh, Pakistan, Sri Lanka Elsevier Health Sciences Division Customer Service Department 17A/1, Main Ring Road Lajpat Nagar IV New Delhi – 110024, India Tel: 91 11 2644 7160-64 Fax: 91 11 2644 7156 E-mail: esindia@vsnl.net

Notice: The authors and publisher have made every effort to ensure that the patient care recommended herein, including choice of drugs and drug dosages, is in accord with the accepted standard and practice at the time of publication. However, since research and regulation constantly change clinical standards, the reader is urged to check the product information sheet included in the package of each drug, which includes recommended doses, warnings, and contraindications. This is particularly important with new or infrequently used drugs. Any treatment regimen, particularly one involving medication, involves inherent risk that must be weighed on a case-by-case basis against the beneﬁts anticipated. The reader is cautioned that the purpose of this book is to inform and enlighten; the information contained herein is not intended as, and should not be employed as, a substitute for individual diagnosis and treatment.

To two people whose hard work, patience, diligence, and, most important, unﬂagging good humor, have made it possible: Geoff R. Norman and David L. Streiner

Too many people confuse being serious with being solemn. John Cleese

One of the ﬁrst symptoms of an approaching nervous breakdown is the belief that one’s work is terribly important. Bertrand Russell

Most researchers use statistics the way a drunkard uses a lamppost—more for support than illumination. Winifred Castle

PREFACE TO THE THIRD EDITION

W

ell, we’re back, yet again. In the preface to the second edition, we wrote that the half-life of statistical knowledge is comparable to the life span of an elephant. Now, maybe it’s just a sign of our incipient decrepitude, but it seems that the pace of developments in statistics has been increasing of late. Techniques that were just a glimmer in our eye when we worked on the previous edition seven years ago, such as hierarchical linear modeling, are now found in nearly all psychology journals and are starting to creep into biomedical ones. Also, when we ﬁrst wrote this book, we thought it would be used by people as a sort of underground resource, to be read behind the backs of their instructors, that would help explain the “real,” “grown-up” statistics books. We’ve been pleasantly surprised and delighted by the number of e-mails we get from people that indicate that this is the only stats book they have on their shelves, or at least the only one they bother to look at. So, the effect of these two forces has led to this third edition. In response to the ﬁrst issue, we’ve taken topics such as growth curve analysis, which had been a section in the chapter on measuring change, and given it its own chapter, under the more up-to-date heading of hierarchical linear modeling. As for making this more of a textbook, two additions are obvious: one new chapter on testing for equivalence and non-inferiority, and another on how to get started using the computer statistics program SPSS, because most of the chapters end with a section on using SPSS to run the tests. Other changes are less obvious, because they’re buried within the chapters. For example, with an increasing number of journals, it’s not sufficient to run a test of signiﬁcance; they now require authors to supplement statistical tests with conﬁdence intervals and some indication of effect size, so we’ve added these whenever possible. We’ve added a lot more to the chapters on regression and ANOVA (analysis of variance), and, in short, turned an excellent book into a fantastic one. One thing hasn’t changed (we hope), and that’s our belief that having the word statistics in the same sentence as funny, irreverent, and possibly obscene doesn’t constitute an oxymoron. We still regard as our highest compliment the fact that one student had to buy another copy of the book, because, while reading her ﬁrst copy in the bathtub (don’t ask), she laughed so hard that she dropped it. So, learn and laugh along the way.

DLS GRN

PREFACE TO THE SECOND EDITION

1And

even more pleased by the reaction to the back cover. In response to many inquiries about it, yes, Geoff really does have four arms.

W

e have been extremely pleased by the positive comments we have received about the ﬁrst edition of Biostatistics: The Bare Essentials.1 It is very gratifying to get e-mail messages out of the blue telling us that, for the ﬁrst time, people really understand what statistics are all about and are having fun learning it— almost as gratifying as getting royalty checks in the snail-mail. We debated for a long time whether we should write a second edition. Our hesitation was due to two considerations. First, if the half-life of medical knowledge is about 5 years, then it must be longer than the life-time of an elephant for statistical knowledge. After all, we are still using the correlation coefficient that was proposed by Galton (and he died in 1911), and the work done by Ronald Fisher in the early 1900s continues to provide the basic core of statistics. Second, there were other things we wanted to do with our lives, such as eating, sleeping, and seeing our families. So, what made us decide to do a second edition? For one thing, statistics have changed. Path analysis and structural equation modeling have been around for about a quarter of a century, but with the recent introduction of programs that do these easily on a desktop computer, their use has proliferated over the past few years. It is almost impossible to read any journal in ﬁelds such as psychology without seeing at least one example in each issue. The same is true for other computer-intensive techniques, such as logistic regression and multivariate analysis of variance; so, we have added chapters on all of these subjects. Now, after nearly a half century of debate, we are starting to reach some consensus about the best way to measure change, and this deserves its own chapter. Writing a second edition has also allowed us to correct the mistakes that we and others have discovered over the years. However, writing new chapters has also offered us the opportunity to make new ones, so keep your eyes open and the e-mails coming. In closing, we would like to thank three people who have been especially diligent in pointing out our mistakes and in reading drafts of some of the new chapters: Bill Marks from Villanova University, Kathleen Wyrwich from St. Louis University, and Jose Luis Saiz from Universidad de la Frontera.

GRN DLS

PREFACE TO THE FIRST EDITION

A

re congratulations in order? Have you ﬁnally overcome those years of denial about your ignorance of statistics, those many embarrassing incidents at scientiﬁc meetings, those offhand comments at drug company receptions when someone dropped tidbits like “analysis of covariance” into the conversation and you had to admit your bewilderment? Are you prepared to recognize your condition and deal with your problem? Face it, you are a photonumerophobic!1 Now that you have come out of the closet (clinic), we are here to help. To begin, it would be useful for you to understand that all statisticians are not created equal, and as a result all statistics books are not equal.2 An analogy with home renovation might help. Three basic types of folks are involved in home renovation. First there are architects, who design houses that no one except dermatologists can afford—they worry about concepts, esthetics, and design at the theory level. Next there are carpenters who do home renovations, are highly specialized and skilled,3 and have a special language consisting of terms such as plates, sills, rafters, sheathing, R28, and the like that describe goings on at the practical level.4 Finally, there are the do-it-yourselfers (DIYers), who have the temerity to sally forth in blissful ignorance and make their own additions. Now, the fact of the matter is that it isn’t all that difficult to put a nail into a 2 4, or to do anything else related to foundations, walls, ceilings, plumbing, and wiring. But a frustration for accomplished DIYers is that the books on do-it-yourselﬁng are written either by the architects, or by carpenters, but not by really good DIYers, and they all miss the mark. So, you either get pieces about the esthetic considerations involved in a $200,000 bathroom renovation, or a DIY book that starts and stops with “How to change a fuse.” Unfortunately, the same conventions hold in statistics. There are the architects of statistics—card-carrying PhDs who contribute to the theory of statistics and publish journal articles in Biometrika or little monographs to be read only by other members of this closed community. Then there are the carpenters—the most common species. They usually have a PhD in statistics, but they don’t actually contribute to the discipline base of statistics—they just do statistics. They don’t usually publish articles in statistics journals, beyond the cookbook recipes. Then

there are the DIYers—folks like us who have arrived at statistics by the back door through disciplines such as psychology or education. With the advent of modern statistical packages and PCs, nearly anyone can be a do-it-yourself statistician—even you. Note that we are assuming in this book, unlike many other statistics books, that you will not actually do statistics. No one except students in statistics courses has done an analysis of variance for 20 years. If God had meant people to do statistics, He wouldn’t have invented computers. This description reveals two problems with the present state of affairs. First, doing statistics really is easier now than doing plumbing, but unfortunately errors are much better hidden—there is no statistical equivalent of a leaky pipe. Also, there is no building inspector or building code in statistics, although journal editors wish there were. Secondly, most Do-It-Yourself stats books are written by tradesmen (oops, that should “tradespersons”). They are a possessive lot and likely feel a little guilty that they, too, don’t publish in Biometrica.5 So, they commit two fundamental errors. First, they cannot resist dazzling you with the mysteries of the game and subliminally impressing you with the incredible intelligence that they must have had to master the ﬁeld. This is achieved by sprinkling technical lingo throughout the book, doing lots and lots of derivations and algebra to make it look like science, and, above all, writing in a stilted, formal, and ultimately unreadably boring prose, as if this is a prerequisite for credibility. That is one type of statistics book—until recently, in the majority. There is a second strategy, however. Recognizing that no one in possession of his or her senses would actually lay out hard-earned cash to buy such a book,6 a number of carpenters have begun to publish little thin books, with lively prose and with a sincere hope of demystifying the ﬁeld and making good royalties. The only problem is that they usually presume that the really contemporary stuff of statistics is much too complicated for the average DIYer to comprehend. As a result, these books begin, and end, with statistical methods that were popular around the turn of the last century. An argument used to justify such books goes like, “We have carefully surveyed the biomedical literature, and contemporary and powerful methods like factor analysis are used only rarely, so we are just teaching

1Photonumeropho-

**bia: fear that one’s fear of numbers will come to light (thanks to Dave Sackett).
**

2Most

statisticians who write statistics books don’t understand this distinction, which is why most statistics books are so boring. the optimists, aren’t we? fools. If they had the good sense to put Graeco-Latin names on these things they could have tripled their salaries. Admit it, you can charge more for making a diagnosis of acute nasopharyngitis than for snotty nose.

3Always

4Damn

5Norman

can sympathize. He has a PhD in physics, which he never used. He was recently introduced at a meeting as a “fallen physicist,” a term which Streiner calls a redundancy. of course, it was assigned reading in a course taught by another statistical carpenter.

6Unless,

x

7This

PREFACE

is an argument for maintaining the status quo despite much discussion of the inadequacy of reporting statistics in the biomedical literature. It’s analogous to saying that we have studied primary care clinics and we found that most visits (about 80%) are related to acute respiratory infections, hypertension, depression, and chronic pulmonary disease, so that is all we will teach our medical students. time we get on an airplane, we are grateful that the pilots practiced landing the 747 with both starboard engines blown on a simulator so (a) they would know what to do if it happened, and (b) they wouldn’t have to practice on us.

8Every

9Lest

we be accused of profane language, this stands for “Convoluted Reasoning and Anti-intellectual Pomposity Detectors.” Ernest Hemingway likely thought so too—he coined the phrase. the note at the end of this preface. sample size calculations are based on exact analysis of impossibly wild guesses, resulting in an illusion of precision. As Alfred North Whitehead said, “Some of the greatest disasters of mankind were inﬂicted by the narrowness of men with a sound methodology.”

10See

11Most

methods that appear commonly.” The circular nature of this argument somehow escapes them.7 We have news for you. Contemporary statistics are not all that complicated; in fact, now that computers are around to do all the dirty work, it’s much less painful than in yesteryear. Certainly compared to physiology or physics, it’s pain free. But an author has to approach it with a genuine desire to try very hard to explain it. Let us just return to the DIY analogy one last time. There are really two types of activities that accomplished DIYers get involved in. For some chores on the house, they want to be sufﬁciently informed that they can hire a professional and feel conﬁdent that they will recognize when it is done well or poorly. That is, they know they can’t do it all on their own, but they know enough to be able to tell shoddy workmanship when they see it. Other tasks they may decide to complete themselves. Again, for the biomedical researcher confronted with statistics, both avenues are open. On the one hand, it is a prerequisite, in examining the analyses conducted by others, to be able to understand when it was done well or poorly, even though one may choose to not do it oneself. On the other hand, with the ﬂexibility and ease of many contemporary statistics packages, just about anyone can now get involved in the doing of statistics. Our first book, PDQ Statistics (Norman and Streiner, 1986), was written to satisfy consumers of statistics. We found that it was possible to explain most of contemporary statistics at the conceptual level, with little recourse to algebra and proofs. However, it does take somewhat more knowledge and skill to do something—plumbing, wiring, or statistics—than it does to recognize when others are doing it well or poorly. That, then, is the intent of this book. If you never intend to do statistics, save a few bucks and buy PDQ. However, if you are actually involved in research, or if you have had your appetite whetted by PDQ or some other introductory book, pay the salesperson for this book and carry on. Some comments about the format of the book. A perusal of the contents reveals that it is laid out much as any other traditional stats book. We contemplated doing it in problem-based fashion, both because we come from a problem-based medical school and also because it would sound contemporary and sell more books (we never said we were in it for altruism). But this would constitute, in our view, a debasement of the meaning of problembased learning (PBL). This book is a resource, not a curriculum. By all means, we urge the reader to consult it when there is a statistical problem around, thereby doing PBL. But PBL does not dictate the format of the resources—all medical students, wherever they are, still engorge Harrison and the Merck Manual. We felt that we could better explain the conceptual underpinnings by following the traditional sequence. Some differences go beyond style. Most chapters begin with an example to set the stage. Usually the

examples were dreamt up in our fertile imaginations and are, we hope, entertaining. Occasionally we reverted to real-world data, simply because sometimes the real world is at least as bizarre as anything imagination could invent. Although many reviews of statistics books praise the users of real examples and castigate others, we are unapologetic in our decision for several reasons: (1) the book is aimed at all types of health professionals, and we didn’t want to waste your time and ours explaining the intricacies of podiatry for others; (2) the real world is a messy place, and it is difficult, or well nigh impossible, to locate real examples that illustrate the pedagogic points simply;8 and (3) we happen to believe, and can cite good psychological evidence to back it up, that memorable (read “bizarre”) examples are a potent ally in learning and remembering concepts. There are far more equations here than in PDQ, although we have still tried to keep these to a minimum. Our excuse is simply that this is the language of statistics; if we try to avoid it altogether, we end up with such convoluted prose that the message gets lost in the medium. But we continue to try very hard to explain the underlying concept, instead of simply dropping a formula in your lap. There are a few other distinctive features. We have retained the idea of C.R.A.P. Detectors9 from PDQ as a way to help you see the errors of other’s (and your own) ways. We have included computer notes at the end of most chapters10 to help you with one of the more common and powerful statistical programs—SPSS (Statistical Program for the Social Sciences). Finally, we acknowledge that many clinical investigators use most of their skills to get grants so that they can hire someone else to do statistics. Also, it is impossible to squeeze money out of most federal, state, or provincial agencies without an impressive sample size calculation.11 That means, of course, that the only analysis many biomedical researchers do is the sample size calculations in their grant proposals. Recognizing this harsh reality, every chapter has a section devoted to sample size calculations (when these are available) so you will be as good as the next person at befuddling the grant reviewers. On the issue of format, you will already have noticed that the book has an excessively wide outside margin. This is not a publisher’s error or an attempt to salvage the pulp and paper industry. Instead, it accomplishes two things: (1) we can use the margin for rubrics,12 expanding on things of slightly peripheral interest, or inﬂicting our base humor on the reader; and (2) you can use it to make your own notes if you don’t like ours. Finally, on the issue of style. You might have already noticed that we have cultivated a somewhat irreverent tone, which we will proceed to apply as we see ﬁt to all folks who have the misfortune to appear in these pages—statisticians, physicians, administrators, nurses, physiotherapists, psycholo-

PREFACE

xi

12No

gists, and social workers. We recognize that we run a certain risk of offending the “allied”13 health professionals, who have historically felt somewhat downtrodden, with good reason, by folks with MD after their name. However, we felt the risk was greater if we omitted them altogether. Fear no evil, all ye downtrodden—our intent is not racist, sexist, or otherwise prejudiced. We will attempt, as much as possible, to insult all professions equally.14

**Notes on the Computer Notes
**

We are of the ﬁrm belief that our mothers didn’t raise us to waste our time doing calculations by hand; that’s why we have computers and computerized statistical packages. However, learning the arcane code words demanded by many of these programs can be as intimidating as learning statistics itself. So, in our never-ending quest to be as helpful as possible, we’ve supplied the commands necessary to make one of these programs bow to your wishes. A few years ago, it would have been a simple job to choose which programs to include; because there were only three or four that could be run on desktop computers, we could have included all of them and be seen as comprehensive and erudite. Now, though, it seems as if a new, “better,” package is introduced every month, forcing us to make some choices.15 When we wrote the ﬁrst edition, there were a bunch of popular and powerful programs which stood out from the rest — SPSS, BMDP, SAS, and Minitab. So, we obligingly included some hints on how to run three of the four. Well, things have changed considerably in the past decade or so. SPSS (Statistical Package for the Social Sciences) has done to statistical software what Microsoft did to operating systems — it swallowed them whole for break-

fast. While you can still buy SAS and Minitab, SPSS bought out BMDP had then let it wither on the vine. “Real” statisticians still use SAS, but you’ll need a separate bookcase just to house all of its manuals. The reality is that wherever you look in the social and medical sciences, folks are running SPSS. It’s never been the best at everything, but it’s good at many things, and it’s pretty well created a monopoly. Since no manual or Help directory can ever compete with a knowledgeable friend, and friends knowledgeable in SPSS are far more common than friends of the other ilks16, it makes no sense for us to buck a trend. Accordingly, this time around we’ve only included instructions for the Windows version of SPSS (Version 9 point something or other). Good luck (and don’t call us if your machine blows up).

doubt you wonder what a rubric is. Literally, it is the note written in red in the margin of the Book of Common Prayer telling the preacher what to do next. That’s why these are red. don’t like the term either, but it’s shorter than spelling out all the allies. forget whether it was Lenny Bruce or Mort Sahl who ended every routine with the line, “Is there anyone in the audience whom I haven’t insulted yet?” In either case, he was our inspiration. thereby resulting in some people castigating us for not including the best statistical package (i.e., the one they have on their machine). Such are the perils of authorship. that we recommend, “Hi there. Do you know SPSS?” as an ice-breaker at a singles bar. Chapter 8 to the contrary notwithstanding, sex and stats make poor bedfellows.

13We

14We

Acknowledgments

Many of our students have waded through early drafts of this book, giving us valuable advice about where we were going astray. Unfortunately, they are too numerous to mention (and we have forgotten most of their names). However, special thanks are due to Dr. Marilyn Craven, who patiently (and sometimes painfully) helped us with our logic and English. So, any mistakes you ﬁnd should be blamed on them; we humbly accept any praise as due to our own efforts. On a serious note (which we hope will be the last), we would like to express our thanks to Brian C. Decker, who dreamt up the idea of this book and who encouraged us from the beginning. GRN DLS

15And

16Not

CONTENTS

SECTION THE FIRST

**THE NATURE OF DATA AND STATISTICS
**

2 7

1 The Basics

2 Looking at the Data

A First Look at Graphing Data

**3 Describing the Data with Numbers 4 The Normal Distribution 5 Probability
**

37 46 31

19

Measures of Central Tendency and Dispersion

**6 Elements of Statistical Inference
**

C.R.A.P.

DETECTORS

63

SECTION THE SECOND

ANALYSIS OF VARIANCE

70

**7 Comparing Two Groups
**

The t-Test

**8 More than Two Groups
**

One-Way ANOVA

72

9 Factorial ANOVA

90 101

**10 Two Repeated Observations
**

The Paired t-Test and Alternatives

11 Repeated-Measures ANOVA

107 117

**12 Multivariate ANOVA (MANOVA)
**

C.R.A.P.

DETECTORS

128

SECTION THE THIRD

**REGRESSION AND CORRELATION
**

132

**13 Simple Regression and Correlation 14 Multiple Regression 15 Logistic Regression
**

143 159

**16 Advanced Topics in Regression and ANOVA 17 Measuring Change
**

177

167

**18 Analysis of Longitudinal Data: Hierarchical Linear Models 19 Principal Components and Factor Analysis
**

Fooling Around with Factors

194

186

**20 Path Analysis and Structural Equation Modeling
**

C.R.A.P.

DETECTORS

210

229

SECTION THE FOURTH

NONPARAMETRIC STATISTICS

234

21 Tests of Signiﬁcance for Categorical Frequency Data 22 Measures of Association for Categorical Data 23 Tests of Signiﬁcance for Ranked Data

260 268 253

**24 Measures of Association for Ranked Data 25 Life-Table (Survival) Analysis
**

C.R.A.P.

DETECTORS

274

290

SECTION THE FIFTH

REPRISE

296 302

26 Equivalence and Non-Inferiority Testing

**27 Screwups, Oddballs, and Other Vagaries of Science
**

Locating Outliers, Handling Missing Data, and Transformations

28 Putting It All Together

313 320 331

29 Getting Started with SPSS

Test Yourself (Being a Compendium of Questions and Answers) Answers to Chapter Exercises References and Further Reading Unabashed Glossary Appendix Index 381 357 355 336 347

SE C T I O N THE FIRST THE NATURE OF DATA AND STATISTICS .

this perfectly describes the person writing this section.CHAPTER THE FIRST In this chapter. Statistics wouldn’t be needed if everybody in the world were exactly like everyone else. 172 cm tall. we wouldn’t need statistics (or anything else) in about 70 years. The Basics 1We also wouldn’t need dating services because it would be futile to look for the perfect mate. this is not the case. Thus he is trying to make an inference about a larger group of subjects from the small group he is studying. let’s continue with some more deﬁnitions. for now. he is not interested in just those 40 kids. attar of eggplant. “Why do we need it?” Leaving aside the unworthy answer that it is required for you to get your degree. because what’s the use? But that’s another story. He wants to know whether all kids with acne will respond to this treatment. By the same token. as well as in thousands of other ways. or whether or not a new drug was effective in eliminating your dandruff. or which political party you’d vote for in the next election (assuming that the parties ﬁnally gave you a meaningful choice. Inferential statistics allow us to generalize from our sample of data to a larger group of subjects. We can’t look in the mirror. Descriptive statistics are concerned with the presentation. interval. So statistics can be used in two ways: to describe data. “Self. and to make inferences from them. if there were no differences and we knew your life expectancy.2 this description would ﬁt every other person. if everybody in the world were male (or female). DESCRIPTIVE AND INFERENTIAL STATISTICS It is because of this variability among people. “a few” to a statistician can mean over 400. The downside of all this variability is that it makes it more difficult to determine how a person will respond to some newfangled treatment regimen or react in some situation. organization. and compares them with 20 adolescents who remain untreated (and presumably unloved). For instance.000 people. that statistics were born. and were incredibly good looking. STATISTICS: SO WHO NEEDS IT? The ﬁrst question most beginning students of statistics ask is. which is doubtful). to see how well that description ﬁts or doesn’t ﬁt other people. includes various methods of organizing and graphing the data to get an idea of what they show. Fortunately. So much for the scientiﬁc use of language. 3Mind you. ask ourselves. The bulk of the book is devoted to inferential stats. to 20 adolescents whose chances for true love have been jeopardized by acne. 2Coincidently. another. statistics allow us to describe the “average” person. Descriptive statistics also include various indices that summarize the data with just a few key numbers. it would mean the end of extramarital affairs. and summarization of data. how do you feel about the newest brand of toothpaste?” and assume everyone will feel the same way. We’ll get into the basics of inferential statistics in Chapter 6. had brown eyes and hair.3 Similarly. we will introduce you to the concepts of variables and to the different types of data: nominal.1 if you were male. The reason is that the world is full of variation. when a dermatologist gives a new cream. as in the Salk polio vaccine trial. and ratio. which we cover in this section. ordinal. he or she would be just like the person sitting next to you. and even within any one person from one time to . 4As we’ll see later. As we hope to show as you wade through this tome. then we would know this for all people. The realm of descriptive statistics. and to see how much we can generalize our ﬁndings from studying a few people4 to the population as a whole. people are different in all of these areas. we have to address the issue of how learning the arcane methods and jargon of this ﬁeld will make you a better person and leave you feeling fulﬁlled in ways that were previously unimaginable. and sometimes it’s hard to tell real differences from natural variation.

and a somewhat shorter person would measure in at 171 cm. within a deﬁned range. we could get really silly about the whole affair and use a laser to measure the person’s height to the nearest thousandth of a millimeter. a person slightly taller would be 173 cm. A physicist can do even better. Discrete variables can have only one of a limited set of values. who said that. for example.THE BASICS 3 VARIABLES In the ﬁrst few paragraphs. this would include variables such as gender. play a “continuous” instrument and are able to make a ﬁne distinction between these two notes. the number of decayed. Another example of a discrete variable is a number total. black.13 children—kids come in discrete quantities. though. Using our previous examples.6 Sounds straightforward. What we’ve manipulated is the treatment (attar of eggplant). The point is that height. and this is our independent variable. and life expectancy.7 Many of the statistical techniques you’ll be learning about don’t really care if the data are discrete or continuous. is only an arbitrary division. it’s obvious that they are different from one another with respect to the type and number of values they can assume. so let’s return to those acned adolescents. The situation is different for continuous variables. though. and the divisions we make are arbitrary to meet our measurement needs. For instance. Once we get out of the realm of experiments. Discrete data have values that can assume only whole numbers. though. they will likely be different if we could measure to the nearest tenth of a millimeter. hair and eye color. or missing. political preference. One way to differentiate between types of variables is to decide whether the values are discrete or continuous.770 oscillations of a cesium atom. So. It has only 88 keys. cut time into 1⁄100-second intervals. this is usually limited to either male or female. which should change in response to some intervention. we’re saying that vocabulary is dependent on age. is artiﬁcial. while a dependable variable is always the same. There are instances. responsiveness to treatment. A piano is a “discrete” instrument. or what is being manipulated. The dependent variable is the outcome of interest. we mentioned a number of ways that people differ: gender. serum rhubarb. If they’re still the same. unbroken progression. blood pressure. the distinction between dependent and independent variables gets a bit hairier. In fact. The measurement. which we hope will change in response to treatment. but if you can afford a Patek. If we used one with ﬁner gradations. such as how many times a person has been admitted to hospital. Despite what the demographers tell us. we may be able to measure in 1⁄2 cm increments. a number to them is just a number. is measured in discrete units: someone is 172 cm tall. In the statistical parlance you’ll be learning. gray. The outcome (acne) is the dependent variable. these factors are referred to as variables.” A variable is simply what is being observed or measured. height. even though it isn’t an intervention and we’re not manipulating it. doesn’t it? That’s a dead giveaway that it’s too simple.” TYPES OF DATA Discrete versus Continuous Data Although we referred to both gender and height as variables. political preference. if two people appear to have the same blood pressure when measured to the nearest millimeter of mercury. like weight. That is. Indeed. Variables come in two ﬂavors: independent and dependent. It may seem at ﬁrst that something such as height. is really continuous. Even this. red. we say that the dependent variable is the one that changes in response to the independent variable. Only the hospital administrator. it’s impossible to have 2. Both dependent and independent variables can take one of a number of speciﬁc values: for gender. The independent variable is the intervention. 5Formerly referred to as “sex. hair color can be brown. Violinists (“ﬁddlers” to y’all south of the Mason-Dixon line). and many other variables. hair and eye color. time. we can measure with even ﬁner gradations until a difference ﬁnally appears. the number of different words would be the dependent variable and age the independent one. and those of us who struggled long and hard to murder Paganini learnt that A-sharp was the same note as B-ﬂat.192.5 age. the escapement mechanism makes the second hand jump. able to buy a Patek Phillipe analogue chronometer. though. you’ll ignore this. blonde. sees time as it actually is: as a smooth. though. or ﬁlled teeth. are different from the deﬁnitions offered by one of our students. after all. really cheap digital watches display only 4 digits and cut time into 1minute chunks. and the number of children. and a variable such as height can range between about 25 to 40 cm for premature infants to about 200 cm for basketball players and coauthors of statistics books. Rest assured that we will point these out to you at the appropriate times. missing. if we wanted to look at the growth of vocabulary as a kid grows up. artiﬁcial. 7Actually. and which treatment a person received. when the distinction is important. if one variable changes in response to another. Similarly. the limitation is imposed by our measuring stick. dividing each second into 9. more generally. We can illustrate this difference between discrete and continuous variables with two other examples. in addition to storing telephone numbers and your bank balance. The easiest way to start to think of them is in an experiment. Razzle-dazzle watches. “An undependable variable keeps changing its value. 6These Continuous data may take any value. We want to see if the degree of acne depends on whether or not the kids got attar of eggplant.631. .

and so on. but a large one. The next world conference of IQ experts can just as arbitrarily decide that from now on. If the distance between values is constant. Patients are often rated as Much improved/Somewhat improved/Same/ Worse/Dead. and. of course. The simplest nominal categories are what Feinstein (1977) calls “existential” variables—a property either exists or it doesn’t exist. Most laboratory test values are ratio variables. with the third trailing by 10 seconds. An intelligence score is a different matter. they can have any number of categories.” This is seen more clearly with letter grades. but by the same token. Because computers handle numbers far more easily than they do letters. and Ratio Data We can think about different types of variables in another way. we’ll subtract 10 kilos from everything we weigh and say that something that previously weighed 11 kilos now weighs 1 kilo. Ordinal. The ordering is arbitrary. and they should not be thought of as having any quantitative value. then the differences between numbers are meaningful. This is like the results of a horse race. then the ratios between numbers are also meaningful. the numerals are really no more than alternative names.” which in turn is better than “Unsatisfactory. amounting to a ruined summer. and her medical problem into one of a few hundred diagnostic categories. or Emergent/Urgent/Elective. In these cases. We haven’t gained anything. simply by adding 400 to all scores. her eye color into Black/Brown/Blue/Green/Mixed (ﬁve categories9). It differs from a variable such as hair color in that there is an ordering of these values: “Excellent” is better than “Satisfactory. with no implied order among the categories. and the one who showed came in third. we would have to say it weighed –5 kilos after the conversion—an obvious impossibility. credit card numbers. an IQ of 100 is not twice as high as an IQ of 50. Because the intervals are equal. What the phrase means is that the zero point isn’t meaningful and therefore can be changed. most existential of all. Don’t be deceived by this use of numbers. where the zero is meaningful.8 we can list them by putting male ﬁrst or female ﬁrst without losing any information. social insurance or social security numbers. The important point is that you can’t say brown eyes are “better” or “worse” than blue. it’s still an ordinal scale. there is only a small division between a B+ and a B. and the conclusions we draw will be identical (assuming. . A variable such as gender can take only two values: male and female. if something weighed 5 kilos before. this time. we know that the horse who won ran faster than the horse who placed. Nominal. researchers commonly code nominal data by assigning a number to each value: Female could be coded as 1 and Male as 2. So letter grades and the order of ﬁnishing a race are called ordinal variables. the difference between an IQ of 70 and an IQ of 80 is the same as the difference between 120 and 130.4 8Although THE NATURE OF DATA AND STATISTICS male chauvinist pigs and radical feminists would disagree. We say that the average IQ is 100. Again. 9“Bloodshot” is usually only a temporary condition and so is not coded. the only necessary change is that we now have to readjust our previously learned standards of what is average. A person has cancer of the liver or doesn’t have it. But there could have been only a 1-second difference between the ﬁrst two horses. A nominal variable consists of named categories. it puts a limitation on the types of statements we can make about interval variables. Now let’s see what the implications of this are.10 A student evaluation rating consisting of Excellent/Satisfactory/Unsatisfactory has three categories. we can change the coding by letting Male = 1 and Female = 2. let’s contrast intelligence. with the numbers (Roman. However. someone has received the new treatment or didn’t receive it. and politicians’ IQs. the subject is either alive or dead. It’s more than a matter of semantics.12 We can’t suddenly decide that from now on. as in Stage I through Stage IV cancer. We can classify a person’s marital status as Single/Married/Separated/Widowed/Divorced/Common-Law (six categories). Many of the variables encountered in the health care ﬁeld are ordinal in nature. 11This A ratio variable has equal intervals between values and a meaningful zero point. measured by some IQ test. we haven’t lost anything. Married = 2. The point is that if the zero point is artiﬁcial and moveable.11 Sometimes numbers are used.” and what does it mean? We added it because. as are physical characteristics such as height and weight. 10Other examples of numbers really being nominal variables and not reﬂecting measured quantities would be telephone numbers. albeit for opposite reasons. This is called a nominal variable. is similar to the scheme used to evaluate employees: Walks on water/ Keeps head above water under stress/ Washes with water/ Drinks water/ Passes water in emergencies. Interval. and we are dealing with (not surprisingly) a ratio variable. that we remember which way we coded the data).” However. but that’s only by convention. but the zero point is arbitrary. A person who weighs 100 kilos is twice as heavy as 12It’s a state aspired to by “highfashion” models. An ordinal variable consists of ordered categories. We all know what zero weight is. where the differences between categories cannot be considered to be equal. to add a bit of class) really representing nothing more than ordered categories. with something such as weight. as we’ll see. An interval variable has equal distances between values. the difference in performance between “Excellent” and “Satisfactory” cannot be assumed to be the same difference as exists between “Satisfactory” and “Unsatisfactory. and no information is gained or lost by changing the order. or Single = 1. Use the difference test: Is the difference in disease severity between Stage I and Stage II cancer the same as exists between Stages II and III or between III and IV? If the answer is No. Why did we add that tag on the end. Nominal variables don’t have to be dichotomous. To illustrate this. One value isn’t “higher” or “better” than the other. If the zero point is meaningful. we’ll make the average 500. between a D– and an F+. but the ratios between them are not. “the zero point is arbitrary. the scale is ordinal. we’ve graduated to what is called an interval variable.

strictly speaking. that is. Same as ordinal plus equal intervals. this time) found that bus drivers had higher morbidity rates of coronary heart disease than did conductors. strictly speaking. the distinctions among nominal. But. State which of the following variables are discrete and which are continuous. For example. intelligence is measured in IQ units. if you have nominal or ordinal data. we’ll try to clear the air. there’s still some confusion. from the viewpoint of a statistician. The number of hairs you’ve lost in the same time. The relationship between hypocholesterolemia and cancer. a rate is a fraction that also has a time component.g. with the average person having an IQ of 100. they have not been arrested for doing so. Percentages are a form of proportions. Although the distinctions among nominal. meat. But. that’s a rate. The IV is ____ The DV is ____ c. indicate which of the variables are dependent (DVs). A proportion is a type of fraction in which the numerator is a subset of the denominator. One study (a real one. The IV is ____ The DV is ____ 2. ordinal. First. we’ll later encounter other fractions (e.000 people will develop photonumerophobia this year. and we’re talking about one of them. The amount of weight you’ve put on in the last year. this assumes you know French. if we say that 1 out of every 1. That is. most people treat IQ and many other such variables as if they were interval variables. to the sloppy English used by some statisticians. We’ll get into what these obscure terms mean later in the book. when we write 1⁄3. By contrast. The time since the last patient was grateful for what you did. Even though this is stuff we learned in grade school. where the denominator is jigged to equal 100. Despite this. We know that members of religious groups that ban drugs. ASA is compared against placebo to see if it leads to a reduction in coronary events. Same as interval plus meaningful zero. Your anticipated before-taxes income the year after you graduate. though. odds) where the numerator is not part of the denominator. being purists. because we’re specifying a time interval.” But. owing. and second (here’s where statisticians often screw up). d. For the following studies. that’s a proportion. The number of hair-transplant sessions undergone in the past year. though. c. we mean that there are three objects. interval. Strictly speaking. independent (IVs). There are two reasons. Of course. course grade. at least in part. As we’ll see in the later chapters.THE BASICS 5 a person weighing 50 kilos. but is it worth it? How do they compare with us on a test of quality of life? The IV is ____ The DV is ____ d. even when we convert kilos to pounds. IQ most likely is an ordinal variable. Anglophones will just have to memorize the order. or neither. alcohol. Your anticipated after-taxes income in the same year. they can be treated and analyzed the same way. ordinal. f. b. certain types of graphs and what are called “parametric tests” can be used with interval and ratio data but not with nominal or ordinal data. and sex (because it may lead to dancing) live longer than the rest of us poor mortals. our discussion of types of numbers has dealt with single numbers—blood pressure. If we say that 23% of children have blue eyes (a ﬁgure we just made up on the spot). This may seem so elementary that you may wonder why we bother to mention it. That’s about enough for the difference between interval and ratio data. we have no assurance that the difference between an IQ of 80 and one of 100 means the same as the difference between 120 and 140. the ratio stays the same: 220 pounds to 110 pounds. The IV is ____ The DV is ____ b. and ratio data appear straightforward on paper. we deal with fractions. as we’ve said. with that as background. In the real world outside of textbooks. a. Same as nominal plus ordered categories. PROPORTIONS AND RATES So far. interval. on to statistics! 13A good mnemonic for remembering the order of the categories is the French word NOIR. e. a.. people sometimes call a proportion a “rate. you are. smoking. tistical tests we can use with them. and ratio are important to keep in mind because they dictate to some degree the types of sta- EXERCISES 1. So. the lines between them occasionally get a bit fuzzy. Sometimes. or counts. nor has the sky fallen on their heads. As far as we know. The fact of the matter is that. Notice that each step up the hierarchy from ordinal data to ratio data takes the assumptions of the step below it and then adds another restriction:13 Variable type Nominal Ordinal Interval Ratio Assumptions Named categories. restricted to “nonparametric” statistics. .

or ratio. The ratio of the number of women who have breast cancer to the total number of women in the population. in mm Hg. d. c. i. l. Indicate whether the following are proportions or rates: a. c. b. A list of the different specialties in your profession. 0296 = Depression. ST depression. f. and so on. b. ST depression on the ECG. Diastolic blood pressure. measured as “1” ± 1 mm. II. A score of 13 out of 17 on the Schmedlap Anxiety Scale. ICD-9 classiﬁcations: 0295 = Organic psychosis. The ratio of males to females. measured in millimeters. j. Your income (assuming it’s more than $0). The ranking of specialties with regard to income. a. or IV. 4. “2” = 1 to 5 mm. Pain measurement on a seven-point scale. Staging of breast cancer as Type I. ordinal. The ratio of new cases of breast cancer last month to the total number of women in the population. . The increase in the price of household good last year. A range of motion in degrees. Bo Derek was described as a “10. k. Indicate whether the following variables are nominal. and “3” ≤ 5 mm. d.” What type of variable was the scale? e. interval. III. g. h.6 THE NATURE OF DATA AND STATISTICS 3.

Looking at the Data A First Look at Graphing Data WHY BOTHER TO LOOK AT DATA? Now that you’ve suffered through all these pages of jargon. some tests require the data to ﬁt a given shape. a researcher may use a code number such as 99 or 999 to indicate a missing value for some variable. For example. To illustrate it. popularized by Albert Einstein. The largest number in the table is 42. The ﬁrst step is to choose an appropriate length for the Y-axis. and how not to plot data. or that a plot of two variables follow a straight line.CHAPTER THE SECOND Here we look at different ways of graphing data. HISTOGRAMS.” then there should be one. as in most areas of our lives (especially those that are enjoyable). we did the study in the ﬁrst place to get some results that we could publish and prove to the Dean that we’re doing something. there is a great temptation to jump right in and start analyzing the bejezus out of any set of data we get. As a result. let’s actually do something useful: Learn how to look at data. how to make the graphs look both accurate and esthetic. We can then tabulate the data as is shown in Table 2–1. AND VARIATIONS ON A THEME The Basic Theme: The Bar Chart Perhaps the most familiar types of graphs to most people are bar charts and histograms (we’ll tell you what the difference is in a little bit). Although there are speciﬁc tests of these assumptions. You do not look at the data just in case there are errors. so we will choose some number somewhat larger than this for the top of the axis. and your job is to try to ﬁnd as many as you can. Course Number of students TABLE 2–1 Responses of 100 students to the question. let’s conduct a “gedanken experiment. With the ready availability of computers on every desk. In essence. A quick look often gives you a better sense of the data than a bunch of numbers. Graphing the data beforehand may well save you from one of life’s embarrassing little moments. they are there. A second purpose for looking at the data is to see if they can be analyzed by the statistical tests you’re planning to use. “What was your most boring introductory course?” Sociology Economics History Psychology Calculus 25 42 8 13 13 7 .” It is used here simply for purposes of pretentiousness. Because we’ll label the tick points every 1This is a German term. Sometimes the problem isn’t an error as such. However. very often. BAR CHARTS. where we’ll plot (at least for now) the number of people who chose each alternative. we must learn to control our temptations in order to become better people. It is difficult to overemphasize the importance and usefulness of getting a “feel for the data” before starting to play with them. the power of the “calibrated eyeball test” should not be underestimated. If there isn’t a Murphy’s Law to the effect that “There will be errors in your data. meaning “thought experiment. you may ﬁnd that some people in his study are a few years older than Methuselah. they consist of a bar whose length is proportional to the number of cases.”1 Imagine we do a study in which we survey 100 students and ask them what their most boring course was in college. and then forget to tell you this little detail when he asks you to analyze his data. After all.

some research (Cleveland. as shown in Figure 2–4. Variation 1: Dot Plots Another variant of the bar chart that is particularly useful when there are many categories is the dot plot. this doesn’t look too bad! However. starting at 1 and ending at 64. Instead of a bar. or every even number. At ﬁrst glance. a tick mark inside the axis may obscure the data point. things have been turned on their ear—literally. When the data fall near the Y-axis. 1984) has shown that people get a more accurate grasp of the relative sizes of the bars if they are placed horizontally. which would make the axis look too cluttered. See what we mean? 10 units. History FIGURE 2–4 Figure 2–3 redrawn as a point graph. This is the way most bar charts of nominal data looked until recently. Also. So. it’s often better to put the tick marks outside the axes rather than in. so we can change the categories around without losing anything. or vice versa. . though. Adding this twist (pun intended).) Making these two changes gives us Figure 2–2. 0 10 20 30 40 50 0 10 20 30 40 50 Number of students Number of students 2Fast! Count by sevens. 50 would be a good choice.8 50 THE NATURE OF DATA AND STATISTICS 50 40 Number of students Number of students 40 30 30 FIGURE 2–2 Figure 2–1 redrawn so that the categories are in order of preference and the tick marks are outside the axes. just a heavy dot is placed where the end of the bar would be. we’ll end up with Figure 2–3. we would have had to label the axis either every 7 units (which are somewhat bizarre numbers2). our graph would look like Figure 2–1. we can make it look even better. 20 20 FIGURE 2–1 Bar chart of the ﬁve least popular courses 10 10 0 Sociology Economics History Psychology Calculus 0 Economics Sociology Psychology Calculus History Course Course Economics Sociology Psychology Calculus Economics Sociology Psychology Calculus History FIGURE 2–3 Figure 2–2 redrawn so that the bars are horizontal. If we had used the number 42. If the names of the categories are long. It’s obvious that the data are nominal. (As a minor point. When there are many labels. things can look pretty cluttered down there on the bottom. the order is arbitrary. Within recent years. Now the relative standing of the courses is more readily apparent. In fact. smaller dots that extend back to the labeled axis are often used to make the chart easier to read. we gain something if we rank the courses so that the highest count is ﬁrst and the lowest one is last.

the ﬁrst thing we should do is to put the data in rank order. it would confuse more than help if you put them in the order: Satisfactory/Excellent/Unsatisfactory just because most students were in the ﬁrst category. 5. far too many to graph when letting each bar stand for a unique number. Let’s use an example. it can be used with all four types.4 starting with the smallest number and ending with the highest. and many others will be only one or two units high. Two notes are in order. If we took 100 fourth-year nursing students and asked them how many bedpans they emptied in the last month. we’d get 100 answers. (This difference is called the range. usually called SORT. and no pattern emerges. Third. First. and many Student Data TABLE 2–2 37 34 9 28 15 7 46 34 25 17 22 36 38 37 27 13 54 38 47 22 33 51 25 16 18 43 14 12 1 35 16 51 49 35 43 61 31 33 25 54 Number of bedpans emptied by 100 fourth-year nursing students in the past month 1–5 6–10 11–15 16–20 21–25 26–30 31–35 36–40 41–45 46–50 51–55 56–60 61–65 66–70 71–75 76–80 81–85 86–90 91–95 96–100 43 41 14 31 35 36 45 57 32 58 32 42 52 21 37 7 38 26 34 26 45 24 29 24 36 27 32 16 30 26 17 28 28 11 41 20 38 24 12 17 16 11 55 24 14 42 42 27 26 19 31 26 15 66 11 56 39 43 52 7 TABLE 2–3 1 7 7 7 9 11 11 11 12 12 13 14 14 14 15 15 16 16 16 16 17 17 17 18 19 20 21 22 22 24 24 24 24 25 25 25 26 26 26 26 26 27 27 27 28 28 28 29 30 31 31 31 32 32 33 33 33 34 34 34 35 35 35 36 36 36 37 37 37 38 38 38 38 39 41 41 42 42 42 43 43 43 43 45 46 46 47 49 51 51 51 52 52 54 56 56 57 58 61 66 Data from Table 2–2 put in rank order . whereas a width of 2 would result in 33 boxes (which is still too many). This leads to the second problem. In fact. If possible. and ratio data. such as Table 2–4. these don’t yield multiples that are easily comprehended. we make each bar represent a range of numbers. the X-axis is going to get awfully cluttered.3 To do this.) If we have one bar for each value. it really is called “rank” order. as in Table 2–2. you can’t blithely move the categories around simply to make the graph look prettier. it makes no difference. Let’s say we have some data on the number of tissues dispensed each day by a group of 75 social workers. 4No Graphing Interval and Ratio Data A few other factors have to be considered in graphing interval and ratio data. we have more possible values than data points. First. We also see that the range (66 – 1) = 65.LOOKING AT THE DATA 9 Graphing Ordinal Data The use of histograms isn’t limited to nominal data. If you were graphing the number of students who received Excellent/Satisfactory/Unsatisfactory ratings. even when the data aren’t as smelly. interval. Second. to do the job for you. We look at our data. we try to end up with between 10 and 20 bars on the axis. Once we do this. which would seem obvious. A width interval of 5 yields 14 boxes (which is just right). We’re overwhelmed by the sheer mass of numbers. who’s been working like a Trojan and who’s been gooﬁng off. a few other considerations should be kept in mind when using them with ordinal. First. not too many between 1 and 10 or between 60 and 70. The difference between the highest and lowest value is 107. in that it will be hard to discern any pattern by eyeballing the data. there are two extra columns. An interval width of 10 would give us 7 boxes (not quite enough for our esthetic sense). 10. The ﬁrst. To make our lives (and all of the next steps) easier. To help us in drawing the graph. pun is intended. For these reasons. in the 20s and 30s. so some bars will have a “height” of zero units. or 20 points. we’ll run into a few problems. most computers have a simple routine. we could make up a summary table. it’s very hard even to ﬁgure out what the highest and lowest numbers are. The main thing a table like this tells us is that it’s next to impossible to make sense of a table like this. we’ll end up with Table 2–3. We’ll deﬁne it a bit more formally later in the next chapter. With this table we can immediately see the highest and lowest values and get at least a rough feel for how the numbers are distributed. use a width that most people are comfortable with: 2. and we ﬁnd that the lowest number is 10 and the highest is 117. what we refer to as the interval width. However. There are a few things to notice about this table. one labeled Mid- 3Note that this dictum is based on esthetics. is that because the values are ordered. you can go from highest to lowest if you wish. Even though a width of 6 or 7 may give you an esthetically beautiful picture. which gives the interval and the number of subjects in that interval. not statistics.

the second from 5 to 9. midpoints. their midpoints.10 THE NATURE OF DATA AND STATISTICS 5Is “esthetic sense” and oxymoron? point and the other labeled Cumulative Total. the axis would look too cluttered. STEM-LEAF PLOTS AND RELATED FLORA All these variants of histograms and bar charts are the traditional ways of taking a mess of data such as we found in Table 2–2 and transforming them into a graph such as Figure 2–5. 5. and 4. is simply a running sum of the number of cases. we can see that the maximum number of cases in any 1 interval is 15. the Cumulative Total. Notice that on the X-axis. called a Stem-and-Leaf Plot. The resulting diagram. 3. we see that 4 people emptied between 5 and 9 pans. but we don’t know exactly how many. In Figure 2–5. Make a new table consisting of the intervals. Choose and appropriate width to yield about 10 to 20 intervals. the midpoint cuts down on the clutter and (for reasons we’ll explore further in the next chapter) is the best single summary of the interval. Rank order the data. 20 FIGURE 2–5 Histogram showing the number of bedpans emptied during the past month by each of 100 nursing students. that’s actually 6 digits. it would fall halfway between the 1 and 2. the more information is lost. 1. We can tell from Table 2–4 that 1 person emptied between 0 and 4 bedpans. the midpoint is 2. but again we’re not sure precisely how many future nurses dumped what number of bedpans. This is very handy because. A good choice would be 20. and so on. we’ve labeled the middle of the interval. counts. say 0. and 3. 6. So. the count. 1. We would therefore want the Y-axis to extend from 0 to some number over 15. Don’t fall into the trap of saying an interval width of 5 covers the numbers 0 to 5. The ﬁrst is just what the name implies: It is the middle of the interval. consisting of the exact values. 2. showing the intervals. The ﬁrst one goes from 0 to 4. though: how to label the two axes. The other added column. Tukey (1977) devised a way to eliminate steps 1 and 6 and to combine 4 and 5 into one step. Choose an appropriate width to yield about 10 to 20 intervals. then the midpoint would again be in the middle. Number of nurses 15 10 5 0 2 7 12 17 22 27 32 37 42 47 52 57 62 67 Number of bedpans . 3. because this would allow us to label every ﬁfth tick mark. the ﬁrst interval had 1 case. 3. 4. and the second 4. 2. Now we can ﬁnally tell you the difference between bar charts and histograms: Bar charts: There are spaces between the bars. Our end product would look like Figure 2–5. the data are continuous. Histograms: The bars touch each other. and a cumulative total. If we labeled every possible number. Because the ﬁrst interval consists of the numbers 0. Another point to notice is that we’ve paid a price for grouping the data to make it more readable. and Interval Midpoint Count Cumulative total TABLE 2–4 A summary of Table 2–3. Looking at the count column in Table 2–4. If there were an even number of numbers. In the next interval. Make a new table that looks like a histogram and preserves the original data. This ﬁgure differs from Figure 2–2 in a subtle way. thus consists of only three steps: 1. In the earlier ﬁgure. if we didn’t end up with 100 at the bottom. Find the range (the highest value minus the lowest). There’s one last consideration. we’re almost ready to start drawing the graph. Turn this into a histogram.5. Lose some information along the way. and cumulative total 0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–39 40–44 45–49 50–54 55–59 60–64 65–69 2 7 12 17 22 27 32 37 42 47 52 57 62 67 1 4 9 11 8 15 12 14 9 5 6 4 1 1 1 5 14 25 33 48 60 74 83 88 94 98 99 100 that price is the loss of some information. because each category was different from every other one. The wider the interval. This time. and we would label it 1. we would know that we messed up the addition somewhere along the line. though. The other point to notice is the interval. 2. The 9 cases in the third interval then produce a cumulative total of (5 + 9) = 14. The steps were as follows: 1. so it makes both statistical as well as esthetic sense5 to have each bar abutting its neighbors. with these points in mind. 2. so the cumulative total at the second interval is (1 + 4) = 5. we left a bit of a gap between bars. Find the range.

Most journals still prefer histograms or bar charts rather than stem-leaf plots. we’ve put a dot at the midpoint of the interval and then connected the dots with straight lines. you are not seeing double. at the same time explaining these somewhat odd-sounding terms. 13. we would make the “16” the stem. the second for the numbers 15 to 19. is 45 so we put a 5 next to the second 4. each leaf would be put in a separate and adjacent horizontal box. Table 2–5 really does have two 0s. 12. for reasons that will be readily apparent if you’ll just be patient). vertically. Now. There are a few other differences between histograms and frequency polygons. For example. However. we can see that the actual numbers were 11. polygons should not be used with nominal or ordinal data because joining the dots makes the assumption that there is a smooth transition from one datum point to another. you’ll see it has exactly the same shape as Figure 2–5. it’s simple to go from the plot to the more traditional forms. because we’ve chosen an interval width of 5. 11. the original data are preserved. and Table 2–7 is the stemand-leaf plot of all 100 numbers. for the number 94. at a midpoint of 20. the 0 is the stem of the numbers 00 (zero) to 04 (four). Let’s take the third line down. and 12. If you turn Table 2–7 sideways. Using the data from Table 2–3 and the same reasoning we did for the histogram. Computer programs that product stem-leaf plots (see the end of this chapter) do this for you automatically. So. The “leaf” consists of the least signiﬁcant digit of the number. we would again opt for an interval width of 5. If you did what we told you to earlier. as in Figure 2–7. so we put a 3 (the leaf) next to the ﬁrst 4. 14. Table 2–6 shows a plot of the ﬁrst 10 numbers. We then write the stems we need. the ﬁrst 0 will contain the numbers 0 to 4. 14. and used graph paper. 14. Strictly speaking. No. the leaf is “4” and the stem is “9.” If our data included numbers such as 167. we can actually rank order the numbers within each stem. because this stem contains the intervals 45 to 49. Now. Let’s start off by looking at one. and then we’ll describe it. Reading across. two 1s. 0 0 1 1 2 2 3 3 4 4 5 5 6 6 1 9 1 6 4 9 3 7 3 5 1 5 1 6 7 4 6 4 5 4 5 1 5 1 7 7 4 5 4 8 1 6 2 6 2 8 7 4 8 2 7 2 6 3 9 4 6 2 6 1 7 4 5 2 7 2 1 9 0 6 2 6 2 4 1 7 4 5 0 8 1 3 7 2 6 2 7 3 2 6 5 7 8 6 8 7 6 5 6 1 1 3 4 5 7 8 8 9 8 3 . This shows the same data as Figure 2–5. and the “stem” is the most signiﬁcant. as in Table 2–5 (it’s best to do this on graph paper. the ﬁrst number in Table 2–2 is 43. the ﬁrst 1 is the stem for the numbers 10 to 14. For example. instead of a bar that spans each interval. The ﬁrst point. the ﬁrst stem with a 1.LOOKING AT THE DATA 11 Let’s take a look and see how this is done. imagine that we have a polygon with just two points. Moreover. The second number. If we want to be a bit fancier. but this is slowly changing. 11. The reason is that. The second interval covers the numbers 5 (05) to 9 (09). as we’ve said. look at Figure 2–6. reading across. and so on. Stem Leaf TABLE 2–5 First step in constructing a Stem-and-Leaf Plot: Writing the stems 0 0 1 1 2 2 3 3 4 4 5 5 6 6 Stem Leaf TABLE 2–6 Stem-and-Leaf Plot of the ﬁrst 10 items of Table 2–2 0 0 1 1 2 2 3 3 4 4 5 5 6 6 1 6 4 3 4 7 3 1 5 1 Stem Leaf TABLE 2–7 Stem-and-Leaf Plot of all the data in Table 2–2 FREQUENCY POLYGONS Another way of representing interval or ratio types of data is called a frequency polygon. and so on. First. In any case. we go back to our original data and write the leaf of each number next to the appropriate stem.

by convention. such as blood pressure. Figure 2–9 then shows the same data with a polygon. For instance. and the second point. frequency polygons begin and end with the line touching the X-axis. This is a closer representation of what we actually do in statistics. we couldn’t make this assumption. shows 100 units on the Y-axis. if you have more than two groups. At the low end. so this isn’t an option. with each group represented by a different line. A second difference is that bar charts seem to imply that the data are spread equally over the interval. To accomplish this. it doesn’t make sense in this case to add another interval because it would cover the numbers –1 to –5. and whatever your plotting package can manage. with a histogram. A third difference is that. it really doesn’t matter. you’re limited to a histogram. we usually use some midpoint as an approximation. they should be noticeably distinct from one another—different symbols representing the data points and different types of lines joining the points. they would correspond to 105 units (where the dot is). The advantage is that all the data for any one group are joined.12 20 THE NATURE OF DATA AND STATISTICS 110 15 Number of nurses (105) 10 FIGURE 2–6 Frequency polygon showing the same data as in Figure 2–5. however. so we just continue the line to the origin. you can also use different colors. it’s more a matter of personal preference. you don’t have a choice. we assume all the cases had the value of the midpoint. we’ve added an extra interval at the upper end. a cumulative frequency polygon. If you’re showing the graph at a meeting. However. 2 at 22. esthetics. and so on. 0 0 7 17 27 37 47 57 67 77 0 20 (25) 30 Number of bedpans 40 Administrators Physicians Nurses 30 20 10 0 20 30 40 50 60 70 80 Hours worked per week FIGURE 2–8 Data for three groups displayed as bar graphs. or height. 2 at 21. In this case. which falls at a midpoint of 30. 6Our publisher is a very generous guy and doesn’t mind doing things in color. as they are with ordinal data. We’ve shown an example of this in Figure 2–8. the values for one group are often broken up by the bars for the other groups. When you’re plotting two or more lines. If we were plotting data that did not include a value of zero. Even though we may not have gathered any data that correspond to an X-axis value of 25.6 Percent CUMULATIVE FREQUENCY POLYGONS Before leaving the topic of graphing for a while. Cast your mind back. IQ. which we feel is easier to follow. We can make this assumption only because we’re using an interval or ratio level of data. we assume they fall on the line. when do we use a histogram and when a polygon? For nominal and ordinal data. most publications are in black and white. and 10 cases were in that interval. if we had an interval width of 5 units spanning the numbers 20 through 24. halfway between 20 and 30. If you’re dealing with interval or ratio data and are showing the data for only one or two groups. if you will. So. we would have added an extra “empty” interval at the lower end. it would appear (and we would assume) that 2 cases fell at 20. 5 100 0 FIGURE 2–7 The assumption of a smooth transition from point to point in frequency polygons. then it’s often better to use frequency polygons. if the distances between intervals are variable or unknown. shows 110 units. With a frequency polygon. which had a frequency count of zero. we’ll mention one more variant. to our dis- . if we don’t know the exact value of some variable.

she can look at a graph appropriate for age and sex and determine in what percentile this particular kid is. to tell you that? We can convey the same information in one sentence. then dropped a vertical line to the X-axis. Take a look at Fig- . So. The difference may be small. The Case of the Missing Zero Dr. slide presentation program. though. should he be promoted? Not if this graph is any indication of the quality of his work. we put the mark at the upper end of the interval. so we don’t know exactly where within the interval the raw data actually occurred. From the picture. we don’t have to waste 30 minutes drawing a ﬁgure. 40 30 20 10 0 20 30 40 50 60 70 80 FIGURE 2–9 The same data as in Figure 2–8. In other words. In Figure 2–10 we’ve drawn a horizontal line at 50%. let’s decide whether a graph is even needed. Then. each cumulative total is also the percent. symbol type. In our example. how many cases there were.000 to a paltry $15. In this section.LOOKING AT THE DATA 13 Administrators Physicians Nurses cussion of the emptying of bedpans. When we drew up Table 2–4. ure 2–11. we’ll discuss some very useless and misleading (albeit very pretty) ways of presenting data. with cumulative polygons. With them. it looks as though there has been almost a threefold increase in his funding (the actual value is about 275%). to show that the amount of grant money he has received has risen dramatically in the past year. (e. Now we’ll mention another purpose. Instead 8We’re using a pseudonym to protect his identity. This is the reason the data are plotted at the end of the interval. The problem is with the Y-axis. starting at the Y-axis and extending to the curve. This shows us that 50% corresponds to 31 bedpans. and line type. and see what percent of people dumped more or fewer). It shows the number of males and females in some study. 7Even when working with inaccurate data. weight.” The good news is that every spreadsheet program. but displayed as frequency polygons. he submits a graph. for reasons that will soon be apparent. it was at the midpoint.. HOW NOT TO GRAPH As the old joke goes. but you’ll rarely be in the fortunate position of having exactly 100 subjects. (Even though you haven’t gotten too far into this book yet. You can also convert the cumulative total at each interval into a percentage of the total count and plot the cumulative percents. and statistics program now can make graphs for you at the press of a button. The reality is that it went from a measly $11. His real name is Dr. and most default options are just plain wrong. Do You Really Need a Graph? Before we begin to discuss bad graphs. Many of the choices are worse than useless. we plot not the raw count within each interval. it helps us draw cumulative frequency polygons. The bad news is that. but statisticians pride themselves on being accurate. To support his petition. they do it extremely badly.g. which takes about 15 seconds to write and 2 seconds to read. after the doc takes the kid off the scale. “We have some good news and some bad news. Y. We can also draw lines at other percentages. The lines are differentiated by color. something that takes up about 1⁄4 of a page. Percent Hours worked per week 100 Cumulative percent 80 60 40 20 0 0 15 30 45 60 75 Number of bedpans FIGURE 2–10 Cumulative frequency polygon of data in Figure 2–6.000. but the cumulative count. we added another column. and mentioned that one reason for using it was as a check on our addition. as we’ve done in Figure 2–10. half of the people emptied fewer than 31 and half emptied more. and other vital statistics. or even work backward. X8 wants to be considered for early promotion. because the total number of data points was 100. labeled the Cumulative Total. that is. rather than at the midpoint. it conveys one bit of information—the proportion of males is 54%. say 40 bedpans. especially for kids— height. an increase of only 37%. up to and including everyone within the interval. As we’ve mentioned. We do know. we bet you can ﬁgure out that the proportion of females is 46%. but this time as a cumulative polygon. draw a vertical line up from.) Do you need a graph. Figure 2–10 again shows the data in Table 2–4. you simply have to enter the data. head circumference. The only difference in drawing a regular frequency polygon and a cumulative one is where we put the point: in the former case. almost without exception.7 Graphs of this sort are very common in plotting all sorts of anthropometric features. not to report numbers. Use graphs to show relationships. shown in Figure 2–12. we have lost some information by grouping the data.

000. it will just be confusing.” we hear you say. 40 30 20 10 0 Male Female Number of students 40 30 20 10 0 FIGURE 2–11 The proportion of males and females in a study. but is useless for comparing groups. But if the sizes of those segments are different. and then follow another imaginary line to where the legend is—a process that’s prone to error at every step. it would not be nice. because the axis doesn’t start at zero.42. it looks as though the temperature or the stock market is ﬂuctuating wildly. For bars farther from the left side of the graph. 4 5 6 true value is actually indicated by the back edge of the bar. We see examples of this every day on TV or in the newspapers. make a turn. we have to follow an imaginary line to the Y-axis. Dr.14 60 THE NATURE OF DATA AND STATISTICS 50 50 Percent FIGURE 2–13 A 3-D version of Figure 2–2. you now have to look at a segment of pie two. Figure 2–15 is a stacked bar graph showing 9Although heaven knows we can think of quite a few. Sounds like an impossible task. Wouldn’t it be nice if we jazzed them up a bit by making them look three dimensional. That is the Question The bar charts and histograms that we’ve shown you so far look pretty drab and ordinary. we’re talking about graphs. X.05 (Beattie and Jones. The greater the 3-D effect. how many students said Economics? You’re excused if you said 39. Remember. much beloved by newspapers and magazines. That’s a tad higher than the recommended value of the GDI. 1992). which is simply: GDI = Size of the effect in the graph –1 Size of the effect in the data (2–1) In this case. lose the 3-D. and that will remove any ambiguity. A pie chart may be good for showing data for one group. X! 3-D or Not 3-D. or reporting data. where different values of a variable are placed on top of one another. which is where your eye is drawn. the greater the confusion. Stacked Graphs For a change. The problem is that the leading edge of the bar. One way to check on this distortion is to use the Graph Discrepancy Index (GDI). “That’s hard to say. A graph is ideal for giving the reader a very quick grasp of relationships that exist in the data. Figure 2–13 looks hot! But quickly now. we’re not making some sort of sexist joke. because we’re nice guys. keep the angle constant as you rotate it until it’s at the same starting place as the corresponding segment of pie one. Not in a Graph of starting at zero. So. as in Figure 2–14. “But. it begins at $10. Pie in the Sky. The Now let’s take the same data to make a pie chart and use it to compare two groups. Economics Sociology Psychology Calculus Course History 15000 14000 Grant Money ($) 13000 12000 11000 10000 0 1 2 3 Year FIGURE 2–12 Grant money per year for Dr. Gotcha. or converted them to pie charts? No. and judge the relative angles. is there a trend over time. is just below the 40. “you can simply put numbers inside or next to the wedges. .9 Rather.” You can relatively easily compare the ﬁrst segment of the two pies.” Let’s keep in mind the difference between a table and a graph. or if we added shading. it’s (275/37) – 1 = 7. which confuses both the eye and its owner. the only place for a pie chart is at a baker’s convention. or used fancier objects instead of just rectangles. so that small differences are magniﬁed. Golly gee. or does one group differ from another? If the precise numbers are important. which is 0. use a table. Let’s take Figure 2–2 and make it look sexy by adding some of the features we’ve just mentioned. Don’t mix up these two functions: communicating a picture. the bottom line is. because they both start at 12 o’clock. Are the numbers of people saying Sociology the same in both groups? Yet again we’ll excuse you if you answer. and it is. You’d be wrong—the real answer is 42—but we’ll excuse you.

and ﬁll in the blanks. we have to try to keep the height of the segment in our mind while shifting the bases until they all line up.g.000 30 20 10 Program A Program B 0 1970 Program C FIGURE 2–16 A stacked line graph. 8. or using separate bars for each category of marital status.992 11.776.000 820.860 2.000 17. and the numbers are given with as much accuracy as possible. These data would either be better presented in a table.438. because they have a common axis (the top or bottom of the graph). or when we want the reader to see the actual numbers. .823 186.000 8. not an easy task by any means. In Figure 2–16. and then see if the heights are comparable.400 27. when only a table of numbers will do—when we have many variables to show at the same time.221 354. 4.925 912. we have no trouble comparing the groups with respect to the proportion married or single.138.031.850 3.098. Conclusion We’ll close with a beautiful quote from Howard Wainer (1990): “Although I shudder to consider it.506.275. but it’s terrible for looking at the contributions of each.313. what about those who are widowed? To compare the groups.970 406.058 20 36 27 21 27 31 31 36 29 27 17 30 9 13 7 6 7 7 12 6 8 5 10 6 32 123 67 18 54 63 49 49 69 40 34 38 = infant mortality rate/1. I sure hope not much.582 8.339 283. This type of graph is ﬁne if we want to see what’s happening to the total cost of the three.050 28. 4.752 1.000 110.020 2.000 2. Indeed.000 2. isn’t it? The bottom line—don’t use it.180 1. 50 40 Cost in $000.000 IMR* TABLE 2–8 Demographic characteristics of the countries in South America Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela *IMR 2.600. three-dimensional pie charts that clutter the pages of USA Today. Which program is growing the fastest? The reality is that Programs A and B are increasing geometrically each decade (e. we show the annual cost of three programs over time in a stacked line chart.000 14.561 214. whereas C is only increasing arithmetically (2.661 1..000 5. 16).190 1. It may seem at ﬁrst glance as if tables were the simplest thing in the world to construct: just write the names of the variables as the headings of the columns.134 600 2.000 Deaths/ 1. 6. Table 2–8 is such a table.354. 60 0 Divorced A B Group C FIGURE 2–15 A stacked bar chart.LOOKING AT THE DATA History Calculus Calculus Economics History Economics 15 100 75 Percent Psychology Sociology Sociology Psychology 50 Single Common-Law Group 1 Group 2 25 Widowed Married FIGURE 2–14 Comparing two groups using pie charts. 1980 Year 1990 2000 MAKING BETTER TABLES So far. as if this were the only way that data can be portrayed. perhaps there is something to be learned from the success enjoyed by the multi-colored.946 1.979 1. There are times.663 756. though. 8). the subjects along the left to indicate the rows.434 1. and it is typical of many you’ll see.040 324 1. and Newsweek. graphs are excellent for displaying one or two variables at a time. As with a pie chart. we’ve been showing you different ways of presenting data in graphs.285. Hard to tell. The countries are listed alphabetically. 2.000 live births. But.215 142. the marital status in three groups.899. Time.973.098.520.736 2.” Country Area (km2) Population Per capita GNP (US$) Births/ 1.

If the population increases by 3% a year. we probably asked people their age to the nearest year. If you want to relate it primarily to other health indices. If the exact numbers are important for archival purposes then. 10Don’t bother to check. Now let’s try the same exercise again: Which is the largest country? The smallest? The one with the highest GNP? The lowest IMR? That was much easier. there appears to be a group of countries with a gradation of similar IMRs. with an IMR that is quite a bit higher than that of the next country.000 Deaths/ 1. Bolivia seems to be in its own class. with rounding introduced Country Area Population (100. For most purposes. The moral of this story is to round. out they go (or into an appendix). ask yourself. it was out of date almost as soon as it was recorded. 11All of whom squeeze into one stadium every Sunday to watch the soccer match.13 Those divisions are totally arbitrary. recorded by the central government. as in Table 2–10 in which information is arranged logically for you.8 3 17 0. then keep them. But if there is one major point you want to make. . and then round again—keep enough digits to highlight important differences.02 years.000 IMR* Argentina Bolivia Brazil Chile Colombia Ecuador Guyana Paraguay Peru Suriname Uruguay Venezuela 28 11 85 8 11 3 2 4 13 1 2 9 28 6 110 11 28 8 0.992” has gone by the board. ranging from the one with the highest IMR to the lowest (or vice versa). only in its place. Between the time the census was taken (and don’t forget it was probably taken over a period of weeks or months). followed by the country’s per capita income and size. however.11 By the time you ﬁnish reading that number. it would be even better to list the countries in order. Don’t get us wrong.000) Per capita GNP (US$100) Births/ 1. such as focusing on the IMR. not that there are too many columns but that we have “unnecessary” accuracy. you can put one in.000 2) km (1. like a child. reproduced in the atlas. then there are nearly seven additional people every minute. if IMR is the most important point of the table. Making the problem worse. Now. but that ﬁnal “90.992. have you ever tasted Bolivian wine? We rest our case. This also means that it may be worthwhile to reorder the columns. and most likely a myth in developing ones). but stick the table in an appendix.000 a day. you’ll remember that it was somewhat over 110 million. the population of Brazil is given as 110. wasn’t it? Getting rid of unnecessary digits made the table much easier to comprehend. we’ve done some rounding. On the other hand. for instance.01 years represents less than four days. if you feel that there should be a break between Paraguay and Suriname. or for a number of different purposes. and Chile is by itself. maintain as many signiﬁcant digits as you can come up with. Are the other columns really necessary? If you want to relate the IMR to the size of the country or to other indices of health such as the birth and death rates. Then. 12Isn’t it somewhat passing strange that people will lie about their age. otherwise. “Inaccurate precision” can be found all over the place. then the birth and death rates are next. If we report that the average age of one group is 43. For example. there are 55 of them. ﬁne. quickly—which is the largest country? The smallest? The one with the highest GNP? The lowest infant mortality rate (IMR)? If you think that was hard. we’re ﬂexible. is that last decimal place really meaningful? Bear in mind that . reported in an official document. with the lowest IMR. Finally. use spaces to highlight clusters. list it ﬁrst. Finally. without glancing up the page or at the table. and read by you. For example. so they were introducing a loss of accuracy from the very outset (assuming that they didn’t lie about their age12).3 3 14 21 6 24 20 12 10 3 12 19 30 27 21 20 36 27 21 27 31 31 36 29 27 17 30 9 13 7 6 7 7 12 6 8 5 10 6 32 123 67 18 54 63 49 49 69 40 34 38 *IMR = infant mortality rate/1.16 THE NATURE OF DATA AND STATISTICS TABLE 2–9 Data from Table 2–8. accuracy is good but. That number is no longer correct—if it ever was to begin with—but the last three digits give the illusion of precision.000 live births. years may have elapsed. Keeping the countries in alphabetical order makes sense if this table is referred to often. imagine how hard it would be if we had listed all of the countries in Africa.000. Even assuming that the census was correct when it was taken (a dubious assumption at best in developed countries. so many digits give an illusion of accuracy that is often misleading. Then.76 years. In Table 2–9. and no more. if you’re like most people. do you remember the exact population of Brazil? Probably not. we can go even further. However.098. it’s already wrong. or almost 10. but not about their year of birth? 13It’s probably no coincidence that they also produce the ﬁnest wine in South America.10 Why was such a seemingly easy task so hard? The main reason is that there are too many numbers. and for another group is 44.

This time. and click the arrow next to Category Axis • You can name the axes by clicking the button marked • Click OK Frequency Polygons • From Graphs. histogram. that’s where we got all these good ideas.000. 5. reordered and with spaces Bolivia Peru Brazil Ecuador Colombia Guyana Paraguay Suriname Venezuela Uruguay Argentina Chile *IMR 123 69 67 63 54 49 49 40 38 34 32 18 36 29 27 31 27 31 36 27 30 17 20 21 13 8 7 7 7 12 6 5 6 10 9 6 6 19 24 10 12 3 12 30 21 27 21 20 6 17 110 8 28 0. 6.000 live births. choose Line • The default is Simple. frequency polygon. 4.LOOKING AT THE DATA 17 Country IMR* Births/ 1. Schmedlap Anxiety Inventory scores for 128 people. 2. though. look at the article and book by Wainer listed in the “To Read Further” section.” before you begin. 3.000 km2) TABLE 2–10 Data from Table 2–9. Time since the last patient indicated his/her gratitude. If you’ve never used SPSS before. The number of patients with 0.000) Area (100. keep it and then click the Define button EXERCISES Let’s take another look at some of the variables we used in the exercises for Chapter 1.3 14 3 28 11 11 13 85 3 11 2 4 1 9 2 28 8 = infant mortality rate/1.000 Deaths/ 1. • Click the name of the variable you want to graph from the list on the left.8 3 0. choose Bar • The default is Simple. “Getting Started with SPSS. 1.000 Per capita GNP (US$100) Population (1. Before-taxes income. or 2+ vessels with > 75% stenosis. 7. Just to keep you on your toes. If you really want to learn how to make good tables. It’s a basic tutorial on getting started Histograms • From Graphs. there is sometimes more than one correct answer. 1. Income for the different specialties in your profession. Number of hair transplant sessions per person. indicate what type of graph you’d use to present the data (bar chart. Range of wrist motion for 100 patients. you may want to look at Chapter 29. How to Get the Computer to Do the Work for You Note: Many chapters have a section on the end showing how to use SPSS to run the analyses mentioned in the chapter. or something else). keep it and then click the Define button Titles . as well as a few others to minimize boredom.

and click the arrow next to Dependent List • Click OK • In the Line Represents area. n of cases • Click the name of the variable you want to graph from the list on the left.18 THE NATURE OF DATA AND STATISTICS • Click the name of the variable you want to graph from the list on the left. and click the arrow next to Category Axis • You can name the axes by clicking the button marked • Click OK Titles Stem-and-Leaf Plots • From Analyze. choose Descriptive Statistics ¨ Explore • Click the Plots button • Choose Stem-and-Leaf and click the Continue button Cumulative Frequency Polygons • From Graphs. and click the arrow next to Category Axis • You can name the axes by clicking the button marked • Click OK Titles . keep it and then click the Define button • Click the name of the variable you want to graph from the list on the left. choose Line • The default is Simple. check Cum.

called “skewness” and “kurtosis. we use the symbol ∑.1 It would be helpful if we could summarize the results with just a few numbers. If there are two or more groups. we use subscript notation. nj is the number of subjects (sample size) in group j. X3 the value of X for subject 3. we can say either the “mean” or the “X bar. σ. we put a subscript after the letter to let us know what it refers to—n1 would be the sample size for group 1.” 2“X X refers to a single data point.) If there is any possible ambiguity about the summation. There is no convention on whether to use uppercase or lowercase. If there is only one group. there’s not much we can do with the results. We denote the mean (see below for deﬁnition) of a variable by putting a bar over the capital letter X: X. and even show you some more Greek. First. if someone asks you to describe the essence of what you found. A SLIGHT DIGRESSION INTO NOTATION A speciﬁc data point—that is. does that strike you as bizarre as it does us? The number of subjects in the sample is represented by N. has a completely different meaning. ∑ means to sum.CHAPTER THE THIRD In this chapter. or whatever. To indicate adding up a series of numbers. which is the uppercase Greek letter sigma. we’ve seen a new abbreviation for the mean: M. using the subscript notation: 1Even more important. X is the arithmetic mean. This is because most word processing packages can’t handle X. “Sum over X-sub-i. and draw a graph. bar” means “the arithmetic mean (AM)”.”2 In recent years. median. and they’d have to ﬁnd an honest profession. 19 . we can’t easily compare the results of two or more different groups or see if they differ in important ways. Not surprisingly. That is. i 1 ∑ Xi (3–1) N We read this as. be they sample sizes. we discuss how to summarize the data with just a few numbers: measures of central tendency (such as the mode. Second. we have to adapt our practices to accommodate the needs of the computer. and mean). summed over all groups. and so on.” This is just a fancy way of saying “Add all the Xs. the value of a variable for one subject—is represented by the capital letter X. data points. there wouldn’t be any work for statisticians. as i goes from 1 to N. it is not the name of a drinking place for divorced statisticians (see the glossary at the end of the book). take your pick and you’ll ﬁnd someone who’ll support your choice. and measures of dispersion (such as the range and standard deviation). When speaking to another statistician. before we introduce these two terms. one for each of the N subjects. In Table 2–2 for subject 1. Xi is the value of X for subject i. we can show explicitly which numbers are being added. Later in this book. But for now. which we’ll get to later in this chapter. The small letter x is used to denote something different. how do we tell which one the n refers to? Whenever we want to differentiate between numbers. we’ll get even fancier. except show them. (The lowercase sigma.”) However. (We will later discuss two other indices. The two most important are measures of central tendency and of dispersion. X = 43. In other words. all you can do is ﬁnd a spare napkin (preferably unused). but most books use a lowercase n to indicate the sample size for a group when there are two or more and use the upper case N to show the entire sample. N is the total sample size. those numbers exist. a brief diversion is in order to introduce some of the shorthand notation that is used in statistics. but it has two limitations. that’s enough background and we’re ready to return to the main feature. Describing the Data with Numbers Measures of Central Tendency and Dispersion G raphing the data is a necessary ﬁrst step in data analysis. which we’ll discuss shortly.

291 or so. their total would be only 2. So. It’s obvious that a better way would be to divide the total by the number of data points so that we can directly compare two or more groups. What we’ve done is to calculate the average number of bedpans emptied by each person. we use a measure of central tendency called the median. you should have learned that we never ask a question unless we know beforehand what the answer will be. 3.5 5This is like the advice to a nonswimmer to never a cross a stream just because its average depth is four feet. Even if the categories are represented by numbers.4 Although we haven’t given you the data. when the term mean is used without an adjective. the middle one. yet this value is the most representative of the groups as a whole.20 THE NATURE OF DATA AND STATISTICS MEASURES OF CENTRAL TENDENCY The Mean Just to break the monotony. It is immaterial whether they are in ascending or descending order.83 for the second. because they consist of ordered categories. say 17. the shape of its distribution is the same as the ﬁrst group’s. even when they comprise different numbers of subjects. Because there is no ambiguity regarding what values of X we’re summing over. 13. but they numbered only 50. or the mean. As you can see. both of which we’ll touch on (very brieﬂy) at the end of this chapter. 6It X i 1 _____ ∑ Xi N N The median is that value such that half of the data points fall above it and half below it. (On a somewhat technical level. half of the values are at or below 9. One of the ironies of statistics is that the most “typical” value. which is to put the values in rank order. also seems ridiculous to write that the mean stage is II. it refers to the AM. 3.6 In this case. such as the harmonic mean and the geometric mean. We’re calculating the median because we’re not supposed to (3–2) We spelled out the equation using this formidable notation for didactic purposes. but it’s been shifted over by 15 units. For the ﬁrst group.LXIV (that’s 2. and 18. A measure of central tendency is the “typical” value for the data. 30. Because there is an odd number of values. the total for the second group is 4. this is called the arithmetic mean (AM). such as Stage I through Stage IV of cancer. 4. Using the notation we’ve just learned. you won’t ﬁnd anybody who dumped 30. 6. whose mean is (6 + 13) ÷ 2 = 9. Let’s start off with a simple example: We have the following 9 numbers: 1. Again. 6 in this case. The reason we distinguish it by calling it the arithmetic mean is because there are other means.83 in the case of Group 1 and 45. we’d have an even number of data points. The Median What can we do with ordinal data? It’s obvious (at least to us) that. Take a Look at Figure 3–1. However. Note that we have already done the ﬁrst step.5. the formula for the mean is: Number of nurses 15 Group 1 Group 2 10 5 3By now. If there is any room for confusion (and there’s always room for confusion in this ﬁeld). .83 for Group 2.583. we can simplify this to: The Arithmetic Mean X ∑X N (3–3) The mean is the measure of central tendency for interval and ratio data. 14. If we added one more value.5 and half located at or above. for short. you can’t simply add them up and divide by the number of scores.83 bedpans. let’s begin by discussing interval and ratio data and work our way down through ordinal to nominal. In statistical parlance. and the median would be the AM of the two middle ones. where we’ve added a second group to the bedpan data from the previous chapter. Here. the “mean” is meaningless. we’re not always in the position where both groups have exactly the same number of subjects. with the second shifted to the right by 15 units. That is. However.83 for the ﬁrst group and 45. we get 30. this comes to 3. From now on. never appears in the original data. the middle values would be 6 and 13. 14. four values are lower and four are higher. we’ll use conceptually more simple forms in the text unless there is any ambiguity. this approach is logically inconsistent. for those of you who don’t calculate in Latin). If the students in the second group worked just as hard.083. This immediately tells us that the second group worked harder than the ﬁrst (or had more patients who needed this necessary service). if you go back to Table 3–2. we’ll use the abbreviation. you can add up the numbers in Table 2–2! 0 0 10 20 30 40 50 60 70 80 90 Number of bedpans 4If FIGURE 3–1 Graphs of two groups. this would then be taken as the median.64. is the median. dividing each total by 100. Is there any way to capture this fact with a number?3 One obvious way is to add up the total number of bedpans emptied by each group. you don’t believe us.

A+ A A– B+ B B– C+ Grade C C– D+ D D– F 0 use the mean with ordinal data. 7The The mode is the most frequently occurring category.g. In other words.. the median is used primarily when we have ordinal data. If two categories were endorsed with the same. If there were three humps in the data. differing in the degree of dispersion. some purists calculate a median that is dependent on the number of values above and below the dividing line (e. there are two 7s below and one above). we can mix up the order of the categories and not lose anything. Cs. we could use the term trimodal.) If the median number occurs more than once (as in the sequence 5 6 7 7 7 10 10 11). take a look at Figure 3–3. quantity “almost the same” is mathematically determined by turning to your neighbor and asking. their shape) can differ with regard to their central tendency. the course was a breeze. in preference to the mean. there is less dispersion in the second group. If we go back to Table 2–1. The two curves have the same means. The data are usually named categories and. but it’s unusual to see it in print because statisticians have trouble counting above two. If that’s the case. so be patient. The measure of central tendency for nominal data is the mode. As we’ve said. too. so it would be the mode. if you didn’t. you’ll sometimes see the term multimodal to refer to data with a lot of humps8 of almost equal height. “Does it look almost the same to you?” technical statistical term. the data points in Group 2 cluster closer to the mean than those in Group 1. However. For example. with a sprinkling of Bs. yet they obviously do not have identical shapes. but there are other ways they can differ. The Mode Even the median can’t be used with nominal data.e. no amount of studying helped. and Fs. or almost the same7 frequency.. as we said earlier. the ﬁnal marks looked like those in Figure 3–2—mainly As and Ds. Group 2 Group 1 FIGURE 3–3 Two groups. 8Another . This happened in one course I had in differential equations: If you understood what was MEASURES OF DISPERSION So far we’ve seen that distributions of data (i.DESCRIBING THE DATA WITH NUMBERS 21 16 12 Number of students 8 4 FIGURE 3–2 A bimodal distribution of course grades. X being done. we can’t. after we introduce you to some more jargon. marked X. yet we do. how can we then turn around and calculate this mean of the middle values? Strictly speaking. If the data aren’t distributed symmetrically. But there are times when it’s used with interval and ratio data. So the concept of a “middle” value just doesn’t make sense. Not only is this a pain to ﬁgure out. the subject that was endorsed most often was Economics. then the median gives a more representative picture of what’s going on than the mean. So. but also the result rarely differs from our “impure” method by more than a few decimal places. the data are called bimodal. We’ll discuss this in a bit more depth at the end of this chapter.

the better are the chances of ﬁnding one of these folks.9 and so on. 110. we know the data are ratio. The range is the difference between the highest and lowest values.370. Economics. the larger the range. and the range can double. with an amazing degree of originality.000 10 02 (4) (3–5) when a study is repeated. fi the number of ratings in each category. 340. It’s the ninth number. That means that if we add new subjects. When ordinal data comprise named. using the same method. although the subject matter isn’t. Therefore. It is deﬁned as follows: D k (N2 ∑fi2) N2 (k 1) 9Except in Italy and Israel. Data. called.10 If any reader is under 16 years of age. However. symbolized as QL. However. let’s move on to ordinal data. showing a nice spread of scores across the courses. a better index of dispersion for nominal and ordinal data. which means that its value can change drastically with more data or The interquartile range is the difference between QL and QU and comprises the middle 50% of the data. 11These data are kosher. especially with large sample sizes. then D would drop to 0. To illustrate how it’s calculated. once we’ve calculated the range. The Interquartile Range Because of these problems with the range. If we go back to the data in Table 2–1. And yes. and the mean may not be an appropriate measure of central tendency. the range is unstable. the larger the number of observations. It follows that the more people there are in the sample. In the latter case. The reason is that the range depends on those few poor souls who are out in the wings—the midgets and the basketball players. f1 = 25.170 = = 0.11 Remember that the median divides the scores into two equal sections. All it takes is one midget or one stilt in the sample. there are only two sexes. a few political parties. called. So. you can say only how many categories were used. Do not show your ignorance by saying. Last. The Range Having dispensed with nominal data.22 THE NATURE OF DATA AND STATISTICS A measure of dispersion refers to how closely the data cluster around the measure of central tendency. and in most cases. especially when we want to alert the reader that our data have (or perhaps don’t have) some oddball values. QU is 433. harvested in Garrison Bay. the index of dispersion. k = 5 (Sociology. and ratio data. that’s about the only advantage it has. to save you the trouble. and this is the lower quartile. Protothaca staminea. The ﬁrst is that. There is. and N = 100. we couldn’t ﬁnd any data on hole sizes in bagels or the degree of heartburn following Mother’s Friday night meal. meaning that only four of the ﬁve categories were used. the interquartile range (sometimes referred to as the midspread). another index of dispersion is sometimes used with ordinal data.904 40. the range will likely increase. and Calculus). Psychology. History. The range is always one number. If we had the numbers 102. if the eight people who rated History as their most boring course changed their responses to Calculus. (3–4) where k is the number of categories. However. f2 = 42. if the ratings were equally divided among the k categories. D= 5 (1002 2766) 36. an upper half and a lower half. 117. then the range would be (120 – 102) = 18. ∑ fi2 = 2. where the number of parties is variable and equal to one more than the sum of the total population. However. ordered categories. width. and N the total number of ratings. so the median will be the eighteenth number. which is 407. such as the rank order of students within a graduating class. we had to include the data on the gonad grade of these clams because we will be using them later on in this section. D would be equal to 1. breadth. . then they are treated like nominal data. It comes in quite handy when we’re describing some data. all of them are used. we’ll focus on the data for the width. if we say that the mean length of stay on a particular unit is 32 days. we can use the range as a measure of dispersion. but you can use this technique with ordinal. The main advantage of this measure is that it’s simple to calculate. For this part. the upper quartile is the median of the upper half of the data. Although our book is intended as family reading. we’ve rank ordered the data on this variable and indicated the median and the upper and lower quartiles. simply. 10The data. for reasons we’ll go into shortly. the second problem is that the range is dependent on the sample size. we’ll use some real data for a change. If all of the ratings fall into one category. However. and gonad grade for 35 littleneck clams. 109. especially its instability from one sample to another or when new subjects are added. there’s precious little we can do with it. and it’s offset by several disadvantages.” even though we’re sure you’ve seen it in even the best journals. Unfortunately. There are 35 numbers in Table 3–1. Table 3–1 shows the length. we’ll begin with nominal data and work through to interval and ratio data. whereas. in Table 3–1. it makes a difference if the range is 10 as opposed to 100. These data were taken from a book by Andrews and Herzberg (1985). please read the remainder of this section with your eyes closed. then D is zero. not the gonads.766. the range isn’t a totally useless number. and so on. this is a ﬁxed number in many situations. if the ordinal data are numeric. what we’ve done is divide the data into four equal parts (hence the name quartile). So. This time. “the range is 102 to 120. interval. though. and 120. Now let’s ﬁnd the median of the lower half. we’d immediately know that there were some people with very long stays. For instance. The Index of Dispersion The simplest measure of dispersion for nominal data is to simply state how many categories were used. In the same way.

signify the sum of the differences between each value and the mean. as in Column 4.g. N. by deﬁnition. We can get around this problem by taking the absolute value of the deviation.. For example. Dividing this by N. 12Judging from the numbers. the middle 30% (D6 – D4). or deciles that break the data into 10 groups. by ignoring the sign.13 As you remember from high school. Therefore. 10. 13Erasing the minus sign is not considered to be good mathematical technique. making it a more useful measure. Clam number Length (mm) Width (mm) Breadth (mm) Gonad grade TABLE 3–1 Vital statistics on 35 littleneck clams Variations on a Range The interquartile range. which is 10. when we deal with another way of presenting data. there must be something wrong. the median. yields a mean of 9. absolute values. Mathematicians view the use of absolute values with the same sense of horror . Dividing this by the sample size. If we left it at this. Column 2 shows the results of taking the difference between each individual value and 9. called box plots. The sum of the absolute deviations is 42. The symbols at the bottom of Column 2. wider ranges (e. The choice depends on what information we want to have: narrower intervals (e. any number times itself must result in a positive value. the average of the absolute deviations. So. when taking the absolute value of a number is indicated by putting the number between the vertical bars: |+ 3| = 3. which divides the numbers into quarters. The Variance and Standard Deviation But all is not lost. clearly. This yields a number called the variance. symbolized by ∑X. D6 – D4) contain less of the data but fall closer to the median. so we divide by the number of differences. that is. is perhaps the best known of the ranges. and in fact there is. The problem is the same as with the mode. but there’s no law that states that we have to split the numbers into four parts. D7 – D3) encompass more of the data but are less accurate estimates of the median. the sum of the deviations of any set of numbers around its mean is zero. can’t be manipulated algebraically. and the range. is some measure of the average deviation of the individual values. Adding up these 10 deviations results in—a big zero. What we want. for example. that is. for various arcane reasons that aren’t worth getting into here. which is denoted by the symbol s2. 50% (D7 – D3). we can use quintiles that divide the numbers into ﬁve equally sized groups. two negative numbers multiplied by each other yield a positive number: –4 –3 = +12. then. ∑(X – X). We’ll meet up with this statistic later in this chapter. To summarize the calculation: Mean deviation (MD) = ∑|X X| ∑|x| = N N (3–6) 31 30 29 28 23 18 26 16 24 25 27 15 11 9 12 14 22 5 7 13 4 21 10 17 6 8 34 20 35 33 19 3 2 1 32 91 169 305 330 420 335 393 394 402 410 389 455 459 452 449 471 465 487 485 472 512 474 468 475 481 479 509 486 511 519 508 505 517 530 537 77 141 264 268 282 288 333 338 340 349 356 385 394 395 397 401 402 407 408 408 413 414 417 422 427 430 433 436 447 456 464 471 477 494 498 42 81 172 188 265 193 209 253 216 253 249 269 282 282 278 271 299 286 298 281 302 317 272 287 315 314 284 275 285 312 298 338 334 337 345 0 3 3 3 2 3 QL = 340 3 3 3 2 3 3 2 2 2 3 3 2 3 3 3 3 2 3 3 3 3 3 3 3 3 3 Median = 407 QU = 433 and scorn with which politicians view making an unretractable statement. obviously civil servants. There is another way to get rid of negative values: by squaring each value. we can specify a range that includes. This looks so good.g. then the result would be larger as our sample size grows.. This is done in Column 3. x. which is the sample size.DESCRIBING THE DATA WITH NUMBERS 23 Because the interquartile range deals with only the middle 50% of the data. We could also have written this as ∑x. and therefore the mean deviation (MD). it is much less affected by a few extreme scores than is the range. rather than taking the absolute value. or 70% (D8 – D2) of the data. We can denote the difference between an individual value and the mean either by (X – X) or by the lowercase letter. Column 1 of Table 3–2 shows the number of coffee breaks taken during 1 day by 10 people12: their sum. we get a mean deviation of 4. Having done that. This isn’t just a ﬂuke of these particular numbers.2. we take the square of the deviation and add these up. The Mean Deviation An approach that at ﬁrst seems intuitively satisfying with interval and ratio data would be to calculate the mean value and then see how much each individual value varies from it. this approach isn’t going to tell us much. is 90. and |– 3| = 3. So.

After we distinguish between this sit- – Because both SD and X are expressed in the same units of measurement. this is good. in one form or another. the smaller s will be. SD will not. both X and SD will increase. CV is sensitive to the scale of measurement (Bedeian and Mossholder. and perhaps more telling.2). – However.0 coffee breaks per day. absolutely nothing. or V). So. But. This makes it easy to compare a bunch of measurements. but what will happen to s and s2? The answer is. Do NOT use the above equation to actually calculate the SD. if we add a constant to each value. in summary. each subtraction leads to some rounding error. 2000). but there’s still one remaining difficulty. and added 10 to each one.22 (the square root of 27. Moreover. we can’t incorporate CV into any of them. we simply take the square root of the whole thing and call it the standard deviation abbreviated as either SD or s: uation and the far more common one in which we have only a sample of people (see Chapter 6). Computers use a different equation that minimizes these errors. but should deﬁnitely not be used with interval-level data. then. this equation is appropriate only in the highly unlikely event that we have data from every possible person in the group in which we’re interested (e. and a third time to square and add the numbers. the variance (and hence the SD) does not change. all males in the world with hypertension). CV will decrease as the mean value increases. If – we multiply every value by a constant. Consequently. you have to go through the data three times: once to calculate the mean. as we said in the previous section. Let’s look for a moment at some of the properties of the variance and SD. So.2 squared coffee breaks. But what the #&$! is a squared coffee break? The problem is that we squared each number to eliminate the negative signs. Finally. the SD is the square root of the average of the squared deviations of each number from the mean of all the numbers. say from different labs. The closer the numbers cluster around the mean. Group 1 would have a larger SD than Group 2. The mean of the 10 numbers in Column 1 is 9. to see if they’re equivalent in their spread of scores. and the variance is 27. and it is expressed in the same units as the original measurement. to get back to the original units. looks more like the right answer. The bottom line. Second. Going back to Figure 3–3. we’ll show you the equation that’s actually used. the units cancel out and we’re left with a pure number. leaving CV unchanged. Say we took a string of numbers. although the SD enters into nearly every statistical test that we’ll discuss. The Coefficient of Variation One measure much beloved by people in ﬁelds as diverse as lab medicine and industrial/ occupational psychology is the coefficient of variation (CV. It is deﬁned simply as follows: _ CV = SD X (3–9) The Standard Deviation s (X X)2 N x2 N (3–8) The result. First. a second time to subtract the mean from each value.. independent from any scale (Simpson et al. such as the ones in Table 3–2.g. 1960). . is that CV may be a useful index for ratio-level data where you cannot indiscriminately add a constant number.24 THE NATURE OF DATA AND STATISTICS TABLE 3–2 Calculation of the mean deviation Column 1 number of coffee breaks X Column 2 raw deviation X X Column 3 absolute deviation |X X| Column 4 squared deviation (X X)2 1 3 4 7 9 9 11 12 16 18 –– ∑X = 90 –8 –6 –5 –2 0 0 2 3 7 9 –– ∑(X – X) = 0 8 6 5 2 0 0 2 3 7 9 –– ∑ | X – X | = 42 64 36 25 4 0 0 4 9 49 81 —– 2 = 272 ∑(X – X ) s2 ∑(X N X)2 ∑ x2 N (3–7) This is more like what we want. It’s obvious that the mean will similarly increase by 10. 5. To begin with. which is then magniﬁed when the difference is squared. because the mean is often a decimal that has to be rounded. there are a couple of limitations of the CV. If we add a constant to every number. X will increase but.

or to have a positive skew. If the SD increases as the mean does. and in how closely the individual values cluster around this typical value (dispersion). or has a negative skew. for data that can have only positive numbers (e. it also affects other parts of the curve. but failing it deﬁnitely shows a problem. shows the classical “bell curve. and a negative number reﬂects negative. However. First. we refer to this distribution as leptokurtic. They differ from those in Figure 3–3 in one important respect. Again. Distributions that are leptokurtic also have heavier tails.15 The three curves in Figure 3–5 are symmetric (i. most lab results. Although kurtosis is usually deﬁned simply in terms of the ﬂatness or peakedness of the distribution. it’s probably easier to see what these terms mean ﬁrst. scores on paper-and-pencil tests). then 16Usually Kurtosis refers to how ﬂat or peaked the curve is. would be of interest only to those who believe that wading through statistical text books makes them better people. FIGURE 3–5 Three distributions differing in terms of kurtosis. and we’ve listed the necessary commands for one of them at the end of this chapter. The curves in Figure 3–3 were symmetric. Curve A. we know that the normal curve extends beyond two SDs on either side of the mean. something we do only as a last resort. As usual. the “direction” of the skew refers to the direction of the longer tail.g. we can use two other measures to describe the distribution. A value of 0 indicates no skew. so take a look at the graphs in Figure 3–4. each with a different mean value of the variable. in statistical parlance) is longer than the other. But how do we know? A good place to start is just looking at a plot of the data. With interval and ratio data. there’s some skewness present. The distributions in this ﬁgure are said to be skewed. kurtosis doesn’t affect the variance of the distribution. whereas the ones in Figure 3–4 are not. one with positive and one with negative skew. one end (or tail. with positive numbers reﬂecting leptokurtosis. Then we can use a couple of tricks suggested by Altman and Bland (1996). platykurtosis.” a term we’ll deﬁne in a short while. so that it ends up with a value of 0. Most statistical computer packages do it for you. Curve A Curve B SKEWNESS AND KURTOSIS We’ve seen that distributions can differ from each other in two ways. Curve A Curve C 14A deﬁnition that excludes statisticians. . or left.e. many computer programs subtract 3 from this. such people are probably related to those who buy Playboy just for the articles. whereas platykurtic curves tend to have lighter tails. Curve A is said to be skewed right. as for skew. By contrast.” or “normal distribution. The statistical term for this is mesokurtic. 15At least some things in statistics make sense. However. only some summary information. Curve C is ﬂatter than the normal one. if the mean is less than twice the SD. So. The terminology of skewness can be a bit confusing. it’s called platykurtic. The middle line. in their “typical” value (the measure of central tendency). Curve B is skewed left. demographic information. but they differ with respect to how ﬂat or peaked they are. and negative numbers. Curve B Skew refers to the symmetry of the curve. So.DESCRIBING THE DATA WITH NUMBERS 25 where the zero is arbitrary and constants don’t change anything. their skew is 0). when everything else has failed. Passing the test doesn’t guarantee normality. a property known as kurtosis. The normal distribution (which is mesokurtic) has a kurtosis of 3. We’re not going to give you the formula for computing skew because we are unaware of any rational human being14 who has ever calculated it by hand in the last 25 years. FIGURE 3–4 Two curves. Are they symmetrically distributed. a positive number shows positive skew (to the right). We can use the second trick if there are a number of groups. SO WHO’S NORMAL? Many of the tests we’ll describe in this book are based on the assumption that the data are normally distributed. Curve B is more peaked.. You’ll have to check in the program manual16 to ﬁnd out what yours does. or is there a long tail to one side or the other? Sometimes we’re reading an article and don’t have access to the raw data. skew. not to where the bulk of the data are located. most statistical computer packages ﬁgure out kurtosis for you. skewness and kurtosis. The formula for calculating kurtosis.

but when we use box plots to compare two or more groups. A more formal way to check for normality is to look at the tests for skewness and kurtosis that we discussed in the previous section.19 In Figure 3–6. and four measures of dispersion (the range. he has also done more to confuse people than did Abbott and Costello doing “Who’s on First. However. Because the tests are looking at deviation from normality. not a calendar). For simplicity’s sake. outlier. there is one outlier and one far. but rest assured. So which one do we use? That’s extremely simple to answer—whichever one your computer program deigns to give you. skewed. However. it’s 3.e. Figure 3–6 shows the data for the width of those delightful littleneck clams we ﬁrst encountered in Table 3–1. WHEN DO WE USE WHAT (AND WHY) Now that we have three measures of central tendency (the mode.5 box lengths. Looking at measures of dispersion. A step is 1. we’re going to have to introduce a bit more of Tukey’s jargon. For example. and the SD). The “+” in the middle represents the median of the distribution. Most computer packages that produce box plots differentiate between them. Just this central part of the box plot yields a lot of information.0. and lets you evaluate the data against a number of different distributions. we’ll invoke four criteria that are applied to evaluating how well any statistical test—not just descriptive stats—works: Sufficiency. Notice that we’ve drawn it vertically rather than horizontally. it’s fairly safe to assume that the data are reasonably normal. which is not usually drawn. and the mean). let’s talk about the upper whisker ﬁrst. and then describe what we see. Let’s start off with the easy parts. simply because no datum point is near the step. if it is closer to the lower quartile. we’ll try to use the more familiar terms. So much for computers simplifying our lives. others an asterisk. and 99% are encompassed by the outer fences. The same thing is done for the lower whisker. because it uses all of the data. is two steps beyond the quartile. the number that’s produced doesn’t mean much in its own right. when do we use what? To help us decide. But. the median. then the fence is drawn to the largest observed value that is still less than one step away from QU. The long lines coming out the sides are called whiskers.0 times the interquartile range.. If a lot of data are about and the distribution is roughly symmetrical. Any data points that fall between the fences are called outliers. far) outliers. If both are less than 2. only a statistician would think of doing them by hand.5 and 3 times the IQR. then 95% of the data points would fall within the range deﬁned by the inner fences. It can be drawn either way. a better way would be a direct test to see if the data deviate from normality. In both cases. we’ll brieﬂy return to the realm of descriptive statistics and talk about one more type of graph. The Anderson-Darling Normality Test is a bit more ﬂexible. One of the most powerful graphing techniques.” by making up new terms for old concepts. because there’s little ambiguity. again the data are skewed.17 Again. the range just the two extreme values. We can see the variability of the data from the length of the box. both falling at the lower end of the distribution. then you should think of one of the transformations described in Chapter 27 to normalize them. if a datum point doesn’t happen to be there. If they’re your own data you’re looking at. 80 160 240 320 Width (mm) + 400 480 . If the data are normally distributed. we just look at the p level. the best way to begin is to look at one (a box plot. The outer fence. if the data points are relatively sparse on one side. there’s no ﬁxed convention for this. they’re probably easier to read in the vertical orientation. BOX PLOTS Now that we’ve introduced the concepts of SD and standard error (SE). and the placement of the median tells us whether or not the data are skewed. or extreme. we won’t spell out the formulae for them. Tukey refers to something almost like the upper and lower quartiles as “hinges. 18Actually. as is the case with these numbers. assesses the data you have against a normal population. then the inner fence is drawn one step above the upper quartile. called the box plot. that is. the interquartile range. whereas the mode uses very little. the data are negatively 19 For example. The end of the whisker (which may or may not have that small perpendicular line at the end of it) corresponds to the inner fence. then both whiskers will be about the same length. If an actual datum point falls exactly at one step. A logical question that arises (or should arise. As with the computations for skewness and kurtosis. Just to pull things together. that. FIGURE 3–6 Box plot of widths of littleneck clams. The Wilks-Shapiro Normality Test. SPSS/PC uses an O for outliers and an E for extreme (i.05. we want the p level to be greater than . it means there’s too much difference. it’s possible that one whisker may be considerably shorter than the other. comes from the fertile brain of John Tukey (1977). These values actually make a lot of sense. the mean is very sufﬁcient. QU and QL. they are positively skewed. using different symbols for near and far outliers. and any beyond the outer fence are called far outliers. this really doesn’t matter too much. if it’s less. How much of the data is used? For measures of central tendency. Some computer programs use a plus sign. Remember that the interquartile range (IQR) was deﬁned as QU – QL.5 times this value. 1. the SD uses all the data. Figure 3–7 labels the various parts of a box plot. To fully understand them and their usefulness. so the middle 50% of the cases fall within the range of scores deﬁned by the box. There are actually a few different statistics for this. the index of dispersion. if you’re paying attention) is why the fences are chosen to be 1. If the median is closer to the upper quartile.18 The ends of the box fall at the upper and lower quartiles. However. Tukey himself drew a solid line across the width of the box. as the name implies. Minitab uses an asterisk (*) for near outliers and an O for far outliers.26 THE NATURE OF DATA AND STATISTICS 17Unfortunately. the median gives us an estimate of central tendency.” As much as possible. who has done as much for exploring the beauty of data as Marilyn Monroe has done for the calendar.

let’s promptly break it. Another way data can become skewed is shown in Figure 3–10. Type of data Measure of central tendency Measure of dispersion TABLE 3–3 Guidelines for use of central tendency and measure of dispersion Nominal Ordinal Interval Ratio Mode Median Mode Mean Median Mode Mean Median Mode Index of dispersion Range Interquartile range SD* Range Interquartile range SD Range Interquartile range *SD—standard deviation. the estimate would be biased. correctly or incorrectly. Again drawing a large number of samples. and the mean would be even further out on the tail. Applying these criteria. The mean is the measure of central tendency of choice for interval and ratio data when the data are symmetrically distributed around the mean. and mode in a symmetric distribution. we can use the guidelines shown in Table 3–3. we can do more statistically with the mean (and its SD) than with the median or mode.” To use the terminology we just introduced. This isn’t true for skewed distributions. If we now add that eighteenth fellow. but not when things are wildly asymmetric. and the mean is 3. . we are assuming. The mean is an unbiased estimator of the population value. However. like the SD. the median is offset to the right of the mode. then the mean. If we have interval data.DESCRIBING THE DATA WITH NUMBERS 27 FIGURE 3–7 Anatomy of a box plot. then our choice would be the mean and SD. most people off on the right are a bit odd. in that it would systematically underestimate the population value. We’ll explain why this is so in Chapter 7. Whenever possible. and mode all have the same value. Let’s see why. If we ignore the oddball off to the right. as is the SD where the denominator is N – 1. especially if the sample size is small. All these estimates of central tendency are fairly consistent with one another and intuitively seem to describe the data fairly well. and the mean is even further to the right than the median. the further apart these three measures of central tendency will be from one another. Efficiency. most of statistics (like most of science) is about generalizing from things you studied (the sample) to the rest of the world (the population). mode FIGURE 3–8 The mean. though. Robustness. we’d say the mean is not a robust statistic.88. To what degree are the statistics affected by outliers or extreme scores? The median is much more robust than the mean. When we do a study on a bunch of patients with multiple sclerosis. the picture would be reversed: the mode (by deﬁnition) would fall at the highest point on the curve. Mean. the Box 20From our political perspective. Having stated this rule. the median would be to the left of it.20 both the mode and the median of the 17 data points are 4. median. as in Figure 3–8. and we can do more with the median (and the range) than with the mode. As you can see. that the sample we studied is representative of all MS patients (the population). median. If we draw an inﬁnite number of samples. If the data are symmetrically distributed around the mean. multiplying the highest number in a series by 10 won’t affect the median at all. cluster more closely than does the range or IQR. We’re further assuming that the estimates we compute are unbiased estimates of the same variables in the population (the parameters). like physicians’ incomes. a synonym is “if the data are highly skewed. does the average of the estimates approximate the parameter we’re interested in. but will grossly distort the mean. median. For each listing. or is it biased in some way? As we’ll see when we get to Chapter 6. If the data were skewed left. we try to use the statistics that are most appropriate for that level of measurement. Far outliers Outlier Whisker QU Inner fence Median QL Whisker Inner fence Unbiasedness. where the denominator is just N. Figure 3–9 shows some data with a positive skew. if we use Equation 3–8. how closely do the estimates cluster around the population value? Efficient statistics. The more skewed the data. the most appropriate measures are listed ﬁrst.

As you can see in the graph. the range shoots up to 42 and the SD up to 9. So: i 1 ∑ Xi 3 X1 X2 X3 (3–11) 3 Xi i 1 X1 X2 X3 (3–12) __ Then. median. The Greek letter π (pi) doesn’t mean 3. if we were to plot them. Similarly. 525 5 4 3 2 Mode Arithmetic mean 350 175 Geometric mean FIGURE 3–10 Histogram of highly skewed data. we take the square root. such as population growth. the curve would rise more steeply as we move out to the right. the median more accurately reﬂects where the bulk of the numbers lie than does the mean. when the data are highly skewed. If the value of X8 is 138.28 THE NATURE OF DATA AND STATISTICS mode and median both stay at 4. After adding that one discrepant value. then this lack of sensitivity is a disadvantage. GM n i 1 Xi (3–10) This looks pretty formidable..14159. The formula for the geometric mean is: The Geometric Mean n OTHER MEASURES OF THE MEAN Although the arithmetic mean is the most useful measure of central tendency. the n to the left of the root sign ( ) means that if we’re dealing with two numbers. then the AM is (138 + 522) ÷ 2 = 330.41. FIGURE 3–11 The difference between the arithmetic and geometric means.32. show what is called exponential growth. but it’s not really that bad. and their SD is 1. the geometric mean is a better estimator than is the AM. Let’s assume we know the value for X8 and X10 and want to estimate what it is at X9. the Mean Median FIGURE 3–9 The mean. If the data are relatively well behaved (i. without too much skew). in this context. this overestimates the real value. as in Figure 3–11. the range of the 17 data points on the left is 5. we saw that it’s less than ideal when the data aren’t normally distributed. The conclusion is that when you’ve got exponential or growth-type data. if there are three numbers. So the median and the mode are untouched. but the mean increases to 6. The moral of the story is that the median is much less sensitive to extreme values than is the mean.e. and it is 522 for X10. we’ll touch on some variants of the mean and see how they get around the problem. the dot labeled Geometric mean seems almost dead on. but the mean value is now higher than 17 of the 18 values. However. and mode in a skewed distribution. it becomes an advantage. On the other hand. that is. it means the product of all those Xs. for skewed-up data. 1 0 0 5 10 15 20 25 30 35 40 45 0 0 2 4 6 8 10 .06. The Geometric Mean Some data. In this section.

so you don’t need a calculator): 4 8 6 3 4 a. which. and the AM is always larger than the GM. Coming from a school advocating the superiority (moral and otherwise) of the SG-PBL approach (that stands for Small Group—Problem-Based Learning and is pronounced “skg-pble”). The range is _____. c.49 (3–15) EXERCISES 1. 10. 19 10 10 10 10 10. in both sections combined). the harmonic mean of 138 and 522 is: 2 HM = __________ = 218. and (2) if an odd number of values are negative. Despite its name. e.97 9. if you’re really good at this sort of stuff. we measure the following variables. The Harmonic Mean Another mean that we sometimes run across is the harmonic mean. there were only two numbers. it is rarely used by musicians (and only occasionally by statisticians). its formula is: HM n n i 1 Data Arithmetic mean Geometric mean Harmonic mean TABLE 3–4 Different results for arithmetic. For each. 10. would it make the following parameter estimates larger. As the variability of the numbers increases. the differences among the three means also increase. A study of 100 subjects unfortunately contains 5 people with missing data. smaller. and hence the GM will be zero. then the product is zero. it gives the smallest number of the three means. The mode is _____. the number of articles each person had rejected by journals for inappropriate data analysis. the three means are all the same. 10 9. a.DESCRIBING THE DATA WITH NUMBERS 29 cube root. The SD is _____. 10. ﬁgure out the following statistics for this data set (we deliberately made the numbers easy. or stay the same? . Scores on a ﬁnal stats exam.0 Minimum = 16 SD = 5.57 2. d. This was coded as “99” in the computer. Just to give yourself some practice. 10. 10.00 9. 21Although why you’d want to be more conservative in this (or any) regard escapes us. The mean is _____. Assume that the true values for the variables are: X = 45. In the example we used. c. 3. as we can see in Table 3–4. geometric.4 (3–13) Most calculators have trouble with anything other than square roots. b. cluster. So you can use either a computer or. d. Time to complete the ﬁnal exam (there was no time limit). and the statistical tests are a bit more conservative.e. owing to the fact that all of the numbers are multiplied together and then the root is extracted: (1) if any number is zero.. then the computer will have an infarct when it tries to take the root of a negative number. b. the formula using logs is: GM = antilog 1 n i 1 ∑ log Xi (3–14) n Be aware of two possible pitfalls when using the GM.31 5.00 9. 10. logarithms. At the end. 15 1. each of which has a different number of subjects. The reason for this is that. and harmonic means 1 Xi 10. The median is _____.29 1 1 + 138 522 (3–16) 138 522 = 72036 = 268. irrespective of the magnitude of the other numbers.21 When all of the numbers are the same. Usually. If you are so inclined.11 10. and so on. The type of headache (migraine. in turn. the only time it is used is when we want to ﬁgure out the average sample size of a number of groups. 18. 2. give the best measure of central tendency and measure of dispersion. or tension) developed by all of the students during class (i. 11 5. 2. randomizing half of the stats students into SG-PBL classes and half into the traditional lecture approach.95 8.6 Maximum = 65 If the statistician went ahead and analyzed the data as if the 99s were real data. Based on a 5-year follow-up. so the geometric mean is: GM = 2 So. we do a study. is always larger than the HM.

choose Descriptive Statistics ¨ Explore… • Choose the variables you want to analyze and click the arrow button next to the Dependent List box • Click the Statistics button and check the box called M-estimators • Press Continue and then OK . Skewness. The standard deviation e. Standard Deviation. and Kurtosis) • Click OK Testing for Normality • From Analyze. then • Click OK . and you’ll get the 5% trimmed mean mality plots with tests. The median c.30 a. choose Descriptive Statistics ¨ Descriptives… • Click on the variables you want to graph. The range THE NATURE OF DATA AND STATISTICS Box Plots • From Graphs. The mean d. choose Boxplot… • Simple is the default. click the How to Get the Computer to Do the Work for You Descriptive Statistics • From Analyze. keep it • In the Data in Chart Are box. and click the arrow button next to the Variable(s) box • Click the Options button • Check the boxes you want (will likely include Mean. The mode b. choose Descriptive Statistics ¨ Explore… • Choose the variables you want to analyze and click the arrow button next to the Dependent List box • Click the Plots button and check NorContinue button for Summaries of separate variables and click the Define button • Choose the variables you want to plot and click the arrow button next to the Boxes Represent: box • Click OK Trimmed Mean • From Analyze.

The Normal Distribution SETTING THE SCENE A survey of contraceptive practices found that the most widely used method is the phrase. in some sense. This isn’t true for many other types of distributions. Lippman (in Wainer and Thissen. such as a bell curve or a Gaussian distribution. we would have only one name to remember.000 people. we discuss what it is. 2Although 3A 4Thus 31 . the mean and variance aren’t dependent on each other. dear. with normally distributed data. blood pressure. when lying on his back. weight.” On an empirical level. and how to use it. If he had discovered this curve. Each measure. Unfortunately. he said. pity Alexander Graham Bell spent all his time on the phone. rumor has it that. why it’s useful. I’ve got a headache. but all of the curves would be roughly symmetric around their means and resemble that general shape. you’ll need to have some more information. nor “abnormal” about other types. WHY WE CARE ABOUT THE NORMAL DISTRIBUTION There are several reasons why the normal curve is important. “Everybody believes in the theory of errors (the normal distribution). The experimenters because they think it is a mathematical theorem. the standard term doesn’t make sense. Second. or between 106 and 112 times annually? B efore you can answer these important questions. Micceri (1989) looked at the distributions of scores from well over 400 widely used psychological measures. Now the moment of truth has come.” uttered by one or the other partner. it was found to be used an average of 100 times a year. and the ﬁrst thing he or she will do is draw bell curve.2. you can say that. with a standard deviation (SD) of 15. Karl Friedrich Gauss himself resembled a Gaussian curve. there’s nothing inherently “normal” about this distribution. starting with what we mean by a “normal distribution. Can we determine what proportion of the public uses this reason at least 115 times a year. The term “bell curve” comes from its shape.1 “Gaussian” from its discoverer. “Not tonight. The only ﬂy in the ointment is that the resemblance may be more illusory than real. if we were to measure the height. First.000) and make frequency polygons of our ﬁndings. and we’ll tell you what is meant by a normal distribution and why you really want to know about it. they would each approximate the normal curve. Third. 1976) put it well. normal curves are abnormal. and found that distributions that were strictly normal were as rare as hen’s teeth. its variance should remain the same. It’s often referred to by a couple of other names.3 So the alternative terms make sense and reﬂect attributes of the curve—its shape and history. such as achievement and aptitude tests.CHAPTER THE FOURTH The normal distribution is ubiquitous in statistics. many of the statistical tests we’ll be discussing in this book assume that the data come from a normal distribution.” We’ve made passing mention of it in the earlier chapters without really deﬁning what it is. although it wasn’t explicitly labeled as such. or fewer than 70 times a year. would have a different mean. The mathematicians because they think it is an experimental fact. such as Figure 3–6. Here. if we increase the mean of a normal distribution. or urine dehydroepiandrosterone level in a larger number of people (“large” meaning at least 1. naturally.4 1And has led to the “gong phenomenon” —ask a statistician any question. it’s held that many natural phenomena are in fact approximately normally distributed. The normal curve has appeared in several previous ﬁgures. Based on a survey of 2. That is.

. we have to make a minor detour.e. the results are shown in Figure 4–2. The Central Limit Theorem states if we draw equally sized samples from a non-normal distribution. no matter how much it deviates from normal. the distribution of the means of these samples will still be normal. By the time we’ve rolled it 8 times. The sums could range from a minimum of 2 to a maximum of 12. added the numbers. we would expect that each number would appear one-sixth of the time. if we drew a large number of samples of reasonable size (we’ll deﬁne “reasonable” shortly). and 8 times. but two ways to roll a 3 (roll a 1 followed by a 2. So statisticians have found a way to transform all normal distributions so that they (the distributions. because there are more ways to get the numbers in the middle of the range. each with its own mean and SD. 50 0 0 1 2 3 4 5 6 Number on die We can illustrate this with another gedanken experiment. To play it safe. If the shape of the population is pretty close to normal. Consequently. and divided by 2 (i. we usually say that anything over 30 is enough under almost all circumstances. the Central Limit Theorem guarantees that. and we recorded the number of times each face appeared. There’s only one way to get a 2 (roll a 1 on each throw) or a 12 (roll a 6 each time). Now. added the numbers. then “large” can be as small as 2. then it “rolled” the die four times.. We did a computer simulation of this. This obviously is not a normal distribution. and we would get a graph that looks like Figure 4–1. and divided by 4 (the mean for a sample size of 4) for 600 trials. So. or a 2 followed by a 1). But this time. then the distribution of the means of those samples will always be normally distributed. we call the results a standard score. If the population is markedly different from normal. the distribution of means has lost its rectangular shape and has begun to look more normal. So. we’d need hundreds of tables to give us the necessary speciﬁcations of the distributions. How large is “large”? Again. Now for the real heart of the matter—the data don’t have to be normally distributed for this to be true because of what’s called the Central Limit Theorem. then 10 to 20 may be large enough. though. each done 600 times. This would make publishers of these tables ecstatic. not the statisticians) use the same scale. is a way of expressing any raw score in terms of SD units. if we take enough even moderately sized samples (“enough” is usually over 30). let’s roll the die twice and add up the two numbers. no face would be expected to appear more often than any other.32 THE NATURE OF DATA AND STATISTICS The fourth reason that the normal distribution is important is that. Number of tries STANDARD SCORES Before we get into the intricacies of the normal distribution. Notice that rolling the die even twice. the resemblance is quite marked. whatever the distribution of the data. The computer “rolled” the die twice. it’s referred to as a rectangular distribution. because of its shape. took the mean for a sample size of 2) 600 times. and again rolled the die 8 times and divided by 8. Number of tries 2 4 8 100 50 0 0 1 2 A standard score. abbreviated as z or Z. the means will approximate a normal distribution. If hundreds of variables were normally distributed. we wouldn’t expect each number to show up with the same frequency. as long as the samples are large enough. When we transform a raw score in this manner. and ﬁve ways to roll a 6. we expect that they will show up more often than those at the extremes. This works with any underlying distribution. If the die wasn’t loaded (and neither were we). 3 Mean 4 5 6 200 Number of rolls per try 150 FIGURE 4–2 Computer simulation of averaging the sum of rolling the die 2. it all depends. but everyone else mildly perturbed. 200 150 100 FIGURE 4–1 Theoretical distribution from rolling a die 600 times. 4. This tendency becomes more and more pronounced as we roll the die more and more times. Imagine that we had a die that we rolled 600 times. The idea is to specify how far away an individual value is from the mean by describing its location in standard deviation (SD) units.

the result will be 1. Isn’t science wonderful? There are a few points to note about standard scores that we can illustrate using the data in Table 4–1. They allow us to compare scores derived from various tests or measures.1 10. for the SDS score of 68: So.72 Mean 11. if we add up the z-scores. or 1.52 (4–3) Similarly. For example. the SD is 7.7 1.38 0. 6A nonfatal disorder that makes people long and green and turns their hair red.5 1.7) = 11. Second. This will always be the case if we use the mean and SD from the sample to transform the raw scores into z-scores.7. z-scores also have other uses. we get 22. they will always have a mean of 0. which corresponds to the mean. We do this when we compare laboratory test results of patients against the general (presumably healthy) population.53 –1.53 SD units.15 –0. and from z-scores back to raw scores. 1965). we don’t have to use the mean and SD of the sample from which we got the data. For the BDI score of 23: z 23 11. such as the Beck Depression Inventory (BDI. even if we transform the raw scores into SD units (or any other units).5 × 7. Just to try this out.34 1.7 10.6. and these are presented in Table 4–1.3 SD 7. with a SD of 5.0. that is. For instance. Beck et al. They each correspond to z-scores of about 1. What we can now do is to transform each of these raw scores into a z-score. A third point about standard scores is that if you take all the numbers in the column marked z in Table 4–1 and ﬁgured out their SDs. however. First. the average deviation of scores about their mean is 0. has a z-score of 0. not every set of data contains a score exactly equal to the mean. 1961) and the Self-Rating Depression Scale (SDS.3 7. this is reassuring.22. Canada is one country divided by two languages. the raw score of 9.53 (4–2) that is. so 11⁄2 SD units is (1. we can go from raw scores to z-scores.38 0 0 0.51 (4–4) The standard score z (X s X) x s (4–1) Adding a bit to the confusion. for rounding error). we can take them from another sample.0 and a standard deviation of 1. In the case of the BDI.0. A raw score of 1 coffee break a day corresponds to: z (1 9) 5. In addition to allowing us to compare against just one table of the normal distribution instead of having to cope with a few hundred tables.0 (plus or minus a fudge factor. The only problem is that the BDI is a 21-item scale. . But if we used the mean and SD derived from a 5This further conﬁrms Churchill’s statement that the United States and Britain are two countries separated by a common language. these transformations tell us that the scores are equivalent.57 1. By deﬁnition. Let’s just check these calculations. if we know the mean and SD of both scales. let’s go back to the data in Table 3–2. How can we compare a score of.11⁄2 SD units above the mean. a raw score of 23. We can do the same thing with all of the other numbers.1 z 68 52.”5 A standard score is calculated by subtracting the mean of the distribution from the raw score and dividing by the SD. we would expect the sum of the z-scores to be 0.3. When we add this to the mean of 11.0.22 1.5. if we took serum rhubarb levels from 100 patients suffering from hyperrhubarbemia6 and transformed their raw scores into z-scores using the mean and SD of those 100 scores.9.” whereas Brits and Canadians say “zed score. several different scales measure the degree of depression. Of course. because it indicates that it doesn’t deviate from the mean. we found that civil servants took an average of 9. their sum is 0 (plus or minus a bit of rounding error). when you convert any group of numbers into z-scores. say. with scores varying from a minimum of 0 to a maximum of 63.5 52. to check your calculations. which is (within rounding error) what we started off with. –1. whereas the SDS is a 20-item scale with a possible total score between 25 and 100.53 SD units below the mean.96 –0. To save you the trouble of looking these up. It is the same reason that the mean deviation is always 0. This also shows that if we know the mean and SD.0 coffee breaks per day. any score that is close to the mean should have a z-score close to 0. However. we’ve graciously provided you with the information in Table 4–2.THE NORMAL DISTRIBUTION 33 X z TABLE 4–1 Data in Table 3–2 transformed into standard scores Beck Depression Inventory Self-Rating Depression Scale TABLE 4–2 Means and standard deviations of two depression scales 1 3 4 7 9 9 11 12 16 18 –1. 23 on the BDI with a score of 68 on the SDS? It’s a piece of cake. or from the population. Zung. Americans pronounce this as “zee scores.

60 0. 47.34 THE NATURE OF DATA AND STATISTICS 34.1915 .30 0. therefore. titled Area of the Normal Curve. The vast majority—95% of them—emptied between (30.9 Let’s now take a look at the numbers inside the curve. which is shown in Figure 4–3. we’ve used µ (the Greek mu) for the mean and σ (lower-case sigma) for the SD.10 1. 10That’s 11Another These properties are true for theoretical normal curves. we know that 68% of the nurses emptied between (30. “normal” means healthy. the skew is 0. The curve is symmetric around the mean.1% 34.90 1.4641 . the distribution wasn’t quite normal.90 2. and mode all have the same value. median.11 We calculated the mean to be 30.2580 .83 + [2 × 14.30 1.08) and (30. Anyone who dumped fewer was really slacking off.40 1.2% 13.08]). So. although you’ll have to take our word for this. so the discrepancy between theoretical and real normal curves bothers only the purists. 13. you should know that “purist” is one term that will never be assigned to us.1 + 13.70 0.3159 . who really cares about the area under this odd-looking curve. In mathematical jargon.83. putting this together with the numbers in Figure 4–3.4772 9By group of normal7 subjects.50 1.00 0. most of the action takes place between the lines labeled –3σ and +3σ. and if you go through the calculations. The tails of the curve get closer and closer to the X-axis as you move away from the mean.4554 . roughly two-thirds of the area (actually 68.00 1. that is. if our data are relatively close to being normally distributed. Going a bit further. you’ll ﬁnd that the SD = 14.60 1. though. Reality deviates from this to some degree. any set of real numbers will show a slight degree of skew and kurtosis.40 0.4332 . or between 16. it follows that another 34. to misquote Albert Einstein.75 and 44.1% 4. one of those precise statistical terms.6% 13. The mean.1% falls between µ and –1σ.4452 . Most importantly.08]) and (30.3413 . and the mean.1% of the area under the normal curve falls between the mean (µ) and one SD above the mean (+1σ).2257 .8 For all intents and purposes. the properties of the normal curve apply to our data.80 0. 2. For reasons we’ll discuss in Chapter 6. 3. we’ll return to those intrepid nurses and their never-empty bed pans.08.1554 . Where those numbers came from is a bit more diffi- . Notice a few properties: 1. That’s easy.2%) is between +1σ and –1σ. So. but it’s close enough.1% 2. unless we have an inﬁnite set of data points. The second question is about where those numbers came from. The important point is that.83 – 14. 5. and those who cleaned 60 or more were working harder than about 97% of their mates. how did we get these numbers? To answer the ﬁrst question. no matter how far out you go.1179 .” now.4713 .20 1. –4σ –3σ –2σ –1σ µ 1σ 2σ 3σ 4σ TABLE 4–3 A portion of the table of the normal curve z Area below 7Here.10 0.3413 . the curve approaches the X-axis asymptotically.3643 . So the normal curve can give us information about the data we’ve collected.83 – [2 × 14. the curve will eventually touch the X-axis.50 0.00 .1% FIGURE 4–3 The normal curve.80 1.2% 0.6% 0. not bell-shaped. those which exist only in the imaginations and dreams of statisticians. “There are only two things that are inﬁnite—the universe and human stupidity— and I’m not sure about the universe. If you remember Figure 2–5.0000 . 0. 34. and mode will not be exactly the same. then it’s possible that all of the patients’ z-scores would be positive. we’re ready to look at the normal curve itself. but they never quite reach it. 2. and second.4032 . look at Table A in the back of the book.08). median. or between about 3 and 59 pans. 8However.3849 . THE NORMAL CURVE Now armed with all this knowledge.00 1.91 (let’s say 17 and 45) bedpans.7% of the area is between the µ and +2σ. not just about some theoretical line on a piece of paper.6% of the area is between +1σ and +2σ (and between –1σ and –2σ).70 1.0398 .83 + 14. All this raises two questions: ﬁrst. because the curve is symmetric. What they tell us is that 34.6 for those of you whose calculator batteries died.0793 .20 0.10 and just slightly over 95% of the curve falls between +2σ and –2σ. The kurtosis is also 0.2881 .4192 .

we’ve reproduced a part of it in Table 4–3. showing the percent of the area between µ and +1σ. or however.00.00) and the value of z (or σ. But this isn’t the area we’re interested in. They all mean the same thing. Table A starts at 0. How many people said this fewer than 70 times in 1 year? Again.80.0000. that the total area under the curve is 1. 2. it helps clarify in our mind the portion that we’re interested in. 3. How many people used this excuse up to 115 times? First. because the curve is symmetric. and also between the mean and –2.00 is 0.5000 rather than 0. “Not tonight.13 2. much less own.4772. There is an equation. or x/σ. This isn’t just for neophytes. be careful reading tables of the normal curve in other books. Notice that the number next to a value of z of 1. as in Figure 4–4.S. Tables in other books may refer to it as x/σ or as σ. or . So be sure to check which type of table you are using. so that an area of .3413.” 13So 14Perhaps .00 to –4.40 is .40 and +0.0000 units. Table 4–3 tells us that the area between the mean and +. us oldtimers prefer “us oldtimers. Because the total area between the extreme left and the mean is half the area of the curve. Table 4–3 has two columns.” There are a few things to notice about the table: ﬁrst.00 (we’ve given only a few values up to 2. not coincidentally. which we won’t bother you with.341 is 34. which in this case are +0. it’s the same number as in Figure 4–3.3413.80 is .00 SD is . us oldtimers do it all the time. How many people use this phrase between 106 and 112 times a year? As usual. We have also been told that the correct phrase is “we oldtimers. Last. –2 0 12Although Table 4–3 tells us that the area of the curve between the mean and +1.00 in Table 4–3). let’s start using it. any other statistics book. the area to the left of the shaded portion is (0.1% of the total area. we have to transform 115 to a z-score. Second. To really give the normal curve a good workout. This means that 34.14 Just one more for practice. so we’ll have to add the 50% of the area that falls below the mean.13% of the people use this delightful phrase between 100 and 115 times.5000. let’s return to the problem posed in Setting the Scene.00.28% of the people. we want to know the area below 70.00 (4–6) As we mentioned. we begin by changing the raw numbers into z-scores. 0.683 people. how we got the number. 1.0228. it doesn’t make sense to waste ink and paper going from 0. or 0. giving the area between mean (z = 0. but keep it in our minds.5000 – . What we do is ignore the sign. Many show the curve the same way it is here. Finally. using the format of Equation 4–1: z 115 15 100 1.12 Now. and the mean and +. It corresponds to the shaded area in Figure 4–5. how to read it.00. We “simply” solved this a few hundred times and put the results in the table.1554. and making a rough sketch (Figure 4–6). What this also shows is that it is very helpful to draw a rough sketch of the normal curve and the area that the table shows. Now.00 in the table.000.00 is . because the sign was negative. We’re interested in the area between we couldn’t begin to imagine why you would want to look at. We’ll show you how to deal with negative z values in a minute.” But.13% of 2. that is. Looking up 2. This shows ﬁrst.” has been used. birthrate is falling! a reﬂection of our increasing decreptide. as it is here.2881. we start off by converting this to a z-score: z 70 100 15 2. So the answer is 84. I’ve got a headache.00. or 1. Other books give the area to the left of z. the z is in SD units. it’s labeled).1 is one-tenth of a SD.00.THE NORMAL DISTRIBUTION 35 cult. +1 FIGURE 4–5 The area between the mean and a z of 2. a few tables give the area to the right of z. This is the area between the mean and +2.4772). we ﬁnd . these are easy to spot because the area equivalent to z = 0.00 (4–5) FIGURE 4–4 The area below a z of 1. dear. we use this latter ﬁgure. the table does not include negative z-scores. that can be solved to give the area between the mean and any value of σ. that’s why the U. But we’re interested in all of the people who said it 115 times and less. one labeled “z” and one labeled “Area below.00 and goes up to 4. and try to determine how many times the phrase. and second. To simplify your life yet again.

4 . unless you plan on doing some very fancy stuff with statistics. and the like. The results were: Males Females Mean SD N 60 12 138 40 10 97 FIGURE 4–6 The area between a z of +. Based on these data. they’ll have the same names as the originals. and contain the z-transformed values.3% of the 2.80. However. First..8 How to Get the Computer to Do the Work for You Calculating z-Scores • From Analyze. What proportion of women get scores between 30 and 45? 5. Unlike the students. choose Descriptive Statistics ¨ Descriptives . Gompertz. If a male gets a score of 70. It is not the only one used in statistics. we’re not going to discuss them for two reasons. the normal curve will get you through almost everything. This ﬁnishes our discussion of the normal distribution. but if you go back and look at the data spreadsheet. often referred to as the NoSE). so why should you? EXERCISES The entire ﬁrst-year class in Billing Practices 101 takes the Norman-Streiner Test of Real or Imagined Licentiousness (the NoSTRIL.40 and +. what’s his z-score? 2. exponential.. What proportion of men get scores over 68? 6. the scores were fairly normal for both men and women. What’s the z-score for a female with a score of 35? 3. Second. we don’t know how to use them. What score demarcates the upper 10% of women? .000 people. • Select the variables you want and click the button to move them into the Variable(s) box • Click the box on the bottom that says Save standardized values as variables • Click OK You won’t see anything in the output window except summary statistics of the variables.1327. the difference is . or 13. What score for females is equivalent to a male’s score of 78? 4. . but be preceded by a Z. there are many others with names such as Poisson.36 THE NATURE OF DATA AND STATISTICS these. you’ll see new variables. ﬁgure out the following: 1.

The answer given by most students to problems such as this is. But. their Wallet. and the relationship between the binomial and the normal distributions. not each. they got the right answer. or those who had been the Body class now have moved to the Desperate). holds true now and in the future only under similar circumstances. whom we get the term masochism. “Who cares?” To us. and our Desperation. The Empirical Way Each of us. you’d have seen exactly 5 white ones and 7 black ones? A DISCLAIMER Open up just about any other book on statistics.. What is the probability that. Probability deals with the relative likelihood that a certain event will or will not occur. It usually consists of examples such as the one above in Setting the Scene. if you took out 12 marbles (replacing each one after you took it out). rather than getting bogged down in philosophical discussions. the probability of survival for a cancer patient is based on the known survival 1That’s combined. we put down how we did (allowing for some degree of poetic license). Much of what’s covered in such chapters is undoubtedly of great interest to those who are so inclined. 37 . The key point is that the probability. under the inﬂuence of the same drugs). the binomial theorem. We can derive probabilities in one of two ways: empirically or theoretically. based on past performance. has probably asked out for a date a number of people of the opposite (or the same) sex. let’s say we can categorize our askees into four mutually exclusive types based on what it was that ﬁrst attracted us to them: their Body. Instead. The classic example of empirically derived probabilities is the tout sheet sold for horse races.CHAPTER THE FIFTH This chapter introduces the basics of probability theory. To keep things nonsexist and simple. The two of us have been messing around with statistics for a total of about 70 years now. and you’ll ﬁnd a long section on probability theory. For instance. ridden by the same jockey (and these days.g. assuming nothing has changed. now we want to look back and see how we’ve done. trying to solve problems in probability theory. in our youth (or second childhood). possibly as a guide to trying out our old skills. under the same track conditions. What the Percent success column tells us is how well we did in the past with each of these four types and gives us the probability of success in the future. Almost all of the probabilities we encounter in the health ﬁeld are derived empirically. 2From What Do We Mean by “Probability”? This is not as easy to answer as it sounds. we haven’t been able to see our toes for the past decade. The odds they give (another way of expressing probabilities) are based on how well the horse did in races of the same length. we’ll rely for now on your intuitive understanding. their Mind. If the circumstances have changed (e. but is of little direct value to the clinician. this chapter gives you what we think are the necessary survival skills to understand and deal with probabilities in situations you’re likely to encounter. We’ve been accepted by some and rejected by others. then the probabilities no longer apply. relative to some other events. anybody who wants to ﬁgure out the correct answer to this and other such problems should be reading another book (we would recommend anything by the Count Sacher von Masoch2).1 and we can’t remember when we’ve ever had to ﬁgure out a problem like this—except when we were wading through statistics texts. Probability SETTING THE SCENE Imagine you have an urn with 73 white marbles and 136 black marbles. In Table 5–1.

30 60. conditional on the person having being chosen for his or her body (X). or of rolling a 3 at roulette. if you have one.000 times to ﬁnd this out. If a person has some chest pain. then the probability that the sum will be 5 is one in six. or hitting a certain number. the probability of rolling a 3 on this one toss of the die is one in six. was . our overall. respiratory acidosis and alkalosis are mutually exclusive events. 3 and 2. it is necessary to differentiate between two types of events: those that Mutually Exclusive Events and the Additive Rule To illustrate the difference between mutually exclusive and conditionally probable events. we’ve all heard that the life expectancy of a person is somewhere around 74 years. such as gender. Turning to more mundane examples. and the ﬁrst die comes up a 1.” MUTUALLY EXCLUSIVE AND CONDITIONALLY DEPENDENT EVENTS To understand probability theory. the probability of a 5. a free bit of information. That is. race.1667.00 Two events. On the other hand. However. heads and tails are mutually exclusive in that. let’s assume . The Theoretical Way When we’ve gotten tired of losing our money on the nags and want to lose it some other way. lotteries pay. 3What do they tell medical students in Kenya. that is. Two events.6 All of these calculations are based on our knowledge of the likelihood of occurrence of various chance events. However.30. of drawing an inside straight. these empirical probabilities go out the window. 5As 6And 7Except in Chicago. such as admitting patients at an earlier stage or changing what we do to them. Consequently. and 4 and 1). it doesn’t necessarily mean that the person can’t have reﬂux at the same time. the tail side won’t. cardiac disease and esophageal reﬂux are not mutually exclusive. or X depends on Y. four of which yield a 5 (1 and 4.1667. Body Mind Wallet Desperation TOTALS 10 12 5 23 ___ 50 3 5 1 21 ___ 30 30. the probability of throwing a 5 with a single toss of two dice is 11. The difference between mutually exclusive and conditionally dependent events is important because we have to ﬁgure out probabilities differently for each of them.1 years for white females born in 1980 as opposed to 70. it’s more likely (or probable) that the patient has a more common disease than a rare one. The empirical way is also the basis for the old diagnostic dictum that if you hear hoof beats. Women live longer than men. 2 and 3. and that’s what we’ll be concerning ourselves with in this chapter. if we throw the dice one after the other.5. or 60%. Returning to the gaming tables.7 years for white males born in the same year. are mutually exclusive if the occurrence of one precludes the occurrence of the other. We want to emphasize that all of this is to make you better clinical researchers. and vote often. or 16. 78.00 41. conditional on the ﬁrst die being a 1. we can always shoot craps at Las Vegas or Atlantic City. stating the pay-offs for various throws. But this doesn’t tell the whole story. it’s more likely to be coming from a horse than from a zebra. has any casino owner ever gone bankrupt? the second bit of free information: The payoff at the tables is light-years better than in state or provincial lotteries. the voter may have voted for one party for some offices and for another party for other offices. where they adhere to the motto. or .00 91. Closer to home. how a person voted in the last election for all candidates may not be mutually exclusive. you can’t simultaneously have the other. “Vote early. The simplest example of this is ﬂipping a coin.67 20.67%. “It’s more likely coming from a zebra than a horse”? knowledge of such matters is derived solely from movies and the reports of others. they’re ﬁgured out based on the theory of probability. Which party a person votes for in an election for a speciﬁc office is a mutually exclusive event.4 The odds given for rolling a 7 or 11 on the ﬁrst throw.3 Analogously. 4Our rates of similar patients who have the same stage of disease and have undergone the same treatment regimen. and year of birth. or 30%. We can get even fancier and calculate the probability of getting a 10 with one roll of two dice. Casinos pay an average of 50 to 80 cents on the dollar. Our deﬁnition also assumes that if any of these factors change. With the exception of Donald Trump.7 However. X and Y. Using the example in Table 5–1. is .38 THE NATURE OF DATA AND STATISTICS TABLE 5–1 Our batting average Attraction Number asked Number accepted Percent success are mutually exclusive and those that are conditionally dependent. you’ll have noticed that the craps tables are covered with a green baize cloth. no to lead you down the road of corruption by making you better gamblers. So the probability of a person living to be 80 is conditional on several factors. we don’t have to do this experiment 1. not from personal experience. If you’ve ever been there. are conditionally dependent if the outcome of Y depends on X. X and Y. at most. our hit rate with Bodies was 30%. horses are more common here than zebras. the payoff is always less than the calculated odds. are not based on the experience of the croupier.11%—there are 36 possible combinations. which is the essence of probability theory. 10 cents on the dollar. our success (Y). and all the ﬁgures are about 3 years more than for people born in 1970. success rate was 30 out of 50. and vice versa. Black people’s life expectancies are about 7 years less than this. Each of the six sides has an equal likelihood of ending up on top. To take a simple example. and the ECG conﬁrms the presence of an infarct. if the head side appears. and which one actually appears is a purely random event. let’s roll a die. or unconditional.

However. then the probability that both will occur is the probability of X times the probability of Y. One way to answer this question is to redraw the table. If there were a total of 100 patients on our ward.. For obvious reasons. That is. there is a 40% probability that the person has either CTP or IHS. say. but this will work with any number. In statistical parlance. these conditions don’t occur with equal frequency.60 × . 80% of them are male. a male with ISD) is . 30 would suffer from IHS.80. and we don’t have to make up a table such as Table 5–3. or 48%. looking at the patients from the other perspective. we know from experience that ISD is more common in males.10 .30. We’ve based this on having 100 patients so that we’re working with whole numbers. as we’ve done in Table 5–3. as opposed to 30% from IHS and 60% from ISD. multiplying them or taking their square roots)? It may help if we think for a moment in terms of bodies instead of proportions. Needless to say. of the 60 patients who have ISD. a person with ISD can be either male or female. Now. and. by looking only at the row and column labeled Total. and.60. So. and only 10% of our patients suffer from this. . the next person through the door has either CTP or IHS) if he or she were 1 of the 10 from the ﬁrst group or 1 of the 30 from the second. the proportion of males and females is different for each disorder. and 30% of IHS patients are men..30 . The probability of having ISD is 60/100. this is called the additive law. So the condition would be satisﬁed (i. a 60% probability that he or she has ISD. Thus the probability is . by extension. then the probability of X or Y is the probability of X plus the probability of Y.60 50/50 30/70 80/20 Disorder Males Females Total TABLE 5–3 Actual number of patients with the three disorders Cryptogenic tinea pedis Idiopathic hangnail syndrome Iatrogenic systemic degeneration TOTALS 5 9 48 62 5 21 12 38 10 30 60 100 If X and Y are mutually exclusive events. so if the patient has one. is 48/60. and 60 from ISD. So. CTP is relatively rare.PROBABILITY 39 that the unit we work on admits only those patients with one of three mutually exclusive disorders: cryptogenic tinea pedis (CTP). We see that 48 of the 100 patients on our ward are males with ISD. the same law holds with as many mutually exclusive events as we want. these are given in the third column of Table 5–2. Using this technique of multiplying probabilities means that we can ﬁgure out the conditional probabilities by simply knowing the individual probabilities that certain events will happen. which is exactly what we got before. given X has occurred.e. “cryptogenic tinea pedis” means “athlete’s foot of unknown origin. or . and 60 would not. and iatrogenic systemic decompensation (ISD). in other words. We can get at the answer another way.” If X and Y are conditionally probable. This is a case of conditional probabilities because the probability that the patient has the diagnosis is conditional on the probability that the patient is a male (and vice versa). We multiplied in this case because we’re looking at a part of a part.8 idiopathic hangnail syndrome (IHS). We know from Table 5–2 that 80% of patients with ISD are male. that is. the probability that both events occur together (i. giving the number of males and females with each of the diagnoses. However. what is the probability that the next person through the door has either CTP or IHS? These are mutually exclusive events. Why do we add the probabilities (rather than. 10 would have CTP. some of the people are male (the others are female). we’re not limited to just two events. Conditionally Probable Events and the Multiplicative Law Now.10 plus . 50% of patients with CTP are male.80. Put into formal jargonese: Pr (X or Y ) Pr(X ) Pr(Y ) (5–1) where “Pr” is statistical shorthand for probability. he or she can’t have the other. so the answer is that there is a 48% chance that the next patient admitted to our ward will be a male and have ISD. whereas the probability of being a male.e. So this rule reads: 8For those of you who are not ﬂuent in Latin and Greek. We can summarize what we’ve said by the additive rule: Disorder Relative frequency Male/female ratio TABLE 5–2 Relative frequencies and gender differences for three disorders Cryptogenic tinea pedis Idiopathic hangnail syndrome Iatrogenic systemic degeneration . or . 40 of the 100 patients would satisfy the condition. some of the total have ISD (the remainder have the other disorders). let’s change the question a bit. Moreover. we say that we’re looking at the marginals. if the diagnosis ISD. What’s the probability that the next patient will be a male and have ISD? These are not mutually exclusive events.

What’s the probability that the next number will be black? The “gambler’s fallacy” is thinking that the sixth roll is conditional on the previous ﬁve. given that the patient has ISD. Our result (311. given a diagnosis of CTP (.e. Let’s say you’re back in the casino. for example.” which is deﬁned as: n! = n × (n – 1) × (n – 2) × … × 1 (5–4) Independent Events Many events are neither mutually exclusive nor conditionally probable. However.875.875.” Just for practice. given that ISD is present [Pr(Y | X)]. and we’re considering all of them to be the same. how many different hands can be dealt? For the ﬁrst card. You’ve seen that the last ﬁve numbers have all been red. we quickly realized that this is more likely an example of mutually exclusive events. but GRN is an only child. rather than permutations.” in this case. So.50). it’s been rumored that casino owners’ dreams are ﬁlled with fantasies of having a room full of people who believe in the gambler’s fallacy. thus. which in our example is: P552 = 52! = 52! = 311.11 and all of the other possible arrangements of these particular ﬁve cards. that after a long run of red the probability of a black is higher. which is n!. 3♥. there are then 51 possibilities left for the second card. there are far fewer combinations than permutations. from our perspective as a gambler. it means “n factorial. the outcome is not conditional on the previous run. which is written in statisticalese as: Pr(X and Y ) Pr(X) Pr(Y | X) (5–2) for the ﬁfth.40 THE NATURE OF DATA AND STATISTICS It goes without saying that this is referred to as the multiplicative law. and the probability of black is 50% (ignoring the 0 and 00 slots for the moment). same. the total number of different hands is: 52 × 51 × 50 × 49 × 48 = 311. which in this case is 52. and 48 As you can see. red and black have the same probability of appearing. and r is the number of objects that we’re looking for. was how it was deﬁned by the sister of one of the authors. then 50 for the third. you’ll see that 2. or 5%.200 (52–5)! 47! (5–6) 10Which 11“The which. assuming the wheel is honest. which is 5. and so on. 5♥ followed by J♠. we won’t say which one. and K♣. We’ll also need another term. you know that. J♠. let’s run through a few other examples. The probability that the patient is a female with CTP is the probability of CTP (.. 2♠.598. “The probability that the patient has ISD [X] and is a male [Y] is the probability of having ISD [Pr(X)] times the probability of being male. so half the numbers should turn up red and half black.875. thank the powers that be. After this card is gone.200) is the number of different ways ﬁve cards can be dealt. the number of cards. exactly what it would be if the previous ﬁve rolls had also been black. Now. it doesn’t “know” what the previous results were. standing over the roulette wheel. J♠. being dealt 5♥. The probability of a female with IHS is 21%. they are independent of one another. rather than with images of girls from the chorus line. 49 for the fourth. However. The reason is that there are 5! (i. That is.875. the probability of being a male. 5♥.200 (5–3) where the symbol Pr(Y | X) means the probability of Y given X. we’re dealing with combinations. The symbol n! does not mean emphatically n10. ignoring the order. So. If we’re playing stud poker. 120) ways a particular hand can be dealt.200. and 2♠ is the same to us as 3♥. So. To avoid embarrassment. meaning a terrible hand.598. translating this equation from statistics into English. so as to make the overall proportion of reds and blacks closer to 50%. you ﬁgure it out for yourself. But. let’s deﬁne a few terms and symbols. First. When the order doesn’t matter. in our example. how many different hands can be dealt? The answer is: Crn = n! = 52! = 2.10) times the probability of being female. Thus. To make our life a bit easier.9 A problem arises when events that are independent of one another are mistakenly assumed by some people to be conditional. is the same as what we got in Equation 5–3. they are independent events. If you want to do the math. we don’t care about the order in which the cards were dealt. and discuss the general rules: n is the total number of “objects” in the set. So the number of permutations of n objects taken r at a time is: Prn = n × (n – 1) × (n – 2) × … × (n – r + 1) = (5–5) 9We were going to use the example of ﬁtness for office and actually being elected to office. K♣. it reads. the ball does not have a memory and has never studied probability theory. hence that term r! in the denominator of the equation for combinations (Equation 5–7) that doesn’t appear in the equation for permutations (Equation 5–5). where ﬁve cards are dealt to each person.960 r!(n–r)! 5! 47! (5–7) Combinations and Permutations As long as we’re in a casino. that term permutation. Time now for a couple of deﬁnitions. let’s see what other valuable lessons in life we can learn. there are 52 possibilities. to recapitulate: .960 is 1⁄120th of 311. However.

the probability of at least one false positive out of 12 is: 1– .” Answers: a. They all express the probabilities of dichotomous events. given false-positive rates of 1%. So. If those six numbers could occur in any sequence. if you order an SMA 12 on a completely healthy person.045125 .95) (. a lab test from a perfectly normal person comes back labeled “abnormal”. c.857375 The eight possible outcomes.95) The Law of “At Least One” Let’s assume that 5% of the time.95) (.4 0. returning to our SMA 12 test examples. Bet you’ll never play again!12 Test 1 Test 2 Test 3 Probability TABLE 5–4 = = = = = = = = .0 minus the probability of the last line (N-N-N).95) (. They’re all correct. 1979) have in common? Circle the correct answer: “Any wire cut to length will be too short. see later why this is a fairly safe assumption to make.95) (. just for fun. and 10%.05) (. and the numbers had to be in a speciﬁc sequence. each of which has a 5% chance of yielding a false-positive result. you’ve also learned a lesson in clinical care: Don’t order more tests than you really need! 12Here. What we have done is turn things around.95) (. not factorial.9512 = 45.8 0.002375 . ﬁgure out the odds of winning. Just to drive the point home. 5%. if we tested the serum rhubarb level of 100 eurhubarbic individuals.000125 .05) (. it will be the issue that contained the article. d.0. Consequently. All of the above. with combinations.0 × . or installment you were most anxious to read. it’s always uphill and against the wind.045125 .05) (. these various alternatives. it increases with the number of tests done.PROBABILITY 41 With permutations. at least one of the 12 test results will be false positive? To make things simpler. Number of tests .05) (.or hyporhubarbemia. the probability of any test or combination of tests being positive includes all but the last line (N-N-N).95) (. of three tests with false-positive rates of 5% P P P P N N N N P P N N P P N N P N P N P N P N (. we are interested in the order in which the events happen. 13We’ll THE BINOMIAL DISTRIBUTION Question: What do these statements (taken from Bloch.0 0. We are saying that the probability of “at least one event” is the complement of the probability of “no events”. The Ontario 6/49 is the latter.05) (. that is. We can add up all of the lines that have a P. and then subtract this number from 1.13 What is the probability that.6 0.95) (. the probability that there is at least one P is 1. and 10%.002375 . are given in Table 5–4. You can see that changing the false-positive rate moves the curve up or down.05) (. Figure 5–1 shows the probability that at least one test will be abnormal in a perfectly healthy individual.05) (.” “If you miss one issue of any magazine.95) (.05) (. and their probabilities. The sum of all eight probabilities has to be 1. ﬁve people would have results that indicate either hyper.” “Any error in any calculation will be in the direction of most harm.2 0. They’re all cynical. the “!” means emphasis. in addition to learning some stats. if each of the 12 component tests has a false-positive rate of 5%. Just to recapitulate: To ﬁgure out the probability of at least one event occurring.96% (5–9) Probability of at least one positive test False-positive rate .95) (. b.953) = .10 . Now.” “For a bike rider.002375 . We’ve shown this for three false-positive rates: 1%. we’d be dealing with a permutation. but there’s an easier way.01 For your ediﬁcation and amazement.045125 . because all of the other lines include at least one P. it would be a combination.95) (. 5%. the order is irrelevant.05) (.05) (. 1.0 0 5 10 15 20 25 So.05) (.05 . As you can see. let’s consider the case of a healthy person who has been given three different lab tests. that is.95) (. or (1. we ﬁrst determine the probability of no events occurring. the probability is 100% that one of these combinations will occur. with the probability that each will occur.1426.05) (. but the basic relation- FIGURE 5–1 Probability that at least one test will be positive in a healthy individual. Eight combinations of positive (P) and negative (N) results are possible. if we were participating in a lottery in which we had to select 6 numbers out of 49. that is: Pr(At Least One) [1 Pr(None)] (5–8) ship between the number of tests and the probability of at least one being abnormal stays the same. story.

there are 210. On each trial.50 = . If we let the kid try to put his shoes on once. one of which is the combination WW. Now. Let’s start off with the easiest case. or it won’t be. The only difficult part is calculating the factorials. so that by the time we reach 10 trials. Consequently. is ﬂipping a coin and seeing how many times it comes up heads in 10 ﬂips. It’s easy enough in this instance to ﬁgure out the probability of getting it wrong both times: there are four equally possible outcomes.50 × . this one comes in quite handy. For each of these outcomes. the probability of getting it wrong on the second try. which is called the binomial expansion. It doesn’t apply if the kid does know. as it is on the second (i.25. You would expect that if they didn’t know right from left. the normal curve describes how a continuous variable (such as blood pressure or IQ) would be distributed if we measured it in a large number of people. The usual example. many pocket calculators can do it for you. The other way to ﬁgure it out is to use the multiplicative law: the probability of W on the ﬁrst try is . and the attempts are truly random. or it won’t be too short. right or wrong. We can also write Equation 5–10 as: n r (n pq r r) (5–11) writing: n because the term r n! r!(n – r)! is simply a shorthand way of (5–12) 15This assumes the kid really doesn’t know right from left. This is not what happened. Although we’re trying to avoid equations as much as possible. . or 1. we’ll avoid that example assiduously and stick with a kid putting on his shoes. but actually they’re not hard to handle.15 If there are two attempts at getting shod. and (4) W on the ﬁrst and R on the second. that is. but does it wrong to get you annoyed.14 What is the Binomial Distribution? As you no doubt recall. such as a diastolic pressure 95. there’s an easier way to ﬁgure things out. but that can be used to both describe and give us the probabilities for dichotomous events. We’ll have to deﬁne a few more terms in addition to the ones we use in discussing combinations and permutations: p is the probability on each try of the outcome of interest (0. but nowadays. and q is (1 – p) Now.5 in this example) occuring.53 = . The ﬁrst part should look familiar. where each of the two values is equally likely. However. yielding the four different patterns we just discussed. conditional that the ﬁrst try was wrong). 10. if you prefer. we could ask the question: If a kid puts his shoes on 10 times. the correct answer is d. We could do the same thing for 3. “All of the above. there are two possible results for the second try—again. they’d get it wrong only half the time. These equations may look fairly scary.42 THE NATURE OF DATA AND STATISTICS 14Or. we have such an animal. the formula for the binomial expansion is: n! p r qn r! (n r) ! r (5–10) The binomial distribution shows the probabilities of different outcomes for a series of random events. is there some way to tell how often this deviation from chance would be expected to occur? Again. so bear with us.1172 (5–13) So. we’d quickly get bogged down. and put their shoes on at random. Putting the numbers from our example into Equation 5–10 gives us: 10! 7)! 7! (10 10! 7! 3! .” I was ﬁrst introduced to this apparent breakdown of the laws of probability when my kids were small and learning to put on their shoes. the probability of W on both trials is . of course there is. what’s the probability that he will get it wrong on exactly 7 of those tries? If we tried to solve this by making a table of the possible outcomes.5(10 7) = . (2) W on both tries. (3) R on the ﬁrst and W on the second. it’s called the binomial distribution. What we would like to have is something equivalent to the normal distribution.50. but these methods are laborious.. Not surprisingly. it seemed that they put their left shoes on their right feet at least 89% of the time. it doesn’t apply about 97% of the time. if he were really putting the shoes on at random. used in every other textbook. each of which can have only one of two values. On the ﬁrst try. then the possible outcomes are (1) R on both tries. the examples we just gave are not continuous. For that reason. which is what we got before.57 .024 possibilities. it’s the formula for the number of combinations of n objects taken r at a time (Equation 5–7). there are two possible outcomes: right (R) or wrong (W). but have only two possible outcomes: The wire either will be too short. What we’re dealing with here is called the binomial distribution. there are two possible results—right or wrong.57 . so the chances are 1 in 4. In case you didn’t know. and so on. contrary children —your choice. However. the number of possibilities doubles. For example. a give-away question.e. the probability is just under 12% that the kid would get it wrong 7 times out of 10. each of which should occur 50% of the time. or 100 tries. the missing issue either will be the one containing the last installment of the mystery story. The curve can also be used to give us the probability of a given event.

In Figure 5–2.2 but increase n from 15 to 30. we’ve dealt with situations that have a 50:50 chance of happening.0098 (5–15) and for 10 out of 10 10! 10! 0! . or just over 17%.1719. Putting these into the equation gives us: 15! 5! (15 – 5)! . then r = 1. We already ﬁgured out the probability of 7 out of 10.3. each patient can be thought of as one trial)? In Figure 5–3.20.00 0 5 10 15 FIGURE 5–3 Changing p from . given 15 patients and an incidence of .10 0. .5(10 9) 0.8(15 5) 0. You would expect that the average number on the ward at any one time would increase (30% of 15 = 4. 8 out of 10 looks like: 10! 8! (10 8)! .2 Probability 0.25 n = 15 p = . 8. It also looks as if the data are spread out some more. 3 infected patients would be on the unit simultaneously (i. most of the time.20 Probability 0.80. what’s the probability that 5 of them will develop an infection from the hospital? In this case. we would again expect a shift to the right.32%. the binomial expansion has allowed us to ﬁgure out that the kid has a 12% chance of putting his shoes on wrong in 7 out of 10 tries and a 17% chance that he’ll get it wrong 7 or more times out of 10.50 = .05 0. Number of nosocomial patients = .2 and n = 15.16 the data behave just as we predicted. What’s the probability that he does it wrong at least 7 times out of 10 (instead of exactly 7 out of 10)? This means getting it wrong 7.0439 (5–14) 9 out of 10 is: 10! 9! (10 9)! . Notice that the distribution isn’t quite symmetrical. What we’ve learned in this section is how to extend the binomial expansion beyond the case where each alternative has a 50% chance of occurring to the more general situation where the two outcomes have different probabilities. and there’s a bit less skew.2.3 0. So.2 to 0.1032 (5–17) So the probability that 5 of the 15 patients will develop a hospital-acquired infection is 10.3.15 0.5).25 0. it’s skewed somewhat toward the right. 9. For example. up through r = 15.58 .2 to . we would expect that. This was done using Equation 5–5 by setting r = 0.15 0. but we’re not limited to this.59 . If we have 15 of these hapless abdominal surgery patients on our wards. let’s say that the bug committee at the hospital has really been effective and has knocked the incidence of nosocomial infections down to 20% following abdominal surgery. we’ve kept n at 15. If we keep p at . Learning a Bit More About the Binomial Distribution Staying with this example for a minute. To calculate the cumulative probability of any of these outcomes. and sure enough the graph has shifted to the right a bit. So far. p = . 16Virgil.510 . p = . how many people with nosocomial infections would we expect to see on our 15-bed unit? It is almost intuitive that. 20% of 15).00 0 5 10 15 FIGURE 5–2 The binomial distribution for n = 15.25 .PROBABILITY 43 Now. with an expected average of 6 (Figure 5–4). we ﬁgure out the individual probabilities and then add them up.0010 (5–16) Adding these up gives us .05 = . or 10 times out of 10 trials.10 0. shows the binomial distribution for p = . let’s get a bit fancier.. and r going from 0 to 15.5 (10 8) 0. Mirabile dictu. and q = . there seems to be a greater spread in the scores.20. Next. 17 BC (personal communication). then. This ﬁgure. Number of nosocomial patients = . What happens when we change the probability and the number of trials (in this case. but we changed p from 0. n = 15. r = 5.e. and again.20 n = 15 p = . we’ve plotted the probabilities of having anywhere between 0 and 15 nosocomial patients on the ward at the same time.

In Figure 5–5. and n = 20.5 (we haven’t shown that. n = 10. We learned in the previous chapters how to ﬁgure out the mean.5. let’s stick with FIGURE 5–5 The binomial distribution for n = 5.15 0. As we would expect from the graphs. Let’s pursue this a bit further.00 0 5 10 15 Number of nosocomial patients So let’s summarize what we’ve seen so far.5.5. To illustrate how we can use the normal distribution to approximate the binomial one. The Binomial and Normal Distributions 0.5. the graph becomes more symmetric. n=5 n = 10 n = 20 . Third.20 0. the graph is perfectly symmetric. we can use the normal curve when n is as low as 10.2 Mean = np (5–18) Probability Variance = npq (5–19) SD = npq (5–20) FIGURE 5–4 Keeping p at . we don’t have to worry about using Equation 5–10 to ﬁgure out probabilities.5.5. We can do the same for binomial data. In fact. SD. the distribution is skewed to the right. if we’re dealing with binomial distributions where n is 30 or more. but trust us). it’s skewed left when p is greater than . the greater the variability in the scores. there isn’t just one “binomial distribution”. What we have. your eyes don’t deceive you. then. The left graph is for n = 5. First. Second. which you remember is 1 – p).5. there’s a different one for every combination of n and p. however.25 0. When p is less than .2. the worse the approximation to the normal distribution. and the right part shows n = 20. as n increases. When it is exactly equal to . and changing n from 15 to 30.5. What this means is that.44 0. the binomial distribution looks more and more like the normal distribution. by the time n = 30. it looks as though increasing the sample size with the same value of p makes the graph seem more normally distributed. so using the normal curve only when n is at least 30 is fairly safe. with p = q = .10 0. when p = . the closer p is to . as p gets closer to . the middle shows n = 10. and thus numerically describe the properties of the binomial distribution that we just saw graphically. and variance of continuous data. As you can see. the ﬁgure is virtually indistinguishable from the normal distribution. we can approximate the binomial distribution by using the normal curve. is: If we go back and compare Figure 5–2 with Figure 5–4. these properties depend on n and p (and therefore also on q. the more p deviates from . the graph looks more and more normal as n increases.05 THE NATURE OF DATA AND STATISTICS Properties of the Binomial Distribution n = 30 p = . Yet again. we show a binomial distribution with p = q = .

(By the way.0 1.5 3.2 = 3.17 The next step is to convert these two numbers (4. 2 people.1% for pulling a black one.5 to 5. and 10% are wheat germ addicts. and one with exercise. p 12! 7! 5! . 3 people? 4. we have to consider the discrete value of 5 people as actually covering the exact limits of 4. then: a. Consequently. we get: z1 5.5) to standard scores.2 . If we choose three people at random.0 1.55 1.3340. the probability of drawing a white one is 73/209 = 34%.3495 .PROBABILITY 45 17We’ll the example of patients who leave the operating room minus an appendix but with an infection.61 = . and is 65. which is easier to use. the mean is 15 .97 = . what is probability that it will rain every day? 5. p deviates from .5 or 5. 1990). we’ve looked at the nature of probability. 2 people.5 people and simply remind you that we did the same thing in Chapter 3 when we were discussing the median. it will require about 30 space shuttle ﬂights to build a proposed space station. If you discard your other two cards. Following that ﬁasco with bridge. . and you’re holding three aces. What proportion both jog and eat health food (call them Type 1)? What proportion jog but don’t eat health food (Type 2)? Eat health food but don’t jog (Type 3)? Neither jog nor eat health food (Type 4)? b.5 quite a bit and n is less than 30.5 and 5.1123. and We just thought you’d like to know. They state that. especially considering that in this case. Plugging these values into the equation. Two health trends have swept the country over the past few years: one concerned with diet. in more depth than you cared to go. If you go camping for 3 days. and its standard deviation is n p q . and you can’t sit next to your spouse)? b. If 40% of people exercise.55.4463. Because there are 73 white balls and 136 black ones. How many different sets of four people can there be? c. using the formula we encountered in Chapter 4. with the stipulation that the ﬁrst person of the pair chosen plays black? .8 = 1.97 z0. Now. the binomial distribution shades into the normal one. one difference between the normal and binomial distributions is that the former is intended to be used with continuous variables (those which can assume any value between the highest and lowest ones). meaning that the probability of ﬁnding ﬁve nosocomial patients on the ward at the same time is about 11%. The difference between them is .6517 (5–22) ignore the fact that no one but a gross anatomist has ever seen 4. but you have only one deck of cards. and the latter with discrete variables. a.61 z2 4.0. According to the weather report. What is the probability of ﬁnding either Type 1s or Type 3s in a sample of 1 person. This means that in our case. What is the probability that none of the three people will be addicted to these behaviors? d. How many ways can the six of you be arranged around the table (ignoring “rules” that say men and women must alternate. This approximation isn’t bad.5. So.) EXERCISES 1. people decide to pair off to play chess. We also saw that when n is over 30.1032. How did they get this ﬁgure? 2. 18Possibly (5–21) We look these two numbers up in a table of the normal distribution and ﬁnd that z1.55 0.32%. in that people who keep a healthy diet are no more or less likely to exercise than those who don’t eat raw ﬁsh and Granola bars. people want to play bridge. and the SD is 15 . What is the probability of ﬁnding only Type 2s in a sample of 1 person. 3 people? e. How many pairs can there be. There’s a pot on the table of $750. what’s the probability of drawing that fourth ace? 3. and we’ll ﬁgure out how likely it will be that we’d have ﬁve such people on our unit at one time. that should read “beliefs”) are independent. Assume that these two fads (oops. the probability of rain is 10% each day for the next 7 days. the answer to the problem of 7 black and 5 white balls drawn from the urn is 20. Remember that the mean for a binomial distribution is np. After dinner. and explored18 ﬁguring out probabilities of events with two outcomes. According to the Office of Technology Assessment.5 3. what is the probability that all will be health food addicts? c. even if the reliability of the shuttle could be increased to 98% there is an 8in-9 chance that a shuttle will fail while building the station (Friedman. You and your signiﬁcant other are having two other couples over for dinner. it’s fairly close to what we found before. RECAP In this chapter.

whereas Norman is 6’5”. modern pharmacology has come to the rescue yet again. or look for an association between two variables. unless you take the role of chance into account. statistics are the inﬂation rate and the Dow Jones averages. a bit on the short side. life tables. you can reach new lengths of satisfaction. we guarantee that your estimate will be in trouble. What we spend the most time on is the stuff of inferential statistics: t-tests. the only statistic of interest is 900 – 600 – 900 (36 – 24 – 36 before metric). the answer is simple. with all the libido in the world. It would be difficult and unfundable to try to measure all of us. If you pick us both. BASIC CONCEPTS When you approach the average man on the street and ask what statistics means to him. if you got Norman. even though they are drawn at random from the population (more on this later) of all individuals of interest. Scientiﬁc studies have proved it. in research. you’d be too high. As if that wasn’t bad enough. If he is less than 30. However. If you wanted to 46 . Inferential statistics are used to determine the probability (or likelihood) that a conclusion based on analysis of data from a sample is true. it’s vital statistics and mortality rates that count. One man’s pleasure … As male baby boomers age. uh. some differences or association will be present purely by chance. Any measurement based on a sample of people. So. The ﬂy in the ointment that leads to all sorts of false conclusions and keeps all us statisticians employed is random error. “apparatus. they have every reason to become increasingly preoccupied with their. the type we discussed in Chapter 3. If you were unlucky enough to get one of your esteemed authors in your sample (a good possibility with a sixpack for bait). so you would likely sample us somehow. and over 60. and their ilk. the fastest growing stalk on earth. and narrow pill derived from natural ingredients found in bamboo. increasing exposure to various parts of the male anatomy in the ever-more permissive media is a constant reminder that. through the simple expedient of a little. long. your estimate would be too low. we discuss the problem of comparing a sample to a population of values with a known mean and standard deviation. To explore how chance wrecks things. The basic goal of these statistics is not to describe the data—that’s what the previous statistics do—but to determine the likelihood that any conclusion drawn from the data is correct. As a result. chi-squares. will differ from the true value by some amount as a result of random processes. a basketball reject. perhaps by sending a letter to the department heads at some northwestern colleges or steering delegates at the annual statisticians’ conference into your booth with offers of beer and pizza. these descriptive statistics. Streiner is about 5’8”. That doesn’t matter too much unless you want to make an inference that the height you measured is an accurate reﬂection of all statisticians. you’ll likely be about right. If you got Streiner. Elements of Statistical Inference SETTING THE SCENE 1Ironic in this context that Halcyon is a sleeping pill. every experiment will conclude that one treatment was better or worse than another. whenever you compare two treatments. between ages 30 and 60. You see.CHAPTER THE SIXTH In this chapter. imagine trying to determine the average height of all statisticians. However. ANOVAs. count for little.” Various drug companies regularly remind them that their sexual prowess could once again return to the halcyon1 days of yore through the ingestion of little pills of various colors. they still might just not measure up (so to speak).

Ask your doctor about Leptostatin. we would probably be saying to ourselves something along the line of. we’ll ﬁnd that the mean of our treated group is indeed a bit better. that’s one heck of a big difference. from the population is at the root of most experimentation and all inferential statistics. “50% of all Canadians are at increased risk of heart disease from high cholesterol. It depends on only two variables—the extent to which individual values differ from the average. if we make an honest attempt to reach all individuals ﬁtting our criteria. particularly ethnography. and population parameters are labeled with Greek letters. if we have a large sample size. become lower and lower. half the time the experimental group would look better. statisticians. Also. estimate how far you might be away from it? But it really isn’t all that mysterious. a long time ago. But as the difference gets bigger and bigger. if for no other reason than because some of those to whom the results will hopefully apply have not actually been born yet. the likelihood this could arise by chance is 50%.” And the little “p < . and the sample size. As one strategy to keep things straight. given the vagaries of random ﬂuctuation. if all we ask is that there is a difference in favor of the treatment.001. sample SDs and population SDs. try PDQ Epidemiology (Streiner and Norman. the probability that we would observe a difference in favor of the treatment of any size is exactly 50%. if the therapy were completely ineffective. as we start inferring. if we did the experiment a bunch of times with an elixir that did nothing. The sample describes those individuals who are in the study. the probability that an effect could arise purely by chance. “the difference was 22. if the difference in favor of the treated group were huge. Sure seems pretty unlikely that you could get such a humungous difference just by chance if there were no real effect of treatment. and half the time it would look worse. We now have sample means and population means.001” says that. as we just did. The goal of inferential statistics is to be highly speciﬁc about these chances. the chance that this could arise by chance. hopefully at random.1% That is. and then drawing a sample. without knowledge of the true value. Also. many more of them are too long a plane trip away. and our estimate will be close to truth. So what? Well. the population describes the hypothetical (and usually) inﬁnite number of people to whom you wish to generalize. How can you.2 On the other hand. But. But the trouble is that. And if the gods shine upon us and our elixir. ﬂip a coin to get them into two groups.ELEMENTS OF STATISTICAL INFERENCE 47 generalize from the sample to the population of statisticians. then calculate the means of both groups after one got the treatment. it is obvious that no one has ever made a truly random sample from a list of everyone of interest. Sample values are labeled with the usual Roman letters. from the previous paragraph. putting a number on trivial versus humungous. with arcane math and many tables. Instead of saying. there is a good chance that your estimate may be too high or too low just as a result of the operation of chance in determining who walks through the door of the hospitality suite. we have suddenly doubled the number of variables we have on hand. It is clear that if we conﬁne our interest to only those patients who are in the hospital at the time of the study. . often expressed as a standard deviation (SD). That’s why statistics can empower you. in the most simple of scientiﬁc worlds. That seems like a truly magical feat. there is a popular TV commercial running that states boldly. based on the author’s knowledge of inferential statistics and/or the money he spent on a PhD. just by chance. created two sets of labels.001.7%. and (2) have different manifestations of illness and thus were not referred to the particular clinicians at our hospital.” In other words. or less that 0. inferential statistics is entirely directed at working out. Note that some of the methods of sociology. 1996). Of course. “Jeepers. we want to do just like Gallup and state that “the true height will lie within plus or minus 2 inches of what we measured 95% of the time. Undoubtedly this was a good idea back in those wondrous days of yore when every school person had to survive courses in Latin and we write. all the differences in individual values will tend to cancel themselves out. 2As A Bit More of Nomenclature Of course. and so on. this generalization is strengthened by the methods of sampling. We’re often in a situation where we want to claim that something we did to our sample—our treatment— actually worked. sample variances and population variances. by a process of random sampling. The Meaning of the Magical p Value So samples are on average a little bit different from populations. he can state with conﬁdence that the chance this difference. the chances that the generalization will be successful are enhanced. the notion of deﬁning a population consisting of all folks of interest to you in the particular experiment. we might pick a sample from somewhere. as we have been doing all along.” Yes. You’ve seen the number if you read the journals. That’s what all of statistics is really about. we will miss all those who (1) have less severe illness and were not referred to the hospital.” SAMPLES AND POPULATIONS The Difference between Them In part. and 50% are at decreased risk from low cholesterol. could arise by chance is less than . Nevertheless. are deliberately not intended to generalize beyond the situation under study. regardless of the variation. that there is some chance that the estimate will be a bit off. For more details about this idea. it is likely that the sample mean will lie fairly close to the true value. do something to one group leaving the other alone. Putting it another way. That is. p < . or one larger. so quite a bit. If relatively little variation is found about the mean of the sample.

500 400 Number 300 200 100 0 Mean = 161. the survey has gathered data on ﬂaccid and erect lengths from over 3. . does nothing to reassure the aging male that he’s not merely on the small side of normal. However.0 mm).com/result. and all.4 What one does is use the calculated sample statistic—the mean or standard deviation calculated from the sample to estimate the population parameter. so all will now be enlightened. the only people who know Greek are Greek scholars.” While the temptation is powerful. So. .5. most stand for the same quantity in the sample and the population. or mu (µ) .100 subjects. not so small that you would walk away from it altogether. when they say that the average gas mileage for 1990 Yugos is 23. No estimation of error exists. and their names: Greek letter Name Roman letter Statistical term α β δ π µ σ alpha beta delta pi mu sigma a b D p M s Type I error (see below) Type II error (see below) difference proportion mean standard deviation 4This isn’t entirely correct. movies. You may actually have access to the population. with the trade name of Mangro. Manitoba. you decide to take a break and go Web surﬁng. the days of free love and free spirits (and other more leafy intoxicants). In despair. you don’t. as a budding clinical scientist. In Figure 6–2. Researchers in a small company situated 234 km northwest of Port Moresby. cough up three monthly payments of $19.5 and standard deviation of 31. the most useful data for our purposes is the distribution of erect penile length of the whole sample. But for the moment. and displayed our sample mean. And there you ﬁnd it—the deﬁnitive penis size survey results (www. and shot it to the biggest drug company on the planet. the increasing barrage of explicit images of wellendowed males on late night TV.48 THE NATURE OF DATA AND STATISTICS 3The other author will be happy to furnish Hebrew equivalents on request. Nightly television ads are a constant reminder that you have a disease—“erectile dysfunction syndrome”—that is cured by a little blue pill. and Greeks. most of their libido went with it.100 subjects is 161. decide to put the claim to the test. 40% bigger. Looking at the distribution. To make matters worse.95. the issue is no longer restricted to inner thoughts. So. if only you screw up your courage and ’fess up to your doctor. As yet. Now in its sixth edition. Just call 1-800-1234567 in Brandon. . So our guys.5 mm (6.sizesurvey.5 Sample Mean = 170 5Who FIGURE 6–1 Distribution of erect penile length for 3. 60 90 120 150 180 210 Length (mm) 240 270 However. ELEMENTS OF STATISTICAL INFERENCE Times are tough for male boomers. who would want to be in the control group? You recognize that you could possibly do the study by just administering the real drug to a group of subjects.100 subjects. And somehow. you. they may well mean just that. Since it’s a naturally occurring compound. have gained a total of 8.5 SD = 31. add the stipulation that they all be included in the trial. that puts the study participants up around the 60th percentile. are long gone. swing it with ease past the ethics committee composed of older male scientists5. Greek.4 inches). so the convention confuses. the little squiggles aren’t all that mysterious. The distribution looks remarkably bell shaped.4 mpg. Below is a small sprinkling of Greek and Roman letters. New Guinea. pharmacologic help has come from an unlikely source. Greek fraternity members. but assigned to the Mangro condition. because only God has access to the entire population. and satisfaction is yours. You design a randomized trial of Mangro versus placebo.5 mm (1. with a standard deviation of 31. many years ago noted that the local men were apparently remarkably well endowed. we haven’t said anything about how one goes about calculating these mystical quantities.0 mm. everyone drops out. For example. Six weeks later. population means begin with Greek m. just as Pﬁzer’s little blue pill solved the ﬁrst problem.5 mm. and have reﬁned it into a small red pill. Overall.htm). Yugo Motors has access to the entire population of 1993 Yugos. The report contains a wealth of interesting data that those who are interested can pursue at their leisure. Sample means begin with M. the results are in: an average erect length of 6. at least until they are both sold. one of us had the beneﬁt of a Greek fraternity3 (but thankfully no Greek course). we’ve replaced the original data with a normal curve with a mean of 161. But still. the mean of all 3. if only you could ﬁnd good comparison data on penile length of males who aren’t taking the drug. and put up signs by the elevators. and not something you (or your signiﬁcant other) would notice. since with the availability of an enhancer off the Web for a paltry $60. about a third of an inch. no research was required. Not quite up to the claim. but after you explain the study. In fact. The label says that it’s “Guaranteed to add half an inch after six weeks or your money refunded. So you press on and enlist 100 anxious males in the study. and so on. Nowadays.25 inches).7 inches (170. it has now become medicalized. You’re immediately inundated with calls. 60% smaller. Those carefree sixties. Added to that. and has carefully compiled the data. shown in Figure 6–1. with an average length of 170. They managed to identify the ingredient in bamboo that causes it to be the fastest growing stalk in the vegetable kingdom. and inferential statistics are not required. and on further inquiry found that they engaged in a nightly ritual of rubbing a potion derived from the boiled oil of bamboo stalks on the area.

It would be the same as if we went out on the street. just by chance? That. Of course. But. people like us couldn’t make all the big bucks. That means that our study is simply repeating the original survey. And as the sample size gets bigger and bigger. that is. However. say by taking two guys at a time. That’s because. is all about working out the chances that a difference could come about by chance. so that the mean of the means will just be the overall mean. what’s the probability that you could get a difference between your sample of 100 people and the population in the deﬁnitive survey of a third of an inch.5 Sample Mean = 170 FIGURE 6–2 Normal distribution of the data in Figure 6–1. However. The big ones cancel out the little ones. so we have to put a veneer of scientiﬁc credibility on it. It states that the penile length of treated males is bigger than untreated males. the extremes will cancel out.5. the question we’re asking as scientists is not whether a particular penile length could be 1⁄3 of an inch bigger than the mean by chance. and doing this again and again. Suppose the treatment doesn’t work.ELEMENTS OF STATISTICAL INFERENCE 49 The obvious question is. 8We’re . then the chances of cashing in on the drug are non-existent and our chances of retiring early on the proﬁts are equally small. that the null hypothesis is true. One thing is clear. to either scientist or consumer. If we did the study with a very small sample size. when we’re plotting the average of 100 folks. ﬁve. and one that says they’re different—either bigger or smaller. just a little bit bigger than normal?” In short. called a null hypothesis. So. but whether the mean of a sample of 100 erect phalli could be bigger than the population by 1⁄3”. another thing is clear—the distribution of the sample means will be much tighter around 161. only with 100 guys.5 than each alone. and it’s a very big however. called the alternative hypothesis. they’ll be a bit closer to 161.5 than was the distribution of the individual lengths. sometimes they would both be on the small side.5. what we’re really worried about are the chances that a sample mean could be equal to or greater than 170. it can also take different forms. but as we’ll see in Chapter 26. We ﬁnd ourselves in the unenviable position of having to do a study where we go and grab (not literally) 100 guys at a time.5 than any individual observation. the null hypothesis does state that nothing is going on (i. If we increase the sample size to. the nil hypothesis). so the mean will get closer and closer to the population value of 161.” In the majority of cases.000. And from our discussion above. plotting the mean for each of our samples.5.. did the drug work. we’re now no longer displaying the original observations. The sample means will fall symmetrically on either side of 161. All the stuff that ﬁlls the next 200 pages and all the other pages in all the other stats books. but most of the time one small guy is going to be balanced by one large guy. 7Actually. If we now do this a number of times.7 THE STANDARD ERROR AND THE STANDARD DEVIATION OF THE MEAN If you look at Figure 6–1. null means “the hypothesis to be nulliﬁed”.0 mm if the sample came from a distribution of means with mean 161. not 3.5. when we’re plotting the individual measurements. all the exams that have struck terror in the hearts of other stats students (but not you!). sometimes they would both be on the big size of average. if ordinary people realized that was all there is. in order to ﬁgure out 6In this case. if this is true. not telling which street. dear reader.5. If this doesn’t seem obvious. is what all of inferential statistics is about—ﬁguring out the chances that a difference could come about by chance. so the sample mean should fall much closer to 161. and put it on a piece of graph paper.6 Frequency SD = 31. 205 220 145 160 175 190 Length (mm) H0: There is no difference between the penile length of males treated with Mangro and untreated males. Every time we sample 100 guys. “Is this a real difference. say. we get a few who are big. Of course. some who are small. all in the cause of ﬁnding out how the means for sample size 100 are distributed.8 sampled 100 guys. think about the relation with sample size. We turn the whole thing into a scientific hypothesis. Of course. we’ll ﬁnd big guys and small ones. and we hope the cops don’t ﬁnd out. and send them packing. as we’ve said. there’s an increasing chance that extreme values will cancel out. this whole exercise is a bit bizarre. A third of an inch just doesn’t look very impressive. So there is also a parallel hypothesis that the drug did work. Putting it another way. We’ll explain all this later. we’re displaying means based on a sample size of 100 each. measure their members.e. and the mean will be closer still to 161. or could the difference you found be simply explained by chance. and labeled H1. or did you just happen to locate 100 guys who were. 161. measured them up. and a bunch who are about average. then there’s a pretty good chance that the ﬁve guys will be spread out across the size range. it doesn’t mean “nothing. However. you’re likely thinking to yourself that this is one study where the null hypothesis wins hands down. on average. we would start to build up a distribution like Figure 6–1.5 Mean = 161. computed their mean. So if we average their two observations. the big ones will average out with the little ones.5. by the sample you ended up with? Putting it slightly differently again. there are two kinds of alternative hypotheses—one that says they’re bigger. we have to work out how the means for sample size 100 would be distributed by chance if the true mean was 161. so we actually start with a hypothesis that the drug didn’t work. statisticians are a cynical lot.

15. the null hyupothesis still holds. it means . or anything larger. and examine in detail the ﬁgure on the top of the left page. it’s not really all that complicated. of course. how close the means of repeated samples will come to the population mean. But to use Table A. and the dotted line above it is the original distribution of individual values. We’ve already ﬁgured out that the bigger the sample size. What’s left over. and σ is “sigma. how spread out the means will be for a given value of σ or n. how likely it is to ﬁnd a mean of 170. we usually don’t know σ. so it’s 8.0 – 161.5 Mean = 161.000 or so that we began with.s. And that’s just about the way it turns out. all this is assuming that the repeat- Original Distribution Distribution of Means ( n = 100) FIGURE 6–3 The distribution of means of n = 100 from the population in Figure 6–1. In this case.50 THE NATURE OF DATA AND STATISTICS 9Pun intended. so the area to the right of its mean is 0. We suspect that you’ve been hanging around this research game long enough to have seen a bunch of “p < . Now the sample mean of 170.4965. The solid line now is the distribution of means. what we need is just how far away from the population mean. But the total area of the normal curve is 1. That’s not hard. Putting what we did into a formula. you’ll ﬁnd a z of 2.5).70 SD units away. we expect that under the null hypothesis. We also now know that this distribution will have a standard deviation (the standard error of the mean) of 3.70 in the third column on the right side.05” is a very good thing. could arise by chance is just the teeny weeny9 bit of the distribution to the right of 170. the SE(M) is 31. the bit left in the tail.5 / √100 = 3. under the null hypothesis that they all come from the normal untreated population.0 – 161. which is the probability of ﬁnding a z score of 2. and is the mean of the “population” distribution of 3. and the SE(M) reﬂects. where we have enlarged the X axis. is directly proportional to the standard deviation. and inversely related to the sample size. µ is “mu. called the Standard Error of the Mean [which is abbreviated as SE(M) or SEM]. because we very rarely have access to the entire population. So with our calculated z value.5 mm from the null hypothesis mean) really looks pretty unlikely. and beside it is “. with a true mean of 161. normal males. This distribution is portrayed in Figure 6–3. we don’t have to do that. we don’t have to repeat the study hundreds of times to ﬁnd out. The number.50 – 0. as you will see in Table A in the Appendix.50..D. So. we estimate it with the sample SD. For this case. it looks like this: z= X – µ 170.5 / 3. for a given sample size. the SD reﬂects how close individual observations come to the sample mean. the SD is 3.5 √ 100 (6–2) Standard Error of the Mean SE(M) = σ √ Sample Size = σ √n (6–1) Of course.4965 = . Fortunately. the mean of the treated sample is.” the standard deviation of this distribution.5 Sample Mean = 170 160 165 Length (mm) 170 175 Just a reminder. They work out for us just how those means will be distributed theoretically. And so it is.” What’s going on here? It’s supposed to be a teeny weeny number. under the null hypothesis. All that remains is to actually compute this little area.15 mm. This formula tells us.5 = 2.05”s sprinkled throughout the papers. based on a single sample. And the subsequent prose certainly leads one to believe that “p < . The likelihood that it. .15. Fortunately for us (and for the measurees).70. we run to the back of the book. If you go down the columns. the smaller the standard deviation).0 or more on a single study. But here’s where statistics students get their Ph.0. in standard deviation units. In other words. is therefore 0. WORKING OUT THE PROBABILITY OF THE DIFFERENCE: THE z-TEST All the preceding discussion suggests that Figure 6–2 is working from the wrong distribution.5.0 (a difference of 8. It’s also not too great a leap of faith to ﬁgure out that the less dispersed the original observations (i.5) = 8. Of course. is the area from the mean out to the value of 2.5 = 2.70 or larger. Well. it is if you look at it right. What we really need to know is the distribution of means for sample size 100. 155 SE(M) = 31.e. so in most cases.70 = σ/ √ n 31. the tighter the distribution (the closer the individual means will be to the population mean of 161.4965. the distribution of means for a given sample size will be directly related to the original standard deviation.0035. The width of the distribution of means.” the Greek for the mean. the tighter the means will be to the population mean. it’s (170.5 mm above the mean. wiser persons have done it for us. and ﬁnd a really messy table. and inversely proportional to the square root of the sample size: ed samples are random samples of the population we began with.

11 = 3. and the corresponding p value is 0. still with the same standard error of 3.0.5 mm (p = . for the test of the null hypothesis: z= 165. is . greater than the usual criterion) We fail to reject the alternative hypothesis. if we go back through the calculations.59 = σ/ √ n 31. because the normal curve is symmetrical.9”.0 under the alternative hypothesis that the true treated mean is 175.05.0. Yeah. and not consistent with a true treatment effect of 0. “What’s the likelihood that we could get a sample mean of 170. with a new population mean centered on 175. to save space. Since the probability was small (. but as we’ll see. But.134. what? Sounds more like a lawyer than a scientist. and all we got was 8. H1. It was claiming a difference of 1⁄2 inch.5/√100 (6–4) 10Why .5/ √100 (6–3) FIGURE 6–4 Testing if our result differs from a mean of 175.0 – 161. We fail to reject the null hypothesis. we get a slightly different result in terms of the numbers. now. chance is still operating. surprise).0035). So although it’s a bit unlikely that the sample mean could be as small as 170. H0). it’s not so unlikely that it meets our criterion of rejection of p = . assuming that the alternative hypothesis is true. we gave the values only for positive values of z. we can look up 1. and the makers of Mangro.054. under the alternative hypothesis. that there really is a difference between the Mangro group and the population. And we have a new z-test: z= X – µ 170 – 175 = –1.05? Stay tuned! 11Note Back to the back of the book we and discover that the probability of a sample mean this small or smaller. Sample x = 170. So we can conclude. but instead comes from a different population: guys who beneﬁted from Mangro.ELEMENTS OF STATISTICAL INFERENCE 51 that because the likelihood of getting a result that big or bigger by chance under the null hypothesis is less than .05) We reject the null hypothesis. a hugely different result in terms of the conclusions.5 mm (1⁄2”). For the moment. The result we observed is consistent with a true treatment effect of 1 ⁄2”. based on the study we did. Using the same logic as before. We just have to remember which side of the curve we’re dealing with. we are now effectively saying that the sample we examined no longer comes from the population of guys with untreated phalli.10 we’ll go out on a small limb and reject the null hypothesis that this difference could have arisen by chance.5 = 1.11 that there are no negative values of z in the table. Suppose it had worked out just a little bit different. which will eventually pave the way to an explication of the whole arcane logic of statistical inference. Now. we can now ask.12 The makers owe us big time.5 mm.5 3.05. In doing so.15 31. We can also carry through the prior logic and calculate the likelihood of observing a sample mean this .0035.0 MORE ON THE ALTERNATIVE HYPOTHESIS. Circuitous logic. In short.0 mm based on 100 people if the treatment was ineffective (the null hypothesis. or 1 in 20.5”. which is 175 mm?” Now the picture is like Figure 6–4.0 mm is exact.0 under the null hypothesis. We. Should we sue and get our money back? Well.59. First. So there is a good possibility that we could have observed a sample mean as large as 165. what’s left is the alternative hypothesis. Because we’ve rejected the null hypothesis. let’s consider some alternative scenarios.15. less than the usual criterion of .5 mm. H1 So we have now done the statistical “laying on of hands” and worked out the chance that we could get a sample mean of 170. that’s another long tradition in medical science (and anything to do with sex). really were pretty lucky that it worked out the way it did— we just managed to get a low enough p value to reject the null hypothesis and conclude that the stuff works. we conclude that the treatment worked.054. but Mangro didn’t exactly deliver on its promises. that: a) It is unlikely that we could observe a sample mean this large or larger under the null hypothesis that the treatment was ineffective (p = . and the population mean for the treated guys is 6.0 or less. either. or 6. 12Of Now we go to the back of the book. called H1 (surprise. and a sample mean of 170. Let’s turn things around and ask a different question. But then. H0 go.0. 150 160 170 Length (mm) 180 b) It is not unlikely that we could observe a sample mean this small or smaller under the alternative hypothesis that the true treatment effect was 12. we have every reason to believe that Mangro delivered on its promise. Suppose that the sample mean was 165. let’s not be too hasty. and we just managed to get a high enough p value to conclude that the observed mean could have resulted from a true treatment effect of 12. because. We’ll return to the philosophy in due course. course it helps that we cooked the data. not 170. or 12.4” + 0. so there’s no reason to assume that the sample mean of 170. After all.

but for commiseration. which means that if Mangro did work.e. we’ll declare the result signiﬁcant and reject the null hypothesis. has changed everything about the experiment. it is much less likely that the same size of effect could occur simply by chance.0008. we might stop there. if the probability calculated from the sample mean is less than this critical number. since it is showing explicitly the operation of chance. We begin with the null hypothesis. There is another way that things could have come out differently.5 = 1. if the probability is greater.0 = –3.0 – 175. Once this decision is made. So. if the sample mean is closer to the population mean than this. so that’s rarely even an explicit judgment. as we saw in the last example. One way or another. we reject the null hypothesis. we still close down the lab and go to the pub.0 if the treatment mean was really 175. the corresponding p value is . that is. with a large sample.0 to a sample mean of 165. we get to the pub.14 However. H0.” if we don’t reject the null hypothesis. with a standard deviation equal to the original standard deviation divided by the square root of the sample size we used in the study. we fail to reject the null hypothesis. which is really saying that. Which brings us to the next step … c) The alternative hypothesis distribution Generally speaking. and the true treatment effect was actually 1⁄2”. concluding there is a signiﬁcant difference) when in fact it came from the H0 distribution. formally. Suppose we did the study with a sample size of 25. and declare that there was no difference. there’s still a high probability. that’s what all of statistics . Similarly. we compare a sample “statistic. is to decide what probability we’ll use to decide whether we’ll accept or reject the null hypothesis. So the chances that we could get a sample mean of 165. instead of celebration.17 = 3. even if we had exactly the same experimental result.05. it is now quite possible to get a sample mean this large purely by chance. we have identiﬁed a “critical value”. if we don’t reject the null.5/√100 (6–5) is about. you can get seemingly large effects purely by chance. But it actually makes sense..0 –10.0 is vanishingly small. Putting it another way. A relatively small change in the outcome. Our next task is to just formalize these ideas a bit more. with the smaller sample size. Normally. Remember that all these calculations depend on sample size.0 – 175. With a small sample. but do it with a smaller sample size. This is centered on the population mean.3 31. we close down the lab and go to the pub. which is the same as saying that we assume the treatment was ineffective.415. With a z of 1. All this because we did the study with a smaller sample. In the simplest case. that the sample mean would be small enough that we would miss it. and we reject the alternative hypothesis. b) The alpha level and the critical value Now.5 –5. we can determine the distribution of sample means under the null hypothesis.15 31.) This probability is called alpha. to a population “parameter”—the mean of the population µ.” the basic formalism of statistics.13 Believe it or not. as explained in the next section. if we do exactly the same thing. Now the two formulae look like: z= 170. if we compute the probability of getting a sample mean this small if the treatment really worked based on: z= 170.0 (about 1/5”). the lower the likelihood that you can get an extreme difference between sample mean and population mean under the null hypothesis.089. we don’t just stop.5/√25 (6–7) Alpha (α) is the probability of concluding that the sample came from the H1 distribution (i. This is: z= 165. and get the same results.35. a) The distribution under the null hypothesis Using the relationship between the standard deviation and the standard error of the mean (Equation 6–1). if we do reject the null hypothesis. amounts to setting up a decision situation. if farther away. and this probability is . because the larger the sample size.5 mm. with exactly the same mean.” in this case. the calculated sample mean. which is the one we worked through in the above example. we’ll accept the null hypothesis. THE FORMAL LOGIC OF STATISTICAL INFERENCE “Statistical inference. The next thing we do. it might be because we simply didn’t have a large enough sample size to detect a reasonable size treatment. from a sample mean of 170.79 = 6. Too large a sample and you can prove anything. here’s where the logic gets more obtuse.5/√25 (6–6) 13As one of our mentors said.0 = –0.5 8. or 12. more than 40%.3 31. Certainly. we accept the null hypothesis. we virtually always use .0 – 161. “Too small a sample and you can prove nothing. medical researchers are not so adventurous as to make claims about the magnitude of treatment effects—“guaranteed to increase your 14And And the probability is 0.52 THE NATURE OF DATA AND STATISTICS small under the alternative hypothesis that the stuff was really good for 1⁄2 ”. we might be led to a different conclusion to not reject H0 simply because the sample was so small that the difference could have arisen by chance. the assumption that the sample mean was computed from a sample drawn from this population. (By convention and history.35 = 6.

(b) it works. unlike Mangro.” while the rows show what we found in our study. All this is shown in Figure 6–5. The columns show “reality. Cell A describes the situation in which there really is a difference. statisticians call this a Type II error. 16This table doesn’t show the all-toofrequent Type III error—getting the correct answer to a question no one is really asking. That is. but our study concluded that it didn’t. but we concluded that it did. (c) it doesn’t do anything. but our study rejected the null hypothesis. This is called a Type I error. The ﬁrst step is to make a plausible guess at how big a difference could reasonably be expected. This is something that only an omnipotent being would know. that leaves two boxes to be ﬁlled. d) Beta and power We can now determine how likely it is that we would accept the null hypothesis when the alternative hypothesis was true. If we don’t see a signiﬁcant difference. Cell D shows another place where we’ve come to the correct answer—there really is no difference because of the intervention. it may well be.. even if one has used Mangro. whose tail now extends past the critical value to the left.ELEMENTS OF STATISTICAL INFERENCE 53 lifespan by 6 months or your money refunded. Finally.05. we’ve come to the right conclusion. rarely arises. we can also put a probability on the likelihood that we would be able to detect this difference. This locates the center of the distribution of sample means under the alternative hypothesis. which is 1 – β. it is “truth” in the Platonic sense15 that we can get glimpses of through our studies. Power (1 – β) is the probability of concluding that the sample came from the H1 distribution (i. it will happen 5% of the time. Cell C shows the opposite situation—the stuff really works. there is no difference (i.16 and its probability of occurring is denoted by β.e. if we took a totally useless preparation. then. then by deﬁnition. when in fact it came from the H1 distribution (there really is a difference). Similarly. and repeated this 100 times. Unfortunately. Truth Study Results Difference No Difference TABLE 6–1 The results of a study and the types of errors Difference (1 – β) A B Type I Error α (1 – α) No Difference Type II Error C D β .e.20 (meaning that the power of the study was between 80 and 15That refers to the Greek philosopher. Congratulations are in order. The width of the distribution is the same as the H0 distribution—just the standard error of the mean. Beta (β) is the probability of concluding that the sample came from the H0 distribution (i... as in Table 6–1. By tradition. as in the sample size 25 experiment. concluding there is no signiﬁcant difference). In general. the intervention doesn’t work). it is not describing a chaste relationship between individuals. e) Type I and Type II errors If you’ve been able to stay awake through all of this. In Cell B. we’d like β to be . which is the area of the H1 distribution to the right of the critical value.” “guaranteed to reduce your risk of stroke by 20%. and that’s also what our study said. could be disputed—at least in this world. it does occur under a different guise. How often does it occur? If we use the commonly accepted α level of . that we didn’t have a large enough sample. but never know deﬁnitely. So the situation that arose in the example. Accept H0 Reject H0 Critical Valve FIGURE 6–5 The null and alternate hypotheses.15 to . gave it to half of the subjects. the analogy to the 1⁄2” of our example. and that’s what our study said. and is called power. and gave the other half a different totally useless preparation. but our study failed to reject H0. beyond our ken.e. just the area in the tail of the H1 distribution to the left of the critical value. where we actually tested the sample mean against the claimed effect to see whether it was consistent. and our study showed that it worked. We can actually extend the calculations of the example to put a number on the likelihood that we could detect a difference that was there—the power of the study. We can show the four conclusions in a two-way table.” which is strange since none of these claims. in fact. when it really did come from the H1 distribution (there is a difference). we’d ﬁnd a statistically signiﬁcant difference between the groups in ﬁve of the studies. With great originality. you may have noticed that we’ve been discussing four different types of conclusions: (a) the stuff really works. this box reﬂects the power of the study—the probability of ﬁnding a difference when it’s there. However. The ﬁrst part of each of the four phrases— it works or it doesn’t work—is. we will have a second distribution to the right of the H0 distribution. and (d) it doesn’t do anything. concluding there is a signiﬁcant difference). As you can see in the table. This is called beta. and that’s also what our study showed.

only a smaller sample size. there is a world of difference. they say “Not proven. 1956) which mirrors what the famous philosopher and skeptic David Hume said way back in 1748: “A wise man proportions his belief to the evidence.05 doesn’t suddenly disappear at p = . if it looks like something’s going on. many will start becoming suspicious after four heads in a row. three times is an enemy conspiracy. “Surely. which is just over 3%. there really was a difference between the groups.05 or not? We’d love to say. Where Did That 5% Come From? We promised you we’d explain why the magic number for journal editors is a p level of . you obviously had sufficient power. When we did the study with a sample size of 100. All along. it keeps coming up heads. or 25%. The Scots have a better solution. and there was not sufficient evidence to prove him guilty. Hence. the most likely culprit is that there were too few subjects. How many tosses before your friends think the game is rigged? If we were doing it with our friends. but they didn’t reach statistical signiﬁcance. so we rejected H0 and concluded the treatment worked.05. As our kids would say. the more power (and the smaller is β). if you did a study and the results look promising.18 To be a bit more formal about it. It really is illustrated with the example. who said. 18Also analogous to the verdict of the O. Simpson trial. just look at the data. or. “Let it be . Hence the logic that we “failed to reject the null hypothesis” acknowledges that there may well be a treatment effect. and we came to the opposite conclusion.” That is. let’s take much of it back. may very well show a difference. and by treating them as synonyms we have offended them terribly. God loves the . The lack of evidence is not the same as the evidence of a lack.25%. the factor that we have the most control over is the sample size. Sir Ronnie was probably right. and we wouldn’t have anything to write about.” which can never be proven correct because you can never assemble all swans to be sure. 5% does seem to correspond to our gut-level feeling of the dividing between chance and something “real” going on!19 Try this out with some of your friends.” So where does that leave us? Should we test for p ≤ . three heads is 1 in 23. he said need not always be the “nil” hypothesis of nothing going on). “Duh!” 85%). where he was found “not guilty. do we need the same level of evidence to believe that non-steroidal anti-inﬂammatory drugs help with arthritis as the claim that extract of shark cartilage cures cancer? There have been roughly 10 papers by every rheumatologist in the world demonstrating the ﬁrst. we used the wrong measures. in one of the early books in the series. First. But the fact is. or 50%. he was off by one or two events. to serious statisticians and philosophers of science (both of which categories exclude present company).” and was found “innocent” (which then led on to a civil suit). which is better executed. the larger the study.” Bond said.00. all we can do in the face of negative ﬁndings is to say that we haven’t disproven the null hypothesis. There are many factors that affect β—the magnitude of the difference between the groups. 19James . It’s logically analogous to the old philosophical analysis of the statement. it’s 1 in 22. Finally.” But at this point we can’t.05.05.” As we’ll see. he rejects hypotheses.” and it was . Fisher did in fact talk about a null hypothesis (which.” this isn’t quite right. and the probability of ﬁve heads in a row is 1 in 25. Insisting that the same p level of . sayeth unto the multitudes. mirabile dictu. for two reasons: You’ll never get a paper published that way. This was best expressed by Rosnow and Rosenthal (1989).” But. Let’s work out the probabilities. and in all circumstances. However. bear in mind that an effect that exists at p = . But. Two heads.051. Our study may not have found an effect because it was badly designed. and the remainder after ﬁve. you’ll pay each of them $1. he said that we should go through the rigamarole of running statistical tests and ﬁguring out probabilities only if we know very little about the problem we’re dealing with (Gigerenzer. the probability of a head is 1 in 2. we found that there was a signiﬁcant difference. But at another level. we’ve used the phrases “accept the null hypothesis” and “fail to reject the null hypothesis” more or less interchangeably. 2004). “All swans are white. Ritual and Myth of p < .06 nearly as much as the . as with most historical “truths.00.17 On “Proving” the Null Hypothesis One ﬁnal bit of philosophical logic. more likely. and all criminal trials.” and Carl Sagan turned into his battle cry of “Extraordinary claims require extraordinary proof. “Once is happenstance. You keep tossing and. so here we go. twice is luck. “Look at the data. you can never prove the non-existence of something (although we’ll try to in Chapter 29).5%. the variability within the groups. The next study. The historical reason is that the granddaddy of statistics. For one head (and no cheating). he was actually the author of the heretical words: no scientiﬁc worker has a ﬁxed level of signiﬁcance at which from year to year. whereas if it comes up tails. So. So the trial failed to reject the hypothesis of innocence. We’ll see in a bit that. yet. and not a single randomized trial showing the second. with a less cynical audience. four heads is 1 in 24.05 or less. the sample size was too small (and hence the power was low). Sir Ronald Fisher. the smaller the study the less power. stop right there and use statistics and p levels only when there’s some doubt.05 Having just fed you the party line about the history and sanctity of p < . but never talked about “accepting” or “rejecting” it. Exactly the same data. Second. since to the average reader they differ only in pomposity. otherwise. he rather gives his mind to each particular case in the light of his evidence and his ideas (Fisher. So 5% falls nicely in between. and α. Finally. That means that he was presumed innocent until proven guilty.05 applies in both situations is somewhat ridiculous.05.J. he simply said we should work out its exact probability. they would say “One or fewer.54 THE NATURE OF DATA AND STATISTICS 17It doesn’t make any sense to calculate the power of a study if you found a signiﬁcant difference—if you found signiﬁcance. it’s just that we couldn’t ﬁnd it. or 12. but we have insufficient evidence to prove it. Tell them you’ll play a game—you’ll ﬂip a coin and if it comes up heads. by the way. or 6. each of them will pay you $1.

we were really not interested in a treatment that made things . Conversely. if it is sufficiently high. Now. the stereo world has undergone yet another revolution. Noise Time STATISTICAL INFERENCE AND THE SIGNAL-TO-NOISE RATIO The essence of the z-test (and as we will eventually see. Two of these are correct decisions (2 and 4).ELEMENTS OF STATISTICAL INFERENCE 55 Signal 1.7 FIGURE 6–6 Spectra of radio signal and noise. then it is reasonable to conclude that no association exists. two decades of sound technology can be boiled down to a quest for higher and higher signal-to-noise ratios so worse and worse music can be played louder and louder without distortion. If the signal does not rise above the noise level. we get a distribution of signals and noises remarkably like what we have already been seeing. there are always four possibilities: (1) concluding we heard a signal when there was none. If the signal—the difference—is large enough as compared with the noise within the group. which is the variability in the measure between individuals within the group. TWO TAILS VERSUS ONE TAIL You might have noticed that. and we falsely conclude that no signal was present. sunspots. The signals come from a distribution with an average height about +1. Nearly all statistical tests are based on a signal-to-noise ratio. This was magically removed by digitizing the signal and implanting it as a bit string on the CD. where the signal is the important relationship and the noise is a measure of individual variation. we were preoccupied with the right side of the H0 distribution. Although we are referring to music. The last one in recent memory was the audio cassette. all through the phallus example. As the local electronics shops and our resident adolescents continue to remind us. we do not hear it at all above the noise. The end result looks like Figure 6–6. All that hissing and wowing was noise. magnetic ﬁelds. Our problem is to determine which our decision is. we’ll make a brief diversion into home audio. Of course. If we now imagine detecting a blip in our receiver and trying to decide if it is a signal or just a random squeak. if we project these waves onto the Y-axis. signal-to-noise ratio of the radio receiver is not just an issue of entertainment value. The cost of all this miniaturization was lots and lots of hiss that no amount of Dolbyizing would resolve. or whatever.1 microvolts (µV). To bring home the concept of signal-to-noise ratio. You might imagine the signals from Voyager 2 whistling through the ether as a “blip” from space. That is. and the noises around another distribution at +0. and a noise. and also would continue to blare music out of our BMWs without skipping a beat as we rounded corners at excessive speed. For obvious reasons. But now we have CDs— compact discs—which deliver all that rap noise at a zillion decibels. based on some observed difference between groups. (2) concluding there was a signal when there was. letting the signal—the original music (or rap noise or heavy metal noise)— come booming on through. we are simply using this as one example of a small signal detected above a sea of noise. When it comes to receiving the radio signal from Voyager 2 as it rounds the bend at Uranus. we then conclude that it is deﬁnitely unlikely to have occurred by chance. we can see that it may come from either distribution. then it is reasonable to conclude that the signal has some effect. and (4) concluding there was no signal when there was none.7 µV. brought about by scratches and dents on the album or random magnetization on the little tape. completely distortion-free. The basis of all inferential statistics is to attach a probability to this ratio. This is superimposed on the random noise of cosmic rays.1 0. if it is very low. the essence of all statistical tests). (3) concluding no signal when there was one. In short. which had the advantage of portability so it would ﬁt into the Walkmen (Walkpersons?) of us on-the-move yuppies. is the notion of a signal. it’s a measure of whether any information will be detected and whether all those NASA bucks are being well spent. and two are wrong ones (1 and 3).

Clearly the two-tailed test is a bit more stringent. “Well. As a consequence. except for the circumstance where you are testing two equivalent treatments against each other. We know what they’re supposed to make of it. If this occurs only on one side of the distribution. but that’s what happened. it’s all over. with astounding logic. we must make it to 1. a 10% improvement. but unanticipated. and we got a treatment effect of 123. not grow.45.05. in either direction. On the other hand. if it is a two-tailed test. we wouldn’t be thrilled to ﬁnd that folks with high support are more depressed. Taking the one-tail philosophy to heart. Nobody expected pure oxygen to produce blindness in neonates. whose pronouncements clearly show he has lots of independent judgment. which corresponds to a z value of 1. One-tailed tests are used to test a directional hypothesis. is not as farfetched as it may sound. we don’t even have the right to analyze whether this difference was statistically signiﬁcant. we might ask whether it is different from the population. and two-tailed tests are used when you are indifferent as to the direction of the difference. But remember that the signiﬁcance or nonsigniﬁcance of the test is predicated on the probability of reaching some conventionally small criterion (usually 0. then from Table A in the book Appendix. then the probability one side is 0. Strictly speaking. we have to think about the H0 distribution with a critical value on both sides. given the dubious provenance of the therapy. So. That is. This kind of test. there is a strong argument against the use of one-tailed tests. so that it is 80% worse. we might reframe the alternative hypothesis. we don’t usually care to prove that the drug is worse than the placebo. It is a different kind of logic from what we’ve seen until now. Instead of focusing on two hypothetical means and bobbing back and forth.645 (i. A two-tailed test is a test of any difference between groups. For a two-tailed test. where a 10% difference in the other direction was. whatever opinions they express are simply the results of the latest poll. and we expect. And every poll somewhere contains the cryptic phrase. Except that everybody uses twotailed tests all the time. is called. and say. In fact. with the notable exception of “Dubya” Bush. But.645. CONFIDENCE INTERVALS More and more.96.05).” Or. (6–8) H1: µA ≥ µB (6–9) or.20 If we want to investigate the effects of high versus low social support. side effects. we would have to say it resulted from chance.56 THE NATURE OF DATA AND STATISTICS shorter. it is not immediately evident what difference all this makes. We may well begin a study hoping to show that our drug is better than a placebo. it is difficult to ﬁnd circumstances where a researcher isn’t cheering for one side over the other. You would think that one-tailed tests would be the order of the day. all of it bad. Consequently. there exists the possibility. If so. because we’re dealing with a sample. regardless of the direction of the difference. putting it another way. not the population. However. ever so slight.22 Polling is now a regular feature of most daily newspapers. we need only achieve a z of 1.e. We did the study. Aside from the philosophy. By contrast. if for no other reason than to fend off lawsuits. so that instead of asking whether the Mangro group is larger than the population. or that cloﬁbrate would kill more people with high cholesterol than it saved.” Often we wonder what mere mortals make of that bit of convoluted prose. that it might make them shrink. which is called. We know that this isn’t 100% correct. then we are equally concerned about Type I errors in either direction.025. we go from the opposite direction. If we do this. the null and alternate hypotheses are: H0: µA = µB (6–12) H1: µA ≠ µB (6–13) 22Regrettably. as an obvious extension a two-tailed test.. for a one-tailed test. “This poll is accurate to ± 2 percentage points 95% of the time. we’re pretty conﬁdent that the truth lies around there somewhere. focusing only on one side. we’re 95% conﬁdent that the mean lies between 121 and 125. Oops!21 So that is the basic idea. we see that this probability occurs at a z value of 1. if we did the study a thou- where A and B refer to the two groups. if the direction is the other way around: H0: µA ≥ µB (6–10) H1: µA ≤ µB (6–11) 21This SDs from the mean). it’s what is called the 95% conﬁdence interval (CI or CI95). When we test a drug against a placebo. a one-tailed test: A one-tailed test speciﬁes the direction of the difference of interest in advance. In fact. we worried only about committing a Type I error on the high side. to achieve signiﬁcance with a one-tailed test.96. the two hypotheses are: H0: µA < µB 20And trying too hard to prove this is a sureﬁre way to cut oneself off from the ﬁlthy lucre of the drug companies. for the sake of argument. politicians are showing that they have no judgment of their own. in fact. 1. albeit for very different reasons. Now we are in the awkward situation of concluding that an 80% difference in this direction is not signiﬁcant. unfortunately.645 . we would surely be interested in such a consequence. imagine our embarrassment when the drug turns out to have lethal. if we want the total probability on both sides to equal 0. If we frame it formally.

0 ± 1. and we do not reject the alternative hypothesis.96 × 31. the actual conﬁdence interval will be written as: CI95 = 170.83—ends up with a tail probability of 2. the sample doesn’t differ signiﬁcantly from it.ELEMENTS OF STATISTICAL INFERENCE 57 Sample Mean = 170 FIGURE 6–7 A 95% CI about a sample mean of 170. Now the question is how can we ﬁgure out what the bounds of 95% probability would be. showing the distributions corresponding to the upper and lower bounds.17 = 163.15 mm.17) (6–14) To formalize all this into an equation. if the population mean falls within the CI. Since the lower bound is actually larger than the original untreated population mean (161. as in this case. All this is shown in Figure 6–8. which shows that the left-hand curve—the distribution of sample means for a population mean of 163.17 180 sand times. 176. meaning that if the true population mean were 176. The smaller the sample size or the larger the SD. there is a 2.0. And considering the relationship between the CI and the null hypothesis.0 ± 6.83.17. it is evident that a relationship exists between the CI and the sample size and SD. We want to ﬁnd out where the population mean would have to be located in order that 2. a 2.5 = 170.17 √100 = (163. but the upper bound is beyond (to the right) of the alternative hypothesis line at 175.96 × 3.17—has 2. we’ll go into more detail on using the CIs from two groups to do “eyeball” tests of statistical signiﬁcance. Let’s go back to Mangro and work it out. So.15. .5% above 170.5% chance of seeing a sample mean of 170 or less. There is also a consistency between this and the results of our hypothesis-testing exercise.15) = 176.5.96.5% probability corresponds to a z value of 1.83 155 160 170 165 Penile Length (mm) 175 Upper 176. and the right-hand curve—the distribution of sample means with a population mean of 176. so the probability of observing a sample mean of 170 with a population mean of 175 is greater than . on 950 occasions the means would lie between 121 and 125. the smaller the CI. and one-tailed as we ﬁrst calculated it). there is a 2. Conversely.83.5% chance of observing a sample mean greater than 170. the larger the CI. The SE for a sample size 100 is 3.025. What we now want to do is establish an upper and lower bound so that we’re 95% conﬁdent that the true mean of the treated population lies between these bounds. where the lower bound of the CI is to the right of the null hypothesis line at 161. In the next chapter.5% [1⁄2 of (100 – 95)] of the sample means for n = 100 would be greater than 170.17. Using a similar logic.025. the lower bound must be (170 – 1. if the true population mean was 163. we calculate the upper bound as (170 + 1.15) = 170 – 6. the upper bound of 176. and the standard error of the mean was 3.83. Recall that the calculated mean of the study sample was 170 mm. 155 160 170 165 Penile Length (mm) 175 180 FIGURE 6–8 Another way of showing the 95% CI. Let’s look at the lower bound ﬁrst. Putting it all together. Therefore. the 95% conﬁdence interval is: CI = X ± zα/2 s = X ± (zα/2 × SEM) √n (6–15) From this equation.5). All this is evident from Figure 6–7. From the normal distribution. the larger the sample size or the smaller the SD.17 is actually greater than our alternative hypothesis of 175.5% below the sample mean of 170. so the result is signiﬁcant (both two-tailed. it must be the case that the probability of seeing a sample mean of 170 is lower than . Lower 163.96 × 3.

even tiddly little differences may be statistically signiﬁcant. dumb. The importance of the difference. if you read the ﬁne print once again. there is a need for multicenter trials. it is a necessary precondition for clinical signiﬁcance.” How big a difference is this? We begin by noting that IQ tests are designed to have a mean of 100 and SD of 15. If a difference is not statistically signiﬁcant. By and large (and small)..” Suppose the brochure even contains relatively legitimate research data to support its claims that the product was demonstrated to raise IQs by an amount signiﬁcant “at the .58 THE NATURE OF DATA AND STATISTICS STATISTICAL SIGNIFICANCE VERSUS CLINICAL IMPORTANCE It may have dawned on you by now that statistical signiﬁcance is all wrapped up in issues of probability and in tables at the ends of books. to investigate how large the sample was on which the study was performed. Suppose we did a study with 100 RLKs (rotten little kids) who took the test. it might as well be in the opposite direction. resulting in a need for international collaborative meetings in exotic locales. a difference of only 3 points would produce a statistically signiﬁcant difference.98 Statistical signiﬁcance is a necessary precondition for a consideration of clinical importance but says nothing about the actual magnitude of the effect. They were followed for 7 to 10 years. Although statistical signiﬁcance makes no claims to the importance of a difference.5 (6–17) so δ = (1. if the study deals with measured quantities such as blood sugar. or. it might as well be zero. this is a very profound observation. often called clinical signiﬁcance or clinical importance. we know the distribution of scores in the population if there is no effect. become stockbrokers or surgeons. death can be relatively rare. it looks like Table 6–2. any difference worth worrying about can be attained with about 30 to 50 subjects in each group. do you . because it really is addressing a pretty mundane idea. By contrast. to make clinicians feel that there is a role for them just about the time that they are totally intimidated by the whole thing. Charlie. supported by rich and adoring offspring. With large samples. If the sample size is small. the ﬁrst large-scale sample of cholesterol-lowering drugs screened 300. As our wise old prof once said.47 0. Of course. before ﬁnding that little cottage in the Florida swampland.5) = 2. Trying to argue that a difference that is not statistically signiﬁcant (i.” As one example. 25Presumably Now. But in a follow-up period sufficiently short that the investigators themselves have some certainty of survival. Under the null hypothesis. This is what the insurance companies call “Future Planning.80 5. for that matter. like everything else. TABLE 6–2 Relation between sample size and the size of a difference needed to reach statistical signiﬁcance when SD = 15 Sample size Difference 4 9 25 64 100 400 900 14. “large” and “small” in terms of sample size are relative terms.68 2. not by any whiz-bang mathematics. 24There is an up side. EFFECT SIZES Now that we’ve told you that there’s a difference between statistical signiﬁcance and clinical importance. There were 38 heart-related deaths in the control group and 30 in the treatment group—just signiﬁcant at the . imagine a mail-order brochure offering to make your rotten little offspring smarter so they can go to Ivy League colleges. with a large sample size. “Too large a difference and you are doomed to statistical signiﬁcance. and support you in a manner to which you would desperately like to become accustomed. is simply an issue of the probability or likelihood that there was a difference—any difference of any size. even huge differences may remain non. clinical ratings.88 3. Statistical signiﬁcance. or depression scores. It would seem important.96 × 1. “OK. we know. Statistical signiﬁcance says nothing about the actual magnitude or the importance of the difference. As we have shown (we hope). This is not the thing of which carefree retirement. So.05 level. aptitude tests. the question arises.000 who ﬁt the inclusion criteria.000 men to get 4. death has a 100% prevalence. How would the means of a sample size of 100 be distributed? The SE is equal to: SE M SD N 15 100 1.5 (6–16) 23Yes. statistical signiﬁcance simply addresses the likelihood that the observed difference is. then analyzed.96 1. By the same token. the z value corresponding to a probability of 0. may be equal to zero) is still clinically important is illogical and. is made! Working the formula out for a few more sample sizes. that the two concepts are not unrelated. however.(not in-) signiﬁcant. then: δ = 1. in truth. with relatively rare events such as death.94 1. That is. Whatever actual differences were observed were left far behind. Just like the earlier example. frankly. It would seem important to clearly outline the difference between statistical signiﬁcance and clinical importance. Note. if the difference between the RLK mean and 100 is δ. and it can be decided only by judgment.96. our sample of RLKs would be expected to have a mean of 100 and an SD and 15.05 (two-tailed. It’s a pity that statistical signiﬁcance has assumed such magical properties. how do you determine if a result is clinically important? One way is to simply turn to your colleague and ask. of course) is 1.25 is a separate issue.94 IQ points. not actually zero.70 9. for N = 100. Indeed.05 level.e.23 it may take depressingly large samples.24 For example.

How big a sample would Encyclopedia Newfoundlandia (E. at 100 points. SEM SEM β α/2 Figure 6–9 Mean IQ of sample of RLKs against the null and alternative hypotheses. Essentially. We will call this distance zα. we must keep in mind that the normal curves we have drawn in the ﬁgure correspond to the distribution of means for repeated experiments. For r type ESs. we get rid of CV. and the second as r type. by the skin on its chin. the upper limit is 1. Second. 110 CV 95 100 H0 105 H1 SAMPLE SIZE ESTIMATION As we already indicated. 27Don’t Similarly: (105 CV) n s/ 15 zβ 1.ELEMENTS OF STATISTICAL INFERENCE 59 think this is clinically important?” There are a couple of problems with this approach. and those that evaluate the proportion of variance in one variable that is explained by another variable.) need to prove that its books will raise IQ levels by 5 points? Now the picture is like Figure 6–9. there may not be much difficulty if the outcome is in units we can understand. that the minimum difference in IQ we would shell out for is 5 points. Adding the two equations together. suppose we decide. we are dealing with scores on a scale. For now. a lot of clinical research is horrendously expensive. By analogy. we know how wide the normal curves are—they correspond to an SD of 15 ÷ n. had no effect) or 105 (if E. There is no upper limit for d type ESs. is—a 5point gain. the area of the H1 curve to the left of this point.27 For historical reasons. or decreases in weight for people on a diet. (105 100) = zα + zβ = 3.28 SEs to the left of the alternative hypothesis mean. this. then. this begins with the clinicians guessing the amount of the minimum clinically signiﬁcant difference worth detecting. suffice it to say that ESs come in two different ﬂavors: those for tests that look at differences between groups.26 That common index is called the effect size. where the values are distributed about either 100 (if E. worry if you don’t fully understand the distinction just yet. IMPORTANT NOTE: The z-value for β is always based on a one-sided test. although in practice. it assumes that Charlie knows the answer. But very often. We can formalize this with a couple of equations: (CV 100) s/ n zα 1. Is a three-point difference between groups on a pain scale clinically important? Without any other information. puts the critical value at 1. Of course.05.05 level.0.0. we don’t know what distribution our E. Liberia.N. Returning to the example of the RLKs. or ES.N. or Myanmar. First.N. Then the critical value (CV) corresponding to this state is 1. the ﬁrst class is known as d type effect sizes. the z value on the alternative curve corresponding to the beta error. whenever possible. the z value corresponding to the alpha error.96 SEs to the right of the null mean. Suppose we decide that we will risk a beta error rate of .N.N. Either way. The challenge is to pull it all together and solve for n. we’ll spell it out when we discuss the different statistical tests. that’s the point of the experiment.24 n s/ 15 (6–20) .28 (6–19) where CV is the critical value between the H0 and H1 curves.-exposed RLKs come from. where the tail of the H1 distribution overlaps that of H0 on only one side. The reason can be seen in Figure 6–9. All will be revealed in the fullness of time. it has become de rigueur to include a sample size calculation in the grant proposal. “I dunno. it’s unusual to ﬁnd them greater than 1. whadda you think?” What we need is an index that can express differences between groups or relationships among variables using a common yardstick.10. We know where the mean of the null distribution is. about the time the encyclopedia salesman is shoving his foot further into the door. had an effect). Then the statistics are messed around so that this minimum clinical difference corresponds to the statistical difference at p = . at IQ 105. We know where the mean of the population of RLKs who had the dubious beneﬁt of E. about all we can say is. such as the number of deaths in the treatment and control groups. Now we have to decide how much we want to risk a Type II error. this will be called zβ.96 (6–18) 26That’s a meter stick for the 97% of the world that doesn’t live in the US. We’re not going to go into the messy details of how you calculate ESs just yet. This doesn’t contradict what we said about two-tailed tests because that applies only to the α level. To keep the cost of doing the study down. Imagine that the experiment was completed in such a way that it just achieved statistical signiﬁcance at the . but he may be as much in the dark as you are. Finally.

To save you the agony of having to work out this formula every time you want to see how many subjects you need to compare two means.” a) There are actually two things wrong with this statement. we give you some rules for how to report the results that you worked so hard to obtain. p = . The second problem is that there are precious few things with a probability of zero—reincarnation is one. σ/∆. and it tells you how big the difference is in SD units. The second is that the effect may actually have been stronger in Study B than in Study A. perhaps we should call it a foreplay. We should put this equation in big.00002357). The ﬁrst is simply trying to compare “levels” of signiﬁcance by using the relative sizes of the p values. There are even fewer than four that have a probability less than zero. The effect size is like a z-score. we then would have been able to determine exactly the probability of rejecting the null hypothesis when it was true (α. results are either signiﬁcant or they’re not.” d) “The results. REPORTING (6–23) 28Although given the topic of this chapter’s example. b) See above. bold type because it. the experiment 75 50 Score FIGURE 6–10 Showing a number of means and their 95% CIs. 25 0 A B Group C D . the Type II error). based on this study of 13 individuals.60 THE NATURE OF DATA AND STATISTICS If. are the things of which successful grant proposals are made. and there are three others mentioned in Chapter 28. are due to the people who (mis)interpret what the computer spit out. It was these values that were used in the sample size calculation. But here. we’ll give you a foretaste28 of what’s to come. c) Using the same logic. In the previous example. Obviously. and the next edition of this book will be dedicated to you. the Type I error) and the probability of rejecting the alternative hypothesis (accepting the null hypothesis) when it was true (β. then the ES is 5 ÷ 15 = . Other mistakes. which was all we could think of. we are in the position of designing a trial. What’s wrong with these statements? a) “The results were highly signiﬁcant (p < 0.0000). So the ratio in the sample size equation. the people who program them may know a lot about how to make electrons ﬂy around to give us the right answer (most of the time). First. Note that the ratio of the difference between groups to the SD is called the effect size (ES). and variations on it. For completeness. were highly signiﬁcant (p = 0. we’ll put the numbers of Figure 6–8 back in: n = [(3. use some estimate of effect size. is the inverse of the effect size.33. They is or they ain’t. too. Really. If the difference you’re looking for is 5 points and the SD is 15 points. In fact.24 × 15) ÷ 5]2 = 95 subjects (6–24) In Chapter 28. if we’re going to use the logic of null hypothesis signiﬁcance testing. we’ve given you these in Table B in the Appendix at the end of the book.058). though.” b) “The results were of borderline signiﬁcance (p = . What is the distinction between the αs and βs in this calculation and the one before? Really only one of timing. REALLY signiﬁcant. we call (105 – 100) the difference ∆. for the sake of generality.” and “really.042). p = . not p. Reserve the term “borderline” for geography and psychiatric diagnoses. and so we based our calculations on a critical value for the sample mean that corresponded to the difference required to just reject the null hypothesis. but the p level is higher because the sample size was so much smaller. We can lay blame for many of the problems at the feet of the ubiquitous computers. If the experiment had turned out at that critical value. What we’ve done is to present N for different ratios of σ ÷ ∆. The same strategy will be used in subsequent chapters to derive sample size estimates for a variety of statistical tests. we couldn’t do this for every possible value of σ and ∆. if you can think of even one. If you want to compare results. Statisticians don’t differentiate among “signiﬁcant. but don’t know diddly-squat about logic. In this case.” They is or they ain’t. let us know.” c) “The effect was stronger in Study A (n = 273.013) than in Study B (n = 14. the algebra becomes: ∆ s/ 15 n (zα zβ) (6–21) so that n (zα ∆ zβ) s (6–22) and squaring everything up: n= (zα + zβ) s ∆ 2 was ﬁnished and did not show a difference. there are two things wrong with this statement.” “highly signiﬁcant.

In a small. d. Often included in meta-analyses are graphs called forest plots (Altman et al. The report ended with the statement that the statistical test was signiﬁcant (p < . If the drug were effective. the size of the squares reﬂects the sample size of the study. double-blind trial of a new treatment in patients with acute myocardial infarction. A word of caution. thank you. the bigger the box. we’ll show you how to do “eyeball” tests of signiﬁcance by using graphs like these. The chance that an individual patient will fail to beneﬁt is less than . two decimal places will do nicely. If you have fewer than 100 subjects. and other areas is termed “metaanalysis”: the combining of the results of many studies to arrive at an overall conclusion. in the last line. Don’t worry about the unit of measurement on the X axis. which showed the population mean. 1 2 3 4 5 6 7 Pooled GRAPHING WITH CONFIDENCE One way to enhance the visual presentation of data is to combine what we learned in Chapter 2 on graphing with what we just learned about CIs.05). the probability of the reported ﬁnding or one more extreme is less than . d. others show 1 SE (which is the 68% CI). EXERCISES 1. but the difference was not signiﬁcant. In light of this information we may conclude: a. of itself. If the drug were ineffective. We saw a simple example of this in Figure 6–8. reveal anything about the importance of the observed difference.2 . c. usually about the efficacy of some intervention. as in Figure 6–11. the results com- the authors didn’t indicate which one was used. c.5 In Favour of Drug A 1 2 5 10 Log OR In Favour of Drug 8 FIGURE 6–11 A forest plot. you don’t know how else they may have screwed up. some people plot the SD.05. and still others the 95% CI. social work. for reporting both the p level and the estimates of parameters. The z-test. 29If SUMMARY You can use a z-test to determine the statistical signiﬁcance of the difference between a sample and a population with known mean and SD. noted that the new drug gave a higher proportion of successes than the placebo. each with its CI. There’s no convention about drawing error bars in graphs.05. A report of a clinical trial of a new anticocaine drug. if you’re reading the graph. The treatment is useless. be sure to check which one the authors used. we’ll get into that later. There is no point in continuing to develop the treatment. the more subjects. Snortstop. but only if your sample size is larger than 1. as in Figure 6–10. The power of the test exceeds 0. the probability of the reported ﬁnding or one more extreme is less than 1 in 20. We can extend this by plotting a number of means. randomized. psychology. bined (pooled is the jargon term) over all of the studies. e. be sure to indicate which one you’re using. The notion of statistical signiﬁcance is embodied in this probability. Making the graph even more informative. You can go up to three decimal places. We should keep adding cases to the trial until the Normal test for comparison of two proportions is signiﬁcant. Fewer than 1 patient in 20 will fail to beneﬁt from the drug. We should carry out a new trial of much greater size.000 or so. though.29 . These show the results of each study and its CI. We can conclude that: a.ELEMENTS OF STATISTICAL INFERENCE 61 Study d) Do you really have the sample size to justify this much “accuracy”? We didn’t think so either. like all statistical tests. versus a placebo. . b. The reduction in mortality is so great that we should introduce the treatment immediately. But statistical signiﬁcance does not. If you’re drawing the ﬁgure. Once we’ve introduced you to the niceties of signiﬁcance testing. then throw the paper away.95. b.1 . and usually. 2000). Forest Plots The new growth industry in medicine. 2. e. plus a sample mean and its 95% CI. relates the magnitude of an observed difference to the probability that such a difference might occur by chance alone. the mortality in the treated group was half that in the control group.

What effect will this have on: a. the CPQ (Couch Potato Questionnaire) with 5 subscales (Emotional Function. Suppose the true beneﬁt was 10 cm. Social Function. What effect will this have on the power to detect a true difference between the two groups on the Eating Attitudes subscale? a.05 as a critical value)? . Self-Esteem.62 THE NATURE OF DATA AND STATISTICS 3.0 (3. Michael Jackson. c. researchers used a quality-of-life instrument. the investigators used a Bonferroni correction.0) seconds for the placebo group and 16. and Michael Jordan. depression. spin around faster. Answer the following questions regarding the expected results of the second study: Stay the same Can’t tell from the data Larger Smaller SD SE of mean Statistical test p-value _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ 4.” Both studies used the same populations and experimental design. 6. Because of concern about the use of multiple tests. In a two-group design comparing the effects of diet restriction and exercise on quality of life of obese patients. How large a sample would you need to have a 90% power of detecting this difference (using alpha = . Testing the ﬁrst part only. Consider two randomized trials of the effect of anabolic steroids on commuters’ times in the “100 meter train dash. Insufficient data to tell.01 instead of the usual . and time spent in closets. whereas the second used 100 per group. Power. What is the probability that this difference could have occurred by chance? b. For the ﬁrst study. What is the power of the study to detect this difference? c. Because of concerns about using multiple t-tests. Population data gathered by university phys-ed coaches across the country show a normal jump height of 50 cm.0) seconds for the group that received anabolic steroids. the alpha level (probability of declaring a difference under the null hypothesis) was set at . The Type I error rate. You have just completed a study of a patent medicine for basketball players. a. Degrees of freedom. the means (SDs) of the two groups were 12. the biggest concern of every nubile adolescent in the 1990s is “Quality of Life. designed to make them jump higher. mirror avoidance. Eating Attitudes). It’s called MJ3 Elixir and is endorsed by Magic Johnson.” So the local teener’s health office developed a questionnaire to assess satisfaction with social interactions. c. Second only to terminal zits. Stay the same d. so only p levels less than . 5. SD 15 cm. Physical Function. The Type II error rate.05. b. Increase power. b.01 were considered signiﬁcant. and fool the opposition by looking like they’re going backwards and forwards at the same time. α was divided by 5. self-esteem. d. The only difference is that the ﬁrst study used a total of 10 office workers per group. Decrease power.0 (3. you ﬁnd that a sample of 16 collegiate players fed the elixir for 2 weeks can jump an average height of 56 cm.

P. even mentally. so why start now? The third reason. to justify their request for a large pay increase. this means that we can’t calculate ratios. The problem is the missing zero. detector I–3). though. Also.SECTION THE FIRST C. 63 . Hospital administrators used a graph like the one shown in Figure I–1. they never needed any justiﬁcation in the past to award themselves increases.1 Can they use this to justify a 500% increase in their salary? 38 1We won’t ask the unworthy question of what they were doing prior to 2000. unless there are compelling reasons why it should not (see C. disinterested scientists. but at some arbitrary point (in this case. DETECTORS No.A.A.A. which shows the number of hours worked each week between 1990 and 2005.R. 36 34 32 30 1990 1995 Year 2000 2005 FIGURE I–1 Number of hours worked per week between 1990 and 2005 by administrators.R. they already get paid too much.P. 30 hours per week). The GDI for this graph is about 400. this graph distorts the data. They argued that this graph showed their workload jumped about 500% between 2000 and 2005. First. is that from our perspective as unbiased. and the GDI should be around zero. for three reasons.P. DETECTOR I–1 The Y-axis should start at 0. from the graph. The Y-axis does not start at zero. so that increases look magniﬁed. as presented by them. C. Number of hours per week I–1.R. this is equivalent to taking ratio data and making it into interval data. Second.

10 0 1990 1995 Year 2000 2005 . Figure I–2. Figure I–2 is turned so that the Y-axis is parallel to the long side of the paper. DETECTOR I–2 The graph should not magnify the visual effect of relatively small changes. and which still showed a marked increase in hours worked per week. which they said corrected the problem of the missing zero.R. the administrators brought in a second graph. 40 Number of hours per week 30 20 FIGURE I–3 What the data really look like. Have they learned the error of their ways and turned to the path of righteousness? Are you kidding? If you look over the graphs we’ve presented so far. The data should really have been displayed as it is in Figure I–3. squeezing the data displayed along the X-axis tends to magnify vertical differences in our mind. Although the numbers displayed in the graph are correct. 0 C. 1990 1995 Year 2000 2005 20 10 FIGURE I–2 Their second try. having started the Y-axis at zero.64 THE NATURE OF DATA OF STATISTICS 40 Number of hours per week 30 I–2. you’ll notice that the vast majority of them are oriented horizontally.A. Foiled in their dastardly attempt to ﬂummox the Board of Directors of the hospital.P.

Note that we statisticians. begin at zero. the zero should be missing because the bottom 80% of the graph is blank. FIGURE I–4 How the administrators presented statisticians’ hours.” When starting at zero would result in the bottom 75%.P. the contrite administrators2 show up at the next board meeting with a graph showing the hours of work per week for the epidemiologists and statisticians. DETECTORS 65 Number of hours per week I–3. thus all the action takes place in the upper 20%. when is not starting at zero a cardinal sin. DETECTOR I–3 The Y-axis should not start at 0 if it means that most of the graph is blank or if it visually distorts the data. Have we been falsely maligned? Of course! The problem here is the converse of the missing zero. “It all depends. as in Figure I–4. C. This lets the reader know that the graph shouldn’t be read as reﬂecting ratio data.P. A better way of presenting these data would be as in Figure I–5. So.C. or so of the graph being blank.R. 100 80 60 40 20 0 1990 1995 Year 2000 2005 2Yet another oxymoron. and when is including it an offense? The clear.A. unambiguous answer is. The effect of this is to squeeze any changes into a very small range. otherwise. showed that the Y-axis did not start at zero by breaking the axis and putting in those two short lines. making it look as if nothing is happening. They maintain that Figure I–4. and so should not get any increase at all. pure of heart. Claiming that they have repented.R. which starts at zero and is arranged horizontally. 100 Number of hours per week 80 60 0 1990 1995 Year 2000 2005 Figure I–5 How the data should have been presented. . shows that these people have barely increased what they do since 1990. it’s best to start somewhere else.A.

and presented the data in Table I–1. Now the administrators were faced with a problem within their own ranks. This would show that the median salary for men is $460. her calculations showed that the mean of both groups was exactly the same. and for women is $443.S. needs only one VP. but hospitals can’t seem to exist with fewer than one VP for every 7 employees? . Do the women have a case? Yes they do. $462. Notice that the data for females are highly skewed by the one very high number. DETECTOR I–4 If the data have a few outliers.000 people.000) disagreed. Under these conditions.. or are seriously skewed. C.A.66 THE NATURE OF DATA OF STATISTICS TABLE I–1 Salaries (in thousands) of nine male and nine female administrators Males Females 460 450 490 460 475 440 458 450 475 437 433 433 435 692 425 438 433 433 I-4. The female VPs said they were being paid less than the males.000.R. probably a more accurate representation of the bulk of the data.000.3 The CEO said that she (the one earning $692. it would be better to use the median (refer to Chapter 3). 3Why is it that the U.000.P. the median should be used as the estimate of central tendency rather than the mean.000. with a population of over 350.

A. we’ll use some real data (reported in Sinha. DETECTORS 67 Change from Baseline (percent) I-5. Pretty useless drug.R. 4 Mean Pain and Discomfort Score 3 2 Placebo 1 Drug 0 0 1 2 3 Time of Treatment (Month) FIGURE I–7 What actually happened.C. -25 -30 -35 Placebo -40 -45 -50 0 1 2 3 Time of Treatment (Month) Drug FIGURE I–6 Improvement as reported by a drug company.P. there are some major problems. right? But. Figure I–7 corrects these errors.A. . DETECTOR I–5 Graphing relative differences may hide the real picture by exaggerating small ones.R. Looks good.P. no? C. To show you that we don’t really have to make up examples of misleading graphs. 2003). Figure I–6 shows improvement in symptoms of irritable bowel syndrome as reported by the drug manufacturer. starting with a missing zero point. but mainly involving a relative measure of improvement rather than a measure of absolute change.

.

SE C T I O N T H E SE C O N D ANALYSIS OF VARIANCE .

to try to impress with obscure words in long. normal or abnormal x-ray. and the like. better. AN OVERVIEW As we indicated in Chapter 6. assistant profs. and practice and rehearse them until they roll off their lips as if Mommy had put them there. or Ratio. 2The erhaps the most common comparison in all of statistics is between two groups—cases vs. you end up looking at some variable that was measured in those lucky folks who beneﬁted from your treatment and also in those who missed out. First. and so on. To show how this applies to the types of analyses discussed in this section. you have devised an orientation course where they learn how to use big words when little ones would do. you decide to do some research on it. meandering sentences. In the discussion below. you can tell a prof in a dark room just by the sound of his voice. pick the three longest words. when you run an experiment in biomedicine. the course planners design a randomized . Imagine if you will a course in Academish IA7 for young. We do not consider this categorical type of measurement in this section. Instead. contractually limited. usually draft dodgers with a remnant of the ﬂower child ethos. or a quality-of-life index. Profs pride themselves on their shabbiness. tenureless. then measure all the obscure words they mutter.CHAPTER THE SEVENTH The t-test is used for comparing the means of two groups and is based on the ratio of the difference between groups to the standard error of the difference. We tend. to help yourself survive in academia. we examine Interval. Reasons why this comparison is ubiquitous are numerous. But even without the dress code. The more informal among us. And. As a result. they are required to open a dictionary to a random page. consider the following example. as a group.1 Note that we have implied that we measure something about each hapless subject. placebos. we demand that you measure something more precise. girls. you usually do something to some poor souls and leave some others alone so that you can ﬁgure out what effect your ministrations may have had. long. Comparing Two Groups The t-Test SETTING THE SCENE To help young profs succeed in academia. So. drugs vs. SDs. which literally means a foot and a half. there is the matter of the dress code. tramp around in old denim stretched taut over ever-expanding derrieres.2 It’s such a common affliction that one might be led to believe that we take a course in the subject. boys vs. all of statistics comes down to a signal-to-noise ratio. and foreigners on the campus might do well to acquire a Berlitz English-Academish dictionary. Old tweed jackets that the rest of the world gardens in are paraded regularly in front of lecture theaters. you randomize half your willing profs to take the course and half to do without. how can you determine how much of the variation in the scores arose from differences between groups and how much came from variation within groups? P 1Or maybe the lucky folks who missed out. same. in contrast to doing an experiment in Grade 7 biology. a blood pressure. or worse. There are many variations on this theme: diseased or healthy. Of course. A moment’s reﬂection on the academic game reveals certain distinct features of universities that set them apart from the rest of the world. First. variables. controls. obscure word for that is sesquipedalianism. As one exercise. so that we can consider means. How can you use these data to tell if the course worked? In short. and the poor souls who “beneﬁted” from your treatment. not wanting to pass up on a potential publication. be it a lab test. Perhaps the most common form of measurement is the FBI criterion— dead or alive.

There is some speculation that he did most of his work during the afternoon breaks at the brewery. and then take the square root of the answer to get back to our original units of measurement. in some way. So. a ﬁrst guess at the error of the difference would be: Standard error difference SEd s1 n1 s2 n2 (7–2) 3Actually. was furnished to us (remember. so “tea” or “t” it became.0 22 25 23 29 30 28 30 33 21 29 270 27. related to the original distributions. the graduands (n1 = 10) used a mean of 35 obscure words.COMPARING TWO GROUPS 71 trial.. So.e. all Guinness employees were forbidden to publish. as we demonstrated in Chapter 6. Now the challenge is to create some method to calculate a number corresponding to the signal—the difference between those who did and did not have the course.0 EQUAL SAMPLE SIZES To illustrate the t-test. Because we are looking at a difference between two means. the sign is meaningless. that the course made no difference). Too bad Guinness doesn’t run universities. graduate students are required to attend a lecture from one of the graduands and some other prof from the control group and count all the words that could not be understood. Why it is called Student’s is actually well known. More formally: Numerator = |X1 X2| (7–1) the population. would be taken seriously by British academics. and to the noise— the variability in scores among individuals within each group. we had a survey of 3.” It is less clear why it is called the “t”-test. After the data are analyzed. It is apparent that some overlap occurs between the two distributions.3 Participants Controls TABLE 7–1 Number of incomprehensible words in treatment and control groups Sum Mean Grand mean 35 31 29 28 39 41 37 39 38 33 350 35. Perhaps because he recognized that no Irishman. who worked at the Guinness brewery in Dublin around the turn of the century. one strategy would be to simply assume that the error of the difference is the sum of the error of the two estimated means. he wrote under the pseudonym “Student. Student’s Stout test probably didn’t have the same ring about it. s ÷ n. what we end up with is: s2 1 n1 s2 2 n2 (7–3) SEd Because the sample sizes are equal (i. the sum doesn’t come out right. we are presuming that this difference arises from a distribution of differences with a mean of zero and a standard deviation (s) that is. The error in each mean is the standard error (SE). The simplest method to make this comparison is called Student’s t-test. But. let alone one who worked in a brewery. From the table. and more important. the profs who made it through Academish IA7 had a mean of 35 incomprehensible words per lecture. n1 = n2).. whereas in the case of the z-test. difference is that for the t-test. σ. The ﬁrst is that.e. where the vertical lines mean that we’re interested in the absolute value of the difference. so we were given the mean and SD of the population). This is not the case here. There are two differences between the t-test and the z-test. whereas the z-test is used mainly with one group. It was invented by a statistician named William Gossett. so the next challenge is to determine the SD of this distribution of differences between the means: the amount of variability in this estimate that we would expect by chance alone. the standard deviation is unknown. but for various arcane reasons we won’t bother to go into. the control group only 27. let’s continue to work through the example. One obvious measure of the signal is simply the difference between the groups or (35 – 27) = 8. we can’t add SDs. Did the course succeed? The data are tabulated in Table 7–1. its primary purpose is to compare two groups. to see if its mean is different from some arbitrary or population value. this equation simpliﬁes a bit further to: s2 1 n (7–4) s2 2 SEd . although a sizeable difference also exists between them. Under the null hypothesis (i.0 31. we can add variances. although the t-test can be used with only one group. A comparable group (n2 = 10) who didn’t take the course used a mean 27 such words in their lectures. discussed in Chapter 6. because it’s totally arbitrary which group we call 1 and which is 2.0. the SD of This is almost right.000 or so penile lengths. The second.

Equations 3–5 (for the variance) and 3–6 (for the SD) are used to calculate these values for the population. if we know the value of B. as well as for every α level.67 + 16. If we know the value of A (say. the total df for a t-test is [(n1 – 1) + (n2 – 1)]. Why the difference? The answer is based on three factors. the sample data will deviate less from the sample mean than they will from the population mean. then B is ﬁxed. whereas the equivalent formulae with N – 1 in the denominator are used for samples (the usual situation). then. Finally. The t-test is then obtained by simply taking the signal-to-noise ratio: t (X1 2 s1 X2) = 8 = 4. . The principal problem is that. p < .05. which is 2. as shown in Figure 7–1. any set of numbers will deviate less from their own mean than from any other number.67 9 (22 27)2 (25 27)2 .96. If the difference is big enough (i.0)] ÷ 10 = 1. However. t converges with z—they are both equal to 1. we have introduced a dependency on the degrees of freedom. t is larger for small samples. for large samples. A and B. as we could expect if it behaved like a z-test. there is a different t value for every degree of freedom.e. or (n1 + n2 – 2). Similarly.10. then df = n – 1. s2 2 = 144 = 16. For example. the purpose of inferential statistics is to estimate the value of the parameter in the population. rather than < .05 or < . we can calculate the variances of the two groups separately. Degrees of freedom can be thought of as the number of unique pieces of information in a set of data.72 ANALYSIS OF VARIANCE In the present example. So. which is equal to 4. If the computer prints out the exact value of p. We can now look up the critical value of t for our situation (18 df) at the 0. small as it may be. and n2 of them in group 2. Dividing the squared deviations by N – 1 rather than by N compensates for this. so that the estimate of the variance and SD will be biased downward. if we have two numbers. As it turns out.05 level.05. 1.915. when we introduced the concept of variance and the SD. The problem is that.178 1.915 FIGURE 7–1 Testing if the mean difference is greater than zero.96 when α = 0. First. with an SE of 1. but haven’t yet told you what it is or how we ﬁgure it out. based on the data we have from a sample. in this case) will differ from the population value to some degree. –8 0 H0: µ1 = µ2 +8. unlike the situation with the z-test. we have two sets of numbers: n1 of them in group 1. sufficiently different from zero). it has to be (30 – 25) = 5. Putting this all together.915. then t = 1. Instead of ﬁnding that. because we have estimated both the means and the SDs. rather than just N. in particular. all of which we’ve already discussed. .0 9 (7–5) Then the denominator of the test is equal to [(20. So. if there are n numbers that are added up. In a t-test. given that we know the sum. . degrees of freedom (df). 10 1 (29 27)2 that now.05. and leads to an unbiased estimate of the population parameter.70. is wildly signiﬁcant. + (33 10 1 35)2 = 186 = 20. . we’d write t(18) = 4. we ﬁnd The Formulae for the SD If you were paying attention. then A has to be (30 – B). We can see what is happening by putting the whole thing on a graph.178. When we are presenting the results in a paper. in general terms. then only one of the numbers is unique and free to vary. use that. we expect that the sample parameter (the mean.96 to 12. so we require a larger relative difference to achieve signiﬁcance.0 .01. t can range anywhere from 1. if α = . the denominator is N – 1. and these are equal to: s2 = 1 (35 35)2 + (31 35)2 + . and we know that they add up to 30. you would have noticed that the equation we just used to calculate the variance (Equation 7–5) differs slightly from the equation we used in Chapter 3. Degrees of Freedom We talked about that magical quantity. The distribution of differences is centered on zero.178.915 2 s2 n (7–6) We can then look this up in Table C in the Appendix and ﬁnd a whole slew of numbers we don’t know how to handle. So our calculated value of t. 25). then we can see that it will achieve signiﬁcance. Second.. The probability of observing a sample difference large enough is the area in the right and left tails. such as the scores of n people in a group.

we take them both and create a (1 ÷ n1 + 1 ÷ n2) term. here goes.) n1s2 1 n1 n2s2 2 n2 (7–7) This is close. if n1 was 4 and n2 was 20. then it makes good sense to pool everything together to get the best estimate of the SD. when we used the two sample SDs to calculate the SE of the difference. honorable gentlemen that we are. Although this looks formidable. So. some computer packages proceed to calculate a new t-test that doesn’t weight the two estimates together. it might not work out this way. Because the samples are now receiving equal weight in terms of contributing to the overall SE. What we do is plot the means of the groups with their respective 95% CIs. like this: σ2(est. And the more general form of the t-test is: THE CALIBRATED EYEBALL We promised you in the previous chapter that we would show you how to do statistical “tests” with your eyeball. Instead of forcing a choice. Suffice it to say that if we took a very large number of independent samples from the population (about a billion or so). but rather σ2 times (n – 1) / n. which is closer to 4 than to 20. we might reasonably presume that the SD from the larger group is a better estimate of the population value. that’s done in the left side (Part A) of Figure 7–2 for the data in Table 7–1. we must again delve into the philosophy of statistics. and comes about as: nh 2 1 n1 1 n2 (7–12) In short. The reason is not all that obscure. but by now you have probably gotten into the habit of subtracting 1 every time you see an n. leaving you free. So. there is no single n. Remember that the CI is: (X1 X2) t = _____________________________________ (n1 1) s2 1 n1 (n2 1) s2 2 n2 2 1 n1 1 n2 (7–10) . the harmonic mean would be 2 ÷ (1⁄4 + 1⁄20) = 6. To understand why. the ﬁnal denominator looks like: Denominator = (n1 1) s2 1 n1 (n2 1) s2 2 n2 2 1 n1 1 n2 (7–11) This looks very much like our original form and has the advantage of simplicity. The denominator now looks like: SE s2 1 n1 s2 2 n2 (7–11) TWO GROUPS AND UNEQUAL SAMPLE SIZES: EXTENDED t-TEST If there are unequal sample sizes in the two groups. in combining the two values. rather than a different number. That’s why there’s a 1 in the denominator. At this point. Now. it makes sense that the df should reﬂect the relatively excess contribution of the smaller sample. We’re not going to go into the messy details.) (n1 1) s2 1 n1 (n2 1) s2 2 n2 2 (7–8) This is the best guess at the SD of the difference. we’ll end up with the unbiased estimate that we’re looking for. if we multiply each of the obtained values by n / (n – 1). we were actually implying that each was an equally good estimate of the population SD. as we have talked about it so far. this approach is called a pooled estimate. So the cost of the separate variance test is that the df are much lower. The trade-off is that the df are calculated differently and turn out to be much closer to the smaller sample of the two. If this is so. is that the two samples are drawn from the same population. But we actually want the SE. Pooled versus Separate Variance Estimates The whole idea of the t-test. and the relevant df is now (n1 + n2 – 2).67. That’s why we did it. This is not the place to stop. so: σ2(est. if the two samples are different sizes. we’d ﬁnd that the mean of those variances isn’t σ2 the population variance. the only conceptual change involves weighting each SD by the relevant sample size.COMPARING TWO GROUPS 73 The question is why we divide by N – 1 as a fudge factor rather than. there are two n terms. the formula becomes a little more complex. Thus it would be appropriate. However. the arithmetic mean would be 12. From here we proceed as before by looking up a table in the Appendix. It could be that the two SDs are wildly different. σ. the redeeming feature is that computer programs are around to deal with all these pesky speciﬁcs. one might rightly pause to question the whole basis of the analysis. which introduces yet another 1 ÷ n term. If you are desperate and decide to plow ahead. to weight the sum by the sample sizes. so. and hence have the same mean and SD. In particular. And of course. N – 2 or some other value. In this case. and it is appropriately a little harder to get statistical signiﬁcance. say. This strategy is called the harmonic mean (abbreviated as nh).

which increases the accuracy of the estimate. N is 10. There are two things to bear in mind. What we see in Figure 7–2 is that the conﬁdence intervals don’t overlap. 2005). (2) If the amount of overlap is less than half of the CI. we still rely on Cohen’s (1988) criteria: an ES of 0. Equation 7–5 tells us that s2 is 20. we’ll later encounter other indices of ESs that use different standards for small. Meyer et al. From this. before and after treatment. (1) If the error bars don’t overlap. the difference is not statistically signiﬁcant. but most people seem to use d.05 (rather than ≤ . both of these indices are referred to as the standardized mean difference. you shouldn’t use this. because that’s what the formal t-test told us. even Cohen himself said that they were just conventions. et al.14. p is ≤ . and that any overlap means nonsigniﬁcance.05.01). We can easily ﬁgure out d using the formula: d=t× 1 1 n1 + n2 (7–17) . and the ES is also referred to as Cohen’s d (Rosenthal.001. On the other hand. so the pooled estimate may be biased. so t with df = 9 is 2. and the reduction of heart attacks from the use of ASA were less than 0.25) (7–14) meaning that the ES is expressed in SD units. If there are more than two.0 ± 2. (3) If the overlap is more than about 50% of the CI. Using the data in Table 7–1.26.01. Second. and more importantly. First.55 = 35.0 = 1. Particip ants Contro ls Particip ants ls Contro showing that the groups differ by 1. the signiﬁcance level is ≤ . so do a majority of established researchers (Belia. moderate.. We conclude that the groups are signiﬁcantly different from each other. though. uses the SD only from the control group.4 Note that this works only for two independent groups. (2001) calculated ESs from over 800 studies.74 CI = X ± tα/2 × SE = X ± tα/2 × SD √N ANALYSIS OF VARIANCE EFFECT SIZE (7–13) The effect size for the difference between two means is: Effect Size for Two Means ES = X1 – X2 SD (7–15) We’re using the t distribution rather than the normal curve because the sample size is relatively small. no reputable physician would advise her patients to replace the ASA they’re taking with cigarettes on the basis that both ESs are vanishingly small and therefore unimportant. the error bars in the right side (Part B) of the ﬁgure show considerable overlap. 29. don’t be ashamed.0 – 27. in this case.37 SDs.85 (7–16) and (24. so the SD is √20.0 ± 3.25 √10 = (31.5 is moderate (of moderate practical or clinical importance). we come across naked values of t without any ES associated with it. if the ES is 1. these criteria apply only to d or ∆.67.0. Plugging those numbers into the formula gives us: CI = 35.g.75.37 5. which is reassuring. and 0. 1976). and should not be applied blindly but only in relation to what’s known in a speciﬁc ﬁeld and the importance of the ﬁnding. Glass’s ∆ (Glass. we used the pooled SDs from both groups.8 is large (of crucial practical or clinical importance). In this example. then the groups are signiﬁcant at p ≤ . it means that the difference between the groups is one standard deviation. 38. we would get: ES = 35. But. and large. The advantage of d is that it uses data from both groups. there is no consensus which is better. or if the data consist of two measurements on the same individuals (e. That is. For the Participants. That’s the reason that another estimate of the ES. Often in reading articles. those reporting the relationship between smoking and lung cancer. as we’ll discuss in a later chapter). As yet. The disadvantage is that the intervention may change not only the mean but also the SD.67 = 4. they don’t even come close. we can follow Cumming and Finch (2005) and establish a few rules: 4If you thought that if the bars don’t overlap. 45 Number of Words 40 35 30 25 20 A B FIGURE 7–2 Means and 95% CIs when two groups do (left side) and do not (right side) signiﬁcantly differ. 1994). For obvious reasons. one of 0. Whichever measure is used.86) for the Controls.55.26 × 4. so the difference is not statistically signiﬁcant.2 is considered to be small (of negligible practical or clinical importance).

In this case. However.6.COMPARING TWO GROUPS 75 SAMPLE SIZE AND POWER Sample size estimates for the t-test closely follow the formulism developed in Chapter 6.94 (say 67) per group (7–19) is. One last word about power. for the sake of argument. you wouldn’t need to do the study. so zα = 1. Usually. you’ll ﬁnd 63 subjects per group. let’s assume that the mean extent is 42% and the SD is 15%. Use the article to ﬁnd out the sample size (if the two groups are different. Well. such as the z-test and the analyses of variance (which will be discussed in later chapters). So the new formula for the sample size requirements for a twogroup comparison looks like: n 2 (z z ) 2 (7–18) For example. and the SD (σ).28. but it’s simply another way of expressing the ES that’s in Equation 7–18. which introduces a dependency on sample size. power is . and one for d = . Because there are two groups.10.4 ÷ 15 is about . This is too low for our blood (we usually want power to be at least . and β = . 8. If it’s too small. halve the treatment effect. make them up. you can never diddle with the α level (unless you try one-tailed tests. but this should be used only as a last resort when all else fails). Table D in the Appendix gives you the sample sizes you need. diddle away. you can pick out β levels of . which is the ratio of δ ÷ σ. If it were. if the previous study was done with only 30 subjects per group.20. 6That SUMMARY The t-test is the easiest approach to the comparison of two means. looking up a two-sided α of . try to back it up with some data from the literature.05. it’s the smallest difference that you would say is clinically important.96 1. look across the row with 30 in column 1 until you get to the two-tailed α = . measured as percent of body area. for d = . the value of the test will still be correct. so zβ = 1. the difference between the means that they actually found (δ). and crank: n 2 (1. 3. though. So. The mean of the two is . put your mind at ease. For example. so we’d conclude that this study was too small and that the negative results were probably a Type II error.4 2 d. The distinction between the t-test and the z-test. The ﬁrst column is labeled .” If 67 per group is too large or too small. So.28) 15 8. double the treatment effect. There’s one column with d = . we presume it’s referred to as a “conservatively nonnormal distribution.346. which is pretty close. is that the t-test estimates both the means and the SD. If you’ve stumbled across a study that reports a nonsigniﬁcant t-test.20. What is known about the extent of psoriasis in my patient population? For the sake of argument. with an α of your choosing. but the signiﬁcance level wouldn’t be accurate.5 2. and the groups differ in sample size.80). . let’s say 20% in relative terms. for the sake of argument. the power is . though. a factor of 2 sneaks into the equation. So. one skewed to the left. are based on the assumption that the data are normally distributed.50 if you are really desperate. so for an effect size of 0. except when the ns are small.5.20 × 42 = 8.5. and you can’t justify enough funding. Table E goes the other way. so we’ll use a number half way between.497. we would proceed as follows: 1.05.4. use the harmonic mean).96. so . For d = . you can check if the groups really were equivalent or if a high probability of a Type II error existed. Despite its computational ease. How big a Type I and Type II error do you want? Unfortunately. If the sample size is more than you can manage in a year. Anyone who takes it more literally than that. Then. = 66. The t-test and all related statistics.648. That’s upside down from the way it appears in the formula.6.”6 the tabled signiﬁcance levels are accurate. there was only a 50% probability that the study would have found a difference if it were actually there. if it’s skewed to the right. The moral of the story is that a sample size calculation informs you about whether you need 20 or 200 people to do the study. discussed in the previous chapter. if we wanted to compare a clam juice group and a placebo group. Sawilowsky and Hillman (1992) found that even with a “radically nonnormal distribution. How big a treatment effect do I think I will get? This is never known. you wouldn’t change your practice because of it.4% in absolute terms.05. The concern is that if the data aren’t normally distributed. you can look up the power of the test. 5If these data are not available. or even . make it up. or . unless the data on which it is based are very good indeed.4. is suffering from delusions. However. note one small wrinkle. let’s say α = . For the sake of the granting agency. Now we put it all in the old sausage machine. To save you the agony of having to buy batteries for your calculator. and our dependent variable was the misery of psoriasis.05 and β of .10. the t-test is not appropriate when there are more than two groups or when individuals in one group are matched to individuals in another. Even if a smaller difference were statistically signiﬁcant. So.

difference between the means b.] • Click on the grouping variable from the list on the left. Maybe the sample wasn’t big enough (and you can get even more money to do a bigger and better study). The data look like this: How to Get the Computer to Do the Work for You • From Analyze.8 6. and click the arrow to move it into the box marked Grouping Variable • Click the Define Groups button • Enter the value which deﬁnes the ﬁrst group into Group 1 and press the <Tab> key • Do the same for Group 2 and press <Tab> Drug Subject Hairs Placebo Subject Hairs • Click • Click Continue OK 1 2 3 4 5 Mean SD 12 14 28 3 22 15.10? . the null hypothesis is that the means are equal b. SE of the difference c. the SEs of the means must be equal e. most patent hair restorers contained ethyl alcohol as the main active ingredient. t-test d. the last bastion of male vanity (and a personal issue with your intrepid authors). choose Compare Means ¨ Independent-Samples T Test • Click on the variable(s) to be analyzed from the list on the left. and have them rub the active drug or a placebo into the affected part for 6 weeks. randomize them to two groups. the sample sizes must be equal d.6.05 and β of ..76 ANALYSIS OF VARIANCE EXERCISES 1. not literally) observer counts hairs per cm2 on the dome.21 Calculate the following quantities: a. the treatment mean is 19.e. a.8)? b. A blind (technically. Till recently. and click the arrow to move it (them) into the box marked Test Variable(s) [Note: One t-test will be performed for each variable listed. But does it really work? We take 10 chrome-domes. Is this result signiﬁcant? 3. Answer True or False: When comparing the means of two samples using the t-test: a. How big a sample size would you need to detect a true difference of 50% with α of . the data must be normally distributed 2. the control mean is 9.8 8. Now. Let’s look at hair loss. Okay. and we calculate the means and SDs. the null hypothesis is that the means are not signiﬁcantly different c.59 5 7 8 9 10 5 10 20 2 12 9. a legitimate drug has changed all that. presumably to ease the anguish. so you tried and failed to grow hair. How much power did you have to detect a difference of 100% (i.

What you really want to do is select a few brands and determine if any difference overall exists among the group means. Why go further? Well. S with T. R with U. If there isn’t. Formalizing it a bit. It would be far easier to do the study with a number of different brands all at once. and T with U. which come in all the colors. approaches . etc. a problem would arise. planned. what of the hypothesis? Going in armed with the knowledge that the condoms all likely came off the same production line. we discovered a neat way to compare two means. If there is. R with T. so the overall chance of a signiﬁcant result. there are dozens of brands. with 10 subjects each. We can do only two at a time.CHAPTER THE EIGHTH One-way ANOVA deals with statistical tests on more than two groups. the t-test.2 who probably took a cue from the beer companies in ﬁnding the advantages of producing multiple brands from the same vat. as a conscientious researcher. then provide a rating on a 10-point scale. and unknown house brand (U). if we were to set about comparing the means5 with a t-test. More than Two Groups One-Way ANOVA SETTING THE SCENE To further the goal of “Safe Sex for Sinners. Sheiks (S). his kids and grandkids.—think of all the extra effort our subjects would have to put in and all the extra pleasure they would have to put up with. one study for Brand A versus Brand B. as it were.05 chance of being signiﬁcant by chance. The remainder are made by Ortho. Interestingly. we might really be interested in whether any difference is discernible among the brands. randomize them to various brands (all delivered for experimental purposes in plain brown wrappers).30. now every Grade 5 student knows all the arcane details. so we view this as an opportunity to bring the adults up to speed. 3What 77 . You are rapidly discouraged by the challenge.6 In any case. even when no difference exists.” 2Actually Now. get a bunch of willing volunteers (which shouldn’t be too difficult). the null hypothesis is: H0: µR = µS = µT = µU (8–1) 1When and our alternative hypothesis is simply: H1: Not all the µs are equal. almost all are made by Julius Schmid. all promising to lift you to new heights of erotic pleasure. However.4 Now. S with U. and sizes under the sun and are apparently only dispensed in men’s rooms of sleazy bars. we couldn’t consider them. which clearly likes to cover both bases. each of which has a . We create a sum of squares representing the differences between individual group means and a second sum of squares representing variation within groups. we really don’t care about the speciﬁc differences in the ﬁrst round. dispensed by every drugstore in the land. (8–2) PDQ Statistics was written. a second study for A versus C.3 Suppose we test four brands.1 Leaving aside the exotica. do IT. we would stop right there. So much for practicing what you preach. or posthoc comparisons) to examine speciﬁc comparisons among individual means. so we end up comparing R with S. Those of us with an empirical bent might wish to put the promise to the test and determine if there really was any difference in pleasure derived from different brands. consider condoms. There are also methods (called pairwise. then try to ﬁnd out what affects these differences. we (Streiner and Norman. Trojan (T). I n the last chapter. How. We certainly wouldn’t do it two brands at a time. shapes. Ramses (R). 2003) have previously called a “Bo Derek scale. do you deal with the problem that assaults consumers daily when they must choose among dozens of apparently identical products to deal with every aspect of life from brushing their teeth in the morning to knocking them out at night? As an example. There are 6 possible comparisons.” you decide to investigate which is the most cost-effective condom. then we might like to ﬁnd out which is best. another for A versus D. as a visit to the local pharmacy reveals an overwhelming array of choices. ponder if you will what happens when you have more than two groups.

. so what do we use? Another way of thinking about what we did with the t-test was that we got the grand mean of the two groups (which we abbreviate – as X . So what we have is: SS(between) = n ΣΣ(Xik – X. is 1.k – X. we in essence have four Subject 1s—one from each of the groups. 10 subjects each Subject Ramses Sheiks Trojans Unnamed 7A 1 2 3 4 5 6 7 8 9 10 Mean Grand mean = 4 4 5 5 6 3 4 4 3 4 4.0 – . + (4 – 4.) = X1 – X2 (8–3) 5The use the harmonic mean of the sample sizes.9 – 4.. but simply because they are the ones who are always whining about the intrusion. A dot replacing a subscript means that we’re talking about all of the peo– ple referred to by that subscript. THE PARTS OF THE ANALYSIS Sums of Squares With the t-test.0 – . These are then massaged into a statistical test. is the mean of all people in all groups (i. usually takes two (or more).78 4We ANALYSIS OF VARIANCE recognize that good sex.2)2 + (5 – 5. we’ll get: SS (Between) = 10[(4.1)2 + . )2 (8–4) 6Actually where n is the number of subjects in each group. our measure of the signal was the difference between the means. so it is not apparently interval level measurement. astute reader may well point out that we have no business comparing means of numbers from a rating scale. like good tangos.95k. + (3 – 3. As we saw when we were deriving the formula for the SD in Chapter 3.26. not because of any sexist leanings. for reasons we’ll explain shortly). + (3 – 4. this turns out to equal 101.375)2 + (3. This is where the complicated formula comes in.2)2 + . Again. . because it is literally the sum of the squared differences between the group means and the grand mean..2)2 + (4 – 4. this will always add up to zero. for the masochists in the crowd. though. is: SS(within) = We use the same logic with the analysis of variance (ANOVA).50. though. as we outlined in Chapter 5. the algebraic formula. Thinking in terms of signals and noises. and so on.3 7 8 7 9 6 3 2 2 2 3 4. and these are shown in Table 8–1.375)2 + (4. Indeed. means to add over all of the subjects in the group.. So. it’s not quite that. and the second summation sign means to add across all the groups.3)2 + .. what we need is a measure of the overall difference among the means of the groups and a second measure of the overall variability within the groups. As we just saw. we need some way of differentiating Subject 1’s score from Subject 2’s.2 4. the Sum of Squares (Within) is the sum of all the squared differences between individual data and the group mean within each group. .875 Similarly. X .3 – 4. and then saw how much each individual mean deviated from it: (X1 – X. It looks like: Sum of Squares (Within) = (4 – 4. the Sum of Squares (Total) is the difference between all the individual data and the grand TABLE 8–1 Satisfaction ratings with 4 brands of condoms. Then we determine a second sum of all the squared differences between the individual data and their group mean.1 Based on a 10-point scale where 1 is the pits and 10 is ecstasy. we ﬁgure out the grand mean of all of the groups. so we have one subscript (i) to indicate which subject we’re talking about. so we need another subscript (k) to reﬂect group membership. Let’s just fake up some sex satisfaction data7 to illustrate what we did. .9)2 + (2 – 3. we will assume that the ratings were made by the male partners. Finally. you would where the ﬁrst summation sign.956) = .e. there is no assurance that the distance between 9 and 10 on the scale is the same as the distance between 3 and 4. . as explained in Chapter 3.1)2 [40 terms] After much anguish..) – (X2 – X.1 – 4. and see how much each group mean deviates from it. We call the results the Sum of Squares (Between). and X . we can’t do that when we have more than two groups. The correct formula to calculate the overall probability of a signiﬁcant result by chance alone when there are n comparisons. . Each person’s score is represented as X. Debates have raged about this one for literally 50 years and we won’t resolve it here (although some of the key references are at the end of the book). the grand mean). We also need to take into account how many subjects there are in each group. We approach this by ﬁrst determining the sum of all the squared differences between group means and the overall mean. the one with the i under it. unless we square the results. But. If the groups have different sample sizes.9 2 1 2 3 3 4 5 4 4 3 3. in this case (1. Now.375 5 5 6 6 7 6 4 5 6 3 5. k)2 i k (8–5) ∑ (X.375)2] = 27.375)2 + (5. For the moment.k is the mean – of all people in Group k. Time to explain those funny dot subscripts. . trick known to all consenting adults.2 – 4. Using Equation 8–4. .

no. After all. More generally.292 2. First. or. + (3 – 4.375)2 + .MORE THAN TWO GROUPS 79 mean. whose sum must equal the Grand Mean. All the µs are therefore equal. the formula is: df (Total) = nk – 1 (8–9) It’s no coincidence that the df for the individual variance components (between and within) add up to the total df. and the algebraic formula is: SS(total) = ΣΣ(Xik – X.875 + 101. because summation is usually obvious from the equation.296 is just a bit greater than the published F-value for 3 and 36 df. Finally. . . The critical values of the F-test at the back of the book are listed under the df for both the numerator and the denominator. and then in turn into the Sum of Squares (Between) and the Mean Square. but we’ll let the computer worry about that wrinkle. )2 i k (8–6) There are two things to note.50 = 129. the numerator is the signal— the difference between the groups—and the denominator is the differences within the groups. where we ﬁgured out the df for each group. the df (Between) is equal to 4 – 1 = 3. If no difference truly exists between the groups. we form the ratio of the two Mean Squares.375)2 + . First we calculate the Mean Square by dividing each Sum of Between Within Total 27. . Second. 2.36. in the absence of any difference in population means.375. . there are 40 data points that again must add up to the Grand Mean. the calculated ratio turns out to be signiﬁcant because 3.) for each term.375. so df = 39. after all. from now on.375 3 36 39 9. For the Between-Groups Sum of Squares. the answer is “No. F(3. At this point. but it can’t say yet which they are. this should be equal to the sum of the Between and Within Sums of Squares.500 129. we deﬁned df as the number of unique bits of information. preparatory to calculating the Mean Squares. and last. This is always the case.375)2 + (5 – 4. As it turns out.86. we’ll get to that when we discuss post-hoc comparisons.375)2 [40 terms] = 129. .819 3. one fact becomes clear—you don’t see F-ratios anywhere near zero. Squares by its df. if you can’t afford the word processor. + (2 – 4. . Again.375)2 + .375)2 + (4 – 4. the expected Mean Square (Between) [usually abbreviated as E(MSbet)] is exactly equal to the variance (within). + (7 – 4. since the df is about the same as the number of terms in the sum. we know that at least two are different from each other. for the Total Sum of Squares. and let you work things out. for k groups: df (Between) = k – 1 (8–7) EXPECTED MEAN SQUARES AND THE DISTRIBUTION OF F If you peruse the table of F-ratios in the back. shouldn’t the numerator go to zero? Surprisingly.f. This is summarized in an ANOVA table such as Table 8–2. We can then look up the calculated F-ratio to see if it is signiﬁcant. the F-ratio. and it provides an easy check in complex designs. generally. if you consider where the F-ratio comes from. σ2err. This is then a measure of the average deviation of individual values from their respective mean (which is why it’s called a Mean Square). . but all condoms are not created equal. then: df (Within) = k (n – 1) (8–8) 8If you would rather gedanken. which is a signal-to-noise ratio of the differences between groups to the variation within groups. But in longhand: Sum of Squares (Total) = (4 – 4. To check the result. When you publish this piece (good luck!!).375)2 + (8 – 4. So. 27. Finally. Imagine8 that there really was no difference among the condoms. so df is 4 × (10 – 1) = 36. In general. we’ll just use a single Σ without a subscript.” The reason is because whatever variation occurred within the groups as a result of error variance would eventually ﬁnd its way into the group means. Perhaps that’s not a surprise. We follow the model for the t-test for the WithinGroups Sum of Squares. there are four groups. Would we expect the Sum of Squares (Between) to be zero? As you might have guessed. But it actually should be a bit more surprising. and then added them together.375)2 + (1 – 4. Remember that in the previous chapter. Either way. the equations look a little bit different if there aren’t the same number of subjects in each group. feel free. Source Sum of squares df Mean square F TABLE 8–2 The ANOVA summary table Mean Squares Now we can go the next.36) or F(3/36).875 101. It is the sum of SS (Between) and SS (Within).. . Here. even when two or more are called for. + (5 – 4. steps. the F-ratio would be written as F3. So Julius may have taken them all out of the same latex vat.296 . we didn’t ﬁnd that any t-values worth talking about were near zero either. each of the four groups has 10 data points. Degrees of Freedom The next step is to ﬁgure out the degrees of freedom (df or df or d.375)2 + .

then. S.80 ANALYSIS OF VARIANCE Conversely.” In the present example. but as with the assumption of normality. Usually it’s in the high . That is. the σ2bet drops out and the ratio (the F-ratio) equals 1. although there may be more than two levels in the analysis. especially if the groups have equal sample sizes.” Because of sampling error. MULTIPLE COMPARISONS One could assume. if we go to a really simple design and do a one-way ANOVA on just two groups. There’s another very good reason not to worry about it. then. and the expected Mean Square (Between) = n × σ2bet. Putting it together. and. The reason is that one-way ANOVA is fairly robust to deviations from homoscedasticity (which is called. the formulae for the expected mean squares will also become hairier (to the extent that this is the last time you will ever see the beast derived exactly). You don’t really care which of the condoms resulted in the most satisfaction—or do you? There are certainly many occasions where one might. occurring out of interest after the primary analysis has rejected the null hypothesis. The assumption of homogeneity of variance. and so on. and the populations (not the samples) are normally distributed. heteroscedasticity). wish to go further after having rejected the null hypothesis to determine exactly which speciﬁc levels of the factor are leading to signiﬁcant differences. out of genuine rather than prurient interest. that ﬁnding the F-ratio concluded the analysis. and the second is that there is homogeneity of variance across all of the groups. comparisons among analgesics. especially if the ratio of the largest to the smallest variance is less than 3 (Box. which also goes by the fancy name of homoscedasticity. post-hoc comparisons are considered to be more like data-dredging. Does this mean that you’ll never see an F-ratio less than 1? Again. the means will be normally distributed. when there is no true variance between groups. they are much more common and also easier to understand. the same answer. our real interest is a comparison of Brand U (unnamed) against the average of brands R. if absolutely no variance exists within groups. As we go to hairier and hairier designs. makes two assumptions about the data. in the above experiment. if this works. If we go back to Chapter 6 and the logic of statistical inference. it’s rarely worth worrying about. especially when there are at least 30 or so observations per group. the Type I and Type II errors aren’t inﬂated if the data are skewed (especially if the skew is in the same direction for all groups).90s. the calculated F-ratio is precisely the square of the t-test. regardless of the original distribution. it is probable that the conclusions drawn from the data using an F-test will not be seriously affected” (p. but one thing will always remain true: In the absence of an effect. normality. post-hoc comparisons are for further exploration of the data after a signiﬁcant effect has been found. the null hypothesis was rejected. and deviations in kurtosis (the distribution being ﬂatter or more peaked than the normal curve) affect power only if the sample size is low. is the distribution of the means. if we were going up against Julius Schmid. 1954). 262). The ﬁrst is that they are normally distributed. which are deliberately engineered into the study before the conduct of the analysis. the previous hypothesis can be framed much more precisely than simply. From the Central Limit Theorem (Chapter 4). for obvious reasons. though. let’s see why they’re necessary. The ﬁrst assumption. the greater the probability of ﬁnding at least one that is signiﬁcant by chance (the problem of mul- . As you might have guessed. the probability of getting a signiﬁcant result if in fact nothing is going on is 5%. but the results are rarely used. as well as the t-test and all other ANOVAs. In fact. and planned comparisons. As we said in Chapter 6. These two situations are described as post-hoc comparisons. so we will start at the end and work forward. ASSUMPTIONS OF ANOVA One-way ANOVA. Multiplicity and Type I Error Rates Before we begin discussing the various tests. such as a group of aspirin-based analgesics that includes a placebo. and T. The more statistical tests we run. then the difference between sample means is equal to the difference between population means. what we’re actually looking at with ANOVA. we expect the relevant F-ratio to equal 1. is rarely tested formally because. is assessed more often because most computer Planned comparisons are hypotheses speciﬁed before the analysis commences. almost automatically implies two levels of interest—all analgesics against the placebo. a comparison of three or four drugs. Conversely. the expected value of the Mean Square (Within) is just the error variance. and any other statistic focused on differences between means. “Unless there is reason to suspect a fairly extreme departure from normality. and the expected value of the Mean Square (Between) is equal to the sum of the two variances: E(MSbet) = σ2err + nσ2bet (8–10) programs do it whether we want it or not. t-tests. “Not everything is equal. σ2. “No. it sometimes happens that when nothing is going on—there’s no effect of group membership—you’ll end up with an F that’s just below 1. situations also occur when. Then. The alternative hypothesis was supported. More commonly. as Ferguson and Takane (1989) state. However. and thus inferior to the elegance of planned comparisons. Just how robust is anybody’s guess.

and compare it with α/T. the probability of ﬁnding at least one signiﬁcant is (1 – .9 This is called the family-wise alpha level.49%. and also as an independent variable.MORE THAN TWO GROUPS 81 tiplicity. all will be. 10More POST-HOC COMPARISONS All the post-hoc procedures we discuss—Fisher’s LSD (Least Signiﬁcant Difference). and Dunnett’s t (and a lot more we won’t discuss)— involve two comparisons at a time. if the tests are perfectly correlated. and (2) we can use the Mean Square (Error) term as a better estimate of the within-group variance. and to control for multiplicity within families. refer back to marginal note 6. the Scheffé method. The most liberal ones. or multiple tests of signiﬁcance). it and all subsequent ps are not signiﬁcant. we showed that the probability of at least one test being signiﬁcant is 1 minus the probability that none are signiﬁcant. if we ran all six possible t-tests between the four condom groups following a signiﬁcant F-test. 11Not .05 ÷ 4 = . is not straightforward by any means. so this is called the experiment-wise alpha level. comparing experienced and novice users (assuming we could ﬁnd any of the latter). the probability of ﬁnding signiﬁcance anywhere in the study depends on all of the tests that we’ve done. and tests of the main hypothesis a different family. Because we have only a limited number of ways to look at the difference between two means (subtract one from the other and divide by a noise term). which is compared with its critical value. In the Holm procedure. but may differ from other subsets of means. If our p level is smaller than this. Fisher’s LSD. Deciding on which comparisons belong in a family. Some are only moderately successful. rather than the ﬁxed value of α/T (where T is the number of tests). We continue doing this until we ﬁnd a p value larger than the critical number. Now. If it’s signiﬁcant. They differ among themselves in two regards—how they try to control for multiplicity. but not across them (Jaccard and Guilamo-Ramos. which Bonferroni uses. since the computers don’t know many ANOVAs or other tests we’re running. This is called a Bonferroni correction. and like a middle-of-theroader rather than an arch conservative. tests to use when the variances aren’t equal). and we move on to the next p value in our list. Modiﬁcations of the Bonferroni Correction In an attempt to overcome the extremely conservative nature of the Bonferroni correction. it and all larger ps are nonsigniﬁcant. ease of use. When we ﬁnd a p level that is larger than the critical value. then it is signiﬁcant. If you have four comparisons. you may have to go back and recalculate these by hand (or by using specialized programs available on the Web). Multiple comparison tests compare each pair of means. The reason is that it should more appropriately be called the Bonferroni overcorrection because it does overcompensate. For example. proposed by Holm (1979) and Hochberg (1988). the probability is larger. the Neuman-Keuls Test. But. we wanted to look at all of these outcomes. So. If it’s not signiﬁcant. whether they are multiple comparison or range tests (some are both). Tukey’s HSD (Honestly Signiﬁcant Difference). So.10 so we won’t waste your time (or ours) going over them. To see why. and comfort level as dependent variables. they don’t talk to each other. the Scheffé Method. which is compared with α/(T – 1). Bonferroni Correction Why not just do a bunch of t-tests? Two reasons: (1) it puts us back into the swamp we began in. It’s probably best to group tests by logically related hypotheses based on your theory. so the number of comparisons in the ﬁrst group wouldn’t affect the α level of the second group. 9Note that this is true only if the tests are perfectly uncorrelated. When we rely solely on statistical packages to do the post-hoc comparisons described below. 2002). then the alpha level becomes . but not discuss. But. we stop. and Dunnett’s t.956) = 26. but still not in the liberal camp. but also satisfaction of the partner(s). The safest bet is to control for every test that’s done. range tests identify groups of means that don’t differ among themselves. they can’t control the experiment-wise error rate. Both of the tests start off by arranging the p levels in order. All of these tests assume equal variances for the groups (we’ll mention. At the extreme. one easy way to keep things in line is to set an alpha level that is more stringent. α/(T – 1).05 by k. we start off with the smallest (most signiﬁcant) p level. If they are correlated. all of the different levels of the independent variable of condom type. and we wanted to look not only at satisfaction of the user. In Chapter 5. a number of alternatives have been proposed. let’s proceed to the more sophisticated (relatively speaking) methods— variations of the Bonferroni correction. in this case. Tukey’s HSD. to be confused with the Goldberg Variations. but we don’t know by how much. and indicate whether they are the same or different. all tests of group differences in baseline characteristics would be one family. the NewmanKeuls. but that results in very low power for any given comparison. This does point to one of the simplest strategies devised to deal with multiple comparisons (of any type). then if one is signiﬁcant. let’s say the study was more complicated than we outlined. then so are all smaller values of p. of losing control of the α level. use critical values that change with each test. The Hochberg variation11 starts with the largest value of p and compares it with α. As with most (dys)functional families. You shouldn’t use Bonferroni if there are more than ﬁve groups (and it’s probably best not to use it at all). Recognizing that the probability of making a Type I error on any one comparison is . from the smallest (p1) to the largest (pT). with the increasing risk of a Type II error. we move on to the next value. because it pertains to a “family” of tests.05. they all end up looking a lot like a t-test. So. All you do is count up the total number of comparisons you are going to make (say k comparisons). realize that they can correct for the family-wise error rate. then divide . and whether to control for just family-wise or experiment-wise error rates.0125.

in the present example. so. but you can ﬁnd free stand-alone programs for the Holm on the Web. The closest.010.011 p3 = . However. become evident. in that the Mean Square (Within) is calculated from the differences between individual values and the group mean across all the groups. almost all the post-hoc procedures we will discuss are. So. It really represents a critical range of differences between means resulting from the error in the observations. which stands for Ryan. so we test p3 against α/3 = . All three methods are multiple comparison tests.).03. the Mean Square (Within) is the best estimate of σ2. In ANOVA. It seems that.045 Denominator 2 MSwithin n (8–12) In the Holm method.82 = 0.751 2. and tested all of the ps against α/5 = . using the Mean Square (Within) instead of a sum of variances.0167. then only p1 would have reached signiﬁcance. 40 observations) to compute the error. Assume we have the results of ﬁve tests. we should use one of these corrections (Holm or Hochberg) to keep the experiment-wise α level at .010.03).1 5. which we called q'. we would ﬁrst compute the denominator: q’ 2 2. p1 is significant. this reduces further to: This is just an ordinary t-test. we’ll call statistically signiﬁcant by the t-test. Neither the Holm nor the Hochberg methods are implemented in statistical packages. meaning that p3 through p5 are not signiﬁcant. and we move on to p2. Because it is smaller. if we ran three separate t-tests or did 10 correlations within a study. Furthermore. we must then divide by some ns to get to the SE of the difference. we would compute the range by multiplying 0.05. the denominator of the test looks like: Denominator MS within 1 n1 1 n2 (8–11) If the sample sizes are the same in both groups. we modify the degrees of freedom to take this into account. 2. You remember in the previous chapter that we spent quite a bit of time devising ways to use the estimate of σ derived from each of the two groups to give us a best guess of the overall SE of the difference. and Welsh. It too is signiﬁcant. at their core. most of this work is already done for us. and then conclude that any difference between group means which is larger than this range is statistically signiﬁcant. Sticking with our example of a ttest for the moment. someone decided early on that the factor of 2 in the square root was just unnecessary . and their p values. any difference we encounter that is larger than this.2 0. since by deﬁnition they involve multiple comparisons (between Group A and Group B. we start with p5 and compare it with α. and then the Hochberg variation. for example. This does introduce one small wrinkle. we’re going to call this quantity q'. except that we computed the denominator differently. Finally. they can be used in a variety of circumstances. when folks were devising these post-hoc tests. so we’ll elaborate a bit on this idea. In fact. For reasons which may. are: p1 = .93 (8–15) The Studentized Range Common to all the remaining methods is the use of the overall Mean Square (Within) as an estimate of the error term in the statistical test. the Q is there just to make it easier to pronounce. Because it’s signiﬁcant.751 2.030 p4 = . in order. create a t-test using this new denominator: t X1 q’ (8–13) X2 So. etc.008 p2 = . Had we used the Bonferroni correction. followed by the Holm method. is called REGWQ. they go about it another way altogether. variants on a t-test.82 ANALYSIS OF VARIANCE Let’s run through an example to see how these procedures work. So the calculation of the denominator starts with Mean Square (Within). too. Regrettably. they begin with a range computed from the Mean Square (Within). we could proceed with a t-test: t |3. We could. and computing the difference required to obtain a signiﬁcant result. Since we are using all the data (in this case.0125. which is found in some packages.751 10 (8–14) Then. which turns out to be 2. Then. It’s larger. With the Hochberg version. For example. then all of the other p values are. multiply it by the appropriate critical value of the particular test.040 p5 = . one more historical diversion is necessary.3| 0. turning the whole equation on its head. if we want to compare U to S (the reason for this choice will be obvious in a minute). Einot. which is compared against α/4 = . We ﬁrst take the square root to give an estimate of the SD. Group A and Group C. before we launch into the litany of post-hoc tests. Although we have been discussing these three tests in relation to ANOVA. That is. the Bonferroni correction is the most conservative.751 by the critical value of the t-test (in this case. Gabriel. In the end. as we showed already. with luck. we compare this with a critical value for the t-test on 36 df. we would compare p1 against α/5 = .

who found that LSD did win him fame and glory. This involves looking up the critical value of the q statistic in Table M. k. that’s not a consideration. goodness knows how this got into the history books. the HSD is both. the HSD equals: HSD 3. or HSD: Group Subset 1 Subset 2 TABLE 8–3 Tukey’s HSD range test for the data in Table 8–1 Unnamed Ramses Trojans Sheiks 3. and it is actually nothing more than a computational device to save work. This has the formula: MSwithin n 2. and sample size. R. This isn’t at all unreasonable.750.3 . and we should avoid using LSD for anything except recreational activities. but also the number of means. After all.12 he went the other way and gambled on moral rectitude. start with a test statistic. one will never know. Second. it doesn’t take a rocket scientist to determine that q' is just √2 = 1. which we’ve called q up until now. takes all this into account. which is a multiple comparison test. Why they bothered to create a new range. the statistic q equals 3. In the present example. and M is the df for the within term.79 2. it gives a table showing subsets of homogeneous means. called the Honestly Signiﬁcant Difference. we have included critical values of q in Table M in the Appendix.52. uses the q statistic. rather. the more possible differences between group means there are. and the greater the chance that some of them will be extreme enough to ring the .9 4. when n is. coming up with the Honestly Signiﬁcant Difference (HSD). equal to k(n – 1). But. You can use this technique with any range-type post-hoc test. and now the T – U comparison is not signiﬁcant. which were either multiple comparisons (Bonferroni. One would be forgiven if there were some inner doubt surfacing about the wisdom of such strategies. k is the number of groups (4 in this case). the test statistic for the HSD and the next test. with 4 and 36 df. called the Studentized range. Hochberg.82 10 2. given the df. and S. as in Table 8–3. Rest in Peace.4141 times q. then.03 = 1.13 This time the test statistic is changed to something closer to the square root of an F statistic. the LSD also found the comparison S – U to be signiﬁcant. but the S – U one still is. It has its own table at the back of some stats books (but not this one). 12Unlike Timothy Leary.05 bell. In this case. The formula for the LSD is therefore: LSD tn 2 2 MSwithin n tn 2 q’ (8–17) where df is the degrees of freedom association with the MSWithin term.750 × 2. and T don’t differ from one another.2 4. For completeness. so a signiﬁcant t (at . The T – U comparison is still statistically signiﬁcant. as before.82 10 HSD q(k. doesn’t it? If this one is “honestly” signiﬁcant. and it is not necessary to calculate a new t for every comparison. T. So.01 (8–19) Fisher’s Least Signiﬁcant Difference (LSD) Fisher’s LSD.9 5. like the one that follows. and compute from it how big the difference between means has to be in order for it to be signiﬁcant. don’t start with a difference between means and then compute a test statistic. you wonder. Fisher’s LSD does nothing to deal with the problem of multiplicity because the critical value is set at . After all. Just compare the difference to 1. so any difference between means greater than . is on the opposite wing in terms of conservatism.52 would be signiﬁcant. all it does is save a little calculation. Tukey then creates another critical difference. since computers do the work in any case. it’s signiﬁcant. and a line is placed under the means that don’t differ from each other. Note that this test.2 4. we have 36 df. A neat way of displaying this is in Table 8–4.531 (8–16) And that’s where q and q' come about. After identifying means that are significantly different.79. You begin with the critical value of t. We worked out before that the denominator of the calculated t-test is . which turns out to involve not just the mean. Holm.05) is 2. Since most of the range tests use q. Unlike the previous tests. SD.03. if it’s bigger. these tests (including the LSD). This time around. the sample size in each group.52 becomes the LSD.05 for each comparison.M) q 0.1 4. The means are listed in numerical order (it doesn’t matter if they’re ascending or descending). and LSD) or range tests (the Studentized Range). the more groups there are. Tim. what does that say about his other posthoc tests? 13Makes Tukey’s Honestly Signiﬁcant Difference Perhaps because Tukey saw that LSD did not immortalize Fisher’s name. nor do the means of R.MORE THAN TWO GROUPS 83 MS(within) n (8–18) baggage. Sooo— 1. This is interpreted as showing that the means of U. but. so they created a new range.

2 . which is the other favorite undergraduate sport. So the critical value looks like: Critical (X i X c) td 2 MSwithin n (8–21) 16Aside 17Not 18We Since the S .” but. you won’t ﬁnd it in the back of this book). and the only person named in this book whom we’ve actually met. so any comparison greater than 2. so any difference between means greater than a critical value of: Critical (Xi Xj) 3.81 q 3.9 5.17 That is. exceeds this critical value and is declared signiﬁcant- . as none of the one-step comparisons would be signiﬁcant.U is almost.3 Note: Means joined by a line are not signiﬁcantly different. with critical values of q Sheiks Trojans Ramses Unnamed Critical q 5.54 1.53 1. but not quite signiﬁcant.023 1. Since neither comparison is signiﬁcant. so any difference larger than 3. S.84 is signiﬁcant.9 0. on reﬂection.1 but fewer than k × (k – 1)/2 comparisons. Now in order to apply the N-K test. with four groups and a total sample size of 40 (so that df = 40 – 4 = 36). we’ve worked this out in Table 8–5. and none is. Dunnett’s t. U you can get for free almost anywhere—student health services.” If you have k groups.T is not signiﬁcant.1 4. We include it for two reasons only.19 So. only one. we need to know how many steps (means) we have to traverse to get from one mean of interest to the other mean of interest.8 1.464. and second. doing essentially what all the others did.81. For the present situation.464 × 0. with a difference of 2.1 2.84 ANALYSIS OF VARIANCE Tukey’s Wholly Signiﬁcant Difference TABLE 8–4 Displaying the results of Table 8–3 in a Journala a Group Unnamed Ramses Trojans Sheiks Mean 3. Not surprisingly. counselors. We begin by ordering all the group means from highest to lowest. So. The downside is that it tends to be somewhat conservative with respect to Type I and Type II errors. It. to get from S to U. Dunnett’s t Yet another t-test.4 4. to go from R to U. the critical q is 2.531 = 1.531 = 1. called the step. returning yet again to the lurid example that got this chapter rolling. it is dependent on the number of groups and the sample size. is a minor variation on what is becoming a familiar theme. to give credit to Student.82) 10 0.91 (8–22) 2(2. and another standard error derived from Mean Square (Within). all the treatments are quite pleasing. On the other hand. otherwise.14 the Neuman-Keuls test appears to have won the posthoc popularity contest lately.16 To illustrate the point. So.023.2 4. the NK can be a bit too liberal. we have four steps. this is called the S-N-K. Mean Differences 1 step 2 steps 3 steps TABLE 8–5 Differences between means for different steps.875 × 0. it is declared signiﬁcant. Just to prove the point. the td is 2. to get R. because Dunnett is an emeritus professor in our department.54 Xc) 2. but they don’t count since they’re not real statisticians. and the last one.05. T . as you might expect.82 10 2.84 2. with a Mean Square (Within) of 2. also known as Tukey’s b) tries to be like Momma Bear. Now. whereas. for the one-step comparisons.53 is signiﬁcant.18 Dunnett’s test is the essence of simplicity. Surely. a q' range. For α = . only this time with the 2 back in—that is. four means.023 (8–20) us out. you might have to shell out some hard-earned scholarship dough.7 4.2 0. there is no point in continuing.3 Dunnett’s main contribution was that he worked out what the distribution of the t statistic would be (for reasons of space. like the Studentized range. S. The big one is that it is the right test for a particular circumstance that is common to many clinical studies in which you wish to compare several treatments with a control group. and S . brand S. From the table.751 On this basis. use WSD if you want to do more than k . were tempted to call it a placebo group after the Latin meaning “I will please. Basically..82 the critical value is: Critical (Xj 2. one clear difference emerges between R. 14Thereby counting The Neuman-Keuls Test For reasons known only to real statisticians.1 1. and what the difference is between the means. Another case of perish then publish. although he didn’t play any role in its development. and 36 degrees of freedom.875.15 we have to introduce a new concept. or T. etc.54.2. use the N-K. the study is now designed to compare multiple treatments with a control group. six or more) and you want to make all possible comparisons. We compute a critical value for the difference using a test statistic called.81 2. too.U comparison is bigger than 2. the tabled value equals 3. Tukey’s Wholly Signiﬁcant Difference (WSD. the primary question must be whether the ones you pay for are really any better than the freebies. from Streiner and Norman. the critical q is 3. to discuss two two-step comparisons. to be confused with frisbees. 15In Tukey’s HSD is the way to go if you have a large number of groups (say. and be “just right. T. SPSS.9 3.

05 level. among the various hypotheses or contrasts. The reason is that the comparisons overlap—(Mean1 – Mean2). we would make a comparison as shown: C 1 ⁄2XR 1 ⁄2XS 1 ⁄2XT 1 ⁄2XU (8–26) . To see if it matters. those condom connoisseurs among the readers probably know that certain. to calculate F-ratios for each test. On balance. In fact. We have not. For example. C2 is 0. if we just wanted to look at the difference between R and U. Because the one-way ANOVA is fairly robust to violations of this assumption. This value is a bit larger than any of the other methods. as usual. the critical value of F at . the REGWQ seems to be more powerful than the Newman-Keuls and able to maintain better control over the α level. We ensure that this condition is met by ﬁrst standardizing the way in which the comparison is written. the Sum of Squares (Between). To avoid this state of affairs. It would seem that most folks these days are opting for the Neuman-Keuls test when they are doing pairwise comparisons. If you’re ever faced with severe heterogeneity. we have listed a couple of readable references at the back of the book. comparing these methods. each contrast among means is written like: C w1X 1 w2X 2 w3X 3 w4X 4 (8–25) For example. wasn’t he? 20Dunnett . 2. Scheffé recommended using the tabled value for α = .82 × 1 0 0 1 ( 10 + 10 + 10 + 10 ) = 2. We do this by introducing weights on each mean. A versus C. This pretty well sums up the most popular posthoc procedures. just to compensate for this conservativeness. the Mean Square (Within). Scheffé “protects” α from them. Even if we don’t do all of these comparisons. The number of groups. as we do with post-hoc comparisons. Scheffé’s method is intended to be very versatile. The sum of squares associated with each is used as a numerator. then C1 is +1. the equation is: S = √(3)(2. or whatever) are said to be orthogonal if they do not share any common variance. it is evident that the LSD method is liberal—it is too likely to ﬁnd a difference. and so on). and a complicated little bit of coefficients.88. was a busy little boy. For a simple contrast. To conduct the procedure. A versus B.20 Tamahane’s procedure tends to be too conservative. in our present data set. The Scheffé method is too conservative. Although we haven’t discussed it. and other more architectural differences. That leaves us with Dunnett’s two tests. taking differences among means as our whims dictate. The basic strategy is to divide up the signal. To accomplish this sleight of stat. and Dunnett’s C. bother to look this up in the back of the book in the table for Student’s t-test.88) 2. in addition to a multitude of nonparametric and multivariate ones. regrettably. the details of which will be spared the reader. Putting it all together. Now taking a simple contrast.MORE THAN TWO GROUPS 85 19Don’t ly different from brand U. and (Mean3 – Mean1) are.05 for 3 and 36 degrees of freedom is 2. factors.10. brands R and T don’t make the grade. Suppose Brands R and S have one such characteristic.207 would be signiﬁcant at the . including highly weird ones. by any means. then there are four Two things (contrasts. If we just went ahead. which is the price you pay for versatility. There are about 25 parametric post-hoc tests. it’s in a special table that Dunnett created. you calculate a range. it’s not quite as bad as it looks. then the Sum of Squares associated with all the contrasts would likely add up to more than the Sum of Squares (Between).82. and we don’t ﬁnd much difference between them. and the Games-Howell too liberal. any comparison of means greater than 2. the comparisons of interest must be constructed in a speciﬁc way so that they are nonoverlapping. expensive classes can be found among condoms. So. spermicide present or absent. or orthogonal. Scheffé’s Method The last procedure we’ll discuss is also the oldest.df PLANNED ORTHOGONAL COMPARISONS In contrast to these bootstrap methods. and C4 is –1. One reason for its conservativeness is that it was meant to test all possible comparisons. k. although the Neuman-Keuls test is a little less conservative (so we are told). and the Mean Square (Within) as a denominator. that means they’re used in most situations. Mean Square (Within) is. The post-hoc tests we’ve mentioned are all used when we can assume homogeneity of variance among the groups. lubricated or not. planned contrasts are done with a certain élan. capitalizing on the same sources of variance. If you have the fortitude to pursue this further. allowing any type of comparison you want (for example. is 4. to some degree. In general. Dunnett’s T3. C3 is 0. rather than .207 (8–24) So. (Mean2 – Mean3). the HSD and the Neuman-Keuls methods are somewhere in the middle. A + B versus C. GamesHowell.05. it is necessary to devise the comparisons in a very particular way. and T and U do not. It looks like: S= (k 1)F MSwithin C j2 nj (8–23) post-hoc tests you can use: Tamahane’s T2. exhausted the space. Dunnett’s test is in a class by itself and should be the method of choice when comparing multiple means to a control mean. using the overall F-test.

2 And these are all supposed to sum up to the Sum of Squares (Between). From our data set. the net effect of the creation of these planned comparisons is to parcel out the total Sum of Squares (Between) into three linear contrasts.625 SS(C2) = 1.2 + 0 5. Now that we have established a set of contrasts equal to the number of df. we have to prove that contrasts 2 and 3 are orthogonal.1 – 1⁄ 2 4.2 + 1⁄2 5. Suppose there are two dimensions. If we imagine two lines. This looks like: C 1XR 1XS 0XT 0XU (8–27) 21If you really aren’t interested in all those contrasts. the product of the ﬁrst two sets equals (1⁄2 )(1) + (1⁄2 )( 1) + ( 1⁄2 )(0) + ( 1⁄2 )(0) = 0.05 16.875. calculate the sum of wi2 ÷ n.05 SS(C3) = 1.625 6. Try this. calculate the actual contrast. according to the equation: (wi wj ) = 0 (8–29) where i refers to one contrast and j to the other. How.995 2.2 = 27. they are at right angles (orthogonal) if the sum of the product of the weights is equal to zero. So.2 C2 6. So far. First.9 = –1.10 W2 = (12 + 12) ÷ 10 = 2 ÷ 10 = . How do we know that these are orthogonal? By multiplying the coefficients together. The sum of squares for each contrast is then. what do you do? Make some up to ﬁt the sum = 0 rule. TABLE 8–6 The ANOVA summary table for the orthogonal comparisons Source Sum of squares df Mean square F Between C1 C2 C3 Within Total 27. however.82 ÷ .1 – 1 4.86 ANALYSIS OF VARIANCE Brand 27.625 6.9 = 1. In a similar manner. Similarly.20 101. 5. it’s almost easy.20 W3 = (12 + 12) ÷ 10 = 2 ÷ 10 = .6 FIGURE 8–1 Parceling out the Total Sum of Squares into three linear contrasts. making the same comparison within the other category ends up looking like: C 0XR 0XS 1XT 1XU (8–28) In this case.12 ÷ . The sum of weights is (1)(0) + (–1)(0) + (0)(1) + (0)(–1) = 0.05 16.3 + 1 3.10 C3 = 0 4.3 1⁄2 3. calculate the sum of squares as below. and call it W: W1 = (1⁄2 2 + 1⁄2 2 + 1⁄2 2 + 1⁄2 2) ÷ 10 = 1 ÷ 10 = .0 C1 5. we can do a test of signiﬁcance on each contrast. we noted above that we could have only as many contrasts as there were df between groups.375 3 1 1 1 36 39 9. so good.84 1. they look like: C1 = 1⁄2 4. We’re getting tired of all this.20 2.2 – 1 5. (a1X + b1X) and (a2X + b2X).75 C2 = 1 4.146 5.20 = 16.21 We calculate the sum of squares for each contrast as follows: 1. Next. X and Y. This is illustrated in Figure 8–1.20 = 6. then ignore the result. so you can check the 1 and 3 contrast.20 3.50 129.9 = 0. which leads to an elaborate ANOVA table (Table 8–6).10 = 5. ignoring T and U. Finally. This is done by taking the ratio to the Mean Square (Within).3 0 3.1 + 0 4. Within 102 Within 102 In a similar manner we might like to compare Brand R with Brand S. you might ask.625 + 6.80 2.875 5. does this guarantee things are orthogonal? We asked the same question and decided that it was anything but self-evident. Now comes the magic.29 5.9 C3 16. by some further chicanery.746 . so the df were divided among the contrasts.05 + 16. And ﬁnally. equal to C 2 ÷ W: SS(C1) = .752 ÷ .

Most computer programs allow you to look for trends. just to bring some closure to this topic. and none other. First. and we want to see if the change in means is linear or quadratic (that is.22 2 is an r-type effect size that will always yield a number between 0 and 1 and is interpreted as the proportion of the variance in the dependent variable that can be attributed to the independent variable. the granddaddy of sample size calculations. If we calculate both statistics on the ANOVA results in Table 8–2. which is usually what we’re interested in. –2. all the action is contained in the ratio of the difference between means to the SD. We discuss this concept in greater detail when we discuss correlation. it has a completely different form: ω2 = SSbetween – (k – 1) MSwithin SStotal + MSwithin (8–33) 22As a homework assignment. ω2 is smaller than 2. But. we use the coefficients –1. 0. Putting it more directly. and 1. The F-ratio tells us if the association is statistically signiﬁcant.215 and ω2 is 0.MORE THAN TWO GROUPS 87 Now the critical F-value for 1 and 36 df is between 4. and over 0. They differ because 2 reﬂects the amount of variance accounted for in the sample.2155 129. symbolized by the letter d. Planned comparisons should probably be used more. called this an effect size. To complicate things a bit. and for quadratic trend. ﬁrst. make a list of what you think those other factors may be. If you reﬂect on the way this formula works. they remain a quaint curiosity to most investigators. 3. but when it can be. because they’ll look better.146. then Grades 1 and 5 will have similar means.17. whereas ω2 is the variance accounted for in the population. So. and 1 for the groups. concern about the individual comparisons being liberal or conservative is unnecessary—they are all exactly right. In general. We can express the strength of the relationship in terms of a variable called eta-squared and written 2.08 and 4. THE STRENGTH OF RELATIONSHIP The logic behind ANOVA is that we want to see if one variable (in this case. SAMPLE SIZE AND POWER The basic idea for sample size estimation developed in the preceding two chapters is made a little more complicated when we get to one-way ANOVA. if the overall test is not signiﬁcant. resist that temptation. type of condom) is related to another one (here. about 15% of the variance in satisfaction scores can be explained by condom brand. we have to worry about sever- .06 would be moderate. then none of the individual comparisons will be either. and 5. 0. so that almost 22% of the variance in satisfaction scores can be explained by condom brand.14 would be large. The rest of the stuff. we can pull this information out from the ANOVA summary table. satisfaction).15 is large. 78% of the variance results from other factors. not a fancy “w”). we’d ﬁnd that 2 is 0. The advantage of the method is twofold. the zs and such like. Just to remind you. but it doesn’t give us any information about the strength of the relationship. As it happens. then at least one of the comparisons will be as well. For example.15 is medium.875 = 0. According to Stevens (2001). the comparisons provide direct tests of the hypotheses of interest. For this reason. ∩ or ∪ shaped). which expresses the effect of the treatment in SD units. an 2 of 0. Although it’s still a ratio of between to within variances. 2 = 27. Linear and Quadratic Trends There’s one subset of planned comparisons that can’t be used very often. and 0.06 is small. The rationale for the latter scheme is that if the relationship is quadratic. Cohen (1977). there’s a second effect size measure for the one-way ANOVA.06 and 0.01 would be considered small. the formula for the sample size for a t-test was: n=2 (zα + zβ)σ δ 2 (8–34) which in the case of a one-way ANOVA is: 2 = SSwithin SSbetween =1– SStotal SStotal (8–31) In our example. if the overall F-test is signiﬁcant. That situation occurs when the levels of the grouping factor have a natural order and are equally spaced. To look for a linear trend. also called the coefficient of determination. 85% of the variance results from other factors. so only the last of these individual comparisons is signiﬁcant. Second. This is always the case. for two reasons. Conversely. the coefficients are 1. That may tempt us to use 2 rather than ω2 when reporting results. the effect of the group differences is contained in this ratio. it’s extremely powerful.375 (8–32) where δ is the difference between the two groups. SSfactor 2 = SStotal (8–30) Cohen (1988) says an ω2 less than 0. between 0. are just niceties related to the arbitrary choice of α and β levels. but because they require a bit of creativity and some manual calculations (instead of simply pressing a button). But things get a bit hairier in the case of ANOVA. which are different from the mean of Grade 3. called ω2 (that’s the Greek letter omega. we may be interested in the body-mass index of children in Grades 1.

and another about their probable distribution. one-way ANOVA is only one way (ho ho ho) to divide up the world.5 = . One possibility arises when we have three groups: two fairly similar drugs and a placebo. it is not an alternative—it is the only way to proceed when there are more than two groups. Then. and the remaining ones are all bunched up in the middle. In essence. So. a. which shows the sample size per group. we’ve also made up a table that goes the other way. The erstwhile authors in the sample rate butt pain on a 100 mm line. Plausibly. having chosen the appropriate values of α and β.25. one about the average difference between means. and the more complex ANOVA methods that build on this formalism are a powerful and elegant way to view the world of numbers. This means that we have to make a couple of guesses. let’s call the distance between the highest and the lowest mean δ.25 1 ⁄2 (5 + 1) = 1. the two drugs might be clustered together at one end of the distribution of means and the placebo at the other. However. more speciﬁcally.88 ANALYSIS OF VARIANCE al means. k. The formulae that accompany these three patterns are: Minimum dispersion: d 1 2k (8–35) (k 1) 3 (k 1) (8–36) Maximum dispersion: (k odd): d k2 2k 1 (8–37) Maximum dispersion (k even): d 1 (8–38) Here’s how it works.25 3 (5 1) 1 ⁄2 . The f. Table I in the book’s Appendix. is: f = 1. (2) that the individual means are distributed evenly along the 10 mm difference. and N. which he called f. How big a sample size do we need to detect this distribution of differences? First. not just two. for this intermediate distribution of means. which varies depending on the distribution of mean—minimum dispersion (Figure 8–2A). based on previous research or intuition or just plain imagination. d is 10 ÷ 8 = 1. Table J gives you the power of the study for various values of f. Actually. As before. if we had a whole bunch of treatments. d is multiplied by some fancy formula. A Minimum SUMMARY We have already indicted pretty strongly the reasons for using one-way ANOVA: It provides an exact test of the hypothesis for multiple groups and. Suppose we are testing ﬁve different NSAIDs for relief of butt pain resulting from too many hours spent at the old computer cranking out books. B Intermediate C Maximum FIGURE 8–2 Some possible distribution of means in a one-way ANOVA. So. in combination with planned comparisons. let’s think about how the means can be spread out over this interval. another obviously does nothing.442 (8–39) Intermediate dispersion: d 1 ⁄2 Now. and second. is an exact (and elegant) alternative to multiple t-tests. we presume (1) a difference of 1 cm (10 mm) between the best and the worst. Our best guess is that all the drugs are all the same. Cohen (1977) then took the value of d and transformed it into the effect size for the ANOVA. a ﬁrst guess is that they would be equally scattered along the line. and the effect size (δ ÷ s) = d. and (3) the SD is 8 mm. A third variation may be that one treatment is a clear winner. the effect size. what do we do with this? We look it up in a table. or intermediate (Figure 8–2B). . But as we shall see in the next few chapters. the means can be distributed in various ways. but this is not the way to get drug company money. as we’ll explain in a bit. As usual. turn the page. of course. maximum dispersion (Figure 8–2C).

and click the arrow to move it (them) into the box marked Dependent List [Note: One ANOVA will be performed for each variable listed. Degrees of freedom (between) f. if they remain conscious.] • Click on the grouping variable from the list on the left. choose Compare Means ¨ One-Way ANOVA • Click on the variable(s) to be analyzed from the list on the left. Degrees of freedom (within) g. It’s a slow day in the lab. “Suicide” wings in one joint don’t rate more than a “Medium” in another—or so it seems. Sum of squares (between) b. a.000 when you’re done. quadratic. Probability of F A.MORE THAN TWO GROUPS 89 EXERCISES 1. a. order the platter of “Suicide.82 7 9 10 10 9. and Tukey’s-b].0 0. Decreases with the number of subjects per group ____ F. Sum of squares (within) c. Construct an ANOVA table and see if there is really a difference in suicide ratings among roadhouses. and use the down arrow to select the degree (linear. We randomize diners to diners (so to speak) and they sally forth. click on Contrasts . Increases with the number of groups ____ D. rate ﬁre on the ubiquitous 10-point scale.0 1. click on Contrasts .0 1. Related to the size of the effect ____ B. armed to the teeth with clipboards. We locate 3 different roadhouses and 12 fearless undergraduates. . b. ﬁll in the ﬁrst coefficient and click Add . If you want another set of contrasts. Related to the random variation within each group ____ C. Mean square (within) e. Increases with the number of subjects in each group ____ E. and click the arrow to move it into the box marked Factor • Click the Post Hoc button and select those tests you want [we recommend LSD. otherwise. Mean square (between) d. Continue until there are contrasts for all of the groups. . Now is your chance to ﬂex your computational muscles. and be sure the Coefficient Total on the bottom is 0. Note that each statement may have more than one answer. late at night. so let’s put this one to the test. Where does the difference lie? Do posthoc comparisons using Scheffé and Fisher’s LSD methods. Then click Continue . cubic). and PeptoBismol. Select the answer to each of the following statements from the list below. • If you want to test for linear or quadratic 1 2 3 4 Mean SD 4 4 7 5 5. Tums.41 7 8 6 7 7. The data look like this: Roadhouse Rater A B C How to Get the Computer to Do the Work for You • From Analyze. Tukey. click click Continue Next . check the box labeled Polynomial. Decreases as the signal-to-noise ratio gets bigger ____ 2.” and then. then click Continue • Click the Options button and click on Descriptives and Homogeneity-ofVariance • To run planned comparisons. They screw up their collective courages. F-ratio h. One dilemma facing all lovers of ﬁery food is that different culinary establishments have different standards.41 trend.

when the data are examined. A better approach would be to deliberately recruit equal numbers of males of both types so that we could eventually compare 20 circumcised to 20 uncircumcised men.375)2 + (3. let alone cause for joyous celebration. what we have been talking about is one possible cause of within-group variation. is whether circumcised males have more—or less—fun than do uncircumcised males.9 – 4. But an experiment such as the one we just did would provide an opportunity to put matters to the test. This latter is called random error. it seems that uncircumcised males rate Brand T higher. there is a still better way. as well as permitting an independent test of a second hypothesis.” but we will leave this until later. but that is just a glib phrase to cover our ignorance of its cause. Surely there must be more than this? Indeed there is. uncircumcised males prefer Brand S.375)2 + (5. and thereby results in a more sensitive test of the ﬁrst hypothesis. we contrasted the variance resulting from different brands against the variance within groups. But as we shall see. So. involving multiple independent factors. Going back to our previous example.1 – 4. We could let nature take its course and examine the ratings provided by males of both types after the fact. A better term would be “unexplained variance. Also. indicating a possible interaction between the two factors. we introduced the notion of splitting up the total variance into components owing to signal and noise. Now we proceed just about as we did before. Although this does not invalidate the test. one other age-old question. we are accounting for some of this variance. The principle is the same: dividing the total Sum of Squares into components because of each factor. There is one other boon to introducing additional factors—the possibility of uncovering interaction effects. and less is left over to go into the “error” term.2 – 4.” Well. By explicitly dealing with this factor. whereas circumcised males rate Brand U higher. the Sum of Squares (Brands) is exactly the same: Sum of Squares (Brands) = 10[(4. introducing a second factor (to the extent that it does contribute to the variance in the dependent variable) reduces the magnitude of the remaining error variance. If circumcision does make a difference. then examine the effects of each singly (main effects) and in combination (interactions). such as. In Chapter 6. using a t-test. It’s difficult for any of us individually to provide evidence on the matter because few among us have had the opportunity to experience sex under both conditions. Let’s think about it a minute. We can easily introduce additional factors in the design.375)2 + (4. But nothing compels us to limit ourselves to only a single factor and single noise term. then the presence of both types of men in the groups has led to some of the withingroup variation. it is less than optimal. The data would now look like Table 9–1.CHAPTER THE NINTH This chapter explores more complex forms of Analysis of Variance. W e have now discovered one of the joys of ANOVA—we can compare multiple groups in a single test without losing track of the actual probability. In fact. It would be just another t-test.875 90 . Factorial ANOVA SETTING THE SCENE The results of the condom experiment are in question. Additional information is derived from the interaction between factors.375)2] = 27. so there may well be a large imbalance in the two groups.3 – 4. “Circumcised males prefer Brand R. When we compared the four brands. But this doesn’t seem such a big deal. No account was taken of a second factor— circumcision status. But the vast majority of men are circumcised. which has been the subject of endless bits of folklore.

Now let’s add some more information. .8)2 + (6 – 4. We can now add this effect to the information and determine that the best estimate for the cell means in the top row is now 40% and in the bottom row is 60%.. . consisting of the difference between individual values and their group mean. with feeling. this time multiplying by the number of data in each circumcision group.375)2] = 18.6)2 + (4 – 3.375 Algebraically: SS j nI∑(X.. The Sum of Squares (Error) is conceptually the same as before. circumcised males may express a strong preference for some brands and uncircumcised males for other brands. 10% below. perhaps we can begin with some simpler data. and J is the number of columns (4). .8)2 + (5 – 4.6)2 + . Once more.70 Group mean Brand mean 4.1 As we indicated.225 4. Putting it more simply.8)2 + . However. technical intent to the terminology. with a sample size diddle factor). 5). 20: Sum of Squares (C UC) = 20[(5. there are more group means to consider.8)2 + (5 – 5. as shown in Table 9–2 under the second category.05 Group mean Circumcised 3. + (4 – 4. this time there is only a dry. I is the number of rows (in this case. Computers beat lecturers by 10%. The goal is to determine what the expected average score in each cell would be if there were no interaction.4 4. involving a difference between the two group means and the grand mean. j X .FACTORIAL ANOVA 91 Ramses Sheik Trojan Unnamed Mean TABLE 9–1 Ratings of satisfaction for different condom brands by circumcised and uncircumcised males Uncircumcised 4 4 5 5 6 4.375)2 And again the algebra looks like: SSi nJ∑(Xi. But to illustrate the point. it explores the idea that the value of the dependent variable (satisfaction) may relate in some nonadditive way to the value of both factors. though. .4 3 2 2 2 3 2. and so it consists of terms such as: Sum of Squares (Error) = (4 – 4. Imagine an experiment similar in design to the present one. small groups.2 5 5 6 6 7 5. But suppose we have a bit more information. and boys.1 5. and computers. . .2 4 5 4 4 3 4.8)2 + (5 – 4. 2).05 + (3. It is almost easier to see what an interaction is by ﬁrst considering the appearance of noninteraction.0 3. if we knew only that the average score of all subjects was 50%. The Sum of Squares (Circumcised/Uncircumcised) is exactly analogous.8)2 + (4 – 4.8 3 4 4 3 4 3.8 6 4 5 6 3 4. It’s just the squared differences between the column means and the grand mean (with a sample size diddle factor). A sample of 30 boys and 30 girls is assigned to three different educational programs to teach algebra—lectures. . the equation is: SS within ∑ (X ijk X ij ) 2 (9–3) where j is the subscript for the columns (brand). namely that girls score. 10% above the mean.)2 (9–1) This turns out to be equal to 24.2)2 [over all the top groups] + (3 – 3. then our best guess at the expected mean score in each cell is just that.3 7 8 7 9 6 7. However. and n is the sample size in each cell (in this case.0)2 [40 terms] This is the sum of the squared differences between all the individual data and their respective cell mean (with no diddle factor needed).8)2 + .6 4.0)2 + (3 – 4. This is simply the squared difference between the two means and the grand mean (again. as we show in Table 9–2 under the ﬁrst category. we have one more term in our bag of tricks—it’s called an interaction.80. and lecturers beat small groups 1Given the topic under discussion. interaction seems particularly apropos. This time. on average.8 5. + (7 – 5. X.)2 (9–2) where now i is the subscript for the rows (circumcision status). There are 10 boys and 10 girls in each group. Now. 50%.9 2 1 2 3 3 2. + (3 – 2.70 4.

875)2 + (3. Of course. and it is 1 for Circumcision Status. and therefore the Error df is 8 × 4 = 32.92 ANALYSIS OF VARIANCE TABLE 9–2 An example of predicting cell means from overall differences Gender Computer Lecture Group Knowing only overall mean Boys Girls 50 50 50 50 50 50 50 Knowing overall mean and row effects Boys Girls 40 60 40 60 40 60 –10 +10 50 Knowing overall mean. And on the average. row effect. .1 – 4. But so far there is still no interaction among factors.3 So we would predict (if the effects simply added together) that uncircumcised This is.375) = 2.500) = 4. each of the eight groups has ﬁve data points.675 from the mean. so df = 1. and the Girl-Group mean would be higher than 50.425)2] = 58. the Girl-Computer mean would be lower than 70.175 points. and it is a bit more complicated than before.525)2 + .375 + (3. This would constitute an interaction between gender and teaching method. The extent to which the actual cell means depart from this picture of expected means is a measure of the interaction between teaching method and gender. whereas girls did better in groups and worse on computers. so the interaction . . For Circumcision Status. As usual. For Brand. these must be multiplied by the cell sample size. So we create an interaction term. But if. The next step. 4. for example. + X .8 – 4. uncircumcised males scored Brand R at 5. Note that the marginal differences remain the same as in the third category.875.175 points..6 – 3.8. the sum of the differences between the individual cell means and what we would have expected if there were no interaction. which is (4.500 points above the overall average. so it all looks like: Sum of Squares (Interaction) = 5[(4. So. or 0. so they would be (0. The ﬁrst and second terms are based on the expected values we have already calculated. the Boy-Group mean would be lower than 30.8 – 4.525)2 2This must be a hypothetical example.675 – 0. ) 2 (9–5) The interaction between two variables is the extent to which the cell means depart from an expected value based on addition of the marginals.425. there are only two data points. Working out the logic of the df for the interaction term can weary the mind beyond repair.375 + 0. Brand R is a bit below par—4.7 – 4.925. + (4.05 versus 4.0.375 + [–0.j X i. it’s the same as before—there are four data points (the sums) so df = 3. they actually average 4. with one ﬁnal diddle factor n for good measure. uncircumcised men really do have more fun—5. as before. noncommercial view favored by academics who haven’t a ghost of a chance at making any entrepreneurial money.175).2 If we add in these effects. and its expected value (4. The df for Brand are 3.175 – 0. the sum consists of 8 terms. which means that the df for each group is 4.0 – 2. we want to show an overall interaction across all brands. and column effect Boys Girls 50 70 +10 40 60 0 30 50 –10 –10 +10 50 Including interaction terms Boys Girls 65 55 +10 40 60 0 15 65 –10 –10 +10 50 males using Brand R would be up 0. or 0.375) + (3.5 when we expected 4.6 – 3. and look like: (4. In the end. The data might look like that in Table 9–2 under the fourth category. for example. if boys did much better on computers and worse in groups. As we see.675]) = 3. the Boy-Computer mean would be higher than 50. is to determine the degrees of freedom. then.675 points better. There has never been a convincing demonstration that any curriculum approach is any better than any other.375. by 10%. For the Error Sum of Squares.2 versus 4. The extent to which the actual cell means depart from this picture of expected means is a measure of the interaction between teaching method and gender.375. in this case. taking the usual nonpartisan. though. the simplest way to ﬁgure it out is to multiply the dfs of the main effects that enter into it. This must be done for each Sum of Squares.875)2 + (3. and down from it 0. which is pretty close to expectation. 3Note from co-author: It is obvious that this example was construed by a member of the uncircumcised tribe. 5.525. the last of which is the squared difference between the observed value of the bottom right cell. and circumcised males averaged 2. on the average. we can suspect some suggestion of a relationship (or an interaction) between circumcision status and condom brand.475 (9–4) And of course we feel duty-bound by now to furnish the masochists with yet another algebraic equation: SS interaction n ∑ (X ij X.9 when we expected (4. or 0. But. we would guess that the expected values in the cells are as shown in Table 9–2 under the third category. we are not speciﬁcally interested in the interaction only with Brand R. Applying the logic to our present data. which is based on the difference between the observed cell means and that which we would expect based on the marginal means. The calculation of expected means is also called an additive model.

38 3 1 3 32 39 9. Underlying the idea is a fundamental notion. Uncircumcised males do have more fun.29 18.8 FIGURE 9–1 Sums of squares and interactions caused by factors and interactions. weighted by an n or two here and there.99 and the probability has gone down correspondingly.99 23. it was almost straightforward: the expected mean square between groups was the sum of the variance between groups and the variance within groups. In the present situation. some authors state that the term “error” is misleading and replace the term with “within” or Brands 27. it is simply variation for which we have no ready explanation.52 25. As it turns out. the sections in the ﬁgure have an area proportional to the relevant sum of squares.78 11. although the Sum of Squares and Mean Square for brand are exactly the same as before.0001 “residual. The idea is illustrated in Figure 9–1. Error variance is not really error at all. which we mentioned in the beginning of this chapter. The total df must equal the total number of data points minus 1. Finally.23 19. the table (Table 9–3) has a few more lines in it than did the one-way table.4 Brands 27. or 39. As a result. The Expected Mean Square for a main effect or interaction of a variable contains other terms from interactions as well as the error term.” However. And the more explanatory variables that are introduced—to the extent that they really do explain variance—the smaller will be the unexplained. and the expected mean square within groups was the within-group variance.9 Within 24. or error. Last time around. see below).” SUMS OF SQUARES AND MEAN SQUARES FOR FACTORIAL DESIGNS In the last chapter.80 Total 129. it can result in a less powerful test of the remaining factors. in repeated-measures designs we describe in Chapter 11.5 Circ/uncirc 18. Because the Sums of Squares are additive. however. the interaction between the two factors is signiﬁcant (whatever that means. Finally. Note that. variance.48 Brand Circumcision Error 24. There is a difference in brands. Obviously. Because each variable costs at least one df.2 Within 83. It is subject to the law of diminishing returns. It is now evident that all the factors are signiﬁcant. “error. From our discussion we have: df (total) = 3 + 1 + 3 + 32 = 39 (9–6) Source Sum of squares df Mean square F p TABLE 9–3 ANOVA table for two factors (brand and circumcision) so the arcane logic above must be right.23 status 58. a sum of variances that together represent the expected value of the calculated mean square. For this reason.88 Circumcision 18.0001 <. after all the fooling around. Why? Because we have managed to move some of the variance that was previously contained in the error term into variance attributable to circumcision status and to the interaction between brand and circumcision. we introduced you to the notion of an Expected Mean Square. In deference to terminology. the error term has shrunk. if a variable is not accounting for a signiﬁcant proportion of the variance. we distinguish between “within subject” and “between subject” sources of variance. we have many more possible variances that could enter the sum.FACTORIAL ANOVA 93 term will have 3 × 1 = 3 df. the F-test has gone up to 11.0001 <. the conceptual rule is as follows. . we are ready to put it together into an ANOVA table.49 .15 <. we call the variance term expressing variance not resulting from any of the identiﬁed factors in the design. What that bit means is this: the expected mean square for the interaction between Circumcised/ Uncircumcised and Brand contains σ2 (Brand × Brands 27. and usually more.9 Interaction 58.

This is only one of several possible types of interactions. so there is no effect of brand. One version. if we take the average of the two points. σ2(Brand) or σ2(C/UC). we violated one cardinal rule of data analy- 8 Uncircumcised Circumcised Degree of pleasure 6 4 FIGURE 9–3 Interaction between brand and circumcision.94 8 ANALYSIS OF VARIANCE Uncircumcised Circumcised Degree of pleasure 6 4 FIGURE 9–2 Pleasure rating by brand and circumcision status. as well as the interaction. Using the same kind of analysis on the lower left. then they cannot say that the effect of treatment is equal to such-and-such. giving a main effect of circumcised status. Moreover. σ2 (Brands × Circumcised/ Uncircumcised) and σ2 (Error). There is a divergence of opinion about interactions. is that one should test only one hypothesis. MS(Brand × C/UC). 5It was Albert Einstein who said that “Everything should be made as simple as possible— and no simpler. but no interaction. All are multiplied by ns here and there. it is synonymous with synergy. uncircumcised males express a strong preference for the T brand and circumcised males for the U brand. the bottom right has everything going on—none of the means are the same as any other and the lines are not parallel—so there are both main effects and an interaction. Obviously the drug companies like this approach because if you are testing their drug against a placebo. If you are still having trouble conceptualizing the idea of interaction. If we just squinted at Brands R and S. as we showed above. one on top of the other. sis—ﬁrst. In the top right. so there is no effect of brand. For some unexplained reason. the mean scores for circumcised and uncircumcised are the same. but the uncircumcised are always above the circumcised. First. But the mean values of T and U present a very different picture. some of the mysteries of the analysis might have become clear. however. Everybody likes S a bit better. but displaced. the average for T and U is the same. so the test of the interaction is the ratio of this mean square to MS(Error). using obscure rules that we will avoid. the lines are parallel. σ 2 (Brand × C/UC) and the error term. the whole is greater than (or less than) the sum of the parts.5 But there are several reasons to contemplate including more than one variable. in Table 9–3 we have followed the lead of most computer programs and used MS(Error) as the denominator. But a strong interaction is in evidence because the uncircumcised strongly prefer T and the circumcised prefer U. all is as expected. The expected mean square for the main effect of Brands contains σ2 (Brands). the correct denominator for the F-ratio testing the two main effects should be MS(Brand × C/UC). The reason for this will be explained shortly. The interaction term. MS(Error) just equals the error variance. Look at Figure 9–2. Therein lies the explanation for the strong interaction term uncovered in the ANOVA. graph the data. so there is no main effect of circumcision status. a gallon of gasoline alone has little free energy. some of which are shown in Figure 9–4. σ2(error). such as “The drug works”— preferably with only two groups. Put them together. A match alone has little free energy. This approach has one and only one virtue—simplicity. There are main effects of brand and circumcision status. If we had done so. This is magniﬁed in Figure 9–3. and uncircumcised males enjoy sex more. circumcision sta- GRAPHING THE DATA (TAKE 1) In our excitement to explore the delights of factorial ANOVA. for T and then for U. The extent to which the lines are not parallel is an indication of the presence of an interaction. they are the same. Finally. if you can account for some of the variance with another variable—in this case. 2 0 Ramses Sheiks Trojans Unnamed 4If you can’t resist exploring the rules more (masochist!).” Circumcised/Uncircumcised) and σ2 (Error). σ2(error) and σ2(Brand × C/UC). Similarly. particularly prevalent in epidemiology. The two main effects each contain a variance component for the main effect. see Glass and Stanley (1970). 2 0 Trojan Unnamed . σ2(error). in the section Random and Fixed Factors. there is no chance that some other company’s drug may come out better. In the top left graph. Some folks hate ’em because if an interaction exists. the lines are not parallel. and suddenly you have a lot of energy (and synergy.4 The effect of all this is that different effects require different error terms. Strictly speaking. so there is an interaction. contains two terms. too). so the effect of circumcision is the same for both T and U.

As it turned out. Main effect of circumcision. which is much less plausible. not human). experts are not randomized. This time there was no effect of expertise—everybody recalled about 20%. The predictable result would have been a few more stomach ulcers and no beneﬁt for the women. But de Groot didn’t stop there.FACTORIAL ANOVA C 95 8 A 8 Uncircumcised Circumcised Degree of Pleasure Degree of Pleasure 6 6 4 4 2 2 0 Trojan B 0 Unnamed 8 Trojan Unnamed D 8 Degree of Pleasure 4 Degree of Pleasure 6 6 4 2 2 0 Trojan Unnamed 0 Trojan Unnamed FIGURE 9–4 A. if the effect had been shown to be signiﬁcant without the analysis by gender. this bias might indeed explain the results. If the researchers had analyzed the data without including male/female as a factor in the design. For example. But if the conclu- sion is based on an interaction. tus—then you can increase the power of the statistical test of the primary hypothesis. Main effect of brand and circumcision. there is the glory of interactions. no interaction. one study showed that if you take a group of patients with transient ischemic attacks. and did the same thing. Main effect of brand and circumcision. the recommendation would have been to treat everyone with aspirin. on a long voyage to America in the 1940s. himself included. the single best predictor of expertise in chess was the ability to recall a typical mid-game position. Maybe chess playing results in biochemical changes that increase memory. D. if all we had was an overall risk reduction of 10%. As an example. and age. experts could recall about 90% of the pieces. Adrian de Groot (1965) studied a group of chess masters. post-hoc hypotheses would be hanging off every tree. they would have concluded that the effect was only about 10%. Now. novices about 20%. signiﬁcant interaction. He also placed the pieces at random on the chess board. After all. Maybe experts are older. signiﬁcant interaction. After a few seconds. we must now explain why unblinding of the physicians would result in a bias in assignment to treatment for only the males. C. and most psychologists were studying rat and pigeon memory. being skeptical that aspirin could possibly work. Secondly. perhaps physicians were unblinded and. results in increased memory performance (this was the 1940s. we actually often go looking for interactions. put only the patients with a milder stroke episode on aspirin. Methodologic beneﬁt is also gained from designing interactions into the study. As a second example of a deliberately manipulated interaction. We believe that it provides much stronger information than a main effect. So he ended up . so maybe they are self-selected with better memories. it is fair to say that all psychological studies of expertise date back to the studies of a single investigator. up to a point. aspirin reduces the likelihood of a subsequent stroke by about 20%—but only for men. no effect of brand. In addition. signiﬁcant interaction. Now if he had left it at that and done a t-test on two group means. B. which in this study would have no longer been statistically signiﬁcant. Suppose we had reason to suspect a bias in the study. No main effect of brand or circumcision. In designing our experiments.

that it’s “the extent to which the cell means depart from an expected value based on addition of the marginals. we’ve graphed the data from the last part of the table. If you were to browse the shelves of the local drugstore or other sex shops. The problem is that plotting the observed means may give a misleading picture when one or more of the main effects is signiﬁcant. He then went on to theorize that experts are able to “chunk” the data. Let’s go back to the example in Table 9–2 and see what difference this makes. we won’t be able to make a statement about them. Nevertheless. If only one main effect is signiﬁcant in a factorial design. main effects and two-way interactions for three-way interactions. but you don’t lose anything by adjusting for nonsigniﬁcant effects. Score Boy s 30 RANDOM AND FIXED FACTORS Although it might not have been obvious when we began. there is a subtle difference between our two independent factors. for boys taught by computer. for boys taught by lecture.96 ANALYSIS OF VARIANCE with an interaction between expertise and real/random position. If we want to show the results so that the adjusted and unadjusted grand means are the same. and middling when lectured to. Data from Table 9–2. B. which in turn are better than group instruction. 15 0 Computer Lecture Program Group B 75 60 Girls FIGURE 9–5 A. For this reason. Once we’ve accounted for the fact that girls are better than boys in algebra. If none of the lower-order effects is signiﬁcant. and the alternate hypotheses came tumbling down. so as to reduce memory load. without adjusting for main effects. poorest when taught in a group. The problem with trying to interpret the picture in this way is that there are main effects of both gender and teaching method. showing the effects of different types of education in algebra for boys and girls. the relationship goes the other way. the adjusted cell mean is 65 – 40 – 60 + 50 = 15. and so on). if we didn’t study Rainbow Delights. and girls are better than boys with group instruction. The result is that this one paper has directed the last 50 years of research in expertise. What we should really do (and we’ll show you how in this section) is to adjust the observed cell means to take these main effects into account. and so on. and then adding in the grand mean. The ﬁgure shows us that there is an interaction (as we’ve said). The brand factor contained only a few of the possible “levels” of the factor. we over-simpliﬁed things a bit. But if we don’t ﬁnd a difference across the four we chose. we would then add the grand mean (which is 50) to each cell. GRAPHING THE DATA (TAKE 2) We have a confession to make: When we described graphing interactions. In Part A of Figure 9–5. The results are the deviations from the expected value for each cell. with boys doing best when taught by A 75 6We just love it when we can quote ourselves. a different (and more accurate!) interpretation of the interaction emerges: Gender doesn’t make a difference with lecture-based material. The same data after adjusting for main effects. The moral of the story is clear—don’t try to interpret graphs of interactions or the cell means without ﬁrst removing the lower-order effects (main effects for two-way interactions. Not exactly of course. expertise in chess resulted in better memory performance in chess. it’s not necessary to do any adjustment. Girls 60 45 computer. and graphed it in Part B of Figure 9–5. our hope is that the results can be applied to other brands. you would ﬁnd dozens or hundreds of other brands. and that computers are better than lectures. for girls. Clearly. So. Computer Lecture Program Group 0 The same cannot be said for the circumcised/uncircumcised factor. it is 40 – 40 – 50 + 50 = 0. but better than boys with the other two teaching formats. using memory for previous positions.”6 and subtracting the main effects from the cell means. what we’re doing is going back to the deﬁnition of an interaction. you need adjust only for that one. It is almost as if we randomly sampled the brands in the study from a population of possible brands. The graph also tells us that girls do poorer than boys with computers. and the intent is to generalize to all other levels. brand is considered a random factor. In essence. We remove these effects by subtracting the row and column means from each cell. boys are better than girls with computerized teaching. We need not generalize . 45 Score Boys 30 15 A random factor contains only a sample of the possible levels of the factor. we presume that we wouldn’t ﬁnd a signiﬁcant difference among any four brands. We’ve done this. Either a male is circumcised or he’s not.

Education interventions hopefully have some lasting effect. and the F-tests for main effects are as shown in Table 9–3. we cannot have patients both have their appendices out and keep them. crossed designs are more powerful because they create the possibility of examining interactions as well as main effects. And most surgery is a one-way street. we treat Brand as random. and so.” However. we assigned individual subjects to only one cell or level of both factors. Neither is signiﬁcant. if we treat Brand as ﬁxed. This was not absolutely necessary because we could have had circumcised men use R and S and uncircumcised men use T and U. CROSSED AND NESTED FACTORS We are not quite through with the generation of jargon yet. that is.78. for circumcised/uncircumcised it is the interaction term. treating the factor as random may still reduce power. which we chose to make nested. since the degrees of freedom for the interaction is much less (3 instead of 32 in this example). Having said all that. of course. whereas circumcised males prefer Trojans. so the Brand × Circumcision interaction becomes the error term for the test of the main effects. and often very powerful to have crossover drug trials in which a subject gets one drug for a certain period and a second drug for an alternate period. Why is this so? Well. if Brand is a random factor. Who cares about the distinction? Unfortunately. we end up with nested factors. since it amounts to adding some component of error derived from the interaction term. it would not have been possible to state whether circumcised males preferred some brands and uncircumcised males preferred other brands. It’s easy. uncircumcised males the other. A factor can also be ﬁxed if we have other levels of the factor but we do not wish to generalize to them. then the denominator for brand is the within-error term. “Uncircumcised males prefer Ramses. Two factors are crossed if each level of one factor occurs at all levels of the other factor. In particular. in general. For example. we ensured that both circumcised and uncircumcised subjects tried out every brand. the fact is that most of the time most computer programs never ask.935 for Circumcision Status. then the correct error term is Mean Square(Error) which equals 0. Two variables are nested if each variable occurs at only one level of the other variable. If. and the F-tests for the two main effects are 0. “Circumcised males prefer some brands. We could have crossed subject with brand (i. In general. especially among those who were already circumcised. and we’re left with a conclusion of the form. we call this a ﬁxed factor. then the four brands we picked are just a sample of all possible brands. the effect of treating a variable as random is to reduce the power of the test. have each subject try out all brands) but chose not to so we wouldn’t tire out the poor dears. If we did. which is quite reasonable. In the present design. Even if the difference between the two mean squares is small (which amounts to a nonsigniﬁcant interaction). this doesn’t generalize well. then race remains a ﬁxed factor. For example. whites.e. however. These are only a sample of all possible races. without bothering to tell you why it is so. if Brand is a ﬁxed factor. all to a worthwhile end.477 for Brands and 0. the effect of the drug is gone and the subject is okay. then the interaction between Brand and Circumcision Status is informative.7 Crossing and Nesting are just technical terms. For this reason. we could still have made a perfectly legitimate statement about the differences among brands overall (the main effect of brand) and the effect of circumcision (the main effect of C/UC). Thus the two factors are said to be crossed. and so. both circumcised and uncircumcised men sampled all the levels of the brand factor. uncircumcised males prefer other brands. The design we used to address this question was only one of a number of possibilities. we hope. A complete nesting would require that we test only two brands. a study done in the United States might include blacks. . which drives up the value of the critical F-ratio. the Mean Square(Interaction) will always be larger than Mean Square(Error). you have to when you move to more complex ANOVA designs.” That really is just error when it comes to deciding which brands are better and which are worse. in complex designs the choice of error term becomes a bit complicated. of necessity. with circumcised males using one. and the choice is further complicated by the ﬁxed versus random issue. it would be hard to have hospitals doing cost containment one month and not the next. A ﬁxed factor contains all levels of the factor of interest in the design. It comes down to the statistical notion of population. but if the results of the study are applied only to these three. straightforward. it is impossible or infeasible to cross some factors. In the present example. Generally speaking. if brand is a ﬁxed factor. In the present example. since we can now say something like. Curative drugs such as antibiotics are a one-shot affair. If we had used the other approach instead. as long as there were equal numbers. But they describe differences that have profound implications for analysis. Unfortunately. and Hispanics. Conversely.49. which equals 19. then the error term for the main effects is MS(Brand × Circumcision). similarly. 7Attempting to cross subjects with circumcision status would have led to severe problems in recruitment. a shorthand way of communicating about experimental designs. As we pointed out earlier. we would have said that C/UC was partially “nested” in brand.. One other variable in the present design is subject. This works nicely because there is a “washout” effect: After some period.FACTORIAL ANOVA 97 beyond the two levels of the factor included in the study. instead of the street deﬁnition. However.

it’s unbalanced or non-orthogonal. but it matters a lot in factorial designs. 1981). Note that factorial ANOVA bears no relationship to factor analysis. but we have no way of knowing that. and people were lost at random (e. which involve using the same patient at various levels of the other factors. we are happy to report. even though this topic is also referred to as orthogonal and non-orthogonal designs. Three problems are given on each of cardiopulmonary and respiratory systems. it would total to more than the total sum of squares. obviously a sign that something is seriously wrong. and similarly. Rates of C-sections were determined for each physician in the treatment and control hospital: Hospital is nested in treatment Physician is nested in hospital and treatment 2. knowing circumcision status. If we then added up the Sums of Squares of all of the effects. Our primary example in the chapter involved only two factors. in proportion to what it would be if the design were orthogonal. We didn’t mention this at all when we were discussing one-way ANOVA. 1. we’ll just give you the bottom line. we’ll simply see if circumcised and uncircumcised men prefer designer brand or generic condoms. Because there are relatively few uncircumcised men. Here are some other examples of crossed and nested designs. and relatively few generic condoms around. In other words. Rather than bore you with the messy details of each proposed solution. different methods available. crossed with treatment 3. At this point. we’ll address the far more mundane topic of the number of subjects in each cell. the higher the degree of correlation between the main effects. BALANCED AND UNBALANCED DESIGNS No. because. so that it’s counted twice (Spinner and Gabriel. Let’s take a simple example and show why. if the number of people in each cell isn’t the same. except the similarity in names. What else can account for articles about “therapeutic touch” (which involves no touching at all) or “iridology”? Furthermore. represented as both computer and paper questions: Format (computer/paper) is crossed with system Problem (e. we are not going to lecture to you about research designs that reﬂect deranged minds. so it is called a two-way factorial ANOVA. and sometimes four. our survey ends up with the situation depicted in Table 9–4. we would end up giving it to each main effect. and computer packages handle such complicated designs with ease. and most statistical computer packages have at least three. An educational intervention involves completing several computer and paper problems on two organ systems. An intervention to convince obstetricians to reduce their rate of cesarean sections was conducted at a random sample of hospitals in the state. though.98 ANALYSIS OF VARIANCE The factor most commonly “crossed” with other factors is the subject or patient.g. the literature on the best way to handle this situation has grown astronomically. to circumcision status. All the other. you can pretty much predict what type of condom the person uses. The greater the imbalance. we’re not going to modify our example to discuss the advantages of the pill versus the condom for birth control. Instead of comparing condom brands. are treated in a randomized trial with cyclosporine. so why are we bringing it up here? The simple reason is that it doesn’t matter at all in the case of the one-way ANOVA. although Lord knows there are enough of them around. If the sample sizes are the same in all of the cells of a factorial design. by Overall and Spiegel. more complex designs are simply called factorial ANOVAs. Chapter 11 is devoted to analysis of such designs. we have to determine why there are unequal Ns in those cells. the major computer packages got it right. there is a 2-week washout followed by 6 weeks on the alternative therapy: Treatment is crossed with gender Patient is nested in gender. chest pain) is nested within system Format is crossed with problem As these examples illustrate. just because they involve many factors. to both. The problem facing us is what to do with this shared variance. it is easy to go from one to two or more factors. Each patient is randomized to receive either cyclosporine or steroids for 6 weeks. Patients with lupus. Factor analysis is covered in Chapter 19. they didn’t show up because of a snow storm or the research assistant lost TABLE 9–4 Sample sizes in a study of condom type and circumcision status Circumcision Staus Condom Type Circumcised Uncircumcised Total Designer Brand Generic Total 28 12 40 10 5 15 38 17 55 . or to neither? What we would like to do is give some of it to each effect. it is called balanced or orthogonal.. We won’t attempt to do the analysis for these designs because it gets very hairy very fast. The problem is that. If we blindly charge ahead and use the formulae for a balanced design. Winer (1971) covers many complex designs.. each main effect tells us something about the other effect. There are two possible reasons: we started off with a balanced design. First. knowing what type of condom is used tells us a lot about what is placed inside it. was written in 1969. Rather. do we allocate it to condom type.g. The classic article on this topic. Since that time. 50 males and 50 females.

then η2 is the way to go. because it gives the most conservative answer. pick the effect (or effects) you really care about. When we do it. or whether the shared variance shouldn’t be credited to either main effect. p That being said. and bash off the sample size.FACTORIAL ANOVA 99 the data). You create an effect size. Klein. However. as you remember from Equation 8–30. p SAMPLE SIZE CALCULATIONS FOR FACTORIAL ANOVA DESIGNS You won’t be blamed for rereading the following. You’ll have to go back to the ANOVA table and do the math yourselves to ﬁgure out what was reported. excluding the effects of the other factors from the non-error variation. but in factorial designs η2 is p always larger than classical η2. equally expert (e. The one-way ANOVA case covered in the last chapter led to all sorts of conditions and ramiﬁcations. this corresponds to what Overall and Spiegel called Method 1. in practice. it will be lower. However plausible the exercise may be in theory. classical η2 or η2? The p answer is. When the non-orthogonality reﬂects differences in group sizes in the population. in particular. If imbalance is due to the ﬁrst reason— random loss of subjects—then the consensus is fairly clear: use the default option of most computer packages. the results could be horribly misleading. Let’s start off with η2. each telling us something different. because it will eat up the shared variance. SSEffect – dfEffectMSError SSEffect + (N – dfEffect)MSError (9–9) THE STRENGTH OF RELATIONSHIP REVISITED In the previous chapter. we use exactly the same strategies for sample size calculations related to main effects as we did in the one-way ANOVA case. 0. and 0..14 to be large (Stevens. Some experts recommend sticking with Type III. and the denominator requires even more guesswork. 2004). naturally. which is p deﬁned as: 2 ηp = SSFactor SSFactor + SSError (9–8) yields the proportion of the total variance owing to that factor. The concept is straightforward enough. As with classical η2. it’s probably better to use partial ω2 (ω2) rather than η2 for the reasons we outlined in p p the previous chapter.g. Unless you have strong reasons on which to base your decision. which. Life becomes a bit more complicated with factorial ANOVAs. Others. if you can’t.g. which is called the Type III Sum of Squares. (b) stick with a Type III solution. treat it as a difference among means. then use classical η2. some versions of statistical programs got it wrong. they are the proportion of the total variance owing to the factor in the sample (η2) and the population (ω2). of the unique variation of the dependent variable. we’d consider a value of 0. 2004).01 to be small. more women than men suffer from osteoporosis). no. As it turns out. these solutions have their own problems associated with them. things start to get a bit hairy. Then you go to the table and look it up. now that we have really hairy designs. “What do you want to know?” If you’re interested in dividing up the total variance in the DV among the various factors and interactions. on the other hand. say you should use a Type I or Type II. The reason is that η2 is not a measure p . So which do you report. this time based on the difference between the cell means and an expected cell mean based on the main effects. p the formula for ω2 is: p 2 ωp = 8Just to add to the confusion. the loss of people doesn’t affect the results too much. was: 2 and the same considerations apply in choosing between ω2 and ω2 as we discussed earlier. However. as some of the variation can be accounted for by the other factors (or interactions). because the results more accurately reﬂect the effects in the population.8 Because it doesn’t give greater weight to cells with larger sample sizes.06 to be moderate. because there are two of each coefficient to choose from. we introduced eta-squared (η 2 ) and omega-squared (ω 2 ) as indices of the strength of the relationship. If you want to see how much each factor or interaction affects the DV in its own right. divided by an estimate of the withincell SD. 2001). but classical p η2 never can. calculating the numerator of the effect size means guessing a minimum of four cell means and four row and column means (for a simple 2 × 2 case). leaving little for the other effect. and ignores the other factors.. the situations where there is enough information available a priori are so limited that the exercise is one of futility. Block. the two η2 are identical. the sample size issue will be horrendous! Amazingly. In the one-way ANOVA. as the estimate of the population parameter. It’s quite possible for η2 to sum over 1. or the imbalance reﬂects group sizes in the population (e. and Aguinis.0 across all the factors. but a more accurate reﬂection of what (if anything) is going on. calculating η2. more boys than girls have attention deﬁcit disorder. Analogously to the expansion of the formula from η2 to η2. We don’t know why either. it’s safest to (a) avoid imbalance if at all possible and. Partial η2 (η2). Surely. = SSfactor SStotal (9–7) This is sometimes referred to as classical η2 (Pierce. You have to decide which main effect is more important than the other. Complicating our lives even more. we again reduce the comparison to a contrast between two means and use the basic formula. So. and p then changing the formula they use in later versions. Interactions are more complicated. Regardless of the design.

For the following studies. Now go back over the list of factors and decide which are random and which are ﬁxed effects. and the other with many unsynchronized stop lights and slow-moving rats ahead of them. ﬁgure out the design. and decide which factors are crossed and which are nested. By inspection.0 4. antacid. In the kitchen a sealed envelope tells the chef to dish out a Mild or Suicide. Patients are further subdivided into lateral (one cheek) and bilateral (both cheeks).100 EXERCISES ANALYSIS OF VARIANCE Rater Heat Roadhouse 1 2 3 4 Mean 1. 2. see if you can work out the design. identify the independent and dependent variables. then click Continue [Note: You can run these only if there are more than two groups.0 Mild 3. and click the arrow to move it into the box marked Dependent Variable • If you have any ﬁxed effect(s). they are all sacriﬁced and the size of the stomach lesions calculated. b. and one third get milk and digestive biscuits. We get a total of 24 undergraduates and send them into the assorted roadhouses. one third get antacids. a. click on it (them) from the list on the left. If you are up to it. d. SS (Within) equals 26. Groups of laboratory mice from a particular ulcer-prone strain are assigned to different mazes. Pass = 2.] • Click the Options button and click on Descriptive Statistics and Estimates of effect size. Let’s return to the roadhouse example of Chapter 8. e. Plot the data. and Tukey’s-b]. A predictive validity study examined whether undergraduate grades predicted success or failure in podiatry school.0. Tukey.0 3. One third of the mice in each group get beta-blockers. How to Get the Computer to Do the Work for You • From Analyze. 3.. c. Each student rates the platter of wings for heat on a 1-point scale. Success/failure was classiﬁed as Honors = 3. patients with acute gluteitis maximus (pain in the butt) are rated by their chiropractor as to the likelihood of a successful outcome on a scale of 1 equals never to 10 equals a complete cure. Five beer brands and ﬁve ales are each rated for quality by four engineering undergraduates on a scale from 1 equals slop to 9 equals super. Now the data look like this: Suicide A B C A B C 4 7 7 4 3 1 4 8 9 2 3 3 7 5 6 7 10 10 6 4 2 4 2 2 5. At the beginning of a course of manipulation. To ease the pain. and click the arrow to move it (them) into the box marked Fixed Factor(s) • Do the same for any random effects • Click the Post Hoc button and select those you want [we recommend LSD. we’ll tell you in advance that the error term. draw the experimental design. In addition to different heat of “Suicide” wings. After 2 weeks.” Same three roadhouses. The mice are further subdivided. choose General Linear Model ¨ Univariate. As above.0 9. Work out the ANOVA table.0 7. Suppose we extend the study to include two levels of heat—“Mild” and “Suicide. • Click on the variable to be analyzed from the list on the left. Before going further. there may be systematic differences at other levels of heat.0 2. only an additional factor is added. what do you think are the signiﬁcant effects? c.0 a. one with no barriers. by moving them to the box labeled Display Means for: • Click • Click Continue OK . What are the factors in the design? Are they crossed or nested? b. It’s also a good idea to get the Estimated Marginal Means.0 7. Fail = 1. and biscuit are tested. and two different brands of beta-blockers..

Both have approximately the same size of self-induced life preserver about the midriff. How do you proceed? A ll this stuff about randomizing folks to groups. arms. He dumps the data on your desk. Recall again that your esteemed authors differ somewhat in height. promising endless riches if you analyze it right. of which big bellies are the least. do something to make it better. such as Stomach Starers or Girth Gazers. A more likely option is some local group. and so on to see what results in minimum weight. we have found that leaving one foot off the scale works better. to pay for our pounds of ﬂesh with our pounds of cash. A much more natural experiment is to measure something. It is small consolation to the formerly petite housewife of 70 kg (154 lb) that the football alumnus and current used car salesman beside her tops in at 140 kg (308 lb). which explicitly accounts for systematic variance between subjects.CHAPTER THE TENTH One common analysis problem results from situations where individuals are measured at the beginning of a period of time (e. and then measure it again. Arguably. when we reach middle age. whereas comparable authors in a control group 1It’s not really that simple. Dr. when it comes to weight. You know you have become obsessed with the problem when you spend a few minutes each day exploring different positions of feet. 101 . Two Repeated Observations The Paired t-Test and Alternatives SETTING THE SCENE In a blatant attempt to cash in on the North American preoccupation with girth.. Suppose. To add a dash of science to the whole affair. why is change better than terminal measure? The reason is that. Somehow it seems that you must pair up the beginning and ending observations on each patient. Stretch is 6’5”. at the start of treatment) and again later (at the end of treatment). This is not simply a reﬂection that some of us are gaunt and some gross. Even if we were to enroll a bunch of these folks in an experiment where they were randomly assigned to a treatment and control group. the paired t-test. although now de rigeuer for medical research. by some miracle. And there we would go once a week. some magical transformation will take place so that the belt will move in a notch or two. stable differences between individuals are far greater than any likely difference resulting from treatment within individuals. they achieved this lofty goal. Shrimp is 5’8”. but they wouldn’t want middle-aged academics for all sorts of reasons. If we were serious about combating this growing girth. we might even consider enlisting in an experiment. Although we may derive some perverse pleasure out of comparing ourselves with other pathetic creatures in the group. no 6-year-old) in his or her right mind would simply weigh them all after the course of treatment. in view of the inevitable statistical sleight of hand to be inﬂicted on the unsuspecting data in the search for the magical p. not lost).g. we tend to get most of our exercise stepping up and down on the bathroom scales each morning. and then measure everybody only after it is all over. but why exactly is it so evidently right to measure change in weight within an individual instead of ﬁnal weight between groups of individuals? In particular. But the big guy weighs about 200 lb and the little fella about 160 lb. both could afford to lose about 15 lb. Forgive us for being so pedantic. he does a study where he weighs a bunch of chubbies before and after they indulge in the plan. leaving both feet off works best of all.1 The point of the exercise is to compare today’s weight with yesterday’s. not absolute weight. It seems nonsensical to do it to some folks and not to others. the comparison is based on weight loss (or more likely. to suffer public humiliation inﬂicted by the sadistic scales. no scientist (or for that matter. Our hope is that by resisting the third donut at coffee or walking to the mailroom. goes against a lot of intuition. The measure of the success or failure of this treatment is based entirely on the comparison of this week’s measurement with last week’s. This design requires a new test. For example. Actually. Casimir from Chittigong designs yet another diet plan. One possibility is Marine Basic Training at Parris Island.

reﬂecting the large stable differences among Homo sapiens. For the treatment. which is ±1 or 2 pounds.3 If this is so. the charming chap from Chittigong. but the denominator includes the variability of the differences within the groups. By contrast. literally. such as those weighing in before and after a round of dieting. the estimated SD of the differences is the calculated SD. Here the signal is the observed difference (d). about 25 kg. power and sensitivity. They undergo a second weigh-in after a month. Having explored the theoretical issues around the issues of excess avoirdupois. suicide-level vindaloos. However. Stretch lost 16 lb. the variability in this difference. using a straight t-test. of a ﬁery hot curry diet. Spicy food makes your body hotter. they consume.8 –3 –2 –7 +2 +1 –4 –4 –4 –2 +3 0 –5 –2. Note that the SD of preweights and postweights are quite large. this is equivalent to the null hypothesis that the true difference in the population is zero. curries. and Rogan Josh’s. 3.” We sure do. 2. To be more precise. Second. It is also possible to correct for baseline differences between groups. and we will eventually explore situations where you can lose. Moreover.0 62 86 118 105 91 72 81 122 95 145 132 105 101. if we follow the logic of statistics. such as may occur if randomization were inadequate or intact groups were used. in calorie loss. then. the difference between the groups would be 15 lb.08. which makes you sweat. We have also calculated the mean and SD of the prediet and postdiet weights and also the weight differences. then there may be real beneﬁt. the SD of the differences is much smaller.03. then the ﬁre in their bellies raises their body temperature.08 or greater could have occurred by chance in a sample of size 12 drawn from the population with a mean difference of 0 and an SD of 3. or repeated measures. but means. The statistical question is. Then if we looked simply at post-test weights. All the prospective clients weigh in.03 kg. as shown at the bottom of the table. The simplest example of a repeated-measures design involves two measurements on a series of subjects. It goes like this. to the best of their ability. the hotter the food gets. at which point they sweat the pounds off.03? The approach is to determine a signal-to-noise ratio. this bizarre word is printed on many scales. which absorbs heat from your ﬂesh. the same reason that mad dogs and Englishmen go out in the noonday sun and is also the origin of that classic ex-pat line “There’s nothing like a nice cuppa tea (pronounced TAY) on a hot summer day.08. is the basic idea that we pursue here.1 3.3 24. includes all the differences among individuals. Our best estimate of this difference is the calculated difference. First. which cools you off. the main advantage of these strategies is the potential gain in statistical power. if they do. In terms of the individual differences. of calculating the difference for each individual (after minus before). As we have seen.” 3It’s didn’t. We have taken the liberty. What is the likelihood that a difference of 2. designs. We all know that the closer you get to the equator. most folks can’t eat it anyway. naturally.2 24. and they may be a result of more than one factor. and the noise is the SE of the dif- TABLE 10–1 Pretest and post-test weights of 12 Casimir subjects Subject Pretest Post-test Difference 1 2 3 4 5 6 7 8 9 10 11 12 Mean SD 65 88 125 103 90 76 85 126 97 142 132 110 103. only 3.2 perhaps we can proceed to an actual example. We begin by examining two measurements per person. It’s a puzzlement until you apply basic physics to the issue. “have some weight. amounting to ± 20 lb. Voilá! The fat literally burns off! So enter Casper Casimir. which evaporates. the numerator is still 15 lb. The data are given in Table 10–1. in the right-hand column. with Captain Casper Casimir’s Choice Curried Calorie-Consuming Cuisine for Cold Canadian Consumers (the C11 Diet). which in turn results in a net energy loss to the environment. as well as gain. which goes into the denominator. Shrimp lost 14. But we should point out that this is not the universal panacea it would appear from our contrived example. Now. However. 2. but eventually we explore the situation where there are any number of measures. if we examine change scores. our null hypothesis is that no loss in weight has occurred. Pretesting and posttesting are only one example of these within-subject. Their counterparts across the way lost 1 and gained 1. This.03 . So the net effect is a large gain in precision and a corresponding increase in statistical power.102 ANALYSIS OF VARIANCE 2Showing off our Canadian bilingualism.

08 ÷ (3. Taking differences introduces within-subject variation twice. the obvious test is an unpaired t-test on the difference scores: .16 kg. We’ll take them in that order. the latter test will have less power than has the former.192) 12 0. all of which go against the simple paired observation design: one a design issue. it equals 2. If the outcome is mortality rates. but the SDs are about 25 kg in each group.216 Given all the previous discussion. the critical value of a one-tailed t-test with 11 df (12 data – 1 mean) at the . is equal to: t= d s/ n (10–1) In this case.TWO REPEATED OBSERVATIONS 103 ference. take a mistress who then shoots you full of holes. If it is an educational intervention. In the situation where small differences resulting from treatments are superimposed on large. If they must maintain their wicked ways. where folks are assigned to one group or another and measured at the end of the study? There are three reasons. We could then compare the treatment group after treatment to the control group with an independent sample test as we did in Chapter 8: t (24. 3. Nonetheless. you recall Chapter 6 and are a little more suspicious of onetailed tests. an alternative approach that takes advantage of the difference measure is to simply ask whether the average weight loss in the treatment group is different from the average weight loss in the control group. called a paired t-test. stable between-individual differences exist. Nor are we implying that this is something you might not have thought of yourself. Nobel prizes are not made. assume that the pretest values were instead derived from a control group of 12 who were destined to pass up the beneﬁts of the curry plan. the null hypothesis comparing treatment (T) and control (C) groups is: H0:∆T = ∆C (10–4) can forget the great grapefruit diet? Seduce the population. So the test. So why do all these randomized trials. it makes little sense to measure alive/dead at the beginning of the study. one a logistic issue.03 ÷ 12.17 26. The logistic problem is more complicated.82 2. it is likely that they would be the same as the treatment group before the treatment began. Maybe “20/20” came out with a new Baba Wawa piece on the beneﬁcial effect of kiwi fruit for dieters.05 level is equal to 1. To illustrate this point a bit more.832 105.17 t (24. most textbooks on experimental design mention this design only to dismiss it out of hand. For these reasons. we could have gone ahead with an independent sample t-test. and we have no justiﬁcation for taking all the credit. for completeness. The design problem is that a simple pretest–posttest design does not control for a zillion other variables that might explain the observed differences. Now.03 ÷ 12) = 2.02) 12 (10–2) 0. Casimir will undoubtedly proclaim to the world that the C11 diet is “scientiﬁcally proven” and cite papers to back up his claim. Or it may be far too costly to measure things at the beginning.5 If we call the weight loss ∆. there is a statistical issue. it is often dangerous to measure achievement at the beginning because the pretest measurement may be very much a part of the intervention. and one a statistical issue. we’ll go ahead and do it.17 kg and in the control group is 105. if we were intent on randomizing to two groups at all costs. each with associated error or variability. And you never gain the weight back! such things. Now the data might look like that in Table 10–2. If within-subject variation exceeds between-subject variation.80. For illustration. stable differences between individuals. First of all. Therein lies the power of repeated observations.08 24.384 (10–3) 4Who However. it can’t be beat. The calibrated eyeball indicates that such a test is not worth the trouble. and then measuring both groups before and after the treatment. make zillions of dollars off the suckers. Maybe the local union went on strike and the study subjects had to cut back on the food bill. In many situations a pretest is not possible or desirable. If no large. Finally.38.4 All of these are alternative “treatments” that might have contributed to the observed weight loss. let’s consider a slightly more elaborate design. and lose about 5 pounds instantly as the blood drains away. not only will you not gain ground with a paired comparison. the mean in the treatment group is 101. this is not exactly a classic randomized controlled trial. Comparing groups on the basis of only post-treatment scores introduces error from (1) within-subject variation and (2) between-subject variation. As we indicated. 5Of Having framed the question this way. and also to confront the design issue. the difficulty with the pre-post design is that any number of agents might have come into play between the ﬁrst and second measurement. but you could possibly lose statistical power. 101. The difference amounts to 4 kg. The reason is that the difference score involves two measurements. that would only measure weights after treatment and then compare treatment and control groups with an unpaired t-test. telling students what you want them to learn as well as anything you teach to them. you should not be surprised to see that this t-test is minuscule and doesn’t warrant a peek at Table C in the Appendix. One obvious way around the issue is to go back to the classical randomized experiment: randomizing folks to get and not get our ministrations. For the sake of argument. Of course.

One problem with the study of teaching Academish 1A7 we described in Chapter 7 is that we’re trusting randomization to result in both groups being similar at baseline in their ability to say “hegemonic phallocentric discourse” and other such phrases without breaking into uncontrollable peals of laughter. If subjects are matched on variables that are unrelated to the outcome. subject 1 in the experimental group is.3 24.08 (3. there are other situations where the paired t-test is not only handy.1 25. Why can’t we just charge ahead and use a Student’s t-test? The reason is the same as the rationale for using the paired t-test in the C11 example. then the df is reduced with no gain in variance reduction. in much the same way that we lose power if we treat a repeated measures design as if it consisted of two different groups.0 62 86 118 105 91 72 81 122 95 145 132 105 101. EFFECT SIZE When we discussed the ES for the independent t-test in Chapter 7. such as eye color or height. as we just discussed in the C11 example. because of the matching. primarily in the form of responsiveness to change.1 3. such as age. we mentioned that there are a few choices.73 t 2. gender. and from the same academic department. But. We hope that this is more than compensated for by the reduction in between-subject variance. much more similar to subject 1 in the control group than to any other. you ever tried to have a conversation with a business major? 7Have . right up there with reliability and validity (e. thus insuring that the groups are relatively similar at the beginning.2 instead of 4.7 to 3.032 1. But. which would make the lecture hall quite crowded). Similarly. there are four ways to calculate a d-type effect size for a paired t-test. but mandatory.. there’s a caveat.145 (10–5) This is signiﬁcant at the . others have been reincarnated in a slightly different guise in clinical research (often with the authors being unaware of the index’s previous life).104 ANALYSIS OF VARIANCE Experimental Control Difference Pretest Post-test Difference TABLE 10–2 Pretest and post-test weights of 12 Casimir subjects and 12 controls Subject Pretest Post-test 1 2 3 4 5 6 7 8 9 10 11 12 Mean SD 65 88 125 103 90 76 85 126 97 142 132 110 103. We could improve the design by selecting people from the pool of controls by matching each person to someone in the experimental group of the same age. So.g. because the betweensubject SD (about 24 to 26) is much larger than the within-subject SD (1.2 24. We lose power if we ignore the similarity.05 level (t[22] = 2.03 68 122 84 95 106 71 87 147 129 136 105 99 104. The test of signiﬁcance for the difference score is considerably higher than is the t-test for the post-test scores.6 as does area of specialization7 and perhaps a couple of other factors. OTHER USES OF THE PAIRED T-TEST The example that we just used. is probably the most common way that the paired t-test is used. but we’re not allowed to discuss that in a book of this nature. but is no guarantee of group similarity for smaller studies.1 1. We also know that gender has an effect on language. Kirshner and Guyatt. 1985. but this means we have to select our matching variables carefully. the situation is closer to that of one person being tested under two different conditions than of two unrelated people in different groups.08 1.5). of testing the effect of the C11 diet by measuring people before and after an intervention.2 26. Although some of these have their roots in the standard statistical literature. and which one you use depends on what assumptions you’re willing (or are forced) to make (Kline.732) 12 3. Randomization works in the long run (say. 2004). but 6So does sex. which has been suggested in the quality-of-life literature as an important measurement characteristic. an inﬁnite number of people per group. randomly selected person. even though the absolute difference was smaller (3.8 –3 –2 –7 +2 +1 –4 –4 –4 –2 +3 0 –5 –2.0). We also lose degrees of freedom.07).2 +2 +1 –1 +2 0 +1 –1 +5 +2 +2 –1 +1 +1. which vary in terms of which estimate of the SD is used in the denominator.2 70 123 83 97 106 72 86 152 131 138 104 100 105.

though. As you can see. are high pretest scores associated with high post-test scores? 9Which where sp is the pooled SD from the two conditions (pre and post) or the two matched groups. the equation shows that the break-even point is a correlation of 0.9 or the two matched groups have different variances. so it may be less accurate. For that reason. they’re just different. whereas in other circumstances it is used to represent the difference between groups.. The problem is that this formula assumes homogeneity of variance. the SDDifference would just be √ sp . Similarly. you’re comparing average change with the variability among people (i. In all cases.) The relationship between the two SDs is: SDDifference = sp √ 2 (1 – r12) (10–9) 8Actually. Suffice it to say for now that it’s a measure of relationship.5. As we’ve already indicated. If you use SDDifference. and you’re way further ahead to use difference scores in the design. simply by plotting the means and 95% CIs.e. not in the SD of the original scale. The problem is that the width of this difference CI (wd) depends on the correlation11 (r) between the pretest and post-test scores: wd = wPre + wPost – 2rwPrewPost (10–10) A third possibility. You can use (a) the pooled SDDifference of both groups (which uses all the data but doesn’t recognize the confounding with treatment mentioned in rubric 9). then the SDDifference is larger than sp. In fact. ∆= Mean Difference sc (10–7) A long stare at this equation yields an important insight that goes beyond just how to compute an ES. This has been reincarnated in the quality-oflife literature by Liang (2000) and McHorney and Tarlov (1995) and is often called the Standardized Response Mean. above that you’d do better to use change scores. for the truth of the matter). you can use either the SD between people measured either before or after the intervention. we’ll show you the parallels. we’re using “mean difference” rather than d in the formulae.5 is moderate. you’re looking at average change compared with the variability of possible changes (Norman. the treatment effect is large enough to move a patient up 1 SD). we presented three rules for an eyeball test of signiﬁcance for two independent groups. and the SDDifference would be 0. then things get more the higher the correlation. where we discussed the difference between the paired and unpaired t-tests. .2 is small. applicable to the paired t-test: (4) DON’T DO IT! The reason is that we’re not interested in the CIs around the individual means. then everyone’s postintervention score would be just the baseline score plus a constant. Using the comparison with SDBetween. as we’ll see at the end of this chapter. if there were no correlation between the pretreatment and post-treatment scores. it’s probably best to report both. Just remember when you use Equation 10–8 that the ES is given in units of the SD of the difference score.8 both are defensible. What’s the relation between the two families of coefficients? Well. deﬁne correlation formally in Chapter 13. the ﬁrst we’ll look at is called Hedges’s g and is simply: G = Mean Difference sp (10–6) complicated. but rather the single CI around the difference score. is to use the SD of the difference score. 2003. recommended by Cohen (1988) for pre-post designs. then SDDifference is 0. as in the example shown in Table 10–2. et al. The second approach. since the paired t-test is inﬁnite. Cohen’s d = Mean Difference SDDifference (10–8) 11We’ll If you’re comparing the change in the treatment group with the change in the control group. In fact. we still use the same criteria: an ES of 0. To avoid confusion. If you use SDBetween in the denominator.8 is large. As we’ll see. If the correlation is 0. so the SD of the post-test is usually larger than the SD of the pretest (assuming treatment increases a person’s score). When you calculate an ES with paired (usually pre-post) data. 2005). Here’s the fourth rule. which has the effect of reducing the SDDifference relative to sp. and you actually lose ground by using change scores. 0.TWO REPEATED OBSERVATIONS 105 see Streiner and Norman. or it can refer to a class of ES estimates. SDDifference). there often is a substantial correlation. Because we can’t see that correlation by just eyeballing the data. The reason there’s more than one coefficient relates directly back to the previous section. sometimes it makes a great deal of sense.e. or (c) the SDDifference of the treatment group. or the SD within people (i. even that doesn’t exhaust the possible permutations. statistics is a rigorous way of eliminating confusion and ambiguity. and this may not be the case if the intervention changes the variance. but the penalty is that the estimate of the SD is based on only half the number of observations. Since it’s in the latter domain where you’re most likely to encounter these things. if the correlation is 1. called Glass’s ∆. the SD of the differences. and 0. below and you’re actually further ahead to just look at posttreatment scores. the smaller wd. 10Note THE CALIBRATED EYEBALL In Chapter 7. d can either be another symbol for the difference between groups. gets around this by using only the preintervention or the comparison group’s SD (sc). which is likely the least biased measure. is highly likely.10 This avoids one problem. comparing the CIs of the two means is meaningless (Cumming and Finch. (If the correlation were 1. Different people respond differently to treatment. While it usually makes little sense in a statistical test to use the SD between people in the measurement of an ES.. but uses only half the data. ∆ is the name of a test. 2006). (b) the SDDifference of just the control group (Guyatt’s responsiveness) which ignores the interaction with treatment. that here.

1 2 3 4 5 Mean SD 12 14 28 3 22 15. or between matched subjects) is the best of both worlds—almost.59 6 7 8 9 10 5 10 20 2 12 9. Average intelligence of older and younger siblings. 2. σ is the SD of the difference. You may recall that we did a t-test on hair restorers in Chapter 7. Scores on this exercise before and after reading Chapter 10.8 8. and click the arrow to move them into the box marked Paired Variables • Click OK . How does this change the analysis? Drug Subject Hairs Placebo Subject Hairs SUMMARY The comparison of differences between treatment and control groups using an unpaired t-test on the difference scores (between initial and ﬁnal observations. We use the original sample size calculation introduced in Chapter 6: n (z z ) 2 (10–11) where δ is the hypothesized difference. then the test can result in a loss. at least three kinds of t-tests can be applied to data sets—unpaired t-tests. c. select the most appropriate. but add a piece of information: subjects were related. Let’s return to the data. rather than a gain. the before and after scores]. b.8 6. School performance of younger versus older brother/sister in two-child families. and so on are brothers. The advantage of the test exists as long as the subjects or pairs have systematic differences between them. with joint count of patients with rheumatoid arthritis. in statistical power. But look on the bright side—more room for optimistic forecasts. control group designs.g. If this is not the case.. and zα and zβ correspond to the chosen α and β levels. and unpaired t-tests on difference scores. How to Get the Computer to Do the Work for You • From Analyze. Subjects 1 and 6. School performance of only children versus children with one brother or sister. choose Compare Means ¨ Paired-Samples T Test • Click on the two variables to be analyzed from the list on the left [e. f. but also the SD of the difference within subjects—which is almost never known in advance. For the following designs. paired t-tests. and (2) 6 more weeks with fool’s gold (iron pyrites). The only small ﬂy in the ointment is that we must now estimate not only the treatment difference. Crossover trial. a. 2 and 7. reared apart and reared together.21 EXERCISES 1. each of whom undergoes (1) 6 weeks of treatment with gold.106 SAMPLE SIZE CALCULATION ANALYSIS OF VARIANCE Sample size calculations for paired t-tests are the essence of simplicity. and a variant of the strategy is useful in the more powerful pre-post. The test is used in pre-post designs. Order is randomized. The basic strategy is to use pairs of observations to eliminate between-subject variance from the denominator of the test. School performance of older brother/ sister in one-parent versus two-parent families. e. As we discussed. d.

” “OK. and suitably blinded. of course. in turn. it can also represent a real occupational health risk. it has something to do with having a muscle in the head. instead of randomly assigning individuals to take one of the analgesics. we chose to mount the following study: after a meeting attended by one of the carriers. So. count to 10. so that we can use each subject again every 2 weeks. A lthough many folks think of the academic life as a carefree idyll in which you arrive late. a searing pain in the derrière resulting from overexposure to so-called “pains in the butt” at committee meetings. Faced with endless academics begging for relief. subjects would be randomly assigned to receive one of the treatments. Tylenol. which was the Latin name brieﬂy adopted for Chronic Fatigue Syndrome. He’s commonly called the one who “is a pain in the butt. Do any over-the-counter analgesics provide relief from this dreadful agony? 1Actually. spend long pub lunches in quasi-intellectual debate. because of the nature of the syndrome. the one who spends endless time playing with his electronic appointment book when it’s time to book the next meeting. One. Not to be confused with myalgic encephalitis. If this was a normal drug trial. something like the one below: 1 No relief 2 Slight relief 3 4 Moderate relief 5 6 Complete relief hemorrhoids aren’t all bad. teacher. After all. suitably blinded. It’s not just that it’s stupifyingly boring for the most part. It’s just a generalization of the paired t-test. multiple. The methods amount to inclusion of the Subject as an explicit factor in the analysis. 3I never was much good at arithmetic even back in grade school. and leave early. You know the type—the one who corrects the spelling and punctuation errors in the minutes.1 It seems that just about every academic committee has at least one member who unnecessarily drags the proceedings down to a snail’s pace worrying about minutiae. Now we will deal with multiple assessments. 2Literally |___________|___________|___________|___________|___________| However. In particular. uh. As near as we can tell. Blood clots in the leg (DVTs) and hemorrhoids from the long hours sitting motionless in a straight-backed chair are the obvious health concerns.2 a terrible affliction characterized by a searing pain through the nether regions that persists long after the offensive individual has left and the meet- ing is over. “Mr. subjects were randomized to receive one of Entrophen. but a far more debilitating risk awaits every committee member. the dreaded new syndrome “myalgic gluteatitis” (ME). Assuming that committees meet only every week or two. This amounts to directly assessing the variance inﬂammation of the muscles of the gluteus maximus. then we have a suitable washout period.CHAPTER THE ELEVENTH Analysis of variance techniques are extended to include situations where there are repeated observations on each subject— called repeatedmeasures ANOVA. in which each person was assessed twice and the difference between the two occasions calculated. Repeated-Measures ANOVA SETTING THE SCENE One of the greatest occupational risks facing academics is “myalgic gluteatitis” (ME). or Motrin to see whether any provided relief. he might appropriately be viewed as a carrier of a disease.3 but the basic idea of each subject becoming his or her own control holds true. one of the activities most feared by academics of all stripes is the dreaded committee meeting. separated by 2 weeks between meetings. no matter how seemingly mundane. and is apparently oblivious to the sighs and groans of the other members. we are at liberty to do the study a little differently than the usual approach. it takes two things to be a consultant— them and grey hair. uh. The grey hair makes you look distinguished and the hemorrhoids make you look concerned. we can actually ask each person to take all of the analgesics.” 107 . the one who can ﬁnd a reason to argue every point. the life of the mind does have its inherent risks.” But would it not be more correct to say that he “induces a pain in the butt?” Putting a clinical hat on. and that the pain subsides in hours. Norman. and then induced to rate the degree of pain relief on some scale. two.

and so on. with 3 levels. If we didn’t do that.33)2 + (2 2. the interaction sum of squares is: SS (Interaction) = (Xij – Xi. 1⁄3 got Tylenol ﬁrst.43)2 + (5 4. and 1⁄3 got Motrin ﬁrst.00 3. and all the people are in a long.10)2 + .67 2. there are actually three sources of variance: so the calculation is: Sum of Squares (Drug) 10 [(4. . So. leaving less variance in the error term (if all goes well). In the ordinary ANOVA designs. 1... we multiply the sum of squared differences by 3. we have to multiply the whole ruddy thing by 10. Now.80 3 4 6 4 6 2 4 5 2 3 –––– 3. Looking at it this way. so there is only one observation per cell.00 –––– 3.108 ANALYSIS OF VARIANCE owing to systematic differences between subjects.00)2 + (3 3.j – X.33 (4.2 (11–2) Remember that since this effect doesn’t include Subject. we would likely randomize the order of treatment so that 1 ⁄3 got Entrophen ﬁrst..00 6. natural healing course as they grow tolerant of the irritant and learn to tune him out. 3.33 4. There are 30 cells and 30 observations. The data may look something like Table 11–1: Conceptually. so we multiply by 10 to make it come out to 30.0 3.00 2. Let’s plow ahead.90 2 3 5 2 6 1 3 5 2 1 –––– 3.23)2 + (3 3. The important distinction in this design.90)2 + (4 4. we hope.33 3.90)2 ] 36.8 3. with the individual subjects as one factor and the pain reliever as a second factor.)2 (11–3) Sum of Squares (Subject) 3 [(3. is that there are repeated observations on each subject so that we can separate out subject variance from error variance. it might appear that the last medication worked the best. the formula is: SS (Subjects) = k Σ (Xi.90 Similarly. and it results in an expected value for the ﬁrst few cells as shown in Table 11–2. subjects are assigned at random (hopefully) to different groups. just like the other ANOVAs we have encountered to date. we’ll talk about how to analyse the data to see if there are order effects. So. then. In the ﬁrst instance. which has 10 levels. to calculate the interaction term.7 3. and the error variance is determined by the dispersion of individual values around each subject mean.)2 (11–1) REPEATED-MEASURES ANOVA (ONE FACTOR) The design we’re talking about is called a repeatedmeasures ANOVA for. 2 from 3.9)2 (3. Starting with the formula: SS (Drug) = nΣ (X. As a design aside.. we just want to determine whether there is any overall difference among the analgesics.33 4. however.9 3. For the Sum of Squares (Subjects). we see that the design is actually a two-way ANOVA.90)2 . Here. in order to ensure that order effects are not confounded with treatment effects. The approach.9)2 (3.90)2 (3.9)2] 16.8 (11–6) . the cells are now deﬁned by the factors Subject with 10 levels and Drug with 3 levels..)2 (11–5) Sum of Squares (Interaction) = [(5 4. Putting it another way.10)2] = 15. – X. Differences among subjects in the average rating of improvement (right hand column) 3. Differences among analgesics overall (at the bottom of the columns) 2. fairly obvious reasons. we are in the same situation as we were when we made the transition from an unpaired t-test to one-way ANOVA. – X. is to examine the sources of variance. Down the road. and any differences between subjects in the variable of interest ultimately ends up as error variance in the test of the effect of the grouping factors. we can take the average of all the observations on each subject as a best guess at the true value of the variable for each subject.90)2 (5.00 3. The subject variance is then calculated as the difference among these subject means. however.00 5.00 (11–4) TABLE 11–1 Pain relief from various analgesics Subject Entrophen Tylenol Motrin Average 1 2 3 4 5 6 7 8 9 10 –––– MEANS 5 5 5 6 6 4 4 4 4 5 –––– 4. + (1 2. it is necessary to estimate the expected values in each cell. since this effect is missing the Drug.67 4. it is of secondary importance to ﬁgure out whether 1 differs from 2. . Error variance—the extent to which an individual value in a cell is not predictable from the margins If we continue to look at it this way a bit longer. We went through the logic before. using exactly the same approach as before.67 3. there are only three terms.33 3.j + X.

that the data from Table 11–1 came from this design.This is signiﬁcant at the . this amounts to a Subject × Drug interaction. Like the paired t-test.944 4.2 52.878 Source Sum of squares df Mean square F p TABLE 11–4 Analysis of variance summary table for an equivalent between-groups design Drug Error TOTALS 16. at the same time. there are (10 – 1) × (3 – 1) = 18 df.7 2 27 –– 29 8. then.2 36. since we have multiple measures for each subject. We do use this term. the variance owing to subjects and the error variance are now lumped together. like the oneway ANOVA.80 ––––– 3. instead of measuring 10 people three times. Here’s a chance to address the age-old question of brand name versus generic and. in a branch of scale development called Generalizability Theory. some people call it a one-way repeated-measures ANOVA.REPEATED-MEASURES ANOVA 109 No multiplier is necessary this time since there are already 30 terms. This all totals to 18 + 9 + 2 = 29 df.005 level. just as the one-way ANOVA is an extension of the unpaired t-test. Finally. so we must have got it right.” as they are in between-subjects ANOVAs. In fact. In most cases where repeated-measures ANOVAs are used. and the basic method extends naturally to more complex ones. The strategy we used to go from one-way to two-way ANOVA. but there’s no F-test associated with it. leading to a bigger error term for the test of the drug effect.00 [4.43] 3 [3. and equals 9. The reason is that we are no longer able to extract variance owing to systematic differences between subjects. Estimation of degrees of freedom is just like before. The test of signiﬁcance. 2003). Let’s pretend.225 < . though. one less than the total number of data. The only difference is that.078 9.7 2 9 8.100 4.10] ––––– 3. can be applied to repeated observations.005 15. and to include a second factor in the design.33 4.90] ––––– 4. learn a little more about the arcane delights of repeated-measures ANOVA. when we are trying to determine the different factors that may contribute to the unreliability of a measure (Streiner and Norman. as shown in the ANOVA table (Table 11–3).22. That is. For the Subject main effect. It is a natural extension of the paired t-test. so (10 – 1) = 9 df.23] 5 2 –––––– MEANS 3 [3. As you can see. as shown in Table 11–3. and the ANOVA table would look like Table 11–4. Subject Entrophen Tylenol Motrin Average TABLE 11–2 Observed and expected values (in square brackets) for ﬁrst two subjects 5 1 [4. before we go on to bigger and better things. which usually leads to more powerful tests. repeated-measures ANOVA is able to take account of systematic differences between subjects. Our research hypothesis asks whether there is any signiﬁcant difference in relief from the various pain relievers.90 Another parallel holds. The error term for this comparison is the Subject × Drug Mean Square. Source Sum of squares df Mean square F p TABLE 11–3 Analysis of variance summary table Drug Subject Error (Subject Drug) TOTALS 16. We have described repeated-measures ANOVA in its simplest form. there are three data points and one grand mean. And ﬁnally. it is the simplest of the class of designs.05 .100 1. subjects are just “replicates. there are 10 data points and one mean. just as in an ordinary ANOVA. As we indicated at the start. The appropriate analysis would be a simple one-way ANOVA. One last run through these data.90 2 [2. This is simply captured in the main effect of Drug. for the moment. since we have only one observation for each subject.7 18 –– 29 0. Although the total Sum of Squares is the same. For the Drug main effect.4 We can now do the last step and create the Mean Squares and the ANOVA table. differences between subjects is a measure of error in the estimate of the effect. we could have done the experiment in a different way by simply randomly assigning individuals to each of the three analgesic groups. we could have used 30 people and measured each only once. for the interaction.00 3. since this reﬂects the extent to which different subjects respond differently to the different drug types. which is now not so highly signiﬁcant.167 < . the Subject term contributes to the total sum of squares.8 –––– 68. so (3 – 1) = 2 degrees of freedom. 4Note from longsuffering coauthor: “It’s about bloody time!” GENERALIZATION TO INCLUDE OTHER TRIAL FACTORS You may have noticed in passing that all the drugs we have used to date were brand names. is based on the ratio of Mean Square (Drug) to Mean Square (Subject × Drug).00] ––––– 3.33] 4 [4. That is.5 –––– 68.

tions with subjects—Subject × Drug (whether some subjects respond differently to some drug types more than others). Such are the vagaries of science. As the table shows. only this time we capture the unsuspecting academics for six meetings. we have thrown in some additional data.8 3 3 6 3 6 1 2 5 3 2 ––– 3. brand/generic) 1 2 3 4 5 6 7 8 9 10 ––––– MEAN 5 5 5 6 6 4 4 4 4 5 ––– 4.8 3 4 6 4 6 2 4 5 2 3 ––– 3.168 1. Tylenol. and now Brand/Generic with two levels. ibuprofen is the least effective drug. there are now two factors: Drug (as before) with three levels. although ASA-based drugs are most effective.110 ANALYSIS OF VARIANCE Subject Brand Entrophen Tylenol Motrin ASA Generic AcetaIbuprofen minophen TABLE 11–5 Two-factor repeatedmeasures design (analgesic type. So. acetaminophen. The Sums of Squares are computed using the general strategy of computing differences between individual means and the grand mean for main effects.12 1.07 3. indicating that the differences between drug classes are different for brands and generics. there are three terms in the Sum of Squares.33 15.1 3 3 6 4 6 2 4 5 2 3 ––– 3. in each case multiplying by some fudge factor corresponding to the number of levels of the remaining factors. but no overall effect of Brand/Generic (F = 2. Motrin) and three with the equivalent generics (ASA.0).27 2.1 versus 4. the mean of Tylenol and acetaminophen (3.30 0. and differences between individual cell means and their expected value from the marginals for interactions. These are the two main effects. After the dust settles. the effect is smaller for the generic (4.85). all squared up and multiplied by the famous fudge factor.833).25.63 9 2 8. Subject × Brand/Generic (whether some subjects show different responses to brands and generics).35) subtracted from the grand mean (3. for the main effect of Drug.52 10.20). three with brand names (Entrophen. Again. ibuprofen).02. and the mean of Motrin and ibuprofen (3. That shows up when we inspect the table a bit closer and see that. corresponding to the mean of ASA and Entrophen (4.5 If you want a name for this design. similarly. which is now 2 × 10 = 20. As we showed in Table 11–5.37 . we found an overall main effect of Drug (F = 6.25 .48 7.001 . and Subject × Drug × Brand/Generic (we won’t even try to put this one into words). However. and they don’t appear in the table.0 4 4 5 5 5 3 4 4 3 4 ––– 4.52 6.01). p = .15 B/G = Brand/Generic.80 6. Note that we are simply using the same basic strategy to encompass a . but the generic is actually a bit more effective than the brand (3.27 18 1 1.63 18 0. it’s called a twoway repeated-measures ANOVA. df = 1/9. We now examine some interactions: Drug × Brand/Generic (whether the differences between drugs are the same for brand names and generics).168). generalizing from the ﬁrst example. The ﬁrst challenge is to work out all the possible lines in the ANOVA table.001 2.4 5Or would be the vagaries if these were real data. subjects are just replications of the study.02 .01 23. to get a result of 60 terms. since it amounts to differences in the effect for different people.03 9 2 0. unless we’re dealing with Generalizability Theory. To start. every Subject × Something interaction is the error term to test the effect.9 2 3 5 2 6 1 3 5 2 1 ––– 3. we are a bit surprised to see a Drug × Brand/Generic interaction. we have Table 11–6.4 versus 3. Again as before.37 0.8). and several interac- TABLE 11–6 Analysis of variance summary table of data in Table 11–5 Source Sum of squares df Mean square F p Subject Drug Error 1 (Subject Drug) Brand/ Generic Error 2 (Subject B/G) Drug B/G Error 3 (Subject Drug B/G) 76. The design is shown in Table 11–5. this means there are 3 × 2 = 6 observations on each subject. p = . df = 2/18. Our lawyer advised us to add the disclaimer that these data are completely artiﬁcial.

not both (see note 6). Between-Subjects and Within-Subjects Factors Now that we are used to the idea of including more than one repeated factor in the design. By extension. which are inevitably burdened with routine decision making necessary to keep the academic ship aﬂoat. both occur as repeated observations for each subject in the design and are.REPEATED-MEASURES ANOVA 111 design where the repeated observations arise from more than one factor. we did it just for convenience. and Drug (Do all drugs give the same relief?) and Brand/Generic (Do brand name drugs give the same relief as generic drugs?) from the top rows. We could. A between-subjects or grouping factor is one in which each subject is present at only one level of the factor. Perhaps we are concerned that ME is prevalent for only some kinds of committees. Perhaps the severity of ME is lower in research meetings.8 To begin. The one interaction with Subject that is missing is the Subject × Group term. or the effects are more transient so that relief comes more rapidly. Now. It can assume only one value for each subject. there are no data in the boxes.” and that each row corresponds to all the measurements on one subject—in this case. We have the same two within-subject factors as before. arise together in the same design. subjects are grouped under a level of the factor. Carriers tend to avoid the more creative committees like research-group meetings. which is different. we must ensure that there are different folks in each group. which will ﬁnd their way into the assorted error terms. and subjects are grouped under each level of this factor. that is. not essential to have the same number of people in each committee type. There are also some two-way interactions— Group × Brand/Generic. of course. 7It’s 8This Brand Group Subject Entrophen Tylenol Motrin ASA Generic Acetaminophen Ibuprofen TABLE 11–7 Experimental design for inclusion of between. We will. Group × Drug. Group (i. this is called a betweensubjects or grouping factor. presents an ideal situation to cook the data as we see ﬁt. there are four main effects: Group (Do folks on research committees get more relief from ME pain than those on administrative committees?) and Subject (Do some people get more or less relief than others?) from the left column. a particular person can only be a on an administrative committee or a research committee.e. and as before Drug × Brand/Generic.and within-subject factors 1 2 Research 3 . There are also several interactions with Subject. as yet. called within-subject factors. ME carriers seem to gravitate to administrative committees. What would a between-subject factor look like? Well. How would we put it to a test? We would use the same design as before. 20 . We won’t do it. we now have an additional factor.6. As a matter of course. From personal experience. let’s expand the design one last time. six. Since each patient is on 6To A within-subjects (trial) factor is one in which all levels of the factor are present for each subject. 10 11 12 Admin 13 . ensure that committee type is a true betweensubjects factor. let’s proceed to anticipate what the ANOVA table might look like. type of committee). One way to think about the distinction is to class all the factors as either within-subjects or between-subjects. The two factors that we have encountered to date. it results in repeated measures. try to ﬁnd a situation in which the same people are on both an administrative and a research committee. but then it would be another within-subject factor. However. This guarantees that the innermost column on the left will be “Subject. that is. Drug and Brand/Generic. since. and other factors that group subjects into classes. that is..7 The design would now look like Table 11–7. only this time we recruit 10 academics from an administrative committee and 10 different academics from a research group. therefore. we can bring up the heavy artillery and contemplate a world where both repeated observations within each subject. it is usually easier in these repeated-measures designs to put all the betweensubjects factors on the left and all the within-subjects factors on the top.

serum sulphur. 9After all we’ve put you through in this chapter. covered in later chapters. There’s an easy way to do this and to ensure that it all balances out in the end. it’s a lot like betweenand within-subjects factors. If we do this. increase the power of the test. . or whatever—usually at different points in time. Entrophen → Tylenol → Motrin.42 0.12 0. Pick some order—for the sake of argument. we should vary the order in which subjects get the different drugs. What was that again? Well. Finally. If there is an order effect.40 21. Second.33 2 16 0. the error terms amount to the associated interaction with Subjects. We need not stop here of course. So now the whole “kit and caboodle” appears as Table 11–8. restrict this to repeated measurements of the same quantity—weight. for completeness. Drug and Brand/Generic are themselves crossed factors. then we have created a second between-subjects factor.112 ANALYSIS OF VARIANCE either an administrative or a research committee. and 1⁄3 the last.12 5.20 1. and (2) reduce the corresponding error terms and. still obeys some of our fundamental rules: (1) the degrees of freedom add up to one less than the number of data points. at the outset. we say that Subject and Group are crossed with Drug and Brand/Generic.02 0. In particular. A more general way of describing this idea is to speak of crossed and nested factors. for arthritic patients. the error term is always Subject: Group. the ANOVA. “Ah heck. Do it one last time: Tylenol → Motrin → Entrophen. again. the computer takes care of all this). but of different variables.34 6.27 3. one explicit factor in all repeated-measures designs is Subject. and so on. at the end of the day. so.003 Drug:Group) Drug Group B/G Drug B/G Drug) 0. not both.730 .40 0. (2) the Sums of Squares for indi- 10We vidual terms could be summed to yield a Total Sum of Squares.15 0. This is just another way of saying that Group is a between-subjects factor. For all the repeated measures. Finally.45 0. as we showed in the previous example.13 7. each subject can be in either the administrative category or the research category. A factor is nested in another factor if each level of the ﬁrst factor occurs at only one level of the second factor.008 . joint count. 1 ⁄3 the second. In particular. There are a few things to note.020 .860 . and (4) the degrees of freedom for numerator and denominator of the F-ratio must use the right degrees of freedom (but.34 1.68 .06 14. the answer may well be “Not. since we have acetylsalicylic acid as both Brand and Generic.07 0. although complicated.02 1. We could have analysed the Group effect differently.94 3. Many studies do make repeated observations on subjects.35 0.34 51. thereby. so we would write Subject nested within Group as S:Group. we can’t estimate this interaction. We could have said to ourselves. Subject is nested within Group.53 2. Let’s just average it and do a t-test. sedimentation rate. In the end. we’ve got a bunch of data on 20 folks in two groups.53 1 8 1 1 8 2 2 16 2 25.640 Error (Subject:Group B/G = Brand/Generic. Now ensure that 1⁄3 of the subjects gets the ﬁrst sequence. there is not a single error term. except that the correct error term must be used (by the computer of course). let us take a minute to try to convince you that there really is a grand unity to the whole thing. there are error terms for each main effect and its interaction.13 2. the main effect of subjects. not both. We are limited in the number of factors only by the number of degrees of freedom. however. we are simply partitioning variance across multiple factors in order to (1) investigate the possible effects and interactions.110 . Nested factors are often signiﬁed with a colon (:) in the ANOVA summary table.46 .43 0. use multivariate statistics. we indicated that as good researchers.” do.20 0. For the between-subjects or Group factors. Now just move everything to the right and rotate the last one to the beginning: Motrin → Entrophen → Tylenol. However. This exhausts the possibilities. Two factors are crossed if each level of one factor occurs at all levels of the other factor. so that any variance owing to systematic differences between subjects can be removed and the power of other tests correspondingly increased. First.42 0. since both Group and Subject occur at all levels of Drug and Brand/Generic (each subject has an observation at all levels of Drug and all levels of Brand/Generic). (3) Mean Squares and F-ratios are calculated just as before. It’s called a Latin Square and it is performed as follows. Since Group is a betweensubjects factor. such as grip strength.20 8. For this situation. morning stiffness.” The TABLE 11–8 Analysis of variance summary for three-factor ANOVA Source Sum of squares df Mean square F p Group Error (Subject:Group) Brand/Generic Brand/Generic Group Error (Subject:B/G Group) Drug Drug Group Error (Subject Brand/Generic 25. it will be apparent and will show up as an Order × Drug interaction.

Another place where we use repeated-measures ANOVA a lot is in measurement studies. Conveniently. Day 2. the reliability is equal to: 4.19 7. The denominator of the t-test would involve differences between the subject means. 13One REPEATED-MEASURES ANOVA AND RELIABILITY OF MEASUREMENT Repeated-measures ANOVA has one other very useful application—to compute variance components for use in studies of agreement. nothing could be further from the truth. 12Usually.. exception is a Time factor that consists of very different intervals (e. If we turn things around and use the data as a measure of individual susceptibility to pain relief. as long as the design is balanced. this can be transformed into a computational formula involving Mean Squares. or months.9 between the objects of measurement—in this case. the same data that went into the Subject main effect.13 Repeated-measures ANOVA. for example. this should not be treated as another level of the time factor. For example. Nothing comes free. warlords. Then. imposes one additional constraint. However. More appropriately.54 that if a baseline measurement were made. It also can work with repeated observations at the same time. Day 10. that is. just as the case for two-way ANOVA and all other parametric tests. Reliability is usually assessed by the Reliability Coefficient. If we did a trial with two groups in which we made repeated observations. Because of the relationship between variances and mean squares. believe it or not. as we’ll see in Chapter 12. like all factorial ANOVA designs. however. such as Patient 6. The resulting t-test would be just the square root of the Group main effect. Day 1. such as Patient 10. repeated-measures designs are one special case of factorial designs and. However.” It ranges from 0 (no systematic difference between subjects) to 1 (all the variance in scores is due to systematic differences between subjects). and other unscrupulous types. it may be better to analyze the data using MANOVA. We also demand homoscedasticity—a lovely word meaning equal variances.12 which is deﬁned as the proportion of variance in the scores related to true variance . seem to think that this approach is a bit of esoterica foisted on them recently by zealous psychometricians. if we go back to Table 11–1. this was discussed in Chapter 9. the assumption of homoscedasticity may break down and. we have discussed in Chapter 6 the extent to which the tests are robust to the violation of these assumptions and.11 As we will see in Chapter 17. The good news is that. the starting point is a repeated-measures ANOVA. capitalists. ASSUMPTIONS AND LIMITATIONS OF COMPLEX ANOVA DESIGNS Are there no costs incurred in this exercise? Of course there are. We found an entire chapter devoted to the intraclass correlation in Fisher’s 1925 statistics book. get better relief across the board than others. weeks. in psychology and education.078 Reliability 4. Note that the Mean Square (Residual) is from the Subject × Drug term: Mean Square (Subj) Mean Square (Subj) Mean Square (Res) (k 1) Mean Square (Res) (11–8) 11Note Reliability In the present example. as you recall (or we recall).878 3. patients. the same data that are going into the Group main effect. then. we have the makings of a reliability coefficient in the ANOVA (Table 11–3). Day 30). The formal deﬁnition looks like: Reliability 2 Subj 2 2 Subj Residual (11–7) Other Applications of RepeatedMeasures ANOVA Repeated-measures ANOVA is a very useful strategy to look at the effects of interventions in situations in which the same person can receive multiple treatments. there is the assumption that the data are at least interval level and are normally distributed. Fisher called this the “interclass correlation. the ANOVA is robust with respect to assumptions about distributions. whose passion is differentiating among people. When you look at reliability or agreement. the normality assumption is unnecessary.g. it is evident that some folks.078 0. Ironically. this should be used as a covariate and the analysis should be a repeatedmeasures analysis of covariance (see Chapter 16). whether you use an intraclass correlation or Cohen’s kappa. as subjects are followed up for days. this is one of several ways to approach the general issue of measuring change. then the repeated-measures ANOVA design acts as a generalized crossover study. are really mainly interested in all the things we called “error.” Finally. Many clinicians. Any time that a washout period is feasible.REPEATED-MEASURES ANOVA 113 numerator of the t-test would involve the difference between the overall means for all the data in the administrative and research groups. and error variance is related to Mean Square (Subject × Drug). First. like factorial designs. (11–9) The coefficient is called an Intraclass Correlation Coefficient (ICC). except to selected dictators. measurement folks.878 (2)0. the Central Limit Theorem indicates that for sample sizes over 10 to 20. Subject variance is directly related to the Mean Square (Subject). this distinguishes it from the Pearson correlation that has observations from different variables. it can be used any time there are repeated measurements on the same set of individuals. the designs must be balanced or nearly so. the number of possible designs is limited only by imagination and resources. Interestingly. then we can start to ask whether the measure can reliably distinguish between those who get a lot of relief and those who get a little.10 One situation in which this occurs is when there are sequential measurements over time. we would likely analyze the data with a two-factor repeated-measures ANOVA.13 0. it has many other applications. since all observations are from the same variable or class. patch tests for suntan lotions. with treatment/control as a between-subjects factor and time as a within-subjects factor. Basically.

e. This continues until the patient has treated six headaches. Thirty spondylitis patients are treated by chiropractors on a weekly basis for 12 weeks. and has only two levels. a. The best strategy to survive the vagaries of reviewers is to take an approximate approach. SUMMARY We have considered a number of extensions to the paired t-test. or blue pills by the attending physician throwing a dart at a Stars and Stripes on the clinic wall.” then name the between-subjects and within-subjects factors. white. and use an approximate calculation based on the paired t-test. name the factor equivalent to “subjects. For the following designs. First. then the procedures outlined for the paired t-test in Chapter 10. Unless the factor is accounting for useful variance. there is a law of diminishing returns. and then go back to the paired t-test. This continues until the patient has treated 6 headaches with each color of pill. and then. unless these factors are designed into the study from the outset. d. Twelve patients suffering from chronic headaches are treated by three different headache medications. There are 20 slides in total. and we already indicated where that slippery slope leads. The upshot is that the mean square—which enters into the statistical test— actually goes up. even if you have only two levels of the factor. the dfs escalate. or blue pill. Each slide is rated by 6 pathologists. After each treatment. all described as repeated-measures designs. white. the patient rates the pain on a 10point scale. Again we convert this to a pairwise comparison. Histologic slides of lymph gland biopsies are judged by pathologists on a 5-point . and has added tremendously to the versatility of experimental research. There are 20 slides in total. at 3 levels of experience—2 ﬁrst-year residents. At the onset of a headache. If there are more than two levels. the power of analysis and interpretation obtained from factorial and repeated-measures ANOVA is often remarkable. The last grant writer who went for such long shots jumped off a building in the Crash of ’29. because the degrees of freedom have been reduced more than the Sum of Squares as a result of the addition of the factor. f. which are the essence of simplicity. Nevertheless. which he or she selects by throwing a dart at a Union Jack on the basement wall. unfortunately. 2 ﬁnal-year residents. there is no exact formula to calculate sample size for two. SAMPLE SIZE ESTIMATION For all sorts of reasons. Pick the one effect you really care about. (2) the approximate interaction between subjects and this effect. they will likely result in imbalance.or three-factor EXERCISES 1. despite the constraints imposed by the addition of more than one factor into a design. Twelve patients suffering from chronic headaches are treated by three different headache medications. An hour later. and 2 pathologists. b. are appropriate. costs one degree of freedom for the main effect and each interaction. Second. Histologic slides of lymph gland biopsies are judged by pathologists on a 5-point scale for likelihood of cancer. which. hopefully. the paradoxical situation can arise that even though the factor carries away some of the Error Sum of Squares. Histologic slides of lymph gland biopsies are judged by pathologists on a 5-point scale for likelihood of cancer. the patient rates the pain on a 10-point scale. anything more complicated forces us to estimate in advance (1) what might be the appropriate change within subjects. for a total of 18 headaches per patient. However. An hour after the onset of each headache. fairly common: when the effect of concern is a two-way interaction. with Subjects as an explicit factor in the design. It still requires a bit of imagination to come up with the error term. Each slide is rated by 6 pathologists. Here. is a main effect with two levels. range of motion of the SI joint is measured. an even more sweeping approximation is needed. Each factor you add. but it’s not impossible. The only exception to this approach is. the Error Mean Square term for other analyses actually increases. They amount to variations on factorial ANOVA methods.114 ANALYSIS OF VARIANCE Are there any more limitations? Indeed there are. repeated-measures designs. There are two reasons why one must not continue to add factors into a design at random. each patient selects either a red. c. Each patient is randomly assigned to be treated by red. If it is a single factor.

Seventeen Scottish lairds are assembled in the manor. yet another Japanese import!) to see if they have different lineages. Source Sum of squares df Mean square F Students (S) Patient (P) P S Observer (O) O S P O P O S 950 300 190 120 95 34 38 19 2 1 Source df F Laird (L) Night/Morn (NM) NM L 320 42 160 16 . and 2 pathologists. 2. Use a within-subject (e. Use a simpler pain scale (Present/Absent) to increase agreement. . Each slide is rated by 6 pathologists. Each work-up is observed by two staff clinicians.300 3. d. To compare 3 of the NSAIDs for the treatment of rheumatoid arthritis. 3. They rated their degree of pain at the end of 10 days.800 5. and 6 legs per bug.g. plied with a “wee dram o’ the malt” all night long. There are 20 slides in total. c. Twenty medical students are observed and rated on ﬁve different patient workups. The investigator approaches you for some suggestions for what she might do to increase the likelihood of getting p below . at 3 levels of experience—2 ﬁrstyear residents. Half the slides were from proven normal patients. Increase the number of subjects from 15 to 25 per group. using a 100-point scale. The bug freak has 20 bugs per group. Would you expect that each of the strategies listed MIGHT WORK or WOULDN’T WORK? a.. crossover) design with the same number of subjects (45). and the other 10 were from patients who eventually died of lymphoma (cancer of the lymph glands). you get to ﬁll in the blanks: a. An entomologist (bug freak) counts the number of spikes on the legs of North American and South American horned cockroaches (Stylopyga orientalis.REPEATED-MEASURES ANOVA 115 scale for likelihood of cancer. Sum of squares Mean square Source df F North American/ South American (NS) Bug (B) Leg (L) L NS L NS B 1. The results of the one-way ANOVA was: F(2. Sum of squares Mean square b. 2 ﬁnal-year residents. For the following designs and ANOVA tables.99.05. then asked to rate their state of euphoria (1) the night before and (2) the morning after.10. all derived from patients with a minimum of 10 years follow-up. b. 42) = 2.000 550 950 c.05 < p < . Increase the number of drugs from 3 to 5. 45 subjects were divided into 3 groups of 15 subjects each and given 1 of the drugs.

116 ANALYSIS OF VARIANCE How to Get the Computer to Do the Work for You Because repeated-measures ANOVAs are somewhat more complex than straight factorial ANOVAs. any interactions with those effects.2). This error term corresponds to the Subject effect in Table 11–3. we’re going to break with tradition a bit and show the actual contents of the computer commands for the analyses we did in this chapter. Any between-subjects effects. we added a variable called Group. and the error term are in a separate box called Tests of Between-Subjects Effects. • From Analyze. Tylenol. and two types of each drug (Brand/Generic). in which there are two groups (Administrative/ Research). and the associated error term are within a box labeled Tests of Within-Subjects Contrasts. interactions not involving within-subject effects. We’ll use the most complicated analysis. choose General Linear Model ¨ Repeated Measures… • Enter Drug <Tab> in the Within-Subjects Factor Name • Enter 3 <Tab> in the Number of Levels box and press Add • Enter 2 <Tab> in the Number of Levels box and press • Click the Add Define button • Click Entrophen. __?__(1. Any repeated-measures effect. and so on to replace the lines __?__(1. Acetaminophen. When we entered the data.1). three drugs. until the six lines in Within-Subject Variables are ﬁlled • Move Group to the Between-Subjects Factor(s) box • Click OK • Enter Brndgnrc <Tab> in the WithinSubjects Factor Name The results are spread out in a number of places. ASA. . which had the value 1 for the administrative group subjects and 2 for the research group subjects.

a host of problems arises if we treat them as if they were unrelated variables. one such technique is an extension of ANOVA. so that the results of a study may be different depending on the choice of the outcome measure. and vice versa. and the outcomes were both weight and the body mass index. It is also useful in analyzing repeated measures. too. so we’ll avoid it as much as we can. Why would obtaining two measures lead to an entire chapter in a book? Why not simply do two t-tests and be done with it? This question is even more cogent when we realize that the mathematics of univariate statistics are relatively straightforward. because they’re highly correlated with one another. and takes into consideration the correlations among the dependent variables. the other must. and found that both were signiﬁcant. but since we are dealing with warm. but it is necessary to keep the analysis simple. we’ll ask both partners3 to complete a satisfaction scale that ranges from 1 (“It would have been better if I had done it myself”) to 100 (“The Earth hasn’t stopped moving yet”). called multivariate analysis of variance. and have only two groups of subjects: a group that does “it” in the missionary position. or MANOVA. and others. If the intervention were a weight reduction program. gender-neutral. had to deal with a major problem: From whose perspective do we measure satisfaction—the man’s or the woman’s? It is possible that what may be better for men is worse for women.” I n keeping with the abbreviation of this statistical test. because the two variables are likely correlated with one another (if they weren’t correlated. so the results aren’t distorted by having redundancies among the variables. but since this is your ﬁrst time. His and her satisfaction scores probably aren’t as highly correlated. 2We will not use the term “in the superior position. would it mean that the groups differed on both variables? The answer is a very deﬁnite “Yes and No. be scared off. This avoids the problem of inﬂation of the alpha level because of multiple testing.” The groups did. it would be highly unusual for them to engage in practices that satisfy one person and are a turnoff for the other.2 To avoid the problem of having to make an arbitrary decision regarding whose opinion we will assess. we can’t really conclude that they differed in two discrete areas. by using techniques that allow us to consider more than one dependent variable at a time. Of course both variables changed. You’ll have to know some of the language. whale-loving treehuggers.4 The answer is that once we have more than one dependent variable (DV). Let’s use an extreme example to illustrate this point. If we did separate t-tests on the measures. we’ll be gentle. which has its own arcane language. but. Multivariate ANOVA (MANOVA) SETTING THE SCENE The previous examples. it would be ridiculous to become ecstatic about the fact that we found change in both variables. which looked at condom brand and circumcision status. The ﬁrst problem is one of interpretation. the problem we will grapple with (in a manner of speaking) is one that has bedeviled sex therapists and missionaries alike for generations—does it matter who’s on top?1 We’ll start off easy. differ on the two outcomes. either. MANOVA takes the correlations among the DVs into account.” because that will only cause havoc among those who want to generalize the term to other situations. The ﬁrst (clean) question that could be asked is. in fact. 1If you’re not sure how this question relates to the test. 3We 4Don’t 117 . check the Glossary under “MANOVA. so. but multivariate tests— those that involve two or more dependent variables— require something called matrix algebra. We can get around this problem.CHAPTER THE TWELFTH Multivariate analysis of variance (MANOVA) is an extension of ANOVA that allows for two or more dependent variables to be analyzed simultaneously. recognize that restricting this study to situations involving only two partners may impose limitations for some people. We don’t understand it too well. if one changes. and a second group in which the women are on top. we wouldn’t bother with multivariate analyses). sensitive.

the situation is reversed—both physically and psychometrically. the results of the males’ satisfaction questionnaires are shown on the left side of Figure 12-1. We won’t show you the results right now. the probability of ﬁnding at least one outcome signiﬁcant by chance increases according to the formula: 1 – (1 – α)N 5Looking at the magnitudes of those evaluations. Let’s start off in the usual way. which often can be considerable. but when women do the rating. and these are reported in Table 12-1. Now. We could try to control this inﬂation of the alpha level with a Bonferroni correction or some other technique. it doesn’t look as if much of anything is happening (insofar as the data are concerned. This couldn’t be seen when we looked at each variable separately. they score higher for the male superior position than for the female superior position.05 level is: 1 – (1 – . the means are relatively close together.10.6 This doesn’t help us much either. and only someone truly desperate for a publication would look twice at the signiﬁcance levels of the t-tests. the probability is over 22%. and those of the females’ satisfaction on the right side of the ﬁgure. If you go back to Table 12-1.79 0.47 then why it can happen. the groups look to be fairly similar on the two variables. by plotting the data to help us see what’s going on.0 31. neither the estimate of how many tests would be signiﬁcant by chance. the distributions of the two groups (positions) seem to overlap quite a bit for both variables. it would be slightly over 40%. depending on who’s on top. you can skip ahead a few pages to Table 12-3 and check for yourself.05)2 = .95 44. you can see that when men do the rating.8 41.0975 (12-2) 6Which. these tend to be overly conservative. differing by less than half a standard deviation. We could even go so far as to correlate each variable with the grouping variable. nor the corrections for this would take into account the correlations among the dependent variables. Also. either.118 6 5 4 ANALYSIS OF VARIANCE 6 Man on top Woman on top 5 4 Man on top Woman on top Number FIGURE 12–1 Men’s (left side) and women’s (right side) satisfaction scores. 3 2 1 0 0 15 30 45 60 75 Number 3 2 1 0 0 15 30 45 60 75 Satisfaction score Satisfaction score The second problem is that of multiple testing. at least)5. it doesn’t look as if much is happening elsewhere. but trust us that there is now a statistically signiﬁcant difference between the groups.55 38 . with MANOVA.2 10. Using that most sensitive of tests. Now. is simply a Pearson correlation of one continuous variable (each satisfaction scale) with a dichotomous one (who’s on top?). the probability that at least one will be significant by chance at a . It’s analogous to the advantage of a factorial design over separate one-way ANOVAs: with the former.61 10. the calibrated eyeball. If we have two outcome variables (and. aren’t you sorry you doubted us? If we have ﬁve outcomes. First. by the time we have 10 outcome variables. we’ll show that this can happen and TABLE 12–1 Results of ttests between the groups for both variables Rater Who’s on top Mean SD t df p Man Male Woman Man Female Woman 34. we’ll pull another statistical test out of our bag of tricks and analyze both dependent variables at the same time using a multivariate analysis of variance (MANOVA).12. we may overlook signiﬁcant relationships that are present. we can look at inter- .3 0. have performed two tests). we can examine interactions among the independent variables that wouldn’t be apparent with individual tests.37 38 . As we saw in Chapter 5.73 12. (12-1) where N is the number of tests we do.7 So. Again. the correlation between the men’s ratings and position is –0. The next step is to do a couple of t-tests on the data. 7If you don’t trust us. using the point-biserial correlation. as we discussed in Chapter 8. as you no doubt remember.73 8. but. and that for the women’s is an equally unimpressive 0. A third problem is the converse of the second: rather than ﬁnding signiﬁcance where there is none. hence. though. what’s going on? Why did we ﬁnd signiﬁcant results using a multivariate test when it didn’t look as if there was anything happening when we used a whole series of univariate tests? The reason is that the pattern of the variables is different in the two groups.

10That’s . the type of prophylactic and whether or not the person was circumcised are independent variables (IVs).” 9You’ve just been introduced to your ﬁrst term in matrix algebra. statisticians have their own meaning for the word “multivariate. However. but it’s considered a univariate procedure because it ﬁts the “single dependent variable” rule. For most of us.10 It can be thought of as the overall mean of the group for each variable. obscure though it may be: both are represented symbolically by arrows. Others prefer the term multivariable but. we would get an ellipse of points for each group. however. so that each group has two means. not actually. “heteroscedasticity” or “polychotomous”). Some of them would call multiple regression a multivariate technique. the mathematics make it a special case of MANOVA. four-. and there was only one DV. four variables would produce a four-dimensional ellipsoid.” and that “regression analysis” doesn’t involving sucking your thumb or recalling a previous incarnation as a Druid princess. but now the center of each ellipse is called the centroid. even though there is only one DV. in this case. However. that was painless. This is just the same as what we do in the univariate case. we again compare the distance between the centers. the swarm of points would look like a football (without the laces).” We already saw that there is nothing inherently normal about the “normal distribution. What this indicates is that the list of means for Group 1 (the technical term for a list of variables like this is a vector) is equal to the vector of means for Group 2. multivariate means more than one DV. Sticking with the analogy of the t-test. and the second shows the dependent variable (1 = Male Rater. Doesn’t this sound simple and straightforward? That’s a sure sign that something will go awry. the conclusions are clear: when you have more than one dependent variable. So. that the means of the two groups are equal. If we went to the bother of checking in a dictionary. or ﬁve-dimensional graph paper.8 What about repeated-measures ANOVA that we slugged through in the last chapter? That’s multiple dependent variables. The picture is similar but a bit more complicated with multivariate tests. the statistic we would use for this study would be the t-test. and so on. we compare the groups by examining how far apart the centers of the distributions are. such as looking at the effect of being at home versus on a romantic holiday. The ﬂy in the ointment is that statisticians aren’t consistent. we’ll discuss some limitations to this approach. we have two dependent variables in the current example. These can be (relatively) easily described mathematically. we would have two distributions (hopefully normal curves) on the Xaxis.MULTIVARIATE ANOVA (MANOVA) 119 actions among the dependent variables in a way that is impossible with univariate tests.9 In other words. you’re often ahead of the game if you use multivariate procedures. and we’ve even deﬁned it—more than one dependent variable—but let’s be more explicit (and confusing). this would be a case of univariate statistics— factorial ANOVA—with two IVs and one DV. as in Figure 12-2. we’re examining a total of three variables at once. this may seem to be multivariate. but for now let’s accept the fact that multivariate statistics are the best way of analyzing multivariate data. after all. To most people. that is. prowess. some use the term to refer to many IVs and some just to indicate that many variables—dependent and independent—are involved. we are testing two null hypotheses simultaneously: µ11 = µ21 and µ12 = µ22 (12-5) If our study involved a factorial design. At the end of this chapter.. or they use English words in their own unique and idiosyncratic way that the British would likely describe as “quaint. to a statistician. we looked at the interaction among condom brand and circumcision status on estimations of prowess. It’s multiple observations of the same dependent variable—repeated measures. So. In the same way. in two-dimensional space (Male’s rating 8A polite term for “it. and it is mathematically identical to ANOVA. “vector.” In Chapter 9. statisticians aren’t like most people. If we had three dependent variables. the group mean. 2 = Female Rater). They have their own terms that bear only a passing resemblance to English (e. as we’ve said. 2 = Woman on Top). then we would have another null hypothesis for this new main effect of Setting. In the multivariate case. In this case. your second term in matrix algebra. It would become a multivariate problem if. in addition to looking at prowess. even here. we also measured the duration of the encounter. t FOR TWO (AND MORE) If we had only one dependent variable.” It refers. wasn’t it? There actually is a link between the use of the term vector in matrix algebra and in disease epidemiology. For two variables. except that we are dealing with two or more dependent variables. isn’t it? Well. the null hypothesis is: H0: 11 12 = 21 22 (12–4) where the ﬁrst subscript after the µ indicates group membership (1 = Man on Top. When we plot the data for a t-test. that is. but are somewhat difficult to draw until someone invents three-.g. In the example we just used. to the analysis of two or more dependent variables (DVs) at the same time. where each center is represented by a single point. As it turns out. then. and a null hypothesis involving the interaction term of Position by Setting. and that’s the usage that we’ll adopt. we would ﬁnd that multivariate simply means “many variables.” See. which starts with the null hypothesis: H0: µ1 = µ2 (12-3) WHAT DO WE MEAN BY “MULTIVARIATE” We’ve used the term multivariate a few times.

statisticians would have thought to use something different. are the variances. With the one-way ANOVA. • If the sample sizes are equal. you were looking at a data matrix. like many other tests for homogeneity. it’s the same symbol we use to indicate summation. Needless to say (but we’ll say it anyway). and Female’s rating). rather than transforming the variables to standard scores ﬁrst. Testing for equivalence of the VCV matrices means that the variance of X1 is the same across all groups. the same applies in the multivariate case. After some statistical hand waving over the matrices. If p > .11 What this means is that instead of having just one number for the variance.12 Second. interpret the results of the MANOVA very cautiously The consequence of violating the assumption of homogeneity of the VCV matrices if the sample sizes are equal is a slight reduction in the power of the MANOVA. we also have to compare the differences between the means to the variances within the groups. However. and each centroid would consist of a vector of three numbers—the means of the three variables for that group. and so forth for all of the variances and all of the covariances. Any time you saw a table listing subjects down the side and variables across the top. we now have a matrix of variance and covariance terms. except that the original units of measurement are retained. Tabachnick and Fidell (2001) offer the following guidelines: Women's ratings 40 30 20 10 0 FIGURE 12–2 Scatterplot of the two DVs for both groups. However. four variances and six covariances. it’s referred to as the variance-covariance matrix (VCV) and looks like: VCV = = s2 11 s2 21 s2 12 s2 22 (12–6) 12We didn’t want to tell you at the time. we “simply” split the SSTotal into more sources of variance. then it’s safe to proceed. of course. If the sample sizes vary considerably. note that we just stuck in a third term from matrix algebra. and the p associated with M is greater than . and p is less than . it is almost impossible to just look at the data and ﬁgure out which it will be. however: the greater the distance. the terms along the main diagonal (s2 and s2 where the subscripts are the same) 11 22. the covariance between X1 and X2 is the same. and those off the diagonal (s2 and 12 2 s21. Because this depends on which matrices are the most different. so the value of s2 is the same as that 12 FROM ANOVA TO MANOVA The logic of MANOVA is very similar to that of ANOVA. because the matrices do not differ signiﬁcantly from each other. but they didn’t. we also have the covariances between the variables.13 In the VCV matrix. after adjusting for the number of subjects and groups (the Mean Squares). The consequence is that a signiﬁcant M statistic doesn’t always mean that you have to either stop with the analysis or transform the data. there would be more terms in the VCV matrix if we had more dependent variables: for three variables. you forget whether it’s sums or matrices that are printed in boldface. it’s fairly safe to proceed • If the cell sizes are unequal.001. LOOKING AT VARIANCE Comparing the means in a t-test is necessary. Just as with a correlation matrix. there’s an added level of complexity. the symbol for a matrix is usually printed in boldface.05. M is unduly sensitive to differences. we partition the total variance (the Sum of Squares Total. The usual test for homogeneity of the VCV matrices is Box’s M statistic. because the signiﬁcance tests are robust enough to handle any deviations from homogeneity • If sample sizes are not equal. unfortunately. in addition to the variances of each of the variables. but not sufficient. as is done with correlations. the context makes the meaning clear. for four variables. With more complicated designs. You’d think that with so many Greek letters lying around that aren’t being used in statistics. That’s right. in addition. or SSTotal) into that due to differences between the groups (SSBetween) and the error variance. Not surprisingly. the F-test looks at the ratio of the explained variance (that due to the grouping factor) to the error (or unexplained) variance. It reﬂects the variance shared by the two variables. we would have to think in three dimensions. which is simply a rectangular array of numbers. the more signiﬁcant the results (all other things being equal). such as factorial or repeated-measures ANOVAs. 13Unless.001. the abbreviation for a variance-covariance matrix is either VCV or the symbol Σ. Usually. especially when the sample size is large. and so on. Then. 0 10 20 30 40 50 60 70 Men's ratings 11A covariance is similar to a correlation between two variables. to avoid confusion.120 70 60 50 Man on top Woman on top ANALYSIS OF VARIANCE 2 for s21. First. “matrix” itself. the number is transformed into either an F-ratio or a χ2. The logic of the statistical analysis is the same. there is nothing but our “feel” for the data to tell us whether the deviation from homogeneity is worth worrying about. where the subscripts are different) are the covariances. you can ﬁnally know: you’ve been dealing with matrices throughout the book. we would have three variances and three unique covariances. the VCV is symmetrical. Not surprisingly. such as that due . but now that you’ve progressed this far. don’t worry about M. which is the Sum of Squares within the groups (SSWithin or SSError). If we had a third variable. then a signiﬁcant M can indicate that the Type I error rate may be either inﬂated or deﬂated.

that’s how we’ll continue to refer to the test. however. then we should stick with univariate tests. The next set of homogeneity tests belong to Levene. and get more F-ratios. the distinction isn’t as important. As we’ll see. λ is built upside down. unlike almost every other Wilks’ likelihood ratio. that’s where we’ll start.MULTIVARIATE ANOVA (MANOVA) 121 to each factor separately. Because many multivariate tests use Wilks’ Lambda (λ). we would get an output similar to that shown in Table 12-3. so that no one will ever know that our hypotheses amount to nothing. especially when we have only two groups. these are then added up to form the sum of cross-products. The ﬁrst question is. those dealing with the intercept. in fact. A small data set. That sounds somewhat formidable. and references to Hotelling’s T2 are increasingly rare. Hotelling’s trace18 and Roy’s largest root have the same value. we’ve been referring to the test as MANOVA. and nonexistent in the menus for some computer programs. In the years B. or V. you’ll see this test is also called Hotelling’s T2. Just as a t-test is an ANOVA for two groups (or. is also called the PillaiBartlett trace. we expand the Sum of Squares terms into a corresponding series of Sum of Squares and Cross-Products (SSCP) matrices: SSCP Total . and. since there were some shortcuts that could simplify the calculations.14 The cross-product is the ﬁrst value of X multiplied by the ﬁrst value of Y. so it’s safe to use the output labeled Multivariate Tests. Notice that. the sum of squares is simply the sum of each value squared. So. which were all done by hand. which tests for equivalence of the variancecovariance matrices. The p level shows that the test is not signiﬁcant. In our case. in the case of the paired t-test.E. This is equivalent to expanding the measurement of variance into a variance-covariance matrix when we looked at the assumption of homogeneity of variance. Pillai’s trace19 and Wilks’ lambda20 add up to 1.15 it made sense to have a separate test for the two-group case. To make our job easier. but let’s start off easy. and that due to measurements over time.. we will run into the ubiquitous problem of multivariate statistics—a couple of other ways to look at the ratios. SSCP Between . and with appropriate transformations. As you no doubt remember. is shown in Table 12-2. If you look at older books. (Ever get the feeling that everything is called something else in this game?) . so we don’t have any worries in this regard. there are no signiﬁcant differences in the error variances between groups. stay tuned. all of the tests end up with the same value of F. or T. or W. things aren’t quite as bad as they seem. if they are signiﬁcant. but but it’s something we do all the time in statistics. as we’ll see when we describe the Pearson correlation. For each variable.16 they are univariate tests that look at each of the DVs separately.C. doesn’t it? Before Computers. and the number of DVs. divide by the appropriate error term. So. except that we have more terms to worry about—the relationships between or among the DVs. as is all too common with multivariate statistics. “it” in this case means only running the test (now that we’ve gotten the foreplay out of the way). we arrive where we want to be—a multivariate test of the difference between the groups based on all of the DVs at once. 15That’s 16Although 17If this ever does come out as nonsigniﬁcant. The SSCP matrix for these numbers is therefore: SSCP 135 105 105 165 (12–7) X Y X2 Y2 (X) (Y) TABLE 12–2 Calculating sums of squares and crossproducts 3 4 5 6 7 ___ TOTALS 9 7 5 3 1 ___ 25 9 16 25 36 49 ____ 135 81 49 25 9 1 ____ 165 27 28 25 18 7 ____ 105 25 where the off-diagonal cells are the same. he lets us use them. this is the ﬁrst thing we see in the output. and SSCPWithin.k. We do exactly the same thing in MANOVA. As you can see. The F-test is now just the ratio of the SSCPBetween to the SSCPWithin. They simply tell us that something is going on. But. since X · Y is the same as Y · X. T2 is a MANOVA for two groups or two variables. Similarly. nonChristians prefer the term B. two related variables)..0. 18Also Analogously. The usual rule of thumb is that if they are not signiﬁcant. Now that computers do all the work for us. the smaller the value of λ. we don’t have just one test but. = is: 19Which SSWithin Groups SSTotal (12–9) 20A. consisting of two variables (X and Y) and ﬁve subjects. we can skip the ﬁrst four lines of the table. we should question whether we should be in this research game at all or become neopostmodern deconstructionists. for some reason no one except Wilks understands. If we (ﬁnally) run a MANOVA on the satisfaction data. These relationships don’t necessarily hold true when there are three or more groups. that due to the interaction between the factors. FINALLY DOING IT Sad to say. for Before Calculating Engines. four of them! Actually. and that the data as a whole deviate from zero. number of groups.17 Finally. the multivariate equivalent of the test for homogeneity of variance is Box’s M.C. after the usual corrections for sample size. the error). the F-ratio for the between-groups effect in an ANOVA is simply: F= MSBetween Groups MSWithin Groups (12–8) 14Makes sense. however.a. the smaller the within-groups sum of squares (that is. rather. called the Hotelling-Lawley trace. what test do we run? So far. As we mentioned. we can use the results from the MANOVA.

.22 Because most people use either Wilks’ λ or Pillai’s criterion. Hence.279 Effect Value F df N df D Sig Intercept Pillai’s trace Wilks’ lambda Hotelling’s trace Roy’s largest root Pillai’s trace Wilks’ lambda Hotelling’s trace Roy’s largest root 0.49 3.625 91.341 . then use Wilks’ λ. square the values of t and you’ll end up with the Fs. If you’ve done just a superb job in designing your study. ending up with equal and large sample sizes in each cell.24 270.614 0. and managed to keep the variances and covariances equivalent across the groups.189 270. Pillai’s trace does not equal (1 – λ).001 . so you’ll have to choose one of them on which to base your decision for signiﬁcance (assuming they give different results).24 270. if you’re as human and fallible as we are.607 14.24 270.001 .4% of the variance is unexplained.932 1.53 475.4 50.159 0. use Pillai’s criterion. In the two-group case.040 Group Tests of between-subjects effects SS hyp df Mean square F Sig Intercept M F M F M F 43296. however.24 3. 22Pillai’s statistical test we’ll ever encounter.660 Male Female 0.471 . not a lousy cardboard maple leaf. When there are more than two groups. although slightly less powerful than the other tests (Stevens. If the assumptions of homogeneity of variance and covariance aren’t met.225 48.205 1 1 Multivariate tests 38 38 . which is the usual measure of variance accounted for in ANOVAs. The last part of Table 12-3 gives the univariate tests. we would rely on these.936 0. indicating . the differences among all of the test statistics are minor except when your data are really bizarre.15 1 1 1 1 38 38 43296.21 Not coincidently. If we can use the results of the multivariate analysis.137 135. we won’t bother with the other two. which means that 93. because it is more robust to vio- lations of these assumptions (Olson. However.225 48. our picture would show us holding Nobel prize medals.122 ANALYSIS OF VARIANCE TABLE 12–3 Output from a MANOVA program Box’s M = 1.696 F = . smaller values are better (more signiﬁcant) than larger ones. about 6.533 df1 = 3 Levene’s test F df 1 df 2 Sig df2 = 259920 p = .936 is also the value of Pillai’s trace. in a univariate sense. these tests tell us which variables are signiﬁcantly different.040 .001 .4 74218. or the amount of variance that is explained.545 Group Error 21If you didn’t suspect it before. 1976).2 5150. this should convince you that these are artiﬁcial data.07 547. to tell us if anything is signiﬁcant.001 . if we did any study that accounted for 94% of the variance.001 .4 50.625 3463.531 0.841 0.607 0. they’re analogous to post-hoc tests used following a signiﬁcant ANOVA.040 . These results are somewhat unusual.6% is explained. These tests are used in two ways.374 . 1979).49 3.040 .49 2 2 2 2 2 2 2 2 37 37 37 37 37 37 37 37 . trace is also equivalent to η2 (eta-squared). In actual fact.4 74218. it’s simply (1 – λ). Just a little bit of work with a calculator shows that they’re exactly the same as the univariate tests in Table 12-1. rather than on the multivariate tests.189 0.49 3. What λ shows is the amount of variance not explained by the differences between the groups.001 . In this case. in that the multivariate tests are signiﬁcant but the univariate ones aren’t.064 14.

23 the output would look very much the same as in Table 12-3. The design. in fact. then we ﬁrst have to look at the test of sphericity. 1983). as long as it doesn’t involve more than two people. it appears that when summing over Groups. then we have the same problem as with a run-of-the-mill. and ﬁnally six months later). Fortunately. this is one possible design (called a doubly repeated MANOVA). there will now be two degrees of freedom. can determine for yourself whether the scores show it’s a man or a woman doing the ratings.25 Next. The maximum value of ε is 1. In this example. the Sums of Squares and Mean Squares would be different. as long as you remember to choose this option. This is actually a univariate test. and perhaps the signiﬁcance levels would also be different (depending on how much or little the participants enjoyed themselves). indicating homogeneity. the correlation between the measures at Time 1 and Time 2 is the same as between Time 2 and Time 3 and is the same as between Time 1 and Time 3). and modify our study a bit by having only one rater. and stick with Pillai’s trace or Wilks’ lambda. overall. where k is the number of groups. treats each time point as if it were a different variable. used if Box’s M shows us we should be concerned about the assumption of equality of the covariance matrices. then we can proceed with abandon. a oneway ANOVA. we have univariate tests for the within-subject factor. the method is the same—post-hoc tests. that would require another dependent variable (and perhaps a larger bed).24 but we’ll repeat the experiment three times. The results are shown graphically on Figure 12-3. If the Group effect is signiﬁcant. The ﬁrst is that. there is an overall Trials effect and a signiﬁcant interaction. we leave it to you to ﬁgure out the meaning of this.001. then the numerator and denominator degrees of freedom for the F-tests are “adjusted” by some value. there is an assumption of sphericity—that for each DV. Most data don’t meet the criterion of sphericity. For the purposes of signiﬁcance testing.MULTIVARIATE ANOVA (MANOVA) 123 that we have to compare the patterns of variables between the groups. in which all of the trials are averaged. however. which assumes the most extreme departure from sphericity. If it is not signiﬁcant (and you have a sufficient sample size). As a hint. the variances are equal across time. although between-subjects designs are relatively robust with respect to heterogeneity of variance across groups. The ﬁrst part of the table gives the results for the between-subjects effect. 24You 25And 26As before. so we have three time points. the F is not signiﬁcant. However. which is referred to as ε (epsilon). known as Mauchly’s W. then. The same general guidelines apply that we discussed before: don’t worry about this if the sample sizes are equal or p > . There are two reasons for this. and the computer output appears in Table 12-4. position doesn’t make a difference. Repeatedmeasures MANOVA. we have more adjustments than we can use. within-subjects designs are not. we have only one within-subject factor (Trials) and one interaction term (Trials by Group). we see the multivariate tests for the withinsubjects factors and the interactions between the within-subjects and the between-subjects factors. four weeks. both of which are measured on two or more occasions and. reﬂecting the fact that there are three groups. a bed isn’t de rigueur. If we go back to Figure 12-3. As is so often the case. The same comments apply to this output as previously with regard to the meaning of the four different tests: they all yield the same F-test when there are two groups. we have some property in Florida we’d like to sell you. and women’s ratings start somewhat lower but increase more than men’s. the ratings increase over time. which in this case is Group or Position. 27It’s called the Lower Bound because it is the lowest value that ε can have: 1 / (k – 1). with the Greenhouse-Geisser adjustment 23You can use your imagination for this. To keep things a bit simpler. it is often better to use a repeated-measures MANOVA than an ANOVA. such as Tukey’s HSD.26 Following this. The major difference in terms of interpretation is the Group effect. If we do feel that it’s more appropriate to use the univariate tests. even if we have only one DV measured two or three times. it would seem to be two or more dependent variables. univariate ANOVA: ﬁguring out which groups are signiﬁcantly different from the others. If it is signiﬁcant (as it is here). The most conservative one is the Lower Bound. In this example. two weeks. if you believe that. resulting in a higher probability of a Type I error than the α level would suggest (LaTour and Miniard. showing that. W is transformed into an approximate χ2 test. . then one week. and this is especially true when the time points aren’t equally spaced (such as measuring a person’s serum rhubarb level immediately on discharge. but the general format would be the same. using the appropriate lines in the succeeding tables. Naturally. so the assumption of sphericity isn’t required.27 The Huynh-Feldt adjustment is the least conservative. In this case. Satisfaction 40 30 FIGURE 12–3 Repeated measures of satisfaction for the two groups. Most programs will do this for you. we’ll go back to using two groups.0. It’s more likely that there’s a higher correlation between the measures at Week 1 and Week 2 than between Week 1 and Month 6. The second reason is that with ANOVA. is a 2 (Who’s on Top) by 3 (Trials) factorial. as are the correlations (that is. 60 Man on top Woman on top 50 MORE THAN TWO GROUPS If we had a third group. 1 2 Trial 3 DOING IT AGAIN: REPEATEDMEASURES MANOVA What would be the MANOVA equivalent of a repeated-measures ANOVA? At ﬁrst glance.

671 Source SS df MS F Sig Trial Sphericity assumed Greenhouse-Geisser Huynh-Feldt Lower Bound Group Sphericity assumed Greenhouse-Geisser Huynh-Feldt Lower Bound Sphericity assumed Greenhouse-Geisser Huynh-Feldt Lower Bound 7571. The trouble is.5 Box’s M = 4.001 .00 2 2 2 2 2 2 2 2 37 37 37 37 37 37 37 37 .16 259.001 .35 612.18 484.001 Trial Error (trial) 28If the Lower Bound adjustment is rarely used.323 1 2 1. that MANOVA is relatively robust to violations here. you’d be safer to use one of the correction factors.97 612. “Relatively” means two .16 259.16 259.10 22.06 50.68 224. ROBUSTNESS We mentioned earlier.73 5986.351 1.001 .001 Trial Mauchly’s test of sphericity Epsilon Withinsubjects effect χ2 GreenhouseGeisser W df Sig Huynh-Feldt Lower Bound Trial 0.45 306.575 0.68 224.61 23.08 13093.145 12.38 7571.35 612.45 7571.076 12.323 1 76 48.28 The caveat we added in the previous paragraph about W (“and you have a sufficient sample size”) is one that holds true for every statistical test trying to prove a null hypothesis: the result may not be signiﬁcant if the sample size is small.001 .20 0.29 so if your sample size is on the low side.2 1110.08 261.26 38 3785.08 261.00 25. and there just isn’t the power to reject the null of sphericity.001 .001 .96 20.96 20.16 20.35 14. except when the sample size is small.96 .001 .001 . it is also relatively robust to deviations from multivariate normality.22 259.924 0.425 1. With that same proviso of equal sample size.632 0.2 2 1.673 df1 = 6 Tests of within-subjects effects df2 = 10462 p = . then why is it printed out? Probably because the programmer’s brother-in-law devised it. when discussing tests of homogeneity of the VCV matrices.419 F = .661 0.390 Effect Value F df N df D Sig Trial Pillai’s trace Wilks’ lambda Hotelling’s trace Roy’s largest root Group Pillai’s trace Wilks’ lambda Hotelling’s trace Roy’s largest root 0. Most people use the GreenhouseGeisser value. nobody knows how much is enough.001 .45 7571.265 1.76 .4 32.2 1110. in which case the Huynh-Feldt value is used.00 25.68 25.96 20.08 344.18 462.001 .35 612.001 .001 .00 25.351 224.58 637. 29At falling in between.69 5724.45 7571.001 .45 612.09 29.85 1 1 38 Multivariate tests 219564.68 224.265 1.33 2 0 0.35 1110.001 .145 0.124 ANALYSIS OF VARIANCE Tests of between-subjects effects TABLE 12–4 Output from a repeatedmeasures MANOVA program Source Sum of squares df Mean square F Sig Intercept Group Error 219564.001 . especially if the group sizes are equal.2 1110. least in the area of statistics.

an alternative exists for the one-way test. this is an absolute minimum requirement. yet again (as if further proof were needed) that there ain’t no such thing as a free lunch. In any case. too) ever since MANOVA was developed: if the overall F-test is signiﬁcant. These tables ﬁrst appeared in a very obscure journal from what was once East Germany. For this test to run. if there are n1 subjects in Group 1 and n2 subjects in Group 2. The problem is that you never know which is the case. authors (hereinafter referred to as the Parties of the First Part) do verily state. we may have less power with multivariate tests than with univariate ones. Second. and some of them appear in Appendix N.30 or use ranks instead of the raw data (discussed below). Is there anything we can’t get away with if we have large and equal groups? There are two things we can’t mess around with—random sampling and independence of the observations. 230)34 If the answer to their question is “No. is how you can get rid of data you don’t want and still get published. we have been royally unsuccessful at talking our students out of MANOVA. run MANOVA on the rank scores. it’s safest to assume that you’re dealing with the partial null hypothesis. Appendix O (adapted from Stevens. assumed by users of these Tables (Parties of the Second Part) are hereby and forthwith null and void. 2002). sample sizes. 1978). don’t throw everything into the pot. expressed or implied.MULTIVARIATE ANOVA (MANOVA) 125 30“Trimming” things: if the degrees of freedom associated with the univariate error terms are over 20. First. The assumption of homogeneity of the VCV matrices is much harder to meet than the assumption of homogeneity of variance. 1996. based on the number of groups. They write: 33And 34Somewhat Because of the increase in complexity and ambiguity of results with MANOVA. where p is the number of variables and k the number of groups (Zwick. MANOVAs are less powerful than univariate tests. 1985). who are usually strong advocates of multivariate procedures. but multivariate tests such as Pillai’s trace are not signiﬁcant. p. declare. despite its disadvantage” (Tabachnick and Fidell.” Some people have vehemently argued that a signiﬁcant omnibus test “protects” the α levels of the post-hoc tests. and the more the better. you can get away with almost anything. they answer their own question in the next edition of their book: “In several years of working with students . and make the adjustment. saying that multiple testing is multiple testing. and power (Läuter. so a Bonferroni-type adjustment is required. If you violate either of these. As with most statistical techniques. the computer will still blithely crunch the numbers and give you an answer. For the most part. So. if the dfError is less than 20.. 376). The best advice with regard to using MANOVA is offered by Tabachnick and Fidell (1983). 1980) gives the power for the two-group MANOVA (Hotelling’s T2) for different numbers of variables. The problem is that both sides are right. If the null hypothesis is true for all of the DVs (a situation called the complete null hypothesis).” then MANOVA is the way to go—but. MANOVA can’t handle data with outliers very well. then it would be safer to trust the results of the univariate tests. Actually. If you’re reluctant to trim the data and still want to use a multivariate procedure. with df = p (k – 1). and that the scores for one person are independent of those for all other people. MANOVA assumes that the data come from a random sample of the underlying population. multiply the Pillai trace (V) by (N – 1).. Might there be some way of combining them or deleting some of them so that ANOVA can be performed? (p. and if the deviation from normality is due to skewness. so we’re not sure if they are legitimate or part of a conspiracy to undermine capitalist society by having the West’s researchers waste their time with underor overpowered studies. so. There are tables for estimating sample size. Third. then α is indeed protected. . and maybe. no. But. and an adjustment for multiplicity is required. if the null hypothesis is true for some of the variables but not for others (the partial null hypothesis). we offer them for your use. “Yes. as we’ve seen. There are just as many people on the other side.32 There is one issue that has bedeviled statisticians (and us. However. even based on the results of the univariate F-tests (Jaccard and GuilamoRamos. as opposed to saying that you simply disregarded those values you didn’t like. it’s more likely that we’re violating something when we do a MANOVA. The outcome variables included in any one MANOVA should be related to each other on poignantly. it may sound as if we should use MANOVA every time we have many dependent measures.33 However. and check the results in a table of χ2. then the ranks will range from 1 to (n1 + n2) for each variable. don’t forget that if the null hypothesis is true for all DVs. one of the best overall recommendations is: Avoid it if you can. is it necessary to adjust the α level for the post-hoc analyses that examine which speciﬁc variables differ between or among the groups? The answer. variables. Ask yourself why you are including correlated DVs in the same analysis. transform your raw data into ranks. you’re dealing with a Type I error to begin with. trim the data to get rid of the outliers. reality dictates just the opposite. Finally. the results are often harder to interpret. as has been the case so often in this book. and effect sizes. so that no adjustment is necessary. 31The WHEN THINGS GO WRONG: DEALING WITH OUTLIERS As we’ve mentioned. It’s also a reminder (as if one were necessary) to plot your data before you do anything else to make sure you don’t have outliers or any other pathologic conditions. This can result in the anomalous situation that some or all of the univariate tests come out signiﬁcant. 1996). and declaim that all warranties. and there’s no need to adjust for the fact that you are doing a number of ANOVAs on the individual variables after the fact. but under different conditions. we must have more subjects than variables in every cell (Tabachnick and Fidell. POWER AND SAMPLE SIZE As we have seen. 32Proving A CAUTIONARY NOTE From all of the above. the ﬁrst test that’s performed when we do a MANOVA is a test for homogeneity of the VCV matrices. is a deﬁnitive. or if the nonnormality is due to outliers. but the results will bear little relationship to what’s really going on. then α is not protected. Then. because there are more things going on at the same time.31 Going the other way.

Based on the results of Levene’s test.941 . If the three groups did not differ with regard to their quality of life. c. use the results of the univariate tests.064 154.000 . Same design as 1. c.618 154. and if the eight subscales were analyzed separately.618 154. but using the eight subscales.b. Box’s M = 22.941 3 3 3 3 3 3 3 3 44 44 44 44 44 44 44 44 .429 . proceed with the analysis without any concern. proceed. but now the eight subscales of the quality-of-life scale are analyzed separately.064 10.055 .815 . but with the subscales. a.941 . EXERCISES 1. indicate whether a univariate or a multivariate ANOVA should be used. and it’s usually a mistake to have more than six or seven at the most in any one analysis.618 154. f.865 1 1 1 46 46 46 . but these are easier to overcome than those resulting from ignoring the correlations among the dependent variables. If this isn’t possible.542 10. use the results of the multivariate tests. what is the probability that at least one comparison will be signiﬁcant at the .618 . osteoarthritis.060 .542 . Is there anything going on? b. Based on the results of Box’s M.941 . WRAPPING UP Taking Tabachnick and Fidell’s advice to heart.429 . then MANOVA is the test to use. stop right now. d.a.695 . Looking at the output from the multivariate tests: a.000 .126 ANALYSIS OF VARIANCE a conceptual level. or try to combine the outcome variables into a global measure. we should try to design studies so that MANOVA isn’t needed: we should rely on one outcome variable.155 .000 . Same as 1.064 . For the following designs. b. All of these groups are tested every 2 months for a year. e.429 .913 .940 . Scores on a quality-of-life scale are compared for three groups of patients: those with rheumatoid arthritis.240 df1 = 6 df2 = 1193 p = . Does the variable SETTING have an effect? Multivariate tests Effect Value F df N df D Sig Intercept Pillai’s trace Wilk’s lambda Hotelling’s trace Roy’s largest root Pillai’s trace Wilk’s lambda Hotelling’s trace Roy’s largest root . We pay a penalty in terms of reduced power and more complicated results. b.745 F = 3. Analyzing all of the outcomes at once avoids many of the interpretive and statistical problems that would result from performing a number of separate t-tests or ANOVAs. Each of these patient groups is divided into males and females as another factor.05 level by chance? 3.357 5. and chronic fatigue syndrome.000 .c. The same design as 1. 2.004 4. you should: a.429 SETTING . b. but be somewhat concerned. Levene’s test F df N df D Sig Variable A Variable B Variable C . you should: a.

.MULTIVARIATE ANOVA (MANOVA) 127 How to Get the Computer to Do the Work for You • From Analyze. Tukey. choose General Linear Model → Multivariate… • Click the variables you want from the list on the left. and then click the Continue button • Click the Options button and check the statistics you want displayed. you will also want Descriptive Statistics and perhaps Estimates of Effect Size • If you have a repeated-measures design. if you haven’t analyzed the data previously. and click the arrow to move them into the box labeled Fixed Factor(s) • If you have more than two groups. and click the arrow to move them into the box labeled Dependent Variables • Choose the grouping factor(s) from the list on the left. and Tukey’s-b]. click the Post Hoc button and select the ones you want [good choices would be LSD. The least you want is Homogeneity Tests. use the instructions at the end of Chapter 11.

SECTION THE SECOND C. There are several problems. Would you? It’s studies like this which make statisticians go bald.R. but never categorize when you don’t have to.R. You can do it afterward for ease of interpretation among those folks who see the world in two categories.A. and (3) placebo. A cardiovascular researcher did yet another randomized clinical trial of a new antihypertensive agent. from all the hair tearing.R. and generally should. DETECTORS II–1. C. 128 . and unless the inclusion criteria were incredibly tight such that every patient’s initial blood pressure was about the same.P. A one-way ANOVA (Chapter 8) on the diastolic blood pressure (DBP) would be more appropriate. stable. be incorporated into analysis with repeated-measures ANOVA. (2) methyldopa. He then analyzed the 3 2 table (Drug Normal/Hypertensive) with the usual chi-square test.P. DETECTOR II–1 Never categorize data that start off as interval or ratio data unless the distributions are absolutely awful.P. So. a repeated-measures ANOVA (Chapter 11) using baseline DBP with drug as a between-subjects factor and time as a within-subjects factor and looking for an interaction would be more powerful still.A.A. and most important. systematic differences probably exist among patients. First. Second. C. The cost in sample size and power is typically a factor of 10 or so. and we’ll deal with them in stages. After 6 weeks he measured their blood pressures and classiﬁed patients as normotensive (diastolic blood pressure < 90 mm Hg) or hypertensive (diastolic blood pressure > 90 mm Hg). He randomized patients into three groups: (1) captopril. DETECTOR II–2 Baseline measures can. he likely measured DBP at the beginning of the study. never take a ratio variable such as blood pressure and categorize it into groups before analysis.

P. and 9 are signiﬁcant. (1986).R. Whenever you see multiple t-tests. . methyldopa.P. They did the analysis exactly right. he measured quality of life every way but Sunday with the following scales: (1) general well-being. then pursued any differences with post-hoc contrasts. (6) cognitive function. (2) physical symptoms. and (8) social participation. C.R. 1This example is based on Croog et al. After 24 weeks.C.A. DETECTORS 129 II–2.A. respectively. He did t-tests comparing captopril to methyldopa to propranolol on all the measures. What would you do? ANOVA methods are usually misused by not being used at all. (4) work performance. At the least. (3) sexual dysfunction. (7) life satisfaction. suspect that ANOVA would be better. Another cardiovascular researcher wanted to investigate the effect of antihypertensive agents on quality of life. Better still would be a MANOVA (Chapter 12). 1 He randomized patients to three groups that received captopril. A total of 24 t-tests are here. (5) sleep dysfunction. DETECTOR II–3 ANOVA methods are usually abused when they’re not used. he should have done a one-way ANOVA (Chapter 8) to see if there was any difference among the three groups on each variable. and propranolol.

DETECTOR II–5 When you have a control group. Because he had multiple measures.R. 2. 3. the right analysis would be an unpaired t-test on the difference scores. . the separate analysis essentially ignores the control condition. Here goes! . C. and 5. use repeated-measures ANOVA. he should have used repeatedmeasures ANOVA with one grouping factor (ﬂuoxetine/amitriptyline) and one within-subjects factor (Time).) HAM-D score 20 1. in the ﬂuoxetine group and for several of the efficacy measurements in the amitriptyline group.P. 3. by contrast. 4. C. Would you analyze it this way? We sure hope not. only 16 of 44 patients actually completed the trial anyway. you cannot analyze the results of the treatment and control groups separately. Feighner (1985) did a randomized control trial with a small sample of patients.130 ANALYSIS OF VARIANCE 30 FIGURE II–1 HAM-D data over 5 weeks for 2 drug groups. one wonders how it made it into print. and the Covi Anxiety scale. the Raskin Depression Inventory. The combined analysis at week 5. DETECTOR II–4 When data are taken on repeated occasions.R. 4. 2. 369–372.” For the sake of interest.A. He also compared the treatment groups at the end of the study and found no significant difference between the two drugs. JP [1985]. 10 Fluoxetine Amitriptyline 0 Base 1 2 Week 3 4 5 II–3. Journal of Clinical Psychiatry. He reported that “the changes were statistically signiﬁcant . 46. This one is so wrong. not a paired t-test. If he had simply used an assessment at time 0 and time 5. They should have used a repeated-measures ANOVA to look at all the data. He measured three outcomes: the HAM-D (a depression scale). He measured changes from baseline separately for the 2 drug groups. then compared the 2 groups at week 5. If the real interest is the new drug (ﬂuoxetine). the data for the HAM-D are shown in Figure II–1. . ignores all the data gathered at baseline and along the way. but we’ll pretend they were all there. Incidentally.A. He analyzed the data from only week 0 and week 5 and totally ignored the data from weeks 1. at baseline and at weeks 1.P. (Modiﬁed from Feighner. 2.

SE C T I O N THE THIRD REGRESSION AND CORRELATION .

Chevettes. The accelerator and brake of the BMW Series 17 are placed in such a way that. apparently brought on by the peculiar shift patterns of the BMW Series 17.000 units. CLOTHES—Total of all Gucci.2 It 132 . which are suitable when the independent variable is nominal categories and the dependent variable approximates an interval variable. ski clubs. She examined the relationship between the severity of the disease and some measure of the degree of Yuppiness of her clients. Measuring the extent of disease was simple—just get out the old protractor and measure ROM. or looks it because of the desert sun. You investigate this new syndrome further by developing an index of Yuppiness. and measure CHICC score and ROM. In these circumstances (with 1 independent variable) the appropriate method is called simple regression and is analogous to one-way ANOVA. and Saint Laurent labels in closets. it certainly seems that some relationship exists between CHICCC and ROM—the higher the CHICC. all claiming some degree of Beemer Knee. the less the ROM. there are many problems in which both independent and dependent variables are interval-level measurements. How do we test for a relationship between CHICC and ROM? Let’s begin with a graph. HEALTH—Number of memberships in tennis clubs. but this would lose information. You could categorize one or the other into High. and engage in all those arcane games that delight only statisticians. the CHICC score. Bob Hope Drive. Are there better ways? BASIC CONCEPTS OF REGRESSION ANALYSIS The latest affliction keeping Beverly Hills and Palm Springs physiotherapists employed is a new disease of Yuppies. The data might look like Figure 13–1. CUISINE—Total consumption of balsamic vinegar (liters) + number of types of mustard in refrigerator. and attempting to relate it to range-of-motion (ROM) of the knee. Suppose we enlisted all the suffering Yuppies in Palm Springs. but she decided to also pursue other sources of affluence. and ﬁtness clubs. both have interval properties (actually. and everybody in Palm Springs is over 80. But what about Yuppiness? After studying the literature on this phenomenon of the 1980s. But CHICC score and ROM are both continuous variables. Lacoste. The cause of the disease wasn’t always that well known until an observant therapist in Sausalito noticed this new affliction among her better-heeled clients and decided to do a scientiﬁc investigation. or minivans. she decided that Yuppiness could be captured by a CHICC score. deﬁned as follows: 1We would likely have to go outside Palm Springs. CARS—Number of European cars + number of offroad vehicles – number of Hyundai Ponies. Frank Sinatra Drive) before you are dead. She could have simply considered whether they owned a series 17 BMW. It’s the only place on earth where they memorialize you in asphalt (Fred Waring Drive.CHAPTER THE THIRTEENTH The previous section dealt with ANOVA methods. if you try any fancy downshifting or upshifting. But the issue is. ROM is a true ratio variable). At ﬁrst glance. Simple Regression and Correlation SETTING THE SCENE You notice that many of the Yuppie patients in your physiotherapy clinic appear to suffer from a peculiar form of costochondrotendonomalaciomyalagia patella (screwed-up knee). Medium. take means and SDs. and Low and do an ANOVA. you are at risk of throwing your knee out—a condition that physiotherapists refer to as costochondrotendonomalaciomyalagia patella (Beemer Knee for short). INCOME—Total income in $10. CHICC and ROM are very nice variables. However. The “Y” in Yuppie stands for young. Thus we can go ahead and add or subtract.1 We ﬁnd 20 of them.

not ours). consider Figure 13–3. the closer the points fall to the ﬁtted line. we have broken with tradition. a perfect relationship and B. After all. measured in tenths of degrees rather than degrees. it is not quite accurate. To the untrained eye (yours. which were extracted from the original data of Figure 13–1. B what conditions would lead us to the conclusion that (1) no relationship or (2) a perfect relationship exists. That’s why we concluded there was a perfect linear relationship on the top left of Figure 13–2. the statistics. should concur with some of our intuitions. the relationship depicted in the upper graph is as perfect as it gets. After all. with some algebra. now it does. then this is a relative measure of the strength of association of the two variables. Figure 13–2. Seemingly. Now the signal (there’s that ugly word again!) is contained in the departure of the ﬁtted data from the grand mean of 33. but we decided to make you do the work. One way to consider the question is to go to extremes and see A FIGURE 13–2 Graphs indicating A. ROM 40 30 40 20 20 10 (Fitted–mean) (Data–fitted) 0 0 10 20 30 40 50 60 70 0 30 40 50 CHICC score 60 70 FIGURE 13–3 Relation between ROM and CHICC score (enlarged). we need only create a new ROM. you have to stand on your head. Before we vault into the calculations.4 Two reasons why we might infer a relationship between two variables are (1) the line relating the two is not horizontal (i.5. By contrast. if done right.. 4Graphs . where we have chosen to focus on the narrow window of CHICC scores between 30 and 70. CHICC score also seems to follow a straight-line relationship—we can apparently capture all the relationship by drawing a straight line through the points. We could have achieved this. of course. the stronger the relationship. If we contrast the amount of variability captured in the departures of individual points from the ﬁtted line with the amount of variability contained in the ﬁtted function. although this captures the spirit of the game. it might be worthwhile to speculate on the reasons why we all agree3 on the existence of some relationship between the two variables. if you will. When we were students. the slope is not zero). 3One good reason is that the teacher says so. the graph on the bottom is depressingly common. Unfortunately. strangely. In fact. even a sociologist would likely give up on the lower graph because of the lack of an apparent association between the two variables. (2) Perhaps less obviously. Examine. the more the line differs from the horizontal. Now the bad news—no wall mirror will save you. The straight-line relationship between CHICC and ROM explained all the variability in ROM. Most relationships are depicted so that more of one gives more of the other. you know the other. Actually. this never held much appeal. The noise is contained in the variabil- 2Once again. one might be driven to conclude that the stronger the relationship. to make the slope go up by a factor of 10.SIMPLE REGRESSION AND CORRELATION 133 80 60 Range of motion FIGURE 13–1 Relation between range of motion (ROM) and CHICC score in 20 Yuppies. both observations contain some of the essence of the relationship question. To elaborate a little more. no relationship between two variables. Y is perfectly predictable from X—if you know one. such as the one on top are as rare as hen’s teeth in biomedical research.e.

a ^ over any variable signiﬁes an estimate of it. If this is not starting to look familiar. we had hoped the dirty jokes would reduce the soporiﬁc effect of this one. he always suspected it). Unfortunately.0 45. we have actually listed the data used in making Figure 13–1 in Table 13–1. let’s go back to the old routine and start to do some sums of squares. the moment of reckoning has arrived. before we launch into the arcane delights of sum-of-squaring everything in sight. Because the term doesn’t sound obscure and scientiﬁc enough. but it isn’t— it’s called the Sum of Squares (Residual). SSregression = ˆ ∑ ( Yi Y )2 (13–1) 7The where a is the intercept. In several locations we have referred to the ﬁtted line rather glibly. Still. as a solvable equation) with calculus. “ROM hat equals . This looks like: SSresidual = ∑(Yi ˆ Yi)2 (13–2) ˆ ROM b0 b1 CHICC That funny-looking thing over ROM goes by the technical name of “hat. or the difference between the ﬁtted points and the horizontal line through the mean of X and Y. analogous to the Sum of Squares (Within) in ANOVA.” In today’s language.6 50. however. TABLE 13–1 CHICC scores and range of motion for 20 Palm Springs Yuppies Subject CHICC ROM Fitted ROM 6The 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 5 8 13 15 22 20 17 29 17 25 28 36 48 65 48 29 18 7 54 47 58 47 43 38 35 38 45 34 48 35 27 15 8 10 18 28 38 68 21 17 52. diddling the values a bit and recalculating the values. The ﬁrst sum of squares results from the signal. Let’s rewrite the equation to incorporate the variables of interest in the example and also change “a” and “b” to “b0” and “b1”: This tells us how far the predicted values differ from the overall mean. so we can solve the equations for the optimal values of the b.4 32.1 38.134 REGRESSION AND CORRELATION 5A not uncommon experience among readers of statistics books.2 26.6 50. including ourselves.2 the values in such a way as to maximize the variance resulting from the ﬁtted line. with no indication of how one ﬁts such a line.” so we would say. Well.3 2. or the amount of change in Y for one unit of change in X.9 11. the issue remains of how one goes about selecting the value of b0 and b1 to best ﬁt the line to the data. The second sum of squares reﬂects the difference between the original data and the ﬁtted line.6 For reasons that bear no allegiance to Freud. this sum is differentiated with respect to both b0 and b1. It should be called the error sum of squares. no one uses it. namely: Y = a + bX The regression line is the straight line passing through the data that minimizes the sum of the squared differences between the original data and the ﬁtted points. rather than the original value. Now although it sounds as if we are faced with the monumental task of trying some values. analogous to the Sum of Squares (Between) in ANOVA. calculating the variances. even if you don’t. key to the solution resides in the magical words maximum and minimum. The right answer can be determined analytically (in other words. to minimize the variance resulting from deviations from the ﬁtted line.3 17.2 39. or residual variance. and carrying on until an optimal solution comes about.8 In creatˆ ing this equation.4 16. approach of calculating a Sum of Squares (Signal) based on deviations of the ﬁtted points from the grand mean and a Sum of Squares (Noise) based on deviations of individual data from the corresponding ﬁtted points. or. the equation yields an estimate of the ROM score. A more descriptive and less obscure term is least-squares analysis because the goal is to create a line that results in the least square sum between ﬁtted and actual data. or the within sum of squares. we call Y the ﬁtted point on the line that corresponds to each of the original data. the method is called regression analysis7 and the line of best ﬁt is the regression line. then solve the equation. and the resulting expression is set equal to zero. after the regression is all over. tall people’s children “regress” to the mean height of the population. no one who has completed the second year of college ever uses calculus. Now that that is out of the way. you take the derivative and set it equal to zero. you must search through the dark recesses of your mind to retrieve the formula for a straight line.3 32.3 42. (And one of the authors is delighted Galton discovered that persons of average height are mediocre. . so you will have to accept that the computer knows the way to beauty and wisdom. to ﬁnd a maximum or minimum of an equation. This results in two equations in two unknowns. For openers. now almost reﬂex. real reason it’s called “regression” is that the technique is based on a study by Francis Galton called “Regression Toward Mediocrity in Hereditary Stature. the value of Y when X is equal to zero. however. So. To get the best ﬁt line.7 33. Y is the number that results from plugging the X value of each individual into the regression equation. . The quantity we want to maximize is the squared difference between the individual data and the corresponding ﬁtted line. The strategy used in this analysis is to adjust This is capturing the error between the estimate and the actual data. One mystery remains. it isn’t at all that bad. equivalently. ity of the individual data about the corresponding ﬁtted points. in other ˆ words.8 44.0 16. then you must have slept through Section II. and b is the slope.9 42.” It means that for any given value of CHICC. On the left side is the calculated CHICC .3 41. To make this just a little less abstract. In calculus. expressing the variance that remains.4 35. equivalent to setting the slope equal to zero.5 We could apply the same.

2)2 864 (47 50. but the same idea applies.. we’d get a horizontal line. . tells us whether the regression line is signiﬁcantly different from the horizontal (i. It’s time for a little logic. in the middle is the corresponding ROM. can create an F-test. Finally. The residual should have (n – 2 – 1) or 17. so it would seem that the regression line should have 2 df. so it is a little unclear how many df to put on each line. the table contains a t-test.001 < . the remaining terms are a bit problematic. Almost. these values are presented in a separate table. One of the parameters is the intercept term. called S(Y/X) (the standard error of Y given X). but a bit more complicated. like a z-score. And the more dispersion there is in the X-values. in a completely analogous manner. Typically. reason for examining differences from the horizontal line is clear if we project the data onto the Yaxis. The second.. it turns out that the SE is related to the Mean Square (Residual).. plugging the CHICC score in the equation and estimating ROM). the Sum of Squares (Regression) has terms such as: SSreg (52.6 33. As an example of the looks of these sums of squares. Source Sum of squares df Mean square F TABLE 13–2 ANOVA table for CHICC against ROM (step 1) Regression Residual 3892 864 (13–3) Source Sum of squares df Mean square F TABLE 13–3 ANOVA table for CHICC against ROM (step 2) and the Sum of Squares (Residual) has terms such as: SSres (58 52. and we are calculating the analogue of the Sum of Squares (Between).2 33. the slope and the intercept.6)2 3892 33. yes. We have two parameters in the problem. whether a signiﬁcant relationship exists between the CHICC score and ROM).010 0. In this case. If we plotted this. but not quite. BETAS AND TESTS OF SIGNIFICANCE While the ANOVA of regression in Table 13–3 gives an overall measure of whether or not the regression line is signiﬁcant. it’s just the ratio of the coefficient to its SE.6)2 (50. as usual. we’ll not go into any more detail. then the best estimate of Y at each value of X is the mean value of Y. and this is completely equivalent to the grand mean. Since we rarely care about it. is just the standard deviation of the deviations between the original data and the ﬁtted values.01 < . the bigger this is. So the table now looks like Table 13–3. we can also go on to the calculation of the Mean Squares and. Another way to think of it is—if no relationship between X and Y existed. usually are most worried about. like Table 13–4. we have worked out the Sum of Squares (Regression) and Sum of Squares (Residual) and have (inevitably) created an ANOVA table.0)2 . For the slope. The p-value associated with the F-test. so this error term in inversely related to the standard deviation of the X-values. MSRes. Now that we have this in hand.SIMPLE REGRESSION AND CORRELATION 135 score for each of the afflicted. which is what we where Sx is the standard deviation of the X-values.843 Slope (b1) 3. the more error there is going to be in the estimate of the slope.6)2 17. The column labelled β (beta) is called a Standard Regression Coefficient. labeled Standard Error (or SE). to give the usual total of (n – 1).0 48.6)2 . or at least the ﬁrst two columns of it (Table 13–2).0 81. The square root of MSRes. We can’t count groups. in our old ANOVA notation. and on the right is the ﬁtted value of the ROM based on the analytic approach described above (i. which actually has the same form as all the other t-tests we’ve encountered.80 9. The idea of df is the difference between the number of data values and the number of estimated parameters. requires some further discussion.e. the better we can anchor the line.001 .1 (13–4) To save you the anguish. just as we’ve shown. However. and (n – 1 – 1) with the error term. it doesn’t actually say anything about what the line actually is—what the computed value of b0 and b1 are. The actual formula is: SE(β1) = √MSRes Sx√(n – 1) (13–5) 8The B’S.0 (17. The ﬁrst column of numbers is just the estimates of the coefficients that resulted from the minimization procedure described in the last section... the smaller the SE. The parameters were means up until now. Parameter b SE(b) β (beta) t p TABLE 13–4 Regression Coefficient and Standard Errors for CHICC against ROM Constant (b0) 56.094 –0.905 18. so only 1 df is associated with this regression. We’ll talk more about it in the next chapter. First. (17 Regression Residual 3892 864 1 18 3892. The horizontal through the mean of the Ys is just the Grand Mean.. with the usual 1/ √n relation. for that matter. losing 1 for the grand mean.e. this expresses the coefficient in standard deviation units. The SE associated with the intercept is similar. But there are a couple of other things coming into play.76 –0. the larger the sample size.

ANOVA is usually applied to experiments in which only a few variables are manipulated. So this fancy formula ends up describing a kind of double-trumpet-shaped zone around the ﬁtted line. It’s wider than the CI for the line for the same reason that the CIs are wider at the ends—sample size. while it’s tempting to think that this band might just be like a ribbon around the ﬁtted line. it’s going to sweep out something like a Japanese fan around the point on the graph corresponding to the mean of X and Y. THE REGRESSION LINE: ERRORS AND CONFIDENCE INTERVALS While we’re pursuing the idea of errors in slopes and intercepts. Fortunately. There’s more error associated with the prediction of the score for one person with a given set of values of the predictor variable(s) than for a group of people with the same values. you’ll see another CI. then. we can imagine that the ﬁtted line. this is the proportion of variance in Y explained by X. For our example. and as we move toward the extremes we have more and more error. This number is called the coefficient of determination and usually written as R2 for the case of simple regression. and an expression involving the standard deviation of the Xs. people who do regression analysis are more aware of this issue and spend more time and paper examining the size of effects than does the ANOVA crowd. signiﬁcant associations are a dime a dozen. actually looks more like a ﬁtted band. is SE(X) = S(Y|X) 1 (X – X)2 + 2 N (n – 1)Sx (13–6) THE COEFFICIENT OF DETERMINATION AND THE CORRELATION COEFFICIENT All is well. and it gets bigger proportional to the square of the distance from X to the mean. For some obscure reason. with its associated errors. Putting the two ideas together. equal to S(Y/X)/ √n . In other words. S(Y/X). where the true value of the line could be anywhere within the band around our computed best ﬁt. The conﬁdence interval is just this quantity multiplied by the t-value. However. much broader. The actual equation is a pretty complicated combination of things we’ve seen before: the SE of the residuals. the data were gathered prospectively at high cost.136 REGRESSION AND CORRELATION 95% CI idual P for Indiv ns redictio Variable 2 FIGURE 13–4 The 95% CI around the regression line (red) and the 95% CI for predicting the score of a single individual (grey). no matter how small. There is error in both the slope and intercept. the sample size. this equals . the SE x is at a minimum when X is at the mean. and our Palm Springs physiotherapist now has a glimmer of hope concerning tenure. If we think about the graph of the data in Figure 13–1. on either side of the ﬁtted line: tions 95% CI 95% C ividua I for Ind l Predic ˆ CI95% = Yi ± tn – 2. 95% CI where S2 is the variance of the X-values. around the regression line. By contrast. After all. Further. which is the (1 – α)% conﬁdence interval around the line. The standard error of the line at any point X. Regression. the test of the signiﬁcance of the regression line is the test of the signiﬁcance of the slope. The F-test from Table 13–3 is just the square of the t-test. is often applied to existing data bases containing zillions of variables. Under these circumstances. with limits as two other parallel lines above and below the ﬁtted line. or the total sum of squares. So. this isn’t quite the case. we have the best ﬁx on the line where most of the points are. and the pvalues are exactly the same. Put another way. If you go back to Figure 13–4. as shown by the dotted red lines in Figure 13–4. particularly multiple regression. 1 – α/2 SE(X) (13–7) Variable 1 There’s another consistency lurking in the table as well. there are a couple of other ways to think about it. and the researchers are grateful for any signiﬁcant result. and their size matters a lot. this means that as the slope varies. This is the CI for predicting the DV for a single individual. at the center. We have a simple way to determine the magnitude of the effect—simply look at the proportion of the variance explained by the regression. we have been insistent to the point of nagging that statistical signiﬁcance says nothing about the magnitude of the effect. The formula is: R2 SSreg SSreg SSres (13–8) This expression is just the ratio of the signal (the sum of the squares of Y accounted for by X) to the signal plus noise. One explanation may lie in the nature of the studies. the conﬁdence interval around the ﬁtted line is going to be at a minimum at the mean and spread out as we get to high and low values (where there are fewer subjects). it works out this way.

Y also deviates in a positive direction from its mean. as the positive and negative terms cancel each other out. For completeness.3. The correlation coefficient. every time you have a positive deviation of X from its mean. is that the df of the correlation is the number of pairs –2. looking up the signiﬁcance of the correlation in Table G in the Appendix). INTERPRETATION OF THE CORRELATION COEFFICIENT (13–11) However messy this looks. X)2] [∑(Yi Because we can write (Xi X ) as xi. The numerator is a bit different—it is a crossproduct of X deviations and Y deviations from their Because the correlation coefficient is so ubiquitous in biomedical research. when no relationship exists. So. this version was derived by another one of the ﬁeld’s granddaddies. large values of X are associated with small values of Y. one for X and one for Y. if X and Y are negatively correlated. Thus. we call the correlation positive if the slope of the line is positive (more of X gives more of Y) and negative if. First. people have developed some cultural norms about what constitutes a reasonable value for the correlation. of historical importance. so this term ends up as (–) × (–) = +. So the sum of the cross-products would likely end up close to zero. and vice versa. all square-rooted. which . each positive deviation of X from its mean would be equally likely to be paired with a positive and a negative deviation of Y. such as in the present situation. the converse also holds: the square root of a positive number9 can be positive or negative.904. Now. respective means. this is not the usual expression encountered in more hidebound stats texts. One other fact. This is of some value. Hence it is often called the Pearson Correlation Coefficient. this is completely analogous. Its full name. Usually. Each term therefore contributes a negative quantity to the sum.818. One starting point that is often forgotten is the relationship between the correlation coefficient and the proportion of variance we showed above—the square of the correlation coefficient gives the proportion of the variance in Y explained by X. the slope is negative. Of course. standardized by dividing out by the respective SDs. It can happen.. Conversely. In this case. some components are recognizable. each pair contributes a positive quantity to this sum. is the Pearson Product Moment Correlation Coefficient. We choose to remain consistent with the idea of expressing the correlation coefficient in terms of sums of squares to show how it relates to the familiar concepts of signal and noise. then. Because the square of any number. So the correlation is – √. we would have a product of the variance of X and the variance of Y. and a correlation of . explains slightly less than half the variance. this term expresses the extent that X and Y vary together. Some clariﬁcation may come from taking two extreme cases.g. imagine that X and Y are really closely related. expresses the proportion of variance in the dependent variable explained by the independent variable. small values of X and Y correspond to negative deviations from the mean. yet another way of representing it is: r= cov(X. used only at black-tie affairs. If we divide out by an N here and there. Karl Pearson. so the term is (+) × (+) = +. this is due simply to rounding error. this can also be written as: r ∑xy ∑x2 ∑y2 Y) Incidentally. it is then set equal to zero. which may be helpful at times (e. long before you had any statistics course—it’s the correlation coefficient: r ± SSreg SSreg SSres (13–9) Note the little ± sign.Y) var(X) var(Y) (13–12) that the coefficient of determination should not be less than zero because it is the ratio of two sums of squares. Now imagine there is no relationship between X and Y. so it is called the covariance of X and Y. we feel duty-bound to enlighten you with the full messy formula: r ∑(Xi [∑(Xi X )(Yi Y) Y)2] (13–10) The covariance of X and Y is the product of the deviations of X and Y from their respective means. (If you examine the formula for eta2 in Chapter 8. The denominator is simply made up of two sums of squares. or cov (X. which is viewed favorably by most researchers.7.Y). This name is used to distinguish it from several alternative forms. positive or negative.SIMPLE REGRESSION AND CORRELATION 137 3892 ÷ (3892 + 864). is the covariance of X and Y. So if X and Y are highly correlated (positively). the coefficient of determination.) R2. Whatever it’s called. to have an estimated sum of squares below zero. or 0. it is always abbreviated r.818 = –. 9Note The correlation coefficient is a number between –1 and +1 whose sign is the same as the slope of the line and whose magnitude is related to the degree of linear association between two variables. However. The square root of this quantity is a term familiar to all. so that when X is large (or small) Y is large (or small)—they are highly correlated. is always positive. and (Yi as yi. So a correlation of . in particular the Intraclass Correlation.

So a correlation of 0 means that the SD of Y about the line is just as big as it was when you started. Similarly.3 is really going from the sublime to the ridiculous. Unfortunately. You don’t have to track our own CVs very far back to ﬁnd instances where we were waxing ecstatic in print about pretty low correlations. In some quarters.9 .5. then the score on one variable doesn’t affect the score on the other. accounts for about 10% of the variance. and half would be below. What Table 13–5 demonstrates is that a correlation of .9. if the correlation were 1.9 D FIGURE 13–5 Scatter plots of data with correlations of A. the actual relationship is shown in Figure 13–6. B. a lot of scatter occurs about the line.5 reduces the scatter in the Ys by about only 13%. a correlation of 1 reduces the scatter about the line to zero. the cultural norms now reestablish themselves. and D.1 . What about the values in the middle—how much is the SD of Y reduced by a given correlation? We’ll tell you in Table 13–5. If we take only those people who score above the median on X. and .87 .6 is larger than one of 0. R2 for simple and multiple regression) is on ratio scale.7.3. we’ll discuss a transformation that puts r on a ratio scale. plained scatter. but it is not twice as large.0. so that an r2 of 0. 11We ‘fess up. .10 In Figure 13–5. conversely.3 REGRESSION AND CORRELATION C r = 0. where do they fall on Y? If the correlation between X and Y is 0. Variances aren’t too easy to think about. . but SDs are—they simply represent the unex- TABLE 13–5 Proportional reduction in standard deviation of Y for various values of the correlation Correlation SD (Y | X) ÷ SD (Y)* . and even a correlation of .” and means the new value of the standard deviation of Y after the X has been ﬁtted.5. .71 . . However.95 . A correlation of 0. the relationship isn’t linear between r = 0 and r = 1.43 *The expression (Y| X) is read as “Y given X.9.3 hardly merits any consideration. the coefficient of determination (r2 for a simple correlation. such as physiology and some epidemiology. In the next section.3 .9 still has an SD of the Ys that is 43% of the initial value! It should be evident that waxing ecstatic and closing the lab down for celebration because you found a signiﬁcant correlation of .3. is statistically signiﬁcant with a sample size of 40 or 20 (see Table G in the Appendix). . C. Our calibrated eyeball says that.15.11 Another way to put a meaningful interpretation on the correlation is to recognize that the coefficient is derived from the idea that X is partially explaining the variance in Y. any correlation of .5 . we have generated data sets corresponding to correlations of . 10If people took Section I seriously. Having said all that.3. is viewed with delight. Those are the circles we move in. . even at . so it is. this demonstration would not be necessary. which is statistically signiﬁcant with a sample size of about 400.138 A r = 0. However they don’t. then all of the people who are above the median on X would also be above the median on Y.7.7 .7 B r = 0. There’s a third way to get some feel for the magnitude of a correlation. we have demonstrated for you how correlations of different sizes actually appear.5 r = 0. so we’d expect that half of these people would be above the median on Y. To maintain some sanity. .9.50 is twice as large as an r2 of 0. Bear in mind when you’re trying to make sense of the relative magnitudes of two correlations that r is not scaled at an interval level.25. .995 .

96 = 1. such as Z. Would that life were that simple here. It turned out that he got it bass-ackwards—the cancer can produce hypocholesterolema.833) 1 – . but there should be. to the later embarrassment (we hope) of the investigators involved. where r = . etc. so not too surprisingly.0.6 0.494 – 1.494 which means that the two ends of the interval are: 1. if it isn’t. Lovely physiologic explanations have been made of the association—extra vascularization. amidst all the hoopla about the dangers of hypercholesterolemia.2 0. and for large correlations.9 between the number of telephones per capita and the infant mortality rate. which can lead to two problems: the CIs won’t be accurate. 100 90 Percent above median 80 70 60 50 0 0. Unfortunately. as we’ve said. Pearson’s r is not measured on an interval scale and therefore is not normally distributed. and just because this correlation is signiﬁcant at the . SDs. there isn’t such a word. one researcher found that hypocholesterolemia was associated with a higher incidence of stomach cancer and warned about lowering your triglyceride levels too much. For example. with N – 3 df. you ﬁnd a correlation of about –. such as the mean. Conﬁdence Intervals Now the 95% CI is in the same form that we’ve seen before: zα/2 √N – 3 (13–15) ‘ ‘ 12No. If there is one guiding motto in statistics. Simple as the idea is. and just because you can predict Y from X. it’s fairly straightforward to ﬁgure out the SE and then the 95% CI.96 = 1. use the table for the t-tests.0001 level. for various magnitudes of the correlation. We have to proceed in three steps: (1) transform the r so that its distribution is normal. Note that the SE is independent of anything happening in the data—the means. it continues to amaze us how often it has been ignored. SIGNIFICANCE TESTS. However. the upper bound of the CI may exceed 1. We can either use Equation 13–13 and do the calculations (it’s not really that hard with a calculator). If you compare country statistics.904 1 loge + loge (19. and (3) “untransform”12 the answers so that they’re comparable with the original correlation. in the end.494 + [ [ [ [ ‘ One last point about the interpretation of the correlation coefficient.SIMPLE REGRESSION AND CORRELATION 139 Cl95% = z ± CONFIDENCE INTERVALS. does not mean that X causes Y.904 2 2 (13–17) = 1. several studies showed an association between an “ear crease” in the earlobe and heart disease. Closer to home.969 √20 – 3 1. for which we use Equation 13–16 or Table Q in the Appendix: ‘ r= e2z – 1 e2z + 1 (13–16) So for our example. or the correlation itself.904. and the latter is a known and much more plausible risk factor. AND EFFECT SIZE Making r Normal With other parameters we’ve calculated.96 for the 95% CI if N is over 30. It is equally plausible that Y causes X or that both result from some other things.019 and √20 – 3 1. Correlation and the SE of z' is simply: SE2 = 1 √N – 3 (13–14) where N is the number of pairs of scores. most people would recognize that the underlying cause of both is degree of development. an excess of androgens. (13–18) . However much fun it is to speculate that the reason is because moms with phones can call their husbands or the taxis and get to the hospital faster. or use Table P in the Appendix: 1+r 1 z = 2 loge 1 – r (13–13) ‘ where zα/2 = 1. Now we have to take those two z'-values and turn them back into rs. it is this: CORRELATION DOES NOT EQUAL CAUSATION! Just because X and Y are correlated. z= 1 1 + . it turned out that both ear creases and coronary artery disease are strongly associated with obesity.8 1 FIGURE 13–6 Percentage of people above the median on one variable who are above the median on the second variable.4 0. (2) ﬁgure out the SE and CI on the transformed value. Fisher worked out the equation for the transformation. it’s called Fisher’s z'.

Effect Size Having just gone through all these calculations to test the signiﬁcance of r.30 or so).769.316 e(2 × 969) + 1 [ [ [ and r. if r2 doesn’t even reach 0. Here we fall back on the coefficient of determination we discussed earlier. we rarely test to see if two correlations are different. a more common situation. it doesn’t depend on means or SDs.95.50. with df = N – 2: t= r = r √N – 2 SEr √1 – r2 (13–22) which is equal to . However. estimation) to an art form regard this as a disadvantage because it reduces the researcher’s df. Second.675 so the 95% CI for the correlation is [0. how can we compare them? Actually. The test for statistical signiﬁcance is a run-of-the-mill t-test.019) – 1 = = 0.. whereas d has no upper limit. and then see what is signiﬁcant to build a quick post-hoc ad-hoc theory. it is reasonable to ask what sample size is necessary to detect a correlation of a particular magnitude.962 52. [ (13–19) 50. The good news is that the SEs of the distribution are dependent only on the magnitude of the correlation and the sample size. one small wrinkle makes the sample size formula a little hairier. testing whether an r is statistically different from 0. you can save yourself a lot of work by simply referring to Table G in the Appendix. We construct the normal curve for the null hypothesis.969) – 1 = = 0. the ES for t-tests. We have to do it to keep journal editors happy and off our backs. but for all other values. it’s shorter at the end nearer to 0 or 1. an r of 0.101 As with the SE for z'. but the important question is rarely if r differs from zero. correlate everything with everything.9042 20 – 2 (13–21) SE r .904 + . we determined the sample size required to determine if one mean was different from another.019) + 1 8. Of course. we should mention two things.962]. you may recall an earlier situation where we indicated that an F-value with 1 and N df was equal to the squared t-value.316 e(2 × 1. the parameter divided by the SE. and it revealed itself in Equation 13–20 earlier. r is constrained between –1 and +1 (although as an index of ES we can ignore the sign). The problem is that it’s on a different scale from d. only the correlation and the sample size. So. The sample size calculation proceeds using the basic logic of Chapter 6—as do virtually all sample size calculations involving statistical inference. almost any correlation will differ from zero.952. the situation does arise when a theory predicts a correlation and we need to know whether the data support the prediction (i. and then solve the two z equations for the critical value. This case is no exception. is to take a data base.675 e(2 × 1. which is: SE r = 1 n r2 2 (13–20) which in our case is: 1 . However. The issue is if it’s large enough to take note of. the second normal curve for the alternative hypothesis. However. it’s quite simple. the correlation is signiﬁcant). As a rough rule of thumb. in some ways. the equivalent F-value is 8. It’s symmetrical when r is 0. 0. SAMPLE SIZE ESTIMATION Hypothesis Testing In the previous chapters on ANOVA and the t-test. The situation is a little different for a correlation.769 e(2 × 1.140 REGRESSION AND CORRELATION Getting those back into rs gives us: 6. To see if the correlation is signiﬁcantly different from 0. we ﬁrst need the SE of r. First. we use the formulae that Cohen (1988) kindly gave us: d= 2r √1 – r2 (13–23) to get from r to d. r is itself a measure of effect size. For completeness. with a large enough sample size. so we don’t have to estimate (read “guess”) the SE. Note the CI is symmetric around the z'-score. a Type III error—getting the correct answer to a question nobody is asking.13 The bad news is that the dependence of the SE on the correlation itself means that the widths of the curves for the null and alternative hypotheses are different. but isn’t around the value of r. which is just about the value (within rounding error) that emerged from our original ANOVA (Table 13–3). these situations are built on existing data bases. When designing such a study. The net result of some creative algebra is: . don’t bother to call us with your results.e. particularly among those of us prone to data-dredging. In fact.10 (that is.101 = 8. and r= d √d2 + 4 (13–24) to get from d to r. simply square the value of 13Those of us who have developed sample size fabrication (oops.00 is. so sample size calculations are not an issue— you use what you got.

The method involves ﬁtting an optimal straight line based on minimizing the sum of squares of deviations from the line. Finally.0 Correlation FIGURE 13–7 Sample size for correlation coefficients related to magnitude of the correlation and and level. which puts you somewhere on the X-axis. The next guess is related to how big a correlation you want to Simple regression is a method devised to assess the relationship between a single interval level independent variable and an interval level dependent variable.8 1. The proportion of variance resulting from the independent variable is expressed as a correlation coefficient. Next.SIMPLE REGRESSION AND CORRELATION 141 1000 α = .05 puts you on the left graph.01 Sample size Sample size 100 100 10 β = .10 β = .2 0. .05 1000 α = .0 0. ﬁrst decide what the α level is going to be: .8 1. pick a β level from . To read these families of curves. read off the approximate sample size on the Y-axis. α = .20 1 0.4 0.05 to . α = .6 0.4 0. The adequacy of ﬁt can be expressed by partitioning the total variance into variance resulting from regression and residual variance. n = zα zβ r 1 r2 2 2 declare as signiﬁcant. and also to reinforce the message that such calculations are approximate. SUMMARY (13–25) To avoid any anguish putting numbers into this equation.05 β = . we have put it all onto a graph (actually the two graphs in Figure 13–7).6 0.05 or .05 β = .0 0.20.10 β = . which orients you on one of the three curves on each graph.01.20 10 β = .01 puts you on the right.2 0. and signiﬁcance tests are derived from these components of variance.0 Correlation 1 0.

Increase the sample size to 200 b.45. Two studies are conducted to see if a relation exists between mathematics ability and income. click on the box marked Simple. Study 1 uses 100 males. and click the X Axis arrow • Do the same for the Y-axis • Click OK Define 2. or NO CHANGE to the correlation coefficient: a. choose Scatter • If it isn’t already chosen. Would the following design changes result in an INCREASE. Study 2 uses the same sample strategy but has a sample size of 800. and click the arrow to move them into the box labeled Variables • Click OK Sum of Squares (Regression) Sum of Squares (Residual) Coefficient of determination Correlation Signiﬁcance of the correlation Slope Intercept _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ If you want to see a scatterplot. What will be the difference between the studies in the following quantities? 1 2 1 2 1 2 1 ? 2 How to Get the Computer to Do the Work for You • From Analyze. choose Correlate ¨ Bivariate • Click the variables you want from the list on the left. Study 3 uses the sample size as 2. ages 21 to 65. drawn from the local telephone book. then: • From Graphs. but the men are sampled from subscribers to Financial Times. DECREASE. Now what will happen to these estimates? 3 2 3 2 3 2 3 ? 2 Sum of Squares (Regression) Sum of Squares (Residual) Coefficient of determination Correlation Signiﬁcance of the correlation Slope Intercept _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ 3. An analysis of the relationship between income and SNOB (Streiner-Norman Obnoxious Behavior) scores among 50 randomly selected men found a Pearson correlation coefficient of 0. then select the button • Select the variable you want on the X-axis from the list on the left.142 REGRESSION AND CORRELATION EXERCISES 1. Select only upper-echelon executives c. Select only those whose SNOB scores are more than 2 SD above or less than 2 SD below the mean .

Multiple regression involves the linear relationship between one dependent variable and multiple (more than one) independent variables. Not surprisingly.000 units. 1Although some researchers might view this as a good thing.1 Second. beyond a certain point. Ours would look like: ˆ Y = b0 + b1CARS + b2HEALTH + b3INCOME + b4CLOTHES + b5CUISINE (14–1) This is just longer than what we had before. or minivans. our intrepid physiotherapist ventured into behavioral medicine by examining the relationship between Beemer Knee and a number of factors associated with the Yuppie lifestyle. Looking closer at the cause of the affliction. HEALTH—Number of memberships in tennis clubs. HEALTH might aggravate the condition.CHAPTER THE FOURTEENTH In this chapter. both individually and together. and analyzing all these extra data costs more. or skiing. What is the effect of stuffing extra variables in the summary score? First. CLOTHES might hurt too. and ﬁtness clubs. We want to keep track of the contribution made by individual variables while still allowing for the joint prediction of the dependent variable by all the variables (or. It may have occurred to you that she was perhaps oversimplifying things by picking ﬁve variables and then ramming them all into a single total score. How do you combine all these multiple measures into one regression analysis? I n the last chapter. all the variables contributing signiﬁcantly to the prediction). Although seemingly complex. we generalize the methods of regression analysis to cope with the situation where we have several independent variables that are all interval-level and one dependent variable. and Saint Laurent labels in the closets. reducing the sensitivity of the analysis. collecting. despite the label. not fundamentally different. CALCULATIONS FOR MULTIPLE REGRESSION The ﬁrst step in multiple regression is to create a new regression equation that involves all the independent variables of interest. You may recall that Yuppiness was codiﬁed by a CHICC score. CUISINE–Total consumption of balsamic vinegar (liters) + number of types of mustard in the fridge. squash. But INCOME and CUISINE seem to be a bit of a stretch. you decide to explore further exactly what aspects of “lifestyle” are causing the problem. it seems at ﬁrst blush that some of these variables may play a larger role in the disease than others. deﬁned as follows: CARS—Number of European cars + Number of off-road vehicles – Number of Hyundai Ponies. they are likely contributing only noise to the prediction. You want to look at all the variables in the CHICC score. coding. and shown that it is indeed a result of a decadent lifestyle. CARS is an obvious prime candidate because the disease was first recognized among Beemer drivers and appeared to be related to fast shifting or heel-andtoe braking. ski clubs. CLOTHES—Total of all Gucci. constricting the circulation in the lower extremities. Beemer Knee. it goes by the name of multiple regression. as a result of all the twisting and knee strain from tennis. INCOME—Total income in $10. Lacoste. Chevettes. if subjects are wearing skin-tight slacks too often. the method is actually a conceptually straightforward extension of simple regression to the case of multiple variables. A reasonable next step 143 . Multiple Regression SETTING THE SCENE Having described (and published about) the new syndrome. as we shall see.

two other Sums of Squares can be extracted from the data. Note that “the whole lot” consists of a series of 20 data points on this six-dimensional graph paper. and Sum of Squares (Total). SSres is the difference between individual data. which could reduce the ﬁt to a nonsigniﬁcant level while actually improving the ﬁtted Sum of Squares. Several differences are seen between the numbers in this table and the tables resulting from simple regression in the previous chapter.95 (14–5) ∑ [ROMi ROM ]2 (14–4) As you might have expected. from 3892 to 4280. no one has yet come up with six-dimensional graph paper. one goes into the intercept. This then results in a lower Fratio. Because Sum of Squares (regression) uses 5 df. the ﬁne print. this has gone up because the Sum of Squares (Regression) is larger. In the simple regression case. Sum of Squares—Although the Total Sum of Squares is the same as before. as before. particularly the bar across the top of ROM instead of the i below it. However. The computer now determines. Introducing additional variables in regression. Each datum is in turn described by six values corresponding to ROM and the ﬁve independent variables. that were we to graph the relationship between ROM and each of the independent variables individually. makes all the difference. or SStot. ANOVA. the df for the residual drops to 14. thereby increasing the Sum of Squares (Regression) and reducing the Sum of Squares (Residual) by the same amount. 3.” The cost of introducing the variables separately was to lose df. so we’ll let that one pass for the moment. This is also understandable. even though the ﬁt has improved. just as we did in the simple regression case. just as before. Signiﬁcant or not. Degrees of Freedom—Now the df resulting from regression has gone from 1 to 5.0 34. at least for now. or SSreg. this is one of many illustrations of the Protestant Work Ethic as applied to stats: “You don’t get something for nothing.17 . We have six estimated parameters. would be to graph the data.144 REGRESSION AND CORRELATION TABLE 14–1 Analysis of variance of prediction of ROM from ﬁve independent variables Source Sum of squares df Mean square F p Regression Residual Total 4280 476 4756 5 14 19 856. But the interpretation is the same. This is actually understandable. Nevertheless. as before. The overall df is still 19. the corresponding Mean Square has dropped by a factor of nearly four. In fact. How can such a little difference make such a big difference? Let’s take things in turn and ﬁnd out. or anywhere else can actually cost power unless they are individually explaining an important amount of variance. making an ANOVA table (Table 14–1). 2. the independent variables are abbreviated to conserve paper. Mean Squares and F-ratio–—Finally. ROMi. . Of course. because the overall df must still equal the number of data –1. this improves the overall fit a little. Then. with 5 df corresponding to the coefficients for each variable. Sum of Squares (Regression). where “best” is deﬁned as the combination of values that result in the minimum sum of squared deviations between ﬁtted and raw data.001 2From here on in. SS reg ∑ [R O M (b 0 + b 1 CA i + b 2 HE i + b 3 INC i + b 4 CL i + b 5 CU i )] 2 (14–3) Although this equation looks a lot like SSres. In turn.0 25. 1. the Mean Squares follow from the Sum of Squares and df. now with 5 and 14 df. The quantity that is being minimized is:2 ∑ [ROM i (b 0 + b 1 CA i + b 2 HE i + b 3 INC i + b 4 CL i + b 5 CU i )] 2 (14–2) We will call this sum. Here we are estimating the contribution of each variable separately so that the overall fit more directly reﬂects the predictive value of each variable. rather than two. and the ﬁtted value. the Sum of Squares resulting from regression has actually gone up a little. Note the capital R. the value of the bs corresponding to the best-ﬁt line. this is called the Multiple Correlation Coefficient to distinguish it from the simple correlation. We can then proceed to stuff the whole lot into the computer and press the “multiple regression” button. SSreg is the difference between the ﬁtted data and the overall grand mean R O M Finally.0) and the df (19) are the same. our bit for the “green revolution” and as compensation for the contribution of all our hot air to global warming. but it is still wildly signiﬁcant. we simply added up the ﬁve subscores to something we called CHICC. we can put it all together. one for each of the 20 Yuppies who were in the study. only the Total Sum of Squares (4756. we will presume. the Sum of Squares (Residual) or SSres. We can now go the last step and calculate a correlation coefficient: R SS reg SS reg SS err 4280 4280 476 . an approximately straight line would be the ﬁnal result. SStot is the difference between raw data and the grand mean: SStot And of course.

892 4.0 45.4 . HEALTH comes next. 864 476 3.0 4113. the Regression Sum of Squares caused by just the three signiﬁcant variables would be: SS reg (3405 1622 643) 5670 (14–6) RELATIONSHIPS AMONG INDIVIDUAL VARIABLES Let’s backtrack some and take the variables one at a time. As we already know. The numbers represent the relevant Sum of Squares. In Figure 14–1. it is also larger than the total Sum of Squares! How can this be? Not too difficult. the strategy of looking at simple correlations ﬁrst and eliminating from consideration insigniﬁcant variables is not a bad one. a graphical interpretation displays activities of the sums of squares. Not so. but at the signiﬁcant cost of df.0 174. the individual ANOVAs (with the corresponding correlation coefﬁcients) would look like Table 14–2. but it has a negative simple correlation.0 75. as discussed previously.81 .0 . as we shall see.21 237. The ability to own a Beemer and belong to exclusive tennis clubs are both related to income—the three variables are intercorrelated. you might think that we can put these individual Sums of Squares all together to do a multiple regression. However. The Not only is this larger than the Sum of Squares (Regression) we already calculated. presumably you have to be rich to afford cars and everything else that goes with a Yuppie lifestyle. INCOME is next.22 .36 214. large numbers of variables demand large samples.31 –. The disadvantage is that you can get fooled by simple correlations—in both directions. really. Last. First.85 1622.0 1351. as yet. presumably if you get enough exercise. with the multiple regression taking a bit more of the pie.58 643.5 4541. unfortunately. so it’s helpful to reduce variables early on.5 252. CUISINE and CLOTHES are not signiﬁcant.280 CHICC Simple CHICC Multiple The Multiple Correlation Coefficient (R) is derived from a multiple regression equation.95 . These data give us much more information about what is actually occurring than we had before.0 1 18 237. doing a simple regression.3 .0 3134. a bit of difference exists.0 228. You might rightly ask what the big deal is because we have not done much else than improve the ﬁt a little by estimating the coefficients singly. Although we confess to having rigged these data so that we wouldn’t have to deal with all the complications down the road. your muscles can withstand the tremendous stresses associated with Beemer Knee.0 4519. We must recognize that the three variables are not making an independent contribution to the prediction. so we can drop them from further consideration. and its square (R2) indicates the proportion of the variance in the dependent variable explained by all the speciﬁed independent variables.0 1 18 643. But CARS alone is most of the sum of squares and has the correspondingly highest simple correlation. as before. This is as it should be.5 1 18 214. So that’s it so far.0 1 18 1622. exploited the speciﬁc relationships among the variables.1 9.85 .5 2. If you permit a little poetic license.0 251. If we did. and still signiﬁcant. note that the total sum of squares is always 4756. This may suggest that income causes every- Source Sum of squares df Mean square F r TABLE 14–2 ANOVA of regression of individual variables Cars: Regression Residual Health: Regression Residual Income: Regression Residual Clothes: Regression Residual Cuisine: Regression Residual 3405.MULTIPLE REGRESSION 145 FIGURE 14–1 Proportion of variance (shaded) from simple regression of CHICC score and multiple regression of individual variables. As always.0 1 18 3405. advantage is that. we have shown the proportion of the Total Sum of Squares resulting from regression and residual. we have not. At ﬁrst blush. it was clinical observations about cars that got us into this mess in the ﬁrst place.

Because all are measures of baby bigness. First.146 Simple regression REGRESSION AND CORRELATION Simple regression Simple regression Cars Health Multiple regression Income FIGURE 14–2 Proportion of variance from simple regression of Cars. one day it may be. we introduced CARS into the equation ﬁrst. and as we have taken pains to point out already. so that if.180 . adding another variable will account only for some portion of the variance that it would take up on its own. as shown in Figure 14–2. and INCOME are 3405. is only SS (Total) – SS (Err) = (4756 – 595) = 4161. the implication is that. and length. In Figure 14–3 we have added some numbers to the circles. Each variable occuMultiple regression Cars + Health + Income pies a proportion of the total area roughly proportional to its corresponding Sum of Squares (Regression).935 (14–7) 830 508 72 212 183 176 PARTIAL F-TESTS AND CORRELATIONS Partial F-Tests 595 FIGURE 14–3 Proportion of variance from multiple regression with partial sums of squares. 3This Cars + Health + Income is not from personal experience. once one variable is in the equation. as a result of putting in all three variables. this equals the sum of all the individuals areas [2180 + 830 + 212 + 183 + 508 + 72 + 176] = 4161. 1622. and Income. (Alternatively. Health. and multiple regression. But Figure 14–3 shows that the overall Sum of Squares (Regression). however.3 We are not. Note. and legroom is not an issue in the driver’s seat of a Rolls. chest circumference. Partial sums of squares indicated We can now begin identifying the unique contributions of each variable and devising a test of statistical signiﬁcance for each coefficient. they are not synonymous. incorporating HEALTH and INCOME adds only the small new moon-shaped crescents to the prediction. But once any of them is in the regression equation. From our present perspective. the individual circles overlap considerably. but then real income may lead to a Rolls. imagine predicting an infant’s weight from three measurements—head circumference. This begins to show quantitatively exactly why the Sum of Squares (Regression) for the combination of the three variables equals something considerably less than the sum of the three individual sums of squares. HEALTH. and 643. for example. respectively. with just these three variables. chances are that any one is pretty predictive of baby weight. The test of signiﬁcance is based on the unique contribution of each vari- . As you can see. consider each variable alone and express the proportion of the variance as a proportion of the total area. addition of a second and third measurement is unlikely to improve things that much. thing. only correlation. although if this book sells well. concerned about causation. We already know that the Sum of Squares (Regression) for CARS. As a possibly clearer example. the new multiple correlation. what happens when we put them all together as in the lower picture.) For thoroughness. We can also demonstrate this truth graphically. is: SS reg SS reg SS err 4161 4161 595 R 2. in any case.

Its numerator is exactly the same as for the partial correlation. divided by the number of df. the more useful variables we stick into the equation (with some rare exceptions). The partial correlation between a and z is deﬁned as: r az. we want the effect of b removed from a. the more variables in the equation. Further.05 where (variables in) means all the variables that are in the model.66 4.b 1 r az r2 ab r ab r zb 1 r2 zb (14–9) The cryptic subscript.0001 <. all of which is desirable. On the other hand (and there’s always another hand). Now we devise a test for the signiﬁcance. for HEALTH. On the one hand. bS AND βS As you may have noticed. This is also desirable. Partial and Semipartial Correlations Another way to determine the unique contribution of a speciﬁc variable. with the contribution of b removed from both of the other variables. because the contribution of any one variable is not usually independent of the contributions of others. which we already determined using Equation 14–8.b r az 1 r ab r zb r2 zb (14–10) The partial F-test is the test of the signiﬁcance of an individual variable’s contribution after all other variables are in the equation. the numerator df is always equal to 1. now equal to (19 – 3) = 16. we have not actually talked about the b coefficients. So. a partial F-test. Because we have only one coefficient. we have been dealing with everything up to now by turning them into sums of squares.18 58. the signiﬁcance of each individual b coefficient is tested with a form of the t-test. In the last chapter. tion.354 .MULTIPLE REGRESSION 147 able after all other variables are in the equation. As any of the Sums of Squares within the “regression” circles is actually variance that will be accounted for by one or another of the predictor variables. the unique variance is 2180.05 . What we require is an estimate of the true error variance. The numerator of this test is fairly obvious: the relevant Sum of Squares. creating a table like Table 14–4. is to look at the partial and semipartial (or part) correlations. and for INCOME.18 37. Let’s say we have a dependent variable.63 13. the denominator for all of the partial Ftests is 595 ÷ 16 = 37. for the contribution of CARS.18 37. raz. but the overall R is “whoppingly” signiﬁcant. the smaller the unique contribution of a particular variable (and the smaller its t-test).135 . So.0001 <. This statistic tells us the correlation between a and z. We’ll return to some of these pragmatic issues in a later section. a and b. One disadvantage is that we have lost some information in the process. where we were dealing with simple regression.0001 <.17 <. in this case equal to 595. the part correla- Variable b SE(b) t p TABLE 14–4 Coefficients and standard errors Cars Health Income . we need what is called the semipartial correlation.037 . We did discuss the basic idea in Chapter 13. the best guess at the Error Sum of Squares is the SS (Err) after all variables are in the equation. For this. it’s 508.65 3.0001 <. z.106 . In turn. but we don’t want it removed from the dependent variable. We could. end up with the paradoxical situation that none of the predictors makes a signiﬁcant unique contribution.0287 . Partial and semipartial correlations highlight one of the conceptual problems in multiple regression. but the denominator doesn’t have the SD of the partialled scores for the dependent variable: sr az. Its formula is as follows: SS reg (variable out) Partial F = SS reg (variables in) MS res (variables in) (14–8) Variable Numerator MS Denominator MS F p TABLE 14–3 Partial F tests for each variable Cars Health Income 2180 508 176 37. partialling out the effects of b.155 7. so we can draw pretty pictures showing what is going on. it is 176. means the correlation between a and z. The advantage of this strategy is that all the sums of squares add and subtract. for our purposes.b.0176 .245 . in fact. and R2. and the tests for each variable are in Table 14–3. which is also known by the alias. In particular.73 <. after the contributions of the other variables have been taken in account. the Mean Square is then divided by the residual df. each contribution called. and (variable out) means all of the variables in the model except the one being evaluated.0170 . and two predictor variables.18. the better the SSreg. the t-test is simply the square root of the associated partial F value.70 2. which is where we began. The denominator of the test is a bit more subtle. for fairly obvious reasons. R. However.

a change of 50 cm results in an increase in weight of 1. what are called structure coefficients (rs). But. also has some utility independent of the statistical test.e. the prediction equation from the CHICC variables could be used as a screening test to estimate the possibility of acquiring Beemer Knee. Train A is approaching from the east at 100 mph. but its shared predictive power was taken up by another IV. The b coefficients can also be interpreted directly as the amount of change in Y resulting from a change of one unit in X. then that variable may be useful in predicting the DV. The idea is this: although the b coefficients are useful for constructing the regression equation.25 kg. Because the bs are affected by the scale used to measure the variable (e. If we go back to the beginning. Previously. relatively rare) where the structure correlation is very low. a simple relationship is found between b and β. and then found that the b coefficient was . it would mean that a change in height of 1 cm results in an average change of weight of .. Another consequence of the fact that βs are partial weights is that their magnitude is related to two factors: the strength of the relationship between a variable and the dependent variable (this is good). for any given variable. So. how do we examine the correlation between ˆ a variable and Y? One way is to run the regression and save the predicted values. Actually. and β with standardized data. it’s out of the running. and differ only in magnitude. In words. 2001). in that it is related to the DV. change cost from dollars to cents and you’ll decrease the b by 100).148 REGRESSION AND CORRELATION The coefficient. Adding or dropping any of the predictors is going to change the size and possibly the signiﬁcance level of the weights. As a result. so it is clearly not something to do with samples and populations.025 kg. they reﬂect the contribution of the variable after controlling for the effect of all of the other variables in the equation. Indeed. in combination with the other variables in the equation. So. and Train B is coming from the west traveling at 80 mph. the β and rss will have the same rank ordering. they are devilishly difficult to interpret relative to each.g. This time. which do we look at—the β weights or the rss? The answer is “Yes.000 times larger than if weight is measured in kilograms and height in centimeters. then it may be a suppressor variable (Courville and Thompson. Going back to our babies. its effect after controlling for the other two) would be relatively small. it is called a standardized regression coefficient. chest circumference. which looks like this: =b x y (14–11) 4Why is it that most problems like this involve crashes or other disasters? Doe it say something about the people who were most inﬂuential during our formative years? used to be called “English” in less pretentious days.025. In other words.. Scaling this up a bit. But—and there’s always a “but”—we have to remember one thing: the βs are partial weights. the b coefﬁcient is 10. it may seem as if we should look at the relative magnitudes of the βs to see which predictors are more important. But here again. b. but rs is high. if weight is measured in grams and height in meters. and length. we can put the prediction equation together by using these estimated coefficients. “Beta?” you ask. but “unimportant” in a predictive sense because.4 If some kids are great at math but can’t read the back of a cereal box. then language skills5 will suppress the relationship between other predictors and scores on a math exam. β is standardized by the ratio of the SDs of x and y. if β is high but rs2 is low. When they are correlated (and they usually are). there may be problems. For example. This means that we also have to look at the correlations between each independent variable (IV) and the predicted value of the ˆ DV (i. even though it may be a good predictor in its own right. if we did a regression analysis to predict the weight of a baby in kilograms from her height in centimeters. If β is very low. but the equation as a whole has a high multiple correlation with weight. We use b with the raw data. it doesn’t matter. We might actually use the equation for prediction instead of publication. How long will it take before they crash into each other?”). the unique contribution of each (i. it doesn’t add much. at ﬁrst glance.” If the predictors are uncorrelated with one another. the question is which do we use in trying to interpret the relative importance of the variables in predicting the dependent variable (DV)? Let’s brieﬂy recap some of the properties of b and β. or 25 g. the magnitude of beta bears no resemblance to the corresponding b value. 150 miles apart. 5What Relative Importance of Variables Now that we’ve introduced you to bs and βs. “Since when did we go from samples to populations?” Drat—an exception to the rule. we used the example of predicting an infant’s weight from head circumference. and then get the correlations between them and all of the predictor vari- . Next in the printout comes a column labeled β. Y). even though everything else stayed the same. Because the three predictor variables are highly correlated with one another. There are times (thankfully. On the other hand. its β weight may be quite small and possibly nonsigniﬁcant.e. we can now directly compare the magnitude of the different βs to get some sense of which variables are contributing more or less to the regression equation. it’s possible that none of the predictors is statistically signiﬁcant. we may have a test of math skills that requires a student to read the problems (you remember the type: “Imagine that there are two trains. we have to look at both. That is. but the β weight is high. the variable may be “important” at an explanatory level.. For example. and the mix of other variables in the equation (this isn’t so good). For our above example. The result is that. This occurs if there is a “suppressor” variable that contaminates the relationships between the other predictor variables and the DV. So by converting all the variables to standard scores (which is what Equation 14–9 does).

41 175. but the coefficients don’t tell us much. and who was in Group 0. The bs that we get are shown in the ﬁrst line of Table 14–5.81 151. we’ll get the results shown in the second line of Table 14–6. structure coefficients play important roles in interpreting other statistics based on the general linear model (which we’ll discuss later). 1997) argue that you shouldn’t use them at all because they don’t add any new information.73 141. we’d be examining a middle-aged ailment for people who have just been born! To see how centering can help us interpret the data better. 2 = Female) and Group (1 = C11. half of whom use the diet and half of whom do not.023).e. Now.) After months of being or not being on the diet. so it’s easier to see the relationship between multiple regression and other techniques.20 167. and help alleviate some of the other problems we may encounter. But.06) and women (127. If it’s too small to pique your interest. because they are the differences in weight compared with groups that don’t exist. is the value of Beemer Knee we’d expect when all of the IVs have a value of zero. Y).. we know what it means to have no cars and no income—everyone with kids in university has experienced that. or any other coding scheme. and 138.25) is the difference between the mean weights of men (175. not any of the other IVs. This section is not a digression into “new age” psychobabble. and C11 people as –1⁄2 and controls as +1⁄2. First. Suddenly.. When the computer has ﬁnished its work. Another way. comparing 50 men and 50 women.MULTIPLE REGRESSION 149 ables. Remember that the formula we ended up with is: ˆ Y = b0 = b1 CARS + b2 HEALTH + b3 INCOME (14–13) so on for all of the other b coefficients in the equation. we can use the equation to predict the average weight gain or loss for a new individual. Rather. why go to the bother to calculate rss rather than just use r? In fact. but as we’ll explain in the next chapter. Let’s try to interpret these. where the DV is weight.06) is exactly the difference between the mean weights of the two programs (165. b0. because we’re dividing a number less than 1. Second. that may actually be easier. what does it mean to have no health? That’s a meaningless concept. For dichotomous variables.g. The intercept (151. Centering has other effects. and Men Women Average 187. Others argue that rs is more informative because it tells us the relationship between a predictor variable and the ˆ predicted value of Y (i. rs is “inﬂated” relative to r. though. and R is the multiple correlation. the numbers become meaningful. which is really what we’re interested in. half of whom were on the C11 diet The intercept. didn’t it? The bs for Program and Gender are equally uninformative. The bottom line. Now that really told us a lot.47 162. Going by what we said before. and the IVs are Gender (1 = Male. That means that structure coefficients are not affected by multicollinearity. or 0 and 1.81). GETTING CENTERED A caveat. b1 is the effect of Cars on Beemer Knee when the other variables are set equal to zero.47 for the Controls. because each rs is simply the zeroorder correlation between the IV and the DV divided by a constant (R).41 for the C11 group). So. if we code men as –1⁄2 and women as +1⁄2. is to calculate them directly: rs = rYXi / R (14–12) where rYXi is the correlation between the predictor Xi and the DV.e. we would take our raw data. Pedhazur. the rank order of the rss is the same as that of the rs. the way 99. and hit the Run button. as we see in Table 14–5. we weigh these 100 people and get the results shown in Table 14–5. So.43 138. Because Captain Casper Casimir’s diet looks so promising. Only if it is of sufficient magnitude should you look at the βs and rss (or rs) and decide which variables are important in helping to predict the DV. we’ll have a table of bs and βs as we described in an earlier section. too. the intercept is the weight of a person whose gender is 0. Now.0.06 127. notice that only the IV that we’re interested in is the numerator. (Yes. and that for Gender (–47. But. there was a signiﬁcant effect of Gender (p = . following the recommendations of Kraemer and Blasey (2004). The b for Program (26. we’ll do a bigger study.34% of people do it). So. It would be even worse if we threw in Age as another predictor. Also. let’s run a multiple regression. we know that this would normally be analyzed with a 2 × 2 factorial ANOVA. the ﬁrst step is to look at R2 and see if it’s worthwhile going any further. there’s enough nonsense written about that already without having us add our two cents’ worth.. let’s use centering. stop right there. use weights of +1⁄2 and –1⁄2 instead of 1 and 2. is that whether you use r or rs to ﬁgure out what’s important. 2 = Control). enter them into the computer. some people (e. regression and ANOVA are really different ways of doing the same thing. yes. but neither Program nor the interaction Controls On Diet Average TABLE 14–5 Weights of 50 men and 50 women. If we were to run a regression without thinking about what we’re doing (i. Similarly.39 114. let’s go back to the example of the C11 diet in Chapter 10. Now. With the raw data. This formula also tells us two important things about the rss. it describes a technique that may make the interpretation of the results of multiple regression easier.44) is the mean of the whole group.44 .

In fact. which we won’t elaborate on. Unfortunately. She might be able to get income data from the Internal Revenue department. So if she had her druthers. leaving a Sum of Squares (Error) of 1103 with 17 df. then HEALTH. CARS. the change in R2 that results from introducing the new variable. such as a credit card agency or a charity. 3405 ÷ 4756 = . So what do we do if the variable is ordinal or interval? Kraemer and Blasey (2004) recommend using the median for ordinal data. more precisely. ment of Motor Vehicles without much hassle about consent and ethics. Data about health. Hierarchical stepwise regression introduces variables. This perfectly reasonable strategy of deciding on logical or logistic grounds a priori about the order of entry is called hierarchical stepwise regression. then INCOME. Another advantage. This score results from the computer’s decision that.25 1. they argue that. the best thing it can do is remove a variable that was previously entered. she would introduce the variables into the equation one at a time. This could get messy. unless there are compelling reasons to the contrary. and the F-test for the addition of this variable is (248 ÷ 1) ÷ (1103 ÷ 17) = 3. Statistical signiﬁcance inevitably comes down to some F-test expressing the ratio of the additional variance explained by the new variable to the residual error variance. is that when multicollinearity (high multiple correlations among the predictor variables.44 23. this adds only 248 to the Sum of Squares (Regression). though. in an order assigned in advance by the researcher. and (2) clinically important. Now the multiple R2 is 3653 ÷ 4756 = . Clinical importance can be captured in the new multiple correlation coefficient. Now we add INCOME.43 1. This is conventionally called the F-to-enter because it is associated with entering the variable in the equation. nor does it change the signiﬁcance level. We have rearranged things slightly in Figure 14–4. either singly or in clusters. but she might have to fake being something legitimate. The idea is perfectly sensible—you enter the variables one at a time to see how much you are gaining with each variable.0002). and the F-test of signiﬁcance is the Mean Square (Regression) ÷ Mean Square (Error) = (3405 ÷ 1) ÷ (1351 ÷ 18) = 45. and the results are exactly the same as the simple regression of CARS on ROM.36.822. STEPWISE REGRESSION One additional wrinkle on multiple regression made possible by cheap computation is called stepwise regression.150 REGRESSION AND CORRELATION TABLE 14–6 Results of a multiple regression of the C11 study. Now we can see what happens every step of the way. We’ll get to that later. would be really hard to get without questionnaires or phone surveys.39 –47. the way she deﬁned it. with 18 df. Gender becomes even more signiﬁcant (p < . it can be easily abused. which we’ll discuss in just a bit) is present. between Program and Gender were signiﬁcant.92 26. either in combination with the other variables or alone. It has an obvious role to play if some or all of the variables are expensive or difficult to get. and the mean for interval (or ratio) variables. so our physiotherapist has good reason to see if she can reduce the cost of data acquisition. All this stuff can be easily extracted from Figure 14–3. we should always center the data. Physiotherapy research is notoriously underfunded. Thus economy is favored by reducing the number of variables to the point that little additional prediction is gained by bringing in additional variables. She reasons as follows: 1.06 –49. starting with CARS. or. This latter criterion (signiﬁcant simple correlation) is a useful starting point for stepwise regression because the more variables the computer has to choose from. like all good things. we have one independent variable. Hierarchical Stepwise Regression To elaborate. Once we center the data. let’s return to the CHICC example. centering doesn’t affect the b for the interaction between the variables. Because it requires some thought on the part of the researcher.6 2. it is rarely used. for a total of 3653. Information on the make of cars owned by a patient can likely be obtained from the Depart- . As you can see from Table 14–6. R2. centering reduces its ill effects. the more possibility of chewing up df and creating unreprodicible results.001). with 1 df. Because all the independent variables are interrelated. The multiple R2 is just the proportion of the Sum of Squares explained. What we want to discover in pursuing this course is whether the introduction of an additional variable in the equation is (1) statistically signiﬁcant. at the next step. and the Sum of Squares (Error) is 1351. In Step 1. The alternative is the F-toremove.43 6There is no ethical behavior on the road. which occurs in stepwise regression (discussed later).716 as before.768.44 151. with 2 df. This indicates how much additional variance was accounted for by the addition of the new variable. 3. which will also help to center our lives. The Sum of Squares (Regression) is 3405. and the effect of Program is signiﬁcant (p = . We have already discovered that Cuisine and Clothes are not signiﬁcantly related to ROM. without (line 1) and with (line 2) centering Uncentered Centered Intercept Program Gender Interaction 186.

and Health.716 . This illustrates both the strength and limitations of the stepwise technique.107 45. The additional Sum of Squares resulting from INCOME was 176 instead of 248 because both CARS and HEALTH were in the equation. The numerator carries variance in addition to that already explained by previous variables.0001 ns <. determined as we did before. Addition of INCOME accounted for only another 5% of the variance. The mathematics are the same as used in hierarchical regression described above.18. these problems are ampliﬁed when we turn to the next method.9 to 37.405 (b) 508 (a) 3.66. the denominator (the Mean Square [Residual]) was reduced further from 1103 ÷ 17 = 64. Here. additional Sum of Squares from present step. The usual criterion is simply the largest value of the F-to-enter. and C. The Sum of Squares (Error) is further reduced to 595. How can this be? Recognize that both the numerator and denominator of the F-test are contingent on what has gone before. This adds 508 to the Sum of Squares (Regression) to bring it to 4161. . It can happen (with all the interactions and interrelationships among the variables) that. all three variables were in the equation. once a whole bunch of variables are in the model. However. we have yet one more wrinkle. usually an F-toenter that does not achieve signiﬁcance.82 13.405 (c) 1. “Out of steam” is also based on a statistical criterion. entering additional variables with gay abandon. we throw in HEALTH. this partial sum of squares is a little larger. The computer approaches this by determining not only what would happen if any of the variables not in the equation were entered. it is possible to examine the independent effect of each variable and use the method to eliminate variables that are adding little to the overall prediction. but also what would happen if any of the variables presently in the equation were removed. it is the partial with just the preceding variables in previous steps in the equation. A subtle but important difference exists between this partial sum of squares and the partial sum of squares for INCOME. When we examined the partial F-tests in Table 14–3. and the F-test is (508 ÷ 1) ÷ (595 ÷ 16) = 13. The sums of squares correspond to A. therein also lies a weakness.MULTIPLE REGRESSION a A Step 1 Cars 151 C c Step 3 Cars + Income + Health B b Step 2 Cars + Income (b) 3.768 . Finally. The net effect was that the partial F-test for introduction of INCOME was just signiﬁcant in the previous analysis.653 (c) 595 FIGURE 14–4 Proportion of variance from stepwise regression of Cars.66 <.875 — .875. except that. As we shall see.351 (b) 248 (a) 3. and Health Cars Income Health . Although this is not too bad (most researchers would likely be interested in variables that account for 2 to 3% of the variance). where we have also calculated the change in R2 resulting from adding each variable. the computer calculates the best next step for all the variables that are not yet in the equation. regression from the previous step. The multiple R2 is now 4161 ÷ 4756 = . until ultimately the beast runs out of steam. and the denominator carries variance that is not explained by all the variables in the equation to this time. the researcher begins by turning over all responsibility for the logical relationship among variables to the machine. Variables are selected by the machine in the order of their power to explain additional variance. the partials are always with all the other independent variables in the equation.36 3.0001 ns = not signiﬁcant. because the contribution of each variable can be considered only in combination with the particular set of other variables in the analysis. All of this is summarized in Table 14–7. and if this F-to-remove is the Variable Multiple R 2 Change in R 2 F p TABLE 14–7 Stepwise regression analysis of ROM against Cars. Ordinary Stepwise Regression In this method. residual Sum of Squares. with 16 df. By considering the combination of variables. The process carries on its merry way. Unfortunately. with 3 df. Income.052 .103 (c) 1. so it equalled only 176. Income. then selects the next variable to enter based on a statistical criterion. Of course. In ordinary multiple regression. B. consequently. The calculation just creates another F-ratio. which we encountered previously. this time around it is not signiﬁcant. at the end of every step. the best way to gain ground is to throw out a variable that went into the equation at an earlier stage but has now become redundant.

including the number of letters in the subject’s mother’s maiden name. using a statistical criterion for entry of variables. 1987.. This reduction in the magnitude of R2 on replication is called shrinkage. the program runs one regression equation for every variable that’s not included. Another way to think of shrinkage is that it’s a closer approximation of the population value of R2 than is value derived from any given sample. the probability would be . Scailfa and Games. ADJUSTED R2. Now. 1988. You should also note that this equation produces less of a discrepancy between R2 and Adj R2 than most of the other equations. inevitably. we should be “charged” one df for each predictor variable in the equation. but in fact none of the 20 are actually associated with the dependent variable (in the population). R2 increases. and it doesn’t matter from a theoretical perspective which variables are in the equation and which are out. we’d like to know by how much the value of R2 will shrink. this would be a grave mistake. AND SHRINKAGE We’ve already introduced you to the fact that R2 reflects the proportion of variance in the DV accounted for by the IVs. you’ll most likely see another term. So the computer grinds merrily away.e. we’ve lost complete control over the alpha level..05)20 = 1 – 0. because hundreds of signiﬁcance tests may have been run. so the problem is probably worse than what the computer tells you. if we take the values of the variables from the second sample and plug them into the equation we derived from the ﬁrst sample. It is always lower than R2. it will base the results on df = 3. One problem is that there are a whole slew of formulae for Adj R2. So what’s the matter with letting the machinery do the work for you? Is it just a matter of Protestant Work Ethic? Unfortunately not. buried somewhere among all the “signiﬁcant” variables are some that are present only because of a Type I error.152 REGRESSION AND CORRELATION largest. best able to predict the DV with the fewest IVs). the adjusted R2 tells us some useful information that we should pay attention to. but we pay a price for every variable we enter into the mix. should therefore be regarded primarily as an exploratory strategy to investigate possible relationships to be veriﬁed on a second set of data. However. They are unable to differentiate between two types of variance—true variability between people.8 This is what the Adjusted R2 tells us. Along the way to selecting which variable should enter into the equation at each step. so you may want to ignore it. Probably the most commonly used one is: Adj R2 = R2 – p (1 – R2) N–p–1 (14–15) 7We’re using this term metaphorically. The reason is that statistical formulae. needless to say. bad as the news may be. Other than that relatively limited situation. What is the chance of observing at least 1 signiﬁcant F-to-enter at the . the true variance will be more or less the same as for the original sample. dfRegression should be 20 (the 3 in the equation plus the 17 that didn’t get in on this step but were evaluated). At the center of the problem is the stuff of statistics: random variation. If your aim is to ﬁnd a set of predictors that are optimal (i. even if only slightly.” where p is the number of variables in the equation and N is the sample size. the computer ups and lies to us. So. “Is stepwise unwise?” is most likely “Yes. the next step in the process may well be to throw something out. Stepwise regression procedures. because it’s based on the wrong df. it’s usually a sign that your hard disk is about to crash. we may be tempted to throw in every variable in sight. as several authors have pointed out (e. Thus. But. Strictly speaking. remember that the answer to the subtitle to Leigh’s (1988) article. go ahead and step away. to compound the problem even further. . and screens 18 variables in the process. very few researchers do it this way. are dumb sorts of animals. because your results won’t look as good. we’ve plotted the Adj R2 for various values of R2 and p. Leigh. or Adj R2. it is: 1 – (1 – . drawing the same types of individuals and using the exact same variables. is there no use for stepwise regression outside of data dredging? In our (not so) humble opinion. If you do hear your computer grinding away. if you look at the output of a computer program.642 (14–14) R2. called the Adjusted R2. and also runs the equation taking out each of the variables that’s already in (to see if removing a variable helps).358 = . Then.05 level? As we have done before in several other contexts. But. without having to go through the hassle of rerunning the whole blinking study. there is less shrinkage as R2 increases and as N is larger. Actually. after looking at the results. every time we add a variable to a regression equation.87 that we would ﬁnd something signiﬁcant somewhere. The problem is that if we were to perfectly replicate the study. Naturally.g. As you can see. So when we begin with a large number of variables and ask the computer to seek out the most significant predictor variables. Wilkinson. If the computer is looking for the third variable to add to the mix. In Figure 14–5. the situation is worse than this (bad as it may already seem). 1979). the results won’t be nearly as good.7 and ﬁnds a set of bs and βs that will maximize the amount of variance explained by the regression line. 8Although maybe. and error variance owing to the fact that every variable is always measured with some degree of error. and the MSRegression will be overly optimistic. So. just one. but the error variance will be different for each person. you don’t really want to know. With rare exceptions. So. and the computers that execute them. However. assuming a sample size of 100. If we had 40 variables. Imagine we have 20 variables that we are anxious to stuff into a regression equation.

How do we incorporate this interaction in the model? Nothing could be simpler—we create a new variable by multiplying the stress and support variables. 1981). in the beginning.8 R 2 = 0. and visits. choose either the upper or lower category. we don’t have to imagine one. usually it’s selected on logical or theoretical grounds. So. in the presence of more stressful events. because it will have the lowest standard error. 2 = $175. We have already described the glories of systematic use of interactions in Chapter 9. This indicates that the later variables are probably contributing more error than they’re accounting for. we’d run into problems if we wanted to add a variable that captures that other aspect of “Yuppiness”—type of dwelling. if a person lives in her own home. 12There are other possible coding schemes (see Hardy.0 0 5 10 15 Number of Predictors R 2 = 0. naturally) and OTHER.000 to $199. Multiple regression makes the assumption that the dependent variable is normally distributed. . If we coded Own Home = 1. but it doesn’t make any such assumption about the predictors. R2 will continue to increase slowly.6 0. it’s good practice to select the category with the largest number of subjects. and Donald. the b and β wouldn’t tell us how much Beemer Knee would increase as we moved from one type of dwelling to another.999.8 Adj R2 0. The rule is that if the original variable has k categories.4 0. so we subtract $63. we would have a nominal variable. it really doesn’t make much difference which one is chosen. 10Actually.000 and up11). she would be coded 0 for both the CONDO and the OTHER dummy variables.999. However. an interaction exists between stress. One of the categories is selected as the reference. So. in the presence of less stress. 1. In this case. more social supports will reduce the number of visits.12 9In contrast to most parents. and Other = 3. Condo with Doorman = 2. after a number of variables are in.” 3. the equation is: Visits b0 b1STRESS SOCSUP) b2SOCSUP b3(STRESS (14–16) Finally. then it would make sense to have the paupers (Level 1) as the reference. dummy coding refers to the way we deal with predictor variables that are measured on a nominal or ordinal scale. This means that treating DWELLING as if it were a continuous variable would lead to bizarre and misleading results. and someone who lives with his mother would be coded 0 for CONDO and 1 for OTHER. DUMMY CODING DO NOT SKIP THIS SECTION JUST BECAUSE YOU’RE SMART! We are not casting aspersions on the intelligence of people who code data. 1978). If you have an ordinal variable. As an example. A predominant view is that the effect of stress is related to the accrual of several stressful events.6 R 2 = 0. Then. Following Hardy (1993). 11Minus the yearly cost of maintaining their car. If you still have a choice.0 0. such as divorce.4 20 R 2 = 0.2 0. INTERACTIONS One simple addition to the armamentarium of the regressive (oops. and the logic rubs off here as well. A condo-dwelling person would be coded 1 for CONDO and 0 for OTHER. In short. and 3 = $200. If we chose OWN HOME as the reference category. Ware. but Adj R2 will actually decrease. and see to what degree greater income leads to more Beemer Knee. don’t use a residual one such as “Other. the coding scheme would look like Table 14–8.2 FIGURE 14–5 How adjusted R2 is affected by R2 and the number of predictor variables. nobody. 2. if we had coded income into three levels (for Beemer owners. social supports. social supports are unrelated to visits. where we would do one analysis with only the main effects and then a second analysis with the interaction term also. Imagine a study where we measured the number of stressful events and also the number of social relationships available.000 from each. we did it (McFarlane et al. you’ll see that. that would be 1 = $150. 1993). because there are three categories. because we’d get completely different numbers with a different coding scheme. to see whether the interaction added signiﬁcant prediction.9 or a mortgage (Holmes. who gets left out in the cold? Actually. against which the other categories are compared. Mathematically. there are several decades of research into the relationship between life stress and health. both R2 and Adj R2 increase with each variable entered (although. Then. but this is the easiest. The solution is breaking the variable down into a number of dummy variables. the model postulates that social supports can buffer or protect the individual from the vagaries of stress (Williams.000 to $174. a child leaving home. In turn. Use a well-deﬁned category. Adj R2 won’t increase as much). psychologists view this event as stressful. we will make k – 1 dummies. we can change the coding scheme and not lose or gain any information.10 The theory is really saying that. That is.MULTIPLE REGRESSION 153 If you run a hierarchical or stepwise regression. we would likely test the theory using hierarchical regression. opt for the one with the largest sample size. For example. 1983). regression) analyst is the incorporation of interaction terms in the regression equation. then the two dummy variables would be CONDO (with Doorman. the three guidelines for selecting the reference level are: 1. we will create two dummy variables. If there is no reason to choose one category over another (as is the case with DWELLING). and now want to examine the relationship to doctor visits. as expected.

as in the right side of the ﬁgure. You’ll remember that our original equation looking at Beemer Knee was: where the funny hat over the Y means that the equation is estimating the value of Y for each person. The points seem to be fairly evenly distributed above and below the line along its entire length. Y and Y are the same for every person. which is estimated value. which occurs as often as sightings of polka-dotted unicorns). R and R2 = 1. that is. choosing a different category to be the reference.e. Y – Y. the lower the value of R. WHAT’S LEFT OVER: LOOKING AT RESIDUALS Let’s go back to the beginning for a moment. and then above it again—a situation called heteroscedasticity (i. with some variance. as we’ve done in the graph on the right side of Figure 14–6. which is his or her actual ˆ value. One question remains: How do we know if CONDO and OTHER are signiﬁcantly different from each other? There are two ways to ﬁnd out. It’s a pain in the rear end if we had to calculate the variance at each value of the predictors. The ﬁrst method is rerunning the analysis. different from the reference category. then fall below the line. Take a look at the left side of Figure 14–6. The difference ˆ between the two. The second method is to get out our calculators and use some of the output from the computer program. unless that equation results in perfect prediction (i. The data seem to fall above the regression line at low values of the predictor. If we plot them against the predicted values. and the regression line. Because the equation should overestimate Y as often as it underestimates it. the points should appear to be randomly placed above and below a value of 0. Now take a look at the left side of Figure 14–7.154 REGRESSION AND CORRELATION ˆ Y = b0 + b1CARS = b2HEALTH + b3INCOME + b4CLOTHES + b5CUISINE Dummy variable Dwelling Condo Other TABLE 14–8 (14–19) Own home Condo Other 0 1 0 0 0 1 Dummy coding for dwelling variable Now our regression equation is: ˆ Y = b0 + b1CARS + b2HEALTH + b3INCOME + b4CLOTHES + b5CUISINE + b6CONDO + b7OTHER (14–17) where b6 tells us the increase or decrease in Beemer Knee for people who live in condos as compared with those who own a home. That means that each person will have two numbers representing the degree of Beemer Knee: Y. Another way of deﬁning R ˆ is that it’s the correlation between Y and Y. But. we can calculate a t-test: t 2 b6 b6 2 b7 b7 2 cov (b 6b 7) (14–18) where cov (b6 b7) means the covariance between the two bs. and b7 does the same for the OTHER category as compared with OWN HOME. When R ˆ = 1. 0 -1 Predicted Value . the more the values will deviate from one another. so we fall back on the granddaddy of all tests. and no pattern should be apparent. the estimate will be off for each individual to some degree. the calibrated eyeball. not homoscedastic). One of the assumptions of multiple regression is homoscedasticity. namely.e. they look like a double-jointed snake with 1 Variable 2 Residual Variable 1 FIGURE 14–6 Scatter plot and plot of residuals when the data are homoscedastic. which shows a scatter plot for one predictor and one dependent variable. the residuals should sum to zero. The computer output will tell us whether b6 and b7 are signiﬁcant. and the larger the residuals will be.. we can go a step further and look to see how the residuals are distributed. is called the residual. which means that the variance is the same at all points along the regression line. and Y.. the variances and covariances of the regression coefficients. With these in hand. This is even more apparent when we plot the residuals. But.

and those that are far from the regression line but have small leverage scores. we’re going to tell you what sorts of aches and pains a regression equation may come down with.MULTIPLE REGRESSION 155 2 1 Variable 2 Residual 0 -1 -2 FIGURE 14–7 Scatter plot and plot of residuals when the data are heteroscedastic. how much ˆ its predicted value of Y (Y-hat. of a case is best thought of as the magnitude of its residual. Predicted Value Variable 1 scoliosis. This is only one of the patterns that shows up with heteroscedasticity. and inﬂuence14) and those involving the variables (mainly multicollinearity). leverage. The data for these outliers should be closely examined to see if there might have been a mistake made when recording or entering the data. with a mean of p/N (where p is the number of predictors).” Leverage Leverage has nothing to do with junk bonds or Wall Street shenanigans. to be confused with the Washington law ﬁrm of the same name. Case A doesn’t change the slope or intercept of the regression line at all. For example. and you may want to use one of the transformations outlined in Chapter 27 before going any further. or to trim it (which is a fancy way of saying “toss it out”). won’t have much inﬂuence. In fact.e. WHERE THINGS CAN GO WRONG By now. A case that has a high leverage score has the potential to affect the regression line. and how to run all sorts of diagnostic tests to ﬁnd out what the problem is. or distance. So. Case A is relatively far from the line (i. that is. all of the cases should have similar values of h.. 14Not Discrepancy The discrepancy. it would be quite unusual for a person to have a closet full of such clothes and not belong to health clubs. Case B has a high leverage score. Ideally. If there isn’t an error you can spot. It refers to how atypical the pattern of predictor scores is for a given case. but the two together may do so. it may not be unusual for a person to have a large number of clothes with designer labels. which are close to p/N. Note that the dependent variable is not considered.13 There are two main types of problems: those involving the cases (the speciﬁc disorders are discrepancy. Cases where h is greater than 2p/N should be looked on with suspicion. If you look at Figure 14–8. has no effect on the slope. but the pattern of predictors (the leverage) is similar to that of the other cases. only the independent variables. but. Case B. Case C has the largest inﬂuence score. and it can range in value from 0 to (N – 1)/N. primarily because it is near the mean of X. Inﬂuence Now let’s put things together: Inﬂuence = Distance × Leverage (14–20) 13The cure is always the same—take two t-tests and call me in the morning. whereas cases that are high on both distance and leverage may exert a lot of inﬂuence. “Can things go wrong?” but. nor would it This means that distance and leverage individually may not inﬂuence the coefficients of the regression equation. because it lies right on it (although it is quite distant from the other points). because it is relatively high on both indices. but it doesn’t have to. you have to know what’s wrong. As you can see from the . or Y) differs from the observed value. in our sample. but the discrepancy isn’t too great. with the dots further from the line as we move from left to right. rather. Sometimes the variance increases with larger values of the predictors. you’re experienced enough in statistics to know that the issue is never. before you can treat a problem. Cases that have a large leverage score but are close to the regression line. it’s discrepancy or distance is high). Leverage for case i is often abbreviated as hi. be unusual for a person to belong to no health clubs. any deviation from a random scattering of points between ± 2 SDs is a warning that the data are heteroscedastic. doctor. So. leading to the fan-shaped pattern. leverage relates to combinations of the predictor variables that are atypical for the group as a whole. In brief. you’ll see what we mean. and lowers the value of the intercept just a bit. We’ll postpone this decision until we deal with “inﬂuence. then you have a tough decision to make—to keep the case. When the residual scores are standardized (as they usually are). then any value over 3. “Where have things gone wrong?” However. we’ll see in a moment under what circumstances it will affect the line.0 shows a subject whose residual is larger than that of 99% of the cases.

All successive admissions to the pediatric gerontology unit (both of them) are there— in a data base. Problems arise. and stand back and watch the F-ratios ﬂy.. Here are a few: 1. average class performance..90 or higher. Most computer programs check for this by calculating the squared multiple correlation (SMC. CD measures how much the residuals of all of the other cases would change when a given case is deleted. subjects.00 shows that the subject’s scores are having an undue inﬂuence and it probably should be dropped. however. A value over 1. leverage (Case B). can be used as a diagnostic test for the presence of multicollinearity. because the zero-order correlations among the variables may be relatively low. This is much harder to spot. 4 C 2 B 0 0 2 4 X 6 8 Y (14–11) dashed regression line. VIF is an index of how much the variance of the coefficients are inﬂated because of multicollinearity. wax ecstatic about the theoretical reasons why a relationship might be so. as computers sprouted in every office. and kick out any that exceed some criterion (usually SMC ≥ 0. Multicollinearity refers to a more complex situation. which.01). usually without any previous good reason (i. The main effect of multicollinearity is to inﬂate the values of the standard errors of the βs.90 or tolerance ≤ 0. we’d ﬁnd that they were all correlated with each other. it’s unusual for the “independent variables” to be completely independent from one another. A first level of response by any reasonable researcher to all this wealth of data and paucity of information should be. Multiple regression is a natural for such nefarious tasks—all you need do is select a likely looking dependent variable (e.10). in which the multiple correlation between one variable and a set of others is in the range of 0. drives them away from statistical signiﬁcance. which is the reciprocal of tolerance. THE PRAGMATICS OF MULTIPLE REGRESSION One real problem with multiple regression is that. If we stick with our criterion that any R2 over 0. days to discharge from hospital. so did data bases. . Use the number you started with. The last step is to examine all the signiﬁcant coefﬁcients (usually about 1 in 20). which is always small. when we cross the threshold from the undefined state of “to some degree” to the equally undeﬁned state of “a lot. weight. which. are there— in a data base. some checks and balances must exist to aid the unsuspecting reader of such tripe. and body mass index (BMI). Another effect of multicollinearity is the seemingly bizarre situation that some of the β weights have values greater than 1. we don’t know either why this term was chosen.90 indicates trouble. All the laboratory requisitions and routine tests ordered on the last 280. This rule provides Multicollinearity Multiple regression is able to handle situations in which the predictors are correlated with each other to some degree. in most situations. Perhaps it refers to the tolerance of statisticians to use terms that are seemingly meaningless. but the multiple correlation—which is what we’re concerned about— could be high. pressures to publish or perish being what they are. then any VIF over 10 means the same thing. so now every damn fool with a lab coat has access to data bases galore. Score assigned to the personal interview for every applicant to the nursing school for the past 20 years. in fact.” We can often spot highly correlated variables just by looking at the correlation matrix.0. undergraduate GPA—almost anything that seems a bit important). For example. then press the button on the old “mult reg” machine. the 5% who came here and the 95% who went elsewhere or vanished altogether. Case C has an effect on (or “inﬂuences”) both the slope and the intercept. For variable i: VIF i 1 (1 2 Ri ) 6 With point C FIGURE 14–8 A depiction of cases with large values for discrepancy (Case A). 15No. but at an acceptable level.e. and inﬂuence (Case C). However. This is a good thing because. The better programs test the SMC or tolerance of each variable. the multiple correlation of height and weight on the BMI would be well over 0. or R2) between each variable and all of the others.156 8 Without point C REGRESSION AND CORRELATION A Yet another variant of the SMC that you’ll run across is the Variance Inﬂation Factor (VIF).g. and then inevitably recommend further research. We would follow Tabachnick and Fidell’s (2001) advice to override this default and use a more stringent criterion (SMC ≥ 0.000 admissions to the hospital are there—in a data base.90. it seems that few can resist the opportunity to analyze them. in turn. hypothesis).99. Some programs try to confuse us by converting the SMC into an index called tolerance. “Who cares?” But then. students) should be a minimum of 5 or 10 times the number of variables entered into the equation—not the number of variables that turn out to be signiﬁcant. if we measured a person’s height. or tolerance ≤ 0.” or CD. The number of data (patients. Inﬂuence is usually indicated in computer outputs as “Cook’s distance. Given the potential for abuse.15 which is deﬁned as (1 – R).

F = 1). A questionnaire was administered. don’t believe us.R. EXERCISES 1. something signiﬁcant will result. a multiple R of about . Conversely. a. 3.001 SAMPLE SIZE CALCULATIONS For once.131 . Multiple R = . In a study of high school depression. a sample of 800 children were selected at random from city high schools. and so on. with each variable adding a little..0007 32. try an authoritative source—Kleinbaum. Multiple R = . say.22 1. So these are some ways to deal with the plethora of multiple regressions out there. or the reviewers of your grant.176 .06 ns ns ns ns = not signiﬁcant. Similarly. A researcher does a study to see if he can predict success in reﬂexology school (measured by the average number of skull bumps the student can detect on simulated plastic heads) using several admissions variables: age.01 ns ns ns = not signiﬁcant.83 4. and stepwise regression procedures should be viewed with considerable suspicion (unless they are hierarchical). look at the change in R2. Much less. and all the individual variables entered the regression equation. SUMMARY Multiple regression methods are the strategy of choice to deal with the common problem of predicting one dependent variable from several (or many) independent variables.5625 n=5 Variable b SE (b) t p Age GPA Gender . little of interest is found in the multiple regression.A. up to 5 or 6 variables and a total R2 of . variable 3. The best guarantee of this is simply that the number of data be considerably more than the number of variables.131 . and gender (M = 0. 20 to 30% of the variance. or the variable is of no consequence in the prediction. look at the patterns in the regression equation.15 R2 = . (e) socioeconomic status.129 1.017 2.32 1. including the categories (a) stress.044 .e.112 .75 R2 = .001).7). 4. nothing could be simpler.8. and Muller (1988). and it’s not saying much. (c) attitudes to parents.312 . but we place them here to provide some sense of perspective.6 to . so that variable 1 predicts.303 .012 . A gradual falloff should be seen in the prediction of each successive variable. One handy way to see if it is any good is to simply square the multiple correlation. 2. Kupper.0001 . He does a multiple regression analysis and determines the R2s and bs. when folks are doing these types of post-hoc regressions. Comment on the results shown in the several displays below. variable 2 an additional 10 to 15%. (d) social support from parents. 2. b.0225 n = 17 Variable b SE (b) t p Age GPA Gender .034 .003 . and (f) a standardized measure of depression. This should be at least a few percent. The multiple correlation was signiﬁcant (R2 = .75 2. 5 to 10% more. A regression analysis used the depression score as dependent variable. . to examine the contribution of an individual variable before you start inventing a new theory.561 . p < .01 .424 . No one could possibly work out ahead of time what a reasonable value for a particular regression coefficient might be.0225 n = 1233 Variable b SE (b) t p Age GPA Gender . Inevitably. it is about like number 2 above—not much happening here.MULTIPLE REGRESSION 157 some assurance that the estimates are stable and not simply capitalizing on chance.034 . Thus the “sample size calculation” is the essence of simplicity: Sample size = 5 (or 10) times the number of variables c. If you.176.28 . if things dribble on forever. statistically signiﬁcant or not. They reappear at the end of this section as C. Finally. Caution must be used in overinterpreting regression models based on relatively small samples.137 .003 . About all that can be hoped for is that the values that eventually emerge are reasonably stable and somewhere near the truth. Multiple R = . Any multiple regression worth its salt should account for about half the variance (i. If all the variance is soaked up by the ﬁrst variable.15 R2 = .383 .P.004 . Detectors. let alone its SE. (b) perceived comfort in social situations.97 . GPA.

Include family income as predictor D. also click on the down-arrow in the Method box and choose Stepwise. you simply choose different options to do the different types of analyses. select the variable(s) in the second block and click Next . until all the variables you want are selected. and click the arrow to move them into the box marked Independent ______ ______ ______ • Click OK To do a hierarchical regression. For a stepwise solution. click the Next button to the right of Block 1 of 1 after you’ve chosen the variable(s) you want to enter ﬁrst. Select only kids from private schools C. For a straight multiple regression: • From Analyze. and click the arrow to move it into the box marked Dependent ______ ______ ______ • Click on the predictor variables from the list on the left. Repeat study with kids who were depressed then had therapy ______ ______ ______ ______ ______ ______ • Click on the dependent variable from the list on the left. . and so on. choose Regression ¨ Linear A. Increase sample size to 1600 B.158 REGRESSION AND CORRELATION What effect would the following strategies have on the listed measures? R2 Signiﬁcance of R2 Beta How to Get the Computer to Do the Work for You All forms of regression are run from the same dialog box.

What combination of variables related to camel contact best predicts the likelihood of contracting the scourge? W e have had such a great time up to now collapsing some historical distinctions that we ﬁgure.CHAPTER THE FIFTEENTH Logistic regression is an extension of multiple regression methods for use where the dependent variable (DV) is dichotomous (e. which we will get to in later chapters. Although it’s an advanced nonparametric method.” which is contracted..1 For illustration. ratio) ones. once you get into it. First. probabilities don’t go in a straight line forever. One early sign is the development of a hump in the middle of the back 3 (not to be confused with widow’s hump). then the distribution of e is binomial rather than normal. The equation would then express the risk of coming down with SCF as a weighted sum of the four factors. from intimate contact with camels. The former use of nonparametric statistics. the assumption is made that the distribution of that error term is normal. the independent variables are usually continuous (but they don’t have to be). where he crouches on all fours awaiting his demise. humps in Asia. and (4) a buccal coliform count (BCC) from a mouth swab of the beasts (since it was thought that the disease is spread by bacteria residing in the camel’s mouth—leading to the horrible odor). let’s acknowledge that many of the major scourges of mankind never reach the temperate shores of Europe and North America. (3) family history of SCF (Fam). which is a dichotomous variable. That means that any signiﬁcance tests we run or conﬁdence intervals that we calculate if you’re rusty on multiple regression. ordinal) variables and continuous (interval. Logistic Regression SETTING THE SCENE A dreaded disease of North African countries is “Somaliland Camelbite Fever. dead or alive. It is used when the dependent variable is dichotomous. and would look like: z b0 b 1Year b2Herd b 3Fam b4 BCC e (15–1) 1So We can write a linear regression equation like this. which we’ll call z. The legs grow spindly. In linear regression. 2First brought to the attention of modern medicine in PDQ Statistics. Herd. they are bounded by zero and one. the breath grows more odoriferous. The method determines the predicted probability of the outcome based on the combination of predictor variables.g. if the dependent variable is dichotomous. dead/alive). the next step should be almost self-evident by now: Construct a regression equation to predict the probability of SCF. and eventually there are psychological manifestations as the hapless victim becomes progressively more bad tempered and seeks solitude in sunny corners of sand boxes. One fairly advanced nonparametric statistic is called logistic regression. (2) size of the herd (Herd). and BCC. Four potential variables were identiﬁed: (1) number of years spent herding camels (Years). You will notice that the boundaries are starting to smear. Second. Now if SCF were a continuous variable. from a linear combination of Years. as the name implies. One intrepid epidemiologist ventured forth to determine risk factors for the disease. But. One of the deadliest is Somaliland Camelbite Fever2 (SCF). which we did a while ago. “Why stop?” You may recall that at the beginning of this whole mess we made a big deal of the difference between categorical (nominal. you might want to have a fresh look at Chapter 14 before you proceed. but we shouldn’t. 3Two 159 . accompanied by water retention. there’s a problem with that little e hiding at the end of the equation. which results in an involvement of multiple systems. it looks an awful lot like ordinary multiple regression. Fam. and the latter use parametric statistics. for example.

so that further increases in the predictors don’t increase the probability any more. we remind ourselves about the situation we’re trying to model. Conversely. so we can go ahead and analyze it as yet another regression problem. As numbers increased. is called the logit function of y. its name comes from the fact that it was originally used to model the logistics of animal populations. this is called a logistic (sometimes referred to as a logit) transformation:4 y Pr(SCF| z) 1 1 e –z (15–2) hand side of the curve. If we had tried this estimate with multiple regression. we have found 30 herders.5 rent the Land Rover. so we now get: log (1 y) = –(b b1Years 0 b4BCC) b2Herd b3Fam y 5Normally research assistants do the data gathering. This function does some nice things. if you have a lot of risk factors. There are a couple of other things to note about Figure 15–1. though. so the negative sign goes (b 0 b 1Years b4BCC) b2Herd b 3Fam (15–5) Son of a gun! We have managed to recapture a linear equation. administering a questionnaire. but this rule is ignored when exotic travel and frequent ﬂyer points are involved. in this ﬁgure. to get around all these problems. But. the relationship disappears again. Until now. but ﬁrst we have to ﬁgure out how the computer computes all this stuff. now. it would have ended up as a straight-line relationship. so that if z is low. so probabilities would be negative on the left side. won’t be accurate.4 FIGURE 15–1 The logistic function. and goes to 1 when z is large and positive. We’ll talk later about the interpretation of this function. resulting in an increase in numbers. The goal is to estimate something about the various possible risk factors for dreaded SCF. the technical term is a logistic function. and the probability of contracting SCF should be low. less food was available to each member of the herd. This ability to capture a plausible relationship between risk and probability is not the only nice feature of the logistic function. So. After months of scouring the desert.0 REGRESSION AND CORRELATION 0. The data for the ﬁrst few are shown in Table 15–1. Then. We ﬂy off to the Sahara. beyond an upper threshold. for those so inclined. it’s best to realize that the job is far from done. there’s a linear relationship between them and the probability of an outcome. and so on. but we’ll save some of the other surprises until later. log (y/(1–y). we should invest some time describing how the computer . buying off each with a few handfuls of beads and bullets. The shape implies that there are two threshold values. the next bit of sleight of hand. What it is saying in the ﬁrst instance is that y is the probability of getting SCF for a given value of z. For the moment. once the predictors pass a certain lower threshold. Finally. Time to mess around a bit more. we’ll rearrange things to get the linear expression all by itself: (1 y y) e (b0 b1 Years b2 Herd b3 Fam b4 BCC) (15–3) Now. z can increase from –10 to about –3 without changing the probability. more food could be consumed. of course). All the predictors are set up in an ascending way.160 1. A complicated little ditty. where the probability of getting SCF should approach one. So.0 10 0 z 10 4On a historical note. you don’t have any risk factors. y is 1/(1 + e0) = 1/(1 + 1) = 0. Before we get to the output of the analysis. 0.2 0.8 Probability 0. And when z goes to –∞. for a given value of the regression equation. When z goes to inﬁnity (∞). When z = 0. As the population dropped. The pathologist does her bit and we have found that 18 herders had SCF and 12 didn’t. leading to increased starvation. First. The expression we just derived. the shape of the curve is not selfevidently a good thing. you’re off on the right-hand side. which it is—that’s the left- (15–4) and log (c/d) away: log y (1 y) log (d/c). by the way. that is. To put some meat on the bones. the relationship is more S-shaped. it describes a smooth curve that approaches zero for large negative values of z.6 0. There’s almost no relationship between the predictor variables and the probability of the outcome for low values of z. and go to inﬁnity on the right side—deﬁnitely not a good thing. which it does. Not too surprisingly.5. For a dichotomous outcome. since we have this linear sum of our original values (which is the good news) hopelessly entangled in the middle of a complicated expression (which is the very bad news). and swabbing their camels’ mouths. we call it multiple linear regression because we assume that the relationship between the DV and all the predictor variables is (you guessed it) linear. recruiting camel herders wherever we can ﬁnd them. it becomes 1/(1 + e∞) = 0. Here’s another deﬁnition to add to the long list. we transform things so that there’s an S-shaped outcome that ranges between zero and one. and wander from wadi to wadi and oasis to oasis. it becomes 1/(1 + e–∞) = 1. let’s conjure up some real data (real being a relative term. The way to get rid of an exponent is to take the logarithm. A graph illustrating this is shown in Figure 15–1.

See note 6 in Chapter 13. based on Equation 15–2. What actually gets maximized is the ratio of the likelihood function for the particular set of observations conditional on all other likelihood functions for all other possible conﬁgurations of the data. then the probability of the overall function being correct associated with this person is just pi. The rest of the calculation is the essence of simplicity (yeah. and keep going until the probability is maximum (the Maximum Likelihood Estimation). compute the probability. All the computer now has to do is try a bunch of numbers. and the js include all the non-cases. can be used only when the underlying model is linear. Looking at things one case at a time. and the remaining 16 are not. What we computed above is one kind of MLE.8 However. That’s what happens conceptually. the probability of observing the data set. The standard method of computing regression by minimizing the residual Sum of Squares. 8What Unconditional versus Conditional Maximum Likelihood Estimation Unfortunately. really happens is that the process begins by differentiating the whole thing with respect to each parameter (calculus again) resulting in a set of k equations in k unknowns. and the universe is being predictable.e. That’s only one possibility. we could just stuff the data into the regression program and let the beast go on its merry way. There’s still another kind. That estimate of the probability that the poor soul will have SCF is then compared with the actual probability. and a constant. called Unconditional MLE. if pi is the computed likelihood of disease given the set of predictor variables for person i who has the disease. and goes ahead and maximizes it. and π (Greek letter pi) is a symbol meaning “product. It’s actually conceptually similar to the standard approach of minimizing the residual sum of squares in ordinary regression. or MLE. estimating the parameters and adjusting things so that the Sum of Squares (Residual) is at a minimum (see Chapter 14). called. the estimated probabilities for those poor souls who actually have SCF will be high. statisticians tell us that we cannot just crash ahead minimizing the error Sum of Squares. things aren’t quite that simple. this was at the cost of creating a dependent variable that is anything but linear. 6Which is why logistic regression was absent from many statistical packages until recently. this analysis is so hairy that even after using calculus. sure!).6 method called Maximum Likelihood Estimation. The conditional MLE compares the particular likelihood with all other possible conﬁgurations of the data. computationally intensive. 7Note that we threw in a little hat over the ps. So. and the probabilities for those who don’t will be low. The Likelihood Function considers the ﬁt between the estimated probabilities and the true state. If person j doesn’t have the disease. the other conﬁgurations are computed by keeping the predictors in a L i pi ˆ j (1 pj) ˆ (15–6) where the is include all the cases. then the appropriate probability of a correct call on the part of the computer program is (1 – pj). The whole point of logistic regression is to compute the estimated probability for each person. Instead we must use an alternative. rather than calculate it from the equation. which is of a particularly simple form: 0 (the person don’t got it) or 1 (the person does got it). if the logistic regression is doing a good job. having converted the whole thing to a linear equation. Although we have done our best to convert the present problem to a linear equation.”7 This is the Likelihood Function. except that we do the whole thing in terms of estimated probabilities and observed probabilities (which are either 0 or 1). Just as in ordinary regression. What in the name of heaven do we mean by “all other conﬁgurations”? Assume we reorder the data so the ﬁrst 13 are cases and the next 17 are controls. not surprisingly. Unfortunately. the set of coefficients for each of the predictor variables.LOGISTIC REGRESSION 161 does its thing. One might think that. others include the ﬁrst 12 are cases. they differ in that unconditional MLE takes the computed probability at face value. it’s still necessary for the computer to use iteration to get the solution. The residual sum of squares is the difference between the observed data and the ﬁtted values estimated from the regression equation. not the truth. called the method of least squares for (one hopes) fairly obvious reasons. calculus comes to the rescue to ﬁnd the maximum value of the function. the one after that is. ﬁddle them a bit to see if the probability increases. we multiply them together instead of adding them. So the overall probability of correct calls (i. it’s necessary to ﬁrst go back to basics. that’s still not the whole story. Now. And since the numbers of interest are probabilities.. and the dependent variable is approximately normally distributed. the next one isn’t. computed as an overall probability across all cases. given a particular value for each of the coefficients in the model) is just the product of all these ps and (1 – p)s: 1 1 multiplying the predictors. Basically. obtained by simply multiplying all the ps for the cases and (1 – p)s for the non-cases together to determine the probability that things could have turned out the way they did for a particular set of coefficients . but not actually. That is. Conditional MLE. this is the mathematical way to say that these are estimates. Subject Years of herding Number in herd Family history Buccal count SCF TABLE 15–1 Data from the ﬁrst 10 of 30 camel herders 1 2 3 4 5 6 7 8 9 10 3 5 25 14 2 16 28 19 13 33 22 3 344 28 77 34 66 100 87 45 No No Yes Yes No No Yes Yes Yes Yes 300 350 446 121 45 233 654 277 321 335 No No Yes No No Yes Yes No Yes Yes MAXIMUM LIKELIHOOD ESTIMATION PROCEDURES To understand what MLE procedures are all about.

in which we have computed the individual coefficients. in the present example. and the formula is: log [ (1 p 0(SCF) ] p 0)(SCF ) b0 (15–8) Now the ratio of p/(1 p) is the of SCF with Fam present or absent. a family history is a bad risk factor. but perhaps the “rule of 5”—5 cases per variable—is a safe bet. the b coefficient for Fam is 3. some does not.0128 .2) = 0.008 . the odds against Beetlebaum are 0. let’s stand back and let it do its thing.8) = 4. and it looks like Table 15–2. however. Imagine that the odds makers work out that the probability of Old Beetlebaum winning is 20%. When the number of variables is large compared with the number of cases.992 . the column to the right of SE.035 1. A recognizable table of b coefficients is one of the ﬁrst things we see. we can get to the interpretation of the last column: p 1/(1 p 0/(1 p1) p0) eb3 (15–10) So. except for the last column.786 × 1035! If you want to see how we got there. Now we’re working backward to that original equation. the whole thing is squared. meaning that a one-unit change in BCC increases the odds of contracting SCF by about 1%.0238 . the odds of getting SCF when you have a positive history is about 29 times what it would be otherwise.8/(1 – 0. in due course. As it turns out. and the appropriate tests of signiﬁcance. In the second form. and e3. the relative odds of getting SCF with a positive family history is 28. For continuous variables we can do the same calculation. ﬁxed order and reordering the 1s and 0s in the last column to consider all possible combinations. However. and. is familiar territory indeed.25. a few pages come spinning out of the printer.132 . the relative odds is 1. therefore. If we take the exponential of the b coefficient. Just like ordinary multiple regression. Family History and BCC are signiﬁcant at the . We press the button on the SCF data. In words: for discrete predictor variables.013.023 28. which one might expect to be the t-test of signiﬁcance for the parameter [t = b/SE(b).45 2. That is. it is just the coefficient divided by SE. this is familiar territory.162 REGRESSION AND CORRELATION TABLE 15–2 B coefficients from logistic regression analysis of SCF data Variable B SE Wald df Sig Exp(B) Family history Herd Years BCC Constant 3. As it turns out. but it requires a somewhat different interpretation. and it doesn’t look a bit like the b/SE(b) we expected. see Chapter 13] is now labelled the Wald test. the other significant predictor.0086 .” which leaves most folks no further ahead. All this would be of academic interest only. then the conditional approach is preferred. Remember.716 1. Focusing only on Fam now. or turning it around. For each one of these conﬁgurations the MLE probability is computed. naturally. Now the eb corresponds to the change in odds associated with a one-unit change in the variable of interest. and the unconditional MLE can give answers that are quite far off the mark. This is almost straightforward. . this table shows the coefficients for all the independent variables and their associated standard errors (SEs). it didn’t—it came from horse racing.79 .79.0061 4.459 0. and if you are good at diddling logs you can show that the log odds ratio is: log p 1/(1 p 0/(1 p1) p0) b3 (15–9) odds 10 10Although SAMPLE CALCULATION Now that we understand a bit about what the computer is doing. The odds ratio (OR.9 Needless to say. In the present printout. some folks might like to convince you that this came from epidemiology.547 . What on earth is Exp(B)? So glad you asked. and watch the electrons ﬂy.2/(1 – 0. Suppose the only variable that was predictive was Family History (Fam).36 1.48 5. and it is distributed approximately as a z test (so Table A in the Appendix is appropriate). Herd and Years are not.42 1. the regression coefficient is equal to the log odds ratio of the event for the predictor present and absent. the computation is very intensive. they chose to square it.36. How large is large? How high is up? It seems that no one knows.79. which has only two values: 1 (present) or 0 (absent). and then they’re all summed up. their SEs. Some of it looks familiar. look up the Fisher Exact Test.012 The short answer is that is stands for “exponential of B.010 . the number of possible combinations for our 30 herders is 1.0073 . always keeping the totals at 13 and 17. So looking at BCC. in this equation. sit back. Let’s deal with the familiar ﬁrst. that we began with something that had all the b coefficients in a linear equation inside an exponential.36 is 28. the probability of SCF given a family history looks like: log (1 p1(SCF) p 1)(SCF ) b0 b3 (15–7) 9For your interest (who’s still interested?).13 . The Wald statistic has two forms: in the ﬁrst form.044 1 1 1 1 . So this part of the analysis. If there is no family history then Fam = 0. Perhaps you should stay away from your parents. except that sometimes it really matters which one you pick. so it equals [b/SE(b)]2. The odds of him winning is 0. taken from SPSS. We have already pointed out that the logistic function expresses the probability of getting SCF given certain values of the predictor variables. So they say it’s 4 to 1 against Beetlebaum in the 7th. also called the relative odds) is the ratio of the odds.05 level.

the comparable numbers are again an RR of 2. “Kush mir in tuchas” is a request to kiss my other beast of burden—a smaller one spelled with three letters. the equation ( ( a (a + b (c +c d = a (c + d) c (a + b) (15–13) . not 0 or 1. In Kushmir. 11If In Kushmir en Tuchas. is: 10 × 45 = 2. missing from the discussion so far is any overall test of ﬁt—the equivalent of the ANOVA of regression and the R2 we encountered in multiple regression. To do this. In our example.10 = 2.10. b. let’s spell out the equation for a relative risk. because to create the table. but the denominator is the number of times the event didn’t occur. To see why. it was from a B-type beast. the two are not the same. and everything from . c + d isn’t much different from d. it is not easily turned into a test of signiﬁcance.0. When the numbers in cells a and c are small. to summarize: • An RR and an OR of 1 means that nothing is going on. however. The overall agreement is 26/30 or 86. but rather in odds.20. which is the formula for the OR we saw in Equation 15–11. So. We could of course do kappas or phis on the thing (see Chapter 23)—but we won’t.50000 to 1 was set you can’t remember the difference. • When the prevalence of a disorder is low (under about 10%). After all. since the predicted value is a probability. Now tell us. and that ancient emirate.50 seems like a reasonable starting point. c. we began with a bunch of cases of SCF and non-cases. Many people interpret them as if they were relative risks. and a + b is just about equal to b. you’re twice as likely to get this debilitating disorder from the two humper as compared with the unihumper.10 Kushmir en Tuchas SCF Type of Camel Yes No Total p Bactrian Dromedary Total 40 20 60 10 30 40 50 50 100 . which is calculated as: Odds Ratio (OR) = ad bc (15–11) Somaliland SCF Type of Camel Yes No Total p TABLE 15–3 The relationship between type of camel and SCF in two regions Bactrian Dromedary Total 10 5 15 40 45 85 50 50 100 . 12For Relative Risk = Now. As the proportion of cases in those cells increases.67%. But the OR. The important point is that the OR does not describe changes in probabilities. every computed probability from 0 to . the numerator is the number of times an event happened. which they ain’t. At one level. that is. Let’s go through some examples to see the difference. We could just do the standard chi-squared on the 2 × 2 table. you’re twice as likely to get SCF from Bactrians as Dromedaries (that’s from the RR).11 Is it possible that two humps doubles the chances of SCF? In Table 15–3. and a D has one. the relative risk (RR) of SCF from Bactrians is 0. • As the prevalence increases. It’s: GOODNESS OF FIT AND OVERALL TESTS OF SIGNIFICANCE That was the easy part. when the prevalence of a disorder is very low (say. and the denominator is the number of times it could have happened. rotate the ﬁrst letters 90˚ to the left—a B has two bumps. the misinterpretation) of the OR.20/0. we must establish some cutoff above which we’ll call it a case—50% or . but the odds are 6:1 that if you got SCF. but with an OR of 6. the risk of SCF from Bactrian camels is 10/50 = 0. Since this is a categorical outcome variable. we haven’t differentiated between Bactrian camels (two humps) and Dromedaries (one hump).LOGISTIC REGRESSION 163 Interpreting Relative Risks and Odds Ratios The major problem with logistic regression is the interpretation (or rather. but this isn’t really a measure of how well we’re doing. we won’t be doing any ANOVAs on these data. under 10%).40 where a. Clearly. Hence. the RR and OR are nearly the same. the OR sets an upper bound for the RR.25 40 × 5 (15–12) simpliﬁes to ad/bc. While this table is a useful way to see how we’re doing overall. We could do the standard epidemiologic “shtick” and create a 2 × 2 table of observed versus predicted classiﬁcations.20 . the OR becomes larger than the RR. the outcome is equally probable in both groups. The numerator is the same in an OR.49999 was set equal to 0. and the risk from Dromedaries is 5/50 = 0.0.12 In Somaliland. goodness of ﬁt is just as easy to come by with logistic regression as with continuous data. In fact. in what other statistics book outside of Saudi Arabia can you learn about the ﬁner points of camel herding? those of you not ﬂuent in Yiddish. the RR and OR are almost identical. and d refer to the four cells. the OR exceeds the value of the RR.80 . With a probability. If we do that for the present data set. Kushmir en Tuchas. the contingency table would look like Table 15–4. as in Somaliland. we present the results from two different regions—Somaliland itself. the ﬁrst of which is a and the third of which is s. So.

so the critical chi-squared. For any number x less than one.86 for p < . we ﬁnd the goodness of ﬁt is 25.0). the critical chi-squared is 3. the better. the likelihood function. derived from Table F in the Appendix. That is. the higher the probability that the particular pattern could have occurred. then we’ll stop. they do a transformation on the likelihood: ˆ Goodness of Fit = –2 log L (15–14) model. since it is a direct estimate of the likelihood that the particular set of 1s and 0s could have arisen from the computed probabilities derived from the logistic function. just as you’ve gotten used to the idea that the bigger the test. Tracing down this logic.88. and further. and the worse the ﬁt. the better the ﬁt. that is). recompute the goodness of ﬁt. In the present example. the bigger the quantity (this time in the usual positive sense). which is larger but still signiﬁcant. they’re all we’ve got. Now.” which has featured prominently in the discussion. So.03. the more variables in the With df = 1. the computed Wald statistic yields a p of .89 18. provides a direct test of the goodness of ﬁt of the particular model. such as McFadden’s Pseudo-R2 and Nagelkerke’s R2. with all this talk about how well the equation does or doesn’t work. there are 4 df. so. Since the MLE is a probability (and. they don’t—that would be too sensible. if we just include Fam. Looking again at Table 15–2. That is. the log will be smaller (negative but closer to zero). It turns out that. and just as fruitlessly. we need a larger sample size. Herd. With 4 variables. For the discrepancy to disappear. Now. and none is regarded as an accurate representation of the amount of variance accounted for by the predictors. A consequence of this inversion is that. when we multiply the whole thing by “–2. and (–2 log L) will be smaller. compute the goodness of ﬁt. so we can then turn it around and use the MLE as a measure of goodness of ﬁt. can never exceed 1. since the degrees of freedom for the chi-squared is going up faster than is the chi-squared. it’s because of the magical word “approximately. so it’s entirely possible that additional predictor variables will improve the goodness of ﬁt (make it smaller). for an R2-like statistic for logistic regression.84. the smaller the goodness-of-ﬁt measure. statisticians turn things “bass-ackwards” on you. In the present example. and some of them are printed out by various computer programs. we expect that the ﬁt will improve. which we have already encountered. The p-value resulting from this stepwise analysis should be the same level of signiﬁcance as that for the B coefficient of BCC in the full model. But. the smaller the probability. We could ﬁt a partial model. There are a couple of tricks in the interpretation of this value. . creating a test of the last variable which is a chi-squared with one degree of freedom.10. the lower probability. is 14.79 (15–15) 13Kleinbaum (1994). Regrettably. One ﬁnal wrinkle. First. in the “To Read Further” section. Why the discrepancy? As near as we can ﬁgure. so there’s a signiﬁcant change by adding BCC. Just as the Crusaders searched high and low for the Holy Grail. There have been a number of contenders. The only reason to do this transformation is that some very clever statistician worked out that this value is approximately distributed as a chi-squared with degrees of freedom equal to the number of variables in the model. All of them have problems of one sort or another associated with them. like all multiple regression stuff. the partial test of BCC is: 2 log L (full model) ( 2 log L (partial model)) = 25. In fact. Of course. STEPWISE LOGISTIC REGRESSION AND THE PARTIAL TEST All of the above suggests a logical extension along the lines of the partial F-test in multiple regression. as we introduce more variables. this is achieved at the cost of an increase in the degrees of freedom. so the ML probability will be higher (closer to 1). the actual ﬁt may well be somewhat better than it looks from the table. the logarithm of x (log x) is negative. but we have it on good authority13 that no one seems to know how much larger it needs to be.” it all stands on its head. the worse the apparent ﬁt. also like multiple regression. the value of the goodness of ﬁt is 18. then add in another variable. the MLE is a probability—the probability that we could have obtained the particular set of observed data given the estimated probabilities computed from putting the optimal set of parameters into the equation for the logistic function (optimal in terms of maximizing the likelihood.10 = 7. its log will always carry a negative sign. therefore. So. although the programs could report this estimate. it’s highly signiﬁcant.164 REGRESSION AND CORRELATION Observed TABLE 15–4 Observed and predicted classiﬁcation Predicted Non-case Case Total Non-case Case TOTALS 9 3 12 1 17 18 10 20 30 equal to 1. why haven’t we mentioned R2? For one simple reason: it doesn’t exist for logistic regression. and subtract the two. the bigger the quantity (bigger negative that is).005. So. statisticians have looked just as diligently. you have to recall some arcane high school math. it isn’t quite. but it will be less signiﬁcant. So. Instead. and Years on the principle that it’s awfully difficult to persuade camels to “open wide” as we ram a swab down their throats.

He administers a quality-of-life questionnaire to a group of patients who have had a cholecystectomy by (i) laparoscopic surgery or (ii) conventional surgery. and a number of additional covariates. Create a dummy variable for each of the pairs (so that. Tabachnick and Fidell state that there are problems when there are too few cases relative to the number of variables. and only one (Tabachnick and Fidell. And on and on—ad inﬁnitum.” so. and so on). For the following designs. A researcher wants to examine factors relating to divorce. 2. in this situation. estimating the ﬁtted probability associated with each observed value. “Neither does anyone else. and you’ll have to shell out for a new software package. honors. and (iii) 2 weeks postoperatively. what is the relative risk of death from the treatment? MORE COMPLEX DESIGNS Now that we have this basic approach under our belts. Tests of the individual parameters in the logistic equation proceed much like the tests for ordinary multiple regression. She also inquires about other variables—income. number of children. The questionnaire is administered (i) before surgery. and gives them a series of 10 written clinical scenarios. a. a researcher assembles a cohort of women with breast cancer and a control group without breast cancer and asks about the total use of antiperspirant deodorants (in stick-years). without the slightest evidence to back this up. A surgeon wants to determine whether laparoscopic surgery results in improvements in quality of life.5. they don’t say what’s meant by “too few cases. She tracks them down and determines who is still married and who is not. the use of an unconditional ML estimator is a really bad idea. ad nauseam. EXERCISES 1. and alcohol consumption in mL/week.LOGISTIC REGRESSION 165 Again. distinction). No big deal. 2001) even mentions the issue of sample size. 10 subjects per variable. It works by ﬁtting a logistic function to the 1s and 0s. wife’s parents married. e. when in doubt. for example. To determine correlates of clinical competence. activity level (in hours/week). we would obviously call it a “packyear” of exposure. in which each person in the treatment group is matched to another individual in the control group? Now we have a situation like a paired t-test. pair 2 has another. “Dunno”. She measures cholesterol and blood pressure. in which we are effectively computing a difference within pairs. by default. we recommend the old and familiar standby. 14If these were wolves. Just multiply Herd × Years to get an overall measure of exposure14 and stuff it into the equation. including drug/placebo as a dummy variable. What is the role of lifestyle and physiologic factors in heart disease? A researcher studies a group of postmyocardial infarction patients and a group of “normal” patients of similar age and sex. d. husbands’ parents married. She identiﬁes a cohort of couples who were married in 1990 through marriage records at the vital statistics office. To test this. Unfortunately. The test of signiﬁcance of the model is based on adjusting the parameters of the model to maximize the likelihood of the observed arising from the linear sum of the variables. SAMPLE SIZE Regarding the question of how many subjects are necessary to run a logistic regression. Want to include interactions? Fine. an educator assembles a group of practicing internists. the latest cancer scare is that antiperspirants “cause” breast cancer. Given that the cardiac death rate in the control group was 10%. The output from a logistic regression reported that Exp(b1) = 0. indicate the appropriate statistical test. The independent variables are (i) Year of graduation. we are led to believe that.” We looked in many textbooks about multivariate statistics and logistic regression. we can extend the method just as we did with multiple regression. the long answer is. Of course. The short answer is. (ii) Gender. b. A randomized trial of a new anticholesterol drug was analyzed with logistic regression (using cardiac death as an endpoint). (ii) 3 days postoperatively. and spend the next three paragraphs discussing the problems and possible solutions (such as collapsing categories). upon appealing to higher authorities than ourselves. The dependent variable is the proportion of the cases diagnosed correctly. c. we should proceed with the formal likelihood ratio test. Want to do a matched analysis. and (iii) Academic standing (pass. SUMMARY Logistic regression analysis is a powerful extension of multiple regression for use when the dependent variable is categorical (0 and 1). . pair 1 has a dummy variable associated with it. At the time of writing. not camels. there’s a short answer and a long one. the so-called MLE procedure. where the number of variables is going up by leaps and bounds.

choose Enter • Under Options … check CI for exp(B) • Click OK . choose Regression → Binary Logistic… • Click on the dependent variable from the list on the left. and click the arrow to move it into the box marked Covariates: • Click on the predictor variables from the list on the left. and click the arrow to move them into the box marked Independent: • For Method.166 REGRESSION AND CORRELATION How to Get the Computer to Do the Work for You • For a logistics regression: • From Analyze.

you just can’t resist the temptation to analyze everything in sight with multiple regression. Advanced Topics in Regression and ANOVA SETTING THE SCENE You have been collecting data at your PMS (Pathetic Male Syndrome) Clinic for 15 years. and (3) treatment with testosterone injections appears to have some effect on the PQ. and it can be handled with a t-test or a one-way ANOVA. cubic. As Bob Dole reminds us. It may not be self-evident why one should bother to try to merge two goods things. Multiple regression tells us how to deal with straight-line relationships. 167 . Analysis of covariance (ANCOVA) combines continuous variables (covariates) and factors and is used for assessing treatment effects while controlling for baseline characteristics. how to put them together. To deal with the more systemic manifestations of masculine aging. thanks to the miracles of modern medical science. (Quadratic means a term squared. if males are given male sex hormones. quadratic.CHAPTER THE SIXTEENTH This chapter reviews several advanced analytical strategies that bring together ANOVA and regression. why else do elderly men buckle their pants somewhere around the nipple line or down around their knees? Another is the purchase of ﬂamboyant hats to cover the shrinking number of hair follicles. That is a comparison between two groups formed on the basis of a nominal variable. As far as the relationship with age goes. After all. is commonly referred to as “mid-life crisis” or “male menopause” in its acute phase. the number of times he says to his signiﬁcant other. We are now conﬁdent that you. there is some hope on the horizon. The presence of satellite dishes in the backyard to receive dirty movies is a warning signal as well. cubic is a term cubed. Dear. “Not tonight. we have to revert to more traditional therapy. Power series analysis is a regression including higher powers of the independent variable (e. we have left out one small detail—namely. Despite admonitions to the contrary. Certainly. wherein the PM (pathetic male) counts the number of wistful sighs. PMS. and quartic is to the fourth power. they seem to recover a bit. I have a backache. On the other hand. or PQ. But how does one actually measure PMS? A simple diary. it would seem that each is capable of handling a large class of complex problems. speciﬁcally testosterone injections. has only local effects down in the nether regions (albeit spectacular. As we indicated above. which to those of mathematical inclination might suggest a quadratic term. three things are evident: (1) Pathos Quotient (PQ) increases linearly with belly size. or quartic terms). ANOVA works on treatment groups. Pathos Quotient increases linearly with belly size—that’s a job for regression. One sign is a gradual movement upward or downward in the belt line—after all. Viagra.. After graphing the data.1 there is life after prostatectomy. but it has a more insidious onset than is implied by those terms. will be able to recognize this new epidemic. by all accounts).” the number of unused notches (guess which side of the buckle) on his belt.g. but how in the world will you deal with all this complexity? B y now we have given you the conceptual tools to master nearly every complexity of ANOVA and regression. which we’ll call the Pathos Quotient. But reﬂect a moment on a simple twist to the designs we have encountered thus far. makes a ratio variable (if not a rational one!).) But how can we put it all together? 1As part of his continuing efforts to commit political hara-kiri. The syndrome we investigate in this chapter. as a health professional. it sounds like a curve peaking at about 45 and falling off on both sides. and there are now pharmacologic means to keep more than our spirits up. and the total dollar sum of subscriptions to various lewd or semi-lewd male magazines. (2) middle-aged males have the highest PQ. however. the number of ounces of Greek Formula 18 consumed in a week. However. PMS is related to three other variables.

For the sake of learning.303 1 14 2893. Finally. It is anything but obvious how this should be analyzed. For your convenience. we have done just that.5 6.2 473.027 Having gotten this far. the t-value is 1. and moreover. Looking at treatment. We could proceed to do a regression analysis on the data in the usual way.0 R2 = . Figure 16–1 shows the PQ scores for 16 subjects in comparison to belt size. the ANOVA of the regression looks like Table 16–2. based on the data of Table 16–1. If we did. . age.30.33. and the hormone group mean is a bit lower than the “Other” group. this is just a nominal variable with two levels. we might like to see the appearance of these elements on graphs. we do have this slightly bizarre relationship with age. its effects seem to dwindle on into the 60s.168 A REGRESSION AND CORRELATION 100 80 Pathos quotient 60 40 20 0 Test Other C FIGURE 16–1 Individual relationship between Pathos Quotient (PQ) and Treatment (A). Belt size (B).2 Residual 6629.11 . we’ll leave age out of the picture altogether for now and simply deal with the other two variables—Belt Size (a ratio variable) and Treatment with testosterone/other (a nominal variable). It is evident from the graph that the data are pretty well linearly related to belt size. and Age (C). and the multiple R2 turns out to be . which is not all that great. indicating that the mid-life crisis is a phenomenon to be reckoned with. we could simply compare the two means with a t-test. If we wanted to determine if there was any evidence of an effect of treatment. 100 80 B 100 80 Pathos quotient 60 40 20 0 20 25 30 35 40 45 50 55 60 Pathos quotient 60 40 20 0 20 30 40 50 Age 60 70 80 90 Belt size (inches) TABLE 16–1 Data for 16 PMS patients Subject Age Belt size Treatment PQ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 24 26 27 88 32 29 70 75 37 65 72 55 45 41 63 40 46 36 40 44 36 30 42 35 42 50 45 53 48 38 43 58 Testosterone Testosterone Other Testosterone Testosterone Other Testosterone Other Other Testosterone Testosterone Other Other Other Other Testosterone 12 14 27 35 26 21 48 51 62 64 60 77 91 84 55 74 TABLE 16–2 ANOVA of regression for PMS against belt size Source Sum of squares df Mean square F p Regression 2893. so we won’t—yet. which is not signiﬁcant. and treatment.

We then need some way to analyze the effect of the treatment variable. which results from other factors not in the study. Treatment. is reduced. We have also included the Grand Mean of all the PQs as a large 100 80 Pathos quotient 60 40 Testosterone 20 0 30 40 50 60 Other FIGURE 16–2 Relationship between PQ and Treatment and Belt size. you know by now that a ﬁrst approach is to graph the data. as well as the belt size. just as in a regression problem. or the error variance. and (2) the larger will be Sum of Squares (Belt) and Sum of Squares (Treatment) when compared with the Sum of Squares (Error). Perhaps we can be forgiven if we now stand things on their heads and do a regression to an ANCOVA problem. and error. then plot the data against belt size again.3 Suppose we forget for a moment that these are a mixture of variables and just plow ahead. so that we have essentially two distributions of PQs. Conversely. something we would naturally approach with an ANOVA. thereby increasing the sensitivity of the test. bobbitty. one for Testosterone and the other for Other. we recapture the picture of Figure 16–1A. in the last chapter we got used to the idea of ANOVAing a regression problem. Belt size (inches) black dot. we can see that the data are actually pretty tightly clustered around two lines. once again using creative acronymizing to obscure what was going on. which is simply an enlargement of Figure 16–2 around the middle of the picture. and we have thrown in a bunch of arrows (we’ll get to those in a minute). 3And . but how do they play out on the graph? Sum of Squares resulting from Treatment is related to the distance between the two parallel lines. but in fact they both come down to sums of squared differences when we look at the variance components. It might look a bit like this: PQ = b0 + b1 Treatment + b2 Belt 2Not bibbitty. if we imagine projecting all the data onto the Y-axis. we are able to reduce the scatter. from ANalysis of COVAriance. it’s not such a difficult problem. employing what is now a familiar refrain—parceling out the total Sum of Squares in PQ into components resulting from Belt. and the ubiquitous error term. You will note that we have made a big deal of putting together both nominal and interval-level data. (1) the closer the data will fall to the ﬁtted lines. And taking account of all of the variance from both sources. Conceptually. Figure 16–2 shows the updated graph. Put it together and what have you got? Analysis of covariance. so good. What we seem to need is a bit of ANOVA to handle the grouping factor and a dose of regression to deal with the continuous variable. boo—silly.ADVANCED TOPICS IN REGRESSION AND ANOVA 169 ANALYSIS OF COVARIANCE Again. Some of the variability visible in the data in Figure 16–1B was a result of the treatment variable. if you’ve learned your lessons well. After all. around the ﬁtted lines. We simply use different points for the two groups. The challenge is to ﬁgure out how to deal with both nominal and ratio independent variables. one for Testosterone and one for Other. to which no amount of mirrors will lend assistance. except that the distances are measured to one or the other line. The better the fit between the two independent variables (Treatment and Belt). To see how this comes about. both for analyzing the impact of belt size on PQ and also for determining if treatment has any effect. The effect of using both variables in the analysis is to reduce the error term in the corresponding test of signiﬁcance. so here we go.2 The time has now come to turn once again from words to pictures. which amounts to looking at the difference between two groups. the residual variance. Now we have a slightly different picture than before. we seem to be in the process of collapsing the distinction altogether between ANOVA and regression methods. Belt. This should result in a more powerful statistical test. Sum of Squares (Error) is the distance between the original data points and the corresponding fitted data point. Belt Size and Treatment. You may recall from Chapter 13. this is a conceptual headstand. stuffing them into a regression equation. we have the same situation as we had with multiple regression. and at least this time it really isn’t too hard to put three variables on two dimensions. the problem is dealt with by a method called ANCOVA. Historically. refer to Figure 16–3. If we look back at the relationship to Belt Size. As a result. or a t-test. showing once again that a picture is worth a few words. each of which is responsible for some of the variance in PQ. by determining two lines instead of one. We have two independent variables. So far. In fact. so this is a reasonable description of what might be the relationship to belt size. Viewed this way. so it expresses the treatment effect on PQ. The Sum of Squares resulting from Belt is the sum of all the squared vertical distances between the ﬁtted points and their corresponding group mean. which is the same thing. however. that the covariance was a product of X and Y differences that expressed the relationship between two interval-level variables. But we haven’t actually started analyzing it numerically yet. Three possible sources of variance are Treatment.

and the last with both in the equation. In the Testosterone group. in the Other group it is just b0. the Sum of Squares resulting from regression. Suffice to say that it. p < .170 REGRESSION AND CORRELATION 100 Grand mean Error Covariate Treatment 80 Pathos quotient 60 40 FIGURE 16–3 Relationship between PQ and Treatment and Belt size (expanded). and the ﬁtted values (again. The F-test for this variable is therefore 4135. with all the individual differences squared and summed. All that remains is to plow ahead just as with any other regression analysis and determine the value and statistical signiﬁcance of the bs. In the course of doing so we have actually done what we set out to do: determine the variance attributable to each independent variable. This way we can determine the effect of each variable above and beyond the effect of the other variables.4 ÷ 326. the additional Sum of Squares is (5274.56.8.70.65. which differ only in the intercept. is equal to: ∑[(b0 + b1 Treatment i + b2 Belt i ) PQ]2 (16–1) This takes the difference between the original data. this is just the difference between the ﬁtted point at each value of PQi (the whole equation in the parentheses) and the overall mean of PQ.4 with 1 df. although we have structured the problem as a regression problem for continuity and simplicity. The Sum of Squares (Residual) is equal to: ∑[PQi (b0 b1 Treatment i b2 Belt i )]2 (16–2) and for the treatment group it is: PQ b0 b1 1 b2 Belt [b0 b1] b2 Belt 4This is just creating dummy variables as we did in Chapter 14. and the residual term is 326. The ANOVAs for each of the models are in Table 16–3.5 – 1139. So this represents the squared differences between the original data and the ﬁtted points.1) = 4135. we must actually determine three regression equations: one with just Treatment in the equation. As it turns out. But what number do we use for Treatment? It’s a nominal variable. 20 0 0 36 38 40 42 44 46 48 Belt size (inches) That looks like a perfectly respectable regression equation. equivalent to a t of 3. so there is no particular relationship between any category and a corresponding number. b2. To test the signiﬁcance of each independent variable. what happens? Then the regression equation for the control group is: PQ = b0 b1 0 b2 Belt b0 b2 Belt Lest the algebra escape you.4 In this case.05. So b1 is just the vertical distance between the two lines in the graph (i. For Belt. 34. the intercept is (b0 + b1).. one with just Belt in the equation. for the full model. 36 … 54 inches (or the metric equivalent). is signiﬁcant. too. ANCOVA is just a special case of multiple regression. In other words.e. Well … suppose we try 0 for Other and 1 for Testosterone. all squared and added. the choice of 0 and 1 for the Treatment variable creates two regression lines with the same slope. Actually. So this is the sum of squares in PQ resulting from the combination of the independent variables.8 = 12. indicating sums of squares. We’ll let you work out the equivalent test for Treatment. the stuff in the parentheses). That is just what we want. We then proceed to determine the individual contributions. with a t of 2. When we put Belt into the equation it’s pretty clear what belt size to use—32. But we have only one little problem. PQi. the effect of treatment). if the analysis were actually run as an .

5 2 13 2637.2 1 14 2893.29 8. 6Our 7Muscle . one prof decides to try a different text this year—Bare Essentials. the situation is illustrated in Figure 16–4. it makes no more sense for an undergraduate student in health sciences heading for a clinical career to have to be able to do statistics than it does for an architect to be required to forge the I-beams in a building.11 . surprisingly few people are even aware of this potential gain in statistical power from using covariates.9 1.1 8383.189 Treatment and Belt size Regression Residual R2 = .07 <. As an example.8 8.5 4248. Let’s take the physics section of the Medical College Admissions Test (MCAT). FIGURE 16–4 Variance in PQ resulting from Belt size and Treatment.07 .005 ANCOVA for Adjusting for Baseline Differences Actually.7 6629.01 *Note that. independent of the other. replacing it with interviews and other touchyfeely stuff. however. Source Sum of squares df Mean square F p TABLE 16–3 ANOVAs of regressions for Testosterone/ other. and Both Treatment Regression Residual R2 = .4 2380.303 2893.553 5274. Of course. If you don’t believe it yet. A different picture now emerges. That’s not funny for him or us. It is clear that Bare Essentials delivered on the goods. But what can we do about it? Clearly we need some independent measure of quantitative skills. where adding variables decreases the power of the tests.120 Belt size Regression Residual R2 = . Note that a funny thing happened when both variables went in together.9 1 14 1139. the residual variance shrank.7 473. Figure 16–4 nicely illustrates one potential gain in using ANCOVA designs: the apportioning of variance resulting from covariates such as Belt can actually increase the power of the statistical test of the grouping factor(s). When each was tested individually.1 598.6 and (3) we know the dangers of historical controls and other nonrandomized designs.2 326. it can work the other way. More frequently. we get Figure 16–5.01 <. naturally.5 1 1 2 13 4135. Treatment was not signiﬁcant. Because each variable accounted for some of the variance.027 1139. He Source Sum of squares df Mean square F p TABLE 16–4 Summary ANCOVA table for Treatment and Belt size Covariate (Belt) Treatment Explained* Residual 4135. in contrast to factorial ANOVA designs.4 2380. look at Figure 16–4.8 2637.5 In an attempt to engage their humorous side. ANCOVA is used in designs such as cohort studies where intact control groups are used and the two groups differ on one or more variables that are potentially related to the outcome or dependent variable.5 4248.1% this year. In our view. So one explanation is that this class has a slightly higher incidence of cerebromyopathy7 than had the last. As with regression.90 . A little detective work reveals the fact that the admissions committee has also been messing around and dropped the GPA standard. and Belt was only marginally so. For those of you with a visual bent. consider the pitiless task of trying to drum some statistical concepts into the thick heads of a bunch of medical students. the contributions of each variable would be separately identiﬁed in the ANCOVA table (Table 16–4). (2) we already spent it recklessly on women.2 326. whereas it was 73. here the sums of squares don’t add up because there is an overlap in the explained variance.ADVANCED TOPICS IN REGRESSION AND ANOVA 171 ANCOVA program. for several reasons: (1) we’re tight-ﬁsted. heads. wives. The regression line for this year’s class is consistently higher than last 5Frankly our sympathies go out to any medical or other students who are reading this book to survive a statistics course.5% last year. Treatment (1139) Belt (2894) (4248) gives this class the same exam as he gave out last year and ﬁnds that the mean score on the exam is 66.01 <. Belt size. If we plot MCAT physics scores and ﬁnal grades for the two classes.8 12. so the test of signiﬁcance of both variables became highly signiﬁcant. this is true only insofar as the covariates account for some of the variance in the dependent variable.65 7. Do Norman and Streiner honor the money-back guarantee and forfeit their hard-earned cash? Not likely.5 6.8 5274.

diminishing returns are the order of the day. If you ever survived one of those courses in plane and solid geometry. the section “Dummy Coding” in Chapter 14 if you forget what a dummy variable is—dummy! more illuminating approach might have been to create a fourth group that gets both drugs. Note that graphically. If you return to Figure 16–1C. which is a long way from signiﬁcance and in the wrong direction anyway. ANCOVA can improve the sensitivity of the statistical test by removing variance attributable to baseline variables. this correction for baseline differences is fraught with bias. We’ll put some statistics into it (which is what we’re here for). Even when you have no reason to expect baseline differences. as we’ll talk about in the next chapter. they actually learn more from Bare Essentials.e. the effect is highly signiﬁcant (t(18) = 3. with one grouping factor and one covariate.. however. Further. You just continue to treat it as a regression problem. are courses that have proven absolutely essential to your subsequent success in life—for those of you who are physicists or pure mathematicians. practical limits. so that it peaked at about 45. First. both logistical and statistical. or optics. But what happened is that the admissions committee blew it (at least as far as stats mastery goes) by admitting a number of students with chronic cases of cerebromyopathy. with a new b for each covariate and a new dummy variable for each grouping factor. or any number of grouping factors.. and really shouldn’t be done without a lot of thought.2) 66. there will be (k – 1) dummy variables created. and we get to keep the dough. 2. We must. a whole new picture emerges. by now—just create another regression equation containing both linear (y = bx) and quadratic (y = cx2) terms. emphasize that this strategy is useful only to the extent that the covariates are related to the dependent variable and are unrelated to each other.9 The analysis proceeds to estimate a Sum of Squares. it is necessary to do three regression analyses: one con- .75— the difference between the two lines. by about 15%. but we then went on to greener pastures. to the extent that the estimate of the treatment effect changed direction. the estimated effect of 90/91 (i. as shown by the arrow. Actually.172 REGRESSION AND CORRELATION Class TABLE 16–5 Mean scores (and SD) for the classes of 1990 and 1991 on MCAT physics and post-test Test 1990 1991 MCAT Post-test 48. somewhat akin to a 10-gallon hat. as discussed earlier. This is easily handled with yet another parameter. if we bring up the heavy artillery and ANCOVA the whole thing. to the left of the graph). If we do a t-test on the post scores. although not as glaringly obvious. if there are k levels of a grouping factor. How can we crunch out the statistics of it all? Easy. take our word for it. so it is 1 if both are present. and signiﬁcance test for each variable. As we discussed in Chapter 14. so that they start off duller (i.10 Nonlinear Regression Whatever happened to age and PMS? We established that there was a slightly bizarre relationship between age and the PQ score. the result is t(18) = 0. So not only did we improve the sensitivity of the test in this analysis. but there is no theoretical limit to the extension of the basic strategy to any number of covariates.1 (28.e. But. In general.82. based on the product of the two dummy variables. The equation would look like: PQ b0 b1 Age b2 Age2 Now to look at the contribution of each term.1) 73. the Bare Essentials treatment effect) is now a super +19. ANCOVA can correct for the effect of these differences on the dependent variable. Suppose. a ﬂashlight reﬂector is a more appropriate. This then summarizes the potential gains resulting from using ANCOVA to account for baseline differences: So far we have considered only a simple design. and end up duller. Like all regression problems.002). The approach is simple enough. and 0 otherwise. of course. This amounts to another nominal variable in the design. and the data for the two cohorts on MCAT and post-test are shown in Table 16–5. Final grade FIGURE 16–5 Relationship between MCAT physics and posttest statistics score for the classes of 1990 and 1991 100 x 1990 Extension to More Complex Designs x 1991 80 60 40 20 0 20 8There are. we also corrected for the bias resulting from baseline differences.60. We would then likely want to examine whether there is an interaction— whether both drugs together are better than the simple additive effect. However.41. with MCAT as the covariate and 90/91 as the grouping factor. just as before.0 (16.8) 1. Time to recycle. 1990 1991 30 40 50 60 70 9See MCAT physics score (×10) 10A 11Both year’s. that we were also interested in the effects of steroids on PQ.6 (10. we must create a new coefficient to separately estimate the difference between steroid and placebo. analogy. p = . Mean Square.8 (8. for example. you will note that the relationship of PQ with age adopted a peculiar form. and testosterone and placebo.8 Any legitimate ANOVA design can be extended by the addition of covariates.0) 40. However. p = 0. this is equivalent to projecting all the data onto the Y-axis and looking at the overlap of the two resulting distributions. When randomization is not possible and differences between groups exist.11 you would immediately recognize this as the curve traced out by a function of the form y = x2. relative to their starting point.

equivalent to a t of 5.0 605.210 Age 2 Sum of squares Mean square Source df F p Regression Residual R2 = 0. the t-value is 5. such as serum drug levels. In short. and the residual term is 207. a ratio variable (Belt). and a power series in Age. exponential.06 . When the dust settles.043 412. which is signiﬁcant at the . each with a coefficient to be estimated. Age Sum of squares Mean square Source df F p TABLE 16–6 Nonlinear regression with Age and Age2 Regression Residual R2 = 0.13 and all of these analyses are now viewed as subsets of something called the General Linear Model (GLM). or falling away. the similarity between ANOVA.60 650. General Linear Model By now. We can introduce interaction terms to cover all contingencies.57.7 0. however. at least). and regression was noted.56 3. Belt size. we’ll ﬁnish off by completing the analysis of the PMS data according to the initial observations of the astute clinician. For Age.33 2.241 7. or whatever.557 1. there are other times we might want to depart from a straight-line relationship. SE t p TABLE 16–7 Analysis of Age.048 ANOVA of the regression Sum of squares Mean square Source df F p Regression Residual 7755 1768 4 11 1938. also signiﬁcant at the .6 = 27.634 ns Age + Age 2 Sum of squares Mean square Source df F p Regression Residual R2 = 0. The corresponding ANOVA tables are shown in Table 16–6.0001 ns = not signiﬁcant. logarithmic. it is apparent that we can recast just about every ANOVA problem as a regression problem by the appropriate choice of dummy variables. all these problems can be cast as a sum of terms.105 .ADVANCED TOPICS IN REGRESSION AND ANOVA 173 taining only the term in Age.14 For a sense of closure. we could easily add in a cubic or quartic term. 13A . From longsuffering coauthor” It’s “quintic.12 The general strategy of creating nonlinear terms in a regression equation is called Power Series Regression when it involves a series of terms with increasing powers of some variable (obviously).32 . but the combination of the two is quite credible. the analysis appears as in Table 16–7.6 1. quadratic. simply creating a multiple regression equation in which there are terms for Age and Age2.7 12.109 1043 8479 1 14 1043.27. If the model still didn’t look too great. and Test/Other Test/Other Age Age2 Belt size –13. We’ll ﬁt a model of the form: PQ b0 b1(Age) b2(Age)2 b4(Test/Other) b3(Belt) 12We can’t. because we don’t know how to express it in Latin. and we saw from the ﬁgure some suggestion of additional curlicues.004 .76 3. we would likely invoke negative exponential or logarithmic terms. If we then proceed to determine the individual contributions. add in a term in x5. However.8.72 . Nothing changes except the concept.0041 1. Conceptually.6 9110.007 .001 Yj b0 ∑ bi f(Xj) (16–3) where f(X) indicates some mathematical function of X—linear.50 4. one containing only Age2. In these cases. ANCOVA. the additional Sum of Squares for Age2 is (6831 – 1043) = 5788 with one degree of freedom. Clearly the addition of the quadratic term to the linear term resulted in a much better ﬁt than either alone.65 1.012 . So.49 .” similarity that we deliberately highlighted by treating ANCOVA We have a nominal variable (Test/Other). At some point in the not-too-distant past.0001 level. Clearly Age alone does a fairly crummy job of ﬁtting the data. Age2.717 6831 2692 2 13 3416. the analysis proceeded in familiar fashion.1 16. as in weight-loss programs (in theory.0 160. then proceed with business as usual. the F-test for this variable is 5788/207.6. but it’s great to impress people at cocktail parties with. We can also throw around nonlinear terms with gay abandon and again shove them into some regression form.71 –. and one containing both. and Age2 is even worse. something of the form: Variable b coeff.0 1 14 412. One of the most common times occurs when things are either accruing.0001 level.23 .0 207.

So if somebody comes along and poses the question. one for each subject but the ﬁrst one. If they’re not parallel. We neatly avoided this issue by cooking our data so that we always ended up with parallel lines. in some sticky situations. the equation for the control group. it’s actually a lot more profound than that. it is equal to: PQ b0 (b 0 b1 b 2 Belt b 3 Belt b1) (b 2 b 3) Belt 16You have no idea how long it took to get data cooked right. most of which . Second. Because the whole game is to remove variance in the dependent variable attributed to the covariate. as before. they are more a matter of logic than of statistics. a steroid preparation designed to kill off muscle tissue in the cerebral cortex.. it should be almost selfevident (if we’ve been doing our jobs) why this is a good idea. and for the treatment group the line’s slope is (b2 + b3). Patients may well respond about the same to a drug. what you do is use a slightly more elaborate model. there are two good reasons to know about GLM. The old “N – 1” routine strikes again. such as multifactor ANOVAs in which some subjects had the indecency to drop out leaving you with an unbalanced design (Oh. for the control group the line has a slope of b2. Even repeated measures designs. The b coefficients for each subject express the difference between the mean of each subject (after the ﬁrst one) and the ﬁrst subject. and forgettable. After all. Although the last bit looks just like we changed the name to protect the guilty. Some other constraints on the selection of the covariates exist. 17Hardly 18Following So the treatment effect is contained in the b1 coefﬁcient as before. Note. pharmacist.” Actually. Note that most computer programs automatically test for parallelism. ANCOVA comes with some costs. marital status. hotshot. the prudent and standard action to take is to always test for an interaction ﬁrst. determining the slopes. one that explicitly includes an interaction term. In any case. If they are. it becomes a constant reminder that all of statistics—in fact. formulae for it. Here’s how. 1. and there is no longer an effect of Treatment. we rather like interactions because they can be informative. through our guidance. which is coded 0. a persuasive argument unless you designed them. Indeed. is: PQ b0 b2 Belt Assumptions of ANCOVA Unfortunately. how good is Corticomyostatin18 anyway?” you would have to concede that it depends on how smart you are to begin with. instead of drawing out the standard. and this is just another example of an interaction. if the lines are not parallel. then you don’t proceed with the ANCOVA. etc. “So. that means that the effect of treatment (the distance between the lines) is different depending on where you are situated on the X-axis. 14To REGRESSION AND CORRELATION really impress them. Second. the new equation looks like: PQ b0 b1 Treatment Treatment Belt b2 Belt b3 Now remember that the way we pulled this off was to use a dummy variable with values of 0 for Other or 1 for Testosterone. The fact that we can lump both nominal and interval-level variables into the same equation starts to collapse the distinction between the two historical classes—regression and ANOVA. many situations arise where there is no relationship And for the testosterone group. GLM approaches must be used to avoid biased estimations. But this condition does preclude the willy-nilly covarying of anything you can lay your hands on. Recall that the model equation before was: PQ = b 0 b1 Treatment b2 Belt If we now add in an interaction term. “That all depends. In principle. gender. And the last thing any clinician. are handled by creating whole families of dummy variables—one for each subject. The covariate should be related to the dependent variable. The two slopes are estimated separately. where you multiply the treatment dummy variable and the covariate together. in which the interval-level predictors like Belt Size are entered directly. the nominal level predictors are accommodated through the use of dummy variables.15 Including interaction terms just amounts to more terms in the equation.” That marks you as a real cognoscenti. the addition of more terms resulted in an improved ﬁt and a reduction of the error term. or snake-oil salesman worth his fee wants to be caught saying is.” That would give the game away. that individually the only signiﬁcant contributions came from the terms in Age and Belt Size. It involves an arcane and complex methodology called multiplication. between the treatment and the covariate. achieved a sense of holistic serenity about the world of statistics. In the ﬁrst place. number of dogs. regardless of the initial state of the disease (or they may not). however. If we do the same stunt here. First. now that you have. since special-purpose software to do ANOVA or regression is much simpler to use. 15Actually So. we can see that both are just special cases of the GLM. any of the designs we have discussed in the previous chapters can be put into the form of a linear equation. This is done by performing a separate regression on each line. Although folks don’t often conduct their analyses this way. and power terms and interactions can be thrown in with gay abandon. which is done as is any other regression coefficient. all of science—is really about explaining variance. and then testing whether the slopes are signiﬁcantly different. as assessed by the MCAT score. in which Subject is an explicit factor. as we pointed out in Chapter 9. You have to say “I did a Glim at work yesterday.174 with regression analysis. the shame of it!). Certainly one condition is that the lines are parallel. on our previous discussion. before proceeding with the ANCOVA. you may realize that this condition is not really too constraining. and then ﬁt a new constant. you don’t say “I did a Gee Ell Emm yesterday at work. So any difference in the slopes shows up in the test of signiﬁcance of the b3 term.16 The two reasons why the lines must be parallel are (1) because that’s what the ANCOVA packages are designed for. namely the usual raft of assumptions. such as age.17 and (2) because that is the only way you can estimate a treatment effect.

Ron Hubbard can do it. We are a less severe. The net result will be that the two groups will end up on the same regression line except that the treatment group will have moved up and to the right. and stop. in our example. 19R. indicate that the use of a covariate will add statistical power. If multiple covariates are used. If all these supposes are so. and that’s the sample size. This sounds a bit like what we were dealing with above. 3. Imagine. As another example. the other hand. ANCOVA methods combine continuous variables and grouping factors into a single regression equation. in a two-group drug trial with a covariate. The covariate should be unrelated to the treatment variable. Bare Essentials is unlikely to inﬂuence height or religion. Take the comparison you really care about and calculate a simple sample size for it. these should be unrelated to each other. the law of diminishing returns rapidly takes hold so that each new variable accounts for relatively little additional variance but costs 1 df or more. conservative statisticians demand that any covariates be measured before treatment. Do as we did in Chapter 14. for all its virtues. SUMMARY This chapter described several advanced methods of analysis based on regression analysis.30 or above.L. 1988). then the treatment will change both the post-test and the MCAT score equally. SAMPLE SIZE As you might have guessed. The situations where gains can be had from more than one or two covariates are rare indeed. For this reason. if L. So much for the inﬂuence of evidence. use ANOVA. why can’t we? 20On . reﬂecting improvement in both MCAT and post-test scores. the comparison of real interest is drug/placebo. multiply by 10 (Kleinbaum. you could use the formula for a paired t-test and again indicate that it is likely conservative.20 so these could be measured anytime (although we’re not sure why you would). There are therefore two strategies available: 1. Kupper. by the time we arrive at these complexities. It is straightforward to extend the strategy to include the analysis of multiple covariates—straightforward and usually dumb. use ANCOVA.30. Thus we would falsely conclude that treatment had no effect. using dummy variables for the latter. The reason is already familiar (we hope). Thorndike conclusively disproved this one in 1904. 2. For example. As you introduce additional variables. A good rule of thumb is that if the correlation between the covariate and the dependent variable is under 0. we’ll accept that. if it’s 0.ADVANCED TOPICS IN REGRESSION AND ANOVA 175 are virtually unrelated to everything. and Muller. 2. Use the formula for a t-test (Chapter 7). and at that point we insisted that all the little dears had to take the MCAT as a condition of getting through the course. but it’s not quite the same. Power series analysis and other nonlinear regressions are simply multiple regressions where coefficients are estimated for various functions of the X variable. then we might expect that Bare Essentials would improve not just the post-test score but also the MCAT score. but many of us were still taking Latin in the 1960s. that our statistics teaching is so very good that it acts on general mathematical skills the way that teachers of yore insisted a Latin course would act on language skills19 and that computer science teachers still insist BASIC will act on logic skills. Add up all the independent variables (not forgetting to count dummy variables as appropriate). any attempt to make an exact sample size calculation is akin to keeping an umbrella open in a tornado. If so. Now suppose further that we didn’t dream up the idea of using MCAT as a covariate until we found the ﬁrst conclusion from the t-test. if you wanted to measure change with ANCOVA.

(b) Reaction time. How many repeated measures are there? What is the df associated with each? right of Block 1 of 1 • Enter the other predictor variables and again press Next • Select Forward from the Method option • For regression equations that include power terms and interactions. Each time patients with low back pain come in for treatment. d. b. Fox Effect” demonstrates that a charming. they complete a 60-item multiple choice test. brilliant. dull. naturally). What variable corresponds to “Subjects”? b. A group of students are randomized to receive either (a) a wonderful. At the end of the stats course. Unbeknownst to the patient or therapist. condescending statistics book (any of the others) at the beginning of a stats course.176 EXERCISES REGRESSION AND CORRELATION How to Get the Computer to Do the Work for You • To add covariates for both factorial and repeated-measures ANOVAs. (c) I. a random device in the machine turns it on or off for a particular session. a. To further explore this phenomenon. and how many df does it have? d. measured by the total time required to remove a gallstone. a. within-subject factors. The mark in their last undergraduate math course is recorded. The effect of dress and speaker age on student ratings were explored. After 6 weeks. witty speaker can suck everybody into believing his message. move the covariate(s) into the Independent(s) box • Then press the Next button to the 1. In the following designs. c. perceptive. e. humorous. it is easiest to ﬁrst create new variables using Transform ¨ Compute . or (b) the same old boring. inarticulate. identify the between-subject factors. and witty new statistics book (this one. As one ﬁnal wrinkle. students received a series of seminars from a total of 12 speakers of varying ages. is predicted using the following variables: (a) Right-handed or left-handed. As in c above. Surgical performance. but the sample is stratiﬁed on male/female. 2. there is just one more step: simply move the desired variable(s) from the list on the left to the box labeled Covariates • To add covariates within a Regression approach. Six were dressed neatly and nattily (NN). Patients with chronic leg cramps are randomized to receive either calcium supplements or a placebo. and 6 were dressed soiled and shabbily (SS). This continues until patients have completed 12 sessions—TENS/Placebo at 6 levels. What is the covariate. The effect of transcutaneous electrical nerve stimulation (TENS) is assessed by physiotherapists. with 10 men and 10 women in the class.Q. silly). students were divided by gender. What is the “Between Subjects” factor? How many df? c. and covariates. The “Dr. they are asked to rate whether the pain has become better or worse and by how much (on a 100 mm Visual Analog Scale). they are given TENS at one of six different power levels assigned at random. (That’s where we get Presidents and Prime Ministers from.

called Robert’s Rectal Pills (RRP). do some analyses. For example. obvious. we explore a number of methods for assessing change (in this case. and ubiquitous methods (difference scores and paired t-tests) to more advanced and powerful methods. To test it. “We’re in the business of making people better. In addition. and another would get some other pill that looks like RRP.1 This step seems so intuitively correct that it couldn’t possibly be wrong. in many situations. we could bring a series of patients into the clinic and measure their joint counts or any of the dozen other things rheumatologists like to measure. Measuring Change SETTING THE SCENE Over the years. Now we have a new wonder drug. let’s just stay with joint counts (JC). but contains only the baking soda. In this chapter. the period of the moon. the time of day. which has the dubious advantage that its mode of administration is somewhat unusual. Robert’s Rectal Pills. In the course of doing so. we will begin with this simple case. you would measure everyone before and after. and charging outrageous prices considering that about 90% are just variants on aspirin. and Dodd beat us to the better organs. if we were to do a study of some new diet plan by assigning everyone to a treatment or control group. and in fact.. 1967). it contains mostly ASA (acetylsalicylic acid) with a mere soupçon of baking soda. including ANCOVA.CHAPTER THE SEVENTEENTH There are a number of approaches to examining the change in a variable over time. which we’ll talk about in due course. Norman. 1998). 2001.g. bring ’em all back. at least for an aspirin concoction. then analyze these difference scores. The simplest involves difference scores. and publish. for that matter) that is so seemingly straightforward but at the same time so controversial as measuring change. 1989. like all good experimenters. and the advancing perihelion of Mercury’s orbit. Unfortunately. For the moment. people with arthritis have fallen prey to countless over-the-counter preparations. goes in the same place. After all. 1991. no one in his or her right mind would just look at the weights after the treatment was over. then develop several other more powerful methods to assess change in more complex (and powerful) experimental situations. it is so self-evident that we do it almost without thinking about it. and is analyzed with a paired ttest. 1And T here is probably no area in statistics (and measurement theory. Obviously. which goes up and down with the weather. It has intuitive appeal to clinicians—something along the lines of. the right test is an unpaired t-test on the difference scores derived from the treatment and control groups. it is limited to the simple case of only two measurements—a pretest and a post-test.” It also seems like the proper thing to do statistically. and the controversy is still alive (e. it’s not quite right either. One group would get RRP. we randomize them into two groups. Analysis of covariance is more appropriate. then difference scores are perfectly OK. which excludes every study we’ve ever done or read about. 2The reason it is wrong or at least suboptimal to use difference scores has to do with a phenomenon called regression to the mean. Collins and Sayer. so we should measure how much better we make them. and generally leads to a more powerful test. consider brieﬂy a disease like rheumatoid arthritis. If there is no measurement error. we examine one more of the family.4 Then. To put some meat on the bones. 3Carter 177 . Harris. ﬁnd out how much each person gained or lost. We would wait a month.3 Like most over-the-counter medications. calculating a gain score is as simple as subtracting the preintervention score from the postintervention score. many have been (Collins and Horn. In this chapter. although it’s not exactly wrong.2 For illustration. In fact. measure their new JC. ANCOVA can also be generalized to the situation in which there are multiple occasions of measurement. change in joint counts) ranging from the simple. the if you want to practice your statistical prowess. guaranteeing immediate relief from the pain. Books can be written about it.

which we learned in Equation 7–10. So the variance in a distribution of scores looks like: σ2Observed = σ2True + σ2Error (17–1) If we imagine doing a trial where we assign folks at random to a treatment or a control group. where’s the problem? There is none.7 and 27. . 10 in the placebo group).78 1 1 8 3 2 3 3 3 6 2 2.5 6 7 7 3 3 5 7 2 4 4 4. Once again. We never observe the true score. is not signiﬁcant at the . If that’s all we did.062 4If we really did measure a whole bunch of things.062. doing something (or nothing) to them. the JC of the RRP group has dropped some to 21. then the straightforward analysis is. we’ll have to make a brief detour into measurement theory. statistics reveals itself to be somewhat rational (at least some of the time).9 12.99 (the square root of 3. if you have been taking to heart all the more complicated stuff in the last few chapters.5 89. We did that for the data in Table 17–1. The error can arise from a number of sources.85 136. and the JC of the control group has also dropped a bit to 24. 5They’re JCs for 20 patients (10 in the RRP group. The Time × Group interaction resulted in an F-test of 3.99. the denominator of the test. shown in Table 17–2 even if you’re not interested.96) with the same probability (. which has an associated p-value of . See Chapter 12.8 16 17 25 21 32 22 27 13 25 21 21. you might want to do a repeated-measures ANOVA. we should use Multivariate Analysis of Variance (MANOVA).1 32 33 42 27 22 18 16 32 25 24 27.062). After a month.25 . which came out to 1. before and after treatment. On the other hand. and it turns out to be 1.5 7. ﬂuctuations in a person’s state. if you just want to plug the numbers into the computer and were interested only if the p level is signiﬁcant or not.5 1617.6 2. but there are many if you want to understand what the numbers mean. and then computing an unpaired t-test on the ﬁnal scores. transposition errors entering the data.5. when we think about the total variation in a distribution of scores.96 with 1 and 18 degrees of freedom.9.9 12. All these calculations are shown in Table 17–2 if you’re interested.96 <. which. Reliability of Difference Scores TABLE 17–2 ANOVA of pretest and post-test scores for the RCT of RRP Source Sum of squares df Mean square F p Group Subject: Group Time Time Group Error 22.1.1 55.001 . whether it’s from a paper-andpencil test.8 1. This is equivalent to the t-test of the difference scores. it’s what would result if the person were tested an inﬁnite number of times. PROBLEMS IN MEASURING CHANGE If it’s this simple and straightforward. the observed score (XO) consists of two parts: the true score (XT) and error (XE).0 1 18 1 1 18 22. So the interaction term is equivalent to the unpaired t-test of the difference scores. a blood pressure cuff.80 11 12 13 14 15 16 17 18 19 20 MEAN SD 2 2 2 2 2 2 2 2 2 2 pretest to post-test and the control group to stay the same. just as we had hoped.7 6. Thus. it has two components. or the most expensive chemical analyzer in the lab.05 0. The consequence is that. Since we expect the treatment group to get better over the time from To understand whether there is a problem with difference scores being unreliable.1 3. such as inattention on the part of the subject. an unpaired t-test on the difference scores for each subject in the treatment and control groups.8 3. with one between-subjects factor (RRP group vs Placebo group) and one within-subjects repeated measure (Pretest/Post-test). with df = 18. and a multitude of others.05 level.7 Any score that we observe.178 REGRESSION AND CORRELATION TABLE 17–1 Data from a randomized trial of RRP versus placebo with a pretest and a 1 month post-test Subject Group Pretest of JC (0 months) Post-test of JC (1 month) Difference 1 2 3 4 5 6 7 8 9 10 MEAN SD 1 1 1 1 1 1 1 1 1 1 22 24 32 24 35 27 34 15 29 25 26. are shown in Table 17–1. has some degree of error associated with it. The two major issues affecting the interpretation of change6 scores are the reliability of difference scores and regression to the mean.8 31 34 34 24 24 15 13 29 19 22 24.63 44.5 The mean scores for the RRP and control groups are about the same initially: 26. mistakes reading a dial. this amounts to a Pre-/Post-test × Group interaction. and the F-test of the interaction is just the square of the equivalent t-test.1 7.9 7. as we already indicated. one due to real differences between people and the other due to measurement error. actually.4 136. is actually based on this sum of variances.

and yet again. a number of people in this latter group will have scores below the criterion.. we have displayed the difference in joint counts from pre. On the right side. then do an unpaired t-test on the difference scores. the cost of all this is to introduce error twice. back in Chapter 11 we deﬁned reliability as: σ2True σ2True = 2 + 2 σ True σ Error σ2Observed (17–4) Reliability = so if σ2True < σ2Error. You won’t end up as a subject in a study of RRP if you don’t have arthritis. measuring change is a good thing if the reliability is greater than 0. the more you improve. Measurement theory also provides another way of looking at the problem of regression toward the mean. If that second evaluation of your pain occurs during the post-treatment assessment. we can guarantee that next year’s performance will be worse. The measurement perspective is fairly similar. Also. this is likely the reason “treatments” such as copper or “ionized” bracelets still continue to be bought by the truckload.MEASURING CHANGE 179 6The One thing we can do to make this error term smaller is to measure everyone before the intervention and again afterward. is a ﬂuctuating condition. one explanation of regression to the mean is that people enter studies when their condition meets all the inclusion and exclusion criteria (i. it doesn’t matter if we subtract the pretest score from the posttest or vice versa. too. Even in the placebo group. There are a couple of reasons it rears its ugly head: sampling. so the actual regression line between pre. people do get better after they’ve put them on. Now the error is: σ2Change = σ2Error + σ2Error = 2σ2Error (17–2) What does this have to do with reliability? Well. in fact.0. then most likely you will also be taller than average but not quite as tall as they are. we known that no test on the face of the earth has a reliability of 1. Whose scores are above some cut-point (assuming that high scores are bad)? Those whose True and Observed scores are above the criterion. because of the error component. nor will you have any. it’s likely the pain isn’t quite as bad. going up at a 45° angle. see Streiner and Norman (2003). From the perspective of sampling. if σ2True < σ2Error.8 Similarly. this amounts to saying that the reliability is less than 0. then the test on postintervention scores will be larger than the test on difference scores. The drug and the placebo both do a world of good for severe arthritics. and gain score are synonymous.9 Let’s take a closer look a the data in Table 17–1. then neither could your parents.” Now. The only problem is that the improvement is due to regression effects. Doc. as in Equation 10–5. and measurement theory (which rears its ugly head again). What an interesting situation. who gets into a study looking at the effects of treatment? Obviously. and those whose True scores are below the cut-point but. let’s see why. You’re a participant because you went to your family physician and said.5. In other words. the worse you are to begin with. not the bangles. but closer to the mean. they wouldn’t have gotten into the study. given that the score at Time 1 (T1) was x is: E(T2|T1 = x) = ρ × x (17–5) terms difference score.” What he was referring to was that if your parents were above average in height. However. they get worse. 7For 8This 9This 10It’s . at 45˚ only if we standardize the two scores. or simplifying a bit.5. so the next time you’re seen. if we compare Equation 17–1 with Equation 17–2. or if you have it but aren’t bothered by it too much. you should see what Galton (1877) originally called it— “reversion to mediocrity. so the person has “improved” even in the absence of any intervention. like many other disorders. but don’t really help mild cases. which we’ve done to make the example easier. On retesting. As we pointed out in Chapter 10. as we said earlier.8 (don’t forget that we are dealing with standardized scores). the expected value at Time 2 (T2). you’ve improved! So. and normal variation in the disorder will make the follow-up assessment look good. yours will be. their Observed scores are above it. then voilà. the post-test score is less than 0. the pain is killing me. if their income was below average. When we take differences. This is a prime example of regression to the mean. these people won’t be balanced out by those whose True scores are above the mark but their Observed scores are below it. then he or she will have the same post-test score.and post-test scores is at a shallower angle.0. 0.to post-treatment. because we’ve measured everyone twice. “Help me. the argument is the same even when the scores haven’t been standardized. In fact. is also the bane of recruiting agents for sports teams. this is usually a good thing. Mathematically. voilà! Furthermore. Now. In this case. is somewhat different from the “proof” that sterility is inherited: if your grandparents couldn’t have children.8.e.10 If a person has a pretest score of. But. let’s confront the second issue. If a batter or pitcher has had an above-average year. and we’ll use them interchangeably. But. like the broken line in the ﬁgure. change score. those who are suffering from some condition. If the inequality: σ2True + σ2Error < 2σ2Error (17–3) holds. say. That means that the correlation between the pretest and post-test scores is 1. those with really bad joint counts initially seem to get quite a bit better after treatment. A close inspection reveals that it seems that in both the treatment and control groups. But not always. Let’s assume that the test that we’re using has perfect reliability. a second measure will revert (or regress) to the mean. and that’s where these variance components come in. Regression to the Mean Now that we’ve put to rest the issue that measuring change is always a good thing. regression to the mean. arthritis. shown in Figure 17–1. even if nothing has intervened. it’s fairly severe). The statistics don’t care. and neither do we. a Cook’s Tour of measurement theory.5. and a bad thing if it is less than 0. we do indeed get rid of all those systematic differences between subjects that go into σ2True. In short. If you think that term has a somewhat pejorative connotation. it’s pretty evident that we won’t always come out ahead with change scores.

3 1 1 17 25. the two means—for the fatties and wisps— both lie on the same 45° line relating pretest to posttest. thus controlling for soil and atmospheric conditions. when we’re dealing with cohort studies.2 691. which. In fact. with the ﬁnal score as the dependent variable and the initial state as the covariate. whereas those who had less sever scores (those below the mean) actually appeared to get worse. So. Now we ﬁnd that the effect of treatment results in an F-test of 4. That’s why in Table 17–1. planting different grains in each section.36.” However. Of course. and ANCOVA can do its magical stuff.13 and they don’t have the option of saying.0 r < 1. you have to be aware of Lord’s paradox.g. and vice versa. because of regression to the mean. background factors (male or female. for the sake of this example.14 Let’s change the example we used in the previous chapter. Sir Ronald Fisher. if people were randomly assigned to the groups.36 119.05 <. 12If 13This where the vertical line means “given that.1 Error 98. the more the score deviates from the mean. with 1 and 17 df. not discussed in this book. and avoids overcorrection. the greater the effect. occasionally. he was working then at the Rothamsted Agricultural Experiment Station with plants and grains. instead of randomly assigning men to the treatment and incidentally measuring their PQ. and plot the Regression to the Mean and ANCOVA As we’ve just said. there can be gains of a factor of two or more in power. because Fisher took a plot of land and split it. First.g. We have done this in Table 17–3. and less gain from the use of ANCOVA. However. Another variant of ANOVA. In fact. Second.57 . that there really is no difference on average. just as we see with the asterisks in Figure 17–2. the two ellipses don’t quite lie with their major axis on the 45° line. Suppose. everyone changes by the same amount except for random error. is called a split plot design. After all. if there is less measurement error. where people end up in groups because of things they may have done (e. did or didn’t use some medication). 11And REGRESSION AND CORRELATION r = 1. the less reliable the test. and measure them before and after treatment. in a cohort study). assignment. Consequently. In each group. took it for granted that there was random assignment.05 level. the use of ANCOVA can lead to error.78 4. The heart of the problem is something called Lord’s paradox. If we analyze the data by using a simple difference score (post-test minus pretest). is to do an Analysis of Covariance (ANCOVA). and this may affect the outcome. looking at the relation between Pathos Quotient (PQ) and belt size.” and ρ (the Greek letter rho) is the test’s reliability. the daddy of ANCOVA. in the regression sense of minimizing error. we can’t assume that about baseline differences. it’s more likely that the group differences are related to those factors. put them both on the hormone. socioeconomic status).001 Ski *nnies FIGURE 17–2 Analyzing the effect of testosterone in two groups using difference scores. This graph and equation tell us two other things.. I want to be in the other group. showing regression to the mean. The net effect is that the footballs aren’t quite as tilted at the 45° line. which we will discuss shortly. we can assume that the differences are due to chance. if you’re scouting for next year’s team.1 5. leaving a smaller error term and a signiﬁcant result. although.12 The ANCOVA ﬁts the optimal line to the scores. adjusting for baseline differences among groups with ANCOVA is deﬁnitely the way to go if the people ended up in those groups by random Post-Test PQ Post-test Fat ties TABLE 17–3 Analysis of covariance of pretest and post-test scores for the RCT of RRP Source Sum of squares df Mean square * F p 25. In situations like these. Pre-Test PQ . So.11 One solution. there is less possibility of regression to the mean. the more regression there will be. The use of ANCOVA in the design more appropriately corrected for baseline differences. pick those in the cellar. those who are below the mean the ﬁrst time aren’t quite so low on the post-test. Through randomization. The good news is that those who had a very bad year will likely improve (except if they’re with the Toronto Blue Jays). is signiﬁcant at the . the more severely impaired patients “improved” more. smoked or didn’t smoke. we form one group of broomsticks and another of gravitychallenged men. people weren’t randomly assigned (e.0 Regression { Pre-test why batters who hit exceptionally well one year will really take a tumble the next. we’ll focus on seeing if fat and skinny men’s PQ scores change to the same degree when given testosterone. the thing is. Under fairly normal circumstances. or something else. the gains from using ANCOVA instead of difference scores will be small. legacy lives on in other ways. not the stars..2 Group Pretest (covariate) 691.180 FIGURE 17–1 Relationship between pretest and post-test scores. “Sorry.

because the same person can’t be in both groups. when the Fatty gets retested. to move upward. but he didn’t explain it. the smaller the effect of the inequality in driving the two intercepts apart. No. is a bit conservative. When we project these two lines to the Y-axis to see whether there is a differential effect of testosterone. neither of these assumptions is testable. and that this relationship holds for all people in the group. which. sits on the 45° line. But this is a consequence entirely of regression to the mean and the fact that the two groups started out differently. these two people are not the same in one very important sense. whereas the equivalent person in the Skinnies group is above the mean of his mates. and then you pays your money and you takes your chances. Now the picture is somewhat different. and analyze the data with t-tests or repeated-measures ANOVA. MULTIPLE FOLLOW-UP OBSERVATIONS: ANCOVA WITH CONTRASTS That’s ﬁne as far as it goes. We’ll get the results shown in Figure 17–2. because the people are more or less similar and interchangeable. The explanation was given by Holland and Rubin (1983). and those assumptions are untestable. What we’d ﬁnd is shown in Figure 17–3. for the reasons we outlined before. The trouble is that the ANCOVA is pretty clearly overestimating the effect in this case. it may be that the difference score. The assumption we make when we use difference scores. as we discussed earlier. as shown in Figure 17–4. However. but using ANCOVA. is more sensitive. where the person gets both the treatment and the control intervention at different times. and made comprehensible by Wainer (1991. seemingly forever and ever. So. with Fatties slightly higher than the slope = 1 line and the Skinnies lower. 15And also in a design called a cross-over. will correctly say that there is a difference between groups. where there was a small “treatment effect” so the two means won’t lie on the 45° line. whereas the Skinny of the same weight is more likely to move downward. Pre-Test PQ ﬁndings. So. because of measurement error. So. we’re effectively forcing the line of “best ﬁt” to stay at 45°. But under different circumstances. In a cohort study. but that’s the way the world is. it is rarely the case that people with chronic diseases have only one follow-up visit. or even Neither. not the one looking down from on high and keeping track of all of your statistical shenanigans. The reason for the equivocal answer is that the “real” answer depends on the assumptions we make. what’s called differential regression toward the mean. which. which analysis is correct? Should we use a ttest on the difference scores.16 14That refers to Frederick M. Let’s analyze the data again. Lord (1967. the answer is Yes. that the dependent variable won’t change between pretesting and post-testing). as we discussed. only this time showing a small treatment effect. may well miss these effects. From this. if a person of a particular weight lost 5 Pathos Quotient points in the Skinny group. The assumption with ANCOVA is that the amount of change is a linear function of the baseline. Pre-Test PQ Group Effect a b *nnies FIGURE 17–4 A similar graph. even if we could ﬁnd a person in the Fatties group who weighed the same and looked the same as someone in the Skinnies condition. or an ANCOVA taking baseline differences into account? As is often the case. The ANCOVA line for each group goes right along the major axis of the ellipse. at a slope a bit less than 1. to the distance between the two regression lines. Maybe. the closer the two groups are in starting values. 2004). Moreover. Of course. 16Philosophy 101 (no date). On the other hand. this time using ANCOVA to adjust for baseline differences in weight. The paradox was raised by Lord.15 This isn’t the case with cohort studies. 1969). equal of course. he’s likely. we’d conclude that there is no effect of weight. What we really want to know is. is that the amount of change is independent of the group (if one group is a placebo condition. Not very satisfying. and that both groups changed to the same degree. would he also lose 5 points if he were in the Fatties group? This is obviously an unanswerable question. the ANCOVA. the choice of which strategy to use depends on which assumption you want to make. We can get a good approximation of the answer with random assignment to groups. The person in the Fatties group is below the mean of his group. It seems a shame to . Part of the problem is not just that one test is biased or the other is conservative. where everyone.MEASURING CHANGE 181 Post-Test PQ Post-Test PQ Fat ties * * Ski Fat ties Group Effect n Ski * nies FIGURE 17–3 Analyzing the same data as in Figure 17–2. regardless of group. we ﬁnd to our amazement that there is. They come back again and again.

14 ANCOVA results Sum of squares Mean square Source df F p Group 141.9 11.1 ignore all these observations just because you can’t do a t-test on them.182 REGRESSION AND CORRELATION Post-test TABLE 17–4 Data from an RCT of RRP versus placebo with post-tests at 1. Those in the control group get a little better. later you’re in that state.13 3.g.17 There’s a ton of information about how patients are getting from one state to the other that is lost in the simple look at just the ﬁrst and last measurements.5 44.00 104.1 5.3 156.967 .8 20.1 6. specifying in advance that you’ll look at the 12-month follow-up) or by snooping (e.001 . On the other hand. and more or less stay at a decreased level of pain. in which we’ve thrown in some more follow-up data.9 7.21 195. Two obvious approaches come to mind.6 5.g. Examine Table 17–4. Of course this is only one possibility (and perhaps an unlikely one).9 6. The results of both analyses are shown in Table 17–5. the Time × Group interaction amounts to an expectation . If we do an ANCOVA. the response to the drug is pretty immediate— people in the treated group have less pain after 1 month. since we are expecting that there will be no difference at Time 0. If we do an ANOVA. We’ve graphed the means in Figure 17–5. which. namely the two we just did: an ANOVA with ﬁve repeated measures.17 Pretest (covariate) 1848.33 20.5 6.5 1848.22 0.31 1. which is something that statisticians really don’t like to do.71 1. in a while we’ll examine some other possibilities.1 1.97 0. For the moment.3 31 34 34 24 24 15 13 29 19 22 29 33 35 24 23 16 14 28 20 21 27 30 32 26 19 14 14 31 24 24 28 35 28 25 19 16 11 29 17 21 11 12 13 14 15 16 17 18 19 20 MEAN SD 2 2 2 2 2 2 2 2 2 2 32 33 42 27 22 18 16 32 25 24 27. 6. By contrast. the latter is fraudulent. we would expect that the effect of treatment will simply end up as a main effect. although we’ve seen both done.5 24.82 8. and more fundamentally.76 . 6. let’s think about how we can analyze the data and be true to the pattern of change.82 .8 7. and an ANCOVA.7 3.3 ANOVA results TABLE 17–5 ANOVA and ANCOVA of scores with multiple follow-ups at 1.5 7. Let’s take a closer look. and 12 months.1 7.9 339. not an interaction. Now that’s interesting! Despite the fact that we have four times as many observations of the treatment effect as before.74 3.4 5.1 16 17 25 21 32 22 27 13 25 21 17 18 24 19 29 21 27 11 22 20 13 19 26 20 31 19 24 12 21 23 14 22 22 23 27 17 22 14 19 24 21. it’s treating change in a very simplistic manner. 3.012 level.44 84.8 Subject: Group Time Time Group Time Pretest Error 300. it’s throwing away half the data you gathered. The baseline data are handled differently as a covariate.3 24. and 12 months as the repeated observations. stay a little better. What is going on here? The explanation lies in a close second look at Figure 17–2. When we do the ANOVA.7 6. 3.3 2815. kind of like a quantum change—ﬁrst you’re in this state.. What’s so bad about the ﬁrst strategy? Two things.8 20. Conversely. 6. and 12 months for the RCT of RRP Source Sum of squares df Mean square F p Group Subject: Group Time Time Group Error 151.001 <. As you can see.0 1 18 4 4 72 151. the ANOVA now shows no overall signiﬁcant effect in the main effect of Group or the Time × Group interaction. and again.55 20. 3. 6.8 24.41 . and 12 months Subject Group Pretest 0 mo 1 mo 3 mo 6 mo 12 mo 1 2 3 4 5 6 7 8 9 10 MEAN SD 1 1 1 1 1 1 1 1 1 1 22 24 32 24 35 27 34 15 29 25 26. looking at all the differences and picking the one time period when the treatment seems to have had the biggest impact).001 .5 1 1 17 3 3 3 51 141. Of course.012 <. The former is inefficient. Second.8 0.9 20. using the pretest as the covariate and post-tests at 1. the overall main effect of the treatment is washed out by the pretest values. either by design (e.3 30. since they occurred before the treatment took effect. but signiﬁcant differences at 1. things are much clearer. with great regularity. are close together. 3.18 6. in the ANCOVA. one approach is to pick one time interval. we might still look for a Time × Group interaction. where we had an almost signiﬁcant interaction before.3 304. the main effect of Group is now signiﬁcant at the ..56 4.8 17.79 <. the data from the follow-up times really look like a main effect of treatment.1 22. First.

In the situation where there are multiple followup observations. no. we have to explicitly account for Time. Finally. A more common scenario is one in which the treatment effect is slow to build and then has a gradually diminishing effect.1 22. 6.5 17.8 6. with pretest as the covariate and the later observations as repeated dependent measures.5 27.1 16 17 25 21 32 22 27 13 25 21 21. as usual. but demands equal spacing on the X-axis. all this happens almost at the push of a button. and 12 months. will do an orthogonal decomposition.7 4. where we have added some constants to the post-test observations to make the relationship over time somewhat more complex. the ﬁrst six lines of this horrendous mess should look familiar. since the straight ANOVA treats each X value as just nominal data—the results are the same regardless of the order in which we put the columns of data.5 7. we might even expect some interactions. In effect.MEASURING CHANGE 183 of different differences between Treatment and Control groups at different times. When this button is pushed. The numbers are different. Second. because we cooked the data some to yield a more complex relationship to time.9 7.5 8. By contrast. the control group data are still a straight horizontal line.3 6. and 12 months. It is apparent that the treatment group shows continual improvement over time but with a law of diminishing returns. SPSS. One example of such a relationship is shown in the data of Table 17–6. Believe it or not.5 15. and appropriately captures this in the main effect of treatment. just rumbles along.5 20. We have somehow to tell the analysis that it’s dealing with data at 0. Post-test Subject Group Pretest 0 mo 1 mo 3 mo 6 mo 12 mo TABLE 17–6 Data from an RCT of RRP versus placebo with post-tests at 1.2 29 33 35 24 23 16 14 28 20 21 24. the differences between treatments are now smeared out over the main effect and the interaction. of course. 3.5 16 17 20 7 27 19 25 9 20 18 17.1 7. TIME-DEPENDENT OBSERVATIONS As we indicated.3 15.5 16.8 31 34 34 24 24 15 13 29 19 22 24. we have to go one further step into the analysis and also generate some new data. and this effect happens only when you contrast the pretest values with the post-test values (which was ﬁne when there was only one post-test value). the ANCOVA gives the pretest means special status and does not try to incorporate them into an overall test of treatment. indicating that the treated group has linear and quadratic terms but that the control group doesn’t. Is this only the case when the treatment effect is relatively constant over time? As it turns out. 18In some software. the right analysis is therefore an ANCOVA. Now. In the next nine lines things get more interesting. as far as we can tell. Instead. and the control group. However. But to see this.9 6. BMDP. By asking for an orthogonal decomposition. How do we put all this into the pot? First.7 6. quadratic. it simply focuses on the relatively constant difference between treatment and control groups over the four post-test times. the computer decomposes the Sum of Squares owing to Time and 1 2 3 4 5 6 7 8 9 10 MEAN SD 1 1 1 1 1 1 1 1 1 1 22 24 32 24 35 27 34 15 29 25 26. does it in both the 2V and 5V subroutines. and higher-order terms (one less term in the power series than the number of time points).4 5.5 15. 30 25 Controls Joint count 20 RRP 15 10 0 2 4 6 Month 8 10 12 FIGURE 17–5 Joint count for the study of RRP (from Table 17–4). the situation in which the treatment acts almost instantly and does not change over time is likely as rare as hen’s teeth. old standby. we told the computer to pay more attention to the time axis and ﬁt the data over time to a power series 17As if there’s something simplistic about quantum mechanics. 3.18 It’s called orthogonal decomposition. They’re completely analogous to the sources of variance we found before when we did an ANCOVA on the pretest and post-test scores. 6. Here we see clearly that the relation between Time and JC in the treatment group is kind of nonlinear—the sort of thing that might require a (Time)2 term as well as a linear term in Time. The results are shown in Table 17–7. 1. The . neither of which are appropriate tests of the observed data.3 to the Time × Group interactions into linear.5 9 17 17 18 22 12 17 9 14 19 17. and linear and nonlinear changes over time MULTIPLE.9 9. it is apparent that we have to build in some kind of power series (remember Chapter 16?) in order to capture the curvilinear change over time.8 24.5 7.5 22.5 19. This might be more obvious in the graph of the data shown in Figure 17–6.3 27 30 32 26 19 14 14 31 24 24 28 35 28 25 19 16 11 29 17 21 11 12 13 14 15 16 17 18 19 20 MEAN SD 2 2 2 2 2 2 2 2 2 2 32 33 42 27 22 18 16 32 25 24 27.

but for that you’ll have to wait until the next chapter.5 8. The bane of all statistical tests is measurement error.9 + 15.2 1 1 17 3 3 54 1 1 18 599. before and after 2 weeks of using Viagro. p < . called hierarchical linear modeling.9 15.5) just equals 167.62 1.05 for both ANOVA and ANCOVA b. Being the compulsive sort you are and desperate for something—anything—to prevent baldness.1.5 55. p = .41.04 2.001 <.9 2266. the counts are still in the millions and. there is a quadratic component interacting with Group (F = 5.5 70. df = 1/18. It’s the same idea that we encountered when we did orthogonal planned comparisons as an adjunct to the one-way ANOVA—decomposing the Total Sum of Squares into a series of contrasts that all sum back to the original.9 323.0 5.063.1 16. p = . persuading a bunch of patients to come back faithfully at exactly the appointed intervals. is as hard as Hades. Suppose you did a study looking at the ability of new Viagro to regenerate the hair on male scalps. What emerges is an overall linear term (F = 17. p = . almost magical technique. If you proceeded to use more advanced tests.39 . what might be the result? a.9 139.0 15.25 .001) showing that there is an overall trend downward. df = 1/18.12 . although the multiple observations over time is easy enough to come by.001 . or cubic. are preferred and yield optimal and unbiased results. p = .10 .03 11.5 8. This problem can be solved with another. and cubic terms that are orthogonal—they sum to the original.06 for ANOVA. you count every single hair on their heads. although they may yield results that are approximately correct. That is. and cubic effects of Time (139.6 44. 15 RRP 10 0 2 4 6 Month 8 10 12 regression.03).06 for both ANOVA and ANCOVA 2. the difference is not quite signiﬁcant. p = . df = 1/18. so.29 4. are highly reproducible from beginning to end.01 . quadratic. We have decomposed the effects related to Time into linear.9 19. the main effect of Time.9 Pretest (covariate) 2266. ANCOVA methods.184 30 REGRESSION AND CORRELATION 25 Controls 20 FIGURE 17–6 Joint count for the study of RRP (from Table 17–6).2 106.2 147. which signiﬁes that the slopes of the two lines differ. p < . Note that if we add up all these components. taking both lines into account. quadratic. p < . A common practice in analyzing clinical trials is to measure patients at baseline and at follow-up visits at regular intervals until the declared end of the trial.2 5. Further down. All this is quite neat (at least we think so). The simplest and most commonly used methods—difference scores and repeated-measures ANOVA—are less than optimal. using the pretest or baseline measure as a covariate.1 Subject: Group 279. Regrettably. particularly repeated-measures ANOVA and ANCOVA.012 17.5 3.001 9. the sum of the linear. quadratic.92 3.0 139.001 . with a paired t-test. Group 599.01.5 137. The interactions and error terms also sum to the Total Sums of Squares for the respective terms. so we can test whether the relationship is linear.17 36. Regrettably. You do a before/after study with a sample of 12 guys with thinning hair. an observation from the graph). we get the three lines above that express the Time main effect and the Time × Group interaction. Joint count WRAPPING UP We have explored a number of approaches to analyzing change. thereby missing the loving grandchild’s birthday party or the free trip to Las Vegas. and so on. rather.7 1 1 18 1 1 18 11.0 6.7 24. and all it requires is multiple observations over time and no missing data.1 71.00 <.06 5.9 19. TABLE 17–7 ANCOVA of follow-up scores with orthogonal decomposition for the RCT of RRP Source Sum of squares df Mean square F p EXERCISES 1.41 <.6 44.7 Time Time Group Error Linear (T) Linear Group Error Quadratic (T2) Quadratic Group Error Cubic (T3) Cubic Group Error 167. There is also an interaction with Group (F = 5.7 <.04) showing that the line for the treated group has some curvature to it (this is not explicit in the interaction but.05 for ANCOVA c. Although they are thinning.6 + 11.2 8.0.

The investigators report that there was a signiﬁcant drop in psychotic symptoms in the treatment group (paired t = –2. the analysis is then conducted on the baseline and end-of-trial measures. If the analysis was repeated. Refer to the relevant chapters for advice if you need a refresher. repeatedmeasures ANOVA. 9. We have shown you how to do paired t-tests. Repeated-measures ANOVA on the scores at 0 and 12 months iii. but the symptoms in the control group actually increased slightly (paired t = +0. the statistical methods have been encountered before. and 12 months with time 0 as covariate How to Get the Computer to Do the Work for You For most of this chapter. Imagine a trial of a new antipsychotic drug.51. 6. involving measures of psychotic symptoms at baseline.MEASURING CHANGE 185 Frequently.05). 9.) a. n. . p < .ANCOVA on the scores at 3. and ANCOVA. and 12 months (the declared end of the trial).46. 6. Loonix. 3.s. Unpaired t-test on the difference scores from 0 to 12 months ii. Is this analysis right or wrong? b. which of the following would be most appropriate? And what would be the likely result? i.

It’s very easy to write in a grant application that patients will be seen every three months. with more occurring in the placebo group than the RRP group. If we did so. so their intermediate data points are missing. X would have the same value. and 12 months. HIERARCHICAL LINEAR MODELING What’s in a Name? A relatively recent technique that has been developed to deal with these issues is called hierarchical lin- 186 . some have forgotten their appointments and have had to be seen at different times. or the active drug may have troubling side effects that the person doesn’t feel balance the gains from the treatment. and that they can be avoided by careful planning and management of a study. subjects rarely think a study is as important as the investigators think it is. To make matters even worse. that the observations have to be independent from one another. for that matter) can all of these threats to the validity of the study be accounted for? W 1Difficult as it is to believe. then all patients seen by Dr. How on earth (or on Mars. although the patients are scheduled to be assessed at 1. inability to take time off work. It’s also quite common for there to be differential dropouts from various treatment groups. If the study focuses on the effects of an educational intervention. such as an ill child at home. patients have dropped out of the study completely. rates of morbidity and mortality differ from one sawbones to the next. For example. then the same situation exists—schools differ from each other. The problem is how to take these “higher level” effects into account. so the subject goes looking elsewhere for relief. and we know that hospitals that see more patients usually have better outcomes than those that treat a smaller number of patients annually. they are probably more the rule in real life than the exceptions. a surgical trial could enroll patients from a number of hospitals.CHAPTER THE EIGHTEENTH Techniques such as ANOVA and regression have difficulty handling missing data. and it’s possible that characteristics of the docs may affect the patients’ responses. We could have one variable indicating the number of cases seen annually by the hospital. and another variable showing the number of cases seen by the physician. e wish we could say that the problems outlined in Setting the Scene are rare anomalies. and (as we all know from often bitter experience) teachers within a school range from those whom we worship to this day to those whose names are used to threaten little children. This violates one of the major assumptions of ANOVA and regression types of statistics. Even within a single hospital. 6. Analysis of Longitudinal Data: Hierarchical Linear Modeling SETTING THE SCENE In the course of running the study comparing Robert’s Rectal Pills against placebo. differential dropouts between groups. Hierarchical linear modeling is designed to deal with just these types of situation.g. but much more difficult to actually pull this off.. but we can’t just tack these on to the records of each patient. and situations where the effect of one factor (e. The study subjects may have competing demands on their time. Finally. some patients in each group are seen by one physician and some by another. as would all patients seen in Hospital B. in larger studies. or perhaps missed entirely. Second. our fearless investigator has encountered a number of problems. owing to the incompetence of the investigative team. even more troublesome. 3. However. subjects may be recruited from various places. experience of the physician) affects a number of people in the group. This could be because the placebo or comparison condition isn’t having any effect. Third. Compounding this problem. the dropouts are unevenly distributed across groups. and this may have an effect on the outcome.1 so that appointments have to be rescheduled. First. or even going on a vacation. some patients have been unable to make the appointments at all.

they are 5No. With ANOVA. and (3) some metric for measuring time. it doesn’t matter when those times are. there are restrictions that hamper us in other ways. latent trajectory modeling. Because. A metric for measuring time. and 7 following an intervention. 1. . However. but would be a very shallow slope if the interval were years rather than weeks. the penalty for relaxing the restriction regarding when the data are collected is that we have to let the computer know. There are many possibilities. if it seems as if most of the subjects’ data follow a different type of curve.. HLM will easily accommodate this. But. In Figure 18–1. What Do We Need? When using HLM to look at change over time. as a starting point. and the third its tail—they’re accurate. and who knows what else. cubic terms. or even drop out of the study. If the outcome doesn’t change. The reason is that if a person’s scores on some instrument were 5. a simple linear relationship will suffice. This is similar to multiple regression. we’ll focus primarily on one aspect of HLM. the way HLM operates is to ﬁt a line that best approximates each individual’s change over time. MANOVA relaxes this assumption.g. it frees us up considerably.g. 3. it’s a good idea to try ﬁtting a straight line for each subject. membership in one group or another. but if we didn’t.g. To avoid confusion. we have thoughtfully provided you with references in the “To Read Further” section of this book. another its leg.3 2. but we don’t want the majority of curves to look too deviant.2 random effects regression. but as with ANOVA. because with the more powerful HLM programs. and another person at weeks 2.. but it’s like the blind men trying to describe an elephant. we just invented it.ANALYSIS OF LONGITUDINAL DATA: HIERARCHICAL LINEAR MODELING 187 ear modeling (HLM). don’t go running to look it up. she was much less restrictive. we’re going to use the term HLM throughout this chapter. 6. there ain’t nothin’ to model. such as Time2. So. a person can be late for the follow-up visit. the analysis of longitudinal data. 3This 4Although STEPPING THROUGH THE ANALYSIS Step 1. but it other ways. This seems so obvious that we hesitate to mention it. and we use that very powerful test for linearity. for those who want more details about HLM. we’ve plotted the improvement score for the ﬁrst six subjects in the RRP group. and 15. In this chapter. the best place to start (and often to stop) is with a straight line. The next thing to look at is the pattern of the correlations over time. it requires complete data for each person. you can’t test one person on weeks 1. except in our heads. we can be sure that someone will use HLM with gender or hair color as the dependent variable (DV). 10. there are three requirements: (1) at least three data points per person. 4. 2Which is what we called it in the second edition of this book. MANOVA. Needless to say. the eyeball check. the time points should be relatively evenly spaced. and 8 (much less 2. both of these may in fact change. 3. We’ll also dip our toes into using HLM to account for clustering of subjects. and even try to ﬁt the data with Shmedlap’s inverse hypergeonormal function. but only up to a point.5 but for the vast majority of cases. Because. not every person’s data will be well ﬁtted. An outcome that changes over time. Also. and what the determinants of the rate of change may be (e. repeatedmeasures ANOVA. we don’t have to keep track of when each individual is tested. and the like). demographic variables). miss a visit. mixed effects regression. there are three that are most common. 5. Many of these terms are the same. Violating this plays havoc with the assumption of equal correlations across time. getting or not getting an intervention. Note that Subject 5 missed the third follow-up visit. With all other techniques that measure change over time (e. But. when one is holding its trunk. there must be at least three data points per person. because each is describing just one aspect of the beast. but we still have enough data to include her in the analysis. Because with repeated-measures ANOVA or MANOVA we assume that everyone is tested at the same time. the slope is one of the main things we’re interested in. its usefulness in measuring change (trajectory or growth modeling). The reasons are twofold: it is probably the most widely used application. or using a model with a higher order time effect. such as every week or every three months. although we’ve also patented it. empirical Bayes models. 6. multi-level models.4 The purpose of HLM is to determine whether different people change over time. they are accurate but incomplete in that they’re highlighting only one aspect of the technique—the statistical method (e. and to go beyond it would require a book in its own right. The problem is that it’s also called by about half a dozen or so other names: growth curve analysis. and as long as there are at least three data points. we should think of either transforming the data so they become linear. But. but be aware that other people may be describing exactly the same technique. Examine the Data Although HLM is able to ﬁt almost any shape of line to the data. So. and 12). as we’ll see. in the case of hair color. but use different terms to say what they’re doing. sometimes twice a week or more. we’ll have to specify what that pattern is.. as we’ll see. (2) an outcome that changes over time. is like the difference in the rules at home versus at grandma’s house. even though we don’t have the foggiest idea what it looks like. we acknowledge that in these permissive times. ranging from the most to the least restrictive. This may seem somewhat more restrictive than the two points needed for simple gain scores. With HLM. or its ability to deal with variables at different levels (hierarchical). All of these terms make sense. we’d better get it right. as long as there are at least three points. Three data points. this would result in a very steep slope if the data points were only one week apart from one another. random or mixed effects regression). both of these approaches assume that all of the subjects are tested at equivalent times. we can throw in quadratic terms. and would even give us candy before dinner. latent growth modeling.

We hope that answers your question. though. The term mixed effects comes into play at the next stage. which is why it has been replaced by MANOVA. and that between Times 1 and 4 it is lower still. which we discussed in the chapter on multiple regression. In ordinary least squares regression (and all of its variants. since the intercept would reﬂect the person’s status when the treatment ended. so that we get two variables for each person. and this translates into more power. Didn’t we just violate one of the assumptions of linear regression in this step? Having the predictor variables consist of the same measurement over time for an individual most deﬁnitely violates the requirement that the errors are independent from one Subject 14 8 6 Score 4 2 0 FIGURE 18–2 Fitting a straight line for each person. i: a slope (π0i) and an intercept (π1i).7 A more realistic picture of what really happens over time is captured by the autoregressive model. for example. one major advantage of centering is to make the intercept more interpretable. Step 2.188 Subject 1 8 6 REGRESSION AND CORRELATION Subject 2 8 6 8 6 Subject 3 Score Score 4 2 0 4 2 0 Score 2 3 4 4 2 0 1 2 Visit 3 4 1 Visit 1 2 Visit 3 4 Subject 4 8 6 8 6 Subject 5 8 6 Subject 6 FIGURE 18–1 Checking that subjects’ data are more or less linear. And why do they? Just because.890 . random effects regression and mixed effects regression. such as logistic regression). If the time variable. So. but compound symmetry also requires them to be equivalent to the correlation between Time 1 and Time 4. in which the longer the interval. the variances are equal across time and all the covariances are equal. is the assumption underlying repeated-measures ANOVA. If there doesn’t seem to be any pattern present. we would use an unstructured correlation matrix. With compound symmetry. or in Grade 0. the unstructured pattern is the easiest to specify. they’re ﬁxed for all subjects. A regression line is ﬁtted to the data for each subject. then without centering. we would ﬁnd that the correlation between Times 1 and 3 is lower than between Times 1 and 2. but you pay a price for this. Fit Individual Regression Lines The next step is up to the computer. In HLM. it’s a good idea to consider whether you should center the data. may well ask why we’re using πs for the slope and intercept. At this point. when these random effects are themselves entered into yet another regression equation. we assume that the intercept and slopes that emerge from the equation apply to all individuals in the sample. admittedly not overly useful information.14 = 0. whereas we used βs when discussing multiple regression.800 b1. the time variable is the number of the followup visit. and unstructured. This isn’t too improbable. autoregressive. As we’re sure you remember. Guess it didn’t sound intimidating enough. that is. So. The advantage of more restrictive models is that they require fewer parameters. the correlation between the data at Time 1 and Time 2 is the same as between Time 3 and Time 4. so that they’re random effects. so it may not be worthwhile to center.8 An example for one subject is shown in Figure 18–2. In the example of the trial of the usefulness of RRP.14 = 0. is the person’s age or year in school. 7This 8You compound symmetric. It is because of this that the technique has two of its other names. where the terms—the average slope and intercept for the group—are ﬁxed. which doesn’t have this restriction. Score Score 4 2 0 4 2 0 Score 1 2 3 4 4 2 0 1 2 Visit 3 4 Visit 1 2 Visit 3 4 6We don’t know why they couldn’t simply have called this constant. the lower the correlation. each person has his or her unique intercept and slope. 1 2 Visit 3 4 b0. the intercept tells you the value of the DV when the person is 0 years old. The reason is because many other authors do.6 in simpler terms.

one for each group. Fit the Level 1 Model At this point. For now. is anything going on? We do this by ﬁtting the simplest of models: π0i = β0 + u0i (18–3) 9Notice Step 3. In this case. it’s worthwhile to see if there’s any variance to model in the next steps. Shakespeare. Before continuing further with the analyses. the value of the variable that keeps track of time or occasion (Timeij). we can throw in another term. Step 4. For the intercept parameters. Part of what contributes to ε is indeed error. If. increasing ever faster as time went on). Equation 18–4 is the same thing for the slope (β1) and the residual difference between each person’s slope and the mean (u1i). Indeed. such as: 2 Yij = π0i + π1i Timeij + π2i Timeij + εij (18–2) but we’d need at least four data points per person. it’s a good idea to stop here and see how well the linear model ﬁts the data for each person. Fit the Level 2 Model Before we go any further. you can examine the residuals and perhaps even the value of R2 at an individual subject level. though. and if their slopes are the same (whether or not they’re changing at the same rate over time). with an intercept and a slope for each. Well. What we are interested in is whether the two groups differ over time: if their intercepts are the same (in this example. and we aren’t doing any signiﬁcance testing. in other ways. 2004). So why did we do it? Because the bible tells us we can. and ζ0i is the residual (ζ is the Greek “zeta”). but it should be true for most. this won’t hold for every person. in that a better description would be “unexplained variance. β0 is the mean intercept for the group (the ﬁxed effect). whether they are starting out at the same place after treatment).” With ANOVA. We said that ε is the error term. If the models ﬁt. we’ll omit the Time2 term. Similarly. The form of the model is: Yij = π0i + π1i Time1j + εij (18–1) π1i = β1 + u1i (18–4) This means that person i’s score at time j is a function of his or her intercept (π0i). where γ00 is the average intercept for the control subjects (γ is the Greek letter “gamma”). owing to all the frailties that human ﬂesh (and its measurement tools) is heir to (W. and then do the same for the slope. γ01 is the difference in the intercept between the control and RRP groups. we have two equations to get those four parameters. That means that we have to calculate two regression lines. but that the values of π0 and π1 vary from one person to the next. We’ve tossed all of the measurement error into the ε term. That means that these estimates are measured separately from the measurement error and they’re not affected by attenuation caused by such errors (Llabre et al. we’re ready to derive the model at the lowest level of the hierarchy. This would mean that the values for R2 should be moderate to large. hence the term random effects regression. and u0i is the difference for each person between his or her intercept and the mean for the group (the random effect). With some programs. Hence. we can take all those parameter estimates we calculated at the individual person level and ﬁgure out what’s going on at the group level.” This is an accurate reﬂection of the state of the art. The rationale is that violating the assumption of independence of the predictors plays absolute havoc with the estimates of the standard errors. as we’ll see later. so it’s fruitless to try to determine why different people change at different rates. We can even get a little bit fancy here. it doesn’t matter. then we might as well pack up our bags and head for home. . personal communication). we may be able to reduce ε by including other factors that may account for variation between people or within people over time (Singer and Willett. Again.” “moderate to large. and the ubiquitous error (εij). The way it’s done is to determine the intercept for the control group. There’s one very helpful implication of Equation 18–1. when we were eyeballing the data.” and “for most. slope (π1i). we try to reduce error by introducing other factors that can account for that variance. Because we’re interested only in getting the parameters—the slope and intercept—for each person. We assume that everybody has the same form of the equation. we thought that they would be better described by a quadratic equation (that is. but we’re using the estimates of the π terms when we move up to the next highest level. we have: π0i = γ00 + γ01 Groupi + ζ0i (18–5) our use of wishy-washy language: “relatively small. the bible is an excellent book on HLM by Singer and Willett (2003). but it doesn’t affect the estimates of the parameters themselves. 2003). that’s not entirely true. If our prayers are answered and the models don’t ﬁt the data well. But.ANALYSIS OF LONGITUDINAL DATA: HIERARCHICAL LINEAR MODELING 189 another. because there’s nothing going on—all of the subjects have the same slope and intercept. we actually hope that there is unexplained variance at this In Equation 18–3. and how much the treatment group’s intercept differs from it. the individual.9 stage of the game that can be reduced when we include things like group membership into the mix. The residual is simply the difference between the actual and predicted values of the DV. it resembles the Error terms in ANOVA. Groupi indicates group membership. that is. and we’d like them to be relatively small. in HLM.

This means that we may want to go back and look for time varying predictors at the level of the individual.732 + 1.013 10. reﬂect the fact that the slopes and intercepts for individuals vary around these group estimates. in this example.481 (18–7) and for the RRP group.190 REGRESSION AND CORRELATION TABLE 18–1 Output of the ﬁxed effects from HLM Intercept. γ11 is the difference in the slope between the groups.208 0.421 0.) To be more exact.e. and the second line says that the intercept for the RRP group is (0. In Table 18–2. it’s: 0. those SEs are asymptotic estimates (i. such as the person’s compliance with taking the meds or whether the person is using other.732. and one for the RRP group. What we are interested in is not the actual values of those ζs but rather their variances. as in Table 18–1. Other possible reasons may be that people who start off with higher scores may be near the ceiling of the test or closer to a physiological limit.234 0. Whatever the reason. the major limit with regard to how many levels we can have in HLM is not mathematical. before going any further. At . they state that other texts and computer programs use different Greek letters and other subscripts. let’s see what all those funny looking squiggles mean. which is the covariance between ζ0i and ζ1i. there is a signiﬁcant change for the control group over time (γ10). because they are conditional on the predictor(s) already in the equation. and that the RPP group changed at a signiﬁcantly different rate (γ11). we’re following the convention of Singer and Willett (2003). For reasons that we’ll explain at the end of this chapter. but the fact that we’re quickly running out of Greek letters to use. they’re all statistically signiﬁcant. but to make your life more interesting.732 + Time × 0. we see that there’s still unexplained within-person variance at Level 1 (σe2). Those two residual terms. because this covariance may be due to regression toward the mean—the greater the deviation from the mean. over-the-counter drugs. σ012 lets us measure its effects. So let’s take a look at those variance components and see what they tell us. These are factors that could change from one follow-up visit to the next.10 So. and ζ1i is the residual.421) = 0. The results are shown in Figure 18–3.940 + Time × 0. the third line gives us the intercept for the slope. σ02 and σ12 are conditional residual variances. There’s also a third term. But realize that we’re going to end up with two regression lines.. ζ0i and ζ1i. 11Which where γ10 is the average slope for the control subjects. is there because it’s possible that the rate of change depends on the person’s starting level. and the fourth tells us that the slope for the RRP group is (0. it’s: 1.215 0. so some programs label them as ASE rather than SE.481 0. meaning that the control group began above 011 (γ00).481 + 0.732 1. π1i Intercept Group Intercept Group Parameter Estimate SE z γ00 γ01 γ10 γ11 0. The labels at ﬁrst seem a bit off. So. 6 5 4 Score 3 2 1 0 0 1 2 Visit 3 4 RRP Control FIGURE 18–3 Plotting the ﬁxed effects. We examined one aspect of this phenomenon in the previous chapter.268 For the slope parameters. one for the RRP group and one for the control group. they are the estimates divided by their standard errors. the equation is: π1i = γ10 + γ11 Groupi + ζ1i (18–6) 10As you can see. The covariance term. σ012. The amount of variance of ζ0i is denoted by σ02. For the control group. they tell how much variance is left over after accounting for the effect of Group. or that those with higher scores have a head start on those with lower scores.902.208) = 1. by looking at a typical output. The ﬁrst line tells us that the intercept for the control group is 0. usually reﬂects the fact simply that the people are alive and breathing. what we’ve done is derive one regression line for the control group.162 2.902 (18–8) Let’s plot those and see what’s going on.481. so they have less room to improve. In this case. The third line indicates that the slope for the control group is 0. π01 Slope. and that for ζ1i as σ12. again.405 5. to summarize. The z values in Table 18–1 are ﬁgured out the same way all z values are. the RRP group began at a signiﬁcantly higher level (γ01). they get more accurate as the sample size increases).239 0. they reﬂect error or unexplained variance. (In this.940. σ012.041 3. the more regression.

893 –15.219 –3. The second part of the equation. at each measurement over time (ζ1i). if more people drop out of the study from one group rather than the other.745 20. and what’s still unexplained (εij). and (c) model how much they would have changed had they remained in the study. distant past (say 10 years ago).12 Some Advantages of HLM We already mentioned a few of the major advantages of analyzing change this way rather than with repeated-measures ANOVA or MANOVA—the ability to handle missing data points (as long as at least three times are left).13 LOCF is conservative when it’s applied to the treatment group. and if the reasons are related to the intervention (e. such as sex or age. in Equations 18–5 and 18–6. we’ve presented the two levels of the analysis—the individuals and the groups—as two sets of regression equations. both time-varying ones as well as timeinvariant ones. 13We This looks much more formidable than Equation 18–9. though. If we’re lucky. then any estimate of the relative change between the groups would be biased by this confounding. That is. but differential dropouts from the two groups. and the elimination of measurement error. HLM can (a) tell us if this is indeed happening. and the terms in the second set of parentheses do the same for the slope. reﬂects the various sources of error—around the intercept (ζ0i). with the γ symbols.851 1. and the interaction between the two. It consists of where the person started out (γ00). in that it assumes that no further change occurs. and using that value for all subsequent missing values.832 5. but in one way.. but there is. people drop out because the placebo is not having an effect (duh!). εij Intercept. which allows the groups to diverge over time. but it may be too liberal when applied to the comparison group. we mentioned that some studies enroll participants in clusters. and the method that was used in the far. there isn’t any residual variance for the intercept (σ02).g. under Imputing Data. ζ0i Slope. LOCF is a way of trying to minimize the loss of data by taking the last valid data point from a subject before he or she dropped out. we’re safe to consider it a timeinvariant effect. which is actually the way most HLM programs like them to be speciﬁed. One of the major threats to the validity of a study is not so much people dropping out (although that can affect the sample size). In Equation 18–1. That’s both an easy way to conceptualize what’s being done. ζ1i Covariance between ζ0i and ζ1i Parameter Estimate SE z σe2 σ0 σ12 σ012 2 15. but there is for the slope (σ12). Then. After being so blessed. If. we deﬁned Yij in terms of π0i and π1i. it works against rejecting the null hypothesis) than last observation carried forward (LOCF). 1974). what we get is: Yij = π0i + π1iTimeij + εij = (γ00 + γ01Groupi + ζ0i) + (γ10 + γ11Groupi + ζ1i) Timeij + εij (18–9) where the three terms inside the ﬁrst set of brackets give us the Level 1 intercept based on the Level 2 parameters. (b) retain the people in the analyses.295 0. discuss this in Chapter 27. HLM can account for this. even accounting for group membership. we deﬁned these two values of π in terms of different values of γs. it’s hard to imagine that there’s even more.ANALYSIS OF LONGITUDINAL DATA: HIERARCHICAL LINEAR MODELING 191 TABLE 18–2 Output of variance components from HLM Level 1 Level 2 Within person.024 8. We can go one further step by multiplying out the terms in the second set and rearranging the results. shows the inﬂuence of the measured variables on a person’s score at a given time. side effects or lack of effectiveness). of course. 12We Sidestep: Putting the Equations Together So far.252 14.e. the relaxation of the restriction about measuring at ﬁxed times for all subjects. let’s continue to follow Singer and Willett (2003) and see how we can combine the equations. family physicians were randomized to have or not have a nurse practitioner (NP) assigned to their practices.. and the outcome consisted of the prevalence of a number of “tracer . but it’s a more conservative assumption (i. then this should be reﬂected in a smaller slope for the dropouts than the remainers. the effects of Time and Group. Dealing with Clusters Earlier in this chapter. For example. and (b) the amount of change in age is relatively small over the course of the study compared with the whole of our lives.722 4. This naturally assumes that they would have continued to change at the same rate. Hence. This tells us that we should look for other factors that may explain the variance. giving us: Yij = [γ00 + γ10Timeij + γ01Groupi + γ11(Groupi × Timeij)] + [ζ0i + ζ1iTimeij + εij] (18–10) know all too well from personal experience that age actually does change—that’s why we’re using the photograph on the back cover we took for the ﬁrst edition—but (a) it’s the same change for everyone.003 5. As the name implies. as we mentioned at the beginning of the chapter. It also consists of the slopes and intercepts of the RRP and control groups. Using HLM with both groups is probably a much better way to proceed. it’s actually more informative. in the Burlington Randomized Trial of the Nurse Practitioner (Spitzer et al. with the ζs. But. The ﬁrst part of the formula.155 Level 2. for example.839 17. Putting it all together.

The stretch from the previous example to this one isn’t too far. let’s restrict ourselves to one measurement of blood pressure. Perhaps the most readable is the one by Singer and Willett (2003). we expect that people within the same family share not only the same house. we can combine these by having time nested within the individual. and 100. Singer and Willett (2003) say there are too many aspects of the data and the actual ML method used to be more precise. which means that mom. HLM didn’t exist back then. the clustering is within groups of people. but also the same food. ever deﬁning what “small” is. the “philosophy” of treatment of any doc or NP—that is.16 16Without. With RRP. some wags have suggested that this was one of the largest trials with N = 2. here. or how to treat otitis media—would apply to all of his or her patients. comparable to Equation 18–1.” Those estimates are a bit narrower. their program. π0j is the average blood pressure from family j. and that they are normally distributed. But. where Yij is the blood pressure of person i in family j. called HLM (Byrk. Indeed. in fact we (or rather. but it’s fairly rough sledding to get through.15 Now. that is. Other books you can try are by Hox (2002) and Goldstein (1995). and that they inhale the same germ-laden air. as the sample size gets closer to inﬁnity. the family dog wasn’t included. they’re calculated by a procedure called maximum likelihood estimation (MLE). However. but not by much. Now. π1j is the slope for family j. Raudenbush. what’s the unit of analysis? We can look to see how many patients in each arm of the study had these different conditions. If there were more than one family practice per condition. So. conditions” among the patients. 15In fact. Similarly. including the fact that the standard errors are smaller than those derived from other methods. but these days. because many already exist. and says that 500 people are “adequate. In summary. pop. would be: Yij = π0j + π1jFamilyj + εij (18–11) where Groupj indicates whether the family is in the nurse practitioner or control group.000 may be too many. SUMMARY This chapter has barely scratched the surface of HLM. these happen asymptotically.192 REGRESSION AND CORRELATION 14As far as we know. For the most part. We could write an entire book on it. but we won’t. The standard reference is by Byrk and Raudenbush (1992). but that p levels and conﬁdence intervals derived from “small” samples should be regarded with caution. A warning. To keep us (and you) from going completely bonkers. it’s likely that people within a family are more similar to each other than they are to people in a different family. and the two kids14 can’t be treated as four separate observations. The Level 2 (family) model for the intercept is: π0j = γ00 + γ01Groupj + ζ0j (18–12) and for the slope is: π1j = γ10 + γ11Groupj + ζ1j (18–13) . So. 1996) is the standard by which others are measured. and all of the measures for the individual nested within the groups. we would have a third level. this would be an ideal situation in which to use it. though: don’t try HLM at home unless you have a knowledgeable person handy at the other end of the phone. But. and Congdon. and Familyj should be selfexplanatory. when to intervene for hypertension. but it’s limited to growth curves. though. the computer programs) don’t use least squares regression any more. SAMPLE SIZE Although we’ve presented HLM as a series of multiple regressions. similar attitudes toward health. though. it yields accurate estimates of the population parameters. and it would be this third level that would include the Group variable. so the patients or families within any one practice aren’t truly independent either. Does that give you enough guidance? Long (1997) recommends a minimum of 100 subjects for cross-sectional studies. and families within practices. the second level would substitute Practice for Group. Returning to our bible. we can think of the various times as being nested within an individual. patients are clustered (or nested) within families. So how many subjects do we need? Singer and Willet (2003) say that 10 is too few. the Level 1 (individual) model. MLE has many desirable properties.

HLM is perhaps the most complex of all procedures in SPSS because there are so many options to choose from. • From Analyze. A nice example is presented in Llabre et al. there are about 16 options. You really have to know your data. select the variable that identiﬁes the subjects and move it into the Subjects box • Select the variable that reﬂects the time and move it into Repeated • Now comes the ﬁrst of many difficult decisions. which is deciding what type of covariance structure to use. If you really can’t decide. use Unstructured . which we discuss in Chapter 20).ANALYSIS OF LONGITUDINAL DATA: HIERARCHICAL LINEAR MODELING 193 How to Get the Computer to Do the Work for You What follows is merely the barest of bones. button and click on the Covariance Structure hyperlink for fuller descriptions of each. choose Mixed Models → Linear • Because we’re almost always dealing with correlated effects. You can press the Help • Press Continue • Move the dependent variable into the Dependent Variable box • Move the independent variable(s) into the Factor(s) box • Move any covariate(s) into the Covariate(s) box • Click on the Fixed button and pick the variable(s) that reﬂect ﬁxed effects • Do the same with the random effects • Press Random button for the OK For those of you who are more comfortable with structural equation modeling (SEM. A good ﬁrst guess for data gathered at fairly regular intervals is either AR(1) or ARMA(1.1). and should look at the covariance matrix of the repeated measures. HLM can also be done using SEM programs. (2004).

and don’t change any of the default options. the test battery takes 32. not to see how they relate to other variables. such as group membership or a set of dependent variables. just to be sure you’ve covered the areas well. Principal Components and Factor Analysis Fooling Around with Factors SETTING THE SCENE You have been appointed Dean of Admissions at the Mesmer School of Health Care and Tonsorial Trades. but the latter term may describe the results of FA with bad data. and you’re still not sure that all of the tests in each area are tapping the right skills. you’ll actually be doing a PCA. After searching the literature for appropriate tests to use. labeled in the large circles on the left and each tapped by ﬁve of the tests. Let’s talk 194 . all will be explained in the fullness of time. with ﬁve tests in each area. people” means we forgot who. our beleaguered Dean will use this procedure to: (1) explore the relationship among the variables. (2) the hands of a woman. if you want to run an FA in. which means something else entirely (“Sweet F**k All”). we wouldn’t be asking these questions unless the answers were “yes. for example.3 There is a lot of confusion regarding the terminology of FA. he comes up with the 15 listed in the box on page 195. You decide that the only way to increase this abysmal ﬁgure is to impose tighter admissions criteria.CHAPTER THE NINETEENTH Factor analysis looks at the pattern of relationships among variables and tries to explain that pattern in terms of a smaller number of underlying hypothetical factors. Some people (erroneously) make it synonymous with a related technique called principal components analysis (PCA). all are treated equally and the data come from one group of subjects. Unfortunately. (3) test some hypotheses about the data. WHAT ARE “FACTORS?” What he hopes to ﬁnd is shown in Figure 19–1: three different attributes. (2) see if the pattern of results can be explained by a smaller number of underlying constructs (sometimes called latent variables or factors).1 It differs from techniques we discussed earlier in one important way: no distinction is made between independent and dependent variables. You devise a test battery for applicants. A 2“Some 3There s usual. For this reason. and you meet with the faculty to draw up a list of the desired attributes of successful students. Your contract stipulates that you will receive a bonus of $100. not SFA. and (3) the soul of a Byzantine usurer. Is there any way you can (1) make sure you’re measuring these three areas and (2) eliminate tests that are either redundant or measuring something entirely different? 1That’s FA. some people2 have referred to FA as an “untargeted” statistical technique. are better ways to test hypotheses using factor analysis. SPSS/PC. As we’ll see. In fact.000 each year that the graduation rate exceeds 75%.6 hours to administer. So. but we think it was the late Alvan Feinstein. the two are not identical. and you have to make sure that you’re doing what you really want to do. But not to fear. They arrive at three: (1) the eyes of an eagle.” The technique we will cover in this chapter is called factor analysis (FA). and (4) reduce the number of variables to a more manageable size. and we can’t ﬁnd the reference. and we’ll discuss them in the chapter on Structural Equation Modeling. let’s get back to the Dean’s dilemma. To jump ahead of the story a bit. which he administers to the 200 applicants over a 3-day period.7%. the goal of the technique is to examine the structure of the relationship among the variables. Only after signing the contract do you ﬁnd that the success rate for the last 5 years has averaged only 23. That is.

it’s better to use conﬁrmatory factor analysis to do this. 10. Figure 19–1 oversimpliﬁes the relationship between factors and variables quite a bit. Actually. they are called factors or latent variables. what we see and measure are various manifestations of intelligence. What we measure are the purported consequences of the attribute. In this example. we can’t see intelligence4. 13 14 15 hand how many factors (if any) there are. 8. 14. We can show this somewhat more complicated.” overcharge more. and we say that the common thread that makes them all correlate with each other is the underlying attribute itself.PRINCIPAL COMPONENTS AND FACTOR ANALYSIS 195 The 15 tests chosen by the Dean of Admissions Eyes 1. act more “Scrooge-like. Strictly speaking. 5What exactly is meant by any of this? However. they don’t really exist. know more facts. However. Interest calculation Scrooge factor Dunning ability test Overcharging index Double billing Fine motor dexterity Gross motor dexterity Pond’s softness test Hand tremor Ability to pick up checks Visual acuity Color blindness Nystagmus Attention to detail Preference for carrots Attribute Variable 1 2 Eyes 3 4 5 6 7 Hands 8 9 10 11 about the attributes for a moment. people who have more of it should have a larger vocabulary. and Soul). and (2) its unique contribution—what variable 1 measures that variables 2 through 5 do not. than would people who have less of the attribute. in statistics. and so on (shown by the arrow from the boxes labeled U in Figure 19–2). and only one would need to be measured. . individual items on the tests themselves. each measured by ﬁve tests. We expect (based on our theory of what Byzantine usurers are like) that people who have more of this attribute would charge higher interest rates. The correlations among them would all be 1. we may not know before- 12 Soul FIGURE 19–1 Three attributes (Eyes. they would all yield identical results. the value of each variable is determined by two points (ignoring any measurement error): (1) the degree to which it is correlated with the factor (represented by the arrow coming from the large circles). the Dean wants to know if applicants’ performances on these 15 tests can be explained by the 3 underlying factors. Soul 11. This is referred to as the exploratory use of FA. that is patently a base canard when applied to students who read this book. In fact. work out puzzles faster. To give another example. 7. he will use these techniques to confirm his hypothesis (although. In psychological jargon. you infer its presence from behaviors that are supposedly based on it. and the object in doing the statistics is to determine this number. Hands. What exactly is meant by “uniqueness”?5 We can best deﬁne it in terms of its converse. that’s a question we’d best leave for the philosophers. which we’ll explain in Chapter 20). as we said. 12. 4. If variables 1 through 5 were determined solely by the Eye of an Eagle factor. One purpose of FA is to determine if numerous measures (these could be paper-and-pencil tests. or whatever) can be explained on the basis of a smaller number of these factors. If our theory of intelligence is correct. 5. In other situations. physical characteristics. communal- 4There are some professors who maintain that it is impossible to see in their students because it isn’t there. and complete more school than do people with less of it. You can’t see or measure “Soul of a Byzantine Usurer” directly. picture in Figure 19–2. 15. we call these attributes hypothetical constructs. 13. 3. Hand 6. and so on. 2. but accurate. 9.00.

116 .145 .325 . this is almost never the case. Factor 2. let’s complicate the picture just a bit more.200 1.091 .110 .213 .722 .326 .000 .327 . in our case. the picture in Figure 19–2). The uniqueness for variable 1 is then simply (1–R12).195 . that portion of variable 1 that cannot be predicted by (i.157 . each using a 0-to-7 scale). HOW IT IS DONE The Correlation Matrix FA usually begins with a correlation matrix.150 .000 .304 . and the measures are almost always correlated with “unrelated” ones to some degree (more like Figure 19–3). If the variables all used a similar metric (such as when we factor analyze items on a test. (15 × 14) ÷ 2.335 .523 .160 .. for 6 TABLE 19–1 Correlation matrix of the 15 tests Acuity Color Nystagmus Detail Carrots Fine dexterity Gross dexterity Softness Tremor Check Interest Scrooge Dunning Overcharge Billing Acuity Color Nystagmus Detail Carrots Fine dexterity Gross dexterity 1.622 .099 . as in Figure 19–3.301 .301 1.247 .275 . However.215 . In reality. so we convert all of them to standard scores.321 .578 .335 .423 .227 .223 . we start with a correlation matrix mainly because. On a technical note.403 . in our ﬁelds. is unrelated to) the remaining variables.527 . how much it has in common with them and can be predicted by them.239 .032 .311 .203 .656 .184 –. and Factor 3.327 . We’ve added lines showing these inﬂuences only for the contribution of the ﬁrst factor on the other 10 variables.057 . Before going on to the next step.344 . What we hope to ﬁnd is that the inﬂuence of the factors represented by the dashed lines is small when compared with that of the solid lines.230 . the variables are each measured with very different units.258 1. Figure 19–2 assumes that Factor 1 plays a role only for variables 1 through 5.222 . Thus we are left looking for patterns in a matrix of [n × (n – 1) ÷ 2] unique correlations. with all of the other variables.217 1.e.317 . If life were good to us.85. The communality of a variable can be approximated by its multiple correlation. Soul 13 14 15 ity.e.000 .344 .468 .308 .303 1.. or 105 (not counting the 1.285 .075 .391 . that is. Needless to say. trying to make sense of this just by eye is close to impossible.353 .309 .296 . Factors 2 and 3 exert a similar inﬂuence on the variables.219 .305 .155 . we’d ﬁnd that all of the variables that measure one factor correlate very strongly with each other and do not correlate with the measures of the other attributes (i.000 . for 11 through 15.140 .412 . as shown in Table 19–1.234 .268 . it would be better to begin with a variance-covariance matrix. it’s worthwhile to do a few “diagnostic checks” on this correlation Hands 8 9 10 11 12 FIGURE 19–2 Adding the unique component of each variable to Figure 19–1.000 .414 .321 .000 .157 . R2.489 .196 Attribute REGRESSION AND CORRELATION Variable 1 2 Uniqueness U1 U2 U3 U4 U5 U6 U7 U8 U9 U10 U11 U12 U13 U14 U15 Eyes 3 4 5 6 7 through 10.332 .318 .382 . each of the factors inﬂuences all of the variables to some degree.512 .132 1.212 .095 . The correlations within a factor are rarely much above . Before we go on. we’d probably not need to go any further than a correlation matrix.000s along the main diagonal).105 .000 . but putting in the lines would have complicated the picture too much.423 .314 .

churning out reams of paper.000 .232 . then the correlation between any two variables should be small after partialing out the effects of the other variables. reﬂecting the fact that the variables are related to each other to some degree.408 . they may (depending on other options we’ll discuss later) 6Their book is ﬁlled with uncommonly good wisdom and should be on the shelf of anyone doing advanced stats. if the variables do indeed correlate with each other because of an underlying factor structure.480 .e.413 . This “test” is based on the fact that.PRINCIPAL COMPONENTS AND FACTOR ANALYSIS 197 Attribute Variable 1 2 Uniqueness U1 U2 U3 U4 U5 U6 U7 U8 U9 U10 U11 U12 U13 U14 U15 Eyes 3 4 5 6 7 Hands 8 9 10 11 12 Soul 13 14 15 FIGURE 19–3 A more accurate picture. a large number of high partial correlations indicates you shouldn’t proceed.621 1.. formal and otherwise.287 1. pretty close to 0).642 1. an antiimage correlation matrix.30 (i.000 .535 . they should be above . The extreme example of this is an identity matrix.300 . full of numbers and graphs.370 .000 .30. the computer would still grind away merrily. You have to be careful interpreting the communalities in SPSS/PC. with 1.271 .000s on the main diag- onal (because a variable is always perfectly correlated with itself).30 and +.246 . such as BMDP.60 or so.395 .619 .313 1.432 .181 . save your paper and stop right there. Some computer programs.412 1. If no underlying factorial structure existed.385 . In either case. Tabachnick and Fidell (2001)6 recommend nothing more sophisticated than an eyeball check of the correlation matrix.000 .512 .555 1. A related diagnostic test involves looking at the communalities.380 . signifying nothing. as opposed to partial correlations.000 .585 . So several tests.000 . give you its ﬁrst cousin (on its mother’s side).484 .000 . resulting in the correlation matrix consisting of purely random numbers between –. if you have only a few correlations higher than .285 . which has 1.345 . print out the partial correlation matrix. they’re interpreted in the opposite way as is the correlation matrix.553 1. other than counting. have been developed to ensure that something is around to factor analyze. Softness Tremor Checks Interest Scrooge Dunning Overcharge Billing 1. such as SPSS/PC. This is nothing more than a partial correlation matrix with the signs of the off-diagonal elements reversed—for some reason that surpasseth human understanding. Some of the most useful “tests” do not involve any statistics at all.428 . The reason is that computers are incredibly dumb animals. Others. matrix.000 . with each factor contributing to each variable.000s along the main diagonal and zeros for all the off-diagonal terms. Because they are the squared multiple correlations. The ﬁrst time it prints them out. A slightly more stringent test is to look at a matrix of the partial correlations.

4 50. For example. What it tries to do. the ﬁrst two for over 50%. (2) the number of subjects. The w’s for the ﬁrst factor are chosen so that they express the largest amount of variance in the sample.0 Now. 15) variables and the w’s are weights.9 61.4% of the variance.2 1. most of the variance may be explained on the basis of only the ﬁrst few factors. For Factor 1.7 85.59 – Miserable 0. It increases with (1) the number of variables. For the moment.4 77. and you should stop right there. The reason is. so even if it is statistically signiﬁcant.2 93.2 72.3158 . 15 equations in the form of the one above.69 – Mediocre 0. with values ranging from 0. (3) the overall level of the correlation. Without going into the details of how it’s calculated.” Notice that the ﬁrst factor accounts for 37. Consequently.4514 . MSA is affected by four factors. the second factor would look like: F2 = w11X1 + w22X2 + … + w2kXk (19–2) TABLE 19–2 The 15 factors Factor Eigenvalue Percent of variance Cumulative percent 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5.00.60 to 0.90 – Marvelous7 You should consider eliminating variables with MSAs under 0.50 – Unacceptable 0. the MSA value for each variable is printed along the main diagonal of the antiimage correlation matrix. this may seem like a tremendous amount of effort was expended to get absolutely nowhere. there will be another column of them. then the correlation matrix doesn’t differ signiﬁcantly from an identity matrix.5511 .6 3.5 10. Another test is the Kaiser-Meyer-Olkin Measure of Sampling Adequacy (usually referred to by its nickname. “Cumulative percent. and the associated p level is over . don’t. If. Tabachnick and Fidell (2001) state that the Bartlett test is “notoriously sensitive. one of the oldest is the Bartlett Test of Sphericity.0 2. this would look something like: F1 = w11X1 + w12X2 + … + w1kXk (19–1) where the X terms are the k (in this case. MSA). So. However. Barlett’s test is a one-sided test: if it says you shouldn’t go on to the Principal Components stage.80 to 0. ignore the column headed “Eigenvalue” (we get back to this cryptic word a bit later) and look at the last one. If we began with 15 variables and ended up with 15 factors.3598 . then.3015 . The way it’s used now is to try to reduce the number of factors as much as possi- . we will end up with 15 factors and. you ﬁnd that the summary value is still low. we now go on to extracting the factors.3 6. and the ﬁrst ﬁve for almost 75% of the variance of the original data.2 67.8968 .05.4 13.. So he actually may end up with what he’s looking for. Later in the output.2 81.1 37.6826 .0. 1974).6 95. and (4) a decrease in the number of factors (Dziuban and Shirkey. and (2) it expresses the largest amount of variance left over after the ﬁrst factor is considered.8 98.e. the Dean hopes that the ﬁrst 3 factors are responsible for most of the variance among the variables and that the remaining 12 factors will be relatively “weak” (i.” which is pronounced just as it’s written. but if it says you can go on. The purpose of this is to come up with a series of linear combinations of the variables to deﬁne each factor.4 2.198 REGRESSION AND CORRELATION 7We don’t know why Kaiser didn’t use a word beginning with M for his deﬁnition of MSAs below 0. the ﬁrst shows that they go with Factor 1. The actual results are given in Table 19–2.79 – Middling 0. returning to our example. quite a bit. he won’t lose too much information if he ignores them). it doesn’t mean that you can safely proceed.9 2.70 to 0. if we have 15 variables.0 5. “Mnyeh. The w’s in the second factor are derived to meet two criteria: (1) the second factor is uncorrelated with the ﬁrst. These w terms have two subscripts. after doing this and rerunning the analysis. If its value is small.5345 . Among the formal statistical tests. it ain’t necessarily so.4423 . and the second indicates with which variable they’re associated.7 4. if a factorial structure is present in the data.7 97.6 3.70. In the SPSS/PC computer package. this is the column to look at.3 88.7113 .” especially with large sample sizes. what have we gained? Actually.50 to 0.1 2.0 1. that data set is destined for the garbage heap.50. Again. a procedure only slightly less painful than extracting teeth.6025 2. We can only assume he wasn’t familiar with that Yiddish term of derision. Extracting the Factors Assuming that all has gone well in the previous steps.89 – Meritorious Over 0.1785 . What we’ve just described is the essence of FA.9 100. which is based on the squared partial correlations. all be 1.3 4.7888 .1580 37. This allows you to check the overall adequacy of the matrix and also see which individual variables may not be pulling their full statistical weight.3 91. Kaiser (1970) gives the following deﬁnitions for values of the MSA: Below 0. therefore. is explain the variance among a bunch of variables in terms of uncorrelated (the statistical term is orthogonal) underlying factors or latent variables. The w’s in all the remaining factors are calculated in the same way. it yields a chisquare statistic.0252 1. with each factor uncorrelated with and explaining less variance than the previous ones. and a summary value is also given.0 to 1.

BMDP has four different methods.0s along the main diagonal. we can do it the other way. though. In Equations 19–1 and 19–2. we begin with a correlation matrix that has 1.9 This has a number of implications regarding how the analyses are done.PRINCIPAL COMPONENTS AND FACTOR ANALYSIS 199 ble so as to get a more parsimonious explanation of what’s going on. thus reducing the number of variables by 80%. Like Gaul.e. we showed how we can deﬁne the factors in terms of weighted combinations of the variables. once we have deﬁned the variables in terms of a weighted sum of the factors (as in Equation 19–3). we use FA to determine what those communalities are. and deﬁne each variable as a weighted combination of the factors. Instead of ﬁve scales tapping different aspects of adjustment. As a part of factor analysis. we’re concerned only with the variance that each variable has in common with the other variables.0s mean that we are concerned about all of the variance. which we’ve highlighted by putting them in separate sets of brackets: the ﬁrst reﬂects the inﬂuence of the factors (i. if we’re trying to create a new scale by eliminating variables (or items) that aren’t associated with other ones or don’t load on any factor. then. its estimates of something called factor loadings (which we’ll explain more fully later on) are a tad higher than those produced by PA. Principal Components Analysis and Principal Axis Factoring To understand the differences among this plethora of techniques. when do we use what? If we’re trying to ﬁnd the optimal weights for the variables to combine them into a single measure.0 for every variable. we promise it will be just a bit. That’s the reason that PA is also referred to as common factor analysis. So.10 It also means that. the reality is that it may be much ado about nothing.0. In principal components analysis (PCA). this would look like: X1 = [w11F1 + w12F2 + w13F3 + … + w1kFk] + [w1U1] + [e1] (19–3) where the ﬁrst subscript of the w’s means variable 1... But. we do a multiple regression with X1 as the DV. if we were concerned that we had too many variables to analyze and wanted to combine some of them into a single index. we can use those equations and perfectly recapture the original data. so we’re stuck with it. 8And.0s there. we can summarize Equation 19–3 as: X1 = Communality + Uniqueness (19–4) The issue is what part of the variance we’re interested in. 11Sort . the initial estimate of the communality for each variable (which is what’s captured by the values along the main diagonal) will be less than 1. We haven’t lost any information by deriving the factors. the second is the variable’s unique contribution (U). This is the procedure that test developers would use. then PA (i. The reality is that FA is highly robust to the way factors are extracted and that “many of these decisions have little effect on the obtained solutions” (Watson et al. In fact. This may not seem too unusual. as we discussed in the chapters on correlation and regression. 20). the variance of each variable is divided into three parts. Consequently. what we do is ﬁgure out the communalities in stages. However. we’re interested in all of the variance (that is. the variable’s communality). for instance. On the other hand. In a similar manner. For variable X1. not that it is the commonly used version. on the other hand. our interest is in only the variance due to the inﬂuence of the factors.0 (Russell. in essence. and the correlations between factor loadings coming from the two methods are pretty close to 1. from whatever source. R2 (which is the usual symbol for the SMC) is the amount of variance that the predictor variables have in common with the DV. Ignoring the error term for a moment. p. That is. and the second refers to the factor. the differences tend to be minor. We would also use PCA if we wanted to account for the maximum amount of variance in the data with the smallest number of mutually independent underlying factors. for example. reﬂecting the fact that variables correlate perfectly with themselves. “common” FA) is the method of choice. again. as all correlation matrices have 1. We would do this. now we’re in a Catch-22 situation. for the k factors we’re working with. those 1.11 So. then PCA is the way to go. and the third is random error (e) that exists every time we measure something. it means that the table listing the initial communalities will show a 1. not the unique variance. the information that was lost is information we don’t care about. But. from the perspective of PA. we’ll have to take a bit8 of a sidestep and expand on what we’ve been doing. to determine the SMC for variable X1. and SPSS/PC has seven. of like information from a presidential press secretary. As a ﬁrst step. how the results are interpreted. This estimate is later revised once we’ve determined how many factors we ultimately keep. the communality plus the uniqueness). Because PCA uses a higher value for the communalities than FA. After we’ve gone to great lengths to explain the difference between PCA and PA. it uses only the variance that is in common with all of the variables. In PCA. In terms of the computer output. Because we have. FA is only one way of determining the factors. and when each is used. or used only by common people. 2002). since after all. but we have no way of determining what’s error and what’s real variance. In PA. the best estimate of a variable’s communality is its squared multiple correlation (SMC) with all of the other variables. but we need some value in order to get started. we’ve lost some information. 9And 10We really don’t care about the error variance.e. though. This makes a lot of sense. discarded the unique variance. whereas in principal axis (PA) factoring. 1994. we would use the weights to come up with one number. this means that we can’t go back and forth between the raw data and the results coming out of Equation 19–3 and expect to have exactly the same values. and all of the other variables as predictors.

We’re not going to clutter this chapter (and your mind) by constantly saying “factors or components”. if you have 15 variables. here. First. that is. so that only a few “strong” ones remain. with PCA. This test has three problems. the number of factors to retain is one of the most important decisions a factor analyst12 must make. we told you so! phrase much beloved by Albert Einstein. But ﬁrst. we usually want to have at least three factors. the results from later steps may be distorted to a marked degree. while this criterion may be logical in PCA. and so on. they will sum to—that’s right.6 2. when the communalities are set equal to 1. Unfortunately. the same eigenvalue means that a factor is accounting for only 2% of the variance. This means that.13 It is the default (although. since 14If 15A 16 For this reason. The criterion that is still the most commonly used is called the eigenvalue one test. Obviously then.0. See. most computer packages do it for us at no extra charge). Without going into the intricacies of matrix algebra. like any other parameter in statistics. we’ll use the term factors for both. but it looks like the scree starts after the third factor. In this example.01 is retained. Consequently. This is another one of those very powerful statistical tests that rely on nothing more than your eyeball. and trust you can keep the difference in mind. we don’t see it around much any more. this doesn’t quite hold. 15. we’ve thrown away the variance due to the uniqueness of each variable.99 is rejected. Henry F. it doesn’t make sense in PA. we have to resolve what we mean by “strong. where the slope of the curve changes from negative to close to zero. Kaiser (1970) refers to this technique as “root staring” (because in matrix algebra. not Kaiser Wilhelm. describe what is meant by an eigenvalue. 13That’s On Keeping and Discarding Factors A few paragraphs back.8 Eigenvalue 3. the eigenvalue one criterion. it’s the junk after the strong factors. with 50 variables. as we’ll see. On replication. The Lawley test tries to get around the ﬁrst problem by looking at the signiﬁcance of the factors.) If we add up the squared eigenvalues of the 15 factors that come out of the PCA. but it should be fairly close. these numbers will likely change to some degree. after the person who popularized it. not necessarily the best) option in most computer packages. class. there’s a sharp break in the curve between the point where it’s descending and where it levels off. (With PA. with 20 variables. we don’t have to do it. an eigenvalue is called a root of the matrix). or the Kaiser criterion. we mentioned that one of the purposes of the factor extraction phase was to reduce the number of factors. are measured with some degree of error. which is the amount of the total variance explained by that factor. but rather a surfeit of them.200 12In REGRESSION AND CORRELATION psychiatric circles.0 for the eigenvalue? The reason is that the ﬁrst step in FA is to transform all of the variables to z-scores so that each has a mean of 0 and a variance of 1. A somewhat better test is Cattell’s Scree Test. We should.0 4. “scree” is the rubble that accumulates at the foot of a hill. 1979).0 and are further ahead (in terms of explaining the variance with fewer latent variables) if we keep only those with eigenvalues over 1. the second factor has the second largest eigenvalue. since the communalities are reduced. 2002). it’s quite sensitive to the sample size and usually results in too many factors being kept when the sample size is large enough to meet the minimal criteria (about which. used when he was about to hit you with something that would take 6 months to ﬁgure out. We see this in Figure 19–4. Some people differentiate between the results of PCA and PA by calling the ﬁrst components and the second factors. usually the ﬁrst one is chosen.15 we gain nothing by keeping factors with eigenvalues under 1. you don’t believe us. dear reader.2 0. At the same time. The third problem is that. The second problem is that the Kaiser criterion often results in too many factors (factors that may not appear if we were to replicate the study) when more than about 50 variables exist and in too few factors when fewer than 20 variables are considered (Horn and Engstrom. This is only logical because. as in Figure 19–4 (actually. hence. Kaiser. the scree may start after the second or third break. In many cases (but by no means all). so we’ll keep the ﬁrst three. Unfortunately.0 accounts for 5% of the variance. there is a break after the second factor. more later).16 We start off by plotting the eigenvalues for each of the 15 factors. whereas one with a value of 0. but this can be modiﬁed by two considerations. We said previously that the w’s are chosen so that the ﬁrst factor expresses the largest amount of variance. an eigenvalue can be thought of as an index of variance. add up the 15 numbers in the “Eigenvalue” column of Table 19–2.0 as accounting for less variance than is generated by one variable. This ignores the fact that eigenvalues. If several breaks are in the descending line. Could this be an example of professional jealousy? geology.” and what criteria we apply. in all fairness.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Factor FIGURE 19–4 A scree plot for the 15 factors. In FA. If too many or too few factors are kept.0. then the total variance within the (z-transformed) data matrix is 15. programs such as SPSS use the same criterion for both methods (Russell. the total amount of variance is equal to the number of variables. So why use the criterion of 1. The ﬁrst is that it’s somewhat arbitrary: a factor with an eigenvalue of 1. leading to a different solution. the problem isn’t a lack of answers.4 1. In what other stats book can you also get a basic grounding in geology at the same time? 17In 6.14 so you can think of a factor with an eigenvalue of less than 1. a factor with an eigenvalue of 1. each factor yields an eigenvalue.17 The last “real” factor is the one before the scree (the relatively ﬂat portion of the curve) begins. It was another way of saying that the ﬁrst factor has the largest eigenvalue. Second. As with the previous phase (factor extraction) and the next one (factor rotation). it is said that one cannot become a factor analyst until one’s self has been factor analyzed. the .

the worse the problem. First. baby. 2000).28525 . “Visual Acuity” is most closely associated with the ﬁrst factor. say. you repeat the factor analysis using random numbers.42615 –. and again for you to override the eigenvalue criterion.35772 .24746 –. The greater the degree of under-extraction. we found that even experienced factor analysts couldn’t agree among themselves about how many factors were present (Streiner. When Cattell drew up his rules for interpreting them. the computer gives us a table (like Table 19–3) that is variously called the Factor Matrix. and select the one that makes the most sense clinically. we speak of the variables loading on the factors.46550 –. what looks like a clean break using Cattell’s criteria may seem smooth on the computer output.35186 .68819 .14661 . Table 19–3 tells us the correlation between each variable and the various factors.22289 . To them. you have to run the FA in two steps: once to produce the scree plot. Almost all programs use the eigenvalue one criterion as a default when they go on to the next steps of factor analysis. The Matrix of Factor Loadings After we’ve extracted the factors and decided on how many to retain.PRINCIPAL COMPONENTS AND FACTOR ANALYSIS 201 number of factors retained with the Kaiser criterion and with the scree test is the same.” If. Variables that should load on unextracted factors may incorrectly show loadings on the extracted factors. and the factors are Factor 1 Factor 2 Factor 3 TABLE 19–3 Unrotated factor loading matrix Acuity Color Nystagmus Detail Carrots Fine dexterity Gross dexterity Softness Tremor Checks Interest Scrooge Dunning Overcharge Billing .09860 . In factor analysis. “Tough. In fact.69554 . In essence.” and the loadings on the surplus factors have more error than on the true factors. The bottom line is that you should be guided by the solutions. however.68151 .34653 . and there were no programs to do it.12809 –.66956 . go with the higher number.62684 .43285 –. Just to confuse things even more. Therefore.28367 . Over-extraction leads to “factor splitting.. however. the results should be interpreted as a suggestion.28899 . 1998).16367 –.67587 .. the Factor Loading Matrix. and then with one or two more and one or two fewer factors. and .50922 .48746 .347 on Factor 3. correlates .62186 . What your beloved authors do is use a couple of them (as long as one includes parallel analysis). and Reise et al.44882 –. is called a parallel analysis (Horn.51248 –.30305 .52711 . As with other correlations. which we ﬁrst ran across in multiple regression. You can usually do this by specifying either the minimum eigenvalue (equal to the value of the smallest one you want to retain) or the actual number of factors to keep. after using all these guidelines you still can’t decide whether to keep. In this case. standalone programs and sets of commands for the statistical programs do exist (see O’Connor. no matter which criterion you use for determining the number of factors. When too few factors are extracted. once with the recommended number.46046 . 1996). As long as we keep the factors orthogonal to each other. the two matrices become different. depending on how many factors there are.285 on Factor 2.42569 . or the Factor Structure Matrix. 1965). we counter with the mature and sophisticated reply.56544 . When computer programs draw scree plots.24955 –. not as truth. Furthermore.43433 . 2000. not ruled by them.57080 –. ﬁve versus six factors. This use of clinical or research judgment drives some statisticians up the wall.49096 . the factor structure matrix and the factor pattern matrix are identical.0. If you do a scree plot and decide you won’t keep all the factors that have eigenvalues over 1. live with it.23411 . which is becoming increasingly popular.627 on Factor 1 (i. It’s a pain in the royal derrière to have to do it in two steps.73480 . a higher (absolute) value means a closer relationship between the factor and the variable. and is probably more accurate than interpreting the scree test by eye (Russell.51771 . Be aware. because they like procedures that give the same result no matter who is using them. that the simplicity of the scree plot can be deceiving.12381 . The number of factors to retain is indicated where the scree plot for the real data drops below that for the random data. the DV is the original variable itself. and vice versa.68436 . Bear in mind that. the estimated factors “are likely to contain considerable error. 359).24838 .19142 . The fact that no statistical test exists for the scree test poses a bit of a problem for computer programs. So “Visual Acuity” loads .03457 . it can also be called the Factor Pattern Matrix.35142 –. 2000). . clinical judgment can alert you to the fact that the variables on these surplus factors really belong with other factors. loadings for variables that genuinely load on the extracted factors may be distorted” (p.627 with the ﬁrst factor). they are standardized regression coefficients (β weights). A third method. the ratio of the scale of the X-axis to that of the Y-axis varies from one output to the next. he was very explicit about the scales on the axes. and then run the FA a number of times. The reason that it hasn’t been used much until recently is that the commercial statistical packages haven’t implemented it. but it can be done. but often. In statistical jargon. A couple of interesting and informative points about factor loadings. which love to deal with numbers.50324 –.e. When we relax this restriction (a topic we’ll discuss a bit later). then. But now. with the same number of variables and subjects. It’s usually better (or at least less bad) to over-extract than to under-extract (Wood et al.

Second. and so on. then a high score on the factor would indicate more of some variables. It would make life much easier if we could understand the factors on the basis of mutually exclusive sets of variables. To us true believers. not PA.1798) or 22.6025 ÷ 9. orthogonal. From a mathematical viewpoint. Rotating the Factors Why rotate at all? Up to now. If we go back to Table 19–2. i. the explanation of these factors must take CARROTS into account. factor rotation. it does not necessarily reﬂect the ﬁrst factor we want to ﬁnd (such as the Eyes of an Eagle). The reason is that we have. 3. The only subjective element was in selecting the number of factors to retain. This is really a consequence of the second criterion.. rather than just those over some criterion. (The reason is that. we call it factorially complex. This is what rotating the factors tries to do. or with bipolar factors. these regression coefficients are identical to correlation coefficients. others have to go down. 4.18 We’ve simply transformed a number of variables into factors.0252 ÷ 9. if factor rotation is still somewhat controversial. shows 18To the extent to which strong emotions can be aroused in statisticians (which is why we refer to statisticians as they. The ﬁrst factor is simply the one that accounts for most of the variance. Distribution of variance. we could go back and forth between factors and variables without losing any information at all.0 or 0. conversely. To simplify interpretation of the factors. and only 2 variables (NYSTAGMUS and CARROTS) load higher on another factor than they do on Factor 1. or 61. rather than as us). if the factors are uncorrelated. At this point.) This becomes important later. factor rotation serves some useful functions. The primary one is to help us understand what (if anything) is going on with the factors. in the interest of interpretive ease. the second factor for (2. no one’s found a way to optimize all of these criteria at once. We can also see this in the fact that all of the variables load strongly on this factor (Table 19–3): 12 of the 15 have loadings over . The variance should be fairly evenly distributed across the retained factors. CARROTS loads on all 3 factors to comparable degrees. the ﬁrst factor contains a disproportionate share of the total variance explained by the three factors. Magnitude of the loadings. and therefore. which we approximated with R2 previously. it would be (.9%. we can’t explain it simply on the basis of the fact that it’s fun.50 on Factor 1. that is. However. a higher score on the factor means more of the latent variable. then the β weights are not dependent on one another. that this pertains to PCA. 4. we’d like the factor to be unipolar.34653)2 = . 19Remember the amount of variance explained by them (which is 61. or with loadings in the middle range. it is easiest to interpret the results of a factor analysis if we can meet these criteria and aim for structural simplicity. 2. For each variable. then its loadings on the other factors will be close to 0. why do we do it? Unlike other acts that arouse strong passions. the factor loading matrix should satisfy four conditions: 1. We usually use the abbreviation h2 for the communality. Again. This occurs when all of the factor loadings have the same sign. In Table 19–3.0%. What Factor 1 often picks up is this “general factor. if we asked for the factor matrix to include all of the factors. what we’ve done wouldn’t arouse strong emotions among most statisticians. NYSTAGMUS loads strongly on Factors 1 and 2. it is the sum of the squared factor loadings across the factors that we’ve kept. 9.202 REGRESSION AND CORRELATION the IVs. 1. Let’s see how well the factor loading matrix in Table 19–3 meets these criteria. when we see what happens when we relax the requirement of orthogonality. The factors should be unipolar (all the strong variables have the same sign). A rotation that spreads the variance equally across the factors may . INTEREST is explained by both Factor 1 and Factor 2 and.” which only rarely tells us something we didn’t already know.e.1788). we can add up the eigenvalues of the ﬁrst three factors. whereas a low score would indicate more of other variables. Factorial complexity makes it more difficult to interpret the role of the variable. Whenever a variable loads strongly on two or more factors. or with factorial complexity. Factorial complexity. 2. Unipolar factors. that really gets the dander up among some (unenlightened) statistical folks.62684)2 + (. the communality of a variable. nothing is wrong with most of the variance being in one factor.28525)2 + (. or the variables higher up on the list.0. Looking at Table 19–3. so as some loadings go up. Each variable should load on only one factor. This situation is extremely common and is found because consistency tends to occur in people across various measures.1%. we’ll postpone our discussion of interpretation until after we’ve discussed factor rotation below. literally. However. However. The reason is that the sum of the squares of the loadings across factors (the variable’s communality. can now be derived exactly.1788. the uniqueness is written as (1 – h2).19 It is the next step.594 for ACUITY. 3. you’ll remember) remains constant when we rotate.2% of the total variance of 15). and the third factor for the remaining 16. and a lower score simply means less of it. an inﬁnite number of ways we can rotate the factors. Their sum. Of this amount. the first factor accounts for (5. If a variable loads strongly on one factor. Unfortunately. So. Which rotation we decide to use (assuming we don’t merely accept the program’s default options without question) is totally a matter of choice on the analyst’s part. we still don’t know what the factors mean. The factor loadings should be close to 1. So. If some loadings were positive and others negative. As long as the factors are orthogonal.

this has resulted in a profusion of rotation techniques. We can then generalize the procedure to three or more factors. . and that’s what we’ll go with ﬁrst. because it’s hard to draw three-dimensional patterns (one dimension for each factor). By asking for two factors. A simple example. You’ll Factor 2 1 20What Factor 1 –1 1 –1 makes factor rotation almost unique in the ﬁeld of statistics is that the techniques are not named after people. Now. we’ll start off by forcing the FA to give us only two factors. The new axes are labeled Factor 1' and Factor 2'. not a statistical problem.0. Let’s plot each variable using the loading on Factor 1 as the X coordinate and the loading on the second factor as the Y coordinate.6 (absolute values) on Factor 2. (4) All of the variables are in or very near to the ﬁrst quadrant. mathematically speaking. all of the Factor 2' coordinates will be negative. demonstrating that not all of the variables are loading on the ﬁrst factor any more. or oblimax. although we won’t be able to visualize the results as readily. with the rotated axes turned more to be horizontal and vertical. However. How do our criteria fare in this picture? (1) A group of variables are showing high loadings on Factor 2 but not on Factor 1.4 and .0 or 0. We end up with Figure 19–7.20 The one that’s used most is called varimax. indicating reduced factorial complexity. Before returning to our original problem. not necessarily reduce factorial complexity. (2) The variables seem to be closer to the axes than to the middle of the quadrant. keeping the axes orthogonal (at right angles) to each other. (2) most of the variables are in the middle portions of the quadrants. The only problem is that. our factor loading table will have just two columns. and so on. Factor 2' 1 Factor 1' –1 1 –1 FIGURE 19–7 Figure 19–6. However. each one designed to give priority to a different criterion. Needless to say. binormamin. but Factor 2 is deﬁnitely bipolar. you almost wish they had been given human names. which is quite kosher. if we continue to rotate the axes clockwise until Factor 1' is horizontal. and one that reduces complexity may not produce unipolar factors. resulting in unipolar factors. (3) Each variable is closer to the top on one factor and closer to the origin for the other factor. showing that the loadings are nearer to 1. Orthogonal versus oblique rotations. where we can see problems with all of the criteria: (1) all of the variables show some degree of loading on Factor 1. However. and all of which yield somewhat different results. showing that they are loading on both of the factors. we can plot all possible pairs of them. This means that all of the signs are positive (or those loadings that are negative are very small). 10 factors would result in 45 graphs. Let’s see how rotating the factors can help meet the four criteria and grant us our wish for simplicity. When we have more than 2 factors. with the rotated axes superimposed. We can correct this little annoyance simply by reversing all of the signs of the Factor 2' factor loadings. we’d have 10 graphs to wade through. and most of them are between . What we’ll get is Figure 19–5. let’s rotate them (Figure 19–6). if we had as few as 5 factors. (3) the factor loadings all seem to fall between . again.8 on Factor 1. let’s use this twofactor solution to illustrate one more point. Factor 2 1 Factor 1' Factor 1 –1 1 Factor 2' –1 FIGURE 19–6 Figure 19–5.2 and . and (4) Factor 1 is unipolar. after trying to get your tongue around terms like varimax. but it makes interpretation a bit harder.PRINCIPAL COMPONENTS AND FACTOR ANALYSIS 203 FIGURE 19–5 A factor plot of the two-factor solution.

it’s likely more accurate to think of them as being correlated to some degree. the greater the difference between these two matrices. The tradeoff is that we now have to contend with the factors being correlated with each other to varying degrees. This is a useful property if the results of a PCA or a PA are to be further analyzed with another statistical test. but we also have to consider any correlation between Factor 1 and the others. When we rotated the axes in Figure 19–6. this equivalence doesn’t hold any more. But.” However. –1 Factor 2' Attribute Variable 1 2 Uniqueness U1 U2 U3 U4 U5 U6 U7 U8 U9 U10 U11 U12 U13 U14 U15 Eyes 3 4 5 6 7 Hands 8 9 10 11 FIGURE 19–9 In an oblique rotation. we said. We said this is “an oblique rotation. Perhaps the best one to use is called a Promax rotation. Because the factors are uncorrelated with each other (that’s the mathematical meaning of “orthogonal”). Because the angle between the axes was ﬁxed at 90˚. However. we have a more complicated situation (Figure 19–9). The correlation among the factors leads to another issue. each variable’s regression coefficients for the factors were the same as the correlations between the variable and the factors. then the resulting factors will be pretty close to orthogonal (Fabrigar et al. We call this an oblique rotation. if the factors really are uncorrelated. In this case. Also. but now the factor structure matrix holds the simple correlations between the variables and the factors. there are many different types of oblique rotations. The advantage is that oblique solutions often lead to greater structural simplicity (using the criteria we listed before) than do orthogonal rotations. as we’ve said.” In actuality. that is. even though oblique rotations may mirror reality more closely than orthogonal ones. most people prefer the latter. as in Figure 19–8. the factors don’t have to be orthogonal. Its advantage is that. The reason is that orthogonal rotations have a number of desirable qualities. there was little we could do. we not only have to look at the variables that have a high loading. It begins by doing an orthogonal varimax rotation and then drops the constraint that the factors have to be uncorrelated with each other. once we introduce some correlation between the factors. the interpretation of the factors is far easier if they are all independent from one another. “…keeping the axes orthogonal (at right angles) to each other. Back to the Dean. the factors are correlated with each other. Before we leave the topic of rotations. 12 Soul 13 14 15 remember that. having some degree of correlation among the factors is probably a better reﬂection of reality than having strictly independent ones. As long as the factors were uncorrelated. earlier. to understand what Factor 1 is measuring. the loadings could be interpreted either as simple correlations or as β weights. let’s just see how our three-factor solution fared with a varimax rotation. we were still left with some of the variables being near the middle of the quadrant. although it’s easier to think of Hands of a Woman as being a completely separate attribute from Eyes of an Eagle. we can draw each axis closer to the middle of each group of variables. where the value of each variable is determined only by its “own” factor and its unique component. relaxing the condition that the factors have to be orthogonal. Instead of the relatively simple description of Figure 19–2.204 Factor 2 REGRESSION AND CORRELATION 1 Factor 1' Factor 1 –1 1 FIGURE 19–8 An oblique rotation to the factor plot in Figure 19–5. So. The higher the correlation among the factors. 1999). We’ll skip the graphing stage . let’s rotate them. which we brieﬂy mentioned earlier. So. any score derived from one factor will correlate 0 with scores derived from the other factors. In fact. just as there are various orthogonal ones. The factor pattern matrix still consists of the loadings deﬁned as partial regression coefficients.

in the unrotated solution. What does change is the distribution of the variance across factors.8 1. and 29.8 1. One way is to adopt some minimum value. The other criteria did just as well. is that rotating the axes got us a lot closer to structural simplicity. such as .10392 .85669 .0 0 Factor loadings Factor loadings .73234 . we would have to look at three factor plots for the unrotated solution (Factor 1 vs 2. the rotated one is in Table 19–4.72647 .09161 . in the absence of three-dimensional graph paper. shows any degree of factorial complexity.73411 . So we seemingly have succeeded in driving the loadings closer to 0.13579 . 12 of the 45 loadings were negative.43687 . Instead.1%.12260 . and 2 vs 3) and an equal number after the rotation.29793 . we see that most of them fall between . If we plot the absolute magnitudes of the unrotated factor loadings.40. these three factors accounted for 61. then. as we did in the left side of Figure 19–10. The unrotated matrix was given in Table 19–3.14328 .20833 . this doesn’t change. and they are relatively small.2 0.35187 . Last.22645 .8%.22729 .79480 . only one variable.18470 INTERPRETING THE FACTORS Now that we’ve got the factors. A better method would be to retain only those loadings that are statistically signiﬁcant. 0. rather than the 5%.18827 . Factor 1 Factor 2 Factor 3 TABLE 19–4 Rotated factor loading matrix Acuity Color Nystagmus Detail Carrots Fine dexterity Gross dexterity Softness Tremor Checks Interest Scrooge Dunning Overcharge Billing . we have to ﬁgure out which loadings are signiﬁcant and which can be safely ignored.2%.79744 .38 may be meaningful if we had 1. Also.60095 . of the variance that is explained. Factor 1 was responsible for 61.PRINCIPAL COMPONENTS AND FACTOR ANALYSIS 205 because.7. previously.80952 .3 and .0 0. a good approximation to use would be: CV 5.30 or . CHECKS. in the rotated one. and then (2) doubling that value because the SEs or factor loadings are up to twice those of ordinary correlations.14288 .26607 . The right side shows the same thing for the rotated loadings.82256 . we’ll focus on the factor matrix.12321 .10139 . only 3 are.6 0.6 0.16826 .15467 .9%. obviously a much more equitable division.2 0. but it may represent only a chance ﬂuctuation from 0 with 30 subjects.72852 . To do this.0%. We know a couple of ways of doing this.56992 .0 0. But which value to use? Stevens (2001) recommends (1) using the 1% level of signiﬁcance.152 N 2 (19–5) 8 Rotated 6 Number 4 Number 4 FIGURE 19–10 Plot of the factor loadings for the unrotated and rotated solutions.69867 .0 or 1.57307 .19103 . We can do this by looking up the critical value in a table for the correlation (see Table F in the Appendix). 1 vs 3. The conclusion.14715 .4 0. Before rotation.0%. the majority of them did.0 2 2 0 0.18696 . these numbers become 37. because of the number of tests that will be done. The problem is that any number we choose is completely arbitrary and doesn’t take the sample size 8 Unrotated 6 into account.0. Factor 2 for 22.000 subjects. the graph is much more bimodal.04798 . and Factor 3 for 16.23456 .4 0.78554 .10065 . a loading of . After rotation.2% of the total variance. with relatively few values in the middle range.06270 .02079 . what do we do with them? The ﬁrst step is to determine which variables load on each factor.20167 . If you recall.00664 . 33.18035 .70180 . When the sample size is over 100 (and we’ll soon see why it had better be). This is also reﬂected in the fact that now only ﬁve variables load strongly on Factor 1.

The CHECKS variable is both factorially complex (loading on factors 1 and 2). if you want to use the 5% level.573 . what do we do with it? We have three options: 1. with the addition of the CHECKS variable (a point we’ll return to soon).. see Streiner and Norman (2003) for more details on scale construction.857 . if our aim is to achieve simplicity and end up with uncorrelated factors. Similarly.570 . if we’re developing a scale. 5. However. which we can then analyze with linear regression or something else. SCROOGE.152 198 (19–6) 0. if we want. is to come up with one number for each factor. and one factor has relatively few items. We may want to do this for a few reasons.437 .601 . This looks very much like the postulated Soul factor. So. an item on a test we’re writing. which is [1 ÷ (N 2)] and voila! So. Because we had 200 most unwilling people taking the test. use 3. Second.729 . we can command the computer to give us a factor score coefficient matrix.786 . First. What we would like to do. However. and Factor 3 to Eyes. Before we waltz away. this may be our only alternative. We would also toss out variables that didn’t load well on any factor. we should make two last checks of the factors. there’s one ﬂy in the ointment. and also to eliminate variables that were either not too helpful or factorially complex.823 .920 in the numerator. We can keep the variable in both factors. If the variable is one we devised (e. Let’s use this for our data.206 REGRESSION AND CORRELATION 21This was just an editorial comment. so yet another tradeoff has to be made. So why not simply use them like a regression equation? The reason is that they were derived to predict the value of the variable from the factors. rather than 15.726 . this option is feasible.795 . such as the one in TABLE 19–5 Matrix of signiﬁcant factor loading Acuity Color Nystagmus Detail Carrots Fine dexterity Gross dexterity Softness Tremor Checks Interest Scrooge Dunning Overcharge Billing Factor 1 Factor 2 Factor 3 . and 2. which. this may be a sensible option. though. he can eliminate those tests with the lowest factor loadings. So. or when it loads on some factor we didn’t retain. So we can use PCA and FA to change a large number of variables into a smaller number of factors. USING THE FACTORS In many cases. If there are enough variables remaining in the factors (a minimum of three). it’s wise to go back to the original correlation matrix and see if the variables in the factor are indeed correlated with each other. INTEREST. The downside of this is that we would have to repeat the whole study with a new group of subjects to see if the revised variable is better than the original. However. A factor should consist of at least three variables (Tabachnick and Fidell say you can get away with two. If he wants to make the test battery shorter. subject-to-variable ratios that are too low for some multivariable procedures may still be okay in FA (see below).699 . loading on a number of factors. Where did these numbers come from? When N > 100.22 Factor 1 consists of six variables: CHECKS.797 .g. In our example. the reduced battery will not predict the factors as well. OVERCHARGE. then. we could rewrite it.21 we would get: CV 5. We mentioned earlier that the factor loadings are partial regression weights. we can use these procedures in another way: to reduce the number of variables. each person would have 3 scores. Factor 2 corresponds to the Hands attribute. and again that factor should be thrown away.576 marks off the 1% level of signiﬁcance. and its highest loading is on the “wrong” factor. their state of mind does not affect the sample size. to predict the factor from the variables. because the Dean will have a new batch of 200 consenting adults next year. DUNNING. we get Table 19–5. it looks like we’ve pulled some degree of order out of chaos. Needless to say. This would be the case when the variable is quite complex. Following Stevens. we double this (hence. but what do those old sticks-in-themud know? our own horn a bit. this wouldn’t be a good choice. Second. would say we’ve created chaos out of order. 2. or an entire test we’re developing). Any factor that contains fewer should be discarded.702 . the steps we’ve gone through are as far as researchers want to go. 3.732 . but we feel that’s low). the normal curve is a good approximation for the correlation distribution. We can throw that test out of the battery because it isn’t tapping what we thought it would. situations can arise in which they’re not.23 In our example.734 .152) and then multiply by the SE for a correlation.366 22Doubters 23Tooting If we now go back to Table 19–4 and eliminate all loadings lower than this (and round down to three decimals to make the numbers easier to read). Table 19–5 can also help the Dean in another way. However.810 . and BILLING. They’ve used PCA and FA to either explore the data or confirm some hypotheses about them. increases the subject/variable ratio by 5. Suddenly the light shines. in essence. it may be easier for us to understand what a pattern of (say) three factors means rather than trying to juggle 15 scores in our mind all at once. What we want to do is just the opposite. Although it’s unusual.

07099 . the correlations among the variables will change drastically.08001 . This means that we’ve artiﬁcially built in a negative correlation between these two scales—if you get a score on one.04008 . True/False.26002 . Let’s say there is an item on a personality test that reads.23. there isn’t such an item.03712 .”26 If the person answers True. In this case. but there should be. the three-factor scores for subject 1 would be found by plugging her 15 standardized scores into the equations.26535)BILLING FS2 = ( .00042 . is –1/(k – 1). we have built in a correlation between the scales. you can’t get it on the other. and the expected correlations among the variables. “Sometimes I am sure that others can tell what I am thinking. it doesn’t mean that anything goes. then the correlation between the items will be 1.35157 . Comrey (1978) pointed out several problems that can arise with dichotomous data. and don’t require multivariate normality (Floyd and Widaman. It almost goes without saying that.00191 TYPES OF DATA TO USE Most of the methods used for exploratory factor analysis are fairly robust against deviations from normality. 0/1.00017 .08689 . A different problem arises when we have ipsative data. this is most true when the variables are totally uncorrelated with each other.09003 . are fair game to be factor analyzed. 1994). while a False answer means that he gets scored on the Grandiosity scale (we’ll ignore the other problem—that this is a dichotomous item).24091 . correlations with dichotomous data are often unstable and will be either artiﬁcially limited or grossly inﬂated. we’ll mention one more fact about factor scores that may actually simplify your life. and (2) the correlation among the factor scores.03297 . after all. but 95% answer True on another variable.06561 .21052 . the correlation suddenly becomes 0.27844 . A third problem27 arises when one variable is a combination of others.05618 .04558 .26677 . and they are bored. Although the factors are uncorrelated (assuming we’ve stopped at PCA or used an orthogonal rotation in FA).02083)ACUITY (. The sum of the rows or columns of an ipsative correlation matrix must be zero. set each signiﬁcant loading equal to 1.00: then all you have to do is add up the (standardized) scores. Computing factor scores is no exception. at times.21378 .09673 . For example.03030 . and the correlation may not reﬂect a relationship between the scales at a theoretic level. Just as bad as dichotomous items are nominal variables. far as we know.00017)COLOR + … + (.33097 .27844)ACUITY + (. Second.03831 . which would then read: FS1 = (. then the maximum correlation between these two variables is about ±0.26219)COLOR + … + (. the numbers don’t really mean anything. This means that ﬁve. depending on which technique we use. 26As 27Or .14363 .14349 . Where they differ is in (1) the variance of the scores.02529 .26535 .08966 . If we change the coding scheme. When more than 10 variables are loading on a factor. so any “pattern” among the variables is artifactual. and 1 person says True. a Total variable (such as the Full Scale IQ or the Verbal and Performance IQs on Wechsler IQ tests) could be the sum 24As low as they may appear. Actually. a couple of other ways are lurking around just to complicate our lives.36291 . Despite this injunction. assuming that they’re independent.31804 .07488)BILLING FS3 = (. Acuity Color Nystagmus Detail Carrots Fine dexterity Gross dexterity Softness Tremor Checks Interest Scrooge Dunning Overcharge Billing . is it the fourth? We’ve lost track. The reason is that. depending on the situation.29316 .07620 . the greater the possible loss in efficiency when using these unit weights.02083 . However.30845 . forget about multiplying them by the coefficients. the β weights don’t improve the predictive ability of the equation to any degree that’s worth worrying about (Cohen. with more than 10 IVs. All of them yield scores with a mean of 0.04866 . which is easier to deﬁne by starting off with an example. If you use unit weights.08967 . But. if 99 people say False to two items. we do have some standards. 1990. with the predicted factor score as the DV and the variables as the IVs.00.14474 .24 The major no-no is dichotomous data: Yes/No. merely because of the scoring scheme. 1976).25 First. they’re just names.06666 . Because their coding is arbitrary.00 (or –1. he gets scored on the Depression scale. Wainer. you can probably forget about the equations entirely. you’ll ﬁnd examples of this type of variable being factor analyzed in many journals.00.PRINCIPAL COMPONENTS AND FACTOR ANALYSIS 207 Table 19–6. Each column is a regression equation.06561)ACUITY (.03500 . 25Interestingly. he pointed them out only after he ran out of dichotomous scales on the MMPI to factor analyze and publish.00856 .00 if it’s negative) and the nonsigniﬁcant ones to 0. Present/Absent.07488 . the factor scores themselves might be.26219 . However. if about half of the people respond True on one variable. the greater the magnitude of the correlation.10321 . So. These scales would be called ipsative. often used in attitude and personality assessment.06656 . if this one person then changes her mind and also answers False. where k is the number of scales (Dunlap and Cornwell.09003)COLOR + … (. making the job of transferring the results to another program much easier. 1995).00191)BILLING Factor 1 Factor 2 Factor 3 TABLE 19–6 Factor score coefficient matrix Most computer programs can calculate the factor scores for us and then save them in a ﬁle. So. and other variables of that ilk. if we have one way to do things in FA.or seven-point scales.

be sure to change the scoring so that high scores on all of the items mean the same thing. ceteris paribus. in order to minimize various response biases. the communalities of the items. A second issue has to do with measurement theory. if you don’t. The aims are the same: to determine if there are a number of underlying factors that can explain the pattern of responses. you may still ﬁnd that one factor consists of the positively worded items and a second factor of the negatively worded ones. such as the number of items on each factor. What we do have are ﬁrmly held beliefs32 and some Monte Carlo simulations. compared to when we factor analyze scales. So. 2003). its reliability is going to be the pits. A third problem (or. Latin for “all else being equal. we just can’t get away from it. completely at random. those who haven’t read our book. This is related to the equally unfortunate situation that easy items (those endorsed by many people) cluster together. However. people who do these studies are emphatic that there are no ﬁxed rules regarding the subject-to-variable ratio. so needless to say. that you already know. If this is the case. we’re dealing with numbers that are relatively reliable. the number of factor analyses performed each year will drop by about 70%. yet again irrespective of content. then you should probably at least double the subject/variable ratio. can be seen as a “scale” having only one item. This means that. set of problems) involves the response formats found with some scales. and the total amount of variance explained by the factors will be lower. Those simulations tell us that the sample size depends on a host of factors (pardon the pun). rather than total scores. rather. resulting in much joy among readers of journals and much consternation within the paper manufacturing business. the computer program should have a major myocardial infarct and die on the spot. If we knew the magnitudes of all of these parameters. this advice ﬁts the usual deﬁnition of an epidemiologist—he tells you something that is absolutely correct. with the proviso that we have at least 100 subjects. Even if the scoring is ﬂipped for this latter item. we wouldn’t have to do the @#$%& study to begin with! What we are left with. and so on. the more reliable it is (Streiner and Norman. Very often. but they’re taking a big risk). said that this should suffice only if the communalities are high and there are many variables for each factor. “I feel great”) and others in a negative direction (“I do not feel great”). there are no power tables to tell exactly how many subjects to use. SAMPLE SIZE In factor analysis. if these rules are followed. and the person who proposed these guidelines. 32Often . 1988). as well as the total number of subjects analyzed. the correlation between any two items is affected not only by their content (which is good). We’ve already mentioned one issue when analyzing items: many scales use items with dichotomous responses. the reliability of a scale is directly dependent on the number of items. though. to ﬁnd items that don’t seem to belong anywhere and should be dropped. Finally. Items with similar distributions tend to correlate more highly than items with dissimilar distributions (Bernstein et al.g. 28Seems no matter how hard we try. that factor loadings may be lower. We dare say that.28 In classical test theory. Most good books about measurement theory30 recommend that about half of the items be worded so that high numbers reﬂect strong endorsement of the trait and half be worded so that low scores indicate endorsement. there’s Streiner and Norman (2003)— yet again. but be aware of the potential problems (we didn’t always follow our own advice). Before you factor analyze such a scale. However. argued with as much vehemence as two theologians debating whether angels can get dandruff (and with about the same degree of data to back them up). and. The only way around this problem is to not include variables that are sums or products of other variables in the same data matrix. 1994). Factor analysis can also be used for developing scales where the data consist of the individual items. choose just one. ANALYZING ITEMS—A FEW MORE CAUTIONS The example we used in this chapter involved the factor analysis of the total scores of a number of scales. the astrological sign under which you were born. indices such as KMO and MSA won’t be as strong. we should expect to ﬁnd. to see if some items are redundant and can be eliminated. again irrespective of content. then.. one of the granddaddies of FA. when we analyze items. there are a number of problems that arise in analyzing items that go beyond the problems encountered when we analyze total scores. If you don’t meet these two conditions. in terms of planning a study. as do hard ones (Nunnally and Bernstein. then the factors will reﬂect the scoring direction rather than items with similar content. 29That’s 30To 31Obviously. An item. is a rule-of-thumb guideline that we’ve been told we shouldn’t have: we must have an absolute minimum of ﬁve subjects per variable (some people would go as low as three. and that is totally useless. Some test developers31 try to balance the scoring by having some items worded in a positive direction (e. the average strength of the factor loadings. and factor analyzing these is a no-no. the bottom line is that you can use factor analysis to analyze individual scales (we’ve done it dozens of times). for all we know. but also by their statistical distributions (which is bad).” and is used here merely to impress you with our erudition.29 the more items a scale has. Gorsuch (1983).208 REGRESSION AND CORRELATION of two or more other variables. So when we factor analyze total scores.

16 . Empathy.61 .42 0.27 .23 . or change it to Principal axis factoring to run a common FA • Click the Options button • In the Coefficient Display Format box. Acceptable. or Promax for an oblique one • Click Continue • Click the Extraction button • In the Method box. b.15 .22 . keep Unrotated factor solution and add Scree plot • In the Extract box.61 .11 .709 . change the default from Correlation matrix to Covariance matrix • In the Display box.57 .14 . there should be at least 100 subjects. how many factors are there? 3.11 1.80 .78 . 1. Initial solution is the default.09 1.07 . e.43 1.53 . and click the arrow to move them into the box marked Variables • Click the Descriptives button • In the Statistics box.828 Rotated factor loadings of the 12 items 5. 2.57 . Antiimage [to get the Measure of Sampling Adequacy]. the ratio should be 10:1. c. c.06 . Are there any items you would drop? Why (or why not. The rotated factor loading matrix is shown in the table. The subject-to-variable ratio is: a. What proportion of the variance is accounted for by the retained factors? 4. one of the authors develops a test for budding social workers called the Streiner Knowledge of Relationships.12 .03 . as the case may be)? Item Factor 1 Factor 2 Factor 3 Factor 4 1 2 3 4 5 6 7 8 9 10 11 12 EIGENVALUES .12 .03 .10 . and Warmth scale (the SKREW). since they’re only social workers. choose Coefficients [if you haven’t already run a correlation matrix].14 .10 .14 .22 . Too low.64 .380 .28 .44 .26 . and KMO and Bartlett’s test of sphericity • Click Continue • In the Analyze box. since it’s 5:1.13 . d. keep the default of Principal components to run PCA.08 .05 .PRINCIPAL COMPONENTS AND FACTOR ANALYSIS 209 EXERCISES In an attempt to gain immortality by attaching his name to a questionnaire.323 –.02 .33 . He starts off with 12 items. keep it • In the Correlation Matrix box.05 . a.03 . keep the default [Eigenvalues over 1 times the mean eigenvalue] or enter the number of factors you want to retain • Click Continue • Click the Rotation button • The default for Method is None.20 . Too low.64 . Click Varimax for an orthogonal rotation. and administers them to a validation sample of 63 already blooming SW types.17 .15 . Acceptable. What is the communality for Item 1? What is its uniqueness? What does this mean? Do you really care? Should you? How to Get the Computer to Do the Work for You • From Analyze. choose Sorted by size • Click • Click Continue OK . d.23 .04 .04 .30 . choose Data Reduction ¨ Factor • Click on the variables to be analyzed from the list on the left. Using the Kaiser criterion. which he hopes will tap these three areas. b.

multiple regression and exploratory factor analysis. only its magnitude. We’ll also assume that this scale. For reasons that will become clear soon. you may want to review those chapters ﬁrst. Let’s start off easy by using statistical tests we have already encountered and seeing what their limitations are and how we can get around them.2 he could run a mul- tiple regression equation. He is bothered. have you ever stopped to wonder why so many cheerleaders are blonde? means a line that doesn’t curve. and others are derived from scores on two or more tests. some variables are measured directly (like height and jumping ability).CHAPTER THE TWENTIETH Path analysis is an extension of multiple regression. however. If Dr. allowing us to look at the relationships among many “dependent” and “independent” variables at the same time. one area of knowledge has still eluded them: delineating what accounts for success in cheerleading.g. One question we may want to ask is. Compounding his problems. is normally distributed and has all of the other good qualities we would want. So a more accurate picture may be the one shown in Figure 20–2. we won’t comment on the direction of the correlation between IQ and cheerleading. But a little reﬂection may lead us to feel that the story is a bit more complicated than this. Teeme wanted to predict CLAP/TRAP scores from the person’s height (in inches). in explaining the movement of the continents over the face of the planet. namely. We’ll assume that success in cheerleading (the dependent variable) has been measured on a scale that goes from 1 (“Performance guaranteed to cause fans to root for the opposing team”) to 10 (“Performance results in terminal happiness among fans”). A person’s height may act directly on the coach’s evaluation. Teeme tease these effects apart. How can Dr. the interests of political correctitude. if you feel a bit shaky about them. so we’ve just halved what you have to learn. and intelligence (in IQ points). jumping ability (in inches). which model is a more accurate reﬂection of reality? We’ll begin by just looking at how the variables are correlated with each other. and in eradicating many sources of disease. But. Yeigh Teeme has tried to solve this problem by gathering a lot of data (some say too much) about men and women who were and were not successful in this demanding task. called the Cheer Leader Activity Proﬁle/Teacher Rating of Athletic Performance or CLAP/TRAP. permitting the same type of analysis with latent variables. Dr. with an arrowhead at one end. to get a feel for what’s 210 .g. but it may also inﬂuence jumping ability.. we’ve drawn a box around each of the variables and a straight arrow3 leading from each of the predictor variables to the dependent variable (DV). path analysis is “merely” a subset of SEM. while other variables may act indirectly (e.1 So. which we discussed in Chapter 13. it would look something like: CLAP/TRAP b0 b1 Height b3 Intelligence b2 Jumping (20–1) 2In 3That A diagram of what this equation does is shown in Figure 20–1. But. we’ll start off with path analysis. One form of SEM is conﬁrmatory factor analysis. it doesn’t refer to a wellscrubbed Boy Scout. PATH ANALYSIS The two techniques we will discuss in this chapter— path analysis and structural equation modeling (SEM)—are extensions of procedures we have discussed earlier. This implies that we are assuming that each of the variables acts directly on the DV. Structural equation modeling (SEM) takes this one step further. a “winning personality” may lead the coach to overvalue the person’s ability). take into account both measured and inferred variables. because it’s easier to understand. the ability to jump high while smiling).. by the fact that some variables may affect performance directly (e. and ﬁgure out which ones are important? 1In fact. Path Analysis and Structural Equation Modeling SETTING THE SCENE Although science has made tremendous advances in predicting which stars will become black holes instead of just ﬂickering out.

are the β weights. and between IQ and Height is rIQ-Height.086) + (–0. r is the sum of both the direct and indirect effects between the two variables. In this case.677. we’ll leave the calculations for the direct and indirect effects of Height to you. r.608 (20–2) Height 6That IQ Jumping CLAP/ TRAP FIGURE 20–2 Modiﬁcation of Figure 20–1 to show that Jumping Ability is affected by Height. The sum of these three terms is 0.548. we have commented on the relationship between IQ and cheerleading.433 (–. in this case. Its β weight is –0.245) + (–0.245 (which is 0.199) CLAP/ TRAP see in a bit that this is true in this model.000 .54 ( 8To Height FIGURE 20–3 Figure 20–1 with the correlations among the variables and the weights in parentheses added.199. 5The Interpreting the Numbers If we now ran a multiple regression based on the model in Figure 20–1. we can show that the variables are correlated with each other by joining the boxes with curved.199. 7 .548.7 is the zero-order correlation between Jumping and CLAP/TRAP. Adding up these three terms gives us: (–0. Through Jumping Ability.277) = –0. we would get (among many other things) three standardized regression weights (betas.548. then.433) times the effect of Jumping on CLAP/TRAP.199 = –0. . we can denote the correlation between pairs of predictors with the usual symbol for a correlation. The problem is that the model in Figure 20–1 shows only part of the picture. Jumping Ability.505 Jumping . going on. and βIQ is –0. the magnitude of the effect of IQ is the correlation between the two predictors (r = –0. or 0.6 08 24 5) –.245. or –0. the relationship between these two sets of parameters isn’t immediately obvious. so we lied.5 IQ Jumping CLAP/ TRAP FIGURE 20–1 Model of linear regression predicting CLAP/TRAP scores from Height.000 . Also. one for each of the predictors. and so –0.4 Height and Jumping are positively correlated with each other and negatively related to IQ. we can see the relationship between the correlations and the β weights.372. Similarly. but it doesn’t necessarily hold in others.505 × 0. and those above the single-headed arrows are the correlations between the predictors and the DV. Norman is 6’3”. so that between IQ –. test your understanding of these concepts (but mainly to save the batteries in our calculator). The numbers near the curved arrows are the zero-order correlations between the predictor variables. Doing the same thing for Jumping Ability shows a direct effect of 0. Let’s start by looking at the effect of IQ on CLAP/TRAP. the correlation matrix is shown in Table 20–1. Height CLAP/TRAP Height IQ Jumping TABLE 20–1 Correlations among the variables in predicting CLAP/TRAP scores CLAP/TRAP Height IQ Jumping 1. Below the arrows. which. To introduce a notation we’ll use a lot in this chapter.505 1. which is its β weight of 0. or βs).433 × –0. As a hint.433 1. Then. IQ exerts an indirect effect on CLAP/TRAP through its correlations with Jumping Ability and Height. and in parentheses. we’ve added the arrows and some numbers. an indirect effect through IQ of –0. and is strongly and negatively related to IQ. it’s not the American Indians’ equivalent of missiles with multiple warheads. and IQ. βHeight is 0.086. between Height and Jumping is rHeight-Jump. Now.678 × 0.678 .245. βJumping is 0. –. double-headed arrows. While the relative magnitudes of the β weights and their signs parallel those of the correlations.6 In Figure 20–3. the correlation between IQ and Jumping is rIQ-Jump. We can actually exploit this to tell us how “good” each model is.277. As we would suspect.8 Conceptually. the indirect effect of IQ through Height is –0. and an indirect effect through Height of 0.199.000 . Streiner is 5’8”. believe it or not.608 . The path between each of the predictors and the dependent variable is a β weight. CLAP/TRAP has a strong positive relationship to Height and Jumping Ability. More formally.433 × 0. negative relationship between Height and IQ should give you some clue as to the author of this chapter. means a line with an arrowhead at each end.000 7We’ll which is the correlation between IQ and CLAP/TRAP.PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 211 4OK. it doesn’t take into account the correlations among the predictor variables themselves.80 8) .677 (.678 .807 1.106).677 .

550 is augmented by its indirect effect through Jumping Ability of 0. in Figure 20–4.573 . in Figure 20–3. and you have just been introduced to the basic elements of path analysis. see the reason for this counter-intuitive path in the next section. If we follow the arrows leading out of each predictor variable. • Once you’ve gone forward along a path using one arrow. we had Height → Jumping → CLAP/TRAP.505 1.068). we traced the indirect contribution of IQ → Jumping → CLAP/TRAP and IQ → Height → CLAP/TRAP. we’ve added the “path coefficients” to the second model (the one in Figure 20–2).247 ( = 0.550 ( = 0. If you were paying attention.678 × 0.505 . we did not trace out the path IQ → Jumping → Height → CLAP/TRAP or IQ → Height → Jumping → CLAP/TRAP. even though the arrow itself actually points from Height to Jumping.807 1. strangely enough. so that its total effect is 0. The rule prohibits only paths that go forward and then backward. and between Height and CLAP/TRAP is βHeight.505 × 0.125).505 × –0. These paths enter one variable on an arrowhead from IQ.5 50 ) FIGURE 20–4 Figure 20–2 with path coefficients added. this path goes backward and then forward.573. and others that we took seemed somewhat bizarre. The . Finding Your Way through the Paths When we decomposed the correlations in Figures 20–3 and 20–4. we’ve delineated the paths through which the variables exert their effects.505 Jumping (. you can’t go back on the path using a different arrow.000 . Sewall Wright (the granddaddy of path analysis) laid out the rules of the road: • For any single path. Now we are in a better position to see what adding the arrow in Figure 20–2 does to our model. Back in 1934. you will have noticed that. Jumping → Height → CLAP/TRAP. and would then leave on an arrowhead to get to the other variable—maybe not a felony offense but deﬁnitely a misdemeanor with respect to the rules. Jumping and CLAP/TRAP is βJump. For example. the direct effect of 0. for an overall effect of –0. it has a direct effect (–0. the path in Figure 20–4 starting at Jumping and then going through Height to CLAP/TRAP doesn’t violate any of these rules.247).678) CLAP/ (. and. In other words. we are saying that Height can affect Jumping Ability (which makes sense). you can go through a given variable only once. 10We’ll What we have just done is to decompose the correlations for the predictor variables into their direct and indirect effects on the DV.212 REGRESSION AND CORRELATION IQ (–. This path is meaningless in terms of our knowledge of biology (the technical term for this is a spurious effect). Similarly. Therefore: rIQ-CLAP/TRAP= βIQ + (rIQ-Height × βHeight) + (rIQ-Jump × βJump) (20–3a) rHeight-CLAP/TRAP = βHeight + (rIQ-Height × βIQ) + (rHeight-Jump × βJump) (20–3b) rJump-CLAP/TRAP = βJump + (rIQ-Jump × βIQ) + (rHeight-Jump × βHeight) (20–3c) 9Unless.200 = –0. and a very indirect effect through Height to Jumping to CLAP/TRAP (–0. an indirect effect through Height of 0.000 . or 0. Figure 20–2 is called a path diagram.342 . Kenny’s rule is the reason why.550 = –0.000 .136.608 .678 . The main effect is that it imposes directionality on the indirect effects. not surprisingly. strange as it may seem.000 direct effects are changed only slightly (although they are changed). Height TABLE 20–2 Original correlations in the upper triangle and reproduced correlations in the lower triangle CLAP/TRAP Height IQ Jumping CLAP/TRAP Height IQ Jumping 1.10 For Height. between IQ and CLAP/TRAP is β IQ . IQ is even less straightforward. in this example. an indirect effect through Height to CLAP/TRAP (–0. In Figure 20–4.278).677 .505 × 0. but that Jumping Ability doesn’t affect Height (which wouldn’t make sense9). these numbers do not add up to equal the correlations in Table 20–2.678 . and its indirect effect through IQ of –0.433 1.810 . there were some paths we did not travel down. we did not have a path IQ → Jumping → Height → CLAP/TRAP. but now the indirect effects are quite different. you jumped from a high height and ﬂattened yourself— but that would result in a negative correlation between the variables.678 × 0. for a total of 0.200 and. in Figure 20–3. we see that Jumping has a direct effect of 0..373).810.593.678 × 0.200) TRAP (. We’ll show you why they don’t when we discuss how we can tell which models are better than others. • You can’t go through a double-headed curved arrow more than one time. Bizarre as it may seem. of course.200.593 . 24 7) –. Kenny (1979) added a fourth rule: • You can’t enter a variable on one arrowhead and leave it on another arrowhead.

Height This means that exogenous variables inﬂuence or affect other variables. but whatever inﬂuences them is not included in the model. a person’s height may be inﬂuenced by genetics and diet. When we looked at Figure 20–1. this can only be supplied by our theory. CLAP/TRAP. The bottom line is that determining true causation from correlations is still akin to getting gold from lead. the hope was that we could make statements about causality. the variables on the left are referred to as exogenous variables. This illustrates why the terms “independent. Those on the left are direct models.PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 213 but it is legitimate insofar as decomposition of the correlation is concerned. This further shows one of the major strengths of path analysis as compared to multiple regression. since. Jumping is a dependent variable in relation to Height but a predictor in terms of its relationship to CLAP/TRAP. this isn’t as capricious as it may ﬁrst appear. Exogenous variables have arrows emerging from them and none pointing to them. That is. Does this mean that we actually can take correlational data and have them tell us about causation? Despite the hopes of the early developers of this technique. all of the statistics would be the same. in SEM in general). in that the exogenous variables inﬂuence the endogenous ones without any intermediary steps. but it cannot tell us which way the arrow should point. regression cannot easily deal with the situation in which a variable is both an independent variable (IV) and a DV. In Figure 20–4. “everything correlates to some extent with everything else” (p. one really remembers his ﬁrst ﬁve laws. their gender will remain the same. To show what more path analysis can do. and drawing in all those arcs between pairs of variables can really make the picture look Endogenous and Exogenous Variables The models we’ve discussed so far are relatively simple ones. 12Although Any variable that has at least one arrow pointing toward it is an endogenous variable. it’s fairly safe to assume that gender (or factors correlated with gender. and the exogenous variables are not correlated with each other. is more common.13 In path analysis (and.” “predictor.14 Part B of the ﬁgure. failing to reject a causal model isn’t the same as showing that it’s correct. group improved their skills at time 2. and ﬁnd that the experimental. However. the endogenous variables have arrows pointing toward them and none pointing away from them. the two exogenous variables affect an endogenous one. in that it has arrows pointing toward it. Types of Path Models Figure 20–6 shows a number of different path models. as we’ll see.” If we had altered the model by having the path go from Jumping Ability to Height. the causality is apparent from the design of the study. as Meehl said in his famous sixth law (1990). just as failing to disprove the null hypothesis does not mean that we have proven it. some of which we have already encountered. such as educational experiences or brain structure) leads to differences in ability. No matter how much we change people’s abilities. For example. if we ﬁnd a relationship between gender and spatial or verbal ability. it is called endogenous. as in Figure 20–5. we would know from the statistics that something was amiss. a correlated path model. but our model will ignore these factors. What about Jumping Ability in Figure 20–2? It has an arrow pointing toward it as well as one emerging from it. at the same time. In fact. it’s equivalent to a multiple regression. 204). For example. even if the data were actually correlational. Jumping CLAP/ TRAP FIGURE 20–5 Figure 20–2 with Jumping Ability affecting Height. we have new terms for these variables. the study design. hence the name “independent. there shouldn’t be a path between IQ and Jumping Ability. Now that we have a new statistical technique. rather than from Height to Jumping Ability.12 Similarly. for reasons we’ll see shortly. we referred to the variables on the left as the predictors and CLAP/TRAP as the dependent variable. In Part A of the ﬁgure (called an independent path model).” This situation isn’t too common in research with humans. let’s look at some models it can handle and. in other words. if we speciﬁed a model that didn’t make much sense. Path analysis can also disprove a “causal” model we may postulate—or fail to disprove it. It exists because Jumping and CLAP/TRAP have a common cause—Height. this technique can tell us whether or not there should be a path between two variables. IQ Path Analysis and Causality Until recently.”11 Because we can specify paths by which we think one variable affects another. 11Or “modelling. path analysis was called “causal modeling. As long as there is at least one arrow (a path) pointing toward a variable. What we had called the dependent variable.” and “dependent” variable can be confusing in path analysis. such as having the path go from IQ to Jumping Ability. their attraction to the opposite gender may change. keeping the original terms can lead to some confusion.” the spelling doesn’t affect the results. although we often have more than two predictors. but not the control. 13Actually. introduce you to some of the arcane vocabulary. or knowledge of the literature. if we modify an educational program at time 1. is an endogenous variable in SEM terms. 14No . the answer is “No.

15 The two endogenous variables don’t have to be the same ones as the exogenous. 16To choose two heights. something is missing. there are two exogenous and two endogenous variables. For example. affects illness behavior immediately. variables X1 and X2 both inﬂuence Y. in Part F. Getting Disturbed The models in Figure 20–6 look fairly complete and. in fact. That is. variable X1 does not affect the endogenous variable directly. but only through its inﬂuence on X2. which in turn affects Y. The magnitude of the path coefficients tells us how strong each effect is. the same applies to Y1 and Y2. X1 and X2 were the amount of stress at time 1 and 6 months later at time 2.214 REGRESSION AND CORRELATION X1 Y X2 (A) X1 Y X2 (D) X1 Y X2 (B) X1 X2 (E) Y X1 FIGURE 20–6 Different path models. and stress at time 2. Furthermore. Jumping (X2) affects CLAP/TRAP directly and also mediates the effects of Height (X1). we wanted to see if stressful events in a person’s life led to more illness. We’re going to add some more terms. but this is a different class of model. in that we assume the predictors are correlated with one another to various degrees. but one person is 6’3” and the other “only” 5’8”. stress at time 1. In our example. This model allowed us to take into account the fact that the best predictor of future behavior is past behavior (the path between Y1 and Y2). which. not to disturb you. X2 Y2 (C) X1 Y1 (F) X2 Y2 Y1 15A more complete model would add a path between X2 and Y2. is the same as part C. What this does is to turn X2 (stress at time 2) into a mediated variable. E for (Error). the height of one’s father (X1) affects a person’s CLAP/TRAP score (Y) only by its effects on the cheerleader’s height (X2). Such is the beauty of path analysis. In Part C. a small circle with an arrow pointing toward . The picture it portrays is what we usually deal with.” since he’s exerting himself more. Now we are saying that the number of illnesses measured at time 2 is affected by three things: illness at time 1. and by affecting stress at time 2. X2 also mediates the effect of X1 on Y. we would expect that. In Part E of the ﬁgure. except that we’ve added a path between X2 and Y2. and then look at the added effect of stress at time 1 on illness at time 2. These are referred to as indirect or mediated models. from the standpoint of the statistical programs that analyze path models) is to attach disturbance terms to each endogenous variable. part F of Figure 20–6. because the variables with both types of arrows mediate the effect of the variable pointing to them on the variables to which they point. in turn.16 we would expect the second person to get more “credit. The major difference between the diagrams on the left of Figure 20–6 and those on the right are that the latter have endogenous variables with arrows pointing both toward them as well as away from them. but they can be. But. 1983). The interesting thing about this model is that X1 and X2 can be different variables. not quite at random (see note 5). The ﬁnal model we’ll show (there are inﬁnitely many others). In Part D of Figure 20–6. However. relatively complex. nor do the two sets of variables have to be measured at different times. and Y1 and Y2 the number of illnesses at these two times. or they can be the same variable measured at two different times. in one study we did (McFarlane et al. These are usually denoted by the letter D (for Disturbance). as we’ll see in a moment. messy. but to disturb the endogenous variables. A more accurate picture (and a necessary one. For example. if two people can jump equally high. we are saying that stress at time 1 works in two ways on illness behavior at time 2: directly (sort of a delayed reaction to stress).

In Figure 20–4. all the paths are unidirectional17. explicitly states that each variable directly affects the other—stress leads to illness. if you blithely and blindly use a computer program to replace your brain cells. Second.. For reasons that surpass human understanding. so that Figure 20–8 may be a more accurate portrayal of what is actually occurring. on the other hand. it may seem as if the safest strategy to use is to draw paths connecting everything with everything else. as in Figure 20–7. where every equation has an error term tacked on the end that captures the measurement error associated with each of the predictor variables. whether the 19For . and we may put them in diagrams for the sake of completeness. X1 Y1 X2 Y2 D 1 D 2 FIGURE 20–8 A nonrecursive path model. the two people in the world who don’t know what this means. as they reﬂect correlations among the variables. One disadvantage of the more sophisticated statistical techniques is that they are too powerful in some regard.g. not paths between them. and disturbances) you can analyze is determined by the number of observations. Stupid. In Figure 20–8. the disturbance term has a broader meaning. variances. it stands for Keep It Simple. in addition to measurement error.. either because we couldn’t measure them (e. it consists of everyone who has ever encountered the terms (with the exception of the dyslexic sadist who coined them). all of the models in Figure 20–6 are similar in two important ways. join the club. these models are referred to as recursive models.PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 215 the variable. they capitalize on chance variance and. and far beyond the scope of one chapter in a book. but it may also be due to the effect of some underlying factor affecting both of them. since it has become an endogenous variable with the addition of the path from Height. but they’re optional at other times.” 17We don’t count the curved arrows.18 Note that there is a difference between connecting two variables with a curved. that is. how much the cheerleader’s parents bribed the coach for their kid to get a good performance score). First. without really meaning causation). model building should be based on theory and knowledge. So why didn’t we draw them? Since every endogenous variable must have a disturbance term. they are superﬂuous to those of us who are “au courant” with path analysis. that is. we suggest two things: (1) pour yourself a stiff drink and think again. A feedback loop.. Recursive and Nonrecursive Models Although it may not be apparent at ﬁrst. 18If K. it also reﬂects all of the other factors that affect the endogenous variable and which aren’t in our model. and to let the path coefficients tell you what’s going on. covariances. they go from “cause” to “effect” (using those terms very loosely. X1 Y1 X2 Y2 D 1 D 2 FIGURE 20–7 Addition of disturbance terms to Figure 20–6F. albeit with low p levels. and (2) if you still want to do it. The former means that we “merely” expect the two variables to covary. genetic factors) or we were too dumb to think of them at the time (e.e. statistical effects. and illness leads to stress. the use of the terms “recursive” and “nonrecursive” seems counterintuitive and confusing. Let’s return to the example we used in Figure 20–6F and modify it a bit more. Analysis and interpretation of nonrecursive models is much more difficult than with recursive ones.I. a given path model has the same number of observations. such as when we use two similar measures of the same thing. “Observations” here is a function of the number of variables and is not related to the number of subjects. If we were to draw complete diagrams for the examples we’ve discussed. you can be led wondrously astray. Covariance may be the result of one variable affecting the other. This would be a bad idea for two reasons. read one of the books listed under “To Read Further. If you ever do need to analyze a nonrecursive model. In path analysis (and in SEM in general).S. First. This is similar to what we have in multiple regression.19 At ﬁrst glance. we have added a path between X1 and Y1 and between X2 and Y2 to indicate that stress could affect illness concurrently. We have to draw them in when we use the computer programs. It could be just as logical that the relationship goes the other way and even more logical that the relationship was reciprocal: that stress at a given time affects illness and also that illness affects stress. if we had drawn the disturbance terms in the models. we wouldn’t draw curved arrows between the disturbances. double-headed arrow. A path diagram with this feedback loop is called a nonrecursive model. The second reason is that there are mathematical limits to how many paths you can have in any one diagram.S.g. or a circle with one of the letters inside. they would all be assumed to be independent from one another. as we will emphasize over and over again in this chapter (and indeed in much of the book). The number of parameters (i. That is. Membership isn’t exclusive. such as path coefficients. then Figure 20–3 would have a disturbance term attached to CLAP/TRAP. there would be two disturbance terms: one associated with CLAP/TRAP and one for Jumping. and joining them with two straight arrows going in opposite directions.

more generally. the signiﬁcance of the paths is the easiest to evaluate. and so forth. the covariance between IQ and Height. This is determined by the variances of the variables. Now the problem is. we are not interested in estimating the variances of endogenous variables but only those variances for variables that can vary. In Figure 20–3. Where does this leave us with respect to Figure 20–3? It’s obvious we want to estimate the 3 path coefﬁcients. To reiterate what we’ve said.32 and c = 24. We have 4 path coefficients (the 3 from the variables to CLAP/TRAP. b = –3). resulting in 9 parameters to be estimated. If we had another observation (e. meaning that there are 10 parameters to be estimated. that is. How many parameters do we actually have in this model? Another way of asking this question is. but there are a different number of arrows. for the same reason: they’re endogenous variables and thus not free to vary and covary on their own. “caused” or inﬂuenced by the other variables. the signiﬁcance is dependent on the ratio of the parameter to its standard error of estimate. In this case. we’ll see the implication of this. We are trying to ﬁnd out what affects the endogenous variables. Since the goal of SEM is to explain the variances of variables and the covariances between pairs of variables that can vary. Again like other parameters. by deﬁnition. so we have 10 observations.32. We say that the model is undeﬁned (or under-identiﬁed) in that there isn’t a unique solution. b= –3 and c = 8).e. but not in Figure 20–4? Why can’t we have more than 10 parameters? Does it make any difference that we had 10 parameters to estimate in Figure 20–3 and 9 in Figure 20–4? Is there intelligent life on Earth? If you’ve really been paying attention. then we can determine that c has to be 8. but is implied). but there would be nothing left to estimate (this is referred to as being over-identiﬁed). they are estimated with some degree of error. they are not free to vary or covary on their own but only in response to exogenous variables. For example. If there are k variables. so that adds another 6 parameters (3 variances + 3 covariances). then the model would be correct. and the model as a whole. so they’re on the list of parameters to estimate. Of the three. or that of Jumping in Figure 20–4. in SEM thinking. if we had a simple equation: a=b+c (20–5) and we know that a = 5. then: k (k 2 1) Number of Observations (20–4) because there are [k × (k – 1)/2] covariances among the variables. then what are the values of b and c? The problem is that there are an inﬁnite number of possible answers: b= 0 and c= 5. free to vary and thus inﬂuence the endogenous variable—so it’s one of those things we have to estimate. b= 1 and c = 4. we aren’t interested in the parameters of variables that are determined by outside inﬂuences. then knowing these variances would allow us to predict the person-toperson variation in the endogenous variables. we didn’t count the variance of CLAP/TRAP in either model. As Good as it Gets: Goodness-of-Fit Indicators How do we know if the model we’ve postulated is a good one? In path analysis.. and 1 from Height to Jumping). In the next section. This means that we can examine at most 10 parameters. and 2 disturbance terms (those for Jumping and CLAP/TRAP). What don’t we know that we should know? To answer this. of structural equation modeling. we knew ahead of time that a = 5. the model is now deﬁned (the technical term is just-identiﬁed).g. This is the situation in Figure 20–3. and the remaining questions will be answered (except perhaps the last question. we have the variance of the disturbance term itself (which isn’t in the ﬁgure. Finally. we have to elaborate a bit more on the purpose of path analysis and. there are three things we look for: the signiﬁcance of the individual paths. you will have noticed that we already answered the ﬁrst two questions (but we’ll repeat it for your sake). the variances of 2 exogenous variables (IQ and Height). Why can’t we examine more parameters than observations? The analogy is having an equation with more unknown terms than data. we have four observed variables. endogenous variables don’t enter into the count. If there were as many observations as parameters (i. Stay tuned. we don’t know the variances of the exogenous variables or the covariances among them. as is the case with all parameters. Because the endogenous variables are. There are still 4 observed variables. which has baffled scientists and philosophers for centuries). the reproduced (or implied) correlations.216 REGRESSION AND CORRELATION study used 10 subjects or 10. Consequently. plus k variances. where there are 10 observations and 10 parameters. we end up with a z-statistic: z Estimate of Path Coefficient Standard Error of Estimate (20–6) . how the exogenous variables work together (the curved paths. Note that the disturbance term attached to an endogenous variable is. Also. If you’ve followed this so far.000 subjects.. a number of questions should arise: Why didn’t we count the variance of CLAP/TRAP in Figure 20–3 as one of the parameters to be estimated? Why did we count the variance of Jumping among the parameters in Figure 20–3. so the limit of 10 parameters remains. Hence. which represent correlations or covariances). b= –19. Now let’s do the same for Figure 20–4. and which of those paths (the straight arrows) are important. If we had a perfect model. The path coefficients are parameters and.

the better. In fact. when a model is fully determined (a fancy way of saying that the number of parameters and observations is the same). where we want the result to be as small as possible. the larger the discrepancy between the two sets of numbers—observed and expected—the larger the χ2.05 level. or that the next government will be more honest than the previous one? 21The What We Assume Path analysis makes certain assumptions. and c= 8. statistical term for this is the “Dolly Parton” effect. where the arrow between Jumping and Height went the “wrong” way. we have no reason to reject the model. the happier we are. Thus. In Figure 20–3. The second assumption. In the case of χ2GoF. since this inﬂuences the standard errors of the estimates. χ2GoF can tell us if we’re on the right track. does this prove that our model is the correct one? Unfortunately. we found differences between the actual and implied correlations. but it cannot provide deﬁnite proof that we’ve arrived. 1986). df = 1. the less the results deviate from the model. When crucial variables are left out. The ﬁrst assumption is that the variables are measured without error. until her return visit to the plastic surgeon. using a twotailed test. there are 10 of each. F-test. there’s nothing left to estimate. the χ2GoF is dependent on the sample size. which we’ll meet later in different contexts. there is no guarantee that there is yet another. we hope to ﬁnd a nonsigniﬁcant result.044. 20The standard error is estimated by the computer using methods that are beyond the scope of this chapter. However. Just to reinforce what we said earlier. If the sample size is low (say. to be fair. So. χ2GoF is 2. we cannot calculate a χ2GoF. only our theory can guide us in this regard. However. For the sake of completeness. which has a p level of . is that all important variables are included in the model. Thus. if we look at the reproduced correlations in the lower half of Table 20–2. which is much harder to detect. the fact that a model is asinine has never been a roadblock to its adoption. but it serves to remind us that we should try to use instruments that are as reliable as possible.044. if we ﬁnd that χ2GoF is not signiﬁcant.96 or greater. Third. For those of you under the age of 30. which are in the upper half of the matrix. we want the value of χ2 to be large. This was a model where there was a path between IQ and Jumping Ability. and (2) the model in Figure 20–3 ﬁts the data too well. model that may ﬁt the data even better. when we reproduced the correlations from the path coefficients in Figure 20–3. we can then compare how close different models come to estimating the data. You’ll remember that. When we look at the model as a whole. we are not testing our observed ﬁndings against the null hypothesis. it was known as the “Pamela Anderson” effect. untested. though. the data ﬁt the model perfectly. we were able to perfectly duplicate the correlations by tracing all the paths. because the number of parameters is equal to the number of observations. Because the χ2GoF is not statistically signiﬁcant. Furthermore. multiple regression and path analysis assume that the variables are additive. How many people believe in alien abductions (often in spaceships piloted by Elvis). as do all statistical tests. Because we usually want our results (the observed values) to be different from the null.525. the model ﬁt may be poor or yield spurious results. the χ2GoF statistics associated with Figure 20–2 and Figure 20–5 are identical. . under 100). The χ2GoF for this model (which also has df = 1) is 42. the indirect paths) and overestimates of the effects of the direct paths (Baron and Kenny. Like the case in which we are told ahead of time that a = b + c and that a= 5. or whatever. some of them differ quite a bit from the original correlations. When this rule is violated (as it always is). The reason is apparent if we compare the number of observations in Figure 20–3 with the number of parameters. This situation is reversed in the case of the χ2GoF. Life is usually much more interesting when we have fewer parameters than observations. but these usually become obvious from their weak path coefficients.PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 217 and (yet again like other z tests). even models that deviate quite a bit from the data may not result in statistically signiﬁcant χ2s. we also discussed a different model. and there is a perfect ﬁt between the model and the data.20 A second criterion concerns the reproduced (or implied) correlation matrix. In the case of the model in Figure 20–4. it is signiﬁcant at the . Basically. the χ2GoF for the model in Fig- ure 20–5. It’s not a good idea to include variables that aren’t important. the answer is a resounding “No. it is almost impossible to ﬁnd a model that doesn’t deviate from the data to some degree. So. in that there is no discrepancy at all between the original and the reproduced correlations. The degrees of freedom (df) associated with the χ2GoF is the difference between the number of observations and the number of parameters. when we tried to reproduce the correlations in Figure 20–4.153. if it is 1. If there are interac- 22Although. Why this sudden change of heart? Let’s go over the logic of χ2 tests. which is highly signiﬁcant. in curing cancer by eating apricot pits. bigger is better21—the larger the value of a t-test. This further illustrates that. In this example. This is patently impossible in most (if not all) research.22 Also. they all reduce to the difference between what we observe and what would be expected. but rather against some hypothesized model. given that the two techniques are closely related. if the sample size is very large (over 200 or so). with “expected” meaning the values we would ﬁnd if the null hypothesis were true. indicating that our results match the model. hence. when we use χ2 to test for goodness-of-ﬁt. we have 10 observations and 9 parameters to estimate. r. and means that there is a very large discrepancy between the model and the data.” As we just showed. Many of the assumptions are the same as for multiple regression. is also 2. so this statistic doesn’t tell us about the “causality” of the relationship between the variables. even though one model makes sense and the other is patently ridiculous. This tell us two things: (1) the model in Figure 20–3 ﬁts the data better than the model in Figure 20–4. meaning that df = 0. the major statistic we use is called the goodness-of-ﬁt χ2 (χ2GoF). it results in underestimates of the effects of mediator variables (that is. Conversely. one that was so ridiculous that we were ashamed even to draw it. In most statistical tests. and we want our results to be congruent with it. No statistical test can tell us which variables have been omitted. which isn’t surprising. b = –3.

. but the parameter estimates become unreliable if the correlations are high (Klem. how many subjects do we need? Unfortunately. this reﬂects our conceptualization of latent variables—that they are the underlying causes of what we are measuring directly. we would say that these are all manifestations of the unseen factor (or latent variable) of “anxiety.23 SEM and Factor Analysis In addition to the variables of Height. the correlations were moderate. Because we represent latent variables with ovals. Also. or WOKS).724 for WOKS.218 REGRESSION AND CORRELATION tions among the variables. he had to resort to using three other tests that he felt collectively measured the same thing: one tapping extroversion (the Seller of Used Cars Scale. We don’t gain very much. his or her tachycardia. the signiﬁcance is the ratio of the parameter to the standard error (SE). Wouldn’t it be nice if we could add some variety. which may at ﬁrst glance seem backwards. In fact. rather than an exploratory one. Very brieﬂy. there is the added advantage that our diagrams can now become more varied and sexy. A drawing of this using the SEM conventions is shown as Figure 20–10. unable to ﬁnd a questionnaire that measures Cheer directly. There are a few points to note. there is no simple relationship between the number of parameters to be estimated and the sample size. and boxes are drawn to show measured variables—that is. Dr.” The primary difference between path analysis and SEM is that the former can look at relationships only among measured variables. latent variables aren’t measured directly. they have error or disturbance terms associated with them. they are inferred as the “glue” that ties together two or more observed (i. Figure 20–9 illustrates the example we just used in SEM terms.” success in this endeavor also depends on the candidate’s personality.e. the computer output is less helpful than from programs speciﬁcally designed to do EFA (if that’s what we want to do). So where does the sample size come into play? The sample size affects the signiﬁcance of the parameter estimates—the path coefficients. The correlations among these scales are shown in Table 20–3. we would ﬁnd that the factor loadings were 0.. Jumping Ability. A very rough rule of thumb is that there should be at least 10 subjects per parameter (some authors argue for 20).g. Exploratory factor analysis (EFA— the kind we discussed in Chapter 19) is now seen as a subset of SEM. however. Despite this.840 for SUCS. and IQ. as we all know. because SEM is a model testing or confirmatory technique. In all cases. Notice the direction of the arrows: they point from the latent variable to the measured ones.. CLAP/TRAP). why else would we even ask the question? This isn’t as ludicrous as it ﬁrst appears. since the ﬁrst part of the word “cheerleading” is “cheer. shows an increased heart rate in enclosed spaces (another measured variable). ones we observe directly on a physical scale (e. and is what leads to the use of pills.797 for MPI. directly measured) variables. another measuring positive outlook on life (the Mary Poppins Inventory. However. if we were to use the programs and techniques of SEM to do EFA. Yes. or MPI). which become extremely monotonous. 0.g. Having said that. First. If we now did an EFA. Finally. 1995). since the three measured variables are endogenous. For example. and a third focusing on denial of negative feelings (the We’re OK Scale. it’s not a coincidence. such as ovals? Of course the answer is “Yes”. circles represent error or disturbance terms. STRUCTURAL EQUATION MODELING The major limitation with path analysis is that our drawings are restricted to circles and boxes. Height) or a paperand-pencil scale (e. A Word about Sample Size The df associated with the χ2GoF test is the difference between the number of observations and the number of parameters. and the standard error. rather. and the covariances. In the drawing conventions of SEM. e2 Tachycardia Anxiety e3 Anxiolytics . but positive. Thus. and 0. both techniques can handle a moderate degree of correlation among the predictor variables (multicollinearity). is dependent on the square root of N. an appropriate interaction term should be built into the model (Klem. we were introduced to another type of variable: latent variables (which we referred to in that chapter as factors). using a least-squares method of extracting the factor. 1995. and uses anxiolytic medications (yet a third observed variable).” That’s an old didactic trick we learned to make things more palatable. or SUCS). Streiner. the latent variable (or trait or factor) of anxiety is what accounts for the person’s score on the test. however. If Figure 20–9 looks suspiciously like the pictures we drew when we were discussing factor analysis. In fact. e1 Test score FIGURE 20–9 Relationship between three measured variables and a latent variable. the disturbance or error 23Notice we said “general” and not “more difficult. 1994). if a person gets a high score on some paperand-pencil test of anxiety (a measured variable). As he suspected. path analysis (and SEM in general) is extremely greedy when it comes to sample sizes. Virginia. In the chapter on factor analysis. as long as there are at least 200 subjects. the variances. This will put us in a better position to understand how SEM works in more general cases. we’ll start off with EFA to show you how the concepts we covered in path analysis apply in this relatively simple case. Teeme also hypothesized that. whereas SEM can examine both measured and latent variables.

we can’t recommend a better book than Health Measurement Scales: A Practical Guide to Their Development and Use. one exists (i. We would have to run an EFA on these variables (and any other sets of variables we wanted to combine). the factor loadings for MPI and WOKS are 0. In fact.690 .000 that factor score in the next stage of the analysis. for each variable. 25Actually. error.PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 219 term in EFA is usually labeled with the letter U (which stands for “uniqueness”). or disturbance).577 1. why make such a big deal about the difference between EFA and conﬁrmatory factor analysis (CFA— the SEM approach to factor analysis)? Leaving the math aside for the moment. when we don’t know what’s going on. and the observer may round up or down to the nearest 5 mm or make a mistake recording the number.27 One major advantage of SEM is that this is done as part of the process. if we’re not careful what we ask for). while we still get pages and pages of output (reams. The solution is to disattenuate the reliabilities. We can accomplish the same goal using other techniques. based on this. the manometer may not be perfectly calibrated. Conﬁrmatory factor analysis. A second advantage arises from the use of measurement theory..797 and 0. some error is involved.8402 + 0. if the reliabilities are “attenuated” very much. The downside is that we can end up with a set of factors that look good statistically but don’t make a whole lot of sense from a clinical or scientiﬁc perspective. and ﬁgure out what the correlation would be if both tests were totally reliable. but the route would be more roundabout. These errors result in a measure that is less than reliable.25 the major difference is a conceptual one. we are in essence saying. Any time we measure something. we’ll keep emphasizing: changes to the model to make it ﬁt the data better should be predicated on our theoretical understanding of the phenomenon we’re studying.5432 = 1. Finally.724 WOKS FIGURE 20–10 Factor loadings and uniqueness for the factor “Cheer” and its three measured variables.” That’s why it’s called “exploratory.724.577. Say we’re interested in the relationship between anxiety and the com- 24The reason for the difference is purely historical. The error can arise from a variety of sources: the person’s blood pressure may change from one moment to the next. that is. respectively. U1 . in fact.797 × 0. run-of-the-mill EFA. . for example. it can be a very powerful tool that can help us understand the interrelationships in our data.24 Second. use the output from this analysis to calculate a factor score for each person. and then use SUCS MPI WOKS TABLE 20–3 Correlations among three tests to measure “Cheer” SUCS MPI WOKS 1.g. where we see how well the model actually ﬁts the data.797 Cheer U3 . For example. the terminology is different. their product (i.724) is 0. and that which is not explained by it (the uniqueness. the square of the factor loading (which is equivalent to a path coefficient) plus the square of the uniqueness equals 1.604 MPI .. 0.669 1. the action is really at the end.608 . If we had only one scale to tap some construct (that is. we would ﬁnd exactly the same thing! So. in other words.” folks. Before we move on to discuss the steps in SEM. we’re making a “super-scale” out of three scales. In traditional. we may erroneously conclude that there is no association when. we are sometimes further ahead if we randomly split the scale in half and then construct a latent variable deﬁned by these two “subscales. rather than a model building one. Thus. as promised. some other points about the advantages of using latent variables (in addition to the esthetic one of having ovals in our drawings) need to be discussed.. When paper-andpencil or observer-completed tests are involved. if we ran this as if it were a structural equation model. such as biases in responding or lapses in concentration. With SEM. the product of any two factor loadings is equal to the correlation between the variables. but the concept is the same. 26And you wondered why it is called conﬁrmatory factor analysis? 27If you want more details about how to do this. even more sources of error exist.” We can then calculate the reliability28 and.e. we’re going to leave it aside as long as we can. we will almost always underestimate the relationships among the variables and. Now.840 U2 .00 (e.29 Let’s see just how powerful disattenuation can be when we’re testing a theory. In our example of the three scales to measure “Cheer. all of this can be accomplished in one step. or their correlation.543 SUCS . and the term was chosen to denote the unique contribution of that variable: the variance it doesn’t share with the other variables (see Chapter 19). we would be dealing with a measured variable rather than a latent variable deﬁned by two or more measured ones). In English. 2003) in no way prejudiced our opinion.26 So. is a model testing technique.0. whether it’s blood pressure or pain.000 . EFA was developed many years before SEM. “I don’t know how these variables are related to one another. and not based on moving arrows around to get the best goodness-of-ﬁt index. This is a point that we’ve already mentioned and.” we said that none measures it exactly but that the three together more or less capture what we want. A problem then arises when we correlate two or more variables: the observed correlation is lower than what we would ﬁnd if the tests had a reliability of 1.000 . it may not be at all obvious why the variables group together the way they do. as is the case with all variants of SEM.00). Rest assured that the fact that we wrote it (Streiner and Norman. the disattenuated reliability. Let me just throw them into the pot and see what comes out. we would be committing a Type II error). we’ve divided up the variance of the variable into two components: that explained by the factor (or latent variable). We’ll say a bit more about CFA later in the chapter.e. 0.

which is measured by the three observed variables IQ. However. and perpetual happiness) are guaranteed. because the researcher has diligently digested what we’ve just said. b and c are referred to as constrained parameters. which are assigned a speciﬁc value ahead of time. We have a latent variable. Cheer.590 0. In addition.000 0. Based on what we said earlier. What she ﬁnds is shown in Table 20-4.31 it would make sense to randomly split CLAP/TRAP into two parallel forms (CLAP and TRAP) and have these deﬁne the latent variable. which is measured by our scales WOKS. ﬁnding her true love. as are the two habit scales (0. (Actually. and MPI. we made b a ﬁxed parameter. we can “mix and match.636 1.63. as long as we don’t do obviously ridiculous things. Model Speciﬁcation To a dress designer. though. usually due to items. then it cannot correct for test-retest unreliability. If the model does not include the same scale administered on two occasions.634 1. after she creates a latent variable of anxiety. This step of model building is called the measurement model. Athletic Ability.” To many men. and (3) constrained parameters. it means stating which supercharged engine to put into a sports car. all of the diagrams in Figure 20–6 can be drawn replacing the measured variables with latent ones. We could also have solved the equation if we had said that b = c. it’s the split-half reliability for those who aren’t interested. However. measured by the two A scales. “I want a woman who’s 5’ 8” tall. because they are overly optimistic regarding the utility of the tests.000 0. which we’ve called d1. we are referring to something far more mundane: explicitly stating the theoretical model that you want to test. But.86. they give a more accurate estimate of the relationships among the variables. The solution is to put constraints on some of the parameters. 29Some pulsion to read every e-mail message as soon as it pops up on the computer screen. We have already discussed much of . which is our primary interest. 5= b + c. tapped by its two H scales. measurement theoreticians do not like using disattenuated reliabilities in scale development. we have three types of parameters in structural models: (1) free parameters. In the example we used previously. Because CT is now an endogenous variable (it has arrows coming toward it from Cheer and Athletic Ability). and found what the correlation would be in the absence of measurement error.) On a technical note. we were able to solve the equation. For more details. it is often the case that. and. 2003). however. and a second latent variable. she ﬁnds the correlation between the traits is 0. the reliability is corrected only for those sources of error that are captured in the model. and has a perfect 36-24-36 ﬁgure. because we are specifying how the latent variables are measured. their number exceeds the limit of [k × (k + 1)]/2.000 0. for example. So. but now we will broaden them to include both measured variables (which can be analyzed with path analysis) and latent variables (which cannot be). such as having two latent variables deﬁne a measured one.” using both latent and measured variables in the same diagram. which is shown in Figure 20–11. we said that the number of parameters can never be larger than the number of observations. weighs 123 pounds. The usual way to test this theory is to give one anxiety inventory and one test of e-mail reading compulsion to a group of people. and see what the correlation is. that is. they should be less.64). once we start adding up all the variances and covariances in our pretty picture. The joy32 of SEM is ﬁguring out how many parameters to leave as free and whether the remaining ones should be ﬁxed or constrained. Height. let’s turn to the steps involved in developing structural equation models. We’ll follow the lead of Schumacker and Lomax (1996) and break the process down into ﬁve steps:30 • Model speciﬁcation • Identiﬁcation • Estimation • Testing ﬁt • Respeciﬁcation this when we were looking at path analysis.59.512 0. we have to give it a disturbance term. For example. in which case both b and c would have to be 21⁄2. and whether to get a hard or a rag top (to attract the model of the ﬁrst deﬁnition). and correlates this with the latent variable of habits. the technique has calculated the parallel forms reliability. (2) ﬁxed parameters. not bad. SUCS. see DeShon (1998). the correlations between anxiety and e-mail habits correlate only between 0. by specifying beforehand that b= –3. CT. she decides to give her 200 subjects two scales to tap anxiety (A1 and A2). used this to disattenuate the reliabilities. which can assume any value and are estimated by the structural equation model. The same concepts apply. and tenure (as well as glossy hair. Now that we have a bit of background.220 REGRESSION AND CORRELATION TABLE 20–4 Correlations among two anxiety scales (A1 and A2) and two ccales of e-mail reading habits (H1 and H2) A1 A2 H1 H2 A1 A2 H1 H2 1. Here. and Jumping. model speciﬁcation may mean.) Identiﬁcation When we were discussing path analysis. and two of e-mail habits (H1 and H2). for those who are interested (Streiner and Norman.538 1. In this case. But.51 and 0. too. ideally. Let’s use these concepts to fully develop a model of success in cheerleading. In this case. in this context.000 28This would be equivalent to the split-half reliability. The two anxiety scales are moderately correlated at 0. Some of the questions we could ask at this stage are as follows: (1) How well do the observed variables really measure the latent variable? (2) Are some observed variables better indices of the latent variable than are other variables? and (3) How reliable is each observed variable? (We will look at these issues in more depth and actually analyze this model when we discuss some worked-out examples of SEM and CFA. which are unknown (as are free ones) but are limited to be the same as the value(s) of one or more other parameters.550 0. but not the stuff of which grand theories are made. To her sorrow.

if this is known. terms “constraining parameters” or “constraints” can refer to making them either ﬁxed or constrained. if ever. because we don’t actually measure (or observe) the error term. ain’t it? you want a more detailed explanation of the effect of choosing different values for the path coefficient. Making the model as simple as possible is often the best way of avoiding “identification problems.” or any other such problems. non-zero value.34 When a latent variable has two or more singleheaded arrows coming from it (as is the case with CLAP/TRAP. and this is what’s usually done. the rank of a matrix is the number of unique rows and columns. we indicate this by assigning the same name to the errors. But. “codependency. drug abuse. This can occur if one variable can be predicted from one or more other variables. 30As Perhaps the easiest way of constraining parameters33 is simply not to draw a path. as in the case of CLAP and TRAP. meaning that the row and column in a correlation matrix representing that variable is not unique. otherwise. gambling.e. just the estimate of the error variance. which require 12 steps. The steps we’ve just discussed are necessary but not sufficient. In programs like AMOS. simplifying the model. if you’ve gone to the trouble of counting up the number of observations. 1997). speaking. one that has reciprocal relationships). Such thinking is naive.e. least we pay attention to what we’ve said. For instance. there may be times when we believe that the error terms of two or more variables are identical (plus or minus some variation). We can also forgo the weekly meetings. split a single test in half. and reﬂects a trust in the inherent fairness of the world that is rarely.. or we give it to mothers and fathers. You can also run into difficulty if the rank of the matrix35 you’re analyzing is less than the number of variables. you will avoid identiﬁcation problems. and Athletic Ability). one of them must be set to be equal to 1 (i. you have seven variables (the six subscales you can see. For each unobserved variable with one single-headed arrow emerging from it (such as the error terms). In any case. We said that MPI loaded on Cheer and (implicitly. we’d be trying to estimate b and c at the same time. confusing. Changing which measured variable has a coefficient of 1 will alter the unstandardized regressions of all the variables related to that latent variable (because the unstandardized regressions are relative to the ﬁxed variable). but it’s best to use the variable with the highest reliability. This situation arises. the Verbal IQ score of some intelligence tests is derived by adding up the scores of a number of subscales (six on the Wechsler family of tests).PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 221 ea1 IQ ea2 Height Ability CLAP ec ea3 Jumping CT ec1 WOKS d1 TRAP et ec2 SUCS Cheer ec3 MPI FIGURE 20–11 The full model for success in cheerleading. which leaves us ﬁxing the path coefficient. and constraining some of the parameters. it doesn’t make sense to assign a value to it. Cheer. see Arbuckle (1997). If you include the six subscale scores as well as the Verbal IQ score in your data matrix. Notice that we didn’t draw curved arrows between the exogenous variables.” A second method involves the latent variables. that is. assigning a different value won’t affect the overall ﬁt of the model. which is yet another reason to avoid them whenever possible.. if we’re giving the same test at two different times. by not having a path) did not load on Athletic Ability. such as AMOS (Arbuckle. but it does not affect the standardized regression weights. Identiﬁcation problems will almost surely arise if you have a nonrecursive model (i. Finally. The easiest thing to do is to give it the value 1. we must ﬁx either the path coefficient or the variance of the error itself to some arbitrary. then you will have problems. and these become constrained parameters. this is much less difficult than recovering from alcoholism. 33The 34If 35Roughly . 31At 32A term to be taken with a large grain of salt. justiﬁed. and most programs. since one variable can be predicted by the sum of the others. as we mentioned previously. It doesn’t really matter at all which one is chosen. This is because we assume that the exogenous variables are correlated with themselves. You might think that. build this in automatically. That’s another way of saying that the parameter for the path from MPI to Athletic Ability is ﬁxed to have a value of 0. for example. be a ﬁxed parameter). or.

hence. constraining two variables to have equal variances. or some other problem. even height can be measured either in inches or centimeters (or centimetres. that is. Arbuckle (1997) says. without trying to be exhaustive (and exhausting). PRELIS2. the unfortunate need to think about what we’re doing. Most of the other indices we will discuss39 are scaled to take values between 0 (no ﬁt) and 1 (perfect ﬁt). or AMOS). skewness. You just do the best you can in setting up the model. They will calculate the correct matrix. This becomes more of a problem when we use paper-andpencil tests that don’t have meaningful scales. illustrated version of War and Peace. However. If the output says that the model needs more constraints. the estimates of the parameters will change. Many programs default to the maximum likelihood (ML) method of estimating the parameters. you may want to consider a different career. If there were one approach that was clearly superior. One rule of thumb for a good ﬁt is that χ2GoF is not signiﬁcant and that χ2GoF /df should be less than two. One class of statistics is called comparative ﬁt indices. pre. one that we’ve encountered in other contexts. and hope the program runs. and consist of interval or ratio data.g. because they test the model against some other model. If this doesn’t help. Often. as we mentioned when we were discussing path analysis. or two different scales tapping the same construct. if the model is different from the null hypothesis. but it requires a very large sample size (usually three times the number of subjects whom you can enroll) to work well. “red in tooth and claw” (yes. if we listed them all. it is scale-dependent.. there’s no easy way to tell ahead of time if you’re going to have problems with underidentiﬁcation of the model. because we live in one of them. But. The unweighted least squares (ULS) method of estimating the parameters has the distinct advantage that it does not make any assumptions about the underlying distribution of the variables. transforming matrices. based on the types of variables you have. So which one do you use? If you have access to the SEM program called LISREL8. but as we just mentioned. 37Although 38The authors are allowed to say this. you’ll have to go back and take a much closer look at the data to see if you have non-normality.and post-multiplying matrices. The easy work is all of that matrix algebra—inverting matrices. Let’s go over some of the more common and useful ones to show the types of indices available. then the results of the ML solution are suspect. Bentler and Bonett. Weighted least squares (WLS) is also distribution-free and does not require multivariate normality. simply because of different measurement scales used.. If you use one of the other programs (e. if the data are extremely skewed or ordinally scaled. Testing the Fit In the previous section. Usually. This works well as long as the variables are multivariate. and all of the other techniques would exist only as historical footnotes. a value of . and that test is also used in SEM. if they differ. which is one of the reasons ULS is rarely used. we lamented the fact that there were so many different approaches. either try to ﬁx it (e. a commodity that is in short supply inside the computer. the scales are totally arbitrary.40 It takes the form of: 2 NFI null 2 2 null model (20–7) . then you have to go back to your theory and determine if there is any justiﬁcation in. This is an unfortunate property. for example. with transformations) or choose a method that best meets the type of data you have. Then.. Estimation 36And almost as rare outside it. somehow the thought of statisticians ﬁghting it out. Unfortunately. The continued survival of all of them. normal. While that is undoubtedly true. The most widely used index (although not necessarily the best) is the Normed Fit Index (NFI. there are no probabilities associated with these tests. then use those. then you can be relatively conﬁdent about what you’ve found. the problem pales into insigniﬁcance in comparison to the plethora of statistics used to estimate goodness-of-ﬁt. and the constant introduction of new ones. pray hard. and so forth. on estimation procedures. PROC CALIS. will only mention a few of them.36 The fact that a number of different techniques exists should act as a warning. This requires brain cells. no matter how badly your model ﬁts. that all of the variables are independent from one another (in statistical jargon. statisticians can quote Tennyson) appears somewhat oxymoronic to us. and its “front-end” program. which tests. if you live in the United Kingdom or one of the colonial backwaters. although what is deemed a “good ﬁt” is often arbitrary. You may suspect they may if they are the same scale measured at two different times. that the covariances are all zero). it is very sensitive to sample size and departures from multivariate normality. It has a distinct advantage in that. the time has come to estimate all of those parameters.g. resulting in a matrix whose rank is less than the number of variables.222 REGRESSION AND CORRELATION plus the Verbal IQ) but only six unique ones. 563). indicates that no clearly superior solution exists. 39We 40As Now that you’ve speciﬁed the model.38 This means that the same study done in Canada and the United States may come up with different results. If all of the results are consistent. the NFI and other indices that compare your model against the null “encourage you to reﬂect on the fact that. it has a test of signiﬁcance associated with it. In most of the work we do. the computer does it for us. if one or more of the indices is transformed to a different scale. unlike all of the other GoF indices.g. Because it’s easy.90 is the minimum accepted value. 1980). What’s left for us is the hard stuff— deciding the best method to use. EQS. We already mentioned the χ2GoF in the context of path analysis. then it may be worthwhile to run the model with a few different types of estimators. between height and weight) may produce problems. things could always be worse” (p. this book would be thicker than the large print. Even high correlations among variables (e. then the law of the statistical jungle37 would dictate that it would survive.

But. NFI2 df model (20–8) Respeciﬁcation R2 Respeciﬁcation is a fancy term for playing with the model to achieve a better ﬁt with the data. and EQS. The statistical tests that do exist are for the opposite type of mis-speciﬁcation errors: those due to variables that don’t belong in the model or have paths leading to the “wrong” endogenous or exogenous variables. and the other indices disagree with one another. check out Tabachnick and Fidell (1996). it’s probably better to trust one of the other indices. and you’re presenting the ﬁndings at an international conference. if the χ2GoF is signiﬁcant. Other indices resemble in that they attempt to determine the proportion of variance in the covariance matrix accounted for by the model.or z-test. there isn’t any statistical test for either the indices or the difference between two AICs or two CAICs. models with small values of N may have NFIs < . such as LISREL. These statistical tests should be used with the greatest caution. Unfortunately. All of the major SEM programs. 41 The Adjusted GFI (AGFI) is analogous to the adjusted R2 in that you’re penalized for having a lot of parameters and a small sample size. The next step is to look at the signiﬁcance of the parameters. is the Root Mean Square Error of Approximation (RMSEA). based on theoretical and empirical considerations. If you’ve been paying attention. The AGFI is a parsimony ﬁt index. fortunately for you. they should have the expected sign.06. We also . There are no computer packages that will give you a Bronx cheer and say. Keep chanting this mantra to yourself as you read this section. What do we do when they disagree? The most usual situation occurs when we get high values (i. for some reason you really. Another widely used index is Akaike’s Information Criterion (AIC) which is unusual in that smaller (closer to 0) is better:42 AIC = 2 model 41If. and the ratio of the parameter to its standard error forms a t. also called the Incremental Fit Index or IFI) tries to compensate for this by incorporating the degrees of freedom: 2 null 2 null 2 model If so. can the signiﬁcant χ2GoF be due to “too much” power? A Conﬁrmatory Factor Analysis Let’s assume that we have seven measured variables that we postulate reﬂect two latent variables: a1 through a4 are associated with the latent variable f1. then that parameter should likely be set equal to 0. Now that we’ve given you the basics. but the χ2GoF is signiﬁcant. 42To – 2df model (20–9) A slight variant of it is called the Consistent AIC. really want to see what it looks like. so we won’t bother to show it. Life is easy when all of the ﬁt indices tell us the same thing. at best. you should realize that the statistical tests play a secondary role in this.e. In all cases. and again has no probability level associated with it. As we’ve said.43 you forgot to include the person’s weight. because you’re rewarded for having a parsimonious model.” then go with that. or CAIC: CAIC 2 model 43We (log e N 1)df model (20–10) use this term only because our editor won’t let us say “Schmuck” in such a family- Because nobody knows what a “good” value of AIC or CAIC should be. can examine the effects of freeing parameters that you’ve ﬁxed (the Lagrange multiplier test) and dropping parameters from the model (the Wald statistic). A more recent index. The Normed Fit Index 2 (NFI2. there are no statistical tests that can help us in this regard.05).90. Whether to follow their advice must be based on theory and previous research. CALIS (part of SAS). One such index is the Goodness-of-Fit Index (GFI).90) for GFI or NFI2 and a low value for RMSEA (below .05. The easiest way to detect these is to look at the parameters. the primary role should be your understanding of the area. If the test is not signiﬁcant. Unfortunately. be consistent with our nomenclature. First. so we can’t say if one model is statistically better or insigniﬁcantly better than the other. It follows the Twiggy criterion of less being more. you may end up with a model that ﬁts the current data very well but makes little sense and may not be replicable. The major reason that a model doesn’t ﬁt is that you haven’t included some key variables—the ones that are really important. here we have to use some judgment. If they all indicate a “good ﬁt. we would call this the Twiggy criterion (also known as the “Kate Moss criterion” by the younger set). all parameters have standard errors associated with them. AMOS. all of which decrease the value of AGI proportionally to the number of parameters you have. “Fool.PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 223 One disadvantage of the NFI is that it is sensitive to the sample size. otherwise. but only when it’s too late. If there are roughly 10 subjects per parameter. and whether or not to publish depends on your desperation level for another article. The one with the smaller AIC or CAIC is “better”. then we should look at a number of indices. then something is dreadfully wrong with your model. and b1 through b3 with latent variable f2. this assumes that you have a sufficient sample size. even if they ﬁt well. so that you’re not committing a Type II error.. then we have a ﬁt that’s marginal. If a parameter is positive and the theory states it should be negative. less than 0. used more and more often now. however. let’s run through a couple of examples. over .” All of your purported colleagues will be only too happy to perform this function. its formula involves a lot of matrix algebra. Of course. and ideally. these indices are use most often to compare models—to choose between two different models of the same data. RMSEA should be less than 0. There are a number of variants of this.

224 1 REGRESSION AND CORRELATION e1 a1 1 e2 1 a2 f1 e3 1 a3 e4 1 a4 e5 1 b1 1 e6 FIGURE 20–12 Input diagram for a conﬁrmatory factor analysis. not trying to develop one. and the variances of the seven error (or disturbance) terms and two latent variables.00. which are equivalent to the communality estimates in EFA. 46Note that we estimate the variances of the error terms but not their path coefficients. further conﬁrming that it isn’t correctly speciﬁed. English). The unstandardized weights for a1 and b1 are 1. the covariance between f1 and f2. Note that.229. This is based on the fact that there are seven measured variables. For each of the endogenous variables. there are 28 “sample moments” in other words (i. we goofed when it comes to variable a4. what do all those little numbers in Figure 20–13 mean? The ones over the arrows should be familiar. the covariance (and hence the correlation) between f1 . those we constrained to be 1. Now let’s turn to the printed output in Table 20–5 and see what else we learn. they’re the path coefficients or standardized regression weights (either term will do). which. it also set the path parameter for one measured variable to be 1. The numbers over the rectangles are the squared multiple correlations. based on 13 degrees of freedom. 1 b2 f2 e7 1 b3 44With the emphasis on the term “relatively. All of them are signiﬁcant (at or over 1. despite the fact that variable a4 doesn’t work too well.” 45Unfortunately. The next block of output tells us that the χ2GoF is 16. the correlation coefficient is only 0.e. think that the two latent variables may be correlated with each other.392. The other ﬁve weights have standard errors associated with them. First. has a p level of .44 so it automatically ﬁxed the parameters from all of the error terms to the measured variables to be 1. it doesn’t really seem to be caused by the latent variable f1. which is encouraging. First.46 There are 13 degrees of freedom.06.. we aren’t told if it loads more on factor f2 or it doesn’t load on either factor. and the ratio of the weight to the SE is the Critical Ratio (CR) which is interpreted as a z-test. There are two other things this ﬁgure tells us. that’s because we’re testing a model. So. Our model speciﬁes 15 parameters to be estimated: ﬁve regression weights (two others aren’t estimated because we ﬁxed them to be 1). After we push the right button. 28 observations. based on previous research. so we overrode it and selected the variable in each set that. because we set them to be equal to 1. which is the difference between the number of observations and the number of parameters. you have to read a 350-page manual to ﬁgure out which button that is. First. has the highest reliability. which are equivalent to the factor loadings in EFA. in contrast to EFA. We didn’t like the choice the program made. so there are (7 × 8) / 2 = 28 observations. since we can’t estimate both at the same time. The program is relatively smart. which we’ve summarized in Table 20–5. The sec- ond fact is that factors f1 and f2 probably aren’t correlated. which is shown in Figure 20–12.45 we get the diagram shown in Figure 20–13 and reams of output. yet again. Similarly.96) except for a4. We start by drawing a diagram of our model (if we’re using a program such as AMOS or EQS). the model as a whole ﬁts the data quite well. We then see the unstandardized and standardized regression weights.

into the trash can it goes.93 .90 . look for this in the output.89 . Our model would ﬁt even better if we dropped variable a4. We reran the model with this path included. The largest one.81 . which tell us how much the model could be improved if we speciﬁed additional paths.123. 50See . we also get the Modification Indices (MI). leaving open the question of who’s on Earth (not to be confused with who’s on ﬁrst)? note 44.80 .92 e6 b2 f2 FIGURE 20–13 Output diagram based on Figure 20–12.89 . men are from Mars and women from Venus. our ﬁt would improve. 48See 49We Comparing Two Factor Analyses We’re sometimes in a position where we want to compare two factor structures.028. Because we asked for them.04 e4 a4 . Finally. because we speciﬁed another path) drops to 10. for example. and the methods of comparing factor structures leave much to be desired. Fortunately. we’ll just ignore them. In fact.50 If we don’t have any hypotheses beforehand regarding the factor structure. our model is close to perfection. that is. is a3 b2.48 But. If we believe that the variable is a substantive one.90 .79 e1 a1 . e7 b3 47Don’t and f2 is low and has a CR of only 0. the two latent variables or factors aren’t correlated.85 . which is listed under Regression Weights. we can start by running an EFA with one group and then use the results to ﬁx the parameter estimates in a CFA for the second group. However. meaning there’s nothing more to estimate). we’ve given just a few of the myriad other GoF indices.87 . just for your beneﬁt. it isn’t there. because there is no theoretical rationale for this path (or for the other proposed modiﬁcations).06 . the path coefficient between the two is –0.90. and the independence model is the opposite (assuming nothing correlates with anything). if we drew a path from b2 to a3.PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 225 . That means that. note 47. how do they differ? This can be done with EFA. The next sets of numbers show the variances we’re estimating and the squared multiple correlations (which are also given in the ﬁgure). this should be dictated by theory. but it is difficult. if not. The saturated model represents perfection (as many parameters as observations. Yet again. if we do have some idea of know the answer to the second question is “No”. but we haven’t shown the output. all of the indices are over 0. we would keep it. it’s relatively easy to do it with CFA.19 e2 a2 f1 e3 a3 .551. otherwise.79 e5 b1 . and that the nonsigniﬁcant path coefficient may be due to sampling error or a small sample size.47 and the χ2GoF (based now on df = 12. are the results for patients and controls or for men and women alike49 and. Conversely.

80. indicating that the model doesn’t ﬁt the data worth a plugged nickel.940 1. or paths between a4 and the three other variables associated with f1.683 3. such as suggesting that we include covariance terms between e4 and f1.148 0.857 12.120 0. there will not be standard errors and critical ratios given for these 6 parameters.871 0. one in which a4 actually does load on f1 but we don’t know this beforehand. the number of parameters to be estimated drops from 15 to 9.127 0.392 Degrees of freedom = 13 Probability = .100 1.854 4. shown in Figure 20–11. Second. In this case.021 0. which should run as long as the variable names in the rectangles correspond to the variable names in our data ﬁle.059 Estimate SE CR f1 f2 e1 e2 e3 e4 e5 e6 e7 Squared multiple correlations 0.665 0. based on our previous results.000 1.018 4.226 REGRESSION AND CORRELATION TABLE 20–5 Selected output for the conﬁrmatory factor analysis Number of sample moments = 28 Number of distinct parameters to be estimated = 15 Degrees of freedom = 13 Chi-squared = 16.073 0.551 f1↔f2 Variances 0. the terms vct over the disturbance terms for CLAP and TRAP tell the program that these variances should be the same. is wrong and perhaps should be closer to 0. because we’ve ﬁxed an additional 6 parameters: 5 paths from the latent variables plus the covariance. Note (yet again) that our use of the modiﬁcation indices is tempered by our knowledge and theory. and ﬁx one path from each latent variable to be 1.866 0.789 0.143 0.000 1.934 0.888 0.824 0.050 4.036 0. the largest ones involve a4 and e4 in various forms.897 0. then we can conclude that the factor loadings that we found for one group would ﬁt the second group.789 0.888 0.812 0. now based on df = 19. The χ2GoF. we’ll stay with the problem presented in Figure 20–13 and assume we drew another sample. since we are not estimating them. The results of all of this ﬁxing and constraining are shown in Figure 20–14.537 5. We would again return to our theory and hypothesize that the path coefficient. we have to ﬁx all of the paths leading to the various disturbance terms to 1. This diagram now forms the input to the program.044 0.029 1.901 0.037 0.044 Estimate 0.199 1. we can up the ante and make the comparison more stringent: are the variances of the error terms similar across samples? This type of analysis is very useful for determining the equivalence of questionnaires in different groups of subjects.107 0.091 — 0.229 Regression weights Estimate SE CR a1←f1 a2←f1 a3←f1 a4←f1 b1←f2 b2←f2 b3←f2 Standardized regression weights 1. Second. First.560 5.809 what the structure should look like.20 to be congruent with the results from the ﬁrst sample. and add what we’ve learned.166 13. we’ll put in the unstandardized regression weights from the ﬁrst sample and again.178 0. it is logical to assume that their variances are similar. Alternatively.043 0.196 0.190 0. too. First.123 0. with a few notable exceptions.078 0. If the model actually ﬁts the data. but they all point to a misspeciﬁcation involving a4.038 5.487 7.614 3. As an example. Instead of ﬁxing just one of the paths from the latent variables to the measured ones. we can set it free and see what the program does with it.920 Estimate SE CR f1↔f2 Correlations 0. None of these make sense theoretically.220 0.082 — 13.804 0. If we look at the modiﬁcation indices. The output will be very similar to that in Table 20–5.779 13. Estimate a1 a2 a3 a4 b1 b2 b3 Continued 0. is a whopping 144.086 0. state that the covariance between f1 and f2 is 0.202 0.080 0.036 0. we can specify it for both groups and see where it ﬁts and doesn’t ﬁt for each. which we ﬁxed at 0.317 Estimate a1←f1 a2←f1 a3←f1 a4←f1 b1←f2 b2←f2 b3←f2 Covariances 0.043 0. A Full SEM Model Now let’s return to the complete model of success in cheerleading. We indicate the fact that we’ve constrained these terms by giving the variances the same name.47.847 . because CLAP and TRAP are random halves of the same test.

60 Ability .02 . Cheer.006).081 Par change a2←b2 a3←f2 a3←b2 Model 5. MI index.63 .91 ec3 MPI FIGURE 20–15 Output based on Figure 20–14.59 . NFI Normed Fit Index.831. but perhaps we can do better with Ability and Cheer. All of this leads us to believe that the model could stand quite a bit of improvement.40 ea1 IQ –. Gratifyingly.454 0. CR critical ratio.88 . .94 . most of them don’t make too much sense from the perspective of our theory.62 .000 modiﬁcation GFI Adjusted GFI. which has an associated p-value of . The other GoF indices are equivocal: GFI and NFI are both just slightly above the cutoff point of 0.83 . To begin.70 .35 .749 GFI 0.133 AGFI NFI Your model Saturated model Independence model 0.76 Cheer .90.PATH ANALYSIS AND STRUCTURAL EQUATION MODELING 227 The output from the program is shown in Figure 20–15.57 TRAP d1 et ec2 SUCS .908 0.122 0. and CLAP/TRAP? The answer seems to be.125 0. but where? Let’s start with the measurement aspect of the model—how well are we measuring the latent variables of Athletic Ability.557. but one bears a closer look—the suggestion of adding a covariance between ea2 and ea3.603 4. let’s leave it aside for now and rerun the model adding e2 ↔ e3. If we look at the Modiﬁcation Indices.49 CLAP ec . based on 19 degrees of freedom. is highly signiﬁcant (p= . not too badly. SE standard error. the χ2GoF is 38.79 ea2 Height . GFI Goodness-of-Fit Index.225 which. thank you.88 ec1 WOKS . the χ2GoF (df = 18) drops to 22.064 0. Because Cheer as a whole seems to add little to the picture. while AGFI is only 0.000 0.957 1.94 .208.77 ea3 Jumping CT .000 0. . ea1 1 IQ ea2 1 Height 1 Ability CLAP 1 ea3 1 Jumping CT 1 ec vct ec1 1 WOKS 1 1 TRAP d1 1 et vct ec2 1 SUCS Cheer ec3 1 MPI FIGURE 20–14 Figure 20–11 with the parameters constrained.173 MI 0.408 4.157 5.968 1. Because this model is a subset of the original Covariances MI Par change TABLE 20–5 Continued e2↔e6 e3↔f2 Regression weights 5.

for example. mankind faces a crossroads. a drop in the coefficient from Cheer (from 0. Which variable(s) is (are) endogenous and which is (are) exogenous? 4.59 to 0. B.50.20. Finally. One path leads to despair and utter hopelessness. A and B. which at least makes us more cheerful. and the weight between B and the outcome is 0. the weight between A and the outcome is 0. on df 2. Also. it can easily handle multiple predictor and outcome variables and complicated models. If 2GoF is 8. path analysis. What is the correlation between B and the outcome? 3. “More than any other time in history. canonical correlation. we get χ2 (1) = 15.” one. from 0. . Let us pray we have the wisdom to choose correctly. all of the parsimony-adjusted GoF indices increase. and C and Model X has only variables A and C. despite its early name (causal analysis) it can determine causality only if the design of the study is appropriate. from a research perspective. and other tests. then: 1. if we subtract the χ2s and the dfs. conﬁrmatory factor analysis.32. meaning that there was a signiﬁcant improvement in the goodness of ﬁt.02 to 0. The other to total extinction. But there are some variables in Model Y that don’t appear in Model X. Which variable(s) should have a disturbance term? 5. Model Y has variables A. This is also reﬂected in an increase in the path coefﬁcient from Ability to CLAP/TRAP. multiple regression. does the model ﬁt the data or not? 52Perhaps SUMMARY Structural equation modeling is a relatively new and very powerful statistical method that can be seen as a general technique that includes. How many observations are there? 6.228 51Model REGRESSION AND CORRELATION X is a subset of Model Y (or is nested within Model Y) if all of the variables in Model X are also in Model Y.51 the difference between their respective χ2s is itself distributed as a χ2. p .01). it means that we don’t have to administer these three tests to all people.02. EXERCISES If the correlation between two predictor variables.73.30. However.52 we can have a simpler model if we just drop it. because Cheer doesn’t help. the statistic itself cannot assess causality from cross-sectional data. and it can accommodate mediating variables as well as variables measured over time. Although the change in the χ2GoF isn’t signiﬁcant this time around.668. Its major advantages are that it can combine measured and latent variables in the same model. and the fact that the other ﬁt indices are in an acceptable range. is 0. So. in its simpler forms. What is the correlation between A and the outcome? 2. reinforcing Woody Allen’s comment.

C. The most obvious is that he couldn’t resist the most common sin of biomedical researchers—he took perfectly respectable ratio-level variables. which deal with continuous data. DETECTOR III–2 Analyses based on extreme groups are biased and lead to a potential loss of sample size and power. In an attempt to examine the relationship among height. Would you approach things any differently? Of course you would. looking at ﬂuoxetine versus amitriptyline. C. 3. Just to remind you. 2. can no longer be interpreted.R.P. The baseline measure is not just one of six measures taken over the course of the study. analyzed data from the graduating class at Slippery State U. III–2. DETECTORS III–1. He reported that “the changes were statistically signiﬁcant…in the ﬂuoxetine group and for several of the efficacy measurements in the amitriptyline group. A better approach would be to treat the Time 0 measure as a covariate. the noted pharmacopsychoanthropologist. and bottom third of the class on height and IQ. middle. IQ. he chose to analyze the two independent variables separately.05). p < . would you do it any differently? But of course.P.P. With your new knowledge. DETECTOR III–1 Never take data that are continuous and interval or ratio and classify them into categories before analysis. More appropriate would be a joint analysis using multiple regression with two independent variables (Height and IQ) and one dependent variable. and later success. he has biased the effect of the independent variables. We will pretend there was only one dependent variable.A. by using extreme groups. following their progress in their respective careers.R. that’s why the question is here. He measured three outcomes: the HAM-D. It has several problems. DETECTOR III–3 All the independent variables should be analyzed together using ANOVA or regression methods. he threw out the middle group. Finally.” He also compared the treatment groups at the end of the study and found no signiﬁcant difference between the two drugs. height and IQ. and the corresponding test of signiﬁcance. and collapsed them into three levels.P. affecting power. Feighner (1985) did an RCT with a small sample of patients.53. The t-test for IQ was signiﬁcant (t = 2. In Section 2. we suggested a repeated-measures ANOVA.R. at baseline and at weeks 1. He administered an IQ test to all the graduates and measured their height. The most obvious is that he has lost a third of his sample. then did a t-test on the two extreme groups. Return to Question II–3 at the end of Section 2. and measured their socioeconomic status on the Blishen scale (a ratio-level scale of measurement).A. He then waited 10 years. and 5. Use all the data.R. he classiﬁed the graduates as being in the top. Third. To analyze the data. This is an absolute no-no! The solution is to retain the original data and use methods such as regression analysis. and repeated-measures ANOVA treats it like a difference score. the Raskin Depression Inventory.A. Charlie Darvon.A. and the Covi Anxiety scale. but the t-test for height was not. thus the estimate of the effect.SECTION THE THIRD C. thereby throwing away a pile of information. This has two effects. 4. Second. Dr. C. then do a repeated-measures ANCOVA with Time 229 .

Regardless of the sophistication of the model. C.A. there are 11 independent variables in her regression equation and 27 subjects. about half of the rats developed nasal cancer. and the upper 95% CI has to be greater than zero.6% of the variance. She studies all 27 teachers in the system and investigates the following variables: Age.R. DETECTOR III–7 Proportions of variances add. She has not just violated. yet all the variables are additive and are good together for only 13.R. Remember that his best estimate of the risk was zero.034). Nothing coming out of this analysis is believable. DETECTOR III–6 Square the multiple correlation. multi-hit model (basically. She claims that Gender enters the equation third and explains 15% of the variance.372 = 13. 7. however. The rats were exposed to 2. There are several problems. III–3.A. C. none developed it. In the 7-ppm group. And don’t get too excited over any variable that doesn’t explain more than 5% of the variance. Cohn (1982) used data from an animal study of cancer resulting from formaldehyde exposure to extrapolate the risk to humans. the upper 95% conﬁdence limit yielded an additional (attributable) risk from UFFI of 51 parts per million.A. A sociologist is investigating discrimination in employment practices of the local school board in Sexsex County.230 REGRESSION AND CORRELATION (5 levels) as the repeated measure and Drug (2 levels) as a grouping betweensubject factor. The best estimate of risk was zero. no regression analysis should be extrapolated much beyond the original data—no model is good enough.” Counting dummy variables. III–4. which releases gaseous formaldehyde into the air (. she has cruciﬁed the old “rule of 10. The dependent variable is income. In the 15-ppm group. where each variable is shown as a logarithmic scale. expressing . 3. C. C. Finally. She ﬁnds that the combination of variables has a multiple correlation of 0. Religion (Christian.P. that Gender enters the regression equation third.).015 (. Or use individual growth curves. left. Jewish.37. Height.37. 2. A multiple correlation of . Again. If it’s not explaining about 30% or more of the variance. ambi). and Gender explains 15% of the variance. Other).A. 1. a nonlinear regression) was ﬁtted to these data and extrapolated to the excess exposure in homes containing urea formaldehyde foam insulation (UFFI). DETECTOR III–5 Watch the old “rule of 10” in all multiple regression analyses done on existing data bases. The results are shown in Figure III–1. there is a computation error. DETECTOR III–4 Baseline measures should be handled as a covariate.049 – . Handedness (right. The major problem is that he assumed he could extrapolate downwards from 15 to . Gender. this leads towards discounting the study. Would you buy a home with UFFI in it? There are two problems with the study. two orders of magnitude. The minor one is that he committed a little fraud by using the upper 95% conﬁdence estimate for his published estimates. Muslim. In the 2-ppm group. A multi-stage. Ph.034 ppm in non-UFFI homes). so add them. Hindu. after Age and Degree. is singularly unimpressive. and 15 parts per million (ppm) of formaldehyde. it’s not a very impressive analysis.6%. Do you believe her? We hope not. 2 of the 240 rats got cancer. and Degree (Bachelor.049 ppm vs .R.D. .P. More reason to reject. using ANCOVA methods.P.R.P. Master.

Some are even dumb enough to analyze their data this way. then correlated this 1 10 –1 Mortality rate 10 –2 10 –3 10 –4 10 –5 0. it is deceptive. DETECTOR III–9 Beware the SE of the mean.P. Several years ago. III–6. Folks often display data using the SE of the mean because it looks so much better. the authors were examining how physicians performed on a multiple choice test in relation to their year of graduation. which they grouped by decade of graduation. C. Only the names are forgotten to protect the guilty. was made by Mark Twain. Meedok and Hipokrit attempted to develop an instrument called the TMIADS (Trust Me. environmental and occupational health folks have institutionalized this dangerous practice. or perhaps it was the British Magical Journal.C. The correlation was about 0.A.P. Nearly anything. . Do you agree? Heck. They calculated the mean score in each decade. FIGURE III–1 Mortality rate as a function of formaldehyde level in the Cohn (1982) study. predictions were that the high birth rate would cause us to have standing-room-only on the planet by the year 2000. C. at the rate that the estimates of Pluto’s mass were decreasing. That is an average of a triﬂe over one mile and a third per year.1 1 10 100 Upper 95% C. This also explains why some predictions go seriously awry. As a result. too. It might have been Lancelot.I. no! First of all. DETECTORS 231 Unfortunately. the planet would disappear entirely in 1980.P. not the original data. but if you want to indicate what the actual data look like. In this article. in large enough doses. Therefore. They concluded a nearly perfect relationship existed between performance and year of graduation. will cause cancer in susceptible rodents. we came across an article in a reputable.R. III–5. Anyone old enough will remember that in the 1960s.R. 1At which point the birth rate would rapidly drop. not the SD (see Chapter 6 if you need reminding of the difference).01 0. DETECTOR III–8 Do not extrapolate regression equations beyond the range of the original data points. But the question is why is it that high? The answer is that they correlated the means in each category. This is perhaps useful when you want to compare means. They had scores from several hundred physicians. I’m a Doctor Scale) to measure patients’ feelings about their doc’s interpersonal skills. most of the variation of individuals was conveniently lost because the “data” for their correlation had an error equal to the SE of the mean. though. The following is a true story. And once the little beasties have it. Binzel (1990) said that. a correlation that high should tip you off to something rotten. then you draw your line down to minimal exposure and show that people will get it. Goodness knows what the true correlation was. the Lower Mississippi River was upwards of one million three hundred thousand miles long. but it was certainly a lot lower. who is not blind or idiotic.A. widely read British medical journal. any calm person. The best comment. Very few things in life are that good.R.96. can see that … just a million years ago next November. Formaldehyde level (ppm) mean score with the midpoint of the decade of graduation. That’s why we have a new carcinogen every week.1 In a similar vein. in Life on the Mississippi: In the space of one hundred and seventysix years the Lower Mississippi has shortened itself two hundred and forty-two miles.A.

24 . (d) have eigenvalues considerably greater than 1.26 .27 . but we wouldn’t get too excited by them. the ﬁrst factor accounted for 1.39 . C.27 . there are more problems than we can mention.36 .232 REGRESSION AND CORRELATION TABLE III–1 Rotated factor loading matrix Item Factor 1 Factor 2 Factor 3 Factor 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 . Again drawing on our vast knowledge of arcane lore.A. you now know that you can ﬁgure them out yourself by simply squaring each loading in the column and adding them up. and 140 would be preferable (a 10:1 ratio).31 After weeding out unusable questions. and the four factors together expressed a total of (1. C.29 .21 .38 –. (b) show minimal factorial complexity. Especially with so few items.16 .0039.30 .24 . which they then administered to 50 patients. Trust. they’re all above 1.14 .18 .18 . (c) have their eigenvalues reported.19 .19 . They said that the rotated factor loading matrix. The Subject-to-Variable ratio. and we really have a three-factor solution (accounting for 26.41 . we know that the total variance is 14 because we have that many items.R.38 . and 1.0675. many of the items load about equally on two or more factors (e.11 .47 –.02 .31 . which is reproduced in Table III–1.22 . Percent of variance explained.12 .15 .17 .5737 1.37 . Empathy. Factorial complexity.33 .15 . It’s usual to report the eigenvalue for each factor at the bottom of the column. and (e) account cumulatively for at least 60% of the variance. Can you spot any problems with what they did? Actually. However.1067 1. Kildare.26 .48 .28 .R. we’d hope that the ﬁrst four factors would explain at least 60% or 70% of the variance.35 .P. Analyzing binary data.21 .5737.29 . and 13). they ended up with 14 True-False items. we’d also be too embarrassed to make them public. and Looking Like Dr.P. 2.54 .A.20 . 1.17 .08 . We would say that two items don’t constitute a factor. 5. 1.94% of the variance. DETECTOR III–10 The same as III–5 and many others: the subject-to-variable ratio should be 5:1 at a minimum.36 . items 8.P. . The authors thought they could pull a fast one on us by not giving them.A.49 .18 .28 .02 . Only 50 patients for 14 items just doesn’t cut it. and it should be closer to 10:1.39 .33 .0675 1. Even after rotation.24%. Here are some of them: 1. With 14 items. 3. This makes it hard to argue that these are independent factors. Sure enough. If our results were this bad.g.R.27 . Don’t! C..0. What we get is that the four eigenvalues are 1. 6. Number of factors. shows that the TMIADS is tapping four different areas—Openness. So.27 .31 . 12.25 .8% of the variance).52 .5737 ÷ 14 11. Eigenvalues.0039) ÷ 14 = 33. Factor 4 has only two items (11 and 12) that load higher on it than on the other factors. Meedok and Hipokrit also didn’t report how much variance each factor explained.21 .19 .1067. DETECTOR III–11 The retained factors should (a) comprise at least three items. 4.24 . DETECTOR III–12 Binary data should not be factor analyzed. there should have been an absolute minimum of 70 subjects (5 subjects per variable).

SE C T I O N T H E FO U RT H NONPARAMETRIC STATISTICS .

.

Here it is a matter of life and death. and (2) it would be hard to trace all of them.2 It did start with a few suspicious cases in New Mexico and grew rapidly from there. except that probably hundreds of thousands of health food freaks are gobbling up megavitamins and all sorts of other stuff. eosinophilia-myalgia syndrome (EMS). fever. a report (Eidson et al. and our data are body counts. and ﬁnally the general case involving many factors (log-linear analysis). match them up as best you can to another group of folks who are similar in every way you can think of but exposure. paired data (McNemar’s chi-squared). or from an accident of nature (radon) or their jobs (Agent Orange). If we really thought people might die from tryptophan exposure. You scour hospital records and death certiﬁcates around the country. dyspnea. whereby subjects are assigned at random to a treatment or control group and no one knows until it’s over who was in what group. how do you perform statistics on counts of bodies? W e confess to a deviation from our tradition. which is supposed to be good for everything from insomnia to impotence. a case-control study.1 The only circumstance they appeared to have in common was that they were health food freaks and had all been imbibing large quantities of an amino acid health food called tryptophan. You scour the countryside far and wide and locate 17 other poor souls who have succumbed under mysterious circumstances. have been exposed to a substance. 3Just 235 . how do you analyze bodies. Tests of Signiﬁcance for Categorical Frequency Data SETTING THE SCENE A few years ago. But ﬁrst a small diversion into research design. That might work here.g. This is the stuff of real epidemiology. and then consider some special cases: small numbers (the Fisher Exact Test). None of this touchy-feely research based on “How do you feel on a seven-point scale?” questions. Mr. it’s unlikely (we hope) that any ethics committee would let us expose folks to the stuff just for the sake of science. nausea.3 The question is. and you eventually locate 80 people with EMS. because they don’t usually follow a normal distribution unless you pile them that way. and how will you prove it? In particular. Parrot. the chisquared (χ2) test. the story (however unlikely) happens to be true (at least true enough to end up in a law court). and it occasionally kills. The next best design is a cohort study. it has many other manifestations (e. to ﬁnd cases and controls. and (1) very few of them actually appear to have come down with eosinophilia-myalgia syndrome (EMS). like Vietnam. tachycardia). 1990) indicated that several people in New Mexico had succumbed to a rare but particularly nasty disease. In this case. weakness. in which you take a bunch of folks with the disease (the cases) and without the disease (controls). and then check the frequency of disease occurrence in both. As well as causing crippling muscle pain and high eosinophil counts.F I R S T Here we introduce statistical methods used to deal with categorical frequency data of the form. of their own volition (smoking). of course. Hercules Parrot. who were hospitalized for something else or died of some- 1Eosinophilia- myalgia syndrome (EMS) is a very nasty multisystem disease. it is about the only practical approach to looking at risk when the prevalence is very low. “The number of individuals who…” We begin with the simplest case. You may have heard that the best of all research designs is a randomized controlled trial. but we do