You are on page 1of 277

Statistics in Psychology

An Historical Perspective

Second Edition
This page intentionally left blank
Statistics in Psychology
An Historical Perspective
SecondEdition

Michael Cowles
York University, Toronto

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS


2001 Mahwah,NewJersey London
Copyright © 2001by Lawrence Erlbaum Associates, Inc.
All rights reserved.No part of this bookmay bereproduced
in any form, by photostat, microform, retrieval system,
or any
other means, without the prior written permissionof the publisher.

Lawrence Erlbaum Associates, Inc., Publishers


10 IndustrialAvenue
Mahwah,NJ 07430

Cover designby Kathryn Houghtaling Lacey

Library of Congress Cataloging-in-Publication Data

Cowles, Michael, 1936--


Statisticsin psychology: anhistorical perspective
/ Michael Cowles.--
2nd ed.
p. cm.
Includes bibliographical references (p.)
andindex.
ISBN 0-8058-3509-1(c: alk. paper)-ISBN 0-8058-3510-5(p: alk. paper)
1. Psychology—Statisticalmethods—History. 2. Social sciences—
Statisical methods—History.I. Title.

BF39.C67 2000
150'.7'27~dc21 00-035369

The final camera copyfor this workwaspreparedby theauthor,and thereforethe


publisher takesno responsibilityfor consistencyor correctnessof typographical style.
However, this arrangement helps to make publicationof this kind of scholarshippossible.

Books publishedby Lawrence Erlbaum Associates areprintedon


acid-freepaper,andtheir bindingsarechosenfor strengthanddurability.

Printedin the United Statesof America

1 0 9 8 7 6 5 4 3 21
Contents

Preface ix
Acknowledgments xi

1 The Development of Statistics 1

Evolution, Biometrics,and Eugenics 1


The Definition of Statistics 6
Probability 7
The Normal Distribution 12
Biometrics 14
Statistical Criticism 18

2 Science,Psychology, and Statistics 21

Determinism 21
Probabilisticand Deterministic Models 26
Scienceand Induction 27
Inference 31
Statisticsin Psychology 33

3 Measurement 36

In Respectof Measurement 36
SomeFundamentals 39
Error in Measurement 44
Vi CONTENTS

4 The Organization of Data 47

The Early Inventories 47


Political Arithmetic 48
Vital Statistics 50
Graphical Methods 53

5 Probability 56

The Early Beginnings 56


The Beginnings 57
The Meaning of Probability 60
Formal Probability Theory 66

6 Distributions 68

The Binomial Distribution 68


The Poisson Distribution 70
The Normal Distribution 72

7 Practical Inference 77

Inverse Probability
and the Foundationsof Inference 77
FisherianInference 81
Bayesorp< 05? . 83

8 Sampling and Estimation 85

Randomnessand Random Numbers 85


Combining Observations 88
Samplingin Theory and in Practice 95
The Theory of Estimation 98
The Battle for Randomization 101

9 Sampling Distributions 105

The Chi-SquareDistribution 705


The t Distribution 114
The F Distribution 121
The Central Limit Theorem 124
CONTENTS VJJ

10 Comparisons, Correlations,
and Predictions 127

Comparing Measurements 727


Galton'sDiscovery of Regression 129
Galton' s Measureof Co-relation 138
The Coefficient of Correlation 141
Correlation - Controversies
and Character 146

11 Factor Analysis 154

Factors 154
The Beginnings 156
Rewriting the Beginnings 162
The Practitioners 164

12 The Design of Experiments 171

The Problemof Control / 71


Methodsof Inquiry 1 73
The Concept of Statistical Control 176
The Linear Model 181
The Design of Experiments 752

13 Assessing Differencesand Having Confidence 186

FisherianStatistics 186
The Analysis of Variance 187
Multiple Comparison Procedures 795
ConfidenceIntervals
and SignificanceTests 199
A Note on 'One-Tail'
and Two-Tail' Tests 203

14 Treatments and Effects: The Rise of ANOVA 205

The Beginnings 205


The Experimental Texts 205
The Journalsand thePapers 207
The Statistical Texts 272
Expected Means Squares 273
Viii CONTENTS

15 The Statistical Hotpot 216

Times of Change 216


Neyman andPearson 217
Statisticsand Invective 224
Fisher versus Neyman and Pearson 228
Practical Statistics 233

References 236

Author Index 254

Subject Index 258


Preface

In this secondedition I have made some corrections to errors in algebraic


expressions that I missedin the first edition and Ihavebriefly expandedon some
sections of the original where I thought such expansion would make the
narrative cleareror more useful. The main changeis the inclusion of two new
chapters;one onfactor analysisand one on therise of the use ofANOVA in
psychologicalresearch.I am still of the opinion that factor analysis deserves its
own historical account,but I am persuaded that the audiencefor such a work
would be limited werethe early mathematical contortions to be fully explored.
I have triedto providea brief non-mathematical background to its arrival on the
statistical scene.
I realized thatmy accountof ANOVA in the first edition did not dojustice
to the story of its adoptionby psychology,and largely due to myre-readingof
the work of Sandy Lovie(of the University of Liverpool, England)and Pat
Lovie (of Keele University, England), who always writepapersthat 1 wish 1 had
written, decidedto try again.I hope thatthe Lovies will not be toodisappointed
by my attempt to summarize their sterling contributions to the history of both
factor analysisand ANOVA.
As before, any errors and misinterpretationsare my responsibility alone.I
would welcome correspondence that points to alternative views.
I would like to give special thanksto the reviewersof the first edition for
their kind commentsand all those who have helpedto bring aboutthe revival
of this work. In particular Professor Niels Waller of Vanderbilt University must
be acknowledgedfor his insistentand encouraging remarks.I hope thatI have

ix
X PREFACE

deserved them.My colleaguesand manyof my students at York University,


Toronto, have been very supportive. Those students, both undergraduate and
graduate,who have expressed their appreciation for my inclusion of some
historical backgroundin my classeson statistics and method have givenme
enormous satisfaction. This relatively short account is mainly for them, and
I hopeit will encourage some of them to explore someof these mattersfurther.
There seemsto be a slow realization among statistical consumers in
psychology that there is moreto theenterprise than null hypothesis significance
testing,and other controversies to exerciseus. It isstill my firm belief that just
a little more mathematical sophistication and just a little more historical
knowledge woulddo agreatdeal for the way wecarry on ourresearch business
in psychology.
The editors and production peopleat Lawrence Erlbaum, ever sharp and
efficient, get onwith the job andbring their expertiseandsensible advice to the
project and I very much appreciate their efforts.
My wife has sacrificeda great dealof her time and given me considerable
help with the final stagesof this revisionand she and my family, even yet,put
up with it all. Mere thanksare not sufficient.

Michael Cowles
Acknowledgments

I wish to expressmy appreciationto a numberof individuals, institutions,and


publishers for granting permissionto reproduce material thatappearsin this
book:

Excerpts from Fisher Box, J. (c) 1978,R. A. Fisher the life of a scientist,from
Scheffe, H. (1959) Theanalysisof variance,and from Lovie, A. D & Lovie, P.
Charles Spearman, Cyril Burt, and theoriginsof factor analysis. Journalof the
History of the Behavioral Sciences,29, 308-321.Reprintedby permissionof
John Wiley& Sons Inc.,the copyright holders.

Excerpts from a numberof papersby R.A. Fisher. Reprinted by permissionof


Professor J.H. Bennett
on behalf of the copyright holders,the University of
Adelaide.

Excerpts, figures andtables by permissionof Hafner Press,a division of


Macmillan Publishing Companyfrom Statistical methods for research workers
by R.A. Fisher. Copyright(c) 1970 by Universityof Adelaide.

Excerpts from volumes of Biometrika. Reprintedby permission of the


Biometrika Trusteesand from Oxford University Press.

Excerpts from MacKenzie, D.A. (1981). Statisticsin Britain 1865-1930.


Reprintedby permissionof the Edinburgh University Press.

XI
Xii ACKNOWLEDGMENTS

Two plates from Galton, F. (1885a). Regression towards mediocrity in


hereditarystature.Journal of theAnthropological Instituteof Great Britain and
Ireland, 15, 246-263.Reprintedby permissionof the Royal Anthropological
Institute of GreatBritain andIreland.

ProfessorW. H. Kruskal, Professor F. Mosteller,and theInternational Statistical


Institute for permissionto reprint a quotationfrom Kruskal,W., & Mosteller,
F. (1979). Representative sampling, IV: The historyof the conceptin statistics,
1895-1939.International StatisticalReview,47, 169-195.

Excerptsfrom Hogben,L. (1957). Statistical theory. London: Allen


andUnwin;
Russell,B. (1931). Thescientific outlook. London:Allen and Unwin; Russell,
B. (1946). Historyof western philosophyand itsconnection withpolitical and
social circumstancesfrom theearliest timesto thepresentday. London: Allen
and Unwin; von Mises, R. (1957). Probability, statisticsand truth. (Second
revised English Edition preparedby Hilda Geiringer) London: Allenand
Unwin. Reprintedby permissionof Unwin Hyman,the copyright holders.

Excerpts from Clark, R. W. (1971). Einstein,the life and times. New York:
World. Reprintedby permissionof Peters, Fraserand Dunlop, Literary Agents.

Dr D. C. Yalden-Thomsonfor permissionto reprint a passagefrom Hume,D.


(1748). An enquiry concerning human understanding. (In D. C. Yalden-
Thomson (Ed.). (1951). Hume, Theory of Knowledge. Edinburgh: Thomas
Nelson).

Excerpts from various volumesof the Journal of the American Statistical


Association. Reprintedby permissionof the Boardof Directorsof theAmerican
Statistical Association.

An excerpt reprintedfrom Probability, statistics,and data analysisby O.


Kempthorneand L. Folks, (c) 1971 Iowa StateUniversity Press, Ames, Iowa
50010.

Excerptsfrom De Moivre, A. (1756).Thedoctrine of chances:or, A methodof


calculating the probabilities of eventsin play. (3rd ed.),London:A. Millar.
Reprintedfrom the editionpublishedby theChelseaPublishingCo.,New York,
(c) 1967with permissionand Kolmogorov,A. N. (1956). Foundationsof the
theory of probability. (N. Morrison, Trans.). Reprinted by permissionof the
Chelsea Publishing Co.
ACKNOWLEDGMENTS Xiii

An excerptreprinted with permission of Macmillan Publishing Companyfrom


An introduction to the study of experimental medicineby C. Bernard (H. C.
Greene,Trans.),(c) 1927 (Original work published in 1865)and from Science
and human behaviorby B. F. Skinner (c) 1953 by Macmillan Publishing
Company, renewed 1981 by B. F. Skinner.

Excerpts from Galton, F. (1908). Memoriesof my life. Reprintedby permission


of Methuenand Co.

An excerptfrom Chang, W-C. (1976). Sampling theories andsamplingpractice.


In D. B. Owen (Ed.), On thehistory of statisticsandprobability (pp.299~315).
Reprintedby permissionof Marcel Dekker, Inc.New York.

An excerpt from Jacobs,J. (1885). Reviewof Ebbinghaus's Ueberdas


Gedachtnis. Mind, 10, 454-459 and from Hacking, I. (1971). Jacques
Bernoulli's Art of Conjecturing. British Journalfor the Philosophyof Science,
22,209-229.Reprintedby permissionof Oxford University Pressand Professor
Ian Hacking.

Excerpts from Lovie, A. D. (1979). The analysisof variancein experimental


psychology: 1934-1945. British Journal of Mathematical and Statistical
Psychology,32, 151-178and Yule, G. U. (1921). Reviewof W. Brown and
G. H. Thomson, The essentialsof mental measurement. British Journal of
Psychology,2, 100-107and Thomson,G. H. (1939)The factorial analysisof
humanability. I. The present position
and theproblems confrontingus. British
Journal of Psychology,30, 105-108.Reprintedby permissionof the British
Psychological Society.

Excerptsfrom Edgeworth,F. Y. (1887). Observations and statistics:An essay


on the theory of errors of observationand the firstprinciples of statistics.
Transactions of the Cambridge Philosophical Society,14, 138—169 and
Neyman,J., & Pearson,E. S.(1933b). The testingof statistical hypotheses in
relation to probabilitiesa priori. Proceedingsof the Cambridge Philosophical
Society, 29, 492-510.Reprintedby permissionof the Cambridge University
Press.

Excerpts from Cochrane,W. G. (1980). Fisherand theanalysisof variance.In


S. E. Fienberg,& D. V. Hinckley (Eds.).,R. A. Fisher: An Appreciation (pp.
17-34)and from Reid,C. (1982). Neyman -fromlife. Reprintedby permission
of Springer-Verlag,New York.
XJV ACKNOWLEDGMENTS

Excerpts and plates from Philosophical Transactionsof the Royal Societyand


Proceedingsof the Royal Societyof London. Reprintedby permission of the
Royal Society.

Excerpts from LaplaceP. S. de(1820).A philosophical essayon probabilities.


F. W. Truscott, & F. L. Emory, Trans.). Reprintedby permissionof Dover
Publications,New York.

Excerpts from Galton, F. (1889). Natural inheritance, Thomson, W. (Lord


Kelvin). (1891). Popular lecturesand addresses,and Todhunter,I. (1865). A
history of the mathematical theoryof probability from thetime of Pascal to that
of Laplace. Reprintedby permissionof Macmillan and Co., London.

Excerpts from various volumesof the Journal of the Royal Statistical Society,
reprinted by permissionof the Royal Statistical Society.

An excerptfrom Boring, E. G. (1957). Whenis humanbehavior predetermined?


The Scientific Monthly, 84, 189-196.Reprintedby permissionof the American
Association for the Advancementof Science.

Data obtainedfrom Rutherford, E., & Geiger, H. (1910). The probability


variations in the distribution of a particles. Philosophical Magazine,20,
698-707. Material used by permissionof Taylor and Francis, Publishers,
London.

Excerptsfrom various volumesof Nature reprintedby permissionof Macmillan


Magazines Ltd.

An excerpt from Forrest,D. W. (1974). Francis Galton:Thelife and work of a


Victorian genius.London: Elek. Reprinted by permissionof Grafton Books, a
division of the Collins PublishingGroup.

Excerpts from Fisher,R. A. (1935, 1966,8th ed.). The design of experiments.


Edinburgh: OliverandBoyd. Reprintedby permissionof the Longman Group,
UK, Limited.

Excerptsfrom Pearson,E. S. (1966). The Neyman-Pearson story:1926-34.


Historical sidelights on an episode in Anglo-Polish collaboration.In F. N.
David (Ed.). Festschrift for J. Neyman. London:Wiley. Reprinted by
permissionof JohnWiley and Sons Ltd., Chichester.
ACKNOWLEDGMENTS XV

An excerpt from Beloff, J. (1993) Parapsychology:a concise history. London:


Athlone Press.Reprinted with permission.

Excerpts from Thomson,G. H. (1946) Thefactorial analysisof human ability.


London: Universityof London Press. Reprinted
by permissionof The Athlone
Press.

An excerpt from Spearman,C. (1927) The abilities of man. New York:


Macmillan Company. Reprintedby permissionof Simon & Schuster, the
copyright holders.

An excerpt from Kelley, T. L. (1928) Crossroadsin the mind of man: a study


of differentiable mental abilities. Reprinted
with the permissionof Stanford
University Press.

An excerpt from Gould, S. J. (1981) The mismeasureof man. Reprinted with


the permissionof W. W. Norton & Company.

An excerptfrom Boring, E. G. (1957) Whenis human behavior predetermined?


Scientific Monthly, 84,189-196.And from Wilson,E. B. (1929) Reviewof The
abilities of man. Science,67, 244-248.Reprintedwith permission from the
American Associationfor the Advancementof Science.

An excerpt from Sidman, M. (1960/1988) Tacticsof scientific research:


Evaluating experimental data
inpsychology.New York: Basic Books. Boston:
Authors Cooperative (reprinted). Reprinted
with the permissionof Dr Sidman.

An excerpt from Carroll, J. B. (1953) An analytical solutionfor approximating


simple structurein factor analysis. Psychometrika, 18, 23-38. Reprinted with
the permissionof Psychometrika.

Excerptsfrom Garrett H. E. & Zubin, J. (1943) The analysis of variancein


psychological research.Psychological Bulletin, 40, 233-267and from Grant,
D. A. (1944) On "The analysisof variance in psychological research."
Psychological Bulletin, 41, 158-166. Reprintedwith the permission of the
American Psychological Association.

An excerpt from Wilson, E. B. (1929) Reviewof Crossroadsin the mind of


man. Journal of General Psychology, 2, 153-169. Reprinted with the
permissionof the Helen Dwight Reid Educational Foundation. Published
by
Heldref Publications, 1319 18th
St. N.W. Washington, D.C.20036-1802.
This page intentionally left blank
1
The Development
of Statistics

EVOLUTION, BIOMETRICS, AND EUGENICS

The central concernof the life sciencesis thestudy of variation. To what extent
does this individualor group of individualsdiffer from another? Whatare the
reasonsfor the variability? Can thevariability be controlled or manipulated?
Do the similarities that exist springfrom a common root? Whatare theeffects
of the variation on the life of the organisms?Theseare questionsaskedby
biologists and psychologists alike.
The life-science disciplines are definedby the different emphases placed on
observed variation,by the natureof the particular variablesof interest,and by
the ways in which the different variables contributeto the life and behaviorof
the subject matter. Change and diversity in nature reston an organizing
principle,the formulationof which hasbeen saidto be thesingle mostinfluential
scientific achievementof the 19th century:the theory of evolutionby meansof
natural selection.The explicationof the theory is attributed, rightly,to Charles
Darwin (1809-1882).His book The Origin of Specieswas published in 1859,
but a numberof other scientistshadwritten on theprinciple, in whole or in part,
and thesemenwereacknowledgedby Darwin in later editionsof his work.
Natural selectionis possible because there is variation in living matter. The
struggle for survival within and acrossspeciesthen ruthlessly favorsthe indi-
viduals that possessa combination of traits and characters, behavioral and
physical,that allows themto copewith the total environment, exist, survive,
and reproduce.
Not all sourcesof variability are biological. Many organisms to agreateror

1
2 1. THE DEVELOPMENT OF STATISTICS

lesser extent reshape their environment, their experience, and therefore their
behavior through learning.In human beings this reshapingof the environment
has reachedits most sophisticatedform in what hascome to becalled cultural
evolution. A fundamental feature of the human condition,of human nature,is
our ability to processa very great deal of information. Human beings have
originality and creative powers that continually expand the boundariesof
knowledge. And, perhaps most important of all, our language skills, verbal
and
written, allow for the accumulationof knowledge and itstransmission from
generationto generation. The rich diversity of human civilization stemsfrom
cultural, aswell as genetic, diversity.
Curiosity about diversityand variability leadsto attemptsto classify and to
measure.The orderingof diversity and theassessment of variation have spurred
the developmentof measurementin the biological and social sciences,and the
applicationof statisticsis onestrategyfor handlingthe numericaldataobtained.
As science has progressed,it has become increasingly concerned with
quantification as a means of describing events.It is felt that precise and
economical descriptions of eventsand therelationships among them are best
achievedby measurement. Measurement is the link between mathematics and
science,and theapparent(at anyrate to mathematicians!) clarityand order of
mathematicsfoster the scientist's urgeto measure.The central importanceof
measurementwasvigorously expoundedby Francis Galton(1822-1911):"Un-
til the phenomenaof any branchof knowledge have been submitted to meas-
urementand numberit cannot assume the statusand dignity of a Science."
Thesewords formed partof the letterhead of the Departmentof Applied
Statistics of University College, London,an institution that received much
intellectual and financial supportfrom Galton. And it is with Galton,who first
formulated the method of correlation, thatthe common statisticalprocedures
of modern social science began.
The nature of variation and thenature of inheritancein organisms were
much-discussedand much-confused topicsin the second halfof the 19th
century. Galtonwasconcernedto makethe studyof heredity mathematical and
to bring order into the chaos.
Francis Galton was Charles Darwin's cousin. Galton's mother was the
daughterof Erasmus Darwin (1731-1802) by his secondwife, and Darwin's
father wasErasmus'sson by hisfirst. Darwin, who was 13yearsGalton'ssenior,
had returned homefrom a 5-year voyageas thenaturaliston board H.M.S.
Beagle(an Admiralty expeditionary ship)in 1836 and by 1838 had conceived
of the principle of natural selectionto accountfor someof the observationshe
hadmadeon theexpedition.The careersandpersonalitiesof GaltonandDarwin
were quite different. Darwin painstakingly marshaled evidence and single-
mindedly buttressedhis theory, but remaineddiffident aboutit, apparently
EVOLUTION, BIOMETRICS AND EUGENICS 3

uncertain of its acceptance.In fact, it was only the inevitability of the an-
nouncementof the independent discovery of the principle by Alfred Russell
Wallace (1823~1913) that forced Darwin to publish, some20 yearsafter he had
formed the idea. Gallon,on theother hand, though a staid andformal Victorian,
was notwithout vanity, enjoyingthe fame and recognition broughtto him by
his many publicationson abewildering varietyof topics. The steady streamof
lectures, papersand books continued unabated from 1850 until shortly before
his death.
The notion of correlated variationwas discussedby the new biologists.
Darwin observesin The Origin of Species:

Many laws regulate variation, some few of which can bedimly seen,...I will
here only alludeto what may becalled correlated variation. Importantchanges
in the embryo or larva will probably entail changes in the mature animal. . .
Breeders believe that long limbs
arealmost always accompanied by an elongated
head .. .catswhich areentirely whiteandhave blue eyesaregenerally deaf...
it appears that white sheep and pigs are injured by certain plants whilst
dark-coloured individuals escape ... (Darwin, 1859/1958,p. 34)

Of course,at this time,the hereditary mechanism was unknown, and, partly


in an attempt to elucidate it, Galton began,in the mid-1870s,to breed sweet
peas.1 The resultsof his studyof thesizeof sweetpeaseeds overtwo generations
were publishedin 1877. Whena fixed size of parentseedwas compared with
the mean sizeof theoffspring seeds, Galton observed thetendency thathecalled
then reversionand later regressionto the mean. The meanoffspring size is not
as extremeas theparental size. Large parent seeds of a particular size produce
seeds that have a mean size thatis larger than average, but not aslarge as the
parentsize. The offspring of small parentseedsof a fixedsize havea mean size
that is smaller than average but nowthis mean sizeis not assmall asthat of the
fixed parent size. This phenomenon is discussed laterin more detail.For the
moment,suffice it to say that it is an arithmetical artifact arisingfrom the fact
that offspring sizesdo not match parental sizes absolutely uniformly. In other
words, the correlation is imperfect.
Galton misinterpreted this statistical phenomenon as areal trend towarda
reductionin population variability. Paradoxically, however, it led to theforma-
tion of the Biometric Schoolof heredity and thus encouraged the development
of a great many statistical methods.

1
Mendel had already carriedout his work with ediblepeasand thus begunthe scienceof
genetics. The resultsof his work were publishedin a rather obscure journal
in 1866 and thewider
scientific world remained obliviousof them until 1900.
4 1. THE DEVELOPMENT OF STATISTICS

Over the next several years Galton collected data on inherited human
characteristicsby the simple expedientof offering cash prizes for family
records. From these data hearrivedat theregression linesfor hereditary stature.
Figures showing these lines are shownin Chapter10.
A common themein Galton's work, and later that of Karl Pearson
(1857-1936),was aparticular social philosophy. Ronald Fisher (1890-1962)
also subscribedto it, although, it mustbe admitted, it wasnot, assuch,a direct
influence on hiswork. Thesethreemen are thefoundersof what are nowcalled
classical statisticsand allwere eugenists. They believed that the most relevant
and important variablesin humanaffairs are inherited. One'sancestorsrather
than one'senvironmental experiences are theoverriding determinants of intel-
lectual capacityand personalityas well as physical attributes. Human well-
being, human personality, indeed human society, could therefore, they argued,
be improvedby encouragingthe most ableto have more children than the least
able. MacKenzie (1981)andCowan(1972,1977)have argued that much of the
early work in statisticsand thecontroversies that arose among biologists and
statisticians reflectthe commitmentof the foundersof biometry, Pearson being
the leader,to theeugenics movement.
In 1884, Galtonfinanced andoperatedan anthropometric laboratory at the
International Health Exhibition. For a chargeof threepence, members of the
public were measured. Visual and auditory acuity, weight, height, limb span,
strength,and anumberof other variables were recorded. Over 9,000 data sets
were obtained, and, at thecloseof theexhibition,the equipmentwastransferred
to the South Kensington Museum where data collection continued. Francis
Galton was anavid measurer.
Karl Pearson(1930) relates that Galton's first forays intothe problem of
correlation involved ranking techniques, although he wasaware that ranking
methods couldbe cumbersome.How could onecomparedifferent measuresof
anthropometric variables?In a flash of illumination, Galton realized that
characteristics measured on scales basedon their own variability (we would
now say standard score units) could be directly compared. This inspiration is
certainly one of themost importantin the early yearsof statistics.He recalls
the occasionin Memoriesof my Life, publishedin 1908:

As these linesare being written,the circumstances under which


I first
clearly graspedthe important generalisation that
the laws of hereditywere solely
concerned with deviations expressed in statistical unitsare vividly recalled to
my memory. It was in thegroundsof Naworth Castle, where an invitation had
been givento ramblefreely. A temporary shower drove me toseekrefuge in a
reddish recessin the rock by the side of the pathway. Therethe idea flashed
EVOLUTION, BIOMETRICS AND EUGENICS 5

acrossme and Iforgot everythingelsefor amomentin mygreatdelight.(Galton,


1908, p. 300)2

This incident apparently took place in 1888,and beforethe year wasout,


Co-relationsand Their MeasurementChiefly From Anthropometric Datahad
been presented to the Royal Society.In this paper Galton defines co-relation:
"Two variable organsare said to be co-relatedwhen the variation of one is
accompaniedon theaverageby more or less variationof the other, and in the
samedirection" (Gallon, 1888,p. 135).
The last five words of the quotation indicate thatthe notion of negative
correlationhad notthen been conceived, butthis briefbut important paper shows
that Galtonfully understoodthe importanceof his statistical approach. Shortly
thereafter, mathematicians entered the picture with encouragement from some,
but by no meansall, biologists.
Much of the basic mathematics of correlation had,in fact, already been
developedby the time of Gallon's paper,but theutility of the procedure itself
in this contexlhad apparently eluded everyone. It was Karl Pearson, Gallon's
disciple andbiographer,who, in 1896,set theconcepton asound mathematical
foundation and presented statistics with the solutionto the problem of repre-
senting covariationby meansof a numerical index,the coefficient of correla-
tion.
From thesebeginnings springthe whole corpusof present-daystatistical
techniques. George Udny Yule (1871~1951),aninfluential statisticianwho was
not a eugenist,and Pearson himself elaborated the conceptsof multiple and
partial correlation.The general psychologyof individual differencesand re-
searchinto the structureof human abilitiesand intelligence relied heavilyon
correlationaltechniques. The firsl third of the20th centurysaw theintroduction
of factor analysis throughthe work of Charles Spearman (1863-1945),Sir
Godfrey Thomson (1881-1955),Sir Cyril Burt (1883-1971),and Louis L.
Thurstone(1887-1955).
A further prolific and fundamentallyimportant streamof development arises
from the work of Sir Ronald Fisher.The techniqueof analysisof variancewas
developed directlyfrom themethodof intra-class correlation- anindex of the
extentto which measurements in thesame category or family arerelated, relative
to other categoriesor families.
2
Karl Pearson (1914 -1930)in thevolume published in 1924, suggested that this spot deserves
a commemorative plaque. Unfortunately, it looks asthoughthe inspiration canneverbe somarked,
for Kenna (1973), investigating the episode, reports that:
"In the groundsof Naworth Castle there
arenot anyrocks,reddishor otherwise, which could provide arecess,..."(p. 229),and hesuggests
that the location of the incident might have been Corby Castle.
6 1. THE DEVELOPMENT OF STATISTICS

Fisher studied mathematicsat Cambridgebut also pursued interests in


biology andgenetics.In 1913he spentthe summer workingon afarm in Canada.
He worked for a while with a City investment company andthen found himself
declaredunfit for military service because of his extremely poor eyesight.He
turned to school-teachingfor which he had notalent and which he hated. In
1919 he had theopportunity of a postat University Collegewith Karl Pearson,
then headof the Departmentof Applied Statistics,but choseinsteadto develop
a statistical laboratoryat theRothamsted Experimental Station near Harpenden
in England, wherehedeveloped experimental methods for agricultural research.
Over the next several years, relations between Pearson and Fisher became
increasingly strained. They clashed on a variety of issues. Someof their
disagreements helped, and some hindered,the developmentof statistics.Had
they been collaboratorsand friends, rather than adversaries and enemies,
statisticsmight havehad aquite different history. In 1933 Fisher became Galton
Professorof Eugenicsat University Collegeand in 1943 movedto Cambridge,
wherehe wasProfessorof Genetics. Analysisof variance, whichhas hadsuch
far-reachingeffects on experimentationin the behavioral sciences, was devel-
oped through attempts to tackle problems posed at Rothamsted.
It may befairly said thatthe majority of textson methodologyand statistics
in the social sciences are theoffspring (diversityandselection notwithstanding!)
of Fisher's books, Statistical Methodsfor ResearchWorkers3 first publishedin
1925(a), and TheDesignof Experimentsfirst publishedin 1935(a).
In succeeding chapters thesestatistical conceptsareexaminedin more detail
and their development elaborated, but first the use of theterm statisticsis
explored a little further.

THE DEFINITION OF STATISTICS

In an everyday sense when we think of statisticswe think of facts and figures,


of numerical descriptionsof political andeconomic states (from which the word
is derived), and ofinventories of the variousaspectsof our social organization.
The history of statistical proceduresin this sensegoesback to the beginnings
of human civilization. When trade and commerce began, when governments
imposed taxes,numerical records were kept.The counting of people, goods,
andchattelswasregularlycarriedout in theRoman Empire,theDomesday Book
attempted to describethe state of England for the Norman conquerors,and
government agenciesthe world over expenda great deal of money and

3
Maurice Kendall (1963) says of this work, "It is not aneasy book. Somebody once said
that no student should attempt
to read it unlesshe hadread it before" (p. 2).
PROBABILITY 7

energyin collectingand tabulating suchinformation in the present day. Statis-


tics are usedto describeand summarize,in numerical terms,a wide varietyof
situations.
But thereis anothermore recently-developed activity subsumed under the
term statistics:the practiceof not only collectingandcollating numerical facts,
but also the processof reasoningabout them. Going beyond the data,making
inferencesand drawing conclusions with greateror lesserdegreesof certainty
in an orderly and consistentfashion is the aim ofmodern applied statistics.In
this sense statistical reasoningdid not begin until fairly late in the 17th century
and then onlyin a quite limited way. The sophisticated models now employed,
backedby theoretical formulations that are often complex,are allless than100
years old. Westergaard (1932) points to the confusionsthat sometimes arise
becausethe word statisticsis usedto signify both collectionsof measurements
and reasoning about them, and that in former times it referred merelyto
descriptionsof statesin both numericaland non-numerical terms.
In adoptingthe statisticalinferential strategythe experimentalistin the life
sciencesis acceptingthe intrinsic variabilityof the subject matter.In recogniz-
ing a rangeof possibilities,the scientist comesfour-squareagainstthe problem
of deciding whetheror not theparticular set of observationshe or she has
collected can reasonablybe expectedto reflect the characteristicsof the total
range. This is the problem of parameter estimation, the task of estimating
population values (parameters) from a considerationof the measurements made
on a particular population subset - the sample statistics.A second taskfor
inferential statisticsis hypothesis testing,the processof judging whetheror not
a particular statistical outcome is likely or unlikely to be due tochance. The
statistical inferential strategy dependson aknowledgeof probabilities.
This aspectof statisticshasgrown out of three activities that,at first glance,
appearto bequite different but in fact have somecloselinks. They areactuarial
prediction, gambling,and error assessment. Each addresses the problemsof
making decisions,evaluating outcomes, and testing predictionsin the face of
uncertainty,and eachhascontributedto thedevelopmentof probability theory.

PROBABILITY

Statistical operationsareoften thoughtof aspractical applications


of previously
developed probability theory. The fact is, however, that almost
all our present-
day statistical techniques have arisen
from attemptsto answerreal-life problems
of prediction and error assessment, and theoretical developments have not
always paralleled technical accomplishments. Box (1984) has reviewed the
scientific context of a rangeof statistical advancesand shown thatthe funda-
mental methodsevolved from the work of practisingscientists.
8 1. THE DEVELOPMENT OF STATISTICS

John Graunt, a London haberdasher, born in 1620, is credited withthe first


attemptto predict andexplain a numberof social phenomena from a considera-
tion of actuarialtables.He compiled histablesfrom Bills of Mortality, the parish
accountsof deaths that were regularly, if somewhat crudely, recorded from the
beginningof the 17th century.
Graunt recognizesthat the question mightbe asked:"To what purpose tends
all this laborious buzzling,andgroping?To know, 1. thenumberof the People?
2. How many Males,andFemales?3. Howmany Married,andsingle?"(Graunt,
1662/1975,p. 77), and says:"To this 1might answerin general by saying, that
those, who cannot apprehend the reasonof these Enquiries, areunfit to trouble
themselvesto askthem." (p. 77).

Graunt reassured readers


of this quite remarkable work:

The Lunaticksarealso but few, viz. \ 58 in229250though I fear manymorethan are


set down in ourBills ...
So that, this Casualty being so uncertain,1 shall not force my self to make any
inference fromthe numbers, and proportions we finde in ourBills concerning it:
onely I dareensureany man atthis present,well in his Wits, for one in thethousand,
that he shall not die aLunatick in Bedlam,within thesesevenyears,becauseI finde
not aboveone inabout one thousandfive hundredhavedone so. (pp. 35-36)

Here is an inference basedon numerical dataand couchedin terms not so


very far removedfrom thosein reportsin the modern literature. Graunt's work
wasimmediately recognizedasbeingof great importance,and theKing himself
(CharlesII) supportedhis electionto the recently incorporated Royal Society.
A few years earlierthe seedsof modern probability theory were being sown
in France.4 At this time gamblingwas apopular habitin fashionable society
and
a range of gamesof chancewas being played. For experienced players the
oddsapplicableto various situations must have been appreciated,
but no formal
methods for calculating the chances of various outcomeswere available.
Antoine Gombauld,the Chevalierde Mere, a "man-about-town" and gambler
with a scientific and mathematical turnof mind, consultedhis friend, Blaise
Pascal (1623-1662),a philosopher, scientist,and mathematician, hoping that

4
But note that thereare hints of probability conceptsin mathematics going back at leastas
far as the12th centuryand that Girolamo Cardano wrote Liber de Ludo Aleae,(The Book on Games
of Chance)a century beforeit was publishedin 1663 (see Ore, 1953). There is also no doubt that
quite early in human civilization, therewas anappreciationof long-run relative frequencies,
randomness,anddegreesof likelihood in gaming,andsome quiteformal conceptsare to befound
in GreekandRoman writings.
PROBABILITY 9

he would be able to resolve questionson calculation of expected(probable)


frequency of gains and losses,as well as on thefair division of the stakesin
games that were interrupted. Consideration of these questionsled to correspon-
dence between Pascal and hisfellow mathematician Pierre Fermat (1601 -1665).
5
No doubt their advice aided de Mere's game. More significantly, it was from
this exchange that some of the foundationsof probability theoryand combina-
torial algebra were laid.
Christian Huygens(1629-1695)published,in 1657, a tract On Reasoning
With Gamesof Dice (1657/1970), which waspartly basedon thePascal-Fermat
correspondence, and in 1713, Jacques Bernoulli's(1654-1705)book The Art of
Conjecture developeda theory of gamesof chance.
Pascalhad connectedthe studyof probabilitywith the arithmetic triangle
(Fig. 1.1), for which he discoverednew properties, although the triangle was
known in China at least five hundred years earlier. Proofs of the triangle's
properties were obtained by mathematical induction or reasoningbyrecurrence.

FIG. 1.1 Pascal's Triangle

5
Poisson(1781-1840),writing of this episodein 1837 says,"A problem concerning games
of chance, proposed by a man of theworld to an austere Jansenist,was theorigin of the calculus
of probabilities" (quotedby Struik, 1954,p. 145). De Mere was certainly "a man of theworld"
and Pascaldid become austere andreligious, but at thetime of de Mere'squestionsPascalwas in
his so-called "worldlyperiod"(1652-1654).I am indebtedto my father-in-law, the late Professor
F.T.H. Fletcher, for many insights intothe life of Pascal.
10 1. THE DEVELOPMENT OF STATISTICS

Pascal'striangle, as it isknown in the West,is a tabulationof the binomial


coefficients that may beobtainedfrom the expansionof (P + Q)n where P = Q
= I The expansionwas developedby Sir Isaac Newton(1642-1727),and,
independently, by the Scottish mathematician, James Gregory (1638-1675).
Gregory discoveredthe rule about 1670. Newton communicated it to theRoyal
Society in 1676, although later that year heexplained thathe had firstformulated
it in 1664 whilehe was aCambridge undergraduate. The example shownin Fig.
1.1 demonstrates that the expansionof (~ + | )4 generates,in the numerators
of the expression,the numbersin the fifth row of Pascal'striangle. The terms
of this expression also give us the fiveexpected frequencies of outcome(0, 1,
2,3, or 4heads)or improbabilitieswhena fair coin is tossedfour times. Simple
experiment will demonstrate that the actual outcomesin the"real world"of coin
tossing closely approximate the distribution thathas been calculatedfrom a
mathematicalabstraction.
During the 18th centurythe theory of probability attractedthe interest of
many brilliant minds. Among themwas a friend and admirer of Newton,
Abraham De Moivre (1667-1754).De Moivre, a French Huguenot,was in-
ternedin 1685after the revocationby LouisXIV of the Edict of Nantes,an edict
which hadguaranteed tolerationto French Protestants. He wasreleasedin 1688,
fled to England, and spent the remainderof his life in London. De Moivre
published what mightbedescribedas agambler's manual, entitled TheDoctrine
of Chancesor a Method of Calculating the Probabilities of Eventsin Play. In
the second editionof this work,publishedin 1738,and in arevised third edition
published posthumouslyin 1756, De Moivre (1756/1967) demonstrated a
method, whichhe had firstdevisedin 1733,of approximatingthe sum of avery
large numberof binomial terms when n in (P +Q)" is very large(animmensely
laborious computationfrom the basic expansion).
It may beappreciated thatas n grows larger,the numberof terms in the
expansionalsogrows larger. The graph of the distribution beginsto resemble
a smooth curve (Fig. 1.2), a bell-shaped symmetrical distributionthat held great
interest in mathematical terms but little practical utility outsideof gaming.
It is safeto saythat no other theoretical mathematical abstraction has had
such an important influence on psychology and the social sciencesas that
bell-shaped curve now commonly knownby thename that Karl Pearson decided
on- thenormal distribution-a\thougl\he was not the first to use the term. Pierre
Laplace(1749-1827)independently derived the function andbrought together
much of the earlier workon probabilityin Theorie Analytiquedes Probabilites,
publishedin 1812.It was hiswork, aswell ascontributionsby many others, that
interpretedthe curve as the Lawof Error andshowed thatit could be applied to
variableresults obtainedin multiple observations.One of the firstapplications
of the distribution outsideof gaming was in the assessmentof errors in
FIG. 1.2 The Binomial Distribution for N = 12
and the Normal Distribution

11
12 1. THE DEVELOPMENT OF STATISTICS

astronomicalobservations.Later theutility of the "law" in errorassessment was


extendedto land surveyingand evento range estimation problems in artillery
fire. Indeed, between 1800 and 1820 the foundationsof the theory of error
distribution were laid.
Carl Friedrich Gauss (1777-1855),perhapsthe greatest mathematician of all
time, also made important contributions to work in this area. He was a
consultantto thegovernmentsof Hanoverand ofDenmark when they undertook
geodeticsurveys. The function that helpedto rationalize the combination of
observationsis sometimes calledthe Laplace-Gaussian distribution.
Following the work of LaplaceandGauss,the developmentof mathematical
probability theory slowed somewhatand not agreat dealof progresswas made
until the present century.But it was during the 19th century, throughthe
developmentof life insurance companies and throughthe growth of statistical
approachesin the social and biological sciences, thatthe applications of
probability theory burgeoned. Augustus De Morgan (1806-1871),for example,
attempted to reduce the constructsof probability to straightforward rulesof
thumb. His work An Essayon Probabilities and on Their Application to Life
Contingenciesand Insurance Offices, publishedin 1838, is full of practical
advice and iscommentedon by Walker (1929).

THE NORMAL DISTRIBUTION

The normal distributionwas sonamed because many biological variables when


measuredin large groupsof individuals,andplotted asfrequency distributions,
do show close approximations to the curve. It is partly for this reason thatthe
mathematicsof thedistribution areusedin data assessment in the social sciences
and inbiology. The responsibility,aswell as thecredit, for this extensionof the
use ofcalculations designed to estimate erroror gambling expectancies into the
examinationof human characteristics rests with Lambert Adolphe Quetelet
(1796-1874),a Belgian astronomer.
In 1835Queteletdescribedhisconceptof theaverageman- / 'homme moyen.
L'homme moyenis Nature's ideal,an ideal that corresponds with a middle,
measured value.But Nature makeserrors,and in, as itwere,missingthe target,
producesthe variability observedin human traitsandphysicalcharacters.More
importantly, the extent and frequency of these errorsoften conformto the law
of frequencyof error-thenormal distribution.
JohnVenn(1834-1933),the English logician, objected to the use of theword
error in this context:"When Nature presents uswith a group of objectsof every
kind, it is using rathera bold metaphorto speak in this case alsoof a law of
error" (Venn, 1888,p. 42), but theanalogy wasattractive to some.
Quetelet examined the distributionof the measurements of the chest girths
THE NORMAL DISTRIBUTION 13

of 5,738 Scottish soldiers, these data having been extracted


from the 13th
volumeof theEdinburgh MedicalJournal. There is nodoubt thatthe measure-
ments closely approximateto a normal curve.In another attemptto exemplify
the law, Quetelet examinedthe heightsof 100,000 French conscripts. Here
he
noticed a discrepancy between observedand predicted values:

The official documents would make it appearthat, of the 100,000men, 28,620


are of less height than5 feet 2 inches: calculation gives only26,345. Is it not a
fair presumption, thatthe 2,275 men who constitutethe difference of these
numbershave beenfraudulently rejected? We canreadily understand that it is
an easy matterto reduceone'sheighta half-inch,or an inch, whenso greatan
interest is at stakeasthat of being rejected. (Quetelet, 1835/1849,p. 98)

Whether or not theallegation stated here - that short (butnot too short)
Frenchmen havestoopedso low as toavoid military service- is true is no
longer an issue. A more important pointis notedby Boring (1920):

While admittingthe dependenceof the law onexperience, Quetelet proceeds in


numerouscasesto analyze experience by meansof it. Such a double-edged
sword is a peculiarly effective weapon, and it is nowonder that subsequent
investigators were tempted to use it inspite of the necessary rules
of scientific
warfare. (Boring, 1920,p. 11)

The use of thenormal curvein statisticsis not, however, based solely on the
fact that it can beusedto describethe frequencydistributionof many observed
characteristics.It has a much morefundamental significancein inferential
statistics,as will be seen,and thedistribution and itsproperties appear
in many
partsof this book.
Galton first became awareof the distribution from his friend William
Spottiswoode,who in 1862 became Secretary of the Royal Geographical
Society, but it was thework of Quetelet that greatly impressed him. Many of
the data setshe collected approximated to the law and heseemed,on occasion,
to be almost mystically impressed with it.

I know of scarcely anythingso apt toimpressthe imaginationas thewonderful


form of cosmic order expressed by the "Law of Frequencyof Error." The law
would havebeenpersonifiedby theGreeksanddeified, if they hadknown of it.
It reigns with serenityand in complete self-effacement amidst the wildest
confusion. The hugerthe mob and thegreaterthe apparent anarchy, the more
perfect is its sway. It is the supremelaw of Unreason. Whenever a large sample
of chaotic elementsare taken in handand marshalledin the order of their
magnitude,an unsuspectedandmost beautiful form of regularityprovesto have
been latentall along. (Galton, 1889,p. 66)
14 1. THE DEVELOPMENT OF STATISTICS

This rathertheologicalattitude towardthe distribution echoesDe Moivre,


who, overa century before, proclaimed
in TheDoctrine of Chances:

Altho' chance produces irregularities,


still the Oddswill be infinitely great, that
in the processof Time, those irregularities will bear no proportion to the
recurrencyof that Order which naturally resultsfrom ORIGINAL DESIGN...
Such Laws, aswell as theoriginal DesignandPurposeof their Establishment,
must all be fromwithout... if we blind not ourselves with metaphysical dust,
we shall beled, by ashortandobvious way,to theacknowledgement of the great
MAKER andGOVENOUR of all; Himself all-wise, all-powerfulandgood.(De
Moivre, 1756/1967p. 251-252)

The ready acceptance of thenormal distributionas a law ofnature encouraged


its wide applicationand also produced consternation when exceptions were
observed. Quetelet himself admitted the possibility of the existenceof asym-
metric distributions,andGalton was attimes less lyrical,for critics had objected
to the use of the
distribution,not as apractical toolto beused with caution where
it seemedappropriate,but as asort of divine rule:

It hasbeenobjectedto someof my former work,especiallyin Hereditary Genius,


that I pushedthe application of the Law of Frequencyof Error somewhattoo
far.
I may have doneso, ratherby incautious phrases than in reality; ... I am
satisfiedto claim theNormal Law is afair averagerepresentationof theobserved
Curves during nine-tenths of their course;...(Galton, 1889,p. 56)6

BIOMETRICS

In 1890, WalterF. R. Weldon (1860-1906)was appointedto the Chair of


Zoology at University College, London.He wasgreatly impressedandmuch
influenced by Gallon's Natural Inheritance.Not only did the book showhim
how thefrequencyof the deviationsfrom a "type" might bemeasured,it opened
up for him, and forotherzoologists,a hostof biometric problems.In two papers
published in 1890 and 1892, Weldon showed that various measurements on
shrimps mightbe assessed usingthe normal distribution.He also demonstrated
interrelationships (correlations) between two variableswithin theindividuals.
But the critical factorin Weldon's contributionto thedevelopmentof statistics
was hisprofessorial appointment, for this broughthim into contact with Karl
Pearson, then Professor of Applied MathematicsandMechanics,a post Pearson
had held since 1884. Weldonwas attempting to remedy his weaknessin
6
Note thatthis quotation and theprevious one from Galton are 10pagesapart in the same
work!
BIOMETRICS 15

mathematicsso that he could extendhis research,and heapproached Pearson


for help. His enthusiasmfor the biometric approach drew Pearson away from
more orthodox work.
A second importantlink was with Galton,who hadreviewed Weldon'sfirst
paperon variation in shrimps. Galton supportedand encouragedthe work of
thesetwo youngermen until his death, and, under the terms of his will, left
£45,000 to endow a Chair of Eugenicsat the University of London, together
with the wish thatthe post mightbe offered first to Karl Pearson.The offer was
madeand accepted.
In 1904, Galtonhad offered the University of London £500to establishthe
study of national eugenics.Pearsonwas amemberof the Committee thatthe
University set up, and the
outcomewas adecisionto appointtheGallon Research
Fellow at what was to benamedthe Eugenics RecordOffice. This becamethe
Galton Laboratory for National Eugenicsin 1906, and yet more financial
assistancewas providedby Gallon. Pearson,still Professorof Applied Mathe-
matics,was itsDirector aswell asHeadof theBiometrics Laboratory. This latter
received muchof its funding over many yearsfrom grantsfrom the Worshipful
Companyof Drapers, whichfirst gave moneyto theUniversity in 1903.
Pearson's appointment to the Gallon Chair brought applied statistics,bio-
melrics, and eugenics together under his direction at University College.II
cannot howeverbe claimedabsolutelythat the day-to-day workof these units
was drivenby acommon theme. Applied statistics andbiometrics were primar-
ily concernedwith the developmentand applicationof statistical techniques to
a variety of problems,including anthropometric investigations; the Eugenics
Laboratory collected extensive family pedigreesand examined actuarial death
rates. Of course Pearson coordinated all the work, and there was interchange
and exchange among the staff that workedwith him, but Magnello (1998,1999)
hasargued that therewas not asingle unifying purposein Pearson'sresearch.
Others, notably MacKenzie (1981), Kevles (1985), and Porter (1986), have
promotedthe view that eugenics was thedriving force behindPearson'sstatis-
tical endeavors.
Pearsonwas not aformal memberof the eugenics movementHe did not
join the Eugenics Education Society, and apparentlyhe tried to keep the two
laboratories administratively separate, maintaining separate financial accounts,
for example,but it has to berecognized thathis personal viewsof the human
condition and itsfuture included the conviction that eugenics was ofcritical
importance.There was anobviousand persistent intermingling of statistical
resultsandeugenicsin his pronouncements. For example,in his Huxley Lecture
in 1903 (publishedin Biometrika in 1903and 1904),on topicsthat were clearly
biometric,havingto do with his researcheson relationships between moral and
16 1. THE DEVELOPMENT OF STATISTICS

intellectual variables,he ended witha plea, if not a rallying cry, for eugenics:

The mentally better stockin the nationis not reproducing itselfat thesame rateas it
did of old; the less able,and theless energetic,
aremore fertile thanthe better stocks.
... The only remedy,if one bepossibleat all, is to alter the relative fertility of the
good and the badstocksin the community.. . . intelligence can beaided and be
trained,but notrainingor educationcancreateit. You must breedit, that is thebroad
result for statecraft whichflows from the equality in inheritanceof the psychicaland
the physical characters in man. (Pearson, 1904a, pp. 179-180).

Pearson'scontributionwasmonumental,for in less than8 years, between


1893 and 1901, he published over30 paperson statisticalmethods. The first
was written as aresultof Weldon's discovery that the distributionof one set of
measurementsof the characteristicsof crabs, collectedat thezoological station
at Naplesin 1892,was "double-humped." The distributionwas reducedto the
sum of two normal curves. Pearson (1894)proceededto investigatethe general
problem of fitting observed distributions to theoretical curves. This work was
to lead directlyto theformulationof the x2 test of "goodnessof fit" in 1900,one
of the most important developments in the history of statistics.
Weldon approachedthe problem of discrepancies between theory and
observationin a much more empirical way, tossing coins anddice and compar-
ing the outcomes withthe binomial model. These data helped to produce
another lineof development.
In a letter to Galton, writtenin 1894, Weldon asksfor a commenton the
results of 7,000tossingsof 12 dice collectedfor him by aclerk at University
College:

A day or two agoPearsonwantedsomerecordsof the kind in a hurry, in order


to illustrate a lecture,and Igavehim therecord of the clerk's7000tosses... on
examination he rejects them, because he thinks the deviation from the theoreti-
cally most probable resultis so great as to make the record intrinsically
incredible, (quotedby E. S.Pearson,1965, p. 11)

This incidentset off agood dealof correspondence between Karl Pearson,


F.Y. Edgeworth (1845-1926),an economist and statistician, and Weldon, the
details of which are now only of minor importance. But,as Karl Pearson
remarked, "Probabilitiesare very slippery things" (quoted by E. S. Pearson,
1965, p. 14), and the search for criteria by which to assessthe differences
between observed andtheoretical frequencies,andwhetheror notthey couldbe
reasonably attributedto chance sampling fluctuations, began. Statisticalre-
search rapidly expanded into careful examinationof distributions other than
the normal curve and eventually intothe propertiesof sampling distributions,
BIOMETRICS 17

particularly through the seminal work of Ronald Fisher.


In developinghis researchinto the propertiesof the probability distributions
of statistics, Fisher investigated
the basisof hypothesis testingand thefounda-
tions of all the well-known testsof statistical significance. Fisher's assertion
that p = .05 (1 in 20) is the
probability thatis convenientfor judging whetheror
not a deviation is to beconsidered significant (i.e. unlikely
to be due tochance),
has profoundly affected research in the social sciences,although it should be
noted thathe was not the originatorof the convention (Cowles& Davis, 1982a).
Of course,the developmentof statistical methodsdoesnot endhere,nor have
all the threads been drawn together. Discussion of the important contribution
of W. S. Gosset("Student," 1876-1937)to small sample workand therefine-
ments introduced into hypothesis testing by Karl Pearson'sson, Egon S.
Pearson(1895-1980)and Jerzy Neyman(1899-1981)will be found in later
chapters, whenthe earlier details have been elaborated.

Biometrics and Genetics

The early years of the biometric school were surrounded by controversy.


Pearsonand Weldon heldfast to the view that evolution took placeby the
continuous selectionsof variations that were favorable to organismsin their
environment.The rediscoveryof Mendel's workin 1900 supportedthe concept
that heredity depends on self-reproducing particles (what we now call genes),
and that inherited variationis discontinuousand saltatory. The source of the
developmentof higher types was occasional genetic jumps or mutations.
Curiously enough, thiswas theview of evolution that Galtonhad supported.
His misinterpretationof the purely statistical phenomenon of regressionled him
to thenotion thata distinctionhad to bemade between variations from the mean
that regressand what he called"sports"(a breeder'sterm for an animalor plant
variety thatappearsapparently spontaneously) that will not.
A championof the position that mutations were of critical importancein the
evolutionary processwas William Bateson(1861-1926)and aprolongedand
bitter argument withthe biometricians ensued. The Evolution Committeeof the
Royal Society broke down over the dispute. Biometrikawasfoundedby Pearson
and Weldon, with Galton'sfinancial support,in 1900, after the Royal Society
had allowed Batesonto publish a detailed criticismof a paper submittedby
Pearson beforethe paper itselfhad been issued. Britain's important scientific
journal, Nature, tookthe biometricians' sideand would not print letters from
Bateson. Pearson replied to Bateson'scriticisms in Biometrika but refusedto
accept Bateson'srejoinders, whereupon Bateson hadthem privately printedby
the CambridgeUniversity Pressin the format of Biometrika\
At the British Association meetingin Cambridge in 1904, Bateson, then
18 1. THE DEVELOPMENT OF STATISTICS

Presidentof theZoological Section, took theopportunityto deliverabitter attack


on the biometric school. Dramatically waving aloft the published volumesof
Biometrika, he pronounced them worthless and hedescribed Pearson's correla-
tion tables as: "aProcrusteanbedinto whichthebiometricianfits hisunanalysed
data." (quotedby Julian Huxley,1949).
It is even said that Pearson
and Batesonrefusedto shake hands at Weldon's
funeral. Nevertheless,after Weldon's deaththe controversy cooled. Pearson's
work became more concerned with the theory of statistics, althoughthe influ-
enceof his eugenic philosophy wasstill in evidence,and by1910, when Bateson
becameDirector of theJohn Innes Horticultural Institute,the argumenthaddied.
However, some statistical aspects of this contentious debate predated the
evolution dispute,andechoesof them - indeed, marked reverberations from
them - arestill around today, although of course MendelianandDarwinian
thinking arecompletely reconciled.

STATISTICAL CRITICISM

Statisticshas been calledthe "scienceof averages,"and this definition is not


meant in a kindly way. The great physiologist Claude Bernard (1813-1878)
maintained thatthe use ofaveragesin physiology couldnot be countenanced:

becausethetruerelationsof phenomenadisappearin the average;whendealing


with complex and variable experiments,we must study their various circum-
stances,andthen presentour most perfect experiment as atype, which, however,
still standsfor true facts.
... averagesmust thereforebe rejected, because they confuse while aiming to
unify, and distort while aiming to simplify. (Bernard,1865/1927,p. 135)

Now it is, of course, true that lumping measurements together may notgive
us anything more than a pictureof the lumping together,and theaverage value
may not beanything likeany oneindividual measurement at all, but Bernard's
ideal type fails to acknowledgethe reality of individual differences.A rather
memorable example of a very real confusionis given by Bernard(1865/1927):

A startling instanceof this kindwas inventedby aphysiologistwho took urinefrom


a railway stationurinal where people of all nationspassed,and whobelievedthat he
could thus presentan analysisof average European urine! (pp. 134-135).

A less memorable,but just astelling, exampleis that of the social psycholo-


gist who solemnly reports "mean social class."
Pearson(1906) notes that:

One of theblows to Weldon, which resultedfrom his biometric viewof life


STATISTICAL CRITICISM 19

was that his biological friends could not appreciatehis newenthusiasms. They
could not understandhow the Museum "specimen"was in thefuture to be
replacedby the "sample"of 500 to 1000 individuals,(p. 37)

The view is still not wholly appreciated. Many psychologists subscribe to


the position thatthe most pressing problems of the discipline,andcertainly the
ones of most practical interest, are problemsof individual behavior. A major
criticism of the effect of the use of thestatistical approachin psychological
researchis the failure to differentiate adequately between general propositions
that apply to most, if not all, membersof a particular groupand statistical
propositions that applyto some aggregatedmeasureof the membersof the
group. The latter approach discounts the exceptionsto thestatisticalaggregate,
which not only may be themost interestingbut may, on occasion, constitute a
large proportionof the group.
Controversy abounds in the field of measurement, probability, and statistics,
and the methods employedare open to criticism, revision, and downright
rejection. On theother hand, measurement and statistics playa leading rolein
psychological research, and thegreatest danger seemsto lie in a nonawareness
of the limitations of the statistical approach and thebasesof their development,
as well as the use of techniques, assisted by thehigh-speed computer, as recipes
for datamanipulation.
Miller (1963) observedof Fisher, "Few psychologists have educated us as
rapidly, or have influencedour work as pervasively,as didthis fervent, clear-
headed statistician."(p. 157).
Hogben (1957) certainly agreesthat Fisherhasbeen enormouslyinfluential
but heobjectsto Fisher's confidence in his own intuitions:

This intrepid beliefin what hedisarmingly calls common sense...has ledFisher


... to advancea battery of conceptsfor thesemantic credentials of which neither
he nor hisdisciplesoffer anyjustification enrapport with the generallyaccepted
tenetsof the classical theoryof probability. (Hogben, 1957,p. 504)

Hogben also expresses a thoughtoften shared by natural scientists when


they review psychologicalresearch,that:

Acceptability of a statistically significant result of an experiment on animal


behaviourin contradistinctionto aresult whichtheinvestigatorcanrepeat before
a critical audience naturally promotes a high outputof publication. Hencethe
argumentthat the techniques workhas atempting appealto young biologists.
(Hogben, 1957,p. 27)

Experimental psychologists may well agree thatthe tightly controlledex-


periment is the apotheosisof classicalscientific method,but they are not so
20 1. THE DEVELOPMENT OF STATISTICS

arrogant as tosuppose that their subject matter will necessarily submit to this
form of analysis,and they turn, almost inevitably, to statistical,as opposedto
experimental, control. This is not amuddle-headed notion, but it does present
dangersif it is accepted without caution.
A balanced,but notuncritical, viewof the utility of statisticscan bearrived
at from a considerationof the forces that shaped the disciplineand anexami-
nation of its development. Whether or not this is an assertion that anyone, let
alonethe authorof this book,canjustify remainsto be seen.

Yet thereare Writers, of a Classindeed very different from that of JamesBernoulli,


who insinuateas if theDoctrine of Probabilities could haveno placein any serious
Enquiry; and that Studiesof this kind, trivial and easy as they be, rather disqualify
a man for reasoning on any other subject. Let the Reader chuse.(De Moivre,
1756/1967,p. 254)
Science, Psychology,
2
and Statistics

DETERMINISM

It is a popular notion thatif psychologyis to beconsidereda science, thenit


most certainly is not an exact science.The propositions of psychology are
considered to be inexact becauseno psychologist on earth would venturea
statement suchas this: "All stable extravertswill, when asked,volunteer to
1
participate in psychological experiments." The propositions of the natural
sciencesare consideredto beexact because all physicists wouldbe preparedto
attest (with some few cautionary qualifications) that, "fireburns" or, more
pretentiously, that"e = mc2." In short, it is felt that the order in the universe,
which nearly everyone (though for different reasons)is sure mustbe there,has
been more obviously demonstrated by the natural rather than the social scien-
tists.
Order in the universe implies determinism, a most useful and amost vexing
term, for it brings thosewho wonder about such things into contact with the
philosophicalunderpinningsof the rather everyday concept of causality.No one
hasstatedthe situation more clearly than Laplace in his Essai:

Present eventsare connectedwith preceding onesby a tie basedupon the evident


principle that a thing cannot occur withouta causewhich producesit. This axiom
known by the name of the principle of sufficient reason, extends even to actions
which are consideredindifferent; the freest will is unable withouta determinative
motive to give them birth;...
We ought thento regardthe present stateof the universeas theeffect of its anterior

1
Not wishing to make pronouncements on theprobabilistic natureof the work of others,the writer
is makingan oblique referenceto work in which he and acolleague (Cowles& Davis, 1987)found that
there is an 80%chance that stable extraverts
will volunteerto participatein psychological research.

21
22 2. SCIENCE, PSYCHOLOGY, AND STATISTICS

stateand as thecauseof the onethat is to follow. Given for oneinstant an intelligence


which could comprehendall the forces by which nature is animated and the
respectivesituationsof thebeingswho composeit- an intelligence sufficiently vast
to submit these data to analysis - it would embrace in the same formulathe
movementsof the greatestbodies of the universeandthoseof the lightest atom;for
it nothingwould beuncertainand thefuture, as thepast,would bepresentto its eyes.
(Laplace, 1820/1951,pp. 3-4)

The assumptionof determinismis simply the apparently reasonable notion


that eventsare caused.Sciencediscovers regularitiesin nature, formulates
descriptionsof these regularities,and provides explanations, that is to say,
discoverscauses.Knowledge of the past enablesthe future to be predicted.
This is thepopular viewof science,and it isalsoa sort of working ruleof thumb
for those engaged in the scientific enterprise. Determinism is thecentral feature
of the developmentof modern science up to thefirst quarterof the 20th century.
The successesof the natural sciences, particularly the successof Newtonian
mechanics, urged andinfluenced someof thegiantsof psychology, particularly
in North America, to adopt a similar mechanistic approach to the study of
behavior. The rise of behaviorism promoted the view that psychology could
be ascience, "like othersciences."Formulae couldbedevised that would allow
behaviorto bepredicted,and atechnology couldbe achieved that would enable
environmentalconditionsto be somanipulatedthat behavior couldbe control-
led. The variability in living things was to bebrought under experimental
control, a program that leads quite naturally to thenotionof thestimulus control
of behavior. It follows thatconceptssuchaswill or choiceor freedomof action
could be rejectedby behavioral science.
In 1913, JohnB. Watson (1878-1958)publisheda paper that became the
behaviorists' manifesto.It begins, "Psychologyas thebehaviorist viewsit is a
purely objective experimental branch of naturalscience.Its theoretical goalis
the predictionand control of behavior" (Watson, 1913, p. 158).
Oddly enough, this pronouncement coincided with work that began to
questionthe assumptionof determinismin physics,the undoubted leader of the
natural sciences.In 1913, a laboratory experimentin Cambridge, England,
provided spectroscopic proofof what is known as theRutherford-Bohr model
of the atom. Ernest Rutherford (later Lord Rutherford, 1871-1937)had pro-
posed thatthe atom was like a miniaturesolar system with electrons orbiting a
central nucleus. Niels Bohr (1885-1962),a Danish physicist, explained that the
electrons moved from oneorbit to another, emitting or absorbing energy asthey
movedtoward or away from the nucleus.Thejumping of anelectronfrom orbit
to orbit appearedto be unpredictable.The totality of exchanges could only be
predictedin a statistical, probabilistic fashion.That giantof modern physicists,
DETERMINISM 23

Albert Einstein(1879-1955),whose workhad helpedto start the revolutionin


physics,was loath to abandonthe conceptof a completely causal universe and
indeed neverdid entirely abandonit. In the 1920s, Einstein made
the statement
that has often been paraphrased
as, "God doesnot play dice withthe world."
Nevertheless,he recognizedthe problem. In a lecture givenin 1928, Einstein
said:

Today faith in unbroken causalityis threatenedprecisely by thosewhosepath it had


illumined as their chief and unrestricted leaderat the front, namely by the repre-
sentativesof physics... All natural lawsaretherefore claimedto be, "in principle,"
of a statisticalvariety and ourimperfect observationpracticesalone havecheatedus
into a belief in strict causality, (quotedby Clark, 1971,pp. 347-348)

But Einstein never really accepted this proposition, believingto the endthat
indeterminacywas to beequated with ignorance. Einstein may be right in
subscribing ultimatelyto the inflexibility of Laplace's all-seeing demon, but
another approachto indeterminacywas advanced by Werner Heisenberg
(1902-1981)a German physicist who, in 1927, formulatedhis famous uncer-
tainty principle. He examinednot merelythe practical limits of measurement
but the theoretical limits,and showed thatthe act ofobservationof the position
and velocity of a subatomic particle interfered with
it so as toinevitably produce
errors in the measurementof one or theother. This assertion hasbeen takento
mean that, ultimately,the forces in our universeare random and therefore
indeterminate.Bertrand Russell (1931) disagrees:

Spaceand time were inventedby the Greeks, and served their purpose admirably
until the present century. Einstein replaced them
by akind of centaur whichhe called
"space-time,"and this did well enoughfor a coupleof decades,but modern quantum
mechanicshas made it evident thata more fundamental reconstruction is necessary.
The Principle of Indeterminacyis merely an illustrationof this necessity,not of the
failure of physical lawsto determinethe courseof nature, (pp. 108-109)

The important pointto be awareof is that Heisenberg's principle refers to


the observerand the act ofobservationand notdirectly to the phenomena that
are being observed. This implies that the phenomena have an existence outside
their observationand description,a contention that,by itself, occupies philoso-
phers. Nowhereis the demonstration that technique and method shape the way
in which we conceptualize phenomena more apparent than in the physicsof
light. The progressof eventsin physics thatled to theview that light was both
wave and particle, a view that Einstein's workhad promoted, beganto dismay
him when it was used to suggest that physics would have to abandon strict
24 2. SCIENCE, PSYCHOLOGY,AND STATISTICS

continuity and causality. Bohr responded


to Einstein's dismay:

You, the man whointroducedthe ideaof light asparticles! If you are soconcerned
with the situation in physicsin which the natureof light allows for a dual interpre-
tation, thenask theGerman government to ban the use ofphotoelectric cellsif you
think that light is waves, or the use ofdiffraction gratings if light is corpuscular,
(quotedby Clark, 1971,p.253)

The parallels in experimental psychologyare obvious. That eminent histo-


rian of the discipline, Edwin Boring, describes a colloquium at Harvard when
his colleague, William McDougall,who:"believed in freedom for the human
mind - in at least a little residue of freedom- believed in it andhoped for as
much as hecould savefrom the inroads of scientific determinism,"and he, a
determinist, achieved,for Boring, an understanding:

McDougall's freedom was my variance. McDougall hoped that variance would


always be found in specifyingthe laws of behavior,for there freedom might still
persist. I hoped then- less wise thanI think I am now (it was 31years ago)- that
science would keep pressing variance towards as zero
alimit. At any rate this general
fact emergesfrom this example: freedom, when you believeit is operating, always
residesin an areaof ignorance. If there is a known law, you do nothave freedom.
(Boring, 1957,p. 190)

Boring was really unshakablein his belief in determinism,and that most


influential of psychologists,B. F.Skinner, agrees with
the necessityof assuming
order in nature:

It is aworking assumption which must be adoptedat thevery start. Wecannotapply


the methods of science to a subject matter whichis assumedto move about
capriciously. Sciencenot only describes,it predicts. It dealsnot only with the past
but with the future...If we are to use themethodsof sciencein the field of human
affairs, we must assume that behavior is lawful and determined. (Skinner,1953,
p. 6)

Carl Rogersis among thosewho have adoptedas afundamental positionthe


view that individualsare responsible, free,and spontaneous. Rogers believes
that, "the individual choosesto fulfill himself by playing a responsible and
voluntary partin bringing aboutthedestined events of the world" (Rogers,1962,
quotedby Walker, 1970,p. 13).
The use of theword destinedin this assertion somewhat spoils the impact of
whatmosttaketo be theindividualistic andhumanisticapproachthat is espoused
by Rogers. Indeed,he hasmaintained thatthe conceptsof scientific determi-
nism and personal choicecanpeacefully coexistin the way inwhich the particle
DETERMINISM 25

and wave theoriesof light coexist. The theoriesaretrue but incompatible.


Thesefew words do little more than suggest the problems thatare facedby
the philosopherof science whenhe or shetacklesthe conceptof methodin both
the naturaland thesocial sciences. What basic assumptions can wemake? So
often the argumentspresentedby humanistic psychologists have strong moral
or even theological undertones, whereas those offering the determinist's view
point to theregularities that existin nature- even human nature - andaver that
without such regularities, behavior would be unpredictable. Clearly, beings
whose behaviorwas completely unpredictable would have been unable to
achievethe degreeof technologicalandsocial cooperation that marks thehuman
species. Indeed, individuals of all philosophical persuasions tend to agree that
someone whose behavior is generally not predictable needs some sort of
treatment.On the other hand,the notion of moral responsibility implies free-
dom. If I am to bepraisedfor my good worksandblamedfor my sins,a statement
that "nature is merely unfolding as it should" is unlikely to be acceptedas a
defensefor the latter, and unlikely to be advancedby me as areasonfor the
former. One way out of theimpasse,it is suggested,is to reject strict "100
percent" determinismand to accept statistical determinism. "Freedom"then
becomes partof the error termin statistical manipulations. Grimbaum (1952)
considersthe arguments against both strict determinism and statistical determi-
nism, arguments based on the complexityof human behavior,the conceptof
moral choice, and theassignmentof responsibility,assertions that individuals
are uniqueand that therefore their actions are notgeneralizablein the scientific
sense,and that human beingsvia their goal-seeking behavior themselves
determinethe future. He concludes,"Since the important arguments against
determinismwhichwe have considered arewithout foundation,the psychologist
neednot bedeterredin his questand canconfidently use thecausal hypothesis
as a principle, undauntedby the caveat of the philosophical indeterminist"
(Grunbaum,1952,p. 676).
Feigl (1959) insists that freedom must not beconfused withthe absenceof
causality, and causal determination must not be confused with coercionor
compulsion or constraint. "To be free means thatthe chooseror agent is an
essentiallink in the chainof causal events and that no extraneous compulsion
- be itphysical, biological,or psychological- forces him to act in adirection
incompatiblewith his basic desiresor intentions" (Feigl, 1959,p. 116).
To some extent,and many modern thinkers would say to alarge extent (and
Feigl agreeswith this), philosophical perplexities can beclarified, if not entirely
resolved,by examiningthe meaningof the terms employedin the debate rather
than arguing about reality.
Two further points mightbe made. The first is that variabilityanduncertainty
in observationsin the naturalas well as thesocial sciences require a statistical
26 2. SCIENCE, PSYCHOLOGY, AND STATISTICS

approachin order to reveal broad regularities,


and this appliesto experimenta-
tion andobservationnow, whatever philosophical stanceis adopted.If the very
idea of regularity is rejected, then systematic approaches
to the study of the
human conditionareirrelevant. The secondis that,at the end of thediscussion,
most will smile withDr Samuel Johnson when he said,"I know thatI have free
will and there'san end onit."

PROBABILISTIC AND DETERMINISTIC MODELS

The natural sciencesset great storeby mathematical models.For example,in


the physicsof massand motion, potential energy is given by PE = mgh, where
m is mass,g is accelerationdue togravity, and h isheight. Newton'ssecond
law of motion states thatF = ma, where F is force, m is mass, and a is
acceleration.
These mathematical functions or modelsmay betermed deterministic models
because, giventhe valueson theright-hand sideof the equation,the construct
on the left-hand sideis completely determined.Any variability that mightbe
observedin, for example, Forcefor a measured mass and agiven acceleration
is dueonly to measurement error. Increasing the precisionof measurement using
accurate instrumentation and/or superior technique will reducethe error in Fto
2
very small margins indeed.
Some psychologists, notably, Clark Hull in learning theoryand Raymond
B. Cattell m personality theoryand measurement,approachedtheir work with
the aim (one mightsay thedream!) of producing parallel models for psycho-
logical constructs.Econometricians similarlysearchfor models thatwill de-
scribe the market with precision and reliability. For social and biological
scientists thereis apersistentandenduring challenge that makes their disciplines
both fascinatingandfrustrating.That challengeis thesearchfor meansto assess
and understandthe variability thatis inherentin living systemsand societies.
In univariate statistical analysiswe are encompassing those procedures
wherethereisjust onemeasuredor dependent variable (the variable that appears
on the left-hand sideof the equals signin the function shown next)and one or
more independentor predictor variables: those that are seenon theright-hand

2
For thoseof you who have been exposed to physics, it must be admitted that although
Newton's laws were more or less unchallenged
for 200yearsthemodelshegaveus are notdefinitions
but assumptionswithin the Newtonian system, as thegreat Austrian physicist, Ernst Mach, pointed
out. Einstein's theoryof relativity showed that theyare not universally true. Nevertheless,
the
distinction betweenthe models of the natural and thesocial and biological sciencesis, for the
moment,a useful one.
SCIENCEAND INDUCTION 27

side of the equation.The function is known as thegeneral linear model.

The independent variables are those thatare chosen and/or manipulated by


the investigator. Theyare assumedto be related to or have an effect on the
dependent variable.In this modelthe independent variables, x\, x2, * 3j xn , are
assumedto be measured without error, po , PI, Pa, PS, P«, are
unknown parame-
ters. Bo is aconstant equivalent to the interceptin the simple linear model,and
the others are theweightsof the independent variables affecting, or being used
to estimate,a given observationy.Finally, e is therandom errorassociatedwith
the particular combinationof circumstances: those chosen variables, treat-
ments, individual difference factors, and so on. Butthese modelsare prob-
abilistic models. Theyare based on samplesof observationsfrom perhapsa
variety of populationsand they may betested underthe hypothesisof chance.
Even when they pass the test of significance, like other statistical outcomes they
are notnecessarily completely reliable. The dependent variable can, at best,be
only partly determinedby such models,and other samplesfrom other popula-
tions, perhaps using differently defined independent variables, may, and some-
times do, give us different conclusions.

SCIENCE AND INDUCTION

It is common to tracethe Western intellectual tradition to two fountainheads,


Greek philosophy, particularly Aristotelian philosophy, and Judaic/Christian
theology. Science began with the Greeks in the sense that theyset out its
commonlyacceptedground rules. Science proceeds systematically. It gives us
a knowledgeof nature thatis public and demonstrable and, most importantly,
open to correction. Science provides explanations that are rational and,in
principle, testable, rather than mystical or symbolic or theological. The cold
rationality that this implieshasbeen tempered by theJudaic/Christian notionof
the compassionate human being as acreature constructed in the imageof God,
and, by the belief thatthe universeis the creationof God and assuch deserves
the attention of the beings thatinhabit it. It is these streams of thought that give
us thedebate about intellectual values and scientific responsibilityand sustain
the view that science cannot be metaphysically neutral nor value free.
But it is nottrue to saythat thesetraditionshavecontinuouslyguided Western
thought. Christianity took some time to become established. Greek philosophy
and science disappeared under the pragmatic technologists of Rome. Judaism,
weakened by the loss of its home and thepersecutionof its adherents, took
refuge in the refinement and interpretation of its ancient doctrines. When
28 2. SCIENCE, PSYCHOLOGY, AND STATISTICS

Christianity did becomethe intellectual shelterfor the thinkers of Europe,it


embracedthe view thatGod revealed whatHe wishedto revealandthat nature
everywherewas symbolicof a general moral truth known only to Him. These
ideas, formalized by the first great Christian philosopher,St. Augustine
(354-430),persistedup to, andbeyond,theRenaissance, andsimplistic versions
may be seenin the expostulationsof fundamentalistpreachers today.For the
best part of 1,000 yearsscientific thinking was not animportant part of
intellectual advance.
Aristotle's rationalismwasrevivedby "the schoolmen,"of whomthe greatest
was Thomas Aquinas(1225?-1274),but the influence of science and the
shapingof the modernintellectual world beganin the 17th century.

The modern world,so far asmental outlookis concerned,beginsin the seventeenth


century. No Italian of the Renaissance would havebeenunintelligible to Plato or
Aristotle; Luther would have horrifiedThomasAquinas, but would not have been
difficult for him to understand. Withthe seventeenth century
it is different: Plato
and Aristotle, Aquinasand Occam,could not have made head or tail of Newton.
(Russell, 1946,p. 512)

This is not to deny the work of earlier scholars. Leonardo da Vinci (1452
-1519)vigorously propounded the importanceof experienceand observation,
and enthusiastically wroteof causalityand the "certainty" of mathematics.
Francis Bacon(1561-1626)is a particular exampleof one whoexpressedthe
importanceof systemand methodin the gainingof new knowledge,and his
contribution, althoughoften underestimated, is of great interestto psychologists.
Hearnshaw (1987) gives an accountof Bacon that shows how his ideascan be
seenin the foundationand progressof experimentaland inductive psychology.
He notes that: "Bacon himself made few detailed contributions to general
psychology as such,he sawmore clearly than anyone of his time the need for,
and thepotentialitiesof, a psychologyfounded on empirical data,and capable
of being appliedto 'the relief of man'sestate'"(Hearnshaw, 1987, p. 55).
A turning point for modern science arrives with the work of Copernicus
(1473-1543). His account of the heliocentric theoryof our planetary system
was publishedin theyearof his deathand hadlittle impactuntil the 17th century.
The Copernican theory involved no newfacts, nor did it contributeto mathe-
matical simplicity. As Ginzburg (1936) notes, Copernicus reviewed theexisting
facts and came up with a simpler physical hypothesis than that of Ptolemaic
theory which stated thatthe earthwas thecenterof theuniverse:

The fact that PTOLEMY and his successorswere led to make an affirmation in
violence to the facts as then known shows that their
acceptanceof the belief in the
SCIENCE AND INDUCTION 29

immobility of the earth at thecentre of the universewas not theresult of incomplete


knowledge but rather the resultof a positive prejudice emanating from non-scientific
considerations. (Ginzburg, 1936, p. 308)

This statementis onethat scientists, when they are wearing their scientists'
hats, would supportand elaborate upon,but anexaminationof the stateof the
psychological sciences could not fully sustainit. Nowhereis our knowledge
more incomplete than in the studyof the human condition,andnowhereare our
interpretations more open to prejudiceand ideology.The study of differences
between the sexes, the nature-nurture issuein the examinationof human
personality, intelligence,and aptitude,the sociobiological debateon the inter-
pretation of the way in which societiesare organized, all are marked by
undertonesof ethicsand ideology thatthe scientific purist wouldsee asoutside
the notion of an autonomous science. Nor is this a complete list.
Scienceis not somuch about factsas about interpretationsof observation,
and interpretationsas well as observationsare guided and moldedby precon-
ceptions. Ask someoneto "observe"and he or she will ask what it is that is to
be observed. To suggest thatthe manipulationand analysisof numerical data,
in the senseof the actual methods employed, canalso be guidedby preconcep-
tions seems very odd. Nevertheless, the developmentof statisticswasheavily
influencedby theideological stance of its developers.The strengthof statistical
analysisis that its latter-day usersdo nothaveto subscribeto theparticular views
of the pioneersin order to appreciateits utility and apply it successfully.
A common viewis that science proceeds through the processof induction.
Put simply, thisis theview thatthe future will resemblethe past. The occurrence
of an eventA will lead us to expectan eventB if past experiencehas shownB
alwaysfollowing A The general principle thatB follows A is quickly accepted.
Reasoning from particular casesto general principlesis seen as the very
foundation of science.
The conceptsof causality and inference come togetherin the process of
induction. The great Scottish philosopher David Hume (1711-1776) threw
down a challenge thatstill occupiesthe attentionof philosophers:

As to past experience, it can beallowed to give direct and certain informationof


those precise objects only, and that precise periodof time, which fell under its
cognizance:but why this experience should be extendedto future times,and toother
objects, whichfor aughtwe know, may beonly in appearance similar; this is themain
question on which I would insist.
Thesetwo propositions are farfrom being the same, / have found thatsuchan
object has always beenattended withsuchan effect and I foresee that other objects,
which are,in appearance, similar, willbe attended withsimilar effects. I shall allow,
if you please,that the oneproposition may justly be inferred fromthe other: I know,
30 2. SCIENCE, PSYCHOLOGY, AND STATISTICS

in fact, that it always is inferred. But if you insist thatthe inferenceis madeby a
chain of reasoning,I desireyou to produce that reasoning. (Hume, 1748/1951, pp.
33-34)

The arguments against Hume's assertion that it is merely the frequent


conjunctionor sequencingof two events that leads us to abelief thatonecauses
the other have been presented in many formsand this account cannot examine
them all. The most obvious counteris that Hume'sown assertion invokes
causality. Contiguityin time and spacecausesus to assume causality. Popper
(1962) notes that the ideaof repetition basedon similarity as thebasisof a belief
in causality presents difficulties. Situationsarenever exactlythe same.Similar
situationsareinterpretedasrepetitionsfrom a particular pointof view - and
that point of view is a system of "expectations,anticipations, assumptions,or
interests"(Popper, 1962,p. 45).
In psychological matters there is the additional factor of'volition. I wish to
pick up my pen andwrite. A chain of nervousand muscularand cognitive
processes ensuesand I dowrite. The fact that human beings can and docontrol
their future actions leadsto a situation wherea general denialof causality flies
in the face of common sense. Such a denial invites mockery.
Hume's argument that experience does not justify prediction is more diffi-
cult to counter. The courseof natural eventsis not wholly predictable,and the
history of the scientific enterpriseis littered with the ruins of theories and
explanations that subsequent experience showed to bewanting. Hume's skep-
ticism, if it was accepted, would lead to a situation where nothing could be
learnedfrom experienceand observation.The history of humanaffairs would
refute this, but theargument undoubtedly leads to acautious approach. Predict-
ing the future becomesa probabilistic exerciseand scienceis no longer ableto
claim to be the way tocertainty andtruth.
Using the probability calculusas an aid toprediction is onething; usingit
to assessthe value of a particular theoryis another. Popper(1962) regards
statements about theories having a high degreeof probability as misconcep-
tions. Theoriescan beinvokedto explain various phenomena andgoodtheories
arethose that stand up to severe test. But, Popper argues, corroboration cannot
be equated with mathematical probability:

All theories,including the best,havethe sameprobability, namelyzero.


That an appealto probabilityis incapableof solving the riddle of experienceis a
conclusionfirst reachedlong ago byDavid Hume...
Experience doesnot consist in the mechanical accumulation of observations.
Experienceis creative. It is the result of free, bold and creative interpretations,
controlled by severe criticismand severe tests. (Popper, 1962,
pp. 192-193)
INFERENCE 31

INFERENCE

Induction is, andwill continue to be, alarge problemfor philosophical discus-


sion. Inference can benarrowed down. Although the termsare sometimes used
in the same sense and with the same meaning, it is useful to reservethe term
inference for the makingof explicit statements about the propertiesof a wider
universe thatare based on a much narrowerset of observations. Statistical
inference is precisely that,and the discussion just presented leads to the
argumentthat all inference is probabilistic and therefore all inferential state-
mentsare statistical.
Statistical inference
is a way ofreasoningthat presents itself as amathemati-
cal solutionto the problemof induction. The searchfor rulesof inferencefrom
the time of Bernoulli and Bayesto that of Neymanand Pearsonhasprovided
the spur for the developmentof mathematical probability theory.It has been
argued thatthe establishingof a set ofrecipesfor data manipulation has led to
a situation where researchersin the social sciences"allow statistics to do the
thinking for them." It has beenfurther argued that psychological questions that
do not lend themselvesto the collectionand manipulationof quantitative data
are neglectedor ignored. These criticisms are not to betaken lightly, but they
can beanswered.In the first place, statistical inference is only a part of formal
psychological investigation. An equally important component is experimental
design.It is the lastingcontributionof Ronald Fisher,a mathematical statistician
and achampionof the practical researcher, that showedus how theformulation
of intelligent questionsin systematic frameworks would produce datathat, with
the helpof statistics, could provide intelligent answers. In the second place,the
social sciences have repeatedly come up with techniques that have enabled
qualitativedatato be quantified.
In experimental psychologytwo broad strategies have been adopted for
coping with variability. The experimental analytic approach sets out boldly to
contain or to standardizeas many of the sourcesof variability as possible. In
the micro-universeof the Skinner box, shaping andobservingthe rat'sbehavior
dependon a knowledgeof the antecedentand present conditions under which
a particular pieceof behaviormay beobserved.The second approach is that of
statistical inference. Experimental psychologists control (in the senseof stand-
ardizeor equalize) those variables that they
cancontrol, measure what they wish
to measure witha degreeof precision,assume that noncontrolled factors operate
randomly,and hope that statistical methods will teaseout the"effects" from the
"error." Whateverthe strategy, experimentalists will agreethat the knowledge
they obtain is approximate. It has also been generally assumed that this
approximate science is an interim science. Probabilityis part of scientific
32 2. SCIENCE, PSYCHOLOGY, AND STATISTICS

method but not part of knowledge. Some writers have rejected this view.
Reichenbach(1938), for example, soughtto devise a formal probability logic
in which judgmentsof the truth or falsity of propositionsis replaced by the
notion of weight. Probability belongsto a classof events. Weight refersto a
single event,and asingle eventcan belongto many classes:

Supposea manforty years old hastuberculosis;. . . Shall we consider . . . the


frequency of death withinthe classof men forty yearsold, or within the classof
tubercular people?...
We take the narrowest classfor which we have reliable statistics... we should
take the classof tubercularmen of forty ... thenarrower the classthe betterthe
determinationof weight...a cautious physicianwill even placethe man inquestion
within a narrower classby making an X-ray; he will then use as theweight of the
case,the probability of death belongingto a condition of the kind observedon the
film. (Reichenbach, 1938,pp. 316-317)

This is a frequentist viewof probability, and it is theview that is implicit in


statistical inference. Reichenbach's thesis shouldan have
appealfor experimen-
tal psychologists, although it is not widely known. It reflects, in formal terms,
the way in which psychological knowledge is reported in the journals and
textbooks, although whether or not thewriters and researchers recognize this
may be debated.The weight of a given propositionis relativeto the stateof our
knowledge,andstatements about particularindividualsandparticular behaviors
are prone to error. It is not that we aretotally ignorant,but that manyof our
classesare toobroadto allow for substantialweightto beplacedon theevidence.
Popper (1959) takes issue with Reichenbach's attempts to extendthe relative
frequency view of probability to include inductive probability. Popper, with
Hume, maintains thata theory of induction is impossible:

We shall haveto get accustomedto the idea thatwe must not look upon scienceas
a "body of knowledge",but rather as asystem of hypotheses; thatis to say, as a
systemof guessesor anticipations... of which we arenever justifiedin saying that
we know that theyare"true" or "more or lesscertain"or even"probable". (Popper,
1959, p. 317)

Now Popper admits only that a systemis scientific when it can betestedby
experience. Scientific statements aretestedby attemptsto refuteor falsify them.
Theories that withstand severe tests are corroboratedby thetests,but they are
not proven, nor arethey even made more probable. It is difficult to gainsay
Popper's logicandHume's skepticism. They arefood for philosophical thought,
but scientistswho perhaps occasionally worry about such things put willthem
aside,if only becauseworking scientistsarepractical people.The principle of
induction is the principle of science,and thefact that Popperand Hume can
STATISTICS IN PSYCHOLOGY 33

shout from the philosophical sidelines that the "official" rules of the gameare
irrational andthat the "real" rulesof the gameare notfully appreciated,will not
stop the game from being played.
Statistical inference may bedefinedas the use of methodsbasedon therules
of chance to draw conclusionsfrom quantitativedata. It may be directly
compared with exercises where numbered tickets aredrawn from a bag oftickets
with a view to making statements about the compositionof the bag, or wherea
die or a coin is tossedwith a view to making statements about its fairness.
Supposea bagcontains tickets numbered 1,2,3,4, and 5.Each numeral appears
on the same,very large, numberof tickets. Now suppose that25 tickets are
drawnat randomand with replacementfrom the bag and the sum of the numbers
is calculated.The obtainedsum could be as low as 25 and as high as 125, but
the expected value of the sumwill be 75,because each of the numerals should
occur on one fifth of thedraws or thereabouts.The sumshouldbe 5(1 +2 + 3
+ 4 + 5) = 75. Inpractice, a given drawwill have a sumthat departsfrom this
value by an amount aboveor below it that can bedescribedas chance error.
The likely sizeof this error is givenby astatistic calledthe standard error, which
is readily computedfrom the formula a/Vw , whereCT is thestandarddeviation
of the numbersin the bag. Leaving aside,for the moment,the problem of
estimatingCTwhen,as isusual,the contentsof the bag areunknown,all classical
statistical inferential procedures stem from this sort of exercise.The real vari-
ability in the bag isgiven by the standard deviation, and thechance variability
in the sumsof the numbers drawnis given by the standard error.

STATISTICS IN PSYCHOLOGY

The use ofquantitative methodsin the study of mental processes begins with
Gustav Fechner(1801-1887)who sethimself the problem of examiningthe
relationship between stimulusandsensation.In 1860he publishedElementeder
Psychophysik,in which he describeshis invention of a psychophysicallaw that
describesthe relationship between mind and body. He developed methods of
measuringsensationbasedon mathematicalandstatisticalconsiderations,meth-
ods that have theirechoesin present-day experimental psychology. Fechner
madeuse of thenormal law in hisdevelopmentof themethodof constant stimuli,
applying it in the Gaussian senseas a way ofdealingwith error anduncontrolled
variation.
Fechner'sbasic assumptions and theconclusionshe drew from his experi-
mental investigations have been shown to befaulty. Stevens'work in the 1950s
and thelater developments of signal detection theory have overtaken the work
of the 19th century psychophysicists, but therevolutionary natureof Fechner's
methods profoundly influenced experimental psychology. Boring (1950)
34 2. SCIENCE, PSYCHOLOGY, AND STATISTICS

devotesa whole chapterof his book to thework of Fechner.


Investigationsof mental inheritanceand mental testing began at aboutthe
same time with Galton,who took the normal law of error from Quetelet and
madeit the centerpieceof his research.The error distributionof physics became
a descriptionof thedistributionof valuesaboutavalue thatwas"mostprobable."
Galton laidthe foundationsof the methodof correlation thatwasrefinedby Karl
Pearson, work that is examinedin more detail laterin this volume. At the turn
of the century, Charles Spearman (1863-1945) usedthe methodto define
mentalabilitiesasfactors. When two apparentlydifferent abilitiesare shownto
be correlated, Spearman took this as evidencefor the existenceof a general
factor G, a factor of generalintelligence, and factors that were specific to the
different abilities. Correlational methods in psychology were dominant for
almost the whole of the first half of this centuryand thetechniquesof factor
analysiswere honed during this period. Chapter 11 provides a review of its
development,but it is worth noting here that Dodd (1928) reviewed the
considerable literature thathadaccumulated over the 23years since Spearman's
original work, and Wolfle (1940) pushed this labor further. Wolfle quotes Louis
Thurstoneon what he takesto be themost importantuse offactor analysis:

Factoranalysisis useful especiallyin thosedomains wherebasicandfruitful concepts


areessentiallylacking andwhere crucial experiments have beendifficult to conceive.
... They enableus to make onlythe crudest first map of a newdomain. But if we
have scientific intuitionand sufficient ingenuity,the rough factorialmap of a new
domain will enableus toproceedbeyondthe factorial stageto themore direct forms
of psychologicalexperimentationin the laboratory. (Thurstone, 1940, pp. 189-190)

The interesting point about this statementis that it clearly sees factor analysis
as amethodof data exploration rather than an experimental method.As Lovie
(1983) points out, Spearman's approach was that of an experimenter using
correlational techniques to confirm his hypothesis,but from 1940on, that view
of the methodsof factor analysishas notprevailed.
Of course,the beginningsof general descriptive techniques crept into the
psychological literature over the same period. Means and probable errorsare
commonly reported, and correlation coefficients are also accompaniedby
estimatesof their probable error.And it was around 1940 that psychologists
started to become awareof the work of R. A. Fisherand toadopt analysisof
variance as thetool of experimental work.It can beargued thatthe progression
of events thatled to Fisherian statistics also led to a division in empirical
psychology,a split between correlationaland experimental psychology.
Cronbach (1957) chose to discussthe "two disciplines"in his APApresiden-
tial address. He notes thatin the beginning:
STATISTICS IN PSYCHOLOGY 35

All experimentalprocedureswere tests,all testswere experiments... .the statistical


comparison of treatments appearedonly around 1900. . . Inference replaced
estimation: the mean and its probable error gave way to the critical ratio. The
standardizedconditions and thestandardizedinstruments remained,but the focus
shifted to thesinglemanipulatedvariable,andlater, following Fisher,to multivariate
manipulation. (Cronbach, 1957, p. 674)

Although there have been signs that the two disciplinescan work together,
the basic situationhas notchangedmuch over 30 years. Individual differences
are error variance to the experimenter;it is the between-groupsor treatment
variance thatis of interest. Differential psychologists look for variations and
relationships among variables within treatment conditions. Indeed, variation in
the situation here leads to error.
It may befairly claimed that thesefundamentaldifferencesin approach have
had themost profoundeffect on psychology.And it may befurther claimed that
the sophisticationand successof the methodsof analysis thatare usedby the
two camps have helped to formalizethe divisions. Correlationand ANOVA
have led to multiple regression analysis and MANOVA, and yet themethods
are based on the same model- the general linear model. Unfortunately,
statistical consumers are frequently unawareof the fundamentals, frightened
away by the mathematics,or, bored and frustratedby the argumentson the
rationale of the probability calculus, they avoid investigation of the general
structureof the methods. When these problems have been overcome, the face
of psychologymay change.
3
Measurement

IN RESPECT OF MEASUREMENT

In the late 19th centurythe eminent scientistWilliam Thomson, LordKelvin


(1824-1907),remarked:

I often say that whenyou canmeasurewhat you arespeakingabout,and expressit


in numbers,you know something about it; but whenyou cannot measure it, when
you cannot expressit in numbers, your knowledge is of ameagreandunsatisfactory
kind: it may be thebeginningof knowledge,but youhave scarcely,in your thoughts,
advancedto the stageof science whatever the matter mightbe. (William Thomson,
Lord Kelvin, 1891,p. 80)

This expressionof the paramount importance of measurementis part of our


scientific tradition. Many versionsof the same sentiment,for example, thatof
Gallon, notedin chapter1, andthat of S. S.Stevens(1906-1973),whose work
is discussed laterin this chapter,are frequently noted with approval. Clearly,
measurement bestows scientific respectability,a stateof affairs that does scant
justice to the work of people like Harvey, Darwin, Pasteur, Freud, and James,
who, it will be noted, if they are to belabeled"scientists,"are biological or
behavioralscientists.The natureof the dataand thecomplexityof the systems
studied by thesemen arequite different in quality from the relatively simple
systems that were the domainof the physical scientist. This is not, of course,to
deny the difficulty of the conceptualand experimental questions of modern
physics,but thefact remains that,in this field, problemscan often be dealt with
in controlled isolation.
It is perhaps comfortingto observe that,in the early years, therewas a

36
IN RESPECT OF MEASUREMENT 37

skepticism about the introductionof mathematics intothe social sciences.


Kendall (1968) quotesa writer in the Saturday Reviewof November11, 1871,
who stated:

If we saythat G representsthe confidence of Liberals in Mr Gladstoneand D the


confidenceof Conservativesin Mr Disraeli and x, y thenumberof thoseparties;and
infer that Mr Gladstone'stenure of office dependsupon someequation involving
dG/dx, dD/dy, we have merely wrappedup a plain statementin a mysterious
collection of letters. (Kendall, 1968,p. 271)

And for GeorgeUdny Yule, the most level-headed


of the early statisticians:

Measurementdoesnot necessarily mean progress.Failing the possibility of measur-


ing that whichyou desire,the lust for measurement may, for example, merely result
in your measuringsomethingelse- andperhapsforgetting thedifference-or in your
ignoring somethings becausethey cannotbe measured.(Yule, 1921,pp. 106-107)

To equate science with measurement is a mistake. Scienceis about system-


atic and controlled observationsand the attempt to verify or falsify those
observations. And if the prescriptionof science demanded that observations
must be quantifiable, thenthe natural as well as thesocial sciences would be
severely retarded.The doubts aboutthe absoluteutility of quantitative descrip-
tion expressedso long ago could well be ponderedon by today's practitioners
of experimental psychology. Nevertheless, the fact of the matteris thatthe early
years of the young disciplineof psychology show, with some notable excep-
tions, a longing for quantification and, thereby, acceptance.In 1885 Joseph
Jacobs reviewed Ebbinghaus's famous work on memory, Ueber das Gedacht-
nis. He notes"If sciencebe measurement it mustbe confessed that psychology
is in a badway" (Jacobs, 1885, p. 454).
Jacobs praises Ebbinghaus's painstaking investigations and hiscareful re-
porting of his measurements:

May we hope to see the daywhen schoolregisterswill record that suchand such a
lad possesses 36 British Association unitsof memory power or when we shall be
able to calculate how long a mind of 17 "macaulays"will take to learn Book ii of
Paradise Lost"? If this be visionary, we may atleast hopefor much of interest and
practical utility in the comparison of the varying powersof different minds which
can now atlast be laid down to scale.(Jacobs,1885, p. 459)

The enthusiasmof the mental measurers of the first half of the 20th century
reflects the same dream,and even today,the smile that Jacobs' words might
bring to the facesof hardened test constructors
and users contains a little of the
old yearning.The urgeto quantify our observationsand toimpose sophisticated
statistical manipulations
on them is a very powerful one in thesocial sciences.
38 3. MEASUREMENT

It is of critical importance to remember that sloppy


and shoddy measurement
cannot be forgiven or forgotten by presenting dazzling tables
of figures, clean
and finely-drawngraphs,or by statistical legerdemain.
Yule (1921), reviewing BrownandThomson's bookon mental measurement,
commentson the problem in remarks thatare notuntypical of the misgivings
expressed,on occasion,by statisticiansandmathematicians when they seetheir
methodsin action:

Measurement.O dear! Isn'tit almost an insult to the word to term someof these
numerical data measurements? They are of thenatureof estimates, mostof them,
and outrageouslybad estimatesoften at that.
And it should alwaysbe the aim of theexperimenternot to revel in statistical
methods (whenhe does reveland notswear)but steadily to diminish, by continual
improvement of his experimental methods,the necessityfor their use and the
influence they haveon his conclusions.(Yule, 1921,pp. 105-106)

The general tenorof this criticism is still valid, but the combination of
experimentaldesignandstatistical method introduced a little later by Sir Ronald
Fisherprovidedthe hope, if not the complete reality,of statisticalasopposedto
strict experimental control. The modern statisticalapproach more readily rec-
ognizes the intrinsic variability in living matter and its associated systems.
Furthermore, Yule's remarks were made before it becameclearthat all actsof
observation contain irreducible uncertainty (as notedin chap.2).
Nearly 40 years after Yule's review, Kendall (1959) gently reminded us of
the importanceof precisionin observationandthat statistical procedures cannot
replaceit. In an acuteand amusing parody,he tells the story of Hiawatha. It is
a tragic tale. Hiawatha,a "mighty hunter,"was anabysmal marksman, although
he didhavethe advantageof having majoredin applied statistics. Partly relying
on his comrades'ignoranceof thesubject,heattemptedto show thathis patently
awful performancein a shooting contestwas notsignificantly different from
that of his fellows. Still, theytook away his bow andarrows:

In a corner of the forest


Dwells alonemy Hiawatha
Permanentlycogitating
On the normal law of error.
Wondering in idle moments
Whether an increasedprecision
Might perhapsbe rather better
Even at therisk of bias
If therebyone, now andthen, could
Registeruponthe target.
(Kendall, 1959,p. 24)
SOME FUNDAMENTALS 39

SOME FUNDAMENTALS

Measurementis the applicationof mathematicsto events. We usenumbersto


designateobjectsandeventsand therelationships that obtain between them. On
occasion the objects are quite realand therelationships immediately compre-
hensible; dining-room tables,for example, and their dimensions, weights,
surfaceareas,and so on. Atother times,we may bedealing with intangibles
such as intelligence,or leadership,or self-esteem.In thesecasesour measure-
ments are descriptionsof behavior that,we assume, reflectsthe underlying
construct. But the critical concernis the hope that measurement will provide
us with preciseandeconomical descriptions of eventsin a manner thatis readily
communicatedto others. Whateverone'sview of mathematics with regard to
its complexities and difficulty, it is generallyregardedas adiscipline thatis
clear, orderly, and rational. The scientist attemptsto add clarity, order, and
rationality to the world aboutus by using measurement.
Measurementhasbeena fundamental feature of human civilizationfrom its
very beginnings. Division of labor, trade,andbarterareaspectsof our condition
that separateus from the huntersand gathererswho were our forebears. Trade
and commerce mean that accounting practices are institutedand the"worth" of
a job or anartifact has to belabeledanddescribed. When groups of individuals
agreedthat a sheep couldfetch three decent-sized spears and a couple of
cooking-pots,the species made a quantum leap intoa world of measurement.
Counting, makinga tally, representsthe simplestform of measurement. Simple
though it is, it requires thatwe have devisedan orderly and determinate number
system.
Developmentof early societies, like the developmentof children, must have
included the mastery of signsand symbolsfor differences and sameness and,
particularly, for onenessand twoness. Most primitive languages at least have
words for "one,""two," and"many,"andmodern languages, including English,
have extra wordsfor one and two(single, sole, lone, couple, pair, and soon).
Trade,commerce,and taxation encouraged the developmentof more com-
plex number systems that required more symbols. The simple tallyrecordedby
a mark on a slatemay bemademore comprehensibleby altering the mark at
convenient groupings,for example, at every five units. This system, still
followed by some primitive tribes,aswell as bypsychologists when construct-
ing, by hand, frequency tables from large amountsof data, corresponds with a
readily availableandportable countingaid - the fingers of one hand.
It is likely that the familiar decimal system developed because the human
handstogether have10 digits. However, vigesimal systems, based on 20, are
known, and languageonceagain recognizes the utility of 20 with the word score
in Englishand quatre-vingt for 80 in French. Contraryto generalbelief, the
decimal systemis not theeasiestto use arithmetically, and it isunlikely that
40 3. MEASUREMENT

decimal schemeswill replaceall other counting systems. Eggs and cakeswill


continueto besold in dozens rather than tens, and thehours aboutthe clock will
still be 12. This is because12 hasmore integral fractional parts than 10; that
is, you candivide 12 by more numbersand getwhole numbersand notfractions
(which everyonefinds hard)thanyou can 10.Viewed in this way,the now-aban-
doned British monetary system of 12penniesto the shilling and 20shillingsto
the pound doesnot seemso odd orirrational.
Systems of number notationand counting havea base or radix. Base5
(quinary), 10 (decimal), 12 (duodecimal),and 20(vigesimal) have been men-
tioned, but any baseis theoretically possible.For many scientific purposes,
binary (base2) is used because this system lies at theheart of the operationsof
the electronic computer.Its two symbols,0 and 1, canreadilybe reproducedin
the off and on modesof electrical circuitry. Octal (base 8) andhexadecimal (base
16) will also be familiar to computer users.The base of a number system
correspondsto the number of symbols thatit needs to express a number,
provided thatthe systemis aplace system. A decimal number,say 304, means
3 X 100 plus 0 X 10 plus 4 X 1.
The symbol for zero signifies an empty place. The invention of zero, the
earliest undoubted occurrence of which is in India over 1,000 yearsago but
which was independently usedby the Mayas of Yucatan, marksan important
step forwardin mathematical notation and arithmetical operation.The ancient
Babylonians,who developeda highly advanced mathematics some4,000years
ago, had asystem witha baseof 60 (abasewith many integral fractions) that
did not have a zero. Their scriptsdid not distinguish between, say, 125 and
7,205, and which one ismeantoften has to beinferred from the context. The
absenceof a zero in Roman numerals may explainwhy Romeis not remembered
for its mathematicians,and therelative sophisticationof Greek mathematics
leads some historians to believe that zeromay have been invented in the Greek
world and thence transmitted to India.

Scalesof Measurement

Using numbersto count events,to order events,and toexpresstherelationship


between events,is the essenceof measurement. These activities have to be
carried out accordingto some prescribed rule.S. S.Stevens (1951) in his classic
piece on mathematicsandmeasurementdefinesthe latter as"the assignmentof
numeralsto objects or events accordingto rules" (p. 1).
This definitionhasbeen criticizedon thereasonable grounds that it apparently
doesnot exclude rulesthat do nothelp us to beinformative,nor rulesthatensure
that the samenumeralsare always assignedto the same events under the same
conditions. Ellis (1968)haspointedout that some such ruleas,"Assign the first
numberthat comes into your head to eachof the objectson thetable in turn,"
SOME FUNDAMENTALS 41

mustbe excluded fromthe definitionof measurement if it is to be determinative


and informative. Moreover, Ellis notes that a rule of measurement must allow
for different numerals,or rangesof numerals,to beassignedto different things,
or to the same things under different conditions. Rules such as "Assign the
number3 to everything" are degenerate rules. Measurement must be madeon
a scale,and weonly havea scale whenwe havea nondegenerate, informative,
determinative rule.
For themoment,the historical narrativewill be setasidein order to delineate
and comment on the matter. Stevens distinguished four kinds of scales of
measurement,and hebelieves thatall practical common scales fall into one or
other of his categories.Thesecategoriesare worth examiningandtheir utility
in the scientific enterprise considered.

The Nominal Scale

The nominal scale,as such, doesnot measure quantities.It measures identity


and difference. It is often said thatthe first stagein a systematic empirical
scienceis the stageof classification. Likeis grouped with like. Events having
characteristicsin commonareexamined together.The ancient Greeks classified
the constitutionof nature into earth, air, fire, and water. Animal, vegetable, or
mineral areconvenient groupings. Mendeleev's (1834-1907)periodictable of
the elementsin chemistry,and plantand animal species classification in biology
(the Systema Naturae of the great botanist Carlvon Linne, knownasLinnaeus
[1707-1778]), and the many typologies that existin psychology are further
examples.
Numbers can,of course,be usedto label eventsor categoriesof events. Street
numbers,or house numbers, or numberson football shirts"belong"to particular
events,but there is, for example,no quantitative significance between player
number10 andplayer number4, on ahockey team,in arithmetical terms. Player
number 10 is not 2.5times player number 4. Such arithmetical rules cannot be
applied to the classificatory exercise. However, it is frequentlythe casethat a
tally, a count, will follow the constructionof a taxonomy.
Clearly, classificationsform a large partof the data of psychology. People
may be labeled Conservative, Liberal, Democrat, Republican, Socialist, and so
on, on thevariableof "political affiliation," or urban, suburban, rural, on the
variable of "location of residence,"and wecould thinkof dozens,if not scores,
of others.

The Ordinal Scale

The essential relationship that characterizes


the ordinal scale is greater than
(symbolized>) or less than (symbolized
<). These scales of measurement have
42 3. MEASUREMENT

proved to be very useful in dealing with psychological variables. It might, for


example,be comparatively easyto state that, according to some specifiedset of
criteria, Bill is more neurotic than Zoe,who ismore neurotic than John, or that
Mary is more musical than Sue, who ismore musical than Jane,and so on,even
though we are not inpossessionof a precise measuring instrument that will
determineby how muchthe individualsdiffer on ourcriteria. It follows thatthe
numbersthat we assignon theordinal scale represent only an order.
In ordering a group of 10 individuals accordingto judgmentsin terms of a
particular set ofcriteria for leadership,we maydesignatethe onewith the most
leadership ability"1" and gothrough to the onewith the least and rank this
person"10," or we maystartwith the onewith leastability andrank him or her
"1," progressingto the onewith most whomwe rank "10." Either methodof
ordering is permissibleand both give preciselythe samesort of information,
provided thatthe onedoing the ordering adheres to the system being used and
communicatesthe rule to others.

The Interval and Ratio Scales

When the gapsbetweenequal points,in the numericalsense,on themeasure-


ment scalearetruly quantitativelyequal, thenthe scaleis calledan interval scale.
The differencebetween130 cm and 137 cm exactly is the sameas thedifference
between 137 cm and 144 cm.When this sortof scale startsat atrue zero point,
as in thedistanceor length scale just mentioned, Stevens designates them ratio
scales.For example,an individual who is 180 cmtall is twice the heightof the
child of 90 cm,who, in turn, is twice the lengthof the baby of 45 cm. Butwhen
the scalehas anarbitrary zero point- for example,the FahrenheitandCelsius
temperature scales - theseratio operationsare notlegitimate. It is neither
meaningful nor correct to saythat a temperatureof 30 isthree timesas hot as a
temperatureof 10. This ratio doesnot represent three times any temperature-
related characteristic. Perhaps a more readily appreciated example is that of
calendar time. Althoughit is true to saythat 60 years is twice 30 years,it is
meaninglessto saythat 2000 A.D.will be twice 1000 A.D.in any description
of "age." In 1000 A.D. the famous Parthenon in Athenswas nottwice as old
as it was in 500A.D. Why not? Becauseit was built in about 440 B.C. It
follows from these examples that arithmetical operations on a ratio scaleare
conductedon the scale values themselves, but on theinterval scale withits
arbitrary zero such operations are conductedon theinterval values.
The higher-order intervaland ratio scales haveall the properties of the
lower-order nominaland ordinal scalesand may be so applied. Psychologists,
however, strive where possible for interval measurement,for it has the
appearanceat least of greater precision. This desire sometimes leads to
conceptualdifficulties and statistical misunderstandings. It is easyto see that
SOME FUNDAMENTALS 43

the 3 cm bywhich a line A of 12 cmdiffers from a line B of 9 cm is thesame


as the 3 cm bywhich B exceedsthe 6-cm-long lineC. We areagreedon how
we will define a centimeter, and measurementof length in these unitsis
comparatively straightforward.
But what of this statement? "Alberthas anIntelligence Quotient (IQ)of
120, Billy one of 110, and Colin of 100." The meaning of the 10-point
differencesbetweentheseindividuals,even whenthe measuring instrument (the
test)usedis specified,is not soeasyto see,nor is itpossibleto saywith complete
confidencethat the gap inintelligence between Albert and Billy is preciselythe
same amountas the gapbetweenBilly and Colin. Suppose that Question 1 on
our test, say, of numerical ability, asksthe respondentto add 123 to456;
Question 2, to divide 432 by 144; and Question3 to add thesquareof 123 to
the square rootof 144. Few would argue that these demands exhibit the same
degreeof difficulty, and, by the same token,few would agreeon theweighting
that mightbe assignedto themarksfor each question. These weightings would,
to some extent, reflectthe differing conceptsof numerical ability. How much
more numerically ableis the individual who hasgraspedthe conceptof square
root than the one whounderstands nothing beyond addition and subtraction?
Although these examples could, quite justifiably, be describedas exaggerated,
they aregiven to show how difficult, indeed, impossible, it is to constructa true
interval scalefor a test of intelligence.
Leaving asidefor the momentthe fact thatno perfectly reliable instrument
existsfor the measurementof IQ, we also know that there is no clear agreement
on the definition of IQ. In other words, statements about the magnitudeof the
differencesin length betweenlines A, B, and C, or in IQbetween Albert,Billy,
and Colin, dependon theexistenceof a standardunit of measurementand on a
constantunit of measurement,one that is the same overall pointson the scale.
In the case of the measurementof length these conditions exist, but in the
measurementof IQ they do not.
Do our arithmetical manipulations of psychometric testscales,as though
they were true interval scales, suggest that psychologists arepreparedto ignore
the imperfectionsin their scalesfor the sake of computational convenience?
Certainly the situation just described emphasizes the responsibility of the
measurerto report on how thescaleshe or sheemploysare used and to keep
their limitationsin mind.
Many, perhaps most, discussions of scalesof measurement reveal a serious
misconception that also arises from Stevens' examination. It is that the level of
measurement, thatis, the specific measurement scale used in a particular
investigation, governsthe applicationof various statistical procedures when the
data cometo beanalyzed. Briefly, the notion is abroad thatan interval scaleof
measurementis required for the adhibition of parametric test:the t-ratio,
F-ratio, Pearson's correlation coefficient, and so on, all ofwhich are discussed
44 3. MEASUREMENT

later in this work. Gaito (1980)is one of anumberof writers who have triedto
exposethe fallacy, noting thatit is basedon aconfusion between measurement
theory and statistical theory. Gaito reiterates some of the criticisms madeby
Lord (1953), who,in a witty andbrilliant discussion, tells the story of a professor
who computedthe meansand standard deviations of test scores behind locked
doors because he hadtaughthis students that such ordinal scores should not be
added. Matters came to ahead whenthe professor,in a story that shouldbe read
in the original, and notparaphrased, discovers that even the numberson football
jerseysbehaveas if they were"real," that is, interval scale, numbers. In fact,
"the numbersdon't remember where they came from" (p. 751).
It is important to realize that the preceding remarksdo not detract from
Stevens' sterling contribution to the developmentof coherent concepts regard-
ing scalesin measurement theory. The difficulties arisewhentheseareregarded
asprescriptionsfor the choiceand use of a wide varietyof statistical techniques.

ERROR IN MEASUREMENT

All acts of measurement involve practical difficulties that are lumped together
as thesourceof measurement error. From the statisticalstandpoint, measure-
ment error increasesthe variability in data sets, decreasing the precisionof our
summariesand inferences. It follows that scientists strivefor accuracyby
continually refining measuring techniques and instruments.The oddfact is that
this strategyproceedseven thoughwe know that absolutely precise measure-
ment is impossible.
It is clear that a meter rule mustbe madefrom rigid material and that
measuringtapesmustnot beelastic. Without thisthe instruments lack reliability
and self-consistency and, put simply, separateand independent measurements
of the same eventare unlikely to agree with each other. Only very rarely do
paper-and-penciltests,or two forms of the sametest(say, a test of personality
factors), give exactlythe same results when given to an individual or group on
two occasions.In a sense these tests are like elastic measuring tapes,and the
discrepancy betweenthe outcomes is an index of their reliability. Perfect
reliability is indexed1. Poor reliability wouldbe indexed0.3 or less,and good
reliability 0.8 or more. In acceptingthis latter figure as good reliability, the
psychologistis acceptingthe inevitable errorin measurement, being prepared
to do sobecausehe or shefeels that a quantitative description better serves to
grasptheconcepts under investigation, to communicate ideas about the concepts
more efficiently, and todescribethe relationships between these concepts and
others more clearly.
Reliability is notjust a function of the quality of the instrument being used.
In 1796, Maskelyne, Astronomer Royal and Directorof the Royal Observatory
at Greenwich, near London, England, dismissed Kinnebrook, an assistant,
ERROR IN MEASUREMENT 45

because Kinnebrook's timing of stellar transitsdiffered from his, sometimesby


as muchas 1sec.1 The accuracyof such measurements is, of course, crucialto
the calculationof the distanceof the star from Earth. Some20 years later this
incident suggestedto Bessel, the astronomerat Konigsberg, that such errors
were the result of individual differencesin observersand led towork on the
personal equation.The astronomers believed that thesediscrepancies were due
to physiological variabilityand early experimentson reaction time reflect this
view. But as scientific psychologyhas grown, the reaction-time variablehas
been usedto study choiceand discrimination, predisposition, attitudes, and the
dynamicsof motivation. These events mark the beginningsof the psychology
of individual differences.For theimmediate argument, however, they illustrate
the importanceof both inter-observerand intra-observer reliability.
The circumstancesof an exercisein measurement must remain constant for
eachseparateobservation.If we were, say, measuring the heightsof individuals
in a group, we would ensure that each of them stoodup straight with heels
together, chinin, chest out, headerect,no slouching,no tiptoeing, and so on.
Quite simply, heighthas to bedefined,and ourmeasurements aremadein accord
with that definition.
When a concept is definedin terms of the operations, manipulations, and
measurements that aremadein referring to it, wehavean operational definition.
Intelligence might be defined as the score obtained on a particular test; a
character trait like generosity mightbedefinedas theproportionof one'sincome
that one gives away. These sorts of definition serveto increasethe precisionof
communication.They most certainly do notimply immediate agreement among
scientists aboutthe nature of the phenomena thus defined. Very few people
would maintain thatthe precedingdefinitions of intelligenceandgenerosityare
entirely satisfactory.
Operationalism is closely alliedto the philosophyof the Vienna Circle, a
group, formedin the late 1920s, that aimed to clarify the logic of thescientific
enterprise.The membersof the Vienna Circle, mainly scientists and mathema-
ticians, cameto beknown aslogical positivists. They hopedto rid thelanguage
of scienceof all ambiguityand toconfinethe businessof scienceto the testing
of hypotheses aboutthe observable world.In this philosophy, meaningis
equated with verifiability,and constructs thatare notaccessibleto observation
are meaningless. Psychologists will recognize this approach as closely akinto
the doctrineof behaviorism.The view that theoretical constructs can begrasped
via the measurement operations that describe them is an interesting one,for it
challengesthe contention that there can be nomeasurement without theory and
that operations cannot be describedin nontheoretical terms.
1
Kinnebrookwas employedat theobservatoryfrom May 1794to February 1796.For some
more detailsof the incidentand what happened
to him, seeRowe (1983).
46 3. MEASUREMENT

The logical positivist viewis obviously attractiveto the working scientist


becauseit seemsto beeminently"hard-nosed."The doctrineof operationalism
demandsan analysisof the natureof measurementand itscontributionto the
descriptionof facts. To statethat scienceis concernedwith the verification of
facts is to imply that therewill be agreement among observers. It follows that
new methodsand newobservational tools,as well as observer disagreement,
will changethe facts. The fact or reality of a tomato is quite differentfor the
artist, the gourmet cook,and the botanist,and the botanical view changed
dramatically whenit became possible to view the cellular structureof the fruit
through a high-power microscope.
In the broad view, this argument includes not only the idea of levels of
agreementandverification, but alsothe validity of toolsandmeasurements.Do
they, in fact, measure what they were designed to measure?And what is meant
by "designedto measure" must have some basis in a conceptual disposition if
not a full-blown theory.
We must not only consider levelsof measurement,but also the limits of
measurement.In chapter 2 it wasnoted that observinga system disturbsthe
system, affects the act of measurement,and thus provokesan irreducible
uncertaintyin all scientific descriptions. This proposition is embodiedin the
Uncertainty Principle of Heisenberg:the neareronetries to obtain an accurate
measurementof eitherthe momentumor the positionof a subatomic particle,
the less certain becomes the measurement of the other. Ultimatelytheprinciple
appliesto all actsof measurement.
The world of psychological measurement is beset with system-disturbing
features. Experimenter-participant interactions and thearousingand motivat-
ing propertiesof the setting, be it the laboratoryor the natural environment,
contributeto variancein the data. These factors have been variously described
as experimentereffect, demand characteristics, and, more generally, as the
social psychologyof thepsychological experiment. They areexaminedat length
by Rosenthaland Rosnow (1969)and Miller (1972).
Despitethe difficulties, logical and practical,of the procedures, scientists
will continueto measure.The behavioral scientist is trying to pin down, withas
muchrigor aspossible,the variability of thepropertiesof living matter that arise
as aresultof the interactionof environmentalandgenetic factors.As mentioned
in chapter1, Quetelet(1835/1849)described these in termsof what we now call
the normal distribution, usingthe analogyof Nature's"errors."
Although Queteletand Galtonand othersmay becriticized for the promo-
tion of thedistributionas a"law of nature,"their work recognizedthe irreducible
variancein living matter. This bringsout theessenceof the statistical approach.
There can be noabsolute accuracyin measurementbut only a. judgment of
accuracy in terms of the inherent variationwithin and between individuals.
Statisticsare thetools for assessingthe propertiesof these random fluctuations.
4
The Organization
of Data

THE EARLY INVENTORIES

The countingof peopleandlivestock, hearthsandhomes,andgoodsand chattels


is an exercise thathas along past.The ancient civilizationsof Babylon and
Egypt conducted censuses for thepurposesof taxationand theraisingof armies,
perhaps 3,000 years before the birth of Christ,andindeedthebirthplaceof Christ
himself was determined partlyby the fact that Maryand Joseph traveledto
Bethlehemin order to beregisteredfor a Roman tax. Censuses were carried out
fairly regularlyby the Romans,but after the fall of Rome many centuries passed
before they became part of the routineof government. It can beargued that
these exercisesare notreally importantin statistical historybecausethey were
not used for the purposesof making comparisonsor for drawing inferences in
the modern sense (that is, by usingthe probability calculus),but they are early
examplesof the descriptionsof States.
When inferences were drawn they were informal.The utility of such descrip-
tions was recognizedby Aristotle, who prepared accounts, which were largely
non-numerical,of 158states, listing details of their methodsof administration,
judicial systems, customs, and so on.Although almostall of these descriptions
are lost, they clearly were intended to be used for comparative purposes and
compiledaspart of Aristotle's theoryof the State. Such systematic descriptions
became partof our intellectual traditionin Europe, particularlyin the German
States, duringthe 17th and 18th centuries. Staatenkunde, the comparative
descriptionof states, madefor the purposesof throwing light on their organiza-
tion, their power, and their weaknesses, became an important discipline pro-
moted especiallyby Gottfried Achenwall (1719-1772),who wasProfessorat
the Universityof Gottingen. Hans Anchersen (1700-1765),a Danish historian,
introducedtables intohis work, and although these early tables were largely

47
48 4. THE ORGANIZATION OF DATA

non-numerical, theyform a bridge betweenthe early comparative descriptions


and Political Arithmetic, wherewe see thefirst inferences basedon vital
statistics (see Westergaard, 1932, for an accountof these developments).
Someearly statisticalaccounts were, until quite recently,thought to have
been made solely for the purposesof taxation. Notable among these is Domes-
day Book, orderedin 1085by William the Conqueror,"to placethegovernment
of the conqueredon awritten basis" (Finn, 1973,p. 1).This massive survey,
which was completed in less than2 years, is not just a tax roll, but an
administrative record made to assist in government. Moreover, Galbraith
(1961) notes thatthe Domesday Inquest made no effort to preservethe over-
whelming massof original returns: "Instead,the practical geniusof theNorman
king preservedin a 'fair copy' what waslittle more thanan abstractof the total
returns" (Galbraith, 1961, p. 2), sothat Domesday Book is perhapsone of the
earliest examplesof a summaryof a large amountof data.

POLITICAL ARITHMETIC

The student in statistics who developsan interest in their history will be


astonishedif he or shevisitsa good libraryandsearchesout titles on thehistory
of statisticsor old bookswith the word statisticsor statistical in the title. Very
soon,for example,TheHistory of Statistics,publishedin 1918to commemorate
the 75th anniversaryof TheAmerican Statistical Association, and edited by
Koren, will be discovered. But thereare nomeansor variancesor correlation
coefficients to befound here. Thisvolume contains memoirs from contributors
from 15 countries detailingthe developmentof the collection and collation of
economic and vital statistics,for all kinds of motivesand using all kinds of
methods, withinthe various jurisdictions. Such works are notunusual,andthey
relate to a variety of endeavors concernedwith the use ofnumerical data both
for comparativeand inferential purposes. They reflect the layman's viewof
statistics in the present day.
Political arithmetic may be dated from the publication by John Graunt
(1620-1674),in 1662, of Natural and Political Observations Mentioned in a
Following Index and Made Upon the Bills of Mortality, althoughthe name
political arithmetic was apparentlythe inventionof Graunt's friend William
Petty (1623-1687).It hasbeen reported that Petty was infact the authorof the
Observations,but this is not so. Petty's Political Arithmetickwas written in
about 1672but was notpublisheduntil 1690. Petty's workwas of interestto
governmentand surely had ruffled some feathers.Petty'sson, dedicatinghis
father'sbook to theking, writes:

He was allowed by all, to be theInventorof this Method of Instruction; wherethe


perplexed and intricate waysof the World are explain'd by a very mean peiceof
POLITICAL ARITHMETIC 49

Science;and had not the Doctrins of this EssayoffendedFrance,they had long since
seen the light, and hadfound Followers, as well as improvements before this time,
to the advantageperhapsof Mankind. (Lord Shelborne, Dedication, in Petty, 1690)

That the work had theimprimatur of authority is given in its frontispiece:

Let this Book called PoliticalArithmetick,which waslong sinceWrit by Sir William


Petty deceased,be Printed.

Given at the Court at Whitehall the 7th dayof Novemb. 1690.

Nottingham.

It wasKarl Pearson's (1978) view, however, that Petty owed more to Graunt
than Grauntto Petty. Grauntwas awell-to-do haberdasher who wasinfluential
enoughto be able to securea professorshipat Gresham College, London, for
Petty, a man who had a variety of posts and careers. Graunt was acultured,
self-educatedman and wasfriends with scientists, artists,
andbusinessmen.The
publicationof the Observationsled to CharlesII himself supportinghis election
to the Royal Societyin 1662. Graunt's workwas the firstattemptto interpret
social behaviorand biological trendsfrom countsof the rather crude figures of
births anddeaths reportedin Londonfrom 1604to 1661- theBills of Mortality.
A continuing themein Graunt's workis the regularityof statistical summaries.
He apparently acceptedimplicitly the notion that whena statistical ratiowas
not maintained then some new influence mustbe present, that some additional
factor is thereto bediscovered.For example,he tried to show thatthenumber
1
of deaths from the plague had been under-reported and by examiningthe
number of christeningsreasonedthat the decreasein population of London
becauseof the plague wouldbe recoveredin 2 years.He attemptedto estimate
the populationdistribution by sex and age and constructeda mortality table.
Essentially, Graunt's work,andthat of Petty, emphasizesthe use ofmathematics
to impose orderon data and,by extension,on society itself.
Petty was afounding memberof the Royal Society, whichwas grantedits
charter in 1662. Althoughhe had been a supporter of Cromwell and the
Commonwealth,he wasknightedin 1661 by CharlesII. He was not somuch
concerned withthe vital statistics studiedby Graunt. Rather,in his early work
he suggested ways in which the expensesof the state couldbe curtailedand the

1
The weekly accounting
of burialswasbegunto allay public fearsaboutthe plague. However,
in epidemic years when anxiety
was at itshighest,it appears that
casesof theplague were concealed
becausemembersof families with the illness were kept together, both
the sick and thewell.
50 4. THE ORGANIZATION OF DATA

revenue from taxes increased,and a continuing themein his work is the


estimationof wealth.
His Political Arithmetick (Petty, 1690)comparesEngland, France,and
Holland in terms of territory, trade, shipping, housing, and so on. Hisinterest
was in practical political mattersand money, topics that reflected his profes-
sional careeraswell as hisphilosophy.
In the context of the times, Grauntand Petty's workmay be seen as a
demonstrationof their belief thatquantification provided knowledge that was
free from controversy and conflict. Buck (1977) discusses theseefforts in the
political climateof the times. The 17th century was,to say theleast,a difficult
time for the people of England. The Civil War, the executionof the King, the
turmoil of theCommonwealth, religious conflict, the Restoration,the weakness
of the monarchsof the Stuart line,the Glorious Revolution that swept James II
from thethrone- all contributedto civil strife andpolitical unease.And yet the
post-Restoration years saw thefounding of theRoyal Society,the establishment
of the Royal Observatoryat Greenwich,and thesetting of the stage for the
triumphs of Newtonian sciencein the 18th century. It was also the era of
Restoration literature. Both John Evelyn (the diaristwho gavethe Society its
nameand itsmotto) and thepoet Dryden were members of the Royal Society.
Buck (1977) and others have argued that the philosophyunderlyingthe contri-
butions of Petty and Grauntdiffers from that of the 18th centuryin that it does
not acceptthe existenceof natural orderin society. Indeedit could not, because
the political mechanisms necessary for the maintenanceof a relatively stable
social systemhad notbeen established.
The practicaleffectsof the work thatwasbegunby Petty has itscounterpart
in the present-day agencies of the state that collect data for all mannerof
purposes. In this accountof the developmentof statistics, however, this path
will be left for now, in orderto returnto theenterprise that sprang from Graunt's
mortality tables. Actuarial science hasmoreto dowith our perspective because
it involves the early studyof probability and risk and the use ofthese mathe-
matics for inference.

VITAL STATISTICS

Actuarial sciencehasbeendefined as theapplicationof probability to insurance,


particularly to life insurance.But to apply the rulesof probability theory there
haveto bedata,andthese dataareprovidedby mortality tables.Life expectancy
tablesgo back as far as the 4thcentury, when they were used under Roman law
to estimatethe value of annuities,but the major impetusto the systematic
examinationof the mathematicsof tontinesand annuitiesis generally placedin
17th century Holland. JohnDe Witt (1625-1672),the Grand Pensionary of
Holland and West Friesland, applied probability principles to the calculationof
VITAL STATISTICS 51

annuities.2 Dawson (1901/1914)statesthat De Witt must be consideredto be


the founderof actuarial science, although his work was notrediscovered until
1852. More prominentis the work of De Witt's contemporaries, Christiaan
Huygens(1629-1695)andJohn Hudde(1628-1704).Graunt's work influenced
that of Huygens and hisyounger brother,who debatedthe relative meritsof
estimatingthe mean durationof life and theprobability of survival or death at
a given age.
Theseattempts to bring rationality and order intothe general areaof life
insurancewasaided considerably by thework of Edmund Halley(1656-1742),
the English astronomer andmathematicianandpatronof Newton. Hepublished
An Estimate of the Mortality of Mankind, Drawn From Curious Tables of the
Births and Funerals at the City of Breslaw; With an Attempt to Ascertain the
Price of Annuitieson Lives,in Philosophical Transactionsin 1693. He was a
brilliant man, and hisclear expositionof the problems is remarkableon two
counts. The first is that it was producedin orderto fulfill a promiseto the Royal
Society to contribute materialfor the Transactions, whichhad resumed publi-
cation after a break of several years, and was atopic that was out of the
mainstreamof his work. The secondis that the early life offices in Englanddid
not use thematerial, the common opinion being that insurance was largely a
game of chance. The general realization that an understandingof probabilities
helpedin the actual determinationof gains and lossesin both gamesof chance
and life insurancewas not tocome until almostthe middleof the 18th century.
The necessityof reliablesystemsfor the determinationof premiumsfor all kinds
of insurance became more and more importantin the rise of 18th century trade
and commerce.
This accountof the beginningsof actuarial scienceis by nomeans complete,
but the brief description highlightsthe utility of data organizationand prob-
ability for the making of inferencesand thebusinessof practical prediction.As
Karl Pearson(1978)noted in his lectures, there were two main linesof descent
from Graunt;the probability-mathematicians andactuaries,and the18th century
political arithmeticians.Of the probability-mathematicians perhaps the most
entertainingandenterprisingwas aphysician, John Arbuthnot (1667-1735).He
was able,he wasintellectual,he was awit, and he can be credited withthe first
use of anabstract mathematical proposition, the binomial theorem,to test the
probability of an observed distribution, namely, the proportion of male and
femalebirths. In fact, he appearsto have beenthe first personto assessobserved
data,chance,and alternative hypotheses using a statistical test. Arbuthnot had,
in 1692, publisheda translationof Huygens workon probability with additions
of his own. In this book he observes that probability may be applied to
2
Tontines were oncea popular formof annuity. Subscribers paid into
ajoint fund, and theincome
they received increasedfor the survivorsas themembers died off.
52 4. THE ORGANIZATION OF DATA

Graunt's suggestionin the Observations thatthe greater numberof male than


female births was a matter of divine providenceand not chance. In An
Argument for Divine Providence,Taken From the Constant Regularity
Observ'd in theBirths of Both Sexes, publishedin 1710, Arbuthnott (sometimes,
as he didhere,he spelledhis namewith two t's) usesthe binomial expansionto
argue thatthe observed distributionof maleand female christenings (which he
equates with births) departs so much from equalityas to beimpossible, "from
whenceit follows that it is Art, not Chance, that governs"
(p. 189). An excess
of malesis preventedby:

the wise Oeconomyof Nature;and tojudge of the wisdom of the Contrivance,we


must observethat the external Accidentsto which Malesaresubject (who mustseek
their Food with danger)do makea greathavockof them, andthat this lossexceeds
far that of the other Sex, occasioned
by the Diseasesincident to it, as Experience
convincesus. To repair that Loss, provident Nature,by the Disposalof its wise
Creator,brings forth more Males than Females; and that in almost constantpropor-
tion. (Arbuthnott, 1710, p. 188)

The resulting near-equal proportionsof the sexes ensures that every male
may havea femaleof suitable age,andArbuthnotconcludes,as didGraunt, that:

Polygamy is contrary to the Law ofNatureandJustice,and to thePropagationof the


Human Race;for where Malesand Femalesare inequal number,if one Man takes
Twenty Wives,NineteenMen must livein Celibacy, whichis repugnantto the Design
of Nature; nor is itprobable that Twenty Women
will be sowell impregnatedby one
Man as byTwenty. (Arbuthnott, 1710, p. 189)

which are asingenious statements of what we nowcall alternative hypotheses


as onecould find anywhere.
The most outstandingof the probability-mathematicians of the era was
Abraham De Moivre (1667-1754).He was born at Vitry in Champagne,a
Protestantand the son of poor
a surgeon. From an early age, althoughan able
scholar of the humanities,he wasinterestedin mathematics. Afterthe repeal
of the Edict of Nantesin 1685, he wasinternedand wasfaced withthe choice
of renouncinghis religion or going into exile. He chosethe latter and in 1688
went to England. By chance he met Isaac Newtonand read the famous
Principia, whichhadbeen published in 1687.He masteredthe newinfinitesimal
calculus and passed intothe circles of the great mathematicians, Bernoulli,
Halley, Leibnitz, andNewton. Despitethe efforts of his friends, he wasunable
to obtain a secure academic position and supported himselfas a peripatetic
teacher of mathematics, "and later in life sitting daily in Slaughter'sCoffee
House in Long Acre, at the beck and call of gamblers,who paid him a small
sum for calculating odds,and of underwritersand annuity brokerswho wished
their values reckoned" (K.Pearson, 1978, p. 143).
GRAPHICAL METHODS 53

De Moivre first publishedhis treatiseon annuitiesin 1725. Essentiallyand


without going intothe details,and someof the defects,of his procedures,De
Moivre used the summationof series to compute compound interests and
annuity values.The crucial factor was inapplyingthe relatively straightforward
mathematicsto a mortality table. He examined Halley's Breslau table and
concludedthat the numberof individuals who survivedto a given age from a
total startingout at anearlier agecould be expressedas decreasing terms in an
arithmetic series. In short, De Moivre combined probabilities and theinterest
factor to compute annuity values in a manner that pointed the way tomodern
actuarial science. Thomas Simpson (1710-1761)has been describedas De
Moivre's younger rival. He is certainly rememberedin the history of life
insuranceas aninnovator. He appearsinitially to have arguedwith DeMoivre
that mortality tables shouldbe used as they were found and not fitted to
mathematicalrules. Simpsonis dismissedby Karl Pearson (1978) as aplagiarist
who "set out to boil down De Moivre's workandsell the resultat alower price"
(p. 176), whereas Dawson (1901/1914) describes his book as "an attempt to
popularizethe science, nevera popular movement among those who hope to
profit by keepingit exclusive"(p. 102).
De Moivre was undoubtedly upset by Simpson's reproduction of his ideas,
but Pearson'sview of Simpson's character is not sharedby many historiansof
mathematics,who have describedhim as aself-taught genius.
The applicationof mathematicalrulesto tablesof mortality are theearliest
examples of the use oftheoretical abstractions for practical purposes. They
are attemptsto organizeand make senseof data. The developmentof, and
rationalesbehind, modem statistical summaries are ofimmediate interest.

GRAPHICAL METHODS

As knowledge increases amongst mankind and transactions multiply,it becomes


more and more desirableto abbreviate and facilitate the modes of conveying
information from one person to anotherand from one individual to the many.
I confessI was long anxiousto find out whetherI was actuallythe first whoapplied
the principles of geometry to matters of finance as it hadlong before been applied
to chronology withgreat success.I am now satisfied upondue inquiry, that I was
the first:
As the eye is thebestjudge of proportion being ableto estimate it with more
accuracy thanany other of our organs,it follows, that wherever relative quantities
are in question, a gradual increaseor decreaseof any ... value is to bestated,this
mode of representingit is peculiarly applicable.
That I havesucceededin proposingand putting in practicea new anduseful mode
of statingaccounts... and asmuch informationmay beobtainedin five minutes as
would require whole days to imprint on thememory, in a lasting manner,by a table
of figures. (Playfair,180 la, pp. ix-x)
54 4. THE ORGANIZATION OF DATA

These quotations aretakenfrom Play fair's Commercialand Political Atlas,


the third edition of which was publishedin 1801. They providean unequivocal
statementof his view, and theview of many historians, that he was theinventor
of statistical graphical methods, methods thathetermed lineal arithmetic. Also
in 1801, Playfair's Statistical Breviary; Shewing,on a Principle Entirely New,
the Resourcesof Every Stateand Kingdom in Europe, appeared.We see, in
handsome hand-colored, copper-plateengravings, lineand bargraphs in the
Atlas,andthese together with circle graphs and piechartsin the Breviary. They
illustrate revenues, importsand exports, population distributions, taxes, and
other economicdata,and they are accompaniedby constant reminders of the
utility of the graphical methodfor the examinationof fiscal and trade matters.
William Playfair (1759-1823)was theyoungerbrotherof John Playfair,the
mathematician.He had avariety of occupations, beginning as anapprentice
mechanicalengineeranddraftsmanandlater as awriter andratherunsuccessful
businessman.In his examination of the history of graphical methods, Funk-
houser (1937) notes that the French translationof the Atlas was very well
receivedon thecontinent;in fact, much more attention was paid to it there than
in England.He also suggests that this might accountfor the greater interest
shown in graphical workby the French overthe next century.
As a method of summarizingdata,we must givefull credit to Playfair for
the developmentof the graphical method,but thereare examplesof the use of
graphs that predate his work by many centuries. Funkhouser (1937) mentions
the use ofcoordinatesby Egyptian surveyors and that latitudesandlongitudes
were used by the geographersand cartographersof ancient Greece. Nicole
Oresme(c. 1323-1382)developedthe essentialsof analytic geometry, which is
ReneDescartes'(1596-1650)greatest contribution to mathematics, although it
is of course true that graphs could have been used,
and infact occasionally were
used, withoutany formal knowledgeof the fact that the curveof a graph can be
representedby a mathematicalfunction. Funkhouser(1936) reportson a10th-
century graphof planetary orbitsas afunction of time, andover the years there
havebeen other examples of the method, perhapsthe most obviousof which is
musical notation.
Despite Playfair's ingenuity and hisattemptsat thepromotion of the tech-
nique, the use ofgraphswas surprisingly slowto develop. We find Jevonsin
1874 describingthe productionof "best fit" curvesby eye andsuggestingthe
procuring or preparation of "paper divided into equal rectangular spaces, a
convenientsize for the spaces being one-tenth of an inch square"(p. 493), and
not until its llth edition, publishedin 1910, did Encyclopaedia Britannica
devote an entry to graphical methods.Funkhouser(1937) suggests,and the
suggestionis eminently plausible, thatthe collection of public statisticswas
affectedby anongoing controversyas towhether verbal descriptions of political
and economic states were superior to numericaltables. If statisticaltables were
GRAPHICAL METHODS 55

regarded with suspicion, then graphs would have been greeted with even more
opprobrium.In Franceand Germanythe use ofgraphswas openly criticizedby
some statisticians.
In the social sciencesthe personwho most helpedto promotethe use of the
graph as a tool in statistical analysiswas Quetelet,who has already been
mentioned.It hasbeen noted that the useof theNormal Curvefor thedescription
of humancharacteristics begins with his work. The law of error and Quetelet's
adaptationof it is of great importance.
5
Probability

A proposition that wouldfind favor with most of us isthat the businessof life
is concernedwith attemptsto domore or lessthe right thing at more or lessthe
right time. It follows thata great dealof our thinking, in the senseof deliberating,
about our condition is bound up with the problem of making decisionsin the
face of uncertainty.A traditional viewof sciencein general,on theother hand,
is that it searchesfor conclusions thatare to betaken to be true, that the
conclusionsshould represent anabsolute certainty. If one of itsconclusionsfails,
then the failure is put down to ignorance,or poor methodology,or primitive
techniques- thetruth is still thoughtto be outthere somewhere. In chapter2,
some of the difficulties that can arise in trying to sustain this positionwere
discussed. Probability theory provides uswith a meansof reaching answers and
conclusions when,as is thecasein anything but the most trivial of circum-
stances,the evidenceis incompleteor, indeed,can never be complete. Put
simply, whatis thebestbet, whatare theoddson ourwinning,and howconfident
can we bethat we areright? The derivationof theoriesand systems thatmay
be applied to assist us in answering these questions continues to present
intellectual challenges.

THE EARLY BEGINNINGS

The early beginningsare very early indeed. David (1962) tells us that many
archeological digs have produced quantities
of astragali, animal heel bones, with
the four flatter sides marked
in various ways. These bones may have been used
as counting aids,as children's toys,or as theforerunnersof dice in ancient
games.The astragaluswas certainly usedin board gamesin Egypt about 3,500
years beforethe birth of Christ. David suggests that gaming may have been
developedfrom game-playingand reports thatit is saidto have been introduced

56
THE BEGINNINGS 57

in Greecejust 300 years before Christ. Shealso suggests that gaming


may have
emergedfrom the wager,and thewager from divination and theinterrogation
of oracles with originsin religiousritual. We know that gamingwas acommon
pastimein both Greeceand Romefrom at least 100years before Christ. David
notes that Claudius (10 B.C.- 54 A.D.) wrote a book on how to win atdice,
which, regrettably, has not survived. David also commentson a remark in
Cicero'sDe Divinatione,Book I:

When the four dice producethe venus-throw [the outcome when four dice, or
astragali,are tossedand thefaces shownare alldifferent] you maytalk of accident:
but supposeyou made a hundred castsand thevenus-throw appeared a hundred
times; couldyou call that accidental? (David, 1962,p. 24)

This may be one of the earliest statementsof the circumstances in which we


might acceptor reject the null hypothesis. Cicero quite clearlyhad some grasp
of the conceptsof randomnessand chanceand that rare events do occur in the
long run. A variety of explanationsmay beadvancedfor the fact that these early
insightsdid not leadto amathematical calculus of probabilities.Greek philoso-
phy, in the works of Platoand Aristotle, searchedfor order and regularityin the
phenomenaof the universe,and later, as Christianity spread, the Church
promoted the idea of an omnipotentGod who wasunceasingly awareof, and
responsible for,the slightest perturbationin naturalaffairs. Human beings were
doomed to be ignorantof the successionof natural eventsand could achieve
salvation onlyby submissionto the will of an Almighty. A more mundane
explanationmay beprovidedby the fact that early arithmetic,at anyrate in what
we would now call Westernculture, was hamperedby the absenceof a logical
and efficient system of number notation. Whateverthe reasons,and many
others have been suggested (see, e.g., Acree, 1978; Hacking,we 1975),
know
that about 1,600 years elapsed before a foundation for probability theorywas
laid.

THE BEGINNINGS

The reader is referredto David's interestingbook (1962)for a review of the


contributionsthat emergedin Europeduring the first millennium and beyond.
In the 17th century, Gerolamo Cardano's book Liber de Ludo Aleae [The Book
on Gamesof Chance] was published(in 1663, some87 yearsafter the death of
its author). A translationof this work by S. H. Gould is to be found in Ore
(1953). The treatise is full of practical adviceon both oddsand personality,
noting that personswho arerenownedfor wisdom,or old anddignified by civil
honoror priesthood, should not play, but that for boys, young men,and soldiers,
it is lessof a reproach.
Cardarno also cautions that doctors and lawyers playat a disadvantage,for
58 5. PROBABILITY

"if they win, they seemto begamblers,and if they lose,perhaps theymay be


takento be asunskilful in their own art as ingaming. Men of these professions
incur the same judgmentif they wish to practice music" (Ore, 1953, pp.
187-188).
It cannot be claimed that Cardano produced a theory with this book.
However,it can bestated thathe understoodthe importanceof therelationship
between theoretical reasoning and eventsin the "real world" of gaming.
The birth of themathematical theoryof probability, thatis, "when the infant
really seesthe light of day" (Weaver, 1963/1977, p. 31), took placein 1654.In
that year, Antoine Gombauld, Chevalier de Mere (1607-1684),posed the
questionsto Blaise Pascal(1623-1662)that initiatedtheendeavor.The two may
have met in thesummerof 1652 and soon become friends, although there is
some doubt about both the time and thecircumstancesof their meeting. David,
in her book, deals rather harshly with de Mere. Shedismisseshim ashaving"a
second-rate intelligence" and indicates thathis writings do not show evidence
of mathematical ability, "being literary pieces of not very high calibre" (David,
1962,p. 85). De Mere was not amathematician, although there is no doubt that
he held a good opinionof his abilitiesin a numberof fields. What he appears
to have beenwasFrance's leading arbiter of good tasteandmanners,a man who
studiedthe ways of polite societywith care anddedication.1 His writings were
soughtafter,and hisopinions respected, by thosewho wishedto maintain good
standingin the drawing roomsof Paris (seeSt. Cyres, 1909,pp. 136-158,for
an accountof de Mere). Pascal appears to have beena willing pupil, although
toward the end of1654heunderwenthis second conversion to thereligious life.
The notion that, astonishingly, the calculusof probabilitieswas theoffspring of
a dilettanteand anascetic, which Poisson's remark (footnoted in chap. 1) has
been takento imply, is a notquite accuratereflection.
Every accountof the history of the concept of probability showshow
importantthe searchfor waysof assessing the oddsin various gaming situations
has been in shapingits development. Thereseemsto be little doubt but that
Pascalhadspent sessions at thegaming tables,and itseems unlikely that various
situationshad notbeen discussed with de Mere. What movedthe matter into
the realm of mathematicaldiscoursewas thecorrespondence between Pascal
and hisfellow mathematician Pierre Fermat, sparked by specific questions that
had been raisedby de Mere.
An old gamblinggameinvolvesthe "house"offering a gambler even money
that he or shewill throw at least 1 six in 4throws of a die. In fact, the odds are
slightly favorableto the house. De Mere's problem concerned the questionof
the odds in a situation wherethe bet wasthat a player would throw at least1
1
That David's judgmenton themattermay be inerror is likely. Michea (1938) maintains that
Descartes'bonsensandPascal'scoeurcan beequated withde Mere'sbongout, and no onecould
find that Descartesand Pascalhad second-rate intelligences.
THE BEGINNINGS 59

double six in 24 throws of 2 dice. Herethe odds are,in fact, slightly against
the house, even though 24 is to 36 as 4 is to 6. De
Mere hadsolved this problem
but used the outcome to argue to Pascal thatit showed that arithmeticwas
self-contradictory!2 A second problem, which de Mere was unableto solve,
was the"problem of points," which concerns,for example,how thestakesor
the "pot" shouldbe divided amongthe players, accordingto thecurrent scores
of the participants,if a gameis interrupted. Thisis analtogether moredifficult,
questionand theexchanges between Pascal andFermat that took place over the
summerof 1654 devote much space to it. The letters3 show the beginningof
the applicationsof combinatorial algebraand of thearithmetic triangle.
In 1655, Christiaan Huygens (1629-1695),a Dutch mathematician, visited
Paris and heard of the Pascal-Fermat correspondence. There was enormous
interest in themathematicsof chance but a seeming reluctance to announce
discoveries and reveal the answers to questions under discussion, perhaps
becauseof the material valueof the findings to gamblers. Huygens returned to
Holland and worked out the answers for himself. His short accountDe
Ratiociniis in Ludo Aleae [Calculating in Gamesof Chance]was publishedin
1657 by Fransvan Schooten(1615-1660),who wasProfessorof mathematics
at Leyden. This book,the first on theprobability calculus,was animportant
influence on James Bernoulliand DeMoivre.
James (also referred to asJacquesor Jakob) Bernoulli(1654-1705)was one
of an extendedfamily of mathematicianswho made many valuable contribu-
tions. It is his Ars Conjectandi [The Art of Conjecture},publishedin 1713after
editing by his nephew Nicholas, that is of most interest.The first part of the
book is a reproductionof Huygens' workwith a commentary.The secondand
third parts deal withthe mathematicsof permutationsandcombinationsand its
application to games of chance. In the fourth part of the book we find the
theorem thatBernoulli called his "golden theorem,"the theorem named "The
Law of Large Numbers"by Poissonin 1837.In an excellentand most readable
commentary, Newman (1956) declares the theoremto be "of cardinal signifi-
cance to the theory of probability," for it "was the first attempt to deduce
statistical measures from individual probabilities" (vol. 3, p. 1448). Hacking
(1971) hasexaminedthe whole work in some considerable depth.
Bernoulli likely did not anticipateall of the debatesand confusions and
2
The probability of not getting a six in asingle throw of a die is 5/6 and the
probability of no
sixes in 4 throws is (5/6)4. The probability of at least 1 six in 4throws is 1 - (5/6)4, which is
.51775, whichis favorableto the house. The probability of not getting 2 sixes in a single throw
of 2 dice is 35/36 and theprobability of the event of at least oncegetting 2 sixesin 24 throws of
2 dice is 1- (35/36)24, which is .49140, whichis slightly unfavorableto the house.

3
Translationsof the Pascal-Fermat correspondence
are to befound in David's (1962) bookand in
Smith (1929).
60 5. PROBABILITY

misconceptions, both popular and academic, thathis theorem would generate,


but that it was asourceof intellectual puzzlement may betakenfrom the fact
that its author states that he meditatedon it for 20 years.In its simplest form,
the theorem states that if an eventhas aprobabilityp, of occurringon asingle
trial, and ntrials are to bemade, thentheproportion of occurrencesof the event
over the ntrials is alsop.. As n increases,the probability thatthe proportionwill
differ from p by less thana given amount,no matterhow small, also increases.
It also becomes increasingly unlikely thatthenumberof occurrencesof theevent
will differ from pn by less thana fixed amount, however large. In a seriesof n
tossesof a "fair" coin the probability of heads occurring close to 50% of the
time (0.5« occurrences) increases as n increases,and there is a much greater
probability of a difference of, say, 10 headsfrom what wouldbe expectedin
1,000tossesof the coin thanin 100tossesof the coin. Kneale (1949) notes that
it is often supposed that the theoremis:

A mysterious law of nature which guarantees that in a sufficiently large numberof


trials a probability will be "realizedas afrequency."
A misunderstandingof Bernoulli's theoremis responsiblefor one of thecommon-
est fallacies in the estimation of probabilities, the fallacy of the maturity of the
chances.When a coin hascomedown headstwice in succession,gamblerssome-
times say that it is more likely to come down tails next time because"by the law of
averages"(whatever thatmay mean)the proportion of tails mustbe broughtto right
sometime. (Kneale, 1949, pp. 139-140)

The fact is, of course, thatthe theoremis no moreand noless thana formal
mathematical proposition and not astatement about realities and, asNewman
(1956)puts it," neither capable of validating 'facts' nor of being invalidatedby
them" (vol. 3, p. 1450). Now this statement accurately represents the situation,
but it mustnot bepassedby without comment.If reality departs severely from
the law, then clearlywe do notdoubt reality,and nor do wedoubtthe law. What
we might do, however,is to doubt someof our initial assumptions about the
observed situation.If red wereto turn up 100timesin a row at theroulette table,
it might be prudentto bet on red on the 101spin st because this sequence departs
so muchfrom expectation thatwe might conclude thatthe wheel is "stuck" on
red. In fact, we might usethis observationas support for an assertion thatthe
wheel is not fair. This example moves usawayfrom probability theoryandinto
the realmof practical statistical inference,aninference about reality that is based
on probabilities indicatedby anabstract mathematical proposition.

THE MEANING OF PROBABILITY

A problem has been raised.The difficulty inherentin any attempt to apply


Bernoulli's theoremis that we assume thatwe know the prior probability^ of
THE MEANING OF PROBABILITY 61

a given eventon asingle trial- that, for example,we know thatthe value AFP
for the appearanceof heads whena coin is tossedis ½ If we doknow p, then
predictionscan bemade, and, givena seriesof observations, inferences may
be drawn from them. The problem is to statethe grounds for setting p at a
particular value,and this raisesthe questionof what it is that we mean by
probability. The issuewas always there,but it was notreally confronteduntil
the 19th century.
The title of this sectionis taken from ErnestNagel'spaper presented at the
annualmeetingof the American Statistical Association in 1935 and printedin
its journal in 1936.Nagel'sclearandpenetrating account does not, as headmits,
resolve the difficult problems raisedin attempts to examine the concept,
problems thatare still the source of sometimes quite heated debate among
mathematiciansandphilosophers. What it doesis presentthe alternative views
and commenton theconclusions that adoption of one orother of the positions
entails.4 Nagel first examinesthe methodological principles that can bebrought
to bear on the question; thenhe delineatesthe contexts in which ideasof
probability may befound; and,finally, he examinesthe three interpretations of
probability that are extant. What follows draws heavily on Nagel'sreview.
Appealsto probabilitiesarefound in everyday discussion, applied statistics
and measurement, physical andbiological theories,the comparisonof compet-
ing theories,and in themathematical probability calculus. The three important
interpretationsof the concept are theclassical; the notion of probability as
"rational belief; and thelong-run relativefrequencydefinition.
Laplace (1749-1827)and DeMorgan (1806-1871)were the chief expo-
nentsof the classical view that regards probability as, in DeMorgan'swords,
"a stateof mind with respectto anassertion"(De Morgan, 1838).The strength
of our belief aboutany given propositionis itsprobability. Thisis what is meant
in everyday discourse when we say(though perhaps not every day!) that"It is
probable that gaming emerged from game-playing." In evaluating competing
theories this viewis also evident,as whenit is said by some,for example, that,
"Biologically based trait theories of personalityaremore probably correct than
social learningtheories."This interpretationis not what is meant when statis-
tical assertions about the weather,the probabilityof precipitation,for example,
or the outcomeof an atomic disintegration are made.
John Maynard Keynes (later Lord Keynes, 1883-1946)interpreted prob-
ability asrational belief, and it hasbeenassertedthat thisis themost appropriate
view for most applicationsof the term. Evidence about the moral characterof
witnesses would lead to statements about the probability of the truth of one
person'stestimony rather than another's or, inexaminingtheresultsof evidence
that attemptedto show, for example, thata social learning theory interpretation

Acree(1978) givesanextensive review that,


for thepresent writer,is sometimesdifficult to follow.
62 5. PROBABILITY

of achievement motivationis more probable thanone basedon trait theory.


These statements reston rational insights intotherelationship between evidence
and conclusion,not on statistical frequencies, because we do nothave such
information. It is Nagel's view,and theview of others, that boththe classical
and theKeynesian views violate an important methodological principle - that
of verifiability:

On Keynes'view a degreeof probability is assignableto a single proposition with


respectto given evidence.But what verifiable consequences can bedrawnfrom the
statement that withrespectto the evidencethe proposition thaton the next throw
with a given pair of dice a 7 will appear,has aprobability of 1/6? For on theview
that it is significant to predicatea probability to the single instance thereis nothing
to be verified or refuted. (Nagel, 1936,p. 18)

The conclusionis that the views we have described "cannot be regardedas


a satisfactory analysis of probability propositionsin the sciences which claim
to abideby this canon" (Nagel, 1936, p. 18).
The third interpretationof probability is the statistical notionof long-run
relative frequency. Bolzano (1781-1841), Venn (1834-1923), Cournot
(1801-1877),Peirce(1839-1914),and, more recently, von Mises(1883-1953)
and Fisher (1890-1962)are among the scientists and mathematicianswho
supported this view.In a simple example,if you aretold that your probability
of survival and recovery aftera particular surgical procedure is 80%, then this
means thatof 100patientswho have undergone this operation, 80 have recov-
ered.
The majority of working scientistsin psychologywho rely on statistical
manipulationsin the examinationof data implicitly or explicitly subscribeto
this meaning. Whether or not it isappropriateto the endeavorand whetheror
not theconsequences of its acceptanceare fully appreciatedis a questionto be
considered.It is important to note, and this is not meant to be facetious, that
following the operationyou areeither aliveor you aredead. In other words,
the probability statementis about a relationship between events, or rather
propositions concerning the occurrenceof events,not abouta single event.In
this respectthe frequentists have no quarrelwith the Keynesians.
The most common logical objection (in fact, it is a class of objections)to
the frequency view concerns the vague term long-run. Clearly, in order to
establisha frequency ratio the run has tostop somewhere, and yetprobability
values are defined as thelimit of an infinite sequence. Nagel's answer to the
objection, which von Mises (1957) also deals with at some length,is that
empirically obtained ratioscan beusedas hypothesesfor the true valuesof the
infinite series,andthesehypothesescan betested.Further, probability values
might be arrived at by using some other theory than that obtained from a
statistical series, for example, using theoretical mechanics to predict the
THE MEANING OF PROBABILITY 63

probability of a fair coin turningup heads. Againthe predicted probabilitycan


be testedempirically. Nagel also argues that Keynesian examples about the
credibility of witnessescouldbeexaminedin frequencyterms;for instance, "the
relative frequency with whicha regular church-goer tells lies on important
occasionsis a small number considerably less than ½" (Nagel, 1936,p. 23). It
is obviously the casethat thereis more thanone conceptionof probability and
that each one hasvalue in its particular context. Nagel concludes that the
frequency view is satisfactoryfor "every-daydiscourse, applied statistics and
measurement,and within many branchesof the theoretical sciences"(p. 26).
It has been noted that psychological scientists have acceptedthis view.
Nevertheless,the fact that probabilityhas to doboth with frequencies and with
degreesof belief is the aleatory and epistemological duality that
is clearly and
cogently discussedby Hacking (1975). We blur the issueas wecompute our
statistics and speakof the confidencewe have in our results. The fact thatthe
answer to the question "Who, in practice, cares?"is "Probably very few," is
basedon an admittedly informal frequency analysis,but it is one inwhich we
can believe!

Probability and the Foundations of Error Estimation

The practical spurto the developmentof probability theory was simply the
attemptto formulate rules that,if followed assiduously,would leadto a gambler
making a profit over the long run, although,of course, it was notpossibleto
asserthow long the runmight haveto be for profit to be assured. Probability
was therefore boundup with the notion of expectancies,and the 150years that
followed the end of the17th century saw its constructs being appliedto
expectancies in life insurance and the sale of annuities. Johnde Witt
(1625-1672)andJohn Hudde(1628-1704),who calculated annuities basedon
mortality statisticsand whose workwas mentioned earlier, corresponded and
consulted with Huygens and drew heavily on his writings.
An important figure in thehistory of probability, althoughthe extent of his
contribution was not fully recognized until long, long after his death, was
Abraham De Moivre (1667-1754). His book The Doctrine of Chances:or A
Method of Calculating the Probabilities of Events in Play was first published
in 1718, but it is thesecond (1738)and third (1756) editionsof the work that
are ofmost interest here.De Moivre'sA Treatiseof Annuities on Lives,editions
of which were publishedin 1724, 1743,and 1750, appliesThe Doctrine of
Chancesto the "valuation of annuities." The Treatise is to be found bound
together withthe 1756 editionof the Doctrine.
Here the problems of the gaming roomsand the questionsof actuarial
prediction in the mathematicsof expectanciesare linked. A third strandwas
not to emergefor over 50 yearswith investigationson theestimation of error.
64 5. PROBABILITY

This latteris associated most closely with the work of Laplace(1749-1827)and


Gauss(1777-1855)and thederivation of "The Law of Frequencyof Error," or
the normal curve as it came to be known. And yet we nowknow that this
fundamentalfunction had,in fact, been demonstrated by De Moivre,who first
published it in 1733 (the Approximatioad Summam Terminorum Binomii
(a+b)n in Seriem Expansi),andEnglish translations of the work are to befound
in the last two editions of The Doctrine of Chances.The Approximatio is
examined by Todhunter (1820-1884), whose monumental Historyof the
Mathematical Theory of Probability From the Time of Pascal to That of
Laplace, publishedin 1865, still standsas one of themost comprehensive and
one of the dullest bookson probability. Todhunter devotes considerable space
in his book to De Moivre's work, but, according to Karl Pearson (1926),he
"misses entirelythe epoch-making character of the 'Approximatio' as well as
its enlargementin the 'Doctrine.' He does not say: Hereis the original of
Stirling's Theorem, here is the first appearanceof the normal curve, hereDe
Moivre anticipated Laplace as the latter anticipated Gauss" (Pearson, 1926,
p. 552).
It is true that Todhunter does not state any of theabove precisely,and it is
thereforenow often reported that Karl Pearson discovered the importanceof the
Approximatio in 1922 while he waspreparing for his lecture courseon the
history of statistics thathe gave at University College London between 1921
"and certainly continuing up till 1929" (E. S.Pearson& Kendall, 1970,p. 479).
Karl Pearson published an article on thematterin 1924 (K.Pearson 1924a) and
some detailsare providedby Daw and E. S.Pearson (1972/1979). Quite how
much credit we cangive Pearsonfor the revelationand howjustified he was in
his judgment that "Todhunter fails almostentirely to catchthe drift of scientific
evolution." (Pearson, 1926, p. 552) can be debated.It is certainlythe casethat
one is notcarried alongby the excitementof the progressionof discoveryas
oneploughsone'sway through Todhunter's book, but hedoes say,in a section
wherehe refersto thepagesof thethird editionof the Doctrine, which givethe
Approximatio and show an important result:

Thus we have seen that the principal contributionsto our subject fromDe Moivre
are hisinvestigations respecting
theDuration of Play, his Theoryof Recurring Series,
and his extension of the value of Bernoulli's Theoremby the aid of Stirling's
Theorem. Our obligations to De Moivre would have been still greater if he had not
concealedthe demonstrationsof the important results whichwe have noticed . .;.
but it will not bedoubted thatthe Theory of Probability owesmore to him than to
any other mathematician, withthe sole exceptionof Laplace.(Todhunter, 1865/1965,
pp. 192-193)

It is also worth noting thatDe Moivre's contributionwas remarkedon by


the American historianof psychology, Edwin Boring, in 1920, before Pearson,
THE MEANING OF PROBABILITY 65

and without so much fuss. In a footnote referringto the"normal law of error"


he says, "The so-called'Gaussian'curve. The mathematical propadeutics for
this function were preparedaslong ago as thebeginningof the 18th century(De
Moivre, 1718 [sic]). SeeTodhunter. .." (Boring, 1920,p. 8).
A more detailed discussion of the Approximatio and thepropertiesof the
normal distributionis given in the next chapter.The roles of Laplaceand Gauss
in the derivation and applicationsof probability theory are of very great
importance. Laplace brought together a great dealof work that had been
published separately as memoirs in his great work Theorie Analytiquedes
Probabilites, first publishedin 1812, with further editions appearingin 1814
and 1820. Todhunter(1865/1965)devotesalmost one quarterof his book to
Laplace's work. From the standpointof statisticsasthey developedin the social
sciences,it is perhaps only necessary to mentiontwo of his contributions:the
derivation of the "Law of Error," and the notion of inverse probability.
Laplace's theorem, the proof of which, in modern terms,may befound in David
(1949), and which had been anticipatedby De Moivre, showsthe relationship
between the binomial seriesand the normal curve.Put very simply, and in
modern parlance,as n in(a+b)n approachesinfinity, the shapeof the discrete
binomial distribution approaches the continuous bell-shaped normal curve. It
is therefore possibleto expressthe sum of anumberof binomial probabilities
by meansof the area of the normal curve,and indeedthe value of this sum
derived from the familiar tables of proportions of area underthe normal
distribution is quite closeto the exact value even for n of only 10. Thedetails
of this procedureare to befound in manyof theelementary manuals of statistics.
Laplace's work givesa numberof applicationsof probability theoryto
practicalquestions,but it is thebell-shaped curve as it isappliedto the estimation
of error thatis of interest here. Both Gauss andLaplace investigated thequestion
independentlyof each other. Errors are inescapable, even though they may not
alwaysbe of critical importance,in all observations that involve measurement.
Despite the most sophisticatedof instrumentsand the most skilled users,
measurements of, say, distances between stars in our galaxyat aparticular time,
or points on the surfaceof the earth, will, when repeated,not always produce
exactly the same result.The assumptionis made thatthe values thatwe require
are definite and,to all intentsandpurposes, unchanging - that is to say, a true
value exists. Variationsfrom the true valueare errors. Laplace assumed that
every instanceof anerror arose because of the operationof a numberof sources
of error, eachone of which may affect the outcomeone way or theother. The
argument proceeds to maintain thatthe mathematical abstraction that we now
call the normal law representsthe distributionof the resultant errors.
Gauss'sapproachis essentially the same as Laplace's but has amore
unashamedlypractical flavor. He assumed that there was anequal probability
of errors of over-estimationand under-estimationand showed,by the method
66 5. PROBABILITY

of leastsquares,thatit is thearithmetic meanof the distributionof measurements


that best reflectsthe true valueof the measurementwe wish to make. The
distribution of measurements,the variation in which represents error on these
assumptions,is preciselythat of Laplace'stheorem,or thenormal curve,or, as
it is sometimes termed, the Gaussian distribution.

FORMAL PROBABILITY THEORY

From the 17th to the20th century, mathematical approaches to probability have


been algebraicaland geometrical. The work of Pascal and Fermat led to
combinatorialalgebra.The invention of thecalculusled toformal examinations
of theoretical probability distributions. Gaussian methods are essentially geo-
metrical. Alwaysthe probability calculusfound itself in difficulties over the
absenceof a formal definition of probability that was universally accepted.
Modern probability theory,but not, it mustbe noted, statistics,hasescaped this
difficulty by developingan arithmetical axiomatic model. The renowned Ger-
man mathematician David Hilbert (1862-1943),Professorof Mathematicsat
Gottingen, presenteda remarkable paper to the International Congressof
Mathematics meetingin Parisin 1900. In it he presented mathematics with a
series of no less than23 problems that,he believed, wouldbe theones that
5
mathematics should address during the 20th century. Among themwas acall
for a theory of probability basedon axiomatic foundations.At that time, various
attemptswere being madeto develop a theory of arithmetic basedon a small
number of fundamentalpostulatesor axioms. These axioms have no basis in
what is called the real world. Questions such as "What do they mean?"are
excluded, ambiguityis avoided.The axioms themselves do notdependon any
assumptions.For example,one of thebest-knownof themathematical logicians,
Guiseppe Peano (1858-1932),basedhis arithmeticon postulates such as"Zero
is a number." Thisis not theplaceto attemptto discussthe difficulties that have
been uncoveredin the axiomatic approachor in set theory, with which it is
closely allied. Suffice it to saythat modern mathematical probability theory is
largely basedon the work of Andrei Kolmogorov (1903-1987),a Russian
mathematician whose book Foundations of the Theory of Probability was first
publishedin 1933. Set theory is most closelylinked with the work of Georg
Cantor (1845-1918),born in St. Petersburgof Danish parentsbut who lived
most of his life in Germany. Set theory grew out of a new mathematical
conceptionof infinity. The theory hasprofound philosophicalaswell as mathe-
matical implicationsbut its basis is easy to understand.It is a mathematical

5
In the actual talk Hilbert onlyhad time to deal with 10 of theproblems,but entire manuscriptin
English, can befound in the Bulletin of the American Mathematical Society (1902).
FORMAL PROBABILITY THEORY 67

system that deals with defined collections, lists,


or classesof objectsor elements:

The theory of probability,as amathematical discipline, can andshouldbe developed


from axioms in exactly the sameway asGeometryand Algebra. This means that
after we havedefined the elementsto be studiedand their basic relations, and have
statedthe axiomsby which these relations are to begoverned,all further propositions
must be based exclusivelyon these axioms, independent of the usual concrete
meaningof these elements andtheir relations ...
... theconceptof afield of probabilities is definedas asystemof setswhich satisfies
certain conditions. Whatthe elementsof this setrepresentis of no importancein the
purely mathematical development of the theory of probability. (Kolmogorov,
1933/1956,p. 1)

The system cannotbe fully delineated here,but this chapterwill endwith an


illustration of its mathematical elegance
in arriving at thewell-known rulesof
mathematical probability theory. This account follows Kolmogorov closely,
althoughit does not use histerminologyand symbols precisely.

E is a collection of elements called elementary events.


/ is a set ofsubsets
of E. Theelementsof f arerandom events.
The axiomsare:
1. / is a field ofsets.
2. / containsthe set E.
3. To eachset A in / isassigneda non-negative realnumberp(A).This is called
the probability of eventA.
4. p(E}=\.
5. If A and Bhave no elementin common, thenp(A + B)= p(A) + p(B)

In set theory,A has acomplementarysetA'.In the languageof random events,


A' meansthe non-occurrenceof A. To saythat A is impossibleis to write A = 0,
and to saythat A must occuris to write A = E.
BecauseA + A' = E and from the 4th and 5thaxioms, p(A)+ p(A') = 1 and
p(A) = \-p(A') and becauseE' = 0, p(E') = 0. Axiom 5 is theaddition rule.
lfp(A) > 0, then p(B \ A) - ^ , * , which is theconditional probabilityof
P(A)
the event B under conditionA. From this followsthe general formula known
as themultiplication rule, p(AB) = p(A)p(B A), and as weshow in chapter7,
this leadsto Bayes'theorem. Note thatp(B\ A) > Q,p(E \A) = 1, and
p{(B + C)\A}=p(B\A)+p(C\A).
For readersof a mathematical inclinationwho wish to see more of the
developmentof this approach(it doesget alittle more difficult), Kolmogorov's
book is conciseand elegant.
6
Distributions

When Grauntand Halley and Queteletmade their inferences, they made them
on thebasisof their examinationof'frequency distributions. Tables, charts, and
graphs- nomatter how theinformationis displayed- all can beusedto show
a listing of data, or classificationsof data, and their associated frequencies.
These are frequency distributions. By extension, such depictions of the fre-
quencyof occurrenceof observationscan beusedto assessthe expectationof
particular values,or classesof values, occurringin the future. Real frequency
distributionscanthenbe usedasprobability distributions.In general, however,
theprobability distributions thatarefamiliar to theusersof statistical techniques
are theoretical distributions, abstractions basedon a mathematical rule, that
match, or approximate, distributions of eventsin the real world. When bodies
of data are described,it is the graph and thechart that are used. But the
theoretical distributions of statisticsandprobability theoryaredescribedby the
mathematical rules or functions that define the relationships betweendata,both
real and hypothetical,andtheir expectedfrequenciesor probabilities.
Over the last 300 yearsor so, thecharacteristicsof a great many theoretical
distributions,all of which have beenfound to have some practical utility in one
situationor another, have been examined. The following discussionis limited
to threedistributions thatare familiar to usersof basicstatisticsin psychology.
An accountof somefundamentalsampling distributionsis given later.

THE BINOMIAL DISTRIBUTION

In the years 1665-1666,when Isaac Newtonwas 23 and had just earnedhis


degree, his Cambridge college (Trinity)was closed becauseof the plague.
Newton went hometo Woolsthorpein Lincolnshireand began,in peaceand
leisure,a scientificrevolution.These weretheyearsin which Newton developed

68
THE BINOMIAL DISTRIBUTION 69

someof the most fundamentaland importantof his ideas: universal gravitation,


the compositionof light, the theory of fluxions (the calculus),and thebinomial
theorem.
The binomial coefficientsfor integral powershad been knownfor many
centuries, but fractional powers werenot considereduntil the work of John
Wallis (1616-1703),Savilian Professorof Geometry at Oxford, and themost
influential of Newton's immediate English predecessors. However, expansions
of expressions such as (x -x2 )1/2 were achievedby Newton earlyin 1665. He
announcedhis discoveryof the binomial theoremin 1 676 inletterswritten to
the Secretaryof the Royal Society, although he never formally publishedit nor
did he provide a proof. Newton proceededfrom earlier work of Wallis, who
publishedthe theorem, with creditto Newton, in 1685. The problem was to
find the areaunder the curve with ordinates (x - x2 )" Whenn is zerothe first
two terms arex - 1 (x3 \ andwhennis1 they arex - 1 (x3 \ Newton, usingthe
methodof interpolation employedsomuchby Wallis, reasoned that when n was
l \*
/2 the corresponding terms should
be, ;c - -r—. He arrived at theseries

and then discovered that


the same result couldbe obtained by deriving, and
subsequently integrating

l/1
2
the binomial expansionof (1 - jc) . Theinterestingandimportant pointto be
noted is that Newton's discoverywas notmade by consideringthe binomial
coefficients of Pascal'strianglebut by examiningthe analysisof infinite series,
a discovery of much greatergeneralityand mathematical significance. Figure
6.1 showsthe binomial distributionfor n = l.
Newton'sdiscoveryof the calculus in his "golden years"at Woolsthorpe
establishes him as its originator, but it was Gottfried Wilhelm Leibniz
(1646-1716),the German philosopher and mathematician,who haspriority of
publication, and it is pretty well established that the discoveries were inde-
pendent. However,a bitter quarrel developed over the claims to priority of
discovery and allegations were made that, on avisit to London in 1673, Leibniz
could have seenthe manuscriptof Newton's De Analysi Aequationes Numero
Terminorum Infmitas, which, though writtenin 1669, was notpublisheduntil
1711. Abraham De Moivre was among those appointed by the Royal Society
in 1712to report on thedispute. De Moivre made extensive use of themethod
in his own work, and it was hisApproximatio, first printed and circulated to
somefriends in 1733, that links the binomial to what we nowcall the normal
70 6. DISTRIBUTIONS

FIG. 6.1 The Binomial Distribution for N = 7

distribution. The Approximatio is included in the second (1738) and third


(1756) editionsof the Doctrine.
It should be mentioned thata Scottish mathematician, James Gregory
(1638-1675),working at the time (1664-1668)in Italy, derivedthe binomial
expansionand produced important work on themathematicsof infinite series,
discoveredquite independentlyof Newton.

THE POISSON DISTRIBUTION

Before the structure of the normal distribution is examined, the work of


Simeon-Denis Poisson (1781-1840)on a useful special caseof the binomial
will be described.The Ecole Polytechnique was foundedin Parisin 1794. It was
the model for many later technical schools, and its methods inspiredthe
productionof many student texts in mathematicsandengineering whichare the
forerunnersof present-day textbooks. Among the brilliant mathematiciansof
the Ecole duringthe earlier yearsof the 19th centurywas Poisson.His nameis
a familiar label in equationsandconstantsin calculus, mechanics, andelectric-
ity. He waspassionately devoted to mathematics and toteaching,andpublished
over 400 works. AmongthesewasRecherchesur laProbabilite desJugements
in 1837. This contains thePoisson Distribution, sometimes called Poisson's law
n
of large numbers.It wasnoted earlier thatas n in (P + Q) increases,the binomial
distribution tendsto thenormal distribution. Poisson considered the case where
THE POISSON DISTRIBUTION 71

as nincreases towardinfinity, P decreases towardzero,and nPremainsconstant.


The resulting distributionhas aremarkable application.
Data collectedby insurance companies on relatively rare accidents,say,
people trapping their fingers in bathroom doors, indicates that the probability
of this event happeningto any oneindividual is very low, in fact near zero.
However, a certain numberof such accidents(X) is reported everyyear,and the
number of these accidents varies from year to year. Overa number of years a
statistical regularityis apparent,a regularity thatcan bedescribedby Poisson's
distribution.
If we set X at k, aninteger, then

where A, is any positive number,e is theconstant2.7183...,and k! isfactorial


k.
Although the distribution is not commonlyto be found in the basic statistics
testsin psychology,it is usedin the social sciencesand it does havea surprising
rangeof applications. It hasbeen usedto fit distributionsin, for example, quality
control (defectsper numberof units produced), numbers of patients suffering
from certain specificdiseases,earthquakes, wrong-number telephone connec-
tions, the daily numberof hits by flying bombs in London during WorldWar
II, misprintsin books, and many others.

FIG. 6.2 The Poisson Distribution applied to Alpha


Emissions (Rutherford & Geiger, 1910)
72 6. DISTRIBUTIONS

Poisson attempted to extendthe possibleutility of probability theory, for he


applied it to testimonyand tolegal decisions. These applications received much
criticism but Poisson greatly valued them. Poisson formally discussedthe
conceptsof a random quantityandcumulative distribution functions,and these
are significant theoretical contributions. But his nameandwork in probability
doesnot occupy much space in the literature, perhaps because he wasovershad-
owed by famous contemporaries such as Laplaceand Gauss. Sheynin (1978)
hasgiven us acomprehensive review of his work in the area. An exampleof a
Poisson distributionis given in Figure 6.2.

THE NORMAL DISTRIBUTION

The binomialandPoisson distributions stand apart from the normaldistribution


because theyare applied to discrete frequency data. The invention of the
calculus provided mathematics with a tool that allowedfor the assessmentof
probabilities in continuous distributions.The first demonstrationof integral
approximation,to thelimiting caseof the binomial expansion,wasgivenby De
Moivre. In the Approximatio, De Moivre beginsby acknowledgingthe work
of James Bernoulli:

Altho" the Solution of Problemsof Chanceoften requires that several Termsof the
Binomial (a + b)n [this is modern notation]be added together, nevertheless in very
high Powersthe thing appearsso laborious,and of sogreatdifficulty, that few people
have undertaken that Task; for besides Jamesand Nicholas Bernoulli, two great
Mathematicians,I know of no body thathas attemptedit; in which, tho' they have
shown very great skill,and havethe praise whichis due totheir Industry,yet some
thingswerefarther required;for what they havedoneis not somuch an Approxima-
tion as thedetermining very wide limits, within which they demonstrated that the
Sum of Termswas contained. (De Moivre, 1756/1967,3rd Ed., p. 243)

De Moivre proceedsto showhow hearrivedat theexpressionof the ratio of


the middle termto the sum of all thetermsin the expansionof (1 +1)" whenn
is a very high power. His answerwas2/BVn, where"B representsthe Number
of which theHyperbolic Logarithmis 1 - 1/12+ 1/360-1/1260+ 1/1080, &c."
He acknowledgesthe help of James Stirlingwho hadfound that "B did denote
the Square-rootof the Circumferenceof a Circle whose Radius is Unity, sothat
if that Circumferencebe called c, the Ratio of the middle Termto the Sum of
all the Terms will be expressedby 2/V(«c)" (De Moivre, 1756,3rd Ed., p. 244).
De Moivre had thus obtained(in modern notation)the expression

, for large n, whereY 0 is themiddle term.


THE NORMAL DISTRIBUTION 73

He also givesthe logarithmof theratio of the middle termto anyterm distant


from it by aninterval /as (w + / - l^)log \m + l- 1 j + (m - I + V2)log \m - I
+ 1} - 2wlogw + log ((m +l)/m , wherem = '/2«andconcludes,in the first of
nine corollaries numbered 1 through6 and 8through 10,7 having been omitted
from the numbering, that,"if m or ½n be aQuantity infinitely great, thenthe
Logarithm of the Ratio, whicha Term distantfrom the these middleby the
Interval 1, has to themiddle Term, is -2///n" (p. 245). This is merely the
_ 2/n
expression that,Y/ = Y 0 e~21 , for large n.
The second corollary obtains the "Sum of the Terms intercepted between the
Middle, andthat whole distance from i t . . . denotedby /", in modern terms,the
sum of Yo+ YI + Y 2 + Y 3 + . .. + Y/, as

which is the expansionof the integral

When / is expressedas S^fn , and Sinterpretedas !/2, the sum becomes

... which convergesso fast, thatby help of no more than sevenor eight Terms,the
Sum required may becarriedto six or seven placesof Decimals: Now that Sumwill
be found to be 0.427812,independently fromthe common Multiplicator2/Vt, and
therefore to the Tabular Logarithm of 0.427182,which is 9.6312529,adding the
Logarithm of 2At, viz. 9.9019400,the sumwill be 19.5331929,to which answersthe
number 0.341344. (De Moivre, 1756, 3rd Ed., p. 245)

This familiar final figure is thearea underthe curveof the normal distribution
betweenthe mean (whichis, of course, alsothe middle value)and anordinate
one standard deviation from the mean. In the third corollaryDe Moivre says:

And therefore, if it was possible to take an infinite number of Experiments, the


Probability that an Event whichhas anequal numberof Chancesto happenor fail,
shall neitherappearmore frequently than'/2 n+] /2"Jn times, nor more rarely than
'/2 n - !/2 Vntimes,will be expressedby thedoubleSum of thenumberexhibitedin
the secondCorollary, that is, by 0.682688,and consequentlythe Probability of
74 6. DISTRIBUTIONS

thecontrary...will be0.317312,thosetwo Probabilities together compleating Unity,


which is the measureof Certainty. (De Moivre, 1756,3rd Ed., p. 246)1

½Vw is what todaywe call the standard deviation.De Moivre did not name
it but he did, in Corollary 6, saythat Vw "will be as itwere the Modulus by
which we are toregulateour Estimation"(De Moivre, 1756,3rd Ed., p. 248).
In fact what De Moivre doesis to expandthe exponentialand to integrate
from 0 to SG.
In Corollary 6 De Moivre notes thatif / is interpreted as Vw rather than
V2 \H , then the series doesnot convergeso fast and that moreand more terms
would be required for a reasonableapproximation as / becomes a greater
proportion of V«,

... for which reason1makeuse inthis Caseof the Artifice of Mechanic Quadratures,
first invented by Sir Isaac Newton...;it consistsin determiningthe Area of a Curve
nearly, from knowinga certain numberof its OrdinatesA, B, C, D, E, F, &c.placed
at equalIntervals,(De Moivre, 1756, 3rd Ed., p. 247)

He usesjust 4 ordinatesfor his quadratureand finds, in effect, that the area


between±2o or '/2/7 ± Vw is 0.95428,and thatthe areain whatwe now call the
tails is 0.04572. The true valueis a little less than thisbut it is, nevertheless,
familiar.
Theseresults can beextendedto the expansionof (a + b]n and where a and
b are notequal.

If the Probabilitiesof happeningand failing be in anygiven Ratioof inequality,the


Problemsrelatingto the Sum of theTermsof the Binomial (a + b)n will be solved
with the same facilityas those in which the Probabilitiesof happeningand failing
are in aRatio of Equality. (De Moivre, 1756/1967,3rd Ed., p. 250)

In Corollary 9, De Moivre in effect, and in modern terms, introduces


^(npq), the expressionwe usetoday for the standard deviation of the normal
approximationto thebinomial distribution.
The sum andsubstance oftheApproximatiois that it gives,for the first time,
the function that was rediscovered much later, the function that dominates
so-called classical statistical
inference - thenormal distribution - which in
modern terminologyis given by the density function

1
The value for the proportionof areabetween±la is 0.6826894,so that De Moivre was out
by one unit in the sixth decimalplace.
THE NORMAL DISTRIBUTION 75

The normal distributionis shownin Figure 6.3.


De Moivre's philosophical positionis revealed in the sections headed
"Remark I" in the 1738 editionof the Doctrine and anadditional,and much
longer, "RemarkII" in 1756. De Moivre sets his work in the philosophical
contextof an ordered determinate universe.His notionof Original Design (see
the quotationin chapter1) is anotion that persisted
at least downto Quetelet.
A powerful deity revealsthe grand design through statistical averages
andstable
statisticalratios. Chance produces irregularities.
As Pearson remarked:

There is much valuein the idea of the ultimate laws being statistical laws, though
why the fluctuations shouldbe attributedto a Lucretian 'Chance',I cannot say. It
is not anexactly dignified conceptionof the Deity to supposehim occupied solely
with first momentsand neglecting secondand higher moments!(Pearson,1978, p.
160)

and elsewhere:

The causeswhich led De Moivre to his "Approximatio" or Bayes to his theorem


were more theologicaland sociological than mathematical,
and until onerecognizes
that the post-Newtonian English mathematicians were more influencedby Newton's
theology thanby hismathematics,thehistory of sciencein theeighteenth century-
in particular thatof the scientistswho were membersof the Royal Society- must
remain obscure. (Pearson, 1926,p. 552)

FIG. 6.3 The Normal Distribution


76 6. DISTRIBUTIONS

It is interesting that thisis preciselythe sort of analysis thathasbeen brought


to bear on thework of Galton and Karl Pearson himself, save that it is the
philosophy of eugenics thatinfluenced their work, rather than Christian theol-
ogy. In Remark II, De Moivre takesup Arbuthnot's argumentfor the ratio of
male to female births, whichwas discussedin Chapter4, defendingthe argu-
ment againstthe criticisms thathad been advanced by Nicholas Bernoulli,who
had noted thata chance distributionof the actual male/female birth ratio would
be found if the hypothesized ratio (i.e., the ratio underwhat we would now call
the null hypothesis)had been takento be 18:17 rather than 1:1. But De Moivre
insists:

This Ratio oncediscovered, and manifestly servingto a wise purpose,we conclude


the Ratio itself, or if you will the Form of the Die, to be anEffect of Intelligence and
Design. As if we were shewna number of Dice, each with18 white and 17black
faces, whichis Mr. Bernoulli's supposition, we shouldnot doubt but that thoseDice
had beenmadeby someArtist; and that their form was notowing to Chance,but
was adaptedto theparticularpurposehe had inView. (De Moivre, 1756/1967,3rd
Ed., p. 253)

With the greatestrespect to De Moivre, this was clearly not Arbuthnot's


argument, and De Moivre's view that he might have saidit is somewhat
specious. LikeQuetelet'suse of thenormal curve many years later,
De Moivre's
view is a prejudgmentand all findings must be made to fit it. Karl Pearson
(1978)makes essentiallythe same point,but it is apparentlya point thathe did
not recognize in his own work.
7
Practical Inference

Someof the philosophical questions surrounding inductionand inference were


dealt within chapter2. Herethe foundationsof practical inferenceare consid-
ered.

INVERSE PROBABILITY AND THE FOUNDATIONS


OF INFERENCE

The first exercisesin statistical inference arosefrom a considerationof statistical


summaries such asthosefound in mortalitytables.The developmentof theories
of inferencefrom the standpointof the implicationsof mathematical theory can
be dated from the work of Thomas Bayes(1702-1761),an English Noncon-
formist clergyman whose ministry was inTunbridge Wells. Bayes was recog-
nized as avery good mathematician, although he published very little,and he
was electedto theRoyal Societyin 1742 (see Barnard, 1958, for a biographical
note). Curiously enough,the paperfor which he isrememberedwas commu-
1
nicatedto theRoyal Societyby hisfriend Richard Price more than2 yearsafter
his death,and thecelebrated forms of the theorem that bear his name, although
they follow from the essay,do not actually appearin the work. An Essay
Towards Solving a Problem in the Doctrine of Chances(Bayes,1763) is still
the subject of much discussionand controversy bothas to itscontentsand
implications and as to howmuch of its import was contributedby its editor,
Richard Price. The problem that Bayes addressed is statedby Pricein the letter

1
Price was also a Unitarian Church minister. His church stillstandsin Newington Greenin north
London and is theoldest Nonconformist church building still being so usedin London. Next doorand
abutting the church is a licensed betting shop.It has been noted that Bayesian analysis, resting,
as it
does,on thenotion of conditional probability, is akin to gambling.

77
78 7. PRACTICAL INFERENCE

accompanyinghis submissionof the essay:

Mr De Moivre . . . has . . after


. Bernoulli,and to agreater degree of exactness,given
rules to find the probability thereis, that if a very great numberof trials be made
concerning any event, the proportion of the numberof times it will happen,to the
numberof times it will fail in those trials,should differ less thanby small assigned
limits from the proportion of the probability of its happeningto theprobability ofits
failing in one single trial. But I know of no personwho hasshewnhow to deduce
the solution of the converse problemto this; namely, "the number of times an
unknown event has happenedand failed being given,to find the chance thatthe
probability ofits happening should lie somewhere between any twonamed degrees
of probability." (Bayes, 1763,pp. 372-373)

A demonstrationof some simple rules of mathematical probability, using a


frequency model, will help to derive and illustrate Bayes' Theorem. The
probability of drawinga redcard from a standard deckof playing cardsis 26/52
or/?(R) = 1/2. The probability of drawing a picture cardfrom the deck is 12/52
orp(P) = 3/13. The probability of drawing a redpicture cardis 6/52 or/?(R &
P) = 3/26. Whatis theprobability of drawing eithera redcard or a picture card?
The answeris, of course, 32/52 orp(Ror P) =8/13. Note that:

p(R or P)=p(R) +p(?) -p(R & P), [8/13 = (1/2 + 3/13) - 3/26].

This is known as theaddition rule.


Suppose thatyou draw a card from the deck, but you are notallowed to see
it. What is the probability that it is a redpicture card? We cancalculatethe
answerto be6/52. Now suppose that you aretold thatit is a redcard. Whatnow
is the probability of it being a picture card? The probability of drawing a red
card is 26/52. We also can figure out that if the card drawn is a redcard then
the probability of it also beinga picture cardis 6/26. In fact,

p(R & P) = /7(R)[p(P|R)]»[6/52 = 26/52(6/26)]

The term /?(P|R) symbolizesthe conditional probability of P, that is, the


probability of P given that R has occurred and the expression denotesthe
multiplication rule. Notethat/?(R& P) is thesameas/?(P& R), andthat this is
equal to p(P)\p(R\P)], or 6/52 = 12/52(6/12). From this fact
we seethat:

which is the simplest formof Bayes' Theorem.In current terminology,the left-


INVERSE PROBABILITY AND THE FOUNDATIONS OF INFERENCE 79

hand sideof this equationis termed theposterior probability, the first term on
the right-hand sidethe prior probability, and theratio p(P\R)/p(R.), the likeli-
hood ratio.
From the addition rulewe canalso show thatthe probability of a redcard is
the sum of theprobabilitiesof a redpicture cardand a rednumber card minus
the probability of a picture number card (whichdoesnot exist!), or:

p(R) =p(R & P) + p(R & N) -p(? & N).

But p(P & N) - 0 becausethe events picture card and number card are
mutually exclusive. So Bayes' Theoremmay bewritten:

The fact that this formulais arithmeticallycorrect may be checked by


substitutingthe valueswe canobtain from the known distributionof cards in
the standard deck. However, this does not demonstratethe allegedutility of the
theoremfor statistical inference.For that we must turnto another example after
substitutingD (for Data) in place of R, and H,(for Hypothesis1) in place of P,
and H2 (for Hypothesis2) in place of N, where H, and H2 are twomutually
exclusiveand exhaustive hypotheses. We have:

Bayes' Theorem apparently provides a meansof assessingthe probability


of a hypothesis giventhe data (or outcome), whichis the inverseof assessing
the probability of an outcome givena hypothesis(or rule), and of coursewe
may envisage more than two hypotheses.In the original essay, Bayes demon-
strated his construct usingas anexamplethe probability of balls comingto rest
on one or other of the parts of a plane table. Laplace,who in 1774 arrivedat
essentiallythe same resultas Bayes,but provideda more generalized analysis,
used the problem of the probability of drawing a white ball from an urn
containingunknown proportionsof black and white balls given thata sampling
of a particular ratiohasbeen drawn. Traditionally, writers
on Bayes make heavy
use of "urn problems,"and tradition is followed here witha simple example.
Phillips (1973) givesa comprehensive account of Bayesian procedures and
showshow they can beappliedto data appraisal in the social sciences. Suppose
we arepresented withtwo identical urnsW and B. Wcontains70 white and 30
80 7. PRACTICAL INFERENCE

black balls,and Bcontains40 white and 60black balls. Fromone of theurns


we areallowed to draw 10 balls, and wefind that 6 of them are white and 4 of
them are black. Is the urnthat we have chosen more likely to be W ormore
likely to be B? Presumably most of us wouldopt for W. Whatis the probability
that it was W?
/?(Hi|D) is theprobability of W giventhe data,and/?(H2|D)is the probability
of B given the data.Now theprobability of drawing a white ball from W is 7/10
and from B it is 4/10. The probability of the data givenH, [p(D|Hi)] is
(0.7)6(0.3)4, and p(D|H2) is (0.4)6(0.6)4.

Now we apply Bayes' Theorem:

In order to completethe calculation,we haveto have valuesfor p(H,) and


/?(H2) - the prior probabilities. And here is thecontroversy. Theobjectivist
view is that theseprobabilitiesare unknownand unknowable because the state
of the urnthat we have chosenis fixed. There are nogroundsfor saying that
we have chosenW or B. The personalist wouldsay that we canalways state
some degreeof belief. If we are completely ignorant about which urn was
chosen,then bothp(H,) andp(H2) can beexpressedas 0.5. This is a formal
statementof Laplace's Principleof Insufficient Reason,or Bayes' Postulate.If
we put thesevalues in our equation,p(Hi|D) works out to be 0.64, which
matches common sense. This value could then be usedto revisep(H])and/?(H2)
and further datacollected.For itsproponents,the strengthof Bayesianmethods
lies in the claim that they provideformal proceduresfor revising many kindsof
opinion in the light of new data in a direct andstraightforward way, quite unlike
the inferential proceduresof Fisherian statistics. The most important point that
has to beaccepted,however, is the justification for the assignmentof prior
probabilities. Althoughin this simple example there might seem to be little
difficulty, in more complex situations, where probabilities, based on relative
frequency2, hunches,or onopinion, cannotbe assigned precisely or readily, the
picture is far less clear. Furthermore, the principleof the equal distribution of
ignorancehasitself come undera great dealof philosophicalattack. "It is rather
generally believed thathe [Bayes] did not publish becausehe distrustedhis
postulate,and thoughthis scholium defective.If so he wascorrect"(Hacking,
1965, p. 201).

2
Presumablyit could be argued thatour urn with its unknown ratioof black andwhite balls
is one of aninfinite distributionof urns with differing make-ups.
FISHERIAN INFERENCE 81

FISHERIAN INFERENCE

When the justification for the probabilistic basisof inferencein the senseof
revising opinionon thebasis of data was thoughtof at all, it was theBayesian
approach that held sway until this century. Its foundations had come under
attack primarilyon the grounds that probability in any practical senseof the
word must be basedon relative frequencyof observationsand not ondegrees
of belief. Venn (1888)was perhapsthe most insistent spokesman for this view.
Sir Ronald Fisher(1890-1962),certainlythe most influential statisticianof all
time, set out toreplaceit. In 1930he publisheda paper that supposedly set out
his notion of fiducial probability, claiming, "Inverse probability has, I believe,
survived so long in spiteof its unsatisfactorybasis, because its critics have until
recent timesput forward nothingto replaceit as arational theoryof learningby
experience" (Fisher, 1930, p. 531).
There is no doubt thatFisher'smethods,and thecontributionsof Neyman
and Pearson that have been grafted on to them, have provided us with a set of
inferential procedures. There seems to beconsiderable doubt as towhetherhe
provided us with a coherent non-Bayesian theory. Fisher himself asserted that
the conceptof the likelihood functionwasfundamentalto his newapproachand
distinguishedit from Bayesian probability. Kendall (1963) in his obituary of
Fisher says this:

It appearsto me that, at this point [1922],his ideas werenot very well thought out.
Certainly his exposition of them was obscure. But, in retrospect,it becomes plain
that he wasthinking of a probability function f(x,Q) in two different ways: as the
probability distributionof* for given0, and asgiving some sortof permissible range
of 0 for anobservedx. To this latterhe gave the thenameof 'fiducial probability
distribution' (later to be known as afiducial distribution)and in doing so began a
long train of confusion;for it is not aprobability distribution to anyonewho rejects
Bayes's approach, and indeed, may not be adistribution of anything. Fisher
neverthelessmanipulatedit as if it were, and thereafter maintainedan attitude of
rather contemptuous surprise towards anyone who wasperverse enoughto fail in
understandinghis argument....
The position on both sideshasbeen restatedad nauseam, without much attempt
at reconciliationor, as Ithink, withoutan explicit recognitionof the real point, which
is that a man's attitude towards inference, like his attitude towards religion,is
determined by his emotional make-up,not by reasonor mathematics. (Kendall,
1963, p. 4)

Richard von Mises (1957)is baffled also:

Fisher introducesthe term [likelihood] in order to denote something different from


probability. As hefails to give a definition for either word,i.e.,he does not indicate
how the value of either is to bedeterminedin a given case,we canonly try to derive
82 7. PRACTICAL INFERENCE

the intended meaningby consideringthe context in which he usesthesewords.


I do notunderstandthe many beautifulwordsusedby Fisherand hisfollowers in
supportof the likelihood theory. The main argument, namely,
thatp[the probability
of the hypothesis]is not a variable but an "unknown constant,"does not mean
anything to me.(von Mises 1957, pp. 157-158)

It is tempting to leaveit at that, but some attemptat capturing the flavor of


Fisher's position must be made. Clearly,if one hasknowledgeof adistribution
of probabilities of events, then that knowledge can beused to establish the
probability of an event that has not yetbeen observed,for example, the
probability thatthe next roll of two dice will producea 7 (p = .167). Whatof
the situation whereaneventhasbeen observed - theroll didproducea 7 - can
we sayanything abouttheplausibilityof anevent withp =. 167havingoccurred?
This is a decidedlyodd sort of question because the eventhasindeed occurred!
Before the draw from a lottery with 1,000,000 ticketsis made,the probability
of my winning is .000001.It will be difficult to convince me after the draw, as
I clutch the winning ticket, that whathashappenedis impossible,or even very,
very unlikely to have occurredby chance,and if you continueto insist I shall
merely keepon showingyou my ticket.
Fisher, in 1930, putsthe situationin this way:

There are two different measuresof rational belief appropriateto different cases.
Knowing the populationwe canexpressour incomplete knowledgeof, or expectation
of, the sample in terms of probability; knowing the samplewe can expressour
incomplete knowledgeof the populationin terms of likelihood. (Fisher, 1930,p.
532)

Likelihood thenis a numerical measureof rational beliefdifferent from


probability. Whetheror not thelogic of the situationis understood,all usersof
statistical methodswill recognize the reasoningas crucial to both estimation
and hypothesis testing.The method of maximum likelihood that Fisher pro-
pounds (althoughit had been put forward by Daniel Bernoulliin 1777; see
Kendall, 1961) justifiesthe choice of population parameter.The method simply
says thatthe best estimateof the parameteris the value that maximizesthe
probability of the observations,or that whathasoccurredis themost likely thing
that could haveoccurred.A numberof writers, for example, Hogben(1957),
have stated that this assumption cannot bejustified without an appealto inverse
probability, so that Fisherdid not succeedin detaching inference from Bayes'
Theorem.Nevertheless,the basicnotion is that we canfind the likelihood of a
particular population parameter,say thecorrelationcoefficient R, by defining
it as avalue thatis proportionalto the probability that from a population with
that valuewe have obtaineda sample withan observed valuer.
BAYES OR p < .05? 83

There is no doubt that Fisher's argument


will continueto be controversial
and that many attemptsto resolve the ambiguitieswill be made. Dempster
(1964) is among thosewho have enteredthe fray, and Hacking's (1965)
rationalehasresultedin one well-known statistician (Bartlett, 1966) proposing
that the resulting theorybe renamed "the Fisher-Hacking theory of fiducial
inference."

BAYES OR p < .05?

Criticism of Fisherian methods arrived almost as soon as they beganto be


adopted.In more recent years, criticism
of null hypothesis significance testing
(NHST) appearsto have becomean 'area' in its own right. Berkson (1938)
probably led theway, whenhe noted that:

if the number of observationsis extremely large- for instanceon theorder of


200,000- thechi-squareP will besmall beyondanyusual levelof significance ...
For we mayassume thatit is practically certain thatany seriesof real observations
doesnot follow a normal curve with absolute exactitude in all respects. (p. 526)

Berkson was expressingin essence what many, many other writers in the
following 60 years have observed: that when HQ is expressedas anexact null
hypothesis (zero difference or no relationship) then very small deviations from
this (dare one sayit?) pedantic viewwill be declared significant]
Someof the more recently-expressed doubts have been brought together by
Henkel and Morrison (1970),but thepapers they collected together were most
certainly not the last word. Indeed, Hunter (1997) called for a ban onNHST,
and reportedthat no lessa body thanthe American Psychological Association
has a"committee looking into whether the use of thesignificancetest should
be discouraged"(p. 6)!
The main criticisms, endlessly repeated, are easily listed. NHST does not
offer any way oftestingthe alternativeor research hypothesis; the null hypothe-
sis is usually falseand when differencesor relationships are trivial, large
samples will lead to its rejection; the method discourages replication and
encourages one-shot research; the inferential model dependson assumptions
about hypothetical populations and data that cannot be verified; and there are
more. Someof the criticismsarevalid andneedto befaced more carefully, even
more boldly than they are, while some seem to be'strawmen' set up to beblown
down. For example,the fact that we only reject HQ and do nottest H^ is not
particularly satisfying,but the notion that that inference is faulty becausein
some way it rests on characteristicsof populationsnot observedor not yet
observedis merely statinga fact of inference! Moreover, mostof the criticism
84 7. PRACTICAL INFERENCE

would be diluted, if not eliminated,if greater attentionto parametersof a test


other thanalpha level were considered, that is to say, effect size, sample size,
and power.
Even Fisher's most supportive and ardent colleague, Frank Yates (1964)
said:

The mostcommonly occurringweaknessin the application of Fisherianmethodsis,


I think, undue emphasis on testsof significanceandfailureto recognize thatin many
typesof experimentalwork estimatesof thetreatmenteffects, togetherwith estimates
of the errorsto which theyaresubjectare thequantitiesof primary interest.To some
extent Fisher himselfis to blame for this. Thusin The Design of Experimentshe
wrote: "Every experimentmay besaid to exist onlyin order to give the factsa chance
of disproving the null hypothesis."(p. 320)

Yates goeson to saythat the null hypothesis,as usually expressed,is


"certainly untrue"andthat:

suchexperiments[variety andfertilizer trials] are infact undertaken with the different


purposeof assessing the magnitudeof the effects... [Fisher]did n o t . .. sufficiently
emphasisethe point in his writings. Scientistswere thusencouragedto expectfinal
and definitive answers...some of them, indeed, came to regardthe achievementof
a significant resultas an end initself, (p. 320)

Although a numberof modificationsof, or alternativesto, NHST have been


suggested (see, for example,confidenceintervals, discussed in chapter13, p..
199) by far themost popularof the suggested replacements (rather than salvage
operationsfor the Fisherian model)is Bayesian analysis. The claim is usually
made that Bayesian methods areconcernedwith the alternative hypothesis, that
they encourage replication, that,in fact, they reflect more clearly the traditional
'scientific method.' Curiously enough, it is also often claimed that despitethe
so-called subjectivityof Bayesian priors,a Bayesian analysis will arrive at the
same conclusion as a'classical' analysis. This leaves thejourneyman psycho-
logical researcher with nothingbut thetired protest that nothing hasbeenoffered
as arewardfor changingfrom the familiar routines!
8
Sampling
and Estimation

RANDOMNESS AND RANDOM NUMBERS

In common parlance,to choose"at random" is to choose without bias, to make


the actof choosingonewithout purpose even though the eventual outcomemay
be usedfor a decision.The emphasis here is on the actrather thanthe outcome.
It is certainly not true to saythat ordinaryfolk accept that random choice means
absenceof design in the outcome. Politicians, examining the polls, have been
known to remark thatthe result was "in thecards."Primitive notions of fate,
demons, guardian angels, and modern appealsto the "will of God" often lie
behind the drawing of lots and thetossingof coins to "decide" if Mary loves
John,or to take one courseof action rather than another. It is absenceof design
in the mannerof choosing thatis important.The point mustnot belabored,but
everyday notionsof chanceare still construed, evenby the most sophisticated,
in ways thatare not too farremovedfrom the random elementin divination and
sortilege practicedby theoraclesandpriestswho were consultedby our remote,
andnot-so-remote, ancestors. And, asalready noted, perhaps one of thereasons
for the delay in the developmentof the probability calculusarose from a
reluctanceto attemptto "second-guess" the gods.
The concepts of randomnessand probability are, therefore, inextricably
intertwined in statistics.The difficulties inherentin defining probability, which
were discussed earlier, are once more presented in an examinationof random-
ness. It is commonly thought that everyone knows what randomness is. The
great statistician Jerzy Neyman (1894-1981)states, "The method of random
sampling consists,as it is known, in taking at random elementsfrom the
population whichit is intendedto study. The elements compose a sample which
is then studied" (Neyman, 1934, p. 567).
It is unlikely that thisdefinition would begiven a high gradeby most teachers
of statistics. Laterin this paper and,this inadequatedefinition notwithstanding,

85
86 8. SAMPLING AND ESTIMATION

it is apaperof centralandenormous importance, Neyman does note that random


samplingmeans that each element in the population must have an equal chance
of being chosen,but it is not uncommonfor writers on the methodto ignore
both this simple directiveandinstructionsas to themeansof its implementation.
Of course,the use of thewords "equalchance"in the definition bringsus back
to what we mean by chanceandprobability. For most purposeswe fall back
on the notion of long-run relative frequency, rather than leaving these constructs
as undefined ideas that make intuitive sense. Practical tests of randomness rest
on the examinationof the distributionof eventsin long series.It is alsothe case,
as Kendall and Babington Smith (1938) point out, that random sampling
proceduresmay follow a purposive process.For example,the numberTI is not
a random number,but its calculation generates a seriesof digits that may be
random. These authors are amongthe earliestto set out thebasesof random
sampling from a straightforward practical viewpoint, and they were writinga
mere 60 years ago.The conceptof a formal random sampleis a modern one;
concerns aboutits placeand importancein scientific inferenceandsignificance
testing paralleledthe developmentof methods of experimental designand
technical approaches to theappraisalof data. Nevertheless, informal notionsof
randomness that imply lack of partiality and"choiceby chance"go back many
centuries.
Stigler (1977) researched the procedure knownas "the trial of the Pyx," a
samplinginspection procedure that has beenin existenceat the Royal Mint in
London for almost eight centuries. Over a period of time, a coin wouldbe taken
daily from the batch thathad been mintedand placedin a box called the Pyx.
At intervals, sometimes separated by as much as 3 or 4years, the Pyx was
openedand thecontents checked andassayedin order to ensure thatthe coinage
met thespecifications laid downby the Crown. Stigler quotes Oresme on the
procedurefollowed in 1280:

When the Master of the Mint has brought the pence, coined, blanched and made
ready,to the placeof trial, e.g. the Mint, he must put them all at once on the counter
which is covered withcanvas. Then, whenthe pence have been well turned over
and thoroughly mixedby the handsof the Master of the Mint and theChanger,let
the Changertakea handful in themiddle of theheap,moving roundnine or tentimes
in one direction or the other, until he hastaken six pounds. He must then distribute
thesetwo or three times into four heaps, so that theyarewell mixed. (Stigler, 1977,
p. 495)

The Masterof the Mint was alloweda marginof error calledthe remedyand
had to make goodany deficit that was discovered. Although mathematical
statistics playedno part in these tests,and if they had they would have been
RANDOMNESSAND RANDOM NUMBERS 87

1
more precise,the procedure itself mirrors modern practice.

The trial of the Pyxeven in the Middle Ages consistedof a sample being drawn, a
null hypothesis (the standard) to betested,a two-sided alternative,and atest statistic
and a critical region (the total weightof the coins and theremedy). The problem
even carried with itselfa loss function which was easily interpretablein economic
terms. (Stigler, 1977,p.499)

Random selectionsmay bemadefrom real, existent universes, for example,


the population of Ontario, or from hypothetical universes,for example, indi-
viduals who,over a period of time, have takena particular drugas part of a
clinical trial. In the latter case,to talk of "selection"in any real senseof the word
is stretching credulity,but we do use the results basedon theindividuals actually
examinedto make inferences about thepotential population thatmay begiven
the drug. In the same way, the samples of convenience thatare used in
psychological research are hardly ever selected randomly in the formal sense.
Undergraduate student volunteers are notlabeledas automatically constituting
random samples,but they are often assumedto be unbiased withrespectto the
dependent variables of interest,anassumption that hasproduced much criticism.
This latter statement emphasizes the fact that a sampling method, whatever it
is, relatesto the universe under study and theparticular dependent variable or
variablesof interest. A questionnaire asking about attitudes to healthandfitness
given to membersof the audienceat, say,a symphony concertmay well be
generalizableto thepopulationof the city, but thesame instrument given to the
annual conventionof a weight watchers' club would not. These statements seem
to be soobviousand yet,as weshall see,overlooking possible sources of bias
either by accidentor design has led tosome expensive fiascos. Unfortunately,
the method of sampling can never be assuredly independent of the variable
under study. Kendalland Babington-Smith (1938) note that "The assumption
of independence must therefore be made with moreor less confidenceon a
priori grounds. It is part of the hypothesison which our ultimate expressionof
opinion is based"(p. 152).
Kendall and Babington Smith commenton the use of"random" digits in
random sampling,and it isworth examining these applications because most of
the present-day statistical cookbooks at least pay lip-serviceto the procedures
by including tablesof random numbersand some instructionas to their use.
Individual units in the populationare numberedin some convenient way, and

1
Stigler notesthat the remedywas specified on a perpound basisandthat all the coin weightswere
combined. This,togetherwith the central limit theorem, almost guarantees that the Master would not
exceedthe remedy.
88 8. SAMPLING AND ESTIMATION

then numbers, taken from the tables,arematchedwith the individualsto select


the sample. Thisproceduremay result in a sampleof individuals numbered1,
2, 3,4, 5,6, 7, 8,9, or 2,4,6,8, 10, 12,14,16, 18,20, groupings that mayinvite
comment because they follow immediately recognizable orders but that never-
theless couldbe generatedby a random selection.The fact that a random
selection producesa grouping that looksto bebiasedor non-representative led
to a great dealof debatein the 1920sand 1930s. The sequence 1,4,2,7,9, has
the appearanceof being randombut the sequence1, 4, 2, 7, 9, 1, 4, 2, 7, 9, has
not. Finite sequencesof random numbersare therefore only locally random.
Even the famous tablesof one million random digits produced by the RAND
Corporation (1965)can only, strictly speaking,be regardedas locally random,
for it may bethat the sequenceof onemillion wasaboutto repeat itself. Random
sequencesmay also be "patchy." Yule (1938b),for example,after examining
Tippett's tables, which hadbeen constructed by taking numbersat random from
census reports, gained the impression that they were rather patchy and pro-
ceededto apply furtherteststhat gavesome supportto hisview. Tippett (1925)
usedhis tables (not then published) to examine experimentally his work on the
distribution of the range.
Simpletestsfor local randomnessareoutlined by Kendalland Babington-
Smith, tests that Tippett's tableshadpassed,andalthough much more extensive
appraisalscan bemade todayby using computers, these prescriptions illustrate
the sort of criteria that shouldbe applied. Each digit should appear approxi-
matelythe same number of times; no digit should tendto befollowed by another
digit; there are certain expectationsin blocks of digits with regard to the
occurrenceof three, four, or five digits thatare all thesame; thereare certain
expectations regarding thegapsin the sequence between digits that are thesame.
Testsof this sortdo notexhaustthe many thatcan beapplied. Whitney(1984)
hasrecently noted that "It hasbeen said that more time hasbeen spent generating
and testing random numbers than using them" (p. 129).

COMBINING OBSERVATIONS

The notion of using a measureof the averageas anadequateand convenient


summary or description of a numberof data is an acceptedpart of everyday
discourse.We speakof average incomes and average prices,of average speeds
and averagegas consumption,of averagemen andaverage women.We are
referring to some middling, nonextreme value that we take as a fair repre-
sentation of our observations.The easy use of theterm doesnot reflect the
logical problems associated with the justification for the use ofparticular
measuresof the average.The term itself, we find from the Oxford English
Dictionary, refers, among other things,
to notions of sharing laboror risk, so
COMBINING OBSERVATIONS 89

old forms of the word referto work doneby tenantsfor a feudal superioror to
shared risks among merchants for lossesat sea. Modern conceptions of the word
include the notion of a sharingor eveningout over a rangeof disparate values.
A large seriesof observationsor measurements of the same phenomenon
producesa distributionof values. Giventhe assumption thata single true value
does,in fact, exist,the presenceof different valuesin the series shows that there
are errors. The assumption that over-estimations are aslikely as under-estima-
tions would provide supportfor the use of themiddle valueof the series,the
median,asrepresentingthetrue value.Theassumption that thevaluewe observe
most frequently is likely the true valuejustifies the use of themode,and the
"evening out" of the different valuesis seenin the use of thearithmetic mean.
It is this latter measure thatis now thestatisticof choice when observations are
combined,andthereare anumberof strandsin our history that have contributed
to thejustification for its use.The employmentof the arithmetic meanand the
use of the lawof error have sound mathematical underpinnings in the Principle
of Least Squares,which will be considered later. First, however, a somewhat
critical look at the use of themeanis in order.

The Arithmetic Mean

In the 1755 Philosophical Transactions, and in arevision in Miscellaneous


Tracts of 1757, Thomas Simpson argues the casefor the arithmetic meanin An
Attempt to Show the Advantage arisingby Taking the Mean Of a Number of
Observations in Practical Astronomy.Theseare valuable contributions that
discuss,for the first time, measurement errors in the context of probabilityand
point the way toward the idea of a law of facility of error. Simpson (1755)
complainsthat "somepersons,of considerable note, have been of opinion,and
even publickly maintained, that onesingle observation, taken with duecare,was
as muchto berelied uponas theMean of a great number" (pp.82-83).
In the revision, Simpson (1757) statesas axioms that positiveand negative
errors are equally probableand that thereare assignablelimits within which
errors can betaken to fall. He also diagramsthe law of error as anisosceles
triangleandshows thatthe meanis nearerto thetrue value thana single random
observation.The claim of the arithmetic meanto be thebest representation of
a large bodyof data is often justified by appealto the principle of least squares
and the law oferror. This is high theory,and inappealingto it there is a danger
of overlookingthe logic of the use of themean. Simply,asJohn Venn (1891),
who wasmentionedin chapter 1, puts it:

Why do we resortto averagesat all?


How can asingle introductionof our own, andthat a fictitious one, possiblytakethe
90 8. SAMPLING AND ESTIMATION

place of the many values which were actually given to us? And theanswersurely
is, that it can notpossibly do so; the onething cannot takethe place of the other for
purposesin general,but only for this or that specific purpose, (pp.429-430)

This seemingly obvious statement is onethat hasfrequently been ignoredin


statisticsin the social sciences. Venn points out thedifferent kindsof averages
that can beusedfor different purposesandnotes cases where the use of anysort
of averageis misleading. Edgeworth (1887) hadprovidedanattemptto examine
the mathematical justifications for the different averagesand Venn and many
others referto this treatment. Edgeworth's paper also examines the validity of
the least squaresprinciple in this context. Venn illustrates his argument with
some straightforward examples. If two people reckonedthe distanceof Cam-
bridge from London to be 50 and 60miles, in the absenceof any information
that would leadus to suspect eitherof the measures,one would guess that55
miles was theprobabledistance.However,if oneperson said that someone they
knew lived in Oxford and another thatthe individual lived in Cambridge,the
most probable location would not be atsome placein between.In the latter case,
in the absenceof any other information, one would havea chanceat arriving at
the truth by choosingat random.
Edgeworth's paperson the best mean represent some of his most useful
work. A particular serviceis rendered by his distinction between realor
objective meansand fictitious or subjective means. The former arise whenwe
use thearithmetic meanas thetrue valueunderlying a groupof measurements
that are subjectto error; the latter is a descriptionof a set.

The mean of observations is a cause,as it were the source from which diverging
errorsemanate.The meanof statistics is adescription,a representativeof the group,
that quantity which,if we must in practice put onequantityfor many, minimisesthe
error unavoidably attending such practice.
Observationsare different copiesof one original; statisticsare different originals
affording one 'genericportrait.' (Edgeworth, 1887,p. 139)

This formal distinctionis clear. However,becausethe mathematicsof the


analysisof errors and themanipulationsof modern statistics rest on the same
principles, the logic of inference is sometimes clouded.It is Quetelet who
brought the mathematicsof error estimationin physical measurement into the
assessmentof thedispersionof humancharacteristics. Clearly, something of the
sort had been done before in the examinationof mortality tablesfor insurance
purposes,but we seeQuetelet makinga direct statement that only partially
recognizes Edgeworth's later distinction:

Everything occursthen as though there existeda type of man, from whichall other
men differed moreor less. Nature hasplaced beforeour eyesliving examples of
COMBINING OBSERVATIONS 91

what theory showsus. Each peoplepresentsits mean, and thedifferent variations


from this meanwhich may becalculateda priori. (Quetelet, 1835/1849,p. 96)

We notedin chapter1 that it was Quetelet'sview that the averagevalue of


mentalandmoral, aswell asphysical, characteristics represented the idealtype,
I'homme moyen (the average man) for which Nature was constantly striving.
Quetelet'sown pronouncements were somewhat inconsistent in that he some-
times promotedthe view of the "average being"as auniversal biological type
and at other times suggested that the average differed across groupsand
circumstances. Nevertheless, his methods werethe forerunnersof work that
attemptsto establish"norms" in biology, anthropology,and thepsychologyof
individual differences.One of therequirementsof a "good" psychometric test
is that it be accompaniedby norms. These norms provide standards for com-
parisonsacrossindividual test scores and, just
asQuetelet's characterization of
the average as the ideal type aroused opposition and controversy, so the
establishmentof national normsand sub-group normsand racial norms pro-
duces heated debate today.

The Principle of Least Squares


In its best-known form, this famous principle states that the sum ofsquared
differencesof observationsfrom the meanof those observations is aminimum;
that is to say, it is smaller thanthe sum ofsquareddifferences from any other
reference point.
Legendre(1752-1833)announcedthe Principle of Least Squaresin 1805,
but in Theoria Motus Corporum Coelestium, published in 1809, Gauss
(1777-1855)discussingthe method, refersto his work on it in 1795 (whenhe
was a 17-year-old student preparing for his university studies). This claimto
priority upset Legendre and led tosome bitter dispute. Today the methodis most
frequently associated with Gauss, who isoften identified as thegreatest mathe-
matician of all time. That the method veryquickly becamea topic of much
commentaryand discussionmay bedemonstratedby the fact that in the 70 or
so years followingLegendre'spublication no lessthan 193 authors produced
72 books, 23 parts of books,and 313memoirs relatingto it. Merriman (1877)
providesa list of these titles together
with some notes. Faced with such sources,
to say nothing of the derivations, papers, commentaries, and monographs
published in the last 110years,the following represents perhaps one of adozen
ways of commentingon its origins.
Using the meanto combinea set ofindependent observations was atech-
niquethat had been usedin the 17th century. Gauss later examined the problem
of selecting from a numberof possible waysof combining datathe one
that producedthe least possible uncertainty about the "true value." Gauss noted
92 8. SAMPLING AND ESTIMATION

that in thecombinationof astronomical observations themathematical treatment


dependedon themethodof combination.He approachedthe problemfrom the
standpoint thatthe method should lead to the cancellationof random errorsof
measurementandthat, astherewas noreasonto preferonemethod over another,
the arithmetic mean should be chosen.Having appreciated thatthe "best,"in
the senseof "most probable,"value couldnot beknown unlessthe distribution
of errorswasknown,heturnedto anexaminationof thedistribution, which gave
the meanas themost probable value. Gauss's approach wasbasedon practical
considerations,andbecausethe procedureshe examineddid produce workable
solutionsin astronomicalandgeodetic observations, themethodwas vindicated.
In fact, the principle of least squares,
asGauss himself noted, can beconsidered
independentlyof the calculusof probabilities.
If we haven observationsX^X2 X3 ^ . . .Xn , from apopulation witha mean
u,, what is the least squares estimateof u? It is thevalueof (i that minimizes,

NOW : whereX is themeanof the n observa-


tions.

Clearly the right-hand sideof the last expressionis at aminimum when,


X- u,, which demonstrates the principle.
This easily obtained result providesa rationalefor estimatingthe population
meanfrom the sample mean that is intuitively sensible.The law oferror enters
the picture whenwe considerthe arithmetic meanas themost probable result.
In this casewe find thatthe law is infact givenby thenormal distribution, often
referredto as theLaplace-Gaussian distribution. It hasbeen statedon more than
one occasionthat Laplaceassumedthe normal law in arriving at themean to
provide whathe describedas themost advantageous combination of observa-
tions. Laplacecertainly consideredthe casewhen positiveandnegativeerrors
are equally likely,andthesecasesrest on theerror law being whatwe nowcall
"normal" in form. The threadsof the argumentare notalways easyto disentan-
gle, but one of thebetter accounts,for thosewho arepreparedto grapple with
a little mathematics,wasgiven by Glaisher as long ago as 1872. The crucial
point is that of the rationalefor the two fundamentalconstructs of statistics,
COMBINING OBSERVATIONS 93

the mean,X,and the variance, I,(X- X) /n. Theessential factis easily seen.
Given thata distribution of observationsis normal in form, and given thatwe
know the mean and thevarianceof the distribution, thenthe distribution is
completely and uniquely specified.All its propertiescan beascertained given
this information.
In the context of statisticsin the social sciences, both
the normal law and the
least squares principleare best understoodin the context of the linear model.
The model encompasses these constructs and brings togetherin a formal sense
the mathematicsof the combinationof observations developed for use inerror
estimationand mathematical statistics as exemplifiedby analysisof variance
and regression analysis.

Representation and Bias

The earliest examples of the use ofsamplesin social statisticswe have seenin
the work of Graunt, Petty, Halley, and theearly actuaries. These samples were
neither randomnor representative,and mistaken inferences were plentiful. In
any event, it was not until much later, when attempts were made to collect
information on populations, inferential exercises repeated, and the results
compared, that critical examinations of thetechniques employed could be made.
Stephan (1948) reports that Sir Frederick Morton Eden estimated thepopulation
of Great Britainin 1800to beabout 9,000,000. This estimate, which wasbased
on the numberof births and theaverage number of inhabitantsin eachhouse(a
number that was obtained by sampling),was confirmedby the first British
censusin 1801. Earlier attemptsto estimate populations had been madein
France,and Laplacein 1802 madean attemptto do sothat followeda scheme
he had devised and publishedin 1786, a scheme that included a probability
measureof the precisionof the estimate. Specifically, Laplace averred that the
odds were 1,161 to 1that the error wouldnot reach halfa million. Westergaard
(1932) provides some more details of these exercises. Elsewhere in Europeand
in the United Statesthe 19th centurysaw various censuses conducted, aswell
as attemptsto estimatethe size of the populationfrom samples.In the United
States the Constitution providedfor a census every10 years in order to
determine Congressional representation, but a Bill introduced intothe British
Parliamentin 1753to establishan annual census was defeatedin the Houseof
Lords.
It appears thatthe probability mathematicians and the rising group of
political arithmeticians never joined forces in the 19th century. The latter
favored the institutionof complete censuses and, generally speaking, not were
mathematicians,and theformer were scientists who hadmany other problems
94 8. SAMPLING AND ESTIMATION

to test their mettle. Almost100years passed before scientific sampling proce-


dures wereproperly investigated.
Stephan (1948) listsfour areas where modern sampling methods could have
been used to advantage:agricultural crop and livestock estimates,economic
statistics, social surveys
and health surveys,and public opinion polls.The last
will be considered herein a little more detail because it is in this areathat the
accuracy of forecastsis so often and soquickly assessedand breakdownsin
sampling procedures detected with the benefit of hindsight.
The Raleigh Starin Raleigh, North Carolina, conducted "straw votes" as
early as 1824, covering political meetings andtrying to discoverthe "senseof
the people." By the turn of the century, many newspapers in the United States
were regularly conducting opinion polls, a common method being merely to
invite membersof the public to clip out aballot in the paperand tomail it in.
The same basic procedure was followed by all thepublications. Thenthe large-
circulation magazines, notably Literary Digest, began to mail out ballotsto very
large numbersof people, sometimesas many as 11,000,000. In 1916 this
publicationcorrectly predictedthe electionof Wilson to thepresidencyand from
then until 1936 enjoyed consistent and much admired success.Its predictions
were very accurate.For example,in 1932its estimateof the popular votefor
each candidatein the presidential election came to within 1.4% of the actual
outcome. In 1936 came disaster.For years Literary Digesthadconducted polls
on all kinds of issues, mailingout millions of ballots,at considerable expense,
to telephone subscribers and automobile owners.In 1936 the magazine pre-
dictedthat AlfredM. Landon wouldwin thepresidencyon thebasisof the return
of over 2,300,000replies from over 10,000,000 mailed ballots. The record
shows that Franklin D. Rooseveltwon thepresidency withone of thelargest
majorities in American presidential history. The reasonsfor this disastrous
mistake are noweasyto see. Priorto 1936, preferencefor the twomajor political
parties in the United Stateswas notrelatedto level of income.In that yearit
seems thatit was. The telephone subscribers and carowners (andin 1936 these
were the rather moreaffluent) who hadreceivedthe Literary Digest ballots
were, in the main, Republicans. In 1937the magazine ceased publication. Crum
(1931) and Robinson (1932, 1937) gave commentaries on someof these early
polls.
Fortune magazine fared no better in 1948, underestimating the vote for the
Democratsand Harry Trumanby close to 12%and, of course,failing to pick
the winner. In 1936, with a much, much smaller sample than that of Literary
Digest (less than 5,000),it had forecast Roosevelt's vote to within 1%. The
explanation for its failure in 1948 restswith the swing of both decidedand
undecided voters between the Septemberpoll and theNovember electionand
failure to correct for a geographic sampling bias. Parten(1966) has some
SAMPLING IN THEORY AND PRACTICE 95

commentary on these and other polls. One of the most successful polling
organizations, the American Institute of Public Opinion, headedby George
Gallup, beganits work in 1933. But the Gallup poll also predicteda win for
Dewey over Trumanin 1948,and many people have seen the Life photograph
of a victorious president holding the Chicago Daily Tribune withits famous
Type I error headline, "Dewey defeats Truman."
The result producedone of thefirst claims thatthe polls influencedthe
outcome, inducing complacency in the Republicansand asmall turnoutof their
supporters, leading to the defeat of the Republican candidate. Today thereis
much controversy overthe conductingand the use ofpolls. Theywill survive
becausein general theyare correct much more often than not. Their success
is
due to thedevelopmentof refined sampling techniques.

SAMPLING IN THEORY AND IN PRACTICE

Chang (1976) givesa quite thorough reviewof inferential processesand


sampling theory,and it is worth sketchingin some of its developmentin the
context of survey sampling. However, from the standpointof statistics and
sampling in psychology, thereis no doubt but that the rationaleof sampling
proceduresfor hypothesis testing rather than parameter estimation is of greater
import. The two arerelated.
The early political arithmeticians held
to theview that statistical ratios - for
example, malesto females, average number of children perfamily, and so on -
were approximately constant, and, as aresult, proceededto draw inferences
from figures collectedin a single townor parish to whole regionsand even
countries.The early 19th centurysaw theintroductionof the law of error by
Gaussand Laplace and anawareness that variability was animportant consid-
eration in the assessmentof data. Populations are nowdefinedby twoparame-
ters, the meanand thevariance.One of theearliest attemptsto put ameasureof
precision ontoa sampling exercisewas that of Laplacein 1802.The samplehe
used was not random, althoughhe appearsto have assumed that it was.
Communes distributed across France were chosen to balanceout climatic
differences. Those having mayors known for their"zeal and intelligence" were
also selectedso that the data wouldbe themost precise. Laplace also assumed
that birth ratewas homogeneousacrossthe French population, exactly the sort
of unwarranted assumption that was madeby theearly political arithmeticians.
Nevertheless, Laplace estimated the population total from his figures and,
appealing to the central limit theorem (whichhe had discussedin 1783),
approximatedthe distribution of estimationerrorsto thenormal curve.
Survey samplingfor all kinds of social investigations owes a great dealto
Sir Arthur Lyon Bowley(1869-1957). In his time he was recognized as a
96 8. SAMPLING AND ESTIMATION

pioneerin the definition of sampling techniques,and hismethodsand assump-


tions werethe subjectof much debate.In the event, someof his approaches
were found to bedefective,but hiswork focussed attention on theproblem. In
1926 he summarizedthe theory of sampling,and ashort paperof 1936 outlines
the application of samplingto economicand social problems.He servedon
numerous official committees that investigated the economic stateof the na-
tion,social effects of unemploymentand so on, andworked directlyon many
surveys. Maunder (1972)has written a memoir that pointsup Bowley's
contributions, contributions that were somewhat overshadowed by thework of
his contemporaries Pearson, Fisher,andNeyman.He was acalm and courteous
man, enormously concerned with social issues,and heoccupied,at theLondon
School of Economics,the first Chair devotedto statisticsin the social sciences.
Bowley (1936) definesthe sampling problem simply:

We are hereconcerned. . . with the investigationof the numerical structureof an


actual andlimited universe,or "population" whichis thebetter wordfor our purpose.
Our problems are quite definitely to infer the population from the sample.
The problem is strictly analogousto that of estimatingthe proportion of the various
colours of balls in a limited urn on thebasisof one ormore trial draws, (pp.474-475)

In the early yearsof this century, Bowley began to examine boththe practice
and theory of survey sampling. His work helpedto highlight the utility of
probability samplingof oneform or another. Systematic selection was adopted
and advocatedby A. N. Kiaer(1838-1919), Director of theNorwegian Bureau
of Statistics,in his examinationsof census data,but themajority of influential
statisticians, represented by the InternationalStatisticalInstitute, rejected sam-
pling, pressingfor complete enumeration. It took almost30 yearsfor theutility
and benefits of the methodsto be appreciated. Seng (1951) and Kruskal and
Mosteller (1979) give accounts of this most interesting periodin statistical
history. The latter authors givea translationand paraphraseof the remarks of
Georg von Mayer, Professorat the University of Munich, on Kiaer's workon
the representative method, which was presentedat ameetingof the Institutein
Berne in 1895:

I regardas most dangerousthe point of view found in his work. I understand that
representative samples can have some value,but it is a value restrictedto terrain
already illuminated by full coverage. One cannot replaceby calculation the real
observationof facts. A sample provides statistics for the units actuallyobserved,but
not true statisticsfor the entire terrain.
It is especially dangerousto proposerepresentative sampling in the midst of an
assembly of statisticians. Perhaps for legislative or administrativegoals sampling
may haveuses- but onemust never forget that it cannot replacea complete survey.
It is necessaryto addthat thereis among us these daysa current in the minds of
SAMPLING IN THEORY AND PRACTICE 97

mathematicians that would,in many ways, haveus calculate rather thanobserve.


We must remainfirm and say: no calculations when observations
can bemade, (von
Mayer, quotedby Kruskal & Mosteller, 1979,pp. 174-175)

Oddly enough, Kiaer's work is not mathematicalin the sense that modern
methodsof parameter estimation aremathematical.At the time those methods
were not fully delineatednor understood.Kiaer aimed, by a variety of tech-
niques,to producea miniatureof the population, although he noted as early as
1899 the necessityfor the investigationof both the practical and theoretical
aspectsof his methods. At a meetingin 1901(a report of which waspublished
in 1903), Kiaer returnedto the theme and it was in adiscussion of his
contribution that L. von Bortkiewicz suggested that the "calculus of prob-
abilities" couldbe usedto test the efficacy of sampling. By establishinghow
much of a difference between sample and population could be obtainedacci-
dentally and checking whetheror not anobserveddifference lay outside those
limits, the representativeness of the sample couldbe decidedon. Bortkiwiecz
did not, apparently, formulate all the necessary tests,
and othershad employed
this method, but he seemsto have beenthe first to draw the attention of
practicing statisticiansto thepossibilities.
In 1903, Kiaer must have thought that the sampling argument was won, for
a subcommitteeof the International StatisticalInstituteproposeda resolutionat
the Berlin meeting:

The Committee, considering that the correct application of the representative


method, in a certain numberof cases,can furnish exact and detailed observations
from which the resultscan begeneralized,within certain limits, recommendsits use,
provided thatin thepublicationof the resultsthe conditions under whichthe selection
of the observation unitsis madeare completely specified.The question willbe kept
on the agenda,sothat a report may bepresentedin the next sessionon the application
of the method in practiceand on thevalueof the results arrivedat. (quotedby Seng,
1951, p. 230)

What is more, a discussantat the meeting,the French statistician Lucien


March, returnedto the ideas thathad been put forward by Bortkiewicz and
outlined someof the basicsof probability sampling (see Kruskal & Mosteller,
1979, for a short summaryof this presentation).The wayahead seemed clear.
In fact the question was,for all intentsand purposes, shelvedfor more than
20 years,and it was notuntil 1925,at theRome sessionof the Institute, thatthe
advantagesof the sampling method were fully recognized. Thiswas in nosmall
way due to thetheoretical workof Bowley. Bowley had suggestedin his
Presidential address to the Economic Scienceand Statistical Sectionof the
British Associationas early as 1906 thata systematic approach to the problem
98 8. SAMPLING AND ESTIMATION

of sampling would bearfruit:

In general,two linesof analysisarepossible:we may find anempiricalformula (with


Professor Karl Pearson) which fits this classof observations [Bowleyis referringto
data thatmay not benormally distributed],and byevaluatingthe constants determine
an appropriate curveof frequency,andhence allotthe chancesof possible differences
betweenour observationand theunknown true value; or we mayaccept Professor
Edgeworth's analysis of thecauseswhich would producehis generalisedlaw of great
numbers,and determinea priori or by experiment whether this universal law may
be expectedor is to befound in the casein question. (Bowley, 1906, pp. 549-550)

Edgeworth's methodis basedon the Central Limit Theorem,and Bowley


explains its utility clearly and simply:

If quantitiesare distributed accordingto almost any curve of frequency,... the


averageof successive groups o f. . .these conformto anormal curve (the more and
more closelyas n isincreased) whose standard deviation diminishes in inverse ratio
to thenumberin each sample...If we canapply this method...,we areableto give
not only a numerical average,but areasoned estimate for the real physical quantity
of which the averageis a local or temporary instance. (Bowley, 1906, p. 550)

The procedure demands random sampling:

The chancesare thesamefor all the itemsof the groups to besampled,and the way
they are takenis absolutely independentof their magnitude.
It is frequently impossibleto covera whole areaas thecensusdoes,...but it is not
necessary.We canobtain as good resultsas wepleaseby sampling,and very often
quite small samplesare enough;the only difficulty is to ensure that every person
or
thing has thesame chanceof inclusion in the investigation. (Bowley, 1906,pp.
551-553)

THE THEORY OF ESTIMATION

Over the next 20 years, Bowleyand his associates completed a number of


surveys, and his theoretical researches produced The Measurement of the
Precision Attainedin Samplingin 1926. This paper formed part of the report of
the International Statistical Institute, which recommended and drew attention
to the methods of random selectionand purposive selection:"A number of
groupsof units areselected which together yield nearlythe same characteristics
as the totality" (p. 2). The report does not directly addressthe method of
stratified sampling, even though the techniquehad been in general use. This
procedure received close attention from Neymanin his paperof 1934. Bowley
had attemptedto present a theory of purposive samplingin his 1926 report.
THE THEORY OF ESTIMATION 99

A distinctive featureof this method, according to Bowley, was that it was acaseof
cluster sampling.It wasassumed that the quantity under investigation
was correlated
with a numberof characters,called controls,and that the regressionof the cluster
meansof thequantityon thoseof each controlwaslinear. Clusters were to be selected
in such a waythat averageof each control computed from the chosen clusters should
(approximately) equalits population mean.It was hoped that,due to theassumed
correlations between controls and thequantity under investigation,
the above method
of selection would resultin arepresentative sample with
respectto thequantity under
investigation. (Chang, 1976, pp. 305-306)

Unfortunately,a practical testof the method (Gini, 1928) proved unsatisfactory


and Neyman's analysis concluded that it was not aconsistentnor an efficient
procedure.
As Neyman pointed out, the problemof samplingis theproblemof estima-
tion. The first forays intothe establishment of atheory hadbeen madeby Fisher
(1921a, 1922b, 1925b),but themannerof samplinghadreceived little attention
from him. The methodof maximum likelihood, which rested entirelyon the
propertiesof the distribution of observations, gave the mostefficient estimate.
Any appealto thepropertiesof theapriori distribution- theBayesian approach
- wasrejectedby Fisher. Neyman attempted to clarify the situation:

We are interestedin characteristicsof a certain population, say,n , . . . it hasbeen


usually assumed that the accurate solution of sucha problem requiresthe knowledge
of probabilitiesa priori attachedto different admissible hypotheses concerning the
valuesof the collective characters [the parameters] of the populationn. (Neyman,
1934, p. 561)

He then turnsto Bowley's work,notingthatwhenthepopulationn isknown,


then questions about the sort of samples thatit could producecan beanswered
from "the safe groundof classical theoryof probability"(p. 561). The second
question involvesthe determination, when we know the sample,of the prob-
abilities a posteriori to beascribedto hypotheses concerning the populations.
Bowley's conclusions arebased:

on some quite arbitrary hypotheses concerning the probabilities a priori, and


Professor Bowley accompanies his results withthe following remark: "It is to be
emphasized thatthe inference thus formulated is based on assumptions thatare
difficult to verify andwhich are notapplicablein all cases."(Neyman, 1934,p. 562)

Neymanthen suggests that Fisher's approach (that involving the notion of


fiducial probability, although Neyman does not use theterm) "removesthe
difficulties involved in the lack of knowledgeof the apriori probability law"
100 8. SAMPLING AND ESTIMATION

(p. 562). He further suggests that these approaches have been misunderstood,
due, he thinks,to Fisher's condensedform of explanationanddifficult method
of attacking the problem:

The form of the solution consistsin determining certain intervals which


I propose
to call confidenceintervals..., in which we mayassumearecontainedthe values
of the estimated charactersof the population, the probability of the error in a
statementof this sort being equalto or less than1 - e ,where e is anynumber0 <
e < 1,chosenin advance.The numbers I call the confidence coefficient. (Neyman,
1934,p. 562)

Neyman's commentson Fisher's abilityto explain his view produced,in


the discussionof his paper, the first (mild) reaction from Fisher. Subsequent
reactionsto Neyman's workandthat of his collaborator, Egon Pearson, became
increasingly vitriolic. The report reads:

Dr Fisher thoughtDr Neyman mustbe mistaken in thinking the term fiducial


probability [Neyman had used the term "confidence coefficient"]had led to any
misunderstanding;he had notcome uponany signsof it in the literature. WhenDr
Neyman said"it really cannot be distinguished fromthe ordinary concept of
probability," Dr Fisher agreedwith him ... Hequalified it from the first with the
word fiducial... Dr Neyman qualified it with the word confidence. The meaning
was evidently the same,and he did notwish to deny that confidence could be used
adjectivally. They wereall too familiar with it, as Professor Bowleyhad reminded
them, in the phrase"confidence trick." (discussion on Dr Neyman'spaper, 1934,
p. 617)

From the standpointof the familiar statistical procedures found in our texts,
this paperis importantfor its treatmentof confidence intervalsand itsemphasis
of the importanceof random sampling.It extended estimation from so-called
point estimation,the use of asample value to infer a populationvalue,to interval
estimation, whichassesses the probability of a range of values. Neyman
demonstratesthe use of theMarkov2 method for deriving the best linear

Andrei Andreyevich Markov(or Markoff, 1856-1922)is best knownfor his studiesof the
probabilities of linked chainsof events.Markov chains have been used in a variety of social and
biological studiesin the last 30 or 40years.But Markov made many contributionsto probability theory.
If we havea random variableX, then regardless of its distribution,for any positive numberc (i.e. c >
0), theprobability, Px(X > cu), thatthe random variable X is greater thanc timesits expected valueux
= u doesnot exceed lie. Thatis, Px(X > c^) < \lc. This is knownas theMarkov Inequality. Markov
was astudentof Pafnuti LTchebycheff(sometimes spelledChebychev or Chebichev,1821-1894),who
formulated the Tchebycheff Inequality which states that, Px(X< n - da or X > + da) = tfwhereX
is a random variable with expected valueu andvariance a2, and d> 0. This resultwas independently
arrived at by theFrench mathematician I.J. Bienayme (1796-1876).These inequalities areimportant
in the developmentof the descriptionsof the propertiesof probability distributions.
THE BATTLE FOR RANDOMIZATION 101

unbiased estimators.It also contains other important ideas, in particular, a


discussion of the methods of stratified samplingand appropriate statistical
modelsfor it. Neyman's paper marks a new era inboth the methodandtheory
of sampling, although,at the time, it was its treatment of the problem of
estimation that receivedthe most attention.In a senseit complementedand
supplementedthe work of Ronald Fisher thatwas going on atRothamsted,but
it became evident that Fisher
did not quitesee itthat way.

THE BATTLE FOR RANDOMIZATION

There is no doubt thatthe requirement that samples should be randomly drawn


wasthoughtof by thesurvey-makersas aprotection against selection bias. And
there is also no doubt that when sample size is large it affords such protection,
but not, it must be stressed,a guarantee.
In agricultural research,it had long been recognized that reduction of
experimentalerror was ofcritical importance.At Rothamstedtwo methods were
available: repeating the experiments over many years, and multiplying the
number of plots on a field. Mercer and Hall (1911) discussthe problem in
considerable detail andgive suggestions for arrangingthe plots sothat theymay
be "scattered."This was theapproach thatwasabandoned, although not imme-
diately, when Fisher started his important workat the Station. Eventually,for
Fisherand hiscoworkersthe argumentfor randomizationhad aquite different
motive from that of trying to obtain a representative sample, one that is crucial
for an appreciationof the use ofstatisticsin psychology. Fisher, although a
brilliant mathematician,was apractical statistician,and hisapproachto statis-
tics can only be understood through his work on thedesignof experimentsand
the analysis of the resultant data.The core of Fisher's argument rests on the
contention thatthe valueof an experiment depends on thevalid estimationof
error, an argument that everyone would agree with. But how was theestimate
to be made?

In nearly all systematic arrangements of replicated plotscareis takento put theunlike


plots as close togetheras possible, and thelike plots consequentlyas far apart as
possible,thus introducinga flagrant violationof the conditions upon whicha valid
estimate is possible.
One way ofmaking sure thata valid estimateof error will be obtainedis to arrange
the plots deliberatelyat random.
The estimateof error is valid, because,if we imagine a large numberof different
results obtainedby different random arrangements,the ratio of the real to the
estimatederror, calculated afreshfor each of these arrangements, will be actually
distributed in the theoretical distributionby which the significance of the result is
tested. Whereasif a group of arrangementsis chosen such that the real errorsin this
group are on thewhole lessthan thoseappropriateto random arrangements,it has
102 8. SAMPLING AND ESTIMATION

now beendemonstrated that the errors, asestimated, will,in sucha group, be higher
than is usualin random arrangements, andthat, in consequence, within such a group,
the test of significanceis vitiated. (Fisher,1926b,pp. 506-507)

The contribution of Fisher thatis overwhelmingly importantis the develop-


ment of the t and ztestsand thegeneral technique of analysis of variance.The
essenceof these proceduresis that they provide estimates of error in the
observationsand theapplicationof testsof statistical significance.These meth-
ods were not immediately recognizedas being useful for larger-scale sample
surveys,and it waspartly the work of Neyman (mentioned earlier) and others
in the mid-1930s that, ironically, introduced them to this area.
Argumentsabout randomized versus systematic designs began in the mid-
1920s. Mostly they revolved around the issueof what to do when therewas an
unwanted assignmentof treatmentsto experimental units, thatis, when the
assignmenthad apattern thatthe researcher either knew or suspected might
confoundthe treatments. Fisher argued very strongly against the use ofsystem-
atic designs,on thebasisof theory, but hisargumentwas notwholly consistent.
Some had suggested that if a random design produced a pattern thenit should
be discardedandanother random assignment drawn. Of course,thesubjectivity
introducedby this sortof procedureis precisely thatof the deliberate balancing
of the design. And how many draws mightone beallowed?

Most experimenterson carrying out a random assignment of plots will be shocked


to find out how farfrom equallythe plots distribute themselves...if the experimenter
rejects the arrangement arrived at bychanceasaltogether "too bad,"or in other ways
"cooks" the arrangementto suit his preconceived ideas,he will either (and most
probably)increasethe standarderror as estimatedfrom the yields; or, if his luck or
his judgment is bad, he will increase the real errors while diminishing his estimate
of them. (Fisher, 1926b,pp. 509-510)

But even Fisher never


quite escapedthe difficulty. Savage (1962) talked
with him:

"What would you do," I had asked, "if, drawinga Latin Squareat random for an
experiment, you happenedto draw a Knut Vik square?"Sir Ronald said he thought
he would draw againandthat, ideally, a theory explicitly excluding regularsquares
should be developed. (Savage, 1962, p. 88)

Studentsandteachers cursing their way through statistical theoryand practice


should takesome comfortfrom the inconsistencies expressed by the master.
The debate reached its height in argument between "Student" and Fisher.
"Student"consistently advocated systematic arrangements. In a letter to Egon
Pearson (Pearson, 1939a) written shortly before
his death in 1937, "Student"
THE BATTLE FOR RANDOMIZATION 103

commentson work by Tedin (1931) thathad examinedthe outcomes when


sytematic, as opposed to random, Latin squares were used in experimental
designs. The "Knight's move" Latin squarehe prefers aboveall others:"It is
interesting as anillustration of what actually happens when we depart from
artificial randomisation:1would Knight's move every time!" (quotedby E. S.
Pearson, 1939a, p. 248).3
Over the previous yeara seriesof papers, letters,and lettersof rebuttalhad
come forth from "Student"andFisher. "Student"was adamantto the end,and
Fisher reiteratedhis claim that valid error estimates cannot be computedin
arranged designs andthat in such casesthe test of significanceis made ineffec-
tive. Picard (1980) describesanddiscussesthe argumentandalso examinesthe
contributionsof Pearsonand Yates,who hadsucceeded Fisher at Rothamsted.
Others enteredthe debate.Jeffreys (1939)is puzzled:

Reading "Student's"paper [of 1937] and Fisher's Designof Experiments I find


myself in almost complete agreement with both;and Ishould therefore have
expected
them to agreewith each other.
But it seemsto me that "Student"is wrong in regarding Fisheras anadvocateof
extremerandomness,andpossibly Fisherhas notsufficiently emphasizedthe amount
of systemin his methods. (Jeffreys, 1939,p. 5)

Jeffreys makesthe point thatomitting to take accountof relevant information


makes avoidable errors:

The bestprocedureis to designthe work so as todetermineit [the error] as accurately


as possibleand not toleave it to chance whetherit can bedeterminedat a l l . . .. The
hypothesis is a considered proposition.. . The argument is inductive and not
deductive; it is not dealt with by consideringan estimable error thathas nothing to
do with it. (Jeffreys, 1939,p. 5)

As ever, Jeffreys' argumentis a paragonof logic, and it notes that Fisher's


advice to balanceor eliminatethe larger systematiceffects as accuratelyas
possible and randomizethe rest "sumsup thesituation very well"(p. 7). This
is the prescription thatthe designof experimentsfollows today.
E. S. Pearson (1938b) attempted to expandon and toclarify "Student's"
stand,but heclearly understoodthe view of the opposition. Nevertheless, he
concluded that balanced layouts could give some slight advantage.

An illustration of "Two Knight's moves"would be


DEABC
B C D EA
E A B CD
C D E AB
ABODE
104 8. SAMPLING AND ESTIMATION

Yates (1939),in a lengthy paper also goes over


the wholeof "Student's"views
on thematter,but hisconclusion supportsthe essenceof Fisher's views:

The conclusionis reached thatin caseswhere Latin square designs can beused,and
in many caseswhere randomized blocks have to beemployed,the gain in accuracy
with systematic arrangements is not likely to be sufficiently great to outweigh the
disadvantages to which systematic designs are subject.
On the other hand, systematic arrangements may in certain casesgive decidedly
greateraccuracy than randomized blocks, but it appears thatin suchcasesthe use of
the modern devicesof confounding, quasi-factorial designs, or split-plot Latin
squares whichare much more satisfactory statistically, are likely to give a similar
gain in accuracy. (Yates, 1939, p. 464)

This bringsus to theapproachesof the present day.


The realization that sampling
was importantin psychologicalresearch,and
that its techniqueshad been much neglected,waspresentedto the disciplineby
McNemar (1940a).In an extensive discussion, he pointsout situations thatare
still with us:

One wonders, for instance, how many psychometric test scores for policeman,
firemen, truck driversetal.have been interpreted by theclinician in termsof college
sophomorenorms.
In psychological researchwe aremorefrequently interestedin makingan inference
regarding the likenessor difference of two differentially defined universes, suchas
two racial groups,or an experimentalvs. acontrol group. The writer venturesthe
guessthat at least 90% of theresearchin psychology involves such comparisons. It
is not only necessaryto considerthe problemof samplingin the caseof experimental
and control groups,but alsoconvenientfrom the viewpointof both good experimen-
tation and sound statisticsto do so. (McNemar, 1940a,p. 335)

This paper, which, regrettably,


is not among those most widely citedin the
psychological literature, should
be required readingfor all those embarking on
researchin any aspect of the social sciences.Its closing remarks contain a
prediction thathasbeen most certainlyfulfilled andwhose contentis dealt with
shortly:

The applicability in psychologyof certain of ProfessorR. A. Fisher's designs should


be examined. Eventually, the analysisof variancewill come intouse inpsychological
research.(McNemar, 1940a,p. 363)

For themoment,no more needsto besaid.


9
Sampling Distributions

Large setsof elementary events are commonly called populationsor universes


in statistics,but the settheory term sample space is perhapsmore descriptive.
The term population distribution refers to the distributionof the valuesof the
possible observationsin the sample space. Although the characteristicsor
parameters of the population(e.g., the mean,\i, or the standard deviation,a)
are of both practicaland theoretical interest, these values are rarely, if ever,
known precisely. Estimatesof the values are obtained from corresponding
sample values,the statistics. Clearly,for a sample of a given size drawn
randomly froma samplespace,a distributionof valuesof a particular summary
statistic exists. This simple statement defines a sampling distribution.In statis-
tical practice it is the propertiesof these distributions that guidesour inferences
about propertiesof populationsof actualor potential observations.In chapter
6 thebinomial, the Poisson,and thenormal distributionswere discussed.Now
that samplinghas been examinedin some detail, three other distributions and
the statistical tests associated with them are reviewed.

THE CHI SQUARE DISTRIBUTION

The developmentof the j^ (chi-square) testof "goodness-of-fit" represents one


of the most important breakthroughs in the history of statistics, certainlyas
important as thedevelopmentof the mathematical foundations of regression.
The fact that both creationsare attributableto the work of one man,1 Karl

MacKenzie (1981) givesa brief accountof Arthur Black (1851-1893),a tragic figure who,on his
death, left a lengthy, and now lost, manuscript, Algebraof Animal Evolution,which was sent to Karl
Pearson. "Pearsonstarted to read it, but realized immediately thatit discussed topics very similarto
thosehe wasworking on, anddecidednot toreadit himself but to sendit to Francis Galtonfor his advice"
(p. 99).Of great interest is that buried among Black's notebooks, which have survived,
is a derivation
of the chi-square approximation to the multinomial distribution.

105
106 9. SAMPLING DISTRIBUTIONS

Fig. 9.1 Chi- Square for 4 and 10 Degrees of Freedom

Pearson,is impressiveattestationto hisrole in thediscipline. Thereare anumber


of routesby whichthetestcan beapproached,but thepath thathasbeen followed
thus far is continued here.This path leads directlyto the work of Pearsonand
Fisher, who did not make use, and,it seems, werein general unaware,of the
earlier work on goodness-of-fitby mathematiciansin Europe. Before looking
at thedevelopmentof the test of goodness-of-fitthe structureof the chi-square
distribution itself is worth examining. Figure9.1 showstwo chi-square distri-
butions. Givena normally distributed population of scoresY with a meanu, and
a variance a2, suppose that samplesof size n = 1 aredrawnfrom the distribution
and each scoreis convertedto its corresponding standard score 2.
2
z - (7-fj,)/a and x = (Y-\\) /G definesthechi-square distribution with
one degreeof freedom. If samples of n = 2 aredrawn, thenx,2 is given by
2 -, 2 T
(Y] - n) /a + (F2 - u,)/a". In fact if n independent measures
aretaken ran-
2 2
domly from a distribution with mean\i, and variancea , % is defined as the
sum of the squaredstandardizedscores:
CHI-SQUARE 107

This is the essenceof the distribution2 that Pearson used in his formulationof
the test of goodness-of-fit. Why wassucha test so necessary?
Gamesof chanceand observational errorsin astronomyand surveying were
subject to the random processesthat led scientistsin the 18th and early 19th
centuriesto the examinationof error distributions.The quest was fora sound
mathematical basisfor the exerciseof estimating true values. Simpson intro-
duced a triangular distributionand Daniel Bernoulli in 1777 suggesteda
semicircular one.In the absenceof empirical data, these distributions, estab-
lished on apriori grounds, were somewhat arbitrarily regarded as having no
more and noless claimto accuracyand utility. But the 19th centurysaw the
normal law established. It had powerful credentials because of the fame of its
two main developers, Gauss andLaplace. Startingfrom the assumption thatthe
arithmetic mean represented the true value, Gauss showed that the error distri-
bution was normal. Laplace, startingfrom the view that every individual
observation arisesfrom a very great numberof independently acting random
factors (the essence of the central limit theorem), cameto the same result.
Gauss's proofof the methodof least squares further establishedthe importance
of the normal distribution,and when, in 1801,he usedthe initial data collected
3
from observationson a newplanet, Ceres, to accurately predict where it would
be observed laterin the year, these procedures, as well as Gauss'sreputation,
were firmly establishedin astronomy.
In astronomyaswell as inbiology and social science,the Laplace-Gaussian
distribution wasindeed law,it was indeed regarded asnormal. This prescription
led to Quetelet'suse of it as a"double-edged sword" (see chapter 1) and led to
many astronomers using it as areasonto reject observations that were consid-
ered to be doubtful, for example Peirce (1852). Quetelet's procedure for
establishingthe "fit" of the normal curvewas thesame as that of the early
astronomers.The tabulation,and later the graphing,of observedand expected
frequenciesled to their being comparedby nothing more than visual inspection
(see, e.g., Airy, 1861).

2
A history of the mathematicsof the x2 distribution would includethe developmentof the gamma
function by the French mathematician 1. J. Bienayme(1796-1876),who, in the 1850s,found a statistic
that is equivalentto the Pearson X in the contextof least squares theory. Pearson
was apparentlynot
awareof his work; nor were F. R. Helmertand E.Abbe, who,in the 1860sand 1870s, also arrived
at the
X distributionfor the sum ofsquaresof independentnormal variates. Long
after the test had become
commonly used,von Mises (1919)linked Bienayme's workto the PearsonX2. Details of this aspectof
the test and distribution's historyare given by Lancaster (1966), Sheynin (1966),
and Chang (1973).

3
Cereswas the firstdiscovered "planetoid" in the asteroid belt. Gauss also determined
the
orbit of Pallas, another planetoid.
108 9. SAMPLING DISTRIBUTIONS

There was some dissent. Egon Pearson (1965) notes:

As a reactionto this view among astronomersI rememberhow SirArthur Eddington


in his Cambridgelectures about 1920 on theCombination of Observationsused to
quote the remark that"to say that errors must obeythe normal law means taking
away the right of thefree-born Englishmanto makeany error he darn well pleases!"
p. 6)

Karl Pearson's first statistical paper (1894)


was on theproblemof interpret-
ing a bimodal distributionas two normal distributions, the problem thathad
arisen as aresult of Weldon's discovery that the distributionof the relative
frontal breadthsof his sampleof Naples crabswas adouble-humped curve. This
paper introduced the methodof momentsas ameansof fitting a theoretical curve
to a set ofobservations.As Egon Pearson (1965) states it:

The question"doesa Normal curvefit this distributionandwhat does this mean if it


doesnot?" was clearly prominent in their discussions.There were three obvious
alternatives;
(a) The discrepancybetweentheory and observationis no more than mightbe
expectedto arise in random sampling.
(b) The dataareheterogeneous,composedof two or more Normal distributions.
(c) The dataare homogeneous,but there is real asymmetryin the distributionof
the variable measured.
The conclusion(c) mayhave been hardto accept, suchwas theprestige surrounding
the Normal law. (p. 9)

4
It appearsfrom Yule's lecture notes that Karl Pearson probably was em-
ploying a procedure that used the ratio of anabsolute deviation from expectation
to its standard errorto examinethe recordof 7,000 (actually 7,006) tosses of 12
dice madeby Mr Hull, a clerk at University College. Thiswas therecord (see
chapter 1) that Weldon said Karl Pearson had rejectedas "intrinsically incred-
ible." Yule's notes also contain an empirical measureof goodnessof fit that
Egon Pearson saysmay be setdown roughly as R = I\O - 7]/ir, where
\O - 7] is the absolute valueof the difference between the observed and
theoreticalfrequencyand T thetotal theoreticalfrequency,thoughit shouldbe
mentionedthat the actual notesdo not containthe formula in this form. This
expression mean absolute error was in useduring the latter yearsof the 19th
century,and Bowley usedit in the first edition of his textbook in 1902.
Karl Pearson's second statistical paper (1895) on asymmetricalfrequency
curves occupied the attentionof thebiologists,but thequestionof thebiological
meaningof skewed distributionswas not onethat, at the time, was in the

These notesare reproducedin Biometrikas Miscellanea(Yule, 1938a).


CHI-SQUARE 109

forefront of Pearson'sthoughts. Interestingly enough, Pearson did not use the


mean absolute error as atest of fit in any of hiswork. His preoccupationwas
with the developmentof a theory of correlation,and it was inthis context that
he solved the goodness-of-fit problem. The 1895 paperand two supplements
that followed in 1901 and 1916b introduceda comprehensive system of fre-
quency curves that pointed a way tosampling distributions that are central to
the use ofstatistical tests,but it was a waythat Pearson himself did not fully
develop.
Pearson's(1900a) seminal paper begins, "The object of this paper is to
investigatea criterion of theprobability on anytheoryof an observed system of
errors, and to apply it to the determination of goodnessof fit in the case of
frequency curves'" (p. 157).
Pearson takesa systemof deviationsx} , x2, . . ., xn from the meansof n
variables with standard deviations a,, c r 2 , . . ., an andwith correlations r,2, r13,
an 2
7*23 •> • • •, '"n-i.n d derivesx as "the equationto a generalized 'ellipsoid,'all
over the surfaceof which the frequencyof the systemof errorsor deviations jc,,
x2,..., xn is constant" (p. 157).
It was Pearson's derivation of the multivariate normal distribution that
2
formed the basisof the x test. LargeJCj's represent large discrepancies between
theory and observationand inturn would give large values of x,2. But x.2 can
be madeto becomea test statisticby examiningthe probabilityof a systemof
errors occurring witha frequency asgreat or greater thanthe observed system.
Pearsonhadalready obtained an expressionfor the multivariatenormal surface,
and hehere givesthe probability of n errors whenthe ellipsoid, mentioned
earlier is squeezedto becomea sphere,which is thegeometric representation
of zero correlationin the multivariate space,and showshow onearrivesat the
probability for a given valueof x2. When we compare Pearson's mathematics
with Hamilton Dickson'sexaminationof Gallon's elliptic contours, seenin the
scatterplotof two correlated variables while Galton was waiting for a train, we
see how farmathematical statistics had advancedin just 15 years.
Pearsonconsidersan (n +l)-fold grouping with observed frequencies,
mi', m2', w3', . . . ,wn', wn+ /, andtheoretical frequencies known a priori, m\,
m2, w 3 , . . ., wn, mn +,. 2w = Iw' = N, is thetotal frequency,and e = m' -m
is the error. The total errorZe (i.e., e} + e2 + e3 +... + en+] ) is zero. The degrees
of freedom,as they are nowknown,follow; "Hence onlyn of the n + 1errors
are variables;the n + 1th isdetermined when the first n areknown,and inusing
2
formula (ii) [Pearson'sbasicx. formula] we treat onlyof n variables" (Pearson,
1900a,pp.160-161).
Starting withthe standard deviation for the random variationof error and
the correlation between random errors, Pearson auses complex trigonometric
110 9. SAMPLING DISTRIBUTIONS

transformationto arrive at aresult"of very great simplicity":

Chi-square (thestatistic)is theweighted sum ofsquared deviations.


Pearson (1904b, 1911) extended the use of thetest to contingency tables and
to the twosamplecaseand in1916(a) presented a somewhat simpler derivation
than the onegiven in 1900a,a derivation that acknowledges the work of Soper
(1913). The first 20 years of this century brought increasing recognition that
the test was of thegreatest importance, starting with Edgeworth , who wrote to
Pearson, "I have to thank you for your splendid methodof testing my mathe-
matical curvesof frequency. That x? of yours is one of themost beautiful of
the instruments thatyou have addedto theCalculus" (quotedby Kendall, 1968,
p. 262).
And even Fisher,who by that timehad becomea fierce critic: "The testof
goodnessof fit was devisedby Pearson,to whose labours principallywe now
owe it, that the test may readily be appliedto a great varietyof questionsof
frequencydistribution" (Fisher, 1922b,p. 597).
In the next chapterthe argument that arose between Pearson andYule on the
assumptionof continuous variation underlying the categoriesin contingency
tables is discussed. When Fisher's modifications and correctionsto Pearson's
theory were accepted,it was Yule who helped to spread the word on the
interpretationof the newidea of degreesof freedom.
The goodness-of-fit testis readily applied whenthe expected frequencies
based on some hypothesisare known. For example, hypothesizing that the
expected distributionis normal witha particular meanand standard deviation
enablesthe expected frequency of any valueto be quickly calculated. Today
2
X is perhaps moreoften appliedto contingency tables, where the expected
valuesare computedfrom the observed frequencies.
This now routine procedure forms the basisof one of themost bitter disputes
in statisticsin the 1920s.In 1916 Pearson examined the questionof partial
contingency. The fixed n in agoodness-of-fit test imposes a constrainton the
frequenciesin thecategories;only k - 1 categories are free to vary. Pearson
realizedthat in the caseof contingencytables additional linear constraints were
placed on the group frequencies,but heargued that these constraints did not
allow for a reductionin the numberof variablesin the casewherethe theoretical
distribution wasestimatedfrom the observed frequencies. Other questions had
also been raised.
Raymond Pearl(1879-1940),an American researcherwho was at the
Biometric Laboratoryin the mid-191Os, pointedout some problemsin the
CHI-SQUARE 111

applicationof X2 in 1912, noting that some hypothetical data


hepresents clearly
show an excellent "fit" between observed and expected frequency but that the
2
value of x was infinite! Pearson,of course, replied,but Pearl was unmoved.

I have earlier pointed out other objectionsto the x2 test ... I have never thought
it
necessaryto make any rejoinderto Pearson'scharacteristically bitter replyto my
criticism, nor do Iyet. The x2 test leadsto this absurdity. (Pearl, 1917,
p. 145)

Pearl repeatsthe argument that essentially notes that in caseswhere thereare


small expected frequencies in the cells of thetable,the valueof X 2 can begrossly
inflated.
Karl Pearson (1917),in a reproof that illustrateshis disdain for those he
believed had betrayedthe Biometric school, respondsto Pearl: "Pearl . . .
provides a striking illustration of how the capable biologist needs a long
continuedtraining in the logic of mathematics before he ventures intothe field
of probability" (p. 429).
Pearl had in fact raised quite legitimate questions about the application of
2
the X test, but in the context of Mendelian theory,to which Pearsonwas
steadfastly opposed.A close readingof Pearl's papers perhaps reveals he that
hadnot followed all Pearson's mathematics, but thequestionshe hadraised were
germane. Pearson's response is theapotheosisof mathematical arrogance that,
on occasion, frightens biologists and social scientists today:

Shortly Dr Pearl'smethodis entirelyfallacious, as anytrained mathematician would


have informedDr Pearlhad hesought advice before publication. It is most regret-
table that such extensions of biometric theory should be lightly published, without
any duesenseof responsibility,not solely in biologicalbut in psychological journals.
It can only bring biometry into contempt as ascienceif, professinga mathematical
foundation, it yet showsin its manifestations most inadequate mathematical reason-
ing. (Pearson, 1917, p. 432)

Social scientists beware!


In 1916 Ronald Fisher, then a schoolmaster, raised
his standardand made
his first foray into whatwas to becomea war. The following correspondence
is to be found in E. S. Pearson (1968):

Dear Professor Pearson,


There is an article by Miss Kirstine Smithin the current issueof Biometrika which,
I think, ought not to pass without comment.I enclosea short note uponit.
Miss Kirstine Smith proposesto use theminimum valueof x2 as acriterion to
determinethe best formof the frequency curve;... It shouldbe observed thatx
can only be determined whenthe material is grouped into arrays,and that its value
112 9. SAMPLING DISTRIBUTIONS

dependsupon the manner in which it is grouped.


Thereis ... something exceedingly arbitrary
in a criterion which depends entirely
upon the mannerin which the datahappensto begrouped, (pp. 454-455)

Pearson replied:

Dear Mr Fisher,
I am afraid that I don't agreewith your criticism of FrokenK. Smith (sheis a pupil
of Thiele'sand one of themost brilliant of the younger Danish statisticians).. . .
your argumentthat x2 varies withthegrouping is of course well known... Whatwe
have to determine, however,is with given grouping which method gives the lowest
X 2. (p. 455)

Pearson asks for a defenseof Fisher's argument before considering its publica-
tion. Fisher's expanded criticism received short shrift. After thanking Fisher
for a copy of his paper on Mendelian inheritance (Fisher, 1918), he hopesto
5
find time for it, but pointsout thathe is"not a believerin cumulative Mendelian
factors as being the solution of the heredity puzzle"(p. 456). He then rejects
Fisher's paper.

Also I fear thatI do not agreewith your criticismof Dr Kirstine Smith's paperand
under present pressure of circumstances must keep the little space I have in
Biometrika free from controversy which can only waste what power I have for
publishing original work.(p. 456)

Egon Pearson thinks that we canaccepthis father's reasons for these rejections;
the pressureof war work, the suspensionof manyof his projects,the fact that
he wasover 60 yearsold with much workunfinished, and hismemoryof the
strain that had been placedon Weldonin the rowwith Bateson.But if he really
was trying to shun controversyby refusing to publish controversial views, then
he most certainlydid not succeed. Egon Pearson's defense is entirely under-
standablebut far toocharitable. Pearson hadcensored Fisher's work before and
appearsto have been trying to establishhis authority overthe youngerman and
over statistics. Evenhis offer to Fisher of an appointmentat the Galton
Laboratory mightbe viewed in this light. FisherBox (1978) notes thatthese
experiencesinfluencedFisherin his refusal of the offer, in the summerof 1919,
recognizing that, "nothing would betaughtor publishedat theGalton laboratory

5
Here Pearsonis dissembling. Fisher's paper wasoriginally submittedto theRoyal Society
and, althoughit was notrejected outright,the Chairmanof the Sectional Committee for Zoology
"had beenin communication withthe communicatorof the paper,who proposedto withdraw it."
Karl Pearsonwas one of the tworeferees. Norton and Pearson (1976) describe the event and
publish the referees' reports.
CHI-SQUARE 113

without the approvalof Pearson"(p. 82).


The last strawfor Fisher camein 1920 whenhe senta paperon theprobable
error of the correlationcoefficient to Biometrika:

Dear Mr Fisher,
Only in passingthrough Town todaydid I find your communicationof August 3rd.
I am very sorry for the delay in answering ...
As therehas beena delay of three weeks already, and as Ifear if I could givefull
attentionto your paper, whichI cannotat the present time,I shouldbe unlikely to
publish it in its presentform ... I would preferyou published elsewhere ... I am
regretfully compelledto exclude all that I think erroneouson my own judgment,
becauseI cannotafford controversy,(quoted by E .S.Pearson,1968, p. 453)

Fishernever again submitted


a paperto Biometrika,and in 1922(a) tackled
the X2 problem:

This short paper withall its juvenile inadequacies, yet did somethingto break the
ice. Any readerwho feels exasperated by itstentativeandpiecemeal character should
remember thatit had to find its way topublicationpast critics who,in the first place,
could not believe thatPearson'swork stood in needof correction, and who, if this
had to beadmitted, were sure that they themselves had corrected it. (Fisher's
commentin Bennett, 1971,p. 336)

Fishernotes thathe is notcriticizing the generaladequacyof the X2 test but


that he intendsto show that:

the valueof n' with which the table shouldbe enteredis not nowequalto thenumber
of cells but to onemore thanthe number of degreesof freedom in the distribution.
Thus for a contingency tableof r rows and ccolumnswe should taken' = (c- \)(r
- 1) + 1insteadof n' = cr. This modificationoften makesa very great difference
to the probability (P) that a given value of x2 should have been obtained
by chance.
(Fisher, 1922a,p.88)

It shouldbe noted that Pearson entered the tablesusing n' = v + 1, where


«' is the numberof variables(i.e., categories)and v iswhatwe now call degrees
of freedom. The modern tables areenteredusing v = (c - 1 )(r - 1). The use of
«' to denote sample size and n todenote degrees of freedom, even thoughin
many writings n wasalso usedfor samplesize, sometimes leads to frustrating
readingin these early papers.

It is clear that Pearson


did not recognise thatin all caseslinear restrictions imposed
upon the frequenciesof the sampled population,by our methodsof reconstructing
that population, have exactly the sameeffect upon the distributionof x as have
restrictions placed upon the cell contentsof the sample. (Fisher, 1922a, p. 92)
114 9. SAMPLING DISTRIBUTIONS

Pearsonwasangeredandcontemptuouslyrepliedin thepagesof Biometrika.


He reiteratesthe fundamentalsof his 1900 paperandthen says:

The processof substitutingsampleconstantsfor sampled populationconstantsdoes


not mean thatwe select out of possible samples of size n, those which have precisely
the same valuesof the constantsas theindividual sample under discussion. ... In
using the constantsof the given sampleto replace the constants of the sampled
population,we in nowise restrictthe original hypothesisof free random samples
tied down onlyby their definite size.We certainly do not byusing sample constants
reduce in any way therandom sampling degrees of freedom.
The abovere-descriptionof what seemto me very elementaryconsiderations
would be unnecessaryhad not arecent writerin theJournal of the Royal Statistical
Society appearedto have wholly ignored them ... thewriter hasdoneno service to
the scienceof statisticsby giving it broad-cast circulation in the pagesof the Journal
of the Royal Statistical Society. (K.Pearson, 1922, p. 187)

And on and on,neverreferring to Fisherby namebut only as "mycritic" or "the


writer in theJournalof theRoyal StatisticalSociety,'" until the final assault when
he accuses Fisher of a disregardfor the natureof the probable error:

I trust my critic will pardon me for comparinghim with Don Quixote tilting at the
windmill; he must eitherdestroyhimself, or thewhole theoryof probable errors,for
they are invariablybasedon using sample values for thoseof the sampled population
unknown to us. Forexample hereis an argumentfor Don Quixote of the simplest
nature ... (K. Pearson, 1922, p. 191)

The editorsof the Journal of the Royal Statistical Society turned tail and ran,
refusing to publish Fisher'srejoinder. Fisher vigorously protested,but to no
avail, and he resignedfrom the Society. There wereother questions about
Pearson's paper that he dealtwith in Bowley'sjournal Economicain 1923and
in the Journal of the Royal Statistical Society
(Fisher, 1924a),but the endcame
in 1926 when, using data tables that had beenpublishedin Biometrika, Fisher

calculated the actual average value of x2 which he had proved earlier should
theoretically be unity and which Pearsonstill maintained should be 3. Inevery case
the averagewas close to unity, in no casenearto 3 . . .. There was noreply. (Fisher
Box, 1978, p. 88,commenting on Fisher, 1926a)

THE t DISTRIBUTION

W. S. Gosset was a remarkable man,not the least becausehe managedto


maintain reasonably cordial relations
with both Pearsonand Fisher,and at the
same time.Nor did heavoid disagreeingwith themon various statistical issues.
He wasborn in 1876 at Canterburyin England,and from 1895to 1899 was
THE t DISTRIBUTION 115

at New College, Oxford, wherehetook a degreein chemistryand mathematics.


In 1899 Cossetbecamean employeeof Arthur Guinness,Son andCompany
Ltd., the famous manufacturers of stout. He was one of the first of the
scientists,
trained either at Oxford or Cambridge,that the firm hadbegun hiring(E. S.
Pearson, 1939a).His correspondencewith Pearson, Fisher, and others shows
him to have beena witty and generousman with a tendencyto downplay his
role in the developmentof statistics. Comparedwith the giants of his day he
published very little,but his contributionis of critical importance. As Fisher
puts it in his Statistical Methodsfor Research Workers:

The studyof the exact distributionsof statistics commencesin 1908 with"Student's"


paper The Probable Error of the Mean. Oncethe true natureof the problem was
indicated, a large numberof sampling problems were within reach of mathematical
solution. (Fisher, 1925/1970,p. 23)

The brewery had apolicy on publishingby itsemployees that obliged Gosset


to publishhis work underthe nomdeplume "Student." In essence,the problem
that "Student"tackled was thedevelopmentof a statistical test that could be
applied to small samples.The nature of the process of brewing, with its
variability in temperatureand ingredients, means that it is not possibleto take
large samples over a long run. In a letter to Fisherin 1915,in which he thanks
Fisher for the Biometrika paper that begins the mathematical solutionto the
small sample problem,"Student"says:

The agricultural (and indeed almost any) Experiments naturally requireda solution
of the mean/S.D. problem,and the Experimental Brewerywhich concerns such
things as theconnection between analysis of malt or hops, and thebehaviourof the
beer,andwhich takesa day toeachunit of theexperiment, thuslimiting the numbers,
demandedan answerto such questions as, "If with a small numberof casesI get a
value r, what is the probability that thereis really a positive correlationof greater
value than (say) .25?" (quoted by E. S.Pearson, 1968,p. 447)

Egon Pearson (1939a) notes that in his first few years at Guinness, Gosset
was making use ofAiry's Theory of Errors (1861), Lupton's Notes on Obser-
vations (1898),and Merriman'sTheMethod of Least Squares (1884).In 1904
he presenteda report to his firm that stated clearly the utility of the application
of statisticsto thework of the breweryand pointedup theparticular difficulties
that mightbe encountered.The report concludes:

We have beenmetwith the difficulty that noneof our books mentions the odds, which
areconveniently accepted asbeingsufficient to establishany conclusion,and itmight
be of assistanceto us toconsult some mathematical physicist on thematter, (quoted
by Pearson,1939a,p.215)
116 9. SAMPLING DISTRIBUTIONS

A meeting was in fact arranged with Pearson, and this took placein the
summerof 1905. Not all of Gosset'sproblems were solved, but a supplement
to his report and asecond reportin late August 1905 produced many changes
in the statisticsusedin the brewery. The standard deviation replaced the mean
error, andPearson's correlation coefficient becamean almost routine procedure
in examining relationships among the many factors involved with brewing. But
one featureof the work concerned Cosset: "Correlation coefficients areusually
calculatedfrom large numbersof cases,in fact I havefound only one paperin
Biometrika of which the casesare as few innumberas thoseat which I have
been working lately" (quoted by Pearson, 1939a, p. 217).
Gosset expressed doubt about the reliability of the probable error formula
for the correlationcoefficient whenit was appliedto small samples.He went to
London in September 1906to spenda year at the Biometric Laboratory.His
first paper, published in 1907, derives Poisson's limit of the binomial distribu-
tion and applies it to the error in sampling when yeast cells are countedin a
haemacytometer.But his most important work during that year was theprepa-
ration of his twopaperson theprobable errorof the meanand of thecorrelation
coefficient, both of which werepublishedin 1908.

The usual methodof determining thatthe meanof the population lies withina given
distanceof the meanof the sample,is to assumea normal distribution about the mean
of the sample with a standarddeviation equalto 5/Vn, where s is the standard
deviation of the sample,and to use thetables of the probability integral.
But, as wedecreasethe value of the numberof experiments,the value of the
standard deviationfound from the sampleof experiments becomes itself subject to
increasing error,until judgements reachedin this way become altogether misleading.
("Student," 1908a,pp. 1-2)

"Student"setsout what the paper intendsto do:

I. The equationis determinedof the curve which represents the frequency distribu-
tion of standard deviations of samples drawnfrom a normal population.
II. Thereis shown to be nokind of correlation betweenthe meanand thestandard
deviation of sucha sample.
III. The equationis determinedof the curve representing the frequency distribution
of a quantity z, which is obtained by dividing the distance between the mean of the
sampleand themeanof the populationby the standard deviationof the sample.
IV. The curve foundin I. is discussed.
V. The curve found in III. is discussed.
VI. The two curvesare compared with some actual distributions.
VII. Tables of the curves foundin III. are given for samplesof different size.
VIII and IX. Thetablesareexplainedand some instances aregiven of their use.
X. Conclusions.
("Student," 1908, p. 2)
THE t DISTRIBUTION 117

"Student"did not provide a proof for the distribution of z. Indeed,he first


examined this distribution,andthat of s, byactually drawing samples (of size4)
from measurements, made on 3,000 criminals, takenfrom data usedin a paper
by Macdonell (1901).The frequencydistributionsfor s and zwere thus directly
obtained,and themathematical work came later. There hasbeen comment over
the yearson thefact that "Student's"mathematical approachwas incomplete,
but this shouldnot detractfrom his achievements. Welch (1958)maintainsthat:

The final verdict of mathematical statisticians will,


I believe,be that they have lasting
value. They havethe rare quality of showing us how anexceptional man wasable
to make mathematicalprogresswithout paying too much regard to the rules. He
fortified what he knew with some tentative guessing, but this was backed by
subsequent careful testingof his results, (pp.785-786)

In fact "Student"had given to future generationsof scientists,in particular


social and biological scientists,a new andpowerful distribution.The z test,
which was to becomethe t test,6 led the way for allkindsof significance tests,
and indeed influenced Fisher as hedeveloped that mostuseful of tools, the
analysis of variance. It is also the case thatthe 1908 paperon theprobable error
of the mean ("Student," 1908a) clearly distinguishedbetween whatwe nowcall
sample statisticsand population parameters,a distinction that is absolutely
critical in modern-day statistical reasoning.
The overwhelming influence of the biometriciansof Gower Streetcan
perhaps partly account for the fact that "Student's" workwas ignored. In 1939
Fisher said that"Student's"work was receivedwith "weighty apathy," and,as
late as 1922, Gosset, writing to Fisher,and sendinga copy of his tables, said,
"You are theonly man that's everlikely to usethem!" (quotedby Fisher Box,
1981, p. 63). Pearson was, to say theleast, suspiciousof work using small
samples.It was theassumptionof normality of the samplingdistributionthatled
to problems,but thebiometricians never used small samples,and"only naughty
brewers taken so small that the difference is not of theorder of the probable
error!" (Pearson writingto Gosset, September 1912, quoted by Pearson, 1939a,
p. 218).
Some of "Student's"ideashad been anticipated by Edgeworthas early as
1883 and,as Welch (1958) notes,one might speculateas to what Gosset's
reaction would have been had hebeen awareof this work.
Gosset'spaperon theprobable errorof thecorrelation coefficient("Student,"
1908b) dealt withthe distributionof the r values obtained when sampling from

6
Eisenhart (1970) concludes that the shift from z to / was due to Fisherandthat Gosset chose
/ for the newstatistic. In their correspondence Gosset used t for his owncalculationsand x forthose
of Fisher.
118 9. SAMPLING DISTRIBUTIONS

is, R = O7. This


a populationin which the two variables were uncorrelated, that
endeavorwasagain basedon empirical sampling distributions constructed from
the Macdonell dataand amathematical curve fitted afterwards. With charac-
teristic flair, he says that he attempted to fit a Pearson curveto the "no
correlation" distributionand came up with a Type II curve. "Working from
2 0 2 (n-4)/2
y= yoC\. -x ) for samplesof 4, I guessedthe formula y = yQ(\ -x )
and proceededto calculatethe moments" ("Student,"1908b, p.306).
He concludeshis paperby hoping thathis work "may serveasillustrations
for the successful solver
of theproblem" (p. 310). And indeedit did, for Fisher's
paperof 1915 showed that "Student" hadguessed correctly. Fisher had beenin
correspondence with Gosset in 1912,sendinghim a proof that appealedto
«-dimensional spaceof the frequency distributionof z. Gosset wantedPearson
to publishit.

Dear Pearson,
I am enclosing a letter which givesa proof of my formulae for the frequency
distribution of z(=xls),... Would you mind lookingat it for me; Idon't feel at home
in morethan threedimensions evenif I could understandit otherwise.
It seemedto methat if it's all right perhapsyou might like to put theproof in a note.
It's so nice and mathematical thatit might appealto some people, (quoted by E. S.
Pearson,1968,p. 446)

In fact the proof was notpublished then,but again approachingthe mathe-


matics througha geometrical representation, Fisher derived the samplingdis-
tribution of the correlation coefficient,andthis, together withthe derivation of
the 2distribution, was publishedin the 1915 paper.The sampling distribution
of r was,of course,of interestto Pearsonand hiscolleagues; afterall, r was
Pearson'sstatistic. Moreover, the distribution followsa remarkable systemof
curves, witha variety of shapes that depart greatly from the normal, depending
on n and thevalueof the true unknown correlation coefficient R, or, as it is now
generally known,p. Pearsonwasanxiousto translate theory into numbers, and
the computationof the distributionof r wascommencedand publishedas the
"co-operativestudy" of Soper, Young, Cave,Lee,and K. Pearson in 1917.
Although Pearsonhad said thathe would send Fisher the proofs of the paper
(the letteris quotedin E. S.Pearson, 1965), there is, apparently,no record that
he in fact received them.E. S.Pearson (1965) suggests that Fisher did notknow
"until late in the day" of the criticism of his particular approachin a sectionof
the study, "On theDeterminationof the 'Most Likely' Valueof the Correlation

7
R wasGosset's symbol for the population correlationcoefficient, p (the Greek letter rho)
appearsto have beenfirst usedfor this valueby Soper(1913). The important pointis thata different
symbol was usedfor sample statistic and populationparameter.
THE t DISTRIBUTION 119

in the Sampled Population." Egon Pearson argues his thatfather's criticism,


for presumablythe elder Pearsonhad taken a major role in the writing of the
"co-operative study," was a misunderstanding based on Fisher's failure to
adequately define what he meantby "likelihood" in 1912and thefailureto make
clear that it was not basedon the Bayesian principleof inverse probability.
Fisher Box (1978) says, "their criticism was asunexpectedas it wasunjust,and
it gavean impressionof something less than scrupulous regardfor a new and
therefore vulnerable reputation"(p. 79).
Fisher's response, publishedin Metron in 1921 (a),was thepaper, mentioned
earlier, that Pearson summarily rejected because he could not afford contro-
versy. In this paper Fisher makes use of the r =tanh z transformation, a
transformation thathad been introduced,a little tentatively,in the 1915 paper.
Its immense utilityin transformingthe complex systemof distributionsof r to
a simple functionof z, which is almost normally distributed, made the laborious
work of the co-operative study redundant. In a paper publishedin 1919, Fisher
examinesthe dataon resemblancein twins thathad been studiedby Edward L.
Thorndike in 1905. Thisis the earliest exampleof the applicationof Fisherian
statistics to psychological data, for the traits examined were both mental and
physical. Fisher lookedat the questionof whether there wereany differences
between the resemblancesof twins in different traits and here uses the z
transformation:

When the resemblances have been expressed in terms of the new variable, a
correlation table may be constructedby picking out every pair of resemblances
betweenthe same twinsin different traits. The valuesare nowcentered symmetri-
cally about a meanat 1.28, and thecorrelationis found to be-.016 ± .048, negative
but quite insignificant.The result entirelycorroboratesTHORNDIKE'S conclusions
as to thespecializationof resemblance. (Fisher, 1919, p. 493)

These manipulations and thedevelopmentof the same general approach to


the distribution of the intraclass correlation coefficient in the 1921 paperare
important for the fundamentalsof analysis of variance.
From the point of view of the present-day student of statistics, Fisher's
(1925a) paperon the applicationsof the t distributionis undoubtedlythe most
comprehensible. Here we see, in familiar notationand most clearly stated,the
utility of the t distribution,a proof of its "exactitude ... for normal samples"
(p. 92), and theformulae for testing the significanceof a difference between
meansand thesignificanceof regression coefficients.

Finally the probability integral withwhich we areconcernedis of valuein calculating


the probability integralof awider classof distributionswhich is relatedto "Student's"
distribution in the same manneras that of X2 is relatedto the normal distribution.
120 9. SAMPLING DISTRIBUTIONS

This wider classof distributions appears(i) in the study of intraclass correlations


(ii) in the comparisonof estimatesof the variance,or of thestandard deviation from
normal samples(iii) in testingthe goodnessof fit of regressionlines (iv) in testing
the significance of a multiple correlation,or (v) of acorrelation ratio. (Fisher,1925a,
pp. 102-103)

These monumental achievements were realized in less than10 years after


Gosset's mixture of mathematicalconjecture,intuition, and thepractical neces-
sities of his work led the way to thedistribution.
t
From 1912, Gossetand Fisherhad beenin correspondence (although there
were some lengthy gaps), but they did not actually meetuntil September 1922,
when Gosset visited Fisher at Harpenden.Fisher Box (1981) describes their
relationshipandreproduces excerpts from their letters. At the end ofWorld War
I, in 1918, Gossetdid not even knowhow Fisherwas employed,and when he
learnedthat Fisherhad beena schoolmasterfor the durationof the war and was
looking for a job wrote, "I hear that Russell [the head of the Rothamsted
Experimental station] intends to get astatistician soon, when hegetsthe money
I think, and it might be worth while to keep your ears open to news from
Harpenden" (quoted by Fisher Box, 1981, p. 62).
In 1922, work beganon thecomputationof a new setof t tables using values
of t = z\n - 1 rather thanz, and thetables were entered with the appropriate
degreesof freedom rather than n. Fisher and Gosset both worked on the new
tables,and after delays,and fits and starts,and thecheckingof errors,the tables

Fig. 9.2 The Normal Distribution and the t


Distribution for df = 10 and df = 4
THE F DISTRIBUTION 121

were publishedby "Student" in Metron in 1925. Figure9.2 compares two?


distributions to the normal distribution. Fisher's "Applications" paper, men-
tioned earlier, appearedin the same volume, but it had,in fact, been written quite
early in 1924. At that time Fisherwas completinghis 1925 bookand needed
tables. Gosset wanted to offer the tablesto Pearsonandexpressed doubts about
the copyright, because the first tableshadbeen published in Biometrika. Fisher
went aheadandcomputedall thetableshimself,a task he completedlater in the
year (Fisher Box, 1981).
Fisher sentthe "Applications" paperto Gossetin July 1924. He says that
the note is:

larger thanI had intended,and tomake it at all complete shouldbe larger still, but I
shall not have timeto makeit so, as I amsailingfor Canadaon the25th, and will not
be back till September, (quoted by Fisher Box, 1981,p. 66)

The visit to Canadawas made to presenta paper (Fisher, 1924b) at the


International Congress of Mathematics, meeting that year in Toronto. This paper
discussesthe interrelationshipsof X 2,z, and /. Fisherwasonly 35 years old,and
yet the foundationsof his enormousinfluence on statistics werenow securely
laid.

THE F DISTRIBUTION

The Toronto paperdid not, in fact, appearuntil almost4 yearsafter the meeting,
by which time the first edition of Statistical Methodsfor Research Workers
(1925) had been published.The first use of ananalysisof variance technique
was reported earlier (Fisher& Mackenzie, 1923),but this paper was not
encounteredby many outsidethe area of agricultural research.It is possible
that if the mathematicsof the Toronto paperhadbeen includedin Fisher's book
then much of the difficulty that its first readershad with it would have been
reducedor eliminated,a point that Fisher acknowledged.
After a general introductionon error curvesand goodness-of-fit, Fisher
examinesthe x,2 statisticand briefly discusseshis (correct) approachto degrees
of freedom. He then pointsout that if a numberof quantitiesx1,,...,* „ are
distributed independentlyin the normal distribution with unit standard devia-
tion, then x.2 = Lx is distributedas "the Pearsonian measure of goodnessof
2
fit." In fact, Fisher uses S(x) to denotethe latter expression,but here, and in
what follows on the commentaryon this paper,the more familiar modern
notation is employed. Fisher refers to "Student's"work on theerror curveof
the standard deviationof a small sample drawn from a normal distributionand
2
shows its relation to . -
122 9. SAMPLING DISTRIBUTIONS

wheren - 1 is thenumberof degreesof freedom (one less than the numberin


2 2
the sample)and s is thebest estimatefrom the sampleof the true variancecr.
2 2
For the generalz distribution, Fisherfirst points out that s, and s2 are
misleading estimatesof a, and 2a when sample sizes, drawn from normal
distributions,are small:

The only exacttreatmentis to eliminate the unknown quantitiesa, and cr 2 from the
distribution by replacing the distribution of s by that of log s, and soderiving the
distribution of log st/s2. Whereasthe samplingerrorsin s, areproportional to Oj
the sampling errorsof log s, depend only uponthe size of the sample from which
5, was calculated. (Fisher, 1924b, p. 808)

then z will be distributed aboutlog a, /a2 asmode, in a distribution which depends


wholly upon the integersnl and «2. Knowing this distributionwe cantell at once if
an observed value of z is or is notconsistent withany hypothetical valueof the ratio
CT, /o2. (Fisher, 1924b,p. 808)

The casesfor infinite and unit degreesof freedom are then considered.In
the latter casethe "Student" distributions
are generated.

In discussingthe accuracyto beascribedto the mean of a small sample,"Student"


took the revolutionary step of allowing for the random sampling variationof his
estimate of the standard error.If the standard error were known with accuracythe
deviation of an observed valuefrom expectation (sayzero),divided by the standard
error would be distributed normally with unit standard deviation; but if for the
accuratestandarddeviation we substitutean estimatebasedon n - 1 degreesof
freedom we have

consequentlythe distribution of t is given by putting n, [degreesof freedom] = 1


and substitutingz = V* log t2. (Fisher, 1924b,pp. 808-809)
THE F DISTRIBUTION 123

For the caseof an infinite numberof degreesof freedom,the tdistribution


becomesthe normal distribution.
In the finalsectionsof thepaperthe r - tanh z transformationis appliedto
the intraclass correlation,an analysisof variance summary table is shown,and
z=loge s} /s2 is thevalue thatmay beusedto testthe significanceof the intraclass
correlation. These basic methods arealso shownto lead to testsof significance
for multiple correlationand r\ (eta),the correlation ratio.
The transition from z to F was notFisher's work.In 1934, GeorgeW.
Snedecor,in the first of anumberof texts designedto make the techniqueof
analysisof variance intelligibleto a wider audience,defined F as theratio of
the larger mean squareto the smaller mean square taken from the summary
table, using the formula z = '/2loge F. Figure 9.3 shows two F distributions.
Snedecorwas Professorof Mathematicsand Director of the Statistical Labora-
tory at Iowa State Collegein the United States. This institution was largely
responsiblefor promotingthe valueof themodern statistical techniques in North
America.
Fisher himself avoidedthe use of thesymbol F becauseit was notusedby
P. C. Mahalonobis,an Indian statisticianwho hadvisited Fisherat Rothamsted
in 1926,and who hadtabulatedthe valuesof the variance ratioin 1932,andthus
established priority. Snedecor did not know of the work of Mahalonobis,a
clear-thinking and enthusiastic workerwho becamethe first Director of the
Indian Statistical Institutein 1932. Todayit is still occasionallythe case thatthe
ratio is referredto as"Snedecor'sF " in honorof both SnedecorandFisher.

Fig 9.3 The F Distributions for 1,5 and 8,8 Degrees of Freedom
124 9. SAMPLING DISTRIBUTIONS

THE CENTRAL LIMIT THEOREM

The transformationof numerical datato statistics and theassessmentof the


probability of the outcome beinga chance occurrence, or the assessmentof a
probable rangeof values within whichthe outcome will fall, is a reasonable
general descriptionof the statistical inferential strategy.The mathematical
foundationsof theseprocedures rest on aremarkable theorem, or rathera set of
theorems, thataregrouped togetheras theCentral Limit Theorem. Much of the
work that led to this theoremhas already been mentioned but it is appropriate
to summarizeit here.
In the most general terms,the problem is to discover the probability
distribution of the cumulativeeffect of many independently acting, and very
small, randomeffects. The centrallimit theorem brings together the mathemati-
cal propositions that show that the required distributionis thenormal distribu-
tion. In other words,a summaryof a subsetof random observations^,^,..
. ,Xn, saytheir sum'LX=X] + X2 +... + Xn or their mean,(Ud/n has asampling
distribution that approaches the shapeof the normal distribution. Figure9.4 is
an illustration of a sampling distributionof means. This chapter has described
distributionsof other statistical summaries, but it will have been noted that, in
one way oranother, those distributions have links with the normal distribution.

Fig. 9.4 A Sampling Distribution of Means (n = 6). 500 draws


from the numbers 1 through 10.
THE CENTRAL LIMIT THEOREM 125

The history of the developmentof the theoremhasbeen brought together in


a useful and eminently readable little book by Adams (1974).The story begins
with the definition of probability as long-run relative frequency,
the ratio of the
numberof ways an eventcanoccurto thetotal numberof possible events, given
that the eventsare independentand equallylikely. The most important contri-
bution of James Bernoulli'sArs Conjectandi is the first limit theorem of
probability theory. The logic of Bernouilli's approachhasbeenthe subjectof
some debate,and weknow that he himself wrestled withit. Hacking (1971)
states:

No one writes dispassionatelyaboutBernoulli. He hasbeenfathered with the first


subjective conceptof probability, and with a completely objectiveconcept of
probability as relative frequency determined by trials on a chance set-up. He has
beenthought to favour an inductive theoryof probability akinto Carnap's.Yet he
is said to anticipateNeyman'sconfidence interval technique of statistical inference,
which is quite opposedto inductive probabilities.In fact Bernoulli was, likeso many
of us, attractedto many of these seemingly incompatible ideas, and he wasunsure
where to rest his case. He left his book unpublished, (pp. 209-210)

But Bernoulli's mathematics


are not inquestion. Whenthe probability^ of
an eventE is unknownand asequenceof n trials is observedand theproportion
of occurrencesof £ is£nthen Bernoulli maintainedthat an "instinct of nature"
causesus to use n£ as anestimateof p. Bernoulli's Law of Large Numbers
shows thatfor any small assigned amount 8, |p - En \ < e increasesto 1 as «
increasesindefinitely:

It is concluded thatin 25,550trials it is more thanone thousand timesmore likely


that the r/t [the ratio of what Bernoulli calls "fertile" events
to thetotal, that is, the
event of interest here called £] will fall between 31/50and 29/50 than thatr/t will
fall outside these limits. (Bernoulli, 1713/1966, p. 65)

The aboveresults holdfor known p. If/? is 3/5, thenwe can bemorally certain that
... thedeviation ... will be less than 1/50.But Bernoulli's problemis the inverse
of this. When/?is unknown, can hisanalysis tell whenwe can bemorally certain
that someestimateof p isright? Thatis theproblemof statistical inference. (Hacking,
1971, p. 222)

The determination of/?was ofcoursethe problem tackledby De Moivre.


He usedthe integrale~*2 to approximateit, but asAdams (1974)and others have
noted, thereis no direct evidence thatimplies that De Moivre thoughtof what
is now called the normal distribution as a probability distribution as
such. Simpson introduced the notion of a probability distributionof obser-
vations,andothers, notably Joseph Lagrange (1736-1813)andDaniel Bernoulli
126 9. SAMPLING DISTRIBUTIONS

(1700-1782),who wasJames's nephew, elaborated laws of error.Thelate 18th


centurysaw theculminationof this workin the development of the normallaw
of frequencyof error and itsapplicationto astronomical observations by Gauss.
It was Laplace's memoir of 1810 that introducedthe central limit theorem,but
the nub of hisdiscoverywas describedin nonmathematical language in his
famous Essai publishedas the introduction to the third edition of Theorie
Analytique desProbabilitesin 1820:

The general problem consists in determiningthe probabilities thatthe valuesof one


or several linear functions of the errors of a very great numberof observationsare
contained withinany limits. The law of thepossibility of the errors of observations
introduces intothe expressionsof these probabilities a constant, whose value seems
to requirethe knowledgeof this law, whichis almost always unknown. Happily this
constantcan bedeterminedfrom the observations.
Thereoften existsin the observationsmany sourcesof errors:...The analysis
which I have used leads easily, whatever the numberof the sourcesof error may be,
to thesystemof factors which givesthe most advantageous results, or thosein which
the same erroris less probable than in any other system.
I ought to make herean important remark. The small uncertainty thatthe
observations, when they are not numerous, leavein regard to the values of the
constants...rendersa little uncertainthe probabilities determined by analysis. But
it almost alwayssuffices to know if the probability, thatthe errors of the results
obtained arecomprised within narrow limits, approaches closely to unity; andwhen
it is not, it sufficesto know up to what pointthe observations should be multiplied,
in orderto obtainaprobability such thatno reasonable doubt remains... The analytic
formulae of probabilitiessatisfy perfectlythis requirement; . . . They are likewise
indispensablein solving a great numberof problems in the natural and moral
sciences. (Laplace, 1820, pp. 192-195in thetranslationby Truscott&Emory, 1951)

Theseareelegantandclear remarksby a geniuswho succeededin makingthe


laborsof many years intelligible
to a wide readership.
Adams(1974) givesa brief accountof the final formal developmentof the
abstract Central Limit Theorem.The Russian mathematician Alexander
Lyapunoff (1857-1918), a pupil of Tchebycheff, provided a rigorous
mathematicalproof of the theorem.His attentionwas drawnto theproblem
when he waspreparing lecturesfor a course in probability theory,and his
approachwas atriumph. His methodsand insights led,in the 1920s,to many
valuablecontributionsand even morepowerful theoremsin probability mathe-
matics.
Comparisons, Correlations
10
and Predictions

COMPARING MEASUREMENTS

Sincethe time of De Moivre, the variables that have been examined by workers
in the field of probability have expressed measurements asmultiplesof a variety
of basic units that reflectthe dispersionof the range of possiblescores.Today
the chosen unitsare units of standard deviation,and thescoresobtained are
called standard scoresor z scores. Karl Pearson usedthe term standard
deviation andgave it the symbol a (the lower caseGreek letter sigma)in 1894,
but the unit was known (althoughnot in itspresent-day form)to De Moivre. It
correspondsto that point on theabscissaof a normal distribution such that an
ordinateerectedfrom it would cut thecurveat itspoint of inflection, or, in simple
terms, the point wherethe curvatureof the function changesfrom concaveto
convex. Sir GeorgeAiry (1801-1892)namedaV2 themodulus (although this
term had been used,in passing,for VK by De Moivre as early as 1733)and
describeda variety of other possible measures, including theprobableerror, in
1875.' This latterwas theunit chosenby Gallon (whose workis discussed later),
although he objected stronglyto the name:

It is astonishing that mathematicians,


who are themost preciseand perspicaciousof
men, havenot long since revolted against this cumbrous, slip-shod,
and misleading
phrase Moreoverthe term Probable Erroris absurd when applied
to the subjects

1
The probable erroris definedas onehalf of the quantity that encompasses the middle 50% of a
normal distributionof measurements.It is equivalentto what is sometimes called the semi-interquartile
range (that portionof the distribution betweenthe first quartileor the twenty-fifth percentile,and the
third quartileor the seventy-fifth percentile, dividedby two). The probable erroris 0.67449times the
standard deviation. Nowadays, everyone follows Pearson (1894), who wrote, "I have alwaysfound it
more convenientto work with the standard deviation than with the probable erroror the modulus,in
terms of which the error function is usually tabulated" (p.88).

127
128 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

now in hand, suchas Stature, Eye-colour, Artistic Faculty,or Disease.I shall


therefore usuallyspeakof Prob. Deviation. (Galton, 1889,
p. 58)

This objection reflects Galton's determination, and that of his followers,to


avoid the use of theconcept of error in describingthe variation of human
characteristics. It also foreshadowsthe well-nigh complete replacement of
probable error with standard deviation, and law of frequency oferror with
normal distribution, developments that reflect philosophical dispositions rather
than mathematical advance. Figure 10.1 shows the relationship between stand-
ard scoresandprobable error.
Perhapsa word or two of elaborationand examplewill illustratethe utility
of measurements made in units of variability. It may seem triteto make the
statement that individual measurements and quantitative descriptions are made
for the purposeof making comparisons. Someone who pays $1,000for a suit
hasbought an expensive suit,aswell ashaving paida greatdeal more thanthe
individual who haspickedup acheapoutfit for only $50.The labels"expensive"
and "cheap"areapplied because the suit-buyingpopulation carries around with
it some notion of the average priceof suitsand some notionof the rangeof

FIG. 10.1 TheNormal Distribution - Standard


Scores and the Probable Error
GALTON'S DISCOVERY OF REGRESSION 129

prices of suits. This obviousfact would be made even more obvious by the
reaction to an announcement that someone had just purchaseda new car for
$1,000. Whatis a lot ofmoney for a suit suddenly becomes almost trifling for
a brand new car. Again this judgment depends on aknowledge, whichmay not
be at allprecise,of the average price,and theprice range,of automobiles. One
can own anexpensive suitand acheapcar and have paid the same absolute
amount for both, althoughit must be admitted that sucha juxtaposition of
purchasesis unlikely! The point is that these examples illustrate the fundamental
objective of standard scores,the comparisonof measurements.
If the mean priceof cars is $9,000 and thestandard deviationis $2,500 then
our $1,000car has an equivalentz scoreof ($1,000 - $9,000)/$2,500or-3.20.
If suit prices havea meanof $350and astandard deviation of $150, thenthe
$1,000.00suit has anequivalentz scoreof ($1,000 - $350)/$150or +4.33.
We havea very cheapcar and avery, very expensive suit.It might be added
that, at thetime of writing, thesefigures were entirely hypothetical.
These simple manipulations are well-known to usersof statistics, and,of
course, when they are appliedin conjunctionwith the probability distributions
of the measurements, they enable us to obtainthe probability of occurrenceof
particular scoresor particular rangesof scores. Theyare intuitively sensible.
A more challenging problem is that of the comparisonof setsof pairs of scores
and of determininga quantitative description of the relationship between them.
It is a problem thatwas solvedduring the secondhalf of the 19th century.

GALTON'S DISCOVERY OF "REGRESSION"

Sir Francis Gallon (his workwas recognizedby the award of a knighthood in


1909) has been describedas aVictorian genius. If we follow Edison's oft-
quoted definition2 of this condition then therecan be noquarrel with the
designation,but Galtonwas a man ofgreat flair as well asgreat energyand his
ideas and innovations were many and varied. In a long life he produced over
300 publications, including17 books. But,in one of those odd happeningsin
the history of human invention,it was Galton's discoveryand subsequent
misinterpretation of a statistical artifact that marks
the beginningof the tech-
nique of correlationas we nowknow it.
1860 was theyear of the famous debateon Darwin's work between Thomas
Huxley (1825-1895)and Bishop Samuel Wilberforce (1805-1873). This en-
counter, which heldthe attentionof the nation, took placeat themeetingof the
British Association (forthe Advancementof Science)held thatyearat Oxford.
Galton attendedthe meeting,and althoughwe do notknow what rolehe had at
it, we doknow thathe later becamean ardent supporter of his cousin's theories.

In 1932, Thomas Alva Edison said that "Genius


is onepercent inspiration
and ninety-nineper cent
perspiration."
130 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

The Origin of Species,he said, "madea marked epochon my development,as


it did in that of human thoughtgenerally"(Galton, 1908, p. 287).
Cowan (1977) notes that, although Galton had made similar assertions
before, the impact of The Origin must have been retrospective, describing his
initial reactionto thebook as,"pedestrianin the extreme." Shealso asserts that
"Galton never really understood the argumentfor evolutionby natural selection,
nor was heinterestedin the problem of the creation of new species"(Cowan,
1977, p. 165).
In fact, it is apparentthat Galtonwas quite selectivein using his cousin's
work to supporthis ownview of the mechanismsof heredity.
TheOxford debatewas notjust a debateabout evolution. MacKenzie (1981)
describes Darwin's book as the"basicwork of Victorian scientific naturalism"
(p. 54), the notion of the world, and thehuman species and itsworks, aspart of
rational scientific nature, needing no recourseto thesupernaturalto explainthe
mysteriesof existence. Naturalismhas itsorigins in the rise of sciencein the
17th and 18th centuries,and itsopponents expressed their concern, because of
its attack, implicitandovert,on traditional authority.As one19th century writer
notes,for example:

Wider speculationsas to morality inevitably occuras soon as thevision of God


becomesfaint; whenthe Almighty retires behind second
causes,insteadof being felt
as animmediate presence, and hisexistencebecomesthe subject of logical proof.
(Stephen,1876, Vol II, p. 2).

The return to nature espousedby writers suchas Jean Jacques Rousseau


(1712-1778)and hisfollowers would haverid the world of kings and priests
and aristocrats whose authority rested on tradition and instinct rather than
reason,and thus, they insisted, have brought about a simple"natural" stateof
society. MacKenzie (1981), citing Turner (1978), adds a very practical noteto
this philosophy.The battlewas notjust about intellectual abstractions but about
who should have authorityand control and who should enjoy the material
advantages that flow from the possession of that authority. Scientific naturalism
was theweaponof the middle classin its strugglefor powerandauthority based
on intellect andmerit andprofessional elitism,and not onpatronageor nobility
or religious affiliation.
These ideas,and the newbiology, certainly turned Galton away from
religion, as well as providing him with an abiding interestin heredity. Forrest
(1974) suggests that Gallon'sfascination for,and work on, heredity coincided
with the realization thathis ownmarriage wouldbe infertile. This alsomay have
been a factor in the mental breakdown that he suffered in 1866. "Another
possibleprecipitating factorwas theloss of his religious faith which left him
with no compensatory philosophy until his programmefor theeugenic improve-
ment of mankind becamea future article of faith" (Forrest,1974, p. 85).
GALTON'S DISCOVERY OF REGRESSION 131

During the years 1866to 1869 Galtonwas ingenerally poor health, but he
collectedthe material for,andwrote, one of hismost famous books, Hereditary
Genius, whichwas published in 1869. In this book and in English Men of
Science, published in 1874, Galton expounds and expands uponhis view that
ability, talent, intellectual power,and accompanying eminence are innately
rather than environmentally determined. The corollary of this view was,of
course, that agencies of social control shouldbe established that would encour-
age the"best stock" to have children,and todiscourage those whom Galton
describedas havingthe smallest quantities of "civic worth" from breeding. It
has been mentioned that Galton was acollector of measurements to a degree
that wasalmostcompulsive,and it iscertainthat he was nothappy withthe fact
that he had had to use qualitative rather than quantitative data
in supportof his
argumentsin the two books just cited.
MacKenzie (1981) maintains that:

the needsof eugenicsin large part determinedthe content of Galton's statistical


theory....If the immediateproblemsof eugenicsresearchwereto besolved,a new
theory of statistics,different from that of the previouslydominanterror theoristshad
to be constructed.(MacKenzie,1981, p. 52)

Galton embarkedon acomparative study of the sizeandweightof sweetpea


3
seeds overtwo generations, but, as helater remarked,"It was anthropological
evidence thatI desired, caringonly for the seedsas meansof throwing lighton
heredityin man. I tried in vain for a long and weary timeto obtainit in sufficient
abundance" (Galton, 1885a, p. 247).
Galton began by weighing,and measuringthe diametersof, thousandsof
sweet peaseeds. He computedthe meanand theprobable errorof the weights
of theseseedsand madeup packetsof 10seeds, each of the seeds being exactly
the same weight.The smallest packet contained seeds weighing the mean minus
three timesthe probable error,the nextthe meanminustwice the probable error,
and so on up topackets containing the largest seeds weighing the mean plus
three timesthe probable error. Sets of the seven packets were sent to friends
acrossthe length and breadthof Britain with detailed instructions on how they
were to beplantedand nurtured. There were two crop failures, but theproduce
of seven harvests provided Galton with the datafor a Royal Institution lecture,
"Typical Laws of Heredity," given in 1877. Complete data for Gallon's experi-
ment are notavailable,but heobserved whathe statedto be asimplelaw that
connected parentand offspring seeds.The offspring of each of the parental

In Memoriesof My Life (1908), Galton says that he determinedon experimenting with sweet peas
in 1885 and that the suggestionhad come to him from Sir Joseph Hooker (the botanist,
1817-1911)and
Darwin. But Darwin had died in 1882 and theexperiments must have been suggested and were begun
in 1875. Assuming that this is notjust a typographical error, perhaps Galton
was recalling the dateof
his important paperon regressionin hereditary stature.
132 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

weight categorieshad weights that were what we would now call normally
distributed, and,the probable error(we would now calculate the standard
deviation)was thesame. However, the mean weightof each groupof offspring
was not asextremeas theparental weight. Large parent seeds produced larger
than average seeds, but mean offspring weight was not aslarge as parental
weight. At the other extremesmall parental seeds produced, on average, smaller
offspring seeds,but themeanof the offspring was found not to be assmall as
that of the parents. This phenomenon Galton termed reversion.
Seeddiameters, Galton noted, are directly proportionalto their weight,and
show the same effect:

By family variability is meantthe departureof the childrenof the sameor similarly


descendedfamilies, from the ideal mean typeof all of them. Reversionis the
tendency of that ideal mean filial type to departfrom the parent type, "reverting"
towards whatmay beroughly and perhapsfairly describedas theaverageancestral
type. If family variability had beenthe only processin simple descent that affected
the characteristicsof a sample,the dispersionof the race fromits mean ideal type
would indefinitely increase withthe numberof generations;but reversion checks
this increase,andbrings it to a standstill. (Galton, 1877, p. 291)

In the 1877 paper, Galton gives a measureof reversion,\v\uchhe symbolized r,


and arrivesat anumberof the basic properties of what we nowcall regression.
Some years passed before Galton returned to thetopic, but during those years
he devised waysof obtainingthe anthropometric data he wanted. He offered
prizes for the most detailed accountsof family historiesof physicalandmental
characteristics, characterand temperament, occupations and illnesses, height
and appearance,and so on, and in1884 he opened,at his own expense,an
anthropometric laboratoryat theInternational Health Exhibition.
For a small sum of money, members of the public were admittedto the
laboratory. In return the visitors receiveda record of their various physical
dimensions, measures of strength, sensory acuities, breathing capacity, color
discrimination, and judgmentsof length. 9,337 persons were measured, of
whom 4,726 were adult males and 1,657 adult females. At the end of the
exhibition, Galton obtaineda site for the laboratoryat the South Kensington
Museum,where data continued to becollectedfor closeto 8 more years. These
data formed partof a numberof papers. They were, of course,not entirely free
from errorsdue toapparatus failure, the circumstancesof the data recording,
andother factors thatarefamiliar enoughto experimentalistsof the present day.
Some artifacts might have been introduced in other ways. Galton (1884)
commented, "Hardlyany trouble occurredwith the visitors, thoughon some
few occasions rough persons entered the laboratorywho were apparentlynot
altogethersober"(p. 206).
In 1885, Gallon's Presidential Address to theAnthropological Sectionof
FIG 10.2 Plate from Galton's 1885a paper (Journal of the
Anthropological Institute).
FIG. 10.3 Plate from Galton's 1885a paper (Journal of the
Anthropological Institute).
GALTON'S DISCOVERY OF REGRESSION 135

the British Association (1885b), meeting that year in Aberdeen, Scotland,


discussedthe phenomenonof what he nowtermed regression toward mediocrity
in human hereditary stature.An extended paper (1885a) in the Journal of the
Anthropological Institute gives illustrations, which are reproducedhere.
First it may benoted that Galton used a measureof parental heights that he
termed the height of the "mid-parent." He multiplied the mother's heightby
1.08 andtook the meanof the resulting valueand thefather's heightto produce
4
the mid-parent value. He found that a deviation from mediocrity of one unit
of height in the parentswas accompaniedby a deviation,on average,of about
only two-thirds of a unit in the children (Fig. 10.2). This outcome paralleled
what he hadobservedin the sweetpea data.
When the frequenciesof the (adult) children's measurements were entered
into a matrix against mid-parent heights, the data being"smoothed"by comput-
ing the meansof four adjacent cells, Galton noticed that values of the same
frequency fell on a line that constitutedan ellipse. Indeed,the data produceda
series of ellipsesall centeredon themeanof the measurements. Straight lines
drawn from this centerto pointson theellipse that were maximally distant (the
points of contact of thehorizontalandvertical tangents- thelinesYN and XM
in Fig. 10.3) producethe regressionlines,ON and OM, and theslopesof these
lines give the regression values of \ andj .
The elliptic contours, which Galton said he noticed whenhe waspondering
on his datawhile waiting for a train, are nothing more thanthe contour lines
that are producedfrom the horizontal sectionsof the frequency surface gener-
ated by two normal distributions (Fig. 10.4).
The time had nowcome for some serious mathematics.

All the formulae for Conic Sections having long since goneout of my head,I went
on my return to London to the Royal Institutionto read themup. Professor, now Sir
James,Dewar, camein, and probably noticingsigns of despair on my face, asked
me what I was about;then said,"Why do you bother over this?My brother-in-law,
J. Hamilton Dickson of Peterhouse loves problems and wants new ones. Sendit to
him." I did so, under the form of a problem in mechanics,and hemost cordially
helped me by working it out, as proposed,on thebasis of the usuallyacceptedand
generally justifiable Gaussian
Law of Error. (Galton, 1908,pp. 302-303)
I may bepermitted to saythat I never felt sucha glow of loyalty and respecttowards
the sovereigntyand magnificent swayof mathematical analysis aswhen his answer

Galton (1885a) maintains that this factor "differs a very little from the factors employedby other
anthropologists, who, moreover, differ a trifle between themselves; anyhow it suitsmy data better than
1.07 or 1.09" (p. 247). Galton also maintained (and checked in his data) "that marriage selection takes
little or noaccountof shortnessor tallness...we maythereforeregardthe marriedfolk ascouples picked
out of the general populationat haphazard" (1885a,pp. 250-251) - a statement thatis not only
implausibleto anyonewho hascasually observed married couples but isalsonot borneout byreasonably
careful investigation.
FIG. 10.4 Frequency Surfaces and Ellipses
(From Yule & Kendall, 14th Ed., 1950)

136
GALTON'S DISCOVERY OF REGRESSION 137

reached me, confirming, by purely mathematical reasoning,my various and


laboriousstatistical conclusions than
I had daredto hope, for the original data ran
somewhatroughly, and 1 had tosmooth them with tender caution. (Galton, 1885b,
p. 509)

Now Galton was certainly not a mathematical ignoramus, but the fact that
one ofstatistics' founding fathers sought help for the analysisof his momentous
discovery may be ofsome small comfortto studentsof the social scienceswho
sometimesfind mathematics such a trial. Another breakthroughwas to come,
and again, it was to culminatein a mathematical analysis, this time by Karl
Pearson, and the developmentof the familiar formula for the correlation
coefficient.
In 1886, a paperon "Family Likenessin Stature,"publishedin the Proceed-
ings of the Royal Society, presents Hamilton Dickson's contribution, as well as
data collectedby Galton from family records. Of passing interestis his use of
the symbol w for "the ratio of regression" (short-lived, as it turns out)as he
details correlations between pairs of relatives.
Gallon's work had led him to amethod of describing the relationship
between parentsand offspring and between other relatives on a particular
characteristicby usingthe regression slope.Now heapplied himselfto the task
of quantifying the relationship between different characteristics,the sort of data
collectedat theanthropometric laboratory. It dawnedon him in aflash of insight
that if each characteristicwas measuredon ascale basedon its ownvariability
(in other words,in what we now call standard scores), then the regression
coefficient could be applied to these data. It was noted in chapter 1 that the
location of this illumination was perhapsnot the place that Galton recalled in
his memoirs (written whenhe was in hiseighties)so that the commemorative
tablet that Pearson said the discovery deserved will haveto be sited carefully.
Before examining someof the consequencesof Galton's inspiration,the
phenomenonof regressionis worth a further look. Arithmeticallyit is real
enough, but,as Forrest (1974) puts it:

It is not that the offspring have been forced towards mediocrity by the pressureof
their mediocreremote ancestry,but aconsequenceof a less than perfect correlation
betweenthe parentsand their offspring. By restrictinghis analysisto the offspring
of a selectedparentageand attemptingto understand their deviations from the mean
Galton failsto account for the deviationof all offspring. ... Galton's conclusionis
that regressionis perpetual and that the only way in which evolutionarychangecan
occur is through the occurrenceof sports, (p. 206)5

5
In this contextthe term "sport" refersto ananimal or a plant that differs strikingly from its species
type. In modern parlance we would speakof "mutations." It is ironic thatthe Mendelians,led byWilliam
Bateson(1861-1926),used Galton's work in supportof their argument thatevolution was discontinuous
and saltatory, and that the biometricians.led by Pearsonand Weldon,who held fast to the notion that
continuousvariationwas thefountainheadof evolution, took inspiration from the same source.
138 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

Wallis and Roberts (1956) presenta delightful accountof the fallacy and
give examplesof it in connection with companyprofits, family incomes,
mid-term and finalgrades,and salesandpolitical campaigns.As they say:

Take any set ofdata,arrange themin groups according to some characteristic, and
then for eachgroup computethe averageof some secondcharacteristic. Thenthe
variability of the second characteristic
will usually appearto be less than thatof the
first characteristic.( p. 263)

In statisticsthe term regressionnow means prediction,a point of confusion


for many students unfamiliar with its history. Yule and Kendall (1950) observe:

The term "regression"is not aparticularly happyone from the etymological point
of view, but it is sofirmly embeddedin statistical literature that
we makeno attempt
to replace it by an expression whichwould more suitably expressits essential
properties,(p. 213)

The word is now part of the statistical arsenal,


and it servesto remind those
of us who areinvolvedin its application of an important episodein the history
of the discipline.

GALTON'S MEASURE OF CO-RELATION

On December 20th, 1888, Galton's paper, Co-relationsand Their Measure-


ment,Chiefly from Anthropometric Data,was read beforethe Royal Societyof
London. It begins:

"Co-relation or correlation of structure" is a phrase much used in biology, and not


least in that branchof it which refersto heredity,and theideais even morefrequently
present thanthe phrase;but 1 am notawareof anyprevious attempt to define it clearly,
to traceits modeof action in detail, or to show how to measureits degree. (Galton,
1888a,p. 135)

He goeson to statethat the co-relation between two variable organs must be


due, in part, to common causes, that if variation was wholly due to common
causes then co-relation would be perfect,and if variation "were in no respect
due tocommon causes, the co-relationwould be nil" (p. 135). His aim thenis
to showhow this co-relationmay beexpressedas asimple number,and heuses
as anillustrationthe relationship between the left cubit (the distance between
the elbow of the bent left arm and the tip of themiddle finger) and stature,
althoughhe presents tables showingthe relationshipsbetweena varietyof other
physical measurements.His data are drawn from measurements made on 350
adult males at the anthropometric laboratory, and then, as now, there were
missingdata. "The exact number of 350 is notpreserved throughout, as injury
GALTON'S MEASURE OF CO-RELATION 139

to some limbor other reducedthe available numberby 1, 2, or 3 indifferent


cases"(Galton, 1888a,p. 137).
After tabulatingthe data in order of magnitude, Galton noted the valuesat
the first, second,andthird quartiles.One half the value obtainedby subtracting
the value at thethird from the valueat thesecond quartile giveshim Q, which,
he notes,is the probable errorof any single measurein the series,and thevalue
at the second quartileis the median. For staturehe obtaineda medianof 67.2
inchesand a Q of1.75, and for theleft cubit, a medianof 18.05 inchesand a Q
of 0.56. It shouldbe noted that although Galton calculatedthe median,he refers
to it as being practically the mean value,"becausethe series run with fair
symmetry" (p. 137). Galton clearly recognized that these manipulations did not
demand thatthe original unitsof measurement be thesame:

It will be understood that


the Q valueis auniversalunit applicableto themost varied
measurements, such asbreathing capacity, strength, memory, keennessof eyesight,
and enables themto be compared together on equal terms notwithstanding their
intrinsic diversity. (Galton, 1888a,p. 137)

Perhaps in an unconscious anticipation of the inevitable universalityof the


metric system,he also recordshis data on physical dimensions in centimeters.
Figure 10.5 reproduces the data (TableIII in Galton's paper)from which the
closenessof the co-relation between stature and cubit was calculated.
A graph was plotted of stature, measured in deviationsfrom M s in units of
Qs, against the mean of the correspondingleft cubits, again measured as
deviations, this timefrom M c in units of Qc (columnA against columnB in the
table). Betweenthe same axes,left cubit was plotted as adeviation from M c
measuredin units of Qc againstthe mean of corresponding statures measured
as deviationsfrom M s in units of Qs (columnsC and D in thetable). A line is
then drawnto represent "the general run" of the plotted points:

It is here seento be astraight line, and it wassimilarly found to be straight in every


other figure drawnfrom the different pairsof co-related variables that I haveas yet
tried. But the inclinationof the line to the vertical differs considerablyin different
cases. In the presentone theinclination is such that a deviationof 1 on thepart of
the subject [the ordinate values], whether it be statureor cubit, is accompaniedby a
mean deviationon thepart of the relative [the valuesof the abscissa], whether it be
cubit or stature,of 0.8. This decimalfraction is consequentlythe measureof the
closenessof the co-relation. (Galton, 1888a, p. 140)

Galton also calculates


the predicted values
from the regression line.He takes
what he terms the "smoothed"(i.e., readfrom the regression line) value for a
given deviationmeasurein unitsof Qc or Qs, multipliesit by Qc or Qs, andadds
the result to the meanMc or My For example, +1.30(0.56) + 18.05 = 18.8. In
modern terms,he computesz' (s) + X~ X. It isilluminating to recomputeand
FIG. 10.5 Galton's Data 1888. Proceedings
of the Royal Society.
THE COEFFICIENT OF CORRELATION 141

to replot Galton'sdataand tofollow his line of statistical reasoning.


Finally, Galton returnsto his original symbol r, to representdegree of
co-relation, thesymbol whichwe usetoday, andredefines/=\ 1 - r 2 as"the Q
value of the distribution of any systemof x values,as x} , x2, x3, &c., round the
mean of all of them, which we may call X " (p. 144), which mirrorsour
modern-day calculationof the standard errorof estimate,and which Galton
had obtainedin 1877.
This short paperis not acomplete accountof themeasurement of correlation.
Galton shows thathe has not yetarrived at the notion of negative correlation
nor of multiple correlation,but his concluding sentences show just how far he
did go:

Lety = the deviationof the subject, whichever of the two variablesmay betaken in
that capacity;and let;c,, ;c2, x3, &c., be thecorresponding deviations of the relative,
and let the meanof thesebe X. Then we find: (1) that y = rX for all valuesof y; (2)
that r is the samewhicheverof the two variablesis taken for the subject;(3) that r is
always less than1; (4) that r measuresthe closenessof co-relation. (Gallon,1888a,
p. 145)

Chronologically, Gallon'sfinal contributionto regressionandheredity is his


book Natural Inheritance, published in 1889. This bookwascompleted several
months beforethe 1888 paperon correlationandcontains noneof its important
findings. It was, however,an influential book that repeatsa great dealof
Galton's earlier workon the statisticsof heredity,the sweetpea experiments,
the data on stature,the recordsof family faculties, and so on. It wasenthusias-
tically received by Walter Weldon(1860-1906),who wasthen a Fellow of St
John's College, Cambridge, and University Lecturer in Invertebrate Morphol-
ogy, and it pointed him toward quantitative solutions to problems in species
variation thathad been occupyinghis attention. This booklinked Galton with
Weldon, and Weldon with Pearson,and then Pearsonwith Galton, a concate-
nation that beganthe biometric movement.

THE COEFFICIENT OF CORRELATION

In June 1890, Weldon was electeda Fellow of the Royal Society,and later that
year became Jodrell Professor of Zoology at University College, London.In
March of the same yearthe Royal Societyhad receivedthe first of his biometric
papers that describes
the distribution of variationsin a numberof measurements
made on shrimps. A Marine Biological Laboratoryhad been constructedat
Plymouth2 yearsearlier, and since that time Weldon had spent partof the year
there collecting-measurements of thephysical dimensions of these creaturesand
their organs.The Royal Society paperhad been sentto Galton for review, and
with his help the statistical analyseshad been reworked. This marked the
142 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

beginningof Weldon'sfriendshipwith Gallon and aconcentrationon biometric


work that lastedfor the rest of his life. In fact, the paper doesnot present the
resultsof a correlational analysis, although apparently one hadbeen carried out.

I haveattemptedto apply to theorgansmeasuredthe testof correlation givenby Mr.


Galton ... and theresult seemsto show thatthe degreeof correlation betweentwo
organsis constantin all the racesexamined;Mr. Galton has,in a letter to myself,
predicted thisresult A result of this kind is, however,so important to the general
theory of heredity, thatI prefer to postponea discussionof it until a larger body of
evidencehas been collected. (Weldon, 1890, p. 453)

The year 1892saw these analyses published.Weldon beginshis paperby


summarizingGalton's methods.He then describesthe measurements that he
made, presents extensive tablesof his calculations,the degreeof correlalion
between the pairs of organs, and the probable errorof the distributions
(Qvl -r 2). The actualcalculation of the degreeof correlation departs some-
whal from Gallon's method,although,because Gallon was inclose touch wilt
Weldon, we canassume thatit had hisblessing.It is quite straightforwardand
is here quotedin full:

(1.). . . let all thoseindividuals be chosenin which a certain organ,A, differs from
its averagesizeby a fixed amount,Y; then, in these individuals,let thedeviationsof
a secondorgan,B, from its averagebe measured.The variousindividualswill exhibit
deviationsof B equal tox} ,x2,x3,..., whose meanmay becalled xm. Theratio jcm/Y
will be constantfor all valuesof Y.
In the sameway, supposethoseindividuals are chosenin which the organB has
a constant deviation, X; then,in theseindividuals,ym. themean deviation
of the organ
A, will have the sameratio to X, whatevermay be thevalueof X.
(2.) The ratios * m/Y and ym/X are connectedby an interesting relation. Let Qa
representthe probable errorof distribution of the organA about its average,and Qb
that of theorgan B; then -

a constant.
So that by taking a fixed deviationof either organ, expressed in terms of its
probable error,and byexpressingthe mean associated deviation of the second organ
in terms of its probable error,a ratio may bedetermined,whosevalue becomes± 1
when a changein either organ involves an equal changein the other, and 0when the
two organsarequite independent. This constant, therefore, measures the "degreeof
correlation"betweenthe twoorgans.(Weldon, 1892,p. 3)

In 1893, more extensive calculations were reported


on data collectedfrom
two large (eachof 1,000adult females)samplesof crabs,one from the Bay of
THE COEFFICIENT OF CORRELATION 143

Naples and the other from Plymouth Sound.In this work, Weldon (1893)
computesthe mean,the meanerror, and themodulus. His greater mathematical
sophistication in this work is evident. He states:

The probableerror is given below, instead


of the'meanerror, becauseit is the constant
which has thesmallest numerical value of any ingeneral use. This property renders
the probable error more convenient than either the meanerror, the modulus,or the
error of mean squares,in the determinationof the degreeof correlation whichwill
be described below. (Weldon, 1893, pp. 322-323)

Weldon found thatin the Naples specimens the distributionof the "frontal
breadth"producedwhat he terms an "asymmetricalresult." This finding he
hoped might arise from the presencein the sampleof two racesof individuals.
He notesthat Karl Pearsonhad tested this supposition and found that it was
likely. Pearson (1906)statesin his obituary of Weldon thatit was this problem
that led to his(Pearson's)first paper in the Mathematical Contributionsto the
Theory of Evolution series, receivedby the Royal Societyin 1893.
Weldon definedr and attemptedto nameit for Galton:

a measureof the degree to which abnormalityin one organ is accompaniedby


abnormality in a second. It becomes±1 when a changein one organ involvesan
equal changein the other, and 0when the two organsare quite independent.The
importance of this constantin all attemptsto deal with the problems of animal
variation was first pointedout by Mr. Galton ... theconstant...may fitly be known
as "Galton'sfunction." (Weldon, 1893,p. 325)

The statisticsof heredityand amutual interestin the plansfor the reform of


the Universityof London (whichare describedby Pearsonin Weldon's obitu-
ary) drew Pearson and Weldon together,and they were close friendsand
colleagues until Weldon's untimely death. Weldon's primary concern was to
make his discipline, particularlyas it related to evolution, a more rigorous
scienceby introducing statistical methods.He realized thathis own mathemati-
cal abilities were limited,and hetried, unsuccessfully,to interest Cambridge
mathematical colleagues in his endeavor.His appointmentto the University
College Chair broughthim into contactwith Pearson, but,in the meantime,he
attempted to remedy his deficienciesby an extensive studyof mathematical
probability. Pearson (1906) writes:

Of this the writer feels sure, that


his earliest contributionsto biometry werethe direct
resultsof Weldon's suggestions and would never have been carried out without his
inspiration and enthusiasm. Both were drawn independentlyby Galton's Natural
Inheritance to these problems, (p. 20)

Pearson'smotivationwas quitedifferent from that of Weldon. MacKenzie


144 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

(1981) hasprovided us with a fascinating accountof Pearson's background,


philosophy, and political outlook.. He wasallied with the Fabian socialists 6;
he held strong viewson women's rights;he wasconvincedof the necessityof
adopting rational scientific approachesto a range of social issues;and his
advocacy of the interestsof the professional middle-class sustained
his promo-
tion of eugenics.
His originality, his real transformation rather than re-ordering
of knowledge, is to be
found in his work in statisticalbiology, wherehe took Galton'sinsights and made
out of them a newscience. It was thework of his maturity - he startedit only in his
mid-thirties- and in it can befound theflowering of mostof themajor concernsof
his youth. (MacKenzie, 1981,pp. 87-88)

It can beclearly seen that Pearson


was notmerely providinga mathematicalapparatus
for othersto use ... Pearson'spoint wasessentiallya political one:the viability, and
indeedsuperiorityto capitalism,of a socialiststatewith eugenically-plannedrepro-
duction. The quantitative statistical formof his argument providedhim with con-
vincing rhetorical resources.(MacKenzie, 1981,p. 91)

MacKenzie is at painsto point out that Pearsondid not consciouslyset out


to found a professional middle-class ideology. His analysis confines itselfto
the view that herewas a match of beliefs and social interests that fostered
Pearson'sunique contribution,and that this sortof sociological approachmay
be usedto assessthe work of such exceptional individuals. Pearsondid not seek
to becomethe leader of a movement; indeed,the compromises necessary for
such an aspiration would have been anathema to what he saw as the role of the
scientist and theintellectual.
However one assesses the operationof the forces that molded Pearson's
work, it is clear that,from a purely technical standpoint,the introduction,at this
juncture, of an able, professional mathematician to the field of statistical
7
methodsbroughtabouta rapid advanceand agreatly elevated sophistication.
The third paper in the Mathematical Contributions series was read beforethe
Royal Society in November 1895.It is an extensive paper which dealswith,
among other things, the general theoryof correlation. It containsa numberof
historical misinterpretations and inaccuracies, that Pearson (1920) later
6
The Fabian Society (the WebbsandGeorge Bernard Shaw were leading members) itstook
name
from Fabius,the Roman Emperorwho adopteda strategyof defenseand harrassmentin Rome's war
with Hannibaland avoided directconfrontations. The Fabiansadvocated gradual advance
and reform
of society, rather than revolution.

7
Pearsonplaced Third Wranglerin the Mathematical Triposat Cambridgein 1879. The "Wran-
glers" were the mathematics students
at Cambridgewho obtainedFirst Class Honors- the oneswho
most successfully"wrangle" with math problems. This method of classification was abandonedin the
early yearsof this century.
THE COEFFICIENT OF CORRELATION 145

attemptedto rectify, but for today's usersof statistical methods it is of crucial


importance,for it presentsthe familiar deviation score formula for the coeffi-
cient of correlation.
It might be mentioned here thatthe latter termhad been introducedfor r by
F. Y. Edgeworth (1845-1926)in an impossibly difficult-to-follow paper pub-
lished in 1892. Edgeworthwas Drummond Professor of Political Economyat
Oxford from 1891 untilhis retirementin 1922,and it is ofsomeinterestto note
that he hadtried to attract Pearsonto mathematical economics, but without
success.Pearson (1920)saysthat Edgeworthwas also recruitedto correlation
by Galton's Natural Inheritance,and he remainedin close touch withthe
biometricians over many years.
Pearson's important paper (published in the Philosophical Transactionsof
the Royal Societyin 1896) warrants close examination. He begins withan
introductionthat statesthe advantagesandlimitations of thestatistical approach,
pointing out that it cannot give us precise information about relationships
between individualsand that nothingbut meansand averagesand probabilities
with regard to large classescan bedealt with.

On theotherhand, the mathematical theorywill be of assistanceto the medicalman


by answering,inter alia, in its discussionof regressionthe problemas to theaverage
effect upon the offspring of given degreesof morbid variationin the parents.It may
enable the physician, in many cases,to state a belief basedon a high degreeof
probability, if it offers no ground for dogma in individual cases.(Pearson,1896, p.
255)

Pearsongoeson to define the mean, median, and mode, the normal prob-
ability distribution, correlation,and regression,as well as various termsem-
ployed in selection and heredity. Next comesa historical section,whichis
examined laterin this chapter. Section4 of Pearson'spaper examinesthe
"special caseof two correlatedorgans." He derives whathe terms the "well-
known Galtonian formof the frequency for two correlated variables,"and says
that r is the "GALTON function or coefficient of correlation" (p. 264).
However, he is notsatisfied thatthe methods usedby Galton and Weldon give
practically the best methodof determiningr, and hegoes on to show by what
we would now call the maximum likelihood method that S(xy)/(nG\ a2) is the
bestvalue (todaywe replaceS by I forsummation).This expressionis familiar
to every beginning student in statistics. As Pearson (1896) puts
it, "This value
presentsno practical difficulty in calculation,and thereforewe shall adoptit"
8
(p. 265). It is now well-known thatwe have done precisely that.

Pearson (1896,p. 265) notes "thatS(xy) correspondsto theproduct-moment of dynamics,as S ( x )


to the momentof inertia." This is why r is often referred to as the"product-momentcoefficient of
correlation."
146 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

There follows the derivation of the standard deviation of the coefficient of


2 2
correlation,(1 - r )/Vn( 1 - r ), which Pearson translates to theprobable error,
0.674506(1- r 2)/\n( 1- r 2). These statistics arethen usedtorework Weldon's
shrimp and crab data, and the results show that Weldon was mistaken in
assuming constancy of correlationin local racesof the same species. Pearson's
exhaustive re-examination of Galton's dataon family stature also shows that
someof the earlier conclusions were in error. The detailsof these analyses are
now of narrow historical interest only, but Pearson's general approach is amodel
for all experimentersand usersof statistics. He emphasizesthe importanceof
samplesizein reducingthe probable error,mentionsthe importanceof precision
in measurement, cautions against general conclusions from biased samples, and
treats his findings with admirablescientific caution.
Pearson introduces V= (o/w) 100 , thecoefficient of variation, as a way of
comparing variation,andshows that "the significance of themutual regressions
o f . . . two organsare as thesquaresof their coefficients of variation" (p. 277).
It should alsobe noted that,in this paper, Pearson pushes much closer to the
solution of problems associated with multiple correlation.
This almost completes the accountof a historical perspective of the devel-
opmentof the Pearsoncoefficient of correlation as it iswidely knownand used.
However,the record demands that Pearson's account be lookedat inmore detail
and the interpretation of measuresof associationasthey were seenby others,
most notably GeorgeUdny Yule (1871-1951),who made valuable contribu-
tions to thetopic, beexamined.

CORRELATION
- CONTROVERSIES
ANDCHARACTER

Pearson's (1896) paper includes a sectionon the history of the mathematical


foundationsof correlation. He says thatthe fundamentaltheorems were "ex-
haustivelydiscussed"by Bravaisin 1846. Indeed,he attributesto Bravaisthe
invention of the GALTONfunction while admitting that "a single symbolis not
used for it" (p. 261). He also statesthat S(xy)/(na\ a2) "is the value givenby
Bravais, but he doesnot show thatit is the best" (p. 265). In examiningthe
general theoremof a multiple correlationsurface, Pearson refersto it as
Edgeworth's Theorem. Twenty-five years later he repudiates these statements
and attemptsto set hisrecord straight.He avers that:

They have been accepted by later writers, notablyMr Yule in his manualof statistics,
who writes (p. 188): "Bravais introduced the product-sum,but not asingle symbol
for a coefficient of correlation. Sir Francis Galton developed the practical method,
determining his coefficient (Galton'sfunction as it wastermed at first) graphically.
Edgeworth developed the theoretical sidefurther and Pearson introduced the prod-
uct-sum formula."
CORRELATION - CONTROVERSIES AND CHARACTER 147

Now I regretto saythat nearly the whole of the abovestatementsare hopelessly


incorrect.(Pearson,1920, p. 28)

Now clearly, it is just not thecase "that nearlythe whole of the above
statementsare hopelessly incorrect."In fact, what Pearson is trying to do is to
emphasizethe importanceof his own andGalton's contributionand toshift the
blame for the maintenanceof historical inaccuracies to Yule, with whom he
had come to have some serious disagreement. The natureof this disagreement
is of interest and can beexaminedfrom at least three perspectives. The first is
that of eugenics,the secondis that of the personalitiesof the antagonists,and
the third is that of thefundamental utility of, and theassumptions underlying,
measuresof association. However, before these matters are examined, some
brief commenton the contributionsof earlier scholarsto the mathematicsof
correlation is in order.
The final yearsof the 18th centuryand the first 20yearsof the 19th was the
period in which the theoreticalfoundations of the mathematicsof errors of
observation were laid. Laplace (1749-1827)and Gauss(1777-1855)are the
best-knownof the mathematicianswho derivedthe law of frequencyof error
anddescribedits application,but anumberof other writers also made significant
contributions (see Walker, 1929, for an accountof these developments). These
scholarsall examinedthe questionof the probabilityof thejoint occurrenceof
two errors,but "None of them conceivedof this as amatter which could have
application outsidethe fields of astronomy, physics, andgeodesyor gambling"
(Walker, 1929,p. 94).
In fact, put simply, these workers were interested solely in the mathematics
associatedwith the probabilityof the simultaneousoccurrenceof two errorsin,
say,the measurement of the positionof a pointin a planeor in three dimensions.
They were clearlynot looking for a measureof a possible relationship between
the errors and certainly not consideringthe notion of organic relationships
among directly measured variables. Indeed, astronomers andsurveyors sought
to make their basic measurements independent. Having said this,it is apparent
that the mathematicalformulationsthat were produced by these earlier scientists
are strikingly similarto those deduced by Galtonand Hamilton Dickson.
Auguste Bravais(1811-1863),who had careers as a naval officer, an
astronomerand physicist, perhaps came closest to anticipatingthe correlation
coefficient; indeed, he even usesthe term correlation in his paper of 1846.
Bravais derivedthe formula for the frequency surfaceof the bivariate normal
distribution and showed thatit was aseriesof concentricellipses,as didGalton
and Hamilton Dickson40 years later. Pearson's acknowledgment of the work
of Bravaisled to thecorrelationcoefficient sometimes being called the Bravais-
Pearson coefficient.
When Pearson came, in 1920, to revisehis estimationof Bravais' role,he
148 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

describeshis ownearly investigationsof correlationandmentionshis lectures


on the topic to research students
at University College. He says:

I was far tooexcited to stop to investigate properly whatother peoplehad done. I


wanted to reachnew resultsandapply them. AccordinglyI did not examine carefully
either Bravais or Edgeworth,andwhen I cameto put mylecturenoteson correlation
into written form, probablyaskedsomeonewho attendedthe lecturesto examinethe
papersand saywhat was in them. Only when I now comeback to the papersof
Bravaisand Edgeworthdo I realisenot only that I did grave injusticeto others,but
mademostmisleadingstatements which havebeenspreadbroadcastby thetext-book
writers. (Pearson,1920, p. 29)

The "Theory of Normal Correlation"was one of thetopics dealt withby


Pearson whenhe startedhis lectureson the'Theoryof Statistics"at University
College in the 1894-1895session, "givingtwo hoursa week to a small but
enthusiastic class of two students- Miss Alice Lee, Demonstrator in Physicsat
the Bedford College,andmyself (Yule, 1897a,p. 457).
One ofthese enthusiastic students, Yule, published his famous textbook,An
Introduction to the Theoryof Statistics,in 1911,a book thatby 1920 was in its
fifth edition. Laterin the 1920 paper, Pearson again deals curtly with Yule. He
is discussingthe utility of a theory of correlation thatis not dependenton the
assumptionsof the bivariate normaldistribution andsays,"As early as 1897 Mr
G. U. Yule, thenmy assistant, made an attemptin this direction" (Pearson, 1920,
P- 45).
In fact, Yule (1897b)in a paperin theJournal of the Royal Statistical Society,
had derived least squares solutions to the correlation of two, three, and four
variables. This methodis a compelling demonstration of appropriatenessof
Pearson'sformula for r under the least squares criterion. But Pearsonis not
impressed,or at leastin 1920 he is notimpressed:

Are we notmakinga fetish of themethodof leastsquaresasothersmadea fetish of


the normal distribution? ... It is by no means clear therefore that Mr Yule's
generalisationindicatesthe real line of future advance.(Pearson,1920, p. 45)

This, to say theleast,cold approachto Yule's work (and thereare many


other examplesof Pearson's overt invective) wascertainlynot evident overthe
10 years that spanthe turn of the 19thto the 20th centuries. During what Yule
describedas "the old days" he spent several holidays with Pearson, and even
when their personal relationship had soured,Yule states thatin nonintellectual
matters Pearsonremained courteousand friendly (Yule, 1936), althoughone
cannotbut help feel that herewe havethe wordsof anessentially kindand gentle
man writing an obituary noticeof one of thefathersof his chosen discipline.
What was thenatureof this disagreement that so aroused Pearson's wrath?
In 1900, Yule developed a measureof associationfor nominal variables,the
CORRELATION - CONTROVERSIES AND CHARACTER 149

frequenciesof which are entered intothe cells of a contingencytable. Yule


presentsvery simple criteriafor such measures of association, namely, that they
shouldbe zero when thereis norelationship (i.e.,the variablesareindependent),
+1 when there is complete dependence or association,and -1 when thereis a
complete negative relationship. The illustrative example chosenis that of a
matrix formed from cells labeledas follows:

VaccinatedB Unvaccinated
Survived A AB A0
Died aB ap

and Yule deviseda measure,Q (named,it appears,for Quetelet), that satisfies


the statedcriteria, and isgiven by,

Yule's paper had been "received"by the Royal Society in October 1899,
and "read"in Decemberof the same year.It was describedby Pearson (1900b),
in a paper that examines the same problem,as "Mr Yule's valuable memoir"
(p. 1). In his paper, Pearson undertakes to investigate "the theoryof the whole
subject" (p. 1) andarrives at his measureof associationfor two-by-two contin-
gency table frequencies, an index that he called the tetrachoric coefficient of
correlation. He examines other possible measure's of association, including
Yule's Q, but heconsiders themto bemerely approximations to the tetrachoric
coefficient. The crucial differencebetween Yule's approach andthat of Pearson
is that Yule's criteriafor ameasureof associationareempiricalandarithmetical,
whereasthe fundamental assumption for Pearsonwasthat the attributes whose
frequencieswere countedin fact arosefrom an underlying continuous bivariate
normal distribution. The details of Pearson's method are somewhat complex
and are notexamined here. There were a numberof developmentsfrom this
work, including Pearson's derivation, in 1904,of the mean square contingency
and thecontingencycoefficient. All these measures demanded the assumption
of anunderlying continuous distribution, even thoughthe variables,asthey were
considered, were categorical. Battle commenced in late1905 when Yule
criticized Pearson'sassumptionsin a paper readto the Royal Society (Yule,
1906). Biometrika'sreaders were soon to see, "Replyto Certain Criticismsof
Mr G. U. Yule" (Pearson, 1907) and, after Yule's discussionof his indices
appearedin the first edition of his textbook, were treatedto David Heron's
exhortation, "The Danger of Certain Formulae Suggested asSubstitutesfor the
Correlation Coefficient" (Heron,1911). These were stirring statistical times,
marked by swingeing attacksas thebiometricians defended their position:
150 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

If Mr Yule's viewsare accepted,irreparable damage will be done to the growth of


modern statistical theory... we shall termMr Yule's latest methodof approaching
the problemof relationshipof attributesthe methodof pseudo-ranks...w e . .. reply
to certain criticisms, not to saycharges,Mr Yule hasmade againstthe work of one
or both of us. (Pearson & Heron, 1913,pp. 159-160)

Articulate sniping and mathematicalbombardment werethe methodsof


attack usedby the biometricians. Yulewas more temperatebut nevertheless
quite firm in his views:

All those who have diedof small-poxare all equally dead:no one ofthem is more
deador lessdeadthan another,and thedeadare quite distinct fromthe survivors.
The introductionof needlessand unverifiable hypotheses does not appearto me
to be adesirableproceedingin scientific work. (Yule, 1912,pp. 611-612)

Yule, and his great friend Major Greenwood, poked privatefun at the
opposition.Partsof a fantasy sentto Yule by Greenwoodin November 1913
arereproduced here:

Extractsfrom The Times, April 1925


G. Udny Yule, who hadbeen convictedof high treasonon the 7thult, was executed
this morning on a scaffold outside GowerSt. Station. A short but painful scene
occurred on thescaffold. As the rope was being adjusted, the criminal made some
observation, imperfectly heard in the press enclosure,
the only audible words being
"the normal coefficientis —." Yule was immediately seizedby theImperial guard
and gagged.
Up to thetime of going to pressthe warrantfor the apprehensionof Greenwoodhad
not been executed,but the police have what they regard to be animportant clue.
During the usual morning service at St. Paul's Cathedral,
which was well attended,
the carlovingian creed was, in accordancewith an imperial rescript, chantedby the
choir. Whenthe solemn words,"I believein one holy and absolutecoefficient of
four-fold correlation" were uttereda shabbily dressedman near the North door
shouted"balls." Amid a sceneof indescribable excitement, the vergers armed with
severalvolumes of Biometrika made their way to thespot.(Greenwood,quotedby
MacKenzie, 1981,pp. 176-177)

The logical positionsof the two sides werequite different. For Pearsonit
was absolutely necessary
to preservethe link with interval-level measurement
wherethe mathematicsof correlationhad beenfully specified.

Mr Yule ... doesnot stop to discuss whetherhis attributesarereally continuousor


discrete,or hide under discrete terminology true continuous variates.
We seeunder
such class indicesas"death"or "recovery","employment"or "non-employment"of
mother, only measures of continuousvariates... (p. 162)

The fog in Mr Yule's mind is well illustrated by his table...(p. 226)


CORRELATION - CONTROVERSIES AND CHARACTER 151

Mr Yule is juggling with class-namesas if they representedreal entities,and his


statisticsonly a form of symbolic logic. No knowledgeof a practical kindevercame
out of theselogical theories,(p. 301) (Pearsonand Heron, 1913)

Yule is attackedon almost everyone ofthis paper's157 pages,but for him


the issue was quite straightforward. Techniques of correlation were nothing
more and nothing less than descriptions of dependencein nominal data. If
different techniques gavedifferent answers,a point that Pearsonand his
followers frequently raised, thenso be it. Themean, median,and mode give
different answersto questions about the central tendencyof a distribution,but
each has itsutility.
Of coursethe controversywas never resolved.The controversycan never
be resolved,for there is no absolute,"right" answer. Each camp started with
certain basic assumptions, and areconciliationof their viewswas notpossible
unlessone orboth sides wereto have abrogated those assumptions, or unlessit
could have been shown with scientific certainty thatonly oneside's assumptions
were viable.For Yule it was asituation thathe accepted.In his obituary notice
of Pearsonhe says, "Timewill settlethe questionin duecourse"(p. 84). In this
he waswrong, because there is nolongera question.The pragmatic practitioners
of statistics in the presentday arelargely unaware that there ever even was a
question.
The disagreementhighlightsthe personalitydifferencesof the protagonists.
Pearson couldbe describedas adifficult man. He held very strong viewson a
variety of subjects,and he wasalways readyto takeup his pen and write scathing
attackson those whomhe perceivedto be misguidedor misinformed. He was
not ungenerous,and hedevotedimmenseamountsof time and energy to the
work of his studentsand fellow researchers, but it is likely that tactwas not one
of his strong pointsand anysort of compromisewould be seenas defeat.
In 1939, Yule commentedon Pearson's polemics, noting that, in 1914,
Pearson said that "Writers rarely . . . understandthe almost religious hatred
which arisesin the true man of science whenhe sees error propagated in high
places"(p. 221).

Surely neitherin the best typeof religion nor in thebest typeof science should hatred
enter in at all. ... In onerespectonly has scientific controversy perforce improved
since the seventeenth century.If A disagrees with B's arguments, dislikeshis
personality and isannoyedby the cock of his hat, he can nolonger, failingall else,
resort to abuseof £'s latinity. (Yule, 1939, p. 221)

Yule had respect and affection for "K.P." even though he haddistanced
himself from the biometriciansof Gower Street (where the Biometric Labora-
tory was located). He clearly disliked the way in which Pearsonand his
followers closed ranksandpreparedfor combatat thesniff of criticism. In some
152 10. COMPARISONS, CORRELATIONS AND PREDICTIONS

respectswe canunderstand Pearson's attitudes. Perhaps he didfeel isolatedand


beset. Weldon's death in 1906 and Gallon's in 1911 affectedhim greatly He
was thecolossusof his field, and yetOxfordhadtwice(in 1897and 1899) turned
down his applicationsfor chairs,and in 1901he applied, again unsuccessfully,
for the chair of Natural Philosophyat Edinburgh. He felt the pressureof
monumentalamountsof work and longed for greater scopeto carry out his
research.This was indeed to come with grantsfrom the Drapers' Company,
which supportedthe Biometric Laboratory,and Galton's bequest, which made
him Professorof Eugenicsat University Collegein 1911.But more controversy,
and more bitter battles, this timewith Fisher, were just overthe horizon.
Once more,we areindebtedto MacKenzie,who soably puts togetherthe
eugenicaspectsof the controversy.It is his view that this providesa much more
adequate explanation than one derivedfrom the examinationof a personality
clash. It is certainly unreasonableto deny thatthis was animportant factor.
Yule appearsto have been largely apolitical. He came from a family of
professionaladministratorsand civil servants,and hehimself worked for the
War Office and for theMinistry of Food during World War I, work for which
he receivedthe C.B.E. (Commanderof the British Empire) in 1918.He was not
a eugenist,and his correspondencewith Greenwood shows that his attitude
toward the eugenics movement was far from favorable.He was anactive
memberof the Royal Statistical Society, a body that awardedhim its highest
honor, the Guy Medal in gold, in 1911. The Society attracted members who
were interestedin an ameliorativeand environmentalapproachto social issues
- debatesonvaccination werea continuingfascination.Although Major Green-
wood, Yule's closefriend, was at first anenthusiasticmemberof thebiometric
school, his career,as astatisticianin the field of public healthand preventive
medicine, drewhim toward the realization that povertyand squalor were
powerful factors in the status and condition of the lower classes,a view that
hardly reflected eugenic philosophy.
For Pearson, eugenics and heredity shapedhis approachto the questionof
correlation,and thenotion of continuousvariation was of critical importance.

His notion of correlation,as afunction allowing direct prediction fromone variable


to another,is shown to have its roots in the task that correlationwas supposedto
perform in evolutionary and eugenic prediction.It was notadequate simplyto know
that offspring characteristicswere dependenton ancestral characteristics: this de-
pendencehad to bemeasuredin sucha way as toallow the prediction of the effects
of natural selection,or of conscious intervention in reproduction. To move in the
direction indicated here, from prediction to potential control over evolutionary
processes,required powerful and accurate predictive tools: mere statements of
dependencewould be inadequate. (MacKenzie, 1981, p. 169)

MacKenzie's sociohistorical
analysis is both compellingand provocative,
CORRELATION - CONTROVERSIES AND CHARACTER 1 53

but at leasttwo points, bothof which are recognizedby him, needto bemade.
The first is that this viewof Pearson's motivations is contrary to his earlier
expressedviews of the positivistic natureof science (Pearson,1892), and
second, the controversy mightbeplacedin thecontextof atightly knit academic
group defendingits position. To the first MacKenzie says that practical consid-
erations outweighed philosophical ones, and yet it isclear that practical demands
did not lead the biometriciansto form or tojoin political groups that might have
made their aspirations reality. The second viewis mundaneand realistic. The
discipline of psychologyhasseena numberof controversiesin its short history.
Notable among themis the connectionist (stimulus-response) versus cognitive
argument in the field of learning theory.The phenomenological,the psychody-
namic, the social-learning, and thetrait theorists have arraigned themselves
against each otherin a variety of combinationsin personalityresearch.Argu-
ments aboutthe continuity-discontinuityof animalsand humankindare exem-
plified perhaps by the controversy over language acquisition in the higher
primates. All thesedebatesare familiar enoughto today'sstudentsof psychol-
ogy, and it is notinconceivableto view the correlation debateas part of the
system of academic "rows" that will remain as long as there is freedom for
intellectual controversyand scientific discourse.
But the Yule-Pearson debateand its implications are not among those
discussedin university classrooms across the globe. For amultitudeof reasons,
not the least of which was thehorror of the negative eugenics espoused by the
German Nazis,and thedemandsof the growing discipline of psychology for
quantitative techniques that would help it deal with its subject matter, veryfew
if any of today'spractitionersand researchersin the social sciences think of
biometrics when theythink of statistics. Theybusily get onwith their analyses
and, if they give thanksto Pearsonand Gallon at all, they remember them for
their statistical insights rather than
for their eugenic philosophy.
11
Factor Analysis

FACTORS

A featureof the idealscientific method thatis always hailedasmost admirable


is parsimony.To reducea massof factsto a single underlying factoror evento
just a few conceptsand constructsis anongoing scientific concern. In psychol-
ogy, the statisticaltechnique knownas,factor analysisis most oftenassociated
with the attemptto comprehendthe structureof intelligenceand thedimensions
of personality. The intellectual struggles that accompany discussion of these
mattersare avery longway from being over.
Of all the statistical techniques discussed in this book, factor analysismay
be justly claimedas psychology's own.As Lawley and Maxwell (1971)and
others have noted, some of the early controversies that swirled around the
methods employed arose from arguments about psychological, rather than
mathematical, matters. Indeed, Lawley andMaxwell suggest that the controver-
sies "discouragedthe interest shownby mathematiciansin the theoretical
problems involved."(p. 1). Infact, the psychological quarrels stem from amuch
older philosophical debate that is most often tracedto Francis Bacon,and his
assertion that"the mere orderly arrangement of data would makethe right
hypothesisobvious"(Russell, 1961,p. 529). Russellgoeson:

The thing thatis achievedby thetheoretical organization of scienceis thecollection


of all subordinateinductions into a few that arevery comprehensive- perhapsonly
one. Such comprehensive inductions are confirmedby so many instances that it is
thought legitimateto accept,as regards them,an inductionby simple enumeration.
This situation is profoundly unsatisfactory,(p. 530)

To bejust a little more concrete,in psychology:

We recognizethe importanceof mentality in our lives and wish to characterizeit, in

154
FACTORS 155

part sothat we canmakethe divisionsanddistinctionsamongpeoplethatour cultural


and political systemsdictate. We therefore givethe word "intelligence" to this
wondrously complex and multifaceted set of human capabilities.This shorthand
symbol is then reified and intelligence achievesits dubiousstatusas aunitary thing.
(Gould, 1981, p. 24)

To besomewhat trite,it is clear that among employed persons there is ahigh


correlationbetweenthe sizeof a monthly paycheckandannualsalary,and it is
easy to seethat both variablesare ameasureof income - the "cause"of the
correlation is clear andobvious. It is therefore temptingnot only to look for the
underlyingbasesof observed interrelationships among variables but to endow
suchfindings with a substance that implies a causal entity, despitethe protesta-
tions of thosewho assertthat the statistical associations do not, by themselves,
provide evidencefor its existence. The refined searchfor these statistical
distillations is factor analysis,and it is important to distinguish betweenthe
mathematicalbasesof the methodsand theinterpretationof the result of their
application.
Now the correlation between monthly paycheck and annual salarywill not
be perfect. Interest payments, profits, gifts, even author's royalties, will make
r less than+1, but presumablyno onewould arguewith the proposition thatif
one requires a reasonable measure of income and material well-being, either
variable is adequateand the other thereby redundant. Galton (1888a), in
commentingon Alphonse Bertillon's work thatwas designedto provide an
anthropometric indexof identification, in particularthe identification of crimi-
nals, pointsto thenecessityof estimatingthe degreeof interdependence among
the variables employed,as well as theimportanceof precisionin measurement
and of not making measurement classifications too wide. There is little or
nothing to begainedfrom including in the measurements a numberof variables
that are highly correlated, when one would suffice.
A somewhatdifferent line of reasoningis to derivefrom the intercorrelations
of themeasured variables a set ofcomponents that areuncorrelated.Thenumber
of components thus derived would be thesameas thenumberof variables.The
componentsareorderedsothat the initial ones accountfor moreof the variance
thanthe later ones thatare listed in the outcome. Thisis themethodof principal
component analysis,for which Pearson (1901) is usually cited as theoriginal
inspiration and Hotelling (1933, 1935)as thedeveloperand refiner, although
there is no indication thatHotelling was influenced by Pearson's work.In fact,
Edgeworth (1892, 1893)had suggesteda schemefor generatinga function
containing uncorrelated terms that was derivedfrom correlated measurements.
Macdonell (1901)was also an early workerin the field of "criminal anthro-
pometry." He acknowledgesin his paper that Pearson haspointedout to him a
methodof arriving at ideal characteristics that "would be given if we calculated
156 11. FACTOR ANALYSIS

the seven [there were seven measures] directions of uncorrelated variables, that
is, theprincipal axesof the correlation 'ellipsoid'"(p. 209).
The method,as Lawley and Maxwell (1971) have emphasized, although it
has links with factor analysis,is not to betaken as avariety of factor analysis.
The latter is amethod that attemptsto distill downor concentratethe covariances
in a set ofvariablesto a much smaller number of factors. Indeed, none other
than Godfrey Thomson(1881-1955),a leading British factor theoristfrom the
1920sto the 1940s, maintained in a letterto D. F.Vincentof Britain's National
Institute of Industrial Psychology that"he did not regard Pearson's 'lines of
closestfit as anythingto do with factor analysis.'" (noted by Hearnshaw, 1979,
p. 176). The reasonfor the ongoing association of Pearson's name with the birth
of factor analysisis to be found in the role playedby another leading British
psychologist,Sir Cyril Burt (1883-1971)in the writing and therewriting of the
history of factor analysis,an episode examined later in this chapter.
In essence,the method of principal componentsmay be visualizedby
imagining thevariables measured - thetests- aspointsin space.Teststhatare
correlatedwill be close togetherin clusters,andtests thatare notrelatedwill be
further away fromthe cluster. Axesor vectorsmay beprojected into thisspace
through the clusters in a fashion that allowsfor as much of the variance as
possibleto be accounted for. Geometrically this projection can take placein
only three dimensions, but, algebraically an w-dimensionalspacecan becon-
ceptualized.

THE BEGINNINGS
In 1904 Charles Spearman (1863-1945)publishedtwo papers (1904a, 1904b)
in the same volumeof the American Journalof Psychology, thenthe leading
English language journal in psychology.The first ofthese,a general accountof
correlation methods, their strengths and weaknesses,and thesourcesof error
that may beintroduced that might "dilate"or "constrict" the results, included
criticism of Pearson'swork.
In his Huxley lecture (sponsored by the Anthropological Instituteof Great
Britain and Ireland) of 1903, Pearson reported on a great amountof data that
had been collectedby teachersin a numberof schools.The physical variables
measured included health, hair and eyecolor, hair curliness, athletic power, head
length, breadth,and height,and a"cephalic index."The psychological charac-
teristics were assertiveness, vivacity, popularity, introspection, conscientious-
ness, temper, ability,and handwriting.The average correlations (omitting
athletic power) reported between the measuresof brother, sister,and brother-
sister pairsare extraordinarily similar, ranging
from .51 to .54.
We areforced, 1 think literally forced,to thegeneral conclusion that
the physicaland
THE BEGINNINGS 157

psychical charactersin men areinherited withinbroadlines in the samemannerand


with the sameintensity. (Pearson,1904a,p. 156).

Spearmanfelt that the measurements taken could have been affected by


"systematic deviations," "attenuation" produced
by the suggestion thatthe
teacher'sjudgments werenot infallible, and by lack of independencein the
judgements,finally maintaining that:
When we further consider that each of these physicaland mental characteristics will
have quitea different amountof such error(in the former this being probably quite
insignificant) it is difficult to avoid the conclusion thatthe remarkable coincidences
announced between physical and mental hereditycan hardly be more than mere
accidental coincidence. (Spearman, 1904a,p. 98).

But Pearson's statistical procedures were


not underfire, for later he says:
If this work of Pearsonhas thus been singled
out for criticism, it is certainly fromno
desire to undervalueit. ... My present objectis only to guard against premature
conclusionsand topoint out the urgent needof still further improvingthe existing
methodicsof correlational work. (Spearman, 1904a,p.99)

Pearsonwas notpleased. When the addresswas reprintedin Biometrika in


1904he includedan addendumreferringto Spearman's criticism that added
fuel
to the fire:

The formula inventedby Mr Spearmanfor his so-called "dilation" is clearly wrong


... not only are hisformulae, especiallyfor probable errorserroneous,but hequite
misunderstandsand misuses partial correlation coefficients, (p. 160).

It might be noted that neither man assumed that the correlations, whatever
they were, mightbe due toanything other than heredity, and that Spearman
largely objected to the data and possible deficienciesin its collection and
Pearson largely objected to Spearman's statistics. The exchange ensured that
the opponents would never even adequately agree on the location of the
battlefield, let alone resolvetheir differences. Spearmanwas to become, suc-
cessively, Reader andheadof a psychological laboratory at University College,
London,in 1907, Grote Professor of the Philosophyof Mind and Logic in 1911,
and Professorof Psychology in 1928 until his retirement in 1931. He was
therefore a colleagueof Pearson'sin the same collegein the same university
during almost the whole of the latter's tenureof the Chair of Eugenics. Their
interests and their methodshad a great deal in common, but they never
collaborated and they disliked each other. They clashed on the matter of
Spearman's rank-difference correlation technique, and that and Spearman's
new criticism of Pearsonand Pearson's reaction to it beganthe hostilities that
158 11. FACTOR ANALYSIS

continuedfor many years,culminating in Pearson's waspish, unsigned review


(1927) of Spearman's(1927) book TheAbilities of Man. Pearson dismisses
the book as,"distinctly written for the layman"(p. 181)andclaims that, "what
ProfessorSpearman considers proofs are notproofs . . . With the failure of
ChapterX, that is, 'Proof thatG and Sexist,' the very backbone disappears from
the body of Prof. Spearman's work" (p. 183).
Nowhere in this book does Pearson's name appear! In his 1904 paper,
Spearman acknowledges that Pearson had given the name the method of
"product moments" to the calculationof the correlation coefficient, but,in a
footnote in his book, Spearman (1927), commenting again on correlation
measures, remarks that:

easily foremostis theprocedure whichwassubstantially givenin a beautiful memoir


of Bravais and which is now called thatof "product moments." Such procedures
have been further improved by Galton who inventedthe device - since adopted
everywhere- of representingall gradesof interdependence by asingle number,the
"coefficient," which rangesfrom unity for perfect correlation downto zero for entire
absenceof it. (p. 56)

So much for Pearson!


The second paper
of 1904hadproduced Spearman's
initial conclusion that:

The . . . observed facts indicate that all branches of intellectual activity havein
commononefundamentalfunction (or group offunctions) , whereasthe remaining
or specific elementsof the activity seemin every caseto bewholly different from that
in all the others. (Spearman, 1904b, p. 284)

This is Spearman'sfirst general statementof the two-factor theory of


intelligence,a theory thathe was tovigorously expoundanddefend. Centralto
this early workwas thenotionof a hierarchyof intelligences.In examining data
drawnfrom the measurement of childrenfrom a "High Class Preparatory School
for Boys" on a variety of tests,Spearman starts by adjustingthe correlation
coefficientsto eliminateirrelevantinfluencesanderrorsusingpartial correlation
techniquesand then orderingthe coefficients from highest to lowest. He
observeda steady decrease in the valuesfrom left to right andfrom top tobottom
in the resulting table.
The samplewas remarkablysmall (only 33), but Spearman (1904b) confi-
dently and boldly proclaims thathis methodhasdemonstrated the existenceof
"GeneralIntelligence,"which,by 1914,he waslabelingg. Hemadeno attempt
at this timeto analyzethe natureof his factor, but henotes:

An important practical consequence of this universalUnity of the Intellectual


Function, the various actual formsof mental activity constitutea stableinter-
THE BEGINNINGS 159

Classics French English Math Discrim Music

Classics 0.83 0.78 0.70 0.66 0.63

French 0.83 0.67 0.67 0.65 0.57

English 0.78 0.67 0.64 0.54 0.51

Math 0.70 0.67 0.64 0.45 0.51

Discrim 0.66 0.65 0.54 0.45 0.40

Music 0.63 0.57 0.51 0.51 0.40

FIG 11.1 Spearman's hierarchical ordering of correlations

connectedHierarchy accordingto thedifferent degreesof intellective saturation,(p.


284)

This work may bejustly claimed to include the first factor analysis. The
intercorrelationsare shown in Figure 11.1.
The hierarchical orderof the correlationsin the matrix is shown by the
tendencyfor the coefficients in a pair of columnsto bear the same ratioto one
another throughout that column.
Of course, the techniqueof correlation that began with Galton (1888b) is
central to all forms of factor analysis, but, more particularly, Spearman's work
relied on the concept of partial correlation, which allowsus to examinethe
relationships between, for example,two variables whena third is held constant.
The method gave Spearman the meansof making mathematical the notion that
the correlation of a specific test witha general factorwas common to all tests
of intellectual functioning.The remaining portionof the variance, errorex-
cepted, is specificand uniqueto the test that measures the variable. Thisis the
essenceof what cameto be known as thetwo-factor theory.
The partial correlationof variables1 and 2when a third, 3, is held constant
is given bv

So that— = — and therefore,


r aj ri,d
160 11. FACTOR ANALYSIS

If there are twovariablesa and b and g is constant


a and theonly causeof
the correlation (rab) betweena and b is g,thenrah_g would be zero and hence

If a is set toequal b then

This latter representsthe variancein the variablea that is accountedfor by


g andleadsus to thecommunalitiesin a matrix of correlations.
If now we were to take four variablesa, b, c, and d, and to considerthe
correlation of a and bwith c and d,then

The left-hand sideof this latter equationis Spearman's famous tetrad


difference, and hisproof of it is givenin an appendixto his 1 921book. The term
tetrad difference first appearsin the mid 1920s (see e.g., Spearman& Holzinger,
1925).

So, amatrix of correlations such


as

generatesthetetrad differenceracr M - r hcr ail and, without beingtoo mathemati-


cal, this is the value of the minor determinantof order two. Whenall these
minor determinantsare zero,the matrix is of rank one. Moreover, the correla-
tions can beexplainedby onegeneral factor.
Wolfle (1940) has noted thatconfusion about whatthe tetrad difference
meant or implied persisted over a good many years. It was sometimes thought
that whena matrix of correlations satisfied the tetrad equation, every individual
measurementof every variable could onlybe divided into two independent
parts. Spearman insisted that what he maintainedwas that this division could
THE BEGINNINGS 161

be made,not that it was theonly possible division. Nevertheless, pronounce-


ments aboutthe tetrad equation by Spearman werefrequently followed by
assertions such as,"The onepart hasbeen calledthe 'general factor'and denoted
by the letter g . . . Thesecond parthasbeen calledthe 'specific factor'and
denotedby the letter s" (Spearman, 1927, p. 75).
Moreover, Spearman'sfrequent referencesto "mental energy" gavethe
impression thatg was "real," even though, especially when challenged, he
denied that it was anything more thana useful, mathematical, explanatory
construction. Spearman made ambiguous statements g, about
but it seems that
he was convinced thatthe two-factor theorywas paramountand the tetrad
difference formed an important partof his argument.Not only that, Spearman
was satisfied that Garnett's earlier"proof (1920)of the two-factor theoryhad
effectively dealt with any criticism:

There is another importantlimitation to the division of the variables into factors.It


is that the division into generaland specific factorsall mutually independentcan be
effected in one wayonly; in other words, it is unique.(Spearman,1927, p. vii)

In 1933 BrownandStephenson attempted a comprehensive test of the theory


using an initial battery of 22 testson asampleof 300 boys aged between 10 and
10'/2 years. But they "purified"the batteryby dropping tests that for one reason
or another did not fit and found support for the two-factor theory. Wolfle's
(1940) reviewof factor analysis saysof this work: "Whetherone credits the
attempt with success or not, all that it provesis this; if one removesall tetrad
differenceswhich do not satisfy the criterion, the remaining onesdo satisfy it"
(p. 9).
Pearson and Moul (1927), in a lengthy paper, attempta dissection of
Spearman'smathematics,in particular examiningthe sampling distributionof
the tetrads and whether or not it can bejustified as being closeto the normal
distribution.They conclude: "the claim of Professor Spearman to have effected
a Copernican revolution in psychology seems at present premature" (p. 291).
However, theydo state that even though they believe the theory of general
and specific factorsis "too narrow a structureto form a frame for the great
variety of mental abilities,"they believe thatit shouldandcould be testedmore
adequately.
Spearman's annoyance with Pearson's view of his work was not sotroubling
as thepronouncements on it madeby theAmerican mathematician E. B.Wilson
(1879-1964),Professorof Vital Statisticsat Harvard's Schoolof Public Health.
The episodehas been meticulously researched and acomprehensive account
given by Lovie and Lovie (1995). Spearman met Wilson at Harvard duringa
visit to the United States latein 1927. Wilson had read The Abilities of Man
162 11. FACTOR ANALYSIS

and hadformedthe opinion thatits author's mathematics were wanting and had
tried to explainto himthat it waspossibleto "getthe tetraddifferencesto vanish,
one andall, in so many ways thatone might suspect thatthe resolution intoa
general factorwas notunique" (quotedby Lovie & Lovie, 1995,p. 241).
Wilson subsequently reviewed Spearman's book for Science.It is generally
a friendly review. Wilson describesthe book as "animportantwork," written
"clearly, spiritedly, suggestively,in places even provocatively,"and hiswords
concentrateon the mathematical appendix.Wilson admits thathis review is
"lop-sided," but hemaintains that Spearman has missed someof the logical
implicationsof the mathematics,and inparticular thathis solutionsare indeter-
minate. But he leavesa loopholefor Spearman:

Do gx, g ,. .. whether determinedor undeterminable represent the intelligence


of x, y,.. . ? Theauthor advancesa deal of argumentand ofstatisticsto show that
they do. This is for psychologists,not for me toassess
(Wilson, 1928,p. 246)

Wilson did not want to destroy the basic philosophyor to undermine


completely the thrust of Spearman's work. Later in 1929, Wilson publisheda
more detailed mathematical critique (1929b) and acorrespondence between him
and Spearman ensued, lasting at least until 1933.The Lovies have analyzed
these exchangesand show that a solution, or, as they term it, "an uneasy
compromise" was reached. A "socially negotiated" solutionsaw Spearman
accepting Wilson's critique, Garnett to modifying his earlier "proof of the
two-factor theory,and Wilsonoffering ways in which the problems mightbe at
least modified,if not overcome. Morespecifically, the idea of a partial
indeterminacyin g was suggested.

REWRITING THE BEGINNINGS


In 1976, the world of psychology,and British psychologyin particular, was
shaken by allegations publishedin the Sunday Timesby Oliver Gillie, to the
effect that Sir Cyril Burt (1883-1971), Spearman's successor as Professorof
Psychology at University College,had faked a large partof the data for his
researchon twins and, among other labels in a seriesof articles, called Burt"a
plagiarist of long standing." Burtwas aprominentfigure in psychologyin the
United Kingdom and hiswork on the heritability of intelligence bolsteredby
his dataon itsrelationship among twins reared apart, was notonly widely cited,
but influenced educational policy. Leslie Hearnshaw, Professor Emeritus of
Psychology at the University of Liverpool, was,at that time, researching his
biography (publishedin 1979) of Burt, and when it appearedhe had notonly
examinedthe apparentfraudulent natureof Burt's databut healso reviewedthe
contributions of Burt to thetechniqueof factor analysis.
REWRITING THE BEGINNINGS 163

The "Burt Scandal"led to adebateby the British Psychological Society, the


publication of a "balancesheet"(Beloff, 1980), and anumberof attemptsto
rehabilitatehim (Fletcher, 1991; Jensen, 1992; Joynson, 1989). It must be said
that Hearnshaw's view that Burt re-wrote history in an attemptto place himself
rather than Spearman as thefounder of factor analysishas notbeen viewedas
seriously as the allegations of fraud. Nevertheless, insofaras Hearnshaw's
words have been picked over in theattemptsto atleast downplay,if not discredit,
his findings, and insofar as therecord is importantin the history and develop-
ment of factor analysis, theyare worth examininghere. Once more the credit
must go to theLevies (1993),who have provideda "side-by-side"comparison
of Spearman'sprepublication notesand commentson Burt's work and the
relevant sectionsof the paper itself.
The publicationof Burt's (1909) paperon "general intelligence"was pre-
ceded by some important correspondence with Spearman,who saw thepaper
as offering support for his two-factor theory.In fact, Spearman re-worked the
paper, and large sectionsof Spearman's notes on it were used substantially
verbatim by Burt. As the Lovies point out, thiswas notblatant plagiarismon
Burt's part, even though the latter's acknowledgement of Spearman's role was
incomplete:

but was akind of reciprocatedself-interest aboutthe Spearman-Burtaxis at this time


which allowed Burt to copy from Spearman,and Spearman to view this with
comparativeequanimitybecausethe article provided suchstrongsupportfor the two
factor theory, (p. 315)

However,of more importas far as ourhistory is concernedis Hearnshaw's


contention that Burt,in attempting to displace Spearman, used subtle and
sometimes not-so-subtle commentary to placethe originsof factor analysis with
Pearson's(1901) articleon "principle axes"and tosuggest thathe (Burt) knew
of thesemethods beforehis exchangeswith Spearman. Indeed, Hearnshaw
(1979) reportson Burt's correspondence with D. F. Vincent (noted earlier)in
which Burt claimed thathe hadlearnedof Pearson's work when Pearson visited
Oxford and that it was then thathe andMcDougall (Burt's mentor) became
interestedin the techniques. After Spearman died, Burt's papers repeatedly
emphasizethe priority of Pearsonand Burt's rolein the elaborationof methods
that became knownasfactor analysis. Hearnshaw notes that Burt did not
mention Pearson's 1901 work until 1947and implies that it was thepublication
of Thurstone's book (1947) Multiple Factor Analysis and its final chapter on
"The Principal Axes" that alerted Burt.It should alsobe noted that Wolfle's
(1940) review, although citing11 of Burt's papersof the late 1930s, doesnot
mention Pearsonat all.
It will be some years beforethe "Burt scandal" becomes no more thanan
164 11. FACTOR ANALYSIS

historical footnote. Burt'smajor work The Factors of the Mind remainsa


significant and important contribution,not only to the debate aboutthe nature
of intelligence,but alsoto theapplicationof mathematicalmethodsto thetesting
of theory. Eventhe most vehement critics of Burt would likely agreeand, with
Hearnshaw, might say, "It is lamentable thathe should have blottedhis record
by the delinquenciesof his later years." In particular, Burt's writings moved
factor theorizing awayfrom the Spearman contention that g was invariant no
matter what typesof tests were used - in other words, thatdifferent batteriesof
tests should produce the same estimates of g. The idea of a numberof group
factors thatmay beidentified from sets of teststhat had similar, althoughnot
identical, content,- for example, verbal factors, numeracy factors, spatial
factors and so on - was central in thedevelopmentof Burt's writings. Overlap
among the tests within each factordid not suggest thatthe g values were
identical. Again, Spearmandid not dismiss these approaches out of hand, but
it is plain thathe alwaysfavored the two-factor model.

THE PRACTITIONERS

Although the distinction may besomewhatsimplistic, it is possibleto separate


the theoreticians from the applied practitionersin these early yearsof factor
analysis. Lovie (1983) discusses the essentially philosophicalandexperimental
drive underlyingSpearman's work, aswell asthat of Thomsonand Thurstone.
Spearman's early paper (1904a) begins with an introduction that discusses the
"signs of weakness"in experimental psychology;he avers that "Wundt's
disciples havefailed to carry forward the work in all the positive spiritof their
master,"and heannouncesa:

"correlationalpsychology,"for the purposeof positively determiningall psychical


tendencies,and in particular those which connect togetherthe so-called"mental
tests"with psychical activitiesof greatergenerality and interest, (p. 205)

The work culminatedin a book that set out hisfundamental, if not yet
watertight and complete, conceptual framework. Burt's work with its continu-
ing premiseof theheritabilityof intelligencealso showsthe needto seek support
for a particular constructionof mental life.
On the other hand,a numberof contemporary researchers, whose contribu-
tions to factor analysis were also impressive, were fundamentally applied
workers. Two, KelleyandThomson, werein fact Professorsof Education,and
a third, Louis Thurstone,was agiant in the field of scaling techniquesand
testing. Their contributions were marked by the task of developing testsof
abilities and skills, measuresof individual differences. Truman L. Kelley
(1884-1961),Professorof Educationand Psychologyat Stanford University,
THE PRACTITIONERS 165

published his Crossroads in the Mind of Man in 1928. In his preface he


welcomes Spearman'swork, but in his first chapter, "Boundariesof Mental
Life, " he makesit clear that his work indicates that several traits
are included
in Spearman'sg, and heemphasises these group factors throughout his book:

Mental life doesnot operatein a plain but in anetworkof canals. Though each canal
may have indefinite limits in length and depth, it doesnot in width; though each
mentaltrait may grow andbecome moreandmore subtle,it doesnot lose its character
and discretenessfrom other traits,(p. 23)

Lovie and Lovie (1995) reporton correspondence between Wilson and


H. W. Holmes, Deanof the School of Educationat Harvard, after Wilsonhad
read andreviewed (1929a) Kelley' s book "If Spearmanis ondangerous ground,
Kelley is sitting on a volcano."
Wilson's (1929a) reviewof Kelley again concentrates on the mathematics
and again raises,in even more acute fashion, the problem of indeterminacy.It
is not an unkind review,but it turns Kelley's statement (quoted above)on its
head:

Mental life in respect of its resolutioninto specific and general factorsdoes not
operate in a network of canalsbut in acontinuouslydistributed hyperplane... no
trait has anyseparateor discrete existence,for eachmay bereplacedby proper linear
combinationsof the others proceeding by infinitesimal gradationsand it isonly the
complex of all that has reality. Insteadof writing of crossroadsin the mind of man
one shouldspeakof man's trackless mental forest or tundraor jungle - mathemati-
cally mind you, accordingto theanalysisoffered. The canalsor crossroads have been
put in without any indication,so far as I cansee, that they have been
put in anyother
way than we put in roads acrossour midwestern plains, namely to meet the
convenienceor to suit the fancy of the pioneers,(p. 164)

Wilson, a highly competent mathematician,


had obviouslyhad a struggle
with Kelley's mathematics.He concludes:

The fact of the matter is that the authorafter an excellent introductory discussion
of
the elementsof thetheoryof the resolutioninto generaland specific factors... gives
up the general problem entirely, throws up his handsso to speak,and proceedsto
develop special methods of examininghis variablesto seewhere there seem to be
specific bonds between them. (p. 160)

Kelley, perhapsa little bemused, perhaps a little rueful, and almost certainly
grateful that Wilson's reviewhad not been more scathing, responds: "The
mathematical toolis sharpand I may nick, or may already have nicked myself
with it. At any rate I do still enjoy the fun of whittling and of fondling such
clean-cut chipsas Wilson has letfall" (p. 172).
Sir Godfrey Thomsonwas Professorof Education at the University of
166 11. FACTOR ANALYSIS

Edinburgh from 1925 to 1951. He had little or no training in psychology;


indeed,his PhD was inphysics.After graduate studies in Strasbourg,he returned
hometo Newcastle, England, where he wasobligedto takeup apostin education
to fulfill requirements that were attached to the grants he hadreceived as a
student.His work at Edinburghwas almost wholly concerned withthe devel-
opment of mental testsand therefining of his approachto factor analysis. In
1939 his book TheFactorial Analysisof Human Ability was published,a book
that establishedhis reputation and set him firmlyoutsidethe Spearmancamp.
A later edition (1946) expanded his views and introducedto a wider audience
the work of Louis Thurstone(1887-1955).Thomson's nameis not aswell-
known as Spearman's,nor is hiswork much cited. Thisis probably because he
did not develop a psychological theoryof intelligence, althoughhe did offer
explanationsfor his findings. Hecan, however,be regardedasSpearman's main
British rival, and heparts companywith him on three main grounds:first, that
Spearman'sanalysishad not andcould not conclusively demonstrate the "ex-
istence"of g; second, that evenif the existenceof g was admitted, it was
misleading and dangerousto reify it so that it becamea crucial psychological
entity; and third, that Spearman's hierarchical model had been overtakenby
multiple factor models that were at once more sophisticated and had more
explanatory power.

The Spearmanschool of experimenters, however, tend always to explain as much


as possibleby onecentral factor....
Thereareinnumerable other ways of explaining these same correlations ...
And the final decision between them has to bemade on some othergrounds.The
decision may bepsychological.... Or thedecisionmay bemadeon theground that
we should be parsimoniousin our inventionof "factors," andthat whereonegeneral
and onegroupfactor will servewe shouldnot inventfive group factors..(Thomson,
1946, p. 14-15)

Thomsonwas notsaying that Spearman's view doesnot make sense but that
it is not inevitably the only view. Spearmanhad agreed thatg was presentin
all the factors of the mind, but in different amountsor "weights," and that in
some factorsit was small enoughto beneglected. Thomson welcomed these
views, "provided thatg is interpreted as a mathematical entity only,and
judgementis suspendedas towhetherit is anything more thanthat" (p. 240).
And his verdict on two-factor theory?After commenting thatthe method
of two factors was an analytical devicefor indicating their presence, that
advancesin method had been made thatled to multiple factor analysis,and
further "It was Professor Thurstone of Chicagowho sawthatonesolutionto the
problem couldbe reachedby ageneralizationof Spearman's ideaof zerotetrad
differences."(p. 20).
THE PRACTITIONERS 167

Thomson himselfhadsuggesteda "sampling theory"to replaceSpearman's


approach:

The alternativetheory to explain the zero tetrad differencesis that eachtest calls
upon a sampleof bonds whichthe mind can form, andthat someof thesebondsare
common to two testsand causetheir correlation,(p. 45)

Thomsondid notgive more thana very general viewof whatthe bonds might
be, although it seems that they were fundamentally neurophysiological and,
from a psychological standpoint, akinto the connectionsin the connectionist
views of learning first advancedby Thorndike:

What the "bonds"of the mind are, we do notknow. But they are fairly certainly
associatedwith the neuronesor nerve cellsof our brains...Thinking is accompanied
by the excitationof theseneuronesin patterns.The simplestpatternsare instinctive,
more complex onesacquired. Intelligenceis possibly associatedwith the number
andcomplexity of thepatterns whichthe braincan (orcould) make. (Thomson, 1946,
p. 51)

And, lastly, to complete this short commentary on Thomson's efforts,his


demonstrationof geometrical methodsfor the illustration of factor theory
contrasts markedly with both Spearman and Burt's work and ismuch morein
tune with the "rotation" methodsof later contributors, notably Thurstone.
In 1939, as part of a General Meetingof the British Psychological Society,
Burt, Spearman, Thomson (1939a), and Stephenson each gave papers on the
factorial analysisof humanability. These papers were published in the British
Journal of Psychology, together with a summingup by Thomson (1939b).
Thomson pointsout that neither Thurstone nor any of hisfollowers were present
to defend their viewsand gives a very brief and favorable summaryof them.
He even says, citing Professor Dirac at a meetingof the Royal Society of
Edinburgh:

When a mathematical physicist finds a mathematical solutionor theorem whichis


particularly beautiful... he canhave considerable confidence that it will prove to
correspondto something realin physical nature. Something of the samefaith seems
to lie behind Prof.Thurstone'strust in the coincidenceof "Simple Structure"in the
matrix of factor loadings with psychological significance
in the factors thus defined.
(Thomson,1939b,p. 105)

He also noted that Burthad "expressedthe hope thatthe 'tetraddifference'of


the four symposiasts would be found to vanish!" (p. 108).
Louis L. Thurstone produced his first work on "the multiple factor problem"
in 1931. The Vectorsof the Mind appearedin 1935and what he himself defined
168 11. FACTOR ANALYSIS

as adevelopmentand expansionof this work, Multiple Factor Analysis,in


1947. Reprintsof the book werestill being published longafter his death.The
following schememight give somefeel for Thurstone'sapproach.We looked
earlier at the matrix of correlations that producea tetrad difference.Now
consider:

This is a minor determinantof order three.Now a newtetrad can beformed:

If this new tetrad is zero then the determinant "vanishes." This procedure
can becarried out with determinantsof any order and if we come to a stage
whenall theminors"vanish"the "rank" of thecorrelation matrixwill be reduced
accordingly. Thurstonefound that testscould be analyzed intoas many com-
mon factors as thereduced rankof the correlation matrix.The present account
follows that of Thomson (1946, chap. 2), who gives a numerical example.The
rank of the matrix of all the correlationsmay bereducedby inserting valuesin
the diagonalof thematrix - thecommunalities.Thesevaluesmay bethought
of asself-correlations,in whichcasetheyare all 1. Butthey may alsobe regarded
asthat part of the variancein the variable thatis due to thecommon factors,in
which casethey haveto be estimatedin some fashion. They might even be
merely guessed.Thurstone chose values that made the tetrads zero. If these
methods seemnot to be entirely satisfactory now, they were certainly not
welcomed by psychologistsin the 1920s, 1930sand 1940s who were tryingto
cometo grips with the newmathematical approaches of the factor theoristsand
their views of human intelligence. Thurstone's centroid method bears some
similarities to principal components analysis. The word centroid refersto a
multivariate average,and Thurstone's "first centroid"may bethoughtof as an
averageof all thetestsin a battery. Centralto theoutcomeof the analysisis the
production of'factor loadings - values that express the relationshipof thetests
to the presumed underlying factors.The ways in which these loadingsare
arrived at aremany and various,and Thurstone's centroid technique was just
one early approach.However, it was anapproach thatwas relatively easyto
apply and (with some mathematics) relatively easy to understand.How the
factors wereinterpretedpsychologically was,and is, largely a matter for the
researcher.Thus tests that were related that involved arithmeticor numerical
THE PRACTITIONERS 169

reasoningor number relationships might be labeled numerical,or numeracy.


Thurstone maintained that complete interpretation of factors involvedthe
techniqueof rotation. Loadingsin a principal factors matrix reflect common
factor variances in test scores.The matrices are arbitrary for they can be
manipulatedto show different axes - different "frames of reference." We can
only make scientific senseof the outcomes whenthe axes are rotated to pass
through loadings that show an assumed factor that represents
a "psychological
reality." Such rotations,which may produce orthogonal (uncorrelated) axes or
oblique (correlated)axes,were first carriedout by amixture of mathematics,
intuition, and inspiration and have becomea standard partof modern factor
analysis. It is perhaps worth noting that Thurstone's early training was in
engineeringand that he taught geometryfor a time at theUniversity of Minne-
sota.

The configurational interpretations


are evidently distastefulto Burt, for he doesnot
havea singlediagramin his text. Perhapsthis is indicative of individual differences
in imagery types which leadsto differencesin methods and interpretation among
scientists. (Thurstone, 1947,
p. ix)

Thurstone'swork eventually produceda set ofseven primary mental abili-


ties, verbal comprehension, word fluency,numerical, spatial, memory, percep-
tual speed,and reasoning. Thiskind of scheme became widely popular among
psychologistsin the United States, althoughmany were disconcerted by the fact
that the numberof factors tendedto grow. J. P.Guilford (1897-1987)devised
a theoretical model that postulated 120 factors (1967),and by 1971 the claim
was made that almost100 of them had been identified.
An ongoing problemfor the early researcherswas theevident subjective
element in all the methods. Mathematicians stepped into the picture (Wilson
was one of theearly ones),often at therequestof the psychologistsor educa-
tionists who were not generally mathematical sophisticates (Thomson, not
trained as apsychologist,was anexception),but they were remarkably quick
learners. The search for analytical methodsfor assessing "simple structure"
began. Carroll (1953)was amongthe earliestwho tackledthe problem:

A criticism of current practicein multiple factor analysisis that the transformation


of the initial factor matrix Fto arotated "simple stucture" matrix ^must apparently
be accompaniedby methods which allow considerable scopefor subjectivejudge-
ment, (p. 23)

Carroll's papergoes on to describea mathematical method that avoids


subjectivedecisions. Modern factor analysis
had begun. From then on agreat
variety of methodswasdeveloped,and it is noaccident that such solutions
were
developed coincidentallywith the rise of the use of thehigh-speed digital
170 11. FACTOR ANALYSIS

computer,which removed the computational labor fromthe more complex


procedures. Harman (1913-1976), building on earlier work (Holzinger
[1892-1954]& Harman, 1941), published Modern Factor Analysis in 1960,
with a secondand athird (1976) edition,and thebook is a classicin the field.
In researchon thestructureof human personality,the giants in the areawere
Raymond Cattell(1905-1998)and Hans Eysenck(1916-1997).Cattell had
beena studentof Cyril Burt, and in hisearly work had introduced the concept
of fluid intelligence thatis akin to g, abroad, biologicallybasedconstruct of
general mental abilityandcrystallized intelligence that depends on learningand
environmental experience. Later, Cattell's work concentrated on theidentifica-
tion andmeasurementof the factorsof humanpersonality (1965a),and his16PF
(16 personality factors) testis widely used. His Handbook of Multivariate
Experimental Psychology (1965b) is a compendiumof the multivariate ap-
proach witha numberof eminent contributors. Cattell himself contributed 6
of the 27chaptersand co-authored another.
Cattell's research output, over a very long career,was massive,as wasthat
of Eysenck.The latter's method, which he termed criterion analysis,was a use
of factor analysisthatattemptedto confirm the existenceof previously hypothe-
sized factors. Three dimensions were eventually identified by Eysenckand his
work helpedto set off one ofpsychology's ongoing arguments about just how
many personality factors there really were. This is not theplaceto review this
debate,but it does,once again, place factor analysis at thecenterof sometimes
quite heated discussion about the "reality" of factors and theextent to which
they reflect the theoreticalpreconceptionsof the investigators."What comes
out is no more than whatgoesin," is the cry of thecritics. Whatis absolutely
clear from this situation is that it is necessaryto validate the factors by
experimental investigation that stands aside from the methods used to identify
the factors and avoid capitalizingon semantic similarities among the tests that
were originally employed.It has to beacknowledged,of course, that both Cattell
and Eysenck (1947; Eysenck & Eysenck, 1985)and their many collaborators
have triedto dojust that.
Recent years have seen an increasein the use offactor-analytictechniques
in a variety of disciplinesand settings, even apparently in the analysisof the
performanceof racehorses!In psychology,the sometimes heated discussion of
the dangersof the reification of factors, their treatmentaspsychological entities,
as well as arguments over,for example, justhow many personality dimensions
are necessaryand/or sufficient to define variationsin human temperament,
continue. At bottom, a large partof these problems concerns the use of the
technique eitheras aconfirmatory toolfor theory or as asearchfor new structure
in our data.Unless thesetwo quitedifferent viewsare recognizedat thestart of
discussionthe debatewill take on anexasperatingfutility that stiflesall progress.
12
The Design
of Experiments

THE PROBLEM OF CONTROL

When Ronald Fisher accepted the post of statisticianat Rothamsted Experimen-


tal Station in 1919, the tasksthat facedhim were to make whathe could of a
large quantityof existing datafrom ongoing long-term agricultural studies (one
had begunin 1843!) and to try toimprovethe effectivenessof future field trials.
Fisher later describedthe first of these tasksas "raking over the muck heap";
the secondhe approached withgreatvigor and enthusiasm, layingas he did so
the foundationsof modern experimental design and statistical analysis.
The essential problemis the problem of control. For the chemist in the
laboratory it is relatively easyto standardizeand manipulatethe conditionsof
a specific chemical reaction. Social scientists, biologists, and agriculturalre-
searchers have to contend withthe fact that their experimental material (people,
animals, plants)is subjectto irregular variation that arises as aresultof complex
interactions of genetic factorsand environmental conditions.These many
variations, unknownand uncertain, makeit very difficult to be confident that
observeddifferencesin experimental observations are due to themanipulations
of the experimenter rather than to chance variation.The challenge of the
psychological sciences is thesensitivity of behaviorand experienceto amulti-
plicity of factors. But in many respectsthe challengehas notbeen answered
becausethe unexplained variationin our observationsis generallyregardedas
a nuisanceor asirrelevant.
It is useful to distinguish between experimental control and the controlled
experiment.The former is the behaviorist's ideal,the state where some consis-
tent behavior can be set offand/or terminatedby manipulating precisely
specified variables. On the other hand,the controlled experiment describes a
procedurein which the effect of the manipulationof the independent variable
or variables is, as it were, checked against observations undertaken in the

171
172 12. THE DESIGN OF EXPERIMENTS

absenceof the manipulation. Thisis themethod thatis employedand supported


by the followersof the Fisherian tradition.The uncontrolled variables that affect
the observationsareassumedto operatein a random fashion, changing individ-
ual behaviorin all kinds of ways sothat whenthe dataareaveraged their effects
arecanceledout, allowingthe effect of themanipulated variableto beseen.The
assumption of randomnessin the influence of uncontrolled variablesis, of
course,not onethat is always easyto justify, and therelegation of important
influenceson variability to error may leadto erroneous inferences anddisastrous
conclusions.
Malaria is a disease thathas been knownand feared for centuries. It
decimatedthe Roman Empirein its final years,it wasquite widespreadin Britain
during the 17th century,and indeedit was still found there in the fencountryin
the 19th century. The names malaria, marsh fever, andpaludism all reflect the
view that the causeof the diseasewas thebreathingof damp, noxiousair in
swamp lands.The relationship between swamp lands and the incidence of
malariais quite clear. The relationship between swamp lands and thepresence
of mosquitosis also clear. But it was notuntil the turn of the century thatit was
realized thatthe mosquitowas responsiblefor the transmissionof the malarial
parasite,and only 20 years earlier,in 1880, was theparasite actually observed.
In 1879, Sir Patrick Manson(1844-1922),a physician whose work played a
role in the discovery of the malarial cycle, presenteda paper in which he
suggestedthat the diseaseelephantiasiswas transmitted throughinsect bites.
The paperwasreceived with scornanddisbelief. The evidencefor the life cycle
of the malarial parasitein mosquitosandhuman beingsand itsbeing established
as the causeof the illness camein a numberof ways - not theleast of which
was thehealthy survival, throughout the malarial season,of three of Manson's
assistantsliving in a mosquito-proofhut in themiddleof the Roman Campagna
(Guthrie, 1946,pp. 357-358).This episodeis an interesting exampleof the
control of a concomitantor correlated biasor effect that was thedirect causeof
the observations.
In psychological studies, some of the earliest work that used true experimen-
tal designsis that of Thorndike and Woodworth (1901)on transferof training.
They used "before-after" designs, control group designs, and correlational
studiesin their work. However,the nowroutine inclusion of control groupsin
experimental investigations in psychology doesnot appearto have beenan
acceptednecessityuntil about50 years ago.In fact, controlled experimentation
in psychology moreor less coincidedwith the introductionof Fisherian statis-
tics, and the twoquite quickly became inseparable. Of courseit would be both
foolish and wrong to imply that early empirical investigations in psychology
were completely lackingin rigor, and mistaken conclusions rife. The point is
that it was not until the 1920s and 1930s that the "rules" of controlled
METHODS OF INQUIRY 173

experimentation were spelled out andappreciatedin the psychological sciences.


The basic ruleshad been in existencefor many decades, having been codified
by John StuartMill (1806-1873)in a book, first publishedin 1843, thatis
usually referredto as theLogic (1843/1872/1973). These formulationshadbeen
precededby an earlier British philosopher, Francis Bacon, who made recom-
mendationsfor what he thought wouldbe sound inductive procedures.

METHODS OF INQUIRY

Mill proposedfour basic methodsof experimentalinquiry, and the fiveCanons,


the first of which is, "If two or more instancesof the phenomenon under
investigation have only one circumstancein common,the circumstancein
which alone all the instances agree,is the cause (or effect) of the given
phenomenon"(Mill, 1843/1872,8th ed., p. 390).
If observationsa, b, and c aremade in circumstancesA, B, and C, and
observationsa, d, and e incircumstances A, D, and £,then it may beconcluded
that A causes a. Mill commented,"As this method proceedsby comparing
different instancesto ascertainin what they agree,I have termedit the Method
of Agreement"(p. 390).
As Mill points out,the difficulty with this methodis the impossibility of
ensuringthat A is the only antecedentof a that is commonto both instances.
The second canonis the Methodof Difference. The antecedent circumstances
A, B, and C arefollowed by a, b, and c.When A is absentonly b and c are
observed:
If an instancein which the phenomenon under investigation occurs,and aninstance
in which it doesnot occur, have every circumstance in common saveone, thatone
occurring only in the former; the circumstancein which alone the two instances
differ, is the effect or the cause, or an indispensable partof the cause of the
phenomenon,(p. 391)

This method containsthe difficulty in practiceof being unableto guarantee


that it is the crucial difference that has beenfound. As part of the wayaround
this difficulty, Mill introducesa joint methodin his third canon:

If two or more instancesin which the phenomenon occurs have only one circum-
stancein common, whiletwo or more instancesin which it doesnot occur have
nothing in common savethe absenceof that circumstance; the circumstancein which
alonethe two setsof instancesdiffer, is the effect, or the cause,or an indispensable
part of the cause,of the phenomenon,(p. 396)

In 1881 Louis Pasteur (1822-1895) conducted a famous experiment that


exemplifies the methodsof agreementand difference. Some30 farm animals
174 12. THE DESIGN OF EXPERIMENTS

were injected by Pasteur witha weak cultureof anthrax virus. Later these
animalsand asimilar numberof others thathad notbeenso "vaccinated"were
given a fatal dose of anthrax virus. Withina few days the non-vaccinated
animals were dead or dying, the vaccinated ones healthy. The conclusion that
was enthusiastically drawnwas that Pasteur's vaccination procedure had pro-
duced the immunity thatwas seenin the healthy animals.The effectivenessof
vaccinationis now regardedas anestablished fact.But it is necessaryto guard
against incautious logic.
The healthof the vaccinated animals could have been
due to some other fortuitous circumstance. Because it is known that some
animals infected with anthrax do recover,an experimental group composed of
theseresistantanimals could have resulted in a spurious conclusion.It should
be noted that Pasteur himself recognized as thisapossibility.
Mill's Method of Residues proclaims that having identified by the methods
of agreementand differences that certain observed phenomena are theeffects
of certain antecedent conditions, the phenomena that remain are due to the
circumstancesthat remain. "Subductfrom any phenomenon such part as is
known by previous inductions to be theeffect of certain antecedents, and the
residue of the phenomenonis the effect of the remaining antecedents"
(1843/1872,8th ed., p. 398).
Mill here uses a very modern argument for the use of themethod in
providing evidencefor the debateon racial and gender differences:

Thosewho assert,what no one hasshown any real groundfor believing, that there
is in onehuman individual,onesex, or oneraceof mankindover another,aninherent
and inexplicable superiorityin mental faculties, could only substantiate their
propo-
sition by subtracting fromthe differencesof intellect which we in fact see,all that
can be traced by known laws either to the ascertaineddifferences of physical
organization,or to thedifferences which have existed in the outward circumstances
in which the subjectsof thecomparison have hitherto been placed. What thesecauses
might fail to accountfor, would constitutea residual phenomenon, which and which
alone wouldbe evidence of an ulterior original distinction,and themeasureof its
amount. But the assertorsof such supposed differences have not provided them-
selveswith thesenecessarylogical conditionsin the establishmentof their doctrine,
(p. 429)

The final method and the fifthcanon is the Method of Concomitant Vari-
ations: "Whatever phenomenon varies in any manner whenever another phe-
nomenon variesin some particular manner, is eithera causeor an effect of that
phenomenon,or is connected withit through some factof causation"(p. 401).
This methodis essentially thatof the correlational study,
the observationof
covariation:

Let us supposethe questionto be,what influencethe moon exertson thesurfaceof


METHODS OF INQUIRY 175

the earth. We cannottry an experimentin the absenceof the moon, so as toobserve


what terrestrial phenomenon her annihilation would put an end to; but
whenwe find
that all the variations in the positionsof the moon are followed by corresponding
variationsin the time and placeof high water,the place always being either the part
of the earth which is nearestto, or that whichis most remote from,the moon, we
have ample evidence that the moonis, wholly or partially, thecause which determines
the tides, (p. 400)

Mill maintained that these methods were in fact the rulesfor inductive logic,
that they were both methods of discoveryandmethodsof proof. His critics, then
and now, argued against a logic of induction (see chapter2), but it isclear that
experimentalistswill agreewith Mill that his methods constitute the meansby
which they gather experimental evidence for their views of nature.
The general structureof all the experimental designs that are employedin
the psychological sciences may beseenin Mill's methods.The applicationand
withholding of experimental treatments acrossgroups reflectthe methodsof
agreementand differences. The useof placebosand thesystematic attempts to
eliminate sourcesof error makeup the methodof residues,and themethodof
concomitant variationis, asalready noted,a complete descriptionof the corre-
lational study. It is worth mentioning thatMill attemptedto deal with the
difficulties presentedby the correlationalstudy and, in doing so, outlinedthe
basics of multiple regressionanalysis,the mathematicsof which werenot to
come for many years:

Suppose, then, that when A changesin quantity, a also changesin quantity,and in


such a manner thatwe cantracethe numericalrelationwhich the changesof the one
bear to such changes of the otheras take placewithin the limits of our observation.
We maythen safely conclude that the samenumericalrelationwill hold beyondthose
limits. (Mill, 1843/1872,8th ed., p. 403)

Mill elaborateson this propositionand goeson to discussthe casewherea


is not wholly the effect of A but nevertheless varies
with it:

It is probably a mathematicalfunction not of A alone,but of A andsomething else:


its changes,for example,may besuchas would occurif part of it remained constant,
or varied on some other principle, and the remainder variedin some numerical
relation to the variationsof A. (p. 403)

Mill's Logic is his principal work, and it may befairly castas thebook that
first describes botha justification for and the methodologyof, the social
sciences.Throughouthis works, the influence of the philosopherswho were,
in fact, the early social scientists,is evident. David Hartley (1705-1757)
publishedhis Observationson Man in 1749. This bookis a psychology rather
176 12. THE DESIGN OF EXPERIMENTS

than a philosophy (Hartley, 1749/1966).It systematically describes associa-


tionism in a psychological contextand is the firsttext that deals with physi-
ological psychology. James Mill (1773-1836),John Stuart's father, much
admired Hartleyand hismajor work becameone of themain source books that
Mill the elder introducedto his sonwhenhe startedhis formal educationat the
age of 3(when he learned Ancient Greek), although he did not get toformal
logic until he was 12.Another important earlyinfluence was Jeremy Bentham
(1748-1832),a reformer who preachedthe doctrine of Utilitarianism, the
essential featureof which is the notion that the several and joint effects of
pleasure andpain governall our thoughtsandactions. LaterMill rejected strict
Benthamismandquestionedthe work of a famousandinfluential contemporary,
AugusteComte (1798-1857),whose work marksthe foundation of positivism
and of sociology. The influenceof the ideasof thesethinkerson early experi-
mental psychologyis strong and clear,but they will not beexplored here.The
main point to be madeis that John StuartMill was anexperimental psycholo-
gist's philosopher. More,he was themethodologist's philosopher. In a letter
to a friend, he said:

If there is any sciencewhich I am capableof promoting, I think it is the science of


scienceitself, the scienceof investigation- of method. I once heard Maurice say
... that almostall differencesof opinion when analysed,
weredifferencesof method,
(quoted by Robson in his textual introductionto the Logic, p. xlix)

And it is clear that all subsequent accounts


of method and experimental
designcan betraced backto Mill.

THECONCEPTOF STATISTICAL CONTROL

The standard designfor agricultural experiments at Rothamstedin the days


before Fisher was to divide a field into a numberof plots. Each plot would
receivea different treatment, say,a different manureor fertilizer or manure/fer-
tilizer mixture. The plot that producedthe highest yield wouldbe taken to be
the best, and thecorresponding treatment considered to be themost effective.
Fisher, and others,realized that soilfertility is by no meansuniform acrossa
large field and that this,as well as other factors,can affect the yields. In fact,
the differencesin the yields could be due tomany factors other thanthe
particular treatments and the highest yield might be due to some chance
combinationof these factors.The essential problem is to estimatethe magni-
tude of these chance factors - the errors - to eliminate, for example,the
differencesin soil fertility.
Someof the first data thatFisher saw atRothamsted werethe recordsof
THE CONCEPT OF STATISTICAL CONTROL 177

daily rainfall and yearly yields from plots in the famous Broadbalk wheat
field. Fertilizers had been appliedto these plots, usingthe same pattern, since
1852. Fisher used the methodof orthogonal polynomialsto obtain fits of the
yields over time. In his paper (1921b) on these data published in 1921 he
describes analysis of variance (ANOVA) for the first time.

When the variation of any quantity (variate)is producedby theaction of two or more
independentcauses,it is known that the variance produced by all the causes
simultaneouslyin operation is the sum of thevaluesof the variance producedby
eachcauseseparately... In Table II is shown the analysisof the total variancefor
each plot, divided according as it may beascribed(i) to annual causes, (ii)to slow
changesother than deterioration, (iii)to deterioration;the sixth columnshowsthe
probability of larger valuesfor the variancedue toslow changes occurring fortui-
tously. (Fisher,192 Ib, pp. 110-111)

The method of data analysis thatFisher employedwas ingeniousand


painstaking,but he realizedquickly that the data that were available suffered
from deficienciesin the designof their collection. Fisherset out on a newseries
of field trials.
He divided a field into blocksand subdividedeach block into plots. Each
plot within the block was given a different treatment,and each treatment was
assignedto each plot randomly. This, as Bartlett (1965) putsit, was Fisher's
"vital principle."

When statisticaldataarecollected asnatural observations,the most sensible assump-


tions aboutthe relevant statistical model have to be inserted. In controlled experi-
mentation, however, randomness could be introduced deliberately into the design,
so that any systematic variability other than [that]due toimposed treatments could
be eliminated.
The second principle Fisher introduced naturally went with the first. With
statistical analysisgearedto the design, all variability not ascribedto theinfluence
of treatmentsdid not have to inflate the random error. With equal numbersof
replicationsfor the treatments each replication could be containedin a distinct block,
and only variability among plotsin the same block werea source of error - that
between blocks could be removed. (Bartlett, 1965, p. 405)

The statistical analysis allowed


for an even more radical break with tradi-
tional experimental methods:
No aphorism is more frequently repeatedin connectionwith field trials, than thatwe
must ask Nature few questions,or, ideally, one question,at a time. The writer is
convinced that this viewis wholly mistaken. Nature, he suggests,will best respond
to a logical and carefully thoughtout questionnaire; indeed,
if we ask her asingle
question, shewill often refuseto answeruntil some other topichasbeen discussed.
(Fisher, 1926b,p. 511)
178 12. THE DESIGN OF EXPERIMENTS

Fisher's"carefully thoughtout questionnaire"was thefactorial design. All


possible combinationsof treatments wouldbe applied with replications. For
example,in the applicationof nitrogen (N), phosphate (P), andpotash(K) there
would be eight possible treatment combinations: no fertilizer, N, P, K, N & P,
N & K, P & K, and N & P & K. Separate compact blocks would be laid out and
these combinations would be randomly appliedto plots withineachblock. This
design allowsfor an estimation of the main effects of the basicfertilizers, the
first-order interactions (theeffect of two fertilizers in combination), and the
second-orderinteraction (theeffect of the three fertilizersin combination).The
1926(b)papersetsout Fisher'srationale for field experimentsand was,as he
noted, the precursorof his book, The Design of Experiments (1935/1966),
published9 years later.The paperis illustratedwith a diagram(Fig.12.1)of a
"complex experiment with winteroats" that had been carriedout with a
colleagueat Rothamsted (Eden & Fisher, 1927).
Here 12 treatments,including absenceof treatments- the"control" plots -
were tested.

FIG. 12.1 Fisher's Design 1926. Journal of the Ministry


of Agriculture
THE CONCEPT OF STATISTICAL CONTROL 179

Any general difference between sulphate and chloride, betweenearly and late
application, or ascribable to quantity of nitrogenous manure,can bebasedon
thirty-two comparisons,each of which is affected by such soil heterogeneityas
existsbetweenplots in the sameblock. To make thesethreesetsof comparisons
only, with the sameaccuracy,by single question methods, would require 224 plots,
againstour 96; but inaddition many other comparisons can bemade with equal
accuracy,for all combinationsof the factors concernedhave been explored. Most
important of all, the conclusions drawn from the single-factor comparisonswill be
given, by the variation of non-essential conditions, a very much wider inductive
basis than could be obtained, by single question methods, without extensive
repetitions of the experiment. (Fisher, 1926b, p. 512)

The algebraand thearithmeticof the analysisaredealt within thefollowing


chapter.The crucial pointof this work is the combinationof statistical analysis
with experimental design. Part of the stimulus for this paperwas Sir John
Russell's (1926) articleon field experiments,which had appearedin the same
journaljust months earlier. Russell's review presents the orthodox approachto
field trials and advocatedcarefully planned, systematic layoutsof the experi-
mental plots.Sir JohnRussellwas theDirectorof the Rothamsted Experimental
Station, he hadhired Fisher,and he wasFisher's boss,but Fisher dismissedhis
methodology.In a footnotein the 1926(b) paper, Fisher says:

This principlewas employed in an experimenton theinfluence of the weatheron the


effectivenessof phophatesand nitrogen alludedto by SirJohn Russell.The author
must disclaimall responsibilityfor the designof this experiment, whichis, however,
a good example of its class. (Fisher, 1926b,p. 506)

And as Fisher Box (1978) remarks:

It is a measure of the climate of the times that Russell,an experiencedresearch


scientist who . . . had had thewisdom to appoint Fisher statistician for the better
analysis of the Rothamsted experiments, did not defer to theviews of his statistician
when he wrote on how experiments were made. Design was, in effect, regardedas
an empirical exerciseattemptedby the experimenter;it was not yet thedomain of
statisticians, (p. 153)

In fact the statistical analysis,


in a sense, arisesfrom the design. Nowadays,
when ANOVA is regardedasefficient and routine,the various designs that are
availableandwidely usedaredictatedto us by theknowledge thatthe reporting
of statistical outcomes andtheir related levels
of significance is thesine qua non
of scientific respectabilityandacceptabilityby the psychological establishment.
Historically, the newmethodsof analysis camefirst. The confounds, defects,
and confusionsof traditionaldesigns became apparent when ANOVA was used
to examinethe data and so newdesigns were undertaken.
180 12. THE DESIGN OF EXPERIMENTS

Randomizationwas demandedby the logic of statistical inference. Esti-


matesof error and valid testsof statisticalsignificancecan only be made when
the assumptions that underlie the theory of sampling distributionsare upheld.
Put crudely, this means that "blind chance" should not be restricted in the
assignmentof treatmentsto plots, or experimental groups. It is, however,
important to note that randomization does not imply that no restrictions or
structuringof the arrangementswithin a designare possible.
Figure 12.2 showstwo systematic designs: (a) ablock designand (b) aLatin
squaredesign,and two randomized designs of the same type,(c) and(d). The
essentialdifference is that chance determines the applicationof the various
treatments appliedto theplots in the latter arrangements, but therestrictionsare
apparent.In the randomizedblock and in therandomizedLatin square, each
block containsone replicationof all thetreatments.

The estimateof error is valid, because,if we imaginea large numberof different


results obtainedby different random arrangements, the ratio of the real to the
estimated error, calculated afreshfor each of these arrangements, will be actually
distributed in the theoreticaldistribution by which the significanceof the result is
tested. Whereasif a groupof arrangementsis chosen such that the real errorsin this
group are on thewhole less than those appropriate to random arrangements, it has
now beendemonstrated that the errors,asestimated,will, in such a group,be higher
than is usualin random arrangements, andthat,in consequence, within sucha group,
the test of significance is vitiated. (Fisher, 1926b,
p.507)

Treatments A Standard Latin Square


1 2 3 4 5
Block 1 A B C D E A B C D
Block 2 A B C D E B A D C
Blocks A B C D E C D B A
Block 4 A B C D E D C A B
Blocks A B C D E
(a) (b)
Treatments A Random Latin Square
1 2 3 4 5
Block 1 D C E A B D A C B
Block 2 A D B C E C B D A
Block3 B A E C D B D A C
Block 4 E D C A B A C B D
Blocks B A D E C
(c) (d)

FIG. 12.2Experimental Designs


THE LINEAR MODEL 181

Fisherlater examinesthe utility of the Latin square design, pointing


out that
it is by far themost efficient andeconomicalfor "thosesimple typesof manurial
trial in which every possible comparison is of equal importance"(p. 510). In
1925 andearly 1926, Fisher enumerated the 5 x 5 and 6x6squares,and in the
1926 paperhe madean offer that undoubtedly helped to spreadthe name,and
the fame,of the Rothamsted Station to many partsof the world:

The Statistical Laboratoryat Rothamstedis preparedto supply these,or other typesof


randomized arrangements, to intending experimenters; this procedure is consideredthe more
desirable sinceit is only too probable thatnew principleswill, at their inception,be, insome
detail or other, misunderstood andmisapplied;a consequence for which their originator,who
has made himself responsible for explaining them, cannot
be held entirely freefrom blame.
(Fisher, 1926b,pp. 510-511)

THE LINEAR MODEL

Fisher described ANOVAas a way of"arrangingthe arithmetic" (Fisher Box,


1978, p. 109), an interpretationwith which not a fewstudents would quarrel.
However,the description does point to thefact thatthe componentsof variance
are additive and that this propertyis an arithmeticalone and notpart of the
calculusof probability and statisticalinferenceas such.
The basic construct that marks the culmination of Fisher's workis that of
specifying valuesof an unknown dependent variable, >>, in termsof a linear set
of parameters,eachone of which weightsthe several independent variables jc,,
jc2, # 3 , . . . , jt n, that are usedfor prediction, togetherwith an error component8
that accountsfor the randomfluctuations in y for particularfixed valuesof A:,,
x2, J C3 , . . . , xn. In algebraic terms,

As we noted earlier,the random component in the modeland thefact thatit


is sample-based make it a probabilistic model,and the properties of the
distribution of this component, real or assumed, govern the inferences thatmay
be made aboutthe unknown dependent variable. Fisher's work is the crucial
link betweenclassical least squares analysis and regression analysis.
As Seal (1967) notes, "The linear regression model owes so muchto Gauss
that we believe it should bearhis name"(p. 1).
However, thereis little reasonto suppose that this will happen. Twentyyears
ago Seal found that veryfew of the standard textson regression,or the linear
model, or ANOVA made more thana passing reference to Gauss, and the
situation is little changed today. Some of the reasonsfor this have alreadybeen
182 12. THE DESIGN OF EXPERIMENTS

mentionedin this book.The European statisticians of the 18thand 19th centuries


were concerned with vital statistics
and political arithmetic,and inferenceand
prediction in the modern sense were, generally speaking, a long way off in these
fields. The mathematicsof Legendreand Gaussand otherson the theory of
errors did not impingeon thework of the statisticians. Perhaps more strikingly,
the early links between social
and vital dataand error theory that were made by
Laplace and Quetelet were largely ignored by Karl Pearsonand Ronald Fisher.

Why, then, could not theTheory of Errors be absorbed intothe broaderconceptof


statisticaltheory ... ? ... Theoriginal reasonwasPearson'spreoccupation withthe
multivariate normal distributionand itsparameters.The predictive regressionequa-
tion of his pathbreaking 'regression'paper (1896) was notseento be identical in
form and solution to Gauss'sTheoria Motus (1809) model.R. A. Fisher and his
associates... wererediscovering manyof themathematical resultsof leastsquares(or
error) theory,apparentlyagreeingwith Pearsonthat this theory held little interestto
the statistician. (Seal, 1967, p. 2)

There mightbe other more mundane, nonmathematicalreasons. Galton and


others were strongly opposed to the use of theword error in describing the
variability in human characteristics, and themany treatiseson thetheory might
thus have been avoided by the newsocial scientists,
who were, in the main,not
mathematicians.
In his 1920 paperon the history of correlation, Pearsonis clearly most
anxious to downplay any suggestion that Gaussian theory contributed to its
development.He writes of the "innumerabletreatises" (p. 27) onleast squares,
of the lengthy analysis,of his opinion that Gaussand Bravais "contributed
nothingof real importanceto theproblemof correlation"(p. 82), and ofhis view
that it is not clear thata least squares generalization "indicates
the real line of
future advance"(p. 45). The generalizationhad been introduced by Yule, who
Pearsonand hisGower Street colleagues clearly saw as theenemy. Pearson
regardedhimself as the father of correlation and regression insofaras the
mathematics were concerned. Galton and Weldon were,of course, recognized
as important figures,but they werenot mathemaliciansand posedno threatto
Pearson'sauthorily. In other respects,Pearsonwas driven to try to show that
his contributions were supreme andindependent.
The historical recordhasbeen tracedby Seal (1967),from the fundamental
work of LegendreandGaussat thebeginningof the 19th centuryto Fisher over
100 years later.

THE DESIGN OF EXPERIMENTS


A lady declaresthat by tasting a cup of teamade with milk she can discriminate
whether the milk or the teainfusionwas firstaddedto thecup. We will considerthe
THE DESIGN OF EXPERIMENTS 183

problem of designingan experimentby meansof which thisassertioncan betested.


(Fisher, 1935/1966,8th ed., p. 11)

With thesewords Fisher introduces the example that illustrated his view of the
principles of experimentation. Holschuh (1980) describes it as "the somewhat
artificial 'lady tasting tea' experiment" (p. 35), and indeedit is, but perhapsan
American writer doesnot appreciatethe fervor of the discussionon the best
methodof preparing cupsof teathat still occupiesthe British! FisherBox (1978)
reports thatan informal experimentwascarriedout atRothamsted.A colleague,
Dr B. Muriel Bristol, declineda cup of teafrom Fisheron thegrounds thatshe
preferredone towhich milk had firstbeen added.Her insistence thatthe order
in which milk and teawere poured intothe cup made a differenceled to a
lightheartedtest actually being carried out.
Fisher examinesthe design of such an experiment. Eight cups of tea are
prepared. Fourof them havetea addedfirst and four milk. The subjectis told
that thishasbeen doneand thecupsof tea arepresentedin a randomorder. The
task is, of course,to divide the set ofeight intotwo setsof four accordingto the
methodof preparation. Because there are 70waysof choosinga set of 4objects
from 8:

A subject withoutany faculty of discrimination wouldin fact divide the 8 cups


correctly intotwo setsof 4 in onetrial out of 70, or,more properly, witha frequency
which would approach1 in 70 more and more nearlythe more oftenthe test were
repeated.. . . The odds could be made much higherby enlargingthe experiment,
while if the experiment were much smaller even the greatest possiblesuccesswould
give oddsso low that the result, might with considerable probability, be ascribedto
chance. (Fisher, 1935/1966,8th ed., pp. 12-13)

Fishergoeson to saythat it is "usualand convenientto take 5 per cent,as a


standard levelof significance,"(p. 13) and so anevent that would occurby
chance oncein 70 trials is decidedlysignificant. The crucial pointfor Fisheris
the act of randomization:

Apart, therefore, fromthe avoidable errorof the experimenter himself introducing


with his test treatments,or subsequently, other differences
in treatment,the effects
of which the experiment is not intendedto study, it may be said that the simple
precaution of randomisation will suffice to guaranteethe validity of the test of
significance, by which the result of the experiment is to be judged. (Fisher,
1935/1966,8th ed., p. 21)

This is indeedthe crucial requirement. Experimental design when variable


measurementsare being made,and statistical methodsare to beusedto tease
out the information from the error, demands randomization.But there are
184 12. THE DESIGN OF EXPERIMENTS

ironies herein this, Fisher's elegant accountof the lady tastingtea.It has been
hailed as themodel for the statisticalinferential approach:

It demandsof the readerthe ability to follow a closely reasonedargument,but it will


repay the effort by giving a vivid understandingof the richness, complexityand
subtlety of modern experimental method. (Newman, 1956, Vol. 3, p.1458)

In fact, it usesa situation and amethod that Fisher repudiated elsewhere.


More discussionof Fisher's objections to the Neyman-Pearson approach are
given later. For themoment,it might be noted that Fisher's misgivings center
on theequationof hypothesis testing with industrial quality control acceptance
procedures where the population being sampled has anobjective reality,and
that populationis repeatedly sampled. However, thetea-tasting example appears
to follow this model! Kempthorne (1983) highlighted problemsand indicated
the difficulties that so many havehad in understandingFisher's pronounce-
ments.
In his book Statistical Methodsand Scientific Inference,first publishedin
1956, Fisher devotes a whole chapterto "Some Misapprehensions about Tests
of Significance." Herehe castigatesthe notion that 'thelevel of significance'
should be determined by "repeatedsamplingfrom the same population",
evidently with no clear realization that the population in questionis hypotheti-
cal" (Fisher,1956/1973,3rd ed.,pp. 81-82).
He determinesto illustrate "the more generaleffects of the confusion
betweenthe level of significanceappropriately assigned to a specific test, with
the frequencyof occurrenceof a specifiedtypeof decision" (Fisher, 1956/1973,
3rd ed., p. 82). He states,"In fact, as amatterof principle,the infrequencywith
which, in particular circumstances, decisive evidence is obtained, shouldnot be
confusedwith the force, or cogencyof such evidence" (Fisher, 1956/1973,3rd
ed., p. 96).
Kempthorne(1983), whose perceptions of both Fisher's genius and incon-
sistenciesare ascogentand illuminating as onewould find anywhere, wonders
if this book's lack of recognitionof randomizationarose because of Fisher's
belated, but of course not admitted, recognition thatit did not mesh with
"fiduciating." Kempthorne quotes a "curious" statementof Fisher's and com-
ments, "Well,well!" A slightly expanded version is given here:

Whereasin the "Theory of Games"a deliberately randomized decision (1934) may


often be usefulto give an unpredictable element to thestrategy of play; and whereas
plannedrandomization(1935-1966)iswidely recognizedasessentialin theselection
and allocation of experimental material,it has nouseful part to play in the formation
of opinion, and consequentlyin testsof significance designedto aid theformation
of opinion in the Natural Sciences.(Fisher, 1956/1973,3rd ed.,p. 102)
THE DESIGN OF EXPERIMENTS 185

Kendall (1963) wishes that Fisher had never writtenthe book, saying,"If
we had tosacrifice any of hiswritings, [this book] would havea strong claim
to priority" (p. 6).
However he did write the book, and heusedit to attack his opponents. In
marshaling his arguments,he introduced inconsistencies of both logic and
method that haveled to confusion in lesser mortals.Karl Pearsonand the
biometricians used exactly the same tactics.In chapter15 theview is presented
that it is this rather sorry state
of affairs that has led to thehistorical development
of statistical procedures,as they are used in psychologyand the behavioral
sciences, being ignored by the texts that made them available to a wider, and
undoubtedlyeager,audience.
13
AssessingDifferences
and Having Confidence

FISHERIAN STATISTICS

Any assessmentof the impactof Fisher's arrivalon thestatistical battlefieldhas


to recognizethat his forcesdid not really seekto destroy totally Pearson's work
or its raison d'etre. The controversy betweenYule and Pearson, discussed
earlier, had aphilosophical,not to sayideological, basis.If, at theheightof the
conflict, one or other sidehad "won," then it is likely that the techniques
advocatedby thevanquishedwould have been discarded andforgotten. Fisher's
war wasmore territorial. The empireof observationand correlationhad to be
taken over by the manipulationsof experimenters. Although he would never
have openly admitted it - indeed,hecontinuedto attack Pearson and hisworks
to the very end of hislife (which came26 yearsafter the end ofPearson's)-
the paradigmsandprocedureshe developeddid indeed incorporate andimprove
on the techniques developed at Gower Street.The chi-square controversy was
not a dispute aboutthe utility of the test or its essential rationale,
but a bitter
disagreement over the efficiency and methodof its application. For anumber
of reasons,which havebeendiscussed,Fisher's views prevailed.He wasright.
In the late 1920sand 1930s Fisher was at theheight of his powers and
vigorously forging ahead. Pearson, although still a man to bereckoned with,
was nearly 30 years awayfrom his best work,an old manfacing retirement,
rather isolatedas heattackedall thosewho were not unquestioninglyybrhim.
Last, but by no means least, Fisher was thebetter mathematician. He had an
intuitive flair that broughthim to solutionsof ingenuityand strength.At the
sametime, he wasable to demonstrateto the community of biological and
behavioral scientists, a communitythatsodesperately needed a coherent system
of data management and assessment, thathis approachhad enormous practical
utility.

186
THE ANALYSIS OF VARIANCE 187

Pearson'swork may be characterizedas large sampleand correlational,


Fisher'sassmall sampleandexperimental. Fisher's contribution easily absorbs
the best of Pearsonand expandson theseminal workof "Student." Assessing
the import and significance of the variation in observations across groups
subject to different experimental treatmentsis the essenceof analysis of
variance, and a haphazard glanceat any research journalin the field of
experimental psychology atteststo its impact.

THEANALYSIS OFVARIANCE

The fundamental ideas of analysis of variance appearedin the paper that


examined correlation among Medelian factors (Fisher, 1918). At this time,
eugenic researchwas occupying Fisher's attention. Between 1915 and 1920,
he published halfa dozen papers that dealt
with matters relevant
to this interest,
an interest that continued throughout
his life. The 1918 paper uses the term
variance for fai 2 + a22J, ai and a2 representingtwo independent causes
of
variability, ana referreato the normally distributed population.

We may now ascribeto the constituentcausesfractions or percentagesof the total


variance which they together produce. It is desirableon the onehand that the
elementary ideas at the basis of the calculus of correlations should be clearly
understood,and easily expressedin ordinary language,and on theother that loose
phrasesaboutthe "percentage of causation,"which obscurethe essentialdistinction
betweenthe individual and thepopulation, should be carefully avoided. (Fisher,
1918, pp. 399-400)

Here we seeFisher already moving away from Pearsonian correlational


methodsas such and appealingto the Gaussianadditive model. Unlike Pear-
son'swork, it cannot be said thatthe particular philosophy of eugenics directly
governed Fisher's approach to new statistical techniques,but it is clear that
Fisheralways promotedthe valueof the methodsin genetics research (see, e.g.,
Fisher, 1952). Whatthe newtechniques were to achievewas arecognitionof
the utility of statistics in agriculture,in industry, and in the biological and
behavioral sciences,to an extent that couldnot possibly have been foreseen
before Fisher cameon the scene.
The first published account of an experiment that used analysis of variance
to assessthe data was that of Fisherand MacKenzie (1923)on The Manurial
Responseof Different Potato Varieties.

Two aspectsof this paperare of historical interest. At that time Fisherdid not fully
understandtherules of theanalysisof variance- hisanalysisiswrong - nor therole
of randomization. Secondly,although the analysis of variance is closely tied to
188 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

additive models, Fisher rejects


the additive modelin his first analysisof variance,
proceedingto a multiplicative modelas more reasonable. (Cochrane, 1980, p. 17)

Cochrane pointsout that randomizationwas notusedin the layout and that


an attempt to minimize error usedan arrangement that placed different
treat-
ments nearoneanother. The conditions couldnot provide an unbiased estimate
of error. Fisher thenproceedsto ananalysis basedon amultiplicative model:

Rather surprisingly, practically


all of Fisher's later workon theanalysisof variance
usesthe additive model. Later papers give no indicationas to why theproduct model
was dropped. Perhaps Fisher found, as 1did, that the additive model is a good
approximation unless main effects are large, as well as being simplerto handle than
the product model. (Cochrane, 1980, p. 21)

Fisher'sderivation of the procedureof analysis of variance and hisunder-


standingof the importanceof randomizationin the planningof experimentsare
fully discussedin Statistical Methodsfor Research Workers (1925/1970), first
published in 1925. This work is now examinedin more detail.
Over 45 yearsand 14editions, the general characterof the book did not
change.The 14th edition was publishedin 1970, usingnotesleft by Fisher at
the time of his death. Expansions, deletions,and elaborationsare evident over
the years. Notable areFisher's increasing recognitionof the work of othersand
greaterattentionto thehistorical account. Fisher's concentration
on his rowwith
the biometriciansastime wentby is also evident.The prefaceto the last edition
follows earlier onesin stating thatthe book was aproductof the research needs
of Rothamsted. Further:

It was clear thatthe traditional machinery inculcated


by the biometrical schoolwas
wholely unsuited to the needs of practical research.The futile elaboration of
innumerablemeasuresof correlation, and the evasion of the real difficulties of
sampling problems under cover of a contemptfor small samples, were obviously
beginningto make its pretensionsridiculous. (Fisher, 1970, 14th ed.,p. v)

The opening sentence


of the chapteron correlationin the first edition reads:

No quantity is more characteristicof modern statistical work than


the correlation
coefficient, and nomethodhasbeen appliedsuccessfullyto such various data as the
method of correlation. (Fisher, 1925,
1sted., p. 129)

and in the 14th edition:

No quantity has been more characteristic


of biometrical work thanthe correlation
coefficient, and nomethodhas been appliedto such various dataas themethod of
correlation. (Fisher, 1970, 14thed.,
p. 177)
THE ANALYSIS OF VARIANCE 189

This not-so-subtle change is reflectedin the divisionsin psychology thatarestill


evident. The twodisciplines discussed by Cronbachin 1957 (see chapter 2) are
those of the correlational and experimental psychologists.
In his opening chapter, Fisher sets out thescopeand definition of statistics.
He notes that theyareessentialto social studiesandthat it is becausethe methods
are used there that"thesestudiesmay beraisedto therank of sciences"(p. 2).
The conceptsof populationsand parameters,of variationand frequency distri-
butions, of probability and likelihood,and of thecharacteristicsof efficient
statistics are outlined very clearly.A short chapteron diagrams oughtto be
required reading,for it points up how useful diagramscan be in theappraisalof
data.1 The chapteron distributions deals with the normal, Poisson,andbinomial
distributions.Of interest is the introductionof the formula:

(Fisher usesS for summation) for variance, noting that s is thebest estimate
of a.
Chapter4 deals withtestsof goodness-of-fit, independence, and homoge-
2
neity, givinga complete description of the applicationof the X tests,including
Yates'correction for discontinuityand theprocedurefor what is now known as
the Fisher Exact Test. Chapter 5 is on tests of significance,about which more
is said later. Chapter6 managesto discuss,quite thoroughly,the techniquesof
interclass correlation without mentioning Pearsonby name exceptto acknow-
ledge thatthe dataof Table 31 arePearsonand Lee's. Failureto acknowledge
the work of others,which was acharacteristicof both Pearsonand Fisher,and
which, to some extent, arose out of both spiteand arrogance,at least partly
explainsthe anonymous presentation of statistical techniques that
is to befound
in the modern textbooksand commentaries.
And then chapter7, two-thirdsof the waythroughthe book, introduces that
most importantand influential of methods- analysis of variance. Fisher
describesanalysisof variance as "the separationof the varianceascribableto
one group of causesfrom the variance ascribable to other groups" (Fisher,
1925/1970,14th ed., p. 213), but heexaminesthe developmentof thetechnique
from a considerationof the intraclass correlation. His exampleis clear and
worth describing. Measurements from n' pairs of brothersmay betreated in
two ways in a correlational analysis. The brothersmay be divided into two

Scatterplots,that so quickly identify the presenceof "outliers," are critical in correlational


analyses.The geometrical explorationof the fundamentalsof variance analysis provides insights which
cannot be matched (see, e.g., Kempthorne, 1976).
1 90 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

classes,say,the elder brotherand theyounger,and theusual interclass correla-


tion on some measured variable may becalculated. When, on theother hand,
the separationof the brothers intotwo classesis either irrelevantor impossible,
then a common meanand standard deviation and anintraclass correlationmay
be computed.
Given pairs of measurements, x\ ,x'\ ; x2 , xr 2 ; *3 , x'3 ; . . .xn> , x'n>
the following statisticsmay becomputed:

In the preceding equations, Fisher's S hasbeen replaced with S and r,used


to designatethe intraclass correlation coefficient. The computationof r, is very
tedious,as thenumberof classesk and thenumberof observationsin each class
increases. Each pair of observationshas to beconsidered twice, (;ci , x'1) and
(x'1 , x1) for example. A set of kvalues givesk (k - 1) entriesin asymmetrical
table. "To obviate this difficulty Harris [1913] introducedan abbreviated
method of calculation by which the value of the correlation givenby the
symmetrical table may beobtained directlyfrom two distributions" (Fisher,
1925/1970, 14th ed.,p. 216). In fact:

Fisher goeson to discussthe sampling errorsof the intraclass correlation


and refers themto his z distribution. Figure 13.1 shows the effect of the
transformationof r to z.

Curves of very unequal varianceare replaced by curves of equal variance, skew


curves by approximately normal curves, curves of dissimilar form by curves of
similar form. (Fisher, 1925/1970,14th ed.,p. 218)

The transformationis given by:

Fisherprovides tablesof the r to ztransformation. After giving an example


FIG. 13.1 r to z transformation (from Fisher, Statistical
Methods for Research Workers).
191
192 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

of the use of thetable and finding the significanceof the intraclass correlation,
conclusionsmay bedrawn. Becausethe symmetrical table does not give the
best estimateof the correlation,a negative biasis introduced intothe valuefor
z andFisher showshow this may becorrected. Fisher then shows that intraclass
correlationis anexampleof the analysisof variance:"A very great simplifica-
tion is introduced into questionsinvolving intraclass correlation when we
recognisethat in such casesthe correlation merely measures the relative
importanceof two groupsof variation" (Fisher, 1925/1970, 14th ed., p. 223).
Figure 13.2is Fisher's general summary table, showing, "in the last column,
the interpretationput upon each expression in the calculationof an. intraclass
correlation from a symmetricaltable" (p. 225).

FIG. 13.2 ANOVA summary table 1 (from Fisher's


Statistical Methods for Research Workers

A quantity madeup of two independentlyand normally distributed parts


with variancesA and B respectively,has atotal varianceof (A + B). A sample
of n' values is taken from the first part and different samplesof k values from
the second part added to them. Fisher notes that in the populationfrom which
the values are drawn, the correlation between pairs of membersof the same
family is:

and thevaluesof A and B may beestimatedfrom the set of kn'observations.


The summary tableis then presented again (Fig. 13.3)
andFisher pointsout that
"the ratio between the sums of squares is altered in the ratio n ' : ( n ' ) ,
THE ANALYSIS OF VARIANCE 193

FIG. 13.3 ANOVA summary table 2 (from Fisher,


Statistical Methods for Research Workers

which precisely eliminatesthe negative bias observedin z derived by the


previous method" (Fisher, 1925/1970, 14th ed., p. 227).
The generalclassof significancetestsapplied hereis that of testingwhether
an estimate of variance derivedfrom n1 degreesof freedom is significantly
greater thana second estimate derived from n2 degrees of freedom. The
significance may beassessedwithout calculatingr. The value of z may be
calculatedas}/2 loge \(n' - 1)(kA + B) - n'(k- 1)B|. Fisher provides tables of
the zdistribution for the 5% and 1% points. In the later editionsof the book he
notes that these values were calculated from the corresponding values of the
12
variance ratio, e, andrefersto tablesof these values prepared by Mahalonobis,
usingthe symbolx in 1932,andSnedecor, using the symbolF in 1934. In fact:

l
Z = /2\OgeF

"The wide use in theUnited States of Snedecor's symbolhas led to the


distribution beingoften referredto as thedistributionof F " (Fisher, 1925/1970,
14th ed., p. 229). Fisher ends
thechapterby giving a numberof examplesof the
useof the method.
It shouldbe mentioned here that the detailsof the history of therelationship
of ANOVA to intraclass correlationis aneglected topicin almostall discussions
of the procedure.A very useful referenceis Haggard (1958).
The final two chaptersof the book discussfurther applicationsof analysis
of varianceand statistical estimation.Of most interest here is Fisher's demon-
stration of the way inwhich the techniquecan beusedto test the linear model
and the "straightness"of the regression line.The method is the link between
least squares andregression analysis. Also of importanceis Fisher's discussion
194 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

of Latin square designsand theanalysisof covariancein improvingthe effi-


ciency andprecision of experiments.
Fisher Box notes thatthe book did not receive a single good review.An
example, which reflectedthe opinionsof many,was thereviewerfor the British
Medical Journal:

If he feared that he waslikely to fall betweentwo stools,to producea book neither


full enoughto satisfy thoseinterestedin its statistical algebranor sufficiently simple
to pleasethosewho dislike algebra,we think Mr. Fisher'sfears arejustified by the
result. (Anonymous, 1926, p. 815)

Yates (1951) comments on these early reviews, noting that many of them
expressed dismay at thelack of formal mathematical proofs and interpretedthe
work as though it was only of interestto those who were involvedin small
sample work. Whatever its receptionby thereviewers,by 1950, whenthe book
was in its 11th edition, about 20,000 copies had been sold.
But it is fair to saythat something like10 years wentby from the dateof the
original publication before Fisher's methods really startedto havean effect on
the behavioral sciences. Lovie (1979) traces its impact overthe years 1934to
1945. He mentions,as doothers,that the early textbook writers contributed to
its acceptance. Notable here are theworks of Snedecor, published in 1934and
1937. Lush (1972) quotes a European researcher who told him, "Whenyou see
Snedecoragain, tellhim that over herewe say, 'ThankGod for Snedecor;now
we canunderstand Fisher' " (Lush, 1972,p. 225).
Lindquist publishedhis Statistical Analysisin Educational Researchin
1940, and this book, too,was widely used. Even then, some authorities were
skeptical,to say theleast. CharlesC. Petersin an editorial for the Journal of
Educational Research rather condescendingly agreesthat Fisher'sstatisticsare
"suitable enough"for agricultural research:

And occasionallythesetechniqueswill be useful for rough preliminary exploratory


researchin other fields, including psychology and education. But if educationists
and psychologists,out of somesort of inferiority complex, grab indiscriminately at
them and employ them where they are unsuitable, education and psychology will
suffer another slump in prestige such as they have often hitherto sufferedin
consequenceof the pursuit of fads. (Peters, 1943, p. 549)

That Peters'conclusion partly reflects the situationat that time,but some-


what missesthe mark, is best evidenced by Lovie's (1979) survey, which we
look at in thenext chapter. Fifty or sixty years yearson, sophisticated designs
and complex analyses are commonin the literaturebut misapprehensions and
misgivings are still to be found there. The recipesare enthusiastically applied
but their structureis not always appreciated.
MULTIPLE COMPARISON PROCEDURES 195

If the early workers reliedon writers like Snedecorto help them with
Fisherianapplications, later ones are indebtedto workers like Eisenhart (1947)
for assistance withthe fundamentalsof the method. Eisenhart setsout clearly
the importanceof theassumptionsof analysisof variance, their critical function
in inference,and the relative consequences of their not being fulfilled. Eisen-
hart's significant contributionhas been pickedup andelaboratedon by sub-
sequent writers,but hisaccount cannotbe bettered.
He delineatesthe two fundamentallydistinct classesof analysisof variance
- what are nowknown as thefixed andrandom effects models.The first of
these is the most familiarto researchersin psychology. Herethe task is to
determine the significance of differences among treatment means: "Tests of
significanceof employedin connection with problems of this classare simply
extensionsto small samplesof the theory of least squares developed by Gauss
andothers- theextensionof thetheoryto small samples being dueprincipally
to R. A. Fisher" (Eisenhart, 1947, pp. 3-4).
The secondclassEisenhart describes as thetrue analysisof variance. Here
the problem is one ofestimating,andinferring the existenceof, the components
of variance,"ascribableto random deviation of the characteristicsof individuals
of a particular generic typefrom the meanvalue of these characteristics in the
'population' of all individuals of that generictype" (Eisenhart, 1947,p. 4).
The failure of the then current literature to adequately distinguish between
the twomethodsis becausethe emphasishadbeenon testsof significancerather
than on problemsof estimation. But it would seem that, despite the best efforts
of the writers of the most insightful of the nowcurrent texts(e.g., Hays, 1963,
1973),the distinctionis still not fully appliedin contemporary research. In other
words, the emphasisis clearly on theassessmentof differencesamong pairsof
treatment means rather than on therelativeand absolute sizeof variances.
Eisenhart's discussion of the assumptionsof the techniquesis a model for
later writers. Random variation, additivity, normality of distribution, homoge-
neity of variance, and zero covariance among the variablesare discussedin
detail and their relative importance examined. This work can beregardedas a
classicof its kind.

MULTIPLE COMPARISON PROCEDURES

In 1972, Maurice Kendall commented on howregrettableit was that duringthe


1940s mathematicshad begun to "spoil" statistics. Nowhereis the shift in
emphasisfrom practice, withits room for intuition and pragmatism,to theory
and abstraction more evident thanin the area of multiple comparison proce-
dures.The rulesfor making such comparisons have been discussedad nauseam,
and they continueto bediscussed. Among the more completeand illuminating
196 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

accountsarethoseof Ryan (1959)and Petrinovichand Hardyck (1969). Davis


and Gaito (1984) provide a very useful discussion of some of the historical
background.In commentingon Tukey's (1949) intentionto replacetheintuitive
approach (championed by "Student")with some hard, cold facts,
and to provide
simple and definite proceduresfor researchers, they say:

[This is] symptomatic of the transition in philosophy and orientation from the early
use ofstatisticsas apractical and rigorous aid for interpretingresearchresults, to a
highly theoreticalsubject predicatedon theassumptionthat mathematicalreasoning
was paramountin statistical work. (Davis& Gaito, 1984, p. 5)

It is also the case thatthe automaticinvoking, from the statistical packages,


of any one ofhalf a dozen procedures following an Ftest hashelpedto promote
the emphasison thecomparisonof treatment means in psychological research.
No one would argue withthe underlying rationaleof multiple comparison
procedures. Given that the task is to compare treatment means, it is evident that
to carry out multiple t testsfrom scratchis inappropriate. Overthe long run it
is apparent thatas larger numbers of comparisons are involved, using a
procedure that assumes that the comparisonsare basedon independent paired
data setswill increasethe Type I error rate considerably whenall possible
comparisonsin a given set ofmeansare made. Put simply,the numberof false
positiveswill increase.As Davis andGaito (1984) point out, witha set at the
.05 level and H0 true, comparisons, using the t test, among10 treatment means
would, in the long run, leadto thedifferencebetweenthe largestand thesmallest
of them being reported assignificant some60% of thetime. One of theproblems
here is that the range increases faster than the standard deviation as thesize of
the sample increases.The earliest attempts to devise methods to counteract this
effect come from Tippett (1925)and "Student" (1927),and later workers
referred to the "studentizing" of the range,using tablesof the sampling distri-
bution of the range/standard deviation ratio, known as the qstatistic. Newman
(1939)publisheda procedure that uses this statistic to assessthe significanceof
multiple comparisons among treatment means.
In general,the earlier writers followed Fisher, who advocatedperforming t
tests following an analysis that produced an overall z that rejectedthe null
hypothesis,the variance estimate being provided by the error mean square and
its associated degrees of freedom. Fisher'sonly cautionary note comes in a
discussionof the procedureto beadopted whenthe ztest fails to reject the null
hypothesis:

Much caution shouldbe used before claiming significance


for special comparisons.
Comparisons, whichthe experimentwasdesignedto make, may,of course,be made
without hesitation. It is comparisonssuggestedsubsequently,by a scrutiny of the
MULTIPLE COMPARISON PROCEDURES 197

resultsthemselves,that are open to suspicion; for if the variants are numerous, a


comparisonof thehighestwith thelowestobservedvalue,pickedout from theresults,
will often appearto be significant, even from undifferentiated material.(Fisher,
1935/1966,8th ed., p. 59)

Fisher is here giving his blessing to planned comparisonsbut does not


mention that these comparisons should, strictly speaking, be orthogonal or
independent. Whathe doessay isthat unforeseen effectsmay betaken as
guidesto future investigations. Davis and Gaito (1984)are atpainsto point out
that Fisher's approach wasthat of the practical researcher and tocontrastit with
the later emphasison fundamental logicand mathematics.
Oddly enough,a numberof procedures, springing from somewhat different
rationales,all appearedon thesceneat about the sametime. Among the best
known are those of Duncan (1951, 1955), Keuls (1952), Scheffe (1953), and
Tukey (1949, 1953). Ryan (1959) examines the issues. After contending that,
fundamentally,the same considerations apply to both a posteriori (sometimes
called post hoc) and apriori (sometimes called planned) comparisons, and
drawing ananalogy with debates over one-tail andtwo-tail tests(to be discussed
briefly later), Ryan definesthe problem as the control of error rate. Per
comparison error ratesrefer to the probability thata given comparisonwill be
wrongly judged significant.Per experiment error rates refer not toprobabilities
as such,but to thefrequencyof incorrect rejectionsof the null hypothesisin an
experiment, overthe long run of such experiments.Finally, the so-called
experimentwise error rate is aprobability,the probability thatany oneparticular
experimenthas atleast one incorrect conclusion.The various techniques that
were developed have all largely concentratedon reducing,or eliminating,the
effect of the latter. The exception seemsto be Duncan, who attempted to
introduce a test based on the error rate per independent comparison. Ryan
suggeststhat this special procedure seems unnecessary and Scheffe (1959),a
brilliant mathematical statistician, is unableto understandits justification.
Debateswill continue and, meanwhile, the packagesprovide us with all the
methodsfor a keypressor two.
For the purposesof the present discussion, the examinationof multiple
comparison procedures provides a case historyfor the state of contemporary
statistics. First,it is an example,to match all examples,of the questfor rules
for decision makingand statistical inference that lie outsidethe structureand
concept of the experiment itself. Ryan argues "that comparisons decided upon
a priori from somepsychological theory should not affect the nature of the
significance tests employedfor multiple comparisons" (Ryan, 1959, p. 33).
Fisher believed that research should be theory drivenand that its results were
always opento revision.
"Multiple comparison procedures" could easily replace, with only some
198 13. ASSESSINGDIFFERENCESAND HAVING CONFIDENCE

slight modification,the subjectof ANOVA in the following quotation:

The quick initial successof ANOVA in psychologycan beattributedto the unattrac-


tiveness of the then available methods of analysing large experiments, combined
with the appealof Fisher'swork which seemedto match, witha remarkabledegree
of exactness,the intellectual ethosof experimental psychologyof the period, with
its atheoreticalandsituationalistnatureand itswish for moreextensiveexperiments.
(Lovie, 1979,p. 175)

Fisher would have deplored this; indeed,he diddeploreit. Second,it reflects


the emphasis,in today's work,on the avoidanceof the Type I error. Fisher
would havehad mixed feelings about this.On the onehand, he rejected the
notion of "errorsof the second kind"(to bediscussedin the next chapter); only
rejection or acceptance of the null hypothesisentersinto his schemeof things.
On theother, hewould have been dismayed - indeed, he wasdismayed- by
a concentrationon automaticacceptanceor rejection of the null hypothesisas
the final arbiter in assessingthe outcomeof an experiment.
Criticizing the ideologiesof both Russiaand theUnited States, where he felt
such technological approaches were evident, he says:

How far,within such a system [Russia], personal and individual inferencesfrom


observed factsare permissiblewe do notknow, but it may perhapsbe safer... to
conceal rather thanto advertisetheselfish andperhapshereticalaim of understanding
for oneselfthe scientific situation.In the U.S. alsothe great importanceof organized
technology has Ithink madeit easyto confusethe process appropriate for drawing
correct conclusions with those aimed rather at, let ussay,speeding production,or
saving money. (Fisher, 1955, p. 70)

Third, the multiple comparison procedure debate reflects,as hasbeen noted,


the increasingly mathematical approach to applied statistics.In the United
States,a Statistical Computing Center at the State Collegeof Agriculture at
Ames, Iowa (now Iowa State University), becamethe first centerof its kind. It
was headedby George W. Snedecor, a mathematician,who suggestedthat
Fisher be invited to lecture thereduring the summer sessionof 1931. Lush
(1972) reportsthat academic policyat that institutionwas such that graduate
coursesin statistics were administered by the Departmentof Mathematics. At
Berkeley, the mathematics department headed by Griffith C. Evans,who went
there in 1934, was to beinstrumentalin making thatinstitution a world center
in statistics. Fisher visited there
in the late summerof 1936 but made a very
poor personalimpression. Jerzy Neyman was tojoin the departmentin 1939.
And, of course,Annals of Mathematical Statisticswas foundedat Universityof
Michigan in 1930.On more thanoneoccasionduring thoseyearsat the end of
the 1930s, Fisher contemplated moving to the United States, and one may
CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS 199

wonder whathis influence would have been on statistical developments had he


becomepart of that milieu.
And, finally, the techniqueof multiple comparisons establishes, without a
backward glance,a systemof statistics thatis based unequivocally on along-run
relative frequencydefinition of probability where subjective, a priori notions
at best run in parallel withtheplanning of an experiment,for they certainlydo
not affect, in the statistical context,the real decisions.

CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS

Inside the laboratory, at the researcher's terminal, as theoutcomeof the job


reveals itself,andmost certainlywithin the pagesof thejournals, success means
statistical significance.A very great many reviews andcommentaries, some of
which have been brought together by Henkel and Morrison (1970), deplore the
concentrationon the Type I error rate,the a level, as it is known, that this
implies. To a lesserextent,but gaining in strength,is thepleafor an alternative
approachto the reporting of statistical outcomes, namely, the examinationof
confidence intervals. And,to an even lesser extent, judging from the journals,
is the challenge thatthe statistical outcomes made in the assessmentof
differencesshouldbe translatedinto the reportingof strengthsof effects. Here
the wheel is turning full circle, backto the appreciationof experimental results
2
in terms of a correlational analysis.
Fisher himselfseemeto believe thatthe notion of statistical significance
was more or less self-evident.Even in the last edition of Statistical Methods,
the words null and hypothesisdo not appearin the index, and significance and
testsof significance, meaningof haveone entry, which refersto thefollowing:

From a limited experience,for example,of individuals of a species,... we may


obtain some ideaof the infinite hypotheticalpopulationfrom which our samplehas
been drawn,and so of theprobablenatureof future samples If a second sample
beliesthis expectationwe infer that it is, in the languageof statistics,drawn from a
second population; that the treatment...did in fact makea material difference....
Critical testsof this kind may becalled testsof significance,and when such tests are
availablewe maydiscover whether a second sample is or is notsignificantly different
from the first. (Fisher, 1925/1970,14th ed.,p. 41)

A few pages later, Fisher does explain the use of thetail area of the
probability intervaland notes thatthe p = .05level is the "convenient" limit for
judging significance. He does thisin the context of examplesof how often

2
Of interest here,however, is the increasinguse,and theincreasingpower,of regression
modelsin the analysisof data; see,for example,Fox (1984).
200 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

deviationsof aparticular sizeoccur in agiven numberof trials - that twicethe


standard deviationis exceeded about oncein 22 trials, and so on.Thereis little
wonder that researchers interpreted significance
as anextensionof the propor-
tion of outcomesin a long-run repetitive process, an interpretationto which
Fisherobjected! In TheDesignof Experiments,he says:

In order to assertthat a natural phenomenon


is experimentally demonstrablewe need,
not anisolated record,but areliable methodof procedure. In relation to the test of
significance, we may saythat a phenomenonis experimentally demonstrable when
we know how toconductan experiment whichwill rarely fail to give us a statistically
significant result. (Fisher,1935/1966,8th ed., p. 14)

Here Fisher certainly seems to be advocating "rulesof procedure," againa


situation which elsewherehe condemns. Of more interestis the notion that
experiments mightbe repeatedto seeif they fail to give significant results. This
seemsto be avery curious procedure, for surely experiments,if they are to be
repeated,arerepeatedto find support for an assertion.The problems that these
statements cause arebasedon thefact thatthe null hypothesisis astatement that
is the negationof the effect that the experimentis trying to demonstrate,and
that it is this hypothesis that
is subjectedto statistical test.The Neyman-Pearson
approach (discussed in chapter15) was anattemptto overcome these problems,
but it was anapproach that again Fishercondemned.
It appearsthat Fisheris responsiblefor the first formal statementof the .05
level as thecriterion for judging significance,but the convention predates his
work (Cowles& Davis, 1982a). Earlier statements about the improbability of
statistical outcomes were made by Pearsonin his 1900(a) paper,and "Student"
(1908a) judged that three times the probable errorin the normal curve would
be considered significant. Wood and Stratton (1910) recommend "taking
30 to 1 as thelowest oddswhich can beacceptedas giving practical certainty
that a differencein a given directionis significant"(p. 433).
Fisher Box mentions that Fisher took a courseat Cambridgeon thetheory
of errors from Stratton duringthe academic year1912-1913.
Odds of 30 to 1representa little more than three timesthe probable error
(P.E.) referredto the normal probability curve. Because the probable erroris
equivalentto a little more than two-thirds of a standard deviation, three P.E.sis
almost two standard deviations, and, of course, reference to any table of the
"areasunderthe normal curve" shows that a z scoreof 1.96 cutsoff 5% in the
two tails of the distribution. With somelittle allowancefor rounding,the .05
probability levelis seento have enjoyedacceptancesome time before Fisher's
prescription.
A testof significanceis atestof theprobabilityof a statistical outcome under
the hypothesisof chance. In post-Fisherian analyses the probabilityis that of
CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS 201

making an error in rejecting the null hypothesis,the so-called TypeI error. It


is, however,not uncommon to readof the null hypothesisbeing rejectedat the
5% level of confidence', an oddinversion that endows the act ofrejection with
a sort of statementof belief about the outcome. Interpretationsof this kind
reflect, unfortunatelyin the worst possible way,the notionsof significancetests
in the Fisherian sense,and confidence intervals introduced in the 1930sby Jerzy
Neyman. Neyman (1941)statesthat the theory of confidence intervalswas
establishedto give frequency interpretations of problems of estimation. The
classical frequency interpretation is best understood in the context of the
long-run relative frequency of the outcomesin, say, the rolling of a die. Actual
relative frequenciesin a finite run aretaken to be more or less equalto the
probabilities,and in the"infinite" run are equalto the probabilities. In his 1937
paper,Neyman considersa systemof random variables x1,x2, xi,. .. xn desig-
nated E, and aprobability law p(E | 0i,0 2 , . .. 01)where 61,0 2 , . . . 0i are un-
known parameters.The problem is to establish:

single-valuedfunctions of the x 's 0 (£) and 0(£) havingthe property that, whatever
the valuesof the O's,say 0'i, 0'2,. . . 0'i, the probability of 0(£) falling short of 9'i
and at thesame timeof 9(£) exceedingO'i is equal to a numbera fixed in advance
so that 0 < a < 1,

It is essentialto notice thatin this problem the probability refers to the valuesof
9(£) and§(£) which, beingsingle-valuedfunctions of the.x 's arerandom variables.
9'i being a constant,the left-hand side of [the above] doesnot representthe
probability of 0'i falling within somefixed limits. (Neyman, 1937,p. 379)

The values 0(£) and 0(£) representthe confidencelimits for 0'1 andspan
the confidence intervalfor theconfidencecoefficient a. Caremustbetaken here
not to confuse thisa with the symbol for the Type 1error rate;in fact this a is
1 minus the Type I error rate. The last sentencein the quotation from Neyman
just given is very important. First,an exampleof a statementof the confidence
interval in morefamiliar termsis, perhaps,in order. Supposethat measurements
on a particular fairly large random sample have produced a mean of 100 and
that the standarderror of this meanhas been calculatedto be 3. Theroutine
methodof establishingthe upperandlower limits of the 90%confidence interval
would be tocompute 100 ± 1.65(3). Whathasbeen established? The textbooks
will commonly saythat the probability thatthe population mean,u, falls within
this interval is 90%, whichis precisely what Neyman says is not thecase.
For Neyman, the confidence limitsrepresentthe solution to the statistical
202 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

problemof estimating 0\ independentof a priori probabilities. What Neyman


is saying is that, overthe long run, confidence intervals, calculatedin this way,
will containthe parameter90% of thetime. It is notjust being pedanticto insist
that, in Neyman's terms, to saythat the oneinterval actually calculated contains
the parameter90% of thetime is mistaken. Nevertheless, that is the way in
which confidence intervals have sometimes come to be interpreted and used.
A numberof writers have vigorously propounded the benefitsof confidence
intervals asopposedto significance testing:

Wheneverpossible,the basicstatisticalreportshould be in theform of a confidence


interval. Briefly, a confidence intervalis a subset of the alternative hypotheses
computedfrom the experimentaldatain such a way that for a selectedconfidence
level a, theprobability thatthe thetrue hypothesisis includedin a set soobtained is
a. Typically, an a -level confidence intervalconsistsof thosehypothesesunder
which the p valuefor theexperimental outcome is larger than1 - a.... Confidence
intervals are the closest we can atpresent cometo quantitative assessmentof
hypothesis-probabilities. . . and arecurrently our most effectiveway to eliminate
hypothesesfrom practicalconsideration- if we chooseto act asthough noneof the
hypothesesnot included in a 95%confidence intervalare correct, we stand onlya
5% chanceof error. (Rozeboom,1960, p. 426)

Both Ronald Fisherand Jerzy Neyman would have been very unhappy with
this advice! It does, however,reflect once againthe way inwhich researchers
in the psychological sciences prescribeandpropound rules that they believe will
leadto acceptanceof the findings of research. Rozeboom's paper is a thoughtful
attempt to provide alternativesto the routine null hypothesis significance test
and dealswith the important aspectof degreeof belief in an outcome.
One final point on confidenceinterval theory:it is apparent that some early
commentators(e.g., E. S.Pearson, 1939b; Welch, 1939) believed that Fisher's
"fiducial theory" andNeyman's confidence interval theory were closely related.
Neyman himself (1934)felt that his work was anextensionof that of Fisher.
Fisher objected stronglyto the notion that therewas anything at all confusing
about fiducial distributionsor probabilities and denied any relationshipto the
theory of confidence intervals,which he maintained,was itself inconsistent.In
1941 Neyman attemptedto show that thereis no relationship betweenthe two
theories,and herehe did notpull his punches:

The presentauthor is inclined to think that the literature on thetheory of fiducial


argumentwas born out of ideassimilarto those underlyingthe theory of confidence
intervals. Theseideas,however, seemto have beentoo vague to crystallizeinto a
mathematical theory. Instead they resulted in misconceptionsof "fiducial prob-
ability" and "fiducial distribution of a parameter"which seemto involve intrinsic
inconsistencies In this light,thetheory of fiducial inferenceis simply non-existent
A NOTE ON "ONE-TAIL" AND "TWO-TAIL" TESTS 203

in the same senseas, for example, a theory of numbers definedby mutually


contradictorydefinitions. (Neyman, 1941,p. 149)

To the confused onlooker, Neyman doesseem to have been clarifyingone


aspectof Fisher's approach,and perhapsfor a brief momentof time therewas
a hint of a rapprochement.Had it happened, thereis reason to believe that
unequivocalstatementsfrom thesemen would have beenof overriding impor-
tancein subsequent applications of statistical techniques.In fact, their quarrels
left the job to their interpreters. Debate and disagreement would,of course,
have continued,but those who like to feel safe could have turned to the
orthodoxy of the masters,a notion thatis not without its attraction.

A NOTE ON "ONE-TAIL" AND "TWO-TAIL" TESTS

In the early 1950s, mainlyin the pages of Psychological Bulletin and the
Psychological Review- then, as now, immensely importantand influential
journals - a debate took place on theutility anddesirabilityof one-tail versus
two-tail tests(Burke, 1953; Hick, 1952; Jones, 1952,1954; Marks, 1951,1953).
It had been stated that when an experimental hypothesis had a directional
component- that is, notmerely thata parameteru, differed significantly from
a second parameter 2,jj.but that, for example,ji, > u2 - thentheresearcherwas
permitted to use thearea cut off in only one tail of the probability distribution
when the test of significancewas applied. Referredto the normal distribution,
this means thatthe critical value becomes 1.65 rather than 1.96. It was argued
that becausemost assertions that appealed to theory were directional- for
example, that spacedwas better than massed practicein learning, or that
extraverts conditioned poorly whereas introverts conditioned well- the actual
statistical testshould takeinto account these one-sided alternatives. Arguments
againstthe use ofone-tailed testsprimarily centeredon whatthe researcher does
when a very large differenceis obtained,but in theunexpected direction.The
temptationto "cheat"is overwhelming!It was also argued that such data ought
not to be treated withthe same reactionas a zero difference on scientific
grounds: "It is to be doubted whether experimental psychology, in its present
state,can afford suchlofty indifference toward experimental surprises" (Burke,
1953, p. 385).
Many workers were concerned that a move toward one-tail tests represented
a loosening or a lowering of conventional standards, a sort of reprehensible
breaking of the rulesand pious pronouncements about scientific conservatism
abound. Predictably, the debateled to attemptsto establishthe rulesfor the use
of one-tail tests (Kimmel, 1957).
What is found in these discussions is the implicit assumption that formally
204 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

stated alternative hypotheses are anintegral partof statistical analysis. What is


not found in these discussions is anyreferenceto thelogic of usinga probability
distribution for the assessment of experimental data.Put baldly andsimply,
using a one-tail test means that the researcheris using only halfthe probability
distribution, and it is inconceivable that this procedure would have been
acceptableto any of thefounding fathers.The debateis yet another exampleof
the eagernessof practical researchers to codify the methodsof data assessment
so that statistical significance
has themaximum opportunityto reveal itself,but
in the presenceof rules that discouraged, if not entirely eliminated, "fudging."
Significance levels, or p levels,are routinely acceptedas part of the inter-
pretationof statistical outcomes.The statistic thatis obtainedis examined with
respectto a hypothesized distribution of the statistic,a distribution thatcan be
completely specified. Whatis not soreadily appreciatedis the notion of an
alternative model. This matter is examinedin the final chapter. For now, a
summaryof the processof significancetesting givenin one of theweightierand
more thoughtful texts might be helpful.

1. Specificationof ahypothesizedclassof modelsand analternativeclassof models.


2. Choice of a function of the observationsT.
3. Evaluationof the significance level, i.e.,SL =P(T> t\ where t is theobserved
value of T and where the probability is calculated for the hypothesizedclass of
models.
In most applied writingsthe significancelevel is designatedby P, acustom which
has engendereda vast amountof confusion.
It is quite common to refer to the hypothesized classof models as the null
hypothesisand to thealternativeclassof modelsas thealternative hypothesis.We
shall omit the adjective "null" becauseit may bemisleading. (Kempthorne & Folks,
1971, pp. 314-315)
14
Treatments and Effects:
The Rise of ANOVA

THE BEGINNINGS

In the last chapterwe consideredthe developmentof analysis of variance,


ANOVA . Herethe incorporationof this statistical technique into psychological
methodologyis examinedin a little more detail.
The emergenceof ANOVA as themost favored method of data appraisalin
psychologyfrom the late 1930sto the 1960s represents a most interestingmix
offerees. The first was thecontinuingneedfor psychologyto be"respectable"
in the scientific senseand thereforeto seek out mathematical methods. The
correlational techniques that werethe norm in the 1920sand early 1930s were
not enough, nor did they lend themselves to situations where experimental
variables mightbe manipulated.The secondwas thegrowth of the commenta-
tors and textbook writerswho interpreted Fisher witha minimum needof
mathematics. Froma trickle to a flood, the recipe books pouredout over a
period of about 25 years and theflow continues unabated. And the third is the
emergenceof explanationsof the models of ANOVA using the concept of
expected mean squares. These explanations, which did not avoid mathematics,
but which wereat alevel that required little more than high school math, opened
the way formore sophisticated procedures to beboth understoodand applied.

THE EXPERIMENTAL TEXTS

It is clear, however, that


the enthusiasm thatwasmountingfor the newmethods
is not reflectedin thetextbooksof experimentalpsychology that were published
in those years. Considering four of the books that were classics
in their own
time shows that their authors largely avoided or ignoredthe impact of Fisher's
statistics. Osgood's Method and Theory in Experimental Psychology (1953)

205
206 14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

mentions neither Fisher nor ANOVA, neither Pearsonnor correlation. Wood-


worth and Schlosberg's Experimental Psychology (1954) ignores Pearson by
name,and theauthors state,"To go thoroughly into correlational analysis lies
beyond the scopeof this book" (p. 39). The writers, however, showsome
recognitionof the changing times, admitting that the "old standard 'ruleof one
variable'" (p. 2)doesnot mean thatno more thanonefactor may bevaried.The
experimental design must allow, however, for the effects of single variablesto
be assessedand also the effect of possible interactions. Fisher's Design of
Experiments is then cited, and Underwood's (1949) bookis referencedand
recommendedas asourceof simple experimental designs. And that is it!
Hilgard's chapterin Stevens'Handbookof ExperimentalPsychology (1951)
not only givesa brief descriptionof ANOVA in the contextof the logic of the
control group and factorial designbut also comments on theutility of matched
groups designs,but other commentaries, explanations, and discussionsof
ANOVA methodsdo not appearin the book. Stevens himself recognizes the
utility of correlational techniquesbut admits that his colleague Frederick
Mosteller had toconvincehim that his claim was overly conservativeand that:
"rank order correlation does not apply to ordinal scales because the derivation
of the formula for this correlation involves the assumption thatthe differences
between successive ranks are equal,(p. 27)." An astonishing claim.
Underwood's (1949) book is a little more encouraging,in that he does
recognizethe importanceof statistics.His preface clearly states how important
it is to deal with both methodand content.His own teachingof experimental
psychology required that statisticsbe anintegral partof it: "I believethat the
factual subject mattercan becomprehended readily without a statistical knowl-
edge,but a full appreciation of experimental design problems requires some
statistical thinking"(p. v).
But, in general,it is fair to saythat many psychological researchers were not
in tune with the statistical methods that were appearing."Statistics"seemsto
havebeen seenas anecessaryevil! Indeed, thereis more thana hint of the same
mindset in today'stexts.An informal surveyof introductory textbooks publish-
ed in the last 10 years showsa depressingly high incidence of statistics being
relegated to an appendix and of sometimesshrill claims that theycan be
understood withoutrecourseto mathematics.In using statistics such as the
t ratio and more comprehensive analyses such as ANOVA, the necessityof
randomizationis always emphasized. Researchers in the social and educational
areasof psychology realized that such a requirementwhenit cameto assigning
participantsto "treatments"wasjust not possible. Levelsof ability, socioeco-
nomic groupings, age, sex, and so oncannot be directly manipulated. When
methodologists such as Campbelland Stanley (1963), authorities in the classi-
fication and comparison of experimental designs, showed that meaningful
THE JOURNALS AND THE PAPERS 207

analyses couldbe achieved through the use of"found groups"and what is often
called quasi-experimental designs, the potential of ANOVA techniques wid-
ened considerably.The argument was,and is,advanced that unless the experi-
menter can control for all relevant variables alternative explanations for the
results other thanthe influenceof the independent variable can befound- the
so-called correlated biases.The skeptic might argue that, in addition to thefact
that statistical appraisals
are bydefinition probabilisticand therefore uncertain,
direct manipulationof the independent variable does not assuredly guard against
mistakenattribution of its effects.
And, very importantly,the commentaries, such asthey were,on experimen-
tal methods haveto be seen in the light of the dominant force,in American
psychologyat least,of behaviorism. For example, Brennan's textbook History
and Systemsof Psychology(1994)devotestwo chaptersandmore than40 pages
out of about 100pagesto Twentieth-Century Systems (omitting Eastern Tradi-
tions, Contemporary Trends and theThird Force Movement). This is not to be
taken as acriticism of Brennan: indeed, his book has run toseveral editionsand
1
is an excellentand readable text. The point is that from Watson's (1913) paper
until well on into the 1960s, experimental psychology was, for many, the
experimental analysisof behavior - latterly within a Skinnerian framework.
Sidman's (1960) book Tactics of Scientific Research: Evaluating Experimental
Data in Psychologyis most certainlynot about statistical analysis!

Scienceis presumably dedicatedto stampingout ignorance,but statistical evaluation


of dataagainsta baselinewhosecharacteristicsaredeterminedby unknown variables
constitutesa passiveacceptanceof ignorance. Thisis a curious negation of the
professedaims of science.More consistent withthoseaims is the evaluation of data
by meansof experimental control,(p. 45)

In general, the experimental psychologists


of this ilk eschewed statistical
approaches apartfrom descriptive means,frequency counts, and straightfor-
ward assessments of variability.

THE JOURNALS AND THE PAPERS

Rucci and Tweney (1980) have carried out acomprehensive analysis of the use
of ANOVA from 1925 to 1950, concentrating mainly, but not exclusively,on
American publications. Theyidentify the earliest research
to useANOVA as a
paper by Reitz (1934).The author checkedfor homogeneityof variance, gave
the method for computingz, andnoted its relationshipwith r\2. An early paper

1
Perhapsone might be permittedto ask Brennanto include a chapteron the history and
influence of statistics!
208 14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

that usesa factorial designis that of Baxter (1940).The author,at theUniversity


of Minnesota, acknowledges the statistical help given to him byPalmer Johnson,
and healso credits Crutchfield (1938) with the first applicationof a factorial
designto apsychological study. Baxter states that his aim is "anattemptto show
how factorial design, as discussedby Fisherand more specificallyby Yates
. . . , hasbeen appliedto a study of reactiontime" (p.494). The proposed study
presentedfor illustration dealswith reaction timeand examined three factors:
the hand usedby theparticipant,the sensory modality (auditory or visual),and
discrimination (a single stimulus, two stimuli, three stimuli). Baxter explains
how the treatment combinations could be arranged and shows how partial
confounding(which resultsin some possible interactions becoming untestable)
is used to reduce subject fatigue. The paper is altogethera remarkably clear
accountof the structureof a factorial design.And Baxter followed throughby
reporting (1942)on theoutcomeof a study which used this design.
Rucci and Tweney examined 6,457 papers in six American psychological
journals. Theyfind that from 1935to 1952 thereis a steady risein the use of
ANOVA, which is paralleledby a rise in the use of the ratio. t The rise became
a fall during the war years, and therise became steeper after the war. It is
suggested thatthe younger researchers, who would have become acquainted
with the newtechniquesin their pre-war graduate school training, would have
been eligiblefor military serviceand the"old-timers" who were left used older
procedures, such as thecritical ratio. It is not thecasethatthe use ofcorrelational
techniques diminished. Rucci andTweney's analysis indicates that the percent-
age ofarticles using such methods remained fairly steady throughout the period
examined. Their conclusion is that ANOVA filled the void in experimental
psychologybut did notdisplace Cronbach's (see chapter 2, p. 35)other disci-
pline of scientific psychology. Nevertheless, the use ofANOVA was not
establisheduntil quite lateand only surpassedits pre-war use in 1950.
Overall, Rucciand Tweney's analysis leads, asthey themselves state,to the
view that "ANOVA was incorporatedinto psychologyin logical and orderly
steps"(p. 179), and their introduction avers that "It took less than15 yearsfor
psychology to incorporateANOVA" (p. 166). Theydo notclaim - indeedthey
specifically deny - that the introductionof these techniques constitutes a
paradigm shift in the Kuhnian sense.
An examinationof the use ofANOVA from 1934 to 1945 by a British
historian of science, Lovie (1979), gives a somewhatdifferent view from that
offered by Rucci and Tweney, who,unfortunately,do not cite this work,for it
is an altogether moreinsightful appraisal, which comes to somewhatdifferent
conclusions:

The work demonstrates


that incorporatingthe technique[ANOVA] into psychology
THE JOURNALS AND THE PAPERS 209

was along and painful process.This was duemore to the substantive implications
of the method thanto the purely practicalmattersof increasedarithmetical labour
and novelty of language,(p. 151)

Lovie also shows that ANOVAcan beregarded,as heputsit, as"one of the


midwives of contemporary experimental psychology" (p. 152). A decidedly
non-Kuhniancharacterization,but close enoughto an indicationof a paradigm
shift! Threecase studies described by Lovie showthe shift, over a period from
1926 to 1932,from informal,as it were, "let-me-show-you" analysis through a
mixture of informal andstatistical analysis
to awholly statistical analysis.It was
the design of the experiment that occupied the experimenteras far asmethod
was concerned,and themeaningof the results couldbe consideredto be "a
scientifically acceptableform of consensualbargaining between the authorand
the readeron the basis of non-statistical common-sense statements about the
data"(p. 155).
When ANOVA was adopted,the old conceptualizations of the interpreta-
tions of datadied hard, and theresults broughtforth by the method were used
selectivelyand asadditional supportfor outcomes thatthe old verbal methods
might well have uncovered. Lovie's valuableaccountand hisexaminationof
the early papers provides detailed supportfor the reasonswhy manyof the first
applicationsof ANOVA were trivial and why more enlightened and effective
useswere slowto emergein anyquantity.
A seriesof papersby Carrington (1934, 1936, 1937) in the Proceedingsof
the Societyfor Psychical Research report on resultsgiven by one-way ANOVA.
The papers dealwith the quantitativeanalysisof trance statesand show thatthe
pioneersin systematic research in parapsychology were among the earliest users
of the newmethods. Indeed, Beloff (1993) makesthis interesting point:

Statistics,after all, have always beena more criticalaspectof experimental para-


psychology than they have been for psychophysicsor experimental psychology
precisely becausethe results were more likely to be challenged.There is, indeed,
someevidencethat parapsychology acted as aspurto statisticiansin developing their
own discipline and inelaboratingthe concept of randomness,(p. 126)

It is worth noting that some of the most well-knownof the early rigorous
experimental psychologists, for example, GardnerMurphy and William
McDougall, were interestedin the paranormal,and it was thelatter who, leaving
Harvard for Duke University, recruitedJ. B. Rhine, who helpedto found, and
later directed, that institution's parapsychology laboratory.
Of course, there were several psychological
andeducational researchers who
vigorously supportedthe "new" methods. Garrett andZubin (1943) observe that
ANOVA had notbeen usedwidely in psychological research. These authors
210 14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

make the point that:

Even as recently as 1940, the author of a textbook [Lindquist] in which Fisher's


methodsare applied to problems in educationalrersearch,found it necessaryto use
artificial datain severalinstancesbecauseof the lack of experimental materialin the
field, (p. 233)

These authors also offer the opinion thatthe belief thatthe methods deal
with small sampleshadinfluenced their acceptance - a belief, nay,a. fact, that
has ledothers to point to it as areasonwhy they were welcomed! They also
worried about "the language of the farm," "soil fertility, weights of pigs,
effectivenessof manurial treatmentsand the like" (p. 233). Hearnshaw,the
eminenthistorian of psychology, also mentions the samedifficulty as areason
for his view that"Fisher'smethods were slow to percolate into psychology both
in this country [Britain]and inAmerica" (1964,p. 226). Althoughit is the case
that agricultural examples are cited in Fisher's books,and phrases that include
the words "blocks," "plots," and "split-plots" still seema little odd to psycho-
logical researchers,it is not true to saythat the world of agriculture permeates
all Fisher's discussions and explanations,as Lovie (1979) has pointed out.
Indeed,the early chapterin The Design of Experimentson themathematicsof
a lady tastingtea isclearly the story of anexperimentin perceptual ability, which
is aboutas psychologicalas onecould get.
It must be mentioned that Grant (1944) produced a detailed criticismof
Garrett and Zubin's paper,in particular taking themto task for citing examples
of reports where ANOVAis wrongly usedor inappropriatelyapplied. Fromthe
standpointof the basisof the method, Grant states that in the Garrett andZubin
piece:

The impression is given . . . that the primary purposeof analysisof variance is to


divide the total variance intotwo or more components (the mean squares) which are
to be interpreteddirectly as therespective contributionsto the total variance made
by the experimental variablesand experimental error, (pp.158-159)

He goeson to saythat the purposeof ANOVA is to test the significanceof


variation and, while agreeing that it is possibleto estimatethe proportional
contribution of the various components, "the process of estimation mustbe
clearly differentiatedfrom the test of significance"(p. 159). Theseare valid
criticisms, but it must be said that suchconfusion as some commentaries may
bring, reflects Fisher's (1934) own assertionin remarkson a paper givenby
Wishart (1934) at a meetingof the Royal Statistical Society that ANOVA"is
not a mathematical theorem, but rather a convenient methodof arrangingthe
arithmetic" (p. 52). In additionthe estimationof treatmenteffects is considered
THEJOURNALSAND THE PAPERS 211

by many researchers to becritical (andseechapter7, p. 83).


What is most telling about Grant's remarks, made more than 50 years ago,
is that they illustratea criticismof data appraisal using statistics that persists to
this day, thatis, the notion that the primary aim of the exerciseis to obtain
significant resultsand that reportingeffect size is not aroutine procedure. Also
of interestis Grant's inclusion of expressionsfor the expected values of the mean
squares,of which more later.
A detailed appraisalof the statistical contentof the leading British journal
has notbeen carried out,but a review of that publication over10 year periods
shows the rise in the use ofstatistical techniques of increasing varietyand
sophistication.The progressionis not dissimilarfrom that observedby Rucci
and Tweney. The first issueof the British Journal of Psychology appeared in
1904 underthe distinguished co-editorships of James Ward(1843-1925)and
W. H. R. Rivers (1864-1922).Some papersin that volume madeuse ofsuch
statisticsasweregenerally available,averages,medians, mean variation, stand-
ard deviation,and thecoefficient of variation. Statistics were there at the start.
By 1929-1930,correlational techniques appeared fairly regularly in the
journal. Volume30, in 1940, contained29 papersand includedthe use of 2x,
the mean,the probable error,and factor analysis. There were no reports that
usedANOVA and nor did any of the 14 papers published in the 1950 volume
(Vol. 41). But x,2, and the tratio still found a place.Fitt and Rogers (1950),the
authorsof the paper that used the t ratio for the difference betweenthe means
felt it necessaryto includethe formula and cited Lindquist (1940)as thesource.
The researchersof the 1950swho were beginningto useANOVA, followed the
appearanceof a significant F valuewith post hoc multiple t tests. The use of
multiple comparisonprocedures(chapter 13, p. 195) had not yetarrived on the
data analysis scene. Six of the 36paperspublishedin 1960 included ANOVA,
one ofwhich was anANOVA by ranks. Kendall's Tau, the t ratio, correlation,
partial correlation, factor analysis, the Mann-Whitney U test - all were used.
Statistics had indeed arrivedand ANOVA was in theforefront..
A third of the papers(17 out of 52, to bemore precise!)in the 1970 issue
made use ofANOVA in the appraisalof the data. Also present were Pearson
correlation coefficients,the Wilcoxon T, Tau, Spearman's rank difference
correlation,the Mann-Whitney U, the phicoefficient, and the tratio.
Educational researchersfigured largely in the early papers that used
ANOVA, both as commentatorson the methodand inapplyingit to their own
research. Of course, manyin the fieldwere quitefamiliar with the correlational
techniques thathad led tofactoranalysisand itscontroversiesandmathematical
complexities. There had beenmore than40 yearsof study and discussionof
skill andability testingand thenatureof intelligence-the raisond'etre of factor
analysis- andthis wascentral to thework of thosein the field.They wereready,
212 14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

even eager,to come to grips with methods that promised an effective way to
use themulti-factor approachin experiments,an approach thatoffered the
opportunity of testing competing theories more systematically than before.
And it was from theeducational researcher's standpoint that Stanley (1966)
offered an appraisalof the influence of Fisher's work "thirty yearslater." He
makesthe interesting point that although there were experimenters with "con-
siderable 'feel'for designing experiments, . . . perhaps someof them did not
have the technical sophisticationto do the corresponding analyses."This
judgmentis basedon the fact that 5 of the 21volunteered papers submitted to
the Division of Educational Psychology of the American Psychological Asso-
ciation annual conventionin 1965 had to bereturnedto have more complex
analysis carried out. Thismay have beendue tolack of knowledge,it may have
been due to uncertainty aboutthe applicability of the techniques,but it also
reflects perhapsthe lingering appealof straightforwardand even subjective
analysis thatthe researchersof 20 years before favored, as wenoted earlierin
the discussionof Lovie's appraisal.It is also Stanley's view that "the impact of
the Fisherian revolutionin the designandanalysisof experiments came slowly
to psychology" (p. 224). It surely is a matter of opinion as towhether Rucci
and Tweney'scountof less than5 percentof papersin their chosen journals in
1940 to between 15 and 20 percent in 1952 is slow growth, particularly when
it is observed thatthe growth of the use ofANOVA varied considerably across
thejournals.2 This leadsto thepossibility thatthe editorial policyand practice
of their chosen journals could have influenced Rucci and Tweney'scount.3 It
is also surelythe case thatif a new techniqueis worth anythingat all, thenthe
growth of its usewill show as apositively accelerated growth curve, because
as more people takeit up, journal editors havea wider and deeper poolof
potential refereeswho cangive knowledgeable opinions. Rucci and Tweney's
remark, quoted earlier, on thelengthof time it took for ANOVA to become part
of experimental psychology, has atone that suggests that they do not believe
that the acceptancecan beregardedas"slow."

THE STATISTICAL TEXTS

The first textbooks that were written


for behavioral scientists began
to appear
in the late 1940sandearly 1950s althoughit mustbe noted thatthe much-cited
2
Stanleyobservesthat theJournal of GeneralPsychologywas late to include ANOVA and it did
not appearat all in the 1948 volume.

3
The author, many years ago, had apaper sent backwith a quite kind note indicating thatthe
journal generallydid notpublish "correlational studies." Younger researchers soon learn which journals
are likely to bereceptiveto their approaches!
EXPECTED MEAN SQUARES 213

text by Lindquist was publishedin 1940.The impactof the first edition of this
book may have beenlessened,however, by a great manyerrors, both typo-
graphic and computational. Quinn McNemar (1940b), in his review, observes
that the volume,"in the opinionof thereviewerandstudentreaders,suffers from
an overdoseof wordiness"(p. 746). He also notes that there are several major
slips, although acknowledging the difficulty in explaining severalof the topics
covered,anddeclares thatthe book "shouldbe particularlyuseful to all who are
interestedin obtaining non-mathematical knowledge of the variance technique"
(p. 748).
Snedecor,an agricultural statistician, published a much-used textin 1937,
and he isoften given creditfor making Fisher comprehensible (see chapter 13,
p. 194). Rucci and Tweney place George Snedecor of Iowa State College,
Harold Hotelling, an economist at Columbia, and Palmer Johnsonof the
University of Minnesota,all of whom had spent time with Fisher himself, as the
foundersof statistical trainingin the United States.But any informal surveyof
psychologistswho were undergraduates in the 1950sand early 1960s would
likely reveal thata large numberof them were rearedon thetexts of Edwards
(1950) of the University of Washington,Guilford (1950) of the University of
SouthernCalifornia,or McNemar (1949)of Stanford University. Guilford had
publisheda text in 1942,but it was thesecondedition of 1950 and thethird of
1956 that became very popular in undergraduate teaching. In Canadaa some-
what later popular text wasFerguson's (1959) Statistical Analysis in Psychology
and Education, and in the United Kingdom, Yule'sAn Introduction to the
Theory of Statistics,first publishedin 1911,had a14th edition published (with
Kendall) in 1950 that includes ANOVA.

EXPECTED MEAN SQUARES

In his well-known book,Scheffe (1959) says:

The origin of the random effects models,


like that of the fixed effects models, lies
in
astronomical problems; statisticians re-invented random-effects models long after
they were introducedby astronomersand then developedmore complicatedones,
(p. 221)

Scheffe citeshis own 1956(a) paper,which gives some historical background


to the notion of expected mean squares - E(MS). The estimationof variance
components using E(MS) was taken up by Daniels (1939)and later by Crump
(1946, 1951). Eisenhart's (1947) influential paperon theassumptions underly-
ing ANOVA models alsodiscusses them in this context. Andersonand Bancroft
(1952) publishedone of theearliest texts that examines mathematical expecta-
tion - theexpected valueof a random variable"over thelong run"- andearly
214 14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

in this work theystatethe rules usedto operate with expected values.


The concept is easily understoodin the context of a lottery. Supposeyou
buy a $1 .00ticket for a draw in which 500 ticketsare to besold. The first and
only prize is $400.00.Your "chances"of winning are 1 in 500 and of losing 499
in 500.If Y is therandom variable,

When Y = -$1.00,the probabilityof Y, i.e.,p(Y) = 499/500.


When Y= $399.00 (you won!),the probabilityp(Y)= 1/500.
The expectationof gain overthe long run is:

E(Y) = ZYp(Y) = (-1)(499/500)+ (399)(1/500)= -0.0998+ 0.798= - 0.20

If you joined in the draw repeatedlyyou would "in the long run" lose 20 cents.
In ANOVA we are concernedwith the expected valuesof the various
components- the mean squares - over thelong run.
Although largely ignoredby the elementary texts,the slightly more ad-
vanced approachesdo treat their discussions and explanationsof ANOVA' s
model from the standpointof E(MS). Cornfieldand Tukey (1956)examinethe
statistical basisof E(MS), but Gaito (1960) claims that"there has been no
systematic presentation of this approachin a psychological textor journal which
will reach most psychologists" (p. 3), and hispaper aimsto rectify this. The
approach, which Gaito rightly refers to as"illuminating," offers some mathe-
matical insight intothe basisof ANOVA and frees researchersfrom the "recipe
book" and intuitive methods that are often used. Eisenhart makes the point that
these methods have been worthwhile and that readersof the books for the
"non-mathematical" researcher have achieved sound and quite complex analy-
ses inusing them. But these early commentators saw theneedto go further.
Gaito'sshort paperoffers a clear statement of thegeneralE(MS) model,and his
contention that this approach brings psychology a tool for tackling many
statistical problemshas been borneout by thefact that now all thewell-known
intermediate texts that deal with ANOVA modelsuse it.
This is not theplaceto detail the algebraof E(MS) but here is the statement
of the generalcasefor a two- factor experiment:
EXPECTED MEAN SQUARES 215

In thefixed effects model - often referred to asModel I - inferences can be


made only aboutthe treatments that have been applied. In the random effects
model the researchermay make inferencesnot only about the treatments
actually appliedbut about the range of possible levels.The treatmentsapplied
are arandom sampleof the range of possible treatments that could have been
applied. Then inferences may bemade aboutthe effects of all levels from the
sampleof factor levels. This model is often referredto asModel II. In the case
where thereare two factors A and B thereare A potential levelsof A but only
a levels are included in the experiment. Whena = A then factorA is a fixed
factor. If the a levels includedare arandom sample of the potentialA levels then
factor A is a random factor. Similarlyfor factor B there are Bpotential levels
of the factor and blevels are used in the experiment. Whena = A then ^ = 1
and when B is a random effect,the levels sample size
b is usually very, very
much smaller thanB and | becomesvanishinglysmall, asdoes ^ where n is
sample sizeand N ispopulation size.
Obtaining the variance components allows us to seewhich componentsare
included in the model, be the factor fixed or random, and to generate the
appropriateF ratios to test the effects.
For fixed effects: E(MSA ) =
E(MSAB) = cre2 + noap2, and theerror termis oe2.
2 2 2
For random effects: E(MS A) = <re + noap + nbca ,
E(M$B) = <TE2 +naa p 2 + nap2, E(MSAfi) - <7E2 +noaap2 ,the error termis aE2.
When A is fixed and B israndom: E(MSA) = OE2 + noap2 + nboa2,
E(MSn ) = crE2 + naoB2, E(MSAu) = Oe2 + noaap2 , the error termis CTE2.
Given thatthe structureof the F ratio requires thatits numerator consists of the
componentsfor a given sourceand itsdenominatorthe same components save
the oneassociated withthe source,we find, for example, thatto test for A, the
F ratio for fixed effectsis:
Whereas for random effects it is F

Thereis little doubt that developmentof this approachhasenabled ANOVAto


be more closely appreciated by its practitioners.
Finally, it is only in recent years that there has been a full realization that
mathematically the ANOVA approach and the regression approachcan be
brought together. Cronbach's delineation of the "two disciplines" (1957,and
seechapter2), the neglect by the textbook writersof any meaningful account
of the general linear model, and theavoidanceby the introductory text authors
of a discussionof expected mean squares, have all contributedto a misunder-
standingof the unifying basisof statistical methods.
15
The Statistical Hotpot

TIMES OF CHANGE

The later yearsof the 1920s were watershed years for statistics. Karl Pearson
was approachingthe end of hiscareer(he retiredin 1933),andsomeof the older
statisticians werenot able to cope with the neworder. The divisions had been,
and were still being, drawn.Yule retiredfrom full-time teachingat Cambridge
in 1930, and, writingto Kendallafter K.P.'s deathin 1936 said,"I feel asthough
the Karlovingianera hascometo anend,and thePiscatorialerawhich succeeds
it is one inwhich I can play no part" (quotedby Kendall, 1952,p. 157).
Yule's text had not bythen tackledthe problemsof small samples. The t
testswere not discusseduntil the 11thedition of 1937, a revision that was,in
fact, undertakenby Maurice Kendall. The changes that were taking place in
statistical methodology led to positions being adopted that were often based
more on personalityand style and loyalties than rational argument on the basic
logic and utility of the approaches. Yates,
who succeeded Fisher at Rothamsted
in 1933, was to write, in 1951, in a commentaryon Statistical Methodsfor
Research Workers'.

Becauseof the importance that correlation analysis had assumedit was natural that
the analysisof variance shouldbe approachedvia correlation,but tothosenot trained
in the school of correlational analysis(of which I am fortunate to be able to count
myself one) this undoubtedly makes this part of the book moredifficult to compre-
hend. (Yates, 1951,p. 24, emphasis added)

And Yates was asolid supporterof Fisherto the very end.


Fisher publishedhis text in 1925, and in thesame yearhis paper on the
applications of "Student's"distribution appeared. Egon Pearson and Jerzy
Neymanmet that year. They were to introducenew features into statistics
that
Fisher would vehemently oppose to the end of hislife.

216
NEYMAN AND PEARSON 217

NEYMAN AND PEARSON

It is a curious fact that what most social scientists now take to be inferential
statistics is a mixture of procedures that,as presentedin many of the current
1
statistical cookbooks, would be criticized by its innovators, RonaldA. Fisher,
Jerzy Neyman,andEgon Pearson. Fisher established the present-day paramount
importanceof the rejection or acceptanceof the null hypothesisin the determi-
nation of a decision on astatistical outcome- thehypothesis thatthe outcome
is due tochance.The newparticipants- they couldnot bedescribedaspartners
- in theunion that became statistics arguedfor anappreciationof the probability
of an alternative hypothesis.
In 1925 Jerzy Neyman, on a Polish governmentfellowship and with a new
PhD from Warsaw, arrivedat University College, London,to study statistics
with Karl Pearson. Neyman was born in Russiain 1894and read mathematics
at theUniversityof Kharkov. In 1921,he movedto the newRepublicof Poland,
where he becamean assistantat theUniversity of Warsawandwhere he worked
for the State Meteorological Institute. His initial months in London were
difficult, partly becauseof his struggleswith English and partly becauseof
misunderstandings with "K.P.," but graduallyhe struck up afriendship that led
to aresearch collaboration with Egon Pearson. Neyman's professional progress
and hissocial and academic relationships have been recounted in a sympathetic
and revealing accountby ConstanceReid (1982).
The younger Pearson has been describedby many as asomewhat diffident
and introvertedman who wasvery much in the shadowof his father. Reid tells
us of his feelings:

Pearsonhad decided thatif he wasgoing to be astatisticianhe wasgoing to haveto


break with his father's ideasand construct his own statistical philosophy. In
retrospect,he describes whathe wantedto do as"bridging the gap"between "Mark
I" statistics- a shorthand expression
he uses for the statisticsof K.P., whichwas
based on large samples obtained from natural populations- and thestatistics of
Studentand Fisher, whichhad treatedsmall samples obtainedin controlled experi-
ments- "Mark II statistics." (Reid, 1982,p. 60)

Pearson (1966) himself describes "the


first steps." In 1924 papersby E. C.
Rhodes and by Karl Pearson (1924b)
had explored the problem of choosing

1
The phrase statistical cookbook has a apejorative ring, whichis not wholly justified.
There are many excellent basic texts available,
and therearemany excellent cookerybooks.
Both are necessaryto our well-being. The point is that conflicting recipesleadto statistical
and gastronomic confusion,and thefact that such conflicts exist has,by and large, been
ignored by consumersin the social sciences.
218 15. THE STATISTICAL HOTPOT

between alternative tests


for the significanceof differences,tests thathad the
same logical validitybut that gavedifferent levels of significance for the
outcome:

I setabout exploringthe multi-dimensional sample spaceand comparing what came


to be termed the rejection regions associated with alternative tests. Could
one find
some generalprinciple or principles appealingto intuition, which would guideone
in choosing betweentests? (E. S.Pearson, 1966, p. 6)

Pearsonhad ponderedthe question: what, exactly,


was theinterpretationof
"Student's"test?

In large samples...the ratio t = (x - \i)^n/s could beregardedas thebest estimate


available of the desired ratio(x- |a)Vn/a and, as such, referredto the normal
probability scale. If the samplewas . . .small ... a samplewith a less divergent
mean jci,might well provide a larger value of t than a secondsample witha more
divergent mean,3ci, simply becauses \ in the firstsample happened through sampling
fluctuations to besmaller than52 in the second. To someone brought up with the
older point of view this seemedat first sight paradoxical...
I realize that a reorientationof outlook mustfor me at anyrate have been
necessary.It was ashift which I think K. P. was notable or never saw theneed to
make. (E. S.Pearson,1966, p. 6)

In 1926 Pearson, put theproblemto "Student,"and thelatter's reply shows,


once more,the fertility of his ideas.Justasthey hadaided Fisher's inspirations,
his commentsnow set theyounger Pearson and later Neymanon thepath that
led to theNeyman-PearsonTheory. After noting what, of course,waswidely
accepted, that with large samples one isableto find the chance thata givenvalue
for the mean of the sample liesat any given distancefrom the mean of the
population,andthat, evenif the chanceis very small, thereis no proo/thatthe
samplehas notbeen randomly drawn, he says:

What it doesis to show thatif there is anyalternative hypothesis which will explain
the occurrenceof the sample witha more reasonable probability, say 0.05 (suchas
that it belongsto adifferent populationor that the sample wasn't randomor whatever
will do thetrick) you will be very much moreinclined to consider thatthe original
hypothesisis not true, (quotedby Pearson,1966, p. 7, emphasis added)

Pearson recalls that during the autumnof 1926,the problemsof the specifi-
cation of the classof alternative hypothesesand their definition, the rejection
region in the sample space, and the twosourcesof error were discussed.At the
end of theyear, Pearson was examiningthe likelihood ratio criterionas a way
of approachingthe questionas towhetherthe alternative,or what Fisher later
NEYMAN AND PEARSON 219

called the null, hypothesiswas themore likely. Pearson always recognized that
he was aweak mathematicianand that he needed help withthe mathematical
formulation of the newideas. He recalledto Reid (1982) thathe approached
Neyman, the newpost-doctoral student at Gower Street,perhapsbecausehe
was "so 'fresh' to statistics"(p. 62) andbecause other possible collaborators
would all have preferences for either MarkI or Mark II statistics. Neymanwas
not a memberof eitherof the camps. He really knew very littleof the statistical
work of the elder Pearson,nor that of "Student"or Fisher. But an immediate
and close collaborationwas notpossible. Although Egon Pearson andNeyman
had a good deal of social contact overthe summer of 1926 and Pearson
remembered that they had touchedon the newquestions, Neyman was disen-
chanted withthe Biometric Laboratory.It was not thefrontier of mathematical
statistics thathe hadexpected,and hedeterminedto go toParis (wherehis wife
was an artstudent)and presson with his work in probabilityand mathematics.
At the end of1926, correspondence began between Neyman and Pearson,
and Pearson visitedhis colleaguein Paris in the spring of 1927. Reid (1982)
reportsthat shegave copiesof Neyman's lettersto Pearsonto Erich Lehmann,
an early studentof Neyman'sat Berkeley and anauthority on the Neyman-
Pearson theory (Lehmann, 1959), for his comment. Lehmann's conclusion was
that, at anyrate until early in 1927, Neyman "obviously didn't understand what
Pearsonwas talking about"(p. 73).
The first joint paper, in two parts, was published in Biometrika in 1928.
Neymanhadreturnedto Polandfor theacademic year1927-1928,teaching both
at Warsaw and atKrakow, and heclearly felt that he had notplayed as big a
role in the first paper as hadPearson.The paper (PartI) ends with Neyman's
disclaimer:

N.B. I feel it necessaryto make a brief commenton theauthorshipof this paper. Its
origin was amatterof close co-operation, both personal and byletter, and theground
covered includedthe general ideasand theillustration of theseby samplingfrom a
normal population.A part of theresults reached in commonare includedin Chapters
1, II and IV. Later I was much occupiedwith other work, and therefore unableto
co-operate.The experimental work,the calculationof tables and thedevelopment
of the theory of ChaptersIII and IV are dueentirely to Dr Egon S. Pearson,(p. 240)

It might be, too, that at this time Neyman wanted to distancehimselfjust a


little from the maximum likelihoodcriterion, which was themain theoretical
underpinningof the ideas developedin the paper. In the early correspondence
with Pearson, Neyman referred to it as"your principle"and was notconvinced
that it was theonly possible approach.
The 1928 (PartI) paper beginsby statingthe important problemof statistical
inference, that of determining whetheror not a particular sample(E) has
220 15. THE STATISTICAL HOTPOT

been randomly drawn from a population (n). HypothesisA isthat indeedit has.
But I mayhave been drawn from some otherpopulationn', andthus,two sorts
of error may arise. The first iswhenA is rejectedbut I wasdrawnfrom K, and
the secondis whenJis acceptedbut Z hasbeen drawnfrom n':

In the long run of statistical experience the frequency of the first sourceof error (or
in a single instanceits probability)can becontrolledby choosingas adiscriminating
contour, one outside whichthe frequency of occurrenceof samplesfrom TC is very
small - say, 5 in 100 or 5 in1000.
The second sourceof error is more difficult to control. ... It is not of course
possibleto determinen',... b u t . .. we maydeterminea "probable" or "likely" form
of it, and hence fix the contours so that in moving "inwards" acrossthem the
difference between n and thepopulationfrom which it is "most likely" that S has
been sampled should become less and less. This choice also implies thaton moving
"outwards" acrossthe contours, other hypotheses as to thepopulation sampled
becomemore and more likely than HypothesisA. (Neyman & Pearson,1928, p.
177)

Hypothesis^correspondsto S having been drawn from n, andHypothesis


A' to I having been drawnfrom n'. A ratio of the probabilitiesof A and A' is
a measurefor their comparison. But:

Probability is a ratio of frequenciesand this relative measure cannot be termed the


ratio of theprobabilitiesof the hypotheses, unlesswe speakof probability a posteriori
and postulate somea priori frequency distribution of sampled populations. Fisher
hastherefore introduced the term likelihood,and calls this comparative measurethe
ratio of the likelihoodsof the twohypotheses. (Neyman & Pearson, 1928,p. 186)

The likelihood criterionis given by:

This ratio definessurfacesin a probability spacesuchthat it decreasesfrom


1 to 0 as aspecific point moves outwardand alternativesto the statistical
hypothesis become more likely:

One hadthen to decideat which contourH(| shouldbe regardedas nolonger tenable,


that is whereshould one chooseto boundthe rejection region?To help in reaching
this decision it appeared thatthe probabilityof falling into the region chosen,if H
were true,was onenecessary piece of information. In taking this viewit can of course
be argued thatour outlook was conditionedby current statistical practice,(E. S.
Pearson, 1966,p. 10)
NEYMAN AND PEARSON 221

The paper doesconsider other approaches andnotes thatthe authorsdo not


claim that the principle advancedis necessarilythe best to adopt, but it is clear
that it is favored:

We haveendeavouredto connectin a logical sequenceseveralof the most simple


tests,and in sodoing havefound it essentialto make use of what R. A. Fisher has
termed"the principle of likelihood."
The processof reasoning,however, is necessarilyan individual matter, and we
do not claim that the method whichhas been mosthelpful to ourselves willbe of
greatestassistance to others. It would seemto be acasewhere each individual must
reasonout for himself his own philosophy. (Neyman& Pearson,1928, p. 230)

There are faint echoes hereof the subjective elementin Neyman's initial
reasoning, commented on byLehmannandreportedby Reid (1982)- thenotion
that belief in a prior hypothesis affects
the considerationof the evidence:

There is a subjective elementin the theory whichexpressesitself in the choice of


significance levelyou aregoing to require;but it is qualitative rather than quantita-
tive.2 While it is not very satisfyingand rather pragmatic,I think this reflectsthe
way our minds work better than the more extreme positions of either denyingany
subjective element in statistics or insisting upon its complete quantification.
(Lehmann, quotedby Reid, 1982,p. 73)

Neyman, in Poland in 1929, wantedto presenta joint paperat the Interna-


tional Statistical Institute meeting, scheduled to be held there that year.The
paper that Neymanwas preparing dealt withthe Bayesian approachto the
problem thathe andPearsonhad consideredand was toattemptto show thatit
led to essentiallythe same solution. Egon Pearson just could not agreeto a
collaboration that admitted, in the slightest,the notion of inverse probability:

Pearson... pointed out to Neyman thatif they publishedthe proposedpaper,with


its admissionof inverse probability, they would find themselvesin a disagreement
with Fisher. . . . Many years later he explained, "The conflict between K.P. and
R.A.F. left me with a curious emotional antagonism and alsofear of the latter so that
it upsetme a bit to see him or tohear him talk." (Reid, 1982,p. 84)

He would not put hisname to the paper. The curious factis that neither
Neymannor Pearson ever wholeheartedly subscribed to theinverse probability
approach, but Neyman felt that it should be addressed. Pearson's attempt to
avoid a confrontationwasdoomedto failure, just as hisfather's earlier attempts

Cowlesand Davis (1982b)carriedout asimple experiment that supports the suggestion thatthe
.05 level of significanceis subjectively reasonable. Sandy Lovie pointed
out to theauthor thatthe same
experiment,in a slightly different context,was performedby Bilodeau (1952).
222 15. THE STATISTICAL HOTPOT

to deflect Fisher'sviews had been.


The years that spanned the turn of the 1920sto the 1930s werenot particu-
larly happy onesfor the collaborators. Pearson was enduringthe agoniesof an
unhappy romance and finding his relationship with his father increasingly
frustrating. Neyman found his work greatly constrainedby economic and
political difficulties in Poland, and the elder Pearson rejected a paper he
submittedto Biometrika. But, albeitin anintermittentfashion,the collaboration
continuedasthey worked toward what they described astheir "big paper."
This was communicatedto the Royal Society by Karl Pearsonin Augustof
1932, readin Novemberof that year,and publishedin Philosophical Transac-
tions in 1933. Neymanhad written to Fisher, with whom he was then on
reasonably amicable terms, about the paper, and thelatter had indicated that
were it to be sentto the Royal Society,he would likely be areferee:

To Neyman it has always beena sourceof satisfactionand amusement thathis and


Egon'sfundamentalpaperwaspresentedto the Royal Societyby Karl Pearson,who
was hostile and skeptical of its contents,and favourably refereedby the formidable
Fisher, who waslater to be highly critical of much of the Neyman-Pearson theory.
(Reid, 1982,p. 103)

Reid reportsthat when she wrote to the librarian of the Royal Society to
discoverthe nameof the second referee, shefound that therehad only beenone
referee, and that he was A. C.Aitken of Edinburgh(a leading innovatorin the
field of matrix algebra).
In fact, two papers were published in 1933. The Royal Society paper dealt
with proceduresfor the determinationof the most efficient tests of statistical
hypotheses. There is no doubt thatit is one of themost influential statistical
papers ever written.It transformedthe way inwhich boththe reasoning behind,
and the actual applicationof, statistical tests were perceived. Forty years later,
Le Cam andLehmann enthusiastically assessed it:

The impact of this work has been enormous. It is, for example,hard to imagine
hypothesistestingwithout the concept of power.... However, the influenceof the
work goesfar beyond. ... By deriving tests as thesolutions of clearly defined
optimum problems,Neyman and Pearson established a pattern for Wald's general
decisiontheoryand for thewhole field of mathematicalstatisticsas it hasdeveloped
since then. (Le Cam & Lehmann, 1974,p. viii)

The later paper (Neyman & Pearson, 1933), presented to the Cambridge
Philosophical Society, again setsout theNeyman-Pearson rationale andproce-
duresfor hypothesis testing.Its statedaim is toseparate hypothesis testing
from
problems in estimation and toexaminethe employmentof tests independently
of a priori probability laws.The authors have rejectedthe Bayesian approach
NEYMAN AND PEARSON 223

and employed the frequentist's viewof probability. A concept of central


importance is that of thepower of a statisticaltest, and theterm is introduced
here for the first time. A Type I error is that of rejectinga statistical hypothesis,
H0 , when it is true. The Type II error occurs whenHQ is not rejected but some
rival, alternative hypothesis is, in fact, true:

If now we have chosena region w, in thesample spaceW, ascritical region to test


//o, thenthe probability thatthe sample pointI defined by the set ofvariates[x , ;c2
. . . , *n] falls into w, if //Q is true, may bewritten as:

The chanceof rejectingHQ if it is true is therefore equalto 8, and w may be termed


of size e for Ho . Thesecond j type
r of error will be made when some alternativeH.i
is true, and Z falls in w= W-w. If we denoteby P (w),/Jn (w) andP(w) the chance
of an error of the first kind, the chanceof an error of the secondkind and thetotal
chanceof error usingw ascritical region, thenit follows that:

(Neyman& Pearson, 1933,p. 495)

The rule thenis to set thechanceof the first kind of error (the sizeor what
we would now call the a level) at asmall valueandthen choosea rejection class
so that the chanceof thesecond kindof error is minimized.In fact, the procedure
attempts to maximizethepower of the test for a given size. The probability of
rejecting the statistical hypothesis, ()H, when in fact the true hypothesisis H.,
that is P (w |H.), is called the power of the critical region w with respectto H..

If we nowconsiderthe probabilityP\\ (w) of type II errors when using


a testT based
on the critical regionw, we maydescribe

as theresultant powerof the test T. . .. It isseen that whilethe power of a test with
regard to a given alternativeH\ is independentof the probabilitiesa priori, and is
therefore known preciselyassoon as H\ and w are specified, thisis not thecasewith
the resultant power,which is a function of the (pi's. (Neyman& Pearson, 1933b, p.
499)
224 15. THE STATISTICAL HOTPOT

Note here thatthe cp'sare the probabilitiesa priori of the admissible


alternative hypotheses. Neyman and Pearsonof coursefully recognized thatthe
(p's cannot oftenbe expressedin numericalform and that the statisticianhas to
considerthe sense,from a practical pointof view, in which testsare independent
of probabilitiesa priori, noting:

This aspectof the error problem is very evidentin a number of fields where tests
must be usedin a routine manner,and errorsof judgment leadto wasteof energyor
financial loss. Such is the casein sampling inspection problems
in mass-production
industry. (Neyman& Pearson,1933, p. 493)

This apparentlyinnocuous statement alludes to thefeaturesof the Neyman-


Pearson theoryof hypothesis testing that, over the courseof the next few years,
Fisher vigorouslyattacked,that is, (a) thenotion that hypothesistestingcan be
regarded as adecision processakin to methods usedin quality control, and
reducedto a set ofpracticalrules,and (b) theimplication that"repeatedsampling
from thesame population"- which occursin industry whensimilar samplesare
drawn from the same productionrun - determinesthe level of significance.
These suggestions were anathema to Fisher. They were, however,the very
features that made statisticsso welcomein psychologyand thesocial sciences
- that is, thepromiseof a set ofruleswith apparently respectable mathematical
foundations that would allow decisions to be made on the meaning in noisy
quantitative data.Few ventured intoan examinationof the logic of the methods;
very few would wish to be trampledas thegiants fought in the field. These
mattersare examined laterin this chapter.
Perhapswe should sparea thoughtfor Sir Ronald Fisher, curmudgeon that
he was. He must indeedbe constantly tossingin his grave as lecturers and
professors across the world, if they rememberhim at all, referto the content of
most current curricula asFisherian statistics.

STATISTICS AND INVECTIVE

The full force of Fisher's oppositionto the "Neyman-Pearsontheory was not


immediately felt in 1933. The circumstancesof his developingfury are, how-
ever, evident.In a decision that could hardly have been less conducive to
harmony in the developmentof statistics,the University of London split Karl
Pearson'sdepartment whenhe retired in 1933:

Statisticswas taught at no other British university,nor wasthere another professor


of eugenics,chargedwith the duty and themeansof research into human heredity.
If Fisherwere to teachat auniversity, it would haveto be asPearson'ssuccessor.
(Fisher Box, 1978,p. 257)
STATISTICS AND INVECTIVE 225

Fisher became Galton Professor in the Departmentof Eugenics,and Egon


Pearson, promotedto Reader, headedthe Departmentof Applied Statistics.
There, in Gower Street, Fisher's department on the topfloor andEgon Pearson's
on the floor below, the rows beganto simmer. FisherBox (1978) reports that
her fatherhadascertained, before accepting the appointment, that Egon Pearson
was well disposedto it. Fisher corresponded with him, apparently in the hope
that they might resolve conflicts in the teachingof statistics. FisherBox (1978)
learnedfrom correspondencewith Egon Pearson that Fisher hadeven suggested
that, despitethe decisionto split the old department,he andPearson should
reunite it. Pearson's response was to theeffect that no lecturesin the theory of
statistics shouldbe given by Fisher, thatthe territories had been defined,and
that each should stay within them.
Early in 1934 Pearsoninvited Jerzy Neymanto join his department, tempo-
rarily, as anassistant. Neyman accepted and, in the summerof 1934, received
an appointmentas lecturer.
The moodsat Gower Street must have been impossible, and theintensityof
the strain is difficult to imagineand reconstruct:

[Karl] Pearsonwas madean honorary memberof the TeaClub, and when he joined
them in the Common Room,it was observed that Fisher did him theunique honour
of breaking out of conversationto step forwardandgreethim cordially. (Fisher Box,
1978, p. 260)

Or,

The Common Roomwas carefully shared.Pearson'sgroup had tea at 4; and at 4:30,


when theywere safely out of theway, Fisherand hisgroup troopedin. Karl Pearson
had withdrawn acrossthe college quadrangle withhis young assistant Florence
David. He continued to edit Biometrika; but,as far asMiss David remembers,he
never againenteredhis old building. (Reid, 1982,p. 114)

It appears that,
at first, Neymanand Fishergot onquite well andthat Neyman
tried to bring Fisherand Egon Pearson together. His work on estimation, rather
than the joint work with Pearsonon hypothesis testing, occupied his attention.
His 1934 paper (discussed earlier) was, in the main, well-receivedby Fisher,
and Fisher's December 1934 paper, presented to the Royal Statistical Society
(Fisher, 1935),was commentedon favorably by Neyman.
But any harmony disappeared at meetingsof the Industrial andAgricultural
Section of the Royal Statistical Societyin 1935. Neyman presented his "Statis-
tical Problemsin Agricultural Experimentation"in which he questionedthe
efficiency of Fisher's handlingof Randomized Block and Latin Square designs,
illustrating his talk with wooden models thathad been preparedfor him
226 15. THE STATISTICAL HOTPOT

3
at University College. Fisher's responseto the paper (for whichhe was
supposedto give the vote of thanks) beganby expressingthe view, to put it
bluntly (which he did), that he hadhoped that Neyman would be speakingon
something thathe knew something about.His closing comments disparaged
the Neyman-Pearson approach.
Frank Yates (1935) presented a paper on factorial designsto the Royal
StatisticalSocietylater that year. Neyman expresseddoubts aboutthe interpre-
tation of interactions and main effects whenthe numberof replicationswas
small. The problem hasbeen examined more recently by Traxler (1976). The
details of thesecriticisms, and they had validity, did not concern Fisher:

[Neyman had] assertedthat Fisher waswrong. This was anunforgivable offense-


Fisher was never wrongand indeedthe suggestion thathe might be wastreatedby
him as adeadly assault.Anyone who did notacceptFisher's writingas theGod-given
truth was atbeststupid and atworst evil. (Kempthorne, 1983,p. 483)

OscarKempthorne, quoted here, and undoubtedlyan admirer of Fisher's


work, hadwhat he hasdescribedas a"partial relationship" with Fisher when he
worked at Rothamstedfrom 1941 to 1946and knew Fisher'sfiery intransigence.
It is understandable that FisherBox (1978) presentsa view of these troubled
times thatis more sympatheticto Fisher, couchingher commentaryin terms
that do reflect an important aspectof the situation. Statisticswas becoming
more mathematical,and theshift of its intellectual power baseto the United
Stateswas tomake it even more mathematical. Fisher Box (1978)comments,
"Fisher was aresearch scientistusing mathematicalskills, Neymana mathema-
tician applying mathematical concepts to experimentation"(p. 265).
She quotes Neyman's reply to Fisher,in which he explainshis concerns
about inconsistencies in the applicationof the ztest. The test is deducedfrom
the sumsof two independent squares but therestricted samplingof randomized
blocks and Latin squares, leadsto the mutual dependenceof results:

Mathematicianstendedto formulatethe argumentin these terms, that is in termsof


normal theory, ignoring randomization. Nevertheless, in doing so they exhibiteda
fundamental misunderstanding of Fisher'swork, for it happensto be false that the
derivation of the z distribution dependson the assumptions Neyman criticized.
(Fisher Box, 1978,p. 266)

Fisher Box defends this view,but it was not aview that convinced many
mathematiciansand statisticians outside
the Fisherian camp.
3
Reid (1982) relatesa story told by both Neymanand Pearson.One evening they returned to the
departmentafter dinnerand found the models strewn aboutthe floor. They suspected that
the angryact
was Fisher's.
STATISTICS AND INVECTIVE 227

In 1935, Egon Pearson was promotedto Professorand Neyman appointed


Reader in statistics at University College. Fisher opposed Neyman's appoint-
ment. Neyman recalledto Reid (1982) thatat about this time Fisherhad
demanded that Neyman should lecture using only Statistical Methods for
Research Workers andthat when Neyman refused said,"'Well, if so, then from
now on I shall opposeyou in all mycapacities.'And heenumerated- member
of the Royal Societyand soforth. There were quite a few. Thenhe left. Banged
the door" (Reid, 1982,p. 126).
Fisher withdrew his support for Neyman's electionto the International
Statistical Institute.It was this sort of intense personal animosity that led to
confusion,indeeddespair,asonlookers attempted to graspthe developmentsin
statistics. The protagonists introduced ambivalence and contradictionas they
defendedtheir positions. Debateand discussion,in any rational sense, never
took place. As if this was not enough, Neymanand Egon Pearson were
beginningto draw apart. Neyman certainly did notfeel committedto University
College. Karl Pearson died in April 1936, and very shortly thereafter Egon
Pearson began work on a survey of his father's contribution(E. S. Pearson,
1938a). Neymanwas working on the developmentof a theory of estimation
using confidence intervals.In a move that seems utterly astonishing, Pearson,
who had inherited the editorship of Biometrika, after a certain amountof
equivocation, rejectedthe resulting paper. Pearson thought that the paper was
too long and toomathematical. It subsequently appeared in the Philosophical
Transactionsof the Royal Society (Neyman, 1937). Joint work between the two
friends had almost ceased. Neyman visited the United Statesin 1937 and made
a very good impression.His visit to Berkeley resultedin his being offereda
post therein 1938, an offer that he accepted.
Neyman's moveto the University of California, the growth in the develop-
ment and applicationsof statisticsat the State Agricultural Collegeat Ames,
Iowa, and theimportantinfluence of the University of Michigan, where Annals
of Mathematical Statistics,edited by Harry C. Carver, had been foundedin
1930, movedthe vanguardof mathematical statistics to America, whereit has
remained.The tribulationsof actualand devastating warfare were on Britain's
horizon. Therewas no oneleft on thestatistical battlefieldwho wantedto fight
with Fisher. He wasdislikedby manybut hisclose collaborators,he wasoften
avoided, and he wasmisunderstood.
Fisher enjoyeda considerable reputation as ageneticist, carryingout experi-
mentsand publishing work thathad considerable impact.His study of natural
selection from a mathematical pointof view led to a reconciliation of the
Mendelian and thebiometric approaches to evolution. In 1943 he acceptedthe
post of Arthur Balfour Professorof Genetics at Cambridge University,an
appointmentfor which Egon Pearson must have given thanks. Not that Pearson
228 15. THE STATISTICAL HOTPOT

suffered too much from the lash of Fisher's tongue,for Fisher regardedhim as
a lightweight and, whenever he could, coldly ignored him.
Fisherwas alwayswilling to promotethe applicationof his work to experi-
mentationin a wide varietyof fields. He wasunableto acceptany criticism of
his view of its mathematical foundations. Perhapsthe most unfortunateepisode
took place at the end ofKarl Pearson'slife. Taking as his reason an attack
Pearson(1936) had made, in a work published very shortly after his death,on
a paper writtenby an Indian statisticianR. S. Koshal (1933), Fisher (1937)
undertook "to examinefrankly the statusof the Pearsonianmethods"(p. 303).
Nearly 20 years later,in an author's note accompanying a republicationof
"Professor Karl Pearson and theMethod of Moments," Fisher reallyexceeds
the boundsof academic propriety:

If peevish intoleranceof free opinion in others is a sign of senility, it is one which


he haddevelopedat anearly age. Unscrupulous manipulation of factual materialis
also a striking featureof the whole corpusof Pearsonian writings,and inthis matter
someblamedoesseemto attachto Pearson'scontemporariesfor not exposinghis
arrogantpretensions.(Fisher, 1950,p. 29.302ain Bennett, 1971)

Of course Karl Pearson was guilty of the same sortof invective. Personal
criticism in the defenseof their mathematicalandstatistical stances is a feature
of the style of both men. No academic writer expects to escape criticism-
indeed, lackof criticism generally indicates lack of interest- butthesesort of
polemicscan only damagethe discipline. Ordinary,and even some extraordi-
nary, men andwomenrun for cover. Theseconflicts may well be responsible
for the rather uncritical acceptanceof the statisticaltools thatwe usetoday, a
point that is discussedfurther.

FISHER versusNEYMAN AND PEARSON

In an author's note preceding


a republicationof a 1939 paper, Fisher reiterates
his oppositionto the Neyman-Pearson approach, referring specifically
to the
Cambridge paper:

The principles broughtto light [in the following paper] seemto theauthor essential
to thetheory of testsof significancein general,and tohave been most unwarrantably
ignored in at least onepretentious workon "Testingstatistical hypotheses."Practical
experimentershave not beenseriously influenced by this work, but in mathematical
departments,at atime when thesewere beginningto appreciatethe part they might
play as guidesin the theoreticalaspectsof experimentation, its influence has been
somewhatretrograde.(Fisher, 1950, p. 35.173ain Bennett,1971).

Fisherset out hisobjectionsto NeymanandPearson's views


in 1955. Statistical
FISHER versus NEYMAN AND PEARSON 229

Methods and Scientific Induction setsout to examinethe differencesin logic in


the two approaches.He acknowledges that Barnard had observed that:

Neyman, thinking thathe wascorrectingand improvingmy own early work on tests


of significance, as a meansto the "improvement of natural knowledge,"in fact
reinterpreted themin termsof that technologicalandcommercial apparatus which
is
known as anacceptanceprocedure. (Fisher, 1955, p. 69)

Fisher acknowledges the importanceof acceptance procedures in industrial


settings, noting that whenever he travels by air he gives thanksfor their
reliability. He objects, however,to thetranslationof this modelto thephysical
and biological sciences. Whereas in a factory the population thatis the product
has anobjective reality, suchis not thecasefor the populationfrom which the
psychologist'sor the biologist's samplehas been drawn.In the latter case,he
argues thereis:

a multiplicity of populationsto eachof which we canlegitimately regardour sample


asbelonging;so that the phrase "repeated sampling from the same population" does
not enableus to determine which population is to beusedto definethe probability
level, for no one ofthem hasobjective reality,all being productsof the statistician's
imagination. (Fisher, 1955, p. 71)

Fisher maintains that significance testing in experimental science depends


only on thepropertiesof the uniquesample thathasbeen observed andthat this
sample shouldbe compared only with other possibilities, that is to say, "to a
population of samplesin all relevant respectslike that observed, neither more
precisenor less precise,and which thereforewe think it appropriateto selectin
specifying the precisionof the estimate" (Fisher, 1955, p. 72).
Fisher was never ableto come to terms with the critical contribution of
Neyman and Pearson, namely, the notion of alternative hypotheses and errors
of the second kind. Once again his objectionsare rooted in what he considersto
be theadoption of the misleading quality control model.He says, "The phrase
'Errors of the second kind,' although apparently only a harmless pieceof
technicaljargon, is useful as indicatingthe type of mental confusionin which
it was coined"(Fisher, 1955,p. 73).
Fisheragreesthat the frequencyof wrongly rejectingthe null hypothesiscan
be controlled but disagrees thatany specificationof the rate of errors of the
second kindis possible. Nor is sucha specification necessary or helpful. The
crux of his argument restson his objection to using hypothesis testingas a
decision process:
The fashion of speakingof a null hypothesisas "accepted when false," whenever a
test of significancegives us nostrong reasonfor rejecting it, and when in fact it is
230 15. THE STATISTICAL HOTPOT

in someway imperfect, showsreal ignoranceof the researchworker'sattitude,by


suggestingthat in such a casehe hascome to an irreversible decision. (Fisher, 1955,
p. 73)

What really should happen when the null hypothesisis accepted?The


researcher concludes that the deviationfrom truth of the working hypothesisis
not sufficient to warrant modification.Or, saysFisher, perhaps, thatthe devia-
tion being in the expected direction,to an extent confirms the researcher's
suspicion,but thedata availableare notsufficient to demonstrateits reality. The
implication is clear. Experimental science is anongoing processof evaluation
and re-evaluationof evidence. Every conclusionis a provisional conclusion:

Acceptanceis irreversible,whether the evidencefor it was strongor weak. It is the


result of applying mechanically rules laid downin advance;no thought is given to
the particular case,and thetester'sstate of mind, or his capacity for learning, is
inoperative. (Fisher, 1955, pp. 73-74)

Finally, Fisher launchesan attack, from the same basic position,


on Ney-
man's use of theterm inductive behaviorto replace the phrase inductive
reasoning. It is clear that Neymanwas looking for a statistical system that
would provide rules.The 1933 paper shows that Neyman andPearson believed
that they wereon thesame trackasFisher:

In dealing withthe problem of statistical estimation,R. A. Fisherhas shown how,


under certain conditions, what may be describedas rules of behaviour can be
employed which willleadto results independent of those probabilities [here Neyman
is referring to probabilitiesa priori]; in this connectionhe hasdiscussedthe important
conceptionof what he terms fiducial limits. (Neyman & Pearson,1933, p. 492,
emphasisadded)

It appearsthat the usersof statistical methods have implicitly acceptedthe


notion thatthe Neyman-Pearson approach was anatural, almostan inevitable,
progressionfrom the work of Fisher.It is, however,little wonder, even leaving
the personalitiesandpolemics aside, that Fisher objected to the mechanization
of the scientific endeavor.It wasjust not the way helookedat science.The 1955
paper almost goes out of its way toattack Abraham Wald's (1950) book on
StatisticalDecision Functions, objecting specifically to its characterizationas a
book about experimental design. Neyman (1956) rebutted the criticisms of
Fisher and effectively restateshis position,but it is apparent thatthe two are
really arguing at cross-purposes. Responding to Neyman's rebuttal, Fisher
(1957) begins,"If Professor Neyman were in the habit of learningfrom others
he might profit from the quotationhe givesfrom Yates" (Fisher, 1957, p. 179).
Of interest hereis Fisher'scontinuingdeterminationto defendthe concept
FISHER versus NEYMAN AND PEARSON 231

of'fiducial probability at all costs. He raisesthe concept againand again in his


writings on inferenceand never admits thatit presentsdifficulties.

The expressions "fiducial probability" and "fiducial argument"are Fisher's. No-


body knows just what they mean, because Fisher repudiated his most explicit,but
definitely faulty, definition and ultimately replacedit with only a few examples.
(Savage,1976, p. 466)

The Bayesian argument lends itself to decision analysis. Fisher objected to


both. The fiducial argumenthas been characterized as anelliptical attempt to
arrive at Bayesian-type posterior probabilities without invoking the Bayesian-
type priors. What muddiesthe water was Fisher's insistence that fiducial
probabilitiesare verifiable in the same way,for example,as theprobabilitiesin
gamesof chanceare verifiable. What makesthe situation even more vexing is
that the perceived relationship between Neyman's confidence intervals and
Fisher's fiducial theory would makethe latter easierto grasp. But Neyman's
theory of estimationby confidence sets grows out of thesame conceptual roots
asNeyman-Pearson hypothesis testing. No rational compromisewas possible.
The null hypothesis(it has been noted that the term is Fisher'sand that it
was notusedby Neymanand Pearson)is, in Fisherianterms,the hypothesis that
is tested and it implies a particular value (mostoften zero) for the population
parameter. Whena statistical analysis produces a significant outcome, thisis
taken to be evidence againstthe null hypothesis, evidence against the stated
value for the parameter, evidence against the assertion that nothing has hap-
pened, evidence against a conclusion thatthe experimental manipulation has
had no effect. The point is that a significant outcome doesnot appearto be
evidencefor anything.The Neyman-Pearson position is that hypothesis testing
demandsa researchhypothesisfor which we can findsupport. Suppose that the
statistical hypothesisstatesthat in the population the correlation is zero, and
indeedit is; then an obtained sample value of+0.8, givena reasonablen, would
lead to a Type I error. If, on the other hand,the statistical hypothesis statesthat
the population valueis +0.8, but again it is really zero, thenan obtained value
of 0.8 will lead to a Type II error. It will immediatelybe argued thatno one
would set thepopulation parameter under the null hypothesisas +0.8 unless
there were evidence that it was of this order. However, this implies that
experimenters take into account other evidence than that immediately to hand.
No matter how formal the rules, it is plain that in real life the rules do not
inevitably guide behaviorand decisions,or conclusions about outcomes. As
was noted earlier,my holdinga winning lottery ticket thatwas drawn from a
million tickets - anextemely improbable event - does notnecessarily leadto
my rejectingthe hypothesisof chance.
There are, then, difficult problemsin theassessment of the rulesof statistical
232 15. THE STATISTICAL HOTPOT

procedures.It is small wonder that they are notwidely appreciated because the
founding fathers themselves were not clearon theissues,or at anyratein public,
and in their writings, they werenot always clearon the issues. Somefew
examples willsuffice to illustratethe point.
In 1935 Karl Pearson asserted that "tests are used to ascertain whethera
reasonable graduation curve has been achieved,not to assert whetherone or
another hypothesis is true or false" (K. Pearson, 1935, p. 296). All he is doing
here is arguing with Fisher.He himself usedhis own x,2 test for hypothesis
testing(e.g.,K. Pearson, 1909).
In his famous discussion of the "tea-tasting" investigation, Fisher discusses
the "sensitiveness"of an experiment.He notes thatby increasingthe sizeof the
experiment:

we canrenderit moresensitive, meaningby this that it will allow of the detectionof


a lower degreeof sensory discrimination, or, in other words,of a quantitatively
smaller departurefrom the null hypothesis. Sincein every casethe experimentis
capableof disproving, but never of proving this hypothesis,we may saythat the
value of the experimentis increased whenever it permits the null hypothesis to be
more readily disproved.
The same result couldbe achieved by repeatingthe experiment,as originally
designed,upon a number of different occasions.(Fisher, 1935/1966,8th ed., p. 22)

Here Fisher appears to bealluding to what Neymanand Pearson formalized


as thepower of the test - a notion thathe never would acceptin their context.
He is also usingthe words "proved" and "disproved" rather loosely, and,
althoughhe might be forgiven, the slip flies in the faceof the routine statements
about the essentially probabilistic nature of statistical outcomes. Elsewhere
Fisher presentsthe view that each experiment is, as itwere, self-contained
in
the context of significance testing:

On the whole the ideas (a) that a test of significance must be regarded as one of a
seriesof similar testsapplied to a successionof similar bodiesof data,and (b)that
the purposeof thetestis to discriminateor "decide"betweentwo ormore hypotheses,
havegreatly obscuredtheir understanding, when taken ascontingent possibilitiesbut
as elementsessentialto their logic. (Fisher, 1956/1973, 3rd ed., pp. 45-46)

To practicing researchers, the notion that testsare notusedto "decide" is


incomprehensible. What elseare they for? Fisher's supporters wouldsay that
what he meansis that theyare notusedto decide finally and irreversibly,and
scientistswould surelyagree.
In his 1955 paper, where he objectsto theNeyman-Pearson specification of
the two types of error and gives his views (mentioned earlier)of what the
researcherwill conclude whena result fails to reach statistical significance,
PRACTICAL STATISTICS 233

Fishersays:

These examples showhow badly the word "error" is used in describing sucha
situation. Moreover, it is a fallacy, so well known as to be astandard example,to
conclude froma test of significance thatthe null hypothesisis thereby established;
at most it may be said to be confirmed or strengthened. (Fisher, 1955, p. 73,
emphasisadded)

It might beargued thata careful readingof Fisher'svoluminousworks would


make his position clear,and that these quotations are selective.The point is
taken, but the point may also be made thatit is hardly surprising thatthe later
interpretationsof, and commentarieson, his influential contribution contain
difficulties and contradictions.Had theprotagonists been more concerned with
rational debaterather than heated argument, statistics would have had aquite
different history.

PRACTICAL STATISTICS

A striking featureof the general statistics texts produced for coursesin the
psychological sciences is theanonymityof the prescriptions thatare described.
A startlinglyhigh numberof incoming graduate students in the author'sdepart-
ment were unaware thatthe F ratio was namedfor a mancalled Fisher,but a
cursory glancethroughthe indexesof introductory texts reveals why. Pearson's
nameis sometimes mentioned as aconvenientlabel for a correlation coefficient,
distinguishingit from the rank-order methodof Spearman. Neyman andPearson
Jr. arehardly ever acknowledged. "Power"is treatedwith caution. Controversial
issuesarenever discussed.
Gigerenzer (1987) presents an interestingdiscussionof the situation: "The
confusion in the statistical texts presented to the poor frustrated studentswas
causedin part by the attemptto sell this hybrid as thesine qua non ofscientific
inference"(p. 20).
Gigerenzer'sthesis addresseswhat he seesas experimentalpsychology's
fight against subjectivity.Probabilisticmodelsand statistical methods provided
the discipline with a mechanicalprocessthat seemedto allow for objective
assessment independent of the experimenter. Parallel constructs of individual
differences as error and uncertainty as ignorancefurther promoted psychol-
ogy's view of itself as anobjective science.
It is Gigerenzer's view thatthe illusion was moreor less deliberately created
by the early textbooks and hasbeen perpetuated. The general neglect of
alternative theories and methodsof inference, anonymous presentation, the
silenceon thecontroversies,and"institutionalization,"all conspiredto provide
psychology with its need for objectivity. Gigerenzer makes tellingand
234 15. THE STATISTICAL HOTPOT

important points. But a more mundane explanation can beadvanced.


Surely social scientistscan beforgiven for not being mathematiciansor
logicians, and those who took a peek at the literatureof the 1920s and 1930s
must have been dismayed at the bitternessof the controversies.At the same
time, the methods were being popularized by such peopleasSnedecor- and
they seemedto bemethods that worked.It is absolutely clear that psychologists
wanted to constructa researchmethodology that wouldbe acceptedby tradi-
tional science.The controversies were ignored because experimentalists in the
social sciences believed that they were sifting the methodological wheatfrom
the polemical chaff. The principalsin the arguments were ignored because of
an eagernessto get onwith the practical job,and inthis they were supported by
the master,Fisher. Moreover,the textbook writers, interpreting Fisher, can be
forgiven for their lack of acknowledgmentof thefounding fathers- they were,
after all, following the masterwho rarely gave creditto anybody!
What is, perhaps, more contentious is the effect that all this has had on the
discipline. If it is to be admitted thatthe logical foundationsof psychology's
mostwidespread method of data assessment areshaky, whatare we tomakeof
the "findings" of experimental psychology?Is the whole edifice of dataand
theory to becompared withthebuildings in thetownsof the"Wild West" - a
gaudy falsefront, and little of substance behind? This is an unreasonable
conclusion, and it is not aconclusion thatis borne out by even a cursory
examination of the successful predictions of behaviorand theconfident appli-
cations of psychology, in areasstretching from market researchto clinical
practice, that havea utility that is indisputable.The plain fact of the matter is
that psychologyis using a set oftools that leaves much to be desired. Some
parts of the kit perhaps should be discarded; some of them, like blunt chisels,
will let usdown and wemight be injured. But they seemto have been doing a
job. Psychologyis a success.
Now some wouldqualify that success.An agonizing reappraisalof the
discipline by Sigmund Kochfollowed his struggleswith his editorship of
Psychology: A Study of a Science (1959-1963). From his statementat the
beginning of that enterprise that psychology was a"disorderly matrix" to his
plea(1980)that 10yearsafter Miller's urgingin his Presidential Address to the
American Psychological Association (1969) that psychology should be "given
away" (in the senseof pressingfor more social relevance), Koch thought that it
ought to be "taken back." Although the drift of the criticism is inescapable,to
throw in the towel is, as itwere, bothunenlighteningandunproductive.
There is no doubt thatthe disparate natureof its subject matter,and the
sometimes conflicting pronouncements issuing from the research journals,has
led to a fragmentationof the discipline, even to denials thatit is, in fact, a
coherent discipline. However two themes are apparentand common to all
PRACTICAL STATISTICS 235

branchesof psychology.One is theclassical statistical method used by thevast


majority of psychological researchers and theother is its history, and, more
particularly, its historic questions,
the mind-body dichotomy,the mechanisms
of perceptionandcognition,thenature-nurtureissue,the individual and society,
the mysteriesof maturationand change,and more. This bookis an attempt to
promotean interestin both statisticsand history.
References
Acree,M. C. (\91$).Theoriesof statistical inference in psychological research: A historico
critical study. Unpublished doctoral dissertation, Clark University, Worcester, MA.
Adams, W. J. (1974).Thelife and timesof the central limit theorem.New York: Kaedmon.
Airy, G. B. (1861).On thealgebraicaland numerical theoryof errors of observationsand
the combinationof observations(2nd ed.). London: Macmillan.
Anderson, R. L., & Bancroft,T. A. (1952). Statistical theoryin research. New York:
McGraw-Hill.
Anonymous(1926). Review of Fisher's Statisticalmethodsfor research-workers.British
MedicalJournal, 1, 578-579.
Arbuthnot, J. (1692).Of the laws of chance:Or a methodof calculationof the hazardsof
gaming. London: Printed by B. Motte and sold by Randall Taylor.
Arbuthnott, J. (1710).An argumentfor divine providence, taken from theconstant regularity
observ'din thebirths of both sexes.Philosophical Transactions of the Royal Society,27,
186-190.
Barnard, G. A. (1958). Thomas Bayes - A biographical note. Biometrika, 45,293-295.
Bartlett, M. S.(1965).R. A. Fisher and thelast fifty yearsof statisticalmethodology. Journal
of the American Statistical Association, 60, 395-409.
Bartlett, M. S. (1966).Reviewof Logic of StatisticalInference, by Ian Hacking. Biometrika,
53,631-633.
Baxter, B. (1940). The application of a factorial designto a psychological problem.
PsychologicalRevie\v,47,494-500.
Baxter, B. (1942).A study of reactiontime using factorial design. Journal of Experimental
Psychology,31,430-437
Bayes,T. (1763). An essay towards solvinga problem in the doctrine of chances.
Philosophical Transactionsof the Royal Society,53, 370-418.[Reprinted witha bio-
graphical noteby G. A. Barnardin Biometrika, 1958,45,293-315.]
Beloff, H. (Ed.). (1980).A balance sheet on Burt. Supplementto theBulletin of the British
PsychologicalSociety,33.
Beloff, J. (1993).Parapsychology:A concise history. London: Athlone Press.
Bennett,J. H. (1971).Collectedpapers of R. A. Fisher. Adelaide: Universityof Adelaide.
Berkson,J. (1938).Some difficultiesof interpretation encountered in the application of the
chi-squaretest.Journal of the American Statistical Association, 33, 526-536.
Bernard, C. (1927).An introductionto the study of experimental medicine (H. C. Greene,
Trans.).New York: Macmillan. (Original work published 1865)
Bernoulli, D. (1966).Part four of the Art of Conjecturing showingthe use andapplication
of the preceding treatisein civil, moral and economicaffairs (Bing Sung, Trans.).
Cambridge,MA: Harvard University, Departmentof Statistics,Tech. ReportNo. 2.
(Original work published 1713)
Bilodeau, E. A. (1952). Statistical versus intuitive confidence. American Journal of
Psychology,65, 211-271.

236
REFERENCES 237

Boring, E. G.(1920). The logic of thenormallaw of error in mental measurement. American


Journal of Psychology,31, 1-33.
Boring, E.G. (1950).A history of experimental psychology.New York: Appleton-
Century-Crofts.
Boring, E. G. (1957). Whenis human behavior predetermined? Scientific Monthly, 84,
189-196.
Bowley, A. L. (1902). Elementsof statistics. London:P. S.King. (6th ed. 1937)
Bowley, A. L. (1906). Presidential address to the economic science and statistics sectionof
the British Association for the Advancementof Science, York. Journalof the Royal
Statistical Society,69, 540-558.
Bowley, A. L. (1926). Measurement of the precision obtainedin sampling. Bulletinof the
International Statistical Institute,22, 1-62.
Bowley, A. L. (1928).F. Y.Edge-worth'scontributions to mathematical statistics. London:
Royal Statistical Society.
Bowley, A. L. (1936).The applicationof samplingto economicandsocial problems. Journal
of the AmericanStatistical Association,31, 474-480.
Box, G. E. P.(1984).The importanceof practicein the developmentof statistics.
Technometrics,26, 1-8.
Bravais,A. (1846).Sur lesprobability'sdeserreursde situationd'un point[On theprobability
of errors in the position of a point]. Memoiresde I'Academie Royaledes Sciencesde
I'lnstitut de France, 9, 255-332.
Brennan,J. F. (1994). History and systemsof psychology.4th Ed.EnglewoodCliffs, NJ:
Prentice-Hall.
Brown, W., & Stephenson,W. (1933).A test of the theory of two factors. British Journalof
Psychology,23, 352-370.
Buck, P. (1977). Seventeenth-century political arithmetic:Civil strife andvital statistics. Isis,
68, 67-84.
Burke, C. J. (1953). A brief noteon one-tailed tests. Psychological Bulletin, 50, 384-387.
Burt, C. (1909).Experimental tests of generalintelligence.British Journal of Psychology,3,
94-177.
Burt, C. (1939).The factorial analysisof humanability. III. Linesof possible reconcilement.
British Journal of Psychology,30, 84-93.
Burt, C. (1940). Thefactors of the mind. London:University of London Press.
Campbell,D. T., & Stanley,J. C.(1963). Experimental and quasi-experimental designs for
research on teaching. In N. L. Gage (Ed.), Handbookof research on teaching, (pp.
171-246.Chicago: Rand McNally.
Carrington, W. (1934). The quantitativestudyof trance personalities. 1. Proceedingsof the
Societyfor Psychical Research. 42, \73-240.
Carrington, W. (1936). The quantitative studyof trance personalities. 2. Proceedingsof the
Societyfor Psychical Research. 43, 319-361.
Carrington, W. (1937).The quantitativestudy of trance personalities. 3. Proceedingsof the
Societyfor Psychical Research. 44, 189-222.
Carroll, J. B. (1953). An analytical solution for approximating simple structure in factor
analysis. Psychometrika, 18,23-38.
Cattell, R. B. (1965a).Thescientific analysisof personality. Baltimore,MD: Penguin Books.
Cattell, R. B. (Ed.) (1965b). Handbookof multivariate experimental psychology. Chicago:
Rand McNally.
238 REFERENCES

Chang, W.-C. (1973).A history of the chi-squaregoodness-of-fit test. Unpublished doctoral


dissertation,University of Toronto,Toronto,Ontario.
Chang, W.-C.(1976).Sampling theoriesandsampling practice.In D. B. Owen (Ed.),On the
history of statisticsandprobability (pp. 299-315).New York: Marcel Dekker.
Clark, R. W. (1971). Einstein,the life and times.New York: World.
Cochrane,W. G. (1980).Fisher and the analysis of variance.In S. E.Fienberg& D. V.
Hinckley (Eds.),R. A.Fisher: AnAppreciation (pp.17-34).New York: Springer-Verlag.
Cowan, R. S.(1972). Francis Gallon's statistical ideas: The influenceof eugenics. Isis,63,
509-528.
Cornfield, J., & Tukey, J. W. (1956). Average values of mean squares in factorials. Annals
of MathematicalStatistics,27, 907-949.
Cowan, R. S.(1977).Nature andnurture:the interplay of biology andpolitics in the work
of FrancisGallon.In W. Coleman& C. Limoges (Eds.), Studies in thehistory of biology
(Vol. 1, pp. 138-208).Baltimore, MD: Johns Hopkins University Press.
Cowles, M. & Davis, C. (1982a). On the origins of the .05level of statistical significance.
American Psychologist, 37, 553-558.
Cowles, M., & Davis, C. (1982b). Is the .05 level subjectively reasonable? Canadian
Journal of Behavioural Science, 14, 248-252.
Cowles, M., & Davis, C. (1987). The subject matterof psychology: volunteers. British
Journal of Social Psychology,26, 97-102.
Cronbach,L. J. (1957).The twodisciplinesof scientific psychology. American Psychologist,
12, 671-684.
Crum, L. S. (1931). On analytical interpretationsof straw vote samples. Journalof the
American Statistical Association, 26, 243-261.
Crump, S. L. (1946). The estimation of variance componentsin analysis of variance.
Biometric Bulletin, 2, 7-11.
Crump, S. L. (1951).Thepresent status of variance component analysis. Biometrics, 7,1-16.
Crutchfield, R. S.(1938). Efficient factorial design and analysisof variance illustratedin
psychological experimentation. Journal of Psychology,5, 339-346.
Daniels,H. E. (1939).The estimation of components of variance. Royal Society Journal,
Supplement6, 186-197.
Darwin, C. (1958).Theorigin of species.New York: New American Library.(Original work
published 1859)
David, F. N. (1949). Probability theoryfor statistical methods. Cambridge, England:
Cambridge UniversityPress.
David, F. N (1962).Games, godsand gambling. London: Charles Griffin.
Davis,C., & Gaito,J.(1984).Multiple comparison procedures within experimental research.
CanadianPsychology,25, 1-12.
Daw, R. H., & Pearson,E. S.(1979).AbrahamDe Moivre's 1733 derivationof the normal
curve: A bibliographical note.In M. Kendall & R. L. Plackett (Eds.), Studies in the
history of statistics and probability (Vol. II, pp. 63-66)New York: Macmillan. (Re-
printed from Biometrika,59, 677-680,1972)
Dawson,M. M. (1914). The developmentof insurance mathematics. InL. W. Zartman (Ed.),
Yale readingsin insurance(W. H. Price,Revisor, pp. 95-119).New Haven,CT: Yale
University Press. (Original work published in 1901).
Dempster,A. P. (1964).On the difficulties inherentin Fisher'sfiducial argument. Journal
of the American Statistical Association, 59, 56-66.
REFERENCES 239

De Moivre, A. (1967).Thedoctrine of chances:or, A methodof calculatingthe probabilities


of eventsinplay (3rded.).New York: Chelsea.Includesa biographical noteon DeMoivre
by Helen M. Walker. (Original work published in 1756)
De Morgan,A. (1838).An essayon probabilitiesand ontheir applicationto life contingencies
and insurance offices.In D. Lardner (Ed.). Cabinet Cyclopaedia, (pp. 1-306)London:
Longman, Orme, Brown, Green & Longman,& John Taylor.
Dodd, S. C.(1928).Thetheory of factors.Psychological Review, 35,1,211-234;II, 261-279.
Duncan,D. B. (1951). A significancetest for differences between ranked treatmentsin an
analysisof variance.Virginia Journal of Science,2, 171-189.
Duncan,D. B. (1955).Multiple rangeandmultipleF tests.Biometrics,11, 1-42.
Eden, T., & Fisher,R. A. (1927).Studiesin crop variation.IV. The experimental determ-
ination of the value of top dressings with cereals. Journal of Agricultural Science
77,548-562.
Edgeworth,F. Y. (1887).Observationsand statistics:An essay on thetheory of errors of
observationand the firstprinciplesof statistics. Transactionsof the Cambridge Philo-
sophical Society,14, 138-169.
Edgeworth,F. Y. (1892).Correlated averages. Philosophical Magazine, 34, 190-204.
Edgeworth,F. Y. (1893). Note on the calculation of correlation betweenorgans.
Philosophical Magazine,36, 350-351.
Edwards,A. E. (1950). Experimental design in psychological research. New York: Holt
Rinehartand Winston.
Eisenhart,C. (1947). The assumptions underlying the analysisof variance. Biometrics,3,
1-21.
Eisenhart,C. (1970).On the transition from "Student's"z to "Student's"/. American
Statistician,33, 6-10.
Ellis, B. (1968). Basic conceptsof measurement. Cambridge, England: Cambridge Univ-
ersity Press.
Eysenck,H. J. (1947) Thedimensionsof personality. London: Routledge & Kegan Paul.
Eysenck,H. J., & Eysenck,M. W. (1985) Personalityand individual differences: A natural
scienceapproach.New York: Plenum Press.
Fechner,G. (1966).Elementsof psychophysics.(H. Adler, Trans.).New York: Holt, Rinehart
and Winston. (Original work published 1860)
Feigl, H. (1959).Philosophical embarrassments of psychology. American Psychologist, 14,
115-128.
Ferguson,G. A. (1959). Statistical analysisin psychology and education. New York:
McGraw-Hill.
Finn, R. W. (1973). Domesday book: A guide. Chichester, England: Phillimore.
Fisher,R. A. (1915). Frequency distributionof valuesof thecorrelationcoefficient in samples
from an indefinitely large population. Biometrika, 10, 507-521.
Fisher, R. A. (1918). The correlation between relatives on thesuppositionof Mendelian
inheritance. Transactions of the Royal Societyof Edinburgh,52, 399-433.
Fisher, R. A. (1919). The genesisof twins. Genetics,4, 489-499.
Fisher, R. A. (192la). On the"probable error"of a coefficient of correlation deducedfrom
a small sample. Metron,1, 3-32.
Fisher,R. A. (1921b). Studiesin crop variation.I. An examinationof the yield of dressed
grain from Broadbalk. Journalof Agricultural Science,11, 107-135.
Fisher,R. A. (1922a).On theinterpretationof X2 from contingencytablesand thecalculation
of P. Journal of the Royal Statistical Society,
85, 87-94.
240 REFERENCES

Fisher, R. A. (1922b). The goodness of fit of regression formulae and thedistribution of


regressioncoefficients. Journalof the Royal Statistical Society, 85, 597-612.
Fisher, R. A. (1923).Statistical tests of agreement between observation and hypothesis.
Economica, 3,139—147.
Fisher, R. A. (1924a).The conditions under which x2 measuresthe discrepancy between
observationand hypothesis. Journalof the Royal Statistical Society. 87, 442-450.
Fisher, R. A. (1924b). On adistribution yieldingthe error functionsof several well known
statistics. Proceedings of the International Congressof Mathematics, Toronto , 2,
805-813.
Fisher,R. A. (1925a).Applications of "Student's"distribution. Metron,5, 90-104.
Fisher,R. A. (1925b). Theory of statistical estimation. Proceedings of the Cambridge
Philosophical Society,22, 700-725.
Fisher, R. A. (1926a). Bayes' theoremand thefourfold table. Eugenics Review, 18,32-33.
Fisher, R. A. (1926b).The arrangementof field experiments. Journalof the Ministry of
Agriculture of Great Britain, 33, 505-513.
Fisher, R. A. (1930). Inverse probability. Proceedings of the Cambridge Philosophical
Society,26, 528-535.
Fisher, R. A. (1934) Discussionof Dr Wishart's paper. Journalof theRoyal Statistical
Society,Supplement1, 51-53.
Fisher, R. A. (1935).The logic of inductive inference. Journal of theRoyal Statistical
Society,98, 39-54.
Fisher,R. A. (1937). Professor Karl Pearson and the method of moments. Annalsof
Eugenics,7,303-318.
Fisher, R. A. (1952). Statistical methods in genetics. Heredity,6, 1-12.
Fisher, R. A. (1955). Statistical methodsand scientific induction. Journalof the Royal
Statistical Society,17,69-78.
Fisher, R. A. (1956). Statistical methods and scientific inference. New York: Hafner Press
(3rd ed., 1973).
Fisher, R. A. (1957). Commenton thenotesby Neyman, Bartlett,andWelch in this Journal
(Vol.18, No. 2, 1956).Journal of the Royal Statistical Society, B, 19, 179.
Fisher, R. A. (1966). Thedesignof experiments. (8th ed.). Edinburgh: Oliver and Boyd.
(Original work published 1935)
Fisher, R. A. (1970). Statistical methodsfor research workers. (14thed.). Edinburgh:
Oliver and Boyd. (Original work published 1925).
Fisher, R. A. (1971)."Author'sNote" preceding, Professor Karl Pearson and themethodof
moments. In J. H. Bennett (Ed.)., Collected papers of R. A. Fisher (Vol. 4 p. 302).
Adelaide: University of Adelaide. (Reprintedfrom Annals of Eugenics, 1939,7,
303-318)
Fisher,R. A. (1971)."Author'sNote" preceding,The comparisonof sampleswith possibly
unequal variances.In J. H. Bennett (Ed.)., Collected papers of R. A. Fisher (Vol. 4 pp.
190-198).Adelaide: Universityof Adelaide. (Reprintedfrom Annalsof Eugenics, 1939,
9, 174-180)
Fisher, R. A., & MacKenzie,W. A. (1923). Studiesin crop variation. II. The manurial
responseof different potato varieties. Journalof Agricultural Science,73, 311-320
Fisher Box,J. (1978). R. A. Fisher, the life of a scientist.New York: Wiley.
FisherBox, J.(1981).Gosset,Fisher,and the tdistribution. American Statistician, 35,61-66.
Fletcher, R. (1991) Science, ideology and themedia: TheCyril Burt scandal. New Bruns-
wick, NJ: TransactionBooks.
REFERENCES 241

Forrest,D. W. (1974). Francis Gallon: The life and work of a Victorian genius. London:
Elek.
Fox, J. (1984).Linear statistical modelsand related methods.New York: Wiley.
Funkhouser,H. G. (1936).A noteon a 10th century graph. Osiris, 1, 260-262.
Funkhouser,H. G. (1937). Historical development of the graphical representation of data.
Osiris, 3, 269-404.
Gaito, J. (1960).Expected means squares in analysisof variance techniques. Psychological
Reports, 7, 3-10.
Gaito, J. (1980).Measurement scales and statistics: Resurgence of an old misconception.
Psychological Bulletin,87, 564-567.
Galbraith, V. H. (1961). Themaking of Domesdaybook. Oxford: University Press.
Galton, F. (1877). Typical lawsof heredity. Proceedingsof the Royal Institutionof Great
Britain, 8, 282-301.[Also in Nature, 15,492-495].
Galton, F. (1884).On theanthropometric laboratory at thelate International Health Exhibi-
tion. Journal of theAnthropological Instituteof Great Britain and Ireland, 14, 205-221.
Galton, F. (1885a).Regression towards mediocrity in hereditary stature. Journal of the
Anthropological Instituteof Great Britain and Ireland, 15, 246-263.
Galton, F. (1885b).Opening address to theAnthropological Sectionof the British Associa-
tion by thePresidentof the Section. Nature,32, 507-510.
Galton, F. (1886).Family likenessin stature. Proceedingsof theRoyal Societyof London,
40, 42-73(includesan appendixby J. D.Hamilton Dickson)
Galton, F. (1888a).Co-relationsandtheir measurement, chiefly from anthropometricdata.
Proceedingsof the Royal Societyof London,45, 135-145.
Galton, F. (1888b). Personal identification and description. Proceedingsof the Royal
Institution, 12, 346-360.
Galton, F. (1889).Natural inheritance. London: Macmillan.
Galton, F. (1907)Inquiries into humanfaculty and itsdevelopment (3rd ed.). London: J. M.
Dent. (Original work published 1883).
Galton, F. (1908).Memoriesof my life. London: Methuen.
Galton, F. (1962). Hereditary genius (second edition reprinted). London: Collins and New
York: World. (Original work published 1869; 2nd ed. 1892)
Galton, F. (1970).English men of science. London: Frank Cass.(Original work published
1874)
Garnett, J. C. M., (1920) The single general factorin dissimilar mental measurements.
British Journal of Psychology,10, 242-258.
Garrett, H. E. & Zubin, J. (1943) The analysis of variance in psychological research.
Psychological Bulletin,40, 233-267.
Gauss,C. F.(1963). Theoria motus corporum coelestium. (Theory of theMotion of Heavenly
Bodies). New York: Dover. (Original work published 1809)
Gigerenzer,G. (1987).Probabilistic thinkingand the fightagainst subjectivity.In L. Krtiger,
G. Gigerenzer& M. Morgan (Eds.),Theprobabilistic revolution: Ideasin the sciences
(Vol. 2, pp. 7-33). Cambridge,MA: MIT Press.
Gini, C. (1928).Une application de la methode representative aux materiaux du dernier
recensementde la population italienne (ler decembre 1921) [An application of the
representative method to material from the last censusof the Italian population (1
December 1921)]. Bulletinof the International Statistical Institute,23, 198-215.
Ginzburg,B. (1936). The scientific valueof the Copemican induction. Osiris, I, 303-313.
Glaisher,J. W. L. (1872). On the law offacility of errors of observationand themethodof
leastsquares. Royal Astronomical Society Memoirs, 39, 75-124.
242 REFERENCES

Gould, S. J.(1981) Themismeasureof man. Markham: Penguin Books Canada.


Grant, D. A. (1944) On The analysisof variancein psychological research.' Psychological
Bulletin, 41, 158-166.
Graunt, J. (1975).Natural andpolitical observations mentioned in afollowing index and
madeupon thebills of mortality. New York: Arno Press.(Original work published 1662)
Grunbaum,A. (1952). Causalityand thescienceof human behavior. American Scientist, 40,
665-676.
Guilford, J. P.(1936) Psychometric methods. New York: McGraw-Hill.
Guilford, J. P.(1950) Fundamental statistics in psychology and education. New York:
McGraw-Hill.
Guilford, J. P.(1967) The nature of human intelligence.New York: McGraw-Hill.
Guthrie, S.C. (1946).A history of medicine. Philadelphia: J. B. Lippincott.
Hacking, I. (1965). The logic of statistical inference. Cambridge, England: Cambridge
University Press.
Hacking, I. (1971).Jacques Bernoulli's Art of Conjecturing. British Journal for thePhilos-
ophy of Science,22, 209-229.
Hacking, I. (1975). Theemergenceof probability. Cambridge, England: Cambridge Univers-
ity Press.
Haggard,E. A. (1958)Intraclass correlationand theanalysisof variance.New York: Dryden
Press.
Halley, E. (1693).An estimateof themortality of mankind, drawnfrom curioustablesof the
births and funerals at the city of Breslaw; with an attempt to ascertain the price of
annuitieson lives. Philosophical Transactions of the Royal Society,17, 596-610.
Harman,H. H. (1976).Modern factor analysis. Chicago: University of ChicagoPress.
Harris, J. A. (1913) On thecalculationof intra-classandinter-class coefficients of correlation
from classmoments whenthe numberof possible combinations is large. Biometrika,12,
22-27.
Hartley, D. (1966). Observations on man, his frame, hisduty and his expectations.
Gainesville,FL: Scholars' Facsimiles andReprints. (Original work published 1749)
Hays, W. L. (1963).Statistics.New York: Holt, RinehartandWinston.
Hays, W. L. (1973).Statisticsfor thesocial sciences. New York: Holt, Rinehartand Winston.
Hearnshaw,L. S. (1964).A short historyof British psychology.New York: Barnes& Noble.
Hearnshaw,L. S. (1979).Cyril Burt, psychologist. London: Hodder and Stoughton.
Hearnshaw,L. S.(1987).Theshapingof modernpsychology. London: Routledge and Kegan
Paul.
Heisenberg,W. (1927). Thephysical principlesof the quantum theory. Chicago: University
of Chicago Press.
Henkel, R. E., & Morrison, D. E. (1970). The significance test controversy. London:
Butterworths.
Heron, D. (1911). The dangerof certainformulae suggestedassubstitutesfor the correlation
coefficient. Biometrika,8, 109-122.
Hick, W. E. (1952). A note on one-tailed andtwo-tailed tests. Psychological Review, 59,
316-318.
Hilbert, D. (1902) Mathematical problems. Bulletin of the American Mathematical Society,
8,437-445;478-79.
Hilgard, E. (1951) Methodsand procedures in the study of learning.In S. S.Stevens (Ed.)
Handbook of experimentalpsychology (pp.517-567).New York: John Wiley.
Hogben, L. (1957).Statistical theory. London: Allen and Unwin.
REFERENCES 243

Holschuh,N. (1980).Randomizationanddesign:I. In S. E.Fienberg& D. V. Hinkley (Eds.),


R. A. Fisher: An appreciation (pp.35-45).New York: Springer-Verlag.
Holzinger, K. J., & Harman,H. H. (1941). Factor analysis:a synthesisof factorial methods.
Chicago: Universityof Chicago Press.
Hotelling, H. (1933).Analysisof acomplexof statistical variables into principal components.
Journal of Educational Psychology, 24, 417-441,498-520.
Hotelling, H. (1935). Themost predictable criterion. Journal of EducationalPsychology,26,
139-142.
Hume, D. (1951). An enquiry concerning human understanding. In D.C. Yalden-Thomson
(Ed.). (1951).Hume, Theory of Knowledge (pp. 3-176)Edinburgh: ThomasNelson.
(Original work published 1748)
Hunter, J. E.(1997).Needed:a ban on thesignificance test. Psychological Science, 8, 3-7.
Huxley, J. (1949)Haifa centuryof genetics. (London) Sunday Times, 10th July.
Huygens,C. (1970).On reasoningin gamesof dice. In J. Bernoulli, The art of conjecture
(F. Maseres,Ed. andTrans.). New York: Redex Microprint. (Original work published
1657)
Jacobs,J. (1885).Reviewof Ebbinghaus's Ueberdas Geddchtnis. Mind, 10, 454-459.
Jeffreys, H. (1939).Randomandsystematic arrangements. Biometrika, 31, 1-8.
Jensen,A. R. (1992). Scientificfraud or false accusations?The caseof Cyril Burt. In D. J.
Miller & M. Hersen(Eds.),Research fraudin the behavioraland biomedical sciences.
(pp. 97-124). New York: John Wiley& Sons.
Jevons,W. (1874). Theprinciples of science. London: Macmillan.
Jones,L. V. (1952).Testsof hypotheses: One-sided vs.two-sided alternatives. Psychological
Bulletin, 49,43-46.
Jones,L. V. (1954). A rejoinderon one-tailed tests. Psychological Bulletin, 51, 585-586.
Joynson,R. B. (1989). TheBurt affair. London: Routledge.
Kelley, T. L. (1923). Statistical method.
New York: Macmillan.
Kelley, T. L. (1928).Crossroadsin themind of man:a studyof differentiable mental abilities.
Stanford CA: Stanford University Press.
Kempthorne,O. (1976). The analysisof varianceand factorial design.In D. B. Owen (Ed.)
On the history of statisticsand probability ( pp. 29-54.)New York: Marcel Dekker.
Kempthorne,O. (1983). A reviewof R. A.Fisher: An Appreciation. Journalof the American
Statistical Association,78,482-490.
Kempthorne,O., & Folks, L. (1971). Probability, statistics,and data analysis. Ames:
Iowa State University Press.
Kendall, M. G. (1952). George UdnyYule, 1871-1951.Journal of the Royal Statistical
Society, 115A, 156-161.
Kendall, M. G. (1959).Hiawatha designsan experiment. American Statistician, 13, 23-24.
Kendall,M. G. (1961). Daniel Bernoullionmaximumlikelihood.Biometrika,48,1-18.[This
paper is followed by a translationby C. G.Allen of D. Bernoulli's The most probable
choicebetween several discrepant observations and theformation therefrom of the most
likely induction 1777(?)., together witha commentaryby Leonard Euler(1707-1783).]
Kendall, M. G. (1963).Ronald Aylmer Fisher,1890-1962.Biometrika,50, 1-15.
Kendall,M. G. (1968). Studiesin thehistoryof probability andstatistics, XIX. Francis Ysidro
Edgeworth (1845-1926). Biometrika, 55, 269-275.
Kendall, M. G. (1972).The history andfuture of statistics.In T. A. Bancroft(Ed.).Statistical
papers in honor of George W.Snedecor. Ames: Iowa State University press.
Kendall, M. G., & Babington Smith,B. (1938). Randomness andrandom sampling numbers.
Journal of the Royal Statistical Society, 101, 147-166.
244 REFERENCES

Kenna, J. C.(1973). Gallon'ssolution to thecorrelation problem:A false memory? Bulletin


of the British Psychological Society,
26, 229-30.
Keuls, M. (1952). The use ofstudentized rangein connection withan analysisof variance.
Euphytica,!, 112-122.
Kevles, D. (1985).In the nameof eugenics.New York: Alfred A. Knopf.
Kiaer, A. N. (1899).Sur lesm&hodesrepresentatives outypologies appliquees a la statistique
(On representative methods or typologies appliedto statistics). Bulletinof the Interna-
tional Statistical Institute,11,180-185.
Kiaer, A. N. (1903). Sur les m&hodes representativesou typologies (on representative
methodsor typologies). Bulletinof the International Statistical Institute, 13, 66-78.
Kimmel H. D. (1957). Three criteriafor the use ofone-tailed tests. Psychological Bulletin,
54,351-353.
Kneale, W. (1949).Probability and induction. Oxford: Oxford UniversityPress.
Koch, S. (Ed.). (1959-1963).Psychology:A study of a science(6 Vols.). New York:
McGraw-Hill.
Koch, S. (1980).Psychology and its human clientele: Beneficiaries or victims? In R. A.
Kasschauand F. S.Kessel (Eds.) Psychology and society:In searchof symbiosis,(pp.
30-60).New York: Holt, RinehartandWinston.
Kolmogorov, A. N. (1956).Foundationsof the theory of probability. (N. Morrison, Trans.)
New York: Chelsea Publishing Co. (Original work publishedin 1933)
Koren, J. (Ed.). (1918).Thehistory of statistics.New York: Macmillan.
Koshal, R. S.(1933). Applicationof the methodof maximumlikelihood to thederivationof
efficient statisticsfor fitting frequencycurves. Journalof theRoyal Statistical Society,
96,303-313.
Kruskal, W., & Mosteller, F. (1979).Representative sampling, IV: The history of the concept
in statistics,1895-1939.International Statistical Review, 47, 169-195.
Lancaster,H. O. (1966).Forerunnersof the PearsonX2. Australian Journalof Statistics,8,
117-126.
Lancaster,H. O. (1969). Thechi-squared distribution.New York: Wiley.
Laplace, P. S. de(1810). Memoir sur les approximationsdes formules de tres grandes
nombres,et surleur applicationauxprobabilites (Memoiron approximationsof formulae
which are functions of very large numbers,and their applicationto probabilities).
Memoirs de la classedessciences mathematiques et physiquesde I'institut de France,
Annee 1809,353-415; suppl. 559-565.
LaplaceP. S. de(1812). Theorie analytiquedesprobabilites. Paris: Courcier.
LaplaceP. S. de(1951). A philosophical essayon probabilities. (F. W. Truscott & F. L.
Emory, Trans.).New York: Dover. (Original work published 1820)
Lawley, D .N.,& Maxwell, A. E. (1971). Factor analysis as astatistical method.2nd edition.
London: Butterworth.
Le Cam,L., & Lehmann,E. L. (1974). J. Neyman.On the occasionof his 80th birthday.
Annals of Statistics,2, vii-xi.
Legendre,A. M. (1805). Nouvelles methodespour la determinationdesorbitesdescometes
[New methodsfor determiningthe orbits of comets],Paris: Courcier.
Lehmann,E. L. (1959). Testingstatistical hypotheses. New York: Wiley.
Lindquist, E. F. (1940). Statistical analysisin educational research. Boston: Houghton
Mifflin.
Lord, F. M. (1953).On thestatistical treatment of football numbers. American Psychologist,
5,750-751.
REFERENCES 245

Lovie, A. D. (1979). The analysis of variance in experimental psychology:1934-1945.


British Journal of Mathematicaland Statistical Psychology, 32, 151-178.
Lovie, A. D. (1983).Imagesof man inearly factoranalysis-psychological and philosophical
aspects.In S. M. Bem, H. Van Rappard& W. Van Hoorn (Eds.), Studies in the history
of psychology and thesocial sciences,Proceedingsof the first Europeanmeetingof
CHEIRON, 1 (pp. 235-247).Leiden: Leiden University.
Lovie, P., & Lovie A. D. (1993). Charles Spearman, Cyril Burt, and theorigins of factor
analysis. Journalof the History of the Behavioral Sciences. 29, 308-321.
Lovie, P., & Lovie A. D. (1995). The cold equations: Spearmanand Wilson on Factor
indeterminacy. British Journalof Mathematical and Statistical Psychology,48,
237-253.
Lupton, S. (1898).Noteson observations. London: Macmillan.
Lush, J. L. (1972).Early statisticsat Iowa State College.In T. A. Bancroft (Ed.)., Statistical
papers in honour of George W.Snedecor (pp.211-226).Ames: Iowa State University
Press.
Macdonell, W. R. (1901).On criminal anthropometryand theidentificationof criminals.
Biometrika, 1, 177-227.
MacKenzie,D. A. (1981).Statisticsin Britain 1865-1930. Edinburgh: Edinburgh University
Press.
Magnello,M. E. (1998).Karl Pearson'smathematizationof inheritance: Fromancestral
heredity to Mendelian genetics(1895-1909).Annals of Science,55, 35-94.
Magnello, M. E. (1999).The non-correlationof biometrics and eugenics: rival formsof
laboratory work in Karl Pearson'scareer at University College London. Historyof
Science,37, Part 1, 79-106;Part 2, 123-150.
Marks, M. R. (1951). Two kindsof experiment distinguished in termsof statistical operations.
Psychological Review, 58, 179-184.
Marks, M. R. (1953). One-andtwo-tailed tests.Psychological Review, 60, 207-208.
Maunder, W. F. (1972). Sir Arthur Lyon Bowley. An inaugural lecture delivered in the
University of Exeter. (In M. G. Kendall& R .L. Plackett (Eds.), Studies in the history of
statisticsandprobability. (1977, Vol.II, pp. 459-480).New York: Macmillan.
Maxwell, A. E. (1977).Multivariate analysisin behavioural research. London: Chapman
and Hall.
McMullen,L. (!),& Pearson, E.S.(2). (1939). William Sealy Gosset, 1876-1937 (1).
"Student"as aman; (2)."Student"as astatistician. Biometrika,30, 205-250.
McNemar, Q. (1940a). Samplingin psychological research. Psychological Bulletin, 37,
331-365.
McNemar, Q. (1940b). Reviewof E. F.Lindquist, Statistical analysisin educational
research. Psychological Bulletin, 37, 746-748.
McNemar, Q. (1949). Psychological statistics. New York: Wiley.
Mercer, W. B. ,& Hall, A. D. (1911). The experimental errorof field trials.Journal of
Agricultural Science,4, 107-132.
Merriman, M. (1877).A list of writings relatingto themethodof leastsquares,with historical
and critical notes. Transactionsof the Connecticut Academy of Arts and Sciences,4,
151-232.
Merriman, M. (1884). A textbookon themethodof leastsquares (8th ed.). New York: Wiley.
Miche"a, R. (1938).Les variations de laraisonauXVII e siecle; essaisur lavaleur du langage
employ6en histoire litteraire [Differences in meaningin the 17th century; essay on the
valueof language employed in literary history]. Revue Philosophiquede la Franceet de
l'Etranger,126,m-2Ql.
246 REFERENCES

Mill, J. S.(1973).A systemof logic ratiocinativeand inductive. Beinga connected viewof


the principles of evidenceand themethodsof scientific investigation (8th ed., 1872
J. M. Robson, Ed.,Toronto: University of Toronto Press.Original work published1843).
Miller, A. G. (Ed.). (1972). Thesocial psychologyof the psychological experiment. New
York: Free Press.
Miller, G. A. (1963).RonaldA. Fisher: 1890-1962.American Journal of Psychology,76,
157-158.
Miller, G. A. (1969). Psychology as a means of promoting human welfare. American
Psychologist,24, 1063-1075.
Nagel, E. (1936).The meaningof probability. Journalof theAmerican Statistical Associ-
ation, 31, 10-30. [Reprinted with a commentaryin J. R.Newman (Ed.),Theworld of
mathematics, 1956, Vol.11, pp. 1398-1414.New York: Simonand Schuster].
Newman, D. (1939). The distribution of the rangein samples from a normal population
expressedin terms of an independent estimate of standard deviation. Biometrika, 31,
20-30.
Newman,J. R.(Ed.). (1956). Theworld of mathematics. (Vols. 1-4). New York: Simonand
Schuster.
Neyman,J. (1934). On the twodifferent aspectsof the representative method: The method
of stratified samplingand the method of purposive selection. Journalof the Royal
Statistical Society,97, 558-625.
Neyman,J. (1935).Statistical problemsin agricultural experimentation. Journal of the Royal
Statistical Society, Supplement,2, 107-154.
Neyman, J. (1937). Outline of a theory of estimation basedon theclassical theoryof
probability. Philosophical Transactions of the Royal Society,A, 236, 333-380.
Neyman,J. (1941).Fiducial argumentand thetheory of confidence intervals. Biometrika
32, 128-150.
Neyman, J. (1956).Note on anarticle by Sir Ronald Fisher. Journalof the Royal Statistical
Society,B, 18,288-294.
Neyman,J., & Pearson,E. S.(1928). On the use and interpretationof certaintest criteria for
purposesof statistical inference. Biometrika, 20a, Part 1, 175-240,Part II, 263-294.
Neyman,J., & Pearson, E. S.(1933). The testing of statistical hypothesesin relation to
probabilities a priori. Proceedings of the Cambridge Philosophical Society,29,
492-510.
Norton, B., & Pearson,E. S.(1976). A note on thebackgroundto, andrefereeingof, R. A.
Fisher's1918 paper'On thecorrelation between relatives on thesuppositionof Mende-
lian inheritance.'Notesand Recordsof the Royal Society,31, 151-162.
Ore, O. (1953).Cardano: Thegambling scholar. Princeton, NJ: Princeton UniversityPress.
Osgood,C. E. (1953)Method and theory in experimental psychology. New York: Oxford
University Press.
Parten,M. (1966).Surveys, polls,and samples: Practical procedures. New York: Cooper
Square Publishers.
Pearl, R. (1917).The probable errorof a Mendelianclassfrequency.American Naturalist,
51, 144-156.
Pearson,E. S.(1938a).Karl Pearson:An appreciationof some aspects of his life and work.
Cambridge, England: Cambridge University Press.
Pearson,E. S.(1938b).Someaspectsof the problem of randomization.II. An illustration of
'Student's'inquiry into the effect of balancingin agricultural experiments. Biometrika,
30,159-171.
REFERENCES 247

Pearson,E. S.(1939a).William Sealy Gosset,1876-1937(2)."Student"as a statistician.


Biometrika, 30, 205-250.
Pearson,E. S.(193 9b).Note on the inverse and direct methodsof estimation in R. D.
Gordon'sproblem. Biometrika,31, 181-186.
Pearson,E. S.(1965).Some incidentsin the early historyof biometryandstatistics, 1890-94
Biometrika, 52, 3-18.
Pearson,E. S.(1966).The Neyman-Pearson story:1926-34. Historical sidelights on an
episode in Anglo-Polish collaboration.In F. N. David (Ed.)., Festschrift for J. Neyman
(pp. 1-23).London: Wiley.
Pearson,E. S.(1968).Some early correspondence between W. S. Gosset,R. A. Fisherand
Karl Pearson, with notes andcomments. Biometrika,55, 445-457.
Pearson,E. S.(1970).Karl Pearson'slectureson thehistory of statisticsin the 17th and18th
centuries. Appendixto E. S.Pearson& M. G. Kendall (Eds.),Studiesin the history of
statistics andprobability, (pp. 479-481).London: Griffin.
Pearson,K. (1892). Thegrammar of science. London: Scott.
Pearson,K. (1894). Contributions to themathematical theoryof evolution. Philosophical
Transactionsof the Royal Society,A, 185, 71-110.
Pearson,K. (1895). Contributionsto themathematical theoryof evolution. II. Skew var-
iations in homogeneous material. Philosophical Transactions of the Royal Society,A,
186, 343-414.
Pearson,K. (1896).Mathematical contributions to thetheory of evolution. III. Regression,
heredity, and panmixia. Philosophical Transactions of the Royal Society,A, 757,
253-318.
Pearson,K. (1900a).On thecriterion thata given systemof deviationsfrom the probablein
the caseof a correlated systemof variablesis such thatit can bereasonably supposed to
have arisenfrom random sampling. Philosophical Magazine, 50, 157-175.
Pearson,K. (1900b). Mathematical contributionsto the theory of evolution. VII. On the
correlation of charactersnot quantitatively measurable. Philosophical Transactions of
the Royal Society,A, 795, 1–47.
Pearson, K. (1901). On lines and points of closest fit to systems of points in space.
Philosophical Magazine,2-6, 559-572.
Pearson,K. (1903). On thelaws of inheritancein man,I. Biometrika, 2, 357-462.
Pearson,K. (1904a).On thelaws of inheritancein man, II. Biometrika, 3, 131-190.
Pearson,K. (1904b).Mathematical contributionsto the theory of evolution. XIII. On the
theory of contingencyand itsrelation to associationand normal correlation. Draper's
CompanyResearch Memoirs, Biometric Series 1. 35 pages.
Pearson,K. (1906).Walter Frank Raphael Weldon, 1860-1906.Biometrika, 5, 1-52.
Pearson,K. (1907).Reply to certain criticismsof Mr G. U. Yule. Biometrika,5, 470-476.
Pearson,K. (1909).On the test of goodnessof fit of observationto theory in Mendelian
experiments. Biometrika,9, 309-314.
Pearson,K. (1911). On theprobability that two independent distributions of frequencyare
really samplesfrom the same population. Biometrika, 8, 250-254.
Pearson,K. (1916a). On abrief proof of the fundamentalformula for testing the goodness
of fit of frequencydistributionsand of theprobable errorof "P." Philosophical Magazine,
31, 369-378.
Pearson,K. (1916b). Mathematical contributionsto thetheory of evolution.XIX. Second
supplementto a memoir on skew variation. Philosophical Transactions of the Royal
Society,A, 216, 429-457.
248 REFERENCES

Pearson,K. (1917). The probable errorof a Mendelian classfrequency.Biometrika,11,


429-432.
Pearson,K. (1920).Noteson thehistory of correlation.Biometrika, 13, 25-45.
Pearson,K. (1922).On the x2 test of goodnessof fit. Biometrika, 14,186-191.
Pearson,K. (1924a).Historical noteon theorigin of thenormal curveof errors.Biometrika,
16,402-404.
Pearson,K. (1924b). On the difference and thedoublet testsfor ascertaining whether two
samples have been drawn from the same population. Biometrika, 16,249-252.
Pearson,K. (1926). Abraham de Moivre. Reply to Professor Archibald. Nature, 117,
551-552.
Pearson,K. (1927). The mathematicsof intelligence. Reviewof C. Spearman,Theabilities
of man, their natureand measurement. Nature, 120, 181-183.
Pearson,K (1914-1930). The life, letters and labours of Francis Galton. Cambridge,
England: Cambridge University Press. (Vol.1, 1914; Vol. 2, 1924; Vol. 3a, 3b,1930)
Pearson,K. (1935). Statisticaltests.Nature, 136,296-297.
Pearson,K. (1936)Method of momentsandmethod of maximum likelihood. Biometrika,
28, 34-59.
Pearson,K. (1978). The history of statistics in the 17th and 18th centuries against the
changing backgroundof intellectual,scientific and religious thought. ( E. S. Pearson,
Ed.). London: CharlesGriffin.
Pearson,K., & Heron, D. (1913). On theoriesof association. Biometrika, 9, 159-315.
Pearson,K., & Moul, M. (1927).Themathematicsof intelligence. Biometrika,19,246-291.
Peirce, B. (1852). Criterion for the rejectionof doubtful observations.TheAstronomical
Journal, 2,161-163.
Peters,C. C. (1943). Misusesof the Fisher statistics. Journal of EducationalResearch,36,
546-549.
Petrinovich, L. F., & Hardyck, C. D. (1969).Error ratesfor multiple comparison methods:
Some evidence concerning the frequency of erroneous conclusions. Psychological
Bulletin, 71,43-54.
Petty, W. (1690). Political arithmetick. London: Printedfor Robert Clavel and Henry
Mortlock.
Phillips, L. D. (1973).Bayesian statistics for social scientists. London: Nelson.
Picard, R. (1980).Randomizationand design:II. In S. E. Fienberg& D. V. Hinkley (Eds.)
R. A. Fisher: An appreciation (pp.46-58).New York: Springer-Verlag.
Playfair, W. (1801a). Commercialandpolitical atlas. (3rd ed.). London: Printed by T. Burton.
Playfair, W. (1801b).Statistical breviary; shewing on aprinciple entirely new,the resources
of every stateand kingdom in Europe. London: Printedby T. Bensley for J. Wallis,
Egerton, Vernorand Hood, Blackand Parry, andTibbet and Didier.
Poisson,S.- D. (1837). Recherches sur laprobabilite desjugements en matiere criminelleet
en matiere civile, precedee desregies generaldu calcule desprobabilites. (Research on
the probability of judgmentsin criminal andcivil matters,precededby generalrules for
the calculation of probabilities). Paris: Bachelier.
Popper,K. R. (1959). The logic of scientific discovery. London: Hutchinson.
Popper,K. R. (1962).Conjecturesand refutations: Thegrowth of scientific knowledge.New
York: Basic Books.
Porter,T (1986) Therise of statistical thinking. Princeton, NJ: Princeton UniversityPress.
REFERENCES 249

Quetelet,L. A. (1849).Letters addressedto H.R.H. the Grand Dukeof SaxeCoburg and


Gotha on theTheory of Probabilities as applied to themoral andpolitical sciences.(O.
G. Downes, Trans.). London: Charles & Edwin Leyton. Authorized facsimileof the
original book producedin 1975 by Xerox University Microfilms, Ann Arbor, MI.
(Original work published1835)
RAND Corporation(1965). A million random digits-with 100,000 normal deviates. New
York: Free Press.
Reichenbach,H. (1938).Experience and prediction. An analysis of thefoundations and
structure of knowledge. Chicago: University of ChicagoPress.
Reid, C. (1982). Neyman-from life. New York: Springer-Verlag.
Reitz, W. (1934). Statistical techniques for the study of institutional differences. Journalof
ExperimentalEducation,3, 11-24.
Rhodes,E.G. (1924). On theproblem whethertwo given samplescan besupposedto have
been drawnfrom the same population. Biometrika, 16,239-248.
Robinson, C. (1932). Straw votes.New York: Columbia UniversityPress.
Robinson,C. (1937). Recent developments in thestraw-poll field. Public Opinion Quarterly,
1 (3), 45-56,and 1(4), 42-52.
Rosenthal,R., & Rosnow,R. L. (Eds.). (1969).Artifact in behavioral research.New York:
Academic Press.
Rowe, F. B. (1983). Whatever becameof poor Kinnebrook? American Psychologist, 38,
851-852.
Rozeboom,W. W. (1960). The fallacy of the null hypothesis significance test. Psychological
Bulletin, 57,416-428.
Rucci, A.J.,& Tweney, R.D. (1980) Analysisof varianceand the"SecondDiscipline" of
scientific psychology:A historical account. Psychological Bulletin, 87, 166-184.
Russell,B. (1931). Thescientific outlook. London:Allen and Unwin.
Russell,B. (1961). Historyof Westernphilosophyand itsconnection with politicaland social
circumstancesfrom theearliest timesto thepresent day. London: Allen and Unwin.
(Original work published1946)
Russell,Sir John, (1926). Field experiments: How they aremadeandwhat theyare.Journal
of the Ministry of Agriculture of Great Britain, 32, 989-1001.
Rutherford,E., & Geiger, H. (1910). The probability variationsin the distribution of a
particles. Philosophical Magazine, 20, 698-707.
Ryan,T. A. (1959).Multiple comparisonsin psychological rsearch. Psychological Bulletin,
56, 26-47.
Savage,L. J. (1962). Thefoundations of statistical inference. New York: Wiley.
Savage,L. J. (1976).On rereadingR. A. Fisher. Annalsof Statistics,4, 441-500.
ScheffiS, H. (1953).A methodfor judging all contrastsin the analysisof variance. Biometrika,
40, 87-104.
Scheff6, H. (1956a)Alternative modelsfor the analysisof variance. Annalsof Mathematical
Statistics,27, 23-36.
ScheffS, H. (1956b) q'mixed modelfor the analysisof variance. Annalsof Mathematical
Statistics,27,251-271.
Scheffe, H. (1959). Theanalysisof variance.New York: Wiley.
Seal, H.L. (1967).The historical development of the Gauss linear model. Biometrika, 54,
1-24.
250 REFERENCES

Seng,Y .P. (1951). Historical surveyof the developmentof sampling theoryand practice.
Journal of the Royal Statistical Society, 114,2]4-231.
Sheynin,O. B. (1966).Origin of the theory of error.Nature, 211,1003-1004.
Sheynin,O. B. (1978).S. - D. Poisson'swork in probability.Archivesfor History of Exact
Sciences,18,245-300.
Sidman,M. (1960). Tactics of scientific research.New York: Basic Books.
Simpson,T. (1.755).A letterto theRight Honourable George Earl of Macclesfield, President
of the Royal Society,on the advantage arisingin taking the mean of a number of
observations,in practical astronomy. Philosophical Transactions of the Royal Society,
49, 82-93.
Simpson,T. (1757).On the advantage arisingfrom taking the mean of a number of
observations,in practical astronomy, wherin the odds thatthe result in this way ismore
exactthanfrom onesingle observationis evincedand theutility of the methodin practise
clearly madeappear.In Miscellaneous tractson some curiousand very interesting
subjects in mechanics, physical astronomy and speculative mathematics (pp. 64-75).
London: Printedfor J. Nourse.
Skinner, B. F. (1953).Scienceand human behavior.New York: Macmillan.
Smith, D. E. (1929)A source bookin mathematics.New York: McGraw Hill.
Smith, K. (1916). On the "best" values of the constants in frequency distributions.
Biometrika, 11, 262-276.
Snedecor,G. W. (1934).Calculationand interpretationof analysis of varianceand covar-
iance. AmesIA: CollegiatePress.
Snedecor,G. W. (1937). Statistical methods. Ames, IA: Collegiate Press.
Soper,H. E. (1913). On the probable errorof the correlation coefficient to a second
approximation.Biometrika,9, 91-115.
Soper,H. E., Young,A. W., Cave,B. M., Lee, A., & Pearson,K. (1917).On thedistribution
of the correlation coefficientin small samples.A co-operative study. Biometrika, 11,
328-413.
Spearman,C. (1904a). General intelligence, objectively determined and measured. Amer-
ican Journal of Psychology,15, 201-293.
Spearman,C. (1904b).The proof andmeasurementof association between two things.
American Journalof Psychology,15, 72—101.
Spearman,C. (1927). Theabilities of man. New York: Macmillan.
Spearman,C. (1933). The factor theoryand itstroubles.Journal of EducationalPsychology,
24, 521-524.
Spearman,C. (1939).The factorial analysisof humanability. II. Determinationof factors.
British Journal of Psychology,30, 78-83.
St. Cyres, Viscount(1909).Pascal. London: Smith, Elder and Co.
Stanley,J. C.(1966).The influenceof Fisher's""The designof experiments"on educational
research thirty years later. American Educational Research Journal, 3, 23-229.
Stephan,F. F. (1948). Historyof the use ofmodern sampling procedures. Journal of the
American Statistical Association, 43, 12-39.
Stephen,L. (1876).History of English thoughtin theeighteenth century (Vols. 1 -2). London:
Smith, Elderand Co.
Stephenson,W. (1939).The factorial analysisof human ability.IV Abilities defined as
non-fractional factors. British Journalof Psychology,30, 94-104.
Stevens,S .S.(1951).Mathematics, measurement andpsychophysics.In S.S. Stevens (Ed.).,
Handbook of experimentalpsychology (pp.1-49).New York: Wiley.
REFERENCES 251

Stigler, S. M. (1977).Eight centuriesof sampling inspection:the trial of the Pyx. Journal


of the American Statistical Association, 72, 493-500.
Struik, D. J. (1954).A concise historyof mathematics. London: Bell.
"Student"(1907).On the error of countingwith a haemacytometer. Biometrika, 5,351-360.
"Student"(1908a).The probable errorof a mean. Biometrika,6, 1-25.
"Student"(1908b).Probable errorof a correlationcoefficient. Biometrika,6, 302-310.
"Student"(1925).New tablesfor testingthe significanceof observations. Metron, 5,25-32.
"Student"(1927). Errorsof routine analysis. Biometrika, 19, 151-164.
Tedin, O. (1931). The influenceof systematic plot arrangements upon the estimateof error
in field experiments. Journalof Agricultural Science,21, 191-208.
Thomson,G. H. (1939a).The factorial analysisof human ability.I. Thepresent positionand
the problems confrontingus.British Journal of Psychology,30, 71-77.
Thomson,G. H.(1939b).Thefactorial analysisof human ability- agreementanddisagree-
ment in factor analysis:a summingup. British Journal of Psychology,30, 105-108.
Thomson,G. H. (1946). The factorial analysis of human ability. London: Universityof
London Press.(First edition1939)
Thomson,G. H. (1969).Theeducationof an Englishman. Edinburgh: Moray House Publ-
ications.
Thomson,W. (Lord Kelvin). (1891). Popular lectures and addresses. London: Macmillan.
Thorndike, E. L. (1905). Measurements of twins. Archivesof Philosophy,Psychology,and
Scientific Methods,No. 1. NewYork: SciencePress.
Thorndike, E. L., & Woodworm,R. S.(1901).The influenceof improvementin onemental
function uponthe efficiency of other functions. Psychological Review, 8, 247-261.
Thurstone,L. L. (1935). Thevectorsof the mind. Chicago: University of ChicagoPress.
Thurstone,L. L. (1940).Current issuesin factor analysis. Psychological Bulletin, 37,
189-236.
Thurstone,L. L. (1947) Multiple factor analysis. Chicago: University of ChicagoPress.
Tippett, L. H. C. (1925). On theextremeindividualsand therangeof samples taken from a
normal population. Biometrika, 17, 364-387.
Todhunter,I. (1865). A history of the mathematical theoryof probability from thetime of
Pascal to that of Laplace. Cambridgeand London: Macmillan. [Reprintedin 1965by
the Chelsea Publishing Co.,New York].
Traxler, R. H. (1976).A snagin the history of factorial experiments.In D. B. Owen (Ed.),
On the history of probability and statistics(pp. 283-295).New York: Marcel Dekker.
Tukey, J. W. (1949). Comparingindividual meansin the analysisof variance. Biometrics,
5,99-114.
Tukey, J. W. (1953). The problem of multiple comparisons. Unpublished manuscript,
Princeton University, Princeton, NJ.
Turner, F. M. (1978). The Victorian conflict between science and religion: A professional
dimension. Isis,69, 356-376.
Underwood,B. J. (1949).Experimentalpsychology.New York: Appleton-Century-Crofts.
Venn, J. (1888). The logic of chance. (3rd ed.),London: Macmillan. (Original work
published 1866).
Venn, J. (1891).On thenatureandusesof averages. Journal of theRoyal Statistical Society,
54, 429-448.
von Mises, R. (1919). Grundlagender wahrscheinlichkeitsrechnung (Principles of pro-
bability theory). Mathematische Zeitschrift, 5, 52-99.
252 REFERENCES

von Mises,R. (1957). Probability, statistics and truth. (2nd rev. Englished. preparedby
H. Geiringer).London: Allen andUnwin.
Wald, A. (1950).Statistical decision functions. New York: Wiley.
Walker, E. L. (1970).Psychology as anatural and social science. Belmont, CA: Brooks/
Cole.
Walker, H. M. (1929). Studiesin thehistory of statistical method. Baltimore, MD: Williams
and Wilkins.
Wallis, W. A., & Roberts,H. V. (1956).Statistics:A newapproach.New York: FreePress.
Watson.J. B. (1913).Psychology as the behaviorist viewsit. Psychological Review, 20,
158-177.
Weaver,W. (1977).Lady Luck: Thetheoryof probability. Harmondsworth: Penguin Books.
(Original work publishedin 1963).
Welch, B. L. (1939).On confidence limits and sufficiency with particular referenceto
parametersof location. Annalsof Mathematical Statistics,10, 58-69.
Welch, B. L. (1958)."Student"andsmall sample theory. Journal of the American Statistical
Association,53, 777—788.
Weldon, W. F. R. (1890). The variations occurringin certain DecapodCrustacea.-!.
Crangon vulgaris. Proceedingsof the Royal Societyof London, 47, 445-453.
Weldon, W.F.R. (1892).Certain correlated variations in Crangon vulgaris. Proceedings of
the Royal Societyof London, 51, 2-21.
Weldon, W.F.R.(1893).On certain correlated variations in Carcinus moenas. Proceedings
of the Royal Societyof London, 54, 318-329.
Westergaard,H. (1932). Contributionsto thehistory of statistics. London:P. S.King.
Whitney, C. A. (1984, October). Generating and testing pseudorandom numbers. Byte
p. 128.
Wilson, E. B. (1928).Review of Theabilities of man, their natureand measurementby
Charles Spearman. Science, 67, 244-248.
Wilson, E. B. (1929a).Review of Crossroadsin the mind of man by T. LKelley. Journal
of General Psychology, 2, 153-169.
Wilson, E. B. (1929b). Comment on Professor Spearman's note. Journal of Educational
Psychology,20, 217-223.
Wishart, J. (1934). Statisticsin agricultural research. Journal of the Royal Statistical Society
Supplement,I, 26-51.
Wolfle, D. (1940).Factor analysisto 1940.Chicago:University of ChicagoPress.
Wood, T. B., & Stratton,F. J. M.(1910). The interpretationof experimental results. Journal
of Agricultural Science,3, 417-440.
Woodworth,R. S., & Schlosberg,H. (1954) Experimental psychology. New York: Holt,
Rinehart & Winston.
Yates, F. (1935).Complex experimentation. Journal of the Royal Statistical SocietySuppl-
ement^,181-247.
Yates, F. (1939).The comparative advantages of systematicandrandomized arrangements
in the designof agricultural andbiological experiments. Biometrika, 30,440-466.
Yates,F. (1951). The influence of Statistical Methodsfor Research Workers on the
developmentof the scienceof statistics. Journalof theAmerican Statistical Association,
46, 19-34.
Yates,F. (1964)Sir RonaldFisherand thedesignof experiments.Biometrics,20,307-321
Yule, G. U. (1897a).Notes on theteachingof the theory of statisticsat University College.
Journal of the Royal Statistical Society, 60, 456-458.
REFERENCES 253

Yule, G. U. (1897b). On thetheory of correlation. Journalof theRoyal Statistical Society,


60, 812-854.
Yule, G. U. (1900).On theassociationof attributesin statistics.Philosophical Transactions
of the Royal Society,A, 194, 257-319.
Yule, G. U. (1906).On aproperty which holds good for all groupingsof anormal distribution
of frequencyfor two variables. Proceedings of the Royal Society,A, 77, 324-336.
Yule, G. U. (1912). On the methodsof measuring association between two attributes.
Journal of the Royal Statistical Society,
75, 579-642.
Yule, G. U. (1921). Reviewof W. Brown & G.H.Thomson, The essentialsof mental
measurement. British Journal of Psychology,2, 100-107.
Yule, G. U. (1936). Karl Pearson, 1857-1936.Obituary Noticesof Fellows of the Royal
Societyof London,2, 73-104.
Yule, G. U. (1938a).Notes of Karl Pearson's lectures on Theory of Statistics. Biometrika,
30, 198-203.
Yule, G. U. (1938b). A test of Tippett's random sampling numbers. Journal of theRoyal
Statistical Society, 101,
167-172.
Yule, G. U. (1939).[Review of] Karl Pearson:An appreciationof some aspects of his life
and work. By E. S.Pearson. Cambridge: University Press, 1938. Nature,220-222. 143,
Yule, G. U., & Kendall,M. G. (1950). An introductionto thetheory of statistics (14th ed.)
London: CharlesGriffin. (1st ed., Yule, 1911)
Author Index

A Carroll, J. B., 169, 237


Cattell, R. B., 170, 237
Acree,M.C.,57,61,236 Cave, B.M., 118,250
Adams, W. J., 125-126,236 Chang, W.-C.,95, 99,107,238
Airy,G.B., 107, 115,236 Clark, R. W., 23-24,238
Anderson,R. L., 213,236 Cochrane,W. G., 188,238
Arbuthnot, J., 51-52,235 Cornfield, J.,214,238
Cowan, R.S.,4, 130,238
B Cowles, M., 17,21,200,221,238
Cronbach,L. J., 34,189,208, 215,238
Babington Smith,B., 86-88,243 Crum, L. S., 94, 238
Bancroft,!. A., 213,235 Crump, S. L. 213,238
Barnard,G. A., 77,235
Bartlett, M. S., 83,177,235 D
Baxter,B., 208,236
Bayes,T., 77-78,235 Daniels, H. E., 213,238
Beloff,H., 163,235 Darwin, C., 1, 3,129-131,238
Beloff, J., 209,235 David, F. N., 56, 58,65,238
Bennett, J. H., 113,235 Davis, C., 17, 21,196-197,200, 221,
238
Berkson, J., 83,236 Daw, R. H., 64,238
Bernard, C., 18, 236 Dawson,M.M., 51,53,238
Bernoulli, J., 9, 52, 59,125,236 Dempster,A. P., 83,238
Bilodeau, E. A., 22,236 De Moivre, A., 10,14,20,52-53,63-65,
Boring, E. G., 13, 24, 64,65,237 69, 72-76,239
Bowley, A. L., 95-96,98-99,108,114, De Morgan, A., 12,61,239
237 Dodd, S. C., 34, 239
Box, G. E. P, 7,237 Duncan,D. B., 197,239
Bravais, A., 146-147, 237
Brennan,J. F.,207, 237 E
Buck, P., 50, 237
Burke, C. J.,203, 237 Eden,T., 178,239
Burt, C., 5,156, 162-164,167,237 Edgeworth,F. Y., 90, 98,110, 117, 145-
146,155,239
c Edwards,A. E.,213,239
Eisenhart,C., 117, 195,213-214,239
Campbell,D. T., 206, 237 Ellis, B., 40-41,239
Carrington,W., 209, 237 Eysenck,H. J., 170,239

254
AUTHOR INDEX 255

F Harman,H.H., 170,242
Harris, J. A., 190,242
Fechner,G., 33,239 Hartley, D., 176, 242
Feigl, H., 25, 239 Hays, W. L., 195,242
Ferguson,G. A., 213 Hearnshaw,L. S., 28,156, 162-164,210,
Finn, R. W., 48, 239 242
Fisher,R. A.,4-6,38,62,81-82,99,102 Heisenberg,W., 23, 46, 242
110-111,113-115,118-122,171,177-Henkel, R. E., 83,199,242
181,183-184,186-200,202-203,205-Heron, D., 149-151,242,245
206,210,212-213,216,225,210,228-Hick, W. E., 203, 242
230,232-233,239, 240 Hilbert, D., 66, 242
Fisher Box,J., 112, 114, 117,119-121, Hilgard, E., 206, 242
179, 181, 183, 200,224-226,240 Hogben, L., 82, 242
Fletcher, R., 163,240 Holschuh,N., 19,183,243
Folks, L., 204, 243 Holzinger, K. J., 160, 170,243
Forrest, D. W., 130, 137,241 Hotelling, H., 155,213,243
Fox, J., 175, 199,247 Hume, D., 30,243
Funkhouser,H. G., 54, 241 Hunter, J. E., 83, 243
Huxley, J., 18,129,243
G Huygens,C., 19, 51, 59, 63, 243

Gaito, J., 44,196, 197,214,241 J


Galbraith, V. H., 48,241
Gallon, F., 2, 4-5,13-14,46, 128-142, Jacobs,J., 37, 243
145-146,155,159,247 Jeffreys, H. 103,243
Garnett, J. C. M., 161,247 Jensen,A. R., 163,243
Garrett, H. E., 209-210,247 Jevons,W., 54,243
Gauss,C. F.,64-65,91-92, 126, Jones,L. V., 203, 243
181-182,247 Joynson,R. B., 163, 243
Geiger, H., 71, 249
Gigerenzer,G., 233, 247
Gini, C., 99,247 K
Ginzburg,B., 28, 247
Glaisher,J. W. L., 92,247 Kelley,T.L., 164-165,243
Gould, S.J.,155,242 Kempthorne,O., 184, 189, 204, 226,243
Grant, D. A., 210-211,242 Kendall, M. G., 6,37-38,64, 81-82, 86-
Graunt,J., 8,48-52,242 88, 110, 138, 185, 195,213,216,243
Griinbaum, A., 25, 242 Kenna,J. C., 5, 244
Guilford,J. P., 169,213,242 Keuls,M, 197, 244
Guthrie, S. C., 172,242 Kevles, D., 15, 244
Kiaer, A. N., 96-97, 244
H Kimmel, H. D., 203, 244
Kneale, W., 60, 244
Hacking, I., 57, 59, 63, 80, 83,125, 242 Koch, S., 234, 244
Haggard, E. A., 193, 242 Kolmogorov, A. N., 66-67, 244
Hall, A. D., 101,245 Koren, J., 48, 244
Halley,E., 51-53,242 Koren, R. S.,228, 244
Hardyck, C. D., 196,245 Kruskal, W., 96-97, 244
256 AUTHOR INDEX

L O

Lancaster,H. O., 107, 244 Ore, O., 8,57-58,246


Laplace, P. S. de,10,22,61,64-65,79, Osgood,C. E.,205,246
92-93,95,126, 244
Lawley,D.N., 154, 156,244 P
Le Cam,L., 222, 244
Lee, A., 118,189,250 Parten,M., 94, 246
Legendre,A.M.,91, 244 Pearl,R., 111,246
Lehmann,E. L., 219, 221-222,244 Pearson,E. S.,16,64,103,108,112-113,
Lindquist, E. F.,210-211,213, 244 115-118,202, 217-218, 220-224, 227,
Lord, F. M, 44, 244 230, 238, 246,247
Lovie, A. D., 34,161-165,194,198,208- Pearson,K., 4,16, 18, 44, 49, 51,53,64,
210,212, 245 75-76, 102, 106,108-114,117-118,
Lovie, P., 161-163,165,245 127,137,141,143-151,153,155,157-
Lupton,S.,115,245 158, 161,182,189,200, 206, 217, 222,
Lush, J. L., 194, 198,245 232, 248
Peirce,B., 62, 107,248
M Peters,C. C., 194,248
Petrinovich,L. F., 196,248
Macdonell,W. R., 117,155,245 Petty, W., 48-50,248
MacKenzie,D. A., 4, 15, 105,130-131, Phillips, L. D., 79,248
144,150,152,245 Picard,R., 103,248
MacKenzie,W. A. 121, 187,240 Playfair, W., 53-54, 248
Magnello,M. E., 15,245 Poisson,S.-D., 58-59,70-72,248
Marks, M. R., 203,245 Popper,K. R., 30, 32, 248
Maunder,W. F., 96 Porter,!.,15,248
Maxwell, A. E., 154, 156,244
McMullen, L., 115,245 Q
McNemar,Q., 104, 213,245
Mercer, W. B., 101,245 Quetelet,L. A. 46, 55, 91, 249
Merriman,M.,91,115,245
Michea, R., 58, 245 R
Mill, J. S.,173-174, 245
Miller, A. G., 19,46,245 RAND Corporation,88, 249
Miller, G. A., 19,234,245 Reichenbach,H., 32,249
Morrison, D. E., 83,199, 242 Reid, C., 217, 219, 221-222, 225-227,
Mosteller, F., 96-97,206, 244 249
MouLM., 161, 248 Reitz, W., 207,249
Rhodes,E.G., 217,249
N Roberts,H. V., 138,252
Robinson,C., 94,249
Nagel,E., 61-63,246 Rosenthal,R.,46,249
Newman,D., 196, 246 Rosnow,R. L.,46,249
Newman,J. R., 5,59-60,184,246 Rowe,F. B., 45, 249
Neyman,J., 85,98-100,198, Rozeboom,W. W., 202,249
201-203,219-225, 227, 230,246 Rucci, A. J., 207-208, 211-213,249
Norton, B., 112,246 Russell,B., 23,28,154,249
AUTHOR INDEX 257

Russell, J., 179, 249 Traxler, R. H., 226,257


Rutherford,E., 71,249 Tukey, J. W., 196-197,257
Ryan, T. A., 196-197,249 Turner, P.M., 130,257
Tweney,R. D., 207-208, 211-213,249
S
U
Savage,L. J., 102,231,249
Scheff6,H., 197,213,249 Underwood,B. J., 206, 257
Schlosberg,H., 206, 252
Seal,H. L., 181-182,249 V
Seng,Y. P., 96-97,250
Sheynin,O. B., 72, 107,250 Venn, J., 12, 62, 81,89-90,257
Sidman,M, 207,250 von Mises,R., 62,81-82, 107,257,252
Simpson,T., 53, 89,125,250
Skinner,B. F., 24, 250 W
Smith, D. E., 59, 250
Smith, K., 112,250 Wald, A., 230, 252
Snedecor,G. W., 123,193,194,213,234, Walker, E. L., 24,252
250 Walker, H. M., 12,147,252
Soper,H.E., 110, 118,250 Wallis, W. A., 138, 252
Spearman,C, 5, 156-164, 166,250 Watson.J. B.,22,207,252
St. Gyres, Viscount, 58, 250 Weaver,W., 58, 252
Stanley,!. C., 206,212,237 Welch, B.L., 117,202,252
Stephan,F. F.,93-94,250 Weldon, W. F. R., 14,141-143,146,252
Stephen,L., 130,250 Westergaard,H., 7, 93, 252
Stephenson,W., 167,250 Whitney, C. A., 88, 252
Stevens,S. S., 36,40-44,206, 250 Wilson, E. B., 161-162, 165,252
Stigler, S. M., 86, 87, 257 Wishart, J., 210,252
Stratton,F. J. M., 200,252 Wolfle, D., 34, 160-161,163, 252
Struik, D. J., 9,257 Wood, T. B. 200, 252
"Student", 17, 102-103, 115-118,121, Woodworth,R. S., 172, 206,252
122, 196,200,257
Y
T
Yates,F., 84,104,189,194,216,226,252
Tedin,0.,103,257 Young, A. W., 118,250
Thomson,G. S., 5,156, 165-168,257 Yule, G. U., 5,37-38,88, 108,110, 138,
Thomson,W. (Lord Kelvin), 36,257 146,148-151,213,253
Thorndike,E. L., 119, 172,257
Thurstone,L. L., 5,34,163,166-169,257 Z
Tippett, L. H. C., 88,196,257
Todhunter,L, 64-65,257 Zubin,J.,209,210,247
Subject Index

A Correlation (cont.),
distribution of coefficient, 118
Actuarial science,51 intraclass,5, 119,123,189-193
Analysis of variance, 5, 35, 102-104, nominal variables, 148-151
177-181, 187-195 probableerror, 117
Anthropometric laboratory,132
Arithmetic mean,18, 89-91 D
Average,18, 88
Data organization,47-55
B Degreesof freedom, 110-114
De Mere's problems, 9, 58-59
Bayes'theorem,77-80,83, 231 Determinism,21-26, 118
Behaviorism,22 in psychology,33-35
Bills of mortality, 8,49-52
Binomial distribution,10-11,64-65,
68-70 E
Biometrics,4, 14-18
Error estimation,10, 63, 90,101-103
C Estimation,theory of, 98-101
Eugenics,4, 15,130-131, 152-153,187
Causality,seealso Determinism,30 Evolution, 1-3, 17, 129,143
Censuses,47, 93 natural selection,1, 129, 130,143
Central limit theorem,95, 107, 124-126 Expected mean squares, 213-215
Centroid method,168 Experimentaltexts, 205-207
Chi square, Experiments,
controversy,110-114 designof, 31,101,171-185
distribution, 105-114 tea-tasting, 182-184
test, 109
Communalities,168
Confidenceintervals, 100, 199-203 F
Control, 20, 31,38, 171-178
Mill's methods of enquiry and, 173- F ratio, 123, 193,233
176 distribution, 121-123
statistical, 176-181 Factor analysis,154-170
Correlation,2, 35, 129, 138-146,174 Factor rotation,169
coefficient of, 5,116, 141-146,158 Free will, 24
controversies,146-153 Frequencydistributions,68

258
SUBJECT INDEX 259

G N

Gaming, 8, 56-59 Naturalism,130


Generallinear model,27, 93, 181-183 Neyman-Pearson theory, 217-224
Genetics,15,17-18 controversy with Fisher,228-233
Gower Street, 151,225 Normal Distribution, 10, 12-14, 64-66,
Graphical methods,53-55 72-76, 107-108, 125, 128

H O

Heredity, 131 One-tail andtwo-tail tests,203-204


Hypotheses, Operationalism,45-46
alternative,217-224,229-233
null, 57,184,217-219
testing, 7, 222-224, 228-233
p

I Parameters,7, 95
Pascal'striangle, 9, 59,69
Induction, 27-30, 175,230 Personal equation, 45
Poisson distribution,70-72
Inference,29, 31-33
early, 47-50 Political arithmetic,48-50
Fisherian, 81-83 Polling, 94-95
practical, 77-84 Population,7, 82, 85, 87,95-101,105
Power,222-224,232-233
statistical,7, 31, 33, 217
Prediction,30
Insuranceandannuities,50-53,63
Primary mental abilities,169
Intelligence,158-161
Inventories,6, 47-^48 Principal component analysis,155
Probabilisticanddeterministic models,
26
Probability, 7-9, 32, 56
L and error estimation,63
and weight, 32
Law of error, 10, 12,89 beginningsof the concept,56-60
Law of large numbers,59-60 distributions,68-76
Least squares,89, 91-93,148,182 fiducial, 81-83,99,202,231
Likelihood criterion, 82,218-220 inverse,65, 77-80, 119,221
Logical positivism,45-46 mathematicaltheory,66
meaningof, 60-63
M subjective,61,221
Probableerror, 127,139,200
Mean, seearithmetic mean
Measurement,2,36-38,40 R
error, 38, 44
definition, 40 Randomness,85-88,172
interval and ratio scales,42 randomnumbers,85-88
nominal scale,41, 148-150 random sampling,87
ordinal scale,41 randomization,94, 101-104, 177-179,
Multiple comparisons,195-199 188
260 SUBJECT INDEX

Regression,3, 17, 129-138,135 Statistics (cont.),


Representative sampling,
93,96 in practice,233
Rothamsted Experimental Station, 6,101, in psychology,33-35
120,176,188 textbooks,212-213
Royal Society,8, 17,49-51 parapsychology,209
vital, 8, 49-51
s
T
Sampling,
distributions,16,105-126 /ratio, 196,218
in practice, 95-98 distribution, 114-121
random,98 Tetrad difference,160
representationandbias, 93-95 Trial of the Pyx, 86
Science,21-26, 36-37,45 Type I error,95, 196-198,201,223,231
Significance, 17,19,83,183-184,199- Type II error, 223, 229,231
200,218
Standard deviation,73-75,116,127 U
Standard error, 33,141
Standard scores, 4, 75,127-129,137 Uncertainty principle,23,46
Statistics,
arguments,224-228 V
criticism of, 18-20,233-235
definition of, 6 Variation, 1,4,132,187
Fisher v. Neyman and Pearson,228-
233 Z
Fisherian,186-187,224
in journalsandpapers,207-212 2 scores,seestandard scores

You might also like