Professional Documents
Culture Documents
An Historical Perspective
Second Edition
This page intentionally left blank
Statistics in Psychology
An Historical Perspective
SecondEdition
Michael Cowles
York University, Toronto
BF39.C67 2000
150'.7'27~dc21 00-035369
1 0 9 8 7 6 5 4 3 21
Contents
Preface ix
Acknowledgments xi
Determinism 21
Probabilisticand Deterministic Models 26
Scienceand Induction 27
Inference 31
Statisticsin Psychology 33
3 Measurement 36
In Respectof Measurement 36
SomeFundamentals 39
Error in Measurement 44
Vi CONTENTS
5 Probability 56
6 Distributions 68
7 Practical Inference 77
Inverse Probability
and the Foundationsof Inference 77
FisherianInference 81
Bayesorp< 05? . 83
10 Comparisons, Correlations,
and Predictions 127
Factors 154
The Beginnings 156
Rewriting the Beginnings 162
The Practitioners 164
FisherianStatistics 186
The Analysis of Variance 187
Multiple Comparison Procedures 795
ConfidenceIntervals
and SignificanceTests 199
A Note on 'One-Tail'
and Two-Tail' Tests 203
References 236
ix
X PREFACE
Michael Cowles
Acknowledgments
Excerpts from Fisher Box, J. (c) 1978,R. A. Fisher the life of a scientist,from
Scheffe, H. (1959) Theanalysisof variance,and from Lovie, A. D & Lovie, P.
Charles Spearman, Cyril Burt, and theoriginsof factor analysis. Journalof the
History of the Behavioral Sciences,29, 308-321.Reprintedby permissionof
John Wiley& Sons Inc.,the copyright holders.
XI
Xii ACKNOWLEDGMENTS
Excerpts from Clark, R. W. (1971). Einstein,the life and times. New York:
World. Reprintedby permissionof Peters, Fraserand Dunlop, Literary Agents.
Excerpts from various volumesof the Journal of the Royal Statistical Society,
reprinted by permissionof the Royal Statistical Society.
The central concernof the life sciencesis thestudy of variation. To what extent
does this individualor group of individualsdiffer from another? Whatare the
reasonsfor the variability? Can thevariability be controlled or manipulated?
Do the similarities that exist springfrom a common root? Whatare theeffects
of the variation on the life of the organisms?Theseare questionsaskedby
biologists and psychologists alike.
The life-science disciplines are definedby the different emphases placed on
observed variation,by the natureof the particular variablesof interest,and by
the ways in which the different variables contributeto the life and behaviorof
the subject matter. Change and diversity in nature reston an organizing
principle,the formulationof which hasbeen saidto be thesingle mostinfluential
scientific achievementof the 19th century:the theory of evolutionby meansof
natural selection.The explicationof the theory is attributed, rightly,to Charles
Darwin (1809-1882).His book The Origin of Specieswas published in 1859,
but a numberof other scientistshadwritten on theprinciple, in whole or in part,
and thesemenwereacknowledgedby Darwin in later editionsof his work.
Natural selectionis possible because there is variation in living matter. The
struggle for survival within and acrossspeciesthen ruthlessly favorsthe indi-
viduals that possessa combination of traits and characters, behavioral and
physical,that allows themto copewith the total environment, exist, survive,
and reproduce.
Not all sourcesof variability are biological. Many organisms to agreateror
1
2 1. THE DEVELOPMENT OF STATISTICS
lesser extent reshape their environment, their experience, and therefore their
behavior through learning.In human beings this reshapingof the environment
has reachedits most sophisticatedform in what hascome to becalled cultural
evolution. A fundamental feature of the human condition,of human nature,is
our ability to processa very great deal of information. Human beings have
originality and creative powers that continually expand the boundariesof
knowledge. And, perhaps most important of all, our language skills, verbal
and
written, allow for the accumulationof knowledge and itstransmission from
generationto generation. The rich diversity of human civilization stemsfrom
cultural, aswell as genetic, diversity.
Curiosity about diversityand variability leadsto attemptsto classify and to
measure.The orderingof diversity and theassessment of variation have spurred
the developmentof measurementin the biological and social sciences,and the
applicationof statisticsis onestrategyfor handlingthe numericaldataobtained.
As science has progressed,it has become increasingly concerned with
quantification as a means of describing events.It is felt that precise and
economical descriptions of eventsand therelationships among them are best
achievedby measurement. Measurement is the link between mathematics and
science,and theapparent(at anyrate to mathematicians!) clarityand order of
mathematicsfoster the scientist's urgeto measure.The central importanceof
measurementwasvigorously expoundedby Francis Galton(1822-1911):"Un-
til the phenomenaof any branchof knowledge have been submitted to meas-
urementand numberit cannot assume the statusand dignity of a Science."
Thesewords formed partof the letterhead of the Departmentof Applied
Statistics of University College, London,an institution that received much
intellectual and financial supportfrom Galton. And it is with Galton,who first
formulated the method of correlation, thatthe common statisticalprocedures
of modern social science began.
The nature of variation and thenature of inheritancein organisms were
much-discussedand much-confused topicsin the second halfof the 19th
century. Galtonwasconcernedto makethe studyof heredity mathematical and
to bring order into the chaos.
Francis Galton was Charles Darwin's cousin. Galton's mother was the
daughterof Erasmus Darwin (1731-1802) by his secondwife, and Darwin's
father wasErasmus'sson by hisfirst. Darwin, who was 13yearsGalton'ssenior,
had returned homefrom a 5-year voyageas thenaturaliston board H.M.S.
Beagle(an Admiralty expeditionary ship)in 1836 and by 1838 had conceived
of the principle of natural selectionto accountfor someof the observationshe
hadmadeon theexpedition.The careersandpersonalitiesof GaltonandDarwin
were quite different. Darwin painstakingly marshaled evidence and single-
mindedly buttressedhis theory, but remaineddiffident aboutit, apparently
EVOLUTION, BIOMETRICS AND EUGENICS 3
uncertain of its acceptance.In fact, it was only the inevitability of the an-
nouncementof the independent discovery of the principle by Alfred Russell
Wallace (1823~1913) that forced Darwin to publish, some20 yearsafter he had
formed the idea. Gallon,on theother hand, though a staid andformal Victorian,
was notwithout vanity, enjoyingthe fame and recognition broughtto him by
his many publicationson abewildering varietyof topics. The steady streamof
lectures, papersand books continued unabated from 1850 until shortly before
his death.
The notion of correlated variationwas discussedby the new biologists.
Darwin observesin The Origin of Species:
Many laws regulate variation, some few of which can bedimly seen,...I will
here only alludeto what may becalled correlated variation. Importantchanges
in the embryo or larva will probably entail changes in the mature animal. . .
Breeders believe that long limbs
arealmost always accompanied by an elongated
head .. .catswhich areentirely whiteandhave blue eyesaregenerally deaf...
it appears that white sheep and pigs are injured by certain plants whilst
dark-coloured individuals escape ... (Darwin, 1859/1958,p. 34)
1
Mendel had already carriedout his work with ediblepeasand thus begunthe scienceof
genetics. The resultsof his work were publishedin a rather obscure journal
in 1866 and thewider
scientific world remained obliviousof them until 1900.
4 1. THE DEVELOPMENT OF STATISTICS
Over the next several years Galton collected data on inherited human
characteristicsby the simple expedientof offering cash prizes for family
records. From these data hearrivedat theregression linesfor hereditary stature.
Figures showing these lines are shownin Chapter10.
A common themein Galton's work, and later that of Karl Pearson
(1857-1936),was aparticular social philosophy. Ronald Fisher (1890-1962)
also subscribedto it, although, it mustbe admitted, it wasnot, assuch,a direct
influence on hiswork. Thesethreemen are thefoundersof what are nowcalled
classical statisticsand allwere eugenists. They believed that the most relevant
and important variablesin humanaffairs are inherited. One'sancestorsrather
than one'senvironmental experiences are theoverriding determinants of intel-
lectual capacityand personalityas well as physical attributes. Human well-
being, human personality, indeed human society, could therefore, they argued,
be improvedby encouragingthe most ableto have more children than the least
able. MacKenzie (1981)andCowan(1972,1977)have argued that much of the
early work in statisticsand thecontroversies that arose among biologists and
statisticians reflectthe commitmentof the foundersof biometry, Pearson being
the leader,to theeugenics movement.
In 1884, Galtonfinanced andoperatedan anthropometric laboratory at the
International Health Exhibition. For a chargeof threepence, members of the
public were measured. Visual and auditory acuity, weight, height, limb span,
strength,and anumberof other variables were recorded. Over 9,000 data sets
were obtained, and, at thecloseof theexhibition,the equipmentwastransferred
to the South Kensington Museum where data collection continued. Francis
Galton was anavid measurer.
Karl Pearson(1930) relates that Galton's first forays intothe problem of
correlation involved ranking techniques, although he wasaware that ranking
methods couldbe cumbersome.How could onecomparedifferent measuresof
anthropometric variables?In a flash of illumination, Galton realized that
characteristics measured on scales basedon their own variability (we would
now say standard score units) could be directly compared. This inspiration is
certainly one of themost importantin the early yearsof statistics.He recalls
the occasionin Memoriesof my Life, publishedin 1908:
3
Maurice Kendall (1963) says of this work, "It is not aneasy book. Somebody once said
that no student should attempt
to read it unlesshe hadread it before" (p. 2).
PROBABILITY 7
PROBABILITY
4
But note that thereare hints of probability conceptsin mathematics going back at leastas
far as the12th centuryand that Girolamo Cardano wrote Liber de Ludo Aleae,(The Book on Games
of Chance)a century beforeit was publishedin 1663 (see Ore, 1953). There is also no doubt that
quite early in human civilization, therewas anappreciationof long-run relative frequencies,
randomness,anddegreesof likelihood in gaming,andsome quiteformal conceptsare to befound
in GreekandRoman writings.
PROBABILITY 9
5
Poisson(1781-1840),writing of this episodein 1837 says,"A problem concerning games
of chance, proposed by a man of theworld to an austere Jansenist,was theorigin of the calculus
of probabilities" (quotedby Struik, 1954,p. 145). De Mere was certainly "a man of theworld"
and Pascaldid become austere andreligious, but at thetime of de Mere'squestionsPascalwas in
his so-called "worldlyperiod"(1652-1654).I am indebtedto my father-in-law, the late Professor
F.T.H. Fletcher, for many insights intothe life of Pascal.
10 1. THE DEVELOPMENT OF STATISTICS
11
12 1. THE DEVELOPMENT OF STATISTICS
Whether or not theallegation stated here - that short (butnot too short)
Frenchmen havestoopedso low as toavoid military service- is true is no
longer an issue. A more important pointis notedby Boring (1920):
The use of thenormal curvein statisticsis not, however, based solely on the
fact that it can beusedto describethe frequencydistributionof many observed
characteristics.It has a much morefundamental significancein inferential
statistics,as will be seen,and thedistribution and itsproperties appear
in many
partsof this book.
Galton first became awareof the distribution from his friend William
Spottiswoode,who in 1862 became Secretary of the Royal Geographical
Society, but it was thework of Quetelet that greatly impressed him. Many of
the data setshe collected approximated to the law and heseemed,on occasion,
to be almost mystically impressed with it.
BIOMETRICS
intellectual variables,he ended witha plea, if not a rallying cry, for eugenics:
The mentally better stockin the nationis not reproducing itselfat thesame rateas it
did of old; the less able,and theless energetic,
aremore fertile thanthe better stocks.
... The only remedy,if one bepossibleat all, is to alter the relative fertility of the
good and the badstocksin the community.. . . intelligence can beaided and be
trained,but notrainingor educationcancreateit. You must breedit, that is thebroad
result for statecraft whichflows from the equality in inheritanceof the psychicaland
the physical characters in man. (Pearson, 1904a, pp. 179-180).
STATISTICAL CRITICISM
Now it is, of course, true that lumping measurements together may notgive
us anything more than a pictureof the lumping together,and theaverage value
may not beanything likeany oneindividual measurement at all, but Bernard's
ideal type fails to acknowledgethe reality of individual differences.A rather
memorable example of a very real confusionis given by Bernard(1865/1927):
was that his biological friends could not appreciatehis newenthusiasms. They
could not understandhow the Museum "specimen"was in thefuture to be
replacedby the "sample"of 500 to 1000 individuals,(p. 37)
arrogant as tosuppose that their subject matter will necessarily submit to this
form of analysis,and they turn, almost inevitably, to statistical,as opposedto
experimental, control. This is not amuddle-headed notion, but it does present
dangersif it is accepted without caution.
A balanced,but notuncritical, viewof the utility of statisticscan bearrived
at from a considerationof the forces that shaped the disciplineand anexami-
nation of its development. Whether or not this is an assertion that anyone, let
alonethe authorof this book,canjustify remainsto be seen.
DETERMINISM
1
Not wishing to make pronouncements on theprobabilistic natureof the work of others,the writer
is makingan oblique referenceto work in which he and acolleague (Cowles& Davis, 1987)found that
there is an 80%chance that stable extraverts
will volunteerto participatein psychological research.
21
22 2. SCIENCE, PSYCHOLOGY, AND STATISTICS
But Einstein never really accepted this proposition, believingto the endthat
indeterminacywas to beequated with ignorance. Einstein may be right in
subscribing ultimatelyto the inflexibility of Laplace's all-seeing demon, but
another approachto indeterminacywas advanced by Werner Heisenberg
(1902-1981)a German physicist who, in 1927, formulatedhis famous uncer-
tainty principle. He examinednot merelythe practical limits of measurement
but the theoretical limits,and showed thatthe act ofobservationof the position
and velocity of a subatomic particle interfered with
it so as toinevitably produce
errors in the measurementof one or theother. This assertion hasbeen takento
mean that, ultimately,the forces in our universeare random and therefore
indeterminate.Bertrand Russell (1931) disagrees:
Spaceand time were inventedby the Greeks, and served their purpose admirably
until the present century. Einstein replaced them
by akind of centaur whichhe called
"space-time,"and this did well enoughfor a coupleof decades,but modern quantum
mechanicshas made it evident thata more fundamental reconstruction is necessary.
The Principle of Indeterminacyis merely an illustrationof this necessity,not of the
failure of physical lawsto determinethe courseof nature, (pp. 108-109)
You, the man whointroducedthe ideaof light asparticles! If you are soconcerned
with the situation in physicsin which the natureof light allows for a dual interpre-
tation, thenask theGerman government to ban the use ofphotoelectric cellsif you
think that light is waves, or the use ofdiffraction gratings if light is corpuscular,
(quotedby Clark, 1971,p.253)
2
For thoseof you who have been exposed to physics, it must be admitted that although
Newton's laws were more or less unchallenged
for 200yearsthemodelshegaveus are notdefinitions
but assumptionswithin the Newtonian system, as thegreat Austrian physicist, Ernst Mach, pointed
out. Einstein's theoryof relativity showed that theyare not universally true. Nevertheless,
the
distinction betweenthe models of the natural and thesocial and biological sciencesis, for the
moment,a useful one.
SCIENCEAND INDUCTION 27
This is not to deny the work of earlier scholars. Leonardo da Vinci (1452
-1519)vigorously propounded the importanceof experienceand observation,
and enthusiastically wroteof causalityand the "certainty" of mathematics.
Francis Bacon(1561-1626)is a particular exampleof one whoexpressedthe
importanceof systemand methodin the gainingof new knowledge,and his
contribution, althoughoften underestimated, is of great interestto psychologists.
Hearnshaw (1987) gives an accountof Bacon that shows how his ideascan be
seenin the foundationand progressof experimentaland inductive psychology.
He notes that: "Bacon himself made few detailed contributions to general
psychology as such,he sawmore clearly than anyone of his time the need for,
and thepotentialitiesof, a psychologyfounded on empirical data,and capable
of being appliedto 'the relief of man'sestate'"(Hearnshaw, 1987, p. 55).
A turning point for modern science arrives with the work of Copernicus
(1473-1543). His account of the heliocentric theoryof our planetary system
was publishedin theyearof his deathand hadlittle impactuntil the 17th century.
The Copernican theory involved no newfacts, nor did it contributeto mathe-
matical simplicity. As Ginzburg (1936) notes, Copernicus reviewed theexisting
facts and came up with a simpler physical hypothesis than that of Ptolemaic
theory which stated thatthe earthwas thecenterof theuniverse:
The fact that PTOLEMY and his successorswere led to make an affirmation in
violence to the facts as then known shows that their
acceptanceof the belief in the
SCIENCE AND INDUCTION 29
This statementis onethat scientists, when they are wearing their scientists'
hats, would supportand elaborate upon,but anexaminationof the stateof the
psychological sciences could not fully sustainit. Nowhereis our knowledge
more incomplete than in the studyof the human condition,andnowhereare our
interpretations more open to prejudiceand ideology.The study of differences
between the sexes, the nature-nurture issuein the examinationof human
personality, intelligence,and aptitude,the sociobiological debateon the inter-
pretation of the way in which societiesare organized, all are marked by
undertonesof ethicsand ideology thatthe scientific purist wouldsee asoutside
the notion of an autonomous science. Nor is this a complete list.
Scienceis not somuch about factsas about interpretationsof observation,
and interpretationsas well as observationsare guided and moldedby precon-
ceptions. Ask someoneto "observe"and he or she will ask what it is that is to
be observed. To suggest thatthe manipulationand analysisof numerical data,
in the senseof the actual methods employed, canalso be guidedby preconcep-
tions seems very odd. Nevertheless, the developmentof statisticswasheavily
influencedby theideological stance of its developers.The strengthof statistical
analysisis that its latter-day usersdo nothaveto subscribeto theparticular views
of the pioneersin order to appreciateits utility and apply it successfully.
A common viewis that science proceeds through the processof induction.
Put simply, thisis theview thatthe future will resemblethe past. The occurrence
of an eventA will lead us to expectan eventB if past experiencehas shownB
alwaysfollowing A The general principle thatB follows A is quickly accepted.
Reasoning from particular casesto general principlesis seen as the very
foundation of science.
The conceptsof causality and inference come togetherin the process of
induction. The great Scottish philosopher David Hume (1711-1776) threw
down a challenge thatstill occupiesthe attentionof philosophers:
in fact, that it always is inferred. But if you insist thatthe inferenceis madeby a
chain of reasoning,I desireyou to produce that reasoning. (Hume, 1748/1951, pp.
33-34)
INFERENCE
method but not part of knowledge. Some writers have rejected this view.
Reichenbach(1938), for example, soughtto devise a formal probability logic
in which judgmentsof the truth or falsity of propositionsis replaced by the
notion of weight. Probability belongsto a classof events. Weight refersto a
single event,and asingle eventcan belongto many classes:
We shall haveto get accustomedto the idea thatwe must not look upon scienceas
a "body of knowledge",but rather as asystem of hypotheses; thatis to say, as a
systemof guessesor anticipations... of which we arenever justifiedin saying that
we know that theyare"true" or "more or lesscertain"or even"probable". (Popper,
1959, p. 317)
Now Popper admits only that a systemis scientific when it can betestedby
experience. Scientific statements aretestedby attemptsto refuteor falsify them.
Theories that withstand severe tests are corroboratedby thetests,but they are
not proven, nor arethey even made more probable. It is difficult to gainsay
Popper's logicandHume's skepticism. They arefood for philosophical thought,
but scientistswho perhaps occasionally worry about such things put willthem
aside,if only becauseworking scientistsarepractical people.The principle of
induction is the principle of science,and thefact that Popperand Hume can
STATISTICS IN PSYCHOLOGY 33
shout from the philosophical sidelines that the "official" rules of the gameare
irrational andthat the "real" rulesof the gameare notfully appreciated,will not
stop the game from being played.
Statistical inference may bedefinedas the use of methodsbasedon therules
of chance to draw conclusionsfrom quantitativedata. It may be directly
compared with exercises where numbered tickets aredrawn from a bag oftickets
with a view to making statements about the compositionof the bag, or wherea
die or a coin is tossedwith a view to making statements about its fairness.
Supposea bagcontains tickets numbered 1,2,3,4, and 5.Each numeral appears
on the same,very large, numberof tickets. Now suppose that25 tickets are
drawnat randomand with replacementfrom the bag and the sum of the numbers
is calculated.The obtainedsum could be as low as 25 and as high as 125, but
the expected value of the sumwill be 75,because each of the numerals should
occur on one fifth of thedraws or thereabouts.The sumshouldbe 5(1 +2 + 3
+ 4 + 5) = 75. Inpractice, a given drawwill have a sumthat departsfrom this
value by an amount aboveor below it that can bedescribedas chance error.
The likely sizeof this error is givenby astatistic calledthe standard error, which
is readily computedfrom the formula a/Vw , whereCT is thestandarddeviation
of the numbersin the bag. Leaving aside,for the moment,the problem of
estimatingCTwhen,as isusual,the contentsof the bag areunknown,all classical
statistical inferential procedures stem from this sort of exercise.The real vari-
ability in the bag isgiven by the standard deviation, and thechance variability
in the sumsof the numbers drawnis given by the standard error.
STATISTICS IN PSYCHOLOGY
The use ofquantitative methodsin the study of mental processes begins with
Gustav Fechner(1801-1887)who sethimself the problem of examiningthe
relationship between stimulusandsensation.In 1860he publishedElementeder
Psychophysik,in which he describeshis invention of a psychophysicallaw that
describesthe relationship between mind and body. He developed methods of
measuringsensationbasedon mathematicalandstatisticalconsiderations,meth-
ods that have theirechoesin present-day experimental psychology. Fechner
madeuse of thenormal law in hisdevelopmentof themethodof constant stimuli,
applying it in the Gaussian senseas a way ofdealingwith error anduncontrolled
variation.
Fechner'sbasic assumptions and theconclusionshe drew from his experi-
mental investigations have been shown to befaulty. Stevens'work in the 1950s
and thelater developments of signal detection theory have overtaken the work
of the 19th century psychophysicists, but therevolutionary natureof Fechner's
methods profoundly influenced experimental psychology. Boring (1950)
34 2. SCIENCE, PSYCHOLOGY, AND STATISTICS
The interesting point about this statementis that it clearly sees factor analysis
as amethodof data exploration rather than an experimental method.As Lovie
(1983) points out, Spearman's approach was that of an experimenter using
correlational techniques to confirm his hypothesis,but from 1940on, that view
of the methodsof factor analysishas notprevailed.
Of course,the beginningsof general descriptive techniques crept into the
psychological literature over the same period. Means and probable errorsare
commonly reported, and correlation coefficients are also accompaniedby
estimatesof their probable error.And it was around 1940 that psychologists
started to become awareof the work of R. A. Fisherand toadopt analysisof
variance as thetool of experimental work.It can beargued thatthe progression
of events thatled to Fisherian statistics also led to a division in empirical
psychology,a split between correlationaland experimental psychology.
Cronbach (1957) chose to discussthe "two disciplines"in his APApresiden-
tial address. He notes thatin the beginning:
STATISTICS IN PSYCHOLOGY 35
Although there have been signs that the two disciplinescan work together,
the basic situationhas notchangedmuch over 30 years. Individual differences
are error variance to the experimenter;it is the between-groupsor treatment
variance thatis of interest. Differential psychologists look for variations and
relationships among variables within treatment conditions. Indeed, variation in
the situation here leads to error.
It may befairly claimed that thesefundamentaldifferencesin approach have
had themost profoundeffect on psychology.And it may befurther claimed that
the sophisticationand successof the methodsof analysis thatare usedby the
two camps have helped to formalizethe divisions. Correlationand ANOVA
have led to multiple regression analysis and MANOVA, and yet themethods
are based on the same model- the general linear model. Unfortunately,
statistical consumers are frequently unawareof the fundamentals, frightened
away by the mathematics,or, bored and frustratedby the argumentson the
rationale of the probability calculus, they avoid investigation of the general
structureof the methods. When these problems have been overcome, the face
of psychologymay change.
3
Measurement
IN RESPECT OF MEASUREMENT
36
IN RESPECT OF MEASUREMENT 37
May we hope to see the daywhen schoolregisterswill record that suchand such a
lad possesses 36 British Association unitsof memory power or when we shall be
able to calculate how long a mind of 17 "macaulays"will take to learn Book ii of
Paradise Lost"? If this be visionary, we may atleast hopefor much of interest and
practical utility in the comparison of the varying powersof different minds which
can now atlast be laid down to scale.(Jacobs,1885, p. 459)
The enthusiasmof the mental measurers of the first half of the 20th century
reflects the same dream,and even today,the smile that Jacobs' words might
bring to the facesof hardened test constructors
and users contains a little of the
old yearning.The urgeto quantify our observationsand toimpose sophisticated
statistical manipulations
on them is a very powerful one in thesocial sciences.
38 3. MEASUREMENT
Measurement.O dear! Isn'tit almost an insult to the word to term someof these
numerical data measurements? They are of thenatureof estimates, mostof them,
and outrageouslybad estimatesoften at that.
And it should alwaysbe the aim of theexperimenternot to revel in statistical
methods (whenhe does reveland notswear)but steadily to diminish, by continual
improvement of his experimental methods,the necessityfor their use and the
influence they haveon his conclusions.(Yule, 1921,pp. 105-106)
The general tenorof this criticism is still valid, but the combination of
experimentaldesignandstatistical method introduced a little later by Sir Ronald
Fisherprovidedthe hope, if not the complete reality,of statisticalasopposedto
strict experimental control. The modern statisticalapproach more readily rec-
ognizes the intrinsic variability in living matter and its associated systems.
Furthermore, Yule's remarks were made before it becameclearthat all actsof
observation contain irreducible uncertainty (as notedin chap.2).
Nearly 40 years after Yule's review, Kendall (1959) gently reminded us of
the importanceof precisionin observationandthat statistical procedures cannot
replaceit. In an acuteand amusing parody,he tells the story of Hiawatha. It is
a tragic tale. Hiawatha,a "mighty hunter,"was anabysmal marksman, although
he didhavethe advantageof having majoredin applied statistics. Partly relying
on his comrades'ignoranceof thesubject,heattemptedto show thathis patently
awful performancein a shooting contestwas notsignificantly different from
that of his fellows. Still, theytook away his bow andarrows:
SOME FUNDAMENTALS
Scalesof Measurement
later in this work. Gaito (1980)is one of anumberof writers who have triedto
exposethe fallacy, noting thatit is basedon aconfusion between measurement
theory and statistical theory. Gaito reiterates some of the criticisms madeby
Lord (1953), who,in a witty andbrilliant discussion, tells the story of a professor
who computedthe meansand standard deviations of test scores behind locked
doors because he hadtaughthis students that such ordinal scores should not be
added. Matters came to ahead whenthe professor,in a story that shouldbe read
in the original, and notparaphrased, discovers that even the numberson football
jerseysbehaveas if they were"real," that is, interval scale, numbers. In fact,
"the numbersdon't remember where they came from" (p. 751).
It is important to realize that the preceding remarksdo not detract from
Stevens' sterling contribution to the developmentof coherent concepts regard-
ing scalesin measurement theory. The difficulties arisewhentheseareregarded
asprescriptionsfor the choiceand use of a wide varietyof statistical techniques.
ERROR IN MEASUREMENT
All acts of measurement involve practical difficulties that are lumped together
as thesourceof measurement error. From the statisticalstandpoint, measure-
ment error increasesthe variability in data sets, decreasing the precisionof our
summariesand inferences. It follows that scientists strivefor accuracyby
continually refining measuring techniques and instruments.The oddfact is that
this strategyproceedseven thoughwe know that absolutely precise measure-
ment is impossible.
It is clear that a meter rule mustbe madefrom rigid material and that
measuringtapesmustnot beelastic. Without thisthe instruments lack reliability
and self-consistency and, put simply, separateand independent measurements
of the same eventare unlikely to agree with each other. Only very rarely do
paper-and-penciltests,or two forms of the sametest(say, a test of personality
factors), give exactlythe same results when given to an individual or group on
two occasions.In a sense these tests are like elastic measuring tapes,and the
discrepancy betweenthe outcomes is an index of their reliability. Perfect
reliability is indexed1. Poor reliability wouldbe indexed0.3 or less,and good
reliability 0.8 or more. In acceptingthis latter figure as good reliability, the
psychologistis acceptingthe inevitable errorin measurement, being prepared
to do sobecausehe or shefeels that a quantitative description better serves to
grasptheconcepts under investigation, to communicate ideas about the concepts
more efficiently, and todescribethe relationships between these concepts and
others more clearly.
Reliability is notjust a function of the quality of the instrument being used.
In 1796, Maskelyne, Astronomer Royal and Directorof the Royal Observatory
at Greenwich, near London, England, dismissed Kinnebrook, an assistant,
ERROR IN MEASUREMENT 45
47
48 4. THE ORGANIZATION OF DATA
POLITICAL ARITHMETIC
Science;and had not the Doctrins of this EssayoffendedFrance,they had long since
seen the light, and hadfound Followers, as well as improvements before this time,
to the advantageperhapsof Mankind. (Lord Shelborne, Dedication, in Petty, 1690)
Nottingham.
It wasKarl Pearson's (1978) view, however, that Petty owed more to Graunt
than Grauntto Petty. Grauntwas awell-to-do haberdasher who wasinfluential
enoughto be able to securea professorshipat Gresham College, London, for
Petty, a man who had a variety of posts and careers. Graunt was acultured,
self-educatedman and wasfriends with scientists, artists,
andbusinessmen.The
publicationof the Observationsled to CharlesII himself supportinghis election
to the Royal Societyin 1662. Graunt's workwas the firstattemptto interpret
social behaviorand biological trendsfrom countsof the rather crude figures of
births anddeaths reportedin Londonfrom 1604to 1661- theBills of Mortality.
A continuing themein Graunt's workis the regularityof statistical summaries.
He apparently acceptedimplicitly the notion that whena statistical ratiowas
not maintained then some new influence mustbe present, that some additional
factor is thereto bediscovered.For example,he tried to show thatthenumber
1
of deaths from the plague had been under-reported and by examiningthe
number of christeningsreasonedthat the decreasein population of London
becauseof the plague wouldbe recoveredin 2 years.He attemptedto estimate
the populationdistribution by sex and age and constructeda mortality table.
Essentially, Graunt's work,andthat of Petty, emphasizesthe use ofmathematics
to impose orderon data and,by extension,on society itself.
Petty was afounding memberof the Royal Society, whichwas grantedits
charter in 1662. Althoughhe had been a supporter of Cromwell and the
Commonwealth,he wasknightedin 1661 by CharlesII. He was not somuch
concerned withthe vital statistics studiedby Graunt. Rather,in his early work
he suggested ways in which the expensesof the state couldbe curtailedand the
1
The weekly accounting
of burialswasbegunto allay public fearsaboutthe plague. However,
in epidemic years when anxiety
was at itshighest,it appears that
casesof theplague were concealed
becausemembersof families with the illness were kept together, both
the sick and thewell.
50 4. THE ORGANIZATION OF DATA
VITAL STATISTICS
The resulting near-equal proportionsof the sexes ensures that every male
may havea femaleof suitable age,andArbuthnotconcludes,as didGraunt, that:
GRAPHICAL METHODS
regarded with suspicion, then graphs would have been greeted with even more
opprobrium.In Franceand Germanythe use ofgraphswas openly criticizedby
some statisticians.
In the social sciencesthe personwho most helpedto promotethe use of the
graph as a tool in statistical analysiswas Quetelet,who has already been
mentioned.It hasbeen noted that the useof theNormal Curvefor thedescription
of humancharacteristics begins with his work. The law of error and Quetelet's
adaptationof it is of great importance.
5
Probability
A proposition that wouldfind favor with most of us isthat the businessof life
is concernedwith attemptsto domore or lessthe right thing at more or lessthe
right time. It follows thata great dealof our thinking, in the senseof deliberating,
about our condition is bound up with the problem of making decisionsin the
face of uncertainty.A traditional viewof sciencein general,on theother hand,
is that it searchesfor conclusions thatare to betaken to be true, that the
conclusionsshould represent anabsolute certainty. If one of itsconclusionsfails,
then the failure is put down to ignorance,or poor methodology,or primitive
techniques- thetruth is still thoughtto be outthere somewhere. In chapter2,
some of the difficulties that can arise in trying to sustain this positionwere
discussed. Probability theory provides uswith a meansof reaching answers and
conclusions when,as is thecasein anything but the most trivial of circum-
stances,the evidenceis incompleteor, indeed,can never be complete. Put
simply, whatis thebestbet, whatare theoddson ourwinning,and howconfident
can we bethat we areright? The derivationof theoriesand systems thatmay
be applied to assist us in answering these questions continues to present
intellectual challenges.
The early beginningsare very early indeed. David (1962) tells us that many
archeological digs have produced quantities
of astragali, animal heel bones, with
the four flatter sides marked
in various ways. These bones may have been used
as counting aids,as children's toys,or as theforerunnersof dice in ancient
games.The astragaluswas certainly usedin board gamesin Egypt about 3,500
years beforethe birth of Christ. David suggests that gaming may have been
developedfrom game-playingand reports thatit is saidto have been introduced
56
THE BEGINNINGS 57
When the four dice producethe venus-throw [the outcome when four dice, or
astragali,are tossedand thefaces shownare alldifferent] you maytalk of accident:
but supposeyou made a hundred castsand thevenus-throw appeared a hundred
times; couldyou call that accidental? (David, 1962,p. 24)
THE BEGINNINGS
double six in 24 throws of 2 dice. Herethe odds are,in fact, slightly against
the house, even though 24 is to 36 as 4 is to 6. De
Mere hadsolved this problem
but used the outcome to argue to Pascal thatit showed that arithmeticwas
self-contradictory!2 A second problem, which de Mere was unableto solve,
was the"problem of points," which concerns,for example,how thestakesor
the "pot" shouldbe divided amongthe players, accordingto thecurrent scores
of the participants,if a gameis interrupted. Thisis analtogether moredifficult,
questionand theexchanges between Pascal andFermat that took place over the
summerof 1654 devote much space to it. The letters3 show the beginningof
the applicationsof combinatorial algebraand of thearithmetic triangle.
In 1655, Christiaan Huygens (1629-1695),a Dutch mathematician, visited
Paris and heard of the Pascal-Fermat correspondence. There was enormous
interest in themathematicsof chance but a seeming reluctance to announce
discoveries and reveal the answers to questions under discussion, perhaps
becauseof the material valueof the findings to gamblers. Huygens returned to
Holland and worked out the answers for himself. His short accountDe
Ratiociniis in Ludo Aleae [Calculating in Gamesof Chance]was publishedin
1657 by Fransvan Schooten(1615-1660),who wasProfessorof mathematics
at Leyden. This book,the first on theprobability calculus,was animportant
influence on James Bernoulliand DeMoivre.
James (also referred to asJacquesor Jakob) Bernoulli(1654-1705)was one
of an extendedfamily of mathematicianswho made many valuable contribu-
tions. It is his Ars Conjectandi [The Art of Conjecture},publishedin 1713after
editing by his nephew Nicholas, that is of most interest.The first part of the
book is a reproductionof Huygens' workwith a commentary.The secondand
third parts deal withthe mathematicsof permutationsandcombinationsand its
application to games of chance. In the fourth part of the book we find the
theorem thatBernoulli called his "golden theorem,"the theorem named "The
Law of Large Numbers"by Poissonin 1837.In an excellentand most readable
commentary, Newman (1956) declares the theoremto be "of cardinal signifi-
cance to the theory of probability," for it "was the first attempt to deduce
statistical measures from individual probabilities" (vol. 3, p. 1448). Hacking
(1971) hasexaminedthe whole work in some considerable depth.
Bernoulli likely did not anticipateall of the debatesand confusions and
2
The probability of not getting a six in asingle throw of a die is 5/6 and the
probability of no
sixes in 4 throws is (5/6)4. The probability of at least 1 six in 4throws is 1 - (5/6)4, which is
.51775, whichis favorableto the house. The probability of not getting 2 sixes in a single throw
of 2 dice is 35/36 and theprobability of the event of at least oncegetting 2 sixesin 24 throws of
2 dice is 1- (35/36)24, which is .49140, whichis slightly unfavorableto the house.
3
Translationsof the Pascal-Fermat correspondence
are to befound in David's (1962) bookand in
Smith (1929).
60 5. PROBABILITY
The fact is, of course, thatthe theoremis no moreand noless thana formal
mathematical proposition and not astatement about realities and, asNewman
(1956)puts it," neither capable of validating 'facts' nor of being invalidatedby
them" (vol. 3, p. 1450). Now this statement accurately represents the situation,
but it mustnot bepassedby without comment.If reality departs severely from
the law, then clearlywe do notdoubt reality,and nor do wedoubtthe law. What
we might do, however,is to doubt someof our initial assumptions about the
observed situation.If red wereto turn up 100timesin a row at theroulette table,
it might be prudentto bet on red on the 101spin st because this sequence departs
so muchfrom expectation thatwe might conclude thatthe wheel is "stuck" on
red. In fact, we might usethis observationas support for an assertion thatthe
wheel is not fair. This example moves usawayfrom probability theoryandinto
the realmof practical statistical inference,aninference about reality that is based
on probabilities indicatedby anabstract mathematical proposition.
a given eventon asingle trial- that, for example,we know thatthe value AFP
for the appearanceof heads whena coin is tossedis ½ If we doknow p, then
predictionscan bemade, and, givena seriesof observations, inferences may
be drawn from them. The problem is to statethe grounds for setting p at a
particular value,and this raisesthe questionof what it is that we mean by
probability. The issuewas always there,but it was notreally confronteduntil
the 19th century.
The title of this sectionis taken from ErnestNagel'spaper presented at the
annualmeetingof the American Statistical Association in 1935 and printedin
its journal in 1936.Nagel'sclearandpenetrating account does not, as headmits,
resolve the difficult problems raisedin attempts to examine the concept,
problems thatare still the source of sometimes quite heated debate among
mathematiciansandphilosophers. What it doesis presentthe alternative views
and commenton theconclusions that adoption of one orother of the positions
entails.4 Nagel first examinesthe methodological principles that can bebrought
to bear on the question; thenhe delineatesthe contexts in which ideasof
probability may befound; and,finally, he examinesthe three interpretations of
probability that are extant. What follows draws heavily on Nagel'sreview.
Appealsto probabilitiesarefound in everyday discussion, applied statistics
and measurement, physical andbiological theories,the comparisonof compet-
ing theories,and in themathematical probability calculus. The three important
interpretationsof the concept are theclassical; the notion of probability as
"rational belief; and thelong-run relativefrequencydefinition.
Laplace (1749-1827)and DeMorgan (1806-1871)were the chief expo-
nentsof the classical view that regards probability as, in DeMorgan'swords,
"a stateof mind with respectto anassertion"(De Morgan, 1838).The strength
of our belief aboutany given propositionis itsprobability. Thisis what is meant
in everyday discourse when we say(though perhaps not every day!) that"It is
probable that gaming emerged from game-playing." In evaluating competing
theories this viewis also evident,as whenit is said by some,for example, that,
"Biologically based trait theories of personalityaremore probably correct than
social learningtheories."This interpretationis not what is meant when statis-
tical assertions about the weather,the probabilityof precipitation,for example,
or the outcomeof an atomic disintegration are made.
John Maynard Keynes (later Lord Keynes, 1883-1946)interpreted prob-
ability asrational belief, and it hasbeenassertedthat thisis themost appropriate
view for most applicationsof the term. Evidence about the moral characterof
witnesses would lead to statements about the probability of the truth of one
person'stestimony rather than another's or, inexaminingtheresultsof evidence
that attemptedto show, for example, thata social learning theory interpretation
The practical spurto the developmentof probability theory was simply the
attemptto formulate rules that,if followed assiduously,would leadto a gambler
making a profit over the long run, although,of course, it was notpossibleto
asserthow long the runmight haveto be for profit to be assured. Probability
was therefore boundup with the notion of expectancies,and the 150years that
followed the end of the17th century saw its constructs being appliedto
expectancies in life insurance and the sale of annuities. Johnde Witt
(1625-1672)andJohn Hudde(1628-1704),who calculated annuities basedon
mortality statisticsand whose workwas mentioned earlier, corresponded and
consulted with Huygens and drew heavily on his writings.
An important figure in thehistory of probability, althoughthe extent of his
contribution was not fully recognized until long, long after his death, was
Abraham De Moivre (1667-1754). His book The Doctrine of Chances:or A
Method of Calculating the Probabilities of Events in Play was first published
in 1718, but it is thesecond (1738)and third (1756) editionsof the work that
are ofmost interest here.De Moivre'sA Treatiseof Annuities on Lives,editions
of which were publishedin 1724, 1743,and 1750, appliesThe Doctrine of
Chancesto the "valuation of annuities." The Treatise is to be found bound
together withthe 1756 editionof the Doctrine.
Here the problems of the gaming roomsand the questionsof actuarial
prediction in the mathematicsof expectanciesare linked. A third strandwas
not to emergefor over 50 yearswith investigationson theestimation of error.
64 5. PROBABILITY
Thus we have seen that the principal contributionsto our subject fromDe Moivre
are hisinvestigations respecting
theDuration of Play, his Theoryof Recurring Series,
and his extension of the value of Bernoulli's Theoremby the aid of Stirling's
Theorem. Our obligations to De Moivre would have been still greater if he had not
concealedthe demonstrationsof the important results whichwe have noticed . .;.
but it will not bedoubted thatthe Theory of Probability owesmore to him than to
any other mathematician, withthe sole exceptionof Laplace.(Todhunter, 1865/1965,
pp. 192-193)
5
In the actual talk Hilbert onlyhad time to deal with 10 of theproblems,but entire manuscriptin
English, can befound in the Bulletin of the American Mathematical Society (1902).
FORMAL PROBABILITY THEORY 67
When Grauntand Halley and Queteletmade their inferences, they made them
on thebasisof their examinationof'frequency distributions. Tables, charts, and
graphs- nomatter how theinformationis displayed- all can beusedto show
a listing of data, or classificationsof data, and their associated frequencies.
These are frequency distributions. By extension, such depictions of the fre-
quencyof occurrenceof observationscan beusedto assessthe expectationof
particular values,or classesof values, occurringin the future. Real frequency
distributionscanthenbe usedasprobability distributions.In general, however,
theprobability distributions thatarefamiliar to theusersof statistical techniques
are theoretical distributions, abstractions basedon a mathematical rule, that
match, or approximate, distributions of eventsin the real world. When bodies
of data are described,it is the graph and thechart that are used. But the
theoretical distributions of statisticsandprobability theoryaredescribedby the
mathematical rules or functions that define the relationships betweendata,both
real and hypothetical,andtheir expectedfrequenciesor probabilities.
Over the last 300 yearsor so, thecharacteristicsof a great many theoretical
distributions,all of which have beenfound to have some practical utility in one
situationor another, have been examined. The following discussionis limited
to threedistributions thatare familiar to usersof basicstatisticsin psychology.
An accountof somefundamentalsampling distributionsis given later.
68
THE BINOMIAL DISTRIBUTION 69
l/1
2
the binomial expansionof (1 - jc) . Theinterestingandimportant pointto be
noted is that Newton's discoverywas notmade by consideringthe binomial
coefficients of Pascal'strianglebut by examiningthe analysisof infinite series,
a discovery of much greatergeneralityand mathematical significance. Figure
6.1 showsthe binomial distributionfor n = l.
Newton'sdiscoveryof the calculus in his "golden years"at Woolsthorpe
establishes him as its originator, but it was Gottfried Wilhelm Leibniz
(1646-1716),the German philosopher and mathematician,who haspriority of
publication, and it is pretty well established that the discoveries were inde-
pendent. However,a bitter quarrel developed over the claims to priority of
discovery and allegations were made that, on avisit to London in 1673, Leibniz
could have seenthe manuscriptof Newton's De Analysi Aequationes Numero
Terminorum Infmitas, which, though writtenin 1669, was notpublisheduntil
1711. Abraham De Moivre was among those appointed by the Royal Society
in 1712to report on thedispute. De Moivre made extensive use of themethod
in his own work, and it was hisApproximatio, first printed and circulated to
somefriends in 1733, that links the binomial to what we nowcall the normal
70 6. DISTRIBUTIONS
Altho" the Solution of Problemsof Chanceoften requires that several Termsof the
Binomial (a + b)n [this is modern notation]be added together, nevertheless in very
high Powersthe thing appearsso laborious,and of sogreatdifficulty, that few people
have undertaken that Task; for besides Jamesand Nicholas Bernoulli, two great
Mathematicians,I know of no body thathas attemptedit; in which, tho' they have
shown very great skill,and havethe praise whichis due totheir Industry,yet some
thingswerefarther required;for what they havedoneis not somuch an Approxima-
tion as thedetermining very wide limits, within which they demonstrated that the
Sum of Termswas contained. (De Moivre, 1756/1967,3rd Ed., p. 243)
... which convergesso fast, thatby help of no more than sevenor eight Terms,the
Sum required may becarriedto six or seven placesof Decimals: Now that Sumwill
be found to be 0.427812,independently fromthe common Multiplicator2/Vt, and
therefore to the Tabular Logarithm of 0.427182,which is 9.6312529,adding the
Logarithm of 2At, viz. 9.9019400,the sumwill be 19.5331929,to which answersthe
number 0.341344. (De Moivre, 1756, 3rd Ed., p. 245)
This familiar final figure is thearea underthe curveof the normal distribution
betweenthe mean (whichis, of course, alsothe middle value)and anordinate
one standard deviation from the mean. In the third corollaryDe Moivre says:
½Vw is what todaywe call the standard deviation.De Moivre did not name
it but he did, in Corollary 6, saythat Vw "will be as itwere the Modulus by
which we are toregulateour Estimation"(De Moivre, 1756,3rd Ed., p. 248).
In fact what De Moivre doesis to expandthe exponentialand to integrate
from 0 to SG.
In Corollary 6 De Moivre notes thatif / is interpreted as Vw rather than
V2 \H , then the series doesnot convergeso fast and that moreand more terms
would be required for a reasonableapproximation as / becomes a greater
proportion of V«,
... for which reason1makeuse inthis Caseof the Artifice of Mechanic Quadratures,
first invented by Sir Isaac Newton...;it consistsin determiningthe Area of a Curve
nearly, from knowinga certain numberof its OrdinatesA, B, C, D, E, F, &c.placed
at equalIntervals,(De Moivre, 1756, 3rd Ed., p. 247)
1
The value for the proportionof areabetween±la is 0.6826894,so that De Moivre was out
by one unit in the sixth decimalplace.
THE NORMAL DISTRIBUTION 75
There is much valuein the idea of the ultimate laws being statistical laws, though
why the fluctuations shouldbe attributedto a Lucretian 'Chance',I cannot say. It
is not anexactly dignified conceptionof the Deity to supposehim occupied solely
with first momentsand neglecting secondand higher moments!(Pearson,1978, p.
160)
and elsewhere:
1
Price was also a Unitarian Church minister. His church stillstandsin Newington Greenin north
London and is theoldest Nonconformist church building still being so usedin London. Next doorand
abutting the church is a licensed betting shop.It has been noted that Bayesian analysis, resting,
as it
does,on thenotion of conditional probability, is akin to gambling.
77
78 7. PRACTICAL INFERENCE
p(R or P)=p(R) +p(?) -p(R & P), [8/13 = (1/2 + 3/13) - 3/26].
hand sideof this equationis termed theposterior probability, the first term on
the right-hand sidethe prior probability, and theratio p(P\R)/p(R.), the likeli-
hood ratio.
From the addition rulewe canalso show thatthe probability of a redcard is
the sum of theprobabilitiesof a redpicture cardand a rednumber card minus
the probability of a picture number card (whichdoesnot exist!), or:
But p(P & N) - 0 becausethe events picture card and number card are
mutually exclusive. So Bayes' Theoremmay bewritten:
2
Presumablyit could be argued thatour urn with its unknown ratioof black andwhite balls
is one of aninfinite distributionof urns with differing make-ups.
FISHERIAN INFERENCE 81
FISHERIAN INFERENCE
When the justification for the probabilistic basisof inferencein the senseof
revising opinionon thebasis of data was thoughtof at all, it was theBayesian
approach that held sway until this century. Its foundations had come under
attack primarilyon the grounds that probability in any practical senseof the
word must be basedon relative frequencyof observationsand not ondegrees
of belief. Venn (1888)was perhapsthe most insistent spokesman for this view.
Sir Ronald Fisher(1890-1962),certainlythe most influential statisticianof all
time, set out toreplaceit. In 1930he publisheda paper that supposedly set out
his notion of fiducial probability, claiming, "Inverse probability has, I believe,
survived so long in spiteof its unsatisfactorybasis, because its critics have until
recent timesput forward nothingto replaceit as arational theoryof learningby
experience" (Fisher, 1930, p. 531).
There is no doubt thatFisher'smethods,and thecontributionsof Neyman
and Pearson that have been grafted on to them, have provided us with a set of
inferential procedures. There seems to beconsiderable doubt as towhetherhe
provided us with a coherent non-Bayesian theory. Fisher himself asserted that
the conceptof the likelihood functionwasfundamentalto his newapproachand
distinguishedit from Bayesian probability. Kendall (1963) in his obituary of
Fisher says this:
It appearsto me that, at this point [1922],his ideas werenot very well thought out.
Certainly his exposition of them was obscure. But, in retrospect,it becomes plain
that he wasthinking of a probability function f(x,Q) in two different ways: as the
probability distributionof* for given0, and asgiving some sortof permissible range
of 0 for anobservedx. To this latterhe gave the thenameof 'fiducial probability
distribution' (later to be known as afiducial distribution)and in doing so began a
long train of confusion;for it is not aprobability distribution to anyonewho rejects
Bayes's approach, and indeed, may not be adistribution of anything. Fisher
neverthelessmanipulatedit as if it were, and thereafter maintainedan attitude of
rather contemptuous surprise towards anyone who wasperverse enoughto fail in
understandinghis argument....
The position on both sideshasbeen restatedad nauseam, without much attempt
at reconciliationor, as Ithink, withoutan explicit recognitionof the real point, which
is that a man's attitude towards inference, like his attitude towards religion,is
determined by his emotional make-up,not by reasonor mathematics. (Kendall,
1963, p. 4)
There are two different measuresof rational belief appropriateto different cases.
Knowing the populationwe canexpressour incomplete knowledgeof, or expectation
of, the sample in terms of probability; knowing the samplewe can expressour
incomplete knowledgeof the populationin terms of likelihood. (Fisher, 1930,p.
532)
Berkson was expressingin essence what many, many other writers in the
following 60 years have observed: that when HQ is expressedas anexact null
hypothesis (zero difference or no relationship) then very small deviations from
this (dare one sayit?) pedantic viewwill be declared significant]
Someof the more recently-expressed doubts have been brought together by
Henkel and Morrison (1970),but thepapers they collected together were most
certainly not the last word. Indeed, Hunter (1997) called for a ban onNHST,
and reportedthat no lessa body thanthe American Psychological Association
has a"committee looking into whether the use of thesignificancetest should
be discouraged"(p. 6)!
The main criticisms, endlessly repeated, are easily listed. NHST does not
offer any way oftestingthe alternativeor research hypothesis; the null hypothe-
sis is usually falseand when differencesor relationships are trivial, large
samples will lead to its rejection; the method discourages replication and
encourages one-shot research; the inferential model dependson assumptions
about hypothetical populations and data that cannot be verified; and there are
more. Someof the criticismsarevalid andneedto befaced more carefully, even
more boldly than they are, while some seem to be'strawmen' set up to beblown
down. For example,the fact that we only reject HQ and do nottest H^ is not
particularly satisfying,but the notion that that inference is faulty becausein
some way it rests on characteristicsof populationsnot observedor not yet
observedis merely statinga fact of inference! Moreover, mostof the criticism
84 7. PRACTICAL INFERENCE
85
86 8. SAMPLING AND ESTIMATION
When the Master of the Mint has brought the pence, coined, blanched and made
ready,to the placeof trial, e.g. the Mint, he must put them all at once on the counter
which is covered withcanvas. Then, whenthe pence have been well turned over
and thoroughly mixedby the handsof the Master of the Mint and theChanger,let
the Changertakea handful in themiddle of theheap,moving roundnine or tentimes
in one direction or the other, until he hastaken six pounds. He must then distribute
thesetwo or three times into four heaps, so that theyarewell mixed. (Stigler, 1977,
p. 495)
The Masterof the Mint was alloweda marginof error calledthe remedyand
had to make goodany deficit that was discovered. Although mathematical
statistics playedno part in these tests,and if they had they would have been
RANDOMNESSAND RANDOM NUMBERS 87
1
more precise,the procedure itself mirrors modern practice.
The trial of the Pyxeven in the Middle Ages consistedof a sample being drawn, a
null hypothesis (the standard) to betested,a two-sided alternative,and atest statistic
and a critical region (the total weightof the coins and theremedy). The problem
even carried with itselfa loss function which was easily interpretablein economic
terms. (Stigler, 1977,p.499)
1
Stigler notesthat the remedywas specified on a perpound basisandthat all the coin weightswere
combined. This,togetherwith the central limit theorem, almost guarantees that the Master would not
exceedthe remedy.
88 8. SAMPLING AND ESTIMATION
COMBINING OBSERVATIONS
old forms of the word referto work doneby tenantsfor a feudal superioror to
shared risks among merchants for lossesat sea. Modern conceptions of the word
include the notion of a sharingor eveningout over a rangeof disparate values.
A large seriesof observationsor measurements of the same phenomenon
producesa distributionof values. Giventhe assumption thata single true value
does,in fact, exist,the presenceof different valuesin the series shows that there
are errors. The assumption that over-estimations are aslikely as under-estima-
tions would provide supportfor the use of themiddle valueof the series,the
median,asrepresentingthetrue value.Theassumption that thevaluewe observe
most frequently is likely the true valuejustifies the use of themode,and the
"evening out" of the different valuesis seenin the use of thearithmetic mean.
It is this latter measure thatis now thestatisticof choice when observations are
combined,andthereare anumberof strandsin our history that have contributed
to thejustification for its use.The employmentof the arithmetic meanand the
use of the lawof error have sound mathematical underpinnings in the Principle
of Least Squares,which will be considered later. First, however, a somewhat
critical look at the use of themeanis in order.
place of the many values which were actually given to us? And theanswersurely
is, that it can notpossibly do so; the onething cannot takethe place of the other for
purposesin general,but only for this or that specific purpose, (pp.429-430)
The mean of observations is a cause,as it were the source from which diverging
errorsemanate.The meanof statistics is adescription,a representativeof the group,
that quantity which,if we must in practice put onequantityfor many, minimisesthe
error unavoidably attending such practice.
Observationsare different copiesof one original; statisticsare different originals
affording one 'genericportrait.' (Edgeworth, 1887,p. 139)
Everything occursthen as though there existeda type of man, from whichall other
men differed moreor less. Nature hasplaced beforeour eyesliving examples of
COMBINING OBSERVATIONS 91
the mean,X,and the variance, I,(X- X) /n. Theessential factis easily seen.
Given thata distribution of observationsis normal in form, and given thatwe
know the mean and thevarianceof the distribution, thenthe distribution is
completely and uniquely specified.All its propertiescan beascertained given
this information.
In the context of statisticsin the social sciences, both
the normal law and the
least squares principleare best understoodin the context of the linear model.
The model encompasses these constructs and brings togetherin a formal sense
the mathematicsof the combinationof observations developed for use inerror
estimationand mathematical statistics as exemplifiedby analysisof variance
and regression analysis.
The earliest examples of the use ofsamplesin social statisticswe have seenin
the work of Graunt, Petty, Halley, and theearly actuaries. These samples were
neither randomnor representative,and mistaken inferences were plentiful. In
any event, it was not until much later, when attempts were made to collect
information on populations, inferential exercises repeated, and the results
compared, that critical examinations of thetechniques employed could be made.
Stephan (1948) reports that Sir Frederick Morton Eden estimated thepopulation
of Great Britainin 1800to beabout 9,000,000. This estimate, which wasbased
on the numberof births and theaverage number of inhabitantsin eachhouse(a
number that was obtained by sampling),was confirmedby the first British
censusin 1801. Earlier attemptsto estimate populations had been madein
France,and Laplacein 1802 madean attemptto do sothat followeda scheme
he had devised and publishedin 1786, a scheme that included a probability
measureof the precisionof the estimate. Specifically, Laplace averred that the
odds were 1,161 to 1that the error wouldnot reach halfa million. Westergaard
(1932) provides some more details of these exercises. Elsewhere in Europeand
in the United Statesthe 19th centurysaw various censuses conducted, aswell
as attemptsto estimatethe size of the populationfrom samples.In the United
States the Constitution providedfor a census every10 years in order to
determine Congressional representation, but a Bill introduced intothe British
Parliamentin 1753to establishan annual census was defeatedin the Houseof
Lords.
It appears thatthe probability mathematicians and the rising group of
political arithmeticians never joined forces in the 19th century. The latter
favored the institutionof complete censuses and, generally speaking, not were
mathematicians,and theformer were scientists who hadmany other problems
94 8. SAMPLING AND ESTIMATION
commentary on these and other polls. One of the most successful polling
organizations, the American Institute of Public Opinion, headedby George
Gallup, beganits work in 1933. But the Gallup poll also predicteda win for
Dewey over Trumanin 1948,and many people have seen the Life photograph
of a victorious president holding the Chicago Daily Tribune withits famous
Type I error headline, "Dewey defeats Truman."
The result producedone of thefirst claims thatthe polls influencedthe
outcome, inducing complacency in the Republicansand asmall turnoutof their
supporters, leading to the defeat of the Republican candidate. Today thereis
much controversy overthe conductingand the use ofpolls. Theywill survive
becausein general theyare correct much more often than not. Their success
is
due to thedevelopmentof refined sampling techniques.
In the early yearsof this century, Bowley began to examine boththe practice
and theory of survey sampling. His work helpedto highlight the utility of
probability samplingof oneform or another. Systematic selection was adopted
and advocatedby A. N. Kiaer(1838-1919), Director of theNorwegian Bureau
of Statistics,in his examinationsof census data,but themajority of influential
statisticians, represented by the InternationalStatisticalInstitute, rejected sam-
pling, pressingfor complete enumeration. It took almost30 yearsfor theutility
and benefits of the methodsto be appreciated. Seng (1951) and Kruskal and
Mosteller (1979) give accounts of this most interesting periodin statistical
history. The latter authors givea translationand paraphraseof the remarks of
Georg von Mayer, Professorat the University of Munich, on Kiaer's workon
the representative method, which was presentedat ameetingof the Institutein
Berne in 1895:
I regardas most dangerousthe point of view found in his work. I understand that
representative samples can have some value,but it is a value restrictedto terrain
already illuminated by full coverage. One cannot replaceby calculation the real
observationof facts. A sample provides statistics for the units actuallyobserved,but
not true statisticsfor the entire terrain.
It is especially dangerousto proposerepresentative sampling in the midst of an
assembly of statisticians. Perhaps for legislative or administrativegoals sampling
may haveuses- but onemust never forget that it cannot replacea complete survey.
It is necessaryto addthat thereis among us these daysa current in the minds of
SAMPLING IN THEORY AND PRACTICE 97
Oddly enough, Kiaer's work is not mathematicalin the sense that modern
methodsof parameter estimation aremathematical.At the time those methods
were not fully delineatednor understood.Kiaer aimed, by a variety of tech-
niques,to producea miniatureof the population, although he noted as early as
1899 the necessityfor the investigationof both the practical and theoretical
aspectsof his methods. At a meetingin 1901(a report of which waspublished
in 1903), Kiaer returnedto the theme and it was in adiscussion of his
contribution that L. von Bortkiewicz suggested that the "calculus of prob-
abilities" couldbe usedto test the efficacy of sampling. By establishinghow
much of a difference between sample and population could be obtainedacci-
dentally and checking whetheror not anobserveddifference lay outside those
limits, the representativeness of the sample couldbe decidedon. Bortkiwiecz
did not, apparently, formulate all the necessary tests,
and othershad employed
this method, but he seemsto have beenthe first to draw the attention of
practicing statisticiansto thepossibilities.
In 1903, Kiaer must have thought that the sampling argument was won, for
a subcommitteeof the International StatisticalInstituteproposeda resolutionat
the Berlin meeting:
The chancesare thesamefor all the itemsof the groups to besampled,and the way
they are takenis absolutely independentof their magnitude.
It is frequently impossibleto covera whole areaas thecensusdoes,...but it is not
necessary.We canobtain as good resultsas wepleaseby sampling,and very often
quite small samplesare enough;the only difficulty is to ensure that every person
or
thing has thesame chanceof inclusion in the investigation. (Bowley, 1906,pp.
551-553)
A distinctive featureof this method, according to Bowley, was that it was acaseof
cluster sampling.It wasassumed that the quantity under investigation
was correlated
with a numberof characters,called controls,and that the regressionof the cluster
meansof thequantityon thoseof each controlwaslinear. Clusters were to be selected
in such a waythat averageof each control computed from the chosen clusters should
(approximately) equalits population mean.It was hoped that,due to theassumed
correlations between controls and thequantity under investigation,
the above method
of selection would resultin arepresentative sample with
respectto thequantity under
investigation. (Chang, 1976, pp. 305-306)
(p. 562). He further suggests that these approaches have been misunderstood,
due, he thinks,to Fisher's condensedform of explanationanddifficult method
of attacking the problem:
From the standpointof the familiar statistical procedures found in our texts,
this paperis importantfor its treatmentof confidence intervalsand itsemphasis
of the importanceof random sampling.It extended estimation from so-called
point estimation,the use of asample value to infer a populationvalue,to interval
estimation, whichassesses the probability of a range of values. Neyman
demonstratesthe use of theMarkov2 method for deriving the best linear
Andrei Andreyevich Markov(or Markoff, 1856-1922)is best knownfor his studiesof the
probabilities of linked chainsof events.Markov chains have been used in a variety of social and
biological studiesin the last 30 or 40years.But Markov made many contributionsto probability theory.
If we havea random variableX, then regardless of its distribution,for any positive numberc (i.e. c >
0), theprobability, Px(X > cu), thatthe random variable X is greater thanc timesits expected valueux
= u doesnot exceed lie. Thatis, Px(X > c^) < \lc. This is knownas theMarkov Inequality. Markov
was astudentof Pafnuti LTchebycheff(sometimes spelledChebychev or Chebichev,1821-1894),who
formulated the Tchebycheff Inequality which states that, Px(X< n - da or X > + da) = tfwhereX
is a random variable with expected valueu andvariance a2, and d> 0. This resultwas independently
arrived at by theFrench mathematician I.J. Bienayme (1796-1876).These inequalities areimportant
in the developmentof the descriptionsof the propertiesof probability distributions.
THE BATTLE FOR RANDOMIZATION 101
now beendemonstrated that the errors, asestimated, will,in sucha group, be higher
than is usualin random arrangements, andthat, in consequence, within such a group,
the test of significanceis vitiated. (Fisher,1926b,pp. 506-507)
"What would you do," I had asked, "if, drawinga Latin Squareat random for an
experiment, you happenedto draw a Knut Vik square?"Sir Ronald said he thought
he would draw againandthat, ideally, a theory explicitly excluding regularsquares
should be developed. (Savage, 1962, p. 88)
The conclusionis reached thatin caseswhere Latin square designs can beused,and
in many caseswhere randomized blocks have to beemployed,the gain in accuracy
with systematic arrangements is not likely to be sufficiently great to outweigh the
disadvantages to which systematic designs are subject.
On the other hand, systematic arrangements may in certain casesgive decidedly
greateraccuracy than randomized blocks, but it appears thatin suchcasesthe use of
the modern devicesof confounding, quasi-factorial designs, or split-plot Latin
squares whichare much more satisfactory statistically, are likely to give a similar
gain in accuracy. (Yates, 1939, p. 464)
One wonders, for instance, how many psychometric test scores for policeman,
firemen, truck driversetal.have been interpreted by theclinician in termsof college
sophomorenorms.
In psychological researchwe aremorefrequently interestedin makingan inference
regarding the likenessor difference of two differentially defined universes, suchas
two racial groups,or an experimentalvs. acontrol group. The writer venturesthe
guessthat at least 90% of theresearchin psychology involves such comparisons. It
is not only necessaryto considerthe problemof samplingin the caseof experimental
and control groups,but alsoconvenientfrom the viewpointof both good experimen-
tation and sound statisticsto do so. (McNemar, 1940a,p. 335)
MacKenzie (1981) givesa brief accountof Arthur Black (1851-1893),a tragic figure who,on his
death, left a lengthy, and now lost, manuscript, Algebraof Animal Evolution,which was sent to Karl
Pearson. "Pearsonstarted to read it, but realized immediately thatit discussed topics very similarto
thosehe wasworking on, anddecidednot toreadit himself but to sendit to Francis Galtonfor his advice"
(p. 99).Of great interest is that buried among Black's notebooks, which have survived,
is a derivation
of the chi-square approximation to the multinomial distribution.
105
106 9. SAMPLING DISTRIBUTIONS
This is the essenceof the distribution2 that Pearson used in his formulationof
the test of goodness-of-fit. Why wassucha test so necessary?
Gamesof chanceand observational errorsin astronomyand surveying were
subject to the random processesthat led scientistsin the 18th and early 19th
centuriesto the examinationof error distributions.The quest was fora sound
mathematical basisfor the exerciseof estimating true values. Simpson intro-
duced a triangular distributionand Daniel Bernoulli in 1777 suggesteda
semicircular one.In the absenceof empirical data, these distributions, estab-
lished on apriori grounds, were somewhat arbitrarily regarded as having no
more and noless claimto accuracyand utility. But the 19th centurysaw the
normal law established. It had powerful credentials because of the fame of its
two main developers, Gauss andLaplace. Startingfrom the assumption thatthe
arithmetic mean represented the true value, Gauss showed that the error distri-
bution was normal. Laplace, startingfrom the view that every individual
observation arisesfrom a very great numberof independently acting random
factors (the essence of the central limit theorem), cameto the same result.
Gauss's proofof the methodof least squares further establishedthe importance
of the normal distribution,and when, in 1801,he usedthe initial data collected
3
from observationson a newplanet, Ceres, to accurately predict where it would
be observed laterin the year, these procedures, as well as Gauss'sreputation,
were firmly establishedin astronomy.
In astronomyaswell as inbiology and social science,the Laplace-Gaussian
distribution wasindeed law,it was indeed regarded asnormal. This prescription
led to Quetelet'suse of it as a"double-edged sword" (see chapter 1) and led to
many astronomers using it as areasonto reject observations that were consid-
ered to be doubtful, for example Peirce (1852). Quetelet's procedure for
establishingthe "fit" of the normal curvewas thesame as that of the early
astronomers.The tabulation,and later the graphing,of observedand expected
frequenciesled to their being comparedby nothing more than visual inspection
(see, e.g., Airy, 1861).
2
A history of the mathematicsof the x2 distribution would includethe developmentof the gamma
function by the French mathematician 1. J. Bienayme(1796-1876),who, in the 1850s,found a statistic
that is equivalentto the Pearson X in the contextof least squares theory. Pearson
was apparentlynot
awareof his work; nor were F. R. Helmertand E.Abbe, who,in the 1860sand 1870s, also arrived
at the
X distributionfor the sum ofsquaresof independentnormal variates. Long
after the test had become
commonly used,von Mises (1919)linked Bienayme's workto the PearsonX2. Details of this aspectof
the test and distribution's historyare given by Lancaster (1966), Sheynin (1966),
and Chang (1973).
3
Cereswas the firstdiscovered "planetoid" in the asteroid belt. Gauss also determined
the
orbit of Pallas, another planetoid.
108 9. SAMPLING DISTRIBUTIONS
4
It appearsfrom Yule's lecture notes that Karl Pearson probably was em-
ploying a procedure that used the ratio of anabsolute deviation from expectation
to its standard errorto examinethe recordof 7,000 (actually 7,006) tosses of 12
dice madeby Mr Hull, a clerk at University College. Thiswas therecord (see
chapter 1) that Weldon said Karl Pearson had rejectedas "intrinsically incred-
ible." Yule's notes also contain an empirical measureof goodnessof fit that
Egon Pearson saysmay be setdown roughly as R = I\O - 7]/ir, where
\O - 7] is the absolute valueof the difference between the observed and
theoreticalfrequencyand T thetotal theoreticalfrequency,thoughit shouldbe
mentionedthat the actual notesdo not containthe formula in this form. This
expression mean absolute error was in useduring the latter yearsof the 19th
century,and Bowley usedit in the first edition of his textbook in 1902.
Karl Pearson's second statistical paper (1895) on asymmetricalfrequency
curves occupied the attentionof thebiologists,but thequestionof thebiological
meaningof skewed distributionswas not onethat, at the time, was in the
I have earlier pointed out other objectionsto the x2 test ... I have never thought
it
necessaryto make any rejoinderto Pearson'scharacteristically bitter replyto my
criticism, nor do Iyet. The x2 test leadsto this absurdity. (Pearl, 1917,
p. 145)
Pearson replied:
Dear Mr Fisher,
I am afraid that I don't agreewith your criticism of FrokenK. Smith (sheis a pupil
of Thiele'sand one of themost brilliant of the younger Danish statisticians).. . .
your argumentthat x2 varies withthegrouping is of course well known... Whatwe
have to determine, however,is with given grouping which method gives the lowest
X 2. (p. 455)
Pearson asks for a defenseof Fisher's argument before considering its publica-
tion. Fisher's expanded criticism received short shrift. After thanking Fisher
for a copy of his paper on Mendelian inheritance (Fisher, 1918), he hopesto
5
find time for it, but pointsout thathe is"not a believerin cumulative Mendelian
factors as being the solution of the heredity puzzle"(p. 456). He then rejects
Fisher's paper.
Also I fear thatI do not agreewith your criticismof Dr Kirstine Smith's paperand
under present pressure of circumstances must keep the little space I have in
Biometrika free from controversy which can only waste what power I have for
publishing original work.(p. 456)
Egon Pearson thinks that we canaccepthis father's reasons for these rejections;
the pressureof war work, the suspensionof manyof his projects,the fact that
he wasover 60 yearsold with much workunfinished, and hismemoryof the
strain that had been placedon Weldonin the rowwith Bateson.But if he really
was trying to shun controversyby refusing to publish controversial views, then
he most certainlydid not succeed. Egon Pearson's defense is entirely under-
standablebut far toocharitable. Pearson hadcensored Fisher's work before and
appearsto have been trying to establishhis authority overthe youngerman and
over statistics. Evenhis offer to Fisher of an appointmentat the Galton
Laboratory mightbe viewed in this light. FisherBox (1978) notes thatthese
experiencesinfluencedFisherin his refusal of the offer, in the summerof 1919,
recognizing that, "nothing would betaughtor publishedat theGalton laboratory
5
Here Pearsonis dissembling. Fisher's paper wasoriginally submittedto theRoyal Society
and, althoughit was notrejected outright,the Chairmanof the Sectional Committee for Zoology
"had beenin communication withthe communicatorof the paper,who proposedto withdraw it."
Karl Pearsonwas one of the tworeferees. Norton and Pearson (1976) describe the event and
publish the referees' reports.
CHI-SQUARE 113
Dear Mr Fisher,
Only in passingthrough Town todaydid I find your communicationof August 3rd.
I am very sorry for the delay in answering ...
As therehas beena delay of three weeks already, and as Ifear if I could givefull
attentionto your paper, whichI cannotat the present time,I shouldbe unlikely to
publish it in its presentform ... I would preferyou published elsewhere ... I am
regretfully compelledto exclude all that I think erroneouson my own judgment,
becauseI cannotafford controversy,(quoted by E .S.Pearson,1968, p. 453)
This short paper withall its juvenile inadequacies, yet did somethingto break the
ice. Any readerwho feels exasperated by itstentativeandpiecemeal character should
remember thatit had to find its way topublicationpast critics who,in the first place,
could not believe thatPearson'swork stood in needof correction, and who, if this
had to beadmitted, were sure that they themselves had corrected it. (Fisher's
commentin Bennett, 1971,p. 336)
the valueof n' with which the table shouldbe enteredis not nowequalto thenumber
of cells but to onemore thanthe number of degreesof freedom in the distribution.
Thus for a contingency tableof r rows and ccolumnswe should taken' = (c- \)(r
- 1) + 1insteadof n' = cr. This modificationoften makesa very great difference
to the probability (P) that a given value of x2 should have been obtained
by chance.
(Fisher, 1922a,p.88)
I trust my critic will pardon me for comparinghim with Don Quixote tilting at the
windmill; he must eitherdestroyhimself, or thewhole theoryof probable errors,for
they are invariablybasedon using sample values for thoseof the sampled population
unknown to us. Forexample hereis an argumentfor Don Quixote of the simplest
nature ... (K. Pearson, 1922, p. 191)
The editorsof the Journal of the Royal Statistical Society turned tail and ran,
refusing to publish Fisher'srejoinder. Fisher vigorously protested,but to no
avail, and he resignedfrom the Society. There wereother questions about
Pearson's paper that he dealtwith in Bowley'sjournal Economicain 1923and
in the Journal of the Royal Statistical Society
(Fisher, 1924a),but the endcame
in 1926 when, using data tables that had beenpublishedin Biometrika, Fisher
calculated the actual average value of x2 which he had proved earlier should
theoretically be unity and which Pearsonstill maintained should be 3. Inevery case
the averagewas close to unity, in no casenearto 3 . . .. There was noreply. (Fisher
Box, 1978, p. 88,commenting on Fisher, 1926a)
THE t DISTRIBUTION
The agricultural (and indeed almost any) Experiments naturally requireda solution
of the mean/S.D. problem,and the Experimental Brewerywhich concerns such
things as theconnection between analysis of malt or hops, and thebehaviourof the
beer,andwhich takesa day toeachunit of theexperiment, thuslimiting the numbers,
demandedan answerto such questions as, "If with a small numberof casesI get a
value r, what is the probability that thereis really a positive correlationof greater
value than (say) .25?" (quoted by E. S.Pearson, 1968,p. 447)
Egon Pearson (1939a) notes that in his first few years at Guinness, Gosset
was making use ofAiry's Theory of Errors (1861), Lupton's Notes on Obser-
vations (1898),and Merriman'sTheMethod of Least Squares (1884).In 1904
he presenteda report to his firm that stated clearly the utility of the application
of statisticsto thework of the breweryand pointedup theparticular difficulties
that mightbe encountered.The report concludes:
We have beenmetwith the difficulty that noneof our books mentions the odds, which
areconveniently accepted asbeingsufficient to establishany conclusion,and itmight
be of assistanceto us toconsult some mathematical physicist on thematter, (quoted
by Pearson,1939a,p.215)
116 9. SAMPLING DISTRIBUTIONS
A meeting was in fact arranged with Pearson, and this took placein the
summerof 1905. Not all of Gosset'sproblems were solved, but a supplement
to his report and asecond reportin late August 1905 produced many changes
in the statisticsusedin the brewery. The standard deviation replaced the mean
error, andPearson's correlation coefficient becamean almost routine procedure
in examining relationships among the many factors involved with brewing. But
one featureof the work concerned Cosset: "Correlation coefficients areusually
calculatedfrom large numbersof cases,in fact I havefound only one paperin
Biometrika of which the casesare as few innumberas thoseat which I have
been working lately" (quoted by Pearson, 1939a, p. 217).
Gosset expressed doubt about the reliability of the probable error formula
for the correlationcoefficient whenit was appliedto small samples.He went to
London in September 1906to spenda year at the Biometric Laboratory.His
first paper, published in 1907, derives Poisson's limit of the binomial distribu-
tion and applies it to the error in sampling when yeast cells are countedin a
haemacytometer.But his most important work during that year was theprepa-
ration of his twopaperson theprobable errorof the meanand of thecorrelation
coefficient, both of which werepublishedin 1908.
The usual methodof determining thatthe meanof the population lies withina given
distanceof the meanof the sample,is to assumea normal distribution about the mean
of the sample with a standarddeviation equalto 5/Vn, where s is the standard
deviation of the sample,and to use thetables of the probability integral.
But, as wedecreasethe value of the numberof experiments,the value of the
standard deviationfound from the sampleof experiments becomes itself subject to
increasing error,until judgements reachedin this way become altogether misleading.
("Student," 1908a,pp. 1-2)
I. The equationis determinedof the curve which represents the frequency distribu-
tion of standard deviations of samples drawnfrom a normal population.
II. Thereis shown to be nokind of correlation betweenthe meanand thestandard
deviation of sucha sample.
III. The equationis determinedof the curve representing the frequency distribution
of a quantity z, which is obtained by dividing the distance between the mean of the
sampleand themeanof the populationby the standard deviationof the sample.
IV. The curve foundin I. is discussed.
V. The curve found in III. is discussed.
VI. The two curvesare compared with some actual distributions.
VII. Tables of the curves foundin III. are given for samplesof different size.
VIII and IX. Thetablesareexplainedand some instances aregiven of their use.
X. Conclusions.
("Student," 1908, p. 2)
THE t DISTRIBUTION 117
6
Eisenhart (1970) concludes that the shift from z to / was due to Fisherandthat Gosset chose
/ for the newstatistic. In their correspondence Gosset used t for his owncalculationsand x forthose
of Fisher.
118 9. SAMPLING DISTRIBUTIONS
Dear Pearson,
I am enclosing a letter which givesa proof of my formulae for the frequency
distribution of z(=xls),... Would you mind lookingat it for me; Idon't feel at home
in morethan threedimensions evenif I could understandit otherwise.
It seemedto methat if it's all right perhapsyou might like to put theproof in a note.
It's so nice and mathematical thatit might appealto some people, (quoted by E. S.
Pearson,1968,p. 446)
7
R wasGosset's symbol for the population correlationcoefficient, p (the Greek letter rho)
appearsto have beenfirst usedfor this valueby Soper(1913). The important pointis thata different
symbol was usedfor sample statistic and populationparameter.
THE t DISTRIBUTION 119
When the resemblances have been expressed in terms of the new variable, a
correlation table may be constructedby picking out every pair of resemblances
betweenthe same twinsin different traits. The valuesare nowcentered symmetri-
cally about a meanat 1.28, and thecorrelationis found to be-.016 ± .048, negative
but quite insignificant.The result entirelycorroboratesTHORNDIKE'S conclusions
as to thespecializationof resemblance. (Fisher, 1919, p. 493)
larger thanI had intended,and tomake it at all complete shouldbe larger still, but I
shall not have timeto makeit so, as I amsailingfor Canadaon the25th, and will not
be back till September, (quoted by Fisher Box, 1981,p. 66)
THE F DISTRIBUTION
The Toronto paperdid not, in fact, appearuntil almost4 yearsafter the meeting,
by which time the first edition of Statistical Methodsfor Research Workers
(1925) had been published.The first use of ananalysisof variance technique
was reported earlier (Fisher& Mackenzie, 1923),but this paper was not
encounteredby many outsidethe area of agricultural research.It is possible
that if the mathematicsof the Toronto paperhadbeen includedin Fisher's book
then much of the difficulty that its first readershad with it would have been
reducedor eliminated,a point that Fisher acknowledged.
After a general introductionon error curvesand goodness-of-fit, Fisher
examinesthe x,2 statisticand briefly discusseshis (correct) approachto degrees
of freedom. He then pointsout that if a numberof quantitiesx1,,...,* „ are
distributed independentlyin the normal distribution with unit standard devia-
tion, then x.2 = Lx is distributedas "the Pearsonian measure of goodnessof
2
fit." In fact, Fisher uses S(x) to denotethe latter expression,but here, and in
what follows on the commentaryon this paper,the more familiar modern
notation is employed. Fisher refers to "Student's"work on theerror curveof
the standard deviationof a small sample drawn from a normal distributionand
2
shows its relation to . -
122 9. SAMPLING DISTRIBUTIONS
The only exacttreatmentis to eliminate the unknown quantitiesa, and cr 2 from the
distribution by replacing the distribution of s by that of log s, and soderiving the
distribution of log st/s2. Whereasthe samplingerrorsin s, areproportional to Oj
the sampling errorsof log s, depend only uponthe size of the sample from which
5, was calculated. (Fisher, 1924b, p. 808)
The casesfor infinite and unit degreesof freedom are then considered.In
the latter casethe "Student" distributions
are generated.
Fig 9.3 The F Distributions for 1,5 and 8,8 Degrees of Freedom
124 9. SAMPLING DISTRIBUTIONS
The aboveresults holdfor known p. If/? is 3/5, thenwe can bemorally certain that
... thedeviation ... will be less than 1/50.But Bernoulli's problemis the inverse
of this. When/?is unknown, can hisanalysis tell whenwe can bemorally certain
that someestimateof p isright? Thatis theproblemof statistical inference. (Hacking,
1971, p. 222)
COMPARING MEASUREMENTS
Sincethe time of De Moivre, the variables that have been examined by workers
in the field of probability have expressed measurements asmultiplesof a variety
of basic units that reflectthe dispersionof the range of possiblescores.Today
the chosen unitsare units of standard deviation,and thescoresobtained are
called standard scoresor z scores. Karl Pearson usedthe term standard
deviation andgave it the symbol a (the lower caseGreek letter sigma)in 1894,
but the unit was known (althoughnot in itspresent-day form)to De Moivre. It
correspondsto that point on theabscissaof a normal distribution such that an
ordinateerectedfrom it would cut thecurveat itspoint of inflection, or, in simple
terms, the point wherethe curvatureof the function changesfrom concaveto
convex. Sir GeorgeAiry (1801-1892)namedaV2 themodulus (although this
term had been used,in passing,for VK by De Moivre as early as 1733)and
describeda variety of other possible measures, including theprobableerror, in
1875.' This latterwas theunit chosenby Gallon (whose workis discussed later),
although he objected stronglyto the name:
1
The probable erroris definedas onehalf of the quantity that encompasses the middle 50% of a
normal distributionof measurements.It is equivalentto what is sometimes called the semi-interquartile
range (that portionof the distribution betweenthe first quartileor the twenty-fifth percentile,and the
third quartileor the seventy-fifth percentile, dividedby two). The probable erroris 0.67449times the
standard deviation. Nowadays, everyone follows Pearson (1894), who wrote, "I have alwaysfound it
more convenientto work with the standard deviation than with the probable erroror the modulus,in
terms of which the error function is usually tabulated" (p.88).
127
128 10. COMPARISONS, CORRELATIONS AND PREDICTIONS
prices of suits. This obviousfact would be made even more obvious by the
reaction to an announcement that someone had just purchaseda new car for
$1,000. Whatis a lot ofmoney for a suit suddenly becomes almost trifling for
a brand new car. Again this judgment depends on aknowledge, whichmay not
be at allprecise,of the average price,and theprice range,of automobiles. One
can own anexpensive suitand acheapcar and have paid the same absolute
amount for both, althoughit must be admitted that sucha juxtaposition of
purchasesis unlikely! The point is that these examples illustrate the fundamental
objective of standard scores,the comparisonof measurements.
If the mean priceof cars is $9,000 and thestandard deviationis $2,500 then
our $1,000car has an equivalentz scoreof ($1,000 - $9,000)/$2,500or-3.20.
If suit prices havea meanof $350and astandard deviation of $150, thenthe
$1,000.00suit has anequivalentz scoreof ($1,000 - $350)/$150or +4.33.
We havea very cheapcar and avery, very expensive suit.It might be added
that, at thetime of writing, thesefigures were entirely hypothetical.
These simple manipulations are well-known to usersof statistics, and,of
course, when they are appliedin conjunctionwith the probability distributions
of the measurements, they enable us to obtainthe probability of occurrenceof
particular scoresor particular rangesof scores. Theyare intuitively sensible.
A more challenging problem is that of the comparisonof setsof pairs of scores
and of determininga quantitative description of the relationship between them.
It is a problem thatwas solvedduring the secondhalf of the 19th century.
During the years 1866to 1869 Galtonwas ingenerally poor health, but he
collectedthe material for,andwrote, one of hismost famous books, Hereditary
Genius, whichwas published in 1869. In this book and in English Men of
Science, published in 1874, Galton expounds and expands uponhis view that
ability, talent, intellectual power,and accompanying eminence are innately
rather than environmentally determined. The corollary of this view was,of
course, that agencies of social control shouldbe established that would encour-
age the"best stock" to have children,and todiscourage those whom Galton
describedas havingthe smallest quantities of "civic worth" from breeding. It
has been mentioned that Galton was acollector of measurements to a degree
that wasalmostcompulsive,and it iscertainthat he was nothappy withthe fact
that he had had to use qualitative rather than quantitative data
in supportof his
argumentsin the two books just cited.
MacKenzie (1981) maintains that:
In Memoriesof My Life (1908), Galton says that he determinedon experimenting with sweet peas
in 1885 and that the suggestionhad come to him from Sir Joseph Hooker (the botanist,
1817-1911)and
Darwin. But Darwin had died in 1882 and theexperiments must have been suggested and were begun
in 1875. Assuming that this is notjust a typographical error, perhaps Galton
was recalling the dateof
his important paperon regressionin hereditary stature.
132 10. COMPARISONS, CORRELATIONS AND PREDICTIONS
weight categorieshad weights that were what we would now call normally
distributed, and,the probable error(we would now calculate the standard
deviation)was thesame. However, the mean weightof each groupof offspring
was not asextremeas theparental weight. Large parent seeds produced larger
than average seeds, but mean offspring weight was not aslarge as parental
weight. At the other extremesmall parental seeds produced, on average, smaller
offspring seeds,but themeanof the offspring was found not to be assmall as
that of the parents. This phenomenon Galton termed reversion.
Seeddiameters, Galton noted, are directly proportionalto their weight,and
show the same effect:
All the formulae for Conic Sections having long since goneout of my head,I went
on my return to London to the Royal Institutionto read themup. Professor, now Sir
James,Dewar, camein, and probably noticingsigns of despair on my face, asked
me what I was about;then said,"Why do you bother over this?My brother-in-law,
J. Hamilton Dickson of Peterhouse loves problems and wants new ones. Sendit to
him." I did so, under the form of a problem in mechanics,and hemost cordially
helped me by working it out, as proposed,on thebasis of the usuallyacceptedand
generally justifiable Gaussian
Law of Error. (Galton, 1908,pp. 302-303)
I may bepermitted to saythat I never felt sucha glow of loyalty and respecttowards
the sovereigntyand magnificent swayof mathematical analysis aswhen his answer
Galton (1885a) maintains that this factor "differs a very little from the factors employedby other
anthropologists, who, moreover, differ a trifle between themselves; anyhow it suitsmy data better than
1.07 or 1.09" (p. 247). Galton also maintained (and checked in his data) "that marriage selection takes
little or noaccountof shortnessor tallness...we maythereforeregardthe marriedfolk ascouples picked
out of the general populationat haphazard" (1885a,pp. 250-251) - a statement thatis not only
implausibleto anyonewho hascasually observed married couples but isalsonot borneout byreasonably
careful investigation.
FIG. 10.4 Frequency Surfaces and Ellipses
(From Yule & Kendall, 14th Ed., 1950)
136
GALTON'S DISCOVERY OF REGRESSION 137
Now Galton was certainly not a mathematical ignoramus, but the fact that
one ofstatistics' founding fathers sought help for the analysisof his momentous
discovery may be ofsome small comfortto studentsof the social scienceswho
sometimesfind mathematics such a trial. Another breakthroughwas to come,
and again, it was to culminatein a mathematical analysis, this time by Karl
Pearson, and the developmentof the familiar formula for the correlation
coefficient.
In 1886, a paperon "Family Likenessin Stature,"publishedin the Proceed-
ings of the Royal Society, presents Hamilton Dickson's contribution, as well as
data collectedby Galton from family records. Of passing interestis his use of
the symbol w for "the ratio of regression" (short-lived, as it turns out)as he
details correlations between pairs of relatives.
Gallon's work had led him to amethod of describing the relationship
between parentsand offspring and between other relatives on a particular
characteristicby usingthe regression slope.Now heapplied himselfto the task
of quantifying the relationship between different characteristics,the sort of data
collectedat theanthropometric laboratory. It dawnedon him in aflash of insight
that if each characteristicwas measuredon ascale basedon its ownvariability
(in other words,in what we now call standard scores), then the regression
coefficient could be applied to these data. It was noted in chapter 1 that the
location of this illumination was perhapsnot the place that Galton recalled in
his memoirs (written whenhe was in hiseighties)so that the commemorative
tablet that Pearson said the discovery deserved will haveto be sited carefully.
Before examining someof the consequencesof Galton's inspiration,the
phenomenonof regressionis worth a further look. Arithmeticallyit is real
enough, but,as Forrest (1974) puts it:
It is not that the offspring have been forced towards mediocrity by the pressureof
their mediocreremote ancestry,but aconsequenceof a less than perfect correlation
betweenthe parentsand their offspring. By restrictinghis analysisto the offspring
of a selectedparentageand attemptingto understand their deviations from the mean
Galton failsto account for the deviationof all offspring. ... Galton's conclusionis
that regressionis perpetual and that the only way in which evolutionarychangecan
occur is through the occurrenceof sports, (p. 206)5
5
In this contextthe term "sport" refersto ananimal or a plant that differs strikingly from its species
type. In modern parlance we would speakof "mutations." It is ironic thatthe Mendelians,led byWilliam
Bateson(1861-1926),used Galton's work in supportof their argument thatevolution was discontinuous
and saltatory, and that the biometricians.led by Pearsonand Weldon,who held fast to the notion that
continuousvariationwas thefountainheadof evolution, took inspiration from the same source.
138 10. COMPARISONS, CORRELATIONS AND PREDICTIONS
Wallis and Roberts (1956) presenta delightful accountof the fallacy and
give examplesof it in connection with companyprofits, family incomes,
mid-term and finalgrades,and salesandpolitical campaigns.As they say:
Take any set ofdata,arrange themin groups according to some characteristic, and
then for eachgroup computethe averageof some secondcharacteristic. Thenthe
variability of the second characteristic
will usually appearto be less than thatof the
first characteristic.( p. 263)
The term "regression"is not aparticularly happyone from the etymological point
of view, but it is sofirmly embeddedin statistical literature that
we makeno attempt
to replace it by an expression whichwould more suitably expressits essential
properties,(p. 213)
Lety = the deviationof the subject, whichever of the two variablesmay betaken in
that capacity;and let;c,, ;c2, x3, &c., be thecorresponding deviations of the relative,
and let the meanof thesebe X. Then we find: (1) that y = rX for all valuesof y; (2)
that r is the samewhicheverof the two variablesis taken for the subject;(3) that r is
always less than1; (4) that r measuresthe closenessof co-relation. (Gallon,1888a,
p. 145)
In June 1890, Weldon was electeda Fellow of the Royal Society,and later that
year became Jodrell Professor of Zoology at University College, London.In
March of the same yearthe Royal Societyhad receivedthe first of his biometric
papers that describes
the distribution of variationsin a numberof measurements
made on shrimps. A Marine Biological Laboratoryhad been constructedat
Plymouth2 yearsearlier, and since that time Weldon had spent partof the year
there collecting-measurements of thephysical dimensions of these creaturesand
their organs.The Royal Society paperhad been sentto Galton for review, and
with his help the statistical analyseshad been reworked. This marked the
142 10. COMPARISONS, CORRELATIONS AND PREDICTIONS
(1.). . . let all thoseindividuals be chosenin which a certain organ,A, differs from
its averagesizeby a fixed amount,Y; then, in these individuals,let thedeviationsof
a secondorgan,B, from its averagebe measured.The variousindividualswill exhibit
deviationsof B equal tox} ,x2,x3,..., whose meanmay becalled xm. Theratio jcm/Y
will be constantfor all valuesof Y.
In the sameway, supposethoseindividuals are chosenin which the organB has
a constant deviation, X; then,in theseindividuals,ym. themean deviation
of the organ
A, will have the sameratio to X, whatevermay be thevalueof X.
(2.) The ratios * m/Y and ym/X are connectedby an interesting relation. Let Qa
representthe probable errorof distribution of the organA about its average,and Qb
that of theorgan B; then -
a constant.
So that by taking a fixed deviationof either organ, expressed in terms of its
probable error,and byexpressingthe mean associated deviation of the second organ
in terms of its probable error,a ratio may bedetermined,whosevalue becomes± 1
when a changein either organ involves an equal changein the other, and 0when the
two organsarequite independent. This constant, therefore, measures the "degreeof
correlation"betweenthe twoorgans.(Weldon, 1892,p. 3)
Naples and the other from Plymouth Sound.In this work, Weldon (1893)
computesthe mean,the meanerror, and themodulus. His greater mathematical
sophistication in this work is evident. He states:
Weldon found thatin the Naples specimens the distributionof the "frontal
breadth"producedwhat he terms an "asymmetricalresult." This finding he
hoped might arise from the presencein the sampleof two racesof individuals.
He notesthat Karl Pearsonhad tested this supposition and found that it was
likely. Pearson (1906)statesin his obituary of Weldon thatit was this problem
that led to his(Pearson's)first paper in the Mathematical Contributionsto the
Theory of Evolution series, receivedby the Royal Societyin 1893.
Weldon definedr and attemptedto nameit for Galton:
7
Pearsonplaced Third Wranglerin the Mathematical Triposat Cambridgein 1879. The "Wran-
glers" were the mathematics students
at Cambridgewho obtainedFirst Class Honors- the oneswho
most successfully"wrangle" with math problems. This method of classification was abandonedin the
early yearsof this century.
THE COEFFICIENT OF CORRELATION 145
Pearsongoeson to define the mean, median, and mode, the normal prob-
ability distribution, correlation,and regression,as well as various termsem-
ployed in selection and heredity. Next comesa historical section,whichis
examined laterin this chapter. Section4 of Pearson'spaper examinesthe
"special caseof two correlatedorgans." He derives whathe terms the "well-
known Galtonian formof the frequency for two correlated variables,"and says
that r is the "GALTON function or coefficient of correlation" (p. 264).
However, he is notsatisfied thatthe methods usedby Galton and Weldon give
practically the best methodof determiningr, and hegoes on to show by what
we would now call the maximum likelihood method that S(xy)/(nG\ a2) is the
bestvalue (todaywe replaceS by I forsummation).This expressionis familiar
to every beginning student in statistics. As Pearson (1896) puts
it, "This value
presentsno practical difficulty in calculation,and thereforewe shall adoptit"
8
(p. 265). It is now well-known thatwe have done precisely that.
CORRELATION
- CONTROVERSIES
ANDCHARACTER
They have been accepted by later writers, notablyMr Yule in his manualof statistics,
who writes (p. 188): "Bravais introduced the product-sum,but not asingle symbol
for a coefficient of correlation. Sir Francis Galton developed the practical method,
determining his coefficient (Galton'sfunction as it wastermed at first) graphically.
Edgeworth developed the theoretical sidefurther and Pearson introduced the prod-
uct-sum formula."
CORRELATION - CONTROVERSIES AND CHARACTER 147
Now clearly, it is just not thecase "that nearlythe whole of the above
statementsare hopelessly incorrect."In fact, what Pearson is trying to do is to
emphasizethe importanceof his own andGalton's contributionand toshift the
blame for the maintenanceof historical inaccuracies to Yule, with whom he
had come to have some serious disagreement. The natureof this disagreement
is of interest and can beexaminedfrom at least three perspectives. The first is
that of eugenics,the secondis that of the personalitiesof the antagonists,and
the third is that of thefundamental utility of, and theassumptions underlying,
measuresof association. However, before these matters are examined, some
brief commenton the contributionsof earlier scholarsto the mathematicsof
correlation is in order.
The final yearsof the 18th centuryand the first 20yearsof the 19th was the
period in which the theoreticalfoundations of the mathematicsof errors of
observation were laid. Laplace (1749-1827)and Gauss(1777-1855)are the
best-knownof the mathematicianswho derivedthe law of frequencyof error
anddescribedits application,but anumberof other writers also made significant
contributions (see Walker, 1929, for an accountof these developments). These
scholarsall examinedthe questionof the probabilityof thejoint occurrenceof
two errors,but "None of them conceivedof this as amatter which could have
application outsidethe fields of astronomy, physics, andgeodesyor gambling"
(Walker, 1929,p. 94).
In fact, put simply, these workers were interested solely in the mathematics
associatedwith the probabilityof the simultaneousoccurrenceof two errorsin,
say,the measurement of the positionof a pointin a planeor in three dimensions.
They were clearlynot looking for a measureof a possible relationship between
the errors and certainly not consideringthe notion of organic relationships
among directly measured variables. Indeed, astronomers andsurveyors sought
to make their basic measurements independent. Having said this,it is apparent
that the mathematicalformulationsthat were produced by these earlier scientists
are strikingly similarto those deduced by Galtonand Hamilton Dickson.
Auguste Bravais(1811-1863),who had careers as a naval officer, an
astronomerand physicist, perhaps came closest to anticipatingthe correlation
coefficient; indeed, he even usesthe term correlation in his paper of 1846.
Bravais derivedthe formula for the frequency surfaceof the bivariate normal
distribution and showed thatit was aseriesof concentricellipses,as didGalton
and Hamilton Dickson40 years later. Pearson's acknowledgment of the work
of Bravaisled to thecorrelationcoefficient sometimes being called the Bravais-
Pearson coefficient.
When Pearson came, in 1920, to revisehis estimationof Bravais' role,he
148 10. COMPARISONS, CORRELATIONS AND PREDICTIONS
VaccinatedB Unvaccinated
Survived A AB A0
Died aB ap
Yule's paper had been "received"by the Royal Society in October 1899,
and "read"in Decemberof the same year.It was describedby Pearson (1900b),
in a paper that examines the same problem,as "Mr Yule's valuable memoir"
(p. 1). In his paper, Pearson undertakes to investigate "the theoryof the whole
subject" (p. 1) andarrives at his measureof associationfor two-by-two contin-
gency table frequencies, an index that he called the tetrachoric coefficient of
correlation. He examines other possible measure's of association, including
Yule's Q, but heconsiders themto bemerely approximations to the tetrachoric
coefficient. The crucial differencebetween Yule's approach andthat of Pearson
is that Yule's criteriafor ameasureof associationareempiricalandarithmetical,
whereasthe fundamental assumption for Pearsonwasthat the attributes whose
frequencieswere countedin fact arosefrom an underlying continuous bivariate
normal distribution. The details of Pearson's method are somewhat complex
and are notexamined here. There were a numberof developmentsfrom this
work, including Pearson's derivation, in 1904,of the mean square contingency
and thecontingencycoefficient. All these measures demanded the assumption
of anunderlying continuous distribution, even thoughthe variables,asthey were
considered, were categorical. Battle commenced in late1905 when Yule
criticized Pearson'sassumptionsin a paper readto the Royal Society (Yule,
1906). Biometrika'sreaders were soon to see, "Replyto Certain Criticismsof
Mr G. U. Yule" (Pearson, 1907) and, after Yule's discussionof his indices
appearedin the first edition of his textbook, were treatedto David Heron's
exhortation, "The Danger of Certain Formulae Suggested asSubstitutesfor the
Correlation Coefficient" (Heron,1911). These were stirring statistical times,
marked by swingeing attacksas thebiometricians defended their position:
150 10. COMPARISONS, CORRELATIONS AND PREDICTIONS
All those who have diedof small-poxare all equally dead:no one ofthem is more
deador lessdeadthan another,and thedeadare quite distinct fromthe survivors.
The introductionof needlessand unverifiable hypotheses does not appearto me
to be adesirableproceedingin scientific work. (Yule, 1912,pp. 611-612)
Yule, and his great friend Major Greenwood, poked privatefun at the
opposition.Partsof a fantasy sentto Yule by Greenwoodin November 1913
arereproduced here:
The logical positionsof the two sides werequite different. For Pearsonit
was absolutely necessary
to preservethe link with interval-level measurement
wherethe mathematicsof correlationhad beenfully specified.
Surely neitherin the best typeof religion nor in thebest typeof science should hatred
enter in at all. ... In onerespectonly has scientific controversy perforce improved
since the seventeenth century.If A disagrees with B's arguments, dislikeshis
personality and isannoyedby the cock of his hat, he can nolonger, failingall else,
resort to abuseof £'s latinity. (Yule, 1939, p. 221)
Yule had respect and affection for "K.P." even though he haddistanced
himself from the biometriciansof Gower Street (where the Biometric Labora-
tory was located). He clearly disliked the way in which Pearsonand his
followers closed ranksandpreparedfor combatat thesniff of criticism. In some
152 10. COMPARISONS, CORRELATIONS AND PREDICTIONS
MacKenzie's sociohistorical
analysis is both compellingand provocative,
CORRELATION - CONTROVERSIES AND CHARACTER 1 53
but at leasttwo points, bothof which are recognizedby him, needto bemade.
The first is that this viewof Pearson's motivations is contrary to his earlier
expressedviews of the positivistic natureof science (Pearson,1892), and
second, the controversy mightbeplacedin thecontextof atightly knit academic
group defendingits position. To the first MacKenzie says that practical consid-
erations outweighed philosophical ones, and yet it isclear that practical demands
did not lead the biometriciansto form or tojoin political groups that might have
made their aspirations reality. The second viewis mundaneand realistic. The
discipline of psychologyhasseena numberof controversiesin its short history.
Notable among themis the connectionist (stimulus-response) versus cognitive
argument in the field of learning theory.The phenomenological,the psychody-
namic, the social-learning, and thetrait theorists have arraigned themselves
against each otherin a variety of combinationsin personalityresearch.Argu-
ments aboutthe continuity-discontinuityof animalsand humankindare exem-
plified perhaps by the controversy over language acquisition in the higher
primates. All thesedebatesare familiar enoughto today'sstudentsof psychol-
ogy, and it is notinconceivableto view the correlation debateas part of the
system of academic "rows" that will remain as long as there is freedom for
intellectual controversyand scientific discourse.
But the Yule-Pearson debateand its implications are not among those
discussedin university classrooms across the globe. For amultitudeof reasons,
not the least of which was thehorror of the negative eugenics espoused by the
German Nazis,and thedemandsof the growing discipline of psychology for
quantitative techniques that would help it deal with its subject matter, veryfew
if any of today'spractitionersand researchersin the social sciences think of
biometrics when theythink of statistics. Theybusily get onwith their analyses
and, if they give thanksto Pearsonand Gallon at all, they remember them for
their statistical insights rather than
for their eugenic philosophy.
11
Factor Analysis
FACTORS
154
FACTORS 155
the seven [there were seven measures] directions of uncorrelated variables, that
is, theprincipal axesof the correlation 'ellipsoid'"(p. 209).
The method,as Lawley and Maxwell (1971) have emphasized, although it
has links with factor analysis,is not to betaken as avariety of factor analysis.
The latter is amethod that attemptsto distill downor concentratethe covariances
in a set ofvariablesto a much smaller number of factors. Indeed, none other
than Godfrey Thomson(1881-1955),a leading British factor theoristfrom the
1920sto the 1940s, maintained in a letterto D. F.Vincentof Britain's National
Institute of Industrial Psychology that"he did not regard Pearson's 'lines of
closestfit as anythingto do with factor analysis.'" (noted by Hearnshaw, 1979,
p. 176). The reasonfor the ongoing association of Pearson's name with the birth
of factor analysisis to be found in the role playedby another leading British
psychologist,Sir Cyril Burt (1883-1971)in the writing and therewriting of the
history of factor analysis,an episode examined later in this chapter.
In essence,the method of principal componentsmay be visualizedby
imagining thevariables measured - thetests- aspointsin space.Teststhatare
correlatedwill be close togetherin clusters,andtests thatare notrelatedwill be
further away fromthe cluster. Axesor vectorsmay beprojected into thisspace
through the clusters in a fashion that allowsfor as much of the variance as
possibleto be accounted for. Geometrically this projection can take placein
only three dimensions, but, algebraically an w-dimensionalspacecan becon-
ceptualized.
THE BEGINNINGS
In 1904 Charles Spearman (1863-1945)publishedtwo papers (1904a, 1904b)
in the same volumeof the American Journalof Psychology, thenthe leading
English language journal in psychology.The first ofthese,a general accountof
correlation methods, their strengths and weaknesses,and thesourcesof error
that may beintroduced that might "dilate"or "constrict" the results, included
criticism of Pearson'swork.
In his Huxley lecture (sponsored by the Anthropological Instituteof Great
Britain and Ireland) of 1903, Pearson reported on a great amountof data that
had been collectedby teachersin a numberof schools.The physical variables
measured included health, hair and eyecolor, hair curliness, athletic power, head
length, breadth,and height,and a"cephalic index."The psychological charac-
teristics were assertiveness, vivacity, popularity, introspection, conscientious-
ness, temper, ability,and handwriting.The average correlations (omitting
athletic power) reported between the measuresof brother, sister,and brother-
sister pairsare extraordinarily similar, ranging
from .51 to .54.
We areforced, 1 think literally forced,to thegeneral conclusion that
the physicaland
THE BEGINNINGS 157
It might be noted that neither man assumed that the correlations, whatever
they were, mightbe due toanything other than heredity, and that Spearman
largely objected to the data and possible deficienciesin its collection and
Pearson largely objected to Spearman's statistics. The exchange ensured that
the opponents would never even adequately agree on the location of the
battlefield, let alone resolvetheir differences. Spearmanwas to become, suc-
cessively, Reader andheadof a psychological laboratory at University College,
London,in 1907, Grote Professor of the Philosophyof Mind and Logic in 1911,
and Professorof Psychology in 1928 until his retirement in 1931. He was
therefore a colleagueof Pearson'sin the same collegein the same university
during almost the whole of the latter's tenureof the Chair of Eugenics. Their
interests and their methodshad a great deal in common, but they never
collaborated and they disliked each other. They clashed on the matter of
Spearman's rank-difference correlation technique, and that and Spearman's
new criticism of Pearsonand Pearson's reaction to it beganthe hostilities that
158 11. FACTOR ANALYSIS
The . . . observed facts indicate that all branches of intellectual activity havein
commononefundamentalfunction (or group offunctions) , whereasthe remaining
or specific elementsof the activity seemin every caseto bewholly different from that
in all the others. (Spearman, 1904b, p. 284)
This work may bejustly claimed to include the first factor analysis. The
intercorrelationsare shown in Figure 11.1.
The hierarchical orderof the correlationsin the matrix is shown by the
tendencyfor the coefficients in a pair of columnsto bear the same ratioto one
another throughout that column.
Of course, the techniqueof correlation that began with Galton (1888b) is
central to all forms of factor analysis, but, more particularly, Spearman's work
relied on the concept of partial correlation, which allowsus to examinethe
relationships between, for example,two variables whena third is held constant.
The method gave Spearman the meansof making mathematical the notion that
the correlation of a specific test witha general factorwas common to all tests
of intellectual functioning.The remaining portionof the variance, errorex-
cepted, is specificand uniqueto the test that measures the variable. Thisis the
essenceof what cameto be known as thetwo-factor theory.
The partial correlationof variables1 and 2when a third, 3, is held constant
is given bv
and hadformedthe opinion thatits author's mathematics were wanting and had
tried to explainto himthat it waspossibleto "getthe tetraddifferencesto vanish,
one andall, in so many ways thatone might suspect thatthe resolution intoa
general factorwas notunique" (quotedby Lovie & Lovie, 1995,p. 241).
Wilson subsequently reviewed Spearman's book for Science.It is generally
a friendly review. Wilson describesthe book as "animportantwork," written
"clearly, spiritedly, suggestively,in places even provocatively,"and hiswords
concentrateon the mathematical appendix.Wilson admits thathis review is
"lop-sided," but hemaintains that Spearman has missed someof the logical
implicationsof the mathematics,and inparticular thathis solutionsare indeter-
minate. But he leavesa loopholefor Spearman:
THE PRACTITIONERS
The work culminatedin a book that set out hisfundamental, if not yet
watertight and complete, conceptual framework. Burt's work with its continu-
ing premiseof theheritabilityof intelligencealso showsthe needto seek support
for a particular constructionof mental life.
On the other hand,a numberof contemporary researchers, whose contribu-
tions to factor analysis were also impressive, were fundamentally applied
workers. Two, KelleyandThomson, werein fact Professorsof Education,and
a third, Louis Thurstone,was agiant in the field of scaling techniquesand
testing. Their contributions were marked by the task of developing testsof
abilities and skills, measuresof individual differences. Truman L. Kelley
(1884-1961),Professorof Educationand Psychologyat Stanford University,
THE PRACTITIONERS 165
Mental life doesnot operatein a plain but in anetworkof canals. Though each canal
may have indefinite limits in length and depth, it doesnot in width; though each
mentaltrait may grow andbecome moreandmore subtle,it doesnot lose its character
and discretenessfrom other traits,(p. 23)
Mental life in respect of its resolutioninto specific and general factorsdoes not
operate in a network of canalsbut in acontinuouslydistributed hyperplane... no
trait has anyseparateor discrete existence,for eachmay bereplacedby proper linear
combinationsof the others proceeding by infinitesimal gradationsand it isonly the
complex of all that has reality. Insteadof writing of crossroadsin the mind of man
one shouldspeakof man's trackless mental forest or tundraor jungle - mathemati-
cally mind you, accordingto theanalysisoffered. The canalsor crossroads have been
put in without any indication,so far as I cansee, that they have been
put in anyother
way than we put in roads acrossour midwestern plains, namely to meet the
convenienceor to suit the fancy of the pioneers,(p. 164)
The fact of the matter is that the authorafter an excellent introductory discussion
of
the elementsof thetheoryof the resolutioninto generaland specific factors... gives
up the general problem entirely, throws up his handsso to speak,and proceedsto
develop special methods of examininghis variablesto seewhere there seem to be
specific bonds between them. (p. 160)
Kelley, perhapsa little bemused, perhaps a little rueful, and almost certainly
grateful that Wilson's reviewhad not been more scathing, responds: "The
mathematical toolis sharpand I may nick, or may already have nicked myself
with it. At any rate I do still enjoy the fun of whittling and of fondling such
clean-cut chipsas Wilson has letfall" (p. 172).
Sir Godfrey Thomsonwas Professorof Education at the University of
166 11. FACTOR ANALYSIS
Thomsonwas notsaying that Spearman's view doesnot make sense but that
it is not inevitably the only view. Spearmanhad agreed thatg was presentin
all the factors of the mind, but in different amountsor "weights," and that in
some factorsit was small enoughto beneglected. Thomson welcomed these
views, "provided thatg is interpreted as a mathematical entity only,and
judgementis suspendedas towhetherit is anything more thanthat" (p. 240).
And his verdict on two-factor theory?After commenting thatthe method
of two factors was an analytical devicefor indicating their presence, that
advancesin method had been made thatled to multiple factor analysis,and
further "It was Professor Thurstone of Chicagowho sawthatonesolutionto the
problem couldbe reachedby ageneralizationof Spearman's ideaof zerotetrad
differences."(p. 20).
THE PRACTITIONERS 167
The alternativetheory to explain the zero tetrad differencesis that eachtest calls
upon a sampleof bonds whichthe mind can form, andthat someof thesebondsare
common to two testsand causetheir correlation,(p. 45)
Thomsondid notgive more thana very general viewof whatthe bonds might
be, although it seems that they were fundamentally neurophysiological and,
from a psychological standpoint, akinto the connectionsin the connectionist
views of learning first advancedby Thorndike:
What the "bonds"of the mind are, we do notknow. But they are fairly certainly
associatedwith the neuronesor nerve cellsof our brains...Thinking is accompanied
by the excitationof theseneuronesin patterns.The simplestpatternsare instinctive,
more complex onesacquired. Intelligenceis possibly associatedwith the number
andcomplexity of thepatterns whichthe braincan (orcould) make. (Thomson, 1946,
p. 51)
If this new tetrad is zero then the determinant "vanishes." This procedure
can becarried out with determinantsof any order and if we come to a stage
whenall theminors"vanish"the "rank" of thecorrelation matrixwill be reduced
accordingly. Thurstonefound that testscould be analyzed intoas many com-
mon factors as thereduced rankof the correlation matrix.The present account
follows that of Thomson (1946, chap. 2), who gives a numerical example.The
rank of the matrix of all the correlationsmay bereducedby inserting valuesin
the diagonalof thematrix - thecommunalities.Thesevaluesmay bethought
of asself-correlations,in whichcasetheyare all 1. Butthey may alsobe regarded
asthat part of the variancein the variable thatis due to thecommon factors,in
which casethey haveto be estimatedin some fashion. They might even be
merely guessed.Thurstone chose values that made the tetrads zero. If these
methods seemnot to be entirely satisfactory now, they were certainly not
welcomed by psychologistsin the 1920s, 1930sand 1940s who were tryingto
cometo grips with the newmathematical approaches of the factor theoristsand
their views of human intelligence. Thurstone's centroid method bears some
similarities to principal components analysis. The word centroid refersto a
multivariate average,and Thurstone's "first centroid"may bethoughtof as an
averageof all thetestsin a battery. Centralto theoutcomeof the analysisis the
production of'factor loadings - values that express the relationshipof thetests
to the presumed underlying factors.The ways in which these loadingsare
arrived at aremany and various,and Thurstone's centroid technique was just
one early approach.However, it was anapproach thatwas relatively easyto
apply and (with some mathematics) relatively easy to understand.How the
factors wereinterpretedpsychologically was,and is, largely a matter for the
researcher.Thus tests that were related that involved arithmeticor numerical
THE PRACTITIONERS 169
171
172 12. THE DESIGN OF EXPERIMENTS
METHODS OF INQUIRY
If two or more instancesin which the phenomenon occurs have only one circum-
stancein common, whiletwo or more instancesin which it doesnot occur have
nothing in common savethe absenceof that circumstance; the circumstancein which
alonethe two setsof instancesdiffer, is the effect, or the cause,or an indispensable
part of the cause,of the phenomenon,(p. 396)
were injected by Pasteur witha weak cultureof anthrax virus. Later these
animalsand asimilar numberof others thathad notbeenso "vaccinated"were
given a fatal dose of anthrax virus. Withina few days the non-vaccinated
animals were dead or dying, the vaccinated ones healthy. The conclusion that
was enthusiastically drawnwas that Pasteur's vaccination procedure had pro-
duced the immunity thatwas seenin the healthy animals.The effectivenessof
vaccinationis now regardedas anestablished fact.But it is necessaryto guard
against incautious logic.
The healthof the vaccinated animals could have been
due to some other fortuitous circumstance. Because it is known that some
animals infected with anthrax do recover,an experimental group composed of
theseresistantanimals could have resulted in a spurious conclusion.It should
be noted that Pasteur himself recognized as thisapossibility.
Mill's Method of Residues proclaims that having identified by the methods
of agreementand differences that certain observed phenomena are theeffects
of certain antecedent conditions, the phenomena that remain are due to the
circumstancesthat remain. "Subductfrom any phenomenon such part as is
known by previous inductions to be theeffect of certain antecedents, and the
residue of the phenomenonis the effect of the remaining antecedents"
(1843/1872,8th ed., p. 398).
Mill here uses a very modern argument for the use of themethod in
providing evidencefor the debateon racial and gender differences:
Thosewho assert,what no one hasshown any real groundfor believing, that there
is in onehuman individual,onesex, or oneraceof mankindover another,aninherent
and inexplicable superiorityin mental faculties, could only substantiate their
propo-
sition by subtracting fromthe differencesof intellect which we in fact see,all that
can be traced by known laws either to the ascertaineddifferences of physical
organization,or to thedifferences which have existed in the outward circumstances
in which the subjectsof thecomparison have hitherto been placed. What thesecauses
might fail to accountfor, would constitutea residual phenomenon, which and which
alone wouldbe evidence of an ulterior original distinction,and themeasureof its
amount. But the assertorsof such supposed differences have not provided them-
selveswith thesenecessarylogical conditionsin the establishmentof their doctrine,
(p. 429)
The final method and the fifthcanon is the Method of Concomitant Vari-
ations: "Whatever phenomenon varies in any manner whenever another phe-
nomenon variesin some particular manner, is eithera causeor an effect of that
phenomenon,or is connected withit through some factof causation"(p. 401).
This methodis essentially thatof the correlational study,
the observationof
covariation:
Mill maintained that these methods were in fact the rulesfor inductive logic,
that they were both methods of discoveryandmethodsof proof. His critics, then
and now, argued against a logic of induction (see chapter2), but it isclear that
experimentalistswill agreewith Mill that his methods constitute the meansby
which they gather experimental evidence for their views of nature.
The general structureof all the experimental designs that are employedin
the psychological sciences may beseenin Mill's methods.The applicationand
withholding of experimental treatments acrossgroups reflectthe methodsof
agreementand differences. The useof placebosand thesystematic attempts to
eliminate sourcesof error makeup the methodof residues,and themethodof
concomitant variationis, asalready noted,a complete descriptionof the corre-
lational study. It is worth mentioning thatMill attemptedto deal with the
difficulties presentedby the correlationalstudy and, in doing so, outlinedthe
basics of multiple regressionanalysis,the mathematicsof which werenot to
come for many years:
Mill's Logic is his principal work, and it may befairly castas thebook that
first describes botha justification for and the methodologyof, the social
sciences.Throughouthis works, the influence of the philosopherswho were,
in fact, the early social scientists,is evident. David Hartley (1705-1757)
publishedhis Observationson Man in 1749. This bookis a psychology rather
176 12. THE DESIGN OF EXPERIMENTS
daily rainfall and yearly yields from plots in the famous Broadbalk wheat
field. Fertilizers had been appliedto these plots, usingthe same pattern, since
1852. Fisher used the methodof orthogonal polynomialsto obtain fits of the
yields over time. In his paper (1921b) on these data published in 1921 he
describes analysis of variance (ANOVA) for the first time.
When the variation of any quantity (variate)is producedby theaction of two or more
independentcauses,it is known that the variance produced by all the causes
simultaneouslyin operation is the sum of thevaluesof the variance producedby
eachcauseseparately... In Table II is shown the analysisof the total variancefor
each plot, divided according as it may beascribed(i) to annual causes, (ii)to slow
changesother than deterioration, (iii)to deterioration;the sixth columnshowsthe
probability of larger valuesfor the variancedue toslow changes occurring fortui-
tously. (Fisher,192 Ib, pp. 110-111)
Any general difference between sulphate and chloride, betweenearly and late
application, or ascribable to quantity of nitrogenous manure,can bebasedon
thirty-two comparisons,each of which is affected by such soil heterogeneityas
existsbetweenplots in the sameblock. To make thesethreesetsof comparisons
only, with the sameaccuracy,by single question methods, would require 224 plots,
againstour 96; but inaddition many other comparisons can bemade with equal
accuracy,for all combinationsof the factors concernedhave been explored. Most
important of all, the conclusions drawn from the single-factor comparisonswill be
given, by the variation of non-essential conditions, a very much wider inductive
basis than could be obtained, by single question methods, without extensive
repetitions of the experiment. (Fisher, 1926b, p. 512)
With thesewords Fisher introduces the example that illustrated his view of the
principles of experimentation. Holschuh (1980) describes it as "the somewhat
artificial 'lady tasting tea' experiment" (p. 35), and indeedit is, but perhapsan
American writer doesnot appreciatethe fervor of the discussionon the best
methodof preparing cupsof teathat still occupiesthe British! FisherBox (1978)
reports thatan informal experimentwascarriedout atRothamsted.A colleague,
Dr B. Muriel Bristol, declineda cup of teafrom Fisheron thegrounds thatshe
preferredone towhich milk had firstbeen added.Her insistence thatthe order
in which milk and teawere poured intothe cup made a differenceled to a
lightheartedtest actually being carried out.
Fisher examinesthe design of such an experiment. Eight cups of tea are
prepared. Fourof them havetea addedfirst and four milk. The subjectis told
that thishasbeen doneand thecupsof tea arepresentedin a randomorder. The
task is, of course,to divide the set ofeight intotwo setsof four accordingto the
methodof preparation. Because there are 70waysof choosinga set of 4objects
from 8:
ironies herein this, Fisher's elegant accountof the lady tastingtea.It has been
hailed as themodel for the statisticalinferential approach:
Kendall (1963) wishes that Fisher had never writtenthe book, saying,"If
we had tosacrifice any of hiswritings, [this book] would havea strong claim
to priority" (p. 6).
However he did write the book, and heusedit to attack his opponents. In
marshaling his arguments,he introduced inconsistencies of both logic and
method that haveled to confusion in lesser mortals.Karl Pearsonand the
biometricians used exactly the same tactics.In chapter15 theview is presented
that it is this rather sorry state
of affairs that has led to thehistorical development
of statistical procedures,as they are used in psychologyand the behavioral
sciences, being ignored by the texts that made them available to a wider, and
undoubtedlyeager,audience.
13
AssessingDifferences
and Having Confidence
FISHERIAN STATISTICS
186
THE ANALYSIS OF VARIANCE 187
THEANALYSIS OFVARIANCE
Two aspectsof this paperare of historical interest. At that time Fisherdid not fully
understandtherules of theanalysisof variance- hisanalysisiswrong - nor therole
of randomization. Secondly,although the analysis of variance is closely tied to
188 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
(Fisher usesS for summation) for variance, noting that s is thebest estimate
of a.
Chapter4 deals withtestsof goodness-of-fit, independence, and homoge-
2
neity, givinga complete description of the applicationof the X tests,including
Yates'correction for discontinuityand theprocedurefor what is now known as
the Fisher Exact Test. Chapter 5 is on tests of significance,about which more
is said later. Chapter6 managesto discuss,quite thoroughly,the techniquesof
interclass correlation without mentioning Pearsonby name exceptto acknow-
ledge thatthe dataof Table 31 arePearsonand Lee's. Failureto acknowledge
the work of others,which was acharacteristicof both Pearsonand Fisher,and
which, to some extent, arose out of both spiteand arrogance,at least partly
explainsthe anonymous presentation of statistical techniques that
is to befound
in the modern textbooksand commentaries.
And then chapter7, two-thirdsof the waythroughthe book, introduces that
most importantand influential of methods- analysis of variance. Fisher
describesanalysisof variance as "the separationof the varianceascribableto
one group of causesfrom the variance ascribable to other groups" (Fisher,
1925/1970,14th ed., p. 213), but heexaminesthe developmentof thetechnique
from a considerationof the intraclass correlation. His exampleis clear and
worth describing. Measurements from n' pairs of brothersmay betreated in
two ways in a correlational analysis. The brothersmay be divided into two
of the use of thetable and finding the significanceof the intraclass correlation,
conclusionsmay bedrawn. Becausethe symmetrical table does not give the
best estimateof the correlation,a negative biasis introduced intothe valuefor
z andFisher showshow this may becorrected. Fisher then shows that intraclass
correlationis anexampleof the analysisof variance:"A very great simplifica-
tion is introduced into questionsinvolving intraclass correlation when we
recognisethat in such casesthe correlation merely measures the relative
importanceof two groupsof variation" (Fisher, 1925/1970, 14th ed., p. 223).
Figure 13.2is Fisher's general summary table, showing, "in the last column,
the interpretationput upon each expression in the calculationof an. intraclass
correlation from a symmetricaltable" (p. 225).
l
Z = /2\OgeF
Yates (1951) comments on these early reviews, noting that many of them
expressed dismay at thelack of formal mathematical proofs and interpretedthe
work as though it was only of interestto those who were involvedin small
sample work. Whatever its receptionby thereviewers,by 1950, whenthe book
was in its 11th edition, about 20,000 copies had been sold.
But it is fair to saythat something like10 years wentby from the dateof the
original publication before Fisher's methods really startedto havean effect on
the behavioral sciences. Lovie (1979) traces its impact overthe years 1934to
1945. He mentions,as doothers,that the early textbook writers contributed to
its acceptance. Notable here are theworks of Snedecor, published in 1934and
1937. Lush (1972) quotes a European researcher who told him, "Whenyou see
Snedecoragain, tellhim that over herewe say, 'ThankGod for Snedecor;now
we canunderstand Fisher' " (Lush, 1972,p. 225).
Lindquist publishedhis Statistical Analysisin Educational Researchin
1940, and this book, too,was widely used. Even then, some authorities were
skeptical,to say theleast. CharlesC. Petersin an editorial for the Journal of
Educational Research rather condescendingly agreesthat Fisher'sstatisticsare
"suitable enough"for agricultural research:
If the early workers reliedon writers like Snedecorto help them with
Fisherianapplications, later ones are indebtedto workers like Eisenhart (1947)
for assistance withthe fundamentalsof the method. Eisenhart setsout clearly
the importanceof theassumptionsof analysisof variance, their critical function
in inference,and the relative consequences of their not being fulfilled. Eisen-
hart's significant contributionhas been pickedup andelaboratedon by sub-
sequent writers,but hisaccount cannotbe bettered.
He delineatesthe two fundamentallydistinct classesof analysisof variance
- what are nowknown as thefixed andrandom effects models.The first of
these is the most familiarto researchersin psychology. Herethe task is to
determine the significance of differences among treatment means: "Tests of
significanceof employedin connection with problems of this classare simply
extensionsto small samplesof the theory of least squares developed by Gauss
andothers- theextensionof thetheoryto small samples being dueprincipally
to R. A. Fisher" (Eisenhart, 1947, pp. 3-4).
The secondclassEisenhart describes as thetrue analysisof variance. Here
the problem is one ofestimating,andinferring the existenceof, the components
of variance,"ascribableto random deviation of the characteristicsof individuals
of a particular generic typefrom the meanvalue of these characteristics in the
'population' of all individuals of that generictype" (Eisenhart, 1947,p. 4).
The failure of the then current literature to adequately distinguish between
the twomethodsis becausethe emphasishadbeenon testsof significancerather
than on problemsof estimation. But it would seem that, despite the best efforts
of the writers of the most insightful of the nowcurrent texts(e.g., Hays, 1963,
1973),the distinctionis still not fully appliedin contemporary research. In other
words, the emphasisis clearly on theassessmentof differencesamong pairsof
treatment means rather than on therelativeand absolute sizeof variances.
Eisenhart's discussion of the assumptionsof the techniquesis a model for
later writers. Random variation, additivity, normality of distribution, homoge-
neity of variance, and zero covariance among the variablesare discussedin
detail and their relative importance examined. This work can beregardedas a
classicof its kind.
[This is] symptomatic of the transition in philosophy and orientation from the early
use ofstatisticsas apractical and rigorous aid for interpretingresearchresults, to a
highly theoreticalsubject predicatedon theassumptionthat mathematicalreasoning
was paramountin statistical work. (Davis& Gaito, 1984, p. 5)
A few pages later, Fisher does explain the use of thetail area of the
probability intervaland notes thatthe p = .05level is the "convenient" limit for
judging significance. He does thisin the context of examplesof how often
2
Of interest here,however, is the increasinguse,and theincreasingpower,of regression
modelsin the analysisof data; see,for example,Fox (1984).
200 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
single-valuedfunctions of the x 's 0 (£) and 0(£) havingthe property that, whatever
the valuesof the O's,say 0'i, 0'2,. . . 0'i, the probability of 0(£) falling short of 9'i
and at thesame timeof 9(£) exceedingO'i is equal to a numbera fixed in advance
so that 0 < a < 1,
It is essentialto notice thatin this problem the probability refers to the valuesof
9(£) and§(£) which, beingsingle-valuedfunctions of the.x 's arerandom variables.
9'i being a constant,the left-hand side of [the above] doesnot representthe
probability of 0'i falling within somefixed limits. (Neyman, 1937,p. 379)
The values 0(£) and 0(£) representthe confidencelimits for 0'1 andspan
the confidence intervalfor theconfidencecoefficient a. Caremustbetaken here
not to confuse thisa with the symbol for the Type 1error rate;in fact this a is
1 minus the Type I error rate. The last sentencein the quotation from Neyman
just given is very important. First,an exampleof a statementof the confidence
interval in morefamiliar termsis, perhaps,in order. Supposethat measurements
on a particular fairly large random sample have produced a mean of 100 and
that the standarderror of this meanhas been calculatedto be 3. Theroutine
methodof establishingthe upperandlower limits of the 90%confidence interval
would be tocompute 100 ± 1.65(3). Whathasbeen established? The textbooks
will commonly saythat the probability thatthe population mean,u, falls within
this interval is 90%, whichis precisely what Neyman says is not thecase.
For Neyman, the confidence limitsrepresentthe solution to the statistical
202 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
Both Ronald Fisherand Jerzy Neyman would have been very unhappy with
this advice! It does, however,reflect once againthe way inwhich researchers
in the psychological sciences prescribeandpropound rules that they believe will
leadto acceptanceof the findings of research. Rozeboom's paper is a thoughtful
attempt to provide alternativesto the routine null hypothesis significance test
and dealswith the important aspectof degreeof belief in an outcome.
One final point on confidenceinterval theory:it is apparent that some early
commentators(e.g., E. S.Pearson, 1939b; Welch, 1939) believed that Fisher's
"fiducial theory" andNeyman's confidence interval theory were closely related.
Neyman himself (1934)felt that his work was anextensionof that of Fisher.
Fisher objected stronglyto the notion that therewas anything at all confusing
about fiducial distributionsor probabilities and denied any relationshipto the
theory of confidence intervals,which he maintained,was itself inconsistent.In
1941 Neyman attemptedto show that thereis no relationship betweenthe two
theories,and herehe did notpull his punches:
In the early 1950s, mainlyin the pages of Psychological Bulletin and the
Psychological Review- then, as now, immensely importantand influential
journals - a debate took place on theutility anddesirabilityof one-tail versus
two-tail tests(Burke, 1953; Hick, 1952; Jones, 1952,1954; Marks, 1951,1953).
It had been stated that when an experimental hypothesis had a directional
component- that is, notmerely thata parameteru, differed significantly from
a second parameter 2,jj.but that, for example,ji, > u2 - thentheresearcherwas
permitted to use thearea cut off in only one tail of the probability distribution
when the test of significancewas applied. Referredto the normal distribution,
this means thatthe critical value becomes 1.65 rather than 1.96. It was argued
that becausemost assertions that appealed to theory were directional- for
example, that spacedwas better than massed practicein learning, or that
extraverts conditioned poorly whereas introverts conditioned well- the actual
statistical testshould takeinto account these one-sided alternatives. Arguments
againstthe use ofone-tailed testsprimarily centeredon whatthe researcher does
when a very large differenceis obtained,but in theunexpected direction.The
temptationto "cheat"is overwhelming!It was also argued that such data ought
not to be treated withthe same reactionas a zero difference on scientific
grounds: "It is to be doubted whether experimental psychology, in its present
state,can afford suchlofty indifference toward experimental surprises" (Burke,
1953, p. 385).
Many workers were concerned that a move toward one-tail tests represented
a loosening or a lowering of conventional standards, a sort of reprehensible
breaking of the rulesand pious pronouncements about scientific conservatism
abound. Predictably, the debateled to attemptsto establishthe rulesfor the use
of one-tail tests (Kimmel, 1957).
What is found in these discussions is the implicit assumption that formally
204 13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE
THE BEGINNINGS
205
206 14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA
analyses couldbe achieved through the use of"found groups"and what is often
called quasi-experimental designs, the potential of ANOVA techniques wid-
ened considerably.The argument was,and is,advanced that unless the experi-
menter can control for all relevant variables alternative explanations for the
results other thanthe influenceof the independent variable can befound- the
so-called correlated biases.The skeptic might argue that, in addition to thefact
that statistical appraisals
are bydefinition probabilisticand therefore uncertain,
direct manipulationof the independent variable does not assuredly guard against
mistakenattribution of its effects.
And, very importantly,the commentaries, such asthey were,on experimen-
tal methods haveto be seen in the light of the dominant force,in American
psychologyat least,of behaviorism. For example, Brennan's textbook History
and Systemsof Psychology(1994)devotestwo chaptersandmore than40 pages
out of about 100pagesto Twentieth-Century Systems (omitting Eastern Tradi-
tions, Contemporary Trends and theThird Force Movement). This is not to be
taken as acriticism of Brennan: indeed, his book has run toseveral editionsand
1
is an excellentand readable text. The point is that from Watson's (1913) paper
until well on into the 1960s, experimental psychology was, for many, the
experimental analysisof behavior - latterly within a Skinnerian framework.
Sidman's (1960) book Tactics of Scientific Research: Evaluating Experimental
Data in Psychologyis most certainlynot about statistical analysis!
Rucci and Tweney (1980) have carried out acomprehensive analysis of the use
of ANOVA from 1925 to 1950, concentrating mainly, but not exclusively,on
American publications. Theyidentify the earliest research
to useANOVA as a
paper by Reitz (1934).The author checkedfor homogeneityof variance, gave
the method for computingz, andnoted its relationshipwith r\2. An early paper
1
Perhapsone might be permittedto ask Brennanto include a chapteron the history and
influence of statistics!
208 14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA
was along and painful process.This was duemore to the substantive implications
of the method thanto the purely practicalmattersof increasedarithmetical labour
and novelty of language,(p. 151)
It is worth noting that some of the most well-knownof the early rigorous
experimental psychologists, for example, GardnerMurphy and William
McDougall, were interestedin the paranormal,and it was thelatter who, leaving
Harvard for Duke University, recruitedJ. B. Rhine, who helpedto found, and
later directed, that institution's parapsychology laboratory.
Of course, there were several psychological
andeducational researchers who
vigorously supportedthe "new" methods. Garrett andZubin (1943) observe that
ANOVA had notbeen usedwidely in psychological research. These authors
210 14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA
These authors also offer the opinion thatthe belief thatthe methods deal
with small sampleshadinfluenced their acceptance - a belief, nay,a. fact, that
has ledothers to point to it as areasonwhy they were welcomed! They also
worried about "the language of the farm," "soil fertility, weights of pigs,
effectivenessof manurial treatmentsand the like" (p. 233). Hearnshaw,the
eminenthistorian of psychology, also mentions the samedifficulty as areason
for his view that"Fisher'smethods were slow to percolate into psychology both
in this country [Britain]and inAmerica" (1964,p. 226). Althoughit is the case
that agricultural examples are cited in Fisher's books,and phrases that include
the words "blocks," "plots," and "split-plots" still seema little odd to psycho-
logical researchers,it is not true to saythat the world of agriculture permeates
all Fisher's discussions and explanations,as Lovie (1979) has pointed out.
Indeed,the early chapterin The Design of Experimentson themathematicsof
a lady tastingtea isclearly the story of anexperimentin perceptual ability, which
is aboutas psychologicalas onecould get.
It must be mentioned that Grant (1944) produced a detailed criticismof
Garrett and Zubin's paper,in particular taking themto task for citing examples
of reports where ANOVAis wrongly usedor inappropriatelyapplied. Fromthe
standpointof the basisof the method, Grant states that in the Garrett andZubin
piece:
even eager,to come to grips with methods that promised an effective way to
use themulti-factor approachin experiments,an approach thatoffered the
opportunity of testing competing theories more systematically than before.
And it was from theeducational researcher's standpoint that Stanley (1966)
offered an appraisalof the influence of Fisher's work "thirty yearslater." He
makesthe interesting point that although there were experimenters with "con-
siderable 'feel'for designing experiments, . . . perhaps someof them did not
have the technical sophisticationto do the corresponding analyses."This
judgmentis basedon the fact that 5 of the 21volunteered papers submitted to
the Division of Educational Psychology of the American Psychological Asso-
ciation annual conventionin 1965 had to bereturnedto have more complex
analysis carried out. Thismay have beendue tolack of knowledge,it may have
been due to uncertainty aboutthe applicability of the techniques,but it also
reflects perhapsthe lingering appealof straightforwardand even subjective
analysis thatthe researchersof 20 years before favored, as wenoted earlierin
the discussionof Lovie's appraisal.It is also Stanley's view that "the impact of
the Fisherian revolutionin the designandanalysisof experiments came slowly
to psychology" (p. 224). It surely is a matter of opinion as towhether Rucci
and Tweney'scountof less than5 percentof papersin their chosen journals in
1940 to between 15 and 20 percent in 1952 is slow growth, particularly when
it is observed thatthe growth of the use ofANOVA varied considerably across
thejournals.2 This leadsto thepossibility thatthe editorial policyand practice
of their chosen journals could have influenced Rucci and Tweney'scount.3 It
is also surelythe case thatif a new techniqueis worth anythingat all, thenthe
growth of its usewill show as apositively accelerated growth curve, because
as more people takeit up, journal editors havea wider and deeper poolof
potential refereeswho cangive knowledgeable opinions. Rucci and Tweney's
remark, quoted earlier, on thelengthof time it took for ANOVA to become part
of experimental psychology, has atone that suggests that they do not believe
that the acceptancecan beregardedas"slow."
3
The author, many years ago, had apaper sent backwith a quite kind note indicating thatthe
journal generallydid notpublish "correlational studies." Younger researchers soon learn which journals
are likely to bereceptiveto their approaches!
EXPECTED MEAN SQUARES 213
text by Lindquist was publishedin 1940.The impactof the first edition of this
book may have beenlessened,however, by a great manyerrors, both typo-
graphic and computational. Quinn McNemar (1940b), in his review, observes
that the volume,"in the opinionof thereviewerandstudentreaders,suffers from
an overdoseof wordiness"(p. 746). He also notes that there are several major
slips, although acknowledging the difficulty in explaining severalof the topics
covered,anddeclares thatthe book "shouldbe particularlyuseful to all who are
interestedin obtaining non-mathematical knowledge of the variance technique"
(p. 748).
Snedecor,an agricultural statistician, published a much-used textin 1937,
and he isoften given creditfor making Fisher comprehensible (see chapter 13,
p. 194). Rucci and Tweney place George Snedecor of Iowa State College,
Harold Hotelling, an economist at Columbia, and Palmer Johnsonof the
University of Minnesota,all of whom had spent time with Fisher himself, as the
foundersof statistical trainingin the United States.But any informal surveyof
psychologistswho were undergraduates in the 1950sand early 1960s would
likely reveal thata large numberof them were rearedon thetexts of Edwards
(1950) of the University of Washington,Guilford (1950) of the University of
SouthernCalifornia,or McNemar (1949)of Stanford University. Guilford had
publisheda text in 1942,but it was thesecondedition of 1950 and thethird of
1956 that became very popular in undergraduate teaching. In Canadaa some-
what later popular text wasFerguson's (1959) Statistical Analysis in Psychology
and Education, and in the United Kingdom, Yule'sAn Introduction to the
Theory of Statistics,first publishedin 1911,had a14th edition published (with
Kendall) in 1950 that includes ANOVA.
If you joined in the draw repeatedlyyou would "in the long run" lose 20 cents.
In ANOVA we are concernedwith the expected valuesof the various
components- the mean squares - over thelong run.
Although largely ignoredby the elementary texts,the slightly more ad-
vanced approachesdo treat their discussions and explanationsof ANOVA' s
model from the standpointof E(MS). Cornfieldand Tukey (1956)examinethe
statistical basisof E(MS), but Gaito (1960) claims that"there has been no
systematic presentation of this approachin a psychological textor journal which
will reach most psychologists" (p. 3), and hispaper aimsto rectify this. The
approach, which Gaito rightly refers to as"illuminating," offers some mathe-
matical insight intothe basisof ANOVA and frees researchersfrom the "recipe
book" and intuitive methods that are often used. Eisenhart makes the point that
these methods have been worthwhile and that readersof the books for the
"non-mathematical" researcher have achieved sound and quite complex analy-
ses inusing them. But these early commentators saw theneedto go further.
Gaito'sshort paperoffers a clear statement of thegeneralE(MS) model,and his
contention that this approach brings psychology a tool for tackling many
statistical problemshas been borneout by thefact that now all thewell-known
intermediate texts that deal with ANOVA modelsuse it.
This is not theplaceto detail the algebraof E(MS) but here is the statement
of the generalcasefor a two- factor experiment:
EXPECTED MEAN SQUARES 215
TIMES OF CHANGE
The later yearsof the 1920s were watershed years for statistics. Karl Pearson
was approachingthe end of hiscareer(he retiredin 1933),andsomeof the older
statisticians werenot able to cope with the neworder. The divisions had been,
and were still being, drawn.Yule retiredfrom full-time teachingat Cambridge
in 1930, and, writingto Kendallafter K.P.'s deathin 1936 said,"I feel asthough
the Karlovingianera hascometo anend,and thePiscatorialerawhich succeeds
it is one inwhich I can play no part" (quotedby Kendall, 1952,p. 157).
Yule's text had not bythen tackledthe problemsof small samples. The t
testswere not discusseduntil the 11thedition of 1937, a revision that was,in
fact, undertakenby Maurice Kendall. The changes that were taking place in
statistical methodology led to positions being adopted that were often based
more on personalityand style and loyalties than rational argument on the basic
logic and utility of the approaches. Yates,
who succeeded Fisher at Rothamsted
in 1933, was to write, in 1951, in a commentaryon Statistical Methodsfor
Research Workers'.
Becauseof the importance that correlation analysis had assumedit was natural that
the analysisof variance shouldbe approachedvia correlation,but tothosenot trained
in the school of correlational analysis(of which I am fortunate to be able to count
myself one) this undoubtedly makes this part of the book moredifficult to compre-
hend. (Yates, 1951,p. 24, emphasis added)
216
NEYMAN AND PEARSON 217
It is a curious fact that what most social scientists now take to be inferential
statistics is a mixture of procedures that,as presentedin many of the current
1
statistical cookbooks, would be criticized by its innovators, RonaldA. Fisher,
Jerzy Neyman,andEgon Pearson. Fisher established the present-day paramount
importanceof the rejection or acceptanceof the null hypothesisin the determi-
nation of a decision on astatistical outcome- thehypothesis thatthe outcome
is due tochance.The newparticipants- they couldnot bedescribedaspartners
- in theunion that became statistics arguedfor anappreciationof the probability
of an alternative hypothesis.
In 1925 Jerzy Neyman, on a Polish governmentfellowship and with a new
PhD from Warsaw, arrivedat University College, London,to study statistics
with Karl Pearson. Neyman was born in Russiain 1894and read mathematics
at theUniversityof Kharkov. In 1921,he movedto the newRepublicof Poland,
where he becamean assistantat theUniversity of Warsawandwhere he worked
for the State Meteorological Institute. His initial months in London were
difficult, partly becauseof his struggleswith English and partly becauseof
misunderstandings with "K.P.," but graduallyhe struck up afriendship that led
to aresearch collaboration with Egon Pearson. Neyman's professional progress
and hissocial and academic relationships have been recounted in a sympathetic
and revealing accountby ConstanceReid (1982).
The younger Pearson has been describedby many as asomewhat diffident
and introvertedman who wasvery much in the shadowof his father. Reid tells
us of his feelings:
1
The phrase statistical cookbook has a apejorative ring, whichis not wholly justified.
There are many excellent basic texts available,
and therearemany excellent cookerybooks.
Both are necessaryto our well-being. The point is that conflicting recipesleadto statistical
and gastronomic confusion,and thefact that such conflicts exist has,by and large, been
ignored by consumersin the social sciences.
218 15. THE STATISTICAL HOTPOT
What it doesis to show thatif there is anyalternative hypothesis which will explain
the occurrenceof the sample witha more reasonable probability, say 0.05 (suchas
that it belongsto adifferent populationor that the sample wasn't randomor whatever
will do thetrick) you will be very much moreinclined to consider thatthe original
hypothesisis not true, (quotedby Pearson,1966, p. 7, emphasis added)
Pearson recalls that during the autumnof 1926,the problemsof the specifi-
cation of the classof alternative hypothesesand their definition, the rejection
region in the sample space, and the twosourcesof error were discussed.At the
end of theyear, Pearson was examiningthe likelihood ratio criterionas a way
of approachingthe questionas towhetherthe alternative,or what Fisher later
NEYMAN AND PEARSON 219
called the null, hypothesiswas themore likely. Pearson always recognized that
he was aweak mathematicianand that he needed help withthe mathematical
formulation of the newideas. He recalledto Reid (1982) thathe approached
Neyman, the newpost-doctoral student at Gower Street,perhapsbecausehe
was "so 'fresh' to statistics"(p. 62) andbecause other possible collaborators
would all have preferences for either MarkI or Mark II statistics. Neymanwas
not a memberof eitherof the camps. He really knew very littleof the statistical
work of the elder Pearson,nor that of "Student"or Fisher. But an immediate
and close collaborationwas notpossible. Although Egon Pearson andNeyman
had a good deal of social contact overthe summer of 1926 and Pearson
remembered that they had touchedon the newquestions, Neyman was disen-
chanted withthe Biometric Laboratory.It was not thefrontier of mathematical
statistics thathe hadexpected,and hedeterminedto go toParis (wherehis wife
was an artstudent)and presson with his work in probabilityand mathematics.
At the end of1926, correspondence began between Neyman and Pearson,
and Pearson visitedhis colleaguein Paris in the spring of 1927. Reid (1982)
reportsthat shegave copiesof Neyman's lettersto Pearsonto Erich Lehmann,
an early studentof Neyman'sat Berkeley and anauthority on the Neyman-
Pearson theory (Lehmann, 1959), for his comment. Lehmann's conclusion was
that, at anyrate until early in 1927, Neyman "obviously didn't understand what
Pearsonwas talking about"(p. 73).
The first joint paper, in two parts, was published in Biometrika in 1928.
Neymanhadreturnedto Polandfor theacademic year1927-1928,teaching both
at Warsaw and atKrakow, and heclearly felt that he had notplayed as big a
role in the first paper as hadPearson.The paper (PartI) ends with Neyman's
disclaimer:
N.B. I feel it necessaryto make a brief commenton theauthorshipof this paper. Its
origin was amatterof close co-operation, both personal and byletter, and theground
covered includedthe general ideasand theillustration of theseby samplingfrom a
normal population.A part of theresults reached in commonare includedin Chapters
1, II and IV. Later I was much occupiedwith other work, and therefore unableto
co-operate.The experimental work,the calculationof tables and thedevelopment
of the theory of ChaptersIII and IV are dueentirely to Dr Egon S. Pearson,(p. 240)
been randomly drawn from a population (n). HypothesisA isthat indeedit has.
But I mayhave been drawn from some otherpopulationn', andthus,two sorts
of error may arise. The first iswhenA is rejectedbut I wasdrawnfrom K, and
the secondis whenJis acceptedbut Z hasbeen drawnfrom n':
In the long run of statistical experience the frequency of the first sourceof error (or
in a single instanceits probability)can becontrolledby choosingas adiscriminating
contour, one outside whichthe frequency of occurrenceof samplesfrom TC is very
small - say, 5 in 100 or 5 in1000.
The second sourceof error is more difficult to control. ... It is not of course
possibleto determinen',... b u t . .. we maydeterminea "probable" or "likely" form
of it, and hence fix the contours so that in moving "inwards" acrossthem the
difference between n and thepopulationfrom which it is "most likely" that S has
been sampled should become less and less. This choice also implies thaton moving
"outwards" acrossthe contours, other hypotheses as to thepopulation sampled
becomemore and more likely than HypothesisA. (Neyman & Pearson,1928, p.
177)
There are faint echoes hereof the subjective elementin Neyman's initial
reasoning, commented on byLehmannandreportedby Reid (1982)- thenotion
that belief in a prior hypothesis affects
the considerationof the evidence:
He would not put hisname to the paper. The curious factis that neither
Neymannor Pearson ever wholeheartedly subscribed to theinverse probability
approach, but Neyman felt that it should be addressed. Pearson's attempt to
avoid a confrontationwasdoomedto failure, just as hisfather's earlier attempts
Cowlesand Davis (1982b)carriedout asimple experiment that supports the suggestion thatthe
.05 level of significanceis subjectively reasonable. Sandy Lovie pointed
out to theauthor thatthe same
experiment,in a slightly different context,was performedby Bilodeau (1952).
222 15. THE STATISTICAL HOTPOT
Reid reportsthat when she wrote to the librarian of the Royal Society to
discoverthe nameof the second referee, shefound that therehad only beenone
referee, and that he was A. C.Aitken of Edinburgh(a leading innovatorin the
field of matrix algebra).
In fact, two papers were published in 1933. The Royal Society paper dealt
with proceduresfor the determinationof the most efficient tests of statistical
hypotheses. There is no doubt thatit is one of themost influential statistical
papers ever written.It transformedthe way inwhich boththe reasoning behind,
and the actual applicationof, statistical tests were perceived. Forty years later,
Le Cam andLehmann enthusiastically assessed it:
The impact of this work has been enormous. It is, for example,hard to imagine
hypothesistestingwithout the concept of power.... However, the influenceof the
work goesfar beyond. ... By deriving tests as thesolutions of clearly defined
optimum problems,Neyman and Pearson established a pattern for Wald's general
decisiontheoryand for thewhole field of mathematicalstatisticsas it hasdeveloped
since then. (Le Cam & Lehmann, 1974,p. viii)
The later paper (Neyman & Pearson, 1933), presented to the Cambridge
Philosophical Society, again setsout theNeyman-Pearson rationale andproce-
duresfor hypothesis testing.Its statedaim is toseparate hypothesis testing
from
problems in estimation and toexaminethe employmentof tests independently
of a priori probability laws.The authors have rejectedthe Bayesian approach
NEYMAN AND PEARSON 223
The rule thenis to set thechanceof the first kind of error (the sizeor what
we would now call the a level) at asmall valueandthen choosea rejection class
so that the chanceof thesecond kindof error is minimized.In fact, the procedure
attempts to maximizethepower of the test for a given size. The probability of
rejecting the statistical hypothesis, ()H, when in fact the true hypothesisis H.,
that is P (w |H.), is called the power of the critical region w with respectto H..
as theresultant powerof the test T. . .. It isseen that whilethe power of a test with
regard to a given alternativeH\ is independentof the probabilitiesa priori, and is
therefore known preciselyassoon as H\ and w are specified, thisis not thecasewith
the resultant power,which is a function of the (pi's. (Neyman& Pearson, 1933b, p.
499)
224 15. THE STATISTICAL HOTPOT
This aspectof the error problem is very evidentin a number of fields where tests
must be usedin a routine manner,and errorsof judgment leadto wasteof energyor
financial loss. Such is the casein sampling inspection problems
in mass-production
industry. (Neyman& Pearson,1933, p. 493)
[Karl] Pearsonwas madean honorary memberof the TeaClub, and when he joined
them in the Common Room,it was observed that Fisher did him theunique honour
of breaking out of conversationto step forwardandgreethim cordially. (Fisher Box,
1978, p. 260)
Or,
It appears that,
at first, Neymanand Fishergot onquite well andthat Neyman
tried to bring Fisherand Egon Pearson together. His work on estimation, rather
than the joint work with Pearsonon hypothesis testing, occupied his attention.
His 1934 paper (discussed earlier) was, in the main, well-receivedby Fisher,
and Fisher's December 1934 paper, presented to the Royal Statistical Society
(Fisher, 1935),was commentedon favorably by Neyman.
But any harmony disappeared at meetingsof the Industrial andAgricultural
Section of the Royal Statistical Societyin 1935. Neyman presented his "Statis-
tical Problemsin Agricultural Experimentation"in which he questionedthe
efficiency of Fisher's handlingof Randomized Block and Latin Square designs,
illustrating his talk with wooden models thathad been preparedfor him
226 15. THE STATISTICAL HOTPOT
3
at University College. Fisher's responseto the paper (for whichhe was
supposedto give the vote of thanks) beganby expressingthe view, to put it
bluntly (which he did), that he hadhoped that Neyman would be speakingon
something thathe knew something about.His closing comments disparaged
the Neyman-Pearson approach.
Frank Yates (1935) presented a paper on factorial designsto the Royal
StatisticalSocietylater that year. Neyman expresseddoubts aboutthe interpre-
tation of interactions and main effects whenthe numberof replicationswas
small. The problem hasbeen examined more recently by Traxler (1976). The
details of thesecriticisms, and they had validity, did not concern Fisher:
Fisher Box defends this view,but it was not aview that convinced many
mathematiciansand statisticians outside
the Fisherian camp.
3
Reid (1982) relatesa story told by both Neymanand Pearson.One evening they returned to the
departmentafter dinnerand found the models strewn aboutthe floor. They suspected that
the angryact
was Fisher's.
STATISTICS AND INVECTIVE 227
suffered too much from the lash of Fisher's tongue,for Fisher regardedhim as
a lightweight and, whenever he could, coldly ignored him.
Fisherwas alwayswilling to promotethe applicationof his work to experi-
mentationin a wide varietyof fields. He wasunableto acceptany criticism of
his view of its mathematical foundations. Perhapsthe most unfortunateepisode
took place at the end ofKarl Pearson'slife. Taking as his reason an attack
Pearson(1936) had made, in a work published very shortly after his death,on
a paper writtenby an Indian statisticianR. S. Koshal (1933), Fisher (1937)
undertook "to examinefrankly the statusof the Pearsonianmethods"(p. 303).
Nearly 20 years later,in an author's note accompanying a republicationof
"Professor Karl Pearson and theMethod of Moments," Fisher reallyexceeds
the boundsof academic propriety:
Of course Karl Pearson was guilty of the same sortof invective. Personal
criticism in the defenseof their mathematicalandstatistical stances is a feature
of the style of both men. No academic writer expects to escape criticism-
indeed, lackof criticism generally indicates lack of interest- butthesesort of
polemicscan only damagethe discipline. Ordinary,and even some extraordi-
nary, men andwomenrun for cover. Theseconflicts may well be responsible
for the rather uncritical acceptanceof the statisticaltools thatwe usetoday, a
point that is discussedfurther.
The principles broughtto light [in the following paper] seemto theauthor essential
to thetheory of testsof significancein general,and tohave been most unwarrantably
ignored in at least onepretentious workon "Testingstatistical hypotheses."Practical
experimentershave not beenseriously influenced by this work, but in mathematical
departments,at atime when thesewere beginningto appreciatethe part they might
play as guidesin the theoreticalaspectsof experimentation, its influence has been
somewhatretrograde.(Fisher, 1950, p. 35.173ain Bennett,1971).
procedures.It is small wonder that they are notwidely appreciated because the
founding fathers themselves were not clearon theissues,or at anyratein public,
and in their writings, they werenot always clearon the issues. Somefew
examples willsuffice to illustratethe point.
In 1935 Karl Pearson asserted that "tests are used to ascertain whethera
reasonable graduation curve has been achieved,not to assert whetherone or
another hypothesis is true or false" (K. Pearson, 1935, p. 296). All he is doing
here is arguing with Fisher.He himself usedhis own x,2 test for hypothesis
testing(e.g.,K. Pearson, 1909).
In his famous discussion of the "tea-tasting" investigation, Fisher discusses
the "sensitiveness"of an experiment.He notes thatby increasingthe sizeof the
experiment:
On the whole the ideas (a) that a test of significance must be regarded as one of a
seriesof similar testsapplied to a successionof similar bodiesof data,and (b)that
the purposeof thetestis to discriminateor "decide"betweentwo ormore hypotheses,
havegreatly obscuredtheir understanding, when taken ascontingent possibilitiesbut
as elementsessentialto their logic. (Fisher, 1956/1973, 3rd ed., pp. 45-46)
Fishersays:
These examples showhow badly the word "error" is used in describing sucha
situation. Moreover, it is a fallacy, so well known as to be astandard example,to
conclude froma test of significance thatthe null hypothesisis thereby established;
at most it may be said to be confirmed or strengthened. (Fisher, 1955, p. 73,
emphasisadded)
PRACTICAL STATISTICS
A striking featureof the general statistics texts produced for coursesin the
psychological sciences is theanonymityof the prescriptions thatare described.
A startlinglyhigh numberof incoming graduate students in the author'sdepart-
ment were unaware thatthe F ratio was namedfor a mancalled Fisher,but a
cursory glancethroughthe indexesof introductory texts reveals why. Pearson's
nameis sometimes mentioned as aconvenientlabel for a correlation coefficient,
distinguishingit from the rank-order methodof Spearman. Neyman andPearson
Jr. arehardly ever acknowledged. "Power"is treatedwith caution. Controversial
issuesarenever discussed.
Gigerenzer (1987) presents an interestingdiscussionof the situation: "The
confusion in the statistical texts presented to the poor frustrated studentswas
causedin part by the attemptto sell this hybrid as thesine qua non ofscientific
inference"(p. 20).
Gigerenzer'sthesis addresseswhat he seesas experimentalpsychology's
fight against subjectivity.Probabilisticmodelsand statistical methods provided
the discipline with a mechanicalprocessthat seemedto allow for objective
assessment independent of the experimenter. Parallel constructs of individual
differences as error and uncertainty as ignorancefurther promoted psychol-
ogy's view of itself as anobjective science.
It is Gigerenzer's view thatthe illusion was moreor less deliberately created
by the early textbooks and hasbeen perpetuated. The general neglect of
alternative theories and methodsof inference, anonymous presentation, the
silenceon thecontroversies,and"institutionalization,"all conspiredto provide
psychology with its need for objectivity. Gigerenzer makes tellingand
234 15. THE STATISTICAL HOTPOT
236
REFERENCES 237
Forrest,D. W. (1974). Francis Gallon: The life and work of a Victorian genius. London:
Elek.
Fox, J. (1984).Linear statistical modelsand related methods.New York: Wiley.
Funkhouser,H. G. (1936).A noteon a 10th century graph. Osiris, 1, 260-262.
Funkhouser,H. G. (1937). Historical development of the graphical representation of data.
Osiris, 3, 269-404.
Gaito, J. (1960).Expected means squares in analysisof variance techniques. Psychological
Reports, 7, 3-10.
Gaito, J. (1980).Measurement scales and statistics: Resurgence of an old misconception.
Psychological Bulletin,87, 564-567.
Galbraith, V. H. (1961). Themaking of Domesdaybook. Oxford: University Press.
Galton, F. (1877). Typical lawsof heredity. Proceedingsof the Royal Institutionof Great
Britain, 8, 282-301.[Also in Nature, 15,492-495].
Galton, F. (1884).On theanthropometric laboratory at thelate International Health Exhibi-
tion. Journal of theAnthropological Instituteof Great Britain and Ireland, 14, 205-221.
Galton, F. (1885a).Regression towards mediocrity in hereditary stature. Journal of the
Anthropological Instituteof Great Britain and Ireland, 15, 246-263.
Galton, F. (1885b).Opening address to theAnthropological Sectionof the British Associa-
tion by thePresidentof the Section. Nature,32, 507-510.
Galton, F. (1886).Family likenessin stature. Proceedingsof theRoyal Societyof London,
40, 42-73(includesan appendixby J. D.Hamilton Dickson)
Galton, F. (1888a).Co-relationsandtheir measurement, chiefly from anthropometricdata.
Proceedingsof the Royal Societyof London,45, 135-145.
Galton, F. (1888b). Personal identification and description. Proceedingsof the Royal
Institution, 12, 346-360.
Galton, F. (1889).Natural inheritance. London: Macmillan.
Galton, F. (1907)Inquiries into humanfaculty and itsdevelopment (3rd ed.). London: J. M.
Dent. (Original work published 1883).
Galton, F. (1908).Memoriesof my life. London: Methuen.
Galton, F. (1962). Hereditary genius (second edition reprinted). London: Collins and New
York: World. (Original work published 1869; 2nd ed. 1892)
Galton, F. (1970).English men of science. London: Frank Cass.(Original work published
1874)
Garnett, J. C. M., (1920) The single general factorin dissimilar mental measurements.
British Journal of Psychology,10, 242-258.
Garrett, H. E. & Zubin, J. (1943) The analysis of variance in psychological research.
Psychological Bulletin,40, 233-267.
Gauss,C. F.(1963). Theoria motus corporum coelestium. (Theory of theMotion of Heavenly
Bodies). New York: Dover. (Original work published 1809)
Gigerenzer,G. (1987).Probabilistic thinkingand the fightagainst subjectivity.In L. Krtiger,
G. Gigerenzer& M. Morgan (Eds.),Theprobabilistic revolution: Ideasin the sciences
(Vol. 2, pp. 7-33). Cambridge,MA: MIT Press.
Gini, C. (1928).Une application de la methode representative aux materiaux du dernier
recensementde la population italienne (ler decembre 1921) [An application of the
representative method to material from the last censusof the Italian population (1
December 1921)]. Bulletinof the International Statistical Institute,23, 198-215.
Ginzburg,B. (1936). The scientific valueof the Copemican induction. Osiris, I, 303-313.
Glaisher,J. W. L. (1872). On the law offacility of errors of observationand themethodof
leastsquares. Royal Astronomical Society Memoirs, 39, 75-124.
242 REFERENCES
Seng,Y .P. (1951). Historical surveyof the developmentof sampling theoryand practice.
Journal of the Royal Statistical Society, 114,2]4-231.
Sheynin,O. B. (1966).Origin of the theory of error.Nature, 211,1003-1004.
Sheynin,O. B. (1978).S. - D. Poisson'swork in probability.Archivesfor History of Exact
Sciences,18,245-300.
Sidman,M. (1960). Tactics of scientific research.New York: Basic Books.
Simpson,T. (1.755).A letterto theRight Honourable George Earl of Macclesfield, President
of the Royal Society,on the advantage arisingin taking the mean of a number of
observations,in practical astronomy. Philosophical Transactions of the Royal Society,
49, 82-93.
Simpson,T. (1757).On the advantage arisingfrom taking the mean of a number of
observations,in practical astronomy, wherin the odds thatthe result in this way ismore
exactthanfrom onesingle observationis evincedand theutility of the methodin practise
clearly madeappear.In Miscellaneous tractson some curiousand very interesting
subjects in mechanics, physical astronomy and speculative mathematics (pp. 64-75).
London: Printedfor J. Nourse.
Skinner, B. F. (1953).Scienceand human behavior.New York: Macmillan.
Smith, D. E. (1929)A source bookin mathematics.New York: McGraw Hill.
Smith, K. (1916). On the "best" values of the constants in frequency distributions.
Biometrika, 11, 262-276.
Snedecor,G. W. (1934).Calculationand interpretationof analysis of varianceand covar-
iance. AmesIA: CollegiatePress.
Snedecor,G. W. (1937). Statistical methods. Ames, IA: Collegiate Press.
Soper,H. E. (1913). On the probable errorof the correlation coefficient to a second
approximation.Biometrika,9, 91-115.
Soper,H. E., Young,A. W., Cave,B. M., Lee, A., & Pearson,K. (1917).On thedistribution
of the correlation coefficientin small samples.A co-operative study. Biometrika, 11,
328-413.
Spearman,C. (1904a). General intelligence, objectively determined and measured. Amer-
ican Journal of Psychology,15, 201-293.
Spearman,C. (1904b).The proof andmeasurementof association between two things.
American Journalof Psychology,15, 72—101.
Spearman,C. (1927). Theabilities of man. New York: Macmillan.
Spearman,C. (1933). The factor theoryand itstroubles.Journal of EducationalPsychology,
24, 521-524.
Spearman,C. (1939).The factorial analysisof humanability. II. Determinationof factors.
British Journal of Psychology,30, 78-83.
St. Cyres, Viscount(1909).Pascal. London: Smith, Elder and Co.
Stanley,J. C.(1966).The influenceof Fisher's""The designof experiments"on educational
research thirty years later. American Educational Research Journal, 3, 23-229.
Stephan,F. F. (1948). Historyof the use ofmodern sampling procedures. Journal of the
American Statistical Association, 43, 12-39.
Stephen,L. (1876).History of English thoughtin theeighteenth century (Vols. 1 -2). London:
Smith, Elderand Co.
Stephenson,W. (1939).The factorial analysisof human ability.IV Abilities defined as
non-fractional factors. British Journalof Psychology,30, 94-104.
Stevens,S .S.(1951).Mathematics, measurement andpsychophysics.In S.S. Stevens (Ed.).,
Handbook of experimentalpsychology (pp.1-49).New York: Wiley.
REFERENCES 251
von Mises,R. (1957). Probability, statistics and truth. (2nd rev. Englished. preparedby
H. Geiringer).London: Allen andUnwin.
Wald, A. (1950).Statistical decision functions. New York: Wiley.
Walker, E. L. (1970).Psychology as anatural and social science. Belmont, CA: Brooks/
Cole.
Walker, H. M. (1929). Studiesin thehistory of statistical method. Baltimore, MD: Williams
and Wilkins.
Wallis, W. A., & Roberts,H. V. (1956).Statistics:A newapproach.New York: FreePress.
Watson.J. B. (1913).Psychology as the behaviorist viewsit. Psychological Review, 20,
158-177.
Weaver,W. (1977).Lady Luck: Thetheoryof probability. Harmondsworth: Penguin Books.
(Original work publishedin 1963).
Welch, B. L. (1939).On confidence limits and sufficiency with particular referenceto
parametersof location. Annalsof Mathematical Statistics,10, 58-69.
Welch, B. L. (1958)."Student"andsmall sample theory. Journal of the American Statistical
Association,53, 777—788.
Weldon, W. F. R. (1890). The variations occurringin certain DecapodCrustacea.-!.
Crangon vulgaris. Proceedingsof the Royal Societyof London, 47, 445-453.
Weldon, W.F.R. (1892).Certain correlated variations in Crangon vulgaris. Proceedings of
the Royal Societyof London, 51, 2-21.
Weldon, W.F.R.(1893).On certain correlated variations in Carcinus moenas. Proceedings
of the Royal Societyof London, 54, 318-329.
Westergaard,H. (1932). Contributionsto thehistory of statistics. London:P. S.King.
Whitney, C. A. (1984, October). Generating and testing pseudorandom numbers. Byte
p. 128.
Wilson, E. B. (1928).Review of Theabilities of man, their natureand measurementby
Charles Spearman. Science, 67, 244-248.
Wilson, E. B. (1929a).Review of Crossroadsin the mind of man by T. LKelley. Journal
of General Psychology, 2, 153-169.
Wilson, E. B. (1929b). Comment on Professor Spearman's note. Journal of Educational
Psychology,20, 217-223.
Wishart, J. (1934). Statisticsin agricultural research. Journal of the Royal Statistical Society
Supplement,I, 26-51.
Wolfle, D. (1940).Factor analysisto 1940.Chicago:University of ChicagoPress.
Wood, T. B., & Stratton,F. J. M.(1910). The interpretationof experimental results. Journal
of Agricultural Science,3, 417-440.
Woodworth,R. S., & Schlosberg,H. (1954) Experimental psychology. New York: Holt,
Rinehart & Winston.
Yates, F. (1935).Complex experimentation. Journal of the Royal Statistical SocietySuppl-
ement^,181-247.
Yates, F. (1939).The comparative advantages of systematicandrandomized arrangements
in the designof agricultural andbiological experiments. Biometrika, 30,440-466.
Yates,F. (1951). The influence of Statistical Methodsfor Research Workers on the
developmentof the scienceof statistics. Journalof theAmerican Statistical Association,
46, 19-34.
Yates,F. (1964)Sir RonaldFisherand thedesignof experiments.Biometrics,20,307-321
Yule, G. U. (1897a).Notes on theteachingof the theory of statisticsat University College.
Journal of the Royal Statistical Society, 60, 456-458.
REFERENCES 253
254
AUTHOR INDEX 255
F Harman,H.H., 170,242
Harris, J. A., 190,242
Fechner,G., 33,239 Hartley, D., 176, 242
Feigl, H., 25, 239 Hays, W. L., 195,242
Ferguson,G. A., 213 Hearnshaw,L. S., 28,156, 162-164,210,
Finn, R. W., 48, 239 242
Fisher,R. A.,4-6,38,62,81-82,99,102 Heisenberg,W., 23, 46, 242
110-111,113-115,118-122,171,177-Henkel, R. E., 83,199,242
181,183-184,186-200,202-203,205-Heron, D., 149-151,242,245
206,210,212-213,216,225,210,228-Hick, W. E., 203, 242
230,232-233,239, 240 Hilbert, D., 66, 242
Fisher Box,J., 112, 114, 117,119-121, Hilgard, E., 206, 242
179, 181, 183, 200,224-226,240 Hogben, L., 82, 242
Fletcher, R., 163,240 Holschuh,N., 19,183,243
Folks, L., 204, 243 Holzinger, K. J., 160, 170,243
Forrest, D. W., 130, 137,241 Hotelling, H., 155,213,243
Fox, J., 175, 199,247 Hume, D., 30,243
Funkhouser,H. G., 54, 241 Hunter, J. E., 83, 243
Huxley, J., 18,129,243
G Huygens,C., 19, 51, 59, 63, 243
L O
A Correlation (cont.),
distribution of coefficient, 118
Actuarial science,51 intraclass,5, 119,123,189-193
Analysis of variance, 5, 35, 102-104, nominal variables, 148-151
177-181, 187-195 probableerror, 117
Anthropometric laboratory,132
Arithmetic mean,18, 89-91 D
Average,18, 88
Data organization,47-55
B Degreesof freedom, 110-114
De Mere's problems, 9, 58-59
Bayes'theorem,77-80,83, 231 Determinism,21-26, 118
Behaviorism,22 in psychology,33-35
Bills of mortality, 8,49-52
Binomial distribution,10-11,64-65,
68-70 E
Biometrics,4, 14-18
Error estimation,10, 63, 90,101-103
C Estimation,theory of, 98-101
Eugenics,4, 15,130-131, 152-153,187
Causality,seealso Determinism,30 Evolution, 1-3, 17, 129,143
Censuses,47, 93 natural selection,1, 129, 130,143
Central limit theorem,95, 107, 124-126 Expected mean squares, 213-215
Centroid method,168 Experimentaltexts, 205-207
Chi square, Experiments,
controversy,110-114 designof, 31,101,171-185
distribution, 105-114 tea-tasting, 182-184
test, 109
Communalities,168
Confidenceintervals, 100, 199-203 F
Control, 20, 31,38, 171-178
Mill's methods of enquiry and, 173- F ratio, 123, 193,233
176 distribution, 121-123
statistical, 176-181 Factor analysis,154-170
Correlation,2, 35, 129, 138-146,174 Factor rotation,169
coefficient of, 5,116, 141-146,158 Free will, 24
controversies,146-153 Frequencydistributions,68
258
SUBJECT INDEX 259
G N
H O
I Parameters,7, 95
Pascal'striangle, 9, 59,69
Induction, 27-30, 175,230 Personal equation, 45
Poisson distribution,70-72
Inference,29, 31-33
early, 47-50 Political arithmetic,48-50
Fisherian, 81-83 Polling, 94-95
practical, 77-84 Population,7, 82, 85, 87,95-101,105
Power,222-224,232-233
statistical,7, 31, 33, 217
Prediction,30
Insuranceandannuities,50-53,63
Primary mental abilities,169
Intelligence,158-161
Inventories,6, 47-^48 Principal component analysis,155
Probabilisticanddeterministic models,
26
Probability, 7-9, 32, 56
L and error estimation,63
and weight, 32
Law of error, 10, 12,89 beginningsof the concept,56-60
Law of large numbers,59-60 distributions,68-76
Least squares,89, 91-93,148,182 fiducial, 81-83,99,202,231
Likelihood criterion, 82,218-220 inverse,65, 77-80, 119,221
Logical positivism,45-46 mathematicaltheory,66
meaningof, 60-63
M subjective,61,221
Probableerror, 127,139,200
Mean, seearithmetic mean
Measurement,2,36-38,40 R
error, 38, 44
definition, 40 Randomness,85-88,172
interval and ratio scales,42 randomnumbers,85-88
nominal scale,41, 148-150 random sampling,87
ordinal scale,41 randomization,94, 101-104, 177-179,
Multiple comparisons,195-199 188
260 SUBJECT INDEX