You are on page 1of 168



b1-this series:
B. COI\-lRIE!lspect
Semantic Them)'
!listmica/ Lbtgu#tics
D, B. FRY ThePhysicsofSpeech
R, A. HUDSON Socio/h1g1tfstt'cs
J. K, CHAMBERS and p, TRUDGILL Diafecto/ogy
A. J, ELLIOT Chi/dLmtguage
P. J.~. MATTHEWS SyntaX ''''""""'"''"'-...
,?"' "''\,
,t,f \
A. ~AD FORD Trans[onnatioual Sy11tax
L. BAUER E',pfish R"ord~Jonnatim1 " ~-~ .~
~f 1.1':'~':''"''~''
V'&fot :: '
S. C, LEVINSON Pragmatics ~.
G. li;ROWN and G, YULE Disroursolnal.wis
R. -'1ASS Phonology ~BD-FFLCH-USP
R. HUDDLESTON Immductinn to the Grammaroft.'~tglish
e. qoMRIE Tense
w, KI, IN Secmtd Language ilcquisitiou
A. c'auTTENDEN Intonation
A, WOODS, P. FLETCHER, A. II UGIIES Statf,,tics ill La11guage Studies

Tht rlglrl ~1 the

Un(Yfr~il)' of
toprl~land 1fll
11/{ manner of boolu
JYQISfllllltd by
HMr)' V/11/n Ill~.
Tlw Un/vm/1)' ~ prlnltd
ond pub//Jhfd ttlnllni/OIIJI)'
lirlCl /$~4,


Genera/Editors: B. COMRIE, C. J. FILLMORE> R. LASS, R. B. LEPAGE,


The first expression oo page 123 ~hould read
H0 :p.=8o t.ersus H 1 :~..t<8o


Published by the Press Syndicate of the University of Cambridge Preface X!
The Pitt Building, Trumpington Street, Cambridge cnz..tRP
32 East 57th Street, NcwYofk, NY 10022, USA:1 ; -, :-.~:-, ';~
10 Stamford Road, Oakleigh, Melbourne Jf66~~A.uStrali8'

. '"I"
' ..
I Why d61i"guists ~eed statistics? .. I

Cambridge University Press 1986 .,. d : I, 2 Tables and graphs 8

2.1 Categorical data 8
First published 1986 -- 2.2 Numerical data 13
Printed in G~eat Britain at The Bath Press, Avon 2.3 Multi-way tables '9
2.4 Special cases 20
B1itish Lib1:ary cata/Oguiflg in publication data Summary 22
Woods, Anthony Exercises 23
Statistics in language studies. -
(Cambridge textbooks-in linguistics) 25
I, Linguistics- Research- Statistical methods
3 Summary measures
J, Title n. Fletcher, Paul 111. Hughes, Arthur, 3' The median 27
4 to .72 PIJ8S 32 The arithmetic mean 29
33 The mea~ .a[}d . the median comp~red 30
Libm1)' of Congress cataloguing t'n publkat(oii;dat_a-~
,r,.,,:,- ',t!; H Means of proportions and percentages 34
Woods, Anthony.
Statistics in language studies. 35 Variability or dispersion 37
(Cambridge textbooks in linguistics) 36 Central intervals 37
Bibliography: p.
3-7 The variance and the standard deviation 40
. Includes index.
38 Standardising test scores 43
t. Linguistics- Statistical methods. I. Fletcher,
Paul. II, Hughes, Arthur. m. Title. iV. Series. Summary 45
PrJB.s.w66rgBs 5'95 Bs-47o Exercises 46
ISBN o 5~1 25326 8 hard covers
4 Statistical inference 48
ISBN o s:u 2731.2 9 paperback
4' The problem
42 Populations 49
43 The theoreti'cal solution 52
+4 ,- The, pr:~gmatic_ solution 54
'I - ;_ :l __ ,- -.'' ''
'{i :;- SunrtJ~rY'' 57
1111111 111111111111111111111111111111 Exercises ss
21300004814 v

: -!~:~ 8.4.2 The value of the test statistic is not significant IJO
5 59
5' Probability 59 8~
Summary IJO
61 , t Exercises IJI
52 Statistical independence and conditional probability.
53 Probability and discrete numerical random variables 66

68 Testing the fit of models to data IJZ
54 Probability and continuous random variables 9
Testing how well a complete model fits the data IJ2
55 Random sampling and random number tables 72
. 9'
Summary 75 .
92 Testing how well a type of model fits the data '37
Exercises 75 . 93 Testing the model of independence '39
94 Problems and pitfalls of the chi-squared test '44
6 Modelling statistical populations 77 ~ 9+' Small expected frequencies '44
~~. The 2 X z contingency table
6.1 A simple statistical model 77 942 '46
6.2 The sample mean and the importance of sample size So $ 9+3 Independence of the observations '47
6.J A model of random variation: the normal distribution 86 9+4 Testing several tables from the same study 149
6.4 Using tables of the normal distribution Sg 9+5 The use of percentages !50
Summary 93 Summary 15'
Exercises 93 Exercises !52

7 Estimating from samples 95 IO Measuring the degree of interdepend<lnce between

7' Point estimators for population parameters 95 two variables ~54
72 Confidence intervals 96 IO.I The concept of covariance !54
73 Estimating a proportion 99 10.2 The correlation coefficient x6o
74 Confidence intervals based on small samples IOI IO.J Testing hypotheses about the correlation coefficient 162
75 Sample size IOJ !0.4 A confidence interval for a correlation coefficient t63
75 1 Central Limit Theorem !OJ !0.5 Comparing correlations x6s
752 When the data are not independent !04 Io.6 Interpreting the sample correlation coefficient !67
753 Confidence intervals 105 10.7 Rank correlations I6g
754 More than one level of sampling xo6 Summary 174
7-55 Sample size to obtain a required precision 107 Exercises 174
76 Different confidence levels IIO
Il2 'li
: ~~!:.~:
Testing for differences between two populations
Independent samples: testing for differences between

means 176
8 Testing hypothe$es about population values IIJ 11.2 Independent samples: comparing two variances x8z
8.1 Using the confidence interval to test a hypothesis IIJ II.J Independent samples: comparing two proportions x8z
8.2 The concept of a test statistic "7 "4 Paired samples: comparing two means !84
8.3 The classical hypothesis test and an example !20 u.s Relaxing the assumptions of normality and equal var
8.4 How to use statistical tests of hypotheses: is significance iance: non parametric tests t88
significant? !27 u.6 The power of differenttests 19'
8+1 The value of the test statistic is significant at the t% Summary 192
level 129 Exercises 193
vi vii
12 Analysisofvariance-ANOVA" ,,,. "''"''' ,, .. ,... ,,,. '48 -Linear discriminant analysis -' z~Ys
I2.I Comparing-:-- several -meanst 'simu'ltaneausly::",,one;.~Y-~'"" ,. .. ,\,'.. ., ,- :t4.9 The lineardiscriminantfunction for two-groups 268
14..10 Probabilities-uf.misclassification 269
12.2 Two-way ANOVA: randomised blocks 200; Summary 271
12..3 Two-way AN OVA: factorial experirill!nts \:' "'~;262' Ex-ercises 271
12.4 ANOV A: main effects only 206
12.5 AN OVA: factorial experiments 2H IS Principal components analysis and factor analysis 273
2.6 Fixed and random effects 212 '5' Reducing the dimensionality of multivariate data 273
12.7 Test score reliability and AN OVA 215 '52 Principal components analysis 275
12.8 Further comments on AN OVA 219 153 A principal components analysis of language test scores 278
12.8. x Transforming the data 220 154 Deciding on the dimensionality of the data 282
12.8.2 'Within-subject' ANOVAs 221 155 Interpreting the principal components 284
Summary n~ 1 <.1
222 '56 Principal components of the~correlatiOn matrix : 287
Exercises 222 157 Covariance matrix or correlation rriatrix? 287
'58 Factor analysis 290
13 Linear regression 224 Summary 2 95

13.1 The simple linear regression model' 226

13.2 Estimating the parameters in a linear regression ' 229 Appendix A Statistical tables 296
r 3 3 The benefits from fitting a linear regression -230 Appendix B Statistical computation. 307
13 4 Testing the significance of a linear regression Appe11dix C Answers to some of the exercises 3'4
13.5 Confidence intervals for predicted values 234
References 316
13.6 Assumptions made when fitting a linear regression 235
Index 3'9
13;7 Extrapolating from linear models 237
13.8 Using more than one independent variable: multiple
regression 2 37

'39 Deciding on the number of independentivatiable~ n "' i ' '24i ,,: " ; ~ <_-

13.10 The correlation matrix and< partial 26'rre'latiotP' ' ' 1 ' 244:
IJ.II Linearising relationships by transfdrtning'the data 245
13.12 Generalised linear models 247
Summary 247
Exercises 248

14 Searching for groups and clusters 249

'4' Multivariate analysis 249
'42 The dissimilarity matrix 252
'43 Hierarchical cluster analysis 254
144 General remarks about hierarchical clustering 259
'45 Non-hierarchical clustering 26r
14.6 Multidimensional scaling -..; .. -.;I. ,;I tL:l ''''262
'47 Further comments on multidimenSional -s(!aling 265'

This book began with initial contacts between linguists (Hughes apd
Fletcher) and a statistician (Woods) over specific research problems in
first and second language learning, and testing. These contacts led to ah
increasing awareness of the relevance of statistics for other areas in \iliguis
tics and applied linguistics, and of the responsibility of those wor.king
in such areas to subject their quantitative data to the same kind of statistical
scrutiny as other researchers in the social sciences. In time, stu~en(s in
linguistics made increasing use of the Advisory Service provided by the
Department of Applied Statistics at Readipg. It soon became clear that
the dialogue between statistipian and student linguist, if it was tope ~axi
mally ureful, required an awareness of basic statistical concepts on the
student's part. The next st~p, then, "' ;he settirlg JlP pf a cOllrse in
statistic.s for lingtjistics stu9ents (taug~t by Woofls). Tllis is esserltfally
the bonk of the course, anq reflects our jqint view$ on what lipguistics
students who want to statistics with t~eir data need to kqow.
There are two main diff~rences between this and other introductory
textbooks in statistics fpr linguists. Firs\, the portion of th~ bo0k devoted
to prob~bility and statistical inference is considerable. In order to eluciq~te
the san)Ple-populatiol] relationspip, we consider in SOFIJe detail basic
notions pf probabUity, of statistical mode\lin~, and (using th~ noriJlal distri-
bution as an example of a statistical model) of the problem ol estimating
pqpulat)on values. from sample estimates. While these chapters (4-8) 'may
on initi~l reading ~eem difficult, \fe stropgly advise readers who wish fully
to understand wh~t they are doing, when they use the techniques elabonated
Iuter in the book, to persevere wit\1 them. '
The second differenfe concern~ the tan!lt of statistical methods we deal
with. From the second half of chapter IJon, a number of multiv~riate
techniques arc examined in relation to linguistic data, Multiple re11ression,
cluster analysis, <\iscriminant function analysis, and principal COf!lPOI)ent
und factor analysis have bt,cn applied in recent years to a range o.f linguistic
. - -(~_,.;;_ .:,

Why do linguists neeci
.. statistics?

Linguists may wonder why they need statis(ics. The domi11ant theoretical
f~amework in the, field, that of ~enerative .grammar, has as its rrimaty
ilafa-source j~dge!llents abqut tlie well-for'piedness of se11tences, These
'judgem.e'nts usuajly come frony linguist~ th~mselyes 1 are eitherc.or
decisio11s, and relate to the language abilitY of an k\eal rative speaker
in a 'homoge!leou~ spe~ch comm11nity. The data simply do not call for,
or lend themselves to, the assignme11t of numerical values wqich need
to be st~mmarised or from which inferences may be drawn. There appears
to be no place here for statistics.
Generative grammar, however, despite its great contribution to linguistic
July 1985 PA,UL F:LETCHER knorvledge oyer t~e past 25 ~ears, is not the sole topic of linguistic study.
Unii.Je~sity of Re(lding AR'fHU~ HUGHE~ There are otjler areas of the subject where the. observed <lata positivly
denpnd statistical treatment. In !his qook we will scr11tinise st11dies from
a number of thes~ areas anli show, we hope, the nefessity foo statistics
in each, In this qrief intro<lucti~n we will use~ few ofthese studies to
illustrat~ the major issues with which we shall beJaced.' .
As w~ will demqnstrate th;oughout the book, s(atistics allows us to s11m
marise complex qume"ical data and then, if desired, to draw inferences
fro~ th~m. Indeed, a ~istinction is sometiJ;nes made between descriptive
statistics on \he one ha!'d an.d inf~rential statistics on the other. The 11eed
to summarise ancl infer comes from the fact that there is variatio'l in
the numerical values associated with the data (i.e. the values over a se(
of measuremfnts are nqt identical). If (here were no variation, there would
be no need fqr statistics.
Let ljS imagine that a phonetjcian, interested in the way that voiced-
voiceless disfinctions are f!1aintained by speakers of English, begins by
taking measurements of voice OI]Set times (VOT), i.e. the time between
the rele.ase of the stop and the onset qf voicing, in il]itial stops. The first
set of data consists of ten repetition~ from ea~h of 20 speakers of. ten
/p/initial words. No\)' if q1er<> were no difference in VOT time, cithct

vvny uu ullgUI~HJ nc::eu "" ..,;
Why do linguists need statistics?
between words or between speakers, there would be no need here for IPI-initial words, not just those in the sample. Here again statistics can
statistics; the single VOT value would simply be recorded. In fact, of help. There are techniques which allow investigators to assess how closely
course, very few, if any, of the values will be identicaL The group of the 'typical' scores offered in a sample of a particular size are likely to
speakers,rl)ay produce VOT values that are all distinct on, for example, approximate to those of the group of whom they wish to make the generalis'
their' first 'pr~nunciation of a particular word. Alternatively, the VOT ation (chapter 7), provided the sample meets certain conditions (see 44
values of an individual speaker may be different from word to word or, and 55) . '
indeed, between repetitions of the same word. Thus the phonetician could Let us take the example of the phonetician further. At the same time
have as many as z,ooo different values, The first contribution of statistics as the IPI -initial data were being collected, the ten subjects were also
will be to provide the means of summarising the results in a meaningful asked to pronounce zo lb/-initial worc;ls. Again, the phonetician could
and readily understandable way. One common approach is to provide a reduce these data to two summary measures: the average VOT time for
single 'typical' value to represent all of the VOT times, together with the group and the standard deviation. Let us consider for the moment
a measure of the way in which the VOT times vary around this value only one of these - the average, or typical value. The phoneticia11 would
(the mean and the standard deviation - see chapter 3). ln this way, a then observe that there is a difference between the typical vorr .vajue
large number of values is reduced to just two. forIPI lbl
-initial words and that for -initial words. The typic~! VqT value
We shall return to the phonetician's data, but let us now look at another'. for I PI -initial words would be larger. The question then arises as to whether
example. In this case a psycholinguist is interested in the nature of aptitude. the difference Lo, ween the two values is likely to be one which represents
for learning foreign languages. As part of this study 100 subjects are given a real difference in VOT times in the larger group for which a generalisation
a language aptitude test and later, after a period of language instruction, will be made, or whether it is one that has come about by chance. (If
an achievement test in that language. One of the things the psycholinguist you think about it, if the measurement is precise there is almosf certain
will wish to know is the form of the relationship between scores on the to be some difference in the sample values.) There are statistical te91lniq11es
aptitude test and scores on the achievement test, Looking at the two sets which allow the phonetician to give the probability that the sample differ-
of scores may give some clues: someone who scored exceptionally high ence is indeed the manifestation of a 'real' difference in the latger grOI!P
on the aptitude test, for instance, may also have done extremely well on In the example we have given (VOT times), the difference is in fact quite
the achievement test. Btit the psycholinguist is not going to be able to well established in the phonetic literature (see e.g. Fry 1979: 135-7), put
assimilate all the information conveyed by the 200 scores simply by looking it should be clear that there are potentially many claims of a similar nature
at each pair of scores separately. Although the kind of summary measures which would be open to- and would demand- similar treatment.
used by the phonetician will be useful, they will not tell the psycholinguist These two examples hint at the kind of contribution statistics can and
directly about the relationship between the two sets of scores. However, should make to linguistic studies, in summarising data, and il) making
there is a straightforward statistical technique available which will allow inferences from them. The range of studies for which statistics is applicable
the strength of the relationship to be represented in a single val.ue (the is vast - in applied linguistics, language acquisition, language 'variation
correlation - see chapter 10). Once again, statistics serves the purpose and linguistics proper. Rather than survey each of these fields briefly at
of reducing complex data to manageable proportions. this point, let us look at one acquisition study in detajl, to try to achieve
The two examples given have concerned data summary, the reduction a better understanding of the problems faced by the i!lvestigator and the
of complex data, In both cases they have concerned the performance of statistiCal issues involved. What are the problems which investiga~ors want
a sample of subjects. Of course, the ultimate interest of linguistic investiga- to solve, what measures of linguistic behaviour do tpey adopt, what is
tors is not in the performance only of samples. They usually wish to the appropriate statistical treatment for these measures, and how reliable
generalise to the performance of larger groups. The phonetician with the >ll'C their results?
sample of 20 sp~akers may wish to be able to say something about all We address these issue$ by returning to voice onset time, now with
speakers of a particular accentual variety of English, or indeed about reference to language acquisition. What precisely are the stages children
speakers of English in generaL Similarly, he or she is interested in all go through in acquiring, in produetilm, the distinction between voiced
~ 3
Why do lmgwsts need statzstlcsr Why do lmgUlsts need statzstzcsr

and voiceless initial stops in English? (fFhi!niisoussion draws heavilyi<)n!T: ;.. " ~tutly Was'~! lot'igh~'cHnal cine, Using.Jour.: children who 'were monolingUal
'speakers of- English withrio siblings of schodLage . .' .. were producing
.J. - :.'- .- - ... _-- , .
Macken & Barton 198oa; for simi-lal' studies' boncerningthe acquisition
of voicing contrasts in other languages the: reader is referred :Macken ai 'least solrie'i'ilitial stop words , .. showedevidellceof normal language .
& Barton 198ob for Spanish, Allen 1()8sforFrench: and"Viaha t&85 far'' ' ,,. '' deveTopm~nt '.':". ahdappeared to be co,operativ'e' (198oa; 42~3). Irl ad" . :
Portuguese.) The inquiry begins from the obserVation that t'ranscriptions' dltioll, 'both j:iarents of each child were naiive speakers of English, and
of children's early pronunciations of stops often show no/ p/-/b/disfinc all'the children had n()rmallearning: The reasonsfor aspects of this subject
tions; generally both /p/-targets (adult words beginning with /p/) and description are transparent; generat:issus relating to sample size and struc-
/b/targets (adult words beginning with /b/) are pronounced with initial ture are discussed below ( 4-4 and 7.5).
(b], or at least this is how auditory impressionistic transcriptions represen.t (b) A second issue which is corn111on in linguistic stu.dies is the size
them. Is it possible, though, that young children are making distinctions of the data sample from each individ\lal. The number of subjects in the
which adult transcribers are unable to hear? VOT is established as a crucial study we are considering is four, but the number of tokens of / pt k/
perceptual cue to the voiced--voiceless distinction-for initial stops; for Eng- . and /bdg/-initial adult targets is.potentia)ly !lery.large. (A~ immediate .
lish there is a 'short-lag' VOT radgMtir voiced stbps: (from o 'to ~3<Nns ' qui!sii6n 'tlilit' iriigl\t be 'as~ed 'is whether if'is 'better io have' relati!!ely
for labials and apicals, o to +4oths forvelars}.'and'a''longlag' rangti'for ''"fev/'subjects, With h!!ativdy many .iijstarices'Mthe behaviour- in which
voiceless stops (~6o to +rooms):Englishsp.eak~rs perceive stoph>ilth we are inthestect' from each subject, or tMnY subjects and fewer tokens
a VOT of less than ~3oms (for labials and apicals,+soms for velars) ..Osee 7.'5 foi-smne discussion.) In the VOT acquisition study the investiga-
as voiced; any value above these figures tends to lead to the perception torsalso had to decide on the related. issue of fr~quency of sampling and
of the item in question as voiceless. Children's productions will tend to the number of tokens within each of t~e six categories of initial stop target.
be assigned by adult transcribers>to thephohemic-categodes definedby As it happeris, they chose a fortnightly sampling interval a,nd the number
short- and long-lag VOT. So if at some stage of development children of tokens in a session ranged from a low of 25 to a high of 4'4 (The
are making a consistent contrast using VOT, but within an adult phonemic goal was to obtain at least rs tokeqs for each stop coris0 nant, but this
category, it is quite possible that adult transcribers, because of their percep- was not always achieved in the early sessions.)
tual habits, will miss it. How is this possibility investigated? It should (c) Once the data are collected ~!Jd the measurements made oh each
be apparent that such a study involves a number of issues for those carrying token from each individual for each' session, the information prbvicled
it out. needs to be presented in an acceptable and comprehensible forf11. Macken
(a) We require a group of children; oNtn appropriateagec,togenernt~ '''" & Batta'n .'Sfdcf themselves, for- the 'ln~trurhyntal tneasuteirients. they
the data. For a developmental study like this we !lave iwdecide whether ' niake; t6 ~ ~ ' tokens 6 elich stop type per sessj0n, It may w~ll be that
the data will be collected longitudidally (from the sari:te children at succes- each of the t5 tokens within 'a catekory has a different VpT value, and
sive times separated by a suitable interval) or cross-sectionally (from differ- for evaluation we therefore need summary values and/ or graphic displays
ent groups of children, where each 'group' is of a particular age, and the of the data. Macken & Barton use both tabular, numerical summaries
different groups span the age-range that is of-interest for us). Longitudinal and graphic representations (see chapters ,-and 3 for a general discussion
data have the disadvantage that they take as long to collect as the child of methods for data summaries).
takes to develop, whereas cross-sectional data can be gathered within a (d) The descriptive summaries of the child VOT data suggest some
brief time span. With longitudinal data, however, we can be sure that interesting conclusions concerning one stage of the development ol initial
we are charting the course of growth within individuals and make reliable stop contrasts in sotne children. Recall that itis generally held that the
comparison between time A and time B. With cro,ss-sectional comparisons perceptual boundary between voiced and voiceless labial or alveplar stops
this is not so clear. Once we have decided on the kind of data we want, is +jo ms. At an early point in the development of alveolars py one subject,
decisions as to the size of the sample and the selection of its elements Tessa, thcaverage value for/ d/ -initial targets is ~2.4 f11S while tl)e oiverage
have to be addressed. It is on the 'decisions made here that our ability of
for'/t/ inithil 'tilrgcts is + 20.5o. Both thesd'"alues are withitz the adult
to generalise the results of a study will depend. The Macken & Bartc)n '' vc\iced categ<ity, ai\d sn the adult is likdy to pdceive thetn as voiced;
4 5
nuy uu uugut::JL::J neeu ~tuu::Jttt-::J:
vvny au angmscs neea ~cansczcsf

But the values are rather different. Is this observed difference between field of the techniques it explains. The chapter is then followed by extensive
the two averages a significant difference, statistically speaking? Or, restat- exer~ises which must be worked through, to accustom the reader to the
ing the question in the terms of the investigation, is the child making applications of the techniques, and their empirical implications. While
a consistent distinction in VOT for I dl -initial and It/ -initial targets, but the book is obviously not intended to be read through from cover to cover,
one which; because it is,inside an adult phonemic category, is unlikely since different readers will be interested in different techniques, we recom-
to be perceived? The particular statistical test that is relevant to this issue mend that all users of the book read chapters z-8 inclusive, since these
is dealt with in chapter 10, but chapters 3-8 provide a crucial preparation are central to understanding.!! is here that summary measures, probability
for understanding it. and inference from samples to populations are dealt with. Many readers
(e) We have referred to one potentially significant difference for one will find chapters 4-8 difficult. This is not because they require special
child. As investigators we are usually interested in how far we are justified, knowledge or skills for their understanding. They do not, for example,
on the basis of the sample data we have analysed, in extending our findings contain any mathematics beyond simple algebra and the use of a common
to a larger group of subjects than actually took part in our study. The notation which is explained in earlier chapters. However, they do contain
answer to this question depends in large measure on how we handled arguments which introduce and explain the logic and philosophy of statisti-
the issues raised in (b) and (d) above, and is discussed again in chapter 4 cal inference. It is possible to use in a superficial, 'cookbook' fashion
Much of the discussion so far has centred on phonetics - not because the techniques described in later chapters without understanding the
we believe that is the only linguistic area in which these issues arise, but material in chapters 4-8, but a true grasp of the meaning and limitations
because VOT is a readily comprehensible measure and studies employing of those techniques will not then be possible.
it lend themselves to a straightforward illustration of concerns that are The second part of the book - from chapter 9 onwards - deals with
common to many areas of language study. We return to them continually a variety of techniques, details of which can be found in the contents
in the pages that follow with reference to a wide variety of studies. list at the beginning of the book.

While the use made of the information in the rest of the book will reflect
the reader's own purposes and requirements, we envisage that there will
be two major reasons for using the book.
First, readers will want to evaluate literature which employs statistical
techniques. The conclusions papers reach are of dubious worth if the mea-
surements are suspect, if the statistical technique is inappropriate, or if
theassumptions of the technique employed are not met. By discussing
a number of techniques and the assumptions they make, the book will
assist critical evaluation of the literature.
Second, many readers will be interested in planning their own research.
The range of techniques introduced by the book will assist this aim, partly
by way of examples from other people's work in similar areas. We should
emphasise that for research planning the book will not solve all problems.
In particular, it does not address in detail measurement (in the sense
of what and how to measure ip a particular field), nor, directly, experimen-
tal design; but it should _go,,\'9,ffie of the way to assisting readers to select
an appropriate statistical ff~IJiework, and will certainly enable problems
to be discussed in an informed way with a statistician.
Each chapter in the book contains some exemplification in a relevant
Categorical data
2 (a) Frequencies ofdisorders-in a sample o/364 -liinguage-impairedm'a/ednUSA

Tables and graphs' ,,

: ;; __ .
- Phonological
- disability
Specific language ! Impaired
disOrder h(!afirig ' Total J.

57 20<) 47 51 364

(b) Relative frequencies. ofdi$0rders in a sample of;64 language-impait-ed males in US;l

Phonological Specific language Impaired
Stuttering disability disorder hearing Total
0.157 0574 0,129 0.140 1.000
When a linguistic study is carried out the investigator will be faced with
the prospect of understanding, and then explaining to others, the meaning (c) Frequencies of disorders in a sample of 364 Janguage-impalred males itl USA (figures
of the data which have been collected. An essential first step in thisprocess {'!, ,~r(],C,~f!$/'I,J,~_tJIIP#'lJ.,e jreque.w:ies (lf p_ert{!tUpge~)_
is to look for ways of summarisingthe resulkwhich bring out their most 1 , -1 ; :, :,-Phonological Specific languagt; :Impaircd
, ... S.t~~tering dis,abi.!~ty , disp~4cr : , hearing Total
obvious features. Indeed if this is''don<Hihaginatively'imd the trends1in' ' '''
57 (16) . 209. (57) . 47 (13) 51 : ( 1 4) 364 (106)
the data are clear enough, there may be no need for sophisticatedanalysis.
In this chapter we describe the types of table' and diagram most commonly'
employed for data summary.
Let .us begin by looking at typical examples of the kind of data which Table 2.1(a) itself already comprises a neat and intelligible summary
might be collected in language studies. We will consider how, by means of the data, displaying the number of times that each category was observed
of tables, diagrams and a few simple calculations, the data may be summar- out of 364 instances. This number is usually called the frequency or
ised so that their important features can be displayed concisely. The proce- observed frequency of the category. However, it may be more revealing
dure is analogous to writing a precis of an article or essay and has similar to display the proportions of subjects falling into the different classes,
attractions and drawbacks. The aim is to reduce detail to a minimum and these can be calculated simply by dividing each frequency hy the
while retaining sufficient information to communicate the essential characc total frequency, 364. The proportions or relative frequencies obtained
teristics of the original. Remember always thatthe use of idata ought to ., < <i'n this way' are displayed in table z.r(b):Note that :no more than three
enrich and elucidate the linguistic argument, -and:this t~n often be-done ''figur-es l!re gillen, though most pocket calculators Will give eight ortbil.
quite well by means of a simple table or diagram.' " : 1 : This is deliberate. Very high accuracy is rarely-required iri such results,
and the ease of assimilation of a table decreases rapidly with the number
2. ' Categorical data of figures used for each value. Do remember, however, that when you
It quite commonly arises that we wish'toclassify a group of. truncate a number you mafhave to alter the last figure which' you wish
people or responses or linguistic elements, putting each unit into orie of to include. For example, written to three decimal places, o.6437' becomes
a set of mutually exclusive classes. The data can then be summarised o.644, while o.JI7>6 would be o.J'7 Tbe rule should be obvious.
by giving the frequency with which each class was observed. Such data A table of relative frequencies is not really informative (and can be
are often called categorical since each element or individual of the group downright misleading) unless we are given the total number of observations
being stuoied can be classified as belonging to one or a (usually small) on which it is based. It should be obvious that the claim that so% of
number of different categories. For example, in table 2.r(a) we have pre- nntivc English speakers display a certain linguistic behaviour is better sup
sented the kind of data one might expect on taking a random sample of ported by the-behaviour in question being observed in soo of r ,ooo subjects
364 males with diagnosed speech and language. difficulties in the USA than in just 't\vo of a tt>tal of four; (This poirit< is discussed. in detail In
(see e.g. Healey eta/, r98r ). We have put these subjects into four different I' cHap!er\ii) 'lf1iR best to givebothfrcquenciebtnd relative' trequendes,
categories of impairment. ns in table 2.1 (c). Note here that the relative frequencies have been rounded
8 9
1ames ana graphs c:azegoncat aara

Table 2.2 TabJe 2.3

(a) Frequencies of disorders in a sample of s6o languagezinpaired individuals in the USA, (a) Frequencies of disorders in a sample of s6o language-impaired individuals t'n the USA
cross-classified by sex (frequencies relative to row totals are given in brackets as (figures in brackets are percentages)
percentages) Phonological Specific langUage Impaired
Phonological Specific language Impaired Stuttering disability disorder hearing Total
Stuttering disability disorder hearing Total s6o (roo)
8+ (rs) 327 (s8) 78 (r4) 7' (r3)
Male 57 (r6) 29 (57) f7 (13) sr (If) 36f (roo)
Female 27 (t4) u8 (6o) 3' (r6) 2.0 (1o) 196 (roo) (b) Frequencies of disorders in a sample of s6o, language-impaired individuals in the USI!,
cross-classified by sex (percentages 1'n brackets give relative frequencies of sexes within
Total 84 (rs) 327 (s8) 78 (14) 7I (I3) s6o. (too) disorders)
(b) Freque~lcies of disorders br a sample of s6o language-impaired individuals in the USJl, Phonological Specific language Impaired
cmss-classijied by sex (frequencies relative to column totals are given in brackets as Stuttering disability disorder hearing Total
percentages) 364 (6sl
Male 57 (68) 209 (64) 47 (6o) 5' (72)
Phonological Specific language Impaired Female 27 (32) u8 (36) 31 (fO) 20 (28) I96 (35)
Stuttering disability disorder hearing Total
(c) Frequendes of disorders;, a sample of s6o language-impaired indt'viduals in the USA,
Male 57 (68) 209 (6f) f7 (6o) sr (72) 364 (6s) cross~classt'jied by sex (percentages itt brackets give relative freqUencies of disorders
Female 27 (32) u8 (36) 31 (fO) 20 (38) rg6 (35) within sex)
Total 84 (roo) 327 (roo) 78 (roo) 7r (roo) s6o (roo) Phonological Specific language Impaired
Stuttering disability disorder hearing
(c) Frequencies of disorders in a sample of s6o latzguage-impaired individuals in the USil,
cross-classified by sex (frequencies relative to the total sample si::e are given in brackets Male 57 (ro) 209(37) 47 (8) 5' (g)
as percentages) Female 27 (s) u8 (21) 31 (6) 20 (4)
Phonological Spedfic language Impaired Total 8f (rs) 327 (s8) 78 (r+) 7r (r3)
Stuttering disability disorder hearing Total
Male 57 (ro) 209 (37) 47 (8) 5' (9) 364 (6s)
Female 27 (S) II8 (2.1) 31 (6) 20 (4) 196 (35) that the proportion of the total who are male and hearing-impaired is
Total 84 (IS) 327 (s8) 78 (If) 7I (13) s6o (roo)
approximately 9% (sx/ s6o).
These tables have been constructed in a form that would be suitable
if only one of them were to be presented. The choice would of course
further to only two figures and quoted as percentages (to change any deci- depend on the features of the data which we wanted to discuss. If, on
mal fraction to a percentage it is necessary only to move the decimal point the other hand, more than one of the tables were required it would be
, . places to the right). neither necessary nor desirable to repeat all the total frequencies in each
It often happens that we wish to compare the way in which the frequen table. It would be preferable to present a sequence of simpler, less cluttered
1:ies of the categories are distributed over two groups. We can present tables as in table 2.3.
this by means of two-way tables as in table 2.2 where the sample has The tables we have introduced so far can be used as a basis for construct-
been extended to include 196 language-impaired females from the same ing graphs or diagrams to represent the data. Such diagrams will frequently
background as the males in table 2. r. In table 2.2(a), the first row is bring out in a striking way the main features of the data. Consider figure
exactly as table 2.r(c); the second row displays the relative frequencies 2.r(a) which is based on table 2.1. This type of graph is often called
of females across disorder categories, and the third row displays relative a bar chart and allows an 'at a glance' comparison of the frequencies
frequencies for the two groups combined. Table. 2.2(b), however, displays of the classes. Figure 2. r (b) is the same chart constructed from the relative
the relative frequency of males to females within categories. So, for exam frequencies. Note that its appearance is identical to tigure 2.r(a); the only
pie, of the total number of stutterers (84), 68% are male while 32% are alteration required is a change of the scale of the vertical axis. Since the
female. In table 2.2(c) the frequencies in parenthesis in each cell are relative categories have no inherent ordering, we have chosen to present them
to the total number of individuuls (s6o), and we can sec, for instance, in the chart in decreasing order of frequency, but this is a matter of taste.
!0 II
1 ames anagrapns Numencal data
Figure 2.2 shows how similar.diagrams used.'to display the data
of table 2.2. Note that in constructing:figure 2.2 -we have used the propor
tions relative to the total frequency: that is, we have divided the original
200 "l !'-; frequencies by s6o. 'Whether or not-this is the appr<>priate procedure w'i-1-h
depend on the point you wish to make and on how the data were collected.
~ 150
2 100. 0.40

-~ lim~
Male Female
50 .,,-,,, - '

''I 0.20
.e -o.1o
Phonological Stuttering Impaired Specfficn
disability hearing language
Phonological Stuttering Specific Impaired
Figure 2.1 (a). Bar chart of frequencieS of'disordcrs in a Saritple of 364 disability language hearing
language-impaired males in the USA (based on table 2.1). disorder
Figure 2.2. Relative frequencies of disorders in a sample of s6o language-
impaired individuals, further classified by sex (based on table 2.zc).

If the whole sample were collected without previously dividing the subjects
into male and female, figure 2.2 based on table 2.2(c) would correctly
0.5. (1. :J ~: give the proportionsof a sample of individualpatieilts who fallintodifferent ' ' :
categories, determined both by their sexi and-'by. the 'type of defect-they
ID E suffer. This would not be true if, say; males and females-were recorded
-;;; 0.4.
E on different registers which were sampled separately, since the numbers
of each sex in the sample would not then necessarily bear any relation
f 0.3 to their relative numbers overall. Figure'2.3 based on table z.z(a), showing
the proportion of males who suffer a particular defect and, separately,
a. 0.2
the proportion of females suffering the same defect, would be correct.
This would always be the more appropriate diagram for comparing the
0.1 distribution of various defects within the different sexes.

2.2 Numerical data

Phonological Stuttering Impaired SpecifiC
disability hearing language The variables considered in the previous section were all-classes
disorder m categories, and the numbers we depicted arose by counting how often
Figure 2.1 (b), Bar ~:hart of rclntivr. frequen 1dM'Of dif'jorticr!i i'n.!!arilfllcdf J64 .u pac(icular category occurred. It- often' happens thatthe variable we are
!rmgungcimpnircd mnlctl (bnucd on table J., 1 ). observing takes numctical values, lot example, the number of letters in
13 '3
!Vumencat aata
'Fables and graphs


~ 0.30

Phonological Stuttering Specific

disability language
Figure 2.3. Relative frequencies ~f language disorders in samples of language
impaired individuals, within sexes (based on table 2.Ja).

a word or morphemes in an utterance, a student's score in a vocabulary of them on the bar chart, including any which do not actually appear
test, or the length of time between the release of a stop and the onset in the data set.
of voicing (VOT), and so on. If the number of different values appearing in a data set is large, there
If the number of different observed values of the variable is small, will be some difficulty in fitting them all onto a bar chart without crushing
then we can present it using the display methods of the previous section. them up and reducing the clarity of the diagram. Besides, unless the
In table 2.4(a) are given the lengths, in morphemes, of 100 utterances number of observations is very large, many possible values will not appear
observed when an adult was speaking to a child aged 3 years. These have
been converted into afrequeneytable in table 2.4(b) and the corresponding 24
bar chart can be seen in figure 2.4. A major difference between this data 22
and the categorical data of the previous section is that. the data have a 20
natural order and spacing. All the values between the minimum and maxi- 18
mum actually observed are possible. and provision must be made for all !ic 16
m 14
Table 2.4(a). Lengths of roo utterances (in morphemes) of an ~ 12

adult addressing a child aged J years ~ 10

7 IO 5 6 5 7 9 7 7
.J~ 1~i::~:
II 3 3 7 10 4 3 3 9 6
5 ~' 3~!,
6 8
5 8
4 ,.
7 4
8 4 8 3
7 IO 3 9 7 4 IO 4
6 6
5 '4

. I2
6 ,.3 8
9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
IO 8 7 4 7 7 IO 5 3 7 Length of utterance
3 2
5 10
14 9
5 3 8
IO 7
9 Figure Z+ liar chnrt for dntn in table 2.4-(b). Lengths of Jooutterances (in
5 9 4 morplu:m<m) of un udult uddrctlsing a child aged 3 ycnrs,

14 15
1'abies ana graphs J.VUfltt:;llf..U' UUH.._

Table 2.5(a). Scores of ro8 stud(?nts in theJllne.i98o Cathbridge. :fr;' .:'J',::, I'
.,N9w we counnhe :number. of scores belongingaoeach :class; 11he most
Proficiency Examination ' . . efficient, and accurate, method- of doing. sois towork,through the list
194 184 '35 r6r r86 198 1:go 240 147 t74 ' 1 197
of scores, cf9~sing ollt 'each one in turn and -iloting it by a tally mark
176 I8J "7 r6r r8s '186 {'208 '200 157 'i1~ ,i:gi /.) ',;opposite the,.corresponding cli\ss intervaL Talliesare :usually counted
tW 192 '45 r62 r86 If8 241 I8f 201 'iOB 'Ii7
135 229 208 209 203 I-45 ~oi 204 192
in blocks offive, the fifth tally stroke l:ieing made .diagonally to complete
'79 '79
ZZ4 209 1.79 023 I92 221 239 238 199 174 '45 a block, as 'in column 2 of the table. The n_umber of tally marks is noted
226- 214 :iu ~15 176 nB t84 :z:n xg8 I96 I84
for each class in column 3. These frequencies are then used to construct
l6f :iog 142 I96 i6o 165 x66 224 z~g !84
I"f? I6J :io7 I79 197 120 255 ISO 23_3 ~88' i75 what is referred to as a histo~am of th~ data (figure 2.5). No gaps are
zzs rs6 2II I90 204 22.2 219 r86 x6o 189
>r8 '49 rtil> r88 224 140 ~20 149 170 197 25

T~ble 2.5(b). Fr~qde(ICY l(lbiJ of thb ~cores of iq8 studellls. in the u)8o r-
C<jmbridgePraficirmcyExa,minatiom ,., .. ,. '' ... ' . "' .J

Cl~ss interVal Tlly :. li:rcqUCficy ,

cumulative c :. ;

E 1!'1

IIQ-ht II 2 0.02 2 0.02 11 ~--.,.
us-l3l 2 0.02, 4 o.o4 ~
Ifo-xs mr lJ.n 1 II IS O.If ~
I55-l6~ m1 j-Hi IL l2 o.h 27 0.25 0
I7c>-t84 o,t8, ~


23 o.:u


5 -
mr 111
20o-214 u-ti J<tl J<tl I7 0,16, 86 o.So

II ,I I h
2IS-22Q J<til<tll-H1 IS 0.4. o 0-94 11U 140 l70 200 ' 230 260
ZJo-244 \HI I 6 o~q6 <07 099 Score
24~-25~ I I q,ot Ip8 1.00
Figu~ 2.5. Histogram Of the frequency data in tablr z.s(b).
left betwe~n the rectangl~s, it being ass.umed by convention that each
Class \yill C\)nt~i\1 all tpo~e ~cores greater t~an, or eqlj~l to, fhe lower bound
at all,. causing frequent gaps lQthe chart, while otl]er v~lues will ~ppear of th~ int~rval but less than the v~lue qf the lower bound pf the ne)\t
ra\her inlrequeQtly ilJ the data. Tab!~ $(a) lists the to\al score of each interv~L S9, for exap1ple a score of ris will go jnto the t~lly for the
of w8 stuqents )aking the Cambridge Proficie~cy Examination (CPE) of cl~ss ps-139, ~s wi\1 a score of '~9 Provi4ed the cla~s intfrvals are
J1.1ne 1980 at a Europe~Q ceiltre. 'I!he ~arks. range fro;, a low qf II7 eqr<al, the height of each rectangle cprre~ponqs to the f~equency of each
to a tnaximum bf 25~ givihg a range 0(255 '- II7 = 1~8. The most frequent cl~ss. 1\s i11 the. case of t~e bar chaft, r~lative frequencies may be used
sc(lre, 184, appears oply five ti!f1es and clearly it i.v9uld be inappropriate to construct the histograf11, this entailing only a change of scale 9n the
to attempt to construct a b~r chart usipg th~ individual scores. vertic~! axis. Great care h~s to be taken not to !'~raw a misleading diagram
The first step in sulllmarising this qata i~ to group the sc6res into ~rounq w\Jeh (he class Intervals are nqt all \Jf the same siae - a good reason for
ten cl~sses (betwe~n eight and IS is usl!ally9onv 0nient and pracfical) 1 c~oosing t~em to be ~qual.. However, in u~usu~l cases it may not be appro-
The first qolump of table 2.5(9) shqws tile classes decided OIJ. The first priate to have equal width intervals, or oqe may wish to *aw a histpgram
will contain all thos~ nlarks in the interyal1I0-1"4 the SCfOOd'thosu: ' 'ba,se<\'m\ d~taalread~ gro\1ped itito classes o'fupeqljaliwi<!th (see exercise
lying in the rang.e 12,5:_139, and so on. 2.4 "'\d figure s.z).
t6 li
1 ames ana graphs Multi-way tables

The fifth column of table 2.5(b) contains the cumulative frequencies up through the scores) while the seth percentile, 'halfway' through the
for the scores of the 108 students. These cumulative frequencies can be data, is called the median score. These values are discussed further in
interpreted as follows: two students scored less than 125 marks, 27 scored the next chapter. In this example, the median is about rgo.
less than 170 marks and so on. The relative cumulative frequencies of
column 6; obtained by dividing each cumulative frequency in column 5 Multi-way tables
by the total freq)lency, 108, are us!Jally more convenient, being easily It is not uncommon for each observation in a linguistic studr
transl~ted into sfate"'e~Jis such as zs% of the $tudents spored less than to be cross-classifjed by several factors. Until now we have lookeq only
170 marks, while zo% scored at lea~t PS marks, and so on. at orie,way plassification (e.g. t~ble ~.r) and twoway cl~ssifi~ation (e.g.

Table z.6. 44lndia11 ~ubjects cross-'rlassifi~d qy se:>;, age, and c/QSS

$ o.a / Sqx

.. Male


, Clasl Claj'
~ 0.6
A~e 1)1'4 UM , U To!al Age LM UM U Total
t 04
Over45 -

Q.. 0.2
/ Total, 1p , 10 8, z8 Tot~l 6 6 ,4 16

L,_' 1 I

it I I I I I

100 :: 120 140' 160 1~0 200 220 240 260

Score table 2.2 - in this latter table the ~o classifying factor~ ate sex on the
one hand and type of language disqrder on t\le other). Wheq subjects,
Figure :z.6. Relative cum\jlative freqh~ncy curve for dati in table. 2.5(b)'.
or any other kind of experimdntal uqits, are cl~ssifi~d in several differen\
ways it is still possi~le, and helpful, tq tabulate the data.
We can also make use of this table to answer qqestibps su~h as 'What Khan (forthcoming) carried out a stuqy of a number of phonological
is the mar~ that cuts off tjle lowest (or highest) sbqring 10% pf ~tudents variables in tHe spoken Epglish of 44 supject~ in Aligarh city in North
from the remainder?' Prob~bly the easiest way to db this is viol the ielative India. The subjects were clas~ified by sex, age and soqial cJ'ass. There
qum~lative frequency C!lr~e of fi!l"llre 2.6. To draw tqis durve we plot were three age groups (r6--Jo, JI-'\5 o'>[er 45) and three social classes
each relatiye cumulative frequency against the value at the beginning of (LM - Lowei middle, UM ~ Upper m\ddle, and U - Upper). lfabl~
tpe hext class i11terval, Le: plot o.6z verticady abbve 125, oi43 verfically 2.6, an example of a tltreeway table, shows the distribution of the sample
above r8s, and so op. Then join up the points with a serie.s of short, of subjects over these categories. , ' ' .
straigl,t lines connecting successive points. If we now find the pqsition Khan studied se~eral phonological variables releVI!nt to, Indian English,
on the vertical axis corresponding to rb% (b.r), draw a horizontal line i11cluding subjects', pronunciation of I dl. The subjects' speech was
till it reai::qes tl]e curve and then drops vertically onto the X-axis, the re10orded in four different sit\'ations and Khan recognised three different
cprr~spondjng sqore, '5' is ah esti!n(lte (see chaptkq) of the score whi,ch variants df I d/ in her suojects' speecjl: d-r, an alveolar variant 'lery similar
10% of studerlts fall to reach. This ~core is called. the tenth percentil'l to English pronunciation of I dl; d-2, a post-a1veol~r variant of I dl; anq
score. Any other percentile can be obtained in the same w0y. d,J, a retroflex variant which is perceived
as corresponding
' ,
a 'heavy
Certain percentiles have a historical importance aqd special names. The Indian accent'. This type. of study prod11ces rather complex data since
25th and 75th percentiles are known respectively as the first quartil'1 there were I 2 SCOres for eafh supject: a SCore fo,r each of the three V~riant~
and the third quartile (being (lflC'lJUUrtc and thrcf<]uartcrs of the way in each df the four sitmitions. One way t9 deal with this (others are sug1
r8 19
1'ables and graphs ~'pecial cases

Table 2. 7. Phonological indices for /dI ofNlnliiali speakerS ofEnglish iilw~y,H!Iat the major purpose ofthe table or graph is to communicate
the data more easily without distorting its general import;
Hughes (1979) tape-recorded all the English spoken, to and by an adult
Male Fema.le____
Spanish learner of English over a period of six months from the time
C:Iass . Class that she began to learn the language, solely through conversation with
Age LM UM u Age LM UM u the investigator. Hughes studied the frequency and accuracy with which
478 so.o 459 s6.9 548 57'
the learner produced a number of features of English, one of whioh was the
16-30 494 484 51.6 43' 487 52.8 possessor-possessed ordering in structures like Marta's book, which is
485 493 533 47-2
489 sz.o 467 448
different from that of its equivalent in Spanish, ellibro de Marta. The
learner produced both sequences with English order, like Marta book,
578 509 66.7 51.2- 46-s 49-4
3 1-45 51.6 31-45 lrlmta's book, and sequences which seemed to reflect possessor-possessed
594 57-8 62,1 479 433
sz.6 56.4 58.6 order in Spanish like bool< Marta. The frequency of such structures
Over 45 44-0 6t.J 6I.J
:over4S No data ,,during the first 45 hours of learning, and their accuracy, is displayed in
596 57' 594 figure 2.7, reproduced from the original. The figure contains information
5Z,I 51.6 647
about both spontaneous phrases, initiated by the learner, and imitations.
How was it constructed? First, the number of relevant items per hour
Table 2.8. Average phonological index for /dl of44lndian
speakers ofEnglish cross-classified by sex and social class
is indicated by the vertical scale at the right and represented by empty
bars (imitations) and solid bars (spontaneous phrases). Second, the percen-
Sex tage of possessor-possessed noUn phrases correct, over time, is read from
ClaSs Male Female Both sexes the vertical scale at the left of the graph and represented by a continuous
Lower middle 5'"9 (Io) 52.2 (6) 52.0 (16) line (spontaneous phrases) and a dotted line (imitations - they were all
Upper middle 53-5 (to) fS.J (6) 51.6 (16) correct in this example). It is clear that the learner is successful with
Upper 575 (8) 57 (4) 55.2 (n)
imitations from the beginning, but much slower in achieving the accurate
.All classes 54-' (28) 504 (t6) 5>7 (44) order in her spontaneous speech. It should be clear from earlier examples
Note: The figures in brackets give the number of stibjectswhosescores\corttri- how the bars representing frequency are i:orisiructed. But how was accuracy
buted to the corresponding average score. '
defined and represented on this graph?
gested in chapters 14 and rs) is to convert the three frequencies into Hughes first scored all attempts, in order' of occurrence, as correct (0)
a single score, which in this case would be an index of how 'Indian-like', or incorrect (X) in terms of the order of the two elements (imitations
on average, a subject's pronunciations of I dl are. A method suggested and non-imitations being scored separately).
by Labov (1966) is used to produce the data presented'in tab'le 2:7 (details Response no. 2 3 4 5 6 7 8 9 10 I I !2.
of the method, which need not concern 'us here, are to be found in 15.1 ). X X X X X X X 0 0 0 X 0
Table 2.8 gives the average phonological indices for each sex by class
combination. In order not to lose the benefit of the completeness of the record, the
data were treated as a series of overlapping samples. Percentages of suc
2.4 Special cases cessful attempts at possessor-possessed noun phrases in every set of ten
Although it will usually be possible to display data using one successive examples were calculated. From examples 1-"12 the following
of the basic procedures described above, you should remain always alive percentages are derived:
to the possibility that rather special situations may arise where you may CorreCdn res'ponses r-Io= 30%
need to modify or extend one of those methods. Yau may feel that unusual 2-rr =
J-12 = 40%
types of data re<Juirc a rather special form of presentation. Remember
Tables and graphs -EXercises
Each percentage was_ thep assigned to the mid-point of the series of res (2) _- 'f.he notiOn of ~elative 'frequency, in terms of p~opof1_i~Jis oi' percentages,
ponses to which they referred. Thus 30% was considered as the level was introduced.
of accuracy in the period between responses numbered 5 and 6, still 30% (3) Advice was given on the Construction of:
in the period between responses 6and 7, and so on. The average level (a) tables: one-way, two-way, and multi-way
of accuracy for each hotjr was then calculated (this kind of average is (b) diagrams: bar charts, histograms, and cumulative frequency curves
usually call~d a !llOVing !'Vetage, since the sample on which it is based (4) Percentiles, quartiles and median scores were introduced.
'move.s' throng~ time), aqd this is what is showfl in figure 2.7, the points (5) It was point~d out that there may be occijsions when,unusual data will necessit~
ate modifip<tion" of the basic proce~ures.
(6) It was emphasised that the function of tables and diagrams is to present inform
26 tio~ in a readily assimilable form and withoUt distortion. In particular, it
wa~ urged that proportions anq percentages should always be accompanied
---------------------- 2Q ' , by an indicatiol' of original freq\lendies.
1\" 70 15 ! EXERCISES
m (t) In \he Orirk ~eport (Quirk '97) the following f)gures are given as estimates
t'! 60
50 ~ of the num!Jer of children in differeqt categories vih.o might require speech
10, ~
~ 40 therapy. _Construct a tai.Jte of relativb frequencies for these categories .
20 5. J (i) Pre-school age children with speech and language
10 6odoo
Hour1lj 5 10 15 20 25 3q 35 40 45 (ii) Children in ordinary schools wjth speech and language
probl~ms x8oooo
FigUre 2.7. Possessor-possessed nouq phrasrs in thc.spcc~h of 11. Spanish
learner of English (Hughes 197:9). (iii) Physically and/ or mentally-handicapped children 42800
(iv) Special groups (language disorders, autism, etc.) SdOO
(z) As~ume that the nlale/female split in t)le categories above is as follows:
joined by an unbro~en Ji11e for non-imitations, and by a broken line for (i) +ooo~/zoooo
imitations. These lines give an indication of tjle way in which the ability (ii) !Oo oqo/8oooo
(Iii) f-5 oog/ 7 8oo
of the speaker to express the possessor-posse~sed ordering c9rrectly w~s
(iv) Jsoo/ soo
progressing through time,
(a) Construct a ~o~Way table of frequencies and relative freqlieqcjtis1 cro'Ss-
We have certainly not exhausted the possibilities for tabular and graphi ciassified by sex (cf. table ,.z(c)).
cal presentatioq of data in this short chapter, though you should find (b) Draw a bar dhart of the frequencies In each categury.
that one of the options proposed here will ~uffice for many data sets. (c) Draw a bar chart of th.e rela!ive frequencies for each category.
' ' !
It is worth reitfrating that the purpose of such simple summaries is to (J) The table b~low gives the scores of 93 stt!dents in a test of English proficiency.
, -., ; ! , I , , , ' I

promote a rapid understanding of the main fpatures of the data, and it

183 206 ISO 164 t62 zz6 189
goes without saying that no summar~ method spould be used if it obscures 2 33 '59 '49
tile t88 t66
205 200 t,f6 190 236 '55 io3
the argument or covers serious deficiencie~ in tlje data. 165 237 172 '52 t8o '4' 140 t8I 194 2d8
J8q 173 225 ,s 5 191 t68 t6"i 209 205 ,66
54 t
This chapter has shown a number of ways i11 which data can be sum- lJ3 171 177 193 'IJl
Jt, t67
!52 i:u Il4
marised in tables and diagrams. 56 149 t87 178 q;:6 '73 149
l:J:i !81 llt) 3 159 '53 1(18 207 205 198
(I) A distinction was made between categorical and numerical data.
- 0111

Tables aiidgraphs
(a) Provide a frequency table (c. tablez.5(b)).
(b) Provide a histogram of the frequency data: in the table.
(c) Provide a relative cumulative frequency curve for the data. 3
(d) What are the scores corresponding tO the teilth percentile, the first qtiatiile',-
and the third quartile/ ,Summary measures-
(4) Suppose that the scores of 165 subjects in a l!mguage test are reported "in
the form of the following frequency table, Construct the corresponding histo
.gram (note that the class intervals are not all the same size).
Class interval
of scores so-79 8o-89 9~4 95-<!9 roo-ro4 ros-tog I Io-J 19 Uo-139
Number of
subjects r6 20 26 28 22
'9 23 We have seen that diagrams can be helpful as a means for the presentation
" i>f a summary version of a co11ecti0n of data.- Often, however, we will
find it convenient to be able to talk succinctly about a set of numbers,
and the message carried by a graph may be difficult to put into words.
Moreover, we may wish to compare various sets of data, to look for impor
lant similarities or differences. Graphs may or may not be helpful in this
rcspect; it depends on the specific question we want to answer and on
how clearly the answer displays itself in the data. For example, if we
compare figure 3' (derived from data in table J.I)with figure 2.4 we
clin see immediately that the lengths of the l)tterances of a mother speaking
to an 18-month-old child tend to be rather shorter than those in the speech
of the same woman speaking to a child aged 3 years.

~c 16
i 14
:l 12
d 10
" 8

l 3 4 G ij 'I 8 0 10 11 12 13 14 15 16 17
Umgth of Ultllrnnco

Figur~ J 1, Uur clwrt pf lijfii!Itha of 100 llltcrnnt:cof n motlwr llpeultinl{ to

n cltild ug~;d 18 monthll-(d!!tn in- ;;. r).

'~"' aG
The median
Summary measures
Table 3. r. Frequency table of the lengths 25
of 100 utterances (in morphemes) of an
adult addressing a child aged I 8 months ~ 20 --,
Length of utterance , Number of utterances -l' r-

zB ~ 15 - 1-_
22 s" r-
'4 -!l 10
6 II
1 -
9 0
~ 5
. ~

IO 0
110 140 170 200 ---
'4 0 Scora
IS 0
z6 0 Figure 3,2(a). Histogram. of CPE scores of 93 L~tin American candidates (data
I1 0 in exercise .2.2.).

However, it is quite rate for the situalion to be so cleat. hi-figure 3.2(ii)
we have drawn the histbgram of the data of exercise 2.2 (reproduced in
table 3.2), which consists of the total scores of 93 s!udents at a Latin 1
~ r-
'!'able 32 Scores, ranked in ascendittgorder, obtained by 93 candidates
at a Latin Ameri~an centre in ihe Cambridge Proficiency Examindtimt .
~ 15 -'-

R _.r-
II4 J2.1 u6 I29 I32 I36
'33 '33 '35 '4" 10
'4' I# I49 I49 '49 ISO I 52 I 52 'S3 "C
'55 zs6 zs6 zs6 '59 '59 I6I z62 I~2 I64 liu
z66 z66

z6s z65 t68
I70 171
I72 '73 '73
'73 7 '77
d 5
z8o z8o
z8o J8J IBI I8J 3 185 J86 ,s, z
IB7 !89 I9d '9' '9' '93 '94 '97 uj8
z98 200 203
208 208 209
I I. h
110 140 170 2b0 230 260
237 254 265
r-igUrc 3.2(b). Histogram of CPE scores of toS European candidates (data
m tuble 2.5).
American centre in the June rg8o Cambridge Proficiency in English Exatrii-
nation. In figure 3.2(b) we have repeated the histogram ofthe scores
of the European students already discussed in the previous chapter. The The median
two histograms are rather alike but there arc some dissimilarities. Can We have draW11 the cumulative relative fretjuency curve for
we make any precise statcmCntN about such dissitnilarities in the vvehzll the Latin American group in flgule 33() and again repeated that for
level of performance of the two group~ df students? the European student~ in figure 33(b). In both diagrams we have tnarked
20 '2,7
Summary measures The arithmetic mean
in the median, or seth percentile(ihtroducediin<ch~pter-;;:.':Retl<leltlb'er-- i,,.u,:iJ in''bo't1i 'setti' who 'ilbt<lined. scores closetothe relevant' median. This, .
th~t this is the score which divid~s elieh:seubf scorces. irito:twone!ltly togetherwith 'the central position of-eachmedian 'indts own -data set,
equ~l subgroups; one of these cont~ins'all iherscores less thim the me'di~h, giVes; rise 'tii'>the idea' of a representative. score for: the group, expressed
the other all those greater than' thehmedian. We"see'th1ltthe illediah.~c!lrel '"' 1 'rn plii'aiies'st't'ch'as 'a typical'sttident' ot''uh~ratice's ofaverage:lengtb',
for the Latin American students, about '74 is somewhat lower than.that The median provides a way to specify a /typical' Or 'average' value
for the Europeans, about 190, and we might use this as the basis for which can be used as a descriptor for a whole group. Although som~
a statement that, in June 198o, a Latin American student at the celltre Latin Americans scored much higher marksthan many Europeans (indeed
of the ability range for his group scored a little Jess well than the correspond- the highest score from the two groups was from a Latin American), we
ing European. It is clear from the histograms that there are many students nevertheless can feel that we know something meaningful about the relative
performance of each group as a whole when we have obtained the two
1.0 median values.
!!l 0.8 .?1:;
y)( 3.2 The arithmetic mean
~ ..!
The median is only one type of average and there are several
~ 0.6 others which can be extracted from any set of numbers. Perhaps the most
0 familiar and widely used of these ~ many of us have learned to call it

I ... Median score
the average - is the arithmetic mean, or just mean, which is calculated
for any set of numerical values simply by addirlg them together and dividing
by the number of values in the set. The mean score for the Latin American
students (from table 3.2) is given by:
120 140 160 180 ' 200 220 240 260 (II4 + I2I + ... + 265) = 16 517
Score 16 5'7 + 93 = 776
Figure 3J(a). Relative cumulative frequency; curvdor-CPE-scores of 93 Latin The meim le'dgth of the utterances (in morphemes) of a mother Speaking,
Americansubjects(dataincxe.rclse2.2). ,.., ' <.'l=!,i.- :..:
to a child aged 3 years (from table 2.4(a)).ls:

1.0 ,.........- (2 + 7 +.
635 +
':. + 4) = 635
100,;, 6.35
J 0.8 /./ '
This is a convenient point for an irttroduction to .the kind of simple
., algebraic notation which we will use throughout Although .we.
~ 0.6 will require little knowledge of mathematics other than simple arithmetic,
0 the use of some basic mathematical notation is, in the end, an aid to clear


I argument and results in a great saving in space.
When we wish to refer to a general data set, without specifying particular
muncrical values, we will designate each rlumber in the set by means
of a letter and a suffix. For example, X 21, will just mean the 26th value
/Median score
in some set of data. The suffixes do not imply an ordering in the values
120 140 160 180 200 220 240 260
Score '':, of the numbers; it will not be assumed that X 1 is greater .(or smaller)
thnt,t X 1110 , but only that X 1 is the first and X 100 the woth in. some list
Figure J.J(b). Hclntivc (:umulative frcqmnl':)' curve hwGPE scort;11 of Jo8
Europcun ,~~ubjectu (dutn In 2.5). of numbers. For cxump!o, in table a.4(n) the firot utterance length recorded
11>8 29
Summary measures The mean and the median compared
is 2 morphemes while the last utterance length recorded is 4 morphemes: Table 3, 3 Lengths of utterances, ranked in ascending order, of adult
thus X, = 2, X10o = 4 speaking to child aged 3 years
If we wish to refer to different data sets in the same context we use
different letters, For instance, X,, x,,,,,, X 28 and Yl> Y 2, , , , , Yzs would 3 3 3 3 3 3 3
be .labels for two different lists each containing 28 numbers. More generally
3 . 3 3 3 3 4 4 + + +
we write X 1, X 2, , , , , X, to mean a list of an indefinite number (n) of '' 4 4 4 4 4 5 5 5 5 5
5 5 5 5 6 6 6 6 (j 6
numerical values, and we write X; to refer to an unspecified member of 6 6 6 7 7 7 7 7 7 7
7 8 8 8 8
the set. 7 7 7 7
8 8 8 9 9 9 9 9 9
The arithmetic mean of the data set X 1, X2 , , , , X., is usually designated 9 9 IO IO IO to 1p IO IO 10
as Xand defined as: IO 10 II I2 I2 12 14 14 '5 17

X= (sum of the values)+ (number of values)

list, 174, has been circled; 46 students scored less than this and 46 scored
"'(X 1 + X2 + ... +X,)+ n more. Hence 174 is the median score of the group. You may Hke tq try
I . the same procedrre for the European students, There are to8 of tpese
=- (X 1 + X 2 + ... +X,) and you should find that when ranked in ascending order the 54th score
is 189 and the 55th 190. Thus there is no mark which exactly divides
I the group into two equal halves, Cbnyentionally, in ~uch a case, we take
n L.. ' l th~ average of the middle pair, so that here we would say t)lat the mepian

where the symbol ~ means 'add up all the Xs indicated by different values score is t895 The s~me convention is adopted even when tl)e mipdle
' pair have th~ sa!l1e value. Tpe utterance lengths of table 2.4(a) are written
of the suffix i', Tpis may be simplified further and written:
in rank otd~r in table 33 Of the too lengths, the soth and SISt, the
- I
middle pair, both have the value 6, and the median wil\ therefore be:
n t/2 (6 + 6) = 6.
Although the !Uean and the median ate both measJ.Ires for locating the
'centre' of a data set, they can sometimes give ratqer differeqt vajues.
3. 3 The mean and the median compared It is helpful to investigate what features of the data might c~use this,
We now have two possible measures of 'averpge' or 'typical', to help us deciqe when one of the yalues, mean or median, might be
the mean and the median. They need not have the same value. For example, more appropriate than the other as an indicator of a typical valu~ for
the 93 Latin American students have a mean score of J77.6, while the the data. This is best done .by means of a very simp!~ example. Consider
median extracted from figure 33(a) has a value of I74 Moreover, although the set of five n11mbers 3, 7, 8, 9, 13. The !11ediat) val4e is 8 anq the
the median score for the group of Latin American students happens to mean is 8 (X= t/5 (3 + 7 + 8 + 9 + IJ) =" 8). In this c~se the two mea~ures
be similar to the mean score, this need not be the case - .as we discuss give e)lactly the same result. Note that tlie numbers are dist;ibuted ex~ctly
below, Why should there be more than one average, and how can we symmetrically about this central value: 3 and 13 are both the same distance
reconcile differences between their values? fmm 8, as are 7 and g; the mean and the median will always have the
In order to be able to discuss the different properties of the median arne value when the data has this kind of symp1etry. If we dq the same
and the mea!l, we need to know how the former can be calculated rather for the values 3, 7, 8, 9, 83, we tim! that the median is still 8, but the
than obtained from the cumulative frequency curve, although the graphical mean is now 22. In other words, the presence of one extreme value has
method is usually quite accurate enough for most purposes. I11 table 3.2 increased the mean dramatically, causing it to be rather different in yalue
we have written the scores of the 93 Latin American students in ascending fmrn the more typical group of small values (3, 7, 8, g), falling bctwctm
order, from the lowest to the highest mark. The 47th mark in this ranked them and the. very large cxtnmw value. Tlw madian, on the other hand,
Summary measures The mean and the median compared
is quite unaffected' by the,,presence'l!lfiil'single';:-tinllsu~Uy>Jarge'ntithbe+ , . '.equ~!'to 1 theltWode ,''while'the'1'rieanf34l )is'::larger, ll:fore t>ypically, unless
and retains the value iLhad pr<lviously,,Onec;woul&osayi!HaLthe,median' ihe TH6de is "in 'the'' centr~ of the1 chart-,' :the ' and-ithe median will
is a robust indicator of the more typical valuesiofthe"<lata,beingunaff~cteif i both"be different froth the mode; andfrom,each,other, but the . median .'!

by the occasional atypical value wlJicH \night occur~t one or othetexttbtne:< -._ '(- '!:i
When a set of data contains one or more observations which seem to .
have quite different values from the great bulk of ;the.observations it may
be worthwhile treating them separately by eliclilding.the uimsual vall!e(s)
before calculating the mean. For example, ln this very simple case we
could say that the data consists of four observations with mean 6. 75 (3,
7, 8, 9) together with one very large value, ~3 This is not misleading
provided that the complete picture is honestly reported. Indeed, it may
be a preferable way to report if there: is <stime' jhdication, other ihaii jus(
its value, that the observation is somehow unusfiat ''" ' '
When the data values are symmetrld~llydlstributed:,,:as in <the previdus' '' i';'' :1.\l;: Ll; 'i'
example, then the mean and median will beequ:U. On the other hand;.
the difference in value between the mean and median willbe quite marked 'Figure 3.4(a), Histogram skewed to the right.
if there is substantial skewness or lack of symmetry' in a set of data;
see figures 3.4(a) and 34(b). A data set is said to be skewed if.the highest
point of its histogram, or bar chart, is nqt in the centre, hence causing
one of the 'tails' of the diagram (o be longer than the other. The skewness
is defined to be in the directioo of the longer tail, so that 3.4(a) shows
a histogram 'skewed to the right' and 3.4(b) is a picture of a data set
which is 'skewed to the left'.
We have seen that the median has the:property: of rbbustne~s.:in 1 the: !1._
presence of an unusually extreme value and the:meiln doe~ not>.,Neverthe- : '1;/,>

less, the mean is the more commonly<\rsed"fOr the !centre' ofa Stit'df'
datal Why is this? The more important reasonswill become apparent
in later chapters, but we can see already that the mean can claim an a'clvan-
tage on grounds of convenience or easeofcalculation for straightforward'
~-i~4rC 3~4(}:>). Histogram skew~d to t~e left.
sets of data. In order to calc:tilate the median we have to rank the data.
At the very least we will need to group them to draw a cumulative frequency
curve to use the graphical method. This can be very tedious indeed for will be closer to the mode than the mean will be. We might feel that
a large data set. For the mean we need only add up the values, in whatever the mode is the best indicator of the typical value for this kind of data.
ordet. Of course, if the data are to be processed by computer both measures Figure 3,5(b) shows a rather different case. Because the histogram is sym-
can be obtained easily. metrical, the mean and median are equal, but either is a silly choice as
However, there ate situations when both the mean and the median can 11 'typical value', However, neither is there a single mode in this case.

be quite misleading, and these are indicated by the diagrams in figures These last two examples stand a~ a warning. For reasons which will
3S(a) and 3S(b). In case (a) it is dear that the dataare dominatedby" ilti<lr\' beccilhe"deaf, it becomes' a h~bit to assUme that; unless otherwise
one particular value, easily the most rypical value of the set. The most indicated, n dum set is roughly syn\rnctricalllnd bellshapcd. Important
frequent value is called the modal value mimode. 'l'hc n\cdian (3) is dcpltrturcs fromthis ~hould ulwayslw clcat'ly indicated, either by means
33 3;1
Summary measures Means ofproportions and percentages
of a histogram or similar graph or by a specific descriptive label -st.;Lch Tabl.e 3+ Pronunciation of words ending'-ing' by 10 Middle working class
as 'U-shaped', 'J-shaped', 'bi-m,odal', 'highly skewed', and so on. speakers
. o l
:-i~ ': '~
(a) Interview (b) Wordlist
,. ' Number of Number of
Number of Number of
30 '-!' n :_ ;" Subject [n] endings tokens % ,!n] cpdings r to~s %

1 10 44 ( 227 36 :too -.i8.o

2o.l l I MOt~e ;3-~- '- i /
8'7 193- 4H 48 <OO 24.0
3 Iq 2!6 51.4 64 200 32..0
4 55 103 sl-4 s6 <00 zS.o
10 -1 I I n !\!11!1 5 145 241 6o.~ 70 ~00 35-0
6 u6 183 634 61 >OO 305
194 397 62 200
77 JI.O
---- 126 <18 57-B 53 :<lOO 6-s

0 1 2 3 4 5 7 6
9 109 223 ~-9 77 100 38-s
Figure 3-s(a). A hypothetical bar chart wlth a pronounCed mode. i :} 10 6 32 J8.8 4 100 2,1,0
. "'
Total 842 1647 s69 2QOO
r- r-

- r- the interview condition, \j'e obtain the values in the fof!rth column of
,w.~ tak~ the 'mea(f of the percentages in, that column, we
- the table.
qptain a v~lu\), o( 4614%. Ho'l'ever 1 oh looking at thf_liata we can see

- ~
trat the eight subjects who provide the great bulk of the (Jata (I 571 tokens)
have perceritages which are close to qr m4ch hjgher thAn 46. '4 The two
subjects whose percentages are low are responsible fdr pnly 76 tqkens,
I yet their inclusiqn in the sample mean as ~alculated abdve greatly reduces
its numerical value. It is general)y goqd pr~ctice to avoid, as far as possible,
Figure 3.5(b). A symmetrical U-shapCd histogram.
dolleqing data in such a way that we haye less information about some
~ubjeqts in the study thaQ we have about others. It will not always b~
34 Me"ns of ptopottions and percentllges possible to achieve. this :- certainly not where the experimental tnaterial
It happens frequently that data are Pfl'Seilted as percentages consists of ~egments of spontaneous speech or writing. The use of wordlists
or proportioris. For example, suppose that the problem ol interest were can remove the pro~lem, since the!) each sUbject )Vill provi<je a sjmilar
the distribution of a linguistic variable in relation to some external variable I1Uffi~r of toke!' Jiowever, it is noteworthy that the t)"o el(perimental
such as social class. To simplify matters we w!ll consider the distdbutio'l conditions have given quite different results for many of the suqjects;
of the variable (say, the prdportion of times [n) is used wordfiqally in some \.>f them gjve only half the proportion of rnl
endibgs in the wordli~t
wotds like winning, running, as opposed to [Ill) in one social group, whic~ that they e;press in spontapeous speech.
we will call Middle working class (MWC). Let us suppose that we have At this point' we shoulp not~ .that it may sometimes pe in~pprqpriate
two conditions for collecting the d~ta: (a) a standard interview, and (b) to take sill)ple averages of percentages ~nfl propottipns. ~uppose tljat the
a wordlist of 200 items. Ten differ~nt subjects are examined under each data of table 3'4 for the interview condition (a) were rot observed or
condition, and table 34 shows the number of times a verb with progressive \en djffcrcnt subjects but rather were the result of analy~ing ten different
ending w~s pronounced 'j'ith final [n ], expressed as a fractimi of the total speech sa'Pples from a single individual for wholl) we wish to m~asure
number of verbs with progressive eqdings used by the subject in the inter- the p~rccntage of [n1 endings. Over the con1pl,ete experiment this subject
view. If we calculate for each subject a percentage of [n]tinul forms in would have ptovided n tot11l of 1 ,6471okcns, of whicl1 842 were pronounced
Summary measures central zntervats

(n]. This gives a percentage of (842jr647)X roo =sr:.u%: By' adding 'to -interpret~a:percentage o~--a -'_m~an: tpercentage _when it.js not clear,what
all the tokens in this way we have reduced the effect of the two situations .was.the basequantity (i.e, .iota! number of observations) oyer which the
in which both the number of tokens andthe praportion ol(n] endings. original proportion was measured, especially when the raw{i.e. original)
were untypically low. It will not alwa:vs beobvioils whether tHis is a~better'c >;. ' '' values on\vhicli thepercentageswerebased are-natquoted.. ' ' ' 1':

answer dum the value obtained by averaging the ten individuarpercentages.

~ou shoulf think caJ.'ful!y about the meaning of the data and have a 3S Variability or dispersion
clear understanding of what it you .are trying to qleasure. Cjf c(lllrse, > ; ' We i~;roduced tre Il)"an ~nd t;he mrtdian <~s measures which
if the base number of tokens is constant it will not matter which method "'auld be useful for the corpparison of data se\S, and it is certainly true
of cakulation is used; both methods give the sa\)1e answer. Fat the wordlist that if we calculate the meaps (or medians) of two sets of scores, say,
condition (b) of table 34 the average of the ten percent~ges is 28.4s% and find tljem to be very different, then we will have made a significant
and s69 is exa~tly 2S.4s% of 2,ooo. . . discovery. If, on the other jlaQd, the two means are similar in value, or
Suppose an examination consists ot .two .parts, one, oral scored out ,of .. the medians are, this will not usually be sufficientevidenceJor. the. statement,
10 by an observer, the other a writteri,-.multiple choice paper with so. thanhe comjJlete set~ of-scores have<a similar overall shape or.structute.
items. The examination s~ore for each subject will ~Pnsist.of one mark ,,. ,. Corisiderthef6llowintfextreme, but artificial, example'/ Suppose one group.
out of so and ar1otlier otit of ro. For.exa~ple, a subjept m~y score 20/ so, of sosubjects have all scored exactly the same. mark, 35 out .of so say,
(4o%) in the written paper and 8/ro (8o%) in the oral test . Haw should in some test; while in another group of the same size, 25 score so and
his overall petcentag~ be calculated 1 The crux of the matter here is the 25 score 20. In each case the mean and median mark would be 35. However,
weight that the exa01iner wishes to give to each part of the test. If the there is a clear difference between the groups. The first is very homoge-
oral is to have equal weight with tqe written test then the overall score neous, containing people all of an equal, reasonably high level of ability,
wotild be 1/2 (4o% + 8o%) 6o%. However, if the scoring system has while the second consists of one subgroup of extremely high-scoring and
been chosen to reflect directly the importance of each test in the complete another of rather low-scoring subjects. The situation will rarely be as '
examination, the written paper should be five times as important as the obvious as this, but the example makes it clear that in order to make
oral. The tester might calculate the overall score by the second method meaningful comparisons between groups it will be necessary to have some
above: mea'!.ure of how the scores in each group relate to !heir 'typical vaiue'
20 + 8 28 as. determined by the mean or median or some .other. average. value, 1L
SO+ IO 6o is therefore usual to measure (and repbrt) the. degree to. which a group
This latter score is freql1ently refe,rred to as a Weigl:i.ted mean score . lacks .homogeneity, Le. the variability, or dispersion, of the scores about
since the ~cores for the individual parts of the examination are ne longer the average value.
given equal weights. It may be calculated thus: Furthermore, we have argued in the opening chapter that it is variability
between subjects or betwe~n linguistic tokens from the same subject which
so X 40% + ro X So% 1nakes it necessary to use statistical techniques in the analysis of language
46.7%- 6o
data. The extent of variability, or heterogeneity if you like, in a population
where each percentage is multiplied by the required 'weighrl. If the scores is the main determinant of how well we can generalise to that population
for the individual parts were given ~qual weights the mean score would the characteristics we observe in a sample. The more variability, the greater
be 6o%, as above. It is important to be clear about the meanirig of the will be the sample size required to obtain a given quality of information
two methods imd to understand that they will often lead to different results. (sec chapter 7). It is essen1ial to have some way of measuring variability.
It sometimes occurs that published data are presented orily as propor-
tions ot percentages without the original values bein~ given. This is: a 3;6 Central intervals
serious error in presentation since, will-sec:in-1at~r .chapters, it often - We would not expect to find in ptactice either of ihe outcomes
prevents analysis of the data by another researcher. It: mny be diflicult dccl'ibcll irr the previous scctimJ. Fo1 cxalnplc, when we administer a

.Jfl 37
Summary measures Central intervals
Table 3. s. Frequency tables ofmarks for two groups of2oo subject$
Group t Group z
Class 1 ,_t' ,_1-1, , ., Relative Relative ~ '50
interval'S' Cunlulativc cumulative Cumulative cumulative
(marks)-- Frequency frequcncy frequency Freq),lency frequency frequency .~
!" 40

,o-14 I I 0.01 6 If 0.07
I 2 Q,Ol 8 22 o.n 30
,ZQ-24 6 8 0.04 6 28 0.14 .~
68. 78 . 0.39
034 ~
35-39 36 2I
4o-44 62 I30 o.6s 43 I2I o.61
45~49 38 I68 o.84 23 I44 0.72
so-54 I7 Iss 093 I6 I6o o.8o
55-59 8 I93 097 9 I69 o.Ss.
6o-64 5 I98 8 I77 o.89 0 10 40
6s-69 2 200 J,OO 4 I8I Score
7D-74 6 I87 094
I9I o.96 Figure 3.6(a). Histogram for data of Group I from table 35
75-79 4
8o-84 4 I95 o.g8
85-89 3 I98
9D-94 2 200 1.00
~ 50
test to a group of subjects we usually expect their marks tp vary over .~
a more or less wide range. Table 3S gives two plausible frequency tables
of marks for two different groups of ~oo subjects. The corresponding
histograms and cu111ulative frequency curves are given in figures 3.6 and ~ 30
37 We can see from the histograms that the marks of the second group ~
are more 'spread out' and from the cumulative frequency curves that the f 20
median mark is the same for both groups (i.e. 42).
~ 10
How can we indicate numerically the difference in the spread of marks?
One way is by means of i.ntervals centred on the median value. and contain
ing a stated percentage of the observed data values. For example, q 1 and 0 10 20 3 40
q 3 , the first and third quartiles, have been marked on both cumulative
Fig1.;1re 3.6(b). ~istogram for data of GroUp 2 from tablC 35
frequency curves. 1 This means that generally half of all the observed values
will lie between q 1 and q3 We write (q" q 3) to represent all the possillle
values between q 1 and q 3 and we would say that (q 1, q3) is a so% central called theinterqua,rtile distance (sometimes, the lnterquartile range).
interval; 'central' because it is centred on the median, which itself is In the example above, figure 37(a) shows a so% central interval (37S
the centre point of the data set, and 'so%' because it contains so% of 475) while for figure 3.7(b) the interval is(32.5, sr.s). The interquartile
the values in the data set. The length of the interval, q 3 minus q" is dist~nce gives a measure of how widely dispersed are the data. If at least
half tlw values arc very close to the median, the quartiles wi11 be close
These were dcfltwd in 1..:L Onc(] the ubaervcd value~ ar(~ smaller than tlw first
tfunrti!c, whil'' thrcc"qtwrtcrll arc smallcr-thnn the third qm1rtilc. The first quanilc, q 1,
together; if the data do not group closely around the median, the quartiles
h; ofwn rdt1rrcd to UH the <~sth pcn-rntilc, Pt,~. hct~nUM.' .zs% uf tlw dMn huvt n vnlm' will be further apart. Forth<: two cases shown in figures 3.7(a) and 37(b),
nm1dlm t~111n q 1, flirnlhuly, q,1 m~y h(' m~llvd ~ht;~?,;th pN~t!!Hile, 1\~ the intorquartilc distnnceB are wand 19 respectively, reflecting the wider
~ll M
:iummary measures The variance and the standard devtatzon

I of dispersion because-of the:need toirank the data ,cvalues to obtain the
median and quartil0s;:With the advent ofeasyto-use computer packages
like MINITAB ,(see Appendix B) no longer a constraint, and the
10 0.8 ------------ -------;!,.jx/i ..! ,_. ---- -.---
quartile~ l!ndp~rcentiles can provide a useflll,C.IIescripdve measure of-Nark '
~a o.s ability. HoweVer, classical statistidll"theory -is" not b::ised on percentiles;
~--------------- . . ---.~ i: and other m.easures of dispersion have been d~veloped whose theoretical
properties are better known and more convenjetu. The prst of these that
i 0.4
' we will discuss is the variance,
... 0.2
. . . . . . . . . . . . . . . . . . . . . . . . . .!!'
' '0
Suppose we have a data set X, X2 , , , X, with arithmetic mean, X.
We could calculate the difference betWeen ~ach value and the mean value,
Jl' I I I
I 1 I

o~ l :l
: :
as follows: d 1 = X1 - X, d2 = X2 - X, etc, It might seem that the average
0 10 20 30 j40i 1so. 60 ,., 10 80 .,.. ."; .. of. these differences would give some"II)easure ofhow closely the data -. ';'
'q, m' q31 values cluster a'round the mean, X, In table 3,6 we have carri~d out this
Score calculation for the number of wordsper sent~nde in a short reading text:'
Figure 37(a). Relative cumulative frequency. curve for data of GroupI from' Notice that some values of the differerice'(s'ecdnd columft)are positive,
table 35
some are negative. and that when added together they cancel out to zero.

1.0 This will happen with any set of data, However, it is still appealing to
use the difference between each value and the mean value as a measure
.ll 0.8
- _..-X_.-_.,..._.. of dispersion, There 'is a method of doing this which will not give the
iS -------- /.,.---
0.6 -------.,.
Table 3.6. Calculation of the variance of number
of words per sentence in a short reading text
0.4 --------------------(
~0 .
/ ''
Number of words
~e 0.2
x' :I
per sentence d1(= x,- X) di2

I I.I" 3 -675. 4556

0 I
'I ,.. " ,7 -2.75 756
~/ ~ I ' ~) -575 'JJ.06
' ' ' ....9,:,. . "0,75 0,56
0 10 20 30 I 40i so:, 60 ~90 '100 l2 2.25 s.o6
m q,' 2 -775 6o.o6
'4 425 z8.o6
Score 18.o6
'4 425
Figure 3 7(b). Relative cumulative frequency _curve for .data of.Group .l from 6 ""3-75 . 14,06
table 35 9 -o.75 o.s6
II 1.25 1,56
dispersion of the values in Group 2, The quartiles do not have any special '7 75 s-56
r8 8.25 68,o6
claim to exclusive use for defining a central interval. Indeed, as we shall l2 2.25 5.06
see in later chapters, central intervals containing a higher proportion of 10 0,25 o.o6
8 ""75 J.o6
the possible values are much more commonly used.
lX= 156 Id;=o kd: 2 = JJZ.cj6

3. 7 The variance and the stand.11rd deviation .X=97s

Before the usc of computers to manage experimental d~ta' it '' (: ,1!, .J.', ;r 3;,12.q6, .-332..-9(,: ! , ,
\ ~~-ll"'"'"~-~.zl.ZO
n ~ 1 1S
was common to reject the usc of the interqunrtilc distance. as a measure.

~Q 41
Summary measures i':>tanaarmszng le:u ~cures
same answer (i.e. zero) for every data set. and which has other useful T)le standard deviation is one of the most important statistical measures.
mathematical properties. After calculating the differences d;, we square It indicates the typical amount by which values in the data set differ from
each to get the squared deviations about the mean, d,2 (third column). the mean, X, and no data summary is complete until all relevant standard
We next total these squ~re~ ,deviations (~d, 2 332.96). Finally, we divide deviations have been calculated.
this toilll by, (n- x) (the !)umber of values minus x). The result of this
calculalionis tlw varia11ce (V): 3.8 Standardising test scores
A major application of the standard deviation is in the process
:Ut' I
known as standardising, which can be carried out on any s~t of ordered,
~ or ---2d2
n-x fl-. I 1 I
numerical data. One important use of standardising in language studies
This would be the arithmetic mean of the square deviations if the divisor is to facilitate the comparison of test scores, and an example of this will
were the sample size n. The reason for dividing by (n- x) instead is be used to demonstrate the method.
technical and will not be explained here. Of coljrse, if n is quite large, Suppose that two individuals, A and B, have been tested (each by a
it will hardly make any difference to' the result whether we divide by. different test) on their language proficiency and have achieved scores of
nor by (n- i), and it is convenient, and in no waymisleading, to think 4' and 53 respectively. Naively, one might say without further ado that
of the variance as the average of the squared deviations. B has scored 'higher' or 'better' than A. However, they have been examined
One inconvenient aspect of the variance is the units in which it is mea by different methods and questions must be asked about the comparability
sured. In table 3.6 we calculated the variance of the number of words of the two tests. Let us suppose that both tests were taken by large numbers
per sentence in a short reading text. Each X; is a number of words and of subjects at the same time as they were taken by the two individuals
X is therefore the mean nu111ber of words per utterance. But what are that interest us, and that we have no reason to pelieve that, overall, the
the units of d(? If we multiply jive words by five words what kind of subjects taking the two tests differed in any systematic way, Suppose also
object do we have? Twenty-five what? The variance, V, will have the that the mean score on the first test was 44 lind that on the second test
same units as each d,Z, i.e. 'square words'. This concept is unhelpful the mean was 49 We note immediately that A lias scored below the average
from the point of view of empiripal interpretation. However, if we take in one test while B has scored above the avhage in the other, so that
the square root of the variance, this will again have the unit 'words'. there is an obvious sense in which B has done better than A. The compari-
This measure, which is referred to as the stand11rd deviation, referred son will not always be so obvious.
to by the symbol s, has many other useful properties and, as a result, Let us assume, as before, scores of 41 and 53 for A and B in their
the stanqard deviation is the most frequently quoted measure of variability. respective tests, but that now the mean scores for the two tests are, respecti-
Note, in passing, that s = VV; thus s2 = V, and the variance is frequently vely, 49 and 58. Both individuals now have lower than average scores.
referred to by the symbol s2. A has scored 8 marks below the mean, while B has just 5 marks less
Before electronic calculators were cheap and readily available it was than the mean for this test. Does this imply that B has achieved a relatively
the custom at this point in statistical textbooks to introduce a number higher score? Not necessarily. It depends how 'spread oui' the scpres
of formulae to simplify the process of calculating s or (equivalently) s2. are for the two tests. It is just conceivable that B's marl< of 53 is the
However, on a relatively inexpensive calculator the mean, X, and the stan lowest achieved by anyone for the test While, say 40% of the candidates
dard deviation, s, of any data set can be obtained simultaneously by entering on the other test scored less ihan A. In that case, jl col!ld be right at
the data values into the calculator )n a prescribe\! fashion. 2 the bottom of the ability range while A would be somewhere near the
middle. The comparison can properly be made only by taking into accpunt
z Many calculators will give tWo possible values of the standard deviation, Ooc of these, the standard deviations of the test scores as well as their means, Suppose,
designated s11 _ 1 ora11 _ 1 is the value calculated according to our formula and we rccom!ncnd
its u~c un all occasion~. The olher,su or <r11 , has berm <~nlculated by replacing (n- 1) by
for example, that the two tests had standard deviations of 8 marks and
n in our formula, 1tnd it~ \l:lc i!l he1:1t avoided. AI'! nolcd ~tbovc, when n is rcasnmthly larg:c 5 marks respectively. Then the distance of A's score of 41 from the mertn
thl't'c will be very little difl'crcucc bctwccll the two values.
nf his test is exactly tlw VHhw of the stamhtrd deviation (49- 8 = 41).
Summary mea~ures Summary
The same is true for B. The dist~nce pf.his: scor:e o.;53; fronHh~ m~~n .. , , :; ,, ,,, ,,,:'J!.ableJ7.Anexampleofthe' ''
is exactly the value of the standard \levhition onthe test:.that hetook standardising procedure on a
(58- 5 =53) .. Both scores can be .said to be. one standard deviationcl.ess . hypothetical data set
than their respective means and, taking into account in this waythe:differ. "' ,. ;_._, ,_.l;- ~ ll~ L,.--,--,-:c,--:--:,-_ ~- i! ... " -' --:::--
X~ subtract X~ _l'~dividcbys:-;_,.X
ent pispersion of scores in the two. tests, there. is a set)se in. which bpth S:o,;_,.
A apd .B have achieved the same scores relative .to the other can\lidates 73 zo I.~z
42 -I! -o.73
who took the tests at the same time. , 36 -7 -"-"t'.u
If the complete sets of scores f0r th~ two above tests had haq exactly 5' -
2 -o,t3
63 o.66
the same mean and exactly the same standard deviation, any mark on
one test would have been directly comparable with the same mark on
X=s3 \'=o Z=o
the other. But of course this is not usually the case. Standardi~ing ;est S;.; = 15.12 Sy = Sx = Sz =I

scores facilitates comparisons between on differ~nttests. ToilhJs c: _'-, ->,

trate, suppose now that we have .the sco"es X" X2 , .-:., etc. of a se\. of . in a .area, 0ach test on a different day Their scores on the first
subjects for some test and that the mean anp stao.dard deviation of, these. , . (es,t !wee a,ll)ean ,of.g2,and ,a standa<<l' ~evialion .of .I.j; and :on the second
scores are X and s respectively. Begip by changing each scar~; X;, to :,test mean of I43 and. a standard d.eyiatiop of 2I. Student A, who fbt
a new score, Y;, by subtracting the mean: some reason or other is prevente\1 from t~king the second test,. scores
I2I on the first. Student B misses the fir~t test but scores 177 on the
second. If we wanted to compare the perfoqnl!nce of these two students
Now change each. of the scores, Y;, into a further new score, Z;, by we could standardise their scores, The standardi~ed Sfore for Subject A
dividing each Y by s, the standard deviation of the original X scores: is:
Z1:;=Y 1+s I2.I - 9~
ZA- 2.07
In table 37 this procedure has been applied to a small set of hypothetical '4
scores to demonstrate the outcome. The original scores of the firs\ column and for Subject B is:
have a mpn ol X= 53 and a standar~l..c\eviation of sx = IS.H .Theme 0n,
X, is subtracted from each of the; ,otigit)ah scor~s to give :ai new: Ya)ue, ;:l ;'0!J= zn- .'t3 ~ I:62
Y, in the seqond column. The rp.ean.,of these new values, is Y=:obut..: '- .:.H,

the stanflard devi~tion is Sy = I5,!l2, the, ~me v,alue .~s before. Fin~lly; Since ZA is'greater than Z 8 we would say th~t'Subject A has scored higher
each value in the second column is c)langed into a standardised va)ue, than Subject B on a standardised sca\e. !twill be seen it) chapters 6-8
Z, by dividhjg by IS. If.. The five values in ooh1mn 3 now have a mean. that quite precise statements cart often be made :abbut such COfl1parisons
of zero and a stand~rd dpviation of .I and that standardised values play an lmpohant role ih statistical theory.
By these two stepS we have change<! a set of ~cores, x.. X 2, etc. with
me~n X and standard ~evia\ion s, into a new set of scores Z 1, Z 2, etc., SUMMARY
with mean zero and standard deviation I. The Z scores are the stan This chapter introd,uces and explains various numerical quantities or
dardlsed X scor~s. The process of standardising can be described by measures which can be used to S~;J~s of data in just <l few nurhbers.
theit'ing1e foqnula 1 ( t) The median and the mea'! (or ~rithmetic mea'!) were defi~ed as meas~res
~- ,_ of the 'typical valueJ of a set of data.
)' Z=- (z) The properties of the mca11 and median were discussed, and the median was
shown th be mnrc robust than the mean in the presence of orle or two unusually
Suppose two difft:rcnt language proficiency tests which p~trport to meat ; .. , ' 11 tXlrl!tlHHhiluCs~
sure the same nbility arc administered to all available high school students (J) h' waa poi'ntcd out thal there were two C'omiuon ways (the ordin11ry mpan

44 45
Summary measures
or the weighted mean) of calculating the mean proportion of a set of proporM (a) Calculate the ~tandardised score for each of the students listed in the
tioos and the motivation and result of each type of mean were discussed, table and rank the students according to their apparent ability, putting
(4) The variance, the standard deviatio,:. and the int~rquartile distance were the best first.
prese-nted as measure'S of variability or dispersion: the concept of the central (b) The institute groups students into classes C, D, E or F according to
interval was introduced. their score on Test A as follows: thOse scoririg'at leaSt i4o are assigned
(5) Standardised scores were explained and shown to be an appropriate tool to Class C; those with at least 120 but less than 140 are ClaSs D; ~hose
for c'otnparing the scOres of different subjects on different testS.- with at least 105 but less than Izo are Class E.-The remainder are assigned
to Class F. In which classes would you-place students R, T.and U,?
(1) (a) Work out the mean, median and mode for the data in table + What
cohclusimls can you reach about the most suitable measure of central
tendency. to-apply to il)ese data?
(b) Follow the same procedures for the data of table 3.1.
(z) The following table gives data from the June 1980 application of the CPE
to a group of Asian students. For this data, construct a table of frequencies
within appropriate cif!SS intervals, cumulative frequencies and relative cumulaM
tive frequencies. Draw the corresponding histogram and the relative cumulative
frequency curve. Estimate the median and the interquartile.distance.

123 132 '54 136 121 220 106 92 127 134

127 70 m 116 70 131 136 170 74 114
6s 112 Bz '93 172 '34 221 217 138 138
51 113 136 108 97 146 75 !88 123 92
191 '95 74 '73 167 '59 149 115 '47 8s
88 96 255 93 171 219 84 u8 90 III
128 78 213 149 110 zs6 zs6 172 129 110

(3) Calculate mean and startdard deviation scores for the mean length of utterance
dat~ in tables 2.4 and 3. I.
(4) An institute which offers intensive courses in oriental languages assigns beginM
riing students to classes on the basis of their language aptitude test scores.
Some studehts wlll pave taken Aptitude Test A, which has a mean of 120
and a standard deviation of IZ; the remainder will have t3ken Test B, which
has a mean bf I oo and a standard deviation of I 5, (The means and standard
devjations for both tests were obtained from large administrations to comparM
able groups of stude!lts.)

Student TestA TcstB

p 132
Q 124
R 122
s 8z
T 75
"""'==== ~';2,~
f)l l ,..,...

.. ~..
from which it has been selected;. How far.. can we assumethe characteristics

4 /'.'-- of this latter group to be similar to those of the smaller group which has
beenobserved? Thisis the classical problemof.statistical inference:how
Statistical inference .- :'\' :-l a
to i~fer fro'ncthe prppe'rties of partdhe likelylptopertiesuf the.whole: ,,
It will turn up repeatedly from now on. It is worth emphasising at the
outset that because of the way in which samples are selected in many
studies in linguistics and applied linguistics, it is often simply !JOt possible
to generalise beyond the samples. We will return to this difficulty.

4.2 Populations
A population is the largest class to which we caiJ generalise
4' The problem
. the results. of an investigation basei!on a s~bclass. The population of
Urltil now we have been corlsidering;how .to describe or:sum.
ln,t~~e~tX<l,~ ilii:get popuia(ioh)wi!l yaryii\ typti and magnitude depending.
marise a set of data considered simpl)':as,an object in its:own right,: Very ..
often we want to do more t\lan tpis:wewishto use a collection of observed. .. .. o~}~~,,altl{~~,~d d~cun;st~nce~of, e~ch, ,~ifferept study ~r inyestig~ti~p
Wtthtn thehrcpts set by the study ui questwn, .the populatton, m statistiCal
values to make inferences about a larger set:of potential values; we would
terms, 'willlllways be considered as th~ set of all possible values of a variable.
like to consider a particular set of data. we hav.e obtained as representing
We have already referred to one study which is concerned with the vocabul"
a larger class. It turns out that to accomplish tl)is ls py no means straight-
ary of 6-7-yeat-olds. The variable here is scores pn a test for comprehension
forward. What is more, an exhaustive treatment of the difficulti~s involved
vocabulary size; the population of ii]teresf is tqe set of aU possible values
is beyond the scope of this book. In this chapter we can only provide
of this variable which co1Jid be derived from all P--7-year-old children
the reader with a general outline of the problem of making inferences
in the country. There are two points which should be apparent here.
from observed values. A full understanding of this exposition will depend
First, although as investigators our primary i!]terest is in the individuals
to some degree pn fainiliarity with the coptent of later chapters. For this
whose behaviour we are measuring, a statistical pop1,datiqn is to be
reason we suggest that this chapter is first read to obtain a general grasp
thought of as a set of val pes; a inean vocabulary size calculated from a
of the problem, and returned to latedor re-reading inr.the light of sub.
'sah!'ple of observed value$ is', as we shall see in chapter 7 an estimate
sequent chapters. ': ,
of the niean vocaliulary size that wpuld be'obtained from the complete
We will illustrate the problem of irtferehee,by: introducing'some of.
set'ohi~lues'wbich forin the t'argeipoj:lulation, The second p6int th~t
the cases which we will aQalyse in greater detail :in Jhe c\laptets to;come.
~hoi.tld'beapparent is that it is often riot str~ightforw~td in language stu die~
One, for example, in chapter 8, CO!)Cerns the size of the comprehension
to define the target population. After all, the set of 16-7-yeat-old children
vocabulary of British children between6 and 7 years.of age. his obviously
in Britain', if' we iake ihis to refer to the, period between the sixth and
not possible, for practical reasons, to test all.British children of this age.
seventh birthdays, is changing daily; so for us to put some limit on our
We simply will not have the resources. We can only test a sample of chil-
statistical population (the set of values which would be available from
dren. We have learned 1 in chapters 2 al]d 3, how to make an adequate
these children) we have to set some kind of constraint. We return to this
description of an observed group, by, for example, constructing a histo-
kind of problem below when we consiqer Sljmj>ling frames. For the
gram or calculating the mean and standard deviation of the vocabulary motnent let us consi<ler further the notion of 'target' or 'intended' popula-
sizes of the subset of children selected. But our interest is often broader tion in relatioh to some of the othet examples used later in the book.
than this; we would like to know the mean and standard deviation wllicp
would have been obtained by testing all children of the relevant age. 1-low Utterance length. If we are interested in the change in utterance length
close would these have been to the mean and standard deviation actually; . .O)'f'X..t.inw..il1 childten's spe,ech, and.collect. data which sample utterance
observed? This will depend on the rdationship we expect to hold betwecp length, the. statistical population in this case is composed of the length
the group we have selected to measutc and the larger group of children 'l..Jlllues of the individual utttl'lHH:cs, not the utterances thcrn.~clvc..'>. Indcc<t

.8 49
Statistical inference Populations

we could use the utterances of the children io derive values for many emphasised that in linguistic studies of the kind represented in this book
different variables and hence to construct many different statistical popula- it is not always easy to conceptualise the population of interest. Let us
tions. If instead of measuring the length of each utterance we gave each assume for the moment, however, that by various means we succeed in
one ~ ~cort(ttlpresentingthe number of third person pronouns it contained, defining our target population, and return to the problem of statistical
the population of interest would then be 'third pers(}n pronoun per utter- inference from another direction. While we may be ultimately interested
ance sCofes'.' in populations, the values we observe will be from samples. How can
we ensure that we have reasonable grounds for claiming that the values
Voice onset time (V01). In the study first referred to in chapter 1, Macken from our sample are accurate estimates of the values in the population?
& Barton (xg8oa) investigated the development of children's acquisition In other words, is it possible to construct our sample in such a way that
of initial stop contrasts in English by measuring VOTs for plosives the we can legitimately make the inference from its values to those of the
children produced which were attempts at adult voiced and voiceless stop population we have determined as being of interest? This is not aquestion
contrasts. The statistical population here is the VOT measurements for to which we can respond in any detail here. Sampling theory is itself
jp, b, t, d, k and g/targets, not the phonological items themselves. Note the subject of many books. But we can illustrate" some of the difficulties
once again that it is not at all easy to conceptualise the target population. that are likely to arise in making generalisations in the kinds of studies
If we do not set any limit, the population (the values of all VOTs for that are used for exemplification in this book, which we. believe are not
word-initial plosives pronounced by English children) is infinite. It is untypical of the field as a whole.
highly likely, however, that the target population will necessarily be more Common sense would suggest that a sample should be representative
limited than this as a result of the circumstances of the investigation from of the population, that is, it should not, by overt or covert bias, have
which the sample values are derived. Deliberate constraints (for example a structure which is too different from the target population. But more
a sample taken only from children of a certain age) or accidental ories technically (remembering that the statistical population is a set of values),
(non-random sampling- see below) will either constrain the population we need to be sure that the values that constitute the sample somehow
of interest or make any generalisation difficult or even impossible. reflect the target statistical population. So, for example, if the possible
Tense marking. In the two examples we have just looked at, the population range of values for length of utterance for 3-year-olds is 1 to II morphemes,
values can vary over a wide range. For other studies we can envisage with larger utterances possible but very unusual, we need to ensure that
large populations in which the individual elements can have one of only we do not introduce bias into the sample by only collecting data from
a few, or even two, distinct values. In the Fletcher & Peters (Ig8<j.) study a conversational setting in which an excessive number of yes-no questions
(discussed in chapter 7) one of the characteristics of the language of children are addressed to the child by the interlocutor. Such questions would tend
in which the investigators were interested was. their marking of lexical to increase the probability of utterance lengths which are very short -
verbs with the various auxiliaries and/ or suffixes used for tense, aspect minor utterances like yes, no, ot short sentences like l don't know. The
and mood in English. They examined frequencies of a variety of individual difficulty is that this is only one kind of bias that might be introdl.lced
verb forms (modal, past, present, do-support, etc.). However, it would into our sample, Suppose that the interlocutor always asked open-ended
be possible to consider, for example, just past tense marking and to ask, questions, like What happened? This might increase the probability of
for children of a particular age, which verbs that referred to past events lohger utterances by a child or children in the sample. And there must
were overtly marked for past tense, and which were not. So if we looked be sources of bias that we do not even contemj:llate, and cannot control
at the utterances of a sample of children of 2 ;6, we could assign the value for (even assuming that we can control for the ones we can contemplate).'
I to each verb marked for past tense, and zero to unmarked verbs. The
1 We have passed over here an issUe which we have to postpone for the morn.ent, but y.'hich
statistical population of interest (the values of the children's past referring
itt of considcrublc importance for much of the research done in litnguage studies. Imag~ne
verbs) would similarly be envisaged as consisting of a large collection of the cnac where the population of interest is utterance lengths of British English-sp~aking
clements, each of which could only have one or the other of these two prc-11chool children. We have to consider whether it is better to construct a s<jmple which
t:ont~i~t of mnny uttcn.ltWI;la from n few chHdrcu, or one which consists of a small number
values. of ut.toi'tUICell from cnch of mnny chlldn.m. We will return to this question, un(l the g!.:ncral
A populntiun then, for "tuti~ticul purp!Js~a. 1!1 11 act of vnlt~e&. We have ltmuc or lltltnplct l:ll~fll in ttbtlptcr 7
Statistical inference J he theoretlcat sotuuon
Fortunately there is amethod.of sampling;J\uow'!lcas:randomsampli!lg'' ""
that can overcome problems of ovett or covert hlas . :Whatcthis,termmeans-- ,,,,_
of 'allithe'subjiiets:in.the-group- to~hich:generalisa.tionis intended. Here,: ,
forexample, we -could extract a'llst of all :the :babies :With birth-dates in
' .:
will become clearer once we know more.about probability. But it isimpJJr,-, .. the televant'year from'the records,of all health-,visitorsin Great Britain.
tant to understand from the outset: thato.'.nandom'.. here.dotls not,:mean, ""'' '"'We'c'ou1d''thtin choose a simple random satriple(chapter:s)ofn,-ofthese- ., :. '
that the events in a sample are haphazard or completely :lacking. in order, babiesand hottothe birthweightsin theirrecord: If.n-is large, the mean
but rather that they have been. constructed by a:procedure:that allows weight.of the sample should be very similar (chapter 7) to the mean for
every element in the population a known probabjljty oLbeing: selected all the babies born in that year. At the very least-we-wilL be able to say:
in the sample. how big the discrepancy is likely to be (in terms of what is known as
While we can never be entirely sure that a sample is representative a 'confidence interval'- see chapter 7).
(that it has roughly the characteristics of the population relevant to our The problem with this solution is that the construction of the sampling
investigation), our best defence against the introduction of experimenter frame would be extremely time-consuming and costly. Other options are
bias is to follow a procedure that ensures. random sa~ pies {one.such. pro,,. available.--For example; sampling .frame could- be qonstmcted, in two
cedure will be described in chapter.,S'):,This \can give:u!>:r.easonable cOI'lfic i : ' or fhore'itages.The country (Britain):could be' divided. into:large regions;
dence that our inferences from sample valuesnto population::values;.are "' ''"'''Sc~tliuid,Waleli, NorthcEast,: West:-.Midlands,:,etc.;.anda .fe1>1 regions, ,,, ,-:
valid. Conversely, if our sample is ribt:constructedaccording. to 1a random chosen from this first stage. sampling ftame. For each of fhe selected
procedure we cannot be confident,that our estimatesJrom it-are likely regions a list of Health Districts can be drawn up (~econd stage) and a few
to be close to the population values in which we'are interested, and any Health Districts chosen, at random, from each region. Then it may be
generalisation will be of a dubious nature. possible to look at the records of all the health visitors in each of the
How are the samples constructed in the studies we consider in this chosen Districts or perhaps a sample of visitors can be chosen from the
book? Is generalisation possible from them to a target population? list (third stage) of all health visitors in each district. 2
The major constraint is of course resources- the time and money avail-
43 The theoretical solution able for data collection and analysis. In the light of this, sensible decisions
It will perhaps help us to answer these questions if we introduce have to be made about, for example, the number of Health Districts in
the notion of a sampling frame by way. aLa non,linguistic example. Britain to be included in the frame; or it may be necessary to limit the
This will incidentally clarify some of the difficulties, wei saw earlier. d!l inquit'y to chlldteri bbrn in four morithsin the,:year instead of:a complete-
attempts to specify populations. ,li, .,. 1 '" , , : L. , ,, yeo\r.>fnthis example, the. sampling frame mediates betw,een the population ,
Suppose researchers are interested:in,thebirthweights oLchildren. born ' " 'ofAnterest'(which'is the birth weight~ ofall,ehildreri. bani in Britain in
in Britain in 1984 (with a view ultimately to comparing birth weights 1984) and the sample, and allows us to generalise from the sample values
in that year with those of 1934). As .isusual with any investigation, their to those in the population of interest.
resources will only allow them to collect-a subset of these measurements If we' ribwreturn to an earlierlinguistic example, we can see how the
- but a fairly large subset. They have to decide Where and how this subset saniplirig frame would enable us to link our sample with a population
of values is to be collected. The first decision they have to make concerns of interest. Take word-initial VOTs. Our interest will always be in the
the sources of their information. Maternity hospital records are the most individuals of a relatively large group and in the measurements we derive
obvious choice, but this leaves out babies born at home. Let us assume from their behaviour. In the present case we are likely to be concerned
that health visitors (who are required to visit all new-born children and with English children between 1 ;6 and 2;6, because this seems to be the
their mothers) have accessible records which can be used. What is now time when they are learning to differentiate voiceless from voiced initial
required is some well-motivated limits on these. records, to constitute a stops using VOT as a crucial phonetic dimension. Our. resources will be
sampling frame within which a random sample. of birth weights.can.he
constructed. ,',!11;-. ~.-i:; l The ~trwly~ifl qf t.hc ~lata gathered by such complex sampHng scheme!;' clln, become quite
cm~plkntcd und _we \vil11iiit deal_ with it in thifl !;rmk. Interested rcUderllllhould sec tcx.ts
The most common type of sampling frame is list. (actual or notional) ~ . , :lin numplin~ thcory.1or Sllrvcy d(!aign Or'{'OiltiUlt nn cxpCricnccd.:aurvcy l:illltiatician.

sz 53
Statistical inference The pragmatic solution
limited. We should, however, at least h_ave asampling -frame-v.:hich sets The great majority of these studies will be exploratory in- nature; they
time limits (for instance, we could choose for the lower limit of our age- will be designed to test a new hypothesis which has occurred to the investi-
range children who are I ;6 jn a particular week in 1984); we would like gator or to look at a modification of some idea already studied and reported
it to be geographically 'well-flistributed (we might again use Health' Dis- on by other researchers. ~Most investigators have very limited resources
tricys) ;, )Vithin, the,. sampling frame we must select a random sample of and, in any case, it would be extravagant to carry out a.Jarge...and _expensive
a reason~ble size. J study unless it was expected to confirm and give more !letailed information
That 'is how we might go abo)lt sefectirig childreqior ~uch 'a st9dy. on' a hypothesis \jihich' was likely to be true and whose implications had
Buf how ate [ang11age samples to be selectep from a child?'Changing the deep sclenti~c, social or eco11omic significance. Of nece~sity, each investiga
example, consider the Jlroblem of selecting utter~nces from a young cpild tor will have to make use of the experinleqtal lljaterial (includipg human
to peasure his mean length of utterance (mlu- see chapter 13). Again subjects) to which he can gain access easily. This will almost always
it is possible to devise ~sampling frame. One method would be to attach prec!u9e the settjng up of ~ampling frames anc\ the random selection of
a radio micrqphoqe to \he child, which would transmit an<,\ record every subjects.
single utterance h~ makes ovet some period. of time to a tape-recorder. At first si!lht it may look a~ if there is an unbridgeable gap !>ete. Statistical
Lef us say we record all his utterances over a three,month period. We theory requires that sampling should be done in a specipl way before geneta-
could then attach a unique number to fach !ltterance ~nd choose a simple lisatiOI) can pe in~de formally frqm a sample to ~ population. Most studies
random sample (chaptqr 5) of utterances. This is clearly nfither sensible do not invo)ve s~mples selectbd in the required fashion. Poes thls mean
nor feasible - it woul<j req1,1ire an unreali~tic expenditure ol resources. that st~tisti9al t~chniques wil! pe inapplicable to these studies? Before
Alternatively 1 and more reasonably, we coulp divide each mpnth Into days addressing this questipn directly, let us step b~ck fqr a momept and ask
and eacjl day intc. hours. select a few days at rrndom im? few hours a what it is, in the most gdneral sense, th~t the discipline of statistics offers
within each day and record ali the ut1eranpes made 1fuHng the selected tq linguistics if its techniques are applicable.
hours. Glee \Yells 1985: chapter ' for a study of this kind.) If!his method What the techniques df statistics offer is a common grojlnd, a common
of selection were to be used it would be better to ass11nle that thai child IIJeasuring stick by which experimenters can meast~re and coml'are the
is active only between, say, 7 a.m. and 8 p:m. and select hours from strength of evidence for one hypothesis or another th:it can be dbtained
that dme period. from a sample of subjects, hln&uage tokens, etc. This is worth striving
In a similar way, it )"ill always be possible to imagine how a sampling after. Different sttidips will tn~asure quantities which are more or less
fra:me could be qrawq up for any finite population if time and other variable and wllljnclude different numbers of subjects and language tokens,
resources were un\imitqd. T\le theory which underlies all the usual statisti- Language researcher~ may find several different directions from which
cal rrtetpods assumes that, !f the resljlts optained from a sample are to to tackle t!]e same issue. Unless a commorl groun? can be establishe<\
be generalised to a wider population, a suitable sampling frame has been op whicH tpe results of djfferept inveStigations ~aq be compared using
established and the sample chosep ranflotnly from the frame. In practice, a common yardstick jt wo)lld lw almost impoasil:lle to assess the quality
however, lt is frequentlY impossfble to dr~w u11 an ~cceptable sampling of the ~vidence cpntaiped in difffrent studies br to judge how much weight
frame- so what, t\len, can be done? to give to cqnflicting claitns.
Ret11rnin~ to the question of applical:iility, we would suggest that a sen
4-4 The pragll)atic solution sible way tp proceed is tq accept the tesults of each study, in the first
In any year a large number of li~guistic studies of an empirical place, as thpugh any sampling !]ad been carried mlt in a thedretically 'cor
na;ure are carrieq out by many researcHers in maily different locations. rcct' fashiop. If these res'llts are interesting - suggesting some hew hy
:l The itHiliCI:I raised in the first footnote crop Up ag11in here: mcusurcrpcnts ori linguistic tl<>thcsis or contradicting a previously accepted one, for example-'- t\len is
vurill.blcs arc more complex tlul!l birth weights. We could again ask whether we should time enough to question how the sample was obtained and whether this
collect mnny word-initial plusivcl!l from few children, or few plosiv~~s from many_ children
(i!(1!J cht~ptcr 7). A llimilnr problem nriacfi with the t~mnplc chosen by sevcrttl stages. ls is likely to haV(' a l>earing 011 the validity of the conclusions reache4. Let
It bctt~r to choollo muny r~uion~ and n J'cw Hcnlth DI~Jtricts in c:nch region or vice vtrnn~ us look at an oxam)Jle.
i!t 55
Statistical inference Summary

In chapter I I we discuss a study'byrHughes.&-Lascaratou ti<J8I')on, ' in thi~ 'kin\! of study oto. trY' to avoid.choosing. eitht; oLthe,,samples ,such .,,
the gravity of errors in written English as perceived by two different groups: that'they,belongobviouslyto some,speciaiNbgroup.
native English-speaking teachers of English aild'Greek'teachersof English; THere is tme type of investigatiOidor which proper random sampling
We conclude that there seep1s to be a differencdn the waythanhe two. ' '" is absolutely essimtial. If a reference test ofSome kind,is to be established,
groups .iudge: errors, the Greek teachers tending to be more severe in perhaps to detect lack of adequate language development in young: children,
their ju~gements. How n\uch. does this tell us about possible differenc'es then the testmust be applicable to the whole: target population and not
betwken the complete popu)ation 'or nativespealling English teachers and justto a particular sal)lple. Inaccuracies indetermining cut-offlevels for
Gree feabh~rs of English 1 The res11lts of the experiment would carry d~tecting children who shquld be given special assistartce have ecol)omic
over to thos~ pt>pulrttions - in tlfe se'lse to be explained in the followir)g implications (e.g. too great a demand on re~ources) and social implications
fol'r chapters - if the samp)es h~d been selected carefully from complete (language-impaired ci)ildren not being detected), For studies of this nature,
sap1pling fr~mes. This was certainly not dorte. Hughes and. l,asearatou a statistician should be recruited befpre any data is collected and before
had to gain 'the co-operation of thOse teachers 'to whom;they had i ready'' ' ,,,, asamplittgframehas been established.: "'' ., ,,
access. The formally correct alternative would have' been prohibitively With this brief introduction to sQme of.the' .problems of the relation
ex pensive. However, both samples ofteachersrtdeastccoritaipedindividrials

between'sample and population, we pow tum in,,chapter sto the concept .

from different insti~utions.' If all the Erlglish teachers had come from a of probability'as a crucial notion in providing. a link between the properties
single l):nglish Institution and all the Greek teachers frotn a single Greek of a sample and the structt~re of its parent pop4lation. ln the final section
scpool of langua!les tqen it. coul~ be ar[iued that the difference ih error of that chapter we outline a procequre for randorr sampling. Chapter
gnivi(y scores could be due to the attitudes of the institutions rather thdn 6 deals wit)l the modelling of statistical populations, and introduces the
the rlatiotla!ity of the teachers. On the other )larld, all put one of the normal distribution, an important model for our purposes jn ch~racterislng
Greek teachers wprked in Athens (the Erjglish teachers cbme /rom a widh tpe relation between sample and populatiop.
selection of backgrounds) and we migllt query whether their attitudes
to errors m\ght be different from those' of their colleag4es, in .different
p~rts of dreece. Without testing tHis argument it is impossible td refute
In this chapter the basic problem of the sample-populat!<ln relation-
it, but pn common se11se grounds (i.e. the/commonsense' ofa'feseatcher.
s~ip has been. discussed . . ,
in the tFaching of second lal)guages) it' seems Uhlike!y,. ". . ,
Tl!is then seems a reason~ble way to proceed.dudge.the.result~as though, ' (lj A statis,ti~al popUlation was defined as a set of all the vahies which might
they were~baseCI on random samples and therlJook al thepossibility.that . evr .be in~cluded in a particul'lr Study'. Ttie target population is the set to
they m~y be distorted )Jy the way the sample was, ih fact, obtained. How- \Vhich geriftalisation is intended frbJll a"Stufly'hased on a sample.
ever, this imposes oh researchers the inescapable duty (i) Genehlisation from a sample to a popul~tion c;!an be made formally only if
fully how their experimental materhil -' including subjects - was actually 'the sample is collected randomly from a sampling ftame which allows each
obtained. It is also gooc;l practice to altempt to foresee some of the obj,ectio~s clement of the population a knqwn c~ance of being selected.
that inight be made about the quallty ofthat matetial and either attempt (3). The point was made thatt fot the rlta)ority pf llnguistic investigations, resource
to forestall criticism or admit openly to arly serious defects. constraints prohibit the collectipn of Q.ata iq this way. It was argued that statisti-
cal theory ami mctljadology still have ad import~nt role to play in the planning
When (he subjects themselves determine to which experimental group
and analysis of language studies.
they belong, wh~her deliben\tely br accidentally, the sampling needs ,to
be done tathet mote catefully. An important objective of the FletcHer
& Peters (r984) study rhentibncd earlier was to compare the speech of EXEilCISES
language-normal with that of ianguagc-impaired children. In this case the ( rY lu dwptc!"' 9 we rdt~t to a sttdy by Ferris & .Pol'itzer (rq81) in which children
investigator8 could not randomly assign children to one of these groups with bilillfitUHI background an~ tcstt~d on their English ability via compositions
.. they hnd alrcndy bemt daHsitlod hllforc they \vcrc sckctcd, It is iniportant ill t'CMJRHl!l-(' to n uhort nlm. H(~ad lh" brief account of the t'tudy (pt1 139ffL

s;fr 57
Statistical inference
find out what measureSWeie Used, and decide-\hat Would -constitute the ele~
ments of the relevant statistical population.
(2) See if you can do the same for Brasington's (I978) study ()f Remielleseloan 5
w0rd,, also explained in chapter 9 (PP: 'I42-4).
(3) ~eview,the Macken.&. Barton (I g8oa) study, detailed in chapter I . In consider~_
:ipg,.th~ 'i'nt~nded:.p.opulation' for this study, what factors do we have to take
.,!tl!O.~ccount?,. ,,
(4) In a well-known experimeht, Newport, Gleitman & Gleitman (I977) collected
conversational data from ts mother-child pairs, when the ctlildren were aged
between 15 and 27 months and again six months later. One of the variables
of Interest was the number of yes/no interrogatives used by mothers at the
first r,ecording. If we simply consider this variable, what caQ we say is the The link between the properties of a SaiJlple ~nd the structure.of its parent
intended population? Is it infinite? population Is prdvi!ied through the concept of I)robability. The concept
of probability js best introduced by' means of straightforward nlln-linguistic
examples. We will return to linguistic exemplil1cation once the ideas are
established. The esserttials car be illustrated by means of a simple game.

s.x Probability . .
Suppose We have ~ box cont~ining ten plastic discs, of identical
shape, of which three are red, the remainder white. The discs are numbered
I to ro, the red discs being t~ose numbered r, z and 3 The game consists
of shaking the box, drawing a disc, witho'!t looking inside, an!i noting
both its number and its colour. The disc is returned to the box and the
game repeated indefinitely. What proportion of the draws do you think
will result in disc. number bei!lg drawn from the bo~? One dr~w in
three? One draw in five? One draw in ten?
Surely vie would all agree that the )ast is the most reasonable. 'there
are ten discs. Each is as ilkely to be drawn as the others every time the
game is played. Since there are ten discs with different numbers, we should
expect that each number will be dra\jlfi about one-tenth of. the time in
a large number of draws, or ttials as they are pften called. We wquld
say that the probability of drawing the disc numbered 4 on any one occasion
is one in ten, and we will write:

P(disc number 4) =- = o. I

Instead of determining the probability of disc 4 being drawn in the

game, we could ask another question, peri)aps, 'What is the problibility
dtrany occasion that the disc drawn is red?' Sipce tpree out of the ten
discs arc of this colour, that is three-tenths of the discs in the box are
red, then:
ss 59
Probability Statistical independence and conditional probabtlzty

P(red disc)= 1. = 0.3 ;)(" Table 5.1. Num/Jerpftimes a red discwas drawnfrom a box
containing J red and 7 white discs in 100 trials .by 42 students
In the same way we can ask questions, about the probabilities of many .. zcv: it 22-
24 24 25 z6. 26 26 -26 27 27 28 28 28
other o~tcom~s. For example: ll~ : r r d1. 29 29: z<) 29 29 JO . 30 JO 30 3'1 Jl 31 32- 32- 33
33 33 33 34 34 35 36 36 37 37 38 39
:(i) the prQbability of drawing an even~numbered disc; .since there are
five of them, is: Number of ted discs Frequency
2Q-22 3
l'(evennumpered disc)=!;= o.s 2~-25 3
26-28 9
(ii) the probability of drawing a disc which is both red and odd numbered, 29-31 12
32--34 8
since there are two of them (I anti 3) 1 is:
35-37 5
38-40 2
. 2
P(red, oddnumbered disc)=-= .0.2 IVfeao of results of the 4:i st4-dcnts ls :i99$2 per .robdraws
and so on. "" ''" ' !Vfea~ proportipn is ther~forc ciozggs2.
It is helpful to play stlch games to gain somf insigh! into the rel~tion I

between the relativ~ frequency of particular outcoines in a serifs of trials game empirjcally, when !he proportions of different types are ~nown. It
and your a prion expectations based on your knowlfdge of the contents is also possible to study the prgpehil's bf sampling games theoretically.
of the box. You could repeal the game 100 times and see what is the In estaplishjng the prpportiorl of rha)es in the village, for instance, it is
actual proportlbn of red discs whicH is dtavyn' It is even better if a nu1111Jer possibll' to find out how many persmis spoulq be samp)ed in order to
of people play the game simultaneously and compare results. Table 5.1 give a reasonably accurate estimate, l)y using methods th~t are explained
shows what happened on one occasion \\.hen 4~ students each playep the in chapter 7 (see especially 7.6). As we shall see, if \ve take a sample
game IOd times. Although the propbrtion qf red discs drawn by individual of 400 persops \ve can expect with reasonable confidence (as% probability
students varied from o.2b to d.Jg, the mean ptopo~tion of red discs drawn of error) that the proportion Of males in the ~ample is be\w~el) 28% and
over the whole group was o.29g5, very close to the value bne would expect, 32%i apd we can pe!tlmo~t ~ure (p to/o probability ol error) that tpe propor
i.e. ~3 tion of mhles in fhe sample would not be less \han 25% or greater than
It s~ould be clear that the actual number of ted discs i:l not the,;mpqrtant 3~%r (Compare these figures to the actual proportion o! 36%.)
quahtity here. The properties of the gaihe depend opl~ on the propqrtibn . iWeshduld'jJO\nt Ojlt th~t,in pti\ctice,itliS not comrponto l!sethe
of discs df that colour. For ~xample, if the box captained rq,odo discs sampling procedure we have emf1loyed in t~e game wlJero eacli djsc/ person
of which j,ooo were reel it would continue to l.:le true that P(red disc) ='o.J. is replaced in the box/ villpg~ after its type has l:leen hotecl. It Is mdre
This has some practical relevance. Suppose that a village contained t<>,ooo , common totake awhole group of di~cs/pople siinultaneously and
people of whom JO% were male, though this faCt was tfnknqwrl tp tiS. note the type of eyef'y element in the sample. This i:! equivalent tp cho0sihg
If we wished to try to establish this proportion without having to Jdentify the discs/ people one at a time and then not repl~cit\g those al~e~dy chosen
the sex of every person in the village, we could pjay the abpve game, befote selecting the next. Prqvided onlf a sin~ll prpportion of the total
but this time with people instead bf disc&. We would repeatedly choose (say less than to%) bf the total is sainpled i) will not make much difference
a person at random (a detailed method fbr random selection will be given whether the sampling is done with rir without replacement.
in the final section of thls chapterj and npte whether the person chosen
was male or female. If We repeated the process n time~, We could use S Statistieal independence ahd cdnditlonal probability
the proportion of males among these n people estimate of tpe propor, Table 5.2 displays the numbers of individuals bf either Sex
tion or expected telative frequency<of males in the population. We ,, irrohl>(l differetlt hypothetical pppulations classed irs monolingual or hi
saw from table 5' that it was possible to nsscss the properties of a sampling lingual. Each population contains the same number of individuals, ro,ooo,
6o 61
Probability Statistical independence and conditional probability

Table -5-_2, Numbet:s .of.m.ono~itzguq/.or. bilinguaJ. adults -in two The probability of a monolingual:
hypothetical popu/atiolls cross-ta~ulatedby se>; P(monolingual) = o.6

Population il The probability of a bilingual:

Male Female Total P(bilingual) = 0.4
Bilinguai zo8o 1920 4000
Monolingual 3120 z88o 6ooo (Notice that when thorpopulatidn is partitioned in such a Way that each
'5200 4.8ao - 1000.9 individual belongs to one and only one category, e.g. mal,.,.female or
monolingual-bilingual, the total probability over all the categories of the
Population B
Bilingual 2500
partition is always r.o; 0.52 + o.48 = r.o and o.<j. + o.6 = r.o, for either
1500 4000
Monolin~ual 2700 3309 6ooo population.)
5200 48op ~0000
Although the two populations have identical proportions of male-female
arid of monolingual-bilingual individuals, they have otherwise different
of whom 5,2oo are male and 4,8oo are, mbnolingual. However, in popula- structures. We see this when we look at finer detail:
tion A the proportion of tnales who are bilingual is 2086/5200 =o.4, the Fot population A:
smne as the proportion of bilingual females (rg2o/ 48od). 11) population 2o8d
B, on the other hand, 25oo/ 5200 males, i.e. 0.48, are bilihgual while P(male and bilingual) = ---. = 0.208
the proportion of bilingual females is only O.Jr. This kind of imbalance
may be observed in practice, for example, among the Totona<l population 192d
P(female and bilingual) =-----=0,192
of Central Mexico, Where the men are more accustomed to trade with 10000
outsiders and are more likely to speak Spanish than the more isolated
women. A similar effect may be encountered in first-generation immigrant P(male and monolingual) =----'- = O.J12
communities where the rhen learn the tongue of their adopted cbuntry
quickly at their place of employment while the women spend much more z88o
time at horne, eiti)er isolated or mirlgling with others of their own ethnic P(female and monolingual)=--= o.z88
and linguistic group.
Suppose that we were tb label every member of such a population with
a different number and then write eacll number on a different plastic ~

Jisc. AJJ the discs are to be of identical shape, but ifthe number corresponds
For population B:
to a male it is Written on a white disc, if female it is written on a red.
The discs are placed in a large box and welJ shaken. Then one is chosen P(male and bilingual) =---'- = 0.25
without looking in the box. Clearly this is one way of choosing objectively
a single person, i.e. a sample of size one, from the whole population, 1500
and you should see immediately that the following probability statements P(female and bilingual) =--=0.15
can be made abdut such a person chosen frbm either of the two populations,
A orB. 2700
P(male and monolingual) = ~ = 0.27
The probability bf a male being chosen: 10000

. 5200
P(male) JJOd
IOOOO !'(female and monolingual)=--= o.JJ
The prob~tbility of a female:
!'(female)~ 0.48 Total I.oo.

6l!< 63
Probability Statistical independence and conditional probability
Suppose that,. for either. population, .;wMcOntinueuraWi:ng :discs'.lin1iL : . (,,__,;. ;-l,_;,and1 ,.
the very first white disc"appears,ithatds;lm~it:thec,firsi'mal<:'.persoids A ,'~ '\'{

chosen. What is the probability.that:this'llrstcemale>is,bili'ngual?. Notice' .;, ,: ,, !'(mal~)= soo =c. :i.
that this is not the same as askingl'<lt tlleprdbabilltylhatth.,,fii'<it'Person::i ~~;PL~ fi>,.' ~+:J;;_::~"-~l:Jf'rl'." ;-.,,,,_,_;fo_c;ioo ts _ ?. t-~ -i
chosen is male and bilingual. We'keep going:until 'w.e <:hoose ille'first . ...,
,. , 'Fromth~se' danipleswe' can see that; in population A; whichever restric-
male and this deliberately excludeslthe females-from .consideratiom In tion as to sex is imposed oi> selection(or indeed if there is no restriction
fact, we are restricting our attention to the subpopulation of -males and- at all); ihe probab!liiy of biling'llalism/monolingualism remains
theh asking abo1.1t the probability of an event which could occur when unchanged. Similarly, whichever restriction as to language category we
we imnose that restriction or condition. For this teason, such probability impose (or ihhere is no restriction at all), the probability of male/female
is called a conditional probability. remains the same.
To signal the restriction that we will consider only males, we will use Irl a population with these characteristics, the variables 'sex' and 'ian
a standard notation, P(bilingualimale'),.:where ..the.vertical.line:indic~tes, guage cate~O'ry' are said to be st~tlstically independent. In .practice,
that a restriction is being imposed, and wesaythatwei'equirethe~proli~bi< .. '' this ln'eai\9 ifilii kndwlng tile value 'Qf one ,gives no' .information about the '
lity that a chosen person is bilingu~t:g~'ve thliNhe persott i~ntale'; Siilc.e,'d -.. "j\1<:-:~ :1-Q\H~~-.~---~q;.:\,c,;_- _ \<- :,-_;,,, ...-,.:. .. , -_,1:""-'''::";:_:<f.::: -.'--'/ ! \: ..

in population A, there are a tota!.ofs',20o'males of'1whom 2,:o8oare:bic ..,, Pbj>Uiatiori B?<hibits taihtir different properties. We'know already that
lingual, the value oftheconditional:probability:is:. ' "' .,.: '" P(bilinguaf) ,;,:o ..\., but we cat\ see:
!'(bilingual Imale)=--= 0.4
52.0d P(Bilingtiall male)=--= 0.48

* 0.4
Note that this is exactly the same as the probability that a person cho~en,
irrespective of se~, will be bilingual, i.e.
,. . ' . ISOO
P(bilioguall male)= P(biiingual) = 0.4 P(l:ithnguall female)=--= o.JI "'0.4
We can calculate the probability in a similar fa~hion for the likelihood
that a chosen person is bilingual, give!\ thoit<she>is.femak: , . ,,,., 'i,:i: : (Both these c'onditional probabilities have:been.rounded totwo decimal ;.

places..) .It is clear that in thls ca~e a.p,erson~s languag.e.categQry wi)l be -. ' ~ 1 ; ' '
P(bilioguall female)=--= o.4 ~ r;, : :p .. "'" depenclent on that persdn's sex. Thad: t~ sa:y; if a male Is selected, then f<(
4800 we know there is a higher chance that this person is bilingual than if
Note that this is again the same as theprobability for bilinguals; irrespective a feliiiile had been choseh. In general; P(X IY), the probability that event
of sex: 1! X occurs, given that the everlt Y has already occurred, can be calculated
. ' 4000
P(bthngual) = - - = 0.4 by the rule:
If -e wish to determine the probability that a chosen netson is male, P(XIY) _ P(XandY)
given that the person is bilingual, the calculation is as follows:
2080 For example, in population B:
P(male Ibilingual)=--= o.s
4000 P(bilingual and male)
. P(bilinguall male)
Note that this is the same probability as: P(malc)

3 I2.0 0,25
P(malc I mormlingui\l) = - - ~ 0.52 m~,._._.:;,.ftli 0.48
(14 6s
Probability Probability and discrete numen'cal random van'ables

There is one important property of population A which results from Table :;.j: HypOiheiical family size distn'bution of 1,ooo
the independence of the two variables. Consider the probability that the families
first person chosen is both male and bilingual. It turns out that:
P(inale and bilingual) = - - =o.2o8
No. of children (X)
0 ..,
No. of families

3 >63
217 .0.~17

5200 99 0.099
10000 5 61 0,061
6o o.o6o
6or more
P(bilingual) =--=0.4
such variables allows us to introduce a richer variety of outcomes wjlose
Now, o.2o8 = o.4 X 0.52, so that we cansee that the probability of a person probability we might want to consider.
being chosen who is both male and bilingual can be calculated as follows: Suppose that 1 ,ooo families have given rise to the populatiol) of fa!llilY
P(maleand bilingual)= P(male) X P(bilingual) sizes (i.e. number of children) summarised in table 53 In !pis ppptilation,
let us choose one family at random. What is the probability th~t X, the
This result holds only because the two variables of sex and linguistic type number of children in this family, is 3?
are independent.
In population B, on the other hand, we have P(male and bi- Answer: P(X = 3) = o.2I7
lingual)= 0.25, while, for the same population, P(male) = 0.52 and P(bi- since that is the proportion of the family sizes whic)l take the value 3.
lingual) = 0.4, so that: Similarly, 1
P(male and bilingual#. P(male) X P(bilingual) P(X = 5) = o.o6I
This indicates the lack of independence between the two variables in this P(X"' 2) = P(X=o) + P(X= I)+ P(X = 2)
population, However, the relation: = o.x21 + o. '79 + o.263 = o.563
P(o <X"' 3) = P(X x or or 3) o.659 =
P(male and bilingual)= P(male Ibilingual) X P(bilingual)
Table 53 is an example of a probabilily distrib!ltion of a random
docs hold, since: variable. A random variable can be thought of as ~ny variaqle whose
P(malel bilingual)= 2500/4000 = o.625 value, which cannot be predicted with certainty, is ascertajned as the out-
P(bilingual) = 0.4 come of an experiment, usually a sampling experime'lt, or game of some
0.4 X o.625 = 0.25 kind. For example, if we decide to choose a family ~t random from the
hypothetical I ,ooo families, we do not know for certain what the size
In general, for any two possible outcomes, X andY, of a sampling experi-
of that family will be until after the sampling has bee!) done. The distribu-
tion of such a random variable is simply the list of probabilities of the
P(X and Y occur together) = P(X IY) X P(Y) = P(Y IX) X P(X) different values that the variable can take. If the different possible values
of the variable can be enumerated or listed, as in tjlis case, it is called
5. 3 Probability and discrete numerical random variables a discrete random variable. Discrete variables may be numerical, like
The examples we have seen so far have been concerned with 'family size' or categorical like 'sex' or 'colour'. (In the previous.section
ruther simple categorical variables such as sex or linguistic category. How- we saw an example of the categorical variable 'colour' which took the
ever, the situation is very similar when we considc1 discrete numerical
Tlw !l)'mhnt < meum~ 'ia lt~aa than', while E<. means 'is less th!Jfl or cqu~l to'. Similarly,
variables, the only difference bting thnt the cxtcndcd range of values in the llymh~ll > llWIUI!l 'ill gr~atcr thu.u', while~ rncam; 'ill greater thun or cllual to'.
(\l) 67
Probability Probabilil'y and continuous random variables
two values 'red' and 'white' with'probabilities,n:]:'arul'o;:.,,~espediVely;},;.:' :' Table 5 +'H)ifJiJthetical distlibutlot1ofttisk,titlzes,
The distribution of a piscrete random. varhible,,can beo,represented. by a
bar chart: figure 5-1, for examplt;"gives the baro:chart corresponding:to' ~ T ~ ---~: ..,._T.:..::im.:..::e_,_(i::c~c.:'co'.c'o:;!.).:y'-'-~
, : ;

table 5-3- We have already seen similar diagrams in figuresz.r 'antJ:k:!J; '\ i- > ~ > _;f~om; < .,T~just less ~han Pr_opot~~no,q.iOl_~_s i~ this_~_ange: h -

and these can be considered as appro:iiimations to the bar charts of the 0 w 0.035
w ~ O.OJI
discr.ete random variables 'types of deficit: and 'length ofutterance'based ,~ ~ o.o6x
on samples of values of the corresponding random variables: ," ~ p 0.154

55 0.202.

~ ~
250 ~ 0.07I
100 o.oss

~ 150


100 ..~
0 0.3

0 2 3 4 5 6 or more g
No. of children ln farnlly
Figure 5 I. Bar chart corresponding to data!of_tabl(!_5'3)' ,~
i o.21 r

For any discrete, numerical random variable; bhe,pr,pbabilitythat a single,, ".;]:


random observation has one of a possible range,.of:valties,is:just th~~uml '' 0.; , ..

of the individual probabilities for the separate values in the range (see
exercise 5-J). Note that the actual size ofthepopulation is.irrelevant once,, r-
the proportion of elements taking a particular value is known.
~ h
54 Probability and continuous random variables 0 20 40 60 80 100
Time (seconds)
Suppose that a statistical population is constructed by having
each of a large number of people carry out a simple linguistic task and Figure 5.2. Distribution of task times of table S3

noting the reaction time, in seconds, for each person's response. We sup possible to S;Iggest one even closer. For example, if r.643217r seconds
pose that the device used to measure the time is extremely accurate and is suggested, that is less close than 1.64321705 which, in turn, is not
measures to something like the nearest millionth of a second. In fact, as close as r .643217049 and so on. A variable with this property is called
conceptually at least, there is no limit to the accuracy with which we a continuous variable. Table 5 4 gives a hypothetical relative frequency
could measure a length of time. Ask yourself what is the next largest time table for the population of task times and figure 5.2 the corresponding
interval greater than 1.643217 seconds. Whatcverflgure you give it is ulways histogi'Um.There arc severn! points worth noting here.
Probability Probability and continuous random variables
First, the class intervals are -not all of the same width and yOu should this probability? The range does not coincide with the endpoints of class
examine how the histogram is adjusted to account for this. In particuhir, intervals as all the previous examples did. Remember that in the histogram
the class interval o-20 has a higher proportion of the population than the area defined by a particular range is the probability of obtaining a
its neighbo1,1ring class ~o-30 and yet the corresponding rectangle in the
histograril'isl~s~ tall; it i~ the areas of the rectangles (i.e. width x height) 0.5

which: correspond to the relative frequencies of the classes. Second, we

do noF'ru)ed''b state the actual' number of elements of the population
belonging to each class since, for calculating probabilities, we need only
know the relative frequencies. Thirdly, the upper bounds of the class 0.4

intervals are given as 'less than 20 seconds', 'less than 30 seconds', and
so on since, because of the accuracy of our measuring instrument, we
are assuming it is possible to get as close to these values as we might
wish. ~ 0.3
Let us choose, at random, one of these task times and denote it by ~
Y. What is the probability that Y takes the value 25, say? We need to ~
think rather carefully about this. Since we are measuring to the nearest
millionth of a second, the range from o to 100 contains a possible 1oo l~ 0.2 r-
million different possible values, even more if our instrument for measuring
times is even more accurate. The probability of getting a time of any ~ 1-
fixed, exact value, say 25.oooooo seconds or JI.I6J217 seconds is very
small. To all intents and purposes it is zero. This means that we will 0.1

have to content ourselves with calculating probabilities for ranges of values
of the variable, Y. For example:
P(Y < 20)
P(Y <so)
= o.OJS + o.OJI + o.o6r + o.rs4= 0.28!
o7 20 40 60 80 100
P(4o"" Y <55)= o.r54 + o.2o2 = 0.356 P (10<Y<20I Time (seconds)
(P(4o"" Y < 55) is to be interpreted as the probability that Y is equal Figure SJ(a). EStimating probabilities.
to or greater than 40 but less 55.)
In each case we simply identify the required range of values on the value within that range. If we shade, on the histogram, the area correspond-
histogram and calculate the total area ofthe histogram encompassed by ing to 10 < y < 20 (and refer to table 5+for the proportion of times that
that range. Note here a very special point which was not true in the previous fall in this range) we see immediately that the required probability
section: 1/2 X 0.035 = o.o175- see figure 53(a). Similarly:
P(Y < 20) = P(Y"" 2o) 2
P(Y > 78) =-X o.o7r + 0.055
because, since the probability of getting any exact value is effectively zero: IO

P(Y .;;2o) = P(Y < 20) + P(Y = 20) and- see figure 5-J(b):
= P(Y <zo) +o
P(5o < Y < 87) = 0.202 + 0.230 + o.r6r
This means that for continuous variables it will not matter whether or
not we write< or""; the probability does not change.
Now, suppose we try to calculate P(to < Y < 20). How can W<1 evaluate + 0.071 +(}-X 0.055)
70 71
Probability Random sampling and random number tables
However, these probabilities will be only 1!pprm;imate1yc"Correct;;,~ince'tl>ec ' the'sarnjlle nndp0pulation sizes small simplifies the.e,:phmatiori.) Random
histogram of figure s.z is drawn to a.rather.crude.scale on;wide.. class . sampling or; in full, simple random sampling, ;is a selection process
intervals. Its true shape may be something like the, smooth figure which iti this case will ensure thai' every sample of tliree subjecis has
6.6. It wol\ld then be more difficult to calculate:. the area corresponding. the same probability of being choseri: Suppose the eight subjects of the"
to rlre 1nt~r~al 10 < Y < zo, but there are methods which enable itto be population are labelled A, B, C, D, E, F, G; and H. Then there are ariy required degree of accuracy so that tables can be produced. 56 possible'different samples of size 3 which could be selected:
\llikWll return to this idea in the nextcha:pter.

l Now suppose that s6 identical, blank discs areobtained and the. three
letter code for a different sample inscribed on each disc. If the discs are
1:2 0.3
now placed in a large drum or hat and well mixed, and then just one
is chosen by blind selection, the three subjects corresponding to the chosen
letter code would constitute a simple random sample.
The problem with this method is that, for even quite moderate sizes
ig. 0.2 of .sample and population, the total number of possible samples (and hence
~ P(50<Y<87) discs) becomes very large. For example, there are around a quarter-of-a-

I"' million different samples of size 4 which can be chosen from a population
of just so. It is impossible to contemplate writing each of these on a different
0.1 disc .just to select a sample. Fortunately there is a much quicker method
to ~qhieve the same result. Let us return, for the moment, to the example l.
of choosing a sample of three from a .pbpuhition .of ejght. Take ~ight discs :I
and write the letter A on the first dis~, B on the second, etc;. until there
is a single disc corresponding to each of the letters A-H. Thoroughly
0 20 40 60
Time (seconds)
mix the discs and choose one of them blindfold. Mix the remaining seven
discs, choose a second, mix the remaining six and choose a third. It can
Figure SJ(b). Estimating probabilities.
be shown mathematically that this method of selection also gives each
of the 56 possible samples exactly the same chance of being chosen. How-
5. 5 Random sampling and random number tables ever, this is.still not a practicable method. For very large populations
We can now return to the issue of random selection and explain it would require a great deal of work to prepare the discs and it may
in more detail what we mean by a 'random sample'. We have already be difficult to ensure that the mixing of them is done efficiently. There
indicated that 'random' .does not mean 'haphazard' or 'without method'. is another method available which is both more practicable and more effi.
In fact the selection of a truly random sample can be achieved only by dent.
closely following well-defined procedures. For illustration, let us suppose Each member of the population is labelled with a number: I, 2, ... ,
that we wish to select a aample of three subjects from a population (Jf N, where N 'is the total population size. Tables of random numbers
eight. (Obvioualy this oituation would never arise in practice, but keeping can then be used to select u random sample of n different numbers between
Probability Exercises

Table 55 Random numbers It is good practice not to enter the random number tables _always at
the same point but to use some simple rule of your own for determining
44 59 62 z6 82 51_ 04 19 45 g8 03 51 so 14 28 02 12 29 88 87
the entry point, based on the d.ate or the time, or any quantity of this
''"" . ,,,.~0 .9o 5~ 5. 90 20 76 95.70 o2 84 74 69 o6 '3 98 86 o6 so
kind which will not always have the same value on every occasion you
.. ' 44-t33' z9 88 -_go. 49 07 55 6g so zo 27 59 51 97 53 57 04 2_2 26
use the tables. Many calc.ulators have a facility for producing random
47 57 22 52 75' 74 53 Il 76 II 21 J6 12 44 JI 89 16 91 _47 75
numbers, which can be useful if tables are not available.
03 20 54 ZQ 70 56 77 59 95 60 19 75 29 94 II 23 59 30 If 47
Simple random sampling is not the only acceptable way to obtain sam
I and N. The individuals corresponding to the chosen numbers will consti
pies. Although it does lead to the simplest statistical theory and the most
tute the required random sample. A table of random numbers is given direct interpretation of the data, it may not be easy to see how to obtain
in appendix A (table Ax). A portion of the table is given in table 5S a simple random sample of, say, a child's utterances or tokens of a particular
and we will use this to demonstrate the procedure. Suppose we wish to phonological variant. This is discussed briefly in chapter 7. However a
select a random sample of ten subjects from a total population of 7,83 sample is collected, if the result is to be a simple random sample it must
The digits in the table should be read off in non-overlapping groups of be true that all possible samples of that size have exactly the same chance
four, the same as the number of digits in the total population size. It of being selected.
does not matter whether they are read down columns or across rows -
we read across for this example. The first 14 four-digit numbers created This chapter has introduced the concept of probability and shown
in that way are: how it can be measured both empirically and, in some situations, theoretically.
4459 6226 8251 0419 4598 035' 5014 ( 1) The probability of a particular outcome to a sampling experiment was identified
2.802 1229 8887 8590 2258 SZ90 2Z76 with the expected relative frequency of the outcome.
and would lead to the inclusion in the sample of the individuals numbered (2) The concept of statistical independence was discussed; to say that two
44S9 6226, 419, 4S98, 3s1, so14, 2802, 1229, 22s8, 2276 and s29o. The events, X and Y, are independent is equiv1;1lent to the statement that P(X
numbers 8251, 8887 and 8s9o are discarded since they do not correspond andY both true)= P(X) x P(Y).
(3) The conditional probability of one event given that another has already
to any individual of the population (which contains only 7,832 individuals).
occurred was defined as P(X IY) = P(X and Y)/P(Y).
If a number is repeated it is ignored after the first time it is drawn. It
(4) If two events, X and Y, are independent, then P(XIY)=P(X) and
is not necessary actually to write numbers against the names of every P(Y IX) = P(Y); that is, the conditional probabilities. have the same values
individual in the population before obtaining the sample. All that is as the unconditional probabilities.
required is that the individuals of the sample are uniquely identified by (5) The concept of a probability distribution was introduced. For discrete
the chosen random numbers. For example, the population may be listed variables, the probability distribution can be presented cis a table; for cont
on computer output, each page containing so names. The random number inuous variables it takes the form of a histogram and the probability that
44S9 then corresponds to the ninth individual on page 90 of the output. the variable lies in a certain range can be identified as the area of the correspond
Similarly, 2S random words could be chosen from a book of 216 pages ing part of the histogram.
in which each page contains 30 lines and no line contains more than rs (6) It was demonstrated how a simple random sample can be selected from
words, as follows. Obtain 2S random numbers between oor and 216 a finite population with the help of random number tables.
(pages); for each of these obtain one random number between or and
30 (lines on a page) and one random number between or and 15 (words
( 1) Replicate yourself the experiment whose results are tabulated in table 5 x.
in the line). This time, if a page number is repeated, that page should Include thC result froin your too tl'ials to table 5, I, and recalcUlate the mean.
be included both (or more) times in the sample. Only if page, line and (>) Using datum 41 us the entry point (you will find it in appendix A, table
word number all coincide with a previous selection (i.e. the very sarnc At, 6th row, 4th column) and using this book as your data source, list the
word token is repented) should the choice be rejected. llilmplc uf ;l,5 wordm ~uggc:i!tcd by the procedure on pngc 74..
(3) Using the probability distributiop,-pf- fami1yt!>ize,in table.s3;
probability that a randomly chosen farvily has' :
: _. ~- ,' ;

'"'(t' .

(a) more than 3 children

(b) fewer than 4 children ,... Modelling statisticaL II
(~) at least 2 but no more than 5 children
(4) Estimate from figure 5.2 the following:.
P(Y > 23) P(so,;;; Y < 6o)
P(r4,;;;Y<92) P(9r,;;;Y<96)
(5) Calculate from table 53 the following:
P(X,;;; 4)
We pointed out in chapter 4 that the solution of many of our problems I
will depend on our ability to infer accurately from samples to populations ...
P(o<X,;;;4) In chapter 5 we introduced the basic elements of probability and argued

. r:
(6) Calculate from the data for population.B in tabk5.2: : ,,, 1,,\, ;:)'!: that it is, by.meansofprobability statements.concerningnndom variables ' , ~:

that we will be able to make inferences from samples to populations. In

P(malel bilingual)
P(female I bilingual) the present chapter we introduce the notion of a statistical model and
describe one very common and important model.
We should say at the outset that the models with which we are concerned
here are not of the kind most commonly met in linguistic studies. They
are not, for instance, like the morphological models proposed by Hockett
(1954); nor do they resemble the psycholinguists' models of speech produc-
tion and perception. The models discussed in this chapter are statistical
models for the description and analysis of random variability. No special
mathematical knowledge or aptitude is required in order to understand
them. .

6,1 A simple statistical model

Statistical models are best introduced by means of an example.
In chapter 1 we discussed in detail a. study which looked at the voice
onset time (VOT) for word-initial plosives in the speech of children in
repeated samples over an eight-month period. For our present purpose
we will consider only the VOTs for one pair of stop targets, I tl and I dl,
for one child at 1 ;8. To make our exposition easier, we will also assume
that the tokens were in the same environment (in this case precedinglu:l).
Look at table 6. r. The fictitious data displayed there are what one would
expect to see only if an individual's VOT for a particular element iri'li
certain environment were always precisely the samE:, i.e. if the population
of an individual's VOTs for that element in that environment had a single
value. Such VOTs would be like measurement of height or arm length.
Provided that the measurement is very accurate, we do not have to measure

>16: 77
Modelling statistical populations A simple statistical model
Table 6.1. Hypothetical sample ofVOTs inabsence of If it were possible ever to obtain the complete population of ldl VOTs
variation for this child we could then calculate the mean VOT for the population.
Let us designate it by JL. (It is customary for population values to be
VOT for/d/ VOTfor/t/
represented by Greek characters and for sample values to be represented
14,25 22.3
by Roman characters.) Any individual value of VOT could then be under-
14.25 22.3
14.25 2.2.3 stood as the sum of two elements: the mean, JL, of the population plus
'4'5 the difference, 8, between the mean value and the actual, observed VOT.
14.25 22.3
14.25 .:U.J A sample of VOTs, X 1, X2, , X" could then be expressed as:
14,25 22.3
14.25 22.3
X1 =JL+ e1
22.J X 2 = JL + s 2

Table 6.2. Hypothetical, but realistic, sample ofVOTsfrom

X"= JL+ 8"
a single subject
JL is often called the 'true value' and 8; the 'error in the i-th observation' .I
VOTfor/d/ . VOT for/t/ Neither the word 'true' nor the word 'error' is meant to imply a value
17,05 16.81 judgement. We suggest a more neutral terminology below.
I J.70 2432
18.09 20.17
Any individual (observed) value can then be seen as being made up
'578 z8.31 of two elements: the true value (the mean of the population), and !he
'394 r8.27
distance, or deviation, of the observed value from the true value. Td
1452 21.03
!6.74 '794 illustrate this, let us imagine that the mean of the population of thF child's
t6.t6 '937
Idl VOTs in the stated environment is '495 (Of course! in lact, \he
population mean can never be known for certain without obserying the
entire population, something which is impossible since there is ho defiqite
the length of a person's arms over and over order to know whether limit to the number of tokens of this VOT which the child might express.)
the left arm is longer than the right. In the same way, we would not So JL = '495 If we take the observed I dl VOTs in table 6.z, these can
have to take repeated measures of an individual's VOTs for Idl and It I- be restated as follows:
targets in a specific environment. In the case of the child, on the basis
'495 + 2.10
of a single accurate measurement of each, we would be able to say that
'495- 3 2 5
the population VOT for I d/(14.25 ms) is shorter than that for ItI (:i.2.3 ms) '495 + 3'4
in the environment I _u:l. Put another way, it would be clear that the (etc.)
sample ldl VOT and the sample ltl VOTdo not come from the same 1
This is one of the maily examples of statistical termin,ology appropriate td the contrxt
statistical population. in which a concept or technique was developed being transferred to a wid'r qontcxt: in
which it is inappropriate or, at least, confusing. Scientists such as Pasc~l, ~a~lact apd,
But of course VOTs are not like that. The data in table 6.2, though particularly, Gauss in the second half of the eighteenth, and first part of the nipettenth,
again invented, are much more realistic. This time there is considerable centuries were concerned with the problem of obtaining accurate me~sutcn;.ents of pl:lysi~al
variation amongst I dl VOTs and amongst /tl VOTs in the same environ- quantities such as length, weight, density, temperature, etc. The instrutheq'ts tbe'1 availapie
were relatively crude and repeated measurements gave noticeably diffe~ent v~lue10. ~n t~at
ment. As a result, it is no longer possible to make a simple comparison context it seemed reasonable to propose that there was a 'true value' and that a sfJectfic
between a single I dl VOT and a single ItI VOT and come to a straightfor- measurement could be described usefully in terms of its deviation from tht! t111e valpe.
Furthermore, it seemed intuitively reasonable that the mean of an indefibitely l~rg~ number
ward decision as to whether the I d/ VOTs and the /t/ VOTs came from of measurements would be extremely close to the true value provided the mcaf!uripg (iev!ce
the same population. In order to make this decision we will have to infer was unbiassed, Dy analogy, the mean of a population of measurements is often rcferved
to ns the 'true' value (cf. the 'true test score' in chapter 13) and any deviation from
the structure of the pop1llntions from the structure of the smnples. this rneun as 1m 'error' or 'random error',

?a 79
Modelling statistical populations The sample mean and the importance ofsample size
The second element,Jhe -error ,-iridicatescthe_,pdsition ,of each observed"_, --.. , . ,_ . ,_.Now: Jon-the sample l)lean; .X, ,We. have :1 '
value relative to the true value onlllean)JLwillcbe,.represented by -'the,-, ~-'! \(. -' ~ _, i' A,
. I
symbol e and its value may he- either positive.-oi:cnegatie-The division. _,_,,_- .x=~l:x,
into true value and error can be illltertded to'thepopulatinn<as 'anv.hole;<ko r<' ; 1 c~tr'

any possible VOT for a ( d/ -targe~which might _ever be pronounced b;r>- ' ' ' ,., ; ; -.: h ~; ', ,., ~ ::,;I

this child in this context can be roptesented,as.: =-[X 1 + X2 + X 3 + X4 + X,j

5 .
=~[(p.+ e,) + (p.+ Bz) +(fl.+ e,) + (p.+ e,) + (p.+ s~]
Ancl that is an example of a statistical modeL However, this definition 5
of the model is not complete until some way is found to describe the
variatidn in the value of e from one VOT to another. ="[SP.+ (e 1 + e,+ e3 + e4 + e5)]
Returning to the child's VOTs, we use the ,model in this case to restate s
our problem. Is the mean of thepnpulati'on,oi:/df,VO'I's<the same-,asn + ,, , '"'It +:e ., (.signifies !hean er-ror) ..
the mean of the population of /t/VO':Fs? <Does .. J.t/ilequalJ.t;afF nsd/ :,'<' :; 't: '''' ... ~ ' ' :;,:

we assume that the /d/ VOTs and/t/VOTs are ,membersmfthe -same'- ' Clearly; the' value. of X can a)sb be expresked as -a true valu.e plus an
population and that the child is not.distinguishing hetween /d/ and/t/ erroq where the true value is still /L, t)l~ population mean of the original
in terms of VOT in the specified en'lironment. Xs and the error is the average value of the original errqrs. However,
The example we ~ave used in this chapter has concerned one individual the mean of several errors is likely to be sma!Ier in size than sing!~ errors,
providing a numbev of VOT values. But the model we- have- presented if only because of the original errors will be negative and solne ,will
(population mean plus errot) can also be applied in cases where a number be positive so th~t there will be a certain amo11nt of caqcelling oui. It
of individuals have each provided a single value. would seem, tHen, tha; the mean of a sample will ]'ave the saihe true
value about whi,ch we requi,re information and will tend to have a smaller
6.2 'i'he sal'!lpie l'!lean and the imp()rtance of sample size. error than the !ypical single measure;,ent. 'The larger the sample size,
So far we have said that a pop4ladon may he modelled by ,thec sm.aller th~ errot . is likely to be. This is such a central concept to
considering each of its values expressed,drJ.t +:eH hi the firial section-ofd , , , thejounl\at\ons_ of statistical inferet1cet\lat It! is worthstudying if'in some, ~ '" ..,,, ... \'' ;.\.1!

chapter 4 we discussed some of tHe~nferonces We might .want to triake'''" , , ,, ,detail. via a:~im.pkflic,e.,throwing exper\rrient: ,,:.:.. , .). - \'It

A 11roperly lll~uf~ctured dice shout~ be ih the' s~ape of a cube with, ,I!,

from a sample to a population. W&1rnayoftenrwish to ,extracdnforriu\tion
from the sample about -the value of the population 1mean;pJ Suppose '. each, face mar.ked by a ..<;lifferent one- of the numbers 1, 2, 3, 4> 5 or 6,
now that we have a sample, XI> Xz, ;; i iX,:, ofn values fro.m some populi!, '' ,, .and be perfectly balanced sq thatno fa~eis-more lik0ly to turn upthan
tion. It seems reasonable to ima'ghre 11\at the~e will b~ a more o'riJsst'.i -': '-any other.when the dice is.throwil. Jf we,were<askec!.to predict the pr0 pdr-
strong resemblance betweeh the sample mean and the population rriean: tion of occurrences bf ~ny one number, say 3, iiJ the entire population
In particular, it appears to be a common intuition that in a 'very large' of numbers resultipg from pqssible throws of the dice, our best prediction,
sample the sample mean should have a value 1very close' to the population givert that there are six faces, would be nne-sixth. We W041d expect each
mean, J.t. Let us explore this intuition by considering a sample ol just of the six numbers to occur on one-sixth of the occasions that the dice
five observations: was thrown. This dn be represented in a bar chart (figure 6.r), which
X 1 = J.t + e 1, X2 = J.t + Ez, X3 = J.t + e3, X, = p. + 84, and X5 = J.t + e5 would be a motlel for the population bar chart, i.e. a bar char\ derived
from all possible t)lrbws of a perfect dice. It is pos~ible to calcu)ate what
Z It wUlbe chapter I I b~fot'c we finally obtain the. answer to thisqucstion, In thcmeantimC.,-" wmild be the popuhlti\m mean, J.t 1 9f all the possitile throws of ibis dice,
the reader is asked to accept that the truth Will evClltllally bc-fcvcalcd.and;thrttthc arguttuin <<
tation which will cause the delay is ncccsAatY=to' a' prbpcr!Undcn:lllllHiir\g of the stati~'t\C'al '
Stipppse ihfi.~\ce i~ cluown.a ''"'Y
largcnumper of limes, N. Eacljpossible
mctlwda used iJJ obtaining the nnswcr. value will alst> appear a very large number .of times. Suppose the value
Modelling statistical populations The sample mean and the importance of sample size
;;-c - -
~'0.2 ' , r-
c ~

~ 0.1

2 3 4 5 6
2 3 4 5 6 Figure 6.2. Typical histogram for the scores of Jooo single dice throws.
the values occurred with roughly equal frequency. The mean of the 1 ,ooo
Figure 6.1. Bar chart for population a( single throws of an ideal dice.
values is 3.61. and the standard deviation is 1.62. The actual outcome
1appears N 1 times, 2 appears N 2 times, and so on. The total score achieved is similar to what would be predicted by the model we constructed for
by the N throws will then be: the population of throws of a perfect dice.
Using the model of the previous section we could express the value,
N, X I+ N,x + N,x3+ N,x4+ N,x s+ N,x6 X, of any particular throw of the dice as:
and the mean will be: X=~<x+
I'=N(N, + zN, + 3N, + 4N; + sNs + 6N6) where Mx = 35' and e takes one of the values 2.5 (X= 6), 1.5 (X= 5),
o.s (X= 4), -o.s (X= 3), -~.s (X= 2) and -2.5 (X= 1).
N1 (N') + 3(N') (N) + 6(N') Furthermore, from the physical properties of the dice we would expect
=N+>N N + 4(N')
N +sN N that each of the 'errors' is equally likely to occur. This is a particularly
If N is an indefinitely large number, the model of figure 6.1 implies that simple model for the random error component e, viz. that all errors are
each possible value appears in one-sixth of throws. In other words: equally likely. The model is adequately expressed by the bar chart of
figure 6.3 (which is identical to that in figure 6.1 except that the possible
N1 N, N3 N, N 5 N6 1 values of e are marked on the horizontal axis rather than the possible
N=N=N= N=N=N=6 scores), or by the formal statement:
and: 'e can takiHhe six values 2.5, 1.5, o.s. -o.s, - 1.5, or -2.5'
and, for any particular throw:
I'= Z, + ( 2X~) + (3 XZ,) + (4 X~) + (5 X~) + (6X~) P(e = 2.5) = P(e = 1.5) = P(e = o.5) = P(e = -o.5) =
= 35 P(e = -1.5) = P(e = -z.s) ='6
With a similar kind of argument it is possible to show that the standard
l We have written P.x here to indicate that we wish to refer to the population mean of
deviation of the population of dice scores iss= 1.71. the variable X. 'We will shortly introduce more variables and, to avoid confusion, it will
A real diqe was thrown 1 ,ooo times and the results arc shown in figure be necessary to usc a different symbol for the population mean of each different variable.
Nott! hen.~ how inapproprintc iR the term 'true value'. Although the mean of the population
6.a. A~ we might expect from the model shown in Hgure 6.1, each of of dice tScorcs is IJ.x 3 :Hi it ill never possible to achieve such a score on a single throw.

Modelling statistical populations 1ne sampte meun- unu tne mtpunwtce (~/sam pte stze

This latter description of the modeHs anotherexainple;ofca probabilityh' of the cstandavd deviation of the -ohgi11ahr populatidn of 'score's Of single . '"'
distribution (see 5.3), the term ;for. a: list or.a Jorffi.ula :which indicates:' ... .':-throws . . ,Each.;ofihe-iildi.vidual m'dm ,Sco'tescan:bC :VJrittet1--as--.Yi = f.J.'i + -er --.c.
the relative frequency of occurrence of the different .values of a random ; ,,, where,"as wet hm<e shown above, JJ.f = JJ.x = J S '(since each Yisrhe average '"
variable. '" ar,clt')lc. of a sample ofXs) and the residualsare. means of:theresiduals of:single' . "'


i ! l 2.5 3.0 3.5 4.0 4.5

Figure 6.4.:Histogram of rooo means of ten dice throws,

-3.0 -2.0 1.0 0 1.0 2.0 3.0 scores and thus will generally be smaller than those for single scores. This
Value of error
seems to be borne out by the smaller value of the standard deviation which
Figure 6.3. Bar chart of population of 1crrors' for a single throw of a dice. indicates that the Y values are less spread out than the X values. The
histogram of figure 6.4 shows this feature quite clearly when it is compared
It is already obvious that the use of the. word 'error' is hard to sustain, with figure 6.2.
and we will from now oh usually adopt the more neutral term residual The whole experiment was repeated several times, using a different
which indicates that e is the value remaining when p. has been subtracted sample size each time. In every experiment I,ooo mean scores were
(p.+ e- p.= e). obtained; the means and standarddeviations of the I,ooo scores for each ,
A second experiment was carried. 'out :with the:dicev This time>' ~fter '! ' . line. samplesize are'!'ecorded in table :6;3; It ~an be seen that as thesample'
every ten throws the mean was taken of:the :numbers: occurring<in t'hese .. ; ' .size.:is -increased the standard deviation 'afthe sample mean 'decreases;
throws. Thus if the first ten throwswere' 2, -2, '6,3, 4, I, 2, s. 2.,6; indicating. that the larger the sample size, the closer the sample mean
a mean score of 33 (33 + 10) was noted. In this way; I,ooo numbers,.,.' is likely .to be to the true value. There is, indeed; a 'simple relationship
each a mean of ten scores, were oblained:Iehls call them Yti .Y,A :rr ,,
which,call .b~demonstrated-theoretically betweerl' the"standard deviation.

Y 10011 These numbers are a sample of a population of means~ themeans of a population of sample means and the standard deviation of the popuhi
that would occur if the procedure were repeated indefinitely. Since there tion of single scores. To obtain the standard deviation for sample means
is effectively no limit to the number of times that the procedure could of samples of size n the standard deviation of single scores should be
be repeated, the population of mean scores is infinite. The distribution, divided by the square root of n. For example, in this case the population
or histogram, of the population is known as the sampling distribution standard deviation of single scores is 1.71. For the population of sample
of the sample mean. The histogram of the sample of I,ooo mean scores means based on samples of size 10, the standard deviation will be
is shown in figure 6.4. =
1.71 + Yro 0.54 (The sample of I,ooo such sample means had a sample
Note that the histogram is quite symmetrical; it is shaped .rather like standard deviation of o.62.) Other examples appear in table 6.3.
an inverted bell. Furthermore, the mean -of these r 1ooo sample ineans ' ' ' The rcsults:<Jf the series of experiments stipporl'ihe intuition that the'
was 3.48 and their standard deviation was o.62, whi<:h is about oncthi"d sample mean should be 'something like' the population mean and that
84 ss
Modelling statistical populations A model of ra11dom variation: the nonnal d1stnbutwn

Table 6. 3. The 'mean and the standard deviation of the sample mean easy to define randomness. Its essentialquality is lack of predictability.
If we think of the child producing tokens of/ d/ in the specified environ-
Typical sample
of xooo scores Population of scores ment, there is no w.ay in which we can predict ln advance precisely what
Stitndard Standard the VOT of the next token will be. In this sense, the variation in VOTs
Numbcf-of throws
averagedfo~ each score Mean deviation Mean (J.L) deviation is random. 4
36 t.62 J.5 1.71 Random variation can take many forms. The histogram of a population
of measurements could be symmetrical with most of the values close to
lO 3-47 o.62 J.5 1,71 the mean, or skewed to the right with most values quite small but with
V!o a noticeable frequency of larger-than-average values. We have already dis-
cussed in chapter 2 the possibility that a histogram could be U-shaped
or bi-modal, in which case most values would be either somewhat larger
s 3-49 0,29 Jo5 I 71
=o.34 or somewhat smaller than the mean and very few will be close to the
mean value. With this range of diversity,- is it possible to formulate a
general and useful model of random variation?
100 35 1 0.14 35 I.7I
VIoo =o.I?
In figure 6.5 we have superimposed the histograms for several of our
dice experiments. As we increase the number of throws whose mean is
calculated to give a score, the histogram of the scores becomes more peaked
400 350 o.og8 35 1.71 and more bell-shaped. It is a fact that, even if the histogiam of single
-;;:;-- =o.o86
400 scores had been skewed, or U-shaped, or whatever, the histograms of the
means would still be symmetrical and bell-shaped fmlarge samples. Furth-
ermore, it can be demonstrated theoretically that, for large samples, the
1{)00 350 Jo5 I.7l
VIooo =o.os 4 histogram of the sample mean will always have the same mathematical
formula irrespective of the pattern of variation in the single measurements
=========================== that are used to calculate the means, The formula was discovered by Gauss
the bigger the sample, the closer the sample mean will tend to be to the about two centuries ago and the corresponding general histogram (figure
population mean. However, we would like to be more specific than this. 6.6) is still often called the Gaussian curve, especially by engineers and
For example, we would like to be able to calculate a sample mean from physicists. During the nineteenth century the Gaussian curve was widely
a single sample and then say how close we believe it to be to the true used, in the way that we describe below ai1d in succeeding chapters, to
value. Alternatively, it would be useful to know what size of sample we analyse statistical data. Towards the end of that century other models
need to attain a particular accuracy. In;. order to ansWer such questions were proposed for the analysis of special cases though the Gauss model
we need a model to describe the way that the value of the residual, e, was still used much more often than the others. Possibly as a result of
might vary from one measurement to another, this it became known as the normal curve or normal distribution and
this is how we will refer to it henceforth.
6.3 A model ofrandom variation: the normal distribution We have here an example of a very stable and important statistical phen-
The model to be presented in this section is one for random omenon. If samples of size n are repeatedly drawn from any population,
variation. This term is in general use for those variations in repeated ami the sample means (i.e. the means of each of the samples) are plotted
measurements which we seem unable to control. For example, the VOTs I Even if there were a discernible puttcrn, attributable perhaps to the effect of fatigue 1
In table 6.2 are all different, though they purport to be several measure- it would still be impossibk to make precise predictions about future VOTs; there would
fltill be u rnndom dement. It is simpler at this stage of our exposition to deny ourselves
ments of the same quantity all obtained under similar conditions, and the luxury of introducing a third clement, t~uch <IS fatigue effect, into our model of VOT
they vary according to no recognisable or predictable pattern. It iH not popula.tionl;l,

Modelling statistical populations Using tables of the normal distn'bution
the population from which the samples.are drawn; The only differences
.... will be (a) the position of the centreof the histogram will depend on
; _ 100 throws --
the value of-the original population mean, p.; and (b) the degree to which
it is peaked or flat d~pends on u; the,standard deviation of the original'
population; the larger 0' is, the more spread out will be the histogram;

.. .' .'
the smaller 0' is, the higher will be the peak in the centre.
This patterning of sample means allows us to develop a statistical model
r1 1 for the histogram of the population of sample means from any experiment.
: 1 I

't ' ' In order to construct such-a model for a particular case we need to know
1 :- 25 throws
'' '' '''
the mean and standard deviation of the population from which each sample
' ' is drawn. Each such model histogram, which will exhibit the shared charac-
teristics of the; histograms in figure 6.5, will ~loselyapproximate the<truc
population histogram of figure 6.6, prnvidedthat the sample size is 'large'.
(We will have more to say in succeeding chapters about the means of
'large' in this context.)'
The normal distribution is basic to a great deal of statistical theory
which assumes that it provides a good model for the behaviour of the
sample mean. It is this which will allow us to give answers to some
of the various problems which we set ourselves in chapter 4 Before we
2,5 3.0 3.5 4.0 4.5 can do this, however, we must learn to use the tables of the normal distribu-
Figure 6.5. Histogram of means of different numbers of dice throws.

6.4 Using tables of the normal distribution

We said in the previous section that the normal distribution
is a good model for the statistical behaviotinoHhe'sample mean. We will
use it in this way in future chapters. But this is,not its only use. It turns
out that any variable whose value comes about as the result of summing
the values of several independent; or almost independent, components
0 can be modelled successfully as a normal distribution; The size of a plant,
Figure 6.6. The normal, or Gaussian, curve. for example, is likely to be determined by manyfactors such as the amount
of light, water, nutrients and space, as well as its genetic make-up, etc.
in a histogram, we find the following three things happen, provided that And it is indeed true that the distribution of the sizes of a number of
n, the sample size, is large: (r) the histogram is symmetrical; (2) the plants of the same species grown under similar conditions can be modelled
mean of the set of sample means is very close to that of the original popula- rather well by a normal distribution, i.e. a histogram representing the
tion; (3) the standard deviation of the set of sample means will be very
close to the original population standard deviation divided by the square 5
The discu~sinn in the ln11t fi.!W pam~raphl'l can be !mnunarisccl by wlHlt is knowu ail the
root of the sample size, n. Central Limit Theorem: 'Buppoac thc population of vnlucs of a \'nriuhlc, X, hm; mean
fL und staodurJ deviation rr. Let X'(n) be the llH!ltlJ of~~ sample of n vulucs randomly
We can go further than this. If the sample size, n, is large enough, choHcTi from the pnpuhHion. Then 111.1 11 gl!lt~ larg(!r the ttlli.> hifltng:ntm of the pnpt~lntion
then the histogram of the means of the samples of size n can nlways lw oLall--thc.po~sibk vnlu~A of' X(n) bt,~omcHnmn 1wurly.'lilw tht histogrum of n nnmml
distribution with nwnn IJ., und t1fllt11Jnnl dt~vinlion (r/ Vn. 'T'hi~ nnmlt will he lnw whutcvcr
very closely ckscribcd by a Hingle matlwmaticnl formula, irrcspttctivc of i11 th<t hH'n'1 rif tlw oriuitiul populnl'iuu hi~~to~rnm i!f thv vnl'inbl~:, X.'

n Sq
Modelling statistical populations Using tables of the normal distn'bution

sizes of plarits- of a -certain species will have the characteristic 'normal' . the same value is subtracted from the mean. So the mean value of the
shape. This is true of many biological measurements.' It is also often new variable, Y, will be equal to the mean value of X minus Jl.. That
true of scores obtained on educational and psychological tests. Certainly 1S."
tests cat) be constructed in such a way that the population of test scores (mean Y,) = (mean X) - p.
will have a histogram very like that of a normal distribution. We need
not bother how such tests are constructed. For the rest of this chapter =o
we will simply assume that we have such a test.
The test we have in mind is a language aptitude test, that is, a test The standard deviation of Y will, however, be exactly the same as that
designed to predict individuals' ability to learn foreign languages. The of X. All the values of X have been reduced by the same amount; they
distribution of test scores, over all the subjects we might wish to administer will still have the same relative values to one another and to the new
the test to, can be modelled quite closely by a normal distribution with mean value. In other words, subtraction of a constant quantity from all
a mean of so marks and a standard deviation of xo marks. Suppose that the elements of a population will not affect the value of the standard devia-
we know that a score of less than 35 indicates that the test-taker is most tion (exercise 6.J). To complete the standardisation we have to change
unlikely to be successful on one of the intensive foreign language courses. this standard deviation so that it will have the value I. We can do that
We might wish to estimate the proportion of all test-takers who would by dividing all the Y values by the number rt, the standard deviation
score below 35. We can do this very easily using tables of the normal of Y (and X). When a vmiable is divided by some number, the mean
distribution. In the following italicised passage we say something about is divided by the same number. So if we write Z= Y/rr it will be true
how the tables are constructed and this will help explain why they can that:
be used in the way that they are, and in particular, why we do not need (meanofY) o
to have separate tables for every possible combination of mean and standard meauofZ -=o
deviation. The reader may wish to skip this passage in order to see how " "
the tables are used, returning to this point in the text only later. Furthermore (see exercise6.4):
A normal distn'bution can have any mean or standard deviation. We (standard deviationofY) rr
therefore cannot give tables of every possible normal distribution - there standard deviation of Z =-=I
are simply too many possibilities. Fortunately, one of the exceptional and " "
extremely useful properties of the nonnal model is that an observation By these two steps we have changed X, a variable with population mean
jimn any normal distribution can be converted to an observation from 11. and standard deviation a; into Z, a van'able whose mean is zem and
the standard normal distribution by using the standardisation procedure standard deviation is unity, I. Remember what the two steps are. From
desclibed in 3 9. The population mean ofa standardised variable is always each value X we subtract Jl., the mean of the population of X values,
zero and its standard deviation is I. Let us try to see. why this is true .. and then divide the result by the population standard deviation, rr. As
Suppose a vmiable, X, comes from a papulation with mean 11. and stan- before, we can write the complete rule in a single formula:
dan/ deviation rr. First we change X into a new van'able, Y, by the rule:
\Vhen we subtract a constant quantity from all the elements in a population,
/.is called a standardised rmzdom van'able whether or not the distribution
Equally, it is 11ot tnll'- of many variables. The distribution of income or wealth in many
l!f the migina/ scores can be modelled successfully as a normal distn'bution.
5ocit:lica is Utnmlly skewed, with the great bulk of individuals receiving less thai1 the mean 1/owever, it is a special property of the normal distn'bution that if X was
income since the mcnn is inflntcd by n few very large incomes, A similar effect CliO often mmnal(v distributed then Z will also have a normal distribution. We say
he lii!Cil in the di~tribution of the time required to tt~urn a new tusk - n few individuals
w.ill t11kc 't-'t'!)' mu~::h longer lhnn the othcrtllU k-IIJ'n a new ~>ltill. 1t ought nul Hd1c-difllcult that Z has the standard normal distribution. We can exploit all this
tc1 thin!~ t)f otlwt ('Xunplt!ll, to change l}Ul!stilms about any nomllllly di.vtributed random valiable i11to
Modelling statistical populations Exercises
equivalent questions about the standard n01mar vmiable and then cuse: : bf.test :results i:an, in fact, be deschhed by the normal distributibnwith
tables of that vmiable to answer the question. In other words, we do not the same mean and variance as the population; If. the population of test
need tables for every different normal distribution. scores has a distribution which cannot be modelled by a normal distribu
We want to know the proportion of test'takers we can expect to 'achieve ':1
tian, 'ltWo\ild be ii:uippropriate to tise standard scores in this way, since
a score of less than 35. To put this another way, if we choose a test-taker Z would not have a standard normal distribution.
at random and obtain his test score, X, we wish to know the likelihood
that the inequality X< 35 will be true. In order to answerthis question,
we will have to alter it until it becomes an equivalent question about the
This chapter has discussed the concept of a statistical model.
corresponding standardised score. (This is because, as was explained in
the italicised passage, the tables we will use relate to the standard normal (I) A model for a single measurement, X, was proposed: X = f.L + s where J.L
distribution.) This can be done as follows: is the ~true' value or population mean and e the error, deviation from
the mean or residual.
X< 3S is equivalent to X- so< 3S- so (subtracting the mean) . (2) It was argued that means of samplesOf measurement's would be less variable
t~an individual measurements.
is equivalent to X - so < - 15
(3) The sampling distribution of the sample mean wlis introduced: for any
X-so -xs random variable X with mean f.L and standard deviation 0', the variable X,
is equivalent t o - - < - - (dividing by the calculated from a sample of size n, will have the same mean f.L but a smaller
10 IO standard dev1at10n
. . )
standard deviation, a/Vn. Furthermore, if n is large, X will have a normal
X- so distribution.
is equivalent to---< -x.s (4) It was shown how to use tables of the standard normal distribution to
answer questions about any normally distributed variable.
What have we done? We have altered X by subtracting the population
mean and dividing by the standard deviation; we have standardised X
and changed it into the standard normal variable Z. Thus:
(I) (a) Using the procedure of exercise 5.2, choose a random sample of Ioo words
X < 35 1s . I z 3S- so
..equtva ent to < - - -
and find the mean and standard deviation of the sample of Ioo word
xo lengths.
(b) Divide the IOO words into 25 subSamples of 4 words each- and calculate
i.e. Z< -x.s the mean word length of each subsample.
This is another way of saying that a subject whose test score is less than (c) ~alculate the mean and standard deviation of the 25 subsample means.
(d) Discuss the standard deviations obtained in (a) and (c) above.
35 will have a standardised test score less than -r.s. (Note thal'the minus
(2) Assuming that the 'true' mean VOT for /d/ for: the observed child is q .. 2s,
sign is extremely important.)
calculate the residuals for the/d/ VOTs of table 6.2.
Table A2 in appendix A gives the probability that Z< -r.5. Notice (3) (a) Calculate the standard deviations of the/ d/ VOTs of table 6.z,
that the table consists of several pairs of columns. The left column of (b) Calculate the standard deviation of their residuals (see exercise 6.2). Dis-
each pair gives values of Z. The right column gives the area of the standard cuss.
normal histogram that lies to the left of the tabulated value of Z. (The (4) (a) Calculate the standard deviation of the/t/ VOTs of table 6,2.
rclatiom~hip between areas in histograms and probabilities was discussed (b) Divide each VOT by the standard deviation calculated in (a).
in chaplet' 5.) The diagram and rubric at the head of the table should (c) Calculate the standard deviation of these modified values and discuss the
he helpful. result.
For the example we have chosen we find P = o.o668. Hence we can (5) A :.~core, Y, on a Lest is normally. distributed with mean 120 and standard
~Y that .nbout 7% (6,68%) of scores will be less than 35 The accuracy ,deviation zo. 'FinO:'
of :thi0 an~wer will depend on how clo~ely the J istr.ibution of the population (H) l'(Y'< loa) (b) l'(Y>4o)

ModeU.:11g statistical populatimzs
(c) P(Y < 130) (d) P(Y> 105)
(e) P(roo<Y<r3o). (f) P(r35<Y<rso)
(g) the score which will be exceeded by 95% of the population.
(Hint: You may find it helpful to begin by drawing sketches similar to figure Estimating from samples

Chapter 6 introduced the normal distribution and the table associated

with it. In the present chapter we will show how to make use of these
to assess how well population values might be estimated by samples. We
return to the question of measuring the/ d/VOT for a child (r ;8) discussed
in the previous chapter. We introduced there a model for a specific token,
X, of VOT expressed by the child, namely:
which says that the value of the token can be considered as a mean (or
'true' ) value plus a random residual. If the value of I' were known we
could use this single value as the/ d/ VOT for the child and go on perhaps
to compare it with the /t/ VOT (i.e. the mean of the population of /t/
VOTs) of the same child to decide whether the child is distinguishing
between /d/ and /t/. In the present chapter we will consider the extent
to which the value of I' can be estimated from a sample of / d/ VOTs.
In chapter I I we will return to the problem of comparing two different
populations of VOTs.

7. I Point estimators for population parameters

It has to be recognised at the outset that the question we have
just posed concerns population means. Clearly we do not have direct access
to the population means; all we can do is estimate them from the sample
values available to us. The question is 'how?'
It seems intuitively reasonable to suppose that the mean value, X, of
a sample of/ d/ VOTs will be similar to the mean, iJ-, of the population
of VOTs that the child is capable of producing. But how accurate is this
intuition? How similar will the two values be? X has two mathematical
properties which sanction its use as an estimator for I' First of all, it
is unbiassed. In other words, in some samples X will be smaller than
J.L, in some it will hi! larger, hut (Jfl average the value will be correct;

Estimating from samples Confidence intervals

the mean of an infinitely large miinber of such sample means would' be deviation u; .though of course we cannoJ know what these values are.
the mean of the population, the very quantity which we wish to estimate. Now we discovered in the previous chapter that, for reasonably large sam-
Second, X is a consistent estimator. This is the technical term used to ples of size n, X is a random observation taken from a normally distributed
describe an estimator which is more 'likely to be close to the true' value' , populationbf possible sample means. The mean of that population is also
of the parameter it is estimating if it is based on a larger sample. We JL but its standard deviation is u/ V n. Again, we know from the c)1aracteris-
have seen that this is the case for X in figure 6. 5. The larger the sample tics of the normal distribution discussed in chapter 6 that it follows that
size, the more closely the values of X will cluster around the population 95% of samples from this population will have a mean value, X, in the
mean, p.,. interval JL r.g6u/Vn (i.e. within r.g6 standard deviations of the true
In fact, it is extremely common to use X as an estimator of the population mean value). If u is small or n is very large, or both, the interval will
mean, not just for VOTs. The mean of any sample can be used to estimate be narrow, and X will then usually be 'rather close' to the population
the mean of the population from which the sample is drawn. A single value, J.t. What we must do now is see just how close we can expect a
number calculated from a sample of data and used to estimate apopulatioo sample.mean.of 14.88 to be to the population mean, given a sample size
parameter is usually referred to as a point estimator (in opposition to of 100 and an estimated uof 5 You may wish to skip the italicised argumen-
the interval estimators introduced" below). There are many instances tation that follows, returning to it only when you have appreciated the
of a sample value being used as a point estimator of the correspot1ding practical- otitcome.
population parameter. The proportion, jl, of incorrectly marked past tense We .can first calculate the standard deviation ofX, the sample mean:
verbs in a sample of speech from a 2-year-old child is an unbiassed and
u 5 5
consistent estimator of the proportion of such errors in the child's speech s\=-=--=-=o.s
v'n v'100 10
as a whole. The variance, s2 , of a sample of values from any population
is likewise an unbiassed 1 and consistent estimator of the population Using the same argument as in 6.4 we know that:
variance, a2.
X-p. .
7.2 Confidence intervals
Although a point estimator is an indicator of the possible value has a standard llonnal distribution and that 2 , from table A]:
of the corresponding population parameter, it is of limited usefulness. by
itself. Its value depends on the charactedstics'of asingle sample; a'new P(-1.96 <Z<1.g6)
sample from the same population will provide a different value. It is there-'
fore preferable to provide estimates which take into account explicitly this In other words:
P(-1.96 <--<1.96)
sampling variability and state a likely range within which the population o.s
value may lie. This is the motivation behind the idea of a confidence
interval. We will illustrate the concept by considering again the VOT
But the inequality: x -p. <1.g6
problem discussed at the beginning of the chapter. o.s
Suppose that we have a sample of 100 /d/-target VOTs from a single
child and find that the sample has a mean value of X= 14.88 ms and is the same as: X-p.<(1.96 xo.s)
a standard deviations= 5.oo ms. Let us suppose further that the population
of/ d/VOTs which could be produced by the child has mean JLand standard
!-ltrict!~, t;peakiug Wl' hat'e t;Jwwn thi:: to be true only if the population standard dc\iation
1 In chapter 3 it was stated that, in the calculation of s2 ~ the sum of squared deviations l!i-,, whl'reuH here the .:;mnple standunl dc\iutinn, usually called the standard error,
is divided not by the sample size, n, but by (n- t). The main reason for this<is tn ensure lt!1:; h.e~~tl p:-;cd. lluwcn-r, it c_an he ~h\1\\'ll,. th;H tht <11-g\I!Hent still holds, even when. the
that s2 is an unbiasscd cstimntor of c?-, If. n were uscd:'in the-denominator the l!amplc 1'111llpk \'ahll' ifl mwd, pro\idtd the sampll' Mizt is Jarg(~: n == 100 ought tn he large enough.
variance would, on nvcragc, ttndcn~stimatc the population htriancc, though tiH~ di~crcpnrwy, {lutHtiollt~Jihout !'lil\Hplt si;w urc di!!twuwd h1nlwr i11 7!i
or bias, w~~uld be IWgligibh- in large Mllllplcs~
b'stimatzng a proportiOn
&timatingfrom samples
The term standard error is generally used to refer to the sample standard
Similarly, the inequality:
-I.g6<~ deviation of any estimated quantity.

7. 3 Estimating a proportion
is the same as: w<X + (1.96 x o.s) Another question raised in chapter 4 which we will deal with
here concerned the estimating of proportions. How can we estimate from
'!1wrefore, in place of: P(-1.96 <Z< 1.96) a sample of a child's speech (at a particular period in its development)
the proportion of the population of tokens of correct irregular past tenses
as opposed to (correct and incorrect) inflectionally marked past tenses,
?Ve can wn"te:
e.g. ran, brought, as opposed to runned, bringed, danced. In a sample
P{(X- 1.96 x o.s) < p.< (.\' + 1.96 x o.s)} of an English child's conversations during her third year, the following
totals were observed:
Now, the value of p. is some fixed but unknown quantity. The value Correct irregular past tense: 987
of X varies from one sample to another. The statement: Inflectionally marked past tense: 372
Total: '359
P{(X- r.g6 x o.s) <: p.< (X+ r.g6 x o.s)} = o.95 The proportion (p) of inflectionally marked past tenses in this sample
is 372/1359 = 0.2737 Within what range would we expect the population
lllt~ans that if samples of size 100 are repeatedly chosen at random from proportion (p) to be? Just as with the mean (p.) in the previous section,
a population with u = 5 then for 95% of those samples the inequality it will be possible to say that we can be 95% sure that the population
will be true. In other words, the interval X (1.96 X 0.5) will contain proportion is within a certain range of values. Indeed, the question about
the value p. about 95% of the time. This interval is called a 95% confidence proportions can be seen as one about means. Suppose that a score, Y,
interval for the value of p.. is provided for each verb where:
In the present example, the value 0.5 is just the standard deviation
Y, = 1 if the ith verb is inflectionally marked
of X which we know to have the value u/v'n in general. So (in general)
Yt;::::: o if the i~th verb is not inflectionally marked
we can say that:
The mean of Y is Y = IY;/ n and this is just the proportion of verb tokens
which are inflectionally marked.
Xr.g{;J Thus jl is in fact a sample mean and we know, therefore, from the
Central Limit Theorem, that the population of values of jl will have. a
is a 95% confidence interval for p., the population mean. If you like, you normal distribution for large samples. In order to calculate confidence
may interpret this by saying that you are '95% certain' that p. lies inside intervals for the p as we did in the previous section for p., we need to
the interval derived from a particular sample. In large samples (see 7.5) know the standard deviation of the population of sample proportions. As
the sample standard deviation, s, can be used in place of <Tand the interval it turns out, there is a straightforward way of calculating this. At the
X 1.96 (s/v'n) will still be a 95% confidence interval for p.. same time, however, for a technical reason, the confidence limits for p
In the case we have used to exemplify the procedure, we have X= 14.88 arc not calculated in quite the same way as for p.. The reader may wish
and the interval is 14.88 (1.96 X o.s). Thus we arc '95% sure' that the to avoid the explicit discussion of these complicating factors and go directly
mean /d/ VOT of the child in his speech as a whole lies in the interval to the formula for determining confidence limits for p, which is immediately
( '.19" 15 .86) milliseconds. The sample standmd deviation of the sample after the italicised passage.
mean is called the standard error of the sample mean, and the 95% We noted ill the j>receding paragraph that zoe need to know the standard
<:nnli<lcncc intcrvnl is nftctt written a.: X 1.90 (staJH.hml error of x). deviation of the population <j( .\'(/lllple pmfJOrtions. We could estimate the
Estimating from samples Confideuce interzwls based on small-samples
sample variance of the sample of Kvalues1 ,s,i; and then use sy/Vn as, . , Le,/(o.ziii2, {),z86z), This'is now a more accurate 95% confidence interval
the standard devation of p. We would then, for large'n;luii!e a.i ags% fbr'the true 'p'n:>p~rtion. The correction has n<lt made much difference
confidence interval for the tme value p,, the nterval p: 1 .g6 sdv'n . . here because the'sample size was rather'large. It: would be more important
In fact ths procedure w11 do perfegtly well forc/arge values.of n. Dmt ''l"<e . in' smaller samples.''' . .
the other hand, it is the case that the value.of s 1' is always.vel),close. "'
to {;(1- p). {It can be shown algebracally that: Confidence intervals based on small samples.
z 1l .. ...
The second issue which we will deal with in this chapter con-
SJ' =--p(I -p) cerns the estimates of population values on the basis of small samples.
In chapter 4 mention was made of a study of syntactic and lexical differences
and for lm~e n the factor n/n- I is almost esactly equal to I aud can
between normal and language-impaired children in which Fletcher & Peters
be f/1wred.) Tin's means that we can avoid calculating s/ and that, as
( rg84) isolated a small group of children (aged 3-6 years) who failed the
soon as we have calculated p we can wn'te immediately that s/ ={;(1 - p),
Stephens Oral. S~reening Test,and who were ,]so,at least sixmonths.delaved
so that sy = Vp(I - p) and l' =p.. Thus a 95% cimfidence intiwval f<ii' :. ,.:,.
bn tli~ Reyn~ii Developme~tal Langu'age Scales! Receptive The children
P s: had hearing and intelligence within normal limits, were intelligible, and
p {~.96 ,JP(I ~ p)}
had not previously had .speech therapy. Two hundred utterances were
collected from each child in conversations under standard conditions and
For our example: these were subjected to syntactic and lexical analysis. The purpose of the
study was to make a preliminary identification of grammatical and lexical
!Y= 372 categories which might distinguish language-impaired children from nor-
,-, .172
mal. There were a number of categories examined in the study which
'=--= 0 2 74=P discriminated the groups; for the purposes of this section we will consider
only one, 'verb expansion'. This is a measure of the occurrence of auxiliary
372 g87 plus verb sequences in the set of utterances by a subject. The data for
sl' =p(I-ji)=--X-.-=o.Ig88
IJ59 IJ59 'verb expansion' in the language~impaired group are shown in table 7L
Hence, a 95% confidence interval fdr the true proportioh ofinfiectionally . yiv~q .~.s~m.\'l.,le mean. here, 0fo,:is'4, how;close'would .we. expect this :to:
.! tl1e mean, score of the popu/ationfrol11:whiph the sample was-drawn?
marked past tenses in the speech of this child:.during this peniod is II'- _,,,c ''''
,. ' ' ' ' ; ,, --, '
How. does the small sample. size affect. the way we go about establishing '
0.2737 V (o.1g88 + 1359) = 0.2737 om 21';i.e .. (o.26I6,o.28s8).
a conf)dence i(lterval? ,
Unfortunately, for a technical reasrm which we will not. explain here
Beca\]SC .of. the, small sample size we cannot rely on the .Central Limit
(see Fleiss I981), this will give mdnterval which is.. a little too narrow.
'Theorem, nor-can .we assume that t-he.sampte--variance is very close to
In other words, the probability that the true value of p lies nside the confi-
the true population variance (which i's unknown). We can proceed from
dence interval will actually be a bit less than 95%. A minor correction
ought to be used to adjust the interval to allow for this.
The formula to give the correct 95% confidence interval is: Table 7. I. Verb expansion scores of eight children

p{.g6r:p)+ :J
Child I .2J5
2 .270
3 .a6s
4 .JOO

In the present case this means: 5 320

6 2 75

1 7 .I05
0.2737 (o.Ol2I +- 8 .260
'7'8 '"""""'"~).'<''"

Estimating from samples Sample size
this point only if we are willing to assume that the populaddn distribution this we take into account the size of the Saffiple and hence the shape of
can be modelled by a nonnal distn'bution, i.e. that verb expansion scores the distribution for that sample size, Since in the present case there are
over all the children of the target population would have a histogram close eight observations, the t-value will be based on 7 df. We enter the table
to that of a normal distribution. The validity of the procedure which follows at 7 df and move to the right along that row until we find the value in
depends on this basicaSsumption. Fortunately, many studies have shown the column headed '5%'. The value found there (2.36) is the s% t-value
that the procedure is quite robust to some degree of failure in the assump corresponding to 7 df and may be entered into the formula presented
tion. However, we should always remember that if the distribution of above.
the original population were to turn out to be decidedly non-normal the So the 95% confidence interval is:
method we explain here might not be applicable. In the present example
o.o653 )
there is no real problem. The score for each child is the proportion of VB
0.254 ( X 2.J6 = 0,254 0,054
auxiliary plus verb sequences found in a sample of 200 utterances. As
we have argued above, a sample proportion is a special type of sample i.e. from 0.200 to o.3o8; we are 95% confident that the true mean verb
mean and the Central Limit Theorem allows us to feel secure that a sample expansion score for the population from which the eight subjects were
mean based on 200 observations will be normally distributed. Hence the drawn lies between o.r96 and O.JI2. Note that if we had constructed
verb expansion scores will meet the criterion of normality. the confidence interval, incorrectly, using the standard normal tables we
The method of calculating confidence intervals for small samples is, would have calculated the 95% confidence interval to be (o.2o6, o.Ja2),
in fact, essentially the same as for larger samples. The difference stems thus overstating the precision of the estimate. The smaller the sample,
from the fact that for smaller samples the quantity ('X- JL)/(s/Vn) does the greater would be this overstatement.
not have a standard normal distribution because s2 is not a sufficiently
7. 5 Sample size
accurate estimator of u 2 in small samples even when, as here, we are assum-
At various points in the book we have already referred to 'large'
ing that the population is normally distributed. This means that we cannot
and 'small' samples and will continue to do so in the remaining chapters.
establish a confidence interval by the argument used above. Fortunately,
What do we mean by a large sample? As in any other context, 'large'
the distribution of ex- JL)/(s/Vn) is known, provided the individual
is a relative term and depends on the criteria used to make the judgement.
X values are normally distributed, and is referred to as the !distribution.
Let us consider the most important cases.
The !-distribution has a symmetric histogram like the normal distribution,
but is somewhat flatter. For large samples it is virtually indistinguishable 7. 5. I Central Limit Theorem
from the standard normal distribution. In small samples, however, the In the discussion of the theorem in chapter 6 we pointed out
!distribution has a somewhat larger variance than the standard normal on several occasions that it is only in 'large' samples that we can feel
distribution. secure that the sample mean will be normally distributed. The size of
In order to calculate the 95% confidence interval we make use of the sample required depends on the distribution of the individual values of
x v
formula ts/ n, in which t is the appropriate 5% value taken from the variable being studied and frequently not much will be known about
the tables of the !distribution (table A4). This !value varies according that. If the variable itself happens tO be normally distributed then the
to sample size. You will see that on the left of the tables there is a column mean of any size of sample, however small, will have a normal distribution.
of figures headed degrees of freedom which run consecutively, r, z, If the variable is not normal but has a population histogram which has
J, etc. This is a technical term from mathematics which we shall not a single mode and is roughly symmetrical, i.e. is no more than slightly
attempt to explain here (but see chapter 9). It is important to enter the skewed, then samples of 20 or so will probably be large enough to ensure
tables at the correct point in this column, i.e, with the correct number the normality of the sample mean. Only when the variable in question
of degrees of freedom. The appropriate number is (n- r) (one less than is highly skewed or h>~s a markedly bi-modal histogram will much larger
the number of observations in the sample). Thus, even without under samples b~> required. Even then, a sample of, say, 100 observations ought
standing the concept of degrees of fl'ccdorn, you can s~c that by doing
to be large twmgh.
Estimating from samples Sample size
752 Whenthedataarenotindepe!fdenf'<c, ' _.,. : , ....;
Table 'f;2.Standard errors from dif./e1'ent samplingru/(!s
The above comments on sample size ate relevant 'Where-the ,1 ,, (sample size so)
observations in the sample are independent- of ;one ,anoth~r:,.- an- aSseition: .:
impossible ,to sustain for certain typesoHiiiguistic data: ,;, ',. '''"., !, . \ _,P<t~i~_g ~pt':":ec~ verbs ... :S!"~dard-l!rror (%)
For e:i.:ample, suppose we wish to estimate, for a single child, the propor.
tion of verbs marked for the present perfect in his speech. If. we analyse Consecutive 2.I7
Every sccoitd l_-.-41
a single, long conversational extract of the subject's speech it is perfectly :"Every third I-. I 21

possible that, following a question from his interlocutor like 'What have Every fourth r.os
Every fifth 0,99
you been doing?', he will respond with a series of utterances that contain Even' tenth 0.84
verbs marked for present .perfect. Thus, if he produces, say, five consecu- EverY twentieth 0,7J

tive verbs marked for present perfect, it may be argued that he has made
only one choice and not five. There is a sense in which.he: has provided : !1' .two orthree sessions rhaywell be repaid bydata'which are more informative
a single instance of the verb form in-which we are Interested, not.five .. . ih 'the sense that the standard errors of estimateci quantities are smaller-
If, on the other hand, we admit for analysis only every tweatieth verb ..:..:'Le; Cbhfi&~ri~Cei"nterva1S arc narrowetaridhypothesis tests more sensitive. 3 . -~
that the subject produces, we might more reasonably hold that each tense. Supptise, for the sake of
illustration, that the, subject, throughout. his
or aspect choice appearing in the samples is independent of the ;others, speech as 'a whole, puts so% of his verbs in the present perfect. The
and could consider that the tokens we have selected constitute a random probability that a randomly chosen verb in a segment of his speech is
"'mple. What implications does this view of independence have for research present perfect will then be 0.5. However, if the first verb in the segment
in applied linguistics and the efficient collection and analysis of language is marked in this way then the probability that the next verb he utters
data, particularly when they comprise natural speech? is also present perfect will be greater than 0.5 (because of a 'persistence
Information is a commodity which has to be paid for like any other; effect'). The fifth or tenth verb in a sequence is much less likely to be
and resources (money or time) available to purchase it will always be influenced by the first than is the immediate successor. Again to illustrate,
limited. A data set consisting of n related values of a variable always contains let us suppose that the probability that the next verb is the same type
bs information for the estimation of population means and proportions as its' immediate predecessor' is'o.g. Now.suppose that 100 verbs are sam
thnn does a sample of n independent observations .of the,same;varil!ble. ,, pled tti dstimate the proportion of presetitperfects uttered by th~ subject.
(The extent of the loss of information. will- depend on how conelated , "' We e6t\1rl ch6ose too cdnsecutive verbs, every other verb, every fifth verb,
- Hcc chapter 10 - the observed values are;) Thus, is possible to etc. Table7.'i'shows how the strindard error of the estimated proportion
obtain n independent values at the same cost. as- n .interdependent values decreases as a bigger gap is left between the sampled verbs. (There is
it will be more efficient to do so. In particular, if theNalues are,not.indepen, , ' 'no simple f6rmula'availabl'e to calculate these standard errors. They have
dent much bigger samples than usual will be needed to assume that the. been obfaim'd by acomputer simulation of the sampling process.)
Hample means are normally distributed. Suppose we have decided to trans
cribe nnd analyse 100 utterances. (For simplicity, assume that each utter- 753 Co11fide11ce i11tervals
ance contains only one main verb. The message will still be the same There are two criteria involved here in the definition of 'large'.
even if this is not true, but the argument is more complex.) We might: One is, as before, the question of how close to normal is the distribution
(a) record and transcribe the first 100 utterances; (b) record a total of of the variable being studied. The second is the amount of knowledge
soo utterances and transcribe every fifth utterance. The transcription costs available about the population variance. Usually the only information about
will be roughly equal in both cases but strategy (b) will require a recording this comes from thcsample itself via the value ofthe sample variance.
period five times as long as (a). I-10\..\!ever, if it .is still possible to .caHy
out the recording in a single ~cssio-n the real diffcnmce- in cnst may._,not 1 Of t',IH.II'!IIi thr:rCN'illlw many ncca~ion~ when an invcstigutor.rnay.bc interested .in a sequence
of {:t!mwtutin: llt.tcnmct:;;. We ~>imply wi!lh to pnint out that information on wmc variables
be very large. Even if t"hi1:1 ia not possiblt:, the cxtrainconve.nicncc. of nct~Uing mny IH~ l!Otlt~chrd t'nlhcr incHicicntly in that cn!\c,

104 105
Estimating from samples Sample size
If the variable in question has a.normal distribution, a confidence interval then the most precise answer would be obtained by sampling one utterance
can be obtained for any sample size using the Hables and the methods from each of n children of that age. However, this will also be the most
of the previous section. For samples of more than so or so the !-distribution expensive sampling scheme. It will inevitably be cheaper to take more
is virtually indistinguisi)able from the standard normal and the confidence utterances from fewer children. Furthermore, by reducing the number
interval of 7.2 would be appropriate. If, on the other hand, there is of children it will almost certainly be possible to increase the total number
some doubt about the normality of the variable, then it will not be possible of utterances. We may be able to use the same resources to analyse, say,
to calculate a reliable interval for small samples. Biological measurements 20 utterances from each of 25 children (500 utterances in total) or 75
of most kinds can be expected to be approximately normally distributed. utterances from each of ro children (750 utterances in total) and it may
So can test scores, if the test is constructed so that the score is the sum be far from obvious which option would give the best results. It depends,
of a number of items or components more or less independent of one in part, on the question being addressed: whether interest lies principally
another. Apart from that, certain types of variable are known from repeated in variations between children, in variability in the speech of individuals,
study to have approximately the correct shape of distribution, If you are or in estimating the distribution of some linguistic variable over the popula
studying a variable about which there is some doubt, you should always tion as a whole.
begin by constructing a histogram of the sample if the sample size makes There is no room here to give a fuller discussion of this problem whose
that feasible; gross deviations from the normal distribution ought to show solution, in any case, requires a fair degree of technical knowledge. Refer-
up then (see also chapter 9). With smaller samples and a relatively unknown ence and text books on sampling problems tend to be difficult for the
type of variable only faith will suffice, though it must be remembered layman largely because of the considerable variety of notation and termin-
that the accuracy of any conclusions we make may be seriously affected ology they employ and the level of technical detail they include. Y au should
if the variable has an unusual distribution. However, whatever the form consult an experienced survey statistician before collecting large quantities
of the data it should be safe to rely on a confidence interval based on of observational data of this kind.
a sample of IOO or more observations and calculated as in 7.2.

754 More than one level of sampling 755 Sample size to obtain a required precision
In many cases a study will require sampling at two different Let us return to the example of 7.2. Suppose we decide that
levels. In the verb expansion study discussed above a sample of children we want to estimate the true average VOT for the child and that we want
was first chosen and then a sample of utterances taken from each child. to be 95% sure that our estimate differs from the populationvalue by
How should the experimenter's effort be distributed in such cases? Is no more than one millisecond. What size of sample ought we to take?
it better to take a large number of utterances from a small number of Another way of stating this requirement is that the 95% confidence
childrent or vice versa? Does it matter? interval for the true mean should be of the form X I. On the other
There is no single or simple answer to this question. Reduction in the hand, when we obtained the 95% confidence interval in 7.2 it was
value of data caused by lack of independence also occurs when the data X (1.96 (o/Yn)). Hence we require that 1.96 (o/Yn) = I. In this exam
are obtained from several subjects. Consider an example. In chapter I I pie we have estimated that u= 5, so we have (r.96 X s)/ (Yn) = I, or:
we discuss the relationship between age and mean length of utterances
Vn = 1.96 X 5 = 9.8
(MLU) in young children. Suppose we wished to estimate the MLU for
n = (9.8) 2 = 96.04
children aged z;6. We might look at many utterances from a few children
or a few utterances from many children -will it matter? Clearly, repeated So, to meet the tolerance that we have insisted on we should need to
observations of the same child are likely to be related. A child may have obtain the test scores of 96 or 97 randomly sampled/ d/ VOTs- we would
a tendency to make utterances which are rather shorter or rather longer probably round upton= roo.
than average for his peer group. It is quite easy to demonstrate that if Wt~ t:an obtain a general formula for the sample size in exactly the same
n uttcntncc.HI arc to lw URcd to estimate !VILLI for tht' whole age group wny. Suppose that, with a confidence, we wish to estimate a popula-
Estimating from samples Sample size
tion mean with an greatev thane phoosedib.e '
samplesizentosatisfy: ,,, .. ... i:
T !,. f7j
. . rt, 96r~~p) .,
. ".. I.96c:-r
=d leaving out the special cqrr_ect_ion. "V~e, th~n ,have,: .. ! ., '- , , , h ;J

I.96<T d ":' 1.96Jp(r- p)

(corresponds to Vn = 1 in the last example) ()

or or
1.96u= dYn
or d' = I.g6' {p(r- p)}
I.96u = Yn .... ,.
d or

Thus the formula to choose the appropriat~ sample size to estimate a

population mean with 95% confidenq},is::. ,,,. ,,._.. , , ~i,b~{p(!
. d
. .

n = ( I.~6u)' This formula seems to suffer from ;a difficulty similar to the previous
one. We will not know the population value of p. Even if we decide to
where d. indicates .the required. precision. The value obtained from this _us~ p,_ t_h.e_ e~_t_imate of p, .i!l the expression for_the standard deviation as
formula will rarely be a whole number and we would choose the next we did previously, we will not know the value of p until after the sample
largest convenient number as we did in the example. is taken. However, it should not take you long to convince yourself that
We notice that n will be large if (a) the value of d is small- we are if p is a number between zero and r, then p( 1 - p) cannot be larger than
insisting on a high degree of accuracy; and (b) cr 2 , the population variance, 0.25 and this will happen when p = o.s. We can then obtain a value of
i" large- we then have to overcome the inherent variability of the popula n which will never be too small by using p(r - p) = 0.25. Thus the formula
lion. The value of dis the,experinTcmteLdHbwever; ui'sa prob~. fp.cho,~se .~H?.nservatively l~r~e.s~mplesize to estimate a population pr,opor.-
!em. Usually its value wiiJ not be.known:..Otoe 'way round>thiS'is!to take \i9t;>."1ith 9s'f,o confiderqfi~: . 'i i \ ; . . ' ' " .

a fairly small sample, say zo or 30, and calculate!direqtly.thesample.variarice'. :-' \\'' )I..g6Z'
"' We can then use s in place of a in the formula for the appropriate .n:= o.25--
. d'
"ample size, in the study proper.
Thus the formula to choose the aJ>propriatc.sample size,toestimate where d.'indibatesthe required precision., . '
a population mean with 95% confidence in ignorance of the population ... Suppose; .for example, that we wish to estimate what.pereentage oLthe.
variance is: population has a given characteristic and that, with 95% confidence, we
- (I.96s)'
n- --
wish to get the answer correct to 1% either way. That is the same as
d saying that we want a 95% confidence interval for the proportion, p, of
The problem is that our estimate of the population variance based on the form p o.or (remember that a percentage is just Ioo times a propor-
a rather small sample might be quite inaccurate, but this is usually the tion); so we need:
best we can do.
If we are estimating a proportion, -p, a. different kind.of solution"is ,.,.,, ,,_~n =,o.2 5-.-.-2
. (o.o1)
available. First, let us begin by writing the simplest form for the confidence
intcrval.for p, i.e. "= 9604 (a large snrnplc!)
Estilll'iiting from samples
lf'we~ire the answer only within 1o% either way, then: Table 73 Confidence intervals with different confidence levels
Confidence level(%) c Confidence interval Length of interval
';;.; n=o.~5--.-=g6 o.6745 (1454 15.22) o,67
' 2
(o.1) so
6o o.8416 (1446, IS.JO) o.84
(1436, 1540) 1.04
70 1.0364
If the true value of pis small (<o.2) or large (>o.8) then this procedure 8o 1.2816 (14.24, 1552) 1..28

may greatly exaggerate the sample size required. If you suspect that this t.6449 (14.06, 1570) t.64
I.g6oo (1 3 .go, 15.86) I.96
may be the case and sampling is expensive or difficult, then you should 95
(13-59 16,17) s
2. 5
99 25758
consult a statistician. 3-2905 (1p3. 16. 53 ) 329
x = 14.88 a=s n = 100
7.6 Different confidence levels Note: The number Cis obtained from table A3 to give the required confidence level.
There is nothing sacred about the value of 95% which we have
used throughout this chapter to introduce and discuss the concept of confi than narrow ones. We can be virtually I co% confident that the true value
dence intervals, though it is very commonly used. However, one experi- lies between 1.36 and 28.4o, but tbat is hardly a useful statement. Some
menter may wish to quote a range of values within which he is 'gg% compromise must be reached between the level of confidence we might
sure' that the true population mean will lie. Another may be content to like and the narrow interval we would find useful. If a researcher knows,
be 'go% confident' of including the true value. What will be the conse- before carrying out an experiment, what level of confidence he requires,
quence of changing the confidence level in this way? he can estimate the sample size, using the methods of the previous section,
When the idea of a 95% confidence interval was introduced in 7.2, to obtain the desired width of interval. It is, again, simply a matter of
the starting point was the expression: choosing the appropriate number, C, from table 73 or table A3. Thus
the general formula for choosing sample size to estimate a population mean
P( -1.96 < z < 1.96) = o.g5
The number 1.96 was chosen from the tables of the normal distribution
to fix the probability at 0.95 or 95%. This probability can be altered to
any other value by choosing the appropriate number from table A3 to where, as before, d is the required precision. In the examples worked
replace 1.96, For example, beginning from: through in 7.2 we have used C = 1.96 corresponding to a confidence
P( -2.5758,; Z,; 2.5758) = level of 95%.
and carrying out a sequence of calculations similar to those in 7.2, we SUMMARY
will then arrive at a 99% confidence interval, 14.88 (2.5758 X 0.5) or This chapter has addressed the problem of using samples to estimate
(1359. 16.17) for the true population mean. The length of this interval the unknown values of population parameters such as a population mean or a
is 2.58 marks (16.17- 13.59) longer than the 95% interval. That is to population proportion.
be expected. In order to be more certain of including the true value we (1) Point estimators were introduced and it was suggested that the sample mean
must include more values, thus lengthening the interval. On the other and the sample variance and the sample proportion would be reasonable point
hand, a go% confidence interval would be 14.88 (1.6449 X 0.5) or (14.o6, estimators for their population counterpartsj all of these estimators are
1570), shorter than the 95% interval. Table 73 gives a range of confidence unbiassed and consistent.
intervals all based on the same data. ( 2) The concept of a confidence interval was explained and it was shown how
Is it better to choose a high or a low confidence level? The lower the to derive a 95% confidence interval for a population mean, p., For a
confidence level the more chance there is that the stated interval does large sample such an interval takes the form:

X 1.96(~;;)
not include the true value. On the other hand, the higher the confidence
level, the wider will be the interval and wide intervals arc Ieos informative
:f'f'O Jll
Estimating from smnples
where X is the sample mean, s is the sample standard deviation and n is
the sample size.
(3) A 95% confidence interval for the population proportion, p, was discussed
and an example calculated, using the formula:
Testing hypotheses about
p- t.g6/'(x-p)+.:_}
n 2n
population values
where pis the sample proportion and n the sample size.
(4) The problem of obtaining a confidence interval for I' from a small sample
was discussed and it was shown how this could be done provided the sample
came from a nonna/ distribution using the tables of the t-distribution and 8. x Using the confidence interval to test a hypothesis
the formula: In the previous chapter the confidence interval was introduced
_ ts as a device for estimating a population parameter. The interval can also
Xy;; be used to assess the plausibility of a hypothesised value for the parameter.
where tis the s% point from table A4 corresponding to the appropriate number Miller (1951) cites a study of the vocabulary of children in which the
of degrees of freedom. average number of words recognised by children aged 6-7 years in the
(s) The issue of sample size was discussed with its relation to the Central Limit USA was 24,000. 1 Suppose that the same test had been carried out in
Theorem and the required precision of a confidence interval. the same year on 140 British children of the same age and that the mean
(6) It was shown how to calculate confidence intervals with different confidence size of vocabulary recognised by that sample was 24,8oo with a sample
levels. standard deviation of 4,200 words. How plausible is the hypothesis that
the population from which the sample of British children was chosen had
the same mean vocabulary as the American children of the same age?
(x) (a) Using the data of table J.t, calculate a 95% confidence interval fm the
Admittedly the sample of British children had a higher mean vocabulary
mean length of utterance of the observed adult speaking to her child.
(b) Calculate a gg% confidence interval. size, but many samples of American children would also have had a mean
(c) Explain carefully what is the meaning of these intervals. score of more than 24,000, We need to rephrase the question. The mean
(2) Repeat exercise 7.1 using the data of table 33 of a sample of British children is 24,8oo, not 24,ooo. Is it nevertheless
(3) In table 34 are given the numbers of tokens of words ending ~ing and the plausible that th~ mean vocabulary of the British population of children
number pronounced [n] by each of ten subjects. in this age range could be 24,000 words and that the apparent discrepancy
is simply due to sampling variation, so that a new sample will have a
(a) Calculate 95% confidence intervals for subject 6 for the proportion
of [n] endings in all such words the subject might utter (a) sponta
mean vocabulary size closer to, perhaps Jess than, 24,ooo?
neously, (b) when reading from a wordlist. Let us begin by using the data obtained on the sample to calculate
(b) Repeat for subject x. a 95% confidence interval for the mean vocabulary size of the whole popula-
(c) Suggest reasons for the differences in the widths of the four confidence tion from which the sample was selected. Let us denote that mean vocabu-
intervals. lary size by,.,_,. Following the procedure of 7.2 we obtain the interval:
- s
(4) Ten undergraduate students are chosen at random in a large university and X x.g6 v'n
arc given a language aptitude test. Their marks are:
, + 1.96 X 4200
J.c. 24 8oo- v' (24 104, 25496)
6z,J9,48, 72,8I,5 1 54t59o67,44 140
1 This vahw was itHclf based on a sample. However, for the moment we will treat it as
Calculate a 95% confidence interval for the mean mark that would have been it wen the population vuluc, u reasonable mwugh procedure if the American figure
obtained if all tilL' undcrgntdunte :c;tudcnts of the univl'l'sily hnd lakl~n the wus hatttd on n much higg't'f sample tlwu the British mean. It is explained in chapter
test. 11 how to l.'lllllplll\' tW(J ,lomplt tnl'lliHi dint'fly.

U<l II 3
Testing hypotheses about population values Using the confidence interval

At this point we might remind ourselves of the meaning of. a 95% confi- with the hypothesis; if the . interval does. not include 24,0.0.0 we woul!i
dence interval. If it has been calculated properly, there is only as% chance doubt the plausibility of the hypothesis. A convention has been established
that it will not contain the value of the mean of the whole population by which; in the latter case, we would say that we reject. the hypothesis
from which the sample was chosen. So, if the mean vocabulary size for as false, while if the interval contains. the hypothesised value we simply
British children aged fr-7 years were in fact 24,000 words, then for 95% say that we have no grounds for rejecting it. (This convention and its
of randomly chosen samples, that is for 19 samples out of 20, the 95% dangers are discussed further in 8.4, but let us adopt it uncritically for
confidence interval would be expected to include the value 24,ooo, while the time being.) Since the hypothesis must be either true or false, there
one time in 20 it would not. For the particular sample tested, it turns are four possible outcomes, which are displayed in table 8. 1. As can be
out that the interval does not include the value 24,ooo, so either the true seen, two of these outcomes will lead to correct assessment of the situation,
mean is some value other than 24,000 or it really is 24,ooo and the sample while the other two cause a mistaken conclusion.
we have chosen happens to he one of the s% of samples which result The first type of error, referred to as a type I error, is due to rejecting
in a 95% confidence interval failing to include the population mean. There the, value 24,ooo when it is the correct value for the. populatiqn mean,
is no way of knowing which of these two cases has occurred. There will The. probability; of this type of error is exactly the s% chance that the
never be anyway of knowing for certain what is the truth in such situations. true;population mean (24,ooo) acci<len,tallyfallsoutside the interyal defiJled
However, it is intuitively reasonable to -suggest that- -when a confidence-- by,the particular sample we have chosen.
interval (based on a sample) fails to include a certain value, then we should The. second type of error, known as a type 2 error, .occurs )Vhen,
begin to doubt that the sample has been drawn from a population having although the population mean is no longer 24,ooo, that value is still
a mean with that value. In the present case, we should begin to doubt included in the confidence interval. In general, the probability of making
that the sample was drawn from a population having a mean of 24,000. this type of error will not be known. It will depend on the true value,
On the basis of what we have said to far, it is possible to develop a JJ,, of the population mean, the sample size, n, and the population variance,
formal procedure for the use of observed data in assessing the plausibility u 2 Sometimes it is possible to calculate it, at least approximately, for
of a hypothesis. In particular, let us suppose that the hypothesis we wished different possible values of the population mean. That is true here and
to test was that the average size of vocabulary of British children of fr-7 some values are given in table 8.2. These values have been calculated
years old was the same as that of their American counterparts. We could assuming that the .standard deviatio,:ns, of.. the saJ11ple of yo,cabulary
decide to make use of a 95% confidence interval based on ,a sample of.
British children. Table 8. 2. Probabilities of type 2 error; using a 95% confidence; .
For the moment let us imagine that we do not yet know what that interval to test whether JJ- = 24000
confidence interval is. There are two possible conclusions we could reach
Sample size
on the basis of the confidence interval, depending on whether or not it
.Tr.ue mean n= 140 n =soc n=Iooo
included 24,000, the mean vocabulary of the American children. If the
interval does include 24,ooo we could conclude that the data were consistent 23 000 o.xg very small very small
23500 0.71 0,24 0.04
236oo o.So O.Jg 0.15
Table 8.1. Possible outcomes from a test ofhypothesis 23700 o.87 0.64 O.JS
238oo o.g2 o.8z o.67
True state 23900 o.gz o.Sg
24100 095 o.gz o.89
Population mean is Population mean is 24200 0.92 o.82 o.&;
Result of sample 24000 not 24000 o.87 0,64 O.JS
I ntcrval includes the value 24400 o.So O.JQ o.IS
l4500 0.71 0.24 0.04
24 ooo Correct Error 2
Interval docs not include l5000 O.H) very small very small
Error 1
,.. ~----
- ,- - - - ~--~-~~~-=-
(In ull <~II!WS ll""' 4 :zoo)

"1 esttng hypotheses about populatwn vatues 'file concept of a test statistic

is the correct standard deviation for the population of all British children Table 8.3. Pmbabilities of type 2 error using a 99% confidence
in the relevant age group. To that extent they are approximate. When interval to test whether p, = 24 ooo
the sample size is 140 it can be seen, for example, that if the true population
Sample size
mean is 23,ooo there is a probability of about rg% that the hypothesis
n= rooo
that JL = 24,000 would still be found acceptable, while if the mean really True mean n= 140 n =sao

is 24,500 the probability of making this error is more than 70%. Table 2J000 O.JI Q,OOl very small
23500 o.Bz 037 0.08
8.2 also demonstrates that the probability of making the second kind of 2J6Do o.88 o.s6 0.25
error depends greatly on the size of sample used to test the hypothesis. 2J700 o.gJ 0-77 o.sJ
2J8Do o.g6 o.go 0.79
Of course, we could decide to use a confidence interval with a different 2JQDO o.g8 o.g6 O.Q4
confidence level to assess the plausibility of the hypothesis that the British 24100 o.g8 o.g6 0 94
population mean score was JL = 24,000. In particular, we might decide 24200 o.g6
o.gJ 0.77
0 53
to reject that hypothesis only if the value p, = 24,ooo was not included 24400 o.88 o . s6 0.25
24500 o.Sz o.Jj o.o8
inside the 99% confidence interval. In that case, the probability of making very small
zscoo O.Jl Q,OQI

a type 1 error would reduce to r% since there is now a 99% probability

(In all cascss = 4 zoo)
that the true value will be included by the confidence interval, so that
if the true population mean is 24,000 there is only a I% chance that it
will be excluded from the interval. On the other hand, the probability that the conclusion is in en-or (a type I error) would be only so/o, or I
of making a type 2 error will now be increased. A 99% confidence interval in 20. It would seem that, on balance, the evidence suggests there is some-
will always be wider than a 95% confidence interval based on the same thing about the education or linguistic environment of the British children
sample of data and therefore it will have more chance of including the which promotes earlier assimilation of vocabulary.
value 24,ooo even when the population mean has some other value. Table In some cases, the rejection of a hypothesis might lead to some costly
8.3 gives the probability of making this second kind of error for different action being taken, a change in educational procedures or extra help to
true values of the population mean, p,, and for three sample sizes. When some apparently disadvantaged section of the community. In such cases
n = 140 and the true population mean is 23,000, the probability of the we might feel that a I in 20 chance of needlessly spending resources as
second kind of error is now about 3 I%; when a 95% confidence interval the result of an incorrect conclusion is too high a probability of. error.
is used this error would occur with a probability of only Ig%. We seem We could base our decision on a gg% confidence interval since then a
to have reached an impasse. Any attempt to protect ourselves against one conclusion, based on 'a sample, that a particular subpopulation had a differ-
type of error will increase the chance of making the other. The conventional ent or unusual mean value, would have a probability of only I in xoo
way to solve the dilemma is to give more importance to the type I error. of being incorrect. If the hypothesis testing procedure is to be used in
The argument goes something like this. this fashion (but see 8.4) we will wish to fix our attention on the probability
The onus is on the investigator to show that some expected or natural of wrongly rejecting a hypothesis, and it would be useful to formulate
hypothesis is untrue. Evidence for this should be accepted only if it is the procedure in such a way that the answer to the following question
reasonably strong. Let us consider our vocabulary size example in this can be obtained easily: 'If I reject a certain hypothesis on the basis of
light. The population mean score over a large number of American children data obtained from observing a random sample, what is the probability
tested is 24,000. If this vocabulary test had never been carried out on that my rejection of the hypothesis is in error?'
British children before we could start out from the point that, in the absence
of sp<.~cinl circumstances, the mean vocabulary size of the latter population 8.2 The concept of a test statistic
ought to hC' about the same as that of the former. If we decide to use Let us recap briefly the procedure presented in 7 .2 for calculat
the 95% confidence interval obtained from the sample of .140 test scores ing the confidence intervals we have been discussing above. We obtain
to tcmt this hypothesis, we would conclude that it was false. 1/w pmbability a randmn sample of 140 vocabulary scores and calculate X, the sample
Testing hypotheses about population values The concept ofa test statzstzc
mean, and s, the sample standard deviation. Provided the sample size Thus JL, the postulated value of the population mean, will be included
is large enough - as it is here - the confidence interval then takes the by the confidence interval only if both the inequalities A and Bare true.
form: Taken together, these inequalities can he put into words, as follows.
XZv;;- Find the absolute .difference between X and 11- (That is, take X- 11- if
X is greater, 11- - X if 11- is greater.) Divide that difference by s/ V n, the
where the value Z is chosen from tables of the standard normal distribution standard error of X. If the answer is less than the value of Z needed
to give the required level of confidence. By an algebraic manipulation to calculate the confidence interval then the interval will include /1-, other-
of this expression it can be shown that it is not strictly necessary to calculate wise it will not.
the confidence interval in order to test a hypothesis. The algebraic argu- For our example, X is larger than J.l., and s = 4,2oo:
ment follows below, in italics, for interested readers.
X- J.t 248oo- ZfOOO 8oo
Suppose we wish to test the hypothesis that the population mean has 2.25

the value 11- We will reject this if 11- is not contained by the confidence s/Vn = 42oo/YI40 355
interval. Now, M wW be contained inside the interval provided: Now, to construct a 95% confidence interval, we see from table A3
that we would need to use Z = 1.96; for gg%, Z = z.58. This tells us
- s - s
X- Z-::-r< JL<X +Z-:;-r- that 24,000 will be included in the 99% but not the 95% interval, since
vn vn
the Z-value corresponding to the sample is less than 2. 58 but greater than
The inequality 'A': 1.96. If we reject the value 24,ooo as incorrect, the probability that the
rejection is an error (type I) is less than s% (since the value falls outside
- s
X-Z-<JL the 95% confidence interval) but greater than I% (since it does not fall
outside the value of the gg% confidence interval). Conventionally, we
is the same as: say that the postulated value, 11- = 24,000, may be rejected at the s% signifi-
cance level but not at the r% significance level. A common notation used
X<JL+ZVn to express this is o.oi < P < o.os, where 'P' is understood to be the prob-
ability of making a type r error, i.e. Pis the significance level.
is the same as: 'At the s% significance level' is just another way of saying that the
postulated mean is not contained by the 95% confidence interval based
x- JL<zv" on the sample. The 'significance level' is then just the probability that
this exclusion is due to a sampling accident and not to the failure of the
is the same as: hypothesis. The value:
~<Z x-JL
s/Vn z= s/Vn
Similarly, the inequality '8': is used as the criterion to assess the degree to which the sample supports,
or fails to support, the hypothesis that 11- is the mean of the population
- s
JL<X +Z:;-r from which the sample has been selected. Such a value used in such a
way is known as a test statistic. Every testable statistical hypothesis is
is the same as: judged on the basis of some test statistic derived from a sample.
Every test statistic will be a random variable because its value depends
on the results of a random sampling procedure. If we were to repeat our
s/Vn <Z
test procedure a lll,l!l'lbCI' of tinws, drawing a different sample each time,
;g II9
'1 'estmg hypotheses about populatwn values 1f<e classical hypothesis test
we would obtain a. random sample of. values .of the test statistic and we have a lower mean vdcabulary size as n\easured by the test used (possibly
could plot its histogram as we did for other variables in chapter 2, A because the test, devised in the USA, has a cultural bias); (ii) We believe
test statistic is always .chosen in such a way that a mathematical model that British children will have a higher mean vocabulary size than American
for its hi~to!\pm is k11own so lo12g ai .the hypothesis being tested is true; children,'of"the sam.e'age (possibly because they start school at an earlier
lll- this ca~?.Jhe distribution of our test statistic, Z, whenever JL is, in age); (iii) We are simply checking whether there is any difference in the
faete.the mean of the population from which the sample is taken, would mean vocabulary size and have no prior hypothesis about the direction
be the standard normal distribution, It follows that the value Z = 1.96 of any difference which might exist.
woutd be exceeded in only 2,5% of samples when we have postulated A full statement of the problem will include a definition of the null
the correct value of JL, Similarly, only 2.5% of random samples would hypothesis (H 0 ) and the alternative hypothesis (H 1), In the present case
give a value less than - L96 when our hypothesis is true. Clearly we are the null hypothesis is H 0 : JL = 24,000 and we have to choose an alternative
again using the 95% central confidence interval and claiming that values from:
outside this interval are in some sense .. unusual. or. interesting. since they
(i) HI :'1'<2400~
should result only from I random sample in 20.
(ii) I-ll: 1'. > 24000
You should now be ready to understand a complete and formal definition
of the classical procedure for a statistical test of hypothesis,
(iii) HI': !L"' 24 000

Despite its vagueness, the form of H 1 will have a definite bearing on

8.3 The classical hypothesis test and an example the outcome of a statistical hypothesis test. Consider again the test statistic:
A hypothesis is stated about some random variable, for exam
ple,.that British childrCn of a certain-age have a mean vocabulary of 24,ooo X- 24ooo
z s/v'n
words, The hypothesis which is taken as a starting point and which often
it is hoped will be refuted, is commonly called the null hypothesis and If (i) is true, the population mean is less than 24,ooo, so that samples
designated H 0 , We might write: let the variable X ')'ith mean JL be the from the population will generally have a sample mean less than 24,000,
vocabulary size of British children aged 6-7 years, Then we wish to test In that case Z will have a negative value since X- 24,000 will be less
Ho: JL = 24,ooo, The most important requirement to enable the test to than zero, So large, negative values of Z will tend to support H 1 rather
take place is the existence of a suitable, test statistic whose diStribution than 'Hu; and positive values, however large, Cannot be cited as support
is known when H 0 is true. In this case there:ris one, provided:we'take for the hypothesised alternative, If (ii) is true the argument will be com
a large sample of children: pletely reversed so that large, positive values of Z should lead to the rejec
X-p, tion of H 0 in favour of H 1, If (iii) is tme the value of Z will tend to
Z=-- be either negative or positive depending on the actual value of JL All
we can say in this case is that any value of Z which is sufficiently different
where JL is the hypothesised mean and sis the sample standard deviation' from zero, inespective of sign, will be support for H 1 Up to this point
For the example we are discussing here: in our discussion we have tacitly been considering possibility (iii) and
X- 24ooo have been prepared to reject H 0 if the value of Z is extreme in either
z direction.
Next, we require tables which will indicate those values of the test statis
Next, we must be able to say what the value of our test statistic would tic which seem to give support to H 1 over H 0 , These values are usually
have to be in order for us to reject the null hypothesis, This depends referred to as the critical values of the test statistic, The tables ought
firstly on what alternative hypothesis we have in mind, i,e, what we tn be in a form which enables us to state the significance level of the
suspect may in fact be the case if the: null hypothesis is untrue. There test This is simply another name for the probability that we will make
are three obviou fJI>Hsibilities: (i) We believe that British children will n rnistakt~ 'if wt decide to reject H0 in favour of I I 1 on tlu. evidence of
.1$.."', Y~O 121
Testing hypotheses about population values The classical hypothesis test

our random sample of values of the variable _we have observed, i.e. the The:mean of this sample is 71.5, with a standard deviation of IJ.t8. Do
probability of making a type r error. you think their performance has been affected by the interruption of their
Care is required at this point. The correct way to calculate the signifi studies? One way to answer this question is to test whether the students
can~~Jevel q{ a test A~11ends on the particular form of H 1 which is con seem to have achieved results as good as, or worse than, those achieved
sider~d, relevant for 1 tp.~ test and on the way in which the tables are by the body of students who have taken the same proficiency test in previous
presented. Some tab!esa~e,presented in a form suitable for what is often years at the same point in their French language education. 2
called a, two-tailed te~t:. o.ur tables A3 and A4 are of that kind. This The null hypothesis for this test will be that the students tested come
type of test is so called because values sufficiently far out in either tail from a population whose mean score is 8o, the historical mean for students
of the histogram of the test statistic will be taken as support for H 1 A who have had the usual preparation for the A level examination. We would
two tailed test is the appropriate one to use when the alternative hypothesis not expect the disruption of teaching to improve students' performance
does not predict the direction of the difference. In the present case, there on the whole, so that a one-sided alternative will be appropriate.
fore, a two-tailed. test will be called for if our alternative hypothesis is We will therefore test the hypothesis that the mean test score of the
I" f 24,000. population from which these mean scores are drawn is 8o, against the
A onetailed test, by contrast, is so called because values sufficiently alternative that the population mean score is less than 8o. In other words,
far out in only one tail of the histogram of the test statistic will be taken we wish to test:
as support for H 1 A one-tailed test is appropriate when the direction Ha :J.J..J:So versus H 1 :p. <So
of the difference is specified in the alternative hypothesis. Thus both the
following: How are we to carry out this test? It looks similar to the test of hypothesis
on mean vocabulary size that we carried out in 8.2 - apart from the
HI :p.<2f000
HI :,u.>24000
change in the alternative hypothesis - but there is one very important
difference. Here the sample size is too small for the Central Limit Theorem
require a one~ tailed teSt. to be invoked safely. In the vocabulary example we did have a large sample
As noted above, our treatment of the vocabulary size problem has and this is what enabled us to be sure that the test statistic, Z, had a
assumed the alternative hypothesis to be I" f 24,000. It follows from this standard normal distribution. In small samples that will not be true.
that a two-tailed test has been appropriate. What we have done informally In the last chapter we have already addressed the problem of small
is i!ldeed to carry out a two-tailed test, making use of the_ significance sample size in the context of the determination of a confidence interval.
levels provided in table A3. We will now use another example to demon The solution which was suggested there carries over to the current problem.
strate the procedure formally, step by step, this time with an alternative First, for small samples, we have to be willing to make an assumption
hypothesis that specifies the direction of the difference. about the distribution of the variable being sampled, namely that it has
Let us suppose that there is a proficiency test for students of French a normal distribution. Here, for example, in order to make any further
as a second language in which students educated to British A level standard progress we must assume that the proficiency test scores would be more
are expected to scar~ a mean of 8o marks. In a certain year teaching activities or less normally distributed over the whole .population of students with
at some schools are disrupted by selective strikes. Ten students are chosen disrupted schooling. Provided that assumption is true, then the quantity:
at random from those schools and are administered the test just before
the time when they are due to sit the A level examination. (In practice, x-~"
a much bigger sample would normally be chosen if there were r~al grounds Sf'Tn
for believing that the students' performance had been affected by the
will still make a suitable test statistic, since its distribution is known and
strikes, but we wish to demonstrate here how a small sample could be
analysed.) The scores of the ten students were: l :\ better way would be to comparc tlwm directly with students \vho will take the A level

62 71 75 56 So 87 62 g6 57 69
in tht same V(~Hr and who11c Htlldic!l wtn not disrupted. How to do that is cxplai11cd
i11 chupter 1 1.'

Testing hypotheses about population values The classical hypothesis text
can be tabulated. It is no. longer standard normal; rather it has. the t-
distribution introduced in the previous chapter. For the present example Standard normat------7'
the statistic: ~ t-distribution with
9 df

mil! have a !-distribution with 9 (= 10- I) degrees of freedom (df). 0
l'lji'pothesis tests are often referred to by the name of the test statistic Figure 8.1, Comparison of the histogram of the t-distribution with that of
they use. In this case we might say that we are 'carrying out a t-tcst' the standard normal distribution.
(cf. F-test and chi-squared test in later chapters).
Hence, if the population mean score really is 8o then for any sample will have a value around zero. However, even if H 0 is true some samples
of ten scores :
will correspond to a value of t in the leftchand tail (and so look as if
X-8o they supported the alternative H 1 : J.t < 8o). A table of !-distributions, such
t=-- as table A4, gives values of t which are somewhat unlikely if H 0 is true.
These values are often called percentage points of the !-distribution
will be a random value from the !-distribution with 9 df. If J.t < 8o, we since they are the values which will be exceeded in only such and such
would expect the test statistic to have a negative value (since then X will a per cent of samples if H 0 is tnte. However, the tables are set up to
usually be less than 8o), and it will no longer have a !-distribution (since give the percentage points appropriate for a test which involves a two~ tailed
the incorrect value of J.t will have been used in its calculation). In other alternative hypothesis when an extreme value of t, whether positive or
words, if the alternative hypothesis 1-1 1 :j.t<8o is true, then we would negative, could be evidence in favour of the alternative rather than the
expect a value oft which is negative and far out in the tail of the histogram null hypothesis. For example, for the !-distribution with 9 df the value
of the !-distribution. given as the ro% point is 1.83 and figure 8.2 demonstrates the meaning
In the test score example we have: of the tabulated value. When the null hypothesis is true the value of t
n = ro, X= 71.5, s = IJ.I8 will lie in one of the tails- shaded in the figure (i.e. t > 1.83 or t < - 1.83)
- for a total of 10% of samples. For half of those samples (s%) the value
so that the value of the test statistic is: of t will lie in the left-hand tail, for the other half in the right. Only
71.5- So values in the left-hand tail can support the alternative hypothesis that
' --2.04 J.t < 8o. If we decide to reject the null hypothesis in favour of the alternative
whenever the t-value falls in the left-hand tail in figure 8.2, i.e. whenever
The value of t is negative. If it had not been so, there could be no t < ~ 1.83, we would reach this conclusion mistakenly in only s% of sam-
question of the data supporting the alternative hypothesis against the null, ples when the null hypothesis is actually true. In the present example,
since a sample mean greater than 8o cannot be claimed as evidence in t = -2.04 ( < - 1.83). Conventionally, we could say that 'at the s%
favour of a population mean less than 8o! The question is whether it
is a value extreme enough to indicate that the hypothesis J.t = 8o is implaus-

ible. Figure 8. I shows the histogram of the t-distrihution with 9 df, with
the histogram of the standard normal superimposed. They are very similar.
Both are symmetric about the value zero but the !-distribution is flatter
and spreads more into the tails, reflecting the extra uncertainty caused
t" -1.83 0 t 1.83
by the small sample size. If the null hypothesis is true, then J.t =So and
most samples will have a mean of around So so that t =(X- J.t)/(H/Yn) Figun) 1'1.~. 'J'Iw lno-lm'lcd 10% point of the t-distrih11ti1m with 9 df.

t U<f 125
Testing hypotheses about population values 1s significance stgmftcant '!

significance level we can reject 1-1 0 : IL =So in favour of the alternative 8.4 How to use statistical tests of hypotheses: is significance
H 1 : 1-' < 8o'. This statement is frequently shortened to 'the value of t is significant?
signifkant at s%'. The shorter version is acceptable provided the null Every statistical test of hypothesis has a similar logic whatever
and ali' er~ative hypothesis are stated elsewhere. are the hypotheses being tested. There will be two hypotheses, one of
It i worth repeating again that the percentage points in the table are which, the null, must be precise (e.g. 1-' = 8o) while the other may be
relevaht for a two-tailed test. The tables say that 1.83 is the 10% value. more or less vague (e.g. ~t< 8o). There must be a test statistic whose
Since the alternative used in the present example is one~sided only, one dist1ibutio11 is known when the null is true. Percentage points of that distri-
of the t~ils is eliminated from consideration and the use of the 'cut-off' bution can then be calculated and tabulated. Sometimes the tables will
or crihcal value t = -I .S3 will lead to only a s% probability of a type be appropriate to two-sided and sometimes to one-sided alternatives -
I erroL Notice that if the two-sided alternative H 1 :I-' f So had been rele- the rubric will make it clear which is the case. The ma]or constraint on
vant here, then, at the s% significance level, only values oft greater than the use of significance tests is that it is generally difficult to discover test
2.26 or less than -2.26 would have been significant so that the value statistics with known properties. Such statistics are available for only a
obtairled here, t = -2.o4, would no longer be significant at the s% level! few, standard null hypotheses. It frequently happens that a researcher
You ~~uh,l find this entirely logical, perhaps after some thought. The wishes to address a question which is not easily formulated in terms of
sigoifihrc~ level is just the probability that the null hypothesis will be one of those hypotheses and it would be mistaken to try to force all investiga-
(mist~ke.rirJ rejected when it is correct. The value that will cut off the tions into this framework. The value of statistical hypothesis testing as
gs% df 'pc~eptable' !-values will be different depending on whether those a scientific tool has been greatly exaggerated.
values are 4istributed between both tails or are confined to just one. Similar In chapter 7 it was argued that a confidence interval gives a useful
arguments:apply to the percentage points corresponding to any other signi- summary of a data set. We hope it is clear from the development of the
ficance ltv~!. The 2% point oft with 9 df is given as 2.76 in the tables. hypothesis test from a confidence interval in 8. 2 that a test statistic and
But this;,s,;b.s always, for a two~sided alternative. For a one-sided alternative the result of a test of hypothesis is simply another way of summarising
2.76 (foe H 1 :It> So) or -2.76 (for H 1 : 1-' <So) is th<n% point. a set of data. It can give a useful and succinct summary, but it is no
Note in passing that table A4 does not give all possible degrees of free- more than that. Important decisions should not be taken simply on the
dom. Fqr hample, after IS df the next tabulated value is 20 df. What basis of a statistical hypothesis test. It is a misguided strategy to abandon
I . ,,
need I8 df, i.e. the sample size is n = rg? You will notice an otherwise attractive line of research because a statistically significant
th~i the c~itical value, of t at any significance level decreases as the df result is not obtained as the result of a single experiment, orto believe
incre~se, u\ other words, the bigger the sample the smaller are the !-values that an unexpected rejection of a null hypothesis means, by itself, that
which are found to be significant. (This reflects the extra confidence we an important scientific discovery has been made. A hypothesis test simply
have in s2 as a measure of population variance in bigger samples.) Notice gives an indication of the strength of evidence (in a single experiment)
further fhat the critical values for 20 df are very similar to those for Is for or against a working hypothesis. The emphasis is always on the type
df. Fbr rny number of degrees of freedom between IS and 20 either of I error, error which arises when we incorrectly reject a true null hypothesis
those tWo rows give very close approximations to the correct answer. It on the basis of this one statistical experiment. The possibility of type
is conveptional (and conservative) to use the row corresponding to the 2 error tends to be forgotten. If we do not find a highly significant result,
neare~t 9umber of degrees of freedom smaller than those which are required . this does not mean that the null is correct. Sampling error, small sample
when th latter are not tabulated. size or the natural variability of the population under study may prevent
A finaf point to notice about the !-tables is that as the degrees of freedom us from detecting a substantial failure in the null hypothesis. Furthermore,
increase :the values become more and more similar to those of the standard although we control the probability of type I error by demanding that
norm.!! ~istribution. The values in the last row of table A4 are exactly it be small, say s% or I% or less, we will not usually know how large
those ;afjtH,e standard normal, i.e. the values in the last row of table A4 is the probability of making a type 2 error. Except when the sample size
correspcjfid to the second column of table A3.

is vety large, the probnbility of a type z error is often rather higher than
Testing hypotheses about populatwn values Is significance significant?
the probability of a type r. Hence, there is often a,high chance of missing whether any differences from the hypothesised value were negligible,
important scientific effects if we rely solely on statistical tests made on although the test had rejected the null hypothesis. If a statistical test indi-
small samples to detect them for us. cates that some null hypothesis is to be rejected we should always attempt
Hypothesis testing is set up as a conservative procedure, asking for a to estimate more likely parameter values to replace those in the rejected
fairly small probability of a type I error before we make a serious claim null hypothesis. We must keep in mind always the difference between
that the null hypothesis has been refuted. The procedure is designed to statistical and scientific significance, and we should remember that the
operate from the point of view of a single researcher or research team latter will frequently have to be assessed further in the light of economic
assessing the result of a single experiment. If the experiment is repeated considerations.
several times, whether by the same or by different investigators, the results Let us consider an example. Suppose that a new method of treatment
need to be considered in a different light. For example, suppose the editors has been suggested to alleviate a dysphasia. An investigation is carried
of a scientific journal decide, misguidedly, to publish those papers which Out whereby an experimental group of n patients is treated for some months
contain a statistically significant result and reject all others. They might by the new' method while a matched control group of the same size is
decide that a significance level of 5% would be required. Now let us suppose treated over the same period by a standard method. We could then test,
that 25 researchers have independently carried out experiments to test using one of the tests to be introduced in chapter I 1, the null hypothesis
a null hypothesis which is, in fact, true. For any one of these individuals that the degree of improvement was the same under both treatment meth-
it is correct that there is only a 5% chance he will erroneously reject the ods against the alternative that the new method caused more improvement.
null hypothesis. However, there is a chance of greater than 72% that Let us consider the possible outcomes of such a test.
at least one of the 25 researchers will find an incorrect, but statistically
significant, result. If there were roo researchers, this chance rises to more 8-4- I The value of the test statistic is significant at the 1% level
than 99%! The chance of a research report appearing in the journal would What does this tell us? In itself, very little. We have just pointed
then depend largely on the popularity of the research topic, but there out that we never expect any null hypothesis to be exactly true. The signifi-
would be no way of assessing how likely to be tr~e are the results in cant value of the test statistic means that our experiment has been able
the published articles. to discover that. The question is, has the significant value come about
Another point to remember is that no null hypothesis will be exactly because, on average, there is a large benefit from the new method or
true. We tested the hypothesis (in 8.3) that 1-' =So; we would probably because, perhaps; a very large sample of subjects was used? If it is the
not wish it to be rejected even if the true value were' not 8o but 79999 fornier then it is still possible that this is a sampling phenomenon, that
since this would not indicate any interesting difference in mean scores the accidental allocation of patients to the two groups has placed in the
between the subpopulation and the wider population. On the other hand, experimental group a majority of patients who would have made most
if a large enough sample is used such small discrepancies will cause rejection improvement under the old method. However, we know, from the signifi-
of the null hypothesis. Look back to the example in 8.3. The test statistic cance level, that there is only a one in a hundred chance that the new
was: treatment is in no way better than the standard.
X-p, Does this then mean that the new treatment should be introduced?
-Sf\Tn Not at all. We must now ask about the relative costs of the two methods.
This can be seen to be a quotient of two numbers. The numerator is , If the new method costs much more than the standard method to administer
X -J.t; the divisor or denominator is s/Vn. The test statistic will have then it can only be introduced if it causes improvement so much more
a large value if either the numerator is large or the denominator is small. rapidly that at least the same number of patients annually can be helped
We can make the denominator as small as we please by increasing the to the same level of improvement as under the standard method. It will
value of n. For very1 large values of n, the test statistic can be significant be ncccssa1y to test the new method with a large number and variety
even if the difference between X and 1-' is trivially small. The explicit of patients bcforo suflici(,llt information can be obtained to assess this
calculation of a suitable conlldcncc intcrvttl would show immediately properly, Although ll Blllflli ~ample 111ig-ht show that there is a statistically
u8 12()
Testing hypotheses about population values
significant difference of an apparently interesting magnitude, more infor-
( r) A confidence interval can be used to test the hypothesis that a sample mean
takes a particular value; type I errors and type z errors were defined.
mation will always be necessary to assess the economic implications of
(2) The concept of a test statistic was used to link confidence intervals with
changes of this type.
hypothesis tests.
(3) The classical hypothesis test was introduced as the test of a null hypothesis
8.4.2 The value of the test statistic is not significant (llo) against a specific alternative hypothesis (H1) using as a criterion the
In itself an 'uninteresting' value of the test statistic should value of a test statistic whose distribution is known provided H0 is true. The
not be the end of the story. Never forget the possibility of type 2 errors, sample value of the test statistiG is compared to a table of critical values
especially if the sample size is very small. At least you should always to obtain the significance level (probability of a type I error) of the test.
look at the difference in performance of the two samples and ask 'Would The meaning of, and need for, one-tailed and two-tailed tests was explained.
a difference of this magnitude be important if it were genuine?' If the To carry out a test of the null hypothesis, H0 : ,u =specified value, against
answer is affirmative, then it is worth considering a repetition of the experi- any of the three common alternatives the relevant statistic is (X- f.')/(s/Yn).
ment with a larger sample size. You should also take a careful look at For small samples its value is compared with those of the tdistribution with
some of the details of the data. For example, it could happen that many (n- 1) degrees of freedom; for large samples it is compared with the critical
values of the standard normal distribution.
patients do not improve much under either method but, of those who
do, the improvement might be more marked under the new treatment.
The average gain of the new method would then be quite small because EXERCISES
(r) A sample of I84 children take an articulation test. Their mean score is 48.8
of the inertia of the 'non-improvers' and the variability in both samples
with standard deviation I 2..4. Show that these results are consistent with the
would be increased because they are really mixtures of two types of patient.
null hypothesis that the population mean is JJ. = so against the alternative that
Both of those conditions would increase the probability of a type 2 error.
0,. so.
(:z.) All the exercises at the end Of chapter 7 require the calculation of confidence
In the light of the above comments it should be clear that to report the intervals. Take just one of those confidence intervals and reconsider its meaning
results of a study by saying that something was significant at the r% level in the light of the present chapter. In particular, formulate two different null
or was not significant at the s% level is unsatisfactory. It makes much hypotheses, one of which would be found plausible and the other of which
more sense to discuss the details of the data in a manner which throws would be rejected as a result of the interval. In both cases state, very precisely,
as much light as possible on the problem which you intended to tackle. the alternative hypothesis.
A formal test of hypothesis then indicates the extent to which your conclu- (3) An experimenter wants to test whether the mean test score, IJ.-, of apopulation
sions may have been distorted by sampling variability. The occurrence of subjects has a certain value. In particular he decides to test Ho: 1-1- =So
of a significant value of a test statistic should be considered as neither versus H 1 : p, >So. He obtains scores on a random sample of subjects and
calculates the sample mean and variance as X; 84.2. and s = I4.6. He omits
necessary nor sufficient as a 'seal of approval' for the statistical validity
to report the sample size, n. Show that if n = I6, H0 would not be rejected
of the conclusion. The general rule introduced in chapter 3 still holds
at the s% level, but that H0 would be rejected if n = 250.
good. Any summary of a set of experimental data may be revealing and
(4) In the last example, find the smallest sample size, n, which would lead to
helpful. This is equally true whether the summary takes the form of a
the rejection of H 0 :
table of means and variances, a graph or the result of a hypothesis test.
In all cases, as much as possible of the original data should always be (i) at the s% significance level
given to enable readers of an article to assess how adequate the summary (ii) at the I% level
is and to enable them to carry out a new analysis if they wish. (s) If X= 8o.r and s= 14.6, show that H0 : f.'= So could still be rejected in favour
of H 1 : 1-1- > 8o at any level of significance.
SUMMARY (6) Discuss the implications of exercises 2, 3 and 4, above.
This chapter introduces tlw concepts and philosophy of statistical
hypothesis testing.

!JO 131
11 complete moaet

Table 9 r. Frequency table of scores r~f 184 children's test scores

9 2 .1
4 5

Testing the fit of models to dass


data Class intcrntls

of scores
number of
Expected Expected
scores in each proportion number uf scores
Frnm~ less than class From----4lcss than in each ch1s:; in each class
0 ]0 2 -2.0 0.02J p
JO -2.0 8.1
'7 -1.5 -I.o
-~.s 0.044
0.0()2 J6.q
9 r Testing how well a complete model fits the data 40' 45 3' -1.0 -o.s O.J_'iO 27.6
45 so ]2 -o.s 0,0 0. H) I 35 1
In the previous chapter we learned how to test hypotheses con- so 55 39 o.o o.s O.JQI 351
cerning the value of important quantities associated with a population. '55 6o 22 o.s LO 0.150 27.6
6o 6s '9 LO r.s o.092 16.9
What we tested was whether a particular model of the chosen type could 6s 70 6 t.S 2.0 8.<
be supported or should be rejected on the basis of data observed in a 70 75 4 2.0 2.5 0.017 J.l
random sample. There are times, however, when we might have doubts 75 or greater 2.5 - o.oo6 Ll

about the very form of the model, when, for instance, we are uncertain
whether it is appropriate for the population in which we are interested lying between any two points is known. If, therefore, the population of
to be modelled as a normally distributed population. scores of the US children on the test is normally distributed, it will be
Imagine the case where a School District il) the USA wishes to identify possible to calculate what proportion of the population of children will
those children beginning school who should be provided with speech ther- obtain scores within a certain range. More particularly, it will be possible
apy. Rather than involve themselves in the lengthy and expensive business to calculate the proportion of children who will be given speech therapy
of constructing a new articulation test, they plan to use a test which is if a certain (low) score is used as a cut-off point; i.e. only children obtaining
already available. One test which seems on the surface to be suitable for a score lower than this will receive therapy. It is in fact quite common
this purpose is British. It has been validated and standardised in Glasgow practice in the USA in identifying children ,in need of treatment, to
in such a way that for the whole population of s-year-old children in. Glas- set that cut'Off point at two standard deviations below the mean. If the
gow the scores on the test are normally distributed, with a mean score test scores are normally distributed, this means that approximately z.s%
of so and standard deviation of w. In order to discover whether scores of children will be selected for treatment. But unless we know that the
on the test will have similar properties when used with s-year-old children distribution of the US population scores can be modelled on a normal
in its own area, the US School District administers it to a random sample distribution, we cannot ascertain the proportion in this way.
of r84 of these children. The results of this are presented as a frequency How then do we determine whether or not the sample data obtained
table in the first two columns of table 9 r. The mean score of the US are consistent with a normally distributed population of test scores? If
children is 48.8 and the standard deviation is 12.4. Using the methods the frequency data are represented as a histogram (figure 9.1), it can be
of the previous chapter, it can readily be shown (see exercise 8.r) that. seen that they do in fact show some resemblance to a normal curve. But
the mean of this sample is indeed consistent with a population mean of we must go further than this. The question that we have to answer is
so. But this does not tell us whether the complete population of test scores whether the number of scores in each class is sufficiently similar to that
(i.e. those which would be made by all s-year-old children in the area) which would result typically when a random sample of r84 observations
could be modelled adequately as a normally distributed population. This is taken from a normally distributed population with a mean of so and
qtwt'tion is important to the School District. 1\R we saw in chapter 6, standard deviation of ro. The first step is to calculate the proportion of
if !I populatiou is normally distributed, tlw proportion of the population the model population which would lie in each class interval. To do this,
'Jesting the fit ojmodels to data A complete model

the endpoints of the intervals are standardised to Z-values, as in the third Table 9.2. Calculations for testing the hypothesis that the data of table 9.1
column of table 9 r (Z-values calculated using the formula Z = (X- p,)/ u come from a normal distlibution with mea11 so and standard deviation ro
presented in chapter 6). In the fourth column are the expected proportions
of the model population to be found in each class interval, these proportions 3 4 5
being taken from a normal table (table Az) in the way described in chapter
6. In order to calculate the number of scores we would expect in each (oi-ci
Observed Expected Discrepancy
frequency, o; frcqucncv, (o;- c;) (o,-ci
25 1 ~ }r3 42}
S.r 12.3 0.7 0-4!J 0.04

17 r6.9 0,1 Q,OI o.ooo6

> 20
- 31 27.6
3-4 rr.s6
I 0.42

~l!l 15
35 1
1.7.6 -s.6
39 rs.zr

19 >.1 44 1 0.26
10 - ;}Jl J.l 12.] ~LJ r.69 O.lf


~~ r-
r !.1

Total deviance= 2. 70

20 30
.0 were true. The most obvious thing to do is calculate for each class the
difference between the observed number of scores and the expected number
Figure 9 I. Histogram of the data of table 9 r. (Note that the first and last of scores. The expected number can simply be subtracted from the
classes arc not represented in the diagram.) observed number. The result of doing this can be seen in column 3 of
table 9.2. (The reason for combining a number of the original classes
class if the sample of 184 scores had been taken from a population with will be given below.) Some of the discrepancies are positive, the remainder
the stated properties, we have only to multiply each expected proper negative. But it is the magnitude of the discrepancy rather than its direction
tion by 184, the results being given in column 5 (for example, which is of interest to us; the sign (plus or minus) has no importance.
184 X 0.023 = 4.2). What is more, just as with deviations around the mean (chapter 3) the
Comparing columns 2 and 5 of the table, we can see that the actually sum of these discrepancies is zero. It will be helpful to square the discrepan-
observed number of scores in each class is not very different from what cies here; just as we did in chapter 3 with deviations around the mean.
we would expect of a sample of 184 taken from a normal population with It should he clear that it is not the absolute discrepancy between observed
a mean of 50 and standard deviation of 10. So far this merely confirms and expected frequencies which is important. If, for example, we expected
the impression given by the histogram (figure 9.1). But we will now use ro scores to fall in a given class and observed 20 (twice as many), we
these observed and expected frequencies to develop a formal test of the would regard this as a more important aberration than if we observed
fit of the proposed model to the observed data. no where roo were expected, even though the absolute difference was
Let the null hypothesis (chapter 8), H 0 , be that the data represents 10 in both cases. For this reason we calculate the relative discrepancy
a random sample drawn from a normally distributed population with by dividing the square of each absolute discrepancy by the expected fre-
p, =so and u= 10. The alternative, H, is that the parent population (the quency. Thus, in the first row: 0.49 (square of discrepancy)+ 12.3
one from which the sample has been drawn) has a distribution different (expected frequency) f(ives a relative discrepancy of o.o4. The results of
from the one proposed. We must first obtain a measure of the discrepancy this and calculatiom:; for the remaining rows are found in column 5
between the observed scores and those which would be expected if 11 0 The procedure that we have fvllowed so far has given us a measure
1;14 135
Testing the fit of models to data A type of model
of deviance from the model for each class which will bezero whenthe Testing how well a type ofmodel fits the data
observed frequency of scores in the class is exactly what would be predicted In the previous section we saw how to test the fit of a model
by H 0 , and which will be large and positive when the discrepancy is large with a normal distribution and a given mean and standard deviation. It
compared to the expected value .. By summing the deviances in column was because the mean and standard deviation were given that the model
s we arrive at the total deviance, which is 2.70. Using the total deviance was described in the section heading as 'complete', since it was fully speci
as a test statistic, we are now in .a- position to decide whether or not the fied. But this information is not always available. If the US School District
sample scores are consistent wit~.;their being drawn from a population had decided not to use an existing test but to develop one of its own,
of normally distributed scores with JL =so and O" = w. This is because, then there would be no population mean and standard deviation which
provided that all the expected frequencies within classes arc large enough could be incorporated into the model. Nevertheless, for the reasons given
(in that they are all greater than s). the distribution of the total deviance in the previous section, the School District would still be concerned to
is known when H 0 is true. It is called a chi-squared distribution (sometimes kn'ow whether the population of s-year-old children on the new test would
written x'J, though, as with the !-distribution, itis really a family of distri' have a normaldistribution, regardless of the mean and standard deviation.
butions, each member of the family being identified by a number of degrees How wo.uld it find out?
of freedom. The degrees of freedom. depend on the number of-classes. Let usimagine that such a test is producc&and administered to a.random
which have contributed to the total deviance., For the present case there sample of 34' -syear-old children in the area, The null hypothesis is that
are eight classes, some of the original classes having been grouped together. the data obtained in this way represent a random sample drawn from
This was done in order to meet the requirement that the expected frequency a normally distributed population. The alternative hypothesis is that the
in each class should be more than s (for example the original classes '7o-less parent population is not normally distributed. The procedure followed
than 7S' and '7s or greater' did not have sufficiently large values). There to test the fit of a type of model (here one with a normal distribution)
are eight classes but the expected frequencies are not all independent. is very similar to the one elaborated in the previous section for a complete
Since their total has to be 184, as soon as seven expected frequencies model. We begin with the sample scores summarised in the first two
have been calculated, then the last one is known automatically. There columns of table 9 3. Since we have not hypothesised a population mean
are therefore just 7 df. If H11 is true, the total deviance will have approxi- (p,) nor a standard deviation (0"), we will take as our estimate of these a chi-squared distribution with 7 df. Critical values of chi-squared
Table g.:J. Calculatio11s[ortesting the hypothesis that the scores qf341
are to be found in table As. We can see there ;that the ro% critical value
children cmne jro1n a nOJ1nal distiibution
for 7df is rz.o. Since the value we have actually obtained, 2.70, is-much
smaller than this, there is no real case tb be made against the null hypoth- 'I z 3 4 5 6
esis. The scores made by the US children are consistent with those expected interval Standardised
from a random sample drawn from a population of scores having a normal of scores interval
distribution with a mean of so and a standard deviation of ro. With this Obser\'ed expected in Expected
knowledge the School District should be able to predict with reasonable From- Less than frequency From-+Less than each interval frequency Deviance
accuracy the number of syearold children in their area for whom speech - I.8.] 0.0.].]4 11.4 o.51
45 9
therapy will be indicated by the British articulation test, whatever cut-off 45 so z8 -J,83 -I.]6 0.0536 dL3 5 14
so 55 z8 -I..]6 -o.89 0.0992 33.8 I .00
point they might wish to select. (Of course, we must not forget the possi- - 6o 46 -o.8Q -0.41 0.1548 52.8 o.88
bility of a type 2 error which would occur when the data really did come 6o 6s JO -0.41 o.o6 0.!829 62.4 093
6s JO 53 o.o6 0 53 0.!780 6o.7 o.q8
from a quite different distribution but the particular sample happened 1.00
JO 75 4' o.sJ 0.!394 475 o.89
to look as though it came from the distribution specified by the null hypoth- 75 Ro ,s 1.00 1.47 0,0879 JO.O 2.13
esis. Given the largish sample size and the fact that the value of the test 8o and uvcr z8 '47 and over 0,0708 24.1 o.63

statistic is well below the critical value, it is highly unlikely that any serious \_;::;(J4-J(I Total dc\iancc = IJ.OCJ
~ ,., 1o.ft
error will be committed by acctpting- that lin is true.) .
---=~-=-- .,_=~=;':A~;:;:t";:;-

136 '37
Testing the fit of models to data Testing the model of independence

the sample mean (X.) and standard deviation (s). In the present case squared distribution with 6 df. We see in table As that the corresponding
X= 64.36 and s = Io.6. Using these figures, the endpoints of the intervals s% critical value is I2.6. Since the total deviance obtained is greater than
are standardised (column 3) and the expected proportion of the model this, we reject the null hypothesis. From the evidence of the sample scores,
population to be found in each class interval is again calculated with the it would seem rather unlikely that the population of scores on the test
help of a normal table (table A2). Each proportion is then multiplied of the s-year-old children will have a normal distribution. This means
by 34I to give the number of scores we would expect in each class if that if the School District were to use the test in its present form, it
the sample were taken from a normally distributed population (column might not be possible to benefit from the known properties of the normal
5). For each class, the deviance is computed by means of the formula: distribution when making decisions about the provision of speech therapy
to children in the area. However, we must not forget that there is a proba-
(o 1- e1) 2
bility of s% that the result of the test is misleading and that the test
e, scores really are normally distributed, or at least have a distribution suffi.
and the deviances are summed to give a total deviance of I 3.09 (column ciently close to normal to meet the requirements of the School District
6). authorities. On the other hand, inspection of table 9 3 suggests that the
The only difference from what was done in the last section is that this distribution of test scores is rather more spread out then the normal distri-
time the sample mean and standard deviation have been used to estimate bution, with higher frequencies than expected in the tails. If a cut-off
their population equivalents. This affects the degrees of freedom. As we point of two standard deviations below the mean is used the remedial
have said before, the degrees of freedom can be considered in a sense education services may be overwhelmed by having referred to them many
as the number of independent pieces of information we have on which more children than expected.
to base the test of a hypothesis. We began here with nine separate classes Yau will realise that the application of the statistical tests elaborated
whose frequencies we wished to compare with. expected frequencies. How in this and the previous section are not limited to articulation tests and
ever, we really have only eight separate expected values since the ninth School Boards in the USA. It should not be difficult to think of comparable
value will be the total frequency less the sum of the first eight expected examples. What you might not realise is that this idea of testing the fit
frequencies. But the degrees of freedom have to be further reduced. In of models to observed data is not limited to test scores, but can be extended
estimating the population mean and standard deviation for the sample, to different kinds of linguistic data. For instance, an assumption underlying
we have, if you like, extracted two pieces of information from the data factor analysis and regression (chapters IS and 13) is that the population
(one to estimate the mean and another to estimate the standard deviation), scores on each variable are normally distributed, and this can be checked
reducing by two the number of pieces of information available for checking in particular cases, whe~ the sample size is large enough, by the method
the fit of the model to the data. The degrees of freedom in this case are described in this section. If the data do not meet the assumption of norm-
therefore 6. ality, the results of factor analysis or regression analysis (if carried out)
You should not worry if you do not follow the argument in the previous should be treated with extra caution.
paragraph. The concept of degrees of freedom is difficult to convey in
ordinary language and in a book such as this we cannot hope to make 93 Testing the model of independence
it fully understood. So far we have appealed to intuition, using the notion In this section we will present two examples of a rather different
of 'pieces of information'. From now on, however, as the reasoning in., application of the chi-squared distribution. The first example is taken from
particular cases becomes more complex, we shall not always attempt to a study reported by Ferris & Politzer (I98I). They wanted to compare
provide the rationale for the degrees of freedom in a particular example. the English composition skills of two groups of students with bilingual
We shall continue, of course, to make clear exactly how they should be backgrounds. The children in group A had been born in the USA and
calculated. educated in English. Those in Group B had been born in Mexico, where
As we saw above, the degrees offreedom in this instance arc 6 (9- I - 2). they had received their early schooling in Spanish, and had later moved
If Hn is correct, the total deviance (13.09) will have approximately a chi- to the USA, where their schooling had been entirely in English. There
lJ8 '39
Testing the fit ofmodels to data Testing the model of independence
Table 9+ Contingency table q/'manberqfverb tenseerrors in children's set up the null hypothesis, H 0 , that the number of errors is independent
essays of the early school experience. If H 0 were true, we could consider that,
(a) Observed frequcncics 11 as regards propensity to make errors in verb tense, the two groups are
Number of error::; in verb tense really a single sample from a single population. The proportion of this
0 1 error z-6 errors Row total single sample making no errors is 2o/6o (i.e. 20 of the 6o children make
GroupA 7 16 JO
no errors). We consider this as an estimate of the proportion in the complete
Group B 13 II 6 JO population who would make no errors of tense under the same conditions.
Column total 20 18 Z2 6o
If Group A were chosen from such a population, about how many could
(b) Expected frequencies: (row total) X (column total)+ (grand total) be expected to make no errors? Let us assume that the same pmportion,
Number of errors 2o/6o, would fall into that category. Since Group A consists of 30 subjects,
0 I error z-6 error::; Row total we would expect about (2o/6o) X 30 = ro subjects of Group A to make
Group A 10 9 II JO no errors of tense. This figure is entered as an expected frequency in
Group B 10 9 II JO table g.4(b). Proceeding in the same way, we obtain the number of subjects
Column total 20 18 Z2 uo in Group A expected to make one verb error (9) and from two to six
(c) Deviances: (observed- cxpcctcd) 2 +expected errors ( r r). When this process is repeated for Group B, the same expected
Number of errors
frequencies ( ro, g, 11) are obtained. This is because the two groups contain
0 I error z-6 errors the same number of subjects, which will not always, or even usually,
Group A o.g 044 2.27 be the case. It is not necessary that the groups should be of the same
Group B o.g 0.44 2.27
size; the test of independence which is developing here works perfectly
Total deviance= 7.22 well on groups of unequal size.
Reproduced from Ferris & Pulitzer (rg8r)
Generalising the above procedure, the expected frequencies of different
numbers of errors in each group can be obtained by multiplying the total
were 30 children in each group, all about 14 years old. Each of them frequency of a given number of errors over the two groups (the column
wrote a composition of at least roo words in response to a short film, total) by the number of subjects in each group (the row total) and dividing
and the first roo words of each essay were then scored in several different the result by the grand totalof subjects in the experiment. The formula:
ways. One of the measures used was the number of verb tense errors cOlUmn tOtai X row total
made by each child in the composition, and the results of this are shown grand total
in table g.4(a) (such a table is referred to as a contingency table). We will give you the expected frequencies, however many rows and columns
can see there that there are differences between the two groups in the you have. You should check that by using it you can obtain all the expected
number of tense errors that they have made. What we must ask ourselves frequencies in table g.4(b).
now is whether the differences observed could be due simply to sampling Now that we have a table of observed frequencies and another of the
variation, that is, we have two samples drawn from the same population; corresponding expected frequencies, the latter being calculated by assum-
or whether they indicate a real difference, that is, the two samples are ing the model of independence, we can test that model in the same way
actually from different populations. that we have tested mOdels in the previous two sections. The total deviance
To answer this question, we must refer back to chapter 5 and the discus- is computed in 9.4(c), and it then only remains to check whether this
sion there about the meaning of statistical independence. If the number value, 7.22, can be considered large enough to call for the rejection of
of errors scored by an individual is independent of his early experience, the null hypothesis. As previously, the total deviance will have a chi-
then, if the experiment were to be carried out over the entire population, squared distribution if the model is correct. The degrees of freedom are
the proportions of individuals scoring no errors, one error, or more than easily calculated uing the formula:
one error would be the same for both subpopulations. Suppose then we (munboi nf column~<-.... 1) X (number of rows - r).
l.<ftl 141
Testing the model of independence
Testing the fit of models to data
Table 95 Some typical treatments of English loans in Rennellese Table 9.6. Contingency table of type of epenthesis by position
of vowel in English loan words in Rennellese
blade buleli half hapu
carltidge katalini matches masesc (a) Observed frequencies
crab kalapu milk melcki Type of epenthesis
cricket kilikiti plumber palama Position of
cross kolosi pump pamu Epcnthctic Vowel Reduplicating Non-reduplicating
engine ininsini 1ijle laepolo
fight paiti rugby laghabi Initial 20 14

fishing pisingi ship sipi Medial IJ 6

fork poka suny sitoli Final 61 112

Data from Brasington ( 1978) 94 1]2

In the present case this means (3- I) X (2- I) = 2. If we consult the (b) Expected frequencies and deviances
Reduplicating Non-reduplicating
chi-squared tables, we find that, with 2 df, the value 7.22 is significant
at the 5% level. This suggests that the null hypothesis may be untenable Initial 14.1 (z.s) Ig.g ( !.7)
Medial 7-9 (J.J) II. I (z.J)
and that the distribution of errors is different for the two populations. Final 72,0 (I.?) IQI.I ( 1.2)
The data point to the population of I4year-old bilingual children who
Total devia~ce = 12.7 on .z df
had early schooling in Spanish in Mexico making more verb errors in
compositions than those who were born in the USA and were educated
there entirely in English. of English plumber, /palama/, select/ a/ as the epenthetic vowel to break
The second example concerns the way in which English words borrowed up the /pi/ cluster? The most straightforward explanation would be one
into a Polynesian language have been modified to fit the phonological struc- of reduplication of the non-epenthetic vowel. Rennellese represents the
ture of the language. Brasington (I978) examined the characteristics of / u/ vowel in English plumber as/ a/; the same vowel is used as the epenthe-
vowel epenthesis in loan words from English into Rennellese. This is a sised one. The same strategy seems to be followed in the Rennellese word
language spoken on the island of Rennell, at the eastern edge ofthe Solomon for crab: English / re/ is represented as/ a/, and the same vowel epenthe
group. Table 95 gives some examples of typical treatments of English sised. We can see similar examples for medial position: rugby /laghabi/;
loans in this language. It is apparent from the examples that (a) English and for final position: ship/sipi/. There are however counterexamples.
consonant clusters tend in the Rennellese forms to have a vowel introduced In initial position, English blade /bleid/ is realised in Rennellese as
between the two elements of the cluster: the initial/kr/ of crab becomes /buledi/; in final position, half/ha:f/ appears as /hapu/.
/kal-/, the initial /bl-/ of blade becomes /bul-/, the medial -gb-/ of rngby We might ask at this point whether there is any association between
becomes/ -ghab-/; (b) English final consonants tend to appear in Rennel- the position at which epenthesis occurs and whether or not reduplication
lese supported by a vowel: ship becomes /sipi/, half becomes /hapu/. is the strategy adopted for selection of the epenthetic vowel. Our null
These modifications (all referred to by Brasington as 'vowel epenthesis') hypothesis would be that reduplication and position are independent. To
can plausibly be attributed in general to the phonotactic structure of Ren test this hypothesis of independence we tabulate the observed frequencies
nellese, which exhibits a 'typically Polynesian ... simple sequential alterna of each type of epenthesis, reduplicating and non-reduplicating, in each
tion of consonants and vowels' (Brasington I978:27). The CV syllable position, initial, medial and final, as in table 9.6(a). The data for this
structure-of the borrowing language modifies the CCV. or VC or. VCCV table were obtained by Brasington from Elbert (I97S), a dictionary of
structures of the loaning language in obvious and predictable ways. While Rennell-Bellona, and includes all English loan words entered there - a
this may explain the fact of epenthesis, the selection of particular epenthetic total of 226. The expected frequencies are calculated in the same way
vowels in specific cases remains to be accounted for. The Rennellcse vowel as in the previous example, by multiplying column totals by row totals
systl'!n (transcribed as i, c, a, o, u) has three heights and (except for and dividing by the grand total. Table 9.6(b) shows expected frequencies
low vowel) n front/back distinction. Why do~s the Rennclit,sc version and deviances. The total deviance of 12.7, with 2 df, exceeds the I% critical
Testing the fit of models to data Problems of the chi-squared test
value, 9.2I. On this basis we are likely to reject our null hypothesis, .and children where the sizes of the groups were chosen. by the experimenter.
assume that the use of the reduplicating strategy for epenthetic vowel Suppose that, for one of the groups, some cells have too small an expected
selection is not independent of position. Inspection of the differences value. Let us say that the smallest such value (for a cell we clearly wish
between observed and expected frequencies leads us to believe that redupli- to retain in the table) is about 1.25, i.e. a quarter ofthe required minimum.
cation is more likely in initial and medial position, and less likely in final Then increasing the size of that group to four times its original value
position. This does not, of course, exhaust the search for factors relevant will cause all the expected values in that row to be increased by about
to the selection of specific epenthetic vowels, and for full details the reader the same factor provided that the proportions of the new data falling in
is urged to consult Brasington ( I978). We will, however, leave the example the different columns are roughly the same as in the original. However,
at this point. to do this could be extremely expensive in terms of experimental effort,
even supposing that the variables are under the experimentees control.
94 Problems and pitfalls of the chi-squared test In the Rennellese loans, for example, both the type of epenthesis and
9+ I Small expected frequencies its place in the word are language~contact phenomena which are not under
We have already noted that, in order for the X' test to have the control of the researcher. In such circumstances it would be necessary
satisfactory properties, all expected frequencies have to be sufficiently large to increase the total number of observations fourfold. Even then success,
(generally 5 or greater). As we saw with the frequencies in table 9.2, though likely, is not guaranteed, since. we cannot be sure that the new
it is sometimes necessary to group categories together to meet this condi~ observation~ will increase the row and column totals relevant to the cell
tion. A similar problem may occur with a contingency table, with one in question as much as we had hoped. When we consider, in addition,
or more of the expected frequencies falling below 5. There are two possible that it is impossible in this particular case to find any further loan words
ways of dealing with this problem when it arises. The first is to consider from the sources used (since Brasington's data comprise all the loan words
whether a variable has been too finely subdivided: if so, then categories from English in Elbert 1975), and that further data will require expensive
can be collapsed so that all the cells of the table do have a sufficiently fieldwork, we are likely to conclude that there is little value in trying
large expected frequency. Suppose, for example, that in table 9.4(b) we to augment the observations.
had found that we expected very few people in one of the groups to make There is, however, a third option. We can go ahead and carry out the
exactly one error. Then we could combine the second and third columns chi-squared test even 1/ some expected frequencies are rather too small.
and classify degree of error into 'all correct' and 'some errors', and still It can be shown that the likely effect of this is to produce a value of
test whether the two groups were similar with. respect to the frequency the test statistic which is rather larger than it ought to be when the null
of their errors. If, on the other hand, a similar problem had arisen in hypothesis is true; that is, there is more likelihood of a type I error (chapter
table 9.6(b) - let us say that the expected frequency for reduplicating 8). On the other hand, if all the cells with small expected values have
epenthetic vowels in medial position was too low - it would have been an observed frequency ve1y similar to the expected and thus contribute
more difficult to decide how to regroup the data. Should the medial position relatively little to the value of the total deviance, it is unlikely that the
vowels be considered alongside those occurring in initial position, or with value of the deviance has been seriously distorted, and the result of the
those epenthesised finally? Because of the nature of the data, there is no test can be accepted, especially if we adopt the attitude to hypothesis
obvious solution to this particular problem. testing suggested in 8.5.
It may also happen that, in a large table, any problem cells are distributed. Whenever X' values based on expected frequencies of less than 5 are
in a rather haphazard way. In such cases, the collapsing of cells necessary reported, the reader's attention should be drawn to this fact and any conclu-
to eliminate all those with small expected frequencies will remove interest- sions arrived at on the basis of a statistically significant outcome should
ing detail. The only really satisfactory approach is to collect more data. be expressed in suitably tentative terms. An examination of the use of
It is in fact easy to estimate roughly how much more data we may need x' in the applied linguistics literature reveals that this is not always done.
to collect. Consider the case where one of the variables classified in the Indeed, there is cause to wonder whether the authors are always aware
table is controlled experimentally, aR in the :;tudy of tiH,; bilingual l'lchoo!R of the fnilurc of the data to meet the requirements of the test.

l# '4S
Testing the fit of models to data Problems of the chi-squared test

Table 97 Errors in pronoun agreement for two groups of bi- Table g.8. Corrected deviances for table 9 7, using the fonnula:
lingual.<choolchildren ( {observed frequency -expected frequency} -o .5)2 +expected
Number of errors
Row total Number of errors
0 I-J
Group A IS IS JO 0 I-J
Group B 2J 7 JO Group A o.64s I. I IJ
22 6o Group B o.645 I. ITJ
Column total J8
chi-squared= JSt6 Total= 3516

Data from Ferris & Politzcr ( 1981)

this to 35 and calculate the deviance (3.5) 2 + 19 = o.645 The remainder
of the individual deviances are given in table g.8 and we see that the
9.4.2 The 2 X 2 contingency table chi-squared value of 3.516 is correct. Note that, in this example, the un-
Table 97 reproduces another section of the results from the modified value of the total deviance would have led us to conclude, incor-
paper of Ferris & Politzer (I98I), this time comparing the essays of two rectly, that the result was significant at the 5% level.
groups of bilingual children for the number of errors made in pronoun
agreement. The authors quite correctly report a chi-squared value, i.e. the 9+3 Independence of the observations
value of the total deviance, of 3.5I6 with I df, and this fails to reach the It is quite common in the study of first or second language
tabulated 5% value, 384- If you calculate the total deviance using the learning to analyse a series of utterances from the speech of an individual
method explained above, you will arrive at a value of 4.6, which would in order to ascertain the incidence of various grammatical elements and/ or
apparently lead to the rejection of the null hypothesis at the 5% significance errors of a particular kind. Hughes ( 1979) provides a simple example of
level, the null hypothesis stating that the type of early education does this. In his study of the learning of English through conversation by a
not affect the number of errors in pronoun agreement. Why is there this Spanish speaker (referred to in chapter 3) one of the features that he in-
discrepancy between the statistic derived via the analysis explained earlier, vestigated was noun phrases of the type possessor-possessed, e.g. Marta( 's)
and the value quoted in Ferris & Politzer ( 198 I)? It can be shown mathema- bedroom. Over one period covered in the study the learner used a total
tically that if the model of independence is correct then the total deviance of 81 constructions of this type. Of these, 55 showed correct English order-
will have approximately a chi-squared distribution with the relevant ing (e.g. Marta('s) bedroom), while the remainder (26) reflected Spanish
degrees of freedom. The approximation is very close when the expected constituent order (e.g. bedroom Marta). On the lexical dimension, 23
frequencies are large enough and the table is larger than 2 X 2. When of the 81 collocations involved pairings of words that were referred to
the table has only two rows and two columns the total deviance tends as 'novel', since these particular pairings had not been used previously
to be rather larger than it ought to be for its distribution to be modelled by the learner. The remaining 58 pairs were non-novel. An interesting
well as a chi-squared variable with I df. Thus a correction (referred to question, relating to productivity in the learner's languge, is whether the
as 'Y ate's correction') is needed, and takes the following form. novel and non-novel instances manifest different error rates. The data
As always, for each of the four cells, you must find the difference between. are entered in table 99 which looks similar in structure to the 2 X 2 con-
expected and observed frequency. Then, ignoring the sign of that differ- tingency table (table 97). However, it would be mistaken to use the chi-
ence, reduce its magnitude by 0.5, square the result and divide by the Rquated test on this data, because the separate utterances which were ana-
expected value as before. For example, for the cell giving the number lysed to obtain the data for the table were not independent of each
of subjects in Group A who make no errors with agreement of pronouns other. Wc have already stressed in earlier chapters that the data we use
(table 9.7) the expected frequency is I9 while 15 instances were actually to estimate a pnrnrneter of a population, or to test a statistical hypothesis,
observed. The magnitude of the difference (IS- 19) is 4 We reduce muRt come from a random ample of the population. This is no less true
14.6 147
Testing the fit of models to data l IVUtt:fft~ Uj Ult: (.flt-:;(jllUI'eU teSt

Table 99 The relation between the novelty of word pairings Tab kg. 1 o. Fi-equency with which two groups qf leamers of English supply
and ordering ennrs in the English of a native Spanish speaker and fail to supply regular plural morphemes in obligatmy contexts during
Novel pairings Non-novel pairings Row total
Correct Group with instruction Group without instruction
order 8 47 55 lVJorphcmc Morpheme Morpheme l\-'lorphcmc
Incorrect supplied not supplied supplied not supplied
order IS 26
" ]2 28 42 ]6
Column total 23 s8 8I II2 24 28
Io6 39 3I 29
42 40 55 37
of the chi-squared test for association between the rows and columns of 26 24 62 55
a contingency table. In particular, all the instances which have been Group
recorded must be separate and not linked in any way. For example, in totals 3I8 ISS 229 I8S
spontaneous data from language learners it is not uncommon for instances
of the same structure to occur at points close in time, though not as part
of the same utterance. It would be naive to suppose that the form which to the plural morpheme on one occasion cannot be seen as independent
the structure takes on a second occasion is quite independent of the form of his performance.on all the other occasions within a single hour.
it took in an utterance occurring a few seconds earlier (see also 75.2). If we seem to have laboured the point about independence in relation
This is particularly important in situations like the one under discussion to x', there is a reason. There is evidence in applied linguistics publications
where some cell frequencies are likely to be more affected than others that the requirement of independence is not generally recognised. The
by the lack of independence. Instances of novel pairings of nouns cannot, reader is therefore urged to exercise care in the use of XZ and also to
by definition, be affected by repetition since when a word pair is repeated be on the alert when encountering it in the literature. Whenever an indivi-
it is no longer novel. However, instances of non-novel noun pairings could dual's contribution to a contingency table is more than 'one', then there
occur in successive utterances. Whenever it is not clear that the observations must be suspicion that the assumption of independence has not been met.
are completely independent, if a chi-squared value is calculated. it should
be treated very sceptically and attention. drawn to possible defects in the 9+4 Testing several tables from the same study
sampling method. In the Ferris & Politzer study, the essays of the 6o students
Consider the following example. A researcher is interested in the fre- were marked for six types of error in all,. and for each type of error a
quency with which two groups of learners of English supply the regular contingency table was presented in the original paper. Each table was
plural morpheme (realised as /s/, /z/, /<z/) in obligatory contexts. One analysed to look for differences between groups at the s% significance
group of learners has had formal lessons in English, the other group has level. There are two points to watch when several tests are carried out
not. Each learner is interviewed for one hour and the number of times based on data from the same subjects.
he supplies and does not supply the morpheme is noted. The results are First, the risk of spurious 'significant' results (i.e. type I errors) increases
reported in table 9 10. with the number of tests. If we carry out six independent chi-squared
The researcher would be completely mistaken to use the group totaJ tests at the s% significance level then, for each test it is true that the
in order to carry out a x' test. For one thing, two members of the group probability of wrongly concluding that the groups are different is only
that had received instruction contributed many more tokens than anyone o.os. However, the probability that at least one test will lead to this wrong
else in the study. Their performance had undue influence on the total conclusion is r -(0.95) 6 = 0.26. In fact the errors in verb tense which
for their group. But even if all learners had provided an equal number we analysed in table 94 were the only set of errors which gave a significant
of tokens, it would still not be correct to make usc of the group totals diffCI"(.'IlCC at the s% level. From the above argument the true significance
in calculating X' This is because an individual's performance with respect (taken iu conjunction wit:h all tht~ other tc~ts) could be o.26, or 26%,
'1 'estmg the ftt oj moaets to aata 0ummary

and we might conclude that there is no str<?ng evidence that the groups except that the cell frequencies have been converted to percentages of
are more different than might have occurreO by random sampling from the total frequency, Table 9, I I (b), where an. alternative method of calculat-
the same population. (\11/e cannot be sure of the exact value since the ing the frequencies of table 97 in terms of percentages is used, gives
contingency tables, being used all on the same subjects, will not be indepen~ an even more dramatic result. Here the frequencies are given as percentages
dent. The true significance level will be somewhere between 5% and z6%.) of the row totals, and the resulting chi-square of I3,4 appears to be highly
A second problem that may occur when one analyses the same scripts significant. Both examples serve to underline how misleading the conver-
or the same set of utterances for a number of different linguistic variables sions of raw observed frequencies into percentages can be, The effect
is that these variables may not be independent, Even though we may can operate in the other direction, disguising significant results, if the
have, say, yes/no questions and auxiliaries as separate categories in our true total frequency is greater than roo.
analysis it is unlikely that these variables are entirely unconnected, Analys-
ing sets of dependent tables can cause the groups to appear either more SUMMARY
or less alike than they really are, depending on the relationship between The previous chapter dealt with the situation where the underlying
the different variables, Again, in practice one may very well carry out model was taken for granted (e.g. the data came from a normal distribution) or
many tests based on a single data set, but it is important to realise that did not matter (because the sample size was so large). This chapter discussed
the problem of testing whether the model itself was adequate, at least for a few
the results cannot be given as much weight as if each table were based
special cases.
on an independent data source.
( 1) It was shown how to test Hu: a sample is from a normal distribution with
9+ 5 The use ofpercentages a specific mean and standard deviation versus Hl: the sample is not from
We have already seen that the data in table 9,7, when tested that distribution, by constructing a frequency table and comparing the
for independence, produced a non-significant value of chi-squared, sug- observed class frequencies, oit with the expected class frequencies, ei>
if H 0 were true. The test statistic was the total deviance, X'=~{ (oi- ei 2/e;},
gesting no evidence of any association between rows and columns. Consider
which would have a chi-squared distribution with k- 1 degrees of freedom
now table g,u(a), which provides a chi-squared value of 6,I8, significant
where k is the number of classes in the frequency table.
at the o,oi level (for I df), This table is, however, 'identical' to the first
(2.) A test was presented of H0 : the sample comes from a normal distribution
with no specified mean and standard deviation versus H 1: it comes from some
other distribution. The procedure and test statistic were the same except that
Table 9,IL Data oftable97 restated as percentages the degrees of freedom were now (k- 3),
(3) The contingency table was introduced together with the chi-squared test,
(a) Percentage of total number of subjects
to test the null hypothesis H 0 : the conditions specified by the rows of the
Number of errors
table are independent of those specified by the columns or Hn: there is no
0 ,_3 Row total association between rows and columns versus the alternative, which is the
Group A 25 2S so simple negation of Hu. The expected and observed frequencies are again corn
Group B 38
" so pared using the total deviance which h'!-s a chi-squared distribution when H 0
Column total 6J 37 wo is true. The number of degrees of freedom is obtained from the rule:

(b) Percentage of number of subjccls in each group df =(no, of rows- I) X (no, of columns- r)
Number of errors (4) It was pointed out that the chi-squared test of contingency tables is frequent_ly
0 ,_3 Row total misused: all the expected frequencies must be 'reasonably large' (generally
Group A so so wo 5 or more) ; the 2 X 2 contingency table requires special treatment; the obser
Group B 76 24 wo vations must be independent of one another - the only completely safe
Column total rule is that each subject supplies just a single token; the raw observed fre-
126 74 200
quencies should be used, not tht~ rclntive rrcqucncics or percentages.
150 I5I
Testing the fit of models to data Exercises
Amongst his results he reports the fallowing: for sentences containing one
(I) Lascaratou ( 1984) studied the incidence of the:passive in Greek texts of various
kind of error, there was a 33% rejection rate; for sentences containing a related
kinds. Two types of text that she looked at were scientific writing and statutes.
error, the rejection rate was only 13%. In both cases N is said to be 6o.
Out of a sample of r,ooo clauses from statutes, 698 were passive (incidentally,
Is the rejection rate for one kind of error significantly different from that
she included only those clauses where a choice was possible, eliminating active
for the other kind?
clauses which could not be passivised); out of a sample of 999 clauses of
scientific writing, 642 were passive. Knowing that samp!lng has been very
careful, what conclusions would you come to regarding the relative frequency
of the passive in texts of the two types?
(2) A group of monolingual native English speakers and a group of native speakers,
bilingual in Welsh and English, were shown a card of an indeterminate colour
and asked to name the colour, in English, choosing between the names blue,
green or grey. The responses are given below. Does it appear that a subject's
choice of colour name is affected by whether or not he speaks Welsh?

Blue Green Grey

Monolingual: 28 41 16
Welsh: 40 ]8 29
(3) An investigator examined the relationship between oral ability in English and
extroversion in Chinese students. Oral ability was measured by means of an
interview. Extroversion was measured by means of a well-known personality
test. On both measures, students were designated as 'high', 'middle' or 'low'
scorers. The results obtained are shown separately for males and females in
table 9 12. Calculate the two >! values, state the degrees of freedom, and
say whether the results are statistically significant. What conclusion would
you come to on the basis of this data?

Table 9.12. Data 011 relationship of oral ability in

English and extroversio11 (Chinese stude12ts)

Female subjects Oral proficiency scores

High Middle Low
Extroversion High 2 6 4
Middle 3
I 2

Male su~iecrs Oral proficiency scores

High Middle Low
Extroversion High 2 0 0
Middle 2 7 0
Low 3 I 2

(4) An investigator asked English-speaking learners of an exotic [angu<~gc whether

sentences he ,presented to them iu that language were gnmUH!Hical or not.

15~ c'' 31 153

'lhe concept oj covanance

Table Io. r. Total error scores assigned by ten native English

10 teachers (X) and ten native English non-teachers (1] for each
of]2 sentences
Measuring the degree of Sentence English teachers (X) English non-teachers (Y)

interdependence between two 2

variables 3
5 3' 26
6 36 4'
7 29 26
8 24 20
In chapters 5 and 6 we introduced a model for the description of the 9 29 ,g
random variation in a single variable. In chapter I 3 we will discuss the 10 ,g '5
23 21
use of a linear model to describe the relationship between two random "
12 22 '9
variables defined on the same underlying population elements. In the 13 31 39
'4 2< 23
present chapter we introduce a measure of the degree of interdependence
15 27 24
between two such variables. 6 32 29
'7 23 ,g
18 18 6
1o. r The concept of covariance 19 JO 29
A study by Hughes & Lascaratou (Ig8I) was concerned with 20 3' 22
21 20 <2
the evaluation of errors made by Greek learners of English at secondary 22 21 26
school. Three groups of judges were presented with 32 sentences from 23 29 43
24 22 26
essays written by such students. Each sentence contained a single error. 25 26 22
The groups of judges were (I) ten teachers of English who were native 26 20 '9
27 29 30
speakers of the language, (2) ten Greek teachers of English, and (3) ten 28 18 17
native speakers of English who had no teaching experience. Each judge 29 23 15
was asked to rate each sentence on a o-s scale for the seriousness of the 30 25 IS
3' 27 28
error it contained. A score of o was to indicate that the judge could find 32 14
no error, while a score of 5 would indicate that in the judge's view it
~ = 25.03 sx = 6.25
contained a very serious error. Total scores assigned for each sentence y = 2].63 sy = 8.z6
by the two native English-speaking groups are displayed in table ro. r.
One question that can be asked of this data is the extent to which the
groups agree on the relative seriousness of the errors in the sentences teachers (theY axis). So, for example, the point for sentence I2 is placed
overall. In this case we wish to examine the degree of association between at the intersection of X= 22 and Y = '9 The pattern of points on the
the total error scores assigned to sentences by the two native~speaker. scattergram does not appear to be haphazard. They appear to cluster
groups. roughly along a line running from the origin to the top right-hand corner
As a first step in addressing this question, we construct the kind of of the scattergram. There are three possible ways, illustrated in figure
diagram to be seen in figure IO.I, which is based on the data in table 10.2, in which a scattergram with this feature could arise:
Io.I, and is referred to as a scatter diagram, or scattergram. Each point
(a) There is an exact linear (i.e. 'straight line') relationship between X
has been placed at the intersection on the graph of the total error score and Y, distorted otily by measurement error or some other kind of
for a sentence by the English teachers (the X axis) and the English non- random variation in the two variables.
lH '55
Measun'ng interdependence of two variables The concept of covariance

50 (c) There is no degree of linear relation between X andY and any apparent
linearity in the way the points cluster in figure ro. r is due entirely
X to random error.
" 40-1


As usual there will be no way to decide with certainty between these
~ 30 x'Xx different hypotheses. We must find some way of assessing the degree to

XX which each of them is supported by the observed evidence. To do this
we first of all require some measure of the extent to which a linear relation-
e 20
0 X
ship between two variables is implied by observed data. We have already
"' 10
developed measures of 'average' (the mean and the median), measures of
dispersion (variance and standard deviationL and now we will introduce
a measure of linear correlation.
0 10 20 30 40 50
Scores of teachers This will be done in two stages, the first of which is to define a rather
Figure ro. I. Scattcrgram of data of table 10:.r. general measure of the way and the extent to which two variables, X
and Y, vary together. This is known as the covariance, which we will
1,1 y
designate COV(X,Y), and is defined by: 1

I - -
COV(X,Y) =-$(X,- X)(Y,- Y)
X n-I
In table I0.2 we have shown what this formula would mean for the
data of table Io.r. For both of the variables X, the total error score for
each sentence given by the English teachers, and Y, the total error score
for the same sentence given by the English non-teachers, we start as though
lbl y
we were about to calculate the variance (cf. 3.7). For example, we find
XX X, the mean of the 32 X observations, and calculate the difference d,(X)
X X X X between each observed value, Xil and the mean. \'Ve then carry out a
X X similar operation on the 32 Y values to obtain the d,(Y). We then multiply
d 1(X) by d 1(Y), d 2 (X) by d 2 (Y) and so on. We finish by adding the 32
products together and then divide by 3 I to find the 'average product of
deviations from the mean'. The results of these calculations can be found
lei Y
in table 10.2. (Note that the column of this table giving the cross-products
x )(...x '><l: x xx~xX~x of X andY, is not used here, but is made use of in table 10.3 to calculate
X )('5l: X~>',( Xx..,.~ ~
XXX X X X xXX '1(-.,r the covariance by an alternative method.)
0 Xxxxx:*lk~~% The reason for doing all this may not immediately seem obvious. It
XX xfxxxx x
may become clearer, however, when it is recognised that three distinct
patterns in the data are translated by this process into three rather different
Figure Io,z. Three hypothetical relationships which might give rise to the
data of table to. r and figure ro. I. Pattern I
X and Y tend to increase together. In this case, when X is bigger
(b) There is a non~linear relationship between X and Y which, especially 1 There i~ a dcgrc~ of arUitrurincfl~ ubout the choice of the divi~or; some authors may use
in the presence of random error, can be represented quite well by n, othcrl!n- :t. The dwkc w~~ IHI\'C mndc Airnplifict-~ later formu\nc. Of course, for largish
a line<tr model. ~amplct~ the rNillit will b~; cHN:tlvdy tlw ftufllc.

156 '57
.lVleasun'ng interdependence of two van'ables The concept of covariance
Table Io.z. Covan'ance betwee1l the error scores assigned by ten nath.:e Pailern2
English teachers and ten natite English non-teachers 01132 seutences There is a tendency for Y to decrease as X increases and vice versa.
Now we will find that when X is below average 1 Y will usually be
2 3 4 5 6 above average and vice versa. The two deviations d 1(X) and d 1(Y)
cl(X)_ d(Y)_ will usually have opposite signs so that their product will be negative.
Sentence X y XY X-X Y-Y d(X)d(Y)
The sum of the products will then be large and negative.
22 22 484 -3.0J -.63 4-94
2 16 18 288 -9,03 -s.63 so.84
3 42 42 1764 16.97 18,37 Pattem3
3 I I. 74
4 25 21 525 -o.o3 -2.63 o.o8 There is no particular relation between X and Y. This implies that
5 31 26 So6 597 237 If. IS the sign of d1(X) will have no influence on the sign of d 1(Y) so that
6 36 41 1476 10.97 1737 I<)
29 26 for about half the subjects they will have the same sign and for the
7 754 3-97 237 941
8 24 20 480 - I . OJ -3.63 374 remainder one will be negative and the other positive. As a result,
9 29 18 522 397 -s.6J -22,35 about half the products di(X)d1(Y) will be positive and the rest negative
10 18 15 270 -7.03 -8.63 6o.67
ll 23 21 483 -2.03 -2.63 so that the sum of products will tend to give a value close to zero.
12 22 19 418 -3.03 -4.63 14.03
13 31 39 1209 597 1537 91.76
14 21 2] 483 -f.OJ -o.63 254
15 27 24 648 1.97 O.J7 0,73 The covariance between two random variables is a useful and important
16 32 29 928 6-97 537 37-43
17 2] 18
quantity and we will make use of it directly in this and later chapters.
4 14 -2.03 -5.63 11.43
18 18 16 288 -7.03 -7.63 53-64 However, its use as a descriptive variable indicating the degree of linearity
19 30 29 870 26,69
497 537 in a scatter diagram is made difficult by two awkward properties it pos-
31 22 682 597 -t,6J -9-73
21 20 12 240 -s.oJ - II.6J sesses.
22 21 26 546 -4 03 237 -9-55 The first we have met before when the variance was introduced (see
2] 29 43 1247 397 1 937 76.go
24 22 26 57 2 -J.OJ 2.37 -7.18 3.7), The units in which covariance is measured will normally be difficult
25 26 22 572 0.97 -1.63 -~. 5 s to interpret. In the example of table 10.1 both d;(X) and d;(Y) will be
26 20 19 38o -5-03 -4.63 2J.29
27 29 JO 87o 6.37 25.29
a number designating assigned error score so that the product will have
28 18 17 306 -7.03 -6.63 46.6r units of 'error score X error score' or 'error score squared'. The second
29 23 15 345 -2.03 -8.63 1752
JO 25 15 -o.o3 -8.63
is more fundamental. Look at figure 1o. 3 which shows the scatter diagram
375 0.26
31 27 28 756 1.97 437 8.61 of height against weight of 25 male postgraduate students of a British
32 ll 14 1 54 -14.03 -9.6] IJS I I university. As we would expect and can see from the data, taller students
COV(X, Y) ~ 12]149 + 31 ~ 3973
lend to be heavier, but it is not invariably true that the taller of two
3973 students is the heavier. In figure 10.3(a) heights are measured in metres
r 0.772
S;.;Sy 6,25 X 8.26 and weight in kilos. In figure Io.3(b) the units used are, respectively,
centimetres and grams. (If you think the two diagrams are identical look
carefully at the scales marked on the axes.) Clearly the relationship between
than the X average, Y will usually be bigger than the ):' average, so the two variables has not changed in the sample, but the covariances corres-
that the product d 1(X)d 1(Y) will tend to be positive. Whenever X ponding to the two diagrams are 0.275 metre-kilos and 27,500 centimetre
is less than the X average, Y will usually be smaller than theY average. grams, respectively. Changing the units from metres to centimetres and
Both deviations d1(X) and JiC't') will then be negative ~o that their kilos to grams causes the covariance to be increased by a factor of Ioo,ooo.
product will again be positive. Since must of the produc.:tB will bt (We would say that covariance is a scale-dependent measure.) Surely
positivt: 1 tlwir sum (and HII.Jan) will be positivc: und (pum~ibly) quito if we carl plot graphs of identical shape for two sets of data, we would
wum any measure of that shap<> to give the same value hoth times? Fortu-
tg8 159
JVIeasunng znteraepenaence of two vanabtes 'lhe correlatzon coejjicient
nately, both these defects in the covariance statistic are removed by making are exactly the same as the units of the covariance in both cases: metre-kilos
a single alteration, which we will present in the next section. and centimetre-grams respectively. This suggests that to describe, in some
sense, the 'shape' of the two scatter diagrams of figure ro.3, we might
80 try the quantity r, known as the correlation coefficient, in which we
use the product sxsv as a denominator:
70 X
] X X X
For Set a:
g X

"'~ 60
- o.05I0.275metres
X Sx_Sy X 6.2 kilos

and for Set b:

1.6 1.7 1.8 1.9 2.0 COV(X,Y) 27 sao em-grams
Height (metres) r(X,Y) = 0.87
s*x_s*y 5. I ems X 62oo grams
80 000 First note tha~ r does not have any units. It is a dimensionless quantity
like a proportion or a percentage. (The units in the numerator 'cancel
X out' with those in the denominator.) Second, it has the same value for

both scattergrams of figure ro.3. Changing the scale in which one, or

X X X X X both, variables are measured does not alter the value of r. It can also
X he shown that the value of the numerator, ignoring the sign, can never
50 000 he greater than Sx_Sy, and hence the value of r can never be greater
than r.
We can then sum up the properties of r(X, Y) as follows:
1600 1700 1800 1900 2000
Height (centimetres)
(r) r = r(X,Y)
Figure IO.J. Scattergram of height and weight of 25 students. SxSy
(2) The units in which the variables are measured does not affect the
value of r (the correlation coefficient is scale-independent)
ro.2The correlation coefficient
(3) - I:::;;:; r:::;;:; 1 (i.e. r takes a value between plus and minus one)
The standard deviation was calCulated of both the heights and
the weights from the data on which the scatter diagrams of figure ro.3
The quantity r is sometimes referred to formally as the 'Pearson product-
were based. They are (sx and s*x are actually the same length- the star
moment coefficient of liner correlation'. 2 There are other ways of measur-
simply indicates a change in units):
inf.( the degree to which two variables are correlated (we will meet one
Set a: fig. IO.J(a) sx = o.05I metres sy = 6.2 kilos Nhortly), but we will adopt the common practice of referring tor simply
Set b: fig. IO.J(b) s*x = 5 I centimetres s*y = 62oo grams
'l'ht tcl'minology of statistics is often obscure. Many of the terms were chosen from other
As we would expect from our knowledge of the properties of the sample hranchcg of applied mathematics. Pearson is the name of the statistician credited with
standard deviation, s*x = roosx and s*y = r,ooosy. tht dh~eovcry of r and its properties; many simple statistics like the mean, variance and
emarimu.:c urc called 'moments' by analogy with quantities in physics which arc calculated
Now consider the product sxsy. Its value for Set b is exactly lOo,ooo ill U l"imi\ar fashion, q.{. QHHllCilt of i!.ll'l'tia, and 'product' refer,:; to the multiplica-
times its value for Set a (cf. the covariance) and the units of thiH pl'oduct tion ,,r tin two factor~> (Xi~~ X) 1\1\d (\' 1 - Y) in the calculation ofthc covariance.

rf>o 161
Measurzng mreraepenaence OJ cwo vunumes companng corretauons
The same applies here. It is one thing to show that the hypothesis that
two variables have no correlation can be rejected; it is quite another to =-loge
argue that either variable contains an important amount of information
about the other. In other words, it is not so much the existence of correla~ = o.slog, (r.94u)
ti&n between two variables that is important but rather the magnitude
of that correlation. A sample correlation coefficient is a point estimator = 0.5 X 0.6633 = o.3316
~chapter 7) of the population value, p, and may vary considerably from Now,
"""'sample to another. A confidence interval would give much more infor-
mation about the possible value of p. r.g6 )
X=z ( z- vn=:i
Unfortunately, there is no simple way to obtain a confidence interval
for the true correlation p, in samples of less than so or so, even if
both variables have a normal distribution. For larger samples and normally r.g6)
distributed variables a 95% confidence interval can be calculated as
= >( 0.33I6- V57
follows: 3
= 0.1440
ex-r e"-r and
ex+I e\+r
r.g6 )
= I.I8>4
I.96 )
whereX=z ( z- Vn-3
Next we have e' = e01440 =I. ISS (the relevant calculator key may be
marked e' or perhaps exp) and e" = el. 1824 = 3.2622.
!.96 )
eX- I 0.155 = 07>
- - = - - 0.
eX+ I 2.155
and I (I-+-r)
2. r-r eY- r 2.,2.62.2
-.-=--=0.53 1
e\ +1 4.262.2.
The sample size is nand r is the sample correlation coefficient. The quantity
e has the value 2.71828 .... and logarithms to the base e are known as The value of the 95% confidence interval is then:
natural logarithms; sometimes loge is written 'In'. It is possible that some
of the symbols used here are unfamiliar to you, so we will work through
an example, step by step, which you should be able to follow on a suitable
calculator. ro. 5 Comparing correlations
Suppose, in a sample of 6o subjects, a correlation of r = 0.32 has been Using Fisher's Z-transformation it is also possible to test
observed between two variables. Let us calculate the corresponding 95% whether two correlation coefficients, estimated from independent samples,
confidence interval for the true correlation, p. First calculate: have come from populations with equal population correlations. Suppose
the correlations r1 and r2 have been calculated from samples of size n 1
I (I
z=-logc -+-')
2, r-r
and n 2 respectively. For both correlations, calculate the value:

3 This interval is obtained hy relying on a devit:c known as Fisher's Ztransformation

--for nxumpk, Downie & !Ieath ( 1q(15: IS6).
I (I
Z=-loge -+-')
2 r-r

,t64 r6s
1vuasurmg znteraepenaence Of avo vanames interpreting the sample co17'elation coefficient
The statistic: This test should not be used for small samples. For samples with n > roo
the value of t should be compared with percentage points of the standard
(z, - Zzl, j<nl.:::-3f[i,z=-i!
n1 +n2 -6
normal distribution.

is approximately standard normal provided the null hypothesis: p 1 = p 2 Io.6 Interpreting the sample correlation coefficient
is true. For sample sizes greater than so the statistic has a distribution It is rather difficult at this stage to explain precisely how to
very close to that of the standard normal and the hypothesis H 0 : p 1 = interpret a statement such as 'the correlation coefficient, r, between X
Pz versus 1-1 1: p 1 p2 can be tested by comparing the value of the statistic and Y was calculated from a sample of 32 pairs of error scores and
to percentage points of the normal distribution. A simple example will r = o.7z2'. We will be able to reveal its meaning more clearly after we
show how the arithmetic is done. have discussed the concept of linear regression in chapter I 3. However,
A sample of n 1 = 62 subjects is used to estimate the correlation between for the moment we will attempt to explain the essential idea in a somewhat
two variables X and Y. In a sample of n 2 = 69 different subjects the vari- simplistic way, at least for the case where both variables are normally
ables X and Ware measured. The first sample has a correlation r 1 = o.83 distributed.
between X and Y, the second has a correlation of r2 = o. 74 between X We show three different sample scattergrams in figure ro+ Now, let
and W. Test the hypothesis that the population correlations are equal us suppose for the moment that the three samples which gave rise to
(i.e. the correlation between X andY is equal to the correlation between
X and W). 1,)

Zt =-loge -- (1.83) = r.r88x
2 0.17

Z 2 =-log 0
(1.74) = 0.9505
y2 ------~--
Yl ----,
' . 1
: :
2 0.26 '' '
i '

~ = 1.326
(Z 1 - Z 2) --
X1 X2 X3 x
I25 y
The IO% point of the standard normal distribution for a two-tailed Xx X
test is I .645 so that the value I .326 is not significant. It is perfectly possible X X XXX
X X X X X r"' Q.8
that the correlation between X and Y is the same as the correlation between X X X

XandW. X X X X
The test we have just described is relevant only if the correlations r1
and r 2 are based on independent samples. If X, Y and W were all measured X
on the same n subjects the test would not be valid. Downie & Heath y
(I95S) give a test statistic for comparing two correlations obtained on
the same sample. Their test is relevant when exactly three variables are

involved, X, Y and W say, and it is required to compare the correlations X X

:x X
X r = 0,2

of two of the variables with the third. To test whether rxw and rvw arc ><xxxxx
significantly different they suggest the statistic:

(rxw-rvw) Y(n-3)(r +rxv) X

v' 2.( 1 ..:-r~~?- ryw2 - rx/ + Zl'xwrywrxy) Figure 10.4, 'f'hrcc hypothetical scattergrams with associated value of the
;{)f: cnrrclution cocffidcnt,

MeasU1ing interdependence oj two vanables J<..ank correlatwns

the graphs produce values of r close to the true value p for their respective although most points are distributed in apparently random fashion, there
populations. In the first diagram, figure ro.4(a), the plotted points lie are three obviously extreme points. That these would greatly influence
exactly on a straight line so that r = 1.0, i.e. the sample points are perfectly the estimated correlation can be seen from exercise ro.4. This is an ad-
correlated. Now, if this is true of the population from which the sample ditional reason for drawing a graph of- the data. We have stressed at various
was drawn then there is a sense in which either of the variables X or times the value of looking closely at the data before carrying out any analysis.
Y is redundant once we know the value of the other. Suppose that we
ha,,.e three observations (X 1,Y 1), (X2,Y2), (X 3,Y 3), with the property, for y

"""mple, that the difference X 3 - X 1 is exactly twice as large as the differ-

ence X2 - X 1 Then it will follow that the difference Y3 - Y 1 is also exactly

twice as large as the difference Y2 - Y1 We could say that variations in
Y (or X) are accounted for roo% by variations in X (or Y). X X XX X

Now consider figure ro.4(b). Because the points do not lie exactly on X
a straight line, it will not be true that variations in X will be associated xxxxxxx
with exactly predictable variation in Y. However, it is clear from the scatter- X

gram that if we know how the X values vary in a sample we ought to

be able to guess quite well how the corresponding Y values will vary in
relation to them. With the case shown in figure 10.4(c) we will not be
Figure 10.5. A hypothetical scattcrgmm. Most of the points arc dotted about
able to do that nearly so well. as though there were very little correlation- sec figure 10.4{c)- but the
The correlation coefficient is often used to describe the extent to which presence of the circled points will produce a large value of r,
knowledge of the way in which the values of one variable vary over observed
units can be used to assess how the values of the other will vary over A highish correlation apparently due to only two or three observations
the same units. This is frequently expressed by a phrase such as 'the ought to be treated with great caution. Apart from any other reason, if
observed variation in X (or Y) accounts for P% of the observed variation both variables come from normally distributed populations it will be ex-
in Y (or X)'. The value of P is obtained by squaring the value of r. tremely rare for a random sample of moderate size to contain two or three
The reason for this is explained in chapter rJ. unusually extreme data values- often called outliers.
For the three examples pictured in figure ro.4 we have:
(a) r= 1.0 r2 = 1.00 (=roo%) ro.7 Rank correlations
(b)r=o.8 r2 = o.64 (= 64%) The statements made in the previous two sections about the
(c) r=o.2 r2 = o.o4 (= 4%). meaning of r and how to test it all depend on the assumption that both
the variables being observed follow a normal distribution. There will be
For example, for case (b) we might say that '64% of the variation in many occasions when this assumption is not tenable. For example, in
Y is accounted for (or 'is due to') the variation in X'. a study of foreign language learners' performance on a variety of tests,
As always, it is important to keep in mind that another random sample one of the variables might be the score on a multiple choice test with
of values (X;,Y;) from the same population will lead to a different value a maximum score of roo (and perhaps with a distribution, over the whole
of r and r 2, so that the 64% mentioned above is only an estimate. How population, approximately normal), while the other might be an impressio-
good that estimate is will, as usual, depend on the sample size. An important nistic score out of 5 given by a native-speaking judge for oral fluency
point to realise here is that by a 'random sample' we mean a sample chosen in an interview. The latter, principally because of the small number of
randomly with respect to both variables. possible categories (i.e. o-s), will be distributed in a way very unlike
It is worth noting that just a few unusual observations can affect greatly a normal distribution. There will be times when the data do not-consist
the value of r. ThiR is tlw situation in the Rcat:tergrarn of flgutc .10.5 whtrc, of scores at all, for example when several judges are asked to rank a set
~~ r68 r69
~ '!
Measuring interdependence of two variables Rank correlations

Table I0.4. Calculating the rank correlation coefficient figure Io.6(a). For perfect negative correlation (r = - I .a), si = n + I - Ri,

R s where n is the sample size- see figure ro.6(b). If X andY vary indepen-
Sentence X y (Rank of X) (Rank of Y) R-S dently of one another, there ought to be no relation between the rank
aa 22 Il.O 17.0 -6.o of a particular X; among the Xs and the rank of the associated Y; among
a .6 ,g a.o 90 -7.0
3 4a 4a J2.0 JI.O y
4 as 21 175 145 J.O
5 3' a6 z8.o 22.5 S5
LO r"' 1.0
6 16 41 JI.O JO.O

7 a9 a6 zJ.s 22.5 LO
8 a4 ao r6.o IJ.O J.O s, .. ----------------- --
9 29 18 235 90 1 45
I<fS -o.s s3 r----===----
s, ----
12 22 19 !I.O 11.5 -o.s s,
13 11 39 z8.o zg.o -I.e

14 21 a3 8.5 rg.o -ro.s

15 a7 24 zo.s 20.0 o.s
16 3a a9 JO.O z6.s 3S R4 A3 A2 R, X
17 a3 18 14-0 90 s.o lbl
18 18 16 4-0 6.0 -z.o y
'9 30 a9 :z6.o z6.s -o.s
ao 3' 22 z8.o 17.0 11.0

ao 12 6.s LO 55 r=-1.0
22 21 a6 s.s 22.5 -14.0
2J.s -s.5 s,
aJ a9 43 J2.0
a4 22 a6 11.0 22.5 -II.S s,~-----1-----',
as a6 aa I<).O 17.0 2.0 s3 -----:------:---"'
a6 ao 19 6.s 11.5 -s.o s4 ____ J______ l---~-------
a7 ag JO 2J.s z8.o -4-5
a8 18 40 70 -J.O '
'' ''
'7 ' ' I
a9 a3 s 40 10,0 I
30 as 15
1 75 40 IJS
' '' '
R, A3 R2 R, X
3' a7 zo.s zs.o -45
3a II '4 LO a.o -r.o
Figure 10.6. The relation between (Xi, Yi) and (Ri, Si) for perfectly correlated
:!:(R;- S;)' = '4'3S
the Ys. It ought therefore to be possible to derive a measure of the correla-
r, = r 0. 741 tion between X and Y by considering the relationship between these rank-
32X (Jz 1 - r)
ings. One such measure is rs, the Spearman rank correlation
of texts on perceived degree of difficulty. In such cases there is an alterna- Suppose a random sample of n pairs has been observed and let R; and
tive method for measuring the relationship between two variables which S; be the ranks of the Xs and Ys as we have defined them above. Let
does not assume that the variables are normally distributed. This method D = l(R;- S;)'. Then r, is calculated by the formula:
depends on a comparison of the rank orders rather than numerical scores.
Let us go back to the example on error gravity and change the pairs 6D
rs = I
of values (X;,Y;) into (R;,S;) where R; gives the rank order of X; in an n(n2 -1)
ordered list of all the X values, S; is the rank order of Y; in a list of It may look to you at first sight as though r, is simpler to calculate
all the Y values - see table ID+ It should be clear to you that if X and than r. This is not usually the Case. First, except for very small samples,
Y arc ptrfectly positively correlated (r .7.:::: r.o) then 81 = H1 nlwuy::r ~~ !:lee it is a time-consuming exercise, unless it can be done by computer, to
~170 171
Measun'ng interdependence of two van'ables Rani~ correlations
calculate all the ranks. Second, there is the .problem of tied ranks. This whose exact values are not known . (e:g; when only the ranks -are known
arise:s when one of the variables has the :same value in two or more of in the first place) and should not be used on data which have a more
the pairs. Suppose in a sample of five observations we have: or less normal distribution. Furthermore, for certain kinds of data the
x, = 6, Xz == J, x3 = 4, x4 = J, Xs =I population value, p5 , of the Spearman coefficient is very different from
that of p, the Pearson correlation coefficient, in-the same population. Since
How should we calculate the ranks; R1? rs is often calculated rather than r precisely in those situations where the
underlying distribution of the data is unknown it is never safe to try to
R1 = 1 (X 1 has the largest value)
R2 = ? (Should X2 be ranked third or fourth?) interpret rs as a measure of correlation. In fact its only legitimate use
R3=2 is as a test statistic for testing the hypothesis that two variables are indepen-
R, =? dent of one another. Consider the following example. Two different judges
R, = 5 are asked to rank ten texts in order of difficulty.
Clearly, R 2 has to have the same value as R 4 since X 2 = X 4, and its value
Jl1dge:x 2 3 4 5 6 7 8 9 !O
must lie between 2 and 5 The way to deal with this is to take the average' Judge:2 4 2 7 l 3 !0 8 9 6 5
of the 'missing' ranks, in this case take the average of 3 and 4, which
is 35 Then R, = R4 = 35 We can then calculate r, in the usual way. Test the hypothesis that the judges are using entirely unrelated criteria
Unfortunately, the problem does not end here. Tied ranks cause bias and hence that their rankings are independent.
in the value of rs. If we use mean ranks and calculate r~ by the usual H 0 : There is no interdependence in the sets of judgements
formula we will tend to overestimate the true correlation. A full discussion H 1 : There is interdependence in the sets of juelgements
of this problem can be found in Kendall (1970). However, unless a very
Here we have:
high proportion of ranks is tied, the bias is likely to be very small. Siegel
(1956) gives a formula to adjust the value ofr, for tied ranks. He then D = 3' + o' + 4' + 3' + . , . + 5' = go
gives an example with a number of tied ranks for which r, = o.6r7 if the
usual formula is used and r, = o.6r6 if the exact (and more complex)
formula is used. Our advice is to use the formula we have given even 6 X 90
in the presence of ties, but to suspect that theapparent correlation may :s-:--,- 10X99
be very slightly exaggerated in the presence of a substantial proportion
of tied ranks. The data on sentence error gravity are analysed in this way = 04545
in table IO+ From table A7 we can see that this value of r, is just significant at the
In this case, the Spearman correlation coefficient r5 = 0.741, quite close ro% level. There is, therefore, somewhat weak evidence of some measure
to the value of the Pearson correlation coefficient r = 0.772 calculated on of agreement between the judges. However, nothing else can be said. It
the same sample. Does this mean that rs can be interpreted in much the is not possible, for example, to calculate a confidence interval for the popu~
same way as r? In particular will r, 2 still indicate the extent to which lation value, Ps, and there is therefore no way of knowing how precise
one variable 'explains' the other? Unfortunately, the answer to this ques- is the estimate 0.4545 This is a general deficiency of rank correlation
tion is an unequivocal cNo'. For moderate-sized samples from a population methods.
in which the variables are normally distributed the Spearman coefficient Table A7 gives percentage points of r, for sample sizes up to n = 30.
can be interpreted in roughly this way. However, even then r~ will tend For sample sizes greater than this it is safe to use table A6 of percentage
to underestimate the true correlation and will be more variable from one points of the Pearsol'l correlation coefficient. In large samples and when
sample to another than the Pearson correlation cocfllcicnt. Beside~, ra waH the two vwiables are actually independent (so that p= p, = o) the values
deBignod t" cope with daw sets which nrc not nonnully distributed or of r,\ and r will be very similar. In the case of correlated variables our
liA t~!l> 1 73
Measun'ng interdependence of two variables
previous statement holds; there is no simple relationship, in general, Sentence no. Greek teachers English teachers
between r and rs, nor between the population correlation coefficients p 33 3 3
34 0 0
and p,. There is no simple way to interpret the 'strength' of correlation
implicit in a sample value. of r,. Include this data in the set you used for your calculations in exercise I0.2
and recalculate the correlation coefficient. What difference does the addition :1

of these two data points for each group make?

(5) Calculate the Spearman rank order correlation coefficient for the data you II
(I) The covariance, COV(X,Y) = S(X;- X) (Y;- Y)/(n- I), was introduced 1
used in exercise 10.2.
and exemplified as a measure of the degree to which two quantities vary in
(6) Twenty-five non-native speakers of English took a so-item doze test. The
step with one another. The covariance was shown to be sca~e-dependent.
exact word method of scoring was used. Without warning, a few days later, 11

(2) The correlation coefficient (Pearson's product~mornent coefficient of

they were presented with the same cloze passage, but this time for each blank Ill
linear correlation) between two variables, X and Y, was defined by ,.
they were given a number of possible options (i.e. multiple choice), these
r(X, Y) = and was shown to be independent of the scales in which including all the responses they had given on the first version with the addition
of the correct answer, if this had not been amongst their responses. ThF results
X andY were measured. The value of r 2 could be interpreted as the proportion
of the variability in either variable which was 'explained' by the other. Multiple
(3) The test statistic for the hypothesis H 0 : p = o was r itself: the critical values choice
to be found in table A6.
rst version version 1st version version
(4) It was shown how to calculate a confidence interval for the true correlation -
from large samples using Fisher's Ztransformation. 33 27 25 20
(5) A test for the hypothesis that two population correlations were equal was 3I 3I 24 25
presented. 32 29 24 24
(6) The Spearman rank correlation coefficient, rs, was defined. It was 30 33 23 32
explained that it would usually be difficult to interpret rs as a measure of 29 3I 23 23
'degree of interdependence' but that a test of the hypothesis H 0 : two variables 29 30 22 24
are independent could be based on rs even when the data are decidedly non- 28 30 2I IS
normal. 29 23 Z2 24
27 30 2I 23
28 29 I7 IS
27 26 I6 I7
(I) Draw a scattergram to examine the relationship between the assigned error
26 27 I6 24
scores in the columns headed 'X' and 'Y' of table II 5 on page. I86, that
25 24
is, between the scores of the Greek teachers and the English teachers. Compare
your scattergrarn with figure ra. I. What do you notice? Twenty-five native speakers of English also took the first version of the test.
(2) (a) Now calculate the covariance of these two sets of scores, using the method Their scores were:
of table Io.2, and then determine the correlation coefficient.
36 35 33 34 34 33 33 32 32 32 32
(b) Recalculate the correlation coefficient using the rapid method of table
ro.3. What does this value tell you about the linear relationship between 3I 3I 3I 3I 3I 3I 30 30 29 28 28
the scores of the Greek teachers and those of the English teachers?
26 25 24
(3) Is the correlation coefficient obtained in exercise ro.2 significant?
(4) In the original study from which this data were taken, judges were presented (a) Are the two sets of scores for the non-native speakers related? How do
with some sentences which were correct. For two of these sentences EngliSh you interpret the correlation coefficient you have calculated?
and Greek teachers were in complcle agreement and nssigned the following (h) What is the relationship between non-native~speaker scores, and native-
:liCOfCfl; speaker scores, on the first version of the test?
~. !'N 1 75
independent samples: dljjerences between means

Table I 1. 1. Notation used in thefonnulation of a test of the

Il hypothesis that two population means are equal
First Second
Testing for differences population population

between two populations Population mean

Population standard deviation
Sample size ~I ~2
Sample mean x, x,
Sample standard deviation s, s,

(At the outset of the experiment there was no compelling reason to believe
It is often the case in language and related studies that we want to compare
that the audiolingual group would do better on this test after two years;
population means on the basis of two samples . .In a well-known experiment.
it was not inconceivable that the grammar-translation group would do
in foreign language teaching (Scherer &, Wertheimer I964), first'year
better. For this reason a non-directional alternative hypothesis is chosen.)
undergraduates at the University of Colorado were divided into two groups. '
A test statistic can be obtained to carry out the hypothesis test provided
One group was taught by the audiolingual method, the other by the more
two assumptions are made about this data:
traditional grammar-translation method. The intention was to discover
which method proved superior in relation to the attainment of a variety 1. The populations are both normally distributed. (As before, this will not be
of language learning goals. The experiment continued for two years, and necessary if the sample size is large since then the Central Limit Theorem
many measurements of ability in- German were taken at various stages assures the validity of the test.)
in the study. For our present purpose we will concentrate on just one 2. u 1 = u 2 ; that is, the population standard deviations are equal. This point is
measurement, that of speaking ability, made at the end of the experi- discussed further below. The value of the common standard deviations is esti,
ment. Out of a possible score on this speaking test of IOO, the traditional mated by:
grammar-translation group obtained a mean score of 777I while the
audiolingual group's mean score was 8z.gz. Thus t,he sample mean of Hn1 - I)~T+(~z -.=.--;)Sl
the audiolingual group was higher' than that of the grammar-trarislation : n1 + n2 - 2

group. But what we are interested in is whether this is due to

an accidental assignment of the more able students to the audiolingual group . Thi~ estim_~te is used in the calculati6n .of the test statistic, t, as follows:
or whether the higher mean is evidence that the audiolingual method is
more efficient. We will address this question by way of a formal hypothesis
test. Jsz + s2

nl nz
II.I Independent samples: testing for differences between
means This statistic has a !-distribution (the same as that met in chapter 7) with
The notation we will use is displayed in table r 1. 1. We wish n, + n 2 - 2 df whenever the null hypothesis is 11ue. The value of t can
to test the null hypothesis: be compared with the critical values in table A4 to determine whether
the difference between the sample means is significantly large.
H0 'I''= 1'2 The data in table r I. 2 are taken from table 6-s of Scherer & Wertheimer
Our alternative hypothesis is:
W c first estimate the population standard deviation using the formula
presented above:
ll, 'I'J "''"'
1 77
restmgjordzjjerences between two poputatwns Independent samples: differences between means
Table r r.2. Group dfferences in speakng ability at the end Table I I 3. Mean VOT values (in milliseconds), standard
of two years deviations and number of tokens for a single child at different
Audiolingual Grammar-translation ages

,,x, 82.92
6.78 ,,x, 777I
Age 1 ;8.zo Age 1;u.o

n, 24.00 n, 24.00
Consonant Consonant

D"ata from Scherer & Wertheimer ( 1964: table 6--5)

/d/ /t/ /d/ /t/
Mean VOT (ms) 14.25 22.JO 767 122.13
Standard deviation 1 S44 IJ.SO 19.17 5419
,= 1(;3><6:78'5 +(23 x 737') Number of tokens 8 IO IS IS
..;.:....::. 46
Data from Macken & Barton ( xg8oa: table 5).

= ji05727 + 124929
y 46
the child and its mother. The focus of the instrumental analysis of this
data was the measurement of voice onset time (VOT) in initial stops in
This estimate of the population standard deviation is then used to calculate t: tokens produced by the children. On the assumption that VOT is the
82,92 -777I major determinant of a voiced/voiceless contrast in initial stops, its values
were examined, separately for each child, to identify the point at which
f7oo- 7.o82
y--+-- the child became capable of making a voicing contrast. The data in table
24 24
I 1.3 were extracted from Macken & Barton's table s, which presents a

s.zr summary of measurements on their subject, Jane.

Y2.o9+ 2.09 In normal, adult spoken English, the mean VOT for /t/ is higher than
for I dj. Children have to learn to make the distinction. For the data in
=-- table I I. 3 we can test the null hypothesis that there is no detectable differ-

= 2.55 ence in the mean VOT against the natural alternative that the mean VOT
is higher for ltl. Using the formulae given above, we find the ~tandard
If we enter table A4 with 46 df (n 1 + n 2 - 2), we discover that the t-value
deviation estimate:
of is significant at the 2% level (which is what was reported by Scherer

~7x Is.44S+(9
& Wertheimer). The probability of the two sample means coming from x') . . d
populations with the same mean is less than 2 in IOo (which is the same s= 14.38 rot 11 1secon s
as I in so). There then appears to be support for the belief that the audio-
and evaluate the test statistic as:
lingual method is superior to the grammar-translation method in develop-
ing speaking ability in German. However, various difficulties encountered If.25- 22,30
by Scherer and Wertheimer in conducting the experiment mean that this ~ = -I.r8with r6df
result should be treated with caution ..
Let us look now at another piece of research where it is appropriate

to test for differences between means on the basis of two samples. Macken On this occasion it can be argued that the natural alternative hypothesis
& Barton (Ig8oa) investigated the acquisition of voicing contrasts in four is directional: we may discount the possibility that the VOT for I d/ could
English-speaking children whose ages at the beginning of the study ranged be greater than the VOT for It/. The s% critical value for t with I6
from r ;4.28 to r ;7.9. Each child was then seen every two weeks over df would be L7S (= for IS df). The t-value we have calculated is less
nn cight~month period, and recordings made of conversations be.twccn than thio; indeed it is less than the w% critical value (1.34). We will
'7~ I79
Testing for differences between two populations Independe11t samples: dzjfereuces between means
therefore be unwilling to claim that the papulation mean VOT for /t/ we have made~ that the variable, VOT; is nottnally distributed for this
is greater than that for / d/. We will not wish to claim that the child has subject at the two ages when the observations were taken and that, at
learned to distinguish the two sounds successfully in terms of VOT, despite both ages, the VOT was equally variable for either sound.' Since the
the difference of more than 8 ms inthb sample means. This is due partly full data set Is not published in the source paper we cannot judge whether
to the large standard deviations and partly to the small number of tokens the VOTs seemed to be normally distributed. With the small number
observed. If we carry out a similar analysis on the VOTs observed on of tokens observed it would be rather difficult to judge anyway and the
tili!'same child at age r ;II .owe find t = -7.71, which is highly significant investigators would have had to rely on their experience of such data as
and indicates that the child is now making a distinction between the two a guide to the validity of the assumption. If we had, say, 30 tokens at
consonants. each age instead of 15, modest deviations from normality would be unim
Having discovered that it seems likely that a difference may exist between portant. On the other hand, turning to the second assumption, there is
two means, it will usually make good sense to estimate how large that always information contained in the values of the two sample standard
difference seems to be. A 95% confidence intervalfor the difference p. 1 -p.z deviations about the variability of the populations. In the test related to
is given by: the VOTs at age I; I r.o, the sample standard deviations of the VOTs
for /d/ and /t/ were I g. I7 and 54 19, the larger being almost three times
XI- X2 (
constant X ls'"T)
llt nz
the value of the former. In the light of this, were we justified in carrying
out the test? In the following section we give a test for comparing two
where the constant is the 95% critical value of .the !-distribution with variances and it will be seen that we will reject decisively the assumption
(n 1 + n 2 - 2) df. The data imply that there is a difference between Jane's that VOT for /t/ has the same variance as VOT for jdj. This is serious
two mean VOTs at age 1 ;u.o. To estimate the size of this difference and would remain so even if the sample sizes were much larger. 2
we first calculate: There is a further problem which may arise with this kind of data where
2 2 several tokens are elicited from a single subject in a short time period.
S = J(r4 X 19I7 ) + (14 X 54I9 )
Every method of statistical analysis presented in this book presupposes
that the data consist of a set of observations on one (or several) simple
= 40.65 random samples with the value of each observation quite independent
A 95% confidence interval for the difference in the mean VOTs would of the values of others. If a linguistic experiment is carried out in such
then be: a way that several tokens of a particular linguistic item occur close together
the subject may, consciously or not, cause them to be more similar than
x,-x, (2.o4 xy-;-;+-;-; they would normally be. Such a lack of independence in the data will
14 14 distort the results of any analysis, including the tests discussed in this
or II4.46 31.34 chapter. This point has already been discussed in chapter 7.
i.e. 83.12 to I45.8o milliseconds example above.
1 We did not question these assumptions in the case of the
First, though we do not have access to the complete data set, tt IS not unrea~onablc to
expect the distribution of scores on a language test with many items to approxtmate nor-
We calculate therefore that we can be 95% certain that the difference mality, Secondly, the F -test (see below) did not indicate a significant difference in variances.
between the population mean VOT for jdj and that for /t/, for this child
Z However, provided the sample sizes are equal, as they arc here, it ca~ be shown m~themati
at this age, is somewhere between 83.I2 and I45.8oms. In view of the cally that the test statistic will still have a t-distribution when H 0 1s t~e _hut WJth fewer
comment in the next paragraph concerning the variances of the VOTs degrees of freedom - though never less than the number of tokens ~~ Just one of the
sample~, i.e .. r5 . The value of -7.71 is still high,ly significant even w1th .only 15 df, so
of the two consonants, this interval is probably not quite correct, but that the initial conclusion stands, that the population mean VOT for /t/1slarger at age
it is unlikely to be far wrong and certainly gives a good idea of the order r; rr.o than the population mean VOT for / d/. If the sample variances arc such that
of difference between the mean VOTs. we have to reject the hypothesis of equal population variances and the sample sizes are.
unequal then a different test statistic with different properties has to be used. The relevant
The correctness of the above analysis depends on the two assumptions forrnulnc, with a discuasiun, can be found in VVctherill (1972: 6.6).
Testing for differences between two populations Independent samples: comparing two proportions
I x.2 Independent samples: comparing two variances Proportion of errorfree cases
We have seen that it is necessary to assume that two populations Group A Is/3o = o.s (iii)
have the same standard deviation or, equivalently, the same variance before Group B 23/30 = 0.7667 (p,)
a !-test for comparing their means will be appropriate. It is possible to Overall 38j6o = o.6333 (p)
check the validity of this assumption by testing H 0 : a}= a.,Z (see table Now p1 and p2 are estimates of the true population proportions p 1 and
I I. I) against H 1 : a-12 # a-22 using the test statistic: We can test the hypothesis that H 0 : p 1 = p2 using the test statistic,

larger sample variance

lp,- Pzl-~(.:_+.:_)
smaller sample variance
2 n n 1 2
In the example of the previous section, two samples of IS tokens of z
/t/ and /d/ from a child aged I ;II.o gave standard deviations of I9.I7
and 54.I9ms respectively. The larger sample variance is therefore Jjl(I- jl)(.:_+.:_)
n1 llz

54.I9 2 = 2936.56 and the smaller is I9.Ii = 367.49 so that:

where n 1 and n2 are the sample sizes, (Remember that IPt - p2 l means
F = 2936.56 + 367.49 = 799 the absolute magnitude of the difference between p1 and pz and is always
If the null hypothesis is true and the population variances are equal a positive quantity.) When the null hypothesis is true Z will have a standard
the distribution of this test statistic is known. It is called the F -distribution normal distribution. 3 Here we have:
and it depends on the degrees of freedom of both the numerator and
the denominator. In every case the relevant degrees of freedom will be lo.s-o.76671--
I ( II )
those used in the divisors of the sample standard deviations. Since both 2 JO JO
samples contained IS tokens, the numerator and denominator of the F
statistic, or variance ratio statistic as it is often called, will both have J (o.6333 ~-~.3667) (; 0
I4 df, and the statistic is denoted F 14 , 14 In general we write Fm,,m, where
m 1 and m2 are the degrees of freedom of numerator and denominator o.z667- 0 0 333 = I.876
respectively. In the tables of the F -distribution (table AS) we find that 0.1244

no value is given for 14 df in the numerator or in the denominator. To

give all possible combinations would result in an enormous table. However, The s% significance value of the normal distribution for a two-sided
the I% significance value for F 12 , 15 is 3.67 and that for F 24 , 15 is 3.29 so ll'Hlis I.g6 so that this result just fails to be significant at that level, exactly
the value for F 14 , 14 must be close to 3.6. The value obtained in the test the conclusion reached by the chi-squared test of section 9+2. Further-
is much larger than this so that the value is significant at the I% level more, 1.8762 = 35I9, almost exactly the value of the chi-squared stdtistic
and it seems highly probable that the VOT for /t/ has a much higher in that analysis and the relevant s% significance value of chi-squared was
variance than that for/ d/. Note, however, that this test (frequently referred J.H.f, which is r.g6 2 , Because of this correspondence, the two methods
to as the F-test) also requires that the samples be from normally distributed will always give identical results, The only real difference is that with
populations and it is rather sensitive to failures in that assumption. the test we have presented in this section it is possible to consider a one-
~idcd alternative such as H 1 :p 1 < p2 while the chi-squared test will always
n~quirc a two-sided alternative.
3 Independent samples: comparing two proportions

We have already established a test for comparing two propor- It will usually be helpful to estimate how unlike two proportions may
tions. In table 97 we presented, as a contingency table, data from a study J Nntc tlutt it i1:1 always the normal distribution that is referred to when testing for differences
by Ferris & Politzer ( Ig8I) of pronoun agreement in two groups of bilingual hutw(~Cn two simple proportions- never the t-distribution. However, as when using the
dli~<~qunn\d tl~:~t, ~~arc must be taken if there arc fewer than five tokens in any of the
children. The data in that table can be presented in a different form as fiUr fHHH!.ihk catcgodc1:1.
Testing for differences between two populations Paired samples: comparing two means
be whenever a test has found them to be significantly different. An approxi- Table I I + Summary of data from table 10.1
mate 95% confidence interval for the difference in two proportions is calcu-
lated by the formula: Group r (English) Group 2 (Greek)
Sample mean X1= 25.03 X2 = 28.28
Sample standard deviation s 1= 6.25
If>,- l'>zl {r.g6 xJp,(I- p,) + Pz(I- pz)} Sample size 32
Sz = 7.85
nl nz

i.e. 0.266 7 (I .g6 X 0 5 X 0 5 + 0.7667 X 0.2JJJ)
same error. It is in this sense that we refer to the samples as 'correlated'.
When we have correlated samples we follow a rather different procedure
i.e. o.o32 to o.sor
in tests for a difference between population means. This will be detailed
Note that this interval does not include the value zero, which we would below, but first of all let us see what results we would obtain if (mistakenly!)
expect since the hypothesis that p 1 = p2 , or p 1 - pz = o, was not rejected we proceeded as if, as in the above examples, the samples were independent
at the s% level (see chapter 7). This occurs because the confidence interval (i.e. uncorrelated). The test statistic would be:
is not exact. It will not be sufficiently incorrect to mislead seriously, X 1 -X2
especially if the exact test of hypothesis is always carried out first.
I r 4 Paired samples: comparing two means
# 32
Table II. 5 presents data on the error gravity scores of ten where:
native English-speaking teachers and ten Greek teachers of English for JIS 1
+ JISl
32 sentences which appeared in the compositions of Greek-Cypriot learners s' 62
of English. (Note that the data for Greek teachers is adapted from that
presented in Hughes & Lascaratou rg8r, for purposes of exposition.) A
summary of the data in the notation of the present chapter appears in t = -r.83 with 62 df.
table I I 4 This is not significant at the s% level. Apparently there is little evidence
First let us test the assumption that the variances in the two populations tit at errors are judged more severely by the Greek teachers. However,
= '*
are equal, Ho: cr1 cr2 , against the hypothesis H 1 : cr1 O"z. The test statistic we hnvc not used all the information in the experiment. The teSt we have
is: curried out would be the only one possible if we had only the summary
vnlucs uf table I I + It ignores the fact that since each group of teachers
largers2 7-Ss' 6r.62
F 2 - - , = - - = I.S8 11fo1Hessed the same 32 sentences, and we know their scores on each sentence,
smallers 6.25 39.06
Wl' cnn compare their severity scores for the individual errors. In table
with 31 df in both numerator and denominator. From tables of the F- 11.5 we present the total data set. In the last column of the table we
distribution we can see that this is not significant and we will assume hllvc given the value obtained by subtracting the total score of the Greek
that the population variances are equal. ttmclwrR from the total score of the English teachers for each sentence
Let us now test the hypothesis that the two groups of teachers Individually. Some of these differences will be positive and others negative
the same error gravity scores on average, i.e. 1-1 0 : f.J-t = f.Lz it iN important to be consistent in the order of subtraction and evaluate
H 1 :p, 1 p,2 The situation is different from the previous examples in (iw Hign properly. Now, the null hypothesis we wanted to test was that,
chapter inasmuch as it is possible to compare the scoring of Greek tcachcrfl< uvcragc, the English and the Greek teachers give equal scores for the
and the English teachers in respect of each of the 32 sentences, not cnors, H.,: p, 1 = p,2 This is logically equivalent to testing the hypo-
in terms of their overall scoring. Tlwre iM likely, on nvcrngc, to be that, on average, the differences in item scores is zero. Since di =
corTclntion bNwccn scores nwmded by diffcn~ut groups of judgctJ on ,_~ Y i! tht population mean difference, /J-d, = p.. 1 - fi-z

184 r8s
Testing for differences between two populations Paired samples: comparing two means
Table r 1.5. Total error gravity scores often native Englt'sh where d is the mean of the observed differences, s is the standard deviation
teachers (X) and ten Greek teachers of English (lJ on 32 of the sample of 32 differences and n is the sample size, 32.
English sentences We find:
X y d~X-Y
- d = -3.25 s ~ 8.32
1 36 -14
and therefore that t = -2.21 with 31 df. From tables of the !distribution
2 16 9 7
3 42 29 13 we find that this value is significant at the s% level, giving some reason
4 25 35 -1o to believe that there is a real difference between the scores of the teachers
5 31 34 -3
6 36 23 13 of the different nationalities.
7 29 25 4 Let us summarise what has occurred here. In the opening section of
8 24 31 -7
9 29 35 -6 the chapter we presented a procedure for testing the null hypothesis that
10 18 21 -3 two population means are equal. We began the present section by carrying
11 23 33 -1o
12 22 13 9 out that test on the error gravity score data of table II. 5. The conclusion
13 31 2> 9 we reached was that the null hypothesis of equal scores for the two groups
14 21 29 -8
15 27 25 2 could not be rejected even at the 10%. level. We then carried out another
16 32 25 7 test of the same \>ypothesis using the same data and found we could reject
17 23 39 -16
18 18 19 -1
il at the s% significance level. Is there a contradiction here?
19 30 28 2 There is none. It is important to realise that the second test made use
20 31 41 -1o
21 20
of information about the differences between individual items which was
25 -5
2> 21 17 4 ignored by the first. Indeed the first test could have been carried out
23 29 26 3 in exactly the same way even if the two groups had scored different sets
24 22 37 -15
25 26 34 -8 uf randomly chosen student errors. By matching the items for the groups
26 20 28 -8 we have eliminated one source of variability in the experiment and increased
27 29 33 -4
28 18 24 -6 the Hcnsitivity of the hypothesis test.
29 23 37 -14 This paired comparison or correlated samples t-test will frequently
30 25 33 -8
31 27 39 -12 be relevant and it is usually good experimental practice to design studies
32 11 20 -g to exploit its extra sensitivity. However, it requires exactly the same
llHHtunptions as the test which uses independent samples; the two popula-
Ho : 1-'1 = is =
the same hypothesis as H0 : J.t 1 - f.Lz o which in turn
l ions being compared should be approximately normally distributed and
be written as H 0 : f.Ld = a.
hHV('! equal variances.
We already know how to test the last hypothesis. In chapterS we
duced a test of the null hypothesis that a sample was drawn from a popula,/' II 95% confidence interval for the difference between the two population
tion with a given population mean. The 32 differences in the last means can be calculated as:
of table 1 r.s can be considered as a sample of the differences that lcil (constant X s/Yn)
arise from two randomly chosen groups of these types assessing
where the constant is the s% significance value of the !-distribution with
errors. Following the procedure of 8.3, a suitable statistic to test
null hypothesis H 0 : f.Ld = o is:
(11 "' r) df. Forthisexamplewehave:
d-o _ l-351 (>.04 X 8.3 + Y3I)
t ~ - - ~ d + (s/Yn), with 3' df
s .15 3 0 5
Vn L~, 0.2 to 6.3
1 'estmg jar dzjjerences between two populatwns J.\'onparamernc rests

Relaxing the assumptions of normality and equal

I I. 5 Tabler r.6. RankingofVOTsfor/g/ and /k/Jrom a single child
variance: nonparametric tests Value 3 35 ]8 SI sr 56 73 89 125 138 169 190 195
We have seen that experimental situations do arise (u.2) Source /g/ /k/ /g/ /k/ I g/ /g/ /k/ /g/ /k/ /k/ /k/ /k/ /g/
Rank 2 3 4-5 45 6 7 8 9 10 II 12 13
where the assumption that two populations have equal variances may be
untenable, and that this will affect the validity of some of the tests intro-
duced above. We may also have doubts about the other assumption, necess- If m is the size of the sample whose ranks have been summed ( m = 6
ary except for large samples, that both samples are drawn from normally in this example) and n is the size of the other sample (n = 7) we calculate
distributed populations. Occasions will arise when we have samples which two statistics, U 1 and U 2, as fo11ows:
arc so small that there is a need to worry about this assumption as well.
It is possible to carry out a test of the hypothesis that the samples come m(m + r)
from two populations with similar characteristics without making any U 1 =mn+ T
assumptions about their distribution. It will, of course, still be necessary
that our data consist of proper random samples in which the values are U 2 =rnn-U 1
independent observations. There is a great number of such tests, collected
under the general heading of nonparametric tests~ tests which require Here we have:
no special distributional assumptions- and we will present just two exam- ' 6X7
ples here. The x' test for association in contingency tables in chapter U 1 = 6 X 7+--- 355
= 27.5
9 and the test for significant rank correlation in chapter ro are two nonpara-
metric tests we have already presented. A larger selection can be found Uz = 6 X 7- 27.5 = '45
in Siegel (1956).
Suppose that, as part of a study like that of Macken & Barton (r98oa) We then refer the smaller of these values to the corresponding value
(see u.r), a child aged 2;o is observed for tokens of /g/ and /k/ in of table A9. (We have given the s% significance values in this table; see
the same phonological environment, and that the VOTs in milliseconds Siegel if other critical values are required.) Since '45 is greater than
for the observed tokens were: the tabulated value of 6 we do not reject the equality of mean VOTs
at the s% significance level. The table of critical values we have supplied
for/g/: J8,195, 56, 3,51, 89 (sixtokens) allows only for the situation where the larger of the two samplescontains
for/k/: 125, 73, r38, 35, 5' rgo, r6g (seven tokens) a maximum of 20 observations. For larger samples the sum of the ranks,
T, can be used to create a different test statistic:
Despite the small number of tokens, we can test the null hypothesis that
the VOTs for the two consonants are centred on the same value by means
of a two-sample Mann-Whitney rank test. We begin by putting all m
T--(m+n+ r)
13 observations into a single, ranked list, keeping note of the sample in Z=--2

which each observation arose, as in table rr.6. It will not matter whether
the ranking is carried out in ascending or descending order. Note how 2

we have dealt with the tied value of 5 r ms (see ro.6).

which can then be compared with critical values of the standard normal
Now, sum the ranks for the smaller of the two samples. If both samples
distribution (table A3) to see whether the null hypothesis of equal popula-
are the same size then sum the ranks of just one of them. In this case
tion means can be rejected.
the smaller sample consists of the VOTs for the child's six tokens of /g/;
There are likewise various nonparametric tests which can be used to
T, the sum of the relevant ranks, is given by:
test hypotheses in paired samples. Suppose that 22 dysphasic patients
T ~ I + 3 + 4 5 + (J + 8 + I 3 ". 5 have the extent of thuir disability asf!csscd on a tcn~point scale by two
1 ne power OJ WJJerew Le.u~
Testing for differences between two populations
different judges. Suppose that for I 3 patients judge A assessed their condi- rI .6 The power of different tests
tion to be much more serious (i.e. have a higher score) than judge B, In chapter 8, where the basic concepts of statistical hypothesis
for five patients the reverse is true and for the remaining four patients testing were introduced, we discussed the notion of the two types of error
the judges agree. We can test the null hypothesis H0 : the judges are giving which were possible as a result of a test of hypothesis: type I error, the
the same assessment scores, on average, versus the two-sided alternative incorrect rejection of a valid null hypothesis, and type 2, the failure to
H 1 : on average the assessment scores of the judges are different, although reject the null when it is mistaken. Let us recapitulate the results of the
it is unlikely that the assessment scores are normally distributed and the tests we have carried out in the current chapter on the error gravity scores
size of the sample is not large enough for us to rely on the Central Limit of the two groups of judges (table I I .5). The paired sample t-test which
Theorem. we carried out in I 1.4 resulted in the conclusion that there was a differ
The procedure is to mark a subject with a plus sign if the first judge ence, significant at the 5% level, between the mean scores of the two
gives a higher score, or with a minus sign if the first judge gives a lower groups. The sign test carried out on the same data in II 5 could not
score. Subjects who receive the same score from both judges are left out detect a difference, not even at the s% significance level, which is the
of the analysis. Note the number of times, S, that the less frequent sign weakest level of evidence which, by general convention, most experi
appears and the total number, T, of cases which are marked with one menters would require in order to claim that 'the null hypothesis can
or other sign. HereS= 5 and T = I8. be rejected'. It is important to understand why this has come about. Assess
Now enter table A10 using the values of S and T. Corresponding to ing a statistical )lypothesis is similar to many other kinds of judgement.
S = 5 and T = I8 we find the value o.o96. These tabulated values are The correctness of the conclusion will depend on two things: the quantity
the significance levels corresponding to the two-tailed test. Hence we could of the information available and its quality. The two tests in question
say here that H 0 could be rejected at the Io% significance level but not use different information.
at the s% level (P = o.og6 = 9.6%). For the one-sided alternative we first It is an assumption of both tests that the observations were collected
of all have to ask whether there are any indications that the alternative as an independent random sample. The fact that the observations were
is true. For example, if we had used in the above example the one-sided collected in this way can therefore be seen as a piece of information that
alternative H 1 : judge B scores more highly, on average, there would be both tests use. A second piece of information that both tests use is the
no point in a formal hypothesis test since judge B has actually scored direction of the difference in each pair (represented simply by plus or
fewer patients more highly than judge A. If the dilection of the evidence minus in the case of the sign test). But the paired sample !-test makes
gives some support for a one-sided alternative, then table A10 should be use of additional, different, information. First, it makes use of the fact
entered as before using the values of S and T but the significance level that the populations of scores are normally distributed (one of its necessary
should be halved. We have seen this before with table A3 and table A4. assumptions). Secondly, it uses not only the direction but also the size
Only values of T up to T = 25 are catered for by table A10. If T is of the difference in pairs (the information contained in the last column
greater than 25, the test can still be carried out by calculating S and T of table u.s). Since the Hest is based on richer information, it is more
in the same way and then using the test statistic: sensitive to differences in the population means and will more readily
give a significant value when such differences exist (and for this reason
T- zS- I
z YT
may be referred to as a more powerful test). In other words, for any
set of data the sign test will be more likely than the Hest to cause a
which should be referred to table A3, critical values of the standard normal type 2 error. However, if the assumption about the parent population
distribution. For example, if we carry out a $ign test on the error gravity which underlies the Hest, i.e. that they are normally distributed, is not
scores of table I 1.5 we have T = 32, S = II (there are II positive and justified, the likelihood of a type I error will be higher than the probability
2I negative differences) so that Z = (32- 22- I)/Y32, i.e. Z = 1.59, indicated by the !-value. The apparent significance of the test result can
which is not a significantly large value of the standard normal distribution be exaggerated.
(table A3). which test then i~; it more appropriate to usc? There is no simple answer~
tqo 191
Testing for differences between two populations b'xercises
A sensible procedure might be firsttO use the sign test. If a significant ''E)(E:RCISES
result is thereby obtained there is really no need to go on to carry out (r) In chapter 8 we discussed a sample of British children whose comprehension
a t-test. However, there is no way to.calculate a confidence interval for vocabularies were measured. The mean vocabulary for a sample of 14ochildren
the size of the difference without assuming normality anyway. If a t-test .wa's 24;80o words with a standard deViation of. 4 1200. If a random sample
is carried out, the researcher should be aware of the consequences of the of 108 American children has a mean vocabulary of 24,000 words with a stan-
possible failure to meet the assumptions of the test. dard deviation of 5,931, test the hypothesis that the two samples come from
populations with the same mean vocabulary.
(2.) Table ro.I gives the total error gravity scores for ten native English speakers
who are not teachers. In table 11.5 can be found the scores of ten Greek
SUMMARY teachers of English on the same errors. Test the hypothesis that the two groups
This chapter has looked at various procedures for testing for differ- give the same error gravity scores, on average.
ences between two groups. (3) Calculate a 95% confidence interval for the difference between the mean error
g_r~vity scores of the two.groups in exercise I I .2.
( r) The t-test for independent samples tO test H 0 : J.L 1 = p.-2 uses the test statistic;
(4) For the d.ita of table g.6 on vowel epeOthesis i~ Rennellese, use the procedure
ex,- x,) of 11.3 to test whether reduplication is equally likely in initial and medial
ls'7 (s) Using the sarl!e data as -in exercise I 1.2, test whether the two sets of error
nl llz scores come from populations with equal variance.
(6) A sample of 14 subjects is divided randomly into two groups who are asked
where: to learn a set of 20 vocabulary items in an unfamiliar language, the items
being presented in a different format to the two groups but all subjects being
(n 1 - I)s 12 + (nz- I)s,'
s' n 1 +nz-z
allowed the same time to study the items before being tested. The numbers
of errors recorded for each of the subjects are:
which has at-distribution with (n 1 + n2 - 2) df when H 0 is true. FormatA: 3 4 11 6 8 2
(2) To compare two proportions estimated from independent samples the Format B: I 5 8 7 9 14 6 8
(Two of the students in the first group dropped out without taking the test:)

Test whether the average number of errors is the same under both formats.

z- 2 nl Dz
Jj'>( I - p) (-'-- + -'--)
n1 lllz

should be referred to tables of the standard normal distribution.

(3) The F -test for comparison of two variances was explained. The test statistic
was F =(larger s2}/ (smaller sZ) to be compared with critical values of the
F-distribution with (n2 ,n2) df.
(4) The paired samples ttest for testing for differences between two means
was presented. The test is carried out by calculating dand s from the differences
and then comparing t = d/(s/v'n) with tables of the !-distribution with (n- I)
(s) Two nonparametric tests were cxplnined. Tlw Mann-Whitney test for inde-
pendent surnplcs nnd the sign test fur paired !:!!llllplcs.

l9il I93

Table IZ.I. Marks in a multiple choice vocabulary test of

!2 candidates for the Cambridge Proficiency of English
examination from four different regions
Analysis of variance- ANOV A Groups
2 3 4
South North Far
Europe America Africa East
10 33 26 26
19 21 25 21
24 25 19 25
17 32 31 22
In the last chapter we explained how it was possible to test whether two 29 16 15 11
37 16 25 35
sample means were sufficiently different to allow us to conclude that the 32 20 23 18
samples were probably drawn from populations with different population 29 13 32 12
means. When more than two different groups or experimental conditions 22 23 20 22
31 20 15 21
are involved we have to be careful how we test whether there might be
Total 250 219 231 213
differences in the corresponding population means. If all possible pairs
Mean zs.o 2I.9 2J.I 21.3
of samples are tested using the techniques suggested in the previous
chapter, the probability of type I errors will be greater than we expect, Sample standard deviation 8.138 6.607 59 1 5 6.897
Sample variance 66.222 43655 34-988 47567
i.e. the 'significance' of any differences will be exaggerated. In the present
chapter we intend to develop techniques which will allow us to investigate
possible differences between the mean results obtained from several (i.e. than students from other areas? This is a generalisation of the problem
more than two) samples, each referring to a different population or col- discussed in I I. I for the comparison of two samples to test whether
lected under different circumstances. they were drawn from populations with different mean values. It might
seem that the solution presented there could be applied here, by comparing
Comparing several means simultaneously: one-way
I2. I these groups of candidates in pairs: the Europeans with the North Africans,
ANOVA the South Americans with the Europeans, and so on. Unfortunately, it
Imagine that an investigator is interested in the standard of can be demonstrated theoretically that doing this leads to an unacceptable
English of students coming to Britain for graduate training. In particular increase in the probability of type I errors. If all six different pairs are
he wishes to discover whether there is a difference in the level of English tested at the s% significance level there will be a much bigger than s%
proficiency between groups of distinct geographical origins - Europe, chance that at least one of the tests will be found to be significant even
South America, North Africa and the Far East. As part of a pilot study when no population differences exist. The greater the number of samples
he administers a multiple choice test to 40 graduate students ( 10 from observed, the more likely it will be that the difference between the largest
each area) drawn at random from the complete set of such students listed sample mean and the smallest sample mean will be sufficiently great -
on a central file. The scores obtained .by these students on the test are even when all the samples are chosen from the same population - to give
shown in table I2.I. The means for the four samples do not have exactly a significant value when a test designed for just two samples is used. We
the same value - we would not expect that. However, we might ask if need a test which will take into account the total number of comparisons
the observed variation in the means is of the order that we could expect we are making. Such a test can be constructed by means of an analysis
from four diFferent random samples, each drawn from the sarnc population of variance, usually contracted to ANOVA or ANOVAR.
of lest scores, or arc the diffrrcntcs sufficiently large to indicate that As usual, there will be some assumptions that must be met in order
atudonts fi'Otn ccnuin urcan nn: more proficient in English, on nvcragc, for the test to be applied, i.e. that cnch sample comes from a normally
flnatyszs OJ vanance- mvuvfi One-way ANOVA
distributed population and that the four populations of candidates' scores
all have the same variance, uz. The data: in table rz. i consist of four,
s2; is !.6322:2.662. SiiiCe this is an estima.te of CJ'2/ro, multiplying it
by ro gives a new estimate of CJ' 2 called the between-groups estimate
groups of scores from four populations. Suppose that the i-th population of variance, sb2, since it measures the variation across the four sample I
has mean /-';, so that Group r is a sample of scores from. a population means. We now have calculated s]= 48 I I and sb 2 = IO X 2.662 = 26.62.
of scores, normally distributed with mean p.. 1 and varianCe u 2; arid .so If the null hypothesis is true, both of these are estimates of the same I,
on. The null hypothesis we will test shortly is 1-1 0 : I-'! =1-'z =/-'J =1-' against qUantity, u 2, and it can be shown that the ratio of the estimates:
the ruternative that not all the!-'; have the same value. F=sb2+sw2
JDsuming that each sample comes from a population with variance u 2,
the four different sample variances are four independent estimates of the has an F -distribution with 3 and 36 df. There are 3 df in the numerator .I
common population variance. These can be combined, or pooled, into
a single estimate by multiplying each estimate by its degrees of freedom
since it is the sample variance of four observations, and 36 in the denomi-
nator since it is a pooled estimate from four samples each of which had

(the sample size minus one) summing the four products and dividing the ro observations and each of which hence contributed 9 df to calculate II
total by the sum of the degrees of freedom'.of the four sample variances, the sample variance. The F-distrihution has appeared already in rr.2 II
thus: as the test statistic for comparing two variances.
(n 1 - r)s 12 + (n2 - r)sz' + (n 3- r)s,Z + (n 4 - r)s,Z If the null hypothesis is not true, sb2 will tend to be larger than
pooled variance estimate s,/ because the yariability in the four sample means will be inflated
n 1 + n2 + n3 + n4- 4
by the differences between the population means, Large values of the
(If you glance back to 11.1 you will recognise this as a direct generalisation F statistic therefore throw doubt on the validity of the null hypothesis. I;
of the method we used to estimate the common variance when we wished In the case of the multiple choice test scores, we have
to compare only two samples.) sb2 + Sw2 = 26.62 + 48. II =0,55 The s% critical value of F3,36 is just bigger 'il
This estimate of the population variance is often called the within- than F,, 40 , which is 2.84, so that the value obtained from the data is not
samples estimate of variance since it is obtained by first calculating significant and there are no grounds for claiming differences between the
the variances within each sample and then combining them. We will refer groups. In other words, these data do not support the view that graduate
to this as sw' In the example of table I2.I, all the samples are the same students coming to Britain differ in their command of English according
size, n 1 = n2 = 11 3 = 11 4 = ro, so that:, to their geographical origin.
The description just given of the analysis of variance procedure is not
(9 X 66.222) + (9 X 43.655) + (9 X 34-988) + (9 X 47567) . ' . the most usual way in which the technique is presented. ANOVA, as li
Sw 2 - 48.II
36 we will see, is a rather general technique which can be applied to the
comparison of means in data with quite complex structure. It is convenient, l1
Now, let us suppose for the moment that the null hypothesis is true, therefore, to have a method of calculating all the required quantities which
and that the common population mean value is 1-' In that case (chapter will generalise easily. For this reason we will now repeat the analysis of :1
5), each of the four sample means is an observation from a normal distribu- the multiple choice test scores using the more common and general method.
tion with mean 1-' and variance CJ'2/ Io. (Since we know that in general The analysis is a particular example of a one-way analysis of variance !
the standard deviation of a mean has the value CJ'/ Y n its variance will - the comparison of the means of groups which are classified according
be u'/n). In other words, the four sample means constitute a random to a single (hence 'one-way') criterion variable, linguistic/ geographical
sample from that distribution and the variance of this random sample origin in this example. During the presentation of the alternative analysis
of four means is an estimate of the population variance, u 2/ ro. The sample we will take the opportunity to state the problem in a completely general
means are 25.0, 21.9, 23.1, 21.3. Treating these four values as a random way.
sample of four observations, we can cal.culate the sample standard deviation Suppose that samples of size n have been taken from each of m popula-
in the normal way. We find that it is 1 .632, and hmcc the sample variance, tions. We will write Y;; for the j-th observation in the i-th group. For

l9(l '97 i !)
flnatyszs Of vanance- fUVU V.fl. One-way ANOVA
example, in .table 12.1, Y4 ,7 = 18, the score of the seventh Far Eastern Table I2.2. ANOVA table for the data of table 12.1
(group 4) candidate. As is common when analysis of variance is presented,
Source df ss MSS F -ratio
we write Yi. to mean the total of the observations of group i. That is:
Between groups 3 79875 26.62 F3,36 =
y I, ="Y
L_. l) Within groups (residual) 36 1 7JI.9 48.11
Total 39 I8II.775
For our example:
" 1i = 250, Y2 . = 219, YJ.
Y1. = L_Y = 231, Y.1. = 213 the term 'residual sum of squares'. An ANOV A table is now constructed
i"'' -table I2.2.
The grand total of all the observations is designated Y .. so that: The first column in the table gives the source of the sums of squares
Y .. =913 -between-groups, residual and total. The second column gives the degrees
of freedom which are used to calculate the different variance estimates
Since we have m samples (m = 4) each' of size n (n = w) we have mn
i.e. 3 for between-groups and 36 (4 X g) for the within-groups estima~es,
(4 X IO 40) observations in all. A ~erm, usually called the correction
as we had in the first analysis above. Generally the between-groups degrees
factor or CF, is now calculated by:
of freedom will be m- I, one less than the number of groups, the total
y.' 913 2 available degrees of freedom will be mn- I, one less than the total number
CF=-=-=2o839.5 of observations, and the residual degrees of freedom are obtained by sub-
mn 40
traction (see table I2.3, which is a general ANOVA table for one-way
(It is often necessary, when calculating for an ANOVA, to keep a large
AN OVA of m samples each containing n observations). The fourth column
number of figures in the intermediate calculations.)
of table I2.2, the mean sum of squares, is obtained by dividing each
We now calculate the total sum of squares, TSS, which is the sum
sum of squares by its degrees of freedom. Note that the values obtained
of the between-groups sum of squares and the within-groups sum
at this stage are exactly the between-groups variance estimate and within-
of squares. (The latter is often called the residual sum of squares
groups variance estimate that we calculated previously. The final column
(RSS) for a reason which will become apparent shortly.)
then gives, on the row corresponding to the source which is to be tested
TSS (total sum of squares)= :l:Y;;'- CF for differences (in this case between-groups) the F-ratio statistic required
= 1o2 + 19 2 + ... + 22 2 + 21 2 - CF for the hypothesis test. It is important that you learn to interpret such
= 226sr- 20839225 tables, for two reasons. The first is that researchers often present their
= r8II 775 results in this way. The second is that, especially for complex data struc-
:l:Y' tures, you may perhaps not carry out the calculations by hand, leaving
between-groups SS = -'- - CF that to a computer package. The output from the. package will usually
= (2502 + 2I92 + 23I 2 + 2I3 2) + ro- CF contain an AN OVA table of some form.
= 79875 Table I2.3. General ANOVA tableforone-way ANOVA ofm samples each
within-groups SS =total SS- between-groups SS containing n observations
= r8II.775 -79.875 Source df ss MSS F-ratio
= I73"9
Between groups m- I BSS 2_ BSS ~
The within-groups sum of squares is the quantity left when the between- sb -m-r sr
groups sum of squares is subtracted from the total sum of squares- hence
Within groups (residual) m(n-I)
1 It is not necessary that the samples be of the same Rizc, though experiments arc often RSS
r m(n-I)
designed to make such groups equal. Howcv~r, the general exposition becomes rather
cumbersome if the sample l'izes nrc different. 'l'utul mn- 1 TSS

Analysis of variance -ANOVA Two-way ANOVA: randomised blocks
~ 'I
I2.2Two-way ANOVA: randomised blocks Table I 2+ Total error gravity scores of ten native English ll
In chapter IO we saw that the sensitivity of a comparison teachers (1), ten Greek teachers ofEnglish (2) and ten native ii

between two means could be improved by pairing the observations in English non-teachers (J) on 32 English sentences
the two samples. This idea can be extended to the comparison of severaL
means. Table I2.4 repeats the data on gravity of errors analysed in chapter
I I, but now extended to three groups of judges, the third group consisting
Sentence I 2 3 Total (Y;.)

of the ten English non-teachers, There are now three ways to divide up 22 ]6 22 So
2 16 9 18 43
the variability: variation in score's between m groups of judges, variation 3 42 29 42 II3
in scores between the n different errors and residual, random variation. 4 25 35 21 8r
5 JI 34 26 91
The necessary calculations and the resulting table are similar to those 6 ]6 2] 41 roo
found in the one-way case, but with an extra item, between-errors sum 7 29 25 26 So
8 24 Jl 20 75
of squares, added. rS S2
9 29 35
We begin by calculating the totals, displayed in table I2.4: 10 r8 2! '5 54 II
ll 2] 33 2! 77 !IIi
Yi. the total for the i-th error !2 22 IJ '9 54 l;i,l
Y 1 the total for the j-th set of judges IJ Jl 22 39 92 I

'4 2! 29 2] 73 :1.1
Y .. the grand total

rs 27 25 24 76 I

r6 32 2S 29 S6 11!!
We calculate, as before, a correction factor by: 17 23 39 1S So I' I
rS r8 19 r6 53 il.
Y .. 2 24622
JO 28 29
87 ~:
mn 3 X 32
Jl 41 94 II
21 20 25 l2 57 1,'
22 21 17 26 64
Then the total sum of squares: 2] 29 26 43 98 :;I;
26 'II
24 22 37 ss
TSS = :i:Y,J'- CF 25 26 34 22 82
= 22 2 + I62 +. '. + 28 2 + I42 - CF 26 20 28 rg 67
= 68 742- 63 I40.04 27 29 33 JO 92
28 r8 24 '7 59
= s6oi.g6 29 2] 37 15 75
30 25 33 15 73
Between-errors sum of squares: JI 27 39 28 94
32 ll 20 14 45
:i:Y .2
ESS = - - - CF Total (Y) So1 905 756 2462 1,1

'. 1

= (8o 2
+ 43 + ... + 94 +
2 2
45 2) + 3- CF

= Ig8og6+3- CF of observations which have gone into each of the values being squared.
= 2891.96
For example, in GSS we have (8oi 2 + 905 2 + 7562) + 32 because each of
Between-groups (of judges) sums of squares: the values 8oi, 905 and 756 is the sum of 32 data values. The corresponding
AN OVA is presented in table I2.5.
:i:Y. 2
GSS = --' - CF As before, the residual sum of squares and the residual degrees of free-
n dom are calculated by subtraction from the total sum of squares and total
= (8oi 2 + 905 2 + 7562) + 32- CF degrees of freedom respectively. The F-ratio for groups of judges is 4.85
= 365.02
with 2 and 62 df and this is significant beyond the 2.5% level, clearly
Note that the divisor in each Rum of squares calculation is just the number indicating differences in the scores of the three sets of judges. The question
~QI:l 20I
Analysis of variance -ANOVA Two-way ANOVA: factorial experiments 111

Table 12-5-ANOVAfordata of table 12.4 Table 12.6. Marks of 40 subjects in a multiple choice test (the I
Source df ss MSS F-ratio
subjects are classified by geographical location and sex)
Between errors 31 z8g1.g6 9329 FJI.62 = 2.47 Geographical location
Between groups of judges 2 182.51 Fz.6Z = 4.83 South East
South North
Residual 62 234498 37-82 Total
Sex Europe (1) America (1.) Africa (3) Asia (4)
Total 95 s6oi.g6
'0 33 26 26
19 21 25 21
Male(') 24 25 19 25
remains whether this is due to the Greek judges scoring differently from 17 32 31 22
English judges (whether teachers or not), teachers (whether Greek or Eng- 29 16 '5 u
lish) scoring differently from non-teachers, and so on. We will return Subtotal 99 "7 u6 105 447

to this question in 12.5. 37 16 25 35

20 18 it
Note that the F-ratio for comparing errors is significant beyond the
1% level, but to investigate this was not an important part of this analysis.
Female (2)
29 13
, "
We have simply taken account of the variability that this causes in the
15 ,22
scores so that we can make a more sensitive comparison between groups. Subtota-l '51 92 us 108 466
This type of experimental design is often called a randomised block
design. Total 250 219 231 213 913

12.3 Two-way ANOVA: factorial experiments

It is often convenient and efficient to investigate several experi- Yij. =total score of subjects belonging to the i-th location and j-th
mental variables simultaneously. The sociolinguist, for example, may be sex (e.g. Y 3 ~, = rr6)
interested in both the linguistic context and the social context in which Yi .. =total score of subjects at i-th location (Y 2.. = 2r9)

a linguistic token is used; a psycho linguist may wish to study how word Yj. =total score of subjects of ith sex (Y.z. = 466)
recognition reaction times vary in different prose types and with subjects Y... = grand total = 913
in different groups. Indeed it will only be possible to study the interaction An experiment designed to give this kind of data structure is usually
between such variables if they are observed simultaneously. We will use I~
called a factorial experiment, the different criterion variables being called !~
again the multiple choice test scores of the four groups of graduate students
factors. These 'factors' are entirely unrelated to those of factor analysis, !;i!
to introduce the terminology and denwn.strate the technique. I
a technique discussed in chapter 15. Here there are two factors, sex and
In table 12.6 we have given the same data as in table 12.1 but now
geographical origin. The different values of each factor are often referred '
cross-classified by geographical origin and sex. The style of presentation
to as the levels of the factor. Sex has two levels, male and female, and 'li
of this table is quite a common one for cross-classified data, with various '
geographical location has four. i.
totals given in the margins of the table (they are often referred to as mar-
We can use this single set of data to test independently two different
ginal totals): total scores by sex, total scores by geographical location
null hypotheses: whether mean scores are the same between geographical
and subtotals by the origin by sex cross-classification. To discuss these
origins and whether mean scores are the same for the two sexes. The
data and describe any formulae for their analysis it is convenient to refer
calculations required are similar to those of the example in the previous
to them by means of three suffixes; Yiik will refer to the score of the
section. We begin by calculating the correction factor, CF:
k-th subject of the j-th sex who belongs to the i-th geographical location.
G(:ncralising the usc of the dot notation introduced in the previous section Y,.' 913 2
we writJ;: cr =40- = -40 = ao BJ9"5
~!tilt'. OM
Analysis of variance -ANOVA Two-way ANOVA: factmial experiments
and continue by obtaining the various sums of. squares: to table 12.6). For the North Africa and South East Asia samples the
total SS = :l:Y,;,2 - CF mean scores of the two sexes are still very similar, but among European
= ( 102 + 19 2 + ... + 22 2 + 21 2) - CF students the females have apparently done rather better, while for the
= r8rr.775 South Americans the reverse is the case. These differences cancel out
when we look at the sex averages over all the locations simultaneously.
!Y. 2
between-locations SS = _,_.. - CF What we are possibly seeing here is an interaction between the two factors.
IO In other words, it may be that there is a difference between the mean
= (zso2 + 2192 + 23r 2 + 2132.) + ro- CF
scores of the sexes, but the extent of the difference depends on the geo-
= 79875 graphical location of the subjects. Any difference between the levels of
:l:Y 2 a single factor which is independent of any other factor is referred to
between-sexes SS = __J - CF
20 as a main effect. Differences which appear only when two or more factors
= (447' + 466 2) + 20- CF are examined together are called interactions. As a form of shorthand,
= 9025 main effects are often designated by a single letter, e.g. L for the variation
in mean score of stud~nts from different locations, and S for the variation
and this leads to the ANOVA in table r2.7(a) from which we conclude
between sexes. Interaction effects are designated by the use of the different
from the small F -ratios that there is no significant difference between geo-
main effects symbols joined by one (or more) crosses, e.g. LXS for the
graphical locations (we came to the same conclusion in 12.1), and none
interaction between location and sex. Provided there is more than one
between sexes. However, the analysis carried out thus tests the differences
observation for each combination of the levels of the main factors it is
between the sample means of the locations calculated over all the observa-
possible to test whether significant interaction effects are present. For
tions for an origin irrespective of the sex of the subject. Likewise, the
the multiple choice test scores data of table 12.6 we have observed five
sample means for the sexes are calculated over all zo observations for
scores for each of the eight combinations of the levels of sex and location.
each sex ignoring any difference in location. Calculated in that way, the
(For factorial experiments it is important that each combination has been
sample mean score for males is 447 + 20 = 22.35 and for females it is 23.]0,
observed the same number of times. If that has not happened it is still
so that they are rather similar. However, suppose we look to see if there
possible to carry out an analysis of variance but the main effects cannot
are differences between sexes within some of the locations (refer back
then be tested independently of one another and there may be difficulties
of interpretation of the ANOV A. Furthermore, the calculations become
Table 12.7.
much more involved and it is not really feasible to carry them out by
(a) JLVOVA ofmain effects only from data of table 12.6 hand- see r3.12.) To test for a significant interaction we simply expand
Source df SS MS F-ratio the AN OVA to include an interaction sum of squares calculated by:
Between locations 3 79875 26.62 F3,3s = 0.54 ~y .. z
Between sexes I g.o.z5 9025 Fu 5 =o.x8 interaction SS = -'- 1
- CF IIi
Residual 1722.87 5
49-225 5
= (992 + 127 2 + ... +II 5
+ ra8 2) + 5 -CF li
39 II
= 473775
(b) !LVOVA of main effects and inte.ractionfmm data of table 12.6

Source df ss MS F-ratio The relevant ANOVA appears in table r2.7(b). The only extra feature
Between locations (L) 26.62 F 3,32 = o.68
requiring comment is that the degrees of freedom for the interaction term II
3 79875
Between sexes (S) I 9-025 g.ozs FI ..1Z = 0.23 are obtained by multiplying the degrees of freedom of the main effects
Interaction (LXS) 3 473775 157-925 FJ,J2 = 405 included in the interaction (here, 3 X r = 3): the symbol for the interaction
Residual 32 1249-100 3903
effect, LXS, is a useful mnemonic for this. The F-ratio for testing the
Total 39 Jf!J !.775
interaction effect is significant at the r% level, showing that such effects
~.04 205
Analysis of variance -ANOVA ANOVA: main effects only
need to be considered. The practical implication of this would be that Some values in such a model, Yii and eii' depend on the specific data
when considering possible differences between the scores of subjects of values observed in the experiment. Others, J.t and Li, are assumed to be
different sex we should not leave out of consideration their geographical fixed for all the different samples that might be chosen; they are population,
origin. 2 as opposed to sample, values and are referred to as the parameters of
We now go on to consider more generally the interpretation of main the model. 1-' is usually called the grand mean and L; the main effect
effects and interaction in ANOV A. of origin i.
The previous null hypothesis that all the /-'; had the same value, ~-'
12.4ANOVA models: main effects only =
can now be restated asH,: L; o, for every value of i (that is, the main
Let us reconsider the first problem we discussed in this chapter effect of geographical location is zero). This model can be generalised
- the one-way ANOVA of four independent samples of students from to cover a huge variety of situations. For-example, consider the randomised
four locations. We wished to test the hypothesis that the population mean block experiment of table 12+ The observations are arranged in 32
score of students from all geographical locations was the same, and we 'blocks', i.e. the errors. Within every block we have a score from each
assumed that, at all locations, the scores were from a normal distribution set of judges. A suitable model would be:
with variance a'. All this can be summarised neatly in a simple mathe-
yij = /L+ bi + gj + Cjj
matical model:
which says that each score, Y;;, is composed of four components summed
together, the grand mean, ~-' the block effect, b;, the group effect,
where Y;; is the j-th score observed at the i-t.h location, 1-'; is the population g;, and the random variation e;; about the mean score of judges of type
mean at the i-th locatio11, and e;; is the random amount by which the j scoring the error i for its gravity,
j-th score, randomly chosen at the i-th location, deviates from the mean It may be easier to understand what this means if we fit the model
score. Our earlier assumption that the scores of students from the i-th to the observed data and estimate values for the parameters. Each parameter
location were nornlally distributed with mean /-Li and variance u 2 is equiva- is estimated by the corresponding sample mean: 1-' is estimated by Y. + g6,
lent to the assumption that, for each geographical location the 'error' or since there is a total of g6 observations. We will use a circumflex to designate
'residual', eii' was normally distributed with mean zero and variance u 2 an estimate and write:
We then tested the null hypothesis that 1-'; = ~-' the same value, for all
il = Y. +g6= 2462+96 = 25.65
gl == (total for group I +number of scores for group I) - jl
Although this simple model is perfectly adequate for the one-way
= Y. 1 + 32- 25.65 = (Sor + 32)- 25.65 = -o.6z
ANOVA problem, it does not generalise easily to more complex cases
(suggesting that groupI scores may be smaller than the overall aver-
such as the factorial experiment. In order to make that possible a slight age)
modification is needed. Suppose we ignore the existence of the four differ-
ent locations. We could then consider the 40 scores as having come from b1 = Y 1. + 3 (each 'block' contains three scores- one from each group ,,I'II.'
a single population with mean ~-' say. Now write L; = !-'; - 1-' That is, of judges)- z6 '

L; is the difference between the mean of the overall population (that is, =(So+ 3)- 25.65 = z6.66- 25.65 = r.o1
(so that the first error may be reviewed as more serious than average)
the grand population mean,.~-') and the mean score of the population of
scores of students from the i-th location, J.A-i Equivalently we can write The complete set of parameter estimates is given in the margins of
!-'; = 1-' + L;, and substituting this into the previous model we now have: table 12.8. The values of these estimates are useful when discussing the
yij= J.1.+ Li + eij data. For example, we can say that the English teachers' group gives o.62
(g 1) marks per error less than the mean (Jl) while the Greek teachers'
z ln this catlc the division of 6uhjcctll into 'mak' and '(crnalc' was entirely hypnthcticnl, group gives 2.63 (g2) marks per error more than the average. Error number
~~ankd out to 'h~monatnitc the bnt3-i(, concept of 'lntcr11ctiun'.
2 receives 11.32 (b 2) marks per group of judges less than the mean gravity
Analysis of variance -ANOVA ANOVA: main effects only
Table 12.8. Total error gravity scores often native Engh'sh teachets (1),. so that the so-called residual error, the difference:
and ten Greek teachers ofEnglish (2), and ten native English non-teachers
ezo,z = Y20.2- Yzo,z = 41 - 3396
(J), on 32 English sentences
between the observed and fitted values is 7.04, which seems rather large,
Sentence 1 2 3 Total (Y;) Mean b ezo,l = Yzo,l- v,,,,
= 3I- (25.65 + 5.68- o.62) = 0.29
22 36 22 So 26.67 1.02
which suggests a good correspondence between the observed and fitted
2 16 9 1S 43 1433 -rr.J2
3 42 29 42 113 3767 values of the score of the English teachers on the error number 20. This
4 25 35 ., S1 27.00
1.35 brings us to the last parameter which apparently has not yet been estimated,
5 31 34 26 91 JO.JJ 4.6S
6 36 23 41 wo 3333 7.6S namely 0"2 , the random error variance or residual variance. A good
7 29 25 26 So 26.67 I.02 estimate of its value is given by the residual mean square error in the
s 24 31 20 75 zs.oo -o.6s
29 1S S2 1.68 AN OVA displayed in table 12.5, i.e. 6" 2 = 37.82, so that the standard
9 35 2733
10 1S 21 15 54 J8.oo -7.65 deviation is estimated by V37.82 = 6.15. This value is Important when II
II 23 33 21 77 25.67 0,02
12 22 it comes to deciding which types of judges do seem t~ be giving different
13 19 54 r8.oo -7.65
13 31 22 92 300 s.oz scores, on average, for the errors used in the study. In chapter II we
14 ., 29
23 73 2433 -r.p . gave the formula for a 95% confidence interval for the size of the difference
15 27 25 24 76 2 533 -O.J2
16 32 25 29 S6 2S.67 J.02 between the two means:
(difference betV-.lCeir sam pie- means) ( ~)
+--::- -
'constant' X i.j--::-
nl nz
20 31 41 22 94 31.33 5.6s
21 20 25 12 57 rg.oo -6.6 5 where s2 was an estimate of the common standard deviation and n 1 and
22 21 17 26 64 21.JJ -4-32 n 2 were the sample sizes. Let us use this to estimate how much difference
23 29 26 43 98 J2.67 7.0z
24 22 37 26 ss 2S.33 2.68 there seems to be between the mean scores of English and Greek teachers.
26 22 S2
2S 19
2 733
Applying the above formula gives the interval:

- -
Y.z-Y. 1(constant J 3782
--+-'- )
30 25 33 15 73 2433 -I.J2
Each of the samplec.sizes is 32, since both means, are based on the scores
31 27 39 28 94 JI.JJ 5.68
32 II 20 14 45 rs.oo - ro.6s for 32 different errors. The constant used in the formula is the s% signifi-
Total (Y) So1 905 756 2462 cance value of the !-distribution with the same number of degrees of
Mean 25.03 z8.z8 2J.6J 25.65 freedom as there are for the residual in the ANOVA table - 62 in this
g -o.6z 2.63 -2.03 case (see table 12.5). The interval will then be:
(28.28- 25.03) (2.0 X V2.J6)
score (jl), but error number I I is seen to be of about average seriousness or 3.25 3.07
since bn is very close to zero, and so on. \Ve can also examine the degree i.e. o.r8 to 3.62
to which a particular score is well or badly fitted by the model, by calculat-
ing the value of the random or residual component nf the score, cii Similar confidence intervals for the other possible differences in means
For txamplc, Y;w,z, the observed score on etTol munbcr zo of tht Grech. are as follows. For English teachers versus English non~teachers:
judge~ is 4', whi\v ~-~"'' tlw value obtaitl('d fromlhe fitted model is: 25.03- 2J.63 307
\' .
. -;m;t---. P, + b~u + g~ ~ ~s .6~ + 5 .!.1H .+ z.63 *
JJ ,q6 i.e. -r.67to4.47

~ill.! 209
ANOVA: factorial experiments
Analysis ofvan'ance -ANOVA
and for Greek teachers versus English non-teachers: the data and ask whether they seem large enough to be important, whether
or not they are found to be significant by a statistical hypothesis test.
28.28- 2J.6J 307
1 2. 5 ANOVA models: factorial experiments
I.e. r.s8 to 7 72
In 12.4 we introduced the concept of a factorial experiment,
It might seem safe now, following the procedure of chapter 5 for carrying using as an example vocabulary test scores classified by two factors, the
out tests of hypotheses using confidence intervals, to conclude that, at sex of the subject who supplied the score and his or her geographical
the 5% significance level, we can reject the two hypotheses that Greek location. A model which we might try to fit to these data is:
teachers give the same scores as English teachers and non-teachers. On
Yiik = J.L+ Li + S; +eiik
the other hand, there does not seem to be a significant difference between
the mean scores of English teachers and English non-teachers. However, where Y;;ko as before, is the score of the k-th subject who is of the i-th
this procedure is equivalent to carrying out three pair-wise tests and, at location and is of the j-th sex, p. is the grand population mean, L; is
the beginning of this chapter, we denied the validity of such an undertaking. the main effect of the i-th location, S; is the main effect of the j-th sex
There are theoretically correct procedures for making multiple compari and e;;k is the random amount by which this subject's score is different
sons- comparing all the possible pairs of means- but they are not simple from the population mean of all scores of subjects of the j-th sex of the
to carry out. A frequently adopted rule of thumb is the following. Provided i-th origin. As before, we assume that all the values. of e;;k are from a
that the ANOVA has indicated a siguijicant difference between a set of normal distribution with mean zero and variance, eft.. Use of this model
means, calculate the standard error s* for the comparison of any pair would lead to the analysis of table 12.6. The values of the various par-
of means by: ameters can be estimated using exactly the same steps as in the analysis
2. X resiCfi.laT mean square of the error gravity scores above (see exercise 3). The residual variance,
s* = u 2 , is estimated by & 2 = 49.225, the residual mean square of the ANOV A
table 12.7(a). We have already seen that neither of the main effects is
where n is the number of observations which have been averaged when significant, i.e. there is no obvious difference in mean scores for the two
calculating each mean. Then find the difference between each pair of sexes nor in the mean scores at the four different locations.
means. If the difference between a pair of means is greater than 2s, However, look again at the model:
take this as suggesting that the corresponding population means may be
Yiik = f.L + Li + Si + eiik
different. If the difference in two sample means is greater than 3s*, take
There is an assumption here that, apart from the random variation, eijkl
this as reasonably convincing evidence of a real difference.
For the three groups of judges, we know (see table 12.5) that the residual each score can be reconstructed by the addition of three parameters, the
grand mean plus the effect of having origin i (assumed equal for both
mean square is 37.82 and therefore:
sexes) plus the effect of the subject being of sex j (assumed equal for
s* = v'2.36 == 1.54, 2s* = 307 and 3s* = 4-61 all locations). We have already demonstrated in 12.3 that this model
mean of Greek teachers- mean of English teachers = 3.25 will not give an adequate description of the data. There is an additional
mean of English teachers- mean of English non-teachers= I .40 effect to consider. There seems to be an interaction between sex and loca~
mean of Greek teachers- mean of English non-teachers = 4-65 tion, males scoring better, on average, in one location and females scoring
from which we might conclude that Greek and English teachers probably better in another. The model can be expanded to cope with this as follows:
give different scores on average and the Greek teachers and English non- tt + Li + Si + aii + eiik
Yiik =
teachers almost certainly do. However, this seems a suitable moment to where the parameter a;; is the additional correction which should be made
reiterate our comment about the difference between statistical significance lor the interaction between the effects of the i-th origin and the j-th sex.
and scientific importance (chapter 7). It is important always to consider We have already seen in tab!" 12.7(b) that this interaction effect is signifi-
thc observed magnitudc of the difrcrcnccs in the mcuns as estimated from cant. Table t ~.9 !jives the parnmt~ter estimates and the sample means lor
Analysis of variance- ANOVA Ftxed and random effects

Table r2.g. Estimation of the model };p, = 11- + L 1 + S,+a 11 + eijk to the mark of students having another. How wide is the scope of this conclusion?
data of table I 2.6 Does it apply only to the four locations actually observed or can we extend
it to students from other locations? The analysis we have carried out above
Geographical location
is correct only if we do nat wish to extend the results, formally, beyond
Sex I 2 3 4 the four locations involved in the experiment. If we intend these locations
Y11. = tg.8o ,.21. = 2540 \'.1!. = 2J.20 \"fl.= 21.00 ':.I.= 22.35 to serve as representatives of a larger group of locations, the model has
all = -4.72 liz1 = 398 aJI = o.s8 a~, = o.r8 s, = -o.48
to be conceptualised differently and a different analysis will be required.
\" 12 . = JO.:zo \' 22 . = r8.4o '\' 32 . = 23.00 i\ 2. = 21 .6o \' .2 . = 23.30 The model fitted to multiple choice test scores, ignoring interactions
11 12 = 4.72 fizz = -3.g8 li32 = -o.s8 342 = -o.r8 S2 = o.48
=============================================================== for the moment, was:
~\ = :zs.o Yz .. = :zr.g '\' 1 = 2J.I t .. = 21.J '\' .. = 22.83
L 1 = 2.17 L 2 = -o.g3 c~~ = 0.27 c+=-r.sJ jl = 22.83 YiJk = P.,+ Li + Si + eiik
&2 =residual mean square= 39.03- sec table 12. 7(b) far which we tested the hypotheses H 11 : L; = o for all four locations and
standard error for comparing origin means= yr3~9~.0-3'-cX~2~-,-o- 2. 79 Hn: S; = o for both sexes. We reached the conclusion that both these
standard error for comparing sex means= V 3903 X 2 : 20- I .g8
hypotheses seemed reasonable. With this formulation of the model the
standard error for comparing interaction means= V JQ.03 X 2 : S 395
results will not extend to other origins. This is known as a fixed effects
the different mean effects and interactions. In this table, Y;;represents model.
the sample mean score of subjects from location i and sex j, etc. \Vhen If we wish to widen the scope of the experiment we have to construct
the interaction effect is significant there is nat a great deal of point in a mechanism to relate the effect of the locations actually involved to the
examining the main effects. In this example it is quite unhelpful to say effects of those not included in the experiment. This is usually dane by
that the mean scores for the different sexes arc about equal when that assuming that there is a very large number of possible locations each with
hides the fact that, between some origins, there seems to be an important its own location effect, L, on the mean score of students having that loca~
difference of scores. It makes more sense to compare sexes within origins tion. We then have to assume further that the different values of L can
and origins within sex. In order to make this comparison we have to use be modelled as a normal distribution with mean zero and some standard
the standard error for comparing interaction means, which has the value deviation, aL. The null hypothesis far a location effect will now be formu-
395 (see table 12.9). For example, for origin r the differencein the sex lated somtrwhat differently, as Hu : O'L = o since if there is no variation
means is 30.2- 19.8 = ro.4, which is 2.6 times the relevant standard error. in the values of L the variance of the distribution of L values would be
Using the guidelines proposed in the previous section, this suggests a zero. It will now be assumed that the four locations we have chosen for
real difference. In any case, an observed average difference of ro marks the experiment have been randomly sampled from all the possible locations
in a test which was scored out of so is sufficiently large to merit further we could have chosen. This is an example of a random effect, the four
investigation. On the other hand, in the case of origin 4, the difference levels of the factor 'locations' being chosen randomly from a papulation
is only o.6 which is certainly not significant compared to the standard of possible levels.
error and is in any case hardly large enough to have any practical For a one-way ANOVA (see r2.r) the calculations and the F-test are
importance. carried out exactly the same whether or not origin is viewed as a fixed
or random effect. The difference lies in the conclusion we can reach and
r2.6 Fixed and random effects the kind of estimation possible in the model. The small F-value (table
In the example analysed in the previous section, students had 12.2) would indicate that location effects were not important and, pmvided
been sampled from four different locations. We reached the conclusion the four locations in the experiment had been randomly chosen fmm a
that there was no main effect of location. Ignoring for the moment the large set of possible locations, this conclusion would apply to the whale
important interaction effects, we might conclude that the mean mark of papulation of locations. However, we have frequently indicated that, what-
students having one of these locations would be very Rimilar to the m~an ever the results of a hypothesis test, it is always advisable to estimate
Analysis of varimzce- ANOVA Test score reliability and ANOVA
Table 12. 10. ANOVA of main effects and interaction from data of table the greater is the number of possible combinations. Many books on experi-
12.6with location as a raudom effect mental design or AN OVA give the details (e.g. Winer 1971). It is sensible
Source df 88. M8 F-ratio to avoid the random effects assumption wherever possible, choosing levels
26.62 F3 ,_; 2 = o.68
of the different factors for well-considered experimental reasons rather
Between locations (L) 3 78.875
Between sexes ( S) <).025 g.o25 Fu 2 = o.o6 than randomly, The fixed effects model is always easier to interpret because
LxS '
3 473775 157925 FJ,Jl. = 405 all the parameters can always be estimated. However, there are situations
Residual 32 1249-100 3903
in linguistic studies where it may be difficult to avoid the use of the random
Total 39 i8II.775
effects model. It could be argued, for example, that in the analysis pre
the parameters of any model in case important effects have been missed sented in 12.2 the 32 errors whose gravity was assessed by sets of judges
by the statistical test or unimportant effects exaggerated. In this model are representatives of a population of possible errors and that 'error' should
an important parameter is uL, the standard deviation of the location effects. be considered as a random effect. This problem is discussed at length
It is estimated by subtracting the residual mean square from the between by Clark (1973), who advocates that random effects models should be
locations mean square and taking the square root of the answer. From used much more widely in language research.
table 12.2 we would estimate: Clark's suggestion is one way to cope with a complex and widespread
il-L= V26.62 48.II problem, but it does seem a pity to lose the simplicity of the fixed effects
model and replace it with a complicated variety of models containing var
Unfortunately the square root of a negative number does not exist, so ious mixtures of fixed and random effects. There are other possible solu
that we cannot estimate aL, and this is a not infrequent outcome in random tions. One is to claim that any differences found relate only to the particular
effects ANOVAs. The best we can say is that we are fairly certain that language examples used. We could conclude that the Greek and English
the value of aL is about zero. teachers give different scores on this particular set of errors. Though this
With higher order AN OVA (that is, two-way and more), even the F-tests may seem rather weak it may serve as an initial conclusion, allowing simple
will differ, depending on which effects we assume to be random or fixed, estimation of how large the differences seem to be and at least serving
though the actual calculations of the sums of squares and mean squares as a basis for the decision on whether further investigation is warranted.
will always be the same. In the previous section we carried out an AN OVA A second solution would be to identify classes of error into which all
of data classified by sex and location. The table for that analysis (table errors could be classified. If each of the 32 errors in the study were a
12.7(b)) is correct assuming that both sex and location are fixed effects. representative of one of the, say, 32 possible classes of error', then we
Clearly sex will always be a fixed effect - there are only two possible would be back to a fixed effects model.
levels- but we could have chosen location as a random effect. The revised There may still' be a problem. Remember that an important assumption
ANOV A table is given in table 12. 10. You may have to look very hard of the ANOVA model is that the variability should not depend on the
before you find the only change that has occurred in the table. It is in levels of the factors. It may very well be that, say, different sets of judges
the F ratio column; the F value for testing the main effect of sex is now find it easier to agree about the gravity of one type of error than the
obtained by dividing the sex mean square by the mean square for the gravity of another. The variance of scores on the latter error would then
LXS interaction and not by the residual mean square. This has caused be greater than on the former. If the difference is large this could have
the F-value to decrease. In this example that was unimportant but it is serious implications for the validity of the ANOVA (see 12.8). There
in general possible that the apparent significance of the effect of one factor
is one area where random effects models occur naturally- in the assessment
may be removed by assuming that another factor is a random rather than of the reliability of language tests.
a fixed effect.
Further discussion of this problem in general terms is beyond the scope 12.7 Test score reliability and ANOVA
of this book. Every different mixture of random and fixed effects gives
Language testers are quite properly interested in the 'reliability'
rigc to clifftll'Cllt set of Frntios and the greater the number of factors of any test which they may administer. A completely reliable test would
Analysis of variance -ANOVA Test score reliability andANOVA
be one in which an individual subject would' always obtain 'exactly the Table I2.n: Scores often subjects on two para/lei forms of
same score if it were possible for him to repeat the test several times; the same test
How can reliability be measured? Several indices have been proposed (e.g.
. ~ubject Form~ Form2 Total
Ghiselli, Campbell & Zedeck I98I')ibut \he most corrimon is the following."
Assume that for the i-th subject in a population there is an underlying 6] 67 1]0
2 41 39 8o
true score, J.L;, for the trait measured by the test. The 4 true'.scores (chapter 3 78 71 149
6) will form a statistical population ,with, mean p, and variance O'b2,' the 4 24 21 45
5 39 48 87
b signifying that the variability is measured between subjects. In fact, 6 53 46 99
a subject taking the test will not usually express his true score exactly, 7 56 51 107
8 59 54 IIJ
due to random influences, such as the way he feels on a particular day 9 46 37 8]
and so on. The score actually observed for the i-th subject will be 10 53 61 II4
Yi = J.Li + ei where the error, e;, is usually .assumed. to be normally distri-. . Total 512 495 !007

buted with mean zero and some variance, a-4. If this subject takes the
same test several times, his observed score on the jth occasion will be that the two different' versions of the test will be measuring the same
Yii = J.Li + eii' where e;i is the error in measuring the true score of the i-th trait in the same way- quite a large ass-umption. The correlational method
student on the j-th occasion when he takes the test. This model can be of estimating the reliability makes no check on this assumption. If the
written. second versiou of the test gave each subject exactly IO marks (or so marks)
Y;i = p. + ai + e;i more than the first version, the correlation would be I, and the reliability
would be apparently perfect, though the marks for each subject are quite
where fJ- is the mean 'true' score of all the students in the population different on the two applications of the test. The random effects AN OVA
and a, is the difference between the 'true' score of the i-th student and model provides a different method for estimating the reliability and also
the mean for all students. Since we had previously assumed that true offers the possibility of checking the assumption that the 'parallel forms'
scores were normally distributed with mean J.L and variance Oh 2, the values of the test do measure the same thing in the same way.
ai will be from a normal distribution with mean zero and variance ub 2 Table I2. I I shows the hypothetical marks of ten subjects on two forms
Now, if the measurement error varianCe isCloseto zero, all the variability of a standard test. There are two ways to tackle the analysis of this data.
in scores will be due to differences in:the,truescOI-es of students> A conimon One is to assume that the parallel forms are equivalent so that the data
reliability coefficient is: can be considered as tWo independent observations of the same trait score
on ten students. This is equivalent to assuming that any student has the
rei = --r;--
"" 2
-~ t-\ame true score on both forms of the test. The observed scores can then
which is the proportion of the total variability which is due to true differ-
be analysed using the model:

ences in the subjects. If rei= I, there is no random error. If rei is close Yii = p.,+ ai + eii
to zero, the measurement error is large enough to hide the true differences where t.t is the common mean of all parallel forms of the test over the
between students. How can we estimate rei? It can be shown that rei= p, whole population of students, a, is the amount by which the score of the
the correlation between two repetitions of the same test over all the subjects, i-th student in the sample differs from this mean, and e,; is the random
and rei is frequently estimated by r, the correlation between the scores error in measuring the score of the i-th student at the j-th test. The corres-
of a sample of subjects each of whom takes the test twice. However, there ponding AN OVA is given in table I2,I2(a).
is a problem with this. It is simply uot possible to administer exactly The random error variance a 2 is estimated by the residual mean square,
the same test to a group of subjects on different occasions. It is much ~i 2 = 31 95 It can be shown that the between-students mean square is
mort common wadministcr two.flwms of the sumc test. The t.catcr hopes an tstimutl! of 11< + krl),l wlltrc k is the number of parallel forms used,
finatysts of varzance - mv v vii r<urtnercomments onmvvvfl

Table 12. 12. ANOVA/ordata in table I 2.I I allowing for differences in caused by using different forms. From table 12.12(b) we can re-estimate
supposed parallel fonns when assessing test reliability s2 = 14.83, s2 + 2sb2 = 406.12 so that sb2 = 19564. This is a new estimate
(a) One-wayANOVflofdata in table 12.11
of crb 2 , which we have already estimated, above, as r87.o85. These two
Source df SS MS estimates for the same quantity, both calculated from the same data, have
slightly different values because of the different models assumed for their
Between students 9 3655.05 406.u
Residual IO 319-50 31 95
calculation. The variance in the mean scores of different parallel forms,
Total I9 3974-55
rrl, can be estimated in a similar way by:
(b) Two-wayANOVAofdata in table 12.1 r
s2 + ks? =mean square for forms

Source df ss MS F-ratio where k is the number of subjects in the sample. This gives
Between students 9 365s.os 406.12 r4.83 + ros, 2 = r86.os, or s12 = '7 12. Now, the total variance of any score
Between forms I 186.os r86.os F 1,9 = 12.55 will be the sum of these three variances:
Residual 9 13345 If.8J
Total 3974-55 s2 + sb2 + s? = 14.83 + rg5.64 + 17.12 = 227.59
Using the definition of reliability that says:

between~subjects variance 195.64

here two. The quantity s, 2, an estimate of crb 2, can be obtained by putting rei ----c--,.---
total vanance
- - = o.86o
s2 + 2sb2 = 4o6.r2 which gives sb2 = r87.o85. An estimate of the reliability
is given by: which is very close to the estimate of o.854 we obtained previously. How-
ever, if the variability due to parallel forms is wrongly assumed to be
rei= - = o.854 part of the true scores variance, we would obtain:
2- -2
sb +s
'9564 + 17.12
On the other hand, the correlation between the two sets of scores is rei 093
0.93 and frequently this would have been used as an estimate of the re-
liability. Why is there this discrepancy? The AN OVA table 12. 12(b) gives which is the correlation between the scores on the two forms of the test!
a clue. The sample correlation will be a good estimator of the reliability Thus the use of the correlation to estimate reliability is likely to cause
only if it is true that the parallel forms of the test really do measure the its overestimation. Furthermore, the ANOVA method extends with no
same trait on the same scale. To obtain this second ANOVA we have difficulty to the case where several parallel forms have been used. Further
assumed the model : discussion of the meaning and dangers of reliability coefficients can be
found in Krzanowski & Woods (1984).
Yii = ,u + ai + fi + eii
where f; is the difference between the overa!l mean score f.L of all forms r 2.8 Further comments on ANOVA
of the test over the whole population and the mean of the j-th form used In this, already rather long, chapter we have covered only the
in this study over the whole population, i.e. the main effect of forms basic elements of AN OVA models. The possible variety of mo<!els is so
(a random effect). The F -ratio corresponding to this effect is highly signifi- large, with the details being different for each different data structure
cant, showing that the mean marks of different forms is not the same. or experimental design, that it is neither possible nor appropriate to attempt
In the sample the means for the two forms are 51.2 and 495 This suggests a complete coverage here. The general principles are always the same
that the sample correlation is not appropriate as an estimate of the re~ and the details for most designs can be found in any of several books.
liability since its usc in that way assumes equivalence of parallel forms. However, there are two special points, of some importance in linguistic
In fact the usc of the correlation cucfilcicnt ignores entirely the variability r<,$cnrch, which need to be mentioned.
l-lntuysls OJ vanance -f1JVV\Ifi I' urmer commems on mvu vfl
12.8. I Trmzsj01ming the data Table 12. '3 The structure of a 'within-subject' ANOVA model
The first point is the possibility of .transforming <lata which
do not meet the assumptions required for ANOVA to a different form
which do. There are many possibilities, depending on the specific feature Nationality 2 3
of the original data which might cause problems. However, one special '{ Subjcct-1 Yut ylZl

case which may arise fairly frequently in applied language studies is data
1 Subject 2 Ym Ym
in the form of proportions or percentages, e.g. the proportion of correct Subject r Yw
insertions in a cloze test with! 20''deletions. In this example, a sample 2 { ~~~.j~ct 2 Ym Yzzz
of native speakers would be expected to score higher than second language
learners. In an 'easy' doze test native speakers might achieve very high
scores, many getting all or almost all items correct, with a few perhaps
scoring less well. Such data would lack symmetry and could not be normally to the mean value. For example, if two groups of individuals have markedly
distributed. Furthermore, since most of the subjects would then have very different mean vocabulary sizes the group with the higher mean will usually
similar scores, the sample variance would be small. A sample of second show more variability in the vocabulary sizes of the individuals comprising
language learners might show much greater spread of ability with a lower the group. Wthe variance of the values in the different groups seems
mean. In general, with this kind of data, the nearer the mean score is to be roughly proportional to their means then analysing the logarithms
to so% correct the greater will be the variance, while the symmetry and of the original values will give more reliable results. If, instead, the standard
variance of the sample scores will both decrease as the mean approaches deviations of the groups are proportional to their means, taking the square
one of the extremes of o% or roo%. It may not be legitimate in such root will help. (See also r3.12.)
cases to carry out a t-test or ANOVA to investigate whether there was
a significant difference in the average scores of the two groups or to estimate rz.8.z 'Within-subject' ANO\fAs
what the difference might be, using the methods of chapter 8. The second general point we have not discussed but which
Provided most of the subjects in the experiment obtained scores in the may be important in linguistic experiments is the situation where subjects
range zo%-8o% (i.e. 4/zo to t6/zo) it would probably be acceptable to are divided into groups and each subject is measured on several variables.
analyse the raw scores directly. However,. if more than one or two scores For example, we might consider an experiment where 12 subjects of-both
lie outside this range, in particular if anyscoresare smaller than Io% 6f tWo -different nationalities are teSted for their reaction times to several
correct or greater than go% correct-, it will not. be safe to. analyse them different stimuli. The data would have the structure shown in table zz. '3
by the methods of earlier chapters without first transforming them. The observation Y;;k will be the reaction time of the k-th subject of
The traditional solution to this problem is to change the scores of the the i-th nationality to the j-th stimulus. The comparison of stimuli can
individual subjects into scores on a different scale in such a way that the be carried oUt within each subject while the comparison of nationalities
new scores will be normally distributed and have constant variance. This can only be carried out between sets of subjects. Variation in the reaction
is done via the arcsine transformation.' Standard computer packages times of a single subject on different applications of the same stimulus
will usually include a simple instruction to enable the data to be transformed is likely to be rather less than variation between subjects reacting to the
in this way. The usual AN OVA or regression analysis can then be carried same stimulus. Stimuli can therefore be compared more accurately than
out on the W~scores instead of the X-scores. nationalities from such a design. (Standard texts on ANOVA often refer
Other transformations are in common use. It may happen that for some to this as a split-plot design since it typically occurs when several large
variables the variability over a population, or subpopulation, is related ag-ricultural plots (subjects) are treated with different fertilisers (national-
ities) and then several varieties of a cereal (stimuli) are grown in each
3 The transformed scores (W) can be obtained from the originalgcores (X) (where these
arc in pcrccntngcH) by the formula: W =arcsin VX{IOO. Mollt scientific calculaton; have
plot.) Pata with this kind of structure cannot be analysed using the models
a funetion key which gives tht~ value of arcsine, otherwise written sin" 1

we have presented in this chapter- see, e.g., Winer (1971) for details.
Analysis of van'ance -ANOVA Exercises

SUMMARY Table 12.14. Scores ofstudentsfmm three different centres

Analysis of variance (ANOVA) was introduced and a number of on the same language test
special cases discussed.
A B c
( 1) One~way ANOV A was explained and it was stated that to compare several
42 34
mean$ simultaneously it would not be correct to carry out pair~wisc ttests. "
10 36 39
(2.) Two-way ANOVA was introduced, especially the randomised block design 12 40 38
which is an extension to several groups of the paired t-test. 10 34 41
10 38 38
(3) The concept of a factorial experiment was explained together with the terms 10 38 J6
factor and levels of a factor; the possibility of interaction was discussed 9 32 38
41 30
and it was explained that when significant interaction was present it did not "9 35 39
make much sense to base the analysis on the main effects alone. 35 36
(4) A convenient rule of thumb was presented for examining the d'iffcrences 35 31
between the means corresponding to different experimental conditions.
"4 32 33
10 29 29
(s) The difference between a fixed effect and a random effect was discussed. 8 28
(6) The reliability of tests was discussed and it was shown, via ANOVA, that 8 32 33
reliability measures based on correlations could be misleading. 8 28 30
8 37 39
(7) It was pointed out that linguistic data may not be suitable for ANOV A and 30 26
transformation may be necessary. 7 29 32
(8) It was stated that the data from experiments which involved repeated 8 30 20
8 26 27
measures from individual subjects may need to be analysed as a within 31 27
subjects or split-plot design, though this type of AN OVA was not explained 7 22 21
further. 7 29 23
4 19 29

(3) Using the methods outlined in 12.5:
(r) Table r2.14 represents scores on a 'language test by three groups of subjects (a) Estimate the following parameters from table 12.6: the overall.t~Uean;
from different centres. Using the method outlined in\ 12. x, test the hypothesis the mean for Location r; the mean for the male subgroup from Loca-
that there is no difference between the sample means overall. Use the rule tion 2; the mean for the female subgroup from Location+ .
of thumb procedure of r2.4 to test for differences between individual means. (b) Using the residual variance, compare the difference between observed
(2) The values that follow are error scores (read from left to right) from a replication
and fitted values for Y 1. 1; Y o~, 1 ; YJ.Z
of the Hughes-Lascaratou study performed by non-native teachers of English
who were of mixed nationalities.

Non-native teacher scores

35 IO 26 40 30 24 26 3' 44 I4 33 IS 24 34 22 24
30 20 25 40 30 I5 24 35 40 25 35 26 33 30 42 IO
(a) Construct a new table (see r2.4) by substituting this column of 32
numbers for the Greek teachers' scores in th.e original.
(b) Construct an ANOV A table for this data (see r:i.z and table 12.5).
(c) Construct a table with the new data comparable to table 12.8, and
again compare the difference betwccri obscrvt~d anJ fitted values for
Yzu.zand y2o.l
(d) Do tht~ n<m-nntivc tcaciH:r!l behave 1:\imilurly lu the Gn.!clt ttmcht:rH?
Lznear regresswn

The model can be represented graphically as in figure I 3. I by a straight

13 line passing through the origin of the graph. When the value of X, the
month's total sales, is known, then the corresponding value of Y, the
Linear regression commission, can be read off from the graph as shown in the figure. Note
that for every l.r increase in X, the commission increases by zp or l.o.o2.
We would say that the slope or gradient of the line is 0.02. This tells
us simply how much change to expect in the value of Y corresponding
to a unit change in X.

In chapter 9 we proposed the correlation coefficient as a measure of the
degree to which two random variables may be linearly related. In the
present chapter we will show how information about one variable which
is easily measured or well-understood can be exploited to improve our ~ 600
knowledge about a less easily measured or less familiar variable. To intro-
duce the idea of a linear model, which is crucial for this chapter, we will
>- 400
begin with a simple non-linguistic example.

0 8 16 24 32 4B X
Sales (thousands of)
Figure 13.2. Graph of Y =sao+ o.rX.
:~ Suppose that the shop manager does not like the extreme fluctuations
E which can take place in his earnings from one month to another and he
u 400
negotiates a change in the way in which he is paid so that he receives
a fixed salary of soo each month plus a smaller commission of I% of
the value of sales. Can he still find a simple mathematical model to calculate
his monthly salary? Clearly he can. With X andY having the same meanings
0 B 16 24 32 4B X as before, the formula:
Sales (thousands of)
Y = soo + o.orX
Figure IJ.L Graph ofY = o.o2X.
will be correct. The corresponding graph is shown in figure IJ.Z. Again
Suppose the manager of a shop is paid entirely on a commission basis it is a straight line. However, it does not pass through the origin, since
and he receives at the end of each month an amount equal to 2% of the even if there are no sales the manager still receives l.soo; nor does it slope
total value of sales made in that month. The problem, and the model so steeply, since now a unit increase in X corresponds to an increase of
for its solution, can be expressed mathematically. Let Y be the commission only o.or in Y. We would say in both cases that there was a linear relation-
the manager ought to receive for the month just ended. Let X be the ship between X and Y, since in both cases the graph takes the form of
total value of the sales in that month. Then: a otraight line. In general, any algebraic relation of the form:
Y ~a.ozX (Remember, o.o2 = 2/o) Y~a+(3X

Linear regression The simple linear regression model

will have a graph which is a straight line. The quantity f3 is called the Table I 3 r. Age and mean length of utterance for 12 hypothetical children
slope or gradient of the line and a is often referred to as the intercept Child Age in months (X) mlu(Y) l~red1ctcd mlu (Y) Residual
or intercept on the Y-axis (figure '33l The values a and f3. remain
24 2.10 1.82 0.28
fixed, irrespective of the values of X andY. 2 23 2..16 1.73 0,43
31 2,,25 2 43
.Y 3
4 20 L93 1.47 0.46
5 43 2.64 3-49 -o.Bs
6 sa 563 4.8o o.83
28 r.g6 2,,17 -o.2l
8 34 2..2J 2.70 -0,47
9 53 519 4-)6 o.83
10 46 3-45 375 -O.JO
11 49 J.2I 4.01 -o.Bo
1Z 36 2.84 2.87 -0.03

COV(X,Y) = 13.881
a sx = 12 573 ~= J7.08I
Sy = 1.243 y = 2.g66

& Chapman (rg8r). They calculated mean length of utterance (mlu) in

morphemes for a group of 123 children between 17 months and 5 years
0 1234 5X of age. In figure 13.4, X, the age of each child, is plotted on the horizontal
Figure I33 Graph of the linear equation Y = 01 + f3X. axis, andY, the corresponding mlu, on the vertical axis. It is clear that
these points do not fit exactly on a straight line. It is equally clear that
mlu is increasing with age, and that it might be helpful to make some
7,0 statement such as 'between the ages of a 1 and a2 mlu increases by about
6.0 so much for each month'. It will make it simpler to introduce and explain
ID the concepts of the present chapter if we use the same two variables as
J:! 5.0
e.~ Miller & Chapman, mlu and age, but with data from a smaller number
4.0 of children, We have therefore constructed hypothetical data on 12 children
; 3.0 and this appears in table 13.1. The values in the table are realistic and
'E commensurate with the real data discussed in Miller & Chapman (rg8r).
The corresponding scattergram appears in figure '35 The correlation,
1.0 r, of mlu with agefor the 12 children is o.8882, obtained as follows:

12 24 36 48 60 72 covariance (mlu, age) = IJ.88r (see 10.1)

Age in months standard deviation of ages= 12.573
Figure '3+ Relationship between age (r month) and mean length of standard deviation of mlu = I .243
utlenmcc (mlu) in morphemes in 12.3 children: mlu = -o.548 + o.ro3 (age).