1 Fundamentals of
Quantitative Analysis
In this chapter, fellow the ouline of topics sod inthe it chapter
lof Kaciga, ilar Salita Ava bxmuse Tink that that
Isa wey effective prsefation of thee cor ees.
Tnceasingly, guise handle quanaive data in hee research
Phonaticans socielnguss, peychlingust nd computational gust
dea in pumbers and have for decades, Now sso, phorologsts, syata-
ttiian, and tora! ngusts are finding linguistic esearch 10
involve quantitative methods, or example, Kelle 200) measure
Fentencencepiiy sing a peychophysia technique called ragni-
Tae estimation. Als Borns and ayes 2001) employe prebalis-
SNeeasoning ina costaint ranking alge for optimality theory
‘Consequently, mastry of quanative methods is increasingly
becoming vital component of linguistic ining Yet when Lam aed
fo each s course cn quniaive mehods Iam not hapry with the aval
‘he textbooks. [hope tht hs bok wil eal adouatly with he fa
Gamental concep tat underke common quanbatve methods, nd
snore than tat wl help stdent# make the transition Som the basis
{O eal research problems with expt examples of various comen
nals tchiques
‘Of cure, the strategies and methods of quantitative analyss ae of
primary inportace, bat in dese chapters prctl aspects of handing
{Guanbiatve linguistic data wil lo be an important feces. We willbe
Soncered with howto ws » parte slastcal peckage (8) to di-
over paters in quaniatve data and to tet lngustishypobest,
‘Tis bem every racic] ane assumes thet its appropriate and use
{alt look at quanintive measure of language structure and usage
‘We will question this assumption. Salsurg, (2001) talks about
a "sata revolution” In slnoe in which the distributions of2 PUNDAMENEALE OF QUANTITATIVE ANALYSIS
rmearuremants are the objets of study. We wil ome smal exon,
‘fneider inguin fom ths point of iw. Has ingaitspurcated
inthe statisti evelution? What would a quaneatve Linguists be
lhe? Where hs approach taking the discipline?
Table 11 shows a et of phonetic measurements. These VOT (voice
conse ne) maarremen’s show te duration of aspiration in vies
ops in Cherokee. made hee measurements rom rcoréings of one
“Tobie 14_ Voie ost sine messuremants of angle Cherise spear
‘ii 0 yar pp neem coding
k @ * a
‘ a k 2
t ™ ‘ 2
e 1% r 8
‘ = e m=
x e ‘ 7
& a x 2
x 1 z a
t i x ”
t i x 102
t 19 x 7
: me ‘ %
t w : s
t B
t m
‘ °
t &
Avecage 1135, ty
‘Sandan Deion 359 s8
wu we ACCOMPLISH IN QUANTITATIVE ANALYSIS. 3
spear, the Cherokee inguist Durbin Feeling that were made in 1971
fhe 2001 The average VOT for oicels top /k/ and /¥ is shorter
isthe 2001 dans Dut i the ciference "agricant? Or te di
ference been VOT in 171 and 2001 ata tance of random var
ion =a cosesjuence of randomly selecting posible wtrances in the
two yen that though nt ident, come fom the same underlying
tistebuton of possible VOT values for thi speaker” I tine tat ore
Uf the main pn fo esp in rnd about driving conlsons om
‘ata that tall guessing Realy. But what wear ying edo with
tical sarmaries and hypo testing sto quant jst how re
she our guesses are.
‘LA What We Accomplish in Quantitative Analysis
‘Quantiatve analysis takes some ne and effort 0 is important to
be csr about what you are eying 10 with Note that
“everbody sem tbe doing” gato the list. The four maln goals
of quantitative analysis are
1 dota eduction: summarize wen capture the common aspect of
Skt cbse mich athe averigy, andar devi, ad
Comeatonn among vale
2. nence genres rm repesenitive set of beatin 02
Inger unten of posse csertons sig Bypass at
suchas the te or ana of weiance
2 Gicovery ston nd esr or casa pater a ata
‘oth sy be dence in melperepeon mode ein factor
seals
4 Srplraon of proses tat may ave a bass in probabil
terete roel syn norton theory, fn prc
tet ich as probeblii sentence pain
12 How to Deseribe an Observation
‘An observation can be oblained in some elaborate way, ke visting
fa monnstery in Bgypt to look at an ancent manuscript tat fass®
‘oun ead in a thoucand ear, o rend MI mache for an hoa
‘of bln inaging, Oran observation canbe obtained on the cheap —4 rowpasenerats oF QuaNriTaTtVE awanests
“sing someone where the shoes are inthe department sore and not-
ing wheter the talker says the "sn “fourth floor”
‘Some obrecvations art be quantised ix eny meaning sence For
sample fat ancient txt hasan insiance of 2 patent frm and
{your main question s "how old isthe form?” then yur esl is that
the form isa eat ld asthe mansscipt Howreves, you were to
‘observe thatthe form was usd 15 ties inthis manuscript but only
twit i a slighly older manasrpt then tees feequency coun
‘agin to take the shape of quanti guise eservaions tat can
De nalyzed withthe same quantitative methods wed in scence and
engineering, tak that to bea good thing = inputs se mena
ofthe scientific community.
‘Ech cbservaton wil Rave several descriptive properties - some
will be quablaive and some wil be quantitative and descriptive
properties valables) come in one of four pes
Nominal: Named proprtie—they have mesningfl onder onsale
of ay type.
iat language being observed? What dsc? Which
‘word? What the gene ofthe person bing observed? Which varie
Ant was uae geno goin’?
Onin: Oxderable properties - they ae observed on & mensurable
scale but ths kindof property is Wansteve so that i es than b
sel bi loth ethan alto ls than
amples Zips rank frequency of word, ating sas (excel
ent good ae, pooe?
Ivers This property thats mescure on a scale hat does not
Ihave a true 2ro valu. In an interval sae the magaitade of differ
sence of adjacent observations canbe determined (wks the adjcent
Items onan ora scale), but because the 2r0 value on the sel
birary the sae cannot be interpreted in any aboot tence.
Examples: temperature (Fahrenelt or Centigrade scl), ing
scales, magnitude estimation judgments.
ajo Tiss a propery that we messute on & sae that does have an
Bolu zero value. This ie called ratio scle Because cats of hese
‘meisurements re meaning For nsanc, vowel hats 100 slong
twice slong as 80 ms vowel, nd 200 wie 100 ms Contrast
FREQUENCY DISTRIBUTIONS 5
this with temperature ~ 80 degre Faerie not tice as hot as
50 ogres.
amples: Acoustic measures ~ frequency, duration, frequency
count seacton tine
13. Frequency Distributions: A Fundamental
Building Block of Quantitative Analysis,
Yous gu gna spy eon Spr we was how
tow gratia a sentence is We cree eence
ca gesnmaiay scl otha score of | means tat sounds prety
‘and 10 sounds pertely OK Suppose that the aings
in Table 12 rel from this exerse.
Tneresting but what are we supposed to leam fom tis? Wel,
swore going to ute this set of 36 rumors to const a quency
‘able 12. Hypo at of gamma tings for group of
Bint
z
i
enon t Rating Rating
ROBE RSURN ROR ERNEST6 wowpnsemerats oF QUANTITATIVE ANALYSIS
‘ook again at Table 12. How many people gave the sentence &
rating of "1"? How many rated it "2"? When we anewer hve gus
tions fo al ofthe possible ratings we bave te valu at make
the foquency disintion of our selence grammatically ratings
‘These data an some sel ofthe are shown in Tabla 13.
You'll notice in Table 13 that we counted tv instante of etng
one instance of ang "2", sk instances of ring "and eo on
Sine there were 36 rates, each giving one sere to the nt We
Inve aft of 6 aservations 50 we can expres he regency outs
In roaive eens as percentage of the otal uber of cbseations.
[Note that percentages (asthe etnology ofthe word would maga)
secummonly expresed ona sale fo 1 100, but you could expose
the sme informution as proparons ranging from 0 t 1.
“The roquency dlstrbuton in Table 1.3 shows tht most of the
rammatealty scores are either ¥" or," and that hough the sores
“pana wide range (fom Ito 9) the scores ae gener chases in
the middle ofthe range. Thi st should be boca sect the
se of scores fom a noemal (bell shaped) frequency dsibution tat
‘entered on the average value of 45 ~ more abou ti tor
‘The set of numbers in Tble 1.3 is more informative tan these in
“Table 12, but nothing bets picture. Figure 1. shows the fequen-
ie from Table 1. This igre highlights forthe visually inlined the
seme points that we made regarding the numeic data in Table 1.3.
-anqueNcy bisramUriONs 7
‘Tble 19. Froueny diubatns of te gansta rating as
ating Frguences—‘loee Cumulative Relative
Itequenciee fequencis —camative
eqences
: 2 36 2 86
2 i 28 3 8
3 é 1 5 80
‘ 5 ma e a2
: 2 aa 2 fos
é 5 ne 3 me
7 1 2 = ay
& ° oO S on
° 1 a & 009
Tot 100 3 aa
Rating
Figace 11 The guaney dition ofthe rammatzalty ang date
Sawa presented Tbk 12‘The property tht we ae seeking to study wih the “grammatical
ity score” mesure is probably a good deal more gradient an we
en by zetieting or ate oa sae of neger numbers. ny be
That ntl sentences hat he/she would rae asa °5" ar exactly et
valent to eachother in the itera olng of gromumaticlity tht they
rake. Who knows? But suppor that ie tv tat he infernal gran
‘matcaliy response hut we messure with our rting scale actually
2 continous, gradient propery. We could get at thi aspect by eo.
‘ing a more and mere continous ype of rating rae wel eee
otis when we ook st magnitude etinatcn Inter ~but whatever Sele
we us, i wil have some degree of granularity or quantization fo
It Ths struc of al ofthe encasement scales that we cul imagine
ung in any scence.
‘Soy with avery fine grained scale ey a grammailty ating on @
scale with many decal points) does make any sense to cout fhe
‘umber of ines tht paula measureent value appess inthe
ata sot Baca i Highly Lkly tat no wo ratings wil be exactly
the sme. In his ence, ten to eseibe the frequency dstoution of
‘ur data we ned to group the data into contiguous ranges of sores
(bins) of ima values and then count he nue af bservatons it
tech bin, For empl, i we permitted ratings on the 110 gram
matality sale to have many decimal places, the Frequency dsb
on would look ike the Ristogam in igure 12, where we have # cu
(of foreach rating vale in the da se.
igure 12 shows how we can group these sae data into ranges
(Gere ratings between and 1,1 and 2a 0 on) ed then count the
‘Buber of ating vals in each ange, jst as We counted before, the
Figse 12. A hitogram ofthe faguencydtbton of grmmalty
‘lings whan tng als oe on crus al
Figure 13 The mae contouring data dt ws sown in ig 12,
Iutnow the requncy dutbton ted In20 wuwpanemerats o” quanritaniv® ANALYSIS
umber of tings of ptr valu S, stead of courting the nm
erof tines the eating “6° wae given, aor wear counting the number
rains that are gestr than or equal oS and ee than
(OK. This procs of grouping measurements on a eoninsoas sale
Ss a wef precticl thing to do, butt helps us now make seus
Point about torescal frequency disuibutiens. This pont she foun
fata of alo the hypotsi testing sates hat we wil be foking
at ltr So, py attention
Let suppose that we could draw an infinite dat et. The age our
dat set becomes tho more detailed representation of he fegucrey
‘strution we can get For example, suppose I exp collecting sen
tence grammatically data for the sane sentence, 20 that intend of
ratings from 36 people Thad eaings om 10000 people, Now even
‘with hstogrant that has 100 bars in Pigare 14), we ean se ha
‘atngs near 45 are more common tan thas at the edges ofthe ting
Seal Now ive hap ang observation p only (jut pay slong
‘sith me hee) ane Heep reducing the sie of the Basi the sopra
‘ofthe frequency distbuton we come toa pint at which the itr
‘ale betwen bats vanishngly small ~ Le we end up with acon
tincons curve ee Figure 15). “Vanishingly smal” shoul be a pot!
2
“oa a 6 8 wo
igie LAA eseney iso with 0 a plting eguney
Figuve 15. Th potty dan asain of 10000 eseraons nd
cers py hay atin emer th12 FUNDAMENTALS OF QUANTITATIVE AWALYS
‘hat we have entre the el of leu. Noto worry though we're
ot going too far.
‘The ‘noma istbuton” san expeily useful thot anton.
seems intuitively rnconable to sesumne thal inmost eases there i
Sonne underlying property tht we are trying to teasue ike ge
matali, o type duration cr amount of processing mean at
there i Some Source of random ero thst heap rom geting 99
‘eact measurement of the underlying property. If ths ft 2 stad
‘esciptin ofthe source af vaabity in oor meseurements, Ue we
‘an medal his situation by saaming that the underlying propery =
the uncontaminated “trac” vale tha we seek ~fe athe center of the
Irequency distribution that we observe in ou mnesurements nd that
thespread ofthe dtrbuton is nus by eer with bigger eon ing
Jes key to occur than smaller ers,
“These assumptions give us a bellshaped frequmncy distbution
which ean be described bythe normal cure, an extremely sel bell
Shaped curve, whch san espanentil incon of the nes ale (Grek
Jeter ye") and the vartance (Grek eer ¢ “sigma.
gigerw te na daston
‘vees or pierminurions ry
(One wf spect of this definition of» theoreti ditibutin of
tases that derives fom jst two mambers the mean vale
nda measire of how varable the dats ar) Is hat sm of the aes
Under the curve 81 So, inated of thinking in terme of eqn)”
‘stretion, the normal curve gives usa ay t caleulate the prob-
Sly of any st of observations by finding the area under any por
ton ot the curve Well come bck o this
14 Types of Distributions
Data come in wait of shapes of frequency distributions (Figure 1.8)
For example, every outcome equally lely then the distribution
isunform This happens fr example with thes sides ofa dice och
fone is Gupposed fob) equally key, so if you count up the rumbor
tt rllsthat come up “1” shouldbe cn average I ut of every 6 ol
nthe noma ~ bellshaped ~ distribution, measurements tend 10
congregate around a tpi value and valves become los and less
Tel they deviate fuer from this central vale. As wes nthe
section above the nana curve i defined by two parameters ~ what
the central tendency) sn how qulcly pecbabty goes down as
{you move away from the center o the dstebuton (3
I anesurements are tke on sale (ike He 1-9 grammaticalty
ating scl dacusse ove), ar approsch one en of the sae the
frequency ditbution is bound Yo be sks Desuse there is 2 init
yond ehh the dts values connat go. We mt fen rn ino skewed
frequency ditebatons when dealing lth percentage data and rae
tion ime data (where negative eeaction ines are not nearing.
"The fmol dntbuton fs kind of skewed istbution with most
‘bserations coming from the very end of the measurement Sale For
‘sample, you count spech errors per trance You mig fn that
‘ost uterancs ve a speech eror count of 0 Soin histogram, the
umber of uterances with alow eror count wl very high an wl
‘lecrense dramatically asthe qunber of ors pr utterance ineresss
"Atoll Sstibation eke combination of two normal istbu-
tions thee are wo pesks If you ind that your data fall in a bimodal
‘istebation you might coneier whether the dats actually vepresent
toro separate population of mearurement. For example, voice fa
damental eajuency Hh acousi property st clos eae othe
pitch ofa persons voi) falls info a bimedal dstbuton when youeum etn
BP
is
Is
m8
set
pure 6 Type of probity tus
1 onwat DATA NomMAL? 5
pol metsurements from men and women because men tnd t have
Tower pitch than women
Ti you ask a numberof people how strongly they supported the US
ination frag you would gota very polarized dbo of resus
In this Usha distribution most people would be either tongly in
for or srongly opposed with not too many inthe mide
| ei none |
meses aie
se eae vis oe
este
1L5_ Is Normal Data, Well, Normal?
“The normal dieting a useful way to describe data embodies
some reasonable assumptions about how we end up with vaiabity
In our dots strand gives us some mathemati tos fo uss in 0
lunportene goals of statist analysis. In data reduction, we can
‘ascribe the who fequencyditbuson with just two numbers themean and the standard deviation (femal definitions of thse are
jist aod) Also, the noemal ditbution provides abl fo owing
Inference about the accuracy of our satel estimates
So, itisa good idea to know whether or not the frequen dst
tion of your dats i shaped ike the normal distbuton.
esse thatthe data we deal with often alsin an appronesately
‘normal distribution, but as discussed in secton Il thee ae ome
omman types of data like percentage nd ang vals) tat ee not
onmallydistbutd
We're going to do two things here Firs, wel explore a couple of
“ways to determine whether your data ate nocmally cstebted, and
ssxond well lok at couple of transfrnations tht you cane
to mate data more normal (hs may eound hy, but ansferations
se legal,
‘Consider again the Cherokee data that we sd to tr this chap:
te. We have tivo sets of data, this vo disttbutens So, when me
plot the frequency dstrbution asa hitgram aad then compare at
‘Shserved ditbaton with he oral carve we case that
both the 2001 and the 1971 dats sets ae ily sma to the normal
‘curve The 201 st Figure 1.7) asa prety normal looking shape, but
there area coupe of measurements at nely 200 ms that Bre he 3
‘When wo remove thse the fi between the theoretic normal curve
snd the frequeey dsttion of our data x qute good, The 1971 set
a 8
8 8
ie i
u :
g gS
Se m2 bm ee
‘or i wor
gure 7 The potablty deny dition of the Chere 201 vice
set ine at The ef pe! shows he st iting nome ere al of
‘eds pints. The ight pe shows ths bs tng normal cre whe
‘het pest VOT ales are move fm he tse
15 NORMAL DATA NonMAL? ”
a We ® 7
‘vor
‘igure 8 The probably deny ditto of he Cherokee 1971 vac
‘ret ine i Thebes tng cl cae ln shown
(Figure 18) also looks roughy ikea normaly distbuted dat set,
‘though notice tha there were no abservatons between a and 100 me
in his quite smal) data set. Though if thee data came from a nO
‘mal curve we Would have expcted several observations in the rng8
ate
oie
i
Seer ate
reac cpa wiaeimeet
eee
mama a eee ee
Seogean aoe
Boeecunericene
ie
“Segre
eee
er
ae
od oe i se tes an
180 When we have data ets hat ar
Weir oan wake abit fants
of several ied
15 NORMAL DATA NORMAL? »
These frequency dtibuton grap gv an ndaton of whether
car dita is dsuuted on neal carve, bot we ae eenaly
Saving cu hands whe rp and sying losis poy onal
shel gus you sbouldnundeetnate hee impoan so ook
2 ihe daa, bt i would be goo tobe ale to esse Nt ow
mally sete hae dn ae
"Todo this we mesrue the degre oft btwsen the data and the
orm cure wih s quan /aute lt end conten bee
{heat quant se and he quale ores. ie
by ermal se, NS Hof ut Ne)
Toray abou 0 plow.
“Tee quante quate (99 et ia phic sigue for determing
‘Ht ta sts come fom populate wi mon eatson
‘gg plo ssp ofthe unis of he Bt date stant the
qn of hsm danse By uae, we ea he eon
{conf pits Bb he gen value. th 5 (or) a
Heist pint swtch 3 ect oft dal below and 7
‘A edepe elerenc line to pled, he wo sts coe fom a
opulton with he sme dn be pn shld al pp
Batey along thi free ins. The pret oe depro hs
‘tis, ht th wie ote es it he
‘hi st have cone boo popunton it ira! lactone
“he atvanger fhe pt ae
1. Tosample ses dont oe el. /
2 Many debated spect can be smslanecely ts Fore
‘eps shir iniaeton sie way angen spony, the
esmce fours an lb detected fr a pat For xa
[Fe to da sos come en popula whese carbs er
fyb a ht inonton the pits slong = ah ine
‘huis pln herp ar Swen cm th Sgr eee ie
“The plot sit probity pot Fors potty pt th
arc ren fb nee pi wih gun
Further eegrding the “prabablity plot the Hendin has his ay
‘The probably pot (Cambs et al. 1985) «api tie for
assrig wher or ot dat st floy agen ttn shSample quate
Sunpleguanles
160
x 3a 7 2
Theoret quantes
igure 19_ Th guts quate protbily pot comparing the Charo
[BP dae wih be arma,
‘The at ae pot apa thor atbaton sch ay
‘hatte pts sto fom apron igiine Dp
tho sight ine ckte depurtres or heaped eaten,
[As you cn sa in igure 19 the Cheroke 1971 dat ars jut a¢ you
woald expect them tobe If they came fom anormal distbutee In
fac the datapoints aze aloe allo the line showing pefot deni
‘etree the expected "Theoretical quantiles” andthe actual “Sample
‘quantiles’ This good ft between ind actual quails te
feted ina correlation eoctcient of 0987 alata pee 1 (youl
[ed more sboutcorlation inthe phonets chapter, Chapter 3}
‘Conta this exelent fi rit the one between the normal i
bution andthe 2001 data (igre 110). Here we se that most ofthe
‘datapoints in the 200 dataset aro ust whore we would expect them
to be in 2 nommal distibutin. However the to (possly tne)
largest VOT values are much leper than expected. Corsequany, the
‘cortelaion between expected and observed or this data set
‘087i lower than what we fs forthe 1971 dat Tema be tat
thie distebtion would lone more normal if collected more dats
2 a a T i
Theol quantiles
igure 140 The qune-qunies pbb plot comping the
‘Ghroe201dats ets normal dtebaton
points or we might find tht we haves bimodal distbuton auch that
‘ost date comes rom a peak around 70 ms, bat thee are some VOTS
(Genap na diferent peaking sys?) hat conte arn smc lenge
(090.9) VOT vale. We will eventual be testing the hypothesis that
{his speaker's VOT was shorter in 200 than it was in 171 andthe ot
Iying dat values wor apna this hypothesis, But even though these
‘bro very lng VOT vals ae inconvenient, dere sno valid reason
to emove them from tho dat se (they ae not errors of measurement,
cr speech dyetluenies), 0 we wl kp them
note: Making 2 quantile quanie plot in Kis easy wing the
‘ror and eine) factions. The function earn) tes &
‘ector of values (he data st) as input are draws a 0-0 pot of
{he data Tals capted the values ed plot the ai of the
_raphinio the vector oi. ag fo ter we in the cretion fan
Hon cr() glia) ads the 5 degree reference ine othe plot,=
[end cr see wel pint ton thee fr n0 |
| Rtatalland 1 fora perfec fi.
rs. qg = anarnot 7S make the lonTe/quntte plot
Volts gg epornGote}Se# nd key the xan of te ptot |
ine(wt7i) # put te Ve of the pot
eco? ag) # compute te corrletton
an o.sassoxz |
> coreott wet.og)
(aor
Now, let lok at non normal distribution, We have some sting
ta that are messed es proportions on a sae from Oto I, and in
fone particular condition several ofthe patipans gave aings that
‘were vary close the bottom ofthe sce ~ ear 22. So, hen We
‘lot these data ina quant quante probably plot (igure 12) you
15 NORMAL DATA NoRDEAL? 2
‘ase that asthe simple quant values approach eo, the data polis
fall on a horizontal ne. Even ith thi non-normal distbuton,
tha correlation batwoan the expeced normal ditiation and
the observed date points spre hgh (r= 052)
“One standard set at used to make a data stall on» more
normal dstbution is fo tantorm the data fom the orginal meat:
Srement cle and put on a sae tats suetched or compressed in
eipfal ways For example when the data are propoons is usually
‘eommened that they be astormed eth the arsine raroform. TE
lakes the orginal dat and conver tothe easlomed daa yung
{8e fllowing formal:
y= erasin MZ) arsine warsformation
“The produces the transformation shown in Figure 112, in which
values that te sear O or 1 onthe ie esp out onthe yas
01s,
oo5 030
fancon2h sin) 6)
oo 08
Theoret quanes
Figure Lit The Nomalqunt-qanti plat for a set f dala hut ot
‘ara ecu th sear vas (whch ae proili)cannat be es
Sanve.
oo @ ow os 10
Figure 12 The arsine wafomation Vals of at are nea o¢ Late
‘tha out on th ari a Net he ent’ ere Sp
snge ro24 PUNDANTALS OF QUANTITATIVE aNALeSIS
The comeatin between the expected valu frm ancl equency