You are on page 1of 17
1 Fundamentals of Quantitative Analysis In this chapter, fellow the ouline of topics sod inthe it chapter lof Kaciga, ilar Salita Ava bxmuse Tink that that Isa wey effective prsefation of thee cor ees. Tnceasingly, guise handle quanaive data in hee research Phonaticans socielnguss, peychlingust nd computational gust dea in pumbers and have for decades, Now sso, phorologsts, syata- ttiian, and tora! ngusts are finding linguistic esearch 10 involve quantitative methods, or example, Kelle 200) measure Fentencencepiiy sing a peychophysia technique called ragni- Tae estimation. Als Borns and ayes 2001) employe prebalis- SNeeasoning ina costaint ranking alge for optimality theory ‘Consequently, mastry of quanative methods is increasingly becoming vital component of linguistic ining Yet when Lam aed fo each s course cn quniaive mehods Iam not hapry with the aval ‘he textbooks. [hope tht hs bok wil eal adouatly with he fa Gamental concep tat underke common quanbatve methods, nd snore than tat wl help stdent# make the transition Som the basis {O eal research problems with expt examples of various comen nals tchiques ‘Of cure, the strategies and methods of quantitative analyss ae of primary inportace, bat in dese chapters prctl aspects of handing {Guanbiatve linguistic data wil lo be an important feces. We willbe Soncered with howto ws » parte slastcal peckage (8) to di- over paters in quaniatve data and to tet lngustishypobest, ‘Tis bem every racic] ane assumes thet its appropriate and use {alt look at quanintive measure of language structure and usage ‘We will question this assumption. Salsurg, (2001) talks about a "sata revolution” In slnoe in which the distributions of 2 PUNDAMENEALE OF QUANTITATIVE ANALYSIS rmearuremants are the objets of study. We wil ome smal exon, ‘fneider inguin fom ths point of iw. Has ingaitspurcated inthe statisti evelution? What would a quaneatve Linguists be lhe? Where hs approach taking the discipline? Table 11 shows a et of phonetic measurements. These VOT (voice conse ne) maarremen’s show te duration of aspiration in vies ops in Cherokee. made hee measurements rom rcoréings of one “Tobie 14_ Voie ost sine messuremants of angle Cherise spear ‘ii 0 yar pp neem coding k @ * a ‘ a k 2 t ™ ‘ 2 e 1% r 8 ‘ = e m= x e ‘ 7 & a x 2 x 1 z a t i x ” t i x 102 t 19 x 7 : me ‘ % t w : s t B t m ‘ ° t & Avecage 1135, ty ‘Sandan Deion 359 s8 wu we ACCOMPLISH IN QUANTITATIVE ANALYSIS. 3 spear, the Cherokee inguist Durbin Feeling that were made in 1971 fhe 2001 The average VOT for oicels top /k/ and /¥ is shorter isthe 2001 dans Dut i the ciference "agricant? Or te di ference been VOT in 171 and 2001 ata tance of random var ion =a cosesjuence of randomly selecting posible wtrances in the two yen that though nt ident, come fom the same underlying tistebuton of possible VOT values for thi speaker” I tine tat ore Uf the main pn fo esp in rnd about driving conlsons om ‘ata that tall guessing Realy. But what wear ying edo with tical sarmaries and hypo testing sto quant jst how re she our guesses are. ‘LA What We Accomplish in Quantitative Analysis ‘Quantiatve analysis takes some ne and effort 0 is important to be csr about what you are eying 10 with Note that “everbody sem tbe doing” gato the list. The four maln goals of quantitative analysis are 1 dota eduction: summarize wen capture the common aspect of Skt cbse mich athe averigy, andar devi, ad Comeatonn among vale 2. nence genres rm repesenitive set of beatin 02 Inger unten of posse csertons sig Bypass at suchas the te or ana of weiance 2 Gicovery ston nd esr or casa pater a ata ‘oth sy be dence in melperepeon mode ein factor seals 4 Srplraon of proses tat may ave a bass in probabil terete roel syn norton theory, fn prc tet ich as probeblii sentence pain 12 How to Deseribe an Observation ‘An observation can be oblained in some elaborate way, ke visting fa monnstery in Bgypt to look at an ancent manuscript tat fass® ‘oun ead in a thoucand ear, o rend MI mache for an hoa ‘of bln inaging, Oran observation canbe obtained on the cheap — 4 rowpasenerats oF QuaNriTaTtVE awanests “sing someone where the shoes are inthe department sore and not- ing wheter the talker says the "sn “fourth floor” ‘Some obrecvations art be quantised ix eny meaning sence For sample fat ancient txt hasan insiance of 2 patent frm and {your main question s "how old isthe form?” then yur esl is that the form isa eat ld asthe mansscipt Howreves, you were to ‘observe thatthe form was usd 15 ties inthis manuscript but only twit i a slighly older manasrpt then tees feequency coun ‘agin to take the shape of quanti guise eservaions tat can De nalyzed withthe same quantitative methods wed in scence and engineering, tak that to bea good thing = inputs se mena ofthe scientific community. ‘Ech cbservaton wil Rave several descriptive properties - some will be quablaive and some wil be quantitative and descriptive properties valables) come in one of four pes Nominal: Named proprtie—they have mesningfl onder onsale of ay type. iat language being observed? What dsc? Which ‘word? What the gene ofthe person bing observed? Which varie Ant was uae geno goin’? Onin: Oxderable properties - they ae observed on & mensurable scale but ths kindof property is Wansteve so that i es than b sel bi loth ethan alto ls than amples Zips rank frequency of word, ating sas (excel ent good ae, pooe? Ivers This property thats mescure on a scale hat does not Ihave a true 2ro valu. In an interval sae the magaitade of differ sence of adjacent observations canbe determined (wks the adjcent Items onan ora scale), but because the 2r0 value on the sel birary the sae cannot be interpreted in any aboot tence. Examples: temperature (Fahrenelt or Centigrade scl), ing scales, magnitude estimation judgments. ajo Tiss a propery that we messute on & sae that does have an Bolu zero value. This ie called ratio scle Because cats of hese ‘meisurements re meaning For nsanc, vowel hats 100 slong twice slong as 80 ms vowel, nd 200 wie 100 ms Contrast FREQUENCY DISTRIBUTIONS 5 this with temperature ~ 80 degre Faerie not tice as hot as 50 ogres. amples: Acoustic measures ~ frequency, duration, frequency count seacton tine 13. Frequency Distributions: A Fundamental Building Block of Quantitative Analysis, Yous gu gna spy eon Spr we was how tow gratia a sentence is We cree eence ca gesnmaiay scl otha score of | means tat sounds prety ‘and 10 sounds pertely OK Suppose that the aings in Table 12 rel from this exerse. Tneresting but what are we supposed to leam fom tis? Wel, swore going to ute this set of 36 rumors to const a quency ‘able 12. Hypo at of gamma tings for group of Bint z i enon t Rating Rating ROBE RSURN ROR ERNEST 6 wowpnsemerats oF QUANTITATIVE ANALYSIS ‘ook again at Table 12. How many people gave the sentence & rating of "1"? How many rated it "2"? When we anewer hve gus tions fo al ofthe possible ratings we bave te valu at make the foquency disintion of our selence grammatically ratings ‘These data an some sel ofthe are shown in Tabla 13. You'll notice in Table 13 that we counted tv instante of etng one instance of ang "2", sk instances of ring "and eo on Sine there were 36 rates, each giving one sere to the nt We Inve aft of 6 aservations 50 we can expres he regency outs In roaive eens as percentage of the otal uber of cbseations. [Note that percentages (asthe etnology ofthe word would maga) secummonly expresed ona sale fo 1 100, but you could expose the sme informution as proparons ranging from 0 t 1. “The roquency dlstrbuton in Table 1.3 shows tht most of the rammatealty scores are either ¥" or," and that hough the sores “pana wide range (fom Ito 9) the scores ae gener chases in the middle ofthe range. Thi st should be boca sect the se of scores fom a noemal (bell shaped) frequency dsibution tat ‘entered on the average value of 45 ~ more abou ti tor ‘The set of numbers in Tble 1.3 is more informative tan these in “Table 12, but nothing bets picture. Figure 1. shows the fequen- ie from Table 1. This igre highlights forthe visually inlined the seme points that we made regarding the numeic data in Table 1.3. -anqueNcy bisramUriONs 7 ‘Tble 19. Froueny diubatns of te gansta rating as ating Frguences—‘loee Cumulative Relative Itequenciee fequencis —camative eqences : 2 36 2 86 2 i 28 3 8 3 é 1 5 80 ‘ 5 ma e a2 : 2 aa 2 fos é 5 ne 3 me 7 1 2 = ay & ° oO S on ° 1 a & 009 Tot 100 3 aa Rating Figace 11 The guaney dition ofthe rammatzalty ang date Sawa presented Tbk 12 ‘The property tht we ae seeking to study wih the “grammatical ity score” mesure is probably a good deal more gradient an we en by zetieting or ate oa sae of neger numbers. ny be That ntl sentences hat he/she would rae asa °5" ar exactly et valent to eachother in the itera olng of gromumaticlity tht they rake. Who knows? But suppor that ie tv tat he infernal gran ‘matcaliy response hut we messure with our rting scale actually 2 continous, gradient propery. We could get at thi aspect by eo. ‘ing a more and mere continous ype of rating rae wel eee otis when we ook st magnitude etinatcn Inter ~but whatever Sele we us, i wil have some degree of granularity or quantization fo It Ths struc of al ofthe encasement scales that we cul imagine ung in any scence. ‘Soy with avery fine grained scale ey a grammailty ating on @ scale with many decal points) does make any sense to cout fhe ‘umber of ines tht paula measureent value appess inthe ata sot Baca i Highly Lkly tat no wo ratings wil be exactly the sme. In his ence, ten to eseibe the frequency dstoution of ‘ur data we ned to group the data into contiguous ranges of sores (bins) of ima values and then count he nue af bservatons it tech bin, For empl, i we permitted ratings on the 110 gram matality sale to have many decimal places, the Frequency dsb on would look ike the Ristogam in igure 12, where we have # cu (of foreach rating vale in the da se. igure 12 shows how we can group these sae data into ranges (Gere ratings between and 1,1 and 2a 0 on) ed then count the ‘Buber of ating vals in each ange, jst as We counted before, the Figse 12. A hitogram ofthe faguencydtbton of grmmalty ‘lings whan tng als oe on crus al Figure 13 The mae contouring data dt ws sown in ig 12, Iutnow the requncy dutbton ted In 20 wuwpanemerats o” quanritaniv® ANALYSIS umber of tings of ptr valu S, stead of courting the nm erof tines the eating “6° wae given, aor wear counting the number rains that are gestr than or equal oS and ee than (OK. This procs of grouping measurements on a eoninsoas sale Ss a wef precticl thing to do, butt helps us now make seus Point about torescal frequency disuibutiens. This pont she foun fata of alo the hypotsi testing sates hat we wil be foking at ltr So, py attention Let suppose that we could draw an infinite dat et. The age our dat set becomes tho more detailed representation of he fegucrey ‘strution we can get For example, suppose I exp collecting sen tence grammatically data for the sane sentence, 20 that intend of ratings from 36 people Thad eaings om 10000 people, Now even ‘with hstogrant that has 100 bars in Pigare 14), we ean se ha ‘atngs near 45 are more common tan thas at the edges ofthe ting Seal Now ive hap ang observation p only (jut pay slong ‘sith me hee) ane Heep reducing the sie of the Basi the sopra ‘ofthe frequency distbuton we come toa pint at which the itr ‘ale betwen bats vanishngly small ~ Le we end up with acon tincons curve ee Figure 15). “Vanishingly smal” shoul be a pot! 2 “oa a 6 8 wo igie LAA eseney iso with 0 a plting eguney Figuve 15. Th potty dan asain of 10000 eseraons nd cers py hay atin emer th 12 FUNDAMENTALS OF QUANTITATIVE AWALYS ‘hat we have entre the el of leu. Noto worry though we're ot going too far. ‘The ‘noma istbuton” san expeily useful thot anton. seems intuitively rnconable to sesumne thal inmost eases there i Sonne underlying property tht we are trying to teasue ike ge matali, o type duration cr amount of processing mean at there i Some Source of random ero thst heap rom geting 99 ‘eact measurement of the underlying property. If ths ft 2 stad ‘esciptin ofthe source af vaabity in oor meseurements, Ue we ‘an medal his situation by saaming that the underlying propery = the uncontaminated “trac” vale tha we seek ~fe athe center of the Irequency distribution that we observe in ou mnesurements nd that thespread ofthe dtrbuton is nus by eer with bigger eon ing Jes key to occur than smaller ers, “These assumptions give us a bellshaped frequmncy distbution which ean be described bythe normal cure, an extremely sel bell Shaped curve, whch san espanentil incon of the nes ale (Grek Jeter ye") and the vartance (Grek eer ¢ “sigma. gigerw te na daston ‘vees or pierminurions ry (One wf spect of this definition of» theoreti ditibutin of tases that derives fom jst two mambers the mean vale nda measire of how varable the dats ar) Is hat sm of the aes Under the curve 81 So, inated of thinking in terme of eqn)” ‘stretion, the normal curve gives usa ay t caleulate the prob- Sly of any st of observations by finding the area under any por ton ot the curve Well come bck o this 14 Types of Distributions Data come in wait of shapes of frequency distributions (Figure 1.8) For example, every outcome equally lely then the distribution isunform This happens fr example with thes sides ofa dice och fone is Gupposed fob) equally key, so if you count up the rumbor tt rllsthat come up “1” shouldbe cn average I ut of every 6 ol nthe noma ~ bellshaped ~ distribution, measurements tend 10 congregate around a tpi value and valves become los and less Tel they deviate fuer from this central vale. As wes nthe section above the nana curve i defined by two parameters ~ what the central tendency) sn how qulcly pecbabty goes down as {you move away from the center o the dstebuton (3 I anesurements are tke on sale (ike He 1-9 grammaticalty ating scl dacusse ove), ar approsch one en of the sae the frequency ditbution is bound Yo be sks Desuse there is 2 init yond ehh the dts values connat go. We mt fen rn ino skewed frequency ditebatons when dealing lth percentage data and rae tion ime data (where negative eeaction ines are not nearing. "The fmol dntbuton fs kind of skewed istbution with most ‘bserations coming from the very end of the measurement Sale For ‘sample, you count spech errors per trance You mig fn that ‘ost uterancs ve a speech eror count of 0 Soin histogram, the umber of uterances with alow eror count wl very high an wl ‘lecrense dramatically asthe qunber of ors pr utterance ineresss "Atoll Sstibation eke combination of two normal istbu- tions thee are wo pesks If you ind that your data fall in a bimodal ‘istebation you might coneier whether the dats actually vepresent toro separate population of mearurement. For example, voice fa damental eajuency Hh acousi property st clos eae othe pitch ofa persons voi) falls info a bimedal dstbuton when you eum etn BP is Is m8 set pure 6 Type of probity tus 1 onwat DATA NomMAL? 5 pol metsurements from men and women because men tnd t have Tower pitch than women Ti you ask a numberof people how strongly they supported the US ination frag you would gota very polarized dbo of resus In this Usha distribution most people would be either tongly in for or srongly opposed with not too many inthe mide | ei none | meses aie se eae vis oe este 1L5_ Is Normal Data, Well, Normal? “The normal dieting a useful way to describe data embodies some reasonable assumptions about how we end up with vaiabity In our dots strand gives us some mathemati tos fo uss in 0 lunportene goals of statist analysis. In data reduction, we can ‘ascribe the who fequencyditbuson with just two numbers the mean and the standard deviation (femal definitions of thse are jist aod) Also, the noemal ditbution provides abl fo owing Inference about the accuracy of our satel estimates So, itisa good idea to know whether or not the frequen dst tion of your dats i shaped ike the normal distbuton. esse thatthe data we deal with often alsin an appronesately ‘normal distribution, but as discussed in secton Il thee ae ome omman types of data like percentage nd ang vals) tat ee not onmallydistbutd We're going to do two things here Firs, wel explore a couple of “ways to determine whether your data ate nocmally cstebted, and ssxond well lok at couple of transfrnations tht you cane to mate data more normal (hs may eound hy, but ansferations se legal, ‘Consider again the Cherokee data that we sd to tr this chap: te. We have tivo sets of data, this vo disttbutens So, when me plot the frequency dstrbution asa hitgram aad then compare at ‘Shserved ditbaton with he oral carve we case that both the 2001 and the 1971 dats sets ae ily sma to the normal ‘curve The 201 st Figure 1.7) asa prety normal looking shape, but there area coupe of measurements at nely 200 ms that Bre he 3 ‘When wo remove thse the fi between the theoretic normal curve snd the frequeey dsttion of our data x qute good, The 1971 set a 8 8 8 ie i u : g gS Se m2 bm ee ‘or i wor gure 7 The potablty deny dition of the Chere 201 vice set ine at The ef pe! shows he st iting nome ere al of ‘eds pints. The ight pe shows ths bs tng normal cre whe ‘het pest VOT ales are move fm he tse 15 NORMAL DATA NonMAL? ” a We ® 7 ‘vor ‘igure 8 The probably deny ditto of he Cherokee 1971 vac ‘ret ine i Thebes tng cl cae ln shown (Figure 18) also looks roughy ikea normaly distbuted dat set, ‘though notice tha there were no abservatons between a and 100 me in his quite smal) data set. Though if thee data came from a nO ‘mal curve we Would have expcted several observations in the rng 8 ate oie i Seer ate reac cpa wiaeimeet eee mama a eee ee Seogean aoe Boeecunericene ie “Segre eee er ae od oe i se tes an 180 When we have data ets hat ar Weir oan wake abit fants of several ied 15 NORMAL DATA NORMAL? » These frequency dtibuton grap gv an ndaton of whether car dita is dsuuted on neal carve, bot we ae eenaly Saving cu hands whe rp and sying losis poy onal shel gus you sbouldnundeetnate hee impoan so ook 2 ihe daa, bt i would be goo tobe ale to esse Nt ow mally sete hae dn ae "Todo this we mesrue the degre oft btwsen the data and the orm cure wih s quan /aute lt end conten bee {heat quant se and he quale ores. ie by ermal se, NS Hof ut Ne) Toray abou 0 plow. “Tee quante quate (99 et ia phic sigue for determing ‘Ht ta sts come fom populate wi mon eatson ‘gg plo ssp ofthe unis of he Bt date stant the qn of hsm danse By uae, we ea he eon {conf pits Bb he gen value. th 5 (or) a Heist pint swtch 3 ect oft dal below and 7 ‘A edepe elerenc line to pled, he wo sts coe fom a opulton with he sme dn be pn shld al pp Batey along thi free ins. The pret oe depro hs ‘tis, ht th wie ote es it he ‘hi st have cone boo popunton it ira! lactone “he atvanger fhe pt ae 1. Tosample ses dont oe el. / 2 Many debated spect can be smslanecely ts Fore ‘eps shir iniaeton sie way angen spony, the esmce fours an lb detected fr a pat For xa [Fe to da sos come en popula whese carbs er fyb a ht inonton the pits slong = ah ine ‘huis pln herp ar Swen cm th Sgr eee ie “The plot sit probity pot Fors potty pt th arc ren fb nee pi wih gun Further eegrding the “prabablity plot the Hendin has his ay ‘The probably pot (Cambs et al. 1985) «api tie for assrig wher or ot dat st floy agen ttn sh Sample quate Sunpleguanles 160 x 3a 7 2 Theoret quantes igure 19_ Th guts quate protbily pot comparing the Charo [BP dae wih be arma, ‘The at ae pot apa thor atbaton sch ay ‘hatte pts sto fom apron igiine Dp tho sight ine ckte depurtres or heaped eaten, [As you cn sa in igure 19 the Cheroke 1971 dat ars jut a¢ you woald expect them tobe If they came fom anormal distbutee In fac the datapoints aze aloe allo the line showing pefot deni ‘etree the expected "Theoretical quantiles” andthe actual “Sample ‘quantiles’ This good ft between ind actual quails te feted ina correlation eoctcient of 0987 alata pee 1 (youl [ed more sboutcorlation inthe phonets chapter, Chapter 3} ‘Conta this exelent fi rit the one between the normal i bution andthe 2001 data (igre 110). Here we se that most ofthe ‘datapoints in the 200 dataset aro ust whore we would expect them to be in 2 nommal distibutin. However the to (possly tne) largest VOT values are much leper than expected. Corsequany, the ‘cortelaion between expected and observed or this data set ‘087i lower than what we fs forthe 1971 dat Tema be tat thie distebtion would lone more normal if collected more dats 2 a a T i Theol quantiles igure 140 The qune-qunies pbb plot comping the ‘Ghroe201dats ets normal dtebaton points or we might find tht we haves bimodal distbuton auch that ‘ost date comes rom a peak around 70 ms, bat thee are some VOTS (Genap na diferent peaking sys?) hat conte arn smc lenge (090.9) VOT vale. We will eventual be testing the hypothesis that {his speaker's VOT was shorter in 200 than it was in 171 andthe ot Iying dat values wor apna this hypothesis, But even though these ‘bro very lng VOT vals ae inconvenient, dere sno valid reason to emove them from tho dat se (they ae not errors of measurement, cr speech dyetluenies), 0 we wl kp them note: Making 2 quantile quanie plot in Kis easy wing the ‘ror and eine) factions. The function earn) tes & ‘ector of values (he data st) as input are draws a 0-0 pot of {he data Tals capted the values ed plot the ai of the _raphinio the vector oi. ag fo ter we in the cretion fan Hon cr() glia) ads the 5 degree reference ine othe plot, = [end cr see wel pint ton thee fr n0 | | Rtatalland 1 fora perfec fi. rs. qg = anarnot 7S make the lonTe/quntte plot Volts gg epornGote}Se# nd key the xan of te ptot | ine(wt7i) # put te Ve of the pot eco? ag) # compute te corrletton an o.sassoxz | > coreott wet.og) (aor Now, let lok at non normal distribution, We have some sting ta that are messed es proportions on a sae from Oto I, and in fone particular condition several ofthe patipans gave aings that ‘were vary close the bottom ofthe sce ~ ear 22. So, hen We ‘lot these data ina quant quante probably plot (igure 12) you 15 NORMAL DATA NoRDEAL? 2 ‘ase that asthe simple quant values approach eo, the data polis fall on a horizontal ne. Even ith thi non-normal distbuton, tha correlation batwoan the expeced normal ditiation and the observed date points spre hgh (r= 052) “One standard set at used to make a data stall on» more normal dstbution is fo tantorm the data fom the orginal meat: Srement cle and put on a sae tats suetched or compressed in eipfal ways For example when the data are propoons is usually ‘eommened that they be astormed eth the arsine raroform. TE lakes the orginal dat and conver tothe easlomed daa yung {8e fllowing formal: y= erasin MZ) arsine warsformation “The produces the transformation shown in Figure 112, in which values that te sear O or 1 onthe ie esp out onthe yas 01s, oo5 030 fancon2h sin) 6) oo 08 Theoret quanes Figure Lit The Nomalqunt-qanti plat for a set f dala hut ot ‘ara ecu th sear vas (whch ae proili)cannat be es Sanve. oo @ ow os 10 Figure 12 The arsine wafomation Vals of at are nea o¢ Late ‘tha out on th ari a Net he ent’ ere Sp snge ro 24 PUNDANTALS OF QUANTITATIVE aNALeSIS The comeatin between the expected valu frm ancl equency

You might also like