ABCs of Lexicostatistics

63 The ABC's of Lexicostatistics (GlottochronoloBJ' )
SARAH C. GUDSCHINSKY

INTRODUCTION

1. Lexicoetatiatice is a technique which attemptl to provide datCI for the earlier ltaget of languap much u carbon 14 dating provides datCI for archaeological finds. This contruta with previous linguistic methoda which, 11- though able to reconatruc:t to lOme extent the history of language, have been unable to provide datCI apart from written hiltorical recordl.

2. By aimple inapection of companble word liatl, for example, the fact of the relationlhip of clOlely rellted languagea can be diacovered. But no one can lay on the basil of lirnple inapcction preciaely how clOlely related two languagea are (Swadeah, 1950, pp. IS7, 164).

3. By the methode of comparative IiDguiatiCl, it is pouible to chart the phonemic c:han1Cl by which contemporary languages have developed from a common parent language, and to reconatruct lOme of the vocabulary of the parent language (ace Paragraph IS). This method pemUti the inveatigator to decide, to lOme extent, the historical order of dialect differentiation. That is, he can .y that Ianguagea A and B diverged from each other before lOch and IUch a phonological change. which is peculiar to Iangua&e B, took place. Or he can .y that the aeparation of languagea A and B from each other must have taken place after their acparatiOll from language C, becauae they aIwe phonological featurea which do not occur in C. The method doea not, however, permit the inveatiptor to .y It what date the lCpar&tion of 1anguaga A and B took place (Hockett, 19S3).

4. A method for determining the chronological relationahipi of cultural e1emeotl to one another by use of various kindl of linpiltic evidence bu been augeated by Sapir (1916, pp.434-436)-

612

The relative antiquity, for example, of the culture itcma borD, tllTMD, and "., is attcated by the fact that theae tcnnI cannot be anaIyaed into conatituent morphemea u can the morphologiCally transparent tcrmI rflilrOlld or eIIj1it4Iist which represent recent additioDl to the culture. The lIIumption ia that lOund changa and lhiftl of meaning over along period of time haft obac:ured . the oricjnal morqhemic content of the older terma. Similarly, the archaic ... plural of UM attCltl the ancient use of the. animIla, aince it ia Ulumed that wordl UIiDa archaic morphologic:al proc:euca, and the cultural e1ementl to which the wordl refer, are of ancient origin. Although theae and other linguistic cluea diacuaeed by Sapir have CODIidcrable nlue in determining IOmething of the relative age of cultural itema, and the chrono- 1ogica1 order in which they became a part of I given culture pattern, thia method doea DOt provide any euc:t datCI. At beat thia method can provide the buia for auch ltatementl -= "Thia element wu probably a part of the culture pattern before wch and wch IOWId changea took place in the language." or "Tbia item probably entered the culture pattern « tribe A during a period of cloee contact with the culture of tribe B from whoee language the terminology wu borrowed."

S. Sapir a1ao I1Iggcated (1921, pp. 217-220) that marked aimilaritiea in the buic morphoIogic:al Itnlcture of otherwUe diIIimi1ar Ianguagea indicated remote common origiD of the languagea, lince the eft'ecta of borrowing or other influence of one language on IDOtber aeldom penctrlte to the ItnIC:tW'II core or nucleua of the language Ift'ected. The 1110 of thiI principle increueI the number of laDJUI&II that can be poatu1ated • belongin, to a an-

lapildc grouping, and gives insight into ~t' relationahipe at deep time deptha ... iI ~not teD ua when the languagea whoee I'IIatioubip it poatulated began to diverge from -another.

6. ·Such hiatorical Cltimation it not sufficient for the necda of anthropologiata, hiatoricaJ IiDguiata, and an:haeologiata, who want to know at jUit what date linguiItic changes took place, and who aIao want to know jUit how the language developments correlate with cultural . changes, migrationa, etc., of which there it

evidence from other 1inea of investigation (Swadeah, 1950, p. 157). :LexicoRatiatic:a it an attempt to provide the more precile dating that itneeded.

BASIC ~SUMPTIONS OF . LEXICOSTATISTICS

7. The fint basic .. umption oflexicoatatiatica it that lOIIle pertI of the vocabulary of any 1anguap are 1IIUIMd, on empirical evidence, to be much leal IUbject .to change than other parte (SWlIdeah [19511). p. 12). This buic core

. YOCabuJary includes IUch itemall terma for pronoUDl, numerala, body parta, geographical featuna, etc. ThiJ concept ia limiIar to Sapir'1 idea of 1 basic nucleua of morphological

. ItnIcture diIcuaaed in Panlraph 5. Terma for new items in the material c:ulture, on the other hand. are frequently borrowed along with the cultural itema. Such terma are aIao euiJy !oat with 1 c:banp in the material culture, or the borrowing of • new item, or for other reuona. The contrut between the baaic core wc:abuJary and pn~ vocabulary may be aeen in the loUowiDg illustration of French loan worda in EngIiah: "Aa apiDIt perhape SO percent of borrowed correapondencea between Engliah and French in the general vocabu1ary. we find just 6 percent in the baaic vocabulary. Residual correapondencea are found to be 'J:1 percent. ThUi the archaic residuum after sooo yean IW'na out to be five times . greater than 2000 ,_. of accumulated borrowinp" (Swadeah, [1951a1 p. 13). .

8. The ICCOnd basic aaaumption of 1cKicaltatiatica ia that the rate of retention of v0cabulary items in the basic core of relatively stable 10CIbulary it CODltant through time. That it, liven • certain number of baaic worda in I certain languaac, • certain percentage of theae worda will remain in the Ianguagc after a

SARAH C. GUDSCHINSKY 61J

thouaand yean of vocabulary lou; that aame percentage of the residue of worda will remain after a aecond thoUllnd yare: and after a third period of a thoUllnd years, the aame percentap ol the words J'eIl'lIirung at the end of the IeCOJld period will remain; aDd 10 on. Complete empirical evidence that the rate of 1011 is collltant through time is atiU Iackin, (Lees, 1953, pp. 121-122), aince the _umption baa not yet been checked for a time apan greater than 2,200 yean and this .pan doa not provide adequate evidence for a conatlnt rate of lou over a Jon, period of time.

9. The third buic IIIWDption of lexicaltatiatica is that the rate of 1011 of baaic v0cabulary itlpproximately the lime in aU languagea. Thia aaaumption baa been tated in thirteen language. in which there are hiltorica1 recorda. The results range from a retention ol86.4 % to 74.4 % per thouaand yeara--en average of SO.S % (Leea, 1953, pp. 118-119). Thia it not, however, concluaive evidence that aU Ianguagea change It thia rate, eapecially aincc aU but two of the thirteen 1anguagea tated are. IndoEuropean. (See aIao Kroeber, 1955, p. 91).

10. The fourth .. umption of lexicoatatiatica it a corollary of the third, namely, that if the percentage of true copatea within· the core vocabulary is known for any pair of languagea. the length of time that baa e1apaed Iince the two Ianguagca began to diverge from a lingle parent IaDguage can be computed (Leea, 1953, pp. 116-117), provided that there are no interfering factora through migratioDl, conqueata, or other aoc:iaI cOntactl which alowed or speeded the divergence (Swadeah, 1950, pp. 158-160; Gudachinaky, 1955, p. 149).

TECHNIOUBS OF LEXICOSTATISTICS

II. In IPP1ying the Iaicoatatiatica techniques developed from the baaic _umptioDl, the atepe are: collectiDg of comparable word Iiata from the relatively atable core vocabulary (Paragrapha 12-14); determining the probable cognatea (Paragrapha 15-23,25-28); computing the time depth (Paragrapha 31-36); computing the range of error (Paragrapha 37-45); and, optionally, computin, the dipe. (Paragrapha 5().52).

WORD LJITI

12. The firat eaaential in making 1 lexicaltatiaticaJ comparison of two or more languap

I

614 THE A.BC'S OF LEXICOSTA.TISTICS

I I·

.1

r.l

I

ia the collection of comparable word liata in the VarioUi Ianguaga. (LexiCOltatiatice provides a quick way of estimating linguistic relatiooehipa on the basis of a relatively unall body of data. For thia reaaon it ia a uaefu1 tool in linguistic aurveyl. For a detailed deac:ription of gathering data in I number of dialec:ta in minimum time, lee Swadeab, 19S4a.) A convenient lilt for'thia purpote ia Swadeah'. 200 word lilt. The uae of this liat baa aeveral advantages: it ia made up of nonculturaI iteml that have been apecifically c:hoeen U I part of the core ~uIuy. Tbeae itema have been tentatively tested (lee Paragraph 9) for percentage of retention ip language. with written hiltorieal recorda. r..ter testa may well indicate that a different IUOrtment of worda would be more uaeful, but any reviaed liat mUll be tested to IICC1'tain whether or not the lime rate of vocabulary lou appliea. Meantime, thia lilt hu been uaed in a number of compariaona, and will yield n:auJts that can euiJy be compared with atudies already made. It does not aeem wiae to Itart with a lilt aborter than 200 worda, &ince the aborter the lilt of worda UIed, the greater the probable error (leO Paragraph 41). Furthermore, it ia lOmetimes impouible to get the entire Hat in all of the Ianguaget investigated 10 that the companIOns mUll be made with fewer itema than in the original Hat For these reaaons it would be good if a longer lilt of eatiafac:tory itema could be worked out. Swadeeh is at preaent experimenting with the uee of a list of only 100 items (lee Swadeali, 1955, for a detailed anaJyaia of the 200 word lilt and the auggeated reviaion to 100 worda). The reaeona given for eJiminsting lOme of the items (e.g., the repetition of lOme roots in aucli pairs U woman-wife, the nonunivenality of .1ich worda • ice and anow, etc.) aeem vaJid to thia author. The gain in quality of teat items, however, is balanced by lOme 1011 in terma of atatiatieal accuracy. Kroeber (1955, p. 97) baa auggeated that a lilt of 1000 itema would be preferable, and doubta that deep time deptha can be explored by Ole of a Jilt u amaH u 200 worda. (Anyone chooaing to uee Swadeah'. DeW lilt of 100 itema mUll uee .86 u the "constant" in the time depth formula

of Paragraph 32.) ,

13. In gathering the data, each Engliah word IhouJd be ~lated by the 11lOIt common converutional equiftlent (Swadeah 1951., p, 13). H there is aD equal choice of two or more

.1'

. ! l

I,

I.

i'

exprelllioDl, one mould be chOlen purely It random (by ftipping a coin if nec:euary) to avoid any biu in the direction of chOOliDc' known cognates, aince nonrandom choice coulcl' considerably lIr.ew the finaJ reaulta. It is eIIeD-! tiaJ, for atatiaticaJ reuGna, that the error be ~ random error, 10 that the accumulating erron I tend to cancel each other out instead of c0mpounding each other.

The lime meaning of each English word ahould be tranalated in each cue. For example "know" is underatood .. referring to facta rather than to perIOns. Translation from Engliah of isolated forma in general inaurea that the resultant forma in each language will be comparable root .tema rather than affixea ar other itema which are not comparable (t.. 1953, p. 115). Thi. is not, however, alWI,. die cue, and the procedure of Paragraph 18 is U8eCl to eliminate the irrelevant material..,

14. Greater time deptha may be explored by the methods of lexicoltati.tice if the liat ia filled in with the reconstructed forma of die poatulated common parent language. of a linguiatic family or .tack (Swadeah, 19531, pp. 41-42). A comparieon of Proto-Romance with Proto-Germanic, for example, might lie. expected to give a more accurate picture of die historical facta than a comparison of modem French with modern German. Such compedIOns are dependent on preliminary comparative ItUdi~ (lee Paragraph 15), and are limited by the fact that rec:onatructed forma for the entire liat are seldom available.

COONATI OOUNT

IS. When the word Hata have been compiJecl, the next step is to compare the worda of the two lilta in order to ucertain how many of the pIira of words are probable cognates (SwadeIb, 1950, pp.J57-158). True cognates are developecl from the lime word in a common pareat language, and only true cognatea are conc1uahe evidence of pnetic: relationship. The moat accurate estimate of whether or not the paira of worda In a given comparison are c:opate • arrived at by the careful Ole of the c:omparatift method in reconstructing the proto-languap. The major uaumption of the companthe method is that while the phonemes of the parent language develop difcrentJy in the di1ferent daughter Janguages,' the development

.,

iI co_eat in each kind of linguiatic environment within each daughter language. The ilmatiptor working on reconttruc:tion matches the wOrds of two (or more) languagea by timiJIrity of form ind meaning. The phonemes in the lame relative poeition in both membera of a IIIItChed pair are compared-u initial corwonant WitIr initial CODIOnant. If the two IanguageI are nIated, the ume pain of phonemes will oa:ur in many pain of words (e.g., many words in Ianguqe A beginning with til mat be matched in language B by words of aimilar meaning which begin with l). Each such rec:urring pair of phonemes is IllUmed to repreacnt a different phoneme or allophone of the common parent language. The investigator on the buia of his data poatulates what phoneme is represented by I!ICh pair. He aIeo poatulates the phonemic: . .;stem of the parent language and on this baaia ItCOllltrUc:tI the probable form of the mer-

. phemes from which the oblerved forma in the daughter languagea have developed. A full clilcuaion of this method ia beyond the ICOpe 01 this paper, but the intel'Clted atudent ahould read Bloomfield (1933, pp. 297-320) and Pike (1950). (For a lilting of additionailOurce&, Bee Pike, 1950, bibliography.)

16. When detailed comparative atudies are not available. probable coInatea can be estimated by an "inapec:tion method," which, although cruder and IUbjec:t to a greater margin of error, can be uaed for time depth estimates. The c:areful uae of the foUowing procedUI'CI will in general diIc:over the paira of words whic:h may be conaidaed u probable c:ognateI within a margin of error not great eDough to invalidate

. the method or render the l'CIulti uaelesa, even though in anyone particular inatanc:e the conc:Iuaion might not reflect the actual hiatoric:al fads. (Fairbanks [1955] hu experimented with an "inlpCCtion method" [the term it his]. teIting the number of diaaimiJar cognates and Iimilar nonc:opatea in eight compariaona within Indo-European. His criteria were lOmewhat .. atrict than thole IUgpted is this paper. For example he ignored vowela, he required agreement in only two CODlOnanti of each word, and he made no provieion for regularly recurring corresPondenc:a [criterion d of this paper]. In his aperiment two of the eight caeea ahowed conaiderable akewing bec:Iuae of copates which were not aimilar [pp. 118-119). This does not completely invalidate

SARAH C. GUDSCHINSKY 615

the method, but it abowa the need of caution eapec:ially in deeper timedeptha. Both Fairbank'. experiment and Taylor'. work on Arawak [see Taylor and Rouse, 1955, p. 106. in which Taylor UICI criteria more stric:t than those pRlCnted here] imply that the skewing from the u.e of the inspection method rather .than c:azeful rec:onatruc:tion tends to be in the direc:tion of overestimation oftime depth, linc:e after long divergence, cognatea frequendy lose much of their eimiIarity.) The proc:edul'Cl are hued in part on the improbability of the c:haDce oc:c:urrenc:e of the same aequenc:e of phonemea with the ume meaning in two different languagea. and in part on the Ulumptiona of comparative linguistics disc:uaaed in Paragraph 15.

17. P"x_, 1. Register u probable noncognates the words which are similar because one language hu borrowed from the other, or becauae both have borrowed from a common IOUfCC. Borrowings from a CI?JDDlOn 1IOUI'c:e are recognizable if the fonna are very similar to a word of the tame or similar meaning in a language which is known to be unrelated, but with which there has been cultural contact. The Mexican Indian languagea of Mazatec· and hc:atec:, for example. are clearly not closely related to the Indo-European Spanish, but for lOme centuries, Spanish hu been the official language of Mexico. Therefore IUch words u Mazatec: tthIa' and bc:atec 'a'tu"ltlut',. 'heart' are rqiatered u noncognate because of the Itrong probability that they are common borrowings from Spanish aiMa rather than deec:endentl of a native word in their common parent language.

Borrowiilp of related 1anguagea from each other or from a closely related common 1Ourc:e may be more diffic:ult to detect. In comparing the Huauda and San Miguel dialec:t8 of Mazatec:, for example, the only evidenc:e that the San Miguel word II ..... 'father' iI a borrowing and not • true cognate with the Huautla word lI'IIi'1 'father' is the fact that the vowel cluster .; oc:cura in the San Miguel dialect only in a limited number of religioua terms, whereas it iI normal in the Huautladialec:t(Gudlc:binKy, 1955, p. 148). Such cluea may indicate some, though probably not all, of the borrowings from· related languagea or dialec:ta.

In languagea whole probability of close rela- . tionahip is unall. all i4entic:al or very Iimilar

words are auapect • loan words unleu clearly proved otherwiae (lee Pangraph 20, criterion a). The apparent doeeneu of the dialec:tl • acenained by lexicoatatiatica1 methods will be greater in proportion to the number of undilCOvered loans that are regiatered • cognates. The probability, however, it that in most cues the number of such loans win not be great enough to eerioualy Ikew the reaults.

18. Procedw, 2. Itolate the equivalent morphemes in each pair of words. IE equivalent morphemea are not iIolated, the investigator may be mialed by the complexity of the words he it comparing. The aimilarity of aftixea marking penon, number, clau, .pect, etc., may obIcure the faa that the baaic atem morphemes are not true c:ognatet. For example, the perIOn marker -III in the forms "-III (Huautla dialect of Muatec) and ttUJ'''I-WA (San Mateo dialect of Muatec) 'he wants~ it irrelevant to the comparilon of the ltemi meaning 'want.' IE both membeR of a pair of words are compounds, one pair of the constituent morphemea may be cognate even though the words • a whole are not cognate. For example, Ixcatec ''-'''1''' and Muatec ,,'oly'" 'guts' are not cognate in spite of the very aimilar yill" and ,' .. Iince tbeae are the morphemes meaning 'dung'; the morphemes which diatinguilh between 'dung' and 'guts' are ",1- 'Kin' and ,,'rJ. 'rope' and are clearly not cognate. (For a further illuatration of the need for iIolating equivalent morpbemea lee Taylor and Rouae, 1955, p. 107.)

If the investigator finds it impoeaible to iIolate all of the morpbemea in the lanpagea he is comparing, be abould proceed with the beat guest be can make from the data available to him, recopizing that the comparing of nonrelevant morphemes may callie him to register a number of falae c:ognateI which will tend to akc:w final reaults in the direction of laaer time depth and cIoIer relationship than is the true m.orical. fact. (See Paragraph 30 for an illuatration of auch skewing in the compariIOn of Ixcatec and Muatec.) The increaaed margin of error from failure to identify morphemes it not 10 great II to invalidate the method if the reaults ate llIed with caution, and not treated' .. abIolutei.

19. Proutlw, 3. Teat the pain of equivalent morphemes iaolated by procedure 2 to determine whether or not they are auftic:iently aimilar to be

616 THE ABC'S OF LEXICOSTATISTICS

, ,

considered probable cognates. This teetilll it done by comparlllf the phonemes or phoneme clusters occurring in' comparable position within the equivalent morphemes. For example, in comparing Ixcatec CUi with Muatec toI'uy.' e is compared with e and u is compared with '; in comparing Ixcatec .. with Muatec ~ 'and,' Ie is compared with Ie and u is compared with tID aince tID occurs in the position comparable to the u; in comparing Ixc:atec'/uW with Muatec .p.u 'come,' 1 it compared with 11/ and .. is compared with N. (Tone it ignored in this example and othera in thia ltudy because the dilcuaion of the compHcateci tone probleml are beyond the ac:ope of tbia paper.)

Any pair of equivalent morphemes may be regiatered • probable cognates if a minimum of three pain of comparable phonemes or pbonenie clusters are found to "ap" according to one or more of the criteria given below. In cues in which one or both members of the pair of morphemes being tested it constituted of fewer than three pbonemei, the pair can be considered II probably cognate cm1y if all the phonemes or phoneme clultel'l of the . aborter morpheme of the pair agree with the phonemes or phoneme cluatera in comparable position in the other morpheme. (For different seta of criteria for determining probable cognates, lee Fairbanb, 1955. and. Swadelh,

19S4c, p. 308.) .

20. CriUriIItI iI. Identical members of a pair of phonemes occurring in comparable position in a pair of equivalent morphemes may be cooaidered II agreeing except that compleci identity between languqea whoee relationahip it IUlpected of heiDI remote may auggeat recent borrowing rather than genetic relatiolllhip. (Criterion d,Paragraph 23. may be uaed to determine whether or not the identity of IDJ given pair of phonemeI is in accord with a pattern in the language, or whether it it peculiar to thia inatance. In the latter caae, the morpheme pair abould be regiatered II probable noncognatea.)

21. erit,""" b. Phonetically aimilar membeR of a pair of phonemes in comparable position in a pair of equivalent morphemea may be considered II agreeing. "PboneticallJ

. limnar" here meant that the. two phonemes of the pair must be sufficlendyalike pboneticaIly to render them IUlpect .. poIIible aIlopboneI ;

·of a lingle phoneme if they occurred in the IIIIlC language. In general, the members of a pUr of phonemes are phonetic:alJy aimilar if they cWfcr in IUch wa,. u: the preaencc 01' ablcncc of vocal vibration u t IJld di the apccd of articulation u , (pronounced with a quick Sap of the tongue) IJld 'i a alight variation of tongue poeition u t .00 t (pronounced with the tongue tip curled back). or i (pronounced u in 'meat') IJld I (pronounced with the

. tongue alightly lower IJld more lax u in mitt'); the preaencc of IOCOndary activity modifying one of the IOUnda uland •• (pronounced with the lipa rounded); the utent of interruption of the air Itram .. 9 (pronounced with partial interruption of the air atream) and , (pronounced with complete interruption of the air Itream). For a fuller dilcullion of phonetic

, . similarity, ICC Pike, 1941, pp. 69-71. (Thil criterion ahould be used with caution if it yielda many agreementawhich are not lubItIntiltcd by criterion d.) .

22. CriurioII e. A conditioned member of a pair of phonemea occurring in comparable poeition in a pair of equivalent morphemca may be conaidered .. agreeing with a phonetic:ally diuUnilar member. That is, phonetically diuimilar phonemea agree if their environment it auch that it could be conaidered a conditioning factor reaponaible for the preaent phonetic ahape of one member of the pair of phonemea even though, arbitrarily, it hu not had the lime effect on the other member of the pair. For example, in comparing the forma li'ltjl

(Huautla dialect of Mazatee) and lfltlq"1 (S1Jl Mateo dialect of Muatee) 'firewood,' the i IJld • are conaidered .. agreeinglincc it ia poIIible that the I might haw been reaponaible for the change from. to i (which ia pronounCed with the toope cloIer to the palate than .) in the HUiutla dialect, even though the chlJlgc did not occur in the San Mateo dialect. A diacullion of conditioning facton may be found in Pike

(1947, pp. 84-96).

23. Cri",.. d. Regu1uly corresponding memben of a pair of phonemea occurring in C!)inparable poeition in equivalent morphemes may be conaidered u agreeing even though they are not phonetically aimilar. By regularly correaponding ia meant that the ume pair of phonemea or phoneme clutten occur in comparable poaition in a number of different pain of equivalent morpbemea. For example, the

! I.

I

t·

i


hcatec phoneme I agreea with the Mllitee phoneme 1 becauae thia pair regularly comlponda in such pain of morphemea u: Ixcatec "fIII'I and MIZItee I'il I 'fire,' hcatec IrII and Mazatec kID' 'rock.'

24. In reading the work of apecialiItI in thia field, the reader ahould bear in mind that they differ in the degree of conaervatiam in their work. The reader can 1lIOII the CODICrvatiam IJld eoIidity of the work by the application of the criteria lugeated in Panpph ~23 to the pain of cognatea which the author offen u evidence. The incluaion of a quantity of comparative data which ia IOlid in terma of theae criteria indicatea that the dlta are conaervative and reliable. If, however, only reconatructed forma (marked with an 1Iteriak) are given, without careful documentation, the reader ahould realize that the propoaed recoDlUUctioDi and the concluaioDi baed on them may in fact be of a highly tentative nature, and mould not be accepted u concluaively proved. (See alto Kroebct, 19S5, p. 97.)

29. III S_". A total of 192 pain of worda in hCltec IJld Mazatec were compared in PUlP.Ph 28. (Eight of the originalliat of worda were lacking in one or the other of the languagca.) Of theae 192 pain, the proccdurea of Parlgraphl 17-23 give a total of 74 probable cognateI and liS probable aonc:ognatea. The time depth baaed on theae figurea it computed in Puagrapha 34-36; the flngc of error of the time depth ia computed in Pmgrapha 44-45; the Ixcatec-MlZltec lexical relationahip in dipa it computed in Pmgraphl j().Sl .

30. A cueful compantive ltudy would probably I'CIUlt in IJl eatimated 78 cognltea IJld 114 noncognatea, aiDa: in the author'l opinion it ia likely that two of the 74 pain regiltered u probable cognatea are not true cognatea, and it it alto likely that Iix of the pain regiltered u probable noncognatea CUi be proved to be true cognatea on the bail of reconatruction. On the other hand, an investigator completely unacquainted with both tanguagca IJld unable to iIolate the equivalent morpbemea and without additional data beyond the 200 word lilt would be apected to arrive at a total of 72 probable cognatea and 120 probable noncognatea, aincc failure to iIolate the equivalent morphemea would have I'CIUIted in regiatering four noncognatea II probable cognatea, but


lack of additional data would have resulted in regiatering u probable noncognatea six pain which may well be true c:ognatea. See Paragraphs 46-48 for a diac:ullion of the degree to which the time depth eatimate it akewed by IUch inaccurate registering of probable cognatet.

COMPUTATION OF TIME DEPTH

31. ·For use in the time depth formula, the number of probable cognatea ucertained by the techniquea of Paragraphs 17-23 moat be converted to percent of cognatet. This is done by dividing the number of probable cognates by the total number of pain of worde compared (Swadeah, 1950, p. lS8).

32. Time depth is computed by the formula t = log C/(2 log,) (Leet, 1953. p. 117). In this formula t stands for indicated time depth in millenia; C stands for the percent of cognates (Paragraph 31); , stands for the "constant" (also called "index" in Swadeah, 19S5, p. 122), that is, the percent of cognatea auumed to remain after a thoUllJld yean of diverging (Paragraph 8). (In the illustrative material in this paper the value .80S hu been UKd for" following Lees [1953, p. ] ]9].) Log means "logarithm of" so that log C means the logarithm of the percent of probable cognatea registered, and 2 log , mean. twice the Ioga. rithm of the CODltant.

33. The formula it solved by the following Kepa: (a) The logarithm of C and the logarithm of, are ascertained from Table 1. (For any who

may be ruaty on the use of logarithma. the following example it given. The logarithm of .38 i •. 968; it it found at the point where a Hne from .3 on the vertical ecale of Table I meetl a Hoe from .08 on the horizontal ecale. The logarithm of .39 it found at the point where a Hoe from .3 on the vertical ecale ·of Table I' meets a 'line from .09 on the horizontal ae.· The logarithm of .385 is halfway between theae; half the di1ference between .968 and .942 aubtracted from .968 givea .955 which is the logarithm of .385. Table I hu been included in the text II more convenient to use than a fun . logarithmic table; it contain. only thole'valuea of N that are neceuary for computing the time . depth.)

(b) The logarithm of , i. multiplied by two: (c) The product of the multiplication in (b) is divided into the logarithm of C.

(d) The quotient of thedivilion in (c) it the . indicated time depth in millenia. It may be changed to yean by multiplying by 1,000.

COMJ>VTATION OP TlMB DEPTH ILLUSTRATBD

34. In the comparison of hc:atec and Mazatcc, 74 of the 192 pain were registered u probable cognatea (Paragraph 29). Dividing 74 by 192 giva .385 (38.5 %). Thi. it the value to be used for C in the time depth formula.

35. The formula may now be filled in to read t = log .385/(2 log .80S). It is solved u follows: (a) The logarithm of .38S it found from Table I to be .955. The Jogarithm of .805 it .

TABLE 1. NATlJUL LOOAIUTJIMI

N .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
.1 -2.303 -2.'1JJ7 -2.120 -2.040 -1.966 ' -1.897 -1.833 -1.m -1.715 -1.661
.2 -1.609 -1.561 -1.514 -1.470 -1.427 -1.386 -1.347 -1.309 -1.273 -1.238
.3 -1.204 -1.171 -1.139 -1.1«» -1.079 -1.050 -1.022 -.994 -.968 -.942
.4 -.'16 -.892 -.868 -.844 - .• 21 -.799 -.n7 -.755 -.734 -.713
.5 -.693 -.673 -.6504 -.635 -.616 -.598 -.580 -.562 -.545 -.528
.6 -.511 -.494 -.47. -.462 -.446 -.431 -.416 -.400 --.386 -.371
.7 -.357 -.~ --.329 -.315 -.301 -.288 -.274 -.261 -.248 -~
•• -.223 -.211 -.19. -.1.6 -·.174 -.163 -.151 -.llP -.128 -;117
.9 -.105 -.094 -.083 -.073 -.062 -.051 -.041 -.030 -.020 -.010 To obcaiD !he IlAl1Inllaladtbm ttlllUlllben ... IIwI .1:

multi.,,, lbe _ber b, 10 ad .ublncr 2.303 fromSIa DltunI, IocIrithm) obcaifttd.

or mUJI!PI, ..,. 100 ad INbcnot 4.fOS from !he III • •

or mukljlly ..,. 1.000 IIIId IUbtnct 6.!NII from !he ID oba • etc.

No11: iii tile ~I_ dllcribed hi !he tal. il • .,..ble to Jeaw out ttllCCOllllt !he nepd .. ftIue of tbe 1opridI-; - ~iWl D_"'r cI1.w.d..,. ~ Ii- ._lIOIilivecl .. ot .... l.

8otIac:I: ., ~ hili 1..".,.,.". .. sr.tiniul ... "."., • ..,. Wilfrid J. OWn ad JI'nnk J. M_; Jr .• copJrip& 1"1. McGraw-Hill IIaok Camper. IDe.


confidence (i.e., the more certainly the true anawer 1iee within the range cited) the wider the range of yean. Narrowing the range of years leueDl the probability that it includes the true anawer.

39. The fint step in computing range of error at any 1evel of confidence is the computation of "standard error" (7/10 confidence level). Standard error is computed by the formula a _ ";C(l-C)/n (~ 1953, p, 1204. formula 11). In this formula a stands for standard error in tennl of percent of cognates i C meanl the percent of cognates (see Paragraph 34-thia is' the ume C UIed in working the time depth formula); • meanl the number of pairs of words compared. The formula is dved by the following ltepa: Ca) C it subtracted from I. (b) The remainder of the subtraction in (a) it multiplied by C. ec) The product of the multiplication in (b) is divided by ... (d) The square root of the quotient of the division in (c) it found. (e) The square root found in (d) it the range of error of the percent of cognates at the 7/10 confidence level.

40. Standard error in yean it computed by 37. It it exceedingly improbable that any two the following ltepI: (a) The range of error of auC:cesaive random aamplinp of the basic the percent of probable cognates (found in vocabulary of a pair of languages would yield step (e) of Paragraph 39) it added to C (found euctly the same percent of probable cognates. in Paragraph 31). (b) The sum of the addition For thia reuon it it neceuary to qualify the . in (a) it worked through the time depth ~tement of time depth in such a way u to formula exactly u the original C wu (paragraph give an eitimate of ita accuracy. The uauaI way 32). (c) The new time depth obtained from (b) of qullifyin, a time depth Itatement is to ltate is subtracted from the original time depth II it u a range of yean rather than u a apecific computed in Paragraph 32 to give the number IijUIlber of yean, and to ltate the degree. of which is added to and aubtnctcd from the ~babiJity (or level of confidence) at which the original time depth u computed in Paragraph raqe of yean wu computed. For example the 32 to give the range of error in yeara It 7/10 ~ depth for Mazatec Ixcatec may be ltated confidence level. (fhe range of error at 9/10 u 2,200 yean ± 200 yean at 7/10 confidence confidence is obtained by multiplying the 1eyel (ace end of Paragraph 045). (The computa- standard error of the percent of cognatea tion of range of error it bued on the .. umption [found in Paragraph 39] by 1.6045 [Dixon and that all c:hangea in thebuic vocabulary are· Muaey, 195J, Table 04]. The product of this random. producing a "normal curve.") multiplication it the range of error. at 9/10

38. Statiatical methoda permit computation confidence level, for the percent of cognates. of range of error at any level of confidence or From it. the range of error in yean It the 9/JO probability. Computationa are uaualIy made. confidence level can be computed by the aame however, . at one of fhree leve1a: "atandard Itet- uaed for the computation of the range of ¢Or" which is 68 % confidence level; (For error at 7/10 confidence level (pangraph 040]. convenience, ltandard error will be referred to The range of error at 5/10 confidence level is • 7/10 confidence level) (ace Paragrapba 39 obtained by the aame 1tepI. uaing the figure and 44); 9/10 (90 %) confidence level; or 5/10 .6704 inatead of 1.6045.)

(SO %) confidence level which it alao called 041. Note that atandard error, and therefore "probable· error." The higher the 1eveI of any range of error, it larger if the number of

found to be .217. (b) The product of 2 It .217 . (that ia 2 log r) it .4304. (e) The quotient of .4304 (2 log r) divided into .955 (log C) it 2.200; that iI. the indicated time depth. t, for Ixcatec. Mu.tec it 2.2 millenia or (multiplied by 1,000) 2,200 yean.

36. The indicated time depth for IxestecMazatec comp.uted in Paragraph 35 may be ltated . in either of the following waya: Ixcatec and Mazatec are estimated to have been a lingle homogeneous language 2,200 yean ago; bcatec and Mazatec are eatimated to have begun to diverge from a common parent lanpage about 2045 B.C. (In the computationa Biven here for illustrative purpoecs. the time deptha, and the datea arrived at by IUbtracting the time depth from the preaent date, are not rounded off. It ahould be noted, however, that die range of errOr computed inParagrapha 44-45 indicatea that theae dates mUit be taken at best u an approximation aomewhere within a few yean of correct. The datCi have no aignificance in tennl of lingle yean or even decades.)

COMPUTAnON OF RANG! or JIJUlOR


comparilOna made ia amall, but decreuea u the number of c:aaea increuea becaute there ia diviaion by the number of cues. Thia makes it important to ute a Jist of WOrdl of lufticient length (Leea, 1953, p. 126).

42. An improved word lilt and more careful coUettion of data and ucertaining of probable cognates will reduce the actual error, but theae win not thow up in thia method of computing the range of probable error lince the accuracy of the inveatigator cannot be included in the formula. LexiCOItatiatice operata admittedly with a wide margin of error due to inacc:Uracy in choice of worda, miatakes in determining cognatel, etc. Tbia ia the price of uaing the method at all, and ia legitimate if one does DOt abuae it by relying on it for a degree of accuncy that it not bukally poIIible.

43. In very deep time deptba where the percent of cognates it amaIl the choice of a lingle {a1ae cognate or the rejection of a lingle true cognate may make conaiderable difference in the resulting date (Swadah, 1953., p.41). If, for example, in a lilt of 200 comparilona there it only one cognate (.5%) the estimated time depth it 12.2 milleni., but if there are two cognates (1%) the time depth ia 10. 6 millenia. Thia ia a cWference of aixteen centuria dependent on the recognition of a lingle cognate.

OOMPVTATION OF JlANOI.OP IRROR Il.LVituTID

44. The range 01 error at 7/10 confidence level can now be computed for the Ixcatec Mazatec time depth by the formula 11_ vC (I-C)l. u follow. (aee Paragraph 39 for the ltepa followed bere): (a) The percent of cognates computed in Paragraph 34 ia .385. Thia number IUbtracted from J.OOO Uivea a remainder of .615 (I-C). (b) .615 multiplied by .385 givea • product of .236775 [C (I-C)]. (c) .236775 divided by 192 (the number of paira of worda compared) givea • quotient of .0012332 [C (I-C)/nl. (d) &: aquare root of .0012332 ia .03511 [vC(I- J. (The aimpJat way to find aqUll'e root it by reference to a manual of mathematical tabJea.) Thia ia rounded off to give a ltandard error at 7/10 confidence level of .035.

45. The range of error in yeara, at 7/10 confidence level, it computed u fonowa (following the ItepI outlined in Paragraph 40):

The range pf error computed in Paragraph 44

(which ia the range of error of the percent of cognates) ia added to the original percent of cognates computed in Paragraph 34; that ia, .385 plUl .035 ia .42. (b) Thi. new C ia worked through the time depth formula t = log C/(2 log r); t ... log .42/2 log .. 805; t = .868/.434; t ... 2.000 millenia or 2,000 yem. (c) The new time depth ia IUbtracted from the time depth computed in Paragraph l5 that ia 2,200 yean minUi 2,000 yeara it 200 yean. (d) The range of error at 7/10 confidence level may now be ated in any of diree waya:

Ixcatec and Mazatec were a lingle homogeneoUi language 2,200 ± 200 yean ago; lxcatec and Mazatec were a lingle homogeneoUi language between 2,000 and 2,400 yean ago; Ixcatec and Mazatec began to diverge from a common parent language between 445 BoC. and 45 B.C.

From the atandard error the range of error at 9/10 confidence level ia compu~ u 2,200 ± 324 yean. The range of error at 5/10 confidence level it 2,200 ± 140 yean (aee Paragraph 40).

46. The percent of cognates likely to be verified by comparative ltudy, and the percent of probable cognates likely to be regiatered by a pmon with no knowledge of the two Ianguagee involved are given in Paragraph 30. At thia point we are ready to work these two eatimatel through the time depth formula and from the . reauitl to estimate the probable degree of akewing of time depth figures due to weakn_ in the criteria or to the inexperience of the investigator.

47. The more conaervative estimate it 78 probable cognates (rather than the 74 probable .cognata on which the illuatration hu 10 far been baaed).7S probable cognates out of· 192 compaNona it.406 (40.6 %). Worked through the time depth formula (Paragrapba 32-33) thiI gives an estimated time depth of 2,078 yean. The range of error computed at 7/10 confidence level ia .035 (computed according to the ItepI in Paragraph 39) or 191 yean (following the ltepa of Paragraph 40). Thia makes the moat conaervative estimate for the time ofMazatec hcatec divergence 2,078 ± 191" or 1,887-2,269 yean ago. Note that the fi&ure. 2,200 yean (Paracraph 36) obtained by the criteria of Parageapba 20-23 ia within thia range.

48. The leaat accurate estimation of coptea, that arrived at by the use of the: criteria aug-. geated in thiI paper, by an investigator without·

lUfticient' know1edp of the language to iIolate the equiVllcnt morphemea, without help from the cOmparative method, and without data

: bey:ond the 200 word liat, it 72 probable cognatel (pangraph 30). This it .375 (37 .s %) and pee a time depth of 2260. Note that thia figure IlIo is within the range of error, at 7/10 confidence level, of the mOlt conaervative

. estimate (paragraph 47).

• 49. On the basia of Pangrapba 47 and 48, it it . . evident that in this particular comparison, the

reault arrived at by the uee of the criteria in this

• paper are only very alightly lkewed from the i JaUltB arrived at by the UN of the more I . conaervative methoda. ID other comparisons the ! -wing may be greater, but the invqtigator

,". ~_}~ pnder~~ atimate thfe. ~on .of thethe

~WUlg, an 1&&e account 0 It In aueumg

reliability of hit raWts.

! r

t.

DIPS

SO. Ita has been demonstrated, the dating arrived at by lexicoetatiatical techniquCl ia very tentative, and can be lCIioualy mialcading to anyone who IIIUDlCI that the datCI are abaolutCI in terma of yean or montba, and UICI them without due caution. For thia reaaon it may be convenient to coDlider the data in tennI of cIipa (i.e.,.degreea of lexical relationship) rather • in termI of hiatoricaI datCl, 10 that the relative lexical relationahipa can be diIcuJIed apart &om any implication of abIolute time (Gudac:hintky, 1955, pp. 141-142) which may be more confuaing than helpful. The dip expreuea a true degree of objective Iexic:al relationship even though borrowing or other facton has deetroyed the time relationship. A knowledge of thia preaent relationship ia invaluable in practical deciaions regarding homogeneity of apeec:h areaa for vernacular achoola, production of literature, etc.

51. The formula for computing lexical relationahip in dipa iI d - 14 (log C/2 log r).. Havil1l once worked the time depth formula, bowevk, the rCIUlts may be converted to dipe by. multiplying the time in millenia by 14, or ~ time in yean by .014. In the bc:atec MUa~ example ueed in thia atudy, the 1exic:al relationship expreued in Paragraph 36 Ii 2,200 yean may be cxpraeed • 30. 8 dipe.

Similarly, the ranp of error in yean may be conyerted by multiplication to range of error in clipe. The range of error at 7/10 confidence


level is given in Paragraph 45 • 200 yean. Multiplied by .014 thia gives a range of error of 2.8. clipe; that ia to Uf, at 7/10 confidence level, the Ixcatec Muatec relationahip is 30.8 ± 2.8 clipe.

52. Swadah hu IUfiClted I c:laaaification of dialec:ta, 1anguap. atocb, and phylwna on the buia of lexic:oatatica1 reeuits (19S4c, p. 326), .foUowa:

DiwTt- c .. ,.
T_ CIffIurifI P.,urat
.1quIae 0-5 100-81
family 5-25 81.36
IIOck 25-50 36-12
microphylum 50-75 12 ..
meIOPhylum 75-100 4-1
macrophylum over 100 lea than I (SwadClh hu uaed .81 u the constant in determining the value in centuriCI 01 the n.rioua percents.) TbeIC labela may be defined in temia of dipa II: language, 0-7 dipe; family, 7-35 dips; atoc:k, 35-70 clips; microphylum, 70-105 clipa; mCIOphylum, 105-140 dipe: macrophylum, more than 140 clipe.

This partic:ular clauification ia, of coune, still tentative. Its empirical UICfulneaa with a large number of languagee remains. to be demonstrated. But without queation, the quantified data resulting from this technique makes pc»aible a more objective clllllification of J.exjcaJ relationahipe than hu hitherto been pouible (Swadesh, 1950, pp. 162-163).

THE VALUE OF LEXICOSTA.TISTICS

53. For the anthropologiat and historian, the lexiCOltatiltical dati auggest the order of the devel0pJnent of languages and dialects. That ia, by studying a number of pain of languages or dialects within a related group, or within a dialect area, those pain which show greateat tiIne depth are usumed to be representative of older splits in the diaIec:tI, and thoee showing leaeer time depth ahow more recent aplits 10 that a progreaaive eplitting is implied (Gudachinaky, 1'55). Thi. IUggested order of aplitting may help in correlating the linguiatic data with known or IUlpected migrations, cultural developments, etc.

S4. The lexicoatatiatical data also imply the geographical location and CQltural contac:tl

622 THE ABC'S OF LEXICOSTA.TISTICS

of ancient dialects, aince the diaIecta were presumably relatively homogeneous until the time at wbich the evidence ahowJ the beginning of their divergence. Then the dialects cloeeat linguistically must have been cloIett geographically and 10Dgeat in cultural contact. Such linguiatic geograpbical relationships have been charted by Swadeah (1950, pp. 164-167) and Hinch (1954). (For an extelllive dilcuaeion of time depth and geographical location lee Krocber [1955]. For uee of the principlea of Paragrapba 53 and 54 lee Taylor and Rouse, 1955.)

55. In using lexicostatiatica1 data, it must be remembered that even when further experiment with the word liat and the conltant make

REFERENCE NOTE

pollible a greater degree of accuracy". no individual a,tudy win be more accurate than the data available and the care. used in uc:ertaining the probable ~tea. AlIo,. regardleu of the degree of accuracy poIIib1e in determining when certain langulgCl or dialects diverged from each other, it ia not poetible to detetmine by lexicostatiatics what 1anguage wu apoken by the people reaponiible for the artifacta found. in any given place (Swadeah, 19S4b; Kroeber,

1955, p. 104). .

56. The archaeologiat or nonlinguiat· who ia curious to try thia material ia urged to do 10. All that he needs beyond what ia given here is the hiatorical recorda or informants from which to obtain the leXical data.

The problema and literature of lexic:oatatiltica and g1ottochronology are dilCUl8ed generally in Hymes (196Oa, 196Oe). in Bergaland and Vogt (1962) and in the comments to these articIea (eap. Hymea,I962b) by a variety of echolara. For recent comment, eee allO Hoijer (1961); For recent work of new ICOpe, lee Dyen..(I962a, 1962b, 1962c) and Carron and Dyen (1962), and d. Elmendorf (l962b). Ebnendorf (19628), Diebold (1960). and Dyen .(I962b) reatate Saliah relationlhipe di.ecuued in Swadesh (19SO) and indicate the importance a weD-worked body of data may acquire. For recent work on laicoltatiatica. apart from glottochronology. lee alao Cowan (1959), Ellegard (1959). Gleaaon (1959). and Kroeber (196Oa).

References not 1n the general bibliography:

CARROLL, JOHN B., and JsrooR! DYJ!N

1962. High Speed Computation of Lexicostati.tical Indices. Lt., 38: 274-278.

COWAN, H. It. 1.

1959. A Note on Statiatical Methoda in Comparative Linguistics. L;",_, 8: 233-246.

DIXON, WILFRID 1 .• and PRAHl[ J. MASSIIY, JIl.

1951. luOtbletimJ to StatUti&1Il AIItlly.u. New York: Wiley.

DYBN, BrooR!

19628. The LexiCOltatiatical Clalaification of the Malayepolyneaian Languages.

Lt., 38: 38-46.

1962b. The LexicostatilticaIly Determined Relatiolllhip of a Language Group.

IJAL, 28: 153-161.

1962c. LexiCOltatiltically Determined Borrowing and Taboo. Lt., 38: 60-66.

ILLBOARD, ALVAR

1959. Statiltical Meuurement of Linguiatic RelatioRlhip. Lt., 35: 131-156.

ILMBNDOD, w. w.

19628. Lexical Innovation and Peraiatence in Four Saliah Dialecta. IJAL, 28: 85-96.

, "

~ :

,. I.

I

! ;,

, '

i

,

I


1962b. Lexical Reladon Models .. a POIIible Check on LexiCOltatiitic Inferences.

AA, 64: 76Q.. 770.

'FAIRBANKS, GORDON H.

1955. A Note on Glottochronology.IJAL, 21: 116-124.

I'BltNANDIZ DB MIRANDA, MAlUA TBIIISA

1951. Reconstrucci6n del Protopopoloca. RIfIiIt. M~ tk EItIItIioI .A.ntr". poldgicot, 12: 61-93.

GDINBIJtG, J08IPH H., and MOWS SWADISH

1953. Jicaque .. a Hobn Language.IJ.A.L, 19: 216-222.

GUDSCHINSKY, SARAH c.

1955. Lexica-statistical Skewing from DWect Borrowing. IJAL. 21: 138-149.

HDISCH, DAVID I. .

1954. Glottochronology and Eskimo and .Eekimo-Aleut Prehistory. A.A., 56: 825-838.

HOCKI'lT, CHARLES F.

1953. Linguistic Tune-Perspective and Ita Anthropological UICI. IJAL, 19:

146-152. '

LDS, ROBllllT B.

1953. The.Buis of Glottochronology; Lt., 29: 113-127.

sw ADlSH, MORRIS

1951b. Kleinschmidt Centennial III: Unaaliq and Proto Eekimo. IJAL, 17:

66-70.

1953a. Moean I: A Problem of Remote Common Origin.IJAL, 19: 26-44. 1953b. Comment on Hockett's Critique.IJAL, 19: 152-153.

1953c. The Language of the Archaeologic Huuteca. Notu 011 MiddU A"","_ hWuology _ EtItnology, 4: 223-227.

1954&. On the Penutian Vocabulary Survey.IJ.A.L, 20: 123-133 •

. 19S4b. Time Deptba of American Linguistic Groupings. With Commenta by G. I. Quimby, H. B. Collins, E. W. Haury, G. F. Ekhoim, and Fred Egan. AA, 56: 361-377.

TAYl.OR, DOUGLAS, and IRVING ROUSI

1955. Linguistic and Archaeological Tune Depth in the West Indica. IJAL, 21: 105-115.

J)tlll 1-11 t"<!$

La"3""I.e t.. ('Jf..rf! ;1".1 >I!>{"'ely. 14 ~Cl'/~ il1 'Lr"'/~sh·cs 6MeI A~j''''/'/''J!'

V]tW t: k . ~il~'t?t ~ 1?4 W

1'164

"

I

.,:,

,

,

I

I

I

I

ABCs of Lexicostatistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ABCs of Lexicostatistics

Uploaded by

Copyright:

Available Formats

63 The ABC's of Lexicostatistics (GlottochronoloBJ' )

You might also like

ABCs of Lexicostatistics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ABCs of Lexicostatistics

Uploaded by

Copyright:

Available Formats

﻿63 The ABC's of Lexicostatistics (GlottochronoloBJ' )

You might also like

63 The ABC's of Lexicostatistics (GlottochronoloBJ' )