Professional Documents
Culture Documents
ABSTRACT
Gray & Atkinsons (2003) application of quantitative phylo-
genetic methods to Dyen, Kruskal & Blacks (1992) Indo-
European database produced controversial divergence time
estimates. Here we test the robustness of these results using an
alternative data set of ancient Indo-European languages. We
employ two very dierent stochastic models of lexical
evolution Gray & Atkinsons (2003) nite-sites model and
a stochastic-Dollo model of word evolution introduced by
Nicholls & Gray (in press). Results of this analysis support the
ndings of Gray & Atkinson (2003). We also tested the ability
of both methods to reconstruct phylogeny and divergence
times accurately from synthetic data. The methods performed
well under a range of scenarios, including widespread and
localized borrowing.
1. INTRODUCTION
Questions about the origins of human populations hold an
enduring fascination. Often the answer lies beyond recorded
history, prior to even the oldest of ancient manuscripts or oral
traditions. In such cases, although we may not have access to any
literal description of events, we can nonetheless gather other kinds
of evidence to determine what happened when. Thomas Jeerson,
for example, in Notes on the State of Virginia (1782: 227), used an
intuitive comparison of related languages to draw historical
inferences about their age
The Philological Society 2005. Published by Blackwell Publishing,
9600 Garsington Road, Oxford OX4 2DQ and 350 Main Street, Malden, MA 02148, USA.
194 TRANSACTIONS OF THE PHILOLOGICAL SOCIETY 103, 2005
Old English ealle1 el1 ic1 scanca1 beorg1 teh1, drg3 sing1 wter1
Old High alle1 fellit1 ih1 bein2 berg1 dinsit2, ziuhit1 singit1 wazzar1
German
Old Norse allir1 fellr1 ek1 leggr3 fjall2 dregr3 syngr1 vatn1
Gothic allai1 driusi2 ik1 farguni3 atinsi2 siggwi1 wato
1
Greek atse12
p 1 rjeko14 o
pipsei3 ecx }qo14 e kjei4
00
aidei2 tdxq1
198 TRANSACTIONS OF THE PHILOLOGICAL SOCIETY 103, 2005
Cognate set 1 2 1 2 3 1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 1
Old English 1 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 1 0 1
Old High German 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1
Old Norse 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1 0 1
Gothic 1 0 0 1 0 1 ? ? ? ? 0 0 1 0 0 1 0 0 1 0 1
Greek 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1
200 TRANSACTIONS OF THE PHILOLOGICAL SOCIETY 103, 2005
words must be born and survive into at least two languages in order
to be observed. This data thinning process may result in the birth
times of cognates in the data being unevenly distributed over the
tree. This eect is accounted for in the likelihood calculation for a
given tree, the details of which are given in Nicholls & Gray (in
press).
There are two obvious features of the data that this model fails to
capture. The rst is missing data. We do not account for missing
data and recode any missing cognates as absent. It is necessary to
check for biases caused by this approximation. We repeat analyses
omitting languages with a signicant amount of missing data (we
use 13% as the cut-o below). The eect of doing this on the
relatively complete data sets treated in this paper is negligible.
A more important issue is that of borrowing between languages.
If one language gains a new word by borrowing it from another, the
Dollo assumption is violated. While it is relatively simple to include
borrowing when building a model of language change, we are
currently unable to analyse such a model. In order to quantify the
magnitude of this misspecication, in section 8 we present a series
of analyses of data synthesised under models with borrowing but
analysed under the Stochastic-Dollo model.
Inference for the Stochastic-Dollo model is made within a
Bayesian framework and the data is analysed using a MCMC
algorithm implemented in Matlab by two of the authors (GN and
DW). The relevant software, called TraitLab, can be downloaded
from (aitken.math.auckland.ac.nz/nicholls/TraitLab/).
7. RESULTS
Figure 1 shows the results of a series of analyses of both data sets
using both of the models of evolution described above. Table 3
summarizes these results including the data, priors and other
conditions used for each analysis from Figure 1.
Gray & Atkinson (2003) found divergence time estimates for the
root of the Indo-European tree were robust to a wide range of
plausible rooting points, Bayesian priors, cognacy judgement
criteria, age constraints and the eect of missing information in
the data. Key results from Gray & Atkinson (2003) are summarized
in gure 1 (DF1-6). These results are consistent with a number of
subsequent analyses, using subsets of 20 languages (DF8 & 9) and
even when the data is limited to the highly conserved and
borrowing-resistant Swadesh 100 word list (DF7). Using their
stochastic-Dollo model, Nicholls & Gray (in press) found evidence
for similar ages using the Swadesh 100 word list items (DD11) and
slightly younger divergence times using the whole data set (DD10)
and a subset of 31 languages (DD12). The inferred ages are broadly
consistent with the Gray & Atkinson (2003) results.
Results from the analysis of the Ringe et al. (2002) data are
consistent with the Dyen et al. (1997) data. Varying branch-length
priors (RF2), rates across sites (RF7), and rates through time
(RF8), and the cognacy judgement criteria (RF3, RD10) had little
ATKINSON, ET AL. FROM WORDS TO DATES 207
DF1 Dyen et al. (1997) Finite-sites 8764 544 All cognacy information, uniform branch length priors,
Gamma distributed rates across sites, PL rate-smoothing.
DF2 Dyen et al. (1997) Finite-sites 9201 670 As for DF1 with conservative cognacy judgements only.
DF3 Dyen et al. (1997) Finite-sites 8794 533 As for DF1 with loose constraints.
DF4 Dyen et al. (1997) Finite-sites 8699 561 As for DF1 with missing data coded as ?
DF5 Dyen et al. (1997) Finite-sites 8829 520 As for DF3 with strict clock assumption.
DF6 Dyen et al. (1997) Finite-sites 9336 843 As for DF2 with exponential branch length priors.
DF7 Dyen et al. (1997) Finite-sites 9176 855 As for DF2 with Swadesh 100 word list terms only.
DF8 Dyen et al. (1997) Finite-sites 7900 735 As for DF2 with selection of 20 languages.
DF9 Dyen et al. (1997) Finite-sites 8590 748 As for DF8 with selection of a dierent 20 languages.
DD10 Dyen et al. (1997) Dollo 7400 218 Conservative cognacy judgements, uniform MRCA-time prior.
DD11 Dyen et al. (1997) Dollo 8714 410 As for DD10 with Swadesh 100 word list terms only.
DD12 Dyen et al. (1997) Dollo 7834 360 As for DD10 with selection of 31 languages
RF1 Ringe et al. (2002) Finite-sites 7964 645 Uniform branch length priors, gamma distributed
rates across sites, PL rate-smoothing
RF2 Ringe et al. (2002) Finite-sites 7942 702 As for RF1 with exponential branch length priors.
RF3 Ringe et al. (2002) Finite-sites 9166 671 As for RF1 with ne grained cognacy judgements for some characters.
RF4 Ringe et al. (2002) Finite-sites 9770 1150 As for RF1 with Swadesh 200 word list items only.
RF5 Ringe et al. (2002) Finite-sites 7106 671 As for RF1 with Swadesh 100 word list items only.
TRANSACTIONS OF THE PHILOLOGICAL SOCIETY
RF6 Ringe et al. (2002) Finite-sites 7932 572 As for RF1 with topology constraint in accordance
with Ringe et al. (2002).
RF7 Ringe et al. (2002) Finite-sites 7665 313 As for RF1 with equal rates across sites.
RF8 Ringe et al. (2002) Finite-sites 7956 612 As for RF1 with strict clock
RD9 Ringe et al. (2002) Dollo 7552 366 Uniform MRCA-time prior, 15 languages with over 10% sampling.
103, 2005
RD10 Ringe et al. (2002) Dollo 7671 368 As for RD9 with ne grained cognacy judgements for some characters.
RD11 Ringe et al. (2002) Dollo 7513 406 As for RD9 with Swadesh 200 word list items only.
RD12 Ringe et al. (2002) Dollo 7869 659 As for RD9 with Swadesh 100 word list items only,
17 languages with over 10% sampling.
ATKINSON, ET AL. FROM WORDS TO DATES 209
Figure 2. Mean root age and 95% condence interval for a series of
synthetic data analyses carried out using TraitLab. Data was
generated under a number of models of evolution on a tree with a
root age of 8680 years BP (indicated by the horizontal line) chosen
from the posterior distribution in RD16. Results to the left of the
vertical line are for method 1, tting the nite sites model, whilst
those to the right are for method 2, tting the stochastic-Dollo
model using TraitLab.
9. DISCUSSION
The divergence time estimates derived from the two independent
lexical data sets using two very dierent models of word evolution
are strikingly consistent. The ndings of Gray & Atkinson (2003)
seem to be robust to the choice of languages sampled, to the
meaning categories analysed, to who makes vocabulary assign-
ments and cognacy judgements, and even to the age of the
languages sampled. The fact that ancient and modern data sets have
produced similar date estimates strongly supports the notion that
the process of language evolution itself is suciently constrained
and robust to human socio-cultural change as to make date
estimates based on lexical comparison a feasible possibility. This is
also evidence against the suggestion that contemporary borrowing
could bias the age estimates.
Of course, one problem could be with the methodology. It is
therefore impressive that two dierent methods using very dierent
models give the same result. We were able to address the principal
concerns that have been raised about model misspecication using
214 TRANSACTIONS OF THE PHILOLOGICAL SOCIETY 103, 2005
10. CONCLUSION
The well-known criticisms of glottochronology have led many
researchers to reject the possibility of estimating dates from lexical
data. Any approach that attempts to turn words into dates is
dismissed as attempting the impossible trying to turn water into
wine. Here we have shown that estimating divergence time
condence intervals from lexical data is far from impossible or
miraculous. New statistical tools from evolutionary biology enable
us to estimate phylogeny and divergence times without falling
ATKINSON, ET AL. FROM WORDS TO DATES 217
References
ADAMS, D. Q., 1999. A Dictionary of Tocharian B (Leiden Studies in Indo-European
10). Amsterdam: Rodopi. Available via online database at S. Starostin and
A. Lubotsky (Eds.) Database Query to A dictionary of Tocharian B. http://
iiasnt.leidenuniv.nl/ied/index2.html.
ATKINSON, Q. D. & GRAY, R. D., in press[a]. Are accurate dates an intractable
problem for historical linguistics?, in In C. Lipo, M. OBrien, S. Shennan & M.
Collard (eds.), Mapping our Ancestry: Phylogenetic Methods in Anthropology and
Prehistory, Chicago: Aldine.
ATKINSON, Q. D. & GRAY, R. D., in press[b]). How old is the Indo-European
language family? Illumination or more moths to the ame?, in Peter Forster, Colin
Renfrew and James Clackson (eds.), Phylogenetic methods and the prehistory of
languages, Cambridge: McDonald Institute for Archaeological Research.
BERGSLAND, K., & VOGT, H., 1962. On the validity of glottochronology, Current
Anthropology 3, 115153.
BLUST, R., 2000. Why lexicostatistics doesnt work: the universal constant
hypothesis and the Austronesian languages, in Colin Renfrew, April McMahon
and Larry Trask (eds.), Time Depth in Historical Linguistics, Cambridge: The
McDonald Institute for Archaeological Research, 311332.
218 TRANSACTIONS OF THE PHILOLOGICAL SOCIETY 103, 2005
BRYANT, D., FILIMON, F. & GRAY, R.D., in press. Untangling our past: Pacic
settlement, phylogenetic trees and Austronesian languages, in R. Mace, C. Holden
and S. Shennan (eds.), The Evolution of Cultural Diversity: Phylogenetic approa-
ches, London: UCL Press.
BRYANT, D., & MOULTON, V., 2002. NeighborNet: An agglomerative method for the
construction of planar phylogenetic networks, Workshop in Algorithms for
Bioinformatics, Proceedings 2002: 375391.
CAMPBELL, LYLE, 2004. Historical Linguistics: An introduction. 2nd edition. Edin-
burgh: Edinburgh University Press.
CAVALLI-SFORZA, LUIGI LUCA, MENOZZI, PAOLO & PIAZZA, ALBERTO, 1994. The History
and Geography of Human Genes, Princeton, N.J.: Princeton University Press.
DYEN, ISIDORE, KRUSKAL, JOSEPH B. & BLACK, PAUL, 1992. An Indoeuropean
Classication: A Lexicostatistical Experiment. Philadelphia: American Philosoph-
ical Society, Transactions 82(5).
DYEN, ISIDORE, KRUSKAL, JOSEPH B. & BLACK, PAUL, 1997. FILE IE-DATA1.
Available at http://www.ntu.edu.au/education/langs/ielex/IE-DATA1.
EVANS, S. N., RINGE, DONALD & WARNOW, TANDY, in press. Inference of divergence
times as a statistical inverse problem, in James Clackson, Peter Forster and Colin
Renfrew (eds.), Phylogenetic methods and the prehistory of languages. Cambridge:
McDonald Institute for Archaeological Research.
FARRIS, J. S., 1977. Phylogentic analysis under Dollos Law, Systematic Zoology 26,
7788.
GAMKRELIDZE, T. V., & IVANOV, V.V., 1995. Indo-European and the Indo-Europeans:
A Reconstruction and Historical Analysis of a Proto-Language and Proto-Culture,
Berlin: Mouton de Gruyter.
GARRETT, ANDREW, in press. Convergence in the formation of Indo-European
subgroups: Phylogeny and chronology, in James Clackson, Peter Forster and
Colin Renfrew (eds.), Phylogenetic methods and the prehistory of languages,
Cambridge: McDonald Institute for Archaeological Research.
GEYER, C.J., 1992. Practical Markov chain Monte Carlo, Statistical Science 7, 473
511
GIMBUTAS, M., 1973a. Old Europe c. 70003500 B.C., the earliest European cultures
before the inltration of the Indo-European peoples, Journal of Indo-European
Studies 1, 120.
GIMBUTAS, M. 1973b. The beginning of the Bronze Age in Europe and the Indo-
Europeans 35002500 B.C., Journal of Indo-European Studies 1, 163214.
GRAY, R. D. & ATKINSON, Q.D., 2003. Language-tree divergence times support the
Anatolian theory of Indo-European origin, Nature 426, 4359.
GUTERBOCK, H. G. & HOFFNER, H.A., 1986. The Hittite dictionary of the Oriental
Institute of the University of Chicago. Chicago: The Institute.
HOFFNER, H. A., 1967. An English-Hittite Dictionary. New Haven: American
Oriental Society.
HOLLAND, B., & MOULTON, V., 2003. Consensus networks: a method for visualising
incompatibilities in collections of trees, in G. Benson and R. Page (eds.),
Algorithms in bioinformatics, WABI 2003, Berlin: Springer-Verlag, 165176.
HUELSENBECK, J. P., & RONQUIST, F., 2001. MRBAYES: Bayesian Inference of
Phylogeny, Bioinformatics 17, 754755.
HUELSENBECK, J. P., RONQUIST, F., NIELSEN, R. & BOLLBACK, J.P., 2001. Bayesian
Inference of Phylogeny and Its Impact on Evolutionary Biology, Science 294,
23102314.
ATKINSON, ET AL. FROM WORDS TO DATES 219