You are on page 1of 23





DFG Center for Advanced Study Words, Bones, Genes, Tools

Abstract. [Sicoli and Holton, 2014] (PLOS ONE 9:3, e91722) use computational phylogenetics to
argue that linguistic data from for the putative, but likely Dene-Yeniseian macro-family are better
compatible with a homeland in Beringia (i.e. northeastern Siberia plus northwestern Alaska) than
with one in central Siberia or deeper Asia. I show that a more careful examination invalidates that
conclusion: in fact, linguistic data do not support Beringia as the homeland. In the course of showing
that, I discuss, without requiring a deep mathematical background, a number of methodological
issues concerning computational phylogenetic analyses of linguistic data and drawing inferences
from them. I suggest current best practices for such issues, using which would have helped to avoid
some of the problems in the Dene-Yeniseian case.

The Dene-Yeniseian language macro-family is argued to consist of the Yeniseian languages in

central Siberia and Na-Dene languages in North America [Vajda, 2011], [Vajda, 2013]. The macro-
family is still only putative, as many open questions remain (see [Campbell, 2011], [Starostin, 2012],
as well as the reply in [Vajda, 2012]). However, the family does appear to be quite likely, and has
been widely accepted as such (for instance, [Kiparsky, 2015]).1
[Sicoli and Holton, 2014] apply computational phylogenetic methods to typological data from
Dene-Yeniseian languages in order to address the question of where their homeland was. The test
that they apply is very simple: Sicoli and Holton examine the shape of the obtained family trees or
networks, and determine whether they support a basal split into separate Yeniseian and Na-Dene
branches. They find that their analysis does not support such a split, which effectively amounts
to saying that there was no Proto-Na-Dene stage that is ancestral to all Na-Dene languages but
excludes Yeniseian. In other words, Sicoli and Holtons analysis says that the basal split was
not between Yeniseian and Na-Dene, but between some Na-Dene I and [Na-Dene II + Yeniseian].
From this inferred history of splitting, Sicoli and Holton conclude that the homeland of the Dene-
Yeniseian family must have been in Beringia rather than in Siberia, or generally Asia excluding
Importantly, the conclusions of [Sicoli and Holton, 2014] have been taken up as valid by special-
ists outside linguistics. The so-called Beringian standstill hypothesis is an important issue in current
studies of the peopling of Americas that use genetic data. That hypothesis argues that there was a
single (though likely structured) human group that later rapidly colonized the Americas, and that
it has been isolated for several thousand years from other human populations even before entering
the two new continents. Such a scenario appears likely given current results from genetics. A
Date: Nov 16, 2017.
This paper has greatly benefitted from discussions with and comments by Chris Bentz, Johannes Dellert, Gerhard
Jager, Taraka Rama, and Johannes Wahle, from the help in locating some of the relevant literature by Alexei
Kassian and Elena Krjukova, and from presentations at the EVOLAEMP project group http://www.evolaemp. and the DFG Center for Advanced Study Words, Bones, Genes, Tools. Research reported
here was supported by DFG under project FOR 2237, establishing the said Center for Advanced Study, which is
hereby gratefully acknowledged.

reasonable place where that isolation period could have happened would be Beringia: Northeastern
Siberia, Western Alaska and the Bering land bridge, which is currently under water. [Watson, 2017]
provides a popular review of the history and the evidence for the Beringian standstill hypothesis.
That review cites [Sicoli and Holton, 2014] as implying that humans occupied Beringia during the
Last Glacial Maximum (LGM), a period of maximum glacier extent at around 26-20 thousands
of years ago [Clark et al., 2009]. [Watson, 2017] cites a p.c. by Gary Holton, one of the authors
of [Sicoli and Holton, 2014], saying that their study supports at least a period of occupation and
diversification within the Beringian area, and probably somewhere within the southwestern Alaskan
area. Similarly, [Hoffecker et al., 2016], a careful review examining multiple lines of evidence for
the Beringian standstill, also relies on [Sicoli and Holton, 2014] for linguistics, stating that a re-
cent analysis of the Na-Dene and Yeniseian languages indicates a back-migration from Beringia
into Siberia and central Asia rather than the reverse. [Hoffecker et al., 2016] conclude their paper
saying that many questions remain unanswered regarding the complicated movements of people
and/or genes into and out of Beringia after the LGM. Some of the answers have been documented
with archeological, linguistic, and genetic data, but others are problematic or disputed, where
linguistic data refers primarily to [Sicoli and Holton, 2014]s work. In other words, Sicoli and
Holtons conclusion that linguistics firmly supports human occupation of Beringia has become very
popular outside linguistics.
While aiming to contribute linguistic evidence to the Beringian debate is commendable, unfor-
tunately, there are several problems with the argument of [Sicoli and Holton, 2014]. First, Sicoli
and Holtons assessment that the shape of the Dene-Yeniseian language-family tree bears bears
on the Beringian question is overly optimistic, as I discuss below in Section 1. Secondly, the tree
structure that Sicoli and Holton obtained in their Bayesian phylogenetic analysis is not robust to
the choice of tree priors: with a different tree prior than the one used by Sicoli and Holton, we
obtain strong evidence for the traditional phylogeny of the macro-family, where the basal split is
into the Yeniseian and the Na-Dene group. This means that the resulting shape of the tree crucially
depends on a technical choice. It also means, importantly, that the linguistic data in Sicoli and
Holtons dataset are insufficient to infer the true tree of the family: with large amounts of data,
the linguistic information should in principle override the preferences induced by the tree prior. I
discuss the general logic behind Bayesian MCMC inference of linguistic phylogenies (the compu-
tational method used by [Sicoli and Holton, 2014]) as well as the specific problem with tree priors
in Section 2. Finally, even though computational methods are of no help in this case for deciding
the general shape of the tree, there is plenty of historical-linguistic information that supports the
traditional phylogeny against Sicoli and Holtons novel proposal, Section 3. In particular, Yeni-
seian and Na-Dene language families are sufficiently different that historical linguists still express
caution regarding whether they can be considered a macro-family, see e.g. [Campbell, 2011]. The
data firmly rule out that a subclade of the Athabaskan languages within Na-Dene could be more
closely related to Yeniseian than to the rest of Athabaskan that is, the phylogeny that Sicoli
and Holton defend.
The Dene-Yeniseian analysis by [Sicoli and Holton, 2014], despite being an innovative take on
the issue, thus suffers from three serious problems: (i) different homeland and migration hypotheses
as such are compatible with both types of linguistic phylogenies, so the latter cannot help us decide
between the former (Section 1 below); (ii) the family-tree structure inferred by Sicoli and Holton
is not robust to the choice of the tree prior: under a different reasonable prior, the alternative
tree structure is inferred, which the authors thought they could reject with certainty (Section 2);
(iii) while phylogenetic methods are of little help for deciding between the tree structures, the overall
linguistic evidence strongly points to the tree which Sicoli and Holton rejected (Section 3). Each of
these three problems alone would have made [Sicoli and Holton, 2014]s Beringia inference invalid.
Section 4 concludes that linguistic evidence currently does not support either side in the Beringia
debate, and briefly summarizes methodological suggestions regarding obtaining and interpreting

computational phylogenetic inferences from linguistic data that would have helped to avoid the
pitfalls that the pioneering study of [Sicoli and Holton, 2014] fell victim to.
It should be particularly stressed that the technical problems [Sicoli and Holton, 2014] ran into
were likely due to treating the employed phylogenetic methodology more or less as a black box.
The authors provided complete logs of their analyses, which is precisely what allowed me to identify
some of their technical mistakes. They were completely transparent about what they did. There
is thus absolutely no question that [Sicoli and Holton, 2014] worked in good faith. I hope that
an extensive, but informal discussion of not-very-transparent technical issues in this paper would
contribute to computational phylogenetics becoming less of a black box for our field of historical

1. [Sicoli and Holton, 2014]s argument for the Beringia homeland of Dene-Yeniseian
[Sicoli and Holton, 2014] use computationally inferred phylogenetic trees based on linguistic ev-
idence in an argument which, according to them, shows that the spread of the Dene-Yeniseian
macro-family proceeded from Beringia. Their linguistic data are 116 binary typological features,
described briefly in their Supplementary Materials 1. Conventionally, the presence of a feature is
coded as 1, its absence as 0, but the phylogenetic-inference model that the authors use does not
make an internal distinction between presences and absences, being only sensitive to match and
mismatch between the same feature in different languages.
Sicoli and Holtons features (which can also be synonymously called characters or sites in
the context of phylogenetic inference) are highly correlated with each other. For example, features
1-18 concern the shape of the vowel system of the respective languages. Features 1, 4, 8, 12, and
16 are mutually exclusive, as they count how many vowels the system has overall: three (feature
1), four (feature 4), and so on. The 13 other features in the group 1-18 describe the more exact
shape of the vowel system, and are conditional on the number of vowels in it. Thus for the 3-vowel
systems, there are two binary features 1-1-1 (feature 2) and 2-1 (feature 3). Obviously these
can have value 1 only if feature 1 (=having exactly 3 vowels) is 1. Similarly, only one of those
two features can be 1 at the same time. Finally, if the system has three vowels overall, this means
that all features corresponding to systems with a different number of vowels, that is features 4-18,
must be 0. Summing up, features 1 and 2 and features 1 and 3 are positively correlated, while
features 1-3 and 4-18 are all pairwise negatively correlated. These 18 features are arguably the
most correlated subset in Sicoli and Holtons dataset, but similar problems occur on a smaller scale
in the rest of the data as well. There are two consequences of that. First, the evolutionary model
Sicoli and Holton use for their phylogenetic inference assumes feature independence, which is not
the case.2 But secondly, even if such statistical dependence does not bias the results, it still means
that the actual amount of data in the dataset is effectively smaller than 116 binary characters.3
2The same problem affects other linguistic phylogenetic studies, and I am not aware of a full-scale quantification
of how serious the problem might be.
The problem might be hard to notice for non-biologists because the assumption of character independence is so
fundamental to phylogenetic inference that it rarely gets spelled out explicitly for example, it is not
3Sicoli and Holton also note that 26 out of their 116 binary characters feature the same value for all languages,
and say that they are therefore uninformative for phylogenetic inference. The latter statement is not completely true.
What is true is that a uniform feature does not give us any information about which languages belong to the same
clade within the family: only shared innovations and retentions that affect a part of the family are useful for that.
But uniform features still contribute information about the rates of change, and could also affect inference of likely
feature states in the proto-language. Through that, they can even affect tree topology, albeit not as directly as non-
uniform characters. In the main text, I only report analyses including all the features, unlike Sicoli and Holton who
excluded uniform ones. (I discuss the technical aspects of the issue a bit further in Section 2.) I checked whether this
difference would affect tree topologies by running one analysis in two variants. As there was no significant difference
in tree topologies, I believe that in the Dene-Yeniseian case, this choice is not particularly consequential. With other
datasets, however, it can be, as I discussed elsewhere [Author, 2017].

This is important, because 116 binary characters is not very much data to start with as far as
computational phylogenetics goes. As the effective number was even smaller, it should therefore
come as no surprise that Sicoli and Holtons phylogenetic results depend heavily on the choice of
prior distributions, as we will see in the next section. With large amounts of data, the signal in
the data can often overwhelm the biases of the prior, but the less data we have, the more influence
our prior assumptions will have on the final result.
Sicoli and Holtons specific goal in their phylogenetic analysis is to compare two hypotheses about
the shape of the Dene-Yeniseian tree. (Here, hypothesis is meant in the statistical sense, namely,
as a theoretical possibility that we can study statistically; this sense is different from the general
scientific sense of hypothesis as a possibility that is formulated to explain the present evidence.)
One hypothesis says that the Dene-Yeniseian tree will have the shape [[Yeniseian], [Na-Dene]], with
the first split separating the two traditionally postulated language families. The other hypothesis
says that the tree does not have that shape, and instead that some Na-Dene languages branch out
before the Yeniseian languages branch out from the stem of the tree. In other words, the second
hypothesis says that the tree has the shape [[Na-Dene I], [Na-Dene II, Yeniseian]]. This hypothesis
explicitly contradicts the traditional linguistic classification of the relevant languages an issue
we will discuss below in Section 3.
Sicoli and Holton argue that the two different topologies correspond to different migration sce-
narios. As this is a crucial step in their argument, it merits a full citation:

We expect the two different migration hypotheses to exhibit different tree topologies.
The out of central/western Asia hypothesis assumes that the Yeniseian languages
(and potentially their extinct relatives) branched off of the Dene-Yeniseian family
with Na-Dene subsequently diversifying. The tree topology for this hypothesis would
place the Yeniseian languages outside of Na-Dene: [Yeniseian[Na-Dene]. The radia-
tion out of Beringia hypothesis does not assume that Yeniseian necessarily branched
[Sicoli and Holton, 2014, p. 4]
What Sicoli and Holton assume in this passage can be usefully illustrated with Figure 1. If the
homeland of Dene-Yeniseian was in central or western Asia, they assume that the only possible
scenario following that would be (SH-I) a single Na-Dene migration into Beringia and then further
into more southerly North America. If this were indeed the only scenario compatible with a deep
Asian homeland, then Sicoli and Holton would have been right to equate the Asian homeland with
the tree topology [[Yeniseian], [Na-Dene]]. However, this is not so. There is no a priori reason why,
for example, the following, completely hypothetical scenario would be ruled out: (A1) Na-Dene
I splits from Proto-Dene-Yeniseian and occupies some territory in central Siberia; (A2) Yeniseian
and Na-Dene II split from each other, all staying in central Siberia; (A3) Na-Dene I and Na-Dene II
migrate to Beringia as a part of a larger migration of diverse peoples, including the speakers of all
non-Na-Dene and non-Eskimo-Inuit American languages. There are more hypothetical scenarios
that can be formulated. Lets spell out one more option: (B1) Na-Dene I splits from Proto-Dene-
Yeniseian and moves to Beringia (for instance, to Western Alaska); (B2) Yeniseian and Na-Dene
II split; (B3) Na-Dene II moves into North America, and Na-Dene I moves around within North
America. Of course, some of such scenarios would be more far-fetched than others. However, all
of them represent logical possibilities. [Sicoli and Holton, 2014] do not discuss why exactly they
think all possibilities but one should be ruled out: the extract above is all that they say about the
Turning to the hypothesis of a Beringian homeland, Sicoli and Holton remain more cautious.
They only say that a Beringian homeland is compatible with a tree topology other than [[Yeniseian],
[Na-Dene]]. Of course, it is also compatible with [[Yeniseian], [Na-Dene]]: it can be that from a
homeland in Beringia, the Yeniseian branch moves out first (presumably to the east), while Na-Dene

SH-I Na-Dene (ND)

ND (3) ND I

ND (1) N




(3 )

) ND
B (2)


(2) Yen
(2) ND II


Figure 1. Some logically possible, hypothetical Dene-Yeniseian migration-and-split sce-

narios. Representation is schematic: specific migration targets within a region are not
intended to convey information. The numbering on arrows indicates the intended temporal
ordering. (SH-I) and (SH-II) are suggested by [Sicoli and Holton, 2014]. (A) and (B) are
alternative scenarios with a Siberian homeland, but the linguistic tree structure [[Na-Dene
I], [Na-Dene II, Yeniseian]]. The existence of (A) and (B) shows that the true tree phylogeny
does not bear directly on where the Dene-Yeniseian homeland could have been.

continue to develop in the homeland before moving out. One alternative scenario, which Sicoli and
Holton end up arguing for in the end, is (SH-II): there were several Na-Dene migrations out of the
Beringian homeland and therefore several linguistic splits resulting in the modern Na-Dene groups,
but the split of the Yeniseian languages occurred after some of those Na-Dene splits, but before
others. It is worth noting that Sicoli and Holtons favorite scenario has roughly the same level of a
priori far-fetchedness as the hypothetical Asian scenarios sketched above. It requires there to be no
Proto-Na-Dene stage excluding Yeniseian, just as (A) and (B) above. It requires several separate
migrations by Na-Dene groups out of their homeland towards their positions in North America,
just as (A) and (B) do. Such multiple migrations into the American interior might seem somewhat
suspect to a geneticist or an archaeologist, given the astonishing levels of genetic and technological
similarity among most American populations (excluding Eskimo-Inuit).4 However, for a linguist,
the presence of different migration streams would not be surprising, as the linguistic diversity of the
Americas in terms of the number of apparently unrelated language families is equally astonishing,
[Nichols, 2008]. If we cannot see genetic or archaeological signs indicating likely linguistic diversity
even regarding the American language stocks as a whole, it is not a surprise that we did not see
such clear evidence for several separate Na-Dene migrations: after all, the former are much more
diverse than the latter linguistically. However, an important point is that a Beringian homeland is

4For the archaeological side, see the review of evidence in [Potter, 2011]. For the genetic side, see a recent review
in [Skoglund and Reich, 2016].

also compatible with a basal split into Na-Dene and Yeniseian: for instance, we can form such a
scenario by simply changing the order of out-of-Beringia migrations in (SH-II) in Figure 1.
Of course, even quite a far-fetched scenario may turn out to be a true one. The very fact that
the Americas feature such a high level of language-family diversity is a good example. If we did
not know about that diversity, we would have found it implausible that any continent could exhibit
such, extrapolating from the rest of the Earth. Yet the American linguistic diversity exists. The
bottom line for the Dene-Yeniseian case is two-fold: (i) one should not rule out even far-fetched
scenarios without sufficient reason; (ii) arguably, logically possible migratory scenarios that were
ruled out without discussion by [Sicoli and Holton, 2014] are not much more far-fetched than the
scenario that they end up arguing for.
Thus if we look at the different logically possible migratory scenarios, we can see that either of the
homelands is compatible with both of the tree topologies considered by [Sicoli and Holton, 2014].
Linguistic phylogenetic trees do not help us decide whether the relevant population splits occurred
in central Asia or in Beringia. In other words, the very premise of Sicoli and Holtons main line of
argumentation is not valid: either topology is compatible with both homelands. There is therefore
no way to use Sicoli and Holtons linguistic framework to either support or refute the Beringia
What if there were valid reasons to rule out all alternative migration scenarios for the Asian
homeland hypothesis? Even though Sicoli and Holton do not provide such reasons or acknowledge
the issue, lets assume for the sake of the argument that such reasons could be provided. In this
case, we would have a clear line of attack. Since by Sicoli and Holtons assumption, linguistic tree
topology [[Yeniseian], [Na-Dene]] is compatible with both homelands, we would not learn anything
if that topology is supported by the data. If, however, its the other topology that is supported,
namely [[Na-Dene I], [Na-Dene II, Yeniseian]], then we can infer that the homeland was in Beringia.
That is the argument that [Sicoli and Holton, 2014] put forward. In the next two sections, we will
examine how valid that argument is even if we accept their premise.

2. The dependence of phylogenetic results on the choice of tree prior

The bottom line of this long section is very simple: while [Sicoli and Holton, 2014]s original
analysis did not support a basal split between Yeniseian and Na-Dene, if we change one of the
parameters of the computational analysis, namely the tree prior, the results actually show exactly
such a split. Because the two analyses both use a priori reasonable settings, but disagree, by
themselves computational-phylogenetic results cannot be used either to support or refute Sicoli
and Holtons position.
If you are only interested in the general structure of the argument for the Dene-Yeniseian home-
land, that information, also illustrated by the consensus trees in Fig. 2, is already enough, and you
can skip to Section 3. The rest of this section explains, in informal terms, how the computational
analysis works and how one sets up its settings. Using computational phylogenetic software can
be daunting, as the manuals and help pages often presuppose a great deal of knowledge about
the technical details, and are written for biologists and geneticists, not linguists. The goal of this
section is to somewhat demystify the process, and at the same time explain the problems with
[Sicoli and Holton, 2014]s analysis, so that one could avoid running into similar problems in the
future. For a longer methodological introduction that, unlike the present article, gradually works
its way towards a mathematical presentation, see [Ronquist et al., 2009].

2.1. How Bayesian MCMC works. It is useful to start with a brief and informal discussion
of the statistical method that Sicoli and Holton used to obtain their trees for the Dene-Yeniseian
macro-family. That method is called Bayesian MCMC (abbreviated from Markov Chain Monte
Carlo). Inferring a language-family tree from observed data intuitively involves finding the tree or
trees which are best compatible with the data. To completely describe any given tree from scratch,

we need many parameters: the topology of the tree (a categorical parameter) and the length of
each single branch (many real-valued parameters). Furthermore, to connect the tree to our data,
we need even more parameters. To compute how likely our observations were to be generated if a
certain tree were the true tree, we need an evolutionary model that describes e.g. the rates of change
of our linguistic features. We thus need even more numbers. Inferring the optimal tree(s) and the
evolutionary parameters is a very complex task: the search space of possible trees, even forgetting
the evolutionary parameters, is astronomically large; the parameters are not independent from one
another, making the problem even harder; finally, rather than there being one unique absolutely
best tree, in this type of model there are usually very many different trees each of which explains the
data relatively well. Bayesian MCMC is precisely the kind of method to work with such complex
situations. It is able to search through very hard-to-analyze parameter spaces, and it outputs not
a single tree as its answer, but a sample of trees that are well compatible with the data.
Here is how Markov Chain Monte Carlo works. The algorithm defines a Markov chain (hence
the first MC in the name), a mathematical construct that moves through the search space of
possible trees according to certain rules, but at the same time retaining a degree of randomness
(hence the second MC, Monte Carlo, metaphorically referring to the randomness component
through association with casinos.) At each step of the chain, a new tree is picked together with
the evolutionary parameters, essentially as a guess.5 This is our new hypothesis. We compute the
probability that language change would have generated exactly the data that we observed assuming
that our new hypothesis is correct. That probability is called the likelihood of our hypothesis in
statistical parlance. Normally, that probability will be very low even for the best cases, because
there are many ways in which language change can proceed. (That is why that probability is
normally counted on a log-scale: it is much harder to work with numbers like 101000 than with
log10 (101000 ), which is just 1000.) We also compute the prior probability of our hypothesis. In
the case of the tree, for example, its prior probability may be equal to how likely that tree would
be to be generated by a certain random evolutionary process. That process is said to induce a tree
prior. There are also other priors participating in the overall prior probability for example, a
prior on the probabilities of different character values at the root of the tree, etc. As tree priors
will turn out to be important for the Dene-Yeniseian case, we will return to them in greater detail
below. For now, it suffices to note that the prior probability of the tree belongs to the prior rather
than the likelihood because it does not depend on our observed linguistic data.
In this manner, we will have obtained the likelihood and the prior probabilities for our hypothesis.
It is the product likelihood * prior that is relevant in what follows. (Recall that a product on the
normal scale corresponds to a simple sum on the log scale.) In technical terms, that product is
proportional to the probability of our hypothesis given the data, which is what makes it a very useful
quantity. Even though the absolute value of likelihood * prior will be quite low for any hypothesis,
there will still be an enormous difference between more likely and less likely outcomes of language
change, and we want to see how exactly our new hypothesis fares compared to others. For that, we
compare the product likelihood * prior generated assuming that our hypothesis were true, with the
same quantity computed for our previous hypothesis. The fact that we only compare those two may
seem unintuitive at first: dont we need to compare our hypothesis with all others? The beauty of
MCMC is that even though we only use pairwise comparison, the resulting sample that we obtain in
the end contains the true information about how all possible different hypotheses fare comparatively.
What we do in our MCMC pairwise comparison is use a special rule to decide whether to keep the
old hypothesis or to adopt the new one instead: the higher the product likelihood * prior of our
new hypothesis, the more likely we are to adopt it. Importantly, the adopted hypothesis is not
necessarily better than the old one (with goodness here measured by likelihood * prior ). The point

5For the algorithm to be efficient, that guess has to be somewhat informed, but the details of that are not relevant
for our purposes here.

of the algorithm is crucially not monotonic improvement. This again might seem strange at first,
but in fact thats needed to obtain the mathematical guarantee that in the end, we will have a
sample from the true posterior distribution of our model. This means the probability distribution
over the space of our hypotheses, that is, trees and evolutionary parameters, conditional on the
data that we observed. In other words, MCMC allows us to determine which trees are more likely
given our data intuitively, thats of course exactly what we want. The interesting thing about
the MCMC chain is not the final hypothesis that we observe, but the sequence of hypotheses that
the chain passes through.
The Markov chain is defined in such a way that it can run literally forever. In practice, of course,
we are interested in actually getting the results out, so we want to stop it at some point and examine
what weve got. The mathematics of the chain guarantees an astonishing fact: if we run the MCMC
algorithm long enough, we are bound to start at some point sampling from the true posterior. In
other words, the trees and evolutionary parameters that we sequentially adopt as the chain works
will, after a certain point, be all coming from the set of best hypotheses given the data. However,
this will only happen after a certain moment. In a run-of-the-mill MCMC run, the chain will start
with some random hypothesis. Chances are that that random hypothesis would be pretty bad.
Technically, it will have a low likelihood, which means it explains the observed data very poorly,
and the product likelihood * prior will correspondingly also be very low. However, the setup of our
chain is such that it will start quickly moving towards better and better hypotheses, and at some
point we will usually see that the likelihoods of our hypotheses are not climbing up anymore, but
rather stay at roughly the same level. When we reach that plateau, we are likely to have started
sampling from the true posterior. It is said that the MCMC converged at this point. Because the
plateau is such a distinctive shape, a common way to estimate whether we have reached convergence
and started sampling from the true posterior is to simply examine the plot showing the likelihoods
of our sequentially drawn samples: if we see a plateau in that plot, were likely to be in the right
spot already. The reason people use such eyeballing to detect convergence is that unfortunately,
there is no theoretical way to detect mathematically, with absolute certainty, that we have reached
the posterior. Heuristics such as eyeballing the likelihood plot is all we have. Fortunately, however,
in practice we can use further tricks for determining (though not guaranteeing) likely convergence.
There is currently a broad agreement between MCMC practitioners that as a whole, detecting
convergence is not such a big problem in real-life studies. We will briefly discuss some commonly
used practical diagnostics in the next section, when we describe Dene-Yeniseian analyses.
Even though we start sampling from the posterior as MCMC progresses, this does not mean that
all hypotheses that we sample are equal. Because many parameters in our hypotheses are continuous
numbers, it is theoretically impossible to sample precisely the same hypothesis twice. In that
uninteresting sense, all hypotheses are equal. However, some hypotheses may come from regions in
the tree space and the parameter space that are densely populated with good hypotheses, while
others may come from regions less likely on the whole. In the end, we are more interested in this
density at the level of regions rather than on the level of individual hypotheses. For example, we
may be interested in the question of which tree structure a clade of languages A, B, and C has:
there are three logical possibilities. It could be that in the true posterior, shape ((A,B),C) occurs
30% of the time, shape (A,(B,C)) 70% of the time, and shape ((A,C),B) never occurs. If this
is the case, then our samples in MCMC should also be roughly 30% ((A,B),C) and roughly 70%
(A,(B,C)). (Why only roughly? In fact, if we ran our MCMC for an infinite amount of time, the
numbers would be exactly as in the true posterior. But because in practice we only run MCMC
for a finite time, we obtain a sample from the posterior rather than the full posterior. If you draw
10000 samples from an infinite can with green and red balls with a 30% share of greens, you will
get very close to 30% green in your sample, but probably not exactly 30%.) This distribution over
the clades shapes is what is ultimately interesting for us as analysts, and not the fate of individual

hypotheses. To study that distribution, we check how many trees in our posterior sample exhibit
each of the shapes.
2.2. MCMC inference on Sicoli and Holtons Dene-Yeniseian data: setting up the
parameters. We are now ready to turn to the actual analysis of Dene-Yeniseian data. In this
section, we explain the settings used to set the analysis up, and then in the next section, we discuss
the results. The analysis was performed using MrBayes [Ronquist et al., 2012b], a popular, free
and open-source MCMC software specialized for phylogenetic inference. [Sicoli and Holton, 2014]
published both the data file they used and the logs of many of their runs of MrBayes, which allows us
to exactly replicate their analyses. (Generally, making the data and software parameters available
together with the publication is a very good practice, making it easy to replicate and build upon
earlier results.)
[Sicoli and Holton, 2014] ran their MCMC chains for 2 million steps, also called generations (no
connection to human generations). They discarded the initial 25% as burnin: that is, the initial
portion of the chain where it was not likely to have started sampling from the actual posterior. The
likelihood plots in their logs show clear plateaux, suggesting that the chain indeed has converged by
the end of the burnin period. In my replicas of their analyses, together with eyeballing the likelihood
plots, I used another common heuristic for detecting convergence provided in MrBayes. Instead of
running a single MCMC analysis, in MrBayes one can run simultaneously two or more chains. Once
they all converge, they should be sampling from one and the same distribution over trees. This
suggests a simple nice diagnostic: we can compare how similar the trees sampled by our multiple
independent chains are. MrBayes does that comparison by counting how different the relative
frequency of each potential clade in the tree is between the different independent runs. In the
limit, those frequencies should agree almost precisely: in each run, the frequency of a clade should
converge to its true frequency in the posterior distribution. I ran all analyses with 2 independent
runs, using 2 million steps initially, but then adding more steps until the standard deviation of
clade frequencies between the two runs fell below 2%. (This diagnostic is called average standard
deviation of split frequencies in MrBayess output; split here refers to a partition of all languages
into a clade and everything outside of that clade.) In most cases, 2 million generations were already
enough for that to happen. When two heuristics (clade frequencies and the visual examination of
the likelihood plot) both point to likely convergence, it is safer to conclude that we have reached
the true posterior.6
Not all of the samples in the 2 million generations minus the 25% burnin are actually stored and
analyzed. Sicoli and Holton stored only every 500th sample, while I did that with every 1000th
one. The difference between our choices is not significant, but it is very important to thin out ones
sample with a value of around that magnitude. For technical reasons, consecutive samples from
phylogenetic MCMC are normally highly correlated with each other. In other words, they are not
statistically independent. Even though they are drawn from the true posterior, the dependence
6MrBayes has two different parameters that govern the number of MCMC processes run in a single analysis. One
is called nruns, and it represents the number of fully independent runs whose primary purpose is to help us detect
convergence, or a lack thereof, via the diagnostic average standard deviation of split frequencies. Another is called
nchains. That is a very different parameter. With nchains=1, MrBayes runs standard MCMC. With nchains1,
MrBayes runs an improved version of the algorithm, called Metropolis-coupled MCMC (abbreviated MCMCMC, or
MC3 ; see [Altekar et al., 2004] for the explanation in the context of MrBayes). That advanced algorithm creates, in
addition to the true MCMC chain, several separate fake MCMC-like chains that follow looser rules for accepting a
new hypothesis. The output of those chains cannot be used directly: taken as a whole, it is just junk. But what we
can do with it is check whether a hypothesis from one of the junk chains would be accepted if it were to appear in the
main, true chain. Nicely, this swap was shown not to invalidate the good properties of the true chain. Allowing such
swaps of hypotheses with the junk chains helps the software to sample from the posterior more efficiently: the junk
chains traverse the hypothesis space as scouts, and find some good hypotheses that the main chain would otherwise
only find after a long time searching. The non-technical bottom line is that it is good to keep the default value
nchains=4 in ones analyses. It is safe to do this, and it should increase the efficiency of the analysis.

means that they represent less information about the posteriors structure than independent samples
would, and create a random bias. To reduce this, we thin them out, with the hope that our stored
samples are far enough in the chain to be largely independent. There is also a way to see, for each
parameter of interest, how independent our samples really appear to be, even after thinning out.
This is summarized for each estimated parameter separately in the indicator ESS (for effective
sample size), reported by MrBayes in the output of sump command after an analysis was finished.
Sicoli and Holtons 1.5 million steps were sampled 3001 times, while each of my runs with the same
number of steps was sampled 1501 times. For the inferred parameter of the tree height, the ESS
for one of Sicoli and Holtons analysis was over 300, clearly much smaller than the actual number
of samples. For my replica, it was about 500 for each run, even though the overall number of my
samples was smaller. On the other hand, for another parameter Sicoli and Holtons ESS was much
larger than my ESS. This illustrates that it is not trivial to predict in advance how to get a higher
ESS. However, an ESS of 200-300 is generally considered to be sufficient for inference.
When data are loaded into MrBayes, the user needs to specify which coding scheme was used
when collecting the data. For example, if we compiled in advance a list of typological features
and recorded the values for them for each language, then we have coded our features exhaustively,
regardless of what the values actually were. We tell this to MrBayes with coding=all. The
default setting for this parameter is, however, coding=variable, which means that only features
with non-uniform values were recorded. This would have been appropriate if we did not have
a feature list in advance, but consciously included only interesting features on which we knew
our languages had different values (and therefore more useful for phylogenetic classification). As
another example, when lexical cognacy data are used for linguistic phylogenetic analyses (often
obtained from Swadesh lists), each cognate class is usually recorded as a separate binary character.
In this case, the proper coding scheme is coding=noabsencesites, meaning that we did not record
characters that were absent in all languages in the sample. Indeed, if there is a cognate class in our
family, but we did not see its representatives in our languages (for instance, it could have existed
only in ancient languages for which we had no data), then we have no way of knowing it existed.7
Furthermore, we need to tell MrBayes what it should expect in terms of the probability to see
a 0 or a 1 for each feature at the root node, that is, the most removed proto-language. Why do
we need this? Luckily, once those probabilities at the root are set, it becomes possible to compute
the exact probability to have generated our observed data if our current tree and evolutionary
In the current version of MrBayes, 3.2.6, the code prohibits setting coding to all or noabsencesites when
the data type is standard, which is what is used in Sicoli and Holtons datafile. There are two ways to solve this
technical issue. The simpler one is to edit Sicoli and Holtons .nex data file, replacing datatype=STANDARD with
datatype=RESTRICTION. However, this has a subtle analytical consequence, described below in this footnote. The
other way is to change line 3679 in source file model.c of MrBayes with the following stopgap line:
if((modelParams[i].dataType != RESTRICTION) && (modelParams[i].dataType != STANDARD))
and then re-compile the program from source.
Both standard and restriction data types allow binary characters. The reason MrBayes disallowed some coding
schemes for standard probably has to do with biologists using the program mostly employing standard characters
when recording biological morphology data (e.g., presence of wings, etc.). For such underlying data, the all coding
scheme would not have much sense, so it is reasonable to disallow it. However, for our linguistic purposes, the
prohibition currently implemented in MrBayes is not meaningful, and its safe to remove it by using the line above.
If both standard and restriction can be used for binary characters, how do they differ? MrBayes uses different
evolutionary models for the two. For standard data, it assumes that change from 0 to 1 has the same probability
as change from 1 to 0. For restriction data, it allows non-equal rates of change in the two directions. For Dene-
Yeniseian data, allowing the rates of change between 0 and 1 to be unequal makes the Yeniseian clade more distinct
from Na-Dene than in Sicoli and Holtons original analysis, but this effect is mild compared to that of changing the
tree prior. I report in the main text only analyses using the standard data type. First, this makes comparison more
favorable for Sicoli and Holtons results, which I am arguing against. Second, since 0s and 1s in different characters of
Sicoli and Holtons do not actually represent identical states, because they refer to very different linguistic entities, it
is far from obvious that unequal rates are any better than equal rates: both do not correspond to the reality, wherein
most characters in the dataset would presumably each have their unique true rates of change between 0 and 1.

models were the true ones. In other words, it becomes possible to compute the likelihood of our
MCMC hypothesis. But the process of letting MrBayes know the root frequencies is not trivial.
When data are coded as standard, in MrBayes we are not allowed to say that, for example, 0s
should be more frequent than 1s at the root. That is because biological standard characters
do not have a fixed interpretation for either 0 or 1: those labels are arbitrary, so it does not
make sense to assign them specific global probabilities. The same is true for linguistic typological
data in [Sicoli and Holton, 2014]s dataset: even though 0s code absences and 1s code presences,
there is hardly any real-world connection between the probability of having a 3-phoneme vowel
system and having the plural expressed on pronouns. Instead of asking us for specific probabilities,
with standard characters MrBayes draws 0s and 1s frequencies at the root (state frequencies)
randomly for each feature. Parameter symdirihyperpr determines how far from equal those values
are allowed be. The default value for symdirihyperpr is fixed(infinity) in MrBayess current
version, which forces the probabilities for 0s and 1s to be exactly equal. If we do not want that,
we can use a smaller number. With a number 1, the preferred state frequencies are around 50%,
but the smaller the number, the farther they are allowed to deviate on average. (Infinity is just the
limit of this: with infinite symdirihyperpr, the frequencies are infinitely strongly forced to be close
to 50%.) With symdirihyperpr=fixed(1), any combination of state frequencies is equally likely
for every character. Finally, with symdirihyperpr smaller than 1, extreme frequency distributions
become more likely: e.g., MrBayes will assume the probability of 80% for 0s (or for 1s, as 0s and
1s are treated symmetrically) more frequently than the probability of 50%. Sicoli and Holton used
the default, i.e. equal state frequencies, in their analyses. I do the same in this paper: for one set of
settings, I tested how using a more liberal state-frequency prior would affect the results, and found
that changes in the inferred tree topology were small. (Sicoli and Holtons original result comes
out a bit stronger under the equal state frequency setting, so keeping it makes comparison more
favorable to their argument.)8
Sicoli and Holton used so-called gamma rate heterogeneity across characters, set up in MrBayes
by rates=gamma. This setting is common in todays phylogenetics. It allows us to somewhat correct
for the fact that different features may be subject to change with different speed. Contrary to a
common misconception, gamma rate heterogeneity does not assign a special rate to each feature.
Instead, it computes the likelihood of the data separately for several possible rates of change, and
then averages across them. This helps to account for true rate differences because if a certain
feature changes very fast, the probability of observing its true values will be much greater under
a fast rate of change, and the corresponding term will dominate the others. Similarly for slow-
changing characters. The reason we do such averaging instead of actually assigning a different rate
for each feature is practical: with a separate rate for each feature, we would have greatly increased
the number of parameters to estimate, while our dataset would remain of the same size. This would
lower the quality of our statistical inference. That is why we have to settle for less precise, but
more practical gamma rate heterogeneity.9

8If data are coded as restriction, then the state frequencies at the root cannot be set independently. Instead,
they are determined by the rates of change from 0 to 1 and from 1 to 0. (Those rates must be equal for standard
characters, but are allowed to be unequal for restriction ones.) Technically, state frequencies at the root in
MrBayess model for restriction characters are the stationary frequencies of a Markov chain with rates of change
as transition probabilities. In practice, if the rate of change from 0 to 1 is twice as large as that from 1 to 0, this
means that 1s will be considered twice as likely as 0s to occur at the root. This is because the root itself is assumed
to be the result of a very long process of language change. After a very long time, the probability for the change
process to have 0 as its current value only depends on the rates of change between 0 and 1: after enough time, the
process forgets which value it originally started from. Thats why one is not allowed to set up the probabilities for
0 and 1 at the root separately from the rates of change for restriction characters in MrBayes.
9One can further parametrize how gamma rate heterogeneity is implemented by setting up the number of gamma
categories ngammacat: the fixed number of distinct rates of change used by the algorithm. MyBayess default is
4 categories, which is generally considered to be sufficient. When they define gamma heterogeneity in MrBayes,

Furthermore, Sicoli and Holton set up the clocked uniform tree prior, proposed by
[Ronquist et al., 2012a]. The tree prior determines the a priori probability of each language-family
tree regardless of the observed data. Somewhat surprisingly, several large classes of tree priors, in-
cluding all commonly used in current phylogenetic software, assign equal probabilities to each tree
topology: the shape of the tree that does not take into account branching times (see [Aldous, 2001],
a.m.o.) That is why the tree prior can be conveniently decomposed into two parts: the prior on tree
topology (which, being uniform, does not have to be explicitly taken into account), and the prior
on branching times, or equivalently, on branch lengths. Correspondingly the tree prior is set up in
MrBayes as brlenspr (for br anch lengths pr ior). [Ronquist et al., 2012a]s clocked uniform tree
prior assumes that the branching times are uniformly distributed over time since the root age. It
is set up using command brlenspr = clock:uniform. In the next section, we will see that the
choice of tree prior greatly affects the results for Dene-Yeniseian, and therefore we will discuss tree
priors in more detail below. Even more information intended to demystify different types of tree
priors is provided in Online Appendix B.10
Under the hood, a tree prior is often conditional on the tree height: the temporal distance between
the root and the leaves observed at the present. For the clocked uniform prior, the root age simply
works as the upper bound on the age of any node. As the probabilities for different branching times
are equal, it thus does not affect the relative probability of different choices of branching times,
as every choice is equi-probable. For other priors inducing non-uniform probabilities on branching
times, those probabilities will often be relative to the root age. The tree height/root age itself is
also a parameter. MrBayes reports its inferred values under the name TH. As any parameter in the
Bayesian setting, the tree height requires a prior, but fortunately, the default prior that MrBayes
uses is very reasonable, so we do not need to worry about it.11
The tree height, as well as the length of each branch, are by default expressed in abstract units.
To give them meaning, we need to explicitly connect them to the actual language change. This is
done through the molecular clock : the rule that says how abstract units of time that we used as
branch lengths are to correspond to actual changes in linguistic features. It is common to set the
base rate of the molecular clock to exactly 1. If we do this, then for example a branch of length 0.05
will correspond to the amount of time during which approximately 0.05 changes per one character
occur. In other words, branch lengths can be read as the expected, on average, number of changes
along that branch. This is the choice made in all analyses in [Sicoli and Holton, 2014] and in this

[Sicoli and Holton, 2014] add command nst=6. That command affects only DNA data, as it determines how many
different change rates between the four nucleotides are allowed in the model. When data are binary, the command
has no effect.
10There is no standard name for this prior introduced by [Ronquist et al., 2012a]. I call it clocked uniform, but
it can also be called simply uniform. The problem is then to avoid confusion with another common uniform
prior, set up in MrBayes with command brlenspr = unconstrained:uniform. See Online Appendix B for more
11The prior currently used as the default is the exponential distribution with mean 1. It is reasonably non-
informative, usually allowing the data to determine the right height. One might wonder why we couldnt simply say
that any height is equally likely to occur. Technically, this amounts to a uniform prior distribution over the interval
from 0 to infinity. Even though it might seem attractive, in fact, using that distribution is quite a poor choice with
rather un-intuitive consequences: basically, it would mean that we expect super-high trees (corresponding, say, to
millions of years of language change) to be just as likely as reasonable-length ones. That cant be right. In addition,
there is a technical problem as well: the uniform prior from zero to infinity belongs to the category of improper
priors. In practical terms, this is harmless for, say, tree topology inference, but using an improper prior makes the
stepping-stone likelihood estimation procedure undefined. Whats worse, MrBayes wont report that fact either, as
it does not check for priors being proper.
12One can also go further and try to translate expected numbers of changes into actual calendar years. This is
a difficult enterprise on many levels, and at the minimum requires adding calibration points: dates of existence for
some of the proto-languages or of the extinct languages of the family.

We already discussed the method of gamma rate heterogeneity, which relaxes the assumption
that all features in our dataset are subject to change at the same rate. However, we also know that
language change does not necessarily proceed at the same rate in different languages either. A simple
molecular clock that prohibits such rate-of-change variation between branches is called the strict
clock. But there are many different methods that introduce variation in the speed of language change
for different languages; the resulting clocks are called relaxed. [Sicoli and Holton, 2014] compare
the strict clock model with a specific relaxed clock model called TK02, [Thorne and Kishino, 2002].
The TK02 model belongs to the family of autocorrelated relaxed clocks. What this means is that
other things being equal, average rates at contiguous branches will likely be similar (so the rate at
one branch will correlate with the rate, i.e. with the same quantity, on the neighboring branch, hence
autocorrelation). The amount of allowed difference between the rates is probabilistically controlled
by the parameter called tk02var in MrBayess output. The larger the parameters value, the greater
the variation between neighboring branches. The value of that parameter is chosen at random for
each MCMC hypothesis. How those values are chosen is determined by the prior tk02varpr. The
default value for that prior is sensible, and is the one I kept in my analyses. (Sicoli and Holton did
not report theirs.)
Congratulations we have just reviewed all of Sicoli and Holtons analysis settings! Online
Appendix A presents an example analysis description produced by MrBayes, and explains how its
parts correspond to the notions we have just discussed. It is time to turn to the analyses themselves.

2.3. MCMC inference on Sicoli and Holtons Dene-Yeniseian data: analyzing the re-
sults. First, we look at the choice between strict and relaxed molecular clock. Before starting their
main analysis, [Sicoli and Holton, 2014] wanted to determine whether the strict clock or the TK02
relaxed clock was a better model. Whether to apply computational methods of model selection
in this case depends on the researchers judgement: strictly speaking, the strict molecular clock
is a special case of the relaxed TK02 clock, arising in the limit of low rate variation. So in one
legitimate sense, TK02 cannot be worse than the strict clock. However, another legitimate way
to compare the two models is to ask whether TK02 with the best fitting parameters explains the
data better than the strict clock. Answering this question involves comparing the likelihood (recall
that this refers to the probability of generating our observed data assuming particular evolutionary
model and history) under the strict clock and under those parameters of TK02 that maximize the
likelihood in statistical parlance, under the maximum likelihood estimate. If this is our method
of comparison, then usually it will be the case that the more flexible model will win: since TK02
has more parameters, it can be more finely tuned to fit the observed data, and we can expect
its likelihood to be generally higher. This is indeed the case: while we do not explicitly compute
maximum likelihood estimates, we can use the likelihood of the best MCMC sample as its proxy.
MrBayes reports those values upon concluding an analysis as Likelihood of best state for cold
chain of run n. My replicas of Sicoli and Holtons analyses list the likelihood of ca. 1007 for
the strict clock, and ca. 993 for the TK02 clock. As expected, the more flexible TK02 clock wins
under that measure.
However, such winning comes at a price. A model with more parameters might be tuned very
tightly to the observed data, but that does not necessarily mean that it is the true model. For
example, suppose I toss a coin four times, and it lands tails in three of them. If I were to tightly
fit my model of the coin to the observations, I would conclude that the next coin toss has a 75%
probability of landing tails. However, if I used a regular everyday coin, it is probably not as biased as
that, and a simpler model that does not allow any bias would probably predict the future outcomes
better. This is one of the reasons why statisticians often prefer to balance the maximization of the
likelihood of the observed data and the flexibility of the model.
One popular approach to model selection that implicitly favors simpler models involves comparing
the Bayes factors of the models, and that is the approach [Sicoli and Holton, 2014] used. The

definition of the Bayes factor is deceptively simple: it is the ratio of marginal likelihoods derived
under the two models. The key here is word marginal : it means that we are comparing not the best
possible values of likelihood, but rather average the likelihood under all possible parameter settings.
Online Appendix C discusses this in a bit more detail, but the crucial point is that a model that
derives very good likelihoods with good parameters, but terribly low likelihood with many other
parameter sets, is going to have a lower marginal likelihood than a model that consistently derives
only averagely good outcomes. We are thus comparing the performance of two models under a wide
range of parameters, not the ones that present each model in the most favorable light. From this it
follows that the result of Bayes factor comparisons depends on the range of parameters we deem to
be acceptable for each model: for instance, if we exclude beforehand some parameter values that
we know to be terrible, the resulting narrower model will have on average better likelihoods. It
is thus important to bear in mind that the Bayes factor method is sensitive to how we select the
priors for our model parameters.
Though theoretically taking the ratio of marginal likelihoods (i.e. computing the Bayes factor) is
straightforward, in practice it is not trivial to obtain marginal likelihoods. Fully accurate averaging
across the whole parameter space is out of the question for all but the simplest and most well-
behaved models. Because of that, most Bayes factors reported in the literature are only estimates
of the true Bayes factors. Sicoli and Holton try two different methods for estimation, both of which
are regularly used in phylogenetics: the harmonic mean and the stepping-stone methods.
The first method involves comparing the harmonic mean of the likelihoods in each sample from
the posterior. That quantity is easy to compute, and is in fact reported in MrBayess standard
output. However, it is well-known to statisticians to be terrible, very bad, absolutely not good
estimator for the likelihood value we are seeking. One way to show it is to simply note that the
variance of that estimator may be infinite (e.g., [Raftery et al., 2007]). In informal terms, this
means that you have no idea how far your estimated value for the likelihood is from its true
value. Another way to explain the problem (requiring more mathematical background than the
current paper assumes, but very convincing for those who can make it through) may be found in
a blog post by statistician Radford Neal, [Neal, 2008]. Intuitively, the issue is that to accurately
estimate the overall likelihood, we need to gather likelihood values from regions of the tree space
and evolutionary-parameter space that do not explain the data particularly well. By design, our
MCMC chain will pass through such regions only rarely. So our MCMC sample would normally
largely lack information crucial for the accurate computation of the marginal likelihood. The true
marginal likelihood is usually much lower than the harmonic-mean estimate reported by MrBayes
after a regular MCMC run.
Hence the second method that [Sicoli and Holton, 2014] use: stepping-stone sampling
[Xie et al., 2011], conveniently implemented in MrBayes. The mathematical idea of the stepping
stone method is not too difficult, but nevertheless lies beyond the level of the current largely in-
formal discussion. What is important is that stepping stone estimation (when defined) is far more
accurate than the harmonic mean method. In particular, when the two disagree, the stepping stone
results are very likely to be the more accurate ones.
[Sicoli and Holton, 2014] obtained very close stepping stone estimates for the strict and the TK02
clock. This means that adding TK02 rate variation between languages did not on average help to
explain the data better, but at the same time it did not make matters worse on average either.
Based on that, we can conclude that a simpler model with the strict clock should be used on
practical grounds. Sicoli and Holton, however, observe also that the harmonic mean estimate for
the strict clock was substantially higher than the harmonic mean estimate for TK02. Based on
that difference, they declare that the strict clock model better fitted the data. This is incorrect: as
explained above, when stepping stone and harmonic mean estimates disagree, the stepping stone
ones should be accepted. So Sicoli and Holtons decision to use the strict clock is justifiable, but
not by the argument they employ.

Next, Sicoli and Holton turn to the main computational question of their paper: which tree struc-
ture is more likely for the Dene-Yeniseian macro-family given their typological data, [[Yeniseian],
[Na-Dene]] or [[Na-Dene I], [Na-Dene II, Yeniseian]]? To answer that question, it is sufficient to
look at the posterior distribution of Sicoli and Holtons baseline analysis. It is summarized in
the majority-rule consensus tree in Fig. 2(a), based on my (slightly modified) replica of Sicoli and
Holtons analysis. The majority-rule consensus tree contains only clades that are present in more
than 50% of the trees in the posterior sample. The fact that the Na-Dene family is not shown as
one clade means that in more than half of the trees, Na-Dene do not form a single sub-family to
the exclusion of Yeniseian. In other words, the posterior probability of the topology [[Yeniseian],
[Na-Dene]] is less than 50%. In fact, examining MrBayess output in more detail, we can determine
that the posterior probability of that topology is about 18%. Just as [Sicoli and Holton, 2014]
argued, family structure [[Yeniseian], [Na-Dene]] is not strongly supported by their analysis. The
alternative structure with the most support, namely with 58% posterior probability, is [[Yeniseian,
Californian Athabaskan], [other Na-Dene]].13
So far, we have reviewed what [Sicoli and Holton, 2014] did in their computational analysis. Now
we get to the crucial point of this section: something they did not. [Sicoli and Holton, 2014] did
not check the robustness of their analysis to different choices of tree prior. When the dataset is
large, evidence from the data would usually overwhelm the tree prior, though this always needs to
be tested empirically by checking how different priors work on ones data. But the Dene-Yeniseian
typological-feature dataset is very small, with only 116 binary features and 84 unique patterns of
their distribution over the languages. To put this into perspective: [Ritchie et al., 2017] examine
the effect of tree-prior choice on three empirical datasets. In one of them, they find that the
tree prior strongly and adversely affected the results, while the other two were relatively fine. The
problematic dataset was smaller than the other two. It contained 14K DNA letters, with 188 unique
patterns across the taxa. The two larger datasets contained respectively 14K DNA letters with 5765
unique patterns, and 21K DNA letters with 6355 unique patterns. The Dene-Yeniseian dataset is
obviously much smaller than even the smallest, problematic dataset of [Ritchie et al., 2017]. It is
therefore quite likely that the Dene-Yeniseian analysis might be sensitive to the choice of tree prior.
The clocked uniform prior that [Sicoli and Holton, 2014] used is a mathematical abstraction that
does not arise from any known evolutionary process. It takes all branching times to be uniformly
distributed within the admissible intervals, which in our case simply means the interval between
the root of the tree and the present when the living languages are sampled. On the surface, this
might seem like a nice choice that does not bring any dangerous assumptions into our analysis. But
there are two problems with that reasoning. First, other tree priors can also be argued to be nice,
as we will demonstrate shortly. Second, it is in itself dangerous to feel safe without actually testing
how safe ones assumptions are. In the good case when we have enough data, the data should
overwhelm the tree prior, making its choice generally unimportant,and the results more or less the
same no matter which tree prior was used. But in the bad case when different tree priors lead to
different results, each of them effectively imposes its own preferences onto the analysis. Because of
that, it is always a good idea to test ones analysis for robustness against tree prior choice.
Here, I subject Sicoli and Holtons dataset to such a test, running exactly the same analysis, but
with the birth-death tree prior.14 Unlike the clocked uniform prior, the birth-death prior arises from
13Sicoli and Holton themselves argue for the [[Na-Dene I], [Na-Dene II, Yeniseian]] topology using a different
method, namely the Bayes factor comparison between marginal likelihoods, as described above for the choice between
two clock models. Unfortunately, their application of the method in that case was affected by a mathematical mistake:
their stepping stone likelihood estimation favored the [[Yeniseian], [Na-Dene]] topology, not the [[Na-Dene I], [Na-
Dene II, Yeniseian]] topology as Sicoli and Holton claimed. When comparison is done more carefully, however, it can
support their conclusion, under the clocked uniform tree prior. I further discuss the choice between the posterior
probability and the Bayes factor methods, and the latters application to this case, in Online Appendix C.
14 For the birth-death tree prior, MrBayes implicitly enforces the improper uniform root-age prior from 0 to infinity.
Even if the user tries to set a different prior, this is overridden by the program without any message informing her

69 apw
87 apj
72 nav
92 100
100 ing
78 hoi
72 scsh
(a) 77 mtl
58 84
58 tol
65 55 ttmN
55 haa
79 tcb


77 scsh
64 60 ttmN
100 tceS
56 haa
81 tcb
80 hoi
(b) aht
57 tol
88 mtl
7 2 apw
87 apj
70 nav
95 100


Figure 2. Majority-rule consensus trees. Two Yeniseian languages Ket (kto) and
Kott (zko) highlighted in blue; four Californian Athabaskan languages (Na-Dene) in
orange. Numbers on clades show posterior probabilities in percents. (a) Modified
replica of [Sicoli and Holton, 2014]s analysis of Dene-Yeniseian excluding unrelated lan-
guage Haida: gamma rate heterogeneity with 4 categories, strict molecular clock, branch
lengths represent expected number of changes per character. The replica differs from
the original analysis in assuming that all sampled characters were included in the data,
whereas [Sicoli and Holton, 2014] employed only non-uniform characters. Tree prior:
[Ronquist et al., 2012a]s clocked uniform prior. (b) The same, but with a different tree
prior: birth-death. Trees prepared in FigTree, free software by Andrew Rambaut.

of that. This is all right for exploring the posterior, but makes MrBayes stepping-stone estimates undefined for

a specific biological model of the tree-generating process (see [Gernhard, 2008] for mathematical
analysis as well as references to earlier work). The birth-death model assumes that all languages (or
species) are always equally likely to split into two, with constant rate , and also equally likely to die
off, with constant rate . These assumptions lead to specific predictions regarding the probability
of the timing of branching events in the tree. (The model accounts for the fact that there would
often exist branches of the family that left no living descendants, in which case we can only detect
a subset of all branching events.) One of the models predictions is that branching events become
more common as we approach the present. This is intuitively appealing: our family would usually
have more languages closer to the present, and it is natural to assume that among 10 languages, a
single split will happen sooner than if we have only 1 language. The birth-death prior thus prefers
to have branching events closer to the tips, other things being equal. This differs from the clocked
uniform prior, where branchings are distributed uniformly over time. In effect, the clocked uniform
prior implicitly assumes that splitting probabilities per language were greater in the distant past
than in the recent past. Arguably, the birth-death prior is then not less of a natural choice than
the clocked uniform one, and worth trying out. (I discuss common tree priors and the assumptions
between them in more detail in Online Appendix B.)
Fig. 2(b) shows the majority-rule consensus tree from an analysis that is exactly like my replica of
[Sicoli and Holton, 2014]s, but with the birth-death tree prior instead of [Ronquist et al., 2012a]s
clocked uniform prior. It is evident that the tree prior greatly affects the results! While the
clocked uniform prior, Fig. 2(a), did not strongly support the [[Yeniseian], [Na-Dene]] topology,
the birth-death prior does, Fig. 2(b). In fact, under the birth-death prior, the posterior support
for a Na-Dene clade, and consequently for the existence of Proto-Na-Dene, is 99%, so the results
under the birth-death prior are much more concentrated and therefore certain than under the
clocked uniform prior.
Some of the differences in the consensus trees are not hard to trace back to the assumptions
induced by the tree prior. Under the clocked uniform prior, language splits are evenly distributed
through time, and indeed we can see in Fig. 2(a) that the nodes of the tree occur at very different
heights. The birth-death prior, on the other hand, favors trees where most splits are relatively
recent, and in accordance with tht, most nodes in Fig. 2(b) occur relatively close to the leaves.
However, this general observation does not by itself explain the topological difference between
Fig. 2(a) and Fig. 2(b): it could well have been that a tree with the topology like in Fig. 2(a)
had a different distribution of node heights, on average closer to the present... This illustrates
that the choice of the tree prior can affect our predictions about the language familys topology in
non-trivial ways. Note also that a topological difference results from the tree prior even though the
prior probability of all topologies is the same in both analyses: the tree prior induces that change
To reiterate, in the good case when we have a lot of data, the data should to a large extent
override any preferences induced by the tree prior, [Ritchie et al., 2017]. So the fact that we get
very different results under these two tree priors simply means that our data are very scarce.
Consequently, it is relatively pointless to argue on the basis of phylogenetics alone which tree prior
is better in this case: this would amount to an a priori discussion about the proper structure of
Dene-Yeniseian not informed by actual empirical evidence from that family. The conclusion that we
should derive from the phylogenetic analyses in this case is that we do not have enough evidence.15

birth-death priors, therefore meaningless. As a practical consequence, the stepping-stone technique should not be
applied to analyses with birth-death priors.
Importantly, MrBayes cannot and does not check this and therefore would report such estimates without error
when asked; the responsibility to avoid that is completely on the user. This serious issue is thus one of the hidden
perils of phylogenetic analysis.
15In principle, one could try to apply standard model selection techniques to determine which tree prior is a
better fit for the Dene-Yeniseian data, just as with the clock models we discussed above. However, this would not

Let us sum up. Sicoli and Holtons argument for the Beringian homeland of the Dene-Yeniseian
macro-family was based on their result that their computational-phylogenetic analysis did not
support the [[Yeniseian], [Na-Dene]] topology. However, it turns out that their computational
result depends on the choice of the tree prior. This in turn means that we do not have enough
linguistic data in the dataset, and that phylogenetics as such cannot actually help us to decide
which Dene-Yeniseian topology is closer to the truth.

3. Lexical and morphological evidence regarding the shape of the Dene-Yeniseian

For the sake of the argument, lets suppose that computational analyses provided strong evidence
against the [[Yeniseian], [Na-Dene]] topology (which is actually false, as we just discussed in Sec-
tion 2), and furthermore that the alternative tree topology [[Na-Dene I], [Na-Dene II, Yeniseian]]
strongly supported the Beringian homeland hypothesis (which is also false, as we showed in Sec-
tion 1). At least in this hypothetical case, could we conclude that the Dene-Yeniseian homeland
was in Beringia? Arguably, still not.
Sicoli and Holtons result implied that there was no Proto-Na-Dene that did not include the
Yeniseian languages. As in any statistical investigation, that result should be cross-validated: we
need to check whether it agrees with data external to the analysis. In the case at hand, it is
obvious that it does not. Yeniseian languages and Na-Dene languages are quite different from each
other, and there is no comparative-method reconstruction that supports the notion of Yeniseian
being a closer relative to some Na-Dene subfamily than the rest of Na-Dene. At the very least, the
genealogical unity of the Athabaskan subfamily within Na-Dene (torn into two by Sicoli and Holton)
is presupposed in the literature. In fact, the very reconstruction by [Vajda, 2011] that defends the
Dene-Yeniseian hypothesis employed Proto-Athabaskan forms reconstructed by [Leer, 2008] and
Vajda himself, and crucially not the material of individual Athabaskan languages.
Table 1 illustrates this using lexical data. One should note that the amount of likely lexical
correspondences between Yeniseian on the one hand, and Na-Dene on the other, is quite limited
in the first place it is overall ca. 100 sets at present [Campbell, 2011]. This can be compared
with ca. 800 cognate sets used by [Nikolaev, 2014] for the Na-Dene family. But even those Dene-
Yeniseian correspondences that are likely can demonstrate the point. There is a great difference
in the certainty of cognate correspondences within Na-Dene and between Na-Dene and Yeniseian.
Consider the Yeniseian and Athabaskan words for fire and foot in Table 1. Within Athabaskan,
the initial consonants of the relevant words represent one and the same pattern. Such regular
correspondences are a hallmark of convincing arguments for lexical cognacy (that is, for words
descending from the same ancestral word). In contrast to that, the correspondences between the
Athabaskan words and their likely Yeniseian cognates are much less transparent. The initial-
consonant belongs to the same series in Athabaskan, but corresponds to q in fire, but k in foot in
Yeniseian. Of course, this does not by itself mean that the words are not related, and [Vajda, 2011]
proposes that the k sound in Ket kis is due to a differential development of uvulars before front
vowels in the early Yeniseian. Still, when such constructs are based on a small number of examples
(and that number naturally depends on the overall number of known likely cognates), they have
much less explanatory power than when we observe exact sound correspondences as we do in the
Athabaskan examples. This is, of course, not the only issue: the coda correspondences are not
unproblematic either, as Vajda himself discusses. Finally, for fire, a meaning shift needs to be
postulated. Similar points can be made with respect to two other likely correspondences in Table 1.

be advisable. The fact that the results are so sensitive to the choice of tree prior is an important indicator: it shows
that we have too small an amount of data. It is not very meaningful to answer the question of which prior results in
a better fit for a dataset that is simply not informative enough.

Table 1. Selected likely lexical correspondences between Yeniseian and Na-Dene. All rows
represent items considered Dene-Yeniseian cognates in [Vajda, 2011] (i.e. indicates the
language is not reported to have the relevant cognate). Transliterations and within-family
cognacy judgements according to Global Lexicostatistical Database datasets for Yeniseian
[Starostin, 2013] and Athabaskan [Kassian, 2016]. For simplicity, only the first variant is
provided where several related items are available. Ket and Kott are Yeniseian languages.
Hupa and Kato are Californian Athabaskan, declared likely to be close relatives of Yeniseian
by Sicoli and Holtons analysis. Central Ahtena and Degexitan are Athabaskan spoken in
Alaska. Under [Sicoli and Holton, 2014]s analysis, Hupa and Kato are expected to be closer
to Ket and Kott than to Ahtena and Degexitan.

Ket Kott Hupa Kato Cent. Ahtena Degexitan

cloud qon dark ?? qos qUT
earth baN paN ninP neP
fire qoN daytime ?? xoNP khw o:NP qh oP qh UnP
foot kis ?? =xe-P =khw eP =qh e-P qh a:-P

Overall, the picture is clear: while the Athabaskan languages, crucially including Californian
Athabaskan like Hupa and Kato, show clear genealogical relationship, the matter of the Yeniseian
languages been related to them is far from obvious which is precisely why Vajdas argument was
taken up with a lot of enthusiasm among historical linguists. But there is simply no evidence in
the lexical and sound-correspondence data that would suggest that the Californian Athabaskan are
closer related to the Yeniseian than to the other Athabaskan groups. [Sicoli and Holton, 2014]s
phylogenetic findings strongly conflict with that.
Lets assess the situation. The bulk of linguistic evidence points to the existence of the Na-Dene
language family on the one hand, and the Yeniseian family on the other. Specialists arguing for
the Dene-Yeniseian connection on classical historical-linguistic grounds accept this as fact, and
consequently compare the reconstructed proto-languages of the two families, not the modern lan-
guages. There is nothing in the published linguistic evidence known to me that would suggest that
Californian Athabaskan (or any other Athabaskan subgroup) is more closely related to Yeniseian
than to other Athabaskan. The comparative method, by which this conclusion is reached, has been
tested since the 19th century on multiple language families of the world, and rightly remains the
scientific standard for proving language relationship.
At the same time, [Sicoli and Holton, 2014] come to a different conclusion based on a computa-
tional phylogenetic analysis. Their data consist of 116 binary features per language, often highly
correlated with each other. Note that if we recode the words in Table 1 into the binary format, we
will easily have more binary features than Sicoli and Holton used though this represents only a
tiny bit of available data! (It is a different question how reliable such data would be, but then again
we have no reason to believe that all of Sicoli and Holtons features are particularly informative
about language genealogies.) On the basis of their limited dataset, Sicoli and Holton obtain best
support for topologies with a [Yeniseian, Californian Athabaskan] clade. But as we have shown in
the previous section, this happens with the clocked uniform prior, but not the birth-death prior, so
their result is prior-dependent. Finally, we do not have a guarantee that even if the binary dataset of
Sicoli and Holtons were sufficient for accurate topology inference, their evolutionary model would
have uncovered the true history of the family. This point affects all phylogenetic methods, as the
models employed in them are very simple and sometimes dictated by mathematical convenience,
while the empirical processes of language change and language diversification have not yet been
studied with enough precision. This is not to condemn attempting computational phylogenetic
analyses, but when we interpret their output, we need to be aware of the fact that we do not have
a guarantee that our results reflect the truth, and better cross-check what we get. Given all these

circumstances, it is clear that when Sicoli and Holtons phylogenetic analysis resulted in a topology
conflicting with the historical-linguistic body of knowledge on Yeniseian and Na-Dene, it is the
phylogenetic analysis that should be rejected as inaccurate. If the Dene-Yeniseian macro-family
exists at all, it features a basal split into Yeniseian and Na-Dene.
There are in principle two reasons why Sicoli and Holtons computational analysis could have
produced inaccurate results: (i) the dataset carries wrong signal that does not reflect the true
family structure (this would happen if the feature distribution is primarily determined not by
shared descent, but by other factors for example, by chance, or by areal language contact); or
(ii) the dataset contains true genealogical signal, but not enough of it to show the true structure
when the prior discourages it. Our result that the birth-death prior produces the expected family
tree shows that the prior plays a role, but it does not allow us to decide between (i) and (ii). For
instance, while Ket and Kott are genealogically closely related, they are also close areally, and thus
whatever resemblances in the typological features they have may stem from areal effects and not
from shared descent. Thus at this point we cannot confirm whether the typological features of
Sicoli and Holton carry true phylogenetic signal.

4. Conclusion
[Sicoli and Holton, 2014] argued that computational phylogenetic analysis of their 116 binary-
feature typological dataset supported the Dene-Yeniseian likely macro-familys radiation out of
Beringia with back-migration into central Asia, rather than a migration from central or western
Asia to North America. This is incorrect. First, both migration hypotheses are compatible with
a range of linguistic phylogenies: finding the true phylogeny does not by itself decide the Beringia-
vs-deeper-Asia question. Second, Sicoli and Holtons phylogenetic results are sensitive to the choice
of tree prior: when their clocked uniform prior is replaced with the birth-death prior, we obtain
the topology with a basal split into Yeniseian and Na-Dene clades, and not the topology Sicoli
and Holton found where Californian Athabaskan is more closely related to Yeniseian than to other
Athabaskan. Third, while phylogenetic analyses themselves do not tell us which result is closer to
the truth, the bulk of linguistic evidence does. Linguistic data overwhelmingly support Athabaskan
languages forming a family, and allow us to firmly reject the hypothesis that Californian Athabaskan
are closer related to Yeniseian than to other Athabaskan. If any phylogenetic estimate based on
Sicoli and Holtons small dataset has a chance to reflect the true family history, it would be an
estimate showing the expected [[Yeniseian], [Na-Dene]] tree structure, and not the one argued for
by Sicoli and Holton.
Why did an earnest attempt by [Sicoli and Holton, 2014] to answer an important question re-
sult in a demonstrably wrong answer? The authors did their best to be transparent about the
computational analyses and did ensure that their research was replicable. They attempted some
cross-validation strategies to check that their results were not spurious: in particular, they checked
what happened when they included or omitted the outgroup putative isolate Haida, or one of the
two Yeniseian languages. Thus in many ways they followed the best practices in the field. However,
there were several things they did not check which made their overall reasoning faulty. One of them
was the correctness of the predictions they spelled out for the two homeland hypotheses, Section 1.
The others are connected to the technicalities and the interpretation of computational phylogenetic
analyses. There are two important steps which any phylogenetic study should take:
Check whether the inferred results are sensitive to the choice of priors
In particular, analyses based on small datasets may be sensitive to, among others:
the choice of tree prior;
the choice of feature coding scheme;
the choice of molecular clock;
the choice of evolutionary substitution model.

Estimating the fit of different models to data, through stepping stone estimation of marginal
likelihoods, may provide important insight into which models work better. However, we cannot be
sure that our likelihood estimates are correct. Furthermore, even our best computational models
are probably at best approximately correct. So in addition to likelihood checks, it is important
to also check how stable our inferences are to different choices especially when those inferences
feed into further linguistic analyses. In Sicoli and Holtons case, the relevant inferred variable is
the topology, but in other studies it may be something else, such as the proto-languages age in
calendar years.
When our variable of interest is not inferred stably across different settings, this must be reported.
Care and caution should be taken when basing ones interpretation on a value that only comes out
in a subset of analytical settings: this is often a sign that our data are not sufficient for deciding
the question, and the appearance of certainty in any single analysis is created not by the certainty
of our linguistic evidence, but by the narrowness of our sometimes implicit assumptions.
Check whether the inferred results agree with external knowledge
Computational phylogenetics is good at drawing mathematically precise inferences from large
amounts of data that cannot be efficiently processed by a human analyst. However, it suffers from
two drawbacks. First, it can only work with formalized and relatively uniform data. Its strength is
in numbers. This differentiates computational phylogenetics from the comparative method, which
in some cases is able to draw categorical inferences about one-of-a-kind language-change events.
Second, we are not yet at the stage where we know for sure which, if any, computational phylogenetic
models fit the actual language-change processes well enough. We also do not know what amount
of language-family history is at least theoretically identifiable using current phylogenetic methods.
This does not mean such methods should be abandoned: its only by studying many test cases that
we will be able to understand their value better. This is similar to how the value, and the limitations,
of the comparative method were understood as the result of many decades of research. However,
taken together, the two drawbacks mean that we cannot take phylogenetic results for granted even
when they have been meticulously cross-validated against small changes in the dataset, choice of
different priors, and other computational-phylogenetic factors.
Fortunately, historical-linguistic research often provides us with enough information to apply
sanity checks to computational phylogenetic results. Section 3 above is an example of that. In the
Dene-Yeniseian case, we can reject Sicoli and Holtons original phylogenetic results with certainty
because they are contradicted by a much more substantial body of evidence employed by historical
linguists studying the families in question. Similar checks need to be applied whenever possible,
which would usually require collaboration with historical linguists specializing in the families of
Conversely, the lack of such checks leads to peril: for example, [Bouckaert et al., 2012] report
strong support for the Anatolian homeland of the Indo-European language family based on phylo-
genetic geographical inference of the positions of proto-languages. The study was based on innova-
tive and a priori sensible statistical methodology, and the authors reported a sanity check of their
method using the Italic subfamily, whose proto-language was correctly inferred to have been spoken
in Italy. But for other Indo-European subfamilies, the results of their inference were obviously off
(unknown to the authors themselves): for example, no historical Iranian languages were inferred
to ever have existed in the Pontic and Siberian steppes, while historical sources are clear about the
Iranians presence there. Given the methods failure to correctly infer the geographical positions
of relatively recent historical Iranian languages, its geographical inferences about the much more
temporally distant Indo-European homeland are without merit. Without thorough sanity checks,
even best analytic procedures may lead to obviously incorrect conclusions.
Of course, sometimes accepted positions among specialists may turn out to be wrong, and it is
possible that a computational phylogenetic analysis brings new insight into the history. However,

when such an analysis conflicts with the accepted, or at least prominent, positions by the specialists,
this should be explicitly reported. Moreover, one should attempt to weigh against each other the
evidence for the accepted position vs. the evidence on which the phylogenetic result was based. For
example, in our Dene-Yeniseian case, the accepted position was based on large amounts of data
demonstrating the genealogical affinity of the Athabaskan languages, while Sicoli and Holtons result
was obtained from a very small dataset and was not even stable to the choice of the tree prior.
This makes it obvious which position is to be preferred. In other cases, the choice may be more
difficult. This, however, is not different from many controversies in historical linguistics: it is often
the case that each competing position may cite considerable positive evidence in its favor, though
they obviously cannot be true together. In such cases, there is no reason to dismiss computational
phylogenetic results as a priori less reliable. Equally, there is no reason to assume that they should
be more accurate than those established with the help of classical historical-linguistic approaches.

Supplementary Materials
Online Appendix A: Annotated output of MrBayess settings
Online Appendix B: Tree priors
Online Appendix C: Posterior probabilities or marginal likelihoods?
Supplementary Materials 1: MrBayes command files, logs with MrBayess output, consensus trees
stemming from MrBayess analyses.

[Aldous, 2001] Aldous, D. J. (2001). Stochastic models and descriptive statistics for phylogenetic trees, from Yule to
today. Statistical Science, 16(1):2334.
[Altekar et al., 2004] Altekar, G., Dwarkadas, S., Huelsenbeck, J. P., and Ronquist, F. (2004). Parallel Metropolis
coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics, 20(3):407415.
[Bouckaert et al., 2012] Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond, A. J.,
Gray, R. D., Suchard, M. A., and Atkinson, Q. D. (2012). Mapping the origins and expansion of the Indo-European
language family. Science, 337:957960.
[Campbell, 2011] Campbell, L. (2011). Review of The Dene-Yeniseian connection. International Journal of Amer-
ican Linguistics, 77(3):445451.
[Clark et al., 2009] Clark, P. U., Dyke, A. S., Shakun, J. D., Carlson, A. E., Clark, J., Wohlfarth, B., Mitrovica,
J. X., Hostetler, S. W., and McCabe, A. M. (2009). The Last Glacial Maximum. Science, 325(5941):710714.
[Gernhard, 2008] Gernhard, T. (2008). The conditioned reconstructed process. Journal of Theoretical Biology,
[Hoffecker et al., 2016] Hoffecker, J. F., Elias, S. A., ORourke, D. H., Scott, G. R., and Bigelow, N. H. (2016).
Beringia and the global dispersal of modern humans. Evolutionary Anthropology. Issues, News and Reviews, 25(2):64
[Kassian, 2016] Kassian, A. (2016). Global lexicostatistical database. Na-Dene family: Athapaskan group. Available
[Kiparsky, 2015] Kiparsky, P. (2015). New perspectives in historical linguistics. In Bowern, C. and Evans, B., editors,
The Routledge handbook of historical linguistics, pages 64102. Routledge.
[Leer, 2008] Leer, J. (2008). Recent advances in AET comparison. Available at
[Neal, 2008] Neal, R. (2008). The harmonic mean of the likelihood: Worst Monte
Carlo Method Ever. UToronto.
[Nichols, 2008] Nichols, J. (2008). Language spread rates and prehistoric American migration rates. Current Anthro-
pology, 49(6):11091117.
[Nikolaev, 2014] Nikolaev, S. (2014). Toward the reconstruction of Proto-Na-Dene. Journal of Language Relationship,
[Potter, 2011] Potter, B. A. (2011). Archaeological patterning in Northeast Asia and Northwest North America: An
examination of the Dene-Yeniseian hypothesis. In Kari, J., Potter, B. A., and Vajda, E., editors, The Dene-Yeniseian
Connection, pages 138167. ANLC, Fairbanks.
[Raftery et al., 2007] Raftery, A. E., Newton, M. A., Satagopan, J. M., and Krivitsky, P. N. (2007). Estimating the
integrated likelihood via posterior simulation using the harmonic mean identity. In Bernardo, J. M., Bayarri, M. J.,

Berger, J. O., Dawid, A. P., Heckerman, D., Smith, A. F. M., and West, M., editors, Bayesian Statistics 8, pages
145. Oxford University Press.
[Ritchie et al., 2017] Ritchie, A. M., Lo, N., and Ho, S. Y. W. (2017). The impact of the tree prior on molecular
dating of data sets containing a mixture of inter- and intraspecies sampling. Systematic Biology, 66(3):413425.
[Ronquist et al., 2012a] Ronquist, F., Klopfstein, S., Vilhelmsen, L., Schulmeister, S., Murray, B. L., and Rasnitsyn,
A. P. (2012a). A total-evidence approach to dating with fossils, applied to the early radiation of the Hymenoptera.
Systematic Biology, 61(6):973999.
[Ronquist et al., 2012b] Ronquist, F., Teslenko, M., van der Mark, P., Ayres, D. L., Darling, A., Hohna, S., Larget,
B., Liu, L., Suchard, M. A., and Huelsenbeck, J. P. (2012b). Mrbayes 3.2: efficient Bayesian phylogenetic inference
and model choice across a large model space. Systematic Biology, 61(3):539542.
[Ronquist et al., 2009] Ronquist, F., van der Mark, P., and Huelsenbeck, J. P. (2009). Bayesian phylogenetic analysis
using mrbayes. theory. practice. In The Phylogenetic Handbook, pages 210266. Cambridge University Press, 2nd
[Sicoli and Holton, 2014] Sicoli, M. A. and Holton, G. (2014). Linguistic phylogenies support back-migration from
Beringia to Asia. PLoS ONE, 9(3):e91722.
[Skoglund and Reich, 2016] Skoglund, P. and Reich, D. (2016). A genomic view of the peopling of the Americas.
Current Opinion in Genetics and Development, 41:2735.
[Starostin, 2012] Starostin, G. (2012). Dene-Yeniseian: a critical assessment. Journal of Language Relationship, 8:117
[Starostin, 2013] Starostin, G. (2013). Global lexicostatistical database. Yeniseian family. Available at starling.
[Thorne and Kishino, 2002] Thorne, J. L. and Kishino, H. (2002). Divergence time and evolutionary rate estimation
with multilocus data. Systematic Biology, 51(5):689702.
[Vajda, 2011] Vajda, E. (2011). A Siberian link with Na-Dene languages. In Kari, J., Potter, B. A., and Vajda, E.,
editors, The Dene-Yeniseian Connection, pages 3399. ANLC, Fairbanks.
[Vajda, 2012] Vajda, E. (2012). The Dene-Yeniseian connection: a reply to G. Starostin. Journal of Language Rela-
tionship, 8:138152.
[Vajda, 2013] Vajda, E. (2013). Vestigial possessive morphology in Na-Dene and Yeniseian. In Hargus, S., Vajda,
E., and Hieber, D., editors, Working Papers in Athabaskan (Dene) Languages 2012, volume 11 of Alaska Native
Language Center Working Papers. ANLC, Fairbanks.
[Watson, 2017] Watson, T. (2017). Is theory about peopling of the Americas a bridge too far? [news feature].
Proceedings of the National Academy of Sciences, 114(22):55545557.
[Xie et al., 2011] Xie, W., Lewis, P. O., Fan, Y., Kuo, L., , and Chen, M.-H. (2011). Improving marginal likelihood
estimation for Bayesian phylogenetic model selection. Systematic Biology, 60(2):150160.