Professional Documents
Culture Documents
Ancient Lenguage Reconstructed App PDF
Ancient Lenguage Reconstructed App PDF
This Supporting Information describes the learning and inference algorithms used by our system (Section 1),
details of simulations supporting the accuracy of our system and analyzing the effects of functional load (Section 2),
lists of reconstructions (Section 3), lists of sound changes (Section 4), and analysis of frequent errors (Section 5).
where the second term is a standard L2 regularization penalty intended to reduce over-fitting due to data sparsity
(we used σ 2 = 1) [1]. The goal of learning is to find the value of λ, the parameters of the logistic regression model
for the transition probabilities, that maximizes this function.
1
1.2 A Monte Carlo Expectation-Maximization algorithm for reconstruction
Optimization of the objective function given in Equation 1 is done using a Monte Carlo variant of the Expectation-
Maximization (EM) algorithm [2]. This algorithm breaks down into two steps, an E step in which the objective
function is approximated and an M step in which this approximate objective function is optimized. The M step
is convex and computed using L-BFGS [3] but the E step is intractable [4], in part because it requires solving
the problem of inferring the words in the protolanguages. We approximate the solution to this inference problem
using a Markov chain Monte Carlo (MCMC) algorithm [5]. This algorithm repeatedly samples words from the
protolanguages until it converges on the distribution implied by our generative model. Since this procedure is
guaranteed to find a local maximum of the objective, we ran it with several different random initializations of the
model parameters. The next two subsections provide the details of these two parts of our system.
1 To be precise: the posterior is over both protolanguage strings and the derivations between these strings and the modern words.
2
We use L-BFGS [3] to optimize this convex objective function. L-BFGS requires the partial derivatives
∂O(λ) X X h X i λ
j
= N (C, ξ) fj (C, ξ) − θC (ξ 0 ; λ)fj (C, ξ 0 ) − 2
∂λj 0
σ
C∈C ξ∈S ξ
X X λj
= F̂j − N (C, ·)θC (ξ 0 ; λ)fj (C, ξ 0 ) − ,
σ2
C∈C ξ∈S
P P P
where F̂j = C∈C ξ∈S N (C, ξ)fj (C, ξ) is the empirical feature vector and N (C, ·) = ξ N (C, ξ) is the
number of times context C was used. F̂j and N (C, ·) do not depend on λ and thus can be precomputed at the
beginning of the M-step, thereby speeding up each L-BFGS iteration.
Bayes estimators are not only optimal within the Bayesian decision framework, but also satisfy frequentist opti-
mality criteria such as admissibility [11]. In our case, the loss we used is the Levenshtein [12] distance, denoted
Loss(x, y) = Lev(x, y) (we discuss this choice in more detail in Section 2).
Since we do not have access to π, but rather to an approximation based on S samples, the objective function
we use for reconstruction rewrites as follows:
S
X 1X
argmin Lev(x, y)π(y) ≈ argmin Lev(x, Xs )
x∈Σ∗ x∈Σ∗ S s=1
y∈Σ∗
S
X
= argmin Lev(x, Xs ).
x∈Σ∗ s=1
3
The raw samples contain both derivations and strings for all protolanguages, whereas we are only interested in
reconstructing words in a single protolanguage. This is addressed by marginalization, which is done in sampling
representations by simply discarding the irrelevant information. Hence, the random variables Xs in the above
equation can be viewed as being string-valued random variables.
Note that the optimum is not changed if we restrict the minimization to be taken on x ∈ Σ∗ such that m ≤ |x| ≤
M where m = mins |Xs |, M = maxs |Xs |. However, even with this simplification, optimization is intractable.
As an approximation, we considered only strings built by at most k contiguous substrings taken from the word
forms in X1 , X2 , . . . , XS . If k = 1, then it is equivalent to taking the min over {Xs : 1 ≤ s ≤ S}. At the other
end of the spectrum, if k = S, it is exact. This scheme is exponential in k, but since words are relatively short, we
found that k = 2 often finds the same solution as higher values of k.
2 Experiments
In this section, we give more details on the results in the main paper, namely those concerning validation using
reconstruction error rate, and those measuring the effect of the tree topology and the number of languages. We
also include additional comparisons to other reconstruction methods, as well as cognate inference results. In
Section 2.1, we analyze in isolation the effects of varying the set of features, the number of observed languages,
the topology, and the number of iterations of EM. In Section 2.2 we compare performance to an oracle and to two
other systems.
Evaluation of all methods was done by computing the Levenshtein distance [12] (uniform-cost edit distance)
between the reconstruction produced by each method and the reconstruction produced by linguists. The Leven-
shtein distance is the minimum number of substitutions, insertions, or deletions of a phoneme required to transform
one word to another. While the Levenshtein distance misses important aspects of phonology (all phoneme substi-
tutions are not equal, for instance), it is parameter-free and still correlates to a large extent with linguistic quality
of reconstruction. It is also superior to held-out log-likelihood, which fails to penalize errors in the modeling
assumptions, and to measuring the percentage of perfect reconstructions, which ignores the degree of correctness
of each reconstructed word. We averaged this distance across reconstructed words to report a single number for
each method. The statistical significance of all performance differences are assessed using a paired t-test with
significance level of 0.05.
4
2.1 Evaluating system performance
We used the Austronesian Basic Vocabulary Database (ABVD) [13] as the basis for a series of experiments used
to evaluate the performance of our system and the factors relevant to its success. The database, downloaded from
http://language.psy.auckland.ac.nz/austronesian/
on August 7, 2010, includes partial cognacy judgments and IPA transcriptions,2 as well as a several reconstructed
protolanguages.
In our main experiments, we used the tree topology induced by the Ethnologue classification [14] in order
to facilitate interpretability of the results (i.e. so that we can give well-known names to clades in the tree in
our analyses). Our method does not require specifying branch lengths since our unsupervised learning procedure
provides a more flexible way of estimating the amount of change between points of the phylogenetic tree.
The first claim we verified experimentally is that having more observed languages aids reconstruction of pro-
tolanguages. In the results in this section, we used the subset of the languages under Proto-Oceanic (POc) to
speed-up the computations. To test this hypothesis we added observed modern languages in increasing order of
distance dc to the target reconstruction of POc so that the languages that are most useful for POc reconstruction
are added first. This prevents the effects of adding a close language after several distant ones being confused with
an improvement produced by increasing the number of languages.
The results are reported in Figure 1(c) of the main paper. They confirm that large-scale inference is desirable
for automatic protolanguage reconstruction: reconstruction improved statistically significantly with each increase
except from 32 to 64 languages, where the average edit distance improvement was 0.05.
We then conducted a number of experiments intended to identify the contribution made by different factors it
incorporates. We found that all of the following ablations significantly hurt reconstruction: using a flat tree (in
which all languages are equidistant from the reconstructed root and from each other) instead of the consensus tree,
dropping the markedness features, dropping the faithfulness features, and disabling sharing across branches. The
results of these experiments are shown in Table S.1.
For comparison, we also included in the same table the performance of a semi-supervised system trained by
K-fold validation. The system was run K = 5 times, with 1 − K −1 of the POc words given to the system as
observations in the graphical model for each run. It is semi-supervised in the sense that target reconstructions for
many internal nodes are not available in the dataset, so they are still not filled.3
such as the usage of the okina symbol (’) for glottal stops (P) in some Polynesian languages or ad hoc conventions such as using the bigram
‘ng’ for the agma symbol (N). We have preprocessed as many of these exceptions as possible, and probabilist methods are generally robust to
reasonable amounts of encoding glitches.
3 We also tried a fully supervised system where a flat topology is used so that all of these latent internal nodes are avoided; but it did not
5
Lev(x, y) denote the Levenshtein distance between word forms x and y. Ideally, we would like the baseline
system to return:
X
argmin Lev(x, y),
x∈Σ∗
y∈O
where O = {y1 , . . . , y|O| } is the set of observed word forms. This objective function is motivated by Bayesian
decision theory [11], and shares similarity to the more sophisticated Bayes estimator described in Section 2. How-
ever it replaces the samples obtained by MCMC sampling by the set of observed words. Similarly to the algorithm
of Section 2, we also restrict the minimization to be taken on x ∈ Σ(O)∗ such that m ≤ |x| ≤ M and to strings
built by at most k contiguous substrings taken from the word forms in O, where m = mini |yi |, M = maxi |yi |
and Σ(O) is the set of characters occurring in O. Again, we found that k = 2 often finds the same solution as
higher values of k. The difference was in all the cases not statistically significant, so we report the approximation
k = 2 in what follows.
We also compared against an oracle, denoted ORACLE, which returns
argmin Lev(y, x∗ ),
y∈O
where x∗ is the target reconstruction. We will denote it by ORACLE. This is superior to picking a single closest
language to be used for all word forms, but it is possible for systems to perform better than the oracle since it has
to return one of the observed word forms. Of course, this scheme is only available to assess system performance
on held-out data: It cannot make new predictions.
We performed the comparison against a previous system proposed in [15] on the same dataset and experimental
conditions as used in [15]. The PMJ dataset was compiled by [16], who also reconstructed the corresponding
protolanguage. Since PRAGUE is not guaranteed to return a reconstruction for each cognate set, only 55 word forms
could be directly compared to our system. We restricted comparison to this subset of the data. This favors PRAGUE
since the system only proposes a reconstruction when it is certain. Still, our system outperformed PRAGUE, with
an average distance of 1.60 compared to 2.02 for PRAGUE. The difference is marginally significant, p = 0.06,
partly due to the small number of word forms involved.
To get a more extensive comparison, we considered the hybrid system that returns PRAGUE’s reconstruction
when possible and otherwise back off to the Sundanese (Snd.) modern form, then Madurese (Mad.), Malay (Mal.)
and finally Javanic (Jv.) (the optimal back-off order). In this case, we obtained an edit distance of 1.86 using our
system against 2.33 for PRAGUE, a statistically significant difference.
We also compared against ORACLE and CENTROID in a large-scale setting. Specifically, we compare to the
experimental setup on 64 modern languages used to reconstruct POc described before. Encouragingly, while
the system’s average distance (1.49) does not attain that of the ORACLE (1.13), we significantly outperform the
CENTROID baseline (1.79).
6
for pairs(G). The metrics are defined as follows:
|pairs(G) ∩ pairs(F )|
precision = ,
|pairs(F )|
|pairs(G) ∩ pairs(F )|
recall = ,
|pairs(G)|
precision · recall
F1 = 2 ,
precision + recall
1 X
purity = max |Gg ∩ Ff |.
N g
f
Using these metrics, we found that our system achieved a precision of 84.4, recall of 62.1, F1 of 71.5, and
cluster purity of 91.8. Thus, over 9 out of 10 words are correctly grouped, and our system errs on the side of under-
grouping words rather than clustering words that are not cognates. Since the null hypothesis in historical linguistics
is to deem words to be unrelated unless proven otherwise, a slight under-grouping is the desired behavior.
Since we are ultimately interested in reconstruction, we then compared our reconstruction system’s ability to
reconstruct words given these automatically determined cognates. Specifically, we took every cognate group found
by our system (run on the Oceanic subclade) with at least two words in it. That is, we excluded words that our
system found to be isolates. Then, we automatically reconstructed the Proto-Oceanic ancestor of those words using
our system (using the auxiliary variables z described in Section 1.2.1).
For evaluation, we then looked at the average edit distance from our reconstructions to the known reconstruc-
tions described in the previous sections. This time, however, we average per modern word rather than per cognate
group, to provide a fairer comparison. (Results were not substantially different averaging per cognate group.)
Using known cognates from the ABVD, there was an average reconstruction error of 2.19, versus 2.47 for the
automatically reconstructed cognates, or an increase in error rate of 12.8%. The fraction of words with each Leven-
shtein distance for these reconstructions is shown in Figure S.1. While the plots are similar, the automatic cognates
exhibit a slightly longer tail. Thus, even with automatic cognates, the reconstruction system can reconstruct words
faithfully in many cases, only failing in a few instances.
where the denominator is simply a normalization that insures FL` (x, y) ≤ 1. Note that if x and y are in comple-
mentary distribution in language `, then the two vectors cx,` and cy,` are orthogonal. The functional load is indeed
zero in this case.
In Figure 3 of the main paper, we show heat maps where the color encodes the log of the number of sound
changes that fall into a given 2-dimensional bin. Each sound change x > y is encoded as pair of numbers in the
unit interval, (ˆl, m̂), where ˆl is an estimate of the functional load of the pair and m̂ is the posterior fraction of the
instances of the phoneme x that undergo a change to y. We now describe how ˆl, m̂ were estimated. The posterior
7
fraction m̂ for the merger x > y between languages pa(`) → ` is easily computed from the same expected
sufficient statistics used for parameter estimation:
P
p∈Σ N (S, x, p, `, y)
m̂` (x > y) = P P 0 0
.
p0 ∈Σ y 0 ∈Σ N (S, x, p , `, y )
The estimate of the functional load requires additional statistics, i.e. the expected context vectors ĉx,` and expected
phoneme token counts N̂ (`), but these can be readily extracted from the output of the MCMC sampler. The
estimate is then:
ˆl` (x, y) = 1
hĉx,` , ĉy,` i.
N̂ (`)2
Finally, the set of points used to construct the heat map is:
n o
ˆlpa(`) (x, y), m̂` (x > y) : ` ∈ L − {root}, x ∈ Σ, y ∈ Σ, x 6= y .
3 Reconstruction lists
In Table S.2–S.4, we show the lists of consensus reconstructions produced by our system (the ‘Automatic’ column).
For comparison, we also include a baseline (randomly picking one modern word), and the edit distances (when it
is greater than zero). In Proto-Oceanic, since two manual reconstructions are available, we include the distances
of the automatic reconstruction to both manual reconstructions (‘P-A’ and ‘B-A’) and the distance between the two
manual reconstructions (‘B-P’).
4 Sound changes
In Figures S.2–S.5, we zoom and rotate each quadrant of the tree shown in Figure 2 of the main paper. For
information on the most frequent change of each branch, refer to the row in Table S.5 corresponding to the code in
parenthesis attached to each branch. By most frequent, we mean the change with the highest expected count in the
last EM iteration, collapsing contexts for simplicity. In cases where no change is observed with an expected count
of more than 1.0, we skip the corresponding entry—this can be caused for example by languages where cognacy
information was too sparse in the dataset. The functional load, normalized to be in the interval [0, 1] is also shown
for reference.
5 Frequent errors
In Figure S.6, we analyze the frequent discrepancy between the PAn reconstruction from our system with those
from [18]. The most frequent problematic substitutions, insertions, and deletions are shown. These frequencies
were obtained by aligning, after running the system, the reconstructions with the references. The aligner is a pair-
HMM trained via EM, using only phoneme identity and gap information [19]. The frequencies were extracted
from the posterior alignments.
8
[3] Liu DC, Nocedal J, Dong C (1989) On the limited memory BFGS method for large scale optimization. Mathematical Programming
45:503–528.
[4] Lunter GA, Miklós I, Song YS, Hein J (2003) An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. J.
Comp. Biol. 10:869–889.
[5] Tierney L (1994) Markov chains for exploring posterior distributions. The Annals of Statistics 22:1701–1728.
[6] Holmes I, Bruno WJ (2001) Evolutionary HMM: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820.
[7] Bouchard-Côté A, Jordan MI, Klein D (2009) Efficient inference in phylogenetic InDel trees. Advances in Neural Information Processing
Systems 21:177–184.
[8] Caffo B, Jank W, Jones G (2005) Ascent-based Monte Carlo EM. Journal of the Royal Statistical Society - Series B 67:235–252.
[9] Judea P (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. (Morgan Kaufmann).
[10] Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368–
376.
[11] Robert CP (2001) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (Springer).
[12] Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10.
[13] Greenhill S, Blust R, Gray R (2008) The Austronesian basic vocabulary database: From bioinformatics to lexomics. Evolutionary
Bioinformatics 4:271–283.
[14] Lewis MP, ed (2009) Ethnologue: Languages of the World, Sixteenth edition. (SIL International).
[15] Oakes M (2000) Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Journal of Quanti-
tative Linguistics 7:233–244.
[16] Nothofer B (1975) The reconstruction of Proto-Malayo-Javanic (M. Nijhoff).
[17] King R (1967) Functional load and sound change. Language 43:831–852.
[18] Blust R (1999) Subgrouping, circularity and extinction: Some issues in Austronesian comparative linguistics. Inst. Linguist. Acad. Sinica
1:31–94.
[19] Berg-Kirkpatrick T, Bouchard-Côté A, DeNero J, Klein D (2010) Painless unsupervised learning with features. Proceedings of the North
American Conference on Computational Linguistics pp 582–590.
Table S.1: Effects of ablation of various aspects of our unsupervised system on mean edit distance to POc. -Sharing
corresponds to the restriction to the subset of the features in OPERATION, FAITHFULNESS and MARKEDNESS that
are branch-specific, -Topology corresponds to using a flat topology where the only edges in the tree connect
modern languages to POc. The semi-supervised system is described in the text. All differences (compared to the
unsupervised full system) are statistically significant.
9
Table S.2. Proto-Austronesian reconstructions
Gloss Reference Baseline Distance Automatic Distance
tohold(1) *gemgem *higem 3 *gemgem
smoke(1) *qebel *q1v1l 3 *qebel
toscratch(1) *ka Raw *kagaw 1 *ka Raw
and(2) *mah *ma 1 *ma 1
leg/foot(1) *qaqay *ai 4 *qaqay
shoulder(1) *qaba Ra *vala 4 *qaba Ra
woman/female(1) *bahi *babinay 4 *vavaian 5
left(1) *kawi Ri *kayli 3 *kawi Ri
day(1) *qalejaw *andew 5 *qalejaw
mother(1) *tina *qina 1 *tina
tosuck(1) *sepsep *sipsip 2 *sepsep
small(2) *kedi *kedhi 1 *kedi
night(1) *be RNi *beyi 2 *be RNi
tosqueeze(1) *pe Req *perah 3 *pereq 1
tohit(1) *palu *mipalo 3 *palu
rat(1) *labaw *lapo 3 *kulavaw 3
you(1) *ikamu *kamo 2 *kamu 1
toplant(1) *mula *h1mula 2 *mula
toswell(1) *ba Req *ba Req *ba Req
tosee(1) *kita *kita *kita
one(1) *isa *sa 1 *isa
tosleep(1) *tudu R *maturug 4 *tudu R
dog(1) *wasu *kahu 2 *vatu 2
topound,beat(20) *tutuh *nutu 2 *tutu 1
stone(1) *batu *batu *batu
green(1) *mataq *mataq *mataq
father(1) *tama *qama 1 *tama
this(1) *ini *eni 1 *ani 1
tooth(1) *nipen *lipon 2 *nipen
tochoose(1) *piliq *piliP 1 *piliq
star(1) *bituqen *bitun 2 *bituqen
tobuy(23) *baliw *taiw 2 *taiw 2
tovomit(1) *utaq *mutjaq 2 *utaq
towork(1) *qumah *quma 1 *quma 1
wide(1) *malawas *Gawig 5 *malabe R 3
tocut,hack(1) *ta Raq *ta Raq *ta Raq
tofear(1) *matakut *takuP 3 *matakut
tolive,bealive(1) *maqudip *maPuNi 3 *maqudip
thunder(3) *de RuN *zuN 3 *deruN 1
tofly(2) *layap *layap *layap
toshoot(1) *panaq *fanak 2 *panaq
name(1) *Najan *ngaza 4 *Najan
tobuy(1) *beli *poli 2 *beli
and(1) *ka *kae 1 *ka
when?(1) *ijan *pirang 3 *pijan 1
todig(1) *kalih *kali 1 *kali 1
ash(1) *qabu *abu 1 *qabu
big(1) *ma Raya *ma Raya *ma Raya
tocut,hack(3) *tektek *tutek 2 *tektek
road/path(1) *zalan *dalan 1 *zalan
tostand(1) *di Ri *diri 1 *di Ri
sand(1) *qenay *one 4 *qenay
continued on next page
10
continued from previous page
Gloss Reference Baseline Distance Automatic Distance
meat/flesh(31) *isi *ici 1 *isi
what?(2) *nanu *anu 1 *anu 1
toswell(26) * Ribawa *malifawa 4 *abeh 5
bird(2) *qayam *ayam 1 *qayam
hand(1) *lima *lime 1 *lima
in,inside(1) *idalem *dalale 4 *idalem
if(2) *nu *no 1 *nu
toknow,beknowledgeable(2) *bajaq *mafanaP 5 *mafanaP 5
at(20) *di *ri 1 *di
wind(2) *bali *feli 2 *beliu 2
new(1) *mabaqe Ru *baPru 5 *vaquan 6
blood(1) *da Raq *da Raq *da Raq
breast(1) *susu *soso 2 *susu
i(1) *iaku *ako 2 *iaku
salt(1) *qasi Ra *sie 4 *qasi Ra
toflow(1) *qalu R *ilir 4 *qalu R
five(1) *lima *lima *lima
at(1) *i *i *i
other(1) *duma *duma *duma
leaf(2) *bi Raq *bela 3 *bela 3
all(1) *amin *kemon 3 *amin
rotten(1) *mabu Raq *mavuk 4 *mabu Ruk 2
tocook(1) *tanek *tan@k 1 *tanek
head(1) *qulu *PuGuh 3 *qulu
mouth(2) *Nusu *Nutu 1 *Nuju 1
house(1) * Rumaq *uma 2 * Rumaq
if(1) *ka *ke 1 *ka
neck(1) *liqe R *liq1g 2 *liqe R
needle(1) *za Rum *dag1m 3 *za Rum
he/she(1) *siia *ia 2 *siia
fruit(1) *buaq *buaq *buaq
back(1) *likud *likudeP 2 *likud
tochew(2) *qelqel *qmelqel 1 *qmelqel 1
salt(2) *timus *timus *timus
we(2) *kami *sikami 2 *kami
long(1) *inaduq *nandu 3 *anaduq 1
we(1) *ikita *itam 3 *kita 1
three(1) *telu *t1lu 1 *telu
lake(1) *danaw *ranu 3 *danaw
toeat(1) *kaen *kman 2 *kman 2
no,not(3) *ini *ini *ini
where?(1) *inu *sadinno 5 *ainu 1
how?(1) *kuja *gagua 4 *kua 1
tothink(34) *nemnem *n1mn1m 2 *kinemnem 2
who?(2) *siima *cima 2 *tima 2
tobite(1) *ka Rat *kagat 1 *ka Rat
tail(1) *iku R *ikog 2 *iku R
11
Table S.3. Proto-Oceanic reconstructions
Reconstructions Pairwise distances
Gloss Blust (B) Pawley (P) Automatic (A) B-P P-A B-A
fish(1) *ikan *ikan *ikan
five(1) *lima *lima *lima
what?(1) *sapa *saa *sava 1 1 1
meat/flesh(1) *pisiko *pisako *kiko 1 3 3
star(1) *pituqun *pituqon *vetuqu 1 4 3
fog(1) *kaput *kaput *kabu 2 2
toscratch(44) *karu *kadru *kadru 1 1
shoulder(1) *pa Ra *qapa Ra *vara 2 4 2
where?(3) *pai *pea *vea 2 1 3
toclimb(2) *sake *sake *cake 1 1
toeat(1) *kani *kani *kani
two(1) *rua *rua *rua
dry(11) *maca *masa *mamasa 1 2 3
narrow(1) *kopit *kopit *kapi 2 2
todig(1) *keli *keli *keli
bone(2) *su Ri *su Ri *sui 1 1
stone(1) *patu *patu *patu
left(1) *mawi Ri *mawi Ri *mawii 1 1
they(1) *ira *ira *sira 1 1
toliedown(1) *qinop *qeno *eno 2 1 3
tohide(1) *puni *puni *vuni 1 1
rope(1) *tali *tali *tali
smoke(2) *qasu *qasu *qasu
when?(1) *Naican *Naijan *Nisa 1 3 3
we(2) *kamami *kami *kami 2 2
this(1) *ne *ani *eni 2 1 2
egg(1) *qatolu R *katolu R *tolu 1 3 3
stick/wood(1) *kayu *kayu *kai 2 2
tosit(16) *nopo *nopo *nofo 1 1
toshoot(1) *panaq *pana *pana 1 1
liver(1) *qate *qate *qate
needle(1) *sa Rum *sa Rum *sau 2 2
feather(1) *pulu *pulu *vulu 1 1
topound,beat(2) *tutuk *tuki *tutuk 3 3
near(9) *tata *tata *tata
heavy(1) *mamat *mapat *mamava 1 3 2
year(1) *taqun *taqun *taqu 1 1
old(1) *matuqa *matuqa *matuqa
fire(1) *api *api *avi 1 1
tochoose(1) *piliq *piliq *vili 2 2
rain(1) *qusan *qusan *usa 2 2
togrow(1) *tubuq *tubuq *tubu 1 1
tosee(1) *kita *kita *kita
tohear(1) *roNo R *roNo R *roNo 1 1
tochew(1) *mamaq *mamaq *mama 1 1
louse(1) *kutu *kutu *kutu
wind(1) *aNin *mataNi *mataNi 4 4
bad,evil(1) *saqat *saqat *saqa 1 1
hand(1) *lima *lima *lima
toflow(1) *tape *tape *tave 1 1
night(1) *boNi *boNi *boNi
continued on next page
12
continued from previous page
Reconstructions Pairwise distances
Gloss Blust (B) Pawley (P) Automatic (A) B-P P-A B-A
day(5) *qaco *qaco *qaso 1 1
tospit(14) *qanusi *qanusi *aNusu 3 3
person/humanbeing(1) *taumataq *tamwata *tamata 3 1 2
tovomit(1) *mumutaq *mumuta *muta 1 2 3
name(1) *Najan *qajan *qasa 1 2 3
snake(12) *mwata *mwata *mwata
man/male(1) *mwa Ruqane *taumwaqane *mwane 5 5 4
tobreathe(1) *manawa *manawa *manawa
far(1) *sauq *sauq *sau 1 1
tobuy(1) *poli *poli *voli 1 1
tovomit(8) *luaq *luaq *lua 1 1
tocook(9) *tunu *tunu *tunu
thick(3) *matolu *matolu *matolu
leg/foot(1) *waqe *waqe *waqe
tobite(1) *ka Rat *ka Rati *karat 1 2 1
leaf(1) *raun *rau *dau 1 1 2
sky(1) *laNit *laNit *laNi 1 1
todrink(1) *inum *inum *inum
tostand(2) *tuqur *taqur *tuqu 1 2 1
i(1) *au *au *yau 1 1
warm(1) *mapanas *mapanas *mavana 2 2
moon(1) *pulan *pulan *vula 2 2
how?(1) *kua *kuya *kua 1 1
three(1) *tolu *tolu *tolu
toplant(2) *tanum *tanom *tan@m 1 1 1
mosquito(1) *namuk *namuk *namu 1 1
bird(1) *manuk *manuk *manu 1 1
four(1) *pani *pat *vati 2 2 2
water(2) *wai R *wai R *wai 1 1
one(1) *sakai *tasa *sa 3 2 3
skin(1) *kulit *kulit *kulit
toyawn(1) *mawap *mawap *mawa 1 1
he/she(1) *ia *ia *ia
nose(1) *isuN *ijuN *isu 1 2 1
thatch/roof(1) *qatop *qatop *qato 1 1
towalk(2) *pano *pano *vano 1 1
flower(1) *puNa *puNa *vuNa 1 1
dust(1) *qapuk *qapuk *avu 3 3
neck(18) * Ruqa * Ruqa *ua 2 2
eye(1) *mata *mata *mata
father(1) *tama *tamana *tama 2 2
tofear(1) *matakut *matakut *matakut
root(2) *waka Ra *waka R *waka 1 1 2
tostab,pierce(8) *soka *soka *soka
breast(1) *susu *susu *susu
tolive,bealive(1) *maqurip *maqurip *maquri 1 1
head(1) *qulu *qulu *qulu
thou(1) *ko *iko *kou 1 2 1
fruit(1) *puaq *puaq *vua 2 2
13
Table S.4. Proto-Polynesian reconstructions
Gloss Reference Baseline Distance Automatic Distance
rope(9) *taula *taura 1 *taula
short(11) *pukupuku *puPupuPu 2 *pukupuku
strikewithfist *moto *moko 1 *moto
coral *puNa *puNa *puNa
cook *tafu *tahu 1 *tafu
painful,sick(1) *masaki *mahaki 1 *masaki
downwards *hifo *ifo 1 *hifo
dive *ruku *uku 1 *ruku
toface *haNa *haNa *haNa
grasp *kapo *Papo 1 *kapo
bay *faNa *hana 2 *faNa
tail *siku *siPu 1 *siku
toblow(6) *pupusi *pupuhi 1 *pupusi
channel *awa *ava 1 *awa
pandanus *fara *fala 1 *fara
urinate *mimi *mimi *mimi
navel *pito *pito *pito
gall *Pahu *au 2 *Pahu
wave *Nalu *Nalu *Nalu
sleep *mohe *moe 1 *mohe
warm(1) *mafanafana *mafanafana *mafanafana
beak *Nutu *Nutu *Nutu
tooth *nifo *niho 1 *nifo
dance *saka *haka 1 *saka
leg *waPe *wae 1 *waPe
drown *lemo *lemo *lemo
water *wai *vai 1 *wai
nose *isu *isu *isu
taro *talo *kalo 1 *talo
tohide(1) *funi *huna 2 *suna 2
upwards *hake *aPe 2 *hake
overripe *pePe *pee 1 *pePe
tosit(16) *nofo *noho 1 *nofo
red *kula *Pula 1 *kula
tochew(1) *mama *mama *mama
three(1) *tolu *tolu *tolu
tosleep(10) *mohe *moe 1 *mohe
nine *hiwa *iwa 1 *hiwa
dawn *ata *ata *ata
feather(1) *fulu *fulu *fulu
canoe *waka *vaPa 2 *waka
left(11) *sema *hema 1 *sema
ashes *refu *lehu 2 *refu
dew *sau *sau *sau
small *riki *riki *riki
house *fale *hale 1 *fale
voice *lePo *leo 1 *lePo
octopus *feke *hePe 2 *feke
toclimb(2) *kake *aPe 2 *ake 1
sea *tahi *kai 2 *tahi
day *Paho *ao 2 *Paho
branch *maNa *maNa *maNa
continued on next page
14
continued from previous page
Gloss Reference Baseline Distance Automatic Distance
lovepity *PaloPofa *aroha 5 *PaloPofa
flyrun *lele *rere 2 *lele
thick(3) *matolutolu *matolutolu *matolutoru 1
tostrippeel *hisi *hihi 1 *hisi
sitdwell *nofo *noho 1 *nofo
black *kele *Pele 1 *kele
torch *rama *ama 1 *rama
15
continued from previous page
Code Parent Child Functional load Number of occurrences
16
continued from previous page
Code Parent Child Functional load Number of occurrences
17
continued from previous page
Code Parent Child Functional load Number of occurrences
18
continued from previous page
Code Parent Child Functional load Number of occurrences
19
continued from previous page
Code Parent Child Functional load Number of occurrences
(393) Not enough data available for reliable sound change estimates
(394) e (Northern NuclearBel) > i (Megiar) 0.04103 3.96950
(395) b (Nuclear WestCentralPapuan) > p (Mekeo) 0.01057 10.53926
(396) e (Northwest) > a (MelanauKajang) 0.05269 4.10012
(397) a (MelanauKajang) > e (MelanauMuk) 0.05542 8.30760
(398) N (LocalMalay) > n (Melayu) 0.17169 15.57376
(399) e (LocalMalay) > a (MelayuBrun) 0.10667 57.81798
(400) Not enough data available for reliable sound change estimates
(401) r (Vitiaz) > l (Mengen) 0.09236 7.76084
(402) a (WestSanto) > e (Merei) 0.12570 4.40581
(403) u (East Barito) > o (MerinaMala) 0.00538 45.42570
(404) p (WesternOceanic) > b (MesoMelanesian) 0.06671 3.25436
(405) e (ProtoMalay) > 1 (MesoPhilippine) 0.00192 27.69856
(406) s (ProtoMicro) > t (MicronesianProper) 0.14436 10.32542
(407) e (Malayan) > a (Minangkaba) 0.09530 36.65853
(408) b (RajaAmpat) > p (Minyaifuin) 0.08190 5.44290
(409) r (KilivilaLouisiades) > l (Misima) 0.09569 2.85853
(410) u (Malayic) > o (Moken) 0.00621 18.67524
(411) a (Ponapeic) > O (Mokilese) 0.00863 20.21510
(412) k (Bwaidoga) > P (Molima) 0.02226 6.39306
(413) r (MonoUruava) > l (Mono) 0.18018 7.90990
(414) Not enough data available for reliable sound change estimates
(415) Not enough data available for reliable sound change estimates
(416) N (SouthNewIrelandNorthwestSolomonic) > n (MonoUruava) 0.07198 4.50934
(417) t (CenderawasihBay) > P (Mor) 0.00191 7.90698
(418) a (Sulawesi) > o (Mori) 0.06991 15.10951
(419) d (ProtoChuuk) > t (Mortlockes) 0.10821 17.95820
(420) v (EastVanuatu) > w (Mota) 0.04447 4.78867
(421) v (SinagoroKeapara) > h (Motu) 0.01870 13.39891
(422) r (Bibling) > x (Mouk) 3.176e-05 16.89672
(423) u (Sulawesi) > o (MunaButon) 0.04575 3.73293
(424) N (Western Munic) > n (MunaKatobu) 2.731e-04 6.02727
(425) d (UlatInai) > r (MurnatenAl) 0.00181 5.95393
(426) r (ProtoOcean) > l (Mussau) 0.16171 3.95147
(427) a (EastVanuatu) > E (Mwotlap) 0.01111 27.83527
(428) r (EndeLio) > z (Nage) 0.02129 1.96583
(429) i (MalekulaCoastal) > e (Nahavaq) 0.05885 9.63004
(430) n (Willaumez) > l (NakanaiBil) 0.00412 1.02690
(431) a (LavongaiNalik) > @ (Nalik) 0.00211 12.96772
(432) u (CentralVanuatu) > i (Namakir) 0.18127 21.52549
(433) a (MalekulaCentral) > e (Naman) 0.13582 33.25250
(434) b (MalekulaCoastal) > p (Nati) 0.01049 19.54611
(435) r (SoutheastIslands) > l (Nauna) 0.10938 5.60587
(436) a (ProtoMicro) > e (Nauru) 0.13661 5.67814
(437) Not enough data available for reliable sound change estimates
(438) s (NehanNorthBougainville) > h (Nehan) 0.05608 13.16549
(439) N (Nehan) > n (NehanHape) 0.11803 18.93949
(440) a (SouthNewIrelandNorthwestSolomonic) > o (NehanNorthBougainville) 0.16106 4.34745
(441) e (Northern NewCaledonian) > a (Nelemwa) 0.13215 10.43158
(442) o (Utupua) > œ (Nembao) 0.01621 4.98208
(443) a (LoyaltyIslands) > e (Nengone) 0.20861 5.30862
(444) m (Vanuatu) > n (Nese) 0.35703 18.91500
(445) a (MalekulaCentral) > e (Neveei) 0.13582 34.11813
(446) e (LoyaltyIslands) > a (NewCaledonian) 0.20861 1.71479
(447) k (SouthNewIrelandNorthwestSolomonic) > g (NewGeorgia) 0.08381 5.00348
(448) e (MesoMelanesian) > i (NewIreland) 0.06281 3.31247
(449) w (BimaSumba) > v (Ngadha) 4.053e-05 14.09575
(450) i (Aru) > e (NgaiborSAr) 0.02299 6.77784
(451) f (NorthNewGuinea) > w (NgeroVitiaz) 0.01667 3.53470
(452) Not enough data available for reliable sound change estimates
(453) s (Gela) > h (Nggela) 0.05008 17.07009
(454) b (CentralVanuatu) > p (Nguna) 0.06113 6.26520
(455) p (Sumatra) > f (Nias) 0.00205 8.90278
(456) f (NilaSerua) > h (Nila) 0.01125 4.39065
(457) t (TeunNilaSerua) > l (NilaSerua) 0.13648 1.88492
(458) u (Nehan) > w (Nissan) 0.02937 8.95963
(459) a (Tongic) > e (Niue) 0.18204 4.31952
(460) l (North Babar) > n (NorthBabar) 0.16784 6.84401
(461) R (ProtoCentr) > r (NorthBomberai) 0.02692 10.12083
(462) v (WesternOceanic) > p (NorthNewGuinea) 0.12924 8.71798
(463) a (Nuclear PapuanTip) > e (NorthPapuanMainlandDEntrecasteaux) 0.27192 3.13375
(464) a (Northwest) > e (NorthSarawakan) 0.05269 5.44048
(465) e (Babar) > E (North Babar) 0.03978 1.27682
(466) b (Sulawesi) > w (North Minahasan) 0.05030 3.03447
(467) u (Vanuatu) > o (NortheastVanuatuBanksIslands) 0.06513 4.83269
(468) r (NorthernLuzon) > g (NorthernCordilleran) 0.01688 4.17784
(469) m (NorthernPhilippine) > n (NorthernLuzon) 0.15955 5.87812
(470) N (ProtoMalay) > n (NorthernPhilippine) 0.10751 18.96525
(471) o (NorthernCordilleran) > u (Northern Dumagat) 0.01648 1.25949
(472) l (EastFormosan) > n (Northern EastFormosan) 0.14981 5.83985
(473) p (Malaita) > b (Northern Malaita) 0.00727 4.99257
(474) a (NewCaledonian) > o (Northern NewCaledonian) 0.17212 3.00312
20
continued from previous page
Code Parent Child Functional load Number of occurrences
21
continued from previous page
Code Parent Child Functional load Number of occurrences
22
continued from previous page
Code Parent Child Functional load Number of occurrences
23
continued from previous page
Code Parent Child Functional load Number of occurrences
24
Figure S.1: Percentage of words with varying levels of Levenshtein distance. Known Cognates (gold) were hand-
annotated by linguists, while Automatic Cognates were found by our system.
25
m ( n - +
S ik a ia a m a r
F u tu k a u T a u
Uv e an a E a s t
P uka puk a
Uv e a E a s t
F ijia nBa u
Ta nim bili
As um boa
R e n nn a An iw
L ua ng iua
Ifira W e s t
e
F u tuM e le M
L ik u ma T ita
Tok e la u
Ta ne ma
Tu v a luna
Nu ku oro
Ne mb a o
Tonga n
pb t d kg q! 4
e ll e s
W uv ul u
Te a nu
l
B ig ay a ifu in
S iv is n
Bum a
Ka s s a u
As M is o o
Ta k u u
Va no
Na un a
No irn d e s a p e n
Niue
e ip o
a th B a bW a n
Va e a
L on i
M u siu
L ou
Lev e
B u li a n
u o pe n
An u ta
we
ar
bua
M a m fo r
L
m
M inr
i
W ba iY
a
,' f v s z x" 0 h#
Am ra u
Mo
Da i we ra D
la
War
E
s
S e s tM wa s
Ce ri l i a s e
333)
Im la M a
( 483)
T a e s tD E a s a s
( 264)
( 592)
E a pla g
( 694)
7)
B
( 704)
( 181)
N ji rN n g a m a r
N
( 703)
( 64
E m in
t
( 218)
G g a i Aru n B a
( 180)
W o u th a l M
( 537)
(
( 163)
( 13)
o
D
Ar
( 317)
e
( 598)
( 325)
a
( 323)
S n tr
T
( 459)
( 682)
W e s e b o rS
( 666)
( 708)
( 654)
M lu lo i a Te
( 702)
( 680)
( 524)
( 656)
( 442)
Am u A m b e s
S oi tu AKe i B l a
( 162)
( 98)
i y % / &u
( 26)
A a u ha a n n
a
H la t ube
P a m o
r
( 182)
E t r
( 330)
( 435)
( 408)
Al
( 68)
( 25)
B o u rnn e h i
( 8) )
( 378
B e fi a te n
)
eø 5* $o
( 742
( 581)
( 748)
l
( 153)
M a o m a
( 563)
( 22 )
( 729)
Ng a n m b y o ro
7
( 329)
M o b u Na a m
( 16 )
W n a
9
( 66
4 )
( 266)
( 58 8)
( 200)
( 11 )
( 14
8
( 97)
( 60 5)
S u r ri n
K o a d g ru
B e di o a a i
6 )
( 683)
( 6 1)
W o nd h a r
g o
1
)
04
( 13 )
(2
0
6 œ 7 8 .3
5)
( 707)
( 701)
( 46
4)
( 7 1)
na
( 481)
( 13
)
( 19
19
S o i m je w k
Ta
( 177)
( 417)
( 616)
a a
)
( 532
( 373)
( 485 )
( 718
)
)
(4 )
25
( 513)
5
a) 12
)
gg a
( 120
P
n
(9
au
(
)
k
ba
03
(5 )
2)
(6
a N ka
( 62
W a a
Ga a v u
u mo a
6)
ur nu
( 38
( 146)
S
( 6960)
d T
s t ile oy e s
S
7 )
)
92
)
(6
)
450
87
E a a l m b lor
)
61
1
(5
( 159)
( 738)
(
(1
(
)
( 160)
B a F
98
(6
L i d o e
( 116)
a n
L n g e be r i tu g
( 608)
E a m N an
)
5)
( 108
N a lue a l n
( 46
)
86
K a k r rg u n
(4
2)
P na a A a
( 61
( 7 6)
A e k a u e rm
)
22
(8
S ul iT
5)
P o t n i a i ri k
(4
( 426)
( 1)
) R to m b Te
)
86
26
( 3 65) ) A a un k a le
(5
( 1 28 M e t a e r o tI
( 360 )
( 357
( 44 )
( 4
T e m a l ol
( 2 517 9)
(
K am ah
( 8 )
( 72 7)
)
( 609
( 76 1)
L am
52
2)
)
L ik a a ng
4
( 33
(1
( 59 )
( 5 9)
( 57 9)
( 24 )
( 2 56)
( 1 716 8)
4
S ed
( 72
86 )
01 )
( 3 69)
( 6101
)
)
K ra i r
)
)
4 9 6
( 77)
( 1 44) 8) E a lu a i
(
) ( 2
( 30 8 2
) T put i
)
( 5
14
6) ( 25
6 5 A e ra n
(1
(1 (
9
5 )
( 2 496
)
7 31
)
0 7) P ugu
( 11) ( ( 3 06) ) T iun a
(
(1
55
) ( 3 91)
5 )
70
( 1 53) I l e ru
( 73
(2 (6 ) S ila
( 14 6) N un
Te s a r
0
( 5 0) )
9 8 9 r
Ki m a a m a
7)
( 6 3) ( 5 6)
(7
(2 2 5
)
(4
R o s tD s e
( 479
E a ti n e e n a a
7 ) e
L m d im b
( 45
)
0) 5)
Ya iTa n
19
) ) ( 74 ( 28 0)
(5
61 78 0)
(4 (1 ( 67 ( 54 Ke l a ru I ri a
S e i wa i rro
( 67
1)
1) Ko a m o n g
( 75 )
74 Ch m p u g
L a m e ri n
8 6 ) ( 2
) (2
6) (62 3
( 14 2)
5 )
Ko n d a e j a
(6 7 ( 32 S u ja n g R a b
5)
Re o liTa g
( 27 T b g a b il i lB
( 5 8 3)
T a n a d a iB
( 3100)
) Ko ro n g a n
( 29 ( 665
) S a raa n Ko ro
B il a a n S a ra
( 61
( 28
7 )
8)
i y ( 6 3
( 291)
8)
( 573)
% / B il a Ata y a&u
Ciu lili q Ata y
( 70)
) S q u iq
Sed n
4 ) ( 71
( 66
( 133)
B unu
( 1279)
( 50 4)
)
( 65335)
( 81) eø ( 625)
( 580) 5* $o
T h a o rl a n g
Favo
(
Ru k a i n
( 67 2)
( 28) 6)
P a iwa na bu
( 17
( 74) 0)
( 10
Ka na ka
( 737)
( 545)
( 492)
6 œ ( 260)
( 553 )
7 8 Ts ou
.3
S a a ro a
( 688)
( 687 )
P uy um a
( 269) Ka v a la n
( 530)
Ba s a i
( 147)
( 472)
( 109)
( 596)
a) ( 54)
12
Ce ntra lAmi
S ira y a
( 504)
( 556)
P a ze h
( 179) S a is ia t
( 521) ( 750)
( 559) ( 632) ( 91) ( 40) Ya ka n
( 560)
( 375) Ba jo
( 230)
( 629) M a pun
( 630) S a ma lS ia s i
( 628) Ina ba kno n
( 620)
S uba nun S in
( 728) ( 733)
Figure S.2: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross- ( 362) ( 123) S ub a no nS
( 365)
( 370)
W e s te rn Buio
M a no bo k
( 366)
referenced with the code in parenthesis attached to each branch. M a no bo W e s t
( 43) ( 27)
M a no bo Ili a
( 377) ( 614) ( 363)
( 79) ( 364)
( 369) M a no Di ba
M a n o bb o Ata d
( 367)
( 368)
( 355) ( 376) M a o Ata u
n o
M a n b o T ig w
( 40 ( 576) ( 233)
5) ( 42)
M a n o b o Ka la
( 118
)
( 80)
( 121 B in u ko b o S a ra
) M a ra id
(4 ( 507 Ira n u n a o
as n
( 25 ( 613 )
( 3193) 3) ( 37 ) ( S
B a li a k
1) 2 7 17) ( 225
( 49 (6 ) ( 3) )
(5 5) ( 6939) ( 102 ( 210
2) ( 20 ) ( 10
7) ) ) Mam
( 6 Il o an
g g wa
8) 35)
26 Hi l in
( 103
ga y o
)
( 662
) W a n
B u ra y W o n
( 25
(3 ( 37 2)
T a utu a n o na ra y
50 1)
) (7 ( 64
27 ( 64 1) s
)
( 57 0) Su ug
Ak ri g a o Jo l o
(1
(1 ( 6 51) (4 )
87
)
93
) ( 4 94)
( 547)
Ce bl a n o n n o n
( 59 7 K a ua n B is
M la ga o
( 13 5) )
(4
Ta a n s a n
(3 6)
77
49 (6
)
) 1
( 2 5) B ik ga lo k a
( 6 1)
45
)
T a o l N g An
33
(3
Ta gba a ga Ct
)
(1
4
26
( )
m ( n - +
M a no bo W e s t
P a iwa na bu
S uba nun S in
S ub a no nS io
W e s te rnB uk
Di ba
S a ma lS ia s i
M a n o b Ata d
ta u
M a n o b o T ig w
Ce ntra lAmi
Ina ba kno n
M a no bo Ili a
M a n oo b o Ka la
o S a ra
pb t d kg q! 4
T h a o rl a n g
Ka v a la n
Ciu lili q Ata y
A
P uy um a
a no bo
B il a n S a ra
S a is ia t
B il a Ata y a
a pun
M a no bo
Ka na ka
o
P a ze h
Ya ka n
S a a ro a
S ira y a
Ru ka i n
a a n o ra y
Ba s a i
Ira n u n a o
Il o nm a n wa
B u tura y W ao n
M a rak id
Ts ou
Sed n
Ba jo
Ak lri g a o nJo l o
Favo
S q u iq
Ce a n o o n
B unu
S u us ug n
an
Ka l b u a n n B i s
W aiga y n
B a li a k
Hi l g g o
M
B in u
,' f v s z x" 0 h#
M
Ta gba a ga t
M
k w Ka
as
a ga o
B a g b a n wa C
N g An
B a l a w P a l a Ab
B ik ga lo k a
Ta a n s a n
Ma
S Wa n g g a n B a w
S
at
S a la no a no
n
T
o l
P a nu a la w
M a d oa k N n a t
Dai n g hu a n o
M
H P i
a
a
K a y nga a k
ju
i y % / &u
t
T
( 733)
( 370)
( 366)
( 363)
M e r ri g a
( 364)
r u a n la
( 369)
D a ti k B
269)
K ya i
( 225)
( 54)
P
)
( 210
Ke u n j n y M a
)
( 103
( 375)
)
( 662
T a a ina h
(
( 40)
I o w n u a ra
M e l ai n c in g
eø 5* $o
L ga la y uS
( 367)
( 368)
( 687)
P i la y B s n
( 553)
O e y
M a ja e a y
( 260)
( 176)
Ch h a n a n y u a h e M
( 672)
M e l a re s i a
M a ru Ra g k a ru n s
( 580)
( 625)
M n n l
( 133)
B a
B an d o M a
( 71)
( 573)
( 70)
( 291)
a
M
( 638)
( 507)
6 œ 7 8 .3
ng b
)
h
( 630)
( 717
( 628)
( 728)
)
( 365)
( 102
C
)
( 635
( 27)
( 376)
am
( 233)
( 37 2)
( 596)
( 109)
( 472)
1)
( 25
I d y n n Ch
( 1 67)
n
)
a) 12
Gaba k e a n
37
(2
( 560)
n
( 91)
I o in
s u ur
( 576)
( 64 1)
Du n M r
( 4 79)
H
( 57 0)
( 42)
( 64
)
( 3 00)
( 121
aa o
(2
(4 )
( 49 )
u
( 4 98)
4
)
n
o a
( 613
( 3 87)
n d u g i tB
( 547)
7)
on
7)
( 2 31)
( 3)
( 10
B u im la b lu nL
( 4 31)
( 123)
ng k
( 3 03)
( 614)
( 3 8)
)
( 3 48)
37
T e t wa u
( 79)
(4
K i n ra i t h L o M u
)
99
( 556)
( 504)
( 147)
B e la a u
( 530)
l
( 1 1)
( 688)
( 492)
( 545)
1
( 737)
B e ny na n M a n
( 1395)
)
( 100)
(5
30
6)
( 28)
B e la na o a
(5
( 2 5)
( 355)
)
K e a a n oB
( 664)
( 81
)
45
(6
M ah g an
( 80)
L ng g k
( 750)
( 632)
( 230)
(6 )
2
E n g s a tas e c
( 6939)
( 37
)
77 )
)
( 6 276 E i a a B re s
)
27
N o b du B a
( 629)
(
(3
( 362)
)
5
T a ta n t
( 377)
( 6 2)
)
M a ya an
17
9) (6
( 1 07)
(2
(9 Iv ba uy y
(4
( 2 25)
I t a b l a te n g
8)
)
06
7)
( 28
B a ra y a ro n
( 118
( 61
Ir ba o
)
27
3)
) It a m a y
(7
( 6 51)
( 25
97 ) 169 ) )
5)
8)
I s a s ro d
)
3 )
93
( 49
3 2
(1
3 ( 0 ( 68
8)
4
o to a
( 1 8) (3 ( 2 37)
Iv o i
( 20
( 1
( 7 6) ( 2 9)
I ma m b a l B a n g
)
49
( 6 78) ( 3 34)
( 559)
Y a m mp a y n
( 620)
(3
(2 ( 2 38)
2 S a p a o B Ka
( 43)
( 35)
( 74)
( 65335)
( 50 4)
K u g a h a n Ke
( 1279)
( 2 40)
)
( 2 410 )
26
5)
( 2 8) If a lla ha n
(1
(
15 )
( 40
2
)
(2
) K a lla loi a n
(
54 2 )
( 22 7) K iba a s in nI
( 3193)
(5
1)
) ( 25 8) In ng uge k
(4
) 2 41
64
P a k i d tKa g a
( ( 25
2)
(4 7) )
(5
6 1
96
) ( 1 55)
( 4679 ) ( 5663)
( 25
6 ) Ka n g o Am
(3 (
( 75
2 ) (2
I l o g a o B a ta n
)
I fu g a o Gu i
)
50
3 2
(2
I fu n to k Gu i n o
(3
64) )
(5
5 ) ( 5
( 2 7)
1
2 20) B o n to c n a y N
B o nk a a w
)
2 (
87
7) (2 21) 89)
(1
9 2 (
Ka l a n g Gu i
( 4 ( )
77
) 8) ( 87
( 49
(4 13) B a linga on
(1
(2 1 9 ) ( 88) Ka e g B i ng
33
)
9) ( 2 2 6)
( 2 6 2) I tn d d a n p l o
( 6 1)
(5
6) ( 61
( 90
) Ga a P a m a g
Att e g Di b
4 4 )
(3 ( 2 5
9 )
I s nta
8 ) 1 ) 2 3
( 47 (4 (
8 4 ) as
( 1
0 ) Ag a g a tC
Du ma n o
( 3
)
( 60
5) 0) ( 236 k ia n
( 11 )
( 255 Il o
y a ne s ly n e s
( 2)
( 144) Y o gJa va y oP o
Old te rn M a la
)
( 216
0) s
( 47
( 46
9)
i y
( 471) % / W e in e s e S o
B ug h
M a loa s s a r
&u
( 468
) ( 754) ( 93)
( 48 8)
( 354) M a k T o ra ja
( 224) T a e S irT a b u
S a ng
eø 5* $o
( 94 )
( 344) ( 567) S a ng ir a ra
( 637) ( 566) S a ng ilS
( 243)
( 735)
( 611)
( 476)
( 565) Ba ntik ng
248) Ka id ip a
Go ron ta loHon
( 50) (
6 œ 7 8 .3
( 568) ( 202)
( 201) Bo la a ng M
( 203) ( 84) ( 518) P opa lia
( 691) ( 85) Bon e ra te
( 631)
( 423)
( 747) W una
( 739) ( 424) M una Ka tobu
( 179)
( 466)
( 46)
( 685)
( 684) a) Tonte mboa n
Tons e a 12
( 51)
( 746) Ba ngga iW di
Ba re e
( 521)
( 418)
( 270)
( 271) W olio
( 96) M ori
( 529) Ka y a nUm a Ju
( 213) ( 715) Buk at
P un a nKe la
W a mp a r i
( 749)
( 484)
Figure S.3: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross- Ya be m
( 67) ( 422) Nu m ba
referenced with the code in parenthesis attached to each branch. ( 624) ( 20) M ou k m iS ib
( 451) ( 7) ( 309) Aria
( 462) ( 713) L a mo
Am a g a iM u l
( 500)
( 585)
( 61)
( 268)
S e n gra s
a u ne n g
( 73)
( 475) K
( 394)
( 188)
B il iblo
il
g Au V
( 575 ( 292 ( 401) ( 72) ( 387) M e g ia
) ) ( 353)
Ge d a r
( 539
) M a ge d
B il iatu k a r
( 574
)
( 57
Men u
9) ( 272
) ( 661
)
M ge n
( 25 ( a
Ko le u
0) 600)
T a rvpe
( 4) ( 249
)
( 62 ( 35 S o ia
27 ( 48
0) ( 34
7) 8)
Ka b e i
R i wy u p u l
( 49 3) ( 28
4
9) ( 74 ) o a uK
( 31 3) K a
Ki i ri ru
)
( 55
( 20
( 62 8)
W os
( 46
3) 5)
( 10 6)
4) Al g eo
Au i
S a he l
( 28 ( 17
(5 )
S u l i b a a wa
08 1) ( 14
) ( 14 ( 41 0)
M au
1) ( 16
(7 ) 2)
36
)
( 72
0) ( 69 Gu a i s i n
(1
17
) ( 40 ( 18 5) Di o m a w
( 2 9) 5) M di a n
U oli o
(7
23
) (
80
) a
e s ia n
Y o gJa v a n e s la y o P o ly n
m ( n - +
Tonte mboa n
M una Ka tobu
Ka y a nUm a Ju
Go ro nta loHon
Ba ngga iW di
M ou k m iS ib
W a mp a r i
Am a g a iM u l
P un a nKe la
T a e S irT a b u
S a n g ir a ra
Bo la a ng M
M a k T o ra ja
Ba ntik ng
W e sin e s e S o
n g Au V
Bon e ra te
M a loa s s a r
Old te rn M a
Ka u s e n g
pb t d kg q! 4
Tons e a
P opa lia
Ka id ip a
S a ng ilS
Ya be m
Nu m ba
Ag t a g a tC
W olio
Ba re e
W una
at
M a a ge d
a uK
B il iatu k a r
M ori
S e n gra
B ug h
L a mo
e g ia r
S a ng
Du ma n o
Buk
M a ge n
B il iblo
il
Il o k y a
Ari a
Men u
Ko le u
R i wy u p u l
d
o ia
,' f v s z x" 0 h#
T a rvpe
Ka b e i
e
M
S u l i b a wa
G
Ki i ri ru
Al g e o
Ka o
a
S a he la
M i o d i wa n
( 422)
( 309)
S
( 20)
( 585)
W os
( 268)
Gu a i s i n
( 394)
M o b u a u i wa
( 73)
( 188)
M au
7)
Au i
a
D ma
38
U o l i mo
D d a
(
W e pa p
K is i a n
i y % / &u
Gab i r
Gai l i v i m a
D b la
( 67)
4)
( 500)
( 424)
( 747)
R u n ra i
( 518)
3)
( 202)
( 28
)
( 7)
( 475)
K ou a d
( 661
( 85)
( 248
)
( 565)
4
( 566)
( 600
( 567)
7
( 354)
( 72)
( 715)
( 749)
( 484)
(
( 93)
M ili a o
( 144)
M o ro i
eø 5* $o
( 18 5)
t
)
Ku i a r g a a s o u
5)
( 69
( 236
V al e
M o turu p u
( 2)
( 30)
L ek
( 41 0)
S a n d ri S
2)
( 14
( 624)
( 62 8)
)
( 31
(
6)
( 55
)
T a n go
( 249
( 684)
( 685)
( 61)
( 2 3)
( 401)
8
( 353)
( 35
4
( 5 5)
(1
K a
9
6 œ 7 8 .3
)
41
S o tp u a
Ku du boa n aa r
( 4 12) ( 30 5)
)
( 16
0)
39
5)
( 72
)
N im v i a t
R a an
( 574
( 2 9)
(
( 272
( 1 0)
( 40
8
)
83
( 713)
qa gg a e
5)
L o sa e
L u un a v gh
( 20
4)
P
)
k
)
)
82
( 739)
( 292
21
( 10
a) 12
( 271)
( 691)
(7
( 529)
0)
( 201)
( 96)
( 84)
( 476)
(4
( 25
( 50)
( 637)
a
ta
( 344)
( 17
( 6 61)
1)
( 4)
( 94)
( 488)
( 5 55)
a
( 754)
k oe g g
(2
( 14
( 5 44)
( 471)
( 2 90)
( 213)
7 )
)
( 4 93)
( 5 93)
bo e l o n u
(5
H
6
( 62
3)
1
)
( 2 37)
( 255
( 451)
01
2
( 34
)
23
u
K gh a n un o
( 3 12 )
( 2 96
(
3)
(7
( 3 35) )
( 46
)
U h ng v k e
( 575
)
( 539
)
94
G a ro re
)
9)
36
)
1)
)
(5
94 )
42
V a a s
( 57
( 28
( 2 96 )
(3
( 6 93 ) M b lo f e
ap
)
02
M o io
0)
( 1 06 )
)
)
( 462)
17
(5
30
( 48
7
( 82 S a op nH
0)
(7
(1
3 ) T e a n
( 11
( 9 1
(3 T h a u
Ne i s s u g g a a T
( 466)
( 746)
)
( 418)
( 423)
( 203)
46
( 46)
( 51)
( 568)
( 6 68) N a k in a n on
( 611)
)
08
54 ( 6 39) H i s a t Gh
( 224)
(5
(1
)
47
)
( 468
) ( 4 58) S b i
(4
B a a ri si s i
2
0
( 6 2) (4
9)
( 49
7
V a r o a Av
5)
(5
( 60
38
)
7) V i ri g g n a
( 4 7) ( 59 ) R e n a ta a
( 20 8
( 3 11) S a b hu na L o
(4
40) ( 7
10
) B a g a ta n a Ka
( 7 38) V a b a ta n a g a
9)
( 5 84) B a b a ta u n
( 46
( 5 5) B ab aL
( 3 05)
B ung o u
( 270)
( 631)
( 7 4)
L o n o Al
( 735)
(3 )
( 243)
9 )
7
(3 ) M on u
) (1
2 6
( 3 4) M o ra v a u ro
(6
10 ( 33 3) T ru a F a
0)
( 41 4) U ono
)
36
( 47
o
M l u r e Ho l m a
(7
( 41 6)
( 68 0)
( 70 5)
B i e k ge K a t
C h ri n eT
6)
( 41
( 12 )
8) M aa ri n g P o ro l
48
) ( 41 ( 37 )
9 M ga o ge L e
(4
( 38 )
1 g
N a ri n a Ki a a s
) ( 45 )
2
M ba n S a m
( 75 ( 15
7) ( 38 )
0
Z a ghu nga
04
) ( 75 )
5
L a a bla ga G
(4 ( 30
2
B l a bla n
( 73
2) ( 8 2)
B l k o ta a Y s
1)
( 83)
) Ko o k a k
Ki l n o n i
( 57 ( 2 8 9
2 )
B a d a ra u n g
) 8
( 124 (2
M an ga gT t
T u ra W e s
(3 4 0 )
Ka k
( 49) (6 9 2 )
Na lin g
( 26 5)
T ia a k
T ig irS u n g l a s
( 43 1)
( 673)
L ih a k L a m
i y ( 31 6) % / ( 674)
( 324)
( 339)
&u
Mad k
B a ro tu tu
( 53) M a u n a iB il
( 388)
Na k a la i
( 338) L a ka
( 741)
eø 5*
( 430)
( 304) $o
Vitu e
Y a p e s otuDh
( 393) M bu gh
( 714) ( 95) Bu go tu ol i
( 92)
( 651) Ta lis e M a la
( 753)
6 œ 7 8 ( 650)
( 195)
.3 Ta lis e M
Gh a riNdi
( 194) Gha ri
( 681) Tol o
( 648) Ta lis e
( 521)
( 190)
a)
( 204)
( 347)
( 196)
( 392)
12 M a la ngo
Gha riNgga e
( 179) ( 199) M bira o
( 652) Gha riTa nda
( 197) Ta lis e P ole
( 649) Gha riNgge r
( 198) Ta lis e Koo
( 189)
( 319) Gha riNg ini
( 523)
( 320) L e ngo
( 321) L e ng oGha
Figure S.4: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross- ( 453)
L e ng oP a im
( 158)
( 678) Ng ge la rip
referenced with the code in parenthesis attached to each branch. ( 618)
( 298)
( 312)
To a m ba ita
Kwa i
( 299)
L a n g a la
( 473) ( 390)
K wa io nga
( 313)
M ba
L a u e ngguu
( 389)
( 345) ( 301)
M ba
Kwa e le le a
( 175)
F a tara
( 315)
( 328) aeSo
( 314) l
( 346
) L a u Wle k a
( 551) L a uN a la d
e
( 550) L o n g o rth
( 621 ( 490 S a a gu
S a Uk iNiM
) )
( 552
OroahS a a Vil a
)
28 ( 548
( 142
)
Sa a l
S a aa Ul a wa
(1 )
11 ( 18)
) ( 19)
D o
Ar ri o
( 58 ( 549
) )
Ar e a re M
( 59
( 56 )
( 17
Sae
4)
( 24 )
2 a re a a s
W
( 57 7)
( 22 0)
B a a Au l u a i a
( 23 )
B a uu ro B a Vi l
( 21 ) F a g ro H ro o
(6 ( 24 ) Ka a n a u n u
57
) ( 60 6) S a h u a Mi
(1 ( 17 ) Ar n ta a
A o s Ca m i
(1 71
1 ) ( 1 3)
m ( n - +
L e ng oP a im
Y a p e s otuDh
Gha riNgga e
Gha riTa nda
Gha riNgge r
Ta lis e M a la
Ng ge la rip
T ig irS u n g l a s
Ta lis e P ole
Kwa ioa la n g a
Bu go tu ol i
Ta lis e Koo
L a u e ngguu
Gha riNg ini
To a m ba ita
L e ng oGha
M a la ngo
M a u n a iB il
L ih a k L a m
le S o l
Kwa le le a
Gh a riNdi
Ta lis e M
rth e
M bira o
Oro S a a Vi l a
a la d
Vitu e
Ta lis e
pb t d kg q! 4
B a ro tu tu
M bu gh
L e ngo
Na k a la i
S a a Uk iNiM
a uW k a
l
ae
Gha ri
Tol o
L a ng
a re a a s
L a ka
Kwa i
Mad k
S a a ggu
M ba e
S a aa Ul a wa
F a tara
B a a Au l u a i a
ba
o
T ia a k
Vi l
F a g ro H ro o
Ka a n a u n u
L a uN
Ar e a re M
Ar o s i Ca ta m i
M
S a ha
W
L on
B a uu ro B a
Aroo s i TOn e i b l
,' f v s z x" 0 h#
Ar ri o
L
t
S a h u a Mi
Ka s i a wa
S aa g a nn i Agwa V
o
Sae
D
T a wa a A i h u f
T nt iR u
( 678)
Ar n ta
( 298)
( 312)
( 299)
( 390)
( 313)
( 389)
F a ga oP a
( 301)
( 175)
Ur y e Ea k e ra th
( 315)
F a ur a
( 392)
( 199)
( 196)
( 652)
( 347)
( 197)
( 648)
( 649)
( 314)
( 681)
( 198)
( 194)
( 319)
Kwa n n ro gn a
( 195)
( 320)
( 650)
( 321)
( 651)
( 453)
B hu
( 551)
( 393)
a aS a
)
S e n me ou
( 95)
( 550
)
( 490
)
( 552
an
)
( 548
)
( 142
i y % / &u
( 18)
)
( 19)
m
rro l
549
( 339)
h
( 324)
( 53)
ut
( 674)
( 673)
i ut
( 431)
So
( 304)
Ame re i S o
( 430)
( 473)
( 388)
ou
S o a g o tl a e s M a
(
m
( 328)
)
( 58
eø 5* $o
S
M ra k
L
( 24 )
a
)
2
( 59
P e o tab ry
( 57 7)
R w m a ra
e
( 17
( 22 0)
)
( 621
A
( 23 )
( 204)
p
( 189)
( 21 )
M aa er
e
( 24 )
( 60 6)
N a u n n fa t
( 92)
P t
M
( 17 )
( 17 3)
( 56 4)
N g o E
a
9)
6 œ 7 8 .3
N k h
)
( 6 663
r
Or u
N a m a
Ne a h ti a k i
( 345)
( 4 5)
(
(3 )
( 1 2)
8
( 338)
(1
q
( 3 00)
5
0
( 4 0)
( 6 18)
An a p s e v a
( 316)
( 5 20)
ei
( 6 36)
( 4 10)
tO
)
99
An
a
or
( 4 91)
a) 12
( 5 27)
om
oP
( 6 31)
ej e
( 190)
( 4 07)
4)
a
( 714)
( 4 89)
a
1)
( 56
k v ei se
( 4 54)
T
( 74
S a v a v e a n te a n
26
( 4 34) )
( 4 32
(7
A e m wa n n e
)
)
( 346
29
N a l u An i a s K
)
50
N u lo a le e A
(1
P u l e wa s e s
)
57
P o ta k e k n
(6
)
19
)
W a uu oc ia
71
(1
)
(1
28 S h rtl i n s e s
( 618)
) ( 5 27) C o o l k e l e ro
52
2) 5) 5 )
(3
3
( 44 )
( 45
7
( 77 ) M a r u o ro a
( 33 5 ) C hu s nC a
)
67
( 753)
( 32
(4 ( 1 19) C o n pa nn
(4
( 4 06) S a i oA i e
( 1 31) S ul le a e s e s
( 1 03) P o k il p n
W o ila e a
( 523)
( 6 55)
( 1 659 4)
( 44
M ing a p i
2) )
0 ) ( 5 6)
)
(
2 2
57 (5 ( 5 44) P on a t
(5 s
)
( 7 11)
09
51
)
P i ri b a i e l l e
(7
( 4 12)
(3
( 5 14) K us s ha
( 158)
(5 K a r ru a
16
) M a u mw
(5
15
) N e le
(5 N we l a
Ja a n a
4 1 ) C a i one
(4
06
) ( 4 4)
4 Ia e ng
(2 N e hu a n ij
)
)
12
83
D tu me rn F
(1
( 2 97)
( 2 85)
(3 4) Ro e s t i
22
) ( 47
W a or
4) M h i ti
(5 0 5 )
(1 n
( 37 )
T a n rh yo tu
)
11
4 2
5) P e
) ( 6
36
(1
(4
( 44
6) 3)
( 54 4) ( 50 )
T u a mi h i k i
a n ua n o
3 8 9
4) (7 (6 )
( 36 ) M u ru t a n M
( 2143) 1
( 4 39) 5) ( 54 ) R a h i t n g a
6 i n
(1 ( 64 3 T ro to n th
( 64
2) ) Ra ti a an
( 33 ( 534
) Ta h u e s
i
6) ( 72
5)
( 12
2) ( 644
) Mar
q re v a
( 53
6) ( 383
) M a n ia n
ga
( 15 ( 359 w ai as
( 384
) ( 20 9 ) H pa nuiE
a
Ra m o a n
S a n a
o
)
( 533 62) ( 63) B e llo p ia
(5
( 675) T ik a e
( 163) E m ta
An u n e ll e s e
i y ( 481)
% / ( 182)
( 13)
( 537)
( 180)
&u
R e n n a An iw
F u tuM e le M
( 21 8) Ifira W e s t
( 51 3)
( 703) Uv e an a E a s t
( 116) ( 146) ( 181) F u tu k a u T a u
eø ( 563) 5* ( 704)
( 647)
$o
Va e a
Ta k u u
( 694 )
Tu v a luna
( 592) S ik a ia a m a r
( 162) ( 264) Ka pi ng
6 œ 7 8 ( 483)
( 333) .3
Nu ku oro
L ua ng iua
( 524) P uka puk a
( 680) Tok e la u
( 702) Uv e a E a s t
a) ( 682)
12
Tonga n
( 521)
( 683) ( 459)
( 177)
Niue
( 666) F ijia nBa u
( 707) ( 708) Te a nu
Va no
( 179)
( 159) ( 654)
( 98) Ta ne ma
( 701) ( 26) Bum a
( 738)
( 656) As um boa
( 581)
( 442) Ta nim bili
( 1)
( 616)
( 748) Ne mb a o
( 160) ( 330) S e im a t
W uv ul u
Figure S.5: Branch-specific, most frequent estimated changes. See Table S.5 for more information, cross- ( 426)
( 373) ( 153)
( 435)
( 317)
L ou
( 598)
Na un a
referenced with the code in parenthesis attached to each branch. ( 609
( 608)
( 729)
( 329)
( 325) L e ip o
S iv is n
L ik u ma T ita
) ( 323)
( 266)
Lev e
L on i
( 200)
( 417)
M u siu
( 108 ( 97)
)
( 532
) Ka s s a u
( 718 ( 408) Gim ira Ira h
B u li a n
( 33
) ( 485 ) ( 68)
)
( 46
5) ( 25) Mo
M inr
( 120
)
B i g ay a ifu in
( 72
4
29 ( 24 )
) ( 61
( 378
( 8) )
As M i s o o
2) ( 46
0 Wa l
( 13 )
Nu ro p e n
( 742
5 )
( 13 )
M a m fo r
( 62
2) 4)
(1
14 Am u ra
)
( 38 W ba iY
No irn d e s a p e n
( 66
6) 7
(1 ( 22 )
52
)
(4
5) ( 16 )
9
D a th B i W a
(6 ( 14 ) 4
Da we ra D a ba n
( 6 60) ( 58 8) r
Te l i a we
(5
( 4597) ( 11 8)
19
(
( 6101 0) Im M a a
)
(5 ( 60 5) r
01 )
)
86
) ( 19
2)
6) E m o ing s bu
(7
7 (
(1
6 E a pla a
1
A (p, v)
(o, ə) B a
i
C q
a
(r, d) m ʀ
(ʀ, r) e n
(p, b) t k
(w, o) r s
(i, e) s t
(c, s) k p
(u, i) u u
(e, i) l m
(b, p) ŋ ŋ
(o, w) p r
(i, u) q w
(k, g) v i
(j, s) n o
others others others
0 0.1 0.2 0.3 0.4 0 0.075 0.150 0.225 0.300 0 0.075 0.150 0.225 0.300
Fraction of substitution errors Fraction of insertion errors Fraction of deletion errors
Figure S.6: Most common substitution errors, insertion errors, and deletion errors in the PAn reconstructions
produced by our system. In (A), the first phoneme in each pair (x, y) represents the reference phoneme, followed
by the incorrectly hypothesized one. In (B), each phoneme corresponds to a phoneme present in the automatic
reconstruction but not in the reference, and vice-versa in (C).
30