You are on page 1of 26

Creating

Phylogenic Language
Trees
Michele Fretta
Saint Michael’s College
Colchester, Vermont
Phylogenic Trees

 Trees are a useful way to visualize the


connections between things.

 The methods we will use here are often used in


biological genetics.
Language Trees
Have you ever seen something like this?
It’s a language
tree, which
describes the
Latin
way several
languages are
related.

Latin is the
ancestor, and
the “leaves” are
French Italian
Portuguese the
Spanish
contemporary
languages.
 We will construct tree graphs which model the
relationships between the words of 12
languages.
 Warning! The trees in this presentation are examples only. The data they are based on is
not statistically significant.

 In (language) trees, it is assumed that a genetic


relationship implies an evolutionary
relationship
 Suppose we have data of words from several
different languages.

 How can we use the data to construct a visual


representation of the relationships among them
(i.e., a tree)?
Branch and Bound Method

 Based on data from several languages, we will construct a sample


tree.

 First, we will use matrices to organize the data…


Sample Cognate Table
 Cognates are words in two different languages which
have similar sound and meaning.
Dutch ALLES EN DIER ASCH
English ALL AND ANIMAL ASHES

French TOUT E ANIMAL SENDRE


German ALLES UND TIER ASCHE

Hindi SEB OR JANVER RAKH


Italian TUTTO E ANIMALE SENERE
Nepalese SAB AU JANAWAR KHAG
Persian HAME VA HEYVAN KHAKESTAR
Polish WSZYSTEK I ZWIERZE POPIAL

Portuguese TODO E ANIMAL SINDRA


Russian VES I ZVER PYOPEL

Spanish TODO I ANIMAL SENIZA


Cognate Matrix Sample
 “1” indicates that there is a cognate between
the two languages
 “0” indicates that there is no cognate.
Pair-wise Percentage Similarity
 Then, we use the cognate matrices to find the
percentage of similarity between each pair of
languages.
 Group and average.
Building the tree,
group by group

Spanish French Italian Portugues Russian Polish German Dutch English Hindi Nepali Persian
e
Here’s another method, which is often more reliable than Branch & Bound:

Kruskal’s Algorithm Method

 Parsimony trees are based on the assumption


that the least number of evolutionary steps is
the most likely.
 We will construct a graph which represents all
possible language relationships,
 And we will use a greedy algorithm in order to
find the shortest (most likely) relationships.
Pair-wise Cognate Percentages
 Begin with our matrix
of cognate percentages:

Dutch English French German Hindi Italian Nepalese Persian Polish PortugueseRussian Spanish
Dutch x
English 30% x
French 10% 10% x
German 70% 10% 1% x
Hindi 1% 1% 1% 1% x
Italian 10% 10% 90% 1% 1% x
Nepalese 1% 10% 1% 1% 60% 1% x
Persian 1% 11% 1% 1% 22% 1% 33% x
Polish 20% 1% 20% 10% 1% 20% 1% 1% x
Portuguese 10% 10% 80% 1% 1% 70% 1% 1% 20% x
Russian 20% 1% 20% 10% 1% 20% 1% 1% 40% 20% x
Spanish 10% 10% 99% 1% 1% 90% 1% 1% 30% 70% 20% x
The Complete Graph
Each edge represents an entry in our cognate-percentage matrix. They are color-coded by
percentage similarity.

99%
90%
80%
70%
60%
40%
33%
30%
22%
20%
11%
10%
1%
Implementing Kruskal’s Algorithm

German
Hindi
French

English Italian

Nepali
Dutch

Persian
Spanish

Russian
Polish
Portuguese
Our Minimum Spanning Tree
Russian Portuguese
Nepali

Persian Dutch French


Spanish

Polish
English

Italian
Hindi
German
Modifying the Tree
 Remember the tree with Latin?
 Latin is an internal vertex because it is an
ancestor.
 Some of the vertices in our minimal spanning
tree are internal vertices, but we want them all
to be leaves
 This is because leaves will represent present-
day languages, which we are working with in
this case.
Modifying the Tree

 Attach a leaf to each internal vertex.

 The former internal vertices will serve as


ancestors, like Latin. Sometimes, linguists
know very little about these “protolanguages”.
Modifying the Tree (cont’d)
Russian Portuguese
Nepali

Persian Dutch French


Spanish

Polish
English

Italian
Hindi
German
Comparing our trees…
They are rooted
differently,

but they are actually


quite similar!

Portuguese

Polish
Spanish

French

Russian

German

Dutch

English

Hindi

Persian
Italian

Nepali
Portuguese
Nepali Russian
Dutch
Persian Spanish French

Polish

English Italian
Hindi German
Rooting the Tree
 Our two trees would look more similar if they had the
same root.

 The root should be the oldest protolanguage, the


ancestor of all of the tree’s contemporary languages.

 Historians do this.
 Based on historical records, they can rearrange the tree so
that the older languages are closed to the root.
Class Problem
 There is a creole in India which is based on Portuguese.
 If this language began as a synthesis of Portuguese and Hindi,
how would you place it in this tree?
Portuguese
Nepali Russian

Dutch Hint:
Persian Spanish French
There might
be a
problem
with this ...

Polish

Italian
English
Hindi
German
Supplements…
Applications
 Parallels with modeling biological evolution
 Mapping the migrations of human populations
 Modeling the genetic similarities among
human populations
 Time divergences: using a decomposition
“clock” to estimate the number of years which
have separated languages.
Biology vs. Linguistics:
Inheritance and transference
 In biological trees, genes are compared to find these
relationships
 In language trees, words are compared
Common ancestor Common ancestor

Word transfer
Gene transfer
French English German
Bacteria Eukarya Archaea

Glycolysis Replication Boeuf Beef Cow Kuh

Electron Transcription Porc Pork Swine Schwein


Transport
Mouton Mutton Sheep Schaf
Photo- Translation
synthesis
Diagram from Searls, Nature
Disadvantages—Branch and Bound
 This method may group together
slow-changing languages rather than
related languages

 It also does not necessarily yield the best


solution.
Disadvantages—Kruskal’s
 Not necessarily the best tree.
 The parsimony method only guarantees that the
tree is twice the “parsimony length” of the best
tree.
 Some languages break the rules of a tree.
 Creoles cause a cycle in the tree because they have
more than one parent language.

You might also like