You are on page 1of 20

Introduction

I recently enjoyed the latest in the Trilogy of the 'Lord of the Rings'
movies at the cinema. I hadn't read any of Tolkien's books, so
watching the films was my first exposure to 'Middle Earth', with all
its strange creatures and languages. I was especially intrigued by
Tolkien's invented languages (such as Elvish and Dwarvish) and was
curious to know where the languages came from, or, more
precisely, which real language was the biggest influence on Tolkien
for his inventions. As I have been thinking about issues of string
similarity recently (see my previous articles 'Taming the Beast by
Matching Similar Strings' and 'How to Strike a Match') I wondered
whether I could extend my ideas of string similarity to language
similarity. In other words, could I discover to which real language
Tolkien's artificial language is most similar?

Apparently, this is a much discussed topic. The article 'Are High


Elves Finno-Ugric?' suggests that Finnish had the greatest
influence on the development of the Elvish language Quenya.
Tolkien first came across a Finnish grammar while he was studying
at Oxford, and admitted that it made a strong (even 'intoxicating'!)
impression on him. Indeed, in early versions of Quenya there are
many Finnish or near-Finnish words, although the meanings of the
words are not those of Finnish. Tolkien himself wrote that Quenya
was based on Latin, but with the added 'phonaesthetic ingredients'
of Finnish and Greek. It has also been argued that some aspects
of Tolkien's invention are more like Uralic languages that are
outside of Baltic Finnish, whilst other aspects more closely resemble
Hungarian.
An Algorithmic Approach

As a developer, I was thinking about an algorithmic approach to the


problem. My idea was to write a program that takes each Tolkien
word in turn and finds which real language has the word which is
most similar. By inspecting the number of times each language is
chosen, we should be able to decide which language was Tolkien's
biggest influence. Of course I would need to look on the Web to find
lists of Tolkien words, as well as word lists for other languages, but
I assumed that wouldn't be a problem. My own string similarity
metric (described in How to Strike a Match) could be used for the
word-by-word comparison, and is a good choice because it
acknowledges similarity for a common substring of any size, and is
robust to differences in string size. Of course this would be a
comparison of lexical similarity, as my string similarity algorithm
makes only lexical comparisons. It is still possible that the
inspiration for the grammar and the lexical structure of Tolkien's
languages came from entirely different sources.

Although I had an existing implementation of the string similarity


metric and a good idea of the basic approach, this was a truly
investigative project. I didn?t know what the outcome was going to
be, and I knew there would be some problems to solve along the
way. But then, that's what makes it so pleasing when you do get a
result. In this article, I explain the first part of my investigation -
how I obtained the word lists that enabled me to do the analysis,
how I processed them to clean them up, and how I represented the
word lists in a database.

Acquiring and Cleaning the Word Lists

A quick Google search led me to believe that I should be able to get


the data sources I needed to do the investigation ? I found suitable
word lists at phreak.org and cotse.com, including a list of
Tolkien's invented words.

After downloading a number of these word lists, I found that they


needed some 'cleaning' before I could use them. I wanted each file
to be a list of words, formatted as one word per line. This was not
the case with several of the downloaded files, so I found myself
'cleaning' the data. For these basic file manipulation and formatting
tasks, I found the speed and flexibility of the Unix-style bash shell
invaluable. The tasks were as follows:

1. Using a text editor, I deleted any explanatory text or comments from the
tops of the files.
2. I found that the list of Hungarian words had first a word, but then a
number on each line. I stripped the numbers from this file using the
simple awk script '{print $1}'.
3. I sorted the files, and then used a text editor to remove non-alphabetic
words, such as '>='.
4. I combined different word lists for the same language. For example, there
were multiple word lists for English, so I simply appended one list onto the
end of the other:

cat englex-dict.txt>> english.txt

5. I removed duplicates from all of the word lists. It is easy to do this


programmatically in the bash shell. For example:

cat english.txt | sort | uniq > new-english.txt

6. I made sure that all the files used the same line termination sequence.
(Text files developed under Windows use two characters, Carriage Return
and Line Feed, to signify the end of a line, whereas Unix just uses a
Carriage Return.) As I was using the bash shell, it was easiest to convert
all the files to Unix file format using:

dos2unix *.txt
At this point, I had 15 word lists of different sizes, and a total of
over 1.3 million words (the wc command shows the number of lines,
words and characters in each file):

$ wc *.txt
25485 25485 259551 danish.txt
178429 178430 1998881 dutch.txt
56553 56553 509773 english.txt
287698 287698 3500749 finnish.txt
138257 138257 1524757 french.txt
160086 160086 2060734 german.txt
18028 18028 172943 hungarian.txt
115506 115506 934652 japanese.txt
77107 77107 850131 latin.txt
61843 61843 589234 norwegian.txt
109861 109861 1022137 polish.txt
86061 86061 850532 spanish.txt
18417 18417 181973 swahili.txt
12146 12146 105192 swedish.txt
470 470 3768 tolkien.txt
1345947 1345948 14565007 total

Storing the Word Lists in a Database

Now that the word lists had been cleaned, my next aim was to
access them from a computer program. Although I could have
written a program to access the word lists directly as files, I felt a
database would offer considerable flexibility to query the data and
analyse the results. I was also worried about the volume of data,
and reasoned that the database would help in accessing and
managing the word lists efficiently. I didn't look around much when
choosing a database to store the word lists - MySQL was the natural
choice because it is fast, flexible and above all, free. And besides, it
was already installed on my computer!
I knew I would need only a single table to store all the word lists in
the database. Each row of the table could hold one word together
with the language to which it belongs. However, to devise the
schema precisely, I needed to find out how many characters to
allow per word. A quick bash shell command against the text files
told me the lengths of the words in the word lists:

$ cat *.txt|awk '{print length($0)}'|sort -n|uniq

The command first runs an awk script over the text files to get the
lengths of the lines, then performs a numeric sort, and finally
removes duplicate lines in the output. Using this command, I found
that the longest word in the input was 57 characters, so decided to
make the database column to hold the words 60 characters long.

The table for storing the words is created as follows:

CREATE TABLE words (


word varchar(60),
lang enum("DANISH", "DUTCH", "ENGLISH", "FINNISH", "FRENCH", "GERMAN",
"HUNGARIAN", "JAPANESE", "LATIN", "NORWEGIAN", "POLISH", "SPANISH", "SWAHILI",
"SWEDISH", "TOLKIEN"),
word_id int(10) NOT NULL auto_increment,
primary key (word_id),
index lang_i (lang),
index word_i (word)
);

In addition to the word and its language, there is an identifier


(word_id), which is the primary key for the table. Note that I used
an enum type for the lang column, since we know we are only going
to use a limited set of languages. By using an enum, only one byte
need be used to store the value of that column - far less data than
if I'd used a varchar. I also added indexes for the lang and word
columns to improve execution times for queries that constrain these
columns.

Now for each language, I loaded the words from the text file into
the database table. The following two SQL commands load the word
list for Danish; the other languages were loaded similarly.

load data infile 'C:\\temp\\danish.txt' into table words(word);


update words set lang='danish' where lang is null;

Once the word lists were in the database, I carried out one further
data cleansing action. I removed any words that were less than
three characters long:

mysql> delete from words where length(word)<3;


Query OK, 2112 rows affected (4.70 sec)

Checking the Data

At this point it is reassuring to run a query to get an overview of the


data that we have stored. First, let's check how many words are in
the database. As we have one word per row of the database table,
that's the same as counting the number of rows in the table. The
following query counts the number of rows, but also stores that
value in a variable, called @total.

mysql> select @total:=count(*) as wordcount from words;


+-----------+
| wordcount |
+-----------+
| 1343410 |
+-----------+
1 row in set (0.00 sec)

Now let's look at the breakdown of the word lists into languages:
mysql> select lang as language, count(word) as wordcount
from words group by lang;
+-----------+-----------+
| language | wordcount |
+-----------+-----------+
| DANISH | 25291 |
| DUTCH | 178341 |
| ENGLISH | 56355 |
| FINNISH | 287231 |
| FRENCH | 138168 |
| GERMAN | 159989 |
| HUNGARIAN | 17818 |
| JAPANESE | 115291 |
| LATIN | 77049 |
| NORWEGIAN | 61679 |
| POLISH | 109343 |
| SPANISH | 85965 |
| SWAHILI | 18363 |
| SWEDISH | 12057 |
| TOLKIEN | 470 |
+-----------+-----------+
15 rows in set (3.55 sec)

Given that we stored the total number of rows in a variable, it is


now quite easy to run a query to express the word counts as a
percentage of the total number of words. The query rounds the
percentages to the nearest number to one decimal place.

mysql> select lang as language, round(100*count(word)/@total,1) as percent


from words group by lang;
+-----------+---------+
| language | percent |
+-----------+---------+
| DANISH | 1.9 |
| DUTCH | 13.3 |
| ENGLISH | 4.2 |
| FINNISH | 21.4 |
| FRENCH | 10.3 |
| GERMAN | 11.9 |
| HUNGARIAN | 1.3 |
| JAPANESE | 8.6 |
| LATIN | 5.7 |
| NORWEGIAN | 4.6 |
| POLISH | 8.1 |
| SPANISH | 6.4 |
| SWAHILI | 1.4 |
| SWEDISH | 0.9 |
| TOLKIEN | 0.0 |
+-----------+---------+
15 rows in set (3.56 sec)

By converting the output of this query to a comma-separated values


file, and then loading it into Microsoft Excel, we can generate the
following pie chart:

It is important that we understand how well represented the


different languages are in the set of word lists, as this will affect our
interpretation of the results of the lexical similarity analysis. Clearly
Finnish is well represented in our word lists, as are Dutch, French
and German. With Finnish having the largest number of words, we
have a good starting point for testing the belief that Finnish was the
biggest influence on Tolkien's languages of Middle Earth.
In my next article, I will explain the algorithm that I used to analyze
the word lists, present an overview of the Java source code, and
reveal the language that, according to my findings, most influenced
Tolkien.

The Story So Far...

When I saw the latest in the Lord of the Rings trilogy of movies a
short while ago, I wondered how Tolkien had invented the artificial
languages of Middle Earth, such as Dwarvish and Elvish. In my
previous article, Lord of the Strings (Part 1), I told of my desire to
discover which real language had been the biggest influence on
Tolkien for his invented ones. As a software developer, I wanted to
discover this information algorithmically. My idea was to use my
own string similarity algorithm (see my article 'How to Strike a
Match') to compare each word from a list of Tolkien words to words
from 14 other real languages. For each Tolkien word, I would find
and record the language with the word that is (lexically) most
similar. The set of most-similar words and the languages from
which they came would provide new insights into the influences on
Tolkien.

The Part 1 article described how I obtained word lists for 14 real
languages, as well as a list of Tolkien words, cleaned them all up in
preparation for processing, and represented them in a MySQL
relational database. At the end of the article, I had over 1.3 million
words in the database, a Java implementation of my similarity
metric (from the 'How to Strike a Match' article), and JDBC
database access drivers at my disposal that would provide access to
the word lists from a Java program. I also used some SQL queries
to analyse the sizes of the word lists, as these will be important for
any conclusions made.

Preparing to Store Similarity Results

Now that the word lists were ready, my attention turned to the
algorithm that would be applied to the word lists and the results
that would be generated. The question in my mind was 'What
should I do with a result once I have discovered it?' That is, as soon
as I know the most similar word to a Tolkien word, do I print it out
to the screen, only for that information to become lost when the
screen buffer size is exceeded? Do I write it out to a file, so that the
result is persisted for later analysis? Or do I write the results back
to the database, so that they are immediately available for analysis
through SQL queries?

In fact, I opted to write the results back to the database, but not for
the reason given above. Actually, I was concerned about the size of
the problem. There were 470 Tolkien words in the database, each of
which could potentially be compared against 1.3 million other
strings. That's a lot of computation, and even for a single given
word I thought it would surely take a while to compute the most
similar string. My idea was to use the persistence of the database to
help address the combinatorics of the problem. If I designed the
program such that it wrote results to the database as it computed
them, and chose words to analyze that are not already in the
results list in the database, then the computer could be shut down
and restarted and the program would still pick up where it left off.
In other words, very little processing time would be lost through a
system shutdown, even if the analysis was not complete. This is
very desirable behavior, so I created another database table, called
'matches' for storing the results of the similarity comparisons. The
table needed to store the word (or its id), its best matching word,
the language of the best matching word, and the similarity score. I
used the following SQL command to create it:

CREATE TABLE matches (


word varchar(60),
word_id int(10) NOT NULL,
best varchar(60),
lang enum("DANISH", "DUTCH", "ENGLISH", "FINNISH", "FRENCH", "GERMAN",
"HUNGARIAN", "JAPANESE", "LATIN", "NORWEGIAN", "POLISH", "SPANISH", "SWAHILI",
"SWEDISH", "TOLKIEN"),
similarity float,
primary key(word_id),
index word_i (word_id)
);

(Although not strictly necessary, I have included the word and best
word in this table as well as in the words table. I realize this
duplicates information and runs the risk of the data becoming
inconsistent, but it keeps things simple in this article. The
normalization and further refinement of the database schema is left
as an exercise for the reader... :-))

Discovering String Similarities

I wrote a Java program that, for each Tolkien word, computes the
most similar word and stores that value back into the database. I
will explain how the program works without providing details about
the JDBC database access (if you are interested in such details, try
this short introduction, or a book such as Beginning Java
Databases.) In fact, the details of the database access are also
hidden from the main program, as it uses a database access class
which hides its inner workings, and instead exposes the following
interface:

public interface QueryRunner {


public void openConnection() throws SQLException;
public ResultSet runQuery(String query) throws SQLException;
public int runUpdate(String sql) throws SQLException;
public void closeConnection() throws SQLException;
}

As you can see there's a method for opening a database connection,


methods for executing a query and performing a database update,
and finally, a method for closing the connection to the database. It
is a very simple interface, but sufficient for the task at hand.

The program also makes use of a small class that represents a


word, its identifier, and the language from which it came:

class Word {
private String word;
private int id;
private String lang;

public Word(String wd, int wdId, String language) {


word = wd;
id = wdId;
lang = language;
}

public String toString() {


StringBuffer buf = new StringBuffer("Word[");
buf.append(word);
buf.append(",");
buf.append(id);
buf.append(",");
buf.append(lang);
buf.append("]");
return buf.toString();
}
}

Now I return to the main logic of the program, which is


implemented in five methods of a class called 'Compare'. The top-
level structure of the class is as follows (and you can see the full
program listing here):

public class Compare {


protected QueryRunner queryRunner;

public Compare() {}
protected Word findBestMatch(Word baseWord, Collection collection) {}
protected void storeBestMatch(Word w, Word bestMatch, double bestScore) {}
protected Word getWord(String lang) throws SQLException {}
protected Collection getCandidateStrings(String str) throws SQLException {}
}

As you can see, the class uses an instance of a QueryRunner to


access the database, and also makes use of the Word class, both as
a method parameter and a result. The following pseudo-code
explains the algorithm and the interaction between the methods:

OpenDatabaseConnection;
// Get a Tolkien word that has not been considered
While (w = getWord("Tolkien") {
// Run a first pass query to return promising strings
Collection candidates = getCandidateStrings(w);
// Find the best match of those candidates
findBestMatch(w, candidates);
}
CloseDatabaseConnection;

First, I open a database connection, and then repeatedly retrieve


Tolkien words that have not yet been considered. (The getWord()
method runs an SQL query, which returns a Tolkien word that does
not yet appear in the comparison results table.) For each Tolkien
word, I run a 'first pass' query to retrieve a subset of promising
'candidate' words from all those in the database. I used the
database's pattern matching ability to return all those (non-Tolkien)
words that start with the same two letters as the Tolkien word. The
program then calls findBestMatch() on the Tolkien word and the
candidate words to find and store the best matching word. When
this has been done for all the Tolkien words, the database
connection is closed.

findBestMatch(Word w, Collection candidates):


bestScore = -1;
for each c in candidates {
similarity = compareStrings(c, w);
if (similarity > bestScore) {
bestScore = similarity;
bestMatch = c;
}
}
storeBestMatch(w, bestMatch, bestScore);

In the findBestMatch() method, the program iterates through each


candidate word and computes the similarity metric. If the similarity
metric is better than the previous best, then the program stores the
new value as its current best. Once all the candidate strings have
been considered, the best match is saved to the database using an
SQL update query.

Analyzing The Results

With the first pass mechanism incorporated into the algorithm as


described, I found that the program runs in quite acceptable time.
On my 1.6 GHz PC with 768 MB Ram using Sun's Java J2SE 1.4.2
and MySQL 3.23.49, it runs to completion in about four minutes.
The java program and database both run on the same machine, so
there is no network latency in this configuration. The main reason
for this relatively quick execution time is the first pass mechanism
that returns a subset of the whole dictionary of words. Without this
mechanism, I am certain that the program would take orders of
magnitude longer to finish. (And if it took longer to finish, then my
idea of storing results back into the database as they are found
would gain significance.)

Anyway, let's have a look at the results obtained. First of all, let's
see how many exact matches were found and for which languages:

mysql> select count(*), lang from matches where similarity=1.0 group by lang;
+----------+-----------+
| count(*) | lang |
+----------+-----------+
| 2 | DANISH |
| 7 | DUTCH |
| 3 | ENGLISH |
| 13 | FINNISH |
| 1 | FRENCH |
| 1 | GERMAN |
| 1 | HUNGARIAN |
| 7 | JAPANESE |
| 2 | LATIN |
| 2 | NORWEGIAN |
| 2 | POLISH |
| 4 | SPANISH |
| 1 | SWEDISH |
+----------+-----------+
13 rows in set (0.00 sec)

The most exact matches were found for Finnish. By executing the
following query, we can inspect the words of those exact matches:

select word from matches where similarity=1 and lang='Finnish';

The exact matches were found for the Finnish words 'andor',
'aragorn', 'arwen', 'balin', 'hobbit', 'ilmarin', 'lindon', 'maia', 'nahar',
'ori', 'rohan', 'sylvan', and 'velar'; whilst English scored exact
matches for the words 'goblin', 'hobgoblin' and 'thrush'.
The following query shows the frequency with which each language
was chosen as being most similar, over the whole set of 470 Tolkien
words.

mysql> select lang, count(*) as hits


from matches
group by lang
order by hits desc;
+-----------+------+
| lang | hits |
+-----------+------+
| FINNISH | 100 |
| ENGLISH | 64 |
| SPANISH | 62 |
| JAPANESE | 52 |
| DUTCH | 40 |
| LATIN | 26 |
| POLISH | 25 |
| DANISH | 19 |
| NORWEGIAN | 19 |
| FRENCH | 18 |
| HUNGARIAN | 18 |
| GERMAN | 16 |
| SWAHILI | 7|
| SWEDISH | 4|
+-----------+------+
14 rows in set (0.00 sec)

If a language is selected as the best match for a Tolkien word, I call


it a 'hit'. As you can see from the results of the query, Finnish
scores the highest by a clear margin, with 100 hits. The languages
with the next-most hits are English, Spanish and Japanese, with 64,
62 and 52 hits, respectively. At first sight, this seems to support the
argument that Finnish is the language that had the biggest
influence on Tolkien (as suggested by the Finnish speaker, Harri
Perälä). However, if you recall the sizes of the word lists, analyzed
at the end of the Part 1 article, you will remember that the Finnish
word list is also the largest word list of all the languages,
comprising 21.4% of the words in the database. So perhaps the
reason why Finnish scores so many hits is simply because I had so
many Finnish words in the database?

Clearly, to conclude anything from the results, we need to take


account of the relative sizes of the word lists for each of the
languages. Perhaps the best way of doing this is to compute an
expected number of hits for each language (based on the size of its
word list), and compare that number to the actual number of hits
that we found. The following pair of queries computes the number
of expected hits for each of the languages, under the assumption
that any of the languages is equally likely to contain the most
similar word to any given Tolkien word (a so-called 'null
hypothesis').

mysql> select @total:=count(*) from words where lang<>'tolkien';


+------------------+
| @total:=count(*) |
+------------------+
| 1342940 |
+------------------+
1 row in set (1.20 sec)

mysql> select lang, round(count(*)*470/@total,1) as expected_hits


from words where lang<>'tolkien'
group by lang
order by expected_hits desc;
+-----------+---------------+
| lang | expected_hits |
+-----------+---------------+
| FINNISH | 100.5 |
| DUTCH | 62.4 |
| GERMAN | 56.0 |
| FRENCH | 48.4 |
| JAPANESE | 40.3 |
| POLISH | 38.3 |
| SPANISH | 30.1 |
| LATIN | 27.0 |
| NORWEGIAN | 21.6 |
| ENGLISH | 19.7 |
| DANISH | 8.9 |
| SWAHILI | 6.4 |
| HUNGARIAN | 6.2 |
| SWEDISH | 4.2 |
+-----------+---------------+
14 rows in set (1.06 sec)

Computing the expected number of hits is the statistical equivalent


of a control experiment in the physical sciences - in other words,
what would be the outcome if there were no interesting behaviour
to study? You might think such an exercise to be of little use, as
we're pretty sure there is something interesting going on. The point
is that we can now compare the actual results that we obtained to
the expected results, and look at the differences. The following
table orders the languages according to the size of the difference
between actual and expected number of hits.

Language Expected Actual Hits Actual-


Hits Expected

ENGLISH 19.7 64 44.3


SPANISH 30.1 62 31.9
HUNGARIAN 6.2 18 11.8
JAPANESE 40.3 52 11.7
DANISH 8.9 19 10.1
SWAHILI 6.4 7 0.6
SWEDISH 4.2 4 -0.2
FINNISH 100.5 100 -0.5
LATIN 27 26 -1
NORWEGIAN 21.6 19 -2.6
POLISH 38.3 25 -13.3
DUTCH 62.4 40 -22.4
FRENCH 48.4 18 -30.4
GERMAN 56 16 -40

Now we're getting somewhere. The languages with differences that


are greater than zero may have had an influence on Tolkien.
Furthermore, the size of the difference is also an indication of the
level of influence. So we're beginning to see that Tolkien's mother
tongue of English seems to have had the most profound influence
on him. My reaction to this was, at first, one of surprise, then of
reassurance. I was surprised because of the apparent dissimilarity
of Tolkien's invented words to English, and the fact that the Tolkien
words matched only three English words exactly. But I was
reassured because Tolkien was, after all, English, and you would
expect him to be heavily influenced by his native language.

Note also the particularly strong result for Hungarian, which


received nearly three times as many hits as expected. Finnish
performed almost exactly as expected, indicating no appreciable
influence on Tolkien, whilst French and German performed well
under expectations, perhaps indicating that Tolkien was deliberately
avoiding the influences of these languages.

Conclusions

When I started this investigation, I had no idea what the result


would be. I just clung firmly onto the belief that my string similarity
metric, together with a simple algorithm to iterate over the set of
possible word pair comparisons, would provide an interesting result.
In fact, the results are very satisfying. I found that English had a
profound effect on Tolkien's invented languages, with perhaps
further influences from Hungarian and Spanish. This is satisfying
because it is entirely reasonable (at least the part about English!),
though not exactly what I expected after reading about the
(apparently unfounded) claims for the influences of Finnish.

It is also satisfying because it increases my confidence in the string


similarity method. And as developers, we like to have confidence in
our methods.

You might also like