Professional Documents
Culture Documents
I recently enjoyed the latest in the Trilogy of the 'Lord of the Rings'
movies at the cinema. I hadn't read any of Tolkien's books, so
watching the films was my first exposure to 'Middle Earth', with all
its strange creatures and languages. I was especially intrigued by
Tolkien's invented languages (such as Elvish and Dwarvish) and was
curious to know where the languages came from, or, more
precisely, which real language was the biggest influence on Tolkien
for his inventions. As I have been thinking about issues of string
similarity recently (see my previous articles 'Taming the Beast by
Matching Similar Strings' and 'How to Strike a Match') I wondered
whether I could extend my ideas of string similarity to language
similarity. In other words, could I discover to which real language
Tolkien's artificial language is most similar?
1. Using a text editor, I deleted any explanatory text or comments from the
tops of the files.
2. I found that the list of Hungarian words had first a word, but then a
number on each line. I stripped the numbers from this file using the
simple awk script '{print $1}'.
3. I sorted the files, and then used a text editor to remove non-alphabetic
words, such as '>='.
4. I combined different word lists for the same language. For example, there
were multiple word lists for English, so I simply appended one list onto the
end of the other:
6. I made sure that all the files used the same line termination sequence.
(Text files developed under Windows use two characters, Carriage Return
and Line Feed, to signify the end of a line, whereas Unix just uses a
Carriage Return.) As I was using the bash shell, it was easiest to convert
all the files to Unix file format using:
dos2unix *.txt
At this point, I had 15 word lists of different sizes, and a total of
over 1.3 million words (the wc command shows the number of lines,
words and characters in each file):
$ wc *.txt
25485 25485 259551 danish.txt
178429 178430 1998881 dutch.txt
56553 56553 509773 english.txt
287698 287698 3500749 finnish.txt
138257 138257 1524757 french.txt
160086 160086 2060734 german.txt
18028 18028 172943 hungarian.txt
115506 115506 934652 japanese.txt
77107 77107 850131 latin.txt
61843 61843 589234 norwegian.txt
109861 109861 1022137 polish.txt
86061 86061 850532 spanish.txt
18417 18417 181973 swahili.txt
12146 12146 105192 swedish.txt
470 470 3768 tolkien.txt
1345947 1345948 14565007 total
Now that the word lists had been cleaned, my next aim was to
access them from a computer program. Although I could have
written a program to access the word lists directly as files, I felt a
database would offer considerable flexibility to query the data and
analyse the results. I was also worried about the volume of data,
and reasoned that the database would help in accessing and
managing the word lists efficiently. I didn't look around much when
choosing a database to store the word lists - MySQL was the natural
choice because it is fast, flexible and above all, free. And besides, it
was already installed on my computer!
I knew I would need only a single table to store all the word lists in
the database. Each row of the table could hold one word together
with the language to which it belongs. However, to devise the
schema precisely, I needed to find out how many characters to
allow per word. A quick bash shell command against the text files
told me the lengths of the words in the word lists:
The command first runs an awk script over the text files to get the
lengths of the lines, then performs a numeric sort, and finally
removes duplicate lines in the output. Using this command, I found
that the longest word in the input was 57 characters, so decided to
make the database column to hold the words 60 characters long.
Now for each language, I loaded the words from the text file into
the database table. The following two SQL commands load the word
list for Danish; the other languages were loaded similarly.
Once the word lists were in the database, I carried out one further
data cleansing action. I removed any words that were less than
three characters long:
Now let's look at the breakdown of the word lists into languages:
mysql> select lang as language, count(word) as wordcount
from words group by lang;
+-----------+-----------+
| language | wordcount |
+-----------+-----------+
| DANISH | 25291 |
| DUTCH | 178341 |
| ENGLISH | 56355 |
| FINNISH | 287231 |
| FRENCH | 138168 |
| GERMAN | 159989 |
| HUNGARIAN | 17818 |
| JAPANESE | 115291 |
| LATIN | 77049 |
| NORWEGIAN | 61679 |
| POLISH | 109343 |
| SPANISH | 85965 |
| SWAHILI | 18363 |
| SWEDISH | 12057 |
| TOLKIEN | 470 |
+-----------+-----------+
15 rows in set (3.55 sec)
When I saw the latest in the Lord of the Rings trilogy of movies a
short while ago, I wondered how Tolkien had invented the artificial
languages of Middle Earth, such as Dwarvish and Elvish. In my
previous article, Lord of the Strings (Part 1), I told of my desire to
discover which real language had been the biggest influence on
Tolkien for his invented ones. As a software developer, I wanted to
discover this information algorithmically. My idea was to use my
own string similarity algorithm (see my article 'How to Strike a
Match') to compare each word from a list of Tolkien words to words
from 14 other real languages. For each Tolkien word, I would find
and record the language with the word that is (lexically) most
similar. The set of most-similar words and the languages from
which they came would provide new insights into the influences on
Tolkien.
The Part 1 article described how I obtained word lists for 14 real
languages, as well as a list of Tolkien words, cleaned them all up in
preparation for processing, and represented them in a MySQL
relational database. At the end of the article, I had over 1.3 million
words in the database, a Java implementation of my similarity
metric (from the 'How to Strike a Match' article), and JDBC
database access drivers at my disposal that would provide access to
the word lists from a Java program. I also used some SQL queries
to analyse the sizes of the word lists, as these will be important for
any conclusions made.
Now that the word lists were ready, my attention turned to the
algorithm that would be applied to the word lists and the results
that would be generated. The question in my mind was 'What
should I do with a result once I have discovered it?' That is, as soon
as I know the most similar word to a Tolkien word, do I print it out
to the screen, only for that information to become lost when the
screen buffer size is exceeded? Do I write it out to a file, so that the
result is persisted for later analysis? Or do I write the results back
to the database, so that they are immediately available for analysis
through SQL queries?
In fact, I opted to write the results back to the database, but not for
the reason given above. Actually, I was concerned about the size of
the problem. There were 470 Tolkien words in the database, each of
which could potentially be compared against 1.3 million other
strings. That's a lot of computation, and even for a single given
word I thought it would surely take a while to compute the most
similar string. My idea was to use the persistence of the database to
help address the combinatorics of the problem. If I designed the
program such that it wrote results to the database as it computed
them, and chose words to analyze that are not already in the
results list in the database, then the computer could be shut down
and restarted and the program would still pick up where it left off.
In other words, very little processing time would be lost through a
system shutdown, even if the analysis was not complete. This is
very desirable behavior, so I created another database table, called
'matches' for storing the results of the similarity comparisons. The
table needed to store the word (or its id), its best matching word,
the language of the best matching word, and the similarity score. I
used the following SQL command to create it:
(Although not strictly necessary, I have included the word and best
word in this table as well as in the words table. I realize this
duplicates information and runs the risk of the data becoming
inconsistent, but it keeps things simple in this article. The
normalization and further refinement of the database schema is left
as an exercise for the reader... :-))
I wrote a Java program that, for each Tolkien word, computes the
most similar word and stores that value back into the database. I
will explain how the program works without providing details about
the JDBC database access (if you are interested in such details, try
this short introduction, or a book such as Beginning Java
Databases.) In fact, the details of the database access are also
hidden from the main program, as it uses a database access class
which hides its inner workings, and instead exposes the following
interface:
class Word {
private String word;
private int id;
private String lang;
public Compare() {}
protected Word findBestMatch(Word baseWord, Collection collection) {}
protected void storeBestMatch(Word w, Word bestMatch, double bestScore) {}
protected Word getWord(String lang) throws SQLException {}
protected Collection getCandidateStrings(String str) throws SQLException {}
}
OpenDatabaseConnection;
// Get a Tolkien word that has not been considered
While (w = getWord("Tolkien") {
// Run a first pass query to return promising strings
Collection candidates = getCandidateStrings(w);
// Find the best match of those candidates
findBestMatch(w, candidates);
}
CloseDatabaseConnection;
Anyway, let's have a look at the results obtained. First of all, let's
see how many exact matches were found and for which languages:
mysql> select count(*), lang from matches where similarity=1.0 group by lang;
+----------+-----------+
| count(*) | lang |
+----------+-----------+
| 2 | DANISH |
| 7 | DUTCH |
| 3 | ENGLISH |
| 13 | FINNISH |
| 1 | FRENCH |
| 1 | GERMAN |
| 1 | HUNGARIAN |
| 7 | JAPANESE |
| 2 | LATIN |
| 2 | NORWEGIAN |
| 2 | POLISH |
| 4 | SPANISH |
| 1 | SWEDISH |
+----------+-----------+
13 rows in set (0.00 sec)
The most exact matches were found for Finnish. By executing the
following query, we can inspect the words of those exact matches:
The exact matches were found for the Finnish words 'andor',
'aragorn', 'arwen', 'balin', 'hobbit', 'ilmarin', 'lindon', 'maia', 'nahar',
'ori', 'rohan', 'sylvan', and 'velar'; whilst English scored exact
matches for the words 'goblin', 'hobgoblin' and 'thrush'.
The following query shows the frequency with which each language
was chosen as being most similar, over the whole set of 470 Tolkien
words.
Conclusions