Professional Documents
Culture Documents
Document Retrieval
MSc thesis, Artificial Intelligence
Marijn Koolen
mhakoole@science.uva.nl
june 2, 2005
Supervisors:
Prof. Dr. Maarten de Rijke, Dr. Jaap Kamps
Informatics Institute
University of Amsterdam
ii
Abstract
iii
iv ABSTRACT
Acknowledgements
I’d like to express my gratitude towards Jaap Kamps and Maarten de Rijke for
their guidance and supervision during this research. They’ve read numerous
versions of this thesis without losing patience or hope, and were always quick
with advice when needed. I’d like to thank Frans Adriaans for the brainstorming
sessions getting both our projects started, and for the discussions on science that
somehow always shifted to discussions on music.
v
vi ACKNOWLEDGEMENTS
Contents
Abstract iii
Acknowledgements v
1 Introduction 1
1.1 Document retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Historic documents and IR . . . . . . . . . . . . . . . . . . . . . 1
1.3 Constructing language resources . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Historic Documents 5
2.1 Language variants or different languages? . . . . . . . . . . . . . 6
2.2 The gap between two variants . . . . . . . . . . . . . . . . . . . . 6
2.3 Bridging the gap . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Resources for historic Dutch . . . . . . . . . . . . . . . . . . . . . 9
2.5 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Corpus problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Measuring the gap . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Spelling check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Rewrite rules 17
3.1 Inconsistent spelling & rewrite rules . . . . . . . . . . . . . . . . 17
3.1.1 Spelling bottleneck . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Linguistic considerations . . . . . . . . . . . . . . . . . . . 20
3.2 Rewrite rule generation . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Phonetic Sequence Similarity . . . . . . . . . . . . . . . . 22
3.2.2 The PSS algorithm . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Relative Sequence Frequency . . . . . . . . . . . . . . . . 24
3.2.4 The RSF algorithm . . . . . . . . . . . . . . . . . . . . . 25
3.2.5 Relative N-gram Frequency . . . . . . . . . . . . . . . . . 26
3.2.6 The RNF algorithm . . . . . . . . . . . . . . . . . . . . . 26
3.3 Rewrite rule selection . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Selection criteria . . . . . . . . . . . . . . . . . . . . . . . 27
vii
viii CONTENTS
4 Further evaluation 43
4.1 Iteration and combining of approaches . . . . . . . . . . . . . . . 43
4.1.1 Iterating generation methods . . . . . . . . . . . . . . . . 44
4.1.2 Combining methods . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Reducing double vowels . . . . . . . . . . . . . . . . . . . 47
4.2 Word-form retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Historic Document Retrieval . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Topics, queries and documents . . . . . . . . . . . . . . . 51
4.3.2 Rewriting as translation . . . . . . . . . . . . . . . . . . . 51
4.4 Document collections from specific periods . . . . . . . . . . . . . 54
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 Concluding 81
6.1 Language resources for historic Dutch . . . . . . . . . . . . . . . 81
6.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.1 The spelling gap . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.2 The vocabulary gap . . . . . . . . . . . . . . . . . . . . . 84
Appendix B - Scripts 93
ix
x LIST OF TABLES
1 DBNL dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
The first question can be split into two more specific question, based on the
observation of Braun about the spelling problem and the vocabulary problem:
• What are the options for automatically constructing a thesaurus for his-
toric languages?
For the spelling problem, Braun and Robertson and Willett have already
mentioned two methods, rewrite rules and spelling correction techniques. But
there may be other options to align two temporal variants of a language. There-
fore, this question can be made more specific:
1.4 Outline
The next chapter will elaborate on the distinction between historic and modern
Dutch documents, and some available historic Dutch document collections will
be described. It will show that information retrieval on historic documents
is differrent from retrieval on modern document collections. Chapter 3 will
discuss in detail the automatic constructing of rewrite rules using phonetics
and statistics, and their effectiveness on historic documents. Several different
methods are described and compared with each other and with the rules from [2].
Further extensions and combinations to these methods and evaluations follow in
4 CHAPTER 1. INTRODUCTION
Historic Documents
Historic Documents are documents written in the past. Of course, this holds for
all documents. But since spoken and written language changes continuously, a
century old Dutch document is written in a form of Dutch that is different from
a document written two weeks ago. The changes are not spectacular, but they
are there all the same. Using a search engine on the internet to find documents
on typical Dutch food with the keywords Hollandse gerechten (English: Dutch
dishes) may retrieve a text written in 1910 containing both words. The keywords
are normal in modern Dutch, but also in early 20th century Dutch. What the
search engine probably won’t find is a website containing hundreds of typically
Dutch recipes from the 16th century, although this website does exist (see section
2.5, the Kookhistorie corpus). The historic texts contain historical spelling
variants of the modern words Hollandse gerechten. This problem is caused by the
fact that the change from 16th century Dutch to modern Dutch is spectacular.
Although the number of digitized 16th century documents is small, through
the increasing interest from historians and funding from national governments
for digitizing historic documents,1 this number is growing rapidly. The afore-
mentioned problem, the gap between a modern keyword and its relevant historic
variants, becomes increasingly important.
Going back further in time, the differences between modern Dutch and mid-
dle Dutch as used in late middle ages (1200 – 1500) are even bigger. In fact,
between 1200 and 1500, Dutch was not a single language, but a collection of
dialects. Each dialect had its own pronunciation, and spelling was often based
on this pronunciation.[23] Between geographical regions the spelling differed.
Due to the union of smaller independent countries and increasing commerce,
a more uniform Dutch language emerged after 1500.2 As contacts between re-
gions increased, spelling was less and less based on pronunciation, becoming
1 See, for example, the CATCH (Continuous Access To Cultural Heritage) project. This
is funded by the Dutch government to make historic material from Dutch cultural heritage
publicly accessible in digital form, thereby preserving the fragile material.
2 For a more detailed description of the changes in language between 1200 and 1500, (in
5
6 CHAPTER 2. HISTORIC DOCUMENTS
more consistent. In the 17th century, the Dutch translation of the Bible, the
Statenbijbel, together with books by famous Dutch writers like Vondel and Hooft
were considered as well-written Dutch, bringing about a more consistent and
systematic spelling. Since there was no official spelling (which wasn’t introduced
in the Netherlands until 1804), there were still many acceptable ways of spelling
a word [23].
The modern Dutch version (the author’s own interpretation) would look like
this:3
order is readable for native Dutch speakers, it is somewhat strange. Apparently, grammar has
changed somewhat as well.
8 CHAPTER 2. HISTORIC DOCUMENTS
would stack wine, oil or olives on top of raisins, onions, rice, salt,
grain or some such goods, where the former spoils the latter, he must
repay the damage to the trader.
Analyzing the historic and modern Dutch sentences, it may be clear that
the biggest difference is in spelling. Some words are still the same (schipper,
bederven, alleen, water), but most words have changed in spelling. The changes
in vocabulary are visible in the change from goet doen to vergoeden (English: re-
pay). There are also some morphological/syntactical changes, like versuijmelijck
(negligent) to verzuimd (neglects).
It is probably easier to attack the spelling bottleneck first. To solve the
second, a thesaurus is needed to translate historic words into modern words
or the other way around. If a method can be found and used to find pairs of
modern and historic words that have the same meaning, such a thesaurus can
be constructed. But if spelling is not uniform, one spelling variant of a historic
word might be matched with a modern one, while another spelling variant is
missed. By solving the spelling bottleneck first, thereby standardizing spelling
for historic documents, finding word translation pairs for a thesaurus may even
be easier.
thesaurus relating these words to modern words would solve this problem. From
this perspective, the historic language can be seen as a different language from
the modern language, and the retrieval task becomes a so called Cross-Language
Information Retrieval (CLIR) task (see [11] and [10] for an analysis of the main
problems and approaches in CLIR).
But can spelling be standardized with nothing but a collection of historic
documents? And is it possible to make a thesaurus using the same limited
document collection? Of course, it is possible to do everything by hand (see
sections 3.1.1 and 5.3). But this is very time consuming, and different language
experts might have different opinions on what the best modern translation would
be. Automatic generation, if at all possible, might be more error prone. But
as modern IR systems have shown [7], sub-optimal resources can still be very
useful for finding relevant documents.
Although there was no standard way of spelling a word in 17th century
Dutch, the possibilities of spelling a word based on pronuncation are not infi-
nite. In fact, there are only a few different spellings for a certain vowel. Corpus
statistics can be used to find different spelling variants by looking at the over-
lap of context. Also, techniques have been developed to group semantically
related words based purely on corpus statistics. If this can be done for modern
languages, it might work with historic languages as well.
for Dutch will reduce it to versuijm. Other rewrite rules may change this to the
modern stem verzuim, conflating it with all other morphological variants.
Finding historical synonyms for modern words, is a problem heretofore only
tackled by manual approaches. For modern languages, techniques have been
developed to find synonyms automatically (see, for instance [3, 4, 5, 14]), using
plain text, or syntactically annotated corpora. Part-Of-Speech (POS) taggers
exist for many languages, but not for 17th century Dutch, and annotated, 17th
century Dutch documents are not available either. Therefore, only those tech-
niques that use nothing but plain text are an option.
The next chapters describe the automatic generation of rewrite rules based
on phonetic information, and the automatic construction thesauri using plain
text. The various approaches are listed here:
• Rewrite rule evaluation methods: The main evaluation will test the
generated rule sets on a test set of historic and modern word pairs, and
measure the similarity of the words before and after rewriting. Further
evaluation is done by doing retrieval experiments, one word-based, the
other document-based.
2.5 Corpora
Although the Nationale Koninklijke Bibliotheek van Nederland has a large col-
lection of historic documents, at this moment, very few of them are in digital
form. The resources that will be constructed use corpora of 17th century texts
acquired from the internet. The following corpora where found:
• Braun corpus: This was acquired from the University of Maastricht, from
the research done by Braun [2].
• Historie van Broer Cornelis: This is a medium size corpus from the be-
ginning (1569) of the Dutch literary ‘Golden Age’, transcribed by the
foundation ‘Secrete Penitentie’ as a contribution to the history of Dutch
satire.
The French word guy (English: guy, fellow) contains the vowel uy, but in French
it is pronounced different from historic Dutch in words like uyt (English: out).
Foreign words ‘contaminate’ the historic Dutch lexicon. The historic corpus will
be used to find typical historic Dutch sequences of characters, so modern Dutch
text are also considered ‘foreign.’ As a preprocessing step, as much of the non
17th century Dutch texts were removed from the corpus. Because the entire
DBNL corpus contains over 8,600 documents, some simple heuristics were used
to find the foreign texts, so the corpus can still contain some other texts than
17th century Dutch.
Another problem with the texts from the DBNL corpus is the period in which
the texts were written. The oldest texts date from 1550, the most recent were
written in 1690. In 150 years time, the Dutch language has changed somewhat,
including pronunciation and use of some letter combinations (like uy). For
instance, in the oldest texts, the uy was used to indicate that the u should be
pronounced long (the modern word vuur was spelled as vuyr around 1550). In
more recent texts, after 1600, the uy was often used like the modern ui, as in
the example given above (uyt is the historic variant of uit).
If texts from a wide ranging period are used, generating rewrite rules will
suffer from ambiguity. To minimalize these problems, it is probably better to
use texts from a fairly small period (20 – 50 years, for instance).
Category Distribution
Modern 177 (35%)
Variant 239 (48%)
Historic 84 (17%)
its morphology has changed (the root of the word beest is still the same, which is
the very reason that its meaning is recognizable from context). This might not
be problematic for native Dutch speakers, but it does pose a problem for finding
relevant historic documents for the query term “beestachtig”. This word, and
other even less recognizable words belong to HIS. Categorizing all 500 randomly
picked words does not result in any hard facts about the gap between the two
language variants, but it does give some idea about the size of the problem.
The results are listed in Table 2.1. It turns out that most of the words (239
words, about 48%) are historic spelling variants of modern words. The overlap
between historic and modern Dutch is also significant (177 words, 35%), leaving
a vocabulary gap of 84 out of 500 words (17%). From this, it shows that solving
the problem of spelling variants bridges the gap between historic (at least 17th
century) Dutch and modern Dutch for a large part.
10. Item, als de schipper de goeden soo qualijck stouwt of laeijt dat
d’eene door d’andere bedorven worden, gelijk soude mogen gebeuren
als hij onder geladen heeft rozijnen, aluin, rijs, sout, graan ende
andere dergelijke goeden, ende dat hij daar boven op laeijt wijnen,
olin of olijven, die uitlopen ende d’andere bederven, die schade moet
de schipper ook alleen dragen ende de koopman goet doen, als boven.
The words oft, den, genoech, maecken, taeckel, daerdore and others were rec-
ognized as misspelled words, and a list of suggestions were given for each word,
including the correct modern words, which were not always the most probable
alternatives according to A-spell. For the word versuijmelijck no alternative was
suggested, probably because it is too dissimilar to any modern Dutch word. The
word goeden is a historic word for which A-spell suggested ‘goed’ (good), but
not ‘goederen’ (goods). The correct suggestion for coopman-schappen (which
is koopmanschappen, lit. ‘trade goods’) was not given, probably because the
modernized version of the word (koopmanschappen) is not a modern word (the
word koopmanschap was suggested, but this means something else, namely the
business of trading). The same goes for qualijck (modern form: kwalijk) and
laeijt (modern word: laadt). Also, some words are in fact historic but are not
recognized is misspellings. The word niette should become niet (English: not),
but is instead recognized as the past singular form of the verb nieten (English:
to staple, as in stapling sheets of paper together).
Another spell checker available for Dutch is the one that comes with the
MS Word text processor.12 It suggests orthographically similar words for any
unknown word in the text, and is also capable of checking grammar. This is the
output after applying the correct suggestions by MS Word:
word is in the list of alternatives. For the historic word alsoo it correctly sug-
gests alzo and afterwards suggests to replace alzo with the more grammatically
appropriate word zo.
Spell checkers can be used to find correct modern words for historic words
that are orthographically similar. However, for many historic words, spell check-
ers cannot find the correct alternative, and for some they cannot find any mod-
ern alternative at all. Moreover, each word has to be checked seperately and
the correct suggestion has to be selected from the list manually (the correct
alternative is not always the first one in the list of suggestions). It would still
take an enormous amount of time and effort to modernize historic documents
for HDR in this way. A spelling check is not a good solution. It seems we do
need specific resources to aid HDR.
16 CHAPTER 2. HISTORIC DOCUMENTS
Chapter 3
Rewrite rules
In this chapter, the spelling bottleneck, and approaches for solving this problem
are described. The following points will be discussed:
• Evaluation of rewrite rules: How are the sets of rewrite rules evalu-
ated? And well do they perform?
17
18 CHAPTER 3. REWRITE RULES
One way of solving this problem would be to expand you query with spelling
variations typical of that period. But few people possess the necessary knowl-
edge to do this. Apart from that, it is fairly time consuming to think of all these
variations, and you inevitably omit some variations. It would be far more effi-
cient to do query expansion automatically. Or to rewrite all historic documents
to a standard form, that matches modern Dutch closely.
Robertson and Willett [18] have shown that character based n-gram match-
ing is an effective way of finding spelling variants of words in 17th century
English texts. Historic word forms for modern words were retrieved based on
the number of n-grams they shared. All the historic words where transformed
into a index of n-grams, and the 20 words with the highest score were retrieved.
The score was computed using the dice score, with N (Wi , Wj ) being the number
of n-grams that Wi and Wj have in common, and L(Wi ) the length of word Wi :
2 × N (Wmod , Whist )
Score(Wmod , Whist ) = (3.1)
L(Wmod ) × L(Whist )
In a historic word list containing 12191 unique words, 2620 historic words
were paired with 2195 unique modern forms. Thus, each modern form had at
least one corresponding historic word form. The results in Table 3.1 show the
recall at the 20 most similar matches (no precision scores given in [18]).
Method Recall
2-gram matching 94.5
3-gram matching 88.8
Table 3.1: Comparative recall for the 20 most similar matches for historic En-
glish
Braun [2] has conducted the same experiment for 17th century Dutch. It
turns out the n-gram matching performance is increased by standardizing spel-
ling and stemming (Table 3.2). The inconsistency of spelling makes it hard to
apply a stemming algorithm directly on historic documents. Therefore, spelling
is standardized by applying rewrite rules on the historic words. In [2], these
rewrite rules for 17th century Dutch were constructed with the help of experts.
They transform the most common spelling variations to a standard spelling.
Most of the variations of rechtvaardig just mentioned would be changed to the
modern spelling by these rules. By rewriting different spelling variants to the
same word form, and removing affixes through stemming, a fair number of word
forms are conflated to the same stem.
Still, constructing rules manually, using the help of experts takes a lot of
effort, and experts of 17th century Dutch are not freely and widely available.
More efficient, but also more error prone, are automatic, statistical methods
to produce rewrite rules. In this chapter, several automatic approaches are
compared with each other as well as with the rewrite rules constructed by Braun.
3.1. INCONSISTENT SPELLING & REWRITE RULES 19
Table 3.2: Comparative recall for the 20 most similar matches for historic Dutch
3.1.2 Resources
To construct rewrite rules, a collection of historic documents is needed, as well as
a collection of modern documents. Without the modern documents, it would be
much harder to standardize historic spelling. The are several equally acceptable
ways of spelling a word in 17th century Dutch. There is no single spelling that
would be better than the others. To ensure uniform rewriting, the rules have
to be constructed with great care. Identifying spelling variants is only the first
step. The second step is rewriting them all to the same form. For another group
of spelling variants, the same standard form should be chosen. But this far from
trivial. Consider the spelling variants klaeghen, klaegen, klaechen and claeghen.
Three out four words start with kl, so it seems sensible to choose kl as the
standard form. Also, two out of four words use gh, so g and ch should become
gh as well to transform all four variants into a uniform spelling. Another group
of spelling variants might be vliegen, vlieghen, vlieggen, vlyegen and fliegen. In
this case, rewriting fl to vl seems to make more sense than rewriting vl to fl.
The same goes for ye and ie. But the next transformation should be selected
more carefully. Of the 3 different options g, gh and gg, g occurs more often. But
rewriting gh to g would be in conflict with the earlier decision to rewrite g to
gh. A far easier solution, and with the goal of making resources for information
retrieval in mind, is to rewrite the historic word forms to modern word forms.
In that case, a standard spelling already exists, and rewriting historic spelling
variants to a uniform word is done by rewriting them to the appropriate modern
word. Of course, we need to find the appropriate modern form, which might
not be easy at all. But we’re faced with the same problems when finding the
different historic spelling variants themselves. The other advantage of rewriting
to modern words becomes clear when combining it with an IR system. Modern
users pose queries in modern language. Rewriting all possible historic variants
to one historic word will not make it any easier for the IR system to match
it with its modern variant. Rewriting historic words to modern words, means
rewriting to the language of the user.
the period small, is that spelling changed over time. If a larger time-span is
chosen, a greater ambiguity in spelling might result in incorrect rewrite rules.
The pronunciation of some character combinations in 1550 might have changed
by 1600. Also, the specific period between 1600 and 1620 makes it possible
to compared the generated rewrite rules with the rules constructed by Braun,
since these rules where based on two law books dating from 1609 and 1620.
The corpus used in this research, named hist1600, contains these same two law
books, in addition to a book by Karel van Mander (Het schilder-boeck), printed
in 1604. Two of the techniques used here compare the words of the historic
corpus to words in a modern corpus. The modern corpus (15 editions of the
Dutch newspaper ”Algemeen Dagblad”) is equal in size to the historic corpus
(see Table 3.3). The included editions of the newspaper where selected on
date, ranging over two whole years, to make sure that not all editions cover the
same topics (two successive editions often contain articles on the same topics,
probably repeating otherwise low frequent content words).
Syllable structure
One such observation is that apparently, most historic words that are recogniz-
able spelling variants of modern words have the same syllable structure as their
modern form. Each syllable in Dutch contains a vowel, and can have a conso-
nant as onset and/or as coda. If we take the modern Dutch word aanspraak
and a historic form aenspraeck, the similarity in syllable structure is obvious.
For both forms the first syllable has a coda (n), the second syllable has an onset
(spr) and shows a difference in the codas (k vs. ck). The vowels of the two
syllables differ also (aa vs. ae). Can this be of any help in choosing a method to
3.1. INCONSISTENT SPELLING & REWRITE RULES 21
attack the spelling problem? A solution can be to split the words into syllables
and than make rewrite rules from mapping the historic syllable to the modern
syllable. This would give the following rules:
aen → aan
spraeck → spraak
The advantage of this approach is that it will not only rewrite the word
aenspraeck but also any other historic word that contains the syllable aen. What
it won’t do is rewrite the word staen to staan (English: to stand), since it
won’t rewrite syllables containing aen that have an onset. After reading a few
sentences of a historic document it becomes clear that the vowel ae is very
common in these documents. In modern documents is not nearly as common.
One problem that is immediately visible is that to rewrite all words that contain
the vowel combination ae an enormous amount of rules is needed to cover all
the different syllables in which this combination can appear. And since the
corpus is limited, not all possible syllables can be found. The rules need to be
generalized. For instance, a rule could be: rewrite all instances of ae to aa in
syllables that have a coda.
But this introduces a few problems. For native Dutch speakers, it is probably
fairly easy to recognize the syllable structure of many historic words. But an
automatic way of splitting a word into syllables would be based on the modern
Dutch spelling rules. Since historic words are not in accordance with these rules,
splitting them properly into syllables might do more bad than good. According
to modern spelling rules, the word claeghen would be split in claeg and hen,
which is not what it should be (namely, clae and ghen). Redundant letters in
historic words can shift the syllable boundaries, adding a coda or onset where
there shouldn’t be one.
To get around this problem, it is possible to split the word into sequences of
vowels and sequences of consonants. The word claeghen would be split into
the sequences cl, ae, gh, e and n. Syllable boundaries can be contained in one
sequence (ia in hiaten), but need not be a problem. Historic vowel sequences may
only be rewritten to modern vowel sequences, and historic consonant sequences
may only be rewritten to modern consonant sequences. Putting this restriction
on what a historic sequence can be rewritten to, will retain the syllable structure.
Again, the considered context can be specific, changing ae to a in the context of
cl and gh, or general, changing ae to a if the sequences is preceded and followed
by any consonant sequence.
substitution takes 2 steps (the same as 1 delete + 1 insert). Changing bed into
bad takes one substitution (‘e’ to ‘a’), thus the edit distance between bed and
bad is 2. The edit distance between bard and bad is 1 (deleting the ‘r’). This can
be used to find the word in a lexicon that is closest to the misspelled word [6].
However, historic spelling is different from misspellings in modern texts. The
variance in spelling is not based on accidentally hitting a wrong key on the
keyboard, but on phonetic information. Without any official spelling, writing
caas or kaas makes no difference. They are both pronounced the same. Thus,
historic Dutch can be treated as modern Dutch with spelling errors based on
a lack of knowledge of modern spelling rules (which people in the 17th cen-
tury where, of course, ignorant of). Thus, writing caas instead of kaas (English:
’cheese’) is more probable than writing cist instead of kist (English: ’box’), since
a c is pronounced as a k when follow by an a, but is pronounced as an s when
followed by an i. From a phonetic perspective, the distance between cist and
kist is bigger than between caas and kaas.
All these sequence pairs are pronounced these same, including the pair [gh,g].
From this list of pairs, only the ones that contain two distinct sequences are
interesting. Rewriting kl to kl has no effect.
After applying rewrite rules based on phonetic transcriptions, the lexicon has
changed. But iterating this process has no further effect. Since the rewrite rules
are based on mapping historic words to modern words that are pronounced the
same, after rewriting, the pronunciation stays the same.
1 see http://tcts.fpms.ac.be/synthesis/mbrola.html
the Nextens package. This list contains 340.000 modern words with phonetic transcriptions.
However, converting the words to phonetic transcriptions using Nextens results in different
transcriptions from the ones in the Kunlex list.
24 CHAPTER 3. REWRITE RULES
i i
Seqhist → Seqmod (3.2)
The resulting rules are ranked by their frequency of occurrence. Thus, if
i i
Seqhist and Seqmod are aligned N times in all the equal sounding word pairs,
i i
the resulting rule has score N. If Seqhist and Seqmod are aligned often, it is
highly probable that the rule is correct, and that it will have a huge effect on
the historic corpus.
Fhist (Seq)
RFhist (Seq) = (3.3)
Nhist (Seq)
The same is done for the modern corpus. The final relative sequence fre-
quency RSF (seq) is given by:
RFhist (Seq)
RSF (Seq) = (3.4)
RFmod (Seq)
A sequence i with a high RSF (seq i ) is a typical historic character combina-
tion. A score of 1 means that the sequence is used just as frequent in a modern
corpus as in a historic corpus. A threshold is used to determine whether a
sequence is considered typically historic or not. This threshold is set to 10,
meaning that, for a historic and a modern text of equal size N , the character
sequence seq i should occur at least 10 times more often in the historic text to
be typically historic. The reasoning behind this is that if a sequence occurs
much more often in a historic text (i.e. is much more common in a historic
text), there is a good chance that its spelling has changed in the past few cen-
turies. If Seqi occurs in the historic corpus but not in the modern corpus (i.e.
RFmod (Seqi ) = 0, RSF (Seqi ) is set to 10. No matter what its historic fre-
quency is, Seqi is infinitely more frequent in the historic corpus than in the
modern corpus, and is considered a typical historic character combination.
Starting with the highest ranking character sequence Seq, all historic words
that contain this sequence are transformed in so called ’wildcard words’. If
Seq is a vowel sequence, a historic word Whist contains Seq if Seq is preceded
and followed by consonants, or the start or end of the word. For example, the
word quaellijk is not listed as a word containing the sequence ae, since ae is
not the full vowel sequence (which is uae). In all the historic words, Seq is
replaced by a ’vowel wildcard’, resulting in a wildcard word W Whist . The word
aenspraek is transformed to VnsprVk, where V is a vowel wildcard. W Whist is
then compared to all modern words. A modern word Wmod matches W Whist if
it can match all vowel wildcards with vowel sequences, or consonant wildcards
with consonant sequences. Thus, VnsprVk is matched with the modern word
26 CHAPTER 3. REWRITE RULES
it aanspraak, but also with inspraak, inspreek, and aanspreek. Given these 4
matches, ae is matched with i once, ee twice, and 4 times with aa, resulting in
3 different rewrite rules:
ae → aa
ae → ee
ae → i
only full sequences (a (full) vowel sequence is only matched with another (full)
vowel sequence), the RNF algorithm matches an n-gram with any character
sequence between n − 2 and n + 1 characters long.
This restriction is based on the fact that modern spelling is more compact than
historic spelling. To indicate that vowels should have a long pronunciation,
historic words are often spelled with double vowels (like aa, ee). In modern
spelling, this is no longer needed (only in a few cases) because of the official
spelling rules. Also, exotic combinations like ckxs where normal in historic
writing, but in modern spelling, only x or ks is allowed. Thus, it is to be
expected that a modern spelling variant of a historic sequence is often shorter
than the historic sequence itself.
Also, without this restriction, the number of possible rules would explode. When
replacing zaek for the n-gram aek with the wildcard word zW (where W is an
unrestricted wildcard), will result in matching zaek with all existing modern
words that start with the letter z. Processing hundreds of wildcard words will
require enormous amounts of memory and disk space. By restricting the length
of the wildcard, only words of length 2 to 5 are matched (this will still match
with many words, but memory requirements are now within acceptable limits).
There is no restriction on vowels or consonants. An n-gram containing only
vowels can be matched by a wildcard containing only consonants. RNF is tested
with different n-gram sizes, ranging from 2 to 5.
When constructing rules from wildcard matches, the same pruning threshold
is used as for the RSF algorithm described above. Without this threshold, the
number of generated rules would still be enormous for large n (n ≥ 4). Especially
for n = 5, literally hundreds of thousands of rules are generated. To reduce
memory and disk space requirements, the pruning threshold for n = 5 is set to
5.
found. But how many matches are needed to make a reliable judgment whether
a rule is appropriate or not? There are many different criteria that can be used.
For instance, given a typical historic character sequence Seqhist , the number of
modern sequences N (Seqmod ) that lead to rewriting a historic word Whist to a
modern word Wmod should be as low as possible. N (Seqmod ) is the number of
alternatives from which a modern sequence should be picked. If the same historic
sequence occurs in many different rules (i.e. there a lot of different modern
consequences to rewrite to), the chance of only one of them being correct is
small. If there is only one rule (i.e. there is only one modern consequence found
for a historic sequence), then that is inevitably the best option. Another aspect
to look at is the effect of the rule on the modernly spelled words in the historic
corpus. Comparing the words of the historic corpus with a modern word list
(Nextens comes with a fairly large list containing approximately 340.000 modern
Dutch words with phonetic transcriptions, the so called Kunlex word list), shows
which words in the historic corpus have not changed in spelling. These words
shouldn’t be affected by rewrite rules. The criterium then becomes selecting
only rules that have little to no effect on modernly spelled words. Of course,
it is also possible to retain rules that have a large effect on these words, but
restrict the application of such a rule to non-modern words (i.e. words that are
not in the modern lexicon).
Another important decision to be made is whether a historic sequence can be
rewritten to different modern spellings. As the y-problem described in section
3.6.1 indicated, not all sequences ay should be rewritten to the same modern
form. The RNF has no difficulty with these problems, since larger n-grams
take the context of ay into account. By first applying large n-grams, different
words containing ay might be affected by different RNF rules. The other 2
algorithms, PSS and RSF cannot take context into account since they use only
vowels or only consonants. Thus, whatever selection criterium is used, only
one modern spelling will be selected for each historic sequence. The following
selection criteria will be discussed:
• Non-Modern: Remove all rules that effect modern words in historic lexi-
con. A word from the historic lexicon is modern if it is also in the Kunlex
lexicon (NM).
• Salience: For the set of competing rules with the same antecedent part,
select the consequent part with the highest score only if the difference
between the highest score and the second highest score is above a certain
threshold (S).
3.4. EVALUATION OF REWRITE RULES 29
v o l c x
0 1 2 3 4 5
v 1 0 1 2 3 4
o 2 1 0 1 2 3
l 3 2 1 0 1 2
k 4 3 2 1 2 3
s 5 4 3 2 3 4
The final edit distance between volcx and volks is 4. The first three characters
30 CHAPTER 3. REWRITE RULES
of both words are the same, resulting in an edit distance of 0. But the next two
character differ. from c to k takes 1 substitution, and another substitution is
needed going from x to s.
v o l c c
0 1 2 3 4 5
v 1 0 1 2 3 4
o 2 1 0 1 2 3
l 3 2 1 0 1 2
k 4 3 2 1 2 3
s 5 4 3 2 3 4
If the difference between D(Whist , Wmod ) and D(Wrewr , Wmod ) is zero, then
either the rule is not applicable to the historic word, or it has no effect on the
distance, in which case it is probably an inappropriate rule. Changing volcx into
volcc has no effect on the edit distance (the distance between volcx and volks
is equal to the distance between volcc and volks, see Tables 3.4 and 3.5), but
the word has changed into something that is pronounced differently, while the
historic word is pronounced the same as its modern variant volks. Most native
Dutch speakers will have little problems recognizing volcx as a spelling variant
of the adverb volks, while they would probably recognize volcc as a variant of
the noun volk.
The problem with using edit distance as a measure is that a bigger reduction
in distance not necessarily means that a rule is better. Take two competing
rewrite rules lcx → lcs and lcx → lk. The first rule reduce the edit distance
from 4 to 2 (see Table 3.6), while the second rule reduces it to 1 (Table 3.7).
The result of the first rule is a word that looks and sounds much like the correct
modern word. The result of the second rule is a different modern word.
v o l c s
0 1 2 3 4 5
v 1 0 1 2 3 4
o 2 1 0 1 2 3
l 3 2 1 0 1 2
k 4 3 2 1 2 3
s 5 4 3 2 3 2
A rewrite rule has a postive effect on the selection set if the average distance
between historic and modern words is reduced. The average change in distance
between the original test set, and the test set after rewriting is given by:
3.4. EVALUATION OF REWRITE RULES 31
v o l k
0 1 2 3 4
v 1 0 1 2 3
o 2 1 0 1 2
l 3 2 1 0 1
k 4 3 2 1 0
s 5 4 3 2 1
n
1X i i i i
C= D(Whist , Wmod ) − D(Wrewr , Wmod ) (3.5)
n i=0
Where D(Whist , Wmod ) is the edit distance between a historic word and its
modern variant, and D(Wrewr , Wmod ) is the edit distance between the rewritten
historic word and the same modern variant. A simple measure would be divid-
ing the average change in edit distance C by the distance D(Seqhist , Seqmod )
between the historic antecedent Seqhist of the rule and its modern consequent
Seqmod (rules that change multiple characters should reduce the average dis-
tance more than rules that change only one character):
Ci
Score(rulei ) = (3.6)
Di
If the resulting score is close to 1, the total amount of change by the rewrite
rule is mostly in the right direction. Looking again at the example of the rule
cx → k, the edit distance between the original historic word volcx and the
modern word volks is reduced by 3, and the edit distance between cx and k is
also 3 (cost 2 for substitution of c with k, and cost 1 for deleting x). Thus,
this rule scores 1. In other words, every change by the rule is a change in the
right direction. But this is not good enough. rewriting cx to k reduces the edit
distance between volcx and volks, but the rule cx → ks not only reduces the
edit distance, it also rewrites the historic word to the correct modern variant.
According to (3.6), both rules would get the same score. But if a rule changes
some historic words to their correct modern forms, it must be a good rule. A
better measure should account for this. (3.7) adds the number of words changed
to their correct modern form M divided by the total number of rewritten words
R:
Ci Mi
Score(rulei ) = + (3.7)
Di Ri
Now, the rule cx → ks reveives a higher score because it rewrites at least
some of the words containing cx to their correct modern form. To make sure
that rules with an accidental positive effect are not selected, a threshold for the
final score of 0.5 is set. In words, this means that for each step done by the rule
32 CHAPTER 3. REWRITE RULES
(insertion, deletion takes one step, substitutionion takes two steps), the distance
should reduce by at least 0.5.
The big disadvantage of selecting only rules that have a positive effect on
the selection set is that not all the typical historic word forms and letter com-
binations are in the selection set. Although the rules are based on statistics
on the whole corpus, some constructed rewrite rules that are appropriate might
not be selected because they have no effect on the selection set. On the other
hand, from a statistical viewpoint, if a specific character combination is not in a
set of 1600 randomly selected word pairs, then it is probably not a common or
typical historic combination. Another drawback is that words that couldn’t be
recognized as variants of modern words, are not in the test set, but are affected
by the selected rewrite rules. Although the performance of a rule on the test set
gives an indication of its “appropriateness” on the recognizable words, there is
no such indication for its effect on the unrecognized words.
The test set is used as a final evaluation of the selected rewrite rules. The
rewrite rules are applied to the historic words and then compared with the
correct modern forms. As mentioned above, comparison is based on the edit
distance between the words. The final score for a rule is the average distance
between the rewritten words and the correct words. To get some measure of the
effect of rewriting, the average distance between the original historic words and
the correct words is also calculated as a baseline. The difference between these
two averages should give an indication of the effect of rewriting. The baseline
score is shown in Table 3.8
total average
word pairs distance
baseline 400 2.38
3.5 Results
The three algorithms PSS, RSF and RNF are evaluated using the test set. To
get an idea of how well certain rule sets perform, all automatically generated
rule sets are compared with the manually constructed rule set in [2]. The results
for this set of rules on the test set is given in Table 3.9. In column 2, the total
number of rules in the rule set is given (num. rules), in column 3 the total
number of historic words that are affected by the rules is given (total rewr.).
The 4th column shows the number of historic words for which the rewriting
is optimal (perf. rewr. indicating a perfect rewrite). The last column shows
the new average distance (new dist.), between the rewritten historic words, and
the modern words. The difference between the new average distance and the
baseline average distance is shown in parentheses.
3.5. RESULTS 33
like the ‘e’ in the character. The problem with igh is that in many words, the
‘i’ is pronounced as a schwa, but the ‘gh’ is certainly pronounced. After con-
version, the word veiligh is matched to the modern Dutch word veilen, because
the final ‘n’ in infinitivals is often not pronounced. A chain is as strong as its
weakest link. If the phonetic transcriptions are not 100% correct, the generated
rules can’t be either.
As a extra, second selection criterium, only rules where selected that had
no effect on those words of the historic lexicon that also occur in the modern
Kunlex lexicon. Thus, only non-modern (NM, see the Non-modern selection
criterium in section 3.3.1) historic sequences are considered. The results for
the salience criterium are given for a salience threshold of 2 (S 2 in the Table).
This means that the highest scoring rule R1 for a historic sequence Seqhist is
selected if R1 matches at least twice as many wildcards as the second best rule
R2 . Several different thresholds values where tested. The threshold value 2
consistently shows the best results.
number of rules that are applied has little effect on the average edit distance
between rewritten words and the correct modern words, but is still in balance
with the total number of affected words (more rules rewrite more words). The
highest scoring rules affect the most words (5 rules rewrite 112 words). For
lower thresholds, NM does have a positive effect, reducing the average distance
by almost 38%. But this is probably because it just reduces the number of
rules. Since most lowly ranked rules increase the average distance, reducing the
number of lowly ranked rules will reduce the negative influence. However, the
number of perfect rewrites is heavily affected by NM. Before applying NM, a
higher threshold results in many more perfect rewrites, and average distance
drops to nearly the original distance (which is 2.38). After applying NM, an
MM-threshold of 50 results in an increase in distance, with much less perfect
rewrites (when compared to an MM-threshold of 50 before applying NM). In
other words, the rules that where thrown out by NM where better than the
rules that NM keeps in the set. Dropping the threshold to 20 introduces some
more bad rules (only 5 rules are added, and the average distance goes up again).
Decreasing the threshold even more shows that some of the rules with a score
below 20 are better than some of the rules with a score above 20.
The results for the Salience (S) selection criterium look much more like the
Maximal Match results. At each threshold level the number of rules is only
slightly smaller than without selecting on salience. For the average distance,
salience works much better. Rules with a score above 30 descrease to average
distance. Some of these rules that are removed by the salience criterium actually
produce perfect rewrites. For thresholds 30, 40 and 50, the number of rules
decreases by 3, and the total number of perfect rewrites decreases by 5. Thus,
the 3 rules scoring between 30 and 40 removed by salience have a bad effect on
the average distance but do have a positive effect on some words. This shows
that for the historic antecedents in these 3 rules, multiple modern consequents
are required, or the context of the historic sequence (the characters preceding
and following the sequence) should be taken into account.
The best results by far are produced by using the selection set. As described
in section 3.4.1, the selection set contains 1600 word pairs, and are used to filter
out rules that have a negative effect on the testset. The MM-score of the rules
are ignored in this selection criterium, and are replaced by a score based on how
well they perform on the selection set. As the selection set is constructed in the
same way as the test set (in fact, only one set of 2000 words was constructed,
which was split afterwards in the selection set and the test set), it should come
as no surprise that this produces better results. About 63% of the all the words
in the test set are rewritten and about 25% of them to their correct modern
forms.
The PSS algorithm clearly suffers from wrong phonetic transcriptions. The
change of pronunciation for some character sequences (most notably the se-
quence ae, which occurs very often in the historic corpus) over time is ignored
by the Nextens conversion tool. These problems occur throughout the rule set,
from highly frequent to rare sequences. Therefore, raising the MM-threshold
will only reduce the total number of rules, effectively reducing the number of
36 CHAPTER 3. REWRITE RULES
rules which have a negative effect on the test set, but also reducing the number
of rules that have a positive effect. The use of the selection set seems the only
way to sort the good rules from bad ones.
grams already 492.804 historic sequences are possible. Of course, most of these
sequences will not be in the historic corpus (take ’qqqq’ or ’xjhs’ for example).
So, before generating the rules, we can predict that there will be far more rules
for n-grams of size 4 than for n-grams of size 2. See table 3.12 for the results of
all n-gram lengths. The results for NM are not listed, since they show the same
bad effect as for the RSF rules, and would only make table 3.12 less readable.
As for salience, it shows mixed results. For 2-grams, the best salience threshold
is 1.5, performing for worse than the original rule set. For 3-grams and 4-grams,
the best value is around 1.25, showing some improvement in average distance
for lower MM threshold values (up to 20) but a drop in the number of perfect
rewrites.
The results for n-grams of length 2 show that only the 8 highest MM-scoring
rules, with a threshold above 50, have a big influence on the test set. These
rules are very good, rewriting 67% of all the words, and of these, 56% are
perfect rewrites. This is due, for the largest part, to the rule ae → aa. Many
of the historic words in the test set contain the sequence ae, and most of their
corresponding modern variants have aa as the modern spelling variant. As
the results at lower MM-thresholds show, the low scoring rules have almost no
influence on the test set.
Another noticable result is that at a MM-threshold of 20, the rules show the
greatest reduction in average distance for all the other n-gram lengths. Also,
for n ≥ 3, increasing the MM-threshold results in less perfect rewrites. As for
the total rewrite / perfect rewrite ratio, the best n-gram lengths are 2 and 3.
Like the PSS and the RSF algorithm, RNF benefits greatly from the use
of the selection set. All n-gram lengths show in improvement over the MM-
threshold selection. The number of selected rules is less than for low MM-
thresholds (which show the highest number of perfect rewrites of the different
MM-thresholds), as well as the total number of rewrites. But the number of
perfect rewrites increases (this is most noticable for n ≥ 4. Now, the 4-gram
rules show the best results. 62% of all rewrites is perfect, and the average
distance is reduced by almost 50%.
3.6 Conclusions
The most significant conclusion is that phonetic transcriptions are not nearly
as useful as expected. As mentioned earlier, there are two reasons for this.
First, the transcriptions are not always correct. Some letter combinations that
no longer occur in modern Dutch words are treated as English or French char-
acter sequences. From the surrounding characters it should be clear that the
word under consideration is certainly not English or French. The grapheme to
phoneme converter of Nextens is very accurate compared to other conversion
tools, but for this particular task, it is simply not good enough. To the defense
of Nextens, it should be mentioned that it wasn’t designed for this task. It was
designed with pronunciation of modern Dutch in mind. That it does very well.
The other main reason is that, although the overlap between 17th century
3.6. CONCLUSIONS 39
3.6.1 problems
There are of course some specific problems with using rewrite rules based on
statistics. Since spelling was based on pronunciation, and people pronounced
certain characters in different ways, some historic words are ambiguous. Just
like certain modern words can have different meanings determined by context,
the spelling of some historic words can be rewritten to different modern words,
depending on context. The character combination ue in the historic word muer
can be rewritten to modern spelling as oe as in moer (English: nut) or as uu as
in muur (wall).
Further evaluation
As we saw in the previous chapter, the RSF and RNF algorithms outperform
the phonetically based PSS algorithm. Here, extensions to these methods are
considered, as well as some other evaluation methods and test sets generated
from documents from different periods. This chapter is divided into the following
sections:
43
44 CHAPTER 4. FURTHER EVALUATION
variant aa might have been found for the historic sequence ae. After applying
this rule to the historic corpus, aenspraeck becomes aanspraack. In the next
iteration the wildcard word aanspraaC will be matched with the modern word
aanspraak resulting in the rewrite rule ck rightarrow k.
However, by combining the different methods, iterating the phonetic tran-
scription method suddenly does have effect. After applying the ae rightarrow aa
rule found by other methods, a phonetic transcription is made for aanspraack
instead of for aenspraeck. And the pronunciation of aanspraack is equal to
the pronunciation of the modern word aanspraak, while the pronunciation of
aenspraeck isn’t (at least, according to Nextens).
words (the first 276 rules affect only 153 words in the test set). There are much
more typically historic sequences of length 5 than there are of length 4. The
problem with evaluating the rules for n-grams of length 5 is that these sequences
are so specific that many of them do not occur in the test set at all. In each
iteration, there are many more rules than there are affected words. All these
rules have have an antecedent part that does occur in the selection set, hence
the selection of the rule. But the selection set is much bigger than the test set,
and thus contains many more sequences. Looking purely at the scores, it is
easy to conclude that 4-grams work better for RNF than 5-grams, but a look at
the rules themselves gives another impression. Consider the historic sequence
verci. The RNF algorithm finds the rule verci ← versi for it, which changes
a historic word like vercieren to versieren (adorn, decorate). The 4-gram verc
should become verk in words like vercopen and overcomen, but it should become
vers in vercieren. Because 5-grams are more specific, 5-gram rules probably
make less mistakes. On the other hand, longer n-grams are more and more like
whole words. Instead of generating rewrite rules, the RNF algorithm would be
generating historic to modern word matches. It would consider every word of
approximately the length as a possible modernization, leaving all the work to
the selection process.
Applied Lexicon
Rule set size
None 44041
RSF 41956
PSS 41557
RNF 39368
RNF+RSF+PSS 38525
cominations that improve on it are the combinations with PSS. Apparently, the
PSS rules are somewhat complementary to RNF and RSF rules, as was expected.
The RNF and RSF algorithms work in a similar way (relative frequency of a
sequence). The PSS algorithm is fundamentally different. It’s rules are based
on phonetics.
It is interesting to see that the total number of unique words in the corpus
is greatly reduced by rewriting words to modern form. The original hist1600
corpus contains 47,816 unique words (see table 3.3), if the lexicon is case sensitive
(upper case letters are distinct from lower case letters). If case is ignored, there
are 44,041 unique words left. After applying the rules of the combined methods
PSS, RNF and RSF, the total number of unique words is reduced to 38,525, a
12.5% decrease. By rewriting, many spelling variants are conflated to the same
(standard) form. As table 4.3 shows, the RNF rules have the most significant
effect on conflation. Looking at the number of rules that each method generates,
this is hardly surprising.
make mistakes. The modern word zeegevecht (sea battle), is changed to the
non-Dutch word zegevecht. The error is not in pronunciation, which is the same
for both words, but in spelling. The double vowel ii (almost) never occurs in
Dutch, so all occurrences of ii in historic words can safely be reduced. For word
final vowels, the e vowel is an exception. If a word ends in a single vowel e, this
is pronounced as a schwa (like the e in ’vowel’). For words ending in a long e
vowel, the double vowel ee is required (thee, zee, twee, vee. Thus, the algorithm
should ignore word final vowels ee.
To test its effectiveness, it was applied to the full historic word list, containing
47816 unique words. Of these, 1498 words contain redundant vowels according
to the RDV-algorithm. The total number of words containing redundant vowels
might be larger, since the algorithm is so simple it is bound to miss some of
these words. But what of the words it did affect? The list of reduced words
was checked manually. It turns out that of all 1498 words, 134 reductions
were incorrect (almost 9%). A closer analysis of the incorrect reductions show
that, by far, the most mistakes are made with the double ee vowel in non-final,
open ended (no coda) syllables in words like veedieven (English: cattle thieves),
tweedeeligh (English: consisting of two parts) and zeemonster (sea monster). In
each of these 3 examples, the first syllable has its vowel reduced. But in all
three examples, the first syllable is a Dutch word by itself. In fact, these words
where the very reason why the algorithm ignores word final ee vowels. It seems
that the frequent use of compound words in Dutch has a significant effect on
the (too) simple RDV-algorithm. A modification might be compound splitting
when encountering a word containing ee. If the first part of the word, up to and
including ee, is an existing word by itself (i.e. it’s in the lexicon), don’t reduce
the vowel. Other frequent mistakes have to do with the adding of a suffix. A
typical Dutch suffix is -achtig, as in twijfelachtig (doubtful, twijfel means doubt).
But the word geelachtig (yellowish, geel means yellow) is incorrectly reduced to
gelachtig (gelly). These errors can be reduced by suffix stripping (stemming).
Furthermore, it was tested on the test set (see table 4.4). It affects only 29 of
the 400 historic words in the test set, but of these, 20 are written to the correct
form. Using RDV after applying rewrite rules, it still has a significant effect
on the test set. The best order of combining the 3 rule generation methods
(RNF, RSF and PSS) affects 337 words, rewriting 224 of them to the correct
modern form. After the RDV algorithm is applied, 5 more words are rewritten,
with 235 perfect rewrites (more than 59% of all the words in the test set!).
Many of the double vowels are removed by the 4-gram and 5-gram RNF rules
(like eelen → elen), but it is mainly due to the fact that ae is rewritten to
aa, resulting in more double vowels, that double vowel reduction still has a
significant effect.
Applying the RNF+RSF+PSS rule set and the RDV algorithm on the ex-
ample text from the ‘Antwerpse Compilatae’ (see chapter 2) gives the following
result:
dat.die daardore vijtten takel oft bevangtouw schoten, ende int wa-
ter oft ter aarden vielen, ende also bedorven worden oft te niette
gingen, alsulke schade oft verlies moet den schipper ook alleen dra-
gen ende den koopman goet doen, als vore.
10. item, als den schipper de goeden so kwalijk stouwd oft laijd dat
d’ene door d’andere bedorven worden, gelijk soude mogen gebeuren
als hij onder geladen heeft rozijnen, allijn, rijs, sout, gran ende an-
dere diergelijke goeden, ende dat hij daar boven op laijd wijnen,
olien oft olijven, die vijtlopen ende d’andere bederven, die schade
moet den schipper ook alleen dragen ende den koopman goet doen,
als boven.
The words verzijmelijk, vijten, genoeh and stouwd are incorrect rewrites of
the words versuijmelijck, uit een, genoeg and stouwt. But takel, maken, schade,
koopman, kwalijk, rozijnen and dragen are correctly transformed from taeckel,
maecken, schade, coopman, qualijck and rosijnen. Although it is far from per-
fect, many words are modernized. Even the word verzijmelijk is orthographically
much closer to its correct modern form verzuimelijk, although its pronunciation
is no longer the same.
variant, namely, the one from the test set. Table 4.5 shows recall at several
different levels. The experiment was repeated after rewriting the historic word
list using the 4-gram RNF rules after 2 iterations (see 4.1.1), and using the best
combination rule set (RNF, RSF, PSS).
Especially at low recall levels (1 and 5) the differences between the original
historic words and the rewritten words is huge.
The 4-gram rules generated by RNF perform much better than the 5-gram
rules. The 5-gram rules are a huge improvement on the original words, but the
4-grams are much better still. The performance of 2-gram and 3-gram matching
@20 are comparable to the experiments by Robertson and Willett (see table 3.1).
When matching n-grams for historic word forms, small n-grams (2 and 3)
perform better than large n-grams (4 and 5). However, it is interesting to see,
that the difference between rewriting and no rewriting, for recall at lower levels
(recall @1 and recall @5), becomes very big for large n-grams. More specifically,
the performance of 4-gram and 5-gram matching is very bad at recall levels 1
and 5.
The document collection is also the same is the one used in [2], because these
documents were assessed for the expert topics.
[1]. The results shown here are only based on topics for which there is at least one relevant
document. The results in [1] also take into account topics that don’t have any relevant
documents in the collection.
52 CHAPTER 4. FURTHER EVALUATION
matrix in which the columns represent the documents in the collection, and the
rows represents all the unique words in the entire document collections, with
each cell containing the frequency of the represented word in the represented
document. Thus, a column shows the frequencies all words that occur in the
document, and each row shows the frequencies of a word in the documents that
it occurs in.
The table shows the results of some standard IR techniques as well. Stem-
ming, as explained in section 2.3, conflates words through suffix stripping. The
4-gram experiments uses 4-grams of words in combination with the words them-
selves as rows in the inverted index. Decompounding is used to split compound
words into their compound parts. The results for the rewrite rules are split into
document translation and query translation. In the query translation experi-
ment, a list of translation pairs was made for the words in the historic document
collection, containing the original historic term and its rewritten form. Each
query was expanded with a original historic word if its rewritten form was a
query word. The document translation experiment was done by replacing each
word in all documents by its rewritten form from the same list of translation
pairs. The first 4 experiments can be seen as monolingual IR (documents and
queries are treated as one language). Translating queries or documents is a
cross-language (CLIR) approach. Either the documents are translated into the
language of the queries, or the queries are translated into the language of the
documents. The last two rows show the effect of stemming after translation.
Of the 4 monolingual approaches, the use of 4-grams works best, although
stemming and decompounding perform better than the baseline as well. Trans-
lation of the queries is comparable in performance to the 4-gram approach, but
stemming the translated queries has a negative effect, especially when using
descriptions only. Query translation means adding historic word-forms to the
query. These historic word-forms contain historic suffixes that might not be
stripped by the stemmer, just as the historic word-forms in the documents.
Without rewriting, many historic spelling variants cantnot be conflated to the
4.3. HISTORIC DOCUMENT RETRIEVAL 53
same stem. Thus, if the historic query terms are not affected by the stemmer,
they will only be matched by the exact same word-forms in the document collec-
tion, not with any morphological variant. Document translation is clearly supe-
rior to query translation. Even without stemming it consistently out-performs
all the other approaches. But here, stemming is useful. By rewriting, many his-
toric spelling variants are conflated to a more modern standard, including their
suffixes. After stemming, morphological variants are conflated to the same stem,
which significantly improves retrieval performance. For the D+T and T only
queries, the improvement over the baseline is almost 100%.
The results for the expert topics are listed in Table 4.7. These results are
in no way comparable to the known-item results. No matter what approach is
used, nothing performs better than the baseline system. The decompounding
approach, and the query translation approach (without stemming) come close
to the performance of the standard system, but they show no improvement.
A closer analysis of the topics shows that the experts who formulated the
queries used specific 17th century terms, and added historic spelling variants to
some of the descriptions and the titles. Topic 13 has the following description
and title:
The document collection used for retrieval contains documents from the
‘Gelders Land- en Stadsrecht’ corpus, and the ‘Antwerpse Compilatae’ corpus.
Both are a collection of text concerning 17th century Dutch law. All documents
from the ‘Antwerpse Compilatae’ contain the words Antwerpse Compilatae. So,
by putting these words in the query, all documents from this corpus are con-
sidered as possible relevant documents. Next, the word ’oorvrede’ is combined
with 2 spelling variants in both the description and the title. By rewriting the
documents, the spelling variant oirvede might have changed, so all documents
originally containing oirvede no longer match with the query word oirvede. In
54 CHAPTER 4. FURTHER EVALUATION
general, if spelling variants are added to the query, the documents should not
be rewritten, since rewriting is used to conflate spelling variants.
4.5 Conclusions
All different evaluations show that 4-gram RNF generates the best rules. Al-
though the Braun rules show to be more period independent, for documents
written after 1600 the automatic methods perform much better. Word retrieval
benefits greatly from rewriting, getting performance on par with the results
from [18] for historic Enlish word-forms. For HDR, the effect of rewriting is
spectacular. What is interesting is the improvement of the stemming algorithm.
Before rewriting, stemming the documents and queries has a mixed effect. For
the titles it is useful, but for the descriptions, stemming has very little effect.
But once the rewrite rules have been applied, much more historic words have a
modern suffix that can be removed, conflating spelling and morphological vari-
ants all to the same stem. In other words, rewriting has brought the historic
Dutch documents closer to the modern Dutch language.
As the iteration of RNF ceases to have effect after 3 iterations, it might be
more effective to switch to phonetic matching, or possibly word-retrieval, after
that. It would be interesting to see the results of a combined run, first rewriting
documents and then use n-gramming to find spelling variants that are not yet
fully modernized. Also, the results of the HDR experiment are based on old
rules. The current best rule set performs much better on the test set evaluation
and on the word-retrieval test, thus might also perform even better on the HDR
experiment.
56 CHAPTER 4. FURTHER EVALUATION
Chapter 5
57
58 CHAPTER 5. THESAURI AND DICTIONARIES
the European Union for instance, all political documents written for the Euro-
pean parliament have to be translated in many different languages. As these
documents contain important information, it is essential that each translation
conveys exaclty the same message. The third paragraph in a Polish translation
contains the same information as the third paragraph in an Italian translation.
This can be exploited to construct a translation dictionary automatically by
aligning sentences and words within these sentences. A collection of such docu-
ments in several languages is often called a parallel corpus. A parallel corpus can
thus be used to find synonyms in one language for words in another language.
If such a collection of documents is available for 17th centry Dutch and mod-
ern Dutch, it could be used to construct word translation pairs between 17th
century and modern Dutch. This could be a partial solution to the vocabulary
gap identified in [2]. Partial, because historic words for concepts that no longer
make any sense in modern times cannot be aligned with modern translations,
simply because no such translations exist.
One of the largest parallel corpora is probably the Bible. It is translated
in many different languages, and also in many different historical variants of
many modern languages. The advantage of using different Bible translations is
that a line in one translation corresponds directly to the same line in the other
translation. The Statenbijbel and the NBV (Nieuwe Bijbel Vertaling) can be
used to construct a limited translation dictionary. The Statenbijbel is the first
Dutch translation of the original Bible, written in 1637. The NBV is the most
recent Bible translation, available in book form and on the internet. However,
the Statenbijbel, unlike the NBV, is not electronically available. The oldest
digitized version that can be found is a modernized version of the Statenbijbel,
dating from 1888. By that time, official spelling rules were introduced, and late
19th century Dutch is very similar to modern Dutch, making it useless for 17th
century Dutch.
query word. Query expansion can benefit from a modern to historic translation
dictionary containing opsnappers as a historic synonym for feestvierders.
There are several techniques that can be used to find semantically related
words automatically. Two of them will be discussed here. The first uses the fre-
quency of co-occurrence of two specific words, the second uses syntactic structure
to find words that are at least syntactically, and possibly semantically related.
P (Wi , Wj )
I(Wi , Wj ) = log (5.1)
P (Wi )P (Wj )
The total mutual information M (i, j) between two classes Ci and Cj is then:
X P (Ci , Cj )
M (i, j) = P (Ci , Cj ) × log (5.2)
P (Ci )P (Cj )
Ci ,Cj
The idea behind this approach is that closely related words are classified
close to each other, and two unrelated words should be classified in different
classes early in the tree (near the root). If two words convey the same meaning,
it makes no sense to place them next to each other in a sentence, because it
would make one of them redundant. The meanings of two adjacent words should
be complementary. If two words co-occur often (i.e. their bigram frequency is
high), they should not be in the same class. Low co-occurence (low bigram
frequency) of high frequent words (high unigram frequency) makes it probable
that the meanings of these words overlap, so they will be classified close to each
other. The example classification given in [16] shows some classes that might
be useful for query expansion. In one class, all days of the week are clustered
together, and in another class, many time-related nouns are clustered. If one
the words in such a class is used as a query word, adding other words from the
same class to the query might help finding documents on the same topic.
Once the N most frequent words have been classified, adding other, less
frequent wordss requires no more than putting each word in that class that
results in the highest mutual information. This second step becomes trivial
when adding words with very low frequencies. A word W with frequency 1 (this
holds for the largest part of the content words) only shows up in 2 bigrams, once
with the previous word in the sentence, and once with the next word. Thus, it
will only add mutual information when classified in the opposite class of one of
these words. If neither the previous nor the next word is in the same class at a
classification level s, putting W at s + 1 class Ci or Cj makes no difference to
the mutual information, because, using (5.3), it adds 0 to either class.
This introduces a new problem for historic documents. Because of the in-
consistency in spelling, resulting in spelling variants, each variant has a lower
corpus frequency and occurs in less bigrams than it would have given a con-
sistent spelling (by conflating spelling variants, the new word frequency is the
sum of the frequencies of the conflated variants). Apart from that, classifica-
tion based on bigrams requires a huge amount of text. More text means better
classification, simply because there is more evidence to base a classification on.
But the amount of electronically available historic text is limited, resulting in
data sparseness.1
To make sure that the algorithm was implemented correctly, a test classifi-
cation was made using a 60 million English newspaper corpus.2 The 1000 most
frequent words were classified in a 6 level binary classification tree. Table 5.1
shows 4 randomly selected classes, paired with their neighbouring class, at clas-
sification level 6 (the leaves of the tree). Out of the 1 million possible bigrams
(each of the 1000 unique words can co-occur with all 1000 words, including
1 Data sparseness in this case means the lack of evidence for unseen bigrams. A bigram
Wi−1 , Wi might not occur in the corpus, making the probability P (Wi−1 , Wi ) 0. Smoothing
techniques can be used to overcome this problem, but for the classification algorithm it still
always adds the same amount of mutual information to each class, making classification trivial.
In a larger corpus, there is a bigger chance that a certain bigram occurs, resulting in a more
reliable probability estimate.
2 The newspaper corpus is the LA-times corpus used at CLEF 2002.
62 CHAPTER 5. THESAURI AND DICTIONARIES
Class Class
number content
9 city Administration movie very given growing
10 nation department only like proposed approved
27 their housing
28 her my financial economic private drug Simi Newport
World Laguna National Pacific Long Orange Santa Ventura
middle as hot five six eight few will can ’ve may said
would could does cannot did should ’ll is ’d must
35 allow take begin
36 bring give provide keep hold get pay sell win find
build break create use meet leave become call tell ask
say see think feel know want run stop play Japan
husband hours days though
49 Clinton Anaheim Los Northridge Thousand judge wife
couple key action summer minute top order largest
usually anything non New own
50 Department American San Inc star hearing project
election list book force war quarter morning week bad
different free got
itself), for the 1000 most frequent words, the corpus contains 412.516 unique
bigrams.
In class 28, a number of auxillary verbs is clustered, and in class 36, some
semantically related verbs are clustered. The neighbouring class, 35, contains
some related verbs as well, which indicates that there is a relation between
clusters that are classified close to each other. In 49 and 50, some time-related
nouns are clustered (summer, minute, quarter, morning, week). But many
clusters contain seemingly semantically unrelated words, like ‘Administration’,
‘movie’, ‘very’ and ‘growing’. The ability to cluster on semantics is limited,
although more data (a larger corpus) should lead to better (or at least, more
reliable) classification. The corpus is used as a language model for English.
Thus, more text leads to a more reliable model.
Better classifications have been made with syntactically annotated corpora.
One of the main problems with plain text is not the semantic but the syntactic
ambiguity of words. The word ’sail’ can be used as a noun (‘The sail of a ship.’)
or as a verb (‘I like to sail.’) But the orthographic form ‘sail’ can only be
classified in 1 class. In syntactically annotated corpora, a word can be classified
together with its part-of-speech tag. For modern English, such corpora exist,
but for 17th century Dutch, all that is available is plain text.
The total historic Dutch corpus is much smaller than the English one, but
still contains about 7 million words. The 1000 most frequent words share 226.318
5.2. NON-PARALLEL CORPORA: USING CONTEXT 63
Class Class
number content
11
12 In uit Na Aen om Op tot van vanden of ofte en ende
Laet Doen Wilt Uw Haer Zy Wy Zijn Mijn Hy Ons Ik Sijn
Gy selve Noch vp Dus Der o Een Geen Daer Daar Dese
Dies Dat Des Dees Alle so soo zo Zoo Indien Wanneer Nu
al
25 aller also als verheven inder vander wien binnen te
alwaer ter
26 welcken nam toch dewyl eerste dat dattet mit achter
onder Roomsche
33 wie hemels verre inne vooren
34 och heer staat ras Maria connen konnen ware zijnde mede
datse dijen
59 Heer Prins hand borst lijf beeld beelden brief boeck
kennis steen gelt brant dood verdriet troost rust
plaats slagh oyt niet voorts eerst sprack
60 bloed kroon troon staet stadt plaets wegh Boeck
editie uitgave naem stof vrucht glans kort quaet
begin neder noyt wel voort wederom weder zien sien
gaven leeren
Table 5.2: Classification of 1000 most frequent words in historic Dutch corpus
bigrams (that means that 77% of all possible bigrams is not in the corpus). The
same experiment was repeated with the historic Dutch corpus, and again 4
classes were randomly selected, shown in table 5.2, togheter with their neigh-
bouring classes.
In some of these classes, there are some clusters of syntactically related
words. Class 12 contains many prepositions and pronouns, and classes 59 and
60 contain mostly nouns. Semantically, classes 59 and 60 are also interesting,
because there are some themes. ‘Heer’ and ‘Prins’ (lord and prince), ‘hand’,
‘borst’ and ‘lijf’ (hand, chest, body), ‘brief’, ‘boeck’ and ‘kennis’ (letter, book,
knowledge) in 59, ‘kroon’ and ‘troon’ (crown, throne), ‘staet’, ‘stadt’, ‘plaets’,
‘weg’ (state, city, place, road), ‘Boeck’, ‘editie’, ‘uitgave’ (Book, edition, edition)
in 60. The main problem of small corpora is that, if the mutual information
within one class is zero (none of the words in that class share a bigram), further
classification is useless. This is clear in classes 11 and 12. Apparently, moving
one word from class 12 to 11 does not increase the mutual information. A
further subclassification of class 12 will result in one empty subclass, and the
other subclass containing all words of class 12.
For a better comparison, the 1000 most frequent words in a 30 million words,
64 CHAPTER 5. THESAURI AND DICTIONARIES
Class Class
number content
21 dacht zet vraagt grote hoge oude
22 maakte hield sterke enorme vijftig hoe
29 Bosnische Europees Zuid dezelfde hetzelfde welke ieder
vele veel enkele beide dertig honderd vijf wat vorig
economische mogen hard ondanks Van
30 Nederlandse nationale Navo Rotterdamse elke deze zoveel
bepaalde negen mijn dit laten
31 drie tien zeven derde halve volgend vorige zware
speciale belangrijke oud
32 zwarte rode politieke vrije ex rekening tv milieu
gebruik kun
49 me we wij belangrijkste echte dergelijke voormalige
meeste klein dollar groei overheid regering gemeente
rechtbank Spanje ogen televisie stuk leeftijd weekeinde
week seizoen keer ander
50 handen hart familie bevolking Raad politiek kabinet
onderwijs school tafel Feyenoord ploeg elftal finale
zomer toekomst maand periode buitenland koers produktie
verkoop verzoek ton rechter kant totaal groter
mogelijkheid
Table 5.3: Classification of 1000 most frequent words in modern Dutch corpus
modern Dutch corpus3 were also classified in a level 6 binary tree. 4 randomly
selected classes and their direct neighbouring classes are listed in table 5.3. In
this corpus, the 1000 most frequent words share 295.404 unique bigrams.
Table5.3 shows 4 directly neighbouring classes (29, 30, 31, 32). At level 4
in the tree they would be merged into one classes. This would make sense, as
classes 29, 30 and 31 contain number words (dertig, honderd, vijf, negen, drie,
tien, zeven) and related adjectives (ieder, vele, veel, enkele, beide, elke, zoveel,
bepaalde, derde, halve), and all 4 classes contain some other adjectives. Classes
49 and 50 also contain some semantically related words: ‘Overheid’, ‘regering’,
‘gemeente’, ‘Raad’, ‘politiek’, ‘kabinet’ (government, government, community,
council, politics, cabinet), and ‘leeftijd’, weekeinde’, ’week’, ’seizoen’, ’zomer’,
’maand’, ’periode’, ’toekomst’ (age, weekend, week, season, summer, month,
period, future).
For all three corpora, the classification trees show some useful clustering,
but it is far from being usable for query expansion, because it is based on high
frequency words, which add very little content to a query and mark a lot of
documents as relevant. As mentioned before, classification of low frequency
words is completely unreliable, because there is very little evidence to base a
3 This corpus is also from CLEF 2002.
5.2. NON-PARALLEL CORPORA: USING CONTEXT 65
classification on. But the low frequency words are the very words that are useful
for document retrieval. Low frequency words, by definition, occur in only a few
documents, and are often related to the topic of a document. It seems that the
only way to get a more reliable classification is to use a bigger corpus.
There is however, a big difference in automatic clustering between English
and Dutch. In the 60 million words corpus used for English, there are ‘only’
306.606 unique words, whereas the 30 million words corpus for modern Dutch
contains 495.605 unique words. The historic Dutch corpus, containing 7 million
words in total, has 373,596 unique words. In general, a larger corpus contains
more unique words, so a 60 million words corpus of historic Dutch would proba-
bly contain much more unique words than the English corpus. The main reason
for this is probably due to compounding of words. In English, compounds
are rare (like ‘schoolday’), as most nouns are seperated by a whitespace (‘shoe
lace’), but in Dutch, compounding is much more common, resulting in words like
‘bedrijfswagenfabriek’ (lit.: company car factory) and ‘nieuwjaarsgeschenken’
(New Year gifts). To get enough evidence for a reliable classification, a larger
lexicon requires a larger corpus.
Another difference between these two languages is the word order, which is
more strict in English than in Dutch. Both languages share the Subject - Verb
- Object order in basic sentences. But adding a modifier to the beginning of
the sentence, the order is retained in English, but changes in Dutch (the verb is
always the second part of the sentence, so the subject comes after the verb. This
has consequences for the number of unique bigrams in the corpus. For Dutch,
a larger number of bigrams is needed to get same reliability for the ‘language
model.’
The quality of the classification seems to depend on quite a number of factors:
1. Lexicon size: Each unique word needs plenty of evidence for proper
classification, thus a larger lexicon needs more evidence, i.e., more text.
2. Ambiguity in a language: Words that can have different syntactic
functions can supply contradictive evidence (the verb ‘sail’ can co-occur
with words that cannot co-occur with the noun ‘sail’). Languages that
have many of these words are harder to model correctly.
3. Strictnes of word-order: Some languages allow various word-orderings
for a sentence. In many so called ’free-word-order-languages’ like Polish
and Russian, a rich morphology makes it possible to distinguish syntactic
categories. However, for a language like Dutch, the word-order may be
changed, but this introduces changes in pronouns and prepositions. A
Dutch translation of the English sentence ‘I’m not aware of that.’ could be
‘Ik ben me daar niet bewust van,’ or ‘Ik ben me daar niet van bewust.’ But
another word-order is allowed’, like ‘Daar ben ik me niet bewust van’, or
even ‘Daar ben ik me niet van bewust.’ Nothing changes morphologically,
but there are four different sentences, with exactly the same words and
exactly the same meaning.4 More possible orderings need more evidence
4 Thanks to Samson de Jager for pointing out this peculiarity in Dutch.
66 CHAPTER 5. THESAURI AND DICTIONARIES
to be modelled correctly.
To aid word clustering for historic Dutch, the historic document collection
could be mixed with an equal amount of modern Dutch text to reduce data
sparseness. The spelling of many words has changed over time, but the most
frequent words have changed very little. There is still a reasonably large overlap
between the most frequent words in both corpora, so if no more historic text is
available, modern text might help. For modern Dutch, syntactically annotated
corpora are available, and can be mixed with historic Dutch to estimate POS-
tags for historic words. If all the modern words in a class are nouns, it seems
probable that the historic words in that class are nouns as well. To bridge the
vocabulary gap, clustering historic and modern words with related meanings
might be very useful. At least for query expansion, adding historic words to
modern query words can increase recall.
in others, a special html-tag is used to mark it. Consider the next two exam-
ples, the first of which is very clear, containing a special tag to signify a word
translation.
The first note has marked the historic word (‘beschaemt’) by tagging it
with a span class called ‘term’. In all of these cases, the modern translation
(‘teleurgesteld’) directly follows the historic word, and ends with a dot or a
semi-colon. The second note is less specific. The historic word (‘bloot’) is
marked in italics, and the modern translation (‘onbeschermd’) again follows it
and ends in a dot (or a semi-colon). The first note is easy to extract. The second
note is more problematic, because the italics not always signify a translation:
Here, the first word in italics, ‘Orpheus’, is not followed by a modern transla-
tion, but by an explanation of who Orpheus was. A simple way of distinguishing
between this note and the previous one, is that the translation pair contains only
one word after the historic, italicized word. But this doesn’t work for transla-
tions containing several words:
For the historic word ‘sloer’, multiple modern translations are given, seper-
ated by a comma. The historic phrase ‘onses Moeders’ has two modern phrases
as translation. How can these be distinguished from the note about Orpheus?
It gets even worse. Consider the next consequetive notes:
The first one contains the historic phrase inside italics and the modern phrase
following it directly. The second one contains both the historic word and its
modern translation inside italics, and an explanation directly after it. And a few
notes further down, the single word after the italics is not a modern translation,
but a reference:
All this makes is very hard to extract only the translation pairs from a
note. Manual correction is not an option, since the 17th century DBNL corpus
contains over 170.000 footnotes. The final list consists of approximately 110.000
translation pairs, many of which are not actual translation pairs but references,
explanations or descriptions. Still, for query expansion it could be useful. If
each modern translation occurs only a few times, only a few historic words or
phrases are added to the query. Not all of them will be useful, but adding noise
to the query might be compensated by the fact that some relevant words are
added as well. By making seperate dictionaries for word to word, word to phrase
and phrase to phrase translations, evaluating each of them seperately, will give
an indication of whether a dictionary can be useful, or contains to much noise.
The dictionaries in table 5.4 are translations from historic to modern, as
extracted from the DBNL corpus. The word to phrase dictionary contains
historic words as entries, and modern phrases as translations. Vice versa, the
phrase to word dictionary contains historic phrases with modern single word
5.3. CRAWLING FOOTNOTES 69
Table 5.5: Simple evaluation of DBNL thesaurus parts: usefulness of 100 random
samples
Some good examples from the word to word and word to phrase dictionaries
are:
The last example is a typical parsing mistake. The right hand side ‘zie’
(English: see) is part of a reference to something.
Furthermore, the word to word and word to phrase dictionaries were used to
get an idea of the overlap between the historic words in a historic corpus, and
the historic words in the dictionaries. How many of the words in the hist1600
corpus (the corpus used for the RSF and RNF algorithms, see section 3.1.2)
for example, have an entry in the DBNL thesaurus? And what about the
corpus that was used for creating the test set? Table 5.6 gives an indication
of the coverage of the thesaurus. The Braun corpus contains the ‘Antwerpse
Compilatae’ and the ‘Gelders Land- en Stadsrecht’, the Mander corpus contains
‘Het Schilderboeck’ by Karel van Mander. Together, they make up the hist1600
corpus. This split up was done because the DBNL thesaurus contains some
entries extracted from the Mander corpus. The same holds for the documents
from the test set corpus. The modern words in the corpora, at least the words
that are found in the modern Kunlex lexicon, were first removed from the total
historic lexicon (column 3). Synonyms for these words can be found in a modern
Dutch thesaurus.
The coverage results can be explained by the footnote extraction. The Braun
corpus does not contain any footnotes, and has the smallest coverage from the
DBNL thesaurus. The Mander corpus has a larger coverage, probably because
a number of entries from the DBNL thesaurus come from ‘Het schilderboeck’.
That the DBNL thesaurus covers even a larger part of the test set corpus is
probably due to the fact that ‘De werken van Vondel, Eerste deel (1605 – 1620)’
is part of the corpus and contains several thousand notes with translation pairs.
The DBNL thesaurus covers a far larger part of the historic words in the
selection and test set (see section 3.4.1). Apparently, in the process of giving
modern spelling variants of historic words, there was a bias towards giving
modern forms for historic words with a salient historic spelling. A bias which
5.3. CRAWLING FOOTNOTES 71
can very well have been the same for the editors of the DBNL who made the
footnotes. Also, the selection and test set do contain some modern words. Out
of the 2000 words in the both sets, 34 words are in the Kunlex, showing that
the decision whether a word is historic or modern is not trivial.
Table 5.7: HDR results for known-item topics using DBNL thesaurus
translations apparently does more bad than good. Only modern stop words (the
high frequency words, that add little content to the query) are removed from
the query, but historic translations are added before this happens. The titles
don’t contain any stop words, thus through translation none are added. The
titles contain mostly low frequency content words. Adding historic synonyms
of these words, and stem all the query words afterwards improves performance.
Look at topic 7 for a good example:
• Description: Kan een eigenaar van onroerend goed zijn verhuurde pand
zomaar verkopen, of heeft hij nog verplichtingen ten opzichte van de hu-
urder?
• Title: eigenaar onroerend goed verhuurde pand verkopen verplichtingen
huurder
• Description: kan ken koon mach magh een een eigenaar van onroerend
goed aertigh welzijn binnen cleven sinnen verstrekken verhuurde pand
paan panckt zomaar verkopen veylen of heeft hij deselve sulcke versoecker
nog nach verplichtingen ten opzichte van de huurder huurling
• Title: eigenaar onroerend goed aertigh wel verhuurde pand paan panckt
verkopen veylen verplichtingen huurder huurling
Not all translation added to the title are good, but most of them are related
to the topic. As for the description, many totally unrelated historic words are
added that will not be recognized as a stop word (mach, magh, koon, sulcke).
Table 5.8: Average number of words in the query using query translation meth-
ods
Table 5.9: HDR results for expert topics using DBNL thesaurus
a specific weight, the final relevance score of document is the weigthed sum of
the relevance scores of each approach. Documents that are considered relevant
by two different approaches, say, retrieval using 4-grams and retrieval using
the DBNL thesaurus, are, on average, ranked higher on the combine list. The
reasoning behind this is that if several approaches retrieve the same document,
there is a fair chance that that document is actually relevant.
As was mentioned in section 4.3, the HDR results of the advanced techniques
for the expert topics show no improvement over the baseline. The same holds for
document and query translation using the DBNL thesaurus. The expert queries
contain specific 17th century words from the documents, making query expan-
sion redundant for a large part. It is still interesting to see that, consistent with
the known-item results, query translation works better than document transla-
tion and stemming afterwards has a negative effect. Although the monolingual
methods perform better on the descriptions, translation of the titles seems to
work better than stemming or 4-gram matching. And again, adding translations
to the descriptions decreases performance significantly.
the final thesaurus. After applying rewrite rules to the historic lexicon, the
rewritten words will (probably) be more similar in spelling to the corresponding
modern word. Through rewriting, the pronunciation of a word may change.
Since letter-to-phoneme algorithms are based on modern pronunciation rules,
the phonetic transcription of the historic word klaeghen will be different from
the transcription of its corresponding modern word klagen, since the modern
pronunciation of ae is different from the modern pronunciation of a (they may
have been the same in 17th century Dutch). Thus, if after rewriting, klaeghen
has become klaghen, the phonetic transcription will be equal to that of klagen.
Converting the lexicon to a pronunciation dictionary again, and repeating the
construction procedure will result in new pairings.
Of course, words that are pronounced the same are not necessarily the same
words (consider eight and ate). This is were the edit distance clearly helps in
distinguishing between spelling variants and homophones (if the homophones
are orthographically dissimilar enough).6
Because the phonetic transcriptions contain some errors, and because the
pronunciation of some vowel sequences has changed over time, the phonetic
transcriptions before and after rewriting were evaluated by randomly selecting
and checking a 100 entries for correctness. The whole experiment was done
twice to get a more reliable indication. If the number of correct and incorrect
transcriptions show a big difference between the first and the second time, a
bigger sample, or more iterations are needed to get a better indication. If the
numbers vary only slightly, their average gives a fair indication of the total
number of correct and incorrect transcriptions. The results in Table 5.10 show
a significant improvement in the quality of the transcriptions. Before rewriting,
the phonetic dictionary contains 4592 entries, and 16% of all transcriptions are
different from their real pronunciation. Only 2% of all the 11.592 entries after
rewriting have incorrect phonetic transcriptions. The phonetic dictionary after
rewriting (using the combined rule set RNF+RSF+PSS) contains the original
historic words as entries, but the modern words were matched with the phonetic
transcriptions of the rewritten forms of the historic words. The word aengaende
was first rewritten to aangaande. Then, aengaende is matched with a modern
word that has the same phonetic transcription as aangaande.
Not only does rewriting effect the number historic words that phonetically
similar to their modern forms, it also decreases the number of wrong phonetic
matches. The historic ae sequences is no longer mathced with the modern ee
sequence, but with aa. The same goes for the historic sequences ey and uy which
were matched with the sequence ie in modern words before rewriting, and are
respectively matched with ei and ui afterwards.
Table 5.11: HDR results for known-item topics using phonetic transcriptions
vious section was conducted. Instead of using rewrite rules to translate queries
or documents, the phonetic transcription dictionary was used. The results are
shown in Table 5.11. For this experiment, the stop word list was extended with
phonetic variants of stop words taken from the phonetic dictionary.
The results of translating the queries are comparable to the use of 4-grams
in the monolingual approach, and, equal to the effect on rewriting (see Table 4.6
in the previous chapter), stemming the translated queries has a negative effect
for the same reason. The historic words often have historic suffixes that are
unaffected by the stemmer, thus conflation of morphological variants is minimal.
Document translation shows the best results for all different queries (D only,
D+T and T only). But now, the effect of stemming is minimal.
The number of phonetically equal words added to the descriptions and titles
is smaller than the number of related words added by the DBNL thesaurus.
Although the phonetic dictionary adds spelling variants of modern stop words
to the query, the list of modern stop words was extended by their phonetically
historic counterparts. Therefore, the performance of query translation for the
description is comparable to query translation for the titles. Combining them
does not affect average precision much.
76 CHAPTER 5. THESAURI AND DICTIONARIES
Table 5.12: HDR resulst for expert topics using phonetic transcriptions
In both the title and the description, 3 spelling variants for verkoop (sale)
and 1 for pand (house). The spelling variants of the stop words bij, van and een
were removed because of the extended stop word list.
Similar
Characters
b,p
d,t
f,v
s,z
y,i
y,ie
y,ij
g,ch
c,k
c,s
5.6 Conclusion
The DBNL thesaurus can be used effectively in query expansion, if the stop word
list is extended with historic variants as was done for the phonetic dictionary,
and with a better note extraction algorithm, the word to phrase and phrase
5.6. CONCLUSION 79
to word translations might become useful as well. The downside is that the
construction of this thesaurus depends on manually added word translation
pairs. Automatically extracting them correctly is difficult, and the only historic
words for which a translation is given are the ones that are deemed important
by the DBNL editors. The modern translations of the words that they find
important enough to translate might not be the words that are posed as query
words by the user.
By combining historic Dutch documents with modern Dutch documents, and
more imortantly, by increasing the corpus size, the use of word clustering algo-
rithms can become an important method for bridging the vocabulary gap. As is
stands, the vocabulary gap remains the most difficult bottleneck of the two, as
the spelling gap is partly bridged by the rewrite rules from the previous chapter
and the phonetic dictionary and PED reranking procedure in this chapter.
The phonetic variants dictionary is effective, but only after rewriting. The
phonetic transcriptions of the original historic words are not always correct, thus
by replacing these transcriptions with the transcriptions of the rewritten words,
many historic words are no longer paired with the wrong modern word. The
advantage of matching historic and modern words with phonetic transcriptions
over using rewrite rules is that non-typical historic character sequences (like
‘cl’ in clacht) are not rewritten incorrectly (clausule should not be rewritten
to klausule). The phonetic dictionary only replaces whole words, not parts of
words. Thus, clacht will be replaced with klacht, but the historic word clausule
is matched with the modern word clausule, and is thus retained in the lexicon.
The performance of word-retrieval can be greatly improved by reranking
the candidate list using the Phonetic Edit Distance algorithm. The number of
candidates can then be reduced to 3 or 5 words, and the remaining list can be
used for query expansion. It has yet to be tested on a HDR experiment though.
80 CHAPTER 5. THESAURI AND DICTIONARIES
Chapter 6
Concluding
We’ve seen, in the previous chapters, that language resources can be constructed
from nothing but plain text. They can be used effectively for HDR, and might
also be used as stand alone resources to improve readability. This chaper con-
cludes this research, and will try to answer to questions from the first chapter.
Some future directions are given as well.
81
82 CHAPTER 6. CONCLUDING
DBNL website is expanded regularly. These new entries also contain notes and
translation pairs, so the thesaurus could be updated with new entries as well.
The construction of a historic synonym thesaurus using mutual information
seems infeasible at this moment. An enormous amount of text would be re-
quired, and even then, the clusters will still show more syntactic structure than
semantic structure. Large clusters of nouns are almost impossible to split into
semantically related subclusters if no more than plain text corpora are available.
Once syntactically annotated 17th century Dutch corpora are available, classi-
fication based on bigram frequencies might become useful to cluster synonyms.
For HDR purposes, it would be interesting to see the effect of mixing historic
and modern Dutch corpora. If the historic and modern words in a cluster are
semantically related, the historic words could be added to modern query words
from the same cluster.
Finally, if the co-occurrence based thesaurus improves in quality, it could be
combined with the DBNL thesaurus. The DBNL thesaurs could be used to test
the quality of the co-occurrence thesaurus (if it is based on a mix of historic and
modern Dutch). The historic word and its modern translation should be in the
same cluster, or at least, close to each other in the classification tree.
As it stands, the attempts at bridging the vocabulary gap have led to lit-
tle more than plans for building a real bridge. The bridge over the spelling
gap, although still a bit shaky, seems to have reached the other side. Language
resources are now available for historic Dutch, most of them automatically gen-
erated, and possibly useful for other languages as well.
86 CHAPTER 6. CONCLUDING
Bibliography
[3] Brown, P.F.; Della Pietra, V.J.; deSouza, P.V.; Lai, J.C.; Mercer, R.L.
(1992). Class-based n-gram moderls of natural language in Computational
Linguistics, volume 18, number 4, pp. 467-479
[5] Dagan, I.; Lee, L.; Pereira, F.C.N. (1998). Similarity-based models of word
cooccurrence probabilities in Machine Learning, Volume 34, number 1-3
[6] Hall, P.A.V., Dowling, G.R. (1980). Approximate string matching in Com-
puting Surveys, Vol 12, No.4, December 1980
[7] Hollink, V.; Kamps, J.; Monz, C.; de Rijke, M. (2004). Monolingual doc-
ument retrieval for European languages in Information Retrieval 7, pp.
33-52
[8] Hull, D.A. (1998). Stemming algorithms: A case studie for detailed evalu-
ation in Journal of the American Society for Information Science, volume
47, issue 1
[9] Jing, Y.; Croft, W.B. (1994). An association thesaurus for information
retrieval in Proceedings of RIAO, pp. 146-160
[10] Kamps, J.; Fissaha Adafre, S.; de Rijke, M. (2005). Effective Translation,
tokenization and combination for Cross-lingual Retrieval
[11] Kamps, J.; Monz, C.; de Rijke, M.; Sigurbjörnsson, B. (2004). Language-
dependent and language-independent approaches to Cross-Lingual Text
Retrieval In Comparative Evaluation of Multilingual Information Access
Systems, CLEF 2003, volume 3237 of Lecture Notes in Computer Science,
pages 152-165. Springer, 2004.
87
88 BIBLIOGRAPHY
Each of the resources and methods to construct them are described in more
detail here. Each section covers a resource and its associated algorithms.
aengeclaeght aangeklaagd
aengecomen aangekomen
aengedaen aangedaan
aengedraegen aangedragen
begosten aanvingen
begote overgoten
begoten bespoeld
89
90 APPENDIX A - RESOURCE DESCRIPTIONS
begraeut afgesnauwd
gerichtschrijversampt gerichtschrijverzambt
gerichtschrijverseydt gerichtschrijverzeid
gerichtscosten gerichtskosten
gerichtsdach gerichtsdag
gerichtsdaege gerichtsdage
The rules generated by PSS and RSF are different from the rules generated
by RNF, because the vowel/consonant restrictions. The historic antecedent of
these rules consist of a historic sequence and a context restriction. For instance,
a historic vowel sequence should match a historic word if the vowel sequence
is surrounded by consonants. The word vaek matches the vowel sequence ae,
while the word zwaeyen doesn’t, because its full vowel sequence is aeye. To
make sure that the rule ae doesn’t match zwaeyen, the antecedent is extended
with context wildcards:
[bcdf ghjklmnpqrstvwxz]ae# → a [aeiouY ]lcx[aeiouY ]
The antecedent part of the rule is actually a regular expression and brack-
eted consonant wildcard [bcdf ghjklmnpqrstvwxz] indicates that the character
sequence ae must be preceded by one of these consonants. The word boundary
character # indicates that the ae sequence must be at the end of the word.3
3 In Perl, this word boundary character can be replaced by a dollar sign ($), which matches
The uppercase Y in the second rule above is used as a replacement for the
Dutch diphtong ij, because the j would otherwise be recognized as a consonant.
Therefore, all occurrences of ij in words and sequences are replaced by Y in the
RSF and PSS algorithms.
The RNF algorithm doesn’t have this consonant/vowel restriction, it will
match anything with the historic antecedent. Context information is more de-
tailed for longer n-grams:
ae → aa bae → baa bael → baal baele → bale
92 APPENDIX A - RESOURCE DESCRIPTIONS
Appendix B - Scripts
The PSS algorithm requires two lists of mappings from words to phonetic tran-
scriptions. One list with historic words and phonetic transcriptions, and one
list for modern words and phonetic transcriptions. The phonetic alphabet that
is used is not important, as long as both lists use the same phonetic alphabet.
The output is a plain text file, where each line contains a rewrite rule and its
PSS score. The PSS score is the MM-score (the maximal match score, described
in section 3.3.1.
The RSF and RNF algorithms both require two word frequency indices, one
from a historic corpus and one from a modern corpus. These indices are plain
text files with each line containing a unique word from the corpus, and its corpus
frequency.
These three algorithms are implemented in Perl, simply named ‘PSS.pl’,
‘RSF.pl’ and ‘RNF.pl’, and use only standard packages and require some scripts
that are included in the resources package.
Other important algorithms are:
93
94 APPENDIX B - SCRIPTS
The selection and test set (section 3.4.1) are in the same format as all the other
word lists and dictionaries. Each line in the test set file consists of a historic
word, a tab, and the modern spelling of the historic word (again, not necessarily
an existing modern word). To clearify this, a few entries are given here:
sijnen zijn
sijns zijns
silvere zilvere
simmetrie symmetrie
sin zin
singen zingen
singht zingt
sinlijckheyt zinnelijkheid
sinnen zinnen
sinplaets zinplaats
4 At least, it is not listed in the ‘Van Dale - Groot woordenboek der Nederlandse taal.’
95