You are on page 1of 105

Constructing Language Resources for Historic

Document Retrieval
MSc thesis, Artificial Intelligence

Marijn Koolen
mhakoole@science.uva.nl

june 2, 2005

Supervisors:
Prof. Dr. Maarten de Rijke, Dr. Jaap Kamps

Informatics Institute
University of Amsterdam
ii
Abstract

The aim of this research is to investigate the possibility of constructing language


resources for historic Dutch documents automatically. The specific problems
for historic Dutch, when compared to modern Dutch, are the inconsistency in
spelling, and the aged vocabulary. Finding relevant historic documents using
modern keywords can be aided by specific resources that add historic variants
of modern words to the query or, resources that translate historic documents
to modern language. Several techniques from Computational Linguistics, Nat-
ural Language Processing and Information Retrieval are used to build language
resources for Historic Document Retrieval on Dutch historic documents. Most
of these methods are language independent. The resulting resources consist of
a number of language independent algorithms and two thesauri for 17th cen-
tury Dutch, namely, a synonym dictionary, and a spelling dictionary based on
phonetic similarity.

iii
iv ABSTRACT
Acknowledgements

I’d like to express my gratitude towards Jaap Kamps and Maarten de Rijke for
their guidance and supervision during this research. They’ve read numerous
versions of this thesis without losing patience or hope, and were always quick
with advice when needed. I’d like to thank Frans Adriaans for the brainstorming
sessions getting both our projects started, and for the discussions on science that
somehow always shifted to discussions on music.

v
vi ACKNOWLEDGEMENTS
Contents

Abstract iii

Acknowledgements v

1 Introduction 1
1.1 Document retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Historic documents and IR . . . . . . . . . . . . . . . . . . . . . 1
1.3 Constructing language resources . . . . . . . . . . . . . . . . . . 2
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Historic Documents 5
2.1 Language variants or different languages? . . . . . . . . . . . . . 6
2.2 The gap between two variants . . . . . . . . . . . . . . . . . . . . 6
2.3 Bridging the gap . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Resources for historic Dutch . . . . . . . . . . . . . . . . . . . . . 9
2.5 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Corpus problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Measuring the gap . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Spelling check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Rewrite rules 17
3.1 Inconsistent spelling & rewrite rules . . . . . . . . . . . . . . . . 17
3.1.1 Spelling bottleneck . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Linguistic considerations . . . . . . . . . . . . . . . . . . . 20
3.2 Rewrite rule generation . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Phonetic Sequence Similarity . . . . . . . . . . . . . . . . 22
3.2.2 The PSS algorithm . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Relative Sequence Frequency . . . . . . . . . . . . . . . . 24
3.2.4 The RSF algorithm . . . . . . . . . . . . . . . . . . . . . 25
3.2.5 Relative N-gram Frequency . . . . . . . . . . . . . . . . . 26
3.2.6 The RNF algorithm . . . . . . . . . . . . . . . . . . . . . 26
3.3 Rewrite rule selection . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Selection criteria . . . . . . . . . . . . . . . . . . . . . . . 27

vii
viii CONTENTS

3.4 Evaluation of rewrite rules . . . . . . . . . . . . . . . . . . . . . . 29


3.4.1 Test and selection set . . . . . . . . . . . . . . . . . . . . 29
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.1 PSS results . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.2 RSF results . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5.3 RNF results . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6.1 problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.2 The y-problem . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Further evaluation 43
4.1 Iteration and combining of approaches . . . . . . . . . . . . . . . 43
4.1.1 Iterating generation methods . . . . . . . . . . . . . . . . 44
4.1.2 Combining methods . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Reducing double vowels . . . . . . . . . . . . . . . . . . . 47
4.2 Word-form retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Historic Document Retrieval . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Topics, queries and documents . . . . . . . . . . . . . . . 51
4.3.2 Rewriting as translation . . . . . . . . . . . . . . . . . . . 51
4.4 Document collections from specific periods . . . . . . . . . . . . . 54
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 Thesauri and dictionaries 57


5.1 Small parallel corpora . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Non-parallel corpora: using context . . . . . . . . . . . . . . . . . 58
5.2.1 Word co-occurrence . . . . . . . . . . . . . . . . . . . . . 59
5.2.2 Mutual information . . . . . . . . . . . . . . . . . . . . . 60
5.3 Crawling footnotes . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 HDR evaluation . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Phonetic transcriptions . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.1 HDR and phonetic transcriptions . . . . . . . . . . . . . . 74
5.5 Edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.1 The phonetic edit distance algorithm . . . . . . . . . . . . 77
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Concluding 81
6.1 Language resources for historic Dutch . . . . . . . . . . . . . . . 81
6.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.1 The spelling gap . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.2 The vocabulary gap . . . . . . . . . . . . . . . . . . . . . 84

Appendix A - Resource descriptions 89

Appendix B - Scripts 93

Appendix C - Selection and Test set 95


List of Tables

2.1 Categories of historic words . . . . . . . . . . . . . . . . . . . . . 13

3.1 Comparative recall for english historic word-forms . . . . . . . . 18


3.2 Comparative recall for Dutch historic word-forms . . . . . . . . . 19
3.3 Corpus statistics for modern and historic corpora . . . . . . . . . 20
3.4 Edit distance example 1 . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Edit distance example 2 . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Edit distance example 3 . . . . . . . . . . . . . . . . . . . . . . . 30
3.7 Edit distance example 4 . . . . . . . . . . . . . . . . . . . . . . . 31
3.8 Edit distance baseline . . . . . . . . . . . . . . . . . . . . . . . . 32
3.9 Manually constructed rules on test set . . . . . . . . . . . . . . . 33
3.10 Results of PSS on test set . . . . . . . . . . . . . . . . . . . . . . 34
3.11 Results of RSF on test set . . . . . . . . . . . . . . . . . . . . . . 37
3.12 Results of RNF on test set . . . . . . . . . . . . . . . . . . . . . . 39
3.13 Different modern spellings for historic y . . . . . . . . . . . . . . 41

4.1 Results of iterating RSF and RNF . . . . . . . . . . . . . . . . . 44


4.2 Results of combined rule generation methods . . . . . . . . . . . 46
4.3 Lexicon size after rewriting . . . . . . . . . . . . . . . . . . . . . 47
4.4 Results of RDV on test set . . . . . . . . . . . . . . . . . . . . . 49
4.5 Results of historic word-form retrieval . . . . . . . . . . . . . . . 50
4.6 HDR results using rewrite rules . . . . . . . . . . . . . . . . . . . 52
4.7 HDR results for expert topics . . . . . . . . . . . . . . . . . . . . 53
4.8 Results on test sets from different periods . . . . . . . . . . . . . 54

5.1 Classification of frequent English words . . . . . . . . . . . . . . 62


5.2 Classification of frequent historic Dutch words . . . . . . . . . . . 63
5.3 Classification of frequent modern Dutch words . . . . . . . . . . 64
5.4 DBNL dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Simple evaluation of DBNL thesaurus . . . . . . . . . . . . . . . 69
5.6 DBNL thesaurus coverage of corpora . . . . . . . . . . . . . . . . 70
5.7 HDR results for known-item topics using DBNL thesaurus . . . . 71
5.8 Analysis of query expansion . . . . . . . . . . . . . . . . . . . . . 72
5.9 HDR results for expert topics using DBNL thesaurus . . . . . . . 73

ix
x LIST OF TABLES

5.10 Evaluation of phonetic transcriptions . . . . . . . . . . . . . . . . 75


5.11 HDR results for known-item topics using Phonetic transcriptions 75
5.12 HDR results for expert topics using phonetic transcriptions . . . 76
5.13 Phonetically similar characters . . . . . . . . . . . . . . . . . . . 77
5.14 Results of historic word-form retrieval using PED . . . . . . . . . 79

1 DBNL dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 1

Introduction

1.1 Document retrieval


An Information Retrieval (IR) system allows a user to pose a query, and retrieves
documents from a document collection that are considered relevant given the
words in the query. A basic IR system retrieves those documents from the col-
lection that contain the most query words. The drawback of retrieving only
documents that contain query words, is that often, not all relevant documents
will be retrieved. Some relevant documents may not contain any of the query
words at all. Many techniques can be used to improve upon the basic sys-
tem, by expanding the query with related words, or using approximate word
matching methods. The aim (and the main challenge) of these techniques is
to increase the number of retrieved relevant documents, without increasing the
number of retrieved irrelevant documents. This is a difficult task, but signifi-
cant improvements have been made by using several language dependent and
language independent resources.

1.2 Historic documents and IR


IR systems often use external resources to improve retrieval performance. Stem-
mers, for example, are used to map words into a standard form, so that mor-
phologically different forms can be matched [7]. Resources are also used in
Cross-Language Information Retrieval (CLIR), where bilingual dictionaries are
used to translate query terms [10]. Different amounts of resources are available
for different languages.
Through increased performance of OCR (optical character recognition) tech-
niques, and dropping costs, more and more historic documents become digitally
available. These documents are written in a historic variant of a modern lan-
guage. Often, the spelling and vocabulary of a language have changed over time.
For these historic language variants, very little resources are digitally available.
Although the performance of OCR is increased, the linguistic resources used for

1
2 CHAPTER 1. INTRODUCTION

automatic correction are based on modern languages. These correction methods


might not work for older texts. Thus, for many historic documents, digitization
requires manual correction of OCR-errors. But the problems don’t end here.
Once a document has been digitized correctly, the historic spelling and vocab-
ulary still form a problem for linguistic resources based on modern languages.
Therefore, this thesis focuses on automatically constructing linguistic tools for
historical variants of a language. These tools can then be used for Historic Doc-
ument Retrieval, which aims at retrieving documents from historical corpora.
The tools will be used to construct resources for 17th century Dutch. Given
that only generic techniques are used, they should also provide a framework for
other periods and other languages.

1.3 Constructing language resources


The aim of this research is to construct language resources to be used for his-
toric document retrieval (HDR) in Dutch. Previous research by Braun [2], and
Robertson and Willett [18] has shown that specific resources for historic texts
can improve IR performance. Robertson and Willett treated historic English
as a corrupted version of modern English, using spelling-correction methods to
find historic spelling variants of modern words. Braun focused on heuristics, ex-
ploiting regularities in historic spelling to develop rewrite rules which transform
historic word forms into modern word forms. These rewrite rules were devel-
oped manually, since the inconsistency of spelling was deemed too problematic
for automatic generation of rules. The rewrite rules can significantly improve
retrieval performance. Therefore, the problem of automatic rule generation will
be investigated, considering approaches from phonetics, computational linguis-
tics and natural language processing (NLP). In both [2] and [18], the focus is
on historic spelling. However, Braun identified a second bottleneck, namely
the vocabulary gap. Some historic words no longer exist. Some modern words
didn’t exist yet (like telephone or bicycle), and other words still exist but have a
different meaning. To tackle this problem, a thesaurus might be used. However,
no such thesaurus is digitally available, so to solve the vocabulary bottleneck
one has to be constructed, either manually or automatically. This research will
thus be focused on the following research questions:

• Can historic language resources be constructed automatically?

• Is the frame work for constructing resources a language indepent (generic)


approach?

• Can HDR benefit from these resources?

The first question can be split into two more specific question, based on the
observation of Braun about the spelling problem and the vocabulary problem:

• Can (automatic) methods be used to solve the spelling problem?


1.4. OUTLINE 3

• What are the options for automatically constructing a thesaurus for his-
toric languages?

For the spelling problem, Braun and Robertson and Willett have already
mentioned two methods, rewrite rules and spelling correction techniques. But
there may be other options to align two temporal variants of a language. There-
fore, this question can be made more specific:

• Can rewrite rules be generated automatically using corpus statistics and


overlap between historic and modern variants of a language?
• Are the generated rewrite rules a good way of solving the spelling bottle-
neck?
• Can historic Dutch be treated as a corrupted version of modern Dutch,
and thus be corrected using spelling correction techniques?

The available methods will be tested indepently and as a combined ap-


proach. Parallel to this research, Adriaans [1], evaluates the retrieval side of
HDR in much more detail. The methods and thesauri developed in this project
will be used in his retrieval experiment as an external evaluation. If these tech-
niques are found to be useful, this will result in a number of language resources,
some of which are language (and period) dependent, and others are language
independent.
The main drawbacks of manual construction of resources are the need for ex-
pert knowledge in the form of historic language experts, and the huge amount
of time it takes to construct the resources. Automatic construction exploits
statistical correlations and regularities in a language. Therefore, expert knowl-
edge is no longer essential, and the time it takes to build resources is greatly
reduced. Another advantage is that, if automatic construction is effective, the
same techniques might be used for several different languages. As the afore-
mentioned articles have shown, IR performance for both 17th century English
documents and 17th century Dutch documents can be increased by attacking
the spelling variation. The techniques in this research should be language in-
dependent, making them useful for both Dutch and English, and perhaps other
languages for which historic documents pose the same problems.

1.4 Outline
The next chapter will elaborate on the distinction between historic and modern
Dutch documents, and some available historic Dutch document collections will
be described. It will show that information retrieval on historic documents
is differrent from retrieval on modern document collections. Chapter 3 will
discuss in detail the automatic constructing of rewrite rules using phonetics
and statistics, and their effectiveness on historic documents. Several different
methods are described and compared with each other and with the rules from [2].
Further extensions and combinations to these methods and evaluations follow in
4 CHAPTER 1. INTRODUCTION

chapter 4, including a document retrieval experiment. Chapter 5 will investigate


the possibility of building a thesaurus to find synonyms among historic words,
using various techniques, and other ways of solving the spelling problem are
put to the test. In the final chapter conclusions are drawn from the conducted
experiments, and some guidelines for the future will be given.
Chapter 2

Historic Documents

Historic Documents are documents written in the past. Of course, this holds for
all documents. But since spoken and written language changes continuously, a
century old Dutch document is written in a form of Dutch that is different from
a document written two weeks ago. The changes are not spectacular, but they
are there all the same. Using a search engine on the internet to find documents
on typical Dutch food with the keywords Hollandse gerechten (English: Dutch
dishes) may retrieve a text written in 1910 containing both words. The keywords
are normal in modern Dutch, but also in early 20th century Dutch. What the
search engine probably won’t find is a website containing hundreds of typically
Dutch recipes from the 16th century, although this website does exist (see section
2.5, the Kookhistorie corpus). The historic texts contain historical spelling
variants of the modern words Hollandse gerechten. This problem is caused by the
fact that the change from 16th century Dutch to modern Dutch is spectacular.
Although the number of digitized 16th century documents is small, through
the increasing interest from historians and funding from national governments
for digitizing historic documents,1 this number is growing rapidly. The afore-
mentioned problem, the gap between a modern keyword and its relevant historic
variants, becomes increasingly important.
Going back further in time, the differences between modern Dutch and mid-
dle Dutch as used in late middle ages (1200 – 1500) are even bigger. In fact,
between 1200 and 1500, Dutch was not a single language, but a collection of
dialects. Each dialect had its own pronunciation, and spelling was often based
on this pronunciation.[23] Between geographical regions the spelling differed.
Due to the union of smaller independent countries and increasing commerce,
a more uniform Dutch language emerged after 1500.2 As contacts between re-
gions increased, spelling was less and less based on pronunciation, becoming
1 See, for example, the CATCH (Continuous Access To Cultural Heritage) project. This

is funded by the Dutch government to make historic material from Dutch cultural heritage
publicly accessible in digital form, thereby preserving the fragile material.
2 For a more detailed description of the changes in language between 1200 and 1500, (in

Dutch) see http://www.literatuurgeschiedenis.nl

5
6 CHAPTER 2. HISTORIC DOCUMENTS

more consistent. In the 17th century, the Dutch translation of the Bible, the
Statenbijbel, together with books by famous Dutch writers like Vondel and Hooft
were considered as well-written Dutch, bringing about a more consistent and
systematic spelling. Since there was no official spelling (which wasn’t introduced
in the Netherlands until 1804), there were still many acceptable ways of spelling
a word [23].

2.1 Language variants or different languages?


The Dutch language is related to the German language, yet, we consider them
to be different languages. A native German speaker will recognize certain words
in a Dutch document, but might have problems understanding what the text is
about. A bilingual person, speaking both German and Dutch could translate
the document into German, making it easy for the former reader to understand
it. The same will probably hold for a native Dutch speaker reading a document
written in middle Dutch. A middle Dutch expert could translate the document
into modern Dutch, making it more readable. But for documents written after
1600, The historic language expert is no longer needed (or at least to a much
lesser degree). Knowledge of modern Dutch gives enough handles on 17th cen-
tury Dutch documents, for native speakers to understand most of the text. It
seems there is a shift from two different languages to a language together with
a certain “dialect”. This makes 17th century Dutch more or less the same lan-
guage as modern Dutch, from an information retrieval (IR) perspective. If 17th
century Dutch can be seen as a “strange” dialect of modern Dutch, its overlap
with modern Dutch might be used to bridge the gap that exists between the
two temporal variants.

2.2 The gap between two variants


But where do the two variants overlap, and where do they differ? Braun, in [2],
identified two main bottlenecks for IR from historic documents. The first bot-
tleneck is the d ifference in spelling. Not only is 17th century spelling different
from modern spelling, it is also less consistent. A word has only one officially
correct spelling in modern Dutch (although many documents do contain some
variation, caused by spelling errors, changes in the official spelling or stubborn-
ness), where older Dutch has many acceptable spelling variations for the same
word. The other bottleneck is the v ocabulary gap. The modern word koelkast
(English: refrigerator) did not exist in the 17th century. In the same way, the
historic word opsnappers (English: people celebrating) cannot be found in any
modern Dutch dictionary. Some words are no longer used, new words a created
daily, and yet other words have changed in meaning. The fact that 17th century
documents are still readable shows that the grammar has changed very little,
so this is probably not an issue (most IR systems ignore word order anyway).
Here is an example of the difference between historic and modern Dutch. The
2.2. THE GAP BETWEEN TWO VARIANTS 7

following historic text is a paragraph taken from the “Antwerpse Compilatae”,


a collection of law texts written in 1609, describing laws and regulations for
the region of Antwerpen. The full text describes how a captain should load a
trader’s goods, and what his responsibilities towards these goods are at sea:

9. Item, oft den schipper versuijmelijck waere de goeden ende


coopman-schappen int laden oft ontladen vast genoech te maecken,
ende dat die daerdore vuijtten taeckel oft bevanghtouw schoten, ende
int water oft ter aerden vielen, ende alsoo bedorven worden oft te
niette gingen, alsulcke schade oft verlies moet den schipper oock
alleen draegen ende den coopman goet doen, als voore.
10. Item, als den schipper de goeden soo qualijck stouwt oft laeijt
dat d’eene door d’andere bedorven worden, ghelijck soude mogen
gebeuren als hij onder geladen heeft rosijnen, alluijn, rijs, sout, graen
ende andere dierghelijcke goeden, ende dat hij daer boven op laeijt
wijnen, olien oft olijven, die vuijtloopen ende d’andere bederven, die
schaede moet den schipper oock alleen draegen ende den coopman
goet doen, als boven.

The modern Dutch version (the author’s own interpretation) would look like
this:3

9. Idem, als de schipper verzuimd de goederen en koopman-


swaren in het laden of uitladen vast genoeg te maken, en dat die
daardoor uit een takel of vangtouw schieten, en in het water of ter
aarde vallen, en zo bederven of te niet gaan, zulke schade of verlies
moet de schipper ook alleen dragen en de koopman vergoeden, als
hiervoor.
10. Idem, als de schipper de goederen zo kwalijk stouwt of laadt
dat het ene door het andere bedorven wordt, gelijk zou kunnen
gebeuren als hij onder geladen heeft rozijnen, ui, rijst, zout, graan
en andere, dergelijke goederen, en dat hij daar boven op laadt wij-
nen, olieen of olijven, die uitlopen en de andere bederven, die schade
moet de schipper ook alleen dragen, en de koopman vergoeden, als
boven.

This is a translation in English, again, the author’s own translation:

9. Equally, if the shipper neglects to properly secure the goods


during loading or unloading, causing them to fall in the water or on
the ground and thereby spoiling them, he should repay the damage
to the trader.
10. Equally, if the shipper stacks or loads the goods in such a
manner that one of the goods spoils another, as could happen if he
3 The word order is retained to make it easier to compare both texts. Although this word

order is readable for native Dutch speakers, it is somewhat strange. Apparently, grammar has
changed somewhat as well.
8 CHAPTER 2. HISTORIC DOCUMENTS

would stack wine, oil or olives on top of raisins, onions, rice, salt,
grain or some such goods, where the former spoils the latter, he must
repay the damage to the trader.

Analyzing the historic and modern Dutch sentences, it may be clear that
the biggest difference is in spelling. Some words are still the same (schipper,
bederven, alleen, water), but most words have changed in spelling. The changes
in vocabulary are visible in the change from goet doen to vergoeden (English: re-
pay). There are also some morphological/syntactical changes, like versuijmelijck
(negligent) to verzuimd (neglects).
It is probably easier to attack the spelling bottleneck first. To solve the
second, a thesaurus is needed to translate historic words into modern words
or the other way around. If a method can be found and used to find pairs of
modern and historic words that have the same meaning, such a thesaurus can
be constructed. But if spelling is not uniform, one spelling variant of a historic
word might be matched with a modern one, while another spelling variant is
missed. By solving the spelling bottleneck first, thereby standardizing spelling
for historic documents, finding word translation pairs for a thesaurus may even
be easier.

2.3 Bridging the gap


In Robertson & Willett [18] n-gram matching is used succesfully to find historic
spelling variants of modern words. Thus, the lack of specific resources might
not be a problem. However, many IR systems for modern languages use a
stemming algorithm (see [8]) to conflate morphological variants to the same
stem. The Porter stemmer4 is one of the most popular stemmers available for
many different modern languages, with a specific version for each language (a
Dutch version is described in [12]). Because modern languages are consistent in
spelling, stemmers can be very effective. The Porter stemmer for Dutch would
conflate the words gevoelig (sensitive), gevoeligheid (sensitivity) en gevoelens
(feelings) to the same stem gevoel (feeling). Using gevoelig as a query word,
documents containing the word gevoelens will also be considered relevant.
When spelling is inconsistent, only some word forms would be stemmed.
Using the porter stemmer for modern Dutch would only affect modernly spelled
historic words. The historic words gevoel, ghevoelens and gevuelig (English:
feeling, feelings and sensitive) would be stemmed to the stems gevoel, ghevoel and
gevuel respectively. By standardizing spelling (i.e. making it consistent), these
three word-forms will be stemmed to the same stem. Another fairly standard
technique in modern IR is query expansion. This means adding related words
to the keywords in the query. In historic documents, some of the words related
to a modern keyword might be historic words that no longer exist. Although
some of these historic words could be very useful for expanding queries, the
lack of knowledge about them makes it impossible to use them effectively. A
4 http://www.tartarus.org/˜martin/PorterStemmer/
2.4. RESOURCES FOR HISTORIC DUTCH 9

thesaurus relating these words to modern words would solve this problem. From
this perspective, the historic language can be seen as a different language from
the modern language, and the retrieval task becomes a so called Cross-Language
Information Retrieval (CLIR) task (see [11] and [10] for an analysis of the main
problems and approaches in CLIR).
But can spelling be standardized with nothing but a collection of historic
documents? And is it possible to make a thesaurus using the same limited
document collection? Of course, it is possible to do everything by hand (see
sections 3.1.1 and 5.3). But this is very time consuming, and different language
experts might have different opinions on what the best modern translation would
be. Automatic generation, if at all possible, might be more error prone. But
as modern IR systems have shown [7], sub-optimal resources can still be very
useful for finding relevant documents.
Although there was no standard way of spelling a word in 17th century
Dutch, the possibilities of spelling a word based on pronuncation are not infi-
nite. In fact, there are only a few different spellings for a certain vowel. Corpus
statistics can be used to find different spelling variants by looking at the over-
lap of context. Also, techniques have been developed to group semantically
related words based purely on corpus statistics. If this can be done for modern
languages, it might work with historic languages as well.

2.4 Resources for historic Dutch


Which kind of resources are needed to standardize spelling, and which are
needed to bridge the vocabulary gap? In [2], rewrite rules are used to map
spelling variants to the same form. By focussing on rewriting affixes to modern
Dutch standards, more morphological variants could be conflated by a stemmer.
Some rules where constructed for rewriting the stems as well, to conflate stems
that were spelled in various ways (like gevoel and gevuel). These rules where
constructed manually, because the spelling was considered to be too inconsis-
tent to do it automatically. But this inconsistency can actually be exploited to
construct rules automatically. The pronunciation of two spelling variations of
a word is the same. In historic documents, the word gelijk (English: equal) is
often spelled as gelijck or ghelijck. In the same way, gevaarlijk (dangerous) is
often spelled as ghevaerlijck or gevaerlijck. By matching words based on their
pronunciation, spelling variations can be matched as well. In both cases, lijck is
apparently pronounced the same as lijk, and ghe is pronouced the same as ge. If
there are more historic words showing the same variations, it seems reasonable
to rewrite lijck to lijk. But if historic word-forms can be matched with their
modern variants through pronunciation, why would we need rewrite rules? Well,
not all historic words will be matched with a modern word. For instance, the
word versuijmelijck (see the short historic text on loading cargo on a ship) is
not pronounced like any modern Dutch word. This is because the morphology
of the word has changed. The modern variant would probably be verzuimend.
Here, rewriting makes sense, because, changing the suffix lijck to lijk, a stemmer
10 CHAPTER 2. HISTORIC DOCUMENTS

for Dutch will reduce it to versuijm. Other rewrite rules may change this to the
modern stem verzuim, conflating it with all other morphological variants.
Finding historical synonyms for modern words, is a problem heretofore only
tackled by manual approaches. For modern languages, techniques have been
developed to find synonyms automatically (see, for instance [3, 4, 5, 14]), using
plain text, or syntactically annotated corpora. Part-Of-Speech (POS) taggers
exist for many languages, but not for 17th century Dutch, and annotated, 17th
century Dutch documents are not available either. Therefore, only those tech-
niques that use nothing but plain text are an option.
The next chapters describe the automatic generation of rewrite rules based
on phonetic information, and the automatic construction thesauri using plain
text. The various approaches are listed here:

• Rewrite rule generation methods: Three different methods, based on


phonetic transcriptions, syllable structure similarity and corpus statistics
will be described.

• Rewrite rule selection methods: Some of the generated rules could be


bad for performance. Some language dependent and independent selection
criteria will be tested.

• Rewrite rule evaluation methods: The main evaluation will test the
generated rule sets on a test set of historic and modern word pairs, and
measure the similarity of the words before and after rewriting. Further
evaluation is done by doing retrieval experiments, one word-based, the
other document-based.

• Thesauri and Dictionaries for the vocabulary gap: A historic to


modern dictionary will be constructed from existing translations pairs (see
next section), a historic synonym thesaurus will be constructed based on
bigram information. These methods address the vocabulary gap.

• Dictionaries for the spelling gap: A dictionary based on pronciation


will be made that contains mappigns from historic words to modern words
with the same pronunciation. Finally, as a way of finding historic spelling
variants for modern words, the word retrieval experiment will be extended
with a technique to measure the similarity of words based on phonetics.
This results in a dictionary of spelling variants. Both methods try to
bridge the spelling gap.

But to do this, a collection of historic documents is needed. Huge document


collections are electronically avaible for modern Dutch (especially since the birth
of IR-conferences like TREC5 and CLEF6 ), but for 17th century Dutch, docu-
ments are only sparsely available.
5 TREC URL: http://trec.nist.gov
6 CLEF URL: http://clef.isti.cnr.it/
2.5. CORPORA 11

2.5 Corpora
Although the Nationale Koninklijke Bibliotheek van Nederland has a large col-
lection of historic documents, at this moment, very few of them are in digital
form. The resources that will be constructed use corpora of 17th century texts
acquired from the internet. The following corpora where found:

• Braun corpus: This was acquired from the University of Maastricht, from
the research done by Braun [2].

• Dbnl corpus: The Digitale bibliotheek voor de Nederlandse letteren7 stores


a huge amount of Dutch historic texts. The text used in this research are
all from the Dutch ‘Golden Age’, 1550–1700. This is by far the largest
corpus. Some texts were written in Latin, others are modern Dutch de-
scriptions of the historic texts. Most of these non-historic Dutch texts
where removed from the corpus. Many texts contain notes with word
translation pairs. These historic/modern translations can be used to cre-
ate a thesaurus for historic Dutch.

• Historie van Broer Cornelis: This is a medium size corpus from the be-
ginning (1569) of the Dutch literary ‘Golden Age’, transcribed by the
foundation ‘Secrete Penitentie’ as a contribution to the history of Dutch
satire.

• Hooglied: A very small corpus. It is a based on an excerpt from the


’statenvertaling’, the first official Dutch bible translation. The so called
’Hooglied’ was put to rhyme by Henrick Bruno in 1658.8

• Kookhistorie: A website containing three historic cook books.9 There is a


huge time span between the appearance of the first cook book (1514) and
the last one (1669). The language of the first book is very different from
that of the second (1593) and third. However, since the first cook book
contains some modern translation of historic terms that also occur in the
other two cook books, the translations can still be used for the thesaurus.

• Het Voorlopig Proteusliedboek: A small text transcribed by the ‘Leidse
vereniging van Renaissancisten Proteus.’10

2.6 Corpus problems


The DBNL corpus contains heterogeneous texts; historic Dutch from various
periods, modern Dutch, Latin, French, English. If the overlap in phonetics is
to be used, the texts from all these different languages might cause problems.
7 URL: www.dbnl.nl
8 URL: http://www.xs4all.nl/ pboot/bruno/brunofs.htm
9 URL: www.kookhistorie.com
10 URL: home.planet.nl/ jhelwig/proteus/proteus.htm
12 CHAPTER 2. HISTORIC DOCUMENTS

The French word guy (English: guy, fellow) contains the vowel uy, but in French
it is pronounced different from historic Dutch in words like uyt (English: out).
Foreign words ‘contaminate’ the historic Dutch lexicon. The historic corpus will
be used to find typical historic Dutch sequences of characters, so modern Dutch
text are also considered ‘foreign.’ As a preprocessing step, as much of the non
17th century Dutch texts were removed from the corpus. Because the entire
DBNL corpus contains over 8,600 documents, some simple heuristics were used
to find the foreign texts, so the corpus can still contain some other texts than
17th century Dutch.
Another problem with the texts from the DBNL corpus is the period in which
the texts were written. The oldest texts date from 1550, the most recent were
written in 1690. In 150 years time, the Dutch language has changed somewhat,
including pronunciation and use of some letter combinations (like uy). For
instance, in the oldest texts, the uy was used to indicate that the u should be
pronounced long (the modern word vuur was spelled as vuyr around 1550). In
more recent texts, after 1600, the uy was often used like the modern ui, as in
the example given above (uyt is the historic variant of uit).
If texts from a wide ranging period are used, generating rewrite rules will
suffer from ambiguity. To minimalize these problems, it is probably better to
use texts from a fairly small period (20 – 50 years, for instance).

2.7 Measuring the gap


Before considering the construction of resources, it might be helpful to have at
least some idea of the differences between the historic language of the corpus,
and the modern language. Some words were spelled different from the current
spelling, but how much words are we talking about? And how much of these
words in the historic document collection are spelled as modern words? To
get an indication of the differences, a sample of 500 randomly picked words
from the historic collection where picked and assessed (names where excluded
since they do not contribute to the difference between two languages). Each
word was assigned to one of three categories: modern (MOD), spelling variant
(VAR), or historic (HIS). A word belongs to MOD if it is spelled in a modern
way (according to modern Dutch spelling rules). It belongs to VAR if it is
recognized as a modern word spelled in a non-modern way. If a word has some
non-modern morphology, or can’t be recognized as a modern word at all, it
belongs to HIS. The word ik (English: I) is found often in historic texts, but
it hasn’t changed over time. It is still used, thus belongs to MOD. The word
heyligh is recognized as a historic spelling of the modern word heilig (English:
holy), and is categorized as VAR. But the word beestlijck is not recognized as a
modern word. Even adjusting its historic spelling, becoming beestelijk, it is not
a correct modern Dutch word. Taking a look at the context (V beestlijck leven)
makes it possible to identify this word as a historic translation of the modern
word beestachtig (English: bestial or beastly). From context, it’s not hard for a
native Dutch speaker to find out what it means, but it is clear that over time,
2.8. SPELLING CHECK 13

Category Distribution
Modern 177 (35%)
Variant 239 (48%)
Historic 84 (17%)

Table 2.1: Distribution over categories for 500 historic words

its morphology has changed (the root of the word beest is still the same, which is
the very reason that its meaning is recognizable from context). This might not
be problematic for native Dutch speakers, but it does pose a problem for finding
relevant historic documents for the query term “beestachtig”. This word, and
other even less recognizable words belong to HIS. Categorizing all 500 randomly
picked words does not result in any hard facts about the gap between the two
language variants, but it does give some idea about the size of the problem.
The results are listed in Table 2.1. It turns out that most of the words (239
words, about 48%) are historic spelling variants of modern words. The overlap
between historic and modern Dutch is also significant (177 words, 35%), leaving
a vocabulary gap of 84 out of 500 words (17%). From this, it shows that solving
the problem of spelling variants bridges the gap between historic (at least 17th
century) Dutch and modern Dutch for a large part.

2.8 Spelling check


Robertson and Willett suggested using spelling correction methods. An ap-
proach for handling inconsistent spelling is to treat 17th century Dutch as a
corrupted version of modern Dutch. A spelling checker might be able to map
historic word forms to modern forms. That would take away the need to build
specific resources for historic document retrieval. For instance, the historic word
menschen might be identified by the spelling checker is a misspelled form of the
modern word mensen. One way of testing this is to do a spelling check on
several documents, and manually evaluate the spelling checker’s performance.
Some spelling checkers use the context of a word to find the most probable
correct word.
The unix spell checker A-spell was tested on the small text snippet from
section 2.2, using a modern Dutch dictionary:11

9. Item, of de schipper versuijmelijck ware de goeden ende coop-


man-schappen int laden ende of ontladen vast genoeg te maken, ende
dat.die daardoor vuijtten takel of vangtouw schoten, ende int water
of ter aerden vielen, ende alsoo bedorven worden of te niette gingen,
alsulcke schade of verlies moet de schipper ook alleen dragen ende
de koopman goet doen, als voor.
11 Information on Aspell and the Dutch dictionary used can be found on
http://aspell.sourceforge.net/
14 CHAPTER 2. HISTORIC DOCUMENTS

10. Item, als de schipper de goeden soo qualijck stouwt of laeijt dat
d’eene door d’andere bedorven worden, gelijk soude mogen gebeuren
als hij onder geladen heeft rozijnen, aluin, rijs, sout, graan ende
andere dergelijke goeden, ende dat hij daar boven op laeijt wijnen,
olin of olijven, die uitlopen ende d’andere bederven, die schade moet
de schipper ook alleen dragen ende de koopman goet doen, als boven.

The words oft, den, genoech, maecken, taeckel, daerdore and others were rec-
ognized as misspelled words, and a list of suggestions were given for each word,
including the correct modern words, which were not always the most probable
alternatives according to A-spell. For the word versuijmelijck no alternative was
suggested, probably because it is too dissimilar to any modern Dutch word. The
word goeden is a historic word for which A-spell suggested ‘goed’ (good), but
not ‘goederen’ (goods). The correct suggestion for coopman-schappen (which
is koopmanschappen, lit. ‘trade goods’) was not given, probably because the
modernized version of the word (koopmanschappen) is not a modern word (the
word koopmanschap was suggested, but this means something else, namely the
business of trading). The same goes for qualijck (modern form: kwalijk) and
laeijt (modern word: laadt). Also, some words are in fact historic but are not
recognized is misspellings. The word niette should become niet (English: not),
but is instead recognized as the past singular form of the verb nieten (English:
to staple, as in stapling sheets of paper together).
Another spell checker available for Dutch is the one that comes with the
MS Word text processor.12 It suggests orthographically similar words for any
unknown word in the text, and is also capable of checking grammar. This is the
output after applying the correct suggestions by MS Word:

Item, oft den schipper versuijmelijck ware de goeden ende koop-


manschappen int laden oft ontladen vast genoeg te maecken, en dat
die daardoor vuijtten taeckel oft bevanghtouw schoten, ende int wa-
ter oft ter aarden vielen, ende zo bedorven worden oft te niette gin-
gen, alsulcke schade oft verlies moet den schipper ook alleen dragen
ende den koopman goed doen, als voor.
Item, als den schipper de goeden zo kwalijk stouwt oft laeijt dat dene
door dandere bedorven worden, gelijk zoude mogen gebeuren als hij
onder geladen heeft rozijnen, aluin, rijs, zout, granen ende andere
dergelijke goeden, ende dat hij daer boven op laeijt wijnen, oliën oft
olijven, die uitlopen ende dandere bederven, die schade moet den
schipper ook alleen dragen ende den koopman goed doen, als boven.

MS Word marks the word versuijmelijck as a misspelled word, but no al-


ternatives are suggested, which happens for bevanghtouw and alsulcke as well.
For some words, the correct word is suggested, as is the case for oft, ende and
maecken and quite a few others. For many other words, the correct modern
12 For those unfamiliar with MS Word, see http://office.microsoft.com
2.8. SPELLING CHECK 15

word is in the list of alternatives. For the historic word alsoo it correctly sug-
gests alzo and afterwards suggests to replace alzo with the more grammatically
appropriate word zo.
Spell checkers can be used to find correct modern words for historic words
that are orthographically similar. However, for many historic words, spell check-
ers cannot find the correct alternative, and for some they cannot find any mod-
ern alternative at all. Moreover, each word has to be checked seperately and
the correct suggestion has to be selected from the list manually (the correct
alternative is not always the first one in the list of suggestions). It would still
take an enormous amount of time and effort to modernize historic documents
for HDR in this way. A spelling check is not a good solution. It seems we do
need specific resources to aid HDR.
16 CHAPTER 2. HISTORIC DOCUMENTS
Chapter 3

Rewrite rules

In this chapter, the spelling bottleneck, and approaches for solving this problem
are described. The following points will be discussed:

• Inconsistent Spelling & rewrite rules: The problems with inconsis-


tent spelling. How rewrite rules can solve these problems, and what is
needed.

• Rewrite rule generation: Methods for generating rewrite rules.

• Rewrite rule selection: Which rules are to selected and applied?

• Evaluation of rewrite rules: How are the sets of rewrite rules evalu-
ated? And well do they perform?

• Rewrite problems: Multiple modern spellings for historic character se-


quences.

• Conclusions: Is automatic generation of rewrite rules an effective solu-


tion to the spelling problem?

3.1 Inconsistent spelling & rewrite rules


3.1.1 Spelling bottleneck
As mentioned in chapter 2, one of the main problems with searching in historic
texts is that the word or words you are looking for can be spelled in many
different ways. For example, if you searching for texts that contain the word
rechtvaardig (English: righteous), you might find it in one or two texts. But
there probably are many more texts that contain the same word spelled in
different ways (i.e.: rechtvaerdig, reghtvaardig, rechtvaardigh and combinations
of these spelling variations).

17
18 CHAPTER 3. REWRITE RULES

One way of solving this problem would be to expand you query with spelling
variations typical of that period. But few people possess the necessary knowl-
edge to do this. Apart from that, it is fairly time consuming to think of all these
variations, and you inevitably omit some variations. It would be far more effi-
cient to do query expansion automatically. Or to rewrite all historic documents
to a standard form, that matches modern Dutch closely.
Robertson and Willett [18] have shown that character based n-gram match-
ing is an effective way of finding spelling variants of words in 17th century
English texts. Historic word forms for modern words were retrieved based on
the number of n-grams they shared. All the historic words where transformed
into a index of n-grams, and the 20 words with the highest score were retrieved.
The score was computed using the dice score, with N (Wi , Wj ) being the number
of n-grams that Wi and Wj have in common, and L(Wi ) the length of word Wi :

2 × N (Wmod , Whist )
Score(Wmod , Whist ) = (3.1)
L(Wmod ) × L(Whist )

In a historic word list containing 12191 unique words, 2620 historic words
were paired with 2195 unique modern forms. Thus, each modern form had at
least one corresponding historic word form. The results in Table 3.1 show the
recall at the 20 most similar matches (no precision scores given in [18]).

Method Recall
2-gram matching 94.5
3-gram matching 88.8

Table 3.1: Comparative recall for the 20 most similar matches for historic En-
glish

Braun [2] has conducted the same experiment for 17th century Dutch. It
turns out the n-gram matching performance is increased by standardizing spel-
ling and stemming (Table 3.2). The inconsistency of spelling makes it hard to
apply a stemming algorithm directly on historic documents. Therefore, spelling
is standardized by applying rewrite rules on the historic words. In [2], these
rewrite rules for 17th century Dutch were constructed with the help of experts.
They transform the most common spelling variations to a standard spelling.
Most of the variations of rechtvaardig just mentioned would be changed to the
modern spelling by these rules. By rewriting different spelling variants to the
same word form, and removing affixes through stemming, a fair number of word
forms are conflated to the same stem.
Still, constructing rules manually, using the help of experts takes a lot of
effort, and experts of 17th century Dutch are not freely and widely available.
More efficient, but also more error prone, are automatic, statistical methods
to produce rewrite rules. In this chapter, several automatic approaches are
compared with each other as well as with the rewrite rules constructed by Braun.
3.1. INCONSISTENT SPELLING & REWRITE RULES 19

Retrieval method Comp. Recall Precision


3-gram 70.4 57.9
3-gram + stemming 74.0 62.5
3-gram + rewriting 74.8 53.7
3-gram + stemming 82.1 57.8
+ rewriting

Table 3.2: Comparative recall for the 20 most similar matches for historic Dutch

3.1.2 Resources
To construct rewrite rules, a collection of historic documents is needed, as well as
a collection of modern documents. Without the modern documents, it would be
much harder to standardize historic spelling. The are several equally acceptable
ways of spelling a word in 17th century Dutch. There is no single spelling that
would be better than the others. To ensure uniform rewriting, the rules have
to be constructed with great care. Identifying spelling variants is only the first
step. The second step is rewriting them all to the same form. For another group
of spelling variants, the same standard form should be chosen. But this far from
trivial. Consider the spelling variants klaeghen, klaegen, klaechen and claeghen.
Three out four words start with kl, so it seems sensible to choose kl as the
standard form. Also, two out of four words use gh, so g and ch should become
gh as well to transform all four variants into a uniform spelling. Another group
of spelling variants might be vliegen, vlieghen, vlieggen, vlyegen and fliegen. In
this case, rewriting fl to vl seems to make more sense than rewriting vl to fl.
The same goes for ye and ie. But the next transformation should be selected
more carefully. Of the 3 different options g, gh and gg, g occurs more often. But
rewriting gh to g would be in conflict with the earlier decision to rewrite g to
gh. A far easier solution, and with the goal of making resources for information
retrieval in mind, is to rewrite the historic word forms to modern word forms.
In that case, a standard spelling already exists, and rewriting historic spelling
variants to a uniform word is done by rewriting them to the appropriate modern
word. Of course, we need to find the appropriate modern form, which might
not be easy at all. But we’re faced with the same problems when finding the
different historic spelling variants themselves. The other advantage of rewriting
to modern words becomes clear when combining it with an IR system. Modern
users pose queries in modern language. Rewriting all possible historic variants
to one historic word will not make it any easier for the IR system to match
it with its modern variant. Rewriting historic words to modern words, means
rewriting to the language of the user.

The document collections


For the historic document collection, a corpus of several large books is used.
These books all date from the same period (1600 – 1620). The reason of keeping
20 CHAPTER 3. REWRITE RULES

the period small, is that spelling changed over time. If a larger time-span is
chosen, a greater ambiguity in spelling might result in incorrect rewrite rules.
The pronunciation of some character combinations in 1550 might have changed
by 1600. Also, the specific period between 1600 and 1620 makes it possible
to compared the generated rewrite rules with the rules constructed by Braun,
since these rules where based on two law books dating from 1609 and 1620.
The corpus used in this research, named hist1600, contains these same two law
books, in addition to a book by Karel van Mander (Het schilder-boeck), printed
in 1604. Two of the techniques used here compare the words of the historic
corpus to words in a modern corpus. The modern corpus (15 editions of the
Dutch newspaper ”Algemeen Dagblad”) is equal in size to the historic corpus
(see Table 3.3). The included editions of the newspaper where selected on
date, ranging over two whole years, to make sure that not all editions cover the
same topics (two successive editions often contain articles on the same topics,
probably repeating otherwise low frequent content words).

Name total size number of


(number of words) unique words
AC-1609 221739 11648
GLS-1620 131183 6977
mand001schi01 453474 32314
Total 791217 47816
(hist1600)
Alg. Dagblad 797530 58664

Table 3.3: Corpus statistics for modern and historic corpora

3.1.3 Linguistic considerations


Is it possible to have some idea about how well a certain method will work?
Surely it would be nice to know in advance that matching variants of a word
based on phonetic similarity works well. But we don’t have this knowledge.
However, some observations beforehand can point to the right direction (or
away from the wrong direction).

Syllable structure
One such observation is that apparently, most historic words that are recogniz-
able spelling variants of modern words have the same syllable structure as their
modern form. Each syllable in Dutch contains a vowel, and can have a conso-
nant as onset and/or as coda. If we take the modern Dutch word aanspraak
and a historic form aenspraeck, the similarity in syllable structure is obvious.
For both forms the first syllable has a coda (n), the second syllable has an onset
(spr) and shows a difference in the codas (k vs. ck). The vowels of the two
syllables differ also (aa vs. ae). Can this be of any help in choosing a method to
3.1. INCONSISTENT SPELLING & REWRITE RULES 21

attack the spelling problem? A solution can be to split the words into syllables
and than make rewrite rules from mapping the historic syllable to the modern
syllable. This would give the following rules:
aen → aan
spraeck → spraak
The advantage of this approach is that it will not only rewrite the word
aenspraeck but also any other historic word that contains the syllable aen. What
it won’t do is rewrite the word staen to staan (English: to stand), since it
won’t rewrite syllables containing aen that have an onset. After reading a few
sentences of a historic document it becomes clear that the vowel ae is very
common in these documents. In modern documents is not nearly as common.
One problem that is immediately visible is that to rewrite all words that contain
the vowel combination ae an enormous amount of rules is needed to cover all
the different syllables in which this combination can appear. And since the
corpus is limited, not all possible syllables can be found. The rules need to be
generalized. For instance, a rule could be: rewrite all instances of ae to aa in
syllables that have a coda.
But this introduces a few problems. For native Dutch speakers, it is probably
fairly easy to recognize the syllable structure of many historic words. But an
automatic way of splitting a word into syllables would be based on the modern
Dutch spelling rules. Since historic words are not in accordance with these rules,
splitting them properly into syllables might do more bad than good. According
to modern spelling rules, the word claeghen would be split in claeg and hen,
which is not what it should be (namely, clae and ghen). Redundant letters in
historic words can shift the syllable boundaries, adding a coda or onset where
there shouldn’t be one.
To get around this problem, it is possible to split the word into sequences of
vowels and sequences of consonants. The word claeghen would be split into
the sequences cl, ae, gh, e and n. Syllable boundaries can be contained in one
sequence (ia in hiaten), but need not be a problem. Historic vowel sequences may
only be rewritten to modern vowel sequences, and historic consonant sequences
may only be rewritten to modern consonant sequences. Putting this restriction
on what a historic sequence can be rewritten to, will retain the syllable structure.
Again, the considered context can be specific, changing ae to a in the context of
cl and gh, or general, changing ae to a if the sequences is preceded and followed
by any consonant sequence.

Spelling errors versus phonetic spelling


Treating historic spelling as a form of spelling errors leads to the method of spell
checking. A possible algorithm for finding the correct word given a misspelled
word is the Edit Distance algorithm, [24]. This algorithm finds the smallest
number of transformations needed to get from one word to another word. At
each step in the algorithm, the minimal cost of inserting, deleting a substitut-
ing a character is calculated. Inserting or deleting a character takes 1 step, a
22 CHAPTER 3. REWRITE RULES

substitution takes 2 steps (the same as 1 delete + 1 insert). Changing bed into
bad takes one substitution (‘e’ to ‘a’), thus the edit distance between bed and
bad is 2. The edit distance between bard and bad is 1 (deleting the ‘r’). This can
be used to find the word in a lexicon that is closest to the misspelled word [6].
However, historic spelling is different from misspellings in modern texts. The
variance in spelling is not based on accidentally hitting a wrong key on the
keyboard, but on phonetic information. Without any official spelling, writing
caas or kaas makes no difference. They are both pronounced the same. Thus,
historic Dutch can be treated as modern Dutch with spelling errors based on
a lack of knowledge of modern spelling rules (which people in the 17th cen-
tury where, of course, ignorant of). Thus, writing caas instead of kaas (English:
’cheese’) is more probable than writing cist instead of kist (English: ’box’), since
a c is pronounced as a k when follow by an a, but is pronounced as an s when
followed by an i. From a phonetic perspective, the distance between cist and
kist is bigger than between caas and kaas.

3.2 Rewrite rule generation


One can think of many different ways of generating rewrite rules. The use pho-
netic transcriptions is one, but another way would be to see the spelling variance
as a noisy channel (i.e. treating historic spelling as a misspelling of modern
Dutch), making rewrite rules out of typical misspellings. N-gram matching can
be used to find letter combinations that occur frequently in a historic lexicon,
but much less frequent in a modern lexicon. In all approaches, a few issues
have to be considered. First of all, while some historic words are spelling varia-
tions of modern words, many other historic words are just plain different words.
They cannot be mapped to modern words, although they can be modernized in
spelling. Thus, purely historic words cannot be used to generate rules, but the
generated rules will affect these words.
Three different rule generation methods have been developed:

1. Phonetic Sequence Similarity

2. Relative Sequence Frequency

3. Relative N-gram Frequency

3.2.1 Phonetic Sequence Similarity


The first method of mapping historic words to their modern variants is by using
phonetic transcriptions of both historic and modern words. Phonetic matching
techniques are used to find the correct spelling of a name, when a name is given
verbally, i.e. only its pronunciation is known (see [26]). For modern Dutch,
a few automatic conversion tools are available to transform the orthographic
word in to a phonetic transcription. A phonetic transcription is list of phoneme
characters which have a specific pronunciation. A simple conversion tool for
3.2. REWRITE RULE GENERATION 23

Dutch can be found on the Mbrola website. 1 It makes acceptable phonetic


transcriptions of words. But, because of its simplicity, it cannot cope with
the less frequent letter combinations in the Dutch language. For instance, the
combination ae is transcribed to two separate vowels AE. A much more com-
plex grapheme to phoneme conversion tool is embedded in the text-to-speech
software package Nextens (see http://nextens.uvt.nl). This converter is more
sensitive to the context of a grapheme (letter). The grapheme n preceded by
a vowel and followed by a b is not pronounced as an n but as an m. Also, it
can cope with the more seldom letter combinations like ae (transcribed to the
phoneme e). Which phonetic alphabet is used by these tools is not important,
as long as the same tool is used for all transcriptions.2
While the conversion tools have been developed for modern Dutch, they can
also be used for historic variants of Dutch. It is not clear how well this works,
but if 17th century Dutch is close enough to modern Dutch, this could be a
very simple way to standardize and modernize historic spelling. Once phonetic
transcriptions are made for all the words in the historic lexicon and all the
words in the modern lexicon, it is easy to find historic words and modern words
with the same pronunciation. These word pairs can be combined in a thesaurus
for lookup (see chapter 5), but they can also be used for constructing rewrite
rules. The next step is then to construct a rewrite rule based on the differ-
ence between these historic and modern word pairs. One way to do this is to
make a mapping between the differing syllables. But splitting historic words
into syllables is problematic. However, splitting words in vowel sequences and
consonant sequences is an option. If the equal sounding words have the same
vowel/consonant sequence structure, then, by aligning the consonant/vowel se-
quences, the aligned sequences are paired on the basis of pronunciation. To
clarify the idea, consider the following example:
historic word: klaghen
modern word: klagen
historic sequences: kl, a, gh, e, n
modern sequences: kl, a, g, e, n

All these sequence pairs are pronounced these same, including the pair [gh,g].
From this list of pairs, only the ones that contain two distinct sequences are
interesting. Rewriting kl to kl has no effect.
After applying rewrite rules based on phonetic transcriptions, the lexicon has
changed. But iterating this process has no further effect. Since the rewrite rules
are based on mapping historic words to modern words that are pronounced the
same, after rewriting, the pronunciation stays the same.

1 see http://tcts.fpms.ac.be/synthesis/mbrola.html

or http://www.coli.uni-sb.de/˜eric/stuff/soft/ which is the website of the author of the con-


version tool
2 This became clear when using the Kunlex phonetic transcriptions list that is supplied with

the Nextens package. This list contains 340.000 modern words with phonetic transcriptions.
However, converting the words to phonetic transcriptions using Nextens results in different
transcriptions from the ones in the Kunlex list.
24 CHAPTER 3. REWRITE RULES

3.2.2 The PSS algorithm


The PSS (Phonetic Sequence Similarity) algorithm aligns two distinct character
sequences that are similar based on phonetics. If the phonetic transcription P T
of a historic word Whist also occurs in the modern phonetic transcriptions list,
then the modern word Wmod that has the same transcription P T , is considered
the modern spelling variant of Whist . Both words are split into sequences of
vowels and sequences of consonants. If number of sequences of Whist is different
from the number of sequences of Wmod , no rewrite rule is generated. This
is because there is unmatched sequence. Consider the modern word authentiek
and the similar sounding historic word authentique 3 . The modern word contains
6 sequences (au, th, e, nt, ie, k), while the historic word contains 7 (au, th, e,
nt, i, q, ue). This last sequence ue is not pronounced (at least, not according to
Nextens). All the other sequences can be aligned to the sequences of the modern
word. This problem is sidestepped by ignoring these cases. When the number of
sequences are equal, an extra check is needed to make sure that for both words
the aligned sequences are of the same type, that is, both sequences should be
vowels, or both should be consonants. In this research, no word pairs were found
that couldn’t be aligned properly, except for the word pair mentioned above,
but as was mentioned, it is part of a French text. The next step is comparing all
i
the aligned sequences. If the spelling of the historic sequence Seqhist is different
i
from the spelling of the modern sequence Seqmod , a possible rewrite rule is
found. Since both words are pronounced the same, apparently, both sequences
i
are pronounced the same as well. By replacing Seqhist in a historic word with
i
Seqmod , pronunciation should be preserved. Thus the rewrite rule becomes:

i i
Seqhist → Seqmod (3.2)
The resulting rules are ranked by their frequency of occurrence. Thus, if
i i
Seqhist and Seqmod are aligned N times in all the equal sounding word pairs,
i i
the resulting rule has score N. If Seqhist and Seqmod are aligned often, it is
highly probable that the rule is correct, and that it will have a huge effect on
the historic corpus.

3.2.3 Relative Sequence Frequency


The second approach tries to find modern spellings for sequences of vowels and
sequences of consonants based on ’wildcard’ matching. Each word, in both
historic and modern corpora, is split in sequences of vowels and sequences of
consonants (in the same way as for the PSS algorithm). Sequences that are
frequent in historic texts but rare in modern texts are likely candidates for
rewriting. To find the appropriate modern sequence to replace it, the historic
sequence could be removed from the historic word and replaced by a wildcard.
This should be a vowel wildcard if the removed historic sequence is a vowel,
3 although this word is in the DBNL corpus, it is probably taken from document containing

a small portion of French


3.2. REWRITE RULE GENERATION 25

and a consonant wildcard for historic consonant sequences. If a modern can be


matched with the historic word containing a wildcard, the modern sequence that
is aligned with the wildcard is a candidate for replacing the historic sequence.
Historic and modern sequences that are aligned often have a high probability of
being correct.

3.2.4 The RSF algorithm


The Relative Sequence Frequency (RSF) algorithm generates rules based typical
historic character sequences. The whole historic corpus is split in vowel and
consonant sequences Seq, which are ranked by their frequency Fhist (Seq). After
that, their frequency scores are divided by the total number of sequences of the
whole corpus Nhist (Seq), resulting in a list of relative frequencies:

Fhist (Seq)
RFhist (Seq) = (3.3)
Nhist (Seq)
The same is done for the modern corpus. The final relative sequence fre-
quency RSF (seq) is given by:

RFhist (Seq)
RSF (Seq) = (3.4)
RFmod (Seq)
A sequence i with a high RSF (seq i ) is a typical historic character combina-
tion. A score of 1 means that the sequence is used just as frequent in a modern
corpus as in a historic corpus. A threshold is used to determine whether a
sequence is considered typically historic or not. This threshold is set to 10,
meaning that, for a historic and a modern text of equal size N , the character
sequence seq i should occur at least 10 times more often in the historic text to
be typically historic. The reasoning behind this is that if a sequence occurs
much more often in a historic text (i.e. is much more common in a historic
text), there is a good chance that its spelling has changed in the past few cen-
turies. If Seqi occurs in the historic corpus but not in the modern corpus (i.e.
RFmod (Seqi ) = 0, RSF (Seqi ) is set to 10. No matter what its historic fre-
quency is, Seqi is infinitely more frequent in the historic corpus than in the
modern corpus, and is considered a typical historic character combination.
Starting with the highest ranking character sequence Seq, all historic words
that contain this sequence are transformed in so called ’wildcard words’. If
Seq is a vowel sequence, a historic word Whist contains Seq if Seq is preceded
and followed by consonants, or the start or end of the word. For example, the
word quaellijk is not listed as a word containing the sequence ae, since ae is
not the full vowel sequence (which is uae). In all the historic words, Seq is
replaced by a ’vowel wildcard’, resulting in a wildcard word W Whist . The word
aenspraek is transformed to VnsprVk, where V is a vowel wildcard. W Whist is
then compared to all modern words. A modern word Wmod matches W Whist if
it can match all vowel wildcards with vowel sequences, or consonant wildcards
with consonant sequences. Thus, VnsprVk is matched with the modern word
26 CHAPTER 3. REWRITE RULES

it aanspraak, but also with inspraak, inspreek, and aanspreek. Given these 4
matches, ae is matched with i once, ee twice, and 4 times with aa, resulting in
3 different rewrite rules:

ae → aa
ae → ee
ae → i

Again, a threshold is used to remove unreliable matches. If seqhist has


N (W Whist ) wildcard words, then seqmod is considered reliable if it is matched
to seqhist by at least N (W Whist )/10 wildcards, with a minimum threshold of 2.
Only one wildcard match is considered an ’coincidence’. This threshold is called
the pruning threshold. After each historic sequence is processed, and wildcard
matches are found, the list of possible modern sequence is pruned by removing
all rules with a score below the pruning threshold. For instance, the sequence
ae has more than 5000 wildcard words. A modern sequence is a reliable match
if it matches at least 500 wildcard words with modern words. Of course, it
is possible, for words that contain seqhist multiple times (ae occurs twice in
aenspraek), to restrict wildcard matching to words that match all the multiple
wildcards with the same vowel sequence seqmod . In that case, only aanspraak
would be a match. All the other words match ae with two different modern
sequences.

3.2.5 Relative N-gram Frequency


A standard, language independent method for matching terms is n-gram match-
ing. For each word all substrings of n characters are determined. One way
of determining similarity between two words is by counting the number of n-
grams that are shared by these words. For instance, the words aenspraeck and
aanspraak are split in the following substrings of length 3:
aenspraeck: #ae, aen, ens, nsp, spr, pra, rae, aec, eck, ck#
aanspraak: #aa, aan, ans, nsp, spr, pra, raa, aak, ak#
The number sign (#) shows the word boundary. Only the substrings nsp, spr
and pra are shared by both words. Character n-gramming is a popular technique
in information retrieval, where it can have a huge influence on accuracy (for a
detailed analysis on n-gram techniques, see [17]). For this research, n-gramming
is used to find typical historic n-grams. Like the previous RSF algorithm, rela-
tive frequencies are used to find letter combinations that are frequent in historic
documents, but rare in modern documents.

3.2.6 The RNF algorithm


The third algorithm is only slightly different from the RSF algorithm, generating
rules based on N-grams instead of vowel/consonant sequences. Hence, it is
called the Relative N-gram Frequency (RNF) algorithm. It is basically the same
algorithm, but with one major difference. Where the RSF algorithm considers
3.3. REWRITE RULE SELECTION 27

only full sequences (a (full) vowel sequence is only matched with another (full)
vowel sequence), the RNF algorithm matches an n-gram with any character
sequence between n − 2 and n + 1 characters long.
This restriction is based on the fact that modern spelling is more compact than
historic spelling. To indicate that vowels should have a long pronunciation,
historic words are often spelled with double vowels (like aa, ee). In modern
spelling, this is no longer needed (only in a few cases) because of the official
spelling rules. Also, exotic combinations like ckxs where normal in historic
writing, but in modern spelling, only x or ks is allowed. Thus, it is to be
expected that a modern spelling variant of a historic sequence is often shorter
than the historic sequence itself.
Also, without this restriction, the number of possible rules would explode. When
replacing zaek for the n-gram aek with the wildcard word zW (where W is an
unrestricted wildcard), will result in matching zaek with all existing modern
words that start with the letter z. Processing hundreds of wildcard words will
require enormous amounts of memory and disk space. By restricting the length
of the wildcard, only words of length 2 to 5 are matched (this will still match
with many words, but memory requirements are now within acceptable limits).
There is no restriction on vowels or consonants. An n-gram containing only
vowels can be matched by a wildcard containing only consonants. RNF is tested
with different n-gram sizes, ranging from 2 to 5.
When constructing rules from wildcard matches, the same pruning threshold
is used as for the RSF algorithm described above. Without this threshold, the
number of generated rules would still be enormous for large n (n ≥ 4). Especially
for n = 5, literally hundreds of thousands of rules are generated. To reduce
memory and disk space requirements, the pruning threshold for n = 5 is set to
5.

3.3 Rewrite rule selection


3.3.1 Selection criteria
Once the methods for generating rewrite rules are working, it is easy to generate
literally thousands of rewrite rules. Of course, not all these rules work equally
well. Some rules are based on matching one particular historic word to one
particular modern word, and some rules are based on matching a historic word to
the wrong modern word. The number of matches between historic and modern
words on which a rule is based can be used as a ranking criterium. The more
historic words that can be transformed to a modern word with the same rule,
the more probable it is that the rule is correct. A rule that maps only one
historic word to a modern word might be correct, but even if it is, its influence
on an entire corpus will be minimal. A rule that maps over a hundred historic
words to modern words is probably correct. It is highly improbable that an
inappropriate rule rewrites this many historic words to modern words. The
rule lcx → ndst rewrites volcx to vondst, but very few other matches will be
28 CHAPTER 3. REWRITE RULES

found. But how many matches are needed to make a reliable judgment whether
a rule is appropriate or not? There are many different criteria that can be used.
For instance, given a typical historic character sequence Seqhist , the number of
modern sequences N (Seqmod ) that lead to rewriting a historic word Whist to a
modern word Wmod should be as low as possible. N (Seqmod ) is the number of
alternatives from which a modern sequence should be picked. If the same historic
sequence occurs in many different rules (i.e. there a lot of different modern
consequences to rewrite to), the chance of only one of them being correct is
small. If there is only one rule (i.e. there is only one modern consequence found
for a historic sequence), then that is inevitably the best option. Another aspect
to look at is the effect of the rule on the modernly spelled words in the historic
corpus. Comparing the words of the historic corpus with a modern word list
(Nextens comes with a fairly large list containing approximately 340.000 modern
Dutch words with phonetic transcriptions, the so called Kunlex word list), shows
which words in the historic corpus have not changed in spelling. These words
shouldn’t be affected by rewrite rules. The criterium then becomes selecting
only rules that have little to no effect on modernly spelled words. Of course,
it is also possible to retain rules that have a large effect on these words, but
restrict the application of such a rule to non-modern words (i.e. words that are
not in the modern lexicon).
Another important decision to be made is whether a historic sequence can be
rewritten to different modern spellings. As the y-problem described in section
3.6.1 indicated, not all sequences ay should be rewritten to the same modern
form. The RNF has no difficulty with these problems, since larger n-grams
take the context of ay into account. By first applying large n-grams, different
words containing ay might be affected by different RNF rules. The other 2
algorithms, PSS and RSF cannot take context into account since they use only
vowels or only consonants. Thus, whatever selection criterium is used, only
one modern spelling will be selected for each historic sequence. The following
selection criteria will be discussed:

• Match-Maximal: Rank rules according how many wildcard words are


matched (MM).

• Non-Modern: Remove all rules that effect modern words in historic lexi-
con. A word from the historic lexicon is modern if it is also in the Kunlex
lexicon (NM).

• Salience: For the set of competing rules with the same antecedent part,
select the consequent part with the highest score only if the difference
between the highest score and the second highest score is above a certain
threshold (S).
3.4. EVALUATION OF REWRITE RULES 29

3.4 Evaluation of rewrite rules


3.4.1 Test and selection set
A dozen more purely statistical selection criteria can be used, but another alter-
native is to create a selection and test set by hand. To test the effectiveness of
the rewrite rules, a test set that contains historic words and the correct modern
variant can be used. The historic words in the test set are picked from a random
sample of words from a small list of 17th century books, published in the same
period as the documents used for the generation of rewrite rules (1600–1620).
Words from these books where randomly selected and added to the test set if
a correct modern spelling was given. These modern forms where only entered
when the historic words was recognized as a variant of a modern word, or as
a morphological variant of a modern word. The historic word beestlijck would
be spelled as beestelijk in modern Dutch. However the word beestelijk is not
an existing modern word, but a morphological variant of the word beestachtig
(beastly). The test set contains some of these words. Some words where not
recognizable at all. These where not added to the test set, since no modern
spelling could be entered.
This way of constructing a test set is fairly simple and doesn’t take a lot of
time. In just a few hours, a total set of 2000 words was made. The whole set
was then split into a selection set and a test set. The selection set was used as
a rule selection method, as a way of sanity checking. Some of the constructed
rules clearly make no sense. For example, the rule cxs → mbt might result
in rewriting some historic words to existing modern words, but since it also
changes pronunciation (and word meaning) radically, it is clear that this rule
makes no sense. To make sure that all selected rules make at least some sense,
a way of sanity checking is to select only rules that have a positive effect on the
selection set.
Using edit distance, measuring the distance D(Whist , Wmod ) between the his-
toric word and its modern variant and the distance D(Wrewr , Wmod ) between
the rewritten word and the modern variant, shows the effect of a rewrite rule.
Here is an example to explain edit distance, using the historic word volcx and
its modern version volks :

v o l c x
0 1 2 3 4 5
v 1 0 1 2 3 4
o 2 1 0 1 2 3
l 3 2 1 0 1 2
k 4 3 2 1 2 3
s 5 4 3 2 3 4

Table 3.4: Edit distance between volcx and volks

The final edit distance between volcx and volks is 4. The first three characters
30 CHAPTER 3. REWRITE RULES

of both words are the same, resulting in an edit distance of 0. But the next two
character differ. from c to k takes 1 substitution, and another substitution is
needed going from x to s.

v o l c c
0 1 2 3 4 5
v 1 0 1 2 3 4
o 2 1 0 1 2 3
l 3 2 1 0 1 2
k 4 3 2 1 2 3
s 5 4 3 2 3 4

Table 3.5: Edit distance between volcc and volks

If the difference between D(Whist , Wmod ) and D(Wrewr , Wmod ) is zero, then
either the rule is not applicable to the historic word, or it has no effect on the
distance, in which case it is probably an inappropriate rule. Changing volcx into
volcc has no effect on the edit distance (the distance between volcx and volks
is equal to the distance between volcc and volks, see Tables 3.4 and 3.5), but
the word has changed into something that is pronounced differently, while the
historic word is pronounced the same as its modern variant volks. Most native
Dutch speakers will have little problems recognizing volcx as a spelling variant
of the adverb volks, while they would probably recognize volcc as a variant of
the noun volk.
The problem with using edit distance as a measure is that a bigger reduction
in distance not necessarily means that a rule is better. Take two competing
rewrite rules lcx → lcs and lcx → lk. The first rule reduce the edit distance
from 4 to 2 (see Table 3.6), while the second rule reduces it to 1 (Table 3.7).
The result of the first rule is a word that looks and sounds much like the correct
modern word. The result of the second rule is a different modern word.

v o l c s
0 1 2 3 4 5
v 1 0 1 2 3 4
o 2 1 0 1 2 3
l 3 2 1 0 1 2
k 4 3 2 1 2 3
s 5 4 3 2 3 2

Table 3.6: Edit distance between volcs and volks

A rewrite rule has a postive effect on the selection set if the average distance
between historic and modern words is reduced. The average change in distance
between the original test set, and the test set after rewriting is given by:
3.4. EVALUATION OF REWRITE RULES 31

v o l k
0 1 2 3 4
v 1 0 1 2 3
o 2 1 0 1 2
l 3 2 1 0 1
k 4 3 2 1 0
s 5 4 3 2 1

Table 3.7: Edit distance between volk and volks

n
1X i i i i
C= D(Whist , Wmod ) − D(Wrewr , Wmod ) (3.5)
n i=0

Where D(Whist , Wmod ) is the edit distance between a historic word and its
modern variant, and D(Wrewr , Wmod ) is the edit distance between the rewritten
historic word and the same modern variant. A simple measure would be divid-
ing the average change in edit distance C by the distance D(Seqhist , Seqmod )
between the historic antecedent Seqhist of the rule and its modern consequent
Seqmod (rules that change multiple characters should reduce the average dis-
tance more than rules that change only one character):
Ci
Score(rulei ) = (3.6)
Di
If the resulting score is close to 1, the total amount of change by the rewrite
rule is mostly in the right direction. Looking again at the example of the rule
cx → k, the edit distance between the original historic word volcx and the
modern word volks is reduced by 3, and the edit distance between cx and k is
also 3 (cost 2 for substitution of c with k, and cost 1 for deleting x). Thus,
this rule scores 1. In other words, every change by the rule is a change in the
right direction. But this is not good enough. rewriting cx to k reduces the edit
distance between volcx and volks, but the rule cx → ks not only reduces the
edit distance, it also rewrites the historic word to the correct modern variant.
According to (3.6), both rules would get the same score. But if a rule changes
some historic words to their correct modern forms, it must be a good rule. A
better measure should account for this. (3.7) adds the number of words changed
to their correct modern form M divided by the total number of rewritten words
R:
Ci Mi
Score(rulei ) = + (3.7)
Di Ri
Now, the rule cx → ks reveives a higher score because it rewrites at least
some of the words containing cx to their correct modern form. To make sure
that rules with an accidental positive effect are not selected, a threshold for the
final score of 0.5 is set. In words, this means that for each step done by the rule
32 CHAPTER 3. REWRITE RULES

(insertion, deletion takes one step, substitutionion takes two steps), the distance
should reduce by at least 0.5.
The big disadvantage of selecting only rules that have a positive effect on
the selection set is that not all the typical historic word forms and letter com-
binations are in the selection set. Although the rules are based on statistics
on the whole corpus, some constructed rewrite rules that are appropriate might
not be selected because they have no effect on the selection set. On the other
hand, from a statistical viewpoint, if a specific character combination is not in a
set of 1600 randomly selected word pairs, then it is probably not a common or
typical historic combination. Another drawback is that words that couldn’t be
recognized as variants of modern words, are not in the test set, but are affected
by the selected rewrite rules. Although the performance of a rule on the test set
gives an indication of its “appropriateness” on the recognizable words, there is
no such indication for its effect on the unrecognized words.
The test set is used as a final evaluation of the selected rewrite rules. The
rewrite rules are applied to the historic words and then compared with the
correct modern forms. As mentioned above, comparison is based on the edit
distance between the words. The final score for a rule is the average distance
between the rewritten words and the correct words. To get some measure of the
effect of rewriting, the average distance between the original historic words and
the correct words is also calculated as a baseline. The difference between these
two averages should give an indication of the effect of rewriting. The baseline
score is shown in Table 3.8

total average
word pairs distance
baseline 400 2.38

Table 3.8: The baseline average edit distance

3.5 Results
The three algorithms PSS, RSF and RNF are evaluated using the test set. To
get an idea of how well certain rule sets perform, all automatically generated
rule sets are compared with the manually constructed rule set in [2]. The results
for this set of rules on the test set is given in Table 3.9. In column 2, the total
number of rules in the rule set is given (num. rules), in column 3 the total
number of historic words that are affected by the rules is given (total rewr.).
The 4th column shows the number of historic words for which the rewriting
is optimal (perf. rewr. indicating a perfect rewrite). The last column shows
the new average distance (new dist.), between the rewritten historic words, and
the modern words. The difference between the new average distance and the
baseline average distance is shown in parentheses.
3.5. RESULTS 33

rule set num. total perf. new


rules rewr. rewr. dist.
Braun 49 248 137 1.41 (-0.97)

Table 3.9: Manually constructed rules on test set

3.5.1 PSS results


The PSS algorithm generated 510 rules, some of which contain the same historic
sequence as antecedent part. From these, only the highest scoring rules with a
unique historic sequence (i.e. only one rule per historic sequence) are selected
(MM, see the Maximal Match criterium). The initial score of a rule is the
number of times that the modern consequent seqmod of the rule is found as a
wildcard for the historic antecedent seqhist (see the algorithm descriptions in
section 3.2.1). Different threshold values for the rule selection algorithm where
tested on the test set, ranging from 0 to 50 (the MM-threshold). The change
in average distance shows whether the rules have a positive or negative effect.
Also, the total number of words that are affected by the test set are given,
together with the number of perfect rewrites. The number of perfect rewrites
is the number of words which are rewritten to their correct modern form. The
results are shown in Table 3.10. It is clear that the rewrite rules generated by
the PSS-algorithm have a bad influence on the average edit distance between
historic words and their modern variants. However, by increasing the threshold,
only the high scoring rules are applied, rewriting 56 out of 400 words to their
correct modern form. The average distance still increases though. The rules
selected at threshold 5 perform better than the rules selected at threshold 10.
Apparently, the rules with a score between 5 and 10 (or at least some of them)
are better than some of the higher scoring rules.
One reason for this is the phonetic change of the sequence ae from a as in
naem (name) to e as in collegae (collegues).4 The ae sequence is very frequent in
the historic corpus, but the Nextens converter transcribes to an e so that naem
will be matched with neem (to take) instead of with naam (name). Accidentally,
there are a lot of wildcard matches with ee, so the rule ae → ee gets a high
score. Another high scoring rule is the rule oo → o, because many historic words
contain a double vowel where their modern forms contain single vowels. But this
rule generalizes too much. There are still many modern Dutch words containing
a double vowel, and their historic counterparts should not be changed, like boot,
boom, school, etc.. There are many lower scoring rules that make more sense
from a phonetic perspective, but low corpus frequencies keeps them at a low
score.
There are some phonetic transcription that are just plain wrong. For in-
stance, the sequence igh in veiligh (safe) is transformed by Nextens to a so
called ‘schwa’ character. A ‘schwa’ is how non-stressed vowels are pronounced,
4 letters in boldface indicate phonemes.
34 CHAPTER 3. REWRITE RULES

like the ‘e’ in the character. The problem with igh is that in many words, the
‘i’ is pronounced as a schwa, but the ‘gh’ is certainly pronounced. After con-
version, the word veiligh is matched to the modern Dutch word veilen, because
the final ‘n’ in infinitivals is often not pronounced. A chain is as strong as its
weakest link. If the phonetic transcriptions are not 100% correct, the generated
rules can’t be either.
As a extra, second selection criterium, only rules where selected that had
no effect on those words of the historic lexicon that also occur in the modern
Kunlex lexicon. Thus, only non-modern (NM, see the Non-modern selection
criterium in section 3.3.1) historic sequences are considered. The results for
the salience criterium are given for a salience threshold of 2 (S 2 in the Table).
This means that the highest scoring rule R1 for a historic sequence Seqhist is
selected if R1 matches at least twice as many wildcards as the second best rule
R2 . Several different thresholds values where tested. The threshold value 2
consistently shows the best results.

Sel. MM num. total perf. new


crit. Tresh. rules rewr. rewr. dist.
MM only 0 404 394 9 4.6 (+2.22)
MM only 5 109 373 25 3.39 (+1.01)
MM only 10 64 365 18 3.76 (+1.38)
MM only 20 34 320 34 3.14 (+0.76)
MM only 30 25 272 61 2.44 (+0.06)
MM only 40 22 269 59 2.46 (+0.08)
MM only 50 18 248 56 2.48 (+0.10)
MM + NM 0 251 232 39 2.87 (+0.49)
MM + NM 5 43 192 28 2.86 (+0.48)
MM + NM 10 20 185 24 2.88 (+0.50)
MM + NM 20 10 179 18 2.9 (+0.52)
MM + NM 30 6 147 12 2.71 (+0.33)
MM + NM 40 6 147 12 2.71 (+0.33)
MM + NM 50 5 112 12 2.71 (+0.33)
MM +S2 0 383 376 15 3.88 (+1.81)
MM +S2 5 99 331 28 2.75 (+0.65)
MM +S2 10 56 322 21 3.12 (+1.01)
MM +S2 20 29 247 40 2.58 (+0.51)
MM +S2 30 22 195 56 2.15 (-0.23)
MM +S2 40 19 190 54 2.17 (-0.21)
MM +S2 50 15 151 51 2.2 (-0.18)
sel. N.A. 104 253 101 1.66 (-0.72)
set

Table 3.10: Results of PSS on test set

What is interesting is that once the NM selection criterium is applied, the


3.5. RESULTS 35

number of rules that are applied has little effect on the average edit distance
between rewritten words and the correct modern words, but is still in balance
with the total number of affected words (more rules rewrite more words). The
highest scoring rules affect the most words (5 rules rewrite 112 words). For
lower thresholds, NM does have a positive effect, reducing the average distance
by almost 38%. But this is probably because it just reduces the number of
rules. Since most lowly ranked rules increase the average distance, reducing the
number of lowly ranked rules will reduce the negative influence. However, the
number of perfect rewrites is heavily affected by NM. Before applying NM, a
higher threshold results in many more perfect rewrites, and average distance
drops to nearly the original distance (which is 2.38). After applying NM, an
MM-threshold of 50 results in an increase in distance, with much less perfect
rewrites (when compared to an MM-threshold of 50 before applying NM). In
other words, the rules that where thrown out by NM where better than the
rules that NM keeps in the set. Dropping the threshold to 20 introduces some
more bad rules (only 5 rules are added, and the average distance goes up again).
Decreasing the threshold even more shows that some of the rules with a score
below 20 are better than some of the rules with a score above 20.
The results for the Salience (S) selection criterium look much more like the
Maximal Match results. At each threshold level the number of rules is only
slightly smaller than without selecting on salience. For the average distance,
salience works much better. Rules with a score above 30 descrease to average
distance. Some of these rules that are removed by the salience criterium actually
produce perfect rewrites. For thresholds 30, 40 and 50, the number of rules
decreases by 3, and the total number of perfect rewrites decreases by 5. Thus,
the 3 rules scoring between 30 and 40 removed by salience have a bad effect on
the average distance but do have a positive effect on some words. This shows
that for the historic antecedents in these 3 rules, multiple modern consequents
are required, or the context of the historic sequence (the characters preceding
and following the sequence) should be taken into account.
The best results by far are produced by using the selection set. As described
in section 3.4.1, the selection set contains 1600 word pairs, and are used to filter
out rules that have a negative effect on the testset. The MM-score of the rules
are ignored in this selection criterium, and are replaced by a score based on how
well they perform on the selection set. As the selection set is constructed in the
same way as the test set (in fact, only one set of 2000 words was constructed,
which was split afterwards in the selection set and the test set), it should come
as no surprise that this produces better results. About 63% of the all the words
in the test set are rewritten and about 25% of them to their correct modern
forms.
The PSS algorithm clearly suffers from wrong phonetic transcriptions. The
change of pronunciation for some character sequences (most notably the se-
quence ae, which occurs very often in the historic corpus) over time is ignored
by the Nextens conversion tool. These problems occur throughout the rule set,
from highly frequent to rare sequences. Therefore, raising the MM-threshold
will only reduce the total number of rules, effectively reducing the number of
36 CHAPTER 3. REWRITE RULES

rules which have a negative effect on the test set, but also reducing the number
of rules that have a positive effect. The use of the selection set seems the only
way to sort the good rules from bad ones.

3.5.2 RSF results


The RSF algorithm generates much more rules than the PSS algorithm, but the
number of historic sequences for which it finds rewrite rules is smaller. This is
because it finds many different rewrite rules for the same historic sequence. After
selecting the highest scoring rule for each unique historic sequence, only 209 rules
are left, compared to 293 rules for the PSS algorithm. This is probably because
the RSF algorithm only considers typical historic character combinations, where
the PSS algorithm considers all sequences in the historic index that can be
matched with a modern variant. The PSS algorithm generates the rule cl → kl
because the historic word clacht is pronounced the same as the modern word
klacht. But the RNF algorithm doesn’t ever consider cl as a typical historic
sequence, thus won’t generate a rewrite rule for it. The results on the test
set are shown in table 3.11. The rules generated by RSF perform much better
than the ones generated by PSS. The average distance between the historic
words and their correct modern variants decreases. Also, the number of perfect
rewrites is much higher. Most rules have a very low score. Setting the threshold
to 5 removes 70% of all the rules in the test set, while staying close to the
performance of the full set of 209 rules. Apparently, the positive effect of the
RSF rule set comes mainly from the rules with a score above 5. Further increase
of the threshold shows a further decrease in performance, but this time the
differences are becoming significant. Between 10 and 20, almost half of the rules
are removed, and the number of perfect rewrites decreases further. But it should
be clear that the most effective rules are the ones with the highest scores. Only
10 rules have a score above 50, but account for the bulk of the perfect rewrites
and the decrease in distance. For all thresholds, the ratio between total rewrites
and perfect rewrites is roughly the same (of every 2 affected words, one rewrite
is perfect).
Clearly, the NM selection criterium has a negative effect on the RSF gener-
ated rules. It throws out some rules (about 20-25%), which have a positive effect
on the test set. By throwing them out, the number of perfect rewrites drop, and
the average distance increases. The effect of NM for PSS was questionable. For
RSF, it is just plain bad. A simple explanation for the performance of NM is
that the RSF algorithm already selects historic sequence based on their relative
frequency. Even if a historic sequence occurs in the modern corpus, the fact
that was selected by RSF means that it is at least 10 times more frequent in
the historic corpus. The use of relative frequencies makes NM redundant.
The salience criterium also removes many good rules. At an MM-threshold
of 50, 6 out of 10 rules are removed (60%), reducing the number of perfect
rewrites by 78 (76%). In other words, probably the best rules in the entire set
are removed by selecting on salience. By dropping the salience threshold, the
performance will go up again. Another short test revealed that by reducing the
3.5. RESULTS 37

Sel. MM num. total perf. new


crit. Thresh. rules rewr. rewr. dist.
MM only 0 209 261 133 1.41 (-0.97)
MM only 5 62 249 130 1.42 (-0.96)
MM only 10 39 243 127 1.44 (-0.94)
MM only 20 21 231 117 1.48 (-0.90)
MM only 30 13 212 109 1.54 (-0.84)
MM only 40 12 212 109 1.54 (-0.84)
MM only 50 10 206 103 1.56 (-0.82)
MM + NM 0 190 207 100 1.58 (-0.8)
MM + NM 5 51 195 97 1.59 (-0.79)
MM + NM 10 30 188 94 1.61 (-0.77)
MM + NM 20 16 178 85 1.64 (-0.74)
MM + NM 30 10 162 79 1.71 (-0.67)
MM + NM 40 9 162 79 1.71 (-0.67)
MM + NM 50 7 156 73 1.73 (-0.65)
MM + S 2 0 48 83 39 2.17 (-0.21)
MM + S 2 5 21 78 37 2.19 (-0.19)
MM + S 2 10 16 78 37 2.19 (-0.19)
MM + S 2 20 9 74 35 2.2 (-0.18)
MM + S 2 30 6 61 30 2.26 (-0.12)
MM + S 2 40 6 61 30 2.26 (-0.12)
MM + S 2 50 4 54 25 2.27 (-0.11)
sel. set N.A. 76 252 140 1.33 (-1.05)

Table 3.11: Results of RSF on test set

salience threshold by 0.1 at a time, the performance slowly changes towards to


original performance. But only by setting the threshold to 1 (no salience), the
performance is equal to the original MM rule set. Thus, for RSF, salient rules
are no better, it seems, than other high ranking but none salient rules.
Again, the selection set is the best selection criterium. It’s performance is
better than the MM-threshold. The number of perfect rewrites is higher, while
the total number of rewrites is lower, and the average distance is reduced by
more than 1 (for edit distance, this amounts one step, insert or delete, closer to
the modern word).

3.5.3 RNF results


Like the RSF algorithm, RNF also generates many rules. But since the se-
quences are not restricted to either consonants or vowels, many more historic
sequences and possible modern sequences are considered. The n-gram size be-
comes important. For n-grams of size 2 only 27 * 27 (26 alphabetic characters
plus the word boundary character) = 729 historic sequences are possible. For 4-
38 CHAPTER 3. REWRITE RULES

grams already 492.804 historic sequences are possible. Of course, most of these
sequences will not be in the historic corpus (take ’qqqq’ or ’xjhs’ for example).
So, before generating the rules, we can predict that there will be far more rules
for n-grams of size 4 than for n-grams of size 2. See table 3.12 for the results of
all n-gram lengths. The results for NM are not listed, since they show the same
bad effect as for the RSF rules, and would only make table 3.12 less readable.
As for salience, it shows mixed results. For 2-grams, the best salience threshold
is 1.5, performing for worse than the original rule set. For 3-grams and 4-grams,
the best value is around 1.25, showing some improvement in average distance
for lower MM threshold values (up to 20) but a drop in the number of perfect
rewrites.
The results for n-grams of length 2 show that only the 8 highest MM-scoring
rules, with a threshold above 50, have a big influence on the test set. These
rules are very good, rewriting 67% of all the words, and of these, 56% are
perfect rewrites. This is due, for the largest part, to the rule ae → aa. Many
of the historic words in the test set contain the sequence ae, and most of their
corresponding modern variants have aa as the modern spelling variant. As
the results at lower MM-thresholds show, the low scoring rules have almost no
influence on the test set.
Another noticable result is that at a MM-threshold of 20, the rules show the
greatest reduction in average distance for all the other n-gram lengths. Also,
for n ≥ 3, increasing the MM-threshold results in less perfect rewrites. As for
the total rewrite / perfect rewrite ratio, the best n-gram lengths are 2 and 3.
Like the PSS and the RSF algorithm, RNF benefits greatly from the use
of the selection set. All n-gram lengths show in improvement over the MM-
threshold selection. The number of selected rules is less than for low MM-
thresholds (which show the highest number of perfect rewrites of the different
MM-thresholds), as well as the total number of rewrites. But the number of
perfect rewrites increases (this is most noticable for n ≥ 4. Now, the 4-gram
rules show the best results. 62% of all rewrites is perfect, and the average
distance is reduced by almost 50%.

3.6 Conclusions
The most significant conclusion is that phonetic transcriptions are not nearly
as useful as expected. As mentioned earlier, there are two reasons for this.
First, the transcriptions are not always correct. Some letter combinations that
no longer occur in modern Dutch words are treated as English or French char-
acter sequences. From the surrounding characters it should be clear that the
word under consideration is certainly not English or French. The grapheme to
phoneme converter of Nextens is very accurate compared to other conversion
tools, but for this particular task, it is simply not good enough. To the defense
of Nextens, it should be mentioned that it wasn’t designed for this task. It was
designed with pronunciation of modern Dutch in mind. That it does very well.
The other main reason is that, although the overlap between 17th century
3.6. CONCLUSIONS 39

N-gram MM Thres num. total perf. new


size -hold rules rewr. rewr. dist.
2 0 15 271 150 1.29 (-1.09)
2 5 14 271 150 1.29 (-1.09)
2 10 11 269 150 1.30 (-1.08)
2 20 10 268 150 1.30 (-1.08)
2 30 9 267 150 1.30 (-1.08)
2 40 8 267 150 1.30 (-1.08)
2 50 8 267 150 1.30 (-1.08)
2 sel. N.A. 12 271 152 1.29 (-1.09)
set
3 0 163 277 148 1.38 (-1)
3 5 163 277 148 1.38 (-1)
3 10 124 270 148 1.33 (-1.05)
3 20 81 260 143 1.33 (-1.05)
3 30 50 239 131 1.41 (-0.97)
3 40 39 229 123 1.46 (-0.92)
3 50 27 196 95 1.78 (-0.60)
3 sel. N.A. 127 274 162 1.19 (-1.19)
set
4 0 458 284 115 1.89 (-0.49)
4 5 458 284 115 1.89 (-0.49)
4 10 321 268 110 1.87 (-0.51)
4 20 118 211 92 1.87 (-0.51)
4 30 57 163 64 1.93 (-0.45)
4 40 39 138 50 2.13 (-0.25)
4 50 29 114 37 2.18 (-0.2)
4 sel. N.A. 276 269 166 1.20 (-1.18)
set
5 0 726 205 57 2.69 (+0.31)
5 5 726 205 57 2.69 (+0.31)
5 10 318 157 52 2.42 (+0.04)
5 20 78 80 28 2.25 (-0.13)
5 30 20 34 9 2.33 (-0.05)
5 40 10 23 6 2.36 (-0.02)
5 50 7 20 6 2.36 (-0.02)
5 sel. N.A. 276 153 97 1.79 (-0.59)
set

Table 3.12: Results of RNF on test set


40 CHAPTER 3. REWRITE RULES

Dutch and contemporary Dutch is mostly in pronunciation, the pronunciation of


some high frequency vowel and consonant sequences (highly frequent in historc
Dutch that is) certainly has changed. The correct transformation of these se-
quences is absolutely essential if good performance is to be achieved. Of course,
this problem could be solved by adjusting the conversion tool, tweaking the
rules through which certain character combinations are mapped to phonemes,
but that would require expert knowledge. The main aim of this research was to
find out if the spelling bottleneck can be solved without any expert knowledge.
Clearly, we need more than phonetic transcriptions alone.
The other two methods work much better. Both methods only take typically
historic sequences into account, and do not suffer from changes in pronunciation.
Corpus statistics are enough to generate and select well-performing rewrite rules.
On the down side, the RSF and RNF algorithms only consider typically
historic sequences. Many words with historic spelling contain sequences that are
quite frequent in modern Dutch as well, like cl in clacht (modern Dutch: klacht).
The PSS algorithm will generate rules for these sequences, but this is probably
where the usefulness of rewrite rules turns into senseless spelling reformation.
After transforming typically historic letter combinations into modern ones, rule
generation should probably be replaced by word matching, either n-gram based
(see section 4.2) or phonetically (see section 5.4).
It seems that the selection criteria NM and S only have some positive effect
on the PSS and RNF rule sets for some MM-thresholds. There is no single
value for the salience threshold that works properly for all methods. The only
criterium that works well for all 3 methods is the selection set. It consistently
shows the best results of all the different selection methods.

3.6.1 problems
There are of course some specific problems with using rewrite rules based on
statistics. Since spelling was based on pronunciation, and people pronounced
certain characters in different ways, some historic words are ambiguous. Just
like certain modern words can have different meanings determined by context,
the spelling of some historic words can be rewritten to different modern words,
depending on context. The character combination ue in the historic word muer
can be rewritten to modern spelling as oe as in moer (English: nut) or as uu as
in muur (wall).

3.6.2 The y-problem


Scanning the list of unique words of all corpora quickly showed a major problem.
Many of the historic terms contain the letter y, in many different combinations
of vowels. It occurs before or after any other vowel, or just by itself. And in
all these cases it its modern spelling is different. table 3.13 shows the possible
combinations and their modern spelling:
The next chapter will describe other ways of evaluating the rewrite rules.
The influence of the rewrite rules on document collections from other periods will
3.6. CONCLUSIONS 41

vowel modern spelling old / modern


ay aa withayrigh / witharig
ay a gepayste / gepaste
ay aai sway / zwaai
ay ai zwaay / zwaai
ay ei treckplayster / trekpleister
ay ij vriendelayck / vriendelijk
ey ei kley / klei
ey ij vrey / vrij
ey ee algemeyn / algemeen
oy oy employeren / employeren
oy ooi flickfloyen / flikflooien
oey oe armoey / armoe
oey oei bloeyde / bloeide
uy e huydendaachse / hedendaagse
uy ui suycker / suiker
uy uu huyrders / huurders
uy u gheduyrende / gedurende
ya ia coryandere / koriander
ya ija vyandt / vijand
ya iea olyachtich / olieachtig
ye ie poezye / poezie
ye ij toverye / toverij
ye ije vrye / vrije
yu io ghetryumpheert / getriomfeerd
yu ijv yurich / ijv’rig

Table 3.13: Different modern spellings for y


42 CHAPTER 3. REWRITE RULES

be measured, as well as the effect of rewriting on retrieving historic word forms


for modern words from a historic corpus. As an extra evaluation, a document
retrieval experiment is described, where the rewrite rules are integrated into the
IR system. Furthermore, a few simple extensions, such as combinations and
iterations, to the three methods PSS, RSF and RNF, are tested. This should
provide a better indication of the performance of the rule-generation methods.
Chapter 4

Further evaluation

As we saw in the previous chapter, the RSF and RNF algorithms outperform
the phonetically based PSS algorithm. Here, extensions to these methods are
considered, as well as some other evaluation methods and test sets generated
from documents from different periods. This chapter is divided into the following
sections:

• Iteration and combination: The three methods described in the pre-


vious chapter are combined and used iteratively.

• Reducing double vowels: The problem of vowel doubling is investi-


gated.

• Word-form retrieval: A method to retrieve historic word forms for


modern words.

• Historic Document Retrieval: An external evaluation method to eval-


uate the rewrite rule sets.

• Documents from different periods: Evaluation of the rewrite rules on


older and newer documents.

4.1 Iteration and combining of approaches


As stated in section 3.2.1, iteration over phonetic transcriptions has no effect.
For the RNF and the RSF methods, iteration can have effect. After the first
iteration, the historic words that are changed by the rewrite rules have become
more similar to their modern variants. A next iteration might result in more
modern words that match a wildcard word. Consider again the example of
the words aanspraak and aenspraeck. The consonant sequence ck is typical for
historic documents, it rarely occurs in modern documents. But the wildcard
word aenspraeC (C is a consonant wildcard) is not matched by any modern
word. No modern spelling is found for ck. But through other words, the modern

43
44 CHAPTER 4. FURTHER EVALUATION

Method iter- new total perf. old new


ation rules rewr. rewr. dist. dist.
RSF 1 209 257 110 2.38 1.42
RSF 2 4 3 0 1.42 1.42
RSF 3 0 0 0 1.42 1.42
2 1 12 271 152 2.38 1.29
2 2 1 1 0 1.29 1.28
2 3 0 0 0 1.28 1.28
3 1 127 274 162 2.38 1.19
3 2 12 6 3 1.19 1.17
3 3 0 0 0 1.17 1.17
4 1 276 269 166 2.38 1.20
4 2 52 26 19 1.20 1.10
4 3 0 0 0 1.10 1.10
5 1 276 153 97 2.38 1.79
5 2 60 38 19 1.79 1.62
5 3 14 4 3 1.62 1.60

Table 4.1: Results of iterating RSF and RNF

variant aa might have been found for the historic sequence ae. After applying
this rule to the historic corpus, aenspraeck becomes aanspraack. In the next
iteration the wildcard word aanspraaC will be matched with the modern word
aanspraak resulting in the rewrite rule ck rightarrow k.
However, by combining the different methods, iterating the phonetic tran-
scription method suddenly does have effect. After applying the ae rightarrow aa
rule found by other methods, a phonetic transcription is made for aanspraack
instead of for aenspraeck. And the pronunciation of aanspraack is equal to
the pronunciation of the modern word aanspraak, while the pronunciation of
aenspraeck isn’t (at least, according to Nextens).

4.1.1 Iterating generation methods


After applying rewrite rules, the historic words are closer to their modern coun-
terparts. There might be some historic sequence Seqhist for which no wildcard
matches could be found because it only occurs in words with several typically
historic sequences. After the first iteration, some of these other historic se-
quences may have been changed to a modern spelling. In a second iteration,
Seqhist can be matched with a modern sequence.
In table 4.1, the results of iterating over rule generation methods is shown.
The average distances before and after applying the rules generated at each
iteration are shown in columns 6 and 7. After the second iteration, only the
rules set of RNF with n = 5 improves by further iteration. A simple explanation
is that most rules generated for historic antecedents of length 5 affect only a few
4.1. ITERATION AND COMBINING OF APPROACHES 45

words (the first 276 rules affect only 153 words in the test set). There are much
more typically historic sequences of length 5 than there are of length 4. The
problem with evaluating the rules for n-grams of length 5 is that these sequences
are so specific that many of them do not occur in the test set at all. In each
iteration, there are many more rules than there are affected words. All these
rules have have an antecedent part that does occur in the selection set, hence
the selection of the rule. But the selection set is much bigger than the test set,
and thus contains many more sequences. Looking purely at the scores, it is
easy to conclude that 4-grams work better for RNF than 5-grams, but a look at
the rules themselves gives another impression. Consider the historic sequence
verci. The RNF algorithm finds the rule verci ← versi for it, which changes
a historic word like vercieren to versieren (adorn, decorate). The 4-gram verc
should become verk in words like vercopen and overcomen, but it should become
vers in vercieren. Because 5-grams are more specific, 5-gram rules probably
make less mistakes. On the other hand, longer n-grams are more and more like
whole words. Instead of generating rewrite rules, the RNF algorithm would be
generating historic to modern word matches. It would consider every word of
approximately the length as a possible modernization, leaving all the work to
the selection process.

4.1.2 Combining methods


The combining of methods is done by generating, selecting and applying the
rules of one method before generating, selecting and applying rules of another
method. The rules of PSS contain not only typically historic antecedents, but
also some non-typical antecedents. Thus, the PSS rules will affect different
words, and words differently than the other rule sets. Also, by first applying
the rules generated by RNF or RSF, the historic antecedents that have a wrong
phonetic transcription (like the frequent sequence ae) may be rewritten before
the PSS rules are generated. This will reduce a number of wrong phonetic tran-
scriptions. Therefore, the generation methods are applied one after another, in
all different permutations, to see the effect or ordering the generation methods.
To be able to make a fair comparison of the different orderings, the rules of each
rule set where selected with the selection set, because it performs very well for
all three generation methods. The rule set for RNF is the combined rule sets
for the n-gram length 2, 3, 4 and 5. The rules where applied in order of length,
with the longest antecedents first, because they are more specific than shorter
antecedents. Rewriting ae to aa before rewriting aeke to ake cancels the effect
of the latter rule.
By combining all the rules of n-gram lengths 2, 3, 4 and 5, the improvement
is huge. Even before combining the RNF with the other algorithms, over 50%
of all the words in the test set is modernized correctly.
Combining PSS and RSF increases performance significantly, although the
order is not important. But when combining RSF with RNF, the ordering does
matter. Applying RSF first, the results are worse than applying RNF alone.
Combining them is no improvement. When compared the RNF rules, the only
46 CHAPTER 4. FURTHER EVALUATION

Method num. total perf. new


Order rules rewr. rewr. dist.
PSS 104 253 101 1.66 (-0.72)
PSS + RNF 769 347 211 0.91 (-1.47)
PSS + RSF 136 326 166 1.13 (-1.25)
PSS + RNF + RSF 774 349 211 0.90 (-1.48)
PSS + RSF + RNF 389 348 206 0.91 (-1.47)
RNF-2 12 271 152 1.29 (-1.09)
RNF-3 127 274 162 1.19 (-1.19)
RNF-4 276 269 166 1.20 (-1.18)
RNF-5 276 153 97 1.79 (-0.59)
RNF-all 691 315 207 0.97 (-1.41)
RNF + PSS 746 335 224 0.87 (-1.51)
RNF + RSF 702 319 208 0.95 (-1.43)
RNF + PSS + RSF 752 337 224 0.87 (-1.51)
RNF + RSF + PSS 753 337 224 0.86 (-1.52)
RSF 62 252 140 1.33 (-1.05)
RSF + RNF 328 295 183 1.05 (-1.33)
RSF + PSS 134 324 167 1.12 (-1.16)
RSF + RNF + PSS 381 342 193 0.96 (-1.42)
RSF + PSS + RNF 397 346 211 0.88 (-1.50)

Table 4.2: Results of combined methods on test set


4.1. ITERATION AND COMBINING OF APPROACHES 47

Applied Lexicon
Rule set size
None 44041
RSF 41956
PSS 41557
RNF 39368
RNF+RSF+PSS 38525

Table 4.3: Lexicon size after applying sets of rewrite rules

cominations that improve on it are the combinations with PSS. Apparently, the
PSS rules are somewhat complementary to RNF and RSF rules, as was expected.
The RNF and RSF algorithms work in a similar way (relative frequency of a
sequence). The PSS algorithm is fundamentally different. It’s rules are based
on phonetics.
It is interesting to see that the total number of unique words in the corpus
is greatly reduced by rewriting words to modern form. The original hist1600
corpus contains 47,816 unique words (see table 3.3), if the lexicon is case sensitive
(upper case letters are distinct from lower case letters). If case is ignored, there
are 44,041 unique words left. After applying the rules of the combined methods
PSS, RNF and RSF, the total number of unique words is reduced to 38,525, a
12.5% decrease. By rewriting, many spelling variants are conflated to the same
(standard) form. As table 4.3 shows, the RNF rules have the most significant
effect on conflation. Looking at the number of rules that each method generates,
this is hardly surprising.

4.1.3 Reducing double vowels


A common spelling phenomenon in 17th century Dutch is the use of double
vowels to indicate vowel lengthening. In English, vowel lengthening is clear
in the word good when compared to poodle. In the former word, the oo is
pronounced somewhat longer than in the latter. The same effect occurs in
Dutch. In bomen (English: trees), the o is long, in bom (bomb) the o is short.
But boom, the singular form of bomen, is also pronounced with a long o. The
double vowel ’oo’ is needed in this case to disambiguate it from bom. In modern
Dutch spelling, this doubling of vowels is only for syllables with a coda1 . Using
the modern spelling rules for vowel doubling, redundant double vowels in historic
words can be removed. A simple algorithm to do this, the Reduce Double Vowels
algorithm (RDV), reduces a double vowel to a single vowel if it is followed by
a single consonant and one or more vowels, or if the double vowel is at the
end of the word (no coda). Thus, eedelsteenen becomes edelstenen, and gaa is
reduced to ga, but beukeboom is not changed to beukebom. This algorithm does
1 there are some exceptions to this rule. The Dutch word for sea is zee. Without the double

vowel the e would be pronounced short, becoming ze (English: she).


48 CHAPTER 4. FURTHER EVALUATION

make mistakes. The modern word zeegevecht (sea battle), is changed to the
non-Dutch word zegevecht. The error is not in pronunciation, which is the same
for both words, but in spelling. The double vowel ii (almost) never occurs in
Dutch, so all occurrences of ii in historic words can safely be reduced. For word
final vowels, the e vowel is an exception. If a word ends in a single vowel e, this
is pronounced as a schwa (like the e in ’vowel’). For words ending in a long e
vowel, the double vowel ee is required (thee, zee, twee, vee. Thus, the algorithm
should ignore word final vowels ee.
To test its effectiveness, it was applied to the full historic word list, containing
47816 unique words. Of these, 1498 words contain redundant vowels according
to the RDV-algorithm. The total number of words containing redundant vowels
might be larger, since the algorithm is so simple it is bound to miss some of
these words. But what of the words it did affect? The list of reduced words
was checked manually. It turns out that of all 1498 words, 134 reductions
were incorrect (almost 9%). A closer analysis of the incorrect reductions show
that, by far, the most mistakes are made with the double ee vowel in non-final,
open ended (no coda) syllables in words like veedieven (English: cattle thieves),
tweedeeligh (English: consisting of two parts) and zeemonster (sea monster). In
each of these 3 examples, the first syllable has its vowel reduced. But in all
three examples, the first syllable is a Dutch word by itself. In fact, these words
where the very reason why the algorithm ignores word final ee vowels. It seems
that the frequent use of compound words in Dutch has a significant effect on
the (too) simple RDV-algorithm. A modification might be compound splitting
when encountering a word containing ee. If the first part of the word, up to and
including ee, is an existing word by itself (i.e. it’s in the lexicon), don’t reduce
the vowel. Other frequent mistakes have to do with the adding of a suffix. A
typical Dutch suffix is -achtig, as in twijfelachtig (doubtful, twijfel means doubt).
But the word geelachtig (yellowish, geel means yellow) is incorrectly reduced to
gelachtig (gelly). These errors can be reduced by suffix stripping (stemming).
Furthermore, it was tested on the test set (see table 4.4). It affects only 29 of
the 400 historic words in the test set, but of these, 20 are written to the correct
form. Using RDV after applying rewrite rules, it still has a significant effect
on the test set. The best order of combining the 3 rule generation methods
(RNF, RSF and PSS) affects 337 words, rewriting 224 of them to the correct
modern form. After the RDV algorithm is applied, 5 more words are rewritten,
with 235 perfect rewrites (more than 59% of all the words in the test set!).
Many of the double vowels are removed by the 4-gram and 5-gram RNF rules
(like eelen → elen), but it is mainly due to the fact that ae is rewritten to
aa, resulting in more double vowels, that double vowel reduction still has a
significant effect.
Applying the RNF+RSF+PSS rule set and the RDV algorithm on the ex-
ample text from the ‘Antwerpse Compilatae’ (see chapter 2) gives the following
result:

9. item, oft den schipper verzijmelijk ware de goeden ende koop-


man-schappen int laden oft ontladen vast genoeh te maken, ende
4.2. WORD-FORM RETRIEVAL 49

Method num. total perf. new


Order rules rewr. rewr. dist.
RDV only N.A. 29 20 2.31 (-0.09)
RNF + RSF + PSS 753 337 224 0.86 (-1.52)
RNF + RSF 753 342 235 0.83 (-1.55)
+ PSS + RDV

Table 4.4: Results of RDV on test set

dat.die daardore vijtten takel oft bevangtouw schoten, ende int wa-
ter oft ter aarden vielen, ende also bedorven worden oft te niette
gingen, alsulke schade oft verlies moet den schipper ook alleen dra-
gen ende den koopman goet doen, als vore.
10. item, als den schipper de goeden so kwalijk stouwd oft laijd dat
d’ene door d’andere bedorven worden, gelijk soude mogen gebeuren
als hij onder geladen heeft rozijnen, allijn, rijs, sout, gran ende an-
dere diergelijke goeden, ende dat hij daar boven op laijd wijnen,
olien oft olijven, die vijtlopen ende d’andere bederven, die schade
moet den schipper ook alleen dragen ende den koopman goet doen,
als boven.

The words verzijmelijk, vijten, genoeh and stouwd are incorrect rewrites of
the words versuijmelijck, uit een, genoeg and stouwt. But takel, maken, schade,
koopman, kwalijk, rozijnen and dragen are correctly transformed from taeckel,
maecken, schade, coopman, qualijck and rosijnen. Although it is far from per-
fect, many words are modernized. Even the word verzijmelijk is orthographically
much closer to its correct modern form verzuimelijk, although its pronunciation
is no longer the same.

4.2 Word-form retrieval


Since the edit distance measure on a manually constructed test set is not the
only (and probably not the best) way of evaluating the performance of rewrite
rules, another evaluation method is used here. In [18] historic spelling variants
of modern words are retrieved using character based n-gram matching (see Ta-
ble 3.1. To evaluate the rewrite rules generated by RNF (the best method by
far), a similar experiment was done on the full test set of 2000 word pairs. Each
modern word in this set is used as a query word. The full list of historic words
from the hist1600 corpus, plus the historic word forms of the test set,2 was used
as the total word collection. Since the full number of spelling variants for each
modern word is not known, the word pairs from the test set were used to per-
form a known-item retrieval experiment. We are looking for a specific spelling
2 This was done to make sure that the appropriate historic word form is in the word list

from which the words are retrieved.


50 CHAPTER 4. FURTHER EVALUATION

N-gram Rule recall


size set @20 @10 @5 @1
2 85.30 78.30 69.05 32.20
2 4-gram 90.55 86.00 79.25 46.65
2 comb. 92.50 88.90 83.20 48.80
3 79.80 70.60 59.95 27.65
3 4-gram 88.35 83.90 76.85 45.65
3 comb. 90.80 86.70 81.50 49.25
4 65.75 56.15 45.75 20.40
4 4-gram 83.30 78.50 73.20 43.90
4 comb. 86.50 81.40 76.65 47.25
5 52.00 45.50 37.30 16.50
5 4-gram 77.15 73.15 68.30 41.55
5 comb. 80.30 76.65 72.55 45.50

Table 4.5: Results of historic word-form retrieval

variant, namely, the one from the test set. Table 4.5 shows recall at several
different levels. The experiment was repeated after rewriting the historic word
list using the 4-gram RNF rules after 2 iterations (see 4.1.1), and using the best
combination rule set (RNF, RSF, PSS).
Especially at low recall levels (1 and 5) the differences between the original
historic words and the rewritten words is huge.
The 4-gram rules generated by RNF perform much better than the 5-gram
rules. The 5-gram rules are a huge improvement on the original words, but the
4-grams are much better still. The performance of 2-gram and 3-gram matching
@20 are comparable to the experiments by Robertson and Willett (see table 3.1).
When matching n-grams for historic word forms, small n-grams (2 and 3)
perform better than large n-grams (4 and 5). However, it is interesting to see,
that the difference between rewriting and no rewriting, for recall at lower levels
(recall @1 and recall @5), becomes very big for large n-grams. More specifically,
the performance of 4-gram and 5-gram matching is very bad at recall levels 1
and 5.

4.3 Historic Document Retrieval


Parallel to this research, Adriaans [1] has worked on historic document retrieval
(HDR). He has investigated whether HDR should be treated is cross-lingual
information retrieval (CLIR) or monolingual information retrieval (MLIR). In
his CLIR approach, the rewrite rule generation methods described here have
been applied to a collection of historic Dutch documents, and a set of modern
Dutch queries.
4.3. HISTORIC DOCUMENT RETRIEVAL 51

4.3.1 Topics, queries and documents


Two sets of topics were used. One is a set of 21 expert topics from [2], for
which a number of the documents in the collection were assesed and marked
as relevant or non-relevant. The formulation of the queries, and the assesment
of the documents has been done by experts of 17th century Dutch. The other
set contains 25 topics, for which only one relevant document is known. This
approach as called known-item retrieval. A document from the collection is
selected randomly, and a query is formulated that describes the content of that
document as precisely as possible. This approach is used when there is a lack of
assessors and/or time to asses the relevance of all the documents for each query.
Each query in the experiment is a combination of a title and a description. The
description is a natural language sentence, posed in modern Dutch describing
the topic of the query. The title contains key words from the description. By
combining the description with the title, the key words occur twice in the query,
giving them extra weight in ranking the documents. The three columns show
the average precision for topics using only descriptions (D only), descriptions
and titles (D+T), and titles only (T only) as queries. The average precision
is the non-interpolated average precision score. For each query, the top-10
ranking documents are retrieved. If, for query Q, the 3rd and 5th documents are
relevant, the non-interpolated average precision is the average of the precision
at rank 3 and the precision at rank 5. If n documents are retrieved, the average
precision is:
n
1X i
Avg.P rec. = (4.1)
n i=1 rank(i)

The document collection is also the same is the one used in [2], because these
documents were assessed for the expert topics.

4.3.2 Rewriting as translation


The used rule set is a combination of RNF, PSS and RSF (see section 4.1.2).
However, the RNF+PSS+RSF rules are generated by non-final versions of the
generation algorithms, since the HDR experiments where done before the final
versions were ready. The RNF algorithm performed sub-optimally because of
memory problems. Due to time constraints, the best rewrite rules at that time
were used (which is the ordered generation of RNF, PSS and then RSF). The
final version of the RNF algorithm generates more rules, and with the final
combined rule sets this would probably give other HDR results.
Table 4.6 shows the results of the HDR experiment on the known item
topics.3 The baseline is the standard retrieval method, looking up the exact
query words in the inverted document index. An inverted document index is
3 The results in this thesis differ from the results in [1] because of last minute changes in

[1]. The results shown here are only based on topics for which there is at least one relevant
document. The results in [1] also take into account topics that don’t have any relevant
documents in the collection.
52 CHAPTER 4. FURTHER EVALUATION

Method Avg. Avg. Avg.


prec. prec. prec.
D only D+T T only
Baseline 0.2192 0.1955 0.1568
Stemming 0.2125 0.2352 0.1749
4grams 0.2366 0.2538 0.2457
Decompounding 0.2356 0.2195 0.1795
Rules Doc 0.3097 0.3118 0.3016
Rules Query 0.2266 0.2537 0.2487
Rules Doc + stem 0.3702 0.3884 0.3006
Rules Query + stem 0.1873 0.2370 0.2234

Table 4.6: HDR results using rewrite rules

matrix in which the columns represent the documents in the collection, and the
rows represents all the unique words in the entire document collections, with
each cell containing the frequency of the represented word in the represented
document. Thus, a column shows the frequencies all words that occur in the
document, and each row shows the frequencies of a word in the documents that
it occurs in.
The table shows the results of some standard IR techniques as well. Stem-
ming, as explained in section 2.3, conflates words through suffix stripping. The
4-gram experiments uses 4-grams of words in combination with the words them-
selves as rows in the inverted index. Decompounding is used to split compound
words into their compound parts. The results for the rewrite rules are split into
document translation and query translation. In the query translation experi-
ment, a list of translation pairs was made for the words in the historic document
collection, containing the original historic term and its rewritten form. Each
query was expanded with a original historic word if its rewritten form was a
query word. The document translation experiment was done by replacing each
word in all documents by its rewritten form from the same list of translation
pairs. The first 4 experiments can be seen as monolingual IR (documents and
queries are treated as one language). Translating queries or documents is a
cross-language (CLIR) approach. Either the documents are translated into the
language of the queries, or the queries are translated into the language of the
documents. The last two rows show the effect of stemming after translation.
Of the 4 monolingual approaches, the use of 4-grams works best, although
stemming and decompounding perform better than the baseline as well. Trans-
lation of the queries is comparable in performance to the 4-gram approach, but
stemming the translated queries has a negative effect, especially when using
descriptions only. Query translation means adding historic word-forms to the
query. These historic word-forms contain historic suffixes that might not be
stripped by the stemmer, just as the historic word-forms in the documents.
Without rewriting, many historic spelling variants cantnot be conflated to the
4.3. HISTORIC DOCUMENT RETRIEVAL 53

Method D only D+T T only


Baseline 0.3396 0.4289 0.4967
Stemming 0.3187 0.3778 0.4206
4grams 0.3037 0.3465 0.3821
Decompounding 0.3307 0.4228 0.4900
Rules Doc 0.2825 0.3835 0.4538
Rules Query 0.3067 0.4224 0.4844
Rules Doc + stem 0.2690 0.3268 0.3799
Rules Query + stem 0.2920 0.3628 0.4214

Table 4.7: HDR results for expert topics

same stem. Thus, if the historic query terms are not affected by the stemmer,
they will only be matched by the exact same word-forms in the document collec-
tion, not with any morphological variant. Document translation is clearly supe-
rior to query translation. Even without stemming it consistently out-performs
all the other approaches. But here, stemming is useful. By rewriting, many his-
toric spelling variants are conflated to a more modern standard, including their
suffixes. After stemming, morphological variants are conflated to the same stem,
which significantly improves retrieval performance. For the D+T and T only
queries, the improvement over the baseline is almost 100%.
The results for the expert topics are listed in Table 4.7. These results are
in no way comparable to the known-item results. No matter what approach is
used, nothing performs better than the baseline system. The decompounding
approach, and the query translation approach (without stemming) come close
to the performance of the standard system, but they show no improvement.
A closer analysis of the topics shows that the experts who formulated the
queries used specific 17th century terms, and added historic spelling variants to
some of the descriptions and the titles. Topic 13 has the following description
and title:

• Description: Welke invloed heeft ’oorvrede’ nog in de periode van de


Antwerpse Compilatae (normaal: oorvede, oirvede)?

• Title: oorvrede Antwerpse Compilatae oorvede oirvede

The document collection used for retrieval contains documents from the
‘Gelders Land- en Stadsrecht’ corpus, and the ‘Antwerpse Compilatae’ corpus.
Both are a collection of text concerning 17th century Dutch law. All documents
from the ‘Antwerpse Compilatae’ contain the words Antwerpse Compilatae. So,
by putting these words in the query, all documents from this corpus are con-
sidered as possible relevant documents. Next, the word ’oorvrede’ is combined
with 2 spelling variants in both the description and the title. By rewriting the
documents, the spelling variant oirvede might have changed, so all documents
originally containing oirvede no longer match with the query word oirvede. In
54 CHAPTER 4. FURTHER EVALUATION

Period baseline Braun 4-gram PSS RSF


distance
1569 2.62 1.53 1.57 1.99 1.82
1600 2.38 1.54 1.20 1.65 1.43
1658 1.98 1.24 0.95 1.54 0.95

Table 4.8: Results of rules on test sets from different periods

general, if spelling variants are added to the query, the documents should not
be rewritten, since rewriting is used to conflate spelling variants.

4.4 Document collections from specific periods


Since spelling changed over time, getting more and more standard because of the
increase in literate people and printing press, a rewrite rule should only be used
on documents dating from more or less the same period as the documents it was
generated from. That is, if documents from the beginning of the 17th century
where used to construct rewrite rules, applying them to texts dating from 1560
or late 17th century might have a negative effect, because the pronunciation of
certain character combinations might have changed in between these periods.
To see if this is the case, two small test sets, containing 100 word pairs each,
one from a text dating from 1569 and one from a text written in 1658, where
manually constructed (in the same way as the large test set constructed from
texts dating from 1600–1620, see 3.4.1). The rules and the RDV-algorithm
where applied to these sets with the following results (the second test set, period
1600, is the original, large test set) :
As the second column shows, the average distance between the historic words
and their modern counterparts is decreasing together with the age of the source
documents (the actual differences might be even bigger, since the test sets do not
contain words that haven’t changed over time. The texts from 1658 probably
contain more of these words than older texts). What is interesting to see, is
that the rules manually constructed by Braun perform better on the oldest test
set than any automatic method. Even the best rules generated by RNF perform
somewhat worse, although they do perform much better on the test sets from
more recent documents. Again, the PSS rules perform worst of all rule sets.
The RSF rules show great improvement in performance as the age of the source
documents decreases. On the 1658 test set, it shows the same performance as
the best RNF rules. Of course, given the small size of the test sets, these results
only give an indication of the effects on test sets from different periods. To
be able to draw reliable conclusions, much larger test sets, and maybe even a
word-form retrieval experiment (see section 4.2) should be used.
4.5. CONCLUSIONS 55

4.5 Conclusions
All different evaluations show that 4-gram RNF generates the best rules. Al-
though the Braun rules show to be more period independent, for documents
written after 1600 the automatic methods perform much better. Word retrieval
benefits greatly from rewriting, getting performance on par with the results
from [18] for historic Enlish word-forms. For HDR, the effect of rewriting is
spectacular. What is interesting is the improvement of the stemming algorithm.
Before rewriting, stemming the documents and queries has a mixed effect. For
the titles it is useful, but for the descriptions, stemming has very little effect.
But once the rewrite rules have been applied, much more historic words have a
modern suffix that can be removed, conflating spelling and morphological vari-
ants all to the same stem. In other words, rewriting has brought the historic
Dutch documents closer to the modern Dutch language.
As the iteration of RNF ceases to have effect after 3 iterations, it might be
more effective to switch to phonetic matching, or possibly word-retrieval, after
that. It would be interesting to see the results of a combined run, first rewriting
documents and then use n-gramming to find spelling variants that are not yet
fully modernized. Also, the results of the HDR experiment are based on old
rules. The current best rule set performs much better on the test set evaluation
and on the word-retrieval test, thus might also perform even better on the HDR
experiment.
56 CHAPTER 4. FURTHER EVALUATION
Chapter 5

Thesauri and dictionaries

A thesaurus is a dictionary of words containing for each word a list of related


words. In IR, it is often used to find synonyms of a word, and other closely
related words for query expansion. A document D that doesn’t contain any of
the query terms will not be retrieved, but can still be relevant. The topic of the
query might be discussed in a document using different, but related words. By
expanding the query with related words from the thesaurus, the query might
now contain words that are in D, so D will be retrieved.
In [2], one of the main bottlenecks for HDR is the vocabulary gap. This gap
not only represents concepts that no longer exist, or that didn’t exists in the
17th century. It also represents the concepts that are described by modern and
historic synonyms. As 17th century documents contain many synonyms (see [2,
p.31]), a thesaurus can be useful for query expansion.
A thesaurus can be created automatically by extracting and combining word
pairs from different sources:
1. Small parallel corpora
2. Non-parallel corpora (using context)
3. Crawling footnotes
4. Phonetic transcriptions
5. Edit distance
The first three methods can be used to construct a thesaurus for the vocab-
ulary gap. The last two methods might be used to tackle the spelling problem,
as an alternative to rewrite rules.

5.1 Small parallel corpora


A very useful technique for finding word translation pairs between two different
languages is word to word alignment in parallel corpora, (see [21] and [22]). In

57
58 CHAPTER 5. THESAURI AND DICTIONARIES

the European Union for instance, all political documents written for the Euro-
pean parliament have to be translated in many different languages. As these
documents contain important information, it is essential that each translation
conveys exaclty the same message. The third paragraph in a Polish translation
contains the same information as the third paragraph in an Italian translation.
This can be exploited to construct a translation dictionary automatically by
aligning sentences and words within these sentences. A collection of such docu-
ments in several languages is often called a parallel corpus. A parallel corpus can
thus be used to find synonyms in one language for words in another language.
If such a collection of documents is available for 17th centry Dutch and mod-
ern Dutch, it could be used to construct word translation pairs between 17th
century and modern Dutch. This could be a partial solution to the vocabulary
gap identified in [2]. Partial, because historic words for concepts that no longer
make any sense in modern times cannot be aligned with modern translations,
simply because no such translations exist.
One of the largest parallel corpora is probably the Bible. It is translated
in many different languages, and also in many different historical variants of
many modern languages. The advantage of using different Bible translations is
that a line in one translation corresponds directly to the same line in the other
translation. The Statenbijbel and the NBV (Nieuwe Bijbel Vertaling) can be
used to construct a limited translation dictionary. The Statenbijbel is the first
Dutch translation of the original Bible, written in 1637. The NBV is the most
recent Bible translation, available in book form and on the internet. However,
the Statenbijbel, unlike the NBV, is not electronically available. The oldest
digitized version that can be found is a modernized version of the Statenbijbel,
dating from 1888. By that time, official spelling rules were introduced, and late
19th century Dutch is very similar to modern Dutch, making it useless for 17th
century Dutch.

5.2 Non-parallel corpora: using context


If the time span between the two language variants is large, the variants can
be considered to be different languages; historic documents can be considered
documents written in another language. If the time span between variants
becomes smaller, the languages are more alike, there is an increasing lexical
overlap. Many words have the same meaning in both languages. There are,
however, some words that appear in one variant but not in the other. Over time,
some words are no longer used, and new words have been made up. Some of the
purely historic Dutch words, that are no longer used in modern Dutch, have a
purely historical meaning. It is hard to find corresponding modern Dutch words
for them, because they have lost their meanings in modern society. But some
purely historical words do have modern counterpart. The word opsnappers is a
17th century Dutch word, that is no longer used in modern Dutch. Its modern
variant is feestvierders (English: people having a party). Someone looking for
documents on throwing a party in ancient days might use feestvierders as a
5.2. NON-PARALLEL CORPORA: USING CONTEXT 59

query word. Query expansion can benefit from a modern to historic translation
dictionary containing opsnappers as a historic synonym for feestvierders.
There are several techniques that can be used to find semantically related
words automatically. Two of them will be discussed here. The first uses the fre-
quency of co-occurrence of two specific words, the second uses syntactic structure
to find words that are at least syntactically, and possibly semantically related.

5.2.1 Word co-occurrence

One way of constructing a thesaurus automatically is using word co-occurrence


statistics to pair related words. This technique exploits the frequent co-occurrence
of related words in the same document, or paragraph. If two words co-occur
frequently in documents, there is a fair chance that these words are related. In
political documents, the words minister and politician will often co-occur. Of
course, high frequent words like the and in will also co-occur often. Content
words have a lower frequency, and will co-occur less often than function words.
But a thesaurus containing pairs of high frequency function words will not help
in retrieving relevant documents, since high frequence words are often removed
from the query. And most of the documents in the collection contain these
function words, so almost all documents would be considered relevant. Content
words not only carry content, they also good at discriminating between docu-
ments. The word minister is much better at discriminating between political and
non-political documents than the word the. Thus, a simple word co-occurrence
thesaurus should pair related content words. There are two ways of filtering
out highly frequent term co-occurrences. Removing the N most frequent terms
from the lexicon is a rather radical approach. The other possibility is to penalize
high term frequency by dividing the number of co-occurrences of two words by
the product of their individual term frequencies. The main advantage of the
latter approach is that high frequency content words are not removed from the
lexicon. No information is lost. On the other hand, if a content words occurs
often, its discriminative power is minimal.
However, low frequency terms suffer from accidental co-occurrences. For two
totally unrelated, low frequency words, accidental co-occurrence is enough to
make their co-occurrence significant. An extra problem is the spelling variation.
A low-frequency content word W1 might be spelled different in each document
it occurs. Even if it co-occurs with the same related word W2 in each of those
documents, each spelling variation will have a co-occurrence frequency with W2
of 1. Ofcourse, this problem is partly solved by applying the rewrite rules from
chapter 3 to the documents, but there are still some spelling variants that are
not conflated after rewriting. Since the document collection is limited in size,
almost all content words have a low frequency, making it nearly impossible to
construct a useful co-occurrence thesaurus in this way.
60 CHAPTER 5. THESAURI AND DICTIONARIES

5.2.2 Mutual information


Another related approach comes from information theory. In [16], a automatic
word-classification system is described using mutual information of word-based
bigrams. Bigrams are often used in natural language processing techniques to
estimate the next word given the current word. Given a corpus, the probability
of word Wi is given by the probability of the words W1 , W2 , ..., Wi−1 occuring
before it. This can be approximated by considering only the n − 1 words be-
fore Wi (Wi−n+1 , Wi−n , ..., Wi−1 ). In the case of bigrams, only the previous
word Wi−1 is considered. The probability of Wi given Wi−1 is the frequency of
Wi−1 , Wi divided by the total number of bigrams in the corpus. The N most
frequent words of an English corpus are classified in a binary tree by maximizing
the mutual information between the words in class Ci and the words in class Cj .
The final tree shows groups of semantically or syntactically related words. The
mutual information between a word Wi from class Ci and a word Wj from class
Cj is given by in (5.1) (P (Wi , Wj ) is the probability of the bigram Wi , Wj ):

P (Wi , Wj )
I(Wi , Wj ) = log (5.1)
P (Wi )P (Wj )
The total mutual information M (i, j) between two classes Ci and Cj is then:
X P (Ci , Cj )
M (i, j) = P (Ci , Cj ) × log (5.2)
P (Ci )P (Cj )
Ci ,Cj

Maximizing the mutual information is done sub-optimally, by finding a lo-


cally optimal classification. First, the N words are randomly classified and
Mt (Ci , Cj ) is computed. Second, for each word W in both classes, the mutual
information information Mt+1 (Ci , Cj ) is calculated of the situation where W is
swaped from one class to another. If this increases the mutual information, the
swap is permanent, otherwise, W is swaped back to its original class. In [16],
the final classification stops after these N swaps. Because computing power has
increased so much, it doesn’t take much more time to iterate this process until,
in the next N swaps, no swap is permanent. Working top-down, at each level,
all the words of class Ci are classified further into subclasses. this ensures that
the classification at the previous level stays intact.
To reduce the computational complexity of the algorithm, the mutual infor-
mation at t + 1 can computed by updating the mutual information at t with the
change made by W . If W is in class Ci at t, computing Mt+1 (Ci , Cj ) is done by
first computing the mutual information M (W, Cj ). This is the contribution of
W at t. Next, the mutual information M (W, Ci ) is computed. If M (W, Ci ) is
higher than M (W, Cj ), swapping W to class Cj increases to mutual information.
The new mutual information is then:

Mt+1 (Ci , Cj ) = Mt (Ci , Cj ) − M (Wi , Cj ) + M ax(M (W, Cj ), M (W, Ci )) (5.3)


In this way, the full mutual information M (Ci , Cj ) has to be calculated only
once, and is updated by each swap.
5.2. NON-PARALLEL CORPORA: USING CONTEXT 61

The idea behind this approach is that closely related words are classified
close to each other, and two unrelated words should be classified in different
classes early in the tree (near the root). If two words convey the same meaning,
it makes no sense to place them next to each other in a sentence, because it
would make one of them redundant. The meanings of two adjacent words should
be complementary. If two words co-occur often (i.e. their bigram frequency is
high), they should not be in the same class. Low co-occurence (low bigram
frequency) of high frequent words (high unigram frequency) makes it probable
that the meanings of these words overlap, so they will be classified close to each
other. The example classification given in [16] shows some classes that might
be useful for query expansion. In one class, all days of the week are clustered
together, and in another class, many time-related nouns are clustered. If one
the words in such a class is used as a query word, adding other words from the
same class to the query might help finding documents on the same topic.
Once the N most frequent words have been classified, adding other, less
frequent wordss requires no more than putting each word in that class that
results in the highest mutual information. This second step becomes trivial
when adding words with very low frequencies. A word W with frequency 1 (this
holds for the largest part of the content words) only shows up in 2 bigrams, once
with the previous word in the sentence, and once with the next word. Thus, it
will only add mutual information when classified in the opposite class of one of
these words. If neither the previous nor the next word is in the same class at a
classification level s, putting W at s + 1 class Ci or Cj makes no difference to
the mutual information, because, using (5.3), it adds 0 to either class.
This introduces a new problem for historic documents. Because of the in-
consistency in spelling, resulting in spelling variants, each variant has a lower
corpus frequency and occurs in less bigrams than it would have given a con-
sistent spelling (by conflating spelling variants, the new word frequency is the
sum of the frequencies of the conflated variants). Apart from that, classifica-
tion based on bigrams requires a huge amount of text. More text means better
classification, simply because there is more evidence to base a classification on.
But the amount of electronically available historic text is limited, resulting in
data sparseness.1
To make sure that the algorithm was implemented correctly, a test classifi-
cation was made using a 60 million English newspaper corpus.2 The 1000 most
frequent words were classified in a 6 level binary classification tree. Table 5.1
shows 4 randomly selected classes, paired with their neighbouring class, at clas-
sification level 6 (the leaves of the tree). Out of the 1 million possible bigrams
(each of the 1000 unique words can co-occur with all 1000 words, including

1 Data sparseness in this case means the lack of evidence for unseen bigrams. A bigram

Wi−1 , Wi might not occur in the corpus, making the probability P (Wi−1 , Wi ) 0. Smoothing
techniques can be used to overcome this problem, but for the classification algorithm it still
always adds the same amount of mutual information to each class, making classification trivial.
In a larger corpus, there is a bigger chance that a certain bigram occurs, resulting in a more
reliable probability estimate.
2 The newspaper corpus is the LA-times corpus used at CLEF 2002.
62 CHAPTER 5. THESAURI AND DICTIONARIES

Class Class
number content
9 city Administration movie very given growing
10 nation department only like proposed approved
27 their housing
28 her my financial economic private drug Simi Newport
World Laguna National Pacific Long Orange Santa Ventura
middle as hot five six eight few will can ’ve may said
would could does cannot did should ’ll is ’d must
35 allow take begin
36 bring give provide keep hold get pay sell win find
build break create use meet leave become call tell ask
say see think feel know want run stop play Japan
husband hours days though
49 Clinton Anaheim Los Northridge Thousand judge wife
couple key action summer minute top order largest
usually anything non New own
50 Department American San Inc star hearing project
election list book force war quarter morning week bad
different free got

Table 5.1: Classification of 1000 most frequent words in LA-times

itself), for the 1000 most frequent words, the corpus contains 412.516 unique
bigrams.
In class 28, a number of auxillary verbs is clustered, and in class 36, some
semantically related verbs are clustered. The neighbouring class, 35, contains
some related verbs as well, which indicates that there is a relation between
clusters that are classified close to each other. In 49 and 50, some time-related
nouns are clustered (summer, minute, quarter, morning, week). But many
clusters contain seemingly semantically unrelated words, like ‘Administration’,
‘movie’, ‘very’ and ‘growing’. The ability to cluster on semantics is limited,
although more data (a larger corpus) should lead to better (or at least, more
reliable) classification. The corpus is used as a language model for English.
Thus, more text leads to a more reliable model.
Better classifications have been made with syntactically annotated corpora.
One of the main problems with plain text is not the semantic but the syntactic
ambiguity of words. The word ’sail’ can be used as a noun (‘The sail of a ship.’)
or as a verb (‘I like to sail.’) But the orthographic form ‘sail’ can only be
classified in 1 class. In syntactically annotated corpora, a word can be classified
together with its part-of-speech tag. For modern English, such corpora exist,
but for 17th century Dutch, all that is available is plain text.
The total historic Dutch corpus is much smaller than the English one, but
still contains about 7 million words. The 1000 most frequent words share 226.318
5.2. NON-PARALLEL CORPORA: USING CONTEXT 63

Class Class
number content
11
12 In uit Na Aen om Op tot van vanden of ofte en ende
Laet Doen Wilt Uw Haer Zy Wy Zijn Mijn Hy Ons Ik Sijn
Gy selve Noch vp Dus Der o Een Geen Daer Daar Dese
Dies Dat Des Dees Alle so soo zo Zoo Indien Wanneer Nu
al
25 aller also als verheven inder vander wien binnen te
alwaer ter
26 welcken nam toch dewyl eerste dat dattet mit achter
onder Roomsche
33 wie hemels verre inne vooren
34 och heer staat ras Maria connen konnen ware zijnde mede
datse dijen
59 Heer Prins hand borst lijf beeld beelden brief boeck
kennis steen gelt brant dood verdriet troost rust
plaats slagh oyt niet voorts eerst sprack
60 bloed kroon troon staet stadt plaets wegh Boeck
editie uitgave naem stof vrucht glans kort quaet
begin neder noyt wel voort wederom weder zien sien
gaven leeren

Table 5.2: Classification of 1000 most frequent words in historic Dutch corpus

bigrams (that means that 77% of all possible bigrams is not in the corpus). The
same experiment was repeated with the historic Dutch corpus, and again 4
classes were randomly selected, shown in table 5.2, togheter with their neigh-
bouring classes.
In some of these classes, there are some clusters of syntactically related
words. Class 12 contains many prepositions and pronouns, and classes 59 and
60 contain mostly nouns. Semantically, classes 59 and 60 are also interesting,
because there are some themes. ‘Heer’ and ‘Prins’ (lord and prince), ‘hand’,
‘borst’ and ‘lijf’ (hand, chest, body), ‘brief’, ‘boeck’ and ‘kennis’ (letter, book,
knowledge) in 59, ‘kroon’ and ‘troon’ (crown, throne), ‘staet’, ‘stadt’, ‘plaets’,
‘weg’ (state, city, place, road), ‘Boeck’, ‘editie’, ‘uitgave’ (Book, edition, edition)
in 60. The main problem of small corpora is that, if the mutual information
within one class is zero (none of the words in that class share a bigram), further
classification is useless. This is clear in classes 11 and 12. Apparently, moving
one word from class 12 to 11 does not increase the mutual information. A
further subclassification of class 12 will result in one empty subclass, and the
other subclass containing all words of class 12.
For a better comparison, the 1000 most frequent words in a 30 million words,
64 CHAPTER 5. THESAURI AND DICTIONARIES

Class Class
number content
21 dacht zet vraagt grote hoge oude
22 maakte hield sterke enorme vijftig hoe
29 Bosnische Europees Zuid dezelfde hetzelfde welke ieder
vele veel enkele beide dertig honderd vijf wat vorig
economische mogen hard ondanks Van
30 Nederlandse nationale Navo Rotterdamse elke deze zoveel
bepaalde negen mijn dit laten
31 drie tien zeven derde halve volgend vorige zware
speciale belangrijke oud
32 zwarte rode politieke vrije ex rekening tv milieu
gebruik kun
49 me we wij belangrijkste echte dergelijke voormalige
meeste klein dollar groei overheid regering gemeente
rechtbank Spanje ogen televisie stuk leeftijd weekeinde
week seizoen keer ander
50 handen hart familie bevolking Raad politiek kabinet
onderwijs school tafel Feyenoord ploeg elftal finale
zomer toekomst maand periode buitenland koers produktie
verkoop verzoek ton rechter kant totaal groter
mogelijkheid

Table 5.3: Classification of 1000 most frequent words in modern Dutch corpus

modern Dutch corpus3 were also classified in a level 6 binary tree. 4 randomly
selected classes and their direct neighbouring classes are listed in table 5.3. In
this corpus, the 1000 most frequent words share 295.404 unique bigrams.
Table5.3 shows 4 directly neighbouring classes (29, 30, 31, 32). At level 4
in the tree they would be merged into one classes. This would make sense, as
classes 29, 30 and 31 contain number words (dertig, honderd, vijf, negen, drie,
tien, zeven) and related adjectives (ieder, vele, veel, enkele, beide, elke, zoveel,
bepaalde, derde, halve), and all 4 classes contain some other adjectives. Classes
49 and 50 also contain some semantically related words: ‘Overheid’, ‘regering’,
‘gemeente’, ‘Raad’, ‘politiek’, ‘kabinet’ (government, government, community,
council, politics, cabinet), and ‘leeftijd’, weekeinde’, ’week’, ’seizoen’, ’zomer’,
’maand’, ’periode’, ’toekomst’ (age, weekend, week, season, summer, month,
period, future).
For all three corpora, the classification trees show some useful clustering,
but it is far from being usable for query expansion, because it is based on high
frequency words, which add very little content to a query and mark a lot of
documents as relevant. As mentioned before, classification of low frequency
words is completely unreliable, because there is very little evidence to base a
3 This corpus is also from CLEF 2002.
5.2. NON-PARALLEL CORPORA: USING CONTEXT 65

classification on. But the low frequency words are the very words that are useful
for document retrieval. Low frequency words, by definition, occur in only a few
documents, and are often related to the topic of a document. It seems that the
only way to get a more reliable classification is to use a bigger corpus.
There is however, a big difference in automatic clustering between English
and Dutch. In the 60 million words corpus used for English, there are ‘only’
306.606 unique words, whereas the 30 million words corpus for modern Dutch
contains 495.605 unique words. The historic Dutch corpus, containing 7 million
words in total, has 373,596 unique words. In general, a larger corpus contains
more unique words, so a 60 million words corpus of historic Dutch would proba-
bly contain much more unique words than the English corpus. The main reason
for this is probably due to compounding of words. In English, compounds
are rare (like ‘schoolday’), as most nouns are seperated by a whitespace (‘shoe
lace’), but in Dutch, compounding is much more common, resulting in words like
‘bedrijfswagenfabriek’ (lit.: company car factory) and ‘nieuwjaarsgeschenken’
(New Year gifts). To get enough evidence for a reliable classification, a larger
lexicon requires a larger corpus.
Another difference between these two languages is the word order, which is
more strict in English than in Dutch. Both languages share the Subject - Verb
- Object order in basic sentences. But adding a modifier to the beginning of
the sentence, the order is retained in English, but changes in Dutch (the verb is
always the second part of the sentence, so the subject comes after the verb. This
has consequences for the number of unique bigrams in the corpus. For Dutch,
a larger number of bigrams is needed to get same reliability for the ‘language
model.’
The quality of the classification seems to depend on quite a number of factors:
1. Lexicon size: Each unique word needs plenty of evidence for proper
classification, thus a larger lexicon needs more evidence, i.e., more text.
2. Ambiguity in a language: Words that can have different syntactic
functions can supply contradictive evidence (the verb ‘sail’ can co-occur
with words that cannot co-occur with the noun ‘sail’). Languages that
have many of these words are harder to model correctly.
3. Strictnes of word-order: Some languages allow various word-orderings
for a sentence. In many so called ’free-word-order-languages’ like Polish
and Russian, a rich morphology makes it possible to distinguish syntactic
categories. However, for a language like Dutch, the word-order may be
changed, but this introduces changes in pronouns and prepositions. A
Dutch translation of the English sentence ‘I’m not aware of that.’ could be
‘Ik ben me daar niet bewust van,’ or ‘Ik ben me daar niet van bewust.’ But
another word-order is allowed’, like ‘Daar ben ik me niet bewust van’, or
even ‘Daar ben ik me niet van bewust.’ Nothing changes morphologically,
but there are four different sentences, with exactly the same words and
exactly the same meaning.4 More possible orderings need more evidence
4 Thanks to Samson de Jager for pointing out this peculiarity in Dutch.
66 CHAPTER 5. THESAURI AND DICTIONARIES

to be modelled correctly.

4. Document style: Each document is written in a certain style. Sentences


in newspaper articles are often different from sentences in personal letters.
This has an effect on which bigrams occur in the corpus.

5. Depth of classification: At depth 1 (the classes directly under the root),


the classification is often quite reliable. At deeper levels, the number of
bigrams that the words in a class share becomes increasingly small, mak-
ing each further classification less reliable. In classes with only nouns
(especially in Dutch, were compounding leads to a large number of low
frequency nouns), a further classification of semantic structure is not pos-
sible because of a lack of syntactic distinction (nouns rarely appear next
to each other).

To aid word clustering for historic Dutch, the historic document collection
could be mixed with an equal amount of modern Dutch text to reduce data
sparseness. The spelling of many words has changed over time, but the most
frequent words have changed very little. There is still a reasonably large overlap
between the most frequent words in both corpora, so if no more historic text is
available, modern text might help. For modern Dutch, syntactically annotated
corpora are available, and can be mixed with historic Dutch to estimate POS-
tags for historic words. If all the modern words in a class are nouns, it seems
probable that the historic words in that class are nouns as well. To bridge the
vocabulary gap, clustering historic and modern words with related meanings
might be very useful. At least for query expansion, adding historic words to
modern query words can increase recall.

5.3 Crawling footnotes


There are some digital resources available on the web. For instance, the Digi-
tale bibliotheek voor de Nederlandse Letteren5 (DBNL) has a large collection of
historic Dutch literature. Many of these texts contain footnotes of the form “1.
opsnappers: feestvierders”. These are direct translations of historic words to
modern variants. By using a large amount of these texts, a historic to modern
dictionary can be constructed. The texts on DBNL are categorized based on
the century in which they where published. There is huge list of 17th century
Dutch literature available, containing over a 100 books, and more works are
added regularly. Not all books contain footnotes, and not all footnotes are di-
rect translations. Many footnotes contain background information or references
to other works. But some texts contain thousands of translations.
Because the books are annotated by different people, the notes don’t have a
consistent format. In some texts, the historic word is set in italics or boldface,
5 url: www.dbnl.nl
5.3. CRAWLING FOOTNOTES 67

in others, a special html-tag is used to mark it. Consider the next two exam-
ples, the first of which is very clear, containing a special tag to signify a word
translation.

<div ID="N098"><small class="note"><a href="#T098" name="N098">


<span class="notenr">&nbsp;4. </span></a>&nbsp;
<span class="term">beschaemt:</span> teleurgesteld. Vgl.
<span class="bible">Rom. X, 11</span>.</small></div>

<div ID="N1944"><small class="note">


<a href="#T1944" name="N1944">
<span class="notenr">&nbsp;9 </span>
</a>&nbsp;<i>bloot:</i> onbeschermd.
</small></div>

<div ID="N1608"><small class="note">


<a href="#T1608" name="N1608">
<span class="notenr">&nbsp;1353 </span></a>&nbsp;
<i>hoofdscheel:</i> hoofdschedel; <i>van:</i> door;
<i>bedropen:</i> overgoten;
Van ’t begin en van ’t einde van Melchisedech’s leven is ons
verder niets bekend; Vondel beschouwt hem als door God zelf
tot priester gewijd.</small></div>

The first note has marked the historic word (‘beschaemt’) by tagging it
with a span class called ‘term’. In all of these cases, the modern translation
(‘teleurgesteld’) directly follows the historic word, and ends with a dot or a
semi-colon. The second note is less specific. The historic word (‘bloot’) is
marked in italics, and the modern translation (‘onbeschermd’) again follows it
and ends in a dot (or a semi-colon). The first note is easy to extract. The second
note is more problematic, because the italics not always signify a translation:

<div ID="N1728"><small class="note">


<a href="#T1728" name="N1728">
<span class="notenr">&nbsp;10 </span></a>&nbsp;
<i>Orpheus:</i> Orpheus, de bekende zanger van de Griekse sage,
die de wilde dieren bedwong door z’n snarenspel
(<i>konde paren:</i> kon verenigen).</small></div>

Here, the first word in italics, ‘Orpheus’, is not followed by a modern transla-
tion, but by an explanation of who Orpheus was. A simple way of distinguishing
between this note and the previous one, is that the translation pair contains only
one word after the historic, italicized word. But this doesn’t work for transla-
tions containing several words:

<div ID="N1726"><small class="note">


<a href="#T1726" name="N1726">
68 CHAPTER 5. THESAURI AND DICTIONARIES

<span class="notenr">&nbsp;7 </span>


</a>&nbsp;<i>sloer:</i> sleur, gang, manier.
</small></div>

<div ID="N1437"><small class="note">


<a href="#T1437" name="N1437">
<span class="notenr">&nbsp;12 </span></a>&nbsp;
<i>onses Moeders:</i> van onze moeder, de aarde.</small></div>

For the historic word ‘sloer’, multiple modern translations are given, seper-
ated by a comma. The historic phrase ‘onses Moeders’ has two modern phrases
as translation. How can these be distinguished from the note about Orpheus?
It gets even worse. Consider the next consequetive notes:

<div class="notes-container" id="noot-1739"> <div class="note">


<a href="#1739T" name="1739"><span class="notenr">4</span></a>
<i>gedraeghe mij tot:</i> houd mij aan.</div></div>
<div class="notes-container" id="noot-1740"> <div class="note">
<a href="#1740T" name="1740"><span class="notenr">5</span></a>
<i>deze: de ene</i>. Bedoeld wordt Pieter Reael, vgl.
<i>502</i>.</div></div>

The first one contains the historic phrase inside italics and the modern phrase
following it directly. The second one contains both the historic word and its
modern translation inside italics, and an explanation directly after it. And a few
notes further down, the single word after the italics is not a modern translation,
but a reference:

<div class="notes-container" id="noot-1744"><div class="note">


<a href="#1744T" name="1744"><span class="notenr">14</span></a>
<i>de schrijver:</i> Vondel.</div></div>

All this makes is very hard to extract only the translation pairs from a
note. Manual correction is not an option, since the 17th century DBNL corpus
contains over 170.000 footnotes. The final list consists of approximately 110.000
translation pairs, many of which are not actual translation pairs but references,
explanations or descriptions. Still, for query expansion it could be useful. If
each modern translation occurs only a few times, only a few historic words or
phrases are added to the query. Not all of them will be useful, but adding noise
to the query might be compensated by the fact that some relevant words are
added as well. By making seperate dictionaries for word to word, word to phrase
and phrase to phrase translations, evaluating each of them seperately, will give
an indication of whether a dictionary can be useful, or contains to much noise.
The dictionaries in table 5.4 are translations from historic to modern, as
extracted from the DBNL corpus. The word to phrase dictionary contains
historic words as entries, and modern phrases as translations. Vice versa, the
phrase to word dictionary contains historic phrases with modern single word
5.3. CRAWLING FOOTNOTES 69

Dictionary number of unique number of


translations entries synonyms
word to word 36505 20281 1.8
word to 26445 16649 1.6
phrase
phrase to 5589 4931 1.1
word
phrase to 42680 35127 1.2
phrase
total 111219 68384 1.6

Table 5.4: DBNL dictionary sizes

translations. To get an indication of usefullness of the DBNL thesaurus, a


random sample of 100 entries was drawn twice, and each entry evaluated. For
each of the four different parts of the total thesaurus, the 100 entries were
marked as useful or useless. Repeating this process once, thus drawing 100
random entries twice, the results give us some idea about the usefulness of the
thesaurus parts.

thesaurus useful useless


part entries entries
word to word 91/88 9/12
word to phrase 72/62 28/38
phrase to word 59/55 41/44
phrase to phrase 70/68 30/32

Table 5.5: Simple evaluation of DBNL thesaurus parts: usefulness of 100 random
samples

Some good examples from the word to word and word to phrase dictionaries
are:

ghewracht -> bewerkt


badt -> verzocht
booswicht -> zondaar
belent -> zeer nabij
heerschapper -> heer en meester

Here are some bad examples:

galgenbergh -> Golgotha


God -> Godheid
Katten-vel -> kat
70 CHAPTER 5. THESAURI AND DICTIONARIES

d’altaergodin -> Vesta


stuck -> op ’t schaakbord
Hippomeen -> zie

The last example is a typical parsing mistake. The right hand side ‘zie’
(English: see) is part of a reference to something.
Furthermore, the word to word and word to phrase dictionaries were used to
get an idea of the overlap between the historic words in a historic corpus, and
the historic words in the dictionaries. How many of the words in the hist1600
corpus (the corpus used for the RSF and RNF algorithms, see section 3.1.2)
for example, have an entry in the DBNL thesaurus? And what about the
corpus that was used for creating the test set? Table 5.6 gives an indication
of the coverage of the thesaurus. The Braun corpus contains the ‘Antwerpse
Compilatae’ and the ‘Gelders Land- en Stadsrecht’, the Mander corpus contains
‘Het Schilderboeck’ by Karel van Mander. Together, they make up the hist1600
corpus. This split up was done because the DBNL thesaurus contains some
entries extracted from the Mander corpus. The same holds for the documents
from the test set corpus. The modern words in the corpora, at least the words
that are found in the modern Kunlex lexicon, were first removed from the total
historic lexicon (column 3). Synonyms for these words can be found in a modern
Dutch thesaurus.
The coverage results can be explained by the footnote extraction. The Braun
corpus does not contain any footnotes, and has the smallest coverage from the
DBNL thesaurus. The Mander corpus has a larger coverage, probably because
a number of entries from the DBNL thesaurus come from ‘Het schilderboeck’.
That the DBNL thesaurus covers even a larger part of the test set corpus is
probably due to the fact that ‘De werken van Vondel, Eerste deel (1605 – 1620)’
is part of the corpus and contains several thousand notes with translation pairs.

Corpus Unique Not in DBNL


words Kunlex coverage
hist1600 47816 41156 (86%) 4315 (10,5%)
Braun 17891 14168 (79%) 1429 (10.1%)
Mander 33805 27074 (80%) 3547 (13.1%)
Test set 69453 44827 (65%) 8119 (18.1%)
corpus
Selection 1600 1569 (98%) 603 (38,4%)
Test set 400 397 (99%) 152 (38,3%)

Table 5.6: Coverage statistics of corpora for DBNL thesaurus

The DBNL thesaurus covers a far larger part of the historic words in the
selection and test set (see section 3.4.1). Apparently, in the process of giving
modern spelling variants of historic words, there was a bias towards giving
modern forms for historic words with a salient historic spelling. A bias which
5.3. CRAWLING FOOTNOTES 71

can very well have been the same for the editors of the DBNL who made the
footnotes. Also, the selection and test set do contain some modern words. Out
of the 2000 words in the both sets, 34 words are in the Kunlex, showing that
the decision whether a word is historic or modern is not trivial.

5.3.1 HDR evaluation


As an external evaluation, the performance of the DBNL thesaurus was tested
on hitoric document retrieval experiment. For a description of the experiment,
see section 4.3. Instead of using rewrite rules, the documents and query were
translated using the DBNL thesaurus. As the results in table 5.5 show, apart
from the word to word thesaurus, the thesauri contain many nonsense entries.
Therefore, only the word to word thesaurus was used. The original words in the
historic documents were replaced by one of the related words from the DBNL
thesaurus. For query translation, all the entries containing a query word as a
translation were added to the query. The original query words were kept in the
query as well. The effect of translation was compared to other standard IR
techniques. Table 5.7 contains the results of translation both with and without
stemming on the known-item topics.

Method Avg. Avg. Avg.


prec. prec. prec.
D only D+T T only
Baseline 0.2192 0.1955 0.1568
Stemming 0.2125 0.2352 0.1749
4grams 0.2366 0.2538 0.2457
Decompounding 0.2356 0.2195 0.1795
DBNL Doc 0.1098 0.1262 0.1546
DBNL Query 0.0860 0.1324 0.1597
DBNL Doc + stem 0.0902 0.1250 0.1321
DNBL Query + stem 0.1389 0.1730 0.1847

Table 5.7: HDR results for known-item topics using DBNL thesaurus

Translation of the descriptions is disastrous for retrieval performance, al-


though stemming compensates a little. For the titles, translation works slightly
better. Whereas the baseline shows a decline in performance by adding titles
and removing descriptions, query translation shows the opposite behaviour. One
of the reasons might be that the number of words in the query is fairly small
using only titles when compared to the descriptions.
Table 5.8 displays the average number of words in the descriptions and titles
for both topic sets. Method None represents the original queries, before trans-
lation. The DBNL thesaurus more than doubles the number of words in the
descriptions, but as the description also contains high frequency words, and the
thesaurus also contains translations for high frequency words, adding so many
72 CHAPTER 5. THESAURI AND DICTIONARIES

translations apparently does more bad than good. Only modern stop words (the
high frequency words, that add little content to the query) are removed from
the query, but historic translations are added before this happens. The titles
don’t contain any stop words, thus through translation none are added. The
titles contain mostly low frequency content words. Adding historic synonyms
of these words, and stem all the query words afterwards improves performance.
Look at topic 7 for a good example:

• Description: Kan een eigenaar van onroerend goed zijn verhuurde pand
zomaar verkopen, of heeft hij nog verplichtingen ten opzichte van de hu-
urder?
• Title: eigenaar onroerend goed verhuurde pand verkopen verplichtingen
huurder

This is the same query after adding translations:

• Description: kan ken koon mach magh een een eigenaar van onroerend
goed aertigh welzijn binnen cleven sinnen verstrekken verhuurde pand
paan panckt zomaar verkopen veylen of heeft hij deselve sulcke versoecker
nog nach verplichtingen ten opzichte van de huurder huurling
• Title: eigenaar onroerend goed aertigh wel verhuurde pand paan panckt
verkopen veylen verplichtingen huurder huurling

Not all translation added to the title are good, but most of them are related
to the topic. As for the description, many totally unrelated historic words are
added that will not be recognized as a stop word (mach, magh, koon, sulcke).

topics method words in words in


title descr.
expert None 3.52 11.05
expert dbnl 5.52 18.43
expert phon 4.29 16.38
expert rules 4.14 14.05
known item None 4.36 11.68
known item dbnl 8.60 24.48
known item phon 7.44 19.32
known item rules 6.68 16.20

Table 5.8: Average number of words in the query using query translation meth-
ods

Because the combination of query translation and stemmming works better


than query translation only, it would be interesting to combine query translation
with the other monolingual techniques. Another approach could be to combine
the scores of retrieval runs. By giving the ranked list of each retrieval approach
5.4. PHONETIC TRANSCRIPTIONS 73

Method D only D+T T only


Baseline 0.3396 0.4289 0.4967
Stemming 0.3187 0.3778 0.4206
4grams 0.3037 0.3465 0.3821
Decompounding 0.3307 0.4228 0.4900
DBNL Doc 0.2246 0.2326 0.3442
DBNL Query 0.2749 0.3696 0.4632
DBNL Doc + stem 0.2095 0.2574 0.2917
DBNL Query + stem 0.2705 0.3316 0.3848

Table 5.9: HDR results for expert topics using DBNL thesaurus

a specific weight, the final relevance score of document is the weigthed sum of
the relevance scores of each approach. Documents that are considered relevant
by two different approaches, say, retrieval using 4-grams and retrieval using
the DBNL thesaurus, are, on average, ranked higher on the combine list. The
reasoning behind this is that if several approaches retrieve the same document,
there is a fair chance that that document is actually relevant.
As was mentioned in section 4.3, the HDR results of the advanced techniques
for the expert topics show no improvement over the baseline. The same holds for
document and query translation using the DBNL thesaurus. The expert queries
contain specific 17th century words from the documents, making query expan-
sion redundant for a large part. It is still interesting to see that, consistent with
the known-item results, query translation works better than document transla-
tion and stemming afterwards has a negative effect. Although the monolingual
methods perform better on the descriptions, translation of the titles seems to
work better than stemming or 4-gram matching. And again, adding translations
to the descriptions decreases performance significantly.

5.4 Phonetic transcriptions


Although often historic words are spelled different from their modern counter-
parts, in many cases, their pronunciation is the same. This fact can be effectively
used to construct a dictionary of equal sounding word pairs. For Dutch, a few
algorithms are available to convert strings of letters to strings of phonemes (see
section 3.2.1). The algorithm for building a dictionary using phonetic transcrip-
tions is very easy. First, convert the historic lexicon lexhist into a historic pro-
nunciation dictionary P dicthist , and the modern lexicon lexmod into P dictmod .
Next, for all historic words whist in P dicthist , lookup its phonetic transcription
P T (whist ) in the modern dictionary P dictmod . Pair whist with all words wmod
for which the phonetic transcription P T (wmod ) is equal to (P T (whist ), and add
each pair to the final dictionary.
This approach can also be combined with the rewrite rules to improve upon
74 CHAPTER 5. THESAURI AND DICTIONARIES

the final thesaurus. After applying rewrite rules to the historic lexicon, the
rewritten words will (probably) be more similar in spelling to the corresponding
modern word. Through rewriting, the pronunciation of a word may change.
Since letter-to-phoneme algorithms are based on modern pronunciation rules,
the phonetic transcription of the historic word klaeghen will be different from
the transcription of its corresponding modern word klagen, since the modern
pronunciation of ae is different from the modern pronunciation of a (they may
have been the same in 17th century Dutch). Thus, if after rewriting, klaeghen
has become klaghen, the phonetic transcription will be equal to that of klagen.
Converting the lexicon to a pronunciation dictionary again, and repeating the
construction procedure will result in new pairings.
Of course, words that are pronounced the same are not necessarily the same
words (consider eight and ate). This is were the edit distance clearly helps in
distinguishing between spelling variants and homophones (if the homophones
are orthographically dissimilar enough).6
Because the phonetic transcriptions contain some errors, and because the
pronunciation of some vowel sequences has changed over time, the phonetic
transcriptions before and after rewriting were evaluated by randomly selecting
and checking a 100 entries for correctness. The whole experiment was done
twice to get a more reliable indication. If the number of correct and incorrect
transcriptions show a big difference between the first and the second time, a
bigger sample, or more iterations are needed to get a better indication. If the
numbers vary only slightly, their average gives a fair indication of the total
number of correct and incorrect transcriptions. The results in Table 5.10 show
a significant improvement in the quality of the transcriptions. Before rewriting,
the phonetic dictionary contains 4592 entries, and 16% of all transcriptions are
different from their real pronunciation. Only 2% of all the 11.592 entries after
rewriting have incorrect phonetic transcriptions. The phonetic dictionary after
rewriting (using the combined rule set RNF+RSF+PSS) contains the original
historic words as entries, but the modern words were matched with the phonetic
transcriptions of the rewritten forms of the historic words. The word aengaende
was first rewritten to aangaande. Then, aengaende is matched with a modern
word that has the same phonetic transcription as aangaande.
Not only does rewriting effect the number historic words that phonetically
similar to their modern forms, it also decreases the number of wrong phonetic
matches. The historic ae sequences is no longer mathced with the modern ee
sequence, but with aa. The same goes for the historic sequences ey and uy which
were matched with the sequence ie in modern words before rewriting, and are
respectively matched with ei and ui afterwards.

5.4.1 HDR and phonetic transcriptions


As a way of evaluating the effectiveness of mapping words using phonetic tran-
scriptions, the same HDR experiment as described in section 4.3 and the pre-
6 If two homophones are orthographically similar, a spelling variant of one of them could

just as easily be a spelling variant of the other.


5.4. PHONETIC TRANSCRIPTIONS 75

Phonetic total incorrect perc.


dictionary entries entries incorrect
of 100
normal 4592 15 / 17 16
rewritten 11592 0/4 2

Table 5.10: Incorrect transcriptions in 2 samples of 100 randomly selected en-


tries, before and after rewriting

Method Avg. Avg. Avg.


prec. prec. prec.
D only D+T T only
Baseline 0.2192 0.1955 0.1568
Stemming 0.2125 0.2352 0.1749
4grams 0.2366 0.2538 0.2457
Decompounding 0.2356 0.2195 0.1795
Phonetic Doc 0.2642 0.2901 0.2609
Phonteic Query 0.2458 0.2511 0.2474
Phonetic Doc 0.2645 0.3054 0.2502
+ stemming
Phonetic Query 0.1911 0.2153 0.1983
+ stemming

Table 5.11: HDR results for known-item topics using phonetic transcriptions

vious section was conducted. Instead of using rewrite rules to translate queries
or documents, the phonetic transcription dictionary was used. The results are
shown in Table 5.11. For this experiment, the stop word list was extended with
phonetic variants of stop words taken from the phonetic dictionary.
The results of translating the queries are comparable to the use of 4-grams
in the monolingual approach, and, equal to the effect on rewriting (see Table 4.6
in the previous chapter), stemming the translated queries has a negative effect
for the same reason. The historic words often have historic suffixes that are
unaffected by the stemmer, thus conflation of morphological variants is minimal.
Document translation shows the best results for all different queries (D only,
D+T and T only). But now, the effect of stemming is minimal.
The number of phonetically equal words added to the descriptions and titles
is smaller than the number of related words added by the DBNL thesaurus.
Although the phonetic dictionary adds spelling variants of modern stop words
to the query, the list of modern stop words was extended by their phonetically
historic counterparts. Therefore, the performance of query translation for the
description is comparable to query translation for the titles. Combining them
does not affect average precision much.
76 CHAPTER 5. THESAURI AND DICTIONARIES

Method D only D+T T only


Baseline 0.3396 0.4289 0.4967
Stemming 0.3187 0.3778 0.4206
4grams 0.3037 0.3465 0.3821
Decompounding 0.3307 0.4228 0.4900
Phonetic Doc 0.2719 0.3213 0.4178
Phonetic Query 0.3037 0.4137 0.4920
Phon. Doc + stem 0.2581 0.3063 0.3373
Phon. Query + stem 0.2913 0.3638 0.4176

Table 5.12: HDR resulst for expert topics using phonetic transcriptions

Another interesting observation is that combining description and titles leads


to a significant increase in precision for document translation. The original
queries of topic 3:

• Description: Hoe wordt de hypotheekrente afgehandeld bij de verkoop


van een pand?

• Title: hypotheekrente afgehandeld verkoop pand

Adding phonetic transcriptions results in:

• Description: hoe wordt wort de hypotheekrente afgehandeld bij bei bey


by de verkoop vercoop vercoope vercope van vaen een ehen pand pandt

• Title: hypotheekrente afgehandeld verkoop vercoop vercoope vercope


pand pandt

In both the title and the description, 3 spelling variants for verkoop (sale)
and 1 for pand (house). The spelling variants of the stop words bij, van and een
were removed because of the extended stop word list.

5.5 Edit distance


Similarity between words can also be measured using the edit distance algo-
rithm. At each place in a word, a character can be deleted, inserted or sub-
stituted (same as delete + insert). The edit distance of two words is equal to
cost of changing one word into the other. Deleting and inserting cost 1 step,
substitution costs 2 (equal to 1 delete and 1 insert) unless the character to be
substituted is equal to the substitute, in which case the cost is 0. The more
similar two words are, the lower the cost. This technique is often used for spell
checking. The algorithm can be adjusted to account for the distance between
two keys on a keyboard. Accidentally hitting a key next to the intended one
occurs more often than hitting one on the other end of the keyboard. A bigger
5.5. EDIT DISTANCE 77

Similar
Characters
b,p
d,t
f,v
s,z
y,i
y,ie
y,ij
g,ch
c,k
c,s

Table 5.13: Phonetically similar characters

distance between two characters on a keyboard results in a higher substitution


cost.
But the similarity between historic words and their modern variants is not
based on the distance between keys on a keyboard, but on their similarity in
pronunciation. Thus, the algorithm can instead be adjusted to take into account
the similarity of pronunciation. A c can be pronounced as a k or as an s. Thus,
substituting a c for an s should be lower in cost than substituting a c for a
t. Adjusting the cost of substitutions has to be done carefully. The cost of
substituting c for t should not be increased unless the cost of deleting and
inserting characters is increased as well. Otherwise, the algorithm will prefer
deleting + inserting over substituting, resulting in the same cost as before the
adjustment. Instead, lowering the cost for substituting phonetically similar
characters, the algorithm will prefer substituting c for s over deleting c and
then inserting t.

5.5.1 The phonetic edit distance algorithm


The Phonetic Edit Distance (PED) algorithm is an adjusted version of the basic
edit distance algorithm. In the standard version, every substitution adds 2 to
the total edit distance unless the two characters under consideration are equal.
The PED version differentiates between substituting two phonetically similar
characters and two phonetically dissimilar characters. If an ‘s’ is substituted
for a ‘z’, the edit distance is increased by 0.5, but if an ‘s’ is substituted for a
‘j’ the edit distance is increased by 2. The following character are considered
phonetically similar:
The edit distance is increased by 2 when the first character of the combi-
nations ‘ie’, ‘ij’ or ‘ch’ is substituted for the phonetic equivalent. The PED
algorithm then decreases the edit distance by 1.5 if the second character of the
character combinations ‘ie’, ‘ij’ or ‘ch’ is substituted for the phonetic equivalent
78 CHAPTER 5. THESAURI AND DICTIONARIES

of the character combination. In total, after two substitutions, 0.5 is added to


the edit distance.
The main problem with the edit distance algorithm is that it is a costly oper-
ation. Since the historic and modern corpora easily consist of several thousand,
or even tens of thousands of words, finding the closest historic match of a modern
word requires a huge amount of computation, [25]. A solution to this problem
is to use a coarse grained selection method like n-gram matching first, which
can reduce the number of candidates under consideration. The word-retrieval
experiment (section 4.2) showed that candidate selection using n-grams works
well, especially after applying rewrite rules. With an n-gram size of 2, over 90%
of the historic variants from the test set were found in the top 20 candidates.
However, most of the words in the list of candidates are no historical variants
of the modern word. As the results in Table 4.5 show, only half of the variants
are found at rank 1, and for most modern words, there are only 2 or 3 spelling
variants found in the entire historic corpus. This is where the fine grained selec-
tion of edit distance can be put to good use. Using the phonetic version of the
edit distance algorithm, the 20 candidates can be reranked according to their
phonetic similarity. The historical variants should be ranked higher than all the
other words in the list of candidates. Of course, if there are multiple historic
variants of the modern word, the historical variant from the test set need not be
at rank 1. There might be another spelling variant that is phonetically closer
to the modern word. Thus, the precision @1 will not be much higher, but the
precision @5 should increase (there are very little modern words that have more
than 5 historical spelling variants in the corpus).
The list of phonetically similar characters could probably be extended, but
this is already an improvement over the standard edit distance algorithm, as can
be seen in Table 5.14. ED stands for the standard edit distance algorithm, PED
is the phonetic version and RR means the rewritten forms of the original historic
words were used for n-gram retrieval using the combined RNF+RSF+PSS rule
set. Recall scores 10 are roughly the same for ED and PED, but for recall 5 the
differences become more significant.
The recall 5 is much closer to the recall 20 after reranking, in Table 5.14.
Actually, the increase in precision makes it interesting to retrieve more than
20 word-forms. The problem with PED is that it is computationally expensive
to find the closest match in an entire lexicon. But for 20 or even 100 words
this is no problem. The increase in recall when retrieving 100 words can be
transformed into an increase in recall 5 through reranking using PED. In the
n-gram retrieval part, once the ranked list of candidates is calculated, selecting
the first 100 words takes negligibly more time than selecting the first 20 words.

5.6 Conclusion
The DBNL thesaurus can be used effectively in query expansion, if the stop word
list is extended with historic variants as was done for the phonetic dictionary,
and with a better note extraction algorithm, the word to phrase and phrase
5.6. CONCLUSION 79

N-gram Pre- recall


size proccess @20 @10 @5 @1
2 0.853 0.783 0.691 0.322
2 ED 0.853 0.830 0.720 0.322
2 PED 0.853 0.843 0.802 0.436
2 RR 0.925 0.889 0.832 0.488
2 RR+ED 0.925 0.908 0.850 0.519
2 RR+PED 0.925 0.920 0.892 0.563
3 0.798 0.706 0.600 0.277
3 ED 0.798 0.782 0.707 0.326
3 PED 0.798 0.791 0.768 0.430
3 RR 0.908 0.867 0.815 0.493
3 RR+ED 0.908 0.894 0.851 0.523
3 RR+PED 0.908 0.905 0.881 0.565

Table 5.14: Results of historic word-form retrieval using PED reranking

to word translations might become useful as well. The downside is that the
construction of this thesaurus depends on manually added word translation
pairs. Automatically extracting them correctly is difficult, and the only historic
words for which a translation is given are the ones that are deemed important
by the DBNL editors. The modern translations of the words that they find
important enough to translate might not be the words that are posed as query
words by the user.
By combining historic Dutch documents with modern Dutch documents, and
more imortantly, by increasing the corpus size, the use of word clustering algo-
rithms can become an important method for bridging the vocabulary gap. As is
stands, the vocabulary gap remains the most difficult bottleneck of the two, as
the spelling gap is partly bridged by the rewrite rules from the previous chapter
and the phonetic dictionary and PED reranking procedure in this chapter.
The phonetic variants dictionary is effective, but only after rewriting. The
phonetic transcriptions of the original historic words are not always correct, thus
by replacing these transcriptions with the transcriptions of the rewritten words,
many historic words are no longer paired with the wrong modern word. The
advantage of matching historic and modern words with phonetic transcriptions
over using rewrite rules is that non-typical historic character sequences (like
‘cl’ in clacht) are not rewritten incorrectly (clausule should not be rewritten
to klausule). The phonetic dictionary only replaces whole words, not parts of
words. Thus, clacht will be replaced with klacht, but the historic word clausule
is matched with the modern word clausule, and is thus retained in the lexicon.
The performance of word-retrieval can be greatly improved by reranking
the candidate list using the Phonetic Edit Distance algorithm. The number of
candidates can then be reduced to 3 or 5 words, and the remaining list can be
used for query expansion. It has yet to be tested on a HDR experiment though.
80 CHAPTER 5. THESAURI AND DICTIONARIES
Chapter 6

Concluding

We’ve seen, in the previous chapters, that language resources can be constructed
from nothing but plain text. They can be used effectively for HDR, and might
also be used as stand alone resources to improve readability. This chaper con-
cludes this research, and will try to answer to questions from the first chapter.
Some future directions are given as well.

6.1 Language resources for historic Dutch


The first chapter posed some research questions. They are repeated here and
an attempt at answering them is made.

• Can resources be constructed automatically? The methods de-


scribed in chapters 3, 4 and 5 have shown that language resources for
Dutch historic document retrieval can be constructed automatically us-
ing nothing but plain historic Dutch text. The HDR experiments and the
word-form retrieval experiments have shown that these language resources
can be used effectively to find historic Dutch word-forms of modern Dutch
words, and also significantly improve HDR performance.

• Can (automatic) methods be used to solve the spelling problem?

– Can rewrite rules be generated automatically using corpus


statistics and overlap between historic and modern variants
of a language? The generation, selection and application of rewrite
rules can be done automatically, and with good results. The RNF
and RSF algorithms work well in finding modern spelling variants
of typical historic character sequences. Both methods use plain text
corpora and exploit the overlap between historic and modern Dutch.
– Are rewrite rules a good way of solving the spelling bot-
tleneck? As the results of combining and iterating the methods
have shown, after rewriting the most important historic character

81
82 CHAPTER 6. CONCLUDING

sequences, their no longer produce any useful rules. However, the


typically historic sequences caused most of the problems for the pho-
netic transcriptions. Once they have been modernized, the grapheme
to phoneme converter produces much better results. So, in answer
to the question, they are a good first step in solving the spelling
bottleneck for 17th century Dutch.
– Can historic Dutch be treated as a corrupted version of
modern Dutch, and thus be corrected using spelling correc-
tion techniques? Using a spell checker shows acceptable results,
but the main problem is that a modern word for each historic words
must be selected manually from a list of suggestions. If the correct
word is not in the list of alternatives, further manual correction is
needed. A language independent and automatic solution is the use
of n-gram matching to retrieve similar word-forms. This produces a
list of historic spelling variants for modern words. It has yet to be
tested if the inversed procedure, finding similar modern word forms
for historic words, works as well. Using n-gram matching as a coarse
grained search, and edit distance, or its phonetic variant, as a fine
grained search, the list of candidates can be reduced further.

• What are the options for automatically constructing a thesaurus


for historic languages? The vocabulary gap is still a big problem. Most
of the resources and methods described are solution to the spelling prob-
lem. Only the DBNL thesaurus and the word co-occurrence classifications
are aimed at the vocabulary gap, and neither is a good solution at this
moment. The DBNL thesaurus contains many nonsense entries, and only
contains manually constructed translation pairs. Extending it to cover
new words depends on the knowledge and effort of experts. As for the
co-occurrence thesaurus, its application in HDR seems a long way of. To
get better classifications, much more text is needed, and even then, the
semantic distinctions are probably to coarse grained to make them useful
for query expansion.

• Is the frame work for constructing resources a language inde-


pent (general) approach? The same word-form retrieval experiment
described in [18] works for historic Dutch documents. This supports the
claim by Robertson and Willett that their methods are general, language
and period independent. The word-form retrieval method uses only n-
gram information, which is language independent. The rewrite rule gen-
eration methods RNF and RSF can be added to the list of language inde-
pendent techniques. Even without using a manually constructed selection
set, using the MM selection criterium, which is a language independent
methods as well, the rules selected can help modernize historic spelling (at
least for historic Dutch). Further improvements can be made by reranking
the candidates using the PED algorithm, which can increase the precision
at a certain level, or alternatively, increase recall at lower levels. The PED
6.2. FUTURE RESEARCH 83

algorithm is language dependent; the characters are phonetically similar


in Dutch, but not necessarily in all languages. Although the edit distance
algorithm is less effective, it is language independent.
The other resources, the PSS algorithm, the phonetic thesaurus and the
DBNL thesaurus are specific for Dutch. The PSS algorithm and the pho-
netic thesaurs use a grapheme to phoneme conversion tool that must be
designed specifically for each language. The DBNL thesaurus consists of
manually constructed word translation pairs.

• Can HDR benefit from these resources? The experiments described


in [1] show that HDR can gain from several techniques, some treating HDR
as a monolingual approach, others, including the techniques and language
resources for historic Dutch, treating HDR as a CLIR approach. The
retrieval results show that rewriting the historic Dutch documents to a
more modern Dutch is a very effective way to improve performance. After
rewriting, the gap between 17th century Dutch and modern Dutch has
become smaller. The monolingual approach of stemming document and
query words is much more effective after the documents are translated.

6.2 Future research


Resources for 17th century Dutch can help HDR, but there is still much that
can be improved upon. Since the problems for 17th century Dutch have been
split into two main issues throughout this research, directions for future work
will follow these two paths.

6.2.1 The spelling gap


It seems that the spelling bottleneck is not the main problem anymore, although
there are still some techniques that could be improved, like the PED algorithm,
and the phonetic transcription tool.
The phonetic transcription tool from Nextens is designed for modern Dutch,
with its many loan words from English, French, German and other languages. It
may be clear that, although the overlap between historic and modern Dutch is
in pronunciation, there are some difference in pronunciation as well. Taking this
into account, the rules for transcribing a sequence of characters into a sequence
of phonemes can be adjusted for 17th century Dutch. The pronunciation of
ae is one of the main problems, but phenomena like double vowels and double
consonants also form a major hurdle in matching historic and modern words.
Making specific rules for their transcription will probably solve most of these
problems.
The PED algorithm can be adjusted in two main ways. First off, the list
of phonetically similar characters can be extended, and maybe improved. For
instance, the characters ‘b’ and ‘p’ are pronounced similar in certain contexts,
like the end of a word (both are pronounced as a ‘p’ in Dutch). But at the
84 CHAPTER 6. CONCLUDING

beginning of a word, they sound different. The algorithm could be changed to


use context information when judging the similarity of pronunciation. Second,
the current cost function might not be optimal. Right now, the substitution of
similar characters costs less than deleting or inserting a character. Different cost
functions can be tested. It would be interesting to see how well this algorithm
works on historic variants of other languages. Maybe the cost function should
depend on the specific language for which it is used. Another approach would
be to use the normal edit distance algorithm on the phonetic transcriptions of
words.
Also, making a spelling variations dictionary is not trivial, and it has not
been tested either. If the number of spelling variations is unknown, how to
determine which candidates are actual spelling variants, and which are not,
might be a difficult problem.
As far as the rewrite rules are concerned, the effect of a rule set on document
collections from different periods can be investigated further. The results in
Table 4.8 show that the generated rules still work for documents written slightly
earlier or slightly later than the documents that were used to generate the rules
from. However, if the difference in age gets larger (between the documents
from which the rules are generated, and the documents on which the rules are
applied), the performance of the rules will probably decrease. For documents
in Middelnederlands1 the differences with modern Dutch are far bigger, not
only in spelling but also in pronunciation. The gap might even be too large to
be bridged by rewrite rules. For more recent documents, the gap is so small
that rewrite rules based on typical historic character sequences are not effective
any more because there are almost no character sequences that are typical of
the historic documents. After the introduction of offical spelling rules, the
differences between historic Dutch and contemporary Dutch are very small,
making the RSF and RNF algorithm redundant. It would be interesting to
see the difference in performance on document collection from a specific period
between rules generated from documents dating from the same period and rules
generated from documents ranging from another period. For other languages
the results can be completely different, but spelling often changes gradually,2
so these effects should be very similar in other languages.

6.2.2 The vocabulary gap


As mentioned earlier, the hardest problem of the two is the vocabulary gap. The
resources constructed to bridge this gap are far from being usable. The DBNL
thesaurus contains to much noise, and the co-occurrence thesaurus would also,
if low frequency words were classified as well.
The quality of the DBNL thesaurus can be improved by using a better
extraction algorithm. Apart from that, the list of 17th century books at the
1 The
Dutch language between 1200 and 1500.
2 Except,
possibly, for the introduction of official spelling rules, which can have a significant
effect on spelling.
6.2. FUTURE RESEARCH 85

DBNL website is expanded regularly. These new entries also contain notes and
translation pairs, so the thesaurus could be updated with new entries as well.
The construction of a historic synonym thesaurus using mutual information
seems infeasible at this moment. An enormous amount of text would be re-
quired, and even then, the clusters will still show more syntactic structure than
semantic structure. Large clusters of nouns are almost impossible to split into
semantically related subclusters if no more than plain text corpora are available.
Once syntactically annotated 17th century Dutch corpora are available, classi-
fication based on bigram frequencies might become useful to cluster synonyms.
For HDR purposes, it would be interesting to see the effect of mixing historic
and modern Dutch corpora. If the historic and modern words in a cluster are
semantically related, the historic words could be added to modern query words
from the same cluster.
Finally, if the co-occurrence based thesaurus improves in quality, it could be
combined with the DBNL thesaurus. The DBNL thesaurs could be used to test
the quality of the co-occurrence thesaurus (if it is based on a mix of historic and
modern Dutch). The historic word and its modern translation should be in the
same cluster, or at least, close to each other in the classification tree.

As it stands, the attempts at bridging the vocabulary gap have led to lit-
tle more than plans for building a real bridge. The bridge over the spelling
gap, although still a bit shaky, seems to have reached the other side. Language
resources are now available for historic Dutch, most of them automatically gen-
erated, and possibly useful for other languages as well.
86 CHAPTER 6. CONCLUDING
Bibliography

[1] Adriaans, F. (2005). Historic Document Retrieval: Exploring strategies


for 17th century Dutch

[2] Braun, L. (2002). Information Retrieval from Dutch Historical Corpora

[3] Brown, P.F.; Della Pietra, V.J.; deSouza, P.V.; Lai, J.C.; Mercer, R.L.
(1992). Class-based n-gram moderls of natural language in Computational
Linguistics, volume 18, number 4, pp. 467-479

[4] Crouch, C.J., Yang, B. (1992). Experiments in automatic statistical the-


saurus construction

[5] Dagan, I.; Lee, L.; Pereira, F.C.N. (1998). Similarity-based models of word
cooccurrence probabilities in Machine Learning, Volume 34, number 1-3

[6] Hall, P.A.V., Dowling, G.R. (1980). Approximate string matching in Com-
puting Surveys, Vol 12, No.4, December 1980

[7] Hollink, V.; Kamps, J.; Monz, C.; de Rijke, M. (2004). Monolingual doc-
ument retrieval for European languages in Information Retrieval 7, pp.
33-52

[8] Hull, D.A. (1998). Stemming algorithms: A case studie for detailed evalu-
ation in Journal of the American Society for Information Science, volume
47, issue 1

[9] Jing, Y.; Croft, W.B. (1994). An association thesaurus for information
retrieval in Proceedings of RIAO, pp. 146-160

[10] Kamps, J.; Fissaha Adafre, S.; de Rijke, M. (2005). Effective Translation,
tokenization and combination for Cross-lingual Retrieval

[11] Kamps, J.; Monz, C.; de Rijke, M.; Sigurbjörnsson, B. (2004). Language-
dependent and language-independent approaches to Cross-Lingual Text
Retrieval In Comparative Evaluation of Multilingual Information Access
Systems, CLEF 2003, volume 3237 of Lecture Notes in Computer Science,
pages 152-165. Springer, 2004.

87
88 BIBLIOGRAPHY

[12] Kraaij, W. & Pohlmann, R. (1994). Porter’s stemming algorithm for


Dutch
[13] Lam, W., Huang, R., Cheung, P.-S. (2004). Learning phonetic similarity
for matching named entity translations and mining new translations in
Proceedings of the 27th annual international conference on Research and
development in information retrieval, 289-296
[14] Li, H. (2001). Word clustering and disambiguation based on co-occurrence
data in Natural Language Engingeering 8(1), pp. 25-42
[15] Lin, D. (1998). Automatic retrieval and clustering of similar words in
Proceedings of COLIN/ACL-98, pp. 768-774
[16] McMahon, J.G.; Smith, F.J. (1996). Improving statistical language model
performance with automatically generated word hierarchies in Computa-
tional Linguistics, Volume 22, number 2.
[17] McNamee, P.; Mayfield, J. (2004). Character N-gram tokenization for eu-
ropean language text retrieval in Information Retrieval, 7, 2004, pp. 73-97
[18] Robertson, A.M.; Willett, P. (1992). Searching for Historical Word-Forms
in a Database of 17th-century English Text Using Spelling-Correction
Methods
[19] Salton, G.; Yang, C.S.; Yu, C.T. (1975) A theory of term importance in
automatic text analysis in Journal of the American Society for Information
Science.
[20] Salton, G. (1986). Another look at automatic text-retrieval systems in
Communications of the ACM, volume 29, number 7
[21] Tiedemann, J. (1999). Word alignment step by step in Proceedings of the
12th Nordic Conference on Computational linguistics, pp. 216-227
[22] Tiedemann, J. (2004). Word to word alignment strategies in Proceedings
of the 20th International Conference on Computational Linguistics,
[23] van der Horst, J., Marschall, F. (1989). Korte geschiedenis van de Neder-
landse taal Sdu Uitgevers, Den Haag
[24] Wagner, R.A.; Fischer, M.J. (1974). The string-to-string correction prob-
lem in Journal of the ACM, Vol. 21, number 1, pp. 168-173
[25] Zobel, J.; Dart, P. (1995). Finding approximate matches in large lexicons
in Software-practice and experience, Vol 25(3), March 1995, pp. 331-345
[26] Zobel, J.; Dart, P. (1996). Phonetic string matching: lessons from infor-
mation retrieval from Proceedings of the 19th International Conference
on Research and Development in Information Retrieval
Appendix A - Resource
descriptions

Each of the resources and methods to construct them are described in more
detail here. Each section covers a resource and its associated algorithms.

Appendix A1 - Phonetic dictionary


The phonetic dictionary (section 5.4) is a plain text file, each line contain-
ing a unique historic word and its modern phonetic equivalent. The modern
words are not unique, as a number of historic spelling variants are translated to
the same modern form. For the historic words, the phonetic transcriptions of
their rewritten forms is used to match them with the phonetic transcriptions of
modern words. This is done to solve the biggest problems with the change in
pronunciation (the sequence ae in particular). In total, there are 11,592 entries.
This example shows the format (historic word, tab, modern word) of the
dictionary:

aengeclaeght aangeklaagd
aengecomen aangekomen
aengedaen aangedaan
aengedraegen aangedragen

Appendix A2 - DBNL dictionary


The DBNL dictionary (section 5.3) is also a plain text file, each line containing
a dictionary entry and its translation. The entries and their translations can be
single words or phrases. As Table 5.5 showed, the word to word entries are by
far the most useful. Some statistics are repeated here:
Some entries have multiple translations, therefore the last column shows the
average number of synonyms for each entry. The format of the DBNL dictionary
is equal to the phonetic dictionary format (historic word, tab, modern word):

begosten aanvingen
begote overgoten
begoten bespoeld

89
90 APPENDIX A - RESOURCE DESCRIPTIONS

Dictionary number of unique number of


translations entries synonyms
word to word 36,505 20,281 1.8
word to phrase 26,445 16,649 1.6
word to either 62,950 36,930 1.7
phrase to word 5,589 4,931 1.1
phrase to phrase 42,680 35,127 1.2
phrase to either 48,269 40,058 1.2
total 111,219 68,384 1.6

Table 1: DBNL dictionary sizes

begraeut afgesnauwd

Appendix A3 - Rewrite rule sets


The rewrite rules are generated from corpus statistics and phonetic information
(section 3.2. The three rewrite rule generation algorithms PSS, RSF and RNF
are explained in sections 3.2.1, 3.2.3 and 3.2.5. The rules can be applied to the
lexicon of the document collection to obtain a dictionary of spelling moderniza-
tion. The rewritten forms are not necessarily existing modern words, because
the word can still be a historic word with a modern spelling (see the examples
below).

gerichtschrijversampt gerichtschrijverzambt
gerichtschrijverseydt gerichtschrijverzeid
gerichtscosten gerichtskosten
gerichtsdach gerichtsdag
gerichtsdaege gerichtsdage

The rules generated by PSS and RSF are different from the rules generated
by RNF, because the vowel/consonant restrictions. The historic antecedent of
these rules consist of a historic sequence and a context restriction. For instance,
a historic vowel sequence should match a historic word if the vowel sequence
is surrounded by consonants. The word vaek matches the vowel sequence ae,
while the word zwaeyen doesn’t, because its full vowel sequence is aeye. To
make sure that the rule ae doesn’t match zwaeyen, the antecedent is extended
with context wildcards:
[bcdf ghjklmnpqrstvwxz]ae# → a [aeiouY ]lcx[aeiouY ]
The antecedent part of the rule is actually a regular expression and brack-
eted consonant wildcard [bcdf ghjklmnpqrstvwxz] indicates that the character
sequence ae must be preceded by one of these consonants. The word boundary
character # indicates that the ae sequence must be at the end of the word.3
3 In Perl, this word boundary character can be replaced by a dollar sign ($), which matches

the preceding regular expression only at the end of the string.


91

The uppercase Y in the second rule above is used as a replacement for the
Dutch diphtong ij, because the j would otherwise be recognized as a consonant.
Therefore, all occurrences of ij in words and sequences are replaced by Y in the
RSF and PSS algorithms.
The RNF algorithm doesn’t have this consonant/vowel restriction, it will
match anything with the historic antecedent. Context information is more de-
tailed for longer n-grams:
ae → aa bae → baa bael → baal baele → bale
92 APPENDIX A - RESOURCE DESCRIPTIONS
Appendix B - Scripts

The PSS algorithm requires two lists of mappings from words to phonetic tran-
scriptions. One list with historic words and phonetic transcriptions, and one
list for modern words and phonetic transcriptions. The phonetic alphabet that
is used is not important, as long as both lists use the same phonetic alphabet.
The output is a plain text file, where each line contains a rewrite rule and its
PSS score. The PSS score is the MM-score (the maximal match score, described
in section 3.3.1.
The RSF and RNF algorithms both require two word frequency indices, one
from a historic corpus and one from a modern corpus. These indices are plain
text files with each line containing a unique word from the corpus, and its corpus
frequency.
These three algorithms are implemented in Perl, simply named ‘PSS.pl’,
‘RSF.pl’ and ‘RNF.pl’, and use only standard packages and require some scripts
that are included in the resources package.
Other important algorithms are:

• mapPhoneticTranscriptions.pl: This expects two lists containing words


and their phonetic transcriptions, and gives as output a dictionary with
of words with the same pronunciation.

• PED.pl: This is a package of subroutines, with ped is the main subroutine


that expects two strings as input and returns the phonetic edit distance
as output.

• RDV.pl: This contains the subroutine rdv, an implementation of the


Reduce Double Vowels algorithm. It needs a string as input and returns
as output the string after reducing redundant double vowels.

• selectRules.pl and selectMethods.pl: The selectRules.pl script is


an executable script that allows you to select rules from a rule set us-
ing a specific selection method (section 3.3). When executing, it needs
three arguments, the number of the selection method, the name of the file
containing the rule set, and finally the name of the output file where the
selected rules are stored. The selection criteria are implemented in the
selectMethods.pl script.

93
94 APPENDIX B - SCRIPTS

• applyRules.pl: This is a package of subroutines for applying a set of


rules to a string or a list of strings.
• createTestSet.pl: This script requires three arguments, a text file from
which words are randomly selected, a filename for the test set, and a
filename for a list of words to skip. The nice thing about this approach is
that the test set can be constructed once, and then extended later. The
skip file contains a list of words that are already presented to the user
in an earlier iteration, and where discarded. The user is presented with
randomly selected words from the text file, and if an alternative spelling is
given by the user, the original word and its alternative spelling are added
to the test set. If no alternative spelling is given, the word is added to
the list of discarded words. In a second run of the script (with the same
filenames for the test set and the skip file), the words from the test set
and the skip file are not presented to the user.
• buildIndex.pl: This expects three arguments as input. The first argu-
mented is a flag indication whether the second argument is a text file, or
a file containing a list of text files if multiple text files are to be indexed.
The third argument is the filename for the resulting index, containing the
unique words from the text file(s), and their collection frequencies.

For more information on or access to the scripts, send an e-mail to the


author.
Appendix C - Selection and
Test set

The selection and test set (section 3.4.1) are in the same format as all the other
word lists and dictionaries. Each line in the test set file consists of a historic
word, a tab, and the modern spelling of the historic word (again, not necessarily
an existing modern word). To clearify this, a few entries are given here:

sijnen zijn
sijns zijns
silvere zilvere
simmetrie symmetrie
sin zin
singen zingen
singht zingt
sinlijckheyt zinnelijkheid
sinnen zinnen
sinplaets zinplaats

The historic word sinplaets is spelled as zinplaats in modern Dutch, although


zinplaats is not an existing modern word.4

4 At least, it is not listed in the ‘Van Dale - Groot woordenboek der Nederlandse taal.’

95

You might also like