P. 1
Constructing language resources for historic document retrieval

Constructing language resources for historic document retrieval

|Views: 223|Likes:
Published by marijnkoolen
Master thesis
Master thesis

More info:

Categories:Types, Research, Science
Published by: marijnkoolen on Sep 30, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/02/2011

pdf

text

original

Sections

Constructing Language Resources for Historic Document Retrieval

MSc thesis, Artificial Intelligence

Marijn Koolen mhakoole@science.uva.nl june 2, 2005

Supervisors: Prof. Dr. Maarten de Rijke, Dr. Jaap Kamps Informatics Institute University of Amsterdam

ii

Abstract
The aim of this research is to investigate the possibility of constructing language resources for historic Dutch documents automatically. The specific problems for historic Dutch, when compared to modern Dutch, are the inconsistency in spelling, and the aged vocabulary. Finding relevant historic documents using modern keywords can be aided by specific resources that add historic variants of modern words to the query or, resources that translate historic documents to modern language. Several techniques from Computational Linguistics, Natural Language Processing and Information Retrieval are used to build language resources for Historic Document Retrieval on Dutch historic documents. Most of these methods are language independent. The resulting resources consist of a number of language independent algorithms and two thesauri for 17th century Dutch, namely, a synonym dictionary, and a spelling dictionary based on phonetic similarity.

iii

iv

ABSTRACT

Acknowledgements
I’d like to express my gratitude towards Jaap Kamps and Maarten de Rijke for their guidance and supervision during this research. They’ve read numerous versions of this thesis without losing patience or hope, and were always quick with advice when needed. I’d like to thank Frans Adriaans for the brainstorming sessions getting both our projects started, and for the discussions on science that somehow always shifted to discussions on music.

v

vi

ACKNOWLEDGEMENTS

Contents
Abstract Acknowledgements 1 Introduction 1.1 Document retrieval . . . . . . . . 1.2 Historic documents and IR . . . 1.3 Constructing language resources 1.4 Outline . . . . . . . . . . . . . . 2 Historic Documents 2.1 Language variants or different 2.2 The gap between two variants 2.3 Bridging the gap . . . . . . . 2.4 Resources for historic Dutch . 2.5 Corpora . . . . . . . . . . . . 2.6 Corpus problems . . . . . . . 2.7 Measuring the gap . . . . . . 2.8 Spelling check . . . . . . . . . iii v 1 1 1 2 3 5 6 6 8 9 11 11 12 13 17 17 17 19 20 22 22 24 24 25 26 26 27 27

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

languages? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

3 Rewrite rules 3.1 Inconsistent spelling & rewrite rules 3.1.1 Spelling bottleneck . . . . . . 3.1.2 Resources . . . . . . . . . . . 3.1.3 Linguistic considerations . . . 3.2 Rewrite rule generation . . . . . . . 3.2.1 Phonetic Sequence Similarity 3.2.2 The PSS algorithm . . . . . . 3.2.3 Relative Sequence Frequency 3.2.4 The RSF algorithm . . . . . 3.2.5 Relative N-gram Frequency . 3.2.6 The RNF algorithm . . . . . 3.3 Rewrite rule selection . . . . . . . . 3.3.1 Selection criteria . . . . . . . vii

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

viii 3.4 3.5 Evaluation of rewrite rules . . 3.4.1 Test and selection set Results . . . . . . . . . . . . . 3.5.1 PSS results . . . . . . 3.5.2 RSF results . . . . . . 3.5.3 RNF results . . . . . . Conclusions . . . . . . . . . . 3.6.1 problems . . . . . . . 3.6.2 The y-problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 32 33 36 37 38 40 40 43 43 44 45 47 49 50 51 51 54 54 57 57 58 59 60 66 71 73 74 76 77 78 81 81 83 83 84 89 93 95

3.6

4 Further evaluation 4.1 Iteration and combining of approaches . . 4.1.1 Iterating generation methods . . . 4.1.2 Combining methods . . . . . . . . 4.1.3 Reducing double vowels . . . . . . 4.2 Word-form retrieval . . . . . . . . . . . . 4.3 Historic Document Retrieval . . . . . . . . 4.3.1 Topics, queries and documents . . 4.3.2 Rewriting as translation . . . . . . 4.4 Document collections from specific periods 4.5 Conclusions . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

5 Thesauri and dictionaries 5.1 Small parallel corpora . . . . . . . . . . . . 5.2 Non-parallel corpora: using context . . . . . 5.2.1 Word co-occurrence . . . . . . . . . 5.2.2 Mutual information . . . . . . . . . 5.3 Crawling footnotes . . . . . . . . . . . . . . 5.3.1 HDR evaluation . . . . . . . . . . . 5.4 Phonetic transcriptions . . . . . . . . . . . . 5.4.1 HDR and phonetic transcriptions . . 5.5 Edit distance . . . . . . . . . . . . . . . . . 5.5.1 The phonetic edit distance algorithm 5.6 Conclusion . . . . . . . . . . . . . . . . . . 6 Concluding 6.1 Language resources for historic Dutch 6.2 Future research . . . . . . . . . . . . . 6.2.1 The spelling gap . . . . . . . . 6.2.2 The vocabulary gap . . . . . . Appendix A - Resource descriptions Appendix B - Scripts Appendix C - Selection and Test set

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

List of Tables
2.1 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Categories of historic words . . . . . . . . . . . . . . . . . . . . . Comparative recall for english historic word-forms Comparative recall for Dutch historic word-forms . Corpus statistics for modern and historic corpora . Edit distance example 1 . . . . . . . . . . . . . . . Edit distance example 2 . . . . . . . . . . . . . . . Edit distance example 3 . . . . . . . . . . . . . . . Edit distance example 4 . . . . . . . . . . . . . . . Edit distance baseline . . . . . . . . . . . . . . . . Manually constructed rules on test set . . . . . . . Results of PSS on test set . . . . . . . . . . . . . . Results of RSF on test set . . . . . . . . . . . . . . Results of RNF on test set . . . . . . . . . . . . . . Different modern spellings for historic y . . . . . . Results of iterating RSF and RNF . . . . . . Results of combined rule generation methods Lexicon size after rewriting . . . . . . . . . . Results of RDV on test set . . . . . . . . . . Results of historic word-form retrieval . . . . HDR results using rewrite rules . . . . . . . . HDR results for expert topics . . . . . . . . . Results on test sets from different periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 18 19 20 29 30 30 31 32 33 34 37 39 41 44 46 47 49 50 52 53 54 62 63 64 69 69 70 71 72 73

Classification of frequent English words . . . . . . . . . . Classification of frequent historic Dutch words . . . . . . . Classification of frequent modern Dutch words . . . . . . DBNL dictionaries . . . . . . . . . . . . . . . . . . . . . . Simple evaluation of DBNL thesaurus . . . . . . . . . . . DBNL thesaurus coverage of corpora . . . . . . . . . . . . HDR results for known-item topics using DBNL thesaurus Analysis of query expansion . . . . . . . . . . . . . . . . . HDR results for expert topics using DBNL thesaurus . . . ix

x 5.10 5.11 5.12 5.13 5.14 1

LIST OF TABLES Evaluation of phonetic transcriptions . . . . . . . . . . . . . . . . HDR results for known-item topics using Phonetic transcriptions HDR results for expert topics using phonetic transcriptions . . . Phonetically similar characters . . . . . . . . . . . . . . . . . . . Results of historic word-form retrieval using PED . . . . . . . . . DBNL dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 75 75 76 77 79 90

Chapter 1

Introduction
1.1 Document retrieval

An Information Retrieval (IR) system allows a user to pose a query, and retrieves documents from a document collection that are considered relevant given the words in the query. A basic IR system retrieves those documents from the collection that contain the most query words. The drawback of retrieving only documents that contain query words, is that often, not all relevant documents will be retrieved. Some relevant documents may not contain any of the query words at all. Many techniques can be used to improve upon the basic system, by expanding the query with related words, or using approximate word matching methods. The aim (and the main challenge) of these techniques is to increase the number of retrieved relevant documents, without increasing the number of retrieved irrelevant documents. This is a difficult task, but significant improvements have been made by using several language dependent and language independent resources.

1.2

Historic documents and IR

IR systems often use external resources to improve retrieval performance. Stemmers, for example, are used to map words into a standard form, so that morphologically different forms can be matched [7]. Resources are also used in Cross-Language Information Retrieval (CLIR), where bilingual dictionaries are used to translate query terms [10]. Different amounts of resources are available for different languages. Through increased performance of OCR (optical character recognition) techniques, and dropping costs, more and more historic documents become digitally available. These documents are written in a historic variant of a modern language. Often, the spelling and vocabulary of a language have changed over time. For these historic language variants, very little resources are digitally available. Although the performance of OCR is increased, the linguistic resources used for 1

2

CHAPTER 1. INTRODUCTION

automatic correction are based on modern languages. These correction methods might not work for older texts. Thus, for many historic documents, digitization requires manual correction of OCR-errors. But the problems don’t end here. Once a document has been digitized correctly, the historic spelling and vocabulary still form a problem for linguistic resources based on modern languages. Therefore, this thesis focuses on automatically constructing linguistic tools for historical variants of a language. These tools can then be used for Historic Document Retrieval, which aims at retrieving documents from historical corpora. The tools will be used to construct resources for 17th century Dutch. Given that only generic techniques are used, they should also provide a framework for other periods and other languages.

1.3

Constructing language resources

The aim of this research is to construct language resources to be used for historic document retrieval (HDR) in Dutch. Previous research by Braun [2], and Robertson and Willett [18] has shown that specific resources for historic texts can improve IR performance. Robertson and Willett treated historic English as a corrupted version of modern English, using spelling-correction methods to find historic spelling variants of modern words. Braun focused on heuristics, exploiting regularities in historic spelling to develop rewrite rules which transform historic word forms into modern word forms. These rewrite rules were developed manually, since the inconsistency of spelling was deemed too problematic for automatic generation of rules. The rewrite rules can significantly improve retrieval performance. Therefore, the problem of automatic rule generation will be investigated, considering approaches from phonetics, computational linguistics and natural language processing (NLP). In both [2] and [18], the focus is on historic spelling. However, Braun identified a second bottleneck, namely the vocabulary gap. Some historic words no longer exist. Some modern words didn’t exist yet (like telephone or bicycle), and other words still exist but have a different meaning. To tackle this problem, a thesaurus might be used. However, no such thesaurus is digitally available, so to solve the vocabulary bottleneck one has to be constructed, either manually or automatically. This research will thus be focused on the following research questions: • Can historic language resources be constructed automatically? • Is the frame work for constructing resources a language indepent (generic) approach? • Can HDR benefit from these resources? The first question can be split into two more specific question, based on the observation of Braun about the spelling problem and the vocabulary problem: • Can (automatic) methods be used to solve the spelling problem?

1.4. OUTLINE

3

• What are the options for automatically constructing a thesaurus for historic languages? For the spelling problem, Braun and Robertson and Willett have already mentioned two methods, rewrite rules and spelling correction techniques. But there may be other options to align two temporal variants of a language. Therefore, this question can be made more specific: • Can rewrite rules be generated automatically using corpus statistics and overlap between historic and modern variants of a language? • Are the generated rewrite rules a good way of solving the spelling bottleneck? • Can historic Dutch be treated as a corrupted version of modern Dutch, and thus be corrected using spelling correction techniques? The available methods will be tested indepently and as a combined approach. Parallel to this research, Adriaans [1], evaluates the retrieval side of HDR in much more detail. The methods and thesauri developed in this project will be used in his retrieval experiment as an external evaluation. If these techniques are found to be useful, this will result in a number of language resources, some of which are language (and period) dependent, and others are language independent. The main drawbacks of manual construction of resources are the need for expert knowledge in the form of historic language experts, and the huge amount of time it takes to construct the resources. Automatic construction exploits statistical correlations and regularities in a language. Therefore, expert knowledge is no longer essential, and the time it takes to build resources is greatly reduced. Another advantage is that, if automatic construction is effective, the same techniques might be used for several different languages. As the aforementioned articles have shown, IR performance for both 17th century English documents and 17th century Dutch documents can be increased by attacking the spelling variation. The techniques in this research should be language independent, making them useful for both Dutch and English, and perhaps other languages for which historic documents pose the same problems.

1.4

Outline

The next chapter will elaborate on the distinction between historic and modern Dutch documents, and some available historic Dutch document collections will be described. It will show that information retrieval on historic documents is differrent from retrieval on modern document collections. Chapter 3 will discuss in detail the automatic constructing of rewrite rules using phonetics and statistics, and their effectiveness on historic documents. Several different methods are described and compared with each other and with the rules from [2]. Further extensions and combinations to these methods and evaluations follow in

4

CHAPTER 1. INTRODUCTION

chapter 4, including a document retrieval experiment. Chapter 5 will investigate the possibility of building a thesaurus to find synonyms among historic words, using various techniques, and other ways of solving the spelling problem are put to the test. In the final chapter conclusions are drawn from the conducted experiments, and some guidelines for the future will be given.

Chapter 2

Historic Documents
Historic Documents are documents written in the past. Of course, this holds for all documents. But since spoken and written language changes continuously, a century old Dutch document is written in a form of Dutch that is different from a document written two weeks ago. The changes are not spectacular, but they are there all the same. Using a search engine on the internet to find documents on typical Dutch food with the keywords Hollandse gerechten (English: Dutch dishes) may retrieve a text written in 1910 containing both words. The keywords are normal in modern Dutch, but also in early 20th century Dutch. What the search engine probably won’t find is a website containing hundreds of typically Dutch recipes from the 16th century, although this website does exist (see section 2.5, the Kookhistorie corpus). The historic texts contain historical spelling variants of the modern words Hollandse gerechten. This problem is caused by the fact that the change from 16th century Dutch to modern Dutch is spectacular. Although the number of digitized 16th century documents is small, through the increasing interest from historians and funding from national governments for digitizing historic documents,1 this number is growing rapidly. The aforementioned problem, the gap between a modern keyword and its relevant historic variants, becomes increasingly important. Going back further in time, the differences between modern Dutch and middle Dutch as used in late middle ages (1200 – 1500) are even bigger. In fact, between 1200 and 1500, Dutch was not a single language, but a collection of dialects. Each dialect had its own pronunciation, and spelling was often based on this pronunciation.[23] Between geographical regions the spelling differed. Due to the union of smaller independent countries and increasing commerce, a more uniform Dutch language emerged after 1500.2 As contacts between regions increased, spelling was less and less based on pronunciation, becoming
1 See, for example, the CATCH (Continuous Access To Cultural Heritage) project. This is funded by the Dutch government to make historic material from Dutch cultural heritage publicly accessible in digital form, thereby preserving the fragile material. 2 For a more detailed description of the changes in language between 1200 and 1500, (in Dutch) see http://www.literatuurgeschiedenis.nl

5

6

CHAPTER 2. HISTORIC DOCUMENTS

more consistent. In the 17th century, the Dutch translation of the Bible, the Statenbijbel, together with books by famous Dutch writers like Vondel and Hooft were considered as well-written Dutch, bringing about a more consistent and systematic spelling. Since there was no official spelling (which wasn’t introduced in the Netherlands until 1804), there were still many acceptable ways of spelling a word [23].

2.1

Language variants or different languages?

The Dutch language is related to the German language, yet, we consider them to be different languages. A native German speaker will recognize certain words in a Dutch document, but might have problems understanding what the text is about. A bilingual person, speaking both German and Dutch could translate the document into German, making it easy for the former reader to understand it. The same will probably hold for a native Dutch speaker reading a document written in middle Dutch. A middle Dutch expert could translate the document into modern Dutch, making it more readable. But for documents written after 1600, The historic language expert is no longer needed (or at least to a much lesser degree). Knowledge of modern Dutch gives enough handles on 17th century Dutch documents, for native speakers to understand most of the text. It seems there is a shift from two different languages to a language together with a certain “dialect”. This makes 17th century Dutch more or less the same language as modern Dutch, from an information retrieval (IR) perspective. If 17th century Dutch can be seen as a “strange” dialect of modern Dutch, its overlap with modern Dutch might be used to bridge the gap that exists between the two temporal variants.

2.2

The gap between two variants

But where do the two variants overlap, and where do they differ? Braun, in [2], identified two main bottlenecks for IR from historic documents. The first bottleneck is the d ifference in spelling. Not only is 17th century spelling different from modern spelling, it is also less consistent. A word has only one officially correct spelling in modern Dutch (although many documents do contain some variation, caused by spelling errors, changes in the official spelling or stubbornness), where older Dutch has many acceptable spelling variations for the same word. The other bottleneck is the v ocabulary gap. The modern word koelkast (English: refrigerator) did not exist in the 17th century. In the same way, the historic word opsnappers (English: people celebrating) cannot be found in any modern Dutch dictionary. Some words are no longer used, new words a created daily, and yet other words have changed in meaning. The fact that 17th century documents are still readable shows that the grammar has changed very little, so this is probably not an issue (most IR systems ignore word order anyway). Here is an example of the difference between historic and modern Dutch. The

2.2. THE GAP BETWEEN TWO VARIANTS

7

following historic text is a paragraph taken from the “Antwerpse Compilatae”, a collection of law texts written in 1609, describing laws and regulations for the region of Antwerpen. The full text describes how a captain should load a trader’s goods, and what his responsibilities towards these goods are at sea: 9. Item, oft den schipper versuijmelijck waere de goeden ende coopman-schappen int laden oft ontladen vast genoech te maecken, ende dat die daerdore vuijtten taeckel oft bevanghtouw schoten, ende int water oft ter aerden vielen, ende alsoo bedorven worden oft te niette gingen, alsulcke schade oft verlies moet den schipper oock alleen draegen ende den coopman goet doen, als voore. 10. Item, als den schipper de goeden soo qualijck stouwt oft laeijt dat d’eene door d’andere bedorven worden, ghelijck soude mogen gebeuren als hij onder geladen heeft rosijnen, alluijn, rijs, sout, graen ende andere dierghelijcke goeden, ende dat hij daer boven op laeijt wijnen, olien oft olijven, die vuijtloopen ende d’andere bederven, die schaede moet den schipper oock alleen draegen ende den coopman goet doen, als boven. The modern Dutch version (the author’s own interpretation) would look like this:3 9. Idem, als de schipper verzuimd de goederen en koopmanswaren in het laden of uitladen vast genoeg te maken, en dat die daardoor uit een takel of vangtouw schieten, en in het water of ter aarde vallen, en zo bederven of te niet gaan, zulke schade of verlies moet de schipper ook alleen dragen en de koopman vergoeden, als hiervoor. 10. Idem, als de schipper de goederen zo kwalijk stouwt of laadt dat het ene door het andere bedorven wordt, gelijk zou kunnen gebeuren als hij onder geladen heeft rozijnen, ui, rijst, zout, graan en andere, dergelijke goederen, en dat hij daar boven op laadt wijnen, olieen of olijven, die uitlopen en de andere bederven, die schade moet de schipper ook alleen dragen, en de koopman vergoeden, als boven. This is a translation in English, again, the author’s own translation: 9. Equally, if the shipper neglects to properly secure the goods during loading or unloading, causing them to fall in the water or on the ground and thereby spoiling them, he should repay the damage to the trader. 10. Equally, if the shipper stacks or loads the goods in such a manner that one of the goods spoils another, as could happen if he
3 The word order is retained to make it easier to compare both texts. Although this word order is readable for native Dutch speakers, it is somewhat strange. Apparently, grammar has changed somewhat as well.

8

CHAPTER 2. HISTORIC DOCUMENTS would stack wine, oil or olives on top of raisins, onions, rice, salt, grain or some such goods, where the former spoils the latter, he must repay the damage to the trader.

Analyzing the historic and modern Dutch sentences, it may be clear that the biggest difference is in spelling. Some words are still the same (schipper, bederven, alleen, water), but most words have changed in spelling. The changes in vocabulary are visible in the change from goet doen to vergoeden (English: repay). There are also some morphological/syntactical changes, like versuijmelijck (negligent) to verzuimd (neglects). It is probably easier to attack the spelling bottleneck first. To solve the second, a thesaurus is needed to translate historic words into modern words or the other way around. If a method can be found and used to find pairs of modern and historic words that have the same meaning, such a thesaurus can be constructed. But if spelling is not uniform, one spelling variant of a historic word might be matched with a modern one, while another spelling variant is missed. By solving the spelling bottleneck first, thereby standardizing spelling for historic documents, finding word translation pairs for a thesaurus may even be easier.

2.3

Bridging the gap

In Robertson & Willett [18] n-gram matching is used succesfully to find historic spelling variants of modern words. Thus, the lack of specific resources might not be a problem. However, many IR systems for modern languages use a stemming algorithm (see [8]) to conflate morphological variants to the same stem. The Porter stemmer4 is one of the most popular stemmers available for many different modern languages, with a specific version for each language (a Dutch version is described in [12]). Because modern languages are consistent in spelling, stemmers can be very effective. The Porter stemmer for Dutch would conflate the words gevoelig (sensitive), gevoeligheid (sensitivity) en gevoelens (feelings) to the same stem gevoel (feeling). Using gevoelig as a query word, documents containing the word gevoelens will also be considered relevant. When spelling is inconsistent, only some word forms would be stemmed. Using the porter stemmer for modern Dutch would only affect modernly spelled historic words. The historic words gevoel, ghevoelens and gevuelig (English: feeling, feelings and sensitive) would be stemmed to the stems gevoel, ghevoel and gevuel respectively. By standardizing spelling (i.e. making it consistent), these three word-forms will be stemmed to the same stem. Another fairly standard technique in modern IR is query expansion. This means adding related words to the keywords in the query. In historic documents, some of the words related to a modern keyword might be historic words that no longer exist. Although some of these historic words could be very useful for expanding queries, the lack of knowledge about them makes it impossible to use them effectively. A
4 http://www.tartarus.org/˜martin/PorterStemmer/

2.4. RESOURCES FOR HISTORIC DUTCH

9

thesaurus relating these words to modern words would solve this problem. From this perspective, the historic language can be seen as a different language from the modern language, and the retrieval task becomes a so called Cross-Language Information Retrieval (CLIR) task (see [11] and [10] for an analysis of the main problems and approaches in CLIR). But can spelling be standardized with nothing but a collection of historic documents? And is it possible to make a thesaurus using the same limited document collection? Of course, it is possible to do everything by hand (see sections 3.1.1 and 5.3). But this is very time consuming, and different language experts might have different opinions on what the best modern translation would be. Automatic generation, if at all possible, might be more error prone. But as modern IR systems have shown [7], sub-optimal resources can still be very useful for finding relevant documents. Although there was no standard way of spelling a word in 17th century Dutch, the possibilities of spelling a word based on pronuncation are not infinite. In fact, there are only a few different spellings for a certain vowel. Corpus statistics can be used to find different spelling variants by looking at the overlap of context. Also, techniques have been developed to group semantically related words based purely on corpus statistics. If this can be done for modern languages, it might work with historic languages as well.

2.4

Resources for historic Dutch

Which kind of resources are needed to standardize spelling, and which are needed to bridge the vocabulary gap? In [2], rewrite rules are used to map spelling variants to the same form. By focussing on rewriting affixes to modern Dutch standards, more morphological variants could be conflated by a stemmer. Some rules where constructed for rewriting the stems as well, to conflate stems that were spelled in various ways (like gevoel and gevuel). These rules where constructed manually, because the spelling was considered to be too inconsistent to do it automatically. But this inconsistency can actually be exploited to construct rules automatically. The pronunciation of two spelling variations of a word is the same. In historic documents, the word gelijk (English: equal) is often spelled as gelijck or ghelijck. In the same way, gevaarlijk (dangerous) is often spelled as ghevaerlijck or gevaerlijck. By matching words based on their pronunciation, spelling variations can be matched as well. In both cases, lijck is apparently pronounced the same as lijk, and ghe is pronouced the same as ge. If there are more historic words showing the same variations, it seems reasonable to rewrite lijck to lijk. But if historic word-forms can be matched with their modern variants through pronunciation, why would we need rewrite rules? Well, not all historic words will be matched with a modern word. For instance, the word versuijmelijck (see the short historic text on loading cargo on a ship) is not pronounced like any modern Dutch word. This is because the morphology of the word has changed. The modern variant would probably be verzuimend. Here, rewriting makes sense, because, changing the suffix lijck to lijk, a stemmer

10

CHAPTER 2. HISTORIC DOCUMENTS

for Dutch will reduce it to versuijm. Other rewrite rules may change this to the modern stem verzuim, conflating it with all other morphological variants. Finding historical synonyms for modern words, is a problem heretofore only tackled by manual approaches. For modern languages, techniques have been developed to find synonyms automatically (see, for instance [3, 4, 5, 14]), using plain text, or syntactically annotated corpora. Part-Of-Speech (POS) taggers exist for many languages, but not for 17th century Dutch, and annotated, 17th century Dutch documents are not available either. Therefore, only those techniques that use nothing but plain text are an option. The next chapters describe the automatic generation of rewrite rules based on phonetic information, and the automatic construction thesauri using plain text. The various approaches are listed here: • Rewrite rule generation methods: Three different methods, based on phonetic transcriptions, syllable structure similarity and corpus statistics will be described. • Rewrite rule selection methods: Some of the generated rules could be bad for performance. Some language dependent and independent selection criteria will be tested. • Rewrite rule evaluation methods: The main evaluation will test the generated rule sets on a test set of historic and modern word pairs, and measure the similarity of the words before and after rewriting. Further evaluation is done by doing retrieval experiments, one word-based, the other document-based. • Thesauri and Dictionaries for the vocabulary gap: A historic to modern dictionary will be constructed from existing translations pairs (see next section), a historic synonym thesaurus will be constructed based on bigram information. These methods address the vocabulary gap. • Dictionaries for the spelling gap: A dictionary based on pronciation will be made that contains mappigns from historic words to modern words with the same pronunciation. Finally, as a way of finding historic spelling variants for modern words, the word retrieval experiment will be extended with a technique to measure the similarity of words based on phonetics. This results in a dictionary of spelling variants. Both methods try to bridge the spelling gap. But to do this, a collection of historic documents is needed. Huge document collections are electronically avaible for modern Dutch (especially since the birth of IR-conferences like TREC5 and CLEF6 ), but for 17th century Dutch, documents are only sparsely available.
5 TREC 6 CLEF

URL: http://trec.nist.gov URL: http://clef.isti.cnr.it/

2.5. CORPORA

11

2.5

Corpora

Although the Nationale Koninklijke Bibliotheek van Nederland has a large collection of historic documents, at this moment, very few of them are in digital form. The resources that will be constructed use corpora of 17th century texts acquired from the internet. The following corpora where found: • Braun corpus: This was acquired from the University of Maastricht, from the research done by Braun [2]. • Dbnl corpus: The Digitale bibliotheek voor de Nederlandse letteren7 stores a huge amount of Dutch historic texts. The text used in this research are all from the Dutch ‘Golden Age’, 1550–1700. This is by far the largest corpus. Some texts were written in Latin, others are modern Dutch descriptions of the historic texts. Most of these non-historic Dutch texts where removed from the corpus. Many texts contain notes with word translation pairs. These historic/modern translations can be used to create a thesaurus for historic Dutch. • Historie van Broer Cornelis: This is a medium size corpus from the beginning (1569) of the Dutch literary ‘Golden Age’, transcribed by the foundation ‘Secrete Penitentie’ as a contribution to the history of Dutch satire. • Hooglied: A very small corpus. It is a based on an excerpt from the ’statenvertaling’, the first official Dutch bible translation. The so called ’Hooglied’ was put to rhyme by Henrick Bruno in 1658.8 • Kookhistorie: A website containing three historic cook books.9 There is a huge time span between the appearance of the first cook book (1514) and the last one (1669). The language of the first book is very different from that of the second (1593) and third. However, since the first cook book contains some modern translation of historic terms that also occur in the other two cook books, the translations can still be used for the thesaurus. • Het Voorlopig Proteusliedboek: A small text transcribed by the ‘Leidse vereniging van Renaissancisten Proteus.’10

2.6

Corpus problems

The DBNL corpus contains heterogeneous texts; historic Dutch from various periods, modern Dutch, Latin, French, English. If the overlap in phonetics is to be used, the texts from all these different languages might cause problems.
www.dbnl.nl http://www.xs4all.nl/ pboot/bruno/brunofs.htm 9 URL: www.kookhistorie.com 10 URL: home.planet.nl/ jhelwig/proteus/proteus.htm
8 URL: 7 URL:

12

CHAPTER 2. HISTORIC DOCUMENTS

The French word guy (English: guy, fellow) contains the vowel uy, but in French it is pronounced different from historic Dutch in words like uyt (English: out). Foreign words ‘contaminate’ the historic Dutch lexicon. The historic corpus will be used to find typical historic Dutch sequences of characters, so modern Dutch text are also considered ‘foreign.’ As a preprocessing step, as much of the non 17th century Dutch texts were removed from the corpus. Because the entire DBNL corpus contains over 8,600 documents, some simple heuristics were used to find the foreign texts, so the corpus can still contain some other texts than 17th century Dutch. Another problem with the texts from the DBNL corpus is the period in which the texts were written. The oldest texts date from 1550, the most recent were written in 1690. In 150 years time, the Dutch language has changed somewhat, including pronunciation and use of some letter combinations (like uy). For instance, in the oldest texts, the uy was used to indicate that the u should be pronounced long (the modern word vuur was spelled as vuyr around 1550). In more recent texts, after 1600, the uy was often used like the modern ui, as in the example given above (uyt is the historic variant of uit). If texts from a wide ranging period are used, generating rewrite rules will suffer from ambiguity. To minimalize these problems, it is probably better to use texts from a fairly small period (20 – 50 years, for instance).

2.7

Measuring the gap

Before considering the construction of resources, it might be helpful to have at least some idea of the differences between the historic language of the corpus, and the modern language. Some words were spelled different from the current spelling, but how much words are we talking about? And how much of these words in the historic document collection are spelled as modern words? To get an indication of the differences, a sample of 500 randomly picked words from the historic collection where picked and assessed (names where excluded since they do not contribute to the difference between two languages). Each word was assigned to one of three categories: modern (MOD), spelling variant (VAR), or historic (HIS). A word belongs to MOD if it is spelled in a modern way (according to modern Dutch spelling rules). It belongs to VAR if it is recognized as a modern word spelled in a non-modern way. If a word has some non-modern morphology, or can’t be recognized as a modern word at all, it belongs to HIS. The word ik (English: I) is found often in historic texts, but it hasn’t changed over time. It is still used, thus belongs to MOD. The word heyligh is recognized as a historic spelling of the modern word heilig (English: holy), and is categorized as VAR. But the word beestlijck is not recognized as a modern word. Even adjusting its historic spelling, becoming beestelijk, it is not a correct modern Dutch word. Taking a look at the context (V beestlijck leven) makes it possible to identify this word as a historic translation of the modern word beestachtig (English: bestial or beastly). From context, it’s not hard for a native Dutch speaker to find out what it means, but it is clear that over time,

2.8. SPELLING CHECK Category Modern Variant Historic Distribution 177 (35%) 239 (48%) 84 (17%)

13

Table 2.1: Distribution over categories for 500 historic words

its morphology has changed (the root of the word beest is still the same, which is the very reason that its meaning is recognizable from context). This might not be problematic for native Dutch speakers, but it does pose a problem for finding relevant historic documents for the query term “beestachtig”. This word, and other even less recognizable words belong to HIS. Categorizing all 500 randomly picked words does not result in any hard facts about the gap between the two language variants, but it does give some idea about the size of the problem. The results are listed in Table 2.1. It turns out that most of the words (239 words, about 48%) are historic spelling variants of modern words. The overlap between historic and modern Dutch is also significant (177 words, 35%), leaving a vocabulary gap of 84 out of 500 words (17%). From this, it shows that solving the problem of spelling variants bridges the gap between historic (at least 17th century) Dutch and modern Dutch for a large part.

2.8

Spelling check

Robertson and Willett suggested using spelling correction methods. An approach for handling inconsistent spelling is to treat 17th century Dutch as a corrupted version of modern Dutch. A spelling checker might be able to map historic word forms to modern forms. That would take away the need to build specific resources for historic document retrieval. For instance, the historic word menschen might be identified by the spelling checker is a misspelled form of the modern word mensen. One way of testing this is to do a spelling check on several documents, and manually evaluate the spelling checker’s performance. Some spelling checkers use the context of a word to find the most probable correct word. The unix spell checker A-spell was tested on the small text snippet from section 2.2, using a modern Dutch dictionary:11 9. Item, of de schipper versuijmelijck ware de goeden ende coopman-schappen int laden ende of ontladen vast genoeg te maken, ende dat.die daardoor vuijtten takel of vangtouw schoten, ende int water of ter aerden vielen, ende alsoo bedorven worden of te niette gingen, alsulcke schade of verlies moet de schipper ook alleen dragen ende de koopman goet doen, als voor.
11 Information on Aspell http://aspell.sourceforge.net/

and

the

Dutch

dictionary

used

can

be

found

on

14

CHAPTER 2. HISTORIC DOCUMENTS 10. Item, als de schipper de goeden soo qualijck stouwt of laeijt dat d’eene door d’andere bedorven worden, gelijk soude mogen gebeuren als hij onder geladen heeft rozijnen, aluin, rijs, sout, graan ende andere dergelijke goeden, ende dat hij daar boven op laeijt wijnen, olin of olijven, die uitlopen ende d’andere bederven, die schade moet de schipper ook alleen dragen ende de koopman goet doen, als boven.

The words oft, den, genoech, maecken, taeckel, daerdore and others were recognized as misspelled words, and a list of suggestions were given for each word, including the correct modern words, which were not always the most probable alternatives according to A-spell. For the word versuijmelijck no alternative was suggested, probably because it is too dissimilar to any modern Dutch word. The word goeden is a historic word for which A-spell suggested ‘goed’ (good), but not ‘goederen’ (goods). The correct suggestion for coopman-schappen (which is koopmanschappen, lit. ‘trade goods’) was not given, probably because the modernized version of the word (koopmanschappen) is not a modern word (the word koopmanschap was suggested, but this means something else, namely the business of trading). The same goes for qualijck (modern form: kwalijk) and laeijt (modern word: laadt). Also, some words are in fact historic but are not recognized is misspellings. The word niette should become niet (English: not), but is instead recognized as the past singular form of the verb nieten (English: to staple, as in stapling sheets of paper together). Another spell checker available for Dutch is the one that comes with the MS Word text processor.12 It suggests orthographically similar words for any unknown word in the text, and is also capable of checking grammar. This is the output after applying the correct suggestions by MS Word: Item, oft den schipper versuijmelijck ware de goeden ende koopmanschappen int laden oft ontladen vast genoeg te maecken, en dat die daardoor vuijtten taeckel oft bevanghtouw schoten, ende int water oft ter aarden vielen, ende zo bedorven worden oft te niette gingen, alsulcke schade oft verlies moet den schipper ook alleen dragen ende den koopman goed doen, als voor. Item, als den schipper de goeden zo kwalijk stouwt oft laeijt dat dene door dandere bedorven worden, gelijk zoude mogen gebeuren als hij onder geladen heeft rozijnen, aluin, rijs, zout, granen ende andere dergelijke goeden, ende dat hij daer boven op laeijt wijnen, oli¨n oft e olijven, die uitlopen ende dandere bederven, die schade moet den schipper ook alleen dragen ende den koopman goed doen, als boven. MS Word marks the word versuijmelijck as a misspelled word, but no alternatives are suggested, which happens for bevanghtouw and alsulcke as well. For some words, the correct word is suggested, as is the case for oft, ende and maecken and quite a few others. For many other words, the correct modern
12 For

those unfamiliar with MS Word, see http://office.microsoft.com

2.8. SPELLING CHECK

15

word is in the list of alternatives. For the historic word alsoo it correctly suggests alzo and afterwards suggests to replace alzo with the more grammatically appropriate word zo. Spell checkers can be used to find correct modern words for historic words that are orthographically similar. However, for many historic words, spell checkers cannot find the correct alternative, and for some they cannot find any modern alternative at all. Moreover, each word has to be checked seperately and the correct suggestion has to be selected from the list manually (the correct alternative is not always the first one in the list of suggestions). It would still take an enormous amount of time and effort to modernize historic documents for HDR in this way. A spelling check is not a good solution. It seems we do need specific resources to aid HDR.

16

CHAPTER 2. HISTORIC DOCUMENTS

Chapter 3

Rewrite rules
In this chapter, the spelling bottleneck, and approaches for solving this problem are described. The following points will be discussed: • Inconsistent Spelling & rewrite rules: The problems with inconsistent spelling. How rewrite rules can solve these problems, and what is needed. • Rewrite rule generation: Methods for generating rewrite rules. • Rewrite rule selection: Which rules are to selected and applied? • Evaluation of rewrite rules: How are the sets of rewrite rules evaluated? And well do they perform? • Rewrite problems: Multiple modern spellings for historic character sequences. • Conclusions: Is automatic generation of rewrite rules an effective solution to the spelling problem?

3.1
3.1.1

Inconsistent spelling & rewrite rules
Spelling bottleneck

As mentioned in chapter 2, one of the main problems with searching in historic texts is that the word or words you are looking for can be spelled in many different ways. For example, if you searching for texts that contain the word rechtvaardig (English: righteous), you might find it in one or two texts. But there probably are many more texts that contain the same word spelled in different ways (i.e.: rechtvaerdig, reghtvaardig, rechtvaardigh and combinations of these spelling variations). 17

18

CHAPTER 3. REWRITE RULES

One way of solving this problem would be to expand you query with spelling variations typical of that period. But few people possess the necessary knowledge to do this. Apart from that, it is fairly time consuming to think of all these variations, and you inevitably omit some variations. It would be far more efficient to do query expansion automatically. Or to rewrite all historic documents to a standard form, that matches modern Dutch closely. Robertson and Willett [18] have shown that character based n-gram matching is an effective way of finding spelling variants of words in 17th century English texts. Historic word forms for modern words were retrieved based on the number of n-grams they shared. All the historic words where transformed into a index of n-grams, and the 20 words with the highest score were retrieved. The score was computed using the dice score, with N (Wi , Wj ) being the number of n-grams that Wi and Wj have in common, and L(Wi ) the length of word Wi : Score(Wmod , Whist ) = 2 × N (Wmod , Whist ) L(Wmod ) × L(Whist ) (3.1)

In a historic word list containing 12191 unique words, 2620 historic words were paired with 2195 unique modern forms. Thus, each modern form had at least one corresponding historic word form. The results in Table 3.1 show the recall at the 20 most similar matches (no precision scores given in [18]). Method 2-gram matching 3-gram matching Recall 94.5 88.8

Table 3.1: Comparative recall for the 20 most similar matches for historic English

Braun [2] has conducted the same experiment for 17th century Dutch. It turns out the n-gram matching performance is increased by standardizing spelling and stemming (Table 3.2). The inconsistency of spelling makes it hard to apply a stemming algorithm directly on historic documents. Therefore, spelling is standardized by applying rewrite rules on the historic words. In [2], these rewrite rules for 17th century Dutch were constructed with the help of experts. They transform the most common spelling variations to a standard spelling. Most of the variations of rechtvaardig just mentioned would be changed to the modern spelling by these rules. By rewriting different spelling variants to the same word form, and removing affixes through stemming, a fair number of word forms are conflated to the same stem. Still, constructing rules manually, using the help of experts takes a lot of effort, and experts of 17th century Dutch are not freely and widely available. More efficient, but also more error prone, are automatic, statistical methods to produce rewrite rules. In this chapter, several automatic approaches are compared with each other as well as with the rewrite rules constructed by Braun.

3.1. INCONSISTENT SPELLING & REWRITE RULES Retrieval method 3-gram 3-gram + stemming 3-gram + rewriting 3-gram + stemming + rewriting Comp. Recall 70.4 74.0 74.8 82.1 Precision 57.9 62.5 53.7 57.8

19

Table 3.2: Comparative recall for the 20 most similar matches for historic Dutch

3.1.2

Resources

To construct rewrite rules, a collection of historic documents is needed, as well as a collection of modern documents. Without the modern documents, it would be much harder to standardize historic spelling. The are several equally acceptable ways of spelling a word in 17th century Dutch. There is no single spelling that would be better than the others. To ensure uniform rewriting, the rules have to be constructed with great care. Identifying spelling variants is only the first step. The second step is rewriting them all to the same form. For another group of spelling variants, the same standard form should be chosen. But this far from trivial. Consider the spelling variants klaeghen, klaegen, klaechen and claeghen. Three out four words start with kl, so it seems sensible to choose kl as the standard form. Also, two out of four words use gh, so g and ch should become gh as well to transform all four variants into a uniform spelling. Another group of spelling variants might be vliegen, vlieghen, vlieggen, vlyegen and fliegen. In this case, rewriting fl to vl seems to make more sense than rewriting vl to fl. The same goes for ye and ie. But the next transformation should be selected more carefully. Of the 3 different options g, gh and gg, g occurs more often. But rewriting gh to g would be in conflict with the earlier decision to rewrite g to gh. A far easier solution, and with the goal of making resources for information retrieval in mind, is to rewrite the historic word forms to modern word forms. In that case, a standard spelling already exists, and rewriting historic spelling variants to a uniform word is done by rewriting them to the appropriate modern word. Of course, we need to find the appropriate modern form, which might not be easy at all. But we’re faced with the same problems when finding the different historic spelling variants themselves. The other advantage of rewriting to modern words becomes clear when combining it with an IR system. Modern users pose queries in modern language. Rewriting all possible historic variants to one historic word will not make it any easier for the IR system to match it with its modern variant. Rewriting historic words to modern words, means rewriting to the language of the user. The document collections For the historic document collection, a corpus of several large books is used. These books all date from the same period (1600 – 1620). The reason of keeping

20

CHAPTER 3. REWRITE RULES

the period small, is that spelling changed over time. If a larger time-span is chosen, a greater ambiguity in spelling might result in incorrect rewrite rules. The pronunciation of some character combinations in 1550 might have changed by 1600. Also, the specific period between 1600 and 1620 makes it possible to compared the generated rewrite rules with the rules constructed by Braun, since these rules where based on two law books dating from 1609 and 1620. The corpus used in this research, named hist1600, contains these same two law books, in addition to a book by Karel van Mander (Het schilder-boeck), printed in 1604. Two of the techniques used here compare the words of the historic corpus to words in a modern corpus. The modern corpus (15 editions of the Dutch newspaper ”Algemeen Dagblad”) is equal in size to the historic corpus (see Table 3.3). The included editions of the newspaper where selected on date, ranging over two whole years, to make sure that not all editions cover the same topics (two successive editions often contain articles on the same topics, probably repeating otherwise low frequent content words). Name AC-1609 GLS-1620 mand001schi01 Total (hist1600) Alg. Dagblad total size (number of words) 221739 131183 453474 791217 797530 number of unique words 11648 6977 32314 47816 58664

Table 3.3: Corpus statistics for modern and historic corpora

3.1.3

Linguistic considerations

Is it possible to have some idea about how well a certain method will work? Surely it would be nice to know in advance that matching variants of a word based on phonetic similarity works well. But we don’t have this knowledge. However, some observations beforehand can point to the right direction (or away from the wrong direction). Syllable structure One such observation is that apparently, most historic words that are recognizable spelling variants of modern words have the same syllable structure as their modern form. Each syllable in Dutch contains a vowel, and can have a consonant as onset and/or as coda. If we take the modern Dutch word aanspraak and a historic form aenspraeck, the similarity in syllable structure is obvious. For both forms the first syllable has a coda (n), the second syllable has an onset (spr) and shows a difference in the codas (k vs. ck). The vowels of the two syllables differ also (aa vs. ae). Can this be of any help in choosing a method to

3.1. INCONSISTENT SPELLING & REWRITE RULES

21

attack the spelling problem? A solution can be to split the words into syllables and than make rewrite rules from mapping the historic syllable to the modern syllable. This would give the following rules: aen → aan spraeck → spraak The advantage of this approach is that it will not only rewrite the word aenspraeck but also any other historic word that contains the syllable aen. What it won’t do is rewrite the word staen to staan (English: to stand), since it won’t rewrite syllables containing aen that have an onset. After reading a few sentences of a historic document it becomes clear that the vowel ae is very common in these documents. In modern documents is not nearly as common. One problem that is immediately visible is that to rewrite all words that contain the vowel combination ae an enormous amount of rules is needed to cover all the different syllables in which this combination can appear. And since the corpus is limited, not all possible syllables can be found. The rules need to be generalized. For instance, a rule could be: rewrite all instances of ae to aa in syllables that have a coda. But this introduces a few problems. For native Dutch speakers, it is probably fairly easy to recognize the syllable structure of many historic words. But an automatic way of splitting a word into syllables would be based on the modern Dutch spelling rules. Since historic words are not in accordance with these rules, splitting them properly into syllables might do more bad than good. According to modern spelling rules, the word claeghen would be split in claeg and hen, which is not what it should be (namely, clae and ghen). Redundant letters in historic words can shift the syllable boundaries, adding a coda or onset where there shouldn’t be one. To get around this problem, it is possible to split the word into sequences of vowels and sequences of consonants. The word claeghen would be split into the sequences cl, ae, gh, e and n. Syllable boundaries can be contained in one sequence (ia in hiaten), but need not be a problem. Historic vowel sequences may only be rewritten to modern vowel sequences, and historic consonant sequences may only be rewritten to modern consonant sequences. Putting this restriction on what a historic sequence can be rewritten to, will retain the syllable structure. Again, the considered context can be specific, changing ae to a in the context of cl and gh, or general, changing ae to a if the sequences is preceded and followed by any consonant sequence. Spelling errors versus phonetic spelling Treating historic spelling as a form of spelling errors leads to the method of spell checking. A possible algorithm for finding the correct word given a misspelled word is the Edit Distance algorithm, [24]. This algorithm finds the smallest number of transformations needed to get from one word to another word. At each step in the algorithm, the minimal cost of inserting, deleting a substituting a character is calculated. Inserting or deleting a character takes 1 step, a

22

CHAPTER 3. REWRITE RULES

substitution takes 2 steps (the same as 1 delete + 1 insert). Changing bed into bad takes one substitution (‘e’ to ‘a’), thus the edit distance between bed and bad is 2. The edit distance between bard and bad is 1 (deleting the ‘r’). This can be used to find the word in a lexicon that is closest to the misspelled word [6]. However, historic spelling is different from misspellings in modern texts. The variance in spelling is not based on accidentally hitting a wrong key on the keyboard, but on phonetic information. Without any official spelling, writing caas or kaas makes no difference. They are both pronounced the same. Thus, historic Dutch can be treated as modern Dutch with spelling errors based on a lack of knowledge of modern spelling rules (which people in the 17th century where, of course, ignorant of). Thus, writing caas instead of kaas (English: ’cheese’) is more probable than writing cist instead of kist (English: ’box’), since a c is pronounced as a k when follow by an a, but is pronounced as an s when followed by an i. From a phonetic perspective, the distance between cist and kist is bigger than between caas and kaas.

3.2

Rewrite rule generation

One can think of many different ways of generating rewrite rules. The use phonetic transcriptions is one, but another way would be to see the spelling variance as a noisy channel (i.e. treating historic spelling as a misspelling of modern Dutch), making rewrite rules out of typical misspellings. N-gram matching can be used to find letter combinations that occur frequently in a historic lexicon, but much less frequent in a modern lexicon. In all approaches, a few issues have to be considered. First of all, while some historic words are spelling variations of modern words, many other historic words are just plain different words. They cannot be mapped to modern words, although they can be modernized in spelling. Thus, purely historic words cannot be used to generate rules, but the generated rules will affect these words. Three different rule generation methods have been developed: 1. Phonetic Sequence Similarity 2. Relative Sequence Frequency 3. Relative N-gram Frequency

3.2.1

Phonetic Sequence Similarity

The first method of mapping historic words to their modern variants is by using phonetic transcriptions of both historic and modern words. Phonetic matching techniques are used to find the correct spelling of a name, when a name is given verbally, i.e. only its pronunciation is known (see [26]). For modern Dutch, a few automatic conversion tools are available to transform the orthographic word in to a phonetic transcription. A phonetic transcription is list of phoneme characters which have a specific pronunciation. A simple conversion tool for

3.2. REWRITE RULE GENERATION

23

Dutch can be found on the Mbrola website. 1 It makes acceptable phonetic transcriptions of words. But, because of its simplicity, it cannot cope with the less frequent letter combinations in the Dutch language. For instance, the combination ae is transcribed to two separate vowels AE. A much more complex grapheme to phoneme conversion tool is embedded in the text-to-speech software package Nextens (see http://nextens.uvt.nl). This converter is more sensitive to the context of a grapheme (letter). The grapheme n preceded by a vowel and followed by a b is not pronounced as an n but as an m. Also, it can cope with the more seldom letter combinations like ae (transcribed to the phoneme e). Which phonetic alphabet is used by these tools is not important, as long as the same tool is used for all transcriptions.2 While the conversion tools have been developed for modern Dutch, they can also be used for historic variants of Dutch. It is not clear how well this works, but if 17th century Dutch is close enough to modern Dutch, this could be a very simple way to standardize and modernize historic spelling. Once phonetic transcriptions are made for all the words in the historic lexicon and all the words in the modern lexicon, it is easy to find historic words and modern words with the same pronunciation. These word pairs can be combined in a thesaurus for lookup (see chapter 5), but they can also be used for constructing rewrite rules. The next step is then to construct a rewrite rule based on the difference between these historic and modern word pairs. One way to do this is to make a mapping between the differing syllables. But splitting historic words into syllables is problematic. However, splitting words in vowel sequences and consonant sequences is an option. If the equal sounding words have the same vowel/consonant sequence structure, then, by aligning the consonant/vowel sequences, the aligned sequences are paired on the basis of pronunciation. To clarify the idea, consider the following example: historic word: klaghen modern word: klagen historic sequences: kl, a, gh, e, n modern sequences: kl, a, g, e, n All these sequence pairs are pronounced these same, including the pair [gh,g]. From this list of pairs, only the ones that contain two distinct sequences are interesting. Rewriting kl to kl has no effect. After applying rewrite rules based on phonetic transcriptions, the lexicon has changed. But iterating this process has no further effect. Since the rewrite rules are based on mapping historic words to modern words that are pronounced the same, after rewriting, the pronunciation stays the same.
1 see http://tcts.fpms.ac.be/synthesis/mbrola.html or http://www.coli.uni-sb.de/˜eric/stuff/soft/ which is the website of the author of the conversion tool 2 This became clear when using the Kunlex phonetic transcriptions list that is supplied with the Nextens package. This list contains 340.000 modern words with phonetic transcriptions. However, converting the words to phonetic transcriptions using Nextens results in different transcriptions from the ones in the Kunlex list.

24

CHAPTER 3. REWRITE RULES

3.2.2

The PSS algorithm

The PSS (Phonetic Sequence Similarity) algorithm aligns two distinct character sequences that are similar based on phonetics. If the phonetic transcription P T of a historic word Whist also occurs in the modern phonetic transcriptions list, then the modern word Wmod that has the same transcription P T , is considered the modern spelling variant of Whist . Both words are split into sequences of vowels and sequences of consonants. If number of sequences of Whist is different from the number of sequences of Wmod , no rewrite rule is generated. This is because there is unmatched sequence. Consider the modern word authentiek and the similar sounding historic word authentique 3 . The modern word contains 6 sequences (au, th, e, nt, ie, k), while the historic word contains 7 (au, th, e, nt, i, q, ue). This last sequence ue is not pronounced (at least, not according to Nextens). All the other sequences can be aligned to the sequences of the modern word. This problem is sidestepped by ignoring these cases. When the number of sequences are equal, an extra check is needed to make sure that for both words the aligned sequences are of the same type, that is, both sequences should be vowels, or both should be consonants. In this research, no word pairs were found that couldn’t be aligned properly, except for the word pair mentioned above, but as was mentioned, it is part of a French text. The next step is comparing all i the aligned sequences. If the spelling of the historic sequence Seqhist is different i from the spelling of the modern sequence Seqmod , a possible rewrite rule is found. Since both words are pronounced the same, apparently, both sequences i are pronounced the same as well. By replacing Seqhist in a historic word with i Seqmod , pronunciation should be preserved. Thus the rewrite rule becomes:
i i Seqhist → Seqmod

(3.2)

The resulting rules are ranked by their frequency of occurrence. Thus, if i i Seqhist and Seqmod are aligned N times in all the equal sounding word pairs, i i the resulting rule has score N. If Seqhist and Seqmod are aligned often, it is highly probable that the rule is correct, and that it will have a huge effect on the historic corpus.

3.2.3

Relative Sequence Frequency

The second approach tries to find modern spellings for sequences of vowels and sequences of consonants based on ’wildcard’ matching. Each word, in both historic and modern corpora, is split in sequences of vowels and sequences of consonants (in the same way as for the PSS algorithm). Sequences that are frequent in historic texts but rare in modern texts are likely candidates for rewriting. To find the appropriate modern sequence to replace it, the historic sequence could be removed from the historic word and replaced by a wildcard. This should be a vowel wildcard if the removed historic sequence is a vowel,
3 although this word is in the DBNL corpus, it is probably taken from document containing a small portion of French

3.2. REWRITE RULE GENERATION

25

and a consonant wildcard for historic consonant sequences. If a modern can be matched with the historic word containing a wildcard, the modern sequence that is aligned with the wildcard is a candidate for replacing the historic sequence. Historic and modern sequences that are aligned often have a high probability of being correct.

3.2.4

The RSF algorithm

The Relative Sequence Frequency (RSF) algorithm generates rules based typical historic character sequences. The whole historic corpus is split in vowel and consonant sequences Seq, which are ranked by their frequency Fhist (Seq). After that, their frequency scores are divided by the total number of sequences of the whole corpus Nhist (Seq), resulting in a list of relative frequencies: RFhist (Seq) = Fhist (Seq) Nhist (Seq) (3.3)

The same is done for the modern corpus. The final relative sequence frequency RSF (seq) is given by: RSF (Seq) = RFhist (Seq) RFmod (Seq) (3.4)

A sequence i with a high RSF (seq i ) is a typical historic character combination. A score of 1 means that the sequence is used just as frequent in a modern corpus as in a historic corpus. A threshold is used to determine whether a sequence is considered typically historic or not. This threshold is set to 10, meaning that, for a historic and a modern text of equal size N , the character sequence seq i should occur at least 10 times more often in the historic text to be typically historic. The reasoning behind this is that if a sequence occurs much more often in a historic text (i.e. is much more common in a historic text), there is a good chance that its spelling has changed in the past few centuries. If Seqi occurs in the historic corpus but not in the modern corpus (i.e. RFmod (Seqi ) = 0, RSF (Seqi ) is set to 10. No matter what its historic frequency is, Seqi is infinitely more frequent in the historic corpus than in the modern corpus, and is considered a typical historic character combination. Starting with the highest ranking character sequence Seq, all historic words that contain this sequence are transformed in so called ’wildcard words’. If Seq is a vowel sequence, a historic word Whist contains Seq if Seq is preceded and followed by consonants, or the start or end of the word. For example, the word quaellijk is not listed as a word containing the sequence ae, since ae is not the full vowel sequence (which is uae). In all the historic words, Seq is replaced by a ’vowel wildcard’, resulting in a wildcard word W Whist . The word aenspraek is transformed to VnsprVk, where V is a vowel wildcard. W Whist is then compared to all modern words. A modern word Wmod matches W Whist if it can match all vowel wildcards with vowel sequences, or consonant wildcards with consonant sequences. Thus, VnsprVk is matched with the modern word

26

CHAPTER 3. REWRITE RULES

it aanspraak, but also with inspraak, inspreek, and aanspreek. Given these 4 matches, ae is matched with i once, ee twice, and 4 times with aa, resulting in 3 different rewrite rules: ae → aa ae → ee ae → i Again, a threshold is used to remove unreliable matches. If seqhist has N (W Whist ) wildcard words, then seqmod is considered reliable if it is matched to seqhist by at least N (W Whist )/10 wildcards, with a minimum threshold of 2. Only one wildcard match is considered an ’coincidence’. This threshold is called the pruning threshold. After each historic sequence is processed, and wildcard matches are found, the list of possible modern sequence is pruned by removing all rules with a score below the pruning threshold. For instance, the sequence ae has more than 5000 wildcard words. A modern sequence is a reliable match if it matches at least 500 wildcard words with modern words. Of course, it is possible, for words that contain seqhist multiple times (ae occurs twice in aenspraek), to restrict wildcard matching to words that match all the multiple wildcards with the same vowel sequence seqmod . In that case, only aanspraak would be a match. All the other words match ae with two different modern sequences.

3.2.5

Relative N-gram Frequency

A standard, language independent method for matching terms is n-gram matching. For each word all substrings of n characters are determined. One way of determining similarity between two words is by counting the number of ngrams that are shared by these words. For instance, the words aenspraeck and aanspraak are split in the following substrings of length 3: aenspraeck: #ae, aen, ens, nsp, spr, pra, rae, aec, eck, ck# aanspraak: #aa, aan, ans, nsp, spr, pra, raa, aak, ak# The number sign (#) shows the word boundary. Only the substrings nsp, spr and pra are shared by both words. Character n-gramming is a popular technique in information retrieval, where it can have a huge influence on accuracy (for a detailed analysis on n-gram techniques, see [17]). For this research, n-gramming is used to find typical historic n-grams. Like the previous RSF algorithm, relative frequencies are used to find letter combinations that are frequent in historic documents, but rare in modern documents.

3.2.6

The RNF algorithm

The third algorithm is only slightly different from the RSF algorithm, generating rules based on N-grams instead of vowel/consonant sequences. Hence, it is called the Relative N-gram Frequency (RNF) algorithm. It is basically the same algorithm, but with one major difference. Where the RSF algorithm considers

3.3. REWRITE RULE SELECTION

27

only full sequences (a (full) vowel sequence is only matched with another (full) vowel sequence), the RNF algorithm matches an n-gram with any character sequence between n − 2 and n + 1 characters long. This restriction is based on the fact that modern spelling is more compact than historic spelling. To indicate that vowels should have a long pronunciation, historic words are often spelled with double vowels (like aa, ee). In modern spelling, this is no longer needed (only in a few cases) because of the official spelling rules. Also, exotic combinations like ckxs where normal in historic writing, but in modern spelling, only x or ks is allowed. Thus, it is to be expected that a modern spelling variant of a historic sequence is often shorter than the historic sequence itself. Also, without this restriction, the number of possible rules would explode. When replacing zaek for the n-gram aek with the wildcard word zW (where W is an unrestricted wildcard), will result in matching zaek with all existing modern words that start with the letter z. Processing hundreds of wildcard words will require enormous amounts of memory and disk space. By restricting the length of the wildcard, only words of length 2 to 5 are matched (this will still match with many words, but memory requirements are now within acceptable limits). There is no restriction on vowels or consonants. An n-gram containing only vowels can be matched by a wildcard containing only consonants. RNF is tested with different n-gram sizes, ranging from 2 to 5. When constructing rules from wildcard matches, the same pruning threshold is used as for the RSF algorithm described above. Without this threshold, the number of generated rules would still be enormous for large n (n ≥ 4). Especially for n = 5, literally hundreds of thousands of rules are generated. To reduce memory and disk space requirements, the pruning threshold for n = 5 is set to 5.

3.3
3.3.1

Rewrite rule selection
Selection criteria

Once the methods for generating rewrite rules are working, it is easy to generate literally thousands of rewrite rules. Of course, not all these rules work equally well. Some rules are based on matching one particular historic word to one particular modern word, and some rules are based on matching a historic word to the wrong modern word. The number of matches between historic and modern words on which a rule is based can be used as a ranking criterium. The more historic words that can be transformed to a modern word with the same rule, the more probable it is that the rule is correct. A rule that maps only one historic word to a modern word might be correct, but even if it is, its influence on an entire corpus will be minimal. A rule that maps over a hundred historic words to modern words is probably correct. It is highly improbable that an inappropriate rule rewrites this many historic words to modern words. The rule lcx → ndst rewrites volcx to vondst, but very few other matches will be

28

CHAPTER 3. REWRITE RULES

found. But how many matches are needed to make a reliable judgment whether a rule is appropriate or not? There are many different criteria that can be used. For instance, given a typical historic character sequence Seqhist , the number of modern sequences N (Seqmod ) that lead to rewriting a historic word Whist to a modern word Wmod should be as low as possible. N (Seqmod ) is the number of alternatives from which a modern sequence should be picked. If the same historic sequence occurs in many different rules (i.e. there a lot of different modern consequences to rewrite to), the chance of only one of them being correct is small. If there is only one rule (i.e. there is only one modern consequence found for a historic sequence), then that is inevitably the best option. Another aspect to look at is the effect of the rule on the modernly spelled words in the historic corpus. Comparing the words of the historic corpus with a modern word list (Nextens comes with a fairly large list containing approximately 340.000 modern Dutch words with phonetic transcriptions, the so called Kunlex word list), shows which words in the historic corpus have not changed in spelling. These words shouldn’t be affected by rewrite rules. The criterium then becomes selecting only rules that have little to no effect on modernly spelled words. Of course, it is also possible to retain rules that have a large effect on these words, but restrict the application of such a rule to non-modern words (i.e. words that are not in the modern lexicon). Another important decision to be made is whether a historic sequence can be rewritten to different modern spellings. As the y-problem described in section 3.6.1 indicated, not all sequences ay should be rewritten to the same modern form. The RNF has no difficulty with these problems, since larger n-grams take the context of ay into account. By first applying large n-grams, different words containing ay might be affected by different RNF rules. The other 2 algorithms, PSS and RSF cannot take context into account since they use only vowels or only consonants. Thus, whatever selection criterium is used, only one modern spelling will be selected for each historic sequence. The following selection criteria will be discussed:

• Match-Maximal: Rank rules according how many wildcard words are matched (MM).

• Non-Modern: Remove all rules that effect modern words in historic lexicon. A word from the historic lexicon is modern if it is also in the Kunlex lexicon (NM).

• Salience: For the set of competing rules with the same antecedent part, select the consequent part with the highest score only if the difference between the highest score and the second highest score is above a certain threshold (S).

3.4. EVALUATION OF REWRITE RULES

29

3.4
3.4.1

Evaluation of rewrite rules
Test and selection set

A dozen more purely statistical selection criteria can be used, but another alternative is to create a selection and test set by hand. To test the effectiveness of the rewrite rules, a test set that contains historic words and the correct modern variant can be used. The historic words in the test set are picked from a random sample of words from a small list of 17th century books, published in the same period as the documents used for the generation of rewrite rules (1600–1620). Words from these books where randomly selected and added to the test set if a correct modern spelling was given. These modern forms where only entered when the historic words was recognized as a variant of a modern word, or as a morphological variant of a modern word. The historic word beestlijck would be spelled as beestelijk in modern Dutch. However the word beestelijk is not an existing modern word, but a morphological variant of the word beestachtig (beastly). The test set contains some of these words. Some words where not recognizable at all. These where not added to the test set, since no modern spelling could be entered. This way of constructing a test set is fairly simple and doesn’t take a lot of time. In just a few hours, a total set of 2000 words was made. The whole set was then split into a selection set and a test set. The selection set was used as a rule selection method, as a way of sanity checking. Some of the constructed rules clearly make no sense. For example, the rule cxs → mbt might result in rewriting some historic words to existing modern words, but since it also changes pronunciation (and word meaning) radically, it is clear that this rule makes no sense. To make sure that all selected rules make at least some sense, a way of sanity checking is to select only rules that have a positive effect on the selection set. Using edit distance, measuring the distance D(Whist , Wmod ) between the historic word and its modern variant and the distance D(Wrewr , Wmod ) between the rewritten word and the modern variant, shows the effect of a rewrite rule. Here is an example to explain edit distance, using the historic word volcx and its modern version volks : 0 1 2 3 4 5 v 1 0 1 2 3 4 o 2 1 0 1 2 3 l 3 2 1 0 1 2 c 4 3 2 1 2 3 x 5 4 3 2 3 4

v o l k s

Table 3.4: Edit distance between volcx and volks The final edit distance between volcx and volks is 4. The first three characters

30

CHAPTER 3. REWRITE RULES

of both words are the same, resulting in an edit distance of 0. But the next two character differ. from c to k takes 1 substitution, and another substitution is needed going from x to s. v 1 0 1 2 3 4 o 2 1 0 1 2 3 l 3 2 1 0 1 2 c 4 3 2 1 2 3 c 5 4 3 2 3 4

v o l k s

0 1 2 3 4 5

Table 3.5: Edit distance between volcc and volks

If the difference between D(Whist , Wmod ) and D(Wrewr , Wmod ) is zero, then either the rule is not applicable to the historic word, or it has no effect on the distance, in which case it is probably an inappropriate rule. Changing volcx into volcc has no effect on the edit distance (the distance between volcx and volks is equal to the distance between volcc and volks, see Tables 3.4 and 3.5), but the word has changed into something that is pronounced differently, while the historic word is pronounced the same as its modern variant volks. Most native Dutch speakers will have little problems recognizing volcx as a spelling variant of the adverb volks, while they would probably recognize volcc as a variant of the noun volk. The problem with using edit distance as a measure is that a bigger reduction in distance not necessarily means that a rule is better. Take two competing rewrite rules lcx → lcs and lcx → lk. The first rule reduce the edit distance from 4 to 2 (see Table 3.6), while the second rule reduces it to 1 (Table 3.7). The result of the first rule is a word that looks and sounds much like the correct modern word. The result of the second rule is a different modern word. v 1 0 1 2 3 4 o 2 1 0 1 2 3 l 3 2 1 0 1 2 c 4 3 2 1 2 3 s 5 4 3 2 3 2

v o l k s

0 1 2 3 4 5

Table 3.6: Edit distance between volcs and volks

A rewrite rule has a postive effect on the selection set if the average distance between historic and modern words is reduced. The average change in distance between the original test set, and the test set after rewriting is given by:

3.4. EVALUATION OF REWRITE RULES v 1 0 1 2 3 4 o 2 1 0 1 2 3 l 3 2 1 0 1 2 k 4 3 2 1 0 1

31

v o l k s

0 1 2 3 4 5

Table 3.7: Edit distance between volk and volks

C=

1 n

n i i i i D(Whist , Wmod ) − D(Wrewr , Wmod ) i=0

(3.5)

Where D(Whist , Wmod ) is the edit distance between a historic word and its modern variant, and D(Wrewr , Wmod ) is the edit distance between the rewritten historic word and the same modern variant. A simple measure would be dividing the average change in edit distance C by the distance D(Seqhist , Seqmod ) between the historic antecedent Seqhist of the rule and its modern consequent Seqmod (rules that change multiple characters should reduce the average distance more than rules that change only one character): Score(rulei ) = Ci Di (3.6)

If the resulting score is close to 1, the total amount of change by the rewrite rule is mostly in the right direction. Looking again at the example of the rule cx → k, the edit distance between the original historic word volcx and the modern word volks is reduced by 3, and the edit distance between cx and k is also 3 (cost 2 for substitution of c with k, and cost 1 for deleting x). Thus, this rule scores 1. In other words, every change by the rule is a change in the right direction. But this is not good enough. rewriting cx to k reduces the edit distance between volcx and volks, but the rule cx → ks not only reduces the edit distance, it also rewrites the historic word to the correct modern variant. According to (3.6), both rules would get the same score. But if a rule changes some historic words to their correct modern forms, it must be a good rule. A better measure should account for this. (3.7) adds the number of words changed to their correct modern form M divided by the total number of rewritten words R: Score(rulei ) = Ci Mi + Di Ri (3.7)

Now, the rule cx → ks reveives a higher score because it rewrites at least some of the words containing cx to their correct modern form. To make sure that rules with an accidental positive effect are not selected, a threshold for the final score of 0.5 is set. In words, this means that for each step done by the rule

32

CHAPTER 3. REWRITE RULES

(insertion, deletion takes one step, substitutionion takes two steps), the distance should reduce by at least 0.5. The big disadvantage of selecting only rules that have a positive effect on the selection set is that not all the typical historic word forms and letter combinations are in the selection set. Although the rules are based on statistics on the whole corpus, some constructed rewrite rules that are appropriate might not be selected because they have no effect on the selection set. On the other hand, from a statistical viewpoint, if a specific character combination is not in a set of 1600 randomly selected word pairs, then it is probably not a common or typical historic combination. Another drawback is that words that couldn’t be recognized as variants of modern words, are not in the test set, but are affected by the selected rewrite rules. Although the performance of a rule on the test set gives an indication of its “appropriateness” on the recognizable words, there is no such indication for its effect on the unrecognized words. The test set is used as a final evaluation of the selected rewrite rules. The rewrite rules are applied to the historic words and then compared with the correct modern forms. As mentioned above, comparison is based on the edit distance between the words. The final score for a rule is the average distance between the rewritten words and the correct words. To get some measure of the effect of rewriting, the average distance between the original historic words and the correct words is also calculated as a baseline. The difference between these two averages should give an indication of the effect of rewriting. The baseline score is shown in Table 3.8 total word pairs 400 average distance 2.38

baseline

Table 3.8: The baseline average edit distance

3.5

Results

The three algorithms PSS, RSF and RNF are evaluated using the test set. To get an idea of how well certain rule sets perform, all automatically generated rule sets are compared with the manually constructed rule set in [2]. The results for this set of rules on the test set is given in Table 3.9. In column 2, the total number of rules in the rule set is given (num. rules), in column 3 the total number of historic words that are affected by the rules is given (total rewr.). The 4th column shows the number of historic words for which the rewriting is optimal (perf. rewr. indicating a perfect rewrite). The last column shows the new average distance (new dist.), between the rewritten historic words, and the modern words. The difference between the new average distance and the baseline average distance is shown in parentheses.

3.5. RESULTS rule set Braun num. rules 49 total rewr. 248 perf. rewr. 137 new dist. 1.41 (-0.97)

33

Table 3.9: Manually constructed rules on test set

3.5.1

PSS results

The PSS algorithm generated 510 rules, some of which contain the same historic sequence as antecedent part. From these, only the highest scoring rules with a unique historic sequence (i.e. only one rule per historic sequence) are selected (MM, see the Maximal Match criterium). The initial score of a rule is the number of times that the modern consequent seqmod of the rule is found as a wildcard for the historic antecedent seqhist (see the algorithm descriptions in section 3.2.1). Different threshold values for the rule selection algorithm where tested on the test set, ranging from 0 to 50 (the MM-threshold). The change in average distance shows whether the rules have a positive or negative effect. Also, the total number of words that are affected by the test set are given, together with the number of perfect rewrites. The number of perfect rewrites is the number of words which are rewritten to their correct modern form. The results are shown in Table 3.10. It is clear that the rewrite rules generated by the PSS-algorithm have a bad influence on the average edit distance between historic words and their modern variants. However, by increasing the threshold, only the high scoring rules are applied, rewriting 56 out of 400 words to their correct modern form. The average distance still increases though. The rules selected at threshold 5 perform better than the rules selected at threshold 10. Apparently, the rules with a score between 5 and 10 (or at least some of them) are better than some of the higher scoring rules. One reason for this is the phonetic change of the sequence ae from a as in naem (name) to e as in collegae (collegues).4 The ae sequence is very frequent in the historic corpus, but the Nextens converter transcribes to an e so that naem will be matched with neem (to take) instead of with naam (name). Accidentally, there are a lot of wildcard matches with ee, so the rule ae → ee gets a high score. Another high scoring rule is the rule oo → o, because many historic words contain a double vowel where their modern forms contain single vowels. But this rule generalizes too much. There are still many modern Dutch words containing a double vowel, and their historic counterparts should not be changed, like boot, boom, school, etc.. There are many lower scoring rules that make more sense from a phonetic perspective, but low corpus frequencies keeps them at a low score. There are some phonetic transcription that are just plain wrong. For instance, the sequence igh in veiligh (safe) is transformed by Nextens to a so called ‘schwa’ character. A ‘schwa’ is how non-stressed vowels are pronounced,
4 letters

in boldface indicate phonemes.

34

CHAPTER 3. REWRITE RULES

like the ‘e’ in the character. The problem with igh is that in many words, the ‘i’ is pronounced as a schwa, but the ‘gh’ is certainly pronounced. After conversion, the word veiligh is matched to the modern Dutch word veilen, because the final ‘n’ in infinitivals is often not pronounced. A chain is as strong as its weakest link. If the phonetic transcriptions are not 100% correct, the generated rules can’t be either. As a extra, second selection criterium, only rules where selected that had no effect on those words of the historic lexicon that also occur in the modern Kunlex lexicon. Thus, only non-modern (NM, see the Non-modern selection criterium in section 3.3.1) historic sequences are considered. The results for the salience criterium are given for a salience threshold of 2 (S 2 in the Table). This means that the highest scoring rule R1 for a historic sequence Seqhist is selected if R1 matches at least twice as many wildcards as the second best rule R2 . Several different thresholds values where tested. The threshold value 2 consistently shows the best results. Sel. crit. MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM MM sel. set MM Tresh. 0 5 10 20 30 40 50 0 5 10 20 30 40 50 0 5 10 20 30 40 50 N.A. num. rules 404 109 64 34 25 22 18 251 43 20 10 6 6 5 383 99 56 29 22 19 15 104 total rewr. 394 373 365 320 272 269 248 232 192 185 179 147 147 112 376 331 322 247 195 190 151 253 perf. rewr. 9 25 18 34 61 59 56 39 28 24 18 12 12 12 15 28 21 40 56 54 51 101 new dist. 4.6 (+2.22) 3.39 (+1.01) 3.76 (+1.38) 3.14 (+0.76) 2.44 (+0.06) 2.46 (+0.08) 2.48 (+0.10) 2.87 (+0.49) 2.86 (+0.48) 2.88 (+0.50) 2.9 (+0.52) 2.71 (+0.33) 2.71 (+0.33) 2.71 (+0.33) 3.88 (+1.81) 2.75 (+0.65) 3.12 (+1.01) 2.58 (+0.51) 2.15 (-0.23) 2.17 (-0.21) 2.2 (-0.18) 1.66 (-0.72)

only only only only only only only + NM + NM + NM + NM + NM + NM + NM +S2 +S2 +S2 +S2 +S2 +S2 +S2

Table 3.10: Results of PSS on test set What is interesting is that once the NM selection criterium is applied, the

3.5. RESULTS

35

number of rules that are applied has little effect on the average edit distance between rewritten words and the correct modern words, but is still in balance with the total number of affected words (more rules rewrite more words). The highest scoring rules affect the most words (5 rules rewrite 112 words). For lower thresholds, NM does have a positive effect, reducing the average distance by almost 38%. But this is probably because it just reduces the number of rules. Since most lowly ranked rules increase the average distance, reducing the number of lowly ranked rules will reduce the negative influence. However, the number of perfect rewrites is heavily affected by NM. Before applying NM, a higher threshold results in many more perfect rewrites, and average distance drops to nearly the original distance (which is 2.38). After applying NM, an MM-threshold of 50 results in an increase in distance, with much less perfect rewrites (when compared to an MM-threshold of 50 before applying NM). In other words, the rules that where thrown out by NM where better than the rules that NM keeps in the set. Dropping the threshold to 20 introduces some more bad rules (only 5 rules are added, and the average distance goes up again). Decreasing the threshold even more shows that some of the rules with a score below 20 are better than some of the rules with a score above 20. The results for the Salience (S) selection criterium look much more like the Maximal Match results. At each threshold level the number of rules is only slightly smaller than without selecting on salience. For the average distance, salience works much better. Rules with a score above 30 descrease to average distance. Some of these rules that are removed by the salience criterium actually produce perfect rewrites. For thresholds 30, 40 and 50, the number of rules decreases by 3, and the total number of perfect rewrites decreases by 5. Thus, the 3 rules scoring between 30 and 40 removed by salience have a bad effect on the average distance but do have a positive effect on some words. This shows that for the historic antecedents in these 3 rules, multiple modern consequents are required, or the context of the historic sequence (the characters preceding and following the sequence) should be taken into account. The best results by far are produced by using the selection set. As described in section 3.4.1, the selection set contains 1600 word pairs, and are used to filter out rules that have a negative effect on the testset. The MM-score of the rules are ignored in this selection criterium, and are replaced by a score based on how well they perform on the selection set. As the selection set is constructed in the same way as the test set (in fact, only one set of 2000 words was constructed, which was split afterwards in the selection set and the test set), it should come as no surprise that this produces better results. About 63% of the all the words in the test set are rewritten and about 25% of them to their correct modern forms. The PSS algorithm clearly suffers from wrong phonetic transcriptions. The change of pronunciation for some character sequences (most notably the sequence ae, which occurs very often in the historic corpus) over time is ignored by the Nextens conversion tool. These problems occur throughout the rule set, from highly frequent to rare sequences. Therefore, raising the MM-threshold will only reduce the total number of rules, effectively reducing the number of

36

CHAPTER 3. REWRITE RULES

rules which have a negative effect on the test set, but also reducing the number of rules that have a positive effect. The use of the selection set seems the only way to sort the good rules from bad ones.

3.5.2

RSF results

The RSF algorithm generates much more rules than the PSS algorithm, but the number of historic sequences for which it finds rewrite rules is smaller. This is because it finds many different rewrite rules for the same historic sequence. After selecting the highest scoring rule for each unique historic sequence, only 209 rules are left, compared to 293 rules for the PSS algorithm. This is probably because the RSF algorithm only considers typical historic character combinations, where the PSS algorithm considers all sequences in the historic index that can be matched with a modern variant. The PSS algorithm generates the rule cl → kl because the historic word clacht is pronounced the same as the modern word klacht. But the RNF algorithm doesn’t ever consider cl as a typical historic sequence, thus won’t generate a rewrite rule for it. The results on the test set are shown in table 3.11. The rules generated by RSF perform much better than the ones generated by PSS. The average distance between the historic words and their correct modern variants decreases. Also, the number of perfect rewrites is much higher. Most rules have a very low score. Setting the threshold to 5 removes 70% of all the rules in the test set, while staying close to the performance of the full set of 209 rules. Apparently, the positive effect of the RSF rule set comes mainly from the rules with a score above 5. Further increase of the threshold shows a further decrease in performance, but this time the differences are becoming significant. Between 10 and 20, almost half of the rules are removed, and the number of perfect rewrites decreases further. But it should be clear that the most effective rules are the ones with the highest scores. Only 10 rules have a score above 50, but account for the bulk of the perfect rewrites and the decrease in distance. For all thresholds, the ratio between total rewrites and perfect rewrites is roughly the same (of every 2 affected words, one rewrite is perfect). Clearly, the NM selection criterium has a negative effect on the RSF generated rules. It throws out some rules (about 20-25%), which have a positive effect on the test set. By throwing them out, the number of perfect rewrites drop, and the average distance increases. The effect of NM for PSS was questionable. For RSF, it is just plain bad. A simple explanation for the performance of NM is that the RSF algorithm already selects historic sequence based on their relative frequency. Even if a historic sequence occurs in the modern corpus, the fact that was selected by RSF means that it is at least 10 times more frequent in the historic corpus. The use of relative frequencies makes NM redundant. The salience criterium also removes many good rules. At an MM-threshold of 50, 6 out of 10 rules are removed (60%), reducing the number of perfect rewrites by 78 (76%). In other words, probably the best rules in the entire set are removed by selecting on salience. By dropping the salience threshold, the performance will go up again. Another short test revealed that by reducing the

3.5. RESULTS Sel. crit. MM only MM only MM only MM only MM only MM only MM only MM + NM MM + NM MM + NM MM + NM MM + NM MM + NM MM + NM MM + S 2 MM + S 2 MM + S 2 MM + S 2 MM + S 2 MM + S 2 MM + S 2 sel. set MM Thresh. 0 5 10 20 30 40 50 0 5 10 20 30 40 50 0 5 10 20 30 40 50 N.A. num. rules 209 62 39 21 13 12 10 190 51 30 16 10 9 7 48 21 16 9 6 6 4 76 total rewr. 261 249 243 231 212 212 206 207 195 188 178 162 162 156 83 78 78 74 61 61 54 252 perf. rewr. 133 130 127 117 109 109 103 100 97 94 85 79 79 73 39 37 37 35 30 30 25 140 new dist. 1.41 (-0.97) 1.42 (-0.96) 1.44 (-0.94) 1.48 (-0.90) 1.54 (-0.84) 1.54 (-0.84) 1.56 (-0.82) 1.58 (-0.8) 1.59 (-0.79) 1.61 (-0.77) 1.64 (-0.74) 1.71 (-0.67) 1.71 (-0.67) 1.73 (-0.65) 2.17 (-0.21) 2.19 (-0.19) 2.19 (-0.19) 2.2 (-0.18) 2.26 (-0.12) 2.26 (-0.12) 2.27 (-0.11) 1.33 (-1.05)

37

Table 3.11: Results of RSF on test set

salience threshold by 0.1 at a time, the performance slowly changes towards to original performance. But only by setting the threshold to 1 (no salience), the performance is equal to the original MM rule set. Thus, for RSF, salient rules are no better, it seems, than other high ranking but none salient rules. Again, the selection set is the best selection criterium. It’s performance is better than the MM-threshold. The number of perfect rewrites is higher, while the total number of rewrites is lower, and the average distance is reduced by more than 1 (for edit distance, this amounts one step, insert or delete, closer to the modern word).

3.5.3

RNF results

Like the RSF algorithm, RNF also generates many rules. But since the sequences are not restricted to either consonants or vowels, many more historic sequences and possible modern sequences are considered. The n-gram size becomes important. For n-grams of size 2 only 27 * 27 (26 alphabetic characters plus the word boundary character) = 729 historic sequences are possible. For 4-

38

CHAPTER 3. REWRITE RULES

grams already 492.804 historic sequences are possible. Of course, most of these sequences will not be in the historic corpus (take ’qqqq’ or ’xjhs’ for example). So, before generating the rules, we can predict that there will be far more rules for n-grams of size 4 than for n-grams of size 2. See table 3.12 for the results of all n-gram lengths. The results for NM are not listed, since they show the same bad effect as for the RSF rules, and would only make table 3.12 less readable. As for salience, it shows mixed results. For 2-grams, the best salience threshold is 1.5, performing for worse than the original rule set. For 3-grams and 4-grams, the best value is around 1.25, showing some improvement in average distance for lower MM threshold values (up to 20) but a drop in the number of perfect rewrites. The results for n-grams of length 2 show that only the 8 highest MM-scoring rules, with a threshold above 50, have a big influence on the test set. These rules are very good, rewriting 67% of all the words, and of these, 56% are perfect rewrites. This is due, for the largest part, to the rule ae → aa. Many of the historic words in the test set contain the sequence ae, and most of their corresponding modern variants have aa as the modern spelling variant. As the results at lower MM-thresholds show, the low scoring rules have almost no influence on the test set. Another noticable result is that at a MM-threshold of 20, the rules show the greatest reduction in average distance for all the other n-gram lengths. Also, for n ≥ 3, increasing the MM-threshold results in less perfect rewrites. As for the total rewrite / perfect rewrite ratio, the best n-gram lengths are 2 and 3. Like the PSS and the RSF algorithm, RNF benefits greatly from the use of the selection set. All n-gram lengths show in improvement over the MMthreshold selection. The number of selected rules is less than for low MMthresholds (which show the highest number of perfect rewrites of the different MM-thresholds), as well as the total number of rewrites. But the number of perfect rewrites increases (this is most noticable for n ≥ 4. Now, the 4-gram rules show the best results. 62% of all rewrites is perfect, and the average distance is reduced by almost 50%.

3.6

Conclusions

The most significant conclusion is that phonetic transcriptions are not nearly as useful as expected. As mentioned earlier, there are two reasons for this. First, the transcriptions are not always correct. Some letter combinations that no longer occur in modern Dutch words are treated as English or French character sequences. From the surrounding characters it should be clear that the word under consideration is certainly not English or French. The grapheme to phoneme converter of Nextens is very accurate compared to other conversion tools, but for this particular task, it is simply not good enough. To the defense of Nextens, it should be mentioned that it wasn’t designed for this task. It was designed with pronunciation of modern Dutch in mind. That it does very well. The other main reason is that, although the overlap between 17th century

3.6. CONCLUSIONS

39

N-gram size 2 2 2 2 2 2 2 2 sel. set 3 3 3 3 3 3 3 3 sel. set 4 4 4 4 4 4 4 4 sel. set 5 5 5 5 5 5 5 5 sel. set

MM Thres -hold 0 5 10 20 30 40 50 N.A. 0 5 10 20 30 40 50 N.A. 0 5 10 20 30 40 50 N.A. 0 5 10 20 30 40 50 N.A.

num. rules 15 14 11 10 9 8 8 12 163 163 124 81 50 39 27 127 458 458 321 118 57 39 29 276 726 726 318 78 20 10 7 276

total rewr. 271 271 269 268 267 267 267 271 277 277 270 260 239 229 196 274 284 284 268 211 163 138 114 269 205 205 157 80 34 23 20 153

perf. rewr. 150 150 150 150 150 150 150 152 148 148 148 143 131 123 95 162 115 115 110 92 64 50 37 166 57 57 52 28 9 6 6 97

1.29 1.29 1.30 1.30 1.30 1.30 1.30 1.29

new dist. (-1.09) (-1.09) (-1.08) (-1.08) (-1.08) (-1.08) (-1.08) (-1.09)

1.38 (-1) 1.38 (-1) 1.33 (-1.05) 1.33 (-1.05) 1.41 (-0.97) 1.46 (-0.92) 1.78 (-0.60) 1.19 (-1.19) 1.89 (-0.49) 1.89 (-0.49) 1.87 (-0.51) 1.87 (-0.51) 1.93 (-0.45) 2.13 (-0.25) 2.18 (-0.2) 1.20 (-1.18) 2.69 (+0.31) 2.69 (+0.31) 2.42 (+0.04) 2.25 (-0.13) 2.33 (-0.05) 2.36 (-0.02) 2.36 (-0.02) 1.79 (-0.59)

Table 3.12: Results of RNF on test set

40

CHAPTER 3. REWRITE RULES

Dutch and contemporary Dutch is mostly in pronunciation, the pronunciation of some high frequency vowel and consonant sequences (highly frequent in historc Dutch that is) certainly has changed. The correct transformation of these sequences is absolutely essential if good performance is to be achieved. Of course, this problem could be solved by adjusting the conversion tool, tweaking the rules through which certain character combinations are mapped to phonemes, but that would require expert knowledge. The main aim of this research was to find out if the spelling bottleneck can be solved without any expert knowledge. Clearly, we need more than phonetic transcriptions alone. The other two methods work much better. Both methods only take typically historic sequences into account, and do not suffer from changes in pronunciation. Corpus statistics are enough to generate and select well-performing rewrite rules. On the down side, the RSF and RNF algorithms only consider typically historic sequences. Many words with historic spelling contain sequences that are quite frequent in modern Dutch as well, like cl in clacht (modern Dutch: klacht). The PSS algorithm will generate rules for these sequences, but this is probably where the usefulness of rewrite rules turns into senseless spelling reformation. After transforming typically historic letter combinations into modern ones, rule generation should probably be replaced by word matching, either n-gram based (see section 4.2) or phonetically (see section 5.4). It seems that the selection criteria NM and S only have some positive effect on the PSS and RNF rule sets for some MM-thresholds. There is no single value for the salience threshold that works properly for all methods. The only criterium that works well for all 3 methods is the selection set. It consistently shows the best results of all the different selection methods.

3.6.1

problems

There are of course some specific problems with using rewrite rules based on statistics. Since spelling was based on pronunciation, and people pronounced certain characters in different ways, some historic words are ambiguous. Just like certain modern words can have different meanings determined by context, the spelling of some historic words can be rewritten to different modern words, depending on context. The character combination ue in the historic word muer can be rewritten to modern spelling as oe as in moer (English: nut) or as uu as in muur (wall).

3.6.2

The y-problem

Scanning the list of unique words of all corpora quickly showed a major problem. Many of the historic terms contain the letter y, in many different combinations of vowels. It occurs before or after any other vowel, or just by itself. And in all these cases it its modern spelling is different. table 3.13 shows the possible combinations and their modern spelling: The next chapter will describe other ways of evaluating the rewrite rules. The influence of the rewrite rules on document collections from other periods will

3.6. CONCLUSIONS

41

vowel ay ay ay ay ay ay ey ey ey oy oy oey oey uy uy uy uy ya ya ya ye ye ye yu yu

modern spelling aa a aai ai ei ij ei ij ee oy ooi oe oei e ui uu u ia ija iea ie ij ije io ijv

old / modern withayrigh / witharig gepayste / gepaste sway / zwaai zwaay / zwaai treckplayster / trekpleister vriendelayck / vriendelijk kley / klei vrey / vrij algemeyn / algemeen employeren / employeren flickfloyen / flikflooien armoey / armoe bloeyde / bloeide huydendaachse / hedendaagse suycker / suiker huyrders / huurders gheduyrende / gedurende coryandere / koriander vyandt / vijand olyachtich / olieachtig poezye / poezie toverye / toverij vrye / vrije ghetryumpheert / getriomfeerd yurich / ijv’rig

Table 3.13: Different modern spellings for y

42

CHAPTER 3. REWRITE RULES

be measured, as well as the effect of rewriting on retrieving historic word forms for modern words from a historic corpus. As an extra evaluation, a document retrieval experiment is described, where the rewrite rules are integrated into the IR system. Furthermore, a few simple extensions, such as combinations and iterations, to the three methods PSS, RSF and RNF, are tested. This should provide a better indication of the performance of the rule-generation methods.

Chapter 4

Further evaluation
As we saw in the previous chapter, the RSF and RNF algorithms outperform the phonetically based PSS algorithm. Here, extensions to these methods are considered, as well as some other evaluation methods and test sets generated from documents from different periods. This chapter is divided into the following sections: • Iteration and combination: The three methods described in the previous chapter are combined and used iteratively. • Reducing double vowels: The problem of vowel doubling is investigated. • Word-form retrieval: A method to retrieve historic word forms for modern words. • Historic Document Retrieval: An external evaluation method to evaluate the rewrite rule sets. • Documents from different periods: Evaluation of the rewrite rules on older and newer documents.

4.1

Iteration and combining of approaches

As stated in section 3.2.1, iteration over phonetic transcriptions has no effect. For the RNF and the RSF methods, iteration can have effect. After the first iteration, the historic words that are changed by the rewrite rules have become more similar to their modern variants. A next iteration might result in more modern words that match a wildcard word. Consider again the example of the words aanspraak and aenspraeck. The consonant sequence ck is typical for historic documents, it rarely occurs in modern documents. But the wildcard word aenspraeC (C is a consonant wildcard) is not matched by any modern word. No modern spelling is found for ck. But through other words, the modern 43

44 Method RSF RSF RSF 2 2 2 3 3 3 4 4 4 5 5 5 iteration 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3

CHAPTER 4. FURTHER EVALUATION new rules 209 4 0 12 1 0 127 12 0 276 52 0 276 60 14 total rewr. 257 3 0 271 1 0 274 6 0 269 26 0 153 38 4 perf. rewr. 110 0 0 152 0 0 162 3 0 166 19 0 97 19 3 old dist. 2.38 1.42 1.42 2.38 1.29 1.28 2.38 1.19 1.17 2.38 1.20 1.10 2.38 1.79 1.62 new dist. 1.42 1.42 1.42 1.29 1.28 1.28 1.19 1.17 1.17 1.20 1.10 1.10 1.79 1.62 1.60

Table 4.1: Results of iterating RSF and RNF

variant aa might have been found for the historic sequence ae. After applying this rule to the historic corpus, aenspraeck becomes aanspraack. In the next iteration the wildcard word aanspraaC will be matched with the modern word aanspraak resulting in the rewrite rule ck rightarrow k. However, by combining the different methods, iterating the phonetic transcription method suddenly does have effect. After applying the ae rightarrow aa rule found by other methods, a phonetic transcription is made for aanspraack instead of for aenspraeck. And the pronunciation of aanspraack is equal to the pronunciation of the modern word aanspraak, while the pronunciation of aenspraeck isn’t (at least, according to Nextens).

4.1.1

Iterating generation methods

After applying rewrite rules, the historic words are closer to their modern counterparts. There might be some historic sequence Seqhist for which no wildcard matches could be found because it only occurs in words with several typically historic sequences. After the first iteration, some of these other historic sequences may have been changed to a modern spelling. In a second iteration, Seqhist can be matched with a modern sequence. In table 4.1, the results of iterating over rule generation methods is shown. The average distances before and after applying the rules generated at each iteration are shown in columns 6 and 7. After the second iteration, only the rules set of RNF with n = 5 improves by further iteration. A simple explanation is that most rules generated for historic antecedents of length 5 affect only a few

4.1. ITERATION AND COMBINING OF APPROACHES

45

words (the first 276 rules affect only 153 words in the test set). There are much more typically historic sequences of length 5 than there are of length 4. The problem with evaluating the rules for n-grams of length 5 is that these sequences are so specific that many of them do not occur in the test set at all. In each iteration, there are many more rules than there are affected words. All these rules have have an antecedent part that does occur in the selection set, hence the selection of the rule. But the selection set is much bigger than the test set, and thus contains many more sequences. Looking purely at the scores, it is easy to conclude that 4-grams work better for RNF than 5-grams, but a look at the rules themselves gives another impression. Consider the historic sequence verci. The RNF algorithm finds the rule verci ← versi for it, which changes a historic word like vercieren to versieren (adorn, decorate). The 4-gram verc should become verk in words like vercopen and overcomen, but it should become vers in vercieren. Because 5-grams are more specific, 5-gram rules probably make less mistakes. On the other hand, longer n-grams are more and more like whole words. Instead of generating rewrite rules, the RNF algorithm would be generating historic to modern word matches. It would consider every word of approximately the length as a possible modernization, leaving all the work to the selection process.

4.1.2

Combining methods

The combining of methods is done by generating, selecting and applying the rules of one method before generating, selecting and applying rules of another method. The rules of PSS contain not only typically historic antecedents, but also some non-typical antecedents. Thus, the PSS rules will affect different words, and words differently than the other rule sets. Also, by first applying the rules generated by RNF or RSF, the historic antecedents that have a wrong phonetic transcription (like the frequent sequence ae) may be rewritten before the PSS rules are generated. This will reduce a number of wrong phonetic transcriptions. Therefore, the generation methods are applied one after another, in all different permutations, to see the effect or ordering the generation methods. To be able to make a fair comparison of the different orderings, the rules of each rule set where selected with the selection set, because it performs very well for all three generation methods. The rule set for RNF is the combined rule sets for the n-gram length 2, 3, 4 and 5. The rules where applied in order of length, with the longest antecedents first, because they are more specific than shorter antecedents. Rewriting ae to aa before rewriting aeke to ake cancels the effect of the latter rule. By combining all the rules of n-gram lengths 2, 3, 4 and 5, the improvement is huge. Even before combining the RNF with the other algorithms, over 50% of all the words in the test set is modernized correctly. Combining PSS and RSF increases performance significantly, although the order is not important. But when combining RSF with RNF, the ordering does matter. Applying RSF first, the results are worse than applying RNF alone. Combining them is no improvement. When compared the RNF rules, the only

46

CHAPTER 4. FURTHER EVALUATION

Method Order PSS PSS + RNF PSS + RSF PSS + RNF + RSF PSS + RSF + RNF RNF-2 RNF-3 RNF-4 RNF-5 RNF-all RNF + PSS RNF + RSF RNF + PSS + RSF RNF + RSF + PSS RSF RSF + RNF RSF + PSS RSF + RNF + PSS RSF + PSS + RNF

num. rules 104 769 136 774 389 12 127 276 276 691 746 702 752 753 62 328 134 381 397

total rewr. 253 347 326 349 348 271 274 269 153 315 335 319 337 337 252 295 324 342 346

perf. rewr. 101 211 166 211 206 152 162 166 97 207 224 208 224 224 140 183 167 193 211

new dist. 1.66 (-0.72) 0.91 (-1.47) 1.13 (-1.25) 0.90 (-1.48) 0.91 (-1.47) 1.29 (-1.09) 1.19 (-1.19) 1.20 (-1.18) 1.79 (-0.59) 0.97 (-1.41) 0.87 (-1.51) 0.95 (-1.43) 0.87 (-1.51) 0.86 (-1.52) 1.33 (-1.05) 1.05 (-1.33) 1.12 (-1.16) 0.96 (-1.42) 0.88 (-1.50)

Table 4.2: Results of combined methods on test set

4.1. ITERATION AND COMBINING OF APPROACHES Applied Rule set None RSF PSS RNF RNF+RSF+PSS Lexicon size 44041 41956 41557 39368 38525

47

Table 4.3: Lexicon size after applying sets of rewrite rules

cominations that improve on it are the combinations with PSS. Apparently, the PSS rules are somewhat complementary to RNF and RSF rules, as was expected. The RNF and RSF algorithms work in a similar way (relative frequency of a sequence). The PSS algorithm is fundamentally different. It’s rules are based on phonetics. It is interesting to see that the total number of unique words in the corpus is greatly reduced by rewriting words to modern form. The original hist1600 corpus contains 47,816 unique words (see table 3.3), if the lexicon is case sensitive (upper case letters are distinct from lower case letters). If case is ignored, there are 44,041 unique words left. After applying the rules of the combined methods PSS, RNF and RSF, the total number of unique words is reduced to 38,525, a 12.5% decrease. By rewriting, many spelling variants are conflated to the same (standard) form. As table 4.3 shows, the RNF rules have the most significant effect on conflation. Looking at the number of rules that each method generates, this is hardly surprising.

4.1.3

Reducing double vowels

A common spelling phenomenon in 17th century Dutch is the use of double vowels to indicate vowel lengthening. In English, vowel lengthening is clear in the word good when compared to poodle. In the former word, the oo is pronounced somewhat longer than in the latter. The same effect occurs in Dutch. In bomen (English: trees), the o is long, in bom (bomb) the o is short. But boom, the singular form of bomen, is also pronounced with a long o. The double vowel ’oo’ is needed in this case to disambiguate it from bom. In modern Dutch spelling, this doubling of vowels is only for syllables with a coda1 . Using the modern spelling rules for vowel doubling, redundant double vowels in historic words can be removed. A simple algorithm to do this, the Reduce Double Vowels algorithm (RDV), reduces a double vowel to a single vowel if it is followed by a single consonant and one or more vowels, or if the double vowel is at the end of the word (no coda). Thus, eedelsteenen becomes edelstenen, and gaa is reduced to ga, but beukeboom is not changed to beukebom. This algorithm does
1 there are some exceptions to this rule. The Dutch word for sea is zee. Without the double vowel the e would be pronounced short, becoming ze (English: she).

48

CHAPTER 4. FURTHER EVALUATION

make mistakes. The modern word zeegevecht (sea battle), is changed to the non-Dutch word zegevecht. The error is not in pronunciation, which is the same for both words, but in spelling. The double vowel ii (almost) never occurs in Dutch, so all occurrences of ii in historic words can safely be reduced. For word final vowels, the e vowel is an exception. If a word ends in a single vowel e, this is pronounced as a schwa (like the e in ’vowel’). For words ending in a long e vowel, the double vowel ee is required (thee, zee, twee, vee. Thus, the algorithm should ignore word final vowels ee. To test its effectiveness, it was applied to the full historic word list, containing 47816 unique words. Of these, 1498 words contain redundant vowels according to the RDV-algorithm. The total number of words containing redundant vowels might be larger, since the algorithm is so simple it is bound to miss some of these words. But what of the words it did affect? The list of reduced words was checked manually. It turns out that of all 1498 words, 134 reductions were incorrect (almost 9%). A closer analysis of the incorrect reductions show that, by far, the most mistakes are made with the double ee vowel in non-final, open ended (no coda) syllables in words like veedieven (English: cattle thieves), tweedeeligh (English: consisting of two parts) and zeemonster (sea monster). In each of these 3 examples, the first syllable has its vowel reduced. But in all three examples, the first syllable is a Dutch word by itself. In fact, these words where the very reason why the algorithm ignores word final ee vowels. It seems that the frequent use of compound words in Dutch has a significant effect on the (too) simple RDV-algorithm. A modification might be compound splitting when encountering a word containing ee. If the first part of the word, up to and including ee, is an existing word by itself (i.e. it’s in the lexicon), don’t reduce the vowel. Other frequent mistakes have to do with the adding of a suffix. A typical Dutch suffix is -achtig, as in twijfelachtig (doubtful, twijfel means doubt). But the word geelachtig (yellowish, geel means yellow) is incorrectly reduced to gelachtig (gelly). These errors can be reduced by suffix stripping (stemming). Furthermore, it was tested on the test set (see table 4.4). It affects only 29 of the 400 historic words in the test set, but of these, 20 are written to the correct form. Using RDV after applying rewrite rules, it still has a significant effect on the test set. The best order of combining the 3 rule generation methods (RNF, RSF and PSS) affects 337 words, rewriting 224 of them to the correct modern form. After the RDV algorithm is applied, 5 more words are rewritten, with 235 perfect rewrites (more than 59% of all the words in the test set!). Many of the double vowels are removed by the 4-gram and 5-gram RNF rules (like eelen → elen), but it is mainly due to the fact that ae is rewritten to aa, resulting in more double vowels, that double vowel reduction still has a significant effect. Applying the RNF+RSF+PSS rule set and the RDV algorithm on the example text from the ‘Antwerpse Compilatae’ (see chapter 2) gives the following result: 9. item, oft den schipper verzijmelijk ware de goeden ende koopman-schappen int laden oft ontladen vast genoeh te maken, ende

4.2. WORD-FORM RETRIEVAL Method Order RDV only RNF + RSF + PSS RNF + RSF + PSS + RDV num. rules N.A. 753 753 total rewr. 29 337 342 perf. rewr. 20 224 235 new dist. 2.31 (-0.09) 0.86 (-1.52) 0.83 (-1.55)

49

Table 4.4: Results of RDV on test set

dat.die daardore vijtten takel oft bevangtouw schoten, ende int water oft ter aarden vielen, ende also bedorven worden oft te niette gingen, alsulke schade oft verlies moet den schipper ook alleen dragen ende den koopman goet doen, als vore. 10. item, als den schipper de goeden so kwalijk stouwd oft laijd dat d’ene door d’andere bedorven worden, gelijk soude mogen gebeuren als hij onder geladen heeft rozijnen, allijn, rijs, sout, gran ende andere diergelijke goeden, ende dat hij daar boven op laijd wijnen, olien oft olijven, die vijtlopen ende d’andere bederven, die schade moet den schipper ook alleen dragen ende den koopman goet doen, als boven. The words verzijmelijk, vijten, genoeh and stouwd are incorrect rewrites of the words versuijmelijck, uit een, genoeg and stouwt. But takel, maken, schade, koopman, kwalijk, rozijnen and dragen are correctly transformed from taeckel, maecken, schade, coopman, qualijck and rosijnen. Although it is far from perfect, many words are modernized. Even the word verzijmelijk is orthographically much closer to its correct modern form verzuimelijk, although its pronunciation is no longer the same.

4.2

Word-form retrieval

Since the edit distance measure on a manually constructed test set is not the only (and probably not the best) way of evaluating the performance of rewrite rules, another evaluation method is used here. In [18] historic spelling variants of modern words are retrieved using character based n-gram matching (see Table 3.1. To evaluate the rewrite rules generated by RNF (the best method by far), a similar experiment was done on the full test set of 2000 word pairs. Each modern word in this set is used as a query word. The full list of historic words from the hist1600 corpus, plus the historic word forms of the test set,2 was used as the total word collection. Since the full number of spelling variants for each modern word is not known, the word pairs from the test set were used to perform a known-item retrieval experiment. We are looking for a specific spelling
2 This was done to make sure that the appropriate historic word form is in the word list from which the words are retrieved.

50 N-gram size 2 2 2 3 3 3 4 4 4 5 5 5 Rule set 4-gram comb. 4-gram comb. 4-gram comb. 4-gram comb.

CHAPTER 4. FURTHER EVALUATION recall @20 85.30 90.55 92.50 79.80 88.35 90.80 65.75 83.30 86.50 52.00 77.15 80.30

@10 78.30 86.00 88.90 70.60 83.90 86.70 56.15 78.50 81.40 45.50 73.15 76.65

@5 69.05 79.25 83.20 59.95 76.85 81.50 45.75 73.20 76.65 37.30 68.30 72.55

@1 32.20 46.65 48.80 27.65 45.65 49.25 20.40 43.90 47.25 16.50 41.55 45.50

Table 4.5: Results of historic word-form retrieval

variant, namely, the one from the test set. Table 4.5 shows recall at several different levels. The experiment was repeated after rewriting the historic word list using the 4-gram RNF rules after 2 iterations (see 4.1.1), and using the best combination rule set (RNF, RSF, PSS). Especially at low recall levels (1 and 5) the differences between the original historic words and the rewritten words is huge. The 4-gram rules generated by RNF perform much better than the 5-gram rules. The 5-gram rules are a huge improvement on the original words, but the 4-grams are much better still. The performance of 2-gram and 3-gram matching @20 are comparable to the experiments by Robertson and Willett (see table 3.1). When matching n-grams for historic word forms, small n-grams (2 and 3) perform better than large n-grams (4 and 5). However, it is interesting to see, that the difference between rewriting and no rewriting, for recall at lower levels (recall @1 and recall @5), becomes very big for large n-grams. More specifically, the performance of 4-gram and 5-gram matching is very bad at recall levels 1 and 5.

4.3

Historic Document Retrieval

Parallel to this research, Adriaans [1] has worked on historic document retrieval (HDR). He has investigated whether HDR should be treated is cross-lingual information retrieval (CLIR) or monolingual information retrieval (MLIR). In his CLIR approach, the rewrite rule generation methods described here have been applied to a collection of historic Dutch documents, and a set of modern Dutch queries.

4.3. HISTORIC DOCUMENT RETRIEVAL

51

4.3.1

Topics, queries and documents

Two sets of topics were used. One is a set of 21 expert topics from [2], for which a number of the documents in the collection were assesed and marked as relevant or non-relevant. The formulation of the queries, and the assesment of the documents has been done by experts of 17th century Dutch. The other set contains 25 topics, for which only one relevant document is known. This approach as called known-item retrieval. A document from the collection is selected randomly, and a query is formulated that describes the content of that document as precisely as possible. This approach is used when there is a lack of assessors and/or time to asses the relevance of all the documents for each query. Each query in the experiment is a combination of a title and a description. The description is a natural language sentence, posed in modern Dutch describing the topic of the query. The title contains key words from the description. By combining the description with the title, the key words occur twice in the query, giving them extra weight in ranking the documents. The three columns show the average precision for topics using only descriptions (D only), descriptions and titles (D+T), and titles only (T only) as queries. The average precision is the non-interpolated average precision score. For each query, the top-10 ranking documents are retrieved. If, for query Q, the 3rd and 5th documents are relevant, the non-interpolated average precision is the average of the precision at rank 3 and the precision at rank 5. If n documents are retrieved, the average precision is: Avg.P rec. = 1 n
n

i=1

i rank(i)

(4.1)

The document collection is also the same is the one used in [2], because these documents were assessed for the expert topics.

4.3.2

Rewriting as translation

The used rule set is a combination of RNF, PSS and RSF (see section 4.1.2). However, the RNF+PSS+RSF rules are generated by non-final versions of the generation algorithms, since the HDR experiments where done before the final versions were ready. The RNF algorithm performed sub-optimally because of memory problems. Due to time constraints, the best rewrite rules at that time were used (which is the ordered generation of RNF, PSS and then RSF). The final version of the RNF algorithm generates more rules, and with the final combined rule sets this would probably give other HDR results. Table 4.6 shows the results of the HDR experiment on the known item topics.3 The baseline is the standard retrieval method, looking up the exact query words in the inverted document index. An inverted document index is
3 The results in this thesis differ from the results in [1] because of last minute changes in [1]. The results shown here are only based on topics for which there is at least one relevant document. The results in [1] also take into account topics that don’t have any relevant documents in the collection.

52 Method

CHAPTER 4. FURTHER EVALUATION Avg. prec. D only 0.2192 0.2125 0.2366 0.2356 0.3097 0.2266 0.3702 0.1873 Avg. prec. D+T 0.1955 0.2352 0.2538 0.2195 0.3118 0.2537 0.3884 0.2370 Avg. prec. T only 0.1568 0.1749 0.2457 0.1795 0.3016 0.2487 0.3006 0.2234

Baseline Stemming 4grams Decompounding Rules Doc Rules Query Rules Doc + stem Rules Query + stem

Table 4.6: HDR results using rewrite rules

matrix in which the columns represent the documents in the collection, and the rows represents all the unique words in the entire document collections, with each cell containing the frequency of the represented word in the represented document. Thus, a column shows the frequencies all words that occur in the document, and each row shows the frequencies of a word in the documents that it occurs in. The table shows the results of some standard IR techniques as well. Stemming, as explained in section 2.3, conflates words through suffix stripping. The 4-gram experiments uses 4-grams of words in combination with the words themselves as rows in the inverted index. Decompounding is used to split compound words into their compound parts. The results for the rewrite rules are split into document translation and query translation. In the query translation experiment, a list of translation pairs was made for the words in the historic document collection, containing the original historic term and its rewritten form. Each query was expanded with a original historic word if its rewritten form was a query word. The document translation experiment was done by replacing each word in all documents by its rewritten form from the same list of translation pairs. The first 4 experiments can be seen as monolingual IR (documents and queries are treated as one language). Translating queries or documents is a cross-language (CLIR) approach. Either the documents are translated into the language of the queries, or the queries are translated into the language of the documents. The last two rows show the effect of stemming after translation. Of the 4 monolingual approaches, the use of 4-grams works best, although stemming and decompounding perform better than the baseline as well. Translation of the queries is comparable in performance to the 4-gram approach, but stemming the translated queries has a negative effect, especially when using descriptions only. Query translation means adding historic word-forms to the query. These historic word-forms contain historic suffixes that might not be stripped by the stemmer, just as the historic word-forms in the documents. Without rewriting, many historic spelling variants cantnot be conflated to the

4.3. HISTORIC DOCUMENT RETRIEVAL Method Baseline Stemming 4grams Decompounding Rules Doc Rules Query Rules Doc + stem Rules Query + stem D only 0.3396 0.3187 0.3037 0.3307 0.2825 0.3067 0.2690 0.2920 D+T 0.4289 0.3778 0.3465 0.4228 0.3835 0.4224 0.3268 0.3628 T only 0.4967 0.4206 0.3821 0.4900 0.4538 0.4844 0.3799 0.4214

53

Table 4.7: HDR results for expert topics

same stem. Thus, if the historic query terms are not affected by the stemmer, they will only be matched by the exact same word-forms in the document collection, not with any morphological variant. Document translation is clearly superior to query translation. Even without stemming it consistently out-performs all the other approaches. But here, stemming is useful. By rewriting, many historic spelling variants are conflated to a more modern standard, including their suffixes. After stemming, morphological variants are conflated to the same stem, which significantly improves retrieval performance. For the D+T and T only queries, the improvement over the baseline is almost 100%. The results for the expert topics are listed in Table 4.7. These results are in no way comparable to the known-item results. No matter what approach is used, nothing performs better than the baseline system. The decompounding approach, and the query translation approach (without stemming) come close to the performance of the standard system, but they show no improvement. A closer analysis of the topics shows that the experts who formulated the queries used specific 17th century terms, and added historic spelling variants to some of the descriptions and the titles. Topic 13 has the following description and title: • Description: Welke invloed heeft ’oorvrede’ nog in de periode van de Antwerpse Compilatae (normaal: oorvede, oirvede)? • Title: oorvrede Antwerpse Compilatae oorvede oirvede The document collection used for retrieval contains documents from the ‘Gelders Land- en Stadsrecht’ corpus, and the ‘Antwerpse Compilatae’ corpus. Both are a collection of text concerning 17th century Dutch law. All documents from the ‘Antwerpse Compilatae’ contain the words Antwerpse Compilatae. So, by putting these words in the query, all documents from this corpus are considered as possible relevant documents. Next, the word ’oorvrede’ is combined with 2 spelling variants in both the description and the title. By rewriting the documents, the spelling variant oirvede might have changed, so all documents originally containing oirvede no longer match with the query word oirvede. In

54 Period 1569 1600 1658 baseline distance 2.62 2.38 1.98

CHAPTER 4. FURTHER EVALUATION Braun 1.53 1.54 1.24 4-gram 1.57 1.20 0.95 PSS 1.99 1.65 1.54 RSF 1.82 1.43 0.95

Table 4.8: Results of rules on test sets from different periods

general, if spelling variants are added to the query, the documents should not be rewritten, since rewriting is used to conflate spelling variants.

4.4

Document collections from specific periods

Since spelling changed over time, getting more and more standard because of the increase in literate people and printing press, a rewrite rule should only be used on documents dating from more or less the same period as the documents it was generated from. That is, if documents from the beginning of the 17th century where used to construct rewrite rules, applying them to texts dating from 1560 or late 17th century might have a negative effect, because the pronunciation of certain character combinations might have changed in between these periods. To see if this is the case, two small test sets, containing 100 word pairs each, one from a text dating from 1569 and one from a text written in 1658, where manually constructed (in the same way as the large test set constructed from texts dating from 1600–1620, see 3.4.1). The rules and the RDV-algorithm where applied to these sets with the following results (the second test set, period 1600, is the original, large test set) : As the second column shows, the average distance between the historic words and their modern counterparts is decreasing together with the age of the source documents (the actual differences might be even bigger, since the test sets do not contain words that haven’t changed over time. The texts from 1658 probably contain more of these words than older texts). What is interesting to see, is that the rules manually constructed by Braun perform better on the oldest test set than any automatic method. Even the best rules generated by RNF perform somewhat worse, although they do perform much better on the test sets from more recent documents. Again, the PSS rules perform worst of all rule sets. The RSF rules show great improvement in performance as the age of the source documents decreases. On the 1658 test set, it shows the same performance as the best RNF rules. Of course, given the small size of the test sets, these results only give an indication of the effects on test sets from different periods. To be able to draw reliable conclusions, much larger test sets, and maybe even a word-form retrieval experiment (see section 4.2) should be used.

4.5. CONCLUSIONS

55

4.5

Conclusions

All different evaluations show that 4-gram RNF generates the best rules. Although the Braun rules show to be more period independent, for documents written after 1600 the automatic methods perform much better. Word retrieval benefits greatly from rewriting, getting performance on par with the results from [18] for historic Enlish word-forms. For HDR, the effect of rewriting is spectacular. What is interesting is the improvement of the stemming algorithm. Before rewriting, stemming the documents and queries has a mixed effect. For the titles it is useful, but for the descriptions, stemming has very little effect. But once the rewrite rules have been applied, much more historic words have a modern suffix that can be removed, conflating spelling and morphological variants all to the same stem. In other words, rewriting has brought the historic Dutch documents closer to the modern Dutch language. As the iteration of RNF ceases to have effect after 3 iterations, it might be more effective to switch to phonetic matching, or possibly word-retrieval, after that. It would be interesting to see the results of a combined run, first rewriting documents and then use n-gramming to find spelling variants that are not yet fully modernized. Also, the results of the HDR experiment are based on old rules. The current best rule set performs much better on the test set evaluation and on the word-retrieval test, thus might also perform even better on the HDR experiment.

56

CHAPTER 4. FURTHER EVALUATION

Chapter 5

Thesauri and dictionaries
A thesaurus is a dictionary of words containing for each word a list of related words. In IR, it is often used to find synonyms of a word, and other closely related words for query expansion. A document D that doesn’t contain any of the query terms will not be retrieved, but can still be relevant. The topic of the query might be discussed in a document using different, but related words. By expanding the query with related words from the thesaurus, the query might now contain words that are in D, so D will be retrieved. In [2], one of the main bottlenecks for HDR is the vocabulary gap. This gap not only represents concepts that no longer exist, or that didn’t exists in the 17th century. It also represents the concepts that are described by modern and historic synonyms. As 17th century documents contain many synonyms (see [2, p.31]), a thesaurus can be useful for query expansion. A thesaurus can be created automatically by extracting and combining word pairs from different sources: 1. Small parallel corpora 2. Non-parallel corpora (using context) 3. Crawling footnotes 4. Phonetic transcriptions 5. Edit distance The first three methods can be used to construct a thesaurus for the vocabulary gap. The last two methods might be used to tackle the spelling problem, as an alternative to rewrite rules.

5.1

Small parallel corpora

A very useful technique for finding word translation pairs between two different languages is word to word alignment in parallel corpora, (see [21] and [22]). In 57

58

CHAPTER 5. THESAURI AND DICTIONARIES

the European Union for instance, all political documents written for the European parliament have to be translated in many different languages. As these documents contain important information, it is essential that each translation conveys exaclty the same message. The third paragraph in a Polish translation contains the same information as the third paragraph in an Italian translation. This can be exploited to construct a translation dictionary automatically by aligning sentences and words within these sentences. A collection of such documents in several languages is often called a parallel corpus. A parallel corpus can thus be used to find synonyms in one language for words in another language. If such a collection of documents is available for 17th centry Dutch and modern Dutch, it could be used to construct word translation pairs between 17th century and modern Dutch. This could be a partial solution to the vocabulary gap identified in [2]. Partial, because historic words for concepts that no longer make any sense in modern times cannot be aligned with modern translations, simply because no such translations exist. One of the largest parallel corpora is probably the Bible. It is translated in many different languages, and also in many different historical variants of many modern languages. The advantage of using different Bible translations is that a line in one translation corresponds directly to the same line in the other translation. The Statenbijbel and the NBV (Nieuwe Bijbel Vertaling) can be used to construct a limited translation dictionary. The Statenbijbel is the first Dutch translation of the original Bible, written in 1637. The NBV is the most recent Bible translation, available in book form and on the internet. However, the Statenbijbel, unlike the NBV, is not electronically available. The oldest digitized version that can be found is a modernized version of the Statenbijbel, dating from 1888. By that time, official spelling rules were introduced, and late 19th century Dutch is very similar to modern Dutch, making it useless for 17th century Dutch.

5.2

Non-parallel corpora: using context

If the time span between the two language variants is large, the variants can be considered to be different languages; historic documents can be considered documents written in another language. If the time span between variants becomes smaller, the languages are more alike, there is an increasing lexical overlap. Many words have the same meaning in both languages. There are, however, some words that appear in one variant but not in the other. Over time, some words are no longer used, and new words have been made up. Some of the purely historic Dutch words, that are no longer used in modern Dutch, have a purely historical meaning. It is hard to find corresponding modern Dutch words for them, because they have lost their meanings in modern society. But some purely historical words do have modern counterpart. The word opsnappers is a 17th century Dutch word, that is no longer used in modern Dutch. Its modern variant is feestvierders (English: people having a party). Someone looking for documents on throwing a party in ancient days might use feestvierders as a

5.2. NON-PARALLEL CORPORA: USING CONTEXT

59

query word. Query expansion can benefit from a modern to historic translation dictionary containing opsnappers as a historic synonym for feestvierders. There are several techniques that can be used to find semantically related words automatically. Two of them will be discussed here. The first uses the frequency of co-occurrence of two specific words, the second uses syntactic structure to find words that are at least syntactically, and possibly semantically related.

5.2.1

Word co-occurrence

One way of constructing a thesaurus automatically is using word co-occurrence statistics to pair related words. This technique exploits the frequent co-occurrence of related words in the same document, or paragraph. If two words co-occur frequently in documents, there is a fair chance that these words are related. In political documents, the words minister and politician will often co-occur. Of course, high frequent words like the and in will also co-occur often. Content words have a lower frequency, and will co-occur less often than function words. But a thesaurus containing pairs of high frequency function words will not help in retrieving relevant documents, since high frequence words are often removed from the query. And most of the documents in the collection contain these function words, so almost all documents would be considered relevant. Content words not only carry content, they also good at discriminating between documents. The word minister is much better at discriminating between political and non-political documents than the word the. Thus, a simple word co-occurrence thesaurus should pair related content words. There are two ways of filtering out highly frequent term co-occurrences. Removing the N most frequent terms from the lexicon is a rather radical approach. The other possibility is to penalize high term frequency by dividing the number of co-occurrences of two words by the product of their individual term frequencies. The main advantage of the latter approach is that high frequency content words are not removed from the lexicon. No information is lost. On the other hand, if a content words occurs often, its discriminative power is minimal. However, low frequency terms suffer from accidental co-occurrences. For two totally unrelated, low frequency words, accidental co-occurrence is enough to make their co-occurrence significant. An extra problem is the spelling variation. A low-frequency content word W1 might be spelled different in each document it occurs. Even if it co-occurs with the same related word W2 in each of those documents, each spelling variation will have a co-occurrence frequency with W2 of 1. Ofcourse, this problem is partly solved by applying the rewrite rules from chapter 3 to the documents, but there are still some spelling variants that are not conflated after rewriting. Since the document collection is limited in size, almost all content words have a low frequency, making it nearly impossible to construct a useful co-occurrence thesaurus in this way.

60

CHAPTER 5. THESAURI AND DICTIONARIES

5.2.2

Mutual information

Another related approach comes from information theory. In [16], a automatic word-classification system is described using mutual information of word-based bigrams. Bigrams are often used in natural language processing techniques to estimate the next word given the current word. Given a corpus, the probability of word Wi is given by the probability of the words W1 , W2 , ..., Wi−1 occuring before it. This can be approximated by considering only the n − 1 words before Wi (Wi−n+1 , Wi−n , ..., Wi−1 ). In the case of bigrams, only the previous word Wi−1 is considered. The probability of Wi given Wi−1 is the frequency of Wi−1 , Wi divided by the total number of bigrams in the corpus. The N most frequent words of an English corpus are classified in a binary tree by maximizing the mutual information between the words in class Ci and the words in class Cj . The final tree shows groups of semantically or syntactically related words. The mutual information between a word Wi from class Ci and a word Wj from class Cj is given by in (5.1) (P (Wi , Wj ) is the probability of the bigram Wi , Wj ): P (Wi , Wj ) (5.1) P (Wi )P (Wj ) The total mutual information M (i, j) between two classes Ci and Cj is then: I(Wi , Wj ) = log M (i, j) =
Ci ,Cj

P (Ci , Cj ) × log

P (Ci , Cj ) P (Ci )P (Cj )

(5.2)

Maximizing the mutual information is done sub-optimally, by finding a locally optimal classification. First, the N words are randomly classified and Mt (Ci , Cj ) is computed. Second, for each word W in both classes, the mutual information information Mt+1 (Ci , Cj ) is calculated of the situation where W is swaped from one class to another. If this increases the mutual information, the swap is permanent, otherwise, W is swaped back to its original class. In [16], the final classification stops after these N swaps. Because computing power has increased so much, it doesn’t take much more time to iterate this process until, in the next N swaps, no swap is permanent. Working top-down, at each level, all the words of class Ci are classified further into subclasses. this ensures that the classification at the previous level stays intact. To reduce the computational complexity of the algorithm, the mutual information at t + 1 can computed by updating the mutual information at t with the change made by W . If W is in class Ci at t, computing Mt+1 (Ci , Cj ) is done by first computing the mutual information M (W, Cj ). This is the contribution of W at t. Next, the mutual information M (W, Ci ) is computed. If M (W, Ci ) is higher than M (W, Cj ), swapping W to class Cj increases to mutual information. The new mutual information is then: Mt+1 (Ci , Cj ) = Mt (Ci , Cj ) − M (Wi , Cj ) + M ax(M (W, Cj ), M (W, Ci )) (5.3) In this way, the full mutual information M (Ci , Cj ) has to be calculated only once, and is updated by each swap.

5.2. NON-PARALLEL CORPORA: USING CONTEXT

61

The idea behind this approach is that closely related words are classified close to each other, and two unrelated words should be classified in different classes early in the tree (near the root). If two words convey the same meaning, it makes no sense to place them next to each other in a sentence, because it would make one of them redundant. The meanings of two adjacent words should be complementary. If two words co-occur often (i.e. their bigram frequency is high), they should not be in the same class. Low co-occurence (low bigram frequency) of high frequent words (high unigram frequency) makes it probable that the meanings of these words overlap, so they will be classified close to each other. The example classification given in [16] shows some classes that might be useful for query expansion. In one class, all days of the week are clustered together, and in another class, many time-related nouns are clustered. If one the words in such a class is used as a query word, adding other words from the same class to the query might help finding documents on the same topic. Once the N most frequent words have been classified, adding other, less frequent wordss requires no more than putting each word in that class that results in the highest mutual information. This second step becomes trivial when adding words with very low frequencies. A word W with frequency 1 (this holds for the largest part of the content words) only shows up in 2 bigrams, once with the previous word in the sentence, and once with the next word. Thus, it will only add mutual information when classified in the opposite class of one of these words. If neither the previous nor the next word is in the same class at a classification level s, putting W at s + 1 class Ci or Cj makes no difference to the mutual information, because, using (5.3), it adds 0 to either class. This introduces a new problem for historic documents. Because of the inconsistency in spelling, resulting in spelling variants, each variant has a lower corpus frequency and occurs in less bigrams than it would have given a consistent spelling (by conflating spelling variants, the new word frequency is the sum of the frequencies of the conflated variants). Apart from that, classification based on bigrams requires a huge amount of text. More text means better classification, simply because there is more evidence to base a classification on. But the amount of electronically available historic text is limited, resulting in data sparseness.1 To make sure that the algorithm was implemented correctly, a test classification was made using a 60 million English newspaper corpus.2 The 1000 most frequent words were classified in a 6 level binary classification tree. Table 5.1 shows 4 randomly selected classes, paired with their neighbouring class, at classification level 6 (the leaves of the tree). Out of the 1 million possible bigrams (each of the 1000 unique words can co-occur with all 1000 words, including
1 Data sparseness in this case means the lack of evidence for unseen bigrams. A bigram Wi−1 , Wi might not occur in the corpus, making the probability P (Wi−1 , Wi ) 0. Smoothing techniques can be used to overcome this problem, but for the classification algorithm it still always adds the same amount of mutual information to each class, making classification trivial. In a larger corpus, there is a bigger chance that a certain bigram occurs, resulting in a more reliable probability estimate. 2 The newspaper corpus is the LA-times corpus used at CLEF 2002.

62 Class number 9 10 27 28

CHAPTER 5. THESAURI AND DICTIONARIES Class content city Administration movie very given growing nation department only like proposed approved their housing her my financial economic private drug Simi Newport World Laguna National Pacific Long Orange Santa Ventura middle as hot five six eight few will can ’ve may said would could does cannot did should ’ll is ’d must allow take begin bring give provide keep hold get pay sell win find build break create use meet leave become call tell ask say see think feel know want run stop play Japan husband hours days though Clinton Anaheim Los Northridge Thousand judge wife couple key action summer minute top order largest usually anything non New own Department American San Inc star hearing project election list book force war quarter morning week bad different free got

35 36

49

50

Table 5.1: Classification of 1000 most frequent words in LA-times

itself), for the 1000 most frequent words, the corpus contains 412.516 unique bigrams. In class 28, a number of auxillary verbs is clustered, and in class 36, some semantically related verbs are clustered. The neighbouring class, 35, contains some related verbs as well, which indicates that there is a relation between clusters that are classified close to each other. In 49 and 50, some time-related nouns are clustered (summer, minute, quarter, morning, week). But many clusters contain seemingly semantically unrelated words, like ‘Administration’, ‘movie’, ‘very’ and ‘growing’. The ability to cluster on semantics is limited, although more data (a larger corpus) should lead to better (or at least, more reliable) classification. The corpus is used as a language model for English. Thus, more text leads to a more reliable model. Better classifications have been made with syntactically annotated corpora. One of the main problems with plain text is not the semantic but the syntactic ambiguity of words. The word ’sail’ can be used as a noun (‘The sail of a ship.’) or as a verb (‘I like to sail.’) But the orthographic form ‘sail’ can only be classified in 1 class. In syntactically annotated corpora, a word can be classified together with its part-of-speech tag. For modern English, such corpora exist, but for 17th century Dutch, all that is available is plain text. The total historic Dutch corpus is much smaller than the English one, but still contains about 7 million words. The 1000 most frequent words share 226.318

5.2. NON-PARALLEL CORPORA: USING CONTEXT Class number 11 12 Class content In uit Na Aen om Op tot van vanden of ofte en ende Laet Doen Wilt Uw Haer Zy Wy Zijn Mijn Hy Ons Ik Sijn Gy selve Noch vp Dus Der o Een Geen Daer Daar Dese Dies Dat Des Dees Alle so soo zo Zoo Indien Wanneer Nu al aller also als verheven inder vander wien binnen te alwaer ter welcken nam toch dewyl eerste dat dattet mit achter onder Roomsche wie hemels verre inne vooren och heer staat ras Maria connen konnen ware zijnde mede datse dijen Heer Prins hand borst lijf beeld beelden brief boeck kennis steen gelt brant dood verdriet troost rust plaats slagh oyt niet voorts eerst sprack bloed kroon troon staet stadt plaets wegh Boeck editie uitgave naem stof vrucht glans kort quaet begin neder noyt wel voort wederom weder zien sien gaven leeren

63

25 26 33 34 59

60

Table 5.2: Classification of 1000 most frequent words in historic Dutch corpus

bigrams (that means that 77% of all possible bigrams is not in the corpus). The same experiment was repeated with the historic Dutch corpus, and again 4 classes were randomly selected, shown in table 5.2, togheter with their neighbouring classes. In some of these classes, there are some clusters of syntactically related words. Class 12 contains many prepositions and pronouns, and classes 59 and 60 contain mostly nouns. Semantically, classes 59 and 60 are also interesting, because there are some themes. ‘Heer’ and ‘Prins’ (lord and prince), ‘hand’, ‘borst’ and ‘lijf’ (hand, chest, body), ‘brief’, ‘boeck’ and ‘kennis’ (letter, book, knowledge) in 59, ‘kroon’ and ‘troon’ (crown, throne), ‘staet’, ‘stadt’, ‘plaets’, ‘weg’ (state, city, place, road), ‘Boeck’, ‘editie’, ‘uitgave’ (Book, edition, edition) in 60. The main problem of small corpora is that, if the mutual information within one class is zero (none of the words in that class share a bigram), further classification is useless. This is clear in classes 11 and 12. Apparently, moving one word from class 12 to 11 does not increase the mutual information. A further subclassification of class 12 will result in one empty subclass, and the other subclass containing all words of class 12. For a better comparison, the 1000 most frequent words in a 30 million words,

64 Class number 21 22 29

CHAPTER 5. THESAURI AND DICTIONARIES Class content dacht zet vraagt grote hoge oude maakte hield sterke enorme vijftig hoe Bosnische Europees Zuid dezelfde hetzelfde welke ieder vele veel enkele beide dertig honderd vijf wat vorig economische mogen hard ondanks Van Nederlandse nationale Navo Rotterdamse elke deze zoveel bepaalde negen mijn dit laten drie tien zeven derde halve volgend vorige zware speciale belangrijke oud zwarte rode politieke vrije ex rekening tv milieu gebruik kun me we wij belangrijkste echte dergelijke voormalige meeste klein dollar groei overheid regering gemeente rechtbank Spanje ogen televisie stuk leeftijd weekeinde week seizoen keer ander handen hart familie bevolking Raad politiek kabinet onderwijs school tafel Feyenoord ploeg elftal finale zomer toekomst maand periode buitenland koers produktie verkoop verzoek ton rechter kant totaal groter mogelijkheid

30 31 32 49

50

Table 5.3: Classification of 1000 most frequent words in modern Dutch corpus

modern Dutch corpus3 were also classified in a level 6 binary tree. 4 randomly selected classes and their direct neighbouring classes are listed in table 5.3. In this corpus, the 1000 most frequent words share 295.404 unique bigrams. Table5.3 shows 4 directly neighbouring classes (29, 30, 31, 32). At level 4 in the tree they would be merged into one classes. This would make sense, as classes 29, 30 and 31 contain number words (dertig, honderd, vijf, negen, drie, tien, zeven) and related adjectives (ieder, vele, veel, enkele, beide, elke, zoveel, bepaalde, derde, halve), and all 4 classes contain some other adjectives. Classes 49 and 50 also contain some semantically related words: ‘Overheid’, ‘regering’, ‘gemeente’, ‘Raad’, ‘politiek’, ‘kabinet’ (government, government, community, council, politics, cabinet), and ‘leeftijd’, weekeinde’, ’week’, ’seizoen’, ’zomer’, ’maand’, ’periode’, ’toekomst’ (age, weekend, week, season, summer, month, period, future). For all three corpora, the classification trees show some useful clustering, but it is far from being usable for query expansion, because it is based on high frequency words, which add very little content to a query and mark a lot of documents as relevant. As mentioned before, classification of low frequency words is completely unreliable, because there is very little evidence to base a
3 This

corpus is also from CLEF 2002.

5.2. NON-PARALLEL CORPORA: USING CONTEXT

65

classification on. But the low frequency words are the very words that are useful for document retrieval. Low frequency words, by definition, occur in only a few documents, and are often related to the topic of a document. It seems that the only way to get a more reliable classification is to use a bigger corpus. There is however, a big difference in automatic clustering between English and Dutch. In the 60 million words corpus used for English, there are ‘only’ 306.606 unique words, whereas the 30 million words corpus for modern Dutch contains 495.605 unique words. The historic Dutch corpus, containing 7 million words in total, has 373,596 unique words. In general, a larger corpus contains more unique words, so a 60 million words corpus of historic Dutch would probably contain much more unique words than the English corpus. The main reason for this is probably due to compounding of words. In English, compounds are rare (like ‘schoolday’), as most nouns are seperated by a whitespace (‘shoe lace’), but in Dutch, compounding is much more common, resulting in words like ‘bedrijfswagenfabriek’ (lit.: company car factory) and ‘nieuwjaarsgeschenken’ (New Year gifts). To get enough evidence for a reliable classification, a larger lexicon requires a larger corpus. Another difference between these two languages is the word order, which is more strict in English than in Dutch. Both languages share the Subject - Verb - Object order in basic sentences. But adding a modifier to the beginning of the sentence, the order is retained in English, but changes in Dutch (the verb is always the second part of the sentence, so the subject comes after the verb. This has consequences for the number of unique bigrams in the corpus. For Dutch, a larger number of bigrams is needed to get same reliability for the ‘language model.’ The quality of the classification seems to depend on quite a number of factors: 1. Lexicon size: Each unique word needs plenty of evidence for proper classification, thus a larger lexicon needs more evidence, i.e., more text. 2. Ambiguity in a language: Words that can have different syntactic functions can supply contradictive evidence (the verb ‘sail’ can co-occur with words that cannot co-occur with the noun ‘sail’). Languages that have many of these words are harder to model correctly. 3. Strictnes of word-order: Some languages allow various word-orderings for a sentence. In many so called ’free-word-order-languages’ like Polish and Russian, a rich morphology makes it possible to distinguish syntactic categories. However, for a language like Dutch, the word-order may be changed, but this introduces changes in pronouns and prepositions. A Dutch translation of the English sentence ‘I’m not aware of that.’ could be ‘Ik ben me daar niet bewust van,’ or ‘Ik ben me daar niet van bewust.’ But another word-order is allowed’, like ‘Daar ben ik me niet bewust van’, or even ‘Daar ben ik me niet van bewust.’ Nothing changes morphologically, but there are four different sentences, with exactly the same words and exactly the same meaning.4 More possible orderings need more evidence
4 Thanks

to Samson de Jager for pointing out this peculiarity in Dutch.

66

CHAPTER 5. THESAURI AND DICTIONARIES to be modelled correctly. 4. Document style: Each document is written in a certain style. Sentences in newspaper articles are often different from sentences in personal letters. This has an effect on which bigrams occur in the corpus. 5. Depth of classification: At depth 1 (the classes directly under the root), the classification is often quite reliable. At deeper levels, the number of bigrams that the words in a class share becomes increasingly small, making each further classification less reliable. In classes with only nouns (especially in Dutch, were compounding leads to a large number of low frequency nouns), a further classification of semantic structure is not possible because of a lack of syntactic distinction (nouns rarely appear next to each other).

To aid word clustering for historic Dutch, the historic document collection could be mixed with an equal amount of modern Dutch text to reduce data sparseness. The spelling of many words has changed over time, but the most frequent words have changed very little. There is still a reasonably large overlap between the most frequent words in both corpora, so if no more historic text is available, modern text might help. For modern Dutch, syntactically annotated corpora are available, and can be mixed with historic Dutch to estimate POStags for historic words. If all the modern words in a class are nouns, it seems probable that the historic words in that class are nouns as well. To bridge the vocabulary gap, clustering historic and modern words with related meanings might be very useful. At least for query expansion, adding historic words to modern query words can increase recall.

5.3

Crawling footnotes

There are some digital resources available on the web. For instance, the Digitale bibliotheek voor de Nederlandse Letteren5 (DBNL) has a large collection of historic Dutch literature. Many of these texts contain footnotes of the form “1. opsnappers: feestvierders”. These are direct translations of historic words to modern variants. By using a large amount of these texts, a historic to modern dictionary can be constructed. The texts on DBNL are categorized based on the century in which they where published. There is huge list of 17th century Dutch literature available, containing over a 100 books, and more works are added regularly. Not all books contain footnotes, and not all footnotes are direct translations. Many footnotes contain background information or references to other works. But some texts contain thousands of translations. Because the books are annotated by different people, the notes don’t have a consistent format. In some texts, the historic word is set in italics or boldface,
5 url:

www.dbnl.nl

5.3. CRAWLING FOOTNOTES

67

in others, a special html-tag is used to mark it. Consider the next two examples, the first of which is very clear, containing a special tag to signify a word translation. <div ID="N098"><small class="note"><a href="#T098" name="N098"> <span class="notenr">&nbsp;4. </span></a>&nbsp; <span class="term">beschaemt:</span> teleurgesteld. Vgl. <span class="bible">Rom. X, 11</span>.</small></div> <div ID="N1944"><small class="note"> <a href="#T1944" name="N1944"> <span class="notenr">&nbsp;9 </span> </a>&nbsp;<i>bloot:</i> onbeschermd. </small></div> <div ID="N1608"><small class="note"> <a href="#T1608" name="N1608"> <span class="notenr">&nbsp;1353 </span></a>&nbsp; <i>hoofdscheel:</i> hoofdschedel; <i>van:</i> door; <i>bedropen:</i> overgoten; Van ’t begin en van ’t einde van Melchisedech’s leven is ons verder niets bekend; Vondel beschouwt hem als door God zelf tot priester gewijd.</small></div> The first note has marked the historic word (‘beschaemt’) by tagging it with a span class called ‘term’. In all of these cases, the modern translation (‘teleurgesteld’) directly follows the historic word, and ends with a dot or a semi-colon. The second note is less specific. The historic word (‘bloot’) is marked in italics, and the modern translation (‘onbeschermd’) again follows it and ends in a dot (or a semi-colon). The first note is easy to extract. The second note is more problematic, because the italics not always signify a translation: <div ID="N1728"><small class="note"> <a href="#T1728" name="N1728"> <span class="notenr">&nbsp;10 </span></a>&nbsp; <i>Orpheus:</i> Orpheus, de bekende zanger van de Griekse sage, die de wilde dieren bedwong door z’n snarenspel (<i>konde paren:</i> kon verenigen).</small></div> Here, the first word in italics, ‘Orpheus’, is not followed by a modern translation, but by an explanation of who Orpheus was. A simple way of distinguishing between this note and the previous one, is that the translation pair contains only one word after the historic, italicized word. But this doesn’t work for translations containing several words: <div ID="N1726"><small class="note"> <a href="#T1726" name="N1726">

68

CHAPTER 5. THESAURI AND DICTIONARIES

<span class="notenr">&nbsp;7 </span> </a>&nbsp;<i>sloer:</i> sleur, gang, manier. </small></div> <div ID="N1437"><small class="note"> <a href="#T1437" name="N1437"> <span class="notenr">&nbsp;12 </span></a>&nbsp; <i>onses Moeders:</i> van onze moeder, de aarde.</small></div> For the historic word ‘sloer’, multiple modern translations are given, seperated by a comma. The historic phrase ‘onses Moeders’ has two modern phrases as translation. How can these be distinguished from the note about Orpheus? It gets even worse. Consider the next consequetive notes: <div class="notes-container" id="noot-1739"> <div class="note"> <a href="#1739T" name="1739"><span class="notenr">4</span></a> <i>gedraeghe mij tot:</i> houd mij aan.</div></div> <div class="notes-container" id="noot-1740"> <div class="note"> <a href="#1740T" name="1740"><span class="notenr">5</span></a> <i>deze: de ene</i>. Bedoeld wordt Pieter Reael, vgl. <i>502</i>.</div></div> The first one contains the historic phrase inside italics and the modern phrase following it directly. The second one contains both the historic word and its modern translation inside italics, and an explanation directly after it. And a few notes further down, the single word after the italics is not a modern translation, but a reference: <div class="notes-container" id="noot-1744"><div class="note"> <a href="#1744T" name="1744"><span class="notenr">14</span></a> <i>de schrijver:</i> Vondel.</div></div> All this makes is very hard to extract only the translation pairs from a note. Manual correction is not an option, since the 17th century DBNL corpus contains over 170.000 footnotes. The final list consists of approximately 110.000 translation pairs, many of which are not actual translation pairs but references, explanations or descriptions. Still, for query expansion it could be useful. If each modern translation occurs only a few times, only a few historic words or phrases are added to the query. Not all of them will be useful, but adding noise to the query might be compensated by the fact that some relevant words are added as well. By making seperate dictionaries for word to word, word to phrase and phrase to phrase translations, evaluating each of them seperately, will give an indication of whether a dictionary can be useful, or contains to much noise. The dictionaries in table 5.4 are translations from historic to modern, as extracted from the DBNL corpus. The word to phrase dictionary contains historic words as entries, and modern phrases as translations. Vice versa, the phrase to word dictionary contains historic phrases with modern single word

5.3. CRAWLING FOOTNOTES Dictionary word to word word to phrase phrase to word phrase to phrase total number of translations 36505 26445 5589 42680 111219 unique entries 20281 16649 4931 35127 68384 number of synonyms 1.8 1.6 1.1 1.2 1.6

69

Table 5.4: DBNL dictionary sizes

translations. To get an indication of usefullness of the DBNL thesaurus, a random sample of 100 entries was drawn twice, and each entry evaluated. For each of the four different parts of the total thesaurus, the 100 entries were marked as useful or useless. Repeating this process once, thus drawing 100 random entries twice, the results give us some idea about the usefulness of the thesaurus parts. thesaurus part word to word word to phrase phrase to word phrase to phrase useful entries 91/88 72/62 59/55 70/68 useless entries 9/12 28/38 41/44 30/32

Table 5.5: Simple evaluation of DBNL thesaurus parts: usefulness of 100 random samples

Some good examples from the word to word and word to phrase dictionaries are: ghewracht badt booswicht belent heerschapper -> -> -> -> -> bewerkt verzocht zondaar zeer nabij heer en meester

Here are some bad examples: galgenbergh God Katten-vel -> -> -> Golgotha Godheid kat

70 d’altaergodin -> stuck -> Hippomeen ->

CHAPTER 5. THESAURI AND DICTIONARIES Vesta op ’t schaakbord zie

The last example is a typical parsing mistake. The right hand side ‘zie’ (English: see) is part of a reference to something. Furthermore, the word to word and word to phrase dictionaries were used to get an idea of the overlap between the historic words in a historic corpus, and the historic words in the dictionaries. How many of the words in the hist1600 corpus (the corpus used for the RSF and RNF algorithms, see section 3.1.2) for example, have an entry in the DBNL thesaurus? And what about the corpus that was used for creating the test set? Table 5.6 gives an indication of the coverage of the thesaurus. The Braun corpus contains the ‘Antwerpse Compilatae’ and the ‘Gelders Land- en Stadsrecht’, the Mander corpus contains ‘Het Schilderboeck’ by Karel van Mander. Together, they make up the hist1600 corpus. This split up was done because the DBNL thesaurus contains some entries extracted from the Mander corpus. The same holds for the documents from the test set corpus. The modern words in the corpora, at least the words that are found in the modern Kunlex lexicon, were first removed from the total historic lexicon (column 3). Synonyms for these words can be found in a modern Dutch thesaurus. The coverage results can be explained by the footnote extraction. The Braun corpus does not contain any footnotes, and has the smallest coverage from the DBNL thesaurus. The Mander corpus has a larger coverage, probably because a number of entries from the DBNL thesaurus come from ‘Het schilderboeck’. That the DBNL thesaurus covers even a larger part of the test set corpus is probably due to the fact that ‘De werken van Vondel, Eerste deel (1605 – 1620)’ is part of the corpus and contains several thousand notes with translation pairs. Corpus hist1600 Braun Mander Test set corpus Selection Test set Unique words 47816 17891 33805 69453 1600 400 Not in Kunlex 41156 (86%) 14168 (79%) 27074 (80%) 44827 (65%) 1569 (98%) 397 (99%) DBNL coverage 4315 (10,5%) 1429 (10.1%) 3547 (13.1%) 8119 (18.1%) 603 (38,4%) 152 (38,3%)

Table 5.6: Coverage statistics of corpora for DBNL thesaurus

The DBNL thesaurus covers a far larger part of the historic words in the selection and test set (see section 3.4.1). Apparently, in the process of giving modern spelling variants of historic words, there was a bias towards giving modern forms for historic words with a salient historic spelling. A bias which

5.3. CRAWLING FOOTNOTES

71

can very well have been the same for the editors of the DBNL who made the footnotes. Also, the selection and test set do contain some modern words. Out of the 2000 words in the both sets, 34 words are in the Kunlex, showing that the decision whether a word is historic or modern is not trivial.

5.3.1

HDR evaluation

As an external evaluation, the performance of the DBNL thesaurus was tested on hitoric document retrieval experiment. For a description of the experiment, see section 4.3. Instead of using rewrite rules, the documents and query were translated using the DBNL thesaurus. As the results in table 5.5 show, apart from the word to word thesaurus, the thesauri contain many nonsense entries. Therefore, only the word to word thesaurus was used. The original words in the historic documents were replaced by one of the related words from the DBNL thesaurus. For query translation, all the entries containing a query word as a translation were added to the query. The original query words were kept in the query as well. The effect of translation was compared to other standard IR techniques. Table 5.7 contains the results of translation both with and without stemming on the known-item topics. Method Avg. prec. D only 0.2192 0.2125 0.2366 0.2356 0.1098 0.0860 0.0902 0.1389 Avg. prec. D+T 0.1955 0.2352 0.2538 0.2195 0.1262 0.1324 0.1250 0.1730 Avg. prec. T only 0.1568 0.1749 0.2457 0.1795 0.1546 0.1597 0.1321 0.1847

Baseline Stemming 4grams Decompounding DBNL Doc DBNL Query DBNL Doc + stem DNBL Query + stem

Table 5.7: HDR results for known-item topics using DBNL thesaurus

Translation of the descriptions is disastrous for retrieval performance, although stemming compensates a little. For the titles, translation works slightly better. Whereas the baseline shows a decline in performance by adding titles and removing descriptions, query translation shows the opposite behaviour. One of the reasons might be that the number of words in the query is fairly small using only titles when compared to the descriptions. Table 5.8 displays the average number of words in the descriptions and titles for both topic sets. Method None represents the original queries, before translation. The DBNL thesaurus more than doubles the number of words in the descriptions, but as the description also contains high frequency words, and the thesaurus also contains translations for high frequency words, adding so many

72

CHAPTER 5. THESAURI AND DICTIONARIES

translations apparently does more bad than good. Only modern stop words (the high frequency words, that add little content to the query) are removed from the query, but historic translations are added before this happens. The titles don’t contain any stop words, thus through translation none are added. The titles contain mostly low frequency content words. Adding historic synonyms of these words, and stem all the query words afterwards improves performance. Look at topic 7 for a good example: • Description: Kan een eigenaar van onroerend goed zijn verhuurde pand zomaar verkopen, of heeft hij nog verplichtingen ten opzichte van de huurder? • Title: eigenaar onroerend goed verhuurde pand verkopen verplichtingen huurder This is the same query after adding translations: • Description: kan ken koon mach magh een een eigenaar van onroerend goed aertigh welzijn binnen cleven sinnen verstrekken verhuurde pand paan panckt zomaar verkopen veylen of heeft hij deselve sulcke versoecker nog nach verplichtingen ten opzichte van de huurder huurling • Title: eigenaar onroerend goed aertigh wel verhuurde pand paan panckt verkopen veylen verplichtingen huurder huurling Not all translation added to the title are good, but most of them are related to the topic. As for the description, many totally unrelated historic words are added that will not be recognized as a stop word (mach, magh, koon, sulcke). topics expert expert expert expert known known known known method None dbnl phon rules None dbnl phon rules words in title 3.52 5.52 4.29 4.14 4.36 8.60 7.44 6.68 words in descr. 11.05 18.43 16.38 14.05 11.68 24.48 19.32 16.20

item item item item

Table 5.8: Average number of words in the query using query translation methods Because the combination of query translation and stemmming works better than query translation only, it would be interesting to combine query translation with the other monolingual techniques. Another approach could be to combine the scores of retrieval runs. By giving the ranked list of each retrieval approach

5.4. PHONETIC TRANSCRIPTIONS Method Baseline Stemming 4grams Decompounding DBNL Doc DBNL Query DBNL Doc + stem DBNL Query + stem D only 0.3396 0.3187 0.3037 0.3307 0.2246 0.2749 0.2095 0.2705 D+T 0.4289 0.3778 0.3465 0.4228 0.2326 0.3696 0.2574 0.3316 T only 0.4967 0.4206 0.3821 0.4900 0.3442 0.4632 0.2917 0.3848

73

Table 5.9: HDR results for expert topics using DBNL thesaurus

a specific weight, the final relevance score of document is the weigthed sum of the relevance scores of each approach. Documents that are considered relevant by two different approaches, say, retrieval using 4-grams and retrieval using the DBNL thesaurus, are, on average, ranked higher on the combine list. The reasoning behind this is that if several approaches retrieve the same document, there is a fair chance that that document is actually relevant. As was mentioned in section 4.3, the HDR results of the advanced techniques for the expert topics show no improvement over the baseline. The same holds for document and query translation using the DBNL thesaurus. The expert queries contain specific 17th century words from the documents, making query expansion redundant for a large part. It is still interesting to see that, consistent with the known-item results, query translation works better than document translation and stemming afterwards has a negative effect. Although the monolingual methods perform better on the descriptions, translation of the titles seems to work better than stemming or 4-gram matching. And again, adding translations to the descriptions decreases performance significantly.

5.4

Phonetic transcriptions

Although often historic words are spelled different from their modern counterparts, in many cases, their pronunciation is the same. This fact can be effectively used to construct a dictionary of equal sounding word pairs. For Dutch, a few algorithms are available to convert strings of letters to strings of phonemes (see section 3.2.1). The algorithm for building a dictionary using phonetic transcriptions is very easy. First, convert the historic lexicon lexhist into a historic pronunciation dictionary P dicthist , and the modern lexicon lexmod into P dictmod . Next, for all historic words whist in P dicthist , lookup its phonetic transcription P T (whist ) in the modern dictionary P dictmod . Pair whist with all words wmod for which the phonetic transcription P T (wmod ) is equal to (P T (whist ), and add each pair to the final dictionary. This approach can also be combined with the rewrite rules to improve upon

74

CHAPTER 5. THESAURI AND DICTIONARIES

the final thesaurus. After applying rewrite rules to the historic lexicon, the rewritten words will (probably) be more similar in spelling to the corresponding modern word. Through rewriting, the pronunciation of a word may change. Since letter-to-phoneme algorithms are based on modern pronunciation rules, the phonetic transcription of the historic word klaeghen will be different from the transcription of its corresponding modern word klagen, since the modern pronunciation of ae is different from the modern pronunciation of a (they may have been the same in 17th century Dutch). Thus, if after rewriting, klaeghen has become klaghen, the phonetic transcription will be equal to that of klagen. Converting the lexicon to a pronunciation dictionary again, and repeating the construction procedure will result in new pairings. Of course, words that are pronounced the same are not necessarily the same words (consider eight and ate). This is were the edit distance clearly helps in distinguishing between spelling variants and homophones (if the homophones are orthographically dissimilar enough).6 Because the phonetic transcriptions contain some errors, and because the pronunciation of some vowel sequences has changed over time, the phonetic transcriptions before and after rewriting were evaluated by randomly selecting and checking a 100 entries for correctness. The whole experiment was done twice to get a more reliable indication. If the number of correct and incorrect transcriptions show a big difference between the first and the second time, a bigger sample, or more iterations are needed to get a better indication. If the numbers vary only slightly, their average gives a fair indication of the total number of correct and incorrect transcriptions. The results in Table 5.10 show a significant improvement in the quality of the transcriptions. Before rewriting, the phonetic dictionary contains 4592 entries, and 16% of all transcriptions are different from their real pronunciation. Only 2% of all the 11.592 entries after rewriting have incorrect phonetic transcriptions. The phonetic dictionary after rewriting (using the combined rule set RNF+RSF+PSS) contains the original historic words as entries, but the modern words were matched with the phonetic transcriptions of the rewritten forms of the historic words. The word aengaende was first rewritten to aangaande. Then, aengaende is matched with a modern word that has the same phonetic transcription as aangaande. Not only does rewriting effect the number historic words that phonetically similar to their modern forms, it also decreases the number of wrong phonetic matches. The historic ae sequences is no longer mathced with the modern ee sequence, but with aa. The same goes for the historic sequences ey and uy which were matched with the sequence ie in modern words before rewriting, and are respectively matched with ei and ui afterwards.

5.4.1

HDR and phonetic transcriptions

As a way of evaluating the effectiveness of mapping words using phonetic transcriptions, the same HDR experiment as described in section 4.3 and the pre6 If two homophones are orthographically similar, a spelling variant of one of them could just as easily be a spelling variant of the other.

5.4. PHONETIC TRANSCRIPTIONS Phonetic dictionary normal rewritten total entries 4592 11592 incorrect entries of 100 15 / 17 0/4 perc. incorrect 16 2

75

Table 5.10: Incorrect transcriptions in 2 samples of 100 randomly selected entries, before and after rewriting

Method

Baseline Stemming 4grams Decompounding Phonetic Doc Phonteic Query Phonetic Doc + stemming Phonetic Query + stemming

Avg. prec. D only 0.2192 0.2125 0.2366 0.2356 0.2642 0.2458 0.2645 0.1911

Avg. prec. D+T 0.1955 0.2352 0.2538 0.2195 0.2901 0.2511 0.3054 0.2153

Avg. prec. T only 0.1568 0.1749 0.2457 0.1795 0.2609 0.2474 0.2502 0.1983

Table 5.11: HDR results for known-item topics using phonetic transcriptions

vious section was conducted. Instead of using rewrite rules to translate queries or documents, the phonetic transcription dictionary was used. The results are shown in Table 5.11. For this experiment, the stop word list was extended with phonetic variants of stop words taken from the phonetic dictionary. The results of translating the queries are comparable to the use of 4-grams in the monolingual approach, and, equal to the effect on rewriting (see Table 4.6 in the previous chapter), stemming the translated queries has a negative effect for the same reason. The historic words often have historic suffixes that are unaffected by the stemmer, thus conflation of morphological variants is minimal. Document translation shows the best results for all different queries (D only, D+T and T only). But now, the effect of stemming is minimal. The number of phonetically equal words added to the descriptions and titles is smaller than the number of related words added by the DBNL thesaurus. Although the phonetic dictionary adds spelling variants of modern stop words to the query, the list of modern stop words was extended by their phonetically historic counterparts. Therefore, the performance of query translation for the description is comparable to query translation for the titles. Combining them does not affect average precision much.

76

CHAPTER 5. THESAURI AND DICTIONARIES Method Baseline Stemming 4grams Decompounding Phonetic Doc Phonetic Query Phon. Doc + stem Phon. Query + stem D only 0.3396 0.3187 0.3037 0.3307 0.2719 0.3037 0.2581 0.2913 D+T 0.4289 0.3778 0.3465 0.4228 0.3213 0.4137 0.3063 0.3638 T only 0.4967 0.4206 0.3821 0.4900 0.4178 0.4920 0.3373 0.4176

Table 5.12: HDR resulst for expert topics using phonetic transcriptions

Another interesting observation is that combining description and titles leads to a significant increase in precision for document translation. The original queries of topic 3: • Description: Hoe wordt de hypotheekrente afgehandeld bij de verkoop van een pand? • Title: hypotheekrente afgehandeld verkoop pand Adding phonetic transcriptions results in: • Description: hoe wordt wort de hypotheekrente afgehandeld bij bei bey by de verkoop vercoop vercoope vercope van vaen een ehen pand pandt • Title: hypotheekrente afgehandeld verkoop vercoop vercoope vercope pand pandt In both the title and the description, 3 spelling variants for verkoop (sale) and 1 for pand (house). The spelling variants of the stop words bij, van and een were removed because of the extended stop word list.

5.5

Edit distance

Similarity between words can also be measured using the edit distance algorithm. At each place in a word, a character can be deleted, inserted or substituted (same as delete + insert). The edit distance of two words is equal to cost of changing one word into the other. Deleting and inserting cost 1 step, substitution costs 2 (equal to 1 delete and 1 insert) unless the character to be substituted is equal to the substitute, in which case the cost is 0. The more similar two words are, the lower the cost. This technique is often used for spell checking. The algorithm can be adjusted to account for the distance between two keys on a keyboard. Accidentally hitting a key next to the intended one occurs more often than hitting one on the other end of the keyboard. A bigger

5.5. EDIT DISTANCE Similar Characters b,p d,t f,v s,z y,i y,ie y,ij g,ch c,k c,s Table 5.13: Phonetically similar characters

77

distance between two characters on a keyboard results in a higher substitution cost. But the similarity between historic words and their modern variants is not based on the distance between keys on a keyboard, but on their similarity in pronunciation. Thus, the algorithm can instead be adjusted to take into account the similarity of pronunciation. A c can be pronounced as a k or as an s. Thus, substituting a c for an s should be lower in cost than substituting a c for a t. Adjusting the cost of substitutions has to be done carefully. The cost of substituting c for t should not be increased unless the cost of deleting and inserting characters is increased as well. Otherwise, the algorithm will prefer deleting + inserting over substituting, resulting in the same cost as before the adjustment. Instead, lowering the cost for substituting phonetically similar characters, the algorithm will prefer substituting c for s over deleting c and then inserting t.

5.5.1

The phonetic edit distance algorithm

The Phonetic Edit Distance (PED) algorithm is an adjusted version of the basic edit distance algorithm. In the standard version, every substitution adds 2 to the total edit distance unless the two characters under consideration are equal. The PED version differentiates between substituting two phonetically similar characters and two phonetically dissimilar characters. If an ‘s’ is substituted for a ‘z’, the edit distance is increased by 0.5, but if an ‘s’ is substituted for a ‘j’ the edit distance is increased by 2. The following character are considered phonetically similar: The edit distance is increased by 2 when the first character of the combinations ‘ie’, ‘ij’ or ‘ch’ is substituted for the phonetic equivalent. The PED algorithm then decreases the edit distance by 1.5 if the second character of the character combinations ‘ie’, ‘ij’ or ‘ch’ is substituted for the phonetic equivalent

78

CHAPTER 5. THESAURI AND DICTIONARIES

of the character combination. In total, after two substitutions, 0.5 is added to the edit distance. The main problem with the edit distance algorithm is that it is a costly operation. Since the historic and modern corpora easily consist of several thousand, or even tens of thousands of words, finding the closest historic match of a modern word requires a huge amount of computation, [25]. A solution to this problem is to use a coarse grained selection method like n-gram matching first, which can reduce the number of candidates under consideration. The word-retrieval experiment (section 4.2) showed that candidate selection using n-grams works well, especially after applying rewrite rules. With an n-gram size of 2, over 90% of the historic variants from the test set were found in the top 20 candidates. However, most of the words in the list of candidates are no historical variants of the modern word. As the results in Table 4.5 show, only half of the variants are found at rank 1, and for most modern words, there are only 2 or 3 spelling variants found in the entire historic corpus. This is where the fine grained selection of edit distance can be put to good use. Using the phonetic version of the edit distance algorithm, the 20 candidates can be reranked according to their phonetic similarity. The historical variants should be ranked higher than all the other words in the list of candidates. Of course, if there are multiple historic variants of the modern word, the historical variant from the test set need not be at rank 1. There might be another spelling variant that is phonetically closer to the modern word. Thus, the precision @1 will not be much higher, but the precision @5 should increase (there are very little modern words that have more than 5 historical spelling variants in the corpus). The list of phonetically similar characters could probably be extended, but this is already an improvement over the standard edit distance algorithm, as can be seen in Table 5.14. ED stands for the standard edit distance algorithm, PED is the phonetic version and RR means the rewritten forms of the original historic words were used for n-gram retrieval using the combined RNF+RSF+PSS rule set. Recall scores 10 are roughly the same for ED and PED, but for recall 5 the differences become more significant. The recall 5 is much closer to the recall 20 after reranking, in Table 5.14. Actually, the increase in precision makes it interesting to retrieve more than 20 word-forms. The problem with PED is that it is computationally expensive to find the closest match in an entire lexicon. But for 20 or even 100 words this is no problem. The increase in recall when retrieving 100 words can be transformed into an increase in recall 5 through reranking using PED. In the n-gram retrieval part, once the ranked list of candidates is calculated, selecting the first 100 words takes negligibly more time than selecting the first 20 words.

5.6

Conclusion

The DBNL thesaurus can be used effectively in query expansion, if the stop word list is extended with historic variants as was done for the phonetic dictionary, and with a better note extraction algorithm, the word to phrase and phrase

5.6. CONCLUSION N-gram size 2 2 2 2 2 2 3 3 3 3 3 3 Preproccess ED PED RR RR+ED RR+PED ED PED RR RR+ED RR+PED recall @20 0.853 0.853 0.853 0.925 0.925 0.925 0.798 0.798 0.798 0.908 0.908 0.908

79

@10 0.783 0.830 0.843 0.889 0.908 0.920 0.706 0.782 0.791 0.867 0.894 0.905

@5 0.691 0.720 0.802 0.832 0.850 0.892 0.600 0.707 0.768 0.815 0.851 0.881

@1 0.322 0.322 0.436 0.488 0.519 0.563 0.277 0.326 0.430 0.493 0.523 0.565

Table 5.14: Results of historic word-form retrieval using PED reranking

to word translations might become useful as well. The downside is that the construction of this thesaurus depends on manually added word translation pairs. Automatically extracting them correctly is difficult, and the only historic words for which a translation is given are the ones that are deemed important by the DBNL editors. The modern translations of the words that they find important enough to translate might not be the words that are posed as query words by the user. By combining historic Dutch documents with modern Dutch documents, and more imortantly, by increasing the corpus size, the use of word clustering algorithms can become an important method for bridging the vocabulary gap. As is stands, the vocabulary gap remains the most difficult bottleneck of the two, as the spelling gap is partly bridged by the rewrite rules from the previous chapter and the phonetic dictionary and PED reranking procedure in this chapter. The phonetic variants dictionary is effective, but only after rewriting. The phonetic transcriptions of the original historic words are not always correct, thus by replacing these transcriptions with the transcriptions of the rewritten words, many historic words are no longer paired with the wrong modern word. The advantage of matching historic and modern words with phonetic transcriptions over using rewrite rules is that non-typical historic character sequences (like ‘cl’ in clacht) are not rewritten incorrectly (clausule should not be rewritten to klausule). The phonetic dictionary only replaces whole words, not parts of words. Thus, clacht will be replaced with klacht, but the historic word clausule is matched with the modern word clausule, and is thus retained in the lexicon. The performance of word-retrieval can be greatly improved by reranking the candidate list using the Phonetic Edit Distance algorithm. The number of candidates can then be reduced to 3 or 5 words, and the remaining list can be used for query expansion. It has yet to be tested on a HDR experiment though.

80

CHAPTER 5. THESAURI AND DICTIONARIES

Chapter 6

Concluding
We’ve seen, in the previous chapters, that language resources can be constructed from nothing but plain text. They can be used effectively for HDR, and might also be used as stand alone resources to improve readability. This chaper concludes this research, and will try to answer to questions from the first chapter. Some future directions are given as well.

6.1

Language resources for historic Dutch

The first chapter posed some research questions. They are repeated here and an attempt at answering them is made. • Can resources be constructed automatically? The methods described in chapters 3, 4 and 5 have shown that language resources for Dutch historic document retrieval can be constructed automatically using nothing but plain historic Dutch text. The HDR experiments and the word-form retrieval experiments have shown that these language resources can be used effectively to find historic Dutch word-forms of modern Dutch words, and also significantly improve HDR performance. • Can (automatic) methods be used to solve the spelling problem? – Can rewrite rules be generated automatically using corpus statistics and overlap between historic and modern variants of a language? The generation, selection and application of rewrite rules can be done automatically, and with good results. The RNF and RSF algorithms work well in finding modern spelling variants of typical historic character sequences. Both methods use plain text corpora and exploit the overlap between historic and modern Dutch. – Are rewrite rules a good way of solving the spelling bottleneck? As the results of combining and iterating the methods have shown, after rewriting the most important historic character 81

82

CHAPTER 6. CONCLUDING sequences, their no longer produce any useful rules. However, the typically historic sequences caused most of the problems for the phonetic transcriptions. Once they have been modernized, the grapheme to phoneme converter produces much better results. So, in answer to the question, they are a good first step in solving the spelling bottleneck for 17th century Dutch. – Can historic Dutch be treated as a corrupted version of modern Dutch, and thus be corrected using spelling correction techniques? Using a spell checker shows acceptable results, but the main problem is that a modern word for each historic words must be selected manually from a list of suggestions. If the correct word is not in the list of alternatives, further manual correction is needed. A language independent and automatic solution is the use of n-gram matching to retrieve similar word-forms. This produces a list of historic spelling variants for modern words. It has yet to be tested if the inversed procedure, finding similar modern word forms for historic words, works as well. Using n-gram matching as a coarse grained search, and edit distance, or its phonetic variant, as a fine grained search, the list of candidates can be reduced further. • What are the options for automatically constructing a thesaurus for historic languages? The vocabulary gap is still a big problem. Most of the resources and methods described are solution to the spelling problem. Only the DBNL thesaurus and the word co-occurrence classifications are aimed at the vocabulary gap, and neither is a good solution at this moment. The DBNL thesaurus contains many nonsense entries, and only contains manually constructed translation pairs. Extending it to cover new words depends on the knowledge and effort of experts. As for the co-occurrence thesaurus, its application in HDR seems a long way of. To get better classifications, much more text is needed, and even then, the semantic distinctions are probably to coarse grained to make them useful for query expansion. • Is the frame work for constructing resources a language indepent (general) approach? The same word-form retrieval experiment described in [18] works for historic Dutch documents. This supports the claim by Robertson and Willett that their methods are general, language and period independent. The word-form retrieval method uses only ngram information, which is language independent. The rewrite rule generation methods RNF and RSF can be added to the list of language independent techniques. Even without using a manually constructed selection set, using the MM selection criterium, which is a language independent methods as well, the rules selected can help modernize historic spelling (at least for historic Dutch). Further improvements can be made by reranking the candidates using the PED algorithm, which can increase the precision at a certain level, or alternatively, increase recall at lower levels. The PED

6.2. FUTURE RESEARCH

83

algorithm is language dependent; the characters are phonetically similar in Dutch, but not necessarily in all languages. Although the edit distance algorithm is less effective, it is language independent. The other resources, the PSS algorithm, the phonetic thesaurus and the DBNL thesaurus are specific for Dutch. The PSS algorithm and the phonetic thesaurs use a grapheme to phoneme conversion tool that must be designed specifically for each language. The DBNL thesaurus consists of manually constructed word translation pairs. • Can HDR benefit from these resources? The experiments described in [1] show that HDR can gain from several techniques, some treating HDR as a monolingual approach, others, including the techniques and language resources for historic Dutch, treating HDR as a CLIR approach. The retrieval results show that rewriting the historic Dutch documents to a more modern Dutch is a very effective way to improve performance. After rewriting, the gap between 17th century Dutch and modern Dutch has become smaller. The monolingual approach of stemming document and query words is much more effective after the documents are translated.

6.2

Future research

Resources for 17th century Dutch can help HDR, but there is still much that can be improved upon. Since the problems for 17th century Dutch have been split into two main issues throughout this research, directions for future work will follow these two paths.

6.2.1

The spelling gap

It seems that the spelling bottleneck is not the main problem anymore, although there are still some techniques that could be improved, like the PED algorithm, and the phonetic transcription tool. The phonetic transcription tool from Nextens is designed for modern Dutch, with its many loan words from English, French, German and other languages. It may be clear that, although the overlap between historic and modern Dutch is in pronunciation, there are some difference in pronunciation as well. Taking this into account, the rules for transcribing a sequence of characters into a sequence of phonemes can be adjusted for 17th century Dutch. The pronunciation of ae is one of the main problems, but phenomena like double vowels and double consonants also form a major hurdle in matching historic and modern words. Making specific rules for their transcription will probably solve most of these problems. The PED algorithm can be adjusted in two main ways. First off, the list of phonetically similar characters can be extended, and maybe improved. For instance, the characters ‘b’ and ‘p’ are pronounced similar in certain contexts, like the end of a word (both are pronounced as a ‘p’ in Dutch). But at the

84

CHAPTER 6. CONCLUDING

beginning of a word, they sound different. The algorithm could be changed to use context information when judging the similarity of pronunciation. Second, the current cost function might not be optimal. Right now, the substitution of similar characters costs less than deleting or inserting a character. Different cost functions can be tested. It would be interesting to see how well this algorithm works on historic variants of other languages. Maybe the cost function should depend on the specific language for which it is used. Another approach would be to use the normal edit distance algorithm on the phonetic transcriptions of words. Also, making a spelling variations dictionary is not trivial, and it has not been tested either. If the number of spelling variations is unknown, how to determine which candidates are actual spelling variants, and which are not, might be a difficult problem. As far as the rewrite rules are concerned, the effect of a rule set on document collections from different periods can be investigated further. The results in Table 4.8 show that the generated rules still work for documents written slightly earlier or slightly later than the documents that were used to generate the rules from. However, if the difference in age gets larger (between the documents from which the rules are generated, and the documents on which the rules are applied), the performance of the rules will probably decrease. For documents in Middelnederlands1 the differences with modern Dutch are far bigger, not only in spelling but also in pronunciation. The gap might even be too large to be bridged by rewrite rules. For more recent documents, the gap is so small that rewrite rules based on typical historic character sequences are not effective any more because there are almost no character sequences that are typical of the historic documents. After the introduction of offical spelling rules, the differences between historic Dutch and contemporary Dutch are very small, making the RSF and RNF algorithm redundant. It would be interesting to see the difference in performance on document collection from a specific period between rules generated from documents dating from the same period and rules generated from documents ranging from another period. For other languages the results can be completely different, but spelling often changes gradually,2 so these effects should be very similar in other languages.

6.2.2

The vocabulary gap

As mentioned earlier, the hardest problem of the two is the vocabulary gap. The resources constructed to bridge this gap are far from being usable. The DBNL thesaurus contains to much noise, and the co-occurrence thesaurus would also, if low frequency words were classified as well. The quality of the DBNL thesaurus can be improved by using a better extraction algorithm. Apart from that, the list of 17th century books at the
Dutch language between 1200 and 1500. possibly, for the introduction of official spelling rules, which can have a significant effect on spelling.
2 Except, 1 The

6.2. FUTURE RESEARCH

85

DBNL website is expanded regularly. These new entries also contain notes and translation pairs, so the thesaurus could be updated with new entries as well. The construction of a historic synonym thesaurus using mutual information seems infeasible at this moment. An enormous amount of text would be required, and even then, the clusters will still show more syntactic structure than semantic structure. Large clusters of nouns are almost impossible to split into semantically related subclusters if no more than plain text corpora are available. Once syntactically annotated 17th century Dutch corpora are available, classification based on bigram frequencies might become useful to cluster synonyms. For HDR purposes, it would be interesting to see the effect of mixing historic and modern Dutch corpora. If the historic and modern words in a cluster are semantically related, the historic words could be added to modern query words from the same cluster. Finally, if the co-occurrence based thesaurus improves in quality, it could be combined with the DBNL thesaurus. The DBNL thesaurs could be used to test the quality of the co-occurrence thesaurus (if it is based on a mix of historic and modern Dutch). The historic word and its modern translation should be in the same cluster, or at least, close to each other in the classification tree. As it stands, the attempts at bridging the vocabulary gap have led to little more than plans for building a real bridge. The bridge over the spelling gap, although still a bit shaky, seems to have reached the other side. Language resources are now available for historic Dutch, most of them automatically generated, and possibly useful for other languages as well.

86

CHAPTER 6. CONCLUDING

Bibliography
[1] Adriaans, F. (2005). Historic Document Retrieval: Exploring strategies for 17th century Dutch [2] Braun, L. (2002). Information Retrieval from Dutch Historical Corpora [3] Brown, P.F.; Della Pietra, V.J.; deSouza, P.V.; Lai, J.C.; Mercer, R.L. (1992). Class-based n-gram moderls of natural language in Computational Linguistics, volume 18, number 4, pp. 467-479 [4] Crouch, C.J., Yang, B. (1992). Experiments in automatic statistical thesaurus construction [5] Dagan, I.; Lee, L.; Pereira, F.C.N. (1998). Similarity-based models of word cooccurrence probabilities in Machine Learning, Volume 34, number 1-3 [6] Hall, P.A.V., Dowling, G.R. (1980). Approximate string matching in Computing Surveys, Vol 12, No.4, December 1980 [7] Hollink, V.; Kamps, J.; Monz, C.; de Rijke, M. (2004). Monolingual document retrieval for European languages in Information Retrieval 7, pp. 33-52 [8] Hull, D.A. (1998). Stemming algorithms: A case studie for detailed evaluation in Journal of the American Society for Information Science, volume 47, issue 1 [9] Jing, Y.; Croft, W.B. (1994). An association thesaurus for information retrieval in Proceedings of RIAO, pp. 146-160 [10] Kamps, J.; Fissaha Adafre, S.; de Rijke, M. (2005). Effective Translation, tokenization and combination for Cross-lingual Retrieval [11] Kamps, J.; Monz, C.; de Rijke, M.; Sigurbj¨rnsson, B. (2004). Languageo dependent and language-independent approaches to Cross-Lingual Text Retrieval In Comparative Evaluation of Multilingual Information Access Systems, CLEF 2003, volume 3237 of Lecture Notes in Computer Science, pages 152-165. Springer, 2004. 87

88

BIBLIOGRAPHY

[12] Kraaij, W. & Pohlmann, R. (1994). Porter’s stemming algorithm for Dutch [13] Lam, W., Huang, R., Cheung, P.-S. (2004). Learning phonetic similarity for matching named entity translations and mining new translations in Proceedings of the 27th annual international conference on Research and development in information retrieval, 289-296 [14] Li, H. (2001). Word clustering and disambiguation based on co-occurrence data in Natural Language Engingeering 8(1), pp. 25-42 [15] Lin, D. (1998). Automatic retrieval and clustering of similar words in Proceedings of COLIN/ACL-98, pp. 768-774 [16] McMahon, J.G.; Smith, F.J. (1996). Improving statistical language model performance with automatically generated word hierarchies in Computational Linguistics, Volume 22, number 2. [17] McNamee, P.; Mayfield, J. (2004). Character N-gram tokenization for european language text retrieval in Information Retrieval, 7, 2004, pp. 73-97 [18] Robertson, A.M.; Willett, P. (1992). Searching for Historical Word-Forms in a Database of 17th-century English Text Using Spelling-Correction Methods [19] Salton, G.; Yang, C.S.; Yu, C.T. (1975) A theory of term importance in automatic text analysis in Journal of the American Society for Information Science. [20] Salton, G. (1986). Another look at automatic text-retrieval systems in Communications of the ACM, volume 29, number 7 [21] Tiedemann, J. (1999). Word alignment step by step in Proceedings of the 12th Nordic Conference on Computational linguistics, pp. 216-227 [22] Tiedemann, J. (2004). Word to word alignment strategies in Proceedings of the 20th International Conference on Computational Linguistics, [23] van der Horst, J., Marschall, F. (1989). Korte geschiedenis van de Nederlandse taal Sdu Uitgevers, Den Haag [24] Wagner, R.A.; Fischer, M.J. (1974). The string-to-string correction problem in Journal of the ACM, Vol. 21, number 1, pp. 168-173 [25] Zobel, J.; Dart, P. (1995). Finding approximate matches in large lexicons in Software-practice and experience, Vol 25(3), March 1995, pp. 331-345 [26] Zobel, J.; Dart, P. (1996). Phonetic string matching: lessons from information retrieval from Proceedings of the 19th International Conference on Research and Development in Information Retrieval

Appendix A - Resource descriptions
Each of the resources and methods to construct them are described in more detail here. Each section covers a resource and its associated algorithms.

Appendix A1 - Phonetic dictionary
The phonetic dictionary (section 5.4) is a plain text file, each line containing a unique historic word and its modern phonetic equivalent. The modern words are not unique, as a number of historic spelling variants are translated to the same modern form. For the historic words, the phonetic transcriptions of their rewritten forms is used to match them with the phonetic transcriptions of modern words. This is done to solve the biggest problems with the change in pronunciation (the sequence ae in particular). In total, there are 11,592 entries. This example shows the format (historic word, tab, modern word) of the dictionary: aengeclaeght aangeklaagd aengecomen aangekomen aengedaen aangedaan aengedraegen aangedragen

Appendix A2 - DBNL dictionary
The DBNL dictionary (section 5.3) is also a plain text file, each line containing a dictionary entry and its translation. The entries and their translations can be single words or phrases. As Table 5.5 showed, the word to word entries are by far the most useful. Some statistics are repeated here: Some entries have multiple translations, therefore the last column shows the average number of synonyms for each entry. The format of the DBNL dictionary is equal to the phonetic dictionary format (historic word, tab, modern word): begosten begote begoten aanvingen overgoten bespoeld 89

90 Dictionary word to word word to phrase word to either phrase to word phrase to phrase phrase to either total

APPENDIX A - RESOURCE DESCRIPTIONS number of translations 36,505 26,445 62,950 5,589 42,680 48,269 111,219 unique entries 20,281 16,649 36,930 4,931 35,127 40,058 68,384 number of synonyms 1.8 1.6 1.7 1.1 1.2 1.2 1.6

Table 1: DBNL dictionary sizes

begraeut

afgesnauwd

Appendix A3 - Rewrite rule sets
The rewrite rules are generated from corpus statistics and phonetic information (section 3.2. The three rewrite rule generation algorithms PSS, RSF and RNF are explained in sections 3.2.1, 3.2.3 and 3.2.5. The rules can be applied to the lexicon of the document collection to obtain a dictionary of spelling modernization. The rewritten forms are not necessarily existing modern words, because the word can still be a historic word with a modern spelling (see the examples below). gerichtschrijversampt gerichtschrijverzambt gerichtschrijverseydt gerichtschrijverzeid gerichtscosten gerichtskosten gerichtsdach gerichtsdag gerichtsdaege gerichtsdage The rules generated by PSS and RSF are different from the rules generated by RNF, because the vowel/consonant restrictions. The historic antecedent of these rules consist of a historic sequence and a context restriction. For instance, a historic vowel sequence should match a historic word if the vowel sequence is surrounded by consonants. The word vaek matches the vowel sequence ae, while the word zwaeyen doesn’t, because its full vowel sequence is aeye. To make sure that the rule ae doesn’t match zwaeyen, the antecedent is extended with context wildcards: [bcdf ghjklmnpqrstvwxz]ae# → a [aeiouY ]lcx[aeiouY ] The antecedent part of the rule is actually a regular expression and bracketed consonant wildcard [bcdf ghjklmnpqrstvwxz] indicates that the character sequence ae must be preceded by one of these consonants. The word boundary character # indicates that the ae sequence must be at the end of the word.3
3 In Perl, this word boundary character can be replaced by a dollar sign ($), which matches the preceding regular expression only at the end of the string.

91 The uppercase Y in the second rule above is used as a replacement for the Dutch diphtong ij, because the j would otherwise be recognized as a consonant. Therefore, all occurrences of ij in words and sequences are replaced by Y in the RSF and PSS algorithms. The RNF algorithm doesn’t have this consonant/vowel restriction, it will match anything with the historic antecedent. Context information is more detailed for longer n-grams: ae → aa bae → baa bael → baal baele → bale

92

APPENDIX A - RESOURCE DESCRIPTIONS

Appendix B - Scripts
The PSS algorithm requires two lists of mappings from words to phonetic transcriptions. One list with historic words and phonetic transcriptions, and one list for modern words and phonetic transcriptions. The phonetic alphabet that is used is not important, as long as both lists use the same phonetic alphabet. The output is a plain text file, where each line contains a rewrite rule and its PSS score. The PSS score is the MM-score (the maximal match score, described in section 3.3.1. The RSF and RNF algorithms both require two word frequency indices, one from a historic corpus and one from a modern corpus. These indices are plain text files with each line containing a unique word from the corpus, and its corpus frequency. These three algorithms are implemented in Perl, simply named ‘PSS.pl’, ‘RSF.pl’ and ‘RNF.pl’, and use only standard packages and require some scripts that are included in the resources package. Other important algorithms are: • mapPhoneticTranscriptions.pl: This expects two lists containing words and their phonetic transcriptions, and gives as output a dictionary with of words with the same pronunciation. • PED.pl: This is a package of subroutines, with ped is the main subroutine that expects two strings as input and returns the phonetic edit distance as output. • RDV.pl: This contains the subroutine rdv, an implementation of the Reduce Double Vowels algorithm. It needs a string as input and returns as output the string after reducing redundant double vowels. • selectRules.pl and selectMethods.pl: The selectRules.pl script is an executable script that allows you to select rules from a rule set using a specific selection method (section 3.3). When executing, it needs three arguments, the number of the selection method, the name of the file containing the rule set, and finally the name of the output file where the selected rules are stored. The selection criteria are implemented in the selectMethods.pl script. 93

94

APPENDIX B - SCRIPTS • applyRules.pl: This is a package of subroutines for applying a set of rules to a string or a list of strings. • createTestSet.pl: This script requires three arguments, a text file from which words are randomly selected, a filename for the test set, and a filename for a list of words to skip. The nice thing about this approach is that the test set can be constructed once, and then extended later. The skip file contains a list of words that are already presented to the user in an earlier iteration, and where discarded. The user is presented with randomly selected words from the text file, and if an alternative spelling is given by the user, the original word and its alternative spelling are added to the test set. If no alternative spelling is given, the word is added to the list of discarded words. In a second run of the script (with the same filenames for the test set and the skip file), the words from the test set and the skip file are not presented to the user. • buildIndex.pl: This expects three arguments as input. The first argumented is a flag indication whether the second argument is a text file, or a file containing a list of text files if multiple text files are to be indexed. The third argument is the filename for the resulting index, containing the unique words from the text file(s), and their collection frequencies.

For more information on or access to the scripts, send an e-mail to the author.

Appendix C - Selection and Test set
The selection and test set (section 3.4.1) are in the same format as all the other word lists and dictionaries. Each line in the test set file consists of a historic word, a tab, and the modern spelling of the historic word (again, not necessarily an existing modern word). To clearify this, a few entries are given here: sijnen zijn sijns zijns silvere zilvere simmetrie symmetrie sin zin singen zingen singht zingt sinlijckheyt zinnelijkheid sinnen zinnen sinplaets zinplaats The historic word sinplaets is spelled as zinplaats in modern Dutch, although zinplaats is not an existing modern word.4

4 At

least, it is not listed in the ‘Van Dale - Groot woordenboek der Nederlandse taal.’

95

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->