You are on page 1of 4

Introducing Minerva – A new system for Machine Translation

Jonathan Levin Minerva@hisown.com

Abstract
This paper introduces a novel system for machine translation, code-named "Minerva", and briefly discusses its design. The source language is currently Latin, with several target language options. Our approach handles morphology and grammar alike, also attempting semantic disambiguation and context determination. While there exist few systems to translate Latin, Minerva already displays noteworthy results that surpass those of the leading one. We further claim that our system can, upon maturity, be adapted in such ways that can potentially allow it to handle other languages, as well.

1

Introduction

Minerva offers significant advancements compared to Perseus. With an improved morphological engine, it correctly deduces more forms than its "sibling". Additionally, it contains an elaborate rule-based system to tackle Latin grammar, providing correct grammatical disambiguation of senses whenever possible. Future work will make a foray into semantics, by integrating Minerva with Princeton University's "WordNet", thereby attempting to leap over the most challenging hurdle yet – full word sense disambiguation. It is important to emphasize that, while the semantic module is at the time of writing far from complete, it will not be a statistical, corpus based approach, but will attempt true context determination by mimicking the same associative process in the human brain.

Machine Translation (MT) is one of the most challenging areas in NLP research. Many systems exist, all vying to build a better "MouseTrap", and improve accuracy to a much desired asymptotic 100%. While most of these systems focus on contemporary spoken languages, we have decided to tackle the challenge of translating Latin (as a source language) into theoretically any target language. Our system, codenamed "Minerva", is the result. At the time of writing, the leading system for Latin MT is "Perseus", developed by Tufts University and heavily used by Classicists worldwide. This system, however, falls short of its vast promise – it focuses solely on the morphological aspects, often correctly determining the various possible senses, but making no real attempt to disambiguate them1 . Consequently, it is nearly useless in attempting word combinations and/or full sentences, and only offers single word lookups at a time.
That said, the newest version (4.0) also offers a corpus based "guesstimate" of the most applicable sense, but solely based on statistics input by human users.
1

2

A brief primer to Latin

A full discussion of Latin Morphology and grammar is well beyond the scope of this short paper. However, we next explain some important attributes of the language, as well as the reasoning behind its choice. 2.1 Why Latin?

Latin is traditionally treated as a "dead" language. Even though its terms and idioms permeate popular domains such as medicine and law, and indeed even everyday English, its use in the 21st century is extremely limited to specific circles, most notably classicists. We prefer to think of Latin as a "frozen" language, rather than a "dead" one. "Frozen", as it has, in effect, developed little in the past 1,500 years: As opposed to many of its descendant Romance languages, Latin vocabulary remains small, with a relatively straightforward grammar, predictable exceptions and (what we find to be) a reasonable degree of semantic ambiguity. In that,

it proves itself to be a worthwhile candidate for automated translation. Minerva was thus designed with Latin as the source language in mind. However, by its very design the system is modular enough so that the input module may later be replaced with a different one, possibly targeting English, French, or even non Romance languages such as Hebrew or Japanese. 2.2 Latin-specific challenges a. Word Order and declensions Latin, like Russian, relies heavily on declensions. As a result, the traditional word order rules from other languages (Subject-Verb-Object (SVO) in English/Romance, SOV in most others) do not apply. While most sentences loosely adopt SOV, sentences can be constructed in any order, so there is no real difference between: Puer amat puellam Puer puellam amat Puellam puer amat
Puer: Puer (boy) n., nominative Puellam: Puella (girl), n., accusative Amat: amare (love), v., 3rd person sg. Present, indicative/active

b. Grammatical Agreement Latin relies heavily on agreement between words, in gender, case and number. This makes it easy to perform grammatical disambiguation in many cases, especially where adjectives are separated from their corresponding nouns.

c. Clauses Latin uses punctuation sparsely. Commas cannot be relied upon in determining clauses, which can also be introduced by myriad prepositions and/or pronouns. Examples are the prodosis and apodosis of conditionals, e.g:
si vis pacem para bellum (“if you want peace, prepare for war”)

Even without punctuation (in between ‘pacem‘ and ‘para’) The use of the conditional (si) and verb (‘para’) enable a clear separation into two clauses (“si vis pacem” and “para bellum”) d. Idiomatic expressions In many cases, well known idioms can be used to perform both grammatical and semantic disambiguation. For example:
quibuscum continenter bellum gerunt (Caesar, "Gallic War", 1.4)

And all mean "The boy loves the girl", due to the use of declensions. The loose word order is often used in Latin for emphasis on a particular part of the sentence, especially in poetry. This proves to be both a blessing and a curse. On the one hand, the use of declensions can be used in the process of grammatical disambiguation. On the other, some declined forms are often identical. These include very commonly encountered forms2, and in those cases Latin provides an even greater challenge than other languages with a rigid word order. Because of declensions, Latin does not have any determiners and uses prepositions sparsely, thus making common grammatical tasks exceptionally challenging.

wherein "bellum" by itself could be either an adjective ("beautiful") or a noun ("war") and "gerunt", while not grammatically ambiguous, is a verb of many potential interpretations ("to bear about, bear, carry, wear, have, hold, sustain"). The combination of both, however, is uniquely determined to be "wage war".

3

System Design

Minerva is highly modular, and treats MT as a sequence of independent stages, each implemented separately:
# 1 2 3 4 Stage Morphological Analysis Grammatical Analysis Semantic Analysis Target Language Generation Language Dependent Yes Yes No Yes

E.g. Nominative (subject) and Accusative (direct object) for neuter, as well as Dative (indirect object) and Ablative (means) for feminine, as well as many others.

2

It's important to emphasize, that while the language dependent modules are currently implemented in Latin (1,2) and English/French (4), we

surmise there is no insurmountable challenge in adapting them to other languages, as well. In this sense, our claim can be expanded to say that the semantic analysis and disambiguation is so language agnostic that the system should ideally be able to deduce senses and context irrespective of choice of the languages involved. 3.1 Morphological Analysis

rules. The Grammatical Analysis module thus implements a classical rule-based system. Grammar rules are defined in a simple, yet effective language that makes use of conditionals and word attributes to form pattern matching expressions. Ambiguities in word senses are handled by means of reducing multiple senses to a simple character representation, upon which a regular expression may be applied and tested. Rules are defined in one of several classes (mandatory, common, unusual), and evaluated by order. Additionally, when a given meaning of a word is discarded due to the application of a rule, the system keeps track of its decision. A "mandatory" rule class is one wherein the system will try to enforce the rule, eliminating the possible senses of the word in case which fail to match it. In a way, it mimics human expectation. Much like in English one would expect certain parts of speech to follow others (say, nouns/adjectives after determiners), so too does Minerva anticipate certain declensions to follow prepositions, and such. If a mismatched mandatory rule results in the elimination of the last possible sense of a given word, Minerva rejects the sentence as ungrammatical. A "common" rule class is one wherein the system expects a common, yet not necessarily strict pattern to occur in a sentence. These rules are "nice-to-have", yet can be violated. An "unusual" rule class groups grammatical constructs that are not found in common prose or speech, but do occur in rare cases, most often poetry. These rules are thus tagged to allow their consideration only in special cases, wherein previous processing has proven insufficient, or the text source is known to be poetry. Minerva processes rules recursively, and in order, according to a simple algorithm:
For ruleClass in Classes TryAgain = True While (TryAgain) TryAgain = False For rule in rules[ruleClass] If isApplicable(rule, sentence) apply(rule, sentence) TryAgain = true

Minerva performs morphological analysis by following an XML tree that correctly describes the Latin syntax. The tree is organized according to endings, and its hierarchical structure easily leads to determining possible meanings of endings. Rather than follow a greedy approach, all possible endings are considered (including partial ones) from which the base form candidate is proposed. A dictionary lookup ensues, and – if the candidate is found – it is added to the possible meanings of the word. Additionally, Minerva maintains separate lists of exceptional verbs and noun forms, to handle the numerous (yet finite) cases wherein words are declined in non-standard, or alternate forms. This stage is fully debugged and tested, and is thus on par with "Perseus", and even exceeds it – as it correctly recognizes many forms the former does not. The output of this phase is an array of all the possible meanings of a given word. Meanings are disambiguated at the grammatical level only, e.g.
si vis pacem, para bellum (if you want peace, prepare for war)
vis:vis, n/sg/f/gen (force,vigor, etc) vis:vis, n/sg/f/nom (force,vigor, etc) vis:velle, v/2nd/sg/pres/ind/act (to want) .. bellum: bellum, n/sg/n/nom (war) bellum: bellum, n/sg/n/acc (war) bellum: bellus, adj/sg/m/acc (pretty) bellum: bellus, adj/sg/n/acc (pretty) ..

Obviously including wrong meanings, and choosing to defer semantic ambiguities (in the above case “vis” as a noun can specifically mean any of force, vigor, power, or energy) to the semantic analysis stage. 3.2 Grammatical Analysis

Latin has a surprisingly rich and diverse grammar, but it mostly falls into well followed

The application of the current rule set on our example sentence yields:
vis:vis, n/sg/f/gen (force,vigor, etc) vis:vis, n/sg/f/nom (force,vigor, etc) vis:velle, v/2nd/sg/pres/ind/act (to want) .. bellum: bellum, n/sg/n/nom (war) bellum: bellum, n/sg/n/acc (war) bellum: bellus, adj/sg/m/acc (pretty) bellum: bellus, adj/sg/n/acc (pretty)

Latin: English: French: Hebrew: Etc.

malus apple#1 (apple%1:13:00::) pomme ‫תפוח‬

Note that, at this stage, while we have eliminated some of the semantic ambiguity (namely, of “vis”), we still cannot determine “bellum” to be “war” and not “pretty”. 3.3 Semantic Analysis

Once the target language noun, verb or partof-speech base form is determined, the language generation module correctly conjugates or declines it. Finally, word ordering is enforced. Results leave some to be desired, but are quite satisfactory.

4

Implementation Details

Minerva is implemented with the following technologies/platforms:
Platform Java MySQL PHP Use Main System Logic Database, dictionaries Web 2.0 Front End

Semantic analysis is, at this time, not fully implemented. Dictionary meanings, however, are already tied to "WordNet", allowing for the consideration of specific meanings, and future disambiguation. This process, however, is currently left out of scope of the present implementation and, thus, this paper. This paper is proposed as a short talk submission, which will focus on the semantic challenges foreseen. In our example sentence, semantic disambiguation will enable Minerva to deduce that “war” is a more likely meaning for “bellum” than “pretty” is – since the former is an antonym of “peace”, and the latter is an adjective used substantively. 3.4 Target Language Generation

The system output is all natively in XML format, allowing for use and integration with other systems as a web service. Human readable output is performed by means of XSL rendering, which converts the output into an X/HTML 1.1 conformant standard.

5

Final Notes

While the default target language is English, Minerva maintains a "sense table" as a proof of concept, mapping English senses into corresponding ones in Hebrew, French and Japanese (the choices of languages being half-arbitrary, as they are the ones known by the authors – but also provide representations of three distinct language classes). Many multi-lingual Machine Translation systems use an intermediate interlingua stage. Minerva does not do this directly, but rather relies on the unique senses of words, as determined by the previous stage of semantic analysis. Under the assumption the ambiguous senses of the word can be eliminated, leaving one unique sense, we claim that there exist mappings of that sense to any target language (different mappings for different languages). We further claim that this mapping is injective (possibly surjective - i.e. a bijection) onto the target language space. For example, Latin “malus” would translate as:

Minerva is still very much in a nascent stage, and is a work in progress. It has, however, reached a degree of maturity in the sense that it could already benefit Latin speakers and classicists. It is the aim of this paper, and the presentation proposed, to spread the word of its existence to as wide an audience as is possible, so as to boost its usage, as well as its evolution. The system’s main drawback, at present, is its rather limited dictionary. A simple web 2.0 interface allows it to query its users (effectively “asking for help”) to input base forms of new words encountered in translation. It is the authors’ hope, that, with more use, Minerva’s dictionary will be expanded organically by its users - increasing, its turn – its efficiency and appeal to new users.

References
Tufts, Perseus - http://www.perseus.tufts.edu/ Princeton, Wordnet - http://wordnet.princeton.edu/