Elimination of lexical ambiguities by grammars

Eric Laporte, Anne Monceaux

To cite this version:
Eric Laporte, Anne Monceaux. Elimination of lexical ambiguities by grammars: The ELAG
system. Lingvisticae Investigationes, John Benjamins Publishing Company, 1999, 22, pp.341-
367. <hal-00189727>

HAL Id: hal-00189727
Submitted on 21 Nov 2007

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est
archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.

in order to check local constraints.g. which is achieved by looking up large coverage dictionaries. or as a part of syntactic parsing. and for context-sensitive spelling checkers. In this article. the checking of local syntactic constraints and contextual reduction of lexical ambiguity can considerably speed up the process. for instance. retrieves successfully all possible tags of a word.CNRS Aérospatiale-Matra Introduction Due to lexical ambiguity. etc. since it takes into account contextual constraints. in the tagging of the following French sentence: (1) C’est une véritable marque de fabrique aux yeux des connaisseurs the various lexical hypotheses in conflict for the form marque include at least a compound noun marque de fabrique — the only correct hypothesis in this case —. e. lexical tagging involves some automatic analysis and recognition of grammatical context. Milne 1986). a contextual constraint checking module can process the result of the tagger and resolve some lexical ambiguities.noun.verb. but does not take into account context. We use the term of artificial ambiguities in order to emphasize that these words with several tags are usually not felt as ambiguous in context by the human 1 . 1. sentence (1) might very well show up. we present a new. We describe the main properties of ELAG and we exemplify them with simple rules formalizing exploitable constraints. adjective. by filtering out invalid analyses before parsing (R. This problem comes from the fact that the process of lexical analysis. since it contributes to selecting tags for words. We also present an INTEX-compatible evaluation procedure for the lexical disambiguation results. one of the most constant sources of nuisances is the presence of artificial lexical ambiguities in the tags assigned to words during the process of lexical analysis. In the case of a lexicon-based tagger like INTEX. THE ELAG SYSTEM ÉRIC LAPORTE ANNE MONCEAUX Université de Marne-la-Vallée . or as an intermediate step. ELIMINATION OF LEXICAL AMBIGUITIES BY GRAMMARS.. For example. local constraint checking is viewed either as a part of lexical tagging. ELAG (elimination of lexical ambiguities by grammars). Depending on systems. INTEX-compatible formalism of expression of distributional constraints. to recognize technical terms in view of an indexation process. Such a short context sensitivity is required. a form of the feminine noun marque and a form of the verb marquer. subject. If the lexical tagger is coupled with a syntactic parser. Definition of the problem For INTEX users. This multiple tagging disturbs the location of linguistic patterns in the text by INTEX: if we try to locate occurrences of constructions like une marque de parfums or N0 donner une marque d’amitié à N1 or N0 marquer le livre de la page 30 à la page 36.

i. we can focus in priority on the lexical ambiguities that contribute the most massively to lexical ambiguity in sentences. and by expressing them in a readable form. but never by a noun. son fleuron. in many cases. Martin 1992. E. the part of speech of véhicule cannot be determined without a thorough recognition of the syntactic structure of the sentence. We can express it as follows: . with the strict constraint of never discarding a correct analysis. H. the quantity of noise in the identification of lexical units in the text increases accordingly. This type of effect can be identified as a realistic objective of a system of resolution of lexical ambiguity. J. 1992. i. Silberztein 1991. This scheme is opposed to another. H. 1989. which is exactly the objective we want to reach by using large-coverage dictionaries and INTEX. Since the average number of tags per form increases with the granularity of the tagset. namely the fact that they are transitive verbs with a direct complement. ne véhicule des signes discriminatoires. T. The principle of a system of lexical ambiguity resolution is to obtain this effect by exploring only a local context of the words marked as ambiguous after dictionary lookup. determine the right tag(s) of each word. The general solution to this problem is syntactic parsing. The problem is all the more embarrassing that the tagset is more informative and fine- grained. D. Marshall 1983. Building a grammar of resolution of lexical ambiguity is a challenge: correct analyses 2 . the ambiguity of véhicule is resolved in favour of the verb. in which the distributional data is automatically acquired by frequency-based corpus training (I. Laval 1995). D. In order to increase the precision as much as possible. In the following sentence. Jelinek 1985. Voutilainen 1994. removing as many as possible of incorrect analyses. i. However. Roche 1992.e.e. Roche and Y.reader. a simple constraint involving basic linguistic information about words in the immediate context suffices to resolve much of the artificial lexical ambiguity in a text. Brill 1992. if we modify sentence (2) by inserting ne to the left of véhicule: (3) La République ne veut pas que l’école publique. as opposed to effective ambiguities. Garside 1987. for instance: (2) La République ne veut pas que l’école publique. nor without syntactic information about the verbs vouloir and véhiculer. véhicule des signes discriminatoires. the best approach should be to handcraft and maintain directly the distribut ional data required for the reduction of lexical ambiguity. For example. These data are elaborated by taking into account underlying linguistic structures. Vosse 1992. the objective of never discarding a correct analysis is considered as a strict constraint. Paulussen and W. which is syntactic parsing. Hindle 1989.increasing the precision in the tagging of the words. . Given these objectives. Ph. During the past ten years. A trade-off between recall and precision may be suited for some applications. sentences which have several possible interpretations. Schabès 1995). Cutting et al. syntactic parsing is even the only way of resolving all lexical ambiguities. A. several systems following this scheme have been presented (M. which does not imply that we hope to eliminate all of them.e. since in French. In this article. by abstracting general rules from observable facts. R. the form ne can be followed by various parts of speech. as a side effect. son fleuron.without reducing the recall. F. In some cases. since this operation would. Schmid 1994. but not for the most fundamental of them. Benello et al. E. E.

V:P3s>. the results of syntactic parsing cannot be explicitly used. An ELAG rule always comprises an "if" part and one or several "then" parts. the linguistic analysis that we wish to assign to a sentence must be taken into account. which implies that the writer of a grammar of resolution of lexical ambiguity has particular views about the desired results of syntactic analysis.V:Kms>.the canonical form or lemma of the lexical unit.g. With this notation. In ELAG grammars.immediately to the right of the part of speech. here Kms for past participle. These tags appear in various forms in INTEX files and windows. . Since such tags comprise all the information stored in the dictionary. for example <faire. singular. then the punctuation cannot be a sentence boundary. here V for verb. we expose how ELAG helps facing these issues. When an inflected form is ambiguous between several series of inflectional values for the same canonical form. which comprises: . but rules cannot be expressed in the form of statements in natural language. let us state formally that aussi cannot be tagged as a coordinating conjunction (CONJC) when it is followed by a sentence boundary. For example. 2. In <faire. masculine. / ' and - all are considered as tags. the lexical tag can include a subcategory. they can also appear in rules formalizing distributional or grammatical constraints which are specific for a particular word.the part of speech. The canonical form may be a compound. <faire. In ELAG. each of these separate tags is a complete tag. a formalism is ind ispensable: it should be as simple as possible. . Complete tags are used in the representation of tagged texts. In the following. we will use the symbol # which represents sentence boundaries. The lexical information conveyed by the tag depends on the dictionary. whereas the inflected form described by the tag is fait. Tagset ELAG is compatible with INTEX. since it is not available at the time when ambiguity resolution is applied to the text.if the part of speech allows for it. <PUNCT> is a variable that stands for any of these symbols. and <!#. our constraint becomes: If <aussi. The inflected form occurs in the text and is not explicitly represented in the tag.should not be removed.V:Kms:P3s> is an abbreviation. 3 . this form is faire.V:Kms>. the format <faire. We will assume that INTEX recognizes sentence boundaries and that the dictionary describes aussi as (at least) an adverb: Luc est aussi arrivé à l'heure and as a CONJC: Luc s'est dépêché. In order to express this constraint formally. we call them complete tags. Other non-verbal symbols are represented by their actual form: . they appear as e.PUNCT> stands for any <PUNCT> except #.CONJC> is followed by <PUNCT>.N+NDN:fp> where NDN indicates the internal structure of the compound. and the set of lexical tags that can be attached to words is the INTEX tagset. and finally. We can formulate our example as: If <aussi.CONJC> is followed by a punctuation. like in <émission de télévision. aussi est-il arrivé à l'heure and we will write our first rule. .V:Kms> and <faire. a series of inflectional feature values.

the "if" part is always signalled and delimited by three boxes with "!".PUNCT>. which stands for any noun.V> which represents any verb except être and avoir.PUNCT> = Graph 1. The recognition of the context usually involves the use of tags for categories of words. the format <faire. The left and right "!" and "=" are present to make the rule more readable. In variable tags.V:Kms> and <faire.verbal symbols are considered as tags.one or several inflectional values: <N:s> for any noun in the singular. When the form aussi ends a sentence.or with one or several canonical forms and exclamation marks. <V:1s> for any verb in the first person singular. . 1 is applied to a sentence. since non. the CONJC tag will be correctly removed because it violates the constraint. The "if. They are also variable tags.PUNCT> in every analysis.CONJC> is immediately followed by <PUNCT>. making up a negative variable tag.. As a matter of fact.CONJC> <PUNCT>. ELAG checks that whenever <aussi. etc. Silberztein (1993).CONJC> ! <PUNCT> ! = = <!#. and the "then" part describes a single <!#. <fait. Graph 1 makes use of variables standing for categories of symbols: <PUNCT> and <!#. The conventions of formalization of variable tags are an important element of the formalism of description of grammatical constraints.or with a canonical form: <fait. <!être!avoir. e.V:P3s>. and each "then" part by three boxes with "=". The central "!" and "=" are the only element of synchronization between the "if" part and the "then" part.PUNCT>.g. for example <faire.g. e. when the rule of Fig. the juncture between them is immediately followed by <!#.A>. but the ADV tag will correctly remain unchanged because it is not concerned by the rule. When an inflected form is ambiguous between several series of inflectional values for the same canonical form. like <N>. the part of speech can be combined with: . parts of speech. and it removes the analyses that do not obey this constraint. Practically. because it is a very natural way of formulating readable disambiguation rules. All other boxes contain linguistic elements that are searched in input sentences when rules are applied to text. then this <PUNCT> is <!#.V:Kms:P3s> can be viewed as a variable tag or as an abbreviation for a list of complete tags.then" structure of ELAG disambiguation rules is borrowed from M. We call them variable tags because they represent a set of usual lexical tags. ! <aussi. the "if" part describes the pattern <aussi. . Thus. These conventions are slightly different from current INTEX conventions for formalizing 4 . and the rule looks like Graph 1.PUNCT>.A:f>. The boxes with "!" and "=" are delimitors which are used to recognize the structure of the rules.

5 . This is why the correctness of the rules remains unchanged in case one introduces into the dictionary new lexical tags describing new acceptions of the same words. The paradox is only apparent.N:ms>. except the joker or universal tag <> that represents any tag2 .A:ms>. . Consider. for instance. like fait. <A:fs>.e.In case of a doubt about the tags associated to a form in the linguistic analysis underlying the system.Given that grammatical constraints are expressed with tags which comprise at least a part of speech. the following sentence: Elle est fidèle à ses idées and a grammatical constraint designed to resolve the gender ambiguity of the adjective: 1 For example. since the existence of ambiguity is the reason why undertake the construction of grammars. not even the ambiguity of the form. . Generally. <fait. they refer to specific grammatical constructions.A:fs>. In fact. analyses are processed as if they were entirely distinct. When the rule writer describes contexts relatives to one of them. the convention incites the rule writer to refer to the electronic dictionary and to take into account its contents.A:ms>. Moreover. but what is relevant to the description is not the ambiguity but the distributional and syntactic contexts of the specific linguistic elements themselves. like <être. However. <fait. i. so we will necessarily resort to lexical tags with a part of speech. The set of unknown words. Describing grammatical constraints is usually more comfortable when one can ignore completely the possible ambiguity of the elements being described. he needs not take into account the other one. the Locate menu of INTEX did not accept variable tags with both a canonical form and a grammatical category. 2 PUNCT is considered as a part of speech.grammatical patterns to be located in texts 1 . In particular.V:P3s>.There is no reason why contexts relevant to the distinct tags — the verb and adjective. and sometimes to definite word senses. the Locate menu does accept a type of variable tags entirely natural. at the time of our experimentations. writers of descriptions consider any restriction to the expressive power of the formalism as an obstacle to their work of elaboration and of generalization from their intuition. In the case of fait. because it restricts the technical means provided by the formalism to deal with some category of grammatical constraints. <faire. a restriction of the expressive power of the formalism may make the activity of formalization easier. With this convention. there are at least four of them: <fait.V:P3s> or <est. like <fait. our conventions about variable tags impose that any tag should include at least a part of speech.A>. during filtering. paradoxically. but not accepted by ELAG: tags reduced to a single word. if we formalize a constraint to resolve a lexical ambiguity with respect to the form est. <faire. the forms not described in the dictionary. we are not allowed to refer to it through the variable tag est. is also considered as an additional part of speech: <?>.V:Kms>. for example — should have anything in common. which represent any complete lexical tag that the dictionary may associate to the form. This is the case of this convention: .

Object-oriented programming is a particularly well suited practice for creating and maintaining complex software systems. is well known in the field of software engineering.N+pr:ms>.PRO:3fs> <être. All other boxes contain linguistic elements that are searched in input sentences when rules are applied to text.N+pr:ms> into the dictionary: when it is applied to (4). when applied to: (4) Elle était plein de reportages cette semaine. without complying to the convention being discussed. imagine that we introduce into the dictionary of proper nouns a tag <Elle. as in graph 3.e. The rule of graph 2. R2. in which a limitation of the expressive power of the formalism makes easier the use of the formalism itself. We can conclude that our convention reduces the degree of interdependency between the contents of the dictionary and of grammars. it eliminates all analyses with <plein. Any of these four patterns can be the empty word. More precisely. The boxes with "!" and "=" delimit four patterns R1. the new version becomes erroneous with the introduction of <Elle. Specification of the effects of applying ELAG rules Graphs 1 and 2 give a rough idea of the effect produced by the application of a rule to a sentence. Therefore. does not apply to the correct analysis with <Elle.PRO :3fs>. the correct analysis is discarded. C 2. because it is specific enough. 6 . Any ELAG rule consists of an "if" part delimited by boxes with "!" and one or more "then" part delimited by boxes with "=". the use of graph 2 does not dispense with introducing the new tag into the dictionary. One of the means successfully used in this framework to facilitate the implementation of programs is to prevent the implementer from using certain variables in certain conditions. Now.V:3s> ! <A> ! = = <A:f> = Graph 2. C1. but at least. In this section. The boxes with "!" and "=" are delimitors whose only function is to mark the structure of the rules. 3. i. The simplest form of an ELAG rule comprises only one "then" part. since otherwise graph 2 rejects the correct analysis of plein in (4). we will specify this effect in a more detailed and general way.N+pr:ms> for a magazine titled Elle. graph 2 does not need any modification when the new tag is introduced.A:ms> and retains only the analyses in which plein is tagged differently. and therefore does not reject it. ! <il. If we write another version of graph 2 by substituting a variable tag elle for <il. This paradoxical situation.

which is inserted between the verb and the pronoun in the case of some verbs — but not all: 7 . For example. The tag for unknown words was included here in order to avoid the situation where the verb before the pronoun would not be in the dictionary. This graph was written assuming that the pronouns ils and elles are described in the dictionary by the tags <il. There is no restriction on the lengths of the sequences that belong to patterns R1. let us write an equivalent of graph 4 for the singular. In fact. then the juncture between them must be immediately preceded by an occurrence of C1 and followed by an occurrence of C2 in the same analysis. In addition to the <V:3s>. ! ! . C2. ! pattern R1 ! pattern R2 ! = pattern C1 = pattern C2 = Graph 3. or with a <V:3s>.PRO:3p> ! <V:3p> = = = <?> Graph 4. In this graph. these tags are correctly discarded. An unknown word is a form which is not found in the dictionary.PRO:3fp>. This rule resolves partially or totally the lexical ambiguity of the verb: if it is ambiguous with an adjective. graph 4 would correctly not eliminate the analysis <?>-<il. The effect of such a rule is to remove all analyses that do not obey this constraint. graph 4 states that when a pronoun ils or elles follows a dash. General form of an ELAG rule with one "then" part. any of these patterns represents a set of tagged sequences that may not have all the same length. the word before is always a verb in the third person plural. <il. patterns R1 and C2 consist of the empty word.PRO:3p>. Should that happen. R2.PRO:3mp> and <il. like pressent. like convergent. which means that the corresponding context is not taken into account. The constraint expressed by such a rule is that whenever an occurrence of R1 is immediately followed by an occurrence of R2 in an analysis of a sentence. we will have to take into account the form -t. The two lines in pattern C1 mean that the word before the pronoun must be either a <V:3p> or an unknown word <?>. For example. C 1.

The constraints expressed in graphs 1. which usually explore a context of a fixed length. In this graph. Roche (1992). may describe portions of text which overlap partially or totally. but would still be correct. This feature distinguishes ELAG from the system of E. If we omit these complete tags from the "then" part. <il. . the rule would become less general. In addition. we can write the rule as in graph 5. others one. which asks the user to describe sequences of tags that can never occur in a correct analysis of a correct text.<voilà.XV> = = = <revoilà. If we omitted <on.PRO:3s> from the "if" part. One of the reasons for us to design ELAG was the opinion of several potential grammar writers. <t. ! <on. some of them comprise three tags. patterns R1 and C1. patterns R1 and C1 overlap one another. but not bounded.PRO:3s>. and the formalism does not set any bounds to the length (in number of tags) of the sequences described in contexts.XV> <?> Graph 5. which found that describing correct sequences would be more comfortable and easier than describing incorrect sequences.XI> <V:3s> <voilà. Avait-il l'intention de se manifester ? Le fera-t-il maintenant ? If we wish to tag -t. 4 and 5 exemplify a practical feature of ELAG which is important for writers of disambiguation grammars: what is described in rules are sequences that can occur in correct analyses of correct texts. or patterns R2 and C 2. This feature distinguishes ELAG from frequency-based taggers. but a given rule may describe rather various contexts. The context explored during the application of a rule is usually very local.<on. pattern C1 contains various sequences of tags. ambiguity resolution and syntactic parsing). 2. The graph states that if the pronouns in the "if" part are preceded by a dash. Graph 5 contains other additional elements as compared to graph 4: . This graph assumes that in the syntactic analysis that underlies the whole system (dictionary. and if voilà and revoilà are not tagged as verbs in the dictionary. the rule will discard the correct analysis of voilà-t-il.PRO:3s> . where XI is a dummy part of speech for this type of isolated phenomena. 4 and 5 are all very simple.XV>.XV> and <revoilà. The context explored by ELAG is thus local.PRO:3s> ! ! .as <t. but it is usually necessary to include more complex patterns in order to formalize more elaborate constraints. 2. In graph 6.XI>. Graphs 1. often one or two words. the determiner un with a deleted noun should still be tagged as determiner: Plus d'un réussira 8 . then the dash is preceded by one of the patterns in the "then" part.

for example. The rule applies even if a preverbal pronoun (PPV).PREP> <un. In general.V:3s> in: Le livre commence par une préface In this graph.PREP> <PRO+PPV:3ms:3mp:3fs:3fp> = <de. ' <PFX> - ! <PREP> ! <un. or both.DET:ms:fs> <par. C1 explores a little more than R1 to the left. the tag <préfacer.The graph states in which conditions this determiner can occur between a preposition and a <V:3s>.PREP> <PRO+PPV> <près.PREP> <à. there is no reason why the "if" part and the "then" part of a rule should cover the same portion of text.PREP> <dessus. occurs between un and the verb.PREP> <dessous.PREP> Graph 6. The "then" parts of this graph do not check for the presence of the sentence boundary. This rule discards correctly. graph 7 states that a sequence <A> <N> after a sentence boundary agrees in gender and number. or a prefix (PFX) with a dash. 9 . This graph resolves lexical ambiguities of the verb: any analysis in conformity with the "if" part but not with the "then" part is rejected. at least one of the "then" parts should be present. Graphs 4 and 5 exemplify this. The most general form of an ELAG rule includes several "then" parts.PREP> <PRO+PPV:1s:1p:2s:2p:3s:3p> <sans.PREP> = = <contre.PREP> <pour. For example. In the case of prepositions à and par. meaning that whenever the "if" part is recognized.DET:ms:fs> <V:3s> ! <autour. ! # <A> ! <N> ! = <A:ms> = <N:ms> = = <A:mp> = <N:mp> = = <A:fs> = <N:fs> = = <A:fp> = <N:fp> = Graph 7. patterns R1 and C1 overlap one another.

Consequently. or C2 is absent: A*R1 (R2A* \ C 2A*) or both. Let P be the set of taggings of a sentence.. and therefore 4 patterns R1. C12 . If a sequence does not satisfy the constraint. Hence. but imprudent.because the "if" part does. is a sequence of complete tags. each element of P. C22 . the set of analyses that do not satisfy the constraint (and. since there is no bound to the lengths of the sequences. To do this. We will specify the constraint. The graph should be modified in order to avoid this effect. n] (A*R1\ ∪i ∈ I A* C1i) (R2A* \ ∪i ∈ [1. consequently. each pattern represents a rational set on the alphabet A of complete tags. Each tagging of the sentence. This graph is simple. and therefore the effects of applying the rule to the sentence. This feature may look a bit theoretical. Each pattern can be reduced to the empty word. C1n . C1. by defining the set of taggings in P which satisfy the constraint. If there are n "then" parts. the generalization of (5) is: (6) P\ ∪I ⊆ [1. the 2n patterns in the n "then" parts. so in any case each pattern represents a set of sequences of complete tags (including punctuation tags and unknown-word tags). but the set of sequences may be infinite. The simplest case is when the rule comprises only one "then" part. n] \ I C2i A*) Expressions (5) and (6) show an important feature of the formalism: P appears only once in these expressions.. Let us name C11 . and a second check would not change the constraint expressed by the rule. n] such that all C1i are absent from the sequence for each i in I. but not to the empty set. because of compound words. R2.. in the beginning of a sentence. C2n . but they are not equivalent to a rigorous specification. Let us now formulate in mathematical notation the constraint expressed by a rule. the set of taggings in P which satisfy the constraint is: (5) P \ (A*R1(R2A* \ C2A*) | (A*R1 \ A*C 1)R2A*) where "|" means union. but a variable tag represent a set of complete tags. the set of analyses that satisfy the constraint) is independent of the set of analyses to which the rule is applied. C2. Note that all taggings in P do not necessarily have the same length. acceptez ce clin d'œil If we apply graph 7 to this sentence. Technically. and such that all C2i are absent for all i in [1. either C1 is absent: (A*R1 \ A*C1)R2A* (where "*" means iteration and "###BOT_TEXT###quot; means set subtraction). The tags in the patterns may be complete tags or variable tags. let us examine first which sequences do not satisfy the constraint: in such sequences. Note that each sequence is of finite length. Each of these patterns is a finite automaton representing a set of sequences of tags.e. we need a more general formula to specify the constraint. In other words. n] \ I. i. C21 . Examples are useful for understanding a formalism intuitively. it will discard the correct analysis. there is at least a subset I of [1. an adjective in the plural can be followed by a coordination of several nouns in the singular: Chers Marie et Alexandre. In fact. but in fact it has 10 .

An analysis of a sentence is accepted by the set of rules if and only if it would be accepted by each rule used in isolation. consider graphs 8 and 9. If we apply both rules to the following sentence: (7) Les banques l'expédient dans un délai de quatre à six semaines. the only possible strategy for building a disambiguation grammar is to accumulate separate rules and to use them together. Given that dozens of constraints can be exploited in order to resolve lexical ambiguity. In the preceding section. except if it is preceded by a dash. Roche (1992). However. and more complex graphs are needed for more complex constraints. 11 .important consequences on the practical use of the formalism. which are adapted from E. What is the effect of applying a disambiguation grammar made of several ELAG rules? The answer is very simple: the constraint formalized by a set of rules is a logical "and" of the individual constraints formalized by the rules. we specified the effect of applying an ELAG rule. ' ! <le. the re is a limit to the complexity of ELAG rules: they must be simple and compact enough to be readable by writers of disambiguation grammars. <le.PRO:3ms:3mp:3fs:3fp> = = Graph 9. except in the past participle.PRO:3ms:3mp:3fs:3fp> ! <N> ! = . in order to benefit from the addition of their effects. the size of an ELAG rule is limited to a page. 4. Graph 9 states that the pronoun le is not followed by a noun. For example. Independence of ELAG constraints The various graphs examined above are very simple examples.DET:ms:mp:fs:fp> ! <V> ! = = <V:K> = Graph 8. Practically. ' ! <le. Graph 8 states that the determiner le is not followed by a verb.

the order of application of rules is indifferent: a set of rules can be used either simultaneously or sequentially. A.PRO:3ms:3fs> ' <expédier. the following rule can be stated in English but cannot be directly translated in ELAG: (8) In any sequence of the form <avoir. For example. Such systems.graph 8 correctly eliminates the analyses with <le. no matter whether this rule is used with other rules or not. a commutative operation. They make them easy to build and to maintain. because they satisfy the constraints of both graphs. 101) is of the same type. Roche 1992. as if each analysis were processed in an entirely distinct way. In addition to ELAG.V> <V:K>.V:P3p:S3p> are accepted.DET:ms:fs> ' <expédient. and graph 9 only to those with the noun. 12 . and all the lexical ambiguity of the second is resolved in favour of the past participle. this implies that when you introduce new rules to a grammar. and in any order. When you add a new rule to an existing grammar. This is a priority rule. have a remarkable property: grammatical constraints expressed as separate rules are independent of each other and do not show any interactions. The effects of disambiguation rules being cumulative.DET:ms:fs> ' <expédier.V:P3p:S3p> and graph 9 those with <le.N:ms> and <le. In particular.N:ms>. the result remains unchanged after the first application. The only risk with this procedure is to duplicate work: there cannot be compatibility problems. including ELAG. even without any coordination between rule writers 3 . A given rule has a specific effect on texts. but the analyses with <le. the price to be paid for this flexibility is a limitation of the expressive power of these systems. Laval (1995). all the lexical ambiguity of the first word is resolved in favour of the verb avoir. If the same set of rules is applied again to the result. the formalism itself ensures that the effects of the existing grammar are not modified. Graph 8 applies only to the analyses with the verb. Let us apply it to the following sentence: (9) Nous avions relevé l'incident le jour même 3 Provided they are based on the same linguistic analyses. The independence of constraints and the independence of analyses are two fundamental properties of such systems. This comes from the fact that constraints are combined with logical "and". several other systems of resolution of lexical ambiguity use logical "and" to combine rules (E. In addition. However. Voutilainen 1994). without any modification of the final result. The new rule can only reject further analyses.PRO:3ms:3fs> ' <expédient. or even applied iteratively. In Ph. one of the two formalisms proposed (p. the rate of reduction of lexical ambiguity can inc rease but never decrease. and this effect will be the same. rules can be written separately and used together.

the description of the context takes into account lexical ambiguity. then all the lexical ambiguity of the second is resolved in favour of the past participle. is an exception to (12). since they may have eliminated <aide. but this correct application of the rule depends on the presence of the tags <avoir.V:Kms> in other analyses for the same words.V> <V:K>. For example. and therefore belongs to other analyses. For instance. as in rule (10): (10) When a form is ambiguous between <V:P3s> and <N:fs>. Now let us examine how rule (8) behaves in the presence of other rules. there may be implicit dependencies between rules in the grammar. This version may correctly accept the tag <relever.N:fs>. and occurs before an <A:fs>.V:I1p> or <relever. The result of the application of (10) depends on other rules. Therefore. which is another tag for the same word. excludes the existence of a network of rule/exception relations between the rules. in ELAG and in related formalisms.V:P1s:P3s:S1s:S3s:Y2s>. Let us now exemplify a type of rule in which a form is recognized only if all its lexical ambiguity has been resolved beforehand. we introduce explicit dependencies between rules. In other examples of grammatical constraints that cannot be formalized in ELAG. Rule (8) cannot be expressed in ELAG.V:Kms>. the application of (8) would leave the sentence unchanged. In addition. the ambiguity is resolved in favour of <N:fs>. Therefore. -t-elle and -t-on. in graph 5. since there is no way of describing how a constraint on a given analysis depends on the absence of other tags for one of the words. -t-elle and -t-on is a verb.V:Kms> in (9). when this constraint is applied to: Nous avons besoin d'une aide urgente it correctly rejects the tags <aider. Rule (11) cannot be expressed in ELAG. if all the lexical ambiguity of the first word has already been resolved in favour of the verb avoir. the result of the application of rule (8) depends on whether other rules have been applied to the same sentence before.V:I1p> and <relever.N:ms> are correctly rejected. but this correct behaviour depends on the presence of <aide. the lexical ambiguity of compound tenses in French can be partially resolved with the aid of other grammatical constraints. since ELAG does not provide any means of describing a constraint about an analysis through the description of another analysis. the two forms voilà and revoilà are the only exceptions to rule (12): (12) Any word occurring to the immediate left of -t-il. The independence of rules. and to state separately that rule (13): (13) voilà and revoilà may occur to the immediate left of -t-il.N:fs> before (10) applies. the final result of the application of a set of rules depends on the order of application of the different rules.N:mp> and <relevé. Rule (11) is a more prudent variant of (8): (11) In any sequence of the form <avoir. like aide. If we decide to adopt (12) as a rule. If another rule had previously (incorrectly) eliminated <avoir. but only if another rule has already rejected <avion.N:mp>. since its conditions of application would not be satisfied. Of course. 13 .The tags <avion. The effects of (12) and (13) cannot be determined independently of each other.

these modifications will not change the effects of other rules. a disambiguation grammar generally contains numerous rules. no matter in which order they are applied. Then. i. The way of combination of grammatical restrictions ensures that a set of correct rules is correct. if we modify these rules in order to correct them. the extension of the context relevant to each rule remains strictly local. this implies that at least one of the rules eliminates the analysis. In such cases. The description of grammatical constraints with this type of systems is also comfortable for a practical reason: rule writers can completely ignore the possible lexical ambiguities of the elements they are describing. the recognition of sequences selects portions of text that we will call application zones. This feature facilitates the construction of a disambiguation grammar by accumulating rules. When rules are applied to a text. An application zo ne can even be completely embedded in another. In addition. In (14). For example. Overlap between application zones In ELAG. and does not change from what was decided by the rule writer. if the application of a disambiguation grammar rules out a correct analysis. and we are sure to discover which of the grammars reject the correct analysis. the application zones of rules in a text frequently overlap each other. contexts relative to adjacent elements may partly coincide. the context relevant to the processing of a case of lexical ambiguity is explicitly defined by the rule writer. in the following sentence: (14) Sans doute plus d'un y a-t-il déjà pensé the application zones of graphs 5 (a-t-il) and 6 (d'un y a) overlap each other. 5. due to the density of lexical amb iguity in texts. If a given analysis is eliminated by a set of rules. to the left as well as to the right. 6.e. we can apply each of the rules to the sentence separately. Two distinct application zones of the same rule can even present such an overlap. with possibly an apostrophe between them. they are justed added to the effects of the new ones. Practically. by checking that all the analyses ruled out are incorrect. both graphs 5 and 6 apply in the same way as they would if each of them were used alone. and to stick to the objective of never discarding a correct analysis during the evolution of the disambiguation grammar. with reference to the underlying linguistic analyses on which one wishes to build the system. For example. and the formalism ensures that a larger context does not become implicitly relevant. This feature makes it possible to detect precisely the origin of errors. Given the variety of situations of lexical ambiguity. the application zone of graph 8 is always two words. It is possible to define what a correct disambiguation rule is. a situation which we want to avoid. if the beginning and end of the contexts described in the rule contain some identical part. Evaluation The evaluation of the results of lexical disambiguation is indispensable in order to be able to 14 . since the distributional and syntactic contexts of linguistic elements are recognized independently of other constructions that could happen to be homographic with them. the ELAG formalism ensures that the effect of applying each rule to the relevant application zone is independent of the rest of the text. The independence of rules have several other important practical consequences. The effects of each rule is independent of all others: they are not altered with the introduction of new rules.

We implemented a listing format in which the tagged text is displayed in lines. . and to compare different systems. FSGraph. a) A readable listing of the text after processing allows us to manually examine unrestricted text. like Les atteintes au paysages (sic).g. these parametres are usually not evaluated separately.fst output of the system of lexical disambiguation.compare versions. A file in this format is generated from the .or the correct analyses do not belong to the automaton of the sentence after dictionary lookup. provided that a file in . . for determining the relative position of systems in the competitive domain of lexical disambiguation. because a lexical item is absent from the dictionary. au revoir. 6. Evaluation of precision 15 . i.fst output of the system. These sentences are usually not very numerous. b) A complementary method consists in automatically detecting sentences in which the system of lexical disambiguation rejects all analyses. This procedure can be used with another procedure of disambiguation than ELAG.or the correct analyses are present in input and therefore are rejected by the grammar. Therefore.either the text itself is incorrect. The evaluation involves two independent parametres: recall. INTEX-compatible procedure of evaluation of the results of lexical disambiguation.grf format can be generated from the . Evaluation of recall The problem is to detect cases of correct analyses being eliminated during the resolution of lexical ambiguity. could also be used. a title with an agreement mistake. a comparison between grammars or between versions becomes theoretically possible. which is the main format in which INTEX generates tagged texts. However. In current literature. and precision. the proportion of analyses accepted by the system among correct analyses. the status of recall and precision is very different.2. the procedure is nearly entirely manual. et merci ! where au revoir belongs to a class of compounds which cannot be reliably recognized by INTEX yet and has not been integrated into the dictionaries. and to check whether the correct analyses are present.fst format. This detection is easy. for monitoring the evolution of a disambiguation grammar. in our view. which is not the case yet. Once a given tagset has been chosen. but it is worth examining their a priori analyses. otherwise we would dispose of an automatic procedure able to determine the correct analyses of sentences in unrestricted text. 6. since we consider as a priority to ensure that a correct analysis is never rejected. The automaton editor of INTEX. e.1. like in Je me suis bien amusé. the performances of systems with different tagsets cannot be compared without a specific work on the two tagsets. In practice. On the contrary. Unfortunately. provided tha t the input and output of the system are available in the form of finite automata. because the output for these sentences is empty.e. and more specifically in .e. we will consider separately recall and precision. In addition to ELAG. we implemented a separate. we use two types of computer aids. i. the proportion of correct analyses among accepted analyses. This work cannot be automated. These sentences fall into three cases: .

for instance. we will comment only about (5) which is a particular case of (6). but do not reject any of the tags of any of these two words. This work cannot be automated on unrestricted text for the same reason as before. Instead of this. graphs 8 and 9. We propose a definition of the quantity of lexical ambiguity that takes into account the removal of paths in an automaton even if the number of tags per simple word remains unchanged. but the usual definition of the quantity of lexical ambiguity does not measure this difference. the number of states or trans itions of an automaton can either increase or decrease when one removes paths. If this quantity is r1 in input and r2 in output. as 16 . We can therefore generalize (15) to the case of any acyclic automaton.A rigorous evaluation of precision involves determining the correct analyses of sentences. If any tag of a word can be combined with any tag of the next word. Implementation The implementation of ELAG follows formulae (5) and (6). This definition is implemented in our procedure of evaluation of redution rate. Doing this. even if states or transitions are counted in the minimal deterministic automaton. reject approximately one half of the analyses of l'expédient. the reduction rate is simply (r1 . Indeed. the rate of reduction of lexical ambiguity.. + log an ) / n The numerical result is not very different from the arithmetic mean (a1 + a2 + . whereas standard disambiguators use less powerful means of representation of their output. If the i-th simple word has ai tags.. The user of this procedure obtains for each sentence and for the whole text a rate that represents the proportion of ambiguity removed during the processing. They only reject combinations of these tags. For simplicity. In (5). This imperfection comes from the fact that this type of resolution of lexical ambiguity can only be implemented if output is represented in the form of acyclic finite automata. The rate of reduction of lexical ambiguity can be defined as the proportion of lexical ambiguity removed during the application of the grammar. the geometric mean r is defined by: (15) log r = (log a1 + log a2 + .r2 )/ r1 . the quantity of lexical ambiguity is close to the value given by the standard definition. We cannot define the quantity of lexical ambiguity as. we substitute the geometric mean for the arithmetic mean when we compute the average number of tags per simple word. The quantity of lexical ambiguity is roughly defined as the number of tags per word. the number of states or transitions of the automaton divided by the number of simple words. the number of paths of the automaton is a1 a2 . they remove a part of the lexical ambiguity of the sentence. + an ) / n. In current literature. since precision can be defined as the proportion of correct analyses among accepted analyses. the quantity of lexical ambiguity is usually defined as the average number of tags per simple word. when applied to sentence (7).. This definition is simplistic because it makes this quantity insensitive to a certain type of resolution of lexical ambiguity. P depends only on the text and the rest of the expression depends only on the rule. is computable and gives a good evaluation of precision provided that recall is high enough. For example. 7. Since the same rule is us ed on many texts.an.. but another parametre. but more sensitive to partial resolution of lexical ambiguity. but (15) can be generalized to the case of an acyclic automaton.. by computing the number p of paths of the automaton: log r = (log p)/n With this definition..

g. It is applied to a text P by (17). Anne Monceaux and he wrote the first rules in 1997. A time optimization of the program of application of rules to texts is under study. The examination of residual lexical ambiguity in the output of these tests shows that existing rules can be extended and numerous new rules could be written and tested. e.la-Vallée CEDEX 2 France eric. the compiled forms of these rules are submitted to automaton intersection and the result is the compiled form of the grammar. rue Pasteur B. +pr in <N+pr> for proper nouns. Université de Marne-la-Vallée . Finally.monceaux@aeromatra. some of them in imitation of grammars by Max Silberztein (1993) and Maurice Gross (1997). the result being an automaton R which is the compiled form of the rule: (16) R = A* \ (A*R1(R2A* \ C2A*) | (A*R1 \ A*C1 )R2A*) The automaton of P is generated by INTEX. An extension of the set of variable tags is now indispensable for writers to design more elaborate ELAG rules. ELAG is not quick enough to process big texts with big grammars. the possibilities of construction of lexical disambiguation grammars are still largely unexplored. bd Descartes F-77454 Marne. the conventions for variable tags do not take into account subcategories. In their present state. Conclusions and perspectives Approximately 70 ELAG rules for French have been written and tested on texts. The software was implemented in 1997 by Éric Laporte and his 17 .much as possible of the computation is executed independently of P. the result is obtained by an automaton intersection between P and R: (17) P∩R If several rules are to be applied together. Éric Laporte began to get interested by this question in 1992. 76 F-92152 Suresnes CEDEX France anne.mlv. In fact.CNRS 5. Authors’ addresses Éric Laporte IGM.fr Anne Monceaux Aérospatiale 12. and this restriction is felt as an obstacle to the design of efficient rules.P.com ACKNOWLEDGMENTS The implementation of ELAG came after a few years of reflection and work on resolution of lexical ambiguity.laporte@univ. If only one rule is to be applied to P. He designed a first version of the formalism in 1996 (with one "then" part).

pp. 1995. Brill. 1992. REFERENCES Benello. 1991. Sibun. J. Laval. J. Jelinek.). Kyoto. Computational Linguistics 21(2):227-253. Mass. Martin. A. Cutting. 111-118. Dictionnaires électroniques et analyse automatique de textes. In Proceedings of the 3d Conference on Applied Natural Language Processing. The CLAWS word-tagging system. Anttila (eds. Garside and Sampson (eds.). pp. I. pp. Milne. In Finite-State Language Processing. Acquiring disambiguation rules from text. 1986. 1983. Marshall. Eric. England: MIT Press. Karlsson.133-140. A. Paulussen.). Resolving lexical ambiguity in a deterministic parser. A simple rule-based part-of-speech tagger. J. ACL.. 1994. 329-354. Hindle.. Roche. 1992. 1992./London. F. Silberztein. In Proceedings of the 3d Conference on Applied Natural Language Processing. R. 1995. an algorithm and experiments on corpora. Voutilainen. The second version of the formalism (with n "then" parts) was designed in 1998 by Éric Laporte and Ali Issa. P. In Proceedings of the 3d Conference on Applied Natural Language Processing. A practical part-of-speech tagger. Max. 1985. Vosse. 1989. Self-organized language modelling for speech recognition. E. J. Roche and Y. In Constraint Grammar: a Language- Independent System for Parsing Unrestricted Text. Nantes. 118-125. Pedersen. 1993. Garside. J. Dilemma-2: a lemmatizer-tagger for medical abstracts. Lingvisticae Investigationes XIX(1):97-105. They adapted the software to INTEX for Windows and to the second version of the formalism in 1999. Choice of grammatical word-class without global syntactic analysis: Tagging words in the LOB Corpus. in compatibility with INTEX for NextStep. Deterministic part-of-speech tagging with finite- state transducers.undergraduate class of computer science at the University of Rheims. Cambridge. Voutilainen. 1994. pp. In Proceedings of COLING- 94. In Proceedings of the 27th Annual Meeting of the ACL. Kupiec. Roche. In Proceedings of COLING-92. Detecting and correcting morpho-syntactic errors in real texts. Le système INTEX. pp. 1997. Mackie. This work was partially financed by the European Union project Copernicus 621 Gramlex. 1987. T. H. 1989. 141-146. A. ACL. Gross. ACL. even without any direct participation. 1992. F. F. Trento. Philippe. Computer Speech and Language 3:203-217. Emmanuel. Anderson.W. Computers and the Humanities 17:139-150.. The construction of local grammars. Morphological disambiguation. 233 p. London/New York: Longman. J.152-155. Skwirzinski (ed. Berlin/New York: Mouton-de Gruyter. A. D. Un système simple de levée des homographies.). Maurice. Silberztein. Syntactic category disambiguation with neural networks. Max. Schabès (eds. Paris: Masson.A. Emmanuel. In The Computational Analysis of English.1992. pp. Yves Schabès. 18 . Heikkilä. A new approach to tagging: the use of a large-coverage electronic dictionary. Computational Linguistics 12(1):1-12. ACL. Schmid. In Impact of Processing Techniques on Communication. D. We thank all those that showed their enthusiasm for this project and those that contributed to it. Trento. Trento. Text disambiguation by finite-state automata. W. Trento. R. In Proceedings of the 3d Conference on Applied Natural Language Processing. Applied Computer Translation 1(4):23-39. Part-of-speech tagging with neural networks.