Professional Documents
Culture Documents
by multi-criteria analysis
Rekkal kahina ABBAS Moncef HOCEINI Youcef
Institute of Computer Science, USTHB, Faculty of Mathematics Institute of Computer Science,
University of Bechar, Bechar; LAID3 Laboratory Dépt.RO, University of Bechar, Bechar;
Algeria, Algéria, Algeria,
Telephone number moncef_abbas@yahoo.com Telephone number
0668953970, 0775250017,
Bechar 08000 Bechar 08000
rekkal@yahoo.fr y_hoceini@yahoo.fr
ABSTRACT analyzer for Arabic vowelized not treating the various cases of
Arab ambiguities.
NLP is still faced with a recurrent problem is the ambiguity that
occurs on almost all treatment levels: morphological, syntactic Our désambigüiseur is based on a multicriteria approach to
and semantic, generating several likely solutions. The Arabic decision allowing the classification of disambiguation scenarios
lexicon contains homographs words, without bending, can have with a view to bring out the best. This approach has the advantage
different pronunciation, different meanings and different of reducing the dominated scenarios and classify the rest
grammatical categories usually is at this point that our according to different evaluation criteria.
contribution will appear in that work on a new class of tagger
whose disambiguation module is able to solve this homography
by selecting a suitable solution from a set of likely scenarios based II.MORPHOLOGICAL AMBIGUITIES
on the approach and techniques from multi-criteria analysis
(MCA). The method used in this article is the method Like all Semitic languages, Arabic is characterized by a complex
PROMETHEE (Preference Ranking Organization Method for morphology and a rich vocabulary which involves automatic
Enrichment Evaluations). analysis difficulty. In what follows, we present some ambiguities
resulting from the automatic processing of Arabic at the
Keywords: PROMETHEE, Multi-criteria analysis, Ambiguity morphological level.
Morphsyntactic, Disambiguation.
A. Ambiguities derivational
I. INTRODUCTION Most Arabic words are derived from triliteral or quadriliteral
The language processing is interested in computer processing roots. The Arabic word is not the result of a simple concatenation
involving the linguistic material: spelling and grammar, text of morphemes as is the case for the English (eg unfailingly = a +
analysis, text generation and machine translation etc. fail + ing + ly) (Beesley, 1996), but it is from a root, a
Since the early work in Natural Language Processing (NLP), we combination of vowels, prefixes, infixes, suffixes and a
pursued various research directions. One can distinguish morphological scheme that obtains a word.
particular numerical approaches based on probability and
statistics and syntactic approaches to formal language theory. We Example: The word "( "يتأثسونthey influence) is the combination
also note that the range of research ranges from the detailed of the root "( "أثسhe influenced) the prefix " "يتthe infix "" and the
analysis of sentences to more comprehensive approaches a text as suffix ""ون.
a whole.
B. Ambiguities agglutination
Any application of NLP requires morphological analysis step to Unlike Latin languages, Arabic, articles, prepositions, pronouns,
identify for each word its class and its different morphosyntactic etc. stick to adjectives, nouns, verbs and particles to which they
features. relate. Compared to the French, an Arabic word can sometimes
If some computer systems for NLP currently seem satisfactory for match a French phrase (Souissi, 1997).
some Latin languages, representation, capture, editing and Example: The Arabic word " "أتتركسووىاcorresponds in French to
processing non-Latin languages such as Arabic remain largely the phrase « Est-ce que vous vous souvenez de nou ? » "Do you
unexplored. In addition, most of the models developed for the remember us?"
formalization of subsets of natural language, in morphosyntactic,
are not configurable; often dedicated to Indo-European languages, This feature generates a morphological ambiguity during the
they are difficult to adapt to other language families such as the analysis. Indeed, it is sometimes difficult to distinguish between a
Semitic languages. proclitic or enclitic and an original character of the word. For
example, " "وcharacter in the word "( "وصلit happened) is an
The work we present in this paper is in this context, specifically in original character while in the word "( "وفتخopened), it is a
the context of the morphosyntactic analysis for non Arabic proclitique.
vowelized. Our contribution, in the field of automatic processing
of natural language, is to design and implement a morphosyntactic
1
C. Ambiguities due to annexation a fairly high rate of ambiguity. Indeed, among the many factors
contributing to this problem we mention that:
In Arabic as in any other language such as French, there are
compound words which when combined do not give the same • The Arabic lexicon contains homographs words, without
sense that when you take every single word. In French, for bending, may have different pronunciations, different meanings
example, we find the compound word "pomme de terre". In and different grammatical categories generally. ( ذهبdhb)
Arabic, compound words are very limited, examples ( بيه بيهentre ( ذهبDahab - gold)
les deux), ( قوس قزحsky arc), عباد الشمش, etc. ( ذهبDahaba - he went)
Example: For the phrase "( "زأيت بىات الجازI saw the neighbor
girls). The word "( "بىاتGirls) is not determined feminine plural •During flexion of verbs, succession morphological and
noun while the word "( "الجازneighbor) is a masculine singular orthographic operations (such as deleting characters or
definite name. The analysis in this case gives different assimilation) frequently produce inflected forms homographs that
morphological characteristics for these two words that can not may belong to two or more lemmas. The following example
have a role in the sentence if they are combined into a single shows a simple verbal form can be interpreted as belonging to five
morphological unit (a group of words). The morphological lemmas. (يعدy‛d)
characteristics of the group " "بىات الجازwill be those of his first ()أعاد( يُ ِعدyu‛id, áa‛āda – he remade)
word ""بىات. ( )عاد( يَعُدya‛ud, ‛āda – he returns)
( )وعد( يَ ِعدya‛id, wa‛ada – it promises)
( )عد( يَ ُع ّدya‛udd, ‛adda – it account)
D. Ambiguities due to non voyellation ( )أعد( يُ ِع ّدyu‛idd, áa‛adda – it prepares)
To describe the problem of voyellation, we take the definition of
Joseph Dichy: Some lemmas are different only by repetition, through the letter
"The current Arabic script does not rate the short vowels, Sadda without this being explicitly written. The two forms below,
consonant gemination, the contingencies marks composed of a differ only in the repetition of the syllable in the middle.
short vowel followed for the names and adjectives indeterminate, ( دزسdrs)
a consonant" (" نn - tanwin) etc. Writing is called 'non-
vowelized'. These signs of voyellation, which are made when they س
َ ( َد َزdarasa – he studied)
are recorded, as diacritics placed above or below the letters appear َّس
َ ( َدزdarrasa – he taught).
in certain religious texts (Qur'an or hadith) and literature (classical
poetry in particular): we say that they are published in vowelized • The bending of a verb can produce a homographic form having a
spelling. nominal lemma..
A distinction is also two practical, that of completely writing أصد (ásd)
vowelized and one that is partially. Voyellation partial answers, in
( أ ُص ّدáasuddu - I block)
his careful editions, the lifting of certain ambiguities first reading.
( أصدáasad - a lion)
Codification leaning on a tradition: it does not present systematic.
"[Dichy, 1997].
• The proclitics can accidentally generate two homographic forms.
The automatic analysis of Arabic faces the formidable problem of
the lack of vowels (or diacritics) in the texts. This issue causes ‛( علميlmy)
many cases of lexical ambiguity Arabic given the polysemous ‛( علميilmiyy - Scientific) علمي
nature of unvowelized words. The problem of multiple ( علم+ ‛( )يilm + y - my knowledge).
voyellation, although much more common in Arabic, can be
compared to that posed multiple accentuation of unaccented
The example, below, shows the ambiguity caused by the
French words.
conjugation of the same verb accomplished active voice and
To illustrate this, we consider a non-accented word in French, as passive voice and imperative.
fishing. This word can be interpreted as fishing (female name), sin
أزصل (ársl)
(male name and past participle of the verb to fish) and fishing
(Word, Present indicative, Active Voice, 3rd person masculine ( أز َصلáarsala – sent)
singular feminine). Similarly for Arabic, the word no vowelized ( أُز ِصلáursila – is sent)
" "فصلtaken out of context without voyellation can be: ( أَز ِصلáarsil - sends [impératif])
- A past tense verb conjugated in the third person singular ""فصل
(it dismissed); �The following two forms are homographic while the first is
conjugated in the 3rd person of the female and the second is
- A masculine noun "( "فصلchapter, season);
flexed at the 2nd person masculine.
- Letter of coordination "( "فthen) + the " "صلverb (li) the
( تكتبtktb)
imperative combined with the second person masculine singular.
ُ( تَ ْكتُبtaktubu - she writes)
In addition, many Arabic words are homographic: they have the
same orthographic form, although the pronunciation is different. ُ( تَ ْكتُبtaktubu - you write masculin])
This homography, when bounded by other phenomena (no �Similarly, the dual form is always confused with the regular
voyellation, inflected and agglutinative morphology, etc.) follows plural form in the accusative and genitive.
2
.( أمسيكييهámrykyyn) C. Why multi-criteria analysis?
( أمسيكيَّ ْي ِهáamriykiyyayni – American [duel]) NLP often induces decision-making practices that correspond to a
َ( أمسيكيِّيهáamriykiyyiyna – American [plural]) classification and a sequence of choices here using the tools and
techniques of multi-criteria analysis prove promising and effective
III. Module disambiguation In most of the decision making process, they may be as complex
Disambiguation is a crucial step in the process of labeling and conflicting, it is often possible to update a number of common
morphosyntactic at this level of treatment if a word is mislabeled; axes to different actors, by which they justify changing their
rules of grammar apply poorly or not at all. However preferences. The choice to use different criteria of meaning
disambiguation phase is not always necessary or required for the around these axes is trying to model what might appear to be the
proper conduct of the labeling process. It must be said that the stable part of the perception of the problem the various actors
disambiguation module comes into play in one case, one in which involved.
the token (word) receives more than one label (more than Multi-criteria models are based on various assumptions which we
morphosyntactic information) which will generate a situation of retain the following [09] [10].
confusion or ambiguity. - The fact that the criteria are multiple involves the search for a
compromise, an acceptable solution, as opposed to an optimal
solution rigidly defined.
- The fact that the selected algorithms are flexible enough to allow
many iterations and inexpensive also solves the non-simultaneity
of information.
For our problem morphological ambiguity labeller can be in a
situation similar to that of the decision maker. By the fact that it
(the labeler) can have multiple labels for a single recognized
lexical unit, where he must choose one from this set....
Thus in both (02) If that is the decision maker or labeller face a
problem of decision making, which aims to choose the right
alternative (an action or label) from a set (stocks, labels ), which
FIGURE 1. CONDITION OF USE OF DISAMBIGUATION MODULE. we call thereafter, a set of actions or potential labels, and that
based on two (02) parameters, namely: the criteria and the method
of resolution to match.
A.THE APPROACHES DISAMBIGUATION It is this combination in the definition of these two (02) problems
Most taggers that exist today are classified according to their (the decision maker, morphological ambiguity) that left us opt for
mode of disambiguation. Thus in the literature, approaches to this mathematical approach namely the multi-criteria analysis.
disambiguation fall into two (02) categories and each category
includes one or more techniques to lift the morphological
ambiguity. The figure below gives an indication of the different D. Incorporation of a problem by MCA
approaches and techniques that go with it:
Let X = {x1, .., xn} all scenarios disambiguation. These scenarios
are different, finite in number, and constitute the entirety of the
solutions (labels) possible.
To choose the best scenario X, using a set F = {f1, .., fq} which is
a Coherent family of criteria. In order to judge each
disambiguation scenarios in each of the criteria, we define an
evaluation function as follows:
fj : X R, j=1,…, q
x fj(x) où fj(x) represents the evaluation of the x scenario fj
criterion.
Each of these functions should be maximized (or minimized) by
the standard of the type used.
Figure 2. Representation of different approaches and techniques Ideal Scenario: A scenario is ideal if it says is the best solution
disambiguation. for all criteria.
B. An approach to disambiguation based on Dominance relation between scenarios: we say that x1 x2
multi-criteria analysis scenario dominates a scenario if and only if fj (x1) ≤ fj (x2) for
each j and with at least one of the inequalities is strict.
The main objective of this study is to propose a new approach to
disambiguation plays the same role as the two (02) previous Effective scenario: A scenario x in X is said to be efficient if and
approaches but based on other techniques. For this, we chose a only if no other scenario X dominates. All effective scenarios is
method based on mathematical principles, namely the multi- considered all the more interesting solutions.
criteria analysis.
3
Ranking of scenarios: The objective is to determine the overall parameters for each curve represent the thresholds of indifference
scenario that enjoys the best ratings. Thus, we calculate for each and / or preference.
scenario xi, an overall evaluation score S(xi) represents the
weighted sum of different xi assessments to all criteria.
4
-PROMETHEE II of assigning the shares in descending order of To Apply the PROMETHEE method, we must be able to calculate
scores Φ (ai) defined as follows: degrees of preference, we need to know other parameters: the
Φ (ai) = Φ+ (ai) – Φ- (ai) preference function type associated with each criterion and any
thresholds that are attached to it (preferably, indifference,
Thus, PROMETHEE II provides a total pre-order. Gaussian and also the sense of evaluation).
All this information is summarized in the table below:
IV. Solution Overview and Application Table II. Type of preference functions and associated
thresholds
Our solution follows the following steps:
• Step 1: Construction of the list of labels. This list is built criterion Fréquency Vowel structure
directly after an ambiguous morphological analysis (multiple matching
solutions) which will generate the set E. Thus, for the sentence:
Function Type Pseudo Critère2 Pré- Pseudo
""لقد عمس ابه عمس طويال
(form5 critère Critère 2
After recognition (morphosyntactic analysis), generating patterns
vincke ,89)
that word can be different, ie are we a case of ambiguous word
" "عمسthat can be a verb (Vtype َََ ُع ِّمس, nameType1 ُع َم َس, Threshold of 1 // 0
nameType2 ُع َم ُس, nameType3 ) ُع ْمس, in this case the set E is: E= indifference (q)
{Vtype, nameType 1, nameType 2, nameType 3}. Threshold 20 10 1
• Step 2: In order to get coherent criteria F, we propose three preferably (p)
basic criteria for discriminating between the labeling scenarios: Max Max Max
Evaluation
The vowel pattern match inside the word frequency criterion, sense
structure criterion.
5
Tableau V.The degree of preference VI .Conclusion
π S1 S2 S3 S4 As main conclusion, it can be deduced from this work is that the
S1 0 0.17 0.10 0.17 multi-criteria approach can be considered as an alternative
classification or choice to be combined with one of the two (02)
S2 0 0 0.005 0.005
existing approaches (or stochastic constraints) to address the
S3 0.27 0.34 0 0.34 problem of morphological ambiguity and is a key crossing point
S4 0 0 0 0 to the success or failure of an application in automatic processing
of Arabic.
We proposed in this paper a method for analyzing and
.Calcul Of inflow, outflow, net flow for each action ai morphosyntactic disambiguation of Arabic texts unvowelized.
Tableau VI. inflow, outflow, net flow This method is part of the approach using multiple criteria
decision, the analyzer allows comprehensive morphosyntactic
p Φ +(ai) Φ -(ai) Φ(ai) analysis. It is based on segmentation, this step represents a
S1 0.15 0.09 0.06 fundamental step to recover the words associated with each word
and a morphosyntactic category so if the cardinality of the latter is
S2 0.003 0.17 -0.17
greater than 1 so we have a case of ambiguity we can remedy the a
S3 0.32 0.04 0.28 partial aggregation algorithm such as Prométhée.
S4 0 0.172 -0.172 The analyzer provides all possible morphosyntactic categories of
the word and the output of a disambiguation module we obtain a
classification and selected the most suitable scenario.
• The inflow expressing the character of outperforming the
response to the (n-1) other actions, that is to say its power. References
0≤ Φ + (ai) ≤1 [1] Abdelkader HAMMAMI., Modélisation Techno-économique
d'une chaine logistique dans une entreprise réseau, 2003.
• The outflow outperforming express the character of the response [2] Beesley K., "Arabic Finite-State Morphological Analysis and
to the (n-1) other actions, that is to say, his weakness. Generation, 1996.
0≤ Φ - (ai) ≤1 [3] Brans J.P, Vincke Ph. A Preference Ranking Organization
Method, 1985.
• The net flux express the balance of inflow and out of the action. [4] CHERAGUI Mohamed Amine. Un modèle d’analyse
greater is the better action. multicritère de la levée de l’ambiguïté associé à un Tagger
• Φ (ai) = Φ + (ai) - Φ - (ai) with -1≤ Φ (ai) ≤1 pour le Traitement Automatique de l’arabe, 2010.
[5] J. Dichy, Pour une lexicomatique de l'arabe: l'unité lexicale
- Determine meadows total orders and proceed with the simple de l'inventaire du mot. META - journal de traduction,
actions of storage: Vol. 42, n° 2, pp. 291-306, (1997).
• The incoming flow gives: S3> S1> S2> S4 -> S3 is the most [6] HOCEINI Youssef & ABBAS Moncef., Méthodologie
Multicritère de Désambiguïsation Morphosyntaxique de la
powerful. Langue Arabe ,2009.
• The outflow gives: S3> S1> S2> S4 -> S3 is the least weak. [7] Nouha Chaaben & Lamia Hadrich Belguith., Analyse et
désambiguïsation morphologiques de textes arabes non
Application of PROMETHEE II gives the following total pre voyellés ,2006
order: S3> S1 with eliminating negative values. [8] Roy B., Bouyssou D. Aide multicritère à la décision :
So according to this pre order, the scenario with the highest méthodes et cas: Economica, 1993.
score will be elected, in our case this is the scenario 3 " " َفُ ِّع ََلThis [9] Roy B. quête de l’optimum et aide à la décision." Cahier du
pattern corresponds to the word ُع ِّم َس. Lamsade n° 167. Université Paris-Dauphine, 21 Pp, 2000.
[10] Roy B. Science de la décision ou science de l’aide à la
décision ?" Revue Internationale de systémique, Vol. 6, pp
497-529, 1992
V. Discussion of results : [11] Slim Mesfar., Analyse morphosyntaxique automatique et
• To evaluate our désambigüiseur, we used a corpus of évaluation. reconnaissance des entités nommes en arabe standard, 2008
it contains of unvowelized words. Our désambigüiseur able to lift [12] Souissi E., "Etiquetage grammatical de l’Arabe voyellé ou
the ambiguity of a word by applying the PROMETHEE method, non", Thèse de doctorat de l’université de Paris III octobre
but it depends on the parameters that we have assigned to 1997.
criterion (weight, the sense of evaluation and also according to the [13] Vincke Ph. L’Aide Multicritère à la Décision. Editions de
l’Université de Bruxelles. Bruxelles 1989.
preference function type associated with each criterion).
• So we plan to work more on the désambiguïseur and that, by
defining more criteria to better generate the right solution (label)
and also working on several larger corpus;
Finally, we would propose PROMETHEE. Its algorithm is easy to
use, it is non-compensatory; and above all, very versatile.
Depending on the case, we can replace it with one of ELECTRE
methods (Elimination and Choice Translating Reality) also very
efficient.