You are on page 1of 14

Int J Speech Technol (2016) 19:289–302

DOI 10.1007/s10772-015-9302-8

A hybrid Arabic POS tagging for simple and compound


morphosyntactic tags
N. Ababou1 · A. Mazroui1

Received: 27 February 2015 / Accepted: 12 August 2015 / Published online: 8 October 2015
© Springer Science+Business Media New York 2015

Abstract The objective of this work is to develop a POS 1 Introduction


tagger for the Arabic language. This analyzer uses a very
rich tag set that gives syntactic information about proclitic Part-of-speech tagging is a crucial step for several applica-
attached to words. This study employs a probabilistic tions in natural language processing (NLP; lemmatization,
model and a morphological analyzer to identify the right parsing, machine translation, speech synthesis,…). It con-
tag in the context. Most published research on probabilistic sists in attributing to each word of a given sentence a tag
analysis uses only a training corpus to search the probable providing certain information (the syntactic category, gender
tags for each words, and this sometimes affects their per- and number for names, the tense for verbs,…).
formances. In this paper, we propose a method that takes In the field of POS tagging, many studies have been
into account the tags that are not included in the training made. The best known are based on Markov models
data. These tags are proposed by the Alkhalil_Morpho_Sys (Viterbi 1967; El Jihad and Yousfi 2005), or symbolic
analyzer (Bebah et al. 2011). We show that this consider- systems based on rules or neural networks. These systems
ation increases significantly the accuracy of the are able to produce only 3–4 % error for English and other
morphosyntactic analysis. In addition, the adopted tag set is languages (Huang et al. 2002), but this is far from being the
very rich and it contains the compound tags that allow case for the Arabic language (El-Jihad et al. 2011; Buck-
analyze the proclitics attached to words. walter 2002; Pasha et al. 2014). Indeed, in spite of the
various researches on the Arabic language, there are few
Keywords Part of speech tagging · Morphological free and generic tools for the Arabic language that could be
analysis · Hidden Markov model · Smoothing · useful to the research community.
Training set · Testing set The performance of POS tagging systems is mostly
measured by the accuracy rate at the word level. It strongly
depends on the nature and size of the tag set. Thus, more
the used tag set is rich, more tagger performance decreases.
So, the comparison between POS tagging systems must
consider the structure of the used tag set and its ability to
adequately represent the different morphosyntactic cate-
gories of Arabic words.
In addition, the high level of performance displayed by
& N. Ababou some POS taggers is a bit misleading. Indeed, these per-
nabilaababou@gmail.com formances are in parts due to the consideration of
A. Mazroui punctuation marks as tags. This can be verified by ana-
azze.mazroui@gmail.com lyzing the error rate at the tag level rather than the overall
1 error rate.
Department of Mathematics and Computer Science, Faculty
of Sciences, University Mohamed First, B-P 717, Our work focuses on the construction of a new POS
60000 Oujda, Morocco tagging system for Arabic language that meets the needs of

123
290 Int J Speech Technol (2016) 19:289–302

many applications of NLP. This construction is based on neural systems. Note here that unsupervised tagging does
the morphological analyzer Alkhalil Morpho Sys (Bebah not require pre-tagged corpus for training phase.
et al. 2011) and the hidden Markov models (HMMs).
Learning and testing phases are performed using the 2.1 Rule-based method
Nemlar corpus (Attia et al. 2005).
Unlike other POS tagging systems which the majority Such tagging is based on grammatical or morphological
are based on part of the Arabic Treebank annotations rules. It seeks to select the right tag of a word, or set the
deducted from the annotations introduced by English possible transitions between tags (Al-Taani and Al-Rub
Treebank, the annotation and terminology we have adopted 2009; Brill 1992; Thibeault 2004).
are inspired by conventional grammatical analysis of
Arabic language. 2.2 Statistical method
In addition, most of these POS tagging systems do not
take into account the proclitics attached to words and treat Statistical tagging (Ghoul 2011; Huang et al. 2002; Schmid
all the words as simple elements. Thus, these systems give 1994) uses frequencies and probabilities calculated from a
the same tag to the three words given in the Table 1. training corpus. The best known algorithms that have
The POS tagger we developed uses annotations enriched worked well for this type of POS tagging are dynamic
with syntactic information about clitics. For example, the programming algorithms, such as the Viterbi algorithm
phrases‫ﻟﻬﺎ‬، ‫ﻟﻪ‬، ‫ﺑﻬﻢ‬، ‫( ﻟﻜﻢ‬lkm, bhm, lh, lh; to you, with them, to (Neuhoff 1975; Viterbi 1967).
his, to her) will have the tag “‫( ”ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ‬jarWamajrour;
preposition + pronoun compound), and for the three words 2.3 Hybrid method
،‫ﺳﺎﻋﺪﻩ‬، ‫ﺳﺎﻋﺪﺍﻩ‬، ‫ ﺳﺎﻋﺪﺍ‬our system gives respectively the
following tags ‫ﻓﺎﻋﻞ ﻓﻌﻞ‬+‫( ﻣﺎﺽ‬fEl mAD+fA3l, These methods combine the rule-based and statistical
Verb_Past + subject), ‫ﻣﻔﻌﻮﻝ ﺑﻪ ﻓﻌﻞ‬+‫( ﻣﺎﺽ‬fEl mAD+mfEwl methods to allocate the correct tags for a given sequence of
bh, Verb_Past + object), and ‫ﻣﻔﻌﻮﻝ ﺑﻪ‬+‫ﻓﺎﻋﻞ‬+‫( ﻓﻌﻞ ﻣﺎﺽ‬fEl words (Nakagawa and Uchimoto 2007; Antony and Soman
mAD + fAEl + mafEwl bh, Verb_Past + subject + ob- 2011).
ject). Likewise, our system determines the syntactic
function of proclitics, which will be very useful for some 2.4 Tagging based on neural networks
NLP applications such as parsers.
The paper is organized as follows. We give in the sec- This method have not produced good results, therefore they
ond paragraph a glimpse of some previous works and we are rarely used.
recall the different approaches used in the POS tagging We will present in the following few POS tagging sys-
field. We devote the next section to an overview on the tems among the most recognized in the field.
analyzer Alkhalil Morpho Sys used in the morphological
phase of our system. Paragraph 4 is reserved to a 2.5 StanfordPOS
description of the adopted method, and we introduce in
paragraph 5 the used tag set. The system evaluation is This POS tagging was developed in 2003 for English
detailed in paragraph 6 and the paper is ended with a (Toutanova et al. 2003). It was later extended to other
conclusion. languages (Arabic, Chinese, German, French,…). It is
constantly improved and freely distributed on the Stanford
University website. Its principle is based on the application
2 Related works of a bidirectional Markov models with maximal entropy
and its tag set uses the Treebank annotations.
The POS tagging, based on a supervised machine learning
technique or not, can be grouped into four main categories: 2.6 Arabic POS Tagger Library
rule-based systems, statistical systems, hybrid systems or
It was newly developed by Qatar Computing Research
Institute (QCRI). It is based on the CRF++ model (con-
Table 1 Example of three words having the same tag
ditional random fields) (Darwish et al. 2014).
Words Gloss Tag
2.7 ASVM
‫ﺳﺎﻋﺪﺍ‬ They help Verb + past
‫ﺳﺎﻋﺪﺍﻩ‬ They help him
This is a free POS tagger developed in Perl by Diab’s Team
‫ﺳﺎﻋﺪﻩ‬ He helps him
(2004). This is an adaptation of the English system

123
Int J Speech Technol (2016) 19:289–302 291

“Yamcha” to the Arabic language. It was trained on the ● its possible vowelized patterns (only if the word is
annotated Treebank corpus. This system is based on the derivative noun or derivative verb);
support vector machine model and uses 24 tags. The Mona ● its possible lemmas and their corresponding patterns;
Diab’s team has achieved an improvement of this POS ● its possible stems and their corresponding patterns;
tagging under the name AMIRA2.0 (Diab 2009). ● its possible syntactic forms.

2.8 APT Khoja


4 Method description
This is an adaptation of English BNC system to Arabic. It
combines statistical data and rules to determine all the The developed system goes through two stages as shown in
morphological features of a lexical unit. This analyzer was Fig. 1.
trained on a corpus containing 50,000 words using 131 tags We give in the following a detailed description of these
(Khoja 2001). phases.

2.9 Sakhr analyzer

This is a commercial analyzer developed by Chalabi Input: Unvoweled


(2004). It can treat both modern as classical Arabic. Its sentence
principle is to define the basic form of the word by deleting
all prefixes and suffixes and giving the morphological
features of the word. According to the website of the Sakhr
Company, this analyzer provides results that achieve an M
accuracy of 90 %. O
Sentence segmentation R
P
2.10 Madamira H
O
L
Madamira is the combination of two tools MADA and Search tags with O
Alkhali Morpho Sys G
AMIRA. It also includes the functionality of MADA-ARZ I
and can achieve the following tasks: Lemmatization, dia- Search C
critization, glossing, part-of-speech tagging, stemming, tags of No A
Use the L
tokenization, morphological analysis and full morphologi- the word
in the Found? proclitic and A
cal disambiguation. Its use requires LDC’s standard Arabic
corpus enclitic N
morphological analyzer (SAMA) version 3.1. A
To evaluate our POS tagging, we chose to compare it L
Yes Y
with the most used “stanfordPOS” and also with the most S
recent “Arabic POS Tagger Library”. I
Possible Tags of each S
word

3 Alkhalil Morpho Sys analyzer S


T
A
Now, the morphological analyzer Alkhalil Morpho Sys is HMM, Viterbi algorithm T
one of the best open source morphological analyzers and smoothing method I
(Absolute discounting) S
(Boudchiche et al. 2014; Altabba et al. 2010). It doesn’t use T
the context of the word. For a given world, this analyzer I
C
determines all the possible morphological and syntactic A
tags. This information is: L

● the possible vowelized form of the word; Single tag for each S
T
● its possible proclitics and enclitics; word E
● its possible roots (only if the word is derivative noun or P
derivative verb);
● its possible syntactic natures; Fig. 1 Analysis phases of a sentence

123
292 Int J Speech Technol (2016) 19:289–302

4.1 First phase: search of possible tags for each PrðYt ¼ ok =Xt ¼ ei ; Yt1 ¼ okt1 ;
word
Xt1 ¼ eit1 ; . . .; Y1 ¼ ok1 ; X1 ¼ ei1 Þ
For each word w of a sentence PH, we first identify all its ¼ PrðYt ¼ ok =Xt ¼ ei Þ ¼ bi ðkÞ:
possible tags by performing the following two steps: bi(k) is the probability of observing ok given the state ej.
First step we use the training corpus extracted from The information about the hidden states can be deduced
Nemlar corpus (Attia et al. 2005) to identify the possible from the observed data.
tags of w. Let S be an observed sentence consisting of the words
Second step we then use the morphological analyzer w1, w2,…,wn, and E = {e1, e2,…,eN} be the set of Arabic
Alkhalil Morpho Sys to determine other possible tags of w. tags. In order to search the most likely morphosyntactic
The out of context tags of the word w are those obtained tags in context, the words wi of the sentence S will be the
from these two steps. observations and the morphosyntactic tags will be the
In the case where the word w does not appear in the hidden states. Our goal is, given the sentence S = (w1, w2,
training set and is not analyzed by Alkhalil Morpho Sys, w3,…,wn), to find the most likely sequence ðe1 ; e2 ; . . .; en Þ
the system uses a specific treatment. This treatment con- satisfying the follows relation:
  
sists in attributing to the word w the most likely tag based e1 ; . . .; en ¼ argmax Prðe1 ; . . .; en =w1 ; . . .; wn Þ;
on the potential prefix and suffix that may be attached to ei 2Ei
the word. For example, the word ‫( ﺍﻟﺘﻌﺠﻴﺰﻳﺔ‬Taâjezeya,
where Ei is the set of possible tags of the word wi obtained
incapacitating) beginning with “Al ‫ ”ال‬and ending with
in the first step.
(yati, ‫ )ﻳﺔ‬will have the adjective tag.
4.2.2 Viterbi algorithm
4.2 Second phase: statistical analysis
To find the most likely sequence of tags, we assume that
After identifying the possible tags for each word of the
the two following Markov assumptions are satisfied:
sentence, the statistical step will select the one that is the
most likely. It is based on the HMM, smoothing techniques H1 The sequence of word tags in a sentence is a homo-
and the Viterbi algorithm (Neuhoff 1975; Viterbi 1967). geneous Markov chain:
Thereafter, we will give a brief overview of these three Prðek =e1 ; . . .; ek1 Þ ¼ Prðek =ek1 Þ:
mathematical concepts.
H2 The prediction of the word w requires only the
4.2.1 Hidden Markov models knowledge of its tag:
Prðwk =ek ; wk1 ; ek1 ; . . .; w1 ; e1 Þ ¼ Prðwk =ek Þ:
A HMM is a statistical method who associates each model
to two dependent processes. The states of the first one are
unobservable (hidden states) and those of the second are And we use the Viterbi algorithm (1967) which is well
observable. suited for finding the optimal path.
Indeed, if O = {o1, o2,…,or} is a finite set of observa- Indeed, if ;ðt; ekt Þ is the maximum over all paths of
tions and E = {e1,…,em} is a finite set of hidden states, a length (t − 1) of the probability that the (t − 1) first words
dual process ðXt ; Yt Þt  1 is a first-order HMM if: have tags of the path and the tth word wt has the tag ekt ; i.e.,
      
● (Xt)t is a homogeneous Markov chain with values in the ; t; ekt ¼ max
j
Pr w1 ; . . .; wt =ek11 ; . . .; ekt Pr ek11 ; . . .; ekt ;
hidden state set S where: eii 2Ei
16i6t1

 
Pr Xtþ1¼ ej =Xt ¼ ei ; . . .; X
 1 ¼ eh then, using the two hypotheses (H1) and (H2), we can
¼ Pr Xtþ1 ¼ ej =Xt ¼ ei ¼ aij : easily verify that:
!
aij is the transition probability from state ei to state ej.  k  j   k j   
; t; et ¼ j max ; t  1; et1 Pr et =et1 P wt =ekt :
● (Yt)t is an observable process taking values in the et1 2Et1
observation set O where

123
Int J Speech Technol (2016) 19:289–302 293

the corpus with the tag ei and for which Alkhalil analyzer
This equation will allow us to find, recursively, the values
generates this tag, and d is a selected value between zero
of ∅. In order to get the optimal path, we use the function
and one.
ψ that memorizes at each time t the hidden tag that realizes
the maximum. It is defined by:
     
W t; ekt ¼ argmax ; t  1; ejt1 Pr ekt =ejt1 : 5 Tag SE
ejt1 2Et1

Our system uses the analyzer Alkhalil Morpho Sys in the


4.2.3 Smoothing methods first step of the system and Nemlar corpus in training and
testing phases. Since the tag set used in the Nemlar corpus
is not identical to that used by the Alkhalil analyzer, we
To identify the most likely tags of words in the sentence,
the parameters of the statistical model, i.e., the transition chose a common set of tags between them.
Our system has 27 basic tags to which are added number
matrix A = (aij) and the emission matrix B = (bi(t)) are
of prefixes and suffixes. Table 2 bellow shows the basic
estimated from the training corpus. The estimation method
tags.
is based on the maximum likelihood (Manning and Schütze
By combining these basic tags with information about
1999).
So if wt is the tth word of the sentence S and (ei, ej) are attached prefixes and suffixes, we get a set of 82 tags.
Please see Appendix.
two tags, we note:
We give in Table 3 below some examples of compound
c(ei): the number of occurrences of the hidden state ei in tags.
the training corpus. The use of syntactic features with a Markov model gives
c(ei, ej): the number of occurrences of the transition of good results because of the crucial role of the transition
the hidden state ei to hidden state ej in the training probabilities. Indeed, a verb can be followed by a verb, a
corpus. noun, a preposition + name or an adjective. It is natural
c(wt, ei): the number of times that the word wt appear that the transition (verb → noun) dominates the other
with the tag ei in the training corpus. transitions (verb → verb, verb → adjective, verb → prepo-
Thereafter, the coefficients aij and bi(t) are estimated using sition + name). However, with the syntactic features the
the following equations: probability of the transition (verb → noun) decreases
whereas the others will have more chance to appear. For
cðei ; ej Þ
aij ¼ ; 1 6 i 6 N; 1 6 j 6 N; ð1Þ example the sentence ‫[( ﺃﺣﺴﺎ ﺑﺤﺮ ﺍﻟﺸﻤﺲ‬Has bHr Al$ms; they
cðei Þ feel the sun’s heat) can be analyzed according to the nature
of all the tag set as follows:
cðwt ; ei Þ ● Tag set without syntactic features.
bi ðtÞ ¼ ; 1 6 t 6 n; 1 6 i 6 N: ð2Þ
cðei Þ
Given that in some cases the correct tag ei of the word wt is
/V
not included in the training data [i.e., c(wt, ei ) = 0], the
Viterbi algorithm can miss the optimal path. T1
To avoid this problem, we used a smoothing technique to
estimate the parameters bi(t) of the transmission matrix B,
which consists in a redistribution of emission probabilities. T2 /V
For this, we tested several smoothing methods and the /Noun
best performances are achieved with the absolute dis-
counting method (Ney et al. 1994).
For a given word wt and a possible tag ei obtained by the
first step, the estimation of bi(t) = P(wt/ei) is given by: T3
8
>
> cðwt ; ei Þ  d
< if cðwt ; ei Þ 6¼ 0;
cðei Þ /pre+noun
bi ðtÞ ¼
>
> Nb  d
: elsewhere;
cðei Þ  z
where Nb is the number of words that appear in the corpus ● Tag set with syntactic features.
with the tag ei, z is the number of words not annotated in

123
294 Int J Speech Technol (2016) 19:289–302

Table 2 The basic tags of our system


Tags Tag symbols Tag in Arabic Example

Past tense VePa ‫ﻓﻌﻞ ﻣﺎﺽ‬ (Said, qAla) ‫ﻗﺎﻝ‬


Present tense VeP ‫ﻓﻌﻞ ﻣﻀﺎ ِﺭﻉ‬ (tdwr,Spin) ‫ﺗﺪﻭﺭ‬
Imperative verb VeIm ‫ﻓﻌﻞ ﺃﻣﺮ‬ (sleeps, nm) ‫ﻧﻢ‬
Det + noun Dno ‫ﺍﺳﻢ ﻣﻌﺮﻑ‬ (Grass, AlEo$b) ‫ﺍﻟﻌﺸﺐ‬,
Noun No ‫ﺍﺳﻢ‬ (lq’,Meeting) ‫ﻟﻘﺎﺀ‬
Det + adjective DSifa ‫ﺻﻔﺔ ﻣﻌﺮﻓﺔ‬ (AlTybap, nice) ‫ﺍﻟﻄﻴﺒﺔ‬
Adjective Sifa ‫ﺻﻔﺔ‬ (Aaml,Pregnant) ‫ﺣﺎﻣﻞ‬
Proper noun Alam ‫ﻋﻠﻢ‬ Mhm˷d ‫ﻣﺤﻤﺪ‬
˜
Proper noun part1 Alam1 ‫ﺍﻟﺠﺰﺀ ﺍﻻﻭﻝ ﻣﻦ ﺍﺳﻢ ﻋﻠﻢ‬ Amru& ‫ﺍﻣﺮﺅ‬
Proper noun part1 Alam2 ‫ﺍﻟﺠﺰﺀ ﺍﻟﺘﺎﻧﻲ ﻣﻦ ﺍﺳﻢ ﻋﻠﻢ‬ (Alqys) ‫ﺍﻟﻘﻴﺲ‬
Demonstrative pronoun NoDP ‫ﺍﺳﻢ ﺇﺷﺎﺭﺓ‬ (*alk, That) ‫ﺫﻟﻚ‬
Relative pronoun NoRP ‫ﺍﺳﻢ ﻣﻮﺻﻮﻝ‬ (Alt˷, Which) ‫ﺍﻟﺘﻲ‬
Comparative Compa ‫ﺍﺳﻢ ﺗﻔﻀﻴﻞ‬ ([$Et,,Disheveled) ‫ﺃﺷﻌﺚ‬
Det + comparative DCompa ‫ﺍﺳﻢ ﺗﻔﻀﻴﻞ ﻣﻌﺮﻑ‬ (Al[Hdb, Humpback) ‫ﺍﻷﺣﺪﺏ‬
Time adverb or location adverb Dz ‫َﻇﺮﻑ ﺯﻣﺎﻥ ﺃﻭ ﻣﻜﺎﻥ‬ (tHt,Under) ‫ﺗﺤﺖ‬
Preposition Pr ‫ﺣ ْﺮﻑ َﺟﺮ‬ (En, About) ‫ﻋﻦ‬
Coordinating + conjunction Co ‫ﺣﺮﻑ ﻋﻄﻒ‬ ([w,bl,lkn;Or, even, but) ‫ ﻟﻜﻦ‬,‫ ﺑﻞ‬,‫ﺃﻭ‬
Articles within syntactic effect Ns ‫ﺣﺮﻑ ﻏﻴﺮ ﻋﺎﻣﻞ‬ ([y,qd,[amm˷; may, eithr) ‫ ﺃﻣﺎ‬,‫ ﻗﺪ‬,‫ﺃﻱ‬
Vocative particle Interj ‫ﺣﺮﻑ ﻧﺪﺍﺀ‬ (yA, O) ‫ﻳﺎ‬
Particles that make the imperfect verb PaNs ‫ﺻﺐ‬
ِ ‫ﺣ ﺮﻑ ﻧ ﺎ‬ ……([n, that) ‫ﺃﻥ‬
in the subjunctive
Prohibition or negative La ‫ﻻ ﺍﻟﻨﺎﻫﻴﺔ ﺃﻭ ﺍﻟﻨﺎﻓﻴﺔ‬ (lA, Do not) ‫ﻻ‬
Exception particle Ex ‫ﺃﺩﺍﺓ ﺍﺳﺘﺜﻨﺎﺀ‬ (\l, except) ‫ﺇﻻ‬
˜
Particles that make the subject of the PNa ‫ﺣﺮﻑ ﻧﺎ ِﺳﺦ‬ [nn ‫ﺃﻥ‬
nominal sentence in second Arabic
syntactic case
Det + abbreviation DAbrv ‫ﻣﺨﺘﺼﺮ ﻣﻌﺮﻑ‬ R ‫ﺍﻵﺭ‬
Abbreviation Abrv ‫ﻣﺨﺘﺼﺮ‬ B ‫ﺑﻲ‬
Particles that make the imperfect verb CJ ‫ﺣﺮﻑ ﺟﺰﻡ‬ (lm, No) ‫ﻟﻢ‬
in the jussive
Preposition + pronoun compound PCo ‫ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭ‬ (lh, to him) ‫ﻟﻪ‬

Table 3 Examples of compound tags


Tags Symbol tags Tag in Arabic Examples

Verb-past-tense + subject VePa.S ‫ ﻓﺎﻋﻞ‬+ ‫ﻓﻌﻞ ﻣﺎﺽ‬ (qAlw, they say) ‫ﻗﺎﻟﻮﺍ‬
Verb-past-tense + object VePa.O ‫ ﻣﻔﻌﻮﻝ ﺑﻪ‬+ ‫ﻓﻌﻞ ﻣﺎﺽ‬ (Drbh, he hits it) ‫ﺿﺮﺑﻪ‬
Verb-past-tense + object + subject VePa.S.O ‫ ﻣﻔﻌﻮﻝ ﺑﻪ‬+ ‫ ﻓﺎﻋﻞ‬+ ‫ﻓﻌﻞ ﻣﺎﺽ‬ (xTbth, I address him) ‫ﺧﺎﻃﺒﺘﻪ‬
Prefix + Det + adjective P. DSifa ‫ ﺻﻔﺔ‬+ ‫ﺣﺮﻑ ﻋﻄﻒ‬ (wAl\slmy, and Islamic) ‫ﻭﺍﻹﺳﻼﻣﻴﺔ‬
Noun + possessive construction No.S ‫ ﻣﻀﺎﻑ ﺇﻟﻴﻪ‬+ ‫ﺍﺳﻢ‬ ([ElAnth, her announcements) ‫ﺇﻋﻼﻧﺎﺗﻬﺎ‬

123
Int J Speech Technol (2016) 19:289–302 295

● Standard Arabic only is considered as it is the most


commonly accepted variant throughout the native
/V Arabic speakers, and due to its regularity that can be
consistently modeled by our tools.
T’1
● Multiple miscellaneous domains are considered; polit-
ical, scientific, general news.

T’2 To evaluate our POS tagging and making a comparison


/V+Sjt
/Noun with other systems, we use the standard evaluation metrics
such as accuracy, precision, recall and F-measure related to
each tag (Al Shamsi and Guessoum 2006). We use local
metrics defined for a given tag e by:
T’3
te
RecallðeÞ ¼ ;
re
/pre+noun
te
PrecisionðeÞ ¼ ;
ce
With the tag set that don’t use the syntactic features, the
where te is the number of words correctly assigned to the
probability of the transition T2 (verb → noun) is so much
tag e in the test corpus, re is the occurrence number of the
greater than the other probabilities. So, the system gives
tag e in the test corpus, ce is the number of words assigned
noun tag for the word‫( ﺑﺤﺮ‬bHr; heat) which isn’t true.
to the tag e in the test corpus, and the F-measure related to
However, the second tag set including syntactic features
each tag e is given by:
penalized the probability of the transition T’2 (verb + sub-
ject → noun) and gives more opportunity to the other 2  RecallðeÞ  PrecisionðeÞ
F-measureðeÞ ¼ :
transitions. RecallðeÞ þ PrecisionðeÞ
We also use the global metrics defined by:
number of true results
6 Experiment and evaluation Accuracy ¼ ;
number of true and false results
As we have reported previously, we will compare our
system to the POS Tagger Arabic Library system and the X
n
StanfordPOS. We aimed to give a comparison with Precision ¼ 1=n precisionðtagðiÞÞ;
Madamira system but we could not because it uses SAMA i¼1
analyzer that requires LDC’s SAMA version 3.1. To make
these comparisons, we realized correspondences between
the tag sets of these systems and we evaluated them on the X
n
Recall ¼ 1=n recallðtagðiÞÞ;
same test corpus (10 % of the Nemlar corpus). As some- i¼1
times it is impossible for some tags to realize this
correspondence, we only compared the part of the testing
corpus for which the tags provided by the three systems 2  Recall  Precision
have correspondences between them. This part represents F-measure ¼ ;
Recall þ Precision
88.68 % of the test corpus.
where n is the total number of tags.
6.1 Nemlar corpus (500,000 words) (Atiyya et al.
6.2 Evaluation of our system
2005)
We give in Table 4 the values of the recall, the precision
Its database was produced and annotated by RDI, Egypt for
and the F-measure for each tag.
the Nemlar Consortium. The Nemlar Arabic written corpus
To compare the impact of each tag, we must take into
is owned and copyrighted by the Nemlar Consortium. The
consideration the frequency of this tag in the corpus.
sampling parameters taken into considerations are:
Indeed, if we consider for example the P.DAlam and No.S
● Time span (mostly recent; i.e., late 1990s till now tags, we notice that their F-measures are close. However,
2005). given that the number of occurrences of the No.S tag is

123
296 Int J Speech Technol (2016) 19:289–302

Table 4 Recall, precision and F-measure of our system Table 4 continued


Tags Recall (%) Precision (%) F-measure (%) Tags Recall (%) Precision (%) F-measure (%)

DAlam 100.00 100.00 100.00 P.PCo 96.77 66.67 78.95


Dz.S 100.00 100.00 100.00 P.Sifa 73.21 83.67 78.10
NoDP 100.00 100.00 100.00 P.Alam 63.09 97.92 76.73
NoDP.S 100.00 100.00 100.00 VePa.S 76.28 72.56 74.38
P.Dz 100.00 100.00 100.00 Sifa.S 60.85 92.90 73.54
P.NoDP.S 100.00 100.00 100.00 P.VeP.S 100.00 43.48 60.61
PNa.O 100.00 100.00 100.00 P.PaNs 100.00 42.86 60.00
Dz 99.83 99.66 99.74 P.VePa.O 39.13 90.00 54.55
P.DNoRP 98.00 100.00 98.99 NoRP 93.07 38.06 54.02
DNoRP 97.52 100.00 98.75 VePa.S.O 66.67 40.00 50.00
P.Pr 97.62 99.39 98.50 P.VePa.S 47.83 45.83 46.81
Pr 96.87 99.67 98.25 P.Sifa.S 34.55 67.86 45.78
P.NoDP 95.00 100.00 97.44 VeIm 100.00 4.94 9.41
DSifa 96.34 98.55 97.43
DNo 98.25 96.46 97.35 higher than that of the P.DAlam tag, we conclude that our
P.PNa.O 94.74 100.00 97.30 system makes mistakes more often with the No.S tag than with
P.NoRP 97.40 96.89 97.14 P.DAlam tag. Also, the tags that have a very low F-measure as
Ex 95.83 97.87 96.84 VepIm appear in the corpus with a too small frequency.
VeP 98.28 93.71 95.94
Alam 93.79 97.96 95.83 6.3 Evaluation of Arabic POS Tagger Library
PCo 92.86 98.48 95.59
P.Ns 97.65 92.91 95.22 We present in Table 5 the results of recall, precision and
PNa 93.38 96.83 95.08 F-measure percentages obtained with the POS Tagger
Co 97.44 92.68 95.00 Arabic Library system.
No 95.70 93.07 94.37 We notice that except for the tag (V) and the noun tags
P.VeP 94.29 94.29 94.29 (NOUN, DET + NOUN, {CONJ,PRE} + NOUN +
Vide 91.15 97.63 94.28 PRON, {CONJ,PRE} + NOUN, {CONJ,PRE} + DET +
CJ 87.36 100.00 93.25 noun, noun + PRON), all other tags have a F-measure
Ns 92.92 93.36 93.14 lower than 79 %.
Compa 93.04 92.45 92.74 This is can be explained by the low precision of the
Abrv 97.41 88.28 92.62 noun tags and their strong presence in the corpus, because
VeP.O 95.28 89.38 92.24 the system gives in many cases the noun tag instead of the
P.DNo 96.01 88.70 92.21 other tags.
Sifa 91.13 93.22 92.16
VeP.S 97.17 87.29 91.96 6.4 StanfordPOS
PaNs 85.98 97.24 91.26
P.No 96.36 85.20 90.44 We give in this section a comparison between the three
VePa 86.00 95.32 90.42 systems: the StanfordPOS, the Arabic POS Tagger Library
P.DAlam 78.57 100.00 88.00
and our system. To do this, we use the global metrics the
No.S 93.81 82.38 87.72
precision, the recall, the F-measure and the accuracy. We
present in Table 6 the results.
P.PNa 81.08 95.24 87.59
We note through these results that the best performances
P.VePa 80.33 93.90 86.59
for the four criteria are obtained with our system. We
P.Dz.S 75.00 100.00 85.71
achieved an accuracy around 94 % which exceeds the
DCompa 83.33 88.24 85.71
others with more than 10 %. On the other hand, the
P.VeP.O 92.86 76.47 83.87
F-measure of our system is also greater than those of the
P.No.S 95.06 74.40 83.47
other systems with 20 %.
VePa.O 77.07 88.97 82.59
Comparing the three systems, we can make the fol-
P.DSifa 78.93 82.33 80.59
lowing remarks.

123
Int J Speech Technol (2016) 19:289–302 297

Table 5 Evaluation of Arabic Tags Recall (%) Precision (%) F-measure (%)
POS Tagger Library
NOUN 91.84 80.94 86.05
DET + NOUN 96.02 73.32 83.15
{CONJ,PRE} + NOUN 86.48 79.02 82.58
V 72.08 94.01 81.60
{CONJ,PRE} + NOUN + PRON 92.40 70.03 79.67
{CONJ,PRE} + DET + NOUN 97.31 67.13 79.45
NOUN + PRON 96.67 67.11 79.23
{CONJ,PRE} + V 63.03 94.31 75.56
DET + ADJ 59.63 92.93 72.65
ADJ 38.64 84.73 53.07
V + VSUFF 67.71 37.03 47.87
{CONJ,PRE} + V+VSUFF 67.65 22.52 33.79
{CONJ,PRE} + DET + ADJ 19.21 81.63 31.11
{CONJ,PRE} + ADJ 10.71 78.57 18.86
ABBREV 9.91 73.32 17.47

Table 6 Global metrics related to StanfordPOS, Arabic POS Tagger Table 7 Problems generated by the punctuation
Library and our system Sentence Result of analysis
StanfordPOS (%) QCRI (%) Our system (%)
‫ﻓﻘﺎﻝ ﻟﻬﻢ ﺩﺧﻞ ﺍﻟﺮﺟﻞ ﺍﻟﻘﺎﺋﺪ‬ ‫ﻓﻘﺎﻝ‬/NN ‫ﻟﻬﻢ‬/NN ‫ﺩﺧﻞ‬/VBD ‫ﺍﻟﺮﺟﻞ‬/
Precision 75.65 73.11 87.30 (fql lhm Alrjl Alq\d; he told DTNN ‫ﺍﻟﻘﺎﺋﺪ‬/DTNN
Recall 54.66 64.62 88.98 them the commander man
entered)
F-measure 63.46 68.60 88.13
.‫ ﺩﺧﻞ ﺍﻟﺮﺟﻞ ﺍﻟﻘﺎﺋﺪ‬: ‫ﻓﻘﺎﻝ ﻟﻬﻢ‬ ‫ﻓﻘﺎﻝ‬/NN ‫ﻟﻬﻢ‬/NN :/PUNC ‫ﺩﺧﻞ‬/VBD
Accuracy 72.68 84.54 94.02
‫ﺍﻟﺮﺟﻞ‬/DTNN ‫ﺍﻟﻘﺎﺋﺪ‬/DTNN./PUNC
.‫ﺩﺧﻞ ﺍﻟﺮﺟﻞ ﺍﻟﻘﺎﺋﺪ‬: ‫ﻓﻘﺎﻝ ﻟﻬﻢ‬ ‫ﻓﻘﺎﻝ‬/NN ‫ﻟﻬﻢ‬/VBD ‫ﺩﺧﻞ‬/NN ‫ﺍﻟﺮﺟﻞ‬/
DTNN ‫ﺍﻟﻘﺎﺋﺪ‬./DTJJ
6.4.1 Tag set

Unlike the other two analyzers, our system uses a set of


detailed tags, very rich and suitable to Arabic language. words that the previous one, and this poses problems for
Indeed, the tag set used by StanfordPOS comes from English POS tagging systems. To illustrate this, we give in Table 7
Treebank and this sometimes causes confusion at the cor- the results of analysis of a few sentences processed by
relation between the functionality of the translated tags of the StanfordPOS analyzer.
English language and those specific to Arabic language. In We note that in these examples the results of analyses
addition, the system Arabic POS Tag Library uses a tag set depend on both the presence or absence of punctuation and
too small. Thus, this tagger does not specify the time in the their positions.
case of verbs. It uses the tag V to tag all verbal tenses. Given the sensitivity of systems to punctuation and
Similarly, the used PART tag includes all tags of the fol- multiplicity of errors that can be caused by these signs, we
lowing classes NoRp, PaNs, La, Ex, PNa, Ns and CJ. decided in our approach to eliminate in the training phase
We also note that for the three systems, a significant part the punctuation signs except the dots that mark the end of
of the errors is related to adjectives. In many cases, the sentences. This allowed our system to keep the transition
three systems give the output tag NOUN instead of an ADJ. probabilities independent of punctuation.

6.4.2 Punctuation 6.4.3 Affixes

The punctuation plays a crucial role in the structure of all The affixes enrich the Arabic language in a way that a
languages, especially Arabic. However, many Arabic texts single word can be translated as a sentence ‫ﺭﺳﻤﺎﻩ‬
are written without punctuation particularly on the web. (rasamAho, they drew it).
Moreover, even when the punctuation exists in the texts, The structure of the Arabic language poses additional
we sometimes forget to put a blank between the signs and problems for POS tagging. To illustrate this, we give in

123
298 Int J Speech Technol (2016) 19:289–302

Table 8 Problems generated by the prefixes 7 Conclusions


Sentence Result of analysis
We have developed in this paper a POS tagging system for
‫ﺗﻮﺟﻪ ﺍﻟﻮﻟﺪ ﺇﻟﻰ ﻣﻘﻌﺪﻩ‬ ‫ﺗﻮﺟﻪ‬/VBD ‫ﺍﻟﻮﻟﺪ‬/DTNN Arabic texts. The use of the morphological analyzer
(twajha Alwld [la mqEdh; the boy ‫ﺇﻟﻰ‬/NN ‫ﻣﻘﻌﺪﻩ‬/JJ Alkhalil Morpho Sys in the first phase and smoothing
went to his seat) techniques in the second phase have significantly improve
‫ﻭﺗﻮﺟﻪ ﺑﺎﻟﻮﻟﺪ ﺇﻟﻰ ﻣﻘﻌﺪﻩ‬ ‫ﻭﺗﻮﺟﻪ‬/NN ‫ﺑﺎﻟﻮﻟﺪ‬/NNP ‫ﺇﻟﻰ‬/NNP system performances. In addition, the adopted tag set is
‫ﻣﻘﻌﺪﻩ‬/NNP very rich and has been developed taking into account the
agglutination structure that is very present in Arabic lan-
Table 8 the results of analysis of two sentences by guage. Thus, the developed system supports enclitic and
StanfordPOS analyzer (Table 8). proclitic attached to words and that allowed, in addition to
We find that StanfordPOS has processed the words ‫ﺗﻮﺟﻪ‬ improving system performance, to have accurate and
and ‫ ﺍﻟﻮﻟﺪ‬correctly in the first sentence, while in the second detailed morphosyntactic information. Currently, we have
sentence where the words are attached respectively to the begun the development of an Arabic parser. The POS
prefixes “‫ ”ﻭ‬and “‫”ﺏ‬, the test results are wrong. tagging we have developed in this paper is the first step of
Therefore, the consideration of these syntactic features the parser.
will improve the performance of the POS tagging. This will
also expand its use in other applications such as syntactic
or semantic analysis. These arguments are behind our Appendix
choice of a POS tagging that considers proclitics attached
to words.

123
Int J Speech Technol (2016) 19:289–302 299

123
300 Int J Speech Technol (2016) 19:289–302

123
Int J Speech Technol (2016) 19:289–302 301

References Buckwalter, T. (2002). Buckwalter Arabic morphological analyzer


version 2.0. Linguistic Data Consortium, University of Pennsyl-
Al Shamsi, F., & Guessoum, A. (2006). A hidden markov model- vania. LDC Catalog No. LDC2002L49. ISBN 1-58563-324-0.
based POS tagger for Arabic. In Proceedings of the 8th Chalabi, A. (2004). Sakhr Arabic lexicon. In NEMLAR international
International Conference on the Statistical. Besançon, France. conference on Arabic language resources and tools (pp. 21–24).
Al-Taani, A. T., & Al-Rub, S. A. (2009). A rule-based approach for Darwish, K., Abdelali, A., & Mubarak, H. (2014). Using stem-
tagging non-vocalized Arabic words. International Arab Journal templates to improve Arabic POS and gender/number tagging. In
of Information Technology, 6(3), 320–328. International conference on language resources and evaluation
Altabba, M., Al-Zaraee, A., & Shukairy, M. A. (2010). An Arabic (LREC-2014).
morphological analyzer and part-of-speech tagger. Thesis, Diab, M. (2009). Second generation AMIRA tools for Arabic
Faculty of Informatics Engineering, Arab International Univer- processing: Fast and robust tokenization, POS tagging, and base
sity, Damascus. phrase chunking. In 2nd International conference on Arabic
Antony, P. J., & Soman, K. P. (2011). Parts of speech tagging for language resources and tools. Cairo, Egypt.
Indian languages: A literature survey. International Journal of Diab, M., Hacioglu, K., & Jurafsky, D. (2004). Automatic tagging of
Computer Applications (0975-8887), 34(8), 22–29. Arabic text: From raw text to base phrase chunks. In Proceedings
Atiyya, M., Choukri, K., & Yaseen, M. (2005, September 29). of HLT-NAACL 2004: Short papers (pp. 149–152). Association
NEMLAR Arabic written corpus. Retrieved June 11, 2015, from for Computational Linguistics.
http://www.rdi-eg.com/Downloads/Lang%20Tech/Nemlar-speci El Jihad, A., & Yousfi, A. (2005). Etiquetage morpho-syntaxique des
fications-resources-WC-V3.0_Final.doc. textes arabes par modèle de Markov caché. In Proceedings of
Attia, M., Yaseen, M., & Choukri, K. (2005). Specifications of the Rencontre Des Etudiants Chercheurs En Informatique Pour Le
Arabic Written Corpus produced within the NEMLAR project. Traitement Automatique Des Langues (pp. 649–654). Dourdan,
http://www.medar.info/The_Nemlar_Project/Publications/WC_ France
design_final.pdf. El-Jihad, A., Yousfi, A., & Si-Lhoussain, A. (2011). Morpho-
Bebah, M. O. A. O., Meziane, A., Mazroui, A., & Lakhouaja, A. syntactic tagging system based on the patterns words for Arabic
(2011). Alkhalil morpho sys. In 7th International computing texts. International Arab Journal of Information Technology, 8
conference in Arabic. (4), 350–354.
Boudchiche, M., Mazroui, M., ould Abdallahi Ould Bebah, M., & Ghoul, D. (2011). Outils génériques pour l’étiquetage morphosyn-
Lakhouaja, A. (2014). L’analyseur Morphosyntaxique Alkhali taxique de la langue arabe: segmentation et corpus
Morpho Sys 2. In 1st National Doctoral Day of Engineering d’entraı̂nement.
Arabic Language. Huang, L., Peng, Y., Wang, H., & Wu, Z. (2002). Statistical part-of-
Brill, E. (1992). A simple rule-based part of speech tagger. In speech tagging for classical Chinese. In Text, speech and
Proceedings of the workshop on speech and natural language dialogue (pp. 115–122). Brno
(pp. 112–116). Association for Computational Linguistics.

123
302 Int J Speech Technol (2016) 19:289–302

Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings Schmid, H. (1994). Probabilistic part-of-speech tagging using deci-
of the student workshop at NAACL (pp. 20–25). sion trees. In Proceedings of the international conference on new
Manning, C. D., & Schütze, H. (1999). Foundations of statistical methods in language processing (Vol. 12, pp. 44–49).
natural language processing. Cambridge, MA: MIT Press. Manchester.
Nakagawa, T., & Uchimoto, K. (2007). A hybrid approach to word Thibeault, M. (2004). La catégorisation grammaticale automatique:
segmentation and POS tagging. In Proceedings of the 45th adaptation du catégoriseur de Brill au français et modification de
annual meeting of the ACL on interactive poster and demonstra- l’approche. Université Laval.
tion sessions (pp. 217–220). Association for Computational Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003).
Linguistics. Feature-rich part-of-speech tagging with a cyclic dependency
Neuhoff, D. L. (1975). The Viterbi algorithm as an aid in text network. In Proceedings of the 2003 conference of the North
recognition (Corresp.). IEEE Transactions on Information The- American chapter of the Association for Computational Linguis-
ory, 21(2), 222–226. tics on Human Language Technology (Vol. 1, pp. 173–180).
Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic Association for Computational Linguistics.
dependences in stochastic language modelling. Computer Speech Viterbi, A. J. (1967). Error bounds for convolutional codes and an
and Language, 8(1), 1–38. asymptotically optimum decoding algorithm. IEEE Transactions
Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., on Information Theory, 13(2), 260–269.
Habash, N., et al. (2014). A fast, comprehensive tool for
morphological analysis and disambiguation of Arabic. Reyk-
javik: LREC.

123

You might also like