Professional Documents
Culture Documents
DOI 10.1007/s10772-015-9302-8
Received: 27 February 2015 / Accepted: 12 August 2015 / Published online: 8 October 2015
© Springer Science+Business Media New York 2015
123
290 Int J Speech Technol (2016) 19:289–302
many applications of NLP. This construction is based on neural systems. Note here that unsupervised tagging does
the morphological analyzer Alkhalil Morpho Sys (Bebah not require pre-tagged corpus for training phase.
et al. 2011) and the hidden Markov models (HMMs).
Learning and testing phases are performed using the 2.1 Rule-based method
Nemlar corpus (Attia et al. 2005).
Unlike other POS tagging systems which the majority Such tagging is based on grammatical or morphological
are based on part of the Arabic Treebank annotations rules. It seeks to select the right tag of a word, or set the
deducted from the annotations introduced by English possible transitions between tags (Al-Taani and Al-Rub
Treebank, the annotation and terminology we have adopted 2009; Brill 1992; Thibeault 2004).
are inspired by conventional grammatical analysis of
Arabic language. 2.2 Statistical method
In addition, most of these POS tagging systems do not
take into account the proclitics attached to words and treat Statistical tagging (Ghoul 2011; Huang et al. 2002; Schmid
all the words as simple elements. Thus, these systems give 1994) uses frequencies and probabilities calculated from a
the same tag to the three words given in the Table 1. training corpus. The best known algorithms that have
The POS tagger we developed uses annotations enriched worked well for this type of POS tagging are dynamic
with syntactic information about clitics. For example, the programming algorithms, such as the Viterbi algorithm
phrasesﻟﻬﺎ، ﻟﻪ، ﺑﻬﻢ، ( ﻟﻜﻢlkm, bhm, lh, lh; to you, with them, to (Neuhoff 1975; Viterbi 1967).
his, to her) will have the tag “( ”ﺟﺎﺭ ﻭﻣﺠﺮﻭﺭjarWamajrour;
preposition + pronoun compound), and for the three words 2.3 Hybrid method
،ﺳﺎﻋﺪﻩ، ﺳﺎﻋﺪﺍﻩ، ﺳﺎﻋﺪﺍour system gives respectively the
following tags ﻓﺎﻋﻞ ﻓﻌﻞ+( ﻣﺎﺽfEl mAD+fA3l, These methods combine the rule-based and statistical
Verb_Past + subject), ﻣﻔﻌﻮﻝ ﺑﻪ ﻓﻌﻞ+( ﻣﺎﺽfEl mAD+mfEwl methods to allocate the correct tags for a given sequence of
bh, Verb_Past + object), and ﻣﻔﻌﻮﻝ ﺑﻪ+ﻓﺎﻋﻞ+( ﻓﻌﻞ ﻣﺎﺽfEl words (Nakagawa and Uchimoto 2007; Antony and Soman
mAD + fAEl + mafEwl bh, Verb_Past + subject + ob- 2011).
ject). Likewise, our system determines the syntactic
function of proclitics, which will be very useful for some 2.4 Tagging based on neural networks
NLP applications such as parsers.
The paper is organized as follows. We give in the sec- This method have not produced good results, therefore they
ond paragraph a glimpse of some previous works and we are rarely used.
recall the different approaches used in the POS tagging We will present in the following few POS tagging sys-
field. We devote the next section to an overview on the tems among the most recognized in the field.
analyzer Alkhalil Morpho Sys used in the morphological
phase of our system. Paragraph 4 is reserved to a 2.5 StanfordPOS
description of the adopted method, and we introduce in
paragraph 5 the used tag set. The system evaluation is This POS tagging was developed in 2003 for English
detailed in paragraph 6 and the paper is ended with a (Toutanova et al. 2003). It was later extended to other
conclusion. languages (Arabic, Chinese, German, French,…). It is
constantly improved and freely distributed on the Stanford
University website. Its principle is based on the application
2 Related works of a bidirectional Markov models with maximal entropy
and its tag set uses the Treebank annotations.
The POS tagging, based on a supervised machine learning
technique or not, can be grouped into four main categories: 2.6 Arabic POS Tagger Library
rule-based systems, statistical systems, hybrid systems or
It was newly developed by Qatar Computing Research
Institute (QCRI). It is based on the CRF++ model (con-
Table 1 Example of three words having the same tag
ditional random fields) (Darwish et al. 2014).
Words Gloss Tag
2.7 ASVM
ﺳﺎﻋﺪﺍ They help Verb + past
ﺳﺎﻋﺪﺍﻩ They help him
This is a free POS tagger developed in Perl by Diab’s Team
ﺳﺎﻋﺪﻩ He helps him
(2004). This is an adaptation of the English system
123
Int J Speech Technol (2016) 19:289–302 291
“Yamcha” to the Arabic language. It was trained on the ● its possible vowelized patterns (only if the word is
annotated Treebank corpus. This system is based on the derivative noun or derivative verb);
support vector machine model and uses 24 tags. The Mona ● its possible lemmas and their corresponding patterns;
Diab’s team has achieved an improvement of this POS ● its possible stems and their corresponding patterns;
tagging under the name AMIRA2.0 (Diab 2009). ● its possible syntactic forms.
● the possible vowelized form of the word; Single tag for each S
T
● its possible proclitics and enclitics; word E
● its possible roots (only if the word is derivative noun or P
derivative verb);
● its possible syntactic natures; Fig. 1 Analysis phases of a sentence
123
292 Int J Speech Technol (2016) 19:289–302
4.1 First phase: search of possible tags for each PrðYt ¼ ok =Xt ¼ ei ; Yt1 ¼ okt1 ;
word
Xt1 ¼ eit1 ; . . .; Y1 ¼ ok1 ; X1 ¼ ei1 Þ
For each word w of a sentence PH, we first identify all its ¼ PrðYt ¼ ok =Xt ¼ ei Þ ¼ bi ðkÞ:
possible tags by performing the following two steps: bi(k) is the probability of observing ok given the state ej.
First step we use the training corpus extracted from The information about the hidden states can be deduced
Nemlar corpus (Attia et al. 2005) to identify the possible from the observed data.
tags of w. Let S be an observed sentence consisting of the words
Second step we then use the morphological analyzer w1, w2,…,wn, and E = {e1, e2,…,eN} be the set of Arabic
Alkhalil Morpho Sys to determine other possible tags of w. tags. In order to search the most likely morphosyntactic
The out of context tags of the word w are those obtained tags in context, the words wi of the sentence S will be the
from these two steps. observations and the morphosyntactic tags will be the
In the case where the word w does not appear in the hidden states. Our goal is, given the sentence S = (w1, w2,
training set and is not analyzed by Alkhalil Morpho Sys, w3,…,wn), to find the most likely sequence ðe1 ; e2 ; . . .; en Þ
the system uses a specific treatment. This treatment con- satisfying the follows relation:
sists in attributing to the word w the most likely tag based e1 ; . . .; en ¼ argmax Prðe1 ; . . .; en =w1 ; . . .; wn Þ;
on the potential prefix and suffix that may be attached to ei 2Ei
the word. For example, the word ( ﺍﻟﺘﻌﺠﻴﺰﻳﺔTaâjezeya,
where Ei is the set of possible tags of the word wi obtained
incapacitating) beginning with “Al ”الand ending with
in the first step.
(yati, )ﻳﺔwill have the adjective tag.
4.2.2 Viterbi algorithm
4.2 Second phase: statistical analysis
To find the most likely sequence of tags, we assume that
After identifying the possible tags for each word of the
the two following Markov assumptions are satisfied:
sentence, the statistical step will select the one that is the
most likely. It is based on the HMM, smoothing techniques H1 The sequence of word tags in a sentence is a homo-
and the Viterbi algorithm (Neuhoff 1975; Viterbi 1967). geneous Markov chain:
Thereafter, we will give a brief overview of these three Prðek =e1 ; . . .; ek1 Þ ¼ Prðek =ek1 Þ:
mathematical concepts.
H2 The prediction of the word w requires only the
4.2.1 Hidden Markov models knowledge of its tag:
Prðwk =ek ; wk1 ; ek1 ; . . .; w1 ; e1 Þ ¼ Prðwk =ek Þ:
A HMM is a statistical method who associates each model
to two dependent processes. The states of the first one are
unobservable (hidden states) and those of the second are And we use the Viterbi algorithm (1967) which is well
observable. suited for finding the optimal path.
Indeed, if O = {o1, o2,…,or} is a finite set of observa- Indeed, if ;ðt; ekt Þ is the maximum over all paths of
tions and E = {e1,…,em} is a finite set of hidden states, a length (t − 1) of the probability that the (t − 1) first words
dual process ðXt ; Yt Þt 1 is a first-order HMM if: have tags of the path and the tth word wt has the tag ekt ; i.e.,
● (Xt)t is a homogeneous Markov chain with values in the ; t; ekt ¼ max
j
Pr w1 ; . . .; wt =ek11 ; . . .; ekt Pr ek11 ; . . .; ekt ;
hidden state set S where: eii 2Ei
16i6t1
Pr Xtþ1¼ ej =Xt ¼ ei ; . . .; X
1 ¼ eh then, using the two hypotheses (H1) and (H2), we can
¼ Pr Xtþ1 ¼ ej =Xt ¼ ei ¼ aij : easily verify that:
!
aij is the transition probability from state ei to state ej. k j k j
; t; et ¼ j max ; t 1; et1 Pr et =et1 P wt =ekt :
● (Yt)t is an observable process taking values in the et1 2Et1
observation set O where
123
Int J Speech Technol (2016) 19:289–302 293
the corpus with the tag ei and for which Alkhalil analyzer
This equation will allow us to find, recursively, the values
generates this tag, and d is a selected value between zero
of ∅. In order to get the optimal path, we use the function
and one.
ψ that memorizes at each time t the hidden tag that realizes
the maximum. It is defined by:
W t; ekt ¼ argmax ; t 1; ejt1 Pr ekt =ejt1 : 5 Tag SE
ejt1 2Et1
123
294 Int J Speech Technol (2016) 19:289–302
Verb-past-tense + subject VePa.S ﻓﺎﻋﻞ+ ﻓﻌﻞ ﻣﺎﺽ (qAlw, they say) ﻗﺎﻟﻮﺍ
Verb-past-tense + object VePa.O ﻣﻔﻌﻮﻝ ﺑﻪ+ ﻓﻌﻞ ﻣﺎﺽ (Drbh, he hits it) ﺿﺮﺑﻪ
Verb-past-tense + object + subject VePa.S.O ﻣﻔﻌﻮﻝ ﺑﻪ+ ﻓﺎﻋﻞ+ ﻓﻌﻞ ﻣﺎﺽ (xTbth, I address him) ﺧﺎﻃﺒﺘﻪ
Prefix + Det + adjective P. DSifa ﺻﻔﺔ+ ﺣﺮﻑ ﻋﻄﻒ (wAl\slmy, and Islamic) ﻭﺍﻹﺳﻼﻣﻴﺔ
Noun + possessive construction No.S ﻣﻀﺎﻑ ﺇﻟﻴﻪ+ ﺍﺳﻢ ([ElAnth, her announcements) ﺇﻋﻼﻧﺎﺗﻬﺎ
123
Int J Speech Technol (2016) 19:289–302 295
123
296 Int J Speech Technol (2016) 19:289–302
123
Int J Speech Technol (2016) 19:289–302 297
Table 5 Evaluation of Arabic Tags Recall (%) Precision (%) F-measure (%)
POS Tagger Library
NOUN 91.84 80.94 86.05
DET + NOUN 96.02 73.32 83.15
{CONJ,PRE} + NOUN 86.48 79.02 82.58
V 72.08 94.01 81.60
{CONJ,PRE} + NOUN + PRON 92.40 70.03 79.67
{CONJ,PRE} + DET + NOUN 97.31 67.13 79.45
NOUN + PRON 96.67 67.11 79.23
{CONJ,PRE} + V 63.03 94.31 75.56
DET + ADJ 59.63 92.93 72.65
ADJ 38.64 84.73 53.07
V + VSUFF 67.71 37.03 47.87
{CONJ,PRE} + V+VSUFF 67.65 22.52 33.79
{CONJ,PRE} + DET + ADJ 19.21 81.63 31.11
{CONJ,PRE} + ADJ 10.71 78.57 18.86
ABBREV 9.91 73.32 17.47
Table 6 Global metrics related to StanfordPOS, Arabic POS Tagger Table 7 Problems generated by the punctuation
Library and our system Sentence Result of analysis
StanfordPOS (%) QCRI (%) Our system (%)
ﻓﻘﺎﻝ ﻟﻬﻢ ﺩﺧﻞ ﺍﻟﺮﺟﻞ ﺍﻟﻘﺎﺋﺪ ﻓﻘﺎﻝ/NN ﻟﻬﻢ/NN ﺩﺧﻞ/VBD ﺍﻟﺮﺟﻞ/
Precision 75.65 73.11 87.30 (fql lhm Alrjl Alq\d; he told DTNN ﺍﻟﻘﺎﺋﺪ/DTNN
Recall 54.66 64.62 88.98 them the commander man
entered)
F-measure 63.46 68.60 88.13
. ﺩﺧﻞ ﺍﻟﺮﺟﻞ ﺍﻟﻘﺎﺋﺪ: ﻓﻘﺎﻝ ﻟﻬﻢ ﻓﻘﺎﻝ/NN ﻟﻬﻢ/NN :/PUNC ﺩﺧﻞ/VBD
Accuracy 72.68 84.54 94.02
ﺍﻟﺮﺟﻞ/DTNN ﺍﻟﻘﺎﺋﺪ/DTNN./PUNC
.ﺩﺧﻞ ﺍﻟﺮﺟﻞ ﺍﻟﻘﺎﺋﺪ: ﻓﻘﺎﻝ ﻟﻬﻢ ﻓﻘﺎﻝ/NN ﻟﻬﻢ/VBD ﺩﺧﻞ/NN ﺍﻟﺮﺟﻞ/
DTNN ﺍﻟﻘﺎﺋﺪ./DTJJ
6.4.1 Tag set
The punctuation plays a crucial role in the structure of all The affixes enrich the Arabic language in a way that a
languages, especially Arabic. However, many Arabic texts single word can be translated as a sentence ﺭﺳﻤﺎﻩ
are written without punctuation particularly on the web. (rasamAho, they drew it).
Moreover, even when the punctuation exists in the texts, The structure of the Arabic language poses additional
we sometimes forget to put a blank between the signs and problems for POS tagging. To illustrate this, we give in
123
298 Int J Speech Technol (2016) 19:289–302
123
Int J Speech Technol (2016) 19:289–302 299
123
300 Int J Speech Technol (2016) 19:289–302
123
Int J Speech Technol (2016) 19:289–302 301
123
302 Int J Speech Technol (2016) 19:289–302
Khoja, S. (2001). APT: Arabic part-of-speech tagger. In Proceedings Schmid, H. (1994). Probabilistic part-of-speech tagging using deci-
of the student workshop at NAACL (pp. 20–25). sion trees. In Proceedings of the international conference on new
Manning, C. D., & Schütze, H. (1999). Foundations of statistical methods in language processing (Vol. 12, pp. 44–49).
natural language processing. Cambridge, MA: MIT Press. Manchester.
Nakagawa, T., & Uchimoto, K. (2007). A hybrid approach to word Thibeault, M. (2004). La catégorisation grammaticale automatique:
segmentation and POS tagging. In Proceedings of the 45th adaptation du catégoriseur de Brill au français et modification de
annual meeting of the ACL on interactive poster and demonstra- l’approche. Université Laval.
tion sessions (pp. 217–220). Association for Computational Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003).
Linguistics. Feature-rich part-of-speech tagging with a cyclic dependency
Neuhoff, D. L. (1975). The Viterbi algorithm as an aid in text network. In Proceedings of the 2003 conference of the North
recognition (Corresp.). IEEE Transactions on Information The- American chapter of the Association for Computational Linguis-
ory, 21(2), 222–226. tics on Human Language Technology (Vol. 1, pp. 173–180).
Ney, H., Essen, U., & Kneser, R. (1994). On structuring probabilistic Association for Computational Linguistics.
dependences in stochastic language modelling. Computer Speech Viterbi, A. J. (1967). Error bounds for convolutional codes and an
and Language, 8(1), 1–38. asymptotically optimum decoding algorithm. IEEE Transactions
Pasha, A., Al-Badrashiny, M., Diab, M., El Kholy, A., Eskander, R., on Information Theory, 13(2), 260–269.
Habash, N., et al. (2014). A fast, comprehensive tool for
morphological analysis and disambiguation of Arabic. Reyk-
javik: LREC.
123