You are on page 1of 5

ARABIC PART-OF-SPEECH TAGGING USING THE SENTENCE

STRUCTURE
Y.O. Mohamed El Hadj
1
, I.A. Al-Sughayeir
1
, A.M. Al-Ansari
2

1
Center oI Research at the College oI Computer & InIormation Sciences
2
College oI Arabic Language
Imam University
P.O.Box. 8488, Riyadh 11681, KSA
m¸e¸hadj¸hotmail.com, imadas¸gmail.com, ansary¸22¸hotmail.com
Abstract
This paper presents a system Ior Arabic Part-OI-Speech Tagging, which combines morphological analysis with Hidden Markov Model
(HMM) and relies on the Arabic sentence structure. On the one hand, the morphological analysis is used to reduce the size oI the tags
lexicon by segmenting Arabic words in their preIixes, stems, and suIIixes due to the Iact that Arabic is a derivational language. On the
other hand, HMM is used to represent the Arabic sentence structure in order to take into account the logical linguistic sequencing. For
these purposes, an appropriate tagging system has been proposed to represent the main Arabic part oI speech in a hierarchical manner
allowing an easy expansion whenever it is needed. Each tag in this system is used to represent a possible state oI the HMM and the
transitions between tags (states) are governed by the syntax oI the sentence.
A corpus oI some old texts, extracted Irom Books oI third century (Hijri), is manually tagged using our developed tagset. and then
used Ior training and testing this system. First experiments conducted on the dataset give a recognition rate oI 96° and thus are very
promising compared to the data size tagged till now and used in the training.

INTRODUCTION
The computational Processing oI the Arabic has gained a
more interest in the last Iew years due to a massive need
oI computer tools necessary to deal with the huge amount
oI Arabic data electronically available and, which is
dramatically increasing daily (Abdelali et al, 2005). A
report published by Madar Research Journal in the year
2005, which includes statistics and Iorecasts on Internet
users in 17 Arab countries, estimated the size oI the
Internet community in the Arab world in excess oI 25
millions (Madar). An update oI this study published in
march 2008 brings signiIicant news such as a 20- Iold
increase in the total number oI Arabic Web pages
produced collectively by 12 countries in the two- year
period (2006 and 2007), with growth ranging Irom as little
as 11 Iold in Saudi Arabia to an outstanding 163 Iold in
Syria. Moreover, a study Irom the Research Unit oI
Internet Arab World magazine states that there are 1.9
million online websites in Arabic and that number is
expected to double every year (IAWRU). In addition oI
Arabic content on the web, there are many initiatives Ior
developing electronic libraries and corpora oI various
types Ior wide range oI research purposes (Alansary et al,
2007; Sulaiti & Atwell, 2006).
Providing users with a high quality tools Ior linguistic
processing is essential to keep up with the growth, and
still need contribution Irom all the scientiIic community.
One oI the basic tools and components necessary Ior any
robust Natural Language Processing inIrastructure oI a
given language, is Part-OI-Speech tagging (POST) also
known PoS-tagging or just Tagging (Atwel et al, 2004;
Alansary et al, 2008). It is considered as one oI the basic
tools needed in speech recognition, natural language
parsing, inIormation retrieval and inIormation extraction.
Moreover, POST is also considered as Iirst stage Ior
analyzing and annotating corpora.

Our contribution in this paper concerns the development
oI an Arabic Part-OI-Speech Tagging system, which
combines morphological analysis with a statistical
approach that relies on the Arabic sentence structure.

POS-TAGGING TECHNIQUES
POST is the process by which a speciIic tag is assigned to
each word oI a sentence to indicate the Iunction oI that
word in the speciIic context (JuraIsky & Martin, 2008).
Arabic POST (APOST) is not an easy task due to the high
ambiguity results Irom the absence oI diacritics and also
Irom the complexity oI the Arabic morphology. Consider
the Iollowing example: ". `=ر »'= »''= " Each word in the
above example has more than one morphological analysis.
The APOST is responsible Ior assigning to each word the
most appropriate morphological tag.

There are three general approaches to deal with the
tagging problem:
1. Rule-based approach: consists oI developing a
knowledge base oI rules written by linguists to deIine
precisely how and where to assign the various POS
tags.
2. Statistical approach: consists oI building a trainable
model and to use previously-tagged corpus to
estimate its parameters. Once this is done, the model
can be used to automatically tagging other texts.
SuccessIul statistical taggers were built during the
last years and are mainly based on Hidden Markov
Models (HMMs).
3. Hybrid approach: Consists in combining rule-based
approach with a statistical one. Most oI the recent
works use this approach as it gives better results.

DiIIerent Arabic taggers have recently emerged, some oI
them are developed by companies (Xerox, Sakhr, RDI) as
commercial products, while others are a result oI research
eIIorts in the scientiIic community (Khoja, 2001;
Freeman, 2001; Maamouri & Cieri, 2002; Diab et al,
2004; Banko & Moore, 2004; Tlili-Guiassa, 2006).
Among these works, Khoja (2001) combines statistical
241
and rule-based techniques and uses a tagset oI 131
basically derived Irom the BNC English tagset. (Freeman,
2001) is based on the Brill tagger and uses a machine
learning approach. A tagset oI 146 tags, based on that oI
Brown corpus Ior English is used. (Maamouri & Cieri,
2002) is based on the automatic annotation output
produced by the morphological analyzer oI Tim
Buckwalter (Buckwalter, 2004); it achieved an accuracy
oI 96°. Diab et al (2004) use Support Vector Machine
(SVM) method and the LDC's POS tagset, which consists
oI 24 tags. Banko and Moore (2004) presents a HMM
tagger that exploits context on both sides oI a word to be
tagged. It is evaluated in both the unsupervised and
supervised cases and achieves an accuracy oI about 96°.
Tlili-Guiassa (2006) uses a hybrid method oI based-rules
and a memory-based learning method. A tagset composed
oI symbols Irom Khoja's tagger and new ones is used and
a perIormance oI 85° was reported.

Almost all oI these taggers, either use tagsets derived
Irom English which is not appropriate Ior Arabic, either
they rely on a transliteration oI the Arabic input text. An
other important point is that the structure oI the Arabic
sentence does not generally taken into account during the
tagging process and, in our knowledge, Iew works are
interested to that (Shamsi & Guessoum, 2006).
In this paper, we present a system Ior Arabic Part-OI-
Speech Tagging that relies on the Arabic sentence
structure and combines morphological analysis with
Hidden Markov Models (HMMs) as we will explain in the
Iollowing section.
OUR APPROACH FOR ARABIC POST
In this work, a Iorm oI combination between statistical
and linguistic approaches will be employed, so that the
processing will be perIormed in two levels. In the Iirst
level, text is Iirstly normalized and tokenized into words,
and then morphologically analyzed. The morphological
analysis is used as input module to reduce the size oI the
needed tags' lexicon by segmenting Arabic words in their
preIixes, stems, and suIIixes. This is very important due to
the Iact that Arabic is a derivational language. For this
purpose, an appropriate tagging system has been proposed
to represent the main Arabic part oI speech in a
hierarchical manner allowing an easy expansion whenever
it is needed.

In the second level, an appropriate statistical model based
on the internal structure oI the Arabic sentence is used to
recognize the morphological characteristics oI the words
Ior the entered text. The use oI the linguistic internal
structure oI the Arabic sentence will allow us to identiIy
logical sequences oI words, and consequently their
corresponding tags. Since the probability oI a certain word
(or its tag) occurrence depends on the words preceding it
in a given context, the HMM will be the best suitable
statistical model to keep track oI this history. A linguistic
study is conducted to determine the Arabic sentence
structure by identiIying the diIIerent main Iorms oI both
nominal and verbal sentences. Based on that, a HMM
model is then used to represent this structure. Each state
oI the HMM is represented by a possible tag in the lexicon
and the transitions between states (tags) are governed by
the syntax oI the sentence. Transition' probabilities are
calculated using a smoothed tri-gram and a special
processing is used to handle unknown words to determine
their lexical probabilities.

BeIore giving the details oI our Arabic POS tagger, a
linguistic study oI Arabic words and grammatical
structures will be required Ior the purpose oI coding
morphological characteristics and Ior extracting the most
appropriate structure Ior common Arabic sentence' Iorms.
DESCRIPTION OF THE TAGGING
SYSTEM
We investigated the principle aspects oI Arabic
morphology and grammar. The Iollowing is a brieI review
oI those aspects. The Arabic verbal structures are
composed oI three classes: noun (»-ا), verb (.·ِ ·) and that
we will call particle (ف¸َ =).
NOUN
It is either a name or a word that describes a person, thing
or idea. It could be deIinite or indeIinite and can be
subcategorized by the person (narrator, interlocutor and
absent), number (Singular, Dual, Plural), gender
(Masculine, Feminine), and grammatical cases ( " _·¸'ا " ،
" ---'ا " ، " ¸='ا (" . Fig1 gives a main classiIication oI the
noun and its prominent ramiIications.






















Fig. 1: Noun and its sub-categories
VERB
It is a word that denotes an action and could be combined
with some particles. In term oI tense (see Fig. 2), the verb
could be past (imperative), present (imperIect) or
imperative. A Iuture verb tense exists, but it's a derivative
oI the present tense that you achieve by attaching a preIix
to the present tense oI the verb. Particles can be added as
preIixes and/or suIIixes indicating the number, gender,
and person oI the subject, like Ior example: -''·, , ل,- ,
ن,',-,, ن`,-,.
Three moods are possible Ior verbs: indicative " _·¸'ا " ,
subjunctive " ---'ا " , and jussive " م¸='ا "


»-`ا
ة¸´ﻥ
ª·¸·-
»'= ¸,-- لا ¹ ة¸´ﻥ ةر'-إ ل,ﺹ,-
.--- .----
»'´-- --'= -='=-
د¸-- _-= _-`-
¸آ-- -ﻥ,-
242










Fig. 2: verb and its temporal-Iorms
PARTICULE
This class includes everything that is neither a verb nor a
noun. It contains the “jarr” prepositions, the coordination
prepositions and the Iunctional words like “inna wa
akhawatuha '+ﺕا,ﺥأو نإ” which inIluences the upcoming
words analysis. There are many prepositions, but we do
not really need, at least in this phase oI work, to give an
exhaustive list oI them. In Iact, our objective is not to
know them in detail. Fig. 3 gives an example oI the
classiIication oI particles, according to their Iunctions.



















Fig. 3: main groups oI particles

PROPOSED TAGSET
The previous classiIication is used to develop an
appropriate tagging scheme considering the parts oI
speech hierarchy in order to make it meaningIul and easily
expandable to include more details and precision about
the Arabic units whenever it is needed.
As we have seen beIore, the noun could be deIined or
undeIined. We will give the noun in its global Iormat the
symbol “NoII”. In its deIined Iormat, it will get the
symbol “NoPr” iI it is a proper name, the symbol “NoPn”
iI it's a pronoun, “NoDe” iI it's a demonstrative pronoun,
“NoCn” iI it's ل,ﺹ,- »-ا " ". Because the pronoun could be
attached " .--- " or not attached " .---- " to another word,
So we will use “NoPnAt” to tag the Iirst one and
“NoPnSe” to tag the second one. To indicate the gender,
number and person, we will add respectively the letters M,
F, S, D, P, 1, 2 or 3.
As Iar as the verb is concerned, it will be given the
symbol “Ve” globally, “VePe” Ior the PerIect, “VeII” Ior
the ImperIect and “VeIa” Ior the imperative.
Regarding the class oI particles, tags are speciIied only Ior
some ones that are oI subject matter Ior our work in its
initial phase. Among those, “PaDe” is used to tag the
identiIier (لأ), “PaDu” and “PaPl” are respectively used
Ior tagging particles indicating the number (dual and
plural). For indicating the gender, the letters M or F can
be used. The remaining particles are assigned the tag
“PaOt”, but they can be tagged separately Iollowing the
same logic. “Pa” is the global tag given to the particle iI
we do not need to distinguish a particular one.
Finally, we will assign to the punctuation signs (., ?, !, etc)
the symbol “Pu”. The digits and dates are denoted by the
symbol “Nu”.
SPECIFICATION OF THE SENTENCE
STRUCTURE: MODEL ARCHITECTURE
A linguistic study has been conducted to extract common
types oI Iormulations oI the Arabic sentence, so that it can
serve as architecture oI the statistical model. The
reIerences oI this study were the old morphology books
and modern studies concerned with sentence structures in
the Arabic language such as the Iollowing reIerences
|Harkat, Mutawakkil, Al-Rahhali, Al-Shukri, Yaqut|.

The sentence in the Arabic language is either nominal like
in “ ¸--'ا ª·='- ” or verbal like in “ة¸´'ا ل'-=`ا -·',”. Each
oI them may have diIIerent Iorms and styles. A list oI
more than 100 ways oI common grammatical structures in
the Arabic language has been surveyed. It covers the
general syntactical analysis and detailed morphological
analysis oI the nouns and verbs.
STRUCTURE OF NOMINAL SENTENCES
DiIIerent Iorms oI Iormulation have been identiIied Ior
nominal sentences. They can be represented by the
Iollowing Iigure (Iig. 4) in terms oI sequences, where V,
N, and P respectively denote NOUN, VERB, and
PARTICLE. S and E are special states, used to represent
the start and the end oI the nominal phrase. Notice that a
loop on a state indicates a certain number oI repetitions oI
this symbol, and an arrow between two sates, means that
Iirst one may be Iollowed by the second one depending on
its direction.
.·-'ا
¸-أ عر'-- ¸-'-
ﻑﺭﺤﻝﺍ
ﺭﻜﺫﻤ ﺙﻨﺅﻤ
ﺭﺠ ﻲﻔﻨ ﻲﻬﻨ ﻡﺎﻬﻔﺘﺴﺍ ﻁﺭﺸ ﻑﻁﻋ ﺏﺍﻭﺠ ﺀﺍﺩﻨ ﺭﺠﺯ
ﻊﻤﺠ ﺔﻴﻨﺜﺘ ﺙﻴﻨﺄﺘ ﺭﻴﺴﻔﺘ ﻑﻴﺭﻌﺘ
ﺀﺎﻨﺜﺘﺴﺍ
ﺭﺨﺁ ﺩﻴﻜﺄﺘ ﻪﻴﺒﻨﺘ ﻊﻗﻭﺘ ﻲﻨﻤﺘ
243










Fig. 4: Structure oI Nominal sentences
STRUCTURE OF VERBAL SENTENCES
Verbal sentence structure can be represented by a graph as
in the Iollowing Iigure (Fig. 5). This means that a verbal
sentence starts either by a verb or a particle and is
Iallowed by any combination oI the main parts oI speech.










Fig. 5: Structure oI Verbal sentences
ARCHITECTURE OF THE STATISTICAL
MODEL
Although the previous representation oI both nominal and
verbal sentences' structures can be seen as trivial and
straightIorward, they are very interesting Ior speciIying
the architecture oI our HMM model. It suIIuses to
combine them in a one graph and to replace each state by
the underlying part oI speech, and then expand it to
include its subcategories as we have speciIied in the
description oI the tagset. Each state in the new model
(HMM) is representing a valid tag Irom our lexicon.
Determination oI the model parameters will be discussed
in the Iollowing section.
THE HMM-BASED POS TAGGER
The use oI a Hidden Markov Model to do part-oI-speech-
tagging can be seen as a special case oI Bayesian
inIerence. It can be Iormalized as Iollows: Ior a given
sequence oI words, what is the best sequence oI tags
which corresponds to this sequence oI words? II we
represent an entered text (sequence oI morphological units
in our case) by
n i i
w W
≤ ≤
=
1
) ( and a sequence oI tags
Irom the lexicon by
n i i
t T
≤ ≤
=
1
) ( , we have to compute:

( ) [ ] W T P
T
, max .

By using the Bayesian rule and then eliminating the
constant part ( ) W P , the equation can be transIormed to
this new one:

( ) ( ) [ ] T P T W P
T
∗ , max .

( ) T P represents the probability oI the tag sequence (tag
transition probabilities), and can be computed using an N-
gram model (trigram in our case), as Iollows:


=
− −
= =
n
i
i i i n
t t t P t t t T P
1
1 2 2 1
) , ( ) ( L .

A tagged training corpus is used to
compute ) , (
1 2 − − i i i
t t t P , by calculating Irequencies oI
trigrams and bigrams (respectively ) (
1 2 i i i
t t t f
− −
and
) (
1 2 − − i i
t t f ) as Iollows:

) ( / ) ( ) , (
1 2 1 2 1 2 − − − − − −
=
i i i i i i i i
t t f t t t f t t t P .

However, it can happens that some trigrams (bigrams)
will never appear in the training set; so, to avoid assigning
null probabilities to unseen trigrams (bigrams), we used a
deleted interpolation developed by (Brants, 2000):

) ( * ) , ( * ) , ( *
3 1 2 1 2 1 i i i i i i
t P t t P t t t P λ λ λ + +
− − −
,

Where 1
3 2 1
= + + λ λ λ .

Now, Ior calculating the likelihood oI the word sequence
given tags ( ) T W P , , the probability oI a word appearing
is generally supposed to be dependent only on its own
part-oI-speech tag. So, it can be written as Iollows:


=
=
n
i
i i
t w P T W P
1
) , ( ) , ( .

Here also, a tagged training set has to be used Ior
computing these probabilities, as Iollows:

) ( / ) , ( ) , (
i i i i i
t f t w f t w P = ,

Where ) , (
i i
t w f and ) (
i
t f represent respectively how
many times w
i
is tagged as t
i
and the Irequency oI the tag
t
i
itselI.

Tag sequence probabilities and word likelihoods represent
the HMM model' parameters: transition probabilities and
emission (observation) probabilities. Once these
parameters are set, the HMM model can be used to Iind
the best sequence oI tags given a sequence oI input words.
The Viterbi algorithm is used to perIorm this task.
PERFORMANCE EVALUATION
CORPUS PREPARATION
We remember that our ultimate goal is to build an Arabic
POS tagger that can be used Ior relatively old books (Irom
the third century Hijri). Although these texts may be
classiIied as MSA, their styles can vary greatly Irom those
oI nowdays. So, we have created a corpus composed oI
some texts extracted Irom ALJAHEZ's book "Albayan-
wa-tabyin" (255 Hijri). It is obtained Irom "Ashamila"
library, which is downloadable Irom this link:
http://www.shamela.ws. A manual tagging oI this corpus
using our own tagset is currently running. Due to the
complexity oI the manual tagging, only a subset oI the
corpus has been Iinished till now. It counts a total words
oI 21882 with a 3565 unique words ranged in more than
1600 sentences. Among these counts, there are 10258
nouns, 2587 verbs, and 9037 particles.
DATA-SETS AND EVALUATION
Our model is trained on 95° oI the tagged corpus
previously described, using 13 tags: 3 subcategories oI
S
E
V
P

N
S
E
N
P
V
244
verbs, 6 subcategories oI nouns, and 4 subcategories oI
particles. It is tested on the remaining 5°, which
represents about 1000 words. To evaluate its perIormance,
we have used the F-measure deIined as
Iollows: ) /( ) * * 2 ( R P R P + , where P and R denotes
precision and Recall respectively. They are calculated ,
using the total number oI correct assigned tags (Nc), total
number oI assigned tags (Na), and the total number oI the
assigned tags in the test-set (Nt): Na Nc P / = and
Nt Nc R / = .
We have obtained an accuracy oI 96°, which is very
encouraging compared to the size oI the tagset used till
now.
CONCLUSION
In this paper we have presented an Arabic Part-OI-Speech
tagger that uses a HMM model to represent the internal
linguistic structure oI the Arabic sentence. We have
conducted a linguistic study to determine the main Arabic
POS and to speciIy diIIerent common Iorms oI Arabic
sentence. AIter that, an appropriate tagging system has
been proposed to represent these main Arabic parts oI
speech in a hierarchical manner allowing an easy
expansion whenever it is needed. Next, a suitable
architecture oI the HMM model is speciIied based-on the
structure oI both nominal and verbal sentence. Having
done this, a corpus composed oI old texts extracted Irom
books oI third century Hijri is created. A part oI it is
manually tagged and used to train and to test the tagger.
PerIormance evaluation has shown an accuracy oI 96°.
However, although this is represents a very good result
compared to the size oI the training corpus, we have to
increase our tagged corpus and to conduct Iurther tests on
more interesting dataset to evaluate the real perIormance
oI this approach.
We plan to use the developed tagger Ior our research
activities in a variety oI ways, especially Ior applications
dealing with old texts "ª,ﺙا¸-'ا ص,--'ا".
REFERENCES
Abdelali A., Cowie J., Soliman H.S. (2005). Building A
Modern Standard Arabic Corpus. Workshop on
Computational Modeling oI Lexical Acquisation, the
split meeting, Croatia.
Alansary S, Nagi M, Adly N. (2008). Towards Analyzing
the International Corpus of Arabic (ICA). 8
th

International ConIerence on Language Engineering,
Egypt.
Alansary S, Nagi M, Adly N. (2007). Building an
International Corpus of Arabic (ICA). 7
th
International
ConIerence on Language Engineering, Egypt.
Al-Sulaiti L, Atwell E. (2006). The design of a corpus of
contemporary Arabic. International Journal oI Corpus
Linguistics, vol. 11, pp. 135-171.
Atwell E, Al-Sulaiti L, Al-Osaimi S, Abu-Shawar B.
(2004). A Review of Arabic Corpus Analysis Tools.
Proceedings oI JEP-TALN'04 Arabic Language
Processing.
Banko M, Moore R. C. (2004). Part of Speech Tagging in
Context. Proc oI the 20
th
international conIerence on
Computational Linguistics, Switzerland.
Brants T. TnT. A statistical part of speech tagger. In proc.
of ANLP’2000, the 6th Conference on Applied Natural
Language Processing: 224-231, Seattle, Washington,
Morgan KauImann Publishers Inc. 2000.
Diab M., Hacioglu K. and JuraIsky D. (2004). Automatic
Tagging of Arabic Text: From Raw Text to Base Phrase
Chunks. proc. of HLTNAACL’04: 149–152.
Freeman A (2001). Brill’s POS tagger and a morphology
parser for Arabic. In ACL’01 Workshop on Arabic
language processing.
Internet Arab World research Unit (IAWRU):
http://www.teckies.com/lebanon/
JuraIsky D., Martin J.H. (2008). Speech and Language
Processing: An introduction to speech recognition,
computational linguistics and natural language
processing. 2
nd
Edition.
|Madar| Madra Research:
http://www.madarresearch.com/archive/archive¸toc.asp
x?id÷50.
Maamouri M, Cieri C. (2002). Resources for Arabic
Natural Language Processing at the LDC. Proceedings
oI the International Symposium on the Processing oI
Arabic ,Tunisia, pp.125-146.
Shamsi F, Guessoum A. (2006). A Hidden Markov Model
–Based POS Tagger for Arabic, JADT'06.
Tlili-Guiassa Y. (2006). Hybrid Method for Tagging
Arabic Text. Journal oI Computer Science 2 (3): 245-
248.
Tim Buckwalter. (2004). Buckwalter Arabic
Morphological Analyzer, Version 2.0. LDC Catalog
No. LDC2004L02, Linguistic Data Consortium,
www.ldc.upenn.edu/Catalog.

245

) َ ف‬ NOUN It is either a name or a word that describes a person. (Freeman. ‫. so that the processing will be performed in two levels. A future verb tense exists."ا‬ subjunctive " ‫ . gender (Masculine. stems. Each state of the HMM is represented by a possible tag in the lexicon and the transitions between states (tags) are governed by the syntax of the sentence. interlocutor and absent). an appropriate statistical model based on the internal structure of the Arabic sentence is used to recognize the morphological characteristics of the words for the entered text. a form of combination between statistical and linguistic approaches will be employed. Tlili-Guiassa (2006) uses a hybrid method of based-rules and a memory-based learning method. and person of the subject. It could be definite or indefinite and can be subcategorized by the person (narrator.)ا‬verb ( ِ ) and that we will call particle (‫. Based on that. like for example: . either they rely on a transliteration of the Arabic input text. thing or idea. DESCRIPTION OF THE TAGGING SYSTEM We investigated the principle aspects of Arabic morphology and grammar. Plural). In this paper.‫. 2). in our knowledge. gender. In the second level. but it's a derivative of the present tense that you achieve by attaching a prefix to the present tense of the verb."ا‬Fig1 gives a main classification of the noun and its prominent ramifications."ا‬and jussive "‫"ا م‬ 242 . The following is a brief review of those aspects. ‫ا‬ OUR APPROACH FOR ARABIC POST In this work. ل‬ ‫ن‬ . ن‬ Three moods are possible for verbs: indicative " ‫. The use of the linguistic internal structure of the Arabic sentence will allow us to identify logical sequences of words. In term of tense (see Fig.and rule-based techniques and uses a tagset of 131 basically derived from the BNC English tagset. 2006). In the first level. and then morphologically analyzed. an appropriate tagging system has been proposed to represent the main Arabic part of speech in a hierarchical manner allowing an easy expansion whenever it is needed. and suffixes. the verb could be past (imperative). Dual. Transition' probabilities are ‫ﻥ ة‬ ‫ال + ﻥ ة‬ ‫إ رة‬ ‫ﺹ ل‬ ‫د‬ ‫ﻥ‬ ‫آ‬ Fig. present (imperfect) or imperative. which consists of 24 tags. and grammatical cases ( ،" ‫"ا‬ (" ‫"، "ا‬ ‫ . Feminine). 1: Noun and its sub-categories VERB It is a word that denotes an action and could be combined with some particles. It is evaluated in both the unsupervised and supervised cases and achieves an accuracy of about 96%. An other important point is that the structure of the Arabic sentence does not generally taken into account during the tagging process and. Particles can be added as prefixes and/or suffixes indicating the number. This is very important due to the fact that Arabic is a derivational language. a HMM model is then used to represent this structure. 2002) is based on the automatic annotation output produced by the morphological analyzer of Tim Buckwalter (Buckwalter. we present a system for Arabic Part-OfSpeech Tagging that relies on the Arabic sentence structure and combines morphological analysis with Hidden Markov Models (HMMs) as we will explain in the following section. For this purpose. A tagset composed of symbols from Khoja's tagger and new ones is used and a performance of 85% was reported. few works are interested to that (Shamsi & Guessoum. A tagset of 146 tags. The morphological analysis is used as input module to reduce the size of the needed tags' lexicon by segmenting Arabic words in their prefixes. A linguistic study is conducted to determine the Arabic sentence structure by identifying the different main forms of both nominal and verbal sentences. calculated using a smoothed tri-gram and a special processing is used to handle unknown words to determine their lexical probabilities. the HMM will be the best suitable statistical model to keep track of this history. text is firstly normalized and tokenized into words. (Maamouri & Cieri. Since the probability of a certain word (or its tag) occurrence depends on the words preceding it in a given context. Banko and Moore (2004) presents a HMM tagger that exploits context on both sides of a word to be tagged. a linguistic study of Arabic words and grammatical structures will be required for the purpose of coding morphological characteristics and for extracting the most appropriate structure for common Arabic sentence' forms. The Arabic verbal structures are composed of three classes: noun ( ‫ . and consequently their corresponding tags. Diab et al (2004) use Support Vector Machine (SVM) method and the LDC's POS tagset. 2004). Almost all of these taggers. Before giving the details of our Arabic POS tagger. number (Singular. it achieved an accuracy of 96%. 2001) is based on the Brill tagger and uses a machine learning approach. based on that of Brown corpus for English is used. either use tagsets derived from English which is not appropriate for Arabic.

Finally. So we will use “NoPnAt” to tag the first one and “NoPnSe” to tag the second one. Mutawakkil.PARTICULE ‫ا‬ This class includes everything that is neither a verb nor a noun. “PaDe” is used to tag the identifier (‫“ . 243 . STRUCTURE OF NOMINAL SENTENCES Different forms of formulation have been identified for nominal sentences. 2: verb and its temporal-forms ‫ﺍﻝﺤﺭﻑ‬ ‫ﻋﻁﻑ‬ ‫ﺠﻭﺍﺏ‬ ‫ﻨﺩﺍﺀ‬ ‫ﺯﺠﺭ‬ ‫ﺍﺴﺘﺜﻨﺎﺀ‬ ‫ﺠﺭ‬ ‫ﻨﻔﻲ‬ ‫ﻨﻬﻲ‬ ‫ﺍﺴﺘﻔﻬﺎﻡ‬ ‫ﺸﺭﻁ‬ ‫ﺁﺨﺭ‬ ‫ﺘﺄﻜﻴﺩ‬ ‫ﺘﻨﺒﻴﻪ‬ ‫ﺘﻭﻗﻊ‬ ‫ﺘﻤﻨﻲ‬ ‫ﺠﻤﻊ‬ ‫ﺘﺜﻨﻴﺔ‬ ‫ﺘﺄﻨﻴﺙ‬ ‫ﺘﻔﺴﻴﺭ‬ ‫ﺘﻌﺭﻴﻑ‬ ‫ﻤﺅﻨﺙ‬ ‫ﻤﺫﻜﺭ‬ Fig. “VePe” for the Perfect. Yaqut]. As far as the verb is concerned. our objective is not to know them in detail. etc) the symbol “Pu”. “Pa” is the global tag given to the particle if we do not need to distinguish a particular one. The digits and dates are denoted by the symbol “Nu”. the noun could be defined or undefined. S and E are special states. at least in this phase of work. D. Al-Rahhali. Regarding the class of particles. we will assign to the punctuation signs (. where V. we will add respectively the letters M. P. They can be represented by the following figure (fig. In its defined format. to give an exhaustive list of them. “NoDe” if it's a demonstrative pronoun. the coordination prepositions and the functional words like “inna wa akhawatuha ‫ ”إن وأﺥ اﺕ‬which influences the upcoming words analysis. VERB. 3: main groups of particles PROPOSED TAGSET The previous classification is used to develop an appropriate tagging scheme considering the parts of speech hierarchy in order to make it meaningful and easily expandable to include more details and precision about the Arabic units whenever it is needed. There are many prepositions. N. Among those. but we do not really need. The references of this study were the old morphology books and modern studies concerned with sentence structures in the Arabic language such as the following references [Harkat. it will get the symbol “NoPr” if it is a proper name. according to their functions."ا‬Because the pronoun could be attached " " or not attached " " to another word.. the letters M or F can be used. tags are specified only for some ones that are of subject matter for our work in its initial phase. and an arrow between two sates.)أل‬PaDu” and “PaPl” are respectively used for tagging particles indicating the number (dual and plural). We will give the noun in its global format the symbol “NoIf”. but they can be tagged separately following the same logic. 1. “VeIf” for the Imperfect and “VeIa” for the imperative. To indicate the gender. Al-Shukri. Each of them may have different forms and styles. ?. SPECIFICATION OF THE SENTENCE STRUCTURE: MODEL ARCHITECTURE A linguistic study has been conducted to extract common types of formulations of the Arabic sentence. ‫أ‬ ‫رع‬ Fig. 4) in terms of sequences. 3 gives an example of the classification of particles. F. and P respectively denote NOUN. Notice that a loop on a state indicates a certain number of repetitions of this symbol. and PARTICLE. means that first one may be followed by the second one depending on its direction. !. Fig. The remaining particles are assigned the tag “PaOt”. used to represent the start and the end of the nominal phrase. so that it can serve as architecture of the statistical model. 2 or 3. number and person. As we have seen before. A list of more than 100 ways of common grammatical structures in the Arabic language has been surveyed. S. It covers the general syntactical analysis and detailed morphological analysis of the nouns and verbs. In fact. For indicating the gender. It contains the “jarr” prepositions. the symbol “NoPn” if it's a pronoun. The sentence in the Arabic language is either nominal like in “ ‫ ”ا‬or verbal like in “‫ل ا ة‬ ‫ا‬ ”. “NoCn” if it's "‫ﺹ ل‬ ‫ . it will be given the symbol “Ve” globally.

V S P Fig. Although these texts may be classified as MSA. So. the equation can be transformed to this new one: max T [P(W | T ) ∗ P(T )] . we have created a corpus composed of some texts extracted from ALJAHEZ's book "Albayanwa-tabyin" (255 Hijri). their styles can vary greatly from those of nowdays. THE HMM-BASED POS TAGGER The use of a Hidden Markov Model to do part-of-speechtagging can be seen as a special case of Bayesian inference. Once these parameters are set. λ1 * P (t i | t i − 2 t i −1 ) + λ2 * P (t i | t i −1 ) + λ3 * P (t i ) . transition probabilities). 2000): STRUCTURE OF VERBAL SENTENCES Verbal sentence structure can be represented by a graph as in the following figure (Fig. By using the Bayesian rule and then eliminating the constant part P (W ) . Where f ( wi .ws. This means that a verbal sentence starts either by a verb or a particle and is fallowed by any combination of the main parts of speech. the HMM model can be used to find the best sequence of tags given a sequence of input words. Among these counts. 5: Structure of Verbal sentences N E Now. which is downloadable from this link: http://www. it can be written as follows: n P (W | T ) = ∏ P(w i =1 i | ti ) . It counts a total words of 21882 with a 3565 unique words ranged in more than 1600 sentences. for calculating the likelihood of the word sequence given tags P (W | T ) . A manual tagging of this corpus using our own tagset is currently running. So. by calculating frequencies of trigrams and bigrams (respectively f (ti −2ti −1ti ) and f (ti−2ti−1 ) ) as follows: P(t i | t i −2 t i −1 ) = f (t i −2 t i −1t i ) / f (t i −2 t i −1 ) . and 9037 particles. the probability of a word appearing is generally supposed to be dependent only on its own part-of-speech tag. Here also. and can be computed using an Ngram model (trigram in our case). The Viterbi algorithm is used to perform this task. It is obtained from "Ashamila" library. 5). Where λ1 + λ2 + λ3 = 1 . Tag sequence probabilities and word likelihoods represent the HMM model' parameters: transition probabilities and emission (observation) probabilities. 2587 verbs. we used a deleted interpolation developed by (Brants. PERFORMANCE EVALUATION CORPUS PREPARATION We remember that our ultimate goal is to build an Arabic POS tagger that can be used for relatively old books (from the third century Hijri). what is the best sequence of tags which corresponds to this sequence of words? If we represent an entered text (sequence of morphological units in our case) by W = ( wi )1≤i≤n and a sequence of tags from the lexicon by T = (t i )1≤i ≤ n . t i ) / f (t i ) . they are very interesting for specifying the architecture of our HMM model. and then expand it to include its subcategories as we have specified in the description of the tagset. Each state in the new model (HMM) is representing a valid tag from our lexicon.n N S P Fig. It can be formalized as follows: for a given sequence of words. It suffuses to combine them in a one graph and to replace each state by the underlying part of speech. as follows: P(T ) represents the probability of the tag sequence (tag DATA-SETS AND EVALUATION Our model is trained on 95% of the tagged corpus previously described. P ( wi | t i ) = f ( wi . only a subset of the corpus has been finished till now. as follows: ARCHITECTURE OF THE STATISTICAL MODEL Although the previous representation of both nominal and verbal sentences' structures can be seen as trivial and straightforward. using 13 tags: 3 subcategories of 244 . to avoid assigning null probabilities to unseen trigrams (bigrams). However. we have to compute: max T [P(T | W )] . Due to the complexity of the manual tagging.shamela. so. a tagged training set has to be used for computing these probabilities. there are 10258 nouns. ti ) and f (ti ) represent respectively how many times wi is tagged as ti and the frequency of the tag ti itself. Determination of the model parameters will be discussed in the following section. i =1 V E A tagged training corpus is used to compute P (ti | ti −2ti −1 ) . it can happens that some trigrams (bigrams) will never appear in the training set. 4: Structure of Nominal sentences P (T = t1t 2 L t n ) = ∏ P (t i | t i − 2 t i −1 ) .

To evaluate its performance. Al-Sulaiti L. Tim Buckwalter. Al-Osaimi S. pp. pp. We plan to use the developed tagger for our research activities in a variety of ways. Towards Analyzing the International Corpus of Arabic (ICA). Brants T.S. of ANLP’2000. which represents about 1000 words.. Switzerland. A part of it is manually tagged and used to train and to test the tagger. Soliman H. Building an International Corpus of Arabic (ICA). 135-171. Atwell E. (2006). 2nd Edition. Building A Modern Standard Arabic Corpus.H. and Jurafsky D. an appropriate tagging system has been proposed to represent these main Arabic parts of speech in a hierarchical manner allowing an easy expansion whenever it is needed. the 6th Conference on Applied Natural Language Processing: 224-231. (2008). (2004). A statistical part of speech tagger. TnT. Nagi M. (2005).ldc. Alansary S. of HLTNAACL’04: 149–152. Diab M.. and 4 subcategories of particles. Abu-Shawar B. (2002). Version 2. computational linguistics and natural language processing. LDC Catalog No. In ACL’01 Workshop on Arabic language processing. a corpus composed of old texts extracted from books of third century Hijri is created. C. Seattle. (2004).0. Alansary S. Cowie J. Part of Speech Tagging in Context. 245 . Washington. proc. However. Al-Sulaiti L. CONCLUSION In this paper we have presented an Arabic Part-Of-Speech tagger that uses a HMM model to represent the internal linguistic structure of the Arabic sentence."ا‬ REFERENCES Abdelali A. [Madar] Madra Research: http://www. Banko M. total number of assigned tags (Na). a suitable architecture of the HMM model is specified based-on the structure of both nominal and verbal sentence. Tlili-Guiassa Y. Having done this.edu/Catalog. vol. Shamsi F. Workshop on Computational Modeling of Lexical Acquisation. Guessoum A. International Journal of Corpus Linguistics.upenn. 7th International Conference on Language Engineering. which is very encouraging compared to the size of the tagset used till now. we have used the F-measure defined as follows: (2 * P * R ) /( P + R ) . Moore R. We have conducted a linguistic study to determine the main Arabic POS and to specify different common forms of Arabic sentence.Tunisia. It is tested on the remaining 5%. 8th International Conference on Language Engineering. Morgan Kaufmann Publishers Inc. Internet Arab World research Unit (IAWRU): http://www. where P and R denotes precision and Recall respectively. www.. Egypt. 6 subcategories of nouns. Hybrid Method for Tagging Arabic Text. (2006). (2004). Journal of Computer Science 2 (3): 245248.teckies. Maamouri M. Proceedings of the International Symposium on the Processing of Arabic . Proc of the 20th international conference on Computational Linguistics. JADT'06. Hacioglu K. Adly N. (2008).asp x?id=50. Egypt. Nagi M. Cieri C. (2004).com/archive/archive_toc. After that. especially for applications dealing with old texts " ‫ص ا اﺙ‬ ‫. and the total number of the assigned tags in the test-set (Nt): P = Nc / Na and R = Nc / Nt . The design of a corpus of contemporary Arabic. Resources for Arabic Natural Language Processing at the LDC. A Hidden Markov Model –Based POS Tagger for Arabic. Speech and Language Processing: An introduction to speech recognition.125-146. Croatia.. Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. A Review of Arabic Corpus Analysis Tools. the split meeting. Proceedings of JEP-TALN'04 Arabic Language Processing.com/lebanon/ Jurafsky D. Next. Freeman A (2001). Martin J. using the total number of correct assigned tags (Nc). Linguistic Data Consortium. (2007). although this is represents a very good result compared to the size of the training corpus. Performance evaluation has shown an accuracy of 96%.madarresearch. we have to increase our tagged corpus and to conduct further tests on more interesting dataset to evaluate the real performance of this approach. 2000. Brill’s POS tagger and a morphology parser for Arabic. 11. We have obtained an accuracy of 96%. In proc. (2006). LDC2004L02. Adly N.verbs. Buckwalter Arabic Morphological Analyzer. Atwell E. They are calculated .