You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334701570

Comparative Study for Recent Technologies in Arabic Language Parsing

Conference Paper · June 2019


DOI: 10.1109/SDS.2019.8768587

CITATIONS READS
2 160

3 authors:

Darah Aqel Shadi AlZu'bi


Al-Zaytoonah University of Jordan Al-Zaytoonah University of Jordan
10 PUBLICATIONS   25 CITATIONS    51 PUBLICATIONS   434 CITATIONS   

SEE PROFILE SEE PROFILE

Siham Hamadah
Al-Zaytoonah University of Jordan
4 PUBLICATIONS   8 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Smart cities View project

Improve Medical Image Processing Performance Using Parallel Programming View project

All content following this page was uploaded by Darah Aqel on 27 July 2019.

The user has requested enhancement of the downloaded file.


Comparative Study for Recent Technologies in
Arabic Language Parsing
Darah Aqel, Shadi AlZu’bi, and Siham Hamadah
Alzaytoonah University of Jordan, Amman, Jordan
{d.aqel, smalzubi, siham}@zuj.edu.jo

Abstract - Parsing the natural language is an interesting issue and solutions for addressing the main limitations of the current
for many of natural language processing (NLP) applications. Arabic parsing systems have been proposed, and prospects for
Parsing a complex language such as Arabic represents a future improvements of these systems are also provided.
challenging task for many researchers. Arabic is a complex The remaining of this paper is organized as follows. Section II
language due to the rich morphology and the complicated provides a literature review on the current studies for parsing
structure of its sentences, since it contains difficult linguistic Arabic language. It also presents a comparison between the
features. This comparative study discusses the existing and the common systems developed for parsing Arabic text by identifying
developed Arabic parsing systems, we describe here current the main characteristics and limitations for each of these systems.
technologies which are employed for parsing Arabic text. A Section III presents a detailed discussion based on the conducted
comparison between several parsing systems is proposed in this review, and suggests the possible solutions for handling the major
paper as well, by stating the key features and limitations for limitations of the current Arabic parsing systems in the future.
each of them. Finally, solutions to address the main limitations Section 4 concludes the work in this paper.
of these Arabic parsing systems have been suggested.
Keywords— Natural Language Processing, Arabic Language
II. RELATED LITERATURE
Researches on Arabic language processing began in the 1970’s.
Processing, Arabic Parsing, Syntactic Parsing, Constituency The current technologies and trends of Arabic NLP emphasize on
Parsing, Dependency Parsing text parsing. Generally, there are several open source NLP tools and
I. INTRODUCTION parsers that can be applied for parsing Arabic text such as the
Natural language processing (NLP) has recently gained an Stanford parser1, FARASA: Advanced Tools for Arabic2, NLTK3,
increasing interest from many research fields [1]. NLP has many and Stanford coreNLP4. Furthermore, a great effort was done by
applications such as parsing [2], sentiment analysis [3], grammar many researchers to develop text parsers for parsing Arabic
rules induction [4], and machine translation [5]. Constituency language. The main concern for most of these researchers was to
parsing is one of the most significant NLP tasks, it is based on provide a clear syntactic analysis of the Arabic text, and produce a
determining the syntactic structure of a sentence based on a set of full representation of the grammatical relations between the words
grammatical rules. Parsing is based on step-by-step parts of the in sentences, such as identifying Arabic verbal and nominal
speech (POS) tagging, which assigns the right POS (e.g. noun, verb, sentences in text.
adjective…) tag to each word in the text. Additionally, it is a For example, the study demonstrated in [10] develops an Arabic
prerequisite step for a deeper NLP task such as semantic analysis parser, called A'reb, which performs a lexical and syntactic analysis
including word sense disambiguation, sentiment analysis, and co- for the Arabic sentences. The developed parser includes the main
reference resolution [1]. Arabic grammar rules for verbal sentences and can be extended to
Parsing systems need to learn the syntax of the modeled language any Arabic sentence. The authors in [10] stated that the developed
to classify newly seen sentences. Recently, research into parsing for Arabic parser has some limitations and requires more work to be
English language has achieved successful results, different completed.
constituency parsing systems for English have been implemented The authors in [11] introduce a model for parsing Arabic sentences
to process and parse English text effectively. However, adapting in general and the Quranic Arabic sentences in specific based on
the available parsing systems to other languages such as Arabic has using Natural language toolkit (NLTK). This model builds a
faced many challenges. context-free grammar (CFG) and invokes the NLTK recursive-
Previous works in the domain of parsing Arabic language have descent parser. It performs some of the NLP tasks and generates
shown that this language is highly ambiguous and difficult to parse. parsing trees for Arabic sentences. The authors mentioned that their
Generally, Arabic language is characterized by its rich morphology, approach does not perform a deeper processing task such as
the complex syntax of Arabic natural language has gained much specifying the parsing dependencies for Arabic sentences.
attention by researches in the last decade, some state-of-the-art The study presented in [12] constructs a CFG and presents an
systems have been developed to process this complex language Arabic parser to parse the Arabic sentences based on the
automatically and deeply. constructed CFG. In particular, the NLTK recursive descent parser,
The research in this paper focuses in introducing an overview and which is based on a top-down strategy, was used to check whether
a comprehensive study for the current state of theoretical research the syntax of an Arabic sentence is grammatically right. To test the
on parsing Arabic language and reports the approaches and systems performance of the parser, a corpus of 150 Arabic sentences was
that have been developed for parsing this language. A comparison used in the experiments. The parser achieved a high accuracy of
between Arabic parsing systems has been presented as well to show 95% in parsing verbal and nominal sentences. However, the study
the main features, limitations, and challenges. Finally, suggestions also showed that some sentences were not properly parsed.
TABLE 1. A Comparison between some Arabic Parsing Systems
Reference The Proposed Achievements Limitations and Demerits of the Approach
Approach
A'reb Arabic parser Parsing Arabic verbal • The approach does not parse the rest of the verbal and nominal
10 and nominal sentences.
sentences • It does not perform any Arabic grammar correction process
• It does not take into account the diacritics of words.
• It does not perform a semantic analysis of the text, where this
causes errors that are related to the improper use of semantic
meaning
NLTK recursive- Parsing Arabic and • The approach does not perform a deeper processing task such as
11 descent parser Quranic sentences specifying the parsing dependencies for the Arabic sentences.
+ building a new CFG • The parser sometime may go to an infinite loop.
for Arabic sentences • The parser tests all words and grammar rules even that they are
inappropriate to the input sentence and this maximizes its time
complexity.
• The parser does not keep the successfully parsed segments and
therefore it repeats much of its work.
NLTK recursive- Parsing Arabic verbal • Some sentences were not properly parsed by their parser and this
12 descent parser and nominal is due to:
+ building a new CFG sentences - The incorrect POS tagging of some words.
for Arabic sentences - The complexity in ordering some sentences.
- Some input sentences did not match the right CFG
production rules.
Bel-Arabi Parsing Arabic verbal • Inability to perform grammar correction.
13 Arabic grammar and nominal • Inability to perform a semantic analysis of the text.
analyzer system sentences • Incapability to differentiate between Arabic active and passive
verbs.
Transducers parser Parsing and • Some sentences have been partially disambiguated due to the
annotating Arabic lack of semantic rules and the inability to identify some rules by
14 nominal and verbal the approach.
sentences • Some sentences were not properly parsed by the parser and this is
due to:
- The incorrect POS tagging of some words.
- Some sentences have been failed to be disambiguated
by the approach due to the lack of some information in
the used dictionaries.
ILP method called ILA Parsing Arabic • The approach was inaccurate in parsing some nominal sentences.
15 nominal and verbal
sentences
An Arabic parser that Parsing Arabic • The approach was applied for parsing the Arabic nominal
16 uses CFG and classical nominal sentences sentences only, not the verbal ones.
grammar rules. • The failures happened by the approach in parsing some sentences
are due to the improper POS tagging and segmentations tasks.
a top-down chart Arabic Parsing Arabic • The approach is unable to recognize passive or active voice
parser +CFG nominal and verbal verbs.
17 sentences • It does not take into account the diacritics of words.
• It does not identify more than one linked pronoun in the word.
• It does not distinguish the particles that are used to describe two
present verbs in the sentence.
A top-down Arabic Parsing Arabic • Some sentences have been wrongly parsed by the parser due to
23 parser with a recursive sentences the incompatibility between the words’ attributes.
transition +CFG • The parser has failed to parse 11 sentences, since it was unable to
associate any rule to these input sentences.
A parsing system based Parsing Arabic • The approach has incorrectly classified some Arabic relative
2 on some rules obtained relative clauses in sentences due to the incorrect POS tagging of some relative
by an ILP method called Arabic sentences pronouns in the Arabic sentences that have relative clauses.
ALEPH

The research discussed in [13] proposes an Arabic grammar Moreover, the authors in [14] introduced a parsing approach for
analyzer system, called Bel-Arabi, which covers the main grammar parsing Arabic nominal sentences using transducers. They
rules for Arabic nominal and verbal sentences. The proposed implemented a set of lexical and syntactic rules for Arabic
grammar analyzer achieved promising results in analyzing 600 language. Their approach also allowed the parsing and the
Arabic sentences. Yet, the research also showed that developed annotation of verbal sentences. By using a corpus of 200 sentences,
analyzer had some restrictions and challenges that affected on the their approach achieved satisfactory results, where the precision
process of parsing the sentences. rate was 80% and the recall rate was 90%.
The researchers in [15] propose a study that applies an inductive of the approach, the statistical parser was trained using a dataset of
logic programming (ILP) method called Inductive Learning 22000 Arabic sentences (taken from PATB) and tested using a
Algorithm (ILA) to identify the nominal and verbal sentences in the different gold standard dataset of 2000 Arabic sentences (also taken
Arabic text. The applied ILP method produces a set of parsing rules from PATB). The achieved parsing performance results were a
from a training dataset, where these rules can be then used for precision of 82.4%, a recall of 86.6%, and an F-measure of 84.4%.
parsing Arabic nominal and verbal sentences. Their method Different Arabic parsing systems were demonstrated above in the
achieved very good accuracy results of 92%. However, the literature. Table 1 displays a comparison between some of these
experimental results of their method showed that it was accurate systems, showing and summarizing their merits and limitations.
and efficient in classifying the verbal sentences more than the
nominal ones, where the nominal sentences have several
formulations in comparison to the verbal ones. III. DETAILED DISCUSSION BASED ON
The study suggested in [16] describes a method for parsing Arabic THE CONDUCTED REVIEW
nominal sentences based on the use of CFG and some rules of
As described before, text parsing aims to specify the syntactic
classical grammar. By testing this method on a corpus of 200
structure and the grammatical relations for the sentences'
Arabic nominal sentences, the achieved accuracy was 97%. This
constituents. Therefore, the grammar checking process is
study showed that the most of failures that happened by the method
considered a major part in the parsing task. In fact, grammar
in parsing some nominal sentences were resulting from the
checking, and parsing are both challenging tasks in Arabic
improper POS tagging and segmentations tasks. The future of this
language, where each constituent in this language needs a deeper
study was to improve and expand the method to parse Arabic verbal
analysis and a wide study [7, 8].
sentences as well.
The current literature in the field of Arabic NLP shows that there is
The authors in [17] proposed a top-down chart Arabic parser for a lack in the research related to the subject of parsing Arabic
parsing Arabic nominal and verbal sentences. The authors first used
language, as very few attempts and efforts have been carried out for
CFG to develop Arabic grammar rules, CFG provides an accurate
developing an effective parsing system for Arabic. Furthermore,
description of the grammatical structure of the Arabic sentences.
there is still no public parser available for Arabic with sufficiently
Then they implemented the top-down chart parser which associates
wide coverage. Indeed, there are various reasons and factors that lie
the grammatical structure to the input sentences. To measure the
behind these issues and explain the challenges that face the
performance of the parser, a set of 70 Arabic nominal and verbal
development of automatic parsing systems. Some of these reasons
sentences, taken from real documents, was used in the experiments
are: the rich morphology of Arabic language, the difficult syntax of
of the parser. The experimental results showed that the parser was
its sentences, the length of its sentences, the free order of words in
efficient in parsing both Arabic nominal and verbal sentences,
Arabic sentences, the unique linguistic characteristics of the
where it achieved an overall accuracy of 94.3%. language, the complicated word formation task of roots, the
Bataineh and colleague developed in [23] an Arabic parser based omission of diacritics in the written Arabic text, absence of regular
on the use of a top-down parsing method with a recursive transition punctuation marks, and the presence of elliptic personal pronouns.
network for parsing Arabic sentences. They firstly generated In addition to that, there is a lack in the availability of Arabic
Arabic grammar rules by building a CFG, and then represented linguistic resources such as having a large manually annotated
these rules using transition networks. The developed parser was Arabic corpus. We cannot also exclude the problem of syntactic
tested on a real dataset of 90 Arabic sentences, where 85.6% of and semantic disambiguation of the Arabic words from these
these sentences were parsed correctly. reasons and factors that limit the research related to Arabic
The authors in [2] developed a parsing system that parses Arabic language parsing [6, 7, 8, 9].
sentences to identify relative clauses in these sentences. Their There are several approaches that have been developed for parsing
system performed a deeper dependency parsing task based on some Arabic text. However, the methodology used in this research
linguistic grammar rules for Arabic relative clauses learned by an discussed and selected the most 10 recent approaches for parsing
ILP machine learning method, called ALEPH. To assess the Arabic language (see Table 1). These 10 approaches have been
performance of the system empirically, a corpus of some Arabic chosen for illustration purposes and for showing the feasibility of
sentences, taken from Quran, was given to the system for this comparative study, but more Arabic parsing approaches could
processing. The system parsed these Arabic sentences and also be selected and included in this study where necessary.
identified the relative clauses in these sentences as well. The Moreover, all of the selected approaches are representative of the
experiments of the system achieved an overall accuracy of 83%. In latest approaches that perform the task of constituency parsing for
general, the results obtained out from their research were Arabic language. In addition, most of the selected parsing
satisfactory and proved that implementing ILP for parsing Arabic approaches in this study perform constituency parsing and do not
relative clauses will gain a promising contribution in the field of perform a deeper parsing task such as dependency parsing for
Arabic NLP. Arabic text. This issue represents one of the main limitations of the
The authors in [24] propose baselines for three existing parsing current systems developed for parsing Arabic language.
models. They have also developed a manually annotated grammar Furthermore, the comprehensive study made in this research has
for Arabic and evaluated it to test the performance of Arabic also shown that the current existing systems for parsing the Arabic
constituency parsing and improve its results. The authors have language have some other limitations (e.g. [10 - 23, 2]). As a result,
shown in their study that the performance of Arabic constituency these limitations may cause low performance results of the Arabic
parsing remains much poorer and lower than English. The study parsing systems in case the size of the corpus was increased or the
illustrated in [25] applies a statistical parsing approach that uses the used corpus was changed to another one. Hence, the future
Pen Arabic Treebank (PATB) [26] (an Arabic linguistic resource) perspectives in the research field related to Arabic language parsing
to construct a parsing model for parsing Arabic sentences and should focus on improving the existing Arabic parsing systems and
generating the parse trees for them. In the performance experiments solving all the previously mentioned limitations of these systems.
One way to solve some of these limitations is to focus more on Springer, 2007.
enhancing the performance of the used Arabic POS taggers. Where [9] E. Othman, K. Shaalan, and A. Rafea, “A Chart Parser for
by assigning the right POS tag (by these POS taggers) to each word Analyzing Modern Standard Arabic Sentence”, WS on Machine
in the text, the accuracy of the parsers will then increase, and more Translation for Semitic Languages: Issues and Approaches, pp.
sentences will be parsed precisely. Furthermore, we should draw 37–44, 2003.
the attention of a lot of researchers towards performing a deeper [10] E. Al Daoud and A. Basata, “A Framework to Automate the
Parsing of Arabic Language Sentences”, International Arab
dependency parsing and a semantic analysis of the Arabic text. All
Journal of Information Technology, Vol. 6, No. 2, pp. 191-195,
of these aspects will then motivate and help the NLP researchers to 2009.
resolve many issues such as the syntactic and semantic ambiguity [11] M. Shatnawi and B. Belkhouche, “Parse Trees of Arabic Sentences
of words, parsing more Arabic sentences accurately, performing a Using the Natural Language Toolkit”, The 13th International Arab
proper grammar correction for the Arabic sentences, and Conference on Information Technology (ACIT'2012), 2012.
supporting the automatic diacritics of Arabic words. [12] S. Alqrainy, H. Muaidi, and M. AlKoffash, “Context-Free
Grammar Analysis for Arabic Sentences”, International Journal
IV. CONCLUSION of Computer Applications, Vol. 53, No. 3, pp. 7-11, 2012.
The process of parsing Arabic sentences is complex, it requires a [13] M. Ibrahim, M. Mahmoud, and D. El-Reedy, “Bel-Arabi:
wide research and analysis to develop an efficient parser suitable Advanced Arabic Grammar Analyzer”, International Journal of
for the rich morphology and the difficult linguistic characteristics Social Science and Humanity, Vol. 6, No. 5, pp. 341-346, 2016.
for the Arabic language. This paper provided an inclusive review [14] N. Hammouda and K. Haddar, “Parsing Arabic Nominal
of the existing approaches and systems developed for parsing Sentences with Transducers to Annotate Corpora”, Computación
Arabic text. In addition, it illustrated the merits, limitations, and y Sistemas, Vol. 21, No. 4, pp. 647–656, 2017.
challenges of these systems. This study is also presented [15] D.Abdelrazaq, S. Abu-Soud and A. Awajan, “Distinguishing
suggestions and future perspectives for building effective parsing Nominal and Verbal Arabic Sentences: A Machine Learning
systems suitable for Arabic text, it helps in improving the Approach”, The International Arab Conference on Information
performance of current existing parsers by addressing the main Technology, 2017.
limitations of these parsers. As a conclusion for NLP researches [16] N. Ababou, A. Mazroui, and R. Belehbib, “Parsing Arabic
who are targeting efficient parser for Arabic, they are advised to Nominal Sentences Using Context Free Grammar and
Fundamental Rules of Classical Grammar”, International Journal
consider the semantic analysis to resolve the problem of words
of Intelligent Systems and Applications, Vol.9, No. 8, p.11, 2017.
ambiguity, errors in the parsing process of the sentences could be
[17] A. Al-Taani, M. Msallam, and S. Wedian, “A Top-Down Chart
avoided. Moreover, the future work related to the field of parsing Parser for Analyzing Arabic Sentences”, The International Arab
Arabic language should focus more on enhancing the used Arabic Journal of Information Technology, Vol.9, No.3, pp.109-116,
POS taggers, improving Arabic grammar correction process in 2012.
parsers, and supporting the process of parsing dependencies more [18] AlZubi, S., Islam, N. and Abbod, M., 2011, February. Enhanced
deeply. It should also take into consideration the diacritics of words Hidden Markov Models for accelerating medical volumes
in the written Arabic text. segmentation. In 2011 IEEE GCC Conference and Exhibition
(GCC) (pp. 287-290). IEEE.
V. REFERENCES [19] Al-Zu’bi, S., Al-Ayyoub, M., Jararweh, Y. and Shehab, M.A.,
[1] G S. K. Mishra, “Artificial Intelligence and Natural Language 2017. Enhanced 3D segmentation techniques for reconstructed 3D
Processing”, Cambridge Scholars Publisher, 385 pages, 2018. medical volumes: Robust and Accurate Intelligent System.
Procedia computer science, 113, pp.531-538.
[2] D. Aqel and B. Hawashin, “Arabic Relative Clauses Parsing Based
on Inductive Logic Programming”, Recent Patents on Computer [20] Rezaee, H., Aghagolzadeh, A., Seyedarabi, M.H. and Al Zu'bi, S.,
Science, Vol. 11, No. 2, pp.121-133, 2018. 2011, February. Tracking and occlusion handling in multi-sensor
[3] D. Aqel and S. Vadera, “A Framework for Employee Appraisals networks by particle filter. In 2011 IEEE GCC Conference and
based on Sentiment Analysis”, In Proceedings of the 1st Exhibition (GCC) (pp. 397-400). IEEE.
International Conference on Intelligent Semantic Web-Services [21] AlZu’bi, S., Shehab, M., Al-Ayyoub, M., Jararweh, Y. and Gupta,
and Applications, ACM, Article 8, (pp. 62-67), 2010. B., 2018. Parallel implementation for 3D medical volume fuzzy
[4] D. Aqel and S. Vadera, “A Framework for Employee Appraisals segmentation. Pattern Recognition Letters.
Based on Inductive Logic Programming and Data Mining [22] AlZu’bi, S., Hawashin, B., Mujahed, M., Jararweh, Y. and Gupta,
Methods”, In International Conference on Application of Natural B.B., 2019. An efficient employment of internet of multimedia
Language to Information Systems, Springer, Berlin, (pp. 404-407), things in smart and future agriculture. Multimedia Tools and
2013. Applications, pp.1-25.
[5] L.H. Baniata, S. Y. Park, and S.B. Park, “A Neural Machine [23] B. Bataineh and E. Bataineh, “An Efficient Recursive Transition
Translation Model for Arabic Dialects that Utilizes Multitask Network Parser for Arabic Language”, In Proceedings of the
Learning ”, Computational Intelligence and Neuroscience, 2018. World Congress on Engineering WCE, Vol. 2, pp. 124- 127, 2009.
[6] A. Farghaly and K. Shaalan, “Arabic natural language processing: [24] S. Green and C. D. Manning, “Better Arabic Parsing: Baselines,
Challenges and solutions”, ACM Transactions on Asian Language Evaluations, and Analysis”, In Proceedings of the 23rd
Information Processing (TALIP), Vol. 8, No. 4, p.14, 2009. International Conference on Computational Linguistics (Coling
[7] J. Nivre, J. Hall, S. Kübler, R. McDonald, et al., “The CoNLL 2007 2010), pp. 394– 402, 2010.
Shared Task on Dependency Parsing”, In Proceedings of the Joint [25] M. Al-Emran, S. Zaza, and K. Shaalan, “Parsing Modern Standard
Conference on Empirical Methods in Natural Language Arabic Using Treebank Resources”, In Int. Conf. on Information
Processing and Computational Natural Language Learning and Communication Technology Research, pp. 80-83, 2015.
(EMNLP-CoNLL), pp. 915-932, 2007. [26] M. Maamouri and C. Cieri, “Resources for Arabic Natural
[8] A. Soudi, G. Neumann, and A. Bosch, “Introductory Chapter. In Language Processing”, In International Symposium on Processing
Arabic Computational Morphology: Knowledge-Based and Arabic, Vol. 1, 2002.
Empirical Methods”, Arabic Computational Morphology,

View publication stats

You might also like