Professional Documents
Culture Documents
net/publication/235923461
CITATIONS READS
32 1,533
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Qasem A. Al-Radaideh on 06 August 2014.
Abstract: Problem statement: Extensive research efforts in the area of Natural Language Processing
(NLP) were focused on developing reading comprehension Question Answering systems (QA) for Latin
based languages such as, English, French and German. Approach: However, little effort was directed
towards the development of such systems for bidirectional languages such as Arabic, Urdu and Farsi. In
general, QA systems are more sophisticated and more complex than Search Engines (SE) because they seek
a specific and somewhat exact answer to the query. Results: Existing Arabic QA system including the
most recent described excluded one or both types of questions (How and Why) from their work because of
the difficulty of handling these questions. In this study, we present a new approach and a new question-
answering system (QArabPro) for reading comprehension texts in Arabic. The overall accuracy of our
system is 84%. Conclusion/Recommendations: These results are promising compared to existing
systems. Our system handles all types of questions including (How and why).
Key words: Arabic Q/A system, Information Retrieval (IR), Natural Language Processing (NLP), Arabic
language, acronyms, Information Extraction (IE), morphological root, morphological analysis,
QA systems, Stemming-root extraction
The backbone of any natural language application is returned as the answer. All question types are similar in
a lexicon. It is critical for parsing, text generation, text using a common WordMatch function, which counts
summarization and for question answering systems the number of words that appear in both the question
(Abuleil and Evens, 1998). Furthermore, Information and the sentence being considered. Two words match if
Retrieval is one of the first areas of natural language they share the same morphological root (Rilo and
processing in which statistics were successfully applied Thelen, 2000).
(Hiemstra and De vries, 2000). Two models of ranked Following the Rilo approach, Hammo et al. (2002)
retrieval methods were developed in the late 60s and introduced a rule-based QA system for Arabic text
early 70s and are still in use today: Salton vector space called QARAB. QARAB excludes two types of
model (Polettini, 2004) and Robertson and Sparck Jones questions “ ذا، (”آHow and Why) because” they
require long and complex processing.” The QARAB
probabilistic model (Robertson and Sparck, 1976).
system did not report any data regarding precision or
The rest of this study is structured as follows:
recall. The system was evaluated by the developers
Section two discusses related work. Section three
(four native Arabic speakers). They fed 113 questions
describes a generic architecture for the new Arabic QA
to the system and evaluated the correctness of the
system. Section three explains the new approach. Section
answers. Such testing cannot be reliable and possibly is
four discusses testing and evaluation results of the new
biased. In addition, such accuracy was not achieved for
system. Section five discusses to the results. Section six
any other language using state-of-the-art QA systems.
contains our conclusions and future work to further
Rotaru and Litman, (2005) worked on evaluating
improve our QA system.
the process of combining the outputs of several
question answering systems, to see whether they
MATERIALS AND METHODS improved over the performance of any individual
system. Their study included a variety of question
Related work: Mohammed et al. (1993) introduced a answering systems on reading comprehension articles
QA system for Arabic called AQAS (1993). The AQAS especially the systems described in (Hirschman et al.,
system is knowledge-based and therefore, extracts 1999; Rilo and Thelen, 2000). The training and testing
answers from structured data and not from raw text data for the question answering projects came from two
(non structured text written in natural language); reading comprehension datasets available from the
moreover, no results were reported. Hirschman et al. MITRE Corporation for research purposes. They
(1999) described an automated reading comprehension concluded that none of those systems combined was
system for English text that can be applied to random globally optimal. The best performing system varied
stories and answers questions about them. Their work both across and within datasets and by question type.
was executed on a corpus of 60 development and 60 Kanaan et al. (2009) described a new Arabic QA
test stories of 3rd and 6th grade materials. Each story
system (QAS) using techniques from IR and NLP to
was followed by a short answer test. The tests ask the
answer short questions in Arabic. Similar to QARAB,
students to read a story or article and then answer the
questions about it to evaluate their understanding of the QAS excludes the questions “ ذا، (”آHow and
article, they used several metrics: precision, recall, Why) citing the same reason cited by Hammo et al.
HumSent (compiled by a human annotator who Both stated that (How and Why) “ require long and
examined the texts and chose the sentence (s) that best complex processing.” Furthermore, the authors
answers the question) and AutSent (an automated reported a test reference collection consisting of 25
routine that examines the texts and chooses the documents gathered from the Internet, 12 queries
sentences that had the highest recall compared against (questions) and some relevant documents provided by
the answer key). Humsent and AutSent compared the the authors. Kanaan et al. reported different recall
sentence chosen by the system to a list of acceptable levels {0, 10 and 20%} where the interpolated
answer sentences, scoring one point for a response on precision was equal to 100% and at recall levels 90
the list and zero points otherwise. and 100% it was equal to 43%. We should also note
Rilo and Thelen, (2000) developed a rule based that, the study instructs the reader to see their result in
system called Quarc for English text that can read a a figure that is missing from the study.
short story and find the sentence presenting the best In VSM a document is conceptually represented by
answer to a question. Each type of WH questions looks
a vector of keywords extracted from the document, with
for different types of answers, so Quarc used a separate
set of rules for each question type (e.g., WHO, WHAT, associated weights representing the importance of the
WHEN, WHERE, WHY). Each rule gave a certain keywords in the document and within the whole
number of points to a sentence. After applying all rules, document collection. Similarly, a query is modeled as a
the sentence that obtained the highest score was list of keywords with associated weights representing
653
Am. J. Applied Sci., 8 (6): 652-661, 2011
the importance of the keywords in the query (Salton et The generic architecture of our QA system is
al., 1975; Salton and Buckley, 1988). shown in Fig. 1. It is composed of the following
Term weighting is an important task in many areas components:
of Information Retrieval (IR), including Question
Answering (QA), Information Extraction (IE) and Text • Question Analysis-Question classification and
Categorization. The purpose of term weighting is to Query formulation
assign to each term w found in a collection of text • IR system-Documents (passages) Retrieval
documents a specific score s(w) that measures the • NLP System-Answer Extraction
importance with respect to a certain goal of the
information represented by the word. For instance, Question analysis-query reformulation: In general,
passage retrieval systems weigh the words of a
question understanding requires deep semantic
document in order to discover important portions of text
and discard irrelevant ones. Other applications such as, processing, which is a non-trivial task in NLP.
QA, Snippet Extraction, Keyword Extraction and Furthermore, Arabic NLP research at the semantic level
Automatic Summarization are used for the same purpose. is still immature (Hammo et al., 2002; Mohammed et
There are many term weighting approaches, al., 1993; Kanaan et al., 2009). Therefore, current
including, IDF, TF.IDF, WIDF, ITF and log(1+TF). Arabic QA systems do not attempt to understand the
Term weighting techniques have been investigated content of the question at the semantic level. Instead,
heavily in the literature (Robertson and Sparck, 1976; they rely on shallow language understanding, i.e., the
Salton and Buckley, 1988; Rotaru and Litman, 2005). QA system uses keyword-based techniques to locate
However, little consensus exists to conclude which relevant passages and sentences from the retrieved
weighting method is best. Different methods seem to documents (Hammo et al., 2002).
work well for different goals (Kolda, 1997). Salton and
Buckley (1988), confirmed that the most used document
Stemming-root extraction: Conflating various forms
term weighting was obtained by the inner product
operation of “the within document” term frequency and of the same word to its root form, called stemming in
the Inverse Document Frequency (idf), all normalized by IR jargon, is the most critical and the most difficult
the length of the document. Salton and Buckley proposed process especially for Arabic. The root is the primary
(augmented normalized term frequency * idf) normalized lexical unit of a word, which carries the most
by cosine as the best term weighting scheme. Further significant aspects of semantic content and cannot be
discussion follows in section 3. reduced into smaller constituents (Root and Stem,
Polettini (2004), analyzed and compared different 2010). On the other hand a stem is the main part of a
techniques for term weighting and how much word to which prefixes and suffixes can be added and
normalization improves retrieval of relevant documents. may not necessarily be a word itself. For example, the
The presented two reasons that necessitate the use of English word friendships contains the stem friend, to
normalization in term weights. According to Polettini which the derivational suffix -ship is attached to form a
(2004), the success or failure of the vector space new stem friendship, to which the inflectional suffix -s
method depends on term weighting. Term weighting is attached (Root and Stem, 2010.). Another example
plays an important role for the similarity measure that
where the stem may not be a word itself is “dod” as the
indicates the most relevant document.
stem in “doddle.” The extracted roots are used for
indexing purposes.
The new QA system (QArabPro): The new QA
Several studies suggested that indexing Arabic text
system assumes that the answer is located in one
using roots significantly increases retrieval
document (i.e. it does not span through multiple
effectiveness over the use of words (Abuleil and Evens,
documents). With this in hand the processing cycle of a
1998; Hammo et al., 2002; Habash and Rambow, 2005;
QA system is composed of the following steps (Hammo Khoja, 1999; Al-Kharashi and Evens, 1994).
et al., 2002): Stemming in Arabic language is difficult compared
to English. The English language has very little
• Process the input question and formulate the query inflection and hence a tendency to have words that are
• Retrieve the candidate documents that contain very close to their respective roots. The distinction
answers using an IR system between the word as a unit of speech and the root as a
• Process each candidate document in the same way unit of meaning is more important in the case of
as the question is processed and languages where roots have many different forms
• Return sentences that may contain the answer when used in actual words, as is the case in Arabic.
654
Am. J. Applied Sci., 8 (6): 652-661, 2011
language and users mostly formulate questions using function that counts the number of words that appear in
words that might not appear in the base document. both the question and the sentence under consideration.
Therefore, if the retrieved passage does not contain all The word match function first removes stopwords
or part of the question keywords, the question such as: { + , , و,}اذاfrom a sentence and then matches
extraction module will not provide the expected answer. the remaining words against the words in the typical
To avoid such problems, a QE process is used to question. Two words match if they share the same
generate new keywords that may exist in the base morphological root. Verbs are very important in
document which in turn improve the performance of the determining when a question and a sentence are related,
whole system. verb matches are weighted more heavily than non verb
The user query can be extended by adding new matches. Matching verbs are awarded (4) points each
words deemed to be somehow (usually semantically) and other matching words are awarded (2) points each.
connected to those contained in the initial query. The The remaining rules used by question answering system
QE module generally use a dictionary of synonyms, a look for a variety of clues. Lexical clues look for
thesaurus, an Ontology, or an index file storing words specific words or phrases. Unless a rule indicates
with similar roots. Because Arabic is highly inflectional otherwise, words are compared using their
and derivational, morphological analysis is a very morphological roots. Some rules can be satisfied by the
complex task. Therefore, we need to consider the most lexical items. These rules are written using the set
important derivation in query reformulation to find all notation (e.g., {:أ, م1ا, )}<(ا.
related words in a document. For example, “-./01 ا203 Each rule awards a specific number of points to a
4151 ا+#801 ا203,4151 ا6آ701 ”اand the definite article ال sentence, depending on how strongly the rule believes
which is sometimes not added at the beginning of some that it found the answer. A rule can assign four possible
words. QE module in our system uses a small levels of points: clue (+3), good clue (+4), confident
dictionary of synonyms. For more details about QE and (+6) and slam dunk (+20). The main purpose of these
Query correction techniques please see (Rachidi et al., values is to assess the relative importance of each clue.
2003; Abdelali, et al., 2003). Figure 2 shows the rules for (Who/Whose) “”,
which use three fairly general heuristics as well as the
Query type: Questions are classified based on a set of Word Match function (rule #1). If the question (Q) does
known “question types”. Question types are not contain any names, then rules #2 and #3 assume
instrumental in determining the type of processing that the question is looking for a name. Rule #2 rewards
needed to identify and extract the final answer. Table 1 sentences that contain a recognized NAME and rule #3
shows the main interrogative particles that precede the rewards sentences that contain the word “name”. Rule
questions to determine what types of answers are #4 awards points to all sentences that contain either a
expected. While previous Q/A system excluded name or a reference to a human (often an occupation,
questions types that are difficult to handle, our QA such as “writer” “=))”آ. Note that more than one rule
system handles all question types listed in Table 1. can be applied to a sentence, in which case the sentence
is awarded points by all of the rules that are applied.
Rule based Q/A systems: A rule-based system for The (What/Which) “” questions were among the
question answering is a system that looks for evidence most difficult to handle because they sought amazing
that a sentence contains the answer to a question (Rilo wide variety of answers. However, Fig. 3 shows a few
and Thelen, 2000). Each type of questions looks for specific rules that worked reasonably well. Rule #1 is
different types of answers, so a question answering the generic word matching function shared by all
system uses a separate set of rules for each question question types. Rule #2 rewards sentences that contain
type such as (, , ذا, -). a date expression if the question contains a month of the
The rules we adapted in our system are generally year. This rule handles questions that ask “ (ث, ”ذاon
similar to those used in a rule-based QA for English a specific date. We also noticed several “ع# ”
text (Hirschman et al., 1999). However, the rules are questions that looks for a description of an object. Rule
modified and enhanced to accommodate the many #3 addresses these questions by rewarding sentences
unique aspects of the Arabic text. These modifications that contain the word (e.g., “...-05” or “...“ ع.@”).
are very critical to the process of determining the Rule #4 looks for words associated with names in both
correct answer. Each rule awards a certain number of the question and sentence.
points to a sentence. After applying the rules, the The rule set for (When) “-” questions shown in
sentence with the highest score is marked as the answer. Fig. 4, is the only rule set that does not apply the
All question types share a common word matching word match function to every sentence in the text.
657
Am. J. Applied Sci., 8 (6): 652-661, 2011
RESULTS
We tested our system using a collection of reading
comprehension texts. The data used were collected
from WIKIPEDIA (Root and Stem, 2010). The data set
contains 75 reading comprehension tests with 335
questions. The HumSent answers are sentences that a
human expert judged to be the best answer for each
question. The AutSent answers are generated
automatically by determining which sentence contains
the highest weight, excluding stopwords. Our parser
uses a small dictionary, so that words can be defined
Fig. 7: Rules for “4( ”آHow many/much)
with semantic classes. The semantic classes used by our
Table 1: Question types processed by the question answering system
system along with a description of the words assigned
Question word Query type to each class are the following:
Who / Whose “” Person
When / “” Date, Time • Human: 52 words, including titles such as
What/Which “”/”ذا Organization, Product, Event E O,( N,دآر
Why / “”
ذا Reason • Location: 135 words, including country names and
Where / “”ا Location city names such as E.(01 ا,E ,ن0B.
How many/much “4”آ Number/Quantity
• Names: 621 words, including common first name,
last name such as د0P ,AB ,(0P
The nouns after “4 ”آmust be always singular and in the • Times: 42 words, including years, 12 month and 7
accusative case. If the noun following “4 ”آwere part of days such as طS ,51 ا,E0Q1ا
a genitive construction, it would not be in the • Stopwords: 1457 words, including words that don’t
accusative case but in the regular nominative case. have meaning on their own such as $H ا6/ اآ, اذا,0
Furthermore, the noun following “4 ”آcan be omitted • Criteria: 34 words, which are enumerated some
from the question. When asking about price, “4 ”آwill measurement criteria for countable and
be preceded by either one of the following preposition uncountable names such as 6/ أآ,ق%,P) , *رب,)*(ر
“in/by/with” and the noun is generally omitted.
Furthermore, “4 ”آcan be used in a style that is used to Table 2 shows the evaluation results of our QA
system for each type of questions.
state numerousness instead of interrogation. These are a
Figure 8 shows a summary of each question type
few of the many cases that govern this type of question.
and its corresponding accuracy. The system achieved
There are many more. 84% overall accuracy. The system performed the best
Figure 7 shows a few specific rules that worked on (Who/Whose) “” WHER” ”أand (What/Which)
very well with this type of question. Rule #1 is the “” questions and performed the worst on WHY “ذا01”
generic word matching function shared by all question and “4( ”آHow many/much) questions, reaching only
types. Rule #2 rewards sentences that contain numbers 62% and 69% accuracy respectively. The low results
as the answer of question containing one or more for WHY “ذا01” and “4( ”آHow many/much) questions
measurement criteria such as (Approximation) “E5#”. It were expected because of the difficulty in handling
handles questions that ask for countable/uncountable such questions. These questions were completely
name and the predicted answer contains an exact excluded by QA systems introduced by (Hammo et al.,
number such as “ة6GB”. Rule #3 addresses such 2002; Kanaan et al., 2009).
questions by rewarding sentences that contain words Table 2: Overall Results
such as (“ $H أA1ا,,6*),“) and it handles questions that Total # of Correct Incorrect Correct Question
ask for countable/uncountable name and the predicted questions percent answer answer type
53 94.34% 3 50
answer contains approximation rather than an exact 48 91.67% 4 44
number. Rule #4 addresses such questions by rewarding 45 88.89% 5 40 ذا
sentences that contain words such as (“ (دB,4J),وي5“) 47 93.62% 3 44 أ
54 85.19% 8 46
and it handles questions that ask for 45 62.22% 17 28
ذا
countable/uncountable name and the predicted answer 43 69.77% 13 30 آ
contains an exact number. 335 84.18% 53 282 overall
659
Am. J. Applied Sci., 8 (6): 652-661, 2011
Hiemstra, D. and A. De Vries, 2000. Relating the new Polettini, N., 2004. The vector space model in
language models of information retrieval to the information retrieval - term weighting problem.
traditional retrieval models. Technical Report tr- University of Trento:
ctit-00-09, Centre for Telematics and Information http://sra.itc.it/people/polettini/STUDYS/Polettini_
Technology. Information_Retrieval.pdf. (Accessed, Aug. 2009).
Hirschman, L., M. Light, E. Breck and j. Burger, 1999. Rachidi T., M. Bouzoubaa, L. ElMortaji, B. Boussouab
Deep read: A reading comprehension system. In and A. Bensaid. 2003. Arabic user search Query
Proceedings of the 37th Annual Meeting of the correction and expansion”, in Proc. Of
Association for Computational Linguistics. COPSTIC’03, Rebat Dec. 11-13.
Kanaan, G., A. Hammouri, R. Al- Shalabi and M. Rilo, E. and M. Thelen, 2000. A rule-based question
Swalha, 2009. A new question answering system answering system for reading comprehension tests.
for the Arabic language. American J Applied Sci., In Proceedings of the Anlp/Naacl 2000 Workshop
6: 797-805, ISSN 1546-9239. on Reading Comprehension Tests as Evaluation for
Katz B., J. Lin and S. Felshin, 2001. Gathering Computer-based Language Understanding
knowledge for a question answering system from Systems.
heterogeneous information sources. Proceedings of Robertson, S.E. and K. Sparck, 1976. Relevance
the Workshop on Human Language Technology, weighting of search terms. J. Am. Soc. Inform.
ACL-2001, Toulouse. Sci., 27: 129-146.
Khoja, S. 1999. “Stemming Arabic Text”. Available on Root and Stem, 2010 (linguistics) http://
the Web at: ar.wikipedia.com.
http://zeus.cs.pacificu.edu/shereen/research.htm#st Rotaru M., D. Litman, 2005. Improving question
emming. answering for reading comprehension tests by
Kolda, T., 1997. Limitedmemory matrix methods with combining multiple systems. In Proceedings of the
applications. Applied Mathematics Program, PHD American Association for Artificial Intelligence
Thesis, University of Maryland at College Park, (AAAI) 2005 Workshop on Question Answering in
59-68. Restricted Domains.
Lundquist, C., D. Grossman and O. Frieder, 1999. Salton G. and C. Buckley, 1988. Term weighting
Improving relevance feedback in the vector space approaches in automatic text retrieval. Inform.
model. Proceedings of the 6th ACM Annual Process. Manage.
Conference on Information and Knowledge Salton, G., A. Wong and C.S. Yang, 1975. A vector
Management (CIKM), pp: 16-23. space model for automatic indexing. Commun.
Mohammed, F.A., K. Nasser and H.M. Harb, 1993. A ACM., 18: 613-620.
knowledge-based Arabic Question Answering UN Department for General Assembly and
System (AQAS). In: ACM SIGART Bulletin, pp: Conference Management, 2008.
21-33. http://www.un.org/Depts/DGACM/faq_languages
.htm
661