You are on page 1of 8

miraQA: Experiments with Learning Answer Context Patterns from the Web

César de Pablo-Sánchez1, José Luis Martínez-Fernández1, Paloma Martínez1, and Julio Villena2
Advanced Databases Group, Computer Science Department, Universidad Carlos III de Madrid, Avda. Universidad 30, 28911 Leganés , Madrid, Spain {cdepablo, jlmferna, pmf} 2 DAEDALUS – Data, Decisions and Language S.A. Centro de Empresas “La Arboleda”, Ctra N-III km 7,300 Madrid 28031, Spain

Abstract. We present the miraQA system which is MIRACLE’s first experience in Question Answering for monolingual Spanish. The general architecture of the system developed for QA@CLEF 2004 is presented as well as evaluation results. miraQA characterizes by learning the rules for answer extraction from the Web using a Hidden Markov Model of the context in which answers appear. We used a supervised approach that uses questions and answers from last years evaluation set for training.

1 Introduction
Question Answering has received a lot of attention during the last years due to the advances in IR and NLP. As in other applications in these areas, the bulk of the research has been mainly in English while perhaps one of the most interesting applications of QA systems could be in cross and multilingual scenarios. Access to concrete quality information in a language that is not spoken or just poorly understood could be advantageous to current IR systems in many situations. QA@CLEF [8] has encouraged the development of QA systems in other languages than English and in crosslingual scenarios. QA systems are usually complex because of the number of different modules that they use, and the need for a good integration among them. Even if questions are expecting a simple fact or a short definition as an answer, the requirement of more precise information has entailed the use of language and domain specific modules. On the other hand, some other approaches relying on data-intensive [4], machine learning and statistical techniques [10] have achieved wide spread and relative success. Moreover, the interest of these approaches for multilingual QA systems lies on the possibility of adapting them quickly to other target languages. In this paper we present our first approach to the QA task. As we have not taken part before in any of the QA evaluation forums, most of the work has been done integrating different available resources. So far, the system we present is targeted only
C. Peters et al. (Eds.): CLEF 2004, LNCS 3491, pp. 494 – 501, 2005. © Springer-Verlag Berlin Heidelberg 2005

miraQA: Experiments with Learning Answer Context Patterns from the Web


to the monolingual Spanish task. The system explores the use of Hidden Markov 1 Models [9] for Answer Extraction and uses Google to collect training data. The results prove that further improvements and tuning are needed, both in the system and the answer extraction method. We expect to continue working on this system to enhance their results and inspect the suitability of the approach for different languages.

2 Description
miraQA, the system that MIRACLE group has developed for QA@CLEF 2004, represents our first attempt to face the Question Answering task. The system has been developed for the monolingual Spanish subtask as we are familiar with available tools for Spanish. Despite we only address this task, we believe that our approach for Answer Extraction could be easily adapted to other target languages, as it uses resources available for most of the languages like POS (Part-Of-Speech) taggers and partial parsers. The architecture of the system follows the usual structure of a QA with three modules as shown in Figure 1: Question Analysis, Document Retrieval and Answer Extraction.
Question Analysis
Question POS +Parsing Question classifier
QA class Term: SemTag Term: SemTag ....

Document retrieval
IR engine EFE94/95

Answer Extraction
Answer Answer ranking Answer Recog. Anchor searching POS +Parsing Sent. Sentence extractor

QA class model

Fig. 1. miraQA architecture

Besides these modules that we use in the question-answering phase, our approach requires a system to train the models that we use for answer recognition. The system uses pairs of questions and answers to query Google and select relevant snippets that contain the answer and other questions terms. In order to build the model we have used QA@CLEF 2003 [7] evaluation set with questions and the answers identified by the judges.



C. de Pablo-Sánchez et al.

2.1 Question Analysis This module classifies questions and selects the terms that are relevant for later processing stages. We have used a taxonomy of 17 different classes in our system that is presented in Table 1. The criteria for the election of the classes has considered the type of the answer, the question form and the relation of the question terms with the answer. Therefore, we refer to classes in this taxonomy as question-answers (QA) classes. General QA classes were split into more specific classes depending on the number of examples in last year evaluation set. As we were planning to use a statistical approach for answer extraction, we were also required to have enough examples in every QA class which determines when to stop subdividing.
Table 1. Question answer (QA) classes used in miraQA

Name Person Group Count

Time Year Month Day

Location Country City_0 City_ 1 Rest

Cause Manner Definition Quantity

In this module, questions are analyzed using ms-tools [1], a package for language processing that contains a POS tagger (MACO) and a partial parser (TACAT) as well as other tools like a Name Entity Recognition and Classification (NERC) module. MACO is able to recognize basic proper names (np) but the NERC module is needed to classify them. As this module is built using an statistical approach using a corpus of a different genre, its accuracy was not good enough for questions and we decided not

S P sn grup-verb ##REL#(sn) ##COUNTRY##(grup-sp) P prep grup-nom sps de np Croacia Fit ?
symbols states

espec grup-nom Fia ¿ pt Cuál vsip es da la nc capital

Fig. 2. Analysis of question #1 in QA@CLEF 2003 evaluation set

miraQA: Experiments with Learning Answer Context Patterns from the Web


to use it. We also modified TACAT to prevent prepositional attachment as it was more appropriate to our interests. Once the questions are tagged and parsed, a set of manually developed rules are used to classify questions. This set of rules is also used to assign a semantic tag to some of the chunks according to the class they belong. These tags are a crude attempt to represent the main relations between the answer and the units appearing in the question. A simple example for the question: “¿Cual es la capital de Croacia?” (“What is the capital city of Croatia?”) is shown in Figure 2 together with the rule that is applied. An example of the rule that classifies question as city_1 QA class and assigns (M/) the ##CAPITAL## and ##COUNTRY## semantic tags. (C/ means that the word is a token, S/ means that the word is a lemma). {13,city_1,S_[¿_Fia sn_[C/cuál] grup-verb_[S/ser] sn_[ C/capital;M/##REL##] M/##COUNTRY## ?_Fit ]} 2.2 Document Retrieval The IR module retrieves the top most relevant documents for a query and extracts those sentences that contain any of the words that were used in the query. Words that were assigned a semantic tag during question analysis are used to build the query. For robustness reasons, the content is scanned again to remove stopwords. Our system uses Xapian2 probabilistic engine to index and search for the most relevant documents. The last step of the retrieval module tokenizes the document using DAEDALUS Tokenizer3 and extracts the sentences that contain relevant terms. The system assigns two scores to every sentence, the relevance measure provided by Xapian to the document and another figure proportional to the number of terms that were found in the sentence. 2.3 Answer Extraction The answer extraction module uses a statistical approach to answer pinpointing that is based on a syntactic-semantic context model of the answer built for any of the classes that the system uses. The following operations are performed: 1. Parsing and Anchor Searching. Sentences selected in the previous step are tagged and parsed using ms-tools. Chunks that contain any of the terms are retagged with their semantic tags and will be used as anchors. Finally, the system select pieces in a window of words around anchor terms that will go to the next phase. 2. Answer Recognition. For every QA class we have previously trained a HMM that models the context of answers found in Google snippets as explained later. A variant of N-best recognition strategy is used to identify the most probable sequence of states (syntactic and semantic tags) that originated the POS sequence. A special semantic tag that identifies the answer (##ANSWER##) represents the
2 3



C. de Pablo-Sánchez et al.

state where words that form the answer are generated. The recognition algorithm is guided to visit states marked as anchors in order to find a path that passes through the answer state. The algorithm assigns a score to every computed path and candidate answer based on the log probabilities of the HMM. 3. Ranking. Candidate answers are normalized (stopwords are removed) and ranked attending to a weighted score that takes into account their length, the score of original documents and sentences and the paths followed during recognition.
#ANSWER#(sn) g-v #REL#(sn) #COUNTRY#(g-sp) coord g-sp... states

g-n-fp np

esp-fs g-n-fs vs da nc

prep g-n-fp sp np cc Fc

prep g-n-fp sp np Fc vs ... symbols


es la capital

de Croacia

y ,

junto a Belgrado, es....

Fig. 3. Answer extraction for “Zagreb is the capital city of Croatia and, together with Belgrade, is….” The model suggest the most probable sequence of states for the sequence of POS tags and assigns ##ANSWER## to the first np (proper noun), giving “Zagreb” as candidate answer

Question Analysis
Question POS +Parsing Question classifier

QA class Term: SemTag Term: SemTag ....

WebIR POS +Parsing


Answers Anchor Searching Model training QA class model

Fig. 4. Architecture for the training of models for extraction

2.4 Training for Answer Recognition Models that are used in the answer extraction phase are trained before from examples. For training the models we have used questions and answers from CLEF 2003. Questions are analyzed as in the main QA system. Question terms and answers strings

miraQA: Experiments with Learning Answer Context Patterns from the Web


are combined and sent to Google using the Google API . Snippets for the top 100 results are retrieved and stored to build the model. They are split into sentences, then they are analyzed and finally, terms that appeared in the question are tagged. The tag is either the semantic class assigned to that term in the question or the answer tag (##ANSWER##). Only sentences containing the answer and at least one of the other semantic tags are selected to train the model. In order to extract answers we train a HMM in which states are syntactic-semantic tags assigned to the chunks and symbols are POS tags. To estimate the transition and emission probabilities of the automata, we have counted the frequencies of the bigrams for POS-POS and POS-CHUNKS. Besides, a simple add-one smoothing technique is used. For every QA class we train a model that will be used to estimate the score of a given sequence and to identify the answer as explained above.

3 Results
We submitted one run for the monolingual Spanish task (mira041eses) that provides one exact answer to every question. Our system is unable to compute the confidence measure so we have limited us to assign the default value of 0. There are two main kinds of questions, factoid and definition and we have tried the same approach for both of them. Besides, the question set contains some questions whose answer could not be found in the document corpus and the valid answer in that case is the NIL string.
Table 2. Results form mira041eses

Question type Factoid Definition Total

Right 18 0 18

Wrong 154 17 174

IneXact 4 3 7

Unsupported 1 0 1

The results we have obtained are fairly low if we compare them with other systems. We attribute these bad results to the fact that the system is in a very early stage of development and tuning. We have obtained several conclusions from the analysis of correct and wrong answers that will guide our future work. The extraction algorithm is working better for factoid questions than definitional. Among factoid questions results are also better for certain QA classes (DATE, NAME...) which are found with higher frequency in our training set. For other QA classes (MANNER, DEFINITION) there were not enough to efficiently build a model. Another noteworthy fact is that our HMM algorithm is somewhat greedy when trying to identify answer and in that case shows some preference for words appearing near anchor terms. Finally, the algorithm is actually doing two jobs at once as it identifies


Google API:


C. de Pablo-Sánchez et al.

answers and, in some way, recognises answer types or entities according to patterns that were present in training answers of the same kind. Another source of errors in our system is induced by the document retrieval process and the way we posed questions and score documents. Terms that we select from queries have the same relevance when it is clear that proper names would benefit the retrieval of probably more precise documents. Besides, the simple scoring schema that we used for sentences (one term-one point) contributes to mask some of the useful fragments. Finally, some errors are also generated during the question classification step as it is unable to handle some of the new surface forms introduced in this year question set. For that reason a catch all classification was also defined and used as a ragbag, but results were not expected to be good for that class. Moreover, POS tagging with MACO fails more frequently for questions and these errors are propagated to the partial parsing. Our limited set of rules was not able to cope with some of these inaccurate parses. The evaluation also provides results for the percentage of NIL answers that we have returned. In our case we returned 74 NIL answers and only 11 of them were correct (14.86%). NIL values were returned when the process did not provided any answer and their high value is due to the chaining of the other problems mentioned above.

4 Future Work
Several lines for further research are open along with the deficiencies that we have detected in the different modules of our system. One of the straightest improvements is the recognition of Named Entities and other specific types that should entail changes and improvements in the different modules. We believe that these improvements could enhance precision in answer recognition and also retrieval. With regard to the Question Analysis module we are planning to improve the QA taxonomy as well as coverage and precision of the rules. We are considering manual and automatic methods for the acquisition of classification rules. Besides the use of NE in the Document Retrieval module, we need to improve the interface with the other two main subsystems. We are planning to develop better strategies for transforming questions into queries and effective scoring mechanism. Results show that the answer extraction mechanism could work properly with appropriate training. We are interested in determining the amount of training data that would be needed in order to improve recognition results. We would likely need to acquire or generate larger question-answer corpus. In the same line, we expect to experiment with different finite state approaches and learning techniques. In a cross-cutting line our interest lies in the development of multilingual and crosslingual QA. Some attempts started already for this campaign in order to face more target languages but revealed that the question classification needs a more robust approach to accept the output of current machine translation systems, at least for questions. Finally we would like to explore if our statistical approach for answer recognition is practical for other languages.

miraQA: Experiments with Learning Answer Context Patterns from the Web


The work has been partially supported by the projects OmniPaper (European Union, 5th Framework Programme for Research and Technological Development, IST-200132174) and MIRACLE (Regional Government of Madrid, Regional Plan for Research, 07T/0055/2003) Special mention to our colleagues at other members of the MIRACLE group should be done: Ana García-Serrano, José Carlos González, José Miguel Goñi and Javier Alonso.

1. S. Abney, M. Collins, and A. Singhal. Answer extraction. In Proceedings of Applied Natural Language Processing (ANLP-2000), (2000). 2. Atserias J., J. Carmona, I. Castellón, S. Cervell, M. Civit, L. Màrquez, M.A. Martí, L. Padró, R. Placer, H. Rodríguez, M. Taulé and J. Turmo Morphosyntactic Analysis and Parsing of Unrestricted Spanish Text. Proceedings of the 1st International Conference on Language Resources and Evaluation (LREC'98). Granada, Spain, 1998. 3. Baeza-Yates R. Ribeiro-Neto B. (Ed.) Modern Information Retrieval. Addison Wesley, New York (1999). 4. Brill E. Lin J. Banco M, Dumais S, Ng A. Data-Intensive Question Answering. In Proceedings of TREC 2001 (2001) 5. Jurafsky D. Martin J.H. Speech and Language Processing. Prentice Hall, Upper Saddle River, New Jersey. (2000) 6. Manning C, Schütze H. Foundations of Statistical Natural Language Processing.. MIT Press (1999) 7. Magnini B., Romagnoli S., Vallin A., Herrera J.,Peñas A., Peinado V, Verdejo F and de Rijke M. The Multiple Language Question Answering Track at CLEF 2003. (2003) Available at 8. Magnini, B., Vallin, A., Ayache, C., Erbach, G., Peñas A., de Rijke, M., Rocha, P., Simov, K. and Sutcliffe R.: Overview of the CLEF 2004 Multilingual Question Answering Track. In : Peters, C., and Clough, P., and Gonzalo, J., and Jones, G., and Kluck, M. and Magnini, B.: Fifth Workshop of the Cross--Language Evaluation Forum (CLEF 2004),Lecture Notes in Computer Science (LNCS), Springer, Heidelberg, Germany (2005) 9. Mérialdo, B.: Tagging English Text with a Probabilistic Model. In Computational Linguistics, Vol 20 (1994) 155-171. 10. Ravichandran, D. and E.H. Hovy. : Learning Surface Text Patterns for a Question Answering System. In Proceedings of the 40th ACL conference. Philadelphia, PA (2002) 11. Vicedo J.L. Recuperando información de alta precisión. Los sistemas de Búsqueda de Respuestas. Phd Thesis. Universidad de Alicante. (2003).