IR-based QA Review: Survey of Approaches for Answer Extraction

QA Review
IR-based Question Answering

An alternative approach to answer extraction, used solely in Web search, is based on N-gram tiling,
sometimes called the redundancy-based approach (Brill et al. 2002, Lin 2007). This simplified method
begins with the snippets returned from the Web search engine, produced by a reformulated query. In
the first step, N-gram mining, every unigram, bigram, and trigram occurring in the snippet is
extracted and weighted. The weight is a function of the number of snippets in which the N-gram
occurred, and the weight of the query reformulation pattern that returned it. In the N-gram filtering
step, N-grams are scored by how well they match the predicted answer type. These scores are
computed by hand-written filters built for each answer type. Finally, an N-gram tiling algorithm
concatenates overlapping N-gram fragments into longer answers. A standard greedy method is to start
with the highest-scoring candidate and try to tile each other candidate with this candidate. The best-
scoring concatenation is added to the set of candidates, the lower-scoring candidate is removed, and
the process continues until a single answer is built.
Knowledge-based Question Answering
Like the text-based paradigm for question answering, this approach dates back to the earliest days of
natural language processing, with systems like BASEBALL (Green et al., 1961) that answered
questions from a structured database of baseball games and stats. Systems for mapping from a text
string to any logical form are called semantic parsers (???). Semantic parsers for question answering
usually map either to some version of predicate calculus or a query language like SQL or SPARQL, as
in the examples in Fig. 28.7.
Rule-based Method
Supervised Method
Dealing with Variation: Semi-Supervised Methods

Because it is difficult to create training sets with questions labeled with their meaning representation,
supervised datasets can’t cover the wide variety of forms that even simple factoid questions can take.
For this reason most techniques for mapping factoid questions to the canonical relations or other
structures in knowledge bases find some way to make use of textual redundancy.
The most common source of redundancy, of course, is the web, which contains vast number of textual
variants expressing any relation. For this reason, most methods make some use of web text, either via
semi-supervised methods like distant supervision or unsupervised methods like open information
extraction, both introduced in Chapter 20. For example the REVERB open information extractor
(Fader et al., 2011) extracts billions of (subject, relation, object) triples of strings from the
web, such as (“Ada Lovelace”,“was born in”, “1815”). By aligning these strings with a canonical
knowledge source like Wikipedia, we create new relations that can be queried while simultaneously
learning to map between the words in question and canonical relations.
To align a REVERB triple with a canonical knowledge source we first align the arguments and then
the predicate. Recall from Chapter 23 that linking a string like “Ada Lovelace” with a Wikipedia page
is called entity linking; we thus represent the concept ‘Ada Lovelace’ by a unique identifier of a
Wikipedia page. If this subject string is not associated with a unique page on Wikipedia, we can
disambiguate which page is being sought, for example by using the cosine distance between the triple
string (‘Ada Lovelace was born in 1815’) and each candidate Wikipedia page. Date strings like ‘1815’
can be turned into a normalized form using standard tools for temporal normalization like SUTime
(Chang and Manning, 2012). Once we’ve aligned the arguments, we align the predicates. Given the
Freebase relation people.person.birthdate(ada lovelace,1815) and the string ‘Ada Lovelace was born in
1815’, having linked Ada Lovelace and normalized 1815, we learn the mapping between the string
‘was born in’ and the relation people.
person. birthdate. In the simplest case, this can be done by aligning the relation with the string of
words in between the arguments; more complex alignment algorithms like IBM Model 1 (Chapter 25)
can be used. Then if a phrase aligns with a predicate across many entities, it can be extracted into a
lexicon for mapping questions to relations.
The third candidate answer scoring stage uses many sources of evidence to score the candidates. One
of the most important is the lexical answer type. DeepQA includes a system that takes a candidate
answer and a lexical answer type and returns a score indicating whether the candidate answer can be
interpreted as a subclass or instance of the answer type. Consider the candidate “difficulty swallowing”
and the lexical answer type “manifestation”. DeepQA first matches each of these words with possible
entities in ontologies like DBpedia and WordNet. Thus the candidate “difficulty swallowing” is
matched with the DBpedia entity “Dysphagia”, and then that instance is mapped to the WordNet type
“Symptom”. The answer type “manifestation” is mapped to the WordNet type “Condition”. The
system looks for a link of hyponymy, instance-of or synonymy between these two types; in this case a
hyponymy relation is found between “Symptom” and “Condition”.
Question Answering Systems: Survey and Trends

The need to query information content available in various formats including structured and
unstructured data (text in natural language, semi-structured Web documents, structured RDF data in
the semantic Web, etc.) has become increasingly important. Thus, Question Answering Systems
(QAS) are essential to satisfy this need.
The rapid increase in massive information storage and the popularity of using the Web allow
researchers to store data and make them available to the public. However, the exploration of this large
amount of data makes finding information a complex and expensive task in terms of time. This
difficulty has motivated the development of new adapted research tools, such as Question Answering
Systems.
However, for Question Answering Systems dedicated to manipulate text and Web documents, the
structure of the required information affects the accuracy of these systems. QAS are most effective to
interact with structured knowledge bases.
Normal key word based search engines rely on the replying of a set of relevant documents to a specific
query. As the information offer is constantly growing in the internet it can be hard to find correct
information, so it should be possible to get specific answers to a query. One reason for Automatic QA
is to relieve the crawling through many documents for the correct answer. On the other hand also
supporting documents can be useful to confirm the answers so users get convinced that the
information is relevant.
New Trends in Automatic QA
An Automatic Answering System with Template Matching for Natural Language Questions
In QA systems there is a general distinction between two problems: open domain problems and closed
domain problems. In open domain the answers to questions can be found in public information
sources. Almost any question can be placed. In the closed domain, however, answers are stored by a
domain expert in a database. The permissible questions are therefore limited to a specific topic. This
work focuses on the closed domain problems.
The system is split up into three main modules:

1. pre-processing module
2. question template matching module
3. answering module
In the first module SMS abbreviations are converted to English words and stop words are removed. In
the second module the result from the previous module is matched with every template to find the best
match. They use a special syntax to describe the templates. For each question the template has to be
written manually. Furthermore the template matching approach has been improved by using synonym
lists and adding robustness relating to spelling mistakes.
Question Answering (QA)
The Internet provides an ever-growing amount of information. To gain access to the desired
information, there are a variety of types of search methods. Conventional search engines usually
provide a list of relevant documents, in which the requested information can be found. However, the
goal of QA is to provide the information immediately to the user without having to search through all
documents.
Natural Language Generation

Natural language generation is the process to construct a natural language text. The text should be
syntactically and semantically correct and provide a formal presentation of the content. The challenge
is to imitate human language ability by using computational effort. Claude Shannon researcher has laid
important foundations for natural language generation. Shannon introduced the ability to automatically
generate text using Markov transition probabilities from one word to another in his paper ”A
Mathematical Theory of Communication” in 1948. He created the first theoretical model of a text
generator.(Shannon, 1948)
For basic applications, the text generation process can be kept very simply by composing the text with
standard text blocks and some link words. For more complex systems, however, more processing
steps, as described in (Reiter & Dale, 2000), are needed.
The process is composed of several stages.
Content determination: What information should be included in the output text
Document structuring: Which content parts should be grouped
Lexicalisation: What specific words should be used
Referring expression generation: What expressions should be used to reference objects
• Aggregation: How the linguistic structure as sentences and paragraphs should be mapped
• Linguistic and structural realisation: Transformation of abstract representations of sentences,
paragraphs and sections into real text
Is Question Answering fit for the Semantic Web?: a

Survey.
With the recent rapid growth of the Semantic Web (SW), the processes of searching and querying
content that is both massive in scale and heterogeneous have become increasingly challenging. User-
friendly interfaces, which can support end users in querying and exploring this novel and diverse,
structured information space, are needed to make the vision of the SW a reality. We present a survey
on ontology-based Question Answering (QA), which has emerged in recent years to exploit the
opportunities offered by structured semantic information on the Web.
The notion of introducing semantics to search on the Web is not understood in a unique way. Accord-
ing to (Fazzinga et al., 2010) the two most common uses of SW technology are: (1) to interpret Web
queries and Web resources annotated with respect to the background knowledge described by
underlying ontologies, and (2) to search in the structured large datasets and Knowledge Bases (KBs)
of the SW as an alternative or a complement to the current web.
Apart from the benefits that can be obtained as more semantic data is published on the Web, the
emergence and continued growth of a large scale SW poses some challenges and drawbacks:
− There is a gap between users and the SW: it is difficult for end-users to understand the complexity
of the logic-based SW. Solutions that can allow the typical Web user to profit from the expressive
power of SW data-models, while hiding the complexity behind them, are of crucial importance.
− The processes of searching and querying content that is massive in scale and highly heterogeneous
have become increasingly challenging: current approaches to querying semantic data have difficulties
to scale their models successfully to cope with the increasing amount of distributed semantic data
available online.
Hence, there is a need for user-friendly interfaces that can scale up to the Web of Data, to support end
users in querying this heterogeneous information space.
We classify a QA system, or any approach to query the SW content, according to four dimensions
based on the type of questions (input), the sources (unstructured data such as documents, or structured
data in a semantic or non-semantic space), the scope (domain-specific, open-domain), and the
traditional intrinsic problems derived from the search environment and scope of the system.
Goals and dimensions of Question Answering The goal of QA systems, as defined by (Hirschman et
al., 2001), is to allow users to ask questions in Natural Language (NL), using their own terminology,
and receive a concise answer.
In (Moldovan et al., 2003), QA systems are classified, according to the complexity of the input
question and the difficulty of extracting the answer, in five increasingly sophisticated types: systems
capable of processing factual questions (factoids), systems enabling reasoning mechanisms, systems
that fuse answers from different sources, interactive (dialog) systems and systems capable of analogical
reasoning.
As pointed out in (Hunter, 2000) more difficult kinds of factual questions include those which ask for
opinion, like Why or How questions, which require understanding of causality or instrumental
relations, What questions which provide little constraint in the answer type, and definition questions.
Ontology-based QA emerged as a combination of ideas of two different research areas - it enhances

the scope of closed NLIDB over structured data, by being agnostic to the domain of the ontology that
it exploits; and also presents complementary affordances to open QA over free text (TREC), the
advantage being that it can help with answering questions requiring situation specific answers, where
multiple pieces of information (from one or several sources) need to be assembled to infer the answers
at run time.
Nonetheless, most ontology-based QA systems are akin to NLIDB in the sense that they are able to
extract precise answers from structured data in a specific domain scenario, instead of retrieving
relevant paragraphs of text in an open scenario.
As stated in (Uren et al., 2007)
NL interfaces are an often-proposed solution in the literature for casual users (Kauffman and
Bernstein, 2007), being particularly appropriate in domains for which there are authoritative and
comprehensive da tabases or resources (Mollá and Vicedo, 2007). However, their success has been
typically overshadowed by both the brittleness and habitability problems (Thompson et al., 2005),
defined as the mismatch between the user expectations and the capabilities of the system with respect
to its NL understanding and what it knows about (users do not know what it is possible to ask). As
stated in (Uren et al., 2007) iterative and exploratory search modes are important to the usability of all
search systems, to support the user in understanding what is the knowledge of the system and what
subset of NL is possible to ask about. Systems also should be able to provide justifications for an
answer in an intuitive way (NL generation), suggest the presence of unrequested but related
information, and actively help the user by recommending searches or proposing alternate paths of
exploration. For example, view based search and forms can help the user to explore the search space
better than keyword-based or NL querying systems, but they become tedious to use in large spaces
and impossible in heterogeneous ones.
Some of the early NLIDB approaches relied on pattern-matching techniques.

The main drawback of these early NLIDB systems is that they were built having a particular database
in mind, thus they could not be easily modified to be used with different databases and were difficult to
port to different application domains. Configuration phases were tedious and required a long time, be-
cause of domain-specific grammars, hard-wired knowledge or hand-written mapping rules that had to
be developed by domain experts.
The next generation of NLIDBs used an intermediate representation language, which expressed the
meaning of the user’s question in terms of high-level concepts, independently of the database’s
structure (Androutsopoulos et al., 1995). Thus, separating the (domain-independent) linguistic process
from the (domain-dependent) mapping process into the database, to improve the portability of the
front end (Martin et al., 1985).
The formal semantics approach presented in (De Roeck et al., 1991) follows this paradigm and clearly
separates between the NL front ends, which have a very high degree of portability, from the back end.
The front end provides a mapping between sentences of English and expressions of a formal semantic
theory, and the back end maps these into expressions, which are meaningful with respect to the
domain in question. Adapting a developed system to a new application requires altering the domain
specific back end alone.
4. Semantic ontology-based Question Answering

In this section we look at ontology-based semantic QA systems (also referred in this paper as semantic
QA systems), which take queries expressed in NL and a given ontology as input, and return answers
drawn from one or more KBs that subscribe to the ontology. Therefore, they do not require the user to
learn the vocabulary or structure of the ontology to be queried.
4.1. Ontology-specific QA systems
Since the steady growth of the SW and the emergence of large-scale semantics the necessity of NLI
to ontology-based repositories has become more acute, re-igniting interest in NL front ends. This
trend has also been supported by usability studies (Kaufmann and Bernstein, 2007), which show that
casual users, typically overwhelmed by the formal logic of the SW, prefer to use a NL interface to
query an ontology. Hence, in the past few years there has been much interest in ontology based QA
systems, where the power of ontologies as a model of knowledge is directly exploited for the query
analysis and translation, thus providing a new twist on the old issues of NLIDB, by focusing on
portability and performance, and replacing the costly domain specific NLP techniques with shallow
but domain-independent ones. A wide range of off-the-shelf components, including triple stores (e.g.,
Sesame 5 ) or text retrieval engines (e.g., Lucene 6 ), domain-independent linguistic re-
sources, such as WordNet and FrameNet 7 , and NLP Parsers, such as Stanford Parser (Klein and
Manning, 2002), support the evolution of these new NLI.
Ontology-based QA systems vary on two main aspects:
(1) the degree of domain customization they require, which correlates with their retrieval perfor-
mance, and
(2) the subset of NL they are able to understand (full grammar-based NL, controlled or guided NL,
pattern based), in order to reduce both complexity and the habitability problem, pointed out as the
main issue that hampers the successful use of NLI (Kaufmann and Bernstein, 2007).
One can conclude that the techniques used to solve the lexical gap between the users and the
structured knowledge are largely comparable across all systems: off-the-shelf parsers and shallow
parsing are used to create a triple-based representation of the user query, while string distance
metrics, WordNet, and heuristics rules are used to match and rank the possible ontological
representations.
Scientific America: The Semantic Web

The Semantic Web is not a separate Web but an extension of the current one, in which information is
given well-defined meaning, better enabling computers and people to work in cooperation.
The essential property of the World Wide Web is its universality. The power of a hypertext link is that
"anything can link to anything."
Knowledge Representation
For the semantic web to function, computers must have access to structured collections of information
and sets of inference rules that they can use to conduct automated reasoning.
The challenge of the Semantic Web, therefore, is to provide a language that expresses both data and
rules for reasoning about the data and that allows rules from any existing knowledge-representation
system to be exported onto the Web.
Two important technologies for developing the Semantic Web are already in place: eXtensible
Markup Language (XML) and the Resource Description Framework (RDF). XML lets everyone
create their own tags—hidden labels such as or that annotate Web pages or sections of text on a page.
Scripts, or programs, can make use of these tags in sophisticated ways, but the script writer has to
know what the page writer uses each tag for. In short, XML allows users to add arbitrary structure to
their documents but says nothing about what the structures mean.
Meaning is expressed by RDF, which encodes it in sets of triples, each triple being rather like the
subject, verb and object of an elementary sentence. These triples can be written using XML tags. In
RDF, a document makes assertions that particular things (people, Web pages or whatever) have
properties (such as "is a sister of," "is the author of") with certain values (another person, another Web
page).
Subject and object are each identified by a Universal Resource Identifier (URI), just as used
in a link on a Web page.
The triples of RDF form webs of information about related things.
Ontology
A solution to this problem is provided by the third basic component of the Semantic Web, collections
of information called ontologies. In philosophy, an ontology is a theory about the nature of existence,
of what types of things exist; ontology as a discipline studies such theories. Artificial-intelligence and
Web researchers have co-opted the term for their own jargon, and for them an ontology is a document
or file that formally defines the relations among terms. The most typical kind of ontology for the Web
has a taxonomy and a set of inference rules.
The taxonomy defines classes of objects and relations among them. Inference rules in ontologies
supply further power.
Ontologies can enhance the functioning of the Web in many ways. They can be used in a simple
fashion to improve the accuracy of Web searches—the search program can look for only those pages
that refer to a precise concept instead of all the ones using ambiguous keywords. More advanced
applications will use ontologies to relate the information on a page to the associated knowledge
structures and inference rules.
The Semantic Web, in contrast, is more flexible. The consumer and producer agents can reach a
shared understanding by exchanging ontologies, which provide the vocabulary needed for discussion.
The Semantic Web, in naming every concept simply by a URI, lets anyone express new concepts that
they invent with minimal effort. Its unifying logical language will enable these concepts to be
progressively linked into a universal Web. This structure will open up the knowledge and workings
of humankind to meaningful analysis by software agents, providing a new class of tools by which we
can live, work and learn together.
knowledge representation and information presentation
Relation with Hypermedia Research

While the Semantic Web aims primarily at providing a generic infrastructure for machine-processable
Web content, it has direct relevance to hypermedia research.

IR-based QA Review: Survey of Approaches for Answer Extraction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR-based QA Review: Survey of Approaches for Answer Extraction

Uploaded by

Copyright:

Available Formats

QA Review

IR-based Question Answering

Knowledge-based Question Answering

Dealing with Variation: Semi-Supervised Methods

Question Answering Systems: Survey and Trends

New Trends in Automatic QA

The system is split up into three main modules:

Natural Language Generation

Is Question Answering fit for the Semantic Web?: a

Ontology-based QA emerged as a combination of ideas of two different research areas - it enhances

As stated in (Uren et al., 2007)

Some of the early NLIDB approaches relied on pattern-matching techniques.

4. Semantic ontology-based Question Answering

Scientific America: The Semantic Web

The triples of RDF form webs of information about related things.

knowledge representation and information presentation

Relation with Hypermedia Research

You might also like