Professional Documents
Culture Documents
Like the text-based paradigm for question answering, this approach dates back to the earliest days of
natural language processing, with systems like BASEBALL (Green et al., 1961) that answered
questions from a structured database of baseball games and stats. Systems for mapping from a text
string to any logical form are called semantic parsers (???). Semantic parsers for question answering
usually map either to some version of predicate calculus or a query language like SQL or SPARQL, as
in the examples in Fig. 28.7.
Rule-based Method
Supervised Method
To align a REVERB triple with a canonical knowledge source we first align the arguments and then
the predicate. Recall from Chapter 23 that linking a string like “Ada Lovelace” with a Wikipedia page
is called entity linking; we thus represent the concept ‘Ada Lovelace’ by a unique identifier of a
Wikipedia page. If this subject string is not associated with a unique page on Wikipedia, we can
disambiguate which page is being sought, for example by using the cosine distance between the triple
string (‘Ada Lovelace was born in 1815’) and each candidate Wikipedia page. Date strings like ‘1815’
can be turned into a normalized form using standard tools for temporal normalization like SUTime
(Chang and Manning, 2012). Once we’ve aligned the arguments, we align the predicates. Given the
Freebase relation people.person.birthdate(ada lovelace,1815) and the string ‘Ada Lovelace was born in
1815’, having linked Ada Lovelace and normalized 1815, we learn the mapping between the string
‘was born in’ and the relation people.
person. birthdate. In the simplest case, this can be done by aligning the relation with the string of
words in between the arguments; more complex alignment algorithms like IBM Model 1 (Chapter 25)
can be used. Then if a phrase aligns with a predicate across many entities, it can be extracted into a
lexicon for mapping questions to relations.
The third candidate answer scoring stage uses many sources of evidence to score the candidates. One
of the most important is the lexical answer type. DeepQA includes a system that takes a candidate
answer and a lexical answer type and returns a score indicating whether the candidate answer can be
interpreted as a subclass or instance of the answer type. Consider the candidate “difficulty swallowing”
and the lexical answer type “manifestation”. DeepQA first matches each of these words with possible
entities in ontologies like DBpedia and WordNet. Thus the candidate “difficulty swallowing” is
matched with the DBpedia entity “Dysphagia”, and then that instance is mapped to the WordNet type
“Symptom”. The answer type “manifestation” is mapped to the WordNet type “Condition”. The
system looks for a link of hyponymy, instance-of or synonymy between these two types; in this case a
hyponymy relation is found between “Symptom” and “Condition”.
The rapid increase in massive information storage and the popularity of using the Web allow
researchers to store data and make them available to the public. However, the exploration of this large
amount of data makes finding information a complex and expensive task in terms of time. This
difficulty has motivated the development of new adapted research tools, such as Question Answering
Systems.
However, for Question Answering Systems dedicated to manipulate text and Web documents, the
structure of the required information affects the accuracy of these systems. QAS are most effective to
interact with structured knowledge bases.
Normal key word based search engines rely on the replying of a set of relevant documents to a specific
query. As the information offer is constantly growing in the internet it can be hard to find correct
information, so it should be possible to get specific answers to a query. One reason for Automatic QA
is to relieve the crawling through many documents for the correct answer. On the other hand also
supporting documents can be useful to confirm the answers so users get convinced that the
information is relevant.
An Automatic Answering System with Template Matching for Natural Language Questions
In QA systems there is a general distinction between two problems: open domain problems and closed
domain problems. In open domain the answers to questions can be found in public information
sources. Almost any question can be placed. In the closed domain, however, answers are stored by a
domain expert in a database. The permissible questions are therefore limited to a specific topic. This
work focuses on the closed domain problems.
The notion of introducing semantics to search on the Web is not understood in a unique way. Accord-
ing to (Fazzinga et al., 2010) the two most common uses of SW technology are: (1) to interpret Web
queries and Web resources annotated with respect to the background knowledge described by
underlying ontologies, and (2) to search in the structured large datasets and Knowledge Bases (KBs)
of the SW as an alternative or a complement to the current web.
Apart from the benefits that can be obtained as more semantic data is published on the Web, the
emergence and continued growth of a large scale SW poses some challenges and drawbacks:
− There is a gap between users and the SW: it is difficult for end-users to understand the complexity
of the logic-based SW. Solutions that can allow the typical Web user to profit from the expressive
power of SW data-models, while hiding the complexity behind them, are of crucial importance.
− The processes of searching and querying content that is massive in scale and highly heterogeneous
have become increasingly challenging: current approaches to querying semantic data have difficulties
to scale their models successfully to cope with the increasing amount of distributed semantic data
available online.
Hence, there is a need for user-friendly interfaces that can scale up to the Web of Data, to support end
users in querying this heterogeneous information space.
We classify a QA system, or any approach to query the SW content, according to four dimensions
based on the type of questions (input), the sources (unstructured data such as documents, or structured
data in a semantic or non-semantic space), the scope (domain-specific, open-domain), and the
traditional intrinsic problems derived from the search environment and scope of the system.
Goals and dimensions of Question Answering The goal of QA systems, as defined by (Hirschman et
al., 2001), is to allow users to ask questions in Natural Language (NL), using their own terminology,
and receive a concise answer.
In (Moldovan et al., 2003), QA systems are classified, according to the complexity of the input
question and the difficulty of extracting the answer, in five increasingly sophisticated types: systems
capable of processing factual questions (factoids), systems enabling reasoning mechanisms, systems
that fuse answers from different sources, interactive (dialog) systems and systems capable of analogical
reasoning.
As pointed out in (Hunter, 2000) more difficult kinds of factual questions include those which ask for
opinion, like Why or How questions, which require understanding of causality or instrumental
relations, What questions which provide little constraint in the answer type, and definition questions.
Nonetheless, most ontology-based QA systems are akin to NLIDB in the sense that they are able to
extract precise answers from structured data in a specific domain scenario, instead of retrieving
relevant paragraphs of text in an open scenario.
NL interfaces are an often-proposed solution in the literature for casual users (Kauffman and
Bernstein, 2007), being particularly appropriate in domains for which there are authoritative and
comprehensive da tabases or resources (Mollá and Vicedo, 2007). However, their success has been
typically overshadowed by both the brittleness and habitability problems (Thompson et al., 2005),
defined as the mismatch between the user expectations and the capabilities of the system with respect
to its NL understanding and what it knows about (users do not know what it is possible to ask). As
stated in (Uren et al., 2007) iterative and exploratory search modes are important to the usability of all
search systems, to support the user in understanding what is the knowledge of the system and what
subset of NL is possible to ask about. Systems also should be able to provide justifications for an
answer in an intuitive way (NL generation), suggest the presence of unrequested but related
information, and actively help the user by recommending searches or proposing alternate paths of
exploration. For example, view based search and forms can help the user to explore the search space
better than keyword-based or NL querying systems, but they become tedious to use in large spaces
and impossible in heterogeneous ones.
The next generation of NLIDBs used an intermediate representation language, which expressed the
meaning of the user’s question in terms of high-level concepts, independently of the database’s
structure (Androutsopoulos et al., 1995). Thus, separating the (domain-independent) linguistic process
from the (domain-dependent) mapping process into the database, to improve the portability of the
front end (Martin et al., 1985).
The formal semantics approach presented in (De Roeck et al., 1991) follows this paradigm and clearly
separates between the NL front ends, which have a very high degree of portability, from the back end.
The front end provides a mapping between sentences of English and expressions of a formal semantic
theory, and the back end maps these into expressions, which are meaningful with respect to the
domain in question. Adapting a developed system to a new application requires altering the domain
specific back end alone.
One can conclude that the techniques used to solve the lexical gap between the users and the
structured knowledge are largely comparable across all systems: off-the-shelf parsers and shallow
parsing are used to create a triple-based representation of the user query, while string distance
metrics, WordNet, and heuristics rules are used to match and rank the possible ontological
representations.
Knowledge Representation
For the semantic web to function, computers must have access to structured collections of information
and sets of inference rules that they can use to conduct automated reasoning.
The challenge of the Semantic Web, therefore, is to provide a language that expresses both data and
rules for reasoning about the data and that allows rules from any existing knowledge-representation
system to be exported onto the Web.
Two important technologies for developing the Semantic Web are already in place: eXtensible
Markup Language (XML) and the Resource Description Framework (RDF). XML lets everyone
create their own tags—hidden labels such as or that annotate Web pages or sections of text on a page.
Scripts, or programs, can make use of these tags in sophisticated ways, but the script writer has to
know what the page writer uses each tag for. In short, XML allows users to add arbitrary structure to
their documents but says nothing about what the structures mean.
Meaning is expressed by RDF, which encodes it in sets of triples, each triple being rather like the
subject, verb and object of an elementary sentence. These triples can be written using XML tags. In
RDF, a document makes assertions that particular things (people, Web pages or whatever) have
properties (such as "is a sister of," "is the author of") with certain values (another person, another Web
page).
Subject and object are each identified by a Universal Resource Identifier (URI), just as used
in a link on a Web page.
Ontology
A solution to this problem is provided by the third basic component of the Semantic Web, collections
of information called ontologies. In philosophy, an ontology is a theory about the nature of existence,
of what types of things exist; ontology as a discipline studies such theories. Artificial-intelligence and
Web researchers have co-opted the term for their own jargon, and for them an ontology is a document
or file that formally defines the relations among terms. The most typical kind of ontology for the Web
has a taxonomy and a set of inference rules.
The taxonomy defines classes of objects and relations among them. Inference rules in ontologies
supply further power.
Ontologies can enhance the functioning of the Web in many ways. They can be used in a simple
fashion to improve the accuracy of Web searches—the search program can look for only those pages
that refer to a precise concept instead of all the ones using ambiguous keywords. More advanced
applications will use ontologies to relate the information on a page to the associated knowledge
structures and inference rules.
The Semantic Web, in contrast, is more flexible. The consumer and producer agents can reach a
shared understanding by exchanging ontologies, which provide the vocabulary needed for discussion.
The Semantic Web, in naming every concept simply by a URI, lets anyone express new concepts that
they invent with minimal effort. Its unifying logical language will enable these concepts to be
progressively linked into a universal Web. This structure will open up the knowledge and workings
of humankind to meaningful analysis by software agents, providing a new class of tools by which we
can live, work and learn together.