Professional Documents
Culture Documents
Week 8
Natural Language Processing
LEARNING OUTCOMES
o There are over a trillion pages of information on the Web, almost all
of it in natural language, An agent that wants to do knowledge
acquisition needs to understand (at least partially) the ambiguous,
messy languages that humans use. We examine the problem from the
point of view of specific information-seeking tasks: text
classification, information retrieval, and information extraction.
o There are over a trillion pages of information on the Web, almost all
of it in natural language, An agent that wants to do knowledge
acquisition needs to understand (at least partially) the ambiguous,
messy languages that humans use. We examine the problem from the
point of view of specific information-seeking tasks: text
classification, information retrieval, and information extraction.
o Formal languages also have rules that define the meaning or
semantics of a program; for example, the Riles say that the "meaning"
of "2 + 2" is 4, and the meaning of “ 1 / 0 ” is that an error is
signaled.
LANGUAGE MODELS
o N-gram character models
2. User Query
The query is a formula used to find the information needed by the
user. In its simplest form, a query is a keyword and documents
that contain the keywords are the searched documents
Example: [AI book]; ["Al book"]; [AI AND book];
[AI NEAR book] [AI book site:www.aaai.org].
INFORMATION RETRIEVAL
Characteristics of IR
3. Set of Results
The results from the queries. A part of the documents in
which is relevant to the query.
4. Display of result sets
Can be a list of results in a ranking of the title
documents
o When the query type is questions, then the result is not a ranking list of the
documents, but the form of short response, it could be a sentence or phrase.
o ASKMSR system (Banko, 2002) is a web-based question and answer
system. Based on the premise that the question could be answered on many
web pages, then the problem in question-and-answer is considered as the
issue of precision (accuracy), not a recall (completeness).
o ASKMSR does not recognize pronoun, verb, etc.. It only recognize 15
types of questions and how to rewrite it in the search engines.
o Examples of the query [who killed Abraham Lincoln] can be rewritten to be
[killed Abraham Lincoln] and to be [Abraham Lincoln was killed by *].
o The results obtained are not full pages but only a brief summary of the text
which may be close to the query condition.
INFORMATION RETRIEVAL
Question Answering
o The results are to be 1 -, 2 -, and 3-grams and it is calculated for the
frequency set of results and to weight: -gram is returned from a very
specific question rewriting (such as matching the query exactly to
the phrase ["Abraham Lincoln was killed by *" ]) will gain more
weight than the general query rewrite as [Abraham OR Lincoln OR
killed]. It is expected that "John Wilkes Booth" will be one of the
highly ranked n-gram is taken, but so are "Abraham Lincoln" and
"the assassination of" and "Ford's Theatre".
o Once the n-gram is scored, then it will be filtered based on the
question, if the question is "who" will then filtered on a person's
name. When the question is "when" it will be filtered on the date or
time. And also there is a filter that is not part of the answer to that
question.
INFORMATION EXTRACTION
o Information extraction is the process of acquiring knowledge
by skimming a text and looking for occurrences of a particular
class of object and for relationships among objects.
o The simplest type of information extraction system is an
attribute-based extraction system TEMPLATE REGULAR
EXPRESSION that assumes that the entire text refers to a
single object and the task is to extract attributes of that object.
INFORMATION EXTRACTION
o 4 approximation for Information Extraction :
• Deterministic to stochastic
• Domain-spesific to general
• Hand-crafted to learned
• Small-scale to large-scale
o FASTUS is a relational-based extraction system.
o Divided to be 5 stages :
• Tokenization
• Complex-word handling
• Basic-group handling
• Complex-phrase handling
• Structure merging
INFORMATION EXTRACTION
o Tokenization: divide the characters into a token (words, numbers,
and punctuation)
o Complex-word handling: handle complex words, words that
contain grammatical rules, such as "Bridgestone Sports Co.", which
shows the name of a company
o Basic-group handling: sort by morphological of the words, such as
verbs, nouns, prefixes, conjunctive
o Complex-phrase handling: merge basic-group to be an
arrangement of words, in order that it can be processed and the
results are not unambiguous
o Structure merging: combining structure that have been resulted
INFORMATION EXTRACTION
o When information extraction must be attempted
from noisy or varied input, simple finite-state
approaches fare poorly.
o It is too hard to get all the rules and their
priorities right; it is better to use a probabilistic
model rather than a rule-based model.
o The simplest probabilistic model for sequences
with hidden state is the hidden Markov model, or
HMM.
INFORMATION EXTRACTION
Example
o a brief text and the most probable (Viterbi) path
for that text for two HMMs, one trained to
recognize the speaker in a talk announcement,
and one trained to recognize dates
INFORMATION EXTRACTION
INFORMATION RETRIEVAL
Ontology extraction from large corpora
o So far we have thought of information extraction as finding a specific set of
relations (e.g.,speaker, time, location) in a specific text (e.g., a talk
announcement). A different application of extraction technology is building
a large knowledge base or ontology of facts from a corpus. This is different
in three ways:
o First it is open-ended—we want to acquire facts about all types of domains,
not just one specific domain.
o Second, with a large corpus, this task is dominated by precision, not as with
question answering on the Web (Section 22.3.6).
o Third, the results can be statistical aggregates gathered from multiple
sources, rather than being extracted from one specific text.
MACHINE TRANSLATION
o System that could read on its own and build up its own
database. Such a system would he relation-independent;
would work for any relation.
o In practice, these systems work on all relations in
parallel, because of the I/O demands of large corpora_
They behave less like a traditional information-extraction
system that is targeted at a few relations and more like a
human reader who learns from the text itself; because of
this the field has been called machine reading.
o A representative machine-reading system is
TEXTRUNNER (Banko and Etzioni, 2008).
MACHINE TRANSLATION
o Rules:
likes(fred, X) :- tall(X).
examines(Person, Course) :- teaches(Person, Course).