You are on page 1of 44

COMP6275 – Artificial Intelligence

Week 8
Natural Language Processing
LEARNING OUTCOMES

At the end of this session, students will be able to:


o LO4 Apply various techniques to an agent when acting
under certainty and how to process natural language
and other perceptual signs in order that an agent can
interact intelligently with the real world
OUTLINE

1. Natural Language Processing


2. Language Models
3. Text Classification
4. Information Retrieval
5. Information Extraction
6. Machine Translation
7. Summary
NATURAL LANGUAGE PROCESSING

o Agent which want to add the information needs to


understand (at least partially) of the human language
which is sometimes ambiguous and unclear.
o There are 3 ways to find information based on
information retrieval perspective:
• Text Classification
• Information retrieval
• Information Extraction
o One common factor in searching information is the
Language Model.
NATURAL LANGUAGE PROCESSING

o There are over a trillion pages of information on the Web, almost all
of it in natural language, An agent that wants to do knowledge
acquisition needs to understand (at least partially) the ambiguous,
messy languages that humans use. We examine the problem from the
point of view of specific information-seeking tasks: text
classification, information retrieval, and information extraction.
o There are over a trillion pages of information on the Web, almost all
of it in natural language, An agent that wants to do knowledge
acquisition needs to understand (at least partially) the ambiguous,
messy languages that humans use. We examine the problem from the
point of view of specific information-seeking tasks: text
classification, information retrieval, and information extraction.
o Formal languages also have rules that define the meaning or
semantics of a program; for example, the Riles say that the "meaning"
of "2 + 2" is 4, and the meaning of “ 1 / 0 ” is that an error is
signaled.
LANGUAGE MODELS
o N-gram character models

o N-gram character models is defined as a Markov chain of order


that in a Markov chain the probability of character depends only on
the immediately preceding characters, not on any other characters.
So in a trigram model (Markov chain of order 2) we have:
LANGUAGE MODELS
Smoothing N-gram models
o In N-gram models, not common words, such as 'ht', would have a
0 (zero) estimation
o Generalizing language models to identify text that has never been
seen from the previous agent
o Smoothing approaches of Pierre-Simon Laplace, in a collection of
information, if a random variable X is always wrong in the
observations, then the estimate for P (X = true) is 1 / (n +2).
o A Better smoothing approach is backoff model, for example:
Linear interpolation smoothing
o Linear interpolation smoothing combines unigram, bigram, and
trigram using linear interpolation.
LANGUAGE MODELS
Model Evaluation
o Evaluation quality of set writing by calculation probability every word.
o Perplexity, negation of probability
o There are 3 models of N-gram word :
• Unigram
• Bigram
• Trigram
o Example:
• logical are as are contusion a may right tries agent goal the was. . .
• systems are very similar computational approach would be
represented. . .
• planning and scheduling are integrated the success of naive bayes
model. . .
TEXT CLASSIFICATION
o Text classification, also known as categorization: given a text
of some kind, decide which of a predefined set of classes it
belongs to. Language identification and genre classification
are examples of text classification, as is sentiment analysis
(classifying a movie or product review as positive or
negative) and spam detection (classifying an email message
as spam or not-spam).
TEXT CLASSIFICATION
Classification by data compression
o Another way to think about classification is as a problem in data
compression. A lossless compression algorithm takes a sequence of
symbols, detects repeated patterns in it, and writes a description of
the sequence that is more compact than the original.
o For example, the text "0.142857142857142857" might be
compressed to Compression algorithms work by building
dictionaries of subsequences of the text, and then referring to entries
in the dictionary. The example here had only one dictionary entry,
"142857.“
o In effect, compression algorithms are creating a language model.
The LZW algorithm in particular directly models a maximum-
entropy probability distribution. To do classification by compression,
we first lump together all the spam training messages and compress
them as
INFORMATION RETRIEVAL
o Information retrieval is the task of finding documents that are relevant
to a user's need for information. The best-known examples of
information retrieval systems are search engines on the World Wide
Web. A Web user can type a query such as [AI into a search engine and
see a list of relevant pages. In this section, we will see how such systems
are built.
INFORMATION RETRIEVAL
o An information retrieval (henceforth IR) system can be characterized by :
• A corpus of documents. Each system must decide what it wants to treat
as a document: a paragraph, a page, or a multipage text.
• Queries posed in a query language. A query specifies what the user
wants to know. The query language can be just a list of words, such as
[AI book]; or it can specify a phrase of words that must be adjacent.
• A result set. This the subset of documents that the IR system judges to
be relevant to the query. By relevant, we mean likely to be of use to the
person who posed the query, for the particular information need
expressed in the query.
• A presentation of the result set. This can be as simple as a ranked list
of document titles or as complex as a rotating color map of the result set
projected onto a three-dimensional space, rendered as a two-
dimensional display.
INFORMATION RETRIEVAL
o Information Retrieval = searching information
o Searching information like document based on query (user input) for
get the information user needed from all documents.
o Example is information searching by engine search in World Wide
Web(www), e.g. Google.
INFORMATION RETRIEVAL
Characteristics of IR
1. A collection of writings (document). The system must determine
which one want to be considered as a document (paper). Example:
a paragraph, a page, etc.

2. User Query
The query is a formula used to find the information needed by the
user. In its simplest form, a query is a keyword and documents
that contain the keywords are the searched documents
Example: [AI book]; ["Al book"]; [AI AND book];
[AI NEAR book] [AI book site:www.aaai.org].
INFORMATION RETRIEVAL
Characteristics of IR
3. Set of Results
The results from the queries. A part of the documents in
which is relevant to the query.
4. Display of result sets
Can be a list of results in a ranking of the title
documents

Originally, IR system works use Boolean Model.


But now it is mostly abandoned.
INFORMATION RETRIEVAL
IR Scoring models
o Boolean Function
has been left  statistical models based
on word count.
o BM25 scoring function, Stephen Robertson and Karen
Sparck Jones at London City College that has been used
in the search engines..
o Scoring Function takes the document and query that
returns a numeric value, the most relevant documents has
the highest scores.
o In BM25 function, score is proportional to the weight of
a combination of scores for each word according to the
query.
INFORMATION RETRIEVAL
Factors that affect weight:
1. Frequency of words appear in documents that match
the query. (Terms of frequency)
2. The opposite of Terms of Frequency or IDF. (usually, it
is a connector on query)
3. The length of the document. A document contains
million of words may be to mention all the query
words, but also possible it is not the one referred to in
the query. A short document outlining all words are
better candidates.

Example: in query [farming in Kansas]


INFORMATION RETRIEVAL
o
IRHow to check
System our IR function works well?
Evaluation
o Value of the performance of IR applications
demonstrate the success of an IR in restoring the
information required by the user.
o The parameters used in measuring the
performance of the system is to measure the
completeness and accuracy.
o Function: Recall (completeness) and Precision
(Accuracy)
INFORMATION RETRIEVAL
In Result Not In Result
Relevant 30 20
Case Example:
Not Relevant 10 40

• Recall (completeness) is ratio of the number of relevant


documents obtained by system with the number of all relevant
documents in the collection of documents (drawn or not drawn
by system).
• Function:
INFORMATION RETRIEVAL
IR Refinement
o There are many possible improvements in IR. It is done
by finding a better algorithm to improve the relevance of
the document.
o BM25 scoring function applies that every word should
stand alone, but in real, a lot of words that might be
related. For example: the word "couch" is associated with
the word 'couches' or 'sofa'. Many IR which tried to
implement this association.
o Furthermore, in addressing the relationship between the
"couch" with 'couches' there’s a finding about stemming
algorithm approach. (Removal of the word, in this case
removing -es).
INFORMATION RETRIEVAL
o But we will find another problem because it will
IRreduce
Refinement
the precision, at stemming process,
"stocking" will be 'stock‘, it is not a problem, but
the word "foot covering“ will reduce the
precision.
o There’s another finding, that is ‘stemming based
on dictionary’ to solve the problem above. We
will not delete -ing if the word is found in the
dictionary.
o The next step is to identify synonyms. "Couch"
and "sofa”
INFORMATION RETRIEVAL
BM25 function:

o di = length of document (in word)


o N = total document
o L = average length of documents in the corpus (document
collection).
o TF(Term of Frequency) Function
o IDF Function (opposite of TF) document inverse.
o There are 2 parameters, they are k and b.
typical value k = 0,2 and b = 0,75.
INFORMATION RETRIEVAL
PageRank Algorithm
o It is one of the original Google search ideas that distinguish with
another web search engines.
o Innovation linking words underlined hyperlinks to other pages pointing.
o If the query [IBM] how do we ensure that the IBM home page
(ibm.com ) is the first in a sequence of query results, even if other pages
have a more frequency of IBM word . The concept is that ibm.com has
many in-links (links to pages ibm.com), then it certainly would be
ranked first in the results.
o But if we only count in-link, it will be possible for the Web spammer to
create a webpage and make a lot of links pointing to a page which page
is to enhance the score of the webpage. Therefore, the PageRank
algorithm is designed for the weight of links from high quality sites
heavier. What is a high-quality website? One related to the other high-
quality sites. It’s recursive.
INFORMATION RETRIEVAL
HITS Algorithm
o HITS (Hyperlink-Induced Topic Search)
It is almost the same as PageRank algorithm, but HITS does
not count the number of links in a page, but look around the
found links, if it is in accordance with the destination of a
link, the more appropriate words between the origin link to
the destination link, the higher the authority value of the page.
INFORMATION RETRIEVAL
Question Answering

o When the query type is questions, then the result is not a ranking list of the
documents, but the form of short response, it could be a sentence or phrase.
o ASKMSR system (Banko, 2002) is a web-based question and answer
system. Based on the premise that the question could be answered on many
web pages, then the problem in question-and-answer is considered as the
issue of precision (accuracy), not a recall (completeness).
o ASKMSR does not recognize pronoun, verb, etc.. It only recognize 15
types of questions and how to rewrite it in the search engines.
o Examples of the query [who killed Abraham Lincoln] can be rewritten to be
[killed Abraham Lincoln] and to be [Abraham Lincoln was killed by *].
o The results obtained are not full pages but only a brief summary of the text
which may be close to the query condition.
INFORMATION RETRIEVAL
Question Answering
o The results are to be 1 -, 2 -, and 3-grams and it is calculated for the
frequency set of results and to weight: -gram is returned from a very
specific question rewriting (such as matching the query exactly to
the phrase ["Abraham Lincoln was killed by *" ]) will gain more
weight than the general query rewrite as [Abraham OR Lincoln OR
killed]. It is expected that "John Wilkes Booth" will be one of the
highly ranked n-gram is taken, but so are "Abraham Lincoln" and
"the assassination of" and "Ford's Theatre".
o Once the n-gram is scored, then it will be filtered based on the
question, if the question is "who" will then filtered on a person's
name. When the question is "when" it will be filtered on the date or
time. And also there is a filter that is not part of the answer to that
question.
INFORMATION EXTRACTION
o Information extraction is the process of acquiring knowledge
by skimming a text and looking for occurrences of a particular
class of object and for relationships among objects.
o The simplest type of information extraction system is an
attribute-based extraction system TEMPLATE REGULAR
EXPRESSION that assumes that the entire text refers to a
single object and the task is to extract attributes of that object.
INFORMATION EXTRACTION
o 4 approximation for Information Extraction :
• Deterministic to stochastic
• Domain-spesific to general
• Hand-crafted to learned
• Small-scale to large-scale
o FASTUS is a relational-based extraction system.
o Divided to be 5 stages :
• Tokenization
• Complex-word handling
• Basic-group handling
• Complex-phrase handling
• Structure merging
INFORMATION EXTRACTION
o Tokenization: divide the characters into a token (words, numbers,
and punctuation)
o Complex-word handling: handle complex words, words that
contain grammatical rules, such as "Bridgestone Sports Co.", which
shows the name of a company
o Basic-group handling: sort by morphological of the words, such as
verbs, nouns, prefixes, conjunctive
o Complex-phrase handling: merge basic-group to be an
arrangement of words, in order that it can be processed and the
results are not unambiguous
o Structure merging: combining structure that have been resulted
INFORMATION EXTRACTION
o When information extraction must be attempted
from noisy or varied input, simple finite-state
approaches fare poorly.
o It is too hard to get all the rules and their
priorities right; it is better to use a probabilistic
model rather than a rule-based model.
o The simplest probabilistic model for sequences
with hidden state is the hidden Markov model, or
HMM.
INFORMATION EXTRACTION
Example
o a brief text and the most probable (Viterbi) path
for that text for two HMMs, one trained to
recognize the speaker in a talk announcement,
and one trained to recognize dates
INFORMATION EXTRACTION
INFORMATION RETRIEVAL
Ontology extraction from large corpora
o So far we have thought of information extraction as finding a specific set of
relations (e.g.,speaker, time, location) in a specific text (e.g., a talk
announcement). A different application of extraction technology is building
a large knowledge base or ontology of facts from a corpus. This is different
in three ways:
o First it is open-ended—we want to acquire facts about all types of domains,
not just one specific domain.
o Second, with a large corpus, this task is dominated by precision, not as with
question answering on the Web (Section 22.3.6).
o Third, the results can be statistical aggregates gathered from multiple
sources, rather than being extracted from one specific text.
MACHINE TRANSLATION
o System that could read on its own and build up its own
database. Such a system would he relation-independent;
would work for any relation.
o In practice, these systems work on all relations in
parallel, because of the I/O demands of large corpora_
They behave less like a traditional information-extraction
system that is targeted at a few relations and more like a
human reader who learns from the text itself; because of
this the field has been called machine reading.
o A representative machine-reading system is
TEXTRUNNER (Banko and Etzioni, 2008).
MACHINE TRANSLATION

o Given as a set of labeled examples of this type, TEXTRUNNER


trains a linear-chain CRF to extract further examples from
unlabeled text. The features in the CRF include function words
like “to” and "of" and 'the," but not nouns and verbs (and not
noun phrases or verb phrases). Because of the text runner is
domain-independent, it cannot rely on predefined lists of nouns
and verbs.
o TEXT RUNNER achieves a precision of 88% and recall of 45%
(F1 of 60%) on a large TEXTRUNNER has extracted hundreds of
millions of facts from a corpus of a half-billion Web pages.
MACHINE TRANSLATION
MACHINE TRANSLATION
Getting Started with Prolog

o Prolog is a language based on first order predicate logic. (Will


revise/introduce this later).
o We can assert some facts and some rules, then ask questions to find
out what is true.
o Facts:
likes(john, mary).
tall(john).
tall(sue).
short(fred).
teaches(alison, artificialIntelligence).
MACHINE TRANSLATION
Prolog

o Rules:
likes(fred, X) :- tall(X).
examines(Person, Course) :- teaches(Person, Course).

• John likes someone if that someone is tall.


• A person examines a course if they teach that course.
• NOTE: “:-” used to mean IF. Meant to look a bit like a
backwards arrow
• NOTE: Use of capitals (or words starting with capitals)
for variables.
MACHINE TRANSLATION
Prolog
o Your “program” consists of a file containing facts and
rules.
o You “run” your program by asking “questions” at the
prolog prompt.
|?- likes(fred, X).

o John likes who?


o Answers are then displayed. Type “;” to get more
answers:X(Note:
= johndarker
? ; font for system output)
X = sue ? ;
no
MACHINE TRANSLATION
Prolog and Search

o Prolog can return more than one answer to a question.


o It has a built in search method for going through all the
possible rules and facts to obtain all possible answers.
o Search method “depth first search” with “backtracking”.
SUMMARY
o Probabilistic language models based on n gram recover a
surprising amount of information about as language. They
can perform well on such diverse tasks as language
identification, spelling correction, genre classification, and
named-entity recognition.
o Text classification can be done with naive Bayes.
o Information retrieval systems use a very simple language
model based on bags of words, and precision on very
large. On Web corpora, link-analysis algorithms improve
performance.
SUMMARY
o Question answering can be handled by an approach based on
information retrieval, for questions that have multiple answers
in the corpus. When more answers are available in the corpus,
we can use techniques that emphasize precision rather than
recall.
o Information Extraction systems use a more complex model
that includes limited notions of syntax and semantics in the form
of templates. They can be built from finite state automata,
o In building a statistical language system, it is best to devise a
model that can make good use of available data, even if the
model seems overly simplistic.
REFERENCES
o Stuart Russell, Peter Norvig,. 2010. Artificial intelligence : a modern
approach. PE. New Jersey. ISBN:9780132071482, Chapter 22
o Elaine Rich, Kevin Knight, Shivashankar B. Nair. 2010. Artificial
Intelligence. MHE. New York. ISBN:0070678162, Chapter 15
o Applications in Natural Language Processing:
http://artint.info/html/ArtInt_290.html
o Artificial Intelligence: Natural Language Processing:
http://www.cs.utexas.edu/~mooney/cs343/slide-handouts/nlp.pdf
ThankYOU...

You might also like