You are on page 1of 58

MODULE - 5

NATURAL LANGUAGE PROCESSING BY


SUBJECT CODE RACHEL E C
18AI641 BITM, BALLARI
Information retrieval (IR) may be defined as a software program that
deals with the
1. organization,
2. storage,
3. retrieval and
4. evaluation of information from document repositories
particularly textual information.

We deal only with retrieval of text documents.

The word document in general includes non – textual information such


as images and speech.

The interaction between NLP and IR is now strengthening many NLP


techniques including probabilistic model.

IR applications are latent semantic indexing and vector space retrieval.


Design Features of Information Retrieval Systems

▪ Term and Keyword will be used independently.

▪ Keyword can be single or multi –word phrases.

▪ Extracted automatically or manually.

▪ The process of transforming document text to some representation


of it is known as indexing – inverted index (data structures).

▪ Text operations – stop word elimination (grammatical or functional


words) and stemming (grammatical roots).
▪ Zipf's law states that the frequency of a token in a text is directly
proportional to its rank or position in the sorted list.

▪ This law describes how tokens are distributed in languages: some


tokens occur very frequently, some occur with intermediate
frequency, and some tokens rarely occur.

▪ Quantify the significance of index terms to a document by assigning


numerical values, called weights.

▪ A number of term – weighting schemes have been proposed in


literature.
Indexing
▪ Collection of raw
doc is transformed
into an easily
accessible
representation.

▪ Involves identifying
Good doc
Descriptor(GD).

▪ GD helps describe
the content of
document and
discriminate it from
other doc collection.
TREC – Text Retrieval Conference method for phrase extraction

▪ Any pair of adjacent non – stop words is regarded a potential


phrase.

▪ The final list of phrase s composed of those pair of words that


occur in 25 or more doc collection.
Eliminating Stop words
Stemming
Stemming vs Lemmatization
Information Retrieval Model
Classical IR Model
It is the simplest and easy to implement IR model. This model is based
on mathematical knowledge that was easily recognized and
understood as well. Boolean, Vector and Probabilistic are the three
classical IR models.

Non-Classical IR Model
It is completely opposite to the classical IR model. Such kinds of IR
models are based on principles other than similarity, probability,
Boolean operations. Information logic model, situation theory model,
and interaction models are examples of non-classical IR models.

Alternative IR Model
It is the enhancement of the classical IR model making use of some
specific techniques from some other fields. Cluster model, fuzzy
model, and latent semantic indexing (LSI) models are the example of
alternative IR model.
Classical IR Model
Boolean Model
▪ Based on Boolean Logic and Classical set theory.
▪ Doc’s are represented as set of keywords and stores in inverted file.
▪ We can pose any query in the form of a Boolean expression of terms
where the terms are logically
combined using the Boolean
operators AND, OR, and NOT in
the Boolean retrieval model.
Terms given a finite set
T = {t1, t2,….., ti, tm}
Index terms finite set
D = {d1, d2,……, dj, dn}
Normal form query Q:
Q = ^(vɵi), ɵi ϵ {ti - ti}
➢ The Boolean AND of two logical statements x and y means that both
x AND y must be satisfied and will be a set of documents that
will smaller or equal to the document set.
➢ While the Boolean OR of these same two statements means that at
least one of these statements must be satisfied and will fetch a set of
documents that will be greater or equal to the document set
otherwise.
➢ Any number of logical statements can be combined using the three
Boolean operators.
➢ The queries are designed as Boolean expressions which have
precise semantics and the retrieval strategy is based on binary
decision criterion.

➢ The Boolean model can also be explained well by mapping the


terms in the query with a set of documents.

➢ The most famous web search engine in recent times Google also
ranks the web page result set based on a two-stage system:

- In the first step, a Simple Boolean Retrieval model** returns


matching documents** in no particular order, and

- In the next step ranking is done according to some estimator


of relevance.
Aspects of Boolean Information Retrieval Model

Indexing: Indexing is one of the core functionalities of the


information retrieval models and the first step in building an IR
system assisting with the efficient retrieval of information.

• Indexing is majorly an offline operation that collects data


about which words occur in the text corpus so that at search time we
only have to access the pre-compiled index done beforehand.

• The Boolean model builds the indices for the terms in the query
considering that index terms are present or absent in a document.
Term-Document Incidence matrix: This is one of the
basic mathematical models to represent text data and can be used
to answer Boolean expression queries using the Boolean Retrieval
Model. It can be used to answer any query as a Boolean expression.

• It views the document as the set of terms and creates the


indexing required for the Boolean retrieval model.

• The text data is represented in the form of a matrix where rows of


the matrix represent the sentences and the columns of the matrix
represent the word for the data which needs to be analyzed and
retrieved and the values of the matrix represent the number of
occurrences of the words.

• This model has good precision as the documents are retrieved if the
condition is matched but, it doesn't scale well with the size of the
corpus, and an inverted index can be used as a good alternative
method.
Processing the data for Boolean retrieval model

• We should strip unwanted characters/markup like HTML tags,


punctuation marks, numbers, etc. before breaking the corpus into
tokens/keywords on whitespace.

• Stemming needs to be done and then common stop words are to be


removed depending on the application need.

• The term document incidence matrix or inverted index (with the


keyword a list of docs containing it) is built.

Then the common queries/phrases may be detected using a domain-


specific dictionary if needed.
Advantages
Simple, Efficient, Easy to Implement and Perform Well (Recall and
Precision) if the query is well formulated.

Drawbacks
First, the model is not able to retrieve documents that are only partly
relevant to user query, all information is “ to be or not to be”.

Second, a Boolean system is not able to rank the returned list of


documents. Differs b/w presence and absence of keywords but fails to
assign relevance and importance to keyword in a doc.

Third, user seldom formulates their query in the pure Boolean


expression that this model requires.
Vector Space Model
▪ Represents documents and queries as vectors of features
representing terms that occur within them.
▪ Each document is characterized by a Boolean or numerical vector.
▪ Represented in a multi – dimensional space, in which each
dimension corresponds to a distinct terms in the corpus of documents.
▪ Each feature takes a value of either 0 or 1 indicating absence or
presence of doc or query.
▪ Ranked algo compute the similarity b/w the doc and query vectors,
to yield retrieval score

Given a finite set of n documents


D = {d1, d2,……., dj,……,dn}
Finite set of m terms
T = {t1, t2, ……. , ti, …….. , tm}
Each document is represented by a column vector of weights as
follows
(w1j, w2j, w3j, ……., wij, wmj)t

Where wij is the weight of the term ti in document dj, collection as a


whole represented by a m x n term – document matrix as

w11 w22 …. w1j …. w1n


w21 w22 …. w2j …. w2n
wi1 wi2 …. wij …. win
wm1 wm2 …. wmj …. wmn
Probabilistic Model
▪ Ranks doc based on the probability of their relevance (relative to
query) to a given query.
▪ Retrieval depends on probability of relevance is higher than non –
relevance i.e. exceeds a threshold value.
▪ Given a set of doc D, a query q and cut – off value α. Calculates for
relevance and irrelevance probability and ranks the doc of relevance
and irrelevance in decreasing order.
If
P(R/d) – probability of relevance
P(I/d) – probability of irrelevance
Set of doc retrieved in response to query q is as follows

S = {dj | P(R/dj) >= P(I/dj)} P(R/dj) >= α


▪ Most system assume the terms are independent when estimating
probability.
▪ This assumption allows estimation of parameter value and helps
reduce computational complexity of the model.

Drawbacks
▪ The results given by this model will partly match the user query.
▪ Determination of a threshold value for the initially retrieved set.
▪ Number of relevant document by a query is too small for the
probability to be estimated accurately.
Term Weighting
▪ Term weighting is a procedure that takes place during the text
indexing process in order to assess the value of each term to the
document.

▪ Term weighting is the assignment of numerical values to terms that


represent their importance in a document in order to improve
retrieval effectiveness.

▪ Each term selected has a indexing feature, acts as a discriminator


b/w the doc and other doc in corpus.
▪Luhn (1958) attempted to quantify the discriminating power of the
terms by associating the frequency of their occurrence (term
frequency) within the document.
▪ Following facts
1. The more a doc contains the given word, the more that doc is
about a concept represented by that word.
2. The less a term occurs in particular document in a collection,
the more discriminating that term is.
3. Document length
▪ Terms occurring in few documents are useful for distinguishing those
documents from the rest of the collection.
▪ Terms occurring more frequently across the entire collection are less
helpful while discriminating among doc’s.
n / ni where n – total number of doc in collection
ni – number of doc in which term i occurs
▪ This measure assigns lowest weight – 1 that appears on all doc.
▪ Highest weight – n that appears in only one document.
▪ As the number of documents in any collection is usually large, the
log of this measure is usually taken, resulting in the following form of
inverse document frequency (idf) term weight:
idfi = log (n / ni )
▪ idf attaches more importance to more specific terms.
▪ If a terms occurs in all doc its idf = 0.
▪ Sparck – Jones showed experimentally that weight of log (n / ni ) + 1,
leads to more effective retrieval.
▪ Combine term frequency (tf) and idf weights, results in tf x idf
wij = tfij x log (n / ni )
• Attempt to normalize tf and idf factors in different ways to allow for
variations in doc length.
wij = tfij / max(tfij) x log (n / ni ) / log n
▪ Most weighting schemes can be characterized by the following 3
factors:
▪ With – in document frequency or term frequency (tf)
▪ Collection frequency or inverse document frequency (idf)
▪ Document Length.
Triple ABC
A – tf component is
handled.
B – idf component is
incorporated.
C – length normalization
component.
Non - Classical IR Model
Non-classical information retrieval models are based on principles
other than similarity, probability, Boolean operations etc. on which
classical retrieval models are based on.
1. Information Logic Model
2. Situation Theory Model and
3. Interaction Model.
Information Logic Model
• Special Logic technique called Logical Imaging.
• Retrieval is performed by making inference from document to query.
• The principle put forward by Rijsbergen is used to measure
uncertainty.

Situation Theory Model
• Based on Rijsbergen’s principle.
• Retrival is considered as a flow of information from document to
query.
• A structure called infon, denoted by l, used to describe the SM
information flow.
• infon –n–ary relation and its polarity–1 or 0(positive or negative info)

• example – Adil is serving a dish, is conveyed by the infon


l = << serving Adil, dish; 1>>
• The polarity of infon depends on the support.
• The support – s |= l means situation s makes the infon l true.
• l = <<serving Adil, dish; 1>> is made true by
the situation s1 = “I see Adil serving a dish.”
• d |= q if the doc doesn’t support the query q then additional
information such as synonyms, hypernyms/hyponyms, meronyms etc.
Interaction IR Model
• The documents are not isolated instead they are interconnected.
• The query interacts with the interconnected documents.
• Retrieval is conceived as result of this interaction.
• Artificial Neural Network can be used to implement this model.
• Each document is modeled as a neuron, the doc at whole forms a
neural network.
• The query is also modeled as neuron and integrated into network.
• Due to which new connections are built between the query and doc
and existing connections are changed.
• This restructuring corresponds to the concept of interaction.
• A measure of this interaction is obtained and used for retrieval.
Alternative IR Model
It is the enhancement of classical IR model making use of some
specific techniques from some other fields.
1. Cluster model
2. Fuzzy model
3. Latent Semantic Indexing (LSI) model.
1.Cluster Model
• The cluster model is an attempt to reduce the number of matches
during retrieval.
• Salton pointed out for clustering.
• Statement of Hypothesis
Closely associated documents tend to be relevant to the same
cluster.
• Forming groups (classes or clusters) of related doc’s, the search
time reduces considerably.
• Matched with the representative of the class, only doc from class
whose representative close to query, are considered for individual
match.
• Clustering can be applied to terms, form group called classes of
co-occurrence terms.
• Co-occurrence used in dimensionality reduction or thesaurus
(information about a particular field or set of concepts)
construction.
Let
D = {d1, d2, ….. , dk, …. , dm}
Similarity matrix E = (eij)n,n
Eij – similarity between doc di and dj
T – Threshold value.
pair of doc di, dj (i != j)
similarity measure exceeds threshold (eij >= T)
Let
D = {d1, d2, ….. , dk, …. , dm}
Similarity matrix E = (eij)n,n
Eij – similarity between doc di and dj
T – Threshold value.
pair of doc di, dj (i != j)
similarity measure exceeds threshold (eij >= T)
The remaining doc form a single cluster.
C = {C1, C2, C3, ……, Ck, ….. , Cp}
Representative vector for cluster Ck is
rk = {a1k, a2k, …… , aik, …. , amk }
Each element aik

aij – weight of term ti, of doc dj in cluster Ck


During the retrieval the query is compared with the cluster vectors
(r1, r2, …., rk, ….., rp)

The comparison in computing similarity between the query vector q


and representative vector rk as

A cluster Ck whose similarity sk exceeds a threshold is returned and


the search proceeds in that cluster.
2. Fuzzy Model
• Fuzzy Logic is a type of Natural Language Processing (NLP) that
helps identify and group similar business records. It tries to
associate similar business records that were misrepresented or
misspelled. Hence, we will obtain a cleansed dataset.
• The document is represented as a fuzzy set of terms, i.e a set of
pairs [ti, μ(ti)]
• μ – membership function.
• Membership function assigns to each term of the document a
numeric membership degree.
• MD expresses the significance(weights) of term to the information
contained in the document.
• Weights are based on the no of occurrence of the term in the
document and in the entire document collection.
D = {d1, d2, ….., dj, ……, dn}
Vector of term weights,
(w1j, w2j, …., wij, ….., wmj)t
wij – degree to which term ti belongs to document dj
Each term in doc is taken a representative of a subject area
Each term ti is itself represented by a fuzzy set fi in the domain of doc
given by
fi = {(dj, wij)} | i = 1, …., m ; j = 1, …., n
• This weight representation makes it possible to rank the retrieved
document in decreasing order of their relevance to the user query.
• Queries are Boolean queries. Fuzzy set operators are then applied
to obtain the desired result.
• For a single term query q = tq, fq = {(dj, wiq)}
CASE of AND and OR
q = tq1 ^ tq2 – first fq1 and fq2 are obtained and their intersection is
obtained, fq1 V fq2 = min{(dj, wiq1),(dj, wiq)}
q = tq1 ^ tq2 – union of fuzzy sets fq1 and fq2
fq1 V fq2 = max{(dj, wiq1),(dj, wiq)}
3. Latent Semantic Indexing (LSI) model
• Latent Semantic Indexing is also known as Latent Semantic
Analysis.
• Latent Semantic Indexing is a method which we use for expanding
the correctness of information retrieval.
• Here is a technique named "Singular Value Decomposition," which
is used in order to scan unstructured data within the documents
and recognize the relationships between the concepts contained
therein.
• To improve the understanding of the information, it finds the
hidden associations between words.

W = TSDT
For k number of largest singular values,
Wk = TkSkDkT
LEXICAL RESOURCES
A lexicon, or lexical resource, is a collection of words and/or phrases
along with associated information, such as part-of-speech and sense
definitions. Lexical resources are secondary to texts, and are usually
created and enriched with the help of texts.

WordNet
FrameNet
Tools such as
Stemmers
Taggers
Parsers
test corpus
WordNet
Word Sense
Words are ambiguous, which means the same word can be used
differently depending on the context.
For example, a 'bank' could be a river bank or a financial institution.
These meanings and variety due to context are captured by sense (or
word sense).
A sense (or word sense) is a discrete representation of one aspect of
the meaning of a word.
Semantic Relations
Synonymy The senses of two separate words are called synonyms if
the meanings of these words are identical or similar. Example:
center/middle, run/jog, etc.
Antonymy Antonyms are words with opposite meanings.
Example: dark/light, fast/slow etc.
Taxonomic Relations
Word senses can be related taxonomically so that they can be
classified in certain categories.
A word (or sense) is a hyponym of another word or sense if the one
denotes a subclass of the other and is conversely called hypernym.
For example, man is a hyponym of animal, and animal is a hypernym
of man. Alternatively, this hyponym/hypernym can be defined as IS-
A relationship 'Man IS-A animal‘
Meronymy The 'part-whole' relationship is called Meronymy. A
wheel is part of car.
What is the WordNet?

WordNet is a large lexical database of words, senses, and their


semantic relations.
This project was started by George A. Miller in the mid-1980s, and
captures the word and their senses.
In WordNet, the sense is defined by a set of synonyms, called synsets,
that have a similar meaning or sense.
This means WordNet represents words (or senses) as lists of the word
senses that can be used to express the concept.
Here is an example of a synset. Sense for the word 'fool' can be defined
by the list of synonyms as {chump, jester, gull, fritter, dupe, fool
around}
Synset
A synset in WordNet is an interface that is a part of NLTK that can be
used to look up words in WordNet.
A Synset instance has groupings of words that are synonymous or
words that express similar concepts.
Some words have a singular Synset, and some have multiple.
Structure of WordNet
A synonym set (synset) is a group of words that all refer to the same
notion in Wordnet.
The structure of the wordnet is made
up of words and synsets linked
Together by conceptual-semantic links.
Applications of WordNet
1. Concept Identification in Natural Language
2. Word Sense Disambiguation
3. Automatic Query Expansion
4. Document Structuring and Categorization
5. Document Summarization
FrameNet
Frames
A Frame is a script-like conceptual structure that describes a
particular type of situation, object, or event along with the
participants and props that are needed for that Frame.
For example, the “Apply_heat” frame describes a common situation
involving a Cook, some Food, and a Heating_Instrument, and is
evoked by words such as bake, blanch, boil, broil, brown, simmer,
steam, etc.
We call the roles of a Frame “frame elements” (FEs) and the frame-
evoking words are called “lexical units” (LUs).
The FrameNet corpus is a lexical database of English that is both
human- and machine-readable, based on annotating examples of
how words are used in actual texts.
FrameNet is based on a theory of meaning called Frame Semantics,
deriving from the work of Charles J. Fillmore and colleagues.
The basic idea is straightforward: that the meanings of most words
can best be understood on the basis of a semantic frame: a
description of a type of event, relation, or entity and the participants
in it.
For example, the concept of cooking typically involves a person doing
the cooking (Cook), the food that is to be cooked (Food), something
to hold the food while cooking (Container) and a source of heat
(Heating_instrument).
In the FrameNet project, this is represented as a frame called
Apply_heat, and the Cook, Food, Heating_instrument and Container
are called frame elements (FEs).
Words that evoke this frame, such as fry, bake, boil, and broil, are
called lexical units (LUs) of the Apply_heat frame.
The job of FrameNet is to define the frames and to annotate
sentences to show how the FEs fit syntactically around the word that
evokes the frame.
FrameNet includes relations between Frames. Several types of
relations are defined, of which the most important are:
Inheritance: An IS-A relation. The child frame is a subtype of the
parent frame, and each FE in the parent is bound to a corresponding
FE in the child. An example is the “Revenge” frame which inherits
from the “Rewards_and_punishments” frame.
Using: The child frame presupposes the parent frame as background,
e.g the “Speed” frame “uses” (or presupposes) the “Motion” frame;
however, not all parent FEs need to be bound to child FEs.
Subframe: The child frame is a subevent of a complex event
represented by the parent, e.g. the “Criminal_process” frame has
subframes of “Arrest”, “Arraignment”, “Trial”, and “Sentencing”.

Perspective_on: The child frame provides a particular perspective on


an un-perspectivized parent frame. A pair of examples consists of the
“Hiring” and “Get_a_job” frames, which perspectivize the
“Employment_start” frame from the Employer’s and the Employee’s
point of view, respectively.

You might also like