You are on page 1of 15

UNIT II MODELING

Taxonomy and Characterization of IR models –Boolean Model - vector model - Term


weighting –scoring and Ranking –Language Model –Set Theoretic Models –Probabilistic
Models-Algebraic Models –Structured Text Retrieval models- Models for Browsing
2.1Taxonomy and Characterization of IR models
Information Retrieval (IR) models are systems designed to retrieve relevant information from a large
collection of data in response to a user's query. These models play a crucial role in search engines, document
retrieval systems, and other information retrieval applications. Taxonomy and characterization of IR models
involve categorizing them based on their underlying principles, techniques, and characteristics.
2.1.1Taxonomy of IR Models:
Based on Retrieval Technique:
Boolean Models:Utilize Boolean algebra to retrieve documents based on the presence or absence of terms.
Vector Space Models (VSM):Represent documents and queries as vectors in a high-dimensional space,
measuring similarity using cosine similarity.
Probabilistic Models:Assign probabilities to the relevance of documents given a query, e.g., BM25.
 Based on Learning Approach:
1.Supervised Learning Models: Trained on labeled data to predict relevance based on features.
2.Unsupervised Learning Models: Discover patterns and relationships in data without labeled examples.
 Based on Document Representation:
1.Bag-of-Words Models: Represent documents as an unordered set of words, ignoring grammar and word
order.
2.Sequence Models: Consider word order and syntax, capturing sequential dependencies.
 Based on Interaction:
1.Interactive Models: Incorporate user feedback to refine search results over time.
2.Batch Models: Retrieve results without user interaction.
2.1.2Characterization of IR Models:
1. Scalability:
Batch Processing Models: Typically applied to large document collections.
Online Processing Models: Handle queries and updates in real-time.
2. Adaptability:
Static Models: Fixed parameters and structures.
Dynamic Models: Adapt to changes in the collection or user behavior.
3. Query Complexity:
Simple Models: Reliance on basic term matching.
Complex Models: Consider semantic relationships, context, and query expansion.
4. Relevance Feedback:
Feedback Models: Incorporate user feedback to adjust ranking.
Non-feedback Models: Rely solely on the original query.
5. Probabilistic Scoring:
Binary Models: Classify documents as relevant or non-relevant.
Probabilistic Models: Assign a degree of relevance to each document.
6. Combination Strategies:
Combining Models: Ensemble methods or hybrid models that integrate multiple retrieval strategies.
Single Models: Models that operate independently without combination.
7. Evaluation Metrics:
Precision and Recall Models: Emphasize precision, recall, or their trade-offs.
User-Centric Models: Incorporate user satisfaction and utility as evaluation metrics.

2.2 BOOLEAN RETRIEVAL MODEL:

The Boolean retrieval model was used by the earliest search engines and is still in use today. It is
also known as exact-match retrieval since documents are retrieved if they exactly match the query
specification, and otherwise are not retrieved. Although this defines a very simple form of ranking, Boolean
retrieval is not generally described as a ranking algorithm. This is because the Boolean retrieval model
assumes that all documents in the retrieved set are equivalent in terms of relevance, in addition to the
assumption that relevance is binary. The name Boolean comes from the fact that there only two possible
outcomes for query evaluation (TRUE and FALSE) and because the query is usually specified using
operators from Boolean logic (AND, OR, NOT). Searching with a regular expression utility such as unix
grep is another example of exact-match retrieval.
A fat book which many people own is Shakespeare’s Collected Works. Suppose you wanted to
determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia. One
way to do that is to start at the beginning and to read through all the text, noting for each play whether it
contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia. The simplest form
of document retrieval is for a computer to do this sort of linear scan through documents.

With modern computers, for simple querying of modest collections (the size of Shakespeare’s
Collected Works is a bit under one million words of text in total), you really need nothing more. But for
many purposes, you do need more:

1. To process large document collections quickly. The amount of online data has grown at least as
quickly as the speed of computers, and we would now like to be able to

search collections that total in the order of billions to trillions of words.

2. To allow more flexible matching operations. For example, it is impractical to perform the query
Romans NEAR countrymen with grep, where NEAR might be defined as “within 5 words” or “within the
same sentence”.

3. To allow ranked retrieval: in many cases you want the best answer to an information need
among many documents that contain certain words.

The way to avoid linearly scanning the texts for each query is to index the documents in advance.
Let us stick with Shakespeare’s Collected Works, and use it to introduce the basics of the Boolean retrieval
model. Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word
out of all the words Shakespeare used (Shakespeare used about 32,000 different words). The result is a
binary term-document incidence, as in Figure. Terms are the indexed units; they are usually words, and for
the moment you can think of them as words.

To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus,
Caesar and Calpurnia, complement the last, and then do a bitwise AND:

110100 AND 110111 AND 101111 = 100100

Answer :The answers for this query are thus Antony and Cleopatra and Hamlet
Figure : A term-document incidence matrix. Matrix element (t, d) is 1 if the play in column d
contains the word in row t, and is 0 otherwise.

Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce
some terminology and notation. Suppose we have N = 1 million documents. By documents we mean
whatever units we have decided to build a retrieval system over. They might be individual memos or
chapters of a book. We will refer to the group of documents over which we perform retrieval as the
COLLECTION. It is sometimes also referred to as a Corpus.

we assume an average of 6 bytes per word including spaces and punctuation, then this is a
document collection about 6 GB in size. Typically, there might be about M = 500,000 distinct terms in these
documents. There is nothing special about the numbers we have chosen, and they might vary by an order of
magnitude or more, but they give us some idea of the dimensions of the kinds of problems we need to
handle.

Advantages:

1. The results of the model are very predictable and easy to explain to users.
2. The operands of a Boolean query can be any document feature, not just words, so it is
straightforward to incorporate metadata such as a document date or document type in the query
specification.
3. From an implementation point of view, Boolean retrieval is usually more efficient than ranked
retrieval because documents can be rapidly eliminated from consideration in the scoring process

Disadvantages:

1. The major drawback of this approach to search is that the effectiveness depends entirely on the
user. Because of the lack of a sophisticated ranking algorithm, simple queries will not work well.
2. All documents containing the specified query words will be retrieved, and this retrieved set will be
presented to the user in some order, such as by publication date, that has little to do with relevance.
It is possible to construct complex Boolean queries that narrow the retrieved set to mostly relevant
documents, but this is a difficult task.
3. No ranking of returned result.

Effectiveness of IR:

RELEVANCE: A document is relevant if it is one that the user perceives as containing information of
value with respect to their personal information need.

To assess the effectiveness of an IR system (i.e., the quality of its search results), a user will usually
want to know two key statistics about the system’s returned results for a query:

PRECISION: What fraction of the returned results is relevant to the information need?

RECALL: What fraction of the relevant documents in the collection were returned by the system?

2.3 VECTOR SPACE MODEL

The vector space model computes a measure of similarity by defining a vector that represents each
document, and a vector that represents the query. The model is based on the idea that, in some rough sense,
the meaning of a document is conveyed by the words used. If one can represent the words in the document
by a vector, it is possible to compare documents with queries to determine how similar their content is. If a
query is considered to be like a document, a similarity coefficient (SC) that measures the similarity between
a document and a query can be computed. Documents whose content, as measured by the terms in the
document, correspond most closely to the content of the query are judged to be the most relevant.

This model involves constructing a vector that represents the terms in the document and
another vector that represents the terms in the query. Next, a method must be chosen to
measure the closeness of any document vector to the query vector.

One could look at the magnitude of the difference vector between two vectors, but this would tend to
make any large document appear to be not relevant to most queries, which typically are short. The traditional
method of determining closeness of two vectors is to use the size of the angle between them. This angle is
computed by using the inner product (or dot product); however, it is not necessary to use the actual
angle.Often the expression "similarity coefficient" is used instead of an angle. Computing this number is
done in a variety of ways, but the inner product generally plays a prominent role.

There is one component in these vectors for every distinct term or concept that occurs in the
document collection. Consider a document collection with only two distinct terms, α and β. All vectors
contain only two components, the first component represents occurrences of α, and the second represents
occurrences of β. The simplest means of constructing a vector is to place a one in the corresponding vector
component if the term appears and a zero if the term does not appear. Consider a document, D1, that
contains two occurrences of term α and zero occurrences of term β. The vector < 1,0 > represents this
document using a binary representation. This binary representation can be used to produce a similarity
coefficient, but it does not take into account the frequency of a term within a document. By extending the
representation to include a count of the number of occurrences of the terms in each component, the
frequency of the terms can be considered. In this example, the vector would now appear as < 2,0 >.

A simple example is given in the following Figure. A component of each vector is required for each
distinct term in the collection. Using the toy example of a language with a two word vocabulary (only A and
I are valid terms), all queries and documents can be represented in two dimensional spaces. A query and
three documents are given along with their corresponding vectors and a graph of these vectors. The
similarity coefficient between the query and the documents can be computed as the distance from the query
to the two vectors.

In this example, it can be seen that document one is represented by the same vector as the query so it
will have the highest rank in the result set. Instead of simply specifying a list of terms in the query, a user is
often given the opportunity to indicate that one term is more important than another. This was done initially
with manually assigned term weights selected by users.

Another approach uses automatically assigned weights - typically based on the frequency of a term
as it occurs across the entire document collection.
The idea was that a term that occurs infrequently
should be given a highereight than a term that
occurs frequently. Similarity coefficients that
employed automatically assigned weights were
compared to manually assigned weights. It was
shown that automatically assigned weights
perform at least as well as manually assigned
weights. Unfortunately, these results did not
include the relative weight of the term across the
entire collection.

The value of a collection weight was studied in the 1970's. The conclusion was that relevance
rankings improved if collection-wide weights were included. Although relatively small document collections
were used to conduct the experiments, the authors still concluded that, "in so far as anything can be called a
solid result in information retrieval research, this is one".

This more formal definition, and slightly larger example, illustrates the use of weights based on the
collection frequency. Weight is computed using the Inverse Document Frequency (IDF) corresponding to
a given term.
To construct a vector that corresponds to each document, consider the followingdefinitions:

t = number of distinct terms in the document collection

tfij= number of occurrences of term tj in document Di . (This is referred to as the term


frequency.)

dfj = number of documents which contain tj. (This is the document frequency.)

idfj=log(d/dfj) where d is the total number of documents. (This is the inverse document
frequency.)

The vector for each document has n components and contains an entry for each distinct term in the
entire document collection. The components in the vector are filled with weights computed for each term in
the document collection. The terms in each document are automatically assigned weights based on how
frequently they occur in the entire document collection and how often a term appears in a particular
document. The weight of a term in a document increases the more often the term appears in one document
and decreases the more often it appears in all other documents.

A weight computed for a term in a document vector is non-zero only if the term appears in the
document. For a large document collection consisting of numerous small documents, the document vectors
are likely to contain mostly zeros. For example, a document collection with 10,000 distinct terms results in a
1O,000-dimensional vector for each document. A given document that has only 100 distinct terms will have
a document vector that contains 9,900zero-valued components.

The weighting factor for a term in a document is defined as a combination of term frequency, and
inverse document frequency. That is, to compute the value of the j th entry in the vector corresponding to
document i, the followingequation is used:

2.4. TERM WEIGHTING:

Each term in a document a weight for thatterm, that depends on the number of occurrences of the
term in the document. Assign the weight to be equal to the number of occurrences of term t in document d.
This weighting scheme is referred to as TERM FREQUENCY and is denoted tft,d, with the subscripts
denoting the term(t) and the document (d) in order.

2.4.1Document frequency:

The document frequency dft, defined to be the number of documents in the collection that contain a term t.
Denoting as usual the total number of documents in a collection by N, we define the inverse document
N
frequency (idf) of a term t as follows:idf t=log
df t

2.4.2 Tf-idf weighting:

We now combine the definitions of term frequency and inverse document frequency, to produce a
composite weight for each term in each document.

The tf-idf weighting scheme assigns to term t a weight in document d given by

tf −idf t , d=tf t , d ×idf t

Consider a document collection that contains a document, D1 , with ten occurrences of the term
green and a document, D2, with
only five occurrences of the
term green. If green is the only
term found in the query, then
document D1 is ranked higher
than D2 .

When a document
retrieval system is used to query
a collection of documents with t
distinct collection-wide terms,
the system computes a vector
D(dil , di2 , ... , dit ) of size t for
each document. The vectors are
filled with term weights as
described above. Similarly, a
vector Q (Wql, Wq2, ... , Wqt) is
constructed for the terms found
in the query.

A simple similarity
coefficient (SC) between a query Q and a document D i is defined by the dot product of two vectors. Since a
query vector is similar in length to a document vector, this same measure is often used to compute the
similarity between two documents.
Example of Similarity Coefficient

Consider a case insensitive query and document collection with a query Q and a document collection
consisting of the following three documents:

Q: "gold silver truck"

D1 : "Shipment of gold damaged in a fire"

D2 : "Delivery of silver arrived in a silver truck"

D3: "Shipment of gold arrived in a truck"

In this collection, there are three documents, so d = 3. If a term appears in only one of the three
documents, its idf is log d/dfj = log 3/1 = 0.477. Similarly, if a term appears in two of the three documents its
idf is log d/dfj = log 3/2 = 0.176, and a term which appears in all three documents has an idf of log d/df j = log
3/3 = 0.

The idf for the terms in the three documents is given below:

idfa = 0 idfarrived = 0.176 idfdamaged = 0.477 idfdelivery = 0.477


idffire = 0.477 idfin = 0

idfof = 0 idfsilver = 0.477 idfshipment = 0.176 idftruck = 0.176


idfgold = 0.176

Document vectors can now be constructed. Since eleven terms appear in the document
collection, an eleven-dimensional document vector is constructed. The alphabetical ordering given
above is used to construct the document vector so that h corresponds to term number one which is a
and t2 is arrived, etc. The weight for term i in vector j is computed as the idf i x tfij. The document
vectors are shown in Table.

SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0)+(0)(0.477) + (0.176)(0.176) + (0)(0) + (0)(0)+(0)


(0.176) + (0.477)(0) + (0.176)(0)= (0.176)2≈0.031

Similarly,

SC(Q, D2 ) = (0.954)(0.477) + (0.176)2≈ 0.486

SC(Q, D3 ) = (0.176)2 + (0.176)2≈ 0.062


Hence, the ranking would be D2 , D3 , D1 .

Advantages:

1. Its term-weighting scheme improves retrieval performance.


2. Its partial matching strategy allows retrieval of documents that approximate the query conditions.
3. Its cosine ranking formula sorts the documents according to their degree of similarity to the query.

Disadvantages:

1. The assumption of mutual independence between index terms.


2. It cannot denote the “velar logic view” like Boolean model.

COSINE SIMILARITY

The effect of document length, the standard way of quantifying the similarity between two
documents d1 and d2 is to compute the cosine similarity of their vector representations v⃑ ( d 1 )∧ v⃑ (d 2)

v⃗ ( d 1 ) . ⃗v ( d 2 )
Sim(d1,d2)=
|⃗v ( d 1 )||⃗v ( d 2 )|
where the numerator represents the dot product (also known as the inner product) of the vectors
⃑v ( d 1 )∧ ⃑v (d 2) , while the denominator is the product of their Euclidean lengths. The dot product ⃗x . ⃗y of two
M
vectors is defined as ∑ x i y i. Let ⃑v ( d 1 ) denote the document vector for d, with M components
i=1


M
⃑v ( d 1 ) … .. ⃑v M ( d 1 ). The Euclidean length of d is defined to be ∑ ⃑v 2i . The effect of the denominator of
i=1

Equation is thus to length-normalize the vectors ⃑v ( d 1 )∧ ⃑v (d 2) to unit vectors ⃑v ( d 1 )= ⃑v ( d 1 ) /|⃑v ( d 1 )| and

v⃑ ( d 2 )= v⃑ ( d 2 ) /|v⃑ ( d 2 )| . We can then rewrite as

sim(d1,d2)= ⃑v ( d 1 ) . ⃑v ( d 2 )

2.5 SCORING AND RANKING

In information retrieval (IR), ranking and scoring are fundamental processes that determine the order in
which documents are presented to users based on their relevance to a given query. These processes play
a crucial role in search engines and other systems where the goal is to retrieve the most relevant
information for a user's query. Here's an overview of ranking and scoring in IR:

2.5.1Ranking in Information Retrieval:


Ranking involves ordering a set of documents based on their estimated relevance to a user's query.
Documents are retrieved based on their similarity or match to the query. The ranking process arranges
these documents in descending order of estimated relevance.

Proess

Various algorithms and models are used for ranking, such as:

Vector Space Models: Representing documents and queries in a high-dimensional space.

Probabilistic Models: Assigning probabilities to the relevance of documents given a query.

Precision, recall, F1 score, and other metrics are used to evaluate the effectiveness of ranking algorithms.

User satisfaction and engagement metrics may also be considered.

2.5.2 Scoring in Information Retrieval:

Scoring involves assigning a numerical score to each document based on its relevance to a query. Each
document is assigned a score indicating its likelihood of being relevant to the query. Documents are then
ranked based on these scores

Methods:

The scoring process often involves combining various factors, such as:

Term Frequency (TF) and Inverse Document Frequency (IDF).

Normalization

To ensure fair comparison and meaningful scores, normalization techniques may be applied.
Normalization can account for document length, query length, or other factors.

Relevance Feedback:

User feedback on initial search results may be used to adjust scores. Positive feedback increases scores,
while negative feedback decreases them.

Dynamic Scoring:

Scoring models may dynamically adapt to changes in the document collection or user behavior.Dynamic
scoring helps maintain relevance in evolving scenarios.

Result Presentation:

The final scores determine the order in which documents are presented to users.Typically, the top-ranked
documents are displayed first.Both ranking and scoring are iterative processes, and the effectiveness of
an information retrieval system is often measured by how well it ranks and scores documents in response
to user queries. The choice of ranking and scoring algorithms depends on the characteristics of the data,
the user's information needs, and the specific goals of the retrieval system.

2.6.PROBABILISTIC RETRIEVAL MODEL

Probabilistic considerations may apply if one assumes that system characteristics such as the term
assignment to the records, or the relevance properties of the records are probabilistic in nature. Other
mathematical techniques that have been used include decisions theory, information theory, pattern
classification, mathematical linguistics and feature selection methods.

In addition to a vector space model, the other dominant approach uses a probabilistic retrieval model.
The model that has been most successful in this area is the Bayesian approach. This approach is natural
to information systems and is based upon the theories of evidential reasoning (drawing conclusions from
evidence).

Bayesian approaches have long been applied to information systems. The Bayesian approach could be
applied as part of index term weighting, but usually is applied as part of the retrieval process by
calculating the relationship between an item and a specific query. The probabilistic approach is based
upon direct application of the theory of probability to information retrieval systems.

This has the advantage of being able t~ use the developed formal theory of probability to direct the
algorithmic development. The use of probability theory is a natural choice because it is the basis of
evidential reasoning. Information Retrieval Models and Their Applications 509 Information Retrieval
510 According to Van Rijsbergen [1979], probability theory is an intuitively pleasing model for
describing and analysing information retrieval.

In this approach one can estimate the probability of retrieval. In other words, the probability of relevance
of retrieval is measured. Van Rijsbergen proposed that a measure of the probability of relevance of a
given document to a particular theory be based on a Vector representing that document. He postulated
that the pattern of index terms in relevant documents will differ from their pattern in non-relevant
documents. Probability approach will help to analyse term-clustering, frequency weighting, relevance
weighting and ranking.

Probability theory can also be used to rank, and order documents according to their probability of
relevance. Robertson [1978] shows that the order of documents can be based on term values and on
'Optimal Retrieval Function'. However, if one attempts to rank order of the documents in a Boolean
environment, some difficulties arise which are inherent to the Boolean logic. Bookstein [1978] suggested
that the retrieved documents be ordered according to the number of Boolean expressions present in the
document that are true.
Doszkoc [1978] suggests that probability associations are being used to find terms that are associated
with other terms. The association procedures are ""- based on term occurrences and the frequency of
these terms in the database. The probabilistic approach of a retrieval model is based on the assumption
that the distribution of the indexing features tells something about the relevance of a document. This
approach adopts a retrieval model which optimizes the retrieval effectiveness according to the
Probability Ranking Principle (PRP).

William Cooper has formulated the Probability Ranking Principle as "If a reference retrieval system's
response to each request is a ranking of the documents in the collections in order of decreasing
probability of usefulness to the user who submitted the request, where the probabilities are estimated as
accurately as possible on the basis of whatever data has been made available to the system for this
purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the
basis of that data". In this method the system replies to a query by presenting the beginning of a list of
documents that are ranked in descending order of scores that either represent probabilities themselves or
could be mapped to probabilities by means of an order preserving transformation.

These scores often called Retrieval Status Values (RS V) depend on document descriptions consisting of
appropriate statistical information about the indexing features. Such score may also depend on domain
dependent parameters that are estimated by means of additional data e.g. by a thesaurus. Probabilities are
usually based upon a binary condition - an item is relevant or not. But in information systems the
relevance of an item is a continuous function from non-relevant to absolutely useful.

The output ordering by rank of items based upon probabilities, even if accurately calculated, may not be
as optimal as that defined by some domain specific heuristic. The source of the problems that arise in
application of probability theory come from a lack of accurate data and simplifying assumptions that are
applied to the mathematical model.

2.7.STRUCTURED TEXT RETRIEVAL MODELS

 Retrieval models which combine information on text content with information on the document
structure are called structured text retrieval model.

 Match point : refer to the position in the text of a sequence of words which matches the user query.

 Region : refer to a contiguous portion of the text.

 Node : refer to a structural component of the document such as a chapter, a section, a subsection.

Structured text retrieval models provide a formal definition or mathematical framework for querying
semistructured textual databases. A textual database contains both content and structure. The content is
the text itself, and the structure divides the database into separate textual parts and relates those textual
parts by some criterion. Often, textual databases can be represented as marked up text, for instance as
XML, where the XML elements define the structure on the text content. Retrieval models for textual
databases should comprise three parts: 1) a model of the text, 2) a model of the structure, and 3) a query
language [4]: The model of the text defines a tokenization into words or other semantic units, as well as
stop words, stemming, synonyms, etc. The model of the structure defines parts of the text, typically a
contiguous portion of the text called element, region, or segment, which is defined on top of the text
model’s word tokens. The query language typically defines a number of operators on content and
structure such as set operators and operators like “containing” and “contained-by” to model relations
between content and structure, as well as relations between the structural elements themselves. Using
such a query language, the (expert) user can for instance formulate requests like “I want a paragraph
discussing formal models near to a table discussing the differences between databases and information
retrieval”. Here, “formal models” and “differences between databases and information retrieval” should
match the content that needs to be retrieved from the database, whereas “paragraph” and “table” refer to
structural constraints on the units to retrieve.

2.8.ALGEBRAIC MODELS

Algebraic models in information retrieval involve representing documents and queries as mathematical
structures, often vectors, and using algebraic operations to measure the similarity between these
representations. These models aim to capture the relationships between terms and documents in a way
that facilitates effective retrieval of relevant information. Here are two prominent algebraic models used
in information retrieval:

2.8.1.Vector Space Model (VSM):

Documents and queries are represented as vectors in a high-dimensional space.Each dimension


corresponds to a unique term in the vocabulary.The value of each dimension represents the weight or
frequency of the corresponding term in the document or query.

Term Frequency-Inverse Document Frequency (TF-IDF): Enhances the VSM by incorporating term
importance information. A probabilistic model based on the VSM that considers term saturation and
document length.

2.8.2 Latent Semantic Analysis (LSA):

LSA uses singular value decomposition (SVD) to reduce the dimensionality of the term-document
matrix.The reduced-dimensional space captures latent semantic relationships between terms and
documents.

You might also like