You are on page 1of 103

TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.

AJAY PASHANKAR

www.profajaypashankar.com Page 1 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

Foundations of Information Retrieval 1


5
Introduction to Information Retrieval (IR) systems: Definition and goals of information
retrieval, Components of an IR system, Challenges and applications of IR

Document Indexing, Storage, and Compression: Inverted index construction and


compression techniques, Document representation and term weighting, Storage and
retrieval of indexed documents,

Retrieval Models: Boolean model: Boolean operators, query processing, Vector space
model: TF-IDF, cosine similarity, query-document matching, Probabilistic model:
Bayesian retrieval, relevance feedback

Spelling Correction in IR Systems: Challenges of spelling errors in queries and


documents, Edit distance and string similarity measures, Techniques for spelling
correction in IR systems

Performance Evaluation: Evaluation metrics: precision, recall, F-measure, average precision,


Test collections and relevance judgments, Experimental design and significance testing
II Advanced Topics in Information Retrieval 15
Text Categorization and Filtering: Text classification algorithms: Naive Bayes,
Support Vector Machines, Feature selection and dimensionality reduction, Applications
of text categorization and filtering
Text Clustering for Information Retrieval: Clustering techniques: K-means,
hierarchical clustering, Evaluation of clustering results, Clustering for query expansion
and result grouping
Web Information Retrieval: Web search architecture and challenges, Crawling and
indexing web pages, Link analysis and PageRank algorithm
Learning to Rank: Algorithms and Techniques, Supervised learning for ranking:
RankSVM, RankBoost, Pairwise and listwise learning to rank approaches Evaluation
metrics for learning to rank
Link Analysis and its Role in IR Systems: Web graph representation and link
analysis algorithms, HITS and PageRank algorithms, Applications of link analysis in IR
systems
III Advanced Topics in Information Retrieval 15
Crawling and Near-Duplicate Page Detection: Web page crawling techniques:
breadth-first, depth-first, focused crawling, Near-duplicate page detection algorithms,
Handling dynamic web content during crawling
Advanced Topics in IR: Text Summarization: extractive and abstractive methods,
Question Answering: approaches for finding precise answers, Recommender Systems:
collaborative filtering, content-based filtering
Cross-Lingual and Multilingual Retrieval: Challenges and techniques for cross-
lingual retrieval, Machine translation for IR, Multilingual document representations and
query translation, Evaluation Techniques for IR Systems
User-based evaluation: user studies, surveys, Test collections and benchmarking,
Online evaluation methods: A/B testing, interleaving experiments
Textbook(s):
1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto, ―Modern Information Retrieval: The Concepts
and Technology behind Search, Second Edition, ACM Press Books
2. C. Manning, P. Raghavan, and H. Schütze, ―Introduction to Information Retrieval, Cambridge
University Press

Additional Reference(s):
1. Ricci, F, Rokach, L. Shapira, B. Kantor, ―Recommender Systems Handbook‖, First Edition.
2. Bruce Croft, Donald Metzler, and Trevor Strohman, Search Engines: Information Retrieval in
Practice, Pearson Education.
3. Stefan Buttcher, Charlie Clarke, Gordon Cormack, Information Retrieval: Implementing and
Evaluating Search Engines, MIT Press.

www.profajaypashankar.com Page 2 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

INDEX

SR.NO CHAPTER NAME PAGE NO.

1 Introduction to Information Retrieval 4-9

2 Document Indexing, Storage and Compression 10-21

3 Retrieval Models 21-30

4 Spelling correction in IR systems 30-41

5 Performance Evaluation 41-45

6 Text Categorization and Filtering 45-48

7 Text Clustering for Information Retrieval 49-51

8 Web Information Retrieval 52-61

9 Learning to Rank 62-67

10 Link Analysis and its Role in IR Systems 67-70

11 Crawling and Near-Duplicate Page Detection 71-75

12 Text Summarization 75-86

13 Cross-Lingual and Multilingual Retrieval 86-94

14 User-Based evaluation 95-103

www.profajaypashankar.com Page 3 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
CHAPTER I: INTRODUCTION TO INFORMATION RETRIEVAL (IR) SYSTEMS
Topics covered:
Definition and goals of information retrieval, Components of an IR system, Challenges and applications
of IR
Introduction
There is a potential for confusion in the understanding of the differences between Database
Management Systems (DBMS) and Information Retrieval Systems. It is easy to confuse the software
that optimizes functional support of each type of system with actual information or structured data that
is being stored and manipulated. The importance of the differences lies in the inability of a database
management system to provide the functions needed to process “information.” The opposite, an
information system containing structured data, also suffers major functional deficiencies.
Information retrieval (IR) is the process of obtaining information system resources that are relevant to
an information need from a collection of those resources. Searches can be based on full-text or
other content-based indexing. Information retrieval is the science of searching for information in a
document, searching for documents themselves, and also searching for the metadata that describes
data, and for databases of texts, images or sounds.
Automated information retrieval systems are used to reduce what has been called information
overload. An Information Retrieval system is a software system that provides access to books,
journals, and other documents; stores and manages those documents. Web search engines are the
most visible IR applications.
Why is information retrieval important?
Data is the big game now. Huge amount of data is generated on a daily basis. But this data is actually
useless, if there is no way to obtain and query the data, the information we collect is useless.
Information retrieval system is critical for making sense of data. Without Google or any other search
engines, it would have been very difficult to retrieve any information off the internet.
Text indexing and retrieval systems index data in data repositories and allow users to search against it.
Retrieval systems provide users with online access to information that they may not be aware of. Users
are able to query all information that the administrator has decided to index with a single search.

What is information retrieval example?


Librarians, professional searchers, etc., engage themselves in the activity of information retrieval but
nowadays hundreds of millions of people engage in IR every day when they use web search
engines. Information Retrieval is believed to be the dominant form of Information access. The IR
system assists the users in finding the information they require but it does not explicitly return the
answers to the question. It notifies regarding the existence and location of documents that might
consist of the required information. Information retrieval also extends support to users in browsing or
filtering document collection or processing a set of retrieved documents. The system searches over
billions of documents stored on millions of computers. A spam filter, manual or automatic means are
provided by the Email program for classifying the mails so that they can be placed directly into
particular folders.
An IR system has the ability to represent, store, organize, and access information items. A set of
keywords are required to search. Keywords are what people are searching for in search engines. These
keywords summarize the description of the information.

What are the types of information retrieval?


Methods/Techniques in which information retrieval techniques are employed include:

• Adversarial information retrieval

www.profajaypashankar.com Page 4 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Automatic summarization
• Multi-document summarization
• Compound term processing
• Cross-lingual retrieval
• Document classification
• Spam filtering
• Question answering

What is information retrieval used for?


When you ask your librarian in your school, they quickly find out the book which you need from
hundreds of other books segregated into various sections or types. This is a kind of Information
retrieval. Now imagine inputting a similar search query on web search engines, which goes through
billions of pages and resources to find out the result of your query. Information Retrieval is believed to
be the dominant form of Information access. The IR system assists the users in finding the information
they require but it does not explicitly return the answers to the question. Just like your librarian who
might recommend a few other books to you in the same genre. It notifies regarding the existence and
location of documents that might consist of the required information.
An IR system has the ability to represent, store, organize, and access various information items. A set
of keywords are required to search. Keywords are what people are searching for in search engines.
These keywords summarize the description of the information.
What are the three classic models in information retrieval systems?

An information model (IR) model can be classified into the following three models −

Classical IR Model
It is the simplest and easy to implement IR model. This model is based on mathematical knowledge
that was easily recognized and understood as well. Boolean, Vector and Probabilistic are the three
classical IR models.

Non-Classical IR Model
It is completely opposite to the classical IR model. Such kinds of IR models are based on principles
other than similarity, probability, Boolean operations. Information logic model, situation theory model,
and interaction models are examples of non-classical IR models.

Alternative IR Model
It is the enhancement of the classical IR model making use of some specific techniques from some
other fields. Cluster model, fuzzy model, and latent semantic indexing (LSI) models are the example of
alternative IR model.

What are the characteristics of information retrieval?


There are 12 characteristics of an Information Retrieval model:

• Search intermediary
• Domain knowledge
• Relevance feedback
• Natural language interface
• Graphical query language

www.profajaypashankar.com Page 5 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Conceptual queries
• Full-text IR
• Field searching
• Fuzzy queries
• Hypertext integration
• Machine learning
• Ranked output

What are the components and features of Information retrieval systems?

1. Inverted Index
The primary data structure of most of the IR systems is in the form of inverted index. We can define an
inverted index as a data structure that list, for every word, all documents that contain it and frequency
of the occurrences in document. It makes it easy to search for ‘hits’ of a query word.

2. Stop Word Elimination


Stop words are those high frequency words that are deemed unlikely to be useful for searching. They
have less semantic weights. All such kind of words are in a list called stop list. For example, articles

www.profajaypashankar.com Page 6 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
“a”, “an”, “the” and prepositions like “in”, “of”, “for”, “at” etc. are the examples of stop words. The size
of the inverted index can be significantly reduced by stop list. As per Zipf’s law, a stop list covering a
few dozen words reduces the size of inverted index by almost half. On the other hand, sometimes the
elimination of stop word may cause elimination of the term that is useful for searching. For example, if
we eliminate the alphabet “A” from “Vitamin A” then it would have no significance.

3. Stemming
Stemming, the simplified form of morphological analysis, is the heuristic process of extracting the base
form of words by chopping off the ends of words. For example, the words laughing, laughs, laughed
would be stemmed to the root word laugh.

4. Crawling
Crawling is the process of gathering different web pages to index them to support a search engine. The
purpose of crawling is to quickly and efficiently gather as many relevant web pages as possible and
together with the link structure that interconnects them.

5. Query
Queries are search statements which describe the information requirements in search engines. A query
will never identify one particular result, it will find many results which match the query with different
degrees.

6. Relevance Feedback
Relevance feedback helps in taking results that are initially returned from a specific query, to gather
user feedback, and determine whether those results are relevant to perform a new query.

Precision and recall in information retrieval

Retrieval Precision and recall are two metrics used to evaluate the performance of an information
retrieval system, such as a search engine. Precision is the fraction of relevant results returned by the
system, while recall is the fraction of relevant results that the system was able to return. In other
words, precision measures the accuracy of the results returned, while recall measures the
completeness of the results. A system with high precision returns fewer results, but they are more
likely to be relevant. A system with high recall returns more results, but they are less likely to be
relevant. For example, if a search engine returns 100 results and 80 of them are relevant, then the
precision is 80%. On the other hand, if the search engine was able to return all 200 relevant results,
then the recall would be 100%.

Information Retrieval techniques

Information retrieval has has many wide spread applications which can be categorized into three types.

General Applications -

• Digital Libraries
• Media Search
• Search Engines

Domain-specific applications

• Expert search finding


• Genomic information retrieval
• Geographic information retrieval
• Information retrieval for chemical structures
• Information retrieval in software engineering
• Legal information retrieval
• Vertical search

Other retrieval methods

www.profajaypashankar.com Page 7 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Adversarial information retrieval
• Automatic summarization
• Multi-document summarization
• Compound term processing
• Cross-lingual retrieval
• Document classification
• Spam filtering
• Question answering

Difference between Data Retrieval and Information Retrieval


Data retrieval (a database management system or DBMS) usually works with structured data with well-
defined semantics, IR deals with completely unstructured data. DBMS returns the exact or no result at
all if no exact match is discovered. An IR system will yield various results with rankings. Information
retrieval systems are likely to go unnoticed, but even a single error would be detected as a complete
failure.

Information Retrieval Services


Information retrieval (IR) services are computer-based systems that allow users to search and retrieve
documents, websites, and other types of information from a database or a collection of documents.
These services are designed to help users find relevant information quickly and efficiently.
There are several types of IR services, including:

1. Search engines: These are the most common type of IR service, and they allow users to search
the Internet for websites, documents, and other types of information. Some examples of search
engines include Google, Bing, and Yahoo.
2. Library catalogs: These IR services allow users to search for books, journals, and other
materials in a library's collection.
3. Document databases: These IR services allow users to search for documents within a specific
database or collection, such as a database of research papers or legal documents.
4. Specialized IR services: These are IR services that are designed to search specific types of
information, such as medical literature or patents.
IR services use various techniques to index and retrieve information, including keyword searches,
natural language processing, and machine learning algorithms. They may also use metadata, such as
author names, publication dates, and subject tags, to help users find relevant information.

Information Storage and Retrieval


Information storage and retrieval refers to the processes of storing and accessing information in a
computer system or database. These processes are essential for organizing and managing large
amounts of data, and they allow users to quickly and easily access the information they need.
There are several methods for storing and retrieving information, including:

1. File systems: A file system is a way of organizing and storing files on a computer or other digital
device. It typically includes a hierarchy of folders and subfolders, and users can access and
retrieve files by navigating through the folder structure.
2. Databases: A database is a collection of structured data that can be searched, queried, and
accessed using a specialized software application. Databases can be used to store and retrieve a
wide range of information, including customer data, financial records, and product information.
3. Cloud storage: Cloud storage refers to the practice of storing data on remote servers that are
accessed over the Internet. This allows users to access and retrieve their data from any device
with an Internet connection.
4. Optical storage: Optical storage refers to the use of lasers or other light-based technologies to
store and retrieve data on media such as CDs, DVDs, and Blu-ray discs.
Regardless of the method used, effective information storage and retrieval systems should be efficient,
reliable, and secure. They should also be easy to use and allow users to access and retrieve the
information they need quickly and easily

Issues in Information Retrieval

www.profajaypashankar.com Page 8 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Indexing is the most vital part of any Information Retrieval System. It is a process in which the
documents required by the users are transformed into searchable data structures. Indexing can be also
referred to as the process of extraction rather than analysis of particular content. It creates a core
functionality of the IR process since it is the first step in IR and assists in efficient information retrieval.
In the process, first, the document surrogates are created to represent each document. Secondly, it
requires analysis of original documents that include simple (identifying meta-information e.g., author,
title, subject etc.) and complex (linguistic analysis of content) data. Indexes are the data structures
that are used to make the search faster.
Evaluation in Information Retrieval is the process of systematically determining a subject’s merit,
worth, and significance by using certain criteria that are governed by a set of standards.
Issues in Information Retrieval :
The main issues of the Information Retrieval (IR) are Document and Query Indexing, Query Evaluation,
and System Evaluation.

1. Document and Query Indexing –


Main goal of Document and Query Indexing is to find important meanings and creating an
internal representation. The factors to be considered are accuracy to represent semantics,
exhaustiveness, and facility for a computer to manipulate.
2. Query Evaluation –
In the retrieval model how can a document be represented with the selected keywords and how
are documents and query representations compared to calculate a score. Information Retrieval
(IR) deals with issues like uncertainty and vagueness in information systems.
• Uncertainty :
The available representation does not typically reflect true semantics of objects such as
images, videos etc.
• Vagueness :
The information that the user requires lacks clarity, is only vaguely expressed in a query,
feedback or user action.
3. System Evaluation –
System Evaluation tells about the importance of determining the impact of information given on
user achievement. Here, we see if the efficiency of the particular system related to time and
space.
-------------------------------------------------------------------------------------------------------------------

www.profajaypashankar.com Page 9 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
CHAPTER II: Document Indexing, Storage, and Compression
Topics Covered:
Document Indexing, Storage, and Compression
Inverted index construction and compression techniques, Document representation and term
weighting, Storage and retrieval of indexed documents

-------------------------------------------------------------------------------------------------------------------
Indexing (Creating document representation)
- Indexing is the manual or automated process of making statements about a document, lesson, and
person and so on.
- For example: author wise, subject wise, text wise, etc.
- Index can be:
i. Document oriented: - the indexer accesses the document relevance to subjects and other features of
interests to user.
ii. Request oriented: - the indexer accesses the document relevance to subjects and other features of
interests to user.
- Automated indexing begins with feature extraction such as extracting all words from a text, followed
by refinements such as eliminating stop words (a, an, the, of), stemming (walking walk), counting
the most frequent words, mapping the concepts using thesaurus (tube pipe).
-------------------------------------------------------------------------------------------------------------------
BUILDING AN INVERTED INDEX
- Inverted index, also called postings file or inverted file, is an index data structure storing a mapping
from content, such as words or numbers to its locations in a database file or in a document or a set of
documents.

- The purpose of an inverted index is to allow fast full text searches.

- An index always maps back from terms to the parts of a document where they occur.

- A dictionary of terms is kept.

- Then for each term, a list is maintained in which documents the term occurs in.

- Each item in the list which records that a term appeared in a document is called a posting.

- The list is then posting list.

- The dictionary will be sorted alphabetically and each postings list is sorted by document ID.

- Example: DOC 1 = new home sales top forecasts

DOC 2 = home sales rise in July DOC 3 = increase in home sales in July DOC 4 = July new home sales
rise forecasts |DOC 1| home |DOC 1| |DOC 2| |DOC 3| |DOC 4| posting list increase
|DOC 3| July |DOC 2| |DOC 3| |DOC 4| increasing new |DOC 1| |DOC 4| rise
|DOC 2| |DOC 4| sales |DOC 1| |DOC 2| |DOC 3| |DOC 4|
top |DOC 1|

INDEXING ARCHITECTURE

www.profajaypashankar.com Page 10 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

BIWORD INDEX
- Index every consecutive pair of terms in the text as a phrase.
- Example: Friends, Romans, Countrymen would generate two bi-words “Friends Romans” and
“Romans Countrymen”.
- Each of these bi-word is now a vocabulary term.

POSITIONAL INDEXES
- Posting lists in a positional index in which each posting is a docID and a list of positions.
- Example:

Cat, 100 <1, 6 :<7, 18, 33, 72, 86, 231>; 2, 5 : <1, 17, 74, 222, 255>; 4, 2 : <8, 16>; .. .. >

The word “cat” has a document frequency 100 and occurs 6 times in document 1 at positions 7, 18,
33, 72, 86, 231 and so on. SPARSE VECTORS
- Most documents and queries do not contain most word, so vectors are sparse.

i.e. most entries are zero (0).


- Need efficient methods for storing and computing with sparse vectors.
- We can use sparse vectors as lists, sparse vectors as trees, sparse vectors as Hash Table.

An Inverted Index is a data structure used in information retrieval systems to efficiently retrieve
documents or web pages containing a specific term or set of terms. In an inverted index, the index is
organized by terms (words), and each term points to a list of documents or web pages that contain
that term.
Inverted indexes are widely used in search engines, database systems, and other applications where
efficient text search is required. They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow.
An inverted index is an index data structure storing a mapping from content, such as words or
numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap-like
data structure that directs you from a word to a document or a web page.
Example: Consider the following documents.
Document 1: The quick brown fox jumped over the lazy dog.
Document 2: The lazy dog slept in the sun.
To create an inverted index for these documents, we first tokenize the documents into terms, as
follows.
Document 1: The, quick, brown, fox, jumped, over, the lazy, dog.
Document 2: The, lazy, dog, slept, in, the, sun.
Next, we create an index of the terms, where each term points to a list of documents that contain that
term, as follows.
The -> Document 1, Document 2
Quick -> Document 1

www.profajaypashankar.com Page 11 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Brown -> Document 1
Fox -> Document 1
Jumped -> Document 1
Over -> Document 1
Lazy -> Document 1, Document 2
Dog -> Document 1, Document 2
Slept -> Document 2
In -> Document 2
Sun -> Document 2

To search for documents containing a particular term or set of terms, the search engine queries the
inverted index for those terms and retrieves the list of documents associated with each term. The
search engine can then use this information to rank the documents based on relevance to the query
and present them to the user in order of importance.
There are two types of inverted indexes:
• Record-Level Inverted Index: Record Level Inverted Index contains a list of references to
documents for each word.
• Word-Level Inverted Index: Word Level Inverted Index additionally contains the positions of
each word within a document. The latter form offers more functionality but needs more
processing power and space to be created.
Suppose we want to search the texts “hello everyone, ” “this article is based on an inverted index, ”
and “which is hashmap-like data structure“. If we index by (text, word within the text), the index
with a location in the text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)

The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an entry (1, 1), and the
word “is” is in documents 2 and 3 at ‘3rd’ and ‘2nd’ positions respectively (here position is based on
the word).
The index may have weights, frequencies, or other indicators.
Steps to Build an Inverted Index
• Fetch the Document: Removing of Stop Words: Stop words are the most occurring and
useless words in documents like “I”, “the”, “we”, “is”, and “an”.
• Stemming of Root Word: Whenever I want to search for “cat”, I want to see a document that
has information about it. But the word present in the document is called “cats” or “catty”
instead of “cat”. To relate both words, I’ll chop some part of every word I read so that I could
get the “root word”. There are standard tools for performing this like “Porter’s Stemmer”.
• Record Document IDs: If the word is already present add a reference of the document to
index else creates a new entry. Add additional information like the frequency of the word,
location of the word, etc.
Example:
Words Document
ant doc1
demo doc2
world doc1, doc2

Advantages of Inverted Index


• The inverted index is to allow fast full-text searches, at a cost of increased processing when a
document is added to the database.
• It is easy to develop.
• It is the most popular data structure used in document retrieval systems, used on a large scale
for example in search engines.

www.profajaypashankar.com Page 12 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

Disadvantages of Inverted Index


• Large storage overhead and high maintenance costs on updating, deleting, and inserting.
• Instead of retrieving the data in decreasing order of expected usefulness, the records are
retrieved in the order in which they occur in the inverted lists.
-------------------------------------------------------------------------------------------------------------------
Features of Inverted Indexes
• Efficient search: Inverted indexes allow for efficient searching of large volumes of text-based
data. By indexing every term in every document, the index can quickly identify all documents
that contain a given search term or phrase, significantly reducing search time.
• Fast updates: Inverted indexes can be updated quickly and efficiently as new content is added
to the system. This allows for near-real-time indexing and searching for new content.
• Flexibility: Inverted indexes can be customized to suit the needs of different types
of information retrieval systems. For example, they can be configured to handle different types
of queries, such as Boolean queries or proximity queries.
• Compression: Inverted indexes can be compressed to reduce storage requirements. Various
techniques such as delta encoding, gamma encoding, variable byte encoding, etc. can be used
to compress the posting list efficiently.
• Support for stemming and synonym expansion: Inverted indexes can be configured to
support stemming and synonym expansion, which can improve the accuracy and relevance of
search results. Stemming is the process of reducing words to their base or root form, while
synonym expansion involves mapping different words that have similar meanings to a common
term.
• Support for multiple languages: Inverted indexes can support multiple languages, allowing
users to search for content in different languages using the same system.

Compression techniques:
-------------------------------------------------------------------------------------------------------------
Why compression (in general)?

• Use less disk space


• Saves a little money
• Keep more data in memory
• Increases speed
• Increase speed of data transfer from disk to memory
• [read compressed data | decompress] is faster than [read uncompressed data]
• Premise: Decompression algorithms are fast
• True of the decompression algorithms we use
-----------------------------------------------------------------------------------------------------------
Lossless vs. lossy compression
Lossless compression: All information is preserved
What we mostly do in IR.
Lossy compression: Discard some information
Several of the pre-processing steps can be viewed as lossy compression: case folding, stop words,
stemming, number elimination
Later: Prune postings entries that are unlikely to turn up in the top k list for any query
Almost no loss quality for top k list
-------------------------------------------------------------------------------------------------------------------
Compressing an Inverted Index
inverted index files is to develop algorithms that reduce I/O bandwidth and storage overhead.
The size of the index file determines the storage overhead imposed. Furthermore, since large index
files demand greater I/O bandwidth to read them, the size also directly affects the processing times.
Although compression of text was extensively studied relatively little work was done in the area of
inverted index compression , an index was generated that was relatively easy to decompress. It
comprised less than ten percent of the original document collection, and, more impressively, included
stop terms. Two primary areas in which an inverted index might be compressed are the term dictionary
and the posting lists. Given relatively inexpensive memory costs, we do not focus on compression of
indexes. The number of new terms always slightly increases as new domains are encountered, but it is

www.profajaypashankar.com Page 13 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
reasonable to expect that it will stabilize at around one or two million terms. With an average term
length of six, a four byte document frequency counter, and a four byte pointer to the first entry in the
posting list, fourteen bytes are required for each term. For the conservative estimate of two million
terms, the uncompressed index is likely to fit comfortably within 32 ME. Even if we are off by an order
of magnitude, the amount of memory needed to store the index is conservatively under a gigabyte.
Given the relatively small size of an index and the ease with which it should fit in memory, we do not
describe a detailed discussion of techniques used to compress the index. We note that stemming
reduces this requirement Also, the use of phrases improves precision and recall Storage of phrases in
the index may well require compression. This depends upon how phrases are identified and restricted.
Most systems eliminate phrases that occur infrequently. To introduce index compression algorithms,
we first describe a relatively straightforward one that is referred to as the Byte Aligned (BA) index
compression BA compression is done within byte boundaries to improve runtime at a slight cost to the
compression ratio. This algorithm is easy to implement and provides good compression Variable length
encoding Although such efforts yield better compression, they do so at the expense of increased
implementation complexity.

Fixed Length Index Compression


posting list are in ascending order by document identifier. An exception to this document ordering
occurs when a pruned inverted index approach is used Hence, run -length encoding is applicable for
document identifiers. For any document identifier, only the offset between the current identifier and
the identifier immediately preceding it are computed. For the case in which no other document
identifier exists, a compressed version of the document identifier is stored.
Using this technique, a high proportion of relatively low numerical values is assured. This scheme
effectively reduces the domain of the identifiers, allowing them to be stored in a more concise format.
Subsequently, the following method is applied to compress the data. For a given input value, the two
left- most bits are reserved to store a count for the number of bytes that are used in storing the value.
There are four possible combinations of two bit representations; thus, a two bit length indicator is used
for all document identifiers. Integers are stored in either 6, 14, 22,or 30 bits. Optimally, a reduction of
each individual data record size by a factor of four is obtained by this method since, in the best case,
all values are less than 26 = 64 and can be stored in a single byte. Without compression, four bytes
are used for all document identifiers. For each
value to be compressed, the minimum number of bytes required to store this value is computed
indicates the range of values that can be stored, as well as the length indicator for one, two, three,
and four bytes. For document collections exceeding 2^30 documents, this scheme can be
extended to include a three bit length indicator which extends the range to 2^61-- 1. For term
frequencies, there is no concept of using an offset between the successive values as each
frequency is independent of the preceding value. However, the same encoding scheme can be
used. Since we do not expect a document to contain a term more than 2^15 = 32, 768 times,
either one or two bytes are used to store the value with one bit serving as the length indicator.

Fig: Byte-Aligned Compression

Fig: Baseline: No Compression

www.profajaypashankar.com Page 14 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Example: Fixed Length Compression
Consider an entry for an arbitrary term, tl, which indicates that tl occurs in documents 1, 3, 7, 70,
and 250. Byte-aligned (BA) compression uses the leading two high order bits to indicate the
number of bytes required to represent the value. For the first four values, only one byte is
required; for the final value, 180, two bytes are required. Note that only the differences between
entries in the posting list must be computed. The difference of 250 - 70 = 180 is all that must be
computed for this final value. The values and their corresponding compressed bit strings Using
no compression, the five entries in the posting list require four bytes each for a total of twenty
bytes. The values and their corresponding compressed bit strings are shown in Table above. In
this example, uncompressed data requires 160 bits, while BA compression requires only 48 bits.

4.1.3 Variable Length Index Compression


The differences in the posting list. They capitalize on the fact that for most long posting lists, the
difference between two entries is relatively small.
They first mention that patterns can be seen in these differences and that Huffman encoding
provides the best compression. In this method, the frequency distribution of all of the offsets is
obtained through an initial pass over the text, a compression scheme is developed based on the
frequency distribution, and a second pass uses the new compression scheme. For example, if it
was found that an offset of one has the highest frequency throughout the entire index, the scheme
would use a single bit to represent the offset of one. This code represents an integer x with [
2llog2x] + 1 bits. The first [log2 x] bits are the unary representation of [log2x]. (Unary
representation is a base one representation of integers using only the digit one. The number 510
is represented as 111111.) After the leading unary representation, the next bit is a single stop bit
of zero. At this point, the highest power of two that does not exceed x has been represented. The
next [log2x] bits represent the remainder of x - 2 ^l[og2 x] in binary. As an example, consider the
compression of the decimal 14. First, [log2 x] =3 is represented in unary as 111. Next, the stop
bit is used. Subsequently, the remainder of x – 2^[lo g2 X ] = 14 - 8 = 6 is stored in binary using
[log2 14] = 3 bits as 110. Hence, the compressed code for 1410 is 1110110. Decompression
requires only one pass, because it is known that for a number with n bits prior to the stop bit,
there will be n bits after the stop bit. The first eight integers using the Elias, encoding .

4.1.3.1 Example: Variable Length Compression


For our same example, the differences of 1, 2, 4, 63, and 180 are given in Table. This requires
only 35 bits, thirteen less than the simple BA compression. Also, our example contained an even
distribution of relatively large offsets to small ones. The real gain can be seen in that very small
offsets require only a 1 or a O. Moffat and Zobel use the, code to compress the term frequency in
a posting list, but use a more complex coding scheme for the posting list entries.
4.1.4 Varying Compression Based on Posting List Size
The gamma scheme can be generalized as a coding paradigm based on the vector V with positive
integers i where 2: Vi 2: N. To code integer x 2: 1 relative to V, find k such that

In other words, find the first component of V such that the sum of all preceding components is
greater than or equal to the value, x, to be encoded. For our example of 7, using a vector V of

www.profajaypashankar.com Page 15 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
<1,2,4, 8, 16,32> we find the first three components that are needed (1, 2, 4) to equal or
exceed 7, so k is equal to three. Now k can be encoded in some representation (unary is
typically used) followed by the difference:

Using this sum we have: d = 7 - (1 + 2) - 1 = 3 which is now coded in[log2] Vk =[log2 4] = 2


binary bits. With this generalization, the, scheme

Table: Variable Compression based on Posting List Size


can be seen as using the vector V composed of powers of 2 < 1, 2, 4, 8, ... , > and coding k in
binary. Clearly, V can be changed to give different compression characteristics. Low values in v
optimize compression for low numbers, while higher values in v provide more resilience for high
numbers. A clever solution given by was to vary V for each posting list such that V = < b, 2b,
4b, 8b, 16b, 32b, 64b, ... , > where b is the median offset given in the posting list.

4.1.4.1 Example: Using the Posting List Size


Using our example of 1,2,4,63, 180, the median, b, has four results in the vector V = < 4, 8,
16,32,64, 128,256>. Table contains an example for the five posting lists using this scheme. This
requires thirty-three bits as well and we can see that, for this example, the use of the median was
not such a good choice as there was wide skew in the numbers. A more typical posting list in
which numbers were uniformly closer to the median could result in better compression.
4.1.4.2 Throughput-optimized Compression
Index compression scheme that yields good compression ratios while maintaining fast decompression
time for efficient query processing . They developed a variable-length encoding that takes advantage
of the distribution of the document identifier offsets for each posting list.
This is a hybrid of bit-aligned and byte- aligned compression; each 32-bit word contains encodings for
a variable number of integers, but each integer within the word is encoded using an equal number of
bits. Words are divided into bits used for a "selector" field and bits used for storing data. The selector
field contains an index into a table of inter -word partitioning strategies based on the number of bits
available for storing data, ensuring that each integer encoded in the word uses the same number of
bits. The appropriate partitioning strategy is chosen based on the largest document identifier offset in
the posting list. Anh and Moffat propose three variants based on this strategy, differing primarily in
how the bits in a word are partitioned:
• Simple-9: Uses 28 bits for data and 4 bits for the selector field; the selection table has nine rows, as
there are nine different ways to split 28 bits equally.
• Relative-W: Similar to Simple-9, but uses only two bits for the selector field, leaving 30 data bits with
10 partitions. The key difference is that, with only 2 selector bits, each word can only chose from 4 of
the 10 available partitions - these are chosen relative to the selector value of the previous word. This
algorithm obtains slight improvements over Simple-9.
• Carryover-12: This is a variant of Relative-W where some of the wasted space due to partitioning is
reclaimed by using the leftover bits to store the selector value for the next word,
allowing that word to use all of its bits for data storage. This obtains the best compression of the
three, but it is the most complex, requiring more decompression time.

Index Pruning
To this point, we have discussed lossless approaches for inverted index compression. A lossy
approach is called static index pruning. The basic idea was . Essentially, posting list entries may
be removed or pruned without significantly degrading precision. Experiments were done with both term
specific pruning and uniform pruning. With term specific pruning, different levels of pruning are done
for each term. Static pruning simply eliminates posting list entries in a uniform fashion - regardless of
the term. It was shown that pruning at levels of nearly seventy percent of the full inverted index did
not significantly affect average precision.

www.profajaypashankar.com Page 16 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Reordering Documents Prior to Indexing
Index compression efficiency can also be improved if we use an algorithm to reorder documents
prior to compressing the inverted index Since the compression effectiveness of many encoding
schemes is largely dependent upon the gap between document identifiers, the idea is that if we
can feed documents to the algorithm correctly, we could reduce the average gap, thereby
maximizing compression. Consider documents d1, dgg , and d1000, all which contain the same
term t. For these documents we obtain a posting list entry for t of t --t d 1, d51, d 101. The
document gap between each posting list entry is 50. If however, we arranged the documents prior
to submitting them to the index, we could submit these documents as dl, d2, and d3 which
completely eliminates this gap. We note that for D documents there are 2D possible orderings, so
any attempt to order documents will be faced with significant scalability concerns. The
algorithms compare documents to other documents prior to submitting them for indexing
Top-Down
Generally, the two top-down algorithms consist of four main phases. In the first phase, called center
selection, two groups of documents are selected from the collection and used as partitions in
subsequent phases. In the redistribution phase, all remaining documents are divided among the
selected centers according to their similarity. In the recursion phase, the previous phases are repeated
recursively over the two resulting partitions until each one becomes a singleton. Finally, in the merging
phase, the partitions formed from each recursive call are merged bottom-up,creating an ordering. The
first of the two proposed top-down algorithms is called transactional B & B, as it is an implementation
of the Blelloch and Blandford algorithm This reordering algorithm obtains the best compression ratios
of the four, however it is not scalable. The second top-down algorithm is called Bisecting, so named
because its center selection phase consists of choosing two random documents as centers, thereby
dramatically reducing the cost of this phase.
Since its center selection is so simple, the Bisecting algorithm obtains less effective compression
but it is more efficient.
Bottom-Up
The bottom-up algorithms begin by considering each document in the collection separately and they
progressively group documents based on their similarity. The first bottom-up algorithms is inspired by
the popular k-means approach to document clustering .The second uses kscan; an algorithm that is a
simplified version of k- means which is based on a centroid-search algorithm.
The k-means algorithm initially chooses k documents as cluster representatives, and assigns all
remaining documents to those clusters based on a measure of similarity. At the end of the first pass,
the cluster centroids are recomputed and the documents are reassigned according to their similarity to
the new centroids. This iteration continues until the cluster centroids stabilize. The singlepass version
of this algorithm only performs the first pass of this algorithm, and the authors select the k initial
centers using the Buckshot clustering technique The k-scan algorithm is a simplified version of single -
pass k-means, requiring only k steps to complete. It forms clusters in place at each step, by first
selecting a document to serve as the centroid for a cluster, and then assigning a portion of unassigned
documents that have the highest similarity to that cluster.
-------------------------------------------------------------------------------------------------------------------
Sequential Searching
A few approaches to directly searching compressed text exist. One of the most successful techniques in
practice relies on Huffman coding taking words as symbols. That is, consider each different text word
as a symbol, count their frequencies, and generate a Huffman code for the words. Then, compress the
text by replacing each word with its code. To improve compression/decompression efficiency, the
Huffman code uses an alphabet of bytes instead of bits. This scheme compresses faster and better
than known commercial systems, even those based on Ziv-Lempel coding.
Since Huffman coding needs to store the codes of each symbol, this scheme has to store the whole
vocabulary of the text, i.e. the list of all different text words. This is fully exploited to efficiently search
complex queries. Although according to Heaps' law the vocabulary (i.e., the alphabet) grows as O(n/3)
for
0 < /3 < 1, the generalized Zipf's law shows that the distribution is skewed enough so that the entropy
remains constant (i.e., the compression ratio will not
degrade as the text grows). Those laws are explained in Chapter 6.
Any single-word or pattern query is first searched in the vocabulary. Some queries can be binary
searched, while others such as approximate searching or regular expression searching must traverse
sequentially all the vocabulary. This vocabulary is rather small compared to the text size, thanks to
Heaps' law. No- tice that this process is exactly the same as the vocabulary searching performed by
inverted indices, either for simple or complex pattern matching.
Once that search is complete, the list of different words that match the query is obtained. The Huffman
codes of all those words are collected and they are searched in the compressed text. One alternative is
to traverse byte-wise the compressed text and traverse the Huffman decoding tree in synchronization,
so that each time that a leaf is reached, it is checked whether the leaf (i.e., word) was marked as

www.profajaypashankar.com Page 17 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
'matching' the query or not. This is illustrated in Figure 8.24. Boyer-Moore filtering can be used to
speed up the search.
Solving phrases is a little more difficult. Each element is searched in the vocabulary. For each word of
the vocabulary we define a bit mask. We set the i-th bit in the mask of all words which match with the
i-th element of the phrase query. This is used together with the Shift-Or algorithm. The text is
traversed byte-wise, and only when a leaf is reached, does the Shift-Or algorithm consider that a new
text symbol has been read, whose bit mask is that of the leaf (see Figure 8.24). This algorithm is
surprisingly simple and efficient.

-------------------------------------------------------------------------------------------------------------
DOCUMENT REPRESENTATION AND TERM WEIGHTING:

STEPS IN IR PROCESS (RETRIEVAL PROCESS)


1. Indexing (Creating document representation)
- Indexing is the manual or automated process of making statements about a document, lesson, and
person and so on.
- For example: author wise, subject wise, text wise, etc.
- Index can be:
i. Document oriented: - the indexer accesses the document relevance to subjects and other features of
interests to user.
ii. Request oriented: - the indexer accesses the document relevance to subjects and other features of
interests to user.
- Automated indexing begins with feature extraction such as extracting all words from a text, followed
by refinements such as eliminating stop words (a, an, the, of), stemming (walking walk), counting
the most frequent words, mapping the concepts using thesaurus (tube pipe).

2. Query Formulation (Creating query representation)


- Retrieval means using the available evidence to predict the degree to which a document is relevant or
useful for a given user need as described in a free form query description.

- A query can specify text words or phrase, the system should look for.
- The query description is transformed manually or automatically into a formed query representation,
ready to match with document representation.

3. Matching the Query Representation With Entity Representation


- The match uses the features specified in the query to predict document relevance.
- Exact match (0 or 1).
- Synonym expansion (pipe tube).
- Hierarchical expansion (pipe capillary).
- The system ranks the result.

4. Selection
- User examines the results and selects the relevant items.

www.profajaypashankar.com Page 18 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

5. Relevance Feedback & Interactive Retrieval


- The system can assist the user in improving the query by showing a list of features (option) found in
many relevant items.

Document Representation:
Document representation refers to how documents are represented within an information retrieval
system. In most cases, documents are represented as a bag-of-words model, where each document is
treated as an unordered collection of words or terms. Other representations, such as vector space
models, can also be used.
Bag-of-Words Model:
• In the bag-of-words model, a document is represented as a vector where each dimension
corresponds to a unique term in the vocabulary.
• The value of each dimension (term) in the vector typically indicates the frequency of the
corresponding term in the document.
• Stop words (common words like "the", "and", "of", etc.) are often removed to reduce noise in
the representation.
• Stemming and lemmatization may also be applied to reduce inflected or derived words to their
base or dictionary form.

Term Weighting:
Term weighting involves assigning weights to terms in the document representation to reflect their
importance in distinguishing relevant documents from irrelevant ones. The goal is to give higher
weights to terms that are more discriminative and informative.
Term Frequency-Inverse Document Frequency (TF-IDF):
• TF-IDF is a popular term weighting scheme used in information retrieval.
• It calculates the importance of a term within a document relative to its importance across all
documents in the corpus.
• The weight of a term �t in a document �d is calculated as the product of two components:
1. Term Frequency (TF): The frequency of term �t in document �d, usually normalized by
the total number of terms in �d to account for document length.
2. Inverse Document Frequency (IDF): The logarithmically scaled inverse fraction of the
documents that contain term �t among the documents in the corpus. It measures the
informativeness of a term; terms that occur in fewer documents tend to have higher IDF
weights.
• The TF-IDF weight for term �t in document �d is given by:

TF(�,�)=Number of times term


� appears in document
�Total number of terms in document
�TF(t,d)=Total number of terms in document dNumber of times term t appears in document d and
IDF(�)=log⁡(Total number of documentsNumber of documents containing term
�)IDF(t)=log(Number of documents containing term tTotal number of documents)
By using TF-IDF, terms that appear frequently in a particular document but rarely in other documents
are assigned higher weights, making them more relevant for retrieval purposes.
-------------------------------------------------------------------------------------------------------------------
Types of Term Weighting Techniques:
1. Frequency-based Techniques:
• These techniques rely on the frequency of occurrence of terms in documents and
queries.
• Term Frequency (TF): Measures how often a term appears in a document.

www.profajaypashankar.com Page 19 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Document Frequency (DF): Measures how many documents contain a particular term.
2. Statistical Techniques:
• These techniques use statistical methods to assign weights to terms.
• Inverse Document Frequency (IDF): Measures the informativeness of a term by
considering how often it appears across all documents in the corpus.
• Term Frequency-Inverse Document Frequency (TF-IDF): Combines TF and IDF to
assign higher weights to terms that are frequent in the document but rare across the
corpus.
3. Probabilistic Techniques:
• These techniques model the probability of relevance of a document given a query.
• Binary Independence Model (BIM): Assumes that terms in a document are
independent and calculates the probability of relevance based on term occurrences.
• Okapi BM25: A variation of the Binary Independence Model that considers document
length normalization and term saturation.
Example:
Let's consider a simplified example using TF-IDF:
Suppose we have a small corpus containing three documents:
• Document 1: "Information retrieval is an essential aspect of modern search engines."
• Document 2: "Search engines use various techniques to retrieve relevant information."
• Document 3: "Modern search engines employ advanced algorithms for information retrieval."
We want to calculate the TF-IDF weights for the term "information" in each document.
1. Term Frequency (TF):
• TF("information", Document 1) = 1/9
• TF("information", Document 2) = 0 (term not present)
• TF("information", Document 3) = 1/8
2. Document Frequency (DF):
• DF("information") = 3 (appears in all three documents)
3. Inverse Document Frequency (IDF):
• IDF("information") = log(3/3) = log(1) = 0
4. TF-IDF Weight:
• TF-IDF("information", Document 1) = (1/9) * 0 = 0
• TF-IDF("information", Document 2) = 0 * 0 = 0
• TF-IDF("information", Document 3) = (1/8) * 0 = 0
In this example, "information" has a TF-IDF weight of 0 in all documents, indicating that it is not
particularly discriminative in this corpus.
Term weighting techniques like TF-IDF help information retrieval systems to rank documents based on
their relevance to user queries by assigning appropriate weights to terms based on their importance
and occurrence patterns across the corpus.
-------------------------------------------------------------------------------------------------------------------
Storage and retrieval of indexed documents:
In Information Retrieval (IR) systems, the storage and retrieval of indexed documents are critical
components for efficiently managing and accessing large volumes of information. Here's how the
process typically works:
Storage of Indexed Documents:
1. Document Preprocessing:
• Before storage, documents undergo preprocessing steps such as tokenization, stemming,
stop-word removal, and possibly other normalization techniques to prepare them for
indexing.
2. Indexing:
• The preprocessed documents are indexed to facilitate efficient retrieval. Indexing
involves creating data structures that map terms to the documents in which they occur.
3. Inverted Index:
• The most common data structure used for indexing is the inverted index. It maps each
unique term in the corpus to the list of documents containing that term.
• Inverted index typically includes metadata such as term frequency (TF), document
frequency (DF), and possibly other statistics to aid in relevance ranking.
4. Storage Mechanism:
• Indexed documents and their associated metadata are stored in a storage system. This
storage can be in-memory, on-disk, or distributed across multiple nodes depending on
the scale and requirements of the IR system.
• Various database systems, NoSQL databases, or custom storage solutions can be used
for efficient storage and retrieval.
Retrieval of Indexed Documents:
1. Query Processing:

www.profajaypashankar.com Page 20 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• When a user submits a query, it goes through a similar preprocessing pipeline as
documents to ensure consistency and compatibility with the indexed data.
• The query may undergo tokenization, stemming, and stop-word removal to extract
meaningful terms.
2. Term Matching:
• The preprocessed query terms are matched against the inverted index to identify
relevant documents.
• The inverted index is queried to retrieve the list of documents containing the query
terms.
3. Ranking and Scoring:
• Retrieved documents are often ranked based on their relevance to the query. This may
involve calculating similarity scores using techniques such as TF-IDF, BM25, or machine
learning-based methods.
• Documents are ranked based on their relevance scores, and the top-ranked documents
are presented to the user.
4. Retrieval Mechanism:
• The retrieval mechanism efficiently fetches the relevant documents from storage based
on the results of the term matching and ranking processes.
• Caching strategies and optimizations may be employed to speed up retrieval and reduce
latency.
-------------------------------------------------------------------------------------------------------------------
CHAPTER III: RETRIEVAL MODELS
Topics Covered:
: Boolean model: Boolean operators, query processing, Vector space model: TF-IDF, cosine
similarity, query-document matching, Probabilistic model: Bayesian retrieval, relevance
feedback
-------------------------------------------------------------------------------------------------------------
Retrieval models in Information Retrieval (IR) are frameworks or algorithms used to rank documents in
response to a user query. These models aim to determine the relevance of documents to a particular
query and retrieve the most relevant documents first. There are several retrieval models, each with its
own approach to ranking documents. Here are some common retrieval models in IR:
1. Boolean Model:
• The Boolean model is based on set theory and retrieves documents based on the
presence or absence of terms specified in the query using Boolean operators (AND, OR,
NOT).
• Documents are ranked as either relevant or non-relevant to the query.
• While simple, the Boolean model lacks the ability to rank documents by relevance.
2. Vector Space Model (VSM):
• In the vector space model, documents and queries are represented as vectors in a multi-
dimensional space.
• Each dimension represents a term, and the value of each dimension indicates the weight
of the term in the document or query.
• Cosine similarity is often used to measure the similarity between the query vector and
document vectors.
• Documents are ranked based on their cosine similarity with the query vector.
3. Probabilistic Models:
• Probabilistic retrieval models, such as the Binary Independence Model (BIM) and Okapi
BM25, are based on probabilistic principles.
• They estimate the probability of relevance of a document given a query.
• These models consider factors such as term frequency, document length, and term
saturation to calculate relevance scores.
4. Language Models:
• Language models in IR treat documents and queries as probabilistic distributions of
terms.
• They estimate the likelihood of observing a query given a document and vice versa.
• Document-query similarity is measured using techniques such as KL divergence or
Jaccard similarity.
5. Machine Learning-Based Models:
• Machine learning techniques, including supervised and unsupervised learning, are used
to learn the relevance of documents to queries.
• Features extracted from documents and queries are used to train models that predict
document relevance.
• Learning to Rank (LTR) algorithms, such as RankNet, LambdaMART, and SVMRank, fall
under this category.

www.profajaypashankar.com Page 21 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
6. Hybrid Models:
• Hybrid retrieval models combine multiple retrieval models to leverage their respective
strengths.
• For example, a hybrid model may combine the vector space model with language models
or probabilistic models to improve retrieval effectiveness.
Each retrieval model has its own strengths and weaknesses, and the choice of model depends on
factors such as the characteristics of the document collection, the nature of user queries, and the
desired trade-offs between recall and precision. Experimentation and evaluation are crucial to selecting
the most suitable retrieval model for a specific IR system.

-------------------------------------------------------------------------------------------------------------------
Boolean Model:
The Boolean model is one of the oldest and simplest information retrieval models used to retrieve
documents that match a Boolean query. It operates on the principle of set theory and allows queries to
be formulated using Boolean operators such as AND, OR, and NOT. Here's how the Boolean model
works:
Principles of the Boolean Model:
1. Document Representation:
• In the Boolean model, each document and query is represented as a set of index terms
(words or phrases).
2. Binary Representation:
• Each term in a document or query is either present (1) or absent (0), resulting in binary
representation.
• The presence of a term indicates that it occurs at least once in the document or query.
3. Boolean Operators:
• Boolean operators (AND, OR, NOT) are used to construct queries to retrieve documents
based on the presence or absence of terms.
• AND Operator: Retrieves documents containing all terms in the query.
• OR Operator: Retrieves documents containing at least one of the terms in the query.
• NOT Operator: Excludes documents containing the specified term.
Example:
Consider a small document collection with the following documents:
• Document 1: "information retrieval techniques"
• Document 2: "document indexing methods"
• Document 3: "retrieval models in IR"
Suppose we want to retrieve documents related to "information retrieval" and "models" using Boolean
queries.
1. Boolean Query Formation:

www.profajaypashankar.com Page 22 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Query 1: "information AND retrieval"
• Query 2: "models"
2. Document Representation:
• Convert each document into a set of index terms.
Document 1: {information, retrieval, techniques} Document 2: {document, indexing, methods}
Document 3: {retrieval, models, in, IR}
3. Boolean Retrieval:
• For Query 1 ("information AND retrieval"):
• Documents containing both "information" and "retrieval" are retrieved.
• Only Document 1 satisfies this condition.
• For Query 2 ("models"):
• Documents containing "models" are retrieved.
• Document 3 satisfies this condition.
Advantages and Limitations:
Advantages:
• Simple and easy to understand.
• Exact matching: Documents either match the query or they don't.
Limitations:
• Lack of ranking: The Boolean model does not rank documents by relevance.
• No partial matching: Documents must contain all terms in an AND query to be retrieved.
• Sensitivity to query formulation: Small changes in query formulation can yield significantly
different results.
-------------------------------------------------------------------------------------------------------------------
Boolean Operators:
The Boolean model in Information Retrieval (IR) is based on set theory and uses Boolean operators
(AND, OR, NOT) to formulate queries and retrieve documents. Here's a brief overview of Boolean
operators and query processing in the Boolean model:
Boolean Operators:
1. AND Operator (∧∧):
• The AND operator retrieves documents that contain all the terms specified in the query.
• It narrows down the search space by finding documents that satisfy all the conditions
specified in the query.
• Example: "information ∧∧ retrieval" retrieves documents that contain both "information"
and "retrieval".
2. OR Operator (∨∨):
• The OR operator retrieves documents that contain at least one of the terms specified in
the query.
• It broadens the search space by finding documents that contain any of the terms in the
query.
• Example: "information ∨∨ retrieval" retrieves documents that contain either
"information" or "retrieval" or both.
3. NOT Operator (¬¬):
• The NOT operator excludes documents that contain the term specified in the query.
• It is used to remove irrelevant documents from the search results.
• Example: "information ¬¬ retrieval" retrieves documents that contain "information" but
do not contain "retrieval".
-------------------------------------------------------------------------------------------------------------------
Query Processing in the Boolean Model:
1. Tokenization:
• The query is tokenized into individual terms or keywords.
• Tokenization breaks down the query into its constituent elements, separating them by
whitespace or other delimiters.
2. Parsing:
• The parsed query is evaluated to identify Boolean operators (AND, OR, NOT) and their
operands.
• The query is structured based on the Boolean operators to determine the relationships
between terms.
3. Document Retrieval:
• For an AND query, documents that contain all the terms in the query are retrieved.
• For an OR query, documents that contain at least one of the terms in the query are
retrieved.
• For a NOT query, documents that do not contain the specified term are retrieved.
4. Ranking:

www.profajaypashankar.com Page 23 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• In the Boolean model, documents are typically retrieved without any ranking based on
relevance.
• Documents either match the query conditions (relevant) or they don't (non-relevant).
5. Presentation:
• The retrieved documents are presented to the user, often in the form of a list or ranked
set of search results.
• Users may further refine their queries based on the initial set of results.
Advantages and Limitations:
• Advantages:
• Simple and intuitive query formulation.
• Exact matching: Documents either match the query or they don't.
• Limitations:
• Lack of ranking: Documents are not ranked by relevance.
• Boolean queries can be rigid and may not capture the nuances of natural language
queries.
• No support for partial matching or synonyms.
Despite its limitations, the Boolean model serves as a foundational concept in IR and has influenced
the development of more sophisticated retrieval models and techniques.
-------------------------------------------------------------------------------------------------------------------
Vector space model
The Vector Space Model (VSM) is a widely used and fundamental approach in Information Retrieval
(IR) for representing documents and queries as vectors in a multi-dimensional space. It enables the
calculation of similarity scores between documents and queries, facilitating the retrieval of relevant
documents based on their similarity to the query. Here's an overview of the Vector Space Model in IR:
Principles of the Vector Space Model:
1. Document and Query Representation:
• In the VSM, both documents and queries are represented as vectors in a high-
dimensional space.
• Each dimension of the space corresponds to a term in the vocabulary of the document
collection.
2. Term Frequency-Inverse Document Frequency (TF-IDF) Weighting:
• Before representing documents and queries as vectors, TF-IDF weighting is often applied
to the terms.
• TF-IDF reflects the importance of a term in a document relative to its importance across
the entire document collection.
• TF-IDF assigns higher weights to terms that are frequent in the document but rare
across the collection, thus capturing the discriminative power of terms.
3. Vector Representation:
• Each document and query is represented as a vector in the TF-IDF weighted term space.
• The dimensions of the vectors correspond to the terms, and the values represent the TF-
IDF weights of the terms.
4. Cosine Similarity:
• To measure the similarity between a document vector and a query vector, cosine
similarity is often used.
• Cosine similarity calculates the cosine of the angle between the document vector and the
query vector.
• Higher cosine similarity values indicate higher similarity between the document and the
query.
5. Retrieval and Ranking:
• Documents are ranked based on their cosine similarity scores with the query vector.
• Documents with higher cosine similarity scores are considered more relevant to the
query and are ranked higher in the search results.

Steps in the Vector Space Model:


1. Document Preprocessing:
• Documents undergo preprocessing steps such as tokenization, lowercasing, stop-word
removal, and stemming.
2. TF-IDF Calculation:
• TF-IDF weights are calculated for each term in the document collection.
3. Vector Representation:
• Each document and query is represented as a vector in the TF-IDF weighted term space.
4. Cosine Similarity Calculation:
• Cosine similarity scores are computed between the query vector and each document
vector.

www.profajaypashankar.com Page 24 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
5. Ranking:
• Documents are ranked based on their cosine similarity scores with the query vector.

Advantages of the Vector Space Model:


• Flexibility: VSM can handle a wide variety of queries and document types.
• Scalability: It can scale to large document collections efficiently.
• Adaptability: VSM can be extended and adapted with additional features and refinements.

Limitations of the Vector Space Model:


• Bag-of-Words Representation: VSM ignores word order and semantic relationships between
terms.
• Sparsity: In high-dimensional spaces, vectors can become sparse, leading to computational
challenges.
• Difficulty with Synonyms and Polysemy: VSM may struggle with capturing synonymous
terms or terms with multiple meanings.
Despite its limitations, the Vector Space Model remains a powerful and widely used approach in
Information Retrieval due to its simplicity, effectiveness, and flexibility in modeling document-query
relationships.
-------------------------------------------------------------------------------------------------------------------
TF-IDF calculation:
let's walk through an example of TF-IDF calculation for a simple document collection. Suppose we have
a collection of three documents:
1. Document 1: "Information retrieval is important."
2. Document 2: "Information retrieval techniques are used in search engines."
3. Document 3: "Search engines use algorithms for information retrieval."
For TF-IDF calculation, we'll follow these steps:
1. Term Frequency (TF) Calculation:
• Calculate the Term Frequency (TF) for each term in each document.
• TF is the ratio of the number of times a term appears in a document to the total number
of terms in the document.
2. Inverse Document Frequency (IDF) Calculation:
• Calculate the Inverse Document Frequency (IDF) for each term in the entire document
collection.
• IDF is the logarithm of the ratio of the total number of documents to the number of
documents containing the term.
3. TF-IDF Calculation:
• Multiply TF by IDF for each term in each document to get the TF-IDF weight.
Let's go through the example:
Term Frequency (TF) Calculation:
• For each document, count the number of times each term appears and calculate TF.
• Normalize TF by dividing by the total number of terms in the document.

Term Document 1 Document 2 Document 3


Information 1 1 0
Retrieval 1 1 1
Is 1 0 0
Important 1 0 0
Techniques 0 1 0
Are 0 1 0
Used 0 1 0
In 0 1 1
Search 0 1 1
Engines 0 1 1
Algorithms 0 0 1
For 0 0 1

www.profajaypashankar.com Page 25 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Inverse Document Frequency (IDF) Calculation:

• Calculate the IDF for each term using the formula: ���(�)=log⁡(�DF(�))IDF(t)=log(DF(t)N)
where �N is the total number of documents and DF(�)DF(t) is the number of documents
containing term �t.
Term DF(t) IDF(t)
Information 2 log(3/2)
Retrieval 3 log(3/3)
Is 1 log(3/1)
Important 1 log(3/1)
Techniques 1 log(3/1)
Are 1 log(3/1)
Used 1 log(3/1)
In 2 log(3/2)
Search 1 log(3/1)
Engines 1 log(3/1)
Algorithms 1 log(3/1)
For 1 log(3/1)
TF-IDF Calculation:
• Multiply TF by IDF for each term in each document to get the TF-IDF weight.
Term Document 1 Document 2 Document 3
Information TF * IDF TF * IDF 0
Retrieval TF * IDF TF * IDF TF * IDF
Is TF * IDF 0 0
Important TF * IDF 0 0
Techniques 0 TF * IDF 0
Are 0 TF * IDF 0
Used 0 TF * IDF 0
In 0 TF * IDF TF * IDF
Search 0 TF * IDF TF * IDF
Engines 0 TF * IDF TF * IDF
Algorithms 0 0 TF * IDF
For 0 0 TF * IDF
This table shows the TF-IDF weights for each term in each document. The values are calculated by
multiplying the TF of each term by its corresponding IDF.
This process assigns higher weights to terms that are important within a document but occur
infrequently across the entire collection, making them more discriminative for retrieval purposes.
-------------------------------------------------------------------------------------------------------------------
Cosine similarity:
Cosine similarity is a widely used metric in Information Retrieval (IR) and Natural Language Processing
(NLP) for measuring the similarity between two vectors. In the context of IR, cosine similarity is
commonly used to determine the relevance of documents to a query. Here's how cosine similarity
works in IR:
Cosine Similarity Formula:
Given two vectors �A and �B, the cosine similarity similarity(�,�)similarity(A,B) is calculated using
the following formula:

www.profajaypashankar.com Page 26 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

Cosine Similarity in Information Retrieval:


1. Document and Query Representation:
• In IR, documents and queries are often represented as vectors in a high-dimensional
space, such as the Vector Space Model (VSM).
• Each dimension corresponds to a term, and the value of each dimension represents the
importance (e.g., TF-IDF weight) of the term in the document or query.
2. Calculation of Cosine Similarity:
• To calculate the similarity between a document and a query, both the document vector
and the query vector are normalized.
• Cosine similarity is then computed as the cosine of the angle between the two
normalized vectors.
• Cosine similarity ranges from -1 to 1, where a value closer to 1 indicates higher
similarity between the document and the query.
3. Ranking of Documents:
• Documents are ranked based on their cosine similarity scores with the query.
• Documents with higher cosine similarity scores are considered more relevant to the
query and are ranked higher in the search results.
Example:
Suppose we have a document vector �D and a query vector �Q in a three-dimensional space:
• Document Vector �D: [3, 2, 1]
• Query Vector �Q: [1, 2, 3]
We first compute the dot product of �D and �Q:
�⋅�=(3×1)+(2×2)+(1×3)=3+4+3=10D⋅Q=(3×1)+(2×2)+(1×3)=3+4+3=10
Next, we calculate the Euclidean norms of �D and �Q:
∥�∥=32+22+12=14≈3.74∥D∥=32+22+12=14≈3.74 ∥�∥=12+22+32=14≈3.74∥Q∥=12+22+32=14
≈3.74
Now, we compute the cosine similarity:
similarity(�,�)=�⋅�∥�∥∥�∥=103.74×3.74≈1014≈0.714similarity(D,Q)=∥D∥∥Q∥D⋅Q=3.74×3.7410
≈1410≈0.714
In this example, the cosine similarity between the document vector �D and the query vector �Q is
approximately 0.714, indicating a relatively high degree of similarity.

www.profajaypashankar.com Page 27 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

-----------------------------------------------------------------------------------------------------------------
Query-document matching
Query-document matching is a fundamental concept in Information Retrieval (IR) systems, where the
goal is to retrieve and rank documents based on their relevance to a given user query. The process
involves comparing the content of documents against the content of the query to identify relevant
documents. Here's how query-document matching typically works:
1. Query Processing:
• The user submits a query to the IR system.
• The query undergoes preprocessing steps, including tokenization, stemming, stop-word
removal, and possibly other normalization techniques to prepare it for matching against the
documents.
2. Term Matching:
• The preprocessed query terms are matched against the indexed documents to identify
documents containing the query terms.
• Documents that contain all or some of the query terms are candidates for retrieval.
3. Scoring and Ranking:
• Once candidate documents are identified, a relevance score is assigned to each document based
on its similarity to the query.
• Various scoring methods, such as TF-IDF, BM25, or machine learning-based models, may be
used to calculate the relevance score.
• The documents are ranked based on their relevance scores, with the most relevant documents
appearing at the top of the search results.
4. Retrieval and Presentation:
• The top-ranked documents are retrieved from the index and presented to the user as search
results.
• The user can then review the search results and select relevant documents based on their
information needs.
Techniques for Query-Document Matching:
1. Exact Matching:
• Documents are retrieved if they contain all the terms in the query.
• Boolean model is an example of exact matching.
2. Partial Matching:
• Documents are retrieved based on the presence of some terms in the query.
• Vector space model and probabilistic models often allow for partial matching.

www.profajaypashankar.com Page 28 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
3. Relevance Feedback:
• Techniques that use user feedback to refine the query or ranking of documents.
• Incorporates user interactions to improve the relevance of retrieved documents.
4. Semantic Matching:
• Techniques that consider the meaning of the terms in the query and documents.
• Semantic matching methods leverage knowledge graphs, word embeddings, or semantic
analysis to improve matching accuracy.
5. Machine Learning-Based Matching:
• Utilizes machine learning algorithms to learn patterns of relevance between queries and
documents.
• Learning to Rank (LTR) algorithms and neural network models can be used for query-
document matching.
Evaluation:
• The effectiveness of query-document matching is evaluated using metrics such as precision,
recall, F1-score, and Mean Average Precision (MAP).
• Evaluation measures how well the system retrieves relevant documents and ranks them
appropriately.
In summary, query-document matching is a crucial component of IR systems that enables users to
efficiently find relevant information from large document collections based on their queries. Various
matching techniques and evaluation metrics are used to ensure the effectiveness and accuracy of the
retrieval process.
-------------------------------------------------------------------------------------------------------------------
The probabilistic model in Information Retrieval (IR) is based on the principles of probability theory and
aims to estimate the probability of relevance of documents to a given query. One of the prominent
techniques within the probabilistic model is Bayesian retrieval, which utilizes Bayesian probability
theory to model the relevance of documents. Additionally, relevance feedback is a technique used to
improve retrieval effectiveness by incorporating feedback from users about the relevance of retrieved
documents. Let's explore both concepts:
Bayesian Retrieval:
Bayesian retrieval is based on the Bayesian probability framework, which calculates the posterior
probability of a document's relevance given a query. The key idea is to estimate the probability that a
document is relevant given the evidence provided by the query terms. The formula for Bayesian
retrieval is:
�(�∣�)=�(�∣�)⋅�(�)�(�)P(R∣Q)=P(Q)P(Q∣R)⋅P(R)
Where:
• �(�∣�)P(R∣Q) is the posterior probability of relevance given the query �Q and the document
being relevant �R.
• �(�∣�)P(Q∣R) is the probability of observing the query �Q given that the document is relevant
�R.
• �(�)P(R) is the prior probability of relevance, representing the probability that a random
document is relevant.
• �(�)P(Q) is the probability of observing the query �Q.
Bayesian retrieval involves estimating these probabilities from the document collection and the query.
Relevance Feedback:
Relevance feedback is a technique used to iteratively improve the relevance of search results by
incorporating feedback from users about the relevance of retrieved documents. The process typically
involves the following steps:
1. Initial Retrieval:
• The IR system retrieves an initial set of documents based on the user's query.
2. User Feedback:
• The user provides feedback on the relevance of the retrieved documents. This feedback
can be binary (relevant or non-relevant) or graded (e.g., on a scale of relevance).
3. Feedback Incorporation:
• The system uses the feedback to re-rank the retrieved documents or to refine the query.
• In Bayesian retrieval, feedback can be used to update the prior probability of relevance
and the probability of observing the query given relevance.
4. Re-Retrieval:
• The system performs a new retrieval based on the updated relevance estimates or
modified query.
5. Iteration:
• The process may iterate, with the system incorporating additional feedback and refining
the retrieval process until satisfactory results are obtained.

www.profajaypashankar.com Page 29 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Relevance feedback helps to bridge the gap between the user's information needs and the retrieval
system's understanding of relevance. By leveraging user feedback, the system can adapt and improve
its performance over time.
Advantages and Considerations:
• Advantages:
• Bayesian retrieval provides a principled framework for estimating document relevance
based on probabilistic reasoning.
• Relevance feedback enables the system to learn from user interactions and adapt to user
preferences.
• Considerations:
• Bayesian retrieval may require accurate estimation of prior probabilities, which can be
challenging.
• Relevance feedback systems need mechanisms to ensure that user feedback is
appropriately incorporated without biasing the results.
In summary, Bayesian retrieval and relevance feedback are important techniques within the
probabilistic model of IR, offering principled approaches to estimating document relevance and
improving retrieval effectiveness based on user feedback.

CHAPTER IV: SPELLING CORRECTION IN IR SYSTEMS


Topics covered:
Challenges of spelling errors in queries and documents, Edit distance and string similarity
measures, Techniques for spelling correction in IR systems.
------------------------------------------------------------------------------------------------------------
Challenges of spelling errors in queries and documents:
Spelling errors in queries and documents can present several challenges, particularly in information
retrieval and natural language processing systems. Some of the key challenges include:
1. Reduced Search Accuracy: Spelling errors can lead to mismatches between the search terms
used by users and the terms present in documents or databases. This can result in relevant
documents being overlooked or irrelevant documents being retrieved, reducing the overall
accuracy of search results.
2. Ambiguity: Spelling errors can introduce ambiguity into search queries and documents. A
misspelled word may have multiple correct spellings or may be similar to other words with
different meanings, making it challenging for search engines to accurately interpret user intent.
3. User Frustration: Users may become frustrated if they do not find the information they are
looking for due to spelling errors. This can lead to a poor user experience and may discourage
users from using the search system in the future.
4. Automatic Correction: While some search engines and natural language processing systems
offer automatic spelling correction, accurately identifying and correcting spelling errors can be
challenging, especially for words that are misspelled in non-standard ways or are context-
dependent.
5. Context Sensitivity: Spelling errors may go unnoticed in certain contexts, especially if they
result in valid words that are unrelated to the intended search query or document content. This
can lead to false positives in search results or misinterpretation of document meaning.
6. Computational Cost: Implementing algorithms for spell checking and correction can introduce
additional computational overhead, especially for large datasets or real-time search systems
where speed is critical.
7. Domain-specific Challenges: In certain domains or technical fields, there may be specialized
terminology or jargon with unique spelling conventions or variations. Spell checkers and
correction algorithms may not always be effective in handling such domain-specific terms.

www.profajaypashankar.com Page 30 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Addressing these challenges often involves a combination of techniques including spell checking, fuzzy
matching, context-aware processing, and user feedback mechanisms to continuously improve search
accuracy and user satisfaction.
-------------------------------------------------------------------------------------------------------------------
Certainly! Let's consider an example scenario where spelling errors in queries and documents pose
challenges:
Scenario: A user is searching for information about "climate change" in a large database of scientific
articles and reports.
Challenge: The user enters the query "climete change" due to a typographical error.
Issues:
1. Reduced Search Accuracy: The misspelled query "climete change" may not match any
documents containing the correct term "climate change" exactly, leading to potentially relevant
documents being overlooked by the search engine.
2. Ambiguity: The misspelled term "climete" could be interpreted as a different concept
altogether, leading to confusion in understanding the user's intent.
3. User Frustration: If the search engine does not provide suggestions or correct the spelling
error, the user may become frustrated by the lack of relevant results and the perceived
inefficiency of the search system.
Solution:
1. Automatic Correction: The search engine can utilize algorithms for spell checking and
correction to suggest the term "climate change" as a correction for "climete change" before
executing the search query.
2. Fuzzy Matching: Implementing fuzzy matching techniques allows the search engine to identify
similar terms or variations of the correct term "climate change," thus increasing the likelihood
of retrieving relevant documents despite minor spelling errors.
3. User Feedback Mechanism: Providing users with the option to provide feedback on search
results helps improve the search engine's performance over time by learning from user
interactions and refining its spelling correction algorithms.
By implementing these solutions, the search engine can mitigate the impact of spelling errors in
queries and documents, improving search accuracy and enhancing the user experience.

www.profajaypashankar.com Page 31 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

This simple flowchart outlines the process of handling spelling errors in a search engine:
1. User Inputs Query: The user inputs a query into the search engine.
2. Spell Check: The search engine performs a spell check on the query. If the query contains a
spelling error, it proceeds to suggest a correction.
3. Did you mean: The search engine suggests a corrected query to the user based on the spell
check results. If the user accepts the correction, the search engine proceeds to search for
documents related to the corrected query.
4. Search Results: The search engine retrieves and presents relevant search results based on the
corrected query.
This flow diagram illustrates a basic approach to handling spelling errors in a search engine, including
spell checking, suggestion, and search result presentation. More sophisticated systems may
incorporate additional steps, such as fuzzy matching algorithms and user feedback mechanisms, to
further improve search accuracy and user experience.
-------------------------------------------------------------------------------------------------------------------
Certainly! In addition to the strategies mentioned earlier, there are several other considerations and
techniques relevant to addressing spelling errors in queries and documents:
1. Language Models: Advanced language models like BERT (Bidirectional Encoder
Representations from Transformers) and GPT (Generative Pre-trained Transformer) have been
trained on vast amounts of text data and can assist in understanding context even with
misspelled words. Fine-tuning these models on domain-specific data can enhance their ability to
handle spelling variations.
2. Word Embeddings: Word embedding techniques such as Word2Vec and GloVe can capture
semantic relationships between words. By leveraging word embeddings, it's possible to identify
similar words or phrases that may correspond to misspelled terms, thereby improving search
accuracy.
3. Probabilistic Models: Probabilistic models like the noisy channel model and edit distance
algorithms (e.g., Levenshtein distance) estimate the likelihood of certain spelling corrections
given the observed misspelled words. These models are widely used in spell checking and
correction systems.

www.profajaypashankar.com Page 32 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
4. User Behavior Analysis: Analyzing user search behavior and patterns can provide insights
into common misspellings and variations. Search logs and user feedback data can be valuable
sources of information for improving spelling correction algorithms and search relevance.
5. Domain-Specific Dictionaries: Building and maintaining domain-specific dictionaries can help
improve the accuracy of spell checking and correction, especially for technical terms and
specialized jargon that may not be present in standard dictionaries.
6. Interactive Correction Interfaces: Providing users with interactive interfaces that offer real-
time spelling suggestions as they type can help prevent spelling errors before queries are
submitted, improving the overall search experience.
7. Multilingual Considerations: In multilingual environments, spelling errors may occur due to
language-specific variations and transliteration issues. Implementing multilingual spell checking
and correction mechanisms can enhance search accuracy for diverse user demographics.
By incorporating these additional considerations and techniques, search engines and natural language
processing systems can better handle spelling errors in queries and documents, ultimately improving
search accuracy, user satisfaction, and overall system performance.

-------------------------------------------------------------------------------------------------------------------
Edit distance and string similarity measures
Edit distance and string similarity measures are fundamental concepts in computer science and natural
language processing that quantify the similarity between two strings. They are widely used in tasks
such as spell checking, fuzzy string matching, and information retrieval. Let's explore each concept:
Edit Distance:
Edit distance, also known as Levenshtein distance, measures the minimum number of single-character
edits (insertions, deletions, or substitutions) required to transform one string into another.
For example, the edit distance between "kitten" and "sitting" is 3, achieved by the following
transformations:
1. Substituting 's' for 'k'
2. Substituting 'i' for 'e'
3. Inserting 'g' at the end
The computation of edit distance is typically done using dynamic programming algorithms, such as the
Wagner-Fischer algorithm, which efficiently computes the minimum edit distance between two strings.

String Similarity Measures:


String similarity measures assess how similar two strings are based on various criteria, including
character overlap, sequence alignment, and phonetic similarity.
1. Jaccard Similarity: This measures the similarity between two sets by calculating the ratio of
the size of the intersection to the size of the union of the sets.
2. Cosine Similarity: Commonly used in information retrieval and text mining, cosine similarity
measures the cosine of the angle between two vectors in a multi-dimensional space. It's often
applied to text documents represented as term frequency-inverse document frequency (TF-IDF)
vectors.
3. Jaro-Winkler Similarity: Designed to measure the similarity between two strings, particularly
for record linkage, this method computes a similarity score based on the number of matching
characters and transpositions.
4. Hamming Distance: Hamming distance measures the number of positions at which
corresponding characters are different between two strings of equal length.
5. Dice Coefficient: This similarity measure assesses the similarity between two strings by
calculating twice the number of common characters divided by the sum of the character counts
in each string.
These measures are often used in combination with edit distance to improve accuracy in tasks such as
spell checking and fuzzy string matching.
In summary, edit distance quantifies the minimum number of edits needed to transform one string into
another, while string similarity measures evaluate the likeness between strings based on various
criteria, providing valuable tools for tasks involving string comparison and analysis.

Let's walk through an example using edit distance and string similarity measures:
Example: Suppose we have two strings: "kitten" and "sitting".
1. Edit Distance: We want to find the minimum number of single-character edits (insertions,
deletions, or substitutions) required to transform "kitten" into "sitting".
Using dynamic programming, we can compute the edit distance:

www.profajaypashankar.com Page 33 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

kitten
↓↓
sitting

The transformation involves:


• Substitute 's' for 'k'
• Substitute 'i' for 'e'
• Insert 'g' at the end
So, the edit distance between "kitten" and "sitting" is 3.

2. String Similarity Measures: Let's explore some string similarity measures between "kitten"
and "sitting".
• Jaccard Similarity: This measure compares the similarity between two sets. Let's
consider the sets of characters in each string:
• Set 1: {'k', 'i', 't', 'e', 'n'}
• Set 2: {'s', 'i', 't', 't', 'i', 'n', 'g'} The Jaccard similarity is the size of the
intersection divided by the size of the union: J(S1, S2) = |S1 ∩ S2| / |S1 ∪ S2| =
2 / 9.
• Cosine Similarity: We represent the strings as vectors in a vector space. Each
dimension represents the frequency of a character in the string. Then, we calculate the
cosine of the angle between the vectors.
• Vector 1: (1, 0, 0, 1, 0, 1, 1) (frequency of 'k', 'i', 't', 'e', 'n', 's', 'g')
• Vector 2: (0, 1, 0, 0, 1, 1, 1) The cosine similarity is the dot product of the
vectors divided by the product of their magnitudes.
• Jaro-Winkler Similarity: This measure takes into account the number of matching
characters and transpositions. It is more complex and involves a formula to calculate a
similarity score between two strings.
These measures provide different perspectives on the similarity between "kitten" and "sitting" based on
various criteria, including character overlap, sequence alignment, and phonetic similarity. Each
measure has its strengths and weaknesses depending on the specific task and context of the
comparison.

Let's delve deeper into the concepts of edit distance and string similarity measures:
Edit Distance:
Edit distance, also known as Levenshtein distance, is a metric used to quantify the similarity between
two strings. It measures the minimum number of single-character edits (insertions, deletions, or
substitutions) required to transform one string into another.
Calculation of Edit Distance:
The calculation of edit distance is typically done using dynamic programming algorithms, such as the
Wagner-Fischer algorithm. Here's a high-level overview of how the algorithm works:
1. Create a matrix where the rows correspond to characters of the first string and the columns
correspond to characters of the second string.
2. Initialize the first row and column with incremental values representing the number of
characters in each string.
3. Traverse the matrix row by row, filling in each cell with the minimum of the following three
operations:
• If the characters at the current positions match, take the value from the diagonal cell
(representing no edit).
• Otherwise, take the minimum of the value from the cell above (representing insertion),
the value from the cell to the left (representing deletion), and the diagonal value
(representing substitution), and add one.
4. The value in the bottom-right cell of the matrix represents the edit distance between the two
strings.

-------------------------------------------------------------------------------------------------------------------

www.profajaypashankar.com Page 34 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Applications of Edit Distance:
• Spell Checking: Edit distance is widely used in spell checking systems to suggest corrections
for misspelled words by finding words with the shortest edit distance from the input.
• Fuzzy String Matching: It is used in applications where approximate string matching is
required, such as in search engines, database queries, and data deduplication.
• Genetic Sequencing: Edit distance can be applied to compare genetic sequences to identify
similarities and differences between DNA or protein sequences.
String Similarity Measures:
String similarity measures assess how similar two strings are based on various criteria. Here are some
commonly used string similarity measures:
Jaccard Similarity:
• Jaccard similarity measures the similarity between two sets by calculating the ratio of the size
of the intersection to the size of the union of the sets.
• It is particularly useful when dealing with sets of elements rather than sequences of characters.
• Jaccard similarity is sensitive to the presence or absence of elements in the sets but does not
consider the order of elements.
Cosine Similarity:
• Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional
space.
• It is often used in information retrieval and text mining to compare documents represented as
vectors, such as TF-IDF (Term Frequency-Inverse Document Frequency) vectors.
• Cosine similarity is not sensitive to the magnitude of vectors and is mainly concerned with the
direction.
Jaro-Winkler Similarity:
• Jaro-Winkler similarity is designed to measure the similarity between two strings, particularly
for record linkage and fuzzy matching.
• It takes into account the number of matching characters and transpositions and penalizes
differences in the initial characters.
• Jaro-Winkler similarity is particularly effective for comparing strings with minor differences,
such as typographical errors or slight variations.
Hamming Distance:
• Hamming distance measures the number of positions at which corresponding characters are
different between two strings of equal length.
• It is primarily used for strings of equal length and requires that both strings have the same
number of characters.
Dice Coefficient:
• The Dice coefficient assesses the similarity between two strings by calculating twice the number
of common characters divided by the sum of the character counts in each string.
• It is commonly used in computational linguistics and bioinformatics for comparing sequences of
characters.
These string similarity measures provide different perspectives on the likeness between strings, each
with its strengths and weaknesses depending on the specific requirements of the task at hand. They
are essential tools in natural language processing, information retrieval, and data analysis tasks where
string comparison and similarity assessment are crucial.

EDIT DISTANCE EXAMPLE:


Given two strings str1 and str2 of length M and N respectively and below operations that can be
performed on str1. Find the minimum number of edits (operations) to convert ‘str1‘ into ‘str2‘.
• Operation 1 (INSERT): Insert any character before or after any index of str1
• Operation 2 (REMOVE): Remove a character of str1
• Operation 3 (Replace): Replace a character at any index of str1 with some other character.
Note: All of the above operations are of equal cost.
Examples:
Input: str1 = “geek”, str2 = “gesek”
Output: 1
Explanation: We can convert str1 into str2 by inserting a ‘s’ between two consecutive ‘e’ in str2.
Input: str1 = “cat”, str2 = “cut”
Output: 1
Explanation: We can convert str1 into str2 by replacing ‘a’ with ‘u’.
Input: str1 = “sunday”, str2 = “saturday”
Output: 3
Explanation: Last three and first characters are same. We basically need to convert “un” to
“atur”. This can be done using below three operations. Replace ‘n’ with ‘r’, insert t, insert a
Illustration of Edit Distance:

www.profajaypashankar.com Page 35 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Let’s suppose we have str1=”GEEXSFRGEEKKS” and str2=”GEEKSFORGEEKS”
Now to convert str1 into str2 we would require 3 minimum operations:
Operation 1: Replace ‘X‘ to ‘S‘
Operation 2: Insert ‘O‘ between ‘F‘ and ‘R‘
Operation 3: Remove second last character i.e. ‘K‘
Refer the below image for better understanding.

-------------------------------------------------------------------------------------------------------------------
Edit Distance using Recursion
Subproblems in Edit Distance:
The idea is to process all characters one by one starting from either from left or right sides of both
strings.
Let us process from the right end of the strings, there are two possibilities for every pair of characters
being traversed, either they match or they don’t match. If last characters of both string matches
then there is no need to perform any operation So, recursively calculate the answer for rest of part of
the strings. When last characters do not match, we can perform all three operations to match the last
characters in the given strings, i.e. insert, replace, and remove. We then recursively calculate the
result for the remaining part of the string. Upon completion of these operations, we will select the
minimum answer.
Below is the recursive tree for this problem:

www.profajaypashankar.com Page 36 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

When the last characters of strings matches. Make a recursive call EditDistance(M-1,N-1) to
calculate the answer for remaining part of the strings.
When the last characters of strings don’t matches. Make three recursive calls as show below:
• Insert str1[N-1] at last of str2 : EditDistance(M, N-1)
• Replace str2[M-1] with str1[N-1] : EditDistance(M-1, N-1)
• Remove str2[M-1] : EditDistance(M-1, N)
Recurrence Relations for Edit Distance:
• EditDistance(str1, str2, M, N) = EditDistance(str1, str2, M-1, N-1)
• Case 1: When the last character of both the strings are same
• Case 2: When the last characters are different
• EditDistance(str1, str2, M, N) = 1 + Minimum{ EditDistance(str1, str2 ,M-1,N-
1), EditDistance(str1, str2 ,M,N-1), EditDistance(str1, str2 ,M-1,N) }
Base Case for Edit Distance:
• Case 1: When str1 becomes empty i.e. M=0
• return N, as it require N characters to convert an empty string to str1 of size N
• Case 2: When str2 becomes empty i.e. N=0
• return M, as it require M characters to convert an empty string to str2 of size M
• Time Complexity: O(3m), when none of the characters of two strings match as shown in the
image below.
Auxiliary Space: O(1)
• Edit Distance Using Dynamic Programming (Memoization):
• In the above recursive approach, there are several overlapping subproblems:
Edit_Distance(M-1, N-1) is called Three times
Edit_Distance(M-1, N-2) is called Two times
Edit_Distance(M-2, N-1) is called Two times. And so on…
• So, we can use Memoization technique to store the result of each subproblems to avoid
recalculating the result again and again.
• Below are the illustration of overlapping subproblems during the recursion.

www.profajaypashankar.com Page 37 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

• Overlapping subproblems in the Edit Distance

• Below is the implementation of Edit Distance Using Dynamic Programming (Memoization):


-------------------------------------------------------------------------------------------------------------------
Techniques for spelling correction in IR systems:
In Information Retrieval (IR) systems, spelling correction is a crucial component to enhance the
accuracy of search results. Here are several techniques commonly used for spelling correction in IR
systems:
1. Dictionary-Based Correction: This technique involves comparing the input word against a
dictionary of correctly spelled words. If the input word is not found in the dictionary, the system
suggests corrections based on words that are similar in terms of edit distance (Levenshtein
distance), phonetic similarity (Soundex or Metaphone), or semantic similarity (using word
embeddings or lexical databases like WordNet).
2. Edit Distance Algorithms: Edit distance algorithms like Levenshtein distance, Damerau-
Levenshtein distance, and Jaccard distance are used to measure the similarity between the
misspelled word and candidate corrections. These algorithms calculate the number of insertions,
deletions, or substitutions required to transform one word into another.
3. Phonetic Matching: Phonetically similar words are considered as potential corrections for
misspelled words. Techniques like Soundex or Metaphone encode words into phonetic
representations, allowing the system to find words that sound similar to the misspelled word.
4. N-gram Language Models: N-gram language models capture the statistical relationships
between words in a corpus. By analyzing the probability of word sequences, these models can
suggest corrections based on the likelihood of certain words occurring together.
5. Statistical Language Models: Statistical language models, such as Hidden Markov Models
(HMMs) or Recurrent Neural Networks (RNNs), can be trained to predict the likelihood of a word
given its context. These models can be used to suggest corrections by considering the context
of the misspelled word within the query.
6. Contextual Correction: Contextual information from the query or surrounding words can be
used to improve spelling correction accuracy. Techniques like context-aware spell checking
leverage the context of the query to suggest corrections that are contextually relevant.
7. User Feedback and Learning: IR systems can learn from user feedback to improve spelling
correction over time. User interactions, such as clicks on search results or manual corrections,
can be used to refine the spelling correction algorithms and adapt them to user preferences.
8. Hybrid Approaches: Combining multiple techniques mentioned above into a hybrid approach
often yields better results. For example, a system may use dictionary-based correction as a first
pass to identify potential corrections and then use language models to rank and select the most
appropriate correction based on contextual information.

By employing these techniques, IR systems can effectively correct spelling errors and enhance the
accuracy of search results, improving the overall user experience.

In addition to the techniques mentioned earlier, there are a few more advanced approaches and
considerations for spelling correction in IR systems:

www.profajaypashankar.com Page 38 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
1. Domain-Specific Correction: Sometimes, IR systems operate within specific domains where
the vocabulary and language conventions may differ from general language use. Building
domain-specific dictionaries and language models can improve the accuracy of spelling
correction within those domains.
2. Error Correction Cascades: Error correction cascades involve applying multiple correction
steps sequentially, starting with simpler techniques and progressing to more complex ones if
needed. For example, the system may first attempt dictionary-based correction and then apply
more sophisticated techniques if the initial correction is unsuccessful.
3. Fuzzy Matching with Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF is
a statistical measure used to evaluate the importance of a word in a document relative to a
collection of documents. Fuzzy matching techniques can be combined with TF-IDF scores to
prioritize candidate corrections based on their relevance to the overall document collection.
4. Spell Checking with External Knowledge Sources: Some systems integrate external
knowledge sources such as Wikipedia, Wiktionary, or domain-specific ontologies to improve
spelling correction accuracy. These knowledge sources can provide additional context and
information that may not be available within the system's internal dictionaries or language
models.
5. Customization and Fine-Tuning: Users may have specific preferences or requirements for
spelling correction. Providing customization options, such as adjustable tolerance levels for edit
distances or the ability to add custom dictionaries or ignore certain words, can enhance user
satisfaction and adaptability to diverse contexts.
6. Real-Time Correction and Performance Optimization: In real-time applications, such as
web search engines or chatbots, efficient implementation of spelling correction algorithms is
essential to maintain low latency and responsiveness. Techniques like indexing precomputed
corrections or using approximate data structures (e.g., Bloom filters) can optimize performance
without sacrificing accuracy.
7. Multilingual Spelling Correction: For IR systems that support multiple languages,
implementing multilingual spelling correction requires handling language-specific characteristics,
such as different character sets, morphology, and linguistic rules. Techniques like language
identification and language-specific correction models are essential for accurate multilingual
spelling correction.

By incorporating these advanced techniques and considerations, IR systems can achieve robust and
efficient spelling correction capabilities across diverse contexts and languages.

example
Let's consider an example scenario of spelling correction in an information retrieval system:
Suppose we have an IR system that allows users to search for documents in a large collection of
scientific papers related to biology and genetics. A user enters the query "gene expretion regulation"
into the search bar, intending to find documents about gene expression regulation.
Here's how the system might process the query using various spelling correction techniques:
1. Dictionary-Based Correction: The system first checks the query words against its dictionary
of correctly spelled words. It identifies "expretion" as a misspelled word since it's not found in
the dictionary. The system suggests corrections based on similar words like "expression."
2. Edit Distance Algorithms: Using an edit distance algorithm like Levenshtein distance, the
system calculates the distance between "expretion" and words in the dictionary. It finds that
"expression" has a low edit distance and suggests it as the correction.
3. Phonetic Matching: The system may employ a phonetic matching algorithm like Soundex or
Metaphone to find phonetically similar words. Even though "expretion" and "expression" might
not be phonetically similar, such techniques can help in other cases where phonetic similarity is
more apparent.
4. N-gram Language Models: The system analyzes the surrounding words and their frequency
in the document collection. It observes that "gene expression regulation" is a common phrase in
the corpus, and "expretion" is likely a misspelling based on the context.
5. User Feedback and Learning: If users frequently click on search results related to "gene
expression regulation" after typing "gene expretion regulation," the system learns from this
feedback and may prioritize "expression" as the correct spelling in future corrections.
In this example, the system applies a combination of techniques to identify and correct the spelling
error in the user query, ultimately improving the relevance and accuracy of search results returned to
the user.

-------------------------------------------------------------------------------------------------------------------

www.profajaypashankar.com Page 39 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Types of spelling errors:
▪ Non-word Errors
▪ graffe →giraffe
▪ Real-word Errors
▪ Typographical errors
▪ three →there
▪ Cognitive Errors (homophones)
▪ piece→peace,
▪ too → two
▪ your →you’re
▪ Non-word correction was historically mainly context insensitive
▪ Real-word correction almost needs to be context sensitive

Non-word spelling errors:


▪ Non-word spelling error detection:
▪ Any word not in a dictionary is an error
▪ The larger the dictionary the better … up to a point
▪ (The Web is full of mis-spellings, so the Web isn’t necessarily a great dictionary …)
▪ Non-word spelling error correction:
▪ Generate candidates: real words that are similar to error
▪ Choose the one which is best:
▪ Shortest weighted edit distance
▪ Highest noisy channel probability

Real word & non-word spelling errors


▪ For each word w, generate candidate set:
▪ Find candidate words with similar pronunciations
▪ Find candidate words with similar spellings
▪ Include w in candidate set
▪ Choose best candidate
▪ Noisy Channel view of spell errors
▪ Context-sensitive – so have to consider whether the surrounding words “make sense”
▪ Flying form Heathrow to LAX → Flying from Heathrow to LAX

Terminology
▪ These are character bigrams:
▪ st, pr, an …
▪ These are word bigrams:
▪ palo alto, flying from, road repairs
▪ In today’s class, we will generally deal with word bigrams
▪ In the accompanying Coursera lecture, we mostly deal with character bigrams (because we
cover stuff complementary to what we’re discussing here)

Similarly
trigrams,
k-grams
independent word Spelling Correction

www.profajaypashankar.com Page 40 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

Noisy Channel = Bayes’ Rule


▪ We see an observation x of a misspelled word
▪ Find the correct word ŵ

-------------------------------------------------------------------------------------------------------------------
CHAPTER V: PERFORMANCE EVALUATION
Topics Covered: Evaluation metrics: precision, recall, F-measure, average precision, Test collections
and relevance judgments, Experimental design and significance testing
-------------------------------------------------------------------------------------------------------------------
Precision measures the relevance of the retrieved documents/items among all the retrieved ones. It
helps answer the question: "Of all the items retrieved, how many are relevant?"
Formula:

Precision focuses on the accuracy of the retrieval system. A high precision indicates that a large
proportion of the retrieved documents are relevant to the user's query.
Recall:
Recall measures the completeness of the retrieval system by quantifying the proportion of relevant
documents retrieved out of all the relevant documents available. It answers the question: "Of all the
relevant items available, how many were retrieved?"
Formula:

www.profajaypashankar.com Page 41 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Recall is a measure of how well the system manages to retrieve all the relevant documents from the
collection. A high recall value means the system is effectively retrieving a large portion of the relevant
documents.
Relationship between Precision and Recall:
• Trade-off: Precision and recall often exhibit a trade-off relationship. Improving precision may
result in lower recall and vice versa.
• Balancing Act: IR systems often strive to strike a balance between precision and recall,
depending on the application and user requirements.
• Thresholds: Sometimes, systems may allow users to set thresholds for precision or recall,
depending on their preferences. For example, in a web search engine, users may prefer high
precision for specific queries while accepting lower recall for others.
Evaluation and Interpretation:
• F1 Score: F1 score is the harmonic mean of precision and recall, providing a single metric that
balances both aspects. It's calculated as:

• Interpretation: While precision and recall are informative individually, they are often
considered together to give a more comprehensive understanding of the system's performance.
• User-Centric: The interpretation of precision and recall can vary based on user needs. For
instance, in a medical information retrieval system, high recall may be more critical to ensure
all relevant research papers are retrieved, even if it means a few irrelevant ones slip through.
In summary, precision and recall are crucial metrics in evaluating the effectiveness of IR systems. They
provide insights into how well the system retrieves relevant information and how accurately it filters
out irrelevant content, ultimately contributing to the overall user satisfaction and utility of the system.
-------------------------------------------------------------------------------------------------------------------
Recall and precision are two important metrics used to evaluate the performance of systems,
particularly in the context of information retrieval, search engines, and machine learning classifiers.
While both metrics measure aspects of a system's effectiveness, they focus on different aspects of
performance:
1. Recall:
• Recall, also known as sensitivity or true positive rate, measures the ability of a system to
retrieve all relevant items from the total pool of relevant items.
• It answers the question: "Of all the relevant items that exist, how many did the system
retrieve?"
• Mathematically, recall is calculated as the ratio of the number of true positive results to
the total number of relevant items:


• A high recall value indicates that the system is successfully retrieving a large proportion
of the relevant items, even if it may also retrieve some irrelevant ones.
2. Precision:
• Precision measures the proportion of retrieved items that are relevant among all the
retrieved items.
• It answers the question: "Of all the items retrieved by the system, how many are
relevant?"
• Mathematically, precision is calculated as the ratio of true positive results to the total
number of retrieved items:

• A high precision value indicates that a large proportion of the retrieved items are
relevant to the user's query, minimizing the presence of irrelevant results.
In summary, recall focuses on the system's ability to capture all relevant items, regardless of how
many irrelevant items are retrieved along with them, while precision focuses on the system's ability to
retrieve relevant items accurately, minimizing the inclusion of irrelevant items in the results. These
metrics are often used together to provide a comprehensive evaluation of system performance,
particularly in tasks such as information retrieval, document classification, and search engine
evaluation.
-------------------------------------------------------------------------------------------------------------------

www.profajaypashankar.com Page 42 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
1. F-measure:
The F-measure, also known as the F1 score, is a single metric that combines precision and recall into
one value. It's particularly useful when you want to balance both precision and recall.
Formula:

The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, while 0 indicates poor
performance in either precision or recall.

2. Average Precision:
Average precision is a metric used to evaluate the performance of an IR system, especially in ranked
retrieval scenarios. It calculates the average precision at each relevant document retrieved.

Calculation:
1. For each relevant document in the retrieved list, calculate the precision at that point.
2. Average all the precisions across the relevant documents.
Average precision is particularly useful in scenarios where the system returns a ranked list of
documents, such as web search engines. It provides a measure of how well the system ranks relevant
documents.

3. Test Collections:
Test collections are curated datasets used to evaluate the performance of IR systems. These collections
typically contain:
• A set of queries: These are representative search queries that users might enter into the
system.
• A set of documents: A collection of documents that the system can retrieve results from.
• Relevance judgments: Annotations indicating which documents in the collection are relevant for
each query.
Test collections provide a standardized way to evaluate the effectiveness of IR systems across different
algorithms and approaches.

4. Relevance Judgments:
Relevance judgments are annotations that indicate the relevance of documents to specific queries
within a test collection. They are typically provided by human assessors who evaluate the documents
based on predefined relevance criteria.
Relevance judgments can be binary (relevant or non-relevant) or graded (with degrees of relevance).
They serve as ground truth labels against which the performance of IR systems is evaluated.
In summary, F-measure, average precision, test collections, and relevance judgments are essential
components of the evaluation process in Information Retrieval. They help measure the effectiveness
and performance of IR systems, providing insights into their precision, recall, ranking capabilities, and
overall retrieval quality.
-------------------------------------------------------------------------------------------------------------------
Let's provide examples for each of the concepts discussed in the context of Information Retrieval (IR):
1. F-measure:
Suppose we have an IR system designed to retrieve relevant documents for a set of queries. After
running the system, we evaluate its performance using precision and recall. Let's say the precision is
0.75 and the recall is 0.80. We can calculate the F1 score as follows:

So, the F1 score for the IR system is 0.774.

2. Average Precision:
Imagine we have a test collection consisting of 10 queries and corresponding relevant documents.
After running our IR system for each query, we obtain ranked lists of retrieved documents. We then
calculate precision at each relevant document position and average them across all queries.
For example, if the precision at relevant documents for query 1 is 0.8, query 2 is 0.6, and so forth, we
sum up these values and divide by the total number of queries to obtain the average precision.

www.profajaypashankar.com Page 43 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
3. Test Collections:
A typical test collection could be the TREC (Text REtrieval Conference) collection, which contains
various datasets for evaluating IR systems. It includes sets of queries, relevant documents, and
relevance judgments provided by human assessors.
4. Relevance Judgments:
Suppose we have a test collection with a query "information retrieval techniques" and a set of
documents retrieved by our IR system. Human assessors review each document and label them as
relevant or non-relevant based on the query's context and relevance criteria.
For instance, if a document provides a comprehensive overview of information retrieval techniques, it
may be labeled as relevant, while a document discussing unrelated topics might be labeled as non-
relevant.
In summary, these examples illustrate how F-measure, average precision, test collections, and
relevance judgments are used in evaluating the performance of Information Retrieval systems,
providing insights into their effectiveness and accuracy.
-------------------------------------------------------------------------------------------------------------------
Experimental design and significance testing: are crucial aspects of evaluating the performance of
Information Retrieval (IR) systems. They help researchers and practitioners understand how well a
system performs and whether observed differences are statistically significant.

Experimental Design:
Experimental design refers to the process of planning and conducting experiments to evaluate the
performance of IR systems. It involves several key steps:
1. Formulating Research Questions: Clearly define the research questions and objectives of the
study. What aspects of the IR system do you want to evaluate? What hypotheses are you
testing?
2. Selection of Evaluation Measures: Choose appropriate evaluation measures based on the
research questions and the specific task of the IR system. Common measures include precision,
recall, F1 score, mean average precision (MAP), normalized discounted cumulative gain (nDCG),
etc.
3. Selection of Test Collections: Choose suitable test collections that reflect the characteristics
of the real-world data and the tasks the IR system is designed for. Test collections should
include queries, relevant documents, and relevance judgments.
4. Experimental Setup: Define the experimental setup, including the selection of baseline
methods, parameter settings, preprocessing techniques, and experimental protocols.
5. Cross-Validation and Replication: Use techniques like cross-validation to ensure the
robustness and generalizability of the results. Replicate experiments with different datasets and
settings to validate the findings.
6. Controlled Variables: Control for variables that could impact the results, such as hardware
configurations, indexing techniques, retrieval algorithms, and user interfaces.
------------------------------------------------------------------------------------------------------------------
Significance Testing:
Significance testing is used to determine whether observed differences or effects in experimental data
are statistically significant or simply due to chance. It helps researchers make inferences about the
population based on sample data.
1. Hypothesis Formulation: Formulate null and alternative hypotheses based on the research
questions. The null hypothesis typically assumes that there is no significant difference between
groups or conditions, while the alternative hypothesis suggests otherwise.
2. Selection of Statistical Test: Choose an appropriate statistical test based on the research
design, data distribution, and nature of the variables being analyzed. Common tests include t-
tests, chi-square tests, ANOVA, Mann-Whitney U test, etc.
3. Calculation of P-value: Perform the statistical test and calculate the p-value, which
represents the probability of observing the data or more extreme results under the assumption
that the null hypothesis is true.
4. Interpretation of Results: Compare the obtained p-value with the significance level (alpha),
typically set at 0.05. If the p-value is less than alpha, the null hypothesis is rejected, indicating
that the observed difference is statistically significant.
5. Effect Size: Consider the effect size in addition to statistical significance to assess the practical
importance or magnitude of the observed differences.
6. Multiple Comparisons: Adjust for multiple comparisons if conducting multiple tests to control
the family-wise error rate or false discovery rate.
In summary, experimental design and significance testing play pivotal roles in the evaluation and
validation of IR systems. They provide a structured framework for conducting experiments, interpreting
results, and drawing meaningful conclusions about system performance and effectiveness.

www.profajaypashankar.com Page 44 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
In addition to experimental design and significance testing, there are several other important
considerations in Information Retrieval (IR) research and evaluation:
User-Centric Evaluation:
• User Studies: Conduct user studies and user feedback sessions to understand user
preferences, behaviors, and satisfaction with the IR system.

• User Experience (UX) Evaluation: Evaluate the overall user experience, including ease of
use, system responsiveness, interface design, and relevance of retrieved results.

Contextual Evaluation:
• Task-Based Evaluation: Evaluate the IR system in the context of specific tasks or user
scenarios to assess its practical utility and effectiveness.
• Domain-Specific Evaluation: Consider the unique characteristics and requirements of
different application domains when designing experiments and evaluating system performance.

Error Analysis:
• Error Analysis: Conduct thorough error analysis to identify common sources of errors, such as
false positives, false negatives, and misclassifications.
• Root Cause Analysis: Investigate the underlying reasons behind errors and explore strategies
to mitigate them.

Bias and Fairness:


• Bias Detection and Mitigation: Identify and address potential biases in the IR system,
including algorithmic biases, dataset biases, and user biases.
• Fairness Evaluation: Assess the fairness of the IR system across different demographic
groups and user profiles to ensure equitable access to information.

Longitudinal Studies:
• Long-Term Evaluation: Conduct longitudinal studies to assess the stability and performance
of the IR system over time, considering factors such as system drift, user dynamics, and
evolving information needs.

Multi-Modal Evaluation:
• Multi-Modal IR: Evaluate IR systems that support multiple modalities, such as text, image,
audio, and video, considering the unique challenges and evaluation metrics associated with
each modality.

Reproducibility and Transparency:


• Reproducibility: Ensure that experimental procedures, datasets, and evaluation
methodologies are well-documented and reproducible to facilitate independent verification and
comparison.

• Open Science Practices: Embrace open science principles by sharing code, datasets, and
research findings openly to foster collaboration and transparency in the IR community.
By considering these additional factors and best practices, researchers and practitioners can conduct
more comprehensive and rigorous evaluations of IR systems, leading to more meaningful insights and
advancements in the field.
-------------------------------------------------------------------------------------------------------------------
CHAPTER VI: TEXT CATEGORIZATION AND FILTERING

Topics covered: Text classification algorithms: Naive Bayes, Support Vector Machines, Feature
selection and dimensionality reduction, Applications of text categorization and filtering

In Information Retrieval (IR), text classification algorithms are essential for tasks such as document
categorization, sentiment analysis, spam detection, and more. Here's an overview of some common
algorithms and techniques used in text classification within IR:
1. Naive Bayes (NB):
• Naive Bayes classifiers are based on Bayes' theorem and assume that features are
conditionally independent given the class label.
• In text classification, Naive Bayes is often used due to its simplicity, efficiency, and
effectiveness, especially with large feature spaces.
• It works well with text data and can handle high-dimensional feature spaces efficiently.

www.profajaypashankar.com Page 45 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Despite its simplicity and the independence assumption, Naive Bayes often performs
surprisingly well in practice for text classification tasks.
2. Support Vector Machines (SVM):
• SVM is a supervised learning algorithm used for classification tasks.
• In text classification, SVM aims to find the hyperplane that best separates the data
points into different classes.
• SVMs are effective in high-dimensional spaces and can handle large feature sets, making
them suitable for text classification where the feature space can be expansive.
• They are known for their ability to find the optimal margin of separation between
classes, which can lead to good generalization performance.
3. Feature Selection and Dimensionality Reduction:
• Feature selection techniques aim to identify the most informative features while reducing
the dimensionality of the feature space.
• In text classification, feature selection methods can include techniques like mutual
information, information gain, chi-square test, and others to select relevant features.
• Dimensionality reduction techniques like Principal Component Analysis (PCA) and Latent
Semantic Analysis (LSA) can also be used to reduce the dimensionality of the feature
space while preserving important information.
• These techniques are crucial for improving the efficiency of classification algorithms,
reducing overfitting, and improving generalization performance.
In Information Retrieval, the effectiveness of these algorithms and techniques can vary depending on
the specific task, the characteristics of the dataset, and the choice of parameters. Experimentation and
empirical evaluation are often necessary to determine the most suitable approach for a particular text
classification problem in IR.

Certainly! Let's break down each component:


1. Naive Bayes Classifier:
Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive" assumption of
independence between features. Despite its simplicity, Naive Bayes classifiers often perform well in
text classification tasks.
Example: Spam Email Classification Suppose you have a dataset of emails labeled as spam or not
spam. Each email is represented by a set of features (words or phrases) along with their frequencies.
The Naive Bayes classifier calculates the probability that an email belongs to a certain class (spam or
not spam) given the presence of certain words. It then assigns the email to the class with the highest
probability.
2. Support Vector Machines (SVM):
SVM is a powerful supervised machine learning algorithm used for classification tasks. It works by
finding the hyperplane that best separates the classes in the feature space.
Example: Text Sentiment Analysis Consider a dataset of movie reviews labeled as positive or
negative. Each review is represented as a vector in a high-dimensional space, where each dimension
corresponds to a feature (e.g., frequency of words). SVM tries to find the hyperplane that maximizes
the margin between the positive and negative reviews.
3. Feature Selection and Dimensionality Reduction:
In information retrieval (IR), feature selection and dimensionality reduction techniques are often used
to improve classification performance and reduce computational complexity.
Example: Term Frequency-Inverse Document Frequency (TF-IDF) TF-IDF is a feature selection
technique commonly used in text classification tasks. It measures the importance of a term in a
document relative to a corpus. By weighting terms based on their frequency in the document and
inverse frequency across the corpus, TF-IDF reduces the importance of common terms and emphasizes
rare ones.
Example: Principal Component Analysis (PCA) PCA is a dimensionality reduction technique used to
reduce the number of features while preserving most of the variance in the data. In text classification,
PCA can be applied to reduce the dimensionality of the feature space, which can help improve
classification performance and reduce computational cost.
In summary, these algorithms and techniques play crucial roles in text classification and information
retrieval tasks by enabling the effective analysis and classification of textual data.

-------------------------------------------------------------------------------------------------------------------
Applications of text categorization and filtering
Text categorization and filtering have numerous applications across various domains. Here are some
prominent examples:
1. Email Spam Filtering:

www.profajaypashankar.com Page 46 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
One of the most well-known applications of text categorization and filtering is in email spam detection.
Algorithms categorize incoming emails as either spam or legitimate based on various features such as
the content of the email, sender information, and metadata.
2. News Article Categorization:
Text categorization is used to automatically categorize news articles into predefined categories such as
politics, sports, technology, and entertainment. This helps news organizations organize and present
content to their readers more effectively.
3. Sentiment Analysis:
Sentiment analysis involves categorizing text into different sentiment classes such as positive,
negative, or neutral. It is widely used in social media monitoring, product reviews, and customer
feedback analysis to understand public opinion and sentiment towards products, brands, or events.
4. Document Classification in Libraries and Archives:
Libraries and archives often use text categorization techniques to classify and organize documents,
manuscripts, and books based on their subject matter, genre, or author. This facilitates efficient
document retrieval and management.
5. Legal Document Classification:
In the legal domain, text categorization is used to classify legal documents such as court cases,
contracts, and statutes. This enables law firms and legal departments to organize and retrieve legal
information more efficiently.
6. Medical Text Classification:
In healthcare, text categorization is used to classify medical documents such as patient records,
medical literature, and clinical trial reports. It helps healthcare professionals extract relevant
information, identify patterns, and make informed decisions.
7. Customer Support and Ticket Routing:
Text categorization is used in customer support systems to automatically route incoming support
tickets to the appropriate department or support agent based on the nature of the inquiry or issue
raised by the customer.
8. Social Media Content Moderation:
Social media platforms use text categorization algorithms to automatically detect and filter out
inappropriate or offensive content, such as hate speech, harassment, or graphic images, in order to
maintain a safe and welcoming online environment.
9. Fraud Detection:
Text categorization techniques are employed in fraud detection systems to classify text-based data
such as transaction descriptions, account activity logs, and customer communications. This helps
financial institutions and e-commerce platforms identify and prevent fraudulent activities.
10. Search Engine Result Categorization:
Search engines use text categorization algorithms to categorize and rank search results based on
relevance to the user query, user preferences, and other factors. This helps users find the most
relevant and useful information quickly and easily.
These are just a few examples of the wide-ranging applications of text categorization and filtering
across different industries and domains. The ability to automatically analyze and categorize textual
data enables organizations to extract valuable insights, improve decision-making processes, and
enhance user experiences.
Certainly! Here are some specific examples of text categorization and filtering applications:
1. Email Spam Filtering:
• Example: Gmail's spam filter automatically categorizes incoming emails as spam or not spam
based on various features such as email content, sender reputation, and user feedback.
2. News Article Categorization:
• Example: The New York Times categorizes its news articles into sections such as World, Politics,
Business, Technology, and Sports, allowing readers to navigate and explore content more
effectively.
3. Sentiment Analysis:
• Example: Twitter sentiment analysis tools classify tweets as positive, negative, or neutral
sentiments to gauge public opinion on various topics, events, or brands.
4. Document Classification in Libraries and Archives:
• Example: The Library of Congress categorizes and organizes its vast collection of books,
manuscripts, and documents into different subject classifications, making it easier for
researchers to access relevant information.
5. Legal Document Classification:
• Example: Legal research platforms like LexisNexis use text categorization algorithms to classify
and organize legal documents such as court cases, statutes, and regulations by jurisdiction,
topic, and legal issue.
6. Medical Text Classification:

www.profajaypashankar.com Page 47 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Example: Healthcare providers use text classification algorithms to categorize electronic health
records (EHRs) and medical literature into different medical specialties, diseases, and
treatments for clinical decision support and research purposes.
7. Customer Support and Ticket Routing:
• Example: Zendesk and other customer support platforms automatically categorize and route
incoming support tickets to the appropriate support team based on the nature of the customer
inquiry or issue.
8. Social Media Content Moderation:
• Example: Facebook employs text categorization algorithms to automatically detect and remove
content that violates its community standards, such as hate speech, harassment, and graphic
violence.
9. Fraud Detection:
• Example: Banks and financial institutions use text categorization techniques to analyze
transaction descriptions and customer communications to detect fraudulent activities such as
phishing attempts, identity theft, and money laundering.
10. Search Engine Result Categorization:
• Example: Google categorizes and ranks search results based on relevance to the user query,
user location, search history, and other factors, helping users find the most relevant information
quickly.
These examples illustrate how text categorization and filtering technologies are applied in various
domains to automate tasks, extract insights, and enhance user experiences.

Certainly! Here are a few more examples of text categorization and filtering applications:
11. Product Review Analysis:
• Example: E-commerce platforms like Amazon categorize and analyze product reviews to provide
insights to manufacturers and other consumers about the quality, features, and satisfaction
levels associated with different products.
12. Content Recommendation Systems:
• Example: Streaming services like Netflix and Spotify use text categorization algorithms to
analyze user preferences and behavior, categorize content based on genres, themes, and
attributes, and recommend personalized movies, shows, and music playlists to users.
13. Content Tagging and Metadata Management:
• Example: Content management systems and digital asset management platforms use text
categorization to automatically tag and classify digital assets such as images, videos, and
documents, making it easier to search, organize, and retrieve content.
14. Online Advertising Targeting:
• Example: Ad networks and digital marketing platforms analyze website content and user
behavior to categorize web pages and target advertisements based on user interests,
demographics, and preferences.
15. Language Identification and Translation:
• Example: Language identification algorithms classify text into different languages, enabling
multilingual search engines, translation services, and global communication platforms to
accurately detect and translate text across language barriers.
16. Knowledge Base Construction and Ontology Development:
• Example: Text categorization algorithms are used to extract and categorize information from
unstructured text sources such as websites, documents, and articles, enabling the construction
of knowledge bases and the development of ontologies for knowledge representation and
semantic web applications.
17. Event Detection and Trend Analysis:
• Example: Social media monitoring tools analyze text data from social media platforms to detect
and categorize events, trends, and discussions in real-time, helping organizations and
governments track public opinion, emerging issues, and crisis events.
These additional examples demonstrate the diverse range of applications and industries where text
categorization and filtering techniques are utilized to automate processes, extract insights, and
enhance decision-making capabilities.

www.profajaypashankar.com Page 48 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
CHAPTER VII: TEXT CLUSTERING FOR INFORMATION RETRIEVAL
Topics covered: Clustering techniques: K-means, hierarchical clustering, Evaluation of clustering
results, Clustering for query expansion and result grouping
------------------------------------------------------------------------------------------------------------------
Clustering techniques: K-means, hierarchical clustering
Clustering techniques, such as K-means and hierarchical clustering, are widely used in Information
Retrieval (IR) for organizing and analyzing large collections of documents or data. Here's an overview
of how K-means and hierarchical clustering are applied in IR:
K-means Clustering:
K-means clustering is an iterative algorithm that partitions data into K clusters based on their features
or attributes. In IR, K-means clustering can be used to group documents or data points with similar
characteristics together. Here's how it works in IR:
1. Document Representation: Documents are represented as feature vectors, often using
techniques like TF-IDF (Term Frequency-Inverse Document Frequency) to represent the
importance of words in the document corpus.
2. Cluster Initialization: The algorithm starts by randomly selecting K initial cluster centroids.
3. Assignment Step: Each document is assigned to the cluster whose centroid is closest to it
based on some distance metric, such as Euclidean distance or cosine similarity.
4. Update Step: After all documents are assigned to clusters, the centroids of the clusters are
recalculated based on the mean of the documents in each cluster.
5. Iterations: Steps 3 and 4 are repeated iteratively until the centroids stabilize or a convergence
criterion is met.
Application in IR: K-means clustering can be applied in IR for document organization, topic
modeling, and document retrieval. For example, it can be used to automatically group similar
documents together, which can aid in exploratory data analysis or in organizing search results.
Hierarchical Clustering:
Hierarchical clustering is a method of cluster analysis that builds a hierarchy of clusters. In IR,
hierarchical clustering can be used to create a tree-like structure of clusters, where similar documents
are grouped together at different levels of granularity. Here's how it works in IR:
1. Document Representation: Similar to K-means, documents are represented as feature
vectors.
2. Initialization: Each document is initially considered as a single-cluster.
3. Merge Step: At each iteration, the two most similar clusters are merged into a single cluster
based on a similarity measure.
4. Hierarchy Formation: This process continues until all documents are part of a single cluster or
until a stopping criterion is met, resulting in a hierarchical clustering structure.
Application in IR: Hierarchical clustering can be useful for organizing documents into a hierarchical
taxonomy or for exploring the relationships between documents at different levels of abstraction. It can
also aid in navigation and browsing within document collections by providing a hierarchical organization
of topics or themes.
In summary, K-means and hierarchical clustering techniques play important roles in organizing and
analyzing document collections in Information Retrieval, enabling tasks such as document grouping,
topic modeling, and exploratory data analysis.

Let's illustrate K-means and hierarchical clustering with an example in the context of Information
Retrieval (IR):
Example: Document Clustering in IR
Suppose we have a collection of news articles from different categories such as sports, politics,
technology, and entertainment. Our goal is to cluster these articles based on their content using K-
means and hierarchical clustering techniques.
Step 1: Document Representation
We represent each document using the TF-IDF (Term Frequency-Inverse Document Frequency) vector
representation. Each document becomes a high-dimensional vector where each dimension represents
the importance of a term in that document relative to the entire corpus.
Step 2: K-means Clustering
Let's say we want to group the documents into 4 clusters (K=4).
1. Initialization: Randomly select 4 initial cluster centroids.
2. Assignment Step: Assign each document to the cluster whose centroid is closest to it based
on cosine similarity or Euclidean distance.
3. Update Step: Recalculate the centroids of the clusters based on the mean of the documents in
each cluster.
4. Iterations: Repeat the assignment and update steps until convergence or until a maximum
number of iterations is reached.

www.profajaypashankar.com Page 49 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Step 3: Hierarchical Clustering
We build a hierarchical clustering structure using agglomerative clustering:
1. Initialization: Treat each document as a single cluster.
2. Merge Step: At each iteration, merge the two most similar clusters based on a similarity
measure (e.g., cosine similarity or Euclidean distance).
3. Hierarchy Formation: Continue merging clusters until all documents are part of a single
cluster or until a stopping criterion is met.
Result Interpretation:
• After clustering, we can examine the clusters formed by both K-means and hierarchical
clustering methods.
• We might find that K-means clustering provides distinct clusters, while hierarchical clustering
offers a hierarchical structure, allowing for exploration at different levels of granularity.
• For instance, K-means might group sports articles together, while hierarchical clustering could
reveal subcategories within sports (e.g., football, basketball, etc.).
Application:
• These clustering results can be used for various IR tasks such as organizing news articles on a
website, providing recommendations to users based on their interests, or enhancing search
results by grouping similar documents together.
By applying K-means and hierarchical clustering techniques to document collections in IR, we can
efficiently organize and explore large volumes of textual data, enabling better retrieval, navigation, and
analysis of information.
-------------------------------------------------------------------------------------------------------------------
Evaluation of clustering results, Clustering for query expansion and result grouping

Evaluation of Clustering Results:


Evaluation of clustering results is essential to assess the quality of clusters generated by clustering
algorithms. Here are some common evaluation metrics:
1. Silhouette Score: Measures how well-separated clusters are and how similar data points are
within the same cluster.
2. Davies-Bouldin Index: Computes the average similarity between each cluster and its most
similar cluster, where lower values indicate better clustering.
3. Cluster Purity: Measures the degree to which clusters contain predominantly data points from
a single class or category.
4. Cluster Compactness and Separation: Measure the tightness of clusters and the distance
between different clusters.
5. Visual Inspection: Visualization techniques such as t-SNE (t-distributed Stochastic Neighbor
Embedding) or PCA (Principal Component Analysis) can help visually inspect the clustering
results.
Clustering for Query Expansion:
Query expansion is a technique used to improve information retrieval systems by expanding user
queries with additional terms. Clustering can aid in query expansion by:
1. Cluster-based Expansion: Expand the user query with terms extracted from documents
within the same cluster as relevant documents. This assumes that documents within the same
cluster are semantically similar.
2. Representative Term Selection: Select representative terms from each cluster and use them
to expand the user query. These terms can be centroids of clusters or terms with high TF-IDF
scores within clusters.
3. Contextual Expansion: Incorporate contextual information from clusters to expand queries.
For example, if a user query is ambiguous, clustering can help disambiguate by providing
context from documents in relevant clusters.
Clustering for Result Grouping:
In IR, clustering can be used to group search results into meaningful categories to enhance user
experience and facilitate information exploration. Here's how clustering can be applied for result
grouping:
1. Topical Result Clustering: Group search results into clusters based on topical similarity.
Users can then explore results within specific topics of interest.
2. Faceted Search: Apply clustering to categorize search results across different facets such as
time, location, or document type. This allows users to refine their search results based on
specific criteria.
3. Personalized Result Clustering: Cluster search results based on user preferences and
behavior. For example, if a user frequently clicks on news articles related to technology, the
system can cluster search results to prioritize technology-related content for that user.
4. Dynamic Result Grouping: Clustering can be performed dynamically based on user
interactions and feedback to adapt result grouping according to user needs and preferences.

www.profajaypashankar.com Page 50 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Evaluation of Result Grouping:
Evaluation of result grouping in IR involves assessing the relevance, coherence, and usability of the
clustered search results. User studies, feedback mechanisms, and relevance judgments can be used to
evaluate the effectiveness of result grouping techniques in improving user satisfaction and information
retrieval performance.
In summary, clustering plays a vital role in IR for evaluating clustering results, expanding user queries,
and grouping search results to enhance user experience and retrieval effectiveness. Evaluating
clustering and result grouping techniques is crucial for optimizing information retrieval systems and
meeting user needs effectively.

1. Evaluation of Clustering Results:


Suppose we have a collection of news articles, and we apply K-means clustering to group them into
different topics. Here's how we can evaluate the clustering results:
• Silhouette Score: Calculate the silhouette score for the clustering. A higher silhouette score
indicates that the clusters are well-separated and data points are tightly grouped within
clusters.
• Davies-Bouldin Index: Compute the Davies-Bouldin Index, which measures the average
similarity between each cluster and its most similar cluster. A lower index suggests better
clustering.
• Cluster Purity: Assess the purity of clusters by measuring the proportion of documents from
the same category within each cluster. Higher purity indicates better clustering quality.
• Visual Inspection: Visualize the clustering results using dimensionality reduction techniques
like t-SNE or PCA. Visual inspection helps to understand the distribution of documents across
clusters and identify any patterns or outliers.

2. Clustering for Query Expansion:


Let's say a user searches for "machine learning" in a scholarly database. We can use clustering to
expand the query and retrieve more relevant documents:
• Cluster-based Expansion: Identify clusters containing documents related to "machine
learning." Extract terms from these clusters, such as "artificial intelligence," "deep learning,"
and "neural networks," to expand the user query.
• Representative Term Selection: Choose representative terms from clusters, such as
frequently occurring terms or cluster centroids, to augment the user query and retrieve
documents with similar content.

3. Clustering for Result Grouping:


Consider a web search engine that retrieves results for the query "data visualization." We can use
clustering to group search results into coherent categories:
• Topical Result Clustering: Cluster search results based on topical similarity. For instance,
group results related to "infographics," "charting tools," and "data visualization techniques" into
separate clusters.
• Faceted Search: Categorize search results across different facets such as "time" (e.g., recent
articles vs. historical resources) or "document type" (e.g., articles vs. tutorials).
• Personalized Result Clustering: Adapt result grouping based on user preferences and
behavior. If a user frequently clicks on results related to "interactive visualizations," prioritize
clustering those results together.

Example Summary:
Let's tie these concepts together with an example:
Suppose we're developing a news aggregation platform. We apply K-means clustering to group news
articles into topics. Evaluation metrics such as silhouette score and cluster purity help assess the
quality of clustering. Users searching for "climate change" can benefit from query expansion, where
terms from clusters related to "environmental science" and "sustainability" are added to the query.
When presenting search results, clustering aids in grouping articles under different facets like "policy
implications," "scientific research," and "public awareness campaigns," enhancing user exploration and
comprehension.
This example illustrates how evaluation, query expansion, and result grouping with clustering can
enhance the Information Retrieval process, providing users with more relevant and organized
information.

www.profajaypashankar.com Page 51 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
CHAPTER VIII: WEB INFORMATION RETRIEVAL
Topics Covered: Web search architecture and challenges, Crawling and indexing web pages, Link
analysis and PageRank algorithm
-------------------------------------------------------------------------------------------------------------------
Web search architecture and challenges, Crawling and indexing web pages, Link analysis and PageRank
algorithm in IR with example in detail

Web Search Architecture and Challenges:

1] Web crawler
Search engine bots, web robots, and spiders are other names for web crawlers. In search engine
optimization (SEO) strategy, it is crucial. It mostly consists of a piece of software that browses the
web, downloads information, and then gathers it all.
There are the following web crawler features that can affect the search results
o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling
2] Database
An example of a non-relational database is the search engine database.
That is where all of the data on the web is kept. It has a lot of online resources. Amazon Elastic Search
Service and Splunk are two of the most well-known search engine databases.
The following two characteristics of database variables may have an impact on search results:
Dimensions of the database
The database's recentness
3] Search interfaces
One of the most crucial elements of a search engine is the search interface.
It serves as the user's interface with the database. In essence, it aids users
with database query searches.
There are the following features Search Interfaces that affect the search
results -
Operators
Phrase Searching
Truncation
4] Ranking algorithms
Google uses the ranking algorithm to determine the order of web sites in its search results.
The following ranking factors have an impact on the search results:
Location and frequency
Link Evaluation
Clickthrough analysis

www.profajaypashankar.com Page 52 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

Architecture:
1. Crawling: The process of discovering and fetching web pages from the internet.
2. Indexing: Analyzing and storing the content of web pages in a searchable index.
3. Query Processing: Interpreting and executing user queries against the indexed data.
4. Ranking: Determining the relevance of indexed pages to the user query and ranking them
accordingly.
5. User Interface: Presenting search results to the user in a user-friendly manner.

Challenges:
1. Scale: The web is vast and constantly expanding, requiring search engines to crawl and index
billions of pages.
2. Freshness: Keeping indexed content up-to-date with the rapidly changing web.
3. Relevance: Providing users with relevant and diverse search results for their queries.
4. Spam and Manipulation: Dealing with spam, low-quality content, and attempts to manipulate
search engine rankings.
5. Diversity of Content: Indexing and retrieving various types of content, including text, images,
videos, and multimedia.

Web search faces several challenges due to the dynamic nature of the web, the vast amount of
information available, and the diverse needs and behaviors of users. Some of the key challenges in
web search include:

Information Overload: The web contains an enormous volume of information, and users often
struggle to find relevant content amidst the abundance of data. Information overload can lead to user
frustration and difficulties in locating specific information.

Quality and Trustworthiness: Ensuring the quality and trustworthiness of information retrieved from
the web is a significant challenge. The web contains a mix of reliable, authoritative sources and
unreliable or misleading content. Users may encounter misinformation, fake news, and biased
perspectives, which can undermine the credibility of search results.

Dynamic and Evolving Content: The web is constantly evolving, with new content being created,
updated, and removed at a rapid pace. Search engines must continuously crawl, index, and update
their databases to reflect the latest information available on the web.

Multimedia Content: The increasing prevalence of multimedia content, including images, videos,
audio files, and interactive media, presents challenges for search engines in effectively indexing,
analyzing, and retrieving non-textual content.

Multilingual and Multicultural Content: The web is a global platform with content available in
multiple languages and tailored to diverse cultural contexts. Search engines must support multilingual
search capabilities and account for cultural differences in language usage, terminology, and
preferences.

www.profajaypashankar.com Page 53 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
User Intent and Context: Understanding user intent and context is essential for delivering relevant
search results. Users may have varying search intents, such as informational, navigational,
transactional, or exploratory, and search engines must interpret user queries accurately to provide the
most relevant results.

Personalization and Privacy: Balancing the need for personalized search experiences with user
privacy concerns is a challenge for search engines. While personalized search results can enhance user
satisfaction and relevance, they also raise privacy issues related to data collection, tracking, and
profiling.

Semantic Understanding: Improving the semantic understanding of search queries and web content
is an ongoing challenge. Search engines must go beyond keyword matching and incorporate natural
language processing, entity recognition, and semantic analysis techniques to better understand the
meaning and context of user queries and web documents.

Mobile and Voice Search: The increasing prevalence of mobile devices and voice-activated assistants
has transformed user search behavior. Search engines must adapt to the unique characteristics of
mobile and voice search, including shorter queries, location-based information, and conversational
language.

Adapting to Emerging Technologies: Emerging technologies such as artificial intelligence, machine


learning, natural language processing, and blockchain present both opportunities and challenges for
web search. Search engines must adapt to leverage these technologies effectively while addressing
potential ethical, legal, and technical implications.

Addressing these challenges requires ongoing research, innovation, and collaboration among search
engine providers, information retrieval experts, web developers, and other stakeholders to enhance the
quality, relevance, and accessibility of web search experiences.
-------------------------------------------------------------------------------------------------------------------
Crawling and Indexing Web Pages:

Crawling:
• Crawlers, also known as spiders or bots, systematically navigate the web by following links from
one page to another.
• They discover new pages and update existing ones by fetching and analyzing their content.
• Example: Googlebot, the crawler used by Google, navigates the web, discovering and fetching
web pages to be indexed.

Web crawling is the process by which we gather pages from the Web, in order to index them and
support a search engine. The objective of crawling is to quickly and efficiently gather as many useful
web pages as possible, together with the link structure that interconnects them.

Features a crawler must provide


We list the desiderata for web crawlers in two categories: features that web crawlers must provide,
followed by features they should provide.

Robustness: The Web contains servers that create spider traps, which are generators of web pages
that mislead crawlers into getting stuck fetching an infinite number of pages in a particular domain.
Crawlers must be designed to be resilient to such traps. Not all such traps are malicious; some
are the inadvertent side-effect of faulty website development.
Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler
can visit them. These politeness policies must be respected.

Features a crawler should provide


Distributed: The crawler should have the ability to execute in a distributed fashion across multiple
machines.
Scalable: The crawler architecture should permit scaling up the crawl rate by adding extra machines
and bandwidth.
Performance and efficiency: The crawl system should make efficient use of various system
resources including processor, storage and network bandwidth.
Quality: Given that a significant fraction of all web pages are of poor utility for serving user query
needs, the crawler should be biased towards fetching “useful” pages first.

www.profajaypashankar.com Page 54 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Freshness: In many applications, the crawler should operate in continuous mode: it should obtain
fresh copies of previously fetched pages. A search engine crawler, for instance, can thus ensure that
the search engine’s index contains a fairly current representation of each indexed web page. For
such continuous crawling, a crawler should be able to crawl a page with a frequency that approximates
the rate of change of that page.
Extensible: Crawlers should be designed to be extensible in many ways – to cope with new data
formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.

Crawler architecture
The simple scheme outlined above for crawling demands several modules that fit together as shown in
Figure 20.1.
1. The URL frontier, containing URLs yet to be fetched in the current crawl (in the case of continuous
crawling, a URL may have been fetched previously but is back in the frontier for re-fetching). We
describe this further
in Section 20.2.3.
2. A DNS resolution module that determines the web server from which to fetch the page specified by
a URL. We describe this further in Section 20.2.2.
3. A fetch module that uses the http protocol to retrieve the web page at a URL.
4. A parsing module that extracts the text and set of links from a fetched web page.
5. A duplicate elimination module that determines whether an extracted link is already in the URL
frontier or has recently been fetched.
Crawling is performed by anywhere from one to potentially hundreds of threads, each of which loops
through the logical cycle in Figure 20.1. These threads may be run in a single process, or be
partitioned amongst multiple processes running at different nodes of a distributed system. We begin by
assuming that the URL frontier is in place and non-empty and defer our description of the
implementation of the URL frontier to Section 20.2.3. We follow the progress of a single URL through
the cycle of being fetched, passing through various checks and filters, then finally (for continuous
crawling) being returned to the URL frontier.
A crawler thread begins by taking a URL from the frontier and fetching the web page at that URL,
generally using the http protocol.
The fetched page is then written into a temporary store, where a number of operations are performed
on it. Next, the page is parsed and the text as well as the links in it are extracted.
The text (with any tag information – e.g., terms in boldface) is passed on to the indexer.
Link information including anchor text is also passed on to the indexer for use in ranking in ways that
are described in Chapter 21. In addition, each extracted link goes through a series of tests to
determine whether the link should be added to the URL frontier.
First, the thread tests whether a web page with the same content has already been seen at another
URL. The simplest implementation for this would use a simple fingerprint such as a checksum (placed
in a store labelled "Doc FP’s" in Figure 20.1). A more sophisticated test would use shingles instead.

www.profajaypashankar.com Page 55 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Web crawling is the automated process of systematically browsing the internet, typically performed by
software programs called web crawlers or spiders. These crawlers navigate the web by following
hyperlinks from one webpage to another, gathering information along the way. Here's a detailed
overview of how web crawling works:
1. Starting Point: The web crawler begins its journey from a list of seed URLs provided by the
user or generated programmatically. These URLs serve as entry points into the web.
2. HTTP Requests: The crawler sends HTTP requests to the web servers hosting the webpages.
These requests typically use the HTTP GET method to retrieve the content of the webpage.
3. Downloading Content: Upon receiving the HTTP response from the server, the crawler
downloads the content of the webpage, including HTML, CSS, JavaScript, images, and other
embedded resources.
4. Parsing HTML: The crawler parses the HTML content of the webpage to extract useful
information such as text content, hyperlinks, metadata, and structural elements like headings
and paragraphs. Parsing is often done using HTML parsing libraries or modules.
5. Extracting URLs: The crawler identifies and extracts hyperlinks (URLs) from the HTML content
of the webpage. These URLs point to other webpages that the crawler has not yet visited.
6. URL Frontier: The extracted URLs are added to a queue known as the URL frontier or crawl
frontier. This queue maintains a list of URLs that the crawler intends to visit in the future.
7. URL Filtering: Before adding URLs to the frontier, the crawler may apply filters to ensure that
only relevant and valid URLs are included. Filters may include domain restrictions, URL patterns,
and URL canonicalization to prevent duplicate content.
8. Politeness and Crawling Policies: Web crawlers often adhere to politeness policies to avoid
overloading web servers with excessive requests. These policies may include crawl delay, user-
agent identification, and respect for robots.txt directives, which indicate which parts of a
website are open to crawling.
9. Recursion: The crawler continues the process recursively by fetching URLs from the frontier,
downloading their content, parsing HTML, extracting new URLs, and adding them to the frontier
for future crawling.
10. Storing Data: As the crawler traverses the web, it may store extracted data such as text
content, metadata, and URL relationships in a structured format like a database or an index.
This data can be later used for various purposes such as search engine indexing, data analysis,
or information retrieval.
11. Crawl Control and Monitoring: Throughout the crawling process, the crawler monitors its
progress, manages resources efficiently, handles errors and exceptions, and adjusts its crawling
strategy based on various factors such as server responses, network conditions, and crawl
priorities.
12. Completion and Reporting: Once the crawling process is complete or reaches a predefined
stopping condition (e.g., maximum depth, time limit), the crawler generates reports, logs, or
notifications summarizing its activities, including statistics on crawled pages, errors
encountered, and data collected.
Web crawling is a fundamental component of many applications and services, including search engines,
web indexing, content aggregation, data mining, and competitive analysis. However, it's essential for
developers and operators to ensure that web crawling activities comply with legal, ethical, and
technical guidelines to avoid disrupting website operations and infringing on intellectual property
rights.
-------------------------------------------------------------------------------------------------------------------
Indexing:
• Indexing involves parsing and analyzing the content of web pages to extract relevant
information.
• The extracted data is stored in a searchable index, enabling efficient retrieval in response to
user queries.
• Example: Google's indexing process analyzes the text, images, metadata, and links within web
pages to create an index of searchable content.

Indexing web pages is a crucial step in the process of organizing and making web content searchable.
Search engines like Google, Bing, and Yahoo use indexing to create a searchable database of web
pages that users can access through their search interfaces. Here's how indexing web pages typically
works:
1. Crawling: As mentioned earlier, web crawling is the process of systematically browsing the
internet to discover and retrieve web pages. Crawlers, also known as spiders or bots, visit web
pages by following hyperlinks from one page to another. They download the content of each
page they visit, including text, images, links, and metadata.
2. Parsing and Analyzing Content: Once the web crawler downloads the content of a web page,
it parses the HTML and extracts relevant information such as text content, headings, metadata

www.profajaypashankar.com Page 56 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
(e.g., title, description), and links to other pages. This extracted data provides the basis for
indexing.
3. Tokenization and Analysis: The text content extracted from web pages is tokenized and
analyzed to identify individual words, phrases, and other linguistic elements. This process may
involve techniques like stemming (reducing words to their root form), stop word removal
(filtering out common words like "and," "the," etc.), and language-specific processing.
4. Building an Inverted Index: The indexed information is then stored in a data structure
known as an inverted index. In an inverted index, each unique term (token) found in the
indexed documents is associated with a list of references to the documents containing that
term. This allows for efficient searching based on keywords.
5. Indexing Metadata: In addition to the textual content of web pages, metadata such as page
titles, descriptions, URLs, and other attributes are indexed to provide additional context and
improve search relevance.
6. Storing and Updating the Index: The indexed data is stored in a searchable database or
index. Search engines continuously update their indexes by periodically crawling the web to
discover new content, revisit previously indexed pages for changes, and remove outdated or
irrelevant pages.
7. Ranking and Relevance: Search engines use sophisticated algorithms to determine the
relevance of indexed pages to specific search queries. Factors such as keyword frequency, page
authority, relevance of incoming links, user engagement metrics, and other signals are used to
rank search results and present the most relevant pages to users.
8. Query Processing: When a user enters a search query, the search engine retrieves relevant
documents from its index based on the query terms and ranking criteria. The search results are
then presented to the user through the search engine's interface, typically in the form of a list
of links accompanied by titles, snippets, and other metadata.
Indexing web pages enables search engines to quickly and efficiently retrieve relevant information in
response to user queries, making it a critical component of the modern web ecosystem.

-------------------------------------------------------------------------------------------------------------------
Link Analysis and PageRank Algorithm in IR:
Link Analysis:
• Link analysis examines the structure of the web, particularly the network of hyperlinks between
web pages.
• It seeks to understand the relationships and importance of pages based on how they are linked
to by other pages.
• Example: Analyzing inbound links to a web page can provide insights into its popularity and
authority within the web ecosystem.

Link analysis in information retrieval (IR) refers to the process of analyzing the relationships between
documents based on hyperlinks. It's a fundamental concept used in various applications, particularly in
web search engines like Google, Bing, and Yahoo. Link analysis helps search engines understand the
structure and relevance of web pages by examining how they are linked together. Here's how link
analysis works in IR:
1. Hyperlink Structure: On the web, hyperlinks are used to connect one webpage to another.
Each hyperlink represents a relationship or connection between two web pages. By analyzing
these hyperlinks, search engines can uncover valuable information about the relationships and
authority of web pages.
2. PageRank Algorithm: One of the most famous algorithms used for link analysis is PageRank,
developed by Larry Page and Sergey Brin, the founders of Google. PageRank assigns a
numerical weight to each webpage based on the quantity and quality of inbound links it receives
from other pages. Pages with higher PageRank scores are considered more authoritative and
are likely to appear higher in search results.
3. Link-based Relevance: Search engines use link analysis to assess the relevance and
importance of a webpage based on its inbound and outbound links. A webpage that receives
many inbound links from other reputable and relevant sites is considered more authoritative
and trustworthy on a particular topic. Similarly, a webpage that links to other authoritative
pages may also gain credibility.
4. Anchor Text Analysis: In addition to analyzing the number and quality of links, search
engines also examine the anchor text (the clickable text of a hyperlink) used in inbound links to
determine the relevance and context of the linked content. Pages with descriptive and relevant
anchor text are likely to be considered more authoritative and relevant for specific search
queries.
5. Link Structure Analysis: Search engines analyze the overall structure of the link graph to
identify patterns, clusters, and communities of related web pages. This analysis helps improve

www.profajaypashankar.com Page 57 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
search relevance by understanding the topical relationships between different pages and
identifying authoritative hubs within specific subject areas.
6. Link Spam Detection: Link analysis is also used to detect and combat link spam, which
involves artificially manipulating the link graph to improve a webpage's ranking in search
results. Search engines employ various techniques to identify and penalize websites engaged in
manipulative link-building practices, such as link farms, link exchanges, and paid links.
7. Reciprocal Link Analysis: Search engines may also analyze reciprocal links (two-way links)
between websites to evaluate the authenticity and naturalness of link relationships. Excessive
reciprocal linking or link schemes aimed at artificially inflating page rankings may be penalized
by search engines.
Overall, link analysis plays a crucial role in information retrieval by helping search engines assess the
relevance, authority, and credibility of web pages based on their relationships with other pages on the
web. It's an essential component of modern search algorithms and contributes to the accuracy and
effectiveness of search engine results.

-------------------------------------------------------------------------------------------------------------------
PageRank Algorithm:
• PageRank is a link analysis algorithm developed by Larry Page and Sergey Brin, the founders of
Google.
• It assigns a numerical value (PageRank score) to each web page based on the quantity and
quality of inbound links.
• Pages with higher PageRank scores are considered more important and are likely to rank higher
in search engine results.
• Example: Suppose Page A has many high-quality inbound links from reputable websites, while
Page B has fewer links from less authoritative sources. Page A is likely to have a higher
PageRank score and rank higher in search results.

Example: PageRank in Action:


Consider a scenario where Page A is a well-established news website with numerous inbound links from
other reputable news sites and academic institutions. Page B, on the other hand, is a personal blog
with fewer inbound links.
When a user searches for a news topic related to the content of both Page A and Page B, Google's
search engine may assign a higher PageRank to Page A due to its authoritative inbound links. As a
result, Page A is more likely to appear higher in the search results compared to Page B.
In summary, the PageRank algorithm, along with crawling, indexing, and link analysis, plays a crucial
role in the architecture of web search engines. By analyzing the structure of the web and the
relationships between pages, search engines can deliver relevant and authoritative search results to
users.
The PageRank algorithm, developed by Larry Page and Sergey Brin at Stanford University, is a
foundational algorithm used by search engines to rank web pages based on their importance and
relevance. Here's a detailed explanation of the PageRank algorithm along with an example:

PageRank Algorithm:
1. Concept of Page Importance:
• The PageRank algorithm views the web as a network of interconnected pages, where
each page is considered a node.
• The importance of a page is determined by the number and quality of inbound links it
receives from other pages.
• Pages with many inbound links from high-quality and authoritative sources are
considered more important.
2. Iterative Calculation:
• PageRank is calculated iteratively through multiple iterations or "iterations".
• At each iteration, the PageRank score of each page is updated based on the PageRank
scores of the pages linking to it.
3. Damping Factor:
• The PageRank algorithm incorporates a damping factor, typically set to 0.85, to model
the behavior of web users who may randomly jump from one page to another.
• The damping factor ensures that even pages with no outbound links receive a small
fraction of PageRank from every page on the web.
4. Formula for PageRank Calculation:
• The PageRank score PR(u) of a page u is calculated as the sum of the PageRank scores
of all pages v linking to u, divided by the number of outbound links from page v, and
multiplied by the damping factor:

www.profajaypashankar.com Page 58 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR


5. Iterative Process:
• The PageRank scores are initially set to a uniform value for all pages.
• The PageRank calculation is performed iteratively until convergence, where the
PageRank scores stabilize and stop changing significantly.
-------------------------------------------------------------------------------------------------------------------
Example of PageRank Algorithm:
Let's consider a simple example with four web pages (A, B, C, D) connected by links:
• Page A has outbound links to pages B, C, and D.
• Page B has an inbound link from page A and outbound links to pages C and D.
• Page C has inbound links from pages A and B, and no outbound links.
• Page D has inbound links from pages A and B, and no outbound links.

Initial PageRank Scores:


• Initially, we assign a uniform PageRank score of 0.25 to each page since there are four pages in
total.
Iteration 1:
• We calculate the new PageRank scores for each page based on the formula discussed earlier.
• After the first iteration, the PageRank scores may change for each page.
Iteration 2:
• We recalculate the PageRank scores based on the updated scores from the previous iteration.
• We continue this iterative process until the PageRank scores stabilize and converge.
Final PageRank Scores:
• After multiple iterations, the PageRank scores stabilize, and we obtain the final PageRank scores
for each page.
The page with the highest PageRank score is considered the most important and authoritative, and it is
more likely to appear higher in search engine results for relevant queries.
In summary, the PageRank algorithm plays a crucial role in determining the importance and relevance
of web pages, helping search engines deliver high-quality search results to users based on the
authority and popularity of pages within the web network.

How does Page-Rank algorithm Works?


The PageRank algorithm assigns a numerical value called a PageRank score to each web page in a
linked page network. Points indicate the relevance or importance of the page online.
The algorithm works step by step:
1. Initialization: The algorithm begins by determining the initial PageRank value of each web
page. Typically, this initial value is set uniformly across all pages so that every page has the
same initial value.
2. Link analysis: The algorithm analyzes the links between web pages. It considers both inbound
links (links pointing to a page) and outbound links (links from a page to other pages). Pages
with more inbound links are considered more important because they are believed to receive
recommendations or votes of trust from other important pages.
3. Iterative calculation: The algorithm repeatedly updates the PageRank score of each page
based on the PageRank score of the related pages. During each iteration, the PageRank of a
page is recalculated, taking into account the PageRank contribution of its incoming links.
Damping factor: a damping factor (typically 0.85) is introduced to avoid infinite loops and
ensure the algorithm This indicates that the user will likely continue browsing by following a link
on the current page rather than jumping to a random page. The damping factor helps to evenly
distribute the importance and block the entire PageRank value on a single page.
4. Rank Distribution: As the algorithm progresses, the PageRank of the page is distributed
among the outgoing links. For example, if a page has a high PageRank and many outbound
links, each link will contribute to the overall impact of the page. This division ensures that the
importance of linked pages is shared.

www.profajaypashankar.com Page 59 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
5. Convergence: The iterative process continues until the PageRank score stabilizes or converges.
Convergence occurs when the difference in PageRank scores between successive iterations falls
below a certain threshold. At this point, the algorithm has reached a stable ranking, and the
PageRank scores indicate the relative importance of each web page.
6. Ranking and Display: Pages are ranked based on their final PageRank scores. Pages with a
higher PageRank score are considered more influential or essential Search engines can use
these points to display search results, so pages with higher rankings are usually shown closer to
the top. By considering the link structure and updating the PageRank score iteratively, the
algorithm effectively measures the importance of web pages relative to others. It allows ranking
pages based on their popularity and influence, helping to develop more accurate and relevant
search engines.
Advantages of Page-Rank Algorithm
The PageRank algorithm, developed by Larry Page and Sergey Brin at Google, is a critical component
of Google's search engine algorithm. It revolutionized how web pages are ranked and provided several
advantages over traditional ranking methods. Here are some advantages of the PageRank algorithm:
1. Objective and unbiased: PageRank algorithm is based on the web's link structure rather than
solely on content analysis. It measures the importance of a web page based on the number and
quality of incoming links from other pages. This approach reduces the impact of subjective
factors and manipulation, making the ranking more objective and unbiased.
2. Quality-focused: PageRank assigns higher importance to pages linked by other essential and
trustworthy pages. It considers the authority and reputation of the linking pages, effectively
measuring the quality of content. This approach helps filter out spam or low-quality pages,
ensuring that highly relevant and reliable pages are ranked higher.
3. Resilience to manipulation: PageRank is designed to resist manipulation and spamming
techniques. The algorithm considers the entire web graph and the collective influence of all
pages. It is difficult for web admins to artificially inflate their page rankings by creating
numerous low-quality links or manipulating the anchor text. This makes the algorithm more
reliable and trustworthy.
4. Scalability: The PageRank algorithm is highly scalable and can handle large-scale web graphs
efficiently. It doesn't require re-indexing or analyzing the entire web each time a search query
is performed. Instead, it calculates and stores the PageRank values for web pages, allowing for
quick retrieval and ranking during search queries.
5. Query-independent: PageRank is a query-independent ranking algorithm that doesn't depend
on specific search terms. The ranking is determined based on the overall link structure and
importance of pages rather than the relevance to a particular query. This allows for consistent
and stable rankings across different search queries, ensuring a more robust search experience.
6. Foundation for other algorithms: The PageRank algorithm forms the foundation for various
ranking algorithms and search engine techniques. It has inspired the development of advanced
algorithms such as HITS (Hyperlink-Induced Topic Search) and Trust Rank, further improving
search results' accuracy and relevance.
Overall, the PageRank algorithm has transformed web search by introducing a more reliable and
objective method of ranking web pages. Its focus on link quality and resilience to manipulation has
made it a cornerstone of modern search engines, providing users with more accurate and trustworthy
search results.

Disadvantages of the Page-Rank algorithm


Developed by Larry Page and Sergey Brin, the PageRank algorithm is widely used for ranking web
pages in search engines. Although it has proven effective in many cases, the PageRank algorithm has
some disadvantages:
1. Vulnerability to manipulation: The original PageRank algorithm highly DependsOn the
number and quality of incoming links to a web page. This makes it vulnerable to manipulative
individuals or organizations that engage in link spamming or other black hat SEO techniques to
increase the relevance of their pages artificially. Over time, search engines have implemented
various measures to mitigate this problem, but it remains a concern.
2. Emphasis on old pages: PageRank favors pages that have been around Because the
algorithm determines relevance based on the quantity and quality of incoming links, older pages
accumulate more links over time, giving them a higher PageRank score. This bias can make it
difficult for new or recently updated pages to rank highly, even if they provide valuable and
relevant content.
3. Lack of user context: PageRank relies primarily on link analysis and needs to consider user
context or search intent. The algorithm does not directly consider user preferences, location, or
personalization factors. As a result, search results may only sometimes accurately reflect the
user's specific needs or preferences.

www.profajaypashankar.com Page 60 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
4. Limited in dealing with spam and low-quality content: While PageRank attempts to rank
pages based on their importance and relevance, it needs to directly consider the quality or
reliability of the content on those pages. This can lead to pages with low-quality or spam
content ranked high based on their link profile alone.
5. Lack of real-time updates: The original PageRank algorithm works on a static photograph and
does not dynamically adapt to changes in the web ecosystem. Because the web evolves rapidly
and new pages are created, updated, or deleted frequently, the static nature of PageRank can
result in outdated rankings that may not accurately reflect the current state of the web. It is
worth noting that the original PageRank algorithm has been improved and modified over the
years, and many of these errors have been corrected to some extent in more modern
algorithms and search engine ranking systems.
-------------------------------------------------------------------------------------------------------------------
Applications of the Page-Rank algorithm
The PageRank algorithm has found several applications beyond its original use in ranking web pages.
Some notable applications include Ranking in search engines:
1. search engines: It helps determine the importance and importance of web pages based on the
website's link structure. Search engines such as Google include PageRank as one of many
factors to rank search results and provide users with more accurate and helpful information.
2. Recommender systems: PageRank can recommend relevant items to users based on their
preferences and similarity. Applying an algorithm to a network of objects and analyzing their
relationships can identify essential and influential objects that may interest the user.
3. Social Network Analysis: PageRank analyzes social networks to identify influential individuals
or network nodes. The algorithm can classify users based on their connections and network
influence by treating individuals as nodes and connections as links. This information can be
valuable in various areas, such as marketing, identifying key opinion leaders, or understanding
the spread of information.
4. Citation analysis: In academic research, the PageRank algorithm can be applied to analyze
citation networks. The algorithm can identify influential articles or researchers in a given field by
treating academic articles as nodes and citations as links. This information helps to understand
the impact and importance of scientific work
5. Content Recommendation: PageRank can recommend related or similar content on a website
or platform. By analyzing the link structure between different pages or articles, the algorithm
can identify related pages and recommend them to users as related or recommended
6. Fraud detection: PageRank can be used in fraud detection systems to identify suspicious fraud
patterns or By analyzing connections between entities, such as financial transactions or network
communications, the algorithm can flag potentially fraudulent nodes or transactions based on
their impact on the network.
It is important to note that while the original PageRank algorithm was created to rank web pages,
variations and adaptations of the algorithm were developed to serve specific applications and domains,
and the approach was adapted to the unique characteristics of the data analyzed.

-------------------------------------------------------------------------------------------------------------------

www.profajaypashankar.com Page 61 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
CHAPTER IX: LEARNING TO RANK

Topics covered: Algorithms and Techniques, Supervised learning for ranking: RankSVM, RankBoost,
Pairwise and listwise learning to rank approaches Evaluation metrics for learning to rank
-------------------------------------------------------------------------------------------------------------------
Algorithms and Techniques:
PageRank is a key algorithm developed by Larry Page and Sergey Brin, the founders of Google, as part
of their early work on the Google search engine. It revolutionized web search by introducing a method
for ranking web pages based on their importance and relevance, as determined by the structure of the
web itself. Here are some of the key algorithms and techniques used in PageRank:
1. Link Analysis: PageRank is fundamentally a link analysis algorithm. It assigns a numerical
weight, or PageRank score, to each webpage in a network of hyperlinked documents based on
the quantity and quality of inbound links it receives from other pages.
2. Random Walk Model: The PageRank algorithm models web users as random surfers
navigating the web by following hyperlinks from one page to another. The probability that a
surfer will move from one page to another is determined by the number and quality of links on
each page.
3. Graph Theory: PageRank views the web as a directed graph, where web pages are nodes and
hyperlinks between pages are edges. The algorithm applies principles from graph theory to
analyze the structure of the web graph and compute PageRank scores for individual pages.
4. Transition Matrix: PageRank represents the web graph as a transition matrix, where each
element represents the probability of transitioning from one page to another via a hyperlink.
The transition matrix is typically sparse and can be efficiently manipulated using matrix
operations.
5. Iterative Algorithm: PageRank is computed iteratively using an iterative algorithm that
repeatedly updates the PageRank scores of web pages until convergence is achieved. In each
iteration, the PageRank scores are recalculated based on the current estimates and the link
structure of the web graph.
6. Damping Factor: To model the behavior of real users who may occasionally jump to a random
page instead of following a hyperlink, PageRank introduces a damping factor (usually denoted
as �d) between 0 and 1. The damping factor represents the probability that a random surfer
will continue browsing the web rather than following a link.
7. Teleportation: The damping factor also introduces the concept of teleportation, where a
random surfer has a probability 1−1−d of jumping to any page in the web graph, regardless of
its link structure. Teleportation helps ensure that the PageRank algorithm converges to a unique
solution even for disconnected or poorly connected web graphs.
8. Convergence Criteria: PageRank iterates until the PageRank scores stabilize, indicating that
the algorithm has converged to a stable solution. Convergence criteria may include a maximum
number of iterations or a threshold for the change in PageRank scores between iterations.
9. Handling Dead Ends and Spider Traps: PageRank algorithms incorporate techniques to
handle dead ends (pages with no outgoing links) and spider traps (cycles of links that trap the
random surfer). These techniques ensure that the algorithm converges and produces
meaningful results even for complex web graphs.
-------------------------------------------------------------------------------------------------------------------

Supervised learning for ranking refers to the process of training machine learning models to rank items
or documents based on their relevance to a given query or context. Several algorithms have been
developed for this purpose, including RankSVM, RankBoost, and pairwise learning methods. Here's an
overview of each approach:
1. RankSVM (Ranking Support Vector Machine):
• RankSVM is an extension of the traditional Support Vector Machine (SVM) algorithm
tailored for ranking tasks.
• In RankSVM, the goal is to learn a ranking function that maps input features (e.g.,
document features, query-document features) to a ranking score that reflects the
relevance of documents to a query.
• RankSVM optimizes a loss function that penalizes the deviation of the predicted ranking
from the true ranking based on labeled training data.
• The optimization process involves solving a constrained optimization problem to find the
optimal separating hyperplane between relevant and irrelevant documents.
• RankSVM is capable of learning complex non-linear ranking functions and can handle
large feature spaces effectively.

Ranking Support Vector Machine (RankSVM) is a supervised learning algorithm designed specifically for
ranking tasks. It extends the traditional Support Vector Machine (SVM) algorithm to learn a ranking

www.profajaypashankar.com Page 62 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
function that can rank items, documents, or data points based on their relevance to a given query or
context. Here's a detailed overview of RankSVM:
1. Objective:
• The primary objective of RankSVM is to learn a ranking function that assigns a numerical score
to each item or document, reflecting its relevance to a query or context.
2. Training Data:
• RankSVM requires labeled training data in the form of query-document pairs, where each pair is
associated with a relevance judgment or relevance score.
• Each query-document pair is represented by a feature vector that captures relevant information
about the query and the document.
3. Margin-based Ranking:
• RankSVM learns a ranking function by maximizing the margin between relevant and irrelevant
documents in the feature space.
• The margin represents the separation between documents that should be ranked higher and
those that should be ranked lower according to the relevance judgments.
4. Optimization:
• RankSVM optimizes a convex objective function that penalizes violations of pairwise ranking
constraints.
• Pairwise ranking constraints specify that the predicted ranking of relevant documents should be
higher than that of irrelevant documents for the same query.
5. Loss Function:
• RankSVM typically employs a hinge loss function to penalize deviations from the correct
pairwise ranking.
• The hinge loss encourages the correct ordering of document pairs by imposing larger penalties
for violating pairwise ranking constraints.
6. Kernel Trick:
• Like traditional SVMs, RankSVM can utilize the kernel trick to handle non-linear relationships
between features and learn complex ranking functions.
• Common kernel functions used in RankSVM include linear, polynomial, radial basis function
(RBF), and sigmoid kernels.
7. Regularization:
• RankSVM incorporates regularization terms to prevent overfitting and improve generalization
performance.
• Regularization parameters control the trade-off between maximizing the margin and minimizing
the classification error on the training data.
8. Prediction:
• Once trained, RankSVM can rank new documents or items based on their feature vectors and
the learned ranking function.
• Documents with higher predicted scores are ranked higher in the final list of search results or
recommendations.
9. Evaluation:
• The performance of RankSVM models is typically evaluated using ranking metrics such as Mean
Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), Precision at k, and
Mean Reciprocal Rank (MRR).
10. Applications:
• RankSVM has been widely used in information retrieval, web search, recommendation systems,
and other domains where ranking plays a critical role in user experience and decision-making.
Overall, RankSVM provides a principled framework for learning ranking functions from labeled training
data, enabling effective ranking of items or documents based on their relevance to users' queries or
preferences.
-------------------------------------------------------------------------------------------------------------------
How RankSVM works:
1. Training Data Preparation:
• RankSVM requires training data in the form of query-document pairs, where each pair is
associated with a relevance judgment or relevance score.
• Features are extracted for each query-document pair. These features can include various
characteristics such as keyword matches, document length, metadata, and more.
• Additionally, each pair is labelled with a relevance judgment, typically represented as a
binary label (relevant or non-relevant) or a graded relevance score.
2. Pairwise Ranking Constraints:
• RankSVM learns by enforcing pairwise ranking constraints on the training data.
• The pairwise ranking constraints specify that the predicted ranking of relevant
documents should be higher than that of non-relevant documents for the same query.
3. Margin Maximization:

www.profajaypashankar.com Page 63 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• RankSVM aims to learn a ranking function that maximizes the margin between relevant
and non-relevant documents in the feature space.
• The margin represents the separation between documents that should be ranked higher
and those that should be ranked lower based on their relevance.
4. Optimization:
• RankSVM optimizes a convex objective function, typically using techniques such as
gradient descent or quadratic programming.
• The objective function includes a regularization term to prevent overfitting and a loss
term that penalizes violations of pairwise ranking constraints.
• The regularization parameter controls the trade-off between maximizing the margin and
minimizing the classification error on the training data.
5. Kernel Trick:
• Like traditional SVMs, RankSVM can utilize the kernel trick to handle non-linear
relationships between features.
• Common kernel functions include linear, polynomial, radial basis function (RBF), and
sigmoid kernels.
• The choice of kernel function depends on the characteristics of the data and the
complexity of the ranking problem.
6. Prediction:
• Once trained, RankSVM can rank new documents or items based on their feature vectors
and the learned ranking function.
• Documents with higher predicted scores are ranked higher in the final list of search
results or recommendations.
7. Evaluation:
• The performance of RankSVM models is typically evaluated using ranking metrics such
as Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG),
Precision at k, and Mean Reciprocal Rank (MRR).
• These metrics assess the quality of the ranked lists produced by the model and measure
its effectiveness in retrieving relevant documents.
Overall, RankSVM provides a powerful framework for learning ranking functions from labelled training
data and is widely used in applications such as information retrieval, web search, recommendation
systems, and more.
-------------------------------------------------------------------------------------------------------------------
2. RankBoost:
• RankBoost is a boosting algorithm designed for ranking tasks, inspired by AdaBoost.
• In RankBoost, weak rankers are trained sequentially to improve the overall ranking
performance.
• At each iteration, RankBoost assigns weights to training examples based on their ranking
errors from previous iterations.
• The weak rankers are trained to minimize the weighted ranking errors, with the objective
of improving the overall ranking accuracy.
• RankBoost iteratively combines multiple weak rankers to construct a strong ranking
model that optimizes a predefined performance measure, such as Mean Average
Precision (MAP) or Normalized Discounted Cumulative Gain (NDCG).
RankBoost is a supervised learning algorithm specifically designed for ranking tasks in the field of
Information Retrieval (IR). It belongs to the class of boosting algorithms, which sequentially combine
weak learners to create a strong learner capable of making accurate predictions. Here's a detailed
explanation of how RankBoost works in the context of Information Retrieval:
1. Objective:
• The primary objective of RankBoost in Information Retrieval is to learn a ranking function that
can effectively order documents based on their relevance to a given query.
2. Weak Rankers:
• RankBoost utilizes weak rankers as its basic building blocks. These weak rankers are typically
decision stumps or shallow decision trees.
• A decision stump is a decision tree with only one split node, making a binary decision based on
a single feature or feature combination.
3. Training Data:
• RankBoost requires labeled training data in the form of query-document pairs, where each pair
is associated with a relevance judgment or relevance score.
• Each query-document pair is represented by a feature vector that captures relevant
characteristics of the query and the document.
4. Weighting Training Examples:
• At the start of the training process, each training example is assigned an equal weight.

www.profajaypashankar.com Page 64 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• During subsequent iterations, RankBoost adjusts the weights of training examples based on
their performance in the previous iteration.
• Examples that are misranked by the current set of weak rankers are assigned higher weights,
while correctly ranked examples receive lower weights.
5. Boosting Iterations:
• RankBoost iterates through a predefined number of boosting rounds or iterations.
• In each iteration, RankBoost selects the weak ranker that minimizes a weighted error function
over the training examples.
• The weak ranker is chosen to improve the ranking performance by focusing on examples that
are currently misranked.
6. Weighting Weak Rankers:
• Each weak ranker is assigned a weight proportional to its performance in reducing the training
error.
• Weak rankers that contribute more to the overall improvement of the ranking performance are
assigned higher weights, while weaker rankers receive lower weights.
7. Combining Weak Rankers:
• The weak rankers are combined into a weighted sum to produce the final ranking.
• The contribution of each weak ranker to the final ranking is determined by its weight, which
reflects its importance in improving the overall ranking performance.
8. Output:
• Once trained, RankBoost can rank new documents based on their feature vectors and the
ensemble of weak rankers.
• Documents with higher predicted scores are ranked higher in the final list of search results or
recommendations.
9. Evaluation:
• The performance of RankBoost models in Information Retrieval is typically evaluated using
ranking metrics such as Mean Average Precision (MAP), Normalized Discounted Cumulative Gain
(NDCG), Precision at k, and Mean Reciprocal Rank (MRR).
• These metrics assess the quality of the ranked lists produced by the model and measure its
effectiveness in retrieving relevant documents.
RankBoost is a powerful algorithm for learning ranking functions from labelled training data and is
commonly used in Information Retrieval applications such as web search, document retrieval,
recommendation systems, and more.
-------------------------------------------------------------------------------------------------------------------
3. Pairwise Learning:
• Pairwise learning methods aim to learn ranking models by directly comparing pairs of
items or documents.
• In pairwise learning, training examples consist of pairs of documents, where each pair is
labelled with the relative preference or relevance between the documents.
• Common pairwise learning algorithms include RankNet, RankBoost, and RankSVM, which
learn to rank documents based on their pairwise preferences.
• Pairwise learning methods typically optimize a loss function that encourages the correct
ranking of document pairs according to their labels.
• Pairwise learning approaches are effective for learning ranking models when only
pairwise preference information is available, as opposed to explicitly labelled relevance
scores for individual documents.
Overall, supervised learning for ranking algorithms such as RankSVM, RankBoost, and pairwise
learning methods provide powerful tools for building ranking models that can effectively rank items or
documents based on their relevance to users' queries or preferences. These algorithms have been
widely used in information retrieval, web search, recommendation systems, and other applications
where ranking plays a crucial role in user experience and decision-making.

Pairwise learning in Information Retrieval (IR) refers to a supervised learning approach where models
are trained to rank items or documents based on their pairwise relationships. In pairwise learning,
training examples consist of pairs of items or documents, and the model learns to predict which item in
the pair is more relevant or preferable to a given query or context. Here's how pairwise learning works
in IR:
1. Training Data:
• Pairwise learning requires labeled training data in the form of query-document pairs, where
each pair is labeled with a relevance judgment or preference.
• Each pair consists of two documents: one document that is considered more relevant or
preferable (positive example) and another document that is considered less relevant or
preferable (negative example).
2. Feature Extraction:

www.profajaypashankar.com Page 65 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• For each query-document pair, relevant features are extracted to represent the characteristics
of the query and the documents.
• These features can include keyword matches, document metadata, document length, relevance
signals, and other relevant attributes.
3. Pairwise Ranking:
• Pairwise learning models aim to learn a ranking function that can correctly rank pairs of
documents based on their relevance to the query.
• The model is trained to predict the preference or relevance of one document over another
within each pair.
4. Loss Function:
• Pairwise learning algorithms typically optimize a loss function that penalizes deviations from the
correct pairwise ranking.
• The loss function encourages the model to correctly order pairs of documents according to their
relevance judgments.
5. Model Training:
• During training, the pairwise learning algorithm iteratively updates the model parameters to
minimize the loss function over the training data.
• Optimization techniques such as stochastic gradient descent (SGD), gradient boosting, or
convex optimization methods are commonly used to update the model parameters.
6. Evaluation:
• The performance of pairwise learning models in IR is evaluated using ranking metrics such as
Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), Precision at k,
and Mean Reciprocal Rank (MRR).
• These metrics assess the quality of the ranked lists produced by the model and measure its
effectiveness in retrieving relevant documents.
7. Applications:
• Pairwise learning is widely used in various IR applications such as web search, document
retrieval, recommendation systems, and more.
• It provides a principled approach to learning ranking functions from labeled training data and
can effectively address the complexities of relevance ranking in IR tasks.
In summary, pairwise learning is a powerful approach in Information Retrieval for training ranking
models based on pairwise relationships between items or documents. It allows models to learn from
explicit preferences and relevance judgments, leading to more accurate and effective ranking
outcomes.
-------------------------------------------------------------------------------------------------------------------
Listwise learning
Listwise learning to rank approaches involve training models that directly optimize ranking metrics
based on lists of items or documents. Instead of focusing on pairwise preferences or relevance
judgments, listwise methods consider the entire ranked list of documents and optimize a loss function
that reflects the quality of the entire list. Here are some popular listwise learning to rank approaches
and evaluation metrics used in learning to rank:
Listwise Learning to Rank Approaches:
1. LambdaMART:
• LambdaMART is a popular listwise learning to rank algorithm based on gradient boosting
trees.
• It optimizes a listwise objective function by combining gradient boosting with a
LambdaRank loss function.
2. ListNet:
• ListNet is a listwise learning to rank algorithm based on neural networks.
• It uses a softmax function to model the probability distribution over permutations of the
input list and minimizes the cross-entropy loss.
3. ListMLE (Listwise Maximum Likelihood Estimation):
• ListMLE directly maximizes the likelihood of observing the correct ranking of documents
in the training data.
• It treats the list of documents as a sequence and models the joint probability of
observing the correct permutation of the list.
Evaluation Metrics for Learning to Rank:
1. Mean Average Precision (MAP):
• MAP computes the average precision across all queries in the evaluation dataset.
• It rewards systems that return relevant documents higher in the ranked list.
2. Normalized Discounted Cumulative Gain (NDCG):
• NDCG measures the effectiveness of a ranked list by considering both the relevance and
the rank position of each document.

www.profajaypashankar.com Page 66 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• It discounts the gain of documents at lower ranks and normalizes the score to be
between 0 and 1.
3. Precision at k:
• Precision at k measures the proportion of relevant documents among the top k ranked
documents.
• It evaluates the precision of a ranking system at various cutoff points.
4. Mean Reciprocal Rank (MRR):
• MRR calculates the average reciprocal rank of the first relevant document across all
queries.
• It provides a single score indicating the effectiveness of a ranking system in returning
relevant documents early in the ranked list.
5. Expected Reciprocal Rank (ERR):
• ERR estimates the expected reciprocal rank of a query result based on the relevance of
the documents.
• It models the probability that a user will stop examining the results after a certain rank
position.
These evaluation metrics provide a comprehensive assessment of the quality and effectiveness of
learning to rank models in retrieving relevant documents or items. They are widely used in benchmark
datasets and competitions to evaluate and compare different learning to rank algorithms and
approaches.
-------------------------------------------------------------------------------------------------------------------
CHAPTER X: LINK ANALYSIS AND ITS ROLE IN IR SYSTEMS
Topics covered: Web graph representation and link analysis algorithms, HITS and PageRank
algorithms, Applications of link analysis in IR systems
-------------------------------------------------------------------------------------------------------------------
Web graph representation and link analysis algorithms:

The web graph is a conceptual representation of the World Wide Web, where web pages are
represented as nodes and hyperlinks between pages are represented as edges. It forms the backbone
of web search engines and plays a crucial role in various web-related tasks. Here's an overview of web
graph representation and link analysis algorithms:

Web Graph Representation:


1. Nodes and Edges:
• Web pages are represented as nodes in the web graph, and hyperlinks between pages
are represented as directed edges.
• A directed edge from page A to page B indicates that page A contains a hyperlink
pointing to page B.
2. Directed Graph:
• The web graph is typically represented as a directed graph, as hyperlinks are inherently
directional, pointing from one page to another.
3. Node Attributes:
• Nodes in the web graph may have attributes such as URL, content, metadata, and other
features that provide additional information about the corresponding web pages.

Web graph representation is a fundamental concept in web science and information retrieval, where
the structure of the World Wide Web is abstracted into a graph-like structure. Here's a detailed
explanation of web graph representation:
1. Nodes and Edges:
• Nodes: In the web graph, nodes represent web pages or documents accessible on the World
Wide Web. Each node corresponds to a unique URL.
• Edges: Edges represent hyperlinks between web pages. If page A contains a hyperlink pointing
to page B, there exists a directed edge from node A to node B in the web graph.
2. Directed Graph:
• The web graph is a directed graph because hyperlinks have directionality. A hyperlink from page
A to page B does not imply a link from page B to page A.
• This directed nature reflects the inherent structure of the web, where pages can link to other
pages without reciprocation.
3. Representation:
• Adjacency List: One common representation of the web graph is using an adjacency list. In
this representation, each node is associated with a list of its outgoing links (nodes it points to).
• Adjacency Matrix: Another representation is using an adjacency matrix, where rows and
columns correspond to nodes, and entries indicate the presence or absence of edges between
nodes.

www.profajaypashankar.com Page 67 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
4. Node Attributes:
• In addition to URLs, nodes in the web graph may have associated attributes such as content,
metadata, popularity metrics, and other information.
• These attributes provide additional context about the pages and can be used for indexing,
ranking, and retrieval purposes.
5. Directed Connectivity:
• The web graph exhibits a rich structure of directed connectivity. Pages can be densely
interconnected within specific domains or topics, while also linking to pages across different
domains.
• Directed connectivity influences the flow of information and navigation patterns on the web.
6. Large Scale:
• The web graph is immense in scale, comprising billions of nodes and edges. As a result,
representing and analyzing the entire web graph is a significant computational challenge.
• Techniques such as graph partitioning, sampling, and distributed computing are employed to
handle the scale of the web graph.
7. Dynamic Nature:
• The web graph is dynamic and constantly evolving as new pages are created, existing pages are
updated, and links are added or removed.
• Representing and analyzing the dynamics of the web graph is crucial for understanding trends,
patterns, and changes in web content and structure over time.
8. Applications:
• Web graph representation is foundational to various web-related applications, including web
search engines, link analysis algorithms, information retrieval systems, social network analysis,
and more.
• Understanding the structure and properties of the web graph enables the development of
effective algorithms and techniques for indexing, ranking, and navigating web content.
In summary, web graph representation provides a structured framework for understanding the
complex interconnections and relationships among web pages on the World Wide Web. It serves as the
basis for numerous web-related tasks and forms the cornerstone of modern web science and
information retrieval.
-------------------------------------------------------------------------------------------------------------------
Link Analysis Algorithms:
1. PageRank:
• PageRank is a link analysis algorithm developed by Google's Larry Page and Sergey Brin.
• It assigns a numerical weight (PageRank score) to each page in the web graph based on
the quantity and quality of inbound links it receives from other pages.
• PageRank operates on the principle that pages with higher inbound link counts from
authoritative pages are likely to be more important and relevant.
• The algorithm iteratively computes PageRank scores until convergence, taking into
account damping factors and teleportation to handle dead ends and spider traps.
2. HITS (Hypertext Induced Topic Selection):
• HITS is another link analysis algorithm that evaluates the authority and hub scores of
web pages.
• Authority scores measure a page's importance based on the number and quality of
inbound links it receives.
• Hub scores measure a page's capability to link to authoritative pages.
• HITS operates iteratively, updating authority and hub scores until convergence is
achieved.
3. Salton's Cosine Similarity:
• Salton's Cosine Similarity is a similarity measure used in information retrieval to assess
the relevance of documents to a query.
• It calculates the cosine of the angle between the vector representations of the query and
the document in a high-dimensional vector space.
• Documents with higher cosine similarity values are considered more relevant to the
query.
4. Kleinberg's Hubs and Authorities Algorithm:
• Kleinberg's algorithm is an extension of HITS that introduces the notion of authority and
hub scores in a more decentralized network.
• It identifies hubs as pages that point to many authorities and authorities as pages that
are pointed to by many hubs.
• The algorithm iteratively updates hub and authority scores based on the connectivity
structure of the web graph.
5. TrustRank:

www.profajaypashankar.com Page 68 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• TrustRank is a variant of PageRank that aims to combat web spam and identify
trustworthy pages.
• It starts with a seed set of trusted pages and propagates trust scores through the web
graph, discounting the influence of untrustworthy pages.
• TrustRank helps search engines prioritize trustworthy pages in search results.
These algorithms and representations form the basis of web search engines and information retrieval
systems, enabling efficient indexing, ranking, and retrieval of web pages based on their connectivity
and relevance.

Link analysis algorithms in Information Retrieval (IR) are techniques used to analyze the relationships
and interconnections between web pages or documents. These algorithms help assess the importance,
authority, and relevance of documents based on the structure of hyperlinks between them. Here's a
detailed explanation of link analysis algorithms in IR:
1. PageRank Algorithm:
• Concept: PageRank, developed by Larry Page and Sergey Brin at Google, assigns a numerical
weight (PageRank score) to each page in the web graph based on the quantity and quality of
inbound links it receives from other pages.
• Working: PageRank operates on the principle that pages with higher inbound link counts from
authoritative pages are likely to be more important and relevant.
• Algorithm: It iteratively computes PageRank scores until convergence, taking into account
damping factors and teleportation to handle dead ends and spider traps.
• Application: PageRank is widely used by search engines to rank search results based on the
importance and relevance of web pages.

2. HITS Algorithm (Hypertext Induced Topic Selection):


• Concept: HITS evaluates the authority and hub scores of web pages. Authority scores measure
a page's importance based on the number and quality of inbound links it receives. Hub scores
measure a page's capability to link to authoritative pages.
• Working: HITS operates iteratively, updating authority and hub scores until convergence is
achieved.
• Algorithm: It analyzes the connectivity structure of the web graph to identify authoritative
pages and hubs within the network.
• Application: HITS is used in information retrieval, link analysis, and web search to identify
important pages and understand the topical structure of the web.

3. Kleinberg's Hubs and Authorities Algorithm:


• Concept: Kleinberg's algorithm is an extension of HITS that introduces the notion of authority
and hub scores in a more decentralized network.
• Working: It identifies hubs as pages that point to many authorities and authorities as pages
that are pointed to by many hubs.
• Algorithm: The algorithm iteratively updates hub and authority scores based on the
connectivity structure of the web graph.
• Application: Kleinberg's algorithm helps analyze the structure and connectivity patterns of
large-scale networks, including the World Wide Web.

4. TrustRank Algorithm:
• Concept: TrustRank is a variant of PageRank that aims to combat web spam and identify
trustworthy pages.
• Working: It starts with a seed set of trusted pages and propagates trust scores through the
web graph, discounting the influence of untrustworthy pages.
• Algorithm: TrustRank helps search engines prioritize trustworthy pages in search results and
improve the quality of search engine rankings.
• Application: TrustRank is used to enhance the credibility and reliability of search results by
identifying and filtering out spammy or low-quality web pages.

5. SALSA Algorithm (Stochastic Approach for Link-Structure Analysis):


• Concept: SALSA is a link analysis algorithm that extends HITS by incorporating authority and
hub scores based on random walks on the web graph.
• Working: It models the behavior of web surfers navigating the web by considering both the
relevance of the page content and the quality of its links.
• Algorithm: SALSA uses iterative computation to update authority and hub scores based on the
stochastic transition matrix of the web graph.
• Application: SALSA is used for web search, information retrieval, and link analysis to identify
authoritative pages and hubs in large-scale networks.

www.profajaypashankar.com Page 69 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
These link analysis algorithms play a crucial role in information retrieval, web search, and web mining
by analyzing the relationships between documents and determining their importance, relevance, and
credibility based on hyperlink structure. They form the foundation of modern search engine algorithms
and contribute to the effectiveness and relevance of search results on the World Wide Web.
-------------------------------------------------------------------------------------------------------------------
Applications of link analysis in IR systems
Link analysis plays a significant role in Information Retrieval (IR) systems, particularly in analyzing and
leveraging the relationships between documents, web pages, or entities. Here are some key
applications of link analysis in IR systems:
1. Web Search Engines:
• Link analysis algorithms such as PageRank and HITS are used by search engines to rank
search results.
• Pages with higher PageRank scores, which are determined by the quality and quantity of
inbound links, are often ranked higher in search engine results pages (SERPs).
• Link analysis helps search engines identify authoritative pages and prioritize them in
search results, improving the relevance and quality of search outcomes.
2. Web Crawling and Indexing:
• Link analysis guides the process of web crawling and indexing by determining which
pages to crawl and how frequently to revisit them.
• Crawlers follow hyperlinks from one page to another, discovering new content and
updating the search engine's index.
• Link analysis algorithms help prioritize pages for crawling based on their importance,
popularity, and relevance.
3. Recommender Systems:
• Link analysis techniques can be applied to build recommendation systems that suggest
relevant content to users based on their preferences and browsing history.
• By analyzing the relationships between users, items, and interactions (e.g., clicks, views,
purchases), recommender systems can identify patterns and make personalized
recommendations.
4. Social Network Analysis:
• In social network analysis, link analysis is used to study the relationships between
individuals or entities in a network.
• Algorithms such as HITS and SALSA help identify influential nodes (hubs) and
authoritative nodes (authorities) in social networks.
• Link analysis enables the detection of communities, influential users, and patterns of
influence propagation in social networks.
5. Text Mining and Document Analysis:
• Link analysis techniques can be applied to analyze the citation networks in academic
literature or the hyperlink structure in web documents.
• By examining the relationships between documents, researchers can identify influential
papers, emerging trends, and research communities.
• Link analysis facilitates the exploration of scholarly networks and the discovery of
relevant research in a particular field.
6. Spam Detection and Trustworthiness Evaluation:
• Link analysis algorithms such as TrustRank help detect spammy or low-quality pages by
analyzing their link profiles.
• Pages with suspicious linking patterns or links from untrustworthy sources may be
penalized or filtered out from search results.
• Link analysis assists in evaluating the trustworthiness and credibility of web pages and
identifying potential sources of misinformation or fraud.
In summary, link analysis is a versatile technique with diverse applications in Information Retrieval
systems. By analyzing the relationships and connections between documents, entities, or users, link
analysis algorithms enhance the effectiveness, relevance, and reliability of IR systems across various
domains and applications.
-------------------------------------------------------------------------------------------------------------------

www.profajaypashankar.com Page 70 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
CHAPTER XI: CRAWLING AND NEAR-DUPLICATE PAGE DETECTION

Topics covered: Web page crawling techniques: breadth-first, depth-first, focused crawling, Near-
duplicate page detection algorithms, Handling dynamic web content during crawling
-------------------------------------------------------------------------------------------------------------------
Web page crawling techniques
Web page crawling is the process of systematically browsing the World Wide Web to discover and
retrieve web pages for indexing by search engines or other purposes. Here are some common web
page crawling techniques:
1. Breadth-First Crawling:
• Breadth-first crawling starts with a set of seed URLs and systematically explores web pages by
visiting pages at each level of depth before moving to the next level.
• It ensures that pages closer to the seed URLs are crawled first, gradually expanding the crawl
frontier outward.
2. Depth-First Crawling:
• Depth-first crawling prioritizes exploring pages at deeper levels of the web page hierarchy
before visiting pages at shallower levels.
• It may be useful for certain scenarios, but it can lead to deep, narrow crawls that may not cover
a wide range of content.
3. Focused Crawling:
• Focused crawling aims to crawl specific areas of the web that are relevant to a particular topic,
domain, or set of keywords.
• It uses heuristics, content analysis, or relevance feedback to identify and prioritize pages
related to the target topic.
4. Parallel Crawling:
• Parallel crawling involves running multiple crawlers concurrently to crawl different parts of the
web simultaneously.
• It improves the efficiency and speed of crawling by distributing the workload across multiple
threads, processes, or machines.
5. Incremental Crawling:
• Incremental crawling focuses on updating the index with new or modified content since the last
crawl.
• It uses techniques such as timestamp comparison, change detection, or crawling frequency
adjustments to identify and prioritize pages that have been updated or added recently.
6. Politeness and Crawling Ethics:
• Politeness policies regulate the rate and frequency of requests sent to web servers to avoid
overloading servers or causing disruptions.
• Crawlers often adhere to the robots.txt protocol, which specifies guidelines for web crawlers
regarding which pages to crawl and which to avoid.
7. Duplicate Content Detection:
• Crawlers may implement techniques to detect and avoid crawling duplicate or near-duplicate
content to maintain index quality and reduce redundancy.
• Techniques include using checksums, fingerprints, or similarity measures to identify duplicate
content.
8. Dynamic Page Handling:
• Crawlers must handle dynamically generated pages, AJAX content, and other dynamically
loaded resources to ensure comprehensive coverage of the web.
• Techniques include executing JavaScript, interpreting AJAX requests, or analyzing embedded
content to discover and crawl dynamically generated content.
9. Link Analysis and Page Ranking:
• Crawlers may prioritize crawling pages based on link analysis algorithms such as PageRank or
HITS to focus on high-quality or authoritative content.
• Page ranking algorithms influence the crawling strategy by determining which pages are more
likely to be relevant or important.
10. Crawl Frontier Management:
• Crawl frontier management involves maintaining a queue or priority list of URLs to be crawled
and managing crawl scheduling, prioritization, and resource allocation.
• Techniques include URL scheduling algorithms, crawl budget allocation, and dynamic
adjustment of crawl priorities based on content freshness or importance.
Effective web page crawling requires a combination of these techniques, along with careful
consideration of scalability, efficiency, relevance, and ethical considerations. Modern web crawlers
employ sophisticated algorithms and strategies to navigate the vast and dynamic landscape of the
World Wide Web efficiently while respecting the guidelines and constraints set by web servers and
website owners.

www.profajaypashankar.com Page 71 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
-------------------------------------------------------------------------------------------------------------------
1. Breadth-First Crawling:
• Description: Breadth-first crawling explores web pages by visiting pages at each level of depth
before moving to the next level.
• Working: It starts with a set of seed URLs and systematically explores pages by visiting all
pages at the current depth level before moving deeper.
• Advantages: Ensures that pages closer to the seed URLs are crawled first, facilitating
comprehensive coverage of the web.
• Disadvantages: May result in crawling a large number of low-quality or irrelevant pages
before reaching more relevant content deeper in the hierarchy.

2. Depth-First Crawling:
• Description: Depth-first crawling prioritizes exploring pages at deeper levels of the web page
hierarchy before visiting shallower levels.
• Working: It focuses on visiting pages as deeply as possible along a single branch of the web
page hierarchy before exploring other branches.
• Advantages: Can lead to more focused and efficient crawls, especially when targeting specific
areas of the web.
• Disadvantages: May miss important content located at shallower levels of the hierarchy,
potentially leading to incomplete coverage.

3. Focused Crawling:
• Description: Focused crawling aims to crawl specific areas of the web that are relevant to a
particular topic, domain, or set of keywords.
• Working: Uses heuristics, content analysis, or relevance feedback to identify and prioritize
pages related to the target topic.
• Advantages: Enables efficient discovery and retrieval of content relevant to specific
information needs or user queries.
• Disadvantages: Requires sophisticated algorithms and heuristics to determine relevance, and
may miss valuable content outside the defined focus area.

4. Parallel Crawling:
• Description: Parallel crawling involves running multiple crawlers concurrently to crawl different
parts of the web simultaneously.
• Working: Distributes the workload across multiple threads, processes, or machines to improve
efficiency and speed.
• Advantages: Accelerates the crawling process, enabling faster discovery and retrieval of web
content.
• Disadvantages: Requires infrastructure and resource management to coordinate parallel
crawlers and avoid duplication or conflicts.

5. Incremental Crawling:
• Description: Incremental crawling focuses on updating the index with new or modified content
since the last crawl.
• Working: Uses techniques such as timestamp comparison, change detection, or crawling
frequency adjustments to identify and prioritize pages that have been updated or added
recently.
• Advantages: Helps maintain index freshness and relevance by prioritizing recently updated or
added content.
• Disadvantages: Requires efficient mechanisms for detecting changes and managing crawl
scheduling to ensure timely updates.

6. Politeness and Crawling Ethics:


• Description: Politeness policies regulate the rate and frequency of requests sent to web
servers to avoid overloading servers or causing disruptions.
• Working: Crawlers adhere to the robots.txt protocol, which specifies guidelines for web
crawlers regarding which pages to crawl and which to avoid.
• Advantages: Promotes responsible and ethical crawling behavior, fostering positive
relationships with website owners and operators.
• Disadvantages: Requires careful implementation and adherence to crawling guidelines to
avoid inadvertently violating site policies or causing server overload.

7. Duplicate Content Detection:

www.profajaypashankar.com Page 72 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Description: Detects and avoids crawling duplicate or near-duplicate content to maintain index
quality and reduce redundancy.
• Working: Uses techniques such as checksums, fingerprints, or similarity measures to identify
duplicate content and avoid redundant crawling.
• Advantages: Improves index quality and search relevance by eliminating duplicate content
from search results.
• Disadvantages: Requires computational resources and processing overhead to perform
duplicate content detection and elimination.

8. Dynamic Page Handling:


• Description: Involves handling dynamically generated pages, AJAX content, and other
dynamically loaded resources during crawling.
• Working: Crawlers execute JavaScript, interpret AJAX requests, or analyze embedded content
to discover and crawl dynamically generated content.
• Advantages: Enables comprehensive coverage of web content, including dynamically
generated pages and rich media.
• Disadvantages: Requires additional processing and complexity to handle dynamic content,
potentially increasing crawl time and resource consumption.

9. Link Analysis and Page Ranking:


• Description: Prioritizes crawling pages based on link analysis algorithms such as PageRank or
HITS to focus on high-quality or authoritative content.
• Working: Page ranking algorithms influence the crawling strategy by determining which pages
are more likely to be relevant or important.
• Advantages: Enhances the relevance and quality of search results by prioritizing authoritative
and high-quality pages for indexing.
• Disadvantages: Requires computational resources and processing overhead to compute and
apply link-based page rankings during crawling.

10. Crawl Frontier Management:


• Description: Involves maintaining a queue or priority list of URLs to be crawled and managing
crawl scheduling, prioritization, and resource allocation.
• Working: Utilizes URL scheduling algorithms, crawl budget allocation, and dynamic adjustment
-------------------------------------------------------------------------------------------------------------------
Near-duplicate page detection algorithms:
Near-duplicate page detection algorithms are crucial for various web-related tasks such as web
search, information retrieval, and content management. They help identify and eliminate redundant or
highly similar pages from search engine indexes, improve search result diversity, and enhance user
experience. Here's an explanation of some common near-duplicate page detection algorithms:
1. Hashing-based Techniques:
• MinHash: MinHash is a technique that represents documents using their characteristic shingle
sets (sequences of words). It then hashes each shingle using multiple hash functions and
selects the minimum hash value for each function. By comparing the sets of minimum hashes
between documents, near-duplicate pairs can be efficiently identified.
• SimHash: SimHash employs locality-sensitive hashing (LSH) to generate a fixed-size hash
representation for documents. It XORs the hash values of individual features and then applies a
sign function to generate a signature vector. Similar documents have similar signature vectors,
allowing for near-duplicate detection.
2. Token-based Techniques:
• Jaccard Similarity: Jaccard similarity calculates the intersection over the union of the sets of
tokens (words, phrases, or shingles) between two documents. Documents with a Jaccard
similarity above a certain threshold are considered near-duplicates.
• Cosine Similarity: Cosine similarity measures the cosine of the angle between the vector
representations of documents in a high-dimensional space. It is commonly used in conjunction
with TF-IDF (Term Frequency-Inverse Document Frequency) weights to compare the similarity
of documents based on their token frequencies.
3. Structural and Textual Features:
• Content-based Features: Near-duplicate detection algorithms analyze both textual content
and structural features (HTML structure, metadata, etc.) of web pages to identify similarities.
• Edit Distance: Edit distance algorithms compute the minimum number of operations
(insertions, deletions, substitutions) required to transform one document into another.
Documents with low edit distances are likely to be near-duplicates.
4. Locality-sensitive Hashing (LSH):

www.profajaypashankar.com Page 73 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• LSH is a family of hash functions that map similar inputs to the same or nearby hash values
with high probability.
• It is used to partition the input space into "buckets" such that similar items are more likely to
be hashed into the same bucket.
• LSH-based techniques efficiently group similar documents for further analysis or identification of
near-duplicates.
5. Machine Learning Approaches:
• Supervised Learning: Supervised learning algorithms can be trained using labeled data to
classify pairs of documents as near-duplicates or non-duplicates.
• Feature Engineering: Features such as token frequency distributions, textual similarity
measures, structural features, and domain-specific attributes can be extracted and used as
input for machine learning models.
6. Scalability and Efficiency:
• Near-duplicate detection algorithms need to be scalable to handle large volumes of web pages
efficiently.
• Techniques such as inverted indexing, distributed processing, and parallelization help improve
the scalability and performance of near-duplicate detection systems.
7. Application-specific Considerations:
• The choice of near-duplicate detection algorithm depends on the specific application
requirements, such as precision, recall, computational resources, and scalability.
• Different applications may prioritize different aspects of near-duplicate detection, such as
content similarity, structural similarity, or a combination of both.
Near-duplicate page detection algorithms are essential for maintaining the quality and relevance of
search engine indexes, improving user experience, and facilitating various content management tasks
on the web. The choice of algorithm depends on factors such as the nature of the documents, the
desired level of sensitivity, and the available computational resources.
-------------------------------------------------------------------------------------------------------------------
Handling dynamic web content during crawling:
Handling dynamic web content during crawling is essential for ensuring that web crawlers can
effectively discover and index the content available on the modern web, which often includes
dynamically generated pages, AJAX content, and other interactive elements. Here's how dynamic
content handling during crawling works:
1. Dynamic Page Identification:
• Dynamic pages are generated on the fly in response to user interactions, JavaScript execution,
or server-side processing.
• Web crawlers need to identify dynamic pages to ensure they are appropriately rendered and
crawled.
2. JavaScript Rendering:
• Many modern web pages heavily rely on JavaScript to dynamically load content, update the
DOM (Document Object Model), and interact with users.
• Web crawlers capable of handling dynamic content execute JavaScript code to render pages
accurately, ensuring that dynamically generated content is discovered and crawled.
3. Headless Browsers and Rendering Engines:
• Headless browsers are browser engines that operate without a graphical user interface, making
them suitable for automated web crawling and testing.
• Crawlers may utilize headless browsers or rendering engines such as Chromium, WebKit, or
Gecko to parse and render JavaScript-heavy pages accurately.
4. AJAX Handling:
• Asynchronous JavaScript and XML (AJAX) requests are commonly used to load dynamic content
dynamically without refreshing the entire page.
• Crawlers need to intercept and handle AJAX requests to ensure that dynamically loaded content
is discovered and indexed.
• Techniques include monitoring network traffic, parsing AJAX responses, and triggering
subsequent requests to fetch additional content.
5. Dynamic Element Detection:
• Crawlers must identify and handle dynamic elements such as pop-ups, modal windows, infinite
scroll, and lazy loading of images or content.
• Techniques include DOM inspection, event monitoring, and content analysis to identify and
interact with dynamically generated elements.
6. Delayed Loading and Infinite Scroll:
• Many websites implement delayed loading or infinite scroll mechanisms to dynamically load
content as users scroll down the page.
• Crawlers need to simulate user interactions and trigger scroll events to load additional content
dynamically.

www.profajaypashankar.com Page 74 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Techniques include emulating user behavior, monitoring DOM changes, and dynamically
updating the crawl queue with newly loaded URLs.
7. Form Submission and User Interaction:
• Some web pages require user interaction, such as filling out forms, clicking buttons, or
navigating through menus, to load dynamic content.
• Crawlers may simulate user interactions by programmatically submitting forms, clicking
buttons, and interacting with UI elements to trigger content loading.
8. Content Extraction and Parsing:
• Once dynamic content is loaded and rendered, crawlers extract relevant text, links, and
metadata from the DOM for indexing and analysis.
• Content extraction techniques include XPath queries, CSS selectors, and regular expressions to
identify and extract relevant content elements.
9. Performance and Scalability:
• Handling dynamic web content during crawling introduces additional computational overhead
and complexity.
• Crawlers must balance performance and scalability considerations to efficiently handle dynamic
content while maintaining a reasonable crawl rate and resource usage.
By effectively handling dynamic web content during crawling, web crawlers can ensure comprehensive
coverage and accurate indexing of the rich and interactive content available on the modern web,
enabling users to discover and access relevant information effectively.
-------------------------------------------------------------------------------------------------------------------
CHAPTER XII: TEXT SUMMARIZATION
Topics covered:
Extractive and abstractive methods, Question Answering: approaches for finding precise answers,
Recommender Systems: collaborative filtering, content-based filtering
-------------------------------------------------------------------------------------------------------------------
Types of Text Summarization: Extractive and Abstractive Summarization

Summarization is one of the most common Natural Language Processing (NLP) tasks. With the
amount of new content generated by billions of people and their smartphones every day, we are
inundated with increasing amount of data every day. Humans can only consume a finite amount of
information and need a way to filter out the wheat from the chaff and find the information that
matters. Text summarization can help achieve that for textual information. We can separate the signal
from the noise and take meaningful actions from them.
In this article, we explore different methods to implement this task and some of the learnings that we
have come across on the way. We hope this will be helpful to other folks who would like to implement
basic summarization in their data science pipeline for solving different business problems.
Python provides some excellent libraries and modules to perform Text Summarization. We will provide
a simple example of generating Extractive Summarization using the Gensim and HuggingFace modules
in this article.

Uses of Summarization?

It may be tempting to use summarization for all texts to get useful information from them and spend
less time reading. However, for now, NLP summarization has been a successful use case in only a few
areas.

www.profajaypashankar.com Page 75 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Text summarization works great if a text has a lot of raw facts and can be used to filter important
information from them. The NLP models can summarize long documents and represent them in small
simpler sentences. News, factsheets, and mailers fall under these categories.
However, for texts where each sentence builds up upon the previous, text summarization does not
work that well. Research journals, medical texts are good examples of texts where summarization
might not be very successful.
Finally, if we take the case of summarizing fiction, summarization methods can work fine. However, it
might miss the style and the tone of the text that the author tried to express.
Hence, Text summarization is helpful only in a handful of use cases.

Two Types Of Summarization

There are two main types of Text Summarization

Extractive

Extractive summarization methods work just like that. It takes the text, ranks all the sentences
according to the understanding and relevance of the text, and presents you with the most important
sentences.
This method does not create new words or phrases, it just takes the already existing words and
phrases and presents only that. You can imagine this as taking a page of text and marking the most
important sentences using a highlighter.

Abstractive

Abstractive summarization, on the other hand, tries to guess the meaning of the whole text and
presents the meaning to you.
It creates words and phrases, puts them together in a meaningful way, and along with that, adds the
most important facts found in the text. This way, abstractive summarization techniques are more
complex than extractive summarization techniques and are also computationally more expensive.

Comparison with practical example

The best way to illustrate these types is through an example. Here we have run the Input Text below
through both types of summarization and the results are shown below.
Input Text:
China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the
second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million, according
to data from research firm Canalys. While Huawei’s sales fell 5 per cent from the same quarter a year
earlier, South Korea’s Samsung posted a bigger drop of 30 per cent, owing to disruption from the
coronavirus in key markets such as Brazil, the United States and Europe, Canalys said. Huawei’s
overseas shipments fell 27 per cent in Q2 from a year earlier, but the company increased its
dominance of the China market which has been faster to recover from COVID-19 and where it now
sells over 70 per cent of its phones. “Our business has demonstrated exceptional resilience in these
difficult times,” a Huawei spokesman said. “Amidst a period of unprecedented global economic
slowdown and challenges, we’re continued to grow and further our leadership position.” Nevertheless,
Huawei’s position as number one seller may prove short-lived once other markets recover given it is

www.profajaypashankar.com Page 76 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
mainly due to economic disruption, a senior Huawei employee with knowledge of the matter told
Reuters. Apple is due to release its Q2 iPhone shipment data on Friday.
Extractive Summarization Output:
While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung
posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as
Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2
from a year earlier, but the company increased its dominance of the China market which has been
faster to recover from COVID-19 and where it now sells over 70 per cent of its phones.
Abstractive Summarization Output:
Huawei overtakes Samsung as world’s biggest seller of mobile phones in the second quarter of 2020.
Sales of Huawei’s 55.8 million devices compared to 53.7 million for south Korea’s Samsung. Shipments
overseas fell 27 per cent in Q2 from a year earlier, but company increased its dominance of the china
market. Position as number one seller may prove short-lived once other markets recover, a senior
Huawei employee says.

Extractive Text Summarization Using Gensim

Import the required libraries and functions:


from gensim.summarization.summarizer import summarize
from gensim.summarization.textcleaner import split_sentences
We store the article content in a variable called Input (mentioned above). Next, we have to pass it to
the summarize function, the second parameter being the ratio we want the summarized text to be. We
chose it as 0.4, or the summary will be around 40% of the original text.
summarize(Input, 0.4)
Output:
While Huawei’s sales fell 5 per cent from the same quarter a year earlier, South Korea’s Samsung
posted a bigger drop of 30 per cent, owing to disruption from the coronavirus in key markets such as
Brazil, the United States and Europe, Canalys said. Huawei’s overseas shipments fell 27 per cent in Q2
from a year earlier, but the company increased its dominance of the China market which has been
faster to recover from COVID-19 and where it now sells over 70 per cent of its phones.
With the parameter split=True, you can see the output as a list of sentences.
Gensim summarization works with the TextRank algorithm. As the name suggests, it ranks texts and
gives you the most important ones back.

Extractive Text Summarization Using Huggingface Transformers

We use the same article to summarize as before, but this time, we use a transformer model from
Huggingface,
from transformers import pipeline
We have to load the pre-trained summarization model into the pipeline:
summarizer = pipeline(“summarization”)
Next, to use this model, we pass the text, the minimum length, and the maximum length parameters.
We get the following output:
summarizer(Input, min_length=30, max_length=300)
Output:
China’s Huawei overtook Samsung Electronics as the world’s biggest seller of mobile phones in the
second quarter of 2020, shipping 55.8 million devices compared to Samsung’s 53.7 million. Samsung
posted a bigger drop of 30 per cent, owing to disruption from coronavirus in key markets such as
Brazil, the United States and Europe.

Text summarization is the process of distilling the most important information from a text while
preserving its key meaning and content. There are two primary approaches to text summarization:
extractive summarization and abstractive summarization.
1. Extractive Summarization:
• Extractive summarization involves selecting and extracting key sentences or passages
directly from the original text to create a summary. The extracted sentences are typically
the ones that contain the most relevant information or represent the main ideas of the
text.
• Extractive summarization methods often use statistical techniques, natural language
processing (NLP), and machine learning algorithms to identify important sentences
based on criteria such as word frequency, sentence position, and semantic similarity.
• Advantages of extractive summarization include the preservation of the original wording
and the ability to generate coherent summaries quickly. However, extractive methods

www.profajaypashankar.com Page 77 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
may struggle to produce concise summaries and may include redundant or less relevant
information.
2. Abstractive Summarization:
• Abstractive summarization involves generating a summary that may contain rephrased
or paraphrased content not present in the original text. Unlike extractive summarization,
abstractive methods have the capability to generate summaries using language
understanding and generation techniques.
• Abstractive summarization methods leverage advanced NLP techniques such as natural
language understanding, semantic analysis, and language generation models (e.g.,
neural networks) to interpret the meaning of the text and generate concise summaries in
natural language.
• Advantages of abstractive summarization include the ability to generate more concise
and coherent summaries by synthesizing information from multiple parts of the text.
However, abstractive methods are more complex and computationally intensive
compared to extractive techniques.

Key differences between extractive and abstractive summarization:
• Extractive summarization directly selects and reuses existing sentences from the original text,
while abstractive summarization involves generating new sentences that capture the essence of
the original content.
• Extractive summarization tends to produce summaries that closely resemble the original text,
while abstractive summarization can generate more concise and human-like summaries.
• Extractive summarization methods are generally easier to implement and computationally less
demanding compared to abstractive methods, which require more sophisticated NLP techniques
and language generation models.
Both extractive and abstractive summarization techniques have their strengths and weaknesses, and
the choice between them depends on factors such as the desired level of summary coherence, the
complexity of the input text, and the computational resources available.

-------------------------------------------------------------------------------------------------------------------
Text summarization is the process of distilling the most important information from a source text to
produce a condensed version while retaining the key ideas and meaning. There are two primary
approaches to text summarization: extractive summarization and abstractive summarization. Let's
delve into each approach in detail:

Extractive Summarization:
Extractive summarization involves selecting a subset of sentences or passages from the source text
and combining them to create a summary. The selected sentences are usually the most informative
and representative of the content of the original text. Here's how extractive summarization works:
1. Sentence Ranking: Extractive summarization algorithms analyze the source text to identify
sentences that contain important information. Various features can be used to assess the
importance of sentences, such as word frequency, sentence length, position in the text, and the
presence of keywords.
2. Scoring and Selection: Once the sentences are identified, each sentence is assigned a score
based on its importance or relevance to the overall content. Common techniques for scoring
sentences include algorithms like TextRank, which is based on graph-based ranking algorithms
similar to Google's PageRank algorithm, and TF-IDF (Term Frequency-Inverse Document
Frequency), which measures the importance of words in a document relative to a corpus of
documents.
3. Sentence Selection: The sentences with the highest scores are then selected to form the
summary. These selected sentences are typically arranged in the same order as they appear in
the original text to maintain coherence and readability.
4. Generation of Summary: The selected sentences are concatenated to form the final
summary, which provides a condensed representation of the main ideas and key points of the
source text.

Abstractive Summarization:
Abstractive summarization goes beyond merely selecting and rearranging sentences from the source
text. Instead, it aims to generate a summary that captures the essence of the original content in a
more human-like manner. Abstractive summarization involves the following steps:
1. Understanding the Text: Abstractive summarization algorithms employ natural language
processing (NLP) techniques to comprehend the meaning and context of the source text. This
may involve parsing the text, identifying entities and relationships, and understanding the
semantic structure of the content.

www.profajaypashankar.com Page 78 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
2. Generating Summary: Using the understanding gained from the text, the algorithm generates
a summary in a new form, often using its own words and sentence structures. This process may
involve paraphrasing, rephrasing, and synthesizing information to produce a concise and
coherent summary.
3. Language Generation: Abstractive summarization systems may use techniques such as
sequence-to-sequence models, recurrent neural networks (RNNs), and transformer models like
BERT and GPT (Generative Pre-trained Transformer) to generate human-like summaries. These
models are trained on large datasets of text and learn to generate summaries based on the
patterns and structures observed in the training data.
4. Evaluation and Refinement: Generated summaries are evaluated for coherence,
informativeness, and relevance to the original text. The system may iterate through multiple
generations, refining the summary based on feedback and evaluation metrics.

Comparison:
• Extractive Summarization:
• Pros:
• Retains the original wording and structure of the text.
• Generally produces grammatically correct summaries.
• Cons:
• Limited to sentences present in the source text.
• May not capture the semantic meaning or context of the original text
comprehensively.
• Abstractive Summarization:
• Pros:
• Can generate summaries that go beyond the original text.
• Captures the semantic meaning and context more effectively.
• Cons:
• Challenging to generate grammatically correct and coherent summaries.
• Requires more advanced natural language processing techniques and language
models.
In summary, while extractive summarization focuses on selecting and rearranging existing content,
abstractive summarization aims to understand the text and generate new content that effectively
conveys the main ideas and key points of the source text. Each approach has its strengths and
limitations, and the choice between extractive and abstractive summarization depends on the specific
requirements and constraints of the task at hand.
-------------------------------------------------------------------------------------------------------------------
Question answering (QA) involves finding precise answers to user queries or questions, typically posed
in natural language. There are several approaches for finding precise answers in QA systems:
1. Information Retrieval (IR)-based QA: In this approach, the QA system retrieves relevant
documents or passages from a large corpus in response to the user's question. The system uses
keyword matching, vector space models, or other IR techniques to identify documents
containing potential answers. Once the documents are retrieved, the system may employ
techniques such as passage extraction or document ranking to select the most relevant
information for answering the question.
2. Text Matching and Similarity: This approach involves analyzing the similarity between the
user's question and textual content in the corpus. Techniques such as cosine similarity,
semantic similarity, or word embeddings are used to measure the similarity between the
question and candidate answers. The system selects the answer that best matches the semantic
meaning or context of the question.
3. Machine Learning and Natural Language Processing (NLP): Machine learning models,
particularly deep learning architectures such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and transformer models like BERT, have been employed in QA
systems. These models are trained on large datasets to understand the relationship between
questions and answers and to generate responses based on learned patterns in the data.
4. Semantic Parsing: Semantic parsing involves analyzing the syntactic and semantic structure
of the question to understand its meaning and intent. Techniques such as dependency parsing,
semantic role labelling, and entity recognition are used to extract relevant entities,
relationships, and constraints from the question. The parsed representation of the question is
then used to query structured or unstructured data sources to find precise answers.
5. Knowledge Graphs: Knowledge graphs represent structured information about entities and
their relationships in a graph-based format. QA systems can leverage knowledge graphs to find
precise answers by traversing the graph to identify relevant entities and relationships based on
the user's question. Techniques such as graph-based inference and query expansion can be
used to infer additional information and improve answer precision.

www.profajaypashankar.com Page 79 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
6. Hybrid Approaches: Many QA systems combine multiple techniques and approaches to
improve answer precision. For example, a system might use IR-based techniques to retrieve
candidate passages followed by machine learning models to rank and score the passages based
on relevance to the question. Hybrid approaches leverage the strengths of different methods to
address the limitations of individual techniques and improve overall performance.

Question answering (QA) involves finding precise answers to user queries or questions posed in natural
language. Various approaches exist for achieving this goal, ranging from rule-based systems to
advanced machine learning models. Here, I'll detail several approaches for finding precise answers in
QA systems:
1. Keyword Matching: One of the simplest approaches to QA involves matching keywords in the
user's question with keywords in a corpus of documents or a knowledge base. The system
retrieves documents containing the relevant keywords and extracts sentences or passages that
contain matching keywords. While straightforward, this approach may not capture nuanced or
complex queries effectively.
2. Information Retrieval (IR) + Passage Retrieval: In this approach, the QA system first
retrieves relevant documents using information retrieval techniques such as TF-IDF (Term
Frequency-Inverse Document Frequency) or BM25 (Best Matching 25). Then, it selects relevant
passages or sentences from the retrieved documents based on their relevance to the user's
question. Passage retrieval methods may consider factors such as semantic similarity,
document context, and language models to identify relevant passages.
3. Named Entity Recognition (NER): Named Entity Recognition identifies entities such as
people, organizations, locations, and dates mentioned in the user's question and in the corpus
of documents. QA systems can use NER to extract relevant entities and then search for
sentences or passages containing these entities to provide answers. NER enhances precision by
focusing on specific entities mentioned in the question.
4. Semantic Parsing and Structured Knowledge Bases: Some QA systems leverage
structured knowledge bases such as Wikidata, Freebase, or DBpedia. Semantic parsing
techniques are used to translate the user's question into a structured query language (e.g.,
SPARQL for RDF knowledge bases). The system then executes the query against the knowledge
base to retrieve precise answers. Structured knowledge bases offer rich semantic information
and enable precise retrieval of factual knowledge.
5. Machine Learning Models: Modern QA systems often employ machine learning models,
particularly deep learning architectures, to understand and answer questions. These models
include:
• Sequence-to-Sequence Models: Seq2Seq models, based on recurrent neural networks
(RNNs) or transformer architectures, can map the user's question to an answer directly.
These models learn to generate answers based on input questions and can handle both
extractive and abstractive QA tasks.
• BERT and Transformers: Bidirectional Encoder Representations from Transformers
(BERT) and other transformer-based models have shown remarkable performance in QA
tasks. These models can understand the context and semantics of the question and the
document corpus, enabling accurate answer extraction.
• BERT-based Fine-tuning: Pre-trained language models like BERT can be fine-tuned on
QA datasets using techniques such as extractive summarization. During fine-tuning, the
model learns to extract the most relevant spans of text from documents to generate
precise answers to questions.
6. Ensemble Approaches: Some QA systems combine multiple approaches mentioned above to
improve precision and robustness. Ensemble methods integrate outputs from different models
or techniques to generate more accurate answers. For example, an ensemble model may
combine results from keyword matching, IR-based passage retrieval, and machine learning
models to provide precise answers across a range of queries.
-------------------------------------------------------------------------------------------------------------------
Recommender systems are information filtering systems that aim to predict user preferences and
recommend items (such as movies, products, or articles) that users are likely to be interested in. Two
primary approaches to building recommender systems are collaborative filtering and content-based
filtering.
1. Collaborative Filtering:
Collaborative filtering (CF) recommends items to users based on the preferences of other users. The
underlying assumption is that users who have preferred similar items in the past will likely prefer
similar items in the future. Collaborative filtering methods can be further categorized into two types:
a. Memory-Based Collaborative Filtering: Memory-based CF techniques compute similarities
between users or items based on their historical interactions. One common method is user-based
collaborative filtering, where recommendations for a user are generated based on the preferences of

www.profajaypashankar.com Page 80 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
similar users. Another approach is item-based collaborative filtering, where items are recommended to
a user based on their similarity to items the user has interacted with in the past.
b. Model-Based Collaborative Filtering: Model-based CF techniques use machine learning
algorithms to learn patterns from the user-item interaction data and make predictions. Matrix
factorization methods, such as Singular Value Decomposition (SVD) and matrix factorization with
techniques like Alternating Least Squares (ALS), are commonly used in model-based collaborative
filtering. These methods decompose the user-item interaction matrix into latent factors that represent
user and item preferences.
2. Content-Based Filtering:
Content-based filtering recommends items to users based on the features or attributes of the items
and the user's preferences. This approach relies on analyzing the content of the items and building
user profiles based on their historical interactions with items. Content-based filtering methods typically
involve the following steps:
a. Feature Extraction: Extracting relevant features from items, such as keywords, genres, or
descriptive attributes.
b. User Profile Creation: Creating user profiles based on their historical interactions and preferences.
User profiles represent the user's preferences for different features or attributes.
c. Matching Items to User Profiles: Recommending items to users that match their preferences
based on the similarity between the item features and the user profile.
Content-based filtering is particularly useful in scenarios where user-item interactions are sparse or
when there is limited data about user preferences. However, it may suffer from the problem of over-
specialization, where recommendations are limited to items similar to those the user has interacted
with in the past.
Both collaborative filtering and content-based filtering have their strengths and weaknesses, and
hybrid approaches that combine both techniques are often used to build more effective recommender
systems. Hybrid approaches leverage the strengths of both methods to provide more accurate and
diverse recommendations to users.
-------------------------------------------------------------------------------------------------------------------
Recommender systems are a type of information filtering system that predict the "rating" or
"preference" a user would give to an item. These systems have become ubiquitous in today's digital
landscape, powering personalized recommendations in various domains such as e-commerce, social
media, music streaming, and movie recommendations. Two prominent approaches to building
recommender systems are collaborative filtering and content-based filtering.
1. Collaborative Filtering:
Collaborative filtering (CF) is based on the idea that users who have agreed in the past tend to agree
again in the future. It works by collecting and analyzing user interactions and preferences to make
automatic predictions about a user's interests.
• Memory-Based Collaborative Filtering:
• User-based CF: This approach recommends items by finding similar users to the
target user and suggesting items that they have liked or interacted with. It
involves building a user-item matrix where each cell represents the rating a user
has given to an item.
• Item-based CF: Instead of comparing users, item-based CF identifies similar
items based on user interactions and recommends items that are similar to those
the user has liked in the past.
• Model-Based Collaborative Filtering:
• This approach involves building a model from the user-item interactions to make
predictions. Techniques like matrix factorization, singular value decomposition
(SVD), and factorization machines fall under this category. These models
generalize better to unseen data and can handle sparse matrices more efficiently
compared to memory-based methods.
2. Content-Based Filtering:
Content-based filtering recommends items similar to those a user has liked in the past based on the
attributes of the items. It relies on item features and user profiles to make recommendations.
• Item Feature Representation: Each item is described by a set of features or
attributes. For example, in a movie recommendation system, features could include
genre, director, actors, and plot keywords.
• User Profile Representation: A user profile is created based on the items they have
interacted with and liked in the past. This profile is then matched against item features
to make recommendations.
• Vector Space Model: Item features and user profiles are often represented in a high-
dimensional vector space. Techniques like cosine similarity or Euclidean distance are
used to measure the similarity between item vectors and user profiles.

www.profajaypashankar.com Page 81 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Hybrid Models: Combining collaborative filtering and content-based filtering techniques
can often result in more accurate and diverse recommendations. Hybrid models leverage
the strengths of both approaches to mitigate their individual weaknesses.
3. Comparison:
• Scalability: Content-based filtering can be more scalable since it relies on item
attributes rather than user interactions. However, collaborative filtering can suffer from
scalability issues with large user-item matrices.
• Cold Start Problem: Content-based filtering can handle the cold start problem better
since it relies on item features. However, collaborative filtering requires a sufficient
amount of user interaction data to make accurate predictions.
• Serendipity vs. Accuracy: Collaborative filtering tends to provide more serendipitous
recommendations since it relies on user behavior. Content-based filtering, on the other
hand, may lead to more accurate recommendations based on item attributes.
In practice, many modern recommender systems combine both collaborative and content-based
filtering techniques to leverage the strengths of each approach and provide more accurate and diverse
recommendations to users.

A Content-Based Recommender works by the data that we take from the user, either explicitly (rating)
or implicitly (clicking on a link). By the data we create a user profile, which is then used to suggest to
the user, as the user provides more input or take more actions on the recommendation, the engine
becomes more accurate.

User Profile: In the User Profile, we create vectors that describe the user’s preference. In the creation
of a user profile, we use the utility matrix which describes the relationship between user and item.
With this information, the best estimate we can make regarding which item user likes, is some
aggregation of the profiles of those items. Item Profile: In Content-Based Recommender, we must
build a profile for each item, which will represent the important characteristics of that item. For
example, if we make a movie as an item then its actors, director, release year and genre are the most
significant features of the movie. We can also add its rating from the IMDB (Internet Movie Database)
in the Item Profile. Utility Matrix: Utility Matrix signifies the user’s preference with certain items. In
the data gathered from the user, we have to find some relation between the items which are liked by
the user and those which are disliked, for this purpose we use the utility matrix. In it we assign a
particular value to each user-item pair, this value is known as the degree of preference. Then we draw
a matrix of a user with the respective items to identify their preference relationship.

www.profajaypashankar.com Page 82 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Some of the columns are blank in the matrix that is because we don’t get the whole input from the
user every time, and the goal of a recommendation system is not to fill all the columns but to
recommend a movie to the user which he/she will prefer. Through this table, our recommender system
won’t suggest Movie 3 to User 2, because in Movie 1 they have given approximately the same ratings,
and in Movie 3 User 1 has given the low rating, so it is highly possible that User 2 also won’t like it.
Recommending Items to User Based on Content:
• Method 1: We can use the cosine distance between the vectors of the item and the user to
determine its preference to the user. For explaining this, let us consider an example: We
observe that the vector for a user will have a positive number for actors that tend to appear in
movies the user likes and negative numbers for actors user doesn’t like, Consider a movie with
actors which user likes and only a few actors which user doesn’t like, then the cosine angle
between the user’s and movie’s vectors will be a large positive fraction. Thus, the angle will be
close to 0, therefore a small cosine distance between the vectors. It represents that the user
tends to like the movie, if the cosine distance is large, then we tend to avoid the item from the
recommendation.
• Method 2: We can use a classification approach in the recommendation systems too, like we
can use the Decision Tree for finding out whether a user wants to watch a movie or not, like at
each level we can apply a certain condition to refine our recommendation. For example:

-------------------------------------------------------------------------------------------------------------------
Content-based filtering

What is content-based filtering?


Content-based filtering uses item features to recommend other items similar to what the user likes,
based on their previous actions or explicit feedback.
To demonstrate content-based filtering, let’s hand-engineer some features for the Google Play store.
The following figure shows a feature matrix where each row represents an app and each column
represents a feature. Features could include categories (such as Education, Casual, Health), the
publisher of the app, and many others. To simplify, assume this feature matrix is binary: a non-zero
value means the app has that feature.

www.profajaypashankar.com Page 83 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

What are the components of a recommender?


There are 3 component procedures of a Recommender:
• Candidate Generations: This method is responsible for generating smaller subsets of candidates
to recommend to a user, given a huge pool of thousands of items.
• Scoring Systems: Candidate Generations can be done by different Generators, so, we need to
standardize everything and try to assign a score to each of the items in the subsets. This is
done by the Scoring system.
• Re-Ranking Systems: After the scoring is done, along with it the system takes into account
other additional constraints to produce the final rankings.

What is content-based and collaborative filtering?


Current recommendation systems such as content-based filtering and collaborative filteringuse
different information sources to make recommendations. Content-based filtering, makes
recommendations based on user preferences for product features. Collaborative filtering mimics user-
to-user recommendations. It predicts users preferences as a linear, weighted combination of other user
preferences.
Both methods have limitations. Content-based filtering can recommend a new item, but needs more
data of user preference in order to incorporate best match. Similar, collaborative filtering needs large
dataset with active users who rated a product before in order to make accurate predictions.
Combination of these different recommendation systems called hybrid systems,

How is content-based filtering implemented?


This method of content based filtering revolves completely around comparing user interests to product
features. The products that have the most overlapping features with user interests are what’s
recommended.
Given the significance of product features in this system, it is important to discuss how the user’s
favorite features are decided.
Here, two methods can be used (possibly in combination). Firstly, users can be given a list of features
out of which they can choose whatever they identify with the most. Secondly, the algorithm can keep
track of the products the user has chosen before and add those features to the users’ data.
Similarly, product features can be identified by the developers of the product themselves. Moreover,
users can be asked what features they believe identify with the products the most.

www.profajaypashankar.com Page 84 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Once a numerical value, whether it is a binary 1 or 0 value or an arbitrary number, has been assigned
to product features and user interests, a method to identify similarities between products and user
interests needs to be identified. A very basic formula would be the dot product.

Why is content-based better than collaborative filtering?


Content-based filtering does not require other users' data during recommendations to one user.

What are the main methods of content-based recommendation?

The content-based recommendation system works on two methods, both of them using different
models and algorithms. One uses the vector spacing method and is called method 1, while the other
uses a classification model and is called method 2.
1. The vector space method
Let us suppose you read a crime thriller book by Agatha Christie, you review it on the internet. Also,
you review one more fictional book of the comedy genre with it and review the crime thriller books as
good and the comedy one as bad.
Now, a rating system is made according to the information provided by you. In the rating system from
0 to 9, crime thriller and detective genres are ranked as 9, and other serious books lie from 9 to 0 and
the comedy ones lie at the lowest, maybe in minus.
With this information, the next book recommendation you will get will be of crime thriller genres most
probably as they are the highest rated genres for you.
For this ranking system, a user vector is created which ranks the information provided by you. After
this, an item vector is created where books are ranked according to their genres on it.
With the vector, every book name is assigned a certain value by multiplying and getting the dot
product of the user and item vector, and the value is then used for recommendation.
Like this, the dot products of all the available books searched by you are ranked and according to it the
top 5 or top 10 books are assigned.
This method of content based filtering was the first one used by a content-based recommendation
system to recommend items to the user.
2. Classification method
The second method of content based filtering is the classification method. In it, we can create a
decision tree and find out if the user wants to read a book or not.
For example, a book is considered, let it be The Alchemist.
Based on the user data, we first look at the author name and it is not Agatha Christie. Then, the genre
is not a crime thriller, nor is it the type of book you ever reviewed. With these classifications, we
conclude that this book shouldn’t be recommended to you.

What are the advantages and disadvantages of content-based recommendation system?


Advantages of content-based recommender system are following:
• Because the recommendations are tailored to a person, the model does not require any
information about other users. This makes scaling of a big number of people more simple.
• The model can recognize a user's individual preferences and make recommendations for niche
things that only a few other users are interested in.

www.profajaypashankar.com Page 85 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• New items may be suggested before being rated by a large number of users, as opposed to
collective filtering.
The disadvantage are as follows:
• This methodology necessitates a great deal of domain knowledge because the feature
representation of the items is hand-engineered to some extent. As a result, the model can only
be as good as the characteristics that were hand-engineered.
• The model can only give suggestions based on the user's current interests. To put it another
way, the model's potential to build on the users' existing interests is limited.
• Since it must align the features of a user's profile with available products, content-based
filtering offers only a small amount of novelty.
• Only item profiles are generated in the case of item-based filtering, and users are
recommended items that are close to what they rate or search for, rather than their previous
background. A perfect content-based filtering system can reveal nothing surprising or
unexpected.
-------------------------------------------------------------------------------------------------------------------
CHAPTER XIII: CROSS-LINGUAL AND MULTILINGUAL RETRIEVAL
Topics covered: Challenges and techniques for cross-lingual retrieval, Machine translation for IR,
Multilingual document representations and query translation, Evaluation Techniques for IR Systems
-------------------------------------------------------------------------------------------------------------------
Cross-lingual retrieval is the process of retrieving information written in a language different from the
language of the query. It plays a crucial role in scenarios where users may be searching for information
in multiple languages or when information needs to be accessed across linguistic boundaries. Let's
delve into the details of cross-lingual retrieval:
1. Challenges:
1. Vocabulary Mismatch: Different languages have distinct vocabularies, and translating queries
directly may not capture the intended meaning accurately.
2. Syntax and Structure Variations: Languages exhibit differences in word order, grammar
rules, and syntactic structures, making it challenging to match queries and documents
accurately.
3. Semantic Equivalence: Expressing the same concept or idea may vary across languages due
to cultural and linguistic differences, leading to ambiguity in retrieval.
4. Resource Limitations: Resources such as parallel corpora, bilingual dictionaries, and language
models may be scarce or unavailable for certain language pairs.
2. Techniques for Cross-Lingual Retrieval:
1. Machine Translation:
• Translating queries from one language to another using machine translation services like
Google Translate, DeepL, or Microsoft Translator.
• This approach enables users to search for information in languages they are not
proficient in.
2. Cross-Lingual Information Retrieval Models:
• Developing models that can effectively match queries and documents across languages
by capturing semantic similarities.
• Utilizing cross-lingual word embeddings or multilingual models to represent documents
and queries in a shared semantic space.
3. Bilingual Lexicons and Dictionaries:
• Leveraging bilingual lexicons and dictionaries to establish mappings between words or
phrases in different languages.
• These resources can aid in query expansion and improving the relevance of cross-lingual
retrieval.
4. Cross-Lingual Transfer Learning:
• Transferring knowledge learned from resource-rich languages to low-resource languages
to improve retrieval performance.
• Pre-training language models on large multilingual corpora and fine-tuning them on
specific languages or domains.
5. Query Expansion and Relevance Feedback:
• Expanding queries using synonyms, related terms, or translations to mitigate vocabulary
mismatch and improve retrieval accuracy.
• Incorporating relevance feedback mechanisms to refine search results based on user
interactions and feedback.
6. Domain Adaptation and Cross-Lingual Adaptation:
• Adapting retrieval models and techniques to specific domains or language pairs to
enhance performance in targeted settings.
• Fine-tuning models on domain-specific or cross-lingual data to improve relevance and
effectiveness.

www.profajaypashankar.com Page 86 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
3. Applications:
• Multilingual Search Engines: Enabling users to search for information across multiple
languages on the web.
• Cross-Lingual Document Retrieval: Facilitating access to documents and information in
diverse languages for research or knowledge discovery.
• Multilingual Information Access: Supporting multilingual communities or organizations by
providing access to information in their native languages.
• Cross-Cultural Communication: Bridging linguistic and cultural barriers to facilitate
communication and collaboration in global contexts.
4. Evaluation:
• Evaluation of cross-lingual retrieval systems involves assessing metrics such as precision, recall,
and F1-score across different languages and datasets.
• Cross-lingual test collections and benchmark datasets are used to evaluate the effectiveness
and performance of retrieval algorithms.
5. Challenges and Future Directions:
• Adapting to low-resource languages and domains where linguistic resources are limited.
• Addressing domain-specific challenges and nuances in cross-lingual retrieval.
• Exploring novel techniques and approaches to improve the accuracy and efficiency of cross-
lingual retrieval systems.
In summary, cross-lingual retrieval is a vital area of research and development aimed at enabling
users to access information across linguistic boundaries effectively. By leveraging advanced techniques
and models, cross-lingual retrieval systems can overcome challenges related to vocabulary mismatch,
syntax variations, and semantic equivalence, thereby facilitating seamless access to information in
diverse languages.
-------------------------------------------------------------------------------------------------------------------
Challenges and techniques for cross-lingual retrieval:
Cross-lingual retrieval refers to the process of retrieving relevant information from documents written
in languages different from the language of the query. It is a crucial task in information retrieval
systems, especially in multilingual and globalized environments. However, it comes with its own set of
challenges, which require specific techniques to address effectively. Below, I'll outline the challenges
and techniques for cross-lingual retrieval along with examples:
Challenges:
1. Vocabulary Mismatch: Different languages have different vocabularies, and a direct
translation may not always capture the nuances or semantics of a query.
2. Syntax and Morphology Variations: Languages exhibit variations in word order, morphology,
and syntax, making it challenging to match queries and documents across languages.
3. Cross-Cultural Differences: The same concept may be expressed differently across languages
due to cultural and linguistic differences, leading to retrieval ambiguity.
4. Resource Scarcity: Resources such as parallel corpora or bilingual dictionaries may be scarce
for certain language pairs, hindering the development of effective cross-lingual retrieval models.
Techniques:
1. Machine Translation:
• Utilizing machine translation techniques to translate queries from one language to
another before performing retrieval.
• Example: Translating an English query "weather forecast" into Spanish ("pronóstico del
tiempo") and retrieving relevant documents in Spanish.
2. Cross-lingual Information Retrieval Models:
• Developing retrieval models that can effectively match queries and documents across
languages by capturing semantic similarities.
• Example: Using cross-lingual word embeddings to represent words from different
languages in a shared vector space, enabling the measurement of semantic similarity
across languages.
3. Parallel Corpora and Bilingual Dictionaries:
• Exploiting parallel corpora and bilingual dictionaries to learn cross-lingual mappings and
align representations between languages.
• Example: Using a parallel corpus of English and French documents along with a bilingual
dictionary to learn cross-lingual word embeddings for retrieval.
4. Cross-lingual Transfer Learning:
• Leveraging pre-trained models or representations learned from resource-rich languages
to improve retrieval performance in resource-scarce languages.
• Example: Fine-tuning a pre-trained language model on a large English corpus and
transferring the learned representations to improve retrieval in a low-resource language
such as Swahili.
5. Query Expansion and Cross-lingual Thesauri:

www.profajaypashankar.com Page 87 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Expanding queries using synonyms or related terms from the target language to mitigate
vocabulary mismatch.
• Example: Expanding an English query for "machine learning" with related terms from a
cross-lingual thesaurus before retrieving documents in French.
6. Multilingual Embeddings and Multilingual Models:
• Learning multilingual embeddings or training models that can effectively handle multiple
languages simultaneously.
• Example: Training a multilingual neural network model that encodes documents and
queries in multiple languages and retrieves relevant documents regardless of the
language.

Cross-lingual retrieval is a challenging yet essential task for information retrieval in multilingual
environments. Addressing vocabulary mismatch, syntax variations, and resource scarcity requires
sophisticated techniques such as machine translation, cross-lingual information retrieval models, and
cross-lingual transfer learning. By leveraging these techniques, cross-lingual retrieval systems can
effectively retrieve relevant information across different languages, enabling users to access
information regardless of linguistic barriers.

Example of cross-lingual retrieval using machine translation and cross-lingual information retrieval
models:

Scenario: Suppose we have a multinational company with offices in English-speaking countries and
Spanish-speaking countries. Employees across these offices need to access documents and information
stored in the company's knowledge base, which is available in both English and Spanish. To facilitate
efficient information retrieval across languages, the company wants to implement a cross-lingual
retrieval system.
Challenges:
• Vocabulary Mismatch: English and Spanish have different vocabularies, and direct translations
may not capture the exact meaning of queries or documents.
• Syntax Variations: English and Spanish have different word orders and syntactic structures,
making it challenging to match queries and documents accurately.
• Resource Scarcity: The company may not have sufficient bilingual resources or parallel corpora
to develop effective cross-lingual retrieval models.
Solution: The company decides to implement a cross-lingual retrieval system using a combination of
machine translation and cross-lingual information retrieval models.
Implementation Steps:
1. Query Translation:
• When a user submits a query in one language (e.g., English), the system translates the
query into the target language (e.g., Spanish) using a machine translation service such
as Google Translate or Microsoft Translator.
• For example, if a user in an English-speaking country searches for "project
management," the system translates the query to "gestión de proyectos" in Spanish.
2. Cross-lingual Information Retrieval:
• The system uses cross-lingual information retrieval models to match the translated
query with relevant documents in the target language.
• It employs techniques such as cross-lingual word embeddings or multilingual models to
capture semantic similarities across languages and retrieve relevant documents.
• For instance, if the translated query "gestión de proyectos" is matched with Spanish
documents containing similar terms, such as "técnicas de gestión de proyectos" or
"mejores prácticas en gestión de proyectos," those documents are retrieved and
presented to the user.
3. Evaluation and Refinement:
• The system continuously evaluates the relevance of retrieved documents based on user
feedback and adjusts its retrieval algorithms accordingly.
• It may refine the cross-lingual retrieval models by incorporating additional linguistic
features, optimizing parameters, or leveraging user interactions to improve retrieval
accuracy.
Benefits:
• Seamless Access to Information: Employees across English and Spanish-speaking regions can
access relevant documents and information regardless of their language preferences.
• Improved Efficiency: The cross-lingual retrieval system reduces the time and effort required to
manually search for information in different languages.
• Enhanced Collaboration: Employees from diverse linguistic backgrounds can collaborate more
effectively by sharing and accessing documents across languages.

www.profajaypashankar.com Page 88 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR

By implementing a cross-lingual retrieval system that combines machine translation and cross-lingual
information retrieval techniques, the multinational company can overcome language barriers and
facilitate efficient access to knowledge and information across its global workforce. This example
demonstrates how cross-lingual retrieval solutions can address the challenges of vocabulary mismatch,
syntax variations, and resource scarcity in multilingual environments.
-------------------------------------------------------------------------------------------------------------------
Machine translation for Information Retrieval (IR) involves using automated translation systems
to translate queries or documents from one language to another in order to facilitate cross-lingual
search and retrieval. This approach allows users to search for information in languages they are not
proficient in and enables access to a wider range of resources. Let's explore machine translation for IR
in detail:
1. Basic Components of Machine Translation:
Machine translation systems typically consist of the following components:
• Text Analysis: Breaking down the input text into its constituent parts, such as words, phrases,
and sentences.
• Translation Model: Generating a translation based on statistical or neural models that capture
the relationships between source and target languages.
• Language Generation: Reconstructing the translated text in the target language, ensuring
fluency and coherence.
2. Types of Machine Translation:
• Statistical Machine Translation (SMT): Based on statistical models that learn translation
patterns from bilingual corpora. SMT systems rely on word alignments and phrase-based
translation techniques.
• Neural Machine Translation (NMT): Utilizes deep learning models, such as recurrent neural
networks (RNNs) or transformer models, to learn translation mappings directly from source to
target languages. NMT has shown significant improvements in translation quality over
traditional SMT approaches.
3. Integration of Machine Translation with IR:
Machine translation can be integrated into the IR process in various ways:
• Query Translation: Translating user queries from the source language to the target language
before retrieving relevant documents. For example, translating an English query into Spanish
before searching for relevant documents in Spanish databases.
• Document Translation: Translating documents retrieved in the target language back to the
source language for user comprehension. This allows users to understand the content of
documents written in languages they are not proficient in.
• Cross-Lingual Retrieval: Facilitating retrieval of documents across multiple languages by
translating queries and documents between source and target languages as part of the retrieval
process.
4. Example Scenario:
Consider a multinational company with offices in English-speaking countries and Japanese-speaking
countries. Employees across these offices need to access documents and information stored in the
company's knowledge base, which is available in both English and Japanese.
• Query Translation: An employee in the English-speaking office submits a query in English,
such as "sales report analysis." The machine translation system translates the query into
Japanese, generating the equivalent query in Japanese, "売上レポート分析."
• Cross-Lingual Retrieval: The translated query is used to retrieve relevant documents written
in Japanese from the company's knowledge base. These documents could include sales reports,
market analyses, and financial summaries.
• Document Translation: The retrieved Japanese documents can be translated back into
English for the English-speaking employee to understand the content and extract relevant
information.
5. Challenges and Considerations:
• Translation Quality: The accuracy and fluency of machine translation can significantly impact
the effectiveness of cross-lingual retrieval. Poor translations may lead to irrelevant search
results and user frustration.
• Domain Specificity: Machine translation systems may struggle with domain-specific
terminology and context. Customizing translation models for specific domains can improve
translation quality and relevance.
• Resource Availability: Availability of bilingual corpora and language resources can impact the
development and performance of machine translation systems, particularly for low-resource
languages.
In conclusion, machine translation plays a crucial role in enabling cross-lingual search and retrieval in
Information Retrieval systems. By leveraging machine translation technologies, users can access and

www.profajaypashankar.com Page 89 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
comprehend information across linguistic boundaries, facilitating collaboration and knowledge sharing
in multilingual environments.
-------------------------------------------------------------------------------------------------------------------
Multilingual document representations and query translation are essential components in multilingual
information retrieval systems, where users interact with content across multiple languages. Here's a
breakdown of each:
1. Multilingual Document Representations:
• Multilingual document representations involve techniques to represent documents in a
manner that transcends language boundaries. These representations enable the
comparison and retrieval of documents across different languages.
• Techniques for multilingual document representations often involve methods like word
embeddings, where words from multiple languages are mapped into a shared vector
space. For instance, models like Word2Vec, FastText, and multilingual versions of BERT
(Bidirectional Encoder Representations from Transformers) can help create such
representations.
• Another approach involves using cross-lingual embeddings, which align word
embeddings across languages to facilitate cross-lingual tasks.
2. Query Translation:
• Query translation is the process of translating user queries from one language to another
to retrieve relevant documents in the target language.
• One common approach to query translation involves using machine translation systems
like Google Translate or custom-built translation models. These systems translate user
queries from the source language to the target language before submitting them to the
search engine.
• Additionally, techniques like cross-lingual information retrieval (CLIR) allow users to
issue queries in one language and retrieve relevant documents in another language
without explicit translation. CLIR systems often rely on multilingual document
representations to bridge the semantic gap between languages.
Key challenges in multilingual document representations and query translation include dealing with
language-specific nuances, handling low-resource languages, and ensuring the accuracy of translation
and retrieval processes across languages.
Overall, effective multilingual document representations and query translation are crucial for enabling
users to access information from diverse linguistic sources and breaking down language barriers in
information retrieval systems.

1. Multilingual Document Representations:


Example: Word Embeddings
• Word embeddings are numerical representations of words in a continuous vector space.
These embeddings capture semantic and syntactic similarities between words, regardless
of the language they belong to.
• For instance, consider the words "cat" and "chat" (French for cat). In a well-trained
multilingual word embedding space, these words might have similar vector
representations because they share similar semantic meanings.
• By leveraging multilingual word embeddings, documents can be represented as vectors
in a shared embedding space. This allows for cross-lingual document retrieval and
comparison.

Example: Cross-lingual Embeddings


• Cross-lingual embeddings align word embeddings from different languages into a shared
space.
• For instance, the word "house" in English and its equivalent "casa" in Spanish might be
mapped to nearby points in the shared embedding space, despite originating from
different languages.
• Models like MUSE (Multilingual Unsupervised and Supervised Embeddings) facilitate the
alignment of word embeddings across multiple languages.

Multilingual document representations are techniques used to represent textual documents in a way
that allows for meaningful comparisons, analysis, and retrieval across multiple languages. They enable
systems to process and understand documents written in different languages, facilitating tasks such as
cross-lingual information retrieval, machine translation, and cross-lingual document classification. Here
are some key approaches and methods used in multilingual document representations:
1. Word Embeddings:

www.profajaypashankar.com Page 90 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Word embeddings are dense, low-dimensional vector representations of words, where
semantically similar words are represented by vectors that are close to each other in the
embedding space.
• Techniques like Word2Vec, GloVe (Global Vectors for Word Representation), and
FastText are commonly used to learn word embeddings from large corpora in multiple
languages.
• Multilingual word embeddings map words from different languages into a shared
embedding space, allowing for cross-lingual comparisons and analysis.
2. Sentence Embeddings:
• Sentence embeddings aim to capture the semantic meaning of entire sentences or
documents in a continuous vector space.
• Methods such as Doc2Vec and Universal Sentence Encoder (USE) learn fixed-length
representations of sentences that encode their semantic content.
• Multilingual sentence embeddings extend these techniques to support documents in
multiple languages, enabling cross-lingual document similarity calculations and retrieval.
3. Cross-Lingual Embeddings:
• Cross-lingual embeddings align word embeddings across different languages into a
shared space, enabling direct comparison and similarity calculation across languages.
• Models like MUSE (Multilingual Unsupervised and Supervised Embeddings) and LASER
(Language-Agnostic SEntence Representations) learn mappings between word
embeddings of different languages, facilitating cross-lingual tasks such as machine
translation and cross-lingual information retrieval.
4. Multilingual Topic Models:
• Topic models such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis
(LSA) identify latent topics in a collection of documents and represent documents as
distributions over these topics.
• Multilingual topic models extend these techniques to analyze and represent documents in
multiple languages simultaneously, capturing cross-lingual relationships and themes.
5. Transformer-Based Models:
• Transformer-based models, such as BERT (Bidirectional Encoder Representations from
Transformers) and its multilingual variants, learn contextualized representations of
words and sentences.
• Multilingual BERT models are pre-trained on large corpora containing text from multiple
languages, enabling them to capture language-agnostic features and relationships.
Multilingual document representations play a crucial role in various natural language processing tasks,
including cross-lingual information retrieval, multilingual document classification, machine translation,
and cross-lingual sentiment analysis. By capturing semantic similarities and relationships across
languages, these representations facilitate effective communication and understanding in multilingual
environments.
-------------------------------------------------------------------------------------------------------------------
2. Query Translation:
Example: Machine Translation
• Suppose a user wants to search for "weather forecast" in French but the search engine
primarily indexes documents in English.
• The user's query "prévision météo" is translated into English using a machine translation
system like Google Translate, resulting in "weather forecast."
• The translated query is then submitted to the search engine, which retrieves relevant
English-language documents about the weather forecast.
Example: Cross-lingual Information Retrieval (CLIR)
• In CLIR, the user issues a query in one language, and the system retrieves relevant
documents in another language without explicit translation.
• For instance, a user might enter the query "restaurantes en Madrid" (restaurants in
Madrid) in Spanish, seeking restaurant recommendations in Madrid.
• The CLIR system retrieves documents written in English that discuss restaurants in
Madrid, even though the user's query was in Spanish.
• CLIR systems leverage multilingual document representations and cross-lingual similarity
measures to bridge the linguistic gap between the query language and the indexed
documents.
In summary, multilingual document representations and query translation enable users to interact with
information across language boundaries, facilitating effective information retrieval and access to
diverse linguistic content. These techniques play a crucial role in breaking down language barriers and
enhancing the accessibility of information in multilingual environments.
-------------------------------------------------------------------------------------------------------------------

www.profajaypashankar.com Page 91 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Query translation is the process of converting user queries from one language to another language
to retrieve relevant information from documents indexed in the target language. It's a fundamental
component of multilingual information retrieval systems and plays a vital role in enabling users to
interact with content across language barriers. Here's a detailed overview of query translation:
1. Process of Query Translation:
• User Input: The process begins when a user enters a query or search term in a source
language, typically their native or preferred language.
• Language Detection: The system first detects the language of the user query. This
step ensures that the system knows which language the query is in and whether
translation is necessary.
• Translation: Once the source language is identified, the user query is translated into
the target language(s) in which the documents are indexed. This translation can be
performed using various machine translation techniques, ranging from rule-based
approaches to statistical methods and neural machine translation models.
• Query Expansion (Optional): In some cases, the translated query may undergo query
expansion, where synonyms or related terms in the target language are added to the
translated query to improve retrieval effectiveness.
• Query Submission: The translated query is then submitted to the search engine or
retrieval system, which retrieves relevant documents in the target language(s) based on
the translated query.
2. Challenges and Considerations:
• Semantic Differences: Translating queries between languages can be challenging due
to semantic nuances and differences in expression between languages. A term in one
language may not have a direct equivalent in another language, leading to potential loss
of meaning or relevance in translation.
• Ambiguity: User queries may contain ambiguous terms or expressions that can be
interpreted differently depending on context. Translating such queries accurately
requires context-aware translation techniques to capture the intended meaning.
• Multilingual Ambiguity: In multilingual environments, words or phrases may have
different meanings across languages. Translation systems need to disambiguate between
these meanings to ensure accurate translation and retrieval.
• Resource Availability: The availability of linguistic resources, such as bilingual
dictionaries and parallel corpora, can impact the quality and accuracy of query
translation. Low-resource languages may pose additional challenges due to limited
linguistic resources for translation.
3. Approaches to Query Translation:
• Statistical Machine Translation (SMT): SMT models use statistical techniques to
learn translation patterns from large bilingual corpora. These models estimate the
probability of generating a target-language sentence given a source-language sentence.
• Neural Machine Translation (NMT): NMT models employ deep neural networks to
directly model the mapping between source and target languages. NMT models have
shown significant improvements over traditional SMT approaches, particularly for
capturing long-range dependencies and handling complex linguistic structures.
• Rule-based Translation: Rule-based translation systems rely on predefined linguistic
rules and dictionaries to perform translation. While rule-based systems may lack the
flexibility of statistical and neural approaches, they can be effective for specific domains
or languages with well-defined grammar and syntax rules.
4. Evaluation of Query Translation:
• The quality of query translation is typically evaluated using metrics such as translation
accuracy, fluency, and relevance of translated queries in retrieving relevant documents.
• Evaluation may involve human annotators assessing the adequacy and naturalness of
translated queries, as well as automated metrics such as BLEU (Bilingual Evaluation
Understudy) and METEOR (Metric for Evaluation of Translation with Explicit Ordering).
In conclusion, query translation is a complex process that involves translating user queries from one
language to another to enable effective information retrieval in multilingual environments. Advances in
machine translation technology and cross-lingual retrieval techniques continue to improve the accuracy
and efficiency of query translation systems, facilitating seamless access to information across language
barriers.
-------------------------------------------------------------------------------------------------------------------
Evaluation techniques for Information Retrieval (IR) systems are crucial for assessing the
effectiveness and performance of these systems in retrieving relevant information in response to user
queries. Here are some common evaluation techniques used in IR:
1. Relevance Judgments:

www.profajaypashankar.com Page 92 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Relevance judgments involve human assessors determining the relevance of documents
returned by the IR system to a set of user queries.
• Assessors typically assign relevance judgments based on predefined criteria, such as
whether the document contains information that satisfies the user's information need.
• Relevance judgments serve as the ground truth against which the performance of the IR
system is measured.
2. Precision and Recall:
• Precision measures the proportion of relevant documents retrieved by the system among
all documents retrieved.
• Recall measures the proportion of relevant documents retrieved by the system among all
relevant documents in the collection.
• Precision and recall are typically computed at different levels, such as document level,
query level, or user session level.
3. F-measure:
• The F-measure (or F1 score) is the harmonic mean of precision and recall and provides a
single measure that balances the two metrics.
• It is computed as 2 * (precision * recall) / (precision + recall).
4. Mean Average Precision (MAP):
• MAP computes the average precision across all queries in the evaluation dataset.
• It considers the ranked list of retrieved documents for each query and computes the
average precision over relevant documents until all relevant documents are retrieved.
5. Mean Reciprocal Rank (MRR):
• MRR measures the average of the reciprocal ranks of the first relevant document
retrieved for each query.
• It is particularly useful for tasks where only the top-ranked document matters (e.g.,
question answering).
6. Normalized Discounted Cumulative Gain (NDCG):
• NDCG evaluates the quality of the ranked list of retrieved documents by considering both
relevance and the position of documents in the list.
• It assigns higher scores to relevant documents that appear higher in the ranked list.
7. Precision-Recall Curves:
• Precision-recall curves illustrate the trade-off between precision and recall at different
retrieval thresholds.
• They are useful for comparing the performance of IR systems across different settings or
algorithms.
8. User Studies and User Satisfaction:
• User studies involve gathering feedback from users regarding their satisfaction with the
retrieved results.
• Techniques such as surveys, interviews, and user observations can provide insights into
the usability and effectiveness of IR systems from the user's perspective.
By employing these evaluation techniques, researchers and practitioners can systematically assess the
performance and effectiveness of Information Retrieval systems and make informed decisions
regarding system improvements and optimizations.

1. Relevance Judgments:
• Relevance judgments involve human assessors determining the relevance of documents
returned by the IR system to a set of user queries.
• Assessors typically assign relevance judgments based on predefined criteria, such as
whether the document contains information that satisfies the user's information need.
• Relevance judgments serve as the ground truth against which the performance of the IR
system is measured.
2. Precision and Recall:
• Precision measures the proportion of relevant documents retrieved by the system among
all documents retrieved. It is computed as:
Precision=Number of relevant documents retrievedTotal number of documents retrieved
Precision=Total number of documents retrievedNumber of relevant documents retrieved
• Recall measures the proportion of relevant documents retrieved by the system among all
relevant documents in the collection. It is computed as:
Recall=Number of relevant documents retrievedTotal number of relevant documentsReca
ll=Total number of relevant documentsNumber of relevant documents retrieved
• Precision and recall are typically computed at different levels, such as document level,
query level, or user session level.

www.profajaypashankar.com Page 93 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR


• Recall=


3. F-measure:
• The F-measure (or F1 score) is the harmonic mean of precision and recall and provides a
single measure that balances the two metrics.
• It is computed as:

4. Mean Average Precision (MAP):


• MAP computes the average precision across all queries in the evaluation dataset.
• It considers the ranked list of retrieved documents for each query and computes the
average precision over relevant documents until all relevant documents are retrieved.
• The formula for MAP is:


• Where N is the total number of queries, and APi is the average precision for query i.

5. Mean Reciprocal Rank (MRR):
• MRR measures the average of the reciprocal ranks of the first relevant document
retrieved for each query.
• It is particularly useful for tasks where only the top-ranked document matters (e.g.,
question answering).

• The formula for MRR is , where N is the total number of


queries, and rank ranki is the rank of the first relevant document for query i.
6. Normalized Discounted Cumulative Gain (NDCG):
• NDCG evaluates the quality of the ranked list of retrieved documents by considering both
relevance and the position of documents in the list.
• It assigns higher scores to relevant documents that appear higher in the ranked list.
• The formula for NDCG varies but generally involves discounting the gain of relevant
documents based on their rank position.
7. Precision-Recall Curves:
• Precision-recall curves illustrate the trade-off between precision and recall at different
retrieval thresholds.
• They are useful for comparing the performance of IR systems across different settings or
algorithms.
8. User Studies and User Satisfaction:
• User studies involve gathering feedback from users regarding their satisfaction with the
retrieved results.
• Techniques such as surveys, interviews, and user observations can provide insights into
the usability and effectiveness of IR systems from the user's perspective.
These evaluation techniques are essential for assessing the performance of IR systems objectively,
identifying areas for improvement, and comparing the effectiveness of different retrieval algorithms
and strategies. They help ensure that IR systems meet the information needs of users and provide
relevant and useful results.

www.profajaypashankar.com Page 94 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
CHAPTER XIV: USER-BASED EVALUATION
Topics covered:
User studies, surveys, Test collections and benchmarking, Online evaluation methods: A/B testing,
interleaving experiments
-------------------------------------------------------------------------------------------------------------------
User studies:
User studies play a critical role in evaluating Information Retrieval (IR) systems from the perspective
of end-users. These studies aim to assess how well the IR system meets users' information needs,
their satisfaction with the retrieved results, and the overall usability of the system. Here's a detailed
overview of user studies in IR:
1. Objective of User Studies:
• The primary objective of user studies in IR is to understand how users interact with the
system, their search behavior, and their satisfaction level with the retrieved results.
• User studies help identify usability issues, user preferences, and areas for improvement
in the IR system.
2. Types of User Studies:
• Laboratory Studies: These studies are conducted in controlled environments where
participants perform predefined search tasks under the observation of researchers.
Researchers can closely monitor user interactions, behaviors, and feedback.
• Field Studies: Field studies involve observing users' interactions with the IR system in
their natural environment, such as their workplace or home. This approach provides
insights into real-world usage patterns and challenges.
• Surveys and Questionnaires: Surveys and questionnaires are used to gather
quantitative and qualitative feedback from users regarding their satisfaction,
preferences, and suggestions for improvement.
• Interviews and Focus Groups: Interviews and focus groups allow researchers to
conduct in-depth discussions with users to understand their information needs, search
strategies, and experiences with the IR system.
3. Key Metrics and Measures:
• Task Success Rate: Measures the percentage of search tasks completed successfully
by users.
• Search Time: The time taken by users to complete search tasks, including query
formulation, result scanning, and document selection.
• Precision and Recall: Users' ability to find relevant documents among the retrieved
results.
• User Satisfaction: Users' subjective evaluation of the system's performance, relevance
of retrieved results, ease of use, and overall satisfaction.
• Task Completion Time: The time taken by users to accomplish specific search tasks,
providing insights into the efficiency of the system.
• Search Relevance Judgment: Users' assessment of the relevance of retrieved
documents to their information needs.
4. Ethical Considerations:
• It's essential to ensure that user studies adhere to ethical guidelines, including informed
consent, privacy protection, and data anonymization.
• Researchers should prioritize the safety, privacy, and well-being of study participants
throughout the research process.
5. Analysis and Interpretation:
• Data collected from user studies are analyzed to identify patterns, trends, and user
preferences.
• Qualitative data from interviews and focus groups may be analyzed using thematic
analysis or content analysis techniques.
• Quantitative data from surveys and task performance metrics are analyzed using
statistical methods to derive meaningful insights.
6. Iterative Design and Improvement:
• Findings from user studies inform iterative design improvements to the IR system.
• Researchers and developers use user feedback to refine system features, interface
design, relevance ranking algorithms, and overall user experience.
By conducting user studies, IR researchers and practitioners can gain valuable insights into user
behavior, preferences, and satisfaction levels, leading to the development of more effective and user-
centric IR systems.

User studies in Information Retrieval (IR) involve systematic investigations into how users interact with
IR systems, their information-seeking behavior, and their satisfaction with the retrieval outcomes.
Here's a detailed breakdown of the components and methodologies involved in user studies in IR:

www.profajaypashankar.com Page 95 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
1. Research Design:
• Defining Objectives: Clearly articulate the research goals, including understanding user
behavior, evaluating system effectiveness, or identifying usability issues.
• Participant Selection: Determine the target user population based on demographics,
expertise, and relevance to the study objectives.
• Task Definition: Define specific search tasks or scenarios that participants will perform
using the IR system.
2. Data Collection Methods:
• Laboratory Studies: Conducted in controlled environments where participants are asked
to perform predefined search tasks using the IR system. Researchers can observe and
record user interactions, behavior, and feedback in real-time.
• Field Studies: Observing users in their natural environment (e.g., workplace or home) as
they interact with the IR system. This approach provides insights into real-world usage
patterns and challenges.
• Surveys and Questionnaires: Administer structured surveys or questionnaires to collect
quantitative and qualitative feedback from users regarding their satisfaction,
preferences, and suggestions for improvement.
• Interviews and Focus Groups: Conduct in-depth interviews or group discussions with
users to explore their information needs, search strategies, and experiences with the IR
system.
3. Key Metrics and Measures:
• Task Success Rate: The percentage of search tasks completed successfully by users.
• Search Time: Time taken by users to complete search tasks, including query
formulation, result scanning, and document selection.
• Precision and Recall: Users' ability to find relevant documents among the retrieved
results.
• User Satisfaction: Users' subjective evaluation of the system's performance, relevance of
retrieved results, ease of use, and overall satisfaction.
• Task Completion Time: The time taken by users to accomplish specific search tasks,
providing insights into the efficiency of the system.
• Search Relevance Judgment: Users' assessment of the relevance of retrieved documents
to their information needs.
4. Ethical Considerations:
• Informed Consent: Participants should be provided with clear information about the
study's purpose, procedures, and potential risks, and they should voluntarily agree to
participate.
• Privacy Protection: Ensure that participants' privacy and confidentiality are protected
throughout the study, especially when dealing with sensitive information.
• Data Anonymization: Anonymize and aggregate data to protect participants' identities
and ensure confidentiality.
5. Data Analysis and Interpretation:
• Qualitative Analysis: Thematic analysis or content analysis of qualitative data (e.g.,
interview transcripts, open-ended survey responses) to identify patterns, themes, and
user perceptions.
• Quantitative Analysis: Statistical analysis of quantitative data (e.g., survey ratings, task
completion times) to identify significant trends, correlations, and differences between
user groups.
6. Iterative Design and Improvement:
• Incorporate findings from user studies into the iterative design process to enhance the
IR system's usability, functionality, and performance.
• Engage with users and stakeholders to gather feedback, validate design decisions, and
prioritize improvements based on user needs and preferences.
By conducting comprehensive user studies, IR researchers and practitioners can gain valuable insights
into user behavior, preferences, and satisfaction levels, leading to the development of more effective
and user-centric IR systems.
-------------------------------------------------------------------------------------------------------------------
User-based evaluation in Information Retrieval (IR) involves assessing the performance and
effectiveness of IR systems from the perspective of end-users. Unlike traditional evaluation methods
that focus solely on system-based metrics like precision and recall, user-based evaluation considers
user satisfaction, interaction patterns, and overall usability of the system. Here's a detailed breakdown
of user-based evaluation in IR:
1. Objective:
• The primary objective of user-based evaluation is to understand how well the IR system
meets users' information needs and how satisfied users are with the retrieval outcomes.

www.profajaypashankar.com Page 96 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• It aims to assess the effectiveness, efficiency, and user experience of the IR system in
real-world scenarios.
2. Methods:
• User Studies: Conducting user studies in controlled or natural environments to observe
how users interact with the IR system, formulate queries, scan search results, and select
relevant documents.
• Surveys and Questionnaires: Administering structured surveys or questionnaires to
gather quantitative and qualitative feedback from users regarding their satisfaction,
preferences, and suggestions for improvement.
• Interviews and Focus Groups: Conducting in-depth interviews or group discussions
with users to explore their information needs, search strategies, and experiences with
the IR system.
3. Key Metrics and Measures:
• Task Success Rate: The percentage of search tasks completed successfully by users,
indicating the system's ability to retrieve relevant information.
• Search Time: The time taken by users to complete search tasks, including query
formulation, result scanning, and document selection.
• Precision and Recall: Users' ability to find relevant documents among the retrieved
results, indicating the system's accuracy and relevance.
• User Satisfaction: Users' subjective evaluation of the system's performance, relevance
of retrieved results, ease of use, and overall satisfaction.
• Task Completion Time: The time taken by users to accomplish specific search tasks,
providing insights into the efficiency of the system.
• Search Relevance Judgment: Users' assessment of the relevance of retrieved
documents to their information needs, helping identify gaps in relevance ranking
algorithms.
4. Ethical Considerations:
• Ensure informed consent from participants, providing clear information about the study's
purpose, procedures, and potential risks.
• Protect participants' privacy and confidentiality throughout the study, especially when
dealing with sensitive information.
• Anonymize and aggregate data to safeguard participants' identities and ensure data
confidentiality and integrity.
5. Data Analysis and Interpretation:
• Qualitative Analysis: Thematic analysis or content analysis of qualitative data (e.g.,
interview transcripts, open-ended survey responses) to identify patterns, themes, and
user perceptions.
• Quantitative Analysis: Statistical analysis of quantitative data (e.g., survey ratings, task
completion times) to identify significant trends, correlations, and differences between
user groups.
6. Iterative Design and Improvement:
• Incorporate findings from user-based evaluation into the iterative design process to
enhance the IR system's usability, functionality, and performance.
• Engage with users and stakeholders to gather feedback, validate design decisions, and
prioritize improvements based on user needs and preferences.
By conducting comprehensive user-based evaluation, IR researchers and practitioners can gain
valuable insights into user behavior, preferences, and satisfaction levels, leading to the development of
more effective and user-centric IR systems.
-------------------------------------------------------------------------------------------------------------------
Surveys:
Surveys play a significant role in evaluating Information Retrieval (IR) systems by collecting
quantitative and qualitative feedback from users regarding their satisfaction, preferences, and
suggestions for improvement. Surveys in IR are structured questionnaires administered to participants
to gather insights into their experiences, perceptions, and interactions with the IR system. Here's a
detailed overview of surveys in IR:
1. Objective:
• Surveys aim to assess users' satisfaction levels, their perception of the IR system's
effectiveness, and their preferences regarding system features, interface design, and
retrieval outcomes.
• They provide valuable insights into user needs, behaviors, and expectations, guiding the
design and improvement of IR systems.
2. Survey Design:

www.profajaypashankar.com Page 97 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Identify Research Objectives: Clearly define the research goals and objectives of the
survey, such as evaluating system usability, relevance of retrieved results, or user
satisfaction.
• Design Survey Instrument: Develop a structured questionnaire containing a mix of
closed-ended (quantitative) and open-ended (qualitative) questions.
• Select Survey Format: Choose the appropriate survey format, including online surveys,
paper-based surveys, or in-person interviews, based on participant demographics and
preferences.
• Pilot Testing: Conduct pilot testing of the survey instrument with a small group of
participants to identify potential issues, refine questions, and ensure clarity and
relevance.
3. Survey Content:
• Demographic Information: Collect demographic data such as age, gender, education
level, and professional background to understand the characteristics of the survey
participants.
• System Usage: Assess participants' frequency of using the IR system, types of tasks
performed, and preferred search strategies.
• Satisfaction Ratings: Use Likert scale or rating scales to measure participants'
satisfaction levels with various aspects of the IR system, including interface design,
search functionality, relevance of results, and overall user experience.
• Open-Ended Questions: Include open-ended questions to allow participants to provide
detailed feedback, suggestions, and comments about their experiences with the IR
system.
4. Survey Administration:
• Recruitment: Recruit participants from the target user population, ensuring diversity and
representation across relevant demographics and user groups.
• Administration: Administer the survey through appropriate channels, such as online
survey platforms, email invitations, or in-person interviews.
• Informed Consent: Provide participants with clear information about the survey's
purpose, voluntary participation, confidentiality measures, and data handling
procedures.
• Data Collection: Collect survey responses from participants within a specified timeframe,
ensuring data accuracy and completeness.
5. Data Analysis and Interpretation:
• Quantitative Analysis: Analyze quantitative survey responses using statistical methods
such as mean, median, standard deviation, and frequency distribution to identify trends,
patterns, and correlations.
• Qualitative Analysis: Conduct thematic analysis or content analysis of qualitative
responses to identify recurring themes, insights, and actionable feedback from
participants.
• Integration: Integrate findings from quantitative and qualitative analyses to derive
comprehensive insights into user perceptions, preferences, and satisfaction levels.
6. Reporting and Dissemination:
• Prepare a detailed report summarizing survey findings, including key insights, trends,
and recommendations for improving the IR system.
• Share survey results with relevant stakeholders, including researchers, developers, and
decision-makers, to inform decision-making and prioritize system enhancements.
By leveraging surveys in IR, researchers and practitioners can gather valuable feedback from users,
identify areas for improvement, and enhance the usability, functionality, and effectiveness of IR
systems to better meet user needs and expectations.
-------------------------------------------------------------------------------------------------------------------

www.profajaypashankar.com Page 98 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
Test collections and benchmarking:

Test collections and benchmarking play a pivotal role in Information Retrieval (IR) research and
development. They provide standardized datasets and evaluation metrics for assessing the
performance of IR systems. Here's a detailed explanation of test collections and benchmarking in IR:

1. Test Collections:
Test collections are curated datasets that consist of:
1. Document Collection: A corpus of documents typically crawled from the web, scientific
papers, news articles, or other sources relevant to the domain of interest.
2. Queries: A set of queries or search topics formulated based on real-world information needs or
specific test scenarios.
3. Relevance Judgments: For each query, human assessors provide relevance judgments,
indicating which documents in the collection are relevant to the query and to what degree.

2. Benchmarking:
Benchmarking involves evaluating the performance of IR systems using test collections. The process
usually includes the following steps:
1. System Retrieval: IR systems retrieve documents from the test collection in response to the
given queries.
2. Evaluation: The retrieved documents are compared against the relevance judgments to assess
the system's effectiveness using various evaluation metrics.
3. Metrics: Common evaluation metrics include precision, recall, F1 score, mean average
precision (MAP), mean reciprocal rank (MRR), normalized discounted cumulative gain (NDCG),
and precision-recall curves.
4. Comparison: The performance of different IR systems can be compared based on their scores
on these metrics.

Advantages of Test Collections and Benchmarking:


1. Standardization: Test collections provide a standardized framework for evaluating and
comparing the performance of IR systems across different research studies and
implementations.
2. Reproducibility: Researchers can replicate experiments and compare results across studies
using the same test collections and evaluation metrics.
3. Validation: Test collections allow researchers to validate new algorithms, techniques, or
enhancements in IR systems in a controlled and systematic manner.
4. Progress Tracking: Benchmarking enables tracking the progress and advancements in IR
technology over time by comparing the performance of newer systems with existing
benchmarks.

Challenges and Considerations:


1. Representativeness: Test collections should reflect real-world information retrieval scenarios
and domain-specific characteristics to ensure the relevance and validity of evaluation results.
2. Scalability: Building large-scale and diverse test collections can be resource-intensive and
time-consuming, particularly for domains with vast and dynamic content.
3. Bias and Generalization: Test collections may exhibit biases in terms of topics, queries, or
relevance judgments, which can influence evaluation outcomes and limit the generalizability of
results.
4. Dynamic Nature: Test collections may become outdated over time due to changes in user
information needs, content availability, or retrieval algorithms, necessitating periodic updates
and revisions.
In summary, test collections and benchmarking provide valuable resources for evaluating, comparing,
and advancing Information Retrieval systems. They enable researchers and practitioners to
systematically assess system performance, identify areas for improvement, and contribute to the
ongoing development and innovation in the field of IR.

Test collections and benchmarking play crucial roles in Information Retrieval (IR) research and
development by providing standardized datasets and evaluation methodologies to assess the
performance of IR systems. Here's a detailed explanation of test collections and benchmarking in IR:
1. Test Collections:
Test collections refer to standardized datasets comprising documents, queries, and relevance
judgments used for evaluating the effectiveness of IR systems. These collections are carefully
constructed to represent various aspects of real-world information retrieval scenarios. Here's how test
collections are typically constructed and utilized:

www.profajaypashankar.com Page 99 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Document Collection: Test collections consist of a corpus of documents, often drawn from
sources like news articles, web pages, academic papers, or other textual sources. The
documents cover diverse topics and represent the types of content users might encounter in
real-world information retrieval tasks.
• Query Set: A set of queries is selected or generated to represent the information needs of
users. These queries may be based on actual user queries from search logs, curated by domain
experts, or generated based on specific criteria such as topic diversity and query length.
• Relevance Judgments: Human assessors annotate the relevance of documents to each query
in the collection. Relevance judgments indicate whether a document is relevant, partially
relevant, or irrelevant to a given query. Assessors may use binary judgments (relevant or non-
relevant) or graded relevance judgments to capture varying degrees of relevance.
• Metadata and Ground Truth: Test collections often include additional metadata such as
document titles, URLs, publication dates, and author information. This metadata helps
researchers analyze retrieval performance and understand the characteristics of the documents
and queries.
• Usage: Test collections serve as standardized benchmarks for evaluating the performance of IR
systems. Researchers use these collections to measure various metrics such as precision, recall,
mean average precision (MAP), and other evaluation measures. Test collections also facilitate
comparisons between different retrieval models, algorithms, and techniques.

2. Benchmarking in IR:
Benchmarking involves the systematic evaluation and comparison of IR systems using standardized
test collections and evaluation measures. Here's how benchmarking is conducted in IR:
• System Evaluation: IR systems are evaluated using test collections and evaluation metrics to
assess their performance in retrieving relevant documents for given queries.
• Comparison of Techniques: Benchmarking allows researchers to compare the effectiveness of
different retrieval models, algorithms, and techniques under controlled conditions. This helps
identify the strengths and weaknesses of various approaches and informs the development of
more effective retrieval systems.
• Evaluation Measures: Benchmarking typically involves the computation of evaluation
measures such as precision, recall, F1-score, MAP, mean reciprocal rank (MRR), and normalized
discounted cumulative gain (NDCG). These measures provide quantitative assessments of
retrieval performance and help researchers understand the trade-offs between precision and
recall.
• Statistical Analysis: Benchmarking studies often include statistical analysis to determine
whether observed differences in retrieval performance are statistically significant. This helps
ensure that any observed improvements or differences are not due to random chance.
• Publication and Sharing: Benchmarking studies are often published in academic conferences
and journals, allowing researchers to share their findings with the broader IR community. Test
collections and evaluation results are also shared publicly to facilitate reproducibility and further
research in the field.
In summary, test collections and benchmarking provide essential resources and methodologies for
evaluating and comparing the performance of IR systems. They help researchers assess the
effectiveness of retrieval techniques, identify areas for improvement, and advance the state-of-the-art
in information retrieval.
-------------------------------------------------------------------------------------------------------------------
Online evaluation methods in Information Retrieval:
Online evaluation methods in Information Retrieval (IR) involve assessing the performance of retrieval
systems using live user interactions and real-world data. Unlike offline evaluation methods that rely on
predefined test collections, online evaluation leverages actual user queries, interactions, and feedback
to measure system effectiveness. Here's a detailed overview of online evaluation methods in IR:
1. Click-Through Rate (CTR):
• Click-through rate measures the proportion of users who click on a search result after
issuing a query.
• In online evaluation, CTR is used to assess the relevance and attractiveness of search
results presented to users.
• Higher CTRs indicate that the displayed results are more relevant and engaging to users.
2. Dwell Time:
• Dwell time refers to the amount of time users spend interacting with a search result
page or a specific document after clicking on it.
• Longer dwell times often indicate that users find the content relevant and engaging.
• Dwell time can be used as a proxy for user satisfaction and result relevance.
3. User Engagement Metrics:

www.profajaypashankar.com Page 100 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Various engagement metrics, such as scroll depth, time spent on page, and interaction
with search facets (filters), provide insights into user behavior and preferences.
• Analyzing user engagement metrics helps evaluate the effectiveness of search interfaces
and the relevance of presented results.
4. Online A/B Testing:
• A/B testing involves comparing two versions (A and B) of a feature or algorithm in live
production environments.
• In IR, A/B testing can be used to evaluate different ranking algorithms, user interface
designs, or search result presentation formats.
• Metrics such as click-through rates, conversion rates, and user satisfaction scores are
compared between the A and B versions to determine the most effective approach.
5. Personalization and User Feedback:
• Personalization techniques, such as collaborative filtering and content-based
recommendation systems, tailor search results to individual user preferences.
• Online evaluation involves collecting and analyzing user feedback, ratings, and
preferences to refine personalization algorithms and improve result relevance.
6. Continuous Monitoring and Analytics:
• Online evaluation requires continuous monitoring of user interactions, system
performance, and user feedback.
• Analytics tools and monitoring platforms track key performance indicators (KPIs) such as
search volume, user engagement, and conversion rates in real-time.
• Insights from analytics data help identify trends, patterns, and areas for improvement in
the IR system.
7. User Surveys and Feedback Loops:
• Periodic user surveys and feedback mechanisms allow users to provide input on their
search experiences, satisfaction levels, and suggestions for improvement.
• Feedback loops facilitate communication between users and system developers, enabling
iterative improvements based on user needs and preferences.
8. Longitudinal Studies:
• Longitudinal studies track user behavior and system performance over extended periods
to observe trends and changes in user preferences and system effectiveness.
• Analyzing longitudinal data helps identify long-term patterns, challenges, and
opportunities for innovation in IR systems.
Online evaluation methods in IR provide valuable insights into user behavior, preferences, and system
performance in real-world contexts. By leveraging live user interactions and feedback, researchers and
practitioners can continuously improve IR systems to better meet user needs and expectations.
------------------------------------------------------------------------------------------------------------------
A/B testing, also known as split testing, is a method used to compare two or more versions of a
webpage, application, or feature to determine which one performs better based on predefined metrics.
In the context of Information Retrieval (IR), A/B testing is often used to evaluate different algorithms,
user interfaces, or system configurations to optimize search performance. Here's a detailed overview of
A/B testing:
1. Objective:
• The primary objective of A/B testing is to determine which variant (A or B) of a
webpage, application feature, or algorithm performs better in terms of predefined
metrics.
• In IR, A/B testing aims to improve the effectiveness, relevance, and user satisfaction of
search results by evaluating different ranking algorithms, user interfaces, or search
features.
2. Setup:
• A/B testing involves dividing users randomly into two or more groups (A and B) and
exposing each group to a different variant of the system or feature being tested.
• The control group (A) typically receives the existing or default version of the system,
while the treatment group (B) is exposed to the new or modified version.
• The allocation of users to different groups should be randomized to ensure that the
groups are statistically similar in terms of user characteristics and behavior.
3. Variants:
• Variants in A/B testing can include changes to the user interface, search algorithms,
result presentation formats, relevance ranking strategies, or any other aspect of the IR
system being evaluated.
• Each variant should represent a distinct hypothesis or proposed improvement that can
be measured objectively using predefined metrics.
4. Metrics:

www.profajaypashankar.com Page 101 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Predefined metrics are used to measure the performance and effectiveness of each
variant in the A/B test.
• Common metrics in IR A/B testing include click-through rate (CTR), conversion rate,
session duration, bounce rate, relevance judgments, and user satisfaction scores.
• Metrics should be selected based on their relevance to the goals of the A/B test and their
ability to capture user engagement, relevance, and satisfaction.
5. Duration and Sample Size:
• The duration of an A/B test and the sample size required depend on factors such as the
expected effect size, statistical power, and variability of the metrics being measured.
• A/B tests should be run for a sufficient duration to capture variability in user behavior
and ensure reliable results.
• Sample size calculations help determine the number of users needed to detect
meaningful differences between variants with statistical significance.
6. Analysis:
• Statistical analysis is performed on the collected data to determine whether there are
significant differences in the performance of the variants.
• Hypothesis testing techniques, such as t-tests, chi-square tests, or Bayesian inference,
are used to compare the metrics between groups and assess statistical significance.
• Confidence intervals and p-values are calculated to quantify the uncertainty and
likelihood of observing the observed differences by chance.
7. Interpretation and Decision Making:
• Based on the results of the A/B test, decisions are made regarding the adoption,
optimization, or rejection of the tested variants.
• Positive outcomes may lead to the implementation of the new variant as the default
option in the IR system, while negative outcomes may prompt further iterations or
alternative strategies.
8. Iterative Improvement:
• A/B testing is an iterative process that allows for continuous improvement and
optimization of IR systems based on user feedback and data-driven insights.
• Multiple rounds of A/B testing may be conducted to refine and fine-tune the system over
time, leading to incremental improvements in search performance and user satisfaction.

A/B testing is a powerful and widely used technique in IR for empirically evaluating and optimizing
search algorithms, user interfaces, and system configurations.
By systematically comparing different variants and measuring their impact on predefined metrics, A/B
testing helps drive data-driven decision-making and continuous improvement in IR systems.
-------------------------------------------------------------------------------------------------------------------
Interleaving experiments in Information Retrieval (IR):
Interleaving experiments in Information Retrieval (IR) are a method used to compare the performance
of different ranking algorithms or search strategies in a live environment. These experiments involve
interleaving results from multiple algorithms and presenting them to users in a randomized order to
evaluate which algorithm provides better user satisfaction or relevance. Here's a detailed overview of
interleaving experiments in IR:
1. Objective:
• The primary objective of interleaving experiments is to compare the effectiveness of
different ranking algorithms or search strategies in providing relevant and satisfying
search results to users.
• Interleaving helps identify which algorithm or strategy performs better in real-world
scenarios based on user feedback and preferences.
2. Setup:
• Interleaving experiments involve presenting search results to users in a randomized
order, with results interleaved from different algorithms or strategies.
• Each user query triggers the execution of multiple ranking algorithms or strategies, and
the top-ranked results from each algorithm are combined and presented to the user in a
randomized order.
• Users are then asked to interact with the interleaved results and provide feedback or
relevance judgments based on their preferences.
3. Interleaving Methods:
• Team Draft Interleaving: In team draft interleaving, two or more algorithms are
treated as competing "teams." Results from each team are combined into a single
interleaved list by taking turns selecting the next result.
• Probabilistic Interleaving: Probabilistic interleaving assigns probabilities to each
result from different algorithms and samples results based on these probabilities to
create the interleaved list.

www.profajaypashankar.com Page 102 of 103


TYCS SEM VI INFORMATION RETRIVAL NOTES: BY: PROF.AJAY PASHANKAR
• Balanced Interleaving: Balanced interleaving ensures that each algorithm's results are
equally represented in the interleaved list, regardless of their individual performance.
4. Metrics:
• Interleaving experiments use user-centric metrics to evaluate the effectiveness of
different ranking algorithms or strategies.
• Common metrics include user satisfaction, relevance judgments, click-through rate
(CTR), dwell time, and conversion rate.
• Metrics are used to assess which interleaved list leads to higher user engagement,
relevance, and satisfaction.
5. User Feedback and Relevance Judgments:
• Users are asked to interact with the interleaved search results and provide feedback or
relevance judgments based on their preferences.
• Relevance judgments may involve rating the relevance of individual results, marking
relevant documents, or providing qualitative feedback on the overall search experience.
6. Statistical Analysis:
• Statistical analysis is performed on the collected user feedback and relevance judgments
to determine which interleaved list performs better in terms of predefined metrics.
• Hypothesis testing techniques, such as paired t-tests or Wilcoxon signed-rank tests, are
used to assess statistical significance between interleaved lists.
7. Interpretation and Decision Making:
• Based on the results of the interleaving experiment, decisions are made regarding the
selection and optimization of ranking algorithms or search strategies.
• Algorithms or strategies that lead to higher user satisfaction, relevance, and engagement
are prioritized for deployment in the live IR system.

Interleaving experiments provide valuable insights into the comparative performance of different
ranking algorithms or search strategies in real-world scenarios. By systematically evaluating and
comparing interleaved search results based on user feedback, interleaving experiments help inform
data-driven decision-making and optimization of IR systems.

www.profajaypashankar.com Page 103 of 103

You might also like