You are on page 1of 10

UNIT 3 INDEXING

Static and Dynamic Inverted Indices –Index Construction and Index Compression –Searching –
Sequential Searching and Pattern Matching –Query operations –Query Languages –Query
Processing –Relevance Feedback and Query Expansion – Automatic Local and Global analysis –
Measuring Efffectiveness and Efficiency.

3.1.STATIC AND DYNAMIC INVERTED INDICES


An Inverted Index is a data structure used in information retrieval systems to efficiently retrieve documents
or web pages containing a specific term or set of terms. In an inverted index, the index is organized by
terms (words), and each term points to a list of documents or web pages that contain that term.
Inverted indexes are widely used in search engines, database systems, and other applications where
efficient text search is required. They are especially useful for large collections of documents, where
searching through all the documents would be prohibitively slow.
An inverted index is an index data structure storing a mapping from content, such as words or numbers, to
its locations in a document or a set of documents. In simple words, it is a hashmap-like data structure that
directs from a word to a document or a web page.
Example: Consider the following documents.
Document 1: The quick brown fox jumped over the lazy dog.
Document 2: The lazy dog slept in the sun.
To create an inverted index for these documents, first tokenize the documents into terms, as follows.
Document 1: The, quick, brown, fox, jumped, over, the lazy, dog.
Document 2: The, lazy, dog, slept, in, the, sun.

Next, create an index of the terms, where each term points to a list of documents that contain that term, as
follows.

The -> Document 1, Document 2


Quick -> Document 1
Brown -> Document 1
Fox -> Document 1
Jumped -> Document 1
Over -> Document 1
Lazy -> Document 1, Document 2
Dog -> Document 1, Document 2
Slept -> Document 2
In -> Document 2
Sun -> Document 2
To search for documents containing a particular term or set of terms, the search engine queries the inverted
index for those terms and retrieves the list of documents associated with each term. The search engine can
then use this information to rank the documents based on relevance to the query and present them to the
user in order of importance.
There are two types of inverted indexes:
1. Record-Level Inverted Index: Record Level Inverted Index contains a list of references to documents for
each word.
2. Word-Level Inverted Index: Word Level Inverted Index additionally contains the positions of each word
within a document. The latter form offers more functionality but needs more processing power and space to
be created.
Suppose want to search the texts “hello everyone, ” “this article is based on an inverted index, ” and
“which is hashmap-like data structure“. If index by (text, word within the text), the index with a location
in the text is:
hello (1, 1)
everyone (1, 2)
this (2, 1)
article (2, 2)
is (2, 3); (3, 2)
based (2, 4)
on (2, 5)
inverted (2, 6)
index (2, 7)
which (3, 1)
hashmap (3, 3)
like (3, 4)
data (3, 5)
structure (3, 6)

The word “hello” is in document 1 (“hello everyone”) starting at word 1, so has an entry (1, 1), and the
word “is” is in documents 2 and 3 at ‘3rd’ and ‘2nd’ positions respectively (here position is based on the
word).
The index may have weights, frequencies, or other indicators.
Steps to Build an Inverted Index
Fetch the Document: Removing of Stop Words: Stop words are the most occurring and useless words in
documents like “I”, “the”, “we”, “is”, and “an”.
Stemming of Root Word: Whenever I want to search for “cat”, I want to see a document that has
information about it. But the word present in the document is called “cats” or “catty” instead of “cat”. To
relate both words, I’ll chop some part of every word I read so that I could get the “root word”. There are
standard tools for performing this like “Porter’s Stemmer”.
Record Document IDs: If the word is already present add a reference of the document to index else
creates a new entry. Add additional information like the frequency of the word, location of the word, etc.
Example:
Words Document
ant doc1
demo doc2
world doc1, doc2
Advantages of Inverted Index

 The inverted index is to allow fast full-text searches, at a cost of increased processing when a document is
added to the database.
 It is easy to develop.
 It is the most popular data structure used in document retrieval systems, used on a large scale for example
in search engines.
Disadvantages of Inverted Index

 Large storage overhead and high maintenance costs on updating, deleting, and inserting.
 Instead of retrieving the data in decreasing order of expected usefulness, the records are retrieved in the
order in which they occur in the inverted lists.
Features of Inverted Indexes

 Efficient search: Inverted indexes allow for efficient searching of large volumes of text-based data. By
indexing every term in every document, the index can quickly identify all documents that contain a given
search term or phrase, significantly reducing search time.
 Fast updates: Inverted indexes can be updated quickly and efficiently as new content is added to the
system. This allows for near-real-time indexing and searching for new content.
 Flexibility: Inverted indexes can be customized to suit the needs of different types of information retrieval
systems. For example, they can be configured to handle different types of queries, such as Boolean queries
or proximity queries.
 Compression: Inverted indexes can be compressed to reduce storage requirements. Various techniques such
as delta encoding, gamma encoding, variable byte encoding, etc. can be used to compress the posting list
efficiently.
 Support for stemming and synonym expansion: Inverted indexes can be configured to support stemming
and synonym expansion, which can improve the accuracy and relevance of search results. Stemming is the
process of reducing words to their base or root form, while synonym expansion involves mapping different
words that have similar meanings to a common term.
 Support for multiple languages: Inverted indexes can support multiple languages, allowing users to search
for content in different languages using the same system.
3.2. INDEX CONSTRUCTION AND INDEX COMPRESSION
3.2.1Index Construction
Index construction is the process of creating an index for a collection of documents. An index is a data
structure that allows for efficient searching of the document collection. The index is typically created by
first parsing the documents to extract the terms, and then storing these terms in a data structure that allows
for fast lookup.
There are a number of different algorithms for index construction. The most common algorithm is the sort-
based algorithm, which involves sorting the terms by their document identifiers (DocIDs). This algorithm
is simple to implement, but it can be inefficient for large collections of documents.
Other index construction algorithms include:
1. Blocked sort-based indexing: This algorithm divides the document collection into smaller blocks and
constructs an index for each block. The resulting indexes are then merged to create a single index for the
entire collection.
2. Single-pass indexing: This algorithm constructs the index in a single pass over the document collection.
This algorithm is more efficient than the sort-based algorithm for large collections of documents, but it is
more complex to implement.
3. Distributed indexing: This algorithm distributes the index construction process across multiple machines.
This algorithm is useful for very large collections of documents, where it would be impractical to construct
the index on a single machine.
3.2.2.Index Compression
Index compression is the process of reducing the size of an index. This is important because large indexes
can take up a lot of disk space. There are a number of different index compression techniques, which can
be broadly classified into two categories:
1. Term-level compression: This type of compression techniques compress the terms themselves. For
example, a technique called Huffman coding can be used to represent terms with shorter bit sequences.
2. Postings-level compression: This type of compression techniques compress the postings lists, which are
the lists of document identifiers for each term. For example, a technique called delta encoding can be used
to store the differences between document identifiers instead of the entire document identifiers themselves.
The choice of index compression technique depends on the specific characteristics of the document
collection and the index structure. In general, a combination of term-level and postings-level compression
techniques is used to achieve the best compression ratio.
Benefits of Index Construction and Index Compression

 Index construction and index compression provide a number of benefits for information retrieval systems,
including:
 Improved search performance: Indexes allow for efficient searching of the document collection, which can
significantly improve search performance.
 Reduced storage requirements: Index compression can reduce the size of the index, which can save disk
space.
 Improved scalability: Index construction and index compression can be used to create indexes for very
large collections of documents, which makes them scalable to large-scale information retrieval systems.
3.3 SEARCHING – SEQUENTIAL SEARCHING AND PATTERN MATCHING

Searching is the fundamental operation of information retrieval (IR). It involves locating relevant
information within a large collection of data. There are two main types of searching:

Exact matching: This type of searching finds all occurrences of a specific pattern in the data. For example,
searching for the word "cat" in a document would find all instances of the word "cat", regardless of its
capitalization or context.

Pattern matching: This type of searching finds all occurrences of a pattern in the data, even if the pattern
is not exact. For example, searching for the pattern "ca*" would find all words that start with "ca", such as
"cat", "car", and "captain".
3.3.1 Sequential searching
Sequential searching is the simplest and most straightforward search algorithm. It works by comparing the
search pattern to each item in the data collection until a match is found. Sequential searching is a relatively
slow algorithm, but it is easy to implement and understand.
Example:

Suppose you are looking for a specific book in a library. The library has a collection of books that is
organized alphabetically by author's last name. You can use sequential searching to find the book by
starting at the beginning of the collection and comparing the title of each book to the title of the book you
are looking for until you find a match.
3.3.2. Pattern matching
Pattern matching is a more complex type of searching that uses algorithms to find occurrences of patterns
in data.
Example:Suppose you are searching for a specific word in a document. The document is a large text file
that contains thousands of words. You can use pattern matching to find the word by using a regular
expression. For example, the regular expression \bcat\b would match all instances of the word "cat" in the
document, regardless of its capitalization or context.

Here is a table that summarizes the key differences between sequential searching and pattern matching:

Feature Sequential Searching Pattern Matching

Search pattern Exact Approximate

Algorithm complexity O(n) O(n*m)

Applications Simple searches Complex searches

Where n is the size of the data collection and m is the size of the search pattern

There are many different pattern matching algorithms, each with its own strengths and weaknesses. Some
of the most common pattern matching algorithms include:

1. Knuth-Morris-Pratt (KMP) algorithm: The KMP algorithm is a fast and efficient pattern matching
algorithm that is well-suited for searching large amounts of text.
2. Boyer-Moore (BM) algorithm: The BM algorithm is another fast and efficient pattern matching algorithm
that is known for its good worst-case performance.
3. Rabin-Karp algorithm: The Rabin-Karp algorithm is a pattern matching algorithm that uses hashing to
quickly find potential matches.

Applications of Searching : Searching is used in a wide variety of applications, including:

 Web search: Search engines use searching to find relevant web pages based on user queries.
 Database search: Database management systems use searching to find records that match user-specified
criteria.
 Text search: Text editors use searching to find specific words or phrases within a document.
 Pattern recognition: Pattern recognition systems use searching to find patterns in images, audio, and other
types of data.
3.4. QUERY OPERATIONS –QUERY LANGUAGES –QUERY PROCESSING
A query operation is a fundamental action that is performed on a collection of documents to retrieve
relevant information. Common query operations include:

 Keyword matching: Finds documents that contain specific keywords or phrases.


 Boolean operators: Combines query terms using Boolean operators such as AND, OR, and NOT to refine
the search.
 Proximity operators: Finds documents where specific keywords or phrases appear within a certain
distance of each other.
 Ranking: Orders retrieved documents based on their relevance to the query.
3.4.1.Query Language
A query language is a formal way of expressing queries to an information retrieval system. Common query
languages include:

 Keyword search: Simple and intuitive for basic searches.


 Boolean query languages: Support Boolean operators for more complex searches.
 Structured query languages: Allow for structured queries, such as those that filter documents based on
specific attributes.
 Natural language query languages: Enable users to express queries in natural language, such as English.
3.4.2.Query Processing
Query processing is the process of transforming a user's query into a form that can be executed by an
information retrieval system. The query processing pipeline typically consists of the following steps:
Parsing: Converts the user's query into an internal representation that the system can understand.
Query rewriting: Applies query rewriting techniques to improve the efficiency or effectiveness of the
query.
Query optimization: Determines the most efficient way to execute the query.
Execution: Executes the query against the document collection.
Evaluation: Measures the effectiveness of the query and provides feedback to the user.
3.4.3Relationship between Query Operation, Query Language, and Query Processing
Query operations, query languages, and query processing are all interrelated components of information
retrieval. Query operations provide the basic building blocks for expressing user queries, query languages
provide a formal way to represent these operations, and query processing transforms these representations
into executable commands that can be used to retrieve relevant information.
Information retrieval systems use query operations, query languages, and query processing to efficiently
and effectively find relevant information for users.
The choice of query operation, query language, and query processing strategy depends on the specific
application and the desired level of expressiveness and efficiency.
Advanced information retrieval systems may employ a variety of techniques to improve the accuracy and
efficiency of query processing, such as relevance feedback, query expansion, and machine learning
algorithms.
3.4.4 Types of Queries
During the process of indexing, many keywords are associated with document set which contains words,
phrases, date created, author names, and type of document. They are used by an IR system to build an
inverted index which is then consulted during the search. The queries formulated by users are compared to
the set of index keywords. Most IR systems also allow the use of Boolean and other operators to build a
complex query. The query language with these operators enriches the expressiveness of a user’s
information need. The Information Retrieval (IR) system finds the relevant documents from a large data set
according to the user query. Queries submitted by users to search engines might be ambiguous, concise and
their meaning may change over time. Some of the types of Queries in IR systems are –
1. Keyword Queries :
Simplest and most common queries. The user enters just keyword combinations to retrieve documents.
These keywords are connected by logical AND operator. All retrieval models provide support for keyword
queries.
2. Boolean Queries :
Some IR systems allow using +, -, AND, OR, NOT, ( ), Boolean operators in combination of keyword
formulations.No ranking is involved because a document either satisfies such a query or does not satisfy it.
A document is retrieved for boolean query if it is logically true as exact match in document
3. Phrase Queries :
When documents are represented using an inverted keyword index for searching, the relative order of items
in document is lost. To perform exact phase retrieval, these phases are encoded in inverted index or
implemented differently. This query consists of a sequence of words that make up a phase.
It is generally enclosed within double quotes.
4. Natural Language Queries :
There are only a few natural language search engines that aim to understand the structure and meaning of
queries written in natural language text, generally as question or narrative. The system tries to formulate
answers for these queries from retrieved results.Semantic models can provide support for this query type.

3.5. RELEVANCE FEEDBACK AND QUERY EXPANSION


In global query expansion, the query is modified based on some global resource, i.e., a resource that is not
query-dependent. In relevance feedback, users give input on documents (are they relevant or not), which is
used to refine the query. In query expansion, users give input on query terms or phrases.
3.5.1Relevance feedback

Relevance feedback is a user-centered technique that involves asking the user to evaluate the relevance of
the retrieved documents to their information need. The user's feedback is then used to refine the query and
improve the ranking of the retrieved documents.
There are two main types of relevance feedback:
Directed relevance feedback: The user is asked to specify which documents are relevant and which are
not relevant.
Undirected relevance feedback: The user is asked to rate the relevance of the documents on a scale, such
as 1 (not relevant) to 5 (very relevant).
The feedback can be used to perform the following tasks:
Identifying relevant terms: Relevant terms are the words or phrases that are most closely related to the
user's information need. These terms can be used to refine the query by adding them to the query or by
removing them from the query.
Removing irrelevant terms: Irrelevant terms are the words or phrases that are not related to the user's
information need. These terms can be used to refine the query by removing them from the query.
Improving term weights: Term weights are a measure of the importance of a term in the query. Term
weights can be adjusted based on the relevance of the documents that contain the term.
3.5.2.Query Expansion
Query expansion is a technique that involves adding new terms to the query based on the retrieved
documents. This can be done using a variety of methods, such as:
Leveraging semantic relationships: Related words and phrases can be added to the query to expand the
search.
Using thesauruses and dictionaries: Thesauruses and dictionaries can be used to find synonyms and
antonyms of the query terms.
Using word embedding techniques: Word embedding techniques can be used to identify words that are
semantically similar to the query terms.
Query expansion can be used to improve the effectiveness of search results in several ways:

 Capturing more relevant information: New terms can be added to the query that capture more of the user's
information need.
 Reducing ambiguity: New terms can be added to the query that help to reduce the ambiguity of the query.
 Improving recall: New terms can be added to the query that help to identify more relevant documents that
were not originally retrieved.
Relevance feedback and query expansion are often used together to improve the effectiveness of search
results. Relevance feedback is used to identify relevant terms and refine the query, while query expansion
is used to further expand the search and capture more relevant information.The combination of relevance
feedback and query expansion can be a powerful tool for improving the effectiveness of IR systems. By
involving the user in the search process and using the feedback to refine the query, these techniques can
help to identify more relevant documents and improve the overall user experience.
3.5.3. Automatic Local analysis
Automatic Local Analysis, Query Expansion Through Local Clustering, and Query Expansion Through
Local Context Analysis are three important techniques used in query expansion for information retrieval
(IR). These techniques aim to improve the effectiveness of search results by expanding the user's original
query with relevant terms.

Automatic Local Analysis (ALA) ALA is a technique that identifies relevant terms based on their co-
occurrence with the query terms in the retrieved documents. The underlying assumption is that terms that
frequently appear together are likely to be semantically related.ALA involves the following steps:

1. Gather retrieval results: Retrieve a set of documents based on the initial query.
2. Identify local clusters: Divide the retrieved documents into local clusters based on their similarity.
3. Rank terms: Rank the terms within each cluster based on their frequency and similarity to the query terms.
4. Select terms: Select the top-ranked terms from each cluster to expand the query.

Query Expansion Through Local Clustering (QE-LC)

QE-LC is a technique that expands the query by selecting terms from the centroids of local clusters. The
centroids represent the most representative documents in each cluster and are assumed to contain the most
relevant terms.

QE-LC involves the following steps:

1. Gather retrieval results: Retrieve a set of documents based on the initial query.
2. Identify local clusters: Divide the retrieved documents into local clusters based on their similarity.
3. Extract cluster centroids: Extract the centroids from each local cluster.
4. Select terms: Select the terms from the cluster centroids to expand the query.

Query Expansion Through Local Context Analysis (QE-LCA)

QE-LCA is a technique that expands the query by selecting terms from the local context of the query terms.
The local context refers to the terms that appear within a certain window size of the query terms in the
retrieved documents.

QE-LCA involves the following steps:

1. Gather retrieval results: Retrieve a set of documents based on the initial query.
2. Identify local context: Identify the local context for each query term in the retrieved documents.
3. Rank terms: Rank the terms within each local context based on their frequency and similarity to the query
terms.
4. Select terms: Select the top-ranked terms from each local context to expand the query.

The three techniques differ in how they identify and select relevant terms for query expansion:
 ALA uses co-occurrence analysis to identify relevant terms.
 QE-LC uses cluster centroids to identify relevant terms.
 QE-LCA uses local context analysis to identify relevant terms.

Automatic Global Analysis

Automatic Global Analysis (AGA) is a query expansion technique that utilizes global document analysis to
identify relevant terms for query refinement. It employs statistical methods to analyze the relationships
between terms across the entire document collection, rather than relying on local context within individual
documents. This global perspective allows AGA to capture broader semantic connections and uncover
terms that may not be apparent when examining documents in isolation.

AGA typically involves the following steps:

 Document indexing: Create a representation of each document in the collection, typically as a vector of
term weights.
 Term-term correlation analysis: Compute the correlation between each pair of terms in the collection,
identifying terms that tend to co-occur frequently.
 Term selection: Select a subset of the most highly correlated terms to add to the query .
Query Expansion Through Global Clustering
Query Expansion Through Global Clustering (QEGC) employs clustering techniques to group semantically
related terms and identify relevant terms for query expansion. It first clusters the terms in the document
collection based on their distributional similarities, resulting in a set of clusters that represent distinct
semantic concepts. Then, it identifies the most representative terms from each cluster and adds them to the
query.
QEGC typically involves the following steps:
 Term clustering: Cluster the terms in the document collection using a clustering algorithm, such as k-
means clustering.
 Cluster centroid computation: For each cluster, compute the centroid, which represents the most central
term in the cluster.
 Term selection: Select thecentroids of clusters that have high relevance to the query and add them to the
query.

Query Expansion Through Global Context Analysis


Query Expansion Through Global Context Analysis (QECA) utilizes global context information to identify
relevant terms for query expansion. It analyzes the context of terms within the entire document collection
to identify patterns and relationships that indicate semantic relevance. This global context analysis allows
QECA to capture subtle nuances of meaning and identify terms that are not readily apparent from term-
term correlations alone.
QECA typically involves the following steps:
1. Document parsing: Parse each document in the collection to extract contextual information, such as
sentence boundaries, part-of-speech tags, and named entities.
2. Context modeling: Construct a model of the global context, capturing the relationships between terms and
their contextual features.
3. Term selection: Identify terms that are highly relevant to the query based on their contextual associations
and add them to the query.

Reference
1. https://ryansblog.xyz/post/96b548f9-6de3-455b-be47-937c43639e1e --query processing
2. https://nlp.stanford.edu/IR-book/pdf/09expand.pdf------- Relevance Feedback and Query Expansion
UNIT 4 CLASSFICATION AND CLUSTERING

Text Classification and Naïve Bayes –Vector Space Classification –Support vector machines and
machine learning on documents –Flat Clustering –Hierarchical Clustering –Matrix decompositions
and latent semantic indexing –Fusion and Meta learning

You might also like