You are on page 1of 14

Information Retrieval

Assignment 1

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science


University of Engineering and Technology
Lahore Pakistan
Introduction
Welcome to the Inverted Indexing and Text Search Manual. This manual provides
comprehensive guidance on utilizing a Python tool designed to create an inverted index from
a collection of text documents and conduct text searches within them. Whether you're an
experienced programmer or have limited coding skills, this manual will help you make the
most of this powerful tool.

Purpose of the Program:


The Inverted Indexing and Text Search Tool is a versatile utility designed to assist you in
various text-related tasks. It allows you to:

● Create an inverted index: Transform a collection of text documents into a structured


index that facilitates efficient text retrieval.
● Search for specific terms: Locate documents that contain particular words or
phrases.
● Count word occurrences: Quantify how frequently specific words appear within each
document.

By the end of this manual, you'll be proficient in using this tool to streamline your text
analysis tasks and extract valuable insights from your documents.

Installation and Setup:


Python: Ensure you have Python installed on your system. This tool is compatible with
Python 3.

NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
pip install nltk

Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.

python text_search.py
Explanation and Guide

Imports (Libraries)

import os
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

import os

● Purpose: The os module provides a way to work with the operating system, allowing
you to perform various file and directory operations.
● Use in the Program: In the code, os is used to manipulate file paths and interact
with the filesystem. It's used to list files in a directory, join file paths, and determine
the script's directory path.

import nltk

● Purpose: The nltk (Natural Language Toolkit) library is a comprehensive library for
natural language processing tasks.
● Use in the Program: nltk is used extensively for text processing in this code. It
provides tools for tokenization, part-of-speech tagging, and stemming, which are
crucial for creating an inverted index and performing text searches.

import string

● Purpose: The string module provides a collection of common string operations,


including a list of punctuation characters.
● Use in the Program: In the code, string.punctuation is used to filter out
punctuation characters from the text. This is important when tokenizing sentences
into words.

from nltk.corpus import stopwords

● Purpose: The NLTK corpus module includes predefined lists of stopwords for various
languages, including English.
● Use in the Program: The stopwords module is used to access a set of common
English stopwords. Stopwords are words that are commonly used in text but often do
not carry significant meaning (e.g., "the," "and"). Filtering out stopwords is a common
preprocessing step in text analysis.

from nltk.stem import PorterStemmer

● Purpose: The PorterStemmer is a stemming algorithm that reduces words to their


base or root form. Stemming helps in reducing words to their essential meaning.
● Use in the Program: In the code, the PorterStemmer is used to stem words in text
documents before they are indexed. This simplifies the process of matching different
forms of a word during text searches.

from nltk.tokenize import word_tokenize, sent_tokenize

● Purpose: The nltk.tokenize module provides functions for breaking text into
words or sentences.
● Use in the Program: In the code, word_tokenize and sent_tokenize functions
are used to tokenize text into words and sentences, respectively. This tokenization is
essential for processing text at the word and sentence level.

Variables
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} # Add more
characters if needed
# Initialize a Porter stemmer
stemmer = PorterStemmer()

stop_words = set(stopwords.words('english'))

● Explanation: The variable stop_words is assigned a set of English stopwords


using NLTK's stopwords.words('english'). These stopwords will be used to
filter out common words from the text documents being processed. This filtering
helps reduce the size of the inverted index and focuses on the content-carrying
words.

unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} #

● Explanation: This variable unwanted_chars is a set containing characters that are


considered unwanted and should be removed from the text before processing. The
characters include various forms of quotes, dashes, and ellipses. If additional
unwanted characters are identified, they can be added to this set.
stemmer = PorterStemmer()

● Explanation: Here, an instance of the Porter Stemmer is initialized as the variable


stemmer. The Porter Stemmer is used to reduce words to their root or base form. In
this code, it's employed to ensure that different forms of words (e.g., "running," "ran,"
"runner") are treated as the same word during indexing and searching. This is
particularly important for improving the accuracy of the inverted index and search
results.

Functions

def create_index(dir_path)
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}

1. def create_index(dir_path): This line defines a Python function called


create_index. It takes one argument, dir_path, which is the path to the directory
containing the text documents that you want to index. This function will be
responsible for building the inverted index and word counts for each document.
2. # Initialize an empty dictionary for the inverted index: This is a
comment that explains the purpose of the next line of code. It's initializing an empty
dictionary named inverted_index, which will be used to store the inverted index.
3. inverted_index = {}: This line creates an empty Python dictionary called
inverted_index. Inverted indexing is a technique used for text retrieval, where
words are associated with the documents they appear in. This dictionary will store
those associations.
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}
# Initialize a dictionary to store word counts per document
word_counts_per_document = {}

1. # Initialize a dictionary to store word counts per document:


This comment explains that the following line initializes a dictionary to store word
counts for each document in the directory.
2. word_counts_per_document = {}: This line creates an empty dictionary called
word_counts_per_document. This dictionary will be used to keep track of the
frequency of each word within each document, essentially counting how many times
each word appears in each text file. It is crucial for later search and retrieval
operations.

# For each word, if it's a noun or verb, stem it and add an entry in the inverted index pointing to
this filename
for word, pos in tagged_words:
if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP']
and word not in stop_words:
stemmed_word = stemmer.stem(word)
if stemmed_word not in inverted_index:
inverted_index[stemmed_word] = []
inverted_index[stemmed_word].append(filename)

# Update word counts for this document


if stemmed_word not in word_counts:
word_counts[stemmed_word] = 1
else:
word_counts[stemmed_word] += 1

# Store word counts for this document


word_counts_per_document[filename] = word_counts

except UnicodeDecodeError:
print(f"Skipping file {filename} due to UnicodeDecodeError")

return inverted_index, word_counts_per_document

1. for filename in os.listdir(dir_path): This line sets up a loop that


iterates over each file in the directory specified by dir_path. The os.listdir()
function returns a list of all files and directories in the given directory, and this loop
iterates through the file names.
2. if filename.endswith('.txt'): This line checks if the current filename
ends with the ".txt" extension, which typically indicates a text file.
3. try: This line begins a try-except block to handle potential errors during file
processing.
4. with open(os.path.join(dir_path, filename), 'r',
encoding='utf8') as file: Within the try block, this line opens the current text
file for reading. It uses os.path.join() to create the full path to the file by
combining dir_path with the filename. The file is opened in text mode ('r') and
with the 'utf8' encoding to handle text files encoded in UTF-8.
5. sentences = sent_tokenize(file.read().lower()): This line reads the
content of the file using file.read(), converts the content to lowercase using
.lower(), and then uses sent_tokenize (from NLTK) to split the content into a
list of sentences. This step prepares the text for further processing.
6. word_counts = {}: This line creates an empty dictionary called word_counts to
store word frequencies for the current document. This dictionary will be populated in
the following steps.
7. for sentence in sentences: This line sets up a loop to iterate over each
sentence in the sentences list.
8. sentence_without_punctuation = "".join([char for char in
sentence if char not in string.punctuation and char not in
unwanted_chars]): This line removes punctuation and unwanted characters from
the current sentence. It creates a new string called
sentence_without_punctuation by joining characters that are not in
string.punctuation or unwanted_chars.
9. words = word_tokenize(sentence_without_punctuation): This line
tokenizes the sentence_without_punctuation into a list of words using the
word_tokenize function from NLTK.
10. tagged_words = nltk.pos_tag(words): This line uses nltk.pos_tag to tag
each word in words with its part of speech. The result is stored in the
tagged_words list of word-tag pairs.
11. # For each word, if it's a noun or verb, stem it and add an
entry in the inverted index pointing to this filename: This
comment explains that the code will process each word in the current sentence,
checking if it's a noun or verb, and then stemming it before associating it with the
current filename in the inverted index.
12. for word, pos in tagged_words: This line sets up a loop to iterate over each
word and its corresponding part of speech in the tagged_words list.
13. if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG',
'VBN', 'VBP'] and word not in stop_words:This line checks two
conditions for each word:
● Whether the word's part of speech (pos) is in the specified list of noun and
verb POS tags. If it is, it's considered for further processing.
● Whether the word is not in the set of stop_words, which are common words
that are often filtered out in text analysis.
14. stemmed_word = stemmer.stem(word): If a word passes the previous
conditions, it is stemmed using the Porter Stemmer. The stemmed word is stored in
the variable stemmed_word.
15. if stemmed_word not in inverted_index: This line checks if the
stemmed_word is not already in the inverted_index.
16. inverted_index[stemmed_word] = []: If the word is not in the inverted index,
it initializes an empty list as the value for that word in the inverted index.
17. inverted_index[stemmed_word].append(filename): Regardless of whether
the word was already in the inverted index or not, it appends the filename of the
current document to the list associated with the stemmed_word. This associates the
word with the document where it appears in the inverted index.
18. if stemmed_word not in word_counts: This line checks if the
stemmed_word is not in the word_counts dictionary.
19. word_counts[stemmed_word] = 1: If the word is not in word_counts, it
initializes it with a count of 1, indicating that this word has been found once in the
current document.
20. else: If the word is already in word_counts, this block of code is executed.
21. word_counts[stemmed_word] += 1: It increments the count for the word in
word_counts to indicate that the word has been found again in the current
document.
22. # Store word counts for this document: This comment explains that the
code is about to store the word counts for the current document.
23. word_counts_per_document[filename] = word_counts: This line stores
the word_counts dictionary (word counts for the current document) in the
word_counts_per_document dictionary with the filename as the key. This
associates the word counts with the document.
24. except UnicodeDecodeError: This is an exception handler that catches
UnicodeDecodeError exceptions. This exception occurs when a file cannot be
decoded using the specified encoding, which can happen when processing text files
with non-standard encodings.
25. print(f"Skipping file {filename} due to UnicodeDecodeError"): If
a UnicodeDecodeError is raised, this line prints a message indicating that the file
is being skipped due to this encoding-related error.
26. return inverted_index, word_counts_per_document: This line returns two
values as a tuple:
● inverted_index: This is a dictionary containing the inverted index, where
each stemmed word is associated with a list of filenames where it appears.
● word_counts_per_document: This is a dictionary containing word counts
for each document, showing how many times each word appears in each
document.

def main()

Code

# Now you can create the index, search it, and count word occurrences within documents

def main():

dir_path = os.path.dirname(os.path.abspath(__file__))

inverted_index, word_counts_per_document = create_index(dir_path)

print(inverted_index)

def search(query):

# Tokenize and stem the query

query_words = word_tokenize(query.lower())

stemmed_query_words = [stemmer.stem(word) for word in query_words]

# Retrieve the filenames for each query word

matching_filenames_for_each_word = {word: inverted_index.get(word, []) for word in


stemmed_query_words}
return matching_filenames_for_each_word

# User-friendly search prompt

while True:

user_query = input("Enter a search query (or 'exit' to quit): ")

if user_query == 'exit':

break

results = search(user_query)

# Collect and count unique filenames

unique_filenames = set()

for filenames in results.values():

unique_filenames.update(filenames)

for filename in unique_filenames:

word_count = sum(word_counts_per_document[filename].get(word, 0) for word in


results.keys())

print(f"The word(s) appear in '{filename}' {word_count} time(s):")

if not unique_filenames:

print("No matching documents found for the query.")

if __name__ == "__main__":

main()

Explanation

def main():

dir_path = os.path.dirname(os.path.abspath(__file__))

1. def main():This line defines the main function, which is the entry point of your
program. It doesn't take any arguments.
2. dir_path = os.path.dirname(os.path.abspath(__file__)): This line
sets dir_path to the directory path of the script file itself. It uses os.path to obtain
the absolute path of the current script (__file__) and then extracts the directory
path from it. This is used to determine the directory where the text documents are
located.

inverted_index, word_counts_per_document = create_index(dir_path)

print(inverted_index)

3. inverted_index, word_counts_per_document =
create_index(dir_path): Here, the code calls the create_index function to
build the inverted index and word counts for the documents in the directory specified
by dir_path. It stores the results in inverted_index and
word_counts_per_document.
4. print(inverted_index): This line prints the inverted_index to the console. It
provides a visual representation of the inverted index, showing how words are
associated with the documents they appear in.

def search(query):

5. def search(query):This line defines a new function called search, which takes
a single argument, query. This function is responsible for searching the inverted
index based on user queries.

query_words = word_tokenize(query.lower())

stemmed_query_words = [stemmer.stem(word) for word in


query_words]

6. The lines within the search function tokenize and stem the user's query:
○ query_words = word_tokenize(query.lower()): The query is
tokenized into individual words using word_tokenize, and all the words are
converted to lowercase to ensure consistent matching.
○ stemmed_query_words = [stemmer.stem(word) for word in
query_words]: Each tokenized word in the query is stemmed using the
stemmer.stem function. This ensures that the query words are in the same
form as the words in the inverted index.

matching_filenames_for_each_word = {word:
inverted_index.get(word, []) for word in stemmed_query_words}

return matching_filenames_for_each_word

7. The code retrieves filenames associated with each query word from the inverted
index. It creates a dictionary, matching_filenames_for_each_word, where
each query word is the key, and the associated list of filenames is the value. This
information will be used for search results.
8. return matching_filenames_for_each_word: The function returns
matching_filenames_for_each_word, which contains the search results
indicating which documents contain the query words.

while True:

user_query = input("Enter a search query (or 'exit' to quit): ")

9. This code initiates a user-friendly search interface where users can input search
queries. It uses a while loop to repeatedly prompt the user for input.
10. user_query = input("Enter a search query (or 'exit' to quit):
"): This line reads the user's search query from the console. Users can type a
search query or type 'exit' to quit the search interface.

if user_query == 'exit':
break

11. This conditional statement checks if the user entered 'exit' as the query. If they did,
the while loop is exited, ending the search interface.

results = search(user_query)

12. results = search(user_query): The user's search query is passed to the


search function, and the results are stored in the results variable.

unique_filenames = set()
for filenames in results.values():
unique_filenames.update(filenames)

13. This part of the code processes the search results:


● unique_filenames is initialized as an empty set to collect unique filenames that
match the search query.
● The for loop iterates over the filenames associated with each query word from the
results dictionary and updates the unique_filenames set with those filenames.

for filename in unique_filenames:


word_count =
sum(word_counts_per_document[filename].get(word, 0) for word in
results.keys())
print(f"The word(s) appear in '{filename}' {word_count}
time(s):")
if not unique_filenames:
print("No matching documents found for the query.")
if not unique_filenames:
print("No matching documents found for the query.")

The code further processes the unique filenames:

● It iterates over the unique filenames.


● For each filename, it calculates the total word count for the query words found in that
document. This is done by iterating over the query words and checking how many
times each of them appears in the document.
● It then prints the filename along with the word count for the query words found in that
document.
15. If there are no unique filenames (i.e., no matching documents for the query), it prints
a message indicating that no matching documents were found for the query.

if __name__ == "__main__":
main()

16. Finally, this code checks if the script is being run as the main program (not imported
as a module). If it is, it calls the main function to start the search interface.
Data Flow Diagram
Block Diagram

You might also like