Professional Documents
Culture Documents
Assignment 1
Submitted by:
Saqlain Nawaz 2020-CS-135
Supervised by:
Sir Khaldoon Syed Khurshid
By the end of this manual, you'll be proficient in using this tool to streamline your text
analysis tasks and extract valuable insights from your documents.
NLTK Library: Install the NLTK library if you haven't already. You can install it using the
following command:
pip install nltk
Running the Tool: Place your text documents in the same directory as the tool. Save the
code in a Python file (e.g., text_search.py). You can run the tool by executing the Python
script.
python text_search.py
Explanation and Guide
Imports (Libraries)
import os
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize
import os
● Purpose: The os module provides a way to work with the operating system, allowing
you to perform various file and directory operations.
● Use in the Program: In the code, os is used to manipulate file paths and interact
with the filesystem. It's used to list files in a directory, join file paths, and determine
the script's directory path.
import nltk
● Purpose: The nltk (Natural Language Toolkit) library is a comprehensive library for
natural language processing tasks.
● Use in the Program: nltk is used extensively for text processing in this code. It
provides tools for tokenization, part-of-speech tagging, and stemming, which are
crucial for creating an inverted index and performing text searches.
import string
● Purpose: The NLTK corpus module includes predefined lists of stopwords for various
languages, including English.
● Use in the Program: The stopwords module is used to access a set of common
English stopwords. Stopwords are words that are commonly used in text but often do
not carry significant meaning (e.g., "the," "and"). Filtering out stopwords is a common
preprocessing step in text analysis.
● Purpose: The nltk.tokenize module provides functions for breaking text into
words or sentences.
● Use in the Program: In the code, word_tokenize and sent_tokenize functions
are used to tokenize text into words and sentences, respectively. This tokenization is
essential for processing text at the word and sentence level.
Variables
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))
unwanted_chars = {'“', '”', '―', '...', '—', '-', '–'} # Add more
characters if needed
# Initialize a Porter stemmer
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
Functions
def create_index(dir_path)
def create_index(dir_path):
# Initialize an empty dictionary for the inverted index
inverted_index = {}
# For each word, if it's a noun or verb, stem it and add an entry in the inverted index pointing to
this filename
for word, pos in tagged_words:
if pos in ['NN', 'NNS', 'NNP', 'NNPS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP']
and word not in stop_words:
stemmed_word = stemmer.stem(word)
if stemmed_word not in inverted_index:
inverted_index[stemmed_word] = []
inverted_index[stemmed_word].append(filename)
except UnicodeDecodeError:
print(f"Skipping file {filename} due to UnicodeDecodeError")
def main()
Code
# Now you can create the index, search it, and count word occurrences within documents
def main():
dir_path = os.path.dirname(os.path.abspath(__file__))
print(inverted_index)
def search(query):
query_words = word_tokenize(query.lower())
while True:
if user_query == 'exit':
break
results = search(user_query)
unique_filenames = set()
unique_filenames.update(filenames)
if not unique_filenames:
if __name__ == "__main__":
main()
Explanation
def main():
dir_path = os.path.dirname(os.path.abspath(__file__))
1. def main():This line defines the main function, which is the entry point of your
program. It doesn't take any arguments.
2. dir_path = os.path.dirname(os.path.abspath(__file__)): This line
sets dir_path to the directory path of the script file itself. It uses os.path to obtain
the absolute path of the current script (__file__) and then extracts the directory
path from it. This is used to determine the directory where the text documents are
located.
print(inverted_index)
3. inverted_index, word_counts_per_document =
create_index(dir_path): Here, the code calls the create_index function to
build the inverted index and word counts for the documents in the directory specified
by dir_path. It stores the results in inverted_index and
word_counts_per_document.
4. print(inverted_index): This line prints the inverted_index to the console. It
provides a visual representation of the inverted index, showing how words are
associated with the documents they appear in.
def search(query):
5. def search(query):This line defines a new function called search, which takes
a single argument, query. This function is responsible for searching the inverted
index based on user queries.
query_words = word_tokenize(query.lower())
6. The lines within the search function tokenize and stem the user's query:
○ query_words = word_tokenize(query.lower()): The query is
tokenized into individual words using word_tokenize, and all the words are
converted to lowercase to ensure consistent matching.
○ stemmed_query_words = [stemmer.stem(word) for word in
query_words]: Each tokenized word in the query is stemmed using the
stemmer.stem function. This ensures that the query words are in the same
form as the words in the inverted index.
matching_filenames_for_each_word = {word:
inverted_index.get(word, []) for word in stemmed_query_words}
return matching_filenames_for_each_word
7. The code retrieves filenames associated with each query word from the inverted
index. It creates a dictionary, matching_filenames_for_each_word, where
each query word is the key, and the associated list of filenames is the value. This
information will be used for search results.
8. return matching_filenames_for_each_word: The function returns
matching_filenames_for_each_word, which contains the search results
indicating which documents contain the query words.
while True:
9. This code initiates a user-friendly search interface where users can input search
queries. It uses a while loop to repeatedly prompt the user for input.
10. user_query = input("Enter a search query (or 'exit' to quit):
"): This line reads the user's search query from the console. Users can type a
search query or type 'exit' to quit the search interface.
if user_query == 'exit':
break
11. This conditional statement checks if the user entered 'exit' as the query. If they did,
the while loop is exited, ending the search interface.
results = search(user_query)
unique_filenames = set()
for filenames in results.values():
unique_filenames.update(filenames)
if __name__ == "__main__":
main()
16. Finally, this code checks if the script is being run as the main program (not imported
as a module). If it is, it calls the main function to start the search interface.
Data Flow Diagram
Block Diagram