You are on page 1of 4

Information Retrieval

Assignment 1

Session: 2020 – 2024

Submitted by:
Saqlain Nawaz 2020-CS-135

Supervised by:
Sir Khaldoon Syed Khurshid

Department of Computer Science


University of Engineering and Technology
Lahore Pakistan
Inverted Index for Text Files
Overview
This document provides an overview and explanation of the code used to create an inverted
index for a collection of text files.

Libraries Used
The following libraries are used in this code:

● OS: Provides functions for interacting with the operating system, used for file
operations and directory traversal.
● NLTK: The Natural Language Toolkit is used for natural language processing tasks
such as tokenization, stemming, and part-of-speech tagging.
● String: Provides a collection of string constants for punctuation characters.
● nltk.corpus.stopwords: Provides a list of common English stopwords.
● nltk.stem.PorterStemmer: Implements the Porter stemming algorithm for word
stemming.
● nltk.tokenize.word_tokenize: Tokenizes sentences into words.
● nltk.tokenize.sent_tokenize: Tokenizes text into sentences.

Code Flow
The code is structured as follows:

Import Libraries

Import the required libraries at the beginning of the code.

Initialize Variables

● Initialize a set of English stopwords using nltk.corpus.stopwords.


● Initialize a Porter stemmer using nltk.stem.PorterStemmer.

create_index Function

This function takes a directory path as input and returns an inverted index.

1. It iterates over text files in the specified directory.


2. For each file, it reads the content, tokenizes it into sentences, and further tokenizes
each sentence into words.
3. It tags the parts of speech for each word and checks if the word is a noun or verb.
4. If the word is a noun or verb and not in the stopwords list, it is stemmed using the
Porter stemmer.
5. An entry is added to the inverted index with the stemmed word as the key and the
filename as the value.
6. The function handles UnicodeDecodeError exceptions for files that cannot be
decoded.

search Function

This function takes a user's search query as input.

1. It tokenizes and stems the query.


2. For each stemmed query word, it retrieves the filenames associated with it from the
inverted index.

User Interaction

In the main function, the program interacts with the user.

● The user can input a search query, and the code returns the filenames in which each
query word appears

Execution

The main function is executed when the script is run.

Block Diagram:
A block diagram is a visual representation of the code's structure and key
components.
Data Flow Diagram (DFD):
A DFD illustrates how data moves through your code.

You might also like