You are on page 1of 1

Assignment – I

(Based on Pre-Processing in Textual Data)


Q1: Load a corpus (of .txt files) of your choice containing at least 10 text files using:

1. File method
2. PlaintextCorpusReader

Q2: Pre-process the corpus loaded in step 1(apply normalization, tokenization, stopword removal,
stemming)

Q3: Convert the corpus into Bag-of-Words and tf-idf feature matrix using:

(a) TfidfVectorizer()and CountVectorizer


(b) Without using in-built functions
Q4: Explore how we can access, pre-process and create feature vector for HTML texts?
(Hint: explore BeautifulSoup package)

You might also like