You are on page 1of 6

NLP Worksheet

Text processing, bag of words, tf-idf activity


Suppose you have obtained these information and you would like to analyse it. Let’s start by making it ready for the
computer!

Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y

Step 1: Sentence Segmentation

No. Sentence

Step 2: Tokenization
Separate your sentences into tokens. How many tokens do you have?

Tokens

Number of tokens: ________


Step 3: Remove stopwords, special characters, numbers
List out the stopwords, special characters, and numbers that you want to remove!

Stopwords, special characters, and numbers

Step 4: Converting text to a common case


Which text do you need to modify? What is the modified form?

Modified form

Step 5: Stemming
List out the stem words.

Stem words

Step 6: Lemmatization
List out the root words/ lemma.

Lemma
Final data
List out the final, processed data.

Processed data

Congratulations, you’ve managed to process the data!

Bag of words
Step 1: Collect data and process it
For this exercise, we can use the sentences without processing it so that it is easier for us to read the sentences.

No. Sentence

1 We can use health chatbots for treating stress

2 We can use NLP to create chatbots and we will be making health chatbots now

3 Health chatbots cannot replace human counsellors now

Step 2: Create dictionary


Make a list of all the different words in the text.

Dictionary

Step 3:Create document vectors


Use the next page to create your document vector!
Tf-idf
You’ve obtained your bag of words. Now let’s continue with the tf-idf!

Step 1 - 3: Count the number of documents where the word appears at least once & write that
number down next to the word in your vocabulary to get your document frequency. Draw your
own table for this!

Example of a document frequency:

aman and Anil are stressed went to a therapist download health chatbot

2 1 2 1 1 2 2 2 1 1 1 1

Your document frequency:

Step 4: Get your inverse document frequency.


Example of an inverse document frequency:

aman and anil are stressed went to a therapist download health chatbot

3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1

Your inverse document frequency:


Step 5: Get your tf-idf
Example of a tf-idf:

After log operation:

Your tf-idf:

You might also like