You are on page 1of 10

NLP Worksheet

Text processing, Bag of Words and TF-IDF


Suppose you have obtained this information and you would like to analyze it. Let’s start by making it ready for the
computer!

Corpus
Document 1: We can use health chatbots for treating stress.
Document 2: We can use NLP to create chatbots and we will be making health chatbots now!
Document 3: Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y

Step 1: Sentence Segmentation

No. Sentence

1 We can use health chatbots for treating stress.

2 We can use NLP to create chatbots and we will be making health chatbots now!

3 Health Chatbots cannot replace human counsellors now. Yay >< !! @1nteLA!4Y

Step 2: Tokenization
Separate your sentences into tokens. How many tokens do you have?

Tokens

We can use health chatbots for treating stress . We can use


NLP to create chatbots and we will be making health chatbots
now! Health Chatbots cannot replace human counsellors now .
Yay > < !! @1nteLA!4Y

Number of tokens: 37
Step 3: Remove stopwords, special characters, numbers
List out the stopwords, special characters, and numbers that you want to remove!

Stopwords, special characters and numbers

Can for and to will be . ! > < @1nteLA!4Y

Step 4: Converting text to a common case


Which text do you need to modify? What is the modified form?

Modified Form

Common case is usually converting texts with upper case or mixed


case words to lower case text. In current case we have some letter with
the first word capital(upper case) we can change that to lower case as
follows

We—we
Health—health
Chatbots—chatbots
Yay—yay
NLP—nlp

Step 5: Stemming
List out the stem words.

Stem words

Word Affix Stem


treating -ing treat
making -ing mak

Step 6: Lemmatization
List out the root words/ lemma.

Lemma

Word Affix Lemma

treating -ing treat


making -ing make

Final data
List out the final, processed data.

Processed data

we use health chatbots treat stress


we use nlp create chatbots we make health chatbots now
health chatbots cannot replace human counsellors now yay
Congratulations, you’ve managed to process the data!

Bag of words
Step 1: Collect data and process it
For this exercise, we can use the sentences without processing it so that it is easier for us to read the sentences.

No. Sentence

1 We can use health chatbots for treating stress

2 We can use NLP to create chatbots and we will be making health chatbots now

3 Health chatbots cannot replace human counsellors now


Step 2: Create dictionary
Make a list of all the different words in the text.

Dictionary

we can use health chatbots for treating stress nlp to create and will be
making now cannot replace human counselors yay

Step 3: Create document vectors


Use the space below to create your document vectors!

we 1 1 0

can 1 1 0

use 1 1 0

health 1 1 1

chatbots 1 1 1

for 1 0 0

treating 1 0 0

stress 1 0 0

nlp 0 1 0

to 0 1 0

create 0 1 0

and 0 1 0

will 0 1 0

be 0 1 0

making 0 1 0

now 0 1 1

cannot 0 0 1

replace 0 0 1

human 0 0 1

counsellor 0 0 1
s
yay 0 0 1
TF-IDF
You’ve obtained your bag of words. Now let’s continue with the TF-IDF!

Step 1 - 3: Count the number of documents where the word appears at least once & write
that number down next to the word in your vocabulary to get your document frequency.
Draw your own table for this!

Example of a document frequency:

Aman and Anil are stressed went to a therapist download health chatbot

2 1 2 1 1 2 2 2 1 1 1 1

Your document frequency:


we 2
can 2
use 2
health 3
chatbots 3
for 1
treating 1
stress 1
nlp 1
to 1
create 1
and 1
will 1
be 1
making 1
now 2
cannot 1
replace 1
human 1
counsellor 1
s
yay 1
Step 4: Get your inverse document frequency.
Example of an inverse document frequency:

aman and anil are stressed went to a therapist download health chatbot

3/2 3/1 3/2 3/1 3/1 3/2 3/2 3/2 3/1 3/1 3/1 3/1

Your inverse document frequency:


we 3/2
can 3/2
use 3/2
health 3/3
chatbots 3/3
for 3/1
treating 3/1
stress 3/1
nlp 3/1
to 3/1
create 3/1
and 3/1
will 3/1
be 3/1
making 3/1
now 3/2
cannot 3/1
replace 3/1
human 3/1
counsellor 3/1
s
yay 3/1
Step 5: Get your TF-IDF
Example of a TF-IDF:

After log operation:

Your TF-IDF:

we 1*log(3/2) 1*log(3/2) 0*log(3/2)

can 1*log(3/2) 1*log(3/2) 0*log(3/2)

use 1*log(3/2) 1*log(3/2) 0*log(3/2)

health 1*log(3/3) 1*log(3/3) 1*log(3/3)

chatbots 1*log(3/3) 1*log(3/3) 1*log(3/3)

for 1*log(3/1) 0*log(3/1) 0*log(3/1)

treating 1*log(3/1) 0*log(3/1) 0*log(3/1)

stress 1*log(3/1) 0*log(3/1) 0*log(3/1)

nlp 0*log(3/1) 1*log(3/1) 0*log(3/1)

to 0*log(3/1) 1*log(3/1) 0*log(3/1)

create 0*log(3/1) 1*log(3/1) 0*log(3/1)

and 0*log(3/1) 1*log(3/1) 0*log(3/1)

will 0*log(3/1) 1*log(3/1) 0*log(3/1)

be 0*log(3/1) 1*log(3/1) 0*log(3/1)


making 0*log(3/1) 1*log(3/1) 0*log(3/1)

now 0*log(3/2) 1*log(3/2) 1*log(3/2)

cannot 0*log(3/1) 0*log(3/1) 1*log(3/1)

replace 0*log(3/1) 0*log(3/1) 1*log(3/1)

human 0*log(3/1) 0*log(3/1) 1*log(3/1)

counsellor 0*log(3/1) 0*log(3/1) 1*log(3/1)


s

yay 0*log(3/1) 0*log(3/1) 1*log(3/1)

After log operation:

we 0.176 0.176 0

can 0.176 0.176 0

use 0.176 0.176 0

health 0(log1=0) 0(log1=0) 0(log1=0)

chatbots 0(log1=0) 0(log1=0) 0(log1=0)

for 0.477 0 0

treating 0.477 0 0

stress 0.477 0 0

nlp 0 0.477 0

to 0 0.477 0

create 0 0.477 0

and 0 0.477 0

will 0 0.477 0

be 0 0.477 0

making 0 0.477 0

now 0 0.176 0.176

cannot 0 0 0.477

replace 0 0 0.477

human 0 0 0.477

counsellor 0 0 0.477
s

yay 0 0 0.477

Thank You
Sampurna Rastogi

You might also like