You are on page 1of 18

Bag of words

is also called as

Document Matrix
Overview
• If we have the text before giving the text to the model, we are doing the text
preprocessing. Usually model will not able to understand the text directly
because it is not a human being.
• That’s bag of words is very important step in text viz or text preprocessing.
• We are converting the text into some integer format before passing to our model.
• If we pass them in integer format or floating format at that time model able to do
calculations to that text and it can actually use them in that algorithm.
• Bag of words is very important in the sentimental analysis.
Step by Step
Example
Lower the sentences
• If we are not lowering system going to take Today as one word and
today as another word.
Tokenization
Histogram
• Important step in bag of words.
• Histogram needs to count the frequency of the words present in the sentence.
• Going to create a matrix
Sort the Histogram in Descending Order
Filter
• In text we are going to get nouns name and other words which are repeated and not useful to the
project at that time we are filtering the data.
• Here we are taking only 10 most frequent words.
• We are selecting more frequently words. Example in 6500 words we are taking 4500 approx.
Creating a matrix
• Creating a matrix is nothing but bag of word.
Disadvantages
• There are few disadvantages in the bag of words.
• Most of the values in Bag of words is 1 or 0.
Solution
• By using TF-IDF we are doing document matrix.
Example

• Lower the sentences


• Tokenization
• Histogram
• Filter
Intuition
Term Frequency - Formula
Applying formula
Inverse Document Frequency.

You might also like