You are on page 1of 3

Vector Space Model BOW

The bag-of-words model is a vector space representation used in NLP.

In this model, a text is represented as the bag of its words, disregarding grammar
and even word order but keeping multiplicity.

BOW It is puppy cat pen a this


It is a puppy 1 1 1 0 0 1 0

It is a kitten 1 1 0 0 0 1 0

It is a cat 1 1 0 1 0 1 0

That is a dog
and this is a 0 2 0 0 1 2 1
pen
It is a matrix 1 1 0 0 0 1 0
Vector Space Model TF-IDF
TF (Term frequency) of a word is the IDF (Inverse document frequency) of a word is a
frequency of a word in a document. measure of how significant that term is throughout the
corpus.
For example, say the term 'cat' appears 10 million times in
For example, when in a document containing the whole corpus If there are 0.3 million documents that
100 words the term 'cat' appears 12 times, the contain such a huge number of 'cat',
TF for the word 'cat' is Then
IDF (i.e. log {DF}) is given by the total number of
TFcat = 12/100 documents divided by the number of documents
containing the term 'cat'.

TFcat = 0.12
IDF (cat) = log (10.000.000/300.000) = 1.52

∴ Wcat =(TF*IDF) cat = 0.12*1.52 = 0.182


BOW VS TF-IDF
BOW bed cat dog face my on sat the

the cat sat on my face 1 0 1 0 1 1 1 1 1

the dog sat on my bed 2 1 0 1 0 1 1 1 1

TF-ID bed cat dog face my on sat the


F

the cat sat on my face 1 0.00000 0.11552 0.00000 0.11552 0.0 0.0 0.0 0.0

the dog sat on my bed 2 0.11552 0.00000 0.11552 0.00000 0.0 0.0 0.0 0.0

You might also like