Professional Documents
Culture Documents
Vector Space Model BOW
Vector Space Model BOW
In this model, a text is represented as the bag of its words, disregarding grammar
and even word order but keeping multiplicity.
It is a kitten 1 1 0 0 0 1 0
It is a cat 1 1 0 1 0 1 0
That is a dog
and this is a 0 2 0 0 1 2 1
pen
It is a matrix 1 1 0 0 0 1 0
Vector Space Model TF-IDF
TF (Term frequency) of a word is the IDF (Inverse document frequency) of a word is a
frequency of a word in a document. measure of how significant that term is throughout the
corpus.
For example, say the term 'cat' appears 10 million times in
For example, when in a document containing the whole corpus If there are 0.3 million documents that
100 words the term 'cat' appears 12 times, the contain such a huge number of 'cat',
TF for the word 'cat' is Then
IDF (i.e. log {DF}) is given by the total number of
TFcat = 12/100 documents divided by the number of documents
containing the term 'cat'.
TFcat = 0.12
IDF (cat) = log (10.000.000/300.000) = 1.52
the cat sat on my face 1 0.00000 0.11552 0.00000 0.11552 0.0 0.0 0.0 0.0
the dog sat on my bed 2 0.11552 0.00000 0.11552 0.00000 0.0 0.0 0.0 0.0