Professional Documents
Culture Documents
Inverse
Document Frequency
Mr. V. M. Vasava
GPG,Surat
IT Dept.
Agenda
• TF -(Term Frequency) -It is the ratio of the occurrence of the word (w)
in document (d) per the total number of words in the documents.
No. of repetition of words in sentence
Term Frequency = No. of words in sentence
OR
Corpus Text
Doc1 good boy
Doc2 good girl
Doc3 Boy girl good
Create the frequency distribution of words
Corpus Text Vocabulary Frequency of words
Doc1 good boy good 3
Doc2 good girl boy 2
Doc3 boy girl good girl 2
TF
Doc1 Doc2 Doc3
good 1/2 1/2 1/3
boy 1/2 0 1/3
girl 0 1/2 1/3
IDF
Doc1 0 ½*Log(3/2) 0
Doc2 0 0 ½*Log(3/2)
Doc3 0 1/3*Log(3/2) 1/3*Log(3/2)
Implementation of TF-IDF
Example
Advantages & Disadvantages
• Reflects Word Importance: TF-IDF highlights words that
are important to a specific document in a corpus.
• Reduces Emphasis on Common Words: Commonly
occurring words (e.g., "the," "is," "and") often have high term
frequencies but low importance.
• Handles Variable Document Lengths: TF-IDF accounts for
variations in document lengths by considering the relative
frequency of terms in a document.
• Support text retrieval system like google search, text
classification, keyword extraction.
Disadvantages
• Sparsity
• Out of vocabulary(OOV)
• ordering
Any Questions????