You are on page 1of 6

Inverse Document Frequency

• Term Frequency measures how common a word is in a


document. But is a word that occurs frequently
important?
Inverse Document Frequency

• Term Frequency measures how common a word is in a


document. But is a word that occurs frequently
important?
• A word is important to a document may also mean that the
word needs to be unique to a document (compared to other
documents that exist), i.e. the word is important if it is a
differentiator.
Inverse Document Frequency

• Term Frequency measures how common a word is in a


document. But is a word that occurs frequently
important?
• A word is important to a document may also mean that the
word needs to be unique to a document (compared to other
documents that exist), i.e. the word is important if it is a
differentiator.
• Two “opposing” objectives: (1) Higher term occurrence,
(2) fewer documents containing the term.
Inverse Document Frequency
• Inverse Document Frequency (IDF) measures the sparseness
of a term, i.e. the uniqueness of a word to a document
compared to the full set of documents.
Total Number of Documents
IDF(t) = log( )
Number of Documents Containing t

• IDF(t) is large when term t appears in a few documents, and


decreases quickly as the t appears in more documents.
Term Frequency and Inverse Document Frequency

• A measure of the uniqueness and the relative importance of


term t in document d is given by the Term Frequency and
Inverse Document Frequency
TFIDF (t, d) = TF (t, d) × IDF (t)
• TFIDF thus assigns a value to every word based on the
frequency and the rarity of the word.
• Words that are common in every document have a low score
even they appear many times since they don’t mean much to
that document in particular.
Term Frequency and Inverse Document Frequency

• A measure of the uniqueness and the relative importance of


term t in document d is given by the Term Frequency and
Inverse Document Frequency
TFIDF (t, d) = TF (t, d) × IDF (t)
• TFIDF thus assigns a value to every word based on the
frequency and the rarity of the word.
• Words that are common in every document have a low score
even they appear many times since they don’t mean much to
that document in particular.
• In R, use the “weightTfIdf” weighting option:
TermDocumentMatrix(myDocument, control = list(weighting =
function(x) weightTfIdf(x, normalize = FALSE)))

You might also like