• Term Frequency measures how common a word is in a
document. But is a word that occurs frequently important? Inverse Document Frequency
• Term Frequency measures how common a word is in a
document. But is a word that occurs frequently important? • A word is important to a document may also mean that the word needs to be unique to a document (compared to other documents that exist), i.e. the word is important if it is a differentiator. Inverse Document Frequency
• Term Frequency measures how common a word is in a
document. But is a word that occurs frequently important? • A word is important to a document may also mean that the word needs to be unique to a document (compared to other documents that exist), i.e. the word is important if it is a differentiator. • Two “opposing” objectives: (1) Higher term occurrence, (2) fewer documents containing the term. Inverse Document Frequency • Inverse Document Frequency (IDF) measures the sparseness of a term, i.e. the uniqueness of a word to a document compared to the full set of documents. Total Number of Documents IDF(t) = log( ) Number of Documents Containing t
• IDF(t) is large when term t appears in a few documents, and
decreases quickly as the t appears in more documents. Term Frequency and Inverse Document Frequency
• A measure of the uniqueness and the relative importance of
term t in document d is given by the Term Frequency and Inverse Document Frequency TFIDF (t, d) = TF (t, d) × IDF (t) • TFIDF thus assigns a value to every word based on the frequency and the rarity of the word. • Words that are common in every document have a low score even they appear many times since they don’t mean much to that document in particular. Term Frequency and Inverse Document Frequency
• A measure of the uniqueness and the relative importance of
term t in document d is given by the Term Frequency and Inverse Document Frequency TFIDF (t, d) = TF (t, d) × IDF (t) • TFIDF thus assigns a value to every word based on the frequency and the rarity of the word. • Words that are common in every document have a low score even they appear many times since they don’t mean much to that document in particular. • In R, use the “weightTfIdf” weighting option: TermDocumentMatrix(myDocument, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))