The document describes the Naive Bayes algorithm for document classification. It outlines the training process which involves counting the occurrence of words in labeled documents and calculating word probabilities. It then describes the classification process which calculates a score for a document by summing the log probabilities of its words and compares the score to a threshold to determine the document class.
The document describes the Naive Bayes algorithm for document classification. It outlines the training process which involves counting the occurrence of words in labeled documents and calculating word probabilities. It then describes the classification process which calculates a score for a document by summing the log probabilities of its words and compares the score to a threshold to determine the document class.
The document describes the Naive Bayes algorithm for document classification. It outlines the training process which involves counting the occurrence of words in labeled documents and calculating word probabilities. It then describes the classification process which calculates a score for a document by summing the log probabilities of its words and compares the score to a threshold to determine the document class.
Train(X, Y) {reads documents X and labels Y} Compute dictionary D of X with n words. Compute m, mham and mspam . Initialize b := log c + log mham − log mspam to offset the rejection threshold Initialize p ∈ R2×n with pij = 1, wspam = n, wham = n. {Count occurrence of each word} {Here xji denotes the number of times word j occurs in document xi } for i = 1 to m do if yi = spam then for j = 1 to n do p0,j ← p0,j + xji wspam ← wspam + xji end for else for j = 1 to n do p1,j ← p1,j + xji wham ← wham + xji end for end if end for {Normalize counts to yield word probabilities} for j = 1 to n do p0,j ← p0,j /wspam p1,j ← p1,j /wham end for Classify(x) {classifies document x} Initialize score threshold t = −b for j = 1 to n do t ← t + xj (log p0,j − log p1,j ) end for if t > 0 return spam else return ham
1.3.2 Nearest Neighbor Estimators
An even simpler estimator than Naive Bayes is nearest neighbors. In its most basic form it assigns the label of its nearest neighbor to an observation x (see Figure 1.17). Hence, all we need to implement it is a distance measure d(x, x0 ) between pairs of observations. Note that this distance need not even be symmetric. This means that nearest neighbor classifiers can be extremely