Professional Documents
Culture Documents
05 Naivebayes
05 Naivebayes
Is this spam?
Dan Jurafsky
5
Dan Jurafsky
Text Classification
• Assigning subject categories, topics, or genres
• Spam detection
• Authorship identification
• Age/gender identification
• Language Identification
• Sentiment analysis
• …
Dan Jurafsky
Hand-coded rules
• Rules based on combinations of words or other features
• spam: black-list-address OR (“dollars” AND“have been selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is expensive
Dan Jurafsky
Classification Methods:
Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
• A training set of m hand-labeled documents (d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d c
10
Dan Jurafsky
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors
•…
Text Classification
and Naïve Bayes
γ( )=c
whimsical and romantic while laughing at the
conventions of the fairy tale genre. I would
recommend it to just about anyone. I've seen
it several times, and I'm always happy to see
it again whenever I have a friend who hasn't
seen it yet.
Dan Jurafsky
γ( )=c
be whimsical and romantic while laughing
at the conventions of the fairy tale genre. I
would recommend it to just about anyone.
I've seen it several times, and I'm always
happy to see it again whenever I have a
friend who hasn't seen it yet.
The bag of words representation:
Dan Jurafsky
γ( )=c
xxxxxxxxxxxxx whimsical xxxx romantic
xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
several xxxxxxxxxxxxxxxxx xxxxx happy
xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxx
Dan Jurafsky
γ( recommend
laugh
1
1
)=c
happy 1
... ...
Dan Jurafsky
Test ?
document
Machine Garbage
parser Learning NLP Collection Planning GUI
language
label learning parser garbage planning ...
translation training tag collection temporal
… algorithm training memory reasoning
shrinkage translation optimization plan
network... language... region... language...
Text Classification
and Naïve Bayes
Formalizing the
Naïve Bayes
Classifier
Dan Jurafsky
MAP is “maximum a
posteriori” = most likely
class
Bayes Rule
Dropping the
denominator
Dan Jurafsky
Document d
represented as
features x1..xn
Dan Jurafsky
Formalizing the
Naïve Bayes
Classifier
Text Classification
and Naïve Bayes
Naïve Bayes:
Learning
Dan Jurafsky Sec.13.3
Parameter estimation
Naïve Bayes:
Learning
Text Classification
and Naïve Bayes
Naïve Bayes:
Relationship to
Language Modeling
Dan Jurafsky
c=China
38
Dan Jurafsky
Naïve Bayes:
Relationship to
Language Modeling
Text Classification
and Naïve Bayes
Multinomial Naïve
Bayes: A Worked
Example
Dan Jurafsky
Doc Words Class
Training 1 Chinese Beijing Chinese c
2 Chinese Chinese Shanghai c
3 Chinese Macao c
4 Tokyo Japan Chinese j
Test 5 Chinese Chinese Chinese Tokyo Japan ?
Priors:
P(c)= 3
4 1 Choosing a class:
P(j)= 4 P(c|d5) 3/4 * (3/7)3 * 1/14 * 1/14
≈ 0.0003
Conditional Probabilities:
P(Chinese|c) = (5+1) / (8+6) = 6/14 = 3/7
P(Tokyo|c) = (0+1) / (8+6) = 1/14 P(j|d5) 1/4 * (2/9)3 * 2/9 * 2/9
P(Japan|c) = (0+1) / (8+6) = 1/14 ≈ 0.0001
P(Chinese|j) = (1+1) / (3+6) = 2/9
P(Tokyo|j) = (1+1) / (3+6) = 2/9
44 P(Japan|j) = (1+1) / (3+6) = 2/9
Dan Jurafsky
Multinomial Naïve
Bayes: A Worked
Example
Text Classification
and Naïve Bayes
A combined measure: F
• A combined measure that assesses the P/R tradeoff is F
measure (weighted harmonic mean):
Text Classification:
Evaluation
Dan Jurafsky Sec.14.5
More Than Two Classes:
Sets of binary classifiers
• Dealing with any-of or multivalue classification
• A document can belong to 0, 1, or >1 classes.
Confusion matrix c
• For each pair of classes <c1,c2> how many documents from c1
were incorrectly assigned to c2?
• c3,2: 90 wheat documents incorrectly assigned to poultry
Docs in test set Assigned Assigned Assigned Assigned Assigned Assigned
UK poultry wheat coffee interest trade
True UK 95 1 13 0 1 0
True poultry 0 1 0 0 0 0
True wheat 10 90 0 1 0 0
True coffee 0 0 0 34 3 7
True interest - 1 2 13 26 5
True trade 0 0 2 14 5 10
57
Dan Jurafsky Sec. 15.2.4
Precision:
Fraction of docs assigned class i that are actually
about class i:
58
Accuracy: (1 - error rate)
Dan Jurafsky Sec. 15.2.4
Text Classification:
Evaluation
Text Classification
and Naïve Bayes
Text Classification:
Practical Issues
Dan Jurafsky Sec. 15.3.1
64
Dan Jurafsky Sec. 15.3.1
No training data?
Manually written rules
If (wheat or grain) and not (whole or bread) then
Categorize as grain
65
Dan Jurafsky Sec. 15.3.1
66
Dan Jurafsky Sec. 15.3.1
67
Dan Jurafsky Sec. 15.3.1
68
Dan Jurafsky Sec. 15.3.1
69
Brill and Banko on spelling correction
Dan Jurafsky
70
Dan Jurafsky
Text Classification:
Practical Issues