Professional Documents
Culture Documents
Chapter 13
Wei Wei
wwei@idi.ntnu.no
Lecture series
TDT4215 Naive Bayes Text Classification 1
OUTLINES
• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification
• Introduction:
Int d ti n: m motivation
ti ti n andnd m
methods
th d
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification
• hire
hi some web b editors
dit
for a small quantity of
news – possibally
for a large scale of
online news – impossible
Spam or Not?
Text Classification
• M
Manual
n l classification
l ssifi ti n
originally used by Yahoo!
very accurate by expert
consistent for small size problem
difficult and expensive to scale
• A
Automatic
t m ti classification
l ssifi ti n
Hand-coded rule-based systems
o complex query languages
o assign category if a document contains a
given boolean combination of words
o accuracy is usually very high if a rule has
been carefully refined over time by a
subject expert
o building and maintaining these rules is
expensive
TDT4215 Naive Bayes Text Classification 12
INTRODUCTION
• A
Automatic
t m ti classification
l ssifi ti n
utilizing machine learning techniques
o k-Nearest Neighbors (kNN)
o Naive Bayes (NB)
o Support Vector Machines (SVM)
o … some other similar methods
o requires hand-classified training data
• Note that: many commercial systems use a
mixture of methods
• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification
An example:
p
• Document with only a sentance:
“London is planning to organize the 2012 Olympics.”
• We have six classes:
<UK>, <China>, <car>, <coffee>, <elections>, <sports>
• Determined: <UK>
An example:
• Document with only a sentance:
“London is planning to organize the 2012 Olympics.”
• We have six classes:
<UK>, <China>, <car>, <coffee>, <elections>, <sports>
• Determined: <UK> and <sports>
A formal definition:
Given:
• A description of an instance, x X, where X is the instance
language or instance space.
• A fixed set of classes:
C = {c1, c2,…, cJ}
Determine:
• The category of x : γ (x) C,
C where γ (x) is a classification
function.
γ: XC
• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification
• Th
There are two
t different
diff nt wayss tto set
s t up
p an
n
NB classifier:
• Multinomial
M ltin mi l N
Naive
i Bayes
B s
A document d being in class c is computed as:
P ( c | d ) P ( c ) P ( tk | c )
1 k nd
prior probability conditional probability
P(tk | c ) : conditional probability of term tk occurring in a
document of class c
P (c ) : the prior probability of a document occurring in c
t1 , t2 ,..., tnd : tokens in document d that are part of
vocabulary used for classification
nd : number of such tokens in d
• Multinomial
M ltin mi l N
Naive
i Bayes
B s
A document d being in class c is computed as:
P ( c | d ) P ( c ) P ( tk | c )
1 k nd
MAX
^ ^
cmap arg max[log P( c )
cC
log P(t
1 k nd
k | c )]
th relative
the l ti f frequency of
fc h
how good
d an indicator
i di t tk is
i ffor c
^ ^
cmap arg max[log P ( c )
cC
log P(t
1 k nd
k | c )]
^
T t ' V ct '
^ Tct 1 Tct 1
P (t | c )
t 'V( Tct ' 1) ( t 'V Tct ' ) B
B=|V|
the number of terms in vocabulary
TDT4215 Naive Bayes Text Classification 25
NAIVE BAYES TEXT CLASSIFICATION
• Question:
Decide:
whether
h th document
d t d5 b
belonging
l i tto class
l c=China?
Chi ?
• Solution:
^ ^ _
Training:
T i i P ( c ) 3 / 4, P( c ) 1 / 4
^
P ( Chinese | c ) ( 5 1) /( 8 6 ) 6 / 14 3 / 7
^ ^
P ( Tokyo | c ) P ( Japan | c ) ( 0 1) /( 8 6 ) 1 / 14
^ _
P ( Chinese
Chi | c ) ( 1 1 ) /( 3 6 ) 2 / 9
^ _ ^ _
P ( Tokyo | c ) P ( Japan | c ) (1 1) /( 3 6 ) 2 / 9
Testing:
c=China
• Th
There are two
t different
diff nt wayss tto set
s t up
p an
n
NB classifier:
• B
Bernoulli
n lli m
model
d l
different with multinomial NB model:
training for
prior probability
are same
nonoccurrentt terms
t still
till affect
ff t
the computing
Decide:
whether
h th document
d t d5 b
belonging
l i tto class
l c=China?
Chi ?
^ ^ _
T i i
Training:
^
P (c ) 3 / 4, P( c ) 1 / 4 P ( Chinese
h | c ) ( 3 1) /( 3 2 ) 4 / 5
^ ^ ^ ^
P ( Beijing | c ) P ( Macao | c ) (1 1) /( 3 2 ) 2 / 5 P (Tokyo | c ) P ( Japan | c ) ( 0 1) /( 3 2 ) 1 / 5
^ _ ^ _ ^ _
P ( Chinese | c ) ( 1 1 ) /( 1 2 ) 2 / 3 P ( Tokyo | c ) P ( Japan | c ) (1 1) /(1 2 ) 2 / 3
^ _ ^ _ ^ _
P ( Beijing | c ) P ( Macao | c ) P ( Shanghai | c ) ( 0 1) /(1 2 ) 1 / 3
Testing: ^
P ( c | d 5 ) 0 . 005
_
P ( c | d 5 ) 0 . 022 not-China
TDT4215 Naive Bayes Text Classification 35
OUTLINES
• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification
• Recall
R ll Bayes
B s rule:
l :
P( AB ) P( A) P( B | A) P( B ) P( A | B )
P( B ) P( A | B)
P( B | A)
P( A)
• Conditional
C nditi n l Independence
Ind p nd n Ass
Assumption
mpti n
probability that in a document of class
c the
th tterm t will
ill occur in
i position
iti k
Multinomial:
P( d | c ) P( t1 ,..., tk ,..., tnd | c ) P( X
1 k nd
k tk | c )
Bernoulli:
P( d | c ) P ( e1 ,..., ek ,..., eM | c ) P(U
1i M
i ei | c )
• Multinomial:
M ltin mi l: probability that in a document of class
c the term t will occur in position k
P( X k1 t | c ) P( X k2 t | c )
• Equivalent
q to bag
g of words model
• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification
• Frequency-based
F n b s d feature
f t sselection
l ti n
selecting terms that are most common in the class
U is a random variable
o et 1 : the
th document
d t contains
t i t
o et 0 : the document does not contain t
C is
i a random
d variable
i bl
o ec 1 : the document is in class c
o ec 0 : the
h document
d is not in class
l c
number of documents
number of documents that that does NOT contain
contain t, but NOT in c t and NOT in c
N1. N10 N11 - number of documents that contain t
N .1 N 01 N11 - number of documents in c
N 0. N 01 N 00 - number of documents that do NOT contain t
N .0 N10 N 00 - number
numb of fddocuments
cum nts NOT in c
N N 00 N 01 N10 N11 - total number of documents
TDT4215 Naive Bayes Text Classification 47
FEATURE SELECTION
• An Example
In Reuters-RCV1, c = poultry, t = export
• The
Th 2
feature
f t selection
s l ti n
In statistics, the text is applied to test
2
the
h independence
d d of
f two events.
Events A and B are defined to be
independence if
P(AB)=P(A)P(B)
P(AB) P(A)P(B) or
P(A|B)=P(A) and P(B|A)=P(B)
In feature selection,
selection the two events are
occurrence of the term and occurrence of
class.
class
TDT4215 Naive Bayes Text Classification 50
FEATURE SELECTION
• The 2
feature selection
• N et ec has
h the
th same meaning
i as in
i Mutual
M t l Information
I f ti
feature selection.
• Ee e is the expected frequency of t and c occurring
together in a document assuming that term and class
t c
are independent.
• The 2
feature selection
• N et(
ec N 00 , N 01 , N10 , N11)can be
b counted
t dffrom th
the
training data set as in Mutual Information feature
selection
selection.
• Eet(ec E00 , E01 , E10 , E11)can also be computed from the
training data set.
compute
t E11 :
C
Compute
t other
th Ee e in
i the
t c
th same way:
• Introduction:
Int d ti n: motivation
m ti ti n and
nd m
methods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification
• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification