You are on page 1of 56

NAIVE BAYES TEXT CLASSIFICATION

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,


Introduction to Information Retrieval,
Retrieval Cambridge University Press.
Press 2008.
2008

Chapter 13

Wei Wei
wwei@idi.ntnu.no

Lecture series
TDT4215 Naive Bayes Text Classification 1
OUTLINES

• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

TDT4215 Naive Bayes Text Classification 2


OUTLINES

• Introduction:
Int d ti n: m motivation
ti ti n andnd m
methods
th d
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

TDT4215 Naive Bayes Text Classification 3


INTRODUCTION

Motivation for Text Classification

TDT4215 Naive Bayes Text Classification 4


INTRODUCTION

TDT4215 Naive Bayes Text Classification 5


INTRODUCTION

TDT4215 Naive Bayes Text Classification 6


INTRODUCTION

How could they do this?

• hire
hi some web b editors
dit
for a small quantity of
news – possibally
for a large scale of
online news – impossible

• only way: automatic


classification by machines

TDT4215 Naive Bayes Text Classification 7


INTRODUCTION

TDT4215 Naive Bayes Text Classification 8


INTRODUCTION

Spam or Not?

Text Classification

TDT4215 Naive Bayes Text Classification 9


INTRODUCTION

Methods for Text Classification

TDT4215 Naive Bayes Text Classification 10


INTRODUCTION

• M
Manual
n l classification
l ssifi ti n
 originally used by Yahoo!
 very accurate by expert
 consistent for small size problem
 difficult and expensive to scale

TDT4215 Naive Bayes Text Classification 11


INTRODUCTION

• A
Automatic
t m ti classification
l ssifi ti n
 Hand-coded rule-based systems
o complex query languages
o assign category if a document contains a
given boolean combination of words
o accuracy is usually very high if a rule has
been carefully refined over time by a
subject expert
o building and maintaining these rules is
expensive
TDT4215 Naive Bayes Text Classification 12
INTRODUCTION

• A
Automatic
t m ti classification
l ssifi ti n
 utilizing machine learning techniques
o k-Nearest Neighbors (kNN)
o Naive Bayes (NB)
o Support Vector Machines (SVM)
o … some other similar methods
o requires hand-classified training data
• Note that: many commercial systems use a
mixture of methods

TDT4215 Naive Bayes Text Classification 13


OUTLINES

• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

TDT4215 Naive Bayes Text Classification 14


THE TEXT CLASSIFICATION PROBLEM

Text Classification also known as Text Categorization:


• Given a set of classes, we seek to determine which class(es) a
given document belongs to.

An example:
p
• Document with only a sentance:
“London is planning to organize the 2012 Olympics.”
• We have six classes:
<UK>, <China>, <car>, <coffee>, <elections>, <sports>
• Determined: <UK>

TDT4215 Naive Bayes Text Classification 15


THE TEXT CLASSIFICATION PROBLEM

An example:
• Document with only a sentance:
“London is planning to organize the 2012 Olympics.”
• We have six classes:
<UK>, <China>, <car>, <coffee>, <elections>, <sports>
• Determined: <UK> and <sports>

For some documents:


• exist more than one class it belongs to
 referred to as any-of classification problem
However,
• we only consider one
one-of
of classification problem
 a document is a member of exactly one class

TDT4215 Naive Bayes Text Classification 16


THE TEXT CLASSIFICATION PROBLEM

A formal definition:
 Given:
• A description of an instance, x  X, where X is the instance
language or instance space.
• A fixed set of classes:
C = {c1, c2,…, cJ}
 Determine:
• The category of x : γ (x) C,
C where γ (x) is a classification
function.
γ: XC

TDT4215 Naive Bayes Text Classification 17


OUTLINES

• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

TDT4215 Naive Bayes Text Classification 18


NAIVE BAYES TEXT CLASSIFICATION

• Th
There are two
t different
diff nt wayss tto set
s t up
p an
n
NB classifier:

 multinomial Naive Bayes (multinomial NB model)

 multivariate Bernoulli model (Bernoulli model)

TDT4215 Naive Bayes Text Classification 19


NAIVE BAYES TEXT CLASSIFICATION

• Multinomial
M ltin mi l N
Naive
i Bayes
B s
A document d being in class c is computed as:

P ( c | d )  P ( c )  P ( tk | c )
1 k  nd
prior probability conditional probability
 P(tk | c ) : conditional probability of term tk occurring in a
document of class c
 P (c ) : the prior probability of a document occurring in c
  t1 , t2 ,..., tnd  : tokens in document d that are part of
vocabulary used for classification
 nd : number of such tokens in d

TDT4215 Naive Bayes Text Classification 20


NAIVE BAYES TEXT CLASSIFICATION

• Multinomial
M ltin mi l N
Naive
i Bayes
B s
A document d being in class c is computed as:

P ( c | d )  P ( c )  P ( tk | c )
1 k  nd
MAX

• How to decide the best class in NB classification ?


^ ^ ^
cmap  arg
g max P( c | d )  argg max P ( c )  P (tk | c )
cC cC 1 k  nd
Note: we do not know the parameters true values but estimate
them from training data.

TDT4215 Naive Bayes Text Classification 21


NAIVE BAYES TEXT CLASSIFICATION

• How to decide the best class in NB classification ?


^ ^ ^
cmap  arg max P( c | d )  arg max P ( c )  P (tk | c )
cC cC 1 k  nd

• Many conditional probabilities are multiplied that will result in a


floating point underflow.
• log(xy)
g y = log(x)
g + log(y)
gy
• Therefore, it’s better to perform adding logarithms:

^ ^
cmap  arg max[log P( c ) 
cC
 log P(t
1 k  nd
k | c )]

th relative
the l ti f frequency of
fc h
how good
d an indicator
i di t tk is
i ffor c

TDT4215 Naive Bayes Text Classification 22


NAIVE BAYES TEXT CLASSIFICATION

^ ^
cmap  arg max[log P ( c ) 
cC
 log P(t
1 k  nd
k | c )]

how to estimate the parameters ?

Maximum Likelihood Estimate (MLE)


for the p
f parameters Estimation

TDT4215 Naive Bayes Text Classification 23


NAIVE BAYES TEXT CLASSIFICATION

• What is the Maximum Likelyhood Estimation (MLE):


 the relative frequency and corresponds to the
most likely value of each parameter given the
training data. number of documents in class c
• How? ^ N c total number of documents
 for the priors: P ( c ) 
N
 for the conditional probability:
^ Tct number
b of f occurrences of
f
P (t | c )  t/t’ in training documents
 t 'V
Tct ' from class c

TDT4215 Naive Bayes Text Classification 24


NAIVE BAYES TEXT CLASSIFICATION

• A problem with MLE


 what if a term that did not occur in the training
data ?
^ T
P ( t | c )   0

ct

^
 T t ' V ct '

 for log P ( t | c ) , we need P(t | c )  0

• Solution: add-one or Laplace smoothing

^ Tct  1 Tct  1
P (t | c )  
t 'V( Tct '  1) ( t 'V Tct ' )  B
B=|V|
the number of terms in vocabulary
TDT4215 Naive Bayes Text Classification 25
NAIVE BAYES TEXT CLASSIFICATION

• Naive Bayes algorithm:


 Training

TDT4215 Naive Bayes Text Classification 26


NAIVE BAYES TEXT CLASSIFICATION

• Naive Bayes algorithm:


 Testing

TDT4215 Naive Bayes Text Classification 27


NAIVE BAYES TEXT CLASSIFICATION

• Question:

Decide:
whether
h th document
d t d5 b
belonging
l i tto class
l c=China?
Chi ?

TDT4215 Naive Bayes Text Classification 28


NAIVE BAYES TEXT CLASSIFICATION

• Solution:

^ ^ _

 Training:
T i i P ( c )  3 / 4, P( c )  1 / 4
^
P ( Chinese | c )  ( 5  1) /( 8  6 )  6 / 14  3 / 7
^ ^
P ( Tokyo | c )  P ( Japan | c )  ( 0  1) /( 8  6 )  1 / 14
^ _
P ( Chinese
Chi | c )  ( 1  1 ) /( 3  6 )  2 / 9
^ _ ^ _
P ( Tokyo | c )  P ( Japan | c )  (1  1) /( 3  6 )  2 / 9
 Testing:
c=China

TDT4215 Naive Bayes Text Classification 29


NAIVE BAYES TEXT CLASSIFICATION

• Th
There are two
t different
diff nt wayss tto set
s t up
p an
n
NB classifier:

 multinomial Naive Bayes (multinomial NB model)

 multivariate Bernoulli model (Bernoulli model)

TDT4215 Naive Bayes Text Classification 30


NAIVE BAYES TEXT CLASSIFICATION

• B
Bernoulli
n lli m
model
d l
 different with multinomial NB model:

different estimation strategies

different classification rules

TDT4215 Naive Bayes Text Classification 31


NAIVE BAYES TEXT CLASSIFICATION

training for
prior probability
are same

fraction of tokens in c containing t

fraction of documents in c containing t

TDT4215 Naive Bayes Text Classification 32


NAIVE BAYES TEXT CLASSIFICATION

only considering terms that


pp
appears in the documents

nonoccurrentt terms
t still
till affect
ff t
the computing

TDT4215 Naive Bayes Text Classification 33


NAIVE BAYES TEXT CLASSIFICATION

• Question with Bernoulli model:

Decide:
whether
h th document
d t d5 b
belonging
l i tto class
l c=China?
Chi ?

TDT4215 Naive Bayes Text Classification 34


NAIVE BAYES TEXT CLASSIFICATION

• Solution with Bernoulli model:

^ ^ _
T i i
Training:
^
 P (c )  3 / 4, P( c )  1 / 4 P ( Chinese
h | c )  ( 3  1) /( 3  2 )  4 / 5
^ ^ ^ ^
P ( Beijing | c )  P ( Macao | c )  (1  1) /( 3  2 )  2 / 5 P (Tokyo | c )  P ( Japan | c )  ( 0  1) /( 3  2 )  1 / 5
^ _ ^ _ ^ _
P ( Chinese | c )  ( 1  1 ) /( 1  2 )  2 / 3 P ( Tokyo | c )  P ( Japan | c )  (1  1) /(1  2 )  2 / 3
^ _ ^ _ ^ _
P ( Beijing | c )  P ( Macao | c )  P ( Shanghai | c )  ( 0  1) /(1  2 )  1 / 3
 Testing: ^
P ( c | d 5 )  0 . 005
_
P ( c | d 5 )  0 . 022 not-China
TDT4215 Naive Bayes Text Classification 35
OUTLINES

• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

TDT4215 Naive Bayes Text Classification 36


PROPERTIES OF NAIVE BAYES

• Recall
R ll Bayes
B s rule:
l :

P( AB )  P( A)  P( B | A)  P( B )  P( A | B )

P( B )  P( A | B)
P( B | A) 
P( A)

TDT4215 Naive Bayes Text Classification 37


PROPERTIES OF NAIVE BAYES

• With Bayes rule,


rule for a document d and a class c:
P(d | c) P(c)
P(c | d ) 
P(d )
Bayes rule
cmap  arg max P ( c | d )
cC
P(d | c) P(c) P(d) do not
 arg
g max affect the result
cC P(d )
 argg max P ( d | c ) P( c )
cC
TDT4215 Naive Bayes Text Classification 38
PROPERTIES OF NAIVE BAYES

cmap  arg max P( d | c ) P( c )


cC high time complexity to
compute both conditional
probabilities
• How to compute P(d|c):
 Multinomial: P(d | c)  P( t1 ,..., tk ,..., tn | c) d
 t1 ,..., tk ,..., tnd  is the sequence of terms as it occurs in d
 Bernoulli: P( d | c )  P( e1 ,..., ek ,..., eM | c )
 e1 ,..., ek ,..., eM  is a binary vector of dimensionality M that
indicates for each term whether it occurs in d or not

TDT4215 Naive Bayes Text Classification 39


PROPERTIES OF NAIVE BAYES

• Conditional
C nditi n l Independence
Ind p nd n Ass
Assumption
mpti n
probability that in a document of class
c the
th tterm t will
ill occur in
i position
iti k
 Multinomial:
P( d | c )  P( t1 ,..., tk ,..., tnd | c )   P( X
1 k  nd
k  tk | c )

 Bernoulli:
P( d | c )  P ( e1 ,..., ek ,..., eM | c )   P(U
1i  M
i  ei | c )

probability that a document of class c the term ti


- will occur if ei=1
- will
ill not occur if ei=0
i 0

TDT4215 Naive Bayes Text Classification 40


PROPERTIES OF NAIVE BAYES

• Multinomial:
M ltin mi l: probability that in a document of class
c the term t will occur in position k

P( d | c )  P( t1 ,..., tk ,..., tnd | c )   P( X


1 k  nd
k  tk | c )

• still high time complexity if we have to


consider the position of each term t occurs
• Positional Independence Assumption

P( X k1  t | c )  P( X k2  t | c )
• Equivalent
q to bag
g of words model

TDT4215 Naive Bayes Text Classification 41


OUTLINES

• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

TDT4215 Naive Bayes Text Classification 42


FEATURE SELECTION

• Feature Selection is a process of selecting a subset


of the terms occurring in the training set and using
only this subset as features in text classification.
• Feature Selection: Why ?
 Text collections have a large
g number of features
o 10,000 – 1, 000, 000 unique words … and more
 May make using a particular classifier feasible
o Some classifiers can’t deal with 100,000 of features
 Reduces training time
o Training time for some methods is quadratic or worse in
the number of features
 Can improve generalization
o Eliminates noise features and avoid overfitting
TDT4215 Naive Bayes Text Classification 43
FEATURE SELECTION

• Feature Selection: How ?

• A(t,c) – utility measures:


 frequency
f
 mutual information
 2
 the test
TDT4215 Naive Bayes Text Classification 44
FEATURE SELECTION

• Frequency-based
F n b s d feature
f t sselection
l ti n
 selecting terms that are most common in the class

 simple and easy to implement

 may select some frequent terms that have no


specific
p f information
f m ((such as,, Monday,
y, Tuesday
y …))

 however, if many thousands of features are


selected, it usually does well.

TDT4215 Naive Bayes Text Classification 45


FEATURE SELECTION

• Mutual Information feature selection: A(t , c)  I (U ; C )

 U is a random variable
o et  1 : the
th document
d t contains
t i t
o et  0 : the document does not contain t
 C is
i a random
d variable
i bl
o ec  1 : the document is in class c
o ec  0 : the
h document
d is not in class
l c

TDT4215 Naive Bayes Text Classification 46


FEATURE SELECTION

• With Maximum Likelyhood Estimation:

number of documents that number of documents that do NOT


contain t and in c contain t, but in c

number of documents
number of documents that that does NOT contain
contain t, but NOT in c t and NOT in c
N1.  N10  N11 - number of documents that contain t
N .1  N 01  N11 - number of documents in c
N 0.  N 01  N 00 - number of documents that do NOT contain t
N .0  N10  N 00 - number
numb of fddocuments
cum nts NOT in c
N  N 00  N 01  N10  N11 - total number of documents
TDT4215 Naive Bayes Text Classification 47
FEATURE SELECTION

• An Example
In Reuters-RCV1, c = poultry, t = export

TDT4215 Naive Bayes Text Classification 48


FEATURE SELECTION

• The figure shows terms with high mutual information scores


for the six classes in Reuters-RCV1.

TDT4215 Naive Bayes Text Classification 49


FEATURE SELECTION

• The
Th  2
feature
f t selection
s l ti n
 In statistics, the  text is applied to test
2

the
h independence
d d of
f two events.
 Events A and B are defined to be
independence if
P(AB)=P(A)P(B)
P(AB) P(A)P(B) or
P(A|B)=P(A) and P(B|A)=P(B)
 In feature selection,
selection the two events are
occurrence of the term and occurrence of
class.
class
TDT4215 Naive Bayes Text Classification 50
FEATURE SELECTION

• The  2
feature selection

• N et ec has
h the
th same meaning
i as in
i Mutual
M t l Information
I f ti
feature selection.
• Ee e is the expected frequency of t and c occurring
together in a document assuming that term and class
t c

are independent.

TDT4215 Naive Bayes Text Classification 51


FEATURE SELECTION

• The  2
feature selection

• N et(
ec N 00 , N 01 , N10 , N11)can be
b counted
t dffrom th
the
training data set as in Mutual Information feature
selection
selection.
• Eet(ec E00 , E01 , E10 , E11)can also be computed from the
training data set.

TDT4215 Naive Bayes Text Classification 52


FEATURE SELECTION

• The Example again:

compute
t E11 :

C
Compute
t other
th Ee e in
i the
t c
th same way:

the higher the value the


more dependence between
term t and class c
TDT4215 Naive Bayes Text Classification 53
OUTLINES

• Introduction:
Int d ti n: motivation
m ti ti n and
nd m
methods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

TDT4215 Naive Bayes Text Classification 54


EVALUATION OF TEXT CLASSIFICATION

• Evaluation must be done on test data that are


independent of the training data (usually a disjoint
set of instances)
• Classification accuracy: c/n
– n is the total number of test instances
– c is the number of test instances correctly
classified
f
• Accuracy measurement is appropriate only if
percentage
p g of documents in the class is high
g
– A class with relative frequency 1%, the “always
no” classifier will achieve 99% accurate

TDT4215 Naive Bayes Text Classification 55


SUMMARY

• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

TDT4215 Naive Bayes Text Classification 56

You might also like