NBTC Naive Bayes Text Classification

NAIVE BAYES TEXT CLASSIFICATION
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,

Introduction to Information Retrieval,
Retrieval Cambridge University Press.
Press 2008.
2008
Chapter 13
Wei Wei
wwei@idi.ntnu.no
Lecture series
TDT4215 Naive Bayes Text Classification 1
OUTLINES
• Introduction:
Int d ti n: motivation
m ti ti n and
nd mmethods
th ds
• The Text Classification Problem
• Naive Bayes Text Classification
• Properties of Naive Bayes
• Feature Selection
• Evalutation of Text Classification

OUTLINES
• Introduction:
Int d ti n: m motivation
ti ti n andnd m
methods
th d

INTRODUCTION
Motivation for Text Classification

INTRODUCTION

INTRODUCTION

INTRODUCTION
How could they do this?
• hire
hi some web b editors
dit
for a small quantity of
news – possibally
for a large scale of
online news – impossible
• only way: automatic

classification by machines

INTRODUCTION

INTRODUCTION
Spam or Not?
Text Classification

INTRODUCTION
Methods for Text Classification

INTRODUCTION
• M
Manual
n l classification
l ssifi ti n
 originally used by Yahoo!
 very accurate by expert
 consistent for small size problem
 difficult and expensive to scale

INTRODUCTION
• A
Automatic
t m ti classification
l ssifi ti n
 Hand-coded rule-based systems
o complex query languages
o assign category if a document contains a
given boolean combination of words
o accuracy is usually very high if a rule has
been carefully refined over time by a
subject expert
o building and maintaining these rules is
expensive
INTRODUCTION
• A
Automatic
t m ti classification
l ssifi ti n
 utilizing machine learning techniques
o k-Nearest Neighbors (kNN)
o Naive Bayes (NB)
o Support Vector Machines (SVM)
o … some other similar methods
o requires hand-classified training data
• Note that: many commercial systems use a
mixture of methods

OUTLINES
• Introduction:
m ti ti n and
nd mmethods
th ds

THE TEXT CLASSIFICATION PROBLEM
Text Classification also known as Text Categorization:

• Given a set of classes, we seek to determine which class(es) a
given document belongs to.
An example:
p
• Document with only a sentance:
“London is planning to organize the 2012 Olympics.”
• We have six classes:
<UK>, <China>, <car>, <coffee>, <elections>, <sports>
• Determined: <UK>

An example:
• Document with only a sentance:
“London is planning to organize the 2012 Olympics.”
• We have six classes:
<UK>, <China>, <car>, <coffee>, <elections>, <sports>
• Determined: <UK> and <sports>
For some documents:

• exist more than one class it belongs to
 referred to as any-of classification problem
However,
• we only consider one
one-of
of classification problem
 a document is a member of exactly one class

A formal definition:
 Given:
• A description of an instance, x  X, where X is the instance
language or instance space.
• A fixed set of classes:
C = {c1, c2,…, cJ}
 Determine:
• The category of x : γ (x) C,
C where γ (x) is a classification
function.
γ: XC

OUTLINES
• Introduction:
m ti ti n and
nd mmethods
th ds

• Th
There are two
t different
diff nt wayss tto set
s t up
p an
n
NB classifier:
 multinomial Naive Bayes (multinomial NB model)
 multivariate Bernoulli model (Bernoulli model)

• Multinomial
M ltin mi l N
Naive
i Bayes
B s
A document d being in class c is computed as:
P ( c | d )  P ( c )  P ( tk | c )
1 k  nd
prior probability conditional probability
 P(tk | c ) : conditional probability of term tk occurring in a
document of class c
 P (c ) : the prior probability of a document occurring in c
  t1 , t2 ,..., tnd  : tokens in document d that are part of
vocabulary used for classification
 nd : number of such tokens in d

• Multinomial
M ltin mi l N
Naive
i Bayes
B s
A document d being in class c is computed as:
P ( c | d )  P ( c )  P ( tk | c )
1 k  nd
MAX
• How to decide the best class in NB classification ?

^ ^ ^
cmap  arg
g max P( c | d )  argg max P ( c )  P (tk | c )
cC cC 1 k  nd
Note: we do not know the parameters true values but estimate
them from training data.

• How to decide the best class in NB classification ?

^ ^ ^
cmap  arg max P( c | d )  arg max P ( c )  P (tk | c )
cC cC 1 k  nd
• Many conditional probabilities are multiplied that will result in a

floating point underflow.
• log(xy)
g y = log(x)
g + log(y)
gy
• Therefore, it’s better to perform adding logarithms:
^ ^
cmap  arg max[log P( c ) 
cC
 log P(t
1 k  nd
k | c )]
th relative
the l ti f frequency of
fc h
how good
d an indicator
i di t tk is
i ffor c

^ ^
cmap  arg max[log P ( c ) 
cC
 log P(t
1 k  nd
k | c )]
how to estimate the parameters ?
Maximum Likelihood Estimate (MLE)

for the p
f parameters Estimation

• What is the Maximum Likelyhood Estimation (MLE):

 the relative frequency and corresponds to the
most likely value of each parameter given the
training data. number of documents in class c
• How? ^ N c total number of documents
 for the priors: P ( c ) 
N
 for the conditional probability:
^ Tct number
b of f occurrences of
f
P (t | c )  t/t’ in training documents
 t 'V
Tct ' from class c

• A problem with MLE

 what if a term that did not occur in the training
data ?
^ T
P ( t | c )   0

ct
^
 T t ' V ct '
 for log P ( t | c ) , we need P(t | c )  0
• Solution: add-one or Laplace smoothing
^ Tct  1 Tct  1
P (t | c )  
t 'V（ Tct '  1） ( t 'V Tct ' )  B
B=|V|
the number of terms in vocabulary
• Naive Bayes algorithm:

 Training

• Naive Bayes algorithm:

 Testing

• Question:
Decide:
whether
h th document
d t d5 b
belonging
l i tto class
l c=China?
Chi ?

• Solution:
^ ^ _
 Training:
T i i P ( c )  3 / 4, P( c )  1 / 4
^
P ( Chinese | c )  ( 5  1) /( 8  6 )  6 / 14  3 / 7
^ ^
P ( Tokyo | c )  P ( Japan | c )  ( 0  1) /( 8  6 )  1 / 14
^ _
P ( Chinese
Chi | c )  ( 1  1 ) /( 3  6 )  2 / 9
^ _ ^ _
P ( Tokyo | c )  P ( Japan | c )  (1  1) /( 3  6 )  2 / 9
 Testing:
c=China

• Th
There are two
t different
diff nt wayss tto set
s t up
p an
n
NB classifier:
 multinomial Naive Bayes (multinomial NB model)
 multivariate Bernoulli model (Bernoulli model)

• B
Bernoulli
n lli m
model
d l
 different with multinomial NB model:
different estimation strategies
different classification rules

training for
prior probability
are same
fraction of tokens in c containing t
fraction of documents in c containing t

only considering terms that

pp
appears in the documents
nonoccurrentt terms
t still
till affect
ff t
the computing

• Question with Bernoulli model:
Decide:
whether
h th document
d t d5 b
belonging
l i tto class
l c=China?
Chi ?

• Solution with Bernoulli model:
^ ^ _
T i i
Training:
^
 P (c )  3 / 4, P( c )  1 / 4 P ( Chinese
h | c )  ( 3  1) /( 3  2 )  4 / 5
^ ^ ^ ^
P ( Beijing | c )  P ( Macao | c )  (1  1) /( 3  2 )  2 / 5 P (Tokyo | c )  P ( Japan | c )  ( 0  1) /( 3  2 )  1 / 5
^ _ ^ _ ^ _
P ( Chinese | c )  ( 1  1 ) /( 1  2 )  2 / 3 P ( Tokyo | c )  P ( Japan | c )  (1  1) /(1  2 )  2 / 3
^ _ ^ _ ^ _
P ( Beijing | c )  P ( Macao | c )  P ( Shanghai | c )  ( 0  1) /(1  2 )  1 / 3
 Testing: ^
P ( c | d 5 )  0 . 005
_
P ( c | d 5 )  0 . 022 not-China
OUTLINES
• Introduction:
m ti ti n and
nd mmethods
th ds

PROPERTIES OF NAIVE BAYES
• Recall
R ll Bayes
B s rule:
l :
P( AB )  P( A)  P( B | A)  P( B )  P( A | B )
P( B )  P( A | B)
P( B | A) 
P( A)

• With Bayes rule,

rule for a document d and a class c:
P(d | c) P(c)
P(c | d ) 
P(d )
Bayes rule
cmap  arg max P ( c | d )
cC
P(d | c) P(c) P(d) do not
 arg
g max affect the result
cC P(d )
 argg max P ( d | c ) P( c )
cC
cmap  arg max P( d | c ) P( c )

cC high time complexity to
compute both conditional
probabilities
• How to compute P(d|c):
 Multinomial: P(d | c)  P( t1 ,..., tk ,..., tn | c) d
 t1 ,..., tk ,..., tnd  is the sequence of terms as it occurs in d
 Bernoulli: P( d | c )  P( e1 ,..., ek ,..., eM | c )
 e1 ,..., ek ,..., eM  is a binary vector of dimensionality M that
indicates for each term whether it occurs in d or not

• Conditional
C nditi n l Independence
Ind p nd n Ass
Assumption
mpti n
probability that in a document of class
c the
th tterm t will
ill occur in
i position
iti k
 Multinomial:
P( d | c )  P( t1 ,..., tk ,..., tnd | c )   P( X
1 k  nd
k  tk | c )
 Bernoulli:
P( d | c )  P ( e1 ,..., ek ,..., eM | c )   P(U
1i  M
i  ei | c )
probability that a document of class c the term ti

- will occur if ei=1
- will
ill not occur if ei=0
i 0

• Multinomial:
M ltin mi l: probability that in a document of class
c the term t will occur in position k
P( d | c )  P( t1 ,..., tk ,..., tnd | c )   P( X

1 k  nd
k  tk | c )
• still high time complexity if we have to

consider the position of each term t occurs
• Positional Independence Assumption
P( X k1  t | c )  P( X k2  t | c )
• Equivalent
q to bag
g of words model

OUTLINES
• Introduction:
m ti ti n and
nd mmethods
th ds

FEATURE SELECTION
• Feature Selection is a process of selecting a subset

of the terms occurring in the training set and using
only this subset as features in text classification.
• Feature Selection: Why ?
 Text collections have a large
g number of features
o 10,000 – 1, 000, 000 unique words … and more
 May make using a particular classifier feasible
o Some classifiers can’t deal with 100,000 of features
 Reduces training time
o Training time for some methods is quadratic or worse in
the number of features
 Can improve generalization
o Eliminates noise features and avoid overfitting
FEATURE SELECTION
• Feature Selection: How ?
• A(t,c) – utility measures:

 frequency
f
 mutual information
 2
 the test
FEATURE SELECTION
• Frequency-based
F n b s d feature
f t sselection
l ti n
 selecting terms that are most common in the class
 simple and easy to implement
 may select some frequent terms that have no

specific
p f information
f m ((such as,, Monday,
y, Tuesday
y …))
 however, if many thousands of features are

selected, it usually does well.

FEATURE SELECTION
• Mutual Information feature selection: A(t , c)  I (U ; C )
 U is a random variable
o et  1 : the
th document
d t contains
t i t
o et  0 : the document does not contain t
 C is
i a random
d variable
i bl
o ec  1 : the document is in class c
o ec  0 : the
h document
d is not in class
l c

FEATURE SELECTION
• With Maximum Likelyhood Estimation:
number of documents that number of documents that do NOT

contain t and in c contain t, but in c
number of documents
number of documents that that does NOT contain
contain t, but NOT in c t and NOT in c
N1.  N10  N11 - number of documents that contain t
N .1  N 01  N11 - number of documents in c
N 0.  N 01  N 00 - number of documents that do NOT contain t
N .0  N10  N 00 - number
numb of fddocuments
cum nts NOT in c
N  N 00  N 01  N10  N11 - total number of documents
FEATURE SELECTION
• An Example
In Reuters-RCV1, c = poultry, t = export

FEATURE SELECTION
• The figure shows terms with high mutual information scores

for the six classes in Reuters-RCV1.

FEATURE SELECTION
• The
Th  2
feature
f t selection
s l ti n
 In statistics, the  text is applied to test
2
the
h independence
d d of
f two events.
 Events A and B are defined to be
independence if
P(AB)=P(A)P(B)
P(AB) P(A)P(B) or
P(A|B)=P(A) and P(B|A)=P(B)
 In feature selection,
selection the two events are
occurrence of the term and occurrence of
class.
class
FEATURE SELECTION
• The  2
feature selection
• N et ec has
h the
th same meaning
i as in
i Mutual
M t l Information
I f ti
feature selection.
• Ee e is the expected frequency of t and c occurring
together in a document assuming that term and class
t c
are independent.

FEATURE SELECTION
• The  2
feature selection
• N et（
ec N 00 , N 01 , N10 , N11）can be
b counted
t dffrom th
the
training data set as in Mutual Information feature
selection
selection.
• Eet（ec E00 , E01 , E10 , E11）can also be computed from the
training data set.

FEATURE SELECTION
• The Example again:
compute
t E11 :
C
Compute
t other
th Ee e in
i the
t c
th same way:
the higher the value the

more dependence between
term t and class c
OUTLINES
• Introduction:
m ti ti n and
nd m
methods
th ds

EVALUATION OF TEXT CLASSIFICATION
• Evaluation must be done on test data that are

independent of the training data (usually a disjoint
set of instances)
• Classification accuracy: c/n
– n is the total number of test instances
– c is the number of test instances correctly
classified
f
• Accuracy measurement is appropriate only if
percentage
p g of documents in the class is high
g
– A class with relative frequency 1%, the “always
no” classifier will achieve 99% accurate

SUMMARY
• Introduction:
m ti ti n and
nd mmethods
th ds

NBTC Naive Bayes Text Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NBTC Naive Bayes Text Classification

Uploaded by

Copyright:

Available Formats

NAIVE BAYES TEXT CLASSIFICATION

Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze,

TDT4215 Naive Bayes Text Classification 2

TDT4215 Naive Bayes Text Classification 3

Motivation for Text Classification

TDT4215 Naive Bayes Text Classification 4

TDT4215 Naive Bayes Text Classification 5

TDT4215 Naive Bayes Text Classification 6

How could they do this?

• only way: automatic

TDT4215 Naive Bayes Text Classification 7

TDT4215 Naive Bayes Text Classification 8

TDT4215 Naive Bayes Text Classification 9

Methods for Text Classification

TDT4215 Naive Bayes Text Classification 10

TDT4215 Naive Bayes Text Classification 11

TDT4215 Naive Bayes Text Classification 13

TDT4215 Naive Bayes Text Classification 14

Text Classification also known as Text Categorization:

TDT4215 Naive Bayes Text Classification 15

For some documents:

TDT4215 Naive Bayes Text Classification 16

TDT4215 Naive Bayes Text Classification 17

TDT4215 Naive Bayes Text Classification 18

 multinomial Naive Bayes (multinomial NB model)

 multivariate Bernoulli model (Bernoulli model)

TDT4215 Naive Bayes Text Classification 19

TDT4215 Naive Bayes Text Classification 20

• How to decide the best class in NB classification ?

TDT4215 Naive Bayes Text Classification 21

• How to decide the best class in NB classification ?

• Many conditional probabilities are multiplied that will result in a

TDT4215 Naive Bayes Text Classification 22

how to estimate the parameters ?

Maximum Likelihood Estimate (MLE)

TDT4215 Naive Bayes Text Classification 23

• What is the Maximum Likelyhood Estimation (MLE):

TDT4215 Naive Bayes Text Classification 24

• A problem with MLE

 for log P ( t | c ) , we need P(t | c )  0

• Solution: add-one or Laplace smoothing

• Naive Bayes algorithm:

TDT4215 Naive Bayes Text Classification 26

• Naive Bayes algorithm:

TDT4215 Naive Bayes Text Classification 27

TDT4215 Naive Bayes Text Classification 28

TDT4215 Naive Bayes Text Classification 29

 multinomial Naive Bayes (multinomial NB model)

 multivariate Bernoulli model (Bernoulli model)

TDT4215 Naive Bayes Text Classification 30

different estimation strategies

different classification rules

TDT4215 Naive Bayes Text Classification 31

fraction of tokens in c containing t

fraction of documents in c containing t

TDT4215 Naive Bayes Text Classification 32

only considering terms that

TDT4215 Naive Bayes Text Classification 33

• Question with Bernoulli model:

TDT4215 Naive Bayes Text Classification 34

• Solution with Bernoulli model:

TDT4215 Naive Bayes Text Classification 36