You are on page 1of 32

0

Pattern
Classification

All materials in these slides were taken


from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher

Pattern Classification, Chapter 3


0
1

Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
 Bayesian Estimation (BE)
 Bayesian Parameter Estimation: Gaussian Case
 Bayesian Parameter Estimation: General
Estimation
 Problems of Dimensionality
 Computational Complexity
 Component Analysis and Discriminants
 Hidden Markov Models
2

Example on MLE

Pattern Classification, Chapter 3


2
3
 Bayesian Estimation (Bayesian learning
to pattern classification problems)
 In MLE  was supposed fix
 In BE  is a random variable
 The computation of posterior probabilities
P(i | x) lies at the heart of Bayesian
classification
 Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be
written p(x | i , D) P(i | D)
P(i | x, D)  c
 p(x |  j , D) P( j | D)
j 1
Pattern Classification, Chapter 3
3
3
4

 To demonstrate the preceding equation, use:


P(x, D | i )  P(x | D, i ) P( D | i ) (from def. of cond. prob.)
P ( x | D )   P ( x,  j | D ) (from law of total prob.)
j

P(i )  P(i | D) (Training sample provides this!)


We assume that samples in different classes are independent
Thus :
p(x | i , Di ) P(i )
P(i | x, D)  c
 p(x |  j , D) P( j )
j 1

Pattern Classification, Chapter 3


4
3
5

 Bayesian Parameter Estimation: Gaussian


Case

Goal: Estimate  using the a-posteriori density


P( | D)

 The univariate case: P( | D)


 is the only unknown parameter
P(x |  ) ~ N(,  2 )
2
P( ) ~ N( 0 ,  0 )
(0 and 0 are known!)
Pattern Classification, Chapter 3
5
4
6
P(D |  ) P (  )
P (  | D)  Bayes Formula
 P(D |  ) P( )d
k n
   P ( xk |  ) P (  )
k 1

 But we know
p ( xk |  ) ~ N (  ,  ) and p (  ) ~ N (  0 ,  0 )
2 2

Plugging in their gaussian expressions and


extracting out factors not depending on  yields:
 1  n 1  2  1 n
0   

p (  | D)   exp   2  2    2 2
 2  0 
 xk  2   
 0   
   k 1

(from eq. 29 page 93)


Pattern Classification, Chapter 3
6
4
7

Observation: p(|D) is an exponential of a


quadratic
 1  n 1  2  1 n
 0   

p (  | D)   exp   2  2    2 2
 2  0 
 xk  2   
 0   
   k 1

It is again normal! It is called a reproducing density

P (  | D) ~ N (  n ,  )2
n

Pattern Classification, Chapter 3


7
4
8

 1  n 1   1 n
0   

p (  | D)   exp   2  2    2 2
 2  0 
2
 xk  2   
 0   
   k 1

 Identifying coefficients in the top equation with that of


the generic Gaussian
2
 1    n  
1
p(  | D)  exp    
2  n  2  n 
Yields expressions for n and n2

1 n 1 n n 0
  and 2  2 ˆ n  2
 2
n  2
 02 n  0
Pattern Classification, Chapter 3
8
4
9

Solving for n and n2 yields:

 n 02  2
 n    ˆ 
2  n
0
 n0 0    n 0  
2 2 2

 0
2 2
and  n 
2

n 02   2
From these equations we see as n increases:
 the variance decreases monotonically
 the estimate of p(|D) becomes more peaked

Pattern Classification, Chapter 3


9
4
10

Pattern Classification, Chapter 3


10
4
11
 The univariate case P(x | D)
 P( | D) computed (in preceding discussion)
 P(x | D) remains to be computed!
P( x | D)   P( x |  ) P(  | D)d is Gaussian
It provides: P( x | D) ~ N (  n ,  2   n2 )

We know 2
and how to compute  n and σ n2
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:

  
Max P (  j | x , D  Max P ( x |  j , D j ) P (  j )
 j  j

Pattern Classification, Chapter 3
11
4
12

 3.5 Bayesian Parameter Estimation: General


Theory

 P(x | D) computation can be applied to any situation


in which the unknown density can be
parameterized: the basic assumptions are:

 The form of P(x | ) is assumed known, but the value of 


is not known exactly
 Our knowledge about  is assumed to be contained in a
known prior density P()
 The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
Pattern Classification, Chapter 3
12
5
13
The basic problem is:
“Compute the posterior density P( | D)”
then “Derive P(x | D)”, where p(x | D)   p(x | θ) p(θ | D)dθ

Using Bayes formula, we have:


P(D | θ) P(θ)
P(θ | D)  ,
 P(D | θ) P(θ)dθ
And by the independence assumption:
n
p(D | θ)   p (x k | θ)
k 1

Pattern Classification, Chapter 3


13
5
14

Convergence (from notes)

Pattern Classification, Chapter 3


14
15

 Problems of Dimensionality
 Problems involving 50 or 100 features are
common (usually binary valued)
 Note: microarray data might entail ~20000
real-valued features
 Classification accuracy dependant on
 dimensionality
 amount of training data
 discrete vs continuous

Pattern Classification, Chapter 3


15
7
16

 Case of two class multivariate normal


with the same covariance
 P(x|j) ~N(j,), j=1,2
 Statistically independent features
 If the priors are equal then:

1 u 2
P ( error ) 
2 e
r/2
2
du (Bayes error)

where : r 2  (  1   2 ) t  1 (  1   2 )
r 2 is the squared Mahalanobi s distance
lim P ( error )  0
r

Pattern Classification, Chapter 3


16
7
17

 If features are conditionally independent then:

  diag(12 ,  22 ,...,  2d )
2
i  d    i2 
2
r    i1

i 1 i 
 Do we remember what conditional independence is?
 Example for binary features:
Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi

Pattern Classification, Chapter 3


17
7
18

 Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
 Doesn’t require independence

 Adding independent features helps increase r  reduce error

 Caution: adding features increases cost & complexity of feature


extractor and classifier

 It has frequently been observed in practice that, beyond a certain


point, the inclusion of additional features leads to worse rather
than better performance:
 we have the wrong model !
 we don’t have enough training data to support the additional
dimensions

Pattern Classification, Chapter 3


18
7
19

Pattern Classification, Chapter 3


19
7
20
 Computational Complexity

 Our design methodology is affected by the


computational difficulty

 “bigoh” notation
f(x) = O(h(x)) “big oh of h(x)”
If: (c 0 , x 0 )   2 ; f (x )  c 0 h(x )
(An upper bound on f(x) grows no worse than h(x) for
sufficiently large x!)

f(x) = 2+3x+4x2
g(x) = x2
f(x) = O(x2)
Pattern Classification, Chapter 3
20
7
21

 “big oh” is not unique!


f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)

“big theta” notation


f(x) = (h(x))
If:  ( x 0 , c 1 , c 2 )   3 ;  x  x0
0  c 1g ( x )  f ( x )  c 2 g ( x )

f(x) = (x2) but f(x)  (x3)

Pattern Classification, Chapter 3


21
7
22
 Complexity of the ML Estimation

 Gaussian priors in d dimensions classifier


with n training samples for each of c classes

 Foreach category, we have to compute the


discriminant function
O (1)
O ( dn ) O}
2
(d n) 6 78
1 } d 1 ˆ
t ˆ 1
g (x)   (x  μˆ ) Σ (x  μˆ )  ln 2  ln Σ  ln P ( )
2 2 2 123
123 O(n)
O( d 2n)

Total = O(d2n)
Total for c classes = O(cd2n)  O(d2n)

 Cost increase when d and n are large!


Pattern Classification, Chapter 3
22
7
23

Overfitting
 Dimensionality of model vs size of training data
 Issue: not enough data to support the model
 Possible solutions:
 Reduce model dimensionality
 Make (possibly incorrect) assumptions to better estimate 

Pattern Classification, Chapter 3


23
24

Overfitting
 Estimate better 
 use data pooled from all classes
 normalization issues
 use pseudo-Bayesian form 0 + (1-)n
 “doctor”  by thresholding entries
 reduces chance correlations
 assume statistical independence
 zero all off-diagonal elements
Pattern Classification, Chapter 3
24
25

Shrinkage

 Shrinkage: weighted combination of common and


individual covariances
(1   )ni Σ i  nΣ
Σ i ( )  for 0    1
(1   )ni  n

 We can also shrink the estimate common covariances


toward the identity matrix

Σ(  )  (1   ) Σ   I for 0    1
Pattern Classification, Chapter 3
25
 Component Analysis and Discriminants 26

 Combine features in order to reduce the


dimension of the feature space
 Linear combinations are simple to compute and
tractable
 Project high dimensional data onto a lower
dimensional space
 Two classical approaches for finding “optimal”
linear transformation

 PCA (Principal Component Analysis) “Projection that


best represents the data in a least- square sense”
 MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
Pattern Classification, Chapter 3
26
8
27
PCA (from notes)

Pattern Classification, Chapter 3


27
 Hidden Markov Models: 28

 Markov Chains

 Goal: make a sequence of decisions

 Processes that unfold in time, states at time t are


influenced by a state at time t-1

 Applications:
speech recognition, gesture recognition,
parts of speech tagging and DNA sequencing,

 Anytemporal process without memory


T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}

 Thesystem can revisit a state at different steps and not


every state need to be visitedPattern Classification, Chapter 3 28
10
29

 First-order Markov models

 Ourproductions of any sequence is described by the


transition probabilities

P(j(t + 1) | i (t)) = aij

Pattern Classification, Chapter 3


29
10
30

Pattern Classification, Chapter 3


30
10
31

 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 .
P((1) = i)

Example: speech recognition

“production of spoken words”


Production of the word: “pattern” represented
by phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/,
/er/ to /n/ and /n/ to a silent state
Pattern Classification, Chapter 3
31
10

You might also like