Bayseian Estimate

0
Pattern
Classification
All materials in these slides were taken

from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher
Pattern Classification, Chapter 3

0
1
Chapter 3:
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
 Bayesian Estimation (BE)
 Bayesian Parameter Estimation: Gaussian Case
 Bayesian Parameter Estimation: General
Estimation
 Problems of Dimensionality
 Computational Complexity
 Component Analysis and Discriminants
 Hidden Markov Models
2
Example on MLE

2
3
 Bayesian Estimation (Bayesian learning
to pattern classification problems)
 In MLE  was supposed fix
 In BE  is a random variable
 The computation of posterior probabilities
P(i | x) lies at the heart of Bayesian
classification
 Goal: compute P(i | x, D)
Given the sample D, Bayes formula can be
written p(x | i , D) P(i | D)
P(i | x, D)  c
 p(x |  j , D) P( j | D)
j 1
3
3
4
 To demonstrate the preceding equation, use:

P(x, D | i )  P(x | D, i ) P( D | i ) (from def. of cond. prob.)
P ( x | D )   P ( x,  j | D ) (from law of total prob.)
j
P(i )  P(i | D) (Training sample provides this!)

We assume that samples in different classes are independent
Thus :
p(x | i , Di ) P(i )
P(i | x, D)  c
 p(x |  j , D) P( j )
j 1

4
3
5
 Bayesian Parameter Estimation: Gaussian

Case
Goal: Estimate  using the a-posteriori density

P( | D)
 The univariate case: P( | D)

 is the only unknown parameter
P(x |  ) ~ N(,  2 )
2
P( ) ~ N( 0 ,  0 )
(0 and 0 are known!)
5
4
6
P(D |  ) P (  )
P (  | D)  Bayes Formula
 P(D |  ) P( )d
k n
   P ( xk |  ) P (  )
k 1
 But we know
p ( xk |  ) ~ N (  ,  ) and p (  ) ~ N (  0 ,  0 )
2 2
Plugging in their gaussian expressions and

extracting out factors not depending on  yields:
 1  n 1  2  1 n
0   

p (  | D)   exp   2  2    2 2
 2  0 
 xk  2   
 0   
   k 1
(from eq. 29 page 93)

6
4
7
Observation: p(|D) is an exponential of a

quadratic
 1  n 1  2  1 n
 0   

p (  | D)   exp   2  2    2 2
 2  0 
 xk  2   
 0   
   k 1
It is again normal! It is called a reproducing density
P (  | D) ~ N (  n ,  )2
n

7
4
8
 1  n 1   1 n
0   

p (  | D)   exp   2  2    2 2
 2  0 
2
 xk  2   
 0   
   k 1
 Identifying coefficients in the top equation with that of

the generic Gaussian
2
 1    n  
1
p(  | D)  exp    
2  n  2  n 
Yields expressions for n and n2
1 n 1 n n 0
  and 2  2 ˆ n  2
 2
n  2
 02 n  0
8
4
9
Solving for n and n2 yields:
 n 02  2
 n    ˆ 
2  n
0
 n0 0    n 0  
2 2 2
 0
2 2
and  n 
2
n 02   2
From these equations we see as n increases:
 the variance decreases monotonically
 the estimate of p(|D) becomes more peaked

9
4
10

10
4
11
 The univariate case P(x | D)
 P( | D) computed (in preceding discussion)
 P(x | D) remains to be computed!
P( x | D)   P( x |  ) P(  | D)d is Gaussian
It provides: P( x | D) ~ N (  n ,  2   n2 )
We know 2
and how to compute  n and σ n2
(Desired class-conditional density P(x | Dj, j))
Therefore: using P(x | Dj, j) together with P(j)
And using Bayes formula, we obtain the
Bayesian classification rule:
  
Max P (  j | x , D  Max P ( x |  j , D j ) P (  j )
 j  j

11
4
12
 3.5 Bayesian Parameter Estimation: General

Theory
 P(x | D) computation can be applied to any situation

in which the unknown density can be
parameterized: the basic assumptions are:
 The form of P(x | ) is assumed known, but the value of 

is not known exactly
 Our knowledge about  is assumed to be contained in a
known prior density P()
 The rest of our knowledge  is contained in a set D of n
random variables x1, x2, …, xn that follows P(x)
12
5
13
The basic problem is:
“Compute the posterior density P( | D)”
then “Derive P(x | D)”, where p(x | D)   p(x | θ) p(θ | D)dθ
Using Bayes formula, we have:

P(D | θ) P(θ)
P(θ | D)  ,
 P(D | θ) P(θ)dθ
And by the independence assumption:
n
p(D | θ)   p (x k | θ)
k 1

13
5
14
Convergence (from notes)

14
15
 Problems of Dimensionality
 Problems involving 50 or 100 features are
common (usually binary valued)
 Note: microarray data might entail ~20000
real-valued features
 Classification accuracy dependant on
 dimensionality
 amount of training data
 discrete vs continuous

15
7
16
 Case of two class multivariate normal

with the same covariance
 P(x|j) ~N(j,), j=1,2
 Statistically independent features
 If the priors are equal then:

1 u 2
P ( error ) 
2 e
r/2
2
du (Bayes error)
where : r 2  (  1   2 ) t  1 (  1   2 )
r 2 is the squared Mahalanobi s distance
lim P ( error )  0
r

16
7
17
 If features are conditionally independent then:
  diag(12 ,  22 ,...,  2d )
2
i  d    i2 
2
r    i1

i 1 i 
 Do we remember what conditional independence is?
 Example for binary features:
Let pi= Pr[xi=1|1] then P(x|1) is the product of the pi

17
7
18
 Most useful features are the ones for which the difference between
the means is large relative to the standard deviation
 Doesn’t require independence
 Adding independent features helps increase r  reduce error
 Caution: adding features increases cost & complexity of feature

extractor and classifier
 It has frequently been observed in practice that, beyond a certain

point, the inclusion of additional features leads to worse rather
than better performance:
 we have the wrong model !
 we don’t have enough training data to support the additional
dimensions

18
7
19

19
7
20
 Computational Complexity
 Our design methodology is affected by the

computational difficulty
 “bigoh” notation
f(x) = O(h(x)) “big oh of h(x)”
If: (c 0 , x 0 )   2 ; f (x )  c 0 h(x )
(An upper bound on f(x) grows no worse than h(x) for
sufficiently large x!)
f(x) = 2+3x+4x2
g(x) = x2
f(x) = O(x2)
20
7
21
 “big oh” is not unique!

f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)
“big theta” notation

f(x) = (h(x))
If:  ( x 0 , c 1 , c 2 )   3 ;  x  x0
0  c 1g ( x )  f ( x )  c 2 g ( x )
f(x) = (x2) but f(x)  (x3)

21
7
22
 Complexity of the ML Estimation
 Gaussian priors in d dimensions classifier

with n training samples for each of c classes
 Foreach category, we have to compute the

discriminant function
O (1)
O ( dn ) O}
2
(d n) 6 78
1 } d 1 ˆ
t ˆ 1
g (x)   (x  μˆ ) Σ (x  μˆ )  ln 2  ln Σ  ln P ( )
2 2 2 123
123 O(n)
O( d 2n)
Total = O(d2n)
Total for c classes = O(cd2n)  O(d2n)
 Cost increase when d and n are large!

22
7
23
Overfitting
 Dimensionality of model vs size of training data
 Issue: not enough data to support the model
 Possible solutions:
 Reduce model dimensionality
 Make (possibly incorrect) assumptions to better estimate 

23
24
Overfitting
 Estimate better 
 use data pooled from all classes
 normalization issues
 use pseudo-Bayesian form 0 + (1-)n
 “doctor”  by thresholding entries
 reduces chance correlations
 assume statistical independence
 zero all off-diagonal elements
24
25
Shrinkage
 Shrinkage: weighted combination of common and

individual covariances
(1   )ni Σ i  nΣ
Σ i ( )  for 0    1
(1   )ni  n
 We can also shrink the estimate common covariances

toward the identity matrix
Σ(  )  (1   ) Σ   I for 0    1
25
 Component Analysis and Discriminants 26
 Combine features in order to reduce the

dimension of the feature space
 Linear combinations are simple to compute and
tractable
 Project high dimensional data onto a lower
dimensional space
 Two classical approaches for finding “optimal”
linear transformation
 PCA (Principal Component Analysis) “Projection that

best represents the data in a least- square sense”
 MDA (Multiple Discriminant Analysis) “Projection that
best separates the data in a least-squares sense”
26
8
27
PCA (from notes)

27
 Hidden Markov Models: 28
 Markov Chains
 Goal: make a sequence of decisions
 Processes that unfold in time, states at time t are

influenced by a state at time t-1
 Applications:
speech recognition, gesture recognition,
parts of speech tagging and DNA sequencing,
 Anytemporal process without memory

T = {(1), (2), (3), …, (T)} sequence of states
We might have 6 = {1, 4, 2, 2, 1, 4}
 Thesystem can revisit a state at different steps and not

every state need to be visitedPattern Classification, Chapter 3 28
10
29
 First-order Markov models
 Ourproductions of any sequence is described by the

transition probabilities
P(j(t + 1) | i (t)) = aij

29
10
30

30
10
31
 = (aij, T)
P(T | ) = a14 . a42 . a22 . a21 . a14 .
P((1) = i)
Example: speech recognition
“production of spoken words”

Production of the word: “pattern” represented
by phonemes
/p/ /a/ /tt/ /er/ /n/ // ( // = silent state)
Transitions from /p/ to /a/, /a/ to /tt/, /tt/ to er/,
/er/ to /n/ and /n/ to a silent state
31
10

Bayseian Estimate

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayseian Estimate

Uploaded by

Copyright:

Available Formats

0

All materials in these slides were taken

Pattern Classification, Chapter 3

Pattern Classification, Chapter 3

 To demonstrate the preceding equation, use:

P(i )  P(i | D) (Training sample provides this!)

Pattern Classification, Chapter 3

 Bayesian Parameter Estimation: Gaussian

Goal: Estimate  using the a-posteriori density

 The univariate case: P( | D)

Plugging in their gaussian expressions and

(from eq. 29 page 93)

Observation: p(|D) is an exponential of a

It is again normal! It is called a reproducing density

Pattern Classification, Chapter 3

 Identifying coefficients in the top equation with that of

Solving for n and n2 yields:

Pattern Classification, Chapter 3

Pattern Classification, Chapter 3

 3.5 Bayesian Parameter Estimation: General

 P(x | D) computation can be applied to any situation

 The form of P(x | ) is assumed known, but the value of 

Using Bayes formula, we have:

Pattern Classification, Chapter 3

Convergence (from notes)

Pattern Classification, Chapter 3

Pattern Classification, Chapter 3

 Case of two class multivariate normal

Pattern Classification, Chapter 3

 If features are conditionally independent then:

Pattern Classification, Chapter 3

 Adding independent features helps increase r  reduce error

 Caution: adding features increases cost & complexity of feature

 It has frequently been observed in practice that, beyond a certain

Pattern Classification, Chapter 3

Pattern Classification, Chapter 3

 Our design methodology is affected by the

 “big oh” is not unique!

“big theta” notation

f(x) = (x2) but f(x)  (x3)

Pattern Classification, Chapter 3

 Gaussian priors in d dimensions classifier

 Foreach category, we have to compute the

 Cost increase when d and n are large!

Pattern Classification, Chapter 3

 Shrinkage: weighted combination of common and

 We can also shrink the estimate common covariances

 Combine features in order to reduce the

 PCA (Principal Component Analysis) “Projection that

Pattern Classification, Chapter 3

 Goal: make a sequence of decisions

 Processes that unfold in time, states at time t are

 Anytemporal process without memory

 Thesystem can revisit a state at different steps and not

 First-order Markov models

 Ourproductions of any sequence is described by the

P(j(t + 1) | i (t)) = aij

Pattern Classification, Chapter 3

Pattern Classification, Chapter 3

Example: speech recognition

“production of spoken words”

You might also like