Pattern Classification: All Materials in These Slides Were Taken From

Pattern
Classification
All materials in these slides were taken

from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation (part 1)
 Introduction
 Maximum-Likelihood Estimation
 Example of a Specific Case
 The Gaussian Case: unknown  and 
 Bias
 Appendix: ML Problem Statement

3
 Introduction
 Data availability in a Bayesian framework
 We could design an optimal classifier if we knew:
 P(i) (priors)
 P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!
 Design a classifier from a training sample

 No problem with prior estimation
 Samples are often too small for class-conditional
estimation (large dimension of feature space!)
Pattern Classification, Chapter1 3

4
 A priori information about the problem
 Normality of P(x | i)
P(x | i) ~ N( i, i)
 Characterized by 2 parameters
 Estimation techniques
 Maximum-Likelihood (ML) and the Bayesian

estimations
 Results are nearly identical, but the approaches
are different

5
 Parameters in ML estimation are fixed but
unknown!
 Best parameters are obtained by maximizing the

probability of obtaining the samples observed
 Bayesian methods view the parameters as

random variables having some known distribution
 In either approach, we use P(i | x)

for our classification rule!

6
 Maximum-Likelihood Estimation
 Has good convergence properties as the

sample size increases
 Simpler than any other alternative techniques
 General principle
 Assume we have c classes and

P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:
1 2 11 22 m n
θ = (μ j , Σ j ) = (μ j , μ j ,..., σ j , σ j , cov(x j , x j )...)
7
 Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated
with each category
 Suppose that D contains n samples, x1, x2,…, xn

k =n
P(D | θ) = ∏ P( x k | θ) = F(θ)
k =1
P(D | θ) is called the likelihood of θ w.r.t. the set of samples)
θ̂
 ML estimate of  is, by definition the value that
maximizes P(D | )
“It is the value of  that best agrees with the actually
observed training sample”
8

9
 Optimal estimation
 Let  = (1, 2, …, p)t and let  be the gradient operator
t
⎡ ∂ ∂ ∂ ⎤
∇θ = ⎢ , ,..., ⎥
⎣∂θ1 ∂θ 2 ∂θp ⎦
 We define l() as the log-likelihood function

l() = ln P(D | )
 New problem statement:

determine  that maximizes the log-likelihood
θˆ = arg max l(θ)

θ

10
Set of necessary conditions for an optimum is:
k =n
(∇θl = ∑ ∇θ ln P(x k | θ))
k =1
l = 0

11
 Example of a specific case: unknown 
 P(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal population)
1 d
[ 1
] t
−1
ln P(x k | μ) = − ln (2π) Σ − (x k − μ) ∑ (x k − μ)
2 2
−1
and ∇ θμ ln P(x k | μ) = ∑ (x k − μ)
 =  therefore:
• The ML estimate for  must satisfy:
k =n
−1
∑ Σ (x k −μˆ ) = 0
k =1
12
• Multiplying by  and rearranging, we obtain:
1 k =n
μˆ = ∑ x k
n k =1
Just the arithmetic average of the samples
of the training samples!
Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!

13
 ML Estimation:
 Gaussian Case: unknown  and 
 = (1, 2) = (, 2)

1 1
l = ln P( x k | θ) = − ln 2 πθ 2 − ( x k −θ1 ) 2
2 2θ 2
⎛ σ ⎞
⎜ (ln P( x k | θ)) ⎟
σθ 1
∇θl = ⎜ ⎟=0
⎜ σ ⎟
⎜ (ln P( x k | θ)) ⎟
⎝ σθ 2 ⎠
⎧1
⎪θ ( x k −θ1 ) = 0
⎪ 2
⎨ 2
⎪− 1 ( x − θ )
+ k 21 =0
⎪
⎩ 2 θ 2 2θ 2

14
Summation:
⎧k =n 1
∑
⎪k =1 θˆ ( x k − θ 1 ) = 0 (1)
⎪ 2
⎨ k =n = ˆ 2
⎪− ∑ 1 + ∑ (x k −θ1 ) = 0
k n
(2)
⎪
⎩ k =1 ˆ
θ 2 k = 1 ˆ
θ 2
2
Combining (1) and (2), one obtains:

k =n
2
k =n x ∑ (x k −μ )
k 2 k =1
μ= ∑ ; σ =
k =1 n n
15
 Bias
 ML estimate for 2 is biased
⎡1 2⎤ n −1 2
E ⎢ Σ ( x i −x ) ⎥ = .σ ≠ σ 2
⎣n ⎦ n
 An elementary unbiased estimator for  is:
1 k =n t
C= ∑ (x k −μ )(x k −μˆ )
1 4 n4 - 41 k4=14 2 4 4 4 4 4 3
Sample covariance matrix

16
 Appendix: ML Problem Statement
 Let D = {x1, x2, …, xn}
P(x1,…, xn | ) = 1,nP(xk | ); |D| = n
Our goal is to determine θ̂ (value of  that

makes this sample the most representative!)

17
|D| = n
. . .x
x1 .x . .
n
2
N(j, j) = P(xj, 1)

P(xj | 1)
P(xj | k)
D1 x11
. ..
x10 Dk x8 . Dc .
x . . . .
. 20
x
1 x 9
. .

18
 = (1, 2, …, c)
Problem: find θ̂ such that:
Max P(D | θ) = MaxP(x 1,..., x n | θ)

θ
n
= Max ∏ P(x k | θ)
k =1

Pattern Classification: All Materials in These Slides Were Taken From

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pattern Classification: All Materials in These Slides Were Taken From

Uploaded by

Copyright:

Available Formats

Pattern

All materials in these slides were taken

 The Gaussian Case: unknown  and 

 Appendix: ML Problem Statement

 Design a classifier from a training sample

estimation (large dimension of feature space!)

Pattern Classification, Chapter1 3

 Normality of P(x | i)

P(x | i) ~ N( i, i)

 Maximum-Likelihood (ML) and the Bayesian

Pattern Classification, Chapter1 3

 Best parameters are obtained by maximizing the

 Bayesian methods view the parameters as

 In either approach, we use P(i | x)

Pattern Classification, Chapter1 3

 Has good convergence properties as the

 Assume we have c classes and

 Suppose that D contains n samples, x1, x2,…, xn

Pattern Classification, Chapter2 3

 We define l() as the log-likelihood function

 New problem statement:

θˆ = arg max l(θ)

Pattern Classification, Chapter2 3

Set of necessary conditions for an optimum is:

Pattern Classification, Chapter2 3

• The ML estimate for  must satisfy:

• Multiplying by  and rearranging, we obtain:

Pattern Classification, Chapter2 3

 = (1, 2) = (, 2)

Pattern Classification, Chapter2 3

Combining (1) and (2), one obtains:

 ML estimate for 2 is biased

 An elementary unbiased estimator for  is:

Pattern Classification, Chapter2 3

 Appendix: ML Problem Statement

 Let D = {x1, x2, …, xn}

P(x1,…, xn | ) = 1,nP(xk | ); |D| = n

Our goal is to determine θ̂ (value of  that

Pattern Classification, Chapter2 3

N(j, j) = P(xj, 1)

Pattern Classification, Chapter2 3

 = (1, 2, …, c)

Problem: find θ̂ such that:

Max P(D | θ) = MaxP(x 1,..., x n | θ)

Pattern Classification, Chapter2 3

You might also like