You are on page 1of 18

Pattern

Classification

All materials in these slides were taken


from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and
the publisher
Chapter 3:
Maximum-Likelihood & Bayesian
Parameter Estimation (part 1)
 Introduction
 Maximum-Likelihood Estimation
 Example of a Specific Case

 The Gaussian Case: unknown  and 

 Bias

 Appendix: ML Problem Statement


3
 Introduction
 Data availability in a Bayesian framework
 We could design an optimal classifier if we knew:
 P(i) (priors)
 P(x | i) (class-conditional densities)
Unfortunately, we rarely have this complete
information!

 Design a classifier from a training sample


 No problem with prior estimation
 Samples are often too small for class-conditional

estimation (large dimension of feature space!)

Pattern Classification, Chapter1 3


4
 A priori information about the problem

 Normality of P(x | i)

P(x | i) ~ N( i, i)

 Characterized by 2 parameters

 Estimation techniques

 Maximum-Likelihood (ML) and the Bayesian


estimations
 Results are nearly identical, but the approaches

are different

Pattern Classification, Chapter1 3


5
 Parameters in ML estimation are fixed but
unknown!

 Best parameters are obtained by maximizing the


probability of obtaining the samples observed

 Bayesian methods view the parameters as


random variables having some known distribution

 In either approach, we use P(i | x)


for our classification rule!

Pattern Classification, Chapter1 3


6
 Maximum-Likelihood Estimation

 Has good convergence properties as the


sample size increases
 Simpler than any other alternative techniques

 General principle

 Assume we have c classes and


P(x | j) ~ N( j, j)
P(x | j)  P (x | j, j) where:

1 2 11 22 m n
θ = (μ j , Σ j ) = (μ j , μ j ,..., σ j , σ j , cov(x j , x j )...)
Pattern Classification, Chapter2 3
7
 Use the information
provided by the training samples to estimate
 = (1, 2, …, c), each i (i = 1, 2, …, c) is associated
with each category

 Suppose that D contains n samples, x1, x2,…, xn


k =n
P(D | θ) = ∏ P( x k | θ) = F(θ)
k =1
P(D | θ) is called the likelihood of θ w.r.t. the set of samples)

θ̂
 ML estimate of  is, by definition the value that
maximizes P(D | )
“It is the value of  that best agrees with the actually
observed training sample”
Pattern Classification, Chapter2 3
8

Pattern Classification, Chapter2 3


9
 Optimal estimation
 Let  = (1, 2, …, p)t and let  be the gradient operator
t
⎡ ∂ ∂ ∂ ⎤
∇θ = ⎢ , ,..., ⎥
⎣∂θ1 ∂θ 2 ∂θp ⎦

 We define l() as the log-likelihood function


l() = ln P(D | )

 New problem statement:


determine  that maximizes the log-likelihood

θˆ = arg max l(θ)


θ

Pattern Classification, Chapter2 3


10

Set of necessary conditions for an optimum is:

k =n
(∇θl = ∑ ∇θ ln P(x k | θ))
k =1

l = 0

Pattern Classification, Chapter2 3


11
 Example of a specific case: unknown 

 P(xi | ) ~ N(, )
(Samples are drawn from a multivariate normal population)

1 d
[ 1
] t
−1
ln P(x k | μ) = − ln (2π) Σ − (x k − μ) ∑ (x k − μ)
2 2
−1
and ∇ θμ ln P(x k | μ) = ∑ (x k − μ)
 =  therefore:

• The ML estimate for  must satisfy:

k =n
−1
∑ Σ (x k −μˆ ) = 0
k =1
Pattern Classification, Chapter2 3
12

• Multiplying by  and rearranging, we obtain:

1 k =n
μˆ = ∑ x k
n k =1
Just the arithmetic average of the samples
of the training samples!

Conclusion:
If P(xk | j) (j = 1, 2, …, c) is supposed to be Gaussian in a d-
dimensional feature space; then we can estimate the vector
 = (1, 2, …, c)t and perform an optimal classification!

Pattern Classification, Chapter2 3


13

 ML Estimation:
 Gaussian Case: unknown  and 

 = (1, 2) = (, 2)


1 1
l = ln P( x k | θ) = − ln 2 πθ 2 − ( x k −θ1 ) 2
2 2θ 2
⎛ σ ⎞
⎜ (ln P( x k | θ)) ⎟
σθ 1
∇θl = ⎜ ⎟=0
⎜ σ ⎟
⎜ (ln P( x k | θ)) ⎟
⎝ σθ 2 ⎠
⎧1
⎪θ ( x k −θ1 ) = 0
⎪ 2
⎨ 2
⎪− 1 ( x − θ )
+ k 21 =0

⎩ 2 θ 2 2θ 2

Pattern Classification, Chapter2 3


14
Summation:

⎧k =n 1

⎪k =1 θˆ ( x k − θ 1 ) = 0 (1)
⎪ 2
⎨ k =n = ˆ 2
⎪− ∑ 1 + ∑ (x k −θ1 ) = 0
k n
(2)

⎩ k =1 ˆ
θ 2 k = 1 ˆ
θ 2
2

Combining (1) and (2), one obtains:


k =n
2
k =n x ∑ (x k −μ )
k 2 k =1
μ= ∑ ; σ =
k =1 n n
Pattern Classification, Chapter2 3
15

 Bias

 ML estimate for 2 is biased

⎡1 2⎤ n −1 2
E ⎢ Σ ( x i −x ) ⎥ = .σ ≠ σ 2
⎣n ⎦ n

 An elementary unbiased estimator for  is:

1 k =n t
C= ∑ (x k −μ )(x k −μˆ )
1 4 n4 - 41 k4=14 2 4 4 4 4 4 3
Sample covariance matrix

Pattern Classification, Chapter2 3


16

 Appendix: ML Problem Statement

 Let D = {x1, x2, …, xn}

P(x1,…, xn | ) = 1,nP(xk | ); |D| = n

Our goal is to determine θ̂ (value of  that


makes this sample the most representative!)

Pattern Classification, Chapter2 3


17

|D| = n
. . .x
x1 .x . .
n
2

N(j, j) = P(xj, 1)


P(xj | 1)
P(xj | k)

D1 x11
. ..
x10 Dk x8 . Dc .
x . . . .
. 20
x
1 x 9

. .

Pattern Classification, Chapter2 3


18

 = (1, 2, …, c)

Problem: find θ̂ such that:

Max P(D | θ) = MaxP(x 1,..., x n | θ)


θ
n
= Max ∏ P(x k | θ)
k =1

Pattern Classification, Chapter2 3

You might also like