Chapter 3

MIMA Group
M L
D M
Chapter 3
Parameter Estimation
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Contents MIMA
 Introduction
 Maximum-Likelihood Estimation
 Bayesian Estimation
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

Bayesian Theorem MIMA
p(x | i ) P(i )
P(i | x) 
p ( x)
c
p (x)   p (x | i ) P(i )
j 1
 To compute posterior probability P (i | x) , we

need to know:
p ( x | i ) P(i )
How can we get these values?

Samples MIMA
D  {D1 , D 2 ,  , D c } D1 D2
The samples in Dj are drawn
independently according to the
probability law p(x|j). That is,
examples in Dj are i.i.d. random
variables, i.e., independent D3
and identically distributed.
It is easy to compute the prior

Dj
P(i ) 
probability:

c
i 1
Di

Samples MIMA
 For class-conditional pdf:

 Case I: p(x|j) has certain parametric form
 e.g.
p (x |  j ) ~ N (μ j , Σ j )
j θ j  (1 ,  2 ,  ,  m ) T
 If X  R d j contains “d+d(d+1)/2” free parameters.
 Case II: p(x|j) doesn’t have parametric form

 Next chapter.

Goal MIMA
θ̂ 2
1 θ̂
D  {D1 , D 2 ,  , D c } D1 1 2
D2
p(x |  j )  p(x | θ j )
D3
Use Dj to estimate the unknown 3 θ̂ 3
parameter vector j
θ j  (1 ,  2 ,  ,  m ) T

Estimation Under Parametric Form MIMA
 Maximum-Likelihood Estimation
View parameters as Estimate parameter values by
quantities whose maximizing the likelihood
values are fixed but (probability) of observing the
unknown actual examples.
 Bayesian Estimation
View parameters as Observation of the actual

random variables training examples transforms
having some known parameters’ prior into posterior
prior distribution distribution. (via Bayes rule)

Maximum-Likelihood Estimation MIMA
 Because each class is considered individually,

the subscript used before will be dropped.
 Now the problem becomes:
Given a sample set D, whose elements are

drawn independently from a population
possessing a known parameter form, say p(x|),
D
we want to choose a θ̂ that will make D to
occur most likely.
θ̂
Maximum-Likelihood Estimation (Cont.)
MIMA
 Criterion of ML
D  {x1 , x 2 ,  , x n }
 By the independence assumption, we have
n
p(D | θ)  p (x1 | θ) p(x 2 | θ)  p(x n | θ)   p (x k | θ)
k 1
 The Likelihood Function

n
L(θ | D )  p (D | θ)   p(x k | θ)
k 1
 The maximum-likelihood
θˆ  arg max L( | D)
estimation: 

MIMA
 Often, we resort to maximize the log-likelihood

function
n
l (θ | D )  ln L(θ | D )   ln p (x k | θ)
k 1
θˆ  arg max l (θ | D )
θ
why?
θˆ  arg max L(θ | D )
θ

MIMA
 Find the extreme values using the method in

differential calculus.
 Gradient Operator
 Let f() be a continuous function, where =(1, 2,…, n)T.
T
Gradient     
Operator  θ   , , , 
 1  2  n 
 Find the extreme values by solving

θ f  0
The Gaussian Case I MIMA
 Case I: unknown , and  is known

1  1 1 
p ( x | μ, Σ )  exp  (x  μ) Σ (x  μ)
T
(2 ) d / 2 | Σ |1/ 2  2 
n
L(μ | D )  p (D | μ)   p (x k | θ)
k 1
1  1 n

n/2 
1
 exp   ( k )
x μ T
Σ ( k  )
x μ
(2 ) nd / 2
| Σ | k 1  2 
l (μ | D )  ln L(μ | D )
1 n
  ln(2 ) nd / 2
|Σ| n/2
  (x k  μ)T Σ 1 (x k  μ)
2 k 1
The Gaussian Case I MIMA
l (μ | D )  ln L(μ | D )
1 n
  ln(2 ) nd / 2
|Σ| n/2
  (x k  μ)T Σ 1 (x k  μ)
2 k 1
n
 μ l (μ | D )   Σ 1 (x k  μ)  0
k 1
1 n
μˆ   x k Sample Mean!
n k 1
Intuitive Result: Maximum estimate for the unknown  is just

the arithmetic average of training samples---sample mean.

The Gaussian Case II MIMA
 Case II: both  and  are unknown

 Consider univariate case
1  ( x   )2 
p( x |  ,  ) 2
exp   θ  (1 , 2 )T  (  ,  2 )T
2   2 2

n
1 n
 ( xk   ) 2 
L(θ | D )  p (D | θ)   p ( xk | θ)  n/2 n 
exp  
k 1 ( 2 )  k 1  2 2

1 n
l (θ | D )  ln L(θ | D )   ln(2 ) n/2
 
n
2 2
 k
( x
k 1
  ) 2
1 n
  ln(2 )  2  k 1 
n/2
n/2
 ( x  ) 2
2 2 k 1

The Gaussian Case II MIMA
1 n
l (θ | D )   ln(2 ) n / 2  2  k 1 
n/2
 ( x  ) 2
2 2 k 1
 1 n  Unbiased Estimator:
   k 1 ( x   )  E [θˆ]  θ
 θl (θ | D )   2 k 1

2 0
 n  ( xk  1 )  Consistent Estimator:
n
 2  2 2  lim E[θˆ ]  θ
 2 k  1 2  n 
unbiased
1 n
ˆ  ˆ1   xk Arithmetic average of n vectors
n k 1
1 n
ˆ 2  ˆ2   ( xk  ˆ ) 2 Arithmetic average of n matrices
n k 1
(x k  μˆ )(x k  μˆ )T
biased
MLE for Normal Population MIMA
1 n Sample Mean
μˆ   x k
n k 1 E[μˆ ]  μ
1 n
ˆ   (x  μˆ )(x  μˆ )T ˆ n 1
Σ k k
E[ Σ ]  ΣΣ
n k 1 n
1 n

Sample Covariance Matrix
C (x k  μ)(x k  μ)
ˆ ˆ T
n  1 k 1 E[C]  Σ

MIMA

Bayesian Estimation MIMA
 Settings
 The parametric form of the likelihood function for each
category is known
 However, j is considered to be random variables
instead of being fixed (but unknown) values.
In this case, we can no longer make a single ML estimate θ̂

and then infer P(i | x) based on P(i ) and p(x | i )
How can we Fully exploit training

proceed? examples!

Posterior Probabilities from sample MIMA
P(i , x, D ) P(i , x, D )
P(i | x, D )  
P( x, D )  P( j , x, D )
c
j 1
P(i , x, D )  P( D)  P(i , x | D )  P( D)  P(i | D )  P( x | i , D )
Assumptions:
P(i | D )  P(i ) P(x | i , D i ) P(i )
P(i | x, D )  c
P ( x | i , D )  P ( x |  i , D i )  P(x |  , D
j 1
j j ) P( j )
Each class can be

considered independently
Problem Formulation MIMA
P(x | i , D i ) P(i )
P(i | x, D )  c
 P(x |  , D
j 1
j j ) P( j )
The key problem is to determine, P(x | i , D i ) ,treat each class

independently, the problem becomes P(x | D )
This is always the central problem of Bayesian Learning.

Class-Conditional Density Estimation MIMA
Assume p(x) is unknown but knowing it has a

fixed form with parameter vector .
p (x | D )   p (x, θ | D)dθ  :Random variable w.r.t. parametric form
  p (x | θ, D) p (θ | D )dθ
  p (x | θ) p (θ | D )dθ x is independent of D given 
The form of The posterior density

distribution is assumed we want to estimate
known
Bayesian Estimation: General Procedure MIMA
p (θ | D )  ?
Phase I:

Bayesian Estimation: General Procedure MIMA
Phase II:
p(x | D )   p(x | θ) p(θ | D )dθ
Phase III:
P(x | i , D i ) P(i )
P(i | x, D )  c
 P (x |  , D
j 1
j j ) P ( j )

The Gaussian Case MIMA
 The univariate Gaussian: unknown 

n
Phase I: p (θ | D )    p (x k | θ) p (θ)
k 1
p(  )  p( x |  )  D p(  | D)
1  1 x  
2
p( x |  )  exp    
2   2    
1  1    0  
2
p(  )  exp    
2  0  2   0  
Other form of prior pdf could be assumed as well

1  1   0 
 
2
1  1  x   2 
p(  )  exp     p( x |  )  exp    
2  0  2   0   2   2    
n
p(θ | D )    p(x k | θ) p(θ)
k 1
n
1  1  xk   
2
 1  1    0  
2
p(  | D )    exp     exp   
k 1 2   2     2  0  2   0  
 1  n x   2      2 
 
   exp    k    0
 
 2  k 1      0  
 
 1  n 1  2  1 n
0   
 
  exp   2  2    2 2  xk  2    
 2   0   k 1  0   

p (  | D ) is an exponential function of a quadratic function of ;

thus p (  | D ) is also a normal.
p (  | D ) ~ N (  n ,  n2 )
1  1    n  
2
p(  | D )  exp   
2  n  2   n  
 1 2 

1

exp  2   2 n    n 
2

Comparison
2  n  2 n 
 1  n 1  2  1 n
0   
p(  | D )    exp   2  2    2 2  xk  2    
 2   0   k 1  0   

 Equating the coefficients in both form; then, we

have
 n 02  2 1 n
 n   2  ˆ 
2  n
0 ̂ n   xk
 n 0    n 0  
2 2
n k 1
  2 2
 2 0
n   2 2
n
0

Phase II: p (x | D )   p (x | θ) p (θ | D )dθ
p(  | D)  p( x |  ) p( x | D)
1  1 x  
2
p( x |  )  exp    
2   2    
p (  | D ) ~ N (  n ,  n2 )
How would p(x|D) look like in this case?

1  1  x   2 
p (x | D )   p(x | u ) p (u | D )dθ p( x |  )  exp    
2   2    
p (  | D ) ~ N (  n ,  n2 )
1  1 x 
2
  1    n  
2
p( x | D ) 
2 n  exp     exp  
 2    
 d
 2   n  
1  1 ( x  n ) 2
  1  n
2 2
  n x   n 
2 2

2
2 
 exp  exp  2 2 
   d
2 n  2  n 
2
 2   n   n  
2 2

p(x|D) is an exponential function of a quadratic

function of x; thus, it is also a normal pdf. =?
1  1  x   2 
p (x | D )   p(x | u ) p (u | D )dθ p( x |  )  exp    
2   2    
p (  | D ) ~ N (  n ,  n2 )
p( x | D ) ~ N (  ,    ) 2 2
1  1 x 
2
  1    n  
2
p( x | D ) 
2 n  exp     exp  
 2     2  n  
 n
 d
n

1  1 ( x  n ) 2
  1  n
2 2
  n x   n 
2 2

2
2 
 exp  exp  2 2 
   d
2 n  2  n 
2
 2   n   n  
2 2

p(x|D) is an exponential function of a quadratic

function of x; thus, it is also a normal pdf. =?
Phase III:
P(x | i , D i ) P(i )
P(i | x, D )  c
 P(x |  , D
j 1
j j ) P ( j )

Summary MIMA
 Key issue
 Estimate prior and class-conditional pdf from training
set
 Basic assumption on training examples: i.i.d.
 Two strategies to key issue
 Parametric form for class-conditional pdf
 Maximum likelihood estimation
 Bayesian estimation
 No parametric form for class-conditional pdf

Summary MIMA
 Maximum likelihood estimation

 Settings: parameters as fixed but unknown values
 The objective function: log-likelihood function
 The gradient for the objective function should be zero
 Gaussian
 Bayesian estimation
 Settings: parameters as random variables
 General procedure: I, II, III
 Gaussian case
Project 3.2

MIMA Group
Any Question?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Chapter 3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3

Uploaded by

Copyright:

Available Formats

MIMA Group

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2

 To compute posterior probability P (i | x) , we

How can we get these values?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3

It is easy to compute the prior

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4

 For class-conditional pdf:

 If X  R d j contains “d+d(d+1)/2” free parameters.

 Case II: p(x|j) doesn’t have parametric form

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6

View parameters as Observation of the actual

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7

 Because each class is considered individually,

Given a sample set D, whose elements are

 The Likelihood Function

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9

 Often, we resort to maximize the log-likelihood

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10

 Find the extreme values using the method in

 Find the extreme values by solving

 Case I: unknown , and  is known

Intuitive Result: Maximum estimate for the unknown  is just

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13

 Case II: both  and  are unknown

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17

In this case, we can no longer make a single ML estimate θ̂

How can we Fully exploit training

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18

P(i , x, D )  P( D)  P(i , x | D )  P( D)  P(i | D )  P( x | i , D )

Each class can be

The key problem is to determine, P(x | i , D i ) ,treat each class

This is always the central problem of Bayesian Learning.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20

Assume p(x) is unknown but knowing it has a

  p (x | θ) p (θ | D )dθ x is independent of D given 

The form of The posterior density

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23

 The univariate Gaussian: unknown 

Other form of prior pdf could be assumed as well

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25

p (  | D ) is an exponential function of a quadratic function of ;

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26

 Equating the coefficients in both form; then, we

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27

Phase II: p (x | D )   p (x | θ) p (θ | D )dθ

How would p(x|D) look like in this case?

p(x|D) is an exponential function of a quadratic

p(x|D) is an exponential function of a quadratic

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31

 No parametric form for class-conditional pdf

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32

 Maximum likelihood estimation

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

You might also like