You are on page 1of 34

MIMA Group

M L
D M

Chapter 3
Parameter Estimation

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University


Contents MIMA

 Introduction
 Maximum-Likelihood Estimation
 Bayesian Estimation

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 2


Bayesian Theorem MIMA

p(x | i ) P(i )
P(i | x) 
p ( x)
c
p (x)   p (x | i ) P(i )
j 1

 To compute posterior probability P (i | x) , we


need to know:
p ( x | i ) P(i )

How can we get these values?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 3


Samples MIMA

D  {D1 , D 2 ,  , D c } D1 D2
The samples in Dj are drawn
independently according to the
probability law p(x|j). That is,
examples in Dj are i.i.d. random
variables, i.e., independent D3
and identically distributed.

It is easy to compute the prior


Dj
P(i ) 
probability:

c
i 1
Di

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 4


Samples MIMA

 For class-conditional pdf:


 Case I: p(x|j) has certain parametric form
 e.g.

p (x |  j ) ~ N (μ j , Σ j )
j θ j  (1 ,  2 ,  ,  m ) T

 If X  R d j contains “d+d(d+1)/2” free parameters.

 Case II: p(x|j) doesn’t have parametric form


 Next chapter.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 5


Goal MIMA

θ̂ 2
1 θ̂
D  {D1 , D 2 ,  , D c } D1 1 2
D2
p(x |  j )  p(x | θ j )

D3
Use Dj to estimate the unknown 3 θ̂ 3
parameter vector j

θ j  (1 ,  2 ,  ,  m ) T

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 6


Estimation Under Parametric Form MIMA

 Maximum-Likelihood Estimation
View parameters as Estimate parameter values by
quantities whose maximizing the likelihood
values are fixed but (probability) of observing the
unknown actual examples.

 Bayesian Estimation

View parameters as Observation of the actual


random variables training examples transforms
having some known parameters’ prior into posterior
prior distribution distribution. (via Bayes rule)

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 7


Maximum-Likelihood Estimation MIMA

 Because each class is considered individually,


the subscript used before will be dropped.
 Now the problem becomes:

Given a sample set D, whose elements are


drawn independently from a population
possessing a known parameter form, say p(x|),
D
we want to choose a θ̂ that will make D to
occur most likely.

θ̂
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 8
Maximum-Likelihood Estimation (Cont.)
MIMA

 Criterion of ML
D  {x1 , x 2 ,  , x n }
 By the independence assumption, we have
n
p(D | θ)  p (x1 | θ) p(x 2 | θ)  p(x n | θ)   p (x k | θ)
k 1

 The Likelihood Function


n
L(θ | D )  p (D | θ)   p(x k | θ)
k 1
 The maximum-likelihood
θˆ  arg max L( | D)
estimation: 

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 9


Maximum-Likelihood Estimation (Cont.)
MIMA

 Often, we resort to maximize the log-likelihood


function
n
l (θ | D )  ln L(θ | D )   ln p (x k | θ)
k 1

θˆ  arg max l (θ | D )
θ

why?
θˆ  arg max L(θ | D )
θ

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 10


Maximum-Likelihood Estimation (Cont.)
MIMA

 Find the extreme values using the method in


differential calculus.
 Gradient Operator
 Let f() be a continuous function, where =(1, 2,…, n)T.

T
Gradient     
Operator  θ   , , , 
 1  2  n 

 Find the extreme values by solving


θ f  0
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 11
The Gaussian Case I MIMA

 Case I: unknown , and  is known


1  1 1 
p ( x | μ, Σ )  exp  (x  μ) Σ (x  μ)
T

(2 ) d / 2 | Σ |1/ 2  2 
n
L(μ | D )  p (D | μ)   p (x k | θ)
k 1
1  1 n

n/2 
1
 exp   ( k )
x μ T
Σ ( k  )
x μ
(2 ) nd / 2
| Σ | k 1  2 

l (μ | D )  ln L(μ | D )
1 n
  ln(2 ) nd / 2
|Σ| n/2
  (x k  μ)T Σ 1 (x k  μ)
2 k 1
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 12
The Gaussian Case I MIMA

l (μ | D )  ln L(μ | D )
1 n
  ln(2 ) nd / 2
|Σ| n/2
  (x k  μ)T Σ 1 (x k  μ)
2 k 1
n
 μ l (μ | D )   Σ 1 (x k  μ)  0
k 1

1 n
μˆ   x k Sample Mean!
n k 1

Intuitive Result: Maximum estimate for the unknown  is just


the arithmetic average of training samples---sample mean.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 13


The Gaussian Case II MIMA

 Case II: both  and  are unknown


 Consider univariate case
1  ( x   )2 
p( x |  ,  ) 2
exp   θ  (1 , 2 )T  (  ,  2 )T
2   2 2

n
1 n
 ( xk   ) 2 
L(θ | D )  p (D | θ)   p ( xk | θ)  n/2 n 
exp  
k 1 ( 2 )  k 1  2 2

1 n
l (θ | D )  ln L(θ | D )   ln(2 ) n/2
 
n

2 2
 k
( x
k 1
  ) 2

1 n
  ln(2 )  2  k 1 
n/2
n/2
 ( x  ) 2

2 2 k 1

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 14


The Gaussian Case II MIMA

1 n
l (θ | D )   ln(2 ) n / 2  2  k 1 
n/2
 ( x  ) 2

2 2 k 1

 1 n  Unbiased Estimator:
   k 1 ( x   )  E [θˆ]  θ
 θl (θ | D )   2 k 1

2 0
 n  ( xk  1 )  Consistent Estimator:
n

 2  2 2  lim E[θˆ ]  θ
 2 k  1 2  n 
unbiased
1 n
ˆ  ˆ1   xk Arithmetic average of n vectors
n k 1
1 n
ˆ 2  ˆ2   ( xk  ˆ ) 2 Arithmetic average of n matrices
n k 1
(x k  μˆ )(x k  μˆ )T
biased
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 15
MLE for Normal Population MIMA

1 n Sample Mean
μˆ   x k
n k 1 E[μˆ ]  μ
1 n
ˆ   (x  μˆ )(x  μˆ )T ˆ n 1
Σ k k
E[ Σ ]  ΣΣ
n k 1 n
1 n

Sample Covariance Matrix
C (x k  μ)(x k  μ)
ˆ ˆ T

n  1 k 1 E[C]  Σ

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 16


MIMA

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 17


Bayesian Estimation MIMA

 Settings
 The parametric form of the likelihood function for each
category is known
 However, j is considered to be random variables
instead of being fixed (but unknown) values.

In this case, we can no longer make a single ML estimate θ̂


and then infer P(i | x) based on P(i ) and p(x | i )

How can we Fully exploit training


proceed? examples!

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 18


Posterior Probabilities from sample MIMA

P(i , x, D ) P(i , x, D )
P(i | x, D )  
P( x, D )  P( j , x, D )
c
j 1

P(i , x, D )  P( D)  P(i , x | D )  P( D)  P(i | D )  P( x | i , D )

Assumptions:
P(i | D )  P(i ) P(x | i , D i ) P(i )
P(i | x, D )  c

P ( x | i , D )  P ( x |  i , D i )  P(x |  , D
j 1
j j ) P( j )

Each class can be


considered independently
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 19
Problem Formulation MIMA

P(x | i , D i ) P(i )
P(i | x, D )  c

 P(x |  , D
j 1
j j ) P( j )

The key problem is to determine, P(x | i , D i ) ,treat each class


independently, the problem becomes P(x | D )

This is always the central problem of Bayesian Learning.

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 20


Class-Conditional Density Estimation MIMA

Assume p(x) is unknown but knowing it has a


fixed form with parameter vector .
p (x | D )   p (x, θ | D)dθ  :Random variable w.r.t. parametric form

  p (x | θ, D) p (θ | D )dθ

  p (x | θ) p (θ | D )dθ x is independent of D given 

The form of The posterior density


distribution is assumed we want to estimate
known
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 21
Bayesian Estimation: General Procedure MIMA

p (θ | D )  ?
Phase I:

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 22


Bayesian Estimation: General Procedure MIMA

Phase II:
p(x | D )   p(x | θ) p(θ | D )dθ

Phase III:
P(x | i , D i ) P(i )
P(i | x, D )  c

 P (x |  , D
j 1
j j ) P ( j )

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 23


The Gaussian Case MIMA

 The univariate Gaussian: unknown 


n

Phase I: p (θ | D )    p (x k | θ) p (θ)
k 1

p(  )  p( x |  )  D p(  | D)

1  1 x  
2

p( x |  )  exp    
2   2    

1  1    0  
2

p(  )  exp    
2  0  2   0  

Other form of prior pdf could be assumed as well


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 24
The Gaussian Case MIMA
1  1   0 
 
2
1  1  x   2 
p(  )  exp     p( x |  )  exp    
2  0  2   0   2   2    

n
p(θ | D )    p(x k | θ) p(θ)
k 1

n
1  1  xk   
2
 1  1    0  
2

p(  | D )    exp     exp   
k 1 2   2     2  0  2   0  

 1  n x   2      2 
 
   exp    k    0
 
 2  k 1      0  
 

 1  n 1  2  1 n
0   
 
  exp   2  2    2 2  xk  2    
 2   0   k 1  0   

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 25


The Gaussian Case MIMA

p (  | D ) is an exponential function of a quadratic function of ;


thus p (  | D ) is also a normal.
p (  | D ) ~ N (  n ,  n2 )
1  1    n  
2

p(  | D )  exp   
2  n  2   n  

 1 2 

1

exp  2   2 n    n 
2

Comparison
2  n  2 n 

 1  n 1  2  1 n
0   
p(  | D )    exp   2  2    2 2  xk  2    
 2   0   k 1  0   

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 26


The Gaussian Case MIMA

 Equating the coefficients in both form; then, we


have

 n 02  2 1 n
 n   2  ˆ 
2  n
0 ̂ n   xk
 n 0    n 0  
2 2
n k 1

  2 2
 2 0
n   2 2
n
0

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 27


The Gaussian Case MIMA

Phase II: p (x | D )   p (x | θ) p (θ | D )dθ

p(  | D)  p( x |  ) p( x | D)

1  1 x  
2

p( x |  )  exp    
2   2    

p (  | D ) ~ N (  n ,  n2 )

How would p(x|D) look like in this case?


Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 28
The Gaussian Case MIMA

1  1  x   2 
p (x | D )   p(x | u ) p (u | D )dθ p( x |  )  exp    
2   2    

p (  | D ) ~ N (  n ,  n2 )

1  1 x 
2
  1    n  
2

p( x | D ) 
2 n  exp     exp  
 2    
 d
 2   n  

1  1 ( x  n ) 2
  1  n
2 2
  n x   n 
2 2

2

2 
 exp  exp  2 2 
   d
2 n  2  n 
2
 2   n   n  
2 2

p(x|D) is an exponential function of a quadratic


function of x; thus, it is also a normal pdf. =?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 29
The Gaussian Case MIMA

1  1  x   2 
p (x | D )   p(x | u ) p (u | D )dθ p( x |  )  exp    
2   2    

p (  | D ) ~ N (  n ,  n2 )

p( x | D ) ~ N (  ,    ) 2 2
1  1 x 
2
  1    n  
2

p( x | D ) 
2 n  exp     exp  
 2     2  n  
 n
 d
n

1  1 ( x  n ) 2
  1  n
2 2
  n x   n 
2 2

2

2 
 exp  exp  2 2 
   d
2 n  2  n 
2
 2   n   n  
2 2

p(x|D) is an exponential function of a quadratic


function of x; thus, it is also a normal pdf. =?
Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 30
The Gaussian Case MIMA

Phase III:

P(x | i , D i ) P(i )
P(i | x, D )  c

 P(x |  , D
j 1
j j ) P ( j )

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 31


Summary MIMA

 Key issue
 Estimate prior and class-conditional pdf from training
set
 Basic assumption on training examples: i.i.d.
 Two strategies to key issue
 Parametric form for class-conditional pdf
 Maximum likelihood estimation

 Bayesian estimation

 No parametric form for class-conditional pdf

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 32


Summary MIMA

 Maximum likelihood estimation


 Settings: parameters as fixed but unknown values
 The objective function: log-likelihood function
 The gradient for the objective function should be zero
 Gaussian
 Bayesian estimation
 Settings: parameters as random variables
 General procedure: I, II, III
 Gaussian case
Project 3.2

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University 33


MIMA Group

Any Question?

Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University

You might also like