You are on page 1of 35

Maximum Likelihood

Estimation
Prof. Nicholas Zabaras

Email: nzabaras@gmail.com
URL: https://www.zabaras.com/

August 31, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 The likelihood function, frequentist vs. Bayesian approaches

 MLE estimator, MLE for the univariate Gaussian

 Biased and unbiased estimators, MLE estimates for μ, 𝜎 2 in a Gaussian

 MLE for the Poisson distribution, MLE for the Multinomial Distribution

 MLE and Least Squares

 MLE for the Multivariate Gaussian

 Sequential MLE Estimation for the Univariate Gaussian Distribution and


Introduction to Robbins-Monro Algorithm, Sequential MLE for the Multivariate
Gaussian
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Goals
 The goals of this lecture are:

 Understand how to compute maximum likelihood parameter estimators

 Learn about bias and unbiased estimators

 Familiarize ourselves with the MLE estimates of mean and variance in the
univariate and multivariate Gaussian distributions

 Learn how to sequentially compute MLE estimates

 Learn about the Robbins-Monro algorithm

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


References
• Following closely Chris Bishops’ PRML book, Chapter 2

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


The Likelihood Function
 Consider Bayes’ theorem
p (D |  ) p( )
p ( | D )  , p(D )   p(D |  ) p( )d
p(D )

 The quantity 𝑝(𝒟|𝜃) on the right-hand side of Bayes’ theorem is evaluated for
the observed data set 𝒟 and can be viewed as a function of the parameter
vector 𝜃, in which case it is called the likelihood function.

 Given this definition of likelihood, we can state Bayes’ theorem in words

posterior ∝ likelihood × prior

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Frequentist Versus Bayesian Paradigms
 The likelihood 𝑝(𝒟|𝜃) is essential in both Bayesian and frequentist
approaches but it is used in different roles.

 In a frequentist approach, 𝜃 is a fixed parameter computed by an estimator


(e.g. maximum likelihood estimator). Error bars on this point estimate are
computed by considering the distribution of all possible data sets 𝒟 (e.g.
variability of predictions between different bootstrap data sets)

 In the Bayesian approach, there is only one set of data 𝒟 and the uncertainty
in 𝜃 is introduced with appropriate prior and computing posterior probabilities
over 𝜃.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Maximum Likelihood Estimator (MLE)
 Consider the following parametric problem

X ~   ( x)   ( x |  ),   k

 Assume that the observations 𝑥𝑗 are obtained independently, i.e. that


𝑋1 , 𝑋2 , … , 𝑋𝑁 , are i.i.d. and 𝑥𝑗 is a realization of 𝑋𝑗 .

 Independency:  ( x1 , x2 ,..., xN |  )   ( x1 |  ) ( x2 |  )... ( xN |  )

or, briefly N
 (D |  )    ( x j |  )
j 1

where
D   x1 , x2 ,..., xN 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Maximum Likelihood Estimator (MLE)
 Maximum likelihood estimator (MLE) of 𝜃 = parameter value that maximizes
the probability of the outcome:
N
 ML  arg max   ( x j |  )
 j 1

 Define the negative of the log-likelihood as

L(D |  )   log  (D |  )

 Minimizer of L(D |  ) = maximizer of  (D |  )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


MLE – Gaussian Model
 For a Gaussian model,
1  1 2 
 ( x |  , ) 
2
exp   2 ( x   )  ,    2 
2 2
 2   

the Likelihood function is given as follows:


N
1  1 N
2
exp   L(D |  )     ( x j |  )  exp   ( x   ) 
 2 2   2 2
N /2 j 1
j 1 j 1 
 1 N
N 
 exp    ( x j  1 )2  log(2 2 )  
 2 2 j 1 2 

1 N N
L( D |  )  
2 2 j 1
( x j  1 ) 2

2
log(2 2 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


MLE – Gaussian Model
 The gradient of the Likelihood function is then:
 L   1 N N 
    

 x j 

1 
 L(D |  )  
1
 2 j 1 2 0
 L   1 N N 
    2 2  j 1
   
2
( x ) 
 2  2 j 1 2 2 

 This gives:
1 N
 mle   ML,1   x j
N j 1

1 N
 2
mle   ML ,2   ( x j   ML ,1 )2
N j 1

 These estimates agree with what we predicted in an earlier lecture with the
law of large numbers.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
MLE for the Univariate Gaussian
 So for the Gaussian distribution,
1  1 2
N ( x |  , 2 )  exp   ( x   ) 
  2 1/2
  
2
2 2
the likelihood function is
N
p (D |  ,  )   N ( xn |  ,  2 )
2

n 1

and the log-likelihood takes the form*


N
1 N N
ln p (D |  ,  )  
2

2 2
 ( xn   )2 
n 1 2
ln  2  ln(2 )
2
 The maximum likelihood solution is
N N

 xn  n ML
( x   ) 2

 ML  n 1
,  ML
2
 n 1

N N
* We work often with log-likelihood to avoid underflow (taking products of small
probabilities) and for simplifying the algebra.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Datasets: CMBData
 CMBdata: Spectral representation of the cosmological microwave background
(CMB), i.e. electromagnetic radiation from photons back to 300,000 years
after the Big Bang, expressed as difference in apparent temperature from the
mean temperature.1 6

5
0.9
MLE based
0.8
estimates of
0.7 4
 and 2 used
0.6 in constructing
0.5 3 the Gaussian shown
0.4 with the solid line.
0.3
2
0.2

0.1
1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

CMBdata 0
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6

Matlab implementation
Normal estimation
From Bayesian Core , J.M. Marin and C.P. Roberts, Chapter 2 (available on line)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Unbiased Estimators
 An estimator of a parameter is unbiased if the expected value of
the estimate is the same as the true value of the parameter.

 If 𝑥1, 𝑥2, … 𝑥𝑁 ~(i.i.d) 𝒩(, 2) then

1 N

[ mle
] N


i 1
xi   

 Thus 𝜇𝑚𝑙𝑒 is an unbiased estimator.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13


Biased Estimators
 An estimator of a parameter is biased if the expected value of the estimate is
different from the true value of the parameter.

 If 𝑥1, 𝑥2, … 𝑥𝑁 ~(i.i.d) 𝒩(, 2) then

1 N

 2
mle
  N

 ( x 
i 1
i ) 
mle 2


1  N 1 N 
2
  1 2
   xi   xj    1      2
 N  i 1 N j 1    N

 Thus 2mle is a biased estimator.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14


MLE for a Gaussian Distribution
N N
1 1
 ML 
N
x
i 1
i , 2
ML 
N
 (x  
i 1
i ML ) 2

Sample mean S ample variance wrt ML


mean ( not the exact mean)

  2
 The maximum likelihood solutions ML ML are functions of the data set values
,
𝑥1, . . . , 𝑥𝑁. Consider the expectations of these quantities with respect to the
data set values, which come from a Gaussian.

 Using the point estimates above we showed that :


In this derivation

N 1 2 you need to use :


 ML    ,  2
ML
    xi x j    xi   x j    2 for i  j
N
 xi2    2   2

 The MLE approach thus underestimates the variance (bias) – this is in the
root of the over-fitting problem.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Unbiased Estimate of Variance
 If 𝑥1, 𝑥2, … 𝑥𝑁 ~(i.i.d) 𝒩(, 2) then
1  N 1 N 
2
  1
 mle
2
     xi   xj    1    2   2
 N  i 1 N j 1    N

 So define

 mle
2
1 N
 2
unbiased 

 
1  N  1 i 1
( xi  ) 
mle 2
 unbiased
2
   2
1  
 N

 The two estimates are nearly the same for large 𝑁.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


Bias in MLE
 In the schematic from MLE estimate for
Bishop’s PRML, we the 2 data points
consider 3 cases each
with 2 data points True Gaussian
extracted from the true
Gaussian.

 The mean of the three


distributions predicted
via MLE (i.e. averaged
over the data) is correct.

 However, the variance is


underestimated since it
is a variance with
respect to the sample
mean and NOT the true
mean.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17
Poisson Distribution
 Recall the Poisson (discrete) distribution for N  0,1, 2,..., 

n
P( N  n)   Poisson (n |  )  e 
n!

 The mean and the variance are both equal to .


 N    n Poisson (n |  )   ,
n 0

 N    2   
 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


MLE – Poisson Model
 Consider the following parametric model:
n
 (n |  )  e 
n!

 We sample independently D  n , n ,.., n  , n


1 2 N k  . The likelihood is:
N N
n
 (D |  )    (n k )  e n !
K
 N

k 1 k 1 k

 The negative log likelihood function is then:


N
L(D |  )   log  ( D |  )     nk log   log nk !
k 1

 Taking the derivative with respect to  and setting it to zero:


 N
 n  1 N
L(D |  )   1  k 0  ML  n
  k
k 1   N k 1

 However note that the Law of Large Numbers predicts:


2
1 N 1 N 
var( N )    k
k 1 
n   n j    ML
N N j 1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
MLE – Poisson Model
 Assume that 𝜃 is known a priori to be large. In this case, we can use the
Gaussian approximation of the Poisson distribution (result derived in an earlier
lecture):  1
N /2

 1 
 
N N

 
2
 ( n |  )    exp   n   
 2  2
Poisson j j
j 1  j 1 
 1 
N /2
 1 1 N 
exp      n j     N log   
2
 
 2   2  j 1 
  

  n j     N log 
N
1 2
L( D |  ) 
 j 1

 ML :  L(D |  )   1    
N
2 N N
 
2
n    n    0
  j 1  j 1 
2 j j

 An approximation for
1/2
1 1 N  1 1 N
    n 2j     n  result
j from the exact density 
4 N j 1  2 N j 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


MLE for the Multinomial Distribution
 Suppose multinomial 𝐾 −dimensional ℳ(𝜇1, 𝜇2, … , 𝜇𝐾 ) with 𝒙1, 𝒙2, … 𝒙𝑁~(i.i.d.)
𝐾 𝐾
𝑥
p(𝒙|𝝁) = ෑ 𝜇𝑘 𝑘 , ෍ 𝜇𝑘 = 1
𝑘=1 𝑘=1
 Let 𝑚𝑘 = σ𝑁
𝑛=1 𝑥𝑛𝑘 . What is the MLE of 𝝁 = (𝜇1, 𝜇2, … , 𝜇𝐾 )? The likelihood is:

N!  N  m1 m2
(m1 , m2 ,..., mK | 1 , 2 ,...,  K )    mi
   1 2 ... K
mK

 mi !
i
 m1 , m2 ,..., mK 

 The penalized log-likelihood is given as:


K K
 K

( 1 , 2 ,...,  K )  log N !  log mi !   mi log i   1   i 
i 1 i 1  i 1 
Lagrange multiplier
enforced constra int
 Differentiation w.r.t. 𝜇𝑖 and enforcing σ𝐾
𝑘=1 𝜇𝑘 = 1 gives the expected result:
mi
imle  , i  1,..., K
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
MLE and Weighted Least Squares
 Consider a multivariate Gaussian model,
X ~ N ( x0 , )
nn
where x0  n
is unknown,   is a known symmetric positive definite
matrix.

 Assume that x0 depends on hidden parameters z  k


through a linear
equation (Model Reduction Approach, 𝑘 << 𝑛)
nk
x0  Az , A  , z k

 Every time you introduce model reduction, you also introduce model errors.

 We consider “the inverse problem” of computing 𝒛 from realizations of


𝑿.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
MLE and Weighted Least Squares
 Our problem can also be written by considering noisy observations as:
X  Az  E, E ~ N (0, )
Note that:
 X   Az   E   Az  x0
cov  X    XX T   ( Az  E )( Az  E )T    EE T   

1  1 
 The probability density of 𝑋 given 𝑧, is:  ( x | z )  exp   ( x  Az )T 1
 ( x  Az ) 
 2  det()1/2  2 
n /2

 Assume independent observations D   x1 , x2 ,.., xN  , x j  n

N
 1 N 
 The likelihood function is then   ( x j | z ) ~ exp   ( x j  Az )  ( x j  Az ) 
T 1

j 1  2 j 1 

 Now minimize 𝐿(𝒟 |𝑧):


1 N N  T 1 N  1 N T 1
L(D | z )   ( x j  Az )T  1 ( x j  Az )  z T  AT  1 A z  z T A   xj    xj  xj
2 j 1 2  j 1  2 j 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
MLE and Weighted Least Squares
1 N N  T 1 N  1 N T 1
L(D | z )   ( x j  Az )T  1 ( x j  Az )  z T  AT  1 A z  z T A   xj    xj  xj
2 j 1 2  j 1  2 j 1
 Taking the gradient wrt 𝑧 equal to zero:
 T 1 N 
 z L(D | z )  N  A  A z   A   x j   0 
T 1

 j 1 
1 N
 A  A z  A  x, where x   x j
T 1 T 1

N j 1
nk
 The existence of the solution of this system depends on the matrix A 

 For the case of one observation, i.e. D  x : L( x | z )  ( x  Az )T  1 ( x  Az )

 Using   UDU T ,  1  ( D 1/2U T )T ( D 1/2U T )  W T W , W  D 1/2U T , we can finally write:


L( x | z )  W ( Az  x)
2

 The MLE minimization problem is a weighted least squares problem!


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
MLE for the Multivariate Gaussian
 We can easily generalize the earlier MLE results for a multivariate Gaussian.
The log-likelihood takes the form:

ND N 1 N
ln p ( X | D ,  , )   ln  2   ln |  |   ( xn   )T  1 ( xn   )
2 2 2 n 1

 Setting the derivatives wrt 𝝁 and 𝚺 equal to zero gives the following:
N N
1 1
 ML 
N
 xn ,  ML
n 1

N
 n ML n ML
( x
n 1
  )( x   ) T

 We provide a proof of the calculation of 𝜮𝑀𝐿 next.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


MLE for the Multivariate Guassian
ND N 1 N
ln p ( X | D ,  , )   ln  2   ln |  |   ( xn   )T  1 ( xn   )
2 2 2 n 1
 We differentiate the log likelihood wrt 𝚺 −𝟏 . Each contributing term is:
N  N  1 N T N
 1
ln |  | 1
ln |  |   
2  2  2 2 A useful trick!
1  N 1   1 N 1 T 
  n
2  1 n 1
( x   )T 1
 ( x n   )  
2  1 
N Tr  
n 1 N
( x n   )( x n   ) 


  N 1 Tr   1 S 
1
𝑺 symmetric
2 
1 1 N
  NS , where S   ( xn   )( xn   )T
2 N n 1

 Setting the derivative equal to zero leads to:  ML  S


 
Tr     T , ln | A |  A1  ,
T

 Here we used:  A
| A1 || A |1 , tr ( AB )  tr ( BA)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
Appendix: Some Useful Matrix Operations
 Show that  
Tr     T and Tr     T
 

Indeed
  
Tr      ik ki  nm
A B  B  Tr     T
Amn Amn 

 Show that 
ln | A |  A1 
T

A

Using the cofactor expansion of the det:


 1  1 
( 1) m  n M mn   A1  nm
1
Amn
ln | A |
| A | Amn
| A |
| A | Amn
 (1)i  j Aij M ij 
j | A|

where in the last step we used Cramer’s rule.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


MLE for a Multivariate Gaussian
𝑁 𝑁 𝑁
1 1 𝑇
1
𝜇𝑀𝐿 = ෍ 𝑥𝑛 ≡ 𝑥,ҧ 𝛴𝑀𝐿 = ෍ ൫𝑥𝑛 − 𝜇𝑀𝐿 ) 𝑥𝑛 − 𝜇𝑀𝐿 = ෍ 𝑥𝑛 𝑥𝑛𝑇 − 𝑥ҧ 𝑥ҧ 𝑇
𝑁 𝑁 𝑁
𝑛=1 𝑛=1 𝑛=1

 Note that the unconstrained maximization of the log-likelihood gives a


symmetric 𝚺.

 As for the univariate case, we can define an unbiased covariance as:


𝑁
1
𝛴ത𝑀𝐿 = ෍ ൫𝑥𝑛 − 𝜇𝑀𝐿 ) 𝑥𝑛 − 𝜇𝑀𝐿 𝑇
, 𝔼 𝛴ത𝑀𝐿 = 𝛴
𝑁−1
𝑛=1

 To prove this, you will need to use that:


 xn xmT    T   mn 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28
Sequential MLE Estimation for the Gaussian
 Often we are interested to compute sequentially an estimate of 𝜇𝑀𝐿 as more
data arrive. This can easily be done:
1 N xN 1 N 1
 (N )
ML   xn    xn 
N n 1 N N n 1
xN N  1 1 N 1

N
 
N N  1 n 1
xn 

xN N  1 ( N 1)

N

N
 ML   ML ( N 1)

1
N
x N   ML
( N 1)

Learning Error signal
rate

 This sequential approach cannot easily be generalized to other cases (non-


Gaussians, etc.).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


Robbins-Monro Algorithm
 A more powerful approach to computing sequentially the MLE estimates is via
the Robbins-Monro algorithm.

 We review the algorithm by considering the calculation of the zero of a


regression function.*

 Consider the joint distribution 𝑝(𝑧, 𝜃) of two


random variables and define the regression
function as:
f     z |     zp( z |  )dz
 Assume we are given samples
from 𝑝(𝑧, 𝜃) one at a time.
• Effectively, we don’t know the regression function f() but we have data of a noisy version z of that. We take
the regression function to be the expectation z |  
 Robbins, H. and S. Monro (1951). A stochastic approximation method. Annals of Mathematical Statistics 22,
400–407.
 Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (Second ed.). Academic Press.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30
Robbins-Monro Algorithm
f     z |     zp( z |  )dz
 We want to find the root f  *  0 in a sequential manner: The Robbins-
Monro algorithm proceeds as:

 ( N )   ( N 1)  aN 1 z  ( N 1) 

 The learning coefficients {𝑎𝑁} should satisfy:

 
lim aN  0,  aN  ,  aN2  
N 
n 1 n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31


Robbins-Monro Algorithm
 We can state the MLE calculation 𝜇𝑀𝐿 for our Gaussian example as finding
the root of a regression function:
 1 N
 N
1  CLT   
 
  N
 ln p ( x |  )  |  0   
n 1 N 
ln p ( x |  ) |  0    ln p ( x |  )  0
 
n ML n ML
n 1  N 
 ML
z | ML

 In the context of the Robbins-Monro algorithm,


 x   ML    ML
z   ln p ( x |  )    ML
|   , z is a Gaussian, f   ML    z | ML   
  2
2
 The algorithm takes the form:
x ( N )   ML
( N 1)
 (N )
 ( N 1)
 aN 1
ML ML
2

 Substituting aN 1   2
gives the exact update discussed earlier.
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32
Robbins-Monro Algorithm
 A graphical interpretation of the algorithm is shown here.
x   ML
z
x ( N )   ML
( N 1)
2
 (N )
 ( N 1)
 aN 1
ML ML
2 p ( z |  ML ) is a Gaussian

The idea here is that The distribution of z


we don’t know the true at a given  ML
 but have noisy
measurements x given
here in terms of z.
   ML
The parameter  is here f   ML    z |  ML   
the  ML and the 2
zero of the regression
function happens at
 ML  

 The Robbin-Monro algorithm computes the zero of the regression function


that corresponds to the true mean .
 Blum, J. A. (1965). Multidimensional stochastic approximation methods. Annals of Mathematical Statistics 25, 737–744.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33
Sequential MLE Estimation for Gaussians
 Let us now repeat the same calculations but for the MLE estimate of 2:
 xN   
2
N N 1
1 1
 (2N )    xn       xn    
2 2

N n 1 N n 1 N
 xN   
2
N 1 2
  ( N 1)  
N N
  ( N 1) 
2 1
N
 xN      (2N 1)
2
 
 If we substitute the expression for the Gaussian likelihood into the Robbins-
Monro procedure for maximizing likelihood:
 1     
2

 (2N )   (2N 1)  aN 1



 (2N 1)
 ln  ( N 1) 
 2
2 x
2
N

 2
( N 1)


  2
( N 1)  a N 1
2
1
4
( N 1)
 N   (2N 1) 
 x   
2

 The 2 formulas are identical for: aN 1  2 (4N 1) N .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Sequential MLE: Multivariate Gaussian
 To simplify things, assume that  ML   and thus: N
1
   ( x   )( x   ) (N )
ML n n
T

N n 1

 From this equation we can derive:


 (ML
N) N 1)
  (ML 
1
N
 ( x N   )( x N   )T   (ML
N 1)

 To apply the Robbins-Monro algorithm, assume that  is diagonal and as
before compute the derivative

 ( N 1)
  ln p  x N | ,  ( N 1)
     ML   ( x N   )( x N   )T   (ML
1 ( N 1) 2
2
N 1)

ML

 Substituting into the Robbins-Monro algorithm:


 AN 1   ML   ( x N   )( x N   )T   (ML 
( N 1) 1 ( N 1) 2 N 1)
 (N )
ML  ML
2
 Thus from the RM algorithm, we can obtain the exact update by selecting

AN 1    ML 
2 ( N 1) 2
N
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35

You might also like