Lec8 MLE

Maximum Likelihood
Estimation
Prof. Nicholas Zabaras
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
August 31, 2020
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

Contents
 The likelihood function, frequentist vs. Bayesian approaches
 MLE estimator, MLE for the univariate Gaussian
 Biased and unbiased estimators, MLE estimates for μ, 𝜎 2 in a Gaussian
 MLE for the Poisson distribution, MLE for the Multinomial Distribution
 MLE and Least Squares
 MLE for the Multivariate Gaussian
 Sequential MLE Estimation for the Univariate Gaussian Distribution and

Introduction to Robbins-Monro Algorithm, Sequential MLE for the Multivariate
Gaussian
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Goals
 The goals of this lecture are:
 Understand how to compute maximum likelihood parameter estimators
 Learn about bias and unbiased estimators
 Familiarize ourselves with the MLE estimates of mean and variance in the
univariate and multivariate Gaussian distributions
 Learn how to sequentially compute MLE estimates
 Learn about the Robbins-Monro algorithm

References
• Following closely Chris Bishops’ PRML book, Chapter 2
• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

The Likelihood Function
 Consider Bayes’ theorem
p (D |  ) p( )
p ( | D )  , p(D )   p(D |  ) p( )d
p(D )
 The quantity 𝑝(𝒟|𝜃) on the right-hand side of Bayes’ theorem is evaluated for
the observed data set 𝒟 and can be viewed as a function of the parameter
vector 𝜃, in which case it is called the likelihood function.
 Given this definition of likelihood, we can state Bayes’ theorem in words
posterior ∝ likelihood × prior

Frequentist Versus Bayesian Paradigms
 The likelihood 𝑝(𝒟|𝜃) is essential in both Bayesian and frequentist
approaches but it is used in different roles.
 In a frequentist approach, 𝜃 is a fixed parameter computed by an estimator

(e.g. maximum likelihood estimator). Error bars on this point estimate are
computed by considering the distribution of all possible data sets 𝒟 (e.g.
variability of predictions between different bootstrap data sets)
 In the Bayesian approach, there is only one set of data 𝒟 and the uncertainty
in 𝜃 is introduced with appropriate prior and computing posterior probabilities
over 𝜃.

Maximum Likelihood Estimator (MLE)
 Consider the following parametric problem
X ~   ( x)   ( x |  ),   k
 Assume that the observations 𝑥𝑗 are obtained independently, i.e. that

𝑋1 , 𝑋2 , … , 𝑋𝑁 , are i.i.d. and 𝑥𝑗 is a realization of 𝑋𝑗 .
 Independency:  ( x1 , x2 ,..., xN |  )   ( x1 |  ) ( x2 |  )... ( xN |  )
or, briefly N
 (D |  )    ( x j |  )
j 1
where
D   x1 , x2 ,..., xN 

Maximum Likelihood Estimator (MLE)
 Maximum likelihood estimator (MLE) of 𝜃 = parameter value that maximizes
the probability of the outcome:
N
 ML  arg max   ( x j |  )
 j 1
 Define the negative of the log-likelihood as
L(D |  )   log  (D |  )
 Minimizer of L(D |  ) = maximizer of  (D |  )

MLE – Gaussian Model
 For a Gaussian model,
1  1 2 
 ( x |  , ) 
2
exp   2 ( x   )  ,    2 
2 2
 2   
the Likelihood function is given as follows:

N
1  1 N
2
exp   L(D |  )     ( x j |  )  exp   ( x   ) 
 2 2   2 2
N /2 j 1
j 1 j 1 
 1 N
N 
 exp    ( x j  1 )2  log(2 2 )  
 2 2 j 1 2 
1 N N
L( D |  )  
2 2 j 1
( x j  1 ) 2

2
log(2 2 )

MLE – Gaussian Model
 The gradient of the Likelihood function is then:
 L   1 N N 
    

 x j 

1 
 L(D |  )  
1
 2 j 1 2 0
 L   1 N N 
    2 2  j 1
   
2
( x ) 
 2  2 j 1 2 2 
 This gives:
1 N
 mle   ML,1   x j
N j 1
1 N
 2
mle   ML ,2   ( x j   ML ,1 )2
N j 1
 These estimates agree with what we predicted in an earlier lecture with the
law of large numbers.
MLE for the Univariate Gaussian
 So for the Gaussian distribution,
1  1 2
N ( x |  , 2 )  exp   ( x   ) 
  2 1/2
  
2
2 2
the likelihood function is
N
p (D |  ,  )   N ( xn |  ,  2 )
2
n 1
and the log-likelihood takes the form*

N
1 N N
ln p (D |  ,  )  
2
2 2
 ( xn   )2 
n 1 2
ln  2  ln(2 )
2
 The maximum likelihood solution is
N N
 xn  n ML
( x   ) 2
 ML  n 1
,  ML
2
 n 1
N N
* We work often with log-likelihood to avoid underflow (taking products of small
probabilities) and for simplifying the algebra.

Datasets: CMBData
 CMBdata: Spectral representation of the cosmological microwave background
(CMB), i.e. electromagnetic radiation from photons back to 300,000 years
after the Big Bang, expressed as difference in apparent temperature from the
mean temperature.1 6
5
0.9
MLE based
0.8
estimates of
0.7 4
 and 2 used
0.6 in constructing
0.5 3 the Gaussian shown
0.4 with the solid line.
0.3
2
0.2
0.1
1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
CMBdata 0
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
Matlab implementation
Normal estimation
From Bayesian Core , J.M. Marin and C.P. Roberts, Chapter 2 (available on line)

Unbiased Estimators
 An estimator of a parameter is unbiased if the expected value of
the estimate is the same as the true value of the parameter.
 If 𝑥1, 𝑥2, … 𝑥𝑁 ~(i.i.d) 𝒩(, 2) then
1 N

[ mle
] N


i 1
xi   

 Thus 𝜇𝑚𝑙𝑒 is an unbiased estimator.

Biased Estimators
 An estimator of a parameter is biased if the expected value of the estimate is
different from the true value of the parameter.
 If 𝑥1, 𝑥2, … 𝑥𝑁 ~(i.i.d) 𝒩(, 2) then
1 N

 2
mle
  N

 ( x 
i 1
i ) 
mle 2

1  N 1 N 
2
  1 2
   xi   xj    1      2
 N  i 1 N j 1    N
 Thus 2mle is a biased estimator.

MLE for a Gaussian Distribution
N N
1 1
 ML 
N
x
i 1
i , 2
ML 
N
 (x  
i 1
i ML ) 2
Sample mean S ample variance wrt ML

mean ( not the exact mean)
  2
 The maximum likelihood solutions ML ML are functions of the data set values
,
𝑥1, . . . , 𝑥𝑁. Consider the expectations of these quantities with respect to the
data set values, which come from a Gaussian.
 Using the point estimates above we showed that :

In this derivation
N 1 2 you need to use :

 ML    ,  2
ML
    xi x j    xi   x j    2 for i  j
N
 xi2    2   2
 The MLE approach thus underestimates the variance (bias) – this is in the
root of the over-fitting problem.
Unbiased Estimate of Variance
 If 𝑥1, 𝑥2, … 𝑥𝑁 ~(i.i.d) 𝒩(, 2) then
1  N 1 N 
2
  1
 mle
2
     xi   xj    1    2   2
 N  i 1 N j 1    N
 So define
 mle
2
1 N
 2
unbiased 

 
1  N  1 i 1
( xi  ) 
mle 2
 unbiased
2
   2
1  
 N
 The two estimates are nearly the same for large 𝑁.

Bias in MLE
 In the schematic from MLE estimate for
Bishop’s PRML, we the 2 data points
consider 3 cases each
with 2 data points True Gaussian
extracted from the true
Gaussian.
 The mean of the three

distributions predicted
via MLE (i.e. averaged
over the data) is correct.
 However, the variance is

underestimated since it
is a variance with
respect to the sample
mean and NOT the true
mean.
Poisson Distribution
 Recall the Poisson (discrete) distribution for N  0,1, 2,..., 
n
P( N  n)   Poisson (n |  )  e 
n!
 The mean and the variance are both equal to .

 N    n Poisson (n |  )   ,
n 0
 N    2   
 

MLE – Poisson Model
 Consider the following parametric model:
n
 (n |  )  e 
n!
 We sample independently D  n , n ,.., n  , n

1 2 N k  . The likelihood is:
N N
n
 (D |  )    (n k )  e n !
K
 N
k 1 k 1 k
 The negative log likelihood function is then:

N
L(D |  )   log  ( D |  )     nk log   log nk !
k 1
 Taking the derivative with respect to  and setting it to zero:

 N
 n  1 N
L(D |  )   1  k 0  ML  n
  k
k 1   N k 1
 However note that the Law of Large Numbers predicts:

2
1 N 1 N 
var( N )    k
k 1 
n   n j    ML
N N j 1 
MLE – Poisson Model
 Assume that 𝜃 is known a priori to be large. In this case, we can use the
Gaussian approximation of the Poisson distribution (result derived in an earlier
lecture):  1
N /2

 1 
 
N N
 
2
 ( n |  )    exp   n   
 2  2
Poisson j j
j 1  j 1 
 1 
N /2
 1 1 N 
exp      n j     N log   
2
 
 2   2  j 1 
  
  n j     N log 
N
1 2
L( D |  ) 
 j 1
 ML :  L(D |  )   1    
N
2 N N
 
2
n    n    0
  j 1  j 1 
2 j j
 An approximation for
1/2
1 1 N  1 1 N
    n 2j     n  result
j from the exact density 
4 N j 1  2 N j 1

MLE for the Multinomial Distribution
 Suppose multinomial 𝐾 −dimensional ℳ(𝜇1, 𝜇2, … , 𝜇𝐾 ) with 𝒙1, 𝒙2, … 𝒙𝑁~(i.i.d.)
𝐾 𝐾
𝑥
p(𝒙|𝝁) = ෑ 𝜇𝑘 𝑘 , ෍ 𝜇𝑘 = 1
𝑘=1 𝑘=1
 Let 𝑚𝑘 = σ𝑁
𝑛=1 𝑥𝑛𝑘 . What is the MLE of 𝝁 = (𝜇1, 𝜇2, … , 𝜇𝐾 )? The likelihood is:
N!  N  m1 m2
(m1 , m2 ,..., mK | 1 , 2 ,...,  K )    mi
   1 2 ... K
mK
 mi !
i
 m1 , m2 ,..., mK 
 The penalized log-likelihood is given as:

K K
 K

( 1 , 2 ,...,  K )  log N !  log mi !   mi log i   1   i 
i 1 i 1  i 1 
Lagrange multiplier
enforced constra int
 Differentiation w.r.t. 𝜇𝑖 and enforcing σ𝐾
𝑘=1 𝜇𝑘 = 1 gives the expected result:
mi
imle  , i  1,..., K
N
MLE and Weighted Least Squares
 Consider a multivariate Gaussian model,
X ~ N ( x0 , )
nn
where x0  n
is unknown,   is a known symmetric positive definite
matrix.
 Assume that x0 depends on hidden parameters z  k

through a linear
equation (Model Reduction Approach, 𝑘 << 𝑛)
nk
x0  Az , A  , z k
 Every time you introduce model reduction, you also introduce model errors.
 We consider “the inverse problem” of computing 𝒛 from realizations of

𝑿.
 Our problem can also be written by considering noisy observations as:
X  Az  E, E ~ N (0, )
Note that:
 X   Az   E   Az  x0
cov  X    XX T   ( Az  E )( Az  E )T    EE T   
1  1 
 The probability density of 𝑋 given 𝑧, is:  ( x | z )  exp   ( x  Az )T 1
 ( x  Az ) 
 2  det()1/2  2 
n /2
 Assume independent observations D   x1 , x2 ,.., xN  , x j  n
N
 1 N 
 The likelihood function is then   ( x j | z ) ~ exp   ( x j  Az )  ( x j  Az ) 
T 1
j 1  2 j 1 
 Now minimize 𝐿(𝒟 |𝑧):

1 N N  T 1 N  1 N T 1
L(D | z )   ( x j  Az )T  1 ( x j  Az )  z T  AT  1 A z  z T A   xj    xj  xj
2 j 1 2  j 1  2 j 1
1 N N  T 1 N  1 N T 1
L(D | z )   ( x j  Az )T  1 ( x j  Az )  z T  AT  1 A z  z T A   xj    xj  xj
2 j 1 2  j 1  2 j 1
 Taking the gradient wrt 𝑧 equal to zero:
 T 1 N 
 z L(D | z )  N  A  A z   A   x j   0 
T 1
 j 1 
1 N
 A  A z  A  x, where x   x j
T 1 T 1
N j 1
nk
 The existence of the solution of this system depends on the matrix A 
 For the case of one observation, i.e. D  x : L( x | z )  ( x  Az )T  1 ( x  Az )
 Using   UDU T ,  1  ( D 1/2U T )T ( D 1/2U T )  W T W , W  D 1/2U T , we can finally write:

L( x | z )  W ( Az  x)
2
 The MLE minimization problem is a weighted least squares problem!

MLE for the Multivariate Gaussian
 We can easily generalize the earlier MLE results for a multivariate Gaussian.
The log-likelihood takes the form:
ND N 1 N
ln p ( X | D ,  , )   ln  2   ln |  |   ( xn   )T  1 ( xn   )
2 2 2 n 1
 Setting the derivatives wrt 𝝁 and 𝚺 equal to zero gives the following:
N N
1 1
 ML 
N
 xn ,  ML
n 1

N
 n ML n ML
( x
n 1
  )( x   ) T
 We provide a proof of the calculation of 𝜮𝑀𝐿 next.

MLE for the Multivariate Guassian
ND N 1 N
ln p ( X | D ,  , )   ln  2   ln |  |   ( xn   )T  1 ( xn   )
2 2 2 n 1
 We differentiate the log likelihood wrt 𝚺 −𝟏 . Each contributing term is:
N  N  1 N T N
 1
ln |  | 1
ln |  |   
2  2  2 2 A useful trick!
1  N 1   1 N 1 T 
  n
2  1 n 1
( x   )T 1
 ( x n   )  
2  1 
N Tr  
n 1 N
( x n   )( x n   ) 


  N 1 Tr   1 S 
1
𝑺 symmetric
2 
1 1 N
  NS , where S   ( xn   )( xn   )T
2 N n 1
 Setting the derivative equal to zero leads to:  ML  S

 
Tr     T , ln | A |  A1  ,
T
 Here we used:  A
| A1 || A |1 , tr ( AB )  tr ( BA)
Appendix: Some Useful Matrix Operations
 Show that  
Tr     T and Tr     T
 
Indeed
  
Tr      ik ki  nm
A B  B  Tr     T
Amn Amn 
 Show that 
ln | A |  A1 
T
A
Using the cofactor expansion of the det:

 1  1 
( 1) m  n M mn   A1  nm
1
Amn
ln | A |
| A | Amn
| A |
| A | Amn
 (1)i  j Aij M ij 
j | A|
where in the last step we used Cramer’s rule.

MLE for a Multivariate Gaussian
𝑁 𝑁 𝑁
1 1 𝑇
1
𝜇𝑀𝐿 = ෍ 𝑥𝑛 ≡ 𝑥,ҧ 𝛴𝑀𝐿 = ෍ ൫𝑥𝑛 − 𝜇𝑀𝐿 ) 𝑥𝑛 − 𝜇𝑀𝐿 = ෍ 𝑥𝑛 𝑥𝑛𝑇 − 𝑥ҧ 𝑥ҧ 𝑇
𝑁 𝑁 𝑁
𝑛=1 𝑛=1 𝑛=1
 Note that the unconstrained maximization of the log-likelihood gives a

symmetric 𝚺.
 As for the univariate case, we can define an unbiased covariance as:

𝑁
1
𝛴ത𝑀𝐿 = ෍ ൫𝑥𝑛 − 𝜇𝑀𝐿 ) 𝑥𝑛 − 𝜇𝑀𝐿 𝑇
, 𝔼 𝛴ത𝑀𝐿 = 𝛴
𝑁−1
𝑛=1
 To prove this, you will need to use that:

 xn xmT    T   mn 
Sequential MLE Estimation for the Gaussian
 Often we are interested to compute sequentially an estimate of 𝜇𝑀𝐿 as more
data arrive. This can easily be done:
1 N xN 1 N 1
 (N )
ML   xn    xn 
N n 1 N N n 1
xN N  1 1 N 1

N
 
N N  1 n 1
xn 
xN N  1 ( N 1)

N

N
 ML   ML ( N 1)

1
N
x N   ML
( N 1)

Learning Error signal
rate
 This sequential approach cannot easily be generalized to other cases (non-

Gaussians, etc.).

Robbins-Monro Algorithm
 A more powerful approach to computing sequentially the MLE estimates is via
the Robbins-Monro algorithm.
 We review the algorithm by considering the calculation of the zero of a

regression function.*
 Consider the joint distribution 𝑝(𝑧, 𝜃) of two

random variables and define the regression
function as:
f     z |     zp( z |  )dz
 Assume we are given samples
from 𝑝(𝑧, 𝜃) one at a time.
• Effectively, we don’t know the regression function f() but we have data of a noisy version z of that. We take
the regression function to be the expectation z |  
 Robbins, H. and S. Monro (1951). A stochastic approximation method. Annals of Mathematical Statistics 22,
400–407.
 Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition (Second ed.). Academic Press.
f     z |     zp( z |  )dz
 We want to find the root f  *  0 in a sequential manner: The Robbins-
Monro algorithm proceeds as:
 ( N )   ( N 1)  aN 1 z  ( N 1) 
 The learning coefficients {𝑎𝑁} should satisfy:
 
lim aN  0,  aN  ,  aN2  
N 
n 1 n 1

 We can state the MLE calculation 𝜇𝑀𝐿 for our Gaussian example as finding
the root of a regression function:
 1 N
 N
1  CLT   
 
  N
 ln p ( x |  )  |  0   
n 1 N 
ln p ( x |  ) |  0    ln p ( x |  )  0
 
n ML n ML
n 1  N 
 ML
z | ML
 In the context of the Robbins-Monro algorithm,

 x   ML    ML
z   ln p ( x |  )    ML
|   , z is a Gaussian, f   ML    z | ML   
  2
2
 The algorithm takes the form:
x ( N )   ML
( N 1)
 (N )
 ( N 1)
 aN 1
ML ML
2
 Substituting aN 1   2
gives the exact update discussed earlier.
N
 A graphical interpretation of the algorithm is shown here.
x   ML
z
x ( N )   ML
( N 1)
2
 (N )
 ( N 1)
 aN 1
ML ML
2 p ( z |  ML ) is a Gaussian
The idea here is that The distribution of z

we don’t know the true at a given  ML
 but have noisy
measurements x given
here in terms of z.
   ML
The parameter  is here f   ML    z |  ML   
the  ML and the 2
zero of the regression
function happens at
 ML  
 The Robbin-Monro algorithm computes the zero of the regression function

that corresponds to the true mean .
 Blum, J. A. (1965). Multidimensional stochastic approximation methods. Annals of Mathematical Statistics 25, 737–744.
Sequential MLE Estimation for Gaussians
 Let us now repeat the same calculations but for the MLE estimate of 2:
 xN   
2
N N 1
1 1
 (2N )    xn       xn    
2 2
N n 1 N n 1 N
 xN   
2
N 1 2
  ( N 1)  
N N
  ( N 1) 
2 1
N
 xN      (2N 1)
2
 
 If we substitute the expression for the Gaussian likelihood into the Robbins-
Monro procedure for maximizing likelihood:
 1     
2
 (2N )   (2N 1)  aN 1


 (2N 1)
 ln  ( N 1) 
 2
2 x
2
N
 2
( N 1)


  2
( N 1)  a N 1
2
1
4
( N 1)
 N   (2N 1) 
 x   
2
 The 2 formulas are identical for: aN 1  2 (4N 1) N .

Sequential MLE: Multivariate Gaussian
 To simplify things, assume that  ML   and thus: N
1
   ( x   )( x   ) (N )
ML n n
T
N n 1
 From this equation we can derive:

 (ML
N) N 1)
  (ML 
1
N
 ( x N   )( x N   )T   (ML
N 1)

 To apply the Robbins-Monro algorithm, assume that  is diagonal and as
before compute the derivative

 ( N 1)
  ln p  x N | ,  ( N 1)
     ML   ( x N   )( x N   )T   (ML
1 ( N 1) 2
2
N 1)

ML
 Substituting into the Robbins-Monro algorithm:

 AN 1   ML   ( x N   )( x N   )T   (ML 
( N 1) 1 ( N 1) 2 N 1)
 (N )
ML  ML
2
 Thus from the RM algorithm, we can obtain the exact update by selecting
AN 1    ML 
2 ( N 1) 2
N

Lec8 MLE

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec8 MLE

Uploaded by

Copyright:

Available Formats

Maximum Likelihood

August 31, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

 MLE estimator, MLE for the univariate Gaussian

 Biased and unbiased estimators, MLE estimates for μ, 𝜎 2 in a Gaussian

 MLE and Least Squares

 MLE for the Multivariate Gaussian

 Sequential MLE Estimation for the Univariate Gaussian Distribution and

 Understand how to compute maximum likelihood parameter estimators

 Learn about bias and unbiased estimators

 Learn how to sequentially compute MLE estimates

 Learn about the Robbins-Monro algorithm

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

• Kevin Murphy’s, Machine Learning: A probablistic perspective, Chapter 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

 Given this definition of likelihood, we can state Bayes’ theorem in words

posterior ∝ likelihood × prior

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

 In a frequentist approach, 𝜃 is a fixed parameter computed by an estimator

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

 Assume that the observations 𝑥𝑗 are obtained independently, i.e. that

 Independency:  ( x1 , x2 ,..., xN |  )   ( x1 |  ) ( x2 |  )... ( xN |  )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

 Define the negative of the log-likelihood as

 Minimizer of L(D |  ) = maximizer of  (D |  )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

the Likelihood function is given as follows:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

and the log-likelihood takes the form*

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

 If 𝑥1, 𝑥2, … 𝑥𝑁 ~(i.i.d) 𝒩(, 2) then

 Thus 𝜇𝑚𝑙𝑒 is an unbiased estimator.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13

 If 𝑥1, 𝑥2, … 𝑥𝑁 ~(i.i.d) 𝒩(, 2) then

 Thus 2mle is a biased estimator.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

Sample mean S ample variance wrt ML

 Using the point estimates above we showed that :

N 1 2 you need to use :

 The two estimates are nearly the same for large 𝑁.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16

 The mean of the three

 However, the variance is

 The mean and the variance are both equal to .

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

 We sample independently D  n , n ,.., n  , n

 The negative log likelihood function is then:

 Taking the derivative with respect to  and setting it to zero:

 However note that the Law of Large Numbers predicts:

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

 The penalized log-likelihood is given as:

 Assume that x0 depends on hidden parameters z  k

 We consider “the inverse problem” of computing 𝒛 from realizations of

 Assume independent observations D   x1 , x2 ,.., xN  , x j  n

 Now minimize 𝐿(𝒟 |𝑧):

 For the case of one observation, i.e. D  x : L( x | z )  ( x  Az )T  1 ( x  Az )

 Using   UDU T ,  1  ( D 1/2U T )T ( D 1/2U T )  W T W , W  D 1/2U T , we can finally write:

 The MLE minimization problem is a weighted least squares problem!

 We provide a proof of the calculation of 𝜮𝑀𝐿 next.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

 Setting the derivative equal to zero leads to:  ML  S

Using the cofactor expansion of the det: