You are on page 1of 4

Estimating Probabilities from data

Suppose you have a dataset D = {x̄i }ni=1 ∈ Rd . Each data point x̄i is drawn independently from
N (µ̄, I) where I is the d × d identity matrix.

1. Find the MLE for µ̄.

2. Assume a standard Gaussian prior on u, namely, P (µ̄) = N (0, I), find the MAP for µ̄.

3. Assume the same Gaussian prior on µ̄ in (2), find the posterior distribution P (µ̄|D). (Hint:
the posterior is a Gaussian distribution.)

Solution:

1. We first find P (D|µ̄). Since the data is i.i.d., we have

n
Y
P (D|µ̄) = P (x̄i |µ̄)
i=1
n
!
Y 1 (x̄i − µ̄)T I −1 (x̄i − µ̄)
= q exp −
2
i=1 (2π)d |I|
n
!
Y 1 (x̄i − µ̄)T (x̄i − µ̄)
= d exp −
(2π) 2 2
i=1

where we use the fact that det I = 1, and I −1 = I. We take the log-likelihood (making the
derivative calculation much easier), giving us

n
!
Y 1 (x̄i − µ̄)T (x̄i − µ̄)
log P (D|µ̄) = log d exp −
(2π) 2 2
i=1
n
!!
X 1 (x̄i − µ̄)T (x̄i − µ̄)
= log d exp −
(2π) 2 2
i=1
n
!
X 1 (x̄i − µ̄)T (x̄i − µ̄)
= log d −
(2π) 2 2
i=1
n
X d (x̄i − µ̄)T (x̄i − µ̄)
= − log (2π) −
2 2
i=1
n
X (x̄i − µ̄)T (x̄i − µ̄)
nd
=− log(2π) −
2 2
i=1

We then find the partial derivative with respect to µ̄:

1
n
!
nd X (x̄i − µ̄)T (x̄i − µ̄)
∇µ̄ log P (D|µ̄) = ∇µ̄ − log(2π) −
2 2
i=1
n
X
= x̄i − µ̄
i=1

To find the derivative of (x̄i − µ̄)T (x̄i − µ̄) we note that (x̄i − µ̄)T (x̄i − µ̄) = ||x̄i − µ̄||22 , and
∇x ||x||22 = 2x, then we can apply the chain rule. Finally, setting the derivative to zero gets
us

∇µ̄ log P (D|µ̄) = 0


Xn
=⇒ x̄i − µ̄ = 0
i=1
n
X
=⇒ nµ̄ = x̄i
i=1
Pn
i=1 x̄i
=⇒ µ̄ =
n

To verify that this is indeed a maximum, we perform the second derivative test
∇2µ̄ log P (D|µ̄) = −nI
Since the second derivative is the negative definite, we have a local maximum.
2. To find the MAP, we simply want to find arg maxµ̄ P (µ̄|D). We know

arg max P (µ̄|D) = arg max log P (µ̄|D)


µ̄ µ̄
 
P (D|µ̄)P (µ̄)
= arg max log
µ̄ P (D)
= arg max log(P (D|µ̄)P (µ̄))
µ̄

= arg max log P (D|µ̄) + log P (µ̄)


µ̄

where we use the monotonicity of the log function in the first step above. We have already
computed log P (D|µ̄), leaving us only to find log P (µ̄). We have

 T −1 
1 µ̄ I µ̄
log P (µ̄) = log q exp −
2
(2π)d |I|
d µ̄T µ̄
= − log 2π −
2 2

2
Trivially, we have the partial derivative of log P (µ̄) with respect to µ̄ is µ̄. Thus,

∇µ̄ (log P (D|µ̄) + log P (µ̄)) = 0


X n
=⇒ (x̄i − µ̄) − µ̄ = 0
i=1
n
X
=⇒ (n + 1)µ̄ = x̄i
i=1
Pn
i=1 x̄i
=⇒ µ̄ =
n+1

Again, we test this point by taking the second derivative

∇2µ̄ (log P (D|µ̄) + log P (µ̄)) = −I(n + 1)

The second derivative is negative definite so it is a maximum.


Pn
i=1 x̄i 1
3. Intuitively, we can expect the mean to be n+1 and the variance to be n+1 I, but let us see
if we can prove it. We have

n
!
(x̄i − µ̄)T I −1 (x̄i − µ̄) µ̄T I −1 µ̄
 
Y 1 1
P (µ̄|D) ∝ q exp − q exp −
2 2
i=1 (2π)d |I| (2π)d |I|
!n+1
(n + 1)µ̄T µ̄ − 2µ̄T ni=1 x̄i + ni=1 x̄Ti x̄i
 P P 
1
= p exp −
(2π)d 2

We know the above has to beT a−1 Gaussian distribution. We note that a Gaussian distribution
(µ̄−m) Σ (µ̄−m)
has the form of c exp − 2 , which when expanding the numerator, comes out to
be
 
1 T −1 T −1
c exp − µ̄ Σ µ̄ − µ̄ Σ m + const
2

where Σ−1 is the inverse covariance matrix.

3
We can see that the above is in the same form, and from there we derive that

1
Σ−1 = (n + 1)I =⇒ Σ = I
n+1
and

n Pn
i=1 x̄i
X
−1
Σ m= x̄i =⇒ m =
n+1
i=1

Thus,

 Pn 
i=1 x̄i
1
P (µ̄|D) ∼ N , I
n+1 n+1

You might also like