Professional Documents
Culture Documents
Contents
1 The Student’s t Distribution 1
4 Multivariate Case 4
5 Demonstration 5
= 1+
Γ ν2 πν ν
Multivariate
Z ∞
Normal x µ, (ηΛ)−1 Gamma (η | ν/2, ν/2) dη
Student (x | µ, Λ, ν) ≡
0
ν+D
12 − ν+D
Γ 2 |Λ| (x − µ)T Λ(x − µ) 2
= 1+
Γ ν2 πν ν
P (X, Z | Θ)
1 They’re all column vectors.
1
where Z is a set of latent variables such that
Z
P (X | Θ) = P (X, Z | Θ) .
Z
3. E step: write down the expectations under the distribution P (Z | X, Θ0 ) for all terms in the com-
plete data log likelihood (step 1).
4. Write down the function to maximize,
Z
Q(Θ, Θ0 ) = P (Z | X, Θ0 ) log P (X, Z | Θ) ,
Z
Once all of the expectation update equations (from step 3) and maximization update equations (from
step 5) are known, we initialize our current estimate of the parameters Θ and update them by iterating
the E and M updates until convergence. Note that a subscript 0 is used here and in the rest of the article
to denote the old setting of the parameters. With each iteration the new parameter setting (found in
step 5) will replace the old one.
we identify
X = {xi }, Z = {ηi }, Θ = {µ, λ, ν}
so that the complete likelihood function is
N
Y
Normal xi µ, (ληi )−1 Gamma (ηi | ν/2, ν/2) .
P (X, Z | Θ) =
i=1
2
Step 2: Posterior latent distribution
P (Z | X, Θ) ∝ P (X, Z | Θ)
N
Y
Normal xi µ, (ληi )−1 Gamma (ηi | ν/2, ν/2)
=
i=1
N
Y
∝ Gamma (ηi | ai , bi ) (2)
i=1
where the last step follows from the fact that the Gamma distribution is the conjugate prior to a Normal
distribution with unknown precision. We find the parameters ai and bi by combining the factors from
the Normal and Gamma distributions.2
N
Y
Normal xi µ, (ληi )−1 Gamma (ηi | ν/2, ν/2)
P (X, Z | Θ) =
i=1
N
Y ν−1 ν λ
∝ ηi2
exp −ηi + (xi − µ)2
i=1
2 2
(All factors independent of ηi are taken up in the proportionality.)
N
Y ν+1 ν λ
∝ Gamma ηi , + (xi − µ)2
i=1
2 2 2
and hence
ν+1
ai = ,
2
ν λ
bi = + (xi − µ)2 .
2 2
Step 3: Expectation By looking at (1) we find that we need to calculate the expectations of 1, ηi
and log ηi under the posterior latent distribution (2).
E[1] = 1
Z N
Y
E[ηi ] = ηi Gamma (ηj | aj , bj )
Z j=1
Z
= ηi Gamma (ηi | ai , bi )
ηi
= ai /bi
ν0 + 1
=
ν0 + λ0 (xi − µ0 )2
Z N
Y
E[log ηi ] = log ηi Gamma (ηj | aj , bj )
Z j=1
Z
= log ηi Gamma (ηi | ai , bi )
ηi
= ψ(ai ) − log bi
ν0 + λ0 (xi − µ0 )2
ν0 + 1
= ψ − log
2 2
3
Step 4: Function to optimize
Z
Q(Θ, Θ0 ) = P (Z | X, Θ0 ) log P (X, Z | Θ)
Z
N N
N N 1X λX
= − log 2π + log λ + E[log ηi ] − (xi − µ)2 E[ηi ]
2 2 2 i=1 2 i=1
ν N N
Nν ν ν X νX
−N log Γ + log + −1 E[log ηi ] − E[ηi ]
2 2 2 2 i=1
2 i=1
Note that all the elements of Θ0 are now implicit in the expectations.
Step 5: Maximization
N
∂Q X
=0 ⇒ λ (xi − µ)E[ηi ] = 0
∂µ i=1
PN
xi E[ηi ]
⇒ µ = Pi=1
N
i=1 E[ηi ]
N
∂Q N 1X
=0 ⇒ − (xi − µ)2 E[ηi ] = 0
∂λ 2λ 2 i=1
N
!−1
1 X 2
⇒ λ= (xi − µ) E[ηi ]
N i=1
N N
∂Q N ν N ν N 1X 1X
=0 ⇒ − ψ + log + + E[log ηi ] − E[ηi ] = 0
∂ν 2 2 2 2 2 2 i=1 2 i=1
ν N N
ν 1 X 1 X
⇒ ψ − log =1+ E[log ηi ] − E[ηi ]
2 2 N i=1 N i=1
Note that there is no closed form solution for ν—it has to be found numerically.
4 Multivariate Case
The derivation of the multivariate case requires taking vector and matrix derivatives to find the values
of µ and Λ but is otherwise very similar to the univariate case. The update equations turn out to be3
ν0 + D
E[ηi ] =
ν0 + (xi − µ0 )T Λ0 (xi − µ0 )
ν0 + (xi − µ0 )T Λ0 (xi − µ0 )
ν0 + D
E[log ηi ] = ψ − log
2 2
PN
xi E[ηi ]
µ = Pi=1 N
i=1 E[ηi ]
N
!−1
1 X T
Λ= (xi − µ)(xi − µ) E[ηi ]
N i=1
The update equation for ν remains unchanged.
3D is still the dimensionality.
4
5 Demonstration
The update equations for the univariate case were implemented in Python.4 Figure 1 shows the results.
Notice how the ML fit to the Normal distribution has to increase its variance (flatten out) more and more
to accommodate the increasing number of outliers. The Student’s t distribution is relatively robust to
this.
0.40 0.35
ML Normal distribution (ignoring outliers) ML Normal distribution (ignoring outliers)
ML Normal distribution (with outliers) ML Normal distribution (with outliers)
0.35 ML Student's t distribution ML Student's t distribution
0.30
0.30
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05 0.05
-2 0 2 4 6 8 10 12 -2 0 2 4 6 8 10
1 outlier 3 outliers
Figure 1: Plots of data points drawn from a Normal distribution along with a small number of outliers.
The ML fit of a Normal (with and without considering the outliers) and a Student’s t to the data are
also shown.
References
[1] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006.
[2] G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, New York, NY, USA,
1997.
4 If you have a look at the code, you’ll probably notice that the update equations are a bit different from those in the