Professional Documents
Culture Documents
Model
Dootika Vats
22/10/2020
In class, we did the univariate Gaussian mixture model example. In this document, we go through the steps
for the multivariate Gaussian mixture model. Note that, fundamentally, the steps are exactly the same,
however, there are some additional nuances in the M-step and the practical implementations are important
to consider.
Suppose for some C classes/clusters/components, we observe X1 , . . . , Xn from a mixture of multivariate
normal distributions. The joint distribution of the observed x is:
C
X
f (x|θ) = πc f1 (x|µc , Σc )
c=1
EM Algorithm
1
And then in the M-step, we do the maximization step:
n C n X C n X C
1 XX X (xi − µc )Σ−1
c (xi − µc )
T X
q(θ|θ(k) ) = const − log |Σc |γi,c,k − γi,c,k + log πc γi,c,k .
2 i=1 c=1 i=1 c=1
2 i=1 c=1
Taking derivatives with respect to a matrix, Σc is tricky. But the following properties will be useful to
remember for compatible matrices A, B, C
• tr(ACB) = tr(CAB) = tr(BC)
• for scalar x, xT Ax = tr(xT Ax) = tr(xT xA)
∂
• tr(AB) = B T
∂A
∂
• log(|A|) = A−T
∂A
Using these properies and taking derivative w.r.t Σ−1c (for convenience), we get
" n #
∂q n ∂ X tr((xi − µc )T (xi − µc )Σ−1 )
= − Σc γi,,c,k − c
γi,c,k
∂Σ−1
c 2 ∂Σ−1
c i=1
2
" n #
n X (xi − µc )T (xi − µc )γi,c,k set
= − Σc γi,,c,k − = 0
2 i=1
2
Pn
γi,c,k (xi − µc,(k+1) )(xi − µc,(k+1) )T
⇒ Σc,(k+1) = i=1 Pn .
i=1 γi,c,k
The optimization for πc is the same as the univariate case. Hence, we get the following updates:
Pn
i=1 γi,c,k xi
µc,(k+1) = P n (1)
γi,c,k
Pn i=1
i,c,k (xi − µc,(k+1) )(xi − µc,(k+1) )
T
γ
Σc,(k+1) = i=1 Pn (2)
i=1 γi,c,k
n
1X
πc,(k+1) = γi,c,k (3)
n i=1
2
• Local solutions: Further note that the target log-likelihood is not concave. Thus, there will be
multiple local maximas. The EM algorithm is an MM algorithm, so it does converge to a local maxima,
but it need not converge to a global maxima. In order to try to ensure that we reach a reasonable
maxima. It will be useful to start from multiple starting points and see where you converge. Then note
down the log-likelihood and compare the log-likehood with other runs from different starting points.
The optima with the largest log-likelihood should be choen maxima.
• Starting values: In addition to testing this on multiple starting values, one must also make sure the
starting values are realistic. That is, if most of the data is centered near (50,100), then it is unreasonable
to start at (0,0), and infact such starting values will lead to numerical inaccuracies. Additionally, if your
starting values for the µc are the same for all c, then all normal distribution components are centered
at the same spot and this leads to only one cluster, even if you have specified C > 1.
• Lack of positive definiteness: In the update for Σc , note that we are calculating a covariance matrix.
Particularly when C is chosen to be larger than the real number of clusters, an iteration value of
Σc,(k+1) may yield singular matrices. To correct for this, a common solution is to set
Pn
γi,c,k (xi − µc,(k+1) )(xi − µc,(k+1) )T
Σc,(k+1) = i=1
Pn + diag(, p) .
i=1 γi,c,k
That is, we add a diagonal matrix with small entries in order to make sure it’s positive definite.
set.seed(10)
library(mvtnorm)
library(ellipse)
##
## Attaching package: 'ellipse'
## The following object is masked from 'package:graphics':
##
## pairs
We now continue a bivariate analysis of the old faithful dataset
data(faithful)
plot(faithful)
3
90
80
waiting
70
60
50
eruptions
Below are the functions that calculate the log-likelihood and that run the EM algorithm for any p and any C.
# calculate log-likelihood
# of mixture of multivariate normal
log_like <- function(X, pi.list, mu.list, Sigma.list, C)
{
foo <- 0
for(c in 1:C)
{
foo <- foo + pi.list[[c]]*dmvnorm(X, mean = mu.list[[c]], sigma = Sigma.list[[c]])
}
return(sum(log(foo)))
}
# X is the data
# C is the number of clusters
mvnEM <- function(X, C = 2, tol = 1e-3, maxit = 1e3)
{
n <- dim(X)[1]
p <- dim(X)[2]
######## Starting values ###################
## pi are equally split over C
pi.list <- rep(1/C, C)
mu <- list()
Sigma <- list()
4
# different solutions
mu[[i]] <- rnorm(p, sd = 3) + colMeans(X)
Sigma[[i]] <- var(X)
}
# Choosing good starting values is important since
# The GMM likelihood is not concave, so the algorithm
# may converge to a local optima.
iter <- 0
diff <- 100
old.mu <- mu
old.Sigma <- Sigma
old.pi <- pi.list
### M-step
pi.list <- colMeans(Ep)
for(c in 1:C)
{
mu[[c]] <- colSums(Ep[ ,c] * X )/sum(Ep[,c])
}
for(c in 1:C)
{
foo <- 0
for(i in 1:n)
{
foo <- foo + (X[i, ] - mu[[c]]) %*% t(X[i, ] - mu[[c]]) * Ep[i,c]
}
Sigma[[c]] <- foo/sum(Ep[,c])
5
diff <- abs(save.loglike[iter+1] - save.loglike[iter])
old.mu <- mu
old.Sigma <- Sigma
old.pi <- pi.list
}
Now for the old faithful dataset, we will try and fit C = 2 clusters. Since the starting values can lead to
different local optimas, the random starting values of the µc chosen up are incredibly helpful. They allow
each run to possibly converge to different local solutions. We will then choose the one with the largest
log-likelihood
# We now fit this to the
data(faithful)
X <- as.matrix(faithful)
C <- 2
class2.1 <- mvnEM(X = X, C = C)
class2.2 <- mvnEM(X = X, C = C)
class2.3 <- mvnEM(X = X, C = C)
class2.4 <- mvnEM(X = X, C = C)
par(mfrow = c(2,2))
for(i in 1:4)
{
model <- get(paste("class2.",i, sep = ""))
allot <- apply(model$Ep, 1, which.max) ## Final allotment of classification
plot(X[,1], X[,2], col = allot, pch = 16,
main = paste("Log-Like = ", round(model$log.like,3))) # plot allotment
6
Log−Like = −1289.797 Log−Like = −1279.356
90
90
80
80
X[, 2]
X[, 2]
70
70
60
60
50
50
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
X[, 1] X[, 1]
90
80
80
X[, 2]
X[, 2]
70
70
60
60
50
50
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
X[, 1] X[, 1]
Notice that each run converged to a different local optima as it started from a different starting value. however,
the log-likelihood value for the last run was the largest (and the cluster formation the most reasonable).
Thus, the chosen MLE estimate are the estimates from the last run.
Now we will repeat the same with C = 3. Note that C = 3 classes is slightly more annoying to fit since
it looks like naturally, there are only C = 2 classes. Thus, we will need to work harder to find good local
maximas. We will run from 6 different (randomly chosen) starting values.
C <- 3
class3.1 <- mvnEM(X = X, C = C)
class3.2 <- mvnEM(X = X, C = C)
class3.3 <- mvnEM(X = X, C = C)
class3.4 <- mvnEM(X = X, C = C)
class3.5 <- mvnEM(X = X, C = C)
class3.6 <- mvnEM(X = X, C = C)
par(mfrow = c(2,3))
for(i in 1:6)
{
model <- get(paste("class3.",i, sep = ""))
allot <- apply(model$Ep, 1, which.max) ## Final allotment of classification
plot(X[,1], X[,2], col = allot, pch = 16,
main = paste("Log-Like = ", round(model$log.like,3))) # plot allotment
7
for(c in 1:C)
{
ell[[c]] <- ellipse(model$Sigma[[c]], centre = as.numeric(model$mu[[c]]))
lines(ell[[c]], col = c)
}
}
Log−Like = −1125.728 Log−Like = −1125.728 Log−Like = −1120.564
90
90
90
80
80
80
X[, 2]
X[, 2]
X[, 2]
70
70
70
60
60
60
50
50
50
1.5 2.5 3.5 4.5 1.5 2.5 3.5 4.5 1.5 2.5 3.5 4.5
90
90
80
80
80
X[, 2]
X[, 2]
X[, 2]
70
70
70
60
60
60
50
50
50
1.5 2.5 3.5 4.5 1.5 2.5 3.5 4.5 1.5 2.5 3.5 4.5
The fifth run has the largest log-likelihood so that is chosen. Here I present the final allocation. This is a
reminder that we are not fitting different models, we are just running the same algorithm multiple times in
order to find the largest local maxima. It’s possible our final estimate corresponds to a local maxima, and
not a global maxima.
par(mfrow = c(1,1))
model <- class3.5
allot <- apply(model$Ep, 1, which.max) ## Final allotment of classification
plot(X[,1], X[,2], col = allot, pch = 16,
main = paste("Log-Like = ", round(model$log.like,3))) # plot allotment
8
Log−Like = −1119.217
90
80
X[, 2]
70
60
50
X[, 1]