You are on page 1of 34

*

Statistical guarantees for the EM algorithm: From


population to sample-based analysis

Mollen KHAN

Université Paris 1 Panthéon-Sorbonne

December 14, 2023

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
1 / 34
The EM Algorithm for Gaussian Mixture Models

Model:
X K
X
p(x) = p(z)p(x|z) = πk N (x|µk , Σk )
z k=1

where K is the number of Gaussian components in the mixture, πk


are the mixing coefficients, and N (x|µk , Σk ) are the component
Gaussian distributions.
Training with the EM Algorithm:
E-step: Calculate responsibilities γ(znk ) which approximate the
posterior distribution of latent variables.
M-step: Maximize the expected log-likelihood over the parameters.
Hyper-parameters:
Number of components K .
Number of EM iterations.
Prediction:
Probability distribution p(x) or assignment to one of the K Gaussians.
Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)
Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
2 / 34
Exploring the EM Algorithm: Goals and Key Insights

Context: Dealing with incomplete data, such as missing values and


latent variables, presents significant computational challenges for
maximum likelihood estimation (MLE).
EM Algorithm: The expectation-maximization (EM) algorithm
provides a powerful approach to such problems, with a rich literature
supporting its application.
Challenge: While MLEs have desirable statistical properties, the EM
algorithm often converges to local optima, raising concerns about its
practical utility.
Paper’s Objective: This paper aims to bridge the gap between
statistical and computational guarantees when applying the EM
algorithm.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
3 / 34
Exploring the EM Algorithm: Goals and Key Insights

Insight: Proper initialization can lead the EM algorithm to converge


to statistically useful estimates, even in non-convex settings.
Approach: Through a series of theoretical and empirical analyses,
the paper characterizes the conditions under which the EM algorithm
and its variants, such as gradient EM, converge effectively.
Structure: The paper provides a comprehensive introduction,
detailed model examples, and general convergence results,
culminating in practical implications and supporting simulations.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
4 / 34
EM Algorithm and its relatives

Observations and latent variables have a joint density within a


parameterized family {θ | θ ∈}.
The goal is to maximize the observed data likelihood:
Z
gθ (y ) = fθ (y , z) dz (1)
Z

The EM algorithm maximizes the expected log-likelihood Q(θ′ |θ),


with
Z Z
log gθ′ (y ) ≥ kθ (z|y ) log fθ′ (y , z)dz − kθ (z|y ) log kθ (z|y )dz,
|Z {z } Z
Q(θ′ |θ)
(2)
M-step maximizes Q(θ′ |θ) and E-step reevaluates at the new
parameter value.
Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)
Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
5 / 34
Standard EM Updates

The iterative EM algorithm consists of two main steps in each iteration:


E-step: Compute the expected log-likelihood Q(θ′ |θt ) using the
current estimate θt .
M-step: Update the estimate to θt+1 by maximizing Q(θ′ |θt ) over θ′
in the parameter space.
The mapping M : Θ → Θ, which is central to the M-step, is given by:

M(θ) = arg max



Q(θ′ |θ).
θ ∈Θ

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
6 / 34
Generalized EM Updates

Relaxation of the M-step.


New improvement condition:

Q(θt+1 , θt ) ≥ Q(θt , θt )

Defines a family of algorithms.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
7 / 34
Gradient EM Updates

A variant of generalized EM with differentiable function Q


Update formula with a step size α > 0:

θt+1 = θt + α∇Q(θt |θt )

Mapping G defined by:

G (θ) = θ + α∇Q(θ|θ)

Compact iteration: θt+1 = G (θt )


Extension to constraints by projection (not addressed here)
With suitable α, meets the EM improvement condition

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
8 / 34
Population vs. Sample Updates

”Oracle” population version with an infinite number of samples:


Z Z 

Q(θ |θ) = kθ (z|y) log fθ′ (y, z)dz gθ∗ (y)dy
Y Z

Sample version with n i.i.d. observations:


n Z 
′ 1X
Q̂n (θ |θ) = kθ (z|yi ) log f (yi , z)dz
θ′
n Z
i=1

Sample-based EM update operator:

Mn (θ) = arg max



Q̂n (θ′ |θ)
θ ∈Θ

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
9 / 34
Population vs. Sample Updates

”Oracle” population version with an infinite number of samples:


Z Z 

Q(θ |θ) = kθ (z|y) log fθ′ (y, z)dz gθ∗ (y)dy
Y Z

Sample version with n i.i.d. observations:


n Z 
′ 1X
Q̂n (θ |θ) = kθ (z|yi ) log f (yi , z)dz
θ′
n Z
i=1

Sample-based EM update operator:

Mn (θ) = arg max



Q̂n (θ′ |θ)
θ ∈Θ

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
10 / 34
Gaussian Mixture Models
Model with density:
1 1
fθ (y ) = ϕ(y ; θ∗ , σ 2 Id ) + ϕ(y ; −θ∗ , σ 2 Id )
2 2
Objective: estimate the unknown mean vector θ∗
Hidden variable Z indicates the mixture component
Sample-based Q̂ function:
n
′ 1 X
wθ (yi )∥yi − θ′ ∥2 + (1 − wθ (yi ))∥yi + θ′ ∥2

Q̂(θ |θ) = −
2n
i=1

Mixture component probability wθ (y ):


 
∥2
exp − ∥θ−y2σ 2
wθ (y ) =    
∥θ−y ∥2 ∥2
exp − 2σ2 + exp − ∥θ+y
2σ 2

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
11 / 34
Gaussian Mixture Models
Model with density:
1 1
fθ (y ) = ϕ(y ; θ∗ , σ 2 Id ) + ϕ(y ; −θ∗ , σ 2 Id )
2 2
Objective: estimate the unknown mean vector θ∗
Hidden variable Z indicates the mixture component
Sample-based Q̂ function:
n
′ 1 X
wθ (yi )∥yi − θ′ ∥2 + (1 − wθ (yi ))∥yi + θ′ ∥2

Q̂(θ |θ) = −
2n
i=1

Mixture component probability wθ (y ):


 
∥2
exp − ∥θ−y2σ 2
wθ (y ) =    
∥θ−y ∥2 ∥2
exp − 2σ2 + exp − ∥θ+y
2σ 2

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
12 / 34
Gradient EM Updates for Mixture of Regressions Model

Gradient EM operator for the sample:


( n )
1 Xh i
Gn (θ) = θ + α (2wθ (xi , yi ) − 1)yi xi − xi xi⊤ θ
n
i=1

Gradient EM operator for the population:

G (θ) = θ + α 2E [wθ (X , Y )YX − θ]

Step size parameter α > 0 for parameter adjustment

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
13 / 34
Linear Regression with Missing Covariates

Extension of standard linear regression with missing data


Instead of full covariate vector xi , observe corrupted version x̃i
Components observed with probability 1 − ρ or missing with
probability ρ: (
xij with probability 1 − ρ
x̃ij =
∗ with probability ρ
ρ ∈ [0, 1) denotes the probability that a given covariate is missing

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
14 / 34
E-step Notation for Linear Regression with Missing
Covariates
Observed and missing parts of x and θ:
xobs , xmis , βobs , βmis

Conditional mean given observed data:


   
E(xmis |xobs , y , θ) Mmis · zobs
µmis (xobs , y ) = =
xobs xobs

Auxiliary notation for conditional mean:


1  ⊤

Mmis = −βmis βobs βmis
∥βmis ∥ + σ
2 2

Observed composite vector:  


xobs
zobs =
y
Conditional second moment matrix:

 
I Mmis · zobs xobs
Σmis (xobs , y ) = ⊤ ⊤ ⊤
xobs zobs Mmis xobs xobs

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
15 / 34
EM Updates for Linear Regression with Missing Covariates

Sample-based EM update maximizes:


n
′ 1 X ⊤ ′ 2

Q(θ |θ) = − wθ (xi , yi )(yi − xi θ )
2n
i=1

Closed-form solution for the EM update:


" n
#−1 " n
#
X X
Mn (θ) = Σθ (xobs,i , yi ) yi µθ (xobs,i , yi )
i=1 i=1

Population EM update analogously defined with expectations.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
16 / 34
Gradient EM Updates for Missing Covariates

Gradient EM operator for the sample with step size α:


( n )
1X
Gn (θ) = θ + α [yi µθ (xobs,i , yi ) − Σθ (xobs,i , yi )θ]
n
i=1

Population gradient EM operator defined similarly with expectations.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
17 / 34
EM Update for Linear Regression with Missing Covariates

EM update maximizes the function Q(θ′ |θ):


n
1 X
Q(θ′ |θ) = − ⟨θ′ , Σθ (xobs,i , yi )θ′ ⟩ − yi ⟨µθ (xobs,i , yi ), θ′ ⟩

2n
i=1

Explicit solution for the sample-based EM operator:


" n
#−1 " n
#
X X
Mn (θ) = Σθ (xobs,i , yi ) yi µθ (xobs,i , yi )
i=1 i=1

Population EM operator:

M(θ) = {E [Σθ (Xobs , Y )]}−1 E [Y µθ (Xobs , Y )]

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
18 / 34
Gradient EM Update for Missing Covariates

Gradient EM algorithm for the sample with step size α:


( n )
1X
Gn (θ) = θ + α (yi µθ (xobs,i , yi ) − Σθ (xobs,i , yi )θ)
n
i=1

Population counterpart for gradient EM:

G (θ) = θ + αE [Y µθ (Xobs , Y ) − Σθ (Xobs , Y )θ]

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
19 / 34
General Convergence Results Overview

Exploring convergence to θ, the maximizer of population likelihood.


Sufficient conditions for population algorithms to converge to θ.
Sample-based algorithms: convergence within an ε-ball of θ.
Theorems detail contractive behavior for both population and
sample-based updates.
Probabilistic analysis for sample-based convergence, including
stochastic updates.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
20 / 34
Guarantees for Population-level EM

Self-Consistency:
Assume vector θ∗ maximizes population likelihood.
Must satisfy self-consistency condition:

θ∗ = arg max Q(θ|θ∗ ),


θ∈Ω

Function Q(·) = Q(·|θ∗ ) is key.


λ-Strongly Concave:
Assumes q is λ-strongly concave:
λ
q(θ1 ) − q(θ2 ) − ⟨∇q(θ2 ), θ1 − θ2 ⟩ ≤ − ∥θ1 − θ2 ∥22 ,
2
Valid in neighborhoods of θ∗ .

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
21 / 34
Guarantees for Population-level EM

Gradient Mappings and Fixed Points:


First-order optimality condition:

⟨∇Q(θ∗ |θ∗ ), θ′ − θ∗ ⟩ ≤ 0 for all θ′ ∈ Ω,

Characterizes points θ and θ∗ .


EM Update Characterization:
For EM updates, we have:

⟨∇Q(M(θ)|θ), θ′ − M(θ)⟩ ≤ 0 for all θ′ ∈ Ω,

Where M(θ) denotes the EM update mapping.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
22 / 34
Regularity Conditions in EM Convergence
Regularity Condition (FOS) :
The functions {θ, θ ∈ Ω} satisfy FOS) over B2 (r ; θ∗ ):

∥∇Q(M(θ)|θ∗ ) − ∇Q(M(θ)|θ)∥2 ≤ γ ∥θ − θ∗ ∥2 (3)



for all θ ∈ B2 (r ; θ ). (4)

Intuition and Example:


Condition is trivially met at θ∗ with = 0.
For Gaussian mixture model:
h  i
2 wθ (Y ) − wθ∗ (Y ) Y ≤ ∥θ − θ∗ ∥2 .

Smoothness of wθ (y ) suggests condition holds near θ∗ .

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
23 / 34
Guarantees for Sample-based EM
EM Variants: Standard iterative EM and sample-splitting EM.
Objective: Analyze convergence guarantees based on sample size (n)
and tolerance (δ).
Convergence Rates
Define ϵM (n, δ) for fixed θ ∈ B2 (r ; θ∗ ):

∥Mn (θ) − M(θ)∥ ≤ ϵM (n, δ), with probability ≥ 1 − δ.

Define uniform rate ϵunif ∗


M (n, δ) over B2 (r ; θ ):

sup ∥Mn (θ) − M(θ)∥ ≤ ϵunif


M (n, δ), with probability ≥ 1 − δ.
θ∈B2 (r ;θ∗ )

Theoretical Guarantee
If M is contractive on B2 (r ; θ∗ ) and θ0 ∈ B2 (r ; θ∗ ), EM convergence
is guaranteed under the defined rates.
Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)
Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
24 / 34
Guarantees for Population-level Gradient EM

Update equation for the gradient EM algorithm:

G (θ) := θ + α∇Q(θ|θ).

Here, α > 0 is a step size parameter.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
25 / 34
Guarantees for population-level gradient EM : Gradient
Stability (GS) Definition

Definition (Gradient Stability (GS))


The functions {Q(·|θ), θ ∈ Ω} satisfy the Gradient Stability condition
GS(γ) over B2 (r ; θ∗ ) if

∥∇Q(θ∗ |θ∗ ) − ∇Q(θ|θ)∥2 ≤ γ∥θ − θ∗ ∥2 for all θ ∈ B2 (r ; θ∗ ). (5)

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
26 / 34
Guarantees for population-level gradient EM : Gradient
Stability Condition

rq(✓)
rQ(✓)
rQ(✓|✓1 )

rQ(✓|✓2 )

✓⇤ ✓1 ✓2 ✓

Figure: Illustration of the gradient stability condition

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
27 / 34
Guarantees for Sample-based Gradient EM

Deviations of the sample operator Mn from the population version M:

ϵG (n, δ) be the smallest scalar such that

∥Mn (θ) − M(θ)∥2 ≤ ϵG (n, δ), for any fixed vector θ ∈ B2 (r ; θ∗ ), with
probability at least 1 − δ.
Uniform deviation analogue:

ϵunif
G (n, δ) be the smallest scalar such that

supθ∈B2 (r ;θ∗ ) ∥Mn (θ) − M(θ)∥2 ≤ ϵunif


G (n, δ), with probability at least
1 − δ.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
28 / 34
Population Contractivity for Gaussian Mixture Models

Gaussian mixture model with equally weighted components at ±θ∗


and variance σ 2 I .
Signal-to-noise ratio (SNR) condition:

∥θ∗ ∥2
> η, (6)
σ
where η is a sufficiently large constant.
Population EM operator is contractive over the ball B2 (r ; θ∗ ) with
radius r and contractivity parameter κ, which decreases exponentially
with η 2 .

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
29 / 34
Population Contractivity for Gaussian Mixtures

Corollary (Corollaire 1 : Population contractivity for Gaussian


mixtures)
Consider a Gaussian mixture model satisfying the signal-to-noise ratio
(SNR) condition as defined in (4) with a sufficiently large η. There exists
a universal constant c > 0 such that the population EM operator is
κ-contractive over the ball B2 (r ; θ∗ ) with

∥θ∗ ∥2
r= ,
4
2
κ(η) ≤ e −cη .

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
30 / 34
Sample-based EM Guarantees for Gaussian Mixtures

Corollary (Corollaire 2 : Sample-based EM guarantees for Gaussian


mixtures)
Given a Gaussian mixture model satisfying the conditions of Corollary 2,
suppose the sample size n is at least 1 log(1/δ). For any initial parameter

θ0 in the ball ∥θ4∥2 , there exists a contraction coefficient κ(η) ≤ e −cη
2

such that the EM iterates {θt }∞ t=0 conform to the inequality


r
t ∗ t 0 ∗ 2 ∗ 1
∥θ − θ ∥ ≤ κ ∥θ − θ ∥ + φ(σ; ∥θ ∥2 ) log
1−κ n δ
with probability at least 1 − δ.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
31 / 34
Comparison of EM and Gradient EM for Gaussian Mixtures
EM, Mixture of Gaussians Gradient EM, Mixture of Gaussians
0 0

−2 −2

−4 −4
Log error

Log error
−6 −6

−8 −8

−10 −10
Opt. error Opt. error
−12 Stat. error −12 Stat. error
2 4 6 8 10 12 20 40 60 80
Iteration Count Iteration Count

(a) EM Algorithm (b) Gradient EM Algorithm


Figure: Comparison of EM and Gradient EM for Gaussian Mixtures. Each plot
illustrates the performance of the algorithms on 10 different trials for a mixture
model with dimension d = 10, sample size n = 1000, and signal-to-noise ratio
(SNR) of 2.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
32 / 34
EM Convergence Rates Across Different SNRs
EM, Mixture of Gaussians
S NR = 0. 5
0 S NR
S NR
=
=
0. 75
1
S NR = 1. 8

Log optimization error


S NR = 2. 5
−2

−4

−6

−8

−10

−12

−14
20 40 60 80 100
Iteration Count

Figure: Plot of the iteration count versus the (log) optimization error

log(∥θt − θ̂∥) for different values of the SNR ∥θσ ∥ . For each SNR, we performed
10 independent trials of a Gaussian mixture model with dimension d = 10 and
sample size n = 1000. Larger values of SNR lead to faster convergence rates,
consistent with Corollary 2
Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)
Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
33 / 34
Conclusion and Perspectives

The paper offers an in-depth analysis of EM and gradient EM


algorithms, investigating their dynamics in population and sample
contexts.
Techniques introduced may extend to other non-convex optimization
algorithms, broadening the scope of application.
Future work could explore EM’s performance when model assumptions
such as correct specification and i.i.d. sampling are relaxed.
Understanding EM under model mis-specification is highlighted as an
important avenue for research, considering the robustness of MLE.
The role of appropriate initialization is emphasized, suggesting the
potential of pilot estimators for setting up EM and gradient EM
algorithms effectively.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)


Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
34 / 34

You might also like