Beamer

*
Statistical guarantees for the EM algorithm: From

population to sample-based analysis
Mollen KHAN
Université Paris 1 Panthéon-Sorbonne
December 14, 2023
Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Statistical guarantees for the EM algorithm: From population
Decemberto sample-based
14, 2023 analysis
1 / 34
The EM Algorithm for Gaussian Mixture Models
Model:
X K
X
p(x) = p(z)p(x|z) = πk N (x|µk , Σk )
z k=1
where K is the number of Gaussian components in the mixture, πk

are the mixing coefficients, and N (x|µk , Σk ) are the component
Gaussian distributions.
Training with the EM Algorithm:
E-step: Calculate responsibilities γ(znk ) which approximate the
posterior distribution of latent variables.
M-step: Maximize the expected log-likelihood over the parameters.
Hyper-parameters:
Number of components K .
Number of EM iterations.
Prediction:
Probability distribution p(x) or assignment to one of the K Gaussians.
14, 2023 analysis
2 / 34
Exploring the EM Algorithm: Goals and Key Insights
Context: Dealing with incomplete data, such as missing values and

latent variables, presents significant computational challenges for
maximum likelihood estimation (MLE).
EM Algorithm: The expectation-maximization (EM) algorithm
provides a powerful approach to such problems, with a rich literature
supporting its application.
Challenge: While MLEs have desirable statistical properties, the EM
algorithm often converges to local optima, raising concerns about its
practical utility.
Paper’s Objective: This paper aims to bridge the gap between
statistical and computational guarantees when applying the EM
algorithm.

14, 2023 analysis
3 / 34
Exploring the EM Algorithm: Goals and Key Insights
Insight: Proper initialization can lead the EM algorithm to converge

to statistically useful estimates, even in non-convex settings.
Approach: Through a series of theoretical and empirical analyses,
the paper characterizes the conditions under which the EM algorithm
and its variants, such as gradient EM, converge effectively.
Structure: The paper provides a comprehensive introduction,
detailed model examples, and general convergence results,
culminating in practical implications and supporting simulations.

14, 2023 analysis
4 / 34
EM Algorithm and its relatives
Observations and latent variables have a joint density within a

parameterized family {θ | θ ∈}.
The goal is to maximize the observed data likelihood:
Z
gθ (y ) = fθ (y , z) dz (1)
Z
The EM algorithm maximizes the expected log-likelihood Q(θ′ |θ),

with
Z Z
log gθ′ (y ) ≥ kθ (z|y ) log fθ′ (y , z)dz − kθ (z|y ) log kθ (z|y )dz,
|Z {z } Z
Q(θ′ |θ)
(2)
M-step maximizes Q(θ′ |θ) and E-step reevaluates at the new
parameter value.
14, 2023 analysis
5 / 34
Standard EM Updates
The iterative EM algorithm consists of two main steps in each iteration:

E-step: Compute the expected log-likelihood Q(θ′ |θt ) using the
current estimate θt .
M-step: Update the estimate to θt+1 by maximizing Q(θ′ |θt ) over θ′
in the parameter space.
The mapping M : Θ → Θ, which is central to the M-step, is given by:
M(θ) = arg max

′
Q(θ′ |θ).
θ ∈Θ

14, 2023 analysis
6 / 34
Generalized EM Updates
Relaxation of the M-step.

New improvement condition:
Q(θt+1 , θt ) ≥ Q(θt , θt )
Defines a family of algorithms.

14, 2023 analysis
7 / 34
Gradient EM Updates
A variant of generalized EM with differentiable function Q

Update formula with a step size α > 0:
θt+1 = θt + α∇Q(θt |θt )
Mapping G defined by:
G (θ) = θ + α∇Q(θ|θ)
Compact iteration: θt+1 = G (θt )

Extension to constraints by projection (not addressed here)
With suitable α, meets the EM improvement condition

14, 2023 analysis
8 / 34
Population vs. Sample Updates
”Oracle” population version with an infinite number of samples:

Z Z
′
Q(θ |θ) = kθ (z|y) log fθ′ (y, z)dz gθ∗ (y)dy
Y Z
Sample version with n i.i.d. observations:

n Z
′ 1X
Q̂n (θ |θ) = kθ (z|yi ) log f (yi , z)dz
θ′
n Z
i=1
Sample-based EM update operator:
Mn (θ) = arg max

′
Q̂n (θ′ |θ)
θ ∈Θ

14, 2023 analysis
9 / 34
Population vs. Sample Updates
”Oracle” population version with an infinite number of samples:

Z Z
′
Q(θ |θ) = kθ (z|y) log fθ′ (y, z)dz gθ∗ (y)dy
Y Z
Sample version with n i.i.d. observations:

n Z
′ 1X
Q̂n (θ |θ) = kθ (z|yi ) log f (yi , z)dz
θ′
n Z
i=1
Sample-based EM update operator:
Mn (θ) = arg max

′
Q̂n (θ′ |θ)
θ ∈Θ

14, 2023 analysis
10 / 34
Gaussian Mixture Models
Model with density:
1 1
fθ (y ) = ϕ(y ; θ∗ , σ 2 Id ) + ϕ(y ; −θ∗ , σ 2 Id )
2 2
Objective: estimate the unknown mean vector θ∗
Hidden variable Z indicates the mixture component
Sample-based Q̂ function:
n
′ 1 X
wθ (yi )∥yi − θ′ ∥2 + (1 − wθ (yi ))∥yi + θ′ ∥2

Q̂(θ |θ) = −
2n
i=1
Mixture component probability wθ (y ):

∥2
exp − ∥θ−y2σ 2
wθ (y ) =
∥θ−y ∥2 ∥2
exp − 2σ2 + exp − ∥θ+y
2σ 2

14, 2023 analysis
11 / 34
Gaussian Mixture Models
Model with density:
1 1
fθ (y ) = ϕ(y ; θ∗ , σ 2 Id ) + ϕ(y ; −θ∗ , σ 2 Id )
2 2
Objective: estimate the unknown mean vector θ∗
Hidden variable Z indicates the mixture component
Sample-based Q̂ function:
n
′ 1 X
wθ (yi )∥yi − θ′ ∥2 + (1 − wθ (yi ))∥yi + θ′ ∥2

Q̂(θ |θ) = −
2n
i=1
Mixture component probability wθ (y ):

∥2
exp − ∥θ−y2σ 2
wθ (y ) =
∥θ−y ∥2 ∥2
exp − 2σ2 + exp − ∥θ+y
2σ 2

14, 2023 analysis
12 / 34
Gradient EM Updates for Mixture of Regressions Model
Gradient EM operator for the sample:

( n )
1 Xh i
Gn (θ) = θ + α (2wθ (xi , yi ) − 1)yi xi − xi xi⊤ θ
n
i=1
Gradient EM operator for the population:
G (θ) = θ + α 2E [wθ (X , Y )YX − θ]
Step size parameter α > 0 for parameter adjustment

14, 2023 analysis
13 / 34
Linear Regression with Missing Covariates
Extension of standard linear regression with missing data

Instead of full covariate vector xi , observe corrupted version x̃i
Components observed with probability 1 − ρ or missing with
probability ρ: (
xij with probability 1 − ρ
x̃ij =
∗ with probability ρ
ρ ∈ [0, 1) denotes the probability that a given covariate is missing

14, 2023 analysis
14 / 34
E-step Notation for Linear Regression with Missing
Covariates
Observed and missing parts of x and θ:
xobs , xmis , βobs , βmis
Conditional mean given observed data:

E(xmis |xobs , y , θ) Mmis · zobs
µmis (xobs , y ) = =
xobs xobs
Auxiliary notation for conditional mean:

1 ⊤

Mmis = −βmis βobs βmis
∥βmis ∥ + σ
2 2
Observed composite vector:

xobs
zobs =
y
Conditional second moment matrix:
⊤

I Mmis · zobs xobs
Σmis (xobs , y ) = ⊤ ⊤ ⊤
xobs zobs Mmis xobs xobs

14, 2023 analysis
15 / 34
EM Updates for Linear Regression with Missing Covariates
Sample-based EM update maximizes:

n
′ 1 X ⊤ ′ 2

Q(θ |θ) = − wθ (xi , yi )(yi − xi θ )
2n
i=1
Closed-form solution for the EM update:

" n
#−1 " n
#
X X
Mn (θ) = Σθ (xobs,i , yi ) yi µθ (xobs,i , yi )
i=1 i=1
Population EM update analogously defined with expectations.

14, 2023 analysis
16 / 34
Gradient EM Updates for Missing Covariates
Gradient EM operator for the sample with step size α:

( n )
1X
Gn (θ) = θ + α [yi µθ (xobs,i , yi ) − Σθ (xobs,i , yi )θ]
n
i=1
Population gradient EM operator defined similarly with expectations.

14, 2023 analysis
17 / 34
EM Update for Linear Regression with Missing Covariates
EM update maximizes the function Q(θ′ |θ):

n
1 X
Q(θ′ |θ) = − ⟨θ′ , Σθ (xobs,i , yi )θ′ ⟩ − yi ⟨µθ (xobs,i , yi ), θ′ ⟩

2n
i=1
Explicit solution for the sample-based EM operator:

" n
#−1 " n
#
X X
Mn (θ) = Σθ (xobs,i , yi ) yi µθ (xobs,i , yi )
i=1 i=1
Population EM operator:
M(θ) = {E [Σθ (Xobs , Y )]}−1 E [Y µθ (Xobs , Y )]

14, 2023 analysis
18 / 34
Gradient EM Update for Missing Covariates
Gradient EM algorithm for the sample with step size α:

( n )
1X
Gn (θ) = θ + α (yi µθ (xobs,i , yi ) − Σθ (xobs,i , yi )θ)
n
i=1
Population counterpart for gradient EM:
G (θ) = θ + αE [Y µθ (Xobs , Y ) − Σθ (Xobs , Y )θ]

14, 2023 analysis
19 / 34
General Convergence Results Overview
Exploring convergence to θ, the maximizer of population likelihood.

Sufficient conditions for population algorithms to converge to θ.
Sample-based algorithms: convergence within an ε-ball of θ.
Theorems detail contractive behavior for both population and
sample-based updates.
Probabilistic analysis for sample-based convergence, including
stochastic updates.

14, 2023 analysis
20 / 34
Guarantees for Population-level EM
Self-Consistency:
Assume vector θ∗ maximizes population likelihood.
Must satisfy self-consistency condition:
θ∗ = arg max Q(θ|θ∗ ),

θ∈Ω
Function Q(·) = Q(·|θ∗ ) is key.

λ-Strongly Concave:
Assumes q is λ-strongly concave:
λ
q(θ1 ) − q(θ2 ) − ⟨∇q(θ2 ), θ1 − θ2 ⟩ ≤ − ∥θ1 − θ2 ∥22 ,
2
Valid in neighborhoods of θ∗ .

14, 2023 analysis
21 / 34
Guarantees for Population-level EM
Gradient Mappings and Fixed Points:

First-order optimality condition:
⟨∇Q(θ∗ |θ∗ ), θ′ − θ∗ ⟩ ≤ 0 for all θ′ ∈ Ω,
Characterizes points θ and θ∗ .

EM Update Characterization:
For EM updates, we have:
⟨∇Q(M(θ)|θ), θ′ − M(θ)⟩ ≤ 0 for all θ′ ∈ Ω,
Where M(θ) denotes the EM update mapping.

14, 2023 analysis
22 / 34
Regularity Conditions in EM Convergence
Regularity Condition (FOS) :
The functions {θ, θ ∈ Ω} satisfy FOS) over B2 (r ; θ∗ ):
∥∇Q(M(θ)|θ∗ ) − ∇Q(M(θ)|θ)∥2 ≤ γ ∥θ − θ∗ ∥2 (3)

∗
for all θ ∈ B2 (r ; θ ). (4)
Intuition and Example:

Condition is trivially met at θ∗ with = 0.
For Gaussian mixture model:
h i
2 wθ (Y ) − wθ∗ (Y ) Y ≤ ∥θ − θ∗ ∥2 .
Smoothness of wθ (y ) suggests condition holds near θ∗ .

14, 2023 analysis
23 / 34
Guarantees for Sample-based EM
EM Variants: Standard iterative EM and sample-splitting EM.
Objective: Analyze convergence guarantees based on sample size (n)
and tolerance (δ).
Convergence Rates
Define ϵM (n, δ) for fixed θ ∈ B2 (r ; θ∗ ):
∥Mn (θ) − M(θ)∥ ≤ ϵM (n, δ), with probability ≥ 1 − δ.
Define uniform rate ϵunif ∗

M (n, δ) over B2 (r ; θ ):
sup ∥Mn (θ) − M(θ)∥ ≤ ϵunif

M (n, δ), with probability ≥ 1 − δ.
θ∈B2 (r ;θ∗ )
Theoretical Guarantee
If M is contractive on B2 (r ; θ∗ ) and θ0 ∈ B2 (r ; θ∗ ), EM convergence
is guaranteed under the defined rates.
14, 2023 analysis
24 / 34
Guarantees for Population-level Gradient EM
Update equation for the gradient EM algorithm:
G (θ) := θ + α∇Q(θ|θ).
Here, α > 0 is a step size parameter.

14, 2023 analysis
25 / 34
Guarantees for population-level gradient EM : Gradient
Stability (GS) Definition
Definition (Gradient Stability (GS))

The functions {Q(·|θ), θ ∈ Ω} satisfy the Gradient Stability condition
GS(γ) over B2 (r ; θ∗ ) if
∥∇Q(θ∗ |θ∗ ) − ∇Q(θ|θ)∥2 ≤ γ∥θ − θ∗ ∥2 for all θ ∈ B2 (r ; θ∗ ). (5)

14, 2023 analysis
26 / 34
Guarantees for population-level gradient EM : Gradient
Stability Condition
rq(✓)
rQ(✓)
rQ(✓|✓1 )
rQ(✓|✓2 )
✓⇤ ✓1 ✓2 ✓
Figure: Illustration of the gradient stability condition

14, 2023 analysis
27 / 34
Guarantees for Sample-based Gradient EM
Deviations of the sample operator Mn from the population version M:
ϵG (n, δ) be the smallest scalar such that
∥Mn (θ) − M(θ)∥2 ≤ ϵG (n, δ), for any fixed vector θ ∈ B2 (r ; θ∗ ), with
probability at least 1 − δ.
Uniform deviation analogue:
ϵunif
G (n, δ) be the smallest scalar such that
supθ∈B2 (r ;θ∗ ) ∥Mn (θ) − M(θ)∥2 ≤ ϵunif

G (n, δ), with probability at least
1 − δ.

14, 2023 analysis
28 / 34
Population Contractivity for Gaussian Mixture Models
Gaussian mixture model with equally weighted components at ±θ∗

and variance σ 2 I .
Signal-to-noise ratio (SNR) condition:
∥θ∗ ∥2
> η, (6)
σ
where η is a sufficiently large constant.
Population EM operator is contractive over the ball B2 (r ; θ∗ ) with
radius r and contractivity parameter κ, which decreases exponentially
with η 2 .

14, 2023 analysis
29 / 34
Population Contractivity for Gaussian Mixtures
Corollary (Corollaire 1 : Population contractivity for Gaussian

mixtures)
Consider a Gaussian mixture model satisfying the signal-to-noise ratio
(SNR) condition as defined in (4) with a sufficiently large η. There exists
a universal constant c > 0 such that the population EM operator is
κ-contractive over the ball B2 (r ; θ∗ ) with
∥θ∗ ∥2
r= ,
4
2
κ(η) ≤ e −cη .

14, 2023 analysis
30 / 34
Sample-based EM Guarantees for Gaussian Mixtures
Corollary (Corollaire 2 : Sample-based EM guarantees for Gaussian

mixtures)
Given a Gaussian mixture model satisfying the conditions of Corollary 2,
suppose the sample size n is at least 1 log(1/δ). For any initial parameter
∗
θ0 in the ball ∥θ4∥2 , there exists a contraction coefficient κ(η) ≤ e −cη
2
such that the EM iterates {θt }∞ t=0 conform to the inequality

r
t ∗ t 0 ∗ 2 ∗ 1
∥θ − θ ∥ ≤ κ ∥θ − θ ∥ + φ(σ; ∥θ ∥2 ) log
1−κ n δ
with probability at least 1 − δ.

14, 2023 analysis
31 / 34
Comparison of EM and Gradient EM for Gaussian Mixtures
EM, Mixture of Gaussians Gradient EM, Mixture of Gaussians
0 0
−2 −2
−4 −4
Log error
Log error
−6 −6
−8 −8
−10 −10
Opt. error Opt. error
−12 Stat. error −12 Stat. error
2 4 6 8 10 12 20 40 60 80
Iteration Count Iteration Count
(a) EM Algorithm (b) Gradient EM Algorithm

Figure: Comparison of EM and Gradient EM for Gaussian Mixtures. Each plot
illustrates the performance of the algorithms on 10 different trials for a mixture
model with dimension d = 10, sample size n = 1000, and signal-to-noise ratio
(SNR) of 2.

14, 2023 analysis
32 / 34
EM Convergence Rates Across Different SNRs
EM, Mixture of Gaussians
S NR = 0. 5
0 S NR
S NR
=
=
0. 75
1
S NR = 1. 8
Log optimization error

S NR = 2. 5
−2
−4
−6
−8
−10
−12
−14
20 40 60 80 100
Iteration Count
Figure: Plot of the iteration count versus the (log) optimization error
∗
log(∥θt − θ̂∥) for different values of the SNR ∥θσ ∥ . For each SNR, we performed
10 independent trials of a Gaussian mixture model with dimension d = 10 and
sample size n = 1000. Larger values of SNR lead to faster convergence rates,
consistent with Corollary 2
14, 2023 analysis
33 / 34
Conclusion and Perspectives
The paper offers an in-depth analysis of EM and gradient EM

algorithms, investigating their dynamics in population and sample
contexts.
Techniques introduced may extend to other non-convex optimization
algorithms, broadening the scope of application.
Future work could explore EM’s performance when model assumptions
such as correct specification and i.i.d. sampling are relaxed.
Understanding EM under model mis-specification is highlighted as an
important avenue for research, considering the robustness of MLE.
The role of appropriate initialization is emphasized, suggesting the
potential of pilot estimators for setting up EM and gradient EM
algorithms effectively.

14, 2023 analysis
34 / 34

Beamer

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Beamer

Uploaded by

Copyright:

Available Formats

*

Statistical guarantees for the EM algorithm: From

Université Paris 1 Panthéon-Sorbonne

December 14, 2023

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

where K is the number of Gaussian components in the mixture, πk

Context: Dealing with incomplete data, such as missing values and

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Insight: Proper initialization can lead the EM algorithm to converge

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Observations and latent variables have a joint density within a

The EM algorithm maximizes the expected log-likelihood Q(θ′ |θ),

The iterative EM algorithm consists of two main steps in each iteration:

M(θ) = arg max

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Relaxation of the M-step.

Defines a family of algorithms.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

A variant of generalized EM with differentiable function Q

θt+1 = θt + α∇Q(θt |θt )

Mapping G defined by:

Compact iteration: θt+1 = G (θt )

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

”Oracle” population version with an infinite number of samples:

Sample version with n i.i.d. observations:

Sample-based EM update operator:

Mn (θ) = arg max

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

”Oracle” population version with an infinite number of samples:

Sample version with n i.i.d. observations:

Sample-based EM update operator:

Mn (θ) = arg max

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Mixture component probability wθ (y ):

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Mixture component probability wθ (y ):

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Gradient EM operator for the sample:

Gradient EM operator for the population:

G (θ) = θ + α 2E [wθ (X , Y )YX − θ]

Step size parameter α > 0 for parameter adjustment

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Extension of standard linear regression with missing data

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Conditional mean given observed data:

Auxiliary notation for conditional mean:

Observed composite vector:  

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Sample-based EM update maximizes:

Closed-form solution for the EM update:

Population EM update analogously defined with expectations.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Gradient EM operator for the sample with step size α:

Population gradient EM operator defined similarly with expectations.

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

EM update maximizes the function Q(θ′ |θ):

Explicit solution for the sample-based EM operator:

M(θ) = {E [Σθ (Xobs , Y )]}−1 E [Y µθ (Xobs , Y )]

Mollen KHAN (Université Paris 1 Panthéon-Sorbonne)

Gradient EM algorithm for the sample with step size α:

Population counterpart for gradient EM:

G (θ) = θ + αE [Y µθ (Xobs , Y ) − Σθ (Xobs , Y )θ]

Observed composite vector: