Professional Documents
Culture Documents
Suvrit Sra
Massachusetts Institute of Technology
04 Mar, 2021
Tractable nonconvex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 2
Tractable nonconvex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 2
Spectral problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 3
Simplest example: eigenvalues
Largest eigenvalue of a symmetric matrix
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 4
Simplest example: eigenvalues
Largest eigenvalue of a symmetric matrix
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 4
Simplest example: eigenvalues
Largest eigenvalue of a symmetric matrix
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 4
Simplest example: eigenvalues
Largest eigenvalue of a symmetric matrix
X X
max λi y2i = max λi zi ,
yT y=1 zT 1=1,z≥0 i
i
xT Ax
max
x6=0 xT Bx
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 5
Generalized eigenvalues
Let A, B be symmetric matrices; generalized eigenvalue is:
xT Ax
max
x6=0 xT Bx
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 5
Generalized eigenvalues
Let A, B be symmetric matrices; generalized eigenvalue is:
xT Ax
max
x6=0 xT Bx
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 5
Trust region subproblem
min xT Ax + 2bT x + c
x
s.t. xT Bx + 2dT x + e ≤ 0.
Here A and B are merely symmetric. Hence, nonconvex
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 6
Trust region subproblem
min xT Ax + 2bT x + c
x
s.t. xT Bx + 2dT x + e ≤ 0.
Here A and B are merely symmetric. Hence, nonconvex
The dual problem can be formulated as (Verify!)
max u
u,v∈R
A + vB b + vd
s.t. 0,
(b + vd)T c + ve − u
v ≥ 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 6
Trust region subproblem
min xT Ax + 2bT x + c
x
s.t. xT Bx + 2dT x + e ≤ 0.
Here A and B are merely symmetric. Hence, nonconvex
The dual problem can be formulated as (Verify!)
max u
u,v∈R
A + vB b + vd
s.t. 0,
(b + vd)T c + ve − u
v ≥ 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 6
Toeplitz-Hausdorff Theorem
Let A be a complex, square matrix. Its numerical range is
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 7
Toeplitz-Hausdorff Theorem
Let A be a complex, square matrix. Its numerical range is
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 7
Toeplitz-Hausdorff Theorem
Let A be a complex, square matrix. Its numerical range is
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 7
Principal Component Analysis (PCA)
Let A ∈ Rn×p . Consider the nonconvex problem
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 8
Principal Component Analysis (PCA)
Let A ∈ Rn×p . Consider the nonconvex problem
X∗ = Uk Σk VkT
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 8
PCA via the Fantope
Another characterization of SVD (nonconvex prob)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
PCA via the Fantope
Another characterization of SVD (nonconvex prob)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
PCA via the Fantope
Another characterization of SVD (nonconvex prob)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
PCA via the Fantope
Another characterization of SVD (nonconvex prob)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
PCA via the Fantope
Another characterization of SVD (nonconvex prob)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 9
Fantope
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 10
Fantope
n o
C = Z = ZT | λi (Z) ∈ [0, 1], Tr(Z) = k
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 10
Fantope
n o
C = Z = ZT | λi (Z) ∈ [0, 1], Tr(Z) = k
n o
= Z = ZT | 0 Z I, Tr(Z) = k .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 10
Fantope
n o
C = Z = ZT | λi (Z) ∈ [0, 1], Tr(Z) = k
n o
= Z = ZT | 0 Z I, Tr(Z) = k .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 10
Sparsity
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 11
Nonconvex Sparse optimization
The `0 -quasi-norm is defined as
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 12
Nonconvex Sparse optimization
The `0 -quasi-norm is defined as
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 12
Nonconvex Sparse optimization
The `0 -quasi-norm is defined as
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 12
Nonconvex Sparse optimization
The `0 -quasi-norm is defined as
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 12
Compressed Sensing
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 13
Compressed Sensing
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 13
Generalized convexity
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 14
Geometric programming
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 15
Geometric programming
Clearly, nonconvex.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 15
Geometric programming
Make change of variables: yi = log xi (recall xi > 0). Then,
T y+b
f (x) = f (ey ) = γ(ey1 )a1 · · · (eyn )an = ea ,
for b = log y.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 16
Geometric programming
Make change of variables: yi = log xi (recall xi > 0). Then,
T y+b
f (x) = f (ey ) = γ(ey1 )a1 · · · (eyn )an = ea ,
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 16
Geometric programming
Make change of variables: yi = log xi (recall xi > 0). Then,
T y+b
f (x) = f (ey ) = γ(ey1 )a1 · · · (eyn )an = ea ,
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 16
Generalized convexity
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 17
Generalized convexity
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 17
Generalized convexity
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 17
Generalized convexity
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 17
Linear fractional programming
aT x + b
min
cT x + d
s.t. Gx ≤ h, cT x + d > 0, Ex = f .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 18
Linear fractional programming
aT x + b
min
cT x + d
s.t. Gx ≤ h, cT x + d > 0, Ex = f .
min aT y + bz
y,z
s.t. Gy − hz ≤ 0, z ≥ 0
Ey = fz, cT y + dz = 1.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 18
Linear fractional programming
aT x + b
min
cT x + d
s.t. Gx ≤ h, cT x + d > 0, Ex = f .
min aT y + bz
y,z
s.t. Gy − hz ≤ 0, z ≥ 0
Ey = fz, cT y + dz = 1.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 19
Challenge: Simplex convexity
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 20
The Polyak-Łojasiewicz class
PL class aka gradient-dominated
f (x) − f (x∗ ) ≤ τ k∇f (x)kα , α ≥ 1.
Observe that if ∇f (x) = 0, then x must be global opt.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 21
The Polyak-Łojasiewicz class
PL class aka gradient-dominated
f (x) − f (x∗ ) ≤ τ k∇f (x)kα , α ≥ 1.
Observe that if ∇f (x) = 0, then x must be global opt.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 21
The Polyak-Łojasiewicz class
PL class aka gradient-dominated
f (x) − f (x∗ ) ≤ τ k∇f (x)kα , α ≥ 1.
Observe that if ∇f (x) = 0, then x must be global opt.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 21
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
I Consider the system of nonlinear equations g(x) = 0
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
I Consider the system of nonlinear equations g(x) = 0
I Assume that m ≤ n and that ∃x∗ s.t. g(x∗ ) = 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
I Consider the system of nonlinear equations g(x) = 0
I Assume that m ≤ n and that ∃x∗ s.t. g(x∗ ) = 0.
I Assume Jacobian J(x) = (∇g1 (x), . . . , ∇gm (x))
non-degenerate on a convex set X containing x∗ . Then,
σ = inf x∈X λmin (J(x)T J(x)) > 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
I Consider the system of nonlinear equations g(x) = 0
I Assume that m ≤ n and that ∃x∗ s.t. g(x∗ ) = 0.
I Assume Jacobian J(x) = (∇g1 (x), . . . , ∇gm (x))
non-degenerate on a convex set X containing x∗ . Then,
σ = inf x∈X λmin (J(x)T J(x)) > 0.
Let f (x) = 12 i g2i (x); note that ∇f (x) = J(x)g(x)
P
I
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
I Consider the system of nonlinear equations g(x) = 0
I Assume that m ≤ n and that ∃x∗ s.t. g(x∗ ) = 0.
I Assume Jacobian J(x) = (∇g1 (x), . . . , ∇gm (x))
non-degenerate on a convex set X containing x∗ . Then,
σ = inf x∈X λmin (J(x)T J(x)) > 0.
Let f (x) = 12 i g2i (x); note that ∇f (x) = J(x)g(x)
P
I
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
I Consider the system of nonlinear equations g(x) = 0
I Assume that m ≤ n and that ∃x∗ s.t. g(x∗ ) = 0.
I Assume Jacobian J(x) = (∇g1 (x), . . . , ∇gm (x))
non-degenerate on a convex set X containing x∗ . Then,
σ = inf x∈X λmin (J(x)T J(x)) > 0.
Let f (x) = 12 i g2i (x); note that ∇f (x) = J(x)g(x)
P
I
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Important non-convex PL example
I Let g(x) = (g1 (x), . . . , gm (x)) be a differentiable func.
I Consider the system of nonlinear equations g(x) = 0
I Assume that m ≤ n and that ∃x∗ s.t. g(x∗ ) = 0.
I Assume Jacobian J(x) = (∇g1 (x), . . . , ∇gm (x))
non-degenerate on a convex set X containing x∗ . Then,
σ = inf x∈X λmin (J(x)T J(x)) > 0.
Let f (x) = 12 i g2i (x); note that ∇f (x) = J(x)g(x)
P
I
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 22
Others tractable nonconvex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 23
Others tractable nonconvex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 23
Others tractable nonconvex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 23
Others tractable nonconvex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 23
Others tractable nonconvex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 23
Others tractable nonconvex problems
Ref. Chulhee Yun, Suvrit Sra, Ali Jadbabaie. Global optimality conditions for
deep neural networks. ICLR 2018.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (03/04/21; Lecture 6) 23