Professional Documents
Culture Documents
Suvrit Sra
Massachusetts Institute of Technology
25 Feb, 2021
Optimality
(Local and global optima)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 2
Optimality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
Optimality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
Optimality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
Optimality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
Optimality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
Optimality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
Optimality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
Optimality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 3
Set of Optimal Solutions
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 4
Optimality conditions
(Recognizing optima)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 5
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn ; t > 0.
Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn ; t > 0.
Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn ; t > 0.
Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn ; t > 0.
Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn ; t > 0.
Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn ; t > 0.
Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn ; t > 0.
Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: unconstrained
Theorem. Let f : Rn → R be continuously differentiable on an
open set S containing x∗ , a local minimum. Then, ∇f (x∗ ) = 0.
Proof: Consider function g(t) = f (x∗ + td), where d ∈ Rn ; t > 0.
Since x∗ is a local min, for small enough t, f (x∗ + td) ≥ f (x∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 6
First-order conditions: constrained
♠ For convex f , we have f (y) ≥ f (x) + h∇f (x), y − xi.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7
First-order conditions: constrained
♠ For convex f , we have f (y) ≥ f (x) + h∇f (x), y − xi.
♠ Thus, x∗ is optimal if and only if
h∇f (x∗ ), y − x∗ i ≥ 0, for all y ∈ X .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7
First-order conditions: constrained
♠ For convex f , we have f (y) ≥ f (x) + h∇f (x), y − xi.
♠ Thus, x∗ is optimal if and only if
h∇f (x∗ ), y − x∗ i ≥ 0, for all y ∈ X .
♠ If X = Rn , this reduces to ∇f (x∗ ) =0
X
) x ∇f (x∗ )
f (x
x∗
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 7
First-order conditions: constrained
I Let f be continuously differentiable, possibly nonconvex
I Suppose ∃y ∈ X such that h∇f (x∗ ), y − x∗ i < 0
I Using mean-value theorem of calculus, ∃ξ ∈ [0, 1] s.t.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 8
Optimality without differentiability
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
Optimality without differentiability
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
Optimality without differentiability
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
Optimality without differentiability
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
Optimality without differentiability
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 9
Optimality – nonsmooth
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
Optimality – nonsmooth
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
Optimality – nonsmooth
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
Optimality – nonsmooth
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
Optimality – nonsmooth
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
Optimality – nonsmooth
Application
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
Optimality – nonsmooth
Application
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 10
Example
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
Example
x ∈ dom f , kxk ≤ 1,
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
Example
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
Example
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
Example
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
Example
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
Example
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 11
Optimality conditions
(KKT and friends)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 12
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
p∗ = f (x∗ )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
p∗ = f (x∗ ) = d∗ = g(λ∗ )
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
x∗ ∈ argminx L(x, λ∗ ).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 13
Optimality conditions via Lagrangian
x∗ ∈ argminx L(x, λ∗ ).
If f , f1 , . . . , fm are differentiable, this implies
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
Optimality conditions via Lagrangian
x∗ ∈ argminx L(x, λ∗ ).
If f , f1 , . . . , fm are differentiable, this implies
X
∇x L(x, λ∗ )|x=x∗ = ∇f (x∗ ) + λ∗i ∇fi (x∗ ) = 0.
i
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
Optimality conditions via Lagrangian
x∗ ∈ argminx L(x, λ∗ ).
If f , f1 , . . . , fm are differentiable, this implies
X
∇x L(x, λ∗ )|x=x∗ = ∇f (x∗ ) + λ∗i ∇fi (x∗ ) = 0.
i
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
Optimality conditions via Lagrangian
x∗ ∈ argminx L(x, λ∗ ).
If f , f1 , . . . , fm are differentiable, this implies
X
∇x L(x, λ∗ )|x=x∗ = ∇f (x∗ ) + λ∗i ∇fi (x∗ ) = 0.
i
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
Optimality conditions via Lagrangian
x∗ ∈ argminx L(x, λ∗ ).
If f , f1 , . . . , fm are differentiable, this implies
X
∇x L(x, λ∗ )|x=x∗ = ∇f (x∗ ) + λ∗i ∇fi (x∗ ) = 0.
i
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
Optimality conditions via Lagrangian
x∗ ∈ argminx L(x, λ∗ ).
If f , f1 , . . . , fm are differentiable, this implies
X
∇x L(x, λ∗ )|x=x∗ = ∇f (x∗ ) + λ∗i ∇fi (x∗ ) = 0.
i
λ∗i fi (x∗ ) = 0, i = 1, . . . , m.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 14
KKT conditions
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15
KKT conditions
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15
KKT conditions
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15
KKT conditions
Read Ch. 5 of BV
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 15
Examples
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 16
Projection onto a hyperplane
min 1
2 kx − yk2 , s.t. aT x = b.
x
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
Projection onto a hyperplane
min 1
2 kx − yk2 , s.t. aT x = b.
x
KKT Conditions
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
Projection onto a hyperplane
min 1
2 kx − yk2 , s.t. aT x = b.
x
KKT Conditions
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
Projection onto a hyperplane
min 1
2 kx − yk2 , s.t. aT x = b.
x
KKT Conditions
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
Projection onto a hyperplane
min 1
2 kx − yk2 , s.t. aT x = b.
x
KKT Conditions
1
x=y− kak2
(aT y − b)a
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 17
Projection onto simplex
min 1
2 kx − yk2 , s.t. xT 1 = 1, x ≥ 0.
x
KKT Conditions
X
L(x, λ, ν) = 21 kx − yk2 − λi xi + ν(xT 1 − 1)
i
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18
Projection onto simplex
min 1
2 kx − yk2 , s.t. xT 1 = 1, x ≥ 0.
x
KKT Conditions
X
L(x, λ, ν) = 21 kx − yk2 − λi xi + ν(xT 1 − 1)
i
∂L
= xi − yi − λi + ν = 0
∂xi
λi xi =0
λi ≥0
T
x 1 = 1, x ≥ 0
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18
Projection onto simplex
min 1
2 kx − yk2 , s.t. xT 1 = 1, x ≥ 0.
x
KKT Conditions
X
L(x, λ, ν) = 21 kx − yk2 − λi xi + ν(xT 1 − 1)
i
∂L
= xi − yi − λi + ν = 0
∂xi
λi xi =0
λi ≥0
T
x 1 = 1, x ≥ 0
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 18
Total variation minimization
X
min 1
2 kx − yk2 + λ |xi+1 − xi |,
i
min 1
2 kx − yk2 + λkDxk1 ,
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19
Total variation minimization
X
min 1
2 kx − yk2 + λ |xi+1 − xi |,
i
min 1
2 kx − yk2 + λkDxk1 ,
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19
Total variation minimization
X
min 1
2 kx − yk2 + λ |xi+1 − xi |,
i
min 1
2 kx − yk2 + λkDxk1 ,
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19
Total variation minimization
X
min 1
2 kx − yk2 + λ |xi+1 − xi |,
i
min 1
2 kx − yk2 + λkDxk1 ,
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 19
Nonsmooth KKT
(via subdifferentials)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 20
KKT via subdifferentials?
Assume all fi (x) are finite valued, and dom f = Rn
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
KKT via subdifferentials?
Assume all fi (x) are finite valued, and dom f = Rn
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
KKT via subdifferentials?
Assume all fi (x) are finite valued, and dom f = Rn
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
KKT via subdifferentials?
Assume all fi (x) are finite valued, and dom f = Rn
0 ∈ ∂φ(x̄).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
KKT via subdifferentials?
Assume all fi (x) are finite valued, and dom f = Rn
0 ∈ ∂φ(x̄).
Slater’s condition tells us that
int C1 ∩ · · · ∩ int Cm 6= ∅.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 21
KKT via subdifferentials?
Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22
KKT via subdifferentials?
Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22
KKT via subdifferentials?
Since int C1 ∩ · · · ∩ int Cm 6= ∅, Rockafellar’s theorem tells us
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 22
KKT via subdifferentials?
S
Thus, ∂φ(x) = {∂f (x) + λ1 ∂f1 (x) + · · · + λm ∂fm (x)}, over all
choices of λi ≥ 0 such that
λi fi (x) = 0.
If fi (x) < 0, ∂1Ci = {0}, while for fi (x) = 0, ∂1Ci (x) = {λi ∂fi (x) | λi ≥ 0},
and we cannot jointly have λi ≥ 0 and fi (x) > 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 23
KKT via subdifferentials?
S
Thus, ∂φ(x) = {∂f (x) + λ1 ∂f1 (x) + · · · + λm ∂fm (x)}, over all
choices of λi ≥ 0 such that
λi fi (x) = 0.
If fi (x) < 0, ∂1Ci = {0}, while for fi (x) = 0, ∂1Ci (x) = {λi ∂fi (x) | λi ≥ 0},
and we cannot jointly have λi ≥ 0 and fi (x) > 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 23
Example: Constrained regression
min 1
2 kAx − bk2 , s.t. kxk ≤ θ.
x
KKT Conditions
Hmmm...?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 24
Example: Constrained regression
min 1
2 kAx − bk2 , s.t. kxk ≤ θ.
x
KKT Conditions
Hmmm...?
I Case (i). x ← pinv(A)b and kxk < θ, then x∗ = x
I Case (ii). If kxk ≥ θ, then kx∗ k = θ. Thus, consider instead
1 2 2 2
2 kAx − bk s.t. kxk = θ . (Exercise: complete the idea.)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/25/21; Lecture 4) 24