L04 Slides

Pattern Recognition
A. G. Schwing
University of Illinois at Urbana-Champaign, 2017
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 1 / 37

L4: Optimization Primal

Goals of this lecture

Goals of this lecture
Understanding the basics of optimization

Optimization problems that we have seen so far:

Linear Regression
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

Linear Regression
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Logistic Regression
X
min log 1 + exp(−y (i) w T φ(x (i) ))
w
(x (i) ,y (i) )∈D

Linear Regression
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Logistic Regression
X
w
(x (i) ,y (i) )∈D
Finding optimum:

Linear Regression
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Logistic Regression
X
w
(x (i) ,y (i) )∈D
Finding optimum:
Analytically computable optimum vs. gradient descent

The Problem more generally:

minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}

minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}
Solution:

minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}
Solution:
Solution w ∗ has smallest value f0 (w ∗ ) among all values that satisfy

constraints

minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}
Solution:
Solution w ∗ has smallest value f0 (w ∗ ) among all values that satisfy

constraints
Saddle Point
Local Minimum
Global Minimum

Questions:

Questions:
When can we find the optimum?

Questions:
Algorithms to search for the optimum?

Questions:
How long does it take to find the optimum?



Least squares, linear and convex programs can be solved
efficiently and reliably

General optimization problems are very difficult to solve

General optimization problems are very difficult to solve
Often compromise between accuracy and computation time

What’s the form of ‘least squares,’ ‘linear,’ and ‘convex’ programs?
Least squares program

1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Linear program

1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Linear program
min c > w s.t. Aw ≤ b
w

1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Linear program
w
Convex program

1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Linear program
w
Convex program when all fi convex (generalizes the above)
min f0 (w ) s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}

w

Convex set:

Convex set:
A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.

Convex set:

Convex set:
Example:

Convex set:
Example: Polyhedron
{w |Aw ≤ b, Cw = d}

Convex function

Convex function
A function f is convex if its domain is a convex set and for any points
w 1 , w 2 in the domain and any λ ∈ [0, 1]

Convex function
f ((1 − λ)w 1 + λw 2 ) ≤ (1 − λ)f (w 1 ) + λf (w 2 )

Convex function
f ((1 − λ)w 1 + λw 2 ) ≤ (1 − λ)f (w 1 ) + λf (w 2 )
f (w2 )
f (w1)
w1 w2
Recognizing convex functions

If f is differentiable, then f is convex if and only if its domain is
convex and f (w 1 ) ≥ f (w 2 ) + ∇f (w 2 )> (w 1 − w 2 ) ∀w 1 , w 2 in the
domain

domain
f (w1)
f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )
w1 w2

domain
f (w1)
f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )
w1 w2

convex and ∀w 1 , w 2 in the domain
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 ) ≥ 0 monotone mapping

domain
f (w1)
f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )
w1 w2

convex and ∀w 1 , w 2 in the domain
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 ) ≥ 0 monotone mapping
If f is twice differentiable, then f is convex if and only if its domain

is convex and ∇2 f (w ) 0 ∀w in the domain
Examples of convex functions

Exponential: exp(ax) convex on x ∈ R ∀a ∈ R

Negative Logarithm: − log(x) is convex on x ∈ R++

Negative Entropy: −H(x) = x log(x) is convex on x ∈ R++

Norms: kw kp for p ≥ 1

Norms: kw kp for p ≥ 1
Log-Sum-Exp: log(exp(w1 ) + . . . + exp(wd ))

Operations which preserve convexity

Non-negative weighted sums: αi ≥ 0; if fi convex ∀i, so is
g = α1 f1 + α2 f2 + . . .

g = α1 f1 + α2 f2 + . . .
Composition with an affine mapping: if f is convex, so is
g(w ) = f (Aw + b)

g = α1 f1 + α2 f2 + . . .
g(w ) = f (Aw + b)
Pointwise maximum: if f1 , f2 are convex, so is
g(w ) = max{f1 (w ), f2 (w )}

g = α1 f1 + α2 f2 + . . .
g(w ) = f (Aw + b)
Pointwise maximum: if f1 , f2 are convex, so is
g(w ) = max{f1 (w ), f2 (w )}
Show that log(1 + exp(x)) is convex for x ∈ R

Optimality of convex optimization

A point w ∗ is locally optimal if f (w ∗ ) ≤ f (w ) ∀w in a neighborhood
of w ∗ ; globally optimal if f (w ∗ ) ≤ f (w ) ∀w

Saddle Point
Local Minimum
Global Minimum

Saddle Point
Local Minimum
Global Minimum
For convex problems global optimality follows directly from

local optimality.

Saddle Point
Local Minimum
Global Minimum

local optimality.
For a local minimum of f , ∇f (w ∗ ) = 0

Saddle Point
Local Minimum
Global Minimum

local optimality.
If f convex, then ∇f (w ∗ ) = 0 sufficient for global optimality

Saddle Point
Local Minimum
Global Minimum

local optimality.
If f convex, then ∇f (w ∗ ) = 0 sufficient for global optimality
This makes convex optimization special!


Descent methods
min f (w )
w
Intuition

Descent methods
min f (w )
w
Intuition (find a stationary point with ∇f (w ) = 0)

Descent methods
min f (w )
w
Intuition (find a stationary point with ∇f (w ) = 0)
∇f (w)
w − α ∇f (w)
w + αd
− ∇f (w)

Iterative algorithm

Iterative algorithm
Start with some guess w

Iterative algorithm
Iterate k = 1, 2, 3, . . .

Iterative algorithm
Iterate k = 1, 2, 3, . . .
I Select direction d k and stepsize αk

Iterative algorithm
Iterate k = 1, 2, 3, . . .
I w ← w + αk d k

Iterative algorithm
Iterate k = 1, 2, 3, . . .
I w ← w + αk d k
I Check whether we should stop (e.g., if ∇f (w ) ≈ 0)

Iterative algorithm
Iterate k = 1, 2, 3, . . .
I w ← w + αk d k
Descent direction dk satisfies

Iterative algorithm
Iterate k = 1, 2, 3, . . .
I w ← w + αk d k
Descent direction dk satisfies ∇f (w )> d k < 0

How to select direction:

Steepest descent: d k = −∇f (w k )

Scaled gradient: d k = −D k ∇f (w k ) for D k 0

Newton’s method: D k = [∇2 f (w k )]−1

Newton’s method: D k = [∇2 f (w k )]−1
...

How to select stepsize:

Exact: αk = arg minα≥0 f (w k + αd k )

Constant: αk = 1/L (for suitable L)

P
Diminishing: αk → 0 but k αk = ∞ (e.g., αk = 1/k )

P
Armijo Rule:

P
Armijo Rule:
Start with α = s and continue with α = βs, α = β 2 s, . . ., until
α = β m s falls within the set of α with
f (w k + αd k ) − f (w k ) ≤ σα∇f (w k )> d k

P
Armijo Rule:
Start with α = s and continue with α = βs, α = β 2 s, . . ., until
α = β m s falls within the set of α with
f (w k + αd k ) − f (w k ) ≤ σα∇f (w k )> d k
Set of Acceptable Unsuccessful

Stepsizes Trials
0
αk = β 2 s βs s α
σ α ∇f (wk )⊤ dk
f (wk + α dk ) − f (wk )
α ∇f (wk )⊤ dk

How long does it take to find the optimum?

Goal:
How many iterations k for
f (w k ) − f (w ∗ ) ≤

Two important properties:

Lipschitz continuous gradient

Lipschitz continuous gradient
Strong convexity

Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2
Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w )

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2
Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ Lkw 1 − w 2 k22

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ Lkw 1 − w 2 k22
∇g(w 1 ) − ∇g(w 2 ) =

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ Lkw 1 − w 2 k22
∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ Lkw 1 − w 2 k22
∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))
(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )

=

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ Lkw 1 − w 2 k22
∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))
(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )

= Lkw 1 − w 2 k22 − (∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ Lkw 1 − w 2 k22
∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))
(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )

= Lkw 1 − w 2 k22 − (∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≥

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ Lkw 1 − w 2 k22
∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))
(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )

= Lkw 1 − w 2 k22 − (∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≥0

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ Lkw 1 − w 2 k22
∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))
(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )

= Lkw 1 − w 2 k22 − (∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≥0 monotone mapping

If g(w ) = L2 kw k22 − f (w ) convex, then

L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2
Proof:

L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2
Proof: plug defintion of g into g(w 2 ) ≥ g(w 1 ) + ∇g(w 1 )> (w 2 − w 1 )

and re-arrange

L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

and re-arrange
Properties: Strong convexity

L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

and re-arrange
σ
f (w 2 ) ≥ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

and re-arrange
σ
f (w 2 ) ≥ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2
Lipschitz
f (w1) Strong
convexity
w1 Convexity

L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

and re-arrange
σ
f (w 2 ) ≥ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2
Lipschitz
if f twice differentiable
f (w1) Strong
convexity
w1 Convexity

L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

and re-arrange
σ
f (w 2 ) ≥ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2
Lipschitz
if f twice differentiable
σI ≺ ∇2 f (w ) ≺ LI ∀w
f (w1) Strong
convexity
w1 Convexity

How many iterations k such that
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k

How to pick αd k ?

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

L
2
Hence
αd k =

L
2
Hence
αd k = − 1L ∇f (w k )

L
2
Hence
αd k = − 1L ∇f (w k )
f (w k +1 ) ≤

L
2
Hence
αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22

L
2
Hence
αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress

L
2
Hence
αd k = − 1L ∇f (w k )
1
Bound on sub-optimality from strong convexity:

L
2
Hence
αd k = − 1L ∇f (w k )
1
f (w ∗ ) ≥

L
2
Hence
αd k = − 1L ∇f (w k )
1
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ

Guaranteed progress:
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:

1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ

1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
in ‘guaranteed progress’:

1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
≤

1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)

1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
Rate:

1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
Rate:
σ k
c 1− ≤ =⇒
L
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
Rate:
σ k
c 1− ≤ =⇒ k ≥ O(log(1/))
L
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
Rate:
σ k
c 1− ≤ =⇒ k ≥ O(log(1/)) (sometimes O(ek ))
L
No strong convexity assumption:

Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

Lα
f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

Lα
f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2
Combined with convexity: f (w 1 ) + ∇f (w 1 )> (w ∗ − w 1 ) ≤ f (w ∗ )

Lα
f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

Lα
f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2
Using w 2 − w 1 = −α∇f (w 1 ) and rearranging terms gives

Lα
f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

1
f (w 2 ) ≤ f (w ∗ ) + kw 1 − w ∗ k22 − kw 2 − w ∗ k22
2α

Lα
f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

1
f (w 2 ) ≤ f (w ∗ ) + kw 1 − w ∗ k22 − kw 2 − w ∗ k22
2α
Summing over all iterations

k
X
(f (w i ) − f (w ∗ )) ≤
i=1

Lα
f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

1
f (w 2 ) ≤ f (w ∗ ) + kw 1 − w ∗ k22 − kw 2 − w ∗ k22
2α
Summing over all iterations

k
X 1
(f (w i ) − f (w ∗ )) ≤ kw 0 − w ∗ k22
2α
i=1

f (w i ) non-increasing:
k
1X ∗ 1
f (w k ) − f (w ) ≤ (f (w i ) − f (w ∗ )) ≤ kw 0 − w ∗ k22 ≤
k 2k α
i=1
Consequently:

k
1X ∗ 1
f (w k ) − f (w ) ≤ (f (w i ) − f (w ∗ )) ≤ kw 0 − w ∗ k22 ≤
k 2k α
i=1
Consequently:
k ≥ O(1/)

k
1X ∗ 1
f (w k ) − f (w ) ≤ (f (w i ) − f (w ∗ )) ≤ kw 0 − w ∗ k22 ≤
k 2k α
i=1
Consequently:
k ≥ O(1/)
Are these rates optimal?

Gradient with momentum

Intuition:
200
150
l
100
50 4
2
0 0
5 -2
10 15 -4
20 w2
w1

Intuition:
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1

Intuition:
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video

Intuition:
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
Polyak’s method (aka heavy-ball)

Intuition:
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
w k +1 = w k − αk ∇f (w k ) + βk (w k − w k −1 )

Intuition:
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
Momentum method in deep learning

Intuition:
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
Momentum method in deep learning

v k +1 = βv k + ∇f (w k )
w k +1 = w k − αv k +1
Recall the structure of our optimization problems:

X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D
So far we didn’t consider the time for computing the gradient

X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

Iteration complexity is linear in the number of samples |D|

X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

A large dataset makes gradient computation slow

X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

A large dataset makes gradient computation slow
How to deal with this?

Stochastic gradient descent

Consider a subset of samples and approximate the gradient based on

this batch of data.


this batch of data.
Select a subset of samples Bk


this batch of data.

Gradient update using approximation
1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk


this batch of data.

1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk
Convergence rates for stochastic gradient descent:


this batch of data.

1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk

Lipschitz continuous gradient and strongly convex: k ≥ O(1/)


this batch of data.

1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk

Lipschitz continuous gradient and strongly convex: k ≥ O(1/)
Lipschitz continuous gradient: k ≥ O(1/2 )

Stochastic vs. deterministic (strongly convex)
Batch gradient descent:

Convergence rate:
Iteration complexity:
Stochastic gradient descent:

Convergence rate:


Convergence rate: O(log 1/)

Convergence rate:


Iteration complexity: linear in |D|

Convergence rate:



Convergence rate: O(1/)



Iteration complexity: independent of |D|



Iteration complexity: independent of |D|
Can we get the best of both worlds?

Many related algorithms:
SAG (Le Roux, Schmidt, Bach 2012)
SDCA (Shalev-Shwartz and Zhang 2013)
SVRG (Johnnson and Zhang 2013)
MISO (Mairal 2015)
Finito (Defazio 2014)
SAGA (Defazio, Bach, Lacoste-Julien 2014)
...

Many related algorithms:
SAG (Le Roux, Schmidt, Bach 2012)
SDCA (Shalev-Shwartz and Zhang 2013)
SVRG (Johnnson and Zhang 2013)
MISO (Mairal 2015)
Finito (Defazio 2014)
SAGA (Defazio, Bach, Lacoste-Julien 2014)
...
Idea: variance reduction

Example: SVRG
Initialize ŵ
For epoch 1, 2, 3, . . .
P
I Compute ∇f (ŵ ) = D ∇ì (ŵ )
I Initialize w 0 = ŵ
I For t in length of epochs

w t = w t−1 − α ∇f (ŵ ) + ∇ì(t) (w t−1 ) − ∇ì(t) (ŵ )
I Update ŵ = w t
Output ŵ

Quiz:

Quiz:
Stepsize/Learning rate rules?

Quiz:
Descent directions?

Quiz:
Descent directions?
Properties of convex functions?

Quiz:
Descent directions?
Convergence rates?

Quiz:
Descent directions?
Convergence rates?
Improvements?

Important topics of this lecture

Convex optimization basics

Algorithm choices

Algorithm choices
Rates

Algorithm choices
Rates
Up next:
Duality

L04 Slides

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L04 Slides

Uploaded by

Copyright:

Available Formats

Pattern Recognition

University of Illinois at Urbana-Champaign, 2017

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 1 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 2 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 3 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 3 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37

Analytically computable optimum vs. gradient descent

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37

Solution w ∗ has smallest value f0 (w ∗ ) among all values that satisfy

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37

Solution w ∗ has smallest value f0 (w ∗ ) among all values that satisfy

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 6 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 6 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 6 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 6 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 7 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 8 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 8 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 8 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 8 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37

Convex program when all fi convex (generalizes the above)

min f0 (w ) s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 11 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 11 / 37

f ((1 − λ)w 1 + λw 2 ) ≤ (1 − λ)f (w 1 ) + λf (w 2 )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 11 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37

If f is differentiable, then f is convex if and only if its domain is

(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 ) ≥ 0 monotone mapping

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37

If f is differentiable, then f is convex if and only if its domain is

(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 ) ≥ 0 monotone mapping

If f is twice differentiable, then f is convex if and only if its domain

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 14 / 37

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 14 / 37

Composition with an affine mapping: if f is convex, so is