You are on page 1of 190

Pattern Recognition

A. G. Schwing

University of Illinois at Urbana-Champaign, 2017

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 1 / 37


L4: Optimization Primal

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 2 / 37


Goals of this lecture

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 3 / 37


Goals of this lecture
Understanding the basics of optimization

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 3 / 37


Optimization problems that we have seen so far:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37


Optimization problems that we have seen so far:
Linear Regression

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37


Optimization problems that we have seen so far:
Linear Regression

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

Logistic Regression
X  
min log 1 + exp(−y (i) w T φ(x (i) ))
w
(x (i) ,y (i) )∈D

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37


Optimization problems that we have seen so far:
Linear Regression

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

Logistic Regression
X  
min log 1 + exp(−y (i) w T φ(x (i) ))
w
(x (i) ,y (i) )∈D

Finding optimum:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37


Optimization problems that we have seen so far:
Linear Regression

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

Logistic Regression
X  
min log 1 + exp(−y (i) w T φ(x (i) ))
w
(x (i) ,y (i) )∈D

Finding optimum:

Analytically computable optimum vs. gradient descent

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 4 / 37


The Problem more generally:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37


The Problem more generally:

minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37


The Problem more generally:

minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}

Solution:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37


The Problem more generally:

minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}

Solution:

Solution w ∗ has smallest value f0 (w ∗ ) among all values that satisfy


constraints

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37


The Problem more generally:

minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}

Solution:

Solution w ∗ has smallest value f0 (w ∗ ) among all values that satisfy


constraints

Saddle Point

Local Minimum

Global Minimum

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 5 / 37


Questions:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 6 / 37


Questions:
When can we find the optimum?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 6 / 37


Questions:
When can we find the optimum?
Algorithms to search for the optimum?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 6 / 37


Questions:
When can we find the optimum?
Algorithms to search for the optimum?
How long does it take to find the optimum?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 6 / 37


When can we find the optimum?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 7 / 37


When can we find the optimum?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 8 / 37


When can we find the optimum?
Least squares, linear and convex programs can be solved
efficiently and reliably

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 8 / 37


When can we find the optimum?
Least squares, linear and convex programs can be solved
efficiently and reliably
General optimization problems are very difficult to solve

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 8 / 37


When can we find the optimum?
Least squares, linear and convex programs can be solved
efficiently and reliably
General optimization problems are very difficult to solve
Often compromise between accuracy and computation time

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 8 / 37


What’s the form of ‘least squares,’ ‘linear,’ and ‘convex’ programs?
Least squares program

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37


What’s the form of ‘least squares,’ ‘linear,’ and ‘convex’ programs?
Least squares program

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37


What’s the form of ‘least squares,’ ‘linear,’ and ‘convex’ programs?
Least squares program

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

Linear program

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37


What’s the form of ‘least squares,’ ‘linear,’ and ‘convex’ programs?
Least squares program

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

Linear program
min c > w s.t. Aw ≤ b
w

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37


What’s the form of ‘least squares,’ ‘linear,’ and ‘convex’ programs?
Least squares program

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

Linear program
min c > w s.t. Aw ≤ b
w

Convex program

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37


What’s the form of ‘least squares,’ ‘linear,’ and ‘convex’ programs?
Least squares program

1 X  2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D

Linear program
min c > w s.t. Aw ≤ b
w

Convex program when all fi convex (generalizes the above)

min f0 (w ) s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}


w

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 9 / 37


Convex set:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37


Convex set:

A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37


Convex set:

A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37


Convex set:

A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.

Example:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37


Convex set:

A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.

Example: Polyhedron

{w |Aw ≤ b, Cw = d}

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 10 / 37


Convex function

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 11 / 37


Convex function

A function f is convex if its domain is a convex set and for any points
w 1 , w 2 in the domain and any λ ∈ [0, 1]

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 11 / 37


Convex function

A function f is convex if its domain is a convex set and for any points
w 1 , w 2 in the domain and any λ ∈ [0, 1]

f ((1 − λ)w 1 + λw 2 ) ≤ (1 − λ)f (w 1 ) + λf (w 2 )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 11 / 37


Convex function
A function f is convex if its domain is a convex set and for any points
w 1 , w 2 in the domain and any λ ∈ [0, 1]
f ((1 − λ)w 1 + λw 2 ) ≤ (1 − λ)f (w 1 ) + λf (w 2 )

f (w2 )

f (w1)

w1 w2
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 11 / 37
Recognizing convex functions

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37


Recognizing convex functions
If f is differentiable, then f is convex if and only if its domain is
convex and f (w 1 ) ≥ f (w 2 ) + ∇f (w 2 )> (w 1 − w 2 ) ∀w 1 , w 2 in the
domain

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37


Recognizing convex functions
If f is differentiable, then f is convex if and only if its domain is
convex and f (w 1 ) ≥ f (w 2 ) + ∇f (w 2 )> (w 1 − w 2 ) ∀w 1 , w 2 in the
domain

f (w1)

f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )

w1 w2

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37


Recognizing convex functions
If f is differentiable, then f is convex if and only if its domain is
convex and f (w 1 ) ≥ f (w 2 ) + ∇f (w 2 )> (w 1 − w 2 ) ∀w 1 , w 2 in the
domain

f (w1)

f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )

w1 w2

If f is differentiable, then f is convex if and only if its domain is


convex and ∀w 1 , w 2 in the domain

(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 ) ≥ 0 monotone mapping

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37


Recognizing convex functions
If f is differentiable, then f is convex if and only if its domain is
convex and f (w 1 ) ≥ f (w 2 ) + ∇f (w 2 )> (w 1 − w 2 ) ∀w 1 , w 2 in the
domain

f (w1)

f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )

w1 w2

If f is differentiable, then f is convex if and only if its domain is


convex and ∀w 1 , w 2 in the domain

(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 ) ≥ 0 monotone mapping

If f is twice differentiable, then f is convex if and only if its domain


is convex and ∇2 f (w )  0 ∀w in the domain
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 12 / 37
Examples of convex functions

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37


Examples of convex functions
Exponential: exp(ax) convex on x ∈ R ∀a ∈ R

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37


Examples of convex functions
Exponential: exp(ax) convex on x ∈ R ∀a ∈ R
Negative Logarithm: − log(x) is convex on x ∈ R++

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37


Examples of convex functions
Exponential: exp(ax) convex on x ∈ R ∀a ∈ R
Negative Logarithm: − log(x) is convex on x ∈ R++
Negative Entropy: −H(x) = x log(x) is convex on x ∈ R++

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37


Examples of convex functions
Exponential: exp(ax) convex on x ∈ R ∀a ∈ R
Negative Logarithm: − log(x) is convex on x ∈ R++
Negative Entropy: −H(x) = x log(x) is convex on x ∈ R++
Norms: kw kp for p ≥ 1

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37


Examples of convex functions
Exponential: exp(ax) convex on x ∈ R ∀a ∈ R
Negative Logarithm: − log(x) is convex on x ∈ R++
Negative Entropy: −H(x) = x log(x) is convex on x ∈ R++
Norms: kw kp for p ≥ 1
Log-Sum-Exp: log(exp(w1 ) + . . . + exp(wd ))

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 13 / 37


Operations which preserve convexity

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 14 / 37


Operations which preserve convexity
Non-negative weighted sums: αi ≥ 0; if fi convex ∀i, so is

g = α1 f1 + α2 f2 + . . .

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 14 / 37


Operations which preserve convexity
Non-negative weighted sums: αi ≥ 0; if fi convex ∀i, so is

g = α1 f1 + α2 f2 + . . .

Composition with an affine mapping: if f is convex, so is

g(w ) = f (Aw + b)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 14 / 37


Operations which preserve convexity
Non-negative weighted sums: αi ≥ 0; if fi convex ∀i, so is

g = α1 f1 + α2 f2 + . . .

Composition with an affine mapping: if f is convex, so is

g(w ) = f (Aw + b)

Pointwise maximum: if f1 , f2 are convex, so is

g(w ) = max{f1 (w ), f2 (w )}

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 14 / 37


Operations which preserve convexity
Non-negative weighted sums: αi ≥ 0; if fi convex ∀i, so is

g = α1 f1 + α2 f2 + . . .

Composition with an affine mapping: if f is convex, so is

g(w ) = f (Aw + b)

Pointwise maximum: if f1 , f2 are convex, so is

g(w ) = max{f1 (w ), f2 (w )}

Show that log(1 + exp(x)) is convex for x ∈ R

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 14 / 37


Optimality of convex optimization

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 15 / 37


Optimality of convex optimization
A point w ∗ is locally optimal if f (w ∗ ) ≤ f (w ) ∀w in a neighborhood
of w ∗ ; globally optimal if f (w ∗ ) ≤ f (w ) ∀w

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 15 / 37


Optimality of convex optimization
A point w ∗ is locally optimal if f (w ∗ ) ≤ f (w ) ∀w in a neighborhood
of w ∗ ; globally optimal if f (w ∗ ) ≤ f (w ) ∀w

Saddle Point

Local Minimum

Global Minimum

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 15 / 37


Optimality of convex optimization
A point w ∗ is locally optimal if f (w ∗ ) ≤ f (w ) ∀w in a neighborhood
of w ∗ ; globally optimal if f (w ∗ ) ≤ f (w ) ∀w

Saddle Point

Local Minimum

Global Minimum

For convex problems global optimality follows directly from


local optimality.

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 15 / 37


Optimality of convex optimization
A point w ∗ is locally optimal if f (w ∗ ) ≤ f (w ) ∀w in a neighborhood
of w ∗ ; globally optimal if f (w ∗ ) ≤ f (w ) ∀w

Saddle Point

Local Minimum

Global Minimum

For convex problems global optimality follows directly from


local optimality.
For a local minimum of f , ∇f (w ∗ ) = 0

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 15 / 37


Optimality of convex optimization
A point w ∗ is locally optimal if f (w ∗ ) ≤ f (w ) ∀w in a neighborhood
of w ∗ ; globally optimal if f (w ∗ ) ≤ f (w ) ∀w

Saddle Point

Local Minimum

Global Minimum

For convex problems global optimality follows directly from


local optimality.
For a local minimum of f , ∇f (w ∗ ) = 0
If f convex, then ∇f (w ∗ ) = 0 sufficient for global optimality

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 15 / 37


Optimality of convex optimization
A point w ∗ is locally optimal if f (w ∗ ) ≤ f (w ) ∀w in a neighborhood
of w ∗ ; globally optimal if f (w ∗ ) ≤ f (w ) ∀w

Saddle Point

Local Minimum

Global Minimum

For convex problems global optimality follows directly from


local optimality.
For a local minimum of f , ∇f (w ∗ ) = 0
If f convex, then ∇f (w ∗ ) = 0 sufficient for global optimality

This makes convex optimization special!

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 15 / 37


Algorithms to search for the optimum?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 16 / 37


Descent methods
min f (w )
w

Intuition

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 17 / 37


Descent methods
min f (w )
w

Intuition (find a stationary point with ∇f (w ) = 0)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 17 / 37


Descent methods
min f (w )
w

Intuition (find a stationary point with ∇f (w ) = 0)

∇f (w)

w − α ∇f (w)

w + αd

− ∇f (w)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 17 / 37


Iterative algorithm

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 18 / 37


Iterative algorithm
Start with some guess w

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 18 / 37


Iterative algorithm
Start with some guess w
Iterate k = 1, 2, 3, . . .

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 18 / 37


Iterative algorithm
Start with some guess w
Iterate k = 1, 2, 3, . . .
I Select direction d k and stepsize αk

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 18 / 37


Iterative algorithm
Start with some guess w
Iterate k = 1, 2, 3, . . .
I Select direction d k and stepsize αk
I w ← w + αk d k

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 18 / 37


Iterative algorithm
Start with some guess w
Iterate k = 1, 2, 3, . . .
I Select direction d k and stepsize αk
I w ← w + αk d k
I Check whether we should stop (e.g., if ∇f (w ) ≈ 0)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 18 / 37


Iterative algorithm
Start with some guess w
Iterate k = 1, 2, 3, . . .
I Select direction d k and stepsize αk
I w ← w + αk d k
I Check whether we should stop (e.g., if ∇f (w ) ≈ 0)

Descent direction dk satisfies

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 18 / 37


Iterative algorithm
Start with some guess w
Iterate k = 1, 2, 3, . . .
I Select direction d k and stepsize αk
I w ← w + αk d k
I Check whether we should stop (e.g., if ∇f (w ) ≈ 0)

Descent direction dk satisfies ∇f (w )> d k < 0

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 18 / 37


How to select direction:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 19 / 37


How to select direction:
Steepest descent: d k = −∇f (w k )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 19 / 37


How to select direction:
Steepest descent: d k = −∇f (w k )
Scaled gradient: d k = −D k ∇f (w k ) for D k  0

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 19 / 37


How to select direction:
Steepest descent: d k = −∇f (w k )
Scaled gradient: d k = −D k ∇f (w k ) for D k  0
Newton’s method: D k = [∇2 f (w k )]−1

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 19 / 37


How to select direction:
Steepest descent: d k = −∇f (w k )
Scaled gradient: d k = −D k ∇f (w k ) for D k  0
Newton’s method: D k = [∇2 f (w k )]−1
...

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 19 / 37


How to select stepsize:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 20 / 37


How to select stepsize:
Exact: αk = arg minα≥0 f (w k + αd k )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 20 / 37


How to select stepsize:
Exact: αk = arg minα≥0 f (w k + αd k )
Constant: αk = 1/L (for suitable L)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 20 / 37


How to select stepsize:
Exact: αk = arg minα≥0 f (w k + αd k )
Constant: αk = 1/L (for suitable L)
P
Diminishing: αk → 0 but k αk = ∞ (e.g., αk = 1/k )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 20 / 37


How to select stepsize:
Exact: αk = arg minα≥0 f (w k + αd k )
Constant: αk = 1/L (for suitable L)
P
Diminishing: αk → 0 but k αk = ∞ (e.g., αk = 1/k )
Armijo Rule:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 20 / 37


How to select stepsize:
Exact: αk = arg minα≥0 f (w k + αd k )
Constant: αk = 1/L (for suitable L)
P
Diminishing: αk → 0 but k αk = ∞ (e.g., αk = 1/k )
Armijo Rule:
Start with α = s and continue with α = βs, α = β 2 s, . . ., until
α = β m s falls within the set of α with

f (w k + αd k ) − f (w k ) ≤ σα∇f (w k )> d k

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 20 / 37


How to select stepsize:
Exact: αk = arg minα≥0 f (w k + αd k )
Constant: αk = 1/L (for suitable L)
P
Diminishing: αk → 0 but k αk = ∞ (e.g., αk = 1/k )
Armijo Rule:
Start with α = s and continue with α = βs, α = β 2 s, . . ., until
α = β m s falls within the set of α with
f (w k + αd k ) − f (w k ) ≤ σα∇f (w k )> d k

Set of Acceptable Unsuccessful


Stepsizes Trials

0
αk = β 2 s βs s α

σ α ∇f (wk )⊤ dk
f (wk + α dk ) − f (wk )

α ∇f (wk )⊤ dk

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 20 / 37


How long does it take to find the optimum?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 21 / 37


Goal:
How many iterations k for

f (w k ) − f (w ∗ ) ≤ 

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 22 / 37


Two important properties:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 23 / 37


Two important properties:
Lipschitz continuous gradient

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 23 / 37


Two important properties:
Lipschitz continuous gradient
Strong convexity

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 23 / 37


Properties: Lipschitz continuous gradient

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz
≤ Lkw 1 − w 2 k22

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz
≤ Lkw 1 − w 2 k22

∇g(w 1 ) − ∇g(w 2 ) =

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz
≤ Lkw 1 − w 2 k22

∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz
≤ Lkw 1 − w 2 k22

∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))

(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )


=

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz
≤ Lkw 1 − w 2 k22

∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))

(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )


= Lkw 1 − w 2 k22 − (∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz
≤ Lkw 1 − w 2 k22

∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))

(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )


= Lkw 1 − w 2 k22 − (∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz
≤ Lkw 1 − w 2 k22

∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))

(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )


= Lkw 1 − w 2 k22 − (∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≥0

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


Properties: Lipschitz continuous gradient

k∇f (w 1 ) − ∇f (w 2 )k2 ≤ Lkw 1 − w 2 k2 ∀w 1 , w 2

Lipschitz continuous gradient, then g(w ) = L2 kw k22 − f (w ) convex


Proof:
(∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≤ k∇f (w 1 ) − ∇f (w 2 )k2 kw 1 − w 2 k2 Cauchy-Schwartz
≤ Lkw 1 − w 2 k22

∇g(w 1 ) − ∇g(w 2 ) = L(w 1 − w 2 ) − (∇f (w 1 ) − ∇f (w 2 ))

(∇g(w 1 ) − ∇g(w 2 ))> (w 1 − w 2 )


= Lkw 1 − w 2 k22 − (∇f (w 1 ) − ∇f (w 2 ))> (w 1 − w 2 )
≥0 monotone mapping

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 24 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then
L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then
L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Proof:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then
L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Proof: plug defintion of g into g(w 2 ) ≥ g(w 1 ) + ∇g(w 1 )> (w 2 − w 1 )


and re-arrange

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then
L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Proof: plug defintion of g into g(w 2 ) ≥ g(w 1 ) + ∇g(w 1 )> (w 2 − w 1 )


and re-arrange
Properties: Strong convexity

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then
L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Proof: plug defintion of g into g(w 2 ) ≥ g(w 1 ) + ∇g(w 1 )> (w 2 − w 1 )


and re-arrange
Properties: Strong convexity
σ
f (w 2 ) ≥ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then
L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Proof: plug defintion of g into g(w 2 ) ≥ g(w 1 ) + ∇g(w 1 )> (w 2 − w 1 )


and re-arrange
Properties: Strong convexity
σ
f (w 2 ) ≥ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Lipschitz

f (w1) Strong
convexity

w1 Convexity

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then
L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Proof: plug defintion of g into g(w 2 ) ≥ g(w 1 ) + ∇g(w 1 )> (w 2 − w 1 )


and re-arrange
Properties: Strong convexity
σ
f (w 2 ) ≥ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Lipschitz
if f twice differentiable

f (w1) Strong
convexity

w1 Convexity

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


If g(w ) = L2 kw k22 − f (w ) convex, then
L
f (w 2 ) ≤ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Proof: plug defintion of g into g(w 2 ) ≥ g(w 1 ) + ∇g(w 1 )> (w 2 − w 1 )


and re-arrange
Properties: Strong convexity
σ
f (w 2 ) ≥ f (w 1 ) + ∇f (w 1 )> (w 2 − w 1 ) + kw 2 − w 1 k22 ∀w 1 , w 2
2

Lipschitz
if f twice differentiable

σI ≺ ∇2 f (w ) ≺ LI ∀w
f (w1) Strong
convexity

w1 Convexity

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 25 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

Hence

αd k =

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

Hence

αd k = − 1L ∇f (w k )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

Hence

αd k = − 1L ∇f (w k )
f (w k +1 ) ≤

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

Hence

αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

Hence

αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

Hence

αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress

Bound on sub-optimality from strong convexity:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

Hence

αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress

Bound on sub-optimality from strong convexity:

f (w ∗ ) ≥

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


How many iterations k such that

f (w k ) − f (w ∗ ) ≤  for w k +1 = w k + αd k

How to pick αd k ? Minimize w.r.t. w k +1 right-hand-side of upper bound

L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2

Hence

αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress

Bound on sub-optimality from strong convexity:

1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 26 / 37


Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37


Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37


Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37


Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22

in ‘guaranteed progress’:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37


Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22

in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37


Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22

in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37


Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22

in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)

Rate:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37


Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22

in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)

Rate:
 σ k
c 1− ≤ =⇒
L
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37
Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22

in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)

Rate:
 σ k
c 1− ≤ =⇒ k ≥ O(log(1/))
L
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37
Guaranteed progress:

1 Lipschitz

f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity

Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22

in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)

Rate:
 σ k
c 1− ≤ =⇒ k ≥ O(log(1/)) (sometimes O(ek ))
L
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37
No strong convexity assumption:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


No strong convexity assumption:
Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


No strong convexity assumption:
Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


No strong convexity assumption:
Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

Combined with convexity: f (w 1 ) + ∇f (w 1 )> (w ∗ − w 1 ) ≤ f (w ∗ )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


No strong convexity assumption:
Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

Combined with convexity: f (w 1 ) + ∇f (w 1 )> (w ∗ − w 1 ) ≤ f (w ∗ )


α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


No strong convexity assumption:
Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

Combined with convexity: f (w 1 ) + ∇f (w 1 )> (w ∗ − w 1 ) ≤ f (w ∗ )


α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

Using w 2 − w 1 = −α∇f (w 1 ) and rearranging terms gives

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


No strong convexity assumption:
Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

Combined with convexity: f (w 1 ) + ∇f (w 1 )> (w ∗ − w 1 ) ≤ f (w ∗ )


α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

Using w 2 − w 1 = −α∇f (w 1 ) and rearranging terms gives


1  
f (w 2 ) ≤ f (w ∗ ) + kw 1 − w ∗ k22 − kw 2 − w ∗ k22

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


No strong convexity assumption:
Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

Combined with convexity: f (w 1 ) + ∇f (w 1 )> (w ∗ − w 1 ) ≤ f (w ∗ )


α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

Using w 2 − w 1 = −α∇f (w 1 ) and rearranging terms gives


1  
f (w 2 ) ≤ f (w ∗ ) + kw 1 − w ∗ k22 − kw 2 − w ∗ k22

Summing over all iterations


k
X
(f (w i ) − f (w ∗ )) ≤
i=1

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


No strong convexity assumption:
Lipschitz bound and w 2 = w 1 − α∇f (w 1 ) yields

f (w 2 ) ≤ f (w 1 ) − (1 − )αk∇f (w 1 )k22
2

Combined with convexity: f (w 1 ) + ∇f (w 1 )> (w ∗ − w 1 ) ≤ f (w ∗ )


α
f (w 2 ) ≤ f (w ∗ ) + ∇f (w 1 )(w 1 − w ∗ ) − k∇f (w 1 )k22
2

Using w 2 − w 1 = −α∇f (w 1 ) and rearranging terms gives


1  
f (w 2 ) ≤ f (w ∗ ) + kw 1 − w ∗ k22 − kw 2 − w ∗ k22

Summing over all iterations


k
X 1
(f (w i ) − f (w ∗ )) ≤ kw 0 − w ∗ k22

i=1

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 28 / 37


f (w i ) non-increasing:
k
1X ∗ 1
f (w k ) − f (w ) ≤ (f (w i ) − f (w ∗ )) ≤ kw 0 − w ∗ k22 ≤ 
k 2k α
i=1

Consequently:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 29 / 37


f (w i ) non-increasing:
k
1X ∗ 1
f (w k ) − f (w ) ≤ (f (w i ) − f (w ∗ )) ≤ kw 0 − w ∗ k22 ≤ 
k 2k α
i=1

Consequently:
k ≥ O(1/)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 29 / 37


f (w i ) non-increasing:
k
1X ∗ 1
f (w k ) − f (w ) ≤ (f (w i ) − f (w ∗ )) ≤ kw 0 − w ∗ k22 ≤ 
k 2k α
i=1

Consequently:
k ≥ O(1/)

Are these rates optimal?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 29 / 37


Gradient with momentum

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 30 / 37


Gradient with momentum
Intuition:

200

150

l
100

50 4
2
0 0
5 -2
10 15 -4
20 w2
w1

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 30 / 37


Gradient with momentum
Intuition:

200 200

150 150
l

l
100 100

50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 30 / 37


Gradient with momentum
Intuition:

200 200

150 150
l

l
100 100

50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1

Video

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 30 / 37


Gradient with momentum
Intuition:

200 200

150 150
l

l
100 100

50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1

Video
Polyak’s method (aka heavy-ball)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 30 / 37


Gradient with momentum
Intuition:

200 200

150 150
l

l
100 100

50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1

Video
Polyak’s method (aka heavy-ball)
w k +1 = w k − αk ∇f (w k ) + βk (w k − w k −1 )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 30 / 37


Gradient with momentum
Intuition:

200 200

150 150
l

l
100 100

50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1

Video
Polyak’s method (aka heavy-ball)
w k +1 = w k − αk ∇f (w k ) + βk (w k − w k −1 )

Momentum method in deep learning

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 30 / 37


Gradient with momentum
Intuition:

200 200

150 150
l

l
100 100

50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1

Video
Polyak’s method (aka heavy-ball)
w k +1 = w k − αk ∇f (w k ) + βk (w k − w k −1 )

Momentum method in deep learning


v k +1 = βv k + ∇f (w k )
w k +1 = w k − αv k +1
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 30 / 37
Recall the structure of our optimization problems:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 31 / 37


Recall the structure of our optimization problems:
X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 31 / 37


Recall the structure of our optimization problems:
X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

So far we didn’t consider the time for computing the gradient

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 31 / 37


Recall the structure of our optimization problems:
X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

So far we didn’t consider the time for computing the gradient


Iteration complexity is linear in the number of samples |D|

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 31 / 37


Recall the structure of our optimization problems:
X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

So far we didn’t consider the time for computing the gradient


Iteration complexity is linear in the number of samples |D|
A large dataset makes gradient computation slow

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 31 / 37


Recall the structure of our optimization problems:
X
min `(yi , F (x (i) , w ))
w
(x (i) ,y (i) )∈D

So far we didn’t consider the time for computing the gradient


Iteration complexity is linear in the number of samples |D|
A large dataset makes gradient computation slow

How to deal with this?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 31 / 37


Stochastic gradient descent

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 32 / 37


Stochastic gradient descent

Consider a subset of samples and approximate the gradient based on


this batch of data.

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 32 / 37


Stochastic gradient descent

Consider a subset of samples and approximate the gradient based on


this batch of data.

Select a subset of samples Bk

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 32 / 37


Stochastic gradient descent

Consider a subset of samples and approximate the gradient based on


this batch of data.

Select a subset of samples Bk


Gradient update using approximation

1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 32 / 37


Stochastic gradient descent

Consider a subset of samples and approximate the gradient based on


this batch of data.

Select a subset of samples Bk


Gradient update using approximation

1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk

Convergence rates for stochastic gradient descent:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 32 / 37


Stochastic gradient descent

Consider a subset of samples and approximate the gradient based on


this batch of data.

Select a subset of samples Bk


Gradient update using approximation

1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk

Convergence rates for stochastic gradient descent:


Lipschitz continuous gradient and strongly convex: k ≥ O(1/)

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 32 / 37


Stochastic gradient descent

Consider a subset of samples and approximate the gradient based on


this batch of data.

Select a subset of samples Bk


Gradient update using approximation

1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk

Convergence rates for stochastic gradient descent:


Lipschitz continuous gradient and strongly convex: k ≥ O(1/)
Lipschitz continuous gradient: k ≥ O(1/2 )

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 32 / 37


Stochastic vs. deterministic (strongly convex)

Batch gradient descent:


Convergence rate:
Iteration complexity:

Stochastic gradient descent:


Convergence rate:
Iteration complexity:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 33 / 37


Stochastic vs. deterministic (strongly convex)

Batch gradient descent:


Convergence rate: O(log 1/)
Iteration complexity:

Stochastic gradient descent:


Convergence rate:
Iteration complexity:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 33 / 37


Stochastic vs. deterministic (strongly convex)

Batch gradient descent:


Convergence rate: O(log 1/)
Iteration complexity: linear in |D|

Stochastic gradient descent:


Convergence rate:
Iteration complexity:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 33 / 37


Stochastic vs. deterministic (strongly convex)

Batch gradient descent:


Convergence rate: O(log 1/)
Iteration complexity: linear in |D|

Stochastic gradient descent:


Convergence rate: O(1/)
Iteration complexity:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 33 / 37


Stochastic vs. deterministic (strongly convex)

Batch gradient descent:


Convergence rate: O(log 1/)
Iteration complexity: linear in |D|

Stochastic gradient descent:


Convergence rate: O(1/)
Iteration complexity: independent of |D|

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 33 / 37


Stochastic vs. deterministic (strongly convex)

Batch gradient descent:


Convergence rate: O(log 1/)
Iteration complexity: linear in |D|

Stochastic gradient descent:


Convergence rate: O(1/)
Iteration complexity: independent of |D|

Can we get the best of both worlds?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 33 / 37


Many related algorithms:
SAG (Le Roux, Schmidt, Bach 2012)
SDCA (Shalev-Shwartz and Zhang 2013)
SVRG (Johnnson and Zhang 2013)
MISO (Mairal 2015)
Finito (Defazio 2014)
SAGA (Defazio, Bach, Lacoste-Julien 2014)
...

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 34 / 37


Many related algorithms:
SAG (Le Roux, Schmidt, Bach 2012)
SDCA (Shalev-Shwartz and Zhang 2013)
SVRG (Johnnson and Zhang 2013)
MISO (Mairal 2015)
Finito (Defazio 2014)
SAGA (Defazio, Bach, Lacoste-Julien 2014)
...

Idea: variance reduction

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 34 / 37


Example: SVRG
Initialize ŵ
For epoch 1, 2, 3, . . .
P
I Compute ∇f (ŵ ) = D ∇`i (ŵ )
I Initialize w 0 = ŵ
I For t in length of epochs
 
w t = w t−1 − α ∇f (ŵ ) + ∇`i(t) (w t−1 ) − ∇`i(t) (ŵ )

I Update ŵ = w t
Output ŵ

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 35 / 37


Quiz:

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 36 / 37


Quiz:
Stepsize/Learning rate rules?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 36 / 37


Quiz:
Stepsize/Learning rate rules?
Descent directions?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 36 / 37


Quiz:
Stepsize/Learning rate rules?
Descent directions?
Properties of convex functions?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 36 / 37


Quiz:
Stepsize/Learning rate rules?
Descent directions?
Properties of convex functions?
Convergence rates?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 36 / 37


Quiz:
Stepsize/Learning rate rules?
Descent directions?
Properties of convex functions?
Convergence rates?
Improvements?

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 36 / 37


Important topics of this lecture

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 37 / 37


Important topics of this lecture
Convex optimization basics

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 37 / 37


Important topics of this lecture
Convex optimization basics
Algorithm choices

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 37 / 37


Important topics of this lecture
Convex optimization basics
Algorithm choices
Rates

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 37 / 37


Important topics of this lecture
Convex optimization basics
Algorithm choices
Rates

Up next:
Duality

A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 37 / 37

You might also like