Professional Documents
Culture Documents
L04 Slides
L04 Slides
A. G. Schwing
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Logistic Regression
X
min log 1 + exp(−y (i) w T φ(x (i) ))
w
(x (i) ,y (i) )∈D
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Logistic Regression
X
min log 1 + exp(−y (i) w T φ(x (i) ))
w
(x (i) ,y (i) )∈D
Finding optimum:
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Logistic Regression
X
min log 1 + exp(−y (i) w T φ(x (i) ))
w
(x (i) ,y (i) )∈D
Finding optimum:
minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}
minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}
Solution:
minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}
Solution:
minw f0 (w )
s.t. fi (w ) ≤ 0 ∀i ∈ {1, . . . , C}
Solution:
Saddle Point
Local Minimum
Global Minimum
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Linear program
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Linear program
min c > w s.t. Aw ≤ b
w
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Linear program
min c > w s.t. Aw ≤ b
w
Convex program
1 X 2
min y (i) − φ(x (i) )> w
w 2
(x (i) ,y (i) )∈D
Linear program
min c > w s.t. Aw ≤ b
w
A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.
A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.
A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.
Example:
A set is convex if for any two points w 1 , w 2 in the set, the line segment
λw 1 + (1 − λ)w 2 for λ ∈ [0, 1] also lies in the set.
Example: Polyhedron
{w |Aw ≤ b, Cw = d}
A function f is convex if its domain is a convex set and for any points
w 1 , w 2 in the domain and any λ ∈ [0, 1]
A function f is convex if its domain is a convex set and for any points
w 1 , w 2 in the domain and any λ ∈ [0, 1]
f (w2 )
f (w1)
w1 w2
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 11 / 37
Recognizing convex functions
f (w1)
f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )
w1 w2
f (w1)
f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )
w1 w2
f (w1)
f (w2 )
f (w2 ) + ∇f (w2 )⊤(w1 − w2 )
w1 w2
g = α1 f1 + α2 f2 + . . .
g = α1 f1 + α2 f2 + . . .
g(w ) = f (Aw + b)
g = α1 f1 + α2 f2 + . . .
g(w ) = f (Aw + b)
g(w ) = max{f1 (w ), f2 (w )}
g = α1 f1 + α2 f2 + . . .
g(w ) = f (Aw + b)
g(w ) = max{f1 (w ), f2 (w )}
Saddle Point
Local Minimum
Global Minimum
Saddle Point
Local Minimum
Global Minimum
Saddle Point
Local Minimum
Global Minimum
Saddle Point
Local Minimum
Global Minimum
Saddle Point
Local Minimum
Global Minimum
Intuition
∇f (w)
w − α ∇f (w)
w + αd
− ∇f (w)
f (w k + αd k ) − f (w k ) ≤ σα∇f (w k )> d k
0
αk = β 2 s βs s α
σ α ∇f (wk )⊤ dk
f (wk + α dk ) − f (wk )
α ∇f (wk )⊤ dk
f (w k ) − f (w ∗ ) ≤
∇g(w 1 ) − ∇g(w 2 ) =
Proof:
Lipschitz
f (w1) Strong
convexity
w1 Convexity
Lipschitz
if f twice differentiable
f (w1) Strong
convexity
w1 Convexity
Lipschitz
if f twice differentiable
σI ≺ ∇2 f (w ) ≺ LI ∀w
f (w1) Strong
convexity
w1 Convexity
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
How to pick αd k ?
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
Hence
αd k =
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
Hence
αd k = − 1L ∇f (w k )
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
Hence
αd k = − 1L ∇f (w k )
f (w k +1 ) ≤
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
Hence
αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
Hence
αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
Hence
αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
Hence
αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress
f (w ∗ ) ≥
f (w k ) − f (w ∗ ) ≤ for w k +1 = w k + αd k
L
f (w k +1 ) ≤ f (w k ) + ∇f (w k )> (w k +1 − w k ) + kw k +1 − w k k22
2
Hence
αd k = − 1L ∇f (w k )
1
f (w k +1 ) ≤ f (w k ) − 2L k∇f (w k )k22 Bound on guaranteed progress
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
in ‘guaranteed progress’:
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
≤
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
Rate:
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
Rate:
σ k
c 1− ≤ =⇒
L
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37
Guaranteed progress:
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
Rate:
σ k
c 1− ≤ =⇒ k ≥ O(log(1/))
L
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37
Guaranteed progress:
1 Lipschitz
f (w k +1 ) ≤ f (w k ) − k∇f (w k )k22
2L
Strong
Guaranteed convexity
Maximum sub-optimality: Progress
f (w1) Maximum
Sub-optimality
1
f (w ∗ ) ≥ f (w k ) − k∇f (w k )k22
2σ w1 Convexity
Distance to go:
1
f (w k ) − f (w ∗ ) ≤ k∇f (w k )k22
2σ
in ‘guaranteed progress’:
f (w k ) − f (w ∗ ) ≤ f (w k −1 ) − f (w ∗ ) − σL (f (w k −1 ) − f (w ∗ ))
k
≤ 1 − σL (f (w 0 ) − f (w ∗ )) (σ < L)
Rate:
σ k
c 1− ≤ =⇒ k ≥ O(log(1/)) (sometimes O(ek ))
L
A. G. Schwing (UofI) ECE 544: Pattern Recognitition 2017 27 / 37
No strong convexity assumption:
Consequently:
Consequently:
k ≥ O(1/)
Consequently:
k ≥ O(1/)
200
150
l
100
50 4
2
0 0
5 -2
10 15 -4
20 w2
w1
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
Polyak’s method (aka heavy-ball)
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
Polyak’s method (aka heavy-ball)
w k +1 = w k − αk ∇f (w k ) + βk (w k − w k −1 )
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
Polyak’s method (aka heavy-ball)
w k +1 = w k − αk ∇f (w k ) + βk (w k − w k −1 )
200 200
150 150
l
l
100 100
50 4 50 4
2 2
0 0 0 0
5 -2 -2
10 -4 10 -4
15 20 20
w2 30 w2
w1 w1
Video
Polyak’s method (aka heavy-ball)
w k +1 = w k − αk ∇f (w k ) + βk (w k − w k −1 )
1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk
1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk
1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk
1 X
∇f (w ) ≈ ∇`(y (i) , F (x (i) , w ))
|Bk |
(x (i) ,y (i) )∈Bk
I Update ŵ = w t
Output ŵ
Up next:
Duality