Subgradient and Bundle Methods

Subgradient and Bundle Methods
for optimization of convex non-smooth functions

April 1, 2009
Motivation
Many naturally occuring problems are nonsmooth
Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but numerically nonsmooth
Motivation
Hinge loss
Motivation
Hinge loss
Motivation
Hinge loss
Motivation
Hinge loss
Methods for nonsmooth optimizations
Approximate by a series of smooth functions
Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition
Subgradient Methods
Bundle Methods
U V decomposition
Subgradient Methods
Bundle Methods
U V decomposition
Subgradient Methods
Bundle Methods
U V decomposition
Subgradient Methods
Bundle Methods
U V decomposition
Subgradient Methods
Bundle Methods
U V decomposition
Subgradient Methods
Bundle Methods
U V decomposition
Denition
An extension of gradients
For a convex differentiable function f (x), x, y
f (y) f (x) +f (x)
T
(y x) (1)
So, a subgradient is dened as any g R
n
such that y
f (y) f (x) +g
T
(y x) (2)
The set of all subgradients of f at x is denoted f (x)
Denition
f (y) f (x) +f (x)
T
(y x) (1)
n
such that y
f (y) f (x) +g
T
(y x) (2)
Denition
f (y) f (x) +f (x)
T
(y x) (1)
n
such that y
f (y) f (x) +g
T
(y x) (2)
Some Facts
From Convex Analysis
A convex function is always subdifferentiable i.e. the
Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. f (x) = {f (x)}
Subgradients are lower bounds for directional derivatives.
f
(x; d) = sup
gf (x)
g, d
Further, d is a descent direction iff g
T
d < 0 g f (x)
Some Facts
f
(x; d) = sup
gf (x)
g, d
T
d < 0 g f (x)
Some Facts
f
(x; d) = sup
gf (x)
g, d
T
d < 0 g f (x)
Some Facts
f
(x; d) = sup
gf (x)
g, d
T
d < 0 g f (x)
Some Facts
f
(x; d) = sup
gf (x)
g, d
T
d < 0 g f (x)
Some Facts
f
(x; d) = sup
gf (x)
g, d
T
d < 0 g f (x)
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Local minima 0 f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to nd minima
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Subgradient Method
Algorithm
Subgradient Method is NOT a descent method!
x
(k+1)
= x
(k)
k
g
k
for
k
0 and g
k
f (x)
f
(k)
best
= min{f
(k1)
best
, f (x
(k)
)}
Line search is not performed. Step lengths
k
usually xed
ahead of time
Subgradient Method
Algorithm
x
(k+1)
= x
(k)
k
g
k
for
k
0 and g
k
f (x)
f
(k)
best
= min{f
(k1)
best
, f (x
(k)
)}
k
usually xed
ahead of time
Subgradient Method
Algorithm
x
(k+1)
= x
(k)
k
g
k
for
k
0 and g
k
f (x)
f
(k)
best
= min{f
(k1)
best
, f (x
(k)
)}
k
usually xed
ahead of time
Subgradient Method
Algorithm
x
(k+1)
= x
(k)
k
g
k
for
k
0 and g
k
f (x)
f
(k)
best
= min{f
(k1)
best
, f (x
(k)
)}
k
usually xed
ahead of time
Step Lengths
Commonly used Step lengths
Constant step size:
k
=
Constant step length:
k
=
k
= /g
(k)
2
Square summable but not summable step size:
k
0,
k=1
2
k
< ,
k=1
k
= .
Nonsummable diminishing:
k
0, lim
k
k
= 0,
k=1
k
= .
Nonsummable diminishing step lengths:
k
0, lim
k
k
= 0,
k=1
k
= .
Step Lengths
Constant step size:
k
=
k
=
k
= /g
(k)
2
k
0,
k=1
2
k
< ,
k=1
k
= .
k
0, lim
k
k
= 0,
k=1
k
= .
k
0, lim
k
k
= 0,
k=1
k
= .
Step Lengths
Constant step size:
k
=
k
=
k
= /g
(k)
2
k
0,
k=1
2
k
< ,
k=1
k
= .
k
0, lim
k
k
= 0,
k=1
k
= .
k
0, lim
k
k
= 0,
k=1
k
= .
Step Lengths
Constant step size:
k
=
k
=
k
= /g
(k)
2
k
0,
k=1
2
k
< ,
k=1
k
= .
k
0, lim
k
k
= 0,
k=1
k
= .
k
0, lim
k
k
= 0,
k=1
k
= .
Step Lengths
Constant step size:
k
=
k
=
k
= /g
(k)
2
k
0,
k=1
2
k
< ,
k=1
k
= .
k
0, lim
k
k
= 0,
k=1
k
= .
k
0, lim
k
k
= 0,
k=1
k
= .
Convergence Result
Assume that G such that the norm of the subgradients is
bounded i.e. ||g
(k)
||
2
G
(For example, Suppose f is Lipshitz continuous)
Result f
k
best
f

dist
x
(1)
, X
2
+G
2

k
i =1
2
i
2
k
i =1
i
Proof is through proving ||x x
|| decreases
Convergence Result
bounded i.e. ||g
(k)
||
2
G
Result f
k
best
f

dist
x
(1)
, X
2
+G
2

k
i =1
2
i
2
k
i =1
i
|| decreases
Convergence Result
bounded i.e. ||g
(k)
||
2
G
Result f
k
best
f

dist
x
(1)
, X
2
+G
2

k
i =1
2
i
2
k
i =1
i
|| decreases
Convergence Result
bounded i.e. ||g
(k)
||
2
G
Result f
k
best
f

dist
x
(1)
, X
2
+G
2

k
i =1
2
i
2
k
i =1
i
|| decreases
Convergence for Commonly used Step lengths
Constant step size: f
(k)
best
within
G
2
h
2
of optimal
Constant step length: f
(k)
best
within Gh of optimal
Square summable but not summable step size: f
(k)
best
f
Nonsummable diminishing: f
(k)
best
f
Nonsummable diminishing step lengths: f

(k)
best
f
f
(k)
best
f

R
2
+G
2

k
i =1
2
i
2
k
i =1
i
So, optimal
i
are
R/G
k
and converges in (RG/)
2
steps
(k)
best
within
G
2
h
2
of optimal
(k)
best
(k)
best
f
(k)
best
f

(k)
best
f
f
(k)
best
f

R
2
+G
2

k
i =1
2
i
2
k
i =1
i
So, optimal
i
are
R/G
k
2
steps
(k)
best
within
G
2
h
2
of optimal
(k)
best
(k)
best
f
(k)
best
f

(k)
best
f
f
(k)
best
f

R
2
+G
2

k
i =1
2
i
2
k
i =1
i
So, optimal
i
are
R/G
k
2
steps
(k)
best
within
G
2
h
2
of optimal
(k)
best
(k)
best
f
(k)
best
f

(k)
best
f
f
(k)
best
f

R
2
+G
2

k
i =1
2
i
2
k
i =1
i
So, optimal
i
are
R/G
k
2
steps
(k)
best
within
G
2
h
2
of optimal
(k)
best
(k)
best
f
(k)
best
f

(k)
best
f
f
(k)
best
f

R
2
+G
2

k
i =1
2
i
2
k
i =1
i
So, optimal
i
are
R/G
k
2
steps
(k)
best
within
G
2
h
2
of optimal
(k)
best
(k)
best
f
(k)
best
f

(k)
best
f
f
(k)
best
f

R
2
+G
2

k
i =1
2
i
2
k
i =1
i
So, optimal
i
are
R/G
k
2
steps
(k)
best
within
G
2
h
2
of optimal
(k)
best
(k)
best
f
(k)
best
f

(k)
best
f
f
(k)
best
f

R
2
+G
2

k
i =1
2
i
2
k
i =1
i
So, optimal
i
are
R/G
k
2
steps
Variations
If optimal value is known eg. if the optimal value is known
to be 0, but the point is not known
k
=
f (x
(k)
) f
||g
(k)
||
2
Projected Subgradient: minimize f (x) s.t. x C
x
(k+1)
= P(x
(k)
+
k
g
(k)
)
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x
(k+1)
= x
(k)
k
g
(k)
+
k
(x
(k
) x
(k1))
Variations
k
=
f (x
(k)
) f
||g
(k)
||
2
x
(k+1)
= P(x
(k)
+
k
g
(k)
)
convex sets
Heavy Ball method:
x
(k+1)
= x
(k)
k
g
(k)
+
k
(x
(k
) x
(k1))
Variations
k
=
f (x
(k)
) f
||g
(k)
||
2
x
(k+1)
= P(x
(k)
+
k
g
(k)
)
convex sets
Heavy Ball method:
x
(k+1)
= x
(k)
k
g
(k)
+
k
(x
(k
) x
(k1))
Variations
k
=
f (x
(k)
) f
||g
(k)
||
2
x
(k+1)
= P(x
(k)
+
k
g
(k)
)
convex sets
Heavy Ball method:
x
(k+1)
= x
(k)
k
g
(k)
+
k
(x
(k
) x
(k1))
Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods
Pros
Low memory usage
is decomposible
Cons
Pros
Low memory usage
is decomposible
Cons
Pros
Low memory usage
is decomposible
Cons
Pros
Low memory usage
is decomposible
Cons
Pros
Low memory usage
is decomposible
Cons
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject to
x C
Construct an Approximate Model:
f (x) = max
i I
(
f (x
i
) +g
i
T
(x x
i
)
Minimize model over x and nd f (x) and g
Update model and repeat till desired accuracy
Numerically unstable
x C
f (x) = max
i I
(
f (x
i
) +g
i
T
(x x
i
)
x C
f (x) = max
i I
(
f (x
i
) +g
i
T
(x x
i
)
x C
f (x) = max
i I
(
f (x
i
) +g
i
T
(x x
i
)
x C
f (x) = max
i I
(
f (x
i
) +g
i
T
(x x
i
)
Idea: solve a series of smooth convex problems to
minimize f (x)
F(x) = min
yR
n
f (y) +

2
||y x||
2
p(x) = argmin
yR
n
f (y) +

2
||y x||
2
F(x) is differentiable!
F(x) = (x p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods
minimize f (x)
F(x) = min
yR
n
f (y) +

2
||y x||
2
p(x) = argmin
yR
n
f (y) +

2
||y x||
2
F(x) = (x p(x))
Bundle Methods
minimize f (x)
F(x) = min
yR
n
f (y) +

2
||y x||
2
p(x) = argmin
yR
n
f (y) +

2
||y x||
2
F(x) = (x p(x))
Bundle Methods
minimize f (x)
F(x) = min
yR
n
f (y) +

2
||y x||
2
p(x) = argmin
yR
n
f (y) +

2
||y x||
2
F(x) = (x p(x))
Bundle Methods
minimize f (x)
F(x) = min
yR
n
f (y) +

2
||y x||
2
p(x) = argmin
yR
n
f (y) +

2
||y x||
2
F(x) = (x p(x))
Bundle Methods
minimize f (x)
F(x) = min
yR
n
f (y) +

2
||y x||
2
p(x) = argmin
yR
n
f (y) +

2
||y x||
2
F(x) = (x p(x))
Bundle Methods
minimize f (x)
F(x) = min
yR
n
f (y) +

2
||y x||
2
p(x) = argmin
yR
n
f (y) +

2
||y x||
2
F(x) = (x p(x))
Bundle Methods
Elementary Bundle Method
As before, f is assumed to be Lipshitz continuous
At a generic iteration we maintain a bundle
< y
i
, f (y
i
), s
i
,
i
>
As before, f is assumed to be Lipshitz continuous
At a generic iteration we maintain a bundle
< y
i
, f (y
i
), s
i
,
i
>
Follow Cutting Plane Method, but use M-Y Regularization
for building the model
y
k+1
= argmin
yR
n
f
k
(y) +

k
2
||y

x
k
||
2
k
= f (
x
k
) [
f
k
(y
k+1
) +

k
2
||y
k+1

x
k
||
2
] 0
if
k
< stop
If f (
x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k
f
k+1
(y) = max{
f
k
(y), f (y
k+1
) +
s
k+1
, y y
k+1
}
y
k+1
= argmin
yR
n
f
k
(y) +

k
2
||y

x
k
||
2
k
= f (
x
k
) [
f
k
(y
k+1
) +

k
2
||y
k+1

x
k
||
2
] 0
if
k
< stop
If f (
x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k
f
k+1
(y) = max{
f
k
(y), f (y
k+1
) +
s
k+1
, y y
k+1
}
y
k+1
= argmin
yR
n
f
k
(y) +

k
2
||y

x
k
||
2
k
= f (
x
k
) [
f
k
(y
k+1
) +

k
2
||y
k+1

x
k
||
2
] 0
if
k
< stop
If f (
x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k
f
k+1
(y) = max{
f
k
(y), f (y
k+1
) +
s
k+1
, y y
k+1
}
y
k+1
= argmin
yR
n
f
k
(y) +

k
2
||y

x
k
||
2
k
= f (
x
k
) [
f
k
(y
k+1
) +

k
2
||y
k+1

x
k
||
2
] 0
if
k
< stop
If f (
x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k
f
k+1
(y) = max{
f
k
(y), f (y
k+1
) +
s
k+1
, y y
k+1
}
y
k+1
= argmin
yR
n
f
k
(y) +

k
2
||y

x
k
||
2
k
= f (
x
k
) [
f
k
(y
k+1
) +

k
2
||y
k+1

x
k
||
2
] 0
if
k
< stop
If f (
x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k
f
k+1
(y) = max{
f
k
(y), f (y
k+1
) +
s
k+1
, y y
k+1
}
y
k+1
= argmin
yR
n
f
k
(y) +

k
2
||y

x
k
||
2
k
= f (
x
k
) [
f
k
(y
k+1
) +

k
2
||y
k+1

x
k
||
2
] 0
if
k
< stop
If f (
x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k
f
k+1
(y) = max{
f
k
(y), f (y
k+1
) +
s
k+1
, y y
k+1
}
Convergence
The Algorithm either makes a nite number of Serious
Steps and then only makes Null steps
Then, If k
0
is the last Serious Step, and
k
is
nondecreasing, then
k
0
Or it makes an innite number of Serious steps
Then,

kK
s

k

f (
x
0
) f
m
so
k
0
Convergence
Then, If k
0
k
is
nondecreasing, then
k
0
Then,

kK
s

k

f (
x
0
) f
m
so
k
0
Convergence
Then, If k
0
k
is
nondecreasing, then
k
0
Then,

kK
s

k

f (
x
0
) f
m
so
k
0
Convergence
Then, If k
0
k
is
nondecreasing, then
k
0
Then,

kK
s

k

f (
x
0
) f
m
so
k
0
Convergence
Then, If k
0
k
is
nondecreasing, then
k
0
Then,

kK
s

k

f (
x
0
) f
m
so
k
0
Convergence
Then, If k
0
k
is
nondecreasing, then
k
0
Then,

kK
s

k

f (
x
0
) f
m
so
k
0
Variations
Replace ||y x||
2
by (y x)
T
M
k
(y x) : Still differentiable
Conjuguate Gradient methods are achieved as a slight
modication of the algorithm (Refer [5])
Variable Metric Methods [10]
M
k
= u
k
I for Diagonal Variable Metric Methods
Bundle-Newton Methods
Variations
Replace ||y x||
2
by (y x)
T
M
k
M
k
= u
k
Variations
Replace ||y x||
2
by (y x)
T
M
k
M
k
= u
k
Variations
Replace ||y x||
2
by (y x)
T
M
k
M
k
= u
k
Variations
Replace ||y x||
2
by (y x)
T
M
k
M
k
= u
k
Summary
Nonsmooth convex optimization has been explored since
1960s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.
Summary
more recently.
Summary
more recently.
Summary
more recently.
Summary
more recently.
For Further Reading I
Naum Z. Shor
Minimization Methods for non-differentiable functions.
Springer-Verlag, 1985.
Boyd and Vanderberge
Convex Optimization.
Cambridge University Press
A. Ruszczyinski
Nonlinear Optimization
Princeton University Press
Wikipedia
en.wikipedia.org/wiki/Subgradient_method
For Further Reading II
Marko Makela
Survey of Bundle Methods, 2009
http://www.informaworld.com/smpp/
content~db=all~content=a713741700
Alexandre Belloni
An Introduction to Bundle Methods
http://web.mit.edu/belloni/www/
LecturesIntroBundle.pdf
John E. Mitchell
Cutting Plane and Subgradient Methods, 2005
http://www.optimization-online.org/DB_HTML/
2009/05/2298.html
For Further Reading III
Lecture Notes on Subgradient methods by Stephen Boyd
http://www.stanford.edu/class/ee392o/
subgrad_method.pdf
Alexander J. Smola, S.V. N. Vishwanathan, Quoc V. Le
Bundle Methods for Machine Learning, 2007
http://books.nips.cc/papers/files/nips20/
NIPS2007_0470.pdf
C Lemarechal Variable metric bundle methods, 1997.
http://www.springerlink.com/index/
3515WK428153171N.pdf
Quoc Le, Alexander Smola
Direct Optimization of Ranking Measures, 2007
http://arxiv.org/abs/0704.3359
For Further Reading IV
SVN Vishwanathan, A. Smola
Quasi-Newton Methods for Efcient Large-Scale Machine
Learning
http://portal.acm.org/ft_gateway.cfm?id=
1390309&type=pdf
and
www.stat.purdue.edu/~vishy/talks/LBFGS.pdf

Subgradient and Bundle Methods

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Subgradient and Bundle Methods

Uploaded by

Copyright:

Available Formats

Subgradient and Bundle Methods

for optimization of convex non-smooth functions

Nonsummable diminishing step lengths: f

Nonsummable diminishing step lengths: f

Nonsummable diminishing step lengths: f

Nonsummable diminishing step lengths: f

Nonsummable diminishing step lengths: f

Nonsummable diminishing step lengths: f

Nonsummable diminishing step lengths: f

You might also like