You are on page 1of 98

Subgradient and Bundle Methods

for optimization of convex non-smooth functions


April 1, 2009
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmooth
Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but numerically nonsmooth
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmooth
Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but numerically nonsmooth
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmooth
Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but numerically nonsmooth
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmooth
Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but numerically nonsmooth
Subgradient and Bundle Methods
Motivation
Many naturally occuring problems are nonsmooth
Hinge loss
Feasible region of a convex minimization problem
Piecewise Linear function
If a function is approximating a non-smooth function, then it
may be analytically smooth, but numerically nonsmooth
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functions
Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functions
Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functions
Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functions
Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functions
Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functions
Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition
Subgradient and Bundle Methods
Methods for nonsmooth optimizations
Approximate by a series of smooth functions
Reformulate the problem adding more constraints such
that the objective is smooth
Subgradient Methods
Cutting Plane Methods
Moreau-Yosida Regularization
Bundle Methods
U V decomposition
Subgradient and Bundle Methods
Denition
An extension of gradients
For a convex differentiable function f (x), x, y
f (y) f (x) +f (x)
T
(y x) (1)
So, a subgradient is dened as any g R
n
such that y
f (y) f (x) +g
T
(y x) (2)
The set of all subgradients of f at x is denoted f (x)
Subgradient and Bundle Methods
Denition
An extension of gradients
For a convex differentiable function f (x), x, y
f (y) f (x) +f (x)
T
(y x) (1)
So, a subgradient is dened as any g R
n
such that y
f (y) f (x) +g
T
(y x) (2)
The set of all subgradients of f at x is denoted f (x)
Subgradient and Bundle Methods
Denition
An extension of gradients
For a convex differentiable function f (x), x, y
f (y) f (x) +f (x)
T
(y x) (1)
So, a subgradient is dened as any g R
n
such that y
f (y) f (x) +g
T
(y x) (2)
The set of all subgradients of f at x is denoted f (x)
Subgradient and Bundle Methods
Some Facts
From Convex Analysis
A convex function is always subdifferentiable i.e. the
Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. f (x) = {f (x)}
Subgradients are lower bounds for directional derivatives.
f

(x; d) = sup
gf (x)
g, d
Further, d is a descent direction iff g
T
d < 0 g f (x)
Subgradient and Bundle Methods
Some Facts
From Convex Analysis
A convex function is always subdifferentiable i.e. the
Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. f (x) = {f (x)}
Subgradients are lower bounds for directional derivatives.
f

(x; d) = sup
gf (x)
g, d
Further, d is a descent direction iff g
T
d < 0 g f (x)
Subgradient and Bundle Methods
Some Facts
From Convex Analysis
A convex function is always subdifferentiable i.e. the
Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. f (x) = {f (x)}
Subgradients are lower bounds for directional derivatives.
f

(x; d) = sup
gf (x)
g, d
Further, d is a descent direction iff g
T
d < 0 g f (x)
Subgradient and Bundle Methods
Some Facts
From Convex Analysis
A convex function is always subdifferentiable i.e. the
Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. f (x) = {f (x)}
Subgradients are lower bounds for directional derivatives.
f

(x; d) = sup
gf (x)
g, d
Further, d is a descent direction iff g
T
d < 0 g f (x)
Subgradient and Bundle Methods
Some Facts
From Convex Analysis
A convex function is always subdifferentiable i.e. the
Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. f (x) = {f (x)}
Subgradients are lower bounds for directional derivatives.
f

(x; d) = sup
gf (x)
g, d
Further, d is a descent direction iff g
T
d < 0 g f (x)
Subgradient and Bundle Methods
Some Facts
From Convex Analysis
A convex function is always subdifferentiable i.e. the
Subgradient of a convex function exists at every point.
Directional derivatives also exist at every point.
If a convex function f is differentiable at x, its subgradient
is the gradient at that point. i.e. f (x) = {f (x)}
Subgradients are lower bounds for directional derivatives.
f

(x; d) = sup
gf (x)
g, d
Further, d is a descent direction iff g
T
d < 0 g f (x)
Subgradient and Bundle Methods
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Local minima 0 f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to nd minima
Subgradient and Bundle Methods
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Local minima 0 f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to nd minima
Subgradient and Bundle Methods
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Local minima 0 f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to nd minima
Subgradient and Bundle Methods
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Local minima 0 f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to nd minima
Subgradient and Bundle Methods
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Local minima 0 f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to nd minima
Subgradient and Bundle Methods
Properties
Without Proof
(f
1
+f
2
)(x) = f
1
(x) + f
2
(x)
f (x) = f (x)
g(x) = f (Ax +b) g(x) = A
T
f (Ax +b)
Local minima 0 f (x)
However, For f (x) = |x|, the oracle returns subgradient 0
only at 0. So this is not a good way to nd minima
Subgradient and Bundle Methods
Subgradient Method
Algorithm
Subgradient Method is NOT a descent method!
x
(k+1)
= x
(k)

k
g
k
for
k
0 and g
k
f (x)
f
(k)
best
= min{f
(k1)
best
, f (x
(k)
)}
Line search is not performed. Step lengths
k
usually xed
ahead of time
Subgradient and Bundle Methods
Subgradient Method
Algorithm
Subgradient Method is NOT a descent method!
x
(k+1)
= x
(k)

k
g
k
for
k
0 and g
k
f (x)
f
(k)
best
= min{f
(k1)
best
, f (x
(k)
)}
Line search is not performed. Step lengths
k
usually xed
ahead of time
Subgradient and Bundle Methods
Subgradient Method
Algorithm
Subgradient Method is NOT a descent method!
x
(k+1)
= x
(k)

k
g
k
for
k
0 and g
k
f (x)
f
(k)
best
= min{f
(k1)
best
, f (x
(k)
)}
Line search is not performed. Step lengths
k
usually xed
ahead of time
Subgradient and Bundle Methods
Subgradient Method
Algorithm
Subgradient Method is NOT a descent method!
x
(k+1)
= x
(k)

k
g
k
for
k
0 and g
k
f (x)
f
(k)
best
= min{f
(k1)
best
, f (x
(k)
)}
Line search is not performed. Step lengths
k
usually xed
ahead of time
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengths
Constant step size:
k
=
Constant step length:
k
=
k
= /g
(k)

2
Square summable but not summable step size:

k
0,

k=1

2
k
< ,

k=1

k
= .
Nonsummable diminishing:

k
0, lim
k

k
= 0,

k=1

k
= .
Nonsummable diminishing step lengths:

k
0, lim
k

k
= 0,

k=1

k
= .
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengths
Constant step size:
k
=
Constant step length:
k
=
k
= /g
(k)

2
Square summable but not summable step size:

k
0,

k=1

2
k
< ,

k=1

k
= .
Nonsummable diminishing:

k
0, lim
k

k
= 0,

k=1

k
= .
Nonsummable diminishing step lengths:

k
0, lim
k

k
= 0,

k=1

k
= .
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengths
Constant step size:
k
=
Constant step length:
k
=
k
= /g
(k)

2
Square summable but not summable step size:

k
0,

k=1

2
k
< ,

k=1

k
= .
Nonsummable diminishing:

k
0, lim
k

k
= 0,

k=1

k
= .
Nonsummable diminishing step lengths:

k
0, lim
k

k
= 0,

k=1

k
= .
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengths
Constant step size:
k
=
Constant step length:
k
=
k
= /g
(k)

2
Square summable but not summable step size:

k
0,

k=1

2
k
< ,

k=1

k
= .
Nonsummable diminishing:

k
0, lim
k

k
= 0,

k=1

k
= .
Nonsummable diminishing step lengths:

k
0, lim
k

k
= 0,

k=1

k
= .
Subgradient and Bundle Methods
Step Lengths
Commonly used Step lengths
Constant step size:
k
=
Constant step length:
k
=
k
= /g
(k)

2
Square summable but not summable step size:

k
0,

k=1

2
k
< ,

k=1

k
= .
Nonsummable diminishing:

k
0, lim
k

k
= 0,

k=1

k
= .
Nonsummable diminishing step lengths:

k
0, lim
k

k
= 0,

k=1

k
= .
Subgradient and Bundle Methods
Convergence Result
Assume that G such that the norm of the subgradients is
bounded i.e. ||g
(k)
||
2
G
(For example, Suppose f is Lipshitz continuous)
Result f
k
best
f


dist

x
(1)
, X

2
+G
2

k
i =1

2
i
2

k
i =1

i
Proof is through proving ||x x

|| decreases
Subgradient and Bundle Methods
Convergence Result
Assume that G such that the norm of the subgradients is
bounded i.e. ||g
(k)
||
2
G
(For example, Suppose f is Lipshitz continuous)
Result f
k
best
f


dist

x
(1)
, X

2
+G
2

k
i =1

2
i
2

k
i =1

i
Proof is through proving ||x x

|| decreases
Subgradient and Bundle Methods
Convergence Result
Assume that G such that the norm of the subgradients is
bounded i.e. ||g
(k)
||
2
G
(For example, Suppose f is Lipshitz continuous)
Result f
k
best
f


dist

x
(1)
, X

2
+G
2

k
i =1

2
i
2

k
i =1

i
Proof is through proving ||x x

|| decreases
Subgradient and Bundle Methods
Convergence Result
Assume that G such that the norm of the subgradients is
bounded i.e. ||g
(k)
||
2
G
(For example, Suppose f is Lipshitz continuous)
Result f
k
best
f


dist

x
(1)
, X

2
+G
2

k
i =1

2
i
2

k
i =1

i
Proof is through proving ||x x

|| decreases
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f
(k)
best
within
G
2
h
2
of optimal
Constant step length: f
(k)
best
within Gh of optimal
Square summable but not summable step size: f
(k)
best
f

Nonsummable diminishing: f
(k)
best
f

Nonsummable diminishing step lengths: f


(k)
best
f

f
(k)
best
f


R
2
+G
2

k
i =1

2
i
2

k
i =1

i
So, optimal
i
are
R/G

k
and converges in (RG/)
2
steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f
(k)
best
within
G
2
h
2
of optimal
Constant step length: f
(k)
best
within Gh of optimal
Square summable but not summable step size: f
(k)
best
f

Nonsummable diminishing: f
(k)
best
f

Nonsummable diminishing step lengths: f


(k)
best
f

f
(k)
best
f


R
2
+G
2

k
i =1

2
i
2

k
i =1

i
So, optimal
i
are
R/G

k
and converges in (RG/)
2
steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f
(k)
best
within
G
2
h
2
of optimal
Constant step length: f
(k)
best
within Gh of optimal
Square summable but not summable step size: f
(k)
best
f

Nonsummable diminishing: f
(k)
best
f

Nonsummable diminishing step lengths: f


(k)
best
f

f
(k)
best
f


R
2
+G
2

k
i =1

2
i
2

k
i =1

i
So, optimal
i
are
R/G

k
and converges in (RG/)
2
steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f
(k)
best
within
G
2
h
2
of optimal
Constant step length: f
(k)
best
within Gh of optimal
Square summable but not summable step size: f
(k)
best
f

Nonsummable diminishing: f
(k)
best
f

Nonsummable diminishing step lengths: f


(k)
best
f

f
(k)
best
f


R
2
+G
2

k
i =1

2
i
2

k
i =1

i
So, optimal
i
are
R/G

k
and converges in (RG/)
2
steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f
(k)
best
within
G
2
h
2
of optimal
Constant step length: f
(k)
best
within Gh of optimal
Square summable but not summable step size: f
(k)
best
f

Nonsummable diminishing: f
(k)
best
f

Nonsummable diminishing step lengths: f


(k)
best
f

f
(k)
best
f


R
2
+G
2

k
i =1

2
i
2

k
i =1

i
So, optimal
i
are
R/G

k
and converges in (RG/)
2
steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f
(k)
best
within
G
2
h
2
of optimal
Constant step length: f
(k)
best
within Gh of optimal
Square summable but not summable step size: f
(k)
best
f

Nonsummable diminishing: f
(k)
best
f

Nonsummable diminishing step lengths: f


(k)
best
f

f
(k)
best
f


R
2
+G
2

k
i =1

2
i
2

k
i =1

i
So, optimal
i
are
R/G

k
and converges in (RG/)
2
steps
Subgradient and Bundle Methods
Convergence for Commonly used Step lengths
Constant step size: f
(k)
best
within
G
2
h
2
of optimal
Constant step length: f
(k)
best
within Gh of optimal
Square summable but not summable step size: f
(k)
best
f

Nonsummable diminishing: f
(k)
best
f

Nonsummable diminishing step lengths: f


(k)
best
f

f
(k)
best
f


R
2
+G
2

k
i =1

2
i
2

k
i =1

i
So, optimal
i
are
R/G

k
and converges in (RG/)
2
steps
Subgradient and Bundle Methods
Variations
If optimal value is known eg. if the optimal value is known
to be 0, but the point is not known

k
=
f (x
(k)
) f

||g
(k)
||
2
Projected Subgradient: minimize f (x) s.t. x C
x
(k+1)
= P(x
(k)
+
k
g
(k)
)
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x
(k+1)
= x
(k)

k
g
(k)
+
k
(x
(k
) x
(k1))
Subgradient and Bundle Methods
Variations
If optimal value is known eg. if the optimal value is known
to be 0, but the point is not known

k
=
f (x
(k)
) f

||g
(k)
||
2
Projected Subgradient: minimize f (x) s.t. x C
x
(k+1)
= P(x
(k)
+
k
g
(k)
)
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x
(k+1)
= x
(k)

k
g
(k)
+
k
(x
(k
) x
(k1))
Subgradient and Bundle Methods
Variations
If optimal value is known eg. if the optimal value is known
to be 0, but the point is not known

k
=
f (x
(k)
) f

||g
(k)
||
2
Projected Subgradient: minimize f (x) s.t. x C
x
(k+1)
= P(x
(k)
+
k
g
(k)
)
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x
(k+1)
= x
(k)

k
g
(k)
+
k
(x
(k
) x
(k1))
Subgradient and Bundle Methods
Variations
If optimal value is known eg. if the optimal value is known
to be 0, but the point is not known

k
=
f (x
(k)
) f

||g
(k)
||
2
Projected Subgradient: minimize f (x) s.t. x C
x
(k+1)
= P(x
(k)
+
k
g
(k)
)
Alternating projections: Find a point in the intesection of 2
convex sets
Heavy Ball method:
x
(k+1)
= x
(k)

k
g
(k)
+
k
(x
(k
) x
(k1))
Subgradient and Bundle Methods
Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods
Subgradient and Bundle Methods
Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods
Subgradient and Bundle Methods
Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods
Subgradient and Bundle Methods
Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods
Subgradient and Bundle Methods
Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods
Subgradient and Bundle Methods
Pros
Can immediately be applied to a wide variety of problems,
especially when accuracy required is not very high.
Low memory usage
Often possible to design distributed methods if objective
is decomposible
Cons
Slower than second-order methods
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject to
x C
Construct an Approximate Model:
f (x) = max
i I
(

f (x
i
) +g
i
T
(x x
i
)
Minimize model over x and nd f (x) and g
Update model and repeat till desired accuracy
Numerically unstable
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject to
x C
Construct an Approximate Model:
f (x) = max
i I
(

f (x
i
) +g
i
T
(x x
i
)
Minimize model over x and nd f (x) and g
Update model and repeat till desired accuracy
Numerically unstable
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject to
x C
Construct an Approximate Model:
f (x) = max
i I
(

f (x
i
) +g
i
T
(x x
i
)
Minimize model over x and nd f (x) and g
Update model and repeat till desired accuracy
Numerically unstable
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject to
x C
Construct an Approximate Model:
f (x) = max
i I
(

f (x
i
) +g
i
T
(x x
i
)
Minimize model over x and nd f (x) and g
Update model and repeat till desired accuracy
Numerically unstable
Subgradient and Bundle Methods
Cutting Plane Method
Again, Consider the problem: minimize f (x) subject to
x C
Construct an Approximate Model:
f (x) = max
i I
(

f (x
i
) +g
i
T
(x x
i
)
Minimize model over x and nd f (x) and g
Update model and repeat till desired accuracy
Numerically unstable
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems to
minimize f (x)
F(x) = min
yR
n

f (y) +

2
||y x||
2

p(x) = argmin
yR
n

f (y) +

2
||y x||
2

F(x) is differentiable!
F(x) = (x p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems to
minimize f (x)
F(x) = min
yR
n

f (y) +

2
||y x||
2

p(x) = argmin
yR
n

f (y) +

2
||y x||
2

F(x) is differentiable!
F(x) = (x p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems to
minimize f (x)
F(x) = min
yR
n

f (y) +

2
||y x||
2

p(x) = argmin
yR
n

f (y) +

2
||y x||
2

F(x) is differentiable!
F(x) = (x p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems to
minimize f (x)
F(x) = min
yR
n

f (y) +

2
||y x||
2

p(x) = argmin
yR
n

f (y) +

2
||y x||
2

F(x) is differentiable!
F(x) = (x p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems to
minimize f (x)
F(x) = min
yR
n

f (y) +

2
||y x||
2

p(x) = argmin
yR
n

f (y) +

2
||y x||
2

F(x) is differentiable!
F(x) = (x p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems to
minimize f (x)
F(x) = min
yR
n

f (y) +

2
||y x||
2

p(x) = argmin
yR
n

f (y) +

2
||y x||
2

F(x) is differentiable!
F(x) = (x p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods
Subgradient and Bundle Methods
Moreau-Yosida Regularization
Idea: solve a series of smooth convex problems to
minimize f (x)
F(x) = min
yR
n

f (y) +

2
||y x||
2

p(x) = argmin
yR
n

f (y) +

2
||y x||
2

F(x) is differentiable!
F(x) = (x p(x))
Minimization is done using the dual.
Cutting Plane Method + Moreau-Yosida Regularization =
Bundle Methods
Subgradient and Bundle Methods
Elementary Bundle Method
As before, f is assumed to be Lipshitz continuous
At a generic iteration we maintain a bundle
< y
i
, f (y
i
), s
i
,
i
>
Subgradient and Bundle Methods
Elementary Bundle Method
As before, f is assumed to be Lipshitz continuous
At a generic iteration we maintain a bundle
< y
i
, f (y
i
), s
i
,
i
>
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularization
for building the model
y
k+1
= argmin
yR
n

f
k
(y) +

k
2
||y

x
k
||
2

k
= f (

x
k
) [

f
k
(y
k+1
) +

k
2
||y
k+1


x
k
||
2
] 0
if
k
< stop
If f (

x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k

f
k+1
(y) = max{

f
k
(y), f (y
k+1
) +

s
k+1
, y y
k+1

}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularization
for building the model
y
k+1
= argmin
yR
n

f
k
(y) +

k
2
||y

x
k
||
2

k
= f (

x
k
) [

f
k
(y
k+1
) +

k
2
||y
k+1


x
k
||
2
] 0
if
k
< stop
If f (

x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k

f
k+1
(y) = max{

f
k
(y), f (y
k+1
) +

s
k+1
, y y
k+1

}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularization
for building the model
y
k+1
= argmin
yR
n

f
k
(y) +

k
2
||y

x
k
||
2

k
= f (

x
k
) [

f
k
(y
k+1
) +

k
2
||y
k+1


x
k
||
2
] 0
if
k
< stop
If f (

x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k

f
k+1
(y) = max{

f
k
(y), f (y
k+1
) +

s
k+1
, y y
k+1

}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularization
for building the model
y
k+1
= argmin
yR
n

f
k
(y) +

k
2
||y

x
k
||
2

k
= f (

x
k
) [

f
k
(y
k+1
) +

k
2
||y
k+1


x
k
||
2
] 0
if
k
< stop
If f (

x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k

f
k+1
(y) = max{

f
k
(y), f (y
k+1
) +

s
k+1
, y y
k+1

}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularization
for building the model
y
k+1
= argmin
yR
n

f
k
(y) +

k
2
||y

x
k
||
2

k
= f (

x
k
) [

f
k
(y
k+1
) +

k
2
||y
k+1


x
k
||
2
] 0
if
k
< stop
If f (

x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k

f
k+1
(y) = max{

f
k
(y), f (y
k+1
) +

s
k+1
, y y
k+1

}
Subgradient and Bundle Methods
Elementary Bundle Method
Follow Cutting Plane Method, but use M-Y Regularization
for building the model
y
k+1
= argmin
yR
n

f
k
(y) +

k
2
||y

x
k
||
2

k
= f (

x
k
) [

f
k
(y
k+1
) +

k
2
||y
k+1


x
k
||
2
] 0
if
k
< stop
If f (

x
k
) f (y
k+1
) m
k
Serious Step

x
k+1
= y
k+1
else Null Step

x
k+1
=

x
k

f
k+1
(y) = max{

f
k
(y), f (y
k+1
) +

s
k+1
, y y
k+1

}
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a nite number of Serious
Steps and then only makes Null steps
Then, If k
0
is the last Serious Step, and
k
is
nondecreasing, then
k
0
Or it makes an innite number of Serious steps
Then,

kK
s

k

f (

x
0
) f

m
so
k
0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a nite number of Serious
Steps and then only makes Null steps
Then, If k
0
is the last Serious Step, and
k
is
nondecreasing, then
k
0
Or it makes an innite number of Serious steps
Then,

kK
s

k

f (

x
0
) f

m
so
k
0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a nite number of Serious
Steps and then only makes Null steps
Then, If k
0
is the last Serious Step, and
k
is
nondecreasing, then
k
0
Or it makes an innite number of Serious steps
Then,

kK
s

k

f (

x
0
) f

m
so
k
0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a nite number of Serious
Steps and then only makes Null steps
Then, If k
0
is the last Serious Step, and
k
is
nondecreasing, then
k
0
Or it makes an innite number of Serious steps
Then,

kK
s

k

f (

x
0
) f

m
so
k
0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a nite number of Serious
Steps and then only makes Null steps
Then, If k
0
is the last Serious Step, and
k
is
nondecreasing, then
k
0
Or it makes an innite number of Serious steps
Then,

kK
s

k

f (

x
0
) f

m
so
k
0
Subgradient and Bundle Methods
Convergence
The Algorithm either makes a nite number of Serious
Steps and then only makes Null steps
Then, If k
0
is the last Serious Step, and
k
is
nondecreasing, then
k
0
Or it makes an innite number of Serious steps
Then,

kK
s

k

f (

x
0
) f

m
so
k
0
Subgradient and Bundle Methods
Variations
Replace ||y x||
2
by (y x)
T
M
k
(y x) : Still differentiable
Conjuguate Gradient methods are achieved as a slight
modication of the algorithm (Refer [5])
Variable Metric Methods [10]
M
k
= u
k
I for Diagonal Variable Metric Methods
Bundle-Newton Methods
Subgradient and Bundle Methods
Variations
Replace ||y x||
2
by (y x)
T
M
k
(y x) : Still differentiable
Conjuguate Gradient methods are achieved as a slight
modication of the algorithm (Refer [5])
Variable Metric Methods [10]
M
k
= u
k
I for Diagonal Variable Metric Methods
Bundle-Newton Methods
Subgradient and Bundle Methods
Variations
Replace ||y x||
2
by (y x)
T
M
k
(y x) : Still differentiable
Conjuguate Gradient methods are achieved as a slight
modication of the algorithm (Refer [5])
Variable Metric Methods [10]
M
k
= u
k
I for Diagonal Variable Metric Methods
Bundle-Newton Methods
Subgradient and Bundle Methods
Variations
Replace ||y x||
2
by (y x)
T
M
k
(y x) : Still differentiable
Conjuguate Gradient methods are achieved as a slight
modication of the algorithm (Refer [5])
Variable Metric Methods [10]
M
k
= u
k
I for Diagonal Variable Metric Methods
Bundle-Newton Methods
Subgradient and Bundle Methods
Variations
Replace ||y x||
2
by (y x)
T
M
k
(y x) : Still differentiable
Conjuguate Gradient methods are achieved as a slight
modication of the algorithm (Refer [5])
Variable Metric Methods [10]
M
k
= u
k
I for Diagonal Variable Metric Methods
Bundle-Newton Methods
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since
1960s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since
1960s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since
1960s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since
1960s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.
Subgradient and Bundle Methods
Summary
Nonsmooth convex optimization has been explored since
1960s. The original subgradient methods were introduced
by Naum Shor. Bundle methods have been developed
more recently.
Subgradient Methods are simple but slow, unless
distributed, which is the predominant current application.
Bundle Methods solve a bounded QP, which is slow, but
need fewer iterations. Preferred for applications where
oracle cost is high.
Subgradient and Bundle Methods
For Further Reading I
Naum Z. Shor
Minimization Methods for non-differentiable functions.
Springer-Verlag, 1985.
Boyd and Vanderberge
Convex Optimization.
Cambridge University Press
A. Ruszczyinski
Nonlinear Optimization
Princeton University Press
Wikipedia
en.wikipedia.org/wiki/Subgradient_method
Subgradient and Bundle Methods
For Further Reading II
Marko Makela
Survey of Bundle Methods, 2009
http://www.informaworld.com/smpp/
content~db=all~content=a713741700
Alexandre Belloni
An Introduction to Bundle Methods
http://web.mit.edu/belloni/www/
LecturesIntroBundle.pdf
John E. Mitchell
Cutting Plane and Subgradient Methods, 2005
http://www.optimization-online.org/DB_HTML/
2009/05/2298.html
Subgradient and Bundle Methods
For Further Reading III
Lecture Notes on Subgradient methods by Stephen Boyd
http://www.stanford.edu/class/ee392o/
subgrad_method.pdf
Alexander J. Smola, S.V. N. Vishwanathan, Quoc V. Le
Bundle Methods for Machine Learning, 2007
http://books.nips.cc/papers/files/nips20/
NIPS2007_0470.pdf
C Lemarechal Variable metric bundle methods, 1997.
http://www.springerlink.com/index/
3515WK428153171N.pdf
Quoc Le, Alexander Smola
Direct Optimization of Ranking Measures, 2007
http://arxiv.org/abs/0704.3359
Subgradient and Bundle Methods
For Further Reading IV
SVN Vishwanathan, A. Smola
Quasi-Newton Methods for Efcient Large-Scale Machine
Learning
http://portal.acm.org/ft_gateway.cfm?id=
1390309&type=pdf
and
www.stat.purdue.edu/~vishy/talks/LBFGS.pdf
Subgradient and Bundle Methods

You might also like