You are on page 1of 23

Distributed Optimization via

Alternating Direction Method of Multipliers


Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato
Stanford University
ITMANET, Stanford, January 2011
Outline
precursors
dual decomposition
method of multipliers
alternating direction method of multipliers
applications/examples
conclusions/big picture
ITMANET, Stanford, January 2011 1
Dual problem
convex equality constrained optimization problem
minimize f(x)
subject to Ax = b
Lagrangian: L(x, y) = f(x) + y
T
(Ax b)
dual function: g(y) = inf
x
L(x, y)
dual problem: maximize g(y)
recover x

= argmin
x
L(x, y

)
ITMANET, Stanford, January 2011 2
Dual ascent
gradient method for dual problem: y
k+1
= y
k
+
k
g(y
k
)
g(y
k
) = A x b, where x = argmin
x
L(x, y
k
)
dual ascent method is
x
k+1
:= argmin
x
L(x, y
k
) // x-minimization
y
k+1
:= y
k
+
k
(Ax
k+1
b) // dual update
works, with lots of strong assumptions
ITMANET, Stanford, January 2011 3
Dual decomposition
suppose f is separable:
f(x) = f
1
(x
1
) + + f
N
(x
N
), x = (x
1
, . . . , x
N
)
then L is separable in x: L(x, y) = L
1
(x
1
, y) + + L
N
(x
N
, y) y
T
b,
L
i
(x
i
, y) = f
i
(x
i
) + y
T
A
i
x
i
x-minimization in dual ascent splits into N separate minimizations
x
k+1
i
:= argmin
x
i
L
i
(x
i
, y
k
)
which can be carried out in parallel
ITMANET, Stanford, January 2011 4
Dual decomposition
dual decomposition (Everett, Dantzig, Wolfe, Benders 196065)
x
k+1
i
:= argmin
x
i
L
i
(x
i
, y
k
), i = 1, . . . , N
y
k+1
:= y
k
+
k
(

N
i=1
A
i
x
k+1
i
b)
scatter y
k
; update x
i
in parallel; gather A
i
x
k+1
i
solve a large problem
by iteratively solving subproblems (in parallel)
dual variable update provides coordination
works, with lots of assumptions; often slow
ITMANET, Stanford, January 2011 5
Method of multipliers
a method to robustify dual ascent
use augmented Lagrangian (Hestenes, Powell 1969), > 0
L

(x, y) = f(x) + y
T
(Ax b) + (/2)Ax b
2
2
method of multipliers (Hestenes, Powell; analysis in Bertsekas 1982)
x
k+1
:= argmin
x
L

(x, y
k
)
y
k+1
:= y
k
+ (Ax
k+1
b)
(note specic dual update step length )
ITMANET, Stanford, January 2011 6
Method of multipliers
good news: converges under much more relaxed conditions
(f can be nondierentiable, take on value +, . . . )
bad news: quadratic penalty destroys splitting of the x-update, so cant
do decomposition
ITMANET, Stanford, January 2011 7
Alternating direction method of multipliers
a method
with good robustness of method of multipliers
which can support decomposition
robust dual decomposition or decomposable method of multipliers
proposed by Gabay, Mercier, Glowinski, Marrocco in 1976
ITMANET, Stanford, January 2011 8
Alternating direction method of multipliers
ADMM problem form (with f, g convex)
minimize f(x) + g(z)
subject to Ax + Bz = c
two sets of variables, with separable objective
L

(x, z, y) = f(x) + g(z) + y


T
(Ax + Bz c) + (/2)Ax + Bz c
2
2
ADMM:
x
k+1
:= argmin
x
L

(x, z
k
, y
k
) // x-minimization
z
k+1
:= argmin
z
L

(x
k+1
, z, y
k
) // z-minimization
y
k+1
:= y
k
+ (Ax
k+1
+ Bz
k+1
c) // dual update
ITMANET, Stanford, January 2011 9
Alternating direction method of multipliers
if we minimized over x and z jointly, reduces to method of multipliers
instead, we do one pass of a Gauss-Seidel method
we get splitting since we minimize over x with z xed, and vice versa
ITMANET, Stanford, January 2011 10
Convergence
assume (very little!)
f, g convex, closed, proper
L
0
has a saddle point
then ADMM converges:
iterates approach feasibility: Ax
k
+ Bz
k
c 0
objective approaches optimal value: f(x
k
) + g(z
k
) p

ITMANET, Stanford, January 2011 11


Related algorithms
operator splitting methods
(Douglas, Peaceman, Rachford, Lions, Mercier, . . . 1950s, 1979)
proximal point algorithm (Rockafellar 1976)
Dykstras alternating projections algorithm (1983)
Spingarns method of partial inverses (1985)
Rockafellar-Wets progressive hedging (1991)
proximal methods (Rockafellar, many others, 1976present)
Bregman iterative methods (2008present)
most of these are special cases of the proximal point algorithm
ITMANET, Stanford, January 2011 12
Consensus optimization
want to solve problem with N objective terms
minimize

N
i=1
f
i
(x)
e.g., f
i
is the loss function for ith block of training data
ADMM form:
minimize

N
i=1
f
i
(x
i
)
subject to x
i
z = 0
x
i
are local variables
z is the global variable
x
i
z = 0 are consistency or consensus constraints
ITMANET, Stanford, January 2011 13
Consensus optimization via ADMM
L

(x, z, y) =

N
i=1

f
i
(x
i
) + y
T
i
(x
i
z) + (/2)x
i
z
2
2

ADMM:
x
k+1
i
:= argmin
x
i

f
i
(x
i
) + y
kT
i
(x
i
z
k
) + (/2)x
i
z
k

2
2

z
k+1
:=
1
N
N

i=1

x
k+1
i
+ (1/)y
k
i

y
k+1
i
:= y
k
i
+ (x
k+1
i
z
k+1
)
ITMANET, Stanford, January 2011 14
Consensus optimization via ADMM
using

N
i=1
y
k
i
= 0, algorithm simplies to
x
k+1
i
:= argmin
x
i

f
i
(x
i
) + y
kT
i
(x
i
x
k
) + (/2)x
i
x
k

2
2

y
k+1
i
:= y
k
i
+ (x
k+1
i
x
k+1
)
where x
k
= (1/N)

N
i=1
x
k
i
in each iteration
gather x
k
i
and average to get x
k
scatter the average x
k
to processors
update y
k
i
locally (in each processor, in parallel)
update x
i
locally
ITMANET, Stanford, January 2011 15
Statistical interpretation
f
i
is negative log-likelihood for parameter x given ith data block
x
k+1
i
is MAP estimate under prior N(x
k
+ (1/)y
k
i
, I)
prior mean is previous iterations consensus shifted by price of
processor i disagreeing with previous consensus
processors only need to support a Gaussian MAP method
type or number of data in each block not relevant
consensus protocol yields global maximum-likelihood estimate
ITMANET, Stanford, January 2011 16
Consensus classication
data (examples) (a
i
, b
i
), i = 1, . . . , N, a
i
R
n
, b
i
{1, +1}
linear classier sign(a
T
w + v), with weight w, oset v
margin for ith example is b
i
(a
T
i
w + v); want margin to be positive
loss for ith example is l(b
i
(a
T
i
w + v))
l is loss function (hinge, logistic, probit, exponential, . . . )
choose w, v to minimize
1
N

N
i=1
l(b
i
(a
T
i
w + v)) + r(w)
r(w) is regularization term (
2
,
1
, . . . )
split data and use ADMM consensus to solve
ITMANET, Stanford, January 2011 17
Consensus SVM example
hinge loss l(u) = (1 u)
+
with
2
regularization
baby problem with n = 2, N = 400 to illustrate
examples split into 20 groups, in worst possible way:
each group contains only positive or negative examples
ITMANET, Stanford, January 2011 18
Iteration 1
3 2 1 0 1 2 3
10
8
6
4
2
0
2
4
6
8
10
ITMANET, Stanford, January 2011 19
Iteration 5
3 2 1 0 1 2 3
10
8
6
4
2
0
2
4
6
8
10
ITMANET, Stanford, January 2011 19
Iteration 40
3 2 1 0 1 2 3
10
8
6
4
2
0
2
4
6
8
10
ITMANET, Stanford, January 2011 19
Reference
Distributed Optimization and Statistical Learning via the Alternating
Direction Method of Multipliers (Boyd, Parikh, Chu, Peleato, Eckstein)
available at Boyd web site
ITMANET, Stanford, January 2011 20

You might also like