Professional Documents
Culture Documents
Distributed Optimization Via Alternating Direction Method of Multipliers
Distributed Optimization Via Alternating Direction Method of Multipliers
= argmin
x
L(x, y
)
ITMANET, Stanford, January 2011 2
Dual ascent
gradient method for dual problem: y
k+1
= y
k
+
k
g(y
k
)
g(y
k
) = A x b, where x = argmin
x
L(x, y
k
)
dual ascent method is
x
k+1
:= argmin
x
L(x, y
k
) // x-minimization
y
k+1
:= y
k
+
k
(Ax
k+1
b) // dual update
works, with lots of strong assumptions
ITMANET, Stanford, January 2011 3
Dual decomposition
suppose f is separable:
f(x) = f
1
(x
1
) + + f
N
(x
N
), x = (x
1
, . . . , x
N
)
then L is separable in x: L(x, y) = L
1
(x
1
, y) + + L
N
(x
N
, y) y
T
b,
L
i
(x
i
, y) = f
i
(x
i
) + y
T
A
i
x
i
x-minimization in dual ascent splits into N separate minimizations
x
k+1
i
:= argmin
x
i
L
i
(x
i
, y
k
)
which can be carried out in parallel
ITMANET, Stanford, January 2011 4
Dual decomposition
dual decomposition (Everett, Dantzig, Wolfe, Benders 196065)
x
k+1
i
:= argmin
x
i
L
i
(x
i
, y
k
), i = 1, . . . , N
y
k+1
:= y
k
+
k
(
N
i=1
A
i
x
k+1
i
b)
scatter y
k
; update x
i
in parallel; gather A
i
x
k+1
i
solve a large problem
by iteratively solving subproblems (in parallel)
dual variable update provides coordination
works, with lots of assumptions; often slow
ITMANET, Stanford, January 2011 5
Method of multipliers
a method to robustify dual ascent
use augmented Lagrangian (Hestenes, Powell 1969), > 0
L
(x, y) = f(x) + y
T
(Ax b) + (/2)Ax b
2
2
method of multipliers (Hestenes, Powell; analysis in Bertsekas 1982)
x
k+1
:= argmin
x
L
(x, y
k
)
y
k+1
:= y
k
+ (Ax
k+1
b)
(note specic dual update step length )
ITMANET, Stanford, January 2011 6
Method of multipliers
good news: converges under much more relaxed conditions
(f can be nondierentiable, take on value +, . . . )
bad news: quadratic penalty destroys splitting of the x-update, so cant
do decomposition
ITMANET, Stanford, January 2011 7
Alternating direction method of multipliers
a method
with good robustness of method of multipliers
which can support decomposition
robust dual decomposition or decomposable method of multipliers
proposed by Gabay, Mercier, Glowinski, Marrocco in 1976
ITMANET, Stanford, January 2011 8
Alternating direction method of multipliers
ADMM problem form (with f, g convex)
minimize f(x) + g(z)
subject to Ax + Bz = c
two sets of variables, with separable objective
L
(x, z
k
, y
k
) // x-minimization
z
k+1
:= argmin
z
L
(x
k+1
, z, y
k
) // z-minimization
y
k+1
:= y
k
+ (Ax
k+1
+ Bz
k+1
c) // dual update
ITMANET, Stanford, January 2011 9
Alternating direction method of multipliers
if we minimized over x and z jointly, reduces to method of multipliers
instead, we do one pass of a Gauss-Seidel method
we get splitting since we minimize over x with z xed, and vice versa
ITMANET, Stanford, January 2011 10
Convergence
assume (very little!)
f, g convex, closed, proper
L
0
has a saddle point
then ADMM converges:
iterates approach feasibility: Ax
k
+ Bz
k
c 0
objective approaches optimal value: f(x
k
) + g(z
k
) p
N
i=1
f
i
(x)
e.g., f
i
is the loss function for ith block of training data
ADMM form:
minimize
N
i=1
f
i
(x
i
)
subject to x
i
z = 0
x
i
are local variables
z is the global variable
x
i
z = 0 are consistency or consensus constraints
ITMANET, Stanford, January 2011 13
Consensus optimization via ADMM
L
(x, z, y) =
N
i=1
f
i
(x
i
) + y
T
i
(x
i
z) + (/2)x
i
z
2
2
ADMM:
x
k+1
i
:= argmin
x
i
f
i
(x
i
) + y
kT
i
(x
i
z
k
) + (/2)x
i
z
k
2
2
z
k+1
:=
1
N
N
i=1
x
k+1
i
+ (1/)y
k
i
y
k+1
i
:= y
k
i
+ (x
k+1
i
z
k+1
)
ITMANET, Stanford, January 2011 14
Consensus optimization via ADMM
using
N
i=1
y
k
i
= 0, algorithm simplies to
x
k+1
i
:= argmin
x
i
f
i
(x
i
) + y
kT
i
(x
i
x
k
) + (/2)x
i
x
k
2
2
y
k+1
i
:= y
k
i
+ (x
k+1
i
x
k+1
)
where x
k
= (1/N)
N
i=1
x
k
i
in each iteration
gather x
k
i
and average to get x
k
scatter the average x
k
to processors
update y
k
i
locally (in each processor, in parallel)
update x
i
locally
ITMANET, Stanford, January 2011 15
Statistical interpretation
f
i
is negative log-likelihood for parameter x given ith data block
x
k+1
i
is MAP estimate under prior N(x
k
+ (1/)y
k
i
, I)
prior mean is previous iterations consensus shifted by price of
processor i disagreeing with previous consensus
processors only need to support a Gaussian MAP method
type or number of data in each block not relevant
consensus protocol yields global maximum-likelihood estimate
ITMANET, Stanford, January 2011 16
Consensus classication
data (examples) (a
i
, b
i
), i = 1, . . . , N, a
i
R
n
, b
i
{1, +1}
linear classier sign(a
T
w + v), with weight w, oset v
margin for ith example is b
i
(a
T
i
w + v); want margin to be positive
loss for ith example is l(b
i
(a
T
i
w + v))
l is loss function (hinge, logistic, probit, exponential, . . . )
choose w, v to minimize
1
N
N
i=1
l(b
i
(a
T
i
w + v)) + r(w)
r(w) is regularization term (
2
,
1
, . . . )
split data and use ADMM consensus to solve
ITMANET, Stanford, January 2011 17
Consensus SVM example
hinge loss l(u) = (1 u)
+
with
2
regularization
baby problem with n = 2, N = 400 to illustrate
examples split into 20 groups, in worst possible way:
each group contains only positive or negative examples
ITMANET, Stanford, January 2011 18
Iteration 1
3 2 1 0 1 2 3
10
8
6
4
2
0
2
4
6
8
10
ITMANET, Stanford, January 2011 19
Iteration 5
3 2 1 0 1 2 3
10
8
6
4
2
0
2
4
6
8
10
ITMANET, Stanford, January 2011 19
Iteration 40
3 2 1 0 1 2 3
10
8
6
4
2
0
2
4
6
8
10
ITMANET, Stanford, January 2011 19
Reference
Distributed Optimization and Statistical Learning via the Alternating
Direction Method of Multipliers (Boyd, Parikh, Chu, Peleato, Eckstein)
available at Boyd web site
ITMANET, Stanford, January 2011 20