You are on page 1of 8

Parallel Coordinate Descent for L1 -Regularized Loss Minimization

Joseph K. Bradley † jkbradle@cs.cmu.edu


Aapo Kyrola † akyrola@cs.cmu.edu
Danny Bickson bickson@cs.cmu.edu
Carlos Guestrin guestrin@cs.cmu.edu
Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213 USA

Abstract ent. As we discuss in Sec. 2, theory (Shalev-Shwartz


We propose Shotgun, a parallel coordi- & Tewari, 2009) and extensive empirical results (Yuan
nate descent algorithm for minimizing L1 - et al., 2010) have shown that variants of Shooting are
regularized losses. Though coordinate de- particularly competitive for high-dimensional data.
scent seems inherently sequential, we prove The need for scalable optimization is growing as more
convergence bounds for Shotgun which pre- applications use high-dimensional data, but processor
dict near-linear speedups, up to a problem- core speeds have stopped increasing in recent years.
dependent limit. We present a comprehen- Instead, computers come with more cores, and the
sive empirical study of Shotgun for Lasso and new challenge is utilizing them efficiently. Yet despite
sparse logistic regression. Our theoretical the many sequential optimization algorithms for L1 -
predictions on the potential for parallelism regularized losses, few parallel algorithms exist.
closely match behavior on real data. Shot-
gun outperforms other published solvers on a Some algorithms, such as interior point methods, can
range of large problems, proving to be one of benefit from parallel matrix-vector operations. How-
the most scalable algorithms for L1 . ever, we found empirically that such algorithms were
often outperformed by Shooting.
Recent work analyzes parallel stochastic gradient de-
1. Introduction scent for multicore (Langford et al., 2009b) and dis-
tributed settings (Mann et al., 2009; Zinkevich et al.,
Many applications use L1 -regularized models such as
2010). These methods parallelize over samples. In ap-
the Lasso (Tibshirani, 1996) and sparse logistic regres-
plications using L1 regularization, though, there are
sion (Ng, 2004). L1 regularization biases learning to-
often many more features than samples, so paralleliz-
wards sparse solutions, and it is especially useful for
ing over samples may be of limited utility.
high-dimensional problems with large numbers of fea-
tures. For example, in logistic regression, it allows We therefore take an orthogonal approach and paral-
sample complexity to scale logarithmically w.r.t. the lelize over features, with a remarkable result: we can
number of irrelevant features (Ng, 2004). parallelize coordinate descent—an algorithm which
seems inherently sequential—for L1 -regularized losses.
Much effort has been put into developing optimiza-
In Sec. 3, we propose Shotgun, a simple multicore al-
tion algorithms for L1 models. These algorithms range
gorithm which makes P coordinate updates in paral-
from coordinate minimization (Fu, 1998) and stochas-
lel. We prove strong convergence bounds for Shot-
tic gradient (Shalev-Shwartz & Tewari, 2009) to more
gun which predict speedups over Shooting which are
complex interior point methods (Kim et al., 2007).
near-linear in P, up to a problem-dependent optimum
Coordinate descent, which we call Shooting after Fu P∗ . Moreover, our theory provides an estimate for this
(1998), is a simple but very effective algorithm which ideal P∗ which may be easily computed from the data.
updates one coordinate per iteration. It often requires
Parallel coordinate descent was also considered by
no tuning of parameters, unlike, e.g., stochastic gradi-
Tsitsiklis et al. (1986), but for differentiable objec-
Appearing in Proceedings of the 28 th International Con- tives in the asynchronous setting. They give a very
ference on Machine Learning, Bellevue, WA, USA, 2011.
Copyright 2011 by the author(s)/owner(s). † These authors contributed equally to this work.
Parallel Coordinate Descent for L1 -Regularized Loss Minimization

general analysis, proving asymptotic convergence but Algorithm 1 Shooting: Sequential SCD
not convergence rates. We are able to prove rates and Set x = 0 ∈ R2d
+.
theoretical speedups for our class of objectives. while not converged do
In Sec. 4, we compare multicore Shotgun with five Choose j ∈ {1, . . . , 2d} uniformly at random.
state-of-the-art algorithms on 35 real and synthetic Set δxj ←− max{−xj , −(∇F (x))j /β}.
datasets. The results show that in large problems Update xj ←− xj + δxj .
Shotgun outperforms the other algorithms. Our ex- end while
periments also validate the theoretical predictions by
showing that Shotgun requires only about 1/P as
of Shooting for solving (1). SCD (Alg. 1) randomly
many iterations as Shooting. We measure the parallel chooses one weight xj to update per iteration. It com-
speedup in running time and analyze the limitations putes the update xj ← xj + δxj via
imposed by the multicore hardware.
δxj = max{−xj , −(∇F (x))j /β} , (5)

2. L1 -Regularized Loss Minimization where β > 0 is a loss-dependent constant.


We consider optimization problems of the form To our knowledge, Shalev-Shwartz and Tewari (2009)
n
X
provide the best known convergence bounds for SCD.
min F (x) = L(aTi x, yi ) + λkxk1 , (1) Their analysis requires a uniform upper bound on the
x∈Rd
i=1 change in the loss F (x) from updating a single weight:
where L(·) is a non-negative convex loss. Each of n Assumption 2.1. Let F (x) : R2d + −→ R be a convex
samples has a feature vector ai ∈ Rd and observation function. Assume there exists β > 0 s.t., for all x and
yi (where y ∈ Y n ). x ∈ Rd is an unknown vector single-weight updates δxj , we have:
of weights for features. λ ≥ 0 is a regularization pa- β(δxj )2
rameter. Let A ∈ Rn×d be the design matrix, whose F (x + (δxj )ej ) ≤ F (x) + δxj (∇F (x))j + 2 ,
ith row is ai . Assume w.l.o.g. that columns of A are
normalized s.t. diag(AT A) = 1.1 where ej is a unit vector with 1 in its j th entry. For
An instance of (1) is the Lasso (Tibshirani, 1996) (in the losses in (2) and (3), Taylor expansions give
penalty form), for which Y ≡ R and 1
β = 1 (squared loss) and β = 4 (logistic loss). (6)
F (x) = 1
2
kAx − yk22 + λkxk1 , (2)
Using this bound, they prove the following theorem.
as well as sparse logistic regression (Ng, 2004), for Theorem 2.1. (Shalev-Shwartz & Tewari, 2009) Let
which Y ≡ {−1, +1} and x∗ minimize (4) and x(T ) be the output of Alg. 1 after
n
X    T iterations. If F (x) satisfies Assumption 2.1, then
F (x) = log 1 + exp −yi aTi x + λkxk1 . (3) h i d(βkx∗ k22 + 2F (x(0) ))
i=1 E F (x(T ) ) − F (x∗ ) ≤ , (7)
T +1
For analysis, we follow Shalev-Shwartz and Tewari
(2009) and transform (1) into an equivalent problem where E[·] is w.r.t. the random choices of weights j.
with a twice-differentiable regularizer. We let x̂ ∈ R2d +,
use duplicated features âi = [ai ; −ai ] ∈ R2d , and solve As Shalev-Shwartz and Tewari (2009) argue, Theo-
rem 2.1 indicates that SCD scales well in the dimen-
n
X 2d
X sionality d of the data. For example, it achieves bet-
min L(âTi x̂, yi ) + λ x̂j . (4) ter runtime bounds w.r.t. d than stochastic gradient
x̂∈R2d
+ i=1 j=1
methods such as SMIDAS (Shalev-Shwartz & Tewari,
If x̂ ∈ R2d 2009) and truncated gradient (Langford et al., 2009a).
+ minimizes (4), then x : xi = x̂d+i − x̂i mini-
mizes (1). Though our analysis uses duplicate features,
they are not needed for an implementation. 3. Parallel Coordinate Descent
2.1. Sequential Coordinate Descent As the dimensionality d or sample size n increase, even
fast sequential algorithms become expensive. To scale
Shalev-Shwartz and Tewari (2009) analyze Stochas- to larger problems, we turn to parallel computation.
tic Coordinate Descent (SCD), a stochastic version
In this section, we present our main theoretical contri-
1
Normalizing A does not change the objective if a sep- bution: we show coordinate descent can be parallelized
arate, normalized λj is used for each xj . by proving strong convergence bounds.
Parallel Coordinate Descent for L1 -Regularized Loss Minimization

Algorithm 2 Shotgun: Parallel SCD


      
Choose number of parallel updates P ≥ 1.
2d
Set x = 0 ∈ R+        
while not converged do 

Choose random subset of P weights in {1, . . . , 2d}.
In parallel on P processors
   
Get assigned weight j.    

Set δxj ←− max{−xj , −(∇F (x))j /β}.


Figure 1. Intuition for parallel coordinate descent. Con-
Update xj ←− xj + δxj .
tour plots of two objectives, with darker meaning better.
end while
Left: Features are uncorrelated; parallel updates are useful.
Right: Features are correlated; parallel updates conflict.
We parallelize stochastic Shooting and call our algo-
rithm Shotgun (Alg. 2). Shotgun initially chooses around x. Bound the first-order term using (5). 
P, the number of weights to update in parallel. On In the next section, we show that this intuition holds
each iteration, it chooses a subset of P weights from for the more general optimization problem in (1).
{1, . . . , 2d} uniformly at random from the possible
combinations; these form a set Pt .2 It updates each 3.1. Shotgun Convergence Analysis
xij : ij ∈ Pt , in parallel using the same update as
Shooting (5). PLet ∆x be the collective update to x, In this section, we present our convergence result for
i.e., (∆x)k = ij ∈Pt : k=ij δxij . Shotgun. The result provides a problem-specific mea-
sure of the potential for parallelization: the spectral
Intuitively, parallel updates might increase the risk of radius ρ of AT A (i.e., the maximum of the magni-
divergence. In Fig. 1, in the left subplot, parallel up- tudes of eigenvalues of AT A). Moreover, this measure
dates speed up convergence since features are uncor- is prescriptive: ρ may be estimated via, e.g., power
related; in the right subplot, parallel updates of cor- iteration5 (Strang, 1988), and it provides a plug-in es-
related features risk increasing the objective. We can timate of the ideal number of parallel updates.
avoid divergence by imposing a step size, but our ex-
periments showed that approach to be impractical.3 We begin by generalizing Assumption 2.1 to our par-
allel setting. The scalars β for Lasso and logistic re-
We formalize this intuition for the Lasso in Theo- gression remain the same as in (6).
rem 3.1. We can separate a sequential progress term Assumption 3.1. Let F (x) : R2d + −→ R be a convex
(summing the improvement from separate updates) function. Assume that there exists β > 0 such that,
from a term measuring interference between parallel for all x and parallel updates ∆x, we have
updates. If AT A were normalized and centered to be
F (x + ∆x) ≤ F (x) + ∆xT ∇F (x) + β2 ∆xT AT A∆x .
a covariance matrix, the elements in the interference
term’s sum would be non-zero only for correlated vari-
We now state our main result, generalizing the conver-
ables, matching our intuition from Fig. 1. Harmful
gence bound in Theorem 2.1 to the Shotgun algorithm.
interference could occur when, e.g., δxi , δxj > 0 and
features i, j were positively correlated. Theorem 3.2. Let x∗ minimize (4) and x(T ) be the
output of Alg. 2 after T iterations with P parallel up-
Theorem 3.1. Fix x. If ∆x is the collective update dates/iteration. Let ρ be the spectral radius of AT A.
to x in one iteration of Alg. 2 for the Lasso, then .
If F (x) satisfies Assumption 3.1 and P is s.t.  =
(P−1)(ρ−1)
F (x + ∆x) − F (x) 2d−1 < 1, then
X X
≤ − 12 (δxij )2 + 12 (AT A)ij ,ik δxij δxik .
 
h i d βkx∗ k22 + 1−
2
F (x(0) )
ij ∈Pt ij ,ik ∈Pt , E F (x(T ) ) − F (x∗ ) ≤ ,
| {z } j6=k (T + 1)P
sequential progress | {z }
interference where E[·] is w.r.t. the random choices of weights to
update. Choosing a near-optimal P∗ ≈ dρ gives
Proof Sketch4 : Write the Taylor expansion of F
ρ βkx∗ k22 + 4F (x(0) )
h i 
(T ) ∗
2
In the supplement, we also analyze choosing weights in- E F (x ) − F (x ) . .
T +1
dependently to form a multiset, which gives worse bounds.
3
A step size of 1P ensures convergence since F is convex in the supplementary material.
5
in x, but it results in very small steps and long runtimes. For our datasets, power iteration gave reasonable esti-
4
We include detailed proofs of all theorems and lemmas mates within a small fraction of the total runtime.
Parallel Coordinate Descent for L1 -Regularized Loss Minimization

The choice P∗ is near-optimal since it is within a fac- 4.7


10 P*=79  

T (iterations)

T (iterations)
tor 2 of the maximum P s.t.  < 1 and since it sets 3
10
 ≈ 12 in the bound (so  is a small constant). Without 4.5
10
duplicated features, Theorem 3.2 predicts that we can P*=1   2
d 10
do up to P ≤ 2ρ parallel updates and achieve speedups 4.3
10
almost linear in P. For an ideal problem with uncor- 0
10
0
10 10
1
10
2

related features, ρ = 1, so we could do up to P∗ = d P (parallel updates) P (parallel updates)

parallel updates. For a pathological problem with ex- Data: Ball64 singlepixcam Data: Mug32 singlepixcam
actly correlated features, ρ = d, so our theorem tells d = 4096, ρ = 2047.8 d = 1024, ρ = 6.4967
us that we could not do parallel updates. With P = 1,
we recover the result for Shooting in Theorem 2.1. Figure 2. Theory for Shotgun’s P (Theorem 3.2) vs. em-
pirical performance for Lasso on two datasets. Y-axis has
To prove Theorem 3.2, we first bound the negative iterations T until EPt [F (x(T ) )] came within 0.5% of F (x∗ ).
impact of interference between parallel updates. Thick red lines trace T for increasing P (until too large P
caused divergence). Vertical lines mark P∗ . Dotted diago-
Lemma 3.3. Fix x. Let  = (P−1)(ρ−1)
2d−1 , where P is
nal lines show optimal (linear) speedups (partly hidden by
s.t.  < 1. Under the assumptions and definitions solid line in right-hand plot).
from Theorem 3.2, if ∆x is the collective update to x
in one iteration of Alg. 2, then
EPt [F (xh+ ∆x) − F (x)] i P with empirical performance for Lasso. We exactly
≤ PEj δxj (∇F (x))j + β2 (1 + )(δxj )2 , simulated Shotgun as in Alg. 2 to eliminate effects
from the practical implementation choices made in
where EPt is w.r.t. a random choice of Pt and Ej is Sec. 4. We tested two single-pixel camera datasets
w.r.t. choosing j ∈ {1, . . . , 2d} uniformly at random. from Duarte et al. (2008) with very different ρ, es-

Proof Sketch: Take the expectation w.r.t. Pt of the timating EPt F (x(T ) ) by averaging 10 runs of Shot-
inequality in Assumption 3.1. gun. We used λ = 0.5 for Ball64 singlepixcam to
get x∗ with about 27% non-zeros; we used λ = 0.05
EPt [F (x+ ∆x) − F (x)] for Mug32 singlepixcam to get about 20% non-zeros.
(8)
≤ EPt ∆xT ∇F (x) + β2 ∆xT AT A∆x

Fig. 2 plots P versus the iterations T required for
Rewrite the expectation using our independent choices EPt F (x(T ) ) to come within 0.5% of the optimum
of ij ∈ Pt , separating out the diagonal second-order
F (x∗ ). Theorem 3.2 predicts that T should decrease
terms. (Here, δxj is the update given by (5), regardless
of whether j ∈ Pt .) as 1P as long as P ≤ P∗ ≈ 2ρd
. The empirical behavior
follows this theory: using the predicted P∗ gives almost
= PEj δxj (∇F (x))j + β2 (δxj )2
 
optimal speedups, and speedups are almost linear in
(9)
+ β2 P(P − 1)Ei,j : i6=j δxi (AT A)i,j δxj P. As P exceeds P∗ , Shotgun soon diverges.


Upper-bound
  the second expectation in terms of Fig. 2 confirms Theorem 3.2’s result: Shooting, a
Ej (δxj )2 by expressing the spectral radius ρ of AT A seemingly sequential algorithm, can be parallelized
as ρ = maxz: zT z=1 zT (AT A)z. Regroup terms to get and achieve near-linear speedups, and the spectral ra-
the lemma’s result.  dius of AT A succinctly captures the potential for par-
allelism in a problem. To our knowledge, our conver-
Proof Sketch (Theorem 3.2): Our proof resem-
gence results are the first for parallel coordinate de-
bles Shalev-Shwartz and Tewari (2009)’s proof of The-
scent for L1 -regularized losses, and they apply to any
orem 2.1. We use a modified potential function Ψ(x) =
β ∗ 2 1 convex loss satisfying Assumption 3.1. Though Fig. 2
2 kx−x k2 + 1− F (x), and the result from Lemma 3.3 ignores certain implementation issues, we show in the
replaces Assumption 2.1. 
next section that Shotgun performs well in practice.
Our analysis implicitly assumes that parallel updates
of the same weight xj will not make xj negative. 3.3. Beyond L1
Proper write-conflict resolution can ensure this as-
sumption holds and is viable in our multicore setting. Theorems 2.1 and 3.2 generalize beyond L1 , for their
main requirements (Assumptions 2.1, 3.1) apply to a
more general class of problems: min F (x) s.t. x ≥
3.2. Theory vs. Empirical Performance
0, where F (x) is smooth. We discuss Shooting and
We end this section by comparing the predictions of Shotgun for sparse regression since both the method
Theorem 3.2 about the number of parallel updates (coordinate descent) and problem (sparse regression)
Parallel Coordinate Descent for L1 -Regularized Loss Minimization

are arguably most useful for high-dimensional settings. hard thresholding for compressed sensing. It sets all
but the s largest weights to zero on each iteration. We
4. Experimental Results set s as the sparsity obtained by Shooting.
SpaRSA (Wright et al., 2009) is an accelerated iter-
We present an extensive study of Shotgun for the Lasso
ative shrinkage/thresholding algorithm which solves a
and sparse logistic regression. On a wide variety of
sequence of quadratic approximations of the objective.
datasets, we compare Shotgun with published state-of-
the-art solvers. We also analyze self-speedup in detail As with Shotgun, all of Shooting, FPC AS, GPSR BB,
in terms of Theorem 3.2 and hardware issues. and SpaRSA use pathwise optimization schemes.
We also tested published implementations of the clas-
4.1. Lasso sic algorithms GLMNET (Friedman et al., 2010) and LARS
We tested Shooting and Shotgun for the Lasso against (Efron et al., 2004). Since we were unable to get them
five published Lasso solvers on 35 datasets. We sum- to run on our larger datasets, we exclude their results.
marize the results here; details are in the supplement.
4.1.3. Results
4.1.1. Implementation: Shotgun We divide our comparisons into four categories of
Our implementation made several practical improve- datasets; the supplementary material has descriptions.
ments to the basic Shooting and Shotgun algorithms.
Sparco: Real-valued datasets of varying sparsity from
Following Friedman et al. (2010), we maintained a the Sparco testbed (van den Berg et al., 2009).
vector Ax to avoid repeated computation. We also n ∈ [128, 29166], d ∈ [128, 29166].
used their pathwise optimization scheme: rather than Single-Pixel Camera: Dense compressed sensing prob-
directly solving with the given λ, we solved with an lems from Duarte et al. (2008).
exponentially decreasing sequence λ1 , λ2 , . . . , λ. The n ∈ [410, 4770], d ∈ [1024, 16384].
solution x for λk is used to warm-start optimization Sparse Compressed Imaging: Similar to Single-Pixel
for λk+1 . This scheme can give significant speedups. Camera datasets, but with very sparse random
−1/ + 1 measurement matrices. Created by us.
Though our analysis is for the synchronous setting, our
n ∈ [477, 32768], d ∈ [954, 65536].
implementation was asynchronous because of the high
Large, Sparse Datasets: Very large and sparse prob-
cost of synchronization. We used atomic compare-and-
lems, including predicting stock volatility from text
swap operations for updating the Ax vector.
in financial reports (Kogan et al., 2009).
We used C++ and the CILK++ library (Leiserson, n ∈ [30465, 209432], d ∈ [209432, 5845762].
2009) for parallelism. All tests ran on an AMD proces-
sor using up to eight Opteron 8384 cores (2.69 GHz). We ran each algorithm on each dataset with regular-
ization λ = 0.5 and 10. Fig. 3 shows runtime results,
4.1.2. Other Algorithms divided by dataset category. We omit runs which failed
to converge within a reasonable time period.
L1 LS (Kim et al., 2007) is a log-barrier interior point
method. It uses Preconditioned Conjugate Gradient Shotgun (with P = 8) consistently performs well, con-
(PCG) to solve Newton steps iteratively and avoid ex- verging faster than other algorithms on most dataset
plicitly inverting the Hessian. The implementation is categories. Shotgun does particularly well on the
in Matlab R
, but the expensive step (PCG) uses very Large, Sparse Datasets category, for which most al-
efficient native Matlab calls. In our tests, matrix- gorithms failed to converge anywhere near the ranges
vector operations were parallelized on up to 8 cores. plotted in Fig. 3. The largest dataset, whose features
are occurrences of bigrams in financial reports (Ko-
FPC AS (Wen et al., 2010) uses iterative shrinkage to gan et al., 2009), has 5 million features and 30K sam-
estimate which elements of x should be non-zero, as ples. On this dataset, Shooting converges but requires
well as their signs. This reduces the objective to a ∼ 4900 seconds, while Shotgun takes < 2000 seconds.
smooth, quadratic function which is then minimized.
On the Single-Pixel Camera datasets, Shotgun (P = 8)
GPSR BB (Figueiredo et al., 2008) is a gradient projec- is slower than Shooting. In fact, it is surprising that
tion method which uses line search and termination Shotgun converges at all with P = 8, for the plotted
techniques tailored for the Lasso. datasets all have P∗ = 1. Fig. 2 shows Shotgun with
Hard l0 (Blumensath & Davies, 2009) uses iterative P > 4 diverging for the Ball64 singlepixcam dataset;
however, after the practical adjustments to Shotgun
Parallel Coordinate Descent for L1 -Regularized Loss Minimization

100 Shotgun faster 40 Shotgun faster 44 Shotgun faster 8000


Shotgun faster

Other alg. runtime (sec)


Other alg. runtime (sec)

Other alg. runtime (sec)


Other alg. runtime (sec)

10 10 1000
10

100
Shotgun slower
Shotgun slower Shotgun slower Shotgun slower
1 1 2 1
2 Shooting 10
1 10 100 1 10 40 1 10 44 10 100 1000 8000
Shotgun runtime (sec) 2 Shotgun runtime (sec) L1_LS
Shotgun runtime (sec) Shooting Shotgun runtime (sec)
1.8 Shooting FPC_AS L1_LS
2
(a) Sparco (b) Single-Pixel Camera1.8 (c) Sparse Compressed Img. (d) Large, Sparse Datasets
2 [1, 8683], avg 1493 Shooting L1_LS GPSR_BB FPC_AS
P∗ ∈ 1.8 P∗ = 1 L1_LS ∗
∈ [1432, 5889], avg 3844 GPSR_BB
PFPC_AS P∗ ∈ [107, 1036], avg 571
2 Shooting
1.6 SpaRSA
1.8 Shooting L1_LS FPC_AS 1.6 GPSR_BB Hard_l0 SpaRSA
1.8 L1_LS FPC_AS GPSR_BB SpaRSA Hard_l0
1.6
1.8 FPC_AS 1.4 SpaRSA
GPSR_BB Hard_l0 categories. Each marker compares an algorithm
Figure 3. Runtime 1.6comparison of algorithms for the Lasso1.4on 4 dataset
GPSR_BB SpaRSA Hard_l0
with 1.6
Shotgun (with P = 8) on1.4 one dataset (and one λ ∈ {0.5, 10}). Y-axis is that algorithm’s running time; X-axis is
1.6 SpaRSA Hard_l0
1.2
Shotgun’s (P=8) running time on the same problem. Markers above the diagonal line indicate that Shotgun was faster;
1.4 Hard_l0 1.2
markers
1.4 below the line indicate Shotgun was slower.
1.4 1.2 1
1.2 1 1.2 1 1.4 1.6 1.8 2
used 1.2
to produce Fig. 3, Shotgun converges with P = 8. 1 CDN1.2 does 1.4 1.6descent,
coordinate 1.8 but instead
2 of using a
1.2 1
1 1.2 1.4 1.6 fixed
1.8step size,
2 it uses a backtracking line search start-
Among the other1solvers, L1 LS is the most robust and
1 1 1.2 1.4 1.6 1.8 2ing at a quadratic approximation of the objective.
1 even solves
1 some
1.2 of the1.4Large,1.6Sparse1.8
Datasets. 2
1 1.2 1.4 1.6 1.8 2 Although our analysis uses the fixed step size in (5),
It is difficult to compare optimization algorithms and
we modified Shooting and Shotgun to use line searches
their implementations. Algorithms’ termination cri-
as in CDN. We refer to CDN as Shooting CDN, and
teria differ; e.g., primal-dual methods such as L1 LS
we refer to parallel CDN as Shotgun CDN.
monitor the duality gap, while Shotgun monitors the
change in x. Shooting and Shotgun were written in Shooting CDN and Shotgun CDN maintain an active
C++, which is generally fast; the other algorithms set of weights which are allowed to become non-zero;
were in Matlab, which handles loops slowly but lin- this scheme speeds up optimization, though it can
ear algebra quickly. Therefore, we emphasize major limit parallelism by shrinking d.
trends: Shotgun robustly handles a range of problems;
Theorem 3.2 helps explain its speedups; and Shotgun 4.2.2. Other Algorithms
generally outperforms published solvers for the Lasso.
SGD iteratively updates x in a gradient direction esti-
mated with one sample and scaled by a learning rate.
4.2. Sparse Logistic Regression
We implemented SGD in C++ following, e.g., Zinke-
For logistic regression, we focus on comparing Shot- vich et al. (2010). We used lazy shrinkage updates
gun with Stochastic Gradient Descent (SGD) variants. (Langford et al., 2009a) to make use of sparsity in A.
SGD methods are of particular interest to us since they Choosing learning rates for SGD can be challenging.
are often considered to be very efficient, especially for In our tests, constant rates led to √faster convergence
learning with many samples; they often have conver- than decaying rates (decaying as 1/ T ). For each test,
gence bounds independent of the number of samples. we tried 14 exponentially increasing rates in [10−4 , 1]
(in parallel) and chose the rate giving the best training
For a large-scale comparison of various algorithms for objective. We did not use a sparsifying step for SGD.
sparse logistic regression, we refer the reader to the
recent survey by Yuan et al. (2010). On L1 logreg SMIDAS (Shalev-Shwartz & Tewari, 2009) uses stochas-
(Koh et al., 2007) and CDN (Yuan et al., 2010), our tic mirror descent but truncates gradients to sparsify
results qualitatively matched their survey. Yuan et al. x. We tested their published C++ implementation.
(2010) do not explore SGD empirically. Parallel SGD refers to Zinkevich et al. (2010)’s work,
which runs SGD in parallel on different subsamples of
4.2.1. Implementation: Shotgun CDN the data and averages the solutions x. We tested this
As Yuan et al. (2010) show empirically, their Coordi- method since it is one of the few existing methods
nate Descent Newton (CDN) method is often orders for parallel regression, but we note that Zinkevich et
of magnitude faster than the basic Shooting algorithm al. (2010) did not address L1 regularization in their
(Alg. 1) for sparse logistic regression. Like Shooting, analysis. We averaged over 8 instances of SGD.
Parallel Coordinate Descent for L1 -Regularized Loss Minimization

4.2.3. Results 8000

Objective value
Objective value
5.4
10 Shooting CDN SGD
Fig. 4 plots training objectives and test accuracy (on
7000
a held-out 10% of the data) for two large datasets. 5.3 SGD Parallel SGD Shooting CDN
10
6 Shotgun CDN
The zeta dataset illustrates the regime with n  d. Shotgun CDN
6000
5.2
It contains 500K samples with 2000 features and is 10

fully dense (in A). SGD performs well and is fairly 0 500 1000 10
Time (sec) Time (log scale, secs)
competitive with Shotgun CDN (with P = 8).

Error (held−out data)

Error (held−out data)


Shooting CDN
The rcv1 dataset 7 (Lewis et al., 2004) illustrates the 0.05
0.15
high-dimensional regime (d > n). It has about twice SGD
Shooting CDN
as many features (44504) as samples (18217), with SGD
0.045
17% non-zeros in A. Shotgun CDN (P = 8) was much 0.1 Parallel SGD
Shotgun CDN Shotgun CDN
faster than SGD, especially in terms of the objective. 0.04
Parallel SGD performed almost identically to SGD. 500 1000 10
Time (sec) Time (log scale, secs)

Though convergence bounds for SMIDAS are compa- zeta with λ = 1 rcv1 with λ = 1
rable to those for SGD, SMIDAS iterations take much d = 2000, n = 500, 000 d = 44504, n = 18217
longer due to the mirror descent updates. To execute
10M updates on the zeta dataset, SGD took 728 sec- Figure 4. Sparse logistic regression on 2 datasets. Top
plots trace training objectives over time; bottom plots
onds, while SMIDAS took over 8500 seconds.
trace classification error rates on held-out data (10%).
These results highlight how SGD is orthogonal to Shot- On zeta (n  d), SGD converges faster initially, but
gun: SGD can cope with large n, and Shotgun can Shotgun CDN (P=8) overtakes it. On rcv1 (d > n),
cope with large d. A hybrid algorithm might be scal- Shotgun CDN converges much faster than SGD (note the log
able in both n and d and, perhaps, be parallelized over scale); Parallel SGD (P=8) is hidden by SGD.
both samples and features.

4.3. Self-Speedup of Shotgun accesses have no temporal locality since each weight
update uses a different column of A. We further vali-
To study the self-speedup of Shotgun Lasso and dated these conclusions by monitoring CPU counters.
Shotgun CDN, we ran both solvers on our datasets with
varying λ, using varying P (number of parallel updates
= number of cores). We recorded the running time as
5. Discussion
the first time when an algorithm came within 0.5% of We introduced the Shotgun, a simple parallel algo-
the optimal objective, as computed by Shooting. rithm for L1 -regularized optimization. Our conver-
Fig. 5 shows results for both speedup (in time) and gence results for Shotgun are the first such results
speedup in iterations until convergence. The speedups for parallel coordinate descent with L1 regularization.
in iterations match Theorem 3.2 quite closely. How- Our bounds predict near-linear speedups, up to an in-
ever, relative speedups in iterations (about 8×) are not terpretable, problem-dependent limit. In experiments,
matched by speedups in runtime (about 2× to 4×). these predictions matched empirical behavior.

We thus discovered that speedups in time were limited Extensive comparisons showed that Shotgun outper-
by low-level technical issues. To understand the lim- forms state-of-the-art L1 solvers on many datasets. We
iting factors, we analyzed various Shotgun-like algo- believe that, currently, Shotgun is one of the most effi-
rithms to find bottlenecks.8 We found we were hitting cient and scalable solvers for L1 -regularized problems.
the memory wall (Wulf & McKee, 1995); memory bus The most exciting extension to this work might be the
bandwidth and latency proved to be the most limiting hybrid of SGD and Shotgun discussed in Sec. 4.3.
factors. Each weight update requires an atomic update
to the shared Ax vector, and the ratio of memory ac- Code, Data, and Benchmark Results: Available
cesses to floating point operations is only O(1). Data at http://www.select.cs.cmu.edu/projects
6 Acknowledgments
The zeta dataset is from the Pascal Large Scale Learn-
ing Challenge: http://www.mlbench.org/instructions/ Thanks to John Langford, Guy Blelloch, Joseph Gon-
7
Our version of the rcv1 dataset is from the LIBSVM zalez, Yucheng Low and our reviewers for feedback.
repository (Chang & Lin, 2001). Funded by NSF IIS-0803333, NSF CNS-0721591, ARO
8
See the supplement for the scalability analysis details. MURI W911NF0710287, ARO MURI W911NF0810242.
Parallel Coordinate Descent for L1 -Regularized Loss Minimization
6 9 5
8 Best

Iteration speedup
4 10

Iteration speedup
Best
4 7 Mean 9 P=8
Speedup

Speedup
P=8 3 8
6
Mean P=4 7 P=4
5 2 Worst 6
2 4 5
P=2 1 4 P=2
3 3
Worst 2 2
0 0
1 2 4 6 8 10 100 1000 1 2 4 6 8 10 100
Number of cores Dataset P* Number of cores Dataset P*
(a) Shotgun Lasso (b) Shotgun Lasso (c) Shotgun CDN (d) Shotgun CDN
runtime speedup iterations runtime speedup iterations

Figure 5. (a,c) Runtime speedup in time for Shotgun Lasso and Shotgun CDN (sparse logistic regression). (b,d) Speedup
in iterations until convergence as a function of P∗ . Both Shotgun instances exhibit almost linear speedups w.r.t. iterations.

References Leiserson, C. E. The Cilk++ concurrency platform. In 46th


Annual Design Automation Conference. ACM, 2009.
Blumensath, T. and Davies, M.E. Iterative hard threshold-
ing for compressed sensing. Applied and Computational Lewis, D.D., Yang, Y., Rose, T.G., and Li, F. RCV1:
Harmonic Analysis, 27(3):265–274, 2009. A new benchmark collection for text categorization re-
search. JMLR, 5:361–397, 2004.
Chang, C.-C. and Lin, C.-J. LIBSVM: a library for support
vector machines, 2001. http://www.csie.ntu.edu.tw/ Mann, G., McDonald, R., Mohri, M., Silberman, N., and
~cjlin/libsvm. Walker, D. Efficient large-scale distributed training of
conditional maximum entropy models. In NIPS, 2009.
Duarte, M.F., Davenport, M.A., Takhar, D., Laska, J.N.,
Sun, T., Kelly, K.F., and Baraniuk, R.G. Single-pixel Ng, A.Y. Feature selection, l1 vs. l2 regularization and
imaging via compressive sampling. Signal Processing rotational invariance. In ICML, 2004.
Magazine, IEEE, 25(2):83–91, 2008. Shalev-Shwartz, S. and Tewari, A. Stochastic methods for
`1 regularized loss minimization. In ICML, 2009.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R.
Least angle regression. Annals of Statistics, 32(2):407– Strang, G. Linear Algebra and Its Applications. Harcourt
499, 2004. Brace Jovanovich, 3rd edition, 1988.
Figueiredo, M.A.T, Nowak, R.D., and Wright, S.J. Gradi- Tibshirani, R. Regression shrinkage and selection via the
ent projection for sparse reconstruction: Application to lasso. J. Royal Statistical Society, 58(1):267–288, 1996.
compressed sensing and other inverse problems. IEEE
J. of Sel. Top. in Signal Processing, 1(4):586–597, 2008. Tsitsiklis, J. N., Bertsekas, D. P., and Athans, M. Dis-
tributed asynchronous deterministic and stochastic gra-
Friedman, J., Hastie, T., and Tibshirani, R. Regularization dient optimization algorithms. 31(9):803–812, 1986.
paths for generalized linear models via coordinate de-
van den Berg, E., Friedlander, M.P., Hennenfent, G., Her-
scent. Journal of Statistical Software, 33(1):1–22, 2010.
rmann, F., Saab, R., and Yılmaz, O. Sparco: A testing
Fu, W.J. Penalized regressions: The bridge versus the framework for sparse reconstruction. ACM Transactions
lasso. J. of Comp. and Graphical Statistics, 7(3):397– on Mathematical Software, 35(4):1–16, 2009.
416, 1998. Wen, Z., Yin, W. Goldfarb, D., and Zhang, Y. A fast
algorithm for sparse reconstruction based on shrinkage,
Kim, S. J., Koh, K., Lustig, M., Boyd, S., and
subspace optimization and continuation. SIAM Journal
Gorinevsky, D. An interior-point method for large-scale
on Scientific Computing, 32(4):1832–1857, 2010.
`1-regularized least squares. IEEE Journal of Sel. Top.
in Signal Processing, 1(4):606–617, 2007. Wright, S.J., Nowak, D.R., and Figueiredo, M.A.T. Sparse
reconstruction by separable approximation. IEEE
Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Trans. on Signal Processing, 57(7):2479–2493, 2009.
Smith, N.A. Predicting risk from financial reports with
regression. In Human Language Tech.-NAACL, 2009. Wulf, W.A. and McKee, S.A. Hitting the memory wall:
Implications of the obvious. ACM SIGARCH Computer
Koh, K., Kim, S.-J., and Boyd, S. An interior-point Architecture News, 23(1):20–24, 1995.
method for large-scale l1-regularized logistic regression.
JMLR, 8:1519–1555, 2007. Yuan, G. X., Chang, K. W., Hsieh, C. J., and Lin, C. J.
A comparison of optimization methods and software for
Langford, J., Li, L., and Zhang, T. Sparse online learning large-scale l1-reg. linear classification. JMLR, 11:3183–
via truncated gradient. In NIPS, 2009a. 3234, 2010.

Langford, J., Smola, A.J., and Zinkevich, M. Slow learners Zinkevich, M., Weimer, M., Smola, A.J., and Li, L. Paral-
are fast. In NIPS, 2009b. lelized stochastic gradient descent. In NIPS, 2010.

You might also like