You are on page 1of 13

Mind the duality gap: safer rules for the Lasso

Olivier Fercoq OLIVIER . FERCOQ @ TELECOM - PARISTECH . FR


Alexandre Gramfort ALEXANDRE . GRAMFORT @ TELECOM - PARISTECH . FR
Joseph Salmon JOSEPH . SALMON @ TELECOM - PARISTECH . FR
Institut Mines-Télécom, Télécom ParisTech, CNRS LTCI
46 rue Barrault, 75013, Paris, France
arXiv:1505.03410v3 [stat.ML] 3 Dec 2015

Abstract one can mention dictionary learning (Mairal, 2010), bio-


statistics (Haury et al., 2012) and medical imaging (Lustig
Screening rules allow to early discard irrelevant
et al., 2007; Gramfort et al., 2012) to name a few.
variables from the optimization in Lasso prob-
lems, or its derivatives, making solvers faster. In Many algorithms exist to approximate Lasso solutions, but
this paper, we propose new versions of the so- it is still a burning issue to accelerate solvers in high di-
called safe rules for the Lasso. Based on duality mensions. Indeed, although some other variable selection
gap considerations, our new rules create safe test and prediction methods exist (Fan & Lv, 2008), the best
regions whose diameters converge to zero, pro- performing methods usually rely on the Lasso. For sta-
vided that one relies on a converging solver. This bility selection methods (Meinshausen & Bühlmann, 2010;
property helps screening out more variables, for Bach, 2008; Varoquaux et al., 2012), hundreds of Lasso
a wider range of regularization parameter values. problems need to be solved. For non-convex approaches
In addition to faster convergence, we prove that such as SCAD (Fan & Li, 2001) or MCP (Zhang, 2010),
we correctly identify the active sets (supports) solving the Lasso is often a required preliminary step (Zou,
of the solutions in finite time. While our pro- 2006; Zhang & Zhang, 2012; Candès et al., 2008).
posed strategy can cope with any solver, its per-
Among possible algorithmic candidates for solving the
formance is demonstrated using a coordinate de-
Lasso, one can mention homotopy methods (Osborne et al.,
scent algorithm particularly adapted to machine
2000), LARS (Efron et al., 2004), and approximate homo-
learning use cases. Significant computing time
topy (Mairal & Yu, 2012), that provide solutions for the full
reductions are obtained with respect to previous
Lasso path, i.e., for all possible choices of tuning parame-
safe rules.
ter λ. More recently, particularly for p ą n, coordinate
descent approaches (Friedman et al., 2007) have proved to
be among the best methods to tackle large scale problems.
1. Introduction
Following the seminal work by El Ghaoui et al. (2012),
Since the mid 1990’s, high dimensional statistics has at- screening techniques have emerged as a way to exploit the
tracted considerable attention, especially in the context of known sparsity of the solution by discarding features prior
linear regression with more explanatory variables than ob- to starting a Lasso solver. Such techniques are coined safe
servations: the so-called p ą n case. In such a context, the rules when they screen out coefficients guaranteed to be
least squares with `1 regularization, referred to as the Lasso zero in the targeted optimal solution. Zeroing those coeffi-
(Tibshirani, 1996) in statistics, or Basis Pursuit (Chen et al., cients allows to focus more precisely on the non-zero ones
1998) in signal processing, has been one of the most pop- (likely to represent signal) and helps reducing the computa-
ular tools. It enjoys theoretical guarantees (Bickel et al., tional burden. We refer to (Xiang et al., 2014) for a concise
2009), as well as practical benefits: it provides sparse solu- introduction on safe rules. Other alternatives have tried to
tions and fast convex solvers are available. This has made screen the Lasso relaxing the “safety”. Potentially, some
the Lasso a popular method in modern data-science tool- variables are wrongly disregarded and post-processing is
kits. Among successful fields where it has been applied, needed to recover them. This is for instance the strategy
Proceedings of the 32 nd International Conference on Machine adopted for the strong rules (Tibshirani et al., 2012).
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- The original basic safe rules operate as follows: one
right 2015 by the author(s).
Mind the duality gap: safer rules for the Lasso

chooses a fixed tuning parameter λ, and before launching finite time the active variables of the optimal solution (or
any solver, tests whether a coordinate can be zeroed or not equivalently the inactive variables), and the tests become
(equivalently if the corresponding variable can be disre- more and more precise as the optimization algorithm pro-
garded or not). We will refer to such safe rules as static ceeds. We also show that our new GAP SAFE rules, built
safe rules. Note that the test is performed according to a on dual gap computations, are converging safe rules since
safe region, i.e., a region containing a dual optimal solu- their associated safe regions have a diameter converging to
tion of the Lasso problem. In the static case, the screening zero. We also explain how our GAP SAFE tests are se-
is performed only once, prior any optimization iteration. quential by nature. Application of our GAP SAFE rules
Two directions have emerged to improve on static strate- with a coordinate descent solver for the Lasso problem is
gies. proposed in Section 4. Using standard data-sets, we report
the time improvement compared to prior safe rules.
• The first direction is oriented towards the resolution
of the Lasso for a large number of tuning parameters. 1.1. Model and notation
Indeed, practitioners commonly compute the Lasso We denote by rds the set t1, . . . , du for any integer d P N.
over a grid of parameters and select the best one in a Our observation vector is y P Rn and the design matrix
data-driven manner, e.g., by cross-validation. As two X “ rx1 , ¨ ¨ ¨ , xp s P Rnˆp has p explanatory variables (or
consecutive λ1 s in the grid lead to similar solutions, features) column-wise. We aim at approximating y as a
knowing the first solution may help improve screen- linear combination of few variables xj ’s, hence expressing
ing for the second one. We call sequential safe rules y as Xβ where β P Rp is a sparse vector. The standard
such strategies, also referred to as recursive safe rules Euclidean norm is written } ¨ }, the `1 norm } ¨ }1 , the `8
in (El Ghaoui et al., 2012). This road has been pur- norm } ¨ }8 , and the matrix transposition of a matrix Q is
sued in (Wang et al., 2013; Xu & Ramadge, 2013; Xi- denoted by QJ . We denote ptq` “ maxp0, tq.
ang et al., 2014), and can be thought of as a “warm
start” of the screening (in addition to the warm start of For such a task, the Lasso is often considered (see
the solution itself). When performing sequential safe Bühlmann & van de Geer (2011) for an introduction). For a
rules, one should keep in mind that generally, only an tuning parameter λ ą 0, controlling the trade-off between
approximation of the previous dual solution is com- data fidelity and sparsity of the solutions, a Lasso estimator
puted. Though, the safety of the rule is guaranteed β̂ pλq is any solution of the primal optimization problem
only if one uses the exact solution. Neglecting this is- 1 2
sue, leads to “unsafe” rules: relevant variables might β̂ pλq P arg min kXβ ´ yk2 ` λ kβk1 . (1)
βPRp 2
looooooooooooomooooooooooooon
be wrongly disregarded.
“Pλ pβq

• The second direction aims at improving the screen- (



Denoting ∆X “ θ P Rn : xJ j θ ď 1, @j P rps the

ing by interlacing it throughout the optimization algo- dual feasible set, a dual formulation of the Lasso reads (see
rithm itself: although screening might be useless at the for instance Kim et al. (2007) or Xiang et al. (2014)):
beginning of the algorithm, it might become (more)
efficient as the algorithm proceeds towards the opti- 1 2 λ2 y 2
θ̂pλq “ arg max kyk2 ´ θ ´ . (2)

mal solution. We call these strategies dynamic safe 2 2 λ 2
θP∆X ĂRn looooooooooooomooooooooooooon
rules following (Bonnefoy et al., 2014a;b). “Dλ pθq

We can reinterpret Eq. (2) as θ̂pλq “ Π∆X py{λq, where


Based on convex optimization arguments, we leverage du-
ΠC refers to the projection onto a closed convex set C. In
ality gap computations to propose a simple strategy uni-
particular, this ensures that the dual solution θ̂pλq is always
fying both sequential and dynamic safe rules. We coined
unique, contrarily to the primal β̂ pλq .
GAP SAFE rules such safe rules.
The main contributions of this paper are 1) the introduction 1.2. A KKT detour
of new safe rules which demonstrate a clear practical im-
provement compared to prior strategies 2) the definition of For the Lasso problem, a primal solution β̂ pλq P Rp and the
a theoretical framework for comparing safe rules by look- dual solution θ̂pλq P Rn are linked through the relation:
ing at the convergence of their associated safe regions. y “ X β̂ pλq ` λθ̂pλq . (3)
In Section 2, we present the framework and the basic con- The Karush-Khun-Tucker (KKT) conditions state:
cepts which guarantee the soundness of static and dynamic #
pλq pλq
screening rules. Then, in Section 3, we introduce the new J pλq tsignpβ̂j qu if β̂j ‰ 0,
@j P rps, xj θ̂ P pλq (4)
concept of converging safe rules. Such rules identify in r´1, 1s if β̂j “ 0.
Mind the duality gap: safer rules for the Lasso

Rule Center Radius Ingredients


y
Static Safe (El Ghaoui et al., 2012) y{λ Rλ p λmax
q q λmax “ }X J y}8 “ |xJ
j ‹ y|
1 `λ ˘
Dynamic ST3 (Xiang et al., 2011) y{λ ´ δxj ‹ pRqλ pθk q2 ´ δ 2 q
2 δ“ max
´ 1 {}xj ‹ }2
λ

Dynamic Safe (Bonnefoy et al., 2014a) y{λ R


qλ pθk q θk P ∆X (e.g., as in (11) )
ˇ ˇ
ˇ 1
Sequential (Wang et al., 2013) θ̂ pλt´1 q
ˇ λt´1 ´ λ1t ˇ }y} exact θ̂pλt´1 q required
ˇ
a
GAP SAFE sphere (proposed) θk rλt pβk , θk q “ λ1t 2Gλt pβk , θk q dual gap for βk , θk

Table 1. Review of some common safe sphere tests.

See for instance (Xiang et al., 2014) for more details. The Remark 2. If C “ tθ̂pλq u, the safe active set is the equicor-
KKT conditions lead to the fact that for λ ě λmax “ relation set Apλq pCq “ Eλ :“ tj P rps : |xJ j θ̂
pλq
| “ 1u (in
}X J y}8 , 0 P Rp is a primal solution. It can be consid- most cases (Tibshirani, 2013) it is exactly the support of
ered as the mother of all safe screening rules. So from now β̂ pλq ). Even when the Lasso is not unique, the equicorrela-
on, we assume that λ ď λmax for all the considered λ’s. tion set contains all the solutions’ supports. The other ex-
treme case is when C “ ∆X , and Apλq pCq “ rps. Here, no
2. Safe rules variable is screened out: Z pλq pCq “ H and the screening
is useless.
Safe rules exploit the KKT condition (4). This equation im-
pλq
plies that β̂j “ 0 as soon as |xJ pλq We now consider common safe regions whose support
j θ̂ | ă 1. The main chal-
lenge is that the dual optimal solution is unknown. Hence, a functions are easy to obtain in closed form. For simplicity
safe rule aims at constructing a set C Ă Rn containing θ̂pλq . we focus only on balls and domes, though more compli-
We call such a set C a safe region. Safe regions are all the cated regions could be investigated (Xiang et al., 2014).
more helpful that for many j’s, µC pxj q :“ supθPC |xJ j θ| ă
pλq 2.1. Sphere tests
1, hence for many j’s, β̂j “ 0.
Practical benefits are obtained if one can construct a region Following previous work on safe rules, we call sphere tests,
C for which it is easy to compute its support function, de- tests relying on balls as safe regions. For a sphere test, one
noted by σC and defined for any x P Rn by: chooses a ball containing θ̂pλq with center c and radius r,
i.e., C “ Bpc, rq. Due to their simplicity, safe spheres have
σC pxq “ max xJ θ . (5) been the most commonly investigated safe regions (see for
θPC
instance Table 1 for a brief review). The corresponding test
Cast differently, for any safe region C, any j P rps, and any is defined as follows:
primal optimal solution β̂ pλq , the following holds true:
pλq
pλq If µBpc,rq pxj q “ |xJ
j c| ` r}xj } ă 1, then β̂j “ 0. (9)
If µC pxj q “ maxpσC pxj q, σC p´xj qq ă 1 then β̂j “ 0.
(6) Note that for a fixed center, the smaller the radius, the better
We call safe test or safe rule, a test associated to C and the safe screening strategy.
screening out explanatory variables thanks to Eq. (6).
Example 1. The first introduced sphere test (El Ghaoui
Remark 1. Reminding that the support function of a set is et al., 2012) consists in using the center c “ y{λ and radius
the same as the support function of its closed convex hull r “ |1{λ ´ 1{λmax |}y}. Given that θ̂pλq “ Π∆X py{λq,
(Hiriart-Urruty & Lemaréchal, 1993)[Proposition V.2.2.1], this is a safe region since y{λmax P ∆X and }y{λmax ´
we restrict our search to closed convex safe regions. Π∆X py{λq} ď }y}|1{λ ´ 1{λmax |. However, one can
Based on a safe region C one can partition the explanatory check that this static safe rule is useless as soon as
variables into a safe active set Aλ pCq and a safe zero set
˜ ¸
λ 1 ` |xJj y|{p}xj }}y}q
Z λ pCq where: ď min . (10)
λmax jPrps 1 ` λmax {p}xj }}y}q
Apλq pCq “ tj P rps : µC pxj q ě 1u, (7)
Z pλq pCq “ tj P rps : µC pxj q ă 1u. (8) 2.2. Dome tests
pλq
Note that for nested safe regions C1 Ă C2 then A pC1 q Ă Other popular safe regions are domes, the intersection be-
Apλq pC2 q. Consequently, a natural goal is to find safe re- tween a ball and a half-space. This kind of safe region has
gions as small as possible: narrowing safe regions can only been considered for instance in (El Ghaoui et al., 2012; Xi-
increase the number of screened out variables. ang & Ramadge, 2012; Xiang et al., 2014; Bonnefoy et al.,
Mind the duality gap: safer rules for the Lasso

gence of the primal is unaltered by safe rules: screening


out unnecessary coefficients of βk , can only decrease the
distance between βk and β̂ pλq .
Example 2. Note that any dual feasible point θ P ∆X im-
mediately provides a ball that contains θ̂pλq since
pλq y 1 y y

θ̂ ´ “ min :“ R

´ ď ´ qλ pθq.
λ λ λ
θ θ
θ 1 P∆X
` ˘ (12)
The ball B y{λ, R qλ pθk q corresponds to the simplest safe
region introduced in (Bonnefoy et al., 2014a;b) (cf. Figure 2
for more insights). When the algorithm proceeds, one ex-
Figure 1. Representation of the dome Dpc, r, α, wq (dark blue).
In our case, note that α is positive.
pects that θk gets closer to θ̂pλq , so }θk ´ y{λ} should get
closer to }θ̂pλq ´ y{λ}. Similarly to Example 1, this dy-
namic rule becomes useless once λ is too small. More pre-
2014b). We denote Dpc, r, α, wq the dome with ball center cisely, this occurs as soon as
c, ball radius r, oriented hyperplane with unit normal vec- ˜ ¸
tor w and parameter α such that c ´ αrw is the projection λ 1 ` |xJ j y|{p}xj }}y}q
ď min .
of c on the hyperplane (see Figure 1 for an illustration in λmax jPrps λmax }θ̂pλq }{}y} ` λmax {p}xj }}y}q
the interesting case α ą 0). (13)
Remark 3. The dome is non-trivial whenever α P r´1, 1s. Noticing that }θ̂pλq } ď }y{λ} (since Π∆X is a contraction
When α “ 0, one gets simply a hemisphere. and 0 P ∆X ) and proceeding as for (10), one can show that
this dynamic safe rule is inefficient when:
For the dome test one needs to compute the support func- ˜ ¸
tion for C “ Dpc, r, α, wq. Interestingly, as for balls, it can λ |xJ
j y|
ď min . (14)
be obtained in a closed form. Due to its length though, the λmax jPrps λmax
formula is deferred to the Appendix (see also (Xiang et al.,
2014)[Lemma 3] for more details). This is a critical threshold, yet the screening might stop
even at a larger λ thanks to Eq. (13). In practice the bound
2.3. Dynamic safe rules in Eq. (13) cannot be evaluated a priori due to the term
}θ̂pλq }). Note also that the bound in Eq. (14) is close to the
For approximating a solution β̂ pλq of the Lasso primal one in Eq. (10), explaining the similar behavior observed
problem Pλ , iterative algorithms are commonly used. We in our experiments (see Figure 3 for instance).
denote βk P Rp the current estimate after k iterations of
any iterative algorithm (see Section 4 for a specific study
on coordinate descent). Dynamic safe rules aim at discov- 3. New contributions on safe rules
ering safe regions that become narrower as k increases. To 3.1. Support discovery in finite time
do so, one first needs dual feasible points: θk P ∆X . Fol-
lowing El Ghaoui et al. (2012) (see also (Bonnefoy et al., Let us first introduce the notions of converging safe regions
2014a)), this can be achieved by a simple transformation of and converging safe tests.
the current residuals ρk “ y ´ Xβk , defining θk as Definition 1. Let pCk qkPN be a sequence of closed convex
# sets in Rn containing θ̂pλq . It is a converging sequence of
θk “ αk ρk”, ´ J ¯ ı safe regions for the Lasso with parameter λ if the diameters
y ρk ´1 1
αk “ min max λkρ 2 , kX J ρ k , kX ρk k8 .
J of the sets converge to zero. The associated safe screening
kk k 8
(11) rules are referred to as converging safe tests.
Such dual feasible θk is proportional to ρk , and is the clos-
est point (for the norm } ¨ }) to y{λ in ∆X with such a prop- Not only converging safe regions are crucial to speed up
erty, i.e., θk “ Π∆X XSpanpρk q py{λq. A reason for choosing computation, but they are also helpful to reach exact active
set identification in a finite number of steps. More pre-
this dual point is that the dual optimal solution θ̂pλq is the cisely, we prove that one recovers the equicorrelation set of
projection of y{λ on the dual feasible set ∆X , and the op- the Lasso (cf. Remark 2) in finite time with any converg-
timal θ̂pλq is proportional to y ´ X β̂ pλq , cf. Equation (3). ing strategy: after a finite number of steps, the equicor-
Remark 4. Note that if limkÑ`8 βk “ β̂ pλq (convergence relation set Eλ is exactly identified. Such a property is
of the primal) then with the previous display and (3), we sometimes referred to as finite identification of the support
can show that limkÑ`8 θk “ θ̂pλq . Moreover, the conver- (Liang et al., 2014). This is summarized in the following.
Mind the duality gap: safer rules for the Lasso

(a) Location of the dual optimal (b) Refined location of the dual (c) Proposed GAP SAFE sphere (d) Proposed GAP SAFE dome
θ̂pλq in the annulus. optimal θ̂pλq (dark blue). (orange). (orange).

Figure 2. Our new GAP SAFE sphere and dome (in orange). The dual optimal solution θ̂pλq must lie in the dark blue region; β is any
point in Rp , and θ is any point in the dual feasible set ∆X . Remark that the GAP SAFE dome is included in the GAP SAFE sphere, and
that it is the convex hull of the dark blue region.

Theorem 1. Let pCk qkPN be a sequence of converging Hence,


safe regions. The estimated support provided by Ck , c´
Apλq pCk q “ tj P rps : maxθPCk |θJ xj | ě 1u, satisfies
¯
2 2
kyk ´ kXβ ´ yk ´ 2λ kβk1
limkÑ8 Apλq pCk q “ Eλ , and there exists k0 P N such that y
`
. (16)

θ ´ ě
@k ě k0 one gets Apλq pCk q “ Eλ . λ λ

Proof. The main idea of the proof is to use that In particular, this provides }θ̂pλq ´ y{λ} ě R pλ pβq. Com-
pλq
limkÑ8 Ck “ tθ̂pλq u, limkÑ8 µCk pxq “ µtθ̂pλq u pxq “ bining (12) and (16), asserts that θ̂ belongs to the an-
nulus Apy{λ, R pλ pβqq :“ tz P Rn : R
qλ pθq, R pλ pβq ď
|xJ θ̂pλq | and that the set Apλq pCk q is discrete. Details are
delayed to the Appendix. }z ´ y{λ} ď R qλ pθqu (the light blue zone in Figure 2).

Remark 5. A more general result is proved for a spe- Remind that the dual feasible set ∆X is convex, hence
cific algorithm (Forward-Backward) in Liang et al. (2014). ∆X X Bpy{λ, R qλ pθqq is also convex. Thanks to (16),
Interestingly, our scheme is independent of the algorithm ∆X XBpy{λ, R qλ pθqq “ ∆X XApy{λ, R qλ pθq, R
pλ pβqq, and
considered (e.g., Forward-Backward (Beck & Teboulle, then ∆X X Apy{λ, Rλ pθq, Rλ pβqq is convex too. Hence,
q p
2009), Primal Dual (Chambolle & Pock, 2011), coordinate- θ̂pλq is inside the annulus Apy{λ, R qλ pθq, R
pλ pβqq and so
descent (Tseng, 2001; Friedman et al., 2007)) and relies pλq
is rθ, θ̂ s Ď Apy{λ, Rλ pθq, Rλ pβqq by convexity (see
q p
only on the convergence of a sequence of safe regions. Figure 2,(a) and Figure 2,(b)). Moreover, θ̂pλq is the
point of rθ, θ̂pλq s which is closest to y{λ. The farthest
3.2. GAP SAFE regions: leveraging the duality gap where θ̂pλq can be according to this information would be
In this section, we provide new dynamic safe rules built on if rθ, θ̂pλq s were tangent to the inner ball Bpy{λ, R pλ pβqq
pλq
converging safe regions. and }θ̂ ´ y{λ} “ Rλ pβq. Let us denote θint such a
p
Theorem 2. Let us take any pβ, θq P Rp ˆ ∆X . Denote point. The tangency property reads kθint ´ y{λk “ R pλ pβq
pλ pβq :“ 1 kyk2 ´kXβ ´ yk2 ´2λ kβk 1{2 , R
` ˘ and pθ ´ θint qJ py{λ ´ θint q “ 0. Hence, with the later
R λ 1 `
qλ pθq :“
qλ pθq, kθ ´ y{λk2 “ kθ ´ θint k2 `
and the definition of R
pλq
kθ ´ y{λk , θ̂b the dual optimal Lasso solution and 2
kθint ´ y{λk and kθ ´ θint k “ R
2 qλ pθq2 ´ Rpλ pβq2 .
r̃λ pβ, θq :“ Rλ pθq2 ´ R
q pλ pβq2 , then
Since by construction θ̂pλq cannot be further away from θ
than θint (again, insights
` can be gleaned from Figure
˘ 2), we
´ ¯
θ̂pλq P B θ, r̃λ pβ, θq . (15) qλ pθq2 ´ R
pλ pβq2 q1{2 .
conclude that θ̂pλq P B θ, pR

Proof. The construction of the ball Bpθ, r̃λ pβ, θqq is based Remark 6. Choosing β “ 0 and θ “ y{λmax , then one
on the weak duality theorem (cf. (Rockafellar & Wets, recovers the static safe rule given in Example 1.
1998) for a reminder on weak and strong duality). Fix
θ P ∆X and β P Rp , then it holds that With the definition of the primal (resp. dual) objective for
1 λ2 y 2 1 Pλ pβq, (resp. Dλ pθq), the duality gap reads as Gλ pβ, θq “
2 2
kyk ´ θ ´ ď kXβ ´ yk ` λ kβk1 .

2 2 λ 2 Pλ pβq´Dλ pθq. Remind that if Gλ pβ, θq ď , then one has
Mind the duality gap: safer rules for the Lasso

Pλ pβq ´ Pλ pβ̂ pλq q ď , which is a standard stopping cri- 3.3. GAP SAFE rules : sequential for free
terion for Lasso solvers. The next proposition establishes a
As a byproduct, our dynamic screening tests provide a
connection between the radius rλ pβ, θq and the duality gap
warm start strategy for the safe regions, making our GAP
Gλ pβ, θq.
SAFE rules inherently sequential. The next proposition
Proposition 1. For any pβ, θq P Rp ˆ ∆X , the following shows their efficiency when attacking a new tuning param-
holds eter, after having solved the Lasso for a previous λ, even
only approximately. Handling approximate solutions is a
2
r̃λ pβ, θq2 ď rλ pβ, θq2 :“ Gλ pβ, θq. (17) critical issue to produce safe sequential strategies: without
λ2 taking into account the approximation error, the screening
might disregard relevant variables, especially the one near
qλ pθq2 “ kθ ´ y{λk2 and
Proof. Use the fact that R the safe regions boundaries. Except for λmax , it is unreal-
pλ pβq2 ě pkyk2 ´ kXβ ´ yk2 ´ 2λ kβk q{λ2 .
R istic to assume that one can dispose of exact solutions.
1
Consider λ0 “ λmax and a non-increasing sequence of
If we could choose the “oracle” θ “ θ̂ pλq
and β “ β̂ pλq T ´ 1 tuning parameters pλt qtPrT ´1s in p0, λmax q. In prac-
in (15) then we would obtain a zero radius. Since those tice, we choose the common grid (Bühlmann & van de
quantities are unknown, we rather pick dynamically the Geer, 2011)[2.12.1]): λt “ λ0 10´δt{pT ´1q (for instance
current available estimates given by an optimization algo- in Figure 3, we considered δ “ 3). The next result controls
rithm: β “ βk and θ “ θk as in Eq. (11). Introducing GAP how the duality gap, or equivalently, the diameter of our
SAFE spheres and domes as below, Proposition 2 ensures GAP SAFE regions, evolves from λt´1 to λt .
that they are converging safe regions. Proposition 3. Suppose that t ě 1 and pβ, θq P Rp ˆ ∆X .
Reminding rλ2 t pβ, θq “ 2Gλt pβ, θq{λ2t , the following holds
GAP SAFE sphere:
ˆ ˙
2 λt´1
Ck “ B pθk , rλ pβ, θqq . (18) rλt pβ, θq “ rλ2 t´1 pβ, θq (20)
λt
2
GAP SAFE dome: λt Xβ ´ y ´ p λt´1 ´ 1q kθk2 .
` p1 ´ q
¨ ˛ λt´1 λt λt
˜ ¸2
y y
` θ k R
qλ pθk q Rpλ pβk q θ k ´
Ck “ D˝ λ , ,2 ´ 1, λ ‚
. Proof. Details are given in the Appendix.
2 2 R
qλ pθk q }θk ´ λy }
(19) This proposition motivates to screen sequentially as fol-
Proposition 2. For any converging primal sequence lows: having pβ, θq P Rp ˆ ∆X such that Gλt´1 pβ, θq ď ,
pβk qkPN , and dual sequence pθk qkPN defined as in Eq. (11), then, we can screen using the GAP SAFE sphere with cen-
then the GAP SAFE sphere and the GAP SAFE dome are ter θ and radius rλ pβ, θq. The adaptation to the GAP SAFE
converging safe regions. dome is straightforward and consists in replacing θk , βk , λ
by θ, β, λt in the GAP SAFE dome definition.
Remark 8. The basic sphere test of (Wang et al., 2013) re-
Proof. For the GAP SAFE sphere the result follows
quires the exact dual solution θ “ θ̂pλt´1 q for center, and
from strong duality, Remark 4 and Proposition 1 yield
has radius |1{λt ´ 1{λt´1 | kyk, which is strictly larger than
limkÑ8 rλ pβk , θk q “ 0, since limkÑ8 θk “ θ̂pλq and
ours. Indeed, if one has access to dual and primal opti-
limkÑ8 βk “ β̂ pλq . For the GAP SAFE dome, one can
mal solutions at λt´1 , i.e., pθ, βq “ pθ̂pλt´1 q , β̂ pλt´1 q q, then
check that it is included in the GAP SAFE sphere, therefore
rλ2 t´1 pβ, θq “ 0, θ “ py ´ Xβq{λt´1 and
inherits the convergence (see also Figure 2,(c) and (d)).
λ2t´1
ˆ ˙
λt λt´1 2
Remark 7. The radius rλ pβk , θk q can be compared rλ2 t pβ, θq “ p1 ´ q ´ p ´ 1q kθk ,
λ2t λt´1 λt
with the radius considered for the Dynamic Safe rule ˆ ˙2
and Dynamic ST3 (Bonnefoy et al., 2014a) respectively: 1 1 2
qλ pθk q “ }θk ´ y{λ}2 and pR qλ pθk q2 ´ δ 2 q1{2 , where ď ´ kyk ,
R λt λt´1
δ “ pλmax {λ ´ 1q{ kxj ‹ k. We have proved that
limkÑ8 rλ pβk , θk q “ 0, but a weaker property is satisfied since }θ} ď }y}{λt´1 for θ “ θ̂pλt´1 q .
by the two other radius: limkÑ8 R qλ pθk q “ R qλ pθ̂pλq q “
Note that contrarily to former sequential rules (Wang et al.,
pλq 2 qλ pθk q ´ δ 2 q1{2 “
2
}Rqλ pθ̂ q ´ y{λ} and limkÑ8 pR 2013), our introduced GAP SAFE rules still work when one
pRqλ pθ̂pλq q2 ´ δ 2 q1{2 ą 0. has only access to approximations of θ̂pλt´1 q .
Mind the duality gap: safer rules for the Lasso

4. Experiments
4.1. Coordinate Descent
Screening procedures can be used with any optimization
algorithm. We chose coordinate descent because it is well
suited for machine learning tasks, especially with sparse
and/or unstructured design matrix X. Coordinate descent
requires to extract efficiently columns of X which is typi-
cally not easy in signal processing applications where X is
commonly an implicit operator (e.g. Fourier or wavelets).

Algorithm 1 Coordinate descent with GAP SAFE rules


input X, y, , K, f, pλt qtPrT ´1s
Initialization:
λ0 “ λmax
β λ0 “ 0
for t P rT ´ 1s do
β Ð β λt´1 (previous -solution) Figure 3. Proportion of active variables as a function of λ and the
for k P rKs do number of iterations K on the Leukemia dataset. Better strategies
if k mod f “ 1 then have longer range of λ with (red) small active sets.
Compute θ and C thanks to (11) and (18) or (19)
Get Aλt pCq “ tj P rps : µC pxj q ě 1u as in (7)
if Gλt pβ, θq ď  then
β λt Ð β
break the evaluation of the stopping criterion.
for j P Aλt pCq do We did not compare our method against the strong rules
` xJ pXβ´yq ˘
βj Ð ST kxλtk2 , βj ´ j kx k2 of Tibshirani et al. (2012) because they are not safe and
j j
# STpu, xq “ signpxq p|x| ´ uq` (soft- therefore need complex post-processing with parameters to
threshold) tune. Also we did not compare against the sequential rule
output pβ λt qtPrT ´1s of Wang et al. (2013) (e.g., EDDP) because it requires the
exact dual optimal solution of the previous Lasso problem,
which is not available in practice and can prevent the solver
We implemented the screening rules of Table 1 based on the from actually converging: this is a phenomenon we always
coordinate descent in Scikit-learn (Pedregosa et al., 2011). observed on our experiments.
This code is written in Python and Cython to generate low
level C code, offering high performance. A low level lan- 4.2. Number of screened variables
guage is necessary for this algorithm to scale. Two im-
plementations were written to work efficiently with both Figure 3 presents the proportion of variables screened by
dense data stored as Fortran ordered arrays and sparse data several safe rules on the standard Leukemia dataset. The
stored in the compressed sparse column (CSC) format. Our screening proportion is presented as a function of the num-
pseudo-code is presented in Algorithm 1. In practice, we ber of iterations K. As the SAFE screening rule of El
perform the dynamic screening tests every f “ 10 passes Ghaoui et al. (2012) is sequential but not dynamic, for a
through the entire (active) variables. Iterations are stopped given λ the proportion of screened variables does not de-
when the duality gap is smaller than the target accuracy. pend on K. The rules of Bonnefoy et al. (2014a) are more
efficient on this dataset but they do not benefit much from
The naive computation
of θk in (11) involves the compu- the dynamic framework. Our proposed GAP SAFE tests
tation of X J ρk 8 (ρk being the current residual), which screen much more variables, especially when the tuning pa-
costs Opnpq operations. This can be avoided as one knows rameter λ gets small, which is particularly relevant in prac-
when using a safe rule that the index achieving the max- tice. Moreover, even for very small λ’s (notice the logarith-
imum for this norm is in Aλt pCq. Indeed, by construc- mic scale) where no variable is screened at the beginning
tion arg maxjPAλt pCq |xJ J
j θk | “ arg maxjPrps |xj θk | “ of the optimization procedure, the GAP SAFE rules man-
J
arg maxjPrps |xj ρk |. In practice the evaluation of the dual age to screen more variables, especially when K increases.
gap is therefore not a Opnpq but Opnqq where q is the size Finally, the figure demonstrates that the GAP SAFE dome
of Aλt pCq. In other words, using screening also speeds up test only brings marginal improvement over the sphere.
Mind the duality gap: safer rules for the Lasso

5 7
No screening 6 No screening
4 SAFE (El Ghaoui et al.) SAFE (El Ghaoui et al.)
5

Time (s)
Time (s)

ST3 (Bonnefoy et al.) ST3 (Bonnefoy et al.)


3 4
SAFE (Bonnefoy et al.) SAFE (Bonnefoy et al.)
3
2 GAP SAFE (sphere) GAP SAFE (sphere)
GAP SAFE (dome) 2 GAP SAFE (dome)
1 1
0
0

10
2

8
6
4
2

8
6
4

-log10(duality gap) -log10(duality gap)

Figure 4. Time to reach convergence using various screening rules Figure 5. Time to reach convergence using various screening rules
on the Leukemia dataset (dense data: n “ 72, p “ 7129). on bag of words from the 20newsgroup dataset (sparse data: with
n “ 961, p “ 10094).

180
4.3. Gains in the computation of Lasso paths 160 No screening
140 SAFE (El Ghaoui et al.)
120

Time (s)
The main interest of variable screening is to reduce com- ST3 (Bonnefoy et al.)
100
putation costs. Indeed, the time to compute the screening SAFE (Bonnefoy et al.)
80
60 GAP SAFE (sphere)
itself should not be larger than the gains it provides. Hence,
40 GAP SAFE (dome)
we compared the time needed to compute Lasso paths to 20
prescribed accuracy for different safe rules. Figures 4, 5 0

10
2

8
6
4
and 6 illustrate results on three datasets. Figure 4 presents
-log10(duality gap)
results on the dense, small scale, Leukemia dataset. Fig-
ure 5 presents results on a medium scale sparse dataset
Figure 6. Computation time to reach convergence using different
obtained with bag of words features extracted from the screening strategies on the RCV1 (Reuters Corpus Volume 1)
20newsgroup dataset (comp.graphics vs. talk.religion.misc dataset (sparse data with n “ 20242 and p “ 47236).
with TF-IDF removing English stop words and words oc-
curring only once or more than 95% of the time). Text
feature extraction was done using Scikit-Learn. Figure 6 would not be significant.
focuses on the large scale sparse RCV1 (Reuters Corpus
Volume 1) dataset, cf. (Schmidt et al., 2013). 5. Conclusion
In all cases, Lasso paths are computed as required to esti-
We have presented new results on safe rules for accelerat-
mate optimal regularization parameters in practice (when
ing algorithms solving the Lasso problem (see Appendix
using cross-validation one path is computed for each fold).
for extension to the Elastic Net). First, we have introduced
For each Lasso path, solutions are obtained for T “ 100
the framework of converging safe rules, a key concept in-
values of λ’s, as detailed in Section 3.3. Remark that the
dependent of the implementation chosen. Our second con-
grid used is the default one in both Scikit-Learn and the
tribution was to leverage duality gap computations to create
glmnet R package. With our proposed GAP SAFE screen-
two safer rules satisfying the aforementioned convergence
ing we obtain on all datasets substantial gains in computa-
properties. Finally, we demonstrated the important practi-
tional time. We can already get an up to 3x speedup when
cal benefits of those new rules by applying them to standard
we require a duality gap smaller than 10´4 . The inter-
dense and sparse datasets using a coordinate descent solver.
est of the screening is even clearer for higher accuracies:
Future works will extend our framework to generalized lin-
GAP SAFE sphere is 11x faster than its competitors on the
ear model and group-Lasso.
Leukemia dataset, at accuracy 10´8 . One can observe that
with the parameter grid used here, the larger is p compared
to n, the higher is the gain in computation time. Acknowledgment
In our experiments, the other safe screening rules did not The authors would like to thanks Jalal Fadili and Jingwei
show much speed-up. As one can see on Figure 3, those Liang for helping clarifying some misleading statements on
screening rules keep all the active variables for a wide range the equicorrelation set. We acknowledge the support from
of λ’s. The algorithm is thus faster for large λ’s but slower Chair Machine Learning for Big Data at Télécom ParisTech
afterwards, since we still compute the screening tests. Even and from the Orange/Télécom ParisTech think tank phi-
if one can avoid some of these useless computations thanks TAB. This work benefited from the support of the ”FMJH
to formulas like (14) or (10), the corresponding speed-up Program Gaspard Monge in optimization and operation re-
Mind the duality gap: safer rules for the Lasso

search”, and from the support to this program from EDF. Gramfort, A., Kowalski, M., and Hämäläinen, M. Mixed-
norm estimates for the M/EEG inverse problem using
References accelerated gradient methods. Phys. Med. Biol., 57(7):
1937–1961, 2012.
Bach, F. Bolasso: model consistent Lasso estimation
through the bootstrap. In ICML, 2008. Haury, A.-C., Mordelet, F., Vera-Licona, P., and Vert, J.-
P. TIGRESS: Trustful Inference of Gene REgulation us-
Beck, A. and Teboulle, M. A fast iterative shrinkage- ing Stability Selection. BMC systems biology, 6(1):145,
thresholding algorithm for linear inverse problems. 2012.
SIAM J. Imaging Sci., 2(1):183–202, 2009.
Hiriart-Urruty, J.-B. and Lemaréchal, C. Convex analysis
Bickel, P. J., Ritov, Y., and Tsybakov, A. B. Simultaneous and minimization algorithms. I, volume 305. Springer-
analysis of Lasso and Dantzig selector. Ann. Statist., 37 Verlag, Berlin, 1993.
(4):1705–1732, 2009.
Kim, S.-J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky,
Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R.
D. An interior-point method for large-scale l1 -
A dynamic screening principle for the lasso. In EU-
regularized least squares. IEEE J. Sel. Topics Signal Pro-
SIPCO, 2014a.
cess., 1(4):606–617, 2007.
Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval,
R. Dynamic Screening: Accelerating First-Order Algo- Liang, J., Fadili, J., and Peyré, G. Local linear conver-
rithms for the Lasso and Group-Lasso. ArXiv e-prints, gence of forward–backward under partial smoothness. In
2014b. NIPS, pp. 1970–1978, 2014.

Bühlmann, P. and van de Geer, S. Statistics for high- Lustig, M., Donoho, D. L., and Pauly, J. M. Sparse MRI:
dimensional data. Springer Series in Statistics. Springer, The application of compressed sensing for rapid MR
Heidelberg, 2011. Methods, theory and applications. imaging. Magnetic Resonance in Medicine, 58(6):1182–
1195, 2007.
Candès, E. J., Wakin, M. B., and Boyd, S. P. Enhancing
sparsity by reweighted l1 minimization. J. Fourier Anal. Mairal, J. Sparse coding for machine learning, image pro-
Applicat., 14(5-6):877–905, 2008. cessing and computer vision. PhD thesis, École normale
supérieure de Cachan, 2010.
Chambolle, A. and Pock, T. A first-order primal-dual algo-
rithm for convex problems with applications to imaging. Mairal, J. and Yu, B. Complexity analysis of the lasso reg-
J. Math. Imaging Vis., 40(1):120–145, 2011. ularization path. In ICML, 2012.
Chen, S. S., Donoho, D. L., and Saunders, M. A. Atomic Meinshausen, N. and Bühlmann, P. Stability selection. J.
decomposition by basis pursuit. SIAM J. Sci. Comput., Roy. Statist. Soc. Ser. B, 72(4):417–473, 2010.
20(1):33–61 (electronic), 1998.
Osborne, M. R., Presnell, B., and Turlach, B. A. A new
Efron, B., Hastie, T., Johnstone, I. M., and Tibshirani, R. approach to variable selection in least squares problems.
Least angle regression. Ann. Statist., 32(2):407–499, IMA J. Numer. Anal., 20(3):389–403, 2000.
2004. With discussion, and a rejoinder by the authors.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
El Ghaoui, L., Viallon, V., and Rabbani, T. Safe feature Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
elimination in sparse supervised learning. J. Pacific Op- Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
tim., 8(4):667–698, 2012. napeau, D., Brucher, M., Perrot, M., and Duchesnay,
Fan, J. and Li, R. Variable selection via nonconcave penal- E. Scikit-learn: Machine learning in Python. J. Mach.
ized likelihood and its oracle properties. J. Amer. Statist. Learn. Res., 12:2825–2830, 2011.
Assoc., 96(456):1348–1360, 2001.
Rockafellar, R. T. and Wets, R. J.-B. Variational analysis,
Fan, J. and Lv, J. Sure independence screening for ultrahigh volume 317 of Grundlehren der Mathematischen Wis-
dimensional feature space. J. Roy. Statist. Soc. Ser. B, 70 senschaften [Fundamental Principles of Mathematical
(5):849–911, 2008. Sciences]. Springer-Verlag, Berlin, 1998.

Friedman, J., Hastie, T., Höfling, H., and Tibshirani, R. Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite
Pathwise coordinate optimization. Ann. Appl. Stat., 1 sums with the stochastic average gradient. arXiv preprint
(2):302–332, 2007. arXiv:1309.2388, 2013.
Mind the duality gap: safer rules for the Lasso

Tibshirani, R. Regression shrinkage and selection via the


lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996.
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon,
N., Taylor, J., and Tibshirani, R. J. Strong rules for dis-
carding predictors in lasso-type problems. J. Roy. Statist.
Soc. Ser. B, 74(2):245–266, 2012.

Tibshirani, R. J. The lasso problem and uniqueness. Elec-


tron. J. Stat., 7:1456–1490, 2013.
Tseng, P. Convergence of a block coordinate descent
method for nondifferentiable minimization. J. Optim.
Theory Appl., 109(3):475–494, 2001.

Varoquaux, G., Gramfort, A., and Thirion, B. Small-


sample brain mapping: sparse recovery on spatially cor-
related designs with randomization and clustering. In
ICML, 2012.
Wang, J., Zhou, J., Wonka, P., and Ye, J. Lasso screening
rules via dual polytope projection. In NIPS, pp. 1070–
1078, 2013.
Xiang, Z. J. and Ramadge, P. J. Fast lasso screening tests
based on correlations. In ICASSP, pp. 2137–2140, 2012.

Xiang, Z. J., Xu, H., and Ramadge, P. J. Learning sparse


representations of high dimensional data on large scale
dictionaries. In NIPS, pp. 900–908, 2011.
Xiang, Z. J., Wang, Y., and Ramadge, P. J. Screening tests
for lasso problems. arXiv preprint arXiv:1405.4897,
2014.
Xu, P. and Ramadge, P. J. Three structural results on the
lasso problem. In ICASSP, pp. 3392–3396, 2013.
Zhang, C.-H. Nearly unbiased variable selection under
minimax concave penalty. Ann. Statist., 38(2):894–942,
2010.
Zhang, C.-H. and Zhang, T. A general theory of con-
cave regularization for high-dimensional sparse estima-
tion problems. Statistical Science, 27(4):576–593, 2012.

Zou, H. The adaptive lasso and its oracle properties. J. Am.


Statist. Assoc., 101(476):1418–1429, 2006.
Zou, H. and Hastie, T. Regularization and variable selec-
tion via the elastic net. J. Roy. Statist. Soc. Ser. B, 67(2):
301–320, 2005.
Mind the duality gap: safer rules for the Lasso

A. Supplementary materials
We provided in this Appendix some more details on the theoretical results given in the main part.

A.1. Dome test


Let us consider the case where the safe region C is the dome Dpc, r, α, wq, with parameters: center c, radius r, relative
distance ratio α and unit normal vector w.
The computation of the dome test formula proceeds as follows:

#
cJ xj ` r}xj } if wJ xj ă ´α}xj },
σC pxj q “ a (21)
cJ xj ´ rαwJ xj ` r p}xj }2 ´ |wJ xj |2 qp1 ´ α2 q otherwise.
and so
#
´cJ xj ` r}xj } if ´ wJ xj ă ´α}xj },
σC p´xj q “ a (22)
´cJ xj ` rαwJ xj ` r p}xj }2 ´ |wJ xj |2 qp1 ´ α2 q otherwise.
With the previous display we can now compute µC pxj q :“ maxpσC pxj q, σC p´xj qq. Thanks to the Eq. (6), we express our
dome test as:
pλq
If Mmin ă cJ xj ă Mmax , then β̂j “ 0. (23)

Using the former notation:


#
1 ´ r}xj } if wJ xj ă ´α}xj },
Mmax “ a (24)
1 ` rαwJ xj ´ r p}xj }2 ´ |wJ xj |2 qp1 ´ α2 q otherwise.
#
´1 ` r}xj } if ´ wJ xj ă ´α}xj },
Mmin “ a (25)
´1 ` rαwJ xj ` r p}xj }2 ´ |wJ xj |2 qp1 ´ α2 q otherwise.

Let us introduce the following dome parameters, for any θ P ∆X :

• Center: c “ py{λ ` θq{2.


• Radius: r “ R
qλ pθq{2.
pλ pθq2 {R
• Ratio: α “ ´1 ` 2R qλ pθq2 .

• Normal vector: w “ py{λ ´ θq{R


qλ pθq.

Reminding that the support function of a set is the same as the support function of its closed convex hull (Hiriart-Urruty
& Lemaréchal, 1993)[Proposition V.2.2.1] means that we only need to optimize over the dome introduced. Therefore,
one cannot improve our previous result by optimizing the problem on the intersection of the ball of radius Rqλ pθq and the
complement of the ball of radius R
pλ pβq (i.e., the blue region in Figure 2).

A.2. Proof of Theorem 1


Proof. Define maxjREλ |xJj θ̂
pλq
| “ t ă 1. Fix  ą 0 such that  ă p1´tq{pmaxjREλ }xj }q. As Ck is a converging sequence
containing θ̂ , its diameter is converging to zero, and there exists k0 P N such that @k ě k0 , @θ P Ck , }θ ´ θ̂pλq } ď .
pλq

Hence, for any j R Eλ and any θ P Ck , |xJj pθ ´ θ̂


pλq
q| ď pmaxjREλ }xj }q}θ ´ θ̂pλq } ď pmaxjREλ }xj }q. Using the triangle
inequality, one gets
|xJ J pλq
j θ| ďpmax }xj }q ` max |xj θ̂ |
jREλ jREλ

ďpmax }xj }q ` t ă 1,


jREλ
Mind the duality gap: safer rules for the Lasso

provided that  ă p1 ´ tq{pmaxjREλ }xj }q. Thus, for any k ě k0 , Eλc Ă Z pλq pCk q “ Apλq pCk qc and Apλq pCk q Ă Eλ .
For the reverse inclusion take j P Eλ , i.e., |xJ
j θ̂
pλq
| “ 1. Since for all k P N, θ̂pλq P Ck , then j P Apλq pCk q “ tj P rps :
J
maxθPCk |xj θ| ě 1u and the result holds.

A.3. Proof of Proposition 3


We detail here the proof of Proposition 3.

Proof. We first use the fact that


2
λ2

1 2 1 2 y
Gλt´1 pβ, θq “ kXβ ´ yk2 ` λt´1 kβk1 ´ kyk2 ` t´1

θ ´ ,
2 2 2 λt´1
2

to obtain 2
λ2

1 ´1 2 2 y ¯
kyk2 ´ kXβ ´ yk2 ´ t´1

kβk1 “ θ ´ ` Gλ pβ, θq .
t´1
λt´1 2 2 λt´1 2

Then,
2
λ2t´1

1 2 λt ´ 1 2 1 2 y ¯
Gλt pβ,θq “ kXβ ´ yk2 ` kyk2 ´ kXβ ´ yk2 ´ θ ´ ` Gλt´1 pβ, θq
2 λt´1 2 2 2 λt´1 2
2
λ2

1 2 y
´ kyk2 ` t θ´
2 2 λt 2
1 λt 2 1 λt 2 λt 1` 2 λt 2˘
“ p ´ 1q kyk2 ` p1 ´ q kXβ ´ yk2 ` Gλt´1 pβ, θq ` kλt θ ´ yk2 ´ kλt´1 θ ´ yk2
2 λt´1 2 λt´1 λt´1 2 λt´1
1 λt 2 1 λ t 2 λ t
“ p ´ 1q kyk2 ` p1 ´ q kXβ ´ yk2 ` Gλ pβ, θq
2 λt´1 2 λt´1 λt´1 t´1
1´ 2 λt ` 2 2 ˘¯
` kλt θ ´ yk2 ´ kλt θ ´ yk2 ` kpλt´1 ´ λt qθk2 ` 2pλt θ ´ yqJ pλt´1 ´ λt qθ .
2 λt´1

We deal with the dot product as


2 2
y J ` 2 y ´ y .
˘
2λt pλt´1 ´ λt qpθ ´ q θ “ λt pλt´1 ´ λt q kθk2 ` θ ´
λt λt 2 λt 2

Hence,

1 λt 1 2 1 λt 2
Gλt pβ, θq “ p ´1` pλt´1 ´ λt qq kyk2 ` p1 ´ q kXβ ´ yk2
2 λt´1 λt´1 2 λt´1
λt 2 1 λt 1 2 λt
´ pλt´1 ´ λt qq kθk2 ` p1 ´ ´ pλt´1 ´ λt qq kλt θ ´ yk2 ` Gλ pβ, θq
2 2 λt´1 λt´1 λt´1 t´1
ˆ ˙
1 λt 2 λt λt
“ 1´ kXβ ´ yk2 ´ pλt´1 ´ λt qq}θ}2 ` Gλ pβ, θq.
2 λt´1 2 λt´1 t´1

We observe in the end that

Xβ ´ y 2
ˆ ˙ ˆ ˙
2 λt λt´1 2 2
2 Gλt
pβ, θq “ 1 ´ ´ ´ 1 kθk2 ` Gλ pβ, θq.
λt λt´1 λt
2 λt λt´1 λt t´1
Mind the duality gap: safer rules for the Lasso

A.4. Elastic-Net
The previously proposed tests can be adapted straightforwardly to the Elastic-Net estimator (Zou & Hastie, 2005). We
provide here some more details for the interested reader.

1 2 λ 2
minp kXβ ´ yk2 ` λα kβk1 ` p1 ´ αq kβk2 . (26)
βPR 2 2
One can reformulate this problem as a Lasso problem
1 2
minp X̃β ´ ỹ ` λα kβk1 , (27)

βPR 2 2
ˆ ˙ ˆ ˙
where X̃ “ a X P Rn`p,p and ỹ “
y
P Rn`p . With this modification all the tests introduced for the
p1 ´ αqλIp 0
Lasso can be adapted for the Elastic-Net.

You might also like