Professional Documents
Culture Documents
joint work with Jerome Friedman, Rob Tibshirani and Noah Simon
Stanford April 2013 Trevor Hastie, Stanford Statistics 2
Forms of Regularization
β̂ β̂
β2 β2
β1 β1
Stanford April 2013 Trevor Hastie, Stanford Statistics 5
9
Standardized Coefficients
500
6
4
8
1 10
0
2
−500
5
0.0 0.2 0.4 0.6 0.8 1.0
||β̂(λ)||1 /||β̂(0)||1
PN
Lasso: β̂(λ) = argminβ N1 i=1 (yi − β0 − xTi β)2 + λ||β||1
fit using lars package in R (Efron, Hastie, Johnstone, Tibshirani 2002)
Stanford April 2013 Trevor Hastie, Stanford Statistics 6
lcavol
• lcavol
••
0.6
•
0.6
•
•
•
•
•
•
•
0.4
0.4
•
•
• svi
• • svi lweight
•• • lweight pgg45
• •• •• • •
Coefficients
• • pgg45
Coefficients
• • •• •• •
• •• • • lbph
0.2
• • •• •
• • • • lbph
0.2
•
• •• • • • • • •• •• •
• • •• • •
•
• •
• • • • • •
• • • •• • • •
•• • •
• •• ••• •
•• •• •• •• • •
•• • • • • • • • • • •• • • • • •
•• • • • • • • • • • •
0.0
• • ••
0.0
• • • • gleason
• • • • • gleason
• • •
• ••
• • •
• • •• age
• • age
• −0.2
−0.2
•
•
lcp
• lcp
1.5
1.4
Poisson Deviance
1.3
1.2
−7 −6 −5 −4 −3 −2 −1
log(Lambda)
K-fold cross-validation is easy and fast. Here K=10, and the true
model had 10 out of 100 nonzero coefficients.
Stanford April 2013 Trevor Hastie, Stanford Statistics 8
Efficient path algorithms for β̂(λ) allow for easy and exact
cross-validation and model selection.
• In 2001 the LARS algorithm (Efron et al) provides a way to
compute the entire lasso coefficient path efficiently at the cost
of a full least-squares fit.
• 2001 – 2008: path algorithms pop up for a wide variety of
related problems: Grouped lasso (Yuan & Lin 2006),
support-vector machine (Hastie, Rosset, Tibshirani & Zhu
2004), elastic net (Zou & Hastie 2004), quantile regression (Li
& Zhu, 2007), logistic regression and glms (Park & Hastie,
2007), Dantzig selector (James & Radchenko 2008), ...
• Many of these do not enjoy the piecewise-linearity of LARS,
and seize up on very large problems.
Stanford April 2013 Trevor Hastie, Stanford Statistics 9
20
Coefficients
0
−20
−40
0 50 100 150
L1 Norm
Stanford April 2013 Trevor Hastie, Stanford Statistics 11
glmnet package in R
Fits coefficient paths for a variety of different GLMs and the elastic
net family of penalties.
Some features of glmnet:
• Models: linear, logistic, multinomial (grouped or not), Poisson,
Cox model, and multiple-response grouped linear.
• Elastic net penalty includes ridge and lasso, and hybrids in
between (more to come)
• Speed!
• Can handle large number of variables p. Along with exact
screening rules we can fit GLMs on GWAS scale (more to come)
• Cross-validation functions for all models.
• Can allow for sparse matrix formats for X, and hence massive
Stanford April 2013 Trevor Hastie, Stanford Statistics 12
(0,0)
βj ← S(βj∗ , λ)
= sign(βj∗ )(|βj∗ | − λ)+
Stanford April 2013 Trevor Hastie, Stanford Statistics 14
0.3
0.3
0.3
0.2
0.2
0.2
Coefficients
Coefficients
Coefficients
0.1
0.1
0.1
0.0
0.0
0.0
−0.1
−0.1
−0.1
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Genome analysis
5000
4000
Number of Predictors after Filtering
3000
SAFE
recursive SAFE
global STRONG
sequential STRONG
2000
1000
0
Multiclass classification
●●●●
Pathwork R Diagnostics
0.8
●
●
Microarray classifica-
Misclassification Error
●
3220 samples
●
22K genes
0.4
Multinomial regression
0.2
●
●
●
●
●
●
●●
●
●
model with
●●●
●●
●●●●●
●●●●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●
0.0
17×22K = 374K
−8 −6 −4 −2 0
parameters
log(Lambda)
Elastic-net (α = 0.25)
Stanford April 2013 Trevor Hastie, Stanford Statistics 20
Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L. Brutlag, and Robert
W. Shafer
PNAS published online Oct 25, 2006;
doi:10.1073/pnas.0607274103
.
Stanford April 2013 Trevor Hastie, Stanford Statistics 21
> requ i re ( g l mn et )
> f i t =g l mn et ( x t r , y t r , s t a n d a r d i z e=FALSE)
> p l ot ( f i t )
2.0
1.5
1.0
Coefficients
0.5
0.0
−0.5
−1.0
0 5 10 15 20 25
L1 Norm
Stanford April 2013 Trevor Hastie, Stanford Statistics 23
* 10−fold CV
* Test *
1.0
*
0.8
*
Mean−Squared Error
*
0.6
*
0.4
*
*
*
*
*
*
0.2
*
*
**
**
***
******
*************
*************** * ********
************************************
−10 −8 −6 −4 −2
log(Lambda)
Stanford April 2013 Trevor Hastie, Stanford Statistics 24
195 121 36 5 1
2.0
1.5
1.0
Coefficients
0.5
0.0
−0.5
−1.0
−10 −8 −6 −4 −2
Log Lambda
204 165 86 15 2
2.0
1.5
1.0
Coefficients
0.5
0.0
−0.5
−1.0
−10 −8 −6 −4 −2
Log Lambda
Histogram of ytr
250
200
Frequency
150
100
50
0
ytr
Stanford April 2013 Trevor Hastie, Stanford Statistics 28
0 3 12 15 22 24
15
Coefficients
10
5
0
0 10 20 30 40 50
L1 Norm
Stanford April 2013 Trevor Hastie, Stanford Statistics 29
24 22 22 20 18 13 13 9 7 5 3 1 1 1 1 1 1 1 1 1 1
1.00
0.95
0.90
AUC
0.85
0.80
0.75
−10 −8 −6 −4 −2
log(Lambda)
Stanford April 2013 Trevor Hastie, Stanford Statistics 30
Inference?
β1 β2 β3 β4 β5 β6 β7 β8 β9 β10
β1 β2 β3 β4 β5 β6 β7 β8 β9 β10
| || |
|| ||||||||||| ||
|||||||||||||||| ||||
| || || | |
|||
|||||||||||||||||||| | ||
|| | | | | | |
| |||||||||||||||||||||||||||
||||||||||| | | ||| || || || | ||||| ||| ||||||||||||||| | |||||||||||||||||| ||||| ||||||||||||| ||||||| ||| | ||| |
| | |
||| ||
| |
|| |||||
|||||||||
|||
||||| ||||||
|||||||
||||||||||||| | |
−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0
Coefficients
Stanford April 2013 Trevor Hastie, Stanford Statistics 32
Covariance Test
λ1 λ2 λ3 λ4 λ5
1
1
5
1
4
3
Coefficients
3
1
2
3
1
0
4
2
2
−1
−1 0 1 2
− log(λ)
Stanford April 2013 Trevor Hastie, Stanford Statistics 34
Under the null hypothesis that all signal variables are in the model:
1
2
· hy, Xβ̂(λj+1 )i − hy, XA β̂A (λj+1 )i → Exp(1) as p, n → ∞
σ
15
Test statistic
10
5
0
0 1 2 3 4 5 6 7
Exp(1)
Stanford April 2013 Trevor Hastie, Stanford Statistics 35
Other Applications
λ = 36 λ = 27
Raf Raf
Mek Jnk Mek Jnk
λ=7 λ=0
Raf Raf
Mek Jnk Mek Jnk
η(X) = X1 β1 + X1 θ1 + X2 θ2
with penalty
q
|β1 | + θ12 + θ22
Note: Coefficient of X1 is nonzero if either group is nonzero;
allows one to enforce hierarchy.
• Interactions with weak or strong hierarchy — interaction
present only when main-effect(s) are present (w.i.p. with
student Michael Lim)
• Sparse additive models — overlap linear part of spline with
non-linear part. Allows ”sticky” null term, linear term, or
smooth term (with varying smoothness) for each variable.
(w.i.p with student Alexandra Chouldechova)
Stanford April 2013 Trevor Hastie, Stanford Statistics 40
B
B2
(0,0)
B1
Stanford April 2013 Trevor Hastie, Stanford Statistics 41
CGH modeling and the fused lasso. Here the penalty has the
form
Xp p−1
X
|βj | + α |βj+1 − βj |.
j=1 j=1
This is not additive, so a modified coordinate descent
algorithm is required (FHT + Hoeffling 2007).
4
log2 ratio
2
0
−2
Genome order
Stanford April 2013 Trevor Hastie, Stanford Statistics 42
Matrix Completion
X
min (Xij − Zij )2 subject to rank(Z) = r
Z
Observed(i,j)
Thank You!