Professional Documents
Culture Documents
Peter Bühlmann
ETH Zürich
September 2009
High-dimensional data
Riboflavin production with Bacillus Subtilis
(in collaboration with DSM (Switzerland))
goal: improve riboflavin production rate of Bacillus Subtilis
using clever genetic engineering
response variables Y ∈ R: riboflavin (log-) production rate
covariates X ∈ Rp : expressions from p = 4088 genes
sample size n = 115, p n
−6
−6
−6
−7
−7
−7
y
y
−8
−8
−8
−9
−9
−9
−10
−10
−10
7.6 8.0 8.4 8.8 7 8 9 10 11 12 6 7 8 9 10
−6
−6
−6
−7
−7
−7
y
y
−8
−8
−8
−9
−9
−9
−10
−10
−10
7.5 8.0 8.5 9.0 7.0 7.5 8.0 8.5 9.0 8 9 10 11
−6
−6
−7
−7
−7
y
y
−8
−8
−8
−9
−9
−9
−10
−10
−10
8 9 10 11 12 7.5 8.0 8.5 9.0 8.5 9.0 9.5 10.0 11.0
Z1 , . . . , Zn i.i.d. or stationary
dim(Zi ) n
for example:
Zi = (Xi , Yi ), Xi ∈ Rp , Yi ∈ R: regression with p n
Zi = (Xi , Yi ), Xi ∈ Rp , Yi ∈ {0, 1}: classification with p n
numerous applications:
biology, imaging, economy, environmental sciences, ...
High-dimensional linear models
p
(j)
X
Yi = α + βj Xi + i , i = 1, . . . , n
j=1
pn
in short: Y = Xβ +
goals:
I prediction, e.g. w.r.t. squared prediction error
I variable selection
i.e. estimating the effective variables
(having corresponding coefficient 6= 0)
Prediction
Y = Xβ + , n = 287, p = 195
Y = Xβ + , n = 287, p = 195
Y = Xβ + , p large; or p n
we need to regularize...
and there are many proposals
I Bayesian methods for regularization
I greedy algorithms: aka forward selection or boosting
I preliminary dimension reduction
I ...
e.g. 2’650’000 entries on Google Scholar for
“high dimensional linear model” ...
High-dimensional linear model
Y = Xβ + , p large; or p n
we need to regularize...
and there are many proposals
I Bayesian methods for regularization
I greedy algorithms: aka forward selection or boosting
I preliminary dimension reduction
I ...
e.g. 2’650’000 entries on Google Scholar for
“high dimensional linear model” ...
High-dimensional linear model
Y = Xβ + , p large; or p n
we need to regularize...
and there are many proposals
I Bayesian methods for regularization
I greedy algorithms: aka forward selection or boosting
I preliminary dimension reduction
I ...
e.g. 2’650’000 entries on Google Scholar for
“high dimensional linear model” ...
Penalty-based methods
kβ 0 k1 = pj=1 |βj0 |
P
I
; penalize with the k · k1 -norm, i.e. Lasso:
argminβ (n−1 kY − Xβk2 + λkβk1 )
; convex optimization:
computationally feasible and very fast for large p
The Lasso (Tibshirani, 1996)
left: `1 -“world”
residual sum of squares reaches a minimal value (for certain
constellations of the data) if its contour lines hit the `1 -ball in its
corner
; β̂1 = 0
`2 -“world” is different
Ridge regression,
β̂Ridge (λ) = argminβ kY − Xβk22 /n + λkβk22 ,
Y = Xβ + , XT X = I
Adaptive Lasso
3
Hard−thresholding
Soft−thresholding
2
1
0
−1
−2
−3
−3 −2 −1 0 1 2 3
z
Lasso for prediction: xnew β̂(λ)
an older formulation:
Theorem (Meinshausen & PB, 2004 (publ: 2006))
I sufficient and necessary neighborhood stability condition
on the design X ; see also Zhao & Yu (2006)
I p = pn is growing with n
I pn = O(nα ) for some 0 < α < ∞ (high-dimensionality)
I |Strue,n | = |S0,n | = O(nκ ) for some 0 < κ < 1 (sparsity)
I the non-zero βj ’s are outside the n−1/2 -range
I Y , X (j) ’s Gaussian (not crucial)
Then: if λ = λn ∼ const.n−1/2−δ/2 (0 < δ < 1/2),
P[Ŝ(λ) = S0 ] = 1 − O(exp(−Cn1−δ )) (n → ∞)
≈ 1 even for relatively small n
Problem 1:
n−1 X T X → Σ
p
hence: max |β̂j − βj | ≤ kβ̂ − βk1 ≤ C log(p)s0 /n
j
p
and if min{|βj |; βj 6= 0} > C log(p)s0 /n
j
Ŝ ⊇ S0
Ŝ(λopt ) ⊇ S0
; Lasso as an
excellent screening procedure
Ŝ ⊇ S0
Ŝ(λopt ) ⊇ S0
; Lasso as an
excellent screening procedure
Ŝ ⊇ S0
Ŝ(λopt ) ⊇ S0
; Lasso as an
excellent screening procedure
2.0
2.0
1.5
1.5
cooefficients
cooefficients
1.0
1.0
0.5
0.5
0.0
0.0
0 200 400 600 800 1000 0 200 400 600 800 1000
variables variables
original data
0.20
0.15
coefficients
0.10
0.05
0.00
variables
Ŝ ⊇ S0
Ŝ ⊇ S0
threshold functions
hard−thresholding
adaptive Lasso
2
soft−thresholding
if β̂init = OLS: 0
−3 −2 −1 0 1 2 3
z
s0 = 3, p = 10 000, n = 50
same 2 independent realizations from before
Adaptive Lasso Adaptive Lasso
2.0
2.0
1.5
1.5
cooefficients
cooefficients
1.0
1.0
0.5
0.5
0.0
0 200 400 600 800 1000 0.0 0 200 400 600 800 1000
variables variables
0.25
0.25
0.20
0.20
0.15
0.15
coefficients
coefficients
0.10
0.10
0.05
0.05
0.00
0.00
−0.05
−0.05
0 50 100 150 200 0 50 100 150 200
variables variables
β̂init,j = 0 ⇒ β̂j = 0
since
p
X |βj |
β̂ = argminβ (kY − Xβk22 /n + λ )
j=1
|β̂init,j |
Lemma
Denote: G(β) = −2XT (Y − Xβ)/n (gradient of n−1 kY − Xβk22 )
Then: a necessary and sufficient condition for Lasso solution
Moreover:
if Lasso solution is not unique (e.g. if p > n) and Gj (β̂) < λ for
some solution β̂, then β̂j = 0 for all Lasso solutions
i.e. the zeroes are unique (and hence estimated variable
selection is “well-defined”)
Karush-Kuhn-Tucker (KKT) conditions
∂Qλ (β)
| = −2XTj (Y − Xβ) + λsign(βj )|β=β̂(λ) = 0
∂βj β=β̂(λ)
⇔
* * * ** * *
* * * **
***
* **** ** ** *******
* ***
** * ** ****
1.0
* * ***
*
* * * *** * **
**** *** **
******
* ***
****
1131
*
**** ** *
* **** * ** *** ********** ****
* ***** ** *
* * ** **
*** * *** ********** *** ****
0.5
** ** *
* *** **** ********* *** *******
1855
* ** ************* ** ** * *** ********
* * ** * * **** ********************* ****
** * * * ** ** * * *******
** *
** ** ****
*****
**** ****** ** ** * **** *** ** ** * *
******
* ***
****** * * * *** **
* ** **
******* ** *** ********** *** *********
**
***
* * ** **** **** ** ** ***
****** *** *** *** **** ************* *
** ****
** * ***
***
*****
* ***
**** * ****
* ***
****
** ********** ** * * ** ******* * *
*** *** ****** * * *
********** * *********** * ** ** *
** ****
*
Standardized Coefficients
2874 3669
**
** **** ******* * * ********
* **********
*****
**** **** *********** *** ** ******* **********
*********
** ** *** ***
0.0
****
* ** * * * * * ** ** * ******
*** ** *** * **** ***
**** ********* ****** *** *** ** ***** ******************* ** ***
* ******
** ******
**************** **** **
** ** **********
** **
***
** *******
* **** * ****** **** ** ********
**** **
*** *** **** *** ***********************
***
**
***** *** **** **** *******
******
** ******
**
* ** ** **** ** * *** * ***
*****
** ****** **** ** **
* *** *** **** ******
*** ****
** *******
**
*********
* * *** **
* * *** * * * * *** ************** ****** ******
** **** **
******************* ****
****
**
****
** ********* ****
* **
** ******** *** ********** ****** *** ** * ******* ***
*******
***
* * ******* *** *** *********** **
* ** ** *
***** **** ***** *** * * * *** *** ****************
* * * ** ********* **
−0.5
* ** * * * *
* * * * * ** ************* ******** *** *******
4004
* * * **** * *
** ****** *****
****
****** ** *
** ****** **
******** *** ************ *
****
* * ** **** **** * * ** * *
*** *
* * *** * ****
******* *** * ** * *** ********* ***
** ** * **** ** * ** * ***
* ** **********
−1.0
** **********
1762
*** ****
* *** ** *******
** *
** *
** *
* *
−1.5
** *
* * **
* ** ********** ********** *** ****
73
* * * * ** * *******
** *** ** * * **
** * **** ****** * ** **** *****
*** *
**** ** ***** **** *** **
−2.0
4003
**** *** ****** *********** *** * **
***** ***
|beta|/max|beta|
Coordinatewise optimization and shooting algorithms
general idea is to compute a solution β̂(λgrid,k ) and use it as a
starting value for the computation of β̂(λgrid,k−1 )
| {z }
<λgrid,k
Coordinatewise algorithm
; componentwise soft-thresholding
q
typically s(dfGg ) = dfGg so that s(dfGg )kβGg k2 = O(dfg )
properties of Group-Lasso penalty
I for group-sizes |Gg | ≡ 1 ; standard Lasso-penalty
I convex penalty ; convex optimization for standard
likelihoods (exponential family models)
I either (β̂G (λ))j = 0 or 6= 0 for all j ∈ G
I penalty is invariant under orthonormal transformation
e.g. invariant when requiring orthonormal parameterization
for factors
Some aspects from theory
“again”:
I optimal prediction and estimation (oracle inequality)
I group screening: Ŝ ⊇ S0 with high prob.
|{z}
set of active groups
Pn (j) (j)
and expand fj (x (j) ) = k=1 Bk
βk (x (j) )
|{z}
|{z}
(βGj )k basis fct.s
fj (·) smooth ⇒ “smoothness” of βGj
Computation and KKT
criterion function
n
X XG
Qλ (β) = n−1 ρβ (xi , Yi ) +λ s(dfg )kβg k2 ,
| {z }
i=1 g=1
loss fct.
loss function ρβ (., .) convex in β
KKT conditions:
β̂Gg
∇ρ(β̂)g + λs(dfg ) = 0 if β̂Gg 6= 0 (not the 0-vector),
kβ̂Gg k2
k∇ρ(β̂)g k2 ≤ λs(dfg ) if β̂Gg ≡ 0.
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
Block coordinate descent algorithm
↑
for Gaussian log-likelihood (squared error loss):
blockwise up-dates are easy and closed-form solutions exist
(use KKT)
. . . ACGGC . . . EEE GC
|{z} IIII . . . AAC . . .
potential donor site
| {z }
3 positions exon GC 4 positions intron
response Y ∈ {0, 1}: splice or non-splice site
predictor variables: 7 factors each having 4 levels
(full dimension: 47 = 160 384)
data:
2
GL/R
GL/MLE
l2−norm
1
0
1 3 5 7 1:3 1:5 1:7 2:4 2:6 3:4 3:6 4:5 4:7 5:7
2 4 6 1:2 1:4 1:6 2:3 2:5 2:7 3:5 3:7 4:6 5:6 6:7
Term
2
l2−norm
1
0
1:2:3 1:2:5 1:2:7 1:3:5 1:3:7 1:4:6 1:5:6 1:6:7 2:3:5 2:3:7 2:4:6 2:5:6 2:6:7 3:4:6 3:5:6 3:6:7 4:5:7 5:6:7
1:2:4 1:2:6 1:3:4 1:3:6 1:4:5 1:4:7 1:5:7 2:3:4 2:3:6 2:4:5 2:4:7 2:5:7 3:4:5 3:4:7 3:5:7 4:5:6 4:6:7
Term
; represent
p p X
m
(j) (j)
X X
fj (x (j) ) = βk Bk (x (j) )
j=1 j=1 k=1
p q
X
λ1 kfj k22 + λ2 I 2 (fj )
j=1
Z
2 00
I (fj ) = (fj (x))2 dx
(j) (j)
where fj = (fj (X1 ), . . . , fj (Xn ))T
; SSP-penalty does variable selection (f̂j ≡ 0 for some j)
Xp v
u T T
= λ1 uβj Bj Bj βj + λ2 βjT Ωj βj
t | {z } |{z}
j=1
Σj integ. 2nd derivatives
p v
X u T
= λ1 uβj (Σj + λ2 Ωj ) β
t | {z }
j=1
Aj =Aj (λ2 )
n = 287, p = 196
0.8
0.6
0.6
0.4
0.4
0.2
0.2
Partial Effect
Partial Effect
0.0
0.0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0
Motif.P1.6.23 Motif.P1.6.26
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
Partial Effect
Partial Effect
0.0
0.0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0
Motif.P1.6.23 Motif.P1.6.26
Motif regression
for finding HIF1α transcription factor binding sites in DNA seq.
really?
do we trust our selection algorithm?
how stable are the findings?
P-values for high-dimensional regression
Motif regression
for finding HIF1α transcription factor binding sites in DNA seq.
really?
do we trust our selection algorithm?
how stable are the findings?
P-values for high-dimensional regression
Motif regression
for finding HIF1α transcription factor binding sites in DNA seq.
really?
do we trust our selection algorithm?
how stable are the findings?
estimated coefficients β̂(λ̂CV )
original data
0.20
0.15
coefficients
0.10
0.05
0.00
variables
stability check: subsampling with subsample size bn/2c
0.20
0.20
0.15
0.15
coefficients
coefficients
0.10
0.10
0.05
0.05
0.00
0.00
0 50 100 150 200 0 50 100 150 200
variables variables
1.0
0.15
0.8
coefficients
coefficients
0.6
0.10
0.4
0.05
0.2
0.00
0.0
variables variables
original data subsampled mean
0.20
0.20
0.15
0.15
coefficients
coefficients
0.10
0.10
0.05
0.05
0.00
0.00
; only 2 “stable” findings 0 50 100
variables
150 200 0 50 100
variables
150 200
1.0
0.15
0.8
coefficients
coefficients
0.6
0.10
0.4
0.05
0.2
0.00
0.0
0 50 100 150 200 0 50 100 150 200
variables variables
Bonferroni-corrected P-values:
(j)
Pcorr = min(P (j) · |Ŝ|, 1)
80
60
FREQUENCY
40
20
0
ADJUSTED P−VALUE
80
60
FREQUENCY
40
20
0
ADJUSTED P−VALUE
goal:
(j) (j) (j)
aggregation of Pcorr,1 , . . . , Pcorr,B to a single P-value Pfinal
(j) (j)
problem: dependence among Pcorr,1 , . . . , Pcorr,B
define
(j)
Q (j) (γ) = qγ (Pcorr,b /γ; b = 1, . . . B)
|{z}
emp. γ-quantile fct.
(j) (j)
; reject H0 : βj = 0 ⇐⇒ Pfinal ≤ α
(j)
Pfinal equals roughly a raw P-value based on sample size bn/2c,
multiplied by
(j)
lim sup P(minc Pfinal ≤ α) ≤ α
n→∞ j∈S
(j)
based on ordered Pfinal ’s from before
; control of FDR for multiple testing of regression coefficients
with p n
(Meinshausen, Meier & PB, 2008)
assumptions for selector Ŝ:
are satisfied for
I Lasso
• assuming compatibility conditions on the design X
• assuming sparsity of true regression coefficients
Lasso with CV
MM
5
MM S
4
MEAN( TRUE POSITIVES ) M S
3
2
M S
S S
M
S
1
M S
S
M M S S
M M
S S
0
p = 195, n = 287
for α = 0.05, only one variable/motif j̃ remains
(j̃)
Pfinal = 0.0059 (= 0.59%)
in this application:
we are rather concerned about false positive findings
; (conservative) P-values are very useful
Motif regression
p = 195, n = 287
for α = 0.05, only one variable/motif j̃ remains
(j̃)
Pfinal = 0.0059 (= 0.59%)
in this application:
we are rather concerned about false positive findings
; (conservative) P-values are very useful
From another perspective
using Lasso:
1. on first half of the sample: with high probability,
Ŝ ⊇ Ssubst;C
p
Ssubst;C = {j; |βj | ≥ C}, C ≥ const. log(p)/n