Presentation UMStat

Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity
Consistent and Scalable Bayesian Model Selection

for High Dimensional Data
Naveen Naidu Narisetty
Department of Statistics
University of Michigan, Ann Arbor
December 11, 2015

Other Work (A small Detour)
Functional Data Depth – proposed a new notion of depth, studied

its properties and applications to simultaneous inference
Ref: Narisetty, N.N., Nair, V.N. Extremal Notion of Depth for
Functional Data and Applications. JASA Theory & Methods, To Appear
Large-Scale Computer Models – worked on nonparametric

regression based methods as statistical emulators
Ref: Zhang, Z., Narisetty, N.N., Nair. V,N., Zhu, J. Calibration of
Computer Models with Expanded MARS.
Environmental Exposures – proposed a method to select nonlinear

interactions between environmental exposures
Ref: Narisetty, N.N. and Mukherjee, B. Nonlinear interaction selection
for environmental exposures.






Outline of the Talk
1 High Dimensional Model Selection

Introduction
2 Bayesian Model Selection Consistency

Shrinking and Diffusing Priors
3 Scalable Computation
Skinny Gibbs Sampler
4 Censoring and Non-convexity

Introduction
Modern Data Sets contain a large number of variables, often

even larger than the sample size
High Dimensional Data examples include gene expression

data, healthcare data, search queries data, etc
Variable (Model) Selection: deals with identifying the most

important variables for a response of interest
High Dimensional Model Selection Introduction 1 / 42

High Dimensional Linear Regression
Consider the standard linear regression set-up
Yn×1 = Xn×p βp×1 + n×1

n - no. of observations
p - no. of variables
When p > n, even the estimation problem is ill-posed.

A natural working assumption is sparsity:
#{j : βj 6= 0} (p ∧ n)
How can the set of active variables be consistently identified?

High Dimensional Linear Regression
Consider the standard linear regression set-up
Yn×1 = Xn×p βp×1 + n×1

n - no. of observations
p - no. of variables
When p > n, even the estimation problem is ill-posed.

A natural working assumption is sparsity:
#{j : βj 6= 0} (p ∧ n)
How can the set of active variables be consistently identified?

(Popular) Penalty Trick
The penalty trick is to minimize the loss together with a

penalty function ρλ (β) to induce shrinkage or sparsity
β̂PEN = arg min {kY − X βk2 + ρλ (β)}

β
A very natural penalty is the L0 penalty

p
1{βj 6= 0}
X
ρλ (β) = λ
j=1
L0 penalty is argued to give good theoretical properties

(Schwarz, 1978; Shen, Pan and Zhu, 2012)
High Dimensional Model Selection Penalty Based Methods 3 / 42

(Popular) Penalty Trick
The penalty trick is to minimize the loss together with a

penalty function ρλ (β) to induce shrinkage or sparsity
β̂PEN = arg min {kY − X βk2 + ρλ (β)}

β
A very natural penalty is the L0 penalty

p
1{βj 6= 0}
X
ρλ (β) = λ
j=1
L0 penalty is argued to give good theoretical properties

(Schwarz, 1978; Shen, Pan and Zhu, 2012)

Penalization methods
Several convex and non-convex penalties are proposed as

compromise between L0 penalty and computational tractability
LASSO (Tibshirani, 1996)
SCAD (Fan and Li, 2001)
Elastic Net (Zou and Hastie, 2005)
Adaptive LASSO (Zou, 2006)
Dantzig Selector (Candes and Tao, 2007)
OSCAR (Bondell and Reich, 2008)
MCP (Zhang, 2010), and more . . .

Why “Bayesian Methods”?
Bayesian methods have been popular
Bayesian model selection can be asymptotically equivalent to

information criterion
Properties of Bayesian model selection methods for high

dimensional data are not as well understood – two challenges
will be discussed
High Dimensional Model Selection Bayesian Methods 5 / 42

Bayesian Framework
Regression Model: Yn×1 = Xn×p βp×1 + n×1 ; ∼ N(0, σ 2 I)
Introduce binary variables Zj to indicate that j th covariate is

active (j = 1, . . . , p)
Place priors on βj as
βj | Zj = 0 ∼ π0 (βj ), βj | Zj = 1 ∼ π1 (βj ), (1)
and on Z = (Z1 , . . . , Zp )
The posterior of Z is used for model selection

Spike and Slab Priors
The prior π0 (called spike) is often taken to be a point mass

(Geweke, 1996; Scott and Berger, 2010; Liang, Song, and
Wu, 2013; Yang, Jordan and Wainwright, 2015)
Some common choices for the prior π1 (called slab) include:
– g priors (Zellner, 1983; Liang et al., 2007)

– Uniform (Mitchell and Beauchamp, 1988)
– Gaussian (George and McCulloch, 1993;
Ročková and George, 2014)
– Laplace (Park and Casella, 2008;
Castillo, Schmidt-Hieber, and van der Vaart, 2015)
– Non-local priors (Johnson and Rossell, 2012), & more . . .

Gaussian Spike and Slab Priors
George and McCulloch (1993)

Spike: π0 ≡ N(0, τ02 ), the variance τ02 is a small and fixed value
Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value
This allowed the use of a Gibbs sampler for posterior sampling

Extensively used in applications
Model selection properties not understood and computationally
intense in high dimensional settings

Gaussian Spike and Slab Priors
George and McCulloch (1993)

Spike: π0 ≡ N(0, τ02 ), the variance τ02 is a small and fixed value
Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value
This allowed the use of a Gibbs sampler for posterior sampling

Extensively used in applications
Model selection properties not understood and computationally
intense in high dimensional settings

Goal and Challenges
Goal:
To study model selection properties using such priors, and
to provide a practically feasible implementation even when p >> n
Two Challenges:
Theoretical Challenge: analyzing the posterior on a huge space
of 2p models
– A new framework using sample-size dependent priors
Computational Challenge: standard Gibbs samplers not scalable
due to large matrix computations
– A scalable and flexible Gibbs sampling algorithm

Outline

Bayesian Model Selection Consistency

Motivation - Shrinking Spike Prior
Regression Model: Y | β ∼ N(X β, I)

Priors: βj | Zj = 0 ∼ N(0, τ02 ), βj | Zj = 1 ∼ N(0, τ12 )
P[Zj = 1] = 0.5 for j = 1, . . . , pn
Data generating model: Y ∼ N(X β0 , I); X 0 X = n I (pn ≤ n)
0 < τ02 < τ12 < ∞; For |β0j | ≤ τ0 6= 0
P[Zj = 1 | Y ] < 0.5 w.p. 1 as n → ∞
Not consistent!
Let the spike variance go to zero with sample size!
Bayesian Model Selection Consistency Motivation 10 / 42


P[Zj = 1] = 0.5 for j = 1, . . . , pn
0 < τ02 < τ12 < ∞; For |β0j | ≤ τ0 6= 0
P[Zj = 1 | Y ] < 0.5 w.p. 1 as n → ∞
Not consistent!


P[Zj = 1] = 0.5 for j = 1, . . . , pn
0 < τ02 < τ12 < ∞; For |β0j | ≤ τ0 6= 0
P[Zj = 1 | Y ] < 0.5 w.p. 1 as n → ∞
Not consistent!

Motivation - Diffusing Slab Prior
2
Consider τ0n → 0, τ12 < ∞ and P[Z = 1] = 0.5. Then,
r
log n
P[Zj = 1 | Y ] > 0.5 ⇐⇒ β̂j ≥ tn ≈ ,
n
β̂j is the OLS estimator.

√ P
When pn >> n, P[ True Model |Y ] −→ 0. Not consistent!
What would be a framework to understand

the model selection properties?

Motivation - Diffusing Slab Prior
2
Consider τ0n → 0, τ12 < ∞ and P[Z = 1] = 0.5. Then,
r
log n
P[Zj = 1 | Y ] > 0.5 ⇐⇒ β̂j ≥ tn ≈ ,
n
β̂j is the OLS estimator.

√ P
When pn >> n, P[ True Model |Y ] −→ 0. Not consistent!
What would be a framework to understand

the model selection properties?

2 → 0 (faster than 1 )
Spike prior shrinks: τ0,n n
– Does not miss “small” non-zero coefficients
2+
2 → ∞ (as ( pn ∨ 1))
Slab prior diffuses: τ1,n n
– Acts as a penalty to drive inactive covariates under Z = 0
P[Zj = 1] = qn → 0 (as pn−1 )

– Controls the model size a priori
Bayesian Model Selection Consistency Shrinking and Diffusing Priors 12 / 42

spike variance: 0.2, slab variance: 1
Bayesian Model Selection Consistency Shrinking and Diffusing Priors

spike variance: 0.1, slab variance: 5
Bayesian Model Selection Consistency Shrinking and Diffusing Priors

Bayesian Hierarchical Framework
We consider the following hierarchical model:
Y | β ∼ N(X β, σ 2 I )
βj | (Zj = 0) ∼ N(0, σ 2 τ0,n

2
), βj | (Zj = 1) ∼ N(0, σ 2 τ1,n
2
)
P(Zj = 1) = qn , j = 1, . . . , pn
σ 2 ∼ IG (α1 , α2 )

Notation
Model vector: Z = (Z1 , . . . , Zp )

True regression vector: β0
True model: t = {j : β0j 6= 0}
For e.g., β0 = (0.5, 0.8, 0, . . . , 0) =⇒ t = (1, 1, 0, . . . , 0)
Arbirary model: k (binary vector of length p)
Posterior probability of model k: P[Z = k|Data]
Posterior Ratio:
P[Z = k | Data]
PR(k, t) =
P[Z = t | Data]

Posterior Ratios
Lemma
For any model k 6= t, we have
|k|−|t|
2 q −2 − 2
n
R̃k −R̃t
o
PR(k, t) ≤ nτ1n n exp − 2σ 2

(2+δ) √ −(|k|−|t|) n o
“≈” pn ∨ n pn exp − R̃k2σ−2R̃t ,
where R̃k ≈ Y 0 (I − PXk )Y , plays the role of RSS for model k.
This implies
P[Z = k | Data] P
PR(k, t) = −→ 0, for each k 6= t
P[Z = t | Data]

Strong Selection Consistency
(Weak) Selection Consistency: For each model k 6= t,
P[Z = k|Data] P
−→ 0 (2)
P[Z = t|Data]
Even if (2) holds, it is possible that P[Z = t|Data] → 0 when

pn → ∞ as the number of models increases exponentially
It is desirable to have Strong Selection Consistency (SSC)

P
P[Z = t|Data] −→ 1
SSC =⇒ selection consistency of maximum a posteriori model

Strong Selection Consistency
(Weak) Selection Consistency: For each model k 6= t,
P[Z = k|Data] P
−→ 0 (2)
P[Z = t|Data]
Even if (2) holds, it is possible that P[Z = t|Data] → 0 when

pn → ∞ as the number of models increases exponentially
It is desirable to have Strong Selection Consistency (SSC)

P
P[Z = t|Data] −→ 1
SSC =⇒ selection consistency of maximum a posteriori model

Sum of Posterior Ratios

P P[Z =k|Data] P
SSC is equivalent to P[Z =t|Data] −→ 0. This sum can be
k6=t
decomposed as:
n
– M1 : Large models having size more than log pn ,
P
PR(k, t) ≤ exp{−c1 n},
k∈M1
– M2 : Overfitted models including (but not equal to) t,

P
PR(k, t) ≤ exp{−c2 log pn },
k∈M2
– M3 : Underfitted models not including a covariate in t,

PR(k, t) ≤ exp{−c3 n∆2n }; ∆n is minimum signal
P
k∈M3


P P[Z =k|Data] P
k6=t
decomposed as:
n
P
k∈M1

P
k∈M2

P
k∈M3


P P[Z =k|Data] P
k6=t
decomposed as:
n
P
k∈M1

P
k∈M2

P
k∈M3


P P[Z =k|Data] P
k6=t
decomposed as:
n
P
k∈M1

P
k∈M2

P
k∈M3

SSC with Shrinking and Diffusing Priors
Theorem (Narisetty and He (AoS, 2014))

Posterior probability of the true model converges to one. That is,
P
P[Z = t | Data] −→ 1 as n → ∞ (under mild conditions).
|t| log pn
Remark 1: pn can be large such that n → 0.
Remark 2: Conditions are on minimum signal and minimum

non-zero eigenvalues of small submatrices of the Gram matrix
Remark 3: As a useful consequence, for 0 < c < 1,

P[Zj = tj | Data] > c for all j with probability going to one.

SSC with Shrinking and Diffusing Priors
Theorem (Narisetty and He (AoS, 2014))

Posterior probability of the true model converges to one. That is,
P
P[Z = t | Data] −→ 1 as n → ∞ (under mild conditions).
|t| log pn
Remark 1: pn can be large such that n → 0.
Remark 2: Conditions are on minimum signal and minimum

non-zero eigenvalues of small submatrices of the Gram matrix
Remark 3: As a useful consequence, for 0 < c < 1,

P[Zj = tj | Data] > c for all j with probability going to one.

Connection with L0 Penalization
The posterior on the model space using shrinking and

diffusing priors can be written as
−Log Posterior (k) = R̃k + (|k| − |t|) ψn,k

≈ Y 0 (I − PXk )Y + (|k| − |t|) ψn,k
Resembles the well-known Bayesian Information Criterion
BIC (k) = Y 0 (I − PXk )Y + |k| log n,
The penalty ψn,k satisfies
c log(n ∨ pn ) ≤ ψn,k ≤ C log(n ∨ pn ),
and has the same rate as that of EBIC (Chen and Chen, 2008)
Bayesian Model Selection Consistency Connection to Penalization 19 / 42

Related Results on SSC
Our contribution: theoretical framework for SSC with n-

dependent priors, and connection to L0 penalization even
when pn > n
Shang and Clayton (2011), and Johnson and Rossell (2012):

SSC for pn < n
More recently,
Castillo, Schmidt-Hieber, and van der Vaart (2015) – SSC
with n-dependent Laplace priors for pn > n

Related Results on SSC
Our contribution: theoretical framework for SSC with n-

dependent priors, and connection to L0 penalization even
when pn > n
Shang and Clayton (2011), and Johnson and Rossell (2012):

SSC for pn < n
More recently,
Castillo, Schmidt-Hieber, and van der Vaart (2015) – SSC
with n-dependent Laplace priors for pn > n

Outline
Skinny Gibbs Sampler
Scalable Computation
Posterior Computation
Standard Gibbs Sampler: conditional distributions are

– Zj (given β): independent Bernoulli
– β (given Z ): Multivariate Normal: N VX 0 Y , σ 2 V

Easy to implement but not easily scalable; involves pn

dimensional matrix computations
Similar difficulty for other Gibbs sampling based methods
(Ishwaran and Rao, 2005; 2010; Bhattacharya et.al, 2015)
Can we have a scalable Gibbs sampler having

strong selection consistency?
Scalable Computation 21/ 42











Large Matrix Computations
The distribution of β (given Z ) has covariance:
V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )
Requires at least pn2 order computations, making it not

feasible for large pn
Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently
Scalable Computation Gibbs Sampler - Matrix Computations 22/ 42

Large Matrix Computations
The distribution of β (given Z ) has covariance:
V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )
Requires at least pn2 order computations, making it not

feasible for large pn
Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently
Scalable Computation Gibbs Sampler - Matrix Computations 22/ 42

Sparse Precision Matrix
A = {j : Zj = 1} - active set; I = {j : Zj = 0} - inactive set

Rearrange the variables and partition X = [XA , XI ]
Can the pn × pn precision matrix be made skinny as
−2
XA0 XA + τ1n XA0 XI
!
I
V −1 =
−2
XI0 XA XI0 XI + τ0n I
w
?
w
−2
XA0 XA + τ1n
!
I 0
V −1 =
−2
0 (n + τ0n )I
If so, generating β requires only linear order computations in pn
Scalable Computation Skinny Gibbs Sampler 23/ 42

Sparse Precision Matrix
A = {j : Zj = 1} - active set; I = {j : Zj = 0} - inactive set

Rearrange the variables and partition X = [XA , XI ]
Can the pn × pn precision matrix be made skinny as
−2
XA0 XA + τ1n XA0 XI
!
I
V −1 =
−2
XI0 XA XI0 XI + τ0n I
w
?
w
−2
XA0 XA + τ1n
!
I 0
V −1 =
−2
0 (n + τ0n )I
If so, generating β requires only linear order computations in pn

Does the Sparsification Work?
The Gibbs sampler would not have the right stationary

distribution
The correlation among the components of βI , and between βA

and βI are lost
We can recover!

Does the Sparsification Work?
The Gibbs sampler would not have the right stationary

distribution
The correlation among the components of βI , and between βA

and βI are lost
We can recover!

Skinny Gibbs
Re-introduce the dependencies through the Z ’s by sampling

them sequentially!
P[Zj = 1 | Z−j , Rest]

qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,

= 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n
where Aj is the set of active variables in Z−j
We call the new sampler - Skinny Gibbs!

Skinny Gibbs
Re-introduce the dependencies through the Z ’s by sampling

them sequentially!

qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,

= 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n
where Aj is the set of active variables in Z−j
We call the new sampler - Skinny Gibbs!

Skinny Gibbs Posterior
Skinny Gibbs has a stationary distribution:

Lemma
The posterior of Z corresponding to Skinny Gibbs is given by

1/2 1/2 1 |k| 1
P[Z = k | Data] ∝ |Vk1 | |Vk0 | |Dk | 2 qn exp − 2 R̃k ,
2σ
−2 −2
where Vk1 = (τ1n I + Xk0 Xk )−1 , Vk0 = (τ0n + n)−1 I .
Skinny Gibbs still retains the Strong selection consistency!

Theorem (Narisetty, Shen, and He (2015))
Under similar conditions, we have
P
P[Z = t | Data] −→ 1.

Skinny Gibbs Posterior
Skinny Gibbs has a stationary distribution:

Lemma
The posterior of Z corresponding to Skinny Gibbs is given by

1/2 1/2 1 |k| 1
P[Z = k | Data] ∝ |Vk1 | |Vk0 | |Dk | 2 qn exp − 2 R̃k ,
2σ
−2 −2
where Vk1 = (τ1n I + Xk0 Xk )−1 , Vk0 = (τ0n + n)−1 I .
Skinny Gibbs still retains the Strong selection consistency!

Theorem (Narisetty, Shen, and He (2015))
Under similar conditions, we have
P
P[Z = t | Data] −→ 1.

Skinny and Strong
Scalable Computation Skinny Gibbs Sampler

Existing Computational Methods
Optimization based Methods

– EM Algorithm (Ročková and George, 2014)
– Approximate L0 Optimization
(Bertsimas, King, and Mazumder; Cassidy and Solo, 2015)
Metropolis-Hastings (MH) Random Walk Methods
– Integrated Likelihood Approach (Guan and Stephens, 2011)
– Shotgun Stochastic Search (Hans, Dobra, West, 2007)
– Bayesian Subset Regression (Liang, Song, and Yu, 2013)
Gibbs Sampling Methods
– Standard Samplers (George and McCulloch, 1993;
Ishwaran, Kogalur, and Rao, 2005; 2010)
– Skinny Gibbs

Existing Computational Methods
Optimization based Methods

– EM Algorithm (Ročková and George, 2014)
– Approximate L0 Optimization
(Bertsimas, King, and Mazumder; Cassidy and Solo, 2015)
Metropolis-Hastings (MH) Random Walk Methods
– Integrated Likelihood Approach (Guan and Stephens, 2011)
– Shotgun Stochastic Search (Hans, Dobra, West, 2007)
– Bayesian Subset Regression (Liang, Song, and Yu, 2013)
Gibbs Sampling Methods
– Standard Samplers (George and McCulloch, 1993;
Ishwaran, Kogalur, and Rao, 2005; 2010)
– Skinny Gibbs

Features of Skinny Gibbs
Extends readily to many non-linear models including logistic

regression
Continues to work in non-convex settings unlike optimization

based methods
Computationally scalable without sacrificing strong selection

consistency!

Beyond Linear Regression
Many distributions can be written as mixtures of Gaussian

distributions; Skinny Gibbs Trick extends readily to them
Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)
Skinny Gibbs works for logistic regression!

(Details: Narisetty, Shen, and He, 2015)

Beyond Linear Regression
Many distributions can be written as mixtures of Gaussian

distributions; Skinny Gibbs Trick extends readily to them
Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)
Skinny Gibbs works for logistic regression!

(Details: Narisetty, Shen, and He, 2015)

Simulation Study
n = 100, p = 250.
|t| = 4; βt = (1.5, 2, 2.5, 3, 0, . . . , 0)
Logistic model: Yi | xi ∼ Logistic(xi β)
Signal-to-noise ratio around 0.9 for the latent response.

Skinny Gibbs - Marginal Posterior Probabilities

Simulation Results
Table : Covariates without Correlation

TP FP Z =t Z4 = t
Skinny Gibbs 3.64 1.19 0.26 0.61
BSR 3.17 0.23 0.25 0.58
ALasso 3.76 4.00 0.01 0.48
SCAD 3.72 3.26 0.02 0.35
MCP 3.84 4.90 0.00 0.49
TP FP Z =t Z4 = t
True Positives False positives Exact Selection Top 4 Selection
Covariates have correlations between (0.1, 0.5)

Simulation Results
Table : Covariates without Correlation

TP FP Z =t Z4 = t
Skinny Gibbs 3.41 1.17 0.24 0.44
BSR 2.56 0.29 0.10 0.27
ALasso 3.66 4.03 0.02 0.30
SCAD 3.40 3.33 0.01 0.03
MCP 3.68 4.18 0.01 0.11
TP FP Z =t Z4 = t
True Positives False positives Exact Selection Top 4 Selection
Covariates have correlations between (0.1, 0.5)

Lymph Data Example
n = 148 subjects and p = 4514 genes
Binary response Y = +ve or −ve status of lymph node
Reference: Hans, Dobra, and West (2007)

Cross Validated Prediction Error
Figure : Cross Validated Prediction Error versus Model Size for several
model selection methods
n = 148; p = 4514 genes

Response Y = +ve or −ve status of lymph node
Outline
Censoring and Non-convexity

Censoring
Suppose Y is censored, and we only observe
Y o = (Y ∨ c)
Conditional mean of Y is not identifiable but (most)

conditional quantiles are!
τ th conditional quantile of Y : Qτ (Y | X ) = X βτ
Powell (1984,86)’s objective for estimation

n
X
β̂Pow (τ ) = arg min ρτ (Yio − (Xi β ∨ c)) ,
β i=1
where ρτ (u) = u(τ − 1(u < 0)) is the check-loss
Censoring and Non-convexity Censoring 36/ 42

Censoring
Suppose Y is censored, and we only observe
Y o = (Y ∨ c)
Conditional mean of Y is not identifiable but (most)

conditional quantiles are!
τ th conditional quantile of Y : Qτ (Y | X ) = X βτ
Powell (1984,86)’s objective for estimation

n
X
β̂Pow (τ ) = arg min ρτ (Yio − (Xi β ∨ c)) ,
β i=1
where ρτ (u) = u(τ − 1(u < 0)) is the check-loss
Censoring and Non-convexity Censoring 36/ 42

Non-convexity
One major difficulty even in low dimensions is computational as

the objective function is non-convex!
Censoring and Non-convexity Non-convexity 37/ 42

Bayesian Computation - Skinny Gibbs
Working asymmetric Laplace (AL) likelihood (weights wi ):

n
Y
wi τ (1 − τ ) exp −wi ρτ (Yio − max(xi0 β, c)) ,

L(β) =
i=1
AL distn can be represented as: (Kozumi and Kobayashi,

2011) √
wi−1 ξ1 ν + wi−1 ξ2 νz ∼ exp{−wi ρτ (.)},
where ν ∼ Exp(1), z ∼ N(0, 1), and ξ1 , ξ2 are constants
Skinny Gibbs Trick for scalable computation and model

selection (Narisetty and He, 2015)!

Bayesian Computation - Skinny Gibbs
Working asymmetric Laplace (AL) likelihood (weights wi ):

n
Y
wi τ (1 − τ ) exp −wi ρτ (Yio − max(xi0 β, c)) ,

L(β) =
i=1
AL distn can be represented as: (Kozumi and Kobayashi,

2011) √
wi−1 ξ1 ν + wi−1 ξ2 νz ∼ exp{−wi ρτ (.)},
where ν ∼ Exp(1), z ∼ N(0, 1), and ξ1 , ξ2 are constants
Skinny Gibbs Trick for scalable computation and model

selection (Narisetty and He, 2015)!

Non-convexity and Theoretical Difficulties
n
P
Powell’s objective function P(β) = wi ρτ (Yi − (Xi β ∨ c))
i=1
Non-convexity and misspecification cause theoretical difficulties!
No quadratic approximations on compact sets!

We show local quadratic approximations in vanishing balls
P(β) − P(β0 ) not guaranteed to be positive for large kβ − β0 k

We show divergence in kβ − β0 k ≤ Bn → ∞ (not quadratic!)
Posterior has wi both in likelihood and penalty parts

Strong selection consistency holds true for some choices of w !

Non-convexity and Theoretical Difficulties
n
P
Powell’s objective function P(β) = wi ρτ (Yi − (Xi β ∨ c))
i=1
Non-convexity and misspecification cause theoretical difficulties!
No quadratic approximations on compact sets!

We show local quadratic approximations in vanishing balls
P(β) − P(β0 ) not guaranteed to be positive for large kβ − β0 k

We show divergence in kβ − β0 k ≤ Bn → ∞ (not quadratic!)
Posterior has wi both in likelihood and penalty parts

Strong selection consistency holds true for some choices of w !

Summing up
Skinny Gibbs Trick provides a computationally feasible

approach for the non-convex Powell objective function
Theoretical difficulties arise due to non-convexity and

misspecification, but consistency is shown to hold
Empirical results show good performance while penalization

approaches give unstable solutions
Censoring and Non-convexity 40/ 42

Directions for Future Research
Inference: The posterior given by a Bayesian method may in

principle be used for inference. Coverage Properties?
Model Selection for Mixture Models: How to deal with

non-convex likelihood under mixture models? Applications to
subgroup identification?
Interaction Selection: Environmental exposures exhibit

nonlinear and higher-order interactions. Bayesian modeling
strategies?
41/ 42
Conclusions
Provided a theoretical framework to study Bayesian model

selection with spike-slab priors (Narisetty and He 2014, AoS)
Proposed Skinny Gibbs - a scalable and flexible algorithm

(Narisetty, Shen and He 2015, under review)
Shown to be applicable to non-convex problems as in censored

quantile regression (Narisetty and He 2015)
Bayesian model selection deserves further investigation!
42/ 42
Thank You!
and
Xuming He, Vijay Nair

Statistics, U Mich.
Bhramar Mukherjee
Biotatistics, U Mich.
Juan Shen
Management, Fudan U
Minsuk Shin
Statistics, Texas A&M U
Fei He, Derek Posselt

Atmospheric Science, U Mich.
Steve Broglio, James Eckner

Kinesiology, Medicine, U Mich.
Appendix
1. Conditions for Strong Selection Consistency
2. Simulation Setting and Default Parameters
3. Kolmogorov-Smirnov Distribution
4. Computational Time
5. Computational Complexity
6. Posterior Probabilities - Lymph Data
7. Non-convexity - Theoretical Details
Conditions for Strong Selection Consistency
We assume the following conditions:

|t| log pn
Number of covariates pn is s.t. n →0
q
|t| log pn
Active covariates have coefficients of order at least n
0
Minimum non-zero eigenvalues of the submatrices of XnX
with size Mn = lognpn may go to zero, but not too fast!
Errors are (sub-) Gaussian
Many assumptions could be relaxed (with additional work)
1/ 7
Simulation Setting and Default Parameters
Covariate Distribution:
ind
xi ∼ N(0, Σ); ΣAA = ρ1 14×4 , ΣAI = ρ2 14×p−4 , ΣII = ρ3 1.
ρ1 = ρ2 = ρ3 = 0
ρ1 = 0.1; ρ2 = 0.25; ρ3 = 0.5
Default parameters for Skinny Gibbs:

Spike and slab variances are
pn2.1

2 1 2
τ0n = , τ1n = max ,1 ,
n 100n
qnh= P(Zj = i) is chosen

i such that
Ppn
P j=1 Zj = 1 > K = 0.1, for K = max(10, log(n)).
2/ 7
Kolmogorov-Smirnov Distribution
Kolmogorov-Smirnov distribution has CDF given by

∞
X
G (σ) = 1 − 2 (−1)n+1 exp(−2n2 σ 2 ).
n=1
It is the distribution of
K = sup |B(t)|,
t∈[0,1]
where B(t) is the Brownian bridge.
3/ 7
Computational Time
Figure : CPU time (in seconds) for BASAD and Skinny Gibbs for
n = 100 and p varies.
4/ 7
Computational Complexity
Yang, Wainwright, and Jordan (2015) recently studied

computational complexity of MH methods.
– Complexity is nearly linear in p: number of iterations in the
order of p n log p is sufficient for convergence!
Geometric ergodicity of Gibbs samplers in many settings (with

fixed p) established (Román and Hobert 2012; 2015)
Number of iterations for standard Gibbs samplers for

regression: p n−1 (Rajaratnam and Sparks, 2015)
Mixing and Complexity of Skinny Gibbs - Future work!
5/ 7
Posterior Probabilities - Lymph Data Example
Figure : Marginal Posterior Probabilities from two different chains of

Skinny Gibbs.
6/ 7
Non-convexity - Theoretical Details
n
P
Powell’s objective function: Pow (β) = wi ρτ (Yi − (Xi β ∨ c)) .
i=1
Pow (β) is non-convex, no global quadratic approximations. But
Local quadratic approximation:
for kβ − β0 k ≤ n = |t|2 log pn /n → 0,
Pow (β) − Pow (β0 ) = n(β − β0 )0 Dw (β − β0 ) + oP (1)
Deviation in a diverging ball:

for n < kβ − β0 k < Bn = (n/ log pn )1/8 → ∞, we have
Pow (β) − Pow (β0 ) > c1 n(kβ − β0 k2 ∧ c2 )
Note: Usually, kβ − β0 k is assumed to be bounded, but we

can allow it to diverge as Bn → ∞.
7/ 7

Presentation UMStat

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation UMStat

Uploaded by

Copyright:

Available Formats

Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Consistent and Scalable Bayesian Model Selection

Naveen Naidu Narisetty

December 11, 2015

Other Work (A small Detour)

Functional Data Depth – proposed a new notion of depth, studied

Large-Scale Computer Models – worked on nonparametric

Environmental Exposures – proposed a method to select nonlinear

Other Work (A small Detour)

Functional Data Depth – proposed a new notion of depth, studied

Large-Scale Computer Models – worked on nonparametric

Environmental Exposures – proposed a method to select nonlinear

Other Work (A small Detour)

Functional Data Depth – proposed a new notion of depth, studied

Large-Scale Computer Models – worked on nonparametric

Environmental Exposures – proposed a method to select nonlinear

Outline of the Talk

1 High Dimensional Model Selection

2 Bayesian Model Selection Consistency

4 Censoring and Non-convexity

Modern Data Sets contain a large number of variables, often

High Dimensional Data examples include gene expression

Variable (Model) Selection: deals with identifying the most

High Dimensional Model Selection Introduction 1 / 42

High Dimensional Linear Regression

Consider the standard linear regression set-up

Yn×1 = Xn×p βp×1 + n×1

When p > n, even the estimation problem is ill-posed.

How can the set of active variables be consistently identified?

High Dimensional Model Selection Introduction 2 / 42

High Dimensional Linear Regression

Consider the standard linear regression set-up

Yn×1 = Xn×p βp×1 + n×1

When p > n, even the estimation problem is ill-posed.

How can the set of active variables be consistently identified?

High Dimensional Model Selection Introduction 2 / 42

(Popular) Penalty Trick

The penalty trick is to minimize the loss together with a

β̂PEN = arg min {kY − X βk2 + ρλ (β)}

A very natural penalty is the L0 penalty

L0 penalty is argued to give good theoretical properties

High Dimensional Model Selection Penalty Based Methods 3 / 42

(Popular) Penalty Trick

The penalty trick is to minimize the loss together with a

β̂PEN = arg min {kY − X βk2 + ρλ (β)}

A very natural penalty is the L0 penalty

L0 penalty is argued to give good theoretical properties

High Dimensional Model Selection Penalty Based Methods 3 / 42

Several convex and non-convex penalties are proposed as

High Dimensional Model Selection Penalty Based Methods 4 / 42

Why “Bayesian Methods”?

Bayesian methods have been popular

Bayesian model selection can be asymptotically equivalent to

Properties of Bayesian model selection methods for high

High Dimensional Model Selection Bayesian Methods 5 / 42

Regression Model: Yn×1 = Xn×p βp×1 + n×1 ;  ∼ N(0, σ 2 I)

Introduce binary variables Zj to indicate that j th covariate is

βj | Zj = 0 ∼ π0 (βj ), βj | Zj = 1 ∼ π1 (βj ), (1)

The posterior of Z is used for model selection

High Dimensional Model Selection Bayesian Methods 6 / 42

Spike and Slab Priors

The prior π0 (called spike) is often taken to be a point mass

Some common choices for the prior π1 (called slab) include:

– g priors (Zellner, 1983; Liang et al., 2007)

Yn×1 = Xn×p βp×1 + n×1

Yn×1 = Xn×p βp×1 + n×1

Regression Model: Yn×1 = Xn×p βp×1 + n×1 ; ∼ N(0, σ 2 I)