Model Selection Talk

Introduction Scalable Computation
Scalable Bayesian Model Selection

for High Dimensional Data
Naveen N. Narisetty
Department of Statistics
2018 BayesComp Meeting

Barcelona, Spain
Outline of the Talk
1 High Dimensional Model Selection

Introduction
Shrinking and Diffusing Priors
2 Computation
Large Matrix Computations
Scalable Gibbs Sampler
Introduction
Modern Data Sets contain a large number of variables, often

even larger than the sample size
High Dimensional Data examples include gene expression

data, healthcare data, search queries data, etc
Variable (Model) Selection: deals with identifying the most

important variables for a response of interest
High Dimensional Model Selection Introduction 1 / 28

Challenges with High Dimensionality
Regression: Linear, Generalized Linear, Nonlinear, Quantile,

etc.
Classification: Logistic, Support Vector Machines, Decision

Trees, etc.
Dimension Reduction Principal Component/Canonical

Correlation/Linear Discriminant Analysis, etc.
Graphical Models
···
The number of parameters in most of the statistical models
explodes – brings computational and theoretical challenges!

High Dimensional Linear Regression
Consider the standard linear regression set-up
Yn×1 = Xn×p βp×1 + n×1

n - no. of observations
p - no. of variables
When p > n, even the estimation problem is ill-posed.

A natural working assumption is sparsity:
#{j : βj 6= 0} (p ∧ n)
How can the set of active variables be consistently identified?

High Dimensional Linear Regression
Consider the standard linear regression set-up
Yn×1 = Xn×p βp×1 + n×1

n - no. of observations
p - no. of variables
When p > n, even the estimation problem is ill-posed.

A natural working assumption is sparsity:
#{j : βj 6= 0} (p ∧ n)
How can the set of active variables be consistently identified?

(Popular) Penalty Trick
The penalty trick is to minimize the loss together with a

penalty function ρλ (β) to induce shrinkage or sparsity
β̂PEN = arg min {kY − X βk2 + ρλ (β)}

β
A very natural penalty is the L0 penalty

p
1{βj 6= 0}
X
ρλ (β) = λ
j=1
L0 penalty is argued to give good theoretical properties

(Schwarz, 1978; Shen, Pan and Zhu, 2012)
High Dimensional Model Selection Penalty Based Methods 4 / 28

(Popular) Penalty Trick
The penalty trick is to minimize the loss together with a

penalty function ρλ (β) to induce shrinkage or sparsity
β̂PEN = arg min {kY − X βk2 + ρλ (β)}

β
A very natural penalty is the L0 penalty

p
1{βj 6= 0}
X
ρλ (β) = λ
j=1
L0 penalty is argued to give good theoretical properties

(Schwarz, 1978; Shen, Pan and Zhu, 2012)

Variable Selection Methods
Several convex and non-convex penalties are proposed as

compromise between L0 penalty and computational tractability
– LASSO (Tibshirani, 1996)
– SCAD (Fan and Li, 2001)
– Elastic Net (Zou and Hastie, 2005)
– Adaptive LASSO (Zou, 2006)
– MCP (Zhang, 2010), and more . . .
Other approaches include: forward selection (Hwang, Zhang, and

Ghoshal, 09; Wasserman and Roeder, 09; Guo, Lu, and Li, 14);
false selection rate (Boos, Stefanski, and Wu, 09)

Why Bayesian Methods?
Bayesian approach offers a flexible modeling framework
Bayesian model selection can be asymptotically equivalent to

information criterion
In some cases, provides an alterntive and more suitable

computational framework compared to optimization
High Dimensional Model Selection Bayesian Methods 6 / 28

Bayesian Framework
Regression Model: Yn×1 = Xn×p βp×1 + n×1 ; ∼ N(0, σ 2 I)
Introduce binary variables Zj to indicate that j th covariate is

active (j = 1, . . . , p)
Place priors on βj as
βj | Zj = 0 ∼ π0 (βj ), βj | Zj = 1 ∼ π1 (βj ), (1)
and on Z = (Z1 , . . . , Zp )
With a prior placed on Z , the posterior of Z can be used for

model selection

Spike and Slab Priors
The prior π0 (called spike) is often taken to be a point mass

(Geweke, 1996; Scott and Berger, 2010; Liang, Song, and
Wu, 2013; Yang, Jordan and Wainwright, 2015)
Some common choices for the prior π1 (called slab) include:
– g priors (Zellner, 1983; Som, Hans, MacEchern, 2014)

– Uniform (Mitchell and Beauchamp, 1988)
– Gaussian (George and McCulloch, 1993)
– Laplace (Park and Casella, 2008)
– Non-local priors (Johnson and Rossell, 2012; Rossell and
Telesca, 2017), & more . . .

Gaussian Spike and Slab Priors
George and McCulloch (1993)

Spike: π0 ≡ N(0, τ02 ), the variance τ02 is a small and fixed value
Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value
Other choices for the mixing distribution are also considered

more recently (Ročková and George, 2014; 2016)
This allowed the use of a Gibbs sampler for posterior sampling

Extensively used in applications
Model selection properties and computationally challenges will
be investigated in high dimensional settings
High Dimensional Model Selection Shrinking and Diffusing Priors 9 / 28

Gaussian Spike and Slab Priors
George and McCulloch (1993)

Spike: π0 ≡ N(0, τ02 ), the variance τ02 is a small and fixed value
Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value
Other choices for the mixing distribution are also considered

more recently (Ročková and George, 2014; 2016)
This allowed the use of a Gibbs sampler for posterior sampling

Extensively used in applications
Model selection properties and computationally challenges will
be investigated in high dimensional settings

Shrinking and Diffusing Priors (Narisetty and He, 2014)
2 → 0 (faster than 1 )
Spike prior shrinks: τ0,n n
– Does not miss “small” non-zero coefficients
2+
2 → ∞ (as ( pn ∨ 1))
Slab prior diffuses: τ1,n n
– Acts as a penalty to drive inactive covariates under Z = 0
P[Zj = 1] = qn → 0 (as pn−1 )

– Controls the model size a priori
Strong selection consistency: Under such priors,

P
P[Z = t | Y ] −→ 1 (the posterior concentrates on the true model).

spike variance: 0.04, slab variance: 1
High Dimensional Model Selection Shrinking and Diffusing Priors

spike variance: 0.01, slab variance: 25
High Dimensional Model Selection Shrinking and Diffusing Priors

Outline
1 High Dimensional Model Selection
2 Computation
Scalable Gibbs Sampler
Computation
Posterior Computation
Standard Gibbs Sampler: conditional distributions are

– Zj (given β): Independent Bernoulli
– β (given Z ): Multivariate Normal: N VX 0 Y , σ 2 V

Easy to implement but not easily scalable; involves pn

dimensional matrix computations
Similar difficulty for other Gibbs sampling based methods
(Ishwaran and Rao, 2005; 2010; Bhattacharya et.al, 2015)
Can we have a scalable Gibbs sampler having

strong selection consistency?
Computation 11/ 28




Computation 11/ 28




Computation 11/ 28
The distribution of β (given Z ) has covariance:
V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )
Requires at least pn2 order computations, making it not

feasible for large pn
Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently
Computation Large Matrix Computations 12/ 28

The distribution of β (given Z ) has covariance:
V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )
Requires at least pn2 order computations, making it not

feasible for large pn
Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently
Computation Large Matrix Computations 12/ 28

Sparse Precision Matrix
A = {j : Zj = 1} - active set; I = {j : Zj = 0} - inactive set

Rearrange the variables and partition X = [XA , XI ]
Can the pn × pn precision matrix be made skinny as
−2
XA0 XA + τ1n XA0 XI
!
I
V −1 =
−2
XI0 XA XI0 XI + τ0n I
w
?
w
−2
XA0 XA + τ1n
!
I 0
S −1 :=
−2
0 (n + τ0n )I
If so, generating β requires only linear order computations in pn
Computation Scalable Gibbs Sampler 13/ 28

Sparse Precision Matrix
A = {j : Zj = 1} - active set; I = {j : Zj = 0} - inactive set

Rearrange the variables and partition X = [XA , XI ]
Can the pn × pn precision matrix be made skinny as
−2
XA0 XA + τ1n XA0 XI
!
I
V −1 =
−2
XI0 XA XI0 XI + τ0n I
w
?
w
−2
XA0 XA + τ1n
!
I 0
S −1 :=
−2
0 (n + τ0n )I
If so, generating β requires only linear order computations in pn

Does the Sparsification Work?
The correlation among the components of βI , and between βA

and βI are lost
The Gibbs sampler would not have an appropriate stationary

distribution
We can recover!

Does the Sparsification Work?
The correlation among the components of βI , and between βA

and βI are lost
The Gibbs sampler would not have an appropriate stationary

distribution
We can recover!

Skinny Gibbs
Re-introduce the dependencies through the Z ’s by sampling

them sequentially!
P[Zj = 1 | Z−j , Rest]

qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,

Rj = 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n
where Aj is the set of active variables in Z−j
We call the new sampler - Skinny Gibbs!

Skinny Gibbs
Re-introduce the dependencies through the Z ’s by sampling

them sequentially!

qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,

Rj = 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n
where Aj is the set of active variables in Z−j
We call the new sampler - Skinny Gibbs!

Skinny Gibbs Sampler
Sample β (given Z ) from N (S X 0 Y , S) , where S is the

sparsified covariance matrix
−2
(XA0 XA + τ1n I)−1
!
0
S=
−2 −1
0 (n + τ0n ) I
Sample Z (given β) from the conditional Bernoulli draws:
Rj
P[Zj = 1 | Z−j , Rest] =
1 + Rj

Skinny Gibbs Posterior
Skinny Gibbs has a stationary distribution:

Lemma
The posterior of Z corresponding to Skinny Gibbs is given by

1/2 1/2 1 |k| 1
P[Z = k | Data] ∝ |Vk1 | |Vk0 | |Dk | 2 qn exp − 2 R̃k ,
2σ
−2 −2
where Vk1 = (τ1n I + Xk0 Xk )−1 , Vk0 = (τ0n + n)−1 I.
Skinny Gibbs still retains the Strong selection consistency!

Theorem (Narisetty, Shen, and He)
Under similar conditions, we have
P
P[Z = t | Data] −→ 1.

Skinny Gibbs Posterior
Skinny Gibbs has a stationary distribution:

Lemma
The posterior of Z corresponding to Skinny Gibbs is given by

1/2 1/2 1 |k| 1
P[Z = k | Data] ∝ |Vk1 | |Vk0 | |Dk | 2 qn exp − 2 R̃k ,
2σ
−2 −2
where Vk1 = (τ1n I + Xk0 Xk )−1 , Vk0 = (τ0n + n)−1 I.
Skinny Gibbs still retains the Strong selection consistency!

Theorem (Narisetty, Shen, and He)
Under similar conditions, we have
P
P[Z = t | Data] −→ 1.

Existing Computational Methods
Optimization based Methods

– EM Algorithm (Ročková and George, 2014)
– Approximate L0 Optimization
(Bertsimas, King, and Mazumder; Cassidy and Solo, 2015)
Metropolis-Hastings (MH) Random Walk Methods
– Integrated Likelihood Approach (Guan and Stephens, 2011)
– Shotgun Stochastic Search (Hans, Dobra, West, 2007)
– Bayesian Subset Regression (Liang, Song, and Yu, 2013)
Gibbs Sampling Methods
– Standard Samplers (George and McCulloch, 1993;
Ishwaran, Kogalur, and Rao, 2005; 2010)
– Skinny Gibbs

Existing Computational Methods
Optimization based Methods

– EM Algorithm (Ročková and George, 2014)
– Approximate L0 Optimization
(Bertsimas, King, and Mazumder; Cassidy and Solo, 2015)
Metropolis-Hastings (MH) Random Walk Methods
– Integrated Likelihood Approach (Guan and Stephens, 2011)
– Shotgun Stochastic Search (Hans, Dobra, West, 2007)
– Bayesian Subset Regression (Liang, Song, and Yu, 2013)
Gibbs Sampling Methods
– Standard Samplers (George and McCulloch, 1993;
Ishwaran, Kogalur, and Rao, 2005; 2010)
– Skinny Gibbs

Features of Skinny Gibbs
Extends readily to mixed data types and non-linear models

such as logistic regression and quantile regression
Continues to work in non-convex settings unlike optimization

based methods
Computationally scalable without sacrificing strong selection

consistency
Mixing properties of Skinny Gibbs and other high dimensional

Bayesian algorithms require future work!

Beyond Linear Regression
Many distributions can be written as mixtures of Gaussian

distributions; Skinny Gibbs Trick extends readily to them
Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)
Skinny Gibbs works for logistic regression!

Beyond Linear Regression
Many distributions can be written as mixtures of Gaussian

distributions; Skinny Gibbs Trick extends readily to them
Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)
Skinny Gibbs works for logistic regression!

Simulation Study
n = 100, p = 250.
|t| = 4; βt = (1.5, 2, 2.5, 3, 0, . . . , 0)
Logistic model: Yi | xi ∼ Logistic(xi β)
Signal-to-noise ratio around 0.9 for the latent response.

Skinny Gibbs - Marginal Posterior Probabilities

Simulation Results
Table: Covariates without Correlation

TP FP Z =t Z ⊃t Zs = t
Skinny 3.72 0.54 0.50 0.75 0.70
VBayes 3.33 0.04 0.46 0.47 0.75
BSR 2.90 0.09 0.23 0.28 0.52
Alasso 3.69 1.23 0.24 0.70 0.52
SCAD 3.69 1.75 0.17 0.71 0.40
MCP 3.82 2.69 0.13 0.83 0.56
TP FP Z =t Z =t Zs = t
True Positives False positives Selection Inclusion Top 4 Selection

Simulation Results
Table: Covariates without Correlation

TP FP Z =t Z ⊃t Zs = t
Skinny 3.62 0.56 0.38 0.65 0.66
VBayes 3.07 0.03 0.33 0.33 0.60
BSR 2.82 0.14 0.23 0.26 0.46
Alasso 3.30 1.06 0.15 0.40 0.28
SCAD 3.33 1.65 0.09 0.47 0.24
MCP 3.72 3.22 0.03 0.74 0.47
TP FP Z =t Z =t Zs = t
True Positives False positives Selection Inclusion Top 4 Selection
Correlation of 0.25 between covariates

Computational Time
Figure: CPU time (in seconds) for BASAD and Skinny Gibbs for n = 100
and p varies.

Censored Quantile Regression
Suppose Y is censored, and we only observe
Y o = (Y ∨ c)
Example: Carotid artery plaque thickness in stroke patients
(Gardener et al., 2014)
Conditional mean of Y is not identifiable but (most)
conditional quantiles are!
τ th conditional quantile of Y : Qτ (Y | X ) = X βτ
Powell (1984,86)’s objective for estimation

n
X
β̂Pow (τ ) = arg min ρτ (Yio − (Xi β ∨ c)) ,
β i=1
where ρτ (u) = u(τ − 1(u < 0)) is the check-loss

Computation Censored Quantile Regression 25/ 28
Censored Quantile Regression
Suppose Y is censored, and we only observe
Y o = (Y ∨ c)
Example: Carotid artery plaque thickness in stroke patients
(Gardener et al., 2014)
Conditional mean of Y is not identifiable but (most)
conditional quantiles are!
τ th conditional quantile of Y : Qτ (Y | X ) = X βτ
Powell (1984,86)’s objective for estimation

n
X
β̂Pow (τ ) = arg min ρτ (Yio − (Xi β ∨ c)) ,
β i=1
where ρτ (u) = u(τ − 1(u < 0)) is the check-loss

Computation Censored Quantile Regression 25/ 28
Non-convexity
One major difficulty even in low dimensions is computational as

the objective function is non-convex!
Computation Non-convexity 26/ 28

Bayesian Computation - Skinny Gibbs
Working asymmetric Laplace (AL) likelihood (weights wi ):

n
Y
wi τ (1 − τ ) exp −wi ρτ (Yio − (xi0 β ∨ c)) ,

L(β) =
i=1
AL distn can be represented as: (Kozumi and Kobayashi,

2011) √
wi−1 ξ1 ν + wi−1 ξ2 νz ∼ exp{−wi ρτ (.)},
where ν ∼ Exp(1), z ∼ N(0, 1), and ξ1 , ξ2 are constants
Skinny Gibbs Trick works for scalable computation and model

selection!

Bayesian Computation - Skinny Gibbs
Working asymmetric Laplace (AL) likelihood (weights wi ):

n
Y
wi τ (1 − τ ) exp −wi ρτ (Yio − (xi0 β ∨ c)) ,

L(β) =
i=1
AL distn can be represented as: (Kozumi and Kobayashi,

2011) √
wi−1 ξ1 ν + wi−1 ξ2 νz ∼ exp{−wi ρτ (.)},
where ν ∼ Exp(1), z ∼ N(0, 1), and ξ1 , ξ2 are constants
Skinny Gibbs Trick works for scalable computation and model

selection!

Conclusions
Provided a theoretical framework to study Bayesian model

selection with spike-slab priors
Proposed Skinny Gibbs - a scalable and flexible algorithm
Also applicable to non-convex problems as in censored

quantile regression
Bayesian model selection deserves further investigation!

Model Selection Talk

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Model Selection Talk

Uploaded by

Copyright:

Available Formats

Introduction Scalable Computation

Scalable Bayesian Model Selection

2018 BayesComp Meeting

Outline of the Talk

1 High Dimensional Model Selection

Modern Data Sets contain a large number of variables, often

High Dimensional Data examples include gene expression

Variable (Model) Selection: deals with identifying the most

High Dimensional Model Selection Introduction 1 / 28

Challenges with High Dimensionality

Regression: Linear, Generalized Linear, Nonlinear, Quantile,

Classification: Logistic, Support Vector Machines, Decision

Dimension Reduction Principal Component/Canonical

High Dimensional Model Selection Introduction 2 / 28

High Dimensional Linear Regression

Consider the standard linear regression set-up

Yn×1 = Xn×p βp×1 + n×1

When p > n, even the estimation problem is ill-posed.

How can the set of active variables be consistently identified?

High Dimensional Model Selection Introduction 3 / 28

High Dimensional Linear Regression

Consider the standard linear regression set-up

Yn×1 = Xn×p βp×1 + n×1

When p > n, even the estimation problem is ill-posed.

How can the set of active variables be consistently identified?

High Dimensional Model Selection Introduction 3 / 28

(Popular) Penalty Trick

The penalty trick is to minimize the loss together with a

β̂PEN = arg min {kY − X βk2 + ρλ (β)}

A very natural penalty is the L0 penalty

L0 penalty is argued to give good theoretical properties

High Dimensional Model Selection Penalty Based Methods 4 / 28

(Popular) Penalty Trick

The penalty trick is to minimize the loss together with a

β̂PEN = arg min {kY − X βk2 + ρλ (β)}

A very natural penalty is the L0 penalty

L0 penalty is argued to give good theoretical properties

High Dimensional Model Selection Penalty Based Methods 4 / 28

Variable Selection Methods

Several convex and non-convex penalties are proposed as

Other approaches include: forward selection (Hwang, Zhang, and

High Dimensional Model Selection Penalty Based Methods 5 / 28

Why Bayesian Methods?

Bayesian approach offers a flexible modeling framework

Bayesian model selection can be asymptotically equivalent to

In some cases, provides an alterntive and more suitable

High Dimensional Model Selection Bayesian Methods 6 / 28

Regression Model: Yn×1 = Xn×p βp×1 + n×1 ;  ∼ N(0, σ 2 I)

Introduce binary variables Zj to indicate that j th covariate is

βj | Zj = 0 ∼ π0 (βj ), βj | Zj = 1 ∼ π1 (βj ), (1)

With a prior placed on Z , the posterior of Z can be used for

High Dimensional Model Selection Bayesian Methods 7 / 28

Spike and Slab Priors

The prior π0 (called spike) is often taken to be a point mass

Some common choices for the prior π1 (called slab) include:

– g priors (Zellner, 1983; Som, Hans, MacEchern, 2014)

High Dimensional Model Selection Bayesian Methods 8 / 28

Gaussian Spike and Slab Priors

George and McCulloch (1993)

Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value

Other choices for the mixing distribution are also considered

This allowed the use of a Gibbs sampler for posterior sampling

High Dimensional Model Selection Shrinking and Diffusing Priors 9 / 28

Gaussian Spike and Slab Priors

Yn×1 = Xn×p βp×1 + n×1

Yn×1 = Xn×p βp×1 + n×1

Regression Model: Yn×1 = Xn×p βp×1 + n×1 ; ∼ N(0, σ 2 I)