You are on page 1of 48

Introduction Scalable Computation

Scalable Bayesian Model Selection


for High Dimensional Data

Naveen N. Narisetty
Department of Statistics

2018 BayesComp Meeting


Barcelona, Spain
Introduction Scalable Computation

Outline of the Talk

1 High Dimensional Model Selection


Introduction
Shrinking and Diffusing Priors

2 Computation
Large Matrix Computations
Scalable Gibbs Sampler
Introduction Scalable Computation

Introduction

Modern Data Sets contain a large number of variables, often


even larger than the sample size

High Dimensional Data examples include gene expression


data, healthcare data, search queries data, etc

Variable (Model) Selection: deals with identifying the most


important variables for a response of interest

High Dimensional Model Selection Introduction 1 / 28


Introduction Scalable Computation

Challenges with High Dimensionality

Regression: Linear, Generalized Linear, Nonlinear, Quantile,


etc.

Classification: Logistic, Support Vector Machines, Decision


Trees, etc.

Dimension Reduction Principal Component/Canonical


Correlation/Linear Discriminant Analysis, etc.

Graphical Models

···
The number of parameters in most of the statistical models
explodes – brings computational and theoretical challenges!

High Dimensional Model Selection Introduction 2 / 28


Introduction Scalable Computation

High Dimensional Linear Regression

Consider the standard linear regression set-up

Yn×1 = Xn×p βp×1 + n×1


n - no. of observations
p - no. of variables

When p > n, even the estimation problem is ill-posed.


A natural working assumption is sparsity:

#{j : βj 6= 0}  (p ∧ n)

How can the set of active variables be consistently identified?

High Dimensional Model Selection Introduction 3 / 28


Introduction Scalable Computation

High Dimensional Linear Regression

Consider the standard linear regression set-up

Yn×1 = Xn×p βp×1 + n×1


n - no. of observations
p - no. of variables

When p > n, even the estimation problem is ill-posed.


A natural working assumption is sparsity:

#{j : βj 6= 0}  (p ∧ n)

How can the set of active variables be consistently identified?

High Dimensional Model Selection Introduction 3 / 28


Introduction Scalable Computation

(Popular) Penalty Trick

The penalty trick is to minimize the loss together with a


penalty function ρλ (β) to induce shrinkage or sparsity

β̂PEN = arg min {kY − X βk2 + ρλ (β)}


β

A very natural penalty is the L0 penalty


p
1{βj 6= 0}
X
ρλ (β) = λ
j=1

L0 penalty is argued to give good theoretical properties


(Schwarz, 1978; Shen, Pan and Zhu, 2012)

High Dimensional Model Selection Penalty Based Methods 4 / 28


Introduction Scalable Computation

(Popular) Penalty Trick

The penalty trick is to minimize the loss together with a


penalty function ρλ (β) to induce shrinkage or sparsity

β̂PEN = arg min {kY − X βk2 + ρλ (β)}


β

A very natural penalty is the L0 penalty


p
1{βj 6= 0}
X
ρλ (β) = λ
j=1

L0 penalty is argued to give good theoretical properties


(Schwarz, 1978; Shen, Pan and Zhu, 2012)

High Dimensional Model Selection Penalty Based Methods 4 / 28


Introduction Scalable Computation

Variable Selection Methods

Several convex and non-convex penalties are proposed as


compromise between L0 penalty and computational tractability
– LASSO (Tibshirani, 1996)
– SCAD (Fan and Li, 2001)
– Elastic Net (Zou and Hastie, 2005)
– Adaptive LASSO (Zou, 2006)
– MCP (Zhang, 2010), and more . . .

Other approaches include: forward selection (Hwang, Zhang, and


Ghoshal, 09; Wasserman and Roeder, 09; Guo, Lu, and Li, 14);
false selection rate (Boos, Stefanski, and Wu, 09)

High Dimensional Model Selection Penalty Based Methods 5 / 28


Introduction Scalable Computation

Why Bayesian Methods?

Bayesian approach offers a flexible modeling framework

Bayesian model selection can be asymptotically equivalent to


information criterion

In some cases, provides an alterntive and more suitable


computational framework compared to optimization

High Dimensional Model Selection Bayesian Methods 6 / 28


Introduction Scalable Computation

Bayesian Framework

Regression Model: Yn×1 = Xn×p βp×1 + n×1 ;  ∼ N(0, σ 2 I)

Introduce binary variables Zj to indicate that j th covariate is


active (j = 1, . . . , p)

Place priors on βj as

βj | Zj = 0 ∼ π0 (βj ), βj | Zj = 1 ∼ π1 (βj ), (1)

and on Z = (Z1 , . . . , Zp )

With a prior placed on Z , the posterior of Z can be used for


model selection

High Dimensional Model Selection Bayesian Methods 7 / 28


Introduction Scalable Computation

Spike and Slab Priors

The prior π0 (called spike) is often taken to be a point mass


(Geweke, 1996; Scott and Berger, 2010; Liang, Song, and
Wu, 2013; Yang, Jordan and Wainwright, 2015)

Some common choices for the prior π1 (called slab) include:

– g priors (Zellner, 1983; Som, Hans, MacEchern, 2014)


– Uniform (Mitchell and Beauchamp, 1988)
– Gaussian (George and McCulloch, 1993)
– Laplace (Park and Casella, 2008)
– Non-local priors (Johnson and Rossell, 2012; Rossell and
Telesca, 2017), & more . . .

High Dimensional Model Selection Bayesian Methods 8 / 28


Introduction Scalable Computation

Gaussian Spike and Slab Priors

George and McCulloch (1993)


Spike: π0 ≡ N(0, τ02 ), the variance τ02 is a small and fixed value

Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value

Other choices for the mixing distribution are also considered


more recently (Ročková and George, 2014; 2016)

This allowed the use of a Gibbs sampler for posterior sampling


Extensively used in applications
Model selection properties and computationally challenges will
be investigated in high dimensional settings

High Dimensional Model Selection Shrinking and Diffusing Priors 9 / 28


Introduction Scalable Computation

Gaussian Spike and Slab Priors

George and McCulloch (1993)


Spike: π0 ≡ N(0, τ02 ), the variance τ02 is a small and fixed value

Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value

Other choices for the mixing distribution are also considered


more recently (Ročková and George, 2014; 2016)

This allowed the use of a Gibbs sampler for posterior sampling


Extensively used in applications
Model selection properties and computationally challenges will
be investigated in high dimensional settings

High Dimensional Model Selection Shrinking and Diffusing Priors 9 / 28


Introduction Scalable Computation

Shrinking and Diffusing Priors (Narisetty and He, 2014)

2 → 0 (faster than 1 )
Spike prior shrinks: τ0,n n
– Does not miss “small” non-zero coefficients

2+
2 → ∞ (as ( pn ∨ 1))
Slab prior diffuses: τ1,n n
– Acts as a penalty to drive inactive covariates under Z = 0

P[Zj = 1] = qn → 0 (as pn−1 )


– Controls the model size a priori

Strong selection consistency: Under such priors,


P
P[Z = t | Y ] −→ 1 (the posterior concentrates on the true model).

High Dimensional Model Selection Shrinking and Diffusing Priors 10 / 28


Introduction Scalable Computation

Shrinking and Diffusing Priors

spike variance: 0.04, slab variance: 1

High Dimensional Model Selection Shrinking and Diffusing Priors


Introduction Scalable Computation

Shrinking and Diffusing Priors

spike variance: 0.01, slab variance: 25

High Dimensional Model Selection Shrinking and Diffusing Priors


Introduction Scalable Computation

Outline

1 High Dimensional Model Selection

2 Computation
Large Matrix Computations
Scalable Gibbs Sampler

Computation
Introduction Scalable Computation

Posterior Computation

Standard Gibbs Sampler: conditional distributions are


– Zj (given β): Independent Bernoulli
– β (given Z ): Multivariate Normal: N VX 0 Y , σ 2 V


Easy to implement but not easily scalable; involves pn


dimensional matrix computations
Similar difficulty for other Gibbs sampling based methods
(Ishwaran and Rao, 2005; 2010; Bhattacharya et.al, 2015)

Can we have a scalable Gibbs sampler having


strong selection consistency?

Computation 11/ 28
Introduction Scalable Computation

Posterior Computation

Standard Gibbs Sampler: conditional distributions are


– Zj (given β): Independent Bernoulli
– β (given Z ): Multivariate Normal: N VX 0 Y , σ 2 V


Easy to implement but not easily scalable; involves pn


dimensional matrix computations
Similar difficulty for other Gibbs sampling based methods
(Ishwaran and Rao, 2005; 2010; Bhattacharya et.al, 2015)

Can we have a scalable Gibbs sampler having


strong selection consistency?

Computation 11/ 28
Introduction Scalable Computation

Posterior Computation

Standard Gibbs Sampler: conditional distributions are


– Zj (given β): Independent Bernoulli
– β (given Z ): Multivariate Normal: N VX 0 Y , σ 2 V


Easy to implement but not easily scalable; involves pn


dimensional matrix computations
Similar difficulty for other Gibbs sampling based methods
(Ishwaran and Rao, 2005; 2010; Bhattacharya et.al, 2015)

Can we have a scalable Gibbs sampler having


strong selection consistency?

Computation 11/ 28
Introduction Scalable Computation

Large Matrix Computations

The distribution of β (given Z ) has covariance:

V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )

Requires at least pn2 order computations, making it not


feasible for large pn

Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently

Computation Large Matrix Computations 12/ 28


Introduction Scalable Computation

Large Matrix Computations

The distribution of β (given Z ) has covariance:

V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )

Requires at least pn2 order computations, making it not


feasible for large pn

Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently

Computation Large Matrix Computations 12/ 28


Introduction Scalable Computation

Sparse Precision Matrix

A = {j : Zj = 1} - active set; I = {j : Zj = 0} - inactive set


Rearrange the variables and partition X = [XA , XI ]
Can the pn × pn precision matrix be made skinny as
−2
XA0 XA + τ1n XA0 XI
!
I
V −1 =
−2
XI0 XA XI0 XI + τ0n I
w
?
w

−2
XA0 XA + τ1n
!
I 0
S −1 :=
−2
0 (n + τ0n )I

If so, generating β requires only linear order computations in pn

Computation Scalable Gibbs Sampler 13/ 28


Introduction Scalable Computation

Sparse Precision Matrix

A = {j : Zj = 1} - active set; I = {j : Zj = 0} - inactive set


Rearrange the variables and partition X = [XA , XI ]
Can the pn × pn precision matrix be made skinny as
−2
XA0 XA + τ1n XA0 XI
!
I
V −1 =
−2
XI0 XA XI0 XI + τ0n I
w
?
w

−2
XA0 XA + τ1n
!
I 0
S −1 :=
−2
0 (n + τ0n )I

If so, generating β requires only linear order computations in pn

Computation Scalable Gibbs Sampler 13/ 28


Introduction Scalable Computation

Does the Sparsification Work?

The correlation among the components of βI , and between βA


and βI are lost

The Gibbs sampler would not have an appropriate stationary


distribution

We can recover!

Computation Scalable Gibbs Sampler 14/ 28


Introduction Scalable Computation

Does the Sparsification Work?

The correlation among the components of βI , and between βA


and βI are lost

The Gibbs sampler would not have an appropriate stationary


distribution

We can recover!

Computation Scalable Gibbs Sampler 14/ 28


Introduction Scalable Computation

Skinny Gibbs

Re-introduce the dependencies through the Z ’s by sampling


them sequentially!

P[Zj = 1 | Z−j , Rest]


P[Zj = 0 | Z−j , Rest]

qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,

Rj = 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n

where Aj is the set of active variables in Z−j

We call the new sampler - Skinny Gibbs!

Computation Scalable Gibbs Sampler 15/ 28


Introduction Scalable Computation

Skinny Gibbs

Re-introduce the dependencies through the Z ’s by sampling


them sequentially!

P[Zj = 1 | Z−j , Rest]


P[Zj = 0 | Z−j , Rest]

qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,

Rj = 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n

where Aj is the set of active variables in Z−j

We call the new sampler - Skinny Gibbs!

Computation Scalable Gibbs Sampler 15/ 28


Introduction Scalable Computation

Skinny Gibbs Sampler

Sample β (given Z ) from N (S X 0 Y , S) , where S is the


sparsified covariance matrix

−2
(XA0 XA + τ1n I)−1
!
0
S=
−2 −1
0 (n + τ0n ) I

Sample Z (given β) from the conditional Bernoulli draws:

Rj
P[Zj = 1 | Z−j , Rest] =
1 + Rj

Computation Scalable Gibbs Sampler 15/ 28


Introduction Scalable Computation

Skinny Gibbs Posterior

Skinny Gibbs has a stationary distribution:


Lemma
The posterior of Z corresponding to Skinny Gibbs is given by
 
1/2 1/2 1 |k| 1
P[Z = k | Data] ∝ |Vk1 | |Vk0 | |Dk | 2 qn exp − 2 R̃k ,

−2 −2
where Vk1 = (τ1n I + Xk0 Xk )−1 , Vk0 = (τ0n + n)−1 I.

Skinny Gibbs still retains the Strong selection consistency!


Theorem (Narisetty, Shen, and He)
Under similar conditions, we have
P
P[Z = t | Data] −→ 1.

Computation Scalable Gibbs Sampler 16/ 28


Introduction Scalable Computation

Skinny Gibbs Posterior

Skinny Gibbs has a stationary distribution:


Lemma
The posterior of Z corresponding to Skinny Gibbs is given by
 
1/2 1/2 1 |k| 1
P[Z = k | Data] ∝ |Vk1 | |Vk0 | |Dk | 2 qn exp − 2 R̃k ,

−2 −2
where Vk1 = (τ1n I + Xk0 Xk )−1 , Vk0 = (τ0n + n)−1 I.

Skinny Gibbs still retains the Strong selection consistency!


Theorem (Narisetty, Shen, and He)
Under similar conditions, we have
P
P[Z = t | Data] −→ 1.

Computation Scalable Gibbs Sampler 16/ 28


Introduction Scalable Computation

Existing Computational Methods

Optimization based Methods


– EM Algorithm (Ročková and George, 2014)
– Approximate L0 Optimization
(Bertsimas, King, and Mazumder; Cassidy and Solo, 2015)
Metropolis-Hastings (MH) Random Walk Methods
– Integrated Likelihood Approach (Guan and Stephens, 2011)
– Shotgun Stochastic Search (Hans, Dobra, West, 2007)
– Bayesian Subset Regression (Liang, Song, and Yu, 2013)
Gibbs Sampling Methods
– Standard Samplers (George and McCulloch, 1993;
Ishwaran, Kogalur, and Rao, 2005; 2010)
– Skinny Gibbs

Computation Scalable Gibbs Sampler 17/ 28


Introduction Scalable Computation

Existing Computational Methods

Optimization based Methods


– EM Algorithm (Ročková and George, 2014)
– Approximate L0 Optimization
(Bertsimas, King, and Mazumder; Cassidy and Solo, 2015)
Metropolis-Hastings (MH) Random Walk Methods
– Integrated Likelihood Approach (Guan and Stephens, 2011)
– Shotgun Stochastic Search (Hans, Dobra, West, 2007)
– Bayesian Subset Regression (Liang, Song, and Yu, 2013)
Gibbs Sampling Methods
– Standard Samplers (George and McCulloch, 1993;
Ishwaran, Kogalur, and Rao, 2005; 2010)
– Skinny Gibbs

Computation Scalable Gibbs Sampler 17/ 28


Introduction Scalable Computation

Features of Skinny Gibbs

Extends readily to mixed data types and non-linear models


such as logistic regression and quantile regression

Continues to work in non-convex settings unlike optimization


based methods

Computationally scalable without sacrificing strong selection


consistency

Mixing properties of Skinny Gibbs and other high dimensional


Bayesian algorithms require future work!

Computation Scalable Gibbs Sampler 18/ 28


Introduction Scalable Computation

Beyond Linear Regression

Many distributions can be written as mixtures of Gaussian


distributions; Skinny Gibbs Trick extends readily to them

Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)

Skinny Gibbs works for logistic regression!

Computation Scalable Gibbs Sampler 19/ 28


Introduction Scalable Computation

Beyond Linear Regression

Many distributions can be written as mixtures of Gaussian


distributions; Skinny Gibbs Trick extends readily to them

Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)

Skinny Gibbs works for logistic regression!

Computation Scalable Gibbs Sampler 19/ 28


Introduction Scalable Computation

Simulation Study

n = 100, p = 250.

|t| = 4; βt = (1.5, 2, 2.5, 3, 0, . . . , 0)

Logistic model: Yi | xi ∼ Logistic(xi β)

Signal-to-noise ratio around 0.9 for the latent response.

Computation Scalable Gibbs Sampler 20/ 28


Introduction Scalable Computation

Skinny Gibbs - Marginal Posterior Probabilities

Computation Scalable Gibbs Sampler 21/ 28


Introduction Scalable Computation

Simulation Results

Table: Covariates without Correlation


TP FP Z =t Z ⊃t Zs = t
Skinny 3.72 0.54 0.50 0.75 0.70
VBayes 3.33 0.04 0.46 0.47 0.75
BSR 2.90 0.09 0.23 0.28 0.52
Alasso 3.69 1.23 0.24 0.70 0.52
SCAD 3.69 1.75 0.17 0.71 0.40
MCP 3.82 2.69 0.13 0.83 0.56

TP FP Z =t Z =t Zs = t
True Positives False positives Selection Inclusion Top 4 Selection

Computation Scalable Gibbs Sampler 22/ 28


Introduction Scalable Computation

Simulation Results

Table: Covariates without Correlation


TP FP Z =t Z ⊃t Zs = t
Skinny 3.62 0.56 0.38 0.65 0.66
VBayes 3.07 0.03 0.33 0.33 0.60
BSR 2.82 0.14 0.23 0.26 0.46
Alasso 3.30 1.06 0.15 0.40 0.28
SCAD 3.33 1.65 0.09 0.47 0.24
MCP 3.72 3.22 0.03 0.74 0.47

TP FP Z =t Z =t Zs = t
True Positives False positives Selection Inclusion Top 4 Selection

Correlation of 0.25 between covariates

Computation Scalable Gibbs Sampler 23/ 28


Introduction Scalable Computation

Computational Time

Figure: CPU time (in seconds) for BASAD and Skinny Gibbs for n = 100
and p varies.

Computation Scalable Gibbs Sampler 24/ 28


Introduction Scalable Computation

Censored Quantile Regression

Suppose Y is censored, and we only observe

Y o = (Y ∨ c)
Example: Carotid artery plaque thickness in stroke patients
(Gardener et al., 2014)
Conditional mean of Y is not identifiable but (most)
conditional quantiles are!
τ th conditional quantile of Y : Qτ (Y | X ) = X βτ

Powell (1984,86)’s objective for estimation


n
X
β̂Pow (τ ) = arg min ρτ (Yio − (Xi β ∨ c)) ,
β i=1

where ρτ (u) = u(τ − 1(u < 0)) is the check-loss


Computation Censored Quantile Regression 25/ 28
Introduction Scalable Computation

Censored Quantile Regression

Suppose Y is censored, and we only observe

Y o = (Y ∨ c)
Example: Carotid artery plaque thickness in stroke patients
(Gardener et al., 2014)
Conditional mean of Y is not identifiable but (most)
conditional quantiles are!
τ th conditional quantile of Y : Qτ (Y | X ) = X βτ

Powell (1984,86)’s objective for estimation


n
X
β̂Pow (τ ) = arg min ρτ (Yio − (Xi β ∨ c)) ,
β i=1

where ρτ (u) = u(τ − 1(u < 0)) is the check-loss


Computation Censored Quantile Regression 25/ 28
Introduction Scalable Computation

Non-convexity

One major difficulty even in low dimensions is computational as


the objective function is non-convex!

Computation Non-convexity 26/ 28


Introduction Scalable Computation

Bayesian Computation - Skinny Gibbs

Working asymmetric Laplace (AL) likelihood (weights wi ):


n
Y
wi τ (1 − τ ) exp −wi ρτ (Yio − (xi0 β ∨ c)) ,

L(β) =
i=1

AL distn can be represented as: (Kozumi and Kobayashi,


2011) √
wi−1 ξ1 ν + wi−1 ξ2 νz ∼ exp{−wi ρτ (.)},
where ν ∼ Exp(1), z ∼ N(0, 1), and ξ1 , ξ2 are constants

Skinny Gibbs Trick works for scalable computation and model


selection!

Computation Non-convexity 27/ 28


Introduction Scalable Computation

Bayesian Computation - Skinny Gibbs

Working asymmetric Laplace (AL) likelihood (weights wi ):


n
Y
wi τ (1 − τ ) exp −wi ρτ (Yio − (xi0 β ∨ c)) ,

L(β) =
i=1

AL distn can be represented as: (Kozumi and Kobayashi,


2011) √
wi−1 ξ1 ν + wi−1 ξ2 νz ∼ exp{−wi ρτ (.)},
where ν ∼ Exp(1), z ∼ N(0, 1), and ξ1 , ξ2 are constants

Skinny Gibbs Trick works for scalable computation and model


selection!

Computation Non-convexity 27/ 28


Introduction Scalable Computation

Conclusions

Provided a theoretical framework to study Bayesian model


selection with spike-slab priors

Proposed Skinny Gibbs - a scalable and flexible algorithm

Also applicable to non-convex problems as in censored


quantile regression

Bayesian model selection deserves further investigation!

Computation Non-convexity 28/ 28

You might also like