You are on page 1of 86

Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Consistent and Scalable Bayesian Model Selection


for High Dimensional Data

Naveen Naidu Narisetty

Department of Statistics
University of Michigan, Ann Arbor

December 11, 2015


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Other Work (A small Detour)

Functional Data Depth – proposed a new notion of depth, studied


its properties and applications to simultaneous inference
Ref: Narisetty, N.N., Nair, V.N. Extremal Notion of Depth for
Functional Data and Applications. JASA Theory & Methods, To Appear

Large-Scale Computer Models – worked on nonparametric


regression based methods as statistical emulators
Ref: Zhang, Z., Narisetty, N.N., Nair. V,N., Zhu, J. Calibration of
Computer Models with Expanded MARS.

Environmental Exposures – proposed a method to select nonlinear


interactions between environmental exposures
Ref: Narisetty, N.N. and Mukherjee, B. Nonlinear interaction selection
for environmental exposures.
Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Other Work (A small Detour)

Functional Data Depth – proposed a new notion of depth, studied


its properties and applications to simultaneous inference
Ref: Narisetty, N.N., Nair, V.N. Extremal Notion of Depth for
Functional Data and Applications. JASA Theory & Methods, To Appear

Large-Scale Computer Models – worked on nonparametric


regression based methods as statistical emulators
Ref: Zhang, Z., Narisetty, N.N., Nair. V,N., Zhu, J. Calibration of
Computer Models with Expanded MARS.

Environmental Exposures – proposed a method to select nonlinear


interactions between environmental exposures
Ref: Narisetty, N.N. and Mukherjee, B. Nonlinear interaction selection
for environmental exposures.
Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Other Work (A small Detour)

Functional Data Depth – proposed a new notion of depth, studied


its properties and applications to simultaneous inference
Ref: Narisetty, N.N., Nair, V.N. Extremal Notion of Depth for
Functional Data and Applications. JASA Theory & Methods, To Appear

Large-Scale Computer Models – worked on nonparametric


regression based methods as statistical emulators
Ref: Zhang, Z., Narisetty, N.N., Nair. V,N., Zhu, J. Calibration of
Computer Models with Expanded MARS.

Environmental Exposures – proposed a method to select nonlinear


interactions between environmental exposures
Ref: Narisetty, N.N. and Mukherjee, B. Nonlinear interaction selection
for environmental exposures.
Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Outline of the Talk

1 High Dimensional Model Selection


Introduction

2 Bayesian Model Selection Consistency


Shrinking and Diffusing Priors

3 Scalable Computation
Skinny Gibbs Sampler

4 Censoring and Non-convexity


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Introduction

Modern Data Sets contain a large number of variables, often


even larger than the sample size

High Dimensional Data examples include gene expression


data, healthcare data, search queries data, etc

Variable (Model) Selection: deals with identifying the most


important variables for a response of interest

High Dimensional Model Selection Introduction 1 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

High Dimensional Linear Regression

Consider the standard linear regression set-up

Yn×1 = Xn×p βp×1 + n×1


n - no. of observations
p - no. of variables

When p > n, even the estimation problem is ill-posed.


A natural working assumption is sparsity:

#{j : βj 6= 0}  (p ∧ n)

How can the set of active variables be consistently identified?

High Dimensional Model Selection Introduction 2 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

High Dimensional Linear Regression

Consider the standard linear regression set-up

Yn×1 = Xn×p βp×1 + n×1


n - no. of observations
p - no. of variables

When p > n, even the estimation problem is ill-posed.


A natural working assumption is sparsity:

#{j : βj 6= 0}  (p ∧ n)

How can the set of active variables be consistently identified?

High Dimensional Model Selection Introduction 2 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

(Popular) Penalty Trick

The penalty trick is to minimize the loss together with a


penalty function ρλ (β) to induce shrinkage or sparsity

β̂PEN = arg min {kY − X βk2 + ρλ (β)}


β

A very natural penalty is the L0 penalty


p
1{βj 6= 0}
X
ρλ (β) = λ
j=1

L0 penalty is argued to give good theoretical properties


(Schwarz, 1978; Shen, Pan and Zhu, 2012)

High Dimensional Model Selection Penalty Based Methods 3 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

(Popular) Penalty Trick

The penalty trick is to minimize the loss together with a


penalty function ρλ (β) to induce shrinkage or sparsity

β̂PEN = arg min {kY − X βk2 + ρλ (β)}


β

A very natural penalty is the L0 penalty


p
1{βj 6= 0}
X
ρλ (β) = λ
j=1

L0 penalty is argued to give good theoretical properties


(Schwarz, 1978; Shen, Pan and Zhu, 2012)

High Dimensional Model Selection Penalty Based Methods 3 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Penalization methods

Several convex and non-convex penalties are proposed as


compromise between L0 penalty and computational tractability
LASSO (Tibshirani, 1996)
SCAD (Fan and Li, 2001)
Elastic Net (Zou and Hastie, 2005)
Adaptive LASSO (Zou, 2006)
Dantzig Selector (Candes and Tao, 2007)
OSCAR (Bondell and Reich, 2008)
MCP (Zhang, 2010), and more . . .

High Dimensional Model Selection Penalty Based Methods 4 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Why “Bayesian Methods”?

Bayesian methods have been popular

Bayesian model selection can be asymptotically equivalent to


information criterion

Properties of Bayesian model selection methods for high


dimensional data are not as well understood – two challenges
will be discussed

High Dimensional Model Selection Bayesian Methods 5 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Bayesian Framework

Regression Model: Yn×1 = Xn×p βp×1 + n×1 ;  ∼ N(0, σ 2 I)

Introduce binary variables Zj to indicate that j th covariate is


active (j = 1, . . . , p)

Place priors on βj as

βj | Zj = 0 ∼ π0 (βj ), βj | Zj = 1 ∼ π1 (βj ), (1)

and on Z = (Z1 , . . . , Zp )

The posterior of Z is used for model selection

High Dimensional Model Selection Bayesian Methods 6 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Spike and Slab Priors

The prior π0 (called spike) is often taken to be a point mass


(Geweke, 1996; Scott and Berger, 2010; Liang, Song, and
Wu, 2013; Yang, Jordan and Wainwright, 2015)

Some common choices for the prior π1 (called slab) include:

– g priors (Zellner, 1983; Liang et al., 2007)


– Uniform (Mitchell and Beauchamp, 1988)
– Gaussian (George and McCulloch, 1993;
Ročková and George, 2014)
– Laplace (Park and Casella, 2008;
Castillo, Schmidt-Hieber, and van der Vaart, 2015)
– Non-local priors (Johnson and Rossell, 2012), & more . . .

High Dimensional Model Selection Bayesian Methods 7 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Gaussian Spike and Slab Priors

George and McCulloch (1993)


Spike: π0 ≡ N(0, τ02 ), the variance τ02 is a small and fixed value

Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value

This allowed the use of a Gibbs sampler for posterior sampling


Extensively used in applications
Model selection properties not understood and computationally
intense in high dimensional settings

High Dimensional Model Selection Bayesian Methods 8 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Gaussian Spike and Slab Priors

George and McCulloch (1993)


Spike: π0 ≡ N(0, τ02 ), the variance τ02 is a small and fixed value

Slab: π1 ≡ N(0, τ12 ), the variance τ12 is fixed at a larger value

This allowed the use of a Gibbs sampler for posterior sampling


Extensively used in applications
Model selection properties not understood and computationally
intense in high dimensional settings

High Dimensional Model Selection Bayesian Methods 8 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Goal and Challenges

Goal:
To study model selection properties using such priors, and
to provide a practically feasible implementation even when p >> n

Two Challenges:
Theoretical Challenge: analyzing the posterior on a huge space
of 2p models
– A new framework using sample-size dependent priors
Computational Challenge: standard Gibbs samplers not scalable
due to large matrix computations
– A scalable and flexible Gibbs sampling algorithm

High Dimensional Model Selection Bayesian Methods 9 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Outline

1 High Dimensional Model Selection

2 Bayesian Model Selection Consistency


Shrinking and Diffusing Priors

3 Scalable Computation

4 Censoring and Non-convexity

Bayesian Model Selection Consistency


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Motivation - Shrinking Spike Prior

Regression Model: Y | β ∼ N(X β, I)


Priors: βj | Zj = 0 ∼ N(0, τ02 ), βj | Zj = 1 ∼ N(0, τ12 )
P[Zj = 1] = 0.5 for j = 1, . . . , pn

Data generating model: Y ∼ N(X β0 , I); X 0 X = n I (pn ≤ n)

0 < τ02 < τ12 < ∞; For |β0j | ≤ τ0 6= 0

P[Zj = 1 | Y ] < 0.5 w.p. 1 as n → ∞

Not consistent!

Let the spike variance go to zero with sample size!

Bayesian Model Selection Consistency Motivation 10 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Motivation - Shrinking Spike Prior

Regression Model: Y | β ∼ N(X β, I)


Priors: βj | Zj = 0 ∼ N(0, τ02 ), βj | Zj = 1 ∼ N(0, τ12 )
P[Zj = 1] = 0.5 for j = 1, . . . , pn

Data generating model: Y ∼ N(X β0 , I); X 0 X = n I (pn ≤ n)

0 < τ02 < τ12 < ∞; For |β0j | ≤ τ0 6= 0

P[Zj = 1 | Y ] < 0.5 w.p. 1 as n → ∞

Not consistent!

Let the spike variance go to zero with sample size!

Bayesian Model Selection Consistency Motivation 10 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Motivation - Shrinking Spike Prior

Regression Model: Y | β ∼ N(X β, I)


Priors: βj | Zj = 0 ∼ N(0, τ02 ), βj | Zj = 1 ∼ N(0, τ12 )
P[Zj = 1] = 0.5 for j = 1, . . . , pn

Data generating model: Y ∼ N(X β0 , I); X 0 X = n I (pn ≤ n)

0 < τ02 < τ12 < ∞; For |β0j | ≤ τ0 6= 0

P[Zj = 1 | Y ] < 0.5 w.p. 1 as n → ∞

Not consistent!

Let the spike variance go to zero with sample size!

Bayesian Model Selection Consistency Motivation 10 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Motivation - Diffusing Slab Prior

2
Consider τ0n → 0, τ12 < ∞ and P[Z = 1] = 0.5. Then,
r
log n
P[Zj = 1 | Y ] > 0.5 ⇐⇒ β̂j ≥ tn ≈ ,
n

β̂j is the OLS estimator.


√ P
When pn >> n, P[ True Model |Y ] −→ 0. Not consistent!

What would be a framework to understand


the model selection properties?

Bayesian Model Selection Consistency Motivation 11 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Motivation - Diffusing Slab Prior

2
Consider τ0n → 0, τ12 < ∞ and P[Z = 1] = 0.5. Then,
r
log n
P[Zj = 1 | Y ] > 0.5 ⇐⇒ β̂j ≥ tn ≈ ,
n

β̂j is the OLS estimator.


√ P
When pn >> n, P[ True Model |Y ] −→ 0. Not consistent!

What would be a framework to understand


the model selection properties?

Bayesian Model Selection Consistency Motivation 11 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Shrinking and Diffusing Priors

2 → 0 (faster than 1 )
Spike prior shrinks: τ0,n n
– Does not miss “small” non-zero coefficients

2+
2 → ∞ (as ( pn ∨ 1))
Slab prior diffuses: τ1,n n
– Acts as a penalty to drive inactive covariates under Z = 0

P[Zj = 1] = qn → 0 (as pn−1 )


– Controls the model size a priori

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 12 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Shrinking and Diffusing Priors

spike variance: 0.2, slab variance: 1

Bayesian Model Selection Consistency Shrinking and Diffusing Priors


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Shrinking and Diffusing Priors

spike variance: 0.1, slab variance: 5

Bayesian Model Selection Consistency Shrinking and Diffusing Priors


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Bayesian Hierarchical Framework

We consider the following hierarchical model:

Y | β ∼ N(X β, σ 2 I )

βj | (Zj = 0) ∼ N(0, σ 2 τ0,n


2
), βj | (Zj = 1) ∼ N(0, σ 2 τ1,n
2
)

P(Zj = 1) = qn , j = 1, . . . , pn

σ 2 ∼ IG (α1 , α2 )

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 13 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Notation

Model vector: Z = (Z1 , . . . , Zp )


True regression vector: β0
True model: t = {j : β0j 6= 0}
For e.g., β0 = (0.5, 0.8, 0, . . . , 0) =⇒ t = (1, 1, 0, . . . , 0)
Arbirary model: k (binary vector of length p)
Posterior probability of model k: P[Z = k|Data]
Posterior Ratio:
P[Z = k | Data]
PR(k, t) =
P[Z = t | Data]

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 14 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Posterior Ratios

Lemma
For any model k 6= t, we have
 |k|−|t|
2 q −2 − 2
n
R̃k −R̃t
o
PR(k, t) ≤ nτ1n n exp − 2σ 2

(2+δ) √  −(|k|−|t|) n o
“≈” pn ∨ n pn exp − R̃k2σ−2R̃t ,

where R̃k ≈ Y 0 (I − PXk )Y , plays the role of RSS for model k.

This implies

P[Z = k | Data] P
PR(k, t) = −→ 0, for each k 6= t
P[Z = t | Data]

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 15 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Strong Selection Consistency

(Weak) Selection Consistency: For each model k 6= t,

P[Z = k|Data] P
−→ 0 (2)
P[Z = t|Data]

Even if (2) holds, it is possible that P[Z = t|Data] → 0 when


pn → ∞ as the number of models increases exponentially

It is desirable to have Strong Selection Consistency (SSC)


P
P[Z = t|Data] −→ 1

SSC =⇒ selection consistency of maximum a posteriori model

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 16 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Strong Selection Consistency

(Weak) Selection Consistency: For each model k 6= t,

P[Z = k|Data] P
−→ 0 (2)
P[Z = t|Data]

Even if (2) holds, it is possible that P[Z = t|Data] → 0 when


pn → ∞ as the number of models increases exponentially

It is desirable to have Strong Selection Consistency (SSC)


P
P[Z = t|Data] −→ 1

SSC =⇒ selection consistency of maximum a posteriori model

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 16 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Sum of Posterior Ratios


P P[Z =k|Data] P
SSC is equivalent to P[Z =t|Data] −→ 0. This sum can be
k6=t
decomposed as:
n
– M1 : Large models having size more than log pn ,
P
PR(k, t) ≤ exp{−c1 n},
k∈M1

– M2 : Overfitted models including (but not equal to) t,


P
PR(k, t) ≤ exp{−c2 log pn },
k∈M2

– M3 : Underfitted models not including a covariate in t,


PR(k, t) ≤ exp{−c3 n∆2n }; ∆n is minimum signal
P
k∈M3

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 17 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Sum of Posterior Ratios


P P[Z =k|Data] P
SSC is equivalent to P[Z =t|Data] −→ 0. This sum can be
k6=t
decomposed as:
n
– M1 : Large models having size more than log pn ,
P
PR(k, t) ≤ exp{−c1 n},
k∈M1

– M2 : Overfitted models including (but not equal to) t,


P
PR(k, t) ≤ exp{−c2 log pn },
k∈M2

– M3 : Underfitted models not including a covariate in t,


PR(k, t) ≤ exp{−c3 n∆2n }; ∆n is minimum signal
P
k∈M3

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 17 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Sum of Posterior Ratios


P P[Z =k|Data] P
SSC is equivalent to P[Z =t|Data] −→ 0. This sum can be
k6=t
decomposed as:
n
– M1 : Large models having size more than log pn ,
P
PR(k, t) ≤ exp{−c1 n},
k∈M1

– M2 : Overfitted models including (but not equal to) t,


P
PR(k, t) ≤ exp{−c2 log pn },
k∈M2

– M3 : Underfitted models not including a covariate in t,


PR(k, t) ≤ exp{−c3 n∆2n }; ∆n is minimum signal
P
k∈M3

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 17 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Sum of Posterior Ratios


P P[Z =k|Data] P
SSC is equivalent to P[Z =t|Data] −→ 0. This sum can be
k6=t
decomposed as:
n
– M1 : Large models having size more than log pn ,
P
PR(k, t) ≤ exp{−c1 n},
k∈M1

– M2 : Overfitted models including (but not equal to) t,


P
PR(k, t) ≤ exp{−c2 log pn },
k∈M2

– M3 : Underfitted models not including a covariate in t,


PR(k, t) ≤ exp{−c3 n∆2n }; ∆n is minimum signal
P
k∈M3

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 17 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

SSC with Shrinking and Diffusing Priors

Theorem (Narisetty and He (AoS, 2014))


Posterior probability of the true model converges to one. That is,
P
P[Z = t | Data] −→ 1 as n → ∞ (under mild conditions).

|t| log pn
Remark 1: pn can be large such that n → 0.

Remark 2: Conditions are on minimum signal and minimum


non-zero eigenvalues of small submatrices of the Gram matrix

Remark 3: As a useful consequence, for 0 < c < 1,


P[Zj = tj | Data] > c for all j with probability going to one.

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 18 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

SSC with Shrinking and Diffusing Priors

Theorem (Narisetty and He (AoS, 2014))


Posterior probability of the true model converges to one. That is,
P
P[Z = t | Data] −→ 1 as n → ∞ (under mild conditions).

|t| log pn
Remark 1: pn can be large such that n → 0.

Remark 2: Conditions are on minimum signal and minimum


non-zero eigenvalues of small submatrices of the Gram matrix

Remark 3: As a useful consequence, for 0 < c < 1,


P[Zj = tj | Data] > c for all j with probability going to one.

Bayesian Model Selection Consistency Shrinking and Diffusing Priors 18 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Connection with L0 Penalization

The posterior on the model space using shrinking and


diffusing priors can be written as

−Log Posterior (k) = R̃k + (|k| − |t|) ψn,k


≈ Y 0 (I − PXk )Y + (|k| − |t|) ψn,k

Resembles the well-known Bayesian Information Criterion

BIC (k) = Y 0 (I − PXk )Y + |k| log n,

The penalty ψn,k satisfies

c log(n ∨ pn ) ≤ ψn,k ≤ C log(n ∨ pn ),

and has the same rate as that of EBIC (Chen and Chen, 2008)

Bayesian Model Selection Consistency Connection to Penalization 19 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Related Results on SSC

Our contribution: theoretical framework for SSC with n-


dependent priors, and connection to L0 penalization even
when pn > n

Shang and Clayton (2011), and Johnson and Rossell (2012):


SSC for pn < n

More recently,
Castillo, Schmidt-Hieber, and van der Vaart (2015) – SSC
with n-dependent Laplace priors for pn > n

Bayesian Model Selection Consistency Connection to Penalization 20 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Related Results on SSC

Our contribution: theoretical framework for SSC with n-


dependent priors, and connection to L0 penalization even
when pn > n

Shang and Clayton (2011), and Johnson and Rossell (2012):


SSC for pn < n

More recently,
Castillo, Schmidt-Hieber, and van der Vaart (2015) – SSC
with n-dependent Laplace priors for pn > n

Bayesian Model Selection Consistency Connection to Penalization 20 / 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Outline

1 High Dimensional Model Selection

2 Bayesian Model Selection Consistency

3 Scalable Computation
Skinny Gibbs Sampler

4 Censoring and Non-convexity

Scalable Computation
Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Posterior Computation

Standard Gibbs Sampler: conditional distributions are


– Zj (given β): independent Bernoulli
– β (given Z ): Multivariate Normal: N VX 0 Y , σ 2 V


Easy to implement but not easily scalable; involves pn


dimensional matrix computations
Similar difficulty for other Gibbs sampling based methods
(Ishwaran and Rao, 2005; 2010; Bhattacharya et.al, 2015)

Can we have a scalable Gibbs sampler having


strong selection consistency?

Scalable Computation 21/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Posterior Computation

Standard Gibbs Sampler: conditional distributions are


– Zj (given β): independent Bernoulli
– β (given Z ): Multivariate Normal: N VX 0 Y , σ 2 V


Easy to implement but not easily scalable; involves pn


dimensional matrix computations
Similar difficulty for other Gibbs sampling based methods
(Ishwaran and Rao, 2005; 2010; Bhattacharya et.al, 2015)

Can we have a scalable Gibbs sampler having


strong selection consistency?

Scalable Computation 21/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Posterior Computation

Standard Gibbs Sampler: conditional distributions are


– Zj (given β): independent Bernoulli
– β (given Z ): Multivariate Normal: N VX 0 Y , σ 2 V


Easy to implement but not easily scalable; involves pn


dimensional matrix computations
Similar difficulty for other Gibbs sampling based methods
(Ishwaran and Rao, 2005; 2010; Bhattacharya et.al, 2015)

Can we have a scalable Gibbs sampler having


strong selection consistency?

Scalable Computation 21/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Large Matrix Computations

The distribution of β (given Z ) has covariance:

V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )

Requires at least pn2 order computations, making it not


feasible for large pn

Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently

Scalable Computation Gibbs Sampler - Matrix Computations 22/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Large Matrix Computations

The distribution of β (given Z ) has covariance:

V = (X 0 X + Dz )−1 ,
−2 −2
where Dz = Diag (Z τ1,n + (1 − Z )τ0,n )

Requires at least pn2 order computations, making it not


feasible for large pn

Strategy:
– To sparsify the precision matrix X 0 X + Dz so that the
βj ’s can somehow be sampled independently

Scalable Computation Gibbs Sampler - Matrix Computations 22/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Sparse Precision Matrix

A = {j : Zj = 1} - active set; I = {j : Zj = 0} - inactive set


Rearrange the variables and partition X = [XA , XI ]
Can the pn × pn precision matrix be made skinny as
−2
XA0 XA + τ1n XA0 XI
!
I
V −1 =
−2
XI0 XA XI0 XI + τ0n I
w
?
w

−2
XA0 XA + τ1n
!
I 0
V −1 =
−2
0 (n + τ0n )I

If so, generating β requires only linear order computations in pn

Scalable Computation Skinny Gibbs Sampler 23/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Sparse Precision Matrix

A = {j : Zj = 1} - active set; I = {j : Zj = 0} - inactive set


Rearrange the variables and partition X = [XA , XI ]
Can the pn × pn precision matrix be made skinny as
−2
XA0 XA + τ1n XA0 XI
!
I
V −1 =
−2
XI0 XA XI0 XI + τ0n I
w
?
w

−2
XA0 XA + τ1n
!
I 0
V −1 =
−2
0 (n + τ0n )I

If so, generating β requires only linear order computations in pn

Scalable Computation Skinny Gibbs Sampler 23/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Does the Sparsification Work?

The Gibbs sampler would not have the right stationary


distribution

The correlation among the components of βI , and between βA


and βI are lost

We can recover!

Scalable Computation Skinny Gibbs Sampler 24/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Does the Sparsification Work?

The Gibbs sampler would not have the right stationary


distribution

The correlation among the components of βI , and between βA


and βI are lost

We can recover!

Scalable Computation Skinny Gibbs Sampler 24/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Skinny Gibbs

Re-introduce the dependencies through the Z ’s by sampling


them sequentially!

P[Zj = 1 | Z−j , Rest]


P[Zj = 0 | Z−j , Rest]

qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,

= 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n

where Aj is the set of active variables in Z−j

We call the new sampler - Skinny Gibbs!

Scalable Computation Skinny Gibbs Sampler 25/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Skinny Gibbs

Re-introduce the dependencies through the Z ’s by sampling


them sequentially!

P[Zj = 1 | Z−j , Rest]


P[Zj = 0 | Z−j , Rest]

qn φ(βj , 0, σ 2 τ1,n
2 )
exp σ −2 βj Xj0 (Y − XAj βAj ) ,

= 2 )
(1 − qn ) φ(βj , 0, σ 2 τ0,n

where Aj is the set of active variables in Z−j

We call the new sampler - Skinny Gibbs!

Scalable Computation Skinny Gibbs Sampler 25/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Skinny Gibbs Posterior

Skinny Gibbs has a stationary distribution:


Lemma
The posterior of Z corresponding to Skinny Gibbs is given by
 
1/2 1/2 1 |k| 1
P[Z = k | Data] ∝ |Vk1 | |Vk0 | |Dk | 2 qn exp − 2 R̃k ,

−2 −2
where Vk1 = (τ1n I + Xk0 Xk )−1 , Vk0 = (τ0n + n)−1 I .

Skinny Gibbs still retains the Strong selection consistency!


Theorem (Narisetty, Shen, and He (2015))
Under similar conditions, we have
P
P[Z = t | Data] −→ 1.

Scalable Computation Skinny Gibbs Sampler 26/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Skinny Gibbs Posterior

Skinny Gibbs has a stationary distribution:


Lemma
The posterior of Z corresponding to Skinny Gibbs is given by
 
1/2 1/2 1 |k| 1
P[Z = k | Data] ∝ |Vk1 | |Vk0 | |Dk | 2 qn exp − 2 R̃k ,

−2 −2
where Vk1 = (τ1n I + Xk0 Xk )−1 , Vk0 = (τ0n + n)−1 I .

Skinny Gibbs still retains the Strong selection consistency!


Theorem (Narisetty, Shen, and He (2015))
Under similar conditions, we have
P
P[Z = t | Data] −→ 1.

Scalable Computation Skinny Gibbs Sampler 26/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Skinny and Strong

Scalable Computation Skinny Gibbs Sampler


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Existing Computational Methods

Optimization based Methods


– EM Algorithm (Ročková and George, 2014)
– Approximate L0 Optimization
(Bertsimas, King, and Mazumder; Cassidy and Solo, 2015)
Metropolis-Hastings (MH) Random Walk Methods
– Integrated Likelihood Approach (Guan and Stephens, 2011)
– Shotgun Stochastic Search (Hans, Dobra, West, 2007)
– Bayesian Subset Regression (Liang, Song, and Yu, 2013)
Gibbs Sampling Methods
– Standard Samplers (George and McCulloch, 1993;
Ishwaran, Kogalur, and Rao, 2005; 2010)
– Skinny Gibbs

Scalable Computation Skinny Gibbs Sampler 27/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Existing Computational Methods

Optimization based Methods


– EM Algorithm (Ročková and George, 2014)
– Approximate L0 Optimization
(Bertsimas, King, and Mazumder; Cassidy and Solo, 2015)
Metropolis-Hastings (MH) Random Walk Methods
– Integrated Likelihood Approach (Guan and Stephens, 2011)
– Shotgun Stochastic Search (Hans, Dobra, West, 2007)
– Bayesian Subset Regression (Liang, Song, and Yu, 2013)
Gibbs Sampling Methods
– Standard Samplers (George and McCulloch, 1993;
Ishwaran, Kogalur, and Rao, 2005; 2010)
– Skinny Gibbs

Scalable Computation Skinny Gibbs Sampler 27/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Features of Skinny Gibbs

Extends readily to many non-linear models including logistic


regression

Continues to work in non-convex settings unlike optimization


based methods

Computationally scalable without sacrificing strong selection


consistency!

Scalable Computation Skinny Gibbs Sampler 28/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Beyond Linear Regression

Many distributions can be written as mixtures of Gaussian


distributions; Skinny Gibbs Trick extends readily to them

Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)

Skinny Gibbs works for logistic regression!


(Details: Narisetty, Shen, and He, 2015)

Scalable Computation Skinny Gibbs Sampler 29/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Beyond Linear Regression

Many distributions can be written as mixtures of Gaussian


distributions; Skinny Gibbs Trick extends readily to them

Example:
– Logistic Distribution
Li ∼ Logistic(xi β) ⇐⇒ Li | si ∼ N(xi β, si2 ); si /2 ∼ FKS ,
FKS is the Kolmogorov-Smirnov distribution (Stefanski, 1991)

Skinny Gibbs works for logistic regression!


(Details: Narisetty, Shen, and He, 2015)

Scalable Computation Skinny Gibbs Sampler 29/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Simulation Study

n = 100, p = 250.

|t| = 4; βt = (1.5, 2, 2.5, 3, 0, . . . , 0)

Logistic model: Yi | xi ∼ Logistic(xi β)

Signal-to-noise ratio around 0.9 for the latent response.

Scalable Computation Skinny Gibbs Sampler 30/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Skinny Gibbs - Marginal Posterior Probabilities

Scalable Computation Skinny Gibbs Sampler 31/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Simulation Results

Table : Covariates without Correlation


TP FP Z =t Z4 = t
Skinny Gibbs 3.64 1.19 0.26 0.61
BSR 3.17 0.23 0.25 0.58
ALasso 3.76 4.00 0.01 0.48
SCAD 3.72 3.26 0.02 0.35
MCP 3.84 4.90 0.00 0.49

TP FP Z =t Z4 = t
True Positives False positives Exact Selection Top 4 Selection

Covariates have correlations between (0.1, 0.5)

Scalable Computation Skinny Gibbs Sampler 32/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Simulation Results

Table : Covariates without Correlation


TP FP Z =t Z4 = t
Skinny Gibbs 3.41 1.17 0.24 0.44
BSR 2.56 0.29 0.10 0.27
ALasso 3.66 4.03 0.02 0.30
SCAD 3.40 3.33 0.01 0.03
MCP 3.68 4.18 0.01 0.11

TP FP Z =t Z4 = t
True Positives False positives Exact Selection Top 4 Selection

Covariates have correlations between (0.1, 0.5)

Scalable Computation Skinny Gibbs Sampler 33/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Lymph Data Example

n = 148 subjects and p = 4514 genes

Binary response Y = +ve or −ve status of lymph node

Reference: Hans, Dobra, and West (2007)

Scalable Computation Skinny Gibbs Sampler 34/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Cross Validated Prediction Error

Figure : Cross Validated Prediction Error versus Model Size for several
model selection methods

n = 148; p = 4514 genes


Response Y = +ve or −ve status of lymph node
Scalable Computation Skinny Gibbs Sampler 35/ 42
Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Outline

1 High Dimensional Model Selection

2 Bayesian Model Selection Consistency

3 Scalable Computation

4 Censoring and Non-convexity

Censoring and Non-convexity


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Censoring

Suppose Y is censored, and we only observe

Y o = (Y ∨ c)

Conditional mean of Y is not identifiable but (most)


conditional quantiles are!
τ th conditional quantile of Y : Qτ (Y | X ) = X βτ

Powell (1984,86)’s objective for estimation


n
X
β̂Pow (τ ) = arg min ρτ (Yio − (Xi β ∨ c)) ,
β i=1

where ρτ (u) = u(τ − 1(u < 0)) is the check-loss

Censoring and Non-convexity Censoring 36/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Censoring

Suppose Y is censored, and we only observe

Y o = (Y ∨ c)

Conditional mean of Y is not identifiable but (most)


conditional quantiles are!
τ th conditional quantile of Y : Qτ (Y | X ) = X βτ

Powell (1984,86)’s objective for estimation


n
X
β̂Pow (τ ) = arg min ρτ (Yio − (Xi β ∨ c)) ,
β i=1

where ρτ (u) = u(τ − 1(u < 0)) is the check-loss

Censoring and Non-convexity Censoring 36/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Non-convexity

One major difficulty even in low dimensions is computational as


the objective function is non-convex!

Censoring and Non-convexity Non-convexity 37/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Bayesian Computation - Skinny Gibbs

Working asymmetric Laplace (AL) likelihood (weights wi ):


n
Y
wi τ (1 − τ ) exp −wi ρτ (Yio − max(xi0 β, c)) ,

L(β) =
i=1

AL distn can be represented as: (Kozumi and Kobayashi,


2011) √
wi−1 ξ1 ν + wi−1 ξ2 νz ∼ exp{−wi ρτ (.)},
where ν ∼ Exp(1), z ∼ N(0, 1), and ξ1 , ξ2 are constants

Skinny Gibbs Trick for scalable computation and model


selection (Narisetty and He, 2015)!

Censoring and Non-convexity Non-convexity 38/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Bayesian Computation - Skinny Gibbs

Working asymmetric Laplace (AL) likelihood (weights wi ):


n
Y
wi τ (1 − τ ) exp −wi ρτ (Yio − max(xi0 β, c)) ,

L(β) =
i=1

AL distn can be represented as: (Kozumi and Kobayashi,


2011) √
wi−1 ξ1 ν + wi−1 ξ2 νz ∼ exp{−wi ρτ (.)},
where ν ∼ Exp(1), z ∼ N(0, 1), and ξ1 , ξ2 are constants

Skinny Gibbs Trick for scalable computation and model


selection (Narisetty and He, 2015)!

Censoring and Non-convexity Non-convexity 38/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Non-convexity and Theoretical Difficulties

n
P
Powell’s objective function P(β) = wi ρτ (Yi − (Xi β ∨ c))
i=1
Non-convexity and misspecification cause theoretical difficulties!

No quadratic approximations on compact sets!


We show local quadratic approximations in vanishing balls

P(β) − P(β0 ) not guaranteed to be positive for large kβ − β0 k


We show divergence in kβ − β0 k ≤ Bn → ∞ (not quadratic!)

Posterior has wi both in likelihood and penalty parts


Strong selection consistency holds true for some choices of w !

Censoring and Non-convexity Non-convexity 39/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Non-convexity and Theoretical Difficulties

n
P
Powell’s objective function P(β) = wi ρτ (Yi − (Xi β ∨ c))
i=1
Non-convexity and misspecification cause theoretical difficulties!

No quadratic approximations on compact sets!


We show local quadratic approximations in vanishing balls

P(β) − P(β0 ) not guaranteed to be positive for large kβ − β0 k


We show divergence in kβ − β0 k ≤ Bn → ∞ (not quadratic!)

Posterior has wi both in likelihood and penalty parts


Strong selection consistency holds true for some choices of w !

Censoring and Non-convexity Non-convexity 39/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Summing up

Skinny Gibbs Trick provides a computationally feasible


approach for the non-convex Powell objective function

Theoretical difficulties arise due to non-convexity and


misspecification, but consistency is shown to hold

Empirical results show good performance while penalization


approaches give unstable solutions

Censoring and Non-convexity 40/ 42


Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Directions for Future Research

Inference: The posterior given by a Bayesian method may in


principle be used for inference. Coverage Properties?

Model Selection for Mixture Models: How to deal with


non-convex likelihood under mixture models? Applications to
subgroup identification?

Interaction Selection: Environmental exposures exhibit


nonlinear and higher-order interactions. Bayesian modeling
strategies?

41/ 42
Introduction Bayesian Model Selection Consistency Scalable Computation Censoring and Non-convexity

Conclusions

Provided a theoretical framework to study Bayesian model


selection with spike-slab priors (Narisetty and He 2014, AoS)

Proposed Skinny Gibbs - a scalable and flexible algorithm


(Narisetty, Shen and He 2015, under review)

Shown to be applicable to non-convex problems as in censored


quantile regression (Narisetty and He 2015)

Bayesian model selection deserves further investigation!

42/ 42
Thank You!
and

Xuming He, Vijay Nair


Statistics, U Mich.

Bhramar Mukherjee
Biotatistics, U Mich.

Juan Shen
Management, Fudan U

Minsuk Shin
Statistics, Texas A&M U

Fei He, Derek Posselt


Atmospheric Science, U Mich.

Steve Broglio, James Eckner


Kinesiology, Medicine, U Mich.
Appendix
1. Conditions for Strong Selection Consistency
2. Simulation Setting and Default Parameters
3. Kolmogorov-Smirnov Distribution
4. Computational Time
5. Computational Complexity
6. Posterior Probabilities - Lymph Data
7. Non-convexity - Theoretical Details
Conditions for Strong Selection Consistency

We assume the following conditions:


|t| log pn
Number of covariates pn is s.t. n →0
q
|t| log pn
Active covariates have coefficients of order at least n

0
Minimum non-zero eigenvalues of the submatrices of XnX
with size Mn = lognpn may go to zero, but not too fast!

Errors are (sub-) Gaussian

Many assumptions could be relaxed (with additional work)

1/ 7
Simulation Setting and Default Parameters

Covariate Distribution:
ind
xi ∼ N(0, Σ); ΣAA = ρ1 14×4 , ΣAI = ρ2 14×p−4 , ΣII = ρ3 1.
ρ1 = ρ2 = ρ3 = 0
ρ1 = 0.1; ρ2 = 0.25; ρ3 = 0.5

Default parameters for Skinny Gibbs:


Spike and slab variances are

pn2.1
 
2 1 2
τ0n = , τ1n = max ,1 ,
n 100n

qnh= P(Zj = i) is chosen


i such that
Ppn
P j=1 Zj = 1 > K = 0.1, for K = max(10, log(n)).

2/ 7
Kolmogorov-Smirnov Distribution

Kolmogorov-Smirnov distribution has CDF given by



X
G (σ) = 1 − 2 (−1)n+1 exp(−2n2 σ 2 ).
n=1

It is the distribution of

K = sup |B(t)|,
t∈[0,1]

where B(t) is the Brownian bridge.

3/ 7
Computational Time

Figure : CPU time (in seconds) for BASAD and Skinny Gibbs for
n = 100 and p varies.

4/ 7
Computational Complexity

Yang, Wainwright, and Jordan (2015) recently studied


computational complexity of MH methods.
– Complexity is nearly linear in p: number of iterations in the
order of p n log p is sufficient for convergence!

Geometric ergodicity of Gibbs samplers in many settings (with


fixed p) established (Román and Hobert 2012; 2015)

Number of iterations for standard Gibbs samplers for


regression: p n−1 (Rajaratnam and Sparks, 2015)

Mixing and Complexity of Skinny Gibbs - Future work!

5/ 7
Posterior Probabilities - Lymph Data Example

Figure : Marginal Posterior Probabilities from two different chains of


Skinny Gibbs.

6/ 7
Non-convexity - Theoretical Details
n
P
Powell’s objective function: Pow (β) = wi ρτ (Yi − (Xi β ∨ c)) .
i=1
Pow (β) is non-convex, no global quadratic approximations. But
Local quadratic approximation:
for kβ − β0 k ≤ n = |t|2 log pn /n → 0,
Pow (β) − Pow (β0 ) = n(β − β0 )0 Dw (β − β0 ) + oP (1)

Deviation in a diverging ball:


for n < kβ − β0 k < Bn = (n/ log pn )1/8 → ∞, we have
Pow (β) − Pow (β0 ) > c1 n(kβ − β0 k2 ∧ c2 )

Note: Usually, kβ − β0 k is assumed to be bounded, but we


can allow it to diverge as Bn → ∞.

7/ 7

You might also like