You are on page 1of 9

ISYE 6416 COMPUTATIONAL STATISTICS

Project 6
AN IMPLEMENTATION OF VARIABLE SELECTION VIA GIBBS SAMPLING METHOD
CHITTA RANJAN GT ID# 902838398

Mar 28, 2012

AN IMPLEMENTATION OF VARIABLE SELECTION VIA GIBBS SAMPLING METHOD


Chitta Ranjan Department of Industrial and Systems Engineering, Georgia Institute of Technology.

ABSTRACT Efficient and effective variable selection is a very crucial problem in multiple regression. The commonly used methods likely AIC, Cp and BIC becomes computationally very cumbersome when the number of predictors is large. George and McCulloch 1993 have devised a stochastic search variable selection method via Gibbs sampling. This method uses hierarchical normal mixture model where latent variables are used to identify subset choices. The best subsets can be identified as those with higher posterior densities. Gibbs sampling helps in reducing the computational burden. In this paper, we have shown the results of our own implementation of one of the examples illustrated in George and McCulloch 1993. We have compared our results with the paper and have discussed the interpretation of the results. 1. INTRODUCTION Stochastic Search Variable Selection (SSVS) method is used for variable selection in linear regression. SSVS is based on embedding the entire regression setup in hierarchical Bayes normal mixture model, where latent variables are used to identify subset choices. Other selection techniques popularly used are AIC, Cp and BIC. However, these techniques require computation of the criterion metric to be calculated for each of 2p possible subsets, where, p is the number of variables. It becomes computationally very difficult to do so many computations when p is large. Hence, the SSVS method which is stochastic in nature and uses sampling to identify the best subsets becomes handy when p is large as best subset can be estimated without doing 2p computations. This method is controlled by various tuning parameters that can be specified according to the problem goals, for example, the search for a parsimonious model. Ultimately, given model predictor variables X = [X1, ., Xp] we can find the best submodel Y=

+ +

where, 0 q p. In this paper, we are implementing the method devised by George and McCulloch 1993 in TinnR using basic functions and its statistical package. We have computed the results by keeping all the parameters at same value as in the illustration of George and McCulloch 1993 Example 4.1. We will then compare our results with their results. The results are then interpreted and discussed in Section 3.

2. APPROACH Given p predictors [X1, , Xp] and response Y with n observations, we will start with the canonical regression model | , ~ ( , ) (1)

Selecting a subset of predictors is equivalent to setting to 0 those is corresponding to the nonselected predictors. We add indicator variables = [1, , p]. i will indicate whether i should be included or not. We will now define priors for and 2, which will depend on . | ~ (0, ) (2)

where, D diag[a11, , app] with ai = 1 if i = 0 and ai = ci if i = 1. | ~ ( , ) (3)

We now aim to obtain the marginal posterior distribution f(|Y) f(Y| )f(), which contains the information relevant to variable selection. f() can be interpreted as the statisticians prior probability that the Xis corresponding to non 0 components should be included in the final model. In our implementation we have used the uniform or indifference prior f() 2-p For choosing i and ci, we consider the pairs ( (4) , ) = (1, 10) in our implementation.

Finally we select the model with maximum posterior probability. We can find the model with maximum posterior density by computing it for each 2p subsets, but it will not serve the purpose of increasing the computational efficiency. Therefore, rather than calculating all 2p posterior probabilities in f(|Y) SSVS uses Gibbs sampler to generate a Gibbs sequence in which f(|Y) is embedded: 0, 0, 0, 1, 1, 1 This sequence is an ergodic Markov chain. 0, are initialized as the least squares estimates of (1), while 0 (1, 1, , 1). The subsequent values are computed by iterative sampling scheme. This scheme entails simulations that can be done fast and efficiently. The sequence is obtained by sampling using the following equations j ~ f(j|Y, j-1, j-1) j ~ f(j|Y, j, j-1) ~ ( | , ,
( ))

(5) (6) (7)


3

Each distribution in (7) is Bernoulli with probability =1 Where, = =


( ), ( ),

()

(8)

=1 =0

( ), ( ),

=1 ( =0 (

( ), ( ),

= 1) = 0)

(8a) (8b)

From Gibbs sampling, we can find the model which shows up most frequently. That model is taken to be the best model. Simulated example of George and McCulloch 1993 Example 4.1 We will implement Example 4.1 from George and McCulloch 1993. This example considers two simple variable selection problems with p = 5 predictors of length n = 60. Problem 1 The predictors are obtained as independent standard normal vectors X1, , X5 iid ~ N60(0, 1), to make them practically uncorrelated. The dependent variable is generated according to the model Y = X4 + 1.2X5 + where, ~ N60(0, 2I) with = 2.5. We implemented the above method on this data to get result shown in Table 1. The table is showing the frequency of sampling each model. As stated above, the model sampled most number of times should be considered as best.
Table 1: High Frequency Models problem 1

(9)

's '00001' '00010' '00011' '00000' '01011' '00100' '01001' '10001' '00101' '10000'

Model Variables 5 4 45 245 3 25 15 35 1

# times sampled [1207] [1099] [ 681] [ 531] [ 141] [ 140] [ 139] [ 139] [ 135] [ 135]

Percentage [24.1400] [21.9800] [13.6200] [10.6200] [ 2.8200] [ 2.8000] [ 2.7800] [ 2.7800] [ 2.7000] [ 2.7000] 4

'10010'

14

[ 107]

[ 2.1400]

Problem 2 Problem 2 is identical to problem 1, except that X3 is replaced by =X5 + 0.15Z, where Z ~ N60(0, 1), yielding high correlation of 0.989. This is a substantial proxy for X5. This problem will show the efficacy of the method in presence of extreme collinearity. The result is shown in Table 2.
Table 2: High Frequency Models problem 2

's '00010' '00000' '00100' '00001' '00110' '10010' '00001' '01000' '00011' '01010' '10100'

Model Variables 4 3 5 34 14 5 2 45 24 13

# times sampled [1075] [ 772] [ 349] [ 297] [ 290] [ 277] [ 267] [ 230] [ 230] [ 181] [ 101]

Percentage [21.5000] [15.4400] [ 6.9800] [ 5.9400] [ 5.8000] [ 5.5400] [ 5.3400] [ 4.6000] [ 4.6000] [ 3.6200] [ 2.0200]

As explained in George and McCulloch 1993, SSVS method is used on both problems with indifference prior f() = 2-5, and ( , ) comes out to be (1, 10). For this setting the threshold for inclusion in the model occurs at coefficient estimates with a t statistic of about 2.7. A sample of m = 5000 observation of Gibbs sequence is simulated and tabulated.

3. DISCUSSION AND CONCLUSION In problem 1, the two most frequent models contain only X5 and X4 respectively. These two models were sampled 24.1% and 21.9% times respectively. Contrary to the results of George and McCulloch 1993, X4 is not often excluded. This result can be correct too because saying that 4 = 1 is close to 0 is very subjective. The t statistic for 4 is 0.84/0.31 2.7 just about inclusion/exclusion threshold for the setting ( , ) = (1, 10). Among other variables the t statistic of only X5 is greater than the threshold t value. The result also shows that model without any Xs can also be considered. This is a good result too, because we have simulated Y from Xs and , where both Xs and is centered at 0 and has small variance. Our implementation also
5

proved that SSVS is useful in identifying several promising models rather than the single best model. In problem 2, the model containing X4 is sampled the maximum number of times (21.5%). The second most sampled model does not contain any predictor. As explained above, this model is a plausible model too. Model having predictor X3 or X5 should be approximately same because X3 is a proxy for X5. This is justified from the result in Table 2 as well, where we can see that model having only X3 and the model having only X5 are sampled in almost equal proportions. Moreover, we can say that the results for Problem 1 and Problem 2 are almost identical in the sense that the former shows model having either X5 or X4 is best and the latter also says that model having X4 is best followed by model having X3 or X5. Since, using X3 or X5 will not matter in Problem 2, hence, we can claim that the results are identical. However, George and McCulloch 1993 has cautioned that introducing proxies may dilute the focus of SSVS by increasing the number of promising models. To avoid this dilution, they have suggested to eliminated any strong proxies from the data before using SSVS. From this example, they have showed that marginal frequencies by themselves wont serve the purpose of variable selection. Similar to their results for Problem 2, we can also observe from our results that although X5 is effective in both problems, it appears in fewer models in Problem 2 results compared to Problem 1s because of the presence of proxy X3. Thus, in this study we could implement the SSVS technique by using basic Tinn-R functions and statistical package. Our implementation was efficient, which helped us in finding the results without reducing the Gibbs samples or anything else making it different from George and McCulloch 1993 illustration.

4. REFERENCES
[1] Lange, K., Numerical Analysis for Statisticians, Springer, 2nd Edition. [2] George, E.I. and McCulloch, R.E., 1993, Variable Selection via Gibbs Sampling, Journal of the American Statistical Association, Vol. 88, No. 423, pp 881-889

APPENDIX ###### Using Tinn-R####### # Using Gibbs selection method for sampling gibbs_selection <- function(X,Y,m,setting) { fit <- lm(Y~X) LSE_b <- fit$coefficients anv<-anova(fit) sigma_LSE_b <- anv$Mean Sq s <-var(fit$residuals)
6

t = sqrt(sigma_LSE_b)/setting[1] n <- nrows(X) p <- ncols(X) c <- array(0, dim=c(p,1)) + setting[2] gamma <- array(1, dim=c(p,1)) #Identity matrix R <- matrix(0,nrow=p,ncol=p) I[row(I)==col(I)] <- 1 invR = solve(R) gibbs_seq = array(0, dim=c(m,p)) # Gibbs sampling for j in 1:m { b = rand_b(gamma,s,invR,LSE_b,c,t) s = rand_s(gamma,n,Y,X,b) for i in 1:p {gamma(i) = rand_g(i,b,gamma,R,c,t)} gibbs_seq[j,] = gamma } return(gibbs_seq) } #Tabulating the gibbs sequence generated tabulation <- function(X) { tempX <- paste0(X[,1],X[,2],X[,3],X[,4],X[,5],collapse=NULL) temptable = tabulate(tempX); table <- sort(temptable, decreasing=TRUE) return(table) } #Problem 1 n<-60 p<-5 m<-5000 a <- array(0,dim=c(n,1)) b <- matrix(0,nrow=n,ncol=n) I[row(I)==col(I)] <- 1 for i in 1:5 { X[,i]<-rmultnorm(a,b)
7

} s = 2.5 e = t(rmultnorm(a,s*s*b)) Y <- X[,4] + 1.2*X[,5] + e gibbs_seq <- gibbs_selection(X,Y,m,c(1,10)) tabulated_samples<-tabulation(gibbs_seq) #Problem 2 for i in 1:5 { X[,i]<-rmultnorm(a,b) } Z<-rmultnorm(a,b) #making X3 proxy of X5 X[,3] <- X[,5]+0.15*Z sigma<-2.5 e<-rmultnorm(a,sigma*sigma*b) Y <- X[,4] + 1.2*X[,5] + e gibbs_seq <- gibbs_selection(X,Y,m,c(1,10)) tabulated_samples<-tabulation(gibbs_seq) ###### #sampling beta rand_b <- function(gamma,s,invR,LSE_b,c,t) { D_ <- Diagonal(1/D(gamma,c,t) temp <- t(X)%*%X/s A <- solve(temp+D_%*%invR%*%D_) b <- t(rmultnorm(A%*%temp%*%LSE_b,A)) return(b) } rand_s <- function(gamma,n,Y,X,b) { u<-0.5*n v<-0.5*(rnorm(Y-X*b)^2) s<-1/gamrnd(u,1/v) return(s) } #sampling gamma rand_g<-function(i,b,gamma,R,c,t)
8

{ gamma0<-gamma gamma1<-gamma gamma0[i]<-0 gamma1[i]<-1 a <- b_prior(b,gamma0,R,c,t) b <- b_prior(b,gamma1,R,c,t) toss<-runif(1,0,1) if (toss<(a/(a+b))) { gamma=1 } else { gamma=0 } return(gamma) } D<-function(gamma,c,t) { a<-1+(c-1)*gamma d<-a*t return(d) } #Defining prior of beta b_prior<-function(b,gamma,R,c,t) { d<-t(D(gamma,c,t)) temp<-Diagonal(d)%*%R%*%Diagonal(d) bp<-dmnorm(b, [], temp, log = FALSE) return(bp) } #Defining prior of gamma g_prior<-function(p) { gp<-2^(-p) #indifference prior }

#bernoulli sampling

You might also like