Project 2: Logistic Regression and Model Selection 1/19/2003 Question 1 Give a brief explanation on the meaning of those commands. data(birthwt) #Load the data set birthwt from the boot package. #We can use an alternative command by loading the whole boot package into the working space with command: require(boot). attach(birthwt) #The data set birthwt is attached to the working space so that the variable in the data set can be accessed directly by its name. Otherwise, we have to use birthwt$varname instead of varname. race< factor(race,labels =c("white","black", "other")) #The function factor is used to encode a vector as a factor race is originally being treated as a numerical factor. This command translates into a categorical factor. table(ftv) #table uses the cross-classified factors to build a contingency table of the counts at each combination of factor levels. The results of this command is as follows, 0 1 2 3 4 6 100 47 30 7 4 1 Question 1 ftv<-factor( ftv) #The function factor is used to encode a vector as a factor levels(ftv)[-(1:2)]< "2+" # This command transfers the levels of ftv into three: 0, 1, and 2+. #2.Convert ptl to two levels and name the new variable as ptd. ptd<-factor(ptl>0) #The function factor is used to encode a vector as a factor. Here, ptd represents a new factor with two levels only. 3. Create a new data frame bwt. bwt<-data-frame(1ow=factor(1ow), age, lwt, race, smoke =(smoke >0), ptd, ht = (ht >0), ui=(ui>0), ftv) # Create a new data frame 4. Clean up data. detach(birthwt) #Remove the data set from the search path of available R objects. This command can be used to remove either a data-frame which has been attached or a package which was loaded previously. rm(race, ptd ,ftv) # remove and rm can be used to remove objects. Here, the variables race, ptd, and ftv, which have been created in this workspace, are removed and no longer exist. Question 2 Give a brief explanation on the specification of the regression model. birthwtglmGglm(1ow~., family=binomial ,data =bwt) # glm is used to fit generalized linear models (GLM) to the data. In GLM, two parameters need to be specified. The first parameter is the error distribution of dependent variable. Here, the chosen distribution is binomial distribution. It is being specified in terms of family=binomial. The second parameter is the unknown parameters in the specification of error distribution. For the binomial distribution, it is the probability of success p. p will be used to associate the independent variables to dependent variable. There is the so-called link function to associate them. The default of link function with binomial distribution is logit. It is as follows: F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp(0 +1X1+...+KXK) The fitted values of this model are the probabilities of low=1 given all the other variables in the data frame bwf. Here, we only consider additive model. It means that all variables enter into the model linearly. (There is no interaction term in the model.) Question 2 birthwtglm<- glm(1ow~., family=binomial, data =bwt) # glm is used to fit generalized linear models (GLM) to the data. In GLM, two parameters need to be specified. The first parameter is the error distribution of dependent variable. Here, the chosen distribution is binomial distribution. It is being specified in terms of family=binomial. The second parameter is the unknown parameters in the specification of error distribution. For the binomial distribution, it is the probability of success p. p will be used to associate the independent variables to dependent variable. There is the so-called link function to associate them. The default of link function with binomial distribution is logit. It is as follows: F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp(0 +1X1+...+KXK) The fitted values of this model are the probabilities of low=1 given all the other variables in the data frame bwf. Here, we only consider additive model. It means that all variables enter into the model linearly. (There is no interaction term in the model.) Question 2 summary(birthwt.glm,correlation=F) It gives the deviance residuals and coefficients. The command summary is a generic function used to produce result summaries of the results of various model fitting functions. The correlations of coefficients are not shown because the parameter correlation is set to be false. The AIC value is 217.48 in this additive model containing all the variables. In addition, the prediction error rate is 0.2698413 We can compare the AIC values and the prediction error rates with this one in the following analyses to show the effect of including or excluding some particular variables. From the probability prob(>|z|) of each variable, we found that the variables ptdTRUE, htTRUE, lwt, and raceblack will be significant for the prediction if the significant level is set at 0.05. Question 2: Model Selection Consider a model with the above four important variables. glm(1ow~lwt + race + ptd + ht, family=binomial, data = bwt) The AIC value will become 217.40, which is slightly smaller than one using all of the variables. In addition, the prediction of the error rate is 0.2698413. If we only drop the two most insignificant variables ftv and age, this leads to an alternative model: glm(1ow~lwt + race + ptd + ht + smoke + ui, family=binomial, data = bwt). The AIC value will become 213.8516,which seems smaller than the two models described above The prediction error rate is 0.2433862. Question 2: Model Selection with AIC Use AIC as the objective of model selection. birthwtstep< step(birthwtglm, trace=F) The command step selects a formula-based model by AIC. Its basic idea is to remove the variables one by one in order to find better models with smaller AIC values. The parameter trace is turn off so that the above deletion process is not shown. We can build model with backward elimination or forward selection. birthwtstep< step(birthwtglm,trace=F, direction= c(forward)) birthwtstep< step(birthwtglm,trace=F, direction= c(backward)) birthwt.step$anova This command is useful in showing the process of deleting variables. For this data set, it removes ftv and age these two variables and final AIC value is reduced to 213.8516. The selection procedure stops because no more reduction of the AIC value can be achieved by removing any other variables. It is interesting that the model selected by the AIC criterion is consistent with that one we use a different criterion. Question 3 Repeat the steps in Question 2 and consider all models include pairwise interactions. In order to ensure that the co-linearity is not present in the model, we usually start on checking the correlation of coefficients among all independent variables. The difficult question is how to address the correlation with categorical variables. Refer to your note on association. Model building strategy: Strategy 1: backward elimination Add all pairwise interactions and then remove them one by one. Implementation of strategy 1 Start form an additive model with all independent variables. Birthwt.glm <- glm(1ow~^2, family=binomial, data=bwt, maxit=20) Due to convergence problem, we can increase the upper bound of the number of iterations, which I done by using maxit. Suggestion: Start form the best additive linear model derived in Question 2 (Exclude the two variables: ftv and age.) Question 3: backward elimination birthwt.glm <-glm(low~(-ftv-age)^2, family=binomial, data=bwt, maxit=20) Consider an additive model without ftv and age. This leads to the following model low ~age + lwt + race + smoke + ptd + ht + ui + fw + age:ht + age:ftv + lwt:smoke + lwt:ht + lwt:ui + race:ht + smoke:ht + ptd:ht + ht:ui + ht:ftv model selection with AIC birthwt.step.pwall <- stepAIC(birthwt.glm.pwall, trace=F) birthwtstep.pwall$anova This leads to the following model with AIC=210.8205 low ~age + 1wt + race + smoke + ptd + ht + ui + fw+ lwt:smoke + lwt:ht + lwt:ui + ht:ui Suggestion: Can we just compare all possible two pairwise interaction terms? The best one is with the interaction terms age:ftv and ht:ui. Its AIC is 209.0006. Although the variable age andftv are not important when we only consider an additive linear model, they are included when the interaction tem is also taken into consideration. Question 3: Model Searching Strategy birthwt.step.both <- stepAIC(birthwt.glm, scope = list(upper = ~ .^2, lower = ~ 1), trace=F) The direction of stepwise search can be one of both, backward, or forward, with a default of both. If the scope argument is missing, the default for direction is backward. Therefore, we do not only remove predictors from the model but also add predictors to reduce the AIC value. For this data set, start with original model with no interaction term. The interaction terms age:ftv and smoke:ui are added sequentially. Finally, the race tem is removed. The process stops at the model low ~age + lwt + smoke + ptd + ht + ui + ftv + age:ftv + smoke:ui with the lowest AIC value 207.0734. Question 5 Repeat the above procedure with the two model being chosen with cross-validation to give prediction error again. How do we divide the data randomly into several groups (fold=5 for example)? Use the cv.glm procedure in the boot package. Some students write their own version of cross-validation. cv.glm Cross-validation for Generalized Linear Models This function calculates the estimated K-fold cross-validation prediction error for generalized linear models. cv.glm(data, glmfit, cost, K) Data: A matrix or data frame containing the data. The rows should be cases and the columns correspond to variables, one of which is the response. glmfit: An object of class "glm" containing the results of a generalized linear model fitted to data. cost: A function of two vector arguments specifying the cost function for the cross- validation. The first argument to cost should correspond to the observed responses and the second argument should correspond to the predicted or fitted responses from the generalized linear model. The default is the average squared error function. K: The number of groups into which the data should be split to estimate the cross- validation prediction error. Cross-validation Algorithm bwt.shuffle<- bwt; shuf:iter <- 1000; datasize <- length(bwt$low) for(i in 1:shuf.iter){ n1<- round(runif(n=1, min = 1, max=datasize)) n2<- round(runif(n=1, min = 1, max= datasize)) temp <- bwt.shuffle[n1,] bwt.shuffle[n1,] <- bwt.shuffIe[n2,] bwt.shume[n2,] <- temp } fold<- k; testsize <- round(datasize/fold); rate.fold <-rep(0,fold) for(i in 1:fold){ test.start <- (i-1)*testsize; test.end <- test.start + testsize if (test.end>datasize) test.end=datasize bwt.test<- data.frame(bwt.shuffIe[(test.start+1):test.end,]) bwt.train<- data.frame(bwt.shuffle[-((test.start+1):test.end),]) train.glm<- glm(1ow~.,family=binomial, data=bwt.train) pred<-predict(train.glm, subset(bwt.test, select =c(age,lwt, race, smoke, ptd, ht,ui, ftv)), type ="response") rate-fold[i]< sum(round(pred)==bwt.test$low)/length(pred) } cvrate<- mean(rate.fold); cat("prediction error rate of cross validation:",1-cvrate, \n")