Financial Time Series I/Methods of Statistical Prediction

Financial Time Series I/Methods
of Statistical Prediction
Suggested Answers to Project 2

Project 2: Logistic Regression and Model
Selection
1/19/2003
Question 1
Give a brief explanation on the meaning of those commands.
data(birthwt)
#Load the data set birthwt from the boot package.
#We can use an alternative command by loading the whole boot package into the
working space with command: require(boot).
attach(birthwt)
#The data set birthwt is attached to the working space so that the variable in the
data set can be accessed directly by its name. Otherwise, we have to use
birthwt$varname instead of varname.
race< factor(race,labels =c("white","black", "other"))
#The function factor is used to encode a vector as a factor race is originally
being treated as a numerical factor. This command translates into a categorical
factor.
table(ftv)
#table uses the cross-classified factors to build a contingency table of the counts at
each combination of factor levels. The results of this command is as follows,
0 1 2 3 4 6
100 47 30 7 4 1
Question 1
ftv<-factor( ftv)
#The function factor is used to encode a vector as a factor
levels(ftv)[-(1:2)]< "2+"
# This command transfers the levels of ftv into three: 0, 1, and 2+.
#2.Convert ptl to two levels and name the new variable as ptd.
ptd<-factor(ptl>0)
#The function factor is used to encode a vector as a factor. Here, ptd represents a new
factor with two levels only.
3. Create a new data frame bwt.
bwt<-data-frame(1ow=factor(1ow), age, lwt, race, smoke =(smoke >0), ptd, ht = (ht >0),
ui=(ui>0), ftv)
# Create a new data frame
4. Clean up data.
detach(birthwt)
#Remove the data set from the search path of available R objects. This command can be
used to remove either a data-frame which has been attached or a package which was loaded
previously.
rm(race, ptd ,ftv)
# remove and rm can be used to remove objects. Here, the variables race, ptd, and
ftv, which have been created in this workspace, are removed and no longer exist.
Question 2
Give a brief explanation on the specification of the regression model.
birthwtglmGglm(1ow~., family=binomial ,data =bwt)
# glm is used to fit generalized linear models (GLM) to the data.
In GLM, two parameters need to be specified.
The first parameter is the error distribution of dependent variable. Here, the
chosen distribution is binomial distribution. It is being specified in terms of
family=binomial.
The second parameter is the unknown parameters in the specification of error
distribution. For the binomial distribution, it is the probability of success p. p will
be used to associate the independent variables to dependent variable.
There is the so-called link function to associate them. The default of link function
with binomial distribution is logit. It is as follows:
F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp(0 +1X1+...+KXK)
The fitted values of this model are the probabilities of low=1 given all the
other variables in the data frame bwf. Here, we only consider additive
model. It means that all variables enter into the model linearly. (There is no
interaction term in the model.)
Question 2
birthwtglm<- glm(1ow~., family=binomial, data =bwt)
# glm is used to fit generalized linear models (GLM) to the data.
In GLM, two parameters need to be specified.
The first parameter is the error distribution of dependent variable. Here, the
chosen distribution is binomial distribution. It is being specified in terms of
family=binomial.
The second parameter is the unknown parameters in the specification of error
distribution. For the binomial distribution, it is the probability of success p. p
will be used to associate the independent variables to dependent variable.
There is the so-called link function to associate them. The default of link
function with binomial distribution is logit. It is as follows:
F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp(0 +1X1+...+KXK)
The fitted values of this model are the probabilities of low=1 given all the
other variables in the data frame bwf. Here, we only consider additive
model. It means that all variables enter into the model linearly. (There is no
interaction term in the model.)
Question 2
summary(birthwt.glm,correlation=F)
It gives the deviance residuals and coefficients.
The command summary is a generic function used to produce
result summaries of the results of various model fitting functions.
The correlations of coefficients are not shown because the parameter
correlation is set to be false.
The AIC value is 217.48 in this additive model containing all the
variables. In addition, the prediction error rate is 0.2698413
We can compare the AIC values and the prediction error rates with
this one in the following analyses to show the effect of including or
excluding some particular variables.
From the probability prob(>|z|) of each variable, we found that the
variables ptdTRUE, htTRUE, lwt, and raceblack will be
significant for the prediction if the significant level is set at 0.05.
Question 2: Model Selection
Consider a model with the above four important variables.
glm(1ow~lwt + race + ptd + ht, family=binomial, data = bwt)
The AIC value will become 217.40, which is slightly smaller than one using
all of the variables.
In addition, the prediction of the error rate is 0.2698413.
If we only drop the two most insignificant variables ftv and age,
this leads to an alternative model:
glm(1ow~lwt + race + ptd + ht + smoke + ui, family=binomial,
data = bwt).
The AIC value will become 213.8516,which seems smaller than the two
models described above
The prediction error rate is 0.2433862.
Question 2: Model Selection with AIC
Use AIC as the objective of model selection.
birthwtstep< step(birthwtglm, trace=F)
The command step selects a formula-based model by AIC. Its basic idea is
to remove the variables one by one in order to find better models with smaller
AIC values.
The parameter trace is turn off so that the above deletion process is not
shown.
We can build model with backward elimination or forward selection.
birthwtstep< step(birthwtglm,trace=F, direction= c(forward))
birthwtstep< step(birthwtglm,trace=F, direction= c(backward))
birthwt.step$anova
This command is useful in showing the process of deleting variables.
For this data set, it removes ftv and age these two variables and final AIC
value is reduced to 213.8516. The selection procedure stops because no more
reduction of the AIC value can be achieved by removing any other variables.
It is interesting that the model selected by the AIC criterion is
consistent with that one we use a different criterion.
Question 3
Repeat the steps in Question 2 and consider all models include
pairwise interactions.
In order to ensure that the co-linearity is not present in the model, we
usually start on checking the correlation of coefficients among all
independent variables.
The difficult question is how to address the correlation with categorical
variables. Refer to your note on association.
Model building strategy:
Strategy 1: backward elimination
Add all pairwise interactions and then remove them one by one.
Implementation of strategy 1
Start form an additive model with all independent variables.
Birthwt.glm <- glm(1ow~^2, family=binomial, data=bwt, maxit=20)
Due to convergence problem, we can increase the upper bound of the number
of iterations, which I done by using maxit.
Suggestion: Start form the best additive linear model derived in Question 2
(Exclude the two variables: ftv and age.)
Question 3: backward elimination
birthwt.glm <-glm(low~(-ftv-age)^2, family=binomial, data=bwt,
maxit=20)
Consider an additive model without ftv and age.
This leads to the following model
low ~age + lwt + race + smoke + ptd + ht + ui + fw + age:ht + age:ftv +
lwt:smoke + lwt:ht + lwt:ui + race:ht + smoke:ht + ptd:ht + ht:ui + ht:ftv
model selection with AIC
birthwt.step.pwall <- stepAIC(birthwt.glm.pwall, trace=F)
birthwtstep.pwall$anova
This leads to the following model with AIC=210.8205
low ~age + 1wt + race + smoke + ptd + ht + ui + fw+ lwt:smoke + lwt:ht +
lwt:ui + ht:ui
Suggestion: Can we just compare all possible two pairwise
interaction terms?
The best one is with the interaction terms age:ftv and ht:ui.
Its AIC is 209.0006.
Although the variable age andftv are not important when we only
consider an additive linear model, they are included when the interaction tem
is also taken into consideration.
Question 3: Model Searching Strategy
birthwt.step.both <- stepAIC(birthwt.glm, scope = list(upper = ~ .^2,
lower = ~ 1), trace=F)
The direction of stepwise search can be one of both, backward, or
forward, with a default of both.
If the scope argument is missing, the default for direction is backward.
Therefore, we do not only remove predictors from the model but also add
predictors to reduce the AIC value.
For this data set, start with original model with no interaction term.
The interaction terms age:ftv and smoke:ui are added sequentially.
Finally, the race tem is removed.
The process stops at the model
low ~age + lwt + smoke + ptd + ht + ui + ftv + age:ftv + smoke:ui
with the lowest AIC value 207.0734.
Question 5
Repeat the above procedure with the two model being chosen with
cross-validation to give prediction error again.
How do we divide the data randomly into several groups (fold=5 for
example)?
Use the cv.glm procedure in the boot package.
Some students write their own version of cross-validation.
cv.glm
Cross-validation for Generalized Linear Models
This function calculates the estimated K-fold cross-validation prediction error
for generalized linear models.
cv.glm(data, glmfit, cost, K)
Data: A matrix or data frame containing the data. The rows should be cases and the
columns correspond to variables, one of which is the response.
glmfit: An object of class "glm" containing the results of a generalized linear model
fitted to data.
cost: A function of two vector arguments specifying the cost function for the cross-
validation. The first argument to cost should correspond to the observed responses
and the second argument should correspond to the predicted or fitted responses from
the generalized linear model. The default is the average squared error function.
K: The number of groups into which the data should be split to estimate the cross-
validation prediction error.
Cross-validation Algorithm
bwt.shuffle<- bwt; shuf:iter <- 1000; datasize <- length(bwt$low)
for(i in 1:shuf.iter){
n1<- round(runif(n=1, min = 1, max=datasize))
n2<- round(runif(n=1, min = 1, max= datasize))
temp <- bwt.shuffle[n1,]
bwt.shuffle[n1,] <- bwt.shuffIe[n2,]
bwt.shume[n2,] <- temp
}
fold<- k; testsize <- round(datasize/fold); rate.fold <-rep(0,fold)
for(i in 1:fold){
test.start <- (i-1)*testsize; test.end <- test.start + testsize
if (test.end>datasize) test.end=datasize
bwt.test<- data.frame(bwt.shuffIe[(test.start+1):test.end,])
bwt.train<- data.frame(bwt.shuffle[-((test.start+1):test.end),])
train.glm<- glm(1ow~.,family=binomial, data=bwt.train)
pred<-predict(train.glm, subset(bwt.test, select =c(age,lwt, race, smoke, ptd, ht,ui,
ftv)), type ="response")
rate-fold[i]< sum(round(pred)==bwt.test$low)/length(pred)
}
cvrate<- mean(rate.fold); cat("prediction error rate of cross validation:",1-cvrate, \n")

Financial Time Series I/Methods of Statistical Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Financial Time Series I/Methods of Statistical Prediction

Uploaded by

Copyright:

Available Formats

Financial Time Series I/Methods

Suggested Answers to Project 2

You might also like