You are on page 1of 15

STAT 5700 Homework 2 Due 10/27/2020

1. Chapter 4 (Exercise 10)


This question should be answered using the Weekly data set, which is part of the ISLR package. This
data is similar in nature to the Smarket data from this chapter’s lab, except that is contained 1,089
weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any
patterns?
>plot(Today~Lag1, col=”darkgreen”, data=Weekly)
>simplelm = lm(Today~Lag1, data=Weekly)
>abline(simplelm, lwd = 3, col=”goldenrod4”)
>pairs(Weekly)

(b) Use the full data set to perform a logistic regression with Direction as the response and the give
lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the
predictors appear to be statistically significant? If so, which ones?
>logmod = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,family =
“binomial”, data=Weekly)
>summary(logmod)

Lag2 and the intercept are statistically significant with values less than 0.05. Lag2 with the value 0.0296
and the intercept with 0.0019

(c)Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion
matrix is telling you about the types of mistakes made by logistic regression.
>probs = predict(logmod, type=”response”)
>preds = rep(“Down”, 1089)
>preds[probs > 0.5] = “Up”
>table(preds, Weekly$Direction)

The confusion matrix is telling us that there are a lot of cases that go UP but there is also a high
percentage of False Positives. Due to this there is a very low proportion of true negatives, 54/484.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as
the only predictor. Compute the confusion matrix and overall fraction of correct predictions for the held
out data (that is, the data from 2009 and 2010).
>training.data = Weekly[Weekly$Year<2009,]
>test.data = Weekly[Weekly$Year>2008,]
>simpleglm = glm(Direction~Lag2, data = training.data, family =
“binomial”)
>summary(simpleglm)

>testprobs = predict(simpleglm, type=”response”, newdata = test.data)


>testdirs = Weekly$Direction[Weekly$Year>2008]
>plot(testprobs,col=ifelse(Weekly$Direction[Weekly$Year>2008]==”Down”,
”blue”,”orange”), pch=16)
>abline(h = 0.5, lwd= 3)
>testpreds = rep(”Down”, 104)
>testpreds[testprobs>0.5] = ”Up”
>mean(probs)
[1] 0.5555556
>table(testpreds, testdirs)

37.5% is the test error rate for logistic regression

(e) Repeat (d) using LDA.


>lda.fit = lda(Direction~Lag2, data = training.data)
>lda.fit
>plot(lda.fit)

>lda.pred = predict(lda.fit, newdata=test.data, type=”response”)


>lda.class = lda.pred$class
>table(lda.class, test.data$Direction)
The error rate is the same as the previous one at 37.5%

(f) Repeat (d) using QDA.


>qda.fit = qda(Direction~Lag2, data= training.data)
>qda.fit

>qda.pred = predict(qda.fit, newdata=test.data, type=”response”)


>qda.class = qda.pred$class
>table(qda.class, test.data$Direction)

With a 41.35% standard error rate the QDA seems to be performing the worst out of the models.

(g) Repeat (d) using KNN with K = 1.


>set.seed(1)
>train.x = cbind(training.data$Lag2)
>test.x = cbind(test.data$Lag2)
>train.y = cbind(training.data$Direction)
>knn.pred = knn(train.x,test.x,train.y, k-1)
>table(knn.pred, test.data$Direction)

(h) Which of these methods appears to provide the best results of this data? It appears that Logistic
Regression and Linear Discriminant Analysis provide the best results with an accuracy rate of 62.5%.

(I) Experiment with different combinations of predictors, including possible transformations and
interactions for each of the methods. Report the variables, method, and associated confusion matrix
that appears to provide the best results on the held out data. Note that you should also experiment with
values for K in the KNN classifier.
>qda.fit2 = qda(Direction~Lag1+Lag2+Lag4, data= training.data)
>qda.fit2
>qda.pred2 = predict(qda.fit2, newdata=test.data, type=”response”)
>qda.class2 = qda.pred2$class
>table(qda.class2, test.data$Direction)

>lda.fit2 = lda(Direction~Lag1+Lag2+Lag4,data= training.data)


>lda.fit2

>lda.pred2 = predict(lda.fit2, newdata=test.data, type=”response”)


>lda.class2 = lda.pred2$class
>table(lga.class2, test.data$Direction)

>knn10.pred = knn(train.x, test.x, train.y, k=10)


>table(knn10.pred, test.data$Direction)
>knn5.pred = knn(train.x, test.x, train.y, k=5)
>table(knn5.pred, test.data$Direction)

Out of these models, the original logistic regression and LDA have the best performance when it comes
to terms of test error rate.
2. Chapter 5 (Exercise 7)

In section 5.3.2 and 5.3.3, we saw that the cv.glm() function can be used in order to compute the
LOOCV test error estimate. Alternatively, one could compute those quantities using just the glm() and
predict.glm() functions, and a for loop. You will now take this approach in order to compute the
LOOCV error for a simple logistic regression model on the Weekly data set. Recall that in the context of
classification problems, the LOOCV error is giving in (5.4).

(a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2.
>fit.glm = glm(Direction~Lag1+Lag2, data= Weekly, family= “binomial”)
>summary(fit.glm)

(b) Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but the first
observation.
>fit.glm = glm(Direction~Lag1+Lag2, data=Weekly[-1,], family =
“binomial”)
>summary(fit.glm)

(c)Use the model from (b) to predict the direction of the first observation. You can do this by predicting
that the first observation will go up if P(Direction=”Up”|Lag1, Lag2) > 0.5. Was this
observation correctly classified?
>predict(fit.glm, newdata= Weekly[1,], type=”response”) > 0.5

>Weekly[1,]$Direction

(d) Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that
performs each of the following steps:

I. Fit a logistic regression model using all but the ith observation to predict Direction
using Lag1 and Lag2.
II. Compute the posterior probability of the market moving up for the ith observation.
III. Use the posterior probability for the ith observation in order to predict whether or not
the market moves up.
IV. Determine whether or not an error was made in predicting the direction for the ith
observation. If an error was made then indicate this as a 1, and otherwise indicate it as a
0.

>error <- rep(1, dim(Weekly)[1])


for (I in 1:dim(Weekly)[1]) {
fit.glm <- glm(Direction~Lag1+Lag2, data=Weekly[-1, ],
family=”binomial”
pred.up <- predict.glm(fit.glm, Weekly[I, ], type= “response”) >0.5
true.up <- Weekly[I, ]$Direction == “Up”
if (pred.up != true.up)
error[I]<- 1
}
error

(e) Take the average of the n numbers obtained in (d)iv in order to obtain the LOOCV estimate for the
test error. Comment on the results.

The LOOCV estimate has a test error rate of 0.4499541 or 44.9954086%


3. Chapter 5 (Exercise 9)

We will now consider the Boston housing data set, from the MASS library.

(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate mu-
hat.
>library(MASS)
>mu.hat = mean(Boston$mdev)
>mu.hat
[1] 22.53281

(b) Provide an estimate of the standard error of mu-hat. Interpret this result. Hint: We can compute the
standard error of the sample mean by dividing the sample standard deviation by the square root of the
number of observations.
>se.hat = sd(Boston$medv) /sqrt(dim(Boston)[1])
>se.hat
[1] 0.4088611

(c)Now estimate the standard error of mu-hat using the bootstrap. How does this compare to your
answer from (b)?
>set.seed(1)
boot.fn<- function(data, index) {
mu <- mean(data[index])
return (mu)
}
boot(Boston$medv, boot.fn, 1000)

(d) Based on your bootstrap estimate from (.c), provide 95% confidence interval for the mean of medv.
Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95%
confidence interval using the formula [mu-hat – 2SE(mu-hat), mu-hat+2SE(mu-hat)].
>t.test(Boston$medv)

>CI.mu.hat <- c(22.53 - 2 * 0.4119, 22.53 + 2 * 0.4119)


>CI.mu.hat
[1] 21.7062 23.3538

(e) Based on this data set, provide an estimate, mu-hat med, for the median value of medv in the
population.
>med.hat <- median(Boston$medv)
>med.hat
[1] 21.2

(f) We now would like to estimate the standard error of mu-hat med. Unfortunately, there is no simple
formula for computing the standard error of the median. Instead, estimate the standard error of the
median using the bootstrap. Comment on your findings.
Boot.fn <- function(data, index) {
mu <- median(data[index])
return(mu)
}
boot(Boston$medv, boot.fn, 1000)

(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call
this quantity mu-hat 0.1. (You can use the quantile() function.)
>percent.hat <- quantile(Boston$medv, c(0.1))
>percent.hat
10%
12.75

(h) Use the bootstrap to estimate the standard error of mu-hat 0.1. Comment on your findings.
Boot.fn <- function(data, index) {
mu <- quantile(data[index], c(0.1))
return(mu)
}
boot(Boston$medv, boot.fn, 1000)

Using the bootstrap method to estimate ethe standard error of mu-hat 0.1 I found that we got an
estimated percentile of 12.75 which is like the amount we got for the tenth percentile of the Boston
suburbs in question g. It has a small standard error of 0.5113 compared to the percentile value.

You might also like