You are on page 1of 8

Exercises

Fit some of the non-linear models investigated in this chapter to the Auto data set. Is there evidence
for non-linear relationships in this data set? Create some informative plots to justify your answer

Answer:
1. Polynimial Regression
Step1: check non-linear relationship between a cars weight and its mpg
in R studio Console:
> require(ISLR); require(tidyverse); require(caret)
> require(ggthemes); require(broom); require(knitr)
> theme_set(theme_tufte(base_size = 14) + theme(legend.position = 'top'))
> set.seed(5)
> options(knitr.kable.NA = '')
> data('Auto')
> force(Auto)
Result:
cylinder
mpg displacement horsepower weight acceleration year origin
s
1 18 8 307.0 130 3504 12.0 70 1
2 15 8 350.0 165 3693 11.5 70 1
3 18 8 318.0 150 3436 11.0 70 1
4 16 8 304.0 150 3433 12.0 70 1
5 17 8 302.0 140 3449 10.5 70 1
6 15 8 429.0 198 4341 10.0 70 1
7 14 8 454.0 220 4354 9.0 70 1
8 14 8 440.0 215 4312 8.5 70 1
9 14 8 455.0 225 4425 10.0 70 1
10 15 8 390.0 190 3850 8.5 70 1
11 15 8 383.0 170 3563 10.0 70 1
12 14 8 340.0 160 3609 8.0 70 1
13 15 8 400.0 150 3761 9.5 70 1
14 14 8 455.0 225 3086 10.0 70 1
15 24 4 113.0 95 2372 15.0 70 3
………continued until…
111 27 4 108.0 94 2379 16.5 73 3
112 18 3 70.0 90 2124 13.5 73 3

name
1 chevrolet chevelle malibu
2 buick skylark 320
3 plymouth satellite
4 amc rebel sst
5 ford torino
6 ford galaxie 500
7 chevrolet impala
8 plymouth fury iii
9 pontiac catalina
10 amc ambassador dpl
11 dodge challenger se
12 plymouth 'cuda 340
13 chevrolet monte carlo
14 buick estate wagon (sw)
15 toyota corona mark ii
………continued until…
111 datsun 610
112 maxda rx3
[ reached 'max' / getOption("max.print") -- omitted 281 rows ]

Step2: using 8-fold cross-validation to find out which level of polynomial fits the training best fit by
K-Fold Validation
in R studio Console:
> folds <- sample(rep(1:8, nrow(Auto)/8))
> errors <- matrix(NA, 8, 8)
> models <- list()
> for (k in 1:8) {
+ for (i in 1:8) {
+ model <- lm(mpg ~ poly(weight,i), data = Auto[folds != i,])
+ pred <- predict(model, Auto[folds == i,])
+ resid <- (Auto$mpg[folds==i] - pred)^2
+ errors[k, i] <- sqrt(mean(resid))
+ }
+}
> errors <- apply(errors, 2, mean)
> data_frame(RMSE = errors) %>%
+ mutate(Polynomial = row_number()) %>%
+ ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) +
+ geom_col() +
+ scale_x_continuous(breaks = 1:8) +
+ guides(fill = FALSE) +
+ coord_cartesian(ylim = c(min(errors), max(errors)))

Result:
the best model uses order of weight to weight is 1

Step3: use the best model on the whole dataset and look at the p.values for each coefficient.
in R studio Console:
> poly_model <- lm(mpg ~ poly(weight, which.min(errors)), data = Auto)
> tidy(poly_model) %>%
+ mutate(term = ifelse(row_number() == 1, 'Intercept',
+ paste0('$Weight^', row_number()-1,'$'))) %>%
+ kable(digits = 3)

Result:

| term | estimate | std.error | statistic | p.value |


| :------------ | -----------: | -----------: | ----------: | --------: |
| Intercept | 23.446 | 0.212 | 110.807 | 0.000 |
| $Weight^1$ | -128.444 | 4.189 | -30.660 | 0.000 |
| $Weight^2$ | 23.159 | 4.189 | 5.528 | 0.000 |
| $Weight^3$ | 0.220 | 4.189 | 0.053 | 0.958 |
| $Weight^4$ | -2.808 | 4.189 | -0.670 | 0.503 |
| $Weight^5$ | 3.830 | 4.189 | 0.914 | 0.361 |
| $Weight^6$ | -3.336 | 4.189 | -0.796 | 0.426 |
| $Weight^7$ | 5.400 | 4.189 | 1.289 | 0.198 |
| $Weight^8$ | -0.520 | 4.189 | -0.124 | 0.901 |

Step3: Plotting data.


in R studio Console:
> Auto %>%
+ mutate(Predictions = predict(poly_model, Auto)) %>%
+ ggplot() + xlab('Weight') + ylab('Miles Per Gallon') +
+ geom_point(aes(weight, mpg, col = 'blue')) +
+ geom_line(aes(weight, Predictions, col = 'goldenrod2'), size = 1.5) +
+ scale_color_manual(name = 'Value Type',
+ labels = c('Observed', 'Predicted'),
+ values = c('#56B4E9', '#E69F00'))
Result:

2. Stepwise Linear Model


Step1: Finding Optimal Number of Cuts Via K-Fold Crossvalidation
in R studio Console:
> folds <- sample(rep(1:8, nrow(Auto)/8))
> errors <- matrix(NA, 8, 8)
> models <- list()
> for (k in 1:8) {
+ for (i in 1:8) {
+ Auto$cut <- cut(Auto$horsepower, i+1)
+ model <- lm(mpg ~ horsepower*cut, data = Auto[folds != i,])
+ pred <- predict(model, Auto[folds == i,])
+ resid <- (Auto$mpg[folds==i] - pred)^2
+ errors[k, i] <- sqrt(mean(resid))
+ Auto$cut <- NULL
+ }
+}
> errors <- apply(errors, 2, mean)
>
> data_frame(RMSE = errors) %>%
+ mutate(Steps = row_number() +1) %>%
+ ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) +
+ geom_col() +
+ scale_x_continuous(breaks = 2:(length(errors)+1)) +
+ guides(fill = FALSE) +
+ coord_cartesian(ylim = c(min(errors), max(errors)))
Result:
The optimal predictive power is found with four cuts in the weight variable.

Step2: Fitting the final model


in R studio Console:
> folds <- sample(rep(1:8, nrow(Auto)/8))
> errors <- matrix(NA, 8, 8)
> models <- list()
> for (k in 1:8) {
+ for (i in 1:8) {
+ Auto$cut <- cut(Auto$horsepower, i+1)
+ model <- lm(mpg ~ horsepower*cut, data = Auto[folds != i,])
+ pred <- predict(model, Auto[folds == i,])
+ resid <- (Auto$mpg[folds==i] - pred)^2
+ errors[k, i] <- sqrt(mean(resid))
+ Auto$cut <- NULL
+ }
+}
> errors <- apply(errors, 2, mean)
>
> data_frame(RMSE = errors) %>%
+ mutate(Steps = row_number() +1) %>%
+ ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) +
+ geom_col() +
+ scale_x_continuous(breaks = 2:(length(errors)+1)) +
+ guides(fill = FALSE) +
+ coord_cartesian(ylim = c(min(errors), max(errors)))
> min_error <- which.min(errors) + 1
> step_model <- lm(mpg ~ horsepower *cut(horsepower, min_error), data = Auto)
> tidy(step_model) %>%
+ kable(digits = 3)
Result:
|term | estimate | std.error | statistic | p.value |
|:------------------------------------------| --------:| --------:| -------:|--------:|
|(Intercept) | 51.145 | 2.052 | 24.929 | 0.000 |
|horsepower | -0.292 | 0.027 | -10.894 | 0.000 |
|cut(horsepower, min_error)(92,138] | -18.889 | 4.370 | -4.322 | 0.000 |
|cut(horsepower, min_error)(138,184] | -25.675 | 6.923 | -3.709 | 0.000 |
|cut(horsepower, min_error)(184,230] | -43.112 | 16.436 | -2.623 | 0.009 |
|horsepower:cut(horsepower, min_error)(92,138] | 0.183 | 0.045 | 4.111 | 0.000 |
|horsepower:cut(horsepower, min_error)(138,184] | 0.223 | 0.050 | 4.441 | 0.000 |
|horsepower:cut(horsepower, min_error)(184,230] | 0.315 | 0.082 | 3.819 | 0.000 |
Step3: Plotting Data
in R studio Console:
> Auto %>%
+ mutate(Predictions = predict(step_model, Auto)) %>%
+ ggplot() + xlab('Horsepower') + ylab('Miled Per Gallon') +
+ geom_point(aes(horsepower, mpg, col = 'blue')) +
+ geom_line(aes(horsepower, Predictions, col = 'goldenrod2'), size = 1.5) +
+ scale_color_manual(name = 'Value Type',
+ labels = c('Observed', 'Predicted'),
+ values = c('#56B4E9', '#E69F00'))
Result:

3. Splinnes
Step1: Model Selection With Caret
in R studio Console:
> spline_model <- train(mpg ~ displacement, data = Auto,
+ method = 'gamSpline',
+ trControl = trainControl(method = 'cv', number = 8),
+ tuneGrid = expand.grid(df = seq(1, 12, 1)))
> disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',
+ trControl = trainControl(method = 'cv', number = 8))
> plot(spline_model)
Result:

> disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',


+ trControl = trainControl(method = 'cv', number = 8))
> plot(spline_model)
> anova(disp_lm$finalModel,
+ spline_model$finalModel) %>%
+ tidy %>%
+ mutate(model = c('Linear', 'Spline'),
+ percent = sumsq/lag(rss) * 100) %>%
+ select(model, rss, percent, sumsq, p.value) %>%
+ kable(digits = 3)
Result:
| model | rss | percent | sumsq | p.value |
|:-------|--------:|--------:|---------:|--------:|
| Linear | 8378.822 | | | |
| Spline | 6629.277 | 20.881 | 1749.545 | 0|
> Auto %>%
+ mutate(pred = predict(spline_model, Auto)) %>%
+ ggplot() +
+ geom_point(aes(displacement, mpg, col = 'blue')) +
+ geom_line(aes(displacement, pred, col = 'goldenrod2'), size = 1.5) +
+ scale_color_manual(name = 'Value Type',
+ labels = c('Observed', 'Predicted'),
+ values = c('#56B4E9', '#E69F00')) +
+ labs(x = 'Acceleration', y = 'Miles Per Gallon', title = 'Spline Relationship') +
+ theme(legend.position = 'top')

Result:

4. Generalized Adaptive Model (GAM)


Step1: Crossvalidating Via Caret
in R studio Console:
> lm_model <- train(mpg ~ horsepower +
+ weight +
+ displacement +
+ acceleration +
+ year +
+ cylinders, data = Auto,
+ method = 'lm',
+ trControl = trainControl(method = 'cv', number = 10))
>
> tidy(gam_model$finalModel) %>%
+ kable(digits = 3)
Result:
| term | edf | ref.df | statistic | p.value |
|:---------------|------:|-------:|---------:|-------:|
| s(year) | 8.427 | 8.903 | 47.705 | 0.000 |
| s(displacement) | 1.753 | 2.184 | 3.171 | 0.045 |
| s(horsepower) | 2.181 | 2.779 | 7.851 | 0.000 |
| s(acceleration) | 2.382 | 3.074 | 2.246 | 0.088 |
| s(weight) | 2.545 | 3.219 | 18.996 | 0.000 |

You might also like