Exercises 2 Unfinished

Exercises
Fit some of the non-linear models investigated in this chapter to the Auto data set. Is there evidence
for non-linear relationships in this data set? Create some informative plots to justify your answer
Answer:
1. Polynimial Regression
Step1: check non-linear relationship between a cars weight and its mpg
in R studio Console:
> require(ISLR); require(tidyverse); require(caret)
> require(ggthemes); require(broom); require(knitr)
> theme_set(theme_tufte(base_size = 14) + theme(legend.position = 'top'))
> set.seed(5)
> options(knitr.kable.NA = '')
> data('Auto')
> force(Auto)
Result:
cylinder
mpg displacement horsepower weight acceleration year origin
s
1 18 8 307.0 130 3504 12.0 70 1
2 15 8 350.0 165 3693 11.5 70 1
3 18 8 318.0 150 3436 11.0 70 1
4 16 8 304.0 150 3433 12.0 70 1
5 17 8 302.0 140 3449 10.5 70 1
6 15 8 429.0 198 4341 10.0 70 1
7 14 8 454.0 220 4354 9.0 70 1
8 14 8 440.0 215 4312 8.5 70 1
9 14 8 455.0 225 4425 10.0 70 1
10 15 8 390.0 190 3850 8.5 70 1
11 15 8 383.0 170 3563 10.0 70 1
12 14 8 340.0 160 3609 8.0 70 1
13 15 8 400.0 150 3761 9.5 70 1
14 14 8 455.0 225 3086 10.0 70 1
15 24 4 113.0 95 2372 15.0 70 3
………continued until…
111 27 4 108.0 94 2379 16.5 73 3
112 18 3 70.0 90 2124 13.5 73 3
name
1 chevrolet chevelle malibu
2 buick skylark 320
3 plymouth satellite
4 amc rebel sst
5 ford torino
6 ford galaxie 500
7 chevrolet impala
8 plymouth fury iii
9 pontiac catalina
10 amc ambassador dpl
11 dodge challenger se
12 plymouth 'cuda 340
13 chevrolet monte carlo
14 buick estate wagon (sw)
15 toyota corona mark ii
………continued until…
111 datsun 610
112 maxda rx3
[ reached 'max' / getOption("max.print") -- omitted 281 rows ]
Step2: using 8-fold cross-validation to find out which level of polynomial fits the training best fit by
K-Fold Validation
> folds <- sample(rep(1:8, nrow(Auto)/8))
> errors <- matrix(NA, 8, 8)
> models <- list()
> for (k in 1:8) {
+ for (i in 1:8) {
+ model <- lm(mpg ~ poly(weight,i), data = Auto[folds != i,])
+ pred <- predict(model, Auto[folds == i,])
+ resid <- (Auto$mpg[folds==i] - pred)^2
+ errors[k, i] <- sqrt(mean(resid))
+ }
+}
> errors <- apply(errors, 2, mean)
> data_frame(RMSE = errors) %>%
+ mutate(Polynomial = row_number()) %>%
+ ggplot(aes(Polynomial, RMSE, fill = Polynomial == which.min(errors))) +
+ geom_col() +
+ scale_x_continuous(breaks = 1:8) +
+ guides(fill = FALSE) +
+ coord_cartesian(ylim = c(min(errors), max(errors)))
Result:
the best model uses order of weight to weight is 1
Step3: use the best model on the whole dataset and look at the p.values for each coefficient.
> poly_model <- lm(mpg ~ poly(weight, which.min(errors)), data = Auto)
> tidy(poly_model) %>%
+ mutate(term = ifelse(row_number() == 1, 'Intercept',
+ paste0('$Weight^', row_number()-1,'$'))) %>%
+ kable(digits = 3)
Result:
| term | estimate | std.error | statistic | p.value |

| :------------ | -----------: | -----------: | ----------: | --------: |
| Intercept | 23.446 | 0.212 | 110.807 | 0.000 |
| $Weight^1$ | -128.444 | 4.189 | -30.660 | 0.000 |
| $Weight^2$ | 23.159 | 4.189 | 5.528 | 0.000 |
| $Weight^3$ | 0.220 | 4.189 | 0.053 | 0.958 |
| $Weight^4$ | -2.808 | 4.189 | -0.670 | 0.503 |
| $Weight^5$ | 3.830 | 4.189 | 0.914 | 0.361 |
| $Weight^6$ | -3.336 | 4.189 | -0.796 | 0.426 |
| $Weight^7$ | 5.400 | 4.189 | 1.289 | 0.198 |
| $Weight^8$ | -0.520 | 4.189 | -0.124 | 0.901 |
Step3: Plotting data.

> Auto %>%
+ mutate(Predictions = predict(poly_model, Auto)) %>%
+ ggplot() + xlab('Weight') + ylab('Miles Per Gallon') +
+ geom_point(aes(weight, mpg, col = 'blue')) +
+ geom_line(aes(weight, Predictions, col = 'goldenrod2'), size = 1.5) +
+ scale_color_manual(name = 'Value Type',
+ labels = c('Observed', 'Predicted'),
+ values = c('#56B4E9', '#E69F00'))
Result:
2. Stepwise Linear Model

Step1: Finding Optimal Number of Cuts Via K-Fold Crossvalidation
> models <- list()
> for (k in 1:8) {
+ for (i in 1:8) {
+ Auto$cut <- cut(Auto$horsepower, i+1)
+ model <- lm(mpg ~ horsepower*cut, data = Auto[folds != i,])
+ Auto$cut <- NULL
+ }
+}
>
+ mutate(Steps = row_number() +1) %>%
+ ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) +
+ geom_col() +
+ scale_x_continuous(breaks = 2:(length(errors)+1)) +
Result:
The optimal predictive power is found with four cuts in the weight variable.
Step2: Fitting the final model

> models <- list()
> for (k in 1:8) {
+ for (i in 1:8) {
+ Auto$cut <- cut(Auto$horsepower, i+1)
+ model <- lm(mpg ~ horsepower*cut, data = Auto[folds != i,])
+ Auto$cut <- NULL
+ }
+}
>
+ mutate(Steps = row_number() +1) %>%
+ ggplot(aes(Steps, RMSE, fill = Steps == (which.min(errors) + 1))) +
+ geom_col() +
+ scale_x_continuous(breaks = 2:(length(errors)+1)) +
> min_error <- which.min(errors) + 1
> step_model <- lm(mpg ~ horsepower *cut(horsepower, min_error), data = Auto)
> tidy(step_model) %>%
+ kable(digits = 3)
Result:
|term | estimate | std.error | statistic | p.value |
|:------------------------------------------| --------:| --------:| -------:|--------:|
|(Intercept) | 51.145 | 2.052 | 24.929 | 0.000 |
|horsepower | -0.292 | 0.027 | -10.894 | 0.000 |
|cut(horsepower, min_error)(92,138] | -18.889 | 4.370 | -4.322 | 0.000 |
|horsepower:cut(horsepower, min_error)(92,138] | 0.183 | 0.045 | 4.111 | 0.000 |
Step3: Plotting Data
> Auto %>%
+ mutate(Predictions = predict(step_model, Auto)) %>%
+ ggplot() + xlab('Horsepower') + ylab('Miled Per Gallon') +
+ geom_point(aes(horsepower, mpg, col = 'blue')) +
+ geom_line(aes(horsepower, Predictions, col = 'goldenrod2'), size = 1.5) +
+ values = c('#56B4E9', '#E69F00'))
Result:
3. Splinnes
Step1: Model Selection With Caret
> spline_model <- train(mpg ~ displacement, data = Auto,
+ method = 'gamSpline',
+ trControl = trainControl(method = 'cv', number = 8),
+ tuneGrid = expand.grid(df = seq(1, 12, 1)))
> disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',
+ trControl = trainControl(method = 'cv', number = 8))
> plot(spline_model)
Result:
> disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',

> plot(spline_model)
> anova(disp_lm$finalModel,
+ spline_model$finalModel) %>%
+ tidy %>%
+ mutate(model = c('Linear', 'Spline'),
+ percent = sumsq/lag(rss) * 100) %>%
+ select(model, rss, percent, sumsq, p.value) %>%
+ kable(digits = 3)
Result:
| model | rss | percent | sumsq | p.value |
|:-------|--------:|--------:|---------:|--------:|
| Linear | 8378.822 | | | |
| Spline | 6629.277 | 20.881 | 1749.545 | 0|
> Auto %>%
+ mutate(pred = predict(spline_model, Auto)) %>%
+ ggplot() +
+ geom_point(aes(displacement, mpg, col = 'blue')) +
+ geom_line(aes(displacement, pred, col = 'goldenrod2'), size = 1.5) +
+ values = c('#56B4E9', '#E69F00')) +
+ labs(x = 'Acceleration', y = 'Miles Per Gallon', title = 'Spline Relationship') +
+ theme(legend.position = 'top')
Result:
4. Generalized Adaptive Model (GAM)

Step1: Crossvalidating Via Caret
> lm_model <- train(mpg ~ horsepower +
+ weight +
+ displacement +
+ acceleration +
+ year +
+ cylinders, data = Auto,
+ method = 'lm',
>
> tidy(gam_model$finalModel) %>%
+ kable(digits = 3)
Result:
| term | edf | ref.df | statistic | p.value |
|:---------------|------:|-------:|---------:|-------:|
| s(year) | 8.427 | 8.903 | 47.705 | 0.000 |
| s(displacement) | 1.753 | 2.184 | 3.171 | 0.045 |
| s(horsepower) | 2.181 | 2.779 | 7.851 | 0.000 |
| s(acceleration) | 2.382 | 3.074 | 2.246 | 0.088 |
| s(weight) | 2.545 | 3.219 | 18.996 | 0.000 |

Exercises 2 Unfinished

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exercises 2 Unfinished

Uploaded by

Copyright:

Available Formats

Exercises

| term | estimate | std.error | statistic | p.value |

Step3: Plotting data.

2. Stepwise Linear Model

Step2: Fitting the final model

> disp_lm <- train(mpg ~ displacement, data = Auto, method = 'lm',

4. Generalized Adaptive Model (GAM)

You might also like