Lab 3. Linear Regression 230223

Lab 3.
Linear Regression
AUTHOR
Jorge Zazueta
Most of this material follows the ISLR Lab in chapter two and the tidy version of the labs by Emil Hvitfeldt.
Libraries
library(MASS) # For Boston data set
library(tidymodels) # For modeling
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.1 ✔ recipes 1.0.3

✔ dials 1.1.0 ✔ rsample 1.1.1
✔ dplyr 1.0.10 ✔ tibble 3.1.8
✔ ggplot2 3.4.0 ✔ tidyr 1.2.1
✔ infer 1.0.4 ✔ tune 1.0.1
✔ modeldata 1.0.1 ✔ workflows 1.1.2
✔ parsnip 1.0.3 ✔ workflowsets 1.0.0
✔ purrr 0.3.5 ✔ yardstick 1.1.0
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──

✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::select() masks MASS::select()
✖ recipes::step() masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
library(ISLR) # Other data sets
Brief exploration
Let’s start by exploring the Boston data set.
head(Boston)
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
You can access the documentation by typing ?Boston in the console.
Let’s glimpse at the data
glimpse(Boston)
Rows: 506
Columns: 14
$ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
$ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
$ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
$ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
$ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
$ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
$ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
$ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
$ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
$ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
$ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
$ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
We are interested in predicting home median value ( medv ). There are 13 potential predictors all coded as
numeric. Although some seem to be categorical. Specifically homes where tract bouds the Charles River
( chas ) is coded as a dummy variable where 1 indicates tract bounds river and 0 otherwise.
Let’s take a look at the distribution of our target variable:
Boston %>%
ggplot(aes(x = medv)) +
geom_histogram(bins = 15, color = "white", fill = "steelblue")
It seems reasonably symetric, except for a few high-price properties.
summary(Boston$medv)
Min. 1st Qu. Median Mean 3rd Qu. Max.

5.00 17.02 21.20 22.53 25.00 50.00
Simple linear regression

From base R, we can run a quick regression of medv vs. the percent of lower status of the population ( lstat ),
lm_0 <- lm(medv ~ lstat, data = Boston)

summary(lm_0)
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom

Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Not bad for a starting model with only one variable. It seems like the status of the population affects the
home value significantly.
It is time to shift gears to the tidymodels framework. We start by declaring our model.
lm_spec <- linear_reg() %>%

set_mode("regression") %>%
set_engine("lm")
This code doesn’t compute anything is just a spec and is already more complex than the lm() command in
base R! Be patient and you will see how this formality becomes incredibly helpful as we build more complex
models.
To fit the model we can use the following code:
lm_fit <- lm_spec %>%

fit(medv ~ lstat, data = Boston)
lm_fit
parsnip model object
Call:
stats::lm(formula = medv ~ lstat, data = data)
Coefficients:
(Intercept) lstat
34.55 -0.95
We can get the more traditional linear regression fit output using pluck() and summary()
lm_fit |>
pluck("fit") |>
summary()
Call:
stats::lm(formula = medv ~ lstat, data = data)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.216 on 504 degrees of freedom

Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
Another way to extract fit information is by tidying the fit object.
tidy(lm_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 34.6 0.563 61.4 3.74e-236
2 lstat -0.950 0.0387 -24.5 5.08e- 88
glance() will give us a tibble with useful statistics that are easy to extract.
glance(lm_fit)
# A tibble: 1 × 12
r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.544 0.543 6.22 602. 5.08e-88 1 -1641. 3289. 3302. 19472.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
# variable names ¹
adj.r.squared, ²
statistic, ³
deviance
With a model in hand, we can make predictions.
predict(lm_fit, new_data = Boston)
# A tibble: 506 × 1
.pred
<dbl>
1 29.8
2 25.9
3 30.7
4 31.8
5 29.5
6 29.6
7 22.7
8 16.4
9 6.12
10 18.3
# … with 496 more rows
If we are interested in confidence intervals, we can get those to.
predict(lm_fit, new_data = Boston, type = "conf_int")
# A tibble: 506 × 2
.pred_lower .pred_upper
<dbl> <dbl>
1 29.0 30.6
2 25.3 26.5
3 29.9 31.6
4 30.8 32.7
5 28.7 30.3
6 28.8 30.4
7 22.2 23.3
8 15.6 17.1
9 4.70 7.54
10 17.7 18.9
To combine observed and predicted values you can use augment() .
augment(lm_fit, new_data = Boston) %>%

select(medv, .pred)
# A tibble: 506 × 2
medv .pred
<dbl> <dbl>
1 24 29.8
2 21.6 25.9
3 34.7 30.7
4 33.4 31.8
5 36.2 29.5
6 28.7 29.6
7 22.9 22.7
8 27.1 16.4
9 16.5 6.12
10 18.9 18.3
augment(lm_fit, new_data = Boston) %>%

select(medv, .pred, lstat) %>%
ggplot(aes(x = lstat, y = .pred)) +
geom_point(aes(y = medv),col = "steelblue", alpha = .4) +
geom_line(col = "orange", linewidth = 1)
Multiple linear regression

The API for multiple linear regression is the same. We just need specify the predictors within the formula.
lm_fit2 <- lm_spec %>%

fit(medv ~ lstat + age, data = Boston)
lm_fit2 %>% tidy()
# A tibble: 3 × 5
1 (Intercept) 33.2 0.731 45.5 2.94e-180
2 lstat -1.03 0.0482 -21.4 8.42e- 73
3 age 0.0345 0.0122 2.83 4.91e- 3
Everything works just as with the simple linear regression case.
Sometimes we want to use all the predictors in our model.

fit(medv ~ ., data = Boston)
lm_fit3 %>% tidy()
# A tibble: 14 × 5
1 (Intercept) 36.5 5.10 7.14 3.28e-12
2 crim -0.108 0.0329 -3.29 1.09e- 3
3 zn 0.0464 0.0137 3.38 7.78e- 4
4 indus 0.0206 0.0615 0.334 7.38e- 1
5 chas 2.69 0.862 3.12 1.93e- 3
6 nox -17.8 3.82 -4.65 4.25e- 6
7 rm 3.81 0.418 9.12 1.98e-18
8 age 0.000692 0.0132 0.0524 9.58e- 1
9 dis -1.48 0.199 -7.40 6.01e-13
10 rad 0.306 0.0663 4.61 5.07e- 6
11 tax -0.0123 0.00376 -3.28 1.11e- 3
12 ptratio -0.953 0.131 -7.28 1.31e-12
13 black 0.00931 0.00269 3.47 5.73e- 4
14 lstat -0.525 0.0507 -10.3 7.78e-23
Adding interaction terms

Using * will include x , y and their interaction.

fit(medv ~ lstat * age, data = Boston)
lm_fit4
Call:
stats::lm(formula = medv ~ lstat * age, data = data)
Coefficients:
(Intercept) lstat age lstat:age
36.0885359 -1.3921168 -0.0007209 0.0041560
Using : returns only the interaction

fit(medv ~ lstat : age, data = Boston)
lm_fit4
Call:
stats::lm(formula = medv ~ lstat:age, data = data)
Coefficients:
(Intercept) lstat:age
30.158863 -0.007715
Recipes and workflows

We can implement feature engineering and pre-processing steps using recipes() that we can bundle with
our models using workflows() .
rec_spec_interact <-
recipe(medv ~ lstat + age, data = Boston) %>%
step_interact(~ lstat:age) # Add interation terms as a step in the recipe
lm_wf_interact <- workflow() %>% # Bundle our model and recipe in a

add_model(lm_spec) %>% # workflow
add_recipe(rec_spec_interact)
lm_wf_interact %>% fit(Boston)
══ Workflow [trained] ══════════════════════════════════════════════════════════

Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_interact()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) lstat age lstat_x_age
36.0885359 -1.3921168 -0.0007209 0.0041560
Non-linear transformations
It is common that the relationship between our predictors and our target are nonlinear. We can extend the
linear model by transforming predictors within the tidymodels paradigm.
rec_spec_pow2 <- recipe(medv ~ lstat, data = Boston) %>%

step_mutate(lstat2 = lstat ^ 2)
lm_wf_pow2 <- workflow() %>%

add_model(lm_spec) %>%
add_recipe(rec_spec_pow2)
lm_wf_pow2 %>% fit(Boston)

Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_mutate()
── Model ───────────────────────────────────────────────────────────────────────
Call:
Coefficients:
(Intercept) lstat lstat2
42.86201 -2.33282 0.04355
Using a recipe to perform transformations ensures consistency across the modeling process. As we start
separating training and testing data sets or using cross validation (we will define this soon) the tidymodels
structure becomes quite valuable.
There are a lot of already defined steps that you can add to your recipes. The tidymodels documentation is a
nice source of information. Let’s say you want to log-transform the lstat variable within the recipe.
rec_spec_log <- recipe(medv ~ lstat, data = Boston) %>%

step_log(lstat)
lm_wf_log <- workflow() %>%

add_recipe(rec_spec_log)
lm_wf_log %>% fit(Boston)

Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_log()
── Model ───────────────────────────────────────────────────────────────────────
Call:
Coefficients:
(Intercept) lstat
52.12 -12.48
Handling qualitative features

Consider the Carseats data set.
head(Carseats)
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education

1 9.50 138 73 11 276 120 Bad 42 17
2 11.22 111 48 16 260 83 Good 65 10
3 10.06 113 35 10 269 80 Medium 59 12
4 7.40 117 100 4 466 97 Medium 55 14
5 4.15 141 64 3 340 128 Bad 38 13
6 10.81 124 113 13 501 72 Bad 78 16
Urban US
1 Yes Yes
2 Yes Yes
3 Yes Yes
4 Yes Yes
5 Yes No
6 No Yes
Some of the predictors are qualitative, like ShelveLoc for example. Some models, inlcuding lm() can handle
categorical variables automatically. But that is not always the case. We can convert qualitative features into
dummy variables within a recipe. While this is not necessary for the lm engine, it will make our pre-
processing applicable to other models.
rec_spec <- recipe(Sales ~ ., data = Carseats) %>%

step_dummy(all_nominal_predictors()) %>%
step_interact(~ Income:Advertising + Price:Age)
lm_wf <- workflow() %>%

add_recipe(rec_spec)
lm_wf %>% fit(Carseats)

Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_dummy()
• step_interact()
── Model ───────────────────────────────────────────────────────────────────────
Call:
Coefficients:
(Intercept) CompPrice Income
6.5755654 0.0929371 0.0108940
Advertising Population Price
0.0702462 0.0001592 -0.1008064
Age Education ShelveLoc_Good
-0.0579466 -0.0208525 4.8486762
ShelveLoc_Medium Urban_Yes US_Yes
1.9532620 0.1401597 -0.1575571
Income_x_Advertising Price_x_Age
0.0007510 0.0001068
Writing functions in R 
To define a function in R, just follow this script:
square <- function(x){

x^2
}
square(2)
[1] 4

Lab 3. Linear Regression 230223

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab 3. Linear Regression 230223

Uploaded by

Copyright:

Available Formats

Lab 3.

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom 1.0.1 ✔ recipes 1.0.3

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──

library(ISLR) # Other data sets

You can access the documentation by typing ?Boston in the console.

Let’s glimpse at the data

It seems reasonably symetric, except for a few high-price properties.

Min. 1st Qu. Median Mean 3rd Qu. Max.

Simple linear regression

lm_0 <- lm(medv ~ lstat, data = Boston)

Residual standard error: 6.216 on 504 degrees of freedom

lm_spec <- linear_reg() %>%

To fit the model we can use the following code:

lm_fit <- lm_spec %>%

parsnip model object

Residual standard error: 6.216 on 504 degrees of freedom

Another way to extract fit information is by tidying the fit object.

With a model in hand, we can make predictions.

predict(lm_fit, new_data = Boston)

If we are interested in confidence intervals, we can get those to.

predict(lm_fit, new_data = Boston, type = "conf_int")

augment(lm_fit, new_data = Boston) %>%

augment(lm_fit, new_data = Boston) %>%

Multiple linear regression

lm_fit2 <- lm_spec %>%

lm_fit2 %>% tidy()

Everything works just as with the simple linear regression case.

Sometimes we want to use all the predictors in our model.

lm_fit3 %>% tidy()

Adding interaction terms

lm_fit4 <- lm_spec %>%

parsnip model object

Using : returns only the interaction

lm_fit4 <- lm_spec %>%

parsnip model object

Recipes and workflows

lm_wf_interact <- workflow() %>% # Bundle our model and recipe in a

lm_wf_interact %>% fit(Boston)

══ Workflow [trained] ══════════════════════════════════════════════════════════

rec_spec_pow2 <- recipe(medv ~ lstat, data = Boston) %>%

lm_wf_pow2 <- workflow() %>%

lm_wf_pow2 %>% fit(Boston)

══ Workflow [trained] ══════════════════════════════════════════════════════════

rec_spec_log <- recipe(medv ~ lstat, data = Boston) %>%

lm_wf_log <- workflow() %>%

lm_wf_log %>% fit(Boston)

══ Workflow [trained] ══════════════════════════════════════════════════════════

Handling qualitative features

Sales CompPrice Income Advertising Population Price ShelveLoc Age Education

rec_spec <- recipe(Sales ~ ., data = Carseats) %>%

lm_wf <- workflow() %>%

lm_wf %>% fit(Carseats)

══ Workflow [trained] ══════════════════════════════════════════════════════════

square <- function(x){

You might also like