You are on page 1of 7

Lab 3.

Linear Regression
AUTHOR
Jorge Zazueta

Most of this material follows the ISLR Lab in chapter two and the tidy version of the labs by Emil Hvitfeldt.

Libraries
library(MASS) # For Boston data set
library(tidymodels) # For modeling

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom 1.0.1 ✔ recipes 1.0.3


✔ dials 1.1.0 ✔ rsample 1.1.1
✔ dplyr 1.0.10 ✔ tibble 3.1.8
✔ ggplot2 3.4.0 ✔ tidyr 1.2.1
✔ infer 1.0.4 ✔ tune 1.0.1
✔ modeldata 1.0.1 ✔ workflows 1.1.2
✔ parsnip 1.0.3 ✔ workflowsets 1.0.0
✔ purrr 0.3.5 ✔ yardstick 1.1.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──


✖ purrr::discard() masks scales::discard()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::select() masks MASS::select()
✖ recipes::step() masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

library(ISLR) # Other data sets

Brief exploration
Let’s start by exploring the Boston data set.

head(Boston)

crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7

You can access the documentation by typing ?Boston in the console.

Let’s glimpse at the data

glimpse(Boston)

Rows: 506
Columns: 14
$ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
$ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
$ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
$ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
$ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
$ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
$ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
$ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
$ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
$ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
$ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
$ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

We are interested in predicting home median value ( medv ). There are 13 potential predictors all coded as
numeric. Although some seem to be categorical. Specifically homes where tract bouds the Charles River
( chas ) is coded as a dummy variable where 1 indicates tract bounds river and 0 otherwise.
Let’s take a look at the distribution of our target variable:

Boston %>%
ggplot(aes(x = medv)) +
geom_histogram(bins = 15, color = "white", fill = "steelblue")

It seems reasonably symetric, except for a few high-price properties.

summary(Boston$medv)

Min. 1st Qu. Median Mean 3rd Qu. Max.


5.00 17.02 21.20 22.53 25.00 50.00

Simple linear regression


From base R, we can run a quick regression of medv vs. the percent of lower status of the population ( lstat ),

lm_0 <- lm(medv ~ lstat, data = Boston)


summary(lm_0)

Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom


Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16

Not bad for a starting model with only one variable. It seems like the status of the population affects the
home value significantly.

It is time to shift gears to the tidymodels framework. We start by declaring our model.

lm_spec <- linear_reg() %>%


set_mode("regression") %>%
set_engine("lm")

This code doesn’t compute anything is just a spec and is already more complex than the lm() command in
base R! Be patient and you will see how this formality becomes incredibly helpful as we build more complex
models.

To fit the model we can use the following code:

lm_fit <- lm_spec %>%


fit(medv ~ lstat, data = Boston)
lm_fit

parsnip model object

Call:
stats::lm(formula = medv ~ lstat, data = data)

Coefficients:
(Intercept) lstat
34.55 -0.95
We can get the more traditional linear regression fit output using pluck() and summary()

lm_fit |>
pluck("fit") |>
summary()

Call:
stats::lm(formula = medv ~ lstat, data = data)

Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom


Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16

Another way to extract fit information is by tidying the fit object.

tidy(lm_fit)

# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 34.6 0.563 61.4 3.74e-236
2 lstat -0.950 0.0387 -24.5 5.08e- 88

glance() will give us a tibble with useful statistics that are easy to extract.

glance(lm_fit)

# A tibble: 1 × 12
r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.544 0.543 6.22 602. 5.08e-88 1 -1641. 3289. 3302. 19472.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
# variable names ¹​
adj.r.squared, ²​
statistic, ³​
deviance

With a model in hand, we can make predictions.

predict(lm_fit, new_data = Boston)

# A tibble: 506 × 1
.pred
<dbl>
1 29.8
2 25.9
3 30.7
4 31.8
5 29.5
6 29.6
7 22.7
8 16.4
9 6.12
10 18.3
# … with 496 more rows

If we are interested in confidence intervals, we can get those to.

predict(lm_fit, new_data = Boston, type = "conf_int")

# A tibble: 506 × 2
.pred_lower .pred_upper
<dbl> <dbl>
1 29.0 30.6
2 25.3 26.5
3 29.9 31.6
4 30.8 32.7
5 28.7 30.3
6 28.8 30.4
7 22.2 23.3
8 15.6 17.1
9 4.70 7.54
10 17.7 18.9
# … with 496 more rows
To combine observed and predicted values you can use augment() .

augment(lm_fit, new_data = Boston) %>%


select(medv, .pred)

# A tibble: 506 × 2
medv .pred
<dbl> <dbl>
1 24 29.8
2 21.6 25.9
3 34.7 30.7
4 33.4 31.8
5 36.2 29.5
6 28.7 29.6
7 22.9 22.7
8 27.1 16.4
9 16.5 6.12
10 18.9 18.3
# … with 496 more rows

augment(lm_fit, new_data = Boston) %>%


select(medv, .pred, lstat) %>%
ggplot(aes(x = lstat, y = .pred)) +
geom_point(aes(y = medv),col = "steelblue", alpha = .4) +
geom_line(col = "orange", linewidth = 1)

Multiple linear regression


The API for multiple linear regression is the same. We just need specify the predictors within the formula.

lm_fit2 <- lm_spec %>%


fit(medv ~ lstat + age, data = Boston)

lm_fit2 %>% tidy()

# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 33.2 0.731 45.5 2.94e-180
2 lstat -1.03 0.0482 -21.4 8.42e- 73
3 age 0.0345 0.0122 2.83 4.91e- 3

Everything works just as with the simple linear regression case.

Sometimes we want to use all the predictors in our model.


lm_fit3 <- lm_spec %>%
fit(medv ~ ., data = Boston)

lm_fit3 %>% tidy()

# A tibble: 14 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 36.5 5.10 7.14 3.28e-12
2 crim -0.108 0.0329 -3.29 1.09e- 3
3 zn 0.0464 0.0137 3.38 7.78e- 4
4 indus 0.0206 0.0615 0.334 7.38e- 1
5 chas 2.69 0.862 3.12 1.93e- 3
6 nox -17.8 3.82 -4.65 4.25e- 6
7 rm 3.81 0.418 9.12 1.98e-18
8 age 0.000692 0.0132 0.0524 9.58e- 1
9 dis -1.48 0.199 -7.40 6.01e-13
10 rad 0.306 0.0663 4.61 5.07e- 6
11 tax -0.0123 0.00376 -3.28 1.11e- 3
12 ptratio -0.953 0.131 -7.28 1.31e-12
13 black 0.00931 0.00269 3.47 5.73e- 4
14 lstat -0.525 0.0507 -10.3 7.78e-23

Adding interaction terms


Using * will include x , y and their interaction.

lm_fit4 <- lm_spec %>%


fit(medv ~ lstat * age, data = Boston)

lm_fit4

parsnip model object

Call:
stats::lm(formula = medv ~ lstat * age, data = data)

Coefficients:
(Intercept) lstat age lstat:age
36.0885359 -1.3921168 -0.0007209 0.0041560

Using : returns only the interaction

lm_fit4 <- lm_spec %>%


fit(medv ~ lstat : age, data = Boston)

lm_fit4

parsnip model object

Call:
stats::lm(formula = medv ~ lstat:age, data = data)

Coefficients:
(Intercept) lstat:age
30.158863 -0.007715

Recipes and workflows


We can implement feature engineering and pre-processing steps using recipes() that we can bundle with
our models using workflows() .

rec_spec_interact <-
recipe(medv ~ lstat + age, data = Boston) %>%
step_interact(~ lstat:age) # Add interation terms as a step in the recipe

lm_wf_interact <- workflow() %>% # Bundle our model and recipe in a


add_model(lm_spec) %>% # workflow
add_recipe(rec_spec_interact)

lm_wf_interact %>% fit(Boston)

══ Workflow [trained] ══════════════════════════════════════════════════════════


Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_interact()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept) lstat age lstat_x_age
36.0885359 -1.3921168 -0.0007209 0.0041560

Non-linear transformations
It is common that the relationship between our predictors and our target are nonlinear. We can extend the
linear model by transforming predictors within the tidymodels paradigm.

rec_spec_pow2 <- recipe(medv ~ lstat, data = Boston) %>%


step_mutate(lstat2 = lstat ^ 2)

lm_wf_pow2 <- workflow() %>%


add_model(lm_spec) %>%
add_recipe(rec_spec_pow2)

lm_wf_pow2 %>% fit(Boston)

══ Workflow [trained] ══════════════════════════════════════════════════════════


Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_mutate()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept) lstat lstat2
42.86201 -2.33282 0.04355

Using a recipe to perform transformations ensures consistency across the modeling process. As we start
separating training and testing data sets or using cross validation (we will define this soon) the tidymodels
structure becomes quite valuable.

There are a lot of already defined steps that you can add to your recipes. The tidymodels documentation is a
nice source of information. Let’s say you want to log-transform the lstat variable within the recipe.

rec_spec_log <- recipe(medv ~ lstat, data = Boston) %>%


step_log(lstat)

lm_wf_log <- workflow() %>%


add_model(lm_spec) %>%
add_recipe(rec_spec_log)

lm_wf_log %>% fit(Boston)

══ Workflow [trained] ══════════════════════════════════════════════════════════


Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step

• step_log()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept) lstat
52.12 -12.48

Handling qualitative features


Consider the Carseats data set.
head(Carseats)

Sales CompPrice Income Advertising Population Price ShelveLoc Age Education


1 9.50 138 73 11 276 120 Bad 42 17
2 11.22 111 48 16 260 83 Good 65 10
3 10.06 113 35 10 269 80 Medium 59 12
4 7.40 117 100 4 466 97 Medium 55 14
5 4.15 141 64 3 340 128 Bad 38 13
6 10.81 124 113 13 501 72 Bad 78 16
Urban US
1 Yes Yes
2 Yes Yes
3 Yes Yes
4 Yes Yes
5 Yes No
6 No Yes

Some of the predictors are qualitative, like ShelveLoc for example. Some models, inlcuding lm() can handle
categorical variables automatically. But that is not always the case. We can convert qualitative features into
dummy variables within a recipe. While this is not necessary for the lm engine, it will make our pre-
processing applicable to other models.

rec_spec <- recipe(Sales ~ ., data = Carseats) %>%


step_dummy(all_nominal_predictors()) %>%
step_interact(~ Income:Advertising + Price:Age)

lm_wf <- workflow() %>%


add_model(lm_spec) %>%
add_recipe(rec_spec)

lm_wf %>% fit(Carseats)

══ Workflow [trained] ══════════════════════════════════════════════════════════


Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_interact()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept) CompPrice Income
6.5755654 0.0929371 0.0108940
Advertising Population Price
0.0702462 0.0001592 -0.1008064
Age Education ShelveLoc_Good
-0.0579466 -0.0208525 4.8486762
ShelveLoc_Medium Urban_Yes US_Yes
1.9532620 0.1401597 -0.1575571
Income_x_Advertising Price_x_Age
0.0007510 0.0001068

Writing functions in R 
To define a function in R, just follow this script:

square <- function(x){


x^2
}

square(2)

[1] 4

You might also like