Professional Documents
Culture Documents
Linear Regression
AUTHOR
Jorge Zazueta
Most of this material follows the ISLR Lab in chapter two and the tidy version of the labs by Emil Hvitfeldt.
Libraries
library(MASS) # For Boston data set
library(tidymodels) # For modeling
Brief exploration
Let’s start by exploring the Boston data set.
head(Boston)
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
glimpse(Boston)
Rows: 506
Columns: 14
$ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
$ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
$ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
$ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
$ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
$ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
$ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
$ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
$ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
$ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
$ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
$ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
$ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
We are interested in predicting home median value ( medv ). There are 13 potential predictors all coded as
numeric. Although some seem to be categorical. Specifically homes where tract bouds the Charles River
( chas ) is coded as a dummy variable where 1 indicates tract bounds river and 0 otherwise.
Let’s take a look at the distribution of our target variable:
Boston %>%
ggplot(aes(x = medv)) +
geom_histogram(bins = 15, color = "white", fill = "steelblue")
summary(Boston$medv)
Call:
lm(formula = medv ~ lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not bad for a starting model with only one variable. It seems like the status of the population affects the
home value significantly.
It is time to shift gears to the tidymodels framework. We start by declaring our model.
This code doesn’t compute anything is just a spec and is already more complex than the lm() command in
base R! Be patient and you will see how this formality becomes incredibly helpful as we build more complex
models.
Call:
stats::lm(formula = medv ~ lstat, data = data)
Coefficients:
(Intercept) lstat
34.55 -0.95
We can get the more traditional linear regression fit output using pluck() and summary()
lm_fit |>
pluck("fit") |>
summary()
Call:
stats::lm(formula = medv ~ lstat, data = data)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.55384 0.56263 61.41 <2e-16 ***
lstat -0.95005 0.03873 -24.53 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
tidy(lm_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 34.6 0.563 61.4 3.74e-236
2 lstat -0.950 0.0387 -24.5 5.08e- 88
glance() will give us a tibble with useful statistics that are easy to extract.
glance(lm_fit)
# A tibble: 1 × 12
r.squared adj.r.squa…¹ sigma stati…² p.value df logLik AIC BIC devia…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.544 0.543 6.22 602. 5.08e-88 1 -1641. 3289. 3302. 19472.
# … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated
# variable names ¹
adj.r.squared, ²
statistic, ³
deviance
# A tibble: 506 × 1
.pred
<dbl>
1 29.8
2 25.9
3 30.7
4 31.8
5 29.5
6 29.6
7 22.7
8 16.4
9 6.12
10 18.3
# … with 496 more rows
# A tibble: 506 × 2
.pred_lower .pred_upper
<dbl> <dbl>
1 29.0 30.6
2 25.3 26.5
3 29.9 31.6
4 30.8 32.7
5 28.7 30.3
6 28.8 30.4
7 22.2 23.3
8 15.6 17.1
9 4.70 7.54
10 17.7 18.9
# … with 496 more rows
To combine observed and predicted values you can use augment() .
# A tibble: 506 × 2
medv .pred
<dbl> <dbl>
1 24 29.8
2 21.6 25.9
3 34.7 30.7
4 33.4 31.8
5 36.2 29.5
6 28.7 29.6
7 22.9 22.7
8 27.1 16.4
9 16.5 6.12
10 18.9 18.3
# … with 496 more rows
# A tibble: 3 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 33.2 0.731 45.5 2.94e-180
2 lstat -1.03 0.0482 -21.4 8.42e- 73
3 age 0.0345 0.0122 2.83 4.91e- 3
# A tibble: 14 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 36.5 5.10 7.14 3.28e-12
2 crim -0.108 0.0329 -3.29 1.09e- 3
3 zn 0.0464 0.0137 3.38 7.78e- 4
4 indus 0.0206 0.0615 0.334 7.38e- 1
5 chas 2.69 0.862 3.12 1.93e- 3
6 nox -17.8 3.82 -4.65 4.25e- 6
7 rm 3.81 0.418 9.12 1.98e-18
8 age 0.000692 0.0132 0.0524 9.58e- 1
9 dis -1.48 0.199 -7.40 6.01e-13
10 rad 0.306 0.0663 4.61 5.07e- 6
11 tax -0.0123 0.00376 -3.28 1.11e- 3
12 ptratio -0.953 0.131 -7.28 1.31e-12
13 black 0.00931 0.00269 3.47 5.73e- 4
14 lstat -0.525 0.0507 -10.3 7.78e-23
lm_fit4
Call:
stats::lm(formula = medv ~ lstat * age, data = data)
Coefficients:
(Intercept) lstat age lstat:age
36.0885359 -1.3921168 -0.0007209 0.0041560
lm_fit4
Call:
stats::lm(formula = medv ~ lstat:age, data = data)
Coefficients:
(Intercept) lstat:age
30.158863 -0.007715
rec_spec_interact <-
recipe(medv ~ lstat + age, data = Boston) %>%
step_interact(~ lstat:age) # Add interation terms as a step in the recipe
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_interact()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) lstat age lstat_x_age
36.0885359 -1.3921168 -0.0007209 0.0041560
Non-linear transformations
It is common that the relationship between our predictors and our target are nonlinear. We can extend the
linear model by transforming predictors within the tidymodels paradigm.
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_mutate()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) lstat lstat2
42.86201 -2.33282 0.04355
Using a recipe to perform transformations ensures consistency across the modeling process. As we start
separating training and testing data sets or using cross validation (we will define this soon) the tidymodels
structure becomes quite valuable.
There are a lot of already defined steps that you can add to your recipes. The tidymodels documentation is a
nice source of information. Let’s say you want to log-transform the lstat variable within the recipe.
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_log()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) lstat
52.12 -12.48
Some of the predictors are qualitative, like ShelveLoc for example. Some models, inlcuding lm() can handle
categorical variables automatically. But that is not always the case. We can convert qualitative features into
dummy variables within a recipe. While this is not necessary for the lm engine, it will make our pre-
processing applicable to other models.
── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps
• step_dummy()
• step_interact()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) CompPrice Income
6.5755654 0.0929371 0.0108940
Advertising Population Price
0.0702462 0.0001592 -0.1008064
Age Education ShelveLoc_Good
-0.0579466 -0.0208525 4.8486762
ShelveLoc_Medium Urban_Yes US_Yes
1.9532620 0.1401597 -0.1575571
Income_x_Advertising Price_x_Age
0.0007510 0.0001068
Writing functions in R
To define a function in R, just follow this script:
square(2)
[1] 4