You are on page 1of 5

Lab 4 Classification

AUTHOR
Jorge Zazueta

Most of this material follows the ISLR Lab in chapter three and the independent tidy versions of the labs by
Emil Hvitfeldt and Taylor Dunn.

Packages
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──


✔ ggplot2 3.4.0 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──


✔ broom 1.0.1 ✔ rsample 1.1.1
✔ dials 1.1.0 ✔ tune 1.0.1
✔ infer 1.0.4 ✔ workflows 1.1.2
✔ modeldata 1.0.1 ✔ workflowsets 1.0.0
✔ parsnip 1.0.3 ✔ yardstick 1.1.0
✔ recipes 1.0.3
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(ISLR)
library(patchwork)

Set a theme for consistency

theme_set(theme_classic())

Classification
Classification is concerned with qualitative outcomes, also known as categorical. Here are a few examples:

1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one
of three medical conditions. Which of the three conditions does the individual have?

2. An online banking service must be able to determine whether or not a transaction being performed on
the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.

3. On the basis of DNA sequence data for a number of patients with and without a given disease, a
biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are
not.

The Default dataset is a simulated data set containing information on ten thousand customers. The aim here
is to predict which customers will default on their credit card debt.

tibble(Default)

# A tibble: 10,000 × 4
default student balance income
<fct> <fct> <dbl> <dbl>
1 No No 730. 44362.
2 No Yes 817. 12106.
3 No No 1074. 31767.
4 No No 529. 35704.
5 No No 786. 38463.
6 No Yes 920. 7492.
7 No No 826. 24905.
8 No Yes 809. 17600.
9 No No 1161. 37469.
10 No No 0 29275.
# … with 9,990 more rows
We can create a sample to work with to speed up exploration. Let us also make the data a tibble

set.seed(239874)
default <- slice_sample(tibble(Default), n = 1000)
default

# A tibble: 1,000 × 4
default student balance income
<fct> <fct> <dbl> <dbl>
1 No Yes 774. 14842.
2 Yes Yes 1238. 14863.
3 No Yes 787. 22435.
4 No No 1294. 40768.
5 No No 562. 37637.
6 No No 1253. 33876.
7 No No 631. 30466.
8 No Yes 1220. 15040.
9 No Yes 1147. 19189.
10 No No 866. 44827.
# … with 990 more rows

Now, we can take a look at the data.

p1 <- default |>


ggplot(aes(x = balance, y = income,
color = default,
shape = default)) +
geom_point(size = 2, alpha = .4, show.legend = FALSE)

p2 <- default |>


ggplot(aes(x = balance, y = default)) +
geom_boxplot(aes(fill = default), show.legend = FALSE) +
coord_flip()

p3 <- default |>


ggplot(aes(x = income, y = default)) +
geom_boxplot(aes(fill = default), show.legend = FALSE) +
coord_flip()

p1 | (p2 | p3) # Displaying multiple charts with the patchwork package

From the chart above, it is clear that balance is a better predictor of default than income. How can we make a
prediction model?

Why don’t use linear regression?


Let’s find out.

lr_model <- linear_reg()

lr_recipe <-
recipe(default ~ balance, data = default) |>
step_mutate(default = as.numeric(default)-1)

lr_workflow <-
workflow() |>
add_model(lr_model) |>
add_recipe(lr_recipe)
lr_fit <-
lr_workflow |>
fit(default)

p <- default |>


ggplot(aes(x = balance, y = as.numeric(default)-1)) +
geom_point(color = "steelblue", alpha = .4, size = 3)

p1 <- p +
geom_abline(slope = tidy(lr_fit)$estimate[2],
intercept = tidy(lr_fit)$estimate[1],
color = "orange",
linewidth = 2,
na.rm = TRUE) +
labs(title = "Simple linear regression model",
y = "default")

p1

We can see the problem now. For starters, the model gives us negative probabilities for small balances! And
following the trend we would get probabilities bigger than one for large enough balances. We need a model
that behaves better in these kind of situations.

Logistic Regression
Logistic regression uses the formula:

β 0 +β 1 X
e
P (x) = (1)
β 0 +β 1 X
1 − e

Which is sometimes referred to as the sigmoid function. Its range is the interval (0, 1), making it convenient
for probability assignment. To find the optimal β and β coefficients. We maximize its likelihood function:
0 1

l(β 0 , β 1 ) = ∏ p(x i ) ∏ (1 − p(x i )). (2)


′ ′
i:y i =1 i :y =0
i

Intuitively, we are looking for values of β and β such that the estimated probability of default is close to
0 1

zero and that of non-default is close to one. Let’s try this approach on our default data.

log_r_model <- logistic_reg()

log_r_recipe <- recipe(default ~ balance, data = default)

log_r_workflow <-
workflow() |>
add_model(log_r_model) |>
add_recipe(log_r_recipe)

log_r_fit <-
log_r_workflow |>
fit(default)

p2 <- augment(log_r_fit, default) |>


ggplot() +
geom_point(aes(x = balance,
y = as.numeric(default)-1),
color = "steelblue",
alpha = .4,
size = 3) +
geom_line(aes(x = balance, y = .pred_Yes),
color = "orange",
size = 2) +
labs(title = "Logistic regression model",
y = NULL)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

p2

This looks much better! We can compare both graphs side by side using the patchwork package.

p1 | p2

To retrieve model information, we can print the fitted object-

log_r_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════


Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────

Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
(Intercept) balance
-10.474727 0.005376

Degrees of Freedom: 999 Total (i.e. Null); 998 Residual


Null Deviance: 276.4
Residual Deviance: 154 AIC: 158

A quick detour on logistic regression metrics


The deviance is negative two times the maximized log-likelihood. The smaller the deviance the
better the fit. (Its role is similar to RSS, but applies for a wider range of models.)
For the logistic regression model, using the binomial log-likelihood, the Alkaike information
criteria or AIC is given by

2 d
AI C = − loglik + 2 .
N N

The AIC criterion provides an estimate of the test error curve and it is generally used for model
selection, simply by choosing the model with the smallest AIC among the models considered.

We can tidy the fit object to get the usual model report.

tidy(log_r_fit)

# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -10.5 1.10 -9.55 1.35e-21
2 balance 0.00538 0.000675 7.97 1.59e-15

Interpretation of the logistic model is not as straightforward as with linear regression. A commonly used (by
horse race bettors at least) concept is odds, defined as

P (x)
β 0 +β 1 X
= e . (3)
1 − P (x)

This expression is obtained by manipulating the logistic function. 1 to 4 odds is the same as a probability of
0.20 ( 0.20

1−0.20
=
1

4
. We can take the logarithm on both sided of Equation 3 to get the log odds or logit
)

function.

P (x)
log ( ) = β 0 + β 1 X. (4)
1 − P (x)

From this point of view, we can say that an increase in X by one unit, increases log odds by β , or,1

equivalently, multiplies the odds by e . β1

We can predict the probability of default.

predict(log_r_fit, new_data = tibble(balance = 1000), type = "prob")

# A tibble: 1 × 2
.pred_No .pred_Yes
<dbl> <dbl>
1 0.994 0.00607

Or make a categorical prediction.

predict(log_r_fit, new_data = tibble(balance = 1000), type = "class")

# A tibble: 1 × 1
.pred_class
<fct>
1 No

You might also like