Professional Documents
Culture Documents
AUTHOR
Jorge Zazueta
Most of this material follows the ISLR Lab in chapter three and the independent tidy versions of the labs by
Emil Hvitfeldt and Taylor Dunn.
Packages
library(tidyverse)
library(tidymodels)
library(ISLR)
library(patchwork)
theme_set(theme_classic())
Classification
Classification is concerned with qualitative outcomes, also known as categorical. Here are a few examples:
1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one
of three medical conditions. Which of the three conditions does the individual have?
2. An online banking service must be able to determine whether or not a transaction being performed on
the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
3. On the basis of DNA sequence data for a number of patients with and without a given disease, a
biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are
not.
The Default dataset is a simulated data set containing information on ten thousand customers. The aim here
is to predict which customers will default on their credit card debt.
tibble(Default)
# A tibble: 10,000 × 4
default student balance income
<fct> <fct> <dbl> <dbl>
1 No No 730. 44362.
2 No Yes 817. 12106.
3 No No 1074. 31767.
4 No No 529. 35704.
5 No No 786. 38463.
6 No Yes 920. 7492.
7 No No 826. 24905.
8 No Yes 809. 17600.
9 No No 1161. 37469.
10 No No 0 29275.
# … with 9,990 more rows
We can create a sample to work with to speed up exploration. Let us also make the data a tibble
set.seed(239874)
default <- slice_sample(tibble(Default), n = 1000)
default
# A tibble: 1,000 × 4
default student balance income
<fct> <fct> <dbl> <dbl>
1 No Yes 774. 14842.
2 Yes Yes 1238. 14863.
3 No Yes 787. 22435.
4 No No 1294. 40768.
5 No No 562. 37637.
6 No No 1253. 33876.
7 No No 631. 30466.
8 No Yes 1220. 15040.
9 No Yes 1147. 19189.
10 No No 866. 44827.
# … with 990 more rows
From the chart above, it is clear that balance is a better predictor of default than income. How can we make a
prediction model?
lr_recipe <-
recipe(default ~ balance, data = default) |>
step_mutate(default = as.numeric(default)-1)
lr_workflow <-
workflow() |>
add_model(lr_model) |>
add_recipe(lr_recipe)
lr_fit <-
lr_workflow |>
fit(default)
p1 <- p +
geom_abline(slope = tidy(lr_fit)$estimate[2],
intercept = tidy(lr_fit)$estimate[1],
color = "orange",
linewidth = 2,
na.rm = TRUE) +
labs(title = "Simple linear regression model",
y = "default")
p1
We can see the problem now. For starters, the model gives us negative probabilities for small balances! And
following the trend we would get probabilities bigger than one for large enough balances. We need a model
that behaves better in these kind of situations.
Logistic Regression
Logistic regression uses the formula:
β 0 +β 1 X
e
P (x) = (1)
β 0 +β 1 X
1 − e
Which is sometimes referred to as the sigmoid function. Its range is the interval (0, 1), making it convenient
for probability assignment. To find the optimal β and β coefficients. We maximize its likelihood function:
0 1
Intuitively, we are looking for values of β and β such that the estimated probability of default is close to
0 1
zero and that of non-default is close to one. Let’s try this approach on our default data.
log_r_workflow <-
workflow() |>
add_model(log_r_model) |>
add_recipe(log_r_recipe)
log_r_fit <-
log_r_workflow |>
fit(default)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
p2
This looks much better! We can compare both graphs side by side using the patchwork package.
p1 | p2
log_r_fit
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Coefficients:
(Intercept) balance
-10.474727 0.005376
2 d
AI C = − loglik + 2 .
N N
The AIC criterion provides an estimate of the test error curve and it is generally used for model
selection, simply by choosing the model with the smallest AIC among the models considered.
We can tidy the fit object to get the usual model report.
tidy(log_r_fit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -10.5 1.10 -9.55 1.35e-21
2 balance 0.00538 0.000675 7.97 1.59e-15
Interpretation of the logistic model is not as straightforward as with linear regression. A commonly used (by
horse race bettors at least) concept is odds, defined as
P (x)
β 0 +β 1 X
= e . (3)
1 − P (x)
This expression is obtained by manipulating the logistic function. 1 to 4 odds is the same as a probability of
0.20 ( 0.20
1−0.20
=
1
4
. We can take the logarithm on both sided of Equation 3 to get the log odds or logit
)
function.
P (x)
log ( ) = β 0 + β 1 X. (4)
1 − P (x)
From this point of view, we can say that an increase in X by one unit, increases log odds by β , or,1
# A tibble: 1 × 2
.pred_No .pred_Yes
<dbl> <dbl>
1 0.994 0.00607
# A tibble: 1 × 1
.pred_class
<fct>
1 No