Professional Documents
Culture Documents
Logistic Regression
Logistic regression (aka logit
regression or logit model) was
developed by statistician David
Cox in 1958 and is a regression
model where the response
variable Y is categorical. Logistic regression allows us to
estimate the probability of a categorical response based
on one or more predictor variables (X). It allows one to say
that the presence of a predictor increases (or decreases)
the probability of a given outcome by a specific
percentage. This tutorial covers the case when Y is binary
— that is, where it can take only two values, “0” and “1”,
which represent outcomes such as pass/fail, win/lose,
alive/dead or healthy/sick. Cases where the dependent
variable has more than two outcome categories may be
analysed with multinomial logistic regression, or, if the
multiple categories are ordered, in ordinal logistic
regression. However, discriminant analysis has become a
popular method for multi-class classification so our next
tutorial will focus on that technique for those instances.
tl;dr
This tutorial serves as an introduction to logistic
regression and covers1:
https://afit-r.github.io/logistic_regression 1/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
Replication Requirements
This tutorial primarily leverages the Default data
provided by the ISLR package. This is a simulated data
set containing information on ten thousand customers
such as whether the customer defaulted, is a student, the
average balance carried by the customer and the income
of the customer. We’ll also use a few packages that provide
data manipulation, visualization, pipeline modeling
functions, and model output tidying functions.
# Packages
# Load data
## # A tibble: 10,000 × 4
## 1 No No 729.5265 44361.625
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 7 No No 825.5133 24905.227
## 9 No No 1161.0579 37468.529
## 10 No No 0.0000 29275.268
https://afit-r.github.io/logistic_regression 2/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
⎧ 1, if stroke;
Y = ⎨ 2, if drug overdose;
⎩
3, if epileptic seizure.
⎧ 1, if epileptic seizure;
Y = ⎨ 2, if stroke;
⎩
3, if drug overdose.
https://afit-r.github.io/logistic_regression 3/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
β0 +β1 X
e
p(X) = (1)
β0 +β1 X
1 + e
(123)
default[!sample, ]
and ^
β1 such that plugging these estimates into the model
https://afit-r.github.io/logistic_regression 4/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
for p(X), given in Eq. 1, yields a number close to one for all
individuals who defaulted, and a number close to zero for
all individuals who did not. This intuition can be
formalized using a mathematical equation called a
likelihood function:
′
ℓ(β0 , β1 ) = ∏ p(xi ) ∏ (1 − p(x )) (2)
i
′ ′
i:yi =1 i :y =0
i
default %>%
ggplot(aes(balance, prob)) +
geom_point(alpha = .15) +
xlab("Balance") +
ylab("Probability of Default")
https://afit-r.github.io/logistic_regression 5/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
summary(model1)
##
## Call:
## Deviance Residuals:
##
## Coefficients:
##
Assessing Coefficients
The below table shows the coefficient estimates and
related information that result from fitting a logistic
regression model in order to predict the probability of
default = Yes using balance. Bear in mind that the
coefficient estimates from logistic regression characterize
the relationship between the predictor and response
variable on a log-odds scale (see Ch. 3 of ISLR1 for more
details). Thus, we see that β^ = 0.0057; this indicates
1
https://afit-r.github.io/logistic_regression 6/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
tidy(model1)
exp(coef(model1))
## (Intercept) balance
## 1.659718e-05 1.005685e+00
confint(model1)
## 2.5 % 97.5 %
Making Predictions
Once the coefficients have been estimated, it is a simple
matter to compute the probability of default for any given
credit card balance. Mathematically, using the coefficient
estimates from our model we predict that the default
probability for an individual with a balance of $1,000 is
less than 0.5%
^ ^
β 0 +β 1 X −11.0063+0.0057×1000
e e
p
^(X) = = = 0.004785 (3)
^ +β
^ X −11.0063+0.0057×1000
1 + e
β
0 1 1 + e
https://afit-r.github.io/logistic_regression 7/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
## 0.004785057 0.582089269
tidy(model2)
−3.55+0.44×0
e
p
^(default=Yes|student=No) = = 0.0278
−3.55+0.44×0
1 + e
## 0.04261206 0.02783019
https://afit-r.github.io/logistic_regression 8/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
β0 +β1 X+⋯+βp Xp
e
p(X) = (4)
β0 +β1 X+⋯+βp Xp
1 + e
https://afit-r.github.io/logistic_regression 9/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
caret::varImp(model3)
## Overall
## balance 19.0403764
## income 0.4647343
## studentYes 2.5835947
https://afit-r.github.io/logistic_regression 10/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
## 1 2
## 0.05437124 0.11440288
Goodness-of-Fit
In the linear regression tutorial we saw how the F-
statistic, R and adjusted R , and residual diagnostics
2 2
inform us of how good the model fits the data. Here, we’ll
look at a few ways to assess the goodness-of-fit for our
logit models.
https://afit-r.github.io/logistic_regression 11/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
##
## 1 6045 908.69
## ---
Pseudo R 2
ln(LM1 )
1 −
ln(LM0 )
list(model1 = pscl::pR2(model1)["McFadden"],
model2 = pscl::pR2(model2)["McFadden"],
model3 = pscl::pR2(model3)["McFadden"])
## $model1
## McFadden
## 0.4726215
##
## $model2
## McFadden
## 0.004898314
##
https://afit-r.github.io/logistic_regression 12/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
## $model3
## McFadden
## 0.4805543
Residual Assessment
Keep in mind that logistic regression does not assume the
residuals are normally distributed nor that the variance is
constant. However, the deviance residual is useful for
determining if individual points are not well fit by the
model. Here we can fit the standardized deviance residuals
to see how many exceed 3 standard deviations. First we
extract several useful bits of model results with augment
and then proceed to plot.
mutate(index = 1:n())
geom_ref_line(h = 3)
https://afit-r.github.io/logistic_regression 13/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
model1_data %>%
filter(abs(.std.resid) > 3)
model1_data %>%
top_n(5, .cooksd)
list(
## $model1
##
## FALSE TRUE
## No 0.962 0.003
##
## $model2
##
## FALSE
## No 0.965
## Yes 0.035
##
## $model3
##
## FALSE TRUE
## No 0.963 0.003
test %>%
https://afit-r.github.io/logistic_regression 16/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
## # A tibble: 1 × 3
##
## FALSE TRUE
## No 3803 12
## Yes 98 40
graphic that shows the trade off between the rate at which
you can correctly predict something with the rate of
incorrectly predicting something. Ultimately, we’re
concerned about the area under the ROC curve, or AUC.
That metric ranges from 0.50 to 1.00, and values above
0.80 indicate that the model does a good job in
discriminating between the two categories which
comprise our target variable. We can compare the ROC
and AUC for model’s 1 and 2, which show a strong
difference in performance. We definitely want our ROC
plots to look more like model 1’s (left) rather than model
2’s (right)!
library(ROCR)
par(mfrow=c(1, 2))
# model 1 AUC
.@y.values
## [[1]]
## [1] 0.939932
# model 2 AUC
https://afit-r.github.io/logistic_regression 18/19
6/20/22, 4:23 PM Logistic Regression · AFIT Data Science Lab R Programming Guide
.@y.values
## [[1]]
## [1] 0.5386955
Additional Resources
This will get you up and running with logistic regression.
Keep in mind that there is a lot more you can dig into so
the following resources will help you learn more:
https://afit-r.github.io/logistic_regression 19/19