You are on page 1of 22

Logistic Regression

Every machine learning algorithm performs best under a given set of


conditions. To ensure good performance, we must know which algorithm
to use depending on the problem at hand. You cannot just use one
particular algorithm for all problems. For example: Linear regression
algorithm cannot be applied on a categorical dependent variable. This is
where Logistic Regression comes in.
You must be wondering what is logistic regression? How is it different
from other algorithms? Why is logistic regression called “regression” if
it doesn’t model continuous outcomes?
Logistic Regression
Logistic Regression is a popular statistical model used for binary
classification, that is for predictions of the type this or that, yes or
no, A or B, etc.
Logistic regression can, however, be used for multi-class classification,
but here we will focus on its simplest application. It is one of the most
frequently used machine learning algorithms for binary classifications
that translates the input to 0 or 1. For example,
 0: negative class
 1: positive class
Some examples of classification are mentioned below:
 Email: spam / not spam
 Online transactions: fraudulent / not fraudulent
 Tumor: malignant / not malignant
What is Logistic Regression?
Logistic Regression is the appropriate regression analysis to conduct
when the dependent variable has a binary solution. It produces results
in a binary format which is used to predict the outcome of a
categorical dependent variable. It gives discrete outputs ranging
between 0 and 1.

Implementation steps of Logistic Regression.


Logistic Regression Curve

Logistic Function
The function g(z) is the logistic function, also known as the sigmoid
function.
The logistic function has asymptotes at 0 and 1, and it crosses the y-
axis at 0.5.

When to use Logistic Regression?


Logistic Regression is used when the input needs to be separated into
“two regions” by a linear boundary. The data points are separated using
a linear line as shown:
Based on the number of categories, Logistic regression can be
classified as:
1. binomial: target variable can have only 2 possible types: “0” or “1”
which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”,
etc.
2. multinomial: target variable can have 3 or more possible types which
are not ordered(i.e. types have no quantitative significance) like
“disease A” vs “disease B” vs “disease C”.
3. ordinal: it deals with target variables with ordered categories. For
example, a test score can be categorized as:“very poor”, “poor”,
“good”, “very good”. Here, each category can be given a score like 0,
1, 2, 3.
We can check the accuracy or goodness of fitting model by various
techniques such as accuracy, precision, FI score, ROC curve,
Confusion matrix, etc.
Used Cases:
(i) Weather prediction: In logistic regression we will predict whether
it is cloudy or not, raining or not. Whereas in linear regression in this
case we would have predicted what’s the temperature.
(ii) Determine illness: In logistic regression we will predict whether ill
or not.

Linear vs Logistic Regression


Graphical Representation between Linear and Logistic Regression.
Here you can clearly see for linear it is forming a straight line and the
range can also be more than 1. While for Logistic it forms S (sigmoid)
curve shape, the reason being all the values less than 0 and greater
than 1 are eliminated.
Mathematical Implementation
The values in logistic regression graph lies between 0 and 1.
→Threshold value
Here we introduce the threshold. Now let us what understand
threshold through example.
See the above diagram the threshold value is taken as 0.5 and
according two conditions are given.
1) If the value >0.5 , then the value get round of to 1.
2) If the value <0.5 , then the value get round of to 0.
For getting the curve we have to make equation.
→Logistic Regression Equation
The logistic regression equation is derived from Straight Line Equation.
Equation of straight line- For more than one independent variable.
Where c= constant ,
B1, B2,… =slopes,
X1, X2,.. = independent values
Y = dependent variable

Let’s derive the logistic regression equation-

Now to get range between 0 and infinity, let’s transform Y


Let us transform it further to get range between — (infinity) to
+(infinity)

Why is logistic regression called “regression” if it doesn’t model


continuous outcomes?
Logistic regression falls under the category of supervised learning; it
measures the relationship between the categorical dependent variable
and one or more independent variables by estimating probabilities using
a logistic/sigmoid function.
In spite of the name ‘logistic regression,’ this is not used for regression
problem where the task is to predict the real-valued output. It is a
classification problem which is used to predict a binary outcome (1/0, -
1/1, True/False) given a set of independent variables.
Logistic regression is a bit similar to the linear regression or we can see
it as a generalized linear model.
In linear regression, we predict a real-valued output y based on a
weighted sum of input variables.
y=c+x1∗w1+x2∗w2+x3∗w3……..+xn∗wn

The aim of linear regression is to estimate values for the model


coefficients c, w1, w2, w3 ….wn and fit the training data with minimal
squared error and predict the output y.
Logistic regression does the same thing, but with one addition. It runs
the result through a special non-linear function called as the logistic
function or sigmoid function to produce the output y.
y=logistic(c+x1∗w1+x2∗w2+x3∗w3……..+xn∗wn)
y=1/1+e[−(c+x1∗w1+x2∗w2+x3∗w3+……..+xn∗wn)]
The sigmoid/logistic function is given by the following equation.
y=1/1+e−x
The moral of the story, classification and regression are not that
different beast as we think, it is all about the kind of problems we
trying to solve. We then may simply treat the output of a classifier as a
regression, if we care about probabilities and not the binary outputs,
and then we use regression metrics to evaluate our model.

How linear regression can be converted to logistic regression?


The logistic regression classifier can be derived by analogy to
the linear regression hypothesis which is:

Linear regression hypothesis


However, the logistic regression hypothesis generalizes from the linear
regression hypothesis in that it uses the logistic function:

Logistic function .
The result is the logistic regression hypothesis:

Logistic regression hypothesis


The function g(z) is the logistic function, also known as the sigmoid
function.
Linear to Logistic Regression.
Pros and Cons of Logistic Regression
Many of the pros and cons of the linear regression model also apply to
the logistic regression model. Although Logistic regression is used
widely by many people for solving various types of problems, it fails to
hold up its performance due to its various limitations and also other
predictive models provide better predictive results.
Pros
 The logistic regression model not only acts as a classification model,
but also gives you probabilities. This is a big advantage over other
models where they can only provide the final classification. Knowing
that an instance has a 99% probability for a class compared to 51%
makes a big difference. Logistic Regression performs well when the
dataset is linearly separable.
 Logistic Regression not only gives a measure of how relevant a
predictor (coefficient size) is, but also its direction of association
(positive or negative). We see that Logistic regression is easier to
implement, interpret and very efficient to train.
Cons
 Logistic regression can suffer from complete separation. If there is
a feature that would perfectly separate the two classes, the logistic
regression model can no longer be trained. This is because the
weight for that feature would not converge, because the optimal
weight would be infinite. This is really a bit unfortunate, because
such a feature is really very useful. But you do not need machine
learning if you have a simple rule that separates both classes. The
problem of complete separation can be solved by introducing
penalization of the weights or defining a prior probability
distribution of weights.
 Logistic regression is less prone to overfitting but it can overfit in
high dimensional datasets and in that case, regularization techniques
should be considered to avoid over-fitting in such scenarios.
Conclusion
Logistic regression provides a useful means for modelling the
dependence of a binary response variable on one or more explanatory
variables, where the latter can be either categorical or continuous. The
fit of the resulting model can be assessed using a number of methods.

Logistic Regression Models


The central mathematical concept that underlies logistic regression is
the logit (the natural logarithm) of an odds ratio. The simplest example
of a logit is derived from a 2 × 2 contingency table. Consider an instance
in which the distribution of a dichotomous outcome variable (a child from
a city school who is recommended for remedial reading classes) is paired
with a dichotomous predictor variable (gender). See data below

Sample Data for Gender and Recommendation for Remedial


Reading Instruction
Gender
Remedial reading instructions Boys Girls Total
Recommended (coded as 1) 73 15 88
Not recommended (coded as 0) 23 11 34
Total 96 26 122

A test of independence using chi-square could be applied. The results


yield 𝜒 2 (1) = 3.43. Alternatively, one might prefer to assess a boy’s
odds of being recommended for remedial reading instruction relative
to a girl’s odds. The result is an odds ratio of 2.33, which suggests
that boys are 2.33 times more likely, than not, to be recommended
for remedial reading classes compared with girls. The odds ratio is
derived from two odds (73/23 for boys and 15/11 for girls); its
natural logarithm [i.e., ln(2.33)] is a logit, which equals 0.85. The value
of 0.85 would be the regression coefficient of the gender predictor
if logistic regression were used to model the two outcomes of a
remedial recommendation as it relates to gender.
Generally, logistic regression is well suited for describing and testing
hypotheses about relationships between a categorical outcome
variable and one or more categorical or continuous predictor
variables. In the simplest case of linear regression for one continuous
predictor X (a child’s reading score on a standardized test) and one
dichotomous outcome variable Y (the child being recommended for
remedial reading classes), the plot of such data results in two parallel
lines, each corresponding to a value of the dichotomous outcome.
Because the two parallel lines are difficult to be described with an
ordinary least squares regression equation due to the dichotomy of
outcomes, one may instead create categories for the predictor and
compute the mean of the outcome variable for the respective
categories. The resultant plot of categories’ means will appear linear
in the middle, much like what one would expect to see on an ordinary
scatter plot, but curved at the ends. Such a shape, often referred to
as sigmoidal or S-shaped, is difficult to describe with a linear
equation for two reasons. First, the extremes do not follow a linear
trend. Second, the errors are neither normally distributed nor
constant across the entire range of data

Logistic regression solves these problems by applying the logit


transformation to the dependent variable. In essence, the logistic
model predicts the logit of Y from X. As stated earlier, the logit is
the natural logarithm (ln) of odds of Y, and odds are ratios of
probabilities (𝜋) of Y happening (i.e., a student is recommended for
remedial reading instruction) to probabilities (1 − 𝜋) of Y not
happening (i.e., a student is not recommended for remedial reading
instruction). Although logistic regression can accommodate
categorical outcomes that are polytomous, in this Lecture we focus on
dichotomous outcomes only. The illustration presented in this Lecture
can be extended easily to polytomous variables with ordered (i.e.,
ordinal-scaled) or unordered (i.e., nominal-scaled) outcomes.
The simple logistic model has the form
𝑙𝑜𝑔𝑖𝑡(𝑌)
𝜋
= 𝑛𝑎𝑡𝑢𝑟𝑎𝑙 log(𝑜𝑑𝑑𝑠) = 𝑙𝑛 ( )
1−𝜋
= 𝛼 + 𝛽𝑋. (1)
For the data in Table 8, the regression coefficient 𝛽 is the logit
(0.85) previously explained. Taking the antilog of Equation 1 on both
sides, one derives an equation to predict the probability of the
occurrence of the outcome of interest as follows:
𝜋 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑌 = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡|𝑋 = 𝑥, 𝑎 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑋)
𝑒 𝛼+𝛽𝑋
= (2)
1 + 𝑒 𝛼+𝛽𝑋
where 𝜋 is the probability of the outcome of interest or “event,” such
as a child’s referral for remedial reading classes, 𝛼 is the Y intercept,
𝛽 is the regression coefficient, and e = 2.71828 is the base of the
system of natural logarithms. X can be categorical or continuous, but
Y is always categorical. According to Equation 1, the relationship
between logit (Y) and X is linear. Yet, according to Equation 2, the
relationship between the probability of Y and X is nonlinear. For this
reason, the natural log transformation of the odds in Equation 1 is
necessary to make the relationship between a categorical outcome
variable and its predictor(s) linear.
The value of the coefficient 𝛽 determines the direction of the
relationship between X and the logit of Y. When 𝛽 is greater than
zero, larger (or smaller) X values are associated with larger (or
smaller) logits of Y. Conversely, if 𝛽 is less than zero, larger (or
smaller) X values are associated with smaller (or larger) logits of Y.
Within the framework of inferential statistics, the null hypothesis
states that 𝛽 equals zero, or there is no linear relationship in the
population. Rejecting such a null hypothesis implies that a linear
relationship exists between X and the logit of Y. If a predictor is
binary, as in Table 8, then the odds ratio is equal to e, the natural
logarithm base, raised to the exponent of the slope 𝛽 (𝑒 𝛽 ).
Extending the logic of the simple logistic regression to multiple
predictors (say X1 = reading score and X2 = gender), one can
construct a complex logistic regression for Y (recommendation for
remedial reading programs) as follows:

𝜋
𝑙𝑜𝑔𝑖𝑡 (𝑌) = 𝑙𝑛 ( )
1−𝜋
= 𝛼 + 𝛽1 𝑋1 + 𝛽2 𝑋2 (3)
Therefore
𝜋 = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(𝑌|𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 )
𝛼+𝛽1 𝑋1 +𝛽2 𝑋2
𝑒
= (4)
1 + 𝑒 𝛼+𝛽1𝑋1 +𝛽2𝑋2

Where 𝑌 is the outcome of interest, 𝜋 is once again the probability


of the event, 𝛼 is the Y intercept, 𝛽𝑠 are regression coefficients,
and Xs are a set of predictors. 𝛼 and 𝛽s are typically estimated by
the maximum likelihood (ML) method, which is preferred over the
weighted least squares approach by several authors, such as
Haberman (1978) and Schlesselman (1982). The ML method is
designed to maximize the likelihood of reproducing the data given the
parameter estimates. Data are entered into the analysis as 0 or 1
coding for the dichotomous outcome, continuous values for continuous
predictors, and dummy codings (e.g., 0 or 1) for categorical
predictors. The null hypothesis underlying the overall model states
that all 𝛽s equal zero. A rejection of this null hypothesis implies that
at least one 𝛽 does not equal zero in the population, which means that
the logistic regression equation predicts the probability of the
outcome better than the mean of the dependent variable Y. The
interpretation of results is rendered using the odds ratio for both
categorical and continuous predictors.

Illustration of Logistic Regression Analysis and Reporting


For the sake of illustration, we construct a hypothetical data
consisting of reading scores and gender of 189 city school children.
Of these children, 59 (31.22%) were recommended for remedial
reading classes and 130 (68.78%) were not. A legitimate research
hypothesis posed to the data is that “the likelihood that a city school
child is recommended for remedial reading instruction is related to
both his/her reading score and gender.” Thus, the outcome variable,
remedial, is students being recommended for remedial reading
instruction (1 = yes, 0 = no), and the two predictors are students’
reading score on a standardized test (X1 = the reading variable) and
gender (X2 = gender). The reading scores range from 40 to 125 points,
with a mean of 64.91 points and standard deviation of 15.29 points
(Table 9). The gender predictor is coded as 1 = boy and 0 = girl. The
gender distribution is nearly even with 49.21% (n = 93) boys and
50.79% (n = 96) girls.

Description of a Hypothetical Data Set for Logistic Regression

Remedial Total Boys Girls Reading score


reading sample (𝑛1 ) (𝑛2 ) (𝑀)
recommended? (𝑁) (𝑆𝐷)
Yes 59 36 23 61.07 13.28
No 130 57 73 66.65 15.86
summary 189 93 96 64.91 15.29

Logistic Regression Analysis


A two-predictor logistic model was fitted to the data to test the
research hypothesis regarding the relationship between the
likelihood that a city child is recommended for remedial reading
instruction and his or her reading score and gender. The logistic
regression analysis was carried out by the Logistic procedure in
statistical software and the results are as shown in Table 10.
Table 10: Logistic Regression Analysis of 189 children’s Referrals
for Remedial Reading Programs

Predictor 𝛽 𝑆𝐸 𝛽 Wald’s 𝑑𝑓 𝑝 𝑒𝛽
𝜒2 (𝑜𝑑𝑑𝑠 𝑟𝑎𝑡𝑖𝑜)
Constant 0.5340 0.8109 0.4337 1 0.5102 NA
Reading -0.0261 0.0122 4.5648 1 0.0326 0.9742
Gender 0.6477 0.3248 3.9759 1 0.0462 1.9111
(1=boys,
0=girls)

The result showed that

Predicted logit of (REMEDIAL) = 0.5340 + (−0.0261)*READING +


(0.6477)*GENDER.
(5)
According to the model, the log of the odds of a child being
recommended for remedial reading instruction was negatively related
to reading scores (p < .05) and positively related to gender (p < .05;
Table 10). In other words, the higher the reading score, the less likely
it is that a child would be recommended for remedial reading classes.
Given the same reading score, boys were more likely to be
recommended for remedial reading classes than girls because boys
were coded to be 1 and girls 0. In fact, the odds of a boy being
recommended for remedial reading programs were 1.9111 (= 𝑒 0.6477 ,
𝑇𝑎𝑏𝑙𝑒 10) times greater than the odds for a girl.

Question
To assess the effect modification relating the outcome of interest
(Y) to independent variables representing the treatment assignment,
sex and the product of the two (called the treatment by sex
interaction variable). Where T is the treatment assignment (1=new
drug and 0=placebo), M = male gender (1=yes, 0=no) and TM, i.e., T *
M or T x M is the product of treatment and male gender, the multiple
regression analysis revealed the following: results
Regression P-
Independent Variable T
Coefficient value
Intercept 39.24 65.89 0.0001
T (Treatment) -0.36 -0.43 0.6711
M (Male Gender) -0.18 -0.13 0.8991
TM (Treatment x Male
6.55 3.37 0.0011
Gender)
Deduce the multiple regression model, hence comment on the results

Solution for question


The multiple regression model is:
𝑌̂ = 39.24 − 0.36𝑇 − 0.18𝑀 + 6.55𝑇𝑀
The details of the test are not shown here, but note in the table above
that in this model, the regression coefficient associated with the
interaction term, 𝛽3 , is statistically significant (i.e., H0: 𝛽3 = 0 versus
H1: 𝛽3 ≠ 0). The fact that this is statistically significant indicates
that the association between treatment and outcome differs by sex.
The model shown above can be used to estimate the mean HDL levels
for men and women who are assigned to the new medication and to the
placebo. In order to use the model to generate these estimates, we
must recall the coding scheme (i.e., T = 1 indicates new drug, T=0
indicates placebo, M=1 indicates male sex and M=0 indicates female
sex).
The expected or predicted HDL for men (M=1) assigned to the new
drug (T=1) can be estimated as follows:
𝑌̂ = 39.24 − 0.36(1) − 0.18(1) + 6.55(1)(1) = 45.25

The expected HDL for men (M=1) assigned to the placebo (T=0) is:
𝑌̂ = 39.24 − 0.36(0) − 0.18(1) + 6.55(0)(1) = 39.06
Similarly, the expected HDL for women (M=0) assigned to the new
drug (T=1) is:
𝑌̂ = 39.24 − 0.36(1) − 0.18(0) + 6.55(1)(0) = 38.88
The expected HDL for women (M=0) assigned to the placebo (T=0)
is:
𝑌̂ = 39.24 − 0.36(0) − 0.18(0) + 6.55(0)(0) = 39.24

Logistic Regression in R

The Dataset
mtcars(motor trend car road test) comprises fuel consumption,
performance, and 10 aspects of automobile design for 32 automobiles.
It comes pre-installed with dplyr package in R.
R

# Installing the package


install.packages("dplyr")

# Loading package
library(dplyr)

# Summary of dataset in package


summary(mtcars)

Performing Logistic regression on a dataset


Logistic regression is implemented in R using glm() by training the
model using features or variables in the dataset.
R
# Installing the package

# For Logistic regression


install.packages("caTools")

# For ROC curve to evaluate model


install.packages("ROCR")

# Loading package
library(caTools)
library(ROCR)

Splitting the Data


R
# Splitting dataset
split <- sample.split(mtcars, SplitRatio = 0.8)
split

train_reg <- subset(mtcars, split == "TRUE")


test_reg <- subset(mtcars, split == "FALSE")

# Training model
logistic_model <- glm(vs ~ wt + disp,
data = train_reg,
family = "binomial")
logistic_model

# Summary
summary(logistic_model)
Output:
Output:

Call:
glm(formula = vs ~ wt + disp, family = "binomial", data = train_reg)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6552 -0.4051 0.4446 0.6180 1.9191

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.58781 2.60087 0.610 0.5415
wt 1.36958 1.60524 0.853 0.3936
disp -0.02969 0.01577 -1.882 0.0598 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 34.617 on 24 degrees of freedom


Residual deviance: 20.212 on 22 degrees of freedom
AIC: 26.212

Number of Fisher Scoring iterations: 6

Call: The function call used to fit the logistic regression model is
displayed, along with information on the family, formula, and data.

Deviance Residuals: These are the deviance residuals, which gauge


the model’s degree of goodness-of-fit. They stand for
discrepancies between actual responses and probability predicted by
the logistic regression model.

Coefficients: These coefficients in logistic regression represent the


response variable’s log odds or logit. The standard errors related to
the estimated coefficients are shown in the “Std. Error” column.
Significance codes: The level of significance of each predictor
variable is indicated by the significance codes.

Dispersion parameter: In logistic regression, the dispersion


parameter serves as the scaling parameter for the binomial
distribution. It is set to 1 in this instance, indicating that the
assumed dispersion is 1.

Null deviance: The null deviance calculates the model’s deviation


when just the intercept is taken into account. It symbolizes the
deviation that would result from a model with no predictors.

Residual deviance: The residual deviance calculates the model’s


deviation after the predictors have been fitted. It stands for the
residual deviation after taking the predictors into account.

AIC: The Akaike Information Criterion (AIC), which accounts for


the number of predictors, is a gauge of a model’s goodness of fit.
It penalizes more intricate models in order to prevent overfitting.
Better-fitting models are indicated by lower AIC values.

Number of Fisher Scoring iterations: The number of iterations


needed by the Fisher scoring procedure to estimate the model
parameters is indicated by the number of iterations.
Predict test data based on model

R
predict_reg <- predict(logistic_model,
test_reg, type = "response")
predict_reg
Output:
Hornet Sportabout Merc 280C Merc 450SE Chrysler
Imperial
0.01226166 0.78972164 0.26380531
0.01544309
AMC Javelin Camaro Z28 Ford Pantera L
0.06104267 0.02807992 0.01107943
R
# Changing probabilities
predict_reg <- ifelse(predict_reg >0.5, 1, 0)

# Evaluating model accuracy


# using confusion matrix
table(test_reg$vs, predict_reg)

missing_classerr <- mean(predict_reg != test_reg$vs)


print(paste('Accuracy =', 1 - missing_classerr))

# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_reg$vs)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")


auc <- auc@y.values[[1]]
auc

# Plotting curve
plot(ROCPer)
plot(ROCPer, colorize = TRUE,
print.cutoffs.at = seq(0.1, by = 0.1),
main = "ROC CURVE")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

You might also like