Lesson 13 Logistic Regression

Logistic Regression Model
Lesson 13
20538 – Predictive Analytics for

data driven decision making
Daniele Tonini
daniele.tonini@unibocconi.it
Agenda
Logistic regression:
• Background
• Problem setting
• Model features
• Interpretation of the coefficients
• ODDS ratio
• Estimation of the model
• Model comparison and evaluation
3
Background
Consider the following business case:
 A retail bank runs a direct marketing campaign in order to promote a new

financial product with some advertising materials mailed to a sample of its clients.
 Customers interested in receiving additional information about the product (and

thus potentially willing to buy it) were asked to send back the coupon included in
the promotional mail sent by the bank.
 Business problem: As newly hired employee, we have to analyze the

data collected during the campaign and to identify those customer features
more likely to lead to a positive reply, in terms of the request of additional
information.
4
Background
In order to deal with some specific managerial problems, sometimes it is not
sufficient to rely on the linear regression model, since the response dependent
variable is a categorical variable.
For example, we may be interested in studying the purchase drivers for a specific
product category, or in explaining the membership of customers to market
segments, or again in identifying the factors leading to customer defection, etc.
For this kind of problems, the business analyst needs to use specific
quantitative tools to model the phenomenon of interest: as it will become
clear in what follows, the estimation of a logistic regression model leads to
the proper solution to these issues
5
Problem Setting
Categorical response variable in its simplest expression  Binary Variable
Value «1»: presence of a characteristic or occurrence of an event of interest (for

example: product purchase, client insolvency, response to a marketing campaign, etc.)
Value «0»: absence of the characteristic of interest
(note: the response variable can be modeled using a Bernoulli random variable)
Goal: analysis of the probability of occurrence for the event of interest (Y=1) given one
or more predictor (independent) variables (X)
Pr (Y=1 | X)
6
Problem Setting
The linear regression model is not suitable to solve the previous problem (categorical
binary response variable), since:
1. In the linear regression, the predicted value is not bounded, and so it can assume
values outside the interval [0;1]
2. The assumption of Homoskedasticity, crucial in linear regression, is not reasonable

in the case of a binary dependent variable.
3. The usual tests for the parameters of the linear regression model are based on the
assumption of Normality for the distribution of the error terms  y can take on
only the values 0 and 1, and the Normality assumption is hard to justify, even in
an approximate sense.
7
The Logistic Regression Model
Model features
To solve the problems mentioned above, it is possible to use a non-linear function of the
independent variables, that, contrary to the linear function, assumes values bounded
between 0 and 1
Among all the suitable non-linear functions, the standard logistic function (or sigmoid
function) is the most often used in practice: 𝜋 = 𝑃𝑟(𝑌 = 1) is defined as a function of p
independent variables, according to the following expression:
𝑒 𝛽0 +𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑝𝑥𝑝 1
(1) 𝜋 = Pr Y = 1 = OR 𝜋 = Pr Y = 1 =
1 + 𝑒 𝛽0 +𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑝 𝑥𝑝 1 + 𝑒 −(𝛽0+𝛽1𝑥1+𝛽2𝑥2+⋯+𝛽𝑝 𝑥𝑝 )
Logistic function Simple Logistic vs. Simple Linear model (one X)
8
The LOGIT expression
The logistic regression allows to model the characteristics of a binary dependent

variable, but, due to its non-linearity, the interpretation of the coefficients is less
intuitive than in linear regression.
Starting from the original definition of the model, it is possible to re-express it

through a different functional form, allowing an easier interpretation for the
coefficients:
(2)
The left-hand side of the equation is often called the LOGIT transformation of 𝜋 ,
while the right-hand side is the linear combination of the p independent variables
𝑥1 , 𝑥2 , … , 𝑥𝑝 , just as for the linear regression model
9
Definition of ODDS and interpretation of coefficients
 ODDS: the ratio between the probability of occurrence of a specific event

and the probability of non-occurrence.
There is a simple relation between the odds (O) and the probability (𝜋)
 O
O 
1  1 O
The LOGIT formulation (2) for the logistic model corresponds to a linear model for
the logarithm of the odds (log-odds)
⟹ the generic coefficient 𝜷𝒌 of the k-th independent variable 𝒙𝒌 is
interpreted as the change in the log-odds for one unit increase in 𝒙𝒌 , keeping
all other variables constant.
10
Example: direct marketing campaign
Dependent Variable: RESPONSE (1: positive interest in a new financial product, 0:

otherwise)
Independent Variables: GENDER (gender of the customer  “1”: male, “0”:

female), ACTIVITY (1: financially active client, 0: inactive client), AGE (age in years)
11
Example: direct marketing campaign
The estimated equation for the logistic model (in Logit form) for the previous example is
(changing the sign for category 1):
It is possible to provide the following interpretation for the coefficients:
GENDER: keeping other variables constant, the estimated log-odds for the event of interest
(i.e. positive interest of the customer) increases by 0.9865 for a man compared to a woman
ACTIVITY: keeping other variables constant, the estimated log-odds increases by 0.9350
for an active customer compared to a non-active one
AGE: keeping other variables constant, the estimated log-odds decreases by 0,0011 when
the age of the customer increases by one year.
 A positive coefficient denotes an increasing relation between the corresponding predictor and
the (estimate of the) probability of occurrence of the event.
12
An additional interpretation: the ODDS RATIO
 The ODDS RATIO is the ratio of two ODDS computed at two different values
of the independent variable:
Independent Variable
x=1 x=0
ODDS RATIO
y=1
 (1)
1   (1) ODDS 1
Dependent  
 (0) ODDS 0
Variable
1   (0)
y=0
14
The odds ratio is a non-negative quantity such that:

1. It assumes a value larger than 1 when a unit increase in the independent
variable leads to an increase in the probability of occurrence of the event of
interest
2. It assumes a value smaller than 1 when a unit increase in the independent

variable leads to a decrease in the probability of occurrence of the event of
interest,
3. It assumes a value equal to 1 when a unit increase in the independent variable

leaves the probability of occurrence of the event unchanged.
For a categorical independent variable, a «unit increase» clearly stays for the change
from the reference category (coded as «0») to the category coded as «1»
So, the ODDS RATIO for a categorical predictor represents the ratio of ODDS for
an individual belonging to the specific category TO the ODDS for an individual
NOT belonging to the category.
15
 DEFINITION: the exponential of the generic coefficient 𝜷𝒌 for the k-th

independent variable 𝒙𝒌 , that is 𝒆𝜷𝒌 , can be interpreted as the ODDS RATIO
corresponding to a unit increase in 𝒙𝒌 , keeping all other variables constant.
To sum up, the estimated coefficients in a logistic regression model can be interpreted
alternatively as:
1. 𝜷𝒌  change in the log-odds for a unit increase in 𝒙𝒌 (ceteris paribus)
2. 𝒆𝜷𝒌  odds ratio corresponding to a unit increase in 𝒙𝒌 (ceteris paribus).
For categorical independent variables, please refer to what has

been previously stated and to the meaning of a «unit increase».
16
EXAMPLE: Consider the output of the logistic regression related to the direct marketing
campaign and the estimated coefficients (slide 11)
We interpreted the coefficient on the variable ACTIVITY as the estimated increase in the
log-odds for an active customer with respect to that for an inactive one (ceteris paribus)
An alternative way of interpreting the

Logistic Regression: RESPONSE estimated relation in the model between the
ACTIVITY response variable and ACTIVITY is:
coefficient 0,935039591
std error of coef 0,184687926 The estimated odds-ratio for the variable
z-ratio 5,0628
ACTIVITY is 2,54: this means that the ODDS for
p-value 0,00%
lower 95% conf. int. 0,573057906
an active customer (ACTIVITY=1) is 2,54 times
upper 95% conf. int. 1,297021275 the ODDS for an inactive one (ACTIVITY=0);
====================== since this value is larger than 1, we can conclude
odds ratio 2,547314305 that there is a positive relation between
RESPONSE (dependent variable) and the variable
ACTIVITY.
17
Estimation of the Model
Similarly to the linear regression model, the relation between the dependent
variable and the independent ones is known up to the values of the parameters
𝜷′𝒌 𝒔, that have to be estimated
An estimation method is thus needed to obtain «good» estimates of the

parameters, on the basis of the available sample observations.
In this situation, it can be proved that OLS estimates are not optimal.
 A more general method, the Maximum Likelihood Estimation (MLE), is

then used: this approach chooses values for parameter estimates (i.e.
regression coefficients) which make the observed data “maximally likely”.
18
Maximum Likelihood Estimation notes:

Probability function of Y Pr 𝑦 = 1 = 𝑝
(Bernoulli random variable) Pr(𝑦) = 𝑝 𝑦 (1 − 𝑝)1−𝑦
Pr 𝑦 = 0 = 1 − 𝑝 yi
yi  1 
N
Likelihood function of
  pi  N  1  pi 

L  Pr  y1 , y2 ,, y N   Pr  y1  Pr  y2 iPr
1  y  1  piPr  y 
𝑦𝑖 (𝑖 = 1, … , 𝑁) independent obs. N  i i 1
yi
N N
 pi 
L   Pr  yi    pi i 1  pi 
N
Finding the Log-Likelihood
    1  pi 
y 1 yi
function (LL)
i 1 i 1 i 1  1  pi 
 pi 
Taking the logarithm of both sides: ln L   yi ln 
𝐿𝐿    ln 1  pi 
(this helps the optimization process) i  1  pi  i
Expression to maximize
From LOGIT regression eq.
𝑝𝑖 ln𝐿𝐿L   yi xi   ln 1  exp xi 
𝑙𝑛 = 𝛽𝑥𝑖 , so:
1 − 𝑝𝑖 i i
19
The Log-Likelihood expression (= the conditions that determine the estimates) is non-
linear in the parameters and does not admit an explicit solution (i.e. it’s not a closed
form)
 Iterative optimization methods must be used to find an approximate solution.

The standard choice is to use the Newton–Raphson algorithm (note: equivalent to the
Iteratively Reweighted Least Squares algorithm in Knime) or the Stocastic Gradient
Descend method (and its numerous variation)
 Maximum Likelihood Estimates have some optimality properties, for a sufficiently large
sample size:
– Asymptotically unbiased (estimates are approximately unbiased)
– Asymptotically efficient (standard errors of the estimates are as low as those of
any other procedure)
– Asymptotically normal (it is possible to use a normal or chi-square distribution
to compute confidence intervals and critical values or p-values in statistical tests)
20
Regularization
Overfitting the training data is a problem that can arise in Logistic Regression, especially
when data has very high dimensions and is sparse.
One approach to reducing overfitting is Regularization, in which we create a modified
“penalized log likelihood function,” which penalizes large values of the estimates.
Generally, we don’t want large estimate: if weights are large, a small change in a feature can
result in a large change in the prediction
Where 𝜆 is the penalty
Penalized log likelihood function: ln𝐿𝐿L   yi xi   ln 1  exp xi  −𝜆𝑅(𝛽) factor and 𝑅(𝛽) is the
i i regularization function
𝑝
L1 regularization 𝑅 𝛽 = 𝛽𝑖  It tends to produce sparse solutions by forcing unimportant
𝑖=1 coefficients to be zero (L1 is equivalent to Laplace method in Knime)
1 𝑝
L2 regularization 𝑅 𝛽 = 𝛽𝑖2  keeps the coefficients from becoming too large but does not force
2 𝑖=1 them to be zero (L2 is equivalent to Gauss method in Knime)
21
Goodness-of-fit and model comparison
Different approaches can be used to evaluate the goodness-of-fit of a logistic model

and to compare different logistic models:
– AIC (Akaike Information Criterion): an index combining an evaluation

of the goodness‐of‐fit of the model and its complexity, measured as the
number of independent variables. Its explicit expression is the following:
𝐴𝐼𝐶 = −2𝐿𝐿 𝑀 + 2(𝑘 + 1)
– where L(M) is the value of the Likelihood function at the estimated

coefficients and k is the number of independent variables.
 A model with a low value of AIC is to be preferred to

a model with a high value since it represents a better
compromise between goodness-of-fit and complexity.
23
Model evaluation: Confusion Matrix
a +d
Confusion matrix  1. Accuracy:
n
Predicted a
values 2. True negatives proportion:
a+ b
0 1 Total
c
3. False negatives proportion:
Actual 0 a b a+b c +d
values 1 c d c+d b
4. False positives proportion:
a+b
Total a+c b+d n
d
5. True positives proportion:
c+ d
The category of the target variable predicted by the model (“predicted values” in the
confusion matrix above) is assigned by setting each observation of the dataset in the
category “1” depending on the estimated probability of the logistic model: if this
probability is greater than a pre-determined threshold, called cut-off (default value equal
to 0.5), then the observation is assigned to “1”, else it is assigned to “0”.
 In order to obtain more reliable results, is it possible to select a cut-off value equal to
the “a priori” probability of the target variable (proportion of category “1” in the target
variable)
26
Il modello di
Theregressione logistica
La valutazione del modello Model evaluation: ROC Curve
ROC Curve
Given a confusion matrix, based on a specific cut-off, the ROC curve is calculated
starting from the joint frequencies of predicted and observed events (correct
classification) and predicted and not observed events (errors). Specifically, with regards
of the confusion matrix of the previous slide, the ROC curve is based on the following
indicators:
d
Sensitivity  True positives proportion:
c+ d
a
Specificity  True negatives proportion:
a+ b
c
1 - Sensitivity  False negatives proportion:
c +d
b
1 - Specificity  False positives proportion:
a+b
27
Il modello di
Once computed these values, the ROC curve is obtained by plotting, for each possible
threshold (cut-off) value, a point in the Cartesian plane that has on the horizontal axis
the percentage of false positives (1 - Specificity) and on the vertical axis the
percentage of true positives (Sensitivity). Each point of the curve, therefore,
represents a particular value of the cut-off (that varies from 0 to 1) on which has been
built a confusion matrix.
Modello11 Modello22
1.0
0.9
0.8
0.7
Sensitivity
ROC 0.6
Curves  0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1 - Specificity 28
Il modello di
The point (0,1) in the chart represents the ideal model, which does not commit
any error of prediction: the percentage of false positive is 0 and the percentage of
true positive is equal to one this allows a perfect separation between the two
classes
The point (0,0) corresponds to a model that assigns all the observation to the
negative class of the response variable (the absence of the target charateristic),
while the point (1,1) predicts all events as belonging to the positive class
 In terms of comparison between two different models, the best fitting curve is,
therefore, the curve that is closest to the upper left angle of the graph.
 Comparing the two sample models represented in the graph of the previous slide,
model 1 can be considered better from this point of view
29
Il modello di
La valutazione del modello Model evaluation: AUC
Sometimes, especially when the two models have similar performances, it is difficult to clearly
distinguish which curve is better than the other, because they would often be overlapped in
some points
 For this reason, it’s frequently used, together with the graphical display, an index that
measures numerically the goodness of the model
This index, called AUC (Area Under the Curve), is obtained simply by calculating the area
below the ROC curve.
Formally, the AUC index is calculated, by varying the threshold, in this way:
1 𝑁
T
𝑉𝑃 𝐹𝑃 1
𝐴𝑈𝐶𝑅𝑂𝐶 = 𝑑 = T
𝑉𝑃 𝑑𝐹𝑃
0 𝑃 𝑁 𝑃∙𝑁 0
A random classifier has an AUC of 0.5 (graphically represented by a ROC curve corresponding
to the bisecting line of the quadrant), while a perfect classifier has an AUC of 1  A logistic
regression model, therefore, will have an AUC between these two extreme values.
30
Keywords
Binary dependent variable
Logistic function
LOGIT Transformation
ODDS
ODDS RATIO
Maximum Likelihood Estimation
AIC
Confusion matrix
ROC Curve
31

Lesson 13 Logistic Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 13 Logistic Regression

Uploaded by

Copyright:

Available Formats

Logistic Regression Model

20538 – Predictive Analytics for

 A retail bank runs a direct marketing campaign in order to promote a new

 Customers interested in receiving additional information about the product (and

 Business problem: As newly hired employee, we have to analyze the

Categorical response variable in its simplest expression  Binary Variable

Value «1»: presence of a characteristic or occurrence of an event of interest (for

Value «0»: absence of the characteristic of interest

2. The assumption of Homoskedasticity, crucial in linear regression, is not reasonable

Logistic function Simple Logistic vs. Simple Linear model (one X)

The logistic regression allows to model the characteristics of a binary dependent

Starting from the original definition of the model, it is possible to re-express it

 ODDS: the ratio between the probability of occurrence of a specific event

Dependent Variable: RESPONSE (1: positive interest in a new financial product, 0:

Independent Variables: GENDER (gender of the customer  “1”: male, “0”:

It is possible to provide the following interpretation for the coefficients:

The odds ratio is a non-negative quantity such that:

2. It assumes a value smaller than 1 when a unit increase in the independent

3. It assumes a value equal to 1 when a unit increase in the independent variable

 DEFINITION: the exponential of the generic coefficient 𝜷𝒌 for the k-th

1. 𝜷𝒌  change in the log-odds for a unit increase in 𝒙𝒌 (ceteris paribus)

2. 𝒆𝜷𝒌  odds ratio corresponding to a unit increase in 𝒙𝒌 (ceteris paribus).

For categorical independent variables, please refer to what has

An alternative way of interpreting the

An estimation method is thus needed to obtain «good» estimates of the

 A more general method, the Maximum Likelihood Estimation (MLE), is

Maximum Likelihood Estimation notes:

 Iterative optimization methods must be used to find an approximate solution.

Different approaches can be used to evaluate the goodness-of-fit of a logistic model

– AIC (Akaike Information Criterion): an index combining an evaluation

𝐴𝐼𝐶 = −2𝐿𝐿 𝑀 + 2(𝑘 + 1)

– where L(M) is the value of the Likelihood function at the estimated

 A model with a low value of AIC is to be preferred to

Binary dependent variable

Maximum Likelihood Estimation

You might also like