Lecture7 Logistic Regression

CDS504 Business Data Analytics
Week 7: Logistic Regression

Aihua YAN
To be, or not to be questions
Yes or No?
Fail or Pass?
2
“To be, or not to be” decision in the business world
In or Out?
Advanced decision:
Which one? Accept or Deny?
(炒楼 vs. 炒币 vs. 炒鞋）
3
Shall I Lend Money to Him?
o You work for a bank. Some of your customers apply loan from you. By checking
their background, you know the property they own, their credit score and some
information on previous credit card use.
o Decision: should you approve the loan?
• Revolving balance is the portion of credit card spending that goes

unpaid at the end of a billing cycle.
• Revolving utilization is the amount of your revolving credit limits
that you are currently making use of.
4
To Whom to Lend Money?
o𝑌 = 𝑎 + 𝑏& 𝑥& + 𝑏( 𝑥( +𝑏) 𝑥) +𝑏* 𝑥* +𝑏+ 𝑥+
oY: approve or not (binary)
oX:
–X1: Homeowner or not
–X2: Credit score
–X3: Years of credit history
–X4: Revolving balance
–X5: Revolving utilization
5
Can Not Use Linear Regression!
o If we use linear OLS (ordinary least square) regression:
𝑌 = 𝑎 + 𝑏𝑋 + 𝑒 ; where Y = (0, 1)
§ Residual is not normally distributed because Y takes on only

two values
o Which assumption of linear regression is violated?
6
Why Not Linear Regression?
Exam Result
7
Source: http://www.math.cornell.edu/~numb3rs/kostyuk/num218.htm
8
S-Shaped Curve
• In the 1970s, the statisticians came up
with the logistic regression.
• The idea is to approximate the data by
an “S” shaped curve, call logistic
curve, and
• is given the following equation:
) !"#$
𝑦=
*+) !"#$
9
Odds of an event
• We define the odds of an event A as:
the probability of A happening divided
by the probability of it not happening.
𝑷(𝑨)
𝑶𝒅𝒅𝒔 𝑨 =
𝟏 − 𝑷(𝑨)
• Natural log of the odds: • In the left figure, we divide the hours
studied into intervals of 40 hours, now
Ln(odds) is called as logit of p. calculate the odds of different intervals. Use X to
• Use LN() function in Excel to denote the hours studied.
calculate the log odds. You Odds(60<X<=100)=(1/6)/(1-(1/6))=1/5
Odds(100<X<=140)=1/4
will get the same results as Odds(140<X<=180)=2/3
shown in the figure. Odds(180<X<=220)=5
Odds(220<X<=260) =4
10
Probabilities and the Corresponding Odds
Probability Corresponding Odds

0.5 50:50 or 1
0.9 90:10 or 9
0.999 999:1 or 999
0.01 1:99 or 0.0101
0.001 1:999 or 0.001001
11
Probabilities, odds, and the corresponding log-odds
Probability Odds Log- Odds

0.5 50:50 or 1 0
0.9 90:10 or 9 2.19
0.999 999:1 or 999 6.9
0.01 1:99 or 0.0101 -4.6
0.001 1:999 or 0.001001 -6.9
12
Kinds of research questions
oLogistic regression allows one to predict a discrete outcome
(DV) such as group membership from a set of variables that may
be continuous, discrete, dichotomous, or mix (IVs).
–If the outcome variable is dichotomous, run binary logistic.

–If the outcome variable has 3 or more values, run multinomial logistic.
13
Logistic Regression
o The purpose is to assess the effects of multiple
explanatory variables, which can be numeric and/or
categorical, on the outcome.
o Establishing a classification system based on the logistic
model for determining classification.
– Profiling: already know the outcome category, want to
describe the characteristics of the explanatory variables
– Classification: logistic regression, predict the class
probability
o Has a non-linear decision function
14
The Logistic Regression Model
log[p/(1-p)] = a + bX + e
Log(odds ratio)= a + bX + e
o p is the probability that the event Y occurs, p(Y=1)
o p/(1-p) is the "odds ratio"
o ln[p/(1-p)] is the log odds ratio, or "logit"
ea + b X
o In other words p=
1 + ea + b X
o Parameter estimation: maximum likelihood method
15
Multiple Logistic Regression Model 16
o The logistic regression model

𝑒 ("#$/ %/ #$0 %0 #⋯#$1 %1 )
𝑝=
1 + 𝑒 ("#$/ %/ #$0 %0 #⋯#$1 %1 )
o Another way to present the model
𝑝
log = 𝛼 + 𝛽( 𝑥( + 𝛽) 𝑥) + ⋯ + 𝛽* 𝑥*
1−𝑝
log 𝑂𝑅 = 𝛼 + 𝛽( 𝑥( + 𝛽) 𝑥) + ⋯ + 𝛽* 𝑥*
+
o The quantity is called the Odd Ratio (OR). It can take any
(,+
nonnegative value.
+
o The quantity log (log (odds)) is referred to as the logit of p.
(,+
It can take any values between [-∞, +∞] .
o The dependent variable in a logistic regression is a logit.
Why do we use logit rather than probability
directly in the logistic regression model?
o You might have this question for a long time in your mind. Hahaha~~
oHere are the reasons:

1. Logistic regression models the probability of an event occurs,
depending on a series of IVs, IVs van be categorical or numerical.
2. The linear combination of IVs be any value from -∞ 𝑡𝑜 +
∞ (𝑒. 𝑔. , 𝛼 + 𝛽( 𝑥( + 𝛽) 𝑥) + ⋯ + 𝛽* 𝑥* ).
3. However, if the DV is the binary value (0/1) or the probability
(range from 0 to 1). We are unable to link the series of IVs to the
appropriate DV.
4. Thus, we need a DV that can range any value from -∞ 𝑡𝑜 + ∞. That
is why logit is the DV of a logistic function.
17
Assumptions of Logistic Regression
o The advantages of logistic regression are primarily the result of the
general lack of assumptions.
o Logistic regression does not require any specific distributional form for
the independent variables.
o Heteroscedasticity (violation of homoscedasticity) is not a “big” concern.
o Linear relationships between the dependent and independent variables are

not required.
o The presence of multicollinearity will not lead to biased coefficients.
18
Performance of the Model
§ Prediction correctness
§ For instance, if the estimated p is greater than or equal to .5,
then consider it to be predicted as 1 (i.e., the loan is rejected).
correct prediction
All approved cases
22
Specificity 22+1
Sensitivity 26
1+26
correct prediction
correct prediction Overall All rejected cases
all predictions Accuracy
22+26
22+1+26+1 19
Sensitivity vs. Specificity
n2,2
Sensitivity =
(n2,4 + n2,2 ) We are more interested
in maximizing the
5-,- sensitivity.
Specificity =
(5-,- 65-,/ )
20
Interpretation of the Coefficients
Log(odds ratio)= a + bX + e
• In logistic regression, 𝛽 measures the incremental impact of
increasing IV by 1 unit on the log odds ratio
• Or, 𝑒 ! (Exp(B)) represents the change in the odds of the outcome
by increasing x by 1 unit
• If B = 0, the odds and probability are the same at all x levels (𝑒 - =1)
• If B > 0 , the odds and probability increase as x increases (𝑒 - >1)
• If B < 0 , the odds and probability decrease as x increases (𝑒 - <1)
21
Example
o Look at the odds ratios
Interpretation:
Exp (B)-odds ratio, is the effect size of the o OR>1 increased
frequency of exposure
predictor. The closer the odds ratio is to 1, among cases
the smaller the effect. o OR=1 No change in
frequency of exposure
o OR<1 decreased
frequency of exposure 22
How to calculate the probability change due to a one-unit change
in the independent variable?
Value
Exponentiated coefficient 0.20 0.50 1.0 1.5 1.7
(𝑒 7 )
𝑒 7 -1 -0.80 -0.50 0.0 0.50 0.70
Percentage change in odds -80% -50% 0% 50% 70%
In the above table, if the exponentiated coefficient

is .20, a one-unit change in the independent
variable will reduce the odds by 80%.
23
Use the logistic regression to estimate the class probability
Logistic regression is widely used to estimate class probability

such as
–the probability of default on credit,
–the probability of response to an offer,
–the probability of fraud on an account,
–the probability that a document is relevant to a topic, and so on.
24
Tutorial: Use Logistic Regression To Predict The
Probability Of Marketing Campaign Response
25

Lecture7 Logistic Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture7 Logistic Regression

Uploaded by

Copyright:

Available Formats

CDS504 Business Data Analytics

Week 7: Logistic Regression

• Revolving balance is the portion of credit card spending that goes

§ Residual is not normally distributed because Y takes on only

Probability Corresponding Odds

Probability Odds Log- Odds

–If the outcome variable is dichotomous, run binary logistic.

o Parameter estimation: maximum likelihood method

o The logistic regression model

oHere are the reasons:

o Heteroscedasticity (violation of homoscedasticity) is not a “big” concern.

o Linear relationships between the dependent and independent variables are

o The presence of multicollinearity will not lead to biased coefficients.

In the above table, if the exponentiated coefficient

Logistic regression is widely used to estimate class probability

You might also like