You are on page 1of 29

Logistic Regression

Dr. Sayak Roychowdhury


Department of Industrial & Systems Engineering,
IIT Kharagpur
Reference Books
• James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An
introduction to statistical learning (Vol. 112, p. 18). New York:
springer.
• Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H.
(2009). The elements of statistical learning: data mining,
inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
• Dobson, A. J., & Barnett, A. G. (2018). An introduction to
generalized linear models. Chapman and Hall/CRC.
Example 1: Qualitative Response
• Suppose there are three possible diagnoses in the medical conditions
of a patient just arrived in ER
1 𝑠𝑡𝑟𝑜𝑘𝑒
𝑌 = ቐ2 𝑑𝑟𝑢𝑔 𝑜𝑣𝑒𝑟𝑑𝑜𝑠𝑒
3 𝑒𝑝𝑖𝑙𝑒𝑝𝑡𝑖𝑐 𝑠𝑒𝑖𝑧𝑢𝑟𝑒

• Using the above coding, least square may be fitted.


• What is the problem?
Example 2: Default on Credit Card Payment
• A credit card company wishes to find out which customer will default
in next billing cycle, given balance and income.
0 𝑁𝑜
𝑌=൜
1 𝑌𝑒𝑠

Can you fit OLS here?


Example: Default or Not
Contingency Table
Logit Model
• To keep the response variable (here 𝑝(𝑥)) between 0 and 1 for all
values of 𝑋, the logistic function is used:
𝑒 𝛽0+𝛽1𝑥
𝑝 𝑋 =
1 + 𝑒𝛽0+𝛽1𝑥
This results in the link function:
𝑝 𝑋
log = 𝛽0 + 𝛽1 𝑥
1−𝑝 𝑋
𝑝 𝑋
• log is also called log-odds or logits
1−𝑝 𝑋
Parameter Estimation
• Logistic regression models are usually fit by maximum likelihood
• Using the conditional likelihood of Y given X, Pr(𝑌|𝑋)
• Pr(𝑌|𝑋) completely specifies the conditional distribution, the
multinomial distribution is appropriate.
• The log-likelihood for N observations with two classes 0 and 1 is
• 𝐿 𝛽0 , 𝛽1 = ς𝑖:𝑦𝑖 =1 𝑝(𝑥𝑖 ; 𝛽) ς𝑖 ′ :𝑦′ =0(1 − 𝑝 𝑥𝑖′ ; 𝛽 )
𝑖
• Where 𝑝 𝑥𝑖 ; 𝛽 = Pr(𝑌 = 𝑘|𝑋 = 𝑥; 𝛽)
• Estimates 𝛽መ0 and 𝛽መ1 are chosen to maximize the likelihood function.
Fitting the Logistic Regression Model
Parameter Estimation
• For 2 class case the log-likelihood ratio can be written as
• 𝑙 𝛽0 , 𝛽1 =
𝑁

log 𝐿 𝛽0 , 𝛽1 = ෍{𝑦𝑖 log 𝑝 𝑥𝑖 ; 𝛽 + 1 − 𝑦𝑖 log(1 − 𝑝 𝑥𝑖 ; 𝛽 )}


𝑖=1
• Where 𝑝 𝑥𝑖 ; 𝛽 = Pr(𝑌 = 1|𝑋 = 𝑥; 𝛽)
• Estimates 𝛽መ0 and 𝛽መ1 are chosen to maximize the likelihood function.
𝜕𝑙
• The above is solved by = 0 using Newton Raphson method
𝜕𝛽
Inference
• Hypothesis tests in a statistical modelling framework are performed by
comparing how well two related models fit the data.
• For generalized linear models, the two models should have the same
probability distribution and the same link function, but the linear component
of one model has more parameters than the other.
• The simpler model, corresponding to the null hypothesis 𝐻0 , must be a
special case of the other more general model.
• If the simpler model fits the data as well as the more general model does,
then it is preferred on the grounds of parsimony and 𝐻0 is retained.
• The goodness of fit statistics may be based on the maximum value of the
likelihood function, the maximum value of the log-likelihood function, the
minimum value of the sum of squares criterion or a composite statistic based
on the residuals.

Source: An Introduction to Generalized Linear Models, A.J. Dobson, A.G. Barnett (Ch 5)
Inference
• Specify a model 𝑀0 as null hypothesis 𝐻0 and a more general model
(with more terms) 𝑀1 as 𝐻1
• Fit 𝑀0 and calculate corresponding goodness of fit statistics (e.g.
Likelihood function) 𝐺0 , do the same for 𝑀1 and get 𝐺1
• Get the improvement 𝐺1 − 𝐺0 or 𝐺1 /𝐺0 and compare with
corresponding sampling distributions
• Use null hypothesis 𝐺1 = 𝐺0 and if it is not rejected then 𝑀0 is
preferred model, otherwise 𝑀1 is the preferred model

Source: An Introduction to Generalized Linear Models, A.J. Dobson, A.G. Barnett (Ch 5)
Log-Likelihood Ratio Statistic
• A model with maximum number of possible parameters is called saturated
model.
• If there are 𝑁𝑻observations 𝑌1 , . . , 𝑌𝑁 all with potentially different values for linear
component 𝒙𝒊 𝜷 then a saturated model can be specified by 𝑁 parameters. This is
also called maximal or full model.
• Let 𝜷𝒎 be the parameter vector for full model and let 𝒃𝒎 be the maximum
likelihood estimator for 𝜷𝒎
• Let 𝒃 be the parameter vector of the model of interest. The likelihood ratio is
𝑳 𝒃 ,𝒚
given by 𝜆 = 𝒎
𝑳 𝒃,𝒚
• 𝑙𝑜𝑔𝜆 = 𝑙𝑜𝑔 𝐿 𝒃𝒎 , 𝒚 − 𝑙𝑜𝑔 𝐿 𝒃, 𝒚 = 𝑙 𝒃𝒎 , 𝒚 − 𝑙(𝒃, 𝒚)
• 2𝑙𝑜𝑔𝜆 is chi-squared distributed, so it is the commonly used statistic, called
Deviance
Deviance
• The likelihood ratio 𝑙𝑜𝑔𝜆 can be used for model comparison and
hypothesis testing as 2𝑙𝑜𝑔𝜆 is chi-square distributed
• The Deviance or log-likelihood ratio statistic is given by
𝑫 = 𝟐 𝑙 𝒃𝒎 , 𝒚 − 𝑙 𝒃, 𝒚 ~𝜒 2 (𝑚 − 𝑝, 𝜈)
Where 𝑚 is the number of parameters in the full model and 𝑝 is the
number of parameters in the model of interest.
• The constant 𝜈 is the non-centrality parameter which is almost 0 if
the model of interest fits the data almost as well as the full model.
• Which means compare Deviance with 𝜒 2 (𝑚 − 𝑝) to test hypothesis.
• Deviance forms the basis of hypothesis test for most GLMs.
Deviance of a Normal Model
• 𝐸 𝑌𝑖 = 𝜇𝑖 = 𝒙𝑻𝒊 𝜷 where 𝑌𝑖 ~𝑁 𝜇𝑖 , 𝜎 2 ; 𝑖 = 1,2, . . 𝑁
1 𝑁 2 1 2
• Log likelihood 𝑙 𝜷, 𝒚 = − σ 𝑖=1 𝑖𝑦 − 𝜇 𝑖 − 𝑁𝑙𝑜𝑔 2𝜋𝜎
2𝜎 2 2
• For a saturated model, all 𝜇𝑖 s are different, so 𝜷 has N elements
𝜕𝑙
• = 0 will result in 𝜇ෝ𝑖 = 𝑦𝑖
𝜕𝜇𝑖
1
• So saturated model log likelihood 𝑙 𝒃𝒎 , 𝑦 = − 𝑁𝑙𝑜𝑔 2𝜋𝜎 2
2
−𝟏
෡ = 𝑿𝑻 𝑿
• For any other model with 𝑝 parameters, 𝑝 < 𝑁 𝜷 𝑿𝑻 𝒚
෡ are the estimated values
• 𝒚ෝ𝒊 = 𝒙𝑻𝒊 𝜷
መ 𝑦 =− 1
σ𝑁 𝑻෡ 2 1
• 𝑙 𝛽, 𝑖=1 𝑦𝑖 − 𝒙𝒊 𝜷 − 𝑁𝑙𝑜𝑔 2𝜋𝜎 2
2𝜎 2 2
Deviance of a Normal Model
1 2
• ෡
Deviance 𝐷𝑝 = 2 𝑙 𝒃𝒎 , 𝑦 − 𝑙 𝜷, 𝑦 = 2 σ𝑁 𝑖=1 𝑖𝑦 − 𝒙𝑻෡
𝒊𝜷
𝜎
• In case where there is only one parameter 𝜇, 𝐸 𝑌𝑖 = 𝜇
• 𝑿 is a vector of N ones; 𝜇Ƹ = 𝑦ത = σ𝑁 𝑖=1 𝑦𝑖
1
• Deviance of null model 𝐷0 = 2 σ𝑁 𝑖=1 𝑦𝑖 − 𝑦ത 2
𝜎
• This statistic is similar to sample variance
𝑁
1 𝜎 2𝐷
𝑆2 = ෍ 𝑦𝑖 − 𝑦ത 2 =
𝑁−1 𝑁−1
𝑖=1
• 𝐷𝑝 ~𝜒 2 𝑁 − 𝑝 deviance of model with 𝑝 parameters
• 𝐷0 ~𝜒 2 (𝑁 − 1) deviance of null model
Nested or Hierarchical Models
• Consider a model 𝑀0 with smaller number 𝑞 of parameters and a more general
model 𝑀1 with 𝑝 parameters, where 𝑞 < 𝑝 < 𝑁 where 𝑁 is the number of
observations.
𝛽1 𝛽1
𝛽2 𝛽2
• 𝐻0 ∶ 𝜷 = 𝜷𝟎 = . ; 𝐻1 ∶ 𝜷 = 𝜷𝟏 = .
. .
𝛽𝑞 𝛽𝑝
• 𝐻0 against 𝐻1 can be tested using the difference in deviance statistics
Δ𝐷 = 𝐷0 − 𝐷1 = 𝟐 𝑙 𝒃𝟏 , 𝒚 − 𝑙 𝒃𝟎 , 𝒚 ~𝜒 2 (𝑝 − 𝑞)
• If Δ𝐷 is consistent with 𝜒 2 (𝑝 − 𝑞), generally you choose the smaller model, but
judgement should be used on deciding which factors are important
Nested or Hierarchical Models
• For normal linear models, the deviance expression has 𝜎 2 in it, which
is usually not known.
• Hence an 𝐹 statistic is used where
𝐷0 −𝐷1
𝑝−𝑞
𝐹= 𝐷1 ~ 𝐹 𝑝 − 𝑞, 𝑁 − 𝑝
𝑁−𝑝

Please review the likelihood ratio notes for linear regression model, the
expression is the same
AIC and BIC (Chapter 7.5, Dobson)
• AIC (Akaike Information Criterion) and Schwartz or BIC (Bayesian
Information Criterion) are log-likelihood ratio statistics with adjustment for
the number of parameters
• 𝐴𝐼𝐶 = −2𝑙 𝜋, ො 𝑦 + 2𝑝 where 𝑝 is the number of parameters
• 𝐵𝐼𝐶 = −2𝑙 𝜋, ො 𝑦 + 2𝑝 × ln(𝑁) where 𝑁 is the number of observations
• BIC imposes greater penalty on number of parameters.
• A small value of these statistics, and a large p-value indicates the model fits
the data well
• Not recommended for nested models, usually these are used with models
that are not nested.
• The above formulae are used in R, other software may use other
expressions similar to these.
Example: Stock Market Data

Matrix Plot:
Example: Stock Market Data
Correlation Analysis
corrplot::corrplot(cor(Smarket[,-9]))
Fitting the Logistic Regression Model
Prediction on Training Data

Probability of market going “Up”


Training Prediction Accuracy

Fraction of correct prediction


Example: Test Data Accuracy
• Data Subsetting

• Fit model over the subset years <2005


Example: Test Data Accuracy
Example: Test Data Accuracy
Example: Test Data Accuracy

You might also like