You are on page 1of 29

1/29

Statistics

Logistic Regression

Shaheena Bashir

FALL, 2019
2/29
Outline

Background

Introduction
Logit Transformation
Assumptions

Estimation

Example
Analysis
How Good is the Fitted Model?

Single Categorical Predictor

Types of Logistic Regression Models


o
3/29
Background

Motivating Example

o
4/29
Background

Scatter Plot
Relationship between Age & CHD

1.0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0.8
Coronary heart disease

0.6
0.4
0.2
0.0

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

20 30 40 50 60 70

Age (years)

o
Not informative!!
5/29
Background

Regression Model: Objective

I Describe the relationship between an outcome (dependent or


response) variable and a set of independent (predictor or
explanatory) variables by some regression model (equation).
I Predict some future outcome based on the regression model
How to model the relationship of CHD with age?

o
6/29
Background

Background

I What distinguishes a logistic regression model from the linear


regression model is that the outcome variable is binary (or
dichotomous).
I Whether the tumor is malignant (Yes=1) or not (No=0)
I Whether a newborn baby with low birth weight (yes=1) or not
(No=0)
I A student gets admission at LUMS (Yes=1) vs not (No=0)
For categorical response variable, the assumption that the
errors follow a normal distribution fails.

o
7/29
Background

Tabular Form of CHD Data

Age Group n CHD Present % CHD Present


20-29 10 1 0.10
30-34 15 2 0.13
35-39 12 3 0.25
40-44 15 5 0.33
45-49 13 6 0.46
50-54 8 5 0.63
55-59 17 13 0.76
60-69 10 8 0.80
100 43

o
8/29
Background

Proportion of Individuals with CHD


Relationship between Age & CHD

1.0
0.8



0.6
% with CHD


0.4


0.2



0.0

20 30 40 50 60 70

Age (years)

o
9/29
Introduction

Logistic Regression Model

I The response variable in logistic regression is categorical. The


linear regression model, i.e., Y = X β +  does not work well
for a few reasons.
I The response values, 0 and 1, are arbitrary, so modeling the
actual values of Y is not exactly of interest.
I Our interest is in modeling the probability of each individual in
the population who responds with 0 or 1,
I The error terms in this case do not follow a normal distribution.
Thus, we might consider modeling P, the probability, as the
response variable.

o
10/29
Introduction

Sigmoid Function

Modeling the probability as response, some problems


I Although the general increase in probability is accompanied by
a general increase in age, we know that P, like all
probabilities, can only fall within the boundaries of 0 and 1.
I It is better to assume that the relationship between age and P
is sigmoidal (S-shaped), rather than a straight line.
I It is possible, however, to find a linear relationship between
age and a function of P. Although a number of functions
work, one of the most useful is the logit function.

o
11/29
Introduction
Logit Transformation

Logit Function
p
The logit function ln 1−p (also called log-odds) is simply the log of
ratio of P(Y = 1) divided by P(Y = 0).
p
ln = Xβ
1−p
The odds
p
= exp(X β).
(1 − p)
Solving

exp(y ) 1
p = Pr (Y = 1|X = x) = =
[1 + exp(y )] 1 + exp(−y )
gives the standard logistic function, while y = X β.
o
12/29
Introduction
Logit Transformation

Logit Function

p
g (x) = ln 1−p has many of the desirable properties of a linear
regression model.
I It may be continuous
I It is linear in the parameters
I It has the potential for a range between −∞ and +∞
depending on the range of x .

o
13/29
Introduction
Logit Transformation

Summary: Logit Transformation

Quantity Formula min max


Probability p 0 1
p
Odds 1−p 0 ∞
p
Logit or ’Log-Odds’ loge 1−p −∞ ∞

Logit stretches the probability scale

o
14/29
Introduction
Assumptions

Assumptions

Linear Regression Logistic Regression


 ∼ N(0; σ 2 )  ∼ Bin(p)
p
Y = Xβ +  ln 1−p = Xβ + 
Y |X ∼ N(X β; σ 2 ) Y |X ∼ Bin(p)

o
15/29
Estimation

Estimation of Parameters of Regression Model: β

I The method of maximum likelihood yields values for the


unknown parameters that maximize the probability of
obtaining the observed set of data.
I For logistic regression the likelihood equations are non-linear
in the parameters β’s and require special methods for their
solution.
I These methods are iterative in nature and have been
programmed into available logistic regression software

o
16/29
Example

Example: CHD Data

I Is age a risk factor of CHD? How the probability of CHD


changes by age?
I Outcome variable: CHD (Yes, No)
I Predictor: Age (in years)
Logistic regression models the probability of some event occurring
as a linear function of a set of predictors.

o
17/29
Example
Analysis

CHD Analysis

ln 1−p̂ p̂ = −5.31 + 0.11Age


I The coefficient is interpreted as the MARGINAL increase in
the log odds of CHD when age increases by 1 year.

Estimate Std. Error z value Pr(>|z|)


(Intercept) -5.31 1.13 -4.68 0.00
age 0.11 0.02 4.61 0.00

OR = exp(0.11) = 1.116
The odds of getting CHD are · · · · · · when age increases by 1 year

o
18/29
Example
Analysis

Fitted Values

exp(βo + β1 X )
p =
[1 + exp(βo + β1 X )]
exp(−5.31 + 0.11Age)
=
[1 + exp(−5.31 + 0.11Age)]

o
19/29
Example
Analysis

R Software

mod1<-glm(chd ∼ age, family=’binomial’, data=chdage)


summary(mod1)
predict(mod1, type = ’response’)
anova(mod1, test=’Chisq’)
plot(mod1)

o
20/29
Example
Analysis

Predicted Probabilities





0.8










predicted probabilities


0.6








0.4









0.2









●●
●●

20 30 40 50 60 70

Age
o
21/29
Example
How Good is the Fitted Model?

Analysis of Deviance
Model: binomial, link: logit
Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev Pr (> Chi)


NULL 99 136.66
Age 1 29.31 98 107.35 6.168e − 08 ∗ ∗∗

I Deviance is a measure of goodness of fit of a generalized


linear model. Or rather, it’s a measure of badness of fit.
I If our new model explains the data better than the null model,
there should be a significant reduction in the deviance which
can be tested against the chi-square distribution to give a
p-value
o
22/29
Example
How Good is the Fitted Model?

Hosmer-Lemeshow Goodness of Fit

How well our model fits depends on the difference between the
model and the observed data.

library(ResourceSelection)
hoslem.test(as.numeric(chdage$chd)-1, fitted(mod1))
R Output
Hosmer and Lemeshow goodness of fit (GOF) test
data: as.numeric(chdage$chd) - 1, fitted(mod1)
X-squared = 2.2243, df = 8, p-value = 0.9734

Our model appears to fit well because we have no significant


difference between the model and the observed data (i.e. the
p-value > 0.05).
o
23/29
Example
How Good is the Fitted Model?

o
24/29
Single Categorical Predictor

Simple Logistic Regression Model with a Categorical


Predictor

I How some function of the probability of categorical response


is linearly related to a predictor
I Interpretation of the resulting intercept βo & the slope β1
where predictor variable is also binary.

o
25/29
Single Categorical Predictor

Case-Control Study: A Recap Example

Past exposure CHD Cases Controls (without disease)


Smokers 112 176
Non-smokers 88 224
Totals 200 400

Odds of CHD for Smokers = · · ·


Odds of CHD for Non-Smokers = · · ·

o
26/29
Single Categorical Predictor

Case-Control Study: A Recap Example Cont’d

Let yi is binary response variable


I yi = 1; if CHD=yes
I yi = 0; if CHD=no

Past exposure yi ni
Smokers 112 288
Non-smokers 88 312

Then yi ∼ Bin(ni , pi )
xi is the binary predictor of past smoking
I xi = 1; if past smoker
I xi = 0; if non-smoker in the past

o
27/29
Single Categorical Predictor

Case-Control Study: A Recap Example Cont’d

The probability of CHD pi can be modeled as:

logit(pi ) = βo + β1 xi

I xi = 1, then logit(pi |xi = 1) = βo + β1 (1)


I xi = 0, then logit(pi |xi = 0) = βo

pi |xi = 1
β1 = logit(pi |xi = 1) − logit(pi |xi = 0) = log
pi |xi = 0
∴ OR = · · · · · ·

o
28/29
Single Categorical Predictor

Example: Logistic Regression

Estimate Std. Error z value Pr(>|z|)


(Intercept) -0.93 0.13 -7.43 0.00
pastsmoke1 0.48 0.17 2.76 0.01

I For past smokers, xi = 1 then


ln(odds of CHD) = βo + β1 ∴ Odds for smokers = · · ·
I For past non-smokers, xi = 0 then
ln(odds of CHD) = βo ∴ Odds for non-smokers = · · ·
OR = · · ·

o
29/29
Types of Logistic Regression Models

Types of Logistic Regression Model

I Binary Logistic Regression Model: The categorical


response is dichotomous (has only two 2 possible outcomes),
e.g., an email is a Spam or Not
I Multinomial Logistic Regression Model: Three or more
categories without ordering (polytomous response), e.g.,
Predicting food choices (Veg, Non-Veg, Vegan)
I Ordinal Logistic Regression Model: Three or more
categories with ordering, e.g., Movie rating from 1 to 5,
teaching evaluation by students, etc.

You might also like