Professional Documents
Culture Documents
Adam Korczyński
The Collegium of Economic Analysis of Warsaw School of Economics
Institute of Statistics and Demography
Statistical Methods and Business Analytics Unit
e-mail: akorczy@sgh.waw.pl
Contingency table and measures
of association
Agenda
• Introductory notes: scope of the class, schedule and requirements
• Basic notions
• Contingency table
• Comparing two proportions - measures in the analysis of categorical
data
• Types of studies
• Calculating risk, risk ratio, odds and odds ratio
• Proc Freq in SAS. Cross tabulation with R and Python
Logistic Regression with SAS ©AK, SGH 2020
Contingency table
• Y and X – two categorical variables with k,l categories
respectively
• We can present the distribution of the two variables using a
table with k rows of Y and l columns for X categories
• The cells represent possible outcomes
• The table with frequency counts of outcomes for a sample,
the table is called contingency table (by Karl Pearson, 1904).
• 2x2 table:
X=x1 X=x2
𝑃(𝑌 = 𝑖|𝑥 = 1)
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑟𝑖𝑠𝑘: 𝑅𝑅10 =
𝑃(𝑌 = 𝑖|𝑥 = 0)
𝑃(𝑌 = 𝑖|𝑥 = 𝑗)
𝑂𝑑𝑑𝑠 (𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑛𝑔 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑤𝑖𝑡ℎ 𝑐𝑜𝑚𝑝𝑙𝑒𝑚𝑒𝑛𝑡𝑎𝑟𝑦 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦): 𝑂𝑥𝑗 =
1 − 𝑃(𝑌 = 𝑖|𝑥 = 𝑗)
𝑂𝑥1
𝑂𝑑𝑑𝑠 𝑟𝑎𝑡𝑖𝑜: 𝑂𝑅𝑥10 =
𝑂𝑥0
Logistic Regression with SAS ©AK, SGH 2020
Types of studies
• Follow-up study (prospective)
• Risk factor → Observation → Outcome
• Assigning tax regimen A or B →Observation → Survived or not
• Case-control study (retrospective)
• Risk factor Observation Outcome
• what was the tax regimen under which companies were operating?
Survived or not
• Cross-sectional
• Snapshot of data at a given time point. Allows assessing coexistence of
features.
Logistic Regression with SAS ©AK, SGH 2020
Retrospective studies
• Research organization decides to study the status of the Polish companies
in a specific industry in years 2015-2016. There were n1 companies which
went bankrupt in this period. For the purpose of the comparison n2
companies operating throughout the whole period have been added to the
sample. The tax regimens under which the companies were operating were
identified afterwards.
• In this setting X is a random variable and Y is set post hoc. We do identify
the status of survival at first and then identify the characteristics. As a
consequence we can estimate P(X=j|y=i). This would mean probability of
being in a tax regimen A or B given that a company survived/not survived.
This does not address the research question though. The counts for Y are
not collected through a random process and therefore P(Y=i|X=j) is not
good estimate of survival conditional on tax regimen.
Logistic Regression with SAS ©AK, SGH 2020
Prospective sample
Regimen A Regimen B ni. RISKS RR ODDS OR
Survived 95 44 139 Regimen A 0.079832 1.46963 0.08676 1.51038
Not survived 1095 766 1861 Regimen B 0.054321 0.05744
n.j 1190 810 2000
Retrospective sample
Regimen A Regimen B ni. RISKS RR ODDS OR
Survived 616 284 900 Regimen A 0.537522 1.23781 1.16226 1.51422
Not survived 530 370 900 Regimen B 0.434251 0.76757
n.j 1146 654 1800
The chance of survival within the studied time period is 50% greater in
companies operating under tax regimen A.
Logistic Regression with SAS ©AK, SGH 2020
Exercise
• Assess the risk of lung cancer among smokers numerically in order to
answer the following question: What is the risk of having LC while
being a smoker as compared to not being a smoker?
• Use Excel and Proc Freq in SAS
Smoker
Lung Cancer Yes No
Yes 688 21
No 650 59
Data source: Agresti, 2002, Table 2.5.
Logistic Regression with SAS ©AK, SGH 2020
Agenda
• Origins of logistic regression
• Distributions for categorical data
• Introduction to modelling of binary response variable
• The basic concepts of logistic regression
• Fitting a binary logistic regression model using the LOGISTIC procedure
• Describing the standard output from the LOGISTIC procedure one continuous
and one categorical predictor variable
• Reading and interpreting odds ratio tables and plots
Logistic Regression with SAS ©AK, SGH 2020
𝑊ሶ 𝑡 = 𝛽𝑊 𝑡
growing quantity
Population growth
[SAS] l02_OriginsOfLogReg01.sas
data a01;
A=100;
beta=&beta.;
do t=0 to 100 by 0.01;
Wt=A*exp(beta*t);
output;
end;
run;
• In fact the exponential model has no upper limit. This aspect of the
described processes was criticized by Adolphe Quetelet (1795-1874) and
investigated by his pupil Pierre-Francois Verhulst (1804-1849).
• Their proposal was to introduce the term representing the resistance to
further growth. The model with Ω as the upper limit (saturation level) of
W(t) was suggested following some investigation on potential limiting
terms:
𝑊 𝑡
𝑊ሶ 𝑡 = 𝛽𝑊 𝑡 1−
Ω
• In this model „growth is proportional both to the population already
attained W(t) and to the remaining room for further expansion Ω-W(t)”
(Cramer, 2002, p. 4).
Logistic Regression with SAS ©AK, SGH 2020
𝑊 𝑡
Solving the differential equation 𝑊ሶ 𝑡 = 𝛽𝑊 𝑡 1−
Ω
𝑑𝑊(𝑡) 𝑊 𝑡
= 𝛽𝑊 𝑡 1− ԡ∙ 𝑑𝑡
𝑑𝑡 Ω
𝑊 𝑡 𝑊 𝑡
𝑑𝑊(𝑡) = 𝛽𝑊 𝑡 1− 𝑑𝑡 ብ: 𝑊 𝑡 1−
Ω Ω 1
1 𝑊 𝑡
ԡ∙ Ω
𝑑𝑊(𝑡) = 𝛽𝑑𝑡 𝑊 𝑡 1− Ω
𝑊 𝑡
𝑊 𝑡 1−
Ω =𝑊
Ω
ԡ−𝑊 𝑡 + 𝑊(𝑡) in num
𝑡 Ω−𝑊 𝑡
1
න 𝑑𝑊(𝑡) = න 𝛽 𝑑𝑡 Ω − 𝑊 𝑡 + 𝑊(𝑡)
𝑊 𝑡 =
𝑊 𝑡 1−
Ω 𝑊 𝑡 Ω−𝑊 𝑡
Ω−𝑊 𝑡 𝑊(𝑡)
= +
1 1 𝑊 𝑡 Ω−𝑊 𝑡 𝑊 𝑡 Ω−𝑊 𝑡
න + 𝑑𝑊(𝑡) = න 𝛽 𝑑𝑡 1 1
𝑊 𝑡 Ω−𝑊 𝑡 = +
𝑊 𝑡 Ω−𝑊 𝑡
Logistic Regression with SAS ©AK, SGH 2020
1 1
Solving න + 𝑑𝑊(𝑡) = න 𝛽 𝑑𝑡
𝑊 𝑡 Ω−𝑊 𝑡
ln 𝑊 𝑡 − ln Ω − 𝑊 𝑡 = 𝛽𝑡 + 𝐶 ԡ∙ (−1)
−ln 𝑊 𝑡 + ln Ω − 𝑊 𝑡 = −𝛽𝑡 − 𝐶 When taking this
mathematical model
Ω−𝑊 𝑡 to statistics we get the
ln = −𝛽𝑡 − 𝐶 ԡexp() following parameters:
𝑊 𝑡
Ω−𝑊 𝑡 For:
= exp(−𝛽𝑡 − 𝐶)
𝑊 𝑡 𝑃 𝑡 = 𝑊(𝑡)/Ω Intercept
Ω Regression
− 1 = exp −𝛽𝑡 − 𝐶 ԡ+1 The equation simplifies to:
𝑊 𝑡 coefficient β
Ω 1
= 1 + exp −𝛽𝑡 − 𝐶 𝑃(𝑡) =
𝑊 𝑡 1 + 𝑒 −𝐶−𝛽𝑡
Ω
𝑊 𝑡 = which is the logistic function.
1 + exp −𝛽𝑡 − 𝐶
Logistic Regression with SAS ©AK, SGH 2020
Verhulst conclusion
• Verhulst concluded that „the curve agrees with the actual course of
the population of France, Belgium, Essex and Russia for periods up to
1833.” (Cramer, 2002, p. 4).
Logistic Regression with SAS ©AK, SGH 2020
data a01;
beta0=2;
beta1a=&beta1a.;
do t=-10 to 10 by 0.01;
Pt=1/(1+exp(-1*(beta0+beta1a*t)));
output;
end;
run;
• Ordinal (i>2)
0 𝐺𝑜𝑜𝑑
𝑌 = ቐ1 𝑀𝑜𝑑𝑒𝑟𝑎𝑡𝑒
2 𝑃𝑜𝑜𝑟
• Multinomial (i>2)
0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑝ℎ𝑜𝑛𝑒
𝑌 = ቐ1 𝐴𝑛𝑑𝑟𝑜𝑖𝑑
2 𝑖𝑃ℎ𝑜𝑛𝑒
Distributions for categorical data
0.0 0.0
01234567891111111111222222222233333333334444444444555555555566666666667777777777888888888899999999991
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
0
x
Logistic Regression with SAS ©AK, SGH 2020
to draw 70 0.014
employed and 5
unemployed 10
0.012
unemployed
of status 5
0.008
(employed, 0.006
unemployed,
inactive) with 0.004
𝜋1=0.75, 𝜋2=0.05 0
0.002
employed
Logistic Regression with SAS ©AK, SGH 2020
Logistic curve
1
𝜋 𝑥𝑖 =
1 + 𝑒 −(𝛼+𝛽𝑥𝑖 )
The logistic curve holds and assumption on the relationship between X and p. It can be considered as an
approximation of the mentioned relationship.
Logistic Regression with SAS ©AK, SGH 2020
data a01;
beta0=2;
beta1a=&beta1a.;
beta1b=&beta1b.;
do x=-10 to 10 by 0.01;
p1=1/(1+exp(-1*(beta0+beta1a*x)));
p2=1/(1+exp(-1*(beta0+beta1b*x)));
output;
end;
run;
Logit transformation
Logistic model is often represented through logit:
𝜋 𝑥𝑖
𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥𝑖 = 𝑙𝑛 = 𝛼 + 𝛽𝑥𝑖
1 − 𝜋 𝑥𝑖
where:
1
𝜋 𝑥𝑖 =
1 + 𝑒 −(𝛼+𝛽𝑥𝑖 )
• Logit is natural logarithm of odds
• It defines the relationship between linear predictor and probability as
response:
p→1 then ln(p/[1-p])→+ ∞
p→0 then ln(p/[1-p])→- ∞
Logistic Regression with SAS ©AK, SGH 2020
data a01;
beta0=2;
beta1a=&beta1a.;
beta1b=&beta1b.;
do x=-10 to 10 by 0.01;
p1=1/(1+exp(-1*(beta0+beta1a*x)));
logit1=log(p1/(1-p1));
p2=1/(1+exp(-1*(beta0+beta1b*x)));
logit2=log(p2/(1-p2));
output;
end;
run;
SAS OnLineDoc:
https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#logistic_toc.htm
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l02_LogRegBank01.sas
[Data] bank.sas7bdat
Response Profile
Ordered Value NewDeposit Total Frequency
10 4000
21 521
Logistic Regression with SAS ©AK, SGH 2020
• -2 Log L:
• Akaike’s information criterion (AIC)
• Schwarz criterion (SC):
The criteria are for assessing the goodness-of-fit of different models. Only measuring the relative fit
among models. The smaller the value is the better the fit is. AIC and SC adjust the log likelihood
statistic for number of parameters and number of parameters and samle size respectively. More
parsimonious models are preferred.
Logistic Regression with SAS ©AK, SGH 2020
𝜋 𝑥𝑖
𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥𝑖 = 𝑙𝑛
1 − 𝜋 𝑥𝑖
= −2.3318 + 0.00995 x1 + 0.5004 x21 +2.4852 x22− 0.3835 x23
1
𝜋 𝑥𝑖 =
1 + 𝑒 2.3318 − 0.00995 x1 − 0.5004 x21 −2.4852 x22+ 0.3835 x23
Logistic Regression with SAS ©AK, SGH 2020
1.00
0.75
Probability
0.50
0.25
0.00
20 40 60 80
age
poutcome failure other success unknown
The odds ratio for age indicates that the odds of having positive response
from the marketing campaign increases by 5% for each increase of 5 years in
age of bank clients.
Logistic Regression with SAS ©AK, SGH 2020
0 5 10 15 20
Odds Ratio
Event = 0 Event = 1
𝑝Ƹ 𝑖 =? 𝑝Ƹ𝑗 =?
Logistic Regression with SAS ©AK, SGH 2020
Concordant pair
• Outcome in agreement with probabilities
Event = 0 Event = 1
Discordant pair
• Outcome in disagreement with probabilities
Event = 0 Event = 1
Tied pair
• The model cannot distinguish between the two subjects
Event = 0 Event = 1
[Data] german.sas
Exercise 2.1
Fit the logistic regression model using dataset German with Default as response
variable and housing type as categorical variable.
[SAS] l02_LogRegGerman01.sas
Logistic Regression with SAS ©AK, SGH 2020
Logistic Regression with SAS ©AK, SGH 2020
Logistic Regression with SAS ©AK, SGH 2020
age
1 2 3
Odds Ratio
Logistic Regression with SAS ©AK, SGH 2020
1.00
0.75
Probability
0.50
0.25
0.00
20 40 60 80
age
housing for free own rent
Binary logistic regression model
Logistic regression
The response variable is given by Bernoulli distribution, e.g.:
1 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑖𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡
𝑌=ቊ
0 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑔𝑜𝑡 𝑜𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑟𝑖𝑜𝑟 𝑡𝑜 𝑡
P ( Y = 0 ) = 1 − ( 0 1)
i i i
exp( + x ) 1 k ki
( x ) = P (Y = 1 | X = x ) =
i i
= k =1
p p
1 + exp( + x ) 1 + e
−( + k x ki )
k =1
k ki
k =1
logit [ ( x )] = ln i
= + x
1 − (x )
i k ki
k =1
i
Logistic Regression with SAS ©AK, SGH 2020
P (Y = y , Y = y ,..., Y = y ) = P (Y = y ) P (Y = y )... P (Y = y ) = p ( y )
1 1 2 2 n n 1 1 2 2 n n i
i =1
• Likelihood function:
1 2 n 1 r
n
= ( x ) (1 − ( x )) ... ( x ) (1 − ( x )) = ( x ) (1 − ( x ))
y1 ( 1 − y1 ) yn ( 1− y n ) yi ( 1− y i )
1 1 n n i i
i =1
• Natural logarithm of likelihood function gives:
n
ln L ( y1 , y 2 ,..., y n | 1 ,..., r ) = ln ( x i ) i (1 − ( x i ))
y (1− y i )
i =1
n
= y ln ( x ) + (1 − y ) ln 1 − ( x )
i i i i
i =1
n
( xi ) n
= y i ln
1 − ( xi )
+ ln 1 − ( x ) i
i =1 i =1
n
p
n
p
= y i + k ki −
x ln 1 + exp( + k x ki )
i =1 k =1 i =1 k =1
Logistic Regression with SAS ©AK, SGH 2020
1 + exp( + x )
i
i =1 i =1
i
1 + exp( + x )
i i
i =1 i =1
i
ln L ( y , y ,..., y | ,..., )
1 2 n 1
=0 r
ˆ ˆ
ln L ( y , y ,..., y | ,..., ) ,
1 2 n 1
=0 Estimates of the parameters
r
Newton-Raphson
Algorithm
(Agresti, 2002, p. 143-
Algorithms 145)
for model
fitting Fisher Scoring
(Agresti, 2002, p. 145-
146)
Logistic Regression with SAS ©AK, SGH 2020
L (β )
2
L (β ) L (β )
2
L (β )
2
1
2
1 2 1 p
1
L (β )
2
L (β ) L (β )
2
L (β )
2
u = , H =
2
2 p
2
2 1 2
L (β ) 2
L (β ) L (β ) L (β )
2 2
p p 2 p
2
p 1
Newton-Raphson Algorithm
Logistic regression (LR) model with one explanatory variable
β
( t +1)
=β − H
t
( )
(t ) −1
u
(t )
β=
n n
exp( + x i )
yi −
i = 1 1 + exp( + x i )
u = n i =1
n
x exp( + x )
i =1
yi xi − i i
i = 1 1 + exp( + x i )
+ xi + xi
n
e n
xi e
− −
H =
i =1 (1 + e + xi 2
) i =1 (1 + e + xi 2
)
+ xi + xi
n
xi e n 2
xi e
− −
i =1 (1 + e + xi 2
) i =1 (1 + e + xi 2
)
Logistic Regression with SAS ©AK, SGH 2020
Newton-Raphson Algorithm
Example, Results
Homework:
1. What if the starting point
is different, e.g. -2? What
does this tell about
properties of Newton-
Raphson algorithm?
2. Write a program to
estimate the parameters
of logistic regression
model with one
explanatory variable
using Newton-Raphson
algorithm.
Logistic Regression with SAS ©AK, SGH 2020
2 L (β ) 2 L (β ) 2 L (β )
L (β ) − E − E − E
1 1 2
2
1 p
1
L ( β ) 2 L (β ) 2 L (β ) 2 L (β )
− E − E − E
u = , J =
2
2
2 1 p
2 2
L (β )
2 L (β ) 2 L (β ) 2 L (β )
p − E − E − E
2
p 1 p 2 p
For large samples J and H are approximately equal. For details on expected
information matrix see: Agresti, 2002, p. 135-139.
Logistic Regression with SAS ©AK, SGH 2020
= y i
+ xi − ln 1 + exp( + xi )
i =1 i =1
n y =1 n y =1 ; x =1 n x =1 n x=0
= yi + yi xi − ln 1 + exp( + ) − ln 1 + exp( )
i =1 i =1 i =1 i =1
= n y =1 + n y =1; x =1 − n x =1 ln 1 + exp( + ) − n x = 0 ln 1 + exp( )
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l03_PlottingLnL01.sas
goptions device=activex;
proc g3d data=p01;
plot a*b=f;
run;
quit;
Logistic Regression with SAS ©AK, SGH 2020
Model (1)
• Response Y (Customer status) and explanatory X (Housing status)
variables are:
1 𝑑𝑒𝑓𝑎𝑢𝑙𝑡
𝑌=ቊ
0 𝑛𝑜𝑛 − 𝑑𝑒𝑓𝑎𝑢𝑙𝑡
1 𝑂𝑤𝑛
𝑋 = ቐ2 𝑅𝑒𝑛𝑡
3 𝐹𝑜𝑟 𝐹𝑟𝑒𝑒
• Estimate the model assessing the risk of customer being default given
their housing status:
1
( xi ) = − ( + x + x )
1+ e 1 1i 2 2i
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l03_EstimationTechniques01.sas
1 1
( xi ) = − ( − 1 . 0414 + 0 . 6669 x1 i + 0 . 5987 x 2 i )
( xi ) = − ( − 0 . 6195 + 0 . 2450 x1 i + 0 . 1768 x 2 i )
1+ e 1+ e
Logistic Regression with SAS ©AK, SGH 2020
1
Reference parametrization ( xi ) =
1+ e
1 . 0414 − 0 . 6669 x 1 i − 0 . 5987 x 2 i
( xi )
O ( x i = rent ) = 0 . 6423 OR ( rent vs own )
1 − ( xi ) O ( x i = rent )
= 1 . 8198
( xi ) O ( x i = own )
O ( x i = own ) = 0 . 3530
1 − ( xi )
Logistic Regression with SAS ©AK, SGH 2020
1
( xi ) =
Effect parametrization 1+ e
0 . 6195 − 0 . 2450 x 1 i − 0 . 1768 x 2 i
Parametrization programmatically
• Reference:
proc logistic data=german;
class housing (param=ref ref='own');
model default(event='1')= housing;
run;
• Effect:
*Binary logistic regression model;
proc logistic data=german;
class housing (param=effect ref='own');
model default(event='1')= housing;
run;
Logistic Regression with SAS ©AK, SGH 2020
Assessing goodness-of-fit
ln L ( y 1 , y 2 ,..., y n | 1 ,..., r )
n
p
n
p
= y i + k x ki − ln 1 + exp( + k x ki )
i =1 k =1 i =1 k =1
In model (1):
ln L ( y 1 , y 2 ,..., y n | , 1 , 2 )
n n
= y i
+ 1 x 1i + 2 x 2 i − ln 1 + exp( + 1 x 1i + 2 x 2 i )
i =1 i =1
AIC = − 2 ln L + 2 p
SC = − 2 ln L + p ln n
H0 :β = 0
H1 : k 0
k
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_WaldTest01.sas
(
W = βˆ − β 0 )
T
cov( βˆ ) (βˆ − β )
−1
0
• For model (1) we get:
−1
ˆ 2
ˆ ˆ ˆ ˆ
W = ˆ1 ˆ 2
ˆ1 1 21
ˆ
ˆ ˆ1 ˆ 2 ˆ
2
ˆ 2
2
• W has a limiting chi-square distribution with
degrees of freedom equal to the rank of covariance matrix.
Logistic Regression with SAS ©AK, SGH 2020
• The idea is to compare the slope and expected curvature of the log-
likelihood function at β0 . The test statistic is as follows:
−1
S =u J u
T
2 L (β ) 2 L (β ) 2 L (β )
L (β ) − E − E − E
1 1 2
2
1 p
1
L ( β ) 2 L (β ) 2 L (β ) 2 L (β )
− E − E − E
u = , J =
2
2
1 1 p
2 2
L (β )
2 L (β ) 2 L (β ) 2 L (β )
p − E − E − E
2
p 1 p 2 p
Logistic Regression with SAS ©AK, SGH 2020
( )
T
(
−1
W = βˆ − β 0 cov( βˆ ) βˆ − β 0 )
ˆ
3
2
(-0.021304 067)
2
= 2 = 9.5988
ˆ 3
0.00004728 34
Measures of Goodness-of-Fit,
• Assessing fit of the model - we would like to know whether the
probabilities produced by the model accurately reflect the true
outcome experience in the data (Hosmer et al., 2007).
• Deviance and Pearson χ2: the statistics compare saturated model with
the estimated model. Saturated model reflects ideal fit – probabilities
from the model equal to the estimated probabilities for each profile
(k=1,2,…,nlk)
Saturated Estimated
n11 n12 … n1k Model Model
n21 n22 … n2k 𝑛𝑖𝑗 𝑛ො 𝑖𝑗
𝑣𝑠. 𝑝Ƹ 𝑖𝑗 =
… … … … 𝑛 𝑛
nl1 nl2 … nlk
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_SaturatedModel01.sas
Saturated model
Counts (nij) in
• Let us consider model (1) with TELEPHONE being included as saturated
model
explanatory variable
proc freq data=german;
tables housing*telephone /
out=f01;
run;
• Deviance:
𝑙 𝑘
𝑛𝑖𝑗
𝐷 = 2 𝑛𝑖𝑗 𝑙𝑛
𝑛ො 𝑖𝑗
𝑖=1 𝑗=1
• Pearson χ2:
𝑙 𝑘 2
2
𝑛𝑖𝑗 − 𝑛ො 𝑖𝑗
𝜒 =
𝑛ො 𝑖𝑗
𝑖=1 𝑗=1
Example
• Test goodness of fit of model (1) with TELEPHONE being included as
explanatory variable using Deviance and Pearson χ2. Calculate
Deviance and Pearson χ2 statistics.
Expected Observed D Chi-square
0 no rent 68.73725 66 6.6 -2.68201 0.1090029
Count IP_0 IP_1 0 no own 313.67 319 31.9 5.37501 0.09056869
116 11.6 0.592563 0.407437 68.73725 47.26275 0 no for free 26.58863 24 2.4 -2.45832 0.25202516
433 43.3 0.724411 0.275589 313.67 119.33 0 yes rent 40.25858 43 4.3 2.832714 0.18667833
47 4.7 0.565716 0.434284 26.58863 20.41137 0 yes own 213.3264 208 20.8 -5.25938 0.13299282
63 6.3 0.639025 0.360975 40.25858 22.74142 0 yes for free 37.40772 40 4 2.680094 0.1796398
280 28 0.76188 0.23812 213.3264 66.67357 1 no rent 47.26275 50 5 2.815032 0.15852994
61 6.1 0.613241 0.386759 37.40772 23.59228 1 no own 119.33 114 11.4 -5.20913 0.23806829
1 no for free 20.41137 23 2.3 2.746248 0.32829759
1 yes rent 22.74142 20 2 -2.56912 0.33047201
1 yes for free 66.67357 72 7.2 5.533743 0.42551922
1 yes for 23.59228 21 2.1 -2.44434 0.28483535
2.721091 2.7166301
Logistic Regression with SAS ©AK, SGH 2020
Hosmer-Lemeshow test
• Hosmer Lemeshow statistic is calculated as follows:
• Estimate the individual probabilities.
• Sort within response and divide set into 10 groups with similar
number of observations (k=1,2,3,…,10)
• Using the groups calculate expected counts and compare with
observed counts using Pearson χ2 statistic, which has approximately χ2
distribution with v=10-2=8 degrees of freedom.
proc logistic data=german;
class housing (param=ref ref='own') telephone (param=ref ref='yes');
model default(event='1')= housing telephone / aggregate scale=none
lackfit;
output out=out predprobs=(i);
run;
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_HosmerLemeshow01.sas
Example
• Test goodness of fit of model (1) with AGE being included as
explanatory variable using Hosmer-Lemeshow statistic.
SAS OnlineDoc
0.002428 0.000547
0.069971 0.020732
1.302721 0.43821
0.072579 0.026138
0.173811 0.066921
0.083389 0.034293
0.831501 0.361852
0.801296 0.392317
0.796441 0.523366
0.784196 0.626636
7.409344
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_RSqaure01.sas
Example
• Assess predictive power of the model (1) with AGE being included as
explanatory variable using Hosmer-Lemeshow statistic.
Classification tables
• A tool to assess predictive power of the model. How good the model is in assessing the values of
Y? We set up a boundary cut-off in (0,1). Then using c we may classify observations into those for
which the event occurred and those for which the event did not occur. We can assess
appropriateness of classification using the following statistics:
Sensitivity (assessing occurrence of the event): Specificity (assessing non-occurrence of the event):
n Yˆ =1|Y =1 n Yˆ = 0 |Y = 0
S1 = S0 =
n Y =1 nY = 0
False occurence: False non-occurence:
n Yˆ =1|Y = 0 n Yˆ = 0 |Y =1
FP = FN =
n Yˆ =1 n Yˆ = 0
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_ClassificationTable01.sas
ROC curve
n Yˆ =1|Y =1
S1 = smaller cut-off
n Y =1
n Yˆ = 0 |Y = 0
S0 =
n
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_Residuals01.sas
[SAS] l04_Collinearity01.sas
Collinearity assessment
• Collinearity in logistic regression lead to overestimation estimation of
standard errors and could lead to overestimation of regression
coefficients
• Standard statistics used in linear regression can be applied to detect
collinearity variance inflation (VIF) and tolerance (TOL):
1 1
VIF = TOL =
1 − Rk VIF
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_DepositBinaryModel01.sas
Exercise with R and Python [R] l04_DepositBinaryModel01.R
o Age (numeric)
o Education (categorical: "unknown", "secondary", "primary", "tertiary")
o Balance: average yearly balance, in euros (numeric)
o Duration: last contact duration, in seconds (numeric)
o Poutcome: outcome of the previous marketing campaign (categorical: "unknown", "other", "failure", "success")
Homework
1. Download and familiarize yourself with the data and documentation from
European Social Survey (ESS) study, Round 9:
http://www.europeansocialsurvey.org
2. Formulate a research problem. The aim is to practice binary logistic regression
– please consider that in the choice of the problem
3. Formulate the research hypothesis. The hypothesis will be verified in your
paper.
4. Suggest exact set features from ESS data to include in the analysis.
5. Specify the model to be used in the analysis.
Prepare a brief presentation to share with the group. The proposals will be
discussed. The research problem will be analyzed in your papers (Projects I, II and
III).
Logistic Regression with SAS ©AK, SGH 2020
• Ordinal (i>2)
0 𝐺𝑜𝑜𝑑
𝑌 = ቐ1 𝑀𝑜𝑑𝑒𝑟𝑎𝑡𝑒
2 𝑃𝑜𝑜𝑟
• Multinomial (i>2)
0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑝ℎ𝑜𝑛𝑒
𝑌 = ቐ1 𝐴𝑛𝑑𝑟𝑜𝑖𝑑
2 𝑖𝑃ℎ𝑜𝑛𝑒
Logistic Regression with SAS ©AK, SGH 2020
P (Y = 1 | x )
= 10 + 11 x1 + ... + 1 p x p = x β 1
T
ln
P (Y = 0 | x )
P (Y = 2 | x )
= 20 + 21 x1 + ... + 2 p x p = x β 2
T
ln
P (Y = 0 | x )
Logistic Regression with SAS ©AK, SGH 2020
Conditional probabilities
• Conditional probabilities for πj|x are derived using the following
system of equations:
The individual conditional probabilities can be
1| x expressed by:
= x β1
T
ln
0 | x 1
0|x =
2|x 1+ e
T
x β1
+e
T
x β2
=
T
β2
T
ln x e
x β1
0 | x 1| x = T
x β1
T
x β2
0 | x = 1 − ( 1| x + 2 | x ) 1+ e T
+e
x β2
e
2|x =
1+ e
T
x β1
+e
T
x β2
Logistic Regression with SAS ©AK, SGH 2020
= 0 (x i ) 1 (x i ) 2 (x i )
y0i y1 i y2i
i =1
n
= (1 − 1 ( x i ) − 2 ( x i )) 1 (x i ) 2 (x i )
y0i y1 i y2i
i =1
Logistic Regression with SAS ©AK, SGH 2020
Phones dataset
ID Name Description Categories
1 Y Type of phone 0=standard
1=Android
2=iPhone
2 LQ Quality of Life 1=Excellent
2=Very good
3=Good
4=Fair or poor
3 EDU Education 1=At least started higher education
0=otherwise
4 SEX Sex of the 1=Male
respondent 2=Female
5 AGE Age of the N/A
respondent
6 INTUSE Is using Internet? 1=Yes
2=No
7 INTMOB Is using Internet 1=Yes
through mobile 2=No
devices?
Logistic Regression with SAS ©AK, SGH 2020
iPhone vs Android
• H0: OR10=OR20
[SAS] l05_MultinomialModel01.sas
0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑝ℎ𝑜𝑛𝑒
𝑌 = ቐ1 𝐴𝑛𝑑𝑟𝑜𝑖𝑑
2 𝑖𝑃ℎ𝑜𝑛𝑒
Profile probabilities
Given the estimates of the model provide the probabilities estimates
for the following profiles:
1. Lady in her 31 with higher education, using Internet in general and
specifically on mobile devices has a specific phone type.
2. Plot the probabilities for age
Logistic Regression with SAS ©AK, SGH 2020
Plotting probabilities
Logistic Regression with SAS ©AK, SGH 2020
Comparing:
0 1 2 3 4 <=0 vs >0
0 1 2 3 4 <=1 vs >1
0 1 2 3 4 <=2 vs >2
0 1 2 3 4 <=3 vs >3
Logistic Regression with SAS ©AK, SGH 2020
D 1 vs D 1
D 2 vs D 2
Odds is derived based for
each cut point:
proportional odds
... assumption
D G − 1 vs D G − 1
P(D g )
odds ( D g ) = , g = 1, 2 ,..., G − 1
P(D g )
Logistic Regression with SAS ©AK, SGH 2020
odds ( D 1) | X = 1
OR ( D 1 ) =
odds ( D 1) | X = 0
odds ( D 4 ) | X = 1
OR ( D 4 ) =
odds ( D 4 ) | X = 0
Logistic Regression with SAS ©AK, SGH 2020
Ordinal
Variable Parameter
Intercept α1, α,2, ...,αG-1
X1 β1
Logistic Regression with SAS ©AK, SGH 2020
1
P(D g | X ) = − ( g + xi )
, g = 1, 2 ,..., G − 1 .
1+ e
− ( g + xi )
1 e
P(D g | X ) = 1 − P(D g | X ) = 1 − − ( g + xi )
= − ( g + xi )
1+ e 1+ e
1
P(D g | X ) P(D g | X ) − ( g + xi )
= 1 + −e( + x )
g + xi
odds ( D g ) = = =e
1 − P(D g | X ) P(D g | X ) e g i
− ( g + xi )
1+ e
Logistic Regression with SAS ©AK, SGH 2020
P(D g | X )
odds ( D g ) =
1 − P(D g | X )
P(D g | X ) g − xi
= =e , where g = 1, 2 ,..., G − 1 .
P(D g | X )
LQ1 younger
0 1 Total Odds of being in lower category when older
1 Fair or poor 119 58 177 Risk(Y<=1|X=0) 0.214414 Risk(Y<=1|X=1) 0.114625
2 Very good or good or 3 Excellent 436 448 884 Odds(Y<=1|X=0) 0.272936 Odds(Y<=1|X=1) 0.129464
Total 555 506 1061 OR01 2.108194
LQ1 younger
0 1 Total Odds of being in lower category when older
1 Fair or poor or 2 Very good or good 468 369 837 Risk(Y<=2|X=0) 0.843243 Risk(Y<=2|X=1) 0.729249
3 Excellent 87 137 224 Odds(Y<=2|X=0) 5.37931 Odds(Y<=2|X=1) 2.693431
Total 555 506 1061 OR01 1.997197
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l05_PropOdds01.sas
Proportional odds assumption – SAS
(„Phones” dataset)
H0: OR1=OR2=…=ORG-1
H1: Ǝi≠j ORi≠ORj
https://support.sas.com/documentation/cdl/en/statug/63962/
HTML/default/viewer.htm#statug_logistic_sect039.htm
Logistic Regression with SAS ©AK, SGH 2020
1 1
P ( D 1 | X = 1) = − ( 1 + xi )
= 0 . 1168
1+ e 1+ e
2.0235
1 1
P ( D 2 | X = 1) = − ( 2 + xi )
= 0 . 7275
1+ e 1+ e
- 0.9820
1 1
P ( D 1 | X = 0) = − ( 1 + xi )
= 2.0235 − 0.7147
0 . 2127
1+ e 1+ e
1 1
P ( D 2 | X = 0) = − ( 2 + xi )
= - 0.9820 − 0.7147
0 . 8451
1+ e 1+ e
Logistic Regression with SAS ©AK, SGH 2020
2 . 6697
OR (1 and 2 ) vs 3
= 0 . 48
4 . 4558
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l05_OrdinalModel01.sas
References
A. Agresti (2002) Categorical Data Analysis (Hoboken: John Wiley & Sons).
P.D. Allison (1999) Logistic regression using the SAS system. Theory and
Application (Cary: SAS Institute Inc.).
Any questions?