You are on page 1of 157

Logistic Regression with SAS

Summer Semester 2019/2020

Adam Korczyński
The Collegium of Economic Analysis of Warsaw School of Economics
Institute of Statistics and Demography
Statistical Methods and Business Analytics Unit

e-mail: akorczy@sgh.waw.pl
Contingency table and measures
of association

Logistic Regression with SAS ©AK, SGH 2020


Logistic Regression with SAS ©AK, SGH 2020

Agenda
• Introductory notes: scope of the class, schedule and requirements
• Basic notions
• Contingency table
• Comparing two proportions - measures in the analysis of categorical
data
• Types of studies
• Calculating risk, risk ratio, odds and odds ratio
• Proc Freq in SAS. Cross tabulation with R and Python
Logistic Regression with SAS ©AK, SGH 2020

Basic notions (Agresti, 2002)


• Categorical variable: scale consisting of a set of categories
• Response Y – Explanatory variable X
• Scales:
• Nominal
• Ordinal
• Interval
• Ratio
Logistic Regression with SAS ©AK, SGH 2020

Contingency table
• Y and X – two categorical variables with k,l categories
respectively
• We can present the distribution of the two variables using a
table with k rows of Y and l columns for X categories
• The cells represent possible outcomes
• The table with frequency counts of outcomes for a sample,
the table is called contingency table (by Karl Pearson, 1904).
• 2x2 table:
X=x1 X=x2

Y=y1 n11 n12


Y=y2 n21 n22
E.g. X- tax regimen (A, B); Y- survival of a company at time t
Logistic Regression with SAS ©AK, SGH 2020

Comparing two proportions - measures in the


analysis of categorical data
1 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑖𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡
𝑌=ቊ
0 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑔𝑜𝑡 𝑜𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑟𝑖𝑜𝑟 𝑡𝑜 𝑡
1 𝑡𝑎𝑥 𝑟𝑒𝑔𝑖𝑚𝑒𝑛 𝐴
𝑋=ቊ
0 𝑡𝑎𝑥 𝑟𝑒𝑔𝑖𝑚𝑒𝑛 𝐵

𝑅𝑖𝑠𝑘 (𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦): 𝑅𝑥𝑗 = 𝑃(𝑌 = 𝑖|𝑥 = 𝑗)

𝑃(𝑌 = 𝑖|𝑥 = 1)
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑟𝑖𝑠𝑘: 𝑅𝑅10 =
𝑃(𝑌 = 𝑖|𝑥 = 0)

𝑃(𝑌 = 𝑖|𝑥 = 𝑗)
𝑂𝑑𝑑𝑠 (𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑛𝑔 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑤𝑖𝑡ℎ 𝑐𝑜𝑚𝑝𝑙𝑒𝑚𝑒𝑛𝑡𝑎𝑟𝑦 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦): 𝑂𝑥𝑗 =
1 − 𝑃(𝑌 = 𝑖|𝑥 = 𝑗)

𝑂𝑥1
𝑂𝑑𝑑𝑠 𝑟𝑎𝑡𝑖𝑜: 𝑂𝑅𝑥10 =
𝑂𝑥0
Logistic Regression with SAS ©AK, SGH 2020

2x2 contingency table


Calculating R, RR, O, OR
Logistic Regression with SAS ©AK, SGH 2020

Types of studies
• Follow-up study (prospective)
• Risk factor → Observation → Outcome
• Assigning tax regimen A or B →Observation → Survived or not
• Case-control study (retrospective)
• Risk factor  Observation  Outcome
• what was the tax regimen under which companies were operating? 
Survived or not
• Cross-sectional
• Snapshot of data at a given time point. Allows assessing coexistence of
features.
Logistic Regression with SAS ©AK, SGH 2020

Prospective study – experimental design


• Set a target sample of subjects (e.g. companies) to be in the study
• Randomize subjects into groups of interest, e.g. tax regimens x
• Then observe the companies at scheduled time points: 6 months, 1
year, 2 years and determine the companies that are still operating Y
• In this setting Y is a random variable and x is fixed. Therefore we can
estimate P(Y=i|x=j) the probability of being on the market at time t
given that the company was operating under tax regimen j
Logistic Regression with SAS ©AK, SGH 2020

Retrospective studies
• Research organization decides to study the status of the Polish companies
in a specific industry in years 2015-2016. There were n1 companies which
went bankrupt in this period. For the purpose of the comparison n2
companies operating throughout the whole period have been added to the
sample. The tax regimens under which the companies were operating were
identified afterwards.
• In this setting X is a random variable and Y is set post hoc. We do identify
the status of survival at first and then identify the characteristics. As a
consequence we can estimate P(X=j|y=i). This would mean probability of
being in a tax regimen A or B given that a company survived/not survived.
This does not address the research question though. The counts for Y are
not collected through a random process and therefore P(Y=i|X=j) is not
good estimate of survival conditional on tax regimen.
Logistic Regression with SAS ©AK, SGH 2020

Numeric example – comparison of measures


of association
Population Survival
Regimen A Regimen B ni. RISKS RR ODDS OR
Survived 478 220 698 Regimen A 0.080282 1.47646 0.08729 1.51805
Not survived 5476 3826 9302 Regimen B 0.054375 0.0575
n.j 5954 4046 10000

Prospective sample
Regimen A Regimen B ni. RISKS RR ODDS OR
Survived 95 44 139 Regimen A 0.079832 1.46963 0.08676 1.51038
Not survived 1095 766 1861 Regimen B 0.054321 0.05744
n.j 1190 810 2000

Retrospective sample
Regimen A Regimen B ni. RISKS RR ODDS OR
Survived 616 284 900 Regimen A 0.537522 1.23781 1.16226 1.51422
Not survived 530 370 900 Regimen B 0.434251 0.76757
n.j 1146 654 1800

The chance of survival within the studied time period is 50% greater in
companies operating under tax regimen A.
Logistic Regression with SAS ©AK, SGH 2020

Proc Freq in SAS


PROC FREQ < options > ;
BY variables ;
EXACT statistic-options < / computation-options > ;
OUTPUT <OUT=SAS-data-set > output-options ;
TABLES requests < / options > ;
TEST options ;
WEIGHT variable < / option > ;
BY Provides separate analyses for each BY group
EXACT Requests exact tests
OUTPUT Requests an output data set
TABLES Specifies tables and requests analyses
TEST Requests tests for measures of association and agreement
WEIGHT Identifies a weight variable
Source:
http://support.sas.com/documentation/cdl/en/procstat/66703/HTML/
default/viewer.htm#procstat_freq_syntax.htm
Logistic Regression with SAS ©AK, SGH 2020

Exercise
• Assess the risk of lung cancer among smokers numerically in order to
answer the following question: What is the risk of having LC while
being a smoker as compared to not being a smoker?
• Use Excel and Proc Freq in SAS
Smoker
Lung Cancer Yes No
Yes 688 21
No 650 59
Data source: Agresti, 2002, Table 2.5.
Logistic Regression with SAS ©AK, SGH 2020

Excercise with R and Python [SAS] l01_OddsRatio.sas


[R] l01_OddsRatio.R
examples – calculating OR [Python] l01_OddsRatio.py
data a01;
input y$ x$ n;
cards;
L S 688
L X 21
N S 650
N X 59
;
run;

proc freq data=a01;


tables y*x / relrisk;
weight n;
run;
Origins of logistic regression.
Introduction to logistic regression

Logistic Regression with SAS ©AK, SGH 2020


Logistic Regression with SAS ©AK, SGH 2020

Agenda
• Origins of logistic regression
• Distributions for categorical data
• Introduction to modelling of binary response variable
• The basic concepts of logistic regression
• Fitting a binary logistic regression model using the LOGISTIC procedure
• Describing the standard output from the LOGISTIC procedure one continuous
and one categorical predictor variable
• Reading and interpreting odds ratio tables and plots
Logistic Regression with SAS ©AK, SGH 2020

Origins of logistic regression


• J.S. Cramer (2002), The Origins of Logistic Regression, „Tinbergen Institute
Working Paper”, No. 119(4).
• The logistic regression origins are in 19th century and relates to work with
aim to describe the growth of population and autocatalytic chemical
reactions. The process in question is the quantity W(t) over time, where t
represents time.
• Growth rate (e.g. population growth rate):
𝑑𝑊(𝑡)

𝑊(𝑡) =
𝑑𝑡
• The assumption is that the growth rate is proportional to the current
quantity: 𝑊ሶ 𝑡
𝑊ሶ 𝑡 = 𝛽𝑊 𝑡 , 𝛽=
𝑊(𝑡)
where: β – constant rate of growth.
Logistic Regression with SAS ©AK, SGH 2020

Exponential growth model


• Assumptions: constant rate of
growth

𝑊ሶ 𝑡 = 𝛽𝑊 𝑡
growing quantity

• The quantity W(t) at the time t can be determined by where we are in


the process which simply t but the effect β magnitude is growing over
time which leads to exponential growth model:
𝑊 𝑡 = 𝐴𝑒𝑥𝑝(𝛽𝑡)
where: A – initial level that can be replaced by W(0).
The thesis by T. Malthus dated 1789 claimed that that population left itself will
grow in geometric progression.
Logistic Regression with SAS ©AK, SGH 2020

Population growth

Source: University of Oxford, Our World in Data, https://ourworldindata.org/world-population-growth, date


accessed 22Feb2020.
Logistic Regression with SAS ©AK, SGH 2020

Exponential growth model – example


Logistic Regression with SAS ©AK, SGH 2020

[SAS] l02_OriginsOfLogReg01.sas

Exponential growth model – SAS code


%let beta=1.7;

data a01;
A=100;
beta=&beta.;
do t=0 to 100 by 0.01;
Wt=A*exp(beta*t);
output;
end;
run;

proc sgplot data=a01;


series x=t y=Wt / lineattrs=(color=red) legendlabel="beta=&beta.";
run;
Logistic Regression with SAS ©AK, SGH 2020

Further development leading to logisitic regression model

• In fact the exponential model has no upper limit. This aspect of the
described processes was criticized by Adolphe Quetelet (1795-1874) and
investigated by his pupil Pierre-Francois Verhulst (1804-1849).
• Their proposal was to introduce the term representing the resistance to
further growth. The model with Ω as the upper limit (saturation level) of
W(t) was suggested following some investigation on potential limiting
terms:
𝑊 𝑡
𝑊ሶ 𝑡 = 𝛽𝑊 𝑡 1−
Ω
• In this model „growth is proportional both to the population already
attained W(t) and to the remaining room for further expansion Ω-W(t)”
(Cramer, 2002, p. 4).
Logistic Regression with SAS ©AK, SGH 2020

Verhulst suggestion for final model for W(t)


• In the model the quantity W(t) can be expressed as proportion
representing the share of the current level in the maximum:
𝑃 𝑡 = 𝑊(𝑡)/Ω
• Then by solving the below differential equation (see next 2 slides):
𝑊 𝑡
𝑊ሶ 𝑡 = 𝛽𝑊 𝑡 1−
Ω
we get the logistic function:
1
𝑃(𝑡) =
1 + 𝑒 −𝐶−𝛽𝑡
Logistic Regression with SAS ©AK, SGH 2020

𝑊 𝑡
Solving the differential equation 𝑊ሶ 𝑡 = 𝛽𝑊 𝑡 1−
Ω
𝑑𝑊(𝑡) 𝑊 𝑡
= 𝛽𝑊 𝑡 1− ԡ∙ 𝑑𝑡
𝑑𝑡 Ω
𝑊 𝑡 𝑊 𝑡
𝑑𝑊(𝑡) = 𝛽𝑊 𝑡 1− 𝑑𝑡 ብ: 𝑊 𝑡 1−
Ω Ω 1
1 𝑊 𝑡
ԡ∙ Ω
𝑑𝑊(𝑡) = 𝛽𝑑𝑡 𝑊 𝑡 1− Ω
𝑊 𝑡
𝑊 𝑡 1−
Ω =𝑊
Ω
ԡ−𝑊 𝑡 + 𝑊(𝑡) in num
𝑡 Ω−𝑊 𝑡
1
න 𝑑𝑊(𝑡) = න 𝛽 𝑑𝑡 Ω − 𝑊 𝑡 + 𝑊(𝑡)
𝑊 𝑡 =
𝑊 𝑡 1−
Ω 𝑊 𝑡 Ω−𝑊 𝑡
Ω−𝑊 𝑡 𝑊(𝑡)
= +
1 1 𝑊 𝑡 Ω−𝑊 𝑡 𝑊 𝑡 Ω−𝑊 𝑡
න + 𝑑𝑊(𝑡) = න 𝛽 𝑑𝑡 1 1
𝑊 𝑡 Ω−𝑊 𝑡 = +
𝑊 𝑡 Ω−𝑊 𝑡
Logistic Regression with SAS ©AK, SGH 2020

1 1
Solving න + 𝑑𝑊(𝑡) = න 𝛽 𝑑𝑡
𝑊 𝑡 Ω−𝑊 𝑡

ln 𝑊 𝑡 − ln Ω − 𝑊 𝑡 = 𝛽𝑡 + 𝐶 ԡ∙ (−1)
−ln 𝑊 𝑡 + ln Ω − 𝑊 𝑡 = −𝛽𝑡 − 𝐶 When taking this
mathematical model
Ω−𝑊 𝑡 to statistics we get the
ln = −𝛽𝑡 − 𝐶 ԡexp() following parameters:
𝑊 𝑡
Ω−𝑊 𝑡 For:
= exp(−𝛽𝑡 − 𝐶)
𝑊 𝑡 𝑃 𝑡 = 𝑊(𝑡)/Ω Intercept
Ω Regression
− 1 = exp −𝛽𝑡 − 𝐶 ԡ+1 The equation simplifies to:
𝑊 𝑡 coefficient β

Ω 1
= 1 + exp −𝛽𝑡 − 𝐶 𝑃(𝑡) =
𝑊 𝑡 1 + 𝑒 −𝐶−𝛽𝑡
Ω
𝑊 𝑡 = which is the logistic function.
1 + exp −𝛽𝑡 − 𝐶
Logistic Regression with SAS ©AK, SGH 2020

Verhulst conclusion
• Verhulst concluded that „the curve agrees with the actual course of
the population of France, Belgium, Essex and Russia for periods up to
1833.” (Cramer, 2002, p. 4).
Logistic Regression with SAS ©AK, SGH 2020

1833 – end date for


data in Verhulst study
Population growth

Source: University of Oxford, Our World in Data, https://ourworldindata.org/grapher/population-by-


country?time=1500..2000, date accessed 22Feb2020.
Logistic Regression with SAS ©AK, SGH 2020

Logistic curve growth model – example


Logistic Regression with SAS ©AK, SGH 2020

Logistic growth model – SAS code


%let beta1a=1.7;

data a01;
beta0=2;
beta1a=&beta1a.;
do t=-10 to 10 by 0.01;
Pt=1/(1+exp(-1*(beta0+beta1a*t)));
output;
end;
run;

proc sgplot data=a01;


series x=t y=Pt / lineattrs=(color=red) legendlabel="beta=&beta1a.";
run;
Logistic Regression with SAS ©AK, SGH 2020

The use of logistic regression models


• The response Y is categorical
• Predictors X are:
• Categorical: Contingency Table or Logistic Regression
• Continuous: Logistic Regression
• Categorical and Continuous: Logistic Regression
Predictor X1 X1=a X1=b X1=c

Predictor X2 X2=y X2=n


Response Y

In general regression analysis has the aim to
quantify the relationship between response
Predictor Xk Xk∈ [0, 120]
variable and one or more predictors.
Logistic Regression with SAS ©AK, SGH 2020

Types of Response Variables in Logistic


regression models
• Binary(i=2)
1 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑖𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡
𝑌=ቊ
0 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑔𝑜𝑡 𝑜𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑟𝑖𝑜𝑟 𝑡𝑜 𝑡

• Ordinal (i>2)
0 𝐺𝑜𝑜𝑑
𝑌 = ቐ1 𝑀𝑜𝑑𝑒𝑟𝑎𝑡𝑒
2 𝑃𝑜𝑜𝑟

• Multinomial (i>2)
0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑝ℎ𝑜𝑛𝑒
𝑌 = ቐ1 𝐴𝑛𝑑𝑟𝑜𝑖𝑑
2 𝑖𝑃ℎ𝑜𝑛𝑒
Distributions for categorical data

Logistic Regression with SAS ©AK, SGH 2020


Logistic Regression with SAS ©AK, SGH 2020

Binomial distribution (Agresti 2002, p. 5-6)


• y1, y2,…, yn – n independent and identical (Bernoulli) trials such that:
P(Yi=1)=𝜋
P(Yi=0)=1-𝜋, where 1 can represent a „success” and 0 a „failure”.
• The total number of successes:
𝑌 = σ𝑛𝑖=1 𝑌𝑖 has the binomial distribution with index n and parameter
𝜋 with the following probability mass function:
𝑛 𝑦
𝑝 𝑦 = 𝑦 𝜋 (1 − 𝜋)𝑛−𝑦 y=0,1,2,…,n.
with parameters:
𝜇 = 𝐸 𝑌 = 𝑛𝜋
𝜎 2 = 𝑣𝑎𝑟 𝑌 = 𝑛𝜋(1 − 𝜋)
• The distribution converges to normality as n increases for fixed 𝜋.
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l02_BinomialDist.sas

Binomial distribution PDF and CDF


P(X = x) P(X = x)
1.0 1.0

• We know that 0.9 0.9

one out of ten 0.8 0.8

clients is 0.7 0.7

responding to 0.6 0.6

the survey. 0.5 0.5

How likely it is 0.4 0.4


that we get
responses from 0.3 0.3

15 if we call 0.2 0.2

100? 0.1 0.1

0.0 0.0
01234567891111111111222222222233333333334444444444555555555566666666667777777777888888888899999999991
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
0
x
Logistic Regression with SAS ©AK, SGH 2020

Multinomial distribution (Agresti 2002, p. 6-7)


• yi = (yi1, yi2,…, yic )– multinomial trial with c outcome categories:
yij = 1 if trial i has outcome in category j and yij = 0 otherwise e.g. yi = (0,1,0)
• The number of trials having outcome in j is:
𝑛𝑗 = σ𝑛𝑖=1 𝑦𝑖𝑗 then counts (n1, n2,…, nc) have the multinomial distribution.
• Let 𝜋i=P(Yij=1) then the probability mass function in multinomial distribution is:
𝑛! 𝑛 𝑛 𝑛
𝑝(𝑛1 , 𝑛2 , . . . , 𝑛𝑐 ) = 𝜋1 1 𝜋2 2 ∙ … ∙ 𝜋𝑐 𝑐
𝑛1 !𝑛2 !∙ ... ∙𝑛𝑐 !
with parameters:
𝜇 = 𝐸 𝑌 = 𝑛𝜋𝑗
𝜎 2 = 𝑣𝑎𝑟 𝑌 = 𝑛𝜋𝑗 (1 − 𝜋𝑗 )
𝑐𝑜𝑣 𝑛𝑗 , 𝑛𝑘 = −𝑛𝜋𝑗 𝜋𝑘
• The binomial distribution is the special case with c=2.
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l02_MultinomialDist.sas

Multinomial distribution – kernel density


Distribution and Kernel Density for employed and unemployed

• What likely it is 15 0.016

to draw 70 0.014
employed and 5
unemployed 10
0.012

given that there 0.01


are three types

unemployed
of status 5
0.008

(employed, 0.006

unemployed,
inactive) with 0.004

𝜋1=0.75, 𝜋2=0.05 0
0.002

and 𝜋3=0.2 and


n=100? 60 70 80 90
0

employed
Logistic Regression with SAS ©AK, SGH 2020

Poisson distribution (Agresti 2002, p. 7)


• Poisson distribution allows addressing the question on number of
“successes” or “failures” in trials the number of which is not fixed upfront.
For example y can represent the number of car defects in the next day.
• The probability mass function in Poisson distribution is:
𝑒 −𝜇 𝜇𝑦
𝑝 𝑦 = y=0,1,2,…
𝑦!
with parameters:
𝐸 𝑌 = 𝑣𝑎𝑟(𝑌) = 𝜇
The distribution converges to normality as 𝜇 increases. The distribution is
used for counts of events that occur randomly over time or space, when
outcomes are disjoint periods or regions are independent. It is an
approximation of binomial probability when n is large and 𝜋 is small with
𝜇=n 𝜋.
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l02_PoissonDist.sas

Poisson distribution PDF and CDF


• What is the
likelihood of
observing 4 car
defects next
day if the count
follows Poisson
distribution
with 𝜇=5?
Logistic Regression with SAS ©AK, SGH 2020

Overdispersion (Agresti, 2002, p. 7-8)


• In practice count observations often exhibit variability exceeding that
predicted by binomial or Poisson distributions. This is called
overdispersion.
• For example we assume with Possion that each subject has the same
probability of getting a car defect. In reality this probability vary due
to factors such as amount of time spent driving, having the car
checked, driving conditions etc. Such variation causes the counts to
exibit more variability than by the Poisson model.
• Assuming a Possion distribution for a count variable is often too
simplistic. The alternative is negative binomial that permits the
variance to exceed the mean.
Introduction to modelling of
binary response variable

Logistic Regression with SAS ©AK, SGH 2020


Logistic Regression with SAS ©AK, SGH 2020

The constraints in probability modelling


𝜋𝑖 𝜌 𝛽0 + 𝛽1 𝑥1𝑖 (𝑖 = 1,2, … , 𝑛)

• Probabilities need to be bounded in [0, 1] => we need other than identity


link between the linear equation and probability 𝜋i
• What would be the best shape to reflect the relationship between X and 𝜋
throughout the possible range of X? The relationship is considered non-
linear in logistic regression model.
• In practice we do not observe the probability - this leads to different
assessment of “error term” than in linear regression model
Logistic Regression with SAS ©AK, SGH 2020

Notation in the logistic model


Each observation on Y has Bernoulli distribution with parameters:

Through logit transformation we get:


Logistic Regression with SAS ©AK, SGH 2020

Logistic curve

1
𝜋 𝑥𝑖 =
1 + 𝑒 −(𝛼+𝛽𝑥𝑖 )

The logistic curve holds and assumption on the relationship between X and p. It can be considered as an
approximation of the mentioned relationship.
Logistic Regression with SAS ©AK, SGH 2020

Interpreting parameters in logistic regression


1
𝜋 𝑥𝑖 =
1 + 𝑒 −(𝛼+𝛽𝑥𝑖 )
• 𝛽 rate of change – whether the transition is gradual or sharp given X
(the greater 𝛽 the steeper the rate of change)
• 𝛽 > 0 predictor “stimulating” the probability
• 𝛽 < 0 “inhibitor” for the probability
• 𝛽 = 0 uniform distribution – each value gives the same probability
assessment
Logistic Regression with SAS ©AK, SGH 2020

Logistic curve on an example – program code


%let beta1a=1;
%let beta1b=5;

data a01;
beta0=2;
beta1a=&beta1a.;
beta1b=&beta1b.;
do x=-10 to 10 by 0.01;
p1=1/(1+exp(-1*(beta0+beta1a*x)));
p2=1/(1+exp(-1*(beta0+beta1b*x)));
output;
end;
run;

proc sgplot data=a01;


series x=x y=p1 / lineattrs=(color=green) legendlabel="beta=&beta1a.";
series x=x y=p2 / lineattrs=(color=red) legendlabel="beta=&beta1b.";
run;
Logistic Regression with SAS ©AK, SGH 2020

Logistic curve on an example – output


Logistic Regression with SAS ©AK, SGH 2020

Logit transformation
Logistic model is often represented through logit:
𝜋 𝑥𝑖
𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥𝑖 = 𝑙𝑛 = 𝛼 + 𝛽𝑥𝑖
1 − 𝜋 𝑥𝑖
where:
1
𝜋 𝑥𝑖 =
1 + 𝑒 −(𝛼+𝛽𝑥𝑖 )
• Logit is natural logarithm of odds
• It defines the relationship between linear predictor and probability as
response:
p→1 then ln(p/[1-p])→+ ∞
p→0 then ln(p/[1-p])→- ∞
Logistic Regression with SAS ©AK, SGH 2020

Logit transformation on an example – program code


%let beta1a=1;
%let beta1b=5;

data a01;
beta0=2;
beta1a=&beta1a.;
beta1b=&beta1b.;
do x=-10 to 10 by 0.01;
p1=1/(1+exp(-1*(beta0+beta1a*x)));
logit1=log(p1/(1-p1));
p2=1/(1+exp(-1*(beta0+beta1b*x)));
logit2=log(p2/(1-p2));
output;
end;
run;

proc sgplot data=a01;


series x=x y=logit1 / lineattrs=(color=green) legendlabel="beta=&beta1a.";
series x=x y=logit2 / lineattrs=(color=red) legendlabel="beta=&beta1b.";
run;
Logistic Regression with SAS ©AK, SGH 2020

Logit transformation on an example – output


Logistic Regression with SAS ©AK, SGH 2020

Multivariate binary logistic regression model

𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖

• 𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥𝑖 - logit of the probability of the event


• 𝛽0 - intercept of the regression equation
• 𝛽𝑘 - parameter estimate of the kth predictor variable
Logistic Regression with SAS ©AK, SGH 2020

PROC LOGISTIC in SAS


• The basic syntax of the LOGISTIC procedure is as follows:

PROC LOGISTIC <options> ;


CLASS variable <(options)><variable <(options)> ></ options> ;
<label:> MODEL events/trials=<effects></ options> ;
ODDSRATIO <’label’> variable </ options> ;
OUTPUT <OUT=SAS-data-set><keyword=name <keyword=name >></ option> ;
RUN;

SAS OnLineDoc:
https://support.sas.com/documentation/cdl/en/statug/63962/HTML/default/viewer.htm#logistic_toc.htm
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l02_LogRegBank01.sas
[Data] bank.sas7bdat

Example 2.1 Binary logistic regression model


• The data is related with direct marketing campaigns (phone calls) of a
Portuguese banking institution. The classification goal is to predict if
the client would subscribe a term deposit (variable y) based on a
sample of n=4521 clients.
• Source: http://archive.ics.uci.edu/ml/datasets/bank+marketing
1 subscribed a term deposit
𝑌 − NEWDEPOSIT = ቊ
0 otherwise

𝑋1 - AGE: age of the client


𝑠𝑢𝑐𝑐𝑒𝑠𝑠
𝑓𝑎𝑖𝑙𝑢𝑟𝑒
𝑋2 − POUTCOME: response to previous marketing campaign =
𝑢𝑛𝑘𝑛𝑜𝑤𝑛
𝑜𝑡ℎ𝑒𝑟
Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 Syntax


Plot of the predicted probability on the Y
axis by the predictor on the X axis

proc logistic data=b02 alpha=0.05 plots(only)=(effect oddsratio);


Unit change class Poutcome (param=ref ref='failure');
for X for OR model NewDeposit(event='1')= Age Poutcome / clodds=pl;
assessment units age=5; *OR for change by given unit;
output out=p predicted=p; *Predictions;
run; Plot of odds ratio with confidence limits

The event category for the binary response model

Confidence intervals for the odds ratios of


all predictor variables
Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [1]


Model Information
Data Set WORK.B02
Response Variable NewDeposit
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring

Number of Observations Read 4521


Number of Observations Used 4521

Response Profile
Ordered Value NewDeposit Total Frequency
10 4000
21 521
Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [2]


Probability modeled is NewDeposit=1.
Modelled event, by default is
is the smallest or first value
(alphanumeric ordering)
Class Level Information
Class Value Design Variables
poutcome failure 0 0 0
other 1 0 0
success 0 1 0
unknown 0 0 1 Model specification

Model Convergence Status


Convergence criterion (GCONV=1E-8) satisfied.
Model parameters were
estimated
Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [3]


Model Fit Statistics
Intercept Intercept and
Criterion Only Covariates
AIC 3233.000 3000.518
SC 3239.417 3032.600
-2 Log L 3231.000 2990.518

• -2 Log L:
• Akaike’s information criterion (AIC)
• Schwarz criterion (SC):

The criteria are for assessing the goodness-of-fit of different models. Only measuring the relative fit
among models. The smaller the value is the better the fit is. AIC and SC adjust the log likelihood
statistic for number of parameters and number of parameters and samle size respectively. More
parsimonious models are preferred.
Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [4]


Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 240.4826 4 <.0001
Score 391.6167 4 <.0001
Wald 239.7336 4 <.0001

The test H0: β=0 is to check if all regression coefficients are 0.


P-value ≤ α indicate that at least one predictor is statistically significant.
The three tests are assessing the same aspect but the construction of the respective test statistics
differs. The likelihood ratio test is more reliable for smaller samples. All tests would give similar
results for large samples.
Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [5]


Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
Intercept α 1 -2.3318 0.2322 100.8112 <.0001
age β1 1 0.00995 0.00445 5.0081 0.0252
poutcome β2 other 1 0.5004 0.2258 4.9124 0.0267
poutcome β3 success 1 2.4852 0.2285 118.3311 <.0001
poutcome β4 unknown 1 -0.3835 0.1467 6.8352 0.0089

𝜋 𝑥𝑖
𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥𝑖 = 𝑙𝑛
1 − 𝜋 𝑥𝑖
= −2.3318 + 0.00995 x1 + 0.5004 x21 +2.4852 x22− 0.3835 x23
1
𝜋 𝑥𝑖 =
1 + 𝑒 2.3318 − 0.00995 x1 − 0.5004 x21 −2.4852 x22+ 0.3835 x23
Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [6]


Predicted Probabilities for NewDeposit=1

1.00

0.75

Probability
0.50

0.25

0.00

20 40 60 80
age
poutcome failure other success unknown

Whom to contact with intention to get a positive response?


Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [7]


• Logistic regression model:

𝑙𝑜𝑔𝑖𝑡 𝜋 𝑥𝑖 = ln(𝑜𝑑𝑑𝑠) = 𝛽0 + β1x1 +β2x21 +β3x22−β4x23


• Odds ratio (5 years increase):
odds for older client = 𝑒 𝛽0+𝛽1 𝑥1 +5 +𝛽2𝑥21 +𝛽3𝑥23 +𝛽4𝑥23
odds for younger client = 𝑒 𝛽0 +𝛽1 𝑥1 +𝛽2 𝑥21 +𝛽3 𝑥23 +𝛽4 𝑥23

odds ratio = odds older/odds younger = 𝑒 𝛽1 ∙5 = 𝑒 (0.00995 ∙ 5) ≈ 1.051

The odds ratio for age indicates that the odds of having positive response
from the marketing campaign increases by 5% for each increase of 5 years in
age of bank clients.
Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [8]


Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 95% Confidence Limits Odds Ratios with 95% Profile-Likelihood Confidence Limits
age 5.0000 1.051 1.006 1.098
poutcome other vs failure 1.0000 1.649 1.053 2.557
poutcome success vs failure 1.0000 12.003 7.720 18.928
age units=5
poutcome unknown vs failure 1.0000 0.681 0.515 0.915

poutcome other vs failure

poutcome success vs failure

poutcome unknown vs failure

0 5 10 15 20
Odds Ratio

Forest plot with odds ratios.


Logistic Regression with SAS ©AK, SGH 2020

Basic model assessment: comparing pairs


𝜋 𝑥𝑖 𝜌 𝜋 𝑥𝑗 ?
• To find concordant, discordant and tied pairs we compare subjects
that had the outcome of interest against subjects that did not

Event = 0 Event = 1

𝑝Ƹ 𝑖 =? 𝑝Ƹ𝑗 =?
Logistic Regression with SAS ©AK, SGH 2020

Concordant pair
• Outcome in agreement with probabilities

Event = 0 Event = 1

𝑝Ƹ 𝑖 = 0.03 𝑝Ƹ𝑗 = 0.40


Logistic Regression with SAS ©AK, SGH 2020

Discordant pair
• Outcome in disagreement with probabilities

Event = 0 Event = 1

𝑝Ƹ 𝑖 = 0.80 𝑝Ƹ𝑗 = 0.60


Logistic Regression with SAS ©AK, SGH 2020

Tied pair
• The model cannot distinguish between the two subjects

Event = 0 Event = 1

𝑝Ƹ 𝑖 = 0.60 𝑝Ƹ𝑗 = 0.60


Logistic Regression with SAS ©AK, SGH 2020

Example 2.1 SAS Output [9]

Association of Predicted Probabilities and Observed Responses


Percent Concordant 60.7 Somers' D 0.228
Percent Discordant 37.8 Gamma 0.232
Percent Tied 1.5 Tau-a 0.047
Pairs 2084000 c 0.614
Logistic Regression with SAS ©AK, SGH 2020

[Data] german.sas
Exercise 2.1
Fit the logistic regression model using dataset German with Default as response
variable and housing type as categorical variable.

1. Write the logistic model equation.


2. Assess the statistical significance of the parameter estimates.
3. Assess the risk of not paying off the loan using odds ratio estimates from the model.
4. Assess the effect of AGE, DURATION on risk of being default.
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l02_LogRegGerman01.sas
Logistic Regression with SAS ©AK, SGH 2020
Logistic Regression with SAS ©AK, SGH 2020
Logistic Regression with SAS ©AK, SGH 2020

Odds Ratios with 95% Wald Confidence Limits

housing for free vs own

housing rent vs own

age

1 2 3
Odds Ratio
Logistic Regression with SAS ©AK, SGH 2020

Predicted Probabilities for default=1

1.00

0.75
Probability

0.50

0.25

0.00

20 40 60 80
age
housing for free own rent
Binary logistic regression model

Logistic Regression with SAS ©AK, SGH 2020


Logistic Regression with SAS ©AK, SGH 2020

Logistic regression
The response variable is given by Bernoulli distribution, e.g.:
1 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑖𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡
𝑌=ቊ
0 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑔𝑜𝑡 𝑜𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑟𝑖𝑜𝑟 𝑡𝑜 𝑡

• The purpose of the analysis is to estimate the probability of


observing an event Y=y conditionally on X=x which is
P(Y=y|X=x). The explanatory variables in the model can be:
• discrete
• continuous

• Example of research questions: finding factors affecting


company survival, identifying variables having an impact on
decision making process e.g. in marketing to say why
customer chooses to buy a product.
Logistic Regression with SAS ©AK, SGH 2020

Specification of the logistic regression model


• Each observation on Y has Bernoulli distribution:
P ( Y = 1) = 
i i

P ( Y = 0 ) = 1 −  ( 0    1)
i i i

• In logistic regression model the probabilities are conditional on Xk


(k=1,…, p): p

exp(  +   x ) 1 k ki

 ( x ) = P (Y = 1 | X = x ) =
i i
= k =1
p p

1 + exp(  +   x ) 1 + e
 −( +   k x ki )
k =1
k ki
k =1

• Logit transformation provides:


 (x ) p

logit [ ( x )] = ln i
= + x
1 −  (x )
i k ki
k =1
i
Logistic Regression with SAS ©AK, SGH 2020

Maximum likelihood estimation


• The method of maximum likelihood yields values for the unknown
parameters that maximize the probability of obtaining the observed
set of data (Hosmer et al., 2013, section 1.2)
• Setup the likelihood function describing the process under analysis:
p(y) for observed set of outcomes (Y1, Y2, …, Yn)
• (Y1, Y2, …, Yn)→sample→(y1, y2, …, yn)
• Assuming that individual observations are independent the
probability:
n

P (Y = y , Y = y ,..., Y = y ) = P (Y = y ) P (Y = y )... P (Y = y ) =  p ( y )
1 1 2 2 n n 1 1 2 2 n n i
i =1

• Likelihood function:
1 2 n 1 r
n

L ( y , y ,..., y |  ,...,  ) =  p ( y |  ,...,  ) i 1 r


max to get ML estimates:
ˆ ,..., ˆ
i =1
 ,..., 
1 r 1 r
Logistic Regression with SAS ©AK, SGH 2020

Estimation in logistic regression model (1)


• For n independent observations from Bernoulli distribution likelihood
function is: n

L ( y , y ,..., y |  ,...,  ) =  p ( y |  ,...,  )


1 2 n 1 r i 1 r
i =1
n

=  ( x ) (1 −  ( x )) ...  ( x ) (1 −  ( x )) =   ( x ) (1 −  ( x ))
y1 ( 1 − y1 ) yn ( 1− y n ) yi ( 1− y i )

1 1 n n i i
i =1
• Natural logarithm of likelihood function gives:
n

ln L ( y1 , y 2 ,..., y n |  1 ,...,  r ) = ln   ( x i ) i (1 −  ( x i ))
y (1− y i )

i =1
n

=   y ln  ( x )  + (1 − y ) ln 1 −  ( x ) 
i i i i
i =1
n
 ( xi ) n

=  y i ln
1 −  ( xi )
+  ln 1 −  ( x )  i
i =1 i =1
n
 p
 n
 p

=  y i  +  k ki  −
 x  ln 1 + exp(  +   k x ki ) 
i =1  k =1  i =1  k =1 
Logistic Regression with SAS ©AK, SGH 2020

Estimation in logistic regression model (2)


• In order to get ML estimates the task is to find values of parameters
that maximize the likelihood function. For a model with one
explanatory variable the likelihood equations are found by:
 ln L ( y , y ,..., y |  ,...,  ) n n exp(  +  x )
1 2 n 1 r
=  y − i

 1 + exp(  +  x )
i
i =1 i =1
i

 ln L ( y , y ,..., y |  ,...,  ) x exp(  +  x )


n n
1 2 n 1
=  yx − r i i

 1 + exp(  +  x )
i i
i =1 i =1
i

  ln L ( y , y ,..., y |  ,...,  )
1 2 n 1
=0 r

  ˆ ˆ
  ln L ( y , y ,..., y |  ,...,  ) ,
 1 2 n 1
=0 Estimates of the parameters
r

  in logistic regression model


Logistic Regression with SAS ©AK, SGH 2020

Newton-Raphson
Algorithm
(Agresti, 2002, p. 143-
Algorithms 145)
for model
fitting Fisher Scoring
(Agresti, 2002, p. 145-
146)
Logistic Regression with SAS ©AK, SGH 2020

Algorithms for model fitting visually


Logistic Regression with SAS ©AK, SGH 2020

Newton-Raphson Algorithm numerically


• For t representing the iteration of the algorithm the procedure is:
= β − (H ) u
( t +1) t ( t ) −1 (t )
β

 L (β )   
2
L (β )  L (β )
2
 L (β ) 
2

  
    1
2
  1  2   1  p
 1
  
L (β )  
2
L (β )  L (β )
2
 L (β ) 
2
  
u =   , H =    
2
 2 p 
  
2
 2 1 2

 L (β )   2     
  L (β )  L (β )  L (β ) 
2 2
  
  p      p 2   p 
2
 p 1

• H is called the observed information matrix in the setting in Newton-


Raphson algorithm.
Logistic Regression with SAS ©AK, SGH 2020

Newton-Raphson Algorithm
Logistic regression (LR) model with one explanatory variable
β
( t +1)
=β − H
t
( )
(t ) −1
u
(t )

 
β=  
 

 n n
exp(  +  x i ) 
  yi −  
i = 1 1 + exp(  +  x i )
u =  n i =1


n
x exp(  +  x ) 

 i =1
yi xi −  i i

 i = 1 1 + exp(  +  x i ) 

 +  xi  +  xi
 n
e n
xi e 
−  −  
H = 
i =1 (1 + e  +  xi 2
) i =1 (1 + e  +  xi 2
) 
 +  xi  +  xi
 n
xi e n 2
xi e 
−  −  
 i =1 (1 + e  +  xi 2
) i =1 (1 + e  +  xi 2
) 
Logistic Regression with SAS ©AK, SGH 2020

Newton-Raphson Algorithm [SAS] l03_NetwonRaphsonAlgorithm01.sas

Example, program code


*Function: y = -x.^4 -x.^3 + x.^2 + x + 1
Task find the maximum using Newton-Raphson algorithm;
data a01;
do x=-3 to 3 by 0.001;
y=-x**4-x**3+x**2+x+1; *Plot the function and iterations:
output; proc sql noprint;
end; select x into :a separated by ' ' from a02;
run; quit;
data a02; proc sgplot data=a01;
x=5; series x=x y=y;
do until(conv<0.01); xaxis min=-3 max=3;
dy = -4*x**3-3*x**2+2*x + 1; refline &a. / axis=x lineattrs=(color=red);
ddy = -12*x**2-6*x + 2; run;
xt=x;
x=x-dy/ddy;
conv = abs(xt - x);
output;
end;
run;
Logistic Regression with SAS ©AK, SGH 2020

Newton-Raphson Algorithm
Example, Results
Homework:
1. What if the starting point
is different, e.g. -2? What
does this tell about
properties of Newton-
Raphson algorithm?
2. Write a program to
estimate the parameters
of logistic regression
model with one
explanatory variable
using Newton-Raphson
algorithm.
Logistic Regression with SAS ©AK, SGH 2020

Fisher scoring numerically


• Fisher scoring method is similar to Newton-Raphson with the
exception that the former uses J matrix of expected information
rather than matrix of observed information H. In For t representing
the iteration of the algorithm the procedure is:
β
( t +1)
=β − J
t
( )
(t ) −1
u
(t )

   2 L (β )    2 L (β )    2 L (β )  
  L (β )   − E   − E    − E 
     
    1     1  2
2
   1 p 
 1
 
 L ( β )   2 L (β )    2 L (β )    2 L (β )  
  − E  − E    − E 
u =   , J =          
  2 
2
2  1   p 
  
2 2
 
  L (β )     
 
     2 L (β )    2 L (β )    2 L (β )  
   p  − E  − E   − E 
           2  
  p 1   p 2   p 

For large samples J and H are approximately equal. For details on expected
information matrix see: Agresti, 2002, p. 135-139.
Logistic Regression with SAS ©AK, SGH 2020

Plotting log likelihood (1)


Model (0)
1 𝑑𝑒𝑓𝑎𝑢𝑙𝑡
𝑌=ቊ
0 𝑛𝑜𝑛 − 𝑑𝑒𝑓𝑎𝑢𝑙𝑡 1
 ( xi ) = − ( +  xi )
1 ℎ𝑎𝑣𝑖𝑛𝑔 𝑡𝑒𝑙𝑒𝑝ℎ𝑜𝑛𝑒 1+ e
𝑋=ቊ
0 𝑛𝑜𝑡 ℎ𝑎𝑣𝑖𝑛𝑔 𝑡𝑒𝑙𝑒𝑝ℎ𝑜𝑛𝑒
Log-likelihood for model (0):
ln L ( y 1 , y 2 ,..., y n |  ,  ) =
n n

=  y  i
+  xi  −  ln 1 + exp(  +  xi )
i =1 i =1
n y =1 n y =1 ; x =1 n x =1 n x=0

=   yi +   yi xi −  ln 1 + exp(  +  ) −  ln 1 + exp(  ) 
i =1 i =1 i =1 i =1
= n y =1 + n y =1; x =1  − n x =1 ln 1 + exp(  +  )  − n x = 0 ln 1 + exp(  ) 
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l03_PlottingLnL01.sas

Plotting log likelihood for model (0)


ln L ( y 1 , y 2 ,..., y n |  ,  ) =
= n y =1 + n y =1; x =1  − n x =1 ln 1 + exp(  +  )  − n x = 0 ln 1 + exp(  ) 
= 300  + 113  − 404 ln[ 1 + exp(  +  )] − 596 ln 1 + exp(  ) 

goptions device=activex;
proc g3d data=p01;
plot a*b=f;
run;
quit;
Logistic Regression with SAS ©AK, SGH 2020

Model (1)
• Response Y (Customer status) and explanatory X (Housing status)
variables are:
1 𝑑𝑒𝑓𝑎𝑢𝑙𝑡
𝑌=ቊ
0 𝑛𝑜𝑛 − 𝑑𝑒𝑓𝑎𝑢𝑙𝑡

1 𝑂𝑤𝑛
𝑋 = ቐ2 𝑅𝑒𝑛𝑡
3 𝐹𝑜𝑟 𝐹𝑟𝑒𝑒

• Estimate the model assessing the risk of customer being default given
their housing status:
1
 ( xi ) = − ( +  x +  x )
1+ e 1 1i 2 2i
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l03_EstimationTechniques01.sas

Estimation of logistic regression in SAS


Choosing estimating algorithm

• The option to choose the estimating algorithm is in model statement:


technique=Fisher | Newton
• Fisher scoring is default
*Binary logistic regression model;
proc logistic data=german;
class housing (param=ref ref='own');
model default(event='1')= housing / technique=Newton;
run;
Model Information Model Information

Data Set CLASS3.GERMAN Data Set CLASS3.GERMAN

Response Variable default Response Variable default

Number of Response Levels 2 Number of Response Levels 2

Model binary logit Model binary logit

Optimization Technique Fisher's scoring Optimization Technique Newton-Raphson


Logistic Regression with SAS ©AK, SGH 2020

Alternative estimating procedure for LR


model
proc genmod data=class3.german descending;
class housing (param=ref ref='own');
model default= housing / dist = bin link = logit;
run;

Analysis Of Maximum Likelihood Parameter Estimates


Parameter DF Estimate Standard Wald 95% Wald Pr > ChiSq
Error Confidence Limits Chi-
Square
Intercept 1 -1.0415 0.0853 -1.2086 -0.8743 149.11 <.0001
housing for free 1 0.6668 0.2136 0.2481 1.0854 9.74 0.0018
housing rent 1 0.5986 0.1753 0.2550 0.9422 11.66 0.0006
Scale 0 1.0000 0.0000 1.0000 1.0000
Logistic Regression with SAS ©AK, SGH 2020

Parametrization of the logistic model


Parametrization (see SAS Online Doc for other
types of parametrizations):
• Reference - Parameter estimates of CLASS main
effects, using the reference coding scheme,
estimate the difference in the effect of each
nonreference level compared to the effect of
the reference level.
• Effect - Parameter estimates of CLASS main
effects, using the effect coding scheme,
estimate the difference in the effect of each
nonreference level compared to the average
effect over all levels.
Logistic Regression with SAS ©AK, SGH 2020

Reference vs. effect parametrization


• In model assessing relationship between a customer being default
and their housing status the model have three structural parameters,
intercept and effects related to the housing status. The individual
probabilities are calculated using:
1
 ( xi ) = − ( +  x +  x )
Reference coding 1+ e Effect coding 1 1i 2 2i

1 1
 ( xi ) = − ( − 1 . 0414 + 0 . 6669 x1 i + 0 . 5987 x 2 i )
 ( xi ) = − ( − 0 . 6195 + 0 . 2450 x1 i + 0 . 1768 x 2 i )
1+ e 1+ e
Logistic Regression with SAS ©AK, SGH 2020

1
Reference parametrization  ( xi ) =
1+ e
1 . 0414 − 0 . 6669 x 1 i − 0 . 5987 x 2 i

• Calculating risk of being default comparing those renting with


customers owning flats [model (1)]:
1
 ( x i = rent ) = 1 . 0414 − 0 . 5987
 0 . 3911
1+ e
1
 ( x i = own ) =  0 . 2610
1+ e
1 . 0414

 ( xi )
O ( x i = rent ) =  0 . 6423 OR ( rent vs own )
1 −  ( xi ) O ( x i = rent )
=  1 . 8198
 ( xi ) O ( x i = own )
O ( x i = own ) =  0 . 3530
1 −  ( xi )
Logistic Regression with SAS ©AK, SGH 2020

1
 ( xi ) =
Effect parametrization 1+ e
0 . 6195 − 0 . 2450 x 1 i − 0 . 1768 x 2 i

• Calculating risk of being default comparing those renting with


customers owning flats [model (1)]:
1
 ( x i = rent ) = 0 . 6195 − 0 . 1768
 0 . 3911
1+ e
1
 ( x i = own ) = 0 . 6195 + 0 . 2450 + 0 . 1768
 0 . 2609
1+ e
 ( xi )
O ( x i = rent ) =  0 . 6423 OR ( rent vs own )
1 −  ( xi ) O ( x i = rent )
=  1 . 8196
 ( xi ) O ( x i = own )
O ( x i = own ) =  0 . 3530
1 −  ( xi )
Logistic Regression with SAS ©AK, SGH 2020

Parametrization programmatically
• Reference:
proc logistic data=german;
class housing (param=ref ref='own');
model default(event='1')= housing;
run;

• Effect:
*Binary logistic regression model;
proc logistic data=german;
class housing (param=effect ref='own');
model default(event='1')= housing;
run;
Logistic Regression with SAS ©AK, SGH 2020

Selecting variables for the model


• Theoretically sensible
• Frequencies analysis to check the distributions – sparse categories
will affect the model estimation
• Selection method (e.g. forward, backward, stepwise)
• Statistical significance of the effects
Logistic Regression with SAS ©AK, SGH 2020

Variable selection methods in SAS


• SAS OnlineDoc:
• When SELECTION=FORWARD, PROC LOGISTIC first estimates parameters for effects forced into
the model. These effects are the intercepts and the first n explanatory effects in
the MODEL statement, where n is the number specified by the START= or INCLUDE= option in
the MODEL statement (n is zero by default). Next, the procedure computes the score chi-square
statistic for each effect not in the model and examines the largest of these statistics. If it is
significant at the SLENTRY= level, the corresponding effect is added to the model. Once an effect
is entered in the model, it is never removed from the model.
• When SELECTION=BACKWARD, estimate model including all variables. Results of the Wald test for
individual parameters are examined. The least significant effect that does not meet
the SLSTAY= level for staying in the model is removed. Once an effect is removed from the model,
it remains excluded. The process is repeated until no other effect in the model meets the
specified level for removal or until the STOP= value is reached.
• The SELECTION=STEPWISE option is similar to the SELECTION=FORWARD option except that
effects already in the model do not necessarily remain. Effects are entered into and removed
from the model in such a way that each forward selection step can be followed by one or more
backward elimination steps. The stepwise selection process terminates if no further effect can be
added to the model or if the current model is identical to a previously visited model.
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_VariableSelection01.sas

Exploring variables through SAS selection


method
• Ex 1. Estimate the model assessing risk of customer being default
adding new explanatory variables using three selection methods.
Compare the results.
*Exploring variables using selection procedure;
proc logistic data=german;
class housing personal_status debtors telephone / param=ref;
model default(event='1')= duration credit_amt age housing
personal_status debtors telephone
/ selection=forward /*Forward | Backward | Stepwise*/;
run;
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_ModelFitStatistics01.sas

Assessing goodness-of-fit
ln L ( y 1 , y 2 ,..., y n |  1 ,...,  r )
n
 p
 n
 p

=  y i  +   k x ki  −  ln 1 + exp(  +   k x ki ) 
i =1  k =1  i =1  k =1 

In model (1):
ln L ( y 1 , y 2 ,..., y n |  ,  1 ,  2 )
n n

=  y  i
+  1 x 1i +  2 x 2 i  −  ln 1 + exp(  +  1 x 1i +  2 x 2 i ) 
i =1 i =1

AIC = − 2 ln L + 2 p

SC = − 2 ln L + p ln n

p-number of parameters [model (1) has 3 parameters]


n- sample size
Logistic Regression with SAS ©AK, SGH 2020

Testing significance of effects


• Testing if at least one explanatory variable in the model is statistically
significant ie. the model allows to identify at least one effect which
affects the probability under analysis

H0 :β = 0
H1 : k  0
k
Logistic Regression with SAS ©AK, SGH 2020

Likelihood ratio test (Agresti, 2002, p. 12)


• The idea is to compare lnL in the null model with the
model with explanatory variable(s):
The test statistic is as follows:
 l0 
− 2 ln  = − 2 ln   = − 2 (ln L 0 − ln L 1 )

 l1 
• For model (1) we get:
− 2 ln  = − 2 ln L 0 + 2 ln L 1 = 1221 . 729 − 1204 . 049 = 17 . 68
• -2lnΛ has a limiting chi-square distribution with degrees of freedom equal to the
difference in the dimension of the parameter spaces between the compared
models
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l04_WaldTest01.sas

Wald test (Agresti, 2002, p. 11)


• The idea is to compare the estimates with their
standard errors. The test statistic is as follows:

(
W = βˆ − β 0 )
T
cov( βˆ )  (βˆ − β )
−1

0
• For model (1) we get:
−1
 ˆ 2
ˆ ˆ ˆ   ˆ 

W = ˆ1 ˆ 2  
ˆ1 1 21
  ˆ 
ˆ ˆ1 ˆ 2 ˆ
2
ˆ 2
  2 
• W has a limiting chi-square distribution with
degrees of freedom equal to the rank of covariance matrix.
Logistic Regression with SAS ©AK, SGH 2020

Score test (Agresti, 2002, p. 12-13)

• The idea is to compare the slope and expected curvature of the log-
likelihood function at β0 . The test statistic is as follows:
−1
S =u J u
T

• S has approximate chi-square distribution with r


degrees of freedom where r is the number of restrictions imposed on
𝛃. See details in OnlineDoc.
Logistic Regression with SAS ©AK, SGH 2020

Score test (2)


Under H0 the assumption is that regression coefficients are equal to 0 and therefore the logit equals:
 ( xi )
n
1
 ( xi ) = 
T
logit [ ( x i )] = ln = yi    ( xi )  
1 −  ( xi ) n i =1

H0 :  1 ... p
T
=  ln  
 0 ... 0
  1 −  ( xi )  
The test statistic is: H1 :  k  0
Estimate of P(Y=1) k
−1
S =u J u
T

   2 L (β )    2 L (β )    2 L (β )  
  L (β )   − E   − E    − E 
     
    1     1  2
2
   1 p 
 1
 
 L ( β )   2 L (β )    2 L (β )    2 L (β )  
  − E  − E    − E 
u =   , J =          
  2 
2
1  1   p 
  
2 2
 
  L (β )     
 
     2 L (β )    2 L (β )    2 L (β )  
   p  − E  − E   − E 
           2  
  p 1   p 2   p 
Logistic Regression with SAS ©AK, SGH 2020

Score test example


In model (0) we get:

ln L ( y 1 , y 2 ,..., y n |  ,  ) = n y =1 + n y =1; x =1  − n x =1 ln 1 + exp(  +  )  − n x = 0 ln 1 + exp(  ) 


exp(  +  ) exp(  )     2 L (β )    2 L (β )  
  L (β )  
n − n x =1 − n x=0  − E   − E
 

     y =1        
2
1 + exp(  +  ) 1 + exp(  )    
u =  =  , J =
 L (β )  exp(  +  )   
2
β   
2
L (β )  
   n y =1; x =1 − n x =1  
L ( )
  
     1 + exp(  +  )  − E   − E 
     
2
    
  n
e
 +  xi   n
x e
 +  xi   e
 +
e

e
 +


− E −   −E − i    n x =1 + n x=0 n x =1 
J = 
 
 (
i =1 1 + e )
 +  xi 2 


 (
i =1 1 + e )
 +  xi 2 
   ( 1+ e )
 + 2
( 1+ e )
 2
( 1+ e
 +
)
2

=   +  +

 n
xie
 +  x i   n 2
xi e
 +  x i  e e
− E  −  − E−  
  n x =1
( )
n x =1
( )

 ( )   ( )   + 2  + 2
 i =1 1 + e
 +  x 2
i =1 1 + e
 +  x 2
  1+ e 1+ e 
   
i i

where n y =1 = 300 , n y = 0 = 700 , n x = 0 = 596 , n x =1 = 404 , n x =1; y =1 = 113
 0 .3 
Under H 0 :  = ln   = − 0 . 8473 ,  = 0
 0 .7 
−1
S = u J u = 1 . 329784
T
Logistic Regression with SAS ©AK, SGH 2020

Wald test for specific variable


• For model (1) with additional effect of age included we get: se (  3 ) = ˆ  3

( )
T
 (
−1
W = βˆ − β 0 cov( βˆ ) βˆ − β 0 )
ˆ
3
2
(-0.021304 067)
2

= 2 =  9.5988
ˆ 3
0.00004728 34

• W has a limiting chi-square distribution with


degrees of freedom equal to the rank of covariance matrix.
Logistic Regression with SAS ©AK, SGH 2020

Measures of Goodness-of-Fit,
• Assessing fit of the model - we would like to know whether the
probabilities produced by the model accurately reflect the true
outcome experience in the data (Hosmer et al., 2007).
• Deviance and Pearson χ2: the statistics compare saturated model with
the estimated model. Saturated model reflects ideal fit – probabilities
from the model equal to the estimated probabilities for each profile
(k=1,2,…,nlk)
Saturated Estimated
n11 n12 … n1k Model Model
n21 n22 … n2k 𝑛𝑖𝑗 𝑛ො 𝑖𝑗
𝑣𝑠. 𝑝Ƹ 𝑖𝑗 =
… … … … 𝑛 𝑛
nl1 nl2 … nlk
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l04_SaturatedModel01.sas

Saturated model
Counts (nij) in
• Let us consider model (1) with TELEPHONE being included as saturated
model
explanatory variable
proc freq data=german;
tables housing*telephone /
out=f01;
run;

proc logistic data=german;


class housing (param=ref ref='own')
telephone (param=ref ref='yes');
model default(event='1')= housing
telephone / aggregate scale=none ;
Probabilities in
output out=out predprobs=(i);
saturated
run;
model
Logistic Regression with SAS ©AK, SGH 2020

Deviance and Pearson χ2

• Deviance:
𝑙 𝑘
𝑛𝑖𝑗
𝐷 = 2 ෍ ෍ 𝑛𝑖𝑗 𝑙𝑛
𝑛ො 𝑖𝑗
𝑖=1 𝑗=1
• Pearson χ2:
𝑙 𝑘 2
2
𝑛𝑖𝑗 − 𝑛ො 𝑖𝑗
𝜒 = ෍෍
𝑛ො 𝑖𝑗
𝑖=1 𝑗=1

𝑛𝑖𝑗 - observed counts


𝑛ො 𝑖𝑗 − counts estimated in the model E(X)=nπ (by property of binomial distribution)
D and χ2 follow chi-square distribution with degrees of freedom equal to the difference between the number
of profiles (parameters estimated in the saturated model) and estimated parameters.
Logistic Regression with SAS ©AK, SGH 2020

Example
• Test goodness of fit of model (1) with TELEPHONE being included as
explanatory variable using Deviance and Pearson χ2. Calculate
Deviance and Pearson χ2 statistics.
Expected Observed D Chi-square
0 no rent 68.73725 66 6.6 -2.68201 0.1090029
Count IP_0 IP_1 0 no own 313.67 319 31.9 5.37501 0.09056869
116 11.6 0.592563 0.407437 68.73725 47.26275 0 no for free 26.58863 24 2.4 -2.45832 0.25202516
433 43.3 0.724411 0.275589 313.67 119.33 0 yes rent 40.25858 43 4.3 2.832714 0.18667833
47 4.7 0.565716 0.434284 26.58863 20.41137 0 yes own 213.3264 208 20.8 -5.25938 0.13299282
63 6.3 0.639025 0.360975 40.25858 22.74142 0 yes for free 37.40772 40 4 2.680094 0.1796398
280 28 0.76188 0.23812 213.3264 66.67357 1 no rent 47.26275 50 5 2.815032 0.15852994
61 6.1 0.613241 0.386759 37.40772 23.59228 1 no own 119.33 114 11.4 -5.20913 0.23806829
1 no for free 20.41137 23 2.3 2.746248 0.32829759
1 yes rent 22.74142 20 2 -2.56912 0.33047201
1 yes for free 66.67357 72 7.2 5.533743 0.42551922
1 yes for 23.59228 21 2.1 -2.44434 0.28483535
2.721091 2.7166301
Logistic Regression with SAS ©AK, SGH 2020

Deviance and Pearson 2


χ practical aspects
• If logistic regression model satisfies either of the following conditions:
• includes many explanatory variables
• explanatory variables have considerable number of categories
• model includes continuous variables
Deviance and Pearson χ2 do not follow χ2 distribution
Practically contingency table includes sparse categories – small or zero
counts. A test which allows to measure the goodness of fit in case of a
model satisfying either of above conditions is Hosmer-Lemeshow (HL)
test.
Logistic Regression with SAS ©AK, SGH 2020

Hosmer-Lemeshow test
• Hosmer Lemeshow statistic is calculated as follows:
• Estimate the individual probabilities.
• Sort within response and divide set into 10 groups with similar
number of observations (k=1,2,3,…,10)
• Using the groups calculate expected counts and compare with
observed counts using Pearson χ2 statistic, which has approximately χ2
distribution with v=10-2=8 degrees of freedom.
proc logistic data=german;
class housing (param=ref ref='own') telephone (param=ref ref='yes');
model default(event='1')= housing telephone / aggregate scale=none
lackfit;
output out=out predprobs=(i);
run;
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l04_HosmerLemeshow01.sas

Example
• Test goodness of fit of model (1) with AGE being included as
explanatory variable using Hosmer-Lemeshow statistic.
SAS OnlineDoc

0.002428 0.000547
0.069971 0.020732
1.302721 0.43821
0.072579 0.026138
0.173811 0.066921
0.083389 0.034293
0.831501 0.361852
0.801296 0.392317
0.796441 0.523366
0.784196 0.626636
7.409344
Logistic Regression with SAS ©AK, SGH 2020

Predictive power of the model


• How the predictions differ from observed values. Generalized
coefficient of determination Cox & Snell R2:
𝐿𝑅
−𝑛
𝑅2 =1− 𝑒

• where LR is Likelihood ratio statistic. The minimum in range of R2 is 0


with maximum smaller than 1. The statistic can be normalized using
the following transformation (Nagelkerke):
𝑅2
𝑅𝑛2 = 2𝑙𝑛𝐿0
1 −𝑒 𝑛
• where lnL0 is ln of likelihood in model with intercept only, R2 is in [0,1]
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l04_RSqaure01.sas

Example
• Assess predictive power of the model (1) with AGE being included as
explanatory variable using Hosmer-Lemeshow statistic.

proc logistic data=german;


class housing (param=ref ref='own');
model default(event='1')= housing age / aggregate scale=none
lackfit rsq;
output out=out predprobs=(i);
ods output LackFitPartition=hl;
run;
Logistic Regression with SAS ©AK, SGH 2020

Discriminatory performance of the model (1)


• Create pairs of observations, and compare
• the probabilities
Kleinbaum and Klein, 2010, p. 358

• nc- number of concordant – 𝜋 𝑑𝑒𝑓𝑎𝑢𝑙𝑡 > 𝜋 𝑛𝑜𝑛 − 𝑑𝑒𝑓𝑎𝑢𝑙𝑡


• nd- number discordant – 𝜋 𝑑𝑒𝑓𝑎𝑢𝑙𝑡 < 𝜋 𝑛𝑜𝑛 − 𝑑𝑒𝑓𝑎𝑢𝑙𝑡
• nt- number of ties – 𝜋 𝑑𝑒𝑓𝑎𝑢𝑙𝑡 = 𝜋 𝑛𝑜𝑛 − 𝑑𝑒𝑓𝑎𝑢𝑙𝑡
• t- number of pairs- comparing 300 defaults with 700 non-defaults
gives 300x700=210 000 comparisons.
Logistic Regression with SAS ©AK, SGH 2020

Discriminatory performance of the model (2)


• D Sommers=(nc-nd)/t – [-1,1] where -1 all discordant 1 all concordant
• G-K Gamma=(nc-nd)/(nc+nd)
• Tau-Kendall=(nc-nd)/(0.5n(n-1))
• c=(nc+0.5(t-nc-nd))/t – area under ROC curve, normalized in [0.5,1]
where 0.5 reflects random assignment of probabilities and 1 ideal
assignment of probabilities
Logistic Regression with SAS ©AK, SGH 2020

Classification tables
• A tool to assess predictive power of the model. How good the model is in assessing the values of
Y? We set up a boundary cut-off in (0,1). Then using c we may classify observations into those for
which the event occurred and those for which the event did not occur. We can assess
appropriateness of classification using the following statistics:
Sensitivity (assessing occurrence of the event): Specificity (assessing non-occurrence of the event):
n Yˆ =1|Y =1 n Yˆ = 0 |Y = 0
S1 = S0 =
n Y =1 nY = 0
False occurence: False non-occurence:

n Yˆ =1|Y = 0 n Yˆ = 0 |Y =1
FP = FN =
n Yˆ =1 n Yˆ = 0
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l04_ClassificationTable01.sas

Classification table for model (1)

proc logistic data=class3.german;


class housing (param=ref ref='own');
ni.
Yˆ = 1 Yˆ = 0 model default(event='1')= housing /
ctable pprob=0.3;
Y=1 n11 n12 n1.
output out=out predprobs=(i);
Y=0 n21 n22 n2. run;
n.j n.1 n.2 n

n11 + n 22 n11 n 22 n 21 n12


n11 n 22 n 21 n12
n n1 . n 2. n .1 n .2
Logistic Regression with SAS ©AK, SGH 2020 [SAS] l04_RocCurve01.sas

ROC curve

n Yˆ =1|Y =1
S1 = smaller cut-off
n Y =1

n Yˆ = 0 |Y = 0
S0 =
n
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_Residuals01.sas

Residuals in logistic regression


• Assessing goodness-of-fit (large residuals indicate poor fit)
• Identifying influential observations
• The residuals are:
• DFBETASi – how the estimates of coefficients would change after removing ith
observation?
• DIFDEVi/DIFCHISQi change in Deviance/Pearson χ2 after removing ith observation
• Ci and CBARi specifies the confidence interval displacement diagnostic that
measures the influence of individual observations on the regression estimates.
• SAS OnlineDoc:
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/vie
wer.htm#statug_logistic_sect042.htm
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l04_Collinearity01.sas

Collinearity assessment
• Collinearity in logistic regression lead to overestimation estimation of
standard errors and could lead to overestimation of regression
coefficients
• Standard statistics used in linear regression can be applied to detect
collinearity variance inflation (VIF) and tolerance (TOL):

1 1
VIF = TOL =
1 − Rk VIF
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l04_DepositBinaryModel01.sas
Exercise with R and Python [R] l04_DepositBinaryModel01.R

examples – Binary Logistic Model [Python] l04_DepositBinaryModel01.py


Use the dataset Bank.sas7bdat (see: Lectures - slide 52) to estimate the logistic regression model with NEWDEPOSIT
as the target variable. The aim of the analysis is to find out what is the chance that a client with particular profile
would be willing to subscribe a term deposit. Sample size n=4521. The input variables to consider are:

o Age (numeric)
o Education (categorical: "unknown", "secondary", "primary", "tertiary")
o Balance: average yearly balance, in euros (numeric)
o Duration: last contact duration, in seconds (numeric)
o Poutcome: outcome of the previous marketing campaign (categorical: "unknown", "other", "failure", "success")

Output variable (target):


17 - y (NEWDEPOSIT)- has the client subscribed a term deposit? (binary: "yes", "no")

Following the estimation provide answers to questions in


Logistic regression with SAS Quiz: Binary Logistic Regression: Link
Logistic Regression with SAS ©AK, SGH 2020

Homework
1. Download and familiarize yourself with the data and documentation from
European Social Survey (ESS) study, Round 9:
http://www.europeansocialsurvey.org
2. Formulate a research problem. The aim is to practice binary logistic regression
– please consider that in the choice of the problem
3. Formulate the research hypothesis. The hypothesis will be verified in your
paper.
4. Suggest exact set features from ESS data to include in the analysis.
5. Specify the model to be used in the analysis.

Prepare a brief presentation to share with the group. The proposals will be
discussed. The research problem will be analyzed in your papers (Projects I, II and
III).
Logistic Regression with SAS ©AK, SGH 2020

Logistic regression - model types


• Binary(i=2)
1 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑖𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑎𝑡 𝑡𝑖𝑚𝑒 𝑡
𝑌=ቊ
0 𝑐𝑜𝑚𝑝𝑎𝑛𝑦 𝑔𝑜𝑡 𝑜𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑎𝑟𝑘𝑒𝑡 𝑝𝑟𝑖𝑜𝑟 𝑡𝑜 𝑡

• Ordinal (i>2)
0 𝐺𝑜𝑜𝑑
𝑌 = ቐ1 𝑀𝑜𝑑𝑒𝑟𝑎𝑡𝑒
2 𝑃𝑜𝑜𝑟

• Multinomial (i>2)
0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑝ℎ𝑜𝑛𝑒
𝑌 = ቐ1 𝐴𝑛𝑑𝑟𝑜𝑖𝑑
2 𝑖𝑃ℎ𝑜𝑛𝑒
Logistic Regression with SAS ©AK, SGH 2020

Multinomial logistic regression model


Logistic Regression with SAS ©AK, SGH 2020

Multinomial model specification


• Response variable Y has j>2 categories. Let us discuss the example of
Y={0,1,2}, j=3. The assumption is that Y has multinomial distribution
with parameters:
• P(Y=0)=π0
• P(Y=1)=π1
• P(Y=2)= π2 =1-(π0+π1)
• Model includes p explanatory variables captured in a vector x and
intercept
0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑝ℎ𝑜𝑛𝑒
𝑌 = ቐ1 𝐴𝑛𝑑𝑟𝑜𝑖𝑑
2 𝑖𝑃ℎ𝑜𝑛𝑒
Logistic Regression with SAS ©AK, SGH 2020

Logit notation in multinomial model


• For Y with three categories multinomial logistic regression model can
be expressed by two logit functions:

 P (Y = 1 | x ) 
 =  10 +  11 x1 + ... +  1 p x p = x β 1
T
ln 
 P (Y = 0 | x ) 

 P (Y = 2 | x ) 
 =  20 +  21 x1 + ... +  2 p x p = x β 2
T
ln 
 P (Y = 0 | x ) 
Logistic Regression with SAS ©AK, SGH 2020

Conditional probabilities
• Conditional probabilities for πj|x are derived using the following
system of equations:
The individual conditional probabilities can be
   1| x  expressed by:
 = x β1
T
 ln 
   0 | x  1
 0|x =
   2|x  1+ e
T
x β1
+e
T
x β2

=
T
  β2
T
 ln x e
x β1

   0 | x   1| x = T
x β1
T
x β2
  0 | x = 1 − ( 1| x +  2 | x ) 1+ e T
+e
x β2
 e
 2|x =
 1+ e
T
x β1
+e
T
x β2
Logistic Regression with SAS ©AK, SGH 2020

Model estimation – likelihood function


• The multinomial model is estimated based on the multinomial
distribution. Conditional likelihood function for a sample of n
independent observations is (Agresti, 2002, p. 272-273; Hosmer,
Lemeshow, Sturdivant, 2013, p. 269-272):
 Y = 0 : y 0 = 1, y 1 = 0 , y 2 = 0

 Y = 1 : y 0 = 0 , y 1 = 1, y 2 = 0
Y = 2 : y = 0 , y = 0 , y = 1
 0 1 2
n

L ( y 1 , y 2 ,..., y n |  1 ,...,  r ) =  p ( y i |  1 ,...,  r )


i =1
=  0 (x1 )  1 (x1 )  2 (x1 ) ...  0 ( x n )  1 (x n )  2 (x n )
y 01 y 11 y 21 y0 n y1 n y2 n

=   0 (x i )  1 (x i )  2 (x i )
y0i y1 i y2i

i =1
n

=  (1 −  1 ( x i ) −  2 ( x i ))  1 (x i )  2 (x i )
y0i y1 i y2i

i =1
Logistic Regression with SAS ©AK, SGH 2020

Odds ratio in multinomial model


• For Y=0 being the reference category and Y=1 the category under
analysis, odds ratio allows to assess what is the chance of observing
Y=1 as compared to Y=0 if subject has X=a as compared to X=b where
a and b are specified characteristics. For example we may answer the
question what is the chance that a client would have a specific
smartphone and not simple phone in group of people with higher
education?
P (Y = 1 | X = a )
P (Y = 0 | X = a )
OR 10 ( a , b ) =
P (Y = 1 | X = b )
P (Y = 0 | X = b )
Logistic Regression with SAS ©AK, SGH 2020
[Data] phones.sas7bdat

Exercise – Estimation of multinomial model


• What is the chance of the client having smartphone given their characteristics?
• The dataset Phones contains information on use of phone based on study „The
Pew Research Center’s Internet & American Life Project”,
http://pewinternet.org/Shared-Content/Data-Sets/2012/April-2012--Cell-
Phones.aspx [date accessed 3Jan2014]. The dataset includes clients with standard
phone, Android phone or iPhone with the following additional characteristics: age
(AGE), education (EDU), internet use (INTUSE), using internet in mobile device
(INTMOB) and quality of life (LQ). There is n=1061 observations in the dataset.
• Task: assess the chance using contingency tables and logistic regression model.
0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑝ℎ𝑜𝑛𝑒
𝑌 = ቐ1 𝐴𝑛𝑑𝑟𝑜𝑖𝑑
2 𝑖𝑃ℎ𝑜𝑛𝑒
X: LQ SEX AGE EDU INTUSE INTMOB
Logistic Regression with SAS ©AK, SGH 2020

Phones dataset
ID Name Description Categories
1 Y Type of phone 0=standard
1=Android
2=iPhone
2 LQ Quality of Life 1=Excellent
2=Very good
3=Good
4=Fair or poor
3 EDU Education 1=At least started higher education
0=otherwise
4 SEX Sex of the 1=Male
respondent 2=Female
5 AGE Age of the N/A
respondent
6 INTUSE Is using Internet? 1=Yes
2=No
7 INTMOB Is using Internet 1=Yes
through mobile 2=No
devices?
Logistic Regression with SAS ©AK, SGH 2020

Comparing clients with higher education [SAS] l05_MultContingency01.sas

to other education – contingency tables


Logistic Regression with SAS ©AK, SGH 2020

Comparing clients with higher education to


other education – multinomial model

• Chance of having an Android smartphone is more than 50% greater in


group of clients with higher education.
• Chance of having an iPhone smartphone is more than 2 times greater
in group of clients with higher education.
• For those with higher education it is likely that they will have
smartphones. Question: is there a significant difference between
Android and iPhone though?
Logistic Regression with SAS ©AK, SGH 2020

iPhone vs Android
• H0: OR10=OR20

• then the point estimate with confidence 95% confidence interval:

• and OR21=exp(0.3377)=1.40 (1.01, 1.94). H0 rejected. OR10 and OR20


are statistically significantly different.
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l05_MultinomialModel01.sas

Exercise – estimating multinomial model


• Estimate the model assessing Chance of having a smartphone given
clients characteristics. Assess significance of the parameters and
interpret the results. The model specification is as follows:

0 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑝ℎ𝑜𝑛𝑒
𝑌 = ቐ1 𝐴𝑛𝑑𝑟𝑜𝑖𝑑
2 𝑖𝑃ℎ𝑜𝑛𝑒

X: SEX AGE EDU INTUSE INTMOB


Logistic Regression with SAS ©AK, SGH 2020

Profile probabilities
Given the estimates of the model provide the probabilities estimates
for the following profiles:
1. Lady in her 31 with higher education, using Internet in general and
specifically on mobile devices has a specific phone type.
2. Plot the probabilities for age
Logistic Regression with SAS ©AK, SGH 2020

Deriving profile probabilities


Logistic Regression with SAS ©AK, SGH 2020

Plotting probabilities
Logistic Regression with SAS ©AK, SGH 2020

Ordinal logistic regression model


Logistic Regression with SAS ©AK, SGH 2020

Ordinal model specification


• Response variable Y has j>2 categories measured on an ordinal scale.
Order of the categories matters. Let us discuss the example of
Y={0,1,2}, j=3. The assumption is that Y has multinomial distribution
with parameters:
• P(Y=0)=π0
• P(Y=1)=π1
• P(Y=2)= π2 =1-(π0+π1)
• Model includes p explanatory variables captured in a vector x and
intercept. For example let Y denotes quality of life:
1 𝐹𝑎𝑖𝑟 𝑜𝑟 𝑝𝑜𝑜𝑟
𝑌 = ቐ2 𝑉𝑒𝑟𝑦 𝑔𝑜𝑜𝑑 𝑜𝑟 𝑔𝑜𝑜𝑑
3 𝐸𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡
Logistic Regression with SAS ©AK, SGH 2020

Cumulative Logit Model


• Ordinal variable can be modeled in various ways, e.g. using
multinomial model. A special type of model which allows to take the
ordering of categories to account is cumulative logit model
(Kleinbaum and Klein, 2010, Chapter 13):
0 1 2 3 4

Comparing:
0 1 2 3 4 <=0 vs >0

0 1 2 3 4 <=1 vs >1

0 1 2 3 4 <=2 vs >2

0 1 2 3 4 <=3 vs >3
Logistic Regression with SAS ©AK, SGH 2020

Cumulative Logit Model


• For ordinal outcome variable Y with G categories (0,1,2,…,G-1) there
are G-1 ways to dichotomize the outcome:

D  1 vs D  1
D  2 vs D  2
Odds is derived based for
each cut point:
proportional odds
... assumption
D  G − 1 vs D  G − 1

P(D  g )
odds ( D  g ) = , g = 1, 2 ,..., G − 1
P(D  g )
Logistic Regression with SAS ©AK, SGH 2020

Proportional odds assumption


Proportional odds means same odds ratio regardless of where
categories are dichotomized. In other words, the odds ratio is invariant
to where the outcome categories are dichotomized.
OR ( D  1) = OR ( D  4 )

odds ( D  1) | X = 1
OR ( D 1 ) =
odds ( D  1) | X = 0 

odds ( D  4 ) | X = 1
OR ( D  4 ) =
odds ( D  4 ) | X = 0 
Logistic Regression with SAS ©AK, SGH 2020

Cumulative logit model


This implies that if there are G outcome categories, there is only one
parameter (β) for each of the predictors variables (e.g., β1 for predictor
X1). However, there is still a separate intercept term (αg) for each of the
G-1 comparisons:

Ordinal
Variable Parameter
Intercept α1, α,2, ...,αG-1
X1 β1
Logistic Regression with SAS ©AK, SGH 2020

Cumulative logit model


• G outcome levels and one predictor X:

1
P(D  g | X ) = − ( g +  xi )
, g = 1, 2 ,..., G − 1 .
1+ e
− ( g +  xi )
1 e
P(D  g | X ) = 1 − P(D  g | X ) = 1 − − ( g +  xi )
= − ( g +  xi )
1+ e 1+ e
1
P(D  g | X ) P(D  g | X ) − ( g +  xi )
= 1 + −e(  +  x )
 g +  xi
odds ( D  g ) = = =e
1 − P(D  g | X ) P(D  g | X ) e g i
− ( g +  xi )
1+ e
Logistic Regression with SAS ©AK, SGH 2020

Cumulative logit model – alternative


parameterization
• G outcome levels and one predictor X:

P(D  g | X )
odds ( D  g ) =
1 − P(D  g | X )
P(D  g | X )  g −  xi
= =e , where g = 1, 2 ,..., G − 1 .
P(D  g | X )

• Many models can be parameterized in different ways. This need not be


problematic as long as the investigator understands how the model is
formulated and how to interpret its parameters.
Logistic Regression with SAS ©AK, SGH 2020

Proportional odds assumption – contingency


table („Phones” dataset)
H0: OR1=OR2=…=ORG-1
H1: Ǝi≠j ORi≠ORj
LQ1 younger
0 1 Total
1 Fair or poor 119 58 177
2 Very good or good 349 311 660
3 Excellent 87 137 224
Total 555 506 1061

LQ1 younger
0 1 Total Odds of being in lower category when older
1 Fair or poor 119 58 177 Risk(Y<=1|X=0) 0.214414 Risk(Y<=1|X=1) 0.114625
2 Very good or good or 3 Excellent 436 448 884 Odds(Y<=1|X=0) 0.272936 Odds(Y<=1|X=1) 0.129464
Total 555 506 1061 OR01 2.108194

LQ1 younger
0 1 Total Odds of being in lower category when older
1 Fair or poor or 2 Very good or good 468 369 837 Risk(Y<=2|X=0) 0.843243 Risk(Y<=2|X=1) 0.729249
3 Excellent 87 137 224 Odds(Y<=2|X=0) 5.37931 Odds(Y<=2|X=1) 2.693431
Total 555 506 1061 OR01 1.997197
Logistic Regression with SAS ©AK, SGH 2020
[SAS] l05_PropOdds01.sas
Proportional odds assumption – SAS
(„Phones” dataset)
H0: OR1=OR2=…=ORG-1
H1: Ǝi≠j ORi≠ORj

https://support.sas.com/documentation/cdl/en/statug/63962/
HTML/default/viewer.htm#statug_logistic_sect039.htm
Logistic Regression with SAS ©AK, SGH 2020

Probabilities from cumulative logit model

1 1
P ( D  1 | X = 1) = − ( 1 +  xi )
=  0 . 1168
1+ e 1+ e
2.0235

1 1
P ( D  2 | X = 1) = − ( 2 +  xi )
=  0 . 7275
1+ e 1+ e
- 0.9820

1 1
P ( D  1 | X = 0) = − ( 1 +  xi )
= 2.0235 − 0.7147
 0 . 2127
1+ e 1+ e
1 1
P ( D  2 | X = 0) = − ( 2 +  xi )
= - 0.9820 − 0.7147
 0 . 8451
1+ e 1+ e
Logistic Regression with SAS ©AK, SGH 2020

Odds ratio in cumulative logit model


1 1 1 1
P ( D  1 | X = 0) = − ( +  x )
=  0 . 1168 P ( D  1 | X = 1) = − ( 1 +  xi )
= 2.0235 − 0.7147
 0 . 2127
1+ e 1 i 1+ e 1+ e 1+ e
2.0235

P ( D  1 | X = 0 ) = 1 − 0 . 1168 = 0 . 8832 P ( D  1 | X = 1) = 1 − 0 . 2127 = 0 . 7873


P ( D  1 | X = 0 ) 0 . 1168 P ( D  1 | X = 1) 0 . 2127
Odds ( D 1 ) = =  0 . 1322 Odds ( D 1 ) = =  0 . 2702
P ( D  1 | X = 0 ) 0 . 8832 P ( D  1 | X = 1) 0 . 7873
0 . 1322
OR 1 vs (2 and 3) =  0 . 48
0 . 2702
1 1 1 1
P ( D  2 | X = 0) = − ( 2 +  xi )
=  0 . 7275 P ( D  2 | X = 1) = =  0 . 8451
1+ e 1+ e
- 0.9820
− ( 2 +  xi ) - 0.9820 − 0.7147
1+ e 1+ e
P ( D  2 | X = 0 ) = 1 − 0 . 7275 = 0 . 2725 P ( D  2 | X = 1) = 1 − 0 . 8451 = 0 . 1549
P ( D  2 | X = 0 ) 0 . 7275 P ( D  2 | X = 1) 0 . 8451
Odds ( D  2 ) = =  2 . 6697 Odds ( D  2 ) = =  4 . 4558
P ( D  2 | X = 0 ) 0 . 2725 P ( D  2 | X = 1) 0 . 1549

2 . 6697
OR (1 and 2 ) vs 3
=  0 . 48
4 . 4558
Logistic Regression with SAS ©AK, SGH 2020

Proportional odds assumption check – example


(Source: SAS OnlineDoc, Proc Logistic Example 72.3)
Consider a study of the effects on taste of
various cheese additives. Researchers tested data Cheese;
do Additive = 1 to 4;
four cheese additives and obtained 52 response
do y = 1 to 9;
ratings for each additive. Each response was input freq @@;
measured on a scale of nine categories ranging output;
from strong dislike (1) to excellent taste (9). The end;
data, given in McCullagh and Nelder (1989, p. end;
175) in the form of a two-way frequency table label y='Taste Rating';
of additive by rating, are saved in the data set datalines;
Cheese by using the following program. The 0 0 1 7 8 8 19 8 1
variable y contains the response rating. The 6 9 12 11 7 6 1 0 0
1 1 6 8 23 7 5 1 0
variable Additive specifies the cheese additive
0 0 0 1 3 7 14 16 11
(1, 2, 3, or 4). The variable „Freq”s gives the ;
frequency with which each additive received run;
each rating. Assess proportional odds
assumption.
Logistic Regression with SAS ©AK, SGH 2020

[SAS] l05_OrdinalModel01.sas

Cumulative logit model


• Assess the effect of sex, age, education level and Internet usage on
the quality of life. Use phones dataset. The variables in the model are:

• Estimate the model and interpret the results.


• Plot the probabilities in order to assess the effect of the respective
features on the quality of life.
Logistic Regression with SAS ©AK, SGH 2020

PROC LOGISTIC summary (OnLineDoc: Link)


ods graphics on;
proc logistic data=g /*Input dataset*/
covout outest=cov /*Output covariance matrix estimates*/
plots(only)=ROC; /*Output ROC curve*/
class housing (param=ref ref='own') acc_status (param=ref); /*Define
classification variables and parametrization (e.g. reference)*/
model default(event='1')= age duration housing acc_status / /*logistic
model definition, response and response category = predictors*/
technique=fisher /*Estimating algorithm*/
corrb /*Collinearity diagnostics*/
covb /*Covariance matrix*/
waldcl /*Wald confidence interval for model parameters*/
aggregate scale=none /*Deviance and Pearson chi-square*/
lackfit /*Hosmer-Lemeshow test of goodness-of-fit*/
expb /*exp of the estimates – OR in table with parameter estimates*/
rsq /*coefficient of determination*/
ctable /*classification tables [optional cut-points: pprob=(0.4 0.5)]*/
influence /*influential observations determination*/
;
run;
ods graphics off;
Logistic Regression with SAS ©AK, SGH 2020

References
A. Agresti (2002) Categorical Data Analysis (Hoboken: John Wiley & Sons).

P.D. Allison (1999) Logistic regression using the SAS system. Theory and
Application (Cary: SAS Institute Inc.).

D.W. Hosmer, S. Lemeshow, R.X. Sturdivant (2013) Applied Logistic


Regression (Hoboken: John Wiley & Sons).

D.G. Kleinbaum and M. Klein (2010) Logistic regression. A Self-Learning Text,


3rd ed. (New York: Springer).
Logistic Regression with SAS ©AK, SGH 2020

Thank you for your attention!

Any questions?

You might also like