You are on page 1of 33

An Introduction to Logistic

Regression Analysis and


Reporting
Chao-Ying Joanne Peng
Indiana University-Bloomington

Three purposes of this


session:
1. Introduces you to basic
concepts of logistic regression
LR constitutes a special class
of regression methods for
research utilizing dichotomous
outcomes.

Three purposes of this


session:
2. Provides you with a set of guidelines
of what to expect in an article using
logistic regression techniques
What tables, figures, or charts should
be included to comprehensively
assess the results?
And, what assumption should be
verified?

Three purposes of this session:


3. Recommendations are also
offered for
appropriate reporting formats of
logistic regression results and
the minimum observation to
predictor ratio.

An Introduction to Logistic
Regression Analysis and
Reporting
Many research problems in education call for
the analysis and prediction of a dichotomous
outcome, for example, whether a child should
be classified as learning disabled (LD), or
whether a teenager is prone to engage in risky
behaviors.
Traditionally, these research questions were
addressed by either ordinary least squares
(OLS) regression or linear discriminant
function analysis.

An Introduction to Logistic
Regression Analysis and
Reporting
Both techniques were subsequently
found to be less than ideal in handling
dichotomous outcomes, due to their strict
statistical assumptions i.e., linearity,
normality, and continuity for OLS
regression and multivariate normality
with equal variances and covariances for
discriminant analysis.

An Introduction to Logistic
Regression Analysis and
Reporting

As an alternative, logistic regression was


proposed in the late 60s and early 70s
(Cabrera, 1994). It became routinely
available in statistical packages in the early
80s.
With the wide availability of sophisticated
statistical software installed on high-speed
computers, the use of logistic regression is
increasing.

Logistic Regression Models


The central mathematical concept that
underlies logistic regression is the logit.
The simplest example of a logit derives ln(odds)
from a 22 contingency table. Consider
an instance in which the distribution of a
dichotomous outcome variable (a child
from an inner city school recommended
for remedial reading classes) is paired
with a dichotomous predictor variable
(gender).

Table 1. Sample Data for Gender and


Recommendation for Remedial Reading

Remedial reading
instruction recommended

Gender
Totals
Boys

Girls

Yes
(coded as 1)

73

15

88

No
(coded as 0)

23

11

34

Totals

96

26

122

The results yield 2(df =1) = 3.43. Alternatively,


one might prefer to assess a boys odds of
being recommended for remedial reading
instructions, relative to a girls odds; the
result is an odds ratio of 2.33.

73 23
odds ratio =
= 3.17 = 2.33
1.36
15 11

Its natural logarithm [i.e., ln (2.33)] equals


0.85 which would be the regression
coefficient of the gender predictor, if logistic
regression were used to model the two
outcomes of a remedial recommendation as
it is related to gender.

The simple logistic model has the form:

logit (Y ) = natural log( odds ) = ln(

) = + X ,

where is the probability of interested outcome,


is the intercept parameter, is a regression
coefficient, and X is a predictor.
a.k.a.
slope parameter

Figure 1. The relationship of a dichotomous outcome variable,


Y (1=remedial reading recommended, 0=remedial
reading not recommended) with a continuous
predictor, READING scores.

For the data in Table 1, the regression


coefficient () is the logit (=0.85) previously
explained. Taking the antilog of equation (1) on
both sides, one derives an equation for the
prediction of the probability of the occurrence of
the outcome of interest as follows:

= P(Y = outcome of interest | X = x, a specific value)


e + x
=

1+e + x

Extending the logic of the simple logistic


regression to multiple predictors (say
X1=reading score and X2=gender), one may
construct a complex logistic regression for Y
(recommendation for remedial reading programs)
as follows:


logit (Y ) = ln
= + 1 X 1 + 2 X 2 .
1

Therefore,
= P(Y = outcome of interest | X 1 = x1 , X 2 = x2 )
+ 1 x1 + 2 x 2

e
=
+ 1 x1 + 2 x 2
1+ e

Illustration of Logistic
Regression
Analysis and Reporting
The hypothetical data consisted of 189 inner city
school childrens reading scores and gender.
Of these children, 59 (31.22%) were recommended
for remedial reading classes while 130 (68.78%)
were not. A legitimate research hypothesis posed
to the data was: the likelihood that an inner city
school child is recommended for remedial reading
instruction is related to both his/her reading score
and gender.

Table 2. Description of a Hypothetical Data Set for


Logistic Regression

Remedial
Gender
Total
reading
instruction Sample
Boys Girls
(N)
recommende
(n1)
(n2)
d?

Reading
scores
Mea
n

SD

Yes

59

36

23

61.0
7

13.2
8

No

130

57

73

66.6
5

15.8
6

Summary

189

93

96

64.9
1

15.2
9

Logistic Regression Analysis


The logistic regression analysis was
carried out by the LOGISTIC
REGRESSION command in SPSS
version 13 (SPSS Inc., 2004)
Predicted logit of (REMEDIAL)=0.534 +
(0.026)READING + (0.648)GENDER

Evaluations of the Logistic


Regression Model
a) overall model evaluation
b) goodness-of-fit statistics
c) statistical tests of individual predictors
d) validations of predicted probabilities

Overall Model Evaluation


2

df

Likelihood Ratio test

10.019

0.007

Score test

9.518

0.009

Tests

OK?

Goodness-of-fit Test
Test
Hosmer-Lemeshow
Goodness-of-fit test

df

OK?

9.286

0.319

R2-type Indices
Cox and Snell R squared = .052
Nagelkerke (Max rescaled) R squared = .073

Table 3. Logistic Regression Analysis of 189 Childrens Referrals for


Remedial Reading Programs by SPSS LOGISTIC
REGRESSION command (version 12)
e
(odds
ratio)

SE

Walds 2
(df=1)

0.534

0.811

0.434

.510

(not applicable)

READING

0.026

0.012

4.565

.033

0.974

GENDER

0.648

0.325

3.976

.046

1.911

Predictor

CONSTANT

(1=boys, 0=girls)

Table 4. The Observed and the Predicted Frequencies for


Remedial Reading Instructions by Logistic
Regression with the Cutoff of 0.50
Predicted
Yes

No

Percentage
Correct

Yes

56

5.1%

No

129

99.2%

Observed

Overall % correct
Note.
Sensitivity=3/(3+56)=5.1%
Specificity=129/(1+129)=99.2%
False Positive=1/(1+3)=25.0%
False Negative=56/(56+129)=30.3%

69.8%

E
s
t
i
m
a
t
e
d
P
r
o
b
a
b
i
l
i
t
y

0.6
Boys

A
B
0.5
A

FA

CC

EA

AI

E
0.4
PC
Girls
BC

HB

AC
A

CB
AC
0.3
BB
ABA

AIB
A

CJ
BA

AKE
B

AFA
B
0.2
BB
A

CA

AAA

ACA

AB
A

B A
A A
0.1
A

0.0

40
60
80
100
120
140

Reading score

Figure 2. Predicted
probability of being
referred for
remedial reading
instructions versus
reading scores,
plotting symbols
A=1 observation,
B=2 observations,
C=3 observations,
etc.

Reporting and Interpreting


Logistic Regression Results
In addition to Tables 3, 4 and Figure 2, it is
helpful to demonstrate the relationship
between the predicted outcome and certain
characteristics found in observations.

Table 5. Predicated Probability of Being Referred for Remedial


Reading Instructions for 8 Children
Case
Number

Beta= 0.026

READIND

Beta=0.648

GENDER

Intercept
= 0.534

Predicted probability of
being referred for
remedial reading
instructions

Actual outcome
1=Yes, 0=No

52.5

Boy

0.5340

0.4530

85

Boy

0.5340

0.2618

75

Girl

0.5340

0.1941

92

Girl

0.5340

0.1250

60

Boy

0.5340

0.4051

--

60

Girl

0.5340

0.2627

--

100

Boy

0.5340

0.1934

--

100

Girl

0.5340

0.1115

--

Interpretation of Regression
Coefficients
For each point increase on the reading score, the
odds of being recommended for remedial
reading programs decrease from one to 0.974
(=e 0.026, Table 3).
If the increase on the reading score is 10 points,
the odds decrease from one to 0.771
[=e 10*(0.026) ].
However, when the READING score was held as
a constant, boys were predicted to be referred
for remedial reading instructions with greater
probability than girls.

Guidelines and Recommendations


What Tables, Figures, or Charts Should be
Included to Comprehensively Assess the
Result?
the overall evaluation of the logistic
model
goodness-of-fit statistics
statistical tests of individual predictors
an assessment of the predicted
probabilities

What Assumption Should be


Verified?
Logistic regression does not assume that predictor
variables are distributed as a multivariate normal
distribution with equal covariance matrix.
It assumes that the binomial distribution describes the
distribution of the errors, which equal the actual Y
minus the predicted Y ;
It is also the assumed distribution for the conditional
mean of the dichotomous outcome.

What Assumption Should be


Verified?
The binomial assumption may be tested by
the normal z test (Siegel & Castellan, 1988),
or taken to be robust as long as the sample
is random; thus, observations are
independent from each other.

Recommended Reporting
Formats of
Logistic Regression
In terms of reporting logistic regression
results, we recommend presenting

A complete logistic regression


model including
the Y-intercept
odds ratio
a table such as Table 5 to illustrate
the relationship between outcomes
and observations with profiles of
certain characteristics

Recommended Minimum
Observation to Predictor Ratio
The literature has not offered specific rules
applicable to logistic regression.
Several authors of multivariate statistics
recommended a minimum ratio of 10 to 1,
with a minimum sample size of 100 or 50
plus a variable number that is a function of
the number of predictors.

Preview of Next Session

You might also like