You are on page 1of 10

DEN 1015H LECTURE NOTES Session 11

Logistic Regression
Binary logistic regression (which will be referred to simply as logistic regression) is a
nonparametric statistic useful for analyses in which we want to be able to predict the presence or
absence of a characteristic based on values of a set of predictor variables. It is a multivariate
technique (multiple logistic regression model) similar to a multiple regression model but is suited
to models where the dependent variable is dichotomous (e.g. alive or dead, healthy or diseased, 0
or 1). [Independent variables can be continuous (interval, ratio) or categorical (nominal,
ordinal); if categorical, they should be dummy variables.] To treat something with only two
values as a sample from a normal distribution requires a leap of faith. However, by a simple
transformation, called a logistic function or logit, we can make the dependent variable look
more like a regular variable.
In logistic regression, we directly estimate the probability of an event occurring. Logistic
regression coefficients are used to estimate ODDS RATIOS for each of the independent
variables adjusting for the effect of the other variables in the model.
In linear regression we estimate the parameters of the model using the least-squares
method. That is, we select regression coefficients that result in the smallest sums of squared
distances between the observed and the predicted values of the dependent variable. In logistic
regression, the parameters of the model are estimated using the maximum-likelihood method.
That is, the coefficients that make our observed results most "likely" are selected.

There are two general goals when doing logistic regression:

1) To determine the effect of a set of variables on the probability of the outcome, plus
the effect of individual variables.
2) To attain the highest predictive accuracy possible with a given set of predictors.

Example: A study was performed investigating the risk of dental fluorosis in two urban
communities in Ontario according to whether their drinking water was fluoridated at optimal
levels, “0.7-1.0 F ppm,” or below optimal levels, “< 0.3 F ppm,” as determined by water
samples. A total of 420 12–16-year-old schoolchildren were randomly selected from school
districts in the two communities: 271 in the fluoridated and 149 in the fluoride-deficient. In both
localities, subjects who exhibited enamel opacities according to the DDE Index (FDI, 1982) were
enrolled in this case-control study. An attempt was made to enrol at least 5 times as many
subjects with no enamel opacities (the controls) as subjects with teeth presenting diffuse enamel
opacities (the cases). The interviewer asked the participants about other possibly related factors,
including demographic factors (e.g. age, sex, ethnic group), dental health behaviours (e.g. dental
attendance pattern, toothbrushing frequency) and other fluoride sources (e.g. use of fluoride
supplements in early childhood, fluoride toothpastes, gels, mouthrinses and varnishes). The
samples studied in both localities exhibited generally high caries activity (high DMFS Index
scores, including non-cavitated enamel lesions). We would like to be able to predict the children
at risk for fluorosis based on the values of the variables collected in this study, so we will need to
look at the most important set of predictors of fluorosis jointly in a multivariate model.

Prepared by Dr Herenia P. Lawrence 2012


SPSS OUTPUT FOR LOGISTIC REGRESSION
Total number of cases: 420 (Unweighted)
Number of selected cases: 420
Number of unselected cases: 0

Number of selected cases: 420


Number rejected because of missing data: 0
Number of cases included in the analysis: 420

Dependent Variable Encoding:

Original Internal
Value Value
0 0
1 1

Dependent Variable.. DEO Enamel Opacities

Beginning Block Number 0. Initial Log Likelihood Function

-2 Log Likelihood 358.52624

* Constant is included in the model.

Beginning Block Number 1. Method: Enter

Variable(s) Entered on Step Number


1.. WATERF Fluoride in water
MT2
CRD1FS D1FS Index score
TOPICALF Topical fluoride

Estimation terminated at iteration number 4 because


Log Likelihood decreased by less than .01 percent.

-2 Log Likelihood 328.175


Goodness of Fit 419.832
Cox & Snell - R2 .070
Nagelkerke - R2 .121

Chi-Square df Significance
Model 30.351 4 .0000
Block 30.351 4 .0000
Step 30.351 4 .0000

Classification Table for DEO


The Cut Value is .50
Predicted
No Fluorosis Fluorosis Percent Correct
N I F
Observed +-------------+-------------+
No Fluorosis N I 356 I 0 I 100.00%
+-------------+-------------+
Fluorosis F I 34 I 30 I 46.87%
+-------------+-------------+
Overall 91.90%

Prepared by Dr Herenia P. Lawrence 2012


----------------- Variables in the Equation ------------------

Variable B S.E. Wald df Sig R

WATERF 1.0072 .3713 7.3565 1 .0067 .1222


MT2 -.9748 .4063 5.7565 1 .0164 -.1024
CRD1FS -.0406 .0176 5.3413 1 .0208 -.0965
TOPICALF .1729 .4015 .1856 1 .6666 .0000
Constant -1.8036 .3994 20.3918 1 .0000

Odds Ratio 95% CI for Exp(B)


Variable Exp(B) Lower Upper

WATERF 2.7379 1.3223 5.6689


MT2 .3772 .1701 .8365
CRD1FS .9602 .9277 .9938
TOPICALF 1.1888 .5412 2.6111

The SPSS output immediately above (Variables in the Equation) contains the estimated
regression coefficients (under column heading B) and related statistics from the logistic
regression model that predicts fluorosis from the constant and the variables WATERF, MT2,
CRD1FS, TOPICALF. WATERF, MT2 and TOPICALF are indicator variables, coded 0 or 1.
The value of 1 for WATERF indicates living in a fluoridated area, the value of 1 for MT2
indicates having lost one or more teeth, and the value of 1 for TOPICALF indicates that the child
currently receives some form of topical fluoride treatment on a regular basis. CRD1FS is the
combined clinical and radiographic DFS Index score for each child that is measured on a
continuous scale.

Given these coefficients, the logistic regression equation for the probability of fluorosis (y) can
be written as

y = Prob (fluorosis) = 1
1 + e-z
where

Z = Ln (Odds)

While probabilities must lie between 0 and 1, odds can take any value between 0 and infinity
(). However, because the odds are always positive (or 0), but never negative, we usually model
the natural logarithm (base e = 2.71...) of the odds. The log odds are not constrained at all; they
can take any value between - and . Thus, in logistic regression we assume that the log odds is
a linear function of x, as follows:

Z = 0 + 1x1 + 2x2 +3x3 +4x4

Z = -1.8036 + 1.0072 (WATERF) -0.9748 (MT2) -0.0406 (CRD1FS) +0.1729 (TOPICALF)

Prepared by Dr Herenia P. Lawrence 2012


Applying this to a child who lives in a fluoridated area, with a DFS score of 10 and values 0 for
the remaining independent variables, we find

Z = -1.8036 + 1.0072 (1) -0.9748 (0) -0.0406 (10) +0.1729 (0) =


= -1.8036 + 1.0072 (1) -0.0406 (10) = -1.2024

The probability of fluorosis is then estimated to be

y = Prob (fluorosis) = 1 = 0.231


1 + e-(-1.2024)

When Z = 0, y is 1  (1 + exp-(0)) = 1  (1 + 1) = 0.5

When Z goes to infinity (), it becomes 1  (1 + exp-()) = 1

When Z goes to -, it becomes 1  (1 + exp-(-)) = 0

So, the logistic function (logit) describes a smooth curve that approaches 0 for large negative
values of Z and approaches 1 when Z is large and positive. See figure of the logistic function as
follows.

The Wald statistic (column labelled Wald) tests the hypothesis that the coefficient (column B) is
0. This test has a chi-square distribution. When a variable has a single degree of freedom, the
Wald statistic is the square of the ratio of the coefficient to its standard error (column labelled
S.E.). For categorical variables, the Wald statistic has degrees of freedom equal to one less than
the number of categories.
For example, the coefficient for CRD1FS is -0.0406, and its standard error is .0176. The
Wald statistic is (-0.0406/0.0176)2, or about 5.321 with some rounding differences from the
computer output because of different levels of precision. The significance level for the Wald
statistic is shown in the column labelled Sig. In this example, only the coefficients for
WATERF, MT2, and CRD1FS appear to be significantly different from 0, using a significance
level of 0.05.

Prepared by Dr Herenia P. Lawrence 2012


Partial correlation, R. The R statistic (column labelled R) is used to look at the partial
correlation between the dependent variable and each of the independent variables. R can range
in value from -1 to +1. A positive value indicates that as the variable increases in value, so does
the likelihood of the event occurring. If R is negative, the opposite is true. Small values for R
indicate that the variable has a small partial contribution to the model.

Interpreting the Regression Coefficients


In multiple linear regression, the interpretation of the regression coefficient is straightforward. It
tells you the amount of change in the dependent variable for a one-unit change in the
independent variable.

In logistic regression, the coefficient can be interpreted as the change in the natural log of the
odds, log odds (logit), associated with a one-unit change in the independent variable. (The
reason for modelling the log odds rather than the proportions is that log odds can take any value,
positive or negative, whereas proportions are constrained to lie between 0 and 1.) For example,
from the SPSS output above, you see that the coefficient for WATERF is 1.0072. This tells us
that when the WATERF changes from 0 to 1 and the values of the other independent variables
remain the same, the log odds of fluorosis increase by 1.0072, the value of the coefficient. As
with multiple linear regression, the regression coefficients are estimates for the effect of a
particular variable, controlling for the other variables in the equation.

Since it is easier to think of odds rather than log odds, then e raised to power 1.0072 is the factor
by which the odds change when WATERF increases by one unit. This is called the odds ratio
or the ratio of the odds of fluorosis when WATERF is 1 to the same odds when WATERF is 0.
The odds ratio for WATERF is in the column labelled Exp(B) (e1.0072 = 2.7379) and its 95%
confidence interval is in the last two columns (95% C.I. = e1.0072-1.96(0.3713) to e1.0072+1.96(0.3713) =
1.322  5.669). From the 95% confidence interval, you can see that values anywhere from 1.322
to 5.669 are plausible for the population value of the odds ratio for WATERF. Since this interval
does not include the value 1 (OR = 1 is equivalent to no change in odds), we can conclude based
on this sample of data that a unit change in WATERF in the population is associated with a
change in the odds of fluorosis. That is to say that the odds of fluorosis in fluoridated areas are
2.7 times more likely than in fluoride-deficient areas.

If the coefficient is positive, the factor will be greater than 1, which means that the odds are
increased; if the regression coefficient is negative, the factor will be less than 1, which means
that the odds are decreased. For example, since the beta coefficient (column labelled B) for MT2
is negative, -0.9748, then the odds of fluorosis for people with missing teeth over those without
missing teeth is exp (-0.9748) = 0.3772. That is to say a child who has had teeth extracted is
approximately 62% less likely to have fluorosis than a child who has all his/her teeth. When B is
zero, the factor equals 1, which leaves the odds unchanged.

When an independent variable is continuous, such as age, DFS or Plaque score, the odds ratio for
a unit change in the value of the independent variable may be less informative than the odds ratio
associated with a decade change in age, or a 5 unit change in DFS or plaque scores.

Prepared by Dr Herenia P. Lawrence 2012


Categorical variables
If you have a two-category variable, such as MT2 (# of missing teeth dichotomized), you can
code each case as „0‟ or „1‟ to indicate either no missing teeth or 1 or more teeth missing. This is
called a dummy-variable or indicator-variable coding. The code of 1 indicates that the poorer
outcome is present. The resulting coefficient tells you the difference in the log odds when a case
is a member of the “poor” category and when it is not.

When you have a categorical variable with more than two categories, you must create dummy
variables to represent the categories. The number of dummy variables required to represent a
categorical variable is one less than the number of categories, and the coefficients for the new
variables represent the effect of each category as it is compared to a reference category (see
example below).

Example: From the SPSS output, you see that there are 106 cases with a value of 1 in the data set for
AGEGRP, meaning being 12. Each of these cases will be assigned a code of 1 for the new variable
AGEGRP(1) and a code of 0 for the new variables AGEGRP(2), AGEGRP(3) and AGEGRP(4). Similarly,
cases with a value of 2 for AGEGRP (those aged 13 yrs) will be given the code of 1 for AGEGRP(2) and 0
for AGEGRP(1), AGEGRP(3) and AGEGRP(4), and so on. Cases with a value of 5 for AGEGRP (16-yr-
olds) will be given the code of 0 for all the new variables since they are the reference group.

Parameter
Value Freq Coding
(1) (2) (3) (4)
AGEGRP
12-yr-old 1 106 1.000 .000 .000 .000
13-yr-old 2 102 .000 1.000 .000 .000
14-yr-old 3 90 .000 .000 1.000 .000
15-yr-old 4 59 .000 .000 .000 1.000
16-yr-old 5 63 .000 .000 .000 .000

------------------ Variables in the Equation ------------------


Variable B S.E. Wald df Sig R

WATERF 1.0062 .3793 7.0369 1 .0080 .1185


MT2 -.8541 .4179 4.1761 1 .0410 -.0779
CRD1FS -.0364 .0182 4.0126 1 .0452 -.0749
TOPICALF .1624 .4120 .1553 1 .6935 .0000
AGEGRP 3.7424 4 .4420 .0000
AGEGRP(1) .2684 .5443 .2431 1 .6220 .0000
AGEGRP(2) .5767 .5223 1.2193 1 .2695 .0000
AGEGRP(3) .6039 .5270 1.3132 1 .2518 .0000
AGEGRP(4) -.3021 .6876 .1930 1 .6604 .0000
Constant -2.2159 .6361 12.1358 1 .0005

Odds Ratio 95% CI for Exp(B)


Variable Exp(B) Lower Upper

WATERF 2.7353 1.3005 5.7529


MT2 .4257 .1876 .9657
CRD1FS .9643 .9305 .9992
TOPICALF 1.1763 .5246 2.6377
AGEGRP(1) 1.3079 .4500 3.8012
AGEGRP(2) 1.7802 .6396 4.9553

Prepared by Dr Herenia P. Lawrence 2012


AGEGRP(3) 1.8292 .6512 5.1382
AGEGRP(4) .7393 .1921 2.8451

In the SPSS output above, you see that the coefficients for all the dummy variables are positive,
except for AGEGRP(4). This means that compared to the 16-year-old school children, children
from 12 to 14 years of age have increased log odds of fluorosis.

Interaction Terms
Just as in Multiple Regression, you can include terms in the model that are products of single
terms. For example, if it makes sense, you could include a term for the WATERF by MT2
interaction in the fluorosis model to see if its inclusion improves the goodness of fit.

Selecting Predictor Variables


In Logistic Regression, as in other multivariate statistical techniques, we want to identify subsets
of independent variables that are good predictors of the dependent variable. We can estimate
models using block entry of variables or any of the following stepwise methods: forward
conditional, forward likelihood-ratio (LR) test, forward Wald, backward conditional, backward
LR, or backward Wald. Different methods for variable selection may result in different models,
so it is always a good idea to examine several possible models and choose from among them on
the basis of “interpretability, parsimony, and ease of variable acquisition”. You will be using
different variable selection algorithms in your data analysis project. For example, SPSS uses the
following logistic regression variable selection methods:

 Enter. A procedure for variable selection in which all variables in a block are entered in
a single step.
 Forward Selection (Conditional). Stepwise selection method with entry testing based
on the significance of the score statistic, and removal testing based on the probability of a
likelihood-ratio statistic based on conditional parameter estimates.
 Forward Selection (Likelihood Ratio). Stepwise selection method with entry testing
based on the significance of the score statistic, and removal testing based on the
probability of a likelihood-ratio statistic based on the maximum partial likelihood
estimates.
 Forward Selection (Wald). Stepwise selection method with entry testing based on the
significance of the score statistic, and removal testing based on the probability of the
Wald statistic.
 Backward Elimination (Conditional). Backward stepwise selection. Removal testing is
based on the probability of the likelihood-ratio statistic based on conditional parameter
estimates.
 Backward Elimination (Likelihood Ratio). Backward stepwise selection. Removal
testing is based on the probability of the likelihood-ratio statistic based on the maximum
partial likelihood estimates.
 Backward Elimination (Wald). Backward stepwise selection. Removal testing is based
on the probability of the Wald statistic.

Prepared by Dr Herenia P. Lawrence 2012


Assumptions
Logistic regression neither assumes homogeneity of variance across values of the dependent
variable nor relies on distributional assumptions in the same sense that multiple regression does.
However, the model may be more stable if the predictors have a multivariate normal distribution.
Additionally, as with other forms of regression, multicollinearity among the predictors can lead
to biased estimates and inflated standard errors.

Assessing the Goodness of Fit of the Model


Classification Table for DEO
The Cut Value is probability = .50
Predicted
No Fluorosis Fluorosis Percent Correct
N I F
Observed +-------------+-------------+
No Fluorosis N I 356 I 0 I 100.00%
+-------------+-------------+
Fluorosis F I 34 I 30 I 46.87%
+-------------+-------------+
Overall 91.90%

One way to assess how well the model fits is to compare our predictions to the observed
outcomes. From the classification table in the output above, we see that 356 subjects without
fluorosis were correctly predicted by the model not to have fluorosis. Similarly, 30 children with
fluorosis were correctly predicted to have fluorosis. The off-diagonal entries of the table tell us
how many children were incorrectly classified. A total of 34 children were misclassified in this
example. Overall, 91.9% of the 420 school children were correctly classified.

Outcome Basis Estimated Probability

True Positive 30/(34+30) .469


True Negative 356/(356+0) 1.000
False Positive 34/(34+30) .531
False Negative 0/(356+0) .000

Sensitivity = the true-positive probability

Specificity = the true-negative probability

The classification table does not reveal the distribution of estimated probabilities for children in
the two groups (a histogram of the estimated or predicted probabilities of fluorosis would be
useful to set up a different cut-off point for fluorosis cases). For each predicted group, the table
shows only whether the estimated probability is greater or less than 0.5 (the default). Using the
ROC (Receiver Operating Characteristic) Curve procedure available in SPSS we can examine
the effects of using different cut-offs for classification on the sensitivity and specificity of the
model.

Another way of assessing the goodness of fit of the model is to examine how “likely” the sample
results actually are, given the parameter estimates, i.e. the likelihood. Since the likelihood is a

Prepared by Dr Herenia P. Lawrence 2012


small number less than 1, it is customary to use -2 times the log of the likelihood (-2LL) as a
measure of how well the estimated model fits the data. A good model is one that results in a
high likelihood of the observed results. This translates to a small value for -2LL. (If a model fits
perfectly, the likelihood is 1, and -2 times the log-likelihood is 0.)

For the logistic regression model that contains only the constant, -2LL is 358.52624, as shown
below. For the model with all of the independent variables, the value of -2LL is 328.175, which
is smaller than the -2LL for the model containing only a constant. The next entry is the
goodness-of-fit statistic, which compares the observed probabilities to those predicted by the
model. The next two entries, the Cox & Snell R2 and the Nagelkerke R2, are statistics that
attempt to quantify the proportion of explained “variation” in the logistic regression model.
They are similar in intent to the R2 in a multiple linear regression model. By either measure, the
independent variables only explain a modest amount of variance. There are three additional chi-
square entries in the SPSS output. They are labelled Model, Block, and Step.
Dependent Variable.. DEO Enamel Opacities

Beginning Block Number 0. Initial Log Likelihood Function

-2 Log Likelihood 358.52624

* Constant is included in the model.

-2 Log Likelihood 328.175


Goodness of Fit 419.832
Cox & Snell - R2 .070
Nagelkerke - R2 .121

Chi-Square df Significance

Model 30.351 4 .0000


Block 30.351 4 .0000
Step 30.351 4 .0000

The Model chi-square is the difference between -2LL for the model with only a constant and
-2LL for the current model with all the variables included. Thus, the model chi-square tests the
null hypothesis that the coefficients for all of the terms in the current model, except the constant,
are 0. This is comparable to the overall F test for multiple linear regression. The degree of
freedom for the model chi-square is the difference between the number of parameters in the two
models. We reject the null hypothesis because the significance is so low, .0000 (which should be
interpreted as < 0.001), and conclude that the set of variables improves the prediction of the log
odds.

The entry labelled Block is the change in -2LL between successive entry blocks during model
building. In this example, we entered our variables in a single block, so the block chi-square is
the same as the model chi-square. If you enter variables in more than one block, these chi-square
values will be different.

The entry labelled Step is the change in -2LL between successive steps of building a model. It
tests the null hypothesis that the coefficients for the variables added at the last step are 0. For

Prepared by Dr Herenia P. Lawrence 2012


this statistic to be different from the block chi-square, we would have to use either forward or
backward variable selection.

Other Diagnostic Methods


Whenever you build a statistical model, it is important to examine the adequacy of the resulting
model. In Multiple Regression we examine a variety of residuals, measures of influence, and
indicators of multicollinearity. These are valuable tools for identifying data points for which the
model does not fit well, points that exert a strong influence on the coefficient estimates, and
variables that are highly related to each other. In Logistic Regression there are comparable
diagnostics that should be used to look at how well the model fits the sample data. The SPSS
Logistic Regression procedure provides a variety of such statistics (e.g. residual, standardized
residual, deviance, leverage, c statistic, Hosmer-Lemeshow goodness-of-fit test) and plots. It is
also important that the model be validated when predictive accuracy is the focus of an analysis.
This means deriving a model on a subset of the data and then testing it on a holdout sample.

Attached are excerpts from papers to illustrate how the logistic model is summarized.

Suggested Reading:

Norman GR, Streiner DL. Biostatistics. The bare essentials. 2nd edition. St. Louis, MO: Mosby-
Year Book Inc., 2000. Chapter 15.

Prepared by Dr Herenia P. Lawrence 2012

You might also like