You are on page 1of 19

Biostat 2 _Module 5

Logistic regression

Getu Degu

November 6, 2008

1
Logistic regression
► The preceding section dealt with multiple regression with
a continuous dependent variable, extending the methods
of linear regression introduced in Biostat 2 ( module 1).

► In many studies the outcome variable of interest is the


presence or absence of some condition, whether or not
the subject has a particular characteristic such as a
symptom of a certain disease.

► We cannot use ordinary multiple linear regression for


such data, but instead we can use a similar approach known
as multiple linear logistic regression or just logistic
regression.
Logistic regression is part of a category of statistical models
called generalized linear models. This broad class of models
includes ordinary regression, ANOVA, ANCOVA and loglinear
regression.

2
Uses and selection of independent variables

♣ In general, there are two main uses


of logistic regression.

♣ The first is the prediction


(estimation) of the probability that an
individual will have (develop) the
characteristic. For example, logistic
regression is often used in
epidemiological studies where the
result of the analysis is the probability
of developing cancer after controlling
for other associated risks.

♣ Logistic regression also provides


knowledge of the relationships and
strengths between an outcome
variable (dependent) with two values
and explanatory variables
(independent) that can be categorical
or continuous (e.g., smoking 10 packs
a day puts you at a higher risk for

3
developing cancer than working in an
asbestos mine).

♣ Logistic regression can be applied to


case-control, follow-up and cross-
sectional data.

The Model:

 The basic principle of logistic regression is much the


same as for ordinary multiple regression.

 The main difference is that instead of developing a


model that uses a combination of the values of a
group of explanatory variables to predict the
value of a dependent variable, we predict a
transformation of the dependent variable.
 The dependent variable in logistic regression is
usually dichotomous, that is, the dependent variable
can take the value 1 with a probability of success q,
or the value 0 with a probability of failure 1-q.

 This type of variable is called a binomial (or binary)


variable.

 Although not discussed in this pack, applications of


logistic regression have also been extended to cases

4
where the dependent variable is of more than two
cases, known as multinomial logistic regression.

 When multiple classes of the dependent variable can


be ranked, then ordinal logistic regression is
preferred to multinomial logistic regression.

 As mentioned previously, one of the goals of logistic


regression is to correctly predict the category of
outcome for individual cases using the most
parsimonious (condensed) model.

 To accomplish this goal, a model is created that


includes all predictor variables that are useful in
predicting the response variable.

 Several different options are available during model


creation. Variables can be entered into the model in
the order specified by the researcher or logistic
regression can test the fit of the model after each
coefficient is added or deleted, called stepwise
regression.

5
Backward stepwise regression appears to be the
preferred method of exploratory analyses, where the
analysis begins with a full or saturated model and
variables are eliminated from the model in an iterative
process. The fit of the model is tested after the elimination
of each variable to ensure that the model still adequately fits
the data. When no more variables can be eliminated from
the model, the analysis has been completed.

 Logistic regression is a powerful statistical tool for


estimating the magnitude of the association between
an exposure and a binary outcome after adjusting
simultaneously for a number of potential
confounding factors.
 If we have a binary variable and give the categories
numerical values of 0 and 1, usually representing ‘No’
and ‘Yes’ respectively, then the mean of these values in
a sample of individuals is the same as the proportion of
individuals with the characteristic.
 We could expect, therefore, that the appropriate
regression model would estimate the probability
(proportion) that an individual will have the
characteristic.
 We cannot use an ordinary linear regression, because
this might predict proportions less than zero or greater
than one, which would be meaningless.
 In practice, a statistically preferable method is to
use a transformation of this proportion.

6
♣ The transformation we use is called the logit
transformation, written as logit (p). Here p is the proportion
of individuals with the characteristic.

For example, if p is the probability of a subject having a


myocardial infarction, then 1-p is the probability that they do
not have one. The ratio p / (1-p) is called the odds and thus

logit (p) = ln is the log odds.

♣ The logit can take any value from minus infinity to plus
infinity.

♣ We can fit regression models to the logit which are very


similar to the ordinary multiple regression and analysis of
variance models found for data from a normal distribution.

♣ We assume that relationships are linear on the logistic


scale:

ln = a + b1X1 + b2X2 + … + bnXn

where, X1, … Xn are the predictor variables and p is the


proportion to be predicted. The calculation is computer
intensive.

7
 Because the logistic regression
equation predicts the log odds, the
coefficients represent the difference
between two log odds, a log odds
ratio.

 The antilog of the coefficients is thus an odds ratio.


Most programs print these odds ratios.

 These are often called adjusted odds ratios.

The above equation can be rewritten to represent the


probability of disease as:

P=

Significance tests
The process by which coefficients are tested for significance
for inclusion or elimination from the model involves several
different techniques.

8
I) Z-test

The significance of each variable can be assessed by


treating
Z=
The corresponding P-values are easily computed (found
from the table of Z-distribution).

II) Likelihood-Ratio Test:

The likelihood-ratio test (LRT) uses the ratio of the


maximized value of the likelihood function for the full model
(L1) over the maximized value of the likelihood function for
the simpler model (L0).

Deviance

9
 Before proceeding to the likelihood
ratio test, we need to know about the
deviance which is analogous to the
residual sum of squares from a linear
model.
 The deviance of a model is -2 times
the log likelihood associated with
each model.
 As a model’s ability to predict
outcomes improves, the deviance
falls. Poorly-fitting models have
higher deviance.
 If a model perfectly predicts
outcomes, the deviance will be zero.
This is analogous to the situation in
linear regression, where the residual
sum of squares falls to 0 if the model
predicts the values of the dependent
variable perfectly.
 Based on the deviance, it is possible to
construct an analogous to r² for logistic

10
regression, commonly referred to as the
Pseudo r².

 If G1² is the deviance of a model with


variables, and G0² is the deviance of a null
model, the pseudo r² of the model is :

r² = 1 - =1 – (ln L1 / ln L0)
 One might think of it as the proportion of
deviance explained.

► The likelihood ratio test, which makes use


of the deviance , is analogous to the F-test
from linear regression.

11
► Inits most basic form, it can test the
hypothesis that all the coefficients in a
model are all equal to 0:
H0: ß1 = ß2 = . . . = ßk = 0
►The test statistic has a chi-square
distribution, with k degrees of freedom.
► Ifwe want to test whether a subset
consisting of q coefficients in a model are all
equal to zero, the test statistic is the same,
except that for L0 we use the likelihood from
the model without the coefficients, and L1 is
the likelihood from the model with them.
►This chi-square has q degrees of freedom.

Assumptions

12
► Logistic regression is popular in part because
it enables the researcher to overcome many of
the restrictive assumptions of OLS regression:

1. Logistic regression does not assume a


linear relationship between the dependents
and the independents. It may handle nonlinear
effects even when exponential and polynomial
terms are not explicitly added as additional
independents because the logit link function on
the left-hand side of the logistic regression
equation is non-linear. However, it is also
possible and permitted to add explicit
interaction and power terms as variables on
the right-hand side of the logistic equation, as in
OLS regression.
2. The dependent variable need not be
normally distributed.
3. The dependent variable need not be
homoscedastic for each level of the
independents; that is, there is no homogeneity
of variance assumption.
However, other assumptions still apply:

13
1. Meaningful coding. Logistic coefficients will
be difficult to interpret if not coded meaningfully.
The convention for binomial logistic regression is
to code the dependent class of greatest interest
as 1 and the other class as 0.

2.Inclusion of all relevant variables in the


regression model
3.Exclusion of all irrelevant variables
4.Error terms are assumed to be
independent (independent sampling).
Violations of this assumption can have serious
effects. Violations are apt to occur, for
instance, in correlated samples and repeated
measures designs, such as before-after or
matched-pairs studies, cluster sampling, or
time-series data. That is, subjects cannot
provide multiple observations at different time
points. In some cases, special methods are
available to adapt logistic models to handle
non-independent data.
5.Linearity: Logistic regression does not require
linear relationships between the independents
and the dependent, as does OLS regression,
but it does assume a linear relationship
between the logit of the independents and the
dependent.

14
6.No multicollinearity: To the extent that one
independent is a linear function of another
independent, the problem of multicollinearity
will occur in logistic regression, as it does in
OLS regression. As the independents increase
in correlation with each other, the standard
errors of the logit (effect) coefficients will
become inflated.

7.No outliers: As in OLS regression, outliers


can affect results significantly. The researcher
should analyze standardized residuals for
outliers and consider removing them or
modeling them separately. Standardized
residuals >2.58 are outliers at the .01 level,
which is the customary level (standardized
residuals > 1.96 are outliers at the less-
used .05 level).

8.Large samples: Unlike OLS regression,


logistic regression uses maximum likelihood
estimation (MLE) rather than ordinary least
squares (OLS) to derive parameters.

15
 MLE relies on large-sample asymptotic
normality which means that reliability of
estimates decline when there are few
cases for each observed combination of
independent variables.
 That is, in small samples one may get
high standard errors. In the extreme, if
there are too few cases in relation to the
number of variables, it may be impossible
to converge on a solution.
 Very high parameter estimates (logistic
coefficients) may signal inadequate
sample size.

Hosmer and Lemeshow Test

♣ The Hosmer -Lemeshow goodness- of - fit


statistic is used to assess whether the
16
necessary assumptions for the application of
multiple logistic regression are fulfilled.

♣ The Hosmer and Lemeshow's goodness-of-fit


statistic is computed as the Pearson chi-square
from the contingency table of observed
frequencies and expected frequencies.

♣ A good fit as measured by Hosmer and


Lemeshow's test will yield a large p-value.

Summary
♣ A likelihood is a probability, specifically the
probability that the values of the dependent variable
may be predicted from the values of the independent

17
variables. Like any probability, the likelihood varies
from 0 to 1.

♣ The log likelihood ratio test (or sometimes called as


model chi-square test) of a model tests the difference
between -2LL for the full model and -2LL for the
initial chi-square in the null model. That is, Model chi-
square is computed as -2LL for the null (initial) model
minus -2LL for the researcher’s model.

♣ The initial chi-square is -2LL for the model which


accepts the null hypothesis that all the b coefficients
are zero.

♣ The log likelihood ratio test tests the null


hypothesis that all population logistic regression
coefficients except the constant are zero. It is an
overall model test which does not assure that every
independent is significant.

18
♣ It measures the improvement in fit that the
explanatory variables make compared to the null
model.

♣ The method of analysis uses an iterative procedure


whereby the answer is obtained by several repeated
cycles of calculation using the maximum likelihood
approach.

♣ Because of this extra complexity, logistic


regression is only found in large statistical packages
or those primarily intended for the analysis of
epidemiological data.

19

You might also like