Professional Documents
Culture Documents
Mark Tranmer
Mark Elliot
CONTENTS
1) Introduction..................................................................................................................... 3
Categorical data and 2 x 2 tables ..................................................................................... 3
Odds and Relative Odds.................................................................................................. 5
Odds .............................................................................................................................5
Relative odds............................................................................................................... 5
2) Logistic regression theory ............................................................................................... 6
Introduction......................................................................................................................6
The Theory ...................................................................................................................... 6
Logistic regression theory ........................................................................................... 7
Dummy variables.........................................................................................................9
Exercise 2.......................................................................................................................10
3) Logistic regression in SPSS 13..................................................................................... 11
4) Summary and further comments.................................................................................. 42
Summary ....................................................................................................................... 42
Further comments.......................................................................................................... 42
Reading list........................................................................................................................ 43
2
1) Introduction
Socio-economic variables are very often categorical, rather than interval scale. In
many cases research focuses on models where the dependent variable is
categorical. For example, the dependent variable might be ‘unemployed’ /
‘employed’, and we could be interested in how this variable is related to sex, age,
ethnic group, etc. In this case we could not carry out a multiple linear regression
as many of the assumptions of this technique will not be met, as will be explained
theoretically below. Instead we would carry out a logistic regression analysis.
Hence, logistic regression may be thought of as an approach that is similar to that
of multiple linear regression, but takes into account the fact that the dependent
variable is categorical.
We can write categorical data in two forms: list form or table form. The important
point to make about this is that whichever way we choose to think about this kind
of data, the information is the same. For example, if we were interested in the
association between unemployment and sex for a sample of 12 people (this is a
smaller sample than we would tend to use in general but it illustrates the point),
we could write the data in list form as:
1 0 0
2 0 1
3 0 1
4 0 1
5 0 0
6 1 1
7 0
1
8 1 0
9 0 0
10 0 1
11 0 0
12 1 0
3
Or the same data in table form as:
2 x 2 tables are quite a good way to present the information, because the
relationship between the two variables can usually be clearly interpreted. When
we are interested in the association between several variables we can, of course,
still construct a multi-way table. However, it is less easy to interpret the
relationships from the tables when several variables are involved.
Example 1
Table 1.
Ethnic Group
Behaviour White Black Total
Problems
NO 90 [0.83] 30 [0.48] 120 (70%)
YES 19 [0.17] 33 [0.52] 52 (30%)
Total 109 (63%) 63 (37%) 172 (100%)
Table 1 is a cross tabulation of two binary variables for a sample of 172 boys in
reception classes.
4
We can see that the majority of the sample of boys (70%) are not perceived to
have a behaviour problem and that 63% of them are white. The conditional
probabilities of having a behaviour problem, given ethnic group are shown in
square brackets after each of the cell frequencies. For example the probability of
being perceived to have a behaviour problem for white boys is 0.17, and for black
boys is 0.52.
A useful way of using the information in cross tabulations where one dimension
of the table is an outcome of interest (whether 2x2 tables or more complicated
ones), is to calculate odds and relative odds (odds ratios).
Odds
In the above table, the odds of a white boy being seen to have a behaviour
problem are 19/90 = 0.21 or 0.21 to 1. In betting terms that is about 5 : 1 against
– much less than even money.
For black boys, the corresponding odds are 33/30 = 1.1, or 1.1 to 1.
Equivalent to 11 to 10 on, (or a little better than even money.). Note that odds are
not the same as probabilities – they are not restricted to the range 0 to 1.
Relative odds
We can also think of the information in the table in terms of relative odds. The
relative odds of a black boy compared with a white boy being seen as having a
behaviour problem are 1.1 / 0.21 or 5.2 to 1. In other words a black boy is 5.2
times more likely than a white boy to be seen as having a behaviour problem.
Equally, boys perceived to have behaviour problems are 5.2 times more likely to
be black rather than white, compared with boys without perceived behaviour
problems. Relative odds are symmetrical in that sense; like correlation, we do not
think of this measure in terms of a dependent variable and an explanatory
variable. We just think in terms of the association between two variables.
Exercise 1
5
Calculate the probabilities, odds and relative odds of being unemployed at 18 for
white and black ethnic groups
2) Logistic regression theory
Introduction
The Theory
6
When we have a proportion as a response, we use a logistic or logit
transformation to link the dependent variable to the set of explanatory variables.
The logit link has the form:
The term within the square brackets is the odds of an event occurring. In the
example above this would be the odds of a person being perceived to have
behaviour problems.
Using the logit scale changes the scale of a proportion to plus and minus infinity,
and also because Logit (P) = 0, when P=0.5. When we transform our results back
from the logit (log odds) scale to the original probability scale, our predicted
values will always be at least 0 and at most 1.
Let:
7
1- Pi = 1/(1 + exp(β0+β1xi))
Notice that we have so far not included a residual term in the models, and have
instead expressed the model in terms of population probabilities. But we could
write it as:
pi = Pi + fi = exp(β0+β1xi)/(1 + exp(β0+β1xi)) + fi
Note that in this case, fi is not normally distributed, as it was assumed to be for
linear regression.
We will now consider some logistic regression theory and return to Example 1,
session 1.
Ethnic Group
Behaviour White Black Total
Problems
NO 90 [0.83] 30 [0.48] 120 (70%)
YES 19 [0.17] 33 [0.52] 52 (30%)
Total 109 (63%) 63 (37%) 172 (100%)
Which we can interpret as the log odds of a white boy (EG=0) seen as having a
behaviour problem being equal to –1.56, hence the odds of a white boy having a
behaviour problem are: exp(-1.56) = 0.21
The log odds of a black boy (EG=1) having a perceived behaviour problem are –
1.56 + 1.65 = 0.09. Hence the odds of a black boy having a perceived behaviour
problem are exp(0.09) = 1.1 Alternatively we can say that the odds for black boys
are exp(1.65)=5.21 times as high as they are for white boys. That is, the relative
odds of a teacher perceiving a black boy to have behavioural problems compared
with a black boy are 5.21.
Notice that these results correspond exactly to the results in Table 1. This is
because for Table 1 there is one degree of freedom: we can calculate the degrees
of freedom in the table as (r-1)*(c-1) = 1, where r is the number of rows in the
8
table and c is the number of columns. So if we fit one parameter (ethnic group,
EG) we have used up, or saturated, all the degrees of freedom and hence fitted a
saturated model. This means we have fitted enough terms in the model to explain
everything that is going on in Table 1. Before we fitted the ethnic group term, we
could not have explained everything that is going on in the table, and we would
hence find a deviance (or –2 log likelihood) of 22.8. The deviance is a measure
of how much variation is left having fitted the model (how much is left
unexplained by the model). The deviance follows a chi2 distribution and we would
in general compare the difference in deviance in two models to find out if the
extra terms we added were significant. In the current example, we cannot really
talk in terms of ‘change in deviance’ because once we have fitted EG to the model,
we have fitted a saturated model, and the deviance is 0, but in general, assessing
the change in deviance is a very useful way of assessing whether we need to add
extra terms to our model. In the workshop examples we will see how this works
in much more detail.
Dummy variables
For example suppose the explanatory variable was housing tenure coded like this:
Tenure
1: Owner occupier
2: renting from a private landlord
3: renting from the local authority
We would therefore need to choose a baseline category and create two dummy
variables. For example if we chose owner occupier as the baseline category we
would code the dummy variables like this
Tenure: D1 D2
Owner occupier 0 0
Rented private 1 0
Rented local authority 0 1
For logistic regression SPSS can create dummy variables for us from categorical
explanatory variables, as we will see later.
9
Exercise 2
No 50 40 45 135
Yes 9 5 4 18
Total 59 45 49 153
10
3) Logistic regression in SPSS 13
These data are taken from the British Election Study 2005 pre-campaign and
post-election panel data. More information:
http://www.essex.ac.uk/bes/
But first some exploratory data analysis: we will check the distributions of each of
the variables and do some filtering of the data and re-coding of the variables.
11
12
frequency of turnout from unfiltered data shown below.
We will now filter the dataset so that it only contains those people who either
answered yes or no to “did you vote in the general election 2005?".
13
14
Frequency of turnout from filtered data:
We will now recode the bq12a variable into another variable called ‘vote2005’
which we will recode as 0=didn’t turn out to vote, 1=did turn out to vote. This will
enable us to model the probability of turning out to vote, which is the response
we require.
15
16
17
Frequencies
18
Histogram of age
19
Question why are some bars much lower than their neighbours?
20
Cross tabulations
21
From these results we can see that:
The conditional probability of male turning out to vote are 1346/1837 = 0.733
(which we note that when multiplied by 100 is equal to the row % in this table,
given the way the cross tab is organised).
The conditional probability of female turning out to vote are 1729/2316 = 0.747
22
23
The odds of a male turning out to vote are:
1346/491 = 2.741
1729/587 = 2.945
24
From this table we can see that, according to these data, owner occupier (‘owns’)
are much more likely to turnout to vote than renter (‘rents’). The conditional
probability of owner occupier turning out to vote is 0.805 whereas for renters it is
0.569.
(If time permits, please work out the odds and relative odds for owns and rents).
There are a total of 49 people who describe their housing tenure as neither ‘owns’
or ‘rents’.
25
A scatterplot of the relationship between age and the proportion at each age
turning out to vote shows that there is a much higher chance of turning out to
vote when you are older.
26
Line of best fit: linear
27
Logistic regression models
28
29
Now Click on Continue and then OK to run the model!
We can see from the table above that we are modeling 4156 cases here (some
cases are deleted from the analysis where information is missing. The SPSS
default for this is listwise. Only cases where all dependent and explanatory
variables are complete are included in the analysis.). The tables below show us
firstly that we have coded our dependent variable in the right direction and
secondly that the categorical variable for gender has reference category of male.
The (1) means that gender (1) in the results refers to female here.
30
Block 0: Beginning Block
31
Block 1: Method = Enter
We have added one new variable to the model, which has reduced the -2 log
likelihood by 1.018 with 1 degree of freedom. The -2 log likelihood is a measure of
how well the model explains variations in the outcome of interest, in this example
turnout. The -2 log likelihood (sometimes called, deviance) has a chi squared
distribution. The p value for the result of adding gender to the model is given in
the table above and we can see that this is 0.313 which is greater than the
conventional significance level of 0.05. hence we would conclude that the
addition of gender to the model is not statistically significant. In other words this
variable does not explain variations in turnout.
32
We see from the table above that the estimated model is
We can see that the coefficient of gender is non-significant (sig = 0.313 > 0.05).
The Exp(B) column shows the relative odds (odds ratio) and indicates that
females are 1.074 times as likely to turnout to vote than males. We can request a
confidence interval for this result as shown below.
33
The confidence interval for exp(B) is 0.935 to 1.235 indicates that females are
between 0.935 and 1.235 times as likely to turn out to vote than females. i.e. the
range has a lower limit of ‘slightly less than males’ and upper limit of ‘slightly
more than males’ and therefore includes ‘males and females are equally likely to
turn out to vote (i.e. exp(B)=1). This is not surprising since we have already
concluded that gender has no statistically significant explanatory power in
explaining variations in turnout. We will now add an additional variable to the
model – age in years (which is a continuous variable, rather than a categorical
one).
We will make use of the ‘block’ procedure to add age to the model, so that we can
see both the effect of adding age alone on the -2 log likelihood as well as seeing
how a model which includes both age and gender might explain variations in
turnout.
34
Block 2: Method = Enter
The addition of age to the model has, as a single variable, reduced the -2 log
likelihood by 300.666 on 1 degree of freedom. The model, which now contains 2
parameters, gender and age has collectively reduced -2 log likelihood by 301.684
but we can see it is age that has the explanatory power, and gender is not adding
anything extra.
the model which includes gender and age explains between 7 and 10% of the
variation in turnout.
35
the model is now:
The age coefficient is statistically significant. Exp(B) for age is 1.038, which
means for each year different in age, the person is 1.038 times more likely to turn
out to vote, having allowed for gender in the model. Eg. a 21 year old is 1.038
times as likely to turn out to vote than a 20 year old. This might not seem much
of a difference but a 20 year difference leads to a person being 1.038^20 = 2.11
times more likely to turn out to vote. E.g. a 40 year old is 2.11 times more likely to
turn out to vote than a 20 year old, having allowed for gender in the model.
36
Click on Continue and then OK to run the model again.
37
Block 0: Beginning Block
38
Block 1: Method = Enter
39
Block 2: Method = Enter
40
Block 3: Method = Enter
We can immediately see that tenure reduces the -2 log likelihood by 174.499
having added 2 new variables (tenure has 3 categories in all so we need 2 dummy
variables). Tenure is statistically significant in this model.
41
The table above shows us that the estimated model is now:
in other words
tenure(1) which contrasts ‘rents’ with ‘owns’ has an exp(B) of 0.349 which means
that a person who rents is only .349 times (i.e. much less) likely to turn out than a
person who owns their own property, having allowed for gender and age. If we
calculate the inverse of exp(B) here, i.e. 1/0.349 = 2.87, we can say that a person
who owns their own home is 2.87 times more likely to vote than someone who
rents, having allowed for gender and age.
We have seen how logistic regression analysis may be used to analyse tabular
data where one of the dimensions of the table is an outcome of interest. This
morning, we looked at some examples where we calculated the probabilities, odds
and relative odds from the table, and we have seen how we can also calculate
these (and get the same results) from the model parameter estimates. Some
theory was introduced and we saw how the logistic model framework is a good
way to investigate associations in multi-way tables where one of the dimensions
of the tables is an outcome of interest, with two categories.
We have seen how we can use SPSS to fit logistic regression models to data using
an example based on the 2005 UK election. We covered main effects models and
models with interactions and we went through the output that SPSS gives us,
including the classification table, the deviance, the model coefficients and other
useful measures such as exp(B), which gives the relative odds or odds ratio for a
particular explanatory variable, given the other explanatory variables in the
model.
Further comments
42
The term ‘generalised linear model’ is used to describe a procedure for
transforming the dependent variable so that the ‘right hand side’ of the model
equation can be interpreted as a ‘linear combination’ of the explanatory variables:
When the response variable has several categories we can use a model that allows
for several categories in the response variable such as multinomial regression. If
this response variable is ordinal (as opposed to nominal) we can allow for this in
the modelling (see Agresti – reference details in reading list). An alternative is to
recode the response variable into just two categories and do a logistic regression
analysis (or to fit several logistic regression models to different pairs of categories
in the response variable, although this is not as statistically efficient as doing a
true multinomial analysis.
Note also that logistic regression models can also be fitted with multilevel
components in MLwiN and STATA.
Reading list
Field, A. (2005) Discovering statistics using SPSS for Windows: advanced
techniques for the beginner, London: Sage. Chapter 6.
Plewis, I (1997) Statistics in Education, Edward Arnold. (Especially chapter 5)
Dobson, A (2001) An introduction to generalized linear models (second
edition). Chapman and Hall.
McCullagh P and Nelder J.A, (1989) Generalized linear models (second
edition). Chapman and Hall
Agresti, A. (1996) Introduction to categorical data analysis. John Wiley.
43