Professional Documents
Culture Documents
Further Poisson Regression (As05) : Course: PG Diploma/ MSC Epidemiology
Further Poisson Regression (As05) : Course: PG Diploma/ MSC Epidemiology
(AS05)
EPM304 Advanced Statistical Methods in Epidemiology
This document contains a copy of the study material located within the computer
assisted learning (CAL) session.
If you have any questions regarding this document or your course, please contact
DLsupport via DLsupport@lshtm.ac.uk.
Important note: this document does not replace the CAL material found on your
module CDROM. When studying this session, please ensure you work through the
CDROM material first. This document can then be used for revision purposes to
refer back to specific sessions.
These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of
the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale
or further copying.
London School of Hygiene & Tropical Medicine September 2013 v1.0
To discuss the Poisson distribution, review Poisson regression and learn about
overdispersion within a Poisson model.
Objectives
By the end of this session you will be able to:
This session should take you between 1.5 and 2 hours to complete.
SM02
SM05
SM11
Poisson distribution
Sampling Distribution
The distribution of sample estimates obtained when many random samples are
generated independently from the same population.
Interaction: Hyperlink: parameter (pop up box appears):
Parameter
A parameter is a measurement from a population that characterises one of its
features.
Incorrect Response:
Sorry, that's not right. You can see from the graph that the greatest number of
notifications in a week was 8.
Interaction: Calculation: Mean, =
You will see how to do this calculation next, but first consider the plot below.
What is the probability of 2 registrations in a week? Give your answer to 2 decimal
places.
Prob (2 registrations/week) =
Interaction: Calculation: Prob (2 registrations/week) =____:
Correct Response 0.14 or 0.15
Correct
Yes, the plot shows that the probability of 2 registrations in a week is somewhere
between 0.14 and 0.15.
Incorrect Response:
No, the probability of 2 registrations in a week is shown as the height of the vertical
line at 2 events on the x-axis. This line reaches to a point just below a probability of
0.15, so your answer should be either 0.14 and 0.15.
e- r
Probability (r occurrences) =
r!
where:
e is the mathematical constant 2.71878
is the mean number of events
r is the number of occurrences
r ! is the factorial of r.
Interaction: Hyperlink: factorial (pop up box appears):
Factorial
The factorial of an integer n is written n! and is calculated
as 1 x 2 x 3 x ... x n.
For example, the factorial of 5 is:
5! = 1 x 2 x 3 x 4 x 5 = 120
Let's work this out for the registration of breast cancer cases.
The mean number of new cases of breast cancer registered each week is 4.0. Using
this number with the formula below, you can calculate the probability of 0, 1, 2, 3...
new registrations in a week.
For example, to calculate the probability of 2 registrations in a week for a mean of
4:
e-4 42
Probability (2 registrations) =
e-4 x 16
=
2!
= 0.1465
2x1
Does this agree with the probability you read from the plot? Go back and check.
e- r
Remember, prob( r ) =
r!
Probability
0.0183
0.0733
2
3
4
5
6
7
8
9
0.1465
0.1954
0.1954
0.1563
0.1042
0.0298
= 0.1042
Interaction: Button: 0.0298 (Number of registrations = 8):
Prob(8 registrations) = (e 4 x 48) / 8!
= 65536e 4 / 40320
= 0.0298
Interaction: Calculation: Number of registrations = 7, Probability = ___:
Correct Response 0.0595:
That's right, the probability of 7 registrations per week is:
Prob(7 registrations) = (e 4 x 47) / 7!
= 16384e 4 / 5040
= 0.0595
Incorrect Response:
Sorry, that's not correct. Remember that the mean, mu = 4, so for r = 7, you use
the formula as follows:
Prob(7 registrations) = (e x r ) / r !
= (e 4 x 47) / 7!
= 16384e 4 / 5040
= 0.0595
Incorrect Response:
Sorry, that's not correct. Remember that the mean, mu = 4, so for r = 9, you use
the formula as follows:
Prob(9 registrations) = (e 4 x 49) / 9!
= 262144e 4 / 362880
= 0.0132
Probability
0.0183
0.0733
0.1465
0.1954
0.1954
0.1563
0.1042
0.0298
The mean and variance are often used to summarize a distribution. For a Poisson
distribution the variance is equal to the mean, so it can be described by a single
parameter.
Interaction: Button: More (text appears on bottom LHS):
So, if the variance is equal to the mean, the standard error equals the square root of
the mean, .
This standard error can be used to obtain a confidence interval for the mean
Sorry, that's not right. The total number of infections during the study, r , was 73,
and the total person-time of follow-up was 868 person-years. So, using the formula:
= 73 / 868 = 0.084
number of events
person-time at risk
log = + 1 X 1 + 2 X 2 +... i X i
Click below to compare this model to two other models you are familiar with, namely
the logistic model and the Cox model.
Interaction: Button: Logistic:
Logistic model
log
)
1 -
Microfilariae per
mg
0
0.1 10
10.1 50
> 50
95% confidence
limits
7.161
20.415
7.500
18.893
21.981
44.444
16.510
66.015
95% confidence
limits
7.161
20.415
7.500
18.893
21.981
44.444
16.510
66.015
Poisson model for the effect of microfilarial load on optic nerve disease
Rate
SE
z
P > |z| 95% confidence
ratio
limits
Mfload 0.985
0.351
-0.044
0.965
0.490
1.979
1
Mfload 2.585
0.832
2.950
0.003
1.375
4.859
2
Mfload 2.731
1.210
2.266
0.023
1.146
6.509
3
Log likelihood = 275.95294
To assess confounding, you fit two models. One is a model with the exposure AND
the potential confounder, the other is a model with the exposure but NOT the
potential confounder. You then compare the rate ratios for the exposure in the
model that includes the potential confounder, with the rate ratios for the exposure in
the model that does not include the potential confounder.
If the rate ratios for the exposure are considerably different between the 2 models
(e.g. they change by around 10% or more between the 2 models), then we can say
there is confounding. If there is confounding, then we must use the model that
includes the confounder when estimating the effect of the exposure variable
Interaction: Hyperlink: (review this):
Window showing SM12 appears, section 5
Interaction: Hyperlink: Test for interaction (and the proportional rates
assumption) (card appears on RHS):
Test for interaction
(Click here to review this from SM10.)
The simple Poisson model assumes that the effect of the exposure is constant over
the follow-up period. We can test this assumption by splitting the follow-up time into
several categories (time bands), and then seeing if there is evidence that the effect
of the exposure varies over time.
For example, if the follow-up period is 10 years we could split the follow-up period
into 2 time bands, the first 5 years of follow-up and the second 5 years of follow-up.
We could then investigate whether there is evidence that the effect of the exposure
is different in the first 5 years (0-5 years) from the second 5 years (6-10 years) of
follow-up. We would do this by fitting an interaction between the exposure variable
and timeband, and using a likelihood ratio test (LRT) to see if the interaction was
statistically significant (comparing the model with the interaction term to the one
without).
Alternatively, we might think that the effect of the exposure varied with calendar
time. Suppose that individuals were followed up over the period 1980-1989. We
could investigate whether the effect of the exposure was different during 1980-1984
from the period 1985-1989.
Or we might think that the effect of the exposure varied with an individual's age. You
will see how to investigate this in Practical 5
Interaction: Hyperlink: review this:
Window showing SM10, section 5.
Interaction: Hyperlink: Look for a linear effect for increasing (or decreasing)
exposures (card appears on RHS):
Linear effect
Section 6: Overdispersion
The assumptions underlying the Poisson model are that all events are independent,
and that the event rates are similar within categories of the covariates.
In some situations this assumption is invalid, i.e. where for some reason an event
may be more likely to happen in certain individuals. For examples, click below:
Recurrent events
Clustered observations
6.1: Overdispersion
For such data, the variance of the distribution of events will be greater than that
predicted by a Poisson model. This is known as overdispersion. For example,
clusters of individuals can have different rates to each other; overall this will give a
wider spread of rates.
Sometimes you might not be aware of clustering in a dataset, because the
characteristic that distinguishes the clustering has not been measured.
Overdispersion can occur because of extra variability, which is not completely
explained by the covariates in the model.
6.2: Overdispersion
You can account for the extra variability by assuming that individual rates depend on
the covariates that you would normally put in the Poisson model plus an unobserved
component specific to each subject. This component is known as the frailty.
The frailty component varies between individuals. It is a measure of how some
individuals may be more prone to infection than others. This may be because of
the community they live in or their disease history.
Therefore, the model can be written:
6.3: Overdispersion
Statistically we can incorporate a second distribution to model the frailty component.
When the Poisson distribution is expanded in this way the distribution is called the
negative binomial model. This is fitted to the data in the same way a Poisson
model is fitted.
Regression methods can be used to estimate the usual exposure effects accounting
for the frailty term.
Negative binomial model
6.4: Overdispersion
To illustrate the difference in estimates obtained from a Poisson model and negative
binomial model you will now consider a study of recurrent infections. The data is
from a trial of a pneumococcal vaccine carried out in Papua New Guinea. The trial
involved school children, and occurrence of infection was recorded over 2 years.
Which of the calculations opposite do you think gives the best estimate for the rate
of infection in the population of schoolchildren?
The number of children who had
an infection during the two-year
period, divided by the total
children-years.
The number of infections during
the two-year period, divided by the
total children-years.
Interaction: Hotspot: The number of children who had an infection during the two
year period, divided by the total children-years.
Incorrect Response (pop up box appears):
No, remember each child could have a number of infections within the two-year
period. So, to measure the rate of infection in the population, it is more relevant to
consider the number of infections.
Interaction: Hotspot: The number of infections during the two-year period, divided
by the total children-years.
6.5: Overdispersion
The table opposite gives the distribution of recurring infections and infection rates by
the number of previous infections observed in the Papua New Guinea schoolchildren
during the 2-year period. Click 'swap' beneath the table to see a plot of these data.
Then answer the questions below.
1. Why might overdispersion occur with
these data?
Interaction: Button: clouds picture (pop up box appears):
Overdispersion would occur if children who have had previous infection(s) are more
prone to infection than children who have not, i.e. if re-infection is more probable
with an increasing number of previous infections.
2. Is there any evidence of overdispersion from the table and plot?
Interaction: Button: clouds picture (pop up box appears):
You can see from the data that the infection rate per 100 person-years increases
with an increasing number of previous infections. This is an indication that recurring
infections produce variability between individuals that is likely to cause
overdispersion.
Distribution of recurrent infections in children followed for 2 years
Number of
previous
infections
0
1
2
3
4
5
6
7
8
9
10
Number of
children
with new
infections
702
444
261
147
77
34
25
11
6
3
2
Infection
rate per 100
personyears
51
82
102
135
141
107
252
172
217
189
451
6.6: Overdispersion
The estimates from the Poisson model for the effect of vaccination are shown below.
Click 'swap' to see the corresponding estimates from the negative binomial model.
How do the estimates compare?
Interaction: Button: clouds picture (pop up box appears):
Both models give similar estimates for the effect of vaccination. However, the
negative binomial, which has accounted for the frailty component gives larger
standard errors, i.e. it has allowed for the extra variation due to overdispersion of
the Poisson. Because the Poisson model underestimates the standard error, the
inference from this model is incorrect, showing an apparent stronger association
between vaccination and infection rates.
Poisson model for pneumococcal vaccine trial in Papua New Guinea
Vaccination
rate
ratio
0.9038
Standard
error
0.383
z
2.389
P > |z|
0.017
95% confidence
limits
0.8318 0.9820
This model assumes no child is more prone to infection than another, i.e. all children
have the same probability of infection.
You could conclude that vaccination has a significant effect in reducing the infection
rate by almost 10%.
Interaction: Button: Swap (table and text changes to the following):
6.7: Overdispersion
The distribution of the observed number of infections, and the expected number of
infections based on the Poisson distribution, can be seen below. The expected values
are calculated assuming the rate of infection does not depend on the number of
previous infections. So the rate is assumed to be the same for each of the 11 groups
of children (0-10 previous infections), and is equal to the overall rate for the study
population (1712 infections / 2376 person-years = 72 per 100 person-years).
Notice how the number of new infections among children with high numbers of
previous infections is much greater than that expected from the Poisson distribution
Infections in Papua New Guinea schoolchildren followed for 2 years
No. of previous
infections
No. of children
with new infections
0
1
2
3
4
5
6
7
8
9
10
702
444
261
147
77
34
25
14
6
3
2
Expected number
of children with
new infections, if
no effect of
previous infections
982
390
184
78
39
23
7
5
2
1
0.3
6.8: Overdispersion
There are two ways to check for overdispersion within your data:
Interaction: Tabs: Examine Rates:
You can subjectively examine the infection rates within clusters. If you observe a
variation of rates within clusters of the data this is most likely an indication of
overdispersion.
Interaction: Tabs: Model deviance:
To formally check for overdispersion, you can compare the model that allows for
differences between clusters with a model that does not.
So for this example, you would fit the model including "number of previous
infections" and compare it to the model without this variable, using a likelihood ratio
test. If the test is significant, then you have evidence of overdispersion. For this data
set, the likelihood ratio statistic (LRS) is highly statistically significant (p<0.001). To
quantify the overdispersion, you can calculate the ratio of the LRS to its degrees of
freedom. In this example, you obtain a dispersion parameter of 251/10 = 25.1,
which is far higher than the value of 1 that you would expect if there were no
overdispersion
Assessment of, and dealing with, overdispersion can be complex, but it is a problem
you may come across in a cohort analysis of which you should be aware.
A simple guide is to remember to:
1. consider the potential for variation within your data
2. formally assess the validity of a Poisson model
3. if there is evidence of overdispersion you should use negative binomial model
(maybe with the help of a statistician).
Section 7: Summary
This is the end of AS05. When you are happy with the material covered here please move
on to session AS06.
The main points of this session will appear below as you click on the relevant title.
Poisson regression
Poisson regression models are used when the outcome measure is a rate.
Rate =
number of events
person-time at risk
Overdispersion
A Poisson model assumes that all events are independent and event rates are similar
within categories of the explanatory variables.
If this is not true, the variance of events will be greater than that predicted by the
model. This is known as overdispersion.
The presence of overdispersion should always be assessed for in a Poisson model,
and if it is present a more complex model (such as the negative binomial model) is
required.