You are on page 1of 32

Further Poisson regression

(AS05)
EPM304 Advanced Statistical Methods in Epidemiology

Course: PG Diploma/ MSc Epidemiology

This document contains a copy of the study material located within the computer
assisted learning (CAL) session.
If you have any questions regarding this document or your course, please contact
DLsupport via DLsupport@lshtm.ac.uk.
Important note: this document does not replace the CAL material found on your
module CDROM. When studying this session, please ensure you work through the
CDROM material first. This document can then be used for revision purposes to
refer back to specific sessions.
These study materials have been prepared by the London School of Hygiene & Tropical Medicine as part of
the PG Diploma/MSc Epidemiology distance learning course. This material is not licensed either for resale
or further copying.
London School of Hygiene & Tropical Medicine September 2013 v1.0

Section 1: Further Poisson regression


Aim

To discuss the Poisson distribution, review Poisson regression and learn about
overdispersion within a Poisson model.

Objectives
By the end of this session you will be able to:

describe the Poisson distribution


calculate Poisson probabilities
summarise the procedures involved in developing a Poisson model
explain overdispersion in a Poisson model
explain how to deal with overdispersion

This session should take you between 1.5 and 2 hours to complete.

Section 2: Planning your study


This session will look at the Poisson distribution and Poisson regression in more
detail.
We will first consider the Poisson distribution, then review the main issues of model
development for a Poisson regression model, and finally discuss what to do if Poisson
regression does not provide the most appropriate model for your data.
To work through this session you should be familiar with the analysis of cohort
studies and regression modelling.
Students who completed EPM202 can refer to the appropriate sessions listed below.
Occasional students should refer to their preferred text.
Cohort studies
Likelihood
Introduction to Poisson and Cox regression
Interaction: Hyperlink: SM02:
SM02 window opens.
Interaction: Hyperlink: SM05:
SM05 window opens.
Interaction: Hyperlink: SM11:
SM11 window opens.

SM02
SM05
SM11

2.1: Planning your study


To illustrate the methods discussed in this session three examples are used. Click on
each of the studies listed below for further details.
Study 1: Weekly registrations of new breast cancer cases within a region
Study 2: A study of optic nerve disease carried out in Nigeria
Study 3: A pneumococcal vaccine trial based in Papua New Guinea
Interaction: Hyperlink: Study 1:(card appears on right handside):
Study 1
In the UK, all new cases of breast cancer are reported to the regional cancer registry
via the GP's or a hospital. The cases are reported on a weekly basis.
Interaction: Hyperlink: Study 2: (card appears on right hand side):
Study 2
Individuals living in an area of rural northern Nigeria that is endemic for
onchocerciasis formed the placebo arm of a trial of ivermectin. All individuals were
screened for optic nerve disease (OND) at the start of the trial. Only individuals
without optic nerve disease at baseline were included. Individuals were re-examined
at about 2 and 3 years post-baseline to identify incident cases of optic nerve disease.
Interaction: Hyperlink: Study 3: (card appears on right hand side):
Study 3
A trial was carried out in Papua New Guinea to assess the effect of a pneumococcal
vaccine on morbidity from acute lower respiratory tract infections in schoolchildren.
Information was collected on all children in the study and throughout the trial each
episode of infection was recorded.

Section 3: The Poisson distribution


You should already be familiar with some statistical distributions, i.e. the Normal and
Binomial distributions (covered in sessions SC04 and SC05).
During parts of the EPP course we have applied the Poisson distribution but not fully
discussed the distribution itself. On this page we will look more closely at this
distribution.
Poisson distribution

3.1: The Poisson distribution


The Poisson distribution is named after the French mathematician Simon Poisson
(17811840).
Let's first think about the type of data it can be applied to, i.e. data in discrete
counts.
Shown below are some examples of instances in which you would count the number
of occurrences of an event over a period of time.
Examples
1. The weekly number of new cases of breast cancer notified to the regional cancer
registry.
2. The number of deaths that occur in a cohort of civil servants in a year.
3. The number of cases of salmonella reported in a regional health authority within a
week.

3.2: The Poisson distribution


So, the Poisson distribution is appropriate for describing:
The number of occurrences of an event during a period of time.
...but only when the events are independent of each other and occur at random.

Poisson distribution

3.3: The Poisson distribution


The simplest way to describe the Poisson distribution is to consider the average rate
at which events occur over time.
For example, imagine you were told that the weekly number of new cases of breast
cancer reported to the regional register was 4 on average. Does this mean that every
week you should expect 4 new cases of breast cancer to be reported?
Interaction: Button: clouds picture (pop up box appears):
It is highly unlikely that you would get exactly 4 new cases of breast cancer reported
each week, but the weekly average over a few weeks should be 4. In one week there
might be no cases reported and the next week there might be more than 6 cases
reported.

3.4: The Poisson distribution


A Poisson distribution describes the sampling distributionof the number of
occurrences, r, of an event during a period of time.
It depends on only one parameter
= the mean number of occurrences in periods of the same length.
This is shown below.
Interaction: Hyperlink: sampling distribution (pop up box appears):

Sampling Distribution
The distribution of sample estimates obtained when many random samples are
generated independently from the same population.
Interaction: Hyperlink: parameter (pop up box appears):
Parameter
A parameter is a measurement from a population that characterises one of its
features.

3.5: The Poisson distribution


Now, the plot below shows the distribution of the weekly notifications of breast
cancer to a regional register over 19 weeks.
What are the values of the following parameters? (You may wish to go back to the
previous card. To work out you need to do a calculation considering the frequency
of each event).
Maximum value of r observed
Mean, =

Interaction: Calculation: Maximum value of r =____:


Correct Response 8:
Correct
Yes, the greatest number of notifications in a week was 8.

Incorrect Response:
Sorry, that's not right. You can see from the graph that the greatest number of
notifications in a week was 8.
Interaction: Calculation: Mean, =

Correct Response 4.2:


That's right, the mean is given by the total number of notifications over the 19 week
period (80), divided by the number of weeks (19).
Therefore the mean is 80 / 19 = 4.2.
Incorrect Response:
No, that's not right. To find the mean, you first need to calculate the total number of
notifications over the 20-week period. This is given by:
(1x1) + (2x2) + (4x3) + (5x4) + (3x5) + (1x6) + (2x7) + (1x8) = 80
Therefore the mean is 80 / 19 = 4.2

3.6: The Poisson distribution


Let's go back to the theoretical Poisson distribution. If you assume the weekly
registrations are independent and follow a Poisson distribution, you can calculate the
probability that a specific number of cases will occur in a week.

You will see how to do this calculation next, but first consider the plot below.
What is the probability of 2 registrations in a week? Give your answer to 2 decimal
places.
Prob (2 registrations/week) =
Interaction: Calculation: Prob (2 registrations/week) =____:
Correct Response 0.14 or 0.15
Correct
Yes, the plot shows that the probability of 2 registrations in a week is somewhere
between 0.14 and 0.15.
Incorrect Response:
No, the probability of 2 registrations in a week is shown as the height of the vertical
line at 2 events on the x-axis. This line reaches to a point just below a probability of
0.15, so your answer should be either 0.14 and 0.15.

3.7: The Poisson distribution


The probability of r occurrences can be calculated from the Poisson formula shown
below.

e- r
Probability (r occurrences) =
r!

where:
e is the mathematical constant 2.71878
is the mean number of events
r is the number of occurrences
r ! is the factorial of r.
Interaction: Hyperlink: factorial (pop up box appears):
Factorial
The factorial of an integer n is written n! and is calculated
as 1 x 2 x 3 x ... x n.
For example, the factorial of 5 is:
5! = 1 x 2 x 3 x 4 x 5 = 120
Let's work this out for the registration of breast cancer cases.
The mean number of new cases of breast cancer registered each week is 4.0. Using
this number with the formula below, you can calculate the probability of 0, 1, 2, 3...
new registrations in a week.
For example, to calculate the probability of 2 registrations in a week for a mean of
4:

e-4 42
Probability (2 registrations) =

e-4 x 16
=

2!

= 0.1465
2x1

Does this agree with the probability you read from the plot? Go back and check.

3.8: The Poisson distribution


The table below shows the probabilities for mu = 4. Click on each cell in the
probability column to see how it was calculated. Then calculate the missing
probabilities, giving your answers to 4 decimal places.

e- r
Remember, prob( r ) =
r!

Probability of r breast cancer registrations per week


Number of
registratio
ns (r)
0
1

Probability
0.0183
0.0733

2
3
4
5
6
7
8
9

0.1465
0.1954
0.1954
0.1563
0.1042
0.0298

Interaction: Button: 0.0183 (Number of registrations = 0):


Prob(0 registrations) = (e 4 x 40) / 0!
= e 4 / 1
= 0.0183

Interaction: Button: 0.0733 *(Number of registrations = 1):


Prob(1 registration) = (e 4 x 41) / 1!
= 4e 4 / 1
= 0.0733

Interaction: Button: 0.1465 (Number of registrations = 2):


Prob(2 registrations) = (e 4 x 42) / 2!
= 16 x e 4 / 2
= 0.1465

Interaction: Button: 0.1954 (Number of registrations = 3):


Prob(3 registrations) = (e 4 x 43) / 3!
= 64e 4 / 6
= 0.1954

Interaction: Button: 0.1954 (Number of registrations = 4):


Prob(4 registrations) = (e 4 x 44) / 4!
= 256e 4 / 24
= 0.1954

Interaction: Button: 0.1563 (Number of registrations = 5):


Prob(5 registrations) = (e 4 x 45) / 5!
= 1024e 4 / 120
= 0.1563

Interaction: Button: 0.1042 (Number of registrations = 6):


Prob(6 registrations) = (e 4 x 46) / 6!
= 4096e 4 / 720

= 0.1042
Interaction: Button: 0.0298 (Number of registrations = 8):
Prob(8 registrations) = (e 4 x 48) / 8!
= 65536e 4 / 40320
= 0.0298
Interaction: Calculation: Number of registrations = 7, Probability = ___:
Correct Response 0.0595:
That's right, the probability of 7 registrations per week is:
Prob(7 registrations) = (e 4 x 47) / 7!
= 16384e 4 / 5040
= 0.0595
Incorrect Response:
Sorry, that's not correct. Remember that the mean, mu = 4, so for r = 7, you use
the formula as follows:
Prob(7 registrations) = (e x r ) / r !
= (e 4 x 47) / 7!
= 16384e 4 / 5040
= 0.0595

Interaction: Calculation: Number of registrations = 9, Probability = ___:


Correct Response 0.0132:
That's right, the probability of 9 registrations per week is:
Prob(9 registrations) = (e 4 x 49) / 9!
= 262144e 4 / 362880
= 0.0132

Incorrect Response:
Sorry, that's not correct. Remember that the mean, mu = 4, so for r = 9, you use
the formula as follows:
Prob(9 registrations) = (e 4 x 49) / 9!
= 262144e 4 / 362880
= 0.0132

(When Probability (7 registrations) is answered the following text and interaction


appears on the LHS. Also an extra row with a calculation appears on the table)

Now, can you calculate the probability of


10 or more registrations, to 4 decimal places? Make sure you have entered the
correct values in rows 7 and 9 before you do this.
Interaction: Button: Hint:
Hint
Remember that probabilities should add up to 1.
Number of
registratio
ns (r)
0
1
2
3
4
5
6
7
8
9
10+

Probability
0.0183
0.0733
0.1465
0.1954
0.1954
0.1563
0.1042
0.0298

Interaction: Calculation: Number of registrations = 10+, probability = ___:


Correct Response 0.0081:
Correct
That's right,
Prob(10+) = 1 [prob(0) + prob(1) + ... + prob(9)]
= 1 0.9919
= 0.0081
Incorrect Response:
Sorry, that's not right. Remember that the probabilities must add up to 1 in total, so
by summing the probabilities for rows 0 to 9, and subtracting that total from 1, you
get the probability that there are 10 or more registrations in a week.
Prob(10+) = 1 [prob(0) + prob(1) + ... + prob(9)]
= 1 0.9919
= 0.0081

3.9: The Poisson distribution

The mean and variance are often used to summarize a distribution. For a Poisson
distribution the variance is equal to the mean, so it can be described by a single
parameter.
Interaction: Button: More (text appears on bottom LHS):
So, if the variance is equal to the mean, the standard error equals the square root of
the mean, .
This standard error can be used to obtain a confidence interval for the mean

3.10: The Poisson distribution


The Poisson distribution for the cases of breast cancer weekly registration is shown
below.
Which of the boxes below best describes the distribution below?
Left skewed
Symmetrical
Right skewed
Interaction: Hotspot: Left skewed:
Incorrect Response (pop up box appears):
Left skewed
Sorry, a left skewed distribution has a low probability of a small number of events
and a high probability of a large number of events. That is not the case for this
distribution. Please try again.
Interaction: Hotspot: Symmetrical:
Incorrect Response (pop up box appears):
Symmetrical
No, one tail of the distribution is much longer than the other tail, therefore the
distribution is not symmetrical. Please try again.
Interaction: Hotspot: Right skewed:
Correct Response (pop up box appears):
Correct
The Poisson distribution is skewed to the right since the probability of a small
number of events is high and the probability of a large number of events is low.

3.11: The Poisson distribution


Use the drop-down menu beneath the plot below to view the shape of the Poisson
distribution for different values of the mean.
Consider the plots. Which distribution can we use as an approximation to the Poisson
when the mean is large?
Interaction: Button: clouds picture (pop up box appears):
For large means, the shape of the Poisson distribution is symmetrical. In such cases,
the distribution can be approximated by the Normal distribution.

Interaction: Pulldown: =3:

Interaction: Pulldown: =5:

Interaction: Button: =10:

Section 4: The Poisson distribution and rates


In epidemiology we are more interested in the rate of occurrence of an outcome
rather than counts.

The mean rate, , of the occurrence of an event is given by:


=r/t
where r is the number of events in a time period t.
So for the Poisson distribution what is the standard error of this rate?
Interaction: Button: clouds picture (text appears on the bottom RHS):
The Poisson distribution is described by one parameter, the mean, which is equal to
the variance.
So the standard error in this case will be
= ( r / t ).

4.1: The Poisson distribution and rates


Usually the follow-up time differs for each individual, and in calculating rate we
therefore divide by the person-time.
So, the mean rate is given by:
= r / person time
Example
A study was carried out in a poor area of Mexico, in which 652 children aged less
than 5 years were followed for a period of up to 2 years. During the study 73 cases
of lower respiratory infection were recorded. The total child-years of follow-up was
868.
Calculate the overall (incidence) rate of lower respiratory infection estimated from
this study and enter your answer in the box below.
Rate =
Interaction: Calculation: Rate =____:
Correct Response 0.084:
Correct
Yes, the incidence rate is given by the number of infections divided by the total
person-time:
= 73 / 868 = 0.084
Incorrect Response:

Sorry, that's not right. The total number of infections during the study, r , was 73,
and the total person-time of follow-up was 868 person-years. So, using the formula:
= 73 / 868 = 0.084

4.2: The Poisson distribution and rates


Such rates are usually expressed in units of 100 or 1000 person-years.
For example, the incidence rate of lower respiratory infection in the Mexico study
was 0.084 x 100 = 8.4 per 100 child-years.

Section 5: Poisson regression model


A Poisson regression model is appropriate for the analysis of cohort data where the
outcome measure is a rate. The information essential for analysis is:
1. Time of entry to the study
2. Whether the event of interest occurred
3. Time of exit / end of follow-up
This allows the calculation of rates, based on the information for the whole cohort.
Rate =

number of events
person-time at risk

Interaction: Button: More (card appears on RHS):


Using Poisson regression we can model the log rates for the exposure(s) of interest
to adjust for many simultaneously.
The general format of a Poisson model may be written as:

log = + 1 X 1 + 2 X 2 +... i X i
Click below to compare this model to two other models you are familiar with, namely
the logistic model and the Cox model.
Interaction: Button: Logistic:
Logistic model

log

)
1 -

= + 1X1 + 2X2 +...i Xi

Interaction: Button: Cox:

log(t;X)= log(t;0) + 1X1 + 2X2 +...i Xi

5.1: Poisson regression model


As with all regression models, a Poisson model can:
Interaction: Timed pop up 1:

Account for interaction


Interaction: Timed pop up 2:

Model a linear effect


Interaction: Timed pop up 3:

Adjust for many confounders


Interaction: Timed pop up 4:

Simultaneously model different exposure types

5.2: Poisson regression model


You will now investigate the application of the Poisson model using data from a study
of optic nerve disease carried out in Nigeria.
Individuals living in an onchocercal area in Nigeria were followed over a 3-year
period to study the incidence of optic nerve disease (OND). The exposure of interest
was onchocerciasis which was measured by microfilarial load.
The categories for this are shown opposite.
Interaction: Hyperlink: onchocerciasis (pop up box appears):
Onchocerciasis
(Also known as river blindness)
A disease affecting the skin and eyes, caused by the filiarial worm Onchocerca
volvulus. The disease is transmitted by blackflies and has been endemic in large
areas of West Africa.

Interaction: Hyperlink: microfilarial (pop up box appears):


Microfilariae
Offspring of the female worm onchocerca volvulus, which migrate through the skin
causing onchocerciasis.
Category of
microfilarial
load

Microfilariae per
mg
0

0.1 10

10.1 50

> 50

5.3: Poisson regression model


The rates of optic nerve disease for each category of microfilarial load are calculated
by dividing the number of OND cases by the total person-time in each category of
microfilarial load.
Examine the rates shown below. Do you think there is a relationship between optic
nerve disease and microfilarial load?
Rates of optic nerve disease by
microfilarial load per 1000 person-years
Microfilarial
D
Y (/
Rate
load
1000)
0 mg
14
1.158
12.091
0.1 to 10 mg
18
1.512
11.903
10.1 to 50
31
0.992
31.256
mg
> 50 mg
8
0.242
33.014

95% confidence
limits
7.161
20.415
7.500
18.893
21.981
44.444
16.510

66.015

Rate = counts (D) / person-time (Y)


Person-time is the total follow-up time for
all people in each group
Interaction: Button: clouds picture (pop up box appears and part of table is
highlighted):
The rates of optic nerve disease appear to increase with a higher microfilarial load,
from 12.1 per 1000 person years in the baseline microfilarial load group to 33.0 per
1000 person years in individuals with the highest microfilarial load.
Rates of optic nerve disease by

microfilarial load per 1000 person-years


Microfilarial
D
Y (/
Rate
load
1000)
0 mg
14
1.158
12.091
0.1 to 10 mg
18
1.512
11.903
10.1 to 50
31
0.992
31.256
mg
~> 50 mg
8
0.242
33.014

95% confidence
limits
7.161
20.415
7.500
18.893
21.981
44.444
16.510

66.015

Rate = counts (D) / person-time (Y)


Person-time is the total follow-up time for
all people in each group

5.4: Poisson regression model


To compare the rate of optic nerve disease in individuals with higher levels of
microfilarial load to individuals with no onchocercal infection you can look at the rate
ratios. These are shown below.
What can you conclude from these estimates?
Rate ratios for each level of microfilarial load compared to the baseline level
Rate
X
P > |z|
95% confidence
Ratio
limits
0.1 to
0.985
0.002
0.965
0.490
1.979
10mg
10.1 to
2.585
9.370
0.002
1.375
4.859
50mg
>50mg
2.731
5.580
0.018
1.146
6.509
Interaction: Button: clouds picture (pop up box appears):
From the estimates in the table, there appears to be no difference in the rates for
the lower category 0.1 mg 10 mg and the baseline group (P = 0.97). However, the
two highest categories of microfilarial load have significantly higher rates of optic
nerve disease compared to the baseline group, (P = 0.002, P = 0.018).

5.5: Poisson regression model


By assuming that the rate of OND follows the Poisson distribution, we can model the
rates using a Poisson regression model:
log (rate of OND) = + 1X1 + 2X2 + + iXi
where: x1 = Mfload1 = 0.1 mg to 10 mg;

x2 = Mfload2 = 10.1 mg to 50 mg;


x3 = Mfload3 = > 50 mg.

The model estimates are shown below.


Click 'swap' below to see the results from the classical analysis. How do the
estimates from the two analyses compare?
Poisson model for the effect of microfilarial load on optic nerve disease
Rate
SE
z
P > |z| 95% confidence
ratio
limits
Mfload 0.985
0.351
-0.044
0.965
0.490
1.979
1
Mfload 2.585
0.832
2.950
0.003
1.375
4.859
2
Mfload 2.731
1.210
2.266
0.023
1.146
6.509
3
Log likelihood = 275.95294

Interaction: Button: clouds picture (pop up box appears):


The results from the simple Poisson model produce exactly the same rate ratios and
confidence intervals as the classical analysis of the association between optic nerve
disease and microfilarial load. However, the p values are slightly different for the 2
methods, because in the regression analysis a Wald test is used, whereas in the
classical analysis a score test is used
Note that neither analysis is adjusted for potential confounders.
Interaction: Button: Swap (graph changes to the following):
Rate ratios for each level of microfilarial load compared to the baseline level
Rate
X
P > |z|
95% confidence
Ratio
limits
0.1 to
0.985
0.002
0.965
0.490
1.979
10mg
10.1 to
2.585
9.370
0.002
1.375
4.859
50mg
>50mg
2.731
5.580
0.018
1.146
6.509

5.6: Poisson regression model


Is there a single estimate in the table that shows whether microfilarial load has an
effect on the rate of optic nerve disease?
Interaction: Button: clouds picture (box appears on RHS):
No, from the table below we can say how each level of microfilarial load compares to
the baseline group but we cannot say what the overall effect of microfilarial load is.
For this you need to calculate a likelihood ratio test.

Poisson model for the effect of microfilarial load on optic nerve disease
Rate
SE
z
P > |z| 95% confidence
ratio
limits
Mfload 0.985
0.351
-0.044
0.965
0.490
1.979
1
Mfload 2.585
0.832
2.950
0.003
1.375
4.859
2
Mfload 2.731
1.210
2.266
0.023
1.146
6.509
3
Log likelihood = 275.95294

5.7: Poisson regression model


The likelihood ratio test for the effect of microfilarial load is calculated as follows:
Log likelihood for:
Model with microfilarial load = 275.95
Model without microfilarial load = 284.17
Therefore, the likelihood ratio statistic (LRS):
LRS = 2 x (275.95 (284.17))
= 16.44; P=0.001
So what can you conclude about the overall effect of microfilarial load on optic nerve
disease?
Interaction: Button: clouds picture (pop up box appears):
P = 0.001, so we can say that microfilarial load has a significant effect on the rate of
optic nerve disease, before adjusting for potential confounders.

5.8: Poisson regression model


Using the same procedures and comparing models you can also:
Assess the effect of confounding variables
Test for interaction (and the proportional rates assumption)
Look for a linear effect for increasing (or decreasing) exposures
Interaction: Hyperlink: Assess the effect of confounding variables (card
appears on RHS):
Confounding variables (review this)

To assess confounding, you fit two models. One is a model with the exposure AND
the potential confounder, the other is a model with the exposure but NOT the
potential confounder. You then compare the rate ratios for the exposure in the
model that includes the potential confounder, with the rate ratios for the exposure in
the model that does not include the potential confounder.
If the rate ratios for the exposure are considerably different between the 2 models
(e.g. they change by around 10% or more between the 2 models), then we can say
there is confounding. If there is confounding, then we must use the model that
includes the confounder when estimating the effect of the exposure variable
Interaction: Hyperlink: (review this):
Window showing SM12 appears, section 5
Interaction: Hyperlink: Test for interaction (and the proportional rates
assumption) (card appears on RHS):
Test for interaction
(Click here to review this from SM10.)
The simple Poisson model assumes that the effect of the exposure is constant over
the follow-up period. We can test this assumption by splitting the follow-up time into
several categories (time bands), and then seeing if there is evidence that the effect
of the exposure varies over time.
For example, if the follow-up period is 10 years we could split the follow-up period
into 2 time bands, the first 5 years of follow-up and the second 5 years of follow-up.
We could then investigate whether there is evidence that the effect of the exposure
is different in the first 5 years (0-5 years) from the second 5 years (6-10 years) of
follow-up. We would do this by fitting an interaction between the exposure variable
and timeband, and using a likelihood ratio test (LRT) to see if the interaction was
statistically significant (comparing the model with the interaction term to the one
without).
Alternatively, we might think that the effect of the exposure varied with calendar
time. Suppose that individuals were followed up over the period 1980-1989. We
could investigate whether the effect of the exposure was different during 1980-1984
from the period 1985-1989.
Or we might think that the effect of the exposure varied with an individual's age. You
will see how to investigate this in Practical 5
Interaction: Hyperlink: review this:
Window showing SM10, section 5.
Interaction: Hyperlink: Look for a linear effect for increasing (or decreasing)
exposures (card appears on RHS):
Linear effect

(Click here to review thisfrom SM10.)


For a variable with increasing or decreasing exposure it is possible to model a linear
effect. The goal in all model analysis should be to produce the most simple model
that describes the data. If a linear effect can be assumed then this reduces the
number of parameters required in a model.
In addition where a linear effect shows a dose-response relationship this is further
evidence of a causal relationship.
Interaction: Hyperlink: review this:
Window showing SM10, section 6.

Section 6: Overdispersion
The assumptions underlying the Poisson model are that all events are independent,
and that the event rates are similar within categories of the covariates.
In some situations this assumption is invalid, i.e. where for some reason an event
may be more likely to happen in certain individuals. For examples, click below:
Recurrent events
Clustered observations

The analysis of such data is covered in more detail in AS09.


Interaction: Hyperlink: covariates (pop up box appears):
Covariates
This is another term for explanatory variables, or independent variables.
Interaction: Hyperlink: Recurrent events (pop up box appears):
Recurrent events
The possibility of recurrent asthma episodes is high for children who have previously
had asthma, i.e. the chance of a recurrent episode is dependent on whether a child
has previously had asthma.
Interaction: Hyperlink: Clustered observations(pop up box appears):
Clustered observations
The chance of contracting an infectious disease may be higher for subjects within a
certain community than for individuals living in another community. This may be due
to some environmental or social factors.

6.1: Overdispersion

For such data, the variance of the distribution of events will be greater than that
predicted by a Poisson model. This is known as overdispersion. For example,
clusters of individuals can have different rates to each other; overall this will give a
wider spread of rates.
Sometimes you might not be aware of clustering in a dataset, because the
characteristic that distinguishes the clustering has not been measured.
Overdispersion can occur because of extra variability, which is not completely
explained by the covariates in the model.

6.2: Overdispersion
You can account for the extra variability by assuming that individual rates depend on
the covariates that you would normally put in the Poisson model plus an unobserved
component specific to each subject. This component is known as the frailty.
The frailty component varies between individuals. It is a measure of how some
individuals may be more prone to infection than others. This may be because of
the community they live in or their disease history.
Therefore, the model can be written:

log = + 1 X 1 + 2 X 2 + .... i X i + frailty

6.3: Overdispersion
Statistically we can incorporate a second distribution to model the frailty component.
When the Poisson distribution is expanded in this way the distribution is called the
negative binomial model. This is fitted to the data in the same way a Poisson
model is fitted.
Regression methods can be used to estimate the usual exposure effects accounting
for the frailty term.
Negative binomial model

log = + 1 X 1 + 2 X 2 + .... i X i + frailty

6.4: Overdispersion
To illustrate the difference in estimates obtained from a Poisson model and negative
binomial model you will now consider a study of recurrent infections. The data is
from a trial of a pneumococcal vaccine carried out in Papua New Guinea. The trial
involved school children, and occurrence of infection was recorded over 2 years.
Which of the calculations opposite do you think gives the best estimate for the rate
of infection in the population of schoolchildren?
The number of children who had
an infection during the two-year
period, divided by the total
children-years.
The number of infections during
the two-year period, divided by the
total children-years.
Interaction: Hotspot: The number of children who had an infection during the two
year period, divided by the total children-years.
Incorrect Response (pop up box appears):
No, remember each child could have a number of infections within the two-year
period. So, to measure the rate of infection in the population, it is more relevant to
consider the number of infections.
Interaction: Hotspot: The number of infections during the two-year period, divided
by the total children-years.

Correct Response (pop up box appears):


Yes, calculating the rate of infection in this group of schoolchildren using the number
of infections gives a better estimate of the burden of infection in the population of
schoolchildren.

6.5: Overdispersion
The table opposite gives the distribution of recurring infections and infection rates by
the number of previous infections observed in the Papua New Guinea schoolchildren
during the 2-year period. Click 'swap' beneath the table to see a plot of these data.
Then answer the questions below.
1. Why might overdispersion occur with
these data?
Interaction: Button: clouds picture (pop up box appears):
Overdispersion would occur if children who have had previous infection(s) are more
prone to infection than children who have not, i.e. if re-infection is more probable
with an increasing number of previous infections.
2. Is there any evidence of overdispersion from the table and plot?
Interaction: Button: clouds picture (pop up box appears):
You can see from the data that the infection rate per 100 person-years increases
with an increasing number of previous infections. This is an indication that recurring
infections produce variability between individuals that is likely to cause
overdispersion.
Distribution of recurrent infections in children followed for 2 years
Number of
previous
infections
0
1
2
3
4
5
6
7
8
9
10

Number of
children
with new
infections
702
444
261
147
77
34
25
11
6
3
2

Infection
rate per 100
personyears
51
82
102
135
141
107
252
172
217
189
451

Interaction: Button: Swap (table changes to the following):

Distribution of recurrent infections

6.6: Overdispersion
The estimates from the Poisson model for the effect of vaccination are shown below.
Click 'swap' to see the corresponding estimates from the negative binomial model.
How do the estimates compare?
Interaction: Button: clouds picture (pop up box appears):
Both models give similar estimates for the effect of vaccination. However, the
negative binomial, which has accounted for the frailty component gives larger
standard errors, i.e. it has allowed for the extra variation due to overdispersion of
the Poisson. Because the Poisson model underestimates the standard error, the
inference from this model is incorrect, showing an apparent stronger association
between vaccination and infection rates.
Poisson model for pneumococcal vaccine trial in Papua New Guinea

Vaccination

rate
ratio
0.9038

Standard
error
0.383

z
2.389

P > |z|
0.017

95% confidence
limits
0.8318 0.9820

This model assumes no child is more prone to infection than another, i.e. all children
have the same probability of infection.

You could conclude that vaccination has a significant effect in reducing the infection
rate by almost 10%.
Interaction: Button: Swap (table and text changes to the following):

Negative binomial model for pneumoncoccal vaccine trial in Papua New


Guinea
Standard
z
P > |z| 95% confidence
Coefficient
error
limits
Vaccination
0.8845
0.5683
1.910
0.056
0.7788
1.003
This model assumes that some children who have been previously infected are more
likely to get infected than children who have not, i.e. it assumes observations are not
independent.
Vaccination appears to reduce the infection rate but is not significant at the 5% level.

6.7: Overdispersion
The distribution of the observed number of infections, and the expected number of
infections based on the Poisson distribution, can be seen below. The expected values
are calculated assuming the rate of infection does not depend on the number of
previous infections. So the rate is assumed to be the same for each of the 11 groups
of children (0-10 previous infections), and is equal to the overall rate for the study
population (1712 infections / 2376 person-years = 72 per 100 person-years).
Notice how the number of new infections among children with high numbers of
previous infections is much greater than that expected from the Poisson distribution
Infections in Papua New Guinea schoolchildren followed for 2 years
No. of previous
infections

No. of children
with new infections

0
1
2
3
4
5
6
7
8
9
10

702
444
261
147
77
34
25
14
6
3
2

Expected number
of children with
new infections, if
no effect of
previous infections
982
390
184
78
39
23
7
5
2
1
0.3

6.8: Overdispersion
There are two ways to check for overdispersion within your data:
Interaction: Tabs: Examine Rates:
You can subjectively examine the infection rates within clusters. If you observe a
variation of rates within clusters of the data this is most likely an indication of
overdispersion.
Interaction: Tabs: Model deviance:
To formally check for overdispersion, you can compare the model that allows for
differences between clusters with a model that does not.
So for this example, you would fit the model including "number of previous
infections" and compare it to the model without this variable, using a likelihood ratio
test. If the test is significant, then you have evidence of overdispersion. For this data
set, the likelihood ratio statistic (LRS) is highly statistically significant (p<0.001). To
quantify the overdispersion, you can calculate the ratio of the LRS to its degrees of
freedom. In this example, you obtain a dispersion parameter of 251/10 = 25.1,
which is far higher than the value of 1 that you would expect if there were no
overdispersion
Assessment of, and dealing with, overdispersion can be complex, but it is a problem
you may come across in a cohort analysis of which you should be aware.
A simple guide is to remember to:
1. consider the potential for variation within your data
2. formally assess the validity of a Poisson model
3. if there is evidence of overdispersion you should use negative binomial model
(maybe with the help of a statistician).

Section 7: Summary
This is the end of AS05. When you are happy with the material covered here please move
on to session AS06.
The main points of this session will appear below as you click on the relevant title.
Poisson regression
Poisson regression models are used when the outcome measure is a rate.
Rate =

number of events
person-time at risk

It is the log rate that is modelled.


Using a Poisson regression model
As with all regression models, using a Poisson model you can:

Estimate the effect of many exposure variables at the same time


Adjust for potential confounding variables
Examine for effect modification
Estimate the effect of a quantitative exposure

Overdispersion
A Poisson model assumes that all events are independent and event rates are similar
within categories of the explanatory variables.
If this is not true, the variance of events will be greater than that predicted by the
model. This is known as overdispersion.
The presence of overdispersion should always be assessed for in a Poisson model,
and if it is present a more complex model (such as the negative binomial model) is
required.

You might also like