Professional Documents
Culture Documents
Basic Models For Disease Occurrence in Epidemiology
Basic Models For Disease Occurrence in Epidemiology
1
O International Epidemiological Association 1995 Printed in Groat Britain
LEADING ARTICLE
Flanders W D (Emory University School of Public Health , Division of Epidemiology, 1599 Clifton Rd, NE, Atlanta,
Assessment of disease occurrence is one of the epidemio- activities, for example, the epidemiologist may wish to
logist's most basic tasks. To perform this task, the epi- assess whether the observed increase in rates is likely
demiologist typically uses either of two basic measures: to indicate a real increase in morbidity or is consistent
cumulative incidence, which reflects the average risk of with random variation. In aetiological studies, the epi-
disease in a population over a specified time, or inci- demiologist may wish to assess if the difference in ob-
dence rate, which reflects the average rate of disease served rates between two subgroups is likely to reflect a
occurrence per unit of person time. Each measure has real difference or is compatible with chance.
well-developed, accepted methods for estimation.1"5 This important step—accounting for variability in
Although these basic measures allow the epidemiolo- disease occurrence—generally involves use of a statisti-
gist to describe disease occurrence in a population, dis- cal distribution to model the variability. Distributions
ease occurrence is not fixed, but varies from place to commonly used for this purpose include the binomial,
place, from time to time, and from group to group. This the Poisson and the exponential.1"3 Although applica-
variability reflects differences in biological phenomena, tion of these distributions is described separately in
genetic differences, environmental and social differ- standard texts and papers, we are unaware of any text,
ences, differences in medical care, and random and review or commentary in which epidemiological appli-
poorly understood phenomena. Sampling variability, cation of these distributions is compared, contrasted
when one selects subjects from a larger universe, also and integrated.
contributes. This review has four purposes. In the first section, we
In many applications, the epidemiologist will need to review the binomial, Poisson and exponential distribu-
account for this inherent variability, perhaps by calcu- tions, highlighting key assumptions and indicating the
lating confidence intervals or p-values. In surveillance relationship of model parameters with basic measures
of epidemiology. In the second section, we discuss
Emory University School of Public Health, Division of Epidemiology, limitations of these three models. In the third section,
1599 Clifton Rd, NE, Atlanta, GA 30329, USA. we indicate the inter-relationships between these three
INTERNATIONAL JOURNAL OF EPIDEMIOLOGY
basic statistical approaches. And finally, we discuss the TABLE 1 Data from follow-up study of Doll and Hill,1 summarized
relationship of these models for disease occurrence to using counts
some of the standard methods of assessing disease-
exposure associations. Smokers Non-smokers Total
where N is the number of people at risk, X is the num- Example I. Consider the follow-up study of smoking
ber of 'cases', and p is the probability of disease for among British physicians reported by Doll and Hill7 in
each person. which each subject, initially disease-free, was followed
We can justify the use of the binomial distribution by for 4 years 5 months to detect subsequent occurrence of
assuming that: 1) disease occurs independently in dif- disease. In Table 1, we summarize data separately for
ferent people; 2) disease occurs with probability p in the smoking and non-smoking cohorts. As just noted,
each person. This homogeneity of risks can also hold if we can model variability of the number of deaths (cases)
the disease probability for each person can be modelled in each cohort (smoking and non-smoking) by using bi-
as arising randomly by independent selection from an nomial distributions, if independence and homogeneity
underlying distribution of risks. In this situation, each of risks are reasonable assumptions. Based on the bino-
person's risk is: mial distribution, the maximum likelihood estimate of
the cumulative incidence among the smokers is 0.0551
with associated standard deviation estimate of 0.00097.
Since the sample is large, the maximum likelihood
estimator of cumulative incidence has an approximate
where p, denotes the ith disease probability, and P(p,) Gaussian distribution, so that an approximate 95%
the probability of selecting this disease probability. On confidence interval is 0.0532 to 0.0570. Similarly,
the other hand, use of the binomial may not be justified the estimated cumulative incidence among the non-
if the length of follow-up varies since risks may then be smokers is 0.0286 with confidence interval from 0.0243
heterogeneous—longer follow-up should associate with to 0.0329.
BASIC MODELS FOR DISEASE OCCURRENCE
Poisson Distribution TABLE 2 Data from follow-up study of Doll and Hill,1 summarized
Epidemiologists frequently use the Poisson distribution using person-time
to analyse data from follow-up studies when the sum-
mary data involve counts of cases. Typical applications Smokers Non-smokers
include studies of cancer, cardiovascular disease, other
chronic diseases, and mortality. Data required for analy- Disease 1582 166
sis consist of the total number of cases and the person- Person-years 123 436 25 250
time of follow-up for each subgroup of interest. Use of
the Poisson distribution to model variability in counts
of cases is probably reasonable if we can assume that confidence interval is 1504 to 1660. Similarly, the MLE
the disease occurs independently in different people of the mean number of deaths among non-smokers is
and in the same person at different points in time, 166 with confidence interval, 141 to 191. Dividing by
that the likelihood that a new case will occur in a short the appropriate number of person-years gives an esti-
TABLE 3 Data from follow-up study of Doll and Hill,1 summarized of two subgroups, one with annual risk p, and size n,,
using (average) individual follow-up time the other with annual risk p2 and size n2. The average
annual cumulative incidence, ignoring subgroups is
Number of Average years Smoking Disease p. = (n^, + n2p2)/(n, + n2), and the associated year-to-
subjects of follow-up status year variance is Cn,p,(l—p,) + n 2 p 2 (l-p 2 ))/(n, + n2)2,
each person"
assuming the composition of the county changes negli-
gibly from year to year. The simple variance estimate
27 116 4.42 yes no based on the average risk, however, is p.(l-p.)/(n,+n 2 ),
1582 2.32 yes yes
an overestimate of the actual variance.
5630 4.42 no no
166 2.32 no yes Biased estimation of the variance can also result
from correlation of outcomes between people (lack of
independence). To exemplify this bias, consider the
"Calculated from data presented by Doll and Hill by assuming that
the average follow-up of subjects who died was the same among year-to-year variation in estimated risk of a communic-
agreement of analytic results for these two distribu- TABLE 4 Summary of incidence rate estimators, basic distributions
tions. Moreover, if the number of deaths in a given pop-
ulation has a Poisson distribution, the time until the Distribution Maximum likelihood Variance estimator
first death and the time between deaths is known to fol- estimator for incidence rate
low an exponential distribution, further illustrating the
close relationship between these two distributions.6'12 Exponential A/PT A/PT2
The binomial distribution is closely related to these Poisson A/PT APT2
two distributions, a correspondence which is particular- Binomial -Ind-A/NVt- A/PT* A/PT2
ly easy to see for rare disease. As noted above, the bi-
nomial distribution leads to p = A/N as the estimated * Approximately, assuming rare disease
cumulative incidence. We can convert this to an esti-
mate of the incidence rate, assuming the rate to be con- In illustrating the relationships of the binomial to the
stant, by using the usual1'2 relationship between risk exponential and Poisson distributions, we used a rare
cumulative incidence odds among the exposed divided distributions, the conditioning arguments cited above
by that among the unexposed (p,/( 1 —p,)) + ( p / O - p ^ ) . lead to use of the hypergeometric distribution to
To focus on this measure, statisticians sometimes treat estimate the odds ratio. The estimated odds ratio
the marginal totals of the summary 2 x 2 table as comparing smokers to non-smokers is 1.98 with 95%
fixed.12 This statistical device, sometimes called a con- confidence limits from 1.68 to 2.33. Since the odds
ditional analysis, eliminates the 'nuisance parameters' ratio approximates the risk ratio for rare disease, this
and leads us to use of the (non-central) hypergeometric odds ratio is nearly the same as the estimated risk
distribution which involves only the parameter of inter- among smokers divided by that among non-smokers:
est, the odds ratio. The conditional argument which 0.0551 +0.0286= 1.93.
leads to use of the hypergeometric distribution, howev- Alternatively, if we use person-time data (Table 2)
er, is based on the underlying assumption that the num- and assume Poisson distributions for the numbers of
ber of exposed cases and the number of unexposed deaths of smokers and non-smokers, conditional argu-
cases each have a binomial distribution. The hypergeo- ments lead to use of the binomial distribution for esti-
5
expanded to incorporate covariates that reflect potential Breslow N E, Day N E. Statistical Methods in Cancer Research.
confounding or effect modification. In particular, even Vol. II—The Design and Analysis of Cohort Studies. Lyon:
though the three distributions are not mathematically International Agency for Research on Cancer, 1987.
6
Hogg R V, Craig A T. Introduction to Mathematical Statistics.
equivalent, they do belong to a family of distributions
3rd Edn. London: Macmillan, 1970.
called the 'exponential family'. Statisticians have de- 7
Doll R, Hill A B. Lung cancer and other causes of death in rela-
veloped a generalized theory of regression modelling tion to smoking. Br Med J 1956; 5001: 1071-81.
which is based on this exponential family in which 'Parzen E. Stochastic Processes. San Francisco: Holden-Day,
these three basic distributions arise as special cases. We 1962.
9
Longini I M, Koopman J S, Haber M, Cotsonis G. Statistical in-
can use this approach, called generalized linear models,
ference for infectious diseases. Am J Epidemiol 1988; 128:
to study disease occurrence by specifying an appro- 845-59.
10
priate member of the exponential family, such as the Greenland S. Randomization, statistics, and causal inference.
binomial distribution, to model variability.14 Related Epidemiology 1990; 1: 421-29.
methods of analysis based on quasi-likelihood approach- "Cox D R, Snell E J. Analysis of Binary Data. New York:
Chapman and Hall, 1989.
es or generalized estimating equations, 14 ' 15 those based