You are on page 1of 6

PGCert/PGDip/MSc Public Health

PHM205 Environmental Epidemiology


Frequently asked questions from students

Frequently asked questions


Below are some common questions that appear on the web conferencing.

Cluster investigations
Can you clarify what is meant by a post hoc study and why p-values are not
appropriate for assessing their significance?

The concept of a ‘post hoc’ study is difficult and somewhat counter-intuitive.


It refers to the fact that, in many cases, apparent clusters are only being investigated because
someone has observed that something appears abnormal. However, it could be (and often is) just
an aggregation of cases from the random variation in disease occurrence. The problem is that
you have no real basis to judge how unusual this aggregation is because there is effectively an
infinite number of ways of defining the boundaries of the cluster -- by circles of varying location
and size, by time period, by disease group, by age etc.
If you had a random set of dots on the page, by judicious selection, you could doubtless draw a
circle somewhere that circumscribed an above-average number of dots. This would appear to be
a cluster. There isn't really a cluster in the sense of some underlying cause of disease, because
the distribution of dots is random. But by adjusting the location/size of the circle, it appears to
contain an unusually high number of cases.
Without perhaps realizing it, this is what often happens in real life. Someone notices several cases
of disease in a small area and wonders if it is a cluster. But they didn't take note of the many other
small areas where there weren't a lot of cases. Such areas don't look unusual and aren't therefore
investigated.
There is therefore a difficulty in judging how unusual any particular observation is. No amount of
statistical testing will give you a true picture. And it turns out that in 99 cases out of a hundred,
the investigation of such apparent clusters never reveals an underlying cause. The assumption is
that most of these are clusters are simply random fluctuations of disease.
This is quite different from the circumstance in which there is a hypothesis about disease in a
specific locality when you have no knowledge of what the data show. Then you can test by
conventional means.

PHM205 2022-23
Can you explain the observed and expected numbers in Table 1.3 (p12 in e-Book
Environmental Epidemiology)?

The numbers calculated in Table 1.3 relate to the census wards (small areas) in which the factory
lies. However, such areas are nonetheless much larger than the two short roads that were the
focus of the television documentary. You can tell this from the population size (5000 is clearly
many more than you might expect in two short roads).
Because of this, the expected numbers are also much larger than the observed cases reported in
the documentary (5000 people x 10 years of investigation x national rates), and are measured in
the hundreds. However, the point to note is that the observed number of cases of brain cancer
appears to be disproportionately high.
Unfortunately the number 5.4 is a misprint (it should say 1.5, as shown in Table 1.3). But the point
remains that the proportion of brain cancers still seems very large by comparison with cancers as
a whole.
In the last part of the feedback, you were shown a method that tries to guess the population of
the relevant roads (guessed to be about 250 people). Hence the number of expected cases that
this leads to is much closer to the observed number.

Why is the baseline group omitted in the regression analysis on p34 in e-Book
Environmental Epidemiology?

In regression analyses such as this one, the (exponentiated) coefficients are odds ratios -- i.e. the
odds for each group compared to the baseline group. (Divide the odds of disease in exposure
group x by the odds of disease in the baseline group.)
The baseline group is omitted because the ratio of odds for this group would by definition always
be 1 (the baseline odds divided by itself), and it has no confidence interval because it is simply
defined as being 1.
There is nothing wrong with showing that the baseline odds ratio is 1, and in tables it is often
clearer to include a line to show this. But to save space, the Stata output omits the baseline group
as it assumes the reader will know it is always just 1.
Remember, however, that the odds of disease (rather than the odds ratio) is set by the study
design, and is not usually of interest itself. It is only the relative change in the odds of disease
with exposure that is of interest – hence the calculation of odds ratios.

PHM205 2022-23
Air pollution
What is a confounding factor in a time-series study ?

The essence of a time-series study is that the population is effectively compared with itself. The
usual situation entails analysis of day to day variation in the count of health endpoints (deaths,
hospital admissions, emergency room visits etc) in relation to variation in air pollution measured
at similar time resolution. This is typically done for health and air pollution statistics for a particular
city.
Because it is the day to day variation that is being looked at, the characteristics of the population
itself are not usually confounders, because they don’t change much from day to day. The
population of London today is likely to be very similar to the population of London yesterday.
Hence age, smoking prevalence etc. are not confounders as they are unlikely to change much
from day to day or to be associated with exposure and outcome over the short term.
However, things that do vary quickly and are both associated with exposure (air pollution) and
health outcome are potential confounders. These include such factors as outdoor temperature,
periods of influenza, and day of the week. (More people tend to die on Mondays than at
weekends, and weekdays will generally have higher emissions of pollution.) So it is these
‘environmental’ risk factors that vary quickly over time that constitute the confounders in time
series studies.

Do we have to take account of indoor sources in air pollution studies?

There is now a very large number of studies of the health effects of outdoor air pollution, including
many from lower income countries. Time-series studies are the most common, but other designs
are also used.
In many lower income countries, there has been particular interest in indoor smoke (particles)
arising from the inefficient or poorly ventilated burning of biomass fuels. The exposures from such
sources may often be of greater importance than that from pollution in outdoor air. The effects of
indoor smoke are not investigated by the same time-series studies as outdoor air pollution, but
usually entail cohort, trials and other studies to measure the frequency of adverse events in people
exposed to different (measured) levels of exposure to indoor sources.
However, for time-series studies, the principle is simply to measure the day to day variation in
health and whether it correlates with similar variation in pollution usually measured at just one or
two sites. Although a single site doesn't tell us about the actual level of exposure of an individual,
the relative change from day to day is probably a reasonable measure of the change for
everybody. (Air pollutant levels are largely driven by meteorological conditions, which are broadly
the same across any local are). It means you can still do a time-series analysis without separately
measuring indoor pollution, and whose effects you would probably assess through a separate
study design.

PHM205 2022-23
What is an interrupted time-series?

An interrupted time-series is the name given to the assembly and analysis of data over time
relating to a ‘step‘ change in exposure. Typically, a single population or group of participants is
tested repeatedly both before and after an intervention in a natural experiment. An example is
the Clancy study of the health effects of the ban on coal sales in Dublin. The authors measured
air pollution, morality and various other factors before and after the introduction of the coal ban,
and attempted to quantify the ‘step change’ in mortality as air pollution levels fell after the ban.
Although the analysis may entail many of the same principles as a conventional time-series study
– controlling for other time-varying confounders, for example – the focus is different in that its
main outcome is the quantification of the before-after change.

What are differences between natural experiment studies, interrupted time series
studies and change-on-change studies?

Natural experiment
Interrupted time-series study is not an equivalent term to natural experiment. When the study
examines the impacts of actual policy/intervention that has occurred already in real world, we
call it a natural experiment. So, three examples which we introduced in the lecture are all in
natural experiment settings (interrupted time-series study of the ban of coal sale in Dublin;
change-on-change study of PM2.5 and life expectancy in the US, 1980-2000; Pre vs Post
comparison of exposure and life years gained for London Congestion Charge scheme). Most
interrupted time series studies are of natural experiments, but not all natural experiment studies
are interrupted time series.
Interrupted time-series study and change-on-change study
Generally, Interrupted timeseries study means a segmented time series study. Specifically, it is
segmented by the timing of the intervention (for instance, pre- and post-intervention) or into
more segments by the level of intervention. You could fit a different model in each segmented
period, or you could fit one model with an indicator variable and coefficient for time segments.
To control underlying trend regardless of the application of intervention(s) trend can be
modelled, or even better it is possible to set up some control population (or area) that is not
exposed to the intervention, which is known as a controlled interrupted time-series study. For
technical details on interrupted time-series study and controlled interrupted time-series study, I
suggest these excellent tutorials by Lopez Bernal et al. 2017 and 2018.

As opposed to very classic Dublin’s interrupted timeseries study, we introduced Pope’s US


PM2.5 and life expectancy study as an example of change-on-change study. Theoretically,
change-on-change does not need denominators as it compares the change in exposure with the
change in outcome as we will be comparing the same population (assuming other general
changes are controlled for in the analyses). Pope’s paper compares reduction in PM2.5 1980s-
1990s to change in life expectancy for the same period. We can fit the model with continuous

PHM205 2022-23
PM2.5 level and life expectancy introducing indicator variable of the time-period, instead of
taking changes (differences) between two time-stamps. In that case, it seems very similar as
time-series study.

As your question suggests, one could argue that there is an overlap between interrupted
timeseries and change-on-change studies: In a controlled interrupted timeseries study we
compare the pre-post “change” in outcome of the area with the intervention with that of the no-
intervention area – which could perhaps be called change-on-change. But the term change-on-
change is more usually used for comparing change in outcome over many areas each with
different levels of environmental change, and the change may be gradual, as in the Pope study,
not sudden, as in the Dublin study.

I am afraid I am not familiar with autocorrelation and overdispersion. Do we need


to know more about this description?

As time-series data (of counts) usually is analysed using Poisson regression models,
`autocorrelation' and `overdispersion' will be the issues. But you don't need to worry too much
details in this module. What you are expected to be aware here is:
Autocorrelation
Observations in time-series data are unlikely to be independent, with observations close in time
likely to be more similar than those distant in time. In another words, temperature today is likely
to be more similar as that of yesterday/tomorrow than that of a month before/ahead. This
represents temporal autocorrelation. After controlling for seasonality, long-term patterns, the
exposure of interest and other explanatory variables, residual autocorrelation will tend to be much
smaller.
However, it is always worth to check if there is any significant autocorrelation remained and if
adjustment for these remaining autocorrelation would change the main results. You can learn how
to check and adjust for autocorrelation in the time-series practical using Stata (in the optional
advanced, page2).
Overdispersion
Observed time-series data sometimes has higher variance than predicted under a Poisson
distribution (which is equal to mean). This is called overdispersion. With overdispersion, the usual
Poisson model underestimates the standard errors of the parameters. In this case, we need to
apply a simple adjustment to obtain appropriate standard errors in the model fitting. In the time-
series practical (optional advanced, page2-3), a method to scale the variance and covariance
matrix using a parameter estimated by the Pearson chi-squared statistics divided by the degree
of freedom is introduced.

PHM205 2022-23
In the air pollution time-series practical with Stata, when I control for seasonal patterns (using
splines rather than Fourier terms), STATA does not recognize the command called frencurv
suggested in the exercise. It shows the following error message:
. frencurvnk, x(date) gen(spl) nk(56) power(3)
n of knots specified as: 56
B-splines could not be calculated for the specified knots and reference
values
unrecognized command
r(199);

This might be around some Stata settings/environment. A possibility which I can think of from the
second line of output is bspline.ado file. This frencurvnk needs bspline.ado file, too. Your setting
may not have it. If not, can you install it by following the procedure below?
1. type "search bspline" in the command.
2. click "sg151_2" in Viewer window. (sg151_2 package contains bspline.ado) Then, click
"click here to install".
After this, run Stata and try "frencurvnk, x(date) gen(spl) nk(56) power(3)" command.

Assessment exercise
In the assessment exercise, what answers are expected in relation to the
improvement in health from traffic management?

The effect of a traffic-management policy clearly entails a number of assumptions, and exact
quantification is not possible. However, there are various aspects of the problem that can be
commented on. It is hoped that answer would include:
- some (semi-)quantitative attempt to gauge the approximate size of the impact, i.e. a very
rough calculation
- comment on the degree to which traffic contributes to local air pollution exposure
- comment on the likely difficulty of achieving the stated reduction in traffic and what that
might do to the change in emissions
- comment on the degree to which reduction of pollution exposure is likely to translate into
health benefits (considering lag effects, irreversibility of some effects etc)
- discussion of the fact that air pollution effects are only part of the likely impact of the
change in traffic

Last updated in September 2022

PHM205 2022-23

You might also like