You are on page 1of 43

Biometrics 61, 899–941 DOI: 10.1111/j.1541-0420.2005.00454.

x
December 2005

Statistical Issues Arising in the Women’s Health Initiative

Ross L. Prentice,∗ Mary Pettinger,∗∗ and Garnet L. Anderson∗∗∗


Division of Public Health Sciences, Fred Hutchinson Cancer Research Center,
P.O. Box 19024, Seattle, Washington 98109-1024, U.S.A.

email: rprentic@whi.org
∗∗
email: mpetting@whi.org
∗∗∗
email: garnet@whi.org

Summary. A brief overview of the design of the Women’s Health Initiative (WHI) clinical trial and
observational study is provided along with a summary of results from the postmenopausal hormone therapy
clinical trial components. Since its inception in 1992, the WHI has encountered a number of statistical
issues where further methodology developments are needed. These include measurement error modeling and
analysis procedures for dietary and physical activity assessment; clinical trial monitoring methods when
treatments may affect multiple clinical outcomes, either beneficially or adversely; study design and analysis
procedures for high-dimensional genomic and proteomic data; and failure time data analysis procedures
when treatment group hazard ratios are time dependent. This final topic seems important in resolving the
discrepancy between WHI clinical trial and observational study results on postmenopausal hormone therapy
and cardiovascular disease.
Key words: Chronic disease prevention; Clinical trial monitoring; Genome-wide scan; Hazard ratio;
Measurement error; Nutritional epidemiology; Observational study; Randomized controlled trial; Women’s
health.

1. Introduction domly assigned 48,835 eligible women to either a sustained


The Women’s Health Initiative (WHI) is perhaps the most low-fat eating pattern (40%) or self-selected dietary behavior
ambitious population research investigation ever undertaken. (60%), with breast cancer and colorectal cancer as designated
The centerpiece of the WHI program is a randomized, con- primary outcomes and coronary heart disease (CHD) as a sec-
trolled clinical trial (CT) to evaluate the health benefits ondary outcome. The nutrition goals for women assigned to
and risks of four distinct interventions (dietary modifica- the DM intervention group were to reduce total dietary fat to
tion, two postmenopausal hormone therapy [HT] interven- 20%, and saturated fat to 7%, of corresponding daily calories
tions, and calcium/vitamin D supplementation) among 68,132 and, secondarily, to increase daily servings of vegetables and
post-menopausal women in the age range 50–79 at random- fruits to at least five and of grain products to at least six, and
ization. Participating women were identified from the general to maintain these changes throughout the trial intervention
population living in proximity to any of the 40 participat- period. The randomization of 40%, rather than 50%, of par-
ing clinical centers throughout the United States. The WHI ticipating women to the DM intervention group was intended
program also includes an observational study (OS) that com- to reduce trial costs, while testing trial hypotheses with spec-
prised 93,676 postmenopausal women recruited from the same ified power.
population base as the CT. Enrollment into WHI began in The postmenopausal HT clinical trial components com-
1993 and concluded in 1998. Intervention activities in the es- prised two parallel randomized, double-blind, placebo-
trogen plus progestin HT component of the CT ended early on controlled trials among 27,347 women, with CHD as the pri-
July 8, 2002 when evidence had accumulated that the risks mary outcome, with hip and other fractures as secondary
exceed the benefits. Intervention activities in the estrogen- outcomes, and with breast cancer as a primary adverse out-
alone component of the CT also ended early, on February 29, come. Of these, 10,739 women (39.3% of total) had a hys-
2004. Intervention activities in the other two CT components terectomy prior to randomization, in which case there was
ended on March 31, 2005. Nonintervention follow-up on par- a randomized allocation between conjugated equine estrogen
ticipating women is planned through 2010, giving an average (E-alone) 0.625 mg/day or placebo. The remaining 16,608
follow-up duration of about 13 years in the CT and 12 years (60.7%) of women, each having a uterus at baseline, were
in the OS. randomized (aside from an early assignment of 331 of these
The CT used a “partial factorial” design. Participating women to E-alone) to the same preparation of estrogen plus
women met eligibility for, and agreed to be randomized to, 2.5 mg/day of medroxyprogesterone (E+P) or placebo. A
either the dietary modification (DM) or one of the HT com- total of 8050 women were randomized to both the DM and
ponents, or both the DM and HT. The DM component ran- HT clinical trial components.
899
900 Biometrics, December 2005

At their 1-year anniversary from DM and/or HT trial en- ventions under study provided an additional motivation for
rollment, all CT women were further screened for possible a prescribed age-at-enrollment distribution. Table 1 provides
randomization in the calcium and vitamin D (CaD) compo- information on enrollment by age group in the various WHI
nent, a randomized, double-blind, placebo-controlled trial of components.
1000 mg elemental calcium plus 400 international units of In addition to the 40 participating clinical centers, the
vitamin D3 daily, versus placebo. Hip fracture is the desig- WHI program is implemented through a clinical coordinat-
nated primary outcome for the CaD component, with other ing center based at the Fred Hutchinson Cancer Research
fractures and colorectal cancer as secondary outcomes. A to- Center in Seattle. Several components of the National In-
tal of 36,282 (53.3% of CT enrollees) were randomized to the stitutes of Health (National Heart, Lung and Blood Insti-
CaD component. tute, National Cancer Institute, National Institute of Aging,
The total CT sample size of 68,132 is only 60.6% of the sum National Institute of Arthritis, Musculoskeletal and Skin Dis-
of the individual sample sizes for the four CT components, eases, NIH Office of Women’s Health, and NIH Director’s
providing a cost and logistics justification for the use of a Office) sponsor the WHI program, with NHLBI taking a co-
partial factorial design with overlapping components. ordinating role.
Postmenopausal women of ages 50–79 years who were Several important statistical issues have arisen in the de-
screened for the CT but proved to be ineligible or unwilling sign, conduct, and analysis of the WHI. Some of these, where
to be randomized were offered the opportunity to enroll in additional methodology developments are required, will be
the OS. The OS is intended to provide additional knowledge described below in some detail.
about risk factors for a range of diseases, including cancer,
cardiovascular disease, and fractures. It has an emphasis on 2. Study Design
biological markers of disease risk, and on risk factor changes Most aspects of the CT and OS design, including target sam-
as modifiers of risk. ple sizes, eligibility criteria, primary and secondary clinical
There was an emphasis on the recruitment of women of outcomes, biological specimen collection and storage proto-
racial/ethnic minority groups throughout the WHI. Overall, cols, quality-assurance procedures, and CT monitoring and
18.5% of CT women and 16.7% of OS women identified them- reporting methods, have previously been described (Freedman
selves as other than white. These fractions allow meaningful et al., 1996; Women’s Health Initiative Study Group, 1998;
study of disease risk factors within certain minority groups in Anderson et al., 2003; Prentice and Anderson, 2005). There
the OS. Also, key CT subsamples are weighted heavily in fa- are, however, study design issues related to the nutritional and
vor of the inclusion of minority women in order to strengthen physical activity epidemiology goals of the program, as well as
the study of intervention effects on specific intermediate out- design issues related to the efficient uses of the WHI specimen
comes (e.g., changes in blood lipids or micronutrients) within repository for genomic and proteomic purposes, that remain
minority groups. under active consideration.
To ensure adequate power for principle outcome compar-
isons, age distribution goals were specified for the CT as fol- 2.1 Nutritional and Physical Activity Epidemiology
lows: 10%, ages 50–54 years; 20%, ages 55–59 years; 45%, The reliable assessment of nutrient consumption and activity-
ages 60–69 years; and 25%, ages 70–79 years. While there related energy expenditure constitutes central challenges in
was substantial interest in assessing the benefits and risks of nutritional and physical activity epidemiology. In fact, a prin-
each CT intervention over the entire 50–79 year age range, cipal argument in support of the need for the DM trial
there was also interest in having a sufficient representation of of a low-fat eating pattern, and for the CaD trial, as op-
younger (50–54 years) postmenopausal women for meaningful posed to a reliance on observational study designs, comes
age group-specific intermediate outcome (biomarker) studies, from dietary assessment uncertainties and their potentially
and of older (70–79 years) women for studies of treatment ef- dominant impact on nutritional epidemiology association
fects on quality of life measures, including aspects of physical studies. Very similar measurement issues arise in physical ac-
and cognitive functioning. Differing shapes for age incidence tivity assessment as most nutritional and physical activity as-
rate functions within the 50–79 age range across the clinical sociation studies rely on self-report assessment methods. Of
outcomes that were hypothesized to be affected by the inter- particular current interest are dietary and physical activity

Table 1
Women’s Health Initiative sample sizes (% of total) by age group
Postmenopausal hormone therapy
Dietary Without uterus With uterus Calcium and Observational
Age group modification (E-alone) (E+P) vitamin D study
50–54 6,961 (14) 1,396 (13) 2,029 (12) 5,157 (14) 12,386 (13)
55–59 11,043 (23) 1,916 (18) 3,492 (21) 8,265 (23) 17,321 (18)
60–69 22,713 (47) 4,852 (45) 7,512 (45) 16,520 (46) 41,196 (44)
70–79 8,118 (17) 2,575 (24) 3,575 (22) 6,340 (17) 22,773 (24)
Total 48,835 10,739 16,608 36,282 93,676
Discussion on Statistical Issues in the Women’s Health Initiative 901

patterns that may be associated with long-term energy bal- (Bingham et al., 2002) among weight-stable persons, through
ance in view of the obesity epidemic in North America and a doubly labeled water protocol, urinary recovery, and indi-
other Western countries, and the strong association between rect calorimetry. However, some of these measures (e.g., en-
obesity and such major chronic diseases as diabetes, CHD, ergy expenditure using the doubly labeled water technique)
and cancer (e.g., Calle et al., 2003). A recent commentary are quite expensive and practical only in a moderate-sized
(Prentice et al., 2004) focused on the future research agenda subset of an epidemiologic cohort. Hence, the viable research
in the nutrition, physical activity, and chronic disease areas, strategy to reliable epidemiologic association analysis seems
and pointed to nutrition and physical activity assessment and to be to carry out a classical measurement error biomarker
modeling as key areas for further methodologic and substan- substudy in a suitable subset of a study cohort, and use this
tive research. substudy to calibrate the self-report data that are available
The validity of the intervention versus control group com- for the entire study cohort. For example, Prentice et al. (2002)
parisons in the DM trial does not rely directly on dietary consider a model
assessment among participating women. Indeed, this lack of
X =Z +ε (1)
reliance, along with the absence of confounding by baseline
risk factors, is the major motivation for an intervention trial. for a nutrient consumption or activity-related energy expendi-
Dietary assessment, however, is needed for the evaluation of ture measure Z having biomarker measure X, where the error
adherence to nutritional goals, and for explanatory analyses variate ε is independent of Z and other study subject charac-
that attempt to attribute intervention effects on clinical out- teristics (V), and the variance of ε is estimated using a repeat
comes to specific nutritional changes (e.g., reduced total fat, application of the biomarker protocol in a reliability subsam-
increased fruits and vegetables) induced by a multifaceted in- ple. The corresponding model for a self-report assessment, W,
tervention program. Of course, WHI CT and OS data will of Z was modeled as
be used to examine many nutritional and physical activity
W = α + βZ + γ T V + δ T Z ⊗ V + U + e, (2)
epidemiology associations beyond those tested by CT inter-
ventions. For these other association analyses, nutritional and where, again, V is a vector of study-subject characteristics
physical activity assessment data will play a direct and central that may relate to the self-report measurement properties,
role. while U is a mean zero random effect for the study subject that
Diet and physical activity are typically assessed in epidemi- allows repeat assessments W to be correlated (given V) and
ologic studies using frequencies, records, or recalls. For ex- e is an independent error term. Some development of logistic
ample, a food-frequency questionnaire (FFQ) or an activity- regression estimation procedures to relate a disease odds ratio
frequency questionnaire provide a list of foods or activities to the underlying nutrient or activity exposure Z under this
and ask a respondent to specify how frequently each is con- measurement model, using regression calibration, conditional
sumed or engaged in, and with what portion size or intensity, scores, and nonparametric corrected scores procedures (e.g.,
over the preceding few months. It has long been known from Carroll et al., 1995; Huang and Wang, 2000), is included in
reliability studies (e.g., Willett et al., 1985) that these types an unpublished 2003 Department of Statistics, University of
of assessment procedures may incorporate substantial random Washington doctoral dissertation by Elizabeth Sugar.
measurement error, but evidence is emerging from biomarker Study design issues related to the use of models (1) and (2),
studies concerning the presence of important systematic mea- or variations thereof, arise from the need to specify a sam-
surement error as well (e.g., Heitmann and Lissner, 1995; Day ple size and sampling procedure for a biomarker subsample.
et al., 2001; Kipnis et al., 2003; Subar et al. 2003; Hebert et Related issues concern the selection of reliability subsamples
al., 2004). Systematic bias may occur when a person con- for both X and W. Suitable design choices, under (1) and (2),
sistently tends to under- or overreport the consumption of likely relate strongly to the relative magnitudes of the vari-
certain foods, or the practice of certain activity patterns on ances of ε, U, e in relation to the variance of Z, and to the
successive application of the same or different self-report in- dependence of such variances on V, and also to the magni-
struments. Relaxing the classical measurement error model tude of the regression coefficients in (2), particularly β and δ.
(e.g., Carroll, Ruppert, and Stefanski, 1995) to include an There are, of course, related analysis issues concerning con-
independent person-specific random effect may help to deal sistent and efficient means of estimated odds ratios or haz-
with the resulting correlated measurement errors, but this ard ratios for clinical outcomes of interest, the robustness of
modeling device will be insufficient if the systematic compo- such inferences to moderate departures from (1) to (2), and
nent to the measurement error tends to depend on individ- the choice between (1) and (2) and other measurement error
ual characteristics, such as body mass, ethnicity, age, or so- models.
cial desirability factors. Instead, the measurement model may At the time of this writing, a Nutrient Biomarker Study
be conditioned on a vector, V, of such characteristics, with among 543 women in the DM component of the Women’s
the mean and variance of a random effect allowed to depend Health Initiative CT (50% control, 50% intervention) was just
on V. being completed with a principal goal of elucidating trial re-
These self-report measurement issues may cause one to in- sults in terms of the components of this multifaceted interven-
stead consider biomarkers that plausibly adhere to a classical tion through a biomarker calibration of FFQ data. A grant
measurement model for nutritional or physical activity assess- proposal to study the comparative measurement properties
ment. In fact, suitable biomarkers are available for short-term of the FFQ, a 4-day food record and (three) 24-hour recalls,
total and activity-related energy expenditure (Schoeller et al., and to study the comparative properties of an activity fre-
2002), and for protein, sodium, and potassium consumption quency questionnaire, a 7-day physical activity recall, and
902 Biometrics, December 2005

WHI personal habits questionnaire, among 450 OS women cular diseases and cancers, especially when such association is
is also pending. These efforts not only include the “recovery” dependent on linkage disequilibrium that is less than one due
biomarkers (Kaaks et al., 2002) listed above, but also blood to the use of tag SNPs. For example, to detect an odds ratio
serum concentration measures for various nutrients. The clas- of 1.5 for the presence of one or both copies of the minor allele
sical measurement model (1) will typically be implausible for of an SNP having an allele frequency of 0.1 at the 0.05 level of
these concentration markers, so additional design and analysis significance, one would require 763 cases and 763 controls for
issues arise in attempts to use these biomarkers in conjunc- 80% power, and 1301 cases and controls for 95% power (e.g.,
tion with self-report assessments in nutritional and physical Breslow and Day, 1987). At 1 cent per SNP, a study of 250,000
activity–disease association analyses. SNPs in 1000 cases and 1000 controls would involve genotyp-
Since few full-scale dietary intervention trials with clini- ing costs of $5 million, and would be expected to yield 12,500
cal outcomes are practical at any point in time for reasons “false positive” associations under the global null hypothesis
of cost and logistics, these measurement error modeling and of no SNP–disease associations. This implies the need for a
analysis activities become key to progress in these important larger sample size, or a multistage design to screen out most
population science research areas. of the false positives, and argues for additional innovation to
reduce genotyping costs.
2.2 High-Dimensional Genomic and Proteomic Studies One approach to reduce genotyping costs is to restrict the
The WHI includes a well-developed system for the standard- analysis to the subset of SNPs that are within the coding or
ized collection and storage of biological materials from par- regulatory regions of known genes. This is a logical and at-
ticipating women. This includes the storage of blood plasma tractive approach, though there is considerable debate about
and serum, as well as white blood cells for DNA extraction. the potential biologic importance of polymorphisms outside
These specimens in the well-characterized CT and OS co- of these regions. A second interesting approach involves the
horts, with comprehensive outcome ascertainment, provide pooling of equal amounts of DNA from each case (or control)
an extremely valuable resource for elucidating mechanisms prior to genotyping. Though the concept of genotyping from
that determine chronic disease risk, and for explaining CT pooled DNA has existed for some time, much of the pertinent
intervention effects. The WHI includes a substantial num- literature is quite recent (see Sham et al., 2002 for a review).
ber of externally funded ancillary studies, as well as a few Recent studies (e.g., Le Hellard et al., 2002; Mohlke et al.,
internally funded case–control studies, that make use of these 2002) document the agreement that can be achieved between
specimens. Ideas for priority uses of specimens include high- allele frequency estimates from pooled DNA compared to in-
dimensional approaches to studying genotype, or to studying dividual SNP genotyping. Some additional variation is intro-
serum protein expression patterns, or changes in such patterns duced by using an allele frequency estimate for the set of cases
over time. The technological advances that allow genome-wide (or controls), rather than an allele frequency measurement,
scans of hundreds of thousands of single nucleotide polymor- though this additional variation can be controlled by em-
phisms (SNPs), from a minute amount of DNA, are impressive ploying a small number of replicate pools, and/or by drawing
indeed. Though the technology is less mature, there are also replicate samples from each pool. For example, if one formed
several platforms for high-dimensional proteomics. However, two case pools and two control pools, each of size 500, car-
suitable statistical methods for the design and analysis of ried out four polymerase chain reaction (PCR) amplifications
case–control studies that include such high-dimensional data from each, and quadruplicate sampled from each PCR pool,
are essential for these innovations to have their desired im- one would incur $160,000 genotyping costs for 250,000 SNPs
pact on medicine and public health, and much related statis- at 1 cent/SNP. This represents a 30-fold cost reduction rel-
tical work remains to be carried out (e.g., Feng, Prentice, and ative to corresponding individual genotyping, evidently with
Srivastava, 2004). little reduction in power (Mohlke et al., 2002) for determining
Consider genetic association studies which examine the re- SNP–disease associations. This cost reduction factor is some-
lationship of genotype to disease risk. Genotype can be char- what optimistic in view of pool formation costs, and necessary
acterized using the several million SNPs (Kruglyak, 1999) that specialized whole genome DNA amplification procedures, but
exist in the human genome. There is substantial effort, includ- the use of an initial pooled DNA step may often be essential
ing the publicly funded HapMap project, to identify a reduced for an epidemiologic study to be practical in terms of cost.
set of tag SNPs that convey most genotype information as a A limitation of the pooled DNA approach is that one is
result of correlation (linkage disequilibrium) between neigh- unable to examine the joint association with disease risk
boring SNPs (Gabriel et al., 2002; Gibbs et al., 2003). Use of adjacent SNPs (haplotypes), or SNP–SNP interactions
of “chip” technologies has allowed genotyping costs to fall to more generally, from pooled DNA, so there are important
the vicinity of $0.01 per SNP and certain organizations make research strategy trade-offs to consider. Multistage study
50,000–250,000 tag SNPs commercially available, the latter designs that employ pooling at the early stages in an at-
number having potential to characterize most of the common tempt to screen out many of the false positives, followed
variability across the human genome. Furthermore, SNP de- by individual genotyping stages, may have considerable ap-
terminations are evidently quite accurate and can be based on peal in some settings, and deserve formal evaluation of sta-
amplified DNA, so that as little as 1 mcg of DNA is sufficient tistical properties. Other statistical design issues relate to
for a rather comprehensive genome-wide scan. preferred pool sizes with some researchers evidently ad-
However, large numbers of cases and controls are needed vocating smaller pool sizes (Barratt et al., 2002; Downes
to detect associations of plausible magnitude between a given et al., 2004) than do others (Le Hellard et al., 2002; Mohlke
SNP and disease risk for such complex diseases as cardiovas- et al., 2002) based on components of variance considerations.
Discussion on Statistical Issues in the Women’s Health Initiative 903

A referee has pointed out that the use of pooled DNA at a cohort study or intervention trial) for the final design stages.
given study design stage will also preclude the study of the Additional proteomic platforms that fractionate proteins ac-
SNPs tested in relation to other traits (e.g., hypertension) cording to additional features, such as affinity tags or elution
for which data may be available for individuals in the co- times, are under vigorous development, and some are suitable
hort, unless such trait values were specifically used in pool for high-throughput applications, or will be in the near future.
construction. These genomic and proteomic design issues, and associated
A multistage design seems attractive in this high- high-dimensional data analysis issues (e.g., Tibshirani and
dimensional setting, whether or not pooling is employed, for Efron, 2002; Simon et al., 2003; Diamandis, 2004), deserve
reasons of excess cost and false-positive avoidance. For ex- the attention of the statistical community in the upcoming
ample, with 250,000 SNPs a three-stage design with equal years, and are expected to be crucial to the longer-term pro-
sample sizes at each stage could be carried out by testing at ductivity of the WHI.
the 0.022 level (Z = 2.30) at each stage, giving an expected
2.5 false positives overall under the global null hypothesis. 3. CT Monitoring and Reporting Methods
This design would screen out nearly 98% of the SNPs at the Each CT component has its designated primary and sec-
first stage, and would involve only about 120 SNPs that are ondary clinical outcomes, and in the case of the two HT tri-
unrelated to disease at the third stage, with close to a two- als a designated primary adverse outcome (breast cancer).
thirds reduction in genotyping costs. However, further eval- The CT monitoring guidelines, adopted by the external Data
uation is needed of corresponding statistical properties (e.g., and Safety Monitoring Board (DSMB) comprised of senior
power properties relative to a single-stage design that tests at researchers and clinicians having expertise in relevant areas
a very extreme significance level of 0.00001). See Sagatopan, of medicine, epidemiology, nutrition, biostatistics, CTs, and
Venkatraman, and Begg (2004) for some related encouraging ethics, included a special role for the designated primary out-
power analyses. come(s). This primary outcome was CHD for the HT trials,
At the time of this writing, the WHI is in the early stages of breast cancer and colorectal cancer separately for the dietary
implementing a three-stage design to identify SNPs, or hap- modification trial, and hip fractures for the CaD trial.
lotypes, that relate to the risk of CHD, stroke, or breast can- It was also recognized from the outset that the interven-
cer and to identify SNPs or haplotypes that relate to the tions under study had potential to affect the risk, either ben-
magnitude of combined hormone (E+P) effects on these dis- eficially or adversely, for various clinical outcomes beyond the
eases. The first two stages will be in the OS, the first involv- primary outcome(s), and that these other effects should enter
ing pooled DNA, while the third will take place in the E+P early trial stopping considerations. Hence for the HT trials the
trial cohort, which has the most reliable information on E+P monitoring plan involved reviewing weighted log-rank statis-
effects. tics for breast cancer, stroke, pulmonary embolism, hip frac-
The relationship between serum (or plasma) protein con- tures, colorectal cancer, endometrial cancer (E+P trial), and
centrations and disease risk has great potential for the early deaths from other causes, in addition to CHD. For the DM
detection of disease, and for the study of disease processes and trial, weighted log-rank statistics were reviewed for CHD, and
intervention mechanisms. Equally important, changes in high- deaths from other causes in addition to breast and colorectal
dimensional serum protein patterns as a result of treatment cancer, while for the CaD trial colorectal cancer, breast can-
or intervention activities have great potential for preventive cer, fractures other than hip, and deaths from other causes
intervention development and initial screening, as knowledge were reviewed, in addition to hip fracture. The weights were
develops on the associations of such patterns with a range of linear from zero at randomization up to a plateau point at
clinical outcomes. This seems fundamental as preventive inter- 3 years for cardiovascular disease and fracture incidence, and
vention development to date has needed to rely on extrapola- at 10 years for cancer and mortality. These weights were cho-
tions from therapeutic trials and on low-dimensional interme- sen to enhance the power of outcomes comparison between
diate outcome trials, both of which may lack sensitivity, or on randomization groups, under the hypothesized time course
observational epidemiology, which may often lack specificity. of intervention effects. These weights were not well suited to
Mass spectrum profiles provide an estimate of protein the identification of any early adverse effects, a fundamental
(peptide) intensity as a function of the peptide mass to charge element of data and safety monitoring, so that unweighted
ratio. Serum specimens, and hence these profiles, are, how- log-rank statistics and Cox model hazard ratio estimates and
ever, quite sensitive to specimen handling and processing confidence intervals were also routinely provided to the DSMB
methods, and measurement platforms differ in their resolu- in biannual CT monitoring reports.
tion and other measurement properties. A multistage sequen- An important statistical and substantive issue concerns the
tial design (Feng et al., 2004) is attractive also in this context means of usefully summarizing the benefits and risks of an
for the identification of peptide peaks that distinguish cases intervention that may plausibly affect multiple clinical out-
from controls. Such peaks can then be studied in more detail comes, each with its own time course, incidence rate pat-
to identify the distinguishing peptides and proteins. These tern, and severity. Following a series of exercises in which
analyses are more greedy in terms of specimen usage, so that DSMB members individually specified their recommended
a multistage design could allow poorer quality specimens to course of action concerning trial continuation (stop, continue,
be used at the early stages (with false positives due to speci- do not know) under scenarios as to how the data may look at
men collection or processing differences screened out at later a future point in time (Freedman et al., 1996) a so-called
stages) saving the better quality specimens (e.g., prediagnos- global index was developed as a part of the CT monitor-
tic specimens collected under a standardized protocol in a ing procedure. For each CT component, the global index was
904 Biometrics, December 2005

defined for each participating woman as the time to the first 4. The Roles of Clinical Trials and Observational
occurrence of the clinical outcomes listed in the preceding Studies in Population Science Research
paragraph, each of which was regarded as a major health A major issue in the chronic disease prevention and popula-
event. If the primary outcome for a CT component, or the tion science research area concerns the designs that are needed
primary adverse outcome for the HT trials, showed signifi- to obtain reliable information on disease associations and in-
cant difference between randomization groups, the global in- tervention effects. Large-scale observational studies, especially
dex was to be examined with early stoppage considerations cohort studies, allow study of the associations between a wide
for benefit or risk based on weighted log-rank statistics for variety of exposures or characteristics and clinical outcomes
the global index. The DSMB agreed to pay attention to these of interest. Controlled intervention trials on the other hand
monitoring statistics, but not necessarily to be bound by represent the gold standard for studying the effects of a given
them, and the DSMB also viewed data on a number of ad- treatment or intervention, in spite of typically high costs and
ditional clinical and behavioral outcomes as a part of their demanding logistics. Clearly, rather few full-scale intervention
overall assessment and safety monitoring activities. trials with disease outcomes can be afforded, so the question
While available statistical methods for the analysis of corre- is better focused on the interplay and complementary role
lated failure times (e.g., Kalbfleisch and Prentice, 2002, Chap- that can be fulfilled by the two study designs. Hence, perti-
ter 10) mostly focus on analyses of marginal hazard rates, the nent questions relate to the criteria, and the hypothesis and
WHI CT highlights the importance of carefully selected sum- intervention development processes, that are needed to estab-
mary measures of treatment effect that can guide the monitor- lish the feasibility and potential of a full-scale intervention
ing and interpretation of CT data. The global index defined trial.
above did play an influential role in the early stoppage of
the combined hormone trial (Writing Group for the Women’s 4.1 Combined HT and Cardiovascular Disease
Health Initiative, 2002) when the DSMB judged that risks ex- The rather few situations where there is evidence from obser-
ceeded benefits over a 5-year usage period, and has been the vational studies and from one or more intervention trials pro-
subject of some discussion and debate ever since. Some critics vide an important opportunity to examine this interplay. The
have asked, for example, why hip fracture was included but WHI HT trials and a large body of preceding observational
not vertebral or other fractures. No doubt there is no uniquely studies provide such an opportunity. In fact, few research re-
suited single index in such a complex setting, and additional ports have stimulated as much public response (The End of
calculations to examine the sensitivity of conclusions to inclu- the Age of Estrogen, 2002; The Truth about Hormones, 2002)
sion and exclusion choices, and to the specification of weights or have engendered as sustained a discussion among medical
among various outcomes, may be a useful element of data practitioners and researchers as the results of the WHI E+P.
presentation and summary. On the other hand, however, the While a major reduction in CHD incidence had been hypoth-
absence of an attempt to specify pertinent summary mea- esized based on a substantial body of observational research
sures in advance of the outcome data coming available leaves (Stampfer et al., 1991; Grady et al., 1992; Barrett-Conner
an undue likelihood that post hoc debate would too strongly and Grady, 1998), the WHI E+P trial found an elevation
influence trial interpretation and clinical practice and public in CHD risk, and assessed that overall health risks exceeded
health impact. benefits over an average 5.6-year follow-up period (Writ-
The estrogen-alone CT component also was stopped early ing Group for the Women’s Health Initiative, 2002; Manson
(Steering Committee for the Women’s Health Initiative, et al., 2003). Table 2 shows Cox model hazard ratio estimates
2004). In the reporting of principal results from the two HT and nominal 95% confidence intervals from the E+P trial, and
trials, we presented hazard ratio estimates, as well as nominal from the companion E-alone trial, from the Writing Group
and adjusted confidence intervals. The adjusted confidence for the WHI (2002) and WHI Steering Committee (2004),
intervals accommodated the sequential data examination of respectively, where confidence intervals adjusted for multiple
evolving data using an O’Brien–Fleming approach, while the testing can also be found. Note the apparent impact of E+P,
elements of the global index other than the primary outcome and to a lesser extent E-alone, on multiple important clinical
(and primary adverse outcome) were also adjusted accord- outcomes.
ing to the number of elements of the global index, using a The lack of explanation for the departure of E+P trial re-
Bonferroni procedure. These latter intervals were substan- sults on CHD, from expectation based on observational stud-
tially conservative since most outcomes in the global index ies, has prompted some clinicians and researchers to hypoth-
were expected to have only a small influence on early stopping, esize flaws in the WHI trial (e.g., Creasman et al., 2003;
and the Bonferroni emphasis on controlling experiment-wise Goodman, Goldzieher, and Ayala, 2003). Others have ar-
error is not so natural in this setting. On the other hand, the gued lack of relevance of trial results to important sub-groups
nominal intervals are somewhat liberal, especially for the pri- of combined HT users. For example, a recent contribution
mary outcomes that may have greater influence on early stop- noted that WHI was not designed to provide a powerful test
ping. Some critics of the combined hormone trial results have of cardioprotective effects among 50- to 54-year-old women
been quick to adopt the conservative adjusted intervals and in menopausal transition, and concluded that observational
declare some differences, where nominal but not adjusted con- studies provide “the only applicable clinical guide to this is-
fidence intervals excluded one, as “not significant.” It would sue” (Naftolin et al., 2004).
be useful to have further development of statistical monitoring Other authors have speculated on reasons for a discrep-
and reporting methods that would lead to more specifically ancy between WHI E+P trial results and related obser-
suited tests and confidence intervals in these types of complex vational research citing confounding in observational stud-
situations. ies, the limited ability of observational studies to assess
Discussion on Statistical Issues in the Women’s Health Initiative 905

Table 2
Clinical outcomes in the WHI postmenopausal hormone therapy trials

E+P trial E-alone trial


Outcomes Hazard ratio 95% CI Hazard ratio 95% CI
Coronary heart disease 1.29 1.02–1.63 0.91 0.75–1.12
Stroke 1.41 1.07–1.85 1.39 1.10–1.77
Venous thromboembolism 2.11 1.58–2.82 1.33 0.99–1.79
Invasive breast cancer 1.26 1.00–1.59 0.77 0.59–1.01
Colorectal cancer 0.63 0.43–0.92 1.08 0.75–1.55
Endometrial cancer 0.83 0.47–1.47 – –
Hip fracture 0.66 0.45–0.98 0.61 0.41–0.91
Death due to other causes 0.92 0.74–1.14 1.08 0.88–1.32
Global index 1.15 1.03–1.28 1.01 0.91–1.12
Number of women 8506 8102 5310 5429
Follow-up time, mean (SD), months 62.2 (16.1) 61.2 (15.0) 81.6 (19.3) 81.9 (19.7)

short-term effects, differences among combined HT prepara- was stratified (s) on baseline age in 5-year intervals, as well
tions, and differences among populations of women studied as cohort (CT or OS), that included treatment effects that
as possible reasons (Grodstein, Clarkson, and Manson, 2003; may depend on the history X(t) of E+P use up to time t fol-
Michels and Manson, 2003; Ray, 2003). The April 2004 issue lowing enrollment (t = 0) in the WHI, and baseline potential
of the International Journal of Epidemiology includes several confounding factors Z. Principal interest resided in the treat-
commentaries on this topic that illustrate the continuing di- ment coefficients β c , which were allowed to differ between the
versity of opinion on the sources of the discrepancy, and on CT (c = 0) and the OS (c = 1). The modeled regression
the clinical implications of the available evidence. vector z was formed from the baseline potential confounding
Related perspectives on study designs that are needed to factors Z.
obtain reliable public health information have ranged from Initial analyses included an indicator variable x(t) = 1 if
the statement (Herrington and Howard, 2003) that “many the woman was assigned to the active intervention group in
people suspended ordinary standards of evidence concerning the CT with x(t) = 0 in the placebo group, and x(t) = 1
medical interventions and concluded that HT was the right if the woman was among the 33% of these OS women who
thing to prevent heart disease in millions of postmenopausal were using combined hormones at baseline, and x(t) = 0 oth-
women despite the absence of any large-scale CT quantifying erwise, without confounding factor control. For CHD, these
its overall risk–benefit ratio” to the assertion (Whittemore analyses gave a hazard ratio estimate for E+P use in the OS
and McGuire, 2003) that “the good agreement between the that was only 61% of that in the CT. More specifically, the
observational studies and the [WHI] trial on end points other ratio (95% CI) of the E+P hazard ratio in the OS to that in
than CHD confirms the utility and validity of observational the CT was 0.61 (0.46, 0.81) following simple 5-year age strat-
studies as monitors of new preventive agents.” ification. The corresponding ratio of hazard ratios for VT was
Recently, Prentice et al. (2005) analyzed data from the 0.52 (0.37, 0.73), indicating that the apparent discrepancy is
WHI combined hormone trial among 16,608 women with a not just an issue for CHD. Including a vector of potential
uterus, and the corresponding subset of 53,054 women in the confounding factors, z, in (3) provided a partial explanation
WHI observational study who were with uterus, and not using for such discrepancies as the ratio of hazard rates became
unopposed estrogen at baseline, in an attempt to resolve this 0.71 (0.52, 0.95) for CHD and 0.62 (0.43, 0.88) for VT follow-
apparent discrepancy. See Langer et al. (2003) and Prentice ing control for such factors as body mass index, education,
et al. (2005) for a description of the distribution of cardio- cigarette smoking history, age at menopause, a baseline phys-
vascular disease risk factors in the two cohorts. Compared ical functioning measure, and age (linear) within the 5-year
to nonusers, OS women who were using E+P preparations at strata. The remainder of the discrepancy for these diseases
baseline tended to be younger, leaner, of higher socioeconomic was largely explained by acknowledging a hazard ratio de-
status, and with a lesser history of cardiovascular disease. The pendence on time from initiation of E+P use, using the expo-
analyses in Prentice et al. (2005) included CHD and venous sure history X(t). In the CT, time from initiation of E+P use
thromboembolism (VT), both of which had been shown in the was defined as time from randomization with time-dependent
CT (Writing Group for the Women’s Health Initiative, 2002) indicator variables x(t) = {x1 (t), x2 (t), x3 (t)} defined accord-
to have had hazard ratios for combined hormone (E+P) use ing to whether women assigned to active treatment were less
that declined with increasing time from randomization, as well than 2, 2 to 5, or more than 5 years from randomization.
as stroke. The Cox regression model Women using hormone therapy during screening for the hor-
mone therapy trials were required to undergo a “wash-out”
λ{t; X(t), Z} = λos (t) exp{x(t) βc + zγ} (3)
period prior to randomization. In the OS, some women had
was employed in these analyses, where the hazard rate model been using E+P for several years prior to enrollment. For
for a specific clinical outcome included a λos function that these women, the indicator variables x(t) were defined to take
906 Biometrics, December 2005

Table 3
E+P hazard ratios (95% CIs) in the CT and OS as a function of years from E+P initiation∗

Coronary heart disease Venous thromboembolism


Years from CT OS CT OS
E+P initiation HR (95% CI; m† ) HR (95% CI; m) HR (95% CI; m) HR (95% CI; m)
<2 1.68 (1.15, 2.45; 80) 1.12 (0.46, 2.74; 5) 3.10 (1.85, 5.19; 73) 2.37 (1.08, 5.19; 7)
2–5 1.25 (0.87, 1.79; 80) 1.05 (0.70, 1.58; 27) 1.89 (1.24, 2.88; 72) 1.52 (1.01, 2.29; 27)
>5 0.66 (0.36, 1.21; 28) 0.83 (0.67, 1.01; 126) 1.31 (0.64, 2.67; 22) 1.24 (0.99, 1.55; 119)
∗ From Prentice et al. (2005).
†m is the number of E+P group women developing disease during WHI follow-up.

value 1 according to whether the E+P usage episode prior Motivated by hormone therapy stopping rates in community
to OS enrollment plus time from WHI enrollment was less studies, the E+P stopping time density was taken to be uni-
than 2, 2 to 5, or more than 5 years at follow-up time t. A form over the first 6 months with 20% stopping probability
usage gap of 1 year or more defined a new hormone therapy by 6 months, and uniform from 6 months to 2 years with a
episode. cumulative stopping probability of 59% at 2 years. Following
With these definitions, and with the same potential con- final outcome adjudication, the E+P trial gave a (Manson et
founding factors as in the analyses previously mentioned, al., 2003) summary CHD hazard ratio (95% CI) of 1.24 (1.00,
there was no longer significant evidence of different treatment 1.54) and a standardized hazard ratio trend statistic of −2.36
effect parameters between the CT and OS (Table 3) for either (p = 0.02). This trend statistic arose by adding to the E+P
clinical outcome (p-values for likelihood ratio test of β 0 = β 1 group indicator variable a product term between this indica-
were greater than 0.6 for CHD, and 0.8 for VT). Evidently, a tor variable and time (days) from randomization. The trend
major component of the apparent discrepancy for these out- test was defined as the ratio of the maximum partial likelihood
comes arises from the fact that OS enrollment included few estimator for this product term divided by its estimated stan-
recent E+P initiators and hence little information on effects dard deviation. Ten runs of the contamination process just de-
during the early years of E+P use, whereas the CT was rel- scribed were carried out yielding respective hazard ratio (HR)
atively sparse following 5 or more years from randomization, estimates (95% CI) of 1.16 (0.91, 1.47), 1.01 (0.80, 1.29), 1.25
while the hazard ratios decreased with increasing years from (0.99, 1.58), 0.97 (0.76, 1.24), 1.23 (0.97, 1.55), 1.09 (0.86,
E+P initiation. The ratio of OS to CT hazard ratios for E+P 1.39), 1.13 (0.89, 1.43), 1.18 (0.93, 1.49), 1.07 (0.85, 1.36),
(95% CI) after accounting for both years from hormone ther- and 1.08 (0.85, 1.37). The corresponding standardized trend
apy initiation and confounding was 0.93 (0.64, 1.36) for CHD, statistics took values of −1.59, −1.38, −0.35, −0.07, −1.03,
and 0.84 (0.54, 1.28) for VT based on an analysis that in- −2.02, −0.86, −0.59, −1.10, and −1.78. It seems evident that
cluded common β’s in (3) for each of the three time periods, this type of limitation in exposure data can have important
plus a product term between the combined hormone group effects on study results if hazard ratios are strongly time de-
indicator and the indicator for OS versus CT cohort. pendent.
Reanalyses of other observational study data, using meth-
ods like those leading to Table 3, may similarly align their 4.2 Statistical Methods for Time-Varying Hazard Ratios
results with those from the WHI E+P trial. Other factors Proportional hazards modeling assumptions will provide a
may also prove to be important. For example, Nurses, Health suitable approximation in many applications. In situations
Study investigators reported a substantially lower CHD risk where all study subjects are followed from randomization or
among postmenopausal hormone therapy (E-alone and E+P) other natural time origin for the “exposure” of interest, haz-
users (Grodstein et al., 2000) and this study enrolled pri- ard ratio estimates arising from a proportionality assumption
marily premenopausal women and hence was in a position may provide simple and useful summary measures, even if the
to identify women who initiated E+P during cohort follow- hazard ratio is moderately time dependent. Specifically, such
up. However, apparently only biennial indicators of hormone estimates can be given an average hazard ratio interpretation
therapy use was used in these analyses. Hence a woman who over the study follow-up period. However, when study sub-
initiates E+P could be regarded as a nonuser for much of the jects enter a study late relative to initiation of the exposure of
first 2 years of use, during which the greatest hazard ratio ele- interest, as for hormone therapy in the OS, summary statistics
vation occurs. To assess the potential effects of E+P exposure calculated under a proportionality assumption may be quite
data on hazard ratio estimates, we undertook an exercise in sensitive to departure from a proportional hazards assump-
the WHI E+P trial cohort as follows. Specifically, each E+P tion. More generally, aspects of the hazard ratio shape may be
group woman was generated a uniformly distributed ascer- of considerable interest in assessing the short- and long-term
tainment time over the first 2 years from randomization. Fur- implications of a treatment. Statistical research is needed to
thermore, we generated a random E+P stopping time. E+P develop suitable methods for summarizing treatment effects
group women were then regarded as nonusers up to their time over defined exposure durations when hazard ratios are time
of ascertainment if ascertainment preceded stopping E+P and dependent. For example, if baseline hazard rates, λos (·) in
permanently as nonusers if stopping preceded ascertainment. the Cox model (3), are not strongly dependent on time (t)
Discussion on Statistical Issues in the Women’s Health Initiative 907

Table 4 tive incidence rates, as summary measures of treatment ef-


E+P hazard ratios (95% CIs) as a function of years from fects in the presence of time-varying hazard functions. These
E+P initiation, and average HRs over various times from measures would be more complex since estimates of baseline
E+P initiation, assuming common HR functions in the CT hazard rates would be involved. These types of summary mea-
and OS sures could be considered for the type of step function hazard
ratio model shown in Table 3, or for smooth hazard ratio
Years from Venous
models, such as that recently proposed by Yang and Prentice
E+P Coronary heart disease thromboembolism
initiation HR (95% CI) HR (95% CI) (2005) which includes separate parameters for short- and long-
term hazard ratios with a hazard ratio function that varies
<2 1.56 (1.12, 2.19) 2.87 (1.89, 4.35) smoothly with t, or for the rather general class of hazard ra-
2–5 1.16 (0.89, 1.51) 1.70 (1.28, 2.26) tio models discussed by Fahrmeir and Klinger (1998).
>5 0.81 (0.67, 0.99) 1.26 (1.02, 1.56)
4.3 Intervention Adherence and Causal Inference Methods
Average HR (95% CI) Average HR (95% CI)
The analyses described in Section 4.1 used the randomiza-
2 1.56 (1.12, 2.19) 2.87 (1.89, 4.35) tion assignment and baseline current use of hormones in the
4 1.36 (1.09, 1.70) 2.28 (1.72, 3.03) OS to define a treatment indicator variable. This was done
6 1.27 (1.04, 1.54) 2.07 (1.62, 2.63) so that we could compare hazard ratio estimates in the OS
8 1.13 (0.96, 1.33) 1.83 (1.50, 2.23)
to “intention-to-treat” hazard ratio estimates in the CT, the
10 1.07 (0.92, 1.24) 1.71 (1.43, 2.05)
latter having a useful interpretation and comparative free-
dom from assumption. The magnitude of treatment effects
among persons who adhere to their treatment group assign-
estimates of hazard ratios averaged over specified treatment ment, however, is likely to differ from those who do not,
durations may be useful, and can be based on estimates of and differential adherence patterns between the CT and OS
β and its asymptotic distribution. For example, the upper could itself be a source of hazard ratio discrepancy. Hence,
part of Table 4 shows HR estimates for CHD and VT as a the analyses of Table 3 and the upper part of Table 4 were
function of time from E+P initiation, when these estimates re-run censoring a woman’s follow-up period at 6 months be-
are restricted to be common to the CT and OS. The lower yond a change in E+P group status (stopped E+P use in
part of Table 4 shows corresponding average hazard ratio es- the active groups, or initiated hormone therapy in the con-
timates and nominal 95% confidence, obtained using the delta trol groups). As shown in Table 5, this analysis among ad-
method, over various time periods from E+P initiation. Note herent women does produce HR estimates that are some-
that these analyses suggest that the HR for CHD may drop what more distant from unity, as expected, but the patterns
below one at 5 or more years from E+P initiation. An HR are similar to those given in Tables 3 and 4. This type of
below one, however, does not by itself imply cardioprotection adherence-adjusted analysis represents a rather simple ap-
in view of the likely selection of women at high risk for CHD proach to a complex issue. Other approaches (e.g., Cuzick,
at earlier times from E+P initiation. Also, the lower part of Edwards, and Segnan, 1997; Frangakis and Rubin, 1999) are
Table 4 shows an average HR estimate above one, even over certainly worth considering, particularly if detailed and reli-
a 10-year period from E+P initiation. Finally, the suggestion able adherence histories are available. In the WHI hormone
of an HR below one at more than 5 years from initiation therapy trials, quantitative adherence data were obtained,
derives largely from OS data, so the possibility of residual primarily through the use of weighed returned pill bottles,
confounding needs to be kept in mind in interpreting these whereas in the OS adherence data were updated through an-
analyses. nual questionnaires, and are essentially qualitative, thereby
More generally, one might consider ratios between treat- limiting the range of adherence-adjusted analyses that can be
ment groups of estimates of cumulative hazards, or cumula- entertained.

Table 5
Adherence sensitivity analyses of hazard ratios in the CT and OS and combined CT and OS as a function of
years from E+P initiation

Years from CT OS CT/OS


E+P initiation HR (95% CI) HR (95% CI) HR (95% CI)
Coronary heart disease
<2 1.75 (1.19, 2.58) 1.03 (0.38, 2.81) 1.62 (1.14, 2.29)
2–5 1.47 (1.00, 2.17) 1.08 (0.69, 1.68) 1.28 (0.96, 1.70)
>5 0.60 (0.27, 1.29) 0.82 (0.66, 1.03) 0.81 (0.66, 1.00)
Venous thromboembolism
<2 3.16 (1.89, 5.31) 2.60 (1.10, 6.07) 3.01 (1.95, 4.64)
2–5 2.15 (1.37, 3.39) 1.81 (1.17, 2.81) 1.98 (1.46, 2.70)
>5 1.86 (0.87, 3.98) 1.28 (1.00, 1.64) 1.34 (1.06, 1.69)
908 Biometrics, December 2005

Some authors make a strong connection between Issues of adherence modeling and interpretation merit con-
adherence-adjusted analysis and so-called causal inference tinued statistical development, with much to be learned
(Angrist, Imbens, and Rubin, 1996) and label treatment ef- through specific applications, such as arise in the WHI.
fect parameters that would apply if there was full adherence
as “causal” parameters. While it is certainly of interest to 5. Discussion
consider assumptions that would lead to identifiability of such Compared to therapeutic research among persons having dis-
treatment parameters, the issue of causal interpretation would ease, rather few statisticians devote their energies to disease
seem much more closely related to the type of study design, prevention research. The wide variation in the rates of chronic
with randomized controlled designs having a distinct advan- diseases around the world, and the results of prevention trials
tage through the statistical independence between treatment to date for various prominent chronic diseases (e.g., Prentice,
and all baseline confounding factors, whether or not such fac- 2004) support the concept that chronic disease risk can be
tors can be well measured, or are even recognized. In com- impacted in a relatively few years, even at advanced ages,
parison, observational study analyses typically must begin by practical lifestyle and pharmaceutical approaches. Statis-
with such critical assumptions of no unmeasured confounders, ticians have an important role to play in the realization of
an ignorable “treatment assignment mechanism,” and non- this potential.
differential outcome ascertainment. These assumptions may There are a number of pivotal study design, conduct, and
often be uncertain enough to raise questions about the analysis issues that pose rate-limiting obstacles to progress
causality of any estimated associations. Adherence-adjusted in the primary disease prevention area. The WHI illustrates
analyses, whether in an observational or randomized trial some of these, including measurement error modeling meth-
setting, additionally must deal with the issues that adher- ods for the study of disease rate associations with difficult-to-
ence to treatment goals may be highly variable due to study measure dietary and physical activity exposures; intervention
subject characteristics or to properties of the intervention, development methods using high-dimensional genomic and
and that rates of censoring of follow-up times may depend on proteomic data; trial monitoring and analysis methods when
preceding adherence histories. Hence, in realistic situations multiple disease outcomes may be affected by an intervention;
adherence-adjusted analyses are best regarded as sensitivity and research to elucidate the interplay between observational
analyses, and associated parameter estimates (e.g., full ad- studies, randomized trials having intermediate outcomes, and
herence hazard ratio estimates) as data extrapolation that full-scale intervention trials. Prevention research is intrinsi-
may be less meaningful if nonadherence arises for treatment- cally multidisciplinary with the statistical role at par with
related reasons, but of greater interest if adherence history that of other key disciplines.
can be regarded as a variable intrinsic to the study subject, Reviewers of this article have requested additional discus-
that is not affected by treatment. sion of some of the points raised above, particularly concern-
In the WHI E+P trial it would not seem appropriate to ing the advantages and disadvantages of specifying composite
regard adherence as an intrinsic study subject characteristic. indices formed by several clinical outcomes in data monitor-
For example, in the active treatment group a larger fraction of ing and analysis; concerning trial monitoring considerations
women than expected experienced persistent vaginal bleeding for early stopping in the WHI hormone therapy trials given
following initiation of this combined hormone regimen. The the possibility of hazard ratios below one after several years
protocol called for dosage modification, or the use of other of use; and concerning lessons that have been learned from
hormonal agents, in response to bleeding that persisted for WHI for future clinical trial and observational study design.
several months or years, and some women chose to discon- While no simple index can be expected to adequately sum-
tinue study pills due to this side effect. Vaginal bleeding in marize intervention effects on several clinical outcomes that
the placebo group was far less common, but more likely to may each have their own time course, it seems quite impor-
be indicative of endometrial pathology, giving rise to biopsy tant for study monitoring and reporting to specify a clear trial
and the possibility of discontinuation of study pills for other monitoring plan before meaningful clinical outcome data come
reasons. Breast tenderness was another important issue for available within the trial. In the case of each of the WHI CT
participating women, that may be treatment related. Also, components, the monitoring plan gave a special place to the
long-term adherers to treatments that have potential to af- trial’s primary outcome, the prevention of which motivated
fect many body organs and systems, and that are subject and justified the trial, and in the case of the HT trials to
to high-profile media coverage, likely have many biobehav- an anticipated safety outcome (breast cancer). Beyond these
ioral characteristics that distinguish them from short-term outcomes, however, the specification of a so-called global in-
users, and it is unclear the extent to which such charac- dex in an attempt to summarize benefits and risks of the
teristics can be measured and adequately accommodated in intervention seemed quite valuable for trial monitoring, and
data analysis. The context of a randomized controlled trial the exercises (scenarios) used in developing these indices and
typically offers substantial advantages in providing indepen- the overall monitoring procedure were quite valuable to the
dence between any such baseline biobehavioral factors and DSMB. For example, these exercises facilitated the identifi-
treatment group assignment, and also through the provision cation and resolution of differing viewpoints among board
of a context for censoring rates that may depend little on members in advance of needing to make recommendations
such factors or upon actual adherence, provided study par- based on trial outcome data. Of course, monitoring commit-
ticipants provide clinical outcome data in a comprehensive tees will appropriately want to examine data beyond these
fashion regardless of their extent of adherence to intervention primary outcomes and summary indices, and the reporting of
activities. trial results could usefully include analyses of the robustness
Discussion on Statistical Issues in the Women’s Health Initiative 909

of clinical implications to variations in the composition of to have favorable benefit versus risk profiles, thereby setting
summary indices, and to other aspects of the reporting the stage for additional valuable primary disease prevention
process. trials.
Some reviewers raised questions about whether the E+P
trial should have stopped after an average 5.6 years of follow-
up in view of the potential long-term benefits (Table 3). Cer-
tainly, these are complex and challenging decisions, and the Acknowledgements
time course of evolving and potential future risks and benefits
This work was supported by grant CA-53996 from the Na-
is one of the most difficult to assimilate into trial monitoring
tional Cancer Institute, and by contract WH-2-2110 from the
procedures. Statistical methods for trial monitoring also seem
National Heart, Lung, and Blood Institute.
quite limited in this respect, in that most formal sequential
testing procedures make a proportional hazards assumption
for outcomes that may affect an early stopping decision. In References
the case of the WHI E+P trial, an elevation in the designated Anderson, G. L., Manson, J., Wallace, R., Lund, B., Hall,
safety outcome, breast cancer, was the trigger for an early D., Davis, S., Shumaker, S., Wang, C. Y., Stein, E., and
stopping consideration under the monitoring guidelines, and Prentice, R. L. (2003). Implementation of the Women’s
this elevation was supported by a global index value indicat- Health Initiative study design. Annals of Epidemiology
ing that risks exceeded benefits over the intervention period. 13, 5–17.
These statistics were supplemented by various other less for- Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Iden-
mal outcome contrasts, and conditional power calculations tification of causal effects using instrumental variables.
under various scenarios concerning future trends constituted Journal of the American Statistical Association 91, 444–
the statistical input to early stopping considerations, with 455.
the DSMB reserving the option of making recommendations Barratt, B. J., Payne, F., Rance, H. E., Nutland, S., Todd,
based on their own judgments which may, for example, be J. A., and Clayton, D. G. (2002). Identification of the
informed also by data external to the trial. Additional pub- sources of error in allele frequency estimations from
lications are under development to elaborate the data and pooled DNA indicates an optimal experimental design.
considerations leading to the early stopping of the two WHI Annals of Human Genetics 66, 393–405.
HT trials. Barrett-Conner, E. and Grady, D. (1998). Hormone replace-
There are many lessons from WHI relative to the design ment therapy, heart disease, and other considerations.
of disease prevention trials and cohort studies. Two that may Annual Review of Public Health 19, 55–72.
merit repeating relate to HR function shape in cohort study Bingham, S. A. (2002). Biomarkers in nutritional epidemiol-
design and analysis, and the complementary role of trials and ogy. Public Health Nutrition 5, 821–827.
cohort studies in assessing the overall benefits and risks of a Bingham, S. A., Luben, R., Welch, A., Wareham, N., Khaw,
preventive intervention. If an exposure, such as hormone ther- K. T., and Day, N. (2003). Are imprecise methods ob-
apy, is a major motivation for a cohort study, then attention scuring a relationship between fat and breast cancer?
should be directed to the enrollment of a sufficient number of Lancet 362, 212–214.
new initiators of such exposure (e.g., Ray, 2003) in order to be Boyd, N. F., Stone, J., Vogt, K. N., Connelly, B. S., Martin,
in a position to assess short-term intervention effects. Even if L. J., and Minkin, S. (2003). Dietary fat and breast can-
a sizeable number of new initiators are enrolled, cohort study cer revisited: A meta-analysis of the published literature.
data analyses may often need to use summary measures of British Journal of Cancer 89, 1672–1685.
exposure effect, such as average hazard ratios, to allow for Breslow, N. E. and Day, N. E. (1987). Statistical Methods for
time variation in hazard ratios, and to summarize exposure Cancer Research 2. The Design and Analysis of Cohort
effects over defined exposure periods. Studies. IARC Scientific Publication 82. Lyon, France:
For reasons of cost, logistics, and ethics, preventive inter- International Agency for Research on Cancer.
vention trials may often not be able to be continued as long Calle, E. E., Rodriquez, C., Walker-Thurmond, K., and Thun,
as would be necessary to assess risks and benefits of the long- M. J. (2003). Overweight, obesity, and mortality from
term use of an intervention, or even to assess the longer-term cancer in a prospectively studied cohort of U.S. adults.
risks and benefits of a relatively short-term intervention. Ob- New England Journal of Medicine 348, 1625–1638.
servational study data, strengthened by joint analysis with Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Mea-
intervention trial data when practical, are essential for as- surement Error in Nonlinear Models. New York: Chap-
sessing such long-term effects, and for examining interactions man and Hall.
of exposure effects with study subject characteristics, which Creasman, W. T., Hoel, D., and DiSaia, P. J. (2003). WHI:
CTs are typically not designed to do in a powerful fashion. Now that the dust has settled: A commentary. American
Finally, the surprising results from the WHI HT trials re- Journal of Obstetric Gynecology 189, 621–626.
inforce questions about the adequacy of the hypothesis devel- Cuzick, J., Edwards, R., and Segnan, N. (1997). Adjusting for
opment and early evaluation infrastructure for the national non-compliance and contamination in randomized clini-
and international disease prevention program. Attention to cal trials. Statistics in Medicine 16, 1017–1029.
observational study design and analysis issues can strengthen Diamandis, E. P. (2004). Analysis of serum proteomic pat-
this infrastructure. The promise of comprehensive genomic terns for early cancer diagnostics: Drawing attention to
and proteomic tools may also strengthen this “enterprise” by potential problems. Journal of the National Cancer Insti-
enhancing the development of interventions that are likely tute 96, 353–356.
910 Biometrics, December 2005

Downes, K., Barratt, B. J., Akan, P., Bumpstead, S. J., Huang, Y. and Wang, C. Y. (2000). Cox regression with ac-
Taylor, S. D., Clayton, D. G., and Deloukas, P. (2004). curate covariates unascertainable: A nonparametric cor-
SNP allele frequency estimation in DNA pools and vari- rection approach. Journal of the American Statistical As-
ance component analysis. Biotechniques 36, 840–845. sociation 45, 1209–1219.
The End of the Age of Estrogen [cover story]. (2002). Hunter, D. J. (1999). Role of dietary fat in the causation
Newsweek July 22. of breast cancer: Counter-point. Cancer Epidemiology
Fahrmeir, L. and Klinger, A. (1998). A nonparametric mul- Biomarkers and Prevention 8, 9–13.
tiplicative hazard model for event history analysis. Kaaks, R., Ferrari, P., Ciampi, A., Plummer, M., and Riboli,
Biometrika 85, 581–592. E. (2002). Uses and limitations of statistical accounting
Feng, Z., Prentice, R. L., and Srivastava, S. (2004). Re- for random error correlations, in the validation of di-
search issues and strategies for genomic and proteomic etary questionnaire assessments. Public Health Nutrition
biomarker discovery and validation: A statistical per- 5, 969–976.
spective. Pharmacogenomics 5, 709–719. Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical
Frangakis, C. E. and Rubin, D. B. (1999). Addressing com- Analysis of Failure Time Data, 2nd edition. New York:
plications of intention-to-treat analysis in the combined John Wiley and Sons.
presence of all-or-none treatment non-compliance and Kipnis, V., Subar, A. F., Midthune, D., et al. (2003). Struc-
subsequent missing outcomes. Biometrika 86, 365–379. ture of dietary measurement error: Results of the OPEN
Freedman, L. S., Anderson, G. L., Kipnis, V., Prentice, biomarker study. American Journal of Epidemiology 158,
R. L., Wang, C. Y., Rossouw, J. R., Wittes, J., and 14–21.
DeMets, D. (1996). Approaches to monitoring the results Kruglyak, L. (1999). Prospects for whole-genome linkage dis-
of long-term disease prevention trials: Examples from the equilibrium mapping of common disease genes. Nature
Women’s Health Initiative. Controlled Clinical Trials 17, Genetics 22, 139–144.
509–525. Langer, R. D., White, E., Lewis, C. E., Kotchen, J. M.,
Gabriel, S. B., Schaffner, S. F., Nguyen, H., et al. (2003). Hendrix, S. L., and Trevisan, M. (2003). The Women’s
The structure of haplotype blocks in the human genome. Health Initiative observational study: Baseline character-
Science 296, 2225–2229. istics of participants and reliability of baseline measures.
Gibbs, R. A., Belmont, J. W., Hardenbol, P., et al. (2003). Annals of Epidemiology 13, S107–S121.
The International HapMap Consortium. The Interna- Le Hellard, S., Ballereau, S. J., Visscher, P. M., et al. (2002).
tional HapMap Project. Nature 426, 789–796. SNP genotyping on pooled DNAs: Comparison of geno-
Goodman, D., Goldzieher, J., and Ayala, C. (2003). Cri- typing technologies and a semi-automated method for
tique of the report from the Writing Group of the WHI. data storage and analysis. Nucleic Acids Research 30, 1–
Menopausal Medicine 10, 1–4. 10.
Grady, D., Rubin, S. B., Pettiti, D. B., et al. (1992). Hor- Manson, J. E., Hsia, J., Johnson, K. C., et al., for the Women’s
mone therapy to prevent disease and prolong life in post- Health Initiative Investigators. (2003). Estrogen plus
menopausal women. Annals of Internal Medicine 117, progestin and the risk of coronary heart disease. New
1016–1037. England Journal of Medicine 349, 523–534.
Greenwald, P. (1999). Role of dietary fat in the causation Michels, K. B. and Manson, J. E. (2003). Postmenopausal
of breast cancer: Point. Cancer Epidemiology Biomarkers hormone therapy: A reversal of fortune. Circulation 107,
and Prevention 8, 3–7. 1830–1833.
Grodstein, F., Manson, J. E., Colditz, G. A., Willett, W. C., Mohlke, K. L., Erdos, M. R., Scott, L. J., et al. (2002). High-
Speizer, F. E., and Stampfer, M. J. (2000). A prospective throughput screening for evidence of association by using
observational study of post-menopausal hormone ther- mass spectrometry genotyping on DNA pools. Proceed-
apy and primary presentation of cardiovascular disease. ings of the National Academy of Sciences of the United
Annals of Internal Medicine 133, 933–941. States of America 99, 16928–16933.
Grodstein, F., Clarkson, T. B., and Manson, J. E. (2003). Naftolin, F., Taylor, H. S., Karas, R., et al. (2004). The
Understanding the divergent data on post-menopausal Women’s Health Initiative could not have detected car-
hormone therapy. New England Journal of Medicine 348, dioprotective effects of starting hormone therapy dur-
645–650. ing the menopausal transition. Fertility and Sterility 81,
Hebert, J. R., Clemow, L., Pbert, L., Ockene, I. S., and 1498–1501.
Ockene, J. K. (1995). Social desirability bias in dietary Prentice, R. L. (2004). Chronic disease prevention: Pub-
self-report may compromise the validity of dietary in- lic health potential and research needs. Statistics in
take measures. International Journal of Epidemiology 24, Medicine 23, 3409–3420.
389–398. Prentice, R. L. and Anderson, G. (2005). Women’s Health
Heitmann, B. L. and Lissner, L. (1995). Dietary underreport- Initiative: Statistical aspects and early results. In Ency-
ing by obese individuals: Is it specific or non-specific? clopedia of Clinical Trials, 2nd edition, P. Armitage and
British Medical Journal 311, 986–989. T. Colton (eds). New York:Wiley.
Herrington, D. M. and Howard, T. D. (2003). From presumed Prentice, R. L., Sugar, E., Wang, C. Y., Neuhouser, M., and
benefits potential harm—Hormone therapy and heart Patterson, R. (2002). Research strategies and the use of
disease. New England Journal of Medicine 349, 519– nutrient biomarkers in studies of diet and chronic disease.
521. Public Health Nutrition 5, 977–984.
Discussion on Statistical Issues in the Women’s Health Initiative 911

Prentice, R. L., Willett, W. C., Greenwald, P., et al. (2004). Tibshirani, R. and Efron, B. (2002). Pre-validation and infer-
Nutrition and physical activity and chronic disease pre- ence in microarrays. Statistical Applications in Genetics
vention: Research strategies and recommendations. Jour- and Molecular Biology 1, Article 1, The Berkeley Elec-
nal of the National Cancer Institute 96, 1276–1287. tronic Press, http://www.bepress.com/sagmb.
Prentice, R. L., Langer, R., Stefanick, M., et al. (2005). Com- The Truth about Hormones [cover story]. (2002). Time July
bined postmenopausal hormone therapy and cardiovas- 22.
cular disease: Toward resolving the discrepancy between Whittemore, A. S. and McGuire, V. (2003). Observational
the observational studies and the Women’s Health Ini- studies and randomized studies of hormone replacement
tiative clinical trial. American Journal of Epidemiology therapy: What can we learn from them? Epidemiology
162, 1–11. 14, 8–10.
Ray, W. A. (2003). Evaluating medication effects outside of Willett, W. C., Sampson, L., Stampfer, M. J., et al. (1985).
clinical trials: New-user designs. American Journal of Reproducibility and validity of a semiquantitative food
Epidemiology 158, 915–920. frequency questionnaire. American Journal of Epidemiol-
Sagatopan, J. M., Venkatraman, E. S., and Begg, C. B. ogy 122, 51–65.
(2004). Two-stage designs for gene-disease association Women’s Health Initiative Steering Committee. (2004). Ef-
studies with sample size constraints. Biometrics 60, 589– fects of conjugated equine estrogen in post-menopausal
597. women with hysterectomy: The Women’s Health Initia-
Schoeller, D. A. (2002). Validation of habitual energy intake. tive randomized controlled trial. Journal of the American
Public Health Nutrition 5, 883–888. Medical Association 291, 1701–1712.
Sham, P., Bader, J. S., Craig, I., O’Donovan, M., and Owen, Women’s Health Initiative Study Group. (1998). Design of
M. (2002). DNA pooling: A tool for large-scale associa- the Women’s Health Initiative clinical trial and observa-
tion studies. Nature Reviews Genetics 3, 862–871. tional study. Controlled Clinical Trials 19, 61–109.
Simon, R., Radmacher, M. D., Dobbin, K., and McShane, Writing Group for the Women’s Health Initiative Investiga-
L. M. (2003). Pitfalls in the use of DNA microarray data tors. (2002). Risks and benefits of estrogen plus pro-
for diagnostic and prognostic classification. Journal of gestin in healthy post-menopausal women. Principal re-
the National Cancer Institute 95, 14–18. sults from the Women’s Health Initiative randomized
Stampfer, M. and Colditz, G. (1991). Estrogen replace- controlled trial. Journal of the American Medical Asso-
ment therapy and coronary heart disease: A quantita- ciation 288, 321–333.
tive assessment of the epidemiologic evidence. Preventive Yang, S. and Prentice, R. L. (2005). Semiparametric analy-
Medicine 20, 47–63. sis of short-term and long-term relative risks with two
Subar, A. F., Kipnis, V., Troiano, R. P., et al. (2003). Using sample survival data. Biometrika 92, 1–17.
intake biomarkers to evaluate the extent of dietary mis-
reporting in a large sample of adults: The OPEN study. Received October 2004. Revised February 2005.
American Journal of Epidemiology 158, 1–13. Accepted March 2005.

Discussions

Raymond J. Carroll
Department of Statistics
Texas A&M University
TAMU 3143, College Station
Texas 77843-3143, U.S.A.
email: carroll@stat.tamu.edu

Prentice, Pettinger, and Anderson are to be congratulated for ual and j denote the replicated instrument. Then models such
an interesting and timely article. as equation (2) of Prentice et al. or the person-specific bias
In what follows, we will use the notation of Carroll, models of Kipnis et al. (2001, 2003) basically state that for
Ruppert, and Stefanski (1995), which is slightly different from some function m(•),
that of Prentice et al. One of the plagues of measurement er-
ror modeling is that everyone uses the same symbols (X, W, Wij = m(Xi , Zi , B) + ri + ij ; (1)
Z, U), but their meaning is seemingly randomly permuted Mij = Xi + Uij , (2)
from author to author!
Let X denote true intake, W intake from a self-report instru- where the random variables ri , ij , and Uij are mutually in-
ment such as a food frequency questionnaire, Z study-specific dependent. In most of the models in the literature, and in
characteristics, and M a biomarker. Let i denote the individ- Prentice et al., m(•) is linear in true intake X, a fact that
912 Biometrics, December 2005

conveniently allows identification and method of moment es- r The first is to abandon the idea of using measurement
timation, and later on allows one to correct risk models for error methods to estimate the relative risk of X, and
the uncertainties in the self-report instrument as given in instead take an operational definition as in Carroll et al.
equation (1). (1995, Chapter 1, Section 1.5), namely to redefine Xi as
The random variable ri is called a person-specific bias the mythical average of Mij over many replications of the
(Kipnis et al., 2001), indicating that two people who eat concentration biomarker. In other words, redefine usual
the same amount will systematically report that amount intake as measured by the concentration biomarker to
differently. be α0 + α1 Xi + si , or, more simply, to redefine the risk
Prentice et al. briefly allude to what is probably the biggest factor to be the concentration biomarker after removing
challenge in nutritional epidemiology, which unfortunately variability in it via averaging.
from this statistician’s perspective is not how to handle mod- r A second possibility is to do separate feeding exper-
els such as (1)–(2). That issue is the difference between a iments to try to understand how the concentration
recovery biomarker and a concentration biomarker. A recov- biomarker is related to actual intake. It is not clear
ery biomarker such as doubly labeled water for energy is one whether this is feasible, and it is especially not clear
where the standard classical measurement error model (2) whether one can get around the issue of person-specific
holds. When one has a recovery biomarker, the now-vast lit- bias in the concentration biomarker.
erature on measurement error modeling can be brought into
play to understand design and analysis issues. Acknowledgements
Concentration biomarkers, such as serum plasma concen-
trations, do not satisfy (2), but instead in their simplest form Research supported by a grant from the National Cancer In-
can be thought of as following stitute (CA-57030), and by the Texas A&M Center for En-
vironmental and Rural Health via a grant from the National
Mij = α0 + α1 Xi + si + Uij , (3) Institute of Environmental Health Sciences (P30-ES09106).
where si is another variance component indicating a special
type of person-specific bias, namely that two people who eat References
the same food may process the foods differently, and system-
atically differ in their concentration biomarkers. One would Carroll, R. J., Ruppert, D., and Stefanski, L. A. (1995). Mea-
expect the concentration biomarker person-specific bias si to surement Error in Nonlinear Models. London: Chapman
be independent of the self-report person-specific bias ri . & Hall CRC Press.
When m(•) in (1) is linear in X, and when si ≡ 0, it is Kipnis, V., Midthune, D., Freedman, L. S., Bingham, S.,
possible to estimate the correlation between the self-report Schatzkin, A., Subar, A., and Carroll, R. J. (2001). Em-
instrument W and the true intake X, a useful fact when one pirical evidence of correlated biases in dietary assessment
is setting sample sizes. However, this estimate would be sen- instruments and its implications. American Journal of
sitive to person-specific bias in the concentration biomarker. Epidermiology 153, 394–403.
Even worse, without additional information, α1 in (3) is not Kipnis, V., Subar, A. F., Midthune, D., Freedman, L.
identifiable, and trying to correct relative risk estimates for S., Ballard-Barbash, R., Troiano, R., Bingham, S.,
measurement error then becomes problematic. Schoeller, D. A., Schatzkin, A., and Carroll, R. J. (2003).
In the case of concentration biomarkers, there seem to be The structure of dietary measurement error: Results
at least two possibilities, and we would be interested in what of the OPEN biomarker study. American Journal of
Prentice et al. think of them. Epidermiology 158, 14–21.

N. E. Day
Strageways Research Laboratory
University of Cambridge
Wort’s Causeway
Cambridge CB1 8RN, UK
email: nick.day@srl.cam.ac.uk

Professor Prentice and his colleagues are to be congratulated the standard for the design of future large-scale interven-
on an outstanding paper. As they rightly say, the Women’s tion trials, and the inclusion of an observational compo-
Health Initiative (WHI) is perhaps the most ambitious pop- nent has proved highly serendipitous, an aspect I will dis-
ulation research investigation ever undertaken. The complex- cuss later. The paper covers a range of issues, including
ity of the interventions, the sophistication of the design, the measurement problems in nutritional epidemiology, the de-
range of endpoints for which the trial was designed to pro- sign of genetic studies given the technological revolution that
vide definitive information, together with the overall size is sweeping through the area, the reporting and monitoring
of the trial, are deeply impressive. It is reassuring to see of clinical trials, and the relative roles and merits of clin-
that the framework for the analysis is commensurate with ical trials and observational studies in population science
the power of the design. The “partial factorial” design sets research.
Discussion on Statistical Issues in the Women’s Health Initiative 913

The dietary modification (DM) component of the WHI has problematic since in model (2) there is a bias term, which re-
its origins in the distant history of the WHI, and was initially quires an appropriate biomarker study for its estimation. It is
the main motivation for the study. The issues are clear. Diet also, as a second-order problem, possible, even likely, that this
and nutrition, together with physical activity, appear to be bias term will depend on the dietary pattern, almost certainly
key determinants of a range of major health endpoints. Diet, different on the two arms of the trial given the nature of the
however, is notoriously difficult to assess accurately, a prob- intervention. If the study demonstrates an appreciable effect
lem compounded by the fact that diet is a high-dimensional for the intervention on the incidence of breast cancer, interpre-
complex of factors, many of which are highly correlated. This tation will be uncontroversial. If, however, the breast cancer
high level of measurement error gives great uncertainty to the results of the DM component are negative or only marginally
results of observational studies, both to the identification of positive on an intention-to-treat analysis, then interpretation
the precise dietary factor of importance and the quantitative will be unclear. One will not know whether the intervention
level of effect, even in fact whether there is any appreciable produced little or no effect because fat intake is unrelated to
dietary effect. Negative results can be at least as suspect as breast cancer risk, or because the intervention did not gener-
positive ones. The hope of the WHI was that these problems ate sufficient difference between the two arms. Shades of the
could be circumvented by a randomized clinical trial. The Multiple Risk Factor Intervention (MRFIT) trial may hang
results of the DM component of the WHI have not yet ap- over the results.
peared, so it is too early to tell whether the optimism behind The issue dealt with in this article that will attract the
the design was justified. However, problems that were raised greatest attention, along with the companion paper in the
at the outset have not disappeared. The primary DM was to American Journal of Epidemiology, relates to the effect of
reduce intakes of total fat and saturated fat to 20% and 7%, hormone replacement therapy on the risk for cardiovascu-
respectively, of average daily caloric intake, while keeping to- lar disease, specifically the apparent discrepancy between
tal caloric intake constant. This is an intervention that is easy the consistent finding from earlier observational studies of
neither to achieve nor to maintain. The trial will, of course, be a protective effect with the clear finding of an excess risk
analyzed on an intention-to-treat basis, but an understanding from the randomized component of the WHI. The results
of what the trial results mean will depend on accurate esti- published by the WHI Writing Committee in 2002, de-
mation of compliance over time of the intervention, and lack scribing an increased risk of coronary heart disease among
of change in the control arm. The intention-to-treat analysis women randomized to combined estrogen–progesterone treat-
only answers the operational question of whether this mode of ment (E+P) compared to controls, gave rise to extrava-
delivering the intervention has an effect. The underlying ques- gant review and comment in the literature. As Prentice
tion, the one of real interest, is whether sustained reduction in and colleagues point out, an issue of the International Jour-
fat, or saturated fat, consumption modifies health outcomes. nal of Epidemiology was devoted to the topic, with lurid
To answer this question one has to measure the degree of com- titles to papers such as “Is this the end of observational
pliance, that is, assess fat and saturated fat intake. Prentice epidemiology?” Many pet theories and old hobby-horses were
and his colleagues have developed more complex, and perhaps brought out to “explain” the discrepancy. Among these was
more realistic, models of the error of dietary self-assessment, the claim that not just socioeconomic status but the pattern
together with simpler error structure models for biomarkers of socioeconomic status and deprivation since birth was of cru-
(models (2) and (1) in the paper). These have been used for cial importance. Without adjustment for such a complex of
the design of a biomarker study now under way, and which variables, available in virtually no observational study, results
will presumably form the basis of their analysis. It is diffi- were fundamentally unreliable. A following paper purported
cult to see, however, how such a biomarker study is going to demonstrate the validity of the claim by showing that ad-
to resolve the issue of sustained compliance with the study justment for a lifetime measure of deprivation gave results
protocol by both arms of the trial. First, no biomarkers are close to the E+P result in the WHI, using data from a cross-
currently available either for fat or for saturated fat intake, sectional study with information on prevalent coronary heart
or indeed for carbohydrate. Second, although for the so-called disease (i.e., a medical record or self-report of a physician di-
recovery biomarkers, at present basically total energy, protein, agnosis). Another commentary referred to the “vindication of
potassium, and sodium, model (1) may be appropriate, there old epidemiological theory.” In an elegant if simple reanaly-
is no compelling reason why model (1) would apply to blood sis of the WHI results, Prentice and his colleagues show such
serum concentration markers, where levels may be affected commentaries to be empty rhetoric. They examine the effect
by individual endogenous or external exposure factors and of one of the most basic of epidemiological variables, time
the assumption of the independence of the errors may be seri- since start of exposure. In cancer epidemiology, it is funda-
ously vitiated. For crucial parameters to be identifiable, some mental to the relationship between exposure and risk, and in
independence assumption, or equivalent, has to be made, and cancer epidemiology would be considered a routine part of an
only for the recovery biomarkers does there appear to be com- analysis of cohort studies. They compare the results from the
pelling justification for such an assumption. It therefore seems randomized component of the WHI with the results from the
unlikely that the self-reported fat consumption data obtained observational component.
from the trial participants can be fully or credibly calibrated. When examined by time since E+P initiation, the two sets
However, for interpretation of the intention-to-treat analysis of results are as close as random fluctuation would allow. The
individual calibration is not necessary, all that is needed is apparent discrepancy simply disappears. In the first two years
an estimate of mean fat consumption on the two arms of the since initiation of E+P, the risk of coronary heart disease,
trial. Even these estimates of the mean, however, will prove and particularly venous thromboembolism, is high. More than
914 Biometrics, December 2005

5 years after initiation of E+P, for coronary heart disease United States, where early stopping has led to incomplete,
there is a substantial protective effect. Of particular note is even misleading, data being published. Apart from this trial,
that over 80% of the coronary heart disease cases on E+P on the U.S. NIH intervention study on the use of tamoxifen for
the observational component occur more than 5 years after the primary prevention of breast cancer is another obvious
E+P initiation, whereas among women taking E+P on the example. These trials have been stopped before they have
randomized component of the WHI, less than 20% of cases been allowed to continue sufficiently to generate data of un-
of coronary heart disease occurred 5 years or more after ini- ambiguous value for clinical or public health decisions. The
tiation of treatment. The analysis in the paper provides the stopping rules for the WHI were complex and sophisticated,
clearest vindication of the insistence on using incident cases yet have led to the appearance of misleading publications.
of disease, and treating time since onset of exposure as a ba- More thought needs to be given, as Prentice and his colleagues
sic variable of interest. Cross-sectional studies using data on stress, to the formulation of stopping rules which provide a
the prevalence of disease can hardly hope to make a serious more helpful balance between short- and longer-term effects.
contribution. Conversely, again as is pointed out in the paper, many obser-
A troubling aspect of the WHI results is the importance of vational studies would benefit from the inclusion of adequate
the early results, that is, outcomes occurring within 2 years person-years at risk soon after exposure starts. Observational
of treatment initiation, in triggering the trial stopping rules. studies and clinical trials should be complementary, the for-
Notwithstanding this paper, and the companion paper in the mer giving information on the effects of exposure under a
American Journal of Epidemiology, the headlines generated much wider range of conditions and doses, but susceptible to
by the incomplete analysis published in 2002 will continue to bias, the latter giving potentially more accurate estimates of
reverberate. There has been a series of trials, mainly in the effect, but under much more restrictive conditions.

David L. DeMets
University of Wisconsin–Madison
K6/446a Clinical Sciences Center
600 Highland Avenue
Madison, WI 53792-4675
email: demets@biostat.wisc.edu

1. Introduction fractures as a secondary outcome. The risk of breast cancer


Prentice et al. (1998) describe several statistical issues that was a major concern. For the CaD component, the reduction
arose during the design, conduct, and analysis of the Women’s of hip fractures was the primary outcome.
Health Initiative (WHI) randomized clinical trial (RCT) and From a design perspective, the WHI is a formidable chal-
observational study (OS). Some of the issues consist of in- lenge. There is no reason to expect that the sample size re-
cluding measurement error in modeling risk for dietary and quirements should be the same for each component, and in
physical activity assessment, interim monitoring for multiple fact they were not the same. In the DM component, almost
outcomes and multiple diseases, the high dimensionality of 49,000 women were enrolled. For the HT component, 10,739
genomic data, and time-dependent treatment group hazard patients were enrolled in the estrogen alone study (Women’s
ratios. Health Initiative Steering Committee, 2004) and 16,608 were
As Prentice et al. summarize, the WHI (Women’s Health enrolled in the estrogen–progestin study (Writing Group for
Initiative Study Group, 1998) was no ordinary RCT and OS. the Women’s Health Initiative Investigators, 2002), and over
Most trials, even very large trials, have one or two treatments 36,000 were in the CaD study. Each treatment arm was com-
being tested on a single disease for each treatment with one or pared to a control arm, which were standard diet for the DM
two major outcomes for each treatment. The WHI was prob- component and a placebo for the E, EP, and CaD treatment
ably the largest trial ever conducted, with over 68,000 post- arms in the other three components. Furthermore, women
menopausal women participating, and the OS had over 93,000 could be eligible and elect to participate in one or more of the
participants. The WHI RCT had three treatments under eval- three components (DM, HT, or CaD). In addition, the ran-
uation, a low-fat dietary modification (DM), a hormone ther- domized cohorts needed to be stratified to achieve racial and
apy (HT) consisting of estrogen and progestin (EP) for women age targets. Recruitment was to be conducted in 40 clinical
with a uterus (Writing Group for the Women’s Health Initia- centers.
tive Investigators, 2002) and estrogen (E) alone for women Because of these complexities, a partial factorial design was
without a uterus (Women’s Health Initiative Steering Com- used, relying on individual design and sample size calcula-
mittee, 2004), a third treatment consisting of calcium vitamin tions for each component. The WHI assumed that the indi-
D (CaD) supplementation. The DM arm had both breast can- vidual components would be independent of each other; that
cer and colon cancer as primary outcomes with coronary heart is, no interaction was expected or assumed. However, there
disease (CHD) as a leading second. The goal was to lower a were several other multiplicities, especially in multiple out-
typical 40% fat content diet to 20%. The HT component had comes for each of the three components, especially for the
as a primary goal the reduction of CHD and reduction of hip HT component. In addition to CHD, hip fracture, and breast
Discussion on Statistical Issues in the Women’s Health Initiative 915

cancer, other outcomes such as stroke and specific subtypes was among the first trial started in the late 1960s to demon-
(e.g., ischemic and hemorrhagic) as well as outcomes related strate that lowering serum lipid values through agents such
to blood clotting risks (e.g., deep vein thrombosis, pulmonary as clofibrate did not affect CHD reductions (Coronary Drug
embolism) arose during the conduct of the trial. How to be Project Research Group, 1975). In fact, the first successful
sensitive to various risks but yet be prudent about the in- lipid reduction with a corresponding risk in CHD mortality
crease in false claims due to multiplicities is not clear even for was almost 30 years later, using a statin, zimvistatin, in a
the standard RCT, much less a trial of this complexity. Scandinavian trial (Scandinavian Simvistatin Survival Study,
Another challenge is that all of the three treatment compo- 1994).
nents are readily available, and a belief among many groups For the HT component, the observational studies did not
in the medical community and the public that these are ef- predict the effect of either treatment modality. The reasons
fective treatments. Thus, the challenge of adherence to the for this are not clear beyond the knowledge that association
treatment arm assigned during the conduct of the trial was is not the same as causation. One possible factor is selec-
substantial. Based on previous observational studies by sev- tion bias. For the HT component, women who were taking
eral research groups, the use of each of the three treatment hormones were possibly more health conscious and physically
modalities was associated with a reduction in risk. While the active. Thus, their CHD risk was already lower and the use of
medical community fully recognized the limitation of obser- hormones to treat postmenopausal symptoms induced a corre-
vational studies, the use of HT, for example, was among the lation that was not correct. Another factor is that researchers
most widely prescribed pharmacologic agents for women. study what they can measure but there are probably many
There are several historical lessons prior to WHI about unknown but extremely important factors involved in the in-
the use of observational cohort studies to infer not just as- creased risk of CHD.
sociations but causality. For example, several cohort studies In evaluating the failure of a low-fat diet to reduce the risk
demonstrated an association between serum betacarotene lev- of breast and colon cancer, Prentice et al. examine the im-
els and the risk of cancer, especially lung cancer. Based on pact of measurement error in dietary assessment in assessing
these cohort studies, three major trials of betacarotene were risk. They recognize the limitations of the observational stud-
launched. The Alpha-Tocopherol Beta Carotene (ATBC) trial ies that suggested the low-fat hypothesis. Dietary assessment
was a randomized placebo control factorial trial conducted is very challenging and full of imprecision. Food frequency
in Finland among 26,000 heavy smokers (Alpha-Tocopherol, questionnaires are fraught with measurement errors and also
Beta Carotene Cancer Prevention Study Group, 1994). The susceptible to systematic bias such as over- or underreport-
CARET trial was a similar design conducted in the United ing, conscious or not. Prentice et al. consider a model of risk
States among heavy smokers and industrial workers exposed, assessment which incorporates measurement error in the in-
for example, to asbestos (Omenn et al., 1994). The third dependent variable. Measurement error is likely to have at-
trial, the Physicians Health Study (PHS), was a randomized tenuated the strength of the association but still may not
placebo control factorial trial of aspirin and betacarotene in- fully address the causation issue. The final results of the DM
volving over 22,000 U.S. male physicians (Hennekens et al., component are not yet available.
1996). All the three trials used a synthetic betacarotene to
increase serum levels. The ATBC, at completion, indicated
an increased risk of lung cancer incidence and mortality, 2. Even Higher Dimensionality
contrary to expectations based on the observational stud- The WHI RCT and OS studies came at a time of great
ies. The CARET trial terminated early with an increased change and innovation in biomedical research. The sequenc-
risk of lung cancer incidence and mortality, the rates be- ing of the human genome and the advances in both genomic
ing nearly identical to the ATBC trial. The betacarotene and proteomic research offers exciting new opportunities. The
component of the PHS ended with a hazard ratio of nearly WHI leaders collected and stored biological materials from
unity, a population that had only a small subgroup of smok- the women participating in the WHI RCT and OS stud-
ers and with little exposure to other lung cancer carcino- ies. These data from this well-characterized cohort of women
gens. Interestingly, in the placebo arms of all three trials, the will be analyzed and explored for years. The dimensional-
baseline levels of serum betacarotene levels were associated ity of the data collected is far beyond anything undertaken
with an increased risk of lung cancer, confirming the associ- previously.
ation seen in earlier observational studies. Yet, modification For both epidemiology and clinical trials, current statistical
of serum betacarotene had the opposite effect. The lesson is methodology is simply not adequate to meet the challenges of
that observational studies identify associations and should not such high-dimensional data in very large cohorts such as the
be taken as evidence of causality and subsequent treatment WHI RCT and OS studies. New methodology, both frequen-
strategies. tist and Bayesian based, must be developed that addresses the
Similar lessons were learned in identifying the association dimensionality and multiplicity. In addition, the laboratory
of lipid values and the risk of CHD. The Framingham Heart methods used to measure the biological specimens is also
Study (FHS) was among the first observational studies to changing rapidly as new advances are made in both the biol-
identify this risk factoring in the late 1950s and in early 1960s ogy and the technology. Many methods such as microarrays
(Dawber, Meadors, and Moore, 1951). Yet, several trials were are full of measurement error that could be improved using
able to effectively reduce serum lipid values without any ben- some of the statistical designs for laboratory quality control.
efit in reducing CHD risk. The Coronary Drug Project (CDP) For example, current results can vary with the placement of
916 Biometrics, December 2005

the material on the microarray chip from run to run and from component should be terminated. The prespecified scenarios
day to day. were not so useful at this juncture, and the group sequential
In addition, as Prentice et al. point out, the costs of these boundaries were helpful but still the DSMB had to render its
measurements can limit the amount of data that can be col- best scientific and ethical judgment.
lected. Of course, with time and improved technology, the The E component of the HT was also terminated early
costs will come down dramatically so that the volume of data but with much greater debate among the DSMB (Women’s
generated from the WHI cohorts will be affordable. Health Initiative Steering Committee, 2004). Here, the same
Nevertheless, this area should serve to be a rich area for risk factors for clotting problems emerged as had been the
statistical research whether the environment is laboratory, case for the EP component. Hip fractures were reduced, but
epidemiological, or clinical trial investigation. The WHI may there was no effect on CHD in this case as well. However,
well be a leading motivation and a beneficiary as well for such in contrast to the EP component, there was a trend for a
statistical methodology. breast cancer benefit, not harm. Thus, the mix of the is-
sues was different. The DSMB was of a mixed mind on what
3. Trial Monitoring should be done. When the data became convincing of the
As suggested by the design, the WHI is a complicated trial clotting problems, the DSMB view was that some change
to monitor and conduct interim analyses for early evidence of needed to be made, that continuing as is was not accept-
benefit or harm. There are essentially four trials being con- able. In a close vote, the DSMB recommended to continue
ducted, with three treatment modalities, through the same the trial but to inform the participants about the clotting
trial infrastructure, with women participating in one or more risks and that the breast cancer question was not resolved.
of the components. Each treatment modality can affect more This was an agonizing recommendation, with each DSMB
than one disease, and each disease may have one or more mea- member being split within themselves. The split vote was
surements assessing treatment effect. Finally, safety monitor- taken to another ad hoc committee which affirmed the rec-
ing for these three treatment modalities involves a multitude ommendation of the DSMB. The trial sponsor, the National
of outcomes. Heart, Lung, and Blood Institute, engaged in discussions with
The NIH appointed an independent Data and Safety Mon- the other NIH institutes as well as the director’s office. Ulti-
itoring Board (DSMB) consisting of experts in the different mately, the NIH determined to simply terminate the WHI E
treatment modalities and diseases, as well as senior biostatis- component.
ticians and ethicists. All were experienced researchers and fa- A global index was created which was a combination of all
miliar with clinical trials. Not all were experienced in trial the major health events. The plan was to require the global
monitoring as in a DSMB. The WHI DSMB was chartered to index to be consistent with the results of a primary outcome
review the WHI accumulating data at least twice per year for before early termination should be seriously considered. How-
evidence of early benefit or harm in any or all of the treat- ever, since the global index was a combination of outcomes
ment modalities. The DSMB could recommend continuation, that were going in different directions, the global index was
a protocol modification, or early termination if the interim not as useful as originally intended. Had the directions of the
data were convincing. To prepare the DSMB members, the major outcomes all been in the same direction, the influence
WHI leadership prepared several scenarios and surveyed the may have been greater.
members as to what they would recommend for the WHI RCT No additional statistical methodology would have made
(Freedman et al., 1996). While none of the imagined scenarios DSMB recommendation either easier or faster. The issues
actually occurred, the process was perhaps helpful to some were simply too complex and while statistics was a part of
members and did serve to bring together the DSMB into a the discussion, it was not the dominating factor. Still, the
functioning unit. challenges of monitoring multiple outcomes, not totally inde-
Standard group sequential methodology was used to mon- pendent, remain and further work is warranted.
itor each major primary outcome and leading secondary
outcome. Some adjustments were made for multiplicities 4. Changing Hazards and Changing Weights
of outcomes but not all. For the HT arm, only an upper The primary analysis of the time-to-event data used a
group sequential boundary for benefit was prespecified, which weighted log-rank test. The weights were constructed to di-
turned out to be a mistake. A lower boundary for harm minish the impact of early events or early treatment effect.
should have been prespecified as well, perhaps an asymmetric The rationale for this weighting is that it would not be ra-
boundary. tional for the treatments, say, for example, HT, to have an
The EP component was terminated early due to a convinc- immediate impact. Thus, a modest if any treatment effect
ing adverse risk of clotting problems as evidenced by increases in the early going could reduce the power of the compari-
in stroke, pulmonary embolism, and deep vein thrombosis. son unless this period of follow-up was discounted. The chal-
In addition, there was an increase in breast cancer (Writing lenge, however, is what the weights should be. In the WHI,
Group for the Women’s Health Initiative Investigators, 2002). the weights were linear from randomization to 3 years for
The trends began to emerge and kept getting stronger while cardiovascular disease and fracture and 10 years for cancer
there was no apparent reduction in either mortality or CHD. incidence and mortality. Unweighted rank tests were used
Hip and other fractures had a benefit with HT, as was ex- for safety assessment. The challenge is what lag period to
pected. After a few meetings, the trends became convincing use for the weighted rank tests. Many effective treatments
and the DSMB recommended to the sponsor that the EP in cardiology, such as aspirin, statins, and beta blockers,
Discussion on Statistical Issues in the Women’s Health Initiative 917

demonstrated an effect within 3 years. For cancer, it is as- and medical biases about the effects of the interventions being
sumed that the process of initiation, promotion, and progres- studied, and carefully analyzed.
sion of cancer takes time, and thus no treatment can have Experience indicates that we should not expect perfect con-
an effect immediately. Any early cancer incidence was a pro- gruence between observational studies and clinical trials. Ob-
cess already underway and not subject to a DM prevention servational studies are best suited to identify possible risk
strategy. However, 10 years may be too long. In any case, factors, potentially modifiable, with the hope of risk reduc-
both weighted and unweighted analyses should probably be tion. Clinical trials are best suited to test rigorously whether
conducted. modification of the risk factor in fact reduces the risk of the
The issue of changing hazard ratios over the follow-up pe- disease under consideration.
riod is not new to clinical trials but was of special interest The biostatistician must resist from being an advocate for
in the WHI. As Prentice et al. point out, “hazard ratio esti- the treatment but rather focus on whether the analysis of both
mates arising from a proportionality assumption may provide the OS and the RCT is as rigorous as possible, recognizing
simple and useful summary measures even if the hazard ra- the inherent limits of the OS design and the analysis assump-
tio is moderately time dependent.” However, the hazard ra- tions. Objectivity must be maintained with no interest in the
tio may be sensitive to time dependency if the participants direction of the outcome but rather that whatever the results,
enter late relative to the initiation of risk exposure. Estima- they can be defended rigorously. As soon as biostatisticians
tion of downstream hazard ratios is itself challenging since lose that objectivity and operate with a bias, they lose their
the participants may represent different risk groups due to professional effectiveness. The results of the HT arm of the
differential mortality, adherence, and follow-up. That is, the WHI RCT are pretty clear.
different hazard ratios may be confounded. This may not Observational studies will always be a primary source for
have been a major issue in the WHI but is nevertheless a identifying risk factors, even in the new era of genomics and
concern. Clearly, more research into the sensitivity of this proteomics. Given recent concerns about drug safety, observa-
effect would be welcome for all clinical trials, not just the tional studies will most likely be the best method for assessing
WHI. long-term safety once initial treatment effectiveness has been
established.
5. Intervention Adherence and Causal Inference
Since Canner first wrote about the challenge of analysis of pri- References
mary outcomes adjusting for intervention compliance, based
Albert, J. M. and DeMets, D. L. (1994). On a model-based
on the Coronary Drug Project, clinical trialists have recog-
approach to estimating efficacy in clinical trials. Statistics
nized the dangers of this approach (Canner, 1991). Canner
in Medicine 13, 2323–2335.
and others have provided examples that demonstrate that
Alpha-Tocopherol, Beta Carotene Cancer Prevention Study
placebo compliers may have better or worse effects than
Group. (1994). The effect of vitamin E and beta carotene
placebo noncompliers. Compliance is itself an outcome and
on the incidence of lung cancer and other cancers in male
not necessarily independent of how the participant is faring
smokers. New England Journal of Medicine 330, 1029–
in the trial. Canner also demonstrated that using a multi-
1035.
tude of measured covariates did not make this anomaly go
Canner, P. L. (1991). Covariate adjustment of treatment ef-
away.
fects in clinical trials. Controlled Clinical Trials 12, 359–
Several authors have tried to model treatment effect based
366.
on compliance to treatment in RCTs, and then extrapolat-
Coronary Drug Project Research Group. (1975). Clofibrate
ing the treatment effect under optimum compliance. However,
and niacin in coronary heart disease. Journal of the
Albert and DeMets (1994) demonstrate that such modeling is
American Medical Association 231, 360–381.
very much dependent on the independence assumption, and
Dawber, T. R., Meadors, G. F., and Moore, F. E. J. (1951).
results can be easily misleading when this assumption is not
Epidemiological approaches to heart disease: The Fram-
correct. However, for OS studies, researchers have no other
ingham Study. American Journal of Public Health 41,
choice than to model treatment effect based on the degree of
279–286.
intervention. This is one of the areas where RCTs and OS will
Freedman, L., Anderson, G., Kipnis, V., Prentice, R.,
differ due to adherence bias, and minimizing this bias is one
Wang, C. Y., Rossouw, J., Wittes, J., and DeMets,
of the strengths of the RCT if the analysis is strictly by intent
D. L. (1996). Approaches to monitoring results of
to treat.
long-term disease prevention trials: Examples from the
Women’s Health Initiative. Controlled Clinical Trials 17,
6. Post Mortems 509–525.
Whenever the results of a trial do not turn out as expected, Hennekens, C. H., Buring, J. E., Manson, J. E., Stampfer,
or are not consistent with previous observational trials, as M., Rosner, B., Cook, N. R., Belanger, C., LaMotte, F.,
was the case for the HT component, many individuals begin Gaziano, J. M., Ridker, P. M., Willett, W., and Peto,
to speculate about possible flaws in the clinical trial. While R. (1996). Lack of effect of long-term supplementation
perhaps some trials may have critical or fatal flaws, that is not with beta carotene on the incidence of malignant neo-
likely to be the case in the WHI. The trial was well designed, plasms and cardiovascular disease. New England Journal
despite its complexity, well conducted in the face of public of Medicine 334, 1145–1149.
918 Biometrics, December 2005

Omenn, G. S., Goodman, G., Thornquist, M., et al. (1994). tive randomized clinical trial. Journal of the American
The beta-carotene and retinol efficacy trial (CARET) for Medical Association 291, 1701–1712.
chemoprevention of lung cancer in high risk populations: Women’s Health Initiative Study Group. (1998). Design of
Smokers and asbestos-exposed workers. Cancer Research the Women’s Health Initiative clinical trial and ob-
54(7 suppl.), 2038s–2043s. servational study. Controlled Clinical Trials 19, 61–
Scandinavian Simvistatin Survival Study. (1994). Random- 109.
ized trial of cholesterol lowering in 4444 patients with Writing Group for the Women’s Health Initiative Investiga-
coronary heart disease: Scandinavian Simvistatin Sur- tors. (1998). Risks and benefits of estrogen plus pro-
vival Study (4S). Lancet 344, 1383–1389. gestin in healthy postmenopausal women: Principal re-
Women’s Health Initiative Steering Committee. (2004). Ef- sults from the Women’s Health Initiative randomized
fect of conjugated equine estrogen in post menopausal controlled trial. Journal of the American Medical Asso-
women with hysterectomy: The Women’s Health Initia- ciation 288, 321–333.

David A. Freedman
Department of Statistics
UC Berkeley
Berkeley, California 94720-3860, U.S.A.
email: freedman@stat.berkeley.edu
and

Diana B. Petitti
Kaiser Permanente Southern California
393 E. Walnut Street
Pasadena, California 91188, U.S.A.
email: diana.b.petitti@kp.org

We thank Ross Prentice and his colleagues for a rich and Some observers remained skeptical (see, for instance, Pe-
provocative paper that has generated many insights in a titti, 1994; Posthuma, Westendorp, and Vandenbroucke, 1994;
variety of methodological areas. We also thank our editor, Vandenbroucke, 1995). Two large clinical trials were organized
Xihong Lin, for organizing this discussion. Ours is an age of to resolve the issue—Heart Progestin/Estrogen Replacement
specialization, and we propose to consider only the effect of study (HERS) and Women’s Health Initiative (WHI). Pren-
hormone replacement therapy (HRT) on three cardiovascular tice and his colleagues were actively involved in the design and
endpoints: coronary heart disease, stroke, and venous throm- analysis of WHI. The experiments demonstrated no benefit
boembolism. from HRT, and some harm: WHI was stopped early, largely
First some background. Ideas of biological mechanism and due to an increased risk from breast cancer among the HRT
evidence from observational epidemiology led many observers group.
to conclude that HRT was protective, reducing cardiovascular Debate continues on these issues—for instance, a different
death rates by a factor of 2 or more. According to Grodstein mix of hormones administered along a different time path
and Stampfer (1998, p. 211, 217), might be beneficial. See, for example, International Journal of
Epidemiology (2004, 33, 441–467). However, the experiments
Consistent evidence from over 40 epidemiologic studies led to another major change in medical practice. Today, HRT
demonstrates that postmenopausal women who use estrogen would rarely be prescribed to prevent cardiovascular disease.
therapy after the menopause have significantly lower rates of WHI had two branches, an observational study and a ran-
heart disease than women who do not take estrogen . . . the
domized controlled experiment. By contrast with the experi-
evidence clearly supports a clinically important protection
against heart disease for postmenopausal women who use ment, the observational study—like many of the other obser-
estrogen. vational studies—found a protective effect from HRT. What
accounts for the discrepancy? Prentice and colleagues have
Also see Stampfer and Colditz (1991) and Grodstein et al. two answers that we find persuasive.
(1996).
Such findings profoundly influenced the practice of 1. Observational studies can be misleading. Therefore, it is
medicine. In the late 1990s, postmenopausal hormones were important to adjust for confounding variables, including
best-selling drugs worldwide. About 90 million prescriptions socioeconomic status. This may seem obvious. It is not.
for HRT were issued annually in the United States, corre- The Nurses’ Health Study on HRT did not adjust for
sponding to 15 million HRT users (Hersh, Stefanick, and socioeconomic status (Grodstein et al., 1996; Humphrey,
Stafford, 2004). Chan, and Sox, 2002).
Discussion on Statistical Issues in the Women’s Health Initiative 919

2. In many contexts, including the present one, time is a als are available only rarely, and conditions may be imposed
crucial variable. Treatment and disease are dynamic, not that almost preclude independent analysis. Policies govern-
static. ing data dissemination need to be reconsidered, although due
regard must be paid to patient confidentiality. Only by thor-
When arguing these points, Prentice, Pettinger, and Ander- ough scrutiny can error be avoided. Transparency is the best
son could be read as suggesting that—if properly analyzed— assurance of scientific quality. For additional discussion, see
the observational study agrees with the randomized controlled Geller et al. (2004).
experiment. We would have several questions about such an We would sum up the methodological lessons as follows.
interpretation. Rigorous causal inferences have been made using observa-
tional data, from the time of John Snow on cholera and Ignaz
1. Observational data can be adjusted in a variety of ways.
Semmelweis on puerperal fever. Recent examples include the
Without experimental data, it will be unclear which ad-
health effects of smoking, and the demonstration that cervi-
justments to make, or how far to go.
cal cancer is in part a sexually transmitted disease. Indeed,
2. Table 3 in Prentice, Pettinger, and Anderson only shows
most of what we know about causation in the medical sciences
results on coronary heart disease and thromboembolism.
comes from observational studies—because experiments are
However, even after all the modeling is done, there re-
often unethical or impractical. We might even suggest that
mains a large disparity with respect to an important
observation necessarily precedes experiment. What else could
cardiovascular endpoint—stroke (Prentice et al., 2005).
provide motivation, or help define protocols?
Prentice, Pettinger, and Anderson mention stroke, but
On the other hand, observational data need to be ap-
do not discuss the difficulties created by this endpoint.
proached with caution. When there is a conflict between
3. Prentice, Pettinger, and Anderson chose for their null
observational epidemiology and experiments—HRT not be-
hypothesis equality between the two branches of WHI.
ing an isolated case—we think that the experiments are the
However, statistical power is limited, and the choice of
ones to watch. The gap between association and causation
null greatly influences conclusions.
will not generally be bridged by proportional-hazard models,
Power is limited because the women in the treatment arm even with stratification and time-dependent exposure vari-
of the clinical trial are mainly short-term users of HRT. By ables. For more discussion on the relative merits of experi-
contrast, in the observational study, users have been taking ment and observation, see Mill (1868, Book III, Chapters VII
hormones for a long time. (According to the conventions used and X).
by Prentice and colleagues, in the observational study, expo- Prentice and his colleagues deserve our thanks for the pa-
sure prior to baseline is counted.) per, and their work on WHI.
To illustrate how substantive conclusions may be deter-
mined by apparently innocuous technical choices, we suggest
the following null hypothesis: compared to the randomized References
controlled experiment, the observational study underesti- Barrett-Connor, E. (1991). Postmenopausal estrogen and pre-
mates the risks of HRT by a factor in the range of 1.5–3, vention bias. Annals of Internal Medicine 115, 455–456.
depending on risk group and endpoint (heart disease, stroke, Geller, N. L., Sorlie, P., Coady, S., Fleg, J., and Friedman, L.
thromboembolism). The data seem to be at least as compat- (2004). Limited access data sets from studies funded by
ible with our null hypothesis as with the null hypothesis of the National Heart, Lung, and Blood Institute. Clinical
equivalence. These null hypotheses have rather different im- Trials 1, 517–524.
plications for bias in observational epidemiology. Grodstein, F. and Stampfer, M. J. (1998). The cardiopro-
Bias stems from incomplete adjustment. Adjustment must tective effects of estrogen. In The Management of the
be incomplete, because relevant lifestyle factors are extraordi- Menopause, Chapter 22, J. Studd (ed), 211–219. London:
narily difficult to identify or measure. Here is one example. In Parthenon.
observational studies, women on HRT are “compliers”: they Grodstein, F., Stampfer, M. J., Manson, J. E., Colditz,
follow a treatment regime prescribed by their doctors. But G. A., Willett, W. C., Rosner, B., Speizerm, F. E., and
compliance—even by subjects assigned to placebo in a clini- Hennekens, C. H. (1996). Post menopausal estrogen and
cal trial—is associated with favorable outcomes. A factor of progestin use and the risk of cardiovascular disease. New
2 for compliance bias is compatible with previous literature. England Journal of Medicine 335, 453–461.
Compliance is thoroughly confounded with treatment in ob- Hersh, I. L., Stefnick, M. L., and Stafford, R. S. (2004). Na-
servational studies of HRT. See Petitti (1994) and Barrett- tional use of postmenopausal hormone therapy: Annual
Connor (1991) for additional discussion. trends and response to recent evidence. Journal of the
HRT comes in two forms: (1) unopposed (estrogen only) American Medical Association 291, 47–53.
and (2) combined (estrogen plus progestin). WHI consid- Humphrey, L. L., Chan, B. K. S., and Sox, H. C. (2002).
ered both forms (Tables 1 and 2 in Prentice, Pettinger, and Postmenopausal hormone replacement therapy and the
Anderson). Modeling results are presented only for the com- primary prevention of cardiovascular disease. Annals of
bined form (Table 3 in Prentice, Pettinger, and Anderson). Internal Medicine 137, 273–284.
Hence our focus is on combined therapy. Mill, J. S. (1868). A System of Logic, Ratiocinative and Induc-
We turn now to a policy issue. Although WHI is tax sup- tive, 7th ed. (1st ed., 1843). London: Longmans, Green,
ported, its data are not available to us. Data from clinical tri- Reader, and Dyer.
920 Biometrics, December 2005

Petitti, D. B. (1994). Coronary heart disease and estrogen observational studies and the Women’s Health Initia-
replacement therapy: Can compliance bias explain the tive clinical trial. American Journal of Epidemiology 162,
results of observational studies? Annals of Epidemiology 404–414.
4, 115–118. Stampfer, M. J. and Colditz, G. A. (1991). Estrogen replace-
Posthuma, W. F., Westendorp, R. G., and Vandenbroucke, ment therapy and coronary heart disease: A quantita-
J. P. (1994). Cardioprotective effect of hormone replace- tive assessment of the epidemiologic evidence. Preven-
ment therapy in postmenopausal women: Is the evidence tive Medicine 20, 47–63. Reprinted in the International
biased? British Medical Journal 308, 1268–1269. Journal of Epidemiology 2004, 33, 445–453.
Prentice, R. L., Langer, R., Stefanick, M., et al. (2005). Com- Vandenbroucke, J. P. (1995). How much of the cardioprotec-
bined postmenopausal hormone therapy and cardiovas- tive effect of postmenopausal estrogens is real? Epidemi-
cular disease: Toward resolving the discrepancy between ology 6, 207–208.

Sander Greenland
Departments of Epidemiology and Statistics
University of California, Los Angeles, CA
email: lesdomes@ucla.edu

The randomized component of the Women’s Health Initia- 1. The Need to Go beyond Hazard Ratios
tive (WHI) is an invaluable check on observational associ- One concern is the exclusive focus on hazard ratios in PPA.
ations. The observational component could be equally im- As a large cohort study, the WHI provides an uncommon
portant if it is analyzed thoroughly and imaginatively, from opportunity to assess outcomes on an absolute-risk scale and
a variety of perspectives. Although valuable, the strate- on a time-to-event (years of life lost) scale. These scales can
gies described by Prentice, Pettinger, and Anderson (PPA) be far more relevant to decision making (both individual and
cover too narrow a range. Following standard practice, administrative) than hazard ratios. A hazard ratio of 2 means
they take an underidentified problem (estimate a causal ef- something very different in terms of risk and benefits if the
fect from observational data) and force identification via baseline risk is 1/100,000 versus 1/100. The difference is about
rather arbitrary constraints (encoded within their mod- 1 excess case versus 1,000 excess cases per 100,000 exposed,
els). While everyone starts this way, the approach needs which is a 1,000-fold difference in health-care costs, and also
to be supplemented by more realistic uncertainty assess- a large difference in the (healthy) years of life lost. There is
ments, at least if the authors wish to draw defensible in- no clue in the tables of PPA what sort of base rates or case
ferences about effects from the observational study compo- numbers the hazard ratios apply to, and so those results are
nent. There is also a multiple-comparisons problem that needs unintelligible in absolute terms.
to be addressed using modern techniques. Other issues arise Even for the purposes of understanding the basic biology
as well. and biases, ratio comparisons can become obscure, especially
I was a bit amused by the comment in PPA that “in when there is no biologic basis for assuming homogeneity of
realistic situations, adherence-adjusted analyses are best ratios across covariates. For example, PPA suggest that the
regarded as sensitivity analyses.” I regard any causal anal- inclusion of the covariate main effects zγ in their model (3)
ysis of observational data (or a randomized trial with ma- partially explains the discrepancy between OS and CT. How
jor compliance problems) as just a piece of a sensitivity much explanation would be achieved by including treatment-
analysis; it is the piece in which results are obtained un- covariate product terms in the model? Perhaps a complete
der the particular assumptions of that analysis. Because we explanation remains possible by allowing for more than just
never know that all the assumptions are correct (and in fact time variation in the ratios.
would wisely doubt them), we had better try more than
one type of analysis. By seeing how results change as we
vary our approach, we are doing a sensitivity analysis. If 2. Limitations of Biomarkers
this variation in method is too broad, going beyond cred- PPA focus on the use of biomarkers to calibrate certain short-
ible assumptions, we may inappropriately discount our re- term measures of intake and activity. This is laudable, but has
sults; conversely (and far more often), if this method varia- limitations for the questions that ultimately motivate funding
tion is insufficiently broad, we may miss important sensitivi- and public interest in such research, such as “what should I
ties and become overconfident (Greenland, 1998). Given the eat to minimize my risk of breast cancer?” and “what dietary
potential contribution of the WHI, it seems that the planned guidelines should we promote?” One concern is that biomark-
method variation outlined by PPA is insufficient. I will sug- ers are not good surrogates for the treatment variables (long-
gest a few of many possible expansions. Perhaps more has term dietary intakes) in these questions; no matter how well
been done or is planned for the analysis than PPA outlined, measured, long-term biomarkers (such as hair and nail con-
but in any case I should hope they address the following tents) are affected by many poorly understood and mostly
concerns. unmeasured vagaries of individual metabolism and exposures,
Discussion on Statistical Issues in the Women’s Health Initiative 921

while the short-term biomarkers discussed by PPA reflect cur- 2 years, PPA generate times from a peculiarly rough two-
rent diet and behavior. step density “motivated by hormone therapy stopping rates
Any disconnect between actual long-term diet or behavior in community studies.” Setting aside the unrealistic density,
and its biomarker is error in the biomarker for the diet. Hence, their approach here much resembles the sort of Monte Carlo
the comparison of measured diet and biomarker is a compar- sensitivity analyses (MCSA) that have recently made their
ison of two very noisy measures of long-term diet (with pre- way from risk assessment to epidemiology, and which closely
sumably independent but unknown and very differently dis- parallel Bayesian risk assessment in their use of priors (see
tributed error). Models (1) and (2) in PPA appear to relate Greenland, 2001, 2003, 2005b for reviews and examples). I
short-term measures; even if the errors in these equations are believe these methods are worth deploying to examine other
zero, the results tell us nothing about the error due to dietary sources of uncertainty in the WHI, such as residual measure-
variation, and it is not clear from PPA how this error will ment error, selection effects, and confounding. Such methods
be accounted for. In any case, one must turn to long-term may be especially relevant for addressing the potential im-
repeat-questionnaire data to address that variation with all pact of measurement error in variables that lack validation
its sources of error as a measure of long-term intake and be- and reliability data, including confounders such as smoking
havior. Addressing these sources of error will require general history.
uncertainty assessments, as discussed below.
5. To Summarize
3. The Need for Empirical Bayes
The Women’s Health Initiative is a remarkable achievement,
Turning to issues of multiplicity and screening of genetic as-
providing a much-needed resource for checking and challeng-
sociations, it seems very odd to me that, in 2005, anyone
ing results of epidemiologic studies, and it will no doubt pro-
could neglect use of empirical-Bayes (EB) and related hier-
vide new leads of its own. While I think PPA have done a
archical procedures. The landmark work of Efron and Mor-
good job of planning the analysis within their areas of focus,
ris (1975) on these methods included an epidemiologic ap-
a broader strategy is needed in both the choice of outcome
plication, and today Efron and others continue to advance
measures and in approaches to multiple inference and un-
these approaches into very genetic problems that PPA dis-
certainty assessment. The WHI is too valuable a resource to
cuss (e.g., Efron, 2004). Empirical-Bayes methodology is now
underanalyze.
textbook material, and theoretical, simulation, and case stud-
ies leave little doubt about the advantages of such tech-
niques in multiple-inference problems (Carlin and Louis,
References
2000). Carlin, B. and Louis, T. A. (2000). Bayes and Empirical-Bayes
I have strongly advised that related random-coefficient Methods for Data Analysis, 2nd edition. New York: Chap-
methods be used for examining effects of multiple nutrients man and Hall.
and other factors with hierarchical measurement structure Efron, B. (2004). Large-scale simultaneous hypothesis testing:
(Greenland, 2000) as will be found in some of the WHI data. The choice of a null hypothesis. Journal of the American
Note especially that measurement errors in nutrient intakes Statistical Association 99, 96–104.
computed from questionnaires are compounds of at least two Efron, B. and Morris, C. N. (1975). Data analysis using Stein’s
sources: those in questionnaire response and those in the diet- estimator and its generalization. Journal of the American
nutrient table as it applies to the foods actually eaten by the Statistical Association 70, 311–319.
subjects (as opposed to those used to construct the table). Greenland, S. (1998). The sensitivity of a sensitivity analy-
Another aspect of the WHI for which empirical-Bayes sis. In 1997 Proceedings of the Biometrics Section, 19–21.
methods could be important is in examination of potential Alexandria, VA: American Statistical Association.
variation in effects across subgroups, as required for mak- Greenland, S. (2000). When should epidemiologic regressions
ing recommendations and generalizations beyond the WHI use random coefficients? Biometrics 56, 915–921.
cohorts. It has already been noted that the WHI is not rep- Greenland, S. (2001). Sensitivity analysis, Monte Carlo risk
resentative of all targets. Even if it were, however, public- analysis, and Bayesian uncertainty assessment. Risk
health, and clinical/personal decisions are more accurately Analysis 21, 579–583.
guided by differences in risks and life expectancies for mul- Greenland, S. (2003). The impact of prior distributions for un-
tiple outcomes than by summaries across disparate groups controlled confounding and response bias: A case study
and outcomes (Greenland, 2005a). Providing such guidance is of the relation of wire codes and magnetic fields to child-
a problem in highly multivariate prediction, for which again hood leukemia. Journal of the American Statistical Asso-
empirical-Bayes methods have proved their worth. ciation 98, 47–54.
Greenland, S. (2005a). Epidemiologic measures and policy for-
4. Bayesian and Monte Carlo Uncertainty Assessment mulation: Lessons from potential outcomes (with discus-
The more general neglect of Bayesian approaches in PPA is sion). Emerging Themes in Epidemiology 2, 1–4.
regrettable, as priors are needed to achieve identification of Greenland, S. (2005b). Multiple-bias modeling for anal-
causal effects from observational data, and it is clear that ysis of observational data (with discussion). Jour-
PPA have priors and use them in their analyses. For ex- nal of the Royal Statistical Society, Series A 168,
ample, in their analysis of E+P stopping times in the first 267–308.
922 Biometrics, December 2005

Miguel A. Hernán,1 James M. Robins,1,2


and Luis A. Garcı́a Rodrı́guez3
1
Department of Epidemiology
Harvard School of Public Health
Boston, Massachusetts 02115, U.S.A.
email: miguel hernan@post.harvard.edu
2
Department of Biostatistics
Harvard School of Public Health
Boston, Massachusetts, U.S.A.
3
CEIFE–Spanish Center of
Pharmacoepidemiologic Research
Madrid, Spain

1. Introduction the following methodologic limitations of the observational


We thank Xihong Lin for the opportunity to discuss Ross studies (Grodstein et al., 2003):
Prentice and collaborators’ interesting paper. The Women’s
Health Initiative (WHI) randomized hormone trials evaluated 1. Lack of comparability between women who initiated and
the effect of postmenopausal hormone therapy on the risk did not initiate hormone therapy (healthy user bias or
of various diseases (WHI Study Group, 1998). In the first confounding by “indication”)
WHI trial, women were randomly assigned to either estrogen In the observational studies, women who started hor-
plus progestin or placebo. The rate of coronary heart disease mone therapy may not be comparable with those who
(CHD) in the hormone group was 1.24 times (95% CI: 0.97, did not start hormone therapy. On average, women who
1.60) that in the placebo group (Manson et al., 2003). This decide to initiate hormone therapy may have fewer risk
result was surprising because large observational studies had factors for CHD than noninitiators. Under this hypoth-
previously suggested a reduced risk of CHD among hormone esis, initiation of hormone therapy would be associated
users. Among the largest of these studies were the Nurses’ with a lower risk of CHD even if hormone therapy it-
Health Study (NHS) in the United States (Stampfer et al., self has no preventive effect on the risk of CHD. That is,
1991; Grodstein et al., 1996, 2000; Grodstein, Manson, and there would be confounding for the effect of treatment
Stampfer, 2001) and a study based on the General Practice initiation.
Research Database (GPRD) in the United Kingdom (Varas- The WHI result cannot be explained by confounding
Lorenzo et al., 2000). for treatment initiation because therapy initiation was
We investigate possible sources of the discrepancy by rean- assigned at random, and thus initiators are on average
alyzing the observational study data using an approach that comparable with noninitiators.
mimics as closely as possible the published analyses of the 2. Lack of comparability between women who continued
WHI randomized trial. We then compare our approach with and discontinued hormone therapy (“noncompliance”
Prentice and collaborators’. Originally we had planned to pro- bias)
vide reanalyses of both the NHS and GPRD data. Unfortu- Even if there were no confounding for the effect of
nately, our reanalysis of the NHS data is not yet complete, so treatment initiation, participants in observational stud-
we report only the GPRD results. The GPRD is a research- ies who stayed on hormone therapy for extended periods
oriented database that covers over 3 million residents in the may be different from those who discontinued hormone
United Kingdom. These individuals’ general practitioners reg- therapy shortly after initiation. For example, women who
ister health-care and medical information about their patients stayed on therapy may be more health conscious than the
in a standardized manner. The registered information includes others. Under this hypothesis, a longer duration of use of
demographic data, all medical diagnoses, consultant and hos- hormone therapy would be associated with a lower risk
pital referrals, and a record of all prescriptions issued. Practi- of CHD even if hormone therapy itself has no preven-
tioners generate prescriptions directly from the computer, en- tive effect on the risk of CHD. That is, there would be
suring its automatic recording. Validation studies have shown confounding for the effect of treatment discontinuation.
that 90% of information present in the patients’ paper medi- Similarly, WHI hormone users who stayed on hormone
cal records, and 95% of newly prescribed drugs, are recorded therapy for extended periods and those who discontinued
in the database (Garcı́a Rodrı́guez and Pérez Gutthann, 1998; hormone therapy shortly after initiation may not be com-
Jick et al., 2003). parable because treatment discontinuation was not ran-
Several biologic and methodologic explanations for the dis- domized. The nonnull WHI results, however, cannot be
crepancy between the CHD results of the WHI random- explained by confounding for treatment discontinuation
ized trial and the observational studies have been proposed because the analysis was conducted under the intention-
(Grodstein, Clarkson, and Manson, 2003; Mendelsohn and to-treat (ITT) principle. That is, the effect of hor-
Karas, 2005). We will focus this discussion on the impact of mone therapy was estimated by comparing the CHD
Discussion on Statistical Issues in the Women’s Health Initiative 923

risk of those randomly assigned to hormone therapy and In the GPRD cohort, we need to define the time of
placebo, regardless of whether they complied with their “randomized” treatment assignment (baseline). Because the
assigned treatment. The ITT effect will generally be follow-up of our cohort started in January 1991, we can de-
closer to the null than the effect had all women fully fine baseline as January 1991, apply the eligibility criteria to
complied with their assigned treatment. women in the cohort in January 1991, and compare the CHD
3. Imprecise ascertainment of the time of hormone therapy risk of eligible women who reported treatment initiation with
initiation that of eligible women who did not report treatment initi-
In some observational studies (e.g., the NHS), data on ation during January 1991. Alternatively, we can define the
hormone use was collected by questionnaires mailed ev- baseline as February 1991, or as any other subsequent time
ery 2 years and the time of therapy initiation within the before the end of follow-up in December 2001. For each pos-
2-year interval is largely unknown. This uncertainty in- sible baseline time, we can apply the eligibility criteria to
troduces bias in the effect estimates over any fixed (say, women in the cohort at that time so women participating in
2-year) interval after treatment initiation. For example, the trial starting in January 1991 would not necessarily be
in previous analyses, women in the NHS were assigned to the same women participating in the trial starting in, say,
the hormone use group that they reported in the ques- December 1994.
tionnaire returned at the onset of the 2-year interval. But rather than fixing a single baseline month for our
Thus women who initiated therapy during the interval GPRD trial, we can conduct all possible trials, pool the data,
were systematically misclassified as nonusers until the and obtain an estimate of effect with a narrower confidence
next questionnaire. If hormone therapy initiation causes interval (which appropriately accounts for correlations that
a short-term increase in risk, then this misclassification may arise from using the same individuals in several trials).
would downwardly bias the effect estimate. In the WHI Let m denote month with m = 0, 1, . . . , 131 representing Jan-
there is no uncertainty regarding the time of randomized uary 1991, February 1991, . . . , December 2001. We started a
therapy initiation. separate GPRD trial at each month m. Each woman may par-
ticipate in a maximum of 132 trials. For each trial, follow-up
In this article, we provide reanalyses of the GPRD that started in month m (baseline) and ended at diagnosis of a
only suffer from limitation 1. Limitation 3 is not present in CHD endpoint, death from causes other than CHD, loss to
the GPRD study because exact dates of treatment initia- follow-up, or administrative end of follow-up (8 years like in
tion are recorded. We remove limitation 2 by reanalyzing the the WHI or December 2001), whichever came first. We index
GPRD study using an ITT principle. This reanalysis requires trials by the month m in which they start.
conceptualizing the observational GPRD study as if it were
a sequence of randomized trials in which the randomization 2.3 Treatment Regimes
probabilities are unknown. Our ITT effect estimates from the WHI participants were randomized to either oral estrogen
GPRD study are then compared to the ITT estimates from (conjugated equine estrogens 0.625 mg/day) plus progestin
the WHI randomized trial. (medroxyprogesterone acetate 2.5 mg/day) or placebo. There
In Section 2, we describe a study protocol for the GPRD was a wash-out interval of 3 months before randomization.
trials that mimics as closely as possible that of the WHI trial. Our GPRD trials included women who either initiated oral
In Sections 3 and 4, we reanalyze the GPRD trials and obtain therapy with estrogens plus progesterone or were nonusers of
(i) estimates of the ITT effect of hormone therapy and (ii) es- hormone therapy in month m. As an additional eligibility cri-
timates of the effect of continuous hormone therapy (i.e., in terion, in each trial m, women were required to have been
the absence of noncompliance). In the last section, we com- nonusers of any form of hormone therapy during the year be-
pare our approach with Prentice and collaborators’. fore baseline (wash-out interval). (We choose a year rather
than 3 months to hopefully obtain a better match with the
2. Study Protocol of the GPRD Trials WHI on the distribution of “time since last hormone ther-
2.1 Eligibility Criteria apy.”) We refer to women eligible for trial m who did (did not)
We defined inclusion and exclusion criteria in our GPRD tri- initiate hormone therapy in month m as “initiators” (nonini-
als to mimic the WHI criteria. Like the WHI trial, the GPRD tiators) in trial m.
trials include only women aged 50 years or more and with an
intact uterus. We mimicked the WHI exclusion criteria (WHI, 2.4 Ascertainment of CHD Endpoints
1998) as closely as we could by excluding GPRD women with and Confounding Variables
a past diagnosis of cancer (except nonmelanoma skin cancer), As in the original GPRD analysis (Varas-Lorenzo et al., 2000),
cardiovascular disease, and cerebrovascular disease (Varas- we defined the CHD endpoint in study m as the time of non-
Lorenzo et al., 2000). fatal myocardial infarction or fatal coronary disease between
baseline (as defined above) and end of follow-up. The follow-
2.2 Baseline and Follow-Up up in the original GPRD study ended in December 1995. Our
In the WHI, women were followed from the time of random- reanalyses extend follow-up to December 2001. In the original
ized treatment assignment (baseline) to the diagnosis of a study, over 90% of CHD endpoints ascertained after review of
CHD endpoint, death from causes other than CHD, loss to computer records were confirmed by reviewing the patients’
follow-up, or administrative end of follow-up, whichever came paper medical records and using standardized diagnostic
first. criteria.
924 Biometrics, December 2005

In each trial m, we obtained at baseline (i.e., just prior to degree of “compliance.” (In our GPRD trials, we defined the
month m) data on the following potential confounders: age, time to noncompliance in trial m as the difference between m
calendar month, family history of CHD, high cholesterol, high and the month of first deviation from baseline treatment, i.e.,
blood pressure, diabetes, body mass index, smoking, alcohol discontinuation of hormone therapy for initiators, and initia-
intake, aspirin use, nonsteroidal anti-inflammatory drug use, tion of hormone therapy for noninitiators.) The WHI and the
and previous use of hormone therapy. Data on additional po- GPRD differ markedly in their “time to noncompliance” dis-
tential “lifestyle” confounders were unavailable. tributions (see Section 5 below), which could cause their ITT
hazard ratios to differ substantially. To disaggregate the ef-
3. Analytic Approach for the GPRD Trials fect of noncompliance from the effect of hormone therapy, we
As discussed further below, our conceptualization of an obser- attempted to estimate for the GPRD trials the “continuous
vational study with a time-varying treatment as a sequence of treatment hazard ratio” that would be observed under full
trials can be viewed as a special case of g-estimation of nested compliance, that is, the hazard ratio comparing continuous
structural models (Robins, 1989). treatment in all initiators versus no treatment in all nonini-
3.1 ITT Effect of Treatment tiators.
To do so, separately in each trial m, we censored women
In each GPRD trial, we compared the CHD hazard rate
when they discontinued their baseline treatment. Because
of initiators and noninitiators, regardless of whether these
this censoring is potentially informative (i.e., noncompli-
women subsequently stopped or initiated therapy. Thus our
ance is nonrandom) and may lead to selection bias (Hernán,
approach is the observational equivalent of the ITT principle
Hernández-Dı́az, and Robins, 2004), a women i at risk (and
that guided the main analysis of the WHI trial. To the women
thus uncensored) in month k > m was upweighted by the
eligible for each GPRD trial m, we fit the Cox proportional
inverse of her estimated probability of remaining uncensored
hazards model
  from month m through month k. Specifically, for each trial m
λT t | G(m) = 1, A(m), L̄(m) we fit logistic models
   
= λ0 t exp αA(m) + η L̄(m) , (1) logit Pr A(j) = a | G(m) = 1,

where m indexes the trial (months from January 1991), T is A(j − 1) = a, A(m) = a, L̄(j), T > j
the time from baseline of trial m to CHD, G(m) is an indica- 
= θa0 + θa1 L̄(j) for j > m, (2)
tor for eligibility for trial m (1: yes, 0: no), A(m) is hormone
therapy initiation at m (1: yes, 0: no), L̄(m) is a vector rep- for continuing compliance separately for initiators (a = 1) and
resenting covariate history through baseline m, λT [t | G(m) = noninitiators (a = 0). The estimated probability of continuing
1, L̄(m), A(m)] is the conditional hazard of CHD at time t, the baseline treatment through month k > m for subject i
λ0 [t] is the baseline hazard at t, and exp[α] is the conditional is the product Πkj=m+1 P̂mi (j) where P̂mi (j) is the predicted
ITT hazard ratio for hormone therapy initiation versus non- value

initiation at baseline m. We modeled L̄(m) by including the  A(j) = a | G(m) = 1, A(j − 1) = a
P̂mi (j) = Gi (m)Pr
potential confounders described in the previous section as co- 
variates. All covariates were categorical except age, alcohol A(m) = a, L̄i (j), T > j ,
|a=Ai (m)
intake, and calendar month. The age effect was modeled as
cubic splines with 3 knots and with product terms of the age from the logistic models. We then estimated the rate ra-
coefficients with diabetes and hypertension. To increase pre- tio exp[α] by refitting Cox model (1) after censoring them
cision, we pooled all 132 GPRD trials in a single analysis. at the time of discontinuation of baseline treatment and
Because many women participate in more than one trial, we weighting their contributions to the partial likelihood at
used the robust variance to account for within-person correla- time k by the inverse probability weights (IPW) Ŵm,i (k) =
k
tion. In addition to our main analyses, we conducted subgroup [ j=m+1 P̂mi (j)]−1 . Again, to increase precision we pooled all
analyses by age (<60, ≥60 years) at baseline and investigated 132 GPRD trials in a single analysis. The assumptions re-
how the rate ratio exp (α) was modified by the month m of quired for the limit of exp[α̂] to be the “continuous treatment
the trial and by time since initiation of therapy. hazard ratio” are discussed in Section 5. To examine whether
Under the assumption of no unmeasured confounders, our censoring due to noncompliance was “informative,” we re-
Cox model estimates the conditional ITT hazard ratio exp[α] peated the above analysis without weights (i.e., we set all the
within levels of L̄(m), that is, the (conditional) hazard had Ŵm,i (k) to 1).
everybody initiated treatment divided by the hazard had no- For comparison purposes, we will also fit a standard time-
body initiated treatment in each GPRD trial. Note that when varying Cox model
this analytic approach is applied to a closed cohort in which  
λT  t | G(0) = 1, A(t), L̄(t)
noneligible women never become eligible at later times, each
trial is nested in the prior trial (Hernán et al., 2005) and we = λ0 [t] exp [βc Ac (t) + βp Ap (t) + γ  L(t)] , (3)
refer to the Cox model (1) pooled over all trials as a nested
Cox model. where T  is the time from the first eligible trial (i.e., month)
to CHD, Ac is an indicator for being currently on treatment,
3.2 Effect of Continuous Treatment Ap is an indicator for being a past user at t (past treatment),
The magnitude of the ITT hazard ratio in a study depends and L(t) are the updated covariate values at t. The hazard
not only on the effect of hormone therapy but also on the ratios exp[β c ] and exp[β p ] compare the CHD incidence in
Discussion on Statistical Issues in the Women’s Health Initiative 925

Table 1
Number of participants, hormone therapy initiators, and CHD events in each GPRD trial (for illustration purposes, only trials
25–50 are shown)

Trial Month Participants CHD events Initiators CHD events in initiators


25 January 1993 68,026 1134 218 1
26 February 1993 67,774 1112 193 1
27 March 1993 67,669 1085 239 1
28 April 1993 67,338 1060 201 1
29 May 1993 66,972 1030 200 1
30 June 1993 66,893 1009 170 1
31 July 1993 66,720 985 168 0
32 August 1993 66,655 966 192 0
33 September 1993 66,354 947 134 1
34 October 1993 66,301 928 132 0
35 November 1993 66,165 908 155 1
36 December 1993 65,983 884 98 0
37 January 1994 69,729 871 149 2
38 February 1994 69,592 858 185 2
39 March 1994 69,262 833 196 3
40 April 1994 69,019 813 168 0
41 May 1994 68,919 801 141 0
42 June 1994 68,442 785 146 1
43 July 1994 68,245 751 135 0
44 August 1994 68,053 736 158 0
45 September 1994 67,769 718 137 2
46 October 1994 67,681 689 135 1
47 November 1994 67,413 661 145 1
48 December 1994 67,151 648 97 0
49 January 1995 69,901 626 178 1
50 February 1995 69,500 618 146 1

current and past users at t, respectively, with that of never 4.1 Estimates of the ITT Effect
users within levels of the updated covariates L(t). We inves- The estimated ITT hazard ratio (95% CI) of CHD for hor-
tigated how the rate ratio at t was modified by the duration mone therapy initiation versus no initiation from model (1)
D(t) since the last reinitiation of hormone therapy (following was 0.92 (0.73, 1.17). When an interaction term between
a period of at least a year of nonuse) at an eligible month treatment initiation at baseline A(m) and the month m that
by adding, for example, β 1 Ac (t) I(5 > D(t) > 2) + β 2 Ac (t) the trial began was added, the term’s estimated coefficient
I(D(t) > 5) + β 3 Ac (t) N (t) to the model, where N(t) is one (95% CI) was 0.005 (−0.059, 0.158), indicating little evidence
if a subject never initiated therapy at an eligible month and of trial time–treatment interaction. The estimated ITT haz-
zero otherwise. ard ratios (95% CI) were 0.97 (0.74, 1.27) for women younger
than 60 years and 0.73 (0.44, 1.22) for women 60 years and
4. Results from the GPRD Trials older at baseline. Table 2 shows the estimates when we re-
stricted the analysis to various periods of follow-up. A further
Our analyses included 99,072 women who met the eligibility
breakdown shows hazard ratios of 0.82 (0.55, 1.21) in years
criteria for at least one GPRD trial. Of these women, 1889
2–5, and 0.69 (0.38, 1.25) in years 5–10.
had a CHD event and 606 died during the follow-up.
We also estimated the ITT effect of hormone therapy on
On average, each woman participated in 60.5 trials (stan-
mortality after replacing “time to CHD” by “time to death” in
dard deviation [SD]: 35.3, median: 59.0) and thus our analy-
model (1). The estimated ITT hazard ratio (95% CI) of death
ses include 5,997,824 (nondistinct) women, 10,566 initiators,
for hormone therapy initiation versus no initiation was 0.89
64,583 CHD endpoints, and 20,815 deaths when all trials are
(0.55, 1.46). When we restricted the duration of the GPRD
pooled. The records of 16% of the initiators and 9% of the non-
trials, the respective estimated ITT hazard ratios (95% CI)
initiators indicated use of hormone therapy more than 1 year
were 1.27 (0.66, 2.43) for 2 years, 1.09 (0.67, 1.77) for 5 years,
before baseline. Only 64 CHD endpoints occurred among ini-
and 0.90 (0.75, 1.20) for 8 years.
tiators, thus limiting the precision of our analysis. As an ex-
ample, Table 1 shows the number of participants, initiators,
and CHD events in trials 25–50. The mean duration of follow- 4.2 Estimates of the Continuous Treatment Effect
up across all trials was 4.1 years (SD: 2.6, median: 3.8 years), and Standard Covariate-Updated Analyses
and the mean age at baseline was 54.6 years (SD: 4.5, median: The proportion of noninitiators who initiated therapy
53.0) for the initiators and 62.0 years (SD: 6.7, median: 62.0) (Figure 1) and of initiators who discontinued therapy
for the noninitiators. (Figure 2) increased over the follow-up period. By 6 years
926 Biometrics, December 2005

Table 2
CHD hazard ratios and 95% confidence intervals for hormone therapy use in the GPRD trials

Initiators versus Continuous versus Current versus


noninitiators never users never users
Model (1) Model (1) Model (3)
Years of follow-up ITT IPW Unweighted Updated covariates
0–2 1.20 (0.84, 1.72) 1.33 (0.79, 2.22) 1.32 (0.82, 2.13) 1.02 (0.63, 1.65)
0–5 0.99 (0.76, 1.28) 0.83 (0.52, 1.32) 0.98 (0.65, 1.49) 0.80 (0.52, 1.21)
0–8 0.95 (0.75, 1.20) 0.95 (0.60, 1.51) 0.98 (0.67, 1.43) 0.88 (0.61, 1.28)
All 0.92 (0.73, 1.17) 0.87 (0.55, 1.39) 0.97 (0.66, 1.42) 0.87 (0.60, 1.27)

hazard ratio of 0.97 (0.66, 1.42). Table 2 shows the weighted


and unweighted estimates when we restricted the analysis to
various periods of follow-up.
The standard updated-covariate analysis gave an estimated
hazard ratio 0.87 (0.60, 1.27) for current therapy was never
exposed since the first eligible visit. The last column of
Table 2 shows the corresponding covariate-updated estimates
as a function of duration of treatment (since the last eligi-
ble period). A further breakdown shows hazard ratios of 0.48
(0.21, 1.06) in years 2–5, and 1.34 (0.60, 3.01) in years 5–10.

5. Discussion and Comparison with Prentice et al.


Our ITT analysis of our GPRD trials suggest that initiation of
estrogen plus progestin does not have a substantial impact on
the risk of CHD although, when compared with noninitiators,
the CHD incidence of initiators was 20% greater during the
Figure 1. Probability of initiating hormone therapy among 2-year period after initiation and 5% lower when averaged
noninitiators. over the 8-year period after initiation. Neither estimate ap-
proached statistical significance.
We did not find significant risk differences by age, but
power was limited because few younger women had a CHD
endpoint and few older women initiated therapy. We could not
stratify the analysis by time since menopause because time
of menopause is not systematically recorded in the GPRD.
When we further restricted eligibility in trial m by requiring
no prior recorded hormone use (rather than a year wash-out
period), 91% of the previously eligible women remained eli-
gible and the ITT estimates showed little change (data not
shown).
The ITT estimates from the GPRD trials are closer to the
null than those of the WHI trial (WHI overall hazard ratio:
1.24, 95% CI: 0.97, 1.60) (Manson et al., 2003). This atten-
uation may be a consequence of the presence of unmeasured
confounding for treatment initiation in the GPRD, a higher
Figure 2. Probability of discontinuing hormone therapy proportion of noncompliance in the GPRD trials, random
among initiators. variability in both studies, or a combination of these factors.
The GPRD-WHI ITT differences cannot be explained by any
uncertainty in time of therapy initiation or by confounding by
of follow-up, the proportion of noncompliance was 13% in risk factors whose distribution differed in women who contin-
noninitiators and 79% in initiators. In the latter group, the ued versus discontinued therapy.
steepest drop in hormone therapy use occurred during the Our approach provides unbiased estimates of the ITT ef-
first year after baseline (Figure 2). The high discontinuation fect only under the assumption of no unmeasured confounders
rate found in the GPRD reflects that of the general British for treatment initiation. Although this assumption cannot be
population (Bromley, de Vries, and Farmer, 2004). directly tested in observational studies, comparison between
Using IPW to adjust for informative censoring, the esti- the adjusted and the unadjusted estimates can be useful in
mated hazard ratio of CHD for continuous hormone therapy assessing the hypothesis that substantial confounding by un-
versus no therapy using weighted model (1) was 0.87 (0.55, measured factors remains. When we repeated our ITT analy-
1.39). When we did not weight, we obtained an estimated sis without adjustment for baseline covariates (except age and
Discussion on Statistical Issues in the Women’s Health Initiative 927

calendar month), the estimated hazard ratio was 0.85 (95% Although we did not do so here, in the presence of un-
CI: 0.67, 1.08), which is only moderately less than the fully measured confounding for treatment continuation (i.e., con-
adjusted estimate 0.92. Were sampling variability absent, it tinued compliance), IPW estimators can be used to conduct
would then follow that the magnitude of confounding due to a sensitivity analysis as follows. Suppose, for the moment,
unmeasured variables would have to exceed the confounding the amount of unmeasured confounding were known, in the
due to measured variables to explain the full GPRD-WHI sense that we could choose a parameter ω and a function
discrepancy. Given the breadth of the measured variables, we q(j, m, L̄(j), Tā ) such that their product ωq(j, m, a, L̄(j), Tā )
believe this hypothesis seems unlikely, although a downward correctly quantifies the degree of dependence on the log-odds
bias of perhaps 0.1 or 0.2 in our hazard ratio estimate is still scale between the probability of treatment continuation and
plausible, especially in light of the large sampling variability. the counterfactual survival time Tā under treatment history
Indeed large sampling variability is a major problem. For ex- ā through the model
ample, the overall ITT hazard ratios from the GPRD and the 
WHI trials were estimated with similarly low precision (width logit Pr A(j) = a | G(m) = 1, A(j − 1) = a,
of the 95% CIs on the log scale: about 0.46 in WHI and 0.47 in 
A(m) = a, L̄(j), T > j, Tā
GPRD) with point estimates close to the null. This relatively
low precision precludes drawing strong conclusions from ei- 
= θa0 + θa1 L̄(j) + ωq(j, m, a, L̄(j), Tā ) for j > m. (4)
ther study and produces overlapping confidence intervals for
the GPRD and the WHI estimates and a nonsignificant es- This logistic model reduces to model (2) if there were
timated difference in ITT effects. For the all-cause mortality no unmeasured confounding for continued compliance (i.e.,
hazard our GPRD estimates were quite similar to the WHI ωq(j, m, a, L̄(j), Tā ) = 0). Because the degree of unmeasured
estimate of 0.98 (95% CI: 0.70, 1.37). confounding is actually unknown, we suggest a sensitivity
Both the WHI and our primary GPRD analysis estimated analysis in which one plots estimates and confidence intervals
the ITT effect of hormone therapy initiation rather than the for the “continuous treatment” hazard ratio as a function of
effect of continuing hormone therapy. Because the rate of non- ω and q(j, m, a, L̄(j), Tā ), where ω and q(j, m, a, L̄(j), Tā ) are
compliance differed between the GPRD and the WHI trials allowed to vary over a plausible range of values and functional
(Writing Group for the WHI Investigators, 2002): 42% (WHI) forms (Scharfstein et al., 2001; Robins, 2002).
versus 79% (GPRD) in initiators, and 11% (WHI) versus Prentice and collaborators also consider estimating the full
13% (GPRD) in noninitiators at 6 years of follow-up, our compliance hazard ratio in the WHI randomized trial by cen-
GPRD estimates may not be directly comparable with the soring subjects at the time of noncompliance, but do not
WHI estimates. To eliminate the effect of noncompliance, we use data on evolving postrandomization covariates L̄(j) to
attempted to estimate the “continuous treatment or full com- reweight subjects. Prentice and collaborators conjecture that
pliance” hazard ratio (i.e., the ITT effect in the absence of any bias due to this failure to adjust for L̄(k) is likely small.
noncompliance) in the GPRD trials by IPW. As discussed by The rather modest differences in weighted and the unweighted
Robins and Finkelstein (2000) and Robins (1998), one should estimates in Table 2 serve as an empirical test and partial
not regard as noncompliant women whose deviation from their confirmation of this conjecture in the GPRD. However, in ob-
assigned therapy was for (not easily palliated) adverse med- servational studies of the effect of drug therapy on time to
ical reasons. Prentice et al. make a similar point. However, AIDS or death in HIV-infected subjects, the magnitude of
in the GPRD study, this option was not available to us, as confounding by time-varying covariates (e.g., CD4 cell count)
data on why a woman stopped hormone therapy was not rou- is much larger than for the effect of hormone therapy on CHD
tinely collected. Robins and Finkelstein (2000) showed that in the GPRD study. In these studies we have repeatedly shown
the IPW estimates are consistent for the “continuous treat- that standard analytic strategies fail; only “causal inference”
ment” hazard ratio if (i) women who initiated and did not methods (either IPW estimation of marginal structural mod-
initiate hormone therapy in each trial m were comparable els or g-estimation of nested structural models) successfully
conditionally on L̄(m) (no unmeasured confounding for treat- reproduce the results of randomized trials (Cole et al., 2003;
ment), (ii) women who discontinued and did not discontinue Hernán et al., 2005; Sterne et al., 2005). Because small bias
their baseline treatment in each month k were comparable cannot be assured a priori, we believe an analyst should rou-
conditionally on L̄(k) (no unmeasured confounding for cen- tinely correct (separately in each arm) for selection bias at-
soring), and (iii) model misspecification is absent. Our IPW tributable to the measured factors L̄(k) by using IPW and
methodology to correct for informative censoring is a special should, perhaps, also consider using IPW to investigate the
case of the much more general methodology of IPW estima- sensitivity of one’s inferences to confounding by unmeasured
tion of marginal structural models. In the GPRD, the overall factors.
IPW hazard ratio estimate of 0.87 was close to the overall Prentice et al. mention the existence of methods for an-
ITT estimate of 0.92. alyzing double blind randomized trial suffering from non-
Further, by comparing the weighted and unweighted esti- compliance that both (i), like an as treated analysis, pro-
mates of the continuous therapy effect in Table 2, we can vide estimates of the treatment effect under full compliance
see that although censoring by noncompliance may have been and yet (ii), like an ITT analysis, protect the α-level under
moderately informative, the observed differences are not sta- the null hypothesis of no treatment effect (without imposing
tistically significant. It would be interesting to conduct IPW any assumptions concerning either the existence or magni-
analyses of censoring by noncompliance in the WHI trial as tude of unmeasured confounding for treatment continuation).
well. Specifically, Prentice et al. reference methodologies proposed
928 Biometrics, December 2005

by Cuzick, Edwards, and Segnan (1997) and Frangakis and variability and multiple comparison considerations, no defini-
Rubin (1999). However, these methodologies only apply if tive conclusions are possible. In contrast with the qualitative
compliance is of the “all or none” type, and censoring by agreement in the GPRD, in studies of the effect of highly ac-
end of follow-up is independent conditional on complier type. tive antiretroviral therapy (HAART) on (i) time to AIDS or
But in the WHI compliance is complex and time varying with death and (ii) on evolution of CD4 count in HIV-infected sub-
women repeatedly stopping and starting their assigned ther- jects, IPW succeeded but standard updated-covariate anal-
apy. Further, although less likely, censoring by end of follow- yses failed to reproduce results found in randomized clini-
up may be dependent if secular changes in baseline mortality cal trials. The problem with the standard updated-covariate
risk have occurred over the trial accrual period. In this set- analysis is that it adjusts for covariates affected by ear-
ting, as far as we are aware, g-estimation of nested structural lier treatment, which can result in bias (Hernán et al.,
models (usually referred to as SNFTMs) is the only general 2004).
methodology available for the analysis of failure time data As mentioned in the introduction, the original standard
that satisfies both (i) and (ii) (Mark and Robins, 1993). Of updated-covariate analyses of the GPRD reported a statis-
course, adequate data on actual treatment A(t) must be avail- tically significant hazard ratio of 0.72 (0.59, 0.89) for cur-
able for analysis. The Appendix provides further detail. rent versus never exposed. However, the original 1995 GPRD
We could have also used doubly robust g-estimation of an analyses differed from ours in that (i) all hormone users (in-
SNFTM rather than our IPW methodology to estimate the cluding estrogen only users) were compared to never users,
effect of continuous hormone therapy on CHD in the GPRD (ii) a subject was defined as “currently” exposed at t if ex-
study. Doubly robust g-estimation provides consistent esti- posed any time in the 6 months before t (regardless of past
mation of the effect of continuous hormone therapy if there use history), and (iii) the maximum duration of follow-up was
is no unmeasured confounding for treatment initiation, the 5 years rather than 10 years. When we repeated our analy-
SNFTM is correct, and one has correctly specified either (but ses using definition (ii) of current exposure, effect estimates
not necessarily both) a model for the conditional probabil- were little changed (data not shown). As discussed above, our
ity that an eligible subject (i.e., G(m) = 1) initiates treat- analyses suggest (but do not prove) that the hazard ratio is
ment in trial m given L̄(m) or a model for the counterfactual modified by duration of exposure and thus presumably by du-
regressions E[Tm,0 | L̄(m), G(m) = 1, T > m] where T m,0 is a ration of follow-up when current exposure is coded simply as
subject’s possibly counterfactual time to CHD had the sub- 1 or 0. Thus the difference between our results and those of
ject received her observed treatment Ā(m − 1) up till month the original GPRD analyses are presumably due to (i) and
m − 1 and no treatment from m (these g-estimators are re- perhaps to (iii).
ferred to as doubly robust because of this latter property). Finally, five remaining differences may affect the GPRD-
The requirement for correct specification of the SNFTM in WHI comparability. First, individuals in the GPRD trials
g-estimation substitutes for the requirement for correct spec- were not blinded as to whether they did or did not receive
ification of model (4) in IPW estimation. hormone therapy. If awareness of exposure status modified
Furthermore, we could have used doubly robust g- the behavior of either the women or their physicians in ways
estimation of an SNFTM to estimate the ITT effect of treat- that affected the risk of a CHD diagnosis, then the GPRD
ment in our GPRD trials but on a multiplicative survival scale estimates would reflect the joint effect of hormone therapy
rather than on a hazard ratio scale. In this setting the sim- and these behavioral modifications. WHI participants were
plest SNFTM is a nested AFT model (defined in the Ap- initially blinded to treatment regime, although some of them
pendix). A nested AFT model (and more generally any ITT may have become aware of it later on, and in fact differen-
SNFTM) has certain theoretical advantages compared with tial unblinding of hormone users has been suggested as a po-
nested hazard ratio models such as the nested Cox model that tential source of bias in the WHI (Garbe and Suissa, 2004).
we used. First, as remarked by Prentice et al., and in contrast Second, women with conditions inconsistent with adherence
with nested AFT models, if the treatment and control ITT (e.g., menopausal symptoms) were excluded in the WHI but
hazards cross at some time t, the values of the parameters not in our GPRD analysis. The GPRD and WHI results might
of even a correctly specified ITT hazard ratio model do not differ if, as the WHI results suggest (Manson et al., 2003),
determine when (or even whether) the survival curves also hormone therapy is less harmful, or possibly beneficial, in
cross, unless combined with an estimate of the baseline sur- the presence of menopausal symptoms. Third, women who
vivor function. (Only when the survival curves cross can one initiated hormone therapy in the GPRD were, on average,
logically conclude that treatment benefits some subjects and 8.6 years younger than initiators in the WHI. Fourth, the
harms others.) Second, standard hazard ratio models do not particular drugs used for postmenopausal hormonal therapy
admit doubly robust estimators, although this shortcoming in the WHI and in the GPRD are different. Last, there is no
in robustness can be alleviated by using marginal structural guarantee that the GPRD and WHI noncompliers were com-
hazard ratio models. parable. For example, many of the GPRD “noncompliers”
Reading from Table 2, we see that there is no qualitative stopped hormone therapy simply because their physician pre-
difference between IPW results and the results from the stan- scribed the drug only for a brief period to combat menopausal
dard updated-covariate analysis, especially in view of the sub- symptoms. This last concern could be partly alleviated by
stantial sampling variability. Both analyses suggest a possible comparing the effects of continued hormone therapy in both
hazard ratio of less than 1 when the duration of therapy is the GPRD and the WHI using either IPW or g-estimation
from 2 to 5 years. However in light of the large sampling methodology.
Discussion on Statistical Issues in the Women’s Health Initiative 929

In conclusion, we have described an analytic approach Garcı́a Rodrı́guez, L. A. and Pérez Gutthann, S. (1998). Use
for observational studies that mimics that commonly used of the UK General Practice Research Database for phar-
for randomized trials and that allows more direct compar- macoepidemiology. British Journal of Clinical Pharma-
isons between the results of observational and randomized cology 45, 419–425.
studies. Under our approach no clear beneficial effect or ad- Grodstein, F., Stampfer, M. J., Manson, J. E., Colditz,
verse effect of combined hormone therapy is apparent in the G. A., Willett, W. C., Rosner, B., Speizer, F. E., and
GPRD, but we had little power to discover small to moderate Hennekens, C. H. (1996). Postmenopausal estrogen and
effects. The difference between the overall WHI ITT estimate progestin use and the risk of cardiovascular disease. New
of 1.24 and our GPRD ITT estimate of 0.92 is consistent with England Journal of Medicine 335, 453–461. (Erratum in
random variability, although additional systematic sources of New England Journal of Medicine 1996, 335, 1406.)
small to moderate bias cannot be excluded in the GPRD. Un- Grodstein, F., Manson, J. E., Colditz, G. A., Willett, W. C.,
fortunately, because of the large sampling variability in both Speizer, F. E., and Stampfer, M. J. (2000). A prospec-
the WHI trial and the GPRD study, our results shed little tive, observational study of postmenopausal hormone
light on the question of whether an (even correctly analyzed) therapy and primary prevention of cardiovascular dis-
observational study of a “lifestyle exposure” can reliably dis- ease. Annals of Internal Medicine 133, 933–941.
criminate among causal relative risks close to 1. Prentice et al. Grodstein, F., Manson, J. E., and Stampfer, M. J. (2001).
show that when the hazard ratio is allowed to vary with du- Postmenopausal hormone use and secondary preven-
ration of therapy, the WHI randomized trial and the WHI tion of coronary events in the Nurses’ Health Study.
observational study provide similar hazard ratio estimates. A prospective, observational study. Annals of Internal
But these authors also had little power to distinguish this Medicine 135, 1–8.
similarity hypothesis from the hypothesis of a moderate sys- Grodstein, F., Clarkson, T. B., and Manson, J. E. (2003). Un-
tematic difference between the hazard ratios, which raises the derstanding the divergent data on postmenopausal hor-
following counterfactual questions that we hope the authors mone therapy. New England Journal of Medicine 348,
might respond to in their rejoinder. Had the WHI random- 645–650.
ized trial been cancelled and the only data been that from the Hernán, M. A., Hernández-Dı́az, S., and Robins, J. M. (2004).
WHI observational study, would Prentice et al. have analyzed A structural approach to selection bias. Epidemiology 15,
the data in the same way and reached the same conclusions 615–625.
as in their actual paper? Further, what is their best expla- Hernán, M. A., Cole, S. R., Margolick, J. B., Cohen,
nation of the discrepancy between the results of their WHI M. H., and Robins, J. M. (2005). Structural acceler-
observational analysis and the results of the other observa- ated failure time models for survival analysis in stud-
tional studies that found a clear benefit of hormone therapy ies with time-varying treatments. Pharmacoepidemiology
on CHD? How certain are they that this explanation is cor- and Drug Safety 14, 477–491.
rect? We ask because, in our analyses of the GPRD and the Jick, S. S., Kaye, J. A., Vasilakis-Scaramozza, C., Garcı́a
NHS, we have often been unable to find clear and convincing Rodrı́guez, L. A., Ruigómez, A., Meier, C. R., Schlienger,
explanations for the variation observed in our effect estimates R. G., Black, C., and Jick, H. (2003). Validity of the Gen-
with elaboration of the analytic model in different directions. eral Practice Research Database. Pharmacotherapy 23,
686–689.
Manson, J. E., Hsia, J., Johnson, K. C., et al. and the
References Women’s Health Initiative Investigators. (2003). Estro-
Bromley, S. E., de Vries, C. S., and Farmer, R. D. T. gen plus progestin and the risk of coronary heart disease.
(2004). Utilisation of hormone replacement therapy in New England Journal of Medicine 349, 523–534.
the United Kingdom. A descriptive study using the gen- Mark, S. D. and Robins, J. M. (1993). A method for the anal-
eral practice research database. British Journal of Ob- ysis of randomized trials with compliance information:
stetrics and Gynaecology 111, 369–376. An application to the multiple risk factor intervention
Cole, S. R., Hernán, M. A., Robins, J. M., et al. (2003). trial. Controlled Clinical Trials 14, 79–97.
Effect of highly active antiretroviral therapy on time Mendelsohn, M. E. and Karas, R. H. (2005). Molecular and
to acquired immunodeficiency syndrome or death using cellular basis of cardiovascular gender differences. Sci-
marginal structural models. American Journal of Epi- ence 308, 1583–1587.
demiology 158, 687–694. Robins, J. M. (1989). The analysis of randomized and non-
Cuzick, J., Edwards, R., and Segnan, N. (1997). Adjusting for randomized AIDS treatment trials using a new approach
non-compliance and contamination in randomized clini- to causal inference in longitudinal studies. In Health
cal trials. Statistics in Medicine 16, 1017–1029. Services Research Methodology: A Focus on AIDS, L.
Frangakis, C. E. and Rubin, D. B. (1999). Addressing com- Sechrest, H. Freeman, and A. Mulley (eds), 113–159.
plications of intention-to-treat analysis in the combined Washington, DC: U.S. Public Health Service, National
presence of all-or-none treatment compliance and subse- Center for Health Services Research.
quent missing outcomes. Biometrika 86, 365–379. Robins, J. M. (1998). Correction for non-compliance in equiv-
Garbe, E. and Suissa, S. (2004). Hormone replacement ther- alence trials. Statistics in Medicine 17, 269–302.
apy and acute coronary outcomes: Methodological issues Robins, J. M. (2002). Comment on “Covariance adjustment
between randomized and observational studies. Human in randomized experiments and observational studies”
Reproduction 19, 8–13. by Paul R. Rosenbaum. Statistical Science 17, 286–327.
930 Biometrics, December 2005

Robins, J. M. and Finkelstein, D. (2000). Correcting for Appendix


non-compliance and dependent censoring in an AIDS
G-Estimation of Nested Structural Models
clinical trial with inverse probability of censoring
for Survival Analysis
weighted (IPCW) log-rank tests. Biometrics 56, 779–
788. The simplest structural nested failure time model (SNFTM)
Robins, J. M., Blevins, D., Ritter, G., and Wulfsohn, M. implies that for some unknown value ψ ∗ of ψ, the ob-
servable random variable Hm (ψ) = hm (T, Ā(T ), ψ) =
(1992). G-estimation of the effect of prophylaxis ther- T
apy for Pneumocystis carinii pneumonia on the survival m
exp(ψA(t)) dt has a conditional distribution given
of AIDS patients. Epidemiology 3, 319–336. (Erratum in (L̄(m), Ā(m), T > m) equal to that of T m,0 , where T m,0
Epidemiology 1993, 4, 189.) is defined in the main text. This model is related to the
Scharfstein, D. O., Robins, J. M., Eddings, W., and Rotnitzky, time-dependent accelerated failure time model. It implies
A. (2001). Inference in randomized studies with infor- that for each m, (T m,1 − m) has the same distribution as
mative censoring and discrete time-to-event endpoints. exp(−ψ ∗ ) (T m,0 − m) and where T m,1 is a subject’s possibly
Biometrics 57, 404–413. counterfactual time to CHD had the subject received her
Stampfer, M. J., Colditz, G. A., Willett, W. C., Manson, observed treatment Ā(m − 1) up to month m and contin-
J. E., Rosner, B., Speizer, F. E., and Hennekens, C. H. uous treatment from m onward. In particular, continuous
(1991). Postmenopausal estrogen therapy and cardiovas- treatment from m = 0 scales the survival distribution by a
cular disease. Ten-year follow-up from the Nurses’ Health factor exp(−ψ ∗ ) compared to no treatment. The parameter
Study. New England Journal of Medicine 325, 756– ψ ∗ is estimated with doubly robust g-estimation. A general
762. SNFTM posits Hm (ψ) = hm (T, Ā(T ), L̄(T ), ψ) to be a known
Sterne, J. A. C., Hernán, M. A., Ledergerber, B., Tilling, K., function of (T, Ā(T ), L̄(T ), ψ) increasing in T and satisfying
Weber, R., Robins, J. M., and Egger, M., the Swiss HIV Hm (ψ) = T − m if ψ = 0 or A(u) = 0, m ≤ u < ∞. Robins
Cohort Study. (2005). Long-term effectiveness of potent et al. (1992) extends g-estimation to allow for right censoring
antiretroviral therapy in preventing AIDS and death: both by administrative end of follow-up and by competing
The Swiss HIV Cohort Study. Lancet 366, 378–384. risks. Owing to double robustness and to the fact that
Varas-Lorenzo, C., Garcı́a-Rodrı́guez, L. A., Pérez-Gutthann, structural nested failure time models are guaranteed correct
S., and Duque-Oliart, A. (2000). Hormone replacement (with true ψ ∗ = 0) whenever a hormone effect on CHD is
therapy and incidence of acute myocardial infarction. A absent, g-estimation can be used to construct robust tests of
population-based nested case-control study. Circulation the null hypothesis of no effect of hormone therapy on CHD,
101, 2572–2578. whenever there is no unmeasured confounding for treatment
Women’s Health Initiative Study Group. (1998). Design of initiation.
the Women’s Health Initiative clinical trial and observa- To estimate the ITT effect of therapy, we simply redefine
T
tional study. Controlled Clinical Trials 19, 61–109. Hm (ψ) = m exp(ψA(m)) dt = A(m) exp(ψ), then exp(−ψ)
Writing Group for the Women’s Health Initiative Investiga- now has the meaning of the ITT effect of treatment, as-
tors. (2002). Risks and benefits of estrogen plus pro- sumed to be the same for each trial m. We refer to this
gestin in healthy postmenopausal women: Principal re- model as a time-independent nested AFT model for the ITT
sults from the Women’s Health Initiative randomized effect. A general ITT SNFTM has Hm (ψ) = hm (T, Ā(m),
controlled trial. Journal of the American Medical Asso- L̄(m), ψ) with hm (T, Ā(m), L̄(m), ψ) = T − m if ψ = 0 or
ciation 288, 321–333. A(m) = 0.

Duncan C. Thomas
University of Southern California
Preventative Medicine (Division of Biostatistics)
Los Angeles, California, Los Angeles, CA
email: dthomas@usc.edu
In a typically masterful performance, Prentice, Pettinger, and To begin with, the focus of Prentice et al.’s discussion of
Anderson have beautifully summarized a broad range of sta- germline variation is on detecting the main effects of genetic
tistical issues raised by one of the most important longitudi- variants on disease risk. Many such “genome-wide association
nal studies of our day, combining both randomized trial and scans” (GWASs)—first seriously proposed nearly a decade
observational epidemiology components. As other commenta- ago by Risch and Merikangas (1996)—have recently been pro-
tors will address many of the clinical trial and observational posed and some are already underway (see review of several
epidemiology issues, I will confine my remarks to the genetic such initiatives in Thomas, Haile, and Duggan, 2005). Indeed,
issues raised in Section 2.2. In particular, I will focus on the the first reports of such scans have started to appear (Ozaki
discussion of germline variation, although many of the prob- et al., 2002; Klein et al., 2005; Maraganore et al., 2005). Be-
lems associated with very high density data arising in that fore discussing some of the methodological issues involved in
context also apply to the proteomic data. This is frequently GWASs, it’s worth noting that much of the interest in the
referred to as the “p  n problem,” meaning many more vari- pharmacogenomics world centers on genetic modifiers of the
ables than observations. response to drug treatments (Need, Motulsky, and Goldstein,
Discussion on Statistical Issues in the Women’s Health Initiative 931

2005), or in the case of the Women’s Health Initiative (WHI), and then apply these predicted latent variables and disease in
on genetic modifiers of chemopreventive agents. A timely re- case–control comparisons.
minder of the importance of such research is the approval by Setting aside the particular problems inherent in a search
the U.S. FDA on June 16, 2005 of the drug BiDil (NitroMed, for modifier genes, Prentice et al. provide a thoughtful discus-
http://www.nitromed.com/index.asp) for treatment of con- sion of some of the challenges of study design and analysis in
gestive heart failure only in African-Americans. As noted by GWASs for main effects. In particular, I would like to com-
Francis Collins and other critics of this decision, it is highly mend them for their discussion of the study design challenges
unlikely that race or skin color per se is the relevant fac- in the use of DNA pooling and the considerable advantages of
tor modifying the effectiveness in the treatment (if indeed using some form of the multistage sampling design proposed
there really is a racial difference), but rather some as-yet- by Satagopan et al. (2002), Satagopan and Elston (2003), and
undiscovered genetic variant that is more prevalent among Satagopan, Venkatraman, and Begg (2004). The use of DNA
African-Americans (Kahn, 2005). A search for such a modi- pooling is still in its infancy, with considerable controversy
fier gene may prove to be a more daunting challenge than for about experimental designs to allow for the various sources
a main effect, but as noted by the FDA panel’s chairman, of error in pool construction and measurement, reviewed by
Dr Steven E. Nissen, “It is very unusual; it is precedent- Prentice et al. None of the currently ongoing or proposed stud-
setting, but it is the case that we are moving forward to ies summarized by Thomas et al. (2005) have elected to use
genome-based medicine. It’s going to happen.” this approach, fearing loss of power to detect modest differ-
The generally greater difficulty in detecting interactions ences in allele frequencies in the face of potential measurement
than main effects in observational studies (Smith and Day, error, even if later stages using individual genotyping are ef-
1984) is somewhat offset in the pharmacogenetics field, how- fective in eliminating false positive signals. The WHI will be
ever, by the ability to randomize one of the interacting fac- one of the first to employ this technology in the first stage
tors, in this case the exposure (drug). Although this does of their GWAS, so its performance will be closely watched
not alter the sample size requirements (other than ensuring by other investigators, as the potential savings in cost could
a suitable balance of exposed and unexposed), it does pro- be well worth a modest loss of power if that could be over-
vide a stronger basis for causal inference. In a similar vein, come by using sufficiently large or sufficiently many pools. It is
prospects for causal inference could be enhanced by exploiting worth noting, however, that even individual genotyping, while
the concept of “Mendelian randomization” (Davey Smith and extremely accurate, is not immune to measurement error. It
Ebrahim, 2003; Little and Khoury, 2003; Thomas and Conti, is only recently that the statistical genetics community has
2004), meaning that genes are “assigned” to individuals ran- turned its attention to the problem of dealing with genotyp-
domly conditional on parental genotypes, so that issues of “re- ing error (Rice and Holmans, 2003; Kang, Gordon, and Finch,
verse causation” (disease influencing an intermediate pheno- 2004), using techniques that have been widely used in envi-
type under study) and confounding can largely be eliminated ronmental epidemiology for years (Prentice, 1982; Thomas,
as alternative explanations. To fully exploit this advantage, Stram, and Dwyer, 1993). A disturbing account by David
family-based designs that use such transmission information Clayton (see online supplement to Thomas et al., 2005 for de-
could be used, such as the transmission-disequilibrium test tails) showed that both genotype call rates and concordance
(TDT) (Spielman, McGinnis, and Ewens, 1993) or discordant across platforms were differential by case–control status in
sibship case–control design (Kraft and Thomas, 2004). While a recent GWAS for type II diabetes, presumably because of
the latter design can be impractical in a randomized trial shifts in the distribution of readings due to sample handling,
context, a TDT analysis could take the form of a comparison implying that the software for genotype calling might need to
of transmitted genotypes between affected cases on differing be calibrated separately for cases and controls. For further dis-
treatment arms. Although this would require the availability cussion of design and analysis issues in GWAS, the interested
of DNA from both parents of all cases, it would not require reader is referred to several other recent reviews (Hirschhorn
unaffected controls, thereby substantially reducing genotyp- and Daly, 2005; Palmer and Cardon, 2005; Thomas et al.,
ing costs and ensuring freedom from population stratification 2005; Wang et al., 2005).
bias (Thomas and Witte, 2002; Wacholder, Rothman, and As an analytical challenge, if the p  n problem were
Caporaso, 2002). The ideas of Mendelian randomization can not bad enough for a genomewide search for genetic main
be particularly useful in disentangling complex genetic path- effects, the mind boggles in considering a search for all possi-
ways (Thomas, 2005) using proteomic or metabolomic tech- ble gene–gene interactions (Marchini, Donnelly, and Cardon,
nologies to directly measure some of the intermediate metabo- 2005), associations with all possible extended haplotypes (Lin,
lites or individual pharmacokinetic rate parameters. Here, the Chakravarti, and Cutler, 2004), or germline variants affecting
potential for confounding or established disease to distort the the expression of all possible genes (Schadt et al., 2003) or
intermediate phenotypes is considerable, but potentially sur- proteomic patterns! Motivated in large part by such high-
mountable by focusing instead on the relationship between volume genomic tools, there has been a resurgence of in-
genes and intermediates, which should be immune to these terest recently in various exploratory data analysis tech-
problems. In a latent variables model (Conti et al., 2003), one niques, such as classification and regression trees (CART)
could imagine using control samples (preferably from a cohort and multivariate adaptive regression spline (MARS) (Cook,
if time-dependent exposures are involved) to build a model Zee, and Ridker, 2004), neural networks (Sebastiani, Yu, and
for the relationship between flawed measurements of the la- Ramoni, 2003), multifactor dimensionality reduction (Ritchie
tent phenotypes and their genetic–environmental predictors et al., 2001), random forests (Bureau et al., 2003), Bayesian
932 Biometrics, December 2005

network analysis (Friedman et al., 2000) and model selec- Hirschhorn, J. N. and Daly, M. J. (2005). Genome-wide asso-
tion (Sillanpaa and Corander, 2002), self-organizing maps ciation studies for common disease and complex traits.
(Tamayo et al., 1999), support vector machines (Byvatov Nature Reviews Genetics 6, 95–108.
and Schneider, 2003), sequential filtering and boosting (Yasui Hoh, J. and Ott, J. (2003). Mathematical multi-locus ap-
et al., 2003; Feng, Prentice, and Srivastava, 2004), and others proaches to localizing complex human trait genes. Nature
(Hoh and Ott, 2003) for exploring large arrays of main effects Reviews Genetics 4, 701–709.
and interactions. While classical hypothesis-driven methods Kahn, J. (2005). Misreading race and genomics after BiDil.
still have an important role to play, I expect that the future Nature Genetics 37, 655–656.
will see a greater interplay between these two basic approaches Kang, S. J., Gordon, D., and Finch, S. J. (2004). What SNP
to statistical analysis of high-volume genomic data. genotyping errors are most costly for genetic association
Proteomic methods have potentially many uses (Sellers and studies? Genetic Epidemiology 26, 132–141.
Yates, 2003), ranging from etiologic (e.g., dissecting complex Klein, R. J., Zeiss, C., Chew, E. Y., et al. (2005). Complement
pathways, reducing etiologic heterogeneity by subphenotyp- factor H polymorphism in age-related macular degener-
ing) to clinical research and practice (e.g., early detection). ation. Science 308, 385–389.
The recent report of a proteomic marker for ovarian cancer Kraft, P. and Thomas, D. C. (2004). Case-sibling gene-
(Petricoin et al., 2002) has been controversial, with the pos- association studies for diseases with variable age at onset.
sibility of differential measurement error due to differences in Statistics in Medicine 23, 3697–3712.
sample handling among the reasons being suggested for the Lin, S., Chakravarti, A., and Cutler, D. J. (2004). Exhaus-
apparent case–control difference in proteomic patterns (Dia- tive allelic transmission disequilibrium tests as a new
mandis, 2004). It is an open question whether this striking approach to genome-wide association studies. Nature Ge-
difference between postdiagnostic case and control patterns netics 36, 1181–1188.
will be useful as a pre-diagnostic screening test. This ques- Little, J. and Khoury, M. J. (2003). Mendelian randomization:
tion will require prospective evaluation. Given the rarity of A new spin or real progress? Lancet 362, 390–391.
ovarian cancer, this will presumably be feasible only in high- Maraganore, D. M., de Andrade, M., Lesnick, T. G., Strain,
risk cohorts, such as relatives of BRCA1 mutation carriers. K. J., Farrer, M. J., Rocca, W. A., Krishna Pant,
In summary, the rapidly exploding availability of high- P. V., Frazer, K. A., Cox, D. R., and Ballinger, D. G.
volume genomics technologies, including SNP genotyping, (2005). High-resolution whole-genome association study
gene expression arrays, proteomics, metabolomics, and other of Parkinson disease. American Journal of Human Genet-
“Omics” technologies, poses daunting challenges and opportu- ics 77, 685–693.
nities that should be a high priority for statistical research. Marchini, J., Donnelly, P., and Cardon, L. R. (2005).
Genome-wide strategies for detecting multiple loci that
influence complex diseases. Nature Genetics 37, 413–
References 417.
Bureau, A., Dupuis, J., Hayward, B., Falls, K., and Van Need, A. C., Motulsky, A. G., and Goldstein, D. B. (2005).
Eerdewegh, P. (2003). Mapping complex traits using ran- Priorities and standards in pharmacogenetic research.
dom forests. BMC Genetics 4, S64. Nature Genetics 37, 671–681.
Byvatov, E. and Schneider, G. (2003). Support vector ma- Ozaki, K., Ohnishi, Y., Iida, A., et al. (2002). Functional SNPs
chine applications in bioinformatics. Applied Bioinfor- in the lymphotoxin-alpha gene that are associated with
matics 2, 67–77. susceptibility to myocardial infarction. Nature Genetics
Conti, D. V., Cortessis, V., Molitor, J., and Thomas, D. C. 32, 650–654.
(2003). Bayesian modeling of complex metabolic path- Palmer, L. J. and Cardon, L. R. (2005). Shaking the tree:
ways. Human Heredity 56, 83–93. Mapping complex disease genes using linkage disequilib-
Cook, N. R., Zee, R. Y., and Ridker, P. M. (2004). Tree and rium. Lancet 366, 1223–1234.
spline based association analysis of gene-gene interac- Petricoin, E. F., Ardekani, A. M., Hitt, B. A., et al. (2002).
tion models for ischemic stroke. Statistics in Medicine Use of proteomic patterns in serum to identify ovarian
23, 1439–1453. cancer. Lancet 359, 572–577.
Davey Smith, G. and Ebrahim, S. (2003). ‘Mendelian ran- Prentice, R. L. (1982). Covariate measurement errors and pa-
domization’: Can genetic epidemiology contribute to un- rameter estimation in Cox’s regression model. Biometrika
derstanding environmental determinants of disease? In- 69, 331–342.
ternational Journal of Epidemiology 32, 1–22. Rice, K. M. and Holmans, P. (2003). Allowing for genotyp-
Diamandis, E. P. (2004). Mass spectrometry as a diagnos- ing error in analysis of unmatched case-control studies.
tic and a cancer biomarker discovery tool: Opportunities Annals of Human Genetics 67, 165–174.
and potential limitations. Molecular and Cellular Pro- Risch, N. and Merikangas, K. (1996). The future of genetic
teomics 3, 367–378. studies of complex human diseases. Science 273, 1616–
Feng, Z., Prentice, R., and Srivastava, S. (2004). Research is- 1617.
sues and strategies for genomic and proteomic biomarker Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont,
discovery and validation: A statistical perspective. Phar- W. D., Parl, F. F., and Moore, J. H. (2001). Multifactor-
macogenomics 5, 709–719. dimensionality reduction reveals high-order interactions
Friedman, N., Linial, M., Nachman, I., and Pe’er, D. (2000). among estrogen-metabolism genes in sporadic breast
Using Bayesian networks to analyze expression data. cancer. American Journal of Human Genetics 69, 138–
Journal of Computational Biology 7, 601–620. 147.
Discussion on Statistical Issues in the Women’s Health Initiative 933

Satagopan, J. M. and Elston, R. C. (2003). Optimal two-stage tion. Proceedings of the National Academy of Sciences of
genotyping in population-based association studies. Ge- the United States of America 96, 2907–2912.
netic Epidemiology 25, 149–157. Thomas, D. C. (2005). The need for a comprehensive ap-
Satagopan, J. M., Verbel, D. A., Venkatraman, E. S., Offit, proach to complex pathways in molecular epidemiology
K. E., and Begg, C. B. (2002). Two-stage designs for (editorial). Cancer Epidemiology, Biomarkers and Pre-
gene-disease association studies. Biometrics 58, 163– vention 14, 557–559.
170. Thomas, D. C. and Conti, D. V. (2004). Commentary: The
Satagopan, J. M., Venkatraman, E. S., and Begg, C. B. (2004). concept of ‘Mendelian Randomization.’ International
Two-stage designs for gene-disease association studies Journal of Epidemiology 33, 21–25.
with sample size constraints. Biometrics 60, 589–597. Thomas, D. C. and Witte, J. S. (2002). Point: Population
Schadt, E. E., Monks, S. A., Drake, T. A., et al. (2003). Ge- stratification: A problem for case-control studies of can-
netics of gene expression surveyed in maize, mouse and didate gene associations? Cancer Epidemiology, Biomark-
man. Nature 422, 297–302. ers and Prevention 11, 505–512.
Sebastiani, P., Yu, Y. H., and Ramoni, M. F. (2003). Bayesian Thomas, D. C., Stram, D., and Dwyer, J. (1993). Exposure
machine learning and its potential applications to the measurement error: Influence on exposure-disease rela-
genomic study of oral oncology. Advances in Dental Re- tionships and methods of correction. Annual Reviews of
search 17, 104–108. Public Health 14, 69–93.
Sellers, T. A. and Yates, J. R. (2003). Review of proteomics Thomas, D. C., Haile, R. W., and Duggan, D. (2005). De-
with applications to genetic epidemiology. Genetic Epi- sign and analysis of genomewide association scans: A
demiology 24, 83–98. workshop report and review. American Journal of Hu-
Sillanpaa, M. J. and Corander, J. (2002). Model choice in man Genetics 77, 337–345.
gene mapping: What and why. Trends in Genetics 18, Wacholder, S., Rothman, N., and Caporaso, N. (2002). Coun-
301–307. terpoint: Bias from population stratification is not a ma-
Smith, P. G. and Day, N. E. (1984). The design of case-control jor threat to the validity of conclusions from epidemio-
studies: The influence of confounding and interaction logic studies of common polymorphisms and cancer. Can-
effects. International Journal of Epidemiology 13, 356– cer Epidemiology, Biomarkers and Prevention 11, 513–
365. 520.
Spielman, R. S., McGinnis, R. E., and Ewens, W. J. (1993). Wang, W. Y. S., Barratt, B. J., Clayton, D. G., and Todd,
Transmission test for linkage disequilibrium: The in- J. A. (2005). Genome-wide association studies: Theoret-
sulin gene region and insulin-dependent diabetes mel- ical and practical concerns. Nature Reviews Genetics 6,
litus (IDDM). American Journal of Human Genetics 52, 109–118.
506–516. Yasui, Y., Pepe, M., Thompson, M. L., et al. (2003). A data-
Tamayo, P., Slonim, D., Mesirov, J., et al. (1999). Interpreting analytic strategy for protein biomarker discovery: Profil-
patterns of gene expression with self-organizing maps: ing of high-dimensional proteomic data for cancer detec-
Methods and application to hematopoietic differentia- tion. Biostatistics 4, 449–463.

Anastasios A. Tsiatis and Marie Davidian


Department of Statistics
Box 8203, North Carolina State University
Raleigh, North Carolina 27695-8203, U.S.A.
email: davidian@stat.ncsu.edu

1. Introduction explicitly highlighted the vital role of statistical science in


The Women’s Health Initiative (WHI) is a multifaceted pub- multidisciplinary public health research, rightly emphasizing
lic health undertaking of enormous scale involving intertwined that “the statistical role (is) on a par with that of other key
interventional and observational components, a plethora of bi- disciplines.” We expect that this stimulating paper will in-
ological substudies, and the collection of an unparalleled data spire established and new statistical researchers alike to pur-
resource that will undoubtedly shape research on women’s sue methodological breakthroughs that will advance the sci-
health for years to come. The entire WHI project team de- entific agenda in women’s health research, disease prevention
serves considerable recognition for their innovative efforts in research, and clinical research more generally.
designing and implementing such an ambitious and important We cannot hope, nor do we feel qualified, to comment on
study. the vast and diverse set of challenges set forth in this paper.
We congratulate Drs Prentice, Pettinger, and Anderson for Accordingly, we limit our remarks to the following two topics.
a thought-provoking, well-written, and comprehensive discus-
sion of the myriad statistical research challenges posed by 2. Clinical Trial Monitoring and Reporting Methods
a study of this magnitude and importance. In addition to The issue of how a Data Safety and Monitoring Board
identifying and elucidating these challenges, the authors have (DSMB) can assess and weigh early observed risks against
934 Biometrics, December 2005

potential future benefits of a treatment is a critical one for We believe that it may be productive to contemplate this
any large-scale trial, be it in the context of disease prevention issue at the design stage by decoupling explicitly short- and
or of treatment of chronic disease. The authors’ account of the long-term effects. In particular, it may be appropriate, based
preparations and procedures put in place in the WHI clinical on the current state of scientific knowledge, perhaps through
trial (CT) anticipating such difficulty and of the high-profile observational evidence, and of scientific interest to focus
early stopping of the combined hormonal trial is fascinating specifically on the long-term effect of treatment. If poten-
and highlights with clarity the complexity of the issue and the tial long-term benefits of treatment ultimately are of inter-
challenges facing a DSMB in this situation. As detailed by the est, which, indeed, is often the case for chronic disease, then,
authors, the WHI team rightly devoted considerable thought from a statistical point of view, a parameter that character-
and effort prior to the CT to develop a cohesive monitoring izes long-term effect may be a better reflection of the scientific
strategy based on a combination of weighted and unweighted objective than the log-hazard ratio, which represents overall
log-rank test statistics for not only the primary endpoints but effect throughout time. This reasoning suggests choosing the
also secondary and adverse outcomes and a “global index” primary endpoint to target such a parameter. For example,
determined by DSMB members’ reactions to various hypo- if ultimately long-term survival is of scientific importance, we
thetical trial scenarios. may base the primary endpoint on the difference in survival
This account, along with our experience serving on a num- distributions at a key point in time t∗ , such as t∗ = 4 or 5
ber of DSMBs for chronic disease trials where the early years, and focus on the parameter
risk/future benefit conundrum has arisen, has inspired us to
formulate with greater specificity an idea for trial design and δ = S1 (t∗ ) − S0 (t∗ ),
analysis that we have contemplated informally for some time. where Sk (x) = P (T ≥ x) under treatment k, k = 0 (placebo)
This idea is meant to apply generically to trials where there is or 1 (active treatment), and T is the time to event. Alterna-
concern that early differences in the primary endpoint could tively, we might consider area under the survival curve during
emerge. some key time interval; i.e., the parameter
As discussed by the authors, when a time to event is the
 t2  t2
primary endpoint, the log-rank test statistic is the most com-
mon basis for monitoring chronic disease clinical trials. In the δ= S1 (x) dx − S0 (x) dx
t1 t1
design of such trials, it is furthermore routine to assume a
proportional hazards relationship between treatments. This for (t1 , t2 ) = (2,5) years, say. Some practitioners may be re-
proves convenient mathematically, as under this assumption luctant to adopt such a strategy under the perception that
the log-rank test statistic behaves like a Brownian motion monitoring procedures based on estimators for such param-
with drift parameter related to the log-hazard ratio of inter- eters may be prohibitively complicated to implement. How-
est. This allows the use of standard calculations for sequen- ever, this is not the case; as long as an efficient estimator
tial test statistics with Brownian motion structure to develop for the parameter is available, the information-based moni-
stopping boundaries. toring theory of Scharfstein, Tsiatis, and Robins (1997) leads
The difficulty, of course, is that one does not know a straightforwardly to feasible techniques for sequential testing.
priori the true relationship between the hazards. Conse- With such primary endpoints, because the parameter of
quently, it is possible, for example, that early treatment dif- interest cannot be estimated until sufficient follow-up is avail-
ferences may lead to a test statistic that crosses a stopping able, sequential boundaries could not be crossed early and
boundary with the active treatment showing an increased hence one could not stop the study until a sufficient number
number of primary events (possibly deaths). This in turn leads of patients were observed for the appropriate length of time.
to discussion within the DSMB of termination of the study Focusing on such primary endpoints does not eliminate the
at a time when there is potentially not a great deal of patient possibility that a large number of events may be seen early in
follow-up. This is a complex dilemma for members of a DSMB. the trial and the attendant ethical considerations. Under this
Faced with an increased number of events on the active treat- view, monitoring for early differences in numbers of events
ment sufficiently large to have crossed a sequential boundary, would still take place; however, this could be considered in
which would dictate stopping the study, the DSMB must con- the same spirit as monitoring for a safety endpoint and would
sider the difficult issue of whether it is ethical to continue the be separate from monitoring the primary endpoint.
study nonetheless with the hope that long-term benefits of Of course, as is current practice, subjective judgment could
the treatment may emerge, which, of course, is unknown at be included in considerations for decision making under early
the time of the decision. Conversely, an early difference fa- treatment differences if desired. As advocated by the authors,
voring the active treatment resulting in a boundary crossing data from outside sources, if they exist, could be a useful sup-
could be observed, again before adequate patient follow-up for plement in determining whether such early results are suffi-
assessing long-term benefit. In this case, adherence to the sta- ciently worrisome as to warrant termination of the trial.
tistical procedure would dictate stopping the study. In some
instances, DSMB members may instead regard the stopping 3. Intervention Adherence and Causal
boundaries as “guidelines” rather than strict criteria and opt Inference Methods
to continue the study in order to enhance follow-up with an In the face of noncompliance to assigned treatment, the au-
eye toward assessing long-term benefit. However, this may thors note that so-called adherence-adjusted analysis has been
lead to difficulties in assessing the statistical operating char- advocated, where one attempts to estimate a parameter that
acteristics of the sequential test. represents treatment effect if in fact subjects were to practice
Discussion on Statistical Issues in the Women’s Health Initiative 935

full compliance with their assigned treatments. We agree with the authors that issues of adherence modeling and interpre-
the authors that focusing on such a parameter can be mis- tation deserve continued development, and we believe that
placed; the authors provide compelling examples of situations progress can be made by casting them as we have described.
in which the reasons for non-adherence are treatment related As suggested by the authors, it is essential that this be car-
and in fact involve adverse outcomes that may preclude fur- ried out through specific applications such as the WHI so that
ther treatment. the extent and nature of assumptions required can be made
Our view is that a more feasible and realistic objective transparent and hence debated in a genuine scientific context.
is to adopt the perspective of Murphy, van der Laan, and
Robins (2001) and focus instead on identifying the effects of 4. Conclusion
relevant so-called dynamic treatment regimes. For example, We again applaud the authors for an excellent paper that pro-
one might view a need to stop treatment because of adverse vides the statistical community with a rich source of method-
outcomes to be part of the overall strategy or policy of ad- ological challenges. We are grateful for the opportunity to
ministering the treatment rather than as representing lack comment on this important piece of work.
of compliance to the treatment. Here, then, the objective of
inference would be to compare different such strategies, pos- Acknowledgements
sibly with the ultimate goal of identifying the optimal such
strategy (Murphy, 2003). A simple example is given by John- This work was supported by NIH grants R01-CA051962, R01-
son and Tsiatis (2004) and Rebeiz et al. (2004), who adopted CA085848, R37-AI031789, and R21-DA019800.
this perspective in estimating the effect of treatment dura-
tion on outcome when treatment must be terminated due to References
an adverse event. Johnson, B. A. and Tsiatis, A. A. (2004). Estimating mean
More generally, it is indisputable that the intention-to-treat response as a function of treatment duration in an ob-
question is of central interest and importance in randomized servational study, where duration may be informatively
trials. However, the actual treatment of patients in practice censored. Biometrics 60, 315–323.
is often significantly more complex than suggested by base- Murphy, S. A. (2003). Optimal dynamic treatment regimes.
line treatment assignment, with numerous treatment decisions Journal of the Royal Statistical Society, Series B 65, 331–
made over time. Questions involving modification of assigned 355.
treatment, including those addressing adherence issues, al- Murphy, S. A., van der Laan, M. J., and Robins, J. M. (2001).
though arising in the context of a randomized study, are obser- Marginal models for dynamic regimes. Journal of the
vational in nature. Formal causal inference methodology pro- American Statistical Association 96, 1410–1423.
vides a systematic framework for addressing such questions. Rebeiz, A. G., Derry, J. P., Tsiatis, A. A., et al. (2004). Op-
As noted by the authors, use of this framework entails crit- timal duration of eptifibatide in percutaneous coronary
ical and unverifiable assumptions, although any attempt to intervention. American Journal of Cardiology 94, 926–
address these questions of necessity would involve these or re- 929.
lated assumptions. Nonetheless, we believe that viewing such Scharfstein, D. O., Tsiatis, A. A., and Robins, J. M. (1997).
problems through the lens of causal inference methods can Semiparametric efficiency and its implication on the de-
help to clarify the questions and formalize exactly what is re- sign and analysis of group sequential studies. Journal of
quired in order to address them. Accordingly, we agree with the American Statistical Association 92, 1342–1350.

Rejoinder

Ross L. Prentice, Mary Pettinger trial, and plan novel uses of the WHI specimen repository and
and Garnet L. Anderson database, this critique is like a thoughtful consultancy session,
We would like to thank each of the persons who provided com- by a very high-priced group of statistical and epidemiological
ments on our article. Our article was intended primarily to consultants!!
draw attention to topics having important statistical content
where further methodology development is needed. Hence, 1. General Reviewer Comments
we particularly appreciate the new modeling approaches that Several writers commented on the public health importance
some commentators suggested and we will provide some re- of the questions being addressed in the WHI, and on the ap-
action to these. Our article also provided an update on the propriateness of the basic study design, comprising a mul-
Women’s Health Initiative and a description of the statisti- tifaceted clinical trial and a companion cohort study. We
cal methods we have employed to date in certain areas. We appreciate these comments, while acknowledging that many
appreciate the critique of these methods. As we prepare to people contributed to these and other design choices, includ-
publish the principal results from the dietary modification ing colleagues at NIH and at WHI clinical centers. It is also
and calcium and vitamin D components of the WHI clinical worth commenting that this type of enterprise is not at all
936 Biometrics, December 2005

easy to initiate. Investigators around the country, both out- subjects are exposed to two or more known dietary intakes
side and within NIH, worked for years in attempts to launch for a nutrient and corresponding (lagged) blood concentra-
full-scale clinical trials of a low-fat eating pattern and of tions are recorded, could be used to estimate the parameters
postmenopausal hormone therapy. Partly because of the cost in Dr Carroll’s model (3). One might then transfer, for ex-
of such trials, there is typically a resistance to such pro- ample, the estimated measurement error correlation, derived
posals from the research community. It is perhaps unlikely from concentration marker measures on the same individual,
that either of these trials would have been launched had not from the feeding study to a cohort or clinical trial setting.
Dr Bernedine Healy sought and received the necessary funding Doing so could allow a separation of the (unknown) nutri-
as a special congressional appropriation shortly after assum- ent consumption from the person-specific bias in Dr Carroll’s
ing the NIH director position. We would also like to comment model (3) for self-report data calibration and disease associ-
that study of the research designs needed to obtain reliable ation testing.
public health information, the infrastructure needs for a vi- Dr Nick Day also brings a wealth of experience to the nu-
brant preventive intervention development program, and the tritional epidemiology and biomarker development areas. He
need for suitable forums to promote the needed research and comments that diet and physical activity patterns appear to
to advise funding organizations on topics where full-scale tri- be key determinants of a range of health outcomes, and notes
als may be warranted, are extremely important issues for pub- the extraordinary difficulty of related epidemiologic research.
lic health research progress, but are mostly beyond the scope He asks very pertinent and timely questions concerning the
of our article. See Prentice et al. (2004) for a discussion of ability to interpret the WHI dietary modification trial results,
these topics in the diet and physical activity epidemiology which are currently being prepared for publication, given the
research areas. major uncertainties that attend the assessment of diet, and
Several commentators also resonated with our remarks the dietary change induced by intervention.
about biostatistics being one of the fundamental disciplines in Fundamentally, the DM trial assesses whether a dietary in-
the public health research arena. As illustrated by the topics tervention program having certain goals (20% of energy from
we chose to emphasize, methodologic issues are often crucial fat, 5 or more servings of fruits and vegetables per day, . . .),
to progress in public health–oriented research. Our colleagues applied in a certain manner (nutritionist-led small group ses-
in areas such as clinical trials, epidemiology, nutrition, and sions having nutritional and behavioral content), can reduce
genetics do excellent work on both substantive and method- the incidence of breast cancer, colorectal cancer, and the other
ologic topics, but statistically oriented investigators are often designated outcomes over an average 8.1-year study period.
needed to help identify the research gaps and to develop sound As Dr Day suggests, a positive answer to this question will
quantitative approaches to filling these gaps. The commen- greatly advance public health aspects of nutrition, in spite
tator group includes excellent role models who have greatly of a limited ability to attribute the disease risk reduction to
impacted such research areas as nutritional epidemiology, ge- the specific dietary changes made (or to other changes that
netic epidemiology, and clinical trial methodology, to name intervention group women may have chosen to a different
just a few. extent than did control group women). If, however, a null
or equivocal result arises, then trial interpretation may well
2. Nutritional Epidemiology Methods depend on the extent of dietary differences between the in-
Turning to comments on dietary measurement error model- tervention and control groups, and the ability to assess nu-
ing and related topics, Dr Raymond Carroll, who has been an trient intakes, even at the group level, may require suitable
important contributor to this area, emphasizes the distinc- biomarkers.
tion between recovery biomarkers, which typically arise from We agree with Dr Day that current technology does not
urinary measures that reflect the body’s expenditure of a nu- allow a compelling response to these issues. The largest prob-
trient, and concentration biomarkers, which assess circulating lem with the food frequency questionnaire, which is our basic
levels of a nutrient in blood or other body compartments. The tool for measuring dietary change in WHI, concerns total en-
former measures may plausibly adhere to a classical measure- ergy assessment. Our nearly completed Nutrient Biomarker
ment model, but are available for only a few nutrients, while Study will permit a calibrated estimate of total energy at
the latter are available for many more nutrients, but will typ- baseline and at various subsequent time points, with a cali-
ically include person-specific biases that make them generally bration procedure that depends on intervention group assign-
unsuitable for the purpose of calibrating dietary self-report ment. These calibrated estimates can be combined with FFQ
estimates. Dr Carroll asks our opinion about two potential percent energy from fat estimates to obtain total fat consump-
ways of using concentration biomarkers. The first involves fo- tion estimates, and changes in total fat consumption that are
cusing on disease association with the biomarker, rather than expected to be an improvement over FFQ fat consumption
the nutrient consumed. This approach simplifies the analysis, estimates. We will examine the extent that these estimates
and yields association parameters of interest, but the resulting of change mediate any intervention effect on disease outcome.
information would not seem directly useful for the develop- The biomarker data will also allow a calibrated assessment of
ment of evidence-based dietary pattern recommendations. His nonprotein calorie consumption. We have an interest in using
second approach to using concentration biomarkers involves the respiratory quotient from resting state indirect calorime-
human feeding studies. There are rather few research groups try in an attempt to separate (even if noisily) fat and car-
configured for human feeding trials or exercise intervention bohydrate consumption, but necessary funding has yet to be
trials. Nevertheless, we think this approach could be feasible secured for this work. Another aspect of DM trial interpreta-
and useful. For example, a study in which individual study tion concerns study of the relationship between any observed
Discussion on Statistical Issues in the Women’s Health Initiative 937

intervention effect and baseline dietary habits. Baseline FFQ also found, in these simulation studies, that a test statistic
nutrient consumption estimates are distorted by our use of that combines data across the preceding stages with testing
the FFQ as a screening instrument (≥32% energy from fat for at 0.01, 0.0002, and 0.00001 levels, respectively, has improved
eligibility). To reduce our reliance on FFQ, we also plan to power properties compared to a separate testing procedure at
present analyses as a function of nutrient consumption based each design stage.
on four-day food records, using a case-only analysis. Both Dr Thomas and Dr David DeMets point to the need
In response to Dr Sander Greenland’s nutritional epidemi- for statistical methodology development for the design and
ology comments, we note that both the self-report assessments analysis of genome-wide association studies. In fact, the Na-
and the biomarker assessments typically pertain to a short tional Heart, Lung, and Blood Institute and several other NIH
time period of a few days to a few months. In the WHI dietary Institutes recently issued a request for application (RFA-HL-
modification trial, for example, we sought food frequency data 05-011) precisely to encourage this type of methodology de-
at baseline and 1 year from randomization on all women, and velopment. The RFA calls for initiatives on a number of top-
subsequently, approximately every 3 years in a rotating sam- ics including tagging SNP selection, assessment of the utility
ple basis. Our biomarker calibration equations derive from of pooled DNA, study design and analysis choices, haplotype
the FFQ and biomarker correspondence at about 8 years from block formation methods, and methods for gene–gene or gene–
randomization. These equations will be applied to the various environment interaction, and refers to the absence of suitable
FFQs collected for a woman, to give a biomarker-corrected di- methods as the key bottleneck for the progress in this promis-
etary history over the average 8.1-year trial follow-up period. ing research area, given the impressive technical advances in
high-throughput SNP genotyping of the past few years.
3. Genomic and Proteomic Methods Dr Greenland finds it “very odd” that we “neglected” em-
We appreciate Dr Duncan Thomas’s comments on the chal- pirical Bayes and other shrinkage procedures in our discussion
lenges in detecting genetic modifiers of the effect of a treat- of this high-dimensional genomic and proteomic studies area.
ment or intervention, reflecting his many years of contri- We find his criticism very odd in that our presentation in-
butions to the genetic epidemiology literature. Dr Thomas volved only testing and identification procedures, and did not
provides very up-to-date citations of initial reports from discuss any form of estimation procedure. Of course, there
genome-wide association studies, as well as citations to recent are interesting estimation methods issues in the context of
articles discussing related methodologic issues. Since the time the type of multistage studies mentioned above. For example,
of writing our article we have proceeded with the early im- one may be interested in estimating the disease odds ratio as-
plementation of a genome-wide association study in collabora- sociated with an SNP or a haplotype block that acknowledges
tion with Dr David Cox (no, not that David Cox!) of Perlegen both the high dimensionality and the selection procedure em-
Sciences in Mountain View, California. For each of coronary ployed (e.g., Benjamini and Yekutieli, 2005). These methods
heart disease, stroke, and breast cancer, this study will involve will require some form of shrinkage for parameter estimation.
250,000 SNPs and 1000 cases and 1000 pair-matched controls The interesting article (Efron, 2004) that Dr Greenland cites
drawn from the WHI observational study in the first stage of is concerned with the choice of the null hypothesis for the
a three-stage design. This first stage will involve eight pairs of high-dimensional testing problem. Our own plans for method-
DNA pools, each comprising equal volumes of DNA from 125 ology development, included in responses to the RFA just
cases or 125 pair-matched controls. DNA from racial/ethnic mentioned, include an empirical comparison of the efficiency
minority cases will be placed in a single pool, and matching of empirical versus theoretical null hypothesis testing proce-
factors will include ethnicity as well as age, WHI enrollment dures in the WHI–Perlegen data, as well as an exploration of
date (to control for follow-up duration), and selected other shrinkage options for parameter estimation.
factors. SNPs meeting a 1% significance level criterion for ei-
ther a test based on an allele frequency difference statistic or 4. Clinical Trial Monitoring Procedures
an odds ratio statistic will move on to the second stage, which Drs DeMets and Day offer rather different views of the ade-
will involve individual SNP determinations for about 613–800 quacy of prevailing clinical trial monitoring methods in com-
cases and controls (depending on the disease) also drawn from plex settings such as the WHI, where study treatments or
the observational study. SNPs meeting a 2% significance level interventions plausibly affect multiple important clinical out-
criterion at this stage will be examined individually in estro- comes, each with its own time course and severity. Dr DeMets
gen plus progestin clinical trial cases and controls (258–349 provides interesting insights into the deliberations of the WHI
cases and controls) with testing at the 5% significance level. DSMB, some of which we, as WHI investigators, were not a
We have conducted simulation studies (Ross Prentice and party to. He describes aspects of the clinical trial monitoring
Lihong Qi, submitted for publication, 2005) to indicate that plan developed by WHI-related statisticians in collaboration
such a design can be expected to have adequate power for de- with the DSMB and concludes that, for the hormone therapy
tecting an odds ratio of 1.5 or greater for the minor SNP trials, the global index we defined as a supplementary statistic
allele, provided the allele frequency is not too small (e.g., to that for the primary outcomes was “not as useful as origi-
≥0.2) and provided an additive or dominant genetic model nally intended” since the outcomes included in the global in-
prevails, with lesser power under a recessive model. Because dex were going in different directions. Dr DeMets comments
this design implies an overall significance level of (0.01)(0.02) further that “no additional statistical methodology” would
(0.05) = 0.00001, there will be only 2.5 expected false positives have facilitated DSMB recommendations.
under the global null hypothesis, facilitating decision making Dr Day, in contrast, finds it “troubling” that early results
as to whether disease-related SNPs have been identified. We had such an influence in triggering the early stopping of the
938 Biometrics, December 2005

hormone therapy trials. He offers the perspective that these from randomization. With this type of statistic the influence
trials, and others such as the breast cancer prevention trial of of the early data declines as longer-term data accumulate. In
tamoxifen, should have been allowed to continue “sufficiently the case of the estrogen-plus-progestin trial, the breast can-
to generate data of unambiguous value for clinical or public cer weighted log-rank statistic crossed a monitoring boundary,
health decisions,” and he concurs with our call for the formu- and the global index statistic also met adverse effects crite-
lation of stopping rules that “provide a more helpful balance ria sufficient to support an early trial-stopping consideration.
between short- and longer-term effects.” This configuration, along with data on a number of other clin-
While it is clear that monitoring committees need to be in a ical outcomes that were monitored informally, seems to us to
position to exercise judgment beyond those implied by formal provide an answer to the public health question about the
statistical monitoring procedures in such complex situations, balance of risks and benefits in the study population over the
we share with Dr Day the viewpoint that premature stop- 5.6-year average follow-up period. While it would be help-
ping is a serious concern in prevention trial monitoring. The ful to know longer-term effects (as well as effects in impor-
development of more flexible and comprehensive statistical tant subsets), the essential question addressed by this trial
monitoring procedures seems to be a major tool toward real- as to whether hormone therapy could be advocated in terms
izing the scientific potential of prevention trials, while taking of benefits for the primary endpoint coronary heart disease
suitable account of ethical issues. Our own experience, both as and in terms of overall health benefits versus risks in popula-
members of monitoring committees and as recipients of their tions like WHI had been substantially answered, and we have
recommendations, suggests that structure is needed, in spite no argument with our DSMB concerning their early stopping
of committee member expertise and experience, since the ex- recommendation. As Dr Day implied, these trial results were
posure of members to trial data is typically brief and episodic, difficult for certain practicing and research communities to
and since the reaction of members to trial data tends to be accept, though we see this more as a result of overinterpre-
highly variable and dependent on personal research interests tation of the preceding observational study data than as be-
and perceptions of committee responsibilities. ing due to the absence of longer-term clinical trial data. The
Even when formal monitoring guidelines have been estrogen-alone trial, on the other hand, even though having a
adopted, it is difficult for monitoring committees and spon- longer average follow-up time (7.1 years), seems to us to leave
sors to allow a trial to continue in the face of data indicat- greater uncertainty concerning clinical and public health rec-
ing early harm. For example, in the WHI estrogen-only trial, ommendations, as the stroke and venous thromboembolism
the global index was almost exactly balanced between bene- elevations are offset by fracture reductions, a possible breast
fits and risks when the early stopping occurred. This type of cancer reduction, and even a suggestion of a favorable trend
situation, rather than a situation in which all the pertinent in coronary heart disease toward the end of the trial. These
outcomes were going in the same direction, was the motiva- are complex issues to assimilate and there seems to us to be
tion for the inclusion of the global index in the monitoring a need for additional development and application of moni-
plan, that is, to help prevent premature stopping if there is toring procedures that can adapt to HR changes over time
major uncertainty concerning overall benefits versus risks at for several outcomes, and for corresponding estimation proce-
a particular point in time in trial conduct. dures that acknowledge such complex monitoring systems.
The WHI trial monitoring plan may not have included suf-
ficient provision for a changing course of benefits versus risks 5. Postmenopausal Hormone Therapy
as the time from the initiation of treatment increases. As Dr and Cardiovascular Disease
Anastasios Tsiatis and Dr Marie Davidian point out, one usu- We appreciate the vigorous discussion of our combined anal-
ally does not know in advance of trial conduct the form of the ysis of WHI clinical trial and observational data on estro-
treatment hazard ratio (HR) over time, and it may turn out gen plus progestin and cardiovascular disease. As Dr Day de-
that the proportional hazards assumption that underlies most scribes, various commentators in other forums have pulled out
statistical monitoring procedures is far from correct, as for various theories, which Dr Day describes as “empty rhetoric,”
coronary heart disease, venous thromboembolism, and breast to explain this discrepancy. We do think that the availability
cancer in the WHI estrogen-plus-progestin trial. in WHI of clinical trial and observational study data drawn
Hence, we concur with Drs Tsiatis and Davidian that mon- from the same population, over the same time period, using
itoring procedures that attempt to disassociate short-term essentially the same data collection instruments, provides a
from long-term effects could be quite useful. One needs only to strong setting to examine this discrepancy. Dr David Freed-
consider a trial of a surgical intervention having some short- man and Dr Diana Petitti note that “without experimental
term mortality along with putative longer-term benefit to re- data, it will be unclear which adjustments to make, or how
alize that some form of disassociation may be essential to far to go.”
avoid certain premature stopping in a sufficiently large trial. In analyses too detailed to present in Prentice et al. (2005)
In response to the interesting specific proposals of Drs Tsi- or in the present article, we examined various potential ad-
atis and Davidian, it seems to us that it may not be desirable ditional sources of confounding. These included interactions
to go so far as to relegate the early data on a primary outcome among established disease risk factors, treatment–covariate
to a separate adverse events monitoring status, since an early interactions (Dr Greenland’s question), changing risk factors
effect does need to be an element of a primary endpoint ben- across follow-up time, and classical measurement error “cor-
efit versus risk summary, over any specified treatment/follow- rected” risk factors in the diet and physical activity areas.
up period. Our own attempts to address this issue involved While it would be an overstatement to say that these anal-
weighted log-rank statistics, with weights increasing with time yses were exhaustive, we were impressed at the insensitivity
Discussion on Statistical Issues in the Women’s Health Initiative 939

of hormone therapy HRs in the OS, and comparative HRs in Database (GPRD) on estrogen plus progestin in relation to
the CT to OS to these refinements. coronary heart disease. As they mention, they originally in-
It is quite possible, however, as Drs Freedman and Petitti tended to include an analysis of Nurses, Health Study cohort
argue, that there is some residual confounding in the OS HR data as well, which stimulated us to include some correspond-
estimate for both coronary heart disease and venous throm- ing comments and sensitivity analyses near the end of our
boembolism, as is almost certainly the case for stroke (which article. Specifically, we noted that the Nurses’ Health Study
we chose not to include for reasons of space; but see Prentice presentations relied on biennial snapshots of hormone therapy
et al., 2005). We speculate that residual confounding in this exposure, and we drew upon patterns of starting and stopping
setting is likely due to factors that are simply not being en- hormone therapy in community studies to illustrate that an
tertained, in spite of the unusually rich WHI database, or early adverse effect as observed in the WHI trial could easily
possibly due to the inability to adequately correct for impor- be missed with the data and analytic approaches used to date
tant, but poorly measured dietary, physical activity, and other in the Nurses’ Health Study. This illustrates that the reasons
behavioral factors. for bias in observational studies can be diverse and also spe-
Drs Freedman and Petitti argue that, because of limited cific to a study environment. We will leave it to the reader
power to compare clinical trial and observational study HRs, to assess whether we have revealed ourselves to be closet
their “null hypothesis” that the WHI observational study un- Bayesians for having used these community data to inform
derestimates hormone therapy risks by a factor of 1.5 to 3 is our little simulation experiment, as Dr Greenland appears to
as compatible with the WHI data as is a hypothesis of equal- suggest.
ity between WHI clinical and observational study risks. Our As defined by HRGR, the GPRD cohort includes 99,072
recent article (Prentice et al., 2005), which appeared with an women of whom 1889 experienced CHD during follow-up.
invited commentary by Drs Petitti and Freedman, provided Apparently, a substantial number of these women initiated
hazard estimates (95% confidence intervals) for the ratio of estrogen-plus-progestin therapy during the defined follow-up
estrogen-plus-progestin HRs in the OS to those in the CT, period, motivating their “intent-to-treat” (ITT) analyses of
following adjustment for the confounding factors listed and these data, but only 64 CHD outcomes occurred among the
with time-from-initiation accommodation. These were 0.93 initiators (compared to 188 among women randomized to es-
(0.64, 1.36) for coronary heart disease and 0.84 (0.54, 1.29) trogen plus progestin in the WHI trial) so precision is limited.
for venous thromboembolism—consistent with equality even Their ITT estimated HRs (95% CIs) as a function of 0–2,
though noisy, but hardly consistent with a 3-fold underesti- 2–5, and >5 years from initiation of estrogen plus progestin
mation in the observational study! Even for stroke the cor- (their Table 2) are 1.20 (0.84, 1.72), 0.82 (0.55, 1.21), and 0.69
responding numbers were 0.76 (0.49, 1.18). We would hasten (0.38, 1.25), which do not seem particularly discrepant with
to add, however, that our purpose was not to argue that ob- corresponding WHI trial HRs of 1.68 (1.15, 2.45), 1.25 (0.87,
servational data, carefully analyzed, can somehow obviate the 1.79), and 0.66 (0.36, 1.21), and not at all discrepant with our
need for clinical trial data. As we have noted in other places WHI observational study HRs of 1.12 (0.46, 2.74), 1.05 (0.70,
(e.g., Prentice et al., 2004), we think there is a need for or- 1.58), and 0.83 (0.67, 1.01). These last HR estimates were,
ganizational strengthening in the public health research com- for simplicity, based on estrogen-plus-progestin usage at en-
munity so that treatments or interventions having the most rollment into the observational study since there were rather
compelling rationale and public health potential are identi- few hormone treatment initiators following cohort enrollment
fied and receive the support of this community for evalua- and we chose to address their influence on HR estimation as
tion in randomized controlled trials. Such structure is par- an element of our adherence sensitivity analyses.
ticularly needed in view of typical high research costs, and We think the HRGR formulation of a data analysis ap-
frequent relevance of a broad range of health outcomes that proach that attempts to emulate an ITT analysis in a random-
may, for example, cut across the foci of several NIH insti- ized controlled trial has attractive features. It would be a mis-
tutes. We disagree somewhat with Drs Freedman and Petitti take, however, to think that this formulation leads to results
when they “suggest that observation necessarily precedes ex- of similar reliability to a randomized trial, largely because of
periment.” In our view, a vigorous prevention research pro- the strong no unmeasured confounder assumption described
gram may well include interventions that are self-selected by by HRGR. In fact, to render the treated and untreated groups
few, if any, persons in populations of interest, in which case comparable for a causative inference, it is not only necessary
hypothesis development and initial testing may be primar- that there are no unmeasured confounders, but the potential
ily based on small-scale trials (e.g., dietary or physical ac- confounders need to be accurately measured (or appropriately
tivity trials) having intermediate outcome endpoints, and on adjusted for measurement error), and confounder relationship
related basic science research. Even when sizeable numbers of to outcome and treatment must be properly modeled. As Drs
persons adopt a dietary or other lifestyle factor having pre- Freedman and Petitti note, observational data alone do not
ventive potential, it may be that observational studies are not provide much guidance as to when there is a sufficient con-
sufficiently reliable to provide much guidance to the research trol for confounding. We do not find HRGR’s comparison of
enterprise. As Dr Day points out, “negative results can be at HRs adjusted for and without adjustment for available con-
least as suspect as positive ones” in nutritional epidemiology founding factors to be at all compelling as a guide to whether
settings. or not substantial residual confounding remains. In fact, this
We appreciate the nice contribution by Drs Miguel Hernán, criterion would seem to favor fewer confounding factors and
James Robins, and Luis Garcı́a Rodrı́guez (HRGR), which less careful modeling since then adjusted and unadjusted HRs
provides analysis of data from the General Practice Research would be more similar.
940 Biometrics, December 2005

HRGR’s analysis evidently regards a woman as an initiator more of study pills. We find it necessary to frequently remind
each time that she starts or restarts estrogen-plus-progestin our WHI colleagues that these adherence-adjusted HRs are
usage. We cannot tell from their presentation how long a us- not to be taken very seriously, since adherence in active and
age gap was required before regarding a repeat user as a new placebo groups likely differ in important disease risk-related
initiator. Also, the apparent time-dependent relationship be- manners as the trial progresses. Dr DeMets notes that this
tween HR and duration of use would seem to imply a need for adherence bias issue constitutes an important advantage for
an exquisite modeling of disease risk (HRGR model (1)) on trials in view of the basic ITT data analysis option. Drs Freed-
prior hormone therapy usage pattern. Our WHI observational man and Petitti argue that adherence bias may be associated
study analyses focused on the current estrogen-plus-progestin with a cardiovascular HR bias factor of 2, based on placebo
episode at baseline, with a prior usage gap of 1 year or more group comparisons between adherers and nonadherers in prior
determining a new episode. This allowed simple analyses that clinical trials. Dr DeMets also offers evidence that covariate
excluded women who used these preparations prior to their correction is unlikely to restore the desired comparability.
baseline episode and excluded clinical trial women who used HRGR offer inverse probability-weighted estimation proce-
them prior to randomization. We do not claim that our sim- dures to address the adherence bias issue. The method re-
ple methods for addressing these issues make optimal use of quires that women who continue or discontinue their treat-
WHI data, and we suspect that HRGR made reasonable cor- ment at each follow-up time are comparable conditional on
responding modeling choices as well. Of course, in addition available potential confounding factors for discontinuation.
to modeling the time course of prior estrogen-plus-progestin We agree with their approach in the sense that imposing this
use, there are issues of dose levels, schedules, agents, and type of assumption may be the best one can do with avail-
routes of administration that observational studies analyses able data, but as Drs Freedman and Petitti note, “adjustment
of estrogen-plus-progestin treatment also need to address. must be incomplete, because relevant lifestyle factors are ex-
Before leaving the ITT type of analysis, we will respond traordinarily difficult to identify and measure.” Baseline char-
to a couple of other issues mentioned in relation to our hor- acteristics would not plausibly be sufficient for this purpose
mone therapy and cardiovascular disease analyses. Dr Green- in complex settings, and the use of posttreatment initiation
land assails us for an “exclusive focus on HRs.” While HR variables raises additional issues of overcorrection if adherence
estimates are an important data summary tool, and one that adjustment factors are on a pathway linking treatment to the
may be particularly valuable for comparing intervention ef- targeted diseases. Nevertheless, we appreciate HRGR’s com-
fects across studies, we agree that HRs do not tell the com- ments on, and illustration of, the IPW approach and concur
plete story, especially concerning clinical and public health that it would be of interest to try these on the WHI hormone
implications. We refer Dr Greenland to the primary estrogen- therapy data. This would also be the case for the associated
plus-progestin and estrogen-alone results articles (Journal of augmented estimating equation approach, with its so-called
the American Medical Association, 2002, 2004) for absolute double robust property. Our own simulation studies of these
risk estimates from these trials and for estimates of numbers augmented estimating equations, and applications to other
of events per 1000 person-years that may be added or sub- WHI data (Qi et al., 2005), attest to their desirable proper-
tracted by use of the preparations studied in a population ties, provided critical modeling assumptions are met.
like WHI. Drs Freedman and Petitti raise the policy issues HRGR conclude their commentary by asking us whether we
that WHI data are not available to them, as taxpayers, for would have reached the same conclusions concerning estrogen-
this reanalysis. We can report that a rather comprehensive plus-progestin effects on cardiovascular disease as in our paper
data set from the estrogen-plus-progestin trial will be avail- if only the observational data had been available. Our answer
able to requestors through NHLBI around the end of 2005. is “probably not.” We would have been unable to detect a
Such a “limited access data set” can be analyzed under the trend in HR as a function of time-from-initiation based on
auspices of your local IRB with the requirement that resulting observational data alone. Our efforts to control confounding
manuscripts need to be submitted to NHLBI for review and would likely have been more limited without having the trial
comment prior to publication. The 3-year period between the data to help develop the HR model. HRGR also ask for our
initial trial publication and the availability of this data set is comments on the results of other observational studies. We
about the minimum needed for investigators and sponsors to have offered a potentially important source of discrepancy be-
publish principal trial findings. A similar data set from the tween the Nurses’ Health Study and WHI trial; Drs Freedman
estrogen-alone trial can be expected in 2007. and Petitti offer an additional suggestion stemming from lack
Turning now to our final topic of adherence-adjusted anal- of control in NHS for socioeconomic factors. Our companion
yses, even a rigorously conducted randomized trial typically publication (Prentice et al., 2005) offers brief suggestions in
does not permit valid comparisons among persons having sim- relation to other observational studies. HRGR also ask how
ilar treatment adherence patterns, since additional untestable sure are we that we have found “clear and convincing expla-
assumptions are needed before it can be claimed that per- nations.” We take some comfort in the fact that persons with
sons in the same adherence strata are comparable in terms a wealth of related experience find our arguments convinc-
of baseline risk for targeted outcomes. Adopting, instead, the ing. For example, Dr Day writes that after examining results
“dynamic treatment regimen” concept recommended by Drs as a function of time-from-initiation (and confounding), “the
Tsiatis and Davidian has a practical appeal. apparent discrepancy simply disappears,” while Drs Freed-
We used a rather simple approach to adherence sensitivity man and Petitti wrote that we provide “answers” that they
analyses in the hormone therapy trials, censoring the follow- find “persuasive.” On the other hand, we agree with HRGR
up time for a woman 6 months after she ceased to take 80% or and Drs Freedman and Petitti that precision is limited and it
Discussion on Statistical Issues in the Women’s Health Initiative 941

is quite possible that residual confounding of the magnitude vention: Research strategies and recommendations. Jour-
listed by HRGR remains, again attesting to the importance nal of the National Cancer Institute 96, 1276–1287.
of randomized controlled trials when the public health impor- Prentice, R. L., Langer, R., Stefanick, M. L., et al. (2005).
tance is great. Combined postmenopausal hormone therapy and cardio-
vascular disease: Toward resolving the discrepancy be-
References tween observational studies and the Women’s Health Ini-
Benjamini, Y. and Yekutieli, D. (2005). False discovery rate- tiative clinical trial. American Journal of Epidemiology
adjusted multiple confidence intervals for selected pa- 162, 1–11.
rameters (with discussion). Journal of the American Sta- Qi, L., Wang, C. Y., and Prentice, R. L. (2005). Weighted es-
tistical Association 100, 71–93. timates of proportional hazards regression with missing
Prentice, R. L., Willett, W. C., Greenwald, P., et al. (2004). covariates. Journal of the American Statistical Associa-
Nutrition and physical activity and chronic disease pre- tion, in press.

You might also like