Registries That Show Efficacy: Good, but Not

Good Enough
Mark N. Levine and Jim A. Julian, Department of Oncology, McMaster University; Juravinski Cancer Centre, Hamilton,
Ontario, Canada

When readers of Journal of Clinical Oncology (JCO) scan the The value of a registry greatly depends on the quality of its data. A
Table of Contents of an issue, they are likely to recognize designs of primary concern is whether the data are accurate and complete. There
interventional research studies such as phase I trials, phase II trials, and can be many sources of data errors. Two examples are errors in
randomized phase II and III trials. We suspect, however, that the programming and inaccurate transcription. To ensure the integrity of
reader has much less familiarity with research involving observational a registry’s data, a set of procedures should be in place both before data
study designs, administrative databases, phase IV studies, or registries. collection, to ensure the highest quality data at the time of collection,
We have also noticed that these latter terms are often used inter- and after collection, to identify and correct sources of error.1,4 Exam-
changeably. Our goal in the present commentary is not to provide a ples of the former are a clear definition of data characteristics and
primer on epidemiology but to focus on registries and address such study design and training of data collectors, and examples of the latter
questions as “What are they?”, “What can they be used for?”, and are data checks and site visits.
“What are some of their limitations?” One example of a registry that has been used as a platform for a
In recent years, there has been a marked increase in the number number of research publications in JCO is the Surveillance, Epidemi-
of medical registries and in publications using registry data. The word ology, and End Results Program (SEER) database.7 The primary pur-
registry is derived from the Latin word registrum, which means list or pose of this registry is to provide information on cancer incidence and
catalogue. A medical registry can be defined as a systematic collection survival. It covers approximately 26% of the US population. Since
of a set of health and demographic data for patients with specific health January 2006, there have been 16 papers published in JCO that have
characteristics held in a defined database for a predefined purpose.1-3 involved research using SEER. SEER has also been used to study the
Registries were first used to describe disease incidence and as a re-
cost of cancer care at a population level.8 This database is large and
source for epidemiologic research. Subsequently, with advances in
comprehensive, with information collected on tumor characteristics
computer technology making it possible to store large amounts of data
including stage, demographic data, surgical intervention, and whether
concerning treatment of patients with medical disorders, registries
radiation therapy was administered. SEER includes follow-up for vital
have been used to describe patterns of clinical care that generate
status and cause of death. Data are collected primarily from an insti-
inferences on quality of care and even effectiveness of therapy. They
tutional source and not from physicians’ offices, so that information
have even been used to assess safety of a drug after its approval by a
on chemotherapy and hormonal therapy is under-reported. A quality
regulatory agency after randomized trials (phase IV or postmarketing
studies). Some registries contain data on all cases of a particular disease control program is conducted each year by the National Cancer Insti-
in a defined population such that the cases can be related to a popula- tute to evaluate the quality and completeness of the SEER data. The
tion base. With this information, incidence rates can be calculated. strengths and limitations of the SEER registry in general were dis-
Information on outcomes, such as remission and death, can be ob- cussed in a 2003 editorial in JCO by Linda Harlan,9 and the limitations
tained if the patients are followed up routinely, either directly or of making inferences on therapy using this registry were also high-
through linkage with other registries or administrative databases (eg, lighted in more recent editorials.10,11
cancer or death registries). There are also registries (such as hospital- A suggested guide to the reader when reading a registry article is
based or disease-specific registries) that are not population based. presented in Table 1. Answering the questions in Table 1 will help in
The intended purpose of a registry should always be prespecified critically appraising the quality of the article.
and should define the necessary properties of the data to be collected. We will discuss some of the limitations of registries when they are
There are different types of data sources available that are collected.4-6 used to compare interventions. First, the allocation of patients to the
For example, administrative data are data that were originally col- intervention is not random. Therefore, the intervention and compar-
lected for reasons other than research. Encounter data sets maintain a ison groups may differ in ways that affect the study outcome, poten-
record of health care encounters and typically are maintained by tially leading to biased overestimates of benefit (ie, selection bias).
payers to track reimbursement (eg, discharge database). Enrollment Second, follow-up is generally not as active or as standardized as in
data allow the determination of a denominator population from randomized trials; therefore, ascertainment of outcomes may be in-
which the encounter numerator is drawn (for example, census data). complete or inaccurate. Moreover, in some cases, outcomes may not

Levine and Julian

falls that can occur when registry data are used to suggest that an
Table 1. Methodology Criteria for Critically Appraising the Quality of
a Registry Study
intervention works.18-20
In a recent edition of JCO, Hillner et al18 reported the initial
Criteria Questions
results from the National Oncologic PET Registry (NOPR). Consid-
Are the study results What is the population base of the registry?
generalizable to Is it well described? erable thought and effort went into the design, creation, and imple-
my patients? Are the patients highly selected? mentation of this registry, in which data are supplied by participating
Is the purpose of the Can data from the registry answer the question centers via the internet.21,22 The primary research goal was to assess the
registry clearly being asked?
stated? effect of positron emission tomography (PET) on referring physicians’
Is the data in the Are procedures in place to ensure accuracy and plans of intended patient management. Some procedures were imple-
registry of high completeness (eg, checks and audits)? mented to ensure that the data were of high quality. There were a series
Are the outcome Are objective criteria used?
of case report forms that were completed according to a schedule.
measures Is the assessment done in a blinded fashion? Briefly, the referring physician requested the PET scan and completed
reasonable? Is the assessment the same in groups being a pre-PET form, which included the reason for the PET imaging,
compared (potential for bias)?
What is the patient Is there missing data?
cancer type, performance status, and planned management if PET was
follow-up? Is the loss to follow-up stated? not available. Patients were asked to provide consent to have their data
Are groups in the Do the types of patients participating in the included in the research database. Once the PET scan was completed,
registry being comparison groups differ (potential for bias)?
compared, and Does the information collected on the patients
the PET report was uploaded to the database. The final step was the
is this potentially differ between groups (potential for bias)? completion of a post-PET form by the referring physician. Fifteen
problematic? Are there important factors that either have not priority areas for early assessment based on tumor type and indication
been collected or have not been used in the
analysis that can affect both group were determined. The NOPR working group established a plan to
membership and outcome (confounder bias)? analyze the data in three phases. In the current report, the first phase of
Is the analysis appropriate?
the evaluation in which all cancers are aggregated together across all
indications is presented.
Hillner et al18 should be congratulated on their remarkable
achievement. It is less than 2 years since the registry opened, and
results on 34,000 patients registered in the first year have already been
be assessed by individuals who are blinded to the intervention alloca- analyzed and published. Overall, physicians changed their intended
tion, leading to further potential for bias. In other cases, follow-up is management in 36.5% of patients after PET. Whether the initial indi-
conducted passively through linkages to administrative databases. Be- cation for PET was diagnosis, initial staging, restaging, or suspected
cause the coding of events in these data sets is not primarily for recurrence, there was a major change in management of the patient in
research purposes (and because coding is tied to other incentives such approximately one third of the cases.
as reimbursement), these methods to ascertain outcomes, although At first pass, the large numbers and magnitude of the benefit
powerful, may be prone to misclassification of outcomes as a result of associated with PET are impressive. However, an important question
coding errors and variations in coding practices. Third, because data is, “How do the results apply to my patient in clinic?” In methodology
collection for registries is often more passive than data collection in jargon, this is referred to as generalizability. The results of the registry
randomized trials, missing data may be a greater potential problem for are presented aggregated by PET indication and not by tumor type,
registries. Finally, although registries are typically more generalizable stage, or clinical scenario. We do not know what other imaging tests
to real-world practice because of their observational design, entry into were performed (or not) before PET. There is no control group in the
a registry may not be as strictly monitored compared with randomized registry. So the question is, “PET changed management compared
trials. This creates the potential for ineligible patients to enter the with what?” It is not known whether the same changes in management
registry and may weaken the generalizability of findings obtained from would have occurred with simpler, cheaper tests. This is an important
analysis of registry data. To minimize these threats to validity, analyses question in many environments concerned about health care costs. In
of registry data should carefully select a comparison group that mini- patients with planned biopsy before PET, biopsy was avoided in ap-
mizes selection bias, describe differences in important prognostic fac- proximately 70%. The avoidance of a biopsy is important because it
tors between intervention and comparison groups, and statistically saves a patient from potential risk. However, it is possible that, in some
adjust for these differences.12-15 It is important to note, however, that cases, a biopsy would have been necessary and the PET-driven strategy
even the most sophisticated statistical methods cannot correct for may have been wrong. We do not know how often this occurred.
differences in unmeasured or unknown confounding factors. In addi- Finally, in the NOPR, the referring physician indicated his or her
tion, estimates of effectiveness may be highly sensitive to the analytic intended change in plan, not what he or she actually did. There is
method used.16,17 Assessment of outcomes should be performed by abundant evidence from the literature that what physicians say they
individuals blinded to allocation where possible. This is often achieved will do and what they actually do in practice is not always the same.23
through linkage to administrative data but, as discussed earlier, comes Hence, it is possible that the 30% change in management is an over-
with its own set of limitations. To address these limitations, critical estimate of the true effect. Hillner et al plan to link their results with
data elements from administrative databases can be validated through Centers for Medicare and Medicaid Services billing records to address
chart review. Finally, to optimize the quality of registry data (eg, this issue, and a more complete understanding of the meaning of these
minimize missing data and ensuring adherence to eligibility criteria), initial published data may then become more evident.
regular random audits of registry data can be performed. Three In the accompanying editorial to the NOPR article, Larson24
recent publications in JCO provide examples of the potential pit- describes the strengths of the study and also points out the limitation

that the end point was based on intended patient management rather atory in nature. The duration of bevacizumab therapy is an important
than actual management. He argues that the NOPR is a clinical trial question and can only be answered by a randomized trial, which has
and an alternative to the randomized trial, which he deems to be already been initiated (Southwest Oncology Group 0600); in this
impractical in the setting of evaluating imaging technology.24 How- trial, patients experiencing progression on oxaliplatin-based chem-
ever, as discussed earlier, studies using observational data can be otherapy plus bevacizumab in first-line therapy will be randomly
subject to unrecognized biases. Equipoise is the underpinning of the assigned to either stopping bevacizumab or continuing it with second-
question addressed in any randomized trial: “I don’t know if PET line irinotecan-based chemotherapy.
improves patient management.” In the NOPR, it is possible that phy- The Monoclonal Antibody Erbitux in a European Pre-License
sicians who participated were already convinced that PET changes (MABEL) study in this issue of JCO can also be considered a registry.20
clinical management, and this could have influenced their selec- In this study, 1,147 patients with metastatic colorectal cancer who had
tion of patients and their recording of change in management, recently experienced progression on irinotecan and who expressed
resulting in a systematic overestimate of the magnitude of the benefit. epidermal growth factor receptor received cetuximab plus irinotecan.
With more than 20,000 PET patients enrolled onto the NOPR, there The goals for the study were to provide safety and efficacy data on
would have been sufficient patient numbers to carry out a series of cetuximab plus irinotecan and to confirm the findings of the Bowel
carefully controlled trials in a number of indications. To under- Oncology with Cetuximab Antibody (BOND) trial in a community
score that randomized trials should be part of the process of tech- setting. In the BOND trial, 329 patients with epidermal growth factor
nology assessment, the Ontario Ministry of Health and Long Term receptor– expressing metastatic colorectal cancer that had progressed
Care is funding randomized trials and prospective cohort trials to within the previous 3 months on irinotecan were randomly assigned
introduce PET into that province.25,26 to cetuximab (n ⫽ 111) or cetuximab plus irinotecan (n ⫽ 218) in the
Two studies based on registries are published in this issue of JCO. same dose and schedule as before random assignment.28
The BRiTE registry collected data on 1,953 patients with metastatic In the study by Grothey et al,19 two groups within the BRITE
colon cancer who started chemotherapy plus bevacizumab between registry are compared. In contrast, in the study by Wilke et al,20 the
February 2004 and July 2005.19 The primary objective of BRiTE was to MABEL cohort of patients is compared with patients in the BOND
collect information on adverse events, and a secondary objective was randomized trial.28 Is it reasonable to compare the results of this
to describe the effectiveness of bevacizumab in terms of progression- registry to those from a randomized trial?
free survival (PFS) and overall survival. The BRiTE population in- The first question we ask is, “Are the two groups comparable?”
cludes a broader range of patients than in the pivotal randomized The median age in MABEL was 62 years compared with 59 years in
trial.27 There was no formal assessment of effectiveness end points BOND. In both studies, approximately 64% of patients were male,
according to prespecified guidelines. and 80% had had two or more prior chemotherapy regimens. In the
The article by Grothey et al19 is a hypothesis-generating, un- MABEL study, 1% of patients had a Karnofsky performance score less
planned analysis of the BRiTE trial that compares two subgroups of than 80 compared with 12% of patients in BOND. The most common
patients who experienced progression of their colon cancer—patients dose of the irinotecan regimen was 180 mg/m2 every 2 weeks in both
who continued on bevacizumab and those who did not. The survival MABEL and BOND. Thus, the answer to the question on the compa-
of the patients who continued on bevacizumab was statistically signif- rability of the two patient populations is probably yes.
icantly longer compared with the survival of those who did not. The Our next question is, “Is it reasonable to compare the results of
authors conclude that “continued VEGF inhibition with bevacizumab registry data with those from a randomized trial?” The answer to this
beyond progressive disease could play an important role.” We would question depends on the outcome measure. The eligibility criteria for
like to highlight a few important points. Yes, the registry did include a a randomized trial can be restrictive. Accordingly, regulatory agencies
broader range of patients (including the elderly) and chemotherapy are interested in additional safety information on a new agent after
regimens than in the pivotal randomized trial.27 However, the charac- registration. Thus, a phase IV study can provide information on drug
teristics of the patient population at the time of disease progression are safety in a larger number of patients and in a broader range of patients.
unknown. Because time of disease progression was the start time for In addition, unrecognized toxicities can sometimes be identified. The
the analysis, it is important to know whether the overall disease burden MABEL investigators used standard criteria to document toxicity. It is
of the two cohorts was similar at this time. Tumor burden could be reassuring that, as in BOND, the most common irinotecan-related
reflected by sites of metastases, number of metastases, performance toxicity was diarrhea, the most common cetuximab-related toxicity
status, prothrombin time, and albumin. These characteristics of the was acne-like rash, and the rates of these toxicities in the two cohorts
two cohorts at the time of disease progression are not compared, and were similar. The answer to the question of whether it is reasonable to
we do not know whether they differ in terms of important known or compare MABEL with BOND for the outcomes of safety is yes.
unknown confounders. Although the time from first metastasis, when However, there are major limitations in performing cross-study
patients were enrolled onto the registry, to progressive disease was the comparisons for the end point of efficacy. The two studies assessed
same for both cohorts, the time from first diagnosis of colon cancer to response and PFS using different time intervals; these outcomes were
first metastasis for the two cohorts should be described to reassure that adjudicated by a committee blinded to treatment allocation in the
there is no underlying difference in the natural history of tumor BOND trial compared with the treating physician in MABEL. The
growth between the two cohorts. The authors used a propensity score median PFS in MABEL was 3.2 months compared with 4.1 months in
to adjust for potential imbalances between groups for important char- the BOND trial. These two rates cannot be compared statistically but
acteristics. However, there are limitations to this approach.17 only in a qualitative sense. The MABEL investigators interpret the PFS
Thus, on the basis of such a strong likelihood of bias from con- results from the two studies to be similar. However, there is an almost
Levine and Julian

