ISBN: 0-8247-9025-1 This book is printed on acid-free paper. Headquarters Marcel Dekker, Inc. 270 Madison Avenue, New York, NY 10016 tel: 212-696-9000; fax: 212-685-4540 Eastern Hemisphere Distribution Marcel Dekker AG Hutgasse 4, Postfach 812, CH-4001 Basel, Switzerland tel: 41-61-261-8482; fax: 41-61-261-8896 World Wide Web http:/ /www.dekker.com The publisher offers discounts on this book when ordered in bulk quantities. For more information, write to Special Sales/Professional Marketing at the headquarters address above. Copyright 2001 by Marcel Dekker, Inc. All Rights Reserved. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microﬁlming, and recording, or by any information storage and retrieval system, without permission in writing from the publisher. Current printing (last digit): 10 9 8 7 6 5 4 3 2 1 PRINTED IN THE UNITED STATES OF AMERICA

To my mother, who has been an inspiration to me for 55 years, and to others for 91.

The idea for this book came to me from Graham Garrett of Marcel Dekker, Inc., who, tragically, passed away while the book was in progress. I hope that the result lives up to his expectations. All royalties for the editor will go to Cancer Research And Biostatistics, a nonproﬁt corporation whose mission is to help conquer cancer through the application of biostatistical principles and data management methods.

Preface

This book is a compendium of statistical approaches to the problems facing those trying to make progress against cancer. As such, the focus is on cancer clinical trials, although several of the contributions also apply to observational studies, and many of the chapters generalize beyond cancer research. This ﬁeld is approximately 50 years old, and it has been at least 15 years since such a summary appeared; because much progress has been made in recent decades, the time is propitious for this book. The intended audience is primarily but not exclusively statisticians working in cancer research; it is hoped that oncologists might beneﬁt as well from reading this book. The book has six sections: 1. Phase I Trials. This area has moved from art to science in the last decade, thanks largely to the contributors to this book. 2. Phase 2 Trials. Recent advances beyond the widely accepted two-stage design based on tumor response include designs based on toxicity and response and selection designs, meant to guide in the decisions regarding which of many treatments to move to Phase 3 trials. 3. Phase 3 Trials. A comprehensive treatment is provided of sample size, as well as discussions of multiarm trials, equivalence trials, and early stopping. 4. Complementary Outcomes. Quality-of-life and cost of treatment have become increasingly important, but pose challenging analytical problems, as the chapters in this section describe. 5. Prognostic Factors and Exploratory Analysis. The statistical ﬁeld of survival analysis has had its main impetus from cancer research and the

v

vi

Preface

6.

chapters in this section demonstrate the breadth and depth of activity in this ﬁeld today. Interpreting Clinical Trials. This section provides lessons—never outdated and seemingly always needing repeating—on what can and cannot be concluded from single or multiple clinical trials.

I would like to thank all the contributors to this volume. John Crowley

Contents

Preface Contributors

v xi

PHASE I TRIALS 1. Overview of Phase I Trials Lutz Edler 2. Dose-Finding Designs Using Continual Reassessment Method John O’Quigley 3. Choosing a Phase I Design Barry E. Storer 1

35

73

PHASE II TRIALS 4. Overview of Phase II Clinical Trials Stephanie Green 5. Designs Based on Toxicity and Response Gina R. Petroni and Mark R. Conaway 93

105

vii

viii

Contents

6.

Phase II Selection Designs P. Y. Liu

119

PHASE III TRIALS 7. Power and Sample Size for Phase III Clinical Trials of Survival Jonathan J. Shuster Multiple Treatment Trials Stephen L. George Factorial Designs with Time-to-Event End Points Stephanie Green Therapeutic Equivalence Trials Richard Simon Early Stopping of Cancer Clinical Trials James J. Dignam, John Bryant, and H. Samuel Wieand Use of the Triangular Test in Sequential Clinical Trials John Whitehead

129

8.

149

9.

161

10.

173

11.

189

12.

211

COMPLEMENTARY OUTCOMES 13. Design and Analysis Considerations for Complementary Outcomes Bernard F. Cole Health-Related Quality-of-Life Outcomes Benny C. Zee and David Osoba Statistical Analysis of Quality of Life Andrea B. Troxel and Carol McMillen Moinpour Economic Analysis of Cancer Clinical Trials Gary H. Lyman

229

14.

249

15.

269

16.

291

Contents

ix

PROGNOSTIC FACTORS AND EXPLORATORY ANALYSIS 17. Prognostic Factor Studies ¨ Martin Schumacher, Norbert Hollander, Guido Schwarzer, and Willi Sauerbrei 18. Statistical Methods to Identify Prognostic Factors Kurt Ulm, Hjalmar Nekarda, Pia Gerein, and Ursula Berger 19. Explained Variation in Proportional Hazards Regression John O’Quigley and Ronghui Xu 20. Graphical Methods for Evaluating Covariate Effects in the Cox Model Peter F. Thall and Elihu H. Estey 21. Graphical Approaches to Exploring the Effects of Prognostic Factors on Survival Peter D. Sasieni and Angela Winnett 22. Tree-Based Methods for Prognostic Stratiﬁcation Michael LeBlanc 321

379

397

411

433

457

INTERPRETING CLINICAL TRIALS 23. Problems in Interpreting Clinical Trials Lilllian L. Siu and Ian F. Tannock 24. Commonly Misused Approaches in the Analysis of Cancer Clinical Trials James R. Anderson 25. Dose-Intensity Analysis Joseph L. Pater 26. Why Kaplan-Meier Fails and Cumulative Incidence Succeeds When Estimating Failure Probabilities in the Presence of Competing Risks Ted A. Gooley, Wendy Leisenring, John Crowley, and Barry E. Storer 473

491

503

513

x

Contents

27.

Meta-Analysis Luc Duchateau and Richard Sylvester

525

Index

545

Seattle. Pennsylvania Bernard F.D. Technical University of Munich. Nebraska Ursula Berger. EORTC Data Center.D. New Hampshire Mark R. Department of Health Evaluation Sciences.D. Pittsburgh. European Organization for Research and Treatment of Cancer. Munich. Illinois Luc Duchateau. Southwest Oncology Group Statistical Center. Ph. Brussels. Institute for Medical Statistics and Epidemiology. Germany John Bryant. University of Chicago. Department of Preventive and Societal Medicine. Belgium xi . Dipl. Omaha. Dartmouth Medical School. University of Pittsburgh. Ph. Virginia John Crowley. National Surgical Adjuvant Breast and Bowel Project and Department of Biostatistics. National Surgical Adjuvant Breast and Bowel Project. University of Pittsburgh. and Department of Health Sciences. Department of Community and Family Medicine. Fred Hutchinson Cancer Research Center. Division of Biostatistics and Epidemiology. Pennsylvania. Charlottesville.Stat. Chicago. Ph. Cole. Ph.D. Ph.Contributors James R. and Biostatistical Center. Conaway.D. University of Nebraska Medical Center.D. Anderson.D. University of Virginia. Dignam. Washington James J. Ph. Lebanon. Pittsburgh. Ph.

Anderson Cancer Center. Heidelberg. Fred Hutchinson Cancer Research Center. M.D. Fred Hutchinson Cancer Research Center.D. Public Health Sciences Division. Albany. New York Carol McMillen Moinpour. Dipl.. Germany Ted A. Program in Biostatistics.P.D. Albany Medical College. Duke University Medical Center. Ph. Ph. Department of Biostatistics and Bioinformatics. Seattle. Southwest Oncology Group Statistical Center. Fred Hutchinson Cancer Research Center. Munich.. Biostatistics Unit. Sc. Seattle. State University of New York at Albany School of Public Health. Seattle. Seattle. Technical University of Munich. Washington ¨ Norbert Hollander. Department of Clinical Statistics. Institute of Medical Biometry and Medical Informatics. Ph. Institute for Medical Statistics and Epidemiology. Washington Wendy Leisenring. Lyman.H.D. Liu. Washington Stephanie Green. George. Southwest Oncology Group Statistical Center. Fred Hutchinson Cancer Research Center. North Carolina Pia Gerein.D. Y. Ph. Germany Michael LeBlanc. University of Freiburg. Southwest Oncology Group Statistical Center.D. Fred Hutchinson Cancer Research Center. Fred Hutchinson Cancer Research Center. and Department of Biometry and Statistics. Southwest Oncology Group Statistical Center.Sc. Freiburg. Ph. Department of Leukemia. Departments of Clinical Statistics and Biostatistics. Seattle.(Edin) Department of Medicine. Program in Biostatistics. Texas Stephen L. Germany Elihu H. M.D. German Cancer Research Center. Gooley.xii Contributors Lutz Edler. Division of Public Health Sciences.P. Department of Medical Biometry and Statistics.C. M.D. Estey.D. Houston.R. Washington . Seattle. Washington P. M. Durham. Ph. F. Washington Gary H.Stat.D.D. University of Texas M. Ph.

. Freiburg.P. Toronto. Princess Margaret Hospital. Dr.D. F. M.(C) NCIC Clinical Trials Group. Germany Jonathan J. F. University of Freiburg. Sc. Statistics. Kingston. Sasieni.D. Charlottesville. University of Virginia. Ph. Ph. Seattle. University of California– San Diego. University of Freiburg. National Cancer Institute. Florida Richard Simon Biometric Research Branch.P.P. Bethesda.R. Technical University of Munich. Petroni. Department of Medical Biometry and Statistics. Department of Mathematics. Institute of Medical Biometry and Medical Informatics. Shuster. Division of Biostatistics and Epidemiology. Storer.D. Siu.C. and Epidemiology. Germany Guido Schwarzer. University of Florida. B. Institute of Medical Biometry and Medical Informatics. M. Ph. Canada Gina R. Maryland Lillian L.Sc. Department of Mathematics.C.D. England Willi Sauerbrei. M. Institute of Medical Biometry and Medical Informatics.(C) Department of Medical Oncology and Hematology. Department of Medical Biometry and Statistics. Quality of Life Consulting.(Alta).D. F. Department of Health Evaluation Sciences. London.D.D. Imperial Cancer Research Fund.. Department of Statistics.R.R. Department of Medical Biometry and Statistics.Sc. British Columbia.C. Clinical Research Division. Ph. M. Canada Joseph L. Ph. Ph. University of Freiburg. Fred Hutchinson Cancer Research Center. California David Osoba. Gainesville.D. Freiburg. West Vancouver.D. National Institutes of Health. Germany John O’Quigley. Freiburg.C. Germany Martin Schumacher. Virginia Peter D. Canada Barry E. Ph. Washington . M. Ontario. Queen’s University. Munich.. Pater.D. Institute for Medical Statistics and Epidemiology. Ontario.Contributors xiii Hjalmar Nekarda.. La Jolla.

Pittsburgh. Joseph L. Tannock.D.* Department of Epidemiology and Public Health.xiv Contributors Richard Sylvester. National Surgical Adjuvant Breast and Bowel Project and Department of Biostatistics. Technical University of Munich. Toronto. Ontario. Munich. Boston. Ph. Ph.C... National Cancer Institute of Canada.D. England H. Department of Biostatistics. Ontario. Canada * Current afﬁliation: Imperial College School of Medicine. Houston. Kingston.D. University of Texas M. Harvard School of Public Health and Dana-Farber Cancer Institute. Canada Peter F.D.D. Ph. The University of Reading. Pennsylvania Angela Winnett. Sc. Mailman School of Public Health. London. M. England Ronghui Xu. London. F.D. Columbia University. Anderson Cancer Center. Department of Biostatistics. Division of Biostatistics. England. Ph.D.P.(C) Department of Medical Oncology and Hematology.D. Medical and Pharmaceutical Statistics Research Unit. Ph. Ph. Brussels. Germany John Whitehead. Imperial Cancer Research Fund. Ph. Thall.D. . Reading. New York Kurt Ulm. Belgium Ian F.D. Princess Margaret Hospital. Samuel Wieand. Troxel. New York. Massachusetts Benny C. Zee. Clinical Trials Group. Ph.D. Sc. EORTC Data Center. Institute for Medical Statistics and Epidemiology.D. European Organization for Research and Treatment of Cancer. Texas Andrea B.R. University of Pittsburgh.

because of possible systemic treatment effects. The phase I trial should deﬁne a standardized treatment schedule to be safely applied to humans and worth being further investigated for efﬁcacy. Therefore. the safety and the feasibility of the treatment are at the center of interest. and so on phase I studies are conducted with patients because of the aggressiveness and possible harmfulness of cytostatic treatments. phase I trials are usually conducted on human volunteers. After failure of standard treatments or in the absence of a curative treatment for seriously chronically ill patients. Heidelberg. and in improved quality of life and survival. AIDS. in suppression of the disease and its symptoms. INTRODUCTION The phase I clinical trial constitutes a research methodology for the search and establishment of new and better treatment of human diseases. the new drug may be the small remaining chance for treatment.1 Overview of Phase I Trials Lutz Edler German Cancer Research Center. A positive risk–beneﬁt judgment should be expected such that possible harm of the treatment is outweighed by possible gain in cure. and because of the high interest in the new drug’s efﬁcacy in those patients directly. Germany I. at least as long as the expected toxicity is mild and can be controlled without harm. For non-life-threatening diseases. The goal of the phase I trial is to deﬁne and to characterize the new treatment in humans to set the basis for later investigations of efﬁcacy and superiority. It is the ﬁrst of the three phases—phase I–III trials—that became a ‘‘gold standard’’ of medical research during the second half of the 20th century (1). 1 . In life-threatening diseases such as cancer.

problems occurring during the conduct of a phase I trial. Biostatistical methodology for planning and analyzing phase I oncological clinical trials is presented. Beginning treatment at a low dose very likely to be safe (starting dose). Validations of phase I trials rely mostly on simulation studies because the designs cannot be compared competitively in practice. The presentation of phase I designs in Section III distinguishes between the determination of the dose levels (action space) and the choice of the dose escalation scheme (decision options). Therefore. Section VII addresses practical needs. including the choice of a starting dose. The sample size per dose level is discussed separately. TASKS. ASSUMPTIONS. Individual dose adjustment and dose titration studies are also addressed. Phase I designs proposed during the past 10 years are introduced there. and future research topics. including the deﬁnition of the maximum tolerated dose (MTD). II. Practical aspects of the conduct of a phase I trial. The phase I trial is the ﬁrst instance where patients are treated experimentally with a new drug. then the ﬁrst application to humans should occur within the framework of a phase I clinical trial (6–9). The objective is to determine the MTD (8) of a drug for a speciﬁed mode of administration and to characterize the DLT. At this early stage. which is crucial for the design and analysis of a phase I trial. it has to be conducted unconditionally under the regulations of the Declaration of Helsinki (2) to preserve the patient’s rights in an extreme experimental situation and to render the study ethically acceptable. an efﬁcacious and safe dosing is unknown and information is available at best from preclinical in vitro and in vivo studies (10). (11): . are presented in Section IV. Section V exhibits standard methods of analyzing phase I data. Therefore.2 Edler The methodology presented below is restricted to the treatment of cancer patients. AND DEFINITIONS A. Clinical Issues and Statistical Tasks Clinical phase I studies in oncology are of pivotal importance for the development of new anticancer drugs and anticancer treatment regimens (3. It will become clear that the methodology for phase I trials is far from being at an optimal level at present. small cohorts of patients are treated at progressively higher doses (dose escalation) until drug-related toxicity reaches a predetermined level (dose limiting toxicity [DLT]). If a new agent has successfully passed preclinical investigations (5) and is judged as being ready for application in patients. Basic pharmacokinetic methods are outlined. The goals in phase I trials are according to Von Hoff et al. The next section provides an outline of the task.4). Basic assumptions underlying the conduct of the trial and basic deﬁnitions for the statistical task are given. Regulatory aspects and guidelines are dealt with in Section VI. This constitutes the core of this chapter.

Investigation of basic clinical pharmacology. 3. Important statistical issues are the design parameters (starting dose. 4. Identiﬁcation of antitumor activity. but tumor response is not a primary end point. The dilemma of probably unknowingly underdosing patients in the early stages of a phase I trial has been of concern and challenged the search for the best possible methodology for the design and conduct of a phase I trial. This idealized relationship is biologically inactive dose biologically active dose highly toxic dose. Methods considered below apply to adult cancer patients with conﬁrmed diagnosis of cancer not amenable to established treatment. the treatment starts at low doses that are probably not high enough such that the drug can be sufﬁciently active to elicit a beneﬁcial effect. B. Recommendation of a dose for phase II studies.Phase I Trials 3 1. Characterization of DLT. Establishment of an MTD. Then. Determination of qualitative and quantitative toxicity and of the toxicity proﬁle. Assumptions Most designs for dose ﬁnding in phase I trials assume a monotone dose–toxicity relationship and a monotone dose–(tumor) response relationship (14). Activity against tumors is examined and assessed. 5. To be on the safe side. Usually excluded are leukemias and tumors in children (9). Inherent in a phase I trial is the ethical issue that anticancer treatment is potentially both harmful and beneﬁcial to a degree that depends on dosage. patients in phase I trials hardly had any beneﬁt from the medical treatment. dose escalation) and the estimation of the MTD. The conduct of a phase I trial requires an excellently equipped oncological center with high-quality . The primary goal is the determination of a maximum safe dose for a speciﬁed mode of treatment as basis for phase II trials. Phase I studies in radiotherapy may require further consideration because of long delayed toxicity. 2. This is all unknown at the start of the trial. retrospectively seen.11). drug schedules. Important clinical issues in phase I trials are patient selection and identiﬁcation of factors that determine toxicity. The dose dependency is at that stage of research not known for humans (12). 6. The goal is to obtain the most information on toxicity in the shortest possible time with the fewest patients (13). dose levels. and the determination and assessment of target toxicity (6. Even worse. the experimental drug may appear ﬁnally—after having passed the clinical drug development program—as inefﬁcacious and may have been of harm only.

count the time periods of treatment with dose x[k]. 2 months). 2 . These requirements indicate the advisability of restricting a phase I trial to one or very few centers.g. easy access to a pharmacological laboratory is needed for timely pharmacokinetic analyses. a) (1) ψ(x.. a) 1.. independent of the number nk of patients per cohort at level x[k]. National Cancer Institute (NCI) (16). . and reversible. a) 0 and ψ(∞.. nmax 8) and where k 1. k 1. This is a large list of adverse events (AEs) . To comply with most articles we denote the dose–toxicity function by ψ(x. 2.S. i. But this choice does not have an impact on the methods nor is the route of application. . . i 1. It is assumed that the patients enter the study one after the other numbered by j. Toxic response of patient j is assumed to be described by the dichotomous random variable Yj . Denote by x( j ) the dose level of patient j. D. 1 nk nmax are treated on a timely consecutive sequence of doses x[k] ∈ D. Furthermore.} assuming xi xi 1. . where nmax is a theoretical limit of the number of patients treated per dose level (e. C. Small cohorts of patients of size nk. Notice that a dose xi ∈ D may be visited more than one with some time delay between visits. Maximum Tolerated Dose The notion of an MTD is deﬁned unequivocally in terms of the observed toxicity data of the patients treated using the notion of DLT under valid toxicity criteria (8). lasts a ﬁxed time length ∆t (e. . . manageable. 2.g. Drug safety has been standardized for oncological studies recently by the establishment of the common toxicity criteria (CTC) of the U. Deﬁnitions Throughout this article we denote the set of dose levels at which patients are treated by D {xi . the duration of the phase I trial is then equal to ∆t times the number of those cohorts entering the trial. . where Yj 1 indicates the occurrence of a DLT and Yj 0 the nonoccurrence. deﬁned on the real line 0 x ∞ with ψ(0. and that treatment starts immediately after entry (informed consent assumed). x[k] ≠ x[h] for k ≠ h is not assumed. . Drug toxicity is considered as tolerable if the toxicity is acceptable. The dose unit is usually mg/ m2 body surface area (15). If the treatment at each of these dose levels x[k]. for detection of toxicity. a) with a parameter (vector) a: P(Y 1|Dose x) ψ(x. a) is assumed as a continuous monotone nondecreasing function of the dose x. .4 Edler means for diagnosis and experimental treatment. . and for fast and adequate reaction in the case of serious adverse events. .e. j 1.

Usually. 3. 5. Not related..g. fatal. 2. Sometimes the list of DLTs is open such that any AE from the CTC catalogue of grade 3 and higher related to treatment is considered a DLT. and assessment of the relation to treatment (19). Therefore. 4. one may deﬁne the occurrence of DLT for a patient more strictly as if at least one toxicity of the candidate subset of the CTC criteria of grade 3 and higher has occurred that was judged as at least possibly treatment related. a death related to treatment has to be counted as DLT. During cancer therapy patients may show symptoms from the candidate list of DLTs not caused by the treatment but by the cancer disease itself or by concomitant treatment. the occurrence of any toxicity is judged by the clinician or study nurse for its relation to the investigational treatment. 3. Uncertainty of the assessment of toxicity has been investigated (e. A commonly used assessment scale is as follows: 1. a judgment of ‘‘possibly’’ and more (i. 3. When anticancer treatment is organized in treatment cycles—mostly for 3–4 weeks—DLT is usually assessed retrospec- . Possibly.Phase I Trials 5 subdivided into organ/symptom categories that can be related to the anticancer treatment. The list of CTC criteria has replaced the list of the World Health Organization (17) based on an equivalent 0–4 scale. Investigators planning a phase I trial have to identify in the CTC list a subset of candidate toxicities for dose limitation.. sometimes applied. Often. That identiﬁed subset of toxicities from the CTC list and the limits of grading deﬁne the DLTs for the investigational drug. very serious or life threatening. CTC CTC CTC CTC CTC grade grade grade grade grade 0. probably or deﬁnitively) is considered as drug-related toxicity and called adverse drug reaction (ADR). Therefore. a toxicity of grade 3 or 4 is considered dose limiting. Each AE has been categorized into ﬁve classes: 1. moderate. Unclear/no judgment possible. 2. this deﬁnition carries subjectivity: choice of the candidate list of CTCs for DLT. and they have to ﬁx the grade for which that toxicity is considered to be dose limiting such that treatment has to be either stopped or the dose has to be reduced. 1. Obviously. 4. serious/severe. no AE or normal. Deﬁnitively related to treatment (18). 5. 21).e. Of course. 2. in 20. assessment of the grade of toxicity. 4. mildly (elevated/reduced). The CTC grade 5. is not used in the sequel because death is usually taken as very serious adverse event preceeded by a CTC grade 4 toxicity. Probably.

For the statistical analysis. I-MTD is nonrandom). With the above deﬁnition of DLT. A population of patients gives rise to a statistical distribution. one postulates the existence of a population-based random MTD (realized in I-MTDs) to describe the distribution of this MTD. one can theoretically assign to each patient an individual MTD (I-MTD) as the highest dose that can be administered safely to that patient: The I-MTD is the highest dose x that can be given to a patient before a DLT occurs. Deﬁne the MTD as that dose for which the proportion of patients exceeding the DLT is at least as large as θ: F(MTD) θ or MTD F 1(θ) (Fig. This then becomes the adequate statistical model for describing the MTD. The probability that x exceeds the random MTD is P(x MTD) F(x) (2) and describes the proportion of the population showing a DLT when treated by dose x. In phase I studies. In practice one should allow F(0) 0 as a ‘‘baseline’ toxicity’’ and also F(1) 1 for saturation of toxicity. Based on this probabilistic basis. (2). It is implicitly assumed that all patients entering a phase I trial react in a statistical sense identically and independently from each other. it is not possible to observe the I-MTD. Classes of well-known tolerance distributions are the probit. the MTD for θ denoted MTDθ is the θ percentile: MTDθ ψ 1(θ) (4) . and Weibull models (22). each patient should be assessable at his or her dose level either as having experienced a DLT (Y 1) or not (Y 0). Obviously. Because a patient can be examined at only one dose. No within-patient variability is considered at this instance (i. that patient is classiﬁed as having reached DLT. 1). Determine an acceptable proportion 0 θ 1 of tolerable toxicity in the patient population before accepting the new anticancer treatment. It is only observed if the given dose x exceeded the I-MTD or not. (2) and ψ in Eq. logit. Therefore. An unambiguous deﬁnition of the assessment rules of individual DLT is mandatory for the study protocol. F is a nondecreasing function with values between 0 and 1. often two cycles are awaited before a ﬁnal assessment of the DLT is made.e. (1) as ψ(MTD) P(Y 1|Dose MTD) F(MTD) θ (3) If ψ(x) is monotone nondecreasing and continuous. a practicable deﬁnition of an MTD of a phase I trial is obtained as a percentile of the statistical distribution of the (random population) MTD as follows. there is a direct correspondence between F in Eq. This probabilistic approach known as tolerance distribution model for a quantal dose–response relationship (22) allows any reasonable cumulative distribution function F for the right side of Eq.6 Edler tively before the start of a new treatment cycle. If at least one cycle exhibits at least one DLT..

etc. For persistent DLT and less aggressive tumors. the ease of accounting for patient covariates (e. Dose–Toxicity Modeling The choice of an appropriate dose–toxicity model ψ(x) is important not only for the planning but also for the analysis of phase I data. it could be as low as 0.).g.. pretreatment. a) with model parameter a. θ could be as high as 0.33. a).Phase I Trials 7 Figure 1 Schematic dose–toxicity relationship ψ(x. A commonly used value is θ 1/3 0. The maximum tolerable dose MTDθ is deﬁned as the θ percentile of the monotone increasing function ψ(x. and the availability of computing software. The choice of θ depends on the nature of the DLT and the type of the target tumor. For an aggressive tumor and a transient and non-life-threatening DLT. disease staging. Most applications use an extended logit model and apply the logistic regression because of its ﬂexibility. A general class of dose–toxicity models is a two-parameter family: ψ(x. E. performance.1 to 0. a) F(a 0 a1h(x)) (5) .25.5.

Simple dose sets are the additive set xi x1 (i 1)∆ x i 1. h a known dose metric. (23). 2. Monotone increasing functions F and h are sufﬁcient for a monotone increasing ψ. . 3 . or an inﬁnite ordered set D ∞ {x1 x2 . One may distinguish between a continuous set {x1 x2 ⋅⋅⋅ xk} of an of doses DC . Choice of the Dose Levels—Action Space From previous information—mostly preclinical results—a range D of possible doses (action space) is assumed. . If h(x) x. . choosing the dose levels sequentially. see O’Quigley et al. . . given the underlying goal of estimating the MTD. DESIGN A phase I trial design has to determine which dose levels are applied to how many patients and in which sequel.}. (8) where f denotes the factor by which the starting dose x1 dose is increased. III. a1) unknown parameters. and a (a 0. In use are modiﬁcations of the multiplicative scheme that start with a few large steps and slow down later. A. the MTD is MTDθ F 1(θ) a1 a0 (6) Convenient functions F are the PROBIT (x): Φ(x) LOGIT (x): {1 exp( x)} 1 HYPERBOLIC TANGENT (x): {[tanh(x) 1]/2} a2 with a further unknown parameter component a2. Such a modiﬁed action space could be the result of a mixture where the ﬁrst steps of low doses are obtained multiplicatively and . 2. A pure multiplicative set cannot be recommended and is not used in phase I trials because of its extreme danger of jumping from a nontoxic level directly to a highly toxic level. This implies three tasks: determination of the possible set of dose levels.8 Edler where F is a known cumulative distribution function. . a discrete ﬁnite action space DK increasing sequence of doses. . and determining the number of patients per dose level. (7) and the multiplicative set xi x1 ⋅ f (i 1) i 1.

Selawry. described next.52 1. A review of the literature for its origin and justiﬁcation as a dose-ﬁnding procedure is difﬁcult. 1.0 1.35 1.40 1.63 1.30–1.65 1.33 1. . . 2. xi fi 1 xi 1 i 1. .Phase I Trials 9 the remaining ones additively. .30–1. which may start with f1 2 as doubling dose from x1 to x2 and continues with 1 fi 2 for i 2.24) refer to an article from 1975 of Goldsmith et al. Chief of the Medical Oncology Branch at the NCI in the early seventies. Another smoother set is obtained when the factors are decreasing with higher doses. For methodology he refered in a general way to O. . (25). n 1. .0 1. .62 1. It has been in use from the beginning of systematic phase I research. f1 1 Fibonacci numbers fn .33 1.62 1.’’ Carter stated that ‘‘this scheme has been used successfully in two Phase I studies performed by the Medical Oncology Branch’’ in 1970. for example. The modiﬁed Fibonacci scheme.33 1. (9) where { fi} is a nonincreasing sequence of factors.67 1.62 Modiﬁed Fibonacci fn 1 /fn — 2 1. who present the MFDE as an ‘‘idealized modiﬁed Fibonacci search scheme’’ in multiples of the starting dose and as percent of increase.40 1. Modiﬁed Fibonacci Dose Escalation The most popular and most cited dose escalation scheme is the so-called modiﬁed Fibonacci dose escalation (MFDE) (Table 1).30–1. . Carter (3) summarized the study design principles for early clinical trials.50 1.67 1. f0 0. 2. n 1.5 1. . is of this general type. Table 1 Evolution of the Modiﬁed Fibonacci Scheme from the Fibonacci Numbers fn Deﬁned Recursively fn 1 fn fn 1.60 1. Both studies are published in the Proceedings of the American Association of Cancer Research without a bibliography. 1 2 3 5 8 13 21 34 55 Fibonacci multiples fn 1 /fn — 2.33 Smoothed modiﬁed Fibonacci — 2. .35 1. Two years earlier.30–1. ‘‘who has elucidated many of the phase I study principles. A number of authors (8.35 . 3.35 1. one by Hansen (26) and one by Muggia (27). 2.

In contrast to the MFDE. Fibonacci related this sequence to the ‘‘breeding of rabbits’’ problem in 1202 and also to the distribution of leaves about a stem. The results say nothing on the placement of the measurements but only on the needed number of measurements. The weakness of the foundation of the MFDE for phase I trials becomes even more evident when one looks deeper into the history. (29) when studying the antitumor effect of 1. this approach is somehow still additive. But it leads to smaller and smaller steps toward higher doses similar to the MFDE. the son of Bonaccio) also known as Leonardo da Pisa or Leonardo Pisanus. 31 on page 152. which is based on a multiplicative set of doses. The Fibonacci numbers have been used in optimization and dynamic programming in the 1950s (30) for determining the maximum of an unimodal function.’’ He also reported on the use of the MFDE by Hansen et al. The ratio of successive numbers fn /fn 1 converges to (1 √5)/2 (1 2. Schneiderman (10) showed that he was familiar with this work on optimization and cites Bellman’s result. One application can be illustrated as follows: ‘‘How many meters long can a bent bridge be. a maximum possible dose xK. Schneiderman (10) tried to transpose the Fibonacci search by ﬁxing an initial dose x1. 30. In 1977. such that one can always locate its maximum height in units of meters by measuring at most n times?’’ The solution is given by Bellman’s theorem (30) as fn meters. and the number n of steps for moving upward from x1 to xK.236)/2 1. The sequence of numbers named after Fibonacci is created by a simple recursive additive rule: Each number is the sum of its two predecessors: fn 1 fn fn 1 and starts from f1 1 using f0 0 (Table 1). This escalation is ﬁxed to K doses and does not open to higher doses if the MTD was not reached at xK. The sequence fn grows geometrically and is approximately equal to a[(1 √5)/2]n . the relation to the Fibonaccci numbers is not clariﬁed in these early studies. Schneiderman discussed a restarting of the scheme if no toxicity was seen at xK and concluded that then ‘‘no guide seems to exist for the number of steps.10 Edler However.618. lived from 1180 to 1250 as a mathematician at the court of Friedrich II in Sicilia working in number theory and biological applications. page 34 and not 342). where a (1 √5)/√5. The optimization result is even better explained in Ref. the Golden Section. and spacing the doses in proportion to the n intervals in the series’’ (10). alias Fibonacci. In his article on methods for early clinical trials research.’’ And he confesses that he has ‘‘not seen it in any published . inverting the order. Fibonacci (Figlio Bonaccio. Carter (28) refers to Schneiderman (10) and ‘‘a dose escalation based on a numeral series described by the famed 13th Century Italian mathematician Leonardo Pisano. the steps obtained by Schneiderman’s inversion are at the beginning very large and later very small. ‘‘By taking a Fibonacci series of length n 1.3-bis(2-chloroethyl)-1-nitrosourea (BCNU) chemotherapy. a famous principle of ancient and Renaissance architecture. Both refer to Schneiderman (10). Schneiderman obtained an increasing sequence of doses. but actually with the wrong page number (correct is Ref. However.

In the same article.’’ The number of steps in this reversed Fibonacci scheme is strongly related to the escalation factor and so provides no guidance for dose escalation. A hint at the use of the inverse of a Fibonacci scheme where the dose increments decrease with increasing numbers is also given by Bodey and Legha (8). Extrapolation from preclinical animal data focused on the lethal dose 10% (LD10) of the mouse (dose with 10% drug-induced deaths) converted into equivalents in units of mg/m2 (33) of body surface area. the sequence of dosing has to be ﬁxed in advance in a so-called dose escalation rule.34).33. Starting Dose The initial dose given to the ﬁrst patients in a phase I study should be low enough to avoid severe toxicity but also high enough for a chance of activity and potential efﬁcacy in humans. The standard starting dose became 1/10 of the minimal effective dose level for 10% deads (MELD10) of the mouse after veriﬁcation that no lethal and no life-threatening effects were seen in another species. They searched for a dose escalation scheme that slows down from doubling the dose to smaller increases within a few steps. seemed reasonable enough to be used in many trials. From empirical evidence and the simulation studies performed later. Earlier recommendations had used higher portions of the MELD10 (mouse) or other characteristic doses as. Dose Escalation Schemes If a clinical action space has been deﬁned as set of dose levels D. who refer to Ref.’’ In summary. Carter . (32) in which a dose escalation with the factors of 2. rats or dogs (7. The method has been successful to the extent that MTDs have been determined through its use. for example. he cites. 25 ‘‘who examined the usefulness of the modiﬁed Fibonacci method as a guide to reaching the MTD. slowing down the increase from 65% to 33% within the ﬁrst ﬁve steps. This section starts with the traditional escalation rules (TER). 1.11. however. 1. for example. 2. the MFDE seems now to be too conservative in too many cases. The MFDE (Table 1). and 1.25 is used. De Vita et al. it seems that the idea of the so-called MFDE came up in the NCI in the sixties when the early clinical trials programs started there and was promoted by the scientists mentioned above.5.Phase I Trials 11 account of preliminary dose ﬁnding. the next step in designing a phase I trial consists of the establishment of rule by which the doses of D are assigned to patients. the lowest dose with toxicity (toxic dose low) in mammals (35). Those rules are also known as ‘‘3 3’’ rules because it became usual to enter three patients at a new dose level and when any toxicity was observed to enter six patients in total at that dose level (11) before deciding to stop at that level or to increase the dose. B. Proceeding from a starting dose x1.

If two Table 2 Example of a Phase I Study Performed According to the Standard Design (TER) Dose level 1 2 3 4 5 6 7 8 9 10 11 12 13 Dosage 0.5 1. patients are treated in cohorts of three each receiving the same dose.29 1. with three to six patients per dose level. Then we introduce the up-and-down rules (UaD) as fundamental but not directly applicable rules and turn from this to Bayesian rules and the intensively and sometimes also controversially discussed continual reassessment method (CRM).0 7. and the actually observed toxicity grades of the N patients. 36. 1.12 Edler (3) is also the ﬁrst source I am aware of where the so-called 3 3 rule (the traditional dose escalation scheme) is listed as a phase I study principle.45 0.33 1.1 30.5 10. Using TER. respectively. . the number of cases with DLT deﬁned as grade 3–4.33 1.5 17.5 6. 2.29 1.0 39.67 2 1. see Table 2 for an example taken from Ref.3 N 3 6 6 3 4 3 3 3 4 3 6 5 1 DLT 0 1 1 0 0 0 1 0 1 0 1 4 1 Grade 000 111311 041111 012 1122 111 113 111 0320 010 123122 33331 3 Each row shows the number of patients N of that dose level. . Traditional Escalation Rule A long used standard phase I design has been the TER where the doses escalates in DK or D∞ step by step from xi to xi 1.3 1. the trial continues at the next higher level xi 1. If none of the three patients shows a DLT at level xi. . Methods for the determination of the MTD have to be addressed in this context also. Otherwise. the next cohort of three patients receives the next higher dose xi 1.9 1..5 23.0 Escalation factor — 2 1.5 13. If exactly one of the six patients treated at xi exhibits DLT. (From Ref. Two versions of this 3 3 rule are described below as TER and strict TER (STER).) . a second cohort of three is treated at the same level xi again. say xi.25 1. i 1.4 1. 36.5 3 4.

If one DLT is observed. possibly going down as far as x1. . . the next cohort of three patients receives the next highest dose xi 1. If two or three DLTs are observed among the ﬁrst three of that cohort. 3. escalation stops at xi and the dose is de-escalated to the next lower level xi 1 where a preﬁxed small number of cases is treated additionally according to one of the options 2–4. . to a total of eight patients. if only three patients had been treated earlier. Then x( j and set xi xi xi 1 1) x( j 2) x( j 3) xi (10) x( j 4) x( j 5) x( j 6) 1 if S 3 ji if S 3 ji if S 3 ji 0 and continue 1 and continue 2 and stop (11) Then set next x( j 7) x( j 7) stop x( j x( j 8) 8) x( j x( j 9) 9) xi xi 1 1 if S 6 ji if S 6 ji if S 6 ji 1 2 and ni 2 and ni 1 1 6 or 6 (12) . Then denote by S m the number of patients with a DLT among m patients at dose level xi when j ji patients have been treated before. various alternatives of treating a few more patients are in use: 1. Using STER. 4. A slightly more conservative escalation is implemented in a modiﬁed TER denoted here as STER. Assume that ni 1 patients had been treated at dose level xi 1. patients are treated in cohorts of three each receiving the same dose. xi 2. . Treat a limited number of patients at a level not previously included in D located between xi 1 and xi . e. . If none of the three patients shows a DLT at level xi . x( j ) xi . Treat a small number of additional patients at the stopping level xi . three other patients are included at the same level xi and the procedure continues as TER. say xi .Phase I Trials 13 or more patients of the six exhibit DLT at the level xi . When the escalation has stopped. Treat another cohort of three patients at the next lower level xi 1 if six patients had not already been treated. 2. Treat another cohort of three patients at all next lower levels xi 1. the escalation stops at that level. STER can be described formally as follows: Assume that j patients have been treated before the turn to the level xi at lower dose levels x(1).g. ..

that level would be the ﬁrst of unacceptable DLT. 41). the assignment process becomes Markovian and performs a random walk on the action space D. the elementary UaD has been reintroduced into phase I trials—cited as Storer’s B design (37)— as a tool to construct more appropriate combination designs. If no such dose level can be identiﬁed. It is common practice that at the dose level of the MTD. essentially nonparametric. at least six patients are treated. Therefore. or the next lower dose level with a probability that depends on the previous subject’s response. . All optimality results of RW designs require that the set D of doses remains unchanged during the trial.’’ However. another reason could be the ambiguity of the adapting parameter sequence {aj } (but see Ref. Formally. the options 2 and 3 above are often applied at the end of a phase I trial. Random Walk (RW) Designs A large class of escalation rules is based on the sequential assignment of doses to one patient after the other.38). A Markov chain representation of the random walk on D is given in Ref. 37. a patient is assigned to the next higher. One reason may be the use of a continuum of dose levels leading to impracticable differentiation between doses. Early prototypes in statistical theory were the UaD (39). If the dose assignment to the current patient depends only on the result seen in the previous one. the UaD rules were more recently adapted for medical applications by considering grouped entry. and of known ﬁnite and asymptotic distribution behavior. 2. The next lower level xi 1 is then considered as the MTD.14 Edler If the escalation stops at a dose level xi . RW designs operate mostly on the ﬁnite lattice of increasingly ordered dosages DK {x1 ⋅ ⋅ ⋅ xK }. In principal. Such. the MTD is estimated after each patient’s treatment and the next patient is then treated at that estimated level. Those rules have their origin in sequential statistical designs and in stochastic approximation theory. otherwise at the next higher level xi 1. biased coin randomization. the same. proposed originally for explosives testing. and Bayesian methods (38). the next patient j 1 is treated at the next lower level xi 1 if a DLT was observed in patient j. the starting dose level x1 would be taken as MTD. They are simply to implement. the MTD can be characterized as the highest dose level below the stopping dose level xi at which (at least) six patients have been treated with no more than one case of DLT. The main reason has been stated already in (10): ‘‘the up and down procedures and the usual overshooting are not ethically acceptable in an experiment on man. Given patient j has been treated on dose level x( j ) xi. For this reason. and one would conclude that the MTD is exceeded. RW designs have been applied to phase I studies (37. and the stochastic approximation method (SAM) (40). Elementary UaD. SAM has never been considered seriously for phase I trials. Basically.

and if the MTD is in the inner of the dose space DK (38). Interestingly. When reaching boundaries x1 and xK. This design centers the dose allocation unimodally around the MTDθ for any θ. is obtained with a nonbiased coin of probability p# (1/3)/ (2/3) 0. proposals of new designs were calibrated mostly on the one-stage designs instead of the more successful two-stage designs such that recommendations for practice are hard to deduct from those investigations. given patient j has been treated on dose level x( j ) xi . A new sequential RW (38) is the so-called biased coin design (BCD) applicable for a action space DK {x1 ⋅ ⋅ ⋅ xK }. The two designs were combined with the elementary UaD to Storer’s BC and BD two-stage designs: UaD-BC: Use the UaD until the ﬁrst toxicity occurs and continue with the UaD-C at the next lower dose level.5 of p# is chosen as θ/(1 θ). the procedure must stay there. MTD1/3. Using BCD. another three patients are treated at the same level and the rule is repeated. If exactly one patient shows a DLT. Escalate if no DLT and de-escalate if more than one DLT occurs. UaD-BD: Use the UaD until the ﬁrst toxicity occurs and continue with UaD-D design. 3. the next patient j 1 is treated at the next lower level xi 1 if a DLT was observed in patient j. a Bayesian based dose escalation rule was . the 33% percentile. UaD-B. For an algorithm see O’Quigley and Chevret (42). otherwise at xi with some probability p# not larger than 0. Simulations revealed a superiority of the UaD-BD over the UaD-BC and the elementary UaD (37).5). Continual Reassessment Method To reduce the number of patients treated with possibly ineffective doses by the TER/STER or the RW type designs. unfortunately. which is quite similar to the 3 3 rule: UaD-C: Proceed as in UaD but escalate only if two consecutive patients are without DLT. 0 θ 0. UaD-D: Three patients are treated at a new dose level. and UaD-C were not considered as sufﬁcient and only the two-stage combinations were proposed for use (37).5 or at xi 1 with probability 1 p# (not smaller than 0.Phase I Trials 15 x( j 1) xi xi 1 1 if x( j ) if x( j ) xi and no DLT xi and DLT (13) Two modiﬁcations of the elementary rule were proposed (37): ‘‘modiﬁed by two UaD’’ or Storer’s C design (UaD-C) and ‘‘modiﬁed by three UaD’’ or Storer’s D design (UaD-D. Although the single-stage designs UaD.5.

. and summarize the sequentially collected dose-toxicity information up to the { y1. In the following. Estimation of the MTD is therefore equivalent to the estimation of a. Ωj ) da θij (16) and the dose level x x( j ) for the jth patient is selected from DK such that the distance of P(x. Ωj ) ∫ L( yj . Assume an unique solution ao at the MTDθ: ψ(MTD. We describe the CRM next and postpone a straightforward Bayesian design to the next subsection. . Assume a ﬁnite action space DK {x1 ⋅ ⋅ ⋅ xK } of dose levels. . a) yj [(1 ψ(x( j ). a) f (a. Ωj ) da 1. a)]1 yj (17) using Bayes theorem as f (a. . Ωj ) du ∞ 0 (18) The CRM starts with an a priori density g(a). . . a is assumed as a scalar a 0 and f normalized by ∞ 0 f (a. x( j ). Ωj ) and the likelihood of the jth observation L ( yj . x( j ). After observing the toxicity Yj at dose level x( j ). given the previous information Ωj { y1. N. . a) ψ(x( j ). a) to the target toxicity rate θ becomes minimal: x( j ) xi if θ θij minimum. the dose for the (N 1)st patient. j 1. . the next dose level is determined such that it is closest to the current estimate of an MTDθ. The MTD is then estimated as y MTD x(N 1). j 1. yj 1}. x( j ). . the posterior density of the parameter a is obtained from the prior density f (a. yj 1}. Ωj ). a ﬁxed sample size N and a one-parameter dose-toxicity function ψ(x. Each patient is treated at the dose level closest to the currently estimated MTD. and this distribution is updated with each patient’s observation with regard to the absence or presence of DLT. . . . . . the posterior density distribution of the parameter a given Yi yi. u) f (u. The information upon the parameter a is j-1th patient by Ωj described by its density function f (a. . yj 1} as ∞ P(xi.16 Edler introduced by O’Quigley et al. a) f (a. . . Ωj 1) L ( yj . a) as deﬁned in Eq. j 1. A starting dose is selected using a prior distribution of the MTD. ao) θ (14) Let Yj denote the dichotomous response variable of the jth patient. Consistency in the sense that the recommended dose converges to the target level was shown even under . a) 0 ψ(xi. . (1) depending on the model parameter a. . i 1. (15) Using CRM. . For this. (23). the probability of a toxic response is calculated for each xi ε DK given Ωj { y1.

Given patient j-1 has been treated on dose level x( j 1) xi and information Ωj { y1. 4.. Modiﬁed CRM but use the starting level of the CRM and enter there three patients (version 4 in 45). (44) introduce additionally a dose level x0 x1 for an estimate of the MTD . .. Korn et al. and lengthening the trial because of allowing treatment and examination of toxicity only for one patient after the other. six) (44). CRM but escalation (i. start with one patient at dose level x1 and apply the CRM. Modiﬁed CRM but stopping if the next level has been already visited by a predetermined number of patients (e. allowance to jump over a larger part of the dose region and skipping intermediate levels. . Modiﬁcations of the CRM The CRM was criticized mainly from three aspects: choice of the starting dose x(1) according to a prior g(a) that could result in a dose level in the middle and not in the lower dose region. 5. Further modiﬁcations demonstrate the efforts of reducing anticonservatism in the CRM: 1. then the next dose level x( j ) is chosen as follows: xCRM ( j ) if xCRM ( j ) if xCRM ( j ) xi 1 if xCRM ( j ) xi x( j x( j x( j 1) 1) and if yj 1) and if yj 0 (no DLT) 1 (DLT) x( j ) 1 1 (19) The main restriction in Eq. The treatment of batches of patients per dose level has also been proposed (23). Modiﬁed CRM but x( j ) is not allowed to exceed the MTD estimate based on Ωj (version 2 in 45). Using modiﬁed CRM. .g. From these proposals evolved a design sketched (49) relying on a suggestion of Faries (45). 2. 4. There are cases where it converges to a close but not the closest level. . yj 1} predicts for the next patient j the dose level xCRM ( j).Phase I Trials 17 model misspeciﬁcation (43). xCRM ( j) x( j 1)) is restricted within DK {x1 ⋅ ⋅ ⋅ xK } to one step only (restricted version in 46). Modiﬁcations of the CRM were obtained (44–48) through restrictions on choosing x1 as starting dose. 3. (19) is the start at x1 and the nonskipping.e. Modiﬁed CRM but stay with x( j ) always one dose level below xCRM ( j) (version 1 in 45). not skipping consecutive dose levels and allowing groups of patients at one dose level. The ethical argument of risking too high toxicity and the practical argument of undue duration elicited modiﬁcations that try to reduce that risk and keep at the same time the beneﬁt of reaching the MTD with a smaller number of dose levels and treating more patients at effective doses.

Furthermore. if a formally proposed MTD equal to x1 would exhibit an unacceptable toxicity. the prior distribution g(a) of the model parameter a. The cumulation of information and the tendency for allowing realtime decision makes Bayesian designs best suited when patients are treated one at a time and when toxicity information is available quickly. dose escalation continues with the standard STER design.45. but it may cover more doses than the CRM. Bayesian methods are excellently adapted to decision making. 7. and a continuous set of actions D.48). Referring to this and to a previous proposal (44). A simulation study showed a reduction of trial duration (50–70%) and reduction of toxicity events (20–35%) compared with the CRM. or three patients per cohort (48). It was proposed to account for the observation of grade 2 toxicity (secondary toxicity) that does not contribute normally to the DLT criterion. The CRM was modiﬁed such that two patients were treated on the level where the ﬁrst patient had exhibited secondary toxicity (version 3 in 45). A gain function (negative loss) G is needed to characterize the gain of information if an action is taken given the true (unknown) value of a or equivalently of the MTD. If one patient of the two shows a DLT or if there is a second case of grade 2 toxicity. Another approach to modify the CRM consisted in starting with the UaD until the ﬁrst toxicity occurs and then switching to the CRM using all information obtained so far (46). Further modiﬁcations of this type with two and three patients per cohort were investigated (46. Bayesian Designs Bayesian methods are attractive for phase I designs because they can be applied even if little data are present but prior information is available. A substantial modiﬁcation of previous discussed phase I designs was the suggestion to use further toxicity information than that deﬁned for the determination of the DLT (44. Whitehead and . Modiﬁed CRM run in three variants of one. this extended CRM starts at x1 and escalates not by more than one step until the ﬁrst toxicity occurs. This version is identical to 5 except that a dose level cannot be passed after a DLT was observed. 5. and they tend to allocate patients at higher doses after nontoxic and at lower doses after toxic responses. two. A full Bayesian design (50) is set up by the data model ψ(x. Implicitly. CRM but allow more than one patient at one step at a dose level and limit escalation to one step (47).48). Ahn (48) implemented a so-called secondary grade design similar to the UaD-BD design: The elementary UaD is used with two patients per cohort. Modiﬁcations Using Toxicity of CTC Grade 2 (Secondary Grade Toxicity).18 Edler 6. a).

MTDθ. 50. given the previous results Ωj of the ﬁrst j 1 patients. a)g(a)da (20) A new dose level is then selected by maximizing the posterior expected gain E[G(a)| yj ] (21) Given a. given the current data Ωj. the Bayes rule determines the posterior density g (a| yj ) of the parameter a as g (a |yj ) ψ( yj . Gatsonis and Greenhouse (51) used a Bayesian method to estimate directly the dose–toxicity function and the MTDθ in place of the parameter a. other suggestions were three to six patients per lower dose level (9) or a minimum of three per dose level and a minimum of ﬁve near the MTD (7). a) s 1 ψ(x(s). a). Statistical methods on optimal sample sizes per dose levels seem to be missing. the likelihood of the data Ωj for the ﬁrst j 1 patients is then a product of the terms (17) used already for the CRM j 1 fj 1(Ωj . a)]1 ys (22) The next dose x( j ) is then determined such that the posterior expected gain is maximized with respect to x( j ). Using EWOC. given Ωj } is obtained from the prior distribution of a and from there the marginal posterior distribution density π(MTD| Ωj ). For details see Ref. the posterior density of {ψ(x1.Phase I Trials 19 Brunier (50) use the precision of the MTD estimate obtained from the next patient as gain function. C. Recommendations vary between one and eight patients per dose level. a)ys [1 ψ(x(s). The dose level x( j ) is then calculated as x( j ) πj 1 (α). The number of patients is ﬁxed in advance. Then. The criterion for determining the dose level for the ith patient is πj (x( j )) α. The probability α P(Dose MTDθ) of overdosing a patient was used as target parameter for Bayesian estimation in the so-called escalation with overdose control (EWOC) method (41). Calculations of sample sizes separately from the sequential design can be based on the binomial distribution and hypothesis testing of toxicity rates (52). This gives some quantitative aid in planing the sample size per selected . Sample Size per Dose Level The number of patients to be treated per dose level was often implicitly determined in the previous described designs. Given the responses Yj yj . obtain the posterior cumulative distribution of the MTD as πj (x) P(MTD x |Ωj ) that is the conditional probability of overdosing the jth patient with the dose x. a)g(a) ψ( yj .

STER designs were inferior to some modiﬁed CRMs with respect to the percentage of correct estimation.69 if n decreases from 6 to 5.09.06. the single-stage UaD designs were inferior to the CRM (23). 0. D. If PUAT 0. Percentage treated at the MTD or one level below.37. 0. percentage of correct estimation. 0.33 POT ( p) takes the values 0. If n 6.94 to 0.04.03. The criteria of the comparisons were 1. . By examining the distribution of the occurrence of toxic dose levels. Distribution of the MTD on the dose levels. Distribution of the occurrence of the dose levels. the UaD recommended toxic levels much more often than the CRM.4.94 if the toxicity rate equals PUAT 0. 4 and 3.51 when n decreases from 6 to 5. 4 and 3.45). 9.54).20 Edler dose level.85 to 0. 8. The BCD design from the RW class and the CRM performed very similar (38). Toxicity probability (percentage treated above the MTD). the sample size and the error rates increase rapidly if the two 0. the probability of nonescalating decreases from 0. If PAT dose level is only achieved if PUAT 0. Average number of steps (cohorts treated on successive dose levels). Validation and Comparison of Phase I Designs Simulation studies were performed to compare new designs with the standard designs TER/STER or to compare the CRM with its modiﬁcations (23. 6. Unfortunately.83 and 0. 7. Given the toxicity rate p. one may determine a sample size by considering the probability POT ( p) of overlooking this toxicity by the simple formula POT ( p) (1 p)n (53. Average number of patients.91 if the toxicity rate equals PAT 0. the percentage treated at the MTD or one level below. 12.05 and the probability of nonescalating becomes 0. Given p 0. 4.05 and a few and unacceptable toxicity (UAT). a sample size of n 10 per probabilities approach each other. Using the two probabilities PAT and PUAT of acceptable toxicity (AT) 0.90. it appears that the TER is inferior to the UaD-BC and the UaD-BD (37) in terms of the fraction of successful trials.05. 0. or percentage treated at the MTD or one level below (23). 10. the probability of escalating becomes 0. 3.44.77. 0. 5.008 for n 6.66 and 0. 7.42. respectively. 8. Characteristically. The UaD-BD is superior to UaD-BC at least in some situations. 11. 0.02. Roughly summarizing the ﬁndings of those studies and without going into details. tables for a ﬁxed low PAT PUAT values were given by Rademaker (52) with a straightforward algorithm.38. 2. the results are based on different simulation designs and different criteria of assessment such that the comparison is rather limited.3. 0. The latter probability decreases from 0. Average number of toxicities.012.2. Percentage of correct MTD estimation. 0.

Rule for completion of the sample size when stopping. Prior information on the MTD. the STER recommended more dose levels lower than the true MTD.45). and software has become available recently by Simon et al. the following design characteristics should be checked and deﬁned in the study protocol: 1. Sample size per dose level.11. If treatment with the experimental drug is without complication and if at least a status quo of the disease is retained. The assessment of toxicity can therefore be based on only few treatment cycles. On the other hand. The CRM and some modiﬁed CRMs needed on the average more steps than the STER. The STER provided no efﬁcient estimate of the MTD and did not stop at the speciﬁed percentile of the dose–response function (55). Escalation rule.D. From the same rationale by which the dose is decreased in the case of toxicity. Intraindividual dose escalation was proposed (6. Examples are found among the studies mentioned in Section II. It was recommended that pa- . A simulation study is recommended before the clinical start of the trial. 2. 7. but achievement of a partial remission of the tumor remains a realistic motivation. 5. and they showed the tendency of treating more patients at levels higher than the MTD. The study protocol usually makes provisions for individual dose reductions in case of non-DLT toxicity. Dose–Toxicity model. 4. Standard Requirements Before a phase I trial is initiated. Dose Titration Approach Patients are treated in a phase I trial in oncology mostly with the therapeutic aim of palliation and relief.56). IV. it should be allowed to increase in the case of nontoxicity and good tolerance.Phase I Trials 21 and the percentage of patients treated with very low levels (44. treatment normally continues.11) if a sufﬁcient time has elapsed after the last treatment course such that any existing or lately occurring toxicity could have been observed before an elevated dose is applied. Individual dose adjustment has been discussed repeatedly as a possible extension of the design of a phase I trial (6. B. Starting dose x1. the goal of cure would be unrealistic in most cases. 6. Dose levels D. 3. mostly two. 8. Stopping rule. (56). CONDUCT OF THE TRIAL A.12.

Two intraindividual strategies were considered: One uses no intraindividual dose escalation and only deescalation by one level per course in case of DLT or unacceptable toxicity. switch to the STER (second intraindividual strategy). unacceptable toxicity (grade 4). The goal was to deﬁne a dose satisfying both safety and efﬁcacy requirements. to stop early when it was likely that no such dose could be found. Y 1 if . conditionally acceptable toxicity (grade 2). switch to the STER with the second intraindividual strategy. If one DLT occurred in any cycle or if grade 2 toxicity occurred twice in any cycle. Therefore. Further consideration should be given to the development of tolerance in some patients and the risk of then treating new patients at too high levels (12). Toxicity and efﬁcacy outcomes were combined into a comprehensive ternary outcome variable Y with the values Y 0 if no effect and no toxicity. When planning intraindividual dose escalation. The number of patients should be large enough to estimate reliably both the toxicity rate and the response rate at the selected dose. Three new designs were formulated by using one of these two options given an action space D or DK: Speed-up Design: Escalate the dose after each patient by one level as long as no DLT or unacceptable toxicity occurs in the ﬁrst cycle and at most one patient shows grade 2 toxicity in the ﬁrst cycle. If a DLT occurs in the ﬁrst cycle or if grade 2 has occurred twice in the ﬁrst cycle. DLT (grade 3). all patients should be at least 3–4 weeks on their primarily scheduled dose before an intraindividual escalation is performed. Toxicity–Response Approach A design involving both dose ﬁnding and evaluation of safety and efﬁcacy in one early phase I/II trial was proposed by Thall and Russell (57). Accelerated Speed-up Design: Same as the Speed-up Design except that doubling dose escalation is used in the ﬁrst stage before switching to STER. undesired effect) are considered. the other uses escalation per course by one level as long as no DLT or unacceptable toxicity occurs and de-escalation as in the ﬁrst case. 56 by a dose titration design. one should weigh the advantage of dose increase and faster escalation in the patient population against the risks of cumulative toxicity in the individual patients. desired effect. Modiﬁed Accelerated Speed-up Design: Same as the Accelerated Speedup Design but no restrictions on the escalation with respect to the cycle.22 Edler tients escalated to a certain level are accompanied by ‘‘fresh’’ patients at that level to allow the assessment of cumulative toxic effects or accumulating tolerance. using again the second intraindividual strategy. The end point was deﬁned as a categorial variable with four levels: acceptable toxicity (grade 1). C. Two toxicity outcomes (absence or presence of toxicity) and three efﬁcacyrelated outcomes (no effect. and to continue if there was chance enough to ﬁnd one. The STER was modiﬁed as in Ref.

2 0. A comprehensive and transparent report of all toxicities observed in a phase I trial is an absolute must for both the producer’s (drug developer) and the consumer’s (patient) risks and beneﬁts.. ψ2(x)). Table 3 provides an example.50).. the assessment of the relation to treatment. Descriptive Statistical Analysis All results obtained in a phase I trial have to be reported in a descriptive statistical analysis that accounts for the dose levels. The evaluation of the response can usually be restricted to a case by case description of all those patients who exhibit a partial or complete response. The MTD is obtained through dose–toxicity modeling. e.the estimation of the MTD and the characterization of the DLT. 0. As dose–response model for (γ1. Evaluation of the DLT has to account for dose and time under treatment and should use pharmacokinetic modeling. 1. A description of the individual load of toxicity of each patient has been made separately using individual descriptions eventually supported by modern graphical methods of linking scatterplots for multivariate data. and two subtables that summarize for patients exhibiting toxicity the ADRs and DLTs deﬁned in section II. and no-toxicity probability is reasonably large. .g. The combined dose–response relationship is parameterized as γj (d ) P(Y j| Dose d ) for j 0. and in a two-dimensional dose–effect end point ψ(x) (ψ1(x). Multiplicity of DLTs in some patients can be so presented (see e.g. (59)).. 2. EVALUATION OF PHASE I DATA The statistical evaluation of the data resulting from a phase I trial has two primary objectives.33).g. It exhibits the complete toxicity observed. V. This results in three dose-dependent outcome probabilities ψj (x) P(Y j| Dose d ). γ2) the proportional odds regression model and the cumulative odds model are considered (58) and a strategy is developed to ﬁnd in DK that dose d* which satisﬁes both criteria. 2. Absolute and relative frequencies (related to the number of patients evaluable for safety) are reported for all toxicities of the CTC list by distinguishing the grading and the assessment of the relation to treatment. The goal is to ﬁnd a dose d such that the effect-andθ* (e. ψ1(d ) 1 that the undesired-effect-and-toxicity probability is limited ψ2(d ) θ* (e.Phase I Trials 23 desired effect and no toxicity. j 0. Besides these objectives.C. Benner et al. and Y 2 if undesired effect and toxicity occurred. Patients with a stable disease for a longer period or patients with a minor improvement not sufﬁcient for partial response may be described and emphasized also.g. phase I data have to be presented in full detail using transparent descriptive methods. 1. A. This is somehow cumbersome because each dose level has to be described as separate stratum.

MTD Estimation The estimation of the MTD has been part of most search designs from Section y III. Absolute and Relative Frequencies of Toxicity CTC item Vomiting n 6 100% Grade 0 2 33 Grade 1 1 17 Grade 2 1 17 Grade 3 2 33 Edler Grade 4 Total with grade 0 0 0 4 67 B. .g. and an estimate MTD resulted often directly through the stopping criterion. y D is by deﬁnition the dose level next lower to the unacceptable dose In TER. x( j ). 33%) experienced DLT. of all patients treated in the trial. . j 1. . A general method for analyzing dose–toxicity data is the logistic regression of the Yj on the actually applied doses x( j ). An estimate of a standard error of the toxicity rate at the chosen dose level is impaired by the small number of cases ( 6) and also by the design.24 Table 3 Descriptive Evaluation of Phase I Toxicity Data A. If the sampling is forced y to choose doses below the true MTD. The logistic model for quantal response is therefore given by P(Yj 1 | x( j )) {1 exp[ (a0 a1 x( j ))]} 1 (23) .. The logistic regression takes all observed data ( yj . n) without prejudice assuming that they are independently sampled from the patient population and that the toxic responses per dose are identically distributed. This would disregard any dependency of the dose–toxicity data on the design that had created the data and may therefore be biased. j 1. . n. Summary Table of DLTs n 6 No DLT 3 DLT 1 B. Assessment of the Relation of the Toxicity to the Treatment by Case Type Listing Grade 0 — C.. MT level where a predeﬁned proportion of in patients (e. MTD may be biased toward lower values. . Summary Table of ADRs n 6 No ADR 1 ADR 3 Grade 1 Grade 2 Grade 3 Grade 4 — Probable Possible Unprobable possible D.. .

or the likelihood ratio test (37). Conﬁdence limits can be obtained by the delta method. The AUC calculated at the .64).Phase I Trials 25 Standard logistic regression provides the maximum likelihood estimate of a (a0.and two-compartment models have been used to estimate the pharmacokinetic characteristics often in a two-step approach: ﬁrst for the individual kinetic of each patient and then for the patient population using population kinetic models. Va1) denotes the asymptotic variance–covariance matrix of the model parameter vector. and the area under the time–drug concentration curve (AUC). Fieller’s theorem. MTDθ is estimated as the θ percentile of a tolerance distribution y MTDθ if. Subsequent to the criticism of the traditional methods for requiring too large a number of steps before reaching the MTD. In practice. the drug half-life. It postulates that the DLT is determined by plasma drug concentrations and that AUC is a measure that holds across species (64). a1) (22.65) based on the equivalence of drug blood levels in mouse and humans and on the pharmacodynamic hypothesis that equal toxicity is caused by equal drug plasma levels. θ V logit (θ) a1 y a0 y ln 2 a1 y a0 y (24) y 0. The large sample variance of MTDθ is given by Va0 y 2MTDθ Va0a1 y1 a2 y θ MTD 2 Va0 Va1 (25) where (Va0. a0 (26) (27) a1 F 1 (θ) C.60. statistical methodology for pharmacokinetic data analysis is primarily based in nonlinear curve-ﬁtting using least-squares methods or their extensions (63). Va0a1.g. Drug concentration measurements ci (tr) of patient i at time tr are usually obtained from blood samples (additionally also from urine samples) taken regularly during medication and are analyzed using pharmacokinetic models (62). a) y MTDθ F x a1 a0 .. e. Speciﬁc parameters that describe the pharmacokinetics of the drug are the absorption and the elimination rate. Pharmacokinetic Phase I Data Analysis An often neglected but important secondary objective of a phase I trial is the assessment of the distribution and elimination of the drug in the body.61). Therefore.33. the peak concentration. a pharmacokinetically guided dose escalation (PGDE) was proposed (24. For the dose–toxicity model ψ(x. pharmacokinetic information to reduce this number of steps was suggested (24. One.

Therefore. the PGDE has not often been used lateron in practice. .mouse) AUC(Starting Dose human) (28) was used to deﬁne a range of dose escalation. see Newell (67). Rules for the planning and conduct of phase I trials are speciﬁcally addressed by the European Agency for the Evaluation of Medicinal Products and its Note for Guidance on Evaluation of Anticancer Medicinal (18 March 1997) CPMP/EWP/205/25.e. i.. Recommended for Adoption at Step 4 on 17 July 1997 (16). The characterization of frequent ‘‘side effects’’ of the agent and their dose–response parameters. The determination of relevant main pharmacokinetic parameters. One tenth of MELD10 is usually taken as the starting dose x1. AUC(LD10. mouse) was considered as a target AUC. General Consideration for Clinical Trials. x2 √x1 Fx1 x1 √F. The determination of the MTD. and a ratio F AUC(LD10. Potential problems and pitfalls were discussed (66).26 Edler MTD for humans was found to be fairly equal to the AUC for mice if calculated at the LD10 (in mg/m2 equivalents. Then one continues with the MFDE. the ﬁrst steps are achieved by a doubling dose scheme as long as 40% of F has not been attained. the ﬁrst step from the starting dose x1 to the next dose x2 is equal to the geometric mean between x1 and the target dose x1 F. In the square root method. REGULATIONS BY GOOD CLINICAL PRACTICE AND INTERNATIONAL CONFERENCE ON HARMONIZATION (ICH) According to ICH Harmonized Tripartite Guideline. MELD10). Subsequent dose escalation continues with the MFDE. Primary objectives are 1. a phase I trial is most typically a study on human pharmacology ‘‘of the initial administration of an investigational new drug into humans. 2.’’ The deﬁnition of pharmacokinetics and pharmacodynamics is seen as a major issue and the study of activity or potential therapeutic beneﬁt as a preliminary and secondary objective. For a more recent discussion and appraisal. Two variants have been proposed.’’ It is considered as having nontherapeutic objectives ‘‘to determine the tolerability of the dose range expected to be needed for later clinical trials and to determine the nature of adverse reactions that can be expected. In the extended factors of two methods. Although usage and further evaluation was encouraged and guidelines for its conduct were proposed. 3. VI.

Dose escalation is oriented at the MFDE or the PGDE scheme. Dose-Response Information to Support Drug Registration Recommended for Adoption Step 4 on 10 March 1994 that cover to some extent studies with patients suffering from life-threatening diseases such as cancer. a Phase I Trial Protocol may be organized as follows: Objectives and Preclinical Background Clinical Background Eligibility and In/Exclusion of Patients Treatment Dose-Limiting Toxicity and MTD Dose Levels and Dose Escalation Design Number of Patients per Dose Level End Points and Longitudinal Observations Toxicity. A minimum of two cycles at the same dose level is preferred. DISCUSSION AND OPEN PROBLEMS Phase I trials are by their primary goal dose-ﬁnding studies. Response Termination of the Study References Appendices on Case Report Forms Important Deﬁnitions Common Toxicity Criteria Informed Consent Form(s) Declaration of Helsinki VII. General Guidelines for obtaining and use of dose–response information for drug registration has been given in ICH Tripartite Guideline.7). The standard paradigm has been as follows: Start at a low dose that is likely to be safe and treat small cohorts of patients at progressively higher doses until drug-related toxicity reaches some predeﬁned level of maximum toxicity or until unexpected and unacceptable toxicity occurs. The starting dose is not ﬁxed. The inherent ethical problem that a treatment is given for which risks and beneﬁts are unknown and that presumably is applied at a suboptimal or inactive dose has to be accounted for by a design that treats each patient when entering the trial at the maximal dose known to be safe at this time.Phase I Trials 27 Single intravenous dosing every 3–4 weeks is recommended if nothing else is suggested. The number of patients per dose level varies around n 3 with an increase to six in case of overt toxicity. but they are performed under the twofold paradigm of the existence of monotone dose–toxicity and dose–beneﬁt relationships. Phase I trial designs have arisen . Based on these regulations and extending earlier suggestions (6.

The rather opaque appearance of the Fibonacci scheme in phase I research seems to be symptomatic of this situation. and the dose escalation scheme. Further complexity arises by population heterogeneity.g. such that they may have a therapeutic beneﬁt from the experimental drug. This dilemma has motivated a large number of modiﬁcations both of the S/TER and the CRM (lowering the starting dose and restricting the dose escalation to one step to ﬁnd a way between the conservatism and the anticonservatism). the pure Bayesian and the CRM rules are lacking because their optimality is connected with treating one patient after the other. The percentage of patients entered into phase I trials who beneﬁted from that treatment has rarely . subjectivity in judging toxicity.. and are strictly constraint by the ethical requirement in choosing doses conservatively. This has restricted the Bayesian dynamic of the CRM considerably.’’ Therefore. which is dose ﬁnding. but there seems to be an advantage using a modiﬁed CRM. there is evidence from simulations and from clinical experience that the standard designs are too conservative in terms of underdosing and needless prolongation. and given that the new experimental drug is ﬁnally proved to be efﬁcient (e. see Ref. see 44). 68. A driving force for the search of new designs and one of the most serious objections against the standard design was the argument that as many patients as possible should be treated at high dose levels best near the true MTD. All previous attempts to validate a design and to compare it with others have to use simulations because a trial can be performed only once with one escalation scheme and it cannot be ‘‘reperformed. packed with liposomes). including the choice of the starting dose. which may prolong a trial even more. One has to be reminded that efﬁcacy is not in accordance with the primary aim of a phase I study. A comprehensive comparison of all the methods proposed is despite a number of simulation results—mentioned in Section III. and that they run the risk of treating too many patients at toxic levels (for further criticism. it seems empirically obvious yet is a post-hoc argument. Dose escalation was presented in broad detail ranging from the more pragmatic standard rules (S/TER) to partial and full Bayesian methods. For the use of drug combinations.28 Edler rather empirically without strong statistical foundation and limited estimating precision (13).5—not available at present. take longer for observing the end point. all comparisons can only be as valid as the setup of the simulation study is able to cover clinical realty and complexity. This argument is absolutely correct. standard approaches as unrestricted UaD and SAM designs are not applicable. At the same time. Different from dose ﬁnding in bioassays. Therefore. Nevertheless. This challenges the use of prior information.g. phase I trials are smaller in sample sizes. the taxanes for breast cancer). The two main constituents of a phase I trial are the action space of the dose levels. and censoring because of early drop-out.. Phase I trials are not completely restricted to new drugs but may also be conducted for new schedules or for new formulations (e.B.

Nevertheless. Laboratory toxicity is.Phase I Trials 29 been estimated in serious studies. The deﬁnition of the toxicity criteria and their assessment rules are mostly beyond the statistical methodology but are perhaps more crucial than any other means during the conduct of trial. however. however. The MTD estimation above was restricted to a qualitative and at most categorical outcome measure. Interestingly. repeated phase I trials use variations of the schedule and administration. some restricted type of randomization may be feasible and should be considered. available on a quantitative scale. Those concerns are nourished by a seemingly missing possibility of randomization. the very early phase I trial of DeVita et al. Given multicentricity and enough patients per center. and the impact of treating at high doses should not be overemphasized. This poses ethical concerns. Statistical estimation of the MTD has been restricted above to basic procedures leaving aside the question of estimation bias when neglecting the design. it is the ethical concern to do as best as possible even in a situation with a very low chance. where y was a quantitative toxicity of myelosuppression (nadir of white blood cell count) and z was a covariate (pretreatment white blood cell . for example. randomly choosing the center for the ﬁrst patient for the next dose level. and this information could be used for the estimation of an MTD (70). drug development programs implement usually more than one phase I trial for the same drug. (41) noted that they can estimate the MTD using a different prior than used in the design and refer to work of others who suggested using a Bayesian scheme for design and a maximum likelihood estimate of the MTD.1%) were recorded among 610 patients in the 3-year review (69) of 23 trials of the MD Anderson Cancer Center from 1991 to 1993. for example. Mick and Ratain (71) proposed therefore ln( y) a0 a1x a2 ln(z) as a dosetoxicity model. but as long as there is no direct information exchange among those trials with respect to occurrence of toxicity. Therefore. Phase I trials are weak in terms of a generalization to a larger patient population because only a small number of selected patients is treated under some special circumstances by specialists. It has to be taken care that no changes in the assessment of toxicity occur progressively during the trial. Further improvement was put into the perspective of increasing the sample size of phase I trials and increasing so the information (13). Nineteen responses (3. (32) from 1965 used a randomized approach. Rough estimates give response rates—mostly partial response only—in the range of a few percent. Mostly. about the premises of the phase I being the trial under which the ﬁrst-time patients are treated. it would be unrealistic to expect a therapeutic beneﬁt even at higher dose levels for most patients. ethical concerns remain. Further concern arises if patients are selected for inclusion into the trial (12). Therefore. The repeat of phase I trials has not found much consideration in the past. Babb et al. by including at higher doses less pretreated cases because of the fear of serious toxicities.

This would demand an isotonic relationship such that any grade 3 toxicity is preceded by a grade 2 toxicity and P(DLT2 | x) P(DLT3 |x) for all doses x. up to six dose escalations and three stages of recruiting one. how can the cumulative increase of information on the dose level x[k] be used to interfere with the dosing of the current patients or to decide on the dosing of the next cohort? Except for the dose-titration design in Section IV. three.-D. three) patients scheduled for one dose level x[k] are treated more or less in parallel and that the toxicity results become available at once when that kth stage has been ﬁnished. But it was not uniformly better.B. Further research and practical application is needed to corroborate the ﬁndings. A silent assumption in this article was that the cohort of nk (e.4. although it may have inﬂuenced recent research in titration designs. For technical assistance I thank Gudrun Friedrich for analyses and . or six patients at each dose level were planned.. Depending on severity of toxicity. If so. ACKNOWLEDGMENTS This work could not have been done without the long-standing extremely fruitful cooperation with the Phase I/II Study Group of the AIO in the German Cancer Society and the work done there with Wolfgang Queißer. This situation may be rather rare in practice. The authors also admitted that a multigrade design is harder to comprehend and use. and Heiner Fiebig. and to my knowledge that design has never been used in practice. E. for example. To comply with clinical practice. no formal methods have been developed to my knowledge that would acount for such an overlap of information. Kreuser. That would allow. such a design should be ﬂexible enough to allow both staggered entry distributed over 1–2 months and parallel entry in a few weeks.30 Edler count). and staggered information by treatment course and patient may be the rule. Extensions of the use of the toxicity grades 1–4 were addressed shortly in Section III.B. I also owe the support and statistical input from the Biometry Departments of Martin Schumacher (Freiburg) and Michael Schemper (Vienna).g. and my next door colleague Annette KoppSchneider. One may be tempted to deﬁne DLT speciﬁcally for different toxicity grades with the aim of a grade-speciﬁc MTD also using grade-speciﬁc acceptable tolerability θ. They showed that the multigrade design was superior to the standard design and that it could compete successfully with the two-stage designs. A multigrade dose escalation scheme was proposed (72) that allows escalation and reduction of dosage using knowledge of the grade of toxicity. tolerability θ2 for grade 2 toxicity higher than the tolerability θ3 for grade 3 toxicity. Harald Heinzl (Vienna). Bayesian approaches may be most promising to deal with this difﬁcult problem. Axel Hanauske. The multiplicity of courses and even more the multiplicity of toxicities assessed with the CTC scheme needs further research.

Sylvester RJ. Declaration of Helsinki (http:/ /www. Wittes RE. REFERENCES 1. 86:1662–1663. In: Proc. 27:1162–1168. In: Staquet MJ. Von Hoff DD. Schwartsmann G. Cancer Clinical Trials: Methods and Practice. 2. J Nat Cancer Inst 1994. Staquet MJ. ed. Bakowski MT. Rozencweig M. 11. Hellmann K. Eur J Cancer Clin Oncol 1985. Stat Med 1991. Carter SK. . 12. Ann Oncol 1994. et al. World Medical Association. Pinkel D. 10. Clark GM. 3rd ed. eds. Eur J Cancer 1991. Experimental bases for drug selection. Spreaﬁco F. 13. Cancer Clinical Trials. J Nat Cancer Inst 1993. 85:1637–1643. Brussels: Editions Scient Europ 1973:242–289. Carter SK. New York: John Wiley. Study design principles for the clinical evaluation of new drugs as developed by the chemotherapy programme of the National Cancer Institute. The use of body surface area as a criterion of drug dosage in cancer chemotherapy. The Design of Clinical Trials in Cancer Therapy. Leventhal BG. Wanders J. Mick R. 1987:153–174. Schneiderman MA. Dordrecht: Nijhoff The Netherlands. 8. 6.Phase I Trials 31 Regina Grunert and Renate Rausch for typing and the bibliography. 4th ed. Ratain MJ.aix-scientiﬁcs. eds. eds. 1988:41–59. Lelieveld P. Edelstein MB. 1987:29–31. 5th Berkeley Symp Math Statist Prob. Bakowski MT. methods and evaluation. 5. Research Methods in Clinical Oncology. eds. Kerr DJ. Methods and Practice. In: Leventhal BG. A decade of progress in statistical methodology for clinical trials. EORTC New Drug Development Ofﬁce Coordinating and Monitoring Programme for Phase I and II Trials with new anticancer agents. Sylvester RJ. Kuhn J. Oxford: Oxford University Press. 4. Siegler M. eds. I thank John Crowley for all the help and encouragement. EORTC guidelines for phase I trials with single agents in adults. 1984:193–209. In: Buyse ME. 9. The limited precision of phase I trials. In: Muggia FM. Oxford: Oxford University Press. Clinical Evaluation of Antitumor Therapy. Koier IJ. 7. Clinical trials in cancer chemotherapy. The phase I study: general objectives. Statistical and ethical issues in the design and conduct of phase I and II clinical trials of new anticancer Agents. Mouse to man: statistical problems in bringing a drug to clinical trial. Legha SS. 14. 10:1789–1817. Chemotherapy of Cancer. Schilsky RL. In: Buyse ME. Finally. Berkeley: University of California Press. Hellmann K. Design and conduct of phase I trials. Staquet MJ. Phase I clinical trials: adapting methodology to face new challenges. Korn EL. Phase I trials. 1967:855–866. 1984:210–220. 15. Cancer Res 1958. 21:1005–1007. Bodey GP. In: Carter SK. 5:S67–S70. EORTC New Drug Development Committee. New York: Raven Press. Wittes RE. com/). 18:853–856. 3. Christian MC. Simon RM.

Biometrics 1990. Repeated dosage and impulse control. 31:223–227. 12:1–5. Gold GL. 85:1138–1148. 19. Bellman RE. Physician-determined patient risk of toxic effects: impact on enrollment and decision making in phase I trials. Chabner BA. Morgan BJT. 35. Animal toxicology for early clinical trials with anticancer agents. 28. . DeVita VT. Daugherty C. Collins JM. III. The Design of Clinical Trials in Cancer Therapy. WHO Handbook for Reporting Results of Cancer Treatment. 24. 48. II.32 Edler 16. Bellman RE. ed. Krant MJ. Carter SK. J Nat Cancer Inst 1994. Continual reassessment method: a practical design for phase I clinical trials in cancer. Scheulen ME. Cancer Res 1975. Owens AH. Lane N. NSC-122819). 29. 1979. Dubbelman AC. 5:113–117. Dynamic Programming. et al. Zee B. Potential roles for preclinical pharmacology in phase I clinical trials. 86:1685–1693. 25. Cancer Treat Rep 1965. 32. Hansen HH. Goldin A. NSC-79037). Phase I study of 4′demethyl-epipodophyllotoxin-β d-thenylidene glycoside (PTG. Carter S. Toxicity grading systems. Kreuser ED. 25:1876–1881. 23. 30. In: Staquet MJ. Schein PS. Assessing the reliability of two toxicity scales: implications for interpreting toxicity data. Mick R. Walker MD. Clinical trials in cancer chemotherapy. Cancer 1977. World Health Organization (WHO). Staquet MJ. 34. 22. J Nat Cancer Inst 1993. A comparison between the WHO scoring system and the common toxicity criteria when used for nausea and vomiting. Hansen HH. 1973:58–81. Dedrick RL. NSC 409962. Clinical Trials with 1. 11:58. Proc Am Assoc Cancer Res 1970. Clinical studies with 1-(2-chloroethyl)-3-cyclohexyl-1-nitrosourea (NSC79037). Cancer Res 1971. Standard operating procedures and organization of German Phase I. Rozencweig M. Von Hoff DD. 27. 21(suppl 3). 20. Princeton: Princeton University Press. Fiebig HH. Math Biosci 1971. London: Chapman & Hall. 1962. Cancer Treat Rep 1986. 11:87. 70:73–80. 46:33–48. Carter SK.3-bis(2-chlorethyl)-1-nitrosourea. Cancer Chemother Pharmacol 1979. 1–22. WHO Offset Publication No. 40:544–557. 17. Topics in pharmacokinetics. Cancer Clin Trials 1981. Rozencweig M. 35:1354– 1364. Homan ER. Pater JL. Brussels: Editions Scient Europ. Selawry OS.nih.info. Carbone PP. et al. Fisher L.gov/ctc3/ctc. Slavik M. Penta JS. 18. Muggia FM. International Conference on Harmonisation: Guideline for Good Clinical Practice (http:/ /ctep. Onkol 1998. 26. Brundage MD. 31. Zaharko DS. Muggia FM. and Study Group of Pharmacology in Oncology and Hematology (APOH) of the Association for Medical Oncology (AIO) of the German Cancer Society. Pepe M. 21. Epmonson J. Quantitative comparison of toxicity in animals and man. Mouse and large-animal toxicology studies of twelve antitumor agents: relevance to starting dose for phase I clinical trials. Guarino AM.htm). Geneva: WHO. and III Study Groups. Franklin HR. Analysis of Quantal Response Data. Simonetti GPC. 33. Proc Am Assoc Cancer Res 1970. O’Quigley J. Ratain MJ. 4:21–28. Muggia FM. New Drug Development Group (AWO). 3:97–101. Clinical experience with 1-(2-chloroethyl) 3-cyclohexyl-1-nitrosourea (CCNU. Quantitative prediction of drug toxicity in humans from toxicology in small and large animals. Goldsmith MA. et al. Ann Oncol 1994. 1992.

44. 37. In American Statist. 29: 400–407. Phase I clinical trial of a combination of dipyridamole and acivicin based upon inhibition of nucleoside salvage. Cancer Invest 1984. 51. Tutsch K. Chen TT. 14:911–922. Freidlin B. 14:885–893. Edler L. 4:147–164. Flournoy N. 42. Earhardt RH. Ahn C. A comparison of two phase I trial designs. 45:925– 937. Biometrika 1996. Bruggink J. Bayesian methods for phase I clinical trials. Stat Med 1998. Christian MC. Chevret S. In: Kitsos CP. Storer BE. New clinical trial designs for phase I studies in hematology and oncology: principles and practice of the continual reassessment model. Piantadosi S. Simon K. Hanauske AR. Zacks S. Accel- . 54. 39. Simon RM. 53:745–760. Brunier H. Shen LZ. Heidelberg: Physica. Stat Med 1991. Edler L. Cancer phase I clinical trials: efﬁcient dose escalation with overdose control. O’Quigley J. 2:483–491. Whitehead J. Zahurak ML. Trump DL. Moller S. Mood AM. 47. Rubinstein LV. 49. 43. Babb J. 17:1103–1120. 48. 10:1647–1664. in order to investigate a greater range of doses. Stat Med 1994. Stat Med 1995. Hamilton RD. Korn EL. J Biopharm Stat 1994. Alberti D. Stat Med 1998. Dixon WJ. Rademaker AW.Phase I Trials 33 36. Rosenberger WF. Methods for dose ﬁnding studies in cancer clinical trials: a review and results of a Monte Carlo study. 17: 1537–1549. Biometr 1997. Faries D. Sample sizes in phase I toxicity studies. 41. 137–141. Stat Med 1995. Contributions to Statistics. 1997:221–232. 50. Geller NL. Stat Med 1992. Onkol 1996. 11:1377–1389. Tormey DC. (Alexandria. Stat Med 1995. Simon RM. Monro S. Christian MC. 19:404–409. Collins J. Some practical improvements in the continual reassessment method for phase I studies. Durham SD. 55. Goodman SN. Cancer Res 48:5585–5590. Biometr 1993. An evaluation of phase I cancer clinical trial designs. Fisher PH. 49:1117–1125. Greenhouse JB. 43:109–126. Koeller JM. 53. 52. Ranhosky A. 46. J Am Statist Assoc 1948. Rogatko A. 13:1799–1806. A random walk rule for phase I clinical trials. Robbins H. Design of phase I and II clinical trials in cancer: a statistician’s view. A stochastic approximation method. Practical modiﬁcations of the continual reassessment method for phase I clinical trials. Modeling and computation in pharmaceutical statistics when analysing drug safety. Willson JKV. 38. 40. VA) ASA Proc Biopharm Sect 1989. eds. A method for obtaining and analyzing sensivity data. Arbuck SG. Ann Math Stat 1951. Small-sample conﬁdence sets for the MTD in a phase I clinical trial. An extension of the continual reassessment methods using a preliminary up-and-down design in a dose ﬁnding study in cancer patients. 14:1149–1161. O’Quigley J. Assoc. Rubinstein LV. 56. 83:395–405. Midthune D. Storer BE. Biometr 1989. Gatsonis C. Edler L. Consistency of continual reassessment method under model misspeciﬁcation. Design and analysis of phase I clinical trials. Bayesian decision procedures for dose determining experiments. 45.

Design and results of phase I cancer clinical trials: three-year experience at M. 68. Grifﬁn. Russell KE. Simon RM. 23:1083–1087. 42:109– 112. Benner A. Lee JJ. J Nat Cancer Inst 1990. Eur J Cancer Clin Oncol 1987. Gordon NH. In: Millard S. Computational statistics for pharmacokinetic data analysis. Grieshaber CK. Krause A. 11:794–801. 2000. Proceedings in Computational Statistics. J Clin Oncol 1993. Collins JM. Kantarjian HM. 14:287–295. Model-guided determination of maximum tolerated dose in phase I clinical trials: evidence for increased precision. 89:1138–1147. 58. Edler L. Hartung G. 63. 82:1321– 1326. 60. Legha SS. Pharmacokinetically guided dose escalation in phase I clinical trials. 70. Heidelberg: Physica. Thall PF. EORTC Pharmacokinetics and Metabolism Group. Green P. Pharmacokinetics. Perrier D. Commentary and proposed guidelines. 72. 69. Statistical Method in Biological Assay. 1978. J Nat Cancer Inst 1993. Newell DR. A strategy for dose-ﬁnding and safety monitoring based on efﬁcacy and adverse outcomes in phase I/II clinical trials. 54:251– 264. Edler L. In: Payne R. J Clin Oncol 1996. J R Stat Soc B 1980. 11:2063–2075. Mick R. 80:790– 792. 1998:281–286. New York: Springer. COMPSTAT. 82:446–447.34 Edler erated titration designs for phase I clinical trials in oncology. J Nat Cancer Inst 1990. Finney DJ. New Drug Ther 1994. J Nat Cancer Inst 1988. The method of probits. Gibaldi M. Bliss CI. 57. 66. Regression models for ordinal data. Using the tolerable-dose diagram in the design of phase I combination chemotherapy trials. 65. 8:257–275. Pharmacology and drug development. Ratain MJ. 71. New York: Marcel Dekker. 61. Stat Med 1992. 67. Using toxicity grades in the design and analysis of cancer phase I clinical trials. 1982. Pharmacologically guided phase I clinical trials based upon preclinical drug development. . McCullagh P. Biometr 1998. Anderson Cancer Center. Science 1934. SPLUS support for analysis and design of phase I trials in clinical oncology. SPLUS in Pharmaceutical Industry. Phase I trials: a strategy of ongoing reﬁnement. Korn EL. Willson JKV. Egorin MJ. 59. Chabner BA. 64. Smith TL. J Nat Cancer Inst 1997. eds.D. 79:409–410. Raber MN. Pharmacologically based phase I trials in cancer chemotherapy. eds. London: C. 85: 217–223. 62. Collins JM.

La Jolla. speciﬁcally the use of graded information on toxicities. The absence of such deﬁnitions and the lack of clinically motivated exigencies have led to the use of a number of schemes. having properties 35 . has greatly gained in popularity in recent years. the properties of the method. although it seems clear that the CRM provides a structure around which such further developments can be carried out. as a tool for carrying out phase I clinical trials in cancer. and the possibility for substantial generalization. some important technical considerations. and the possibility of modeling within patient dose escalation. California I.2 Dose-Finding Designs Using Continual Reassessment Method John O’Quigley University of California at San Diego. A. Here I describe the basic ideas behind the method. in particular the up and down scheme. the incorporation of information on patient heterogeneity. CONTINUAL REASSESSMENT METHOD The continual reassessment method (CRM). recalled in a broad review by Storer (28). the incorporation of a stopping rule leading to further reductions in sample size. At the time of writing. few of these generalizations have been fully studied in any depth. Motivation The precise goals of a phase I dose-ﬁnding study in cancer have not always been clearly deﬁned. the incorporation of pharmacokinetics.

The fourth requirement is not an independent requirement and can be viewed as a partial re-expression of requirements 1 and 2. a concern for all types of clinical studies.e. let us ﬁrst look at the requirements themselves in the context of cancer dose-ﬁnding studies. Given that candidates for these trials have no other options concerning treatment.36 O’Quigley that can be considered undesirable in certain applications. There will always be hope in the therapeutic potential of the new experimental treatment. the correct level being deﬁned as the one having . This is because of the understandable desire to proceed quickly with a potentially promising treatment to the phase II stage. 3. becomes of paramount importance here where very small sample sizes are inevitable. We have to do the very best we can with the relatively few patients available. We should minimize the number of patients needed to complete the study (efﬁciency). patients treated at unacceptably low dose levels. will be accompanied by too high a probability of encountering unacceptable toxicity. We should minimize the number of undertreated patients. Most phase 1 cancer clinical trials are carried out on patients for whom all currently available therapies have failed. in which the design was constructed to respond to speciﬁc requirements of the phase I clinical investigation in cancer. Given this context. although offering in general better hope for treatment effect. Taken together. the CRM (15). i. the requirements point toward a method where we converge quickly to the correct level. rapidly escalating in the absence of indication of drug activity (toxicity) and rapidly de-escalating in the presence of unacceptably high levels of observed toxicity. may offer too little chance of seeing any beneﬁt at all. their inclusion appears contingent on maintaining some acceptable degree of control over the toxic side effects and trying to maximize treatment efﬁcacy (which translates as dose). translating the uncertainty of our ﬁnal recommendations based on such small samples. The third requirement. The method should respond quickly to inevitable errors in initial guesses. This consideration underscored the development of a different approach to such studies. We should minimize the number of patients treated at unacceptably high dose levels. Too low a dose. although avoiding this risk. requirements 1 and 2 appear immediate. 2. These requirements are the following: 1. but such hope is invariably tempered by the almost inevitable life-threatening toxicity accompanying the treatment. Before describing just how the CRM meets these requirements. and the statistician involved in such studies should also provide some ideas as to the error of our estimates. Too high a dose. 4. At the phase II stage the probability of observing treatment efﬁcacy is almost certainly higher than that for the phase I population of patients.

(bottom) standard design.Continual Reassessment Method 37 Figure 1 Typical trial histories. (Top) CRM. .

It generally does very well in addressing the more limited goal: to identify some single chosen percentile from this unknown curve. indirectly. The true unknown dose– toxicity curve is not well estimated overall by the underparameterized dose– toxicity curve taken from the CRM class. The cycle is continued until a ﬁxed . is reﬁtted after each inclusion. whereas those lower than θ unacceptably low in that they indicate. Patients enter sequentially. The next patient is then treated at this level.38 O’Quigley Figure 2 Fit of the working model. the correct targeted level. and a small number of discrete dose levels rather than a continuum. but. CRM will not do well if the task is to estimate the overall dose–toxicity curve. The curve is then inverted to identify which of the available levels has an associated estimated probability as close as we can get to the targeted acceptable toxicity level. at the point of main interest. taken from the CRM class (described below). the likelihood of too weak an antitumor effect. belonging to a particular class of models. The value θ is chosen by the investigator such that he or she considers probabilities of toxicity higher than θ to be unacceptably high. Figure 2 provides some insight into how the method behaves after having included a sufﬁcient number of patients into the study. The working dose–toxicity curve. a probability of toxicity as close as possible to some value θ. the main differences being the use of a nonlinear underparameterized model. the two curves nearly coincide. Figure 1 illustrates the comparative behavior of CRM with a ﬁxed-sample up and down design in which level 7 is the correct level. How does CRM work? The essential idea is close to that of stochastic approximation.

VI. Nonetheless. practically. Whether Bayesian or likelihood based.29).e. Operating Characteristics The above paragraphs outline how CRM works. V). both in terms of accuracy of ﬁnal recommendation and in terms of concentrating as large a percentage as possible of studied patients close to the target level. if any scheme fails to meet such basic statistical criteria as large sample convergence.. In the model and examples of O’Quigley et al. the likelihood in nonmonotone). (15). once the scheme is under way (i. large sample properties themselves will not be wholly convincing because. the theory of Markov chains enables us to carry out exact probabilistic calculations (23. B.2. For nonmonotone likelihood it is impossible to be at some level. a particularly simple model and how it worked when used to identify a target dose level having probability of toxicity as near as possible to 0.2. Furthermore. we need to investigate with great care its ﬁnite sample properties. the method skipped when de-escalating. assuming no further toxicities were seen. escalating to level 2. However. O’Quigley and Chevret (16). if the ﬁrst patient. Good- . it can be calculated and follows our intuition that a toxicity. Typical behavior is that shown in Fig. (15). we are inevitably faced with small to moderate sample sizes. observe a toxicity. As pointed out by Storer (30). and then for the model to recommend a higher level as claimed by some authors (see Sect. particularly early on where little information is available. Simulations in O’Quigley et al. when targeting lower percentiles such as 0. the absolute value of the change diminishing with the number of included patients. This translates directly into an operating characteristic whereby model-based escalation is relatively cautious and de-escalation more rapid. The technical details are provided below. Simulations were encouraging and showed striking improvement over the standard design. recommending level 1 for the subsequent two entered patients before. The original article on the method by O’Quigley et al. dose levels could never be skipped when escalating. treated at level 3. then it is readily shown that a nontoxicity always points in the direction of higher levels and a toxicity in the direction of lower levels. suffered a toxic side effect. unless pushed in such a direction by a strong prior. will have a much greater impact on the likelihood or posterior density. thereby minimizing the number of overtreated and undertreated patients. (15) considered.B). the level to which a CRM design converged will indeed be the closest to the target. The tool to use here is mostly that of simulation. as an illustration. After asking how CRM works it is natural to ask how CRM behaves. A large sample study (25) showed that under some broad conditions. 1. although for the standard up and down schemes.Continual Reassessment Method 39 number of subjects have been treated or until we apply some stopping rule (see Sect. occurring with a frequency a factor of 4 less than that for the nontoxicities.

already favorable to CRM. situation 6 in Table 2) used a model.40 O’Quigley man et al (9). II. failing the conditions outlined in Section 11. and their results. These doses are not necessarily ordered in terms of the d i themselves. described in the following section. being a combination of different treatments. However. are ordered whereby the probability of toxicity at level i is greater than that at level i′ whenever i i ′. The problem of dose spacing for single drug combinations. k). TECHNICAL ASPECTS The aim of CRM is to locate the most appropriate dose. all the dose information required to run a CRM trial is contained in the dose levels. but as far as extrapolation or interpolation of dose is concerned. . the precise deﬁnition of which is provided below. at present we lose no information when we replace d i by i. (11) worked with this same model. the relevant insights will come from pharmacokinetics. We assume monotonicity and we take monotonicity to mean that the dose levels. in particular since each d i may be a vector. . Both Faries (7) and Moller (13) assigned to early levels other than those indicated by the model. d 1. We return to this in Section VI. CRM can help with this afﬁrmation. and O’Quigley (1999) show the operating characteristics of CRM to be good. violation of the model requirements and allocation principle of CRM. possibly disastrous. equally well identiﬁed by their integer subscripts i (i 1. d k. Chevret (6. (11). (9) and Korn et al. often multidimensional. but rather in terms of the probability R(d i ) of encountering toxicity at each dose d i. For our purposes we assume that we have available k ﬁxed doses. Goodman et al. in terms of accuracy of ﬁnal recommendation. . would have been yet more favorable had a model not violating the basic requirements been used (19). that resulted in never recommending the correct level. a performance worse than we achieve by random guessing. . effect on operating characteristics. is beyond the scope of CRM. can have a negative. while simultaneously minimizing the numbers of overtreated and undertreated patients.A. The monotonicity requirement or the assumption that we can so order our available dose levels is thus important. often addressed via a modiﬁed Fibonacci design. This dose is taken from some given range of available doses. . The d i. It is important to be careful at this point since confusion can arise over the notation. . the so-called target dose. describe the actual doses or combinations of doses being used. The need to add doses may arise in practice when the toxicity frequency is deemed too low at one level but the next highest level is considered too toxic. Korn et al. in one case skipping nine levels after the inclusion of a single patient. leading to large skips in the dose allocation. Currently. . Without wishing to preclude the possibility of exploiting information contained in the doses d i and not in the dose levels i.

Choosing the dose levels amounts to selecting levels (treatment combinations) such that the lowest level hopefully has an associated toxic probability less than the target and the highest level possibly close or higher than the target. it is in principle possible to work with the actual dose. 0. This is all we need and we may prefer to write x j ∈ {1.2.4 (13). is that dose having an associated probability of toxicity as close as we can get to the target ‘‘acceptable’’ toxicity θ. A likelihood procedure will be unstable and may even break down. . under our monotonicity assumption. Values for the target toxicity level. We model R(x j ). if desired. We do not advise this since it removes. n). although there are studies in which this can be as high as 0. The dose for the jth entered patient.31) may work initially. Although a two-parameter model may appear more ﬂexible. . . We are close to something like nonidentiﬁability. This is true even when starting out at a low or the lowest level. the ‘‘target’’ dose. . initially working with an up and down design for early escalation. a).15. any design that ultimately concentrates all patients from a single group on some given level can ﬁt no more than a single parameter without running into problems of consistency. k}.25. the true probability of toxic response. whereas a two-parameter fully Bayesian approach (8.3. can be viewed as random taking values x j. a) for some one parameter model ψ(x j. X j. . before a CRM model is applied. In light of the remarks of the previous two paragraphs we can. k} via R(x j ) Pr(Y j 1|X j xj) E(Y j | x j ) ψ(x j. entirely suppress the notion of dose and retain only information pertaining to dose level. so many mg/m2 say. . most often discrete in which case x j ∈ {d 1. increasing when one of the constituent ingredients increases and. . . x ∈ R . 0. We work instead with some conceptual dose. but behave erratically as sample size increases (see also Sect. 1) where 1 denotes severe toxic response for the jth entered patient ( j 1. . . For the most common case of a single homogeneous group of patients. For multidrug or treatment combination studies there is no obvious univariate measure. d k} but possibly continuous where X j x. d k} or x j ∈ {1.35.Continual Reassessment Method 41 The actual amount of drug therefore. . . some of our modeling ﬂexibility. at X j x j. without operational advantages. although somewhat artiﬁcially. might typically be 0. . Let Y j be a binary random variable (0. is typically not used. the sequential nature of CRM together with its aim to put the included patients at a single correct level means that we will not obtain information needed to ﬁt two parameters. For a single-agent trial (see 13). translating itself as an increase in the probability of a toxic reaction. θ. 0. . we are obliged to work with an underparametrized model. Indeed. x j ∈ {d 1. VI). The value depends on the context and the nature of the toxic side effects. The most appropriate dose. . . notably a one-parameter model.

α 4 0. the working model had α 1 0. . where d 1 1. . .30. In that article this was expressed a little differently in terms of conceptual dose d i. . . a) α a.42. α 5 0. a) be monotonic increasing in x or. k) (1) where 0 α1 ⋅⋅⋅ αk 1 and 0 a ∞. For the six levels studied in the simulations by O’Quigley et al. i k).. Thus. a) ψ(d m. at least as far as maximum likelihood estimation is concerned (see Sect. The one-parameter logistic model. in which b is ﬁxed and where w exp(b ax). . III. k) (2) The above ‘‘tanh’’ model was ﬁrst introduced in this context by O’Quigley et al. k. and d 6 0. between adjacent α i will have an impact on operating characteristics. For given ﬁxed x we require that ψ(x.05. Pepe and Fisher (1990). however.50.10. The spacing. can be seen to fail the above requirements (25).69. a k.42 O’Quigley A. and α 6 0. . d 3 0. (15). Working with real doses corresponds to using some ﬁxed dose spacing. is replaced by α * (i 1.A). a) w/(1 w). . In other words. . . . . We call this a working model since we do not anticipate a single value of a to work precisely at every level. . We have obtained excellent results with the simple choice: ψ(d i. that ψ(d i. although not necessarily one with nice properties. . (i i 1. An investigation into how to choose the α i with the speciﬁc aim of improving certain aspects of performance has yet to be carried out. Note that. our one-parameter model has to be rich enough to model the true probability of toxicity at any given level. whatever treatment combination has been coded by x) is given by R(x). . d 4 0. a) be strictly monotonic in a. k). leading to potentially poor operating characteristics. α 2 0. ψ(x. . d 2 1.10. say a 1. (i 1. d 5 0.47. (15). . a) were described by O’Quigley. that is. For ﬁxed a we require that ψ(x. . .0. where α* α m for any real m 0. k). The spacings chosen here have proved satisfactory in terms of performance across a broad range of situations. working with model (1) is equivalent to working with a model in which α i (i 1. Model Requirements The restrictions on ψ(x. we cannot really attach any concrete i i meaning to the α i. Some obvious choices for a model can fail the above conditions. . . On the other hand. . in the usual case of discrete dose levels d i i 1.20. . . This extra generality is not usually needed since attention is focused on the few ﬁxed d i. a i ) R(d i ). . The true probability of toxicity at x (i. a) whenever i m.e. α 3 0. and we require that for the speciﬁc doses under study (d 1. . . we do not anticipate a 1 a 2 ⋅ ⋅ ⋅ a k a.42 obtained from a model in which αi (tanh d i 1)/2 (i 1. such that ψ(d i. the less intuitive . . . the idea being that tanh (x) increases monotonically from 0 to 1 as x increases from ∞ to ∞. . . Many choices are possible.70. d k ) there exists values of a. .

. . for example. . . incorporating cost or other considerations. the outcomes of the ﬁrst j experiments we obtain estimates ˆ R (d i ). . . . Similar ideas have been pursued by Babb et al. developed in the original paper by O’Quigley et al. x 1. . b ≠ 0. multiply |R (x j ) ˆ constant greater than 1 when R (x j ) θ. two-stage designs. . . (i 1. exp(a bx).Continual Reassessment Method 43 model obtained by redeﬁning w so that w CRM class. . x j ≠ d i ) Thus. (i 1. Bayesian ideas can nonetheless be very useful in addressing more complex questions such as patient heterogeneity and intrapatient escalation. Obtaining the initial data is partially described in these same sections as well as being the subject of its own subsection. To decide. y j. k). k. ˆ θ| by some We could also weight the distance. it turns out that standard procedures of estimation work. (i 1. k) at the k dose levels (see below). The dose or dose level x j assigned to the jth included patient is such that ˆ |R (x j ) θ| ˆ |R (d i ) θ|. . This would favor conservatism. the appropriate level at which to treat a patient. we need some estimate of the probability of toxic response at dose level d i. on the basis of available information and previous observations. (5). x j}. (15). will perform very similarly unless priors are strong. belongs to the III. . x j is the closest level to the target level in the above precise sense. such a design tending to experiment more often below the target than a design without weights. in particular since the ﬁrst entered patient or group of patients must be treated in the absence of any data based estimates of R(x 1 )? Even though our model is underparametrized. . . . Other choices of closeness could be made. We return to this in Section VII. Some care is needed to show this. The Bayesian estimator. We would currently recommend use of the maximum likelihood estimator (17) described in section III. The target dose level is that level having associated with it a probability of toxicity as close as we can get to θ. and we look at this in Section IV. The procedures themselves are described just below. (i 1. . .A. leading us into the area of misspeciﬁed models. . ˆ The estimates R (x j ) obtain from the one-parameter working model. The use of strong priors in the context of an underparametrized and misspeciﬁed model may require deeper study. k) of the true unknown probabilities R(d i ). . IMPLEMENTATION Once a model has been chosen and we have data in the form of the set Ω j {y 1. Two questions dealt with in this section arise: How do we estimate R(x j ) on the basis of Ω j 1 and how do we obtain the initial data. .

Once we have calculated a j. For the model of Eq. (1). (a j ˆ v(a j )1/2 )} ˆ where z α is the αth percentile of a standard normal distribution and v(a j ) is an ˆ estimate of the variance of a j. Maximum Likelihood Implementation After the inclusion of the ﬁrst j patients. and we can write ˆ v 1 (a j ) j. Thus. a)) (3) ˆ and is maximized at a a j Maximization of j (a) can easily be achieved with a Newton Raphson algorithm or by visual inspection using some software packˆ age such as Excel. k) 1)th patient. or use of a design believed to be more appropriate by the investigator. the experiment is considered as not being fully underway until we have some heterogeneity in the responses. is largely arbitrary. at least one toxic and one nontoxic response (27). . . Otherwise. A requirement to be able to maximize the log-likelihood on the interior of the parameter space is that we have heterogeneity among the responses. ψj ) where ψj ˆ ψ{x j 1. may not even be deﬁned. that is. one. the model kicks in and we continue as prescribed above (estimation-allocation). the dose to be given to the ( j is determined. Once we have achieved heterogeneity. we can next obtain an estimate of the probability of toxicity at each dose level d i via ˆ R (d i ) ˆ ψ(d i. k) are trivially either zero. We can also calculate an approximate 100(1 ˆ interval for ψ(x j 1. . that is. . a) 1 (1 y )log(1 ψ(x . . This feature is speciﬁc to the maximum likelihood implementation and such that it may well be treated . this turns out to be particularly simple.44 O’Quigley A. a j ). and thus helpful in practice (14). the log-likelihood can be written as j j j (a) 1 y log ψ(x . a j )(log α )2 /(1 ˆ ψ(x . a j ) as (ψj . ψ j ˆ ˆ ψ{x j 1. use of an initial Bayesian CRM as outlined below. y 0 ˆ ψ(x . . or. depending on the model we are working with. Getting the trial underway. (a j z1 α/2 ˆ v(a j )1/2 )}. (i 1. x j 1 α)% conﬁdence z1 α/2 On the basis of this formula. . achieving the necessary heterogeneity to carry out the above prescription. even for sample sizes as small as 16. a j ))2 Although based on a misspeciﬁed model these intervals turn out to be quite accurate. (i 1. These could arise in a variety of different ways: use of the standard up and down approach. the likelihood is maximized on the boundary of the parameter space and our estimates of R(d i ).

Continual Reassessment Method 45 separately. Bayesian Implementation Before describing more closely the Bayesian implementation. a further two one level below. . 3. it is instructive to consider Fig. B. Indeed. After seeing the ﬁrst observed toxicity at level 3. two other patients being treated at this same level. this is our suggestion and we describe this more fully below in the subsection Two-Stage Designs. and a Figure 3 Likelihood and posterior densities for small samples.

0. d 3 0.2. but the illustration helps eliminate a potential concern that the maximum likelihood estimate may be too unstable for small samples.7. was uncertain and likely to be in error. the prior point estimates of toxic probabilities were 0.99).3. Very few further observations are required before the maximum likelihood estimator and the Bayesian estimator become.46 O’Quigley single patient at the lowest level. 0. 0. The simplest member of the gamma family. and this simple exponential formulation appeared to be fairly satisfactory for many cases (15). A more vague prior would help acceleration from the starting level to the highest level when we greatly overestimate the new treatment’s toxic potential. We considered that our point estimate 0. the standard exponential distribution with g(a) exp( a). corresponding to the mean of some prior distribution. 0. the family of gamma distributions in particular. and d 6 0. respectively. For the lowest level. In the model of Section II. in the light of all current available knowledge. d 5 0.2. is the positive real line. showed itself to be a prior sufﬁciently vague for a large number of situations. In O’Quigley et al.69.42 so that for a 0 1. (15) the targeted toxicity level was 0.7. for practical purposes. but such erratic behavior is avoided by the dampening effects of the one-parameter model. the interval becomes (0.26. Such knowledge.A. CRM stipulated that the ﬁrst entered patient would be treated at some level. the corresponding interval is (10 5.5.2. d 4 0. 0.A and VI. lay between 0. and so we gave consideration to distributions having support on . Even so. In addition.69.42. This could be the case for two-parameter models (see also Sect. and 0. The notion of uncertainty was expressed via a prior density on a having support on and called g(a). the likelihood estimate for a can be seen to be very close to that based on the Bayesian posterior. so that we had d 3 1. we had d 1 that satisﬁes this is 0. believed by the experimenter.05.47.96. Such a prior is therefore not vague at all levels and suggests that the highest level is likely to be too high. for the probability of toxicity at the starting dose.69. possibly together with his or her own subjective conviction. (2) the ‘‘dose’’ 0. d 2 1.0.1. This level was chosen to be level 3 in an experiment with six levels allowing the possibility of both escalation and de-escalation. to be the target level. In Eq. The starting level d i is such that we should have ψ(d i.C). We expect this for vague priors. it does not take long for the accumulating information to ‘‘override’’ the prior.93). VI. corresponding to a 1. led the experimenter to a ‘‘point estimate’’ of the probability of toxicity at the starting dose to be the same as the targeted toxic level.003 and 0. For this prior 95% Bayesian intervals. having a point prior estimate of 0. whereas for the highest level. In the original Bayesian setup (15).10. 0. indistinguishable. u)g(u)du θ .

it may be more computationally efﬁcient to locate the dose level x j 1 ∈ {d 1. Difﬁculties can arise if this procedure is not followed.e. µ 0} α 1. are carefully followed. rather than simply being parameters to a model up to an arbitrary positive power transformation. u}}f(u|Ω j )du . Restricting escalation increments may appear to alleviate the problem (see MCRM. i. and practically we might take the starting dose to be obtained from ψ(d i. d k} assigned to the ( j dose minimizing |θ ∫ ψ{x j 1. . µ k} αk These initial values for the toxicities may reﬂect the experimenter’s best guesses about the potential for toxicities at the available doses. .B).13). This can lead to undesirable behavior of CRM. (15). suppose we decide to start out with a deliberately low dose. . but we do not recommend this in view of its ad hoc nature and since the problem can be avoided at the setup stage when the guidelines of O’Quigley et al. the α i can be ascribed a more concrete meaning. d k} satisfying Q ij (P ij where ∞ 2θ) 0 (i 1. . x j 1 ≠ di) (6) Q ij 0 {ψ{x j 1. . recalled here.Continual Reassessment Method 47 This may be a difﬁcult integral equation to solve. in terms of probabilities.. u}}f(u|Ω j )du| If there are many dose levels. . . an example being the potential occurrence of big jumps when escalating (7. Such difﬁculties do not arise with the maximum likelihood approach. Sect. . Given the set Ω j we can calculate the posterior density for a as f(a|Ω j ) H j 1 (a)exp{ j (a)}g(a) (4) where H j (a) is a sequence of normalizing constant. ∞ H j (a) 0 exp{ j (u)}g(u)du (5) 1)th included patient is the The dose x j 1 ∈ {d 1. ψ{d k. . µ 0 ) θ where µ0 ug(u)du The other doses could be chosen so that ψ{d 1. u} ψ{d i. .. . k. although according to the prior a higher dose would have been indicated. Note that in contrast to the maximum likelihood approach. For instance. . . VI.

48 O’Quigley and ∞ P ij 0 {ψ{x j 1. the necessity of an initial stage was pointed out by O’Quigley and Shen (17) since the likelihood equation fails to have a solution on the interior of the parameter space unless some heterogeneity in the responses has been observed. µ j ) as (ψj . or up and down. αj αj f(u|Ω j 1 )du 1 α The Bayesian approach has the apparent advantage of being immediately operational in that it is not necessary to wait for patient heterogeneity before being able to assess to which level we should assign the successively entered patients. α j ). thereby eliminating the need for k 1 integral calculations. we can alter these early operating characteristics to mimic the kind of behavior we would like. His idea was to enable more rapid escalation in the early part of the trial where we may be quite far from a level at which treatment activity could be anticipated. quantities that are to a large extent arbitrarily deﬁned. . rapid initial escalation that is gradually dampened in the absence of observed toxicities. however. for instance. d k} such that |θ ψ{x j 1. α j ). Their suggestion was to work with any initial scheme. In principle we could ﬁne tune prior and model parameters to achieve. are more readily accomplished via the two-stage designs of the following section. for example. C. we treat the ( j 1)th included patient at level x j 1 ∈ {d 1. However. Her idea was to allow the ﬁrst stage to be based on some variant of the usual up and down procedures. In the context of sequential likelihood estimation. u}}f (u|Ω j )du Often it will make little difference if rather than work with the expectations of the toxicities. By modifying our prior and/or model. . we believe there is something very natural and desirable in two- . Bayesian CRM. ψj ) where ψj ψ(x j 1. . we can calculate an approximate 100(1 α)% Bayesian interval for ψ(x j 1. u} ψ{d i. Moller (13) was the ﬁrst to use this idea in the context of CRM designs. As in the likelihood approach. Such goals. Such an idea was ﬁrst proposed by Storer (28) in the context of the more classical up and down schemes. rapid or less rapid escalation. Thus. ψ j ψ(x j 1. we work with the expectation of a. Two-Stage Designs It may be believed that we know so little before undertaking a given study that it is worthwhile to split the design into two stages: an initial exploratory escalation followed by a more reﬁned homing in on the target. µ j}| is minimized where µ j ∫uf(u|Ω j )du. and for any reasonable scheme the operating characteristics appear relatively insensitive to this choice.

from both the statistical and ethical angles. This is done by integrating this information and that obtained on all the earlier non-dose-limiting toxicities to estimate the most appropriate dose level. then we escalate quickly. is translating directly. More importantly. comes fully into play. by modifying our model parameters and/or our prior. may be somewhat artiﬁcial. by allowing information on toxicity grade to determine the rapidity of escalation. in the absence of heterogeneity. and the ﬁrst included patient was treated at a low level.Continual Reassessment Method 49 stage designs and that currently they could be taken as the designs of choice. based on ﬁtting a CRM model. Rather than lead the clinician into thinking that something subtle and carefully analytic is taking place. at which time the second stage. however constructed. It was decided to use information on low-grade toxicities in the ﬁrst stage of a two-stage design to allow rapid initial escalation since it is possible that we be far below the target level. can be made much more efﬁcient.. we deﬁne a grade severity variable Table 1 trial Severity 0 1 2 3 4 Toxicity ‘‘grades’’ (severities) for Degree of toxicity No toxicity Mild toxicity (non-dose limiting) Nonmild toxicity (non-dose limiting) Severe toxicity (non-dose limiting) Dose-limiting toxicity . There were many dose levels. our belief is that it is preferable that he or she be involved in the design of the initial phase. Here we describe an example of a two-stage design that has been used in practice. the simple desire to try a higher dose because thus far we have encountered no toxicity. albeit somewhat indirectly. (i.e. appears to be rather arbitrary. As long as we observe very low-grade toxicities. A decision to escalate after inclusion of three patients tolerating some level or after a single patient tolerating a level or according to some Bayesian prior. the initial phase of the design. As soon as we encounter more serious toxicities then escalation is slowed down. Early behavior of the method. The use of a working model at this point. Operating characteristics that do not depend on data ought be driven by clinical rather than statistical concerns. lack of toxic response). in which no toxicity has yet been observed. Ultimately we encounter dose-limiting toxicities. and the rate of escalation can be modiﬁed at will. as occurs for Bayesian estimation. less directly for the Bayesian prescription. Speciﬁcally. including only a single patient at each level.

in the absence of information on such recently included patients. the sum of the severities at that level divided by the number of patients treated at that level. The question does arise. The delayed response can lead to grouping or we can simply decide on the grouping by design. then we escalate. in practice. and anything other than a 0 severity for this inclusion would require yet a further inclusion and a non-dose-limiting toxicity before being able to escalate. (15) and. The ﬁrst severity coded 2 necessitates a further inclusion at this same level. This is the level indicated by all the currently available information. then escalation to higher levels only occurs if each cohort of three patients does not experience dose-limiting toxicity. when the correct level was the highest available level and we start out at the lowest or a low level. one-by-one inclusion. no additional work required to deal with such situations. D. however. (15) described the situation of delayed response in which new patients become available to be included in the study while the toxicity results are still outstanding on already entered patients. this phase of the study (the initial escalation scheme) comes to a close and we proceed on the basis of CRM recommendation. the cohort size had little impact on operating characteristics and the accuracy of ﬁnal recommendation. This scheme means that. provided S(i) is less than 2. Although the initial phase is closed. The likelihood for this situation was written down by O’Quigley et al. operationally. and three were evaluated. the information on both dose-limiting and non-dose-limiting toxicities thereby obtained is used in the second stage. as long as we see only toxicities of severities coded 0 or 1. Grouped Designs O’Quigley et al. This can be helpful to avoid being handicapped by an outlier or an unanticipated and possibly not drug-related toxicity. is just the likelihood we obtain were the subjects to have been included one by one. two. apart from a constant term not involving the unknown parameter. .50 O’Quigley S(i) to be the average toxicity severity observed at dose level i. in which cohorts of one. once we have included three patients at some level. as to the performance of CRM in such cases. The more thorough study was that of Goodman et al. (9) and O’Quigley and Shen (17) studied the effects of grouping. The rule is to escalate. Goodman et al. O’Quigley and Shen (17) indicated that for groups of three and relatively few patients (n 16). we retain the capability of picking up speed (in escalation) should subsequent toxicities be of low degree (0 or 1). Furthermore. then we might anticipate some marked drop in performance when contrasted with. The suggestion was. that is. This design also has the advantage that should we be slowed down by a severe (severity 3) albeit non-dose-limiting toxicity. say. There is therefore. Once dose-limiting toxicity is encountered. Broadly speaking. that the logical course to take was to treat at the last recommended level.

759. patients being included.6. the starting dose. (1) where α 1 0. R(d 5 ) 0.2. One-by-one inclusion tends to maximize efﬁciency. We then have that R (d 1 ) ˆ ˆ ˆ ˆ ˆ 0. the form of the model. R(d 4 ) 0.652. two of the three patients treated at level 3 experienced toxicity. R(d 3 ) 0. The design was two stage.212. and a 90% conﬁdence interval for this probability is estimated as (0.55. R (d 2 ) 0. this is the closest to the target θ 0. (9). following a ˆ few iterations.03. The ﬁrst three patients experienced no toxicity at level 1. The 10th included patient does not suffer toxic effects.2.95. the members of the class depending on arbitrary quantities chosen by the investigator such as. Level 2 remains the level with an estimated probability of toxicity closest to the target. The 10th entered patient is then treated at level 2 for which R (d 2 ) 0. The working model was that given by Eq. IV. and the new maximum likelihood estimator becomes 0.101. The ﬁndings of Goodman et al. and the differences disappeared for samples of size 25.07.07 α 3 0.8. could be seen to be equal to 0.715.04. as for the classical schemes. STATISTICAL PROPERTIES Recall that CRM is a class of methods rather than a single method. O’Quigley and Shen (17) and O’Quigley (19) contradict the conjecture of Korn et al.20. There were six levels in the study. (11) that any grouping would lead to substantial efﬁciency losses.775. R(d 2 ) 0. The study concerned 16 patients. α 2 0. maximum likelihood was used.35. The true toxic probabilities were R(d 1 ) 0. whether . R (d 3 ) 0.Continual Reassessment Method 51 Simple intuition would tell us this. The estimated probability of toxicity at this level is 0.22. 0. Given this heterogeneity in the responses. but should stability throughout the study be an issue. and the ﬁrst entered patients were treated at the lowest level. then this extra stability can be obtained through grouping. R (d 5 ) 0.22. the spacing between the doses. Subsequently. E. A grouped design was used until heterogeneity in toxic responses was observed. and R (d 6 ) ˆ 0. α 4 0.149. Illustration This brief illustration is recalled from O’Quigley and Shen (17). Their toxic responses were simulated from the known dose– toxicity curve.70.39). in groups of three. This same level is in fact recommended to the remaining patients so that after 16 inclusions the recommended MTD is level 2. and R(d 6 ) 0. indicating that the best level for the maximum tolerated dose (MTD) is given by level 2 where the true probability of toxicity is 0. and α 6 0.472. the maximum likelihood estimator for a now exists and. R (d 4 ) 0.316. The cost of this extra stability in terms of efﬁciency loss appears to be generally small. from the available estimates. α 5 0.149 since.45. Escalation then took place to level 2. and the next three patients treated at this level did not experience any toxicity either. The targeted toxicity was given by θ 0.

. where |R (x 0 ) θ| |R(d i ) θ|. . . a i ∈ S(a 0 ). a j ). the function ψ′ (x. . that . . Before writing down the third condition. (i 1.. a): t t 1 and each x. the members nonetheless maintaining some of their own particularities. . B]. . k. k. A. ˆ ˆ The maximum likelihood estimate. (7) The condition we require is that. that is. . k. a i ) R(d i ). . a 0 ) 1. . under the assumptions on R(d i ) and ψ(x i. . there exists a unique a i such that ψ(d i. . a) 1 ψ 2. . R (d i ) ψ(d i. . We nonetheless require that the working model is not ‘‘too distant’’ from the true underlying dose toxicity curve. We also require the true unknown dose toxicity function. ψ(x. The ﬁrst condition is standard for estimating equations to have unique solutions. all the probability mass is not put on a single point. d k satisfy 0 R(d 1 ) . The statistical properties described in this section apply broadly to all members of the class. Convergence Convergence arguments are obtained from considerations of the likelihood.A. a) θ| |ψ(x i. a).52 O’Quigley single or grouped inclusions. . the initial dose escalation scheme in two stage designs or the prior density chosen for Bayesian formulations. x 0 ≠ d i ). R(x). For each 0 s(t. to satisfy the following conditions: 1. x. . . . in particular the condition that. for all d i ≠ x 0} 1. it will generally not be true that ψ(d i. Usual likelihood arguments break down since our models are misspeciﬁed. . . . . . . and this can be made precise with the help of the set S(a 0 ) {a: |ψ(x 0. is continuous and is strictly monotone in a. The same arguments apply to Bayesian estimation as long as the prior is other than degenerate. for i It can be shown (25). d k ). a) ψ (1 t) ψ′ (x. 2. Note that the a i s depend on the actual probabilities of toxicity and are therefore unknown. The probabilities of toxicity at d 1. for i 1. exists as soon as we have some heterogeneity in the responses (27). The parameter a belongs to a ﬁnite interval [A. a). We assume the dose toxicity function. 3. The target dose level is x 0 ∈ (d 1. R(d k ) 1. The second imposes no real practical restriction. to satisfy the conditions described in Section 11. note that since our model is R(d i ) for i misspeciﬁed. k. a) θ|. . . . We also require 1.

The result then follows. B]. . a n ) θ|. Hence. . for large n. . a n ∈ S(a 0 ) for n sufﬁˆ ciently large. the deﬁnition of a i and condition 1 on the dose–toxicity function indicate that a i is the unique solution to the equation R(d i ) ψ′ (d i. a} ψ (8) 1 n n R{x j} j 1 ψ′ {x j. . k. a n satisﬁes ˆ |ψ(x 0. observe that for each dose level d i. d i ≠ x 0 . Since a n solves I n (a) 0. and uniform continuity ensure that. a) ψ (9) ˜ I Now. almost surely I This convergence result follows intuitively and can be demonstrated rigorously in a number of ways. ⋅) and {ψ′/(1 ψ)}(d i. . a n ) θ| ˆ |ψ(d i. Deﬁne π n (d i ) ∈ [0. . a n solves ψ′ (d i. a k} and a (k) max{a 1. 1] to be the frequency that the level d i has been used by the ﬁrst n ˜ experiments. almost surely. B] to bound by arbitrarily small quantities the above differences. The third condition on R(x) and the convexity of the set S(a 0 ) imply that S 1 (a 0 ) ⊂ S(a 0 ). . The next important step is to consider the ﬁnite interval S 1 (a 0 ) [a (1). let a n be the solution to the equation ˜ n (a) k ˜ 0.Continual Reassessment Method 53 S(a 0 ) is an open and convex set. i. a) ψ (1 R(d i )) 1 ψ′ (d i. a} ψ (1 yj) 1 ψ′ {x j. a) ψ (1 R(d j )) 1 ψ′ (d i. Shen and O’Quigley (25) applied this result to a sufﬁciently ﬁne partition of the interval [A. . Letting I n (a) and ˜ I n (a) then sup |I n (a) a∈[A. . ⋅) are uniformly continuous in a over the ﬁnite interval [A. a} ψ ψ′ {x j. a) ψ (1 R(d j )) 1 For each 1 i k. Then we can rewrite I n (a) as k ˜ n (a) I i 1 πn (di) R(d i ) ψ′ (d i. a) ψ 0 ˆ ˜ It follows that a n will fall into the interval S 1 (a 0 ). a (k)] in which a (1) min{a 1.. (8). (ψ′/ψ) (d i. .e. a) ψ 0 (10) πn (di) R (d i ) i 1 ψ′ (d i. ˆ Eq. This result is the key to rewriting ˜ n (a) below I in a way that is convenient.B] 1 n n yj j 1 ψ′ {x j. For instance. . . a k}. for i 1. a} ψ [1 R{x j}] 1 ˜ n (a)| → 0.

When faced with some potential realities or classes of realities. Efﬁciency ˆ ˆ ψ(x n 1. . with σ 2 {ψ′(x 0. θ 0 (1 θ 0 )}. What actually takes place in ﬁnite samples needs to be investigated on a case by case basis. except π n (x 0 ). Thus. say. will tend to the solution to R(x 0 ) ψ′ (x 0. we can ask ourselves questions such as what is the probability of toxicity for a randomly chosen patient that has been included in the study or. k) in Eq. B. Some cases studied showed evidence of superefﬁciency. σ2 ). what is the probability of toxicity for those patients entered into the study at the very beginning? . Nonetheless. translating nonnegligeable bias that happens to be in the right direction. being the solution for Eq.’’ Safety is in fact a statistical property of any method. for n large enough x n 1 x 0 so that at this dose level x n 1 satisﬁes |x n 1 x 0| |x n x 0| if x n ≠ x 0.54 O’Quigley Thus. Since there are only a ﬁnite number of dose levels. The estimate then provided by CRM is √n{θ n fully efﬁcient. π n (d i ). An application of the δ method shows that the asymptotic distribution of ˆ R(x 0 )} is N{0. (9) become negligible. This translates a central ethical concern. where a n is the maximum likelihood estimate. observe that. (8) again. that the asymptotic distribution of √n (a n a 0 ) is N(0. a n ) to estimate the probability of O’Quigley (14) proposes using θn ˆ toxicity at the recommended level x n 1. whereas a few others indicated efﬁciency losses large enough to suggest the potential for improvement. which tends to 1. (i 1. further. . were all subjects to be experimented at the correct level. a) ψ 0 The solution to the above equation is a 0. x n converges to x 0 almost surely. ˜ a n. This is what our intuition would suggest given the convergence properties of CRM. Applying Eq. x n will stay at x 0 ultimately. (10). . ˆ To establish the consistency of a n. A belief that CRM would tend to treat the early included patients in a study at high-dose levels convinced many investigators that without some modiﬁcation CRM was not ‘‘safe. as n tends to inﬁnity. the relatively broad range of cases studied by O’Quigley (14) show a mean squared error for the estimated probability of toxicity at the recommended level under CRM to correspond well with the theoretical variance for samples of size n. C. Safety In any discussion on a phase I design the word safety will arise. In other words. a 0 )} 2θ 0 (1 θ 0 ). a) ψ {1 R(x 0 )} 1 ψ′ (x 0. we obtain the ˆ ˆ consistency of a n and.

we may hope to capture some other effects that are necessarily ignored by the rough and ready up and down designs. but with some skill in model construction. which are obvious and transparent for up and down schemes and less transparent for model-based schemes such as CRM. An alternative way to enhance conservatism is rather than choose the closest available dose to the target. The following sections consider some examples.19. the probability of being treated at levels higher than the MTD was. MORE COMPLEX CRM DESIGNS The different up and down designs amount to a collection of ad hoc rules for making decisions when faced with accumulating observations. roughly halved.20 to 0. a model enables us to go further and accommodate greater complexity.10. although not providing a broad summary of the true underlying probabilistic phenomenon.23) indicate CRM to be a safer design than any of the commonly used up and down schemes in that for targets of less than θ 0.15–17. Theoretical work and extensive simulations (1. This ﬁnding is logical given that the purpose of CRM is to concentrate as much experimentation as possible around the prespeciﬁed target. If the deﬁnition of safety was to be widened to include the concept of treating patients at unacceptably low levels. although the impact such an approach might have on the reliability of ﬁnal estimation remains to be studied. levels at which the probability of toxicity is deemed too close to zero. Care is needed. in all the studied cases. Some study on this idea has been carried out by Babb et al.Continual Reassessment Method 55 Once we know the realities or classes of realities we are facing. This is an important point since it highlights the main advantages of the CRM over the standard designs in terms of ﬂexibility and the ability to be adapted to potentially different situations. In practice. that is. these calculations are involved. (5). on average. in view of its being underparametrized. it ought be emphasized that we can adjust the CRM to make it as safe as we require by changing the target level. In principle at least.30. does nonetheless provide structure enabling better control in an experimental situation. higher with the standard designs than with CRM. The CRM leans on a model that. and we may simply prefer to estimate them to any desired degree of accuracy via simulation. systematically take the dose immediately lower than the target or change the distance measure used to select the next level to recommend. V. In addition. the observed number of toxicities will be. For instance. the probability that a randomly chosen patient suffers a toxicity is lower. Furthermore. Safety ought be improved. then in principle we can calculate the probabilities mentioned above. the operating rules of the method. . then CRM does very much better than the standard designs. if we decreased the target from 0.

. (15) would be to use ˆ the Bayesian intervals. y n ). a j ). (ψ j . Such designs were used by Goodman et al. j. n Pr{x j 1 xj 2 ⋅⋅⋅ x n 1|Ω j } In words. The idea is based on the convergence of CRM and that as we reach a plateau. as in the preceding steps. 4. Repeat step 2. at the currently recommended level and when this interval falls within some prespeciﬁed range. compute the value of a j 1. Label the left child of the root with this dose level.n to refer to the tree constructed with this algorithm. However. to ﬁnd j. 3. . given Ω j we would like to say something about the levels at which the remaining patients. Label the remaining nodes of the tree level by level. ψ j ). given the convergence properties of CRM. The properties of such rules remain to be studied. (9) and Korn et al. it may occur in practice that we appear to have settled on a level before having included the full sample size n of anticipated patients. n is the probability that x j 1 is the dose recommended to all remaining patients in the trial and is the ﬁnal recommended dose. .n that starts at the root and ends at a leaf whose nodes all have the same label represents a trial where the recommended dose is unchanged be- . Another approach would be to stop after some ﬁxed number of subjects have been treated at the same level. We use the notation T j. Speciﬁcally. are likely to be treated. The quantity we are interested in is j.56 O’Quigley A. In such a case we may wish to bring the study to an early close. 1. we stop the study. (11) and have the advantage of great simplicity. ψ(x j 1. and we do not recommend their use at this point. The root is labeled with x j 1. Label the right child of the root with this dose level. ˆ Assuming that y j 1 1. j 1 to n. Thus. .n one needs to determine all the possible outcomes of the trial based on the results known for the ﬁrst j patients. 2. Determine the dose level that would be recommended to patient j 1 in this case. thereby enabling the phase II study to be undertaken more quickly. for the probability of toxicity. The following algorithm achieves the desired result. One stopping rule that has been studied in detail (18) is described here. the accumulating information can enable us to quantify this notion. this time with y j 1 0. Determine the dose level that would be recommended to patient j 1. Inclusion of a Stopping Rule The usual CRM design requires that a given sample size is determined in advance. Construct a complete binary tree on 2n j 1 1 nodes corresponding to all possible future outcomes (y j 1. One possible approach suggested by O’Quigley et al. Each path in T j.

n. The probability of each such path j τ {R(x j 1 )}τ{1 where τ is the number of toxicities along the path. In such situations we may wish to carry out separate trials for the different groups to identify the appropriate MTD for each group. taking value 1 or 2. B. Patients of course differ in the way they may react to a treatment. One example occurs in patients with acute leukemia where it has been observed that children will better tolerate more aggressive doses (standardized by their weight) than adults. Otherwise we run the risk of recommending an ‘‘average’’ compromise dose level. This has the disadvantage of failing to use information common to both groups. clinicians carry out two separate trials or split a trial into two arms after encountering the ﬁrst dose limiting toxicities (DLTs) when it is believed that there are two distinct prognostic groups. a) ψ 2 (x. although this assumption is not essential to our conclusions. I x. a j )]n j τ Adding up these path estimates yields an estimate of O’Quigley and Reiner (18). a. be the indicator variable for the two groups. the current estimate of the toxicity of x j 1.Continual Reassessment Method 57 tween the ( j is given by 1)st and the (n R(x j 1 )}n 1)st patient. a j )]τ[1 ˆ ψ(x j 1. too toxic for a part of the population and suboptimal for the other. I 1) 2) ψ 1 (x. A two-sample CRM has been developed so that only one trial is carried out based on information from both groups (20). and although hampered by small samples. a j}. ˆ Using ψ{x j 1. Let I.n . although we must remain realistic in terms of what is achievable in the light of the available sample sizes. we use the same notation as previously deﬁned. we suppose that the targeted probability is the same in both groups and is denoted by θ. Usually. For clarity. By exclusivity we can sum the probabilities of all such paths to obtain j. we may estimate the probability of each path by ˆ [ψ(x j 1. Otherwise. A multisample CRM is a direct generalization. Patient Heterogeneity j. The dose–toxicity model is now the following: Pr(Y Pr(Y 1|X 1|X x. b) . heavily pretreated patients are more likely to suffer from toxic side effects than lightly pretreated patients. we may sometimes be in a position to speciﬁcally address the issue of patient heterogeneity. Details are given in As in other types of clinical trials we are essentially looking for an average effect. Likewise.

where I k indicates to which group the kth subject belongs. maximizing this equation after the incluˆ sion of j patients. . . The trial is carried out as usual: After each inclusion. (15). . . . the recommended dose ˆ ˆ level minimizes |ψ 2 (x j 1. j ) on the ﬁrst j 1 patients in group 1 and j 2 patients in group 2 ( j 1 j 2 j). . To estimate the two parameters. an obvious generalization of the model of O’Quigley. d k}. et al. k 1. b) i 1 ψ 1 (x i. the recommendations will converge to the right dose level for both groups and the estimate of the true probabilities of toxicity at these two levels. a ) θ| with x j 1 ∈ {d 1. Let z k (x k. (1) in which Eq. b j ) values of (a. The following model has performed well in simulations: ψ 1 (x. respectively. It has been shown that under some conditions. A non-zero value for b indicates group heterogeneity. a 0. b))1 yi ˆ ˆ If we denote by (a j. and y k indicates whether or not the kth subject suffered a toxic response. . b)y i (1 i j1 1 ψ 2 (x i. b 0 ) θ. arising from Eq. b). a j. he or she will be allocated at ˆ the dose level that minimizes |ψ 1 (x j 1. if the ( j 1)th patient belongs to group 2. b 0 ). one can use a Bayesian estimate or maximum likelihood estimate as for a traditional CRM design. k) (11) to group 2. then the estimated dose–toxicity relations are ψ 1 (x. . . a 0 ) θ and ψ 2 (x. a j ) and ˆ ˆ ψ 2 (x. On the other hand. a. x k is the dose level at which the kth subject is tested. (2) applies to group 1 and αi {tanh(d i b) 1}/2 (i 1. y k. I k ). b) b exp(a x) b exp(a x) 1 1 There are many other possibilities. . . Note that it is not necessary that the two sample sizes are balanced nor that entry into the study is alternating. . The functions ψ 1 and ψ 2 are selected in such a way that for each θ ∈ (0. . a) exp(a x) .1) and each dose level x there exists (a 0. If the ( j 1)th patient belongs to group 1. b j ). our knowledge on the probabilities of toxicity at each dose level for either group is updated via the parameters. a j. we can write down the likelihood as j1 (a. .58 O’Quigley Parameter b measures to some extent the difference between the groups. a. b j ) θ|. . . a)yi (1 k ψ 1 (x i. (k 1. a. This condition is satisﬁed by many function pairs. j be the outcomes of the ﬁrst j patients. satisfying ψ 1 (x. exp(a x) ψ 2 (x. a))1 yi ψ 2 (x i. . On the basis of the observations z k.

that is. the groups were combined until evidence of heterogeneity began to manifest itself. group 2 recommendation being based on (a. ˆ 0) and ψ 2 (x.C. such as takes place with the CRM.C. a. The ﬁrst DLT in group 1 was encountered at dose level 6 and led to recommend a lower level to the next patient to be included. Implementation was based on likelihood estimation. The design called for shared initial escalation. Much more fully studied in the . Figure 4 illustrates a simulated trial carried out with a two-parameter model. 0) and group 1 continuing without a model as in Section III. necessitating nontoxicities and a toxicity in each group before the model could be fully ﬁt.Continual Reassessment Method 59 Figure 4 Results of a simulated trial for two groups. ˆ the trial was split into two arms. Before this. doselevel escalation followed an algorithm incorporating grade information paralleling that of Section III. is relatively recent. allocation for both groups leaned on the model together with the minimization algorithms described above. Note that there are many possible variants on this design. At this point. Pharmacokinetic Studies Statistical modeling of the clinical situation of phase I dose-ﬁnding studies. C. For the remainder of the study. The ﬁrst DLT occured in group 2 for the ﬁfth included patient.

in particular the logistic model. and the peak concentration. indicated by the retrospective analysis to have probabilities of toxicity much lower or much higher than suggested by the average estimate. A recommendation is made for this level. and a mechanistic approach based on a catch-all model is probably to be advised against. Most patients will have been studied at the recommended level and a smaller amount at adjacent levels. If so. There are many parameters of interest to the pharmacologist. a particular practical difﬁculty arises in the phase I context in which any such information only becomes available once the dose has been administered. requiring great statistical and/or pharmacological skill. The strength of CRM is to locate with relatively few patients the target dose level. Some pioneering work has been carried out here by Piantadosi et al. can be used to see if this information helps explain the toxicities. say one including all the relevant factors believed to inﬂuence the probability of encountering toxicity. most often blood plasma. for example. For our purposes. as in the case of patient heterogeneity. following the phase I clinical study. pharmacokinetics deals with the study of concentration and elimination characteristics of given compounds in speciﬁed organ systems. available before selecting the level at which the patient should be treated. we may be encouraged to carry out further studies at higher or lower levels for certain patient proﬁles. and this is where we see the main advantage of pharmacokinetics. (22). At this point we do not see the utility of a model in which all the different factors are included as regressors. . However. we will have responses and a great deal of pharmacokinetic information. such information will have a bearing on whether or not a given patient is likely to encounter dose-limiting toxicity or.60 O’Quigley phase I context are pharmacokinetics and pharmacodynamics. why some patients and not others were able to tolerate some given dose. In principle we can write down any model we care to hypothesize. The usual models. the area under the concentration time curve. Further studies. the rate of clearance of the drug. Most often then. the information will be of most use in terms of retrospectively explaining the toxicities. whereas pharmacodynamics focuses on how the compounds affect the body. Clearly. This is a vast subject referred to as PK/PD modeling. At any of these levels. We can then proceed to estimate the parameters. in retrospect. This can be viewed as the ﬁne tuning and may itself give rise to new more highly focused phase I studies. it is possible to have pharmacodynamic information and other patient characteristics relating to the patient’s ability to synthesize the drugs. This is a large ﬁeld awaiting further exploration. Roughly speaking. we must remain realistic in terms of what can be achieved given the maximum obtainable sample size. The remaining patients are then treated at this same level. indicating the potential for improved precision by the incorporation of pharmacokinetic information. However. can now be made. These further analyses are necessarily very delicate.

it would be possible to predict at some level the rate of occurence of dose-limiting toxicities without necessarily having observed very many. has been simpliﬁed when going from ﬁve levels to two and that it may help to use models accomodating multilevel responses. mostly grade 4 but possibly also certain kinds of grade 3. a. b) 1) 1 ψ 2 (x k. . At the opposite end of the model/hypothesis spectrum. a. Graded Toxicities Although we refer to dose-limiting toxicities as a binary (0. a) 2 or Y k 3) ψ 2 (x k. the highest being dose limiting. let us consider the case of three toxicity levels. In fact this is not the way we believe that progress is to be made. 3). The goal of the trial is still to identify a level of dose whose probability of severe toxicity is closest to a given percentile of the dose–toxicity curve. . The natural reaction for a statistician is to consider that the response variable. to 4. life-threatening toxicity. using a Bayesian prescription. the prediction leaning largely on the model.Continual Reassessment Method 61 D. . if only moderate. b) . we can make striking gains in efﬁciency since the more frequently observed lower grade toxicities carry a great deal of information on the potential occurence of dose-limiting toxicities. This idea is used implicitly in the two-stage designs described in Section III. without making strong assumptions. at least hypothetically. In this case it turns out that we neither lose nor gain efﬁciency. most studies record information on the degree of toxicity. from 0. in which very careful modeling can lead to efﬁciency improvements. The issue is not that of modeling a response (toxicity) at ﬁve levels but of controlling for dose-limiting toxicity. then we need models relating the occurence of dose-limiting toxicities to the occurence of lower grade toxicities. Lower grades are helpful in that their occurence indicates that we are approaching a zone in which the probability of encountering a dose-limiting toxicity is becoming large enough to be of concern. Such a situation would also allow gains in safety since. If we are to proceed more formally and hopefully extract yet more information from the observations. A working model for the CRM could be Pr(Y k Pr(Y k Pr(Y k 3) ψ 1 (x k. Let Y k denote the toxic response at level k. . toxicity. and the method behaves identically to one in which the only information we obtain is whether or not the toxicity is dose limiting. complete absence of side effects.1) variable. In the unrealistic situation in which we can accurately model the ratio of the probabilities of the different types of toxicity. These two situations suggest a middle road. we might decide we know nothing about the relative rates of occurrence of the different toxicity types and simply allow the accumulating observations to provide the necessary estimates. (k 1. To make this more precise.

a. random samples with mean zero and unit variance. a) when Y k 2. This is due to the parameter orthogonality. . . . This work is incomplete at the time of this writing. y n ). Let us imagine that the parameter b is known precisely. . . The modiﬁed continual reassessment method. . . With no prior information and maximizing the likelihood. . and ψ 2 (x k. This problem has its application in both industrial experiments and medical research. . ε n are i. RELATED APPROACHES A number of other designs for phase I studies have been suggested. such as that of Anbar (2– 4) and Wu (32. y n. The model need not be correctly speciﬁed. Stochastic Approximation The problem is considered from the angle of estimating the root of a regression function M(x) from observations (x 1. so that ξ 0 can be estimated consistently and efﬁciently from the corresponding observations y 1. . A. Errors can then be overwritten by the data. although b should maintain interpretation outside the model. b) when Y k 1. . It is described below. . y i ) have the following relation: yi M(x i ) σε i The errors ε 1. should we be wrong in our assumed value that this induces a noncorrectable bias. Let θ 0 be a real value and ξ 0 be the solution for M(x) θ 0. although there is the advantage of learning about the relationship between the different toxicity types. are in fact quite closely related. that is. . . expressed via a prior distribution. (x n. Following the pioneering work of Robbins and Monro (24).i. x n. where (x i. b) ψ 1 (x k. Schemes that predate the CRM. a) when Y k 3. We are interested in sequential determination of the design values x 1. . we obtain exactly the same results as with the more usual one-parameter CRM. Efﬁciency gains are then quite substantial (21). is clearly related to CRM.33). a. leaning on stochastic approximation. as its name suggests. ψ 1 (x k. . There is therefore no efﬁciency gain. To overcome this we have investigated a Bayesian setup in which we use prior information to provide a ‘‘point estimate’’ for b but having uncertainty associated with it. y 1 ). VI.62 O’Quigley The contributions to the likelihood are 1 ψ 2 (x k. This is not of direct practical interest since the assumption of no error in b is completely inﬂexible. but early results are encouraging. for instance some simple function of the ratio of grade 3 to grade 2 toxicities.d. .

It should be possible to estimate ξ 0 by ﬁtting the data with a linear model without an intercept. Intuitively. . . (x n. however. Imagine that after many experiments the design points have concentrated around the root ξ 0. Anbar (2. Wu (33) gave a heuristic argument that Eq. β 0. the Robbins–Monro procedure is unstable at ˆ places where the function M(x) is ﬂat. .3) used Robbins–Monro in the context of phase I dose ﬁnding. irrespective whether M(⋅) is ﬂat or not. sequentially estimating the slope. More precisely. (13). . It is then quite possible for the estimate of β to be unstable if a regression line is ﬁtted to data with both intercept and slope. stochastic approximation can be considered as ﬁtting an ordinary linear model to the existing data and then treating the regression line as an approximation of M(⋅) and using it to calculate the next design point. This stabilizes the estimate of β. With relatively few points outside the small region around the root. consider the simple case that the data are generated according to an ordinary linear regression (M(x) α βx): Y α βX σε. Wu (32) then proposed truncating β n whenever it becomes too large. The estimate of the slope then becomes relatively stable. Lai and Robbins (1979) pointed out the connection between Eq. there are inﬁnitely many pairs that ﬁt the data equally well. However. X 0 (14) . The main concern is the consistency of the estimate since the model becomes less ﬂexible and thus it may be more difﬁcult for it to capture the nature of the data. (12) and the following procedure based on the ordinary linear regression applied to (x 1. y 1 ). a one-parameter model would be sufﬁcient for determining ξ 0 if most of the design values are around it. Under certain circumstances. the Robbins–Monro procedure calculates the design values sequentially according to xn 1 xn c (y n n θ0) (12) where c is some constant. y n ): xn 1 xn ˆ β n 1(y n θ0) (13) ˆ where β n is the least-squared estimate of β. Considerably more observations were required to achieve relative stability than for the CRM (16). α 0. As pointed out by Wu (32). (12) is equivalent to Eq. We believe. although it was clear that such designs were superior to the standard up and down design.Continual Reassessment Method 63 stochastic approximation has been applied to this problem and has been studied by many authors. that the intrinsic source of instability lies in the use of a model with two parameters (the intercept and the slope) in the estimation of a single root. since for every given intercept we can ﬁnd a slope passing through the observations.

However. The design point x n and the average x n can both serve as estimates of ξ 0. Note that once the working model and so forth has been speciﬁed. as has been outlined in Section II. Note that the model used by the procedure is different from that generating the data. B. . y n ). Before we carry out any given trial. and stratiﬁcation issues in the more common design context of randomized phase III studies. y 1 ). then we can ﬁt the underparameterized regression line going through the origin. If we wish to leave the paths as random. erratic. This would be analagous to calculating sample size. and Restricted CRM The operating characteristics of CRM depend. It is enough to specify given paths of visited levels and the associated responses to know exactly which level will be recommended by CRM. For any situation or class of situations we can obtain the operating characteristics to decide on the most appropriate design within the class of CRM designs. the levels recommended by the method are entirely deterministic. . we can put ourselves in a position of knowing just what the behavior will be when faced with some particular circumstance. There is therefore no cause to be concerned about unanticipated. This results in an estimate of the slope ˆ ˆ βn y n /x n. depending on the possible outcomes. These estimates are called consistent if they converge to ξ 0 almost surely when n goes to inﬁnity. . Solving β n x θ 0 yields the recommended design value for the next experiment: xn 1 θ0 ˆ βn θ0 xn yn (15) This process is repeated after observation of y n 1. in particular the possibility of jumping dose levels. Modiﬁed.B). where x n ∑x i /n and y n ∑y i /n. such problems do not arise when CRM is correctly implemented (Section III. degree of balance. The modiﬁed continual reassessment method (7) was developed to deal with perceived problems in operating characteristics. There is enough ﬂexibility in the CRM to obtain any reasonable characteristics we might believe necessary. or aberrant behavior. then the appropriate tool to use is that of simulation. The conditions for consistency have been identiﬁed by Shen and O’Quigley (25) and the main arguments are close to those showing the consistency for the CRM. Extended. on certain arbitrary speciﬁcations. . Suppose that we have collected data (x 1.64 O’Quigley Then ξ 0 is the solution for equation α βx θ0 0. The deﬁnition of x n implies that its consistency is equivalent to that of x n. Correct implementation is preferable to using schemes with ad hoc design modiﬁcations resulting in potentially poor . (x n.

’’ in such circumstances. The usual CRM will never recommend escalation after an observed toxicity and will never skip doses when escalating. introduce two further rules. In addition. MCRM ﬁxes a problem that does not exist under the usual guidelines. as can be seen by studying Fig. MCRM. Note that none of the comparative studies in Tables 4. it is important to underline that the perceived difﬁculties arise from an implementation of CRM differing from that described in this current work as well as in the original paper by O’Quigley et al. the apparently alarming example that Moller quotes in which CRM recommends treating the second entered patient at level 10. However. ordered by increasing probabilities of toxic reaction (see also Sect. the relatively steep dose– toxicity working models of Section II. As described by O’Quigley et al. 1 of Faries (7). to our knowledge. Faries suggested ﬁxing these awkward operating characteristics by continuing with the same prescription but. A comparative study of MCRM and CRM.A may be required to dampen oscillations and reduce instability. 5. the ‘‘doses’’ are conceptual. The method’s prescription is overruled. following a nontoxicity for the ﬁrst patient at level 1.Continual Reassessment Method 65 operating characteristics. The ﬁrst of these is that dose escalation after an observed toxicity is not allowed. (15). if we are to use a Bayesian implementation of CRM and our prior knowledge is weak. We may conclude that there is then little to choose between CRM and MCRM. O’Quigley and Shen (17). The second rule is that escalation should never be more than a single level so that ‘‘skipping’’ is also overruled when indicated by the method. CRM does not work with actual doses. rather than use the recommended level. lacking the ﬂexibility we may need. Our few limited studies indicate that MCRM performs poorly when compared with CRM. does not behave as we might hope. (15). The priors selected by Faries (7). For studies with a large number of dose levels. The new modiﬁed method is called MCRM. sometimes referred to as MCRM. The consequences of the particular setup for CRM by Faries (7) were twofold: It was possible to observe dose escalation after an observed toxicity and it was possible to observe large skips in the dose levels after nontoxicities. For the standard case of six dose levels studied by O’Quigley et al. Essentially MCRM works with a different dose–toxicity function. II). working with the actual doses as does MCRM can be problematic. then this must be reﬂected in the choice of prior. and 6 of Faries compare MCRM with CRM. and Shen and O’Quigley (25). 12 or more (see 13). and although large sample behavior should be similar to CRM for small ﬁnite samples. would as with Faries. are informative. and we allocate at the same level. never arise had CRM been implemented according to O’Quigley . It should be no surprise at all that ‘‘CRM. (15). the problem of skipping really could arise. as does Faries. Before considering this method. has yet to be carried out. can be unstable. However. in the context of a one-parameter model where the dose is now the real dose.

III. as happened in Moller (13). we may wish to use a model that could jump a level. Certainly we would not skip from level 1 to level 10. (15). but as described above for MCRM.B) and Ahn (1). This can be relatively rapid or relatively slow. The model chosen by O’Quigley. As mentioned earlier. Furthermore. between 12 and 20 levels and relatively few subjects. It is easy to understand what is going on from Fig. conclusions based on the comparisons should not be given very much weight. 1 of Moller (13).66 O’Quigley et al. then the horizontal line at the target θ 0. assume the particular model to truly generate the data. given different models. The subsequent attempt to make the exponential prior noninformative by modifying the conceptual doses is certainly interesting albeit not the easiest way to go and would need further study before it could be recommended. we are of course inevitably ‘‘jumping’’ dose levels. for instance. This is therefore not a real issue. The comparisons of the respective performances of restricted CRM and original CRM are based on this rather unusual setup. Skipping doses is an issue that necessitates some thought. although there is some uncertainty in this quantiﬁed by our prior. the choice of an exponential prior would not seem a good one. such operating characteristics depend on the choice of model and given design features such as the number of dose levels and the value of θ. in the context of six doses is such that it is not possible to skip doses when escalating. (15). The prior uncertainty should be quantiﬁed by the function g(a). certainly not the CRM as described by O’Quigley et al. unlike others in this area. Essentially this particular Bayesian setup expresses the idea that we believe the correct level to be level 10. since there is always the conceptual possibility of having intermediary levels. If the experimenter wishes to modify these properties. (15). Moller also suggests the same rule as Faries. these comparisons. we then recommend level 10 to patient 2. it addresses problems equally well solved within the framework of the standard setup. et al. It is therefore perfectly natural that integrating some further small piece of information (the ﬁrst patient treated at the lowest level and tolerating the treatment). For larger number of doses or other choices of models. skipping could occur. we would not work with such models. and had this been chosen according to the prescription of O’Quigley et al. and the relevant issue is to understand at the trial outset just how we proceed through the available levels. However. The middle curve is the working model. To prevent such an occurrence. Typically. This design is called restricted CRM. The posterior and the prior are almost indistinguishable (see also Sect.4 would meet the lowest level rather than level 10 as in the ﬁgure. This is not realistic. depending on initial design and the chosen model. unless we made this a feature of the design. Indeed. then this . It should be viewed as an operating characteristic of the method. For these reasons. In Moller’s example.

(15). However. Whitehead and Williamson (31) worked with some of the more classic notions from optimal design for choosing the dose levels in a bid to establish whether much is lost by using suboptimal designs. for the sake of simplicity.Continual Reassessment Method 67 can be done by changing the model. Any prior information can subsequently be incorporated via the Bayes formula into a posterior density that also involves the actual current observations. This is a particular case of the two-stage designs described in Section III. But they are unlikely to come from ad hoc improvisations that have. In O’Quigley et al. observed from outside the trial and that sollicited from clinicians and/or pharmacologists. Gatsonis and Greenhouse (8) consider two-parameter probit and logit models for dose response and study the effect of different prior distributions. up and down schemes are grafted onto a methodology with a more solid foundation and whose operating characteristics can be anticipated. in which we quantify prior information. (15) ruled out criteria based on optimal design due to the ethical criterion of the need to attempt to assign the sequentially included . O’Quigley et al.C. and very poorly behaved. Hopefully Section III makes it clear that there is nothing especially Bayesian about the CRM (see also 30). Given the typically small sample sizes often used. have been suggested for use in the context of phase I trial designs. VII. (15) we could also work with informative priors. dominated the area of phase I trial design for many years. we believe the two-stage design to be the most ﬂexible and generally applicable design.’’ or other ‘‘improved’’ CRM designs where rules of thumb. and not simply a Bayesian estimator. This is in our view to be preferred over ‘‘modiﬁed. Unlike the setup described by O’Quigley et al. unfortunately. Moller (13) also refers to extended CRM. This is not to say that improvements cannot be made. BAYESIAN APPROACHES The CRM is often referred to as the Bayesian alternative to the classic up and down designs used in phase I studies. Bayesian estimators and vague priors were proposed. As we pointed out there. having their inspiration largely from the old. Decisions are made more formally using tools from decision theory. in particular the maximum likelihood estimator. Whitehead and Williamson (31) carried out similar studies but with attention focusing on logistic models and beta priors. a fully Bayesian approach has some appeal in that we would not wish to waste any relevant information at hand. More fully Bayesian approaches. By more fully we mean more in the Bayesian spirit of inference. there is nothing to prevent us from working with other estimators.’’ ‘‘restricted.

This was done by O’Quigley. Two-parameter CRM was seen to behave poorly (15) and is generally inconsistent (25). A promising area for Bayesian formulations is one where we may have little overall knowledge of any dose–toxicity relationship but we may have some. et al. it does provide some theoretical comfort and hints that for ﬁnite samples things might work out okay too. This same point was also emphasized by Whitehead and Williamson (31). The context is fully Bayesian. This is interesting in that it could be argued that the aim of the approach translates in some ways more directly the clinician’s objective than does CRM.A). we can conclude that a judicious choice of model and prior. (15) for the usual underparametrized CRM. not running into serious conﬂict with the subsequent observations. It may well be argued that large sample consistency is not very relevant in such typically small studies. knowledge of some secondary aspect of the problem. A quite different Bayesian approach has been proposed by Babb et al. Nonetheless. an unintuitive ﬁnding at ﬁrst glance but one that makes sense in the light of the comments in Section VI. Whitehead and Williamson (31) suggest that CRM could be viewed as a special case of their designs with their second parameter being assigned a degenerate prior and thereby behaving as a constant. However. Model misspeciﬁcation was not investigated and would be an interesting area for further research. possibly considerable. the aim here is to escalate as fast as possible toward the MTD while sequentially safeguarding against overdosing. then we might carry out large numbers of ﬁnite samples studies. Such comparisons remain to be done for the Bayesian methods discussed here. and the methodology may be a useful modiﬁcation of CRM when primary concern is on avoiding overdosing and we are in a position to have a prior on a two-parameter function. there may be concerns about large sample consistency when working with a design that tends to settle on some level. it can be misleading in that for the single sample case. We have to view the single parameter as necessary in the homogeneous case and not simply a parametric restriction to facilitate numerical integration (see also Sect. (5). simulating most often under realities well removed from our working model. Con- . if we fail to achieve large sample consistency. This was true even when the data were truly generated by the same two-parameter model. VI. may help inference in some special cases. two-parameter CRM and one-parameter CRM are fundamentally different.68 O’Quigley patients at the most appropriate level for the patient.A. The approach appears promising. As above. Although in some senses this view is technically correct. Rather than concentrate experimentation at some target level as does CRM. We do not believe it will be possible to demonstrate large sample consistency for either the Gatsonis and Greenhouse (8) approach or that of Whitehead and Williamson (31) as was done for CRM by Shen and O’Quigley (25). At the very least.

through bias. when the assumption are violated. Nonetheless. Regression models have been suggested to address issues of patient heterogeneity (26). As already mentioned in Section V. but given the typically small sample sizes and the implementation algorithms via underparametrized models. a Bayesian approach would allow us to specify the anticipated direction with high probability while enabling the accumulating data to override this assumed direction if the two run into serious conﬂict. a group weakened by extensive prior therapy. VIII. We do not believe that ad hoc modiﬁcations such as MCRM will be fruitful. Very often we do not know the number of levels that may be used. for instance. via two-stage designs. In our own applied work. Careful parametrization would enable this information to be included as a constraint.Continual Reassessment Method 69 sider the two-group case. a Bayesian setup opens up the possibility of compromise so that constraints become modiﬁable in the light of accumulating data. Uninformative Bayes or maximum likelihood would then seem appropriate. is most likely to have a level strictly less than that for the other group. Incorporating such information will improve efﬁciency. we can achieve clear efﬁciency gains. we have improvised when faced with such situations. A more systematic approach may be useful. But we may well know that one of the groups. concerning CRM. under some strong assumptions. This appears natural. relatively rapid escalation to the higher levels or models that require accumulating a lot of precision on the safety of some particular level before further escalation. for example. if the model being constructed at the beginning of the second stage. However. both practical and theoretical. great care is . A deep understanding of how certain types of changes to the model correspond to certain types of changes in operating characteristics could ultimately lead to selecting models on the basis of some particularly desired behavior. Once again. and it would clearly be helpful to have a better understanding of this. In such cases the model cannot be determined in advance. RESEARCH DIRECTIONS There are many outstanding questions. and we do not therefore see this as a useful research area. For the actual dose levels we are looking for we may know almost nothing. when incorporating information on the graded toxicities. operating characteristics depend to some extent on arbitrary speciﬁcation parameters such as the chosen model. Exactly the same idea could be used in a case where we believe there may be group heterogeneity but that it be very unlikely the correct MTDs differ by more than a single level. rather than work with a rigid and unmodiﬁable constraint. Such gains can be wiped out and even become negative.

Goodman S. Practical modiﬁcations of the continual reassessment method for phase I cancer clinical trials. Piantadosi S. 17:1103–1120. Nonetheless. 11:1377–1389. 2. some or all of which may not be available before deciding on the appropriate dose level for the patient. 3. Within-patient dose escalation is frequently undertaken in practice but not analyzed as such. Faries D. 4:147–164. 8. PK/PD information. 1:191–206. Rogatko A. in conjunction with careful modeling. and as the method is increasingly used these will require more attention from statisticians. The most interesting problems. 14:1149–1161. Finally. 7. 9. whether toxicities are cumulative or whether cumulative treatment provides some kind of protection. Deeper studies showing how and in which situations advantage can be drawn from a regression model are needed. as always. raise particular methodological questions. Stat Med 1992. The continual reassessment method in cancer phase I clinical trials: a simulation study. REFERENCES 1. . Anbar D. A stochastic Newton-Raphson method. 17: 1537–1549. Stochastic approximation methods and their use in bioassay and Phase I clinical trials. Indeed. 6. Anbar D. Commun Statist 1984. The application of stochastic methods to the bioassay problem. are those arising from practical applications. Anbar D. and efﬁciency of different stopping rules are just some of these.70 O’Quigley needed. J Stat Planning Inference 1977. Some practical improvements in the continual reassessment method for phase I studies. inference under misspeciﬁed models. Some of these are mentioned in the above paragraph. Stat Med 1998. 12:1093–1108. Modeling would necessarily be difﬁcult in view of the complex potential relationships governing the outcomes at different levels for the same patient. J Biopharm Stat 1994. 13:2451–2467. Gatsonis C. Bayesian approaches. Bayesian methods for phase I clinical trials. Greenhouse JB. There are numerous outstanding theoretical issues to be resolved: Conditions for convergence. Zahurak ML. will continue to be treated to the clinician’s best ability and often at doses higher than that initially given. Stat Med 1993. the patient. Chevret S. Statist Med 1995. Zacks S. J Stat Planning Inference 1978. incorporation of random allocation. 2:153–163. 4. Stat Med 1998. Cancer Phase I clinical trials: efﬁcient dose escalation with overdose control. Babb J. 5. open up many possibilities. Ahn C. regulatory agencies sometimes disallow the inclusion of information on doses other than that at which the patient was ﬁrst treated. An evaluation of phase I cancer clinical trial designs. optimality. even if off study.

JR Stat Soc B 1981. Small-sample conﬁdence sets for the MTD in a phase I clinical trial. 46:33–48. Korn EL. JNCI 1993. Storer BE. 14:911–922. 48:853–862. Robbins H. A comparison of two Phase I trial designs. 28. Storer BE. J Biopharm Stat 1999. On the existence of maximum likelihood estimators for the binomial response models. 27. 23. 19. Paoletti X. 12. 85:217– 223. Gamst A. Chen TT. Proc 6th Berkeley Symp 1967. 1998. University of California at San Diego. O’Quigley J. Simon R. Simon R. 21. . 30. Continual reassessment method: a practical design for Phase I clinical trials in cancer. 24. O’Quigley J. 22. Another look at two Phase I clinical trial designs (with commentary). 30:303–315. A stochastic approximation method. Improved designs for dose escalation studies using pharmacokinetic measurements. Ann Math Stat 1951. Estimating the probability of toxicity at the recommended dose following a Phase I clinical trial in cancer. Stat Med 1995. 49:1117–1125. Consistency of continual reassessment method in dose ﬁnding studies. 10:1647–1664. Storer BE. Biometrics 1996. Encylopedia of Biostatistics. Design and analysis of Phase I clinical trials. 1:221–233. Moller S. 15:1605–1618. Comp Stat Data Anal 1998. Biometrics 1993. Model-guided determination of maximum tolerated dose in Phase I clinical trials: evidence for increased precision. 85:741–748. 45:925– 937. O’Quigley J. Ratain MJ. Fisher L. Stat Med 1996. 15. Shen LZ. Mick R. An extension of the continual reassessment method using a preliminary up and down design in a dose ﬁnding study in cancer patients in order to investigate a greater number of dose levels. 83:395–406. Arbuck S. The behavior of maximum likelihood estimates under nonstandard conditions. 52:163–174. Pepe M. 1998. New York: Wiley. Biometrika 1996. Monro S. O’Quigley J. 26. O’Quigley J. Freidlin B. Liu G. O’Quigley J. O’Quigley J.Continual Reassessment Method 71 10. 13. 18. Accelerated titration designs for Phase I clinical trials in oncology. Stat Med 1999. 89:1138– 1147. Christian MC. Stat Med 1991. Biometrics 1990. Biometrics 1992. Reiner E. A stopping rule for the continual reassessment method. Biostatistics Group. Phase I clinical trials. Rubinstein LV. 9:17–44. Piantadosi S. Midthune D. Reiner E. 18:2683–2692. 16. Shen LZ. 29: 351–356. Chevret S. Biometrics 1989. Operating characteristics of the standard phase I clinical trial design. Collins J. 29. Huber PJ. Biometrika 1998. Continual reassessment method: a likelihood approach. 43:310–313. Stat Med 1994. 20. 17. 13:1799–1806. Paoletti X. O’Quigley J. O’Quigley J. Using Graded Toxicities in Phase I Trial Design. Technical Report 4. 14. O’Quigley J. Two sample continual reassessment method. Christian M. 11. Silvapulle MJ. Rubinstein L. Methods for dose ﬁnding studies in cancer clinical trials: a review and results of a Monte Carlo study. JNCI 1997. Shen L. 25.

Efﬁcient sequential designs with binary data. J Am Stat Assoc 1985. J Biopharm Stat 1998. Institute of Mathematical Statistics Monograph 8. 80: 974–984. 7:1196–1221. ed. Williamson D. pp 298–313. Wu CFJ. 33. Whitehead J. CA: Institute of Mathematical Statistics. 8:445–467. Bayesian decision procedures based on logistic regression models for dose-ﬁnding studies. Lai TL. Maximum likelihood recursion and stochastic approximation in sequential designs. Wu CFJ. 32. 34. In: van Ryzin J. . Annals of Statistics 1979. (1986). Robbins H. Adaptive design and stochastic approximation. Hayward.72 O’Quigley 31. Adaptive Statistical Procedures and Related Topics.

this dose is referred to as the maximum tolerable dose (MTD) and is presumably the dose that will be used in subsequent phase II trials evaluating efﬁcacy. Occasionally.3 Choosing a Phase I Design Barry E. Since the response of the patient will be unknown before the drug is given. given a 73 . Washington I. referred to as phase IB trials. The problem of deﬁning an acceptably toxic dose is complicated by the fact that patient response is heterogenous: At a given dose. some patients may experience little or no toxicity. one may encounter trials that are intermediate between phase I and phase II. This is a more heterogeneous group but typically includes trials evaluating some measure of biological efﬁcacy over a range of doses that have been found to have acceptable toxicity in a phase I (or phase IA) trial. but most drugs that will be evaluated in phase I trials will prove ineffective at any dose. INTRODUCTION AND BACKGROUND Although the term phase I is sometimes applied generically to almost any ‘‘early’’ trial. Storer Fred Hutchinson Cancer Research Center. whereas others may have severe or even fatal toxicity. Seattle. in cancer drug development it usually refers speciﬁcally to a dose-ﬁnding trial whose major end point is toxicity. There is an implicit assumption with most anticancer agents of a positive correlation between toxicity and efﬁcacy. What constitutes acceptable toxicity of course depends on the potential therapeutic beneﬁt of the drug. This chapter focuses exclusively on phase I trials with a toxicity end point. For example. acceptable toxicity is typically deﬁned with respect to the patient population as a whole. The goal is to ﬁnd the highest dose of a potential therapeutic agent that has acceptable toxicity.

When deﬁned in terms of the presence or absence of DLT. . one out of three patients would be expected to experience a grade 3 or worse toxicity. The latter limitation also has implications for the relevance of the MTD in subsequent phase II trials of efﬁcacy. . The pressure to use only small numbers of patients is large—literally dozens of drugs per year may come forward for evaluation. and for dose d we have ψ(d ) Pr(Y 1|d). if the agent in question is completely novel. so that one must start at a dose level believed almost certainly to be below the MTD and gradually escalate upward. Notationally. The latter is referred to as ‘‘dose-limiting toxicity’’ (DLT) and does not need to correspond to a deﬁnition of unacceptable toxicity in an individual patient. Initial Dose Level and Dose Spacing The initial dose level is generally derived either from animal experiments. then the trial is stopped. the MTD can be deﬁned as some quantile of a dose–response curve. and the dose spacing. where θ is the desired probability of toxicity. A. are discussed in more detail below. and is not driven traditionally by rigorous statistical considerations requiring a speciﬁed degree of precision in the estimate of MTD. . Furthermore. fatal). and each route of administration requires a separate trial. Beginning at the ﬁrst dose level. . moderate. the traditional phase I trial design uses a set of ﬁxed dose levels that have been speciﬁed in advance. the number of patients for whom it is considered ethically justiﬁed to participate in a trial with little evidence of efﬁcacy is limited. The second is the fact that the number of patients typically available for a phase I trial is relatively small. mild.74 Storer toxicity grading scheme ranging from 0 to 5 (none. and each combination with other drugs. depending on whether a patient does or does not experience DLT. The choice of the initial dose level d 1. d ∈ {d 1. life threatening. small numbers of patients are entered. d K}.1 There are two signiﬁcant constraints on the design of a phase I trial. then the MTD is deﬁned by ψ(dMTD ) θ. on average. Since the patient populations are different. one might seek the dose where. and the decision to escalate or not depends on a prespeciﬁed algorithm related to the occurrence of DLT. respectively. if Y is a random variable whose possible values are 1 and 0. When a dose level is reached with unacceptable toxicity. say 15–30. II. or by conservative consideration of previ- . The ﬁrst is the ethical requirement to approach the MTD from below. each schedule. that is. it is not clear that the MTD estimated in one population will yield the same result when implemented in another. typically three to six. d 2. severe. DESIGNS FOR PHASE I TRIALS As a consequence of the above considerations.

(B2) If at least two of six have DLT. . . then increase dose to d k 1 and go to (A). For purposes of illustration. then increase dose to d k 1 and go to (A). and thereafter d k 1 1. (B) Evaluate an additional three patients at d k: (B1) If one of six have DLT. route of administration. d 4 10d 2. that is. then go to (D). we describe the following. (C) Discontinue dose escalation. but implicitly 0. . d 3 10d 2. Traditional Escalation Algorithms A wide variety of dose escalation rules may be used.67d 2. d 2 2d 1. d 3 10d 1. if the agent in question has been used before but with a different schedule. adjusted for the size of the animal on a per kilogram basis or by some other method. . so that the MTD becomes the highest dose level where no more than one toxicity is observed in six patients. (A) Evaluate three patients at d k: (A1) If zero of three have DLT. if necessary.}. then go to (C). Subsequent dose levels are determined by increasing the preceding dose level by decreasing multiples. (A3) If three of three have DLT. so we could take θ 0. the dose levels may be determined by log. A common starting point based on the former is from 1/10 to 1/3 of the mouse LD 10. . If the trial is stopped. d 2 3d 1. and that process should procede downward.}. (A3) If at least two of three have DLT. . spacing. . then increase dose to d k 1 and go to (A). The actual θ that is desired is generally not deﬁned when such algorithms are used. {d 1. is described here.Choosing a Phase I Design 75 ous human experience. that is.4d 4. d 5 10d 3. the dose levels are equally spaced on a log scale. then go to (B). With some agents. d 2 10d 1. then go to (C). d 3 1. which is often referred to as the traditional ‘‘3 3’’ design. d 4 1. .5d 3. the dose that kills 10% of mice. then an additional three should be entered.d 4 10d 3.d 5 1.33. Beginning at k 1. (A2) If one or two of three have DLT. particularly biological agents. or with other concomitant drugs.25. or approximate half-log. (A2) If one of three have DLT. . then the dose level below that at which excessive DLT was observed is the MTD.} Such sequences are often referred to as ‘‘modiﬁed Fibonacci.’’2 Note that after the ﬁrst few increments. . Beginning at k 1. referred to as the ‘‘bestof-5’’ design. (A) Evaluate three patients at d k: (A1) If zero of three have DLT. {d 1. then go to (B).17 θ 0. B. Another example of a dose escalation algorithm. Some protocols may specify that if only three patients were evaluated at that dose level.33d k. a typical sequence being {d 1. for a total of six.

only brief reﬂection is needed to see that the determination of MTD will have a rather tenuous statistical basis. Again. Crude comparisons among traditional dose escalation algorithms can be made by examining the level-wise operating characteristics of the design. 0–0. and d 3 is 0 of 3. Although the middle dose would be taken as the estimated MTD.54. (B3) If three of four have DLT.09–0. The probability of escalation can then be plotted over a range of ψ(d ).25. 1 for the two algorithms described above. Consider the outcome of a trial using the 3 3 design where the frequency of DLT for dose levels d 1. the probability of escalation is B(0. ψ(d )) is the probability of r successes (toxicities) out of n trials (patients) with underlying success probability at the current dose level ψ(d ). respectively.67.76 Storer (B) Evaluate an additional one patient at d k: (B1) If one of four have DLT. 1 of 6. ψ(d )) B(1. 3. and 2 of 6. C. but we could take θ 0. then go to (D). 3. the value of θ is not explicitly deﬁned.51. there is not even reasonably precise evidence that the toxicity rate for any of the three doses is either above or below the implied θ of approximately 0. Ignoring the sequential nature of the escalation procedure. respectively. the probability of escalating to the next dose level given an assumption regarding the underlying probability of DLT at the current dose level. in the 3 3 algorithm described above. then go to (D).40. Although traditional designs reﬂect an empirical common sense approach to the problem of estimating the MTD under the noted constraints. More useful approaches to choosing among traditional designs and the other designs described below are discussed in Section III. as is done in Fig. then increase dose to d k 1 and go to (A). 0. d 2. where B(τ. (C) Evaluate an additional one patient at d k: (C1) If two of ﬁve have DLT. the level-wise operating characteristics do not provide much useful insight into whether or not a particular design will tend to select an MTD that is close to the target. then go to (C). then increase dose to d k 1 and go to (A). (D) Discontinue dose escalation. this calculation is a function of simple binomial success probabilities. (C3) If three of ﬁve have DLT. 3. For example. Usually. A Bayesian Approach: The Continual Reassessment Method The small sample size and low information content in the data derived from traditional methods have suggested to some the usefulness of Bayesian methods . ψ(d )) B(0. 0. Although it is obvious from such a display that one algorithm is considerably more aggressive than another. (B2) If two of four have DLT. n. that is. ψ(d )). the pointwise 80% conﬁdence intervals for the rate of DLT at the three dose levels are.02–0.

. An example of such a function is ψ(d. is the most likely to have an associated probability of DLT equal to the desired θ. The probability of escalating to the next higher dose level is plotted as a function of the true probability of DLT at the current dose. In principle. In fact.. a) p and in particular that ψ(d MTD.2).Choosing a Phase I Design 77 Figure 1 Level-wise operating characteristics of two traditional dose escalation algorithms. d does not need to correspond literally to the dose of a drug. this approach allows one to combine any prior information available regarding the value of the MTD with subsequent data collected in the phase I trial to obtain an updated estimate reﬂecting both. say {d 1. Note that ψ is not assumed to be necessarily a dose–response function relating a characteristic of the dose levels to the probability of toxicity. to estimate the MTD. . The most clearly developed Bayesian approach to phase I design is the continual reassessment method (CRM) proposed by O’Quigley and colleagues (1. a 0 ) θ. the treatments at each of the dose levels may be completely unrelated. in this case d could be just the index . From among a small set of possible dose levels. based on all available information. It is assumed that there is a simple family of monotone dose–response functions ψ such that for any dose d and probability of toxicity p there exists a unique a where ψ(d. d 6}. as long as the probability of toxicity increases from each dose level to the next. . That is. experimentation begins at the dose level that the investigators believe. a) [(tanh d 1)/2]a.

A prior distribution g(a) is assumed for the parameter a such that for the ∞ θ or. a)g(a)da ∞ ψ(d 3. In practice. initial dose level.3 D. there is nothing in the approach that prohibits one from starting at the same low initial dose as would be common in traditional trials or from updating after groups of three patients rather than single patients. The next patient is then treated at the dose level minimizing some measure of the distance between the current estimate of the probability of toxicity and θ. µ a ) θ. The particular prior used should also reﬂect the degree of uncertainty present regarding the probability of toxicity at the starting dose level. In spite of these advantages. A further advantage is that unlike traditional algorithms. Storer’s Two-Stage Design Storer (6. in general. CRM will tend eventually to select the dose level that has a probability of toxicity closest to θ (3). the basic framework of CRM can easily be adapted to a non-Bayesian setting and can conform in practice more closely to traditional methods (5). the design is easily adapted to different values of θ. this will be quite vague. the latter have a tendency to become ‘‘stuck’’ and oscillate between dose levels when any data conﬁguration leads to a large estimate for the slope parameter. alternatively. although its practical performance should be evaluated in the small sample setting typical of phase I trials. The uniqueness constraint implies in general the use of oneparameter models and explicitly eliminates popular two-parameter dose–response models like the logistic. Even if the dose–response model used in updating is misspeciﬁed. the dose level selected as the MTD is the one that would be chosen for a hypothetical n 1st patient. where µ a ∫ 0 ag(a)da. either ∫ 0 ψ(d 3. However.78 Storer of the dose levels. After each patient is treated and the presence or absence of toxicity observed. the Bayesian prior can be abandoned entirely and the updating after each patient can be fully likelihood based. . Allowing for some ad hoc deterministic rules to start the trial off.7) explored a combination of more traditional methods implemented in such a way as to minimize the numbers of patients treated at low dose levels and to focus sampling around the MTD. After a ﬁxed number n of patients has been entered sequentially in this fashion. some practitioners object philosophically to the Bayesian approach. for example d 3. calculated by either method above (1). these methods also use an explicit dose– response framework to estimate the MTD. For example. the current distribution g(a) is updated along with the estimated probabilities of toxicity at each dose level. and it is clear in the phase I setting that the choice of prior can have a measurable effect on the estimate of MTD (4). An advantage of the CRM design is that it makes full use of all the data at hand to choose the next dose level.

then increase dose to d k 1 and go to (A). The ﬁrst stage assigns single patients at each dose level and escalates upward until a patient has DLT or downward until a patient does not have DLT. (A3) If at least one patient has and has not had DLT. Algorithmically. Note that the ﬁrst stage meets the requirement for heterogeneity in response needed to start off a likelihood-based CRM design and could be used for that purpose. To obtain a meaningful estimate of the MTD. then increase dose to d k 1 and go to (B). a target θ different from 1/3 would probably lead one to use a modiﬁed second-stage algorithm. Storer (7) also evaluated different methods of providing conﬁdence intervals for the MTD. Although other quantiles could be estimated from the same estimated dose–response curve. As noted. a dose–response model is ﬁt to the data and the MTD estimated by maximum likelihood or other method. one could use a logistic model where logit (ψ(d)) α β log(d ). then decrease dose to d k 1 and go to (A). then it is natural to use cohorts of size three. then if the current patient has not had DLT. (B2) If one of three have DLT. More accurate conﬁdence sets can be con- . beginning at k 1. If θ 1/3. otherwise. decrease the dose to d k 1 and go to (B). (B3) If at least two of three have DLT. then go to (B). (A) Evaluate one patient at d k: (A1) If no patient has had DLT. without markedly increasing the proportion of patients treated at dose levels where the probability of DLT is excessive.Choosing a Phase I Design 79 The design has two stages and uses a combination of simple dose-escalation algorithms. however. the algorithm described above is designed with a target θ 1/3 in mind. If this is not the case. two-parameter models have undesirable properties for purposes of dose ˆ escalation. go to (B). then one needs either to add additional cohorts of patients or substitute a more empirical estimate. and a likelihood ratio method) are often markedly anticonservative. Standard likelihood-based methods that ignore the sequential sampling scheme (a delta method. a method based on Fieller’s theorem. For example. Extensive simulation experiments using this trial design compared with more traditional designs demonstrated the possibility of reducing the variability of point estimates of the MTD and reducing the proportion of patients treated at very low dose levels. After completion of the second stage. A two-parameter model is used here to make fullest use of the ﬁnal sample of data. as noted above. such as the last dose level or hypothetical next dose level. then decrease dose to d k 1 and go to (B). The second stage incorporates a ﬁxed number of cohorts of patients. whence ˆ ˆ the estimated MTD is log(d MTD ) (logit(θ) α)/β. (A2) If all patients have had DLT. as follows: (B) Evaluate three patients at d k: (B1) If zero of three have DLT. one must have 0 β ∞.

This is also a two-stage design. the MTD might be deﬁned in terms of the mean response. the latter will be used as the MTD. the two-stage design described above has been implemented in a real phase I trial (8). we then have that d MTD (c α)/β. the dose for . in the judgement of the protocol chair. Ad hoc rules for dose escalation are determined by the toxicity experience in the current cohort. and a provision to add additional intermediate dose levels if. E. In the second stage of the study. For the same simple linear model above. Nevertheless. Furthermore. it is useful to consider the case where the major outcome deﬁning toxicity is a continuous measurement. the nadir white blood count (WBC). if necessary. however. however. the resulting conﬁdence intervals are often extremely wide. the methodology is purely frequentist and may be unable to account for minor variations in the implementation of the design when a trial is conducted. the use of a mean response to deﬁne MTD is not generalizable across drugs with different or multiple toxicities and consequently has received little attention in practice. The major modiﬁcations included a provision to add additional cohorts of three patients. suppose that DLT is determined by the outcome Y c. the nature or frequency of toxicity at a dose level precludes further patient accrual at that dose level. where WBC pre nadir. for example. With some practical modiﬁcations. and we have Y N(α βd. Alternatively. Then d MTD (c α Φ 1(θ)σ)/β has the traditional deﬁnition that the probability of DLT is θ. that is. the dose where E(Y) c. which for a hypothetical study of etoposide assumes a simple regression model relating dose to the WBC β 2d. The ﬁrst phase uses cohorts of two patients. Some sequential design strategies in this context have been described by Eichhorn et al. (9). A recent proposal for a design incorporating a continuous outcome is that of Mick and Ratain (11). Fewer distributional assumptions are needed to estimate d MTD. where c is a constant.05 level of signiﬁcance. the model is ﬁt each time and cohorts of two are added until ˆ at least eight patients have been treated and β 2 is signiﬁcantly different from 0 at the 0. and stochastic approximation techniques might be applied in the design of trials with such an end point (10). Continuous Outcomes Although not common in practice. The model is log (WBC) α β 1 log(WBC pre ) is the pretreatment WBC. The use of such a model in studies with small sample size makes some distributional assumption imperative.80 Storer structed by simulating the distribution of any of those test statistics at trial values of the MTD. a provision that if the estimated MTD is higher than the highest dose level at which patients have actually been treated. σ2 ). until the estimate of β in the ﬁtted logistic model becomes positive and ﬁnite. For example. This may or may not involve a fundamentally different deﬁnition of the MTD in terms of the occurrence of DLT.

Such calculations are fairly tedious. and do not accommodate designs with nonstationary transition probabilities. such as the MTD estimated after following Storer’s two-stage design. III. one can then calculate exactly many quantities of interest. the method applies only to situations where the DLT is a single continuous outcome. simulations studies are the only practical tool for evaluating phase I designs.5) α β 1 log(WBC pre ))/β 2. As with exact computations. Many simple designs for which the level-wise operating characteristics can be speciﬁed can be formulated as discrete Markov chains (6). for designs like CRM. The average sample size was also measurably smaller. Nor do they allow one to evaluate any quantity derived from all of the data.5. Useful evaluations of phase I designs must involve the entire dose–response curve. which of course is unknown. such as the number of patients treated at each dose level. This continues until ˆ 2 is signiﬁcantly different from 0 at least eight patients have been treated and β at the 0. only limited information regarding the suitability of a phase I design can be gained from the levelwise operating characteristics shown in Fig. with an absorbing state corresponding to the stopping of the trial. which depend on data from prior dose levels to determine the next dose level. For these reasons. Furthermore. Though such results are promising. d k 1 (log(2. that is.Choosing a Phase I Design 81 the next cohort of two patients is determined by ﬁtting the regression model to the accumulated data and estimating the dose that leads to a mean nadir WBC ˆ ˆ ˆ of 2. Furthermore. such as CRM. For various assumptions about the true dose–response curve. as compared with the MTD estimated from a more traditional design. The states in the chain refer to treatment of a patient or group of patients at a dose level. it is not even possible to specify a levelwise operating characteristic. CHOOSING A PHASE I DESIGN As noted above. 1. Here we give an example of such a study to illustrate the kinds of information that can be used in the evaluation and some of the considerations involved in the design of the study. . the simulation studies that are needed to establish the usefulness of the method in speciﬁc situations often require the use of human pharmacokinetic data that might not be available at the time the study was being planned.001 level of signiﬁcance. Simulation studies of this design using a pharmakinetic model and historic database demonstrated a clear increase in precision in the MTD estimated from the model-based dose-escalation method. however. from the appropriate quantities determined from successive powers of the transition probability matrix P. one needs to specify a range of possible dose–response scenarios and then simulate the outcome of a large number of trials under each scenario.

Specifying the Dose–Response Curve We follow the modiﬁed Fibonacci spacing described in Section II. . d 6 933. 2.4. An actual protocol might have an upper limit on the number of dose levels.4 Varying the probability of DLT at d 1 while holding the probability at d 5 ﬁxed at θ results in a sequence of dose–response curves ranging from relatively steep to relatively ﬂat. which obviously is unknown. that is. Traditional 3 3 Design This design is implemented as described in Section II. a 3 3 design might.20. Specifying the Designs This simulation will evaluate the two traditional designs described above. d 2 200. for example. Although this is an unlikely occurrence in practice. The starting dose is always d 1.0. We also deﬁne hypothetical dose levels 50. d 0 25.01) 0. and we assume that the true MTD is four dose levels higher. although it is rare for any design to escalate beyond d 10. To deﬁne a range of dose–response scenarios. a clinical protocol should specify any provision to decrease dose if the stopping criteria are met at the ﬁrst dose level. For example. whereas in practice. . which of course need not be exactly at one of the predetermined dose levels. Storer’s two-stage design. with θ 1/3. . d 1 below d 1 that successively halve the dose above. It is important to make the simulation as realistic as possible in terms of how an actual clinical protocol would be implemented or at least to recognize what differences might exist.0. The point is to study the sensitivity of the designs to features of the underlying dose–response curve. In the event that excessive toxicity occurs at d 1. forego the last patient in a cohort of three if the ﬁrst two patients had experienced DLT. 1. and a non-Bayesian CRM design. d 3 333. Similarly. An even greater range could be encompassed by also varying the number of dose levels between the starting dose and the true MTD. . d 7 1244. the simulation always evaluates a full cohort of patients. For example.0. with a provision for how to deﬁne the MTD if that limit is reached. the simulation does not place a practical limit on the highest dose level. we vary the probability of toxicity at d 1 from 0. B. The true dose–response curve is determined by assuming that a logistic model holds on the log scale. with the same rules applied to stopping at d 1. the MTD is taken to be d 0. Traditional Best-of-5 Design Again implemented as described in Section II. .01 (0.0.0. .3.3. d 4 500. we have d 1 100.82 Storer A.0. d 5 700. at d 5. . . . and graph our results as a function of that probability. where patients are more likely entered sequentially than simultaneously. in arbitrary units.

the results are presented in terms ˆ MTD ). The ﬁrst cohort is entered at the same dose level as for the second stage of the two-stage design. The horizontal line at 1/3 is a point of reference for the target θ. if this dose is above the highest dose at which patients have been treated. If it is not the case that 0 ˆ β ∞. all except the conservative 3 3 design perform fairly well across the range of dose–response curves. One should also note . the next level cannot be more than one dose level higher than the current highest level at which any patients have been treated. Storer’s Two-Stage Design Implemented as described in Section II. then the patient is taken to have experienced DLT. however. In either case.75. then the geometric mean of the last dose level used and the dose level that would have been assigned to the next cohort is used as the MTD. with a second-stage sample size of 24 patients. after that successive cohorts are entered using likelihood based updating of the dose–response curve. 1). if that dose is higher than the highest dose at which patients have actually been treated. If the number is less than that probability. The level that would be chosen for a hypothetical additional cohort is the MTD. Non-Bayesian CRM Design We start the design using the ﬁrst stage of the two-stage design as described above. The precision of the estiˆ mates. Simulation and Results The simulation is performed by generating 5000 sequences of patients and applying each of the designs to each sequence for each dose–response curve being evaluated. C. Figure 2 displays results of the simulation study above that relate to the ˆ estimate d MTD. the latter is taken as the MTD. is shown in Fig. taken as the root MSE of the probabilities ψ(d MTD ). the next cohort is treated at the dose level with estimated probability of DLT closest in absolute value to θ. 24 patients are entered in cohorts of three. A standard logistic model is ﬁt to the data. The sequence of patients is really a sequence of psuedo-random numbers generated to be Uniform (0. Each patient’s number is compared with the hypothetical true probability of DLT at the dose level the patient is entered at for the dose–response curve being evaluated. Since the dose scale is arbitrary. Although none of the designs is unbiased. For this purpose we use a single parameter logistic model—a two-parameter model with β ﬁxed at 0. Figure 2(a) displays the mean probability of DLT at the estimated of ψ(d MTD. then the latter is taken as the MTD. In this regard the CRM and two-stage designs perform better than the bestof-5 design over most settings of the dose–response curve.Choosing a Phase I Design 83 3. 2 (b). 4.5 After each updating. Once heterogeneity has been achieved. however.

plotted as a function of the probability of DLT at the starting dose level. Results are expressed in terms of ˆ p(MTD) ψ(d MTD ).84 Storer Figure 2 Results of 5000 simulated phase I trials according to four designs. with θ 1/3. The true MTD is ﬁxed at four dose levels above the starting dose. .

2.Choosing a Phase I Design 85 that. and in particular we ﬁnd that the two designs that perform the best in Fig.’’ In this case there are not large differences among the designs.20 or ψ(d MTD ) 0. the second stage in the two-stage or CRM designs above always uses eight cycles. Although the occurrence of DLT in and of itself is not necessarily undesirable. but this is because it tends to stop well below the target. Panels (a) and (b) present the overall fraction of patients that are treated below and above. For example. 2(c) do not carry an unduly large penalty. One of course could easily evaluate other limits if desired.50.20 is the level at which the odds of DLT are half that of θ. In addition to the average properties of the estimates. The two-stage and CRM designs perform best at avoiding treating patients at the lower dose levels. Figure 3(d) displays the mean number of ‘‘cycles’’ of treatment that are needed to complete the trial. Hence. where a cycle is the period of time over which a patient or group of patients needs to be treated and evaluated before a decision can be made as to the dose level for the next patient or group. The 3 3 design uses the smallest number of patients. each dose level in the 3 3 design uses either one or two cycles. to the extent that the target θ deﬁnes a dose with some efﬁcacy in addition to toxicity. it is also relevant to look at the extremes. Figure 2(c) and (d) present the fraction of trials where ˆ ˆ ψ(d MTD ) 0. the two-stage and CRM designs do best in this regard.6 The cutoff of 0. in absolute terms. respectively. Although this may not be an important consideration. Some results related to the outcome of the trials themselves are presented in Fig.’’ Because of their common ﬁrst-stage design that uses single patients at the initial dose levels. the best-of-5 design uses six to eight fewer patients than the two-stage or CRM design. On average. the fraction of trials below this arbitrary limit may represent cases in which the dose selected for subsequent evaluation in efﬁcacy trials is ‘‘too low. 3. Conversely. respectively. 3. Because they share a common ﬁrst stage and use the same ﬁxed number of patients in the second stage. although of course it does not do as well as the very conservative 3 3 design. Sample size considerations are evaluated in Fig. 2 (d) is the level at which the odds of toxicity are twice that of θ. the same limits as for the estimates in Fig. This is a consideration only for situations where the time needed to complete a phase I trial is not limited by the rate of patient accrual but by the . the two-stage design is somewhat better than the CRM design at avoiding treating patients at higher dose levels. the twostage and CRM designs yield identical results. (c) and (d). the cutoff used in Fig. and so on. the precision of the estimates is not high even for the best designs. the trials where the probability of DLT is above this arbitrary level may represent cases in which the dose selected as the MTD is ‘‘too high. Panel (c) shows the mean number of patients treated. as the probability of DLT increases there is likely a corresponding increase in the probability of very severe or even fatal toxicity.

. with θ 1/3. The true MTD is ﬁxed at four dose levels above the starting dose. plotted as a function of the probability of DLT at the starting dose level.86 Storer Figure 3 Results of 5000 simulated phase I trials according to four designs.

and of course the CRM design selects the next dose level based on the new target. Summary and Conclusion Based only on the results above.Choosing a Phase I Design 87 time needed to treat and evaluate each group of patients. The best-of-5 design would probably also be eliminated as well due to the lower precision and greater likelihood that the MTD will be well below the target. there is perhaps a slight advantage to the former in terms of greater precision and a smaller chance that the estimate will be too far above the target. There is a slight disadvantage in terms of precision. one would likely eliminate the 3 3 design from consideration. however. However. it could also be the case in this setting that using a smaller secondstage sample size would not adversely affect the two-stage and CRM designs. Exactly the same dose–response settings are used. the best-of-5 design uses fewer patients. the 3 3 design performs nearly as well. On the other hand. Additional simulations could be carried out that would vary also the distance between the starting dose and the true MTD or place the true MTD between dose levels instead of exactly at a dose level. D. but given that the mean sample size with the 3 3 design is nearly half that of the other two. we reiterate the point that the purpose of this simulation was to demonstrate some of the properties of phase I designs and of the process of . To illustrate further some of the features of phase I designs and the necessity of studying each situation on a case by case basis. The results for this simulation are presented in Fig. Of course.’’ Additionally. Finally. than the supposedly more sophisticated two-stage and CRM designs. 3(c). so that the results for the two traditional designs are identical to those shown previously. the difference is likely not important in practical terms and might vary under other dose–response conditions.20. Between the twostage and CRM designs. it would be reasonable to consider an additional simulation in which the second-stage sample size for the two-stage and CRM designs is reduced to.7 A desirable feature of the results shown is that both the relative and absolute properties of the designs do not differ much over the range of dose–response curves. the ﬁnal ﬁtted model estimates the MTD associated with the new target. 4 and 5. The two-stage design is modiﬁed to use ﬁve cohorts of ﬁve patients but follows essentially the same rule for selecting the next level described above with ‘‘three’’ replaced by ‘‘ﬁve. In this case the best-of-5 design is clearly eliminated as too aggressive. and one could see whether they continued to maintain an advantage in the other aspects. and perhaps surprisingly. 18 patients. or better. this may be a reasonable trade-off. In this case the results are qualitatively similar to that of Fig. This would put the average sample size for those designs closer to that of the best-of-5. we repeated the simulation study above using a target θ 0. say. If small patient numbers are a priority.

88

Storer

Figure 4 Results of 5000 simulated phase I trials according to four designs, plotted as a function of the probability of DLT at the starting dose level. The dose–response curves are identical to those used for Fig. 2 but with θ 0.20. Results are expressed in terms ˆ of p(MTD) ψ(d MTD ).

simulation itself, not to advocate any particular design. Depending on the particulars of the trial at hand, any one of the four designs might be a reasonable choice. An important point to bear in mind is that traditional designs must be matched to the desired target quantile and will perform poorly for other quantiles. CRM designs are particulary ﬂexible in this regard; the two-stage design can only be modiﬁed to a lesser extent.

Choosing a Phase I Design

89

Figure 5 Results of 5000 simulated phase I trials according to four designs, plotted as a function of the probability of DLT at the starting dose level. The dose–response curves are identical to those used for Fig. 3 but with σ 0.20.

ENDNOTES

1. Alternately, one could deﬁne Y to be the random variable representing the threshold dose at which a patient would experience DLT. The distribution of Y is referred to as a tolerance distribution and the dose–response curve is the cumulative distribution function for Y, so that the MTD would be deﬁned by Pr(Y d MTD ) θ. For a given sample size, the most effective way of estimating this quantile would be from a sample of threshold doses. Such data are nearly impossible to gather, however, as it is imprac-

90

Storer tical to give each patient more than a small number of discrete doses. Further, the data obtained from sequential administration of different doses to the same patient would almost surely be biased, as one could never distinguish the cumulative effects of the different doses from the acute effects of the current dose level. Extended washout periods between doses are not a solution, since the condition of the patient and hence the response to the drug is likely to change rapidly for the typical patient in a phase I trial. For this reason, almost all phase I trials involve the administration of only a single dose level to each patient and the observation of the frequency of DLT in all patients treated at the same dose level. In a true Fibonacci sequence, the increments would be approximately 2, 1.5, 1.67, 1.60, 1.63, and then 1.62 thereafter, converging on the golden ratio. Without a prior, the dose–response model cannot be ﬁt to the data until there is some heterogeneity in outcome, i.e., at least one patient with DLT and one patient without DLT. Thus, some simple rules are needed to guide the dose escalation until heterogeneity is achieved. Also, one may want to impose rules that restrict one from skipping dose levels during escalation, even if the ﬁtted dose–response model would lead one to select a higher dose. The usual formulation of the logistic dose–response curve would be that logitψ(d) α βlog d. In the above setup, we specify d 1, ψ(d 1 ), and that ψ(d 5 ) 1/3, whence β (logit(1/3) logit(ψ(d 1 ))/∆, where ∆ log d 5 log d 1, and α logit(ψ(d 1 )) β log d 1. This value does have to be tuned to the actual dose scale but is not particularly sensitive to the precise value. That is, similar results would be obtained with β in the range 0.5–1. For reference, on the natural log scale the distance log (d MTD log(d 1 ) 2, and the true value of β in the simulation ranges from 2.01 to 0.37 as ψ(d 1 ) ranges from 0.01 to 0.20. The see-saw pattern observed for all but the two-stage design is caused by changes in the underlying dose–response curve, as the probability of DLT at particular dose levels moves over or under the limit under consideration. Since the three designs ˆ select discrete dose levels as d MTD, this will result in a corresponding decrease in the fraction of estimates beyond the limit. The advantage of the two-stage design may seem surprising, given that the next dose level is selected only on the basis of the outcome at the current dose level, and ignores the information that CRM uses from all prior patients. However, the two-stage design also incorporates a ﬁnal estimation procedure for the MTD that uses all the data and uses a richer family of dose–response models. This issue is examined in Storer (12).

2. 3.

4.

5.

6.

7.

REFERENCES

1. O’Quigley J, Pepe M, Fisher L. Continual reassessment method: a practical design for Phase I clinical studies in cancer. Biometrics 1990; 46:33–48. 2. O’Quigley J, Chevret S. Methods for dose ﬁnding studies in cancer clinical trials: a review and results of a Monte Carlo study. Stat Med 1991; 10:1647–1664.

Choosing a Phase I Design

91

3. Shen LZ, O’Quigley J. Consistency of continual reassessment method under model misspeciﬁcation. Biometrika 1996; 83:395–405. 4. Gatsonis C, Greenhouse JB. Bayesian methods for Phase I clinical trials. Stat Med 1992; 11:1377–1389. 5. O’Quigley J, Shen LZ. Continual reassessment method: a likelihood approach. Biometrics 1996; 52:673–684. 6. Storer B. Design and analysis of Phase I clinical trials. Biometrics 1989; 45:925– 937. 7. Storer B. Small-sample conﬁdence sets for the MTD in a Phase I clinical trial. Biometrics 1993; 49:1117–1125. 8. Berlin J, Stewart JA, Storer B, Tutsch KD, Arzoomanian RZ, Alberti D, Feierabend C, Simon K, Wilding G. Phase I clinical and pharmacokinetic trial of penclomedine utilizing a novel, two-stage trial design. Clin Oncol 1998; 16:1142–1149. 9. Eichhorn BH, Zacks S. Sequential search of an optimal dosage. J Am Stat Assoc 1973; 68:594–598. 10. Anbar D. Stochastic approximation methods and their use in bioassay and Phase I clinical trials. Commun Stat Theory Methods 1984; 13:2451–2467. 11. Mick R, Ratain MJ. Model-guided determination of maximum tolerated dose in Phase I clinical trials: evidence for increased precision. J Nat Cancer Inst 1993; 85: 217–223. 12. Storer B. An evaluation of Phase I designs for continuous dose response. Stat Med (In press.)

4

Overview of Phase II Clinical Trials

Stephanie Green

Fred Hutchinson Cancer Research Center, Seattle, Washington

I.

DESIGN

Standard phase II studies are used to screen new regimens for activity and to decide which ones should be tested further. To screen regimens efﬁciently, the decisions generally are based on single-arm studies using short-term end points (usually tumor response in cancer studies) in limited numbers of patients. The p 0 versus the problem is formulated as a test of the null hypothesis H 0: p alternative hypothesis H A: p p A, where p is the probability of response, p 0 is the probability which, if true, would mean that the regimen was not worth studying further, and p A is the probability which, if true, would mean it would be important to identify the regimen as active and to continue studying it. Typically, p 0 is a value at or somewhat below the historical probability of response to standard treatment for the same stage of disease, and p A is typically somewhat above. For ethical reasons studies of new regimens usually are designed with two or more stages of accrual, allowing early stopping due to inactivity of the regimen. A variety of approaches to early stopping has been proposed. Although several of these include options for more than two stages, only the two-stage versions are discussed in this chapter. (In typical clinical settings it is difﬁcult to manage more than two stages.) An early approach, due to Gehan (1), suggested stopping if 0/N responses were observed, where the probability of 0/N was less than 0.05 under a speciﬁc alternative. Otherwise accrual was to be continued until the sample size was large enough for estimation at a speciﬁed level of precision.

93

94

Green

In 1982, Fleming (2) proposed stopping when results are inconsistent either p′, where H 0 is tested at level α and p′ is the alternative with H 0 or H AA: p for which the procedure has power 1 α. The bounds for stopping after the ﬁrst stage of a two-stage design are the nearest integer to N 1 p′ Z 1 α{Np′(1 p′)}1/2 (for concluding early that the regimen should not be tested further) and the nearest integer to N 1 p 0 Z 1 α{Np 0(1 p 0)}1/2 1 (for concluding early that the regimen is promising), where N 1 is the ﬁrst stage sample size and N is the total after the second stage. At the second stage, H 0 is accepted or rejected according to the normal approximation for a single-stage design. Since then other authors, rather than proposing tests, have proposed choosing stopping boundaries to minimize the expected number of patients required, subject to level and power speciﬁcations. Chang et al. (3) proposed minimizing the average expected sample size under the null and alternative hypotheses. Simon (4), recognizing the ethical imperative of stopping when the agent is inactive, recommended stopping early only for unpromising results and minimizing the expected sample size under the null or, alternatively, minimizing the maximum sample size. A problem with these designs is that sample size has to be accrued exactly for the optimality properties to hold, so in practice they cannot be carried out faithfully in many settings. Particularly in multi-institution settings, studies cannot be closed after a speciﬁed number of patients have been accrued. It takes time to get a closure notice out, and during this time more patients will have been approached to enter the trial. Patients who have been asked and have agreed to participate in a trial should be allowed to do so, and this means there is a period of time during which institutions can continue registering patients even though the study is closing. Furthermore, some patients may be found to be ineligible after the study is closed. It is rare to end up with precisely the number of patients planned, making application of ﬁxed designs problematic. To address this problem, Green and Dahlberg (5) proposed designs allowing for variable attained sample sizes. The approach is to accrue patients in two stages with about the same number of patients per stage, to have level approximately 0.05 and power approximately 0.9, and to stop early if the agent appears unpromising. Speciﬁcally, the regimen is concluded unpromising and the trial is stopped early if the alternative is rejected at the 0.02 level after the ﬁrst stage of accrual and the agent is concluded promising if H 0 is rejected at the 0.055 level after the second stage of accrual. The level 0.02 was chosen to balance the concern of treating the fewest possible patients with an inactive agent against the concern of rejecting an active agent due to treating a chance series of poor risk patients. Level 0.05 and power 0.9 are reasonable for solid tumors due to the modest percent of agents found to be active in this setting (6); less conservative values might be appropriate in more responsive diseases.

Phase II Trials

95

The design has the property that stopping at the ﬁrst stage occurs when the estimate of the response probability is less than approximately p 0, the true value that would mean the agent would not be of interest. At the second stage the agent is concluded to warrant further study if the estimate of the response probability is greater than approximately (p A p 0)/2, which typically would be equal to or somewhat above the historical probability expected from other agents and a value at which one might be expected to be indifferent to the outcome of the trial. However, there are no optimality properties. Chen and Ng (7) proposed a different approach to ﬂexible design by optimizing (with respect to expected sample size under p 0) across possible attained sample sizes. They assumed a uniform distribution over sets of eight consecutive N 1s and eight consecutive Ns; presumably if information is available on the actual distribution in a particular setting, then the approach could be used for a better optimization. Herndon (8) described another variation on the Green and Dahlberg designs. To address the problem of temporary closure of studies, an alternative approach is proposed that allows patient accrual to continue while results of the ﬁrst stage are reviewed. Temporary closures are disruptive, so this approach might be reasonable for cases when accrual is relatively slow with respect to submission of information (if too rapid, the ethical aim of stopping early due to inactivity is lost). Table 1 illustrates several of the design approaches mentioned above for level 0.05 and power 0.9 tests, including Fleming designs, Simon minimax designs, Green and Dahlberg designs, and Chen and Ng optimal design sets. Powers and levels are reasonable for all approaches. (Chen and Ng designs have correct level on average, although individual realizations have levels up to 0.075 among the tabled designs.) Of the four approaches, Green and Dahlberg designs are the most conservative with respect to early stopping for level 0.05 and power 0.9, whereas Chen and Ng designs are the least. In another approach to phase II design, Storer (9) suggested a procedure similar to two-sided testing instead of the standard one-sided test. In this approach, the phase II is considered negative (H A: p p A is rejected) if the number p 0 is rejected) if sufﬁciently of responses is sufﬁciently low, positive (H 0: p high, and equivocal if intermediate (neither hypothesis rejected). For a value p m between p 0 and p A, upper and lower rejection bounds (r U and r L) are chosen such γ and P(x r L |p m) γ, with p m and sample size chosen that P(x r U | p m) to have adequate power to reject H A under p 0 or H 0 under p A. When p 0 0.1 and p A 0.3, an example of a Storer design is to test p m 0.193 with γ 0.33 and power 0.8 under p 0 and p A. For a two-stage design, N 1, N, r L1, r U1, r L2, and r U2 are 18, 29, 1, 6, 4, and 7, respectively. If the ﬁnal result is equivocal (5 or 6 responses in 29 for this example), the conclusion is that other information is necessary to make a decision.

Table 1

Examples of Designs H0 vs. HA N1 20 20 25 30 29 22 24 24 20 20 25 30 17–24 12–14 15–19 18–20 21–24 25 19–20 21–23 24–26 a1 0 2 5 9 1 2 5 7 0 1 4 8 1 1 2 4 5 6 6 7 8 b1 4 6 10 16 — — — — — — — — — — — — — — — — — — N 40 35 45 55 38 33 45 53 40 35 45 55 41–46 47–48 36–39 40–43 48 49–51 52–55 55 56–58 59–60 61–62 a2 4 6 13 22 4 6 13 21 4 7 13 22 4 5 6 7 13 14 15 21 22 23 24 b2 5 7 14 23 5 7 14 22 5 8 14 23 5 6 7 8 14 15 16 22 23 24 25 Level (average and range for Chen) 0.052 0.053 0.055 0.042 0.039 0.041 0.048 0.047 0.047 0.020 0.052 0.041 0.046 0.022–0.069 0.050 0.029–0.075 0.050 0.034–0.073 0.050 0.035–0.064 Power (average and range for Chen) 0.92 0.92 0.91 0.91 0.90 0.90 0.90 0.90 0.92 0.87 0.91 0.91 0.90 0.845–0.946 0.90 0.848–0.938 0.90 0.868–0.937 0.90 0.872–0.929

96

Fleming

Simon

Green

Chen

0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 0.05 0.1 0.2 0.3 0.05

vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs.

0.2 0.3 0.4 0.5 0.2 0.3 0.4 0.5 0.2 0.3 0.4 0.5 0.2

0.1 vs. 0.3 0.2 vs. 0.4

0.3 vs. 0.5

Green

N1 is the sample size for the ﬁrst stage of accrual, N is the total sample size after the second stage of accrual, ai is the bound for accepting H0 at stage i, and bi is the bound for rejecting H0 at stage i (i 1, 2). Designs are listed for Fleming (2), Simon (4), and Green and Dahlberg (5); the optimal design set is listed for Chen and Ng (7).

Phase II Trials

97

II. ANALYSIS OF STANDARD PHASE II DESIGNS As noted in Storer, the hypothesis testing framework used in phase II studies is useful for developing designs and determining sample sizes. The resulting decision rules are not always meaningful, however, except as tied to hypothetical follow-up trials that in practice may or may not be done. Thus, it is important to present conﬁdence intervals for phase II results, which can be interpreted appropriately regardless of the nominal ‘‘decision’’ made at the end of the trial as to whether further study of the regimen is warranted. The main analysis issue after a multistage trial is how to generate a conﬁdence interval, since the usual procedures assuming a single-stage design are biased. Various approaches to generating intervals have been proposed. These involve ordering the outcome space and inverting tail probabilities or test acceptance regions, as in estimation following single-stage designs; however, with multistage designs, the outcome space does not lend itself to any simple ordering. Jennison and Turnbull (10) order the outcome space by which boundary is reached, by the stage stopped at, and by the number of successes (stopping at stage i is considered more extreme than stopping at stage i 1 regardless of the number of successes). A value p is not in the 1 2α conﬁdence interval if the probability under p of the observed result or one more extreme according to this ordering is less than α (either direction). Chang and O’Brien (11) order the sample space instead based on the likelihood principle. For each p, the sample space for a two-stage design is ordered according to L(x, N*) (x/N*)x(1 p)N* x /px{(N* x)/N*}N* x, where N* is N 1 if x can only be observed at the ﬁrst stage and N if at the second (x is the number of responses). P is in the conﬁdence ‘‘interval’’ if one half of the probability of the observed outcome plus the probability of a more extreme outcome according to this ordering is α or less. The conﬁdence set is not always strictly an interval, but the authors state that the effect of discontinuous points is negligible. Chang and O’Brien intervals were shorter than those of Jennison and Turnbull, although this in part would be due to the fact that Jennison and Turnbull did not adjust for discreteness by assigning only 1/2 of the probability of the observed value to the tail as Chang and O’Brien did. Duffy and Santner (12) recommend ordering the sample space by success percent and also develop intervals of shorter length than Jennison and Turnbull intervals. Although they produce shorter intervals, these last two approaches have the major disadvantage of requiring knowledge of the ﬁnal sample size to calculate an interval for a study stopped at the ﬁrst stage; as noted above, this typically will be random. The Jennison and Turnbull approach can be used since it only requires knowledge up to the stopping time. However, it is not entirely clear how important it is to adjust conﬁdence intervals for the multistage nature of the design. From the point of view of appro-

98

Green

priately reﬂecting the activity of the regimen tested, the usual interval assuming a single stage design may be sufﬁcient. In this setting the length of the conﬁdence interval is not of primary importance (sample sizes are small and all intervals are long). The primary concern is that the interval appropriately reﬂects the activity of the regimen. Similar to Storer’s idea, it is assumed that if the conﬁdence interval excludes p 0, the regimen is considered active, and if it excludes p A, the regimen is considered insufﬁciently active. If it excludes neither, results are equivocal; this seems reasonable whether or not continued testing is recommended for the better equivocal results. For Green and Dahlberg designs, the differences between Jennison and Turnbull and unadjusted tail probabilities are 0 if the trial stops at the ﬁrst stage, ∑ a1bin(i, N 1, p)∑ N 2 i bin( j, N N 1, p) for the upper tail if stopped at the second 0 x stage, and ∑ a 1bin(i, N 1, p) ∑ N2 i 1bin( j, N N 1, p) for the lower tail if stopped 0 x at the second stage. (A 1 is the stopping bound for accepting H 0 at the ﬁrst stage.) Both the upper and lower conﬁdence bounds are shifted to the right for Jennison and Turnbull intervals. These therefore will more often appropriately exclude p 0 when p A is true and inappropriately include p A when p 0 is true compared with the unadjusted interval. However, the tail differences are generally small, resulting in small differences in the intervals. The absolute value of the upper tail difference is less than approximately 0.003 when the lower bound of the unadjusted interval is p 0 (normal approximation), whereas the lower tail difference is constrained to be 0.02 for p p A due to the early stopping rule. Generally, the shift in a Jennison and Turnbull interval is noticeable only for small x at the second stage. As Rosner and Tsiatis (13) note, such results (activity in the ﬁrst stage, no activity in the second) are unlikely, possibly suggesting the independent identically distributed assumption was incorrect. For example, consider a common design for testing H 0: p 0.1 versus H A: p 0.3: stop in favor of H 0 at the ﬁrst stage if 0 or 1 responses are observed in 20 patients and otherwise continue to a total of 35. Of the 36 possible trial outcomes (if planned sample sizes are achieved), the largest discrepancy in the 95% conﬁdence intervals occurs if two responses are observed in the ﬁrst stage and none in the second. For this outcome, the Jennison and Turnbull 95% conﬁdence interval is from 0.013 to 0.25, whereas the unadjusted interval is from 0.007 to 0.19. Although not identical, both intervals lead to the same conclusion: The alternative is ruled out. For the Fleming and Green and Dahlberg designs listed in Table 1, Table 2 lists the probabilities that the 95% conﬁdence intervals lie above p 0 (evidence the regimen is active), below p A (evidence the regimen has insufﬁcient activity to pursue), or cover both p 0 and p A (inconclusive). (In no case are p 0 and p A both p 0 and for adjusted excluded.) Probabilities are calculated for p p A and p (by the Jennison and Turnbull method) and unadjusted intervals. For the Green and Dahlberg designs considered, probabilities for the Jennison and Turnbull and unadjusted intervals are the same in most cases. The only

Phase II Trials

99

Table 2 Probabilities Under p 0 and p A for Unadjusted and Jennison-Turnbull (J-T) Adjusted 95% Conﬁdence Intervals Probability 95% CI is above p0 when p p0 0.05 vs. 0.2 Green J-T Unadjusted Fleming J-T Unadjusted Green J-T Unadjusted Fleming J-T Unadjusted Green J-T Unadjusted Fleming J-T Unadjusted Green J-T Unadjusted Fleming J-T Unadjusted 0.014 0.014 0.024 0.024 0.020 0.020 0.025 0.025 0.025 0.025 0.023 0.034 0.022 0.022 0.025 0.025 pA 0.836 0.836 0.854 0.854 0.866 0.866 0.866 0.866 0.856 0.856 0.802 0.862 0.859 0.859 0.860 0.860 Probability 95% CI is below pA when p p0 0.704 0.704 0.704 0.704 0.747 0.747 0.392 0.515 0.742 0.833 0.421 0.654 0.822 0.822 0.778 0.837 pA 0.017 0.017 0.017 0.017 0.014 0.014 0.008 0.011 0.016 0.027 0.009 0.022 0.020 0.020 0.025 0.030 Probability 95% CI includes p0 and pA when p p0 0.282 0.282 0.272 0.272 0.233 0.233 0.583 0.460 0.233 0.142 0.556 0.312 0.156 0.156 0.197 0.138 pA 0.147 0.147 0.129 0.129 0.120 0.120 0.126 0.123 0.128 0.117 0.189 0.116 0.121 0.121 0.115 0.110

0.1 vs. 0.3

0.2 vs. 0.4

0.3 vs. 0.5

discrepancy occurs for the 0.2 versus 0.4 design when the ﬁnal outcome is 11 of 45 responses. In this case the unadjusted interval is from 0.129 to 0.395, whereas the Jennison and Turnbull interval is from 0.131 to 0.402. There are more differences between adjusted and unadjusted probabilities for Fleming designs, the largest for ruling out p A in the 0.2 versus 0.4 and 0.1 versus 0.3 designs. In these designs, no second-stage Jennison and Turnbull interval excludes the alternative, making this probability unacceptably low under p 0. The examples presented suggest that adjusted conﬁdence intervals do not necessarily result in more sensible intervals in phase II designs and in some cases are worse than not adjusting.

III. OTHER PHASE II DESIGNS A. Multiarm Phase II Designs Occasionally, the aim of a phase II study is not to decide whether a particular regimen should be studied further but to decide which of several new regimens

100

Green

should be taken to the next phase of testing (assuming they cannot all be). In these cases selection designs are used, often formulated as follows: Take on to further testing the treatment arm observed to be best by any amount, where the number of patients per arm is chosen to be large enough such that if one treatment is superior by ∆ and the rest are equivalent, the probability of choosing the superior treatment is p. Simon et al. (14) published sample sizes for selection designs with response endpoints, whereas Liu et al. (15) provide sample sizes for survival end points. For survival the approach is to choose the arm with the smallest estimated β in a Cox model. Sample size is chosen so that if one treatment is superior with β ln (1 ∆) and the others have the same survival, then the superior treatment will be chosen with probability p. Theoretically this all ﬁne, but in reality the designs are not strictly followed. If response is poor in all arms, the conclusion is to pursue none of the regimens (not an option allowed in these designs). If a ‘‘striking’’ difference is observed, then the temptation is to bypass the conﬁrmatory phase III. In a follow-up to the survival selection paper, Liu et al. (16) noted that the probability of a an observed β ln (1.7), which cancer investigators consider striking, is not trivial—with two to four arms the probabilities are 0.07–0.08 when in fact there are no differences in the treatment arms. B. Phase II Designs with Multiple End Points

The selected primary end point of a phase II trial is just one consideration in the decision to pursue a new agent. Other end points (such as survival and toxicity if response is primary) must also be considered. For instance, a trial with a sufﬁcient number of responses to be considered active may still not be of interest if too many patients experience life-threatening toxicity or if they all die quickly. On the other hand, a trial with an insufﬁcient number of responses but a good toxicity proﬁle and promising survival might still be considered for future trials. Designs have been proposed to incorporate multiple end points explicitly into phase II studies. Bryant and Day (17) proposed an extension of Simon’s approach, identifying designs that minimize the expected accrual when the regimen is unacceptable either with respect to response or toxicity. Their designs are terminated at the ﬁrst stage if either the number of responses is C R1 or less or the number of patients without toxicity is C T1 or less (or both). The regimen is concluded useful if the number of patients with responses and the number without toxicity are greater than C R2 and C T2 respectively, at the second stage. N 1, N, C R1, C T1, C R2, and C T2 are chosen such that the probability of recommending the regimen when the probability of no toxicity is acceptable (p T p T1) but response is unacceptable (p R p R0) is less than or equal to α R, the probability of recommending the regimen when response is acceptable (p R p R1) but toxicity is unaccept-

Phase II Trials

101

able (p T p T0) is less than or equal to α T, and the probability of recommending the regimen when both are acceptable is 1 β or better. The constraints are applied either uniformly over all possible correlations between toxicity and response or assuming independence of toxicity and response. Minimization is done subject to the constraints. For many practical situations, minimization assuming independence produces designs that perform reasonably well when the assumption is incorrect. Conaway and Petroni (18) proposed similar designs assuming that a particular relationship between toxicity and response, an optimality criterion, and a ﬁxed total sample size are all speciﬁed. Design constraints proposed include limiting the probability of recommending the regimen to α or less when both rep R0 and p T p T0) and to γ or less sponse and toxicity are unacceptable (p R anywhere else in the null region (p R p R0 or p T p T0). The following year, Conaway and Petroni (19) proposed boundaries allowing for trade-offs between toxicity and response. Instead of dividing the parameter space as in Fig. 1a, it is divided according to investigator speciﬁcations, such as in Fig. 1b, allowing for fewer patients with no toxicity when the response probability is higher (and the reverse). The test proposed is to accept H 0 when T(x) c 1 at the ﬁrst stage or c 2 at the second, subject to maximum level α over the null region and power at least 1 β when p R p R1 and p T p T1 for an assumed value for the association between response and toxicity. Here, T(x) ∑p* ln( p* /pij), where ij indexes ij ij ˆ ˆ the cells of the 2 2 response–toxicity table, ps are the usual probability estiˆ mates, and the p i*s are the values achieving infH0 ∑pij ln( pij /pij). (T(x) can be j ˆ interpreted in some sense as a distance from p to H 0). Interim stopping bounds are chosen to satisfy optimality criteria (the authors’ preference is minimization of expected sample size under the null).

Figure 1 Division of parameter space for two approaches to bivariate phase II design. (a) An acceptable probability of response and an acceptable probability of no toxicity are each speciﬁed. (b) Acceptable probabilities are not ﬁxed at one value for each but instead allow for a trade-off between toxicity and response.

102

Green

There are a number of practical problems with these designs. As for other designs relying on optimality criteria, they generally cannot be done faithfully in realistic settings. Even when they can be carried out, deﬁning toxicity as a single yes–no variable is problematic, since typically several toxicities of various grades are of interest. Perhaps the most important issue is that of the response– toxicity trade-off. Any function speciﬁed is subjective and cannot be assumed to reﬂect the preferences of either investigators or patients in general.

IV. DISCUSSION Despite the precise formulation of hypotheses and decision rules, phase II trials are not as objective as we would like. The small sample sizes used cannot support decision making based on all aspects of interest in a trial. Trials combining more than one aspect (such as toxicity and response) are fairly arbitrary with respect to the relative importance placed on each end point (including the 0 weight placed on the end points not included), so are subject to about as much imprecision in interpretation as results of single end point trials. Furthermore, a phase II trial would rarely be considered on its own. By the time a regimen is taken to phase III testing, multiple phase II trials have been done and the outcomes of the various trials weighed and discussed. Perhaps statistical considerations in a phase II design are most useful in keeping investigators realistic about how limited such studies are. For similar reasons, optimality considerations both with respect to design and conﬁdence intervals are not particularly compelling in phase II trials. Sample sizes in the typical clinical setting are small and variable, making it more important to use procedures that work reasonably well across a variety of circumstances rather than optimally in one. Also, there are various characteristics it would be useful to optimize; compromise is often in order. A ﬁnal practical note—choices of null and alternative hypotheses in phase II trials are often made routinely, with little thought, but phase II experience should be reviewed occasionally. As deﬁnitions and treatments change, old historical probabilities do not remain applicable.

REFERENCES

1. Gehan EA. The determination of number of patients in a follow up trial of a new chemotherapeutic agent. J Chronic Dis 1961; 13:346–353. 2. Fleming TR. One sample multiple testing procedures for Phase II clinical trials. Biometrics 1982; 38:143–151.

Phase II Trials

103

3. Chang MN, Therneau TM, Wieand HS, Cha SS. Designs for group sequential Phase II clinical trials. Biometrics 1987; 43:865–874. 4. Simon R. Optimal two-stage designs for Phase II clinical trials. Controlled Clin Trials 1989; 10:1–10. 5. Green SJ, Dahlberg, S. Planned vs attained design in Phase II clinical trials. Stat Med 1992; 11:853–862. 6. Simon R. How large should a Phase II trial of a new drug be? Cancer Treatment Rep 1987; 71:1079–1085. 7. Chen T, Ng T-H. Optimal ﬂexible designs in Phase II clinical trials. Stat Med 1998; 17:2301–2312. 8. Herndon J. A design alternative for two-stage, Phase II, multicenter cancer clinical trials. Controlled Clin Trials 1998; 19:440–450. 9. Storer B. A class of Phase II designs with three possible outcomes. Biometrics 1992; 48:55–60. 10. Jennison C, Turnbull BW. Conﬁdence intervals for a binomial parameter following a multistage test with application to MIL-STD 105D and medical trials. Technometrics 1983; 25:49–58. 11. Chang MN, O’Brien PC. Conﬁdence intervals following group sequential tests. Controlled Clin Trials 1986; 7:18–26. 12. Duffy DE, Santner TJ. Conﬁdence intervals for a binomial parameter based on multistage tests. Biometrics 1987; 43:81–94. 13. Rosner G, Tsiatis AA. Exact conﬁdence intervals following a group sequential trial: a comparison of methods. Biometrika 1988; 75:723–729. 14. Simon R, Wittes R, Ellenberg S. Randomized Phase II clinical trials. Cancer Treatment Rep 1985; 69:1375–1381. 15. Liu PY, Dahlberg S, Crowley J. Selection designs for pilot studies based on survival endpoints. Biometrics 1993; 49:391–398. 16. Liu PY, LeBlanc M, Desai M. False positive rates of randomized Phase II designs. Controlled Clin Trials 1999; 20:343–352. 17. Bryant J, Day R. Incorporating toxicity considerations into the design of two-stage Phase II clinical trials. Biometrics 1995; 51:1372–1383. 18. Conaway M, Petroni G. Bivariate sequential designs for Phase II trials. Biometrics 1995; 51:656–664. 19. Conaway M, Petroni G. Designs for phase II trials allowing for a trade-off between response and toxicity. Biometrics 1996; 52:1375–1386.

In practice. Ethical considerations. Virginia I. INTRODUCTION In principle. The number of patients in a phase I trial is small and the toxicity proﬁle of the new agent is estimated with little precision. there is a need to gather more information about toxicity rates before proceeding to a large comparative trial. Conaway University of Virginia. usually deﬁned in terms of the proportion of patients experiencing severe side effects. As a result.5 Designs Based on Toxicity and Response Gina R. Conaway and Petroni (1) and Bryant and Day (2) cite several reasons why toxicity considerations are important for phase II trials: 1. Charlottesville. experience some objective measure of disease improvement. Most phase II trials are designed to terminate the study early if it does not appear that the new agent is sufﬁciently 105 . 2.’’ that is. the separation between establishing the toxicity of a new agent in a phase I trial and establishing the response rate in a phase II trial is artiﬁcial. Most phase II trials are conducted not only to establish the response rate but also to gather additional information about the toxicity associated with the new agent. Sample sizes in phase I trials. has been established in a previous phase I trial. phase II trials evaluate whether a new agent is sufﬁciently promising to warrant a comparison with the current standard of treatment. An agent is considered sufﬁciently promising based on the proportion of patients who ‘‘respond. Petroni and Mark R. The toxicity of the new agent.

(5. X 22 ). promising to warrant a comparative trial. In these designs. where it is hypothesized that a more intensive therapy induces a greater chance of a response but also a greater chance of toxicity. The data are summarized in a 2 2 table where X ij is the number of patients with response classiﬁcation i and toxicity classiﬁcation j (Table 1). (p11 . X 12 .6) propose a Bayesian method for monitoring response and toxicity that can also incorporate a trade-off between response and toxicity rates. and consequently phase II trials should be designed with the possibility of early termination of the study if an excessive number of toxicities are observed. This consideration is particularly important in studies of intensive chemotherapy regimens. As a motivation for the multistage designs. p22 ). In this setting. binary variables representing response and toxicity are observed in each of N patients. p21 . With these considerations. Conaway and Petroni (1) and Bryant and Day (2) propose methods that extend the two-stage designs of Simon (3). a new agent is considered sufﬁciently promising if it exhibits both a response rate that is greater than that of the standard therapy and a toxicity rate that does not exceed that of the standard therapy. It is assumed that the cell counts in this table. X 21 . The characteristics of the patients enrolled in the previous phase I trials may be different from those of the patients to be enrolled in the phase II trial. a new agent with a greater toxicity rate might be considered sufﬁciently promising if it also has a much greater response rate than the standard therapy. using the notation in Conaway and Petroni (1). several proposals have been made for designing phase II trials that formally incorporate both response and toxicity end points. . p12 . DESIGNS FOR RESPONSE AND TOXICITY Conaway and Petroni (1) and Bryant and Day (2) present multistage designs that formally monitor response and toxicity. phase I trials often enroll patients for whom all standard therapies have failed. For example. (X 11. Thall et al. based on a trade-off between response and toxicity rates. have a multinomial distribution with underlying probabilities. Patients should be protected also from receiving agents with excessive rates of toxicity. In each of these methods.106 Petroni and Conaway 3. The observed number of responses is X R X 11 X 12 and the observed number of patients experiencing a severe toxicity is X T X 11 X 21. Conaway and Petroni (4) consider a different criterion. These designs are meant to protect patients from receiving substandard therapy. These patients are likely to have a greater extent of disease than patients who will be accrued to the phase II trial. we ﬁrst describe the methods for a ﬁxed sample design. II.

X T ): X R c R and X T c T}. Conaway and Petroni (1) and Bryant and Day (2) interpret the term ‘‘sufﬁciently promising’’ to mean that the new treatment has a greater response rate than the standard and that the toxicity rate with the new treatment is no greater than that of the standard treatment. X T ). a proportion.Designs Based on Toxicity and Response Table 1 Classiﬁcation of Patients by Response and Toxicity Toxicity Yes Response Yes No Total X 11 X 21 XT No X 12 X 22 N XT Total XR N N XR 107 That is. with a critical region of the form C {(X R . 1. Deﬁning pRo as the response rate with the standard treatment and pTo as the toxicity rate for the standard treatment. pij . A statistic for testing H o versus H a is (X R . the null hypothesis can be written as H o: pR H a: pR pRo or pT pTo pRo and pT pTo The null and alternative regions are displayed in Fig. With this notation the probability of a response is pR probability of a toxicity is pT p11 p21. in the population of patients to be treated with this new agent. We reject the null hypothesis and Table 2 Population Proportions for Response and Toxicity Classiﬁcations Toxicity Yes Response Yes No Total p11 p21 pT No p12 p22 1 pT Total pR 1 1 pR . would have response classiﬁcation i and toxicity classiﬁcation j (Table p11 p12 and the 2). The design is based on having sufﬁcient power to test the null hypothesis that the new treatment is ‘‘not sufﬁciently promising’’ to warrant further study against the alternative hypothesis that the new agent is sufﬁciently promising to warrant a comparative trial.

108 Petroni and Conaway Figure 1 Null and alternative regions. γ. Conaway and Petroni (1) choose the sample size. 3. The three error probabilities are 1. 2. We do not reject the null hypothesis if we observe too few responses or too many toxicities. The probability of declaring the treatment not promising at a particular point in the alternative region. The probability of incorrectly declaring the treatment promising when the response and toxicity rates for the new therapy are the same as those of the standard therapy. N. The probability of incorrectly declaring the treatment promising when the response rate for the new therapy is no greater than that of the standard or the toxicity rate for the new therapy is greater than that of the standard therapy. declare the treatment sufﬁciently promising if we observe many responses and few toxicities. and critical values (c R . respectively. The design should yield sufﬁcient power to reject the null hypothesis for a speciﬁc response and toxicity rate. . and 1 β. c T ) to constrain three error probabilities to be less than prespeciﬁed levels α.

γ 0. Therefore. k 1. the null hypothesis that the treatment is not sufﬁciently promising to warrant further study is rejected if there are a sufﬁcient number of observed . If the trial is stopped early. . Conaway and Petroni (1) compute the sample size and critical values by enumerating the distribution of (X R . The trial is designed to have approximately 90% power at the alternative determined by (pRa . 0.10. 1 β (1) (2) (3) c T | pR . The study continues to the next stage if the total number of responses observed up to and including the kth stage is at least as great as c Rk and the total number of toxicities up to and including the kth stage is no greater than c Tk. .30. The extension to multistage designs is straightforward. The odds ratio. θ) α. pTa ) is a prespeciﬁed point in the alternative region. As an example. At the ﬁnal stage. pT. Conaway and Petroni (1) present a proposed phase II trial of high-dose chemotherapy for patients with non-Hodgkin’s lymphoma. the point (pRa .50. Conaway and Petroni (1) assume that the study is to be carried out in K stages. for the general discussion. toxicity rate. K. is determined by the assumed response rate. In addition. At the end of the kth stage. XT c T | pR cR. and β 0. in Table 2. with pRa pRo and pTa pTo. Although most phase II trials are carried out in at most two stages. pT pTa .Designs Based on Toxicity and Response 109 where the response rate is greater than that of the standard therapy and the toxicity rate is less than that of the standard therapy. pT . and the conditional probability of experiencing a life-threatening toxicity given that patient had a complete response. At the end of the kth stage. θ) P(X R where these probabilities are computed for a prespeciﬁed value of the odds ratio. Conaway and Petroni (1) chose values α 0. pTo ) is assumed to be (0.05. (pRo . X T ) under particular values for (pR .0. the treatment is declared not sufﬁciently promising to warrant further study. . the decision to continue or terminate the study is governed by the boundaries (c Rk . c Tk ). XT P(X R cR. θ.75. these error probabilities are: P(X R sup pR pRo or pT pTo cR. a decision is made whether to enroll patients for the next stage or to stop the trial.15). θ) pRa . Mathematically. 0. The multistage designs allow for the early termination of a study if early results indicate that the treatment is not sufﬁciently effective or is too toxic. previous results indicated that approximately 35–40% of the patients who experienced a complete response also experienced life-threatening toxicities.30) and the odds ratio is assumed to be 2. pT pTo . . θ (p11 p22 )/(p12 p21 ). pTa ) (0. θ). XT c T | pR pRo . Results from earlier studies for this patient population have indicated that standard therapy results in an estimated response rate of 50% with approximately 30% of patients experiencing life-threatening toxicities. γ.

If Y R2 sponse. and C T2 are parameters to be determined by the design: 1. C R2 and Y T2 C T2 . can be used to select a design.’’ The principle for choosing the stage sample sizes and stage boundaries is the same as in Conaway and Petroni (1). C T1 . the goal is to choose sample sizes for the stages m 1 . C T1 . The stage sample sizes and boundaries can be chosen to give the minimum expected sample size at the response and toxicity rates for the standard therapy (pRo . They present optimal designs for twostage trials that extend the designs of Simon (3). pTo ) among all designs that satisfy the error requirements. Alternatively. m 2 . For a ﬁxed total sample size. . In the second stage. Y R1 patients respond and Y T1 patients do not experience toxicity. where N 2 . 2. Bryant and Day (2) also consider the problem of monitoring binary end points representing response and toxicity. c T1 ). continue to the second stage. . Although the methods differ in the particular . ‘‘not promising’’ due to excessive toxicity. one could choose the design that minimizes the maximum expected sample size over the entire null hypothesis region. N ∑k m k . C R1 . . N 2 N 1 patients are accrued. At the end of the ﬁrst stage. C R2 and Y T2 C T2 . 2. they evaluate the sensitivity of the designs to a misspeciﬁcation of the value for the odds ratio. terminate due to inadequate response. 3. . 4. If Y R2 If Y R2 If Y R2 C R2 and Y T2 C T2 . (c RK. θ. c T2 ). At the end of this stage. 3. 4. terminate due to excessive toxicity. the following rules govern the decision whether or not the new agent is sufﬁciently promising. terminate due to both factors. An additional criterion. where N 1 . In the ﬁrst stage. and C T1 are parameters to be chosen as part of the design speciﬁcation: 1. C R2 .110 Petroni and Conaway responses (at least c RK ) and sufﬁciently few observed toxicities (no more than c TK ). ‘‘sufﬁciently promising. . The design parameters are determined from prespeciﬁed error constraints. ‘‘not promising’’ due to inadequate re- C R2 and Y T2 C T2 . (c R2 . ‘‘not promising’’ due to both factors. . . c TK ) satisfying the error constraints listed above. m K and boundaries (c R1 . . C T1 . there may be many designs that satisfy the error requirements. N 1 patients are accrued and classiﬁed by response and toxicity. In designing the study. a decision to continue to the next stage or to terminate the study is made according to the following rules. such as one of those proposed by Simon (3) in the context of two-stage trials with a single binary end point. Through simulations. If If If If Y R1 Y R1 Y R1 Y R1 C R1 C R1 C R1 C R1 and and and and Y T1 Y T1 Y T1 Y T1 C T1 . Conaway and Petroni (1) compute the ‘‘optimal’’ designs for these criteria for two-stage and three-stage designs using a ﬁxed prespeciﬁed value for the odds ratio.

1. ϕ. Constraining the probability of recommending a treatment with an insufﬁcient response rate leads to α 10 (Q. i 0. Constraining the probability of recommending α R . so that an upper bound on α 00(Q. The stage sample sizes and boundaries for the optimal design depend on the value of the nuisance parameter. For an unspeciﬁed odds ratio. 1. Speciﬁcally. ϕ. j 0. Q. ϕ) and E 10(Q. One would like to limit the probability of recommending a treatment that has an insufﬁcient response rate or excessive toxicity rate. and an odds ratio. ϕ) and E 10 (Q. Bryant and Day (2) deﬁne the optimal design to be the one that minimizes the expected number of patients in a study of a treatment with an unacceptable response or toxicity rate. Similarly. among all designs meeting the error criteria. ϕ). one would like to constrain the probability of failing to recommend a treatment that is superior to the standard treatment in terms of both response and toxicity rates. There can be many designs that meet these speciﬁcations. given that the true response rate equals PRi and the true nontoxicity rate equals PTj . Q. Q (N 1 . Bryant and Day (2) note that α 00 (Q. where α T and β are prespeciﬁed constants. C T2 ). j 0. C R2 . Under any of the four combinations of acceptable or unacceptable rates of response and nontoxicity. among all . where E ij is the expected number of patients accrued when the true response rate equals P Ri and the true nontoxicity rate equals P Tj . ϕ) is less than either α 01(Q. that minimizes the maximum of E 01 (Q. Toxicity) P(Response. let α ij (Q. ϕ) α T . and ensuring a sufﬁciently high probability of recommending a truly superior treatment requires α 11 (Q. N 2 . ϕ) 1 β. ϕ) or α 10 (Q. No toxicity) P(No Response. Toxicity) Bryant and Day (2) parameterize the odds ratio in terms of response and no toxicity so ϕ corresponds to 1/θ in the notation of Conaway and Petroni (1). Among these designs. i 0. The association between response and toxicity is determined by the odds ratio. the optimal design is the one that minimizes the average number of patients treated with an ineffective therapy. ϕ). ϕ) be the probability of recommending the treatment. Bryant and Day (2) assume that the association between response and toxicity is constant. C R1 . ϕ). 1. ϕ P(No Response. 1. In choosing the design parameters. where a treatment with an insufﬁcient response rate leads to α 01 (Q. ϕ. No Toxicity) P(Response. in the 2 2 table cross-classifying response and toxicity. Finally. ϕ) α R is a prespeciﬁed constant. Bryant and Day (2) specify an acceptable (P R1 ) and an unacceptable (P R0 ) response rate along with an acceptable (P T1 ) and unacceptable (P T0 ) rate of nontoxicity. The expected value E 00 (Q. Bryant and Day (2) choose the design.Designs Based on Toxicity and Response 111 constraints considered. the motivation for these error constraints is the same. ϕ) does not play a role in the calculation of the optimal design because it is less than both E 01 (Q. ϕ) is implicit in these constraints. For a design. C T1 .

III. Conaway and Petroni (1) came to a similar conclusion. that would be acceptable if the treatment produced no toxicities. pTo . but different values for the assumed odds ratio led to similar designs.max . ϕ). By considering a number of examples.112 Petroni and Conaway designs that meet the error constraints. including the form for the alternative region. and the toxicity rate. that would be acceptable if the new treatment were to produce responses in all patients. The alternative hypothesis is that the new treatment is sufﬁciently effective and safe to warrant further study. E 10 (Q. a new treatment must show evidence of a greater response rate and a lesser toxicity rate than the standard treatment. One of the primary issues in the design is how to elicit the trade-off speciﬁcation.’’ In practice this can be difﬁcult to elicit. The terms ‘‘sufﬁciently safe’’ and ‘‘sufﬁciently effective’’ are relative to the response rate. DESIGNS THAT ALLOW A TRADE-OFF BETWEEN RESPONSE AND TOXICITY The designs for response and toxicity proposed by Conaway and Petroni (1) and Bryant and Day (2) share a number of common features. for the standard treatment. either due to an insufﬁcient response rate or excessive toxicity. A simpler method for obtaining the trade-off information is for the investigator to specify the maximum toxicity rate. pR. Bryant and Day (2) provide bounds that indicate that the characteristics of the optimal design for an unspeciﬁed odds ratio do not differ greatly from the optimal design found by assuming that response and toxicity are independent. Their designs are computed under a ﬁxed value for the odds ratio. max ϕ {max(E 01 (Q. pRo .min . In practice. the investigator would be asked to specify the minimum response rate. Conaway and Petroni (4) propose two-stage designs for phase II trials that allow early termination of the study if the new therapy is not sufﬁciently promising and allow a trade-off between response and toxicity. since one may be willing to allow greater toxicity to achieve a greater response rate or may be willing to accept a slightly lower response rate if lower toxicity can be obtained. Figure 2 illustrates the set of values for the true response . Assumptions about a ﬁxed value of the odds ratio lead to a simpler computational problem. the optimal design minimizes the maximum expected patient accruals under a treatment with an unacceptable response or toxicity rate. Similarly. The hypotheses are the same as those considered for the bivariate designs of the previous section. the trade-off between safety and efﬁcacy would be summarized as a function of toxicity and response rates that deﬁnes a treatment as ‘‘worthy of further study. ϕ))}. a trade-off could be considered in the design. pT. Ideally. In these designs. this is particularly true if response and toxicity are assumed to be independent (ϕ 1). The null hypothesis is that the new treatment is not sufﬁciently promising to warrant further study.

pTo ) and (1. which satisfy the null and alternative hypotheses. The values chosen for Fig.5.Designs Based on Toxicity and Response 113 Figure 2 Null and alternative regions for trade-offs. the equation of the line connecting (pRo . pR.min 0. rate (pR ) and true toxicity rate (p T ). pTo 0. where tan(ψR ) pTo /(pRo pR.4.max 0.min ) is given by the equation pT pTo tan(ψR )(pR pRo ).max ) is given by the equation pT pTo tan(ψT )(pR pRo ). pT. With ψT ψR . Similarly. pR. The line connecting the point (pRo . where tan(ψT ) (pT.max pTo )/(1 pRo ). 2 are p Ro 0.7.min ). and pT. although the basic principles in constructing the design and specifying the trade- .2. the null hypothesis is H o : pT pTo tan(ψT )(pR pRo ) or pT pTo tan(ψR )(pR pRo ) and the alternative hypothesis is H a : pT pTo tan(ψT )(pR pRo ) and pT pTo tan(ψR )(pR pRo ) The forms of the null and alternative are different for the case where ψT ψR . pTo ) and (1.

and power (1 β). To describe the trade-off designs for a ﬁxed sample size. The test statistic is denoted by T(p). X 21 .114 Petroni and Conaway off information remain the same (cf. θ) 1 β These probabilities are computed for a ﬁxed value of the odds ratio. Conaway and Petroni (4) determine sample size and critical values under an assumed value for the odds ratio between response and toxicity. For an appropriate choice of sample size (N). the value c can be chosen to constrain the probability of recommending a treatment that has an insufﬁcient response rate relative to the toxicity rate and ensure a high probability of recommending a treatment with response rate pRa and toxicity rate pTa. pT )∈Ho c | p R . pTa . where p (1/ N)(X 11 . The trade-off designs can be extended to two-stage designs that allow early termination of the study if the new treatment does not appear to be sufﬁciently promising. X 12 . Conaway and Petroni [4]). 1 β. we use the notation and assumptions for the ﬁxed sample size design described in Section II. signiﬁcance level (α). Special cases of 0 and ψR π/2 yield the these hypotheses have been used previously: ψT critical regions of Conaway and Petroni (1) and Bryant and Day (2). As in their earlier work. A vector of observed proportions p leads to rejection of the null hypothesis if T(p) c. The I error. m 2 ) and decision boundaries (c 1 . [7]). by enumerating the value of T(p) for all possible realizations of the multinominal vector (X 11 . Rejection of the null hypothesis results when the observed value of T(p) is ‘‘too far’’ from the null hypothesis region. pTa ) satisﬁes the constraints deﬁning the alternative hypothesis and represents the response and toxicity rates for a treatment considered to be superior to the standard treatment. the goal is to choose the stage sample sizes (m 1 . X 21 . θ) α and P(T(p) c | pRa . at a particular point pR point (pRa . X 22 ). Robertson et al. c 2 ) to satisfy error probability constraints similar to those in the ﬁxed sample size trade-off design. In designing the study. The sample size calculations require a speciﬁcation of a level of type pRa and pT pTa. X 12 . ψR ψT 0 yield hypotheses in terms of toxicity alone. . and ψR ψT π/2 yield hypotheses in terms of response alone. The critical value c is chosen to meet the error criteria: sup P(T(p) (pR. X 22 ). α. is the vector of sample proportions in the four cells of Table 1 and is based on computing an ‘‘I-divergence measure’’ (cf. and power. The test statistic has the intuitively appealing property of being roughly analogous to a ‘‘distance’’ from p to the region H o. θ. pT .

et al. for example. In cases where many designs meet the error requirements.6) outline a strategy for monitoring each end point in the trial. the chosen design minimizes the maximum expected sample size under the null hypothesis. They conclude that unless the odds ratio is badly misspeciﬁed. 2 and 3 in Conaway and Petroni (4) to illustrate the characteristics of the trade-off designs. If. In the example given above for a trial . a monitoring boundary based on prespeciﬁed targets for an improvement in efﬁcacy and an unacceptable increase in the rate of adverse events. (5. Through simulations. p2 ) c 2 | pR .Designs Based on Toxicity and Response 115 sup P(T1 (p1 ) (pR. T2 (p1 . a treatment that improves the response rate by 15 percentage points might be considered promising. Among all designs that meet the error constraints. even if its toxicity rate is 5 percentage points greater than the standard therapy. p Ta . (5. the investigators can better understand the implications of the tradeoff being proposed. Thall. The critical values for the test statistic are much harder to interpret than the critical values in Conaway and Petroni (1) or Bryant and Day (2). θ) 1 β where T 1 is the test statistic computed on the stage 1 observations and T 2 is the test statistic computed on the accumulated data in stages 1 and 2. similar to Figs. an optimal design is found according to the criterion in Bryant and Day (2) and Simon (3). This idea also motivated the Bayesian monitoring method of Thall et al. for each end point in the trial. We recommend two plots. the new therapy increases the toxicity rate by 10 percentage points. Conaway and Petroni (4) investigate the effect of ﬁxing the odds ratio on the choice of the optimal design. T 2 (p 1 . The ﬁrst is a display of the power of the test. however. so that the investigators can see the probability of recommending a treatment with true response rate pR and true toxicity rate pT. With these plots. which are counts of the number of observed responses and toxicities. p 2 ) c 2 | pRa . p T )∈Ho c 1 . They note that. these probabilities are computed for a ﬁxed value of the odds ratio and are found by enumerating all possible outcomes of the trial. The trade-off designs of Conaway and Petroni (4) were motivated by the idea that a new treatment could be considered acceptable even if the toxicity rate for the new treatment is greater than that of the standard treatment. The second plot displays the rejection region. pT . provided the response rate improvement is sufﬁciently large. As in the ﬁxed sample size design.6). so that the investigators can see the decision about the treatment that will be made for speciﬁc numbers of observed responses and toxicities. θ) α and P(T 1 (p 1 ) c 1 . the choice of the odds ratio has little effect on the properties of the optimal design. They deﬁne. it might not be considered an acceptable therapy.

m. pS22 ). Thall et al. In a typical phase II trial. A Dirichlet prior for the cell probabilities is particularly convenient in this setting. the trial should be stopped and the treatment declared ‘‘sufﬁciently promising.’’ For m j M. X j . since this induces a beta prior on p GR and p GT . Thall et al. for response. In addition to the prior distribution. pE21 . It continues until either a maximum number of patients. for G S or E. pE12 . The monitoring of the end points begins after a minimum number of patients. Under the standard therapy. the monitoring boundaries are P[P ER P[P ER P[pET P SR δ(R) | X j] P SR|X j] pU(R) pST δ(T) | X j] pL(R) pU(T) . M. pS21 .6) translate these rules into statements about the updated (posterior) distribution [pE | X j ] and the prior distribution pS . if there is strong evidence that the new treatment is superior to the standard treatment in terms of the targeted improvement for response. the cell probabilities are denoted PE (pE11 . have been observed. δ(T). If there is strong evidence that the new therapy does not meet the targeted improvement in response rate.6) take a Bayesian approach that allows for monitoring each end point on a patient by patient basis. After the response and toxicity classiﬁcation on j patients. whereas the distribution on the probabilities under the new therapy is updated each time a patient’s outcomes are observed. the distribution on the probabilities under the standard therapy remains constant throughout the trial. there are several possible decisions one could make. have been accrued or a monitoring boundary has been crossed. we simplify the discussion by considering only a single efﬁcacy event (response) and a single adverse event (toxicity) end point. then the trial should be stopped and the new treatment declared ‘‘not sufﬁciently promising.6) specify a target improvement.’’ In terms of toxicity. pS12 . the trial should be stopped if there is strong evidence of an excessive toxicity rate with the new treatment. where G stands for either S or E.’’ Alternatively. in which only the new therapy is used. Putting a prior distribution on the cell probabilities (p G11 . (5. using prespeciﬁed cutoff for what constitutes ‘‘strong evidence. and a maximum allowable difference. δ(R). p G21 .116 Petroni and Conaway with a single response end point and a single toxicity end point. they elicit a prior distribution on the cell probabilities in Table 2. have been observed. under the new experimental therapy. Although their methods allow for a number of efﬁcacy and adverse event end points. pE22 ). Before the trial begins. (5. (5. the cell probabilities are denoted PS (pS11 . p G12 . p G22 ) induces a prior distribution on p GR p G11 p G12 and on p GT p G11 p G21 . Thall et al. the targeted improvement in response rate is 15% and the allowance for increased toxicity is 5%. for toxicity.

2. Although there is no formal trade-off discussion in these article.Designs Based on Toxicity and Response 117 where pL (R). None of the methods use asymptotic approximations for distributions and are well suited for the small sample sizes encountered typically in phase II trials. The trade-off designs of Conaway and Petroni (4) have a trade-off strategy that permits the allowable level of toxicity to increase with the response rate. 51:1372–1383. but the choice of the Dirichlet prior makes the computations relatively easy. one needs to modify the hypotheses to be tested. this means that only a 5% increase in toxicity is allowable even if the response rate with the new treatment is as much as 30%. (5. the method can provide graphical representations of the probability associated with each of the decision rules. Bivariate sequential designs for phase II trials. For example. Day R.6). Biometrics 1995. The bivariate designs of Conaway and Petroni (1) and Bryant and Day (2) have critical values that are based on the observed number of responses and the observed number of toxicities. Incorporating toxicity considerations into the design of two-stage phase II clinical trials. pU (R). (5.6) have advantages in terms of being able to monitor outcomes on a patient by patient basis. Numerical integration is required to compute these probabilities. (5. Petroni GR. the null and alternative hypothesis could be changed to Ho : pR Ha : pR pRo pRo δR or pT pTo δ T δ R and pT pTo δ T for some prespeciﬁed δ R and δ T. the standard for ‘‘allowable toxicity’’ is greater for a treatment with a 30% improvement than for one with a 15% improvement. REFERENCES 1. At each monitoring point. and pU (T) are prespeciﬁed probability levels. Conaway MR. the general methods can be adapted to the kind of trade-off discussed in Thall et al. With the trade-off of Conaway and Petroni (4). The methods of Thall et al.6). . IV. SUMMARY All methods discussed in this chapter have advantages in monitoring toxicity in phase II trials. these statistics are easily calculated and interpreted by the investigators. in the trade-off example of Thall et al. Biometrics 1995. To do this. Because the allowance in toxicity is prespeciﬁed. Bryant J. a 5% increase in toxicity would be considered acceptable for a treatment with a 15% increase in response. In contrast. 51:656–664.

Ltd. Stat Med 1995. Simon RM. Biometrics 1996. 4. Designs for phase II trials allowing for trade-off between response and toxicity. 5. Chichester: John Wiley and Sons. Petroni GR. . 14:296–303. J Clin Oncol 1996. Estey EH.. Optimal two-stage designs for phase II clinical trials. Wright FT. 1988. Simon R. Order Restricted Statistical Inference. Thall PF. 52:1375–1386. Simon RM.118 Petroni and Conaway 3. 7. Thall PF. 10:1–10. Conaway MR. Bayesian sequential monitoring designs for singlearm clinical trials with multiple outcomes. Robertson T. Estey EH. 6. New statistical strategy for monitoring safety and efﬁcacy in single-arm clinical trials. Controlled Clin Trials 1989. 14:357–379. Dykstra RL.

In this approach. however small the advantage over the others may appear to be. In other words. Selection designs can be used in such circumstances. 119 . Y. Simon et al. Hypothesis tests are not performed. 4). The sample sizes required for a phase III study with more than three arms could be prohibitive (1). Liu Fred Hutchinson Cancer Research Center. it may not possible to test all of them against the standard treatment in a deﬁnitive phase III trial. Seattle. Sample sizes are established so that if a treatment exists for which the underlying efﬁcacy is superior to the others by a speciﬁed amount. the regimens under testing have already shown promise. one always selects the observed best treatment for further study. The required sample sizes are usually similar to those associated with pilot phase II trials. In a selection design. BASIC CONCEPT When there are multiple promising new therapies in a disease setting. An alternative strategy is to screen the new therapies ﬁrst and choose one to test against a standard treatment in a simple two-arm phase III trial. patients are randomized to treatments involving new combinations or schedules of known active agents or new agents for which activity against the disease in question has already been demonstrated in some limited setting. (2) ﬁrst introduced statistical methods for ranking and selection to the oncology literature. it will be selected with a high probability.6 Phase II Selection Designs P. Now the aim is to narrow down the choice for formal comparisons with the standard therapy. Washington I. In addition. the analysis can be highly complex and prone to errors due to the large number of possible comparisons in a multiarm study (see Chap.

89 down to 0. the total number of patients required for four groups. Yet. This practice is especially dangerous when a standard arm is included as the basis for comparison or when all treatment arms are experimental and a standard treatment does not exist for the particular disease. and four groups from Simon et al. Binary Outcomes Table 1 shows sample size requirements for binary outcomes with K two. and 4. and there is often great temptation to treat the unproved results as ﬁnal (3). 3. . one treatment can look substantially better than the others. respectively. With the listed N per group and true response rates. Table 1 indicates the sample size to be relatively insensitive to baseline response rates (i. unless the follow-on phase III study is ensured by some external mechanism such as government regulations for new drug approval. 55. Falsely justiﬁed by the randomized treatment assignment. selection designs can do more harm than good by their propensity for being misused. The false-positive rate of the misapplication is discussed in more detail in Section IV. and therefore has no power. that is. Therefore. and 67 patients per group for K 2. Because of the design’s moderate sample sizes and lack of control for falsepositive and false-negative ﬁndings. that is. the correct selection probability should be approximately 0. three. 37. in distinguishing the former from the latter at the selection step.. Although a total N of 74 for two groups is in line with large phase II studies. The sample sizes were presumably derived by normal approximations to binomial distributions. Results from selection trials are error prone when treated as ends in themselves. Except in extreme cases. The approach presumes the subsequent conduct of deﬁnitive III trials and makes no attempt.90. A ‘‘treatment of choice’’ is often erroneously concluded in such situations. close to 270. response rates of groups 1 through K 1). SAMPLE SIZE REQUIREMENTS A. could render the design impractical for many applications. the major abuse of the design is to treat the observed ranking as conclusive and forego the subsequent phase III testing. II. Since precise knowledge of the baseline rates is often not available.120 Liu Although the statistical principles for the selection design are simple. (2). its application can be rather slippery. It is of vital importance to emphasize up front that a selection design serves merely as a precursor to the requisite deﬁnitive phase III comparison.e.86 when N is small. a common conservative approach is to always use the largest sample size for each K.90 in all cases and may be worth considering when N is less than 30. Increasing the sample size per group by ﬁve raises the correct selection probability to 0. A check by exact probabilities indicates that the actual correct selection probability ranges from 0. the observed best treatment could be truly superior or it could appear so simply due to chance with no real advantage over the other treatments.

90 correct selection probaˆ bility were calculated based on the asymptotic normality of the β. The requirements for exponential survival and uniform censoring are reproduced in Table 2. (4) suggested ﬁtting the Cox proportional hazards model. Sample sizes for 0.90 Correct Selection Probability Response rates P1. Table 3 gives the approximate number of events needed per group for the worst groups. some readers may ﬁnd the expected event count more ﬂexible for planning purposes. to the data where z is the (K 1) dimensional vector of treatment group indicators and β (β1.5 or when there are more than three groups. where I and F are the respective cumulative distribution functions for censoring and survival times. . For 0. Liu et al.80 for moderate departures from the assumption. We proposed selecting the treatment with the smallest βi ˆ (where βK 0) for further testing. . the actual number of expected events are the same for the different rows in Table 2. As with binary outcomes. . Since the effective sample size for exponential survival distributions is the number of uncensored observations. . PK 10% 20% 30% 40% 50% 60% 70% 80% From Ref. 1 121 N per Group PK K 21 29 35 37 36 32 26 16 2 K 31 44 52 55 54 49 39 24 3 K 37 52 62 67 65 59 47 29 4 25% 35% 45% 55% 65% 75% 85% 95% B. Survival Outcomes For censored survival data. . the design becomes less practical when the hazard ratio between the worst and the best groups is smaller than 1. βk 1) is the vector ˆ of log hazard ratios.90 correct selection probability. With ∫IdF as the proportion of censored observations.Phase II Selection Designs Table 1 Sample Size per Treatment for Binary Outcomes and 0. . h(t. It does not encompass situations where the two are different. Table 2 covers scenarios where the patient enrollment period is similar to the median survival of the worst groups. . 2. Simulation studies of robustness of the proportional hazards assumption found the correct selection probabilities to be above 0. z) h0(t)exp(β′z). .

As an example. the .5 1 0 0.5* 51 31 26 69 40 31 87 48 36 1.5 Follow‡ 0 0. III.4* 72 44 36 96 56 44 121 68 51 1.5 1 1. Designs with Toxicity Acceptance Criteria Toxicity or side effects often are major concerns for cancer treatments.4 34 50 60 1. Treatments could even be stopped early to guard against excessive toxicity.5* 91 56 46 122 70 55 153 86 65 1. Table 3 Event Count per Group for the Worst Groups for Exponential Survival and 0. † Median survival in years for groups 1 through K 1. ‡ Additional follow-up in years after accrual completion. it is recommended that formal acceptance criteria should be established for them.122 Liu Table 2 Sample Size per Treatment for Exponential Survival Outcomes with 1-Year Accrual and 0. group K.3* 206 127 106 275 160 125 345 194 147 K 4 1.90 Correct Selection Probability K Median† 0.4* 107 66 54 143 83 65 180 101 76 1.5 24 36 43 * Hazard ratio of groups 1 through K 1 vs.3* 171 106 88 229 133 104 287 161 122 K 3 1.3 54 80 96 1.75 1 * Hazard ratio of groups 1 through K 1 vs. From Ref. VARIATIONS OF THE DESIGN A.5* 76 46 38 102 59 46 128 72 54 1.4* 128 79 65 172 100 78 216 121 92 0.3* 115 71 59 153 89 70 192 108 82 2 1. 4.5 1 0 0. Selection then takes place only among those with acceptable toxicity. group K.90 Correct Selection Probability HR* K 2 3 4 1. If the toxicity proﬁles of the treatments under evaluation are not well known.

the idea of selection is sometimes applied to randomized phase II trials when anticancer activities have not been previously established for the treatments involved. the observed best arm is selected for further study. Forty percent or more of the patients not tolerating the initial dose for at least two courses of treatment was considered not acceptable.02 if the true proportion is 40%. Therefore. since the probability of 13 or more out of 20 is only 0. the side effects of the treatments could be sufﬁciently severe that a certain activity level must be met to justify the therapy. the minimum response requirement would dominate and reject all treatments. When activity levels are low for all treatments. The study was designed with 37 patients per arm. for a selection design to have 0.Phase II Selection Designs 123 Southwest Oncology Group study S8835 investigated mitoxantrone or ﬂoxuridine administered intraperitoneally in patients with minimal residual ovarian cancer after the second-look laparotomy (5). respectively. B. the treatment with a higher percent of patients free of disease progression or relapse at 1 year would be selected for more evaluation. Fourteen or more responses out of 44 would represent sufﬁcient activity for continued pursuit (6). However. The acceptance criterion for N 55 is adjusted to 17 or more responses. A treatment would be dropped if 13 or more of the ﬁrst 20 patients on that arm cannot tolerate at least two courses of treatment at the starting dose. if a 20% response rate would not justify further investigating a treatment regimen whereas a 40% response rate deﬁnitely would. Designs with Minimum Activity Requirements Though the selection design is most appropriate when acceptable levels of treatment efﬁcacy are no longer in question. Statistical properties of this approach have not been formally quantiﬁed.04 (chance of observing 14 responses out of 44 when the true response rate is 20%) and a power of 0. the combined minimum . When more than one treatment arm is accepted. Alternatively.90 (chance of observing 14 responses out of 44 when the true response rate is 40%).90 correct selection probability when the response in one treatment is higher than the others by an absolute 15%. On the other hand. For example. This design has a type I error of 0. one would design the study with 44 patients per arm when K 2 and 55 patients per arm when K 3. a standard phase II design based on the binomial distribution requires 44 patients in a single-stage study. each treatment arm is designed as a stand-alone phase II trial with the same acceptance criteria for all arms. patient accrual to either treatment could be stopped early if unacceptable toxicity was observed. a sample size of 37 or 55 per group is required for K 2 or 3. In such cases. designing the study with the larger sample size between what is required for a standard phase II and that for a selection phase II would generally give a reasonable result. When there are more than one treatment with acceptable response levels and one particularly superior treatment. with binary data. per Table 1. However.

N 55. p1 40%. approximately 28 and 32 events per group are need for the worst groups for K 3 and 4. Alternatively. Logistic regression for binary outcomes and the Cox model for survival are obvious choices. for K 4. This approach is not recommended because the decision to enroll more patients (or not) could be inﬂuenced by observed outcome data. 2. and 3 for three groups). For exponential survival data with a 1. this is a substantial reduction in sample size. If more than one treatment is accepted at the completion of the second stage. C. the design can take advantage of this inherent order. this approach should require smaller sample sizes for the same correct selection probability. as compared with 34 and 41 given in Table 3. with more patients enrolled for the selection purpose only when more than one treatment is accepted. respectively. N 44 and K 3. p1 p2 p3 p4. 40 patients per arm are needed instead of 67.20 when the true best response rate ( p K) is 25%. approximately N 35 per arm is needed for a 0. Limited simulations were conducted with the following results. Compared with the nonordered design.70 range when p K 35% and approximately 0. Limited simulations were conducted for K 2. If the sign of the observed slope is in the expected direction. for outcomes available in a short time. Compared with N 55 given in Table 1. Similarly. p4 55%.124 Liu activity/selection criteria would also result in the correct selection with high probability. The sample sizes at different stages and acceptance criteria could be adapted if needed (6). A single independent variable with equally spaced scores for the treatments would be included in the regression (e.g. Designs for Ordered Treatments When the K ( 3) treatments under consideration consist of increasing dose schedules of the same agents. p1 40%. Some authors suggest designing the activity screening with standard two-stage design sample sizes. the trial could proceed in stages.90 chance that the slope from the logistic regression is positive.90 when p K is 45% or higher. Otherwise. . For binary data with K 3. the lowest dose schedule would be selected. the observed best treatment would be selected for further study.. 1. The larger sample size of the two-stage design at full accrual and the selection design should still be used. with minimum acceptance criteria stated above and one superior arm for which the response rate is higher than the rest by 15%. The results indicate the probability of the best arm meeting the minimum requirement and being selected is close to 0. the highest dose with acceptable toxicity is selected for further study. A simple method is to ﬁt regression models to the outcomes with treatment groups coded in an ordered manner according to the increasing dose levels.5 hazard ratio between the worst groups and the best group. The correct selection probability is in the low to mid 0. p3 55%. p1 p2 p3.

90 correct selection probability.Phase II Selection Designs 125 IV.5 are 0.20 to 0.90 correct selection probability per Table 1.3 and 1. Liu et al. this approach is not recommended because it may incur a false sense of conﬁdence that a true superior treatment has been found when the observed results are positive. for binary outcomes. respectively. Some have proposed changing the selection criterion so that the observed best treatment will be further studied only when the difference over the worst group is greater than some positive ∆. the required number of patients per group is 40. To illustrate. for K 2 and 0. ˆ ˆ the chances of observing |p1 p 2| 5% and 10% are approximately 0.53 and 0.20 for K 2 and 0. Again.10 chance for a greater than 20% difference if the true rate is 50% or 60% for either K. the chances of observing a hazard ratio greater than 1. We (3) also pointed out that performing hypothesis tests post-hoc changes the purpose of the design.45 when K 3. the sample size requirement for the same correct selection probability increases quickly as ∆ increases. respectively. Therefore.90 when the true response rates are 50% and 65%. none of the treatments will be pursued. for K 3. Clearly when ∆ 5% the sample size required is impractical.21. observed hazard ratios of 1.52 and 0. when the true response rates are the same in all treatments and in the 10–20% range. then a phase III comparison should have been designed with appropriate analyses and error . the principal misuse of the selection design is to treat the results as ends in themselves without the required phase III investigations. When the shared response rate is in the 30–60% range. 36 patients per treatment are required for ∆ 0 and a 0.30 to 0.35 when K 2 and 0. (3) previously published the false-positive rates of this misapplication.30 for K 3. 5%.5 often represent true treatment advances in large deﬁnitive phase III studies. the chance of observing an absolute 10% or higher difference between two treatments is roughly 0. and 7%. 79. and 123 when ∆ 1%. But with selection sample sizes they appear with alarmingly high probabilities when there are no survival differences between treatments at all. With the same conﬁguration and 0. 3%. by design the correct selection probability remains 0. otherwise. respectively.16. the chance of observing a 15% or greater difference is close to 0. 55. Although this approach may seem appealing at face value. Brieﬂy.37 and 0. When p1 p2 50% and n1 n2 79. If the goal is to reach deﬁnitive answers. There is a 0. These are impressive looking differences arising with high frequencies purely out of chance. Even with ∆ 5% and 79 patients per group. with Table 2 sample sizes and the same exponential survival distribution for all groups.21. MISAPPLICATIONS AND RESULTING FALSE-POSITIVE RATES As mentioned in the beginning. Similarly.3 to 1. for binary data with K 2. yet the results are by no means deﬁnitive when a greater than 5% absolute difference is observed. p1 50% and p2 65%.20 to 0.

Controlled Clin Trials 1995. there can be a tremendous tendency to stop short of the phase III testing and declare winners at less than a quarter of the distance of this marathon race. 69:1375–1381. Dahlberg S. any comparison between the standard and the observed best treatment from a selection trial is recognized as informal because the limitations of historical comparisons are widely accepted. V. Selection designs for pilot studies based on survival. Biometrics 1993. 2. the chance of observing an experimental treatment better than the control is (K 1)/K. The resulting errors can send research into lengthy detours and do great disservice to cancer patients. Liu PY. Ellenberg SS. Liu PY. Controlled Clin Trials 1999. In this case. false-negative conclusions are as damaging as false-positive ones because a new treatment with similar efﬁcacy as the control but less severe side effects could be dismissed as ineffective. Liu PY. 1/2 for K 2.’’ If there are no efﬁcacy difference between treatments. Applied correctly. Wittes RE. Unless follow-on deﬁnitive evaluations are ensured by external means. 2/3 for K 3. that is. REFERENCES 1. Dahlberg S. the legitimacy for comparison is established and there is great temptation to interpret the results literally and ‘‘move on. False positive rates of randomized phase II designs. the design can serve a useful function in the long and arduous process of new treatment discovery. 4. perhaps due to the time and resources involved. LeBlanc M. When a control arm is included for randomization. the absence of a control treatment may be the only imperfect safeguard for the design. the inclusion of a standard or control treatment in a selection design is especially dangerous. CONCLUDING REMARKS The statistical principles of the selection design are simple and adaptable to various situations present in cancer clinical research. 3. Design and analysis of multiarm clinical trials with survival endpoints. 16:119–130. and so on. Finally. Randomized phase II clinical trials. . Without a control arm.126 Liu rates. Testing hypotheses with selection sample sizes could be likened to conducting the initial interim analysis for phase III trials. Desai M. However. It is well known that extremely stringent p values are required to ‘‘stop the trial’’ at this early stage. Simon R. 20:343–352. 49:391–398. Crowley J. Cancer Treatment Rep 1985. The chance of an impressive difference between an experimental treatment and the control treatment is as discussed above for K 2 and higher than 2/3 of those discussed above for K 3.

7. Gibbons JD. New York: Wiley. Intraperitoneal mitoxantrone or ﬂoxuridine: effects on time-to-failure and survival in patients with minimal residual ovarian cancer after second-look laparotomy—a randomized phase II study by the Southwest Oncology Group. . Alberts DS. Olkin I. 6. Gynecol Oncol 1996. 11:853–862. Muggia FM. et al. Selecting and Ordering Populations: A New Statistical Methodology. Planned versus attained design in phase II clinical trials. 1977. Liu PY. 61:395–402. Sobel M.Phase II Selection Designs 127 5. Green S. Stat Med 1992. Dahlberg S.

.

Shuster University of Florida. Florida I. Sample size guidelines are presented ﬁrst for situations in which there is no sequential monitoring. One of the most important aspects of this chapter centers on the nonrobustness of the power and type I error properties when proportional hazards are violated. is the same for all values of T. which implies that the ratio of the instantaneous probability of death at a given instant (treatment A: treatment B). planning requires far more in the way of assumptions than trials whose results are accrued quickly.7 Power and Sample Size for Phase III Clinical Trials of Survival Jonathan J. This is especially problematic when sequential methods are used. The methodology is built on a two-piece exponential model initially. 129 . Gainesville. proportional hazards. These are modiﬁed by an inﬂation factor to allow for O’Brien–Fleming type sequential monitoring (1). for patients alive at time T. time until an adverse event. Statisticians are faced with the need for such methods on the one hand but with the reality that they are based on questionable forecasts. where a trial accrues a block of patients and is then put on hold. more generally. but this is later relaxed to cover proportional hazards in general. a ‘‘leap-frog’’ approach is proposed. This chapter is concerned with the most common set of assumptions. INTRODUCTION This chapter is devoted to two treatment 1-1 randomized trials where the outcome measure is survival or. In practical terms. As a partial solution to this problem.

results for exponential survival and estimation of the difference between hazard rates will be given. (1) Then it follows under Hα. respectively: H0 : ∆ ˆ P[(∆ P[W ∆ 0 vs. Section IX is concerned with competing losses. then the piecewise exponential assumption can be further relaxed to proportional hazards. If that measure is indeed used and if the trial is run to a ﬁxed number of failures. including major cautions about the use of sequential monitoring. a general sample size formulation for tests that are inversions of conﬁdence intervals is presented. For group sequential plans. A BASIC SAMPLE SIZE FORMULATION The following formulation is peculiar looking but very useful (see Shuster (4) for further details). In Section III.’’ an essential ingredient for sequential monitoring. this inﬂation factor is even less. Section VIII deals with application of the methods to more complex designs. In Section VI. Section VII is devoted to ways of obtaining the ‘‘information fraction time.130 Shuster In Section II. it is shown that the maximum sample size inﬂation over trials with no monitoring is typically of the order of 7% or less. Section V is devoted to a generalization of the results for the exponential distribution to the two-piece exponential distribution with proportional hazards and to a numerical example. H1 : ∆ ∆ 0 )/SE ˆ (∆ Zα] ∆1 )/SE ∆1 α Zβ] 1 β ∆0 (a) (b) where W (Zα Zβ )(S SE)/SE. Section X gives the practical conclusions. Suppose the following Eq. ˆ Z α] 1 (2) That is. In Section IV. SE is a standard error estimate. the necessary sample size is developed for the O’Brien–Fleming method of pure sequential monitoring. Finally. and S satisﬁes the implicit sample size equation: S ˆ P[(∆ (∆1 ∆ 0 )/(Zα ∆ 0 )/SE Zβ ) β. calculated from the data only. The fraction of failures observed (interim analysis to ﬁnal planned analysis) is the recommended measure. including multiple treatments and 2 2 factorial designs. (a) and (b) hold under the null (H0 ) and alternative hypothesis (H1 ). S is a function of the parameters (including sample size) under the alternate hypothesis. In that section. II. valid under both the null and alternate hypothesis. the connection between the exponential results and the logrank test as described in Peto and Peto (2) and Cox regression (3) are given. . the approximate α level test has approximate power 1 β.

S/SE converges ∆ 0.Power and Sample Size in Phase III Trials 131 Note that Z α and Z β are usually (but need not always be) the 100α and 100β upper percentiles of the standard normal cumulative distribution function (CDF). with normal distributions after standardizing by the standard error statistic. one has SE ˆ [(λ 2 /F1 ) 1 ˆ (λ 2 /λ 2 )]0. S ∆1 ∆0 sqrt[{P1(1 (P1 0 P2 ) P1 ) P2 (1 P2 )}/N] and hence from Eq. EXPONENTIAL SURVIVAL If the underlying distribution is exponential. it can be shown that the stochastic process that plots the number of deaths observed (Y–axis) versus total accumulated time on test (X-axis) is a homogeneous Poisson process with hazard rate equal to the constant hazard of the exponential distribution. each treatment will have sample size N [P1 (1 P1 ) P2 (1 P2 )]{(Z α Z β )/(P1 P2 )}2 This formula is useful for determining the sample size based on the Kaplan– Meier statistic (5). To validate the implicit sample size formula. we replace α by α/2 in the above expression. In the notation of the previous section. 2 let λi hazard rate for treatment i and Fi total number of failures to total accumulated time on test Ti. ˆ λi ˆ Let ∆ Fi /Ti Asy N[λi.5 2 . Binomial Example: (one-sided test) For binomial trials with success rates P1 and P2 and equal sample size N per group. For treatments i 1. you need only show that under Ha. α is replaced to one in probability. Suppose you have a ‘‘conﬁdence interval inversion’’ test. set up so that ∆1 by α/2. For two-sided tests. (1). λ 2 /E(Fi )] i ˆ ˆ λ1 λ2 and ∆ λ1 λ2. For the two-sided test. III.

After substituting Eq. These methods use various transformations and thus yield slightly different though locally equivalent results. Schoenfeld allowed the incorporation of covariates. (1). X. the resulting equation must be solved iteratively (bisection is the method of choice) to identify the accrual period. Peto and Peto (2) demonstrated the full asymptotic local efﬁciency of the logrank test when the underlying survival distributions are exponential.5 in Eq.5ΨXP i (5) where Ψ is the accrual rate for the study (half assigned to each treatment).’’ 1.132 Shuster and S [λ2 /E(F1 ) 1 λ2 /E(F2 )]0. simply replace the 0. Rubinstein et al. (5) by γ i. (10). minimum follow-up period Y. If the allocations are unequal with γ i assigned to treatment i (γ 1 γ2 1). Morgan (9). This is useful in planning studies of prognostic factors or clinical trials where the experimental treatment is very costly compared with the control. and values λ 1 and λ 2 (and hence of ∆) under the alternate hypothesis. and Schoenfeld (11). required for given planning values of the accrual rate. This implies that the power and sample size formulas of Section III apply directly to the logrank test (as well as the likelihood- . Similar methods for exponential survival have been published by Bernstein and Lagakos (6). then the probability of death for a random patient assigned to treatment i is easily obtained as Pi where Qi exp( λ iY)[1 exp( λ i X)]/(λ i X) 1 Qi (4) The expected number of failures on treatment i is E(F i ) 0.X) and followed until death or to the closure (time X Y) (Y being the minimum follow-up time). Ψ.5 2 (3) If patients are accrued uniformly (Poisson arrival) over calendar time (0. George and Desu (7). Lachin (8). (3) in Eq. APPLICATIONS TO THE LOGRANK TEST AND COX REGRESSION Two important observations extend the utility of the above sample size formulation to many settings under ‘‘proportional hazards. IV. whereas Bernstein and Lagakos allowed one to incorporate stratiﬁcation.

(14). 0.g. Note that λ i ln(R i )/Y. Y 3 years) Planning Y-year survival under the control treatment (e. V. PLANNING A STUDY WITH PIECE-WISE EXPONENTIAL SURVIVAL A.5) ρ represents the piecewise exponential. Before the minimum follow-up Y. Input Parameters Ψ Y R1 R2 Annual planning accrual rate (total) (e.60) Planning Y-year survival under the experimental treatment (e.80) Posttime Y:pretime Y hazard ratio (e. that is quite robust to failure in correctly identifying the transformation. as long as a planning transformation that converts the outcomes to exponential distributions is prespeciﬁed. taking the transformation into account. since accrual may not be uniform in the transformed time scale. Two distributions have proportional hazards if and only if there exists a continuous strictly monotonic increasing transformation of the time scale that simultaneously converts both to exponential distributions. These programs (or papers as in the original) can also investigate the robustness of the sample size to deviations from distributional assumptions..g.g.. 0. the test for no treatment effect in Cox regression (with no other covariates) is equivalent to the logrank test. Another approach.Power and Sample Size in Phase III Trials 133 based test. For two treatments. Three articles offer software that can do the calculations: Halperin and Brown (12).g. .... and Henderson et al. presumes a two-piece exponential model..g.g. This is discussed in the next section along with a numerical example.. 50% per arm) Minimum follow-up time (e. one) α π ρ type 1 error (e. The only problems are to redeﬁne the λ i and to evaluate E(F i ).70) Sidedness of test (one or two) (e. Parameters to Identify 1. This can be approximated well if historical control data are available. Cantor (13).g. 2.) See Appendix I for further discussion of the efﬁciency of the logrank test.g. if the hazard is λ i on treatment i. R 1 0. 0.. This means that an investigator can plan any trial that assumes proportional hazards. it is ρλ i on treatment i after Y. used above. the power of the test (e. R 2 0.05) 1 β. 210 patients per year are planned.

196) Based on note 2.134 Shuster 2. B. 5.g. Y) can be relaxed to require proportional hazards only.. 2. It is convenient to . plus the probability of surviving to time Y.5ΨX(1 R i ). is the probability of death by time Y. the greater the power of the test. with a strictly monotonic increasing function.g. 4.. the calculation of E(F i ) for Eq. Output Parameters X E(F 1 ) X Y ΨX E(F 2 ) Accrual period (e.g. R i. (From Eq. All other things being equal. D i (x). The expected failures on treatment i in the ﬁrst Y years of patient risk is 0. This in turn implies that the exponential assumption over the interval (0. Section IV.33 years) Total accrual required (e. (1 R i ). 490 patients) Total expected failures (e. where Ψ is the annual accrual rate. (3) implies that the larger the expected number of failures. but dying between times Y and x Y. power increases as ρ increases. the exponentiality assumption postY is not important (although proportional hazards may be). the time at risk is uniformly distributed over the period from Y to (X Y) and the probability of death. conditional on a time at risk. Eq. Key Observations 1. X is the accrual period.g. (3) and (5) is straightforward.) 3. (3). (4) and (5) is Qi R i[1 exp( λ iY) ρλ i X)]/(ρλ iX) (6) Note that λ i is deﬁned via the equation Ri exp( (7) 1) the value of Q i agrees with that below and hence for exponential data (ρ Eq.. the greater the hazard post-Y-years. For this piecewise exponential model and Poisson accrual process. and is not affected by the transformation. and R i is the planning Y-year survival on Treatment i. the expected number of failures that occurred before patient time Y would be unchanged. depends only on the ﬁxed Y-year survival on treatment i. Y x. If hazard rates are low after time Y. Y). 2. (4). that is.. D i (x) (1 Ri) R i (1 exp(ρλ ix))] The unconditional probability of death for a randomly selected patient on treatment i is found by taking the expectation of D i (x) over the uniform distribution for x from 0 to X. the deﬁnition of λ i ln(R i )/Y in Eq. (7). Y) onto (0.33 years) Total study duration (e. If one transformed the subset of the time scale (0. the value of Q i to be used in Eqs. Since the larger the value of ρ. For a randomly selected patient.

and 0. 407. 5.698. 0. the reciprocal of the accrual duration. 561 patients would be needed. SEQUENTIAL MONITORING BY THE O’BRIEN–FLEMING METHOD In this section. the study should accrue until 189 failures have occurred (nearly an identical number to the 196 expected failures derived from the piece-wise exponential model). but the macro allows for both equal or unequal allocation VI. one can use asymptotic Brownian motion to connect full sequential monitoring power to no monitoring power. since it enters the sample size equation only through the expected number of failures.589 probability of being in the control arm. and 561 for ρ 1.411 probability of being in the experimental arm versus 0. A Macintosh version exists but must invoke 16-bit architecture. namely (1/X2 ). respectively.0. The equal patient allocation of 50% to each arm is nearly optimal in terms of minimizing the smallest sample size.Power and Sample Size in Phase III Trials 135 think of ρ as an average hazard ratio (post:pre-year Y). Software for these calculations is available in Shuster (5). To further illustrate this point. This will be used to apply an inﬂation factor for sequential monitoring (Section VI). Under exponentiality. 6. . without regard to the piece-wise exponential assumption. is relatively modest. then from the results of Section II.5. the rate of change is approximately proportional to the derivative of (1/X). on a Windows platform. then the power holds up under proportional hazards. from elementary asymptotic considerations of the Poisson processes obtained in exponential survival. the patient requirements are 353. The variance (the square of the right-hand side of formula (3)) decreases more rapidly than (1/X). for small increments in X. whereas if ρ 0. Appendix II contains a SAS macro that also can be used for the calculations.698/1. if one used the above sample size methodology to approximate the accrual duration and minimum follow-up but actually ran the trial until the total failures equaled the total expected in the study: E(F 1 ) E(F 2 ). Under the same parametrization. the longer time at risk increases the probability of death for each entrant.0.698 0. the planning hazard ratio (HR) for survival rates of 60% versus 70% at 3 years is 0. However. 7. because although the number of patients increase linearly. in the above example. the actual case ρ 0. if each failure is considered to approximate an independent binomial event. 490 patients. In fact. but with an accrual of 60 patients per year instead of 210. which under the null hypothesis has a 50% probability of falling into each treatment and under the alternative a HR/(1 HR) 0.33 years and minimum follow-up is 3 years. where accrual is 2.5. The effect of ρ in the above example. The effect of ρ is much more striking if accrual is slow. Based on Shuster (15). 448 patients would be required (ρ 1).

026 ∆/S represents the detectable effect size.136 Shuster First.010 0. 0 θ maximum allowable time of completion. on the basis of the rejection region for testing ∆ 0 versus ∆ 0.926 O-F ∆/S 3. α) α. α) Φ[(∆/S) Z α] For example.168 2. and investigator 2 ran a slightly larger study sensitive to an effect size of ∆/S 2.025 0. such that ˆ ∆ θ is asymptotically N(∆.688 3.025 0.576 3. if investigator 1 ran a study sensitive to an effect size ∆/S 2.242 2. the power function for the O’Brien–Fleming test that rejects the null hypothesis if at any time θ ˆ θ ∆θ is Π(∆.80 0.238 2. ˆ except that the index θ was added to delineate the time scale) θ ∆ θ is asymptotically a Brownian motion process with drift ∆ and diffusion constant S 2.332 3. From ﬁrst passage time considerations (see Cox and Miller (16) for example).050 0.608 3.801 2.010 0. α) Φ[(∆/S) Z α/2] exp[(2∆Z α/2 )/S]Φ[ (∆/S) Z α/2] SZ α/2 where Φ standard normal CDF and Z p is the upper 100p percentile of Φ. S.050 Power 0. .576 but used continuous O’Brien–Fleming bounds for sequential monitoring. (3).90 None ∆/S 3.881 2. S. Continuous Monitoring by O’Brien–Fleming (OF) α 0.486 3.90 0. deﬁne a ‘‘time parameter’’ θ.486 and required no monitoring. Note that Π(0. For no monitoring. S. with θ 1 being the θ represents the ratio of variances of the estimate of effect size (ﬁnal to interim analysis) and S 2 is the variance of the estimate of effect size at the planned ﬁnal analysis (θ 1). From asymptotic considerations of the Poisson process (with the notation the same as the section on the exponential. S 2 /θ) 1.80 0.80 0. the power function is deﬁned by Π no (∆. the two studies would both have 80% power Table 1 Sensitivity of No Monitoring (None) vs.90 0. calculated per Eq.

where 490 patients were needed for a trial conducted without sequential monitoring. There is almost no penalty for this type of monitoring.6698]. an additional 7% would bring the necessary accrual to 524 (34 more entrants).5 years 0.5418]. 5.7785 [0. it was concluded that 2. 2. 3. All one needs to compute is the expected number of failures at an interim analysis. 1.0 years 0. and 5. above.0684]. since it will fall between no monitoring and continuous monitoring. θ. de- .1502].01). Group sequential monitoring by the O’Brien–Fleming method would require a smaller inﬂation factor for the maximum sample size. the fact is that for studies planned with survival differences of the order of 15% or less. Although the use of variance ratios Eq. 1. planning difference is a 10% improvement in 3-year survival from 60% (control) to 70% (experimental) and monitoring is handled by a continuous O’Brien–Fleming approach.49 years of accrual plus 3 years of minimal follow-up (total maximum duration of 5. The information fractions.05. 4. where the expected total failures is 50 or higher. note 5.5.5 years 0.49 years with monitoring versus 2.1504 [0. if the trial was to be sequentially monitored by the O’Brien–Fleming method. for Eq. 2.49 years 100% [100%]. 3. 3. 4.2607].5. respectively.5. 6% (α 0. increasing the accrual duration slightly means that the variance changes proportional to the change in the reciprocal of the accrual duration. Others use the ratio of expected failures (interim to ﬁnal). EVALUATION OF THE INFORMATION FRACTION It is of interest to note that for the piece-wise exponential model. (3) and for expected failures ratios in []: 1.Power and Sample Size in Phase III Trials 137 at α 0. where planning accrual is 210 per year.05).0. 3.0 years 0.0 years 0.0.5 years 0. VII. The inﬂation factors would be 4% (α 0.2611 [0.49 years are. the two are almost identical. But experience has shown it to be not much smaller than those derived for continuous monitoring. 2.0.0 years 0. whereas still others use the ratio of actual failures to total expected. These numbers are impressively close.025). and 7% (α 0.33 years without monitoring. (3) appears to be quite different from the ratio of expected failures when the hazard ratios are not close to one.0685 [0.7780]. 2.0.5424 [0. In the example above. one can derive the information fraction as the ratio of the ﬁnal to interim variance as in the strict deﬁnition (see Eq. The software package ‘‘East’’ (17) does not handle nonexponential data but can be used to derive an approximate inﬂation factor for true group sequential designs.6704 [0.3985 [0. (3)). This represents an accrual duration of at most 2. In the numerical example. Since as remarked in Section V. an approximate measure for the increase in accrual mandated by continuous O’Brien monitoring is approximately the square of the ratio of the entries in Table 1. for calendar times 1.3979]. even if indeed the study runs to completion.49 years) were needed.

If the assumption of no major interaction in the hazard ratios is reasonable. For example.698. are problematic unless they are treatment uninforma- .138 Shuster spite the fact that under the alternate hypothesis. the improvement over a completely randomized design is difﬁcult to quantify. Interventions should be carefully selected to minimize the potential for qualitative interaction. one could reach a different conclusion if the actual analysis had been corrected for the number of pair-wise comparisons in the original trial. As noted above. MULTIPLE TREATMENTS AND STRATIFICATION The methods can be applied to pair-wise comparisons in multiple treatment trials. competing losses. If a qualitative interaction is indeed anticipated. it is my opinion that the planning α level should not be altered for multiple comparisons. COMPETING LOSSES In general. for the same data.00. the planning hazard ratio (experimental to control) is 0. a situation where the superior treatment depends on which of the concomitant treatments is used. the most robust concept is to use actual failures and terminate at a maximum of the exected number. stratifying for the concomitant treatment. IX. In most applications. hardly a local alternative to the null value of 1. presuming proportional hazards within the strata hold. whether or not there was a third arm C. had a hypothetical trial of only A versus B been run and accrued the same data. The deﬁnition of power applies to each pair-wise comparison. This is because the inference about A versus B should be the same. if the study is a three-armed comparison and accrual is estimated to be 300 patients per year. the nonstratiﬁed plan represents a conservative estimate of patient needs when the design and analysis are in fact stratiﬁed. In general. the use of stratiﬁcation on a limited scale can increase the precision of the comparison. In other words. Note that the accrual rate would be computed for the pair of treatments being compared. However. VIII. this could be analyzed as a stratiﬁed logrank test. then the study should be designed as a four-treatment trial for the purposes of patient requirements. which can be shown to be 211 for this continuous O’Brien–Fleming design. If a 2 2 factorial study is conducted. a study planned as if it was a two-treatment nonstratiﬁed study will generally yield a sample size estimate slightly larger (but unquantiﬁably so) than needed in the stratiﬁed analysis. If one wishes to correct for multiple comparisons (a controversial issue). Hence. patients censored for reasons other than being alive at the time of the analysis. then one should use a corrected level of α to deal with this but keep the original power. 200 per year would be accrued to each pair-wise comparison.

Conduct the trial as planned until this many failures occur or until the study is halted for signiﬁcance by crossing an O’Brien–Fleming bound. For example. if L 0. if at an early interim analysis. (5). it is also an ethical dilemma. (4). (3) using expected values in Eq. one treatment (e. since statisticians are really being asked to forecast the future. It is recommended for studies where proportional hazards is considered to be a reasonable assumption that the study be planned as if the piece-wise exponential assumptions hold.10 (10%) are expected to be lost to competing reasons. PRACTICAL CONCLUSIONS The methods proposed herein allow a planner to think in terms of ﬁxed term survival rather than hazard rates or hazard ratios. and (6) for the piece-wise model to obtain the expected number of failures to be seen in the trial.g.Power and Sample Size in Phase III Trials 139 tive (the reason for the loss is presumed to be independent of the treatment assignment and. preferred by this author.9 to obtain a ﬁnal sample size. where the information fraction is calculated as actual failures/maximum number of failures that would occur at the planned ﬁnal analysis. Next. apply Eq. the actual test (logrank or Cox) is valid for testing the null hypothesis that the survival curves are identical. However. causing the study to be stopped early and reaching the incorrect conclusion. the sample size would be inﬂated by dividing the initial sample size calculation by (1 L) 0. apply the small inﬂation factor by taking the square of the ratio of the entry in the ∆/S columns per Table 1. a conservative approach. For example. at least conceptually. is to use a second ‘‘inﬂation factor’’ for competing losses. an early signiﬁcance favoring the wrong treatment may emerge. where they have little in the way of reliable information to work with. is to conduct accrual in stages. For example. Although sequential analysis is often an ethical necessity. it . One possible recommendation.. it might be very tempting but unwise to close the study for lack of efﬁcacy. as the failures to date divided by the total failures that would occur if the study is run to its completion. If the plan is to monitor the study by the O’Brien– Fleming method.’’ If a superior treatment is associated with a high propensity for early deaths. whether it is group sequential or pure sequential. for survival trials where there is concern about the model assumptions. bone marrow transplant) is more toxic and appears to have inferior outcome when compared with the other (chemotherapy alone). Although it is possible to build in these competing losses in a quantiﬁed way for sample size purposes [see (10)]. X. unrelated to the patient’s prognosis). the sample size calculation and power are sensitive to violations of the proportional hazards assumption and especially to ‘‘crossing hazards. Irrespective of the proportional hazards assumption. The use of a ﬁxed number of failures in implementation also allows for simplistic estimates of the information fraction.

that is.49 year accrual study (both with continuous O’Brien–Fleming bounds). analyzing the data 3.2 years instead of 5. respectively. In the numerical example initiated in Section V. EXPONENTIAL ESTIMATION— LOCAL OPTIMALITY For treatment i and day j. the information fraction at 1 year (about 40% of accrual) would be only 7%. would also allow the analyst to have a better handle on model diagnostics. irrespective of any interim results. APPENDIX I: LOGRANK VS. let N ij and F ij represent the number at risk at the start of the day and number of failures on that day.8 years. In addition. close the trial temporarily and begin accrual on a new trial (B) for a period of 1 year. for the O’Brien–Fleming design. trial B would be completed earlier by the leap-frog approach by about 0. This process might slow down completion of trial A but also might speed up completion of trials B and C. This leap-frog approach would continue in 1-year increments. is weighted inversely proportional to (N 1j 1 N 2j 1 ).49 years. The weight is zero if either group has no patient at risk starting day j. while fewer patients were at risk.2 years. F ij /N ij. if one had a 1year ‘‘time out’’ for accrual but ran the second stage for 1. ACKNOWLEDGMENT Supported in part by grant 29139 from the National Cancer Institute. whereas the information fraction at 2 years (about 80% of accrual completed) would be 26% (19% coming from the ﬁrst year of accrual plus only 7% coming from the second year of accrual). For the exponential test. for safety purposes. where the denominator is simply the standard error of the numerator) is F 1j N 1j (F 1j F 2j )/(N 1j N 2j ) {(F 1j /N 1j ) (F 2j /N 2j )}/(N 1j 1 N 2j 1 ) The day j estimate of each hazard. Offsetting this difference. the contributions to the estimates of the hazards for day j are .0 years after the end of accrual. For the logrank test. the study would have the same power as the 2. presuming the study ran to completion. there would be a much smaller disparity between calendar time and information time. Completion would occur after 6. the contribution to the observed minus expected for day j (to the numerator of the statistic. proportional to N 1j N 2j /(N 1j N 2j ). The slowdown.140 Shuster might be prudent to accrue patients onto a trial (A) for a period of 1 year and. A decision as to whether to continue accrual to trial (A) for another year or begin yet another trial (C) for a year could be made with more mature data than otherwise possible. In fact. generated by the leap-frog approach.

Note that there are two things helping the logrank test. Comment: The mathematical treatment of this chapter is somewhat unconventional in that it uses the difference in hazards for the exponential rather than the hazard ratio or log hazard ratio. Comment: SAS (Statistical Analysis Systems) macros for the logrank and stratiﬁed logrank tests are included in Appendix III.psi. APPENDIX II: SAS MACRO FOR SAMPLE SIZE AND ACCRUAL DURATION Note that this macro can handle unequal allocation of patients. See Section V.’’ which is represented by the exponential weights.alpha. On the negative side.99 or higher). the weights will be very similar except when both become very small. and as such. and so on.lfu). until relatively few are at risk. both weight the information inversely proportional to the variance.rho.r1.lfu).rho. by not using a transformation.Power and Sample Size in Phase III Trials 141 (N ij /N i. the connection to expected failures is not as direct as it would be under the transform to a hazard ratio.r1. and in practice. Usage %ssize(ddsetx. something useful for planning studies for the prognostic signiﬁcance of yes/no covariates.side.alpha. 2. a relative efﬁciency measure can easily be obtained for nonlocal alternatives.alloc. ddsetx user supplied name for data set containing the planning parameters. and exponentially.r2.pi.alloc.)(F ij /N ij ) where N i.5 for 1-1 randomized trials) Psi Annual Accrual Rate Y Minimum Follow-up R1 Control Group planned Y-Year Survival R2 Experimental Group planned Y-Year Survival . Under the null hypothesis.00 (typically 0.side. if the ratio of N 1j /N 2j remains fairly constant over time j 1.y. Second. Using the approximation that the N ij are ﬁxed rather than random. both tests are locally optimal.psi. is the total days on test for treatment i.pi. stratiﬁed estimation is robust against modest departures from the ‘‘optical allocation. First. the relative efﬁciency for studies planned as above with 30 or more failures will generally be only trivially less than 1. This enables a user to directly investigate the relative efﬁciency issue of the logrank to exponential test. Alloc Fraction allocated to control group (Use .y. %ssize(a.r2.

alloc &alloc.r2 &r2pha &alpha. q1 r1*(1 exp( rho*lam1*x))/(rho*lam1*x). if inc .rho &rho.r1.9 then do. pi &pi.y.na int(na . v1 ((lam1**2)/(alloc*psi*x*p1)). zb probit(pi).lfu).inc .999). if s2 s and inc . za probit(alpha/side). label alloc ‘Allocated to Control’ rho ‘Post to Pre y-Year Hazard’ psi ‘Accrual Rate’ y ‘Minimum Follow-up (MF)’ r1 ‘Planned control Survival at MF’ r2 ‘Planned Experimental Survival at MF’ side ‘Sidedness’ alpha ‘P-Value’ .goto aa.01. p2 1 q2.999 psi*x).r2. ex fail psi*x*(alloc*p1 (1 alloc)*p2).alpha.lfu &lfu.psi.rho. q2 r2*(1 exp( rho*lam2*x))/(rho*lam2*x).psi &psi. DATA DDSETX.alloc. %MACRO ssize(ddsetx. n int(. lam2 log(r2)/y.x x 1.side. if s2 2 then goto aa. OPTIONS NOSOURCE NONOTES. del abs(lam1 lam2).set &ddsetx. na n/(1 1fu). s2 sqrt(v1 v2). lam1 log(r1)/y.r1 &r1. v2 ((lam2**2)/((1 alloc)*psi*x*p2)).9 then goto aa. aa:x x inc. x 0. options ps 60 ls 75.142 Shuster side 1 (one-sided) or 2(two-sided) alpha size of type I error pi Power rho Average Post:pre Y-Year hazard ratio lfu Fraction expected to be lost to follow-up.inc 1.end.set ddsetx.pi. data ddsetx.y &y. p1 1 q1. s del/(za zb).

psi.8 . x1.5 210 3 . stratum) %logstr(a. ddset name of data set needing analysis.6 . Stratum categorical variable in ddset deﬁning stratum for stratiﬁed logrank test. class) %logrank(a.1 . input alloc psi y r1 r2 side alpha pi rho lfu.Power and Sample Size in Phase III Trials 143 pi ‘power’ lfu ‘Loss to Follow-up Rate’ x ‘Accrual Duration’ n ‘Sample size.7 1 .7 1 . cards. APPENDIX III: SAS MACRO FOR LOGRANK AND STRATIFIED LOGRANK TESTS Sample usage: %logrank(ddset. proc print label. time. rho.1 .6 .7 1 . time.05 .8 . y.6 . this program was run.1 %ssize(ddsetx. event.05 .6 . treat no. %mend. gender) for the stratiﬁed logrank test time.5 . side.7 1 .05 .8 . no losses’ na ‘Sample size with losses’ ex fail ‘Expected Failures’. and stratum are user-supplied names in the user-supplied data set called ddset. event. x1.000001 . time time variable in data set ddset (time 0 not allowed) event survival variable in data set ddset event 0 (alive) or event 1 (died) class categorical variable in ddset identifying treatment group.1 .5 60 3 .1 .1 . To produce the example in Section V.05 .7 1 .8 1 .6 .5 . lfu). . event.05 . alpha. data ddsetx. var alloc psi y r1 r2 side alpha pi rho lfu x n na ex fail.7 1 .8 1 . pi.5 210 3 . alloc.5 210 3 .6 . x2. x2. class.8 . .5 60 3 .05 . r2.000001 . treat no) for the unstratiﬁed logrank test %logstr(ddset. r1.5 60 3 . class.

144 Shuster .

Power and Sample Size in Phase III Trials 145 .

2. Peto R. O’Brien PC. 35:549–556. Asymptotically efﬁcient rank invariant test procedures. J R Stat Soc 1972. . Biometrics 1979. A multiple testing procedure for clinical trials. Fleming TR. A135:185–206.146 Shuster REFERENCES 1. Peto J.

Controlled Clin Trials 1981. Meier P. Cox DR. Introduction to sample size determination and power analysis for clinical trials. Bernstein D. J Chronic Dis 1985. Kaplan EL. 17. 16. 53:457–481. Halperin J. Sethi G. Lagakos SW. B34:187–220. Cambridge. 2:93–113. FL: CRC Press. 34:469–479. 13. Nonparametric estimation from incomplete observations. Rubinstein LJ. Controlled Clin Trials 1991. 9. Controlled Clin Trials 1991. Designing clinical trials with arbitrary speciﬁcation of survival functions and for the log rank test or generalized Wilcoxon test. Sample size and power determination for stratiﬁed clinical trials. George SL. Biometrics 1983. Planning the size and duration of a clinical trial studying the time to some critical event. Schoenfeld DA. EaST: Early Stopping in Clinical Trials. Lachin JM. Morgan TM. 38:1009–1018. 10. Practical Handbook of Sample Size Guidelines for Clinical Trials. London: Methuen. 15. Mehta C. J Am Stat Assoc 1958. . 14:198–208. J R Stat Soc 1972. Planning the duration of a comparative clinical trial with loss to follow-up and a period of continued observation.Power and Sample Size in Phase III Trials 147 3. Miller HD. Fixing the number of events in large comparative trials with low event rates: a binomial approach. Regression models and life tables [with discussion]. J Stat Comput Simul 1987. J Chronic Dis 1981. 7. Cox DR. 39:499–503. 12:462–473. Conditional power for arbitrary survival curves to decide whether to extend a clinical trial. Controlled Clin Trials 1987. 1965. 1992. 4. 27:15–29. J Chronic Dis 1974. Cantor AB. Henderson WG. 8:177–189. MA: CyTEL Software Corporation. Weber L. Hammermeister KE. 5. Power estimation for rank tests using censored data: conditional and unconditional. Boca Raton. 8:65–73. 11. Controlled Clin Trials 1993. 12:304–313. Fisher SG. 8. Shuster JJ. Desu MM. Sample size formula for the proportional-hazards regression model. Gail MH. 12. Brown BW. Santner TJ. The Theory of Stochastic Processes. Shuster JJ. 14. Planning the duration of accrual and follow-up for clinical trials. 6.

.

149 .8 Multiple Treatment Trials Stephen L. subgroups. When there are more than two treatments inference is more complex. INTRODUCTION Randomized clinical trials involving more than two treatments present problems and challenges in their design and analysis that are absent in trials involving only two treatments. Durham. conduct. The number of paired and subset comparisons increases rapidly as the number of treatments increases. with three treatments. although some issues raised here are relevant to other multiplicity topics. When there are only two treatments in a trial. North Carolina I. inference is relatively straightforward since the only direct comparison is between the two treatments. sequential analysis. in addition to the global comparison of all three treatments simultaneously. This chapter deals with multiple treatments. it is vitally important to specify clearly the objectives of the trial and to be certain that these objectives are reﬂected in an appropriate design and analysis. George Duke University Medical Center. These problems are part of the larger issues raised by multiplicities in clinical trials. and multiple treatments. covariate adjustments. Multiplicity (1–3) refers to issues concerning multiple outcome variables. and this complexity affects the design. For example. In this setting. there are three possible paired comparisons among the three treatments and an unlimited number of other comparisons arising from various pooled weightings of the treatment groups. and analysis of the trial (4).

is that error rates may be elevated beyond their putative or nominal level.40. it is 0. Types of Errors The primary difﬁculty raised by multiplicity. The purpose of this chapter is to describe techniques of statistical design and analysis that address these issues. For α 0. proper control of the experimental type I error rate can lead to highly conservative procedures with low power to detect individual differences. several common settings in which more than two treatments are involved are presented. the overall probability of ﬁnding at least one erroneous ‘‘signiﬁcant’’ result when there are truly no differences is 1 (1 α) N. if we test all possible pairs of treatments and wish to set the experimental type I error rate to α. Erroneously failing to reject a false null hypothesis is a type II error.05. two types of errors can occur in testing a speciﬁed null hypothesis against an alternative non-null hypothesis. but signiﬁcant. each at the nominal α level. as the calculation in the previous paragraph illustrates. In the classic statistical hypothesis testing framework. These concepts apply equally to the overall experiment (‘‘experimental’’ error rates) or to the individual comparisons within the experiment.23. But if we reduce the individual type I error rates without increasing the sample size. . For N 5. the individual type II error rates are increased. multiple experimental treatments compared with a control. this error probability becomes quite large even for moderate N. In each setting. MULTIPLICITY AND ERROR RATES A. respectively. The sample size can be increased to avoid this problem. Multiple testing with no adjustment of individual error rates dramatically increases the probability of ﬁnding spurious. but this may require an unacceptably large sample size. Thus. However. The complement of the type II error rate (1 β) is referred to as the statistical power of the test procedure. Erroneously rejecting a true null hypothesis is a type I error.150 George In the following sections. including multiple treatments. The error rates for these two types of errors are usually denoted by α and β. it is 0. design strategies are presented along with their implications for sample size and for the conduct and analysis of the trial. and selection designs. the individual type I error rates must each be less than α. factorial designs. For N 10. A simple and well-known example of this phenomenon arises in multiple signiﬁcance testing. II. These include multiple independent treatments. differences. If we conduct N independent tests of signiﬁcance.

One of these (6) uses a signiﬁcance level of jα/N for the jth ordered test ( j 1. . and Holm (8) deﬁned a procedure that rejects the hypothesis H (i) corresponding to p (i) if and only if p(j) α j for all j i N 1 This procedure is identical to Bonferoni for i 1 but not as conservative otherwise. . N). is to use a signiﬁcance level α/N for each individual test. if α 0. . Bonferoni and Related Procedures One of the older procedures used to control the experimental error rate is the Bonferoni procedure. drop-outs. It has been extended by Hochberg (7). and so on are avoided. x i . . . which is normally distributed with an unknown mean µ i and common variance σ 2 . However. it is a very conservative procedure. censoring. where N is the number of tests and α is the overall type I error rate (5). N ) If we want P α. To avoid complexity and to focus attention on problems of multiplicity arising from multiple treatments. For example. . one rejects the global null hypothesis if p ( j ) jα/N for any j. based on a simple form of one of the Bonferoni inequalities: N P i 1 pi where P is the probability of at least one event occurring out of N possible events and p i is the probability of the ith event (i 1. Hence. applied to multiple signiﬁcance tests. the simplest Bonferoni procedure. Modiﬁcations of the simple Bonferoni procedure have been proposed to mitigate the overly conservative nature of the procedure. stratiﬁcation. the various scenarios considered here are all expressed in terms of hypotheses involving the µ i . one approach is to set p i α/N. Complications arising from unequal variances. the signiﬁcance level would be 0. . Thus. non-normal distributions. Hochberg and Benjamini (9) discuss these procedures and provide some suggested improvements. in the remainder of this chapter it is assumed that the K treatments have a single primary outcome measure. That is. . This procedure is guaranteed to yield an overall error rate no larger than α when the global null hypothesis is true. particularly where N is large and the tests are correlated. Application to treatment trials involving K treatments in which the tests are limited to all pair-wise comparisons yields a signiﬁcance level of 2α/K(K 1) for each individual test.005 for each of the 10 possible paired comparisons. This procedure is less conservative than the simple Bonferoni procedure.Multiple Treatment Trials 151 B.05 and K 5. Such additional . ordering the p values p (1) ⋅⋅⋅ p (N) .

proceed with paired comparisons of the µi. conclude in favor of H 0 with no further tests. III. Typically. Global Null Hypothesis Perhaps the most common setting in clinical trials with multiple treatments involves the comparison of K independent. j with i ≠ j µ2 ⋅⋅⋅ µK That is. The alternative H 1 is that at least one of the µ i is not equal to the others. In this setting. However. but in some trials might be larger. the primary hypothesis test is usually H0 : µ1 vs. A rather unsatisfying result is that in case of failure to reject the global null hypothesis. K is equal to three or four. If H 0 is not rejected. and the least signiﬁcant difference test. (ii) All possible paired comparisons—The second general approach is to conduct all possible paired comparisons (K(K 1)/2 in number) but to do so in a way that preserves the experimental error rate α. the Newman-Keuls test. but their introduction here would introduce complexities that would obscure the salient points of emphasis. if H 0 is rejected at the ﬁrst step. Methods for testing all possible pairs of treatments in a study have been investigated for a very long time (11. That is. The conclusion is simply that the K treatments are not demonstrably different. MULTIPLE INDEPENDENT TREATMENTS A.12).152 George considerations are of considerable practical importance in actual trials. there is no further examination of the treatment differences. this means that each comparison must be carried out at a signiﬁcance level less than α. the global null hypothesis H 0 is that all of the µ i are identical. The primary attraction of this approach is that control of the experimental error rate is simple and direct. Some of these techniques are Tukey’s honestly signiﬁcant difference and wholly signiﬁcant difference tests. There are two general approaches in this setting: (i) Hierarchical approach—First construct a global α-level test of H 0 (10). treatments. there are K different treatments with no implied ordering or other special relationship among the treatments. As discussed above in connection with Bonferoni-type adjustments. H1 : µi ≠ µj for some i. the Duncan test. or nominal. All provide ap- .

0.05.48 1.10) 3 1.21 1.05. the required sample size depends on the speciﬁcation of the means in the alternative case. then N 85 patients are required for each treatment.67 . the standardized difference in means that we desire to detect with power 1 β.73 1. The exact sample size can be obtained by an iterative solution to an equation involving noncentral F distributions (13). the required (equal) sample size N for each treatment is N 2(z α/2 z β ) 2 ∆2 where z x is the 100x percentile of the standard normal distribution and ∆ (µ 1 µ 2 )/σ. and ∆ 0. In practice. with α 0.01. N is rounded up to the nearest integer value.61 10 1.10) (0. In the case of two treatments with σ 2 known. Sample Size Implications For both of the general approaches described above. Details are not given here.56 2 9 1.20) (0. For example. if α 0.40 1. β 0.68 1.50 8 1.56 1. If there are K 2 treatments.18 1. there is a cost in terms of the required sample size. β) (0.16 4 1. B. This is the conﬁguration yielding the minimum power for a ﬁxed sample size and is thus the conﬁguration used to determine sample size. β 0. K. the least favorable conﬁguration of means for a given maximum difference ∆ (µ (K) µ (1))/σ occurs when µ (i) (µ (1) µ (K) )/2 for i ≠ 1.62 1.87 1.46 1.10.65 1.Multiple Treatment Trials 153 proaches to analyses that control the overall type I error rate.05.80 1.50. The relative increase is given in Table 1 for K 3 to 10.43 7 1.73 1.05.27 5 1.10. but a close approximation is the following: N 2{√χ 2 α (K 1 1) ∆2 (K 2) zβ}2 The required N for the case K 2 treatments is always greater than that required for K 2 treatments.35 1. 0.30 1. Denoting the ordered means by µ (1) ⋅⋅⋅ µ (K) . For example. 0. we require 18% more patients per treat- Table 1 Multiples of Sample Size for Two Treatments Required for K Treatments Number of treatments (K ) (α.55 1.35 6 1.

µ 0 . In the particular case of K experimental treatments compared with a control treatment. . the hypotheses of interest are H 0 : µ i µ 0 i 1.e. . . assumed equal in this formulation. requirement of 3 85 255 patients based on the two-treatment situation. speciﬁc combinations of treatments or speciﬁc comparisons may be of primary or even exclusive interest.154 George ment for three treatments than for two treatments. K H 1 : µ i ≠ µ 0 for at least one µ i This represents exactly K comparisons. The experimental treatments themselves are not directly compared. the procedure involves computing K t-statistics in the usual way: ti (y i y 0 ) √N s√2 where y i is the mean for the ith treatment group. or related hypotheses. . . Thus. The most common test procedure in this case is Dunnett’s procedure (19). it was assumed that comparison of all pairs of treatments was of interest. the total number of paired comparisons. More complex situations are covered elsewhere (14–18). The salient point in all situations is that the number of patients per treatment group for K 2 treatments must be increased rather substantially over the number required for two treatments. K). rather than the apparent. The number of such hypotheses may be far less than K(K 1)/2. one for each experimental treatment with the control. One common example in a clinical trial setting is the use of K experimental treatments and a control or standard treatment (or perhaps both). To preserve the experimental type I error rate. an example of a ‘‘many-one’’ test statistic. the greater the increase. and the more treatments. ∑ µ i /K vs. If the control 1. the hypotheses of interest may be. µ 0 . . if ∆ 0. and N is the number of observations in each treatment. If this is the case.18 85 ≅ 301 patients. IV.50 as in the earlier example. . but erroneous. PRESPECIFIED COMPARISONS In the previous section. In other settings. µ i vs. In brief. treatment mean is denoted µ 0 and the experimental treatment means µ i (i . No speciﬁc comparison was assumed to be of more or less interest than any other comparison.. a two-treatment trial requires 2 85 170 patients but a threetreatment trial requires 3 1. the statistics t i must each be compared with a critical value larger than the value for K 1 (i. the number of comparisons can be limited and prespeciﬁed to reduce the inherent problems of multiplicity. for example. s is the (pooled) standard deviation. .

73 2. it is natural to study their joint effects and interactions directly. Factorial designs are popular because they carry a promise of being able to exploit the relationships among the treatments in ways not possible in the case of K-independent treatments.41 2. in estimating s. And in a clinical setting in which the factors are different treatment options (drugs. the 2 2 factorial. neither A nor B. V. A and B.99 2. in a series of papers (20–22). OTHER DESIGNS A. considered comparing the experimental treatments that are found to differ from the control in the case when there is a prior preference ordering for the treatments based on considerations such as toxicity or cost.96 2 2.65 2.27 2. two-sided tests) Degrees of Freedom 10 20 60 120 ∞ Number of experimental treatments 1 2. Chen and Simon.05.23 2.44 5 2.51 2.58 2. In general there is no limit to the number of factors and the number of levels of each factor. Other work has extended Dunnett’s procedure in various ways. the design is an a b design and the number of treatment groups is K ab.38 2. Factorial Designs Another type of design involving multiple treatments is the factorial design (23.38 2. but in clinical trials the 2 2 design is the most common and introduces fewer complexities than higher level designs.24) in which two or more factors. ν (K 1)(N 1). B only.57 2. If the two factors have a and b levels.) that might plausibly be administered jointly.47 2.09 2. For example.00 1.Multiple Treatment Trials Table 2 Critical Values for Comparing K Experimental Treatments to a Single Control Treatment Using Dunnett’s Procedure (α 0. are combined together.89 2. say A and B.54 2.55 2. etc.98 1. .35 4 2. For example. two different drugs. Table 2 gives the required critical values for this purpose (11). This is an example of the simplest type of factorial design.76 2.21 3 2.24 2.51 155 one treatment. each with two or more levels. one control). may be either present or absent resulting in four different treatment groups: A only. modalities. This critical value is a function of N and the degrees of freedom.

to detect an interaction Table 3 Treatment Group Means in 2 Factorial Design Treatment B Absent Treatment A Absent Present Pooled µ 00 µ 10 µ •0 Present µ 01 µ 11 µ •1 2 Pooled µ 0• µ 1• µ •• . the effect of A is decreased. 1 and j 0. the simple main effects of A are α and α γ in the absence or presence of treatment B. The model can be written µi j µ 00 αx βy γxy where x 0. a negative interaction. For example. The primary issues are addressed in terms of the 2 2 design described earlier in which the four treatments are deﬁned based in the presence or absence of the treatments A and B. 1 depending on the absence or presence of treatments A and B. The primary advantage of a factorial design occurs when it can be assumed that the simple main effects of a treatment are equal. where i 0. respectively. For example. If γ 0. and γ are parameters deﬁning the effects of A and B and their interaction. µ 1• is the mean where treatment A is present pooled over the treatment B categories (absent. there is no interaction.β. present). Table 3 gives the four possible treatment group means as µ i j . a positive interaction. as in the case with K-independent treatments. Even moderate interaction effects can have a profound impact on statistical power. 1 depending on the absence or presence of treatments A and B. 1 and y 0. and α. it is not possible to get something for nothing. If γ 0. the main effects of treatment A in the absence and in the presence of treatment B. The pooled means are simply the means pooled over the relevant categories. respectively.156 George Unfortunately. The problem arises when the simple main effects are not equal. Then we can design the trial as if it were a two-treatment trial for treatment A and obtain a test for treatment B seemingly at no cost. Thus. If γ 0. The ‘‘main effect’’ of treatment A is deﬁned µ 0• and the ‘‘simple main effects’’ of treatment A are µ 10 µ 00 and as µ 1• µ 11 µ 01 . In this situation the effect of one treatment depends on whether or not the second treatment is present and is called an interaction between the treatments. for example. Nonzero interactions are the source of potential difﬁculties in factorial designs. the effect of treatment A is heightened when given with treatment B. respectively. Similar deﬁnitions apply for treatment B.

the treatment with the highest response rate is selected for further testing in a phase II trial. In the normal means case. The most highly publicized of this type of design applied to clinical trials is a randomized phase II design (27. These selection designs do not allow explicit comparisons among the treatments in the usual sense. one must give careful consideration to the plausible size of an interaction and design the study accordingly. and the purpose is to identify (‘‘select’’) the treatment associated with µ (K) with a high probability of correct selection P* whenever µ (K) µ (K 1) δ*. SUMMARY Multiple treatment trials are more difﬁcult to design and analyze than trials with only two treatments. However.28). in practice. Stated another way. The smaller sample sizes required by these designs are obtained by changing the objectives of the trial. Thus. Smaller interactions will require more patients for reliable detection.. The required sample sizes in this setting are surprisingly small for most values of δ*/σ. It is generally unreasonable to expect to detect a small or moderate interaction but reasonable to assume that they might exist. Thus. VI. we have an ordering of the means µ (1) µ (2) ⋅⋅⋅ µ (K) . Simon and Freedman (24) provide a Bayesian approach (see also Green [25] in this volume). At the end of the trial. a selection design is not appropriate.Multiple Treatment Trials 157 effect of the same magnitude as a treatment main effect requires four times the number of patients required to detect the main effect where no interaction is assumed (23). are instead considered in a single study with random assignment of the available treatments. B. One of the designs discussed earlier should be used (see also Liu [29] in this volume). One example is when the treatments differ only slightly (e. a negative interaction will produce smaller overall treatment effects and reduce the power of the statistical tests. If one wishes to compare the treatments in the usual way. in dose or intensity) so that the loss in selecting the wrong treatment is not excessive. such designs should be carefully applied only in very select circumstances. several treatments that might have been tested in a sequence of traditional phase II designs. Careful attention to detail and proper characterization of the objectives of the trial can minimize these difﬁculties. The treatment selected may be imperceptibly better than some or all of the competing treatments. In this setting the purpose is to identify the ‘‘best’’ treatment or the one with the largest or smallest value of some parameter. one for each treatment. Here. Selection Designs An additional design involving multiple treatments and occasionally used in clinical trials is a selection design (26).g. it is important .

16:119–130. Benedetti J. New York: Chapman and Hall. The required size and length of a clinical trial. Phillips A. Biometrika 1988. Scand J Statist 1979. Senn S. Sample size requirements for comparing time-to-failure among k treatment groups. eds. 8. 1986. Green S. 1990. 13. Hochberg Y. Simes RJ. Desu MM. 310:170. Drug Inform J 1998. 10:871–890. 9. A simple sequentially rejective multiple test procedure. Biometrika 1986. Control Clin Trials 1995. George SL. 50:1096–1121. Anderson SJ. 19. A multiple-step selection procedure with sequential protection of preferred treatments. 9:811–818. Sylvester RJ. Bauer P. 6. Holm S.158 George to recognize at the outset that such trials will of necessity be larger than twotreatment trials. 7. New York: Wiley. Sample size estimation when comparing more than two treatment groups. An improved Bonferroni procedure for multiple tests of signiﬁcance. Koch GG. 4. 14. Raghavarao D. Simon R. Clinical Trials in Oncology. Klockars AJ. London: Sage Publications. 1997. 3. New York: Springer-Verlag. 75:800–802. Altman DG. 11. Chen TT. Ahnn S. Sax G. Tukey JW. Miller RG. 2. Statistical considerations for multiplicity in conﬁrmatory protocols. 1981. Multiple testing in clinical trials. J Chron Dis 1982. San Diego: Academic Press. REFERENCES 1. Benjamini Y. 14:2273–2282. 15. Oxford: Oxford University Press. especially problems of multiplicity. Statistical Issues in Drug Development. 17. 32:193–199. Hochberg Y. Br Med J 1995. Simon RM. but there are always costs. Liu PY. Crowley J. 49:753–761. Stat Med 1990. Drug Inform J 1996. Design and analysis of multiarm clinical trials with survival endpoints. 12. Science 1977. 287–310. 35:861–867. Biometrics 1993. Staquet MJ. Multiple signiﬁcance tests: the Bonferroni method. . 1997. Cancer Clinical Trials. In many settings the advantages may outweigh the costs. 10. Stat Med 1991. 6:65–70. Sample Size Methodology. 20. pp. In: Buyse ME. 5. Sample size determination for comparing more than two survival distributions. 16. Makuch RW. J Am Stat Assoc 1955. 30:523–534. 73:751–754. Some thoughts on clinical trials. Gansky SA. Dahlberg S. 1984. Dunnett CW. Multiple Comparisons. 198:679–684. The considerations in this chapter can help in an assessment of when the advantages outweigh the costs. Stat Med 1995. Simultaneous Statistical Inference. More powerful procedures for multiple signiﬁcance testing. A multiple comparison procedure for comparing several treatments with a control. Bland JM. 18. A sharper Bonferroni procedure for multiple tests of signiﬁcance.

Dahlberg S. Freedman LS. 24. Sample size requirements and length of study for testing interaction in a 2 k factorial design when time-to-failure is the outcome. George SL.Multiple Treatment Trials 159 21. New York: Marcel Dekker. Ellenberg SS. New York: John Wiley & Sons. Biometrics 1997. Selection designs for pilot studies based on survival. 27. Liu PY. Control Clin Trials 1994. Liu PY. Biometrics 1993. Handbook of Statistics in Clinical Oncology. 26. 1977. 2001. Factorial designs with time-to-event end points. Chen TT. Green S. Crowley J. 29. Peterson B. Sobel M. ed. Wittes RE. In: Crowley J. In: Crowley J. Handbook of Statistics in Clinical Oncology. 14:511–522. Chen TT. 23. 53:456–464. Extension of one-sided test to multiple treatment trials. 25. Simon R. 2001. Bayesian design and analysis of two two factorial clinical trials. Gibbons JD. Selecting and Ordering Populations: A New Statistical Methodology. Olkin I. Randomized phase II clinical trials. Simon RM. Simon R. 49:391–398. Stat Med 1994. A multiple decision procedure in clinical trials. 15:124–134. ed. Controlled Clin Trials 1993. 28. 13:431–446. . 22. Phase II selection designs. Cancer Treatment Rep 1985. 69:1375–1381. New York: Marcel Dekker. Simon R.

.

Generally.9 Factorial Designs with Time-to-Event End Points Stephanie Green Fred Hutchinson Cancer Research Center. the aim is to study the effect of levels of each treatment separately by pooling across all other treatments. particularly if the comparisons speciﬁed to be of interest turn out to be the wrong ones. no treatment) are of interest alone or in combination. plus the global test of equality of all four arms. The assumption often is made that each treatment has the same effect regardless of assignment to the other treatments (no interaction). Factorial designs are sometimes considered when two or more treatments. each of which has two or more dose levels (possibly including level 0. Conclusions are straightforward: Either the two arms are shown to be different or they are not. i. with four arms there are six possible pairwise comparisons. and conclusions can be difﬁcult. i 1 K. Some subset of these comparisons must be identiﬁed as of interest. Complexities arise with more than two arms. and magnitude considerations.. . level. has l i levels the result is an l1 l2 . the problems of multiple testing must be addressed. 24 ways of ordering the arms. lK factorial. There has been a fair amount of recent interest in factorial designs. 161 .e. If treatment i. Washington FACTORIAL DESIGN The frequent use of the standard two-arm randomized clinical trial is due in part to its relative simplicity of design and interpretation. each comparison has power. . Seattle. A factorial design assigns patients equally to each possible combination of levels of each treatment. Byar (1) suggested potential beneﬁt in use of factorials for studies with low event rates. 19 ways of pooling and comparing two groups.

O plus treatment A (arm A) vs. again see Gail et al. To illustrate the issues in factorial designs. who discussed Baysian design and analysis of 2 2 factorials (allowing for some uncertainty in the assumption of no interaction). and unfavorable interaction (worse. who discussed testing ﬁrst for interaction when outcomes are normally distributed and interactions occur only if there are effects of both treatment arms.33)) with no effect of B. A effective and B detrimental (β ln(1. but clearly if the probability of at least one false-positive result is high. The simulated trial had 125 patients per arm accrued over 3 years with 3 additional years of follow-up.33)).162 Green such as screening studies. a single positive result from the experiment will be difﬁcult to interpret and may well be dismissed by many as inconclusive. see Gail et al. by Hung (4). no interaction. Starting with global testing followed by pairwise tests only if the global test is signiﬁcant is a common approach to limit the probability of false-positive results. The multiple comparisons problem is one of the issues that must be considered in factorial designs. From the point of view of individual tests. favorable interaction (AB hazard improved compared to expected. [6]). A concern even in this ideal case may be the joint power for both A and B. a simulation of a 2 2 factorial trial of control treatment O (control arm) vs. A theoretical discussion of factorials in the context of the proportional hazards model is presented by Slud (2).33)).33)). O plus A and B (arm AB) was performed. power calculations are straightforward under the assumption of no interaction—calculate power according to the number of patients in the combined groups (also typical.05 level test of A vs. Survival was exponentially distributed on each arm. and median survival was 1. no-A to have power 0.5)). If power to detect a speciﬁed effect .5 years on the control arm. There is disagreement on the issue of whether all primary questions should each be tested at level α or whether the experiment-wide level across all primary questions should be level α. If tests of each treatment are performed at level α (typical for factorial designs. γ ln(1. Power issues must also be considered. The sample size was sufﬁcient for a one-sided 0. Table 1 summarizes the cases considered. O plus treatment B (arm B) vs. Each of these was considered with no interaction (γ 0).33. and an A/O hazard ratio of 1/1. γ ln(1. then the experiment-wide level (probability that at least one comparison will be signiﬁcant under the null hypothesis) is greater than α. [6]). and both A and B effective (α and β both ln(1. and by Akritas and Brunner (5). Various cases were considered using the usual proportional hazards model λ λ0exp(αzA βzB γzA zB): neither A nor B effective (α and β 0).9 with no effect of B. Other recent contributions to the topic include those by Simon and Freedman (3). A effective (α ln(1. a Bonferoni approach (each of T primary tests performed at α/T ) is also an option. who proposed a nonparametric approach to analysis of factorial designs with censored data (making no assumptions about interaction).

α and β are not .Factorial Designs O A Used in the Simulation B AB Case 2—Effect of A No effect of B 1.5 1.5 1. not-A and B vs. A vs.5 1. The ﬁrst approach is to analyze assuming there are no interactions and to do only two one-sided tests. A vs.5 1.5 2 2 2 1.5 1.5 1.77 Each Case Has The Median of the Best Arm in Bold.5 1. not-A and B vs.67 2 2 2 3.5 2 2 2.5 1. then base conclusions on the tests of A vs. and B vs.5 1.5 1 2 1. not-A is not signiﬁcant. or 2. and the probabilities of choosing the best arm under alternatives of interest must be calculated. Several scenarios were considered in the simulation. the joint power to detect the effects of both is closer to 1 2β. the procedures for designating the preferred arm at the end of the trial (which generally is the point of a clinical trial) must be speciﬁed. not-B.5 1.13 1. If the interaction term is not signiﬁcant.33 2 1 2 1. The second approach is to test ﬁrst for interaction (two-sided) using the model λ λ0exp(αzA βzB γzAzB). not-B.5 1.5 1. power considerations are considerably more complicated. If it is signiﬁcant. then base conclusions on tests of the three terms in the model and on appropriate subset tests.5 1 1.5 2 4—Effect of A Detrimental effect of B 1. not-B is not signiﬁcant. γ is not signiﬁcant. The ‘‘best’’ arm must be speciﬁed for the possible true conﬁgurations.5 1. From the point of view of choosing the best arm.5 1.5 2 2.5 1 1.67 3—Effect of A Effect of B 1.5 2 1.5 2 1. then AB is assumed to be the best arm. If both A is better than not-A and B is better than not-B. γ is signiﬁcant and negative (favorable interaction).55 163 Table 1 Arm Medians Interaction No interaction Unfavorable interaction Favorable interaction 1—Null 1. The treatment of choice is as follows: Arm O if 1. of A is 1 β and power to detect a speciﬁed effect of B is also 1 β.

α is not signiﬁcant and β is not signiﬁcant in the three-parameter model. 2. or γ is signiﬁcant and favorable. but with the results for A and B reversed. 5. γ is not signiﬁcant. not-B are both signiﬁcant.) The third approach is to control the overall level of the experiment by ﬁrst doing an overall test of differences among the four arms and to proceed with the . Arm B if results are similar to A above. AB is not signiﬁcant. α is signiﬁcant and β is not signiﬁcant in the three-parameter model. AB is signiﬁcant. and the test of A vs. AB is signiﬁcant. and the test of A vs. or γ is signiﬁcant and positive (unfavorable interaction) and α and β are not signiﬁcant in the three-parameter model. α is signiﬁcant and β is not signiﬁcant in the three-parameter model. 4. or γ is signiﬁcant and favorable. signiﬁcant in the three-parameter model. AB is signiﬁcant. α is signiﬁcant and β is not signiﬁcant in the three-parameter model.164 Green 3. or γ is signiﬁcant and favorable. AB is not signiﬁcant. not-A is signiﬁcant. not-A and B vs. 3. γ is not signiﬁcant. and the test of A vs. 3. Arm A if 1. and the test of O vs. or γ is signiﬁcant and favorable. not-B is not signiﬁcant. α and β are signiﬁcant in the three-parameter model. B is signiﬁcant in favor of A. (Try putting this into the statistical considerations of a protocol. or γ is signiﬁcant and unfavorable. Arm A or Arm B if γ is signiﬁcant and unfavorable. and the test of A vs. α and β are signiﬁcant in the threeparameter model. Arm AB if 1. B is not signiﬁcant. and A vs. B vs. 4. or γ is signiﬁcant and favorable and α and β are both signiﬁcant in the three-parameter model. or γ is signiﬁcant and unfavorable. and the test of B vs. and A vs. 2. and the test of O vs. β is signiﬁcant and α is not signiﬁcant in the three-parameter model.

O vs. B vs.422 0. for the approach of testing for interaction. The global test was done at the 0.053 0 0.. If the interaction enhances the effectiveness of the best arm.562 0.095 0. approach 2 may or may not be superior.999 0. not-A.244 0.922 0.055 0.985 0 0 0 A or B 0 0 0 0 0 0 0 0 0 0 0 0 . case 4 Table 2 Simulated Probability of Conclusion with Approach 1: No Test of Interaction Conclusion Case.007 0 0. interaction 1.078 0. about as anticipated.g. and possibly insufﬁciently conservative.791 0.Factorial Designs 165 second approach above only if this test is signiﬁcant. none unfavorable favorable none unfavorable favorable none unfavorable favorable none unfavorable favorable O 0. O vs.001 0.578 0. One-sided tests were done at the 0.078 0.002 0 0. and for the approach of doing a global test before testing for interaction. B. The possible outcomes of a trial of O vs.998 B 0. it is better to test for interaction (e. for the approach of ignoring interaction.79. where the difference between A and not-A is diminished due to the interaction).259 0. not-B. A. In the best case of using approach 1 when in fact there is no interaction.187 0.05 level. B vs. If the overall test is not signiﬁcant. and model terms α and β were one-sided. approach 1 is best if there is no interaction. other tests were two-sided.g.243 0. Tables 2–4 show the simulated probabilities of making each of the conclusions in the 12 cases of Table 1.1. A.002 0.002 A 0. the experiment level is 0.231 0. 2. case 4 with an unfavorable interaction. 2. B. 1. If the interaction masks the effectiveness of the best regimen. or A or B but not AB. O vs. 2. 4.05 level.437 0. 3. A vs.006 0 0 0 AB 0. Tables 2–4 illustrate several points.010 0. 1. 4. Apart from that.311 0.208 0.002 0.049 0. testing is detrimental (e.369 0. AB. 3. 3.. AB. AB are to recommend one of O.316 0 0.001 0.890 0.867 0.104 0. 4. The probability of choosing the correct arm is reduced if approach 2 (testing ﬁrst for interaction) is used instead of approach 1 in all four cases with no interaction. If there is an interaction. For each table the probability of drawing the correct conclusion is in bold. two-sided at 0.009 0. Tests of A vs. then arm O is concluded to be the treatment of choice.627 0.11 and power when both A and B are effective is 0.

349 0. 2. 4.002 0 0 0 Green .167 0.810 0.004 0.990 0 0 0.116 0.002 0.003 0 0 0 AB 0.123 0.061 0.756 B 0.752 0.865 0.062 0.353 0.172 0. 2.467 0.424 0. 3.005 0 0.060 0. 1.384 0.659 0.463 0.122 0.472 0.019 0.128 0.001 0.046 A or B 0.036 0.601 0.166 Table 3 Simulated Probability of Conclusion with Approach 2: Test of Interaction Test for interaction.426 0.048 0 0. 3.441 Conclusion Case.446 0. 4.883 0.009 0.914 0.078 0.198 A 0.612 0.138 0.114 0.418 0.033 0.110 0.089 0.008 0.001 0.086 0. interaction 1. none unfavorable favorable none unfavorable favorable none unfavorable favorable none unfavorable favorable O 0.341 0.117 0. 2.006 0. probability of rejection 0. 4. 1.309 0.089 0.185 0 0.434 0. 3.003 0.017 0 0.

001 0 0.390 0.198 A 0.004 0. 3. 1.972 0.032 0. 2.00 0.286 0.997 1.018 0.057 0.004 0.046 A or B 0.684 0. 2.063 0. Probability of rejection 0. 3.932 0.528 0. 4.659 0.Factorial Designs Table 4 Simulated Probability of Conclusion with Approach 3: Global Test Followed by Approach 2 Conclusion Global test. 4.578 0.926 0.466 0 0.987 0.052 0.578 0. 2.503 0.069 0.756 B 0.049 0.039 0 0.109 0.432 0.002 0 0 0 167 .010 0. 1.068 0.535 1.011 0.558 0.001 0.329 0. interaction 1.003 0 0 0 AB 0.117 0.741 0.00 0. 3.003 0 0.611 0.074 0 0.026 0.882 0.341 0.015 0 0.014 0.990 0 0 0.067 0.541 0.999 Case. 4.374 0. none unfavorable favorable none unfavorable favorable none unfavorable favorable none unfavorable favorable O 0.072 0.059 0.

the difﬁculties become more evident. Potential drug interactions. It was assumed that PBI and chemotherapy would not affect each other. a test of whether PBI was superior to no PBI. The worst arm was PBI plus chemotherapy.168 Green with favorable interaction. however. and no deﬁnitive conclusion could be made concerning chemotherapy. the roles of both chemotherapy and prophylactic radiation to the brain were of interest. However. overlapping toxicities. the interactions were detected at most 47% of the time in these simulations. it was clear that the comparison of no further treatment vs. No other tests were speciﬁed. This is not ‘‘changing the rules’’ of the design. chemotherapy vs. in general A cannot be assumed to behave the same way in the presence of B as in the absence of B. Perhaps in studies where A and B have unrelated mechanisms of action and are being used to affect different outcomes. Once you admit your K J factorial study is not one K-arm study and one J-arm study (which happen to be in the same patients) but rather a K Jarm study with small numbers of patients per arm. PBI was to be tested by combining across the chemotherapy arms and chemotherapy was to be tested by combining across PBI arms.025 for two tests. chemotherapy was critical— but the study had seriously inadequate power for this test. Approach 3 does restrict the overall level (probability of not choosing O when there is no positive effect of A or B or AB). The probability of identifying the correct regimen is poor for all methods if the correct arm is not the control arm. . PBI was found to be detrimental to patient survival. PBI vs.1 level tests. Even using 0. it is acknowledging the reality of most clinical settings. Using the design criteria one would conclude that neither PBI nor chemotherapy should be used. Unfavorable interactions and detrimental effects happen: Study 8300 (similar to case 4 with an unfavorable interaction) is an unfortunate example (7). then no additional treatment. Investigators chose a Bonferoni approach to limit type 1 error: The trial design speciﬁed level 0. All patients received radiation to the chest and were randomized to receive prophylactic brain irradiation (PBI) plus chemotherapy vs. In all cases the power for detecting interactions is poor. differences in compliance. Approach 1. then chemotherapy alone. where the difference between A and not-A is larger due to the interaction whereas B is still clearly ineffective). no additional treatment. With this outcome. assuming there is no interaction. Unfortunately. is particularly poor. In this study in limited non-small cell lung cancer. followed by PBI. Unfavorable interactions are particularly devastating to a study. but this is at the expense of a reduced probability of choosing the correct arm when the four arms are not sufﬁciently different for the overall test to have high power. then assumptions of no interaction may not be unreasonable. and a test of whether chemotherapy was superior to no chemotherapy. and so on all make it more reasonable to assume there will be differences— and with small sample sizes per group it is unlikely these will be detected.

Liu et al. If assumptions are tested. and application of a ‘‘bubble sort’’ analysis (e. . Properties are good for this approach when each experimental arm is similar either to the control arm or the best arm but not when survivals are more evenly spread out among the control and other arms. with no way of ascertaining beforehand which is the case.Factorial Designs 169 OTHER APPROACHES TO MULTIARM STUDIES Various approaches to multiarm studies are available. These authors also point out the problems when power is considered in the broader sense of drawing the correct conclusions. focusing on appropriate global tests or on appropriate tests for subhypotheses. . A.. This approach is discussed in Chen and Simon (13). Any model assumption can result in problems when the assumptions are not correct. take the most costly only if signiﬁcantly better than the rest. multiple experimental arms would apply. . followed by α level pairwise tests if the global test is signiﬁcant) has good power for detecting the difference between a control arm and the best treatment. Southwest Oncology Group . Liu and Dahlberg (11) discuss design and provide sample size estimates (based on the least favorable alternative for the global test) for K-arm trials with timeto-event end points. procedures must be speciﬁed for when the assumptions are shown not to be met. A related approach includes a preference ordering. the problem of comparing control vs. i and pooled groups i 1 . K 1 K 1 1/2 T i 1 L (i) i 1 var(L (i) ) 2 i j cov(L (i). There is a long history of articles on this problem. resulting in the alternative O B A AB). Marcus et al. . (12) propose a modiﬁed logrank test for ordered alternatives.). K.g. Similar comments apply as to the more general case above. . B and AB. and Tang and Lin (10). say by expense of the regimens (which at least has a good chance of being speciﬁed correctly). . which changes the properties of the experiment and complicates sample size considerations. . testing other assumptions can either be beneﬁcial or detrimental. for instance Dunnet (8). this test is used as the global test before pairwise comparisons in this setting. L ( j)) where L (i) is the numerator of the one-sided logrank test between the pooled groups 1. (9). and the most costly is not signiﬁcantly better . with the additional problem that the test will not work well if the ordering is misspeciﬁed. The procedure investigated (a K-sample logrank test is performed at level α. Designs for ordered alternatives are another possibility (say for this example there are theoretical reasons to hypothesize superiority of A over B. the second most only if signiﬁcantly better than the less costly arms. As with testing for interactions. If the example study could be formulated as O vs.

Simon R. Factorial and reciprocal control design. Nonparametric methods for factorial designs with censored data. The correct balance between conservative assumptions vs. high-dose comparison. If power for speciﬁc pairwise comparisons is important for any outcome. Freedman L. CONCLUSION The motivation for simplifying assumptions in multiarm trials is clear. In the case of factorial designs. thereby eliminating what most view as the primary advantage to factorial designs. but disappointing experience tells us that too many are too often wrong. then the required sample size is larger. four times the sample size is required (15). 53:456–464. high-dose CDDP vs. Stat Med 1993. Biometrics 1994. is large. To detect an interaction of the same magnitude as the main effects in a 2 2 trial. This trial randomized patients to low-dose cisplatin (CDDP) vs. REFERENCES 1. . A beneﬁcial effect of adding mitomycin-C to high-dose CDDP could not be ruled out at the time. possible efﬁciencies is rarely clear. The trial was closed approximately half way through the planned accrual because survival on high-dose CDDP was convincingly shown not to be superior to standard-dose CDDP by the hypothesized 25% (in fact. Akritas M. An even larger sample size is needed if detection of interaction is of interest. it appeared to be worse). Stat Med 1990. Baysian design and analysis of two x two factorial clinical trials. 3. highdose CDDP plus mitomycin-C (with the obvious hypothesized ordering). 5. Brunner E. but this comparison became meaningless in view of the standarddose vs. Byar J. 50:25–38. 92:568–576. Two-stage tests for studying monotherapy and combination therapy in twoby-two factorial trials. while at the same time limiting the overall level of the experiment. Analysis of factorial survival experiments. Slud E.170 Green study S8738 (14) provides an example of incorrect assumptions. 2. 4. Likely not all simplifying assumptions are wrong. the small sample sizes resulting from oversimpliﬁcation lead to unacceptable chances of inconclusive results (and a tremendous waste of resources). Hung H. Unfortunately. Biometrics 1997. 9:55–64. J Am Stat Assoc 1997. The sample size required to have adequate power for multiple plausible alternatives. 12:645–660. combining treatment arms seems to be a neat trick—multiple answers for the price of one—until you start considering how to protect against the possibility that the assumptions allowing the trick are incorrect.

Chase E. Peterson B. A multiple comparisons procedure for comparing several treatments with a control. 14. Sample size requirements and length of study for testing interaction in a 2 K factorial design when time to failure is the outcome. Peritz E. Tsai W-Y. A randomized trial of chemotherapy and radiotherapy for stage III non-small cell lung cancer. 14:511–522. Hutchins L. Design and analysis of multiarm trial clinical trials with survival endpoints. Grunberg S. Li J-Y. Control Clin Trials 1993. J Am Stat Assoc 1955. Design and analysis for survival data under order restrictions with a modiﬁed logrank test. Chang Y-S. Simon R. Dunnet C. Evaluation of cisplatin in metastatic non-small cell lung cancer: A phase III study of the Southwest Oncology Group. George S. Gabriel K. Perez E. J Am Stat Assoc 1997. Control Clin Trials 1998. Natale R. Jin M-L. 9. 17:1469–1479. Dahlberg S. Liu P-Y. Mark S. Blot W. Rabkin C. Groves F. 19:352–369. Hutchins L. 13. Livingston R. Marcus R. Heinrich J. You W-C. Crowley J. 8. 12. Fraumeni J. Xu G-W. Taylor C. Control Clin Trials 1995. 60:573–583. Natale R. Braun T. Control Clin Trials 1994. . Chen T. Crowley J. Weiss G. Brown L. Factorial trial of three interventions to reduce the progression of precancerous gastric lesions in Sandong. Extension of one-sided test to multiple treatment trials. Liu W-D. 11. Balcerzak S. An approximate likelihood ratio test for comparing several treatments to a control. Cancer Ther 1998. Mira J. Roach R. Baker L. China: design issues and initial data. Liu P-Y. Neefe J. Gail M. 7. 63:655–660. 92:1155–1162. 11:873–878. 10. Schwartz J. Lin S. Wolf M.Factorial Designs 171 6. Gandara D. 16:119–130. On closed testing procedures with special reference to ordered analysis of variance. Ma J-L. Zhang L. Hu J. Livingston R. 15:124–134. Biometrika 1976. Tang D-I. J Clin Oncol 1993. 1:229–236. 15. Stat Med 1998. Miller T.

.

National Institutes of Health. INTRODUCTION The objective of a therapeutic equivalence trial is generally to demonstrate that a new treatment is equivalent to a standard therapy with regard to a speciﬁed clinical end point. Consequently. it would be attractive to patients. Usually. 173 . Maryland I. The new treatment may be less invasive and less debilitating or it may be more convenient. however. investigators would like to demonstrate that the new treatment is effective compared with no treatment. one is willing to exchange only very small reductions in efﬁcacy for the advantages in secondary end points.10 Therapeutic Equivalence Trials Richard Simon National Cancer Institute. they attempt to demonstrate therapeutic equivalence to a standard treatment. Bethesda. if it is equivalent to the standard with regard to the primary efﬁcacy end point. Therapeutic equivalence trials are contrasted to bioequivalence trials where the objective is to demonstrate equivalence of serum concentrations of the active moiety. In some therapeutic equivalence trials called active control trials. but because use of a no-treatment arm is not feasible. In this chapter we review some of the problems with therapeutic equivalence trials and provide recommendations for the design and analysis of such trials.

Failure to demonstrate nonequivalence is the ambiguous outcome and the outcome that leads to change in the treatment of future patients. This is a particular problem for therapeutic equivalence trials because it leads to failure to demonstrate nonequivalence as a result of inadequate statistical power. In conventional trials. Someone once said that everything looks like a nail to a person whose only tool is a hammer. Overreliance on statistical signiﬁcance testing is one of the problems with the conduct of therapeutic equivalence trials. A related problem with therapeutic equivalence trials is that large sample sizes are often needed. closeness of sample means or sample distributions and nonsigniﬁcant p values are convincing to a large part of the medical audience. Failure to reject the null hypothesis in conventional trials generally does not lead to change medical practice. For example. If the outcomes for the two treatments are very different. Unfortunately. Failure to demonstrate nonequivalence is often interpreted as a demonstration of therapeutic equivalence and grounds for adoption of the new regimen. This is not always the case but is for a large class of therapeutic equivalence trials where a standard effective treatment is compared with a shorter lower dose or less invasive regimen. Many problems with therapeutic equivalence trials are associated with reasons why failure to demonstrate nonequivalence should not be interpreted as demonstration of equivalence. PROBLEMS WITH THERAPEUTIC EQUIVALENCE TRIALS One inherent problem is that it is impossible to demonstrate equivalence.174 Simon II. Failure to reject the null hypothesis may be a result of inadequate sample size and not grounds for concluding equivalence. For therapeutic equivalence trials the situation is quite different. The implications of failure to reject the null hypothesis are often more difﬁcult to interpret. In the absence of demonstrating lack of equivalence. and rejection of the null hypothesis leads to change in the treatment of future patients. Consequently. The sample size for clinical trials is often determined on practical grounds based on patient availability over a limited time period or funding available. Tumor resection may have clear advantages with regard to . and hence the ambiguity associated with its interpretation is in some sense of less concern. many trials are undersized. consider a cancer trial evaluating tumor resection as an alternative to amputation of the organ containing the tumor in a setting where amputation is the standard therapy known to be curative in a large number of cases. rejection of the null hypothesis is usually established with substantial statistical reliability. then one can conclude that the two treatments are not therapeutically equivalent. however. one can only conclude that results are consistent with only small differences. however.

Therapeutic Equivalence Trials

175

quality of life, but many patients would be interested in these advantages only if they were assured that the chance for cure they might give up would be very small. Hence, the appropriate trial should focus on distinguishing the null hypothesis that the new treatment is no worse than the standard, i.e., ∆ 0 from the hypothesis that the standard treatment is superior by at least some very small amount δ, i.e., ∆ δ. Consequently, this trial would have to be quite large. Some therapeutic equivalence trials compare a treatment regimen E to a control C when the advantage of C over placebo P or no treatment is small. Such trials must also be very large because they must demonstrate that the difference in efﬁcacy between E and C is no greater than a fraction of the difference between C and P. Another difﬁculty with the therapeutic equivalence trial is that there is no internal validation of the assumption that the control C is actually effective for the patient population at hand. It is not enough for E to be therapeutically equivalent to C, we want equivalence coupled with the effectiveness of E and C relative to P. A related problem is the difﬁculty in selecting the difference δ to be distinguished from the null hypothesis. In general, the difference δ should represent the largest difference that a patient is willing to give up in efﬁcacy of the standard treatment C for the secondary beneﬁts of the experimental treatment E. The difference δ must be no greater than the efﬁcacy of C relative to P and will in general be a fraction of this quantity δ c . Estimation of δ c requires review of clinical trials that established the effectiveness of C relative to P. δ c should not be taken as the maximum likelihood estimate (mle) of treatment effect from such trials because there is substantial probability that the true treatment effect in those trials was less than the point mle. We discuss later in this chapter quantitative methods for utilizing information that demonstrates the effectiveness of C relative to P in planning a therapeutic equivalence trial.

III. DESIGN AND ANALYSIS Of the most frequent methods that have been proposed for the design and analysis of therapeutic equivalence trials, that based on conﬁdence intervals has important advantages (1–3). As indicated above, a signiﬁcance test of the null hypothesis can provide a very misleading interpretation of the results of a therapeutic equivalence trial. Testing an alternative hypothesis has been proposed as an alternative method (4). For an experiment that is too small to be informative, a test of an alternative hypothesis will be less likely to be interpreted as supporting therapeutic equivalence, but this approach will still not indicate the basic inadequacy of the experiment. A conﬁdence interval for the difference in efﬁcacy between E

176

Simon

and C will indicate the range of values that are consistent with the data. A conﬁdence interval is more informative than a hypothesis test and a statement of statistical power. Statistical power is relevant for planning the study, but it is not a good parameter for interpreting the results of a study because it ignores the data actually obtained. A conﬁdence interval incorporates both the size of the study and the results obtained. The well-known study by Frieman et al. (5) tabulated the power of 71 trials reported as negative. They found that 50 of these 71 trials had power less than 0.9 for detecting a 50% treatment effect. If one computes approximate conﬁdence intervals from those trials, however, one ﬁnds that 16 of these 50 trials with inadequate statistical power have conﬁdence limits that exclude 50% treatment effects and hence are deﬁnitively negative. Frieman et al. focused attention on statistical power of trials that claimed to be ‘‘negative.’’ This was useful, but calculation of conﬁdence intervals for treatment differences is a much more relevant and informative means of analysis. To encourage physicians to use such conﬁdence intervals, Simon (2) showed how to calculate such approximate conﬁdence intervals for commonly encountered end points. If one decides to use a conﬁdence interval as the method of analysis, the questions of one-sided versus two-sided and the conﬁdence coefﬁcient arise. Therapeutic equivalence trials are by nature asymmetric with regard to E and C. We are generally interested in whether E is about the same as or substantially worse than C. Often, there is information that makes it less likely that E will be superior to C. In any case, clinical decision making may be the same whether E is equivalent to C or if E is superior to C. Consequently, one-sided conﬁdence intervals are often justiﬁed. Since for some therapeutic equivalence trials it is possible that E is superior to C, two-sided conﬁdence intervals may also be desirable. I have recommended symmetric two-sided 90% conﬁdence intervals in many cases. This provides the same upper limit for the C E difference as a one-sided 95% conﬁdence interval and also provides a lower limit for evaluating whether E is actually superior to C. Another alternative is the 11/2-sided conﬁdence limit in which the lower limit is extended. For example, a 11/2-sided 95% conﬁdence limit would have 31/3% area above the upper limit and 12/3% area below the lower limit. ˆ Let ∆ denote the mle of the difference in treatment effects C E. We will ˆ assume that a positive value favors C and that ∆ is approximately normal with mean ∆ and standard deviation σ. An upper 1 α level conﬁdence limit for ∆ is approximately ∆ up ˆ ∆ z1 α σ (1)

In planning the trial a value δ must be speciﬁed where δ represents the largest true C E difference consistent with therapeutic equivalence. Once the data are δ, then one concludes that E is therapeutically equivalent to obtained, if ∆ up

Therapeutic Equivalence Trials

177

C. Whether or not this condition is achieved, the conﬁdence interval provides the range of relative effectiveness’ consistent with the data. There are several approaches to planning the size of the study using the conﬁdence interval as the basis for analysis. All the methods are based, however, on the fact that σ that occurs in Eq. (1) is a function of the sample size. In the case of survival comparisons with proportional hazards, σ is a function of the number of events observed. One approach is to set σ so that under the null hypothesis, the probability that ∆ up δ is a speciﬁed value 1 β. If σ is independent of the value of ∆, this leads to the familiar condition for sample size planning with normal distributions that δ/σ z1

α

z1

β

(2)

For the two-sample normal case σ √2σ 2 /n, where σ 0 is the standard deviation 0 per observation and n is the sample size per treatment group. For the two-sample normal case, this approach provides the same sample size as does the hypothesis testing framework, with proper deﬁnition of α and β. For the two-sample binomial or the two-sample time-to-event case, the correspondence is not exact because of dependence of σ on ∆. The correspondence is approximately the same, however. For example, Eq. (2) can be used for sample size planning in the twosample time-to-event case with the approximation σ √4/total events . An alternative approach to sample size planning is to use a symmetric twosided conﬁdence interval and require that ˆ Pr ∆ 0 [∆ z1 α σ δ] ˆ β and Pr ∆ δ [∆ z1 α σ 0] β (3)

In the case where σ is independent of ∆, satisfying condition (2) automatically satisﬁes both parts of condition (3). A more stringent approach to sample size planning is to require that the width of the two-sided conﬁdence interval be of size δ. This ensures that the conﬁdence interval will always exclude either 0 or δ. It requires substantially more patients, however. Interim analysis using conﬁdence intervals and group sequential methods is described by Durrleman and Simon (6).

IV. ANALYSIS TO ESTABLISH EQUIVALENCE AND EFFECTIVENESS In an important sense, none of the above approaches represents a satisfactory statistical framework for the design and analysis of therapeutic equivalence trials. These approaches depend on the speciﬁcation of a minimal difference δ in efﬁ-

178

Simon

cacy that one is willing to tolerate. None of the approaches deal with how δ is determined. Fleming (7) and Gould (8,9) have noted that the design and interpretation of equivalence trials must utilize information about previous trials of the active control. Fleming proposed that the new treatment is considered effective if an upper conﬁdence limit for the amount that the new treatment may be inferior to the active control does not exceed a reliable estimate of the improvement of the active control over placebo or no treatment. Gould provided a method for creating a synthetic placebo control group based on previous trials comparing the active control to placebo. Simon presented a general Bayesian approach to the utilization of information from previous trials in the design and analysis of an equivalence trial (10). Two major objectives can be distinguished. The ﬁrst is to determine whether the experimental treatment is effective relative to P. This requires explicit use of prior information about outcomes of trials comparing P to the active control. Meaningful interpretation of active control trials is impossible without consideration of such information. Establishing whether or not the experimental treatment is effective relative to P is a ﬁrst requirement. The second objective is to determine whether any medically important portion of the treatment effect for the active control is lost with the experimental treatment. In some cases this objective is unrealistic because the size of the treatment effect (relative to P) for the active control is imprecisely determined. We use the following model: y α βx γz ε, where y denotes the response of a patient, x 0 for placebo or the experimental treatment and 1 for the control treatment, z 0 for placebo or the control treatment and 1 for the experimental treatment, and ε is normally distributed experimental error. Hence the expected response for C is α β, the expected response for E is α γ, and the expected response for P is α. The likelihood function for the data (D) from the active controlled trial can be expressed as π(D| α, β, γ) π(y c |α, β)π(y e |α, γ), where the ﬁrst factor is the likelihood of the data for the control group and the second factor is the likelihood of the data for the experimental group. We use the notation π( ) informally to denote either probability density of observable data, prior probability density of a parameter, or posterior density of a parameter. The ﬁrst factor is N(α β, σ 2 ) and the second factor is N(α γ, σ 2 ), where σ is the standard error for the observed means. We assume that σ 2 is known, although it will generally be estimated. For the large sample sizes appropriate for active control trials, the additional variability caused by uncertainty in σ 2 should be very small. This assumption enables us to obtain simple analytical results, but a more exact treatment is possible using posterior distribution sampling methods. The posterior distribution of Θ (α, β, γ) has density proportional to π(D| α, β, γ)π(Θ). We shall assume that the parameters have independent normal prior densities π(α) N(µ α , σ 2 ), π(β) N(µ β , σ 2 ), and π(γ) N(µ γ , σ 2 ). Hence, α β γ

Therapeutic Equivalence Trials

179

the posterior distribution of Θ is π(Θ| D) π(y c | α, β)π(y e |α, γ)π(α)π(β)π(γ). The posterior distribution can be shown to be multivariate normal. The covariance matrix is ∑ (K/σ 2 ) (1 r β )(1 r γ ) (1 r γ ) rγ (1 r β ) (1 (1 rγ) r α )(1 1 (1 rγ) rβ (1 rβ) 1 r α )(1 (4) rβ)

where r α σ 2 /σ 2 , r β σ 2 /σ 2 , and r γ σ 2 /σ 2 and K r α (1 r β )(1 r γ ) α β γ rγ) (1 r β )r γ . The mean vector η (η α , η β , η γ ) of the posterior r β (1 distribution is ηα ηβ ηγ r α (1 r β (r γ r γ (r β r β )(1 (1 (1 r γ )µ α r α )(1 r α )(1 r β (1 r γ ))µ β r β ))µ γ r γ )(y c µ β ) K r α (1 r γ )(y c K r α (1 K r β )(y e r γ (1 µα) µα) r β )(y e r γ (y c r β (y e µγ) ye yc µγ) µβ) (5) This indicates that the posterior mean of α is a weighted average of three estimates of α. The ﬁrst estimate is the prior mean µ α . The second estimate is the observed y c minus the prior mean for β. This makes intuitive sense since the expectation of y c is α β. The third estimate in the weighted average is the observed y e minus the prior mean for γ. The expectation of y e is α γ. The sum of the weights is K. The other posterior means are similarly interpreted. The marginal posterior distribution of γ is normal with mean η γ and variance the (3, 3) element of ∑. The parameter γ represents the contrast of experimental treatment versus placebo. One can thus easily compute the posterior probability that γ 0, which would be a Bayesian analog of a statistical signiﬁcance test of the null hypothesis that the experimental regimen is no more effective than placebo (if negative values of the parameter represent effectiveness). The posterior distribution of γ kβ is univariate normal with mean η γ kη β and variance ∑ 33 k 2 ∑ 22 2k ∑ 23 . Consequently, one can also easily compute the posterior probability that γ kβ 0. For k 0.5, if β 0 this represents the probability that the experimental regimen is at least half as effective as the active control. Since there may be positive probability that β 0, it is more appropriate to compute the joint probability that β 0 and γ kβ 0 to represent the probability that the experimental regimen is at least a kth as effective as the active control.

180

Simon

In the special case where noninformative prior distributions are adopted for α and γ, one obtains 1 ∑ σ2 β (1 1 rβ) rβ 1 1 1 (1 1 1 2r β rβ) (6)

In this case the posterior distribution of β is N(µ β , σ 2 ) the same as the prior β ye yc , σ 2 2σ 2 ), and distribution, the posterior distribution of γ is N(µ β β 2 2 the posterior distribution of α is N(y c µβ, σβ σ ). It can be seen that the clinical trial comparing C to E contains information about α if an informative prior distribution is used for β. One may permit correlation among the prior distributions. Let S denotes the covariance matrix for the multinormal prior distribution for (α, β, γ). Then ∑ 1 M S 1 where 2 1 1 (1/σ 2 ) 1 1 0 1 0 1

M

(7)

and the posterior mean vector is the solution of ∑ 1 η (1/σ 2 )(y • y c y e )′ S 1 µ′ where µ (µ α µ β µ γ ) and y • y c y e . The above results can be applied to binary outcome data by approximating the log odds of failure by a normal distribution. The approach can also be extended in an approximate manner to the proportional hazards model. Let the hazard be written as λ(t) λ 0 (t) exp(βx γz)

where λ 0 (t) denotes the baseline hazard function and the indicator variables x and z are the same as described above in Section IV. The data will be taken as the maximum likelihood estimate of the log hazard ratio for E relative to C for the active control study and will be denoted by y. Thus, for large samples y is approximately normally distributed with mean γ β and variance σ 2 1/d C 1/d E where the d’s are the number of events observed on C and E, respectively. Using normal priors for β and γ as above, the same reasoning results in the posterior distribution of the parameters (β, γ) being approximately normal with mean η (η β , η γ ) and covariance matrix ∑ (λ ij ) 1 with λ 11 λ 22 λ 12 1/σ 2 1/σ 2 β 1/σ 2 1/σ 2 γ 1/σ 2

Therapeutic Equivalence Trials

181

and mean vector determined by Λη µ β /σ 2 β y/σ 2 y/σ 2 µ γ /σ 2 γ

If a noninformative prior is used for γ, then λ 22 λ 12 , and we obtain that the posterior distribution of β is N(µ β , σ 2 ), the same as the prior distribution. β In this case the posterior distribution of γ is N(µ β y, σ 2 σ 2 ). The posterior β covariance of β and γ is σ 2 . Hence, the posterior probability that the experimenβ tal treatment is effective relative to placebo is Φ( (µ β y)/√σ 2 σ 2 ). β

V.

PLANNING TO ESTABLISH EQUIVALENCE AND EFFECTIVENESS

A minimal objective of the active controlled trial is to determine whether or not E is effective relative to P. Hence, we might require that if γ β, then it should be very probable that the trial will result in data y (y e , y c ) such that Pr(γ 0| y) 0.95, where γ 0 represents effectiveness of the experimental treatment. Thus, we want Pr[η γ /√ ∑33 1.645] ξ (8)

where η γ , ∑ 33 are the posterior mean and variance of γ, the probability is calculated assuming γ β and that β is distributed according to its prior distribution, and ξ is some appropriately large value such as 0.90. The posterior mean η γ is a linear combination of the data and is thus itself normally distributed with mean and variance denoted by ρ γ , ζ γ respectively. Thus, Eq. (8) can be written 1.645 √ ∑33 √ζ γ ργ zξ (9) β

where z ξ is the 100ξth percentile of the standard normal distribution. When γ ργ ∑31 (2µ β /σ 2 ∑33 (µ β /σ 2 and ζγ ∑31 (2 µ α /σ 2 ) α ∑32 (1 µ β /σ 2 ) β ∑33 (1 µ γ /σ 2 ) γ µ α /σ 2 ) α µ γ /σ 2 ) γ ∑32 (µ β /σ 2 µ β /σ 2 ) β

(10)

(11)

182

Simon

Hence, one can determine the value of σ 2 that satisﬁes Eq. (9). σ 2 represents the variance of the means y e and y c and hence is inversely proportional to the sample size per treatment arm in the active controlled trial. In the special case where noninformative prior distributions are adopted σ 2 → ∞, the above results simplify and the mean of for α and γ, that is, σ 2 α γ the predictive distribution is ρ γ µ β with predictive variance ζ γ 2σ 2. Using these results in Eq. (9) and simplifying yields 1.645√1 2σ 2 /σ 2 β √2σ /σ

2 2 β

µ β /σ β

zξ

(12)

The trial may be sized by ﬁnding the value of σ 2 that satisﬁes Eq. (12). It is of interest that µ β /σ β is the ‘‘z value’’ for the evaluation of the active control versus placebo. The required sample size for the active control trial is very sensitive to 3. This represents substantial that z value. For example, suppose that µ β /σ β evidence that the active control is indeed effective relative to placebo. In this case, for ξ 0.8 one requires that the ratio r σ 2 /σ 2 0.4 for Eq. (12) to be β satisﬁed. Since σ 2 is known and since σ 2 represents the variance of the mean β response per treatment arm in the active controlled trial, the sample size per arm can be determined. Alternatively, if there is less substantial evidence for the effectiveness of the active control, for example µ β /σ β 2, then one requires that 0.05 to satisfy Eq. (12). This represents eight times the the ratio r σ 2 /σ 2 β sample size required for the case when r 3. When the evidence for the effectiveness of the active control is marginal, then the active control design is neither feasible nor appropriate. For the binary response approximation described in Section III, we have approximately σ 2 1/npq, where n is the sample size per treatment group in the active control trial. If there is one previous randomized trial of active control versus placebo on which to base the prior distribution of β, then we have approximately that σ 2 2/n 0 pq, where n 0 denotes the average sample size per treatment β group in that trial. Consequently, σ 2 /σ 2 n 0 /2n. If µ β /σ β 3, then n 0 /2n 0 0.4, that is n 1.25n 0 , and the sample size required for the active control trial is 25% larger than that required for the trial, demonstrating the effectiveness of the active control. On the other hand, if µ β /σ β 2, then n 0 /2n 0.05, that is, n 10n 0 . Planning the trial to demonstrate that the new regimen is effective compared with placebo seems a minimal requirement. As indicated above, even establishing that objective may not be feasible unless the data demonstrating the effectiveness of the active control is deﬁnitive. One can be more ambitious and plan the trial to ensure with high probability that the results will support the conclusion that the new treatment is at least 100k% as effective as the active control when

Therapeutic Equivalence Trials

183

in fact the new treatment is equivalent to the active control. That is, we would require that Pr(γ kβ| y) 0.95. To achieve this, one obtains instead of Eq. (9) the requirement 1.645√(1 k) 2 2σ 2 /σ 2 β √2σ 2 /σ 2 β VI. EXAMPLE As an example of the analysis of therapeutic equivalence trials, we consider two recently reported clinical trials of bolus t-PA (tissue plasminogen activator) for lysis of coronary artery thrombosis. Both trials, GUSTO III and COBALT, compared t-PA administered in two boluses separated by 30 min to standard t-PA administered in an accelerated infusion over 90 min (11,12). Heparin was administered intravenously in all cases. The GUSTO III trial used a recombinant mutant version of t-PA for the bolus group. Infusion t-PA was considered the standard treatment; but bolus administration is more convenient to administer. Thirty-day mortality results for the COBALT and GUSTO III trials are shown in Tables 1 and 2. In COBALT, the 30-day mortality for bolus was higher than that for infusion, but the difference was not statistically signiﬁcant. The investigators concluded that ‘‘double-bolus alteplase was not shown to be equivalent according to the prespeciﬁed criteria, to accelerated infusion with regard to 30-day mortality. There was also a slightly higher rate of intracranial hemorrhage with the double-bolus method. Therefore, accelerated infusion of alteplase over a period of 90 minutes remains the preferred regimen.’’ The results of Gusto III were similar to those for COBALT. The 30-day mortality for the bolus arm was slightly but not statistically signiﬁcantly higher than for the infusion arm. In contrast to the COBALT result, the investigators implied that the two regimens were equivalent, although they indicated that the trial was not sized for demonstrating therapeutic equivalence since they expected the bolus regimen to be superior. (1 k)µ β /σ β zξ (13)

Table 1 COBALT n(planned) t-PA IV heparin Bolus t-PA IV heparin 4029 4029 n(actual) 3584 3595 30-day mortality (%) 7.53 7.98

184 Table 2 GUSTO III n t-PA IV heparin Bolus reteplase IV heparin 4,921 10,138 30-day mortality (%) 7.24 7.47

Simon

Using the logit approximation, the logit of the odds of 30-day mortality for the bolus regimen compared with infusion was 0.0621 with a standard error of 0.088 for COBALT and 0.0341 with a standard error of 0.067 for GUSTO III. A weighted average of these two results gives a log odds ratio of 0.044 with standard deviation of 0.053. The negative logit reﬂects an odds ratio of 0.957, slightly favoring the infusion regimen. A 95% two-sided conﬁdence interval for the log odds ratio is ( 0.148, 0.06), which corresponds to a conﬁdence interval for the odds ratio of (0.862, 1.062). The lower limit corresponds to a 14% lower 30-day mortality for the standard infusion regimen compared with the bolus regimen. The two arms of GUSTO I using infusion t-PA gave an average 30-day mortality rate of 6.65% based on 20,672 patients (13). The other two arms involving streptokinase (SK) gave an average of 7.30% 30-day mortality based on 20,173 patients. The odds ratio for infusion t-PA relative to SK is 0.9046 and the logit is 0.10027 with a standard error of 0.039. The Z value for this comparison is 2.57, and an approximate 95% conﬁdence limit for the odds ratio is (0.838, 0.976). Since the point estimate of the odds ratio for 30-day mortality for infusion t-PA versus SK is 0.9046, there is about a 50% chance that the reduction in risk is less than 10%. The Bayesian analysis described previously was applied to these data in an approximate manner. Flat prior distributions were used for the intercept parameter (α) and for the effect of bolus t-PA relative to SK (γ). The prior distribution for the effect of infusion t-PA relative to SK was obtained from GUSTO I as indicated in the previous paragraph. That is, for β we used a normal prior with mean 0.10 and standard deviation of 0.039. This ignores any possible interstudy variability in the effectiveness of infusion t-PA. We could account for such additional variability by increasing the standard deviation of β. We incorporated the COBALT and GUSTO III data in a two-step manner. First, we summarized the result of COBALT using the empirical logit transform to be represented by y c 2.507, y e 2.445, and σ 0.0625. The standard deviation was computed as the average of the standard deviations for the two treatment arms. Using these data, we computed the posterior distributions of the parameters. These results are shown in Table 4. We then summarized the results

Therapeutic Equivalence Trials Table 3 GUSTO I Sample size t-PA IV heparin SK IV heparin SK SC heparin t-PA SK IV heparin 10,344 10,377 9,796 10,328 30-day mortality (%) 6.3 7.4 7.2 7.0

185

of GUSTO III in a similar manner as y c 2.5512, y e 2.517, and σ 0.0472. In this study the sample sizes for the two arms are quite different, and it would be more accurate to generalize the results of the Bayesian approach for this. We have, however, approximated using an average standard deviation. For this second step of analysis we used the posterior distribution obtained from the COBALT data as a prior distribution for incorporating the GUSTO III data. It should be noted that there are substantial correlations among the parameters in the posterior distribution obtained from the COBALT data, and hence the generalized formula (7) was used. The last column of Table 4 shows the approximate posterior distributions obtained after incorporating both the COBALT and GUSTO III data. From the mean and standard deviation of the posterior distribution of γ we can compute that the posterior probability that γ 0, that is, that bolus t-PA, is more effective than SK is 0.80. Hence, these data provide only suggestive, but not deﬁnitive, evidence that bolus t-PA is even more effective that SK. We also computed the posterior probability that both β 0 and γ 0.5β. This can be interpreted as the probability that infusion t-PA is more effective than SK and that bolus t-PA is at least 50% as effective as infusion t-PA. This probability was 0.54. Hence there appears to be little evidence from these trials that bolus t-PA is at least 50% as effective as infusion t-PA relative to SK.

Table 4 Distribution of Parameters Prior α: mean β: mean γ: mean ρα β ρα γ ρβ γ SD SD SD 0 0.10 0 0 0 0 10 0.039 10 After COBALT 2.41 0.10 0.038 0.074 0.039 0.0966 0.53 0.76 0.40 After COBALT and GUSTO III 2.44 0.054 0.10 0.039 0.056 0.066 0.72 0.82 0.59

cannot demonstrate clinically relevant objectives.5 we found that a ratio R σ 2 /σ 2 of 0. The method is presented in a Bayesian context but has a frequentist interpretation if ﬂat priors are used for the α and γ parameters. A corollary to these considerations is that superiority trials. Conventional methods for planning therapeutic equivalence trials often miss this point because they take the maximum likelihood estimate of the effectiveness of the control treatment as if it were a value known with certainty.25 as large as for GUSTO I. Consequently. For example. (13) the size of clinical trial needed to establish that a regimen is at least 50% as effective as infusion t-PA relative to SK. is strongly preferable whenever possible. This ignores the fact that the degree of effectiveness of the control treatment is only imprecisely known unless the effect is overwhelmingly signiﬁcant. This new approach is based on the premise that a therapeutic equivalence trial is not interpretable unless one provides the quantitative evidence that the control treatment is effective.84. An important implication of the new approach is that reliable therapeutic equivalence trials are not practical unless the evidence of the effectiveness of the control treatment is overwhelming. The methods described here for the planning of such trials will hopefully help organizations to avoid such misdirected efforts. . even large multicenter trials. VII.059 is required to make the righthand side equal 0.57 from GUSTO I. many planned therapeutic equivalence trials. We used Eq. This corresponds to a sample size 4. Even to perform an equivalence trial for establishing indirectly that a regimen is more effective than SK (k 0). With k 0.186 Simon One can obtain from Eq. one obtains that a ratio R of 0. Unless this is the case. rather than therapeutic equivalence trials with marginally effective control treatments.059 or about 17 times the size of GUSTO I. and we have described a new approach to planning and analysis of such trials. β corresponding to 80% power. We have also tried to indicate that standard methods for the planning and analysis of such trials are problematic and potentially misleading. if the effect is of borderline signiﬁcance. One can conclude that infusion t-PA was not sufﬁciently better than SK and the difference was not strongly enough established in GUSTO I to make therapeutic equivalence trials practical. the sample size needed for the equivalence trial is many times larger than the sample size needed to establish the effectiveness of the control treatment. (13) with Z 2. This means that the sample size required for the planned equivalence trial should be 1/0.235 is required for 80% power. CONCLUSION In this chapter we have attempted to clarify the serious limitations of therapeutic equivalence trials. then the conﬁdence interval for the size of the effect almost includes zero.

Biometrics 1999. 7. Control Clin Trials 1991. Why conﬁdence intervals are useful tools in clinical therapeutics. 55:484–487. N Engl J Med 1978. A comparison of reteplase with alteplase for acute myocardial infarction. The GUSTO Investigators. Gould L. Another view of active-controlled trials. Simon R. 337:1118–1123. The COBALT Investigators. Gould A. N Engl J Med 1993. 3:243–248. J Biopharm Stat 1993. Fleming T. 11. 46:329–336. 12:2001–2023. Biometrics 1990. N Engl J Med 1998. Evaluation of active control trials in AIDS. Proving the null hypothesis in clinical trials. Smith HJ. Sample size requirements for evaluating a conservative therapy. The importance of beta the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 ‘‘negative’’ trials. 62:1037–1040. 299:690–694. Stat Med 1993. Simon R. Bayesian design and analysis of active control clinical trials.Therapeutic Equivalence Trials 187 REFERENCES 1. Cancer Treatment Rep 1978. 10. 12: 474–485. 329:673– 682. Simon R. Conﬁdence intervals for reporting results from clinical trials. Ann Intern Med 1986. N Engl J Med 1998. . Durrleman S. 8. Simon R. Frieman JA. Planning and monitoring of equivalence studies. 337:1124–1130. Sample sizes for event rate equivalence trials using prior information. 4. An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction. Makuch R. A comparison of continuous infusion of alteplase with double bolus administration for acute myocardial infarction. J Acquir Immune Deﬁc Syndr 1990. Control Clin Trials 1982. Simon R. 2. 6. The GUSTO III Investigators. Blackwelder W. Chalmers TC. 105:429–435. 13. 12. 3. 9. 5. 3:S82–S87. 3:345–353.

.

Early termination for negative results may be called for if the data to date are sufﬁcient to rule out the possibility of improvements in efﬁcacy that are large enough to be clinically relevant. Such rules provide for the possibility of early stopping in response to positive trends that are sufﬁciently strong to establish the treatment differences the clinical trial was designed to detect. Illinois John Bryant and H. whether or not one exists. Pennsylvania. Pittsburgh. Pennsylvania I. In this chapter we discuss methods for early stopping of cancer clinical trials. Alternatively.11 Early Stopping of Cancer Clinical Trials James J. they guard against prematurely terminating a trial on the basis of initial positive results that may not be maintained with additional follow-up. it may have become clear that study accrual. Samuel Wieand National Surgical Adjuvant Breast and Bowel Project and University of Pittsburgh. Chicago. or other factors have rendered the study incapable of discovering a difference. drug compliance. Pittsburgh. We focus in particular on situations where evidence suggests that differ189 . consider a trial comparing a new treatment to an established control regimen. and University of Chicago. Dignam National Surgical Adjuvant Breast and Bowel Project and University of Pittsburgh. INTRODUCTION Most cancer clinical trials use formal statistical monitoring rules to serve as guidelines for possible early termination. At the same time. follow-up compliance. We may also consider stopping a trial before its scheduled end point if current trends in the data indicate that eventual positive ﬁndings are unlikely. For example.

using survival (or slightly more generally. including predictive methods such as stochastic curtailment.190 Dignam et al. and a decision is made whether to continue with the accumulation of additional information or to stop and make some decision based on the information collected to date. Next. We also give a second example to illustrate the use of a common asymmetric monitoring plan adapted for use in Southwest Oncology Group (SWOG) Protocol SWOG8738. Consider a clinical trial designed to compare two treatments. the hypothesis that the two treatments are equivalent is H 0: δ 0. Suppose patients are accrued and assigned at random to receive either treatment A or B. the methods we discuss may be extended to other trial designs and efﬁcacy criteria. The accumulation of information is usually quantiﬁed by the total number of deaths. The relative effectiveness of the two treatments can be summarized by the parameter δ ln(λ B (t)/λ A(t)). we ﬁrst describe group sequential approaches to trial monitoring and outline a general framework for designing group sequential monitoring rules. as this aspect of trial monitoring has received less attention. In most applications. we restrict our attention to randomized clinical trials designed to compare two treatments. In a group sequential test of H 0. This approach is compared with a slight modiﬁcation of a monitoring rule proposed by Wieand et al. time to some event) as the primary criterion. We then discuss the application of asymmetric monitoring boundaries to clinical trials in situations where it is appropriate to plan for the possibility of early termination in the face of negative results. In this chapter. National Surgical Adjuvant Breast and Bowel Project (NSABP) Protocol B-14 is presented as a detailed example illustrating the use of stochastic curtailment calculations and Bayesian methods. at speciﬁed intervals an interim analysis is performed. one treatment represents an established regimen for the disease and patient population in question. using survival as the primary end point. whereas the second is an experimental regimen to be tested by randomized comparison with this control. the logarithm of the ratio of hazard rates λ B (t) and λ A (t). and the comparison of treatments is based on the logrank statistic. A and B. information is allowed to accumulate over time. We assume that this ratio is independent of time t. we consider various approaches to assessing futility in ongoing trials. II. whereas values of δ 0 indicate the superiority of A to B and values of δ 0 indicate the superiority of B to A. ences in efﬁcacy between treatments will not be demonstrated. However. Thus. For concreteness. We then brieﬂy examine Bayesian methods for trial monitoring and early stopping. We conclude with a discussion of considerations relevant to the choice of a monitoring plan. A large number of group sequential procedures have been proposed in this . GROUP SEQUENTIAL MONITORING RULES The most common statistical monitoring rules are based on group sequential procedures.

. Most fall into a common general framework that we now describe: For k 1. randomized treatment assignment. K fk (w. Deﬁne t k m k /m K. . . . 2. . If Z k b k at the kth analysis. 2. we reject H 0 in favor of H A : δ 0. K 1. and a ﬁnal analysis is scheduled to occur after the m Kth death. This result permits the extension of sequential methods based on evolving sums of independent normal variates to the survival setting. . S K R 1. . and let W k appropriate conditions (roughly. if Z k ∈ C k . The stopping region for the Kth analysis is the entire real line. K. the W k behave asymptotically like Brownian motion: Deﬁning ∆t k t k t k 1 and η δ ⋅ √m K /2. .Stopping Clinical Trials 191 setting. we continue to the (k 1)st analysis.η) Ck fk 1( y. . For each k 1. stopping probabilities. In particular. a similar decision rule is applied except ZK b K . the increments W k W k 1 are approximately uncorrelated normal random variables with means η ⋅ ∆t k and variances ∆t k (1–3). etc. an interim analysis is scheduled to take place after m k total deaths have occurred on both treatment arms. . The b k are chosen to maintain a desired experiment-wise type I error rate Pr{Reject H 0 |H 0 } α. . H 0 is rejected in favor of H A : δ 0. k 1. let V k denote its variance. but if Z k ∈ S k. η) dPr{τ k. we accept H 0 rather than continuing to an additional that if b K analysis. loss to follow-up independent of entry time and treatment assignment).η) ⋅ [φ{(w 1 y η ⋅ ∆tk )/√∆tk }/√∆tk ]dy (1) From this result all operating characteristics of the group sequential procedure (size. If testing continues to the Kth and ﬁnal analysis. . and for k 2.η}/dw where τ represents the number of the terminal analysis: Letting φ{⋅} represent φ{w η ⋅ t 1 )/√t1}/√t 1 . K. sequential entry. whereas if Z k b k . Let L k denote the logrank statistic computed at the kth analysis. (4) may be used to compute the density fk (w. . .) may be obtained. and then setting mK 4 ⋅ η2 /δ2 A (2) . so that t k represents the fraction of the total information Z k ⋅ √t k . the standard normal density. and the maximum duration m K of the trial is selected to achieve power 1 β against a speciﬁed alternative H A :δ δ A. the recursive integration scheme of Armitage et al.η) 3. Under available at the kth analysis. k 1. the continuation regions are of the form C k {Z k : b k Z k b k }. . . 2.Wk w. In cases where a two-sided symmetric test of the hypothesis H 0: δ 0 is appropriate. K 1. 2. . f1 (w. we stop after the kth analysis. . and let Z k represent the corresponding standardized statistic Z k L k /√V k. by determining that value of η that yields Pr{Reject H 0 |η} 1 β. . . power. the real line R1 is partitioned into a continuation region C k and a stopping region S k R1 C k .

(10) boundaries retain the desirable property of the O’Brien–Fleming procedure that the nominal level of the Kth analysis is nearly equal to α but avoid the extremely conservative nature of that procedure for small k when K 3. DeMets and Ware (11) proposed the use of asymmetric Pocock boundaries: The lower boundary points a k are ﬁxed at some speciﬁed 0. K. K (3) . the method can be shown to achieve nearly the desired type I error rate. . O’Brien and Fleming (8). K 5) and a large critical value (z 3.0 a k value for the b k is determined by setting the type I error rate to α. (1) can be used. Since most phase III trials compare a new therapy A to an accepted standard B. After restricting the choice of a k and b k. . K 1. Early stopping rules proposed by Haybittle (5). so that b k constant/√t k . . it may often times be appropriate to consider one-sided hypothesis tests of H 0: δ 0 versus H A: δ 0 and to make use of asymmetric continuation regions of the form C k {Z k : a k Zk b k } or equivalently C k {W k : A k Wk B k }. . . The design of such asymmetric monitoring plans presents no signiﬁcant new computational difﬁculties. the W-critical values are constant. . 2. which are linear in information time: Ak (Z′ /η) L (η/2) ⋅ tk . Wang and Tsiatis (9) boundaries have the form bk constant ⋅ tk∆ 1/2. Bk (Z′ /η) U (η/2) ⋅ tk . . . Eq. . k 1. (10) all ﬁt into this general framework. . Eq. For the O’Brien–Fleming procedure.7). (1) is used (generally in an iterative fashion) to ﬁx the both the size of the monitoring procedure and its power against a suitable alternative or set of alternatives. For a moderate number of analyses (say. 2. This procedure is most easily expressed in terms of its W-critical values. . B k b k ⋅ √t k . Fleming et al. . .192 Dignam et al. and Fleming et al. where ∆ is a speciﬁed constant.5 is tabled) and then a constant value independent of k (the range 2. a constant large critical value is used for analyses k 1. whereas crossing the lower boundary results in trial termination in recognition that the new therapy is unlikely to be materially better than the accepted standard or that H 0 is unlikely to be rejected with further follow-up. In the method by Haybittle. where A k a k ⋅ √t k .05 level procedure). . . K. to some desired class of boundaries. The Pocock bounds are obtained by constraining the z-critical values to be identical for each k: b k constant. Wang and Tsiatis (9). A second suggestion was to use a test with boundaries that are motivated by their similarity to those of a sequential probability ratio test (12). k 1. 2.0 was suggested if one wishes to obtain an overall 0. Crossing the upper boundary results in rejection of H 0 in favor of H A . despite no adjustment to the ﬁnal test boundary value. k 1. To obtain the ﬁnal critical value that would yield precisely the desired level overall. . In this context. Equation (2) is used to determine the maximum duration of the trial in terms of observed deaths. Pocock (6.2. and the ﬁnal analysis is performed using a critical value corresponding to the desired overall type I error level.

. . If instead one wishes to achieve a power of 1 β ≠ 1 α against the alternative δ δ A . In a subsequent publication (13). η and Q are determined iteratively to satisfy type I and type II error constraints using Eq. The adjustment A A factor of 0.20). and Z u and η are chosen to satisfy type I and type II ′ L error requirements by iterative use of Eq. Then δ′ should be used in place of δ A in Eq. (1). Bk Q (η/4) ⋅ tk . k 1. . similar to Lan and DeMets (18). k k 1. (2). . W-critical values are of the form Ak Q ⋅ t∆ k η ⋅ tk .2. . . The W-critical values are Ak Q (3η/4) ⋅ tk . DeMets and Ware recommended that the Wald-like lower boundary should be retained but the upper boundary should be replaced by an O’Brien–Fleming boundary B k B. Pr{Reject H 0 | δ δ A} 1 β may be used to determine an alternative δ′ such that Pr{Reject H 0|δ A δ′ } 1 α. 2. Suppose ﬁrst that it is desired to achieve type I error rate α and power 1-α against the alternative δ δ A . (2). K where η satisﬁes η2 {2. (2). The maximum number of required deaths is given by Eq. K. K. . (2). .583/√K. t k k/K. These are speciﬁed in terms of spending functions. The maximum number of observed deaths required to achieve this is given by Eq. In general. ∆ 1/2 results in Pocock-like O’Brien–Fleming bound to test H A: δ boundaries. The power boundaries of Wang and Tsiatis (9) may be adapted for use in testing hypotheses of the form H 0 : δ 0 verses H A : δ δ A (19. The maximum number of observed deaths is given by Eq. Slightly more accurate results may be obtained by iteratively determining η and Q to satisfy type I and type II error constraints via Eq. (1). the operating characteristic curve of the ﬁxed sample size test satisfying Pr{Reject H 0 | δ 0} α. . 2. Although iteration is still required to determine η and B. Bk (η Q) ⋅ t ∆. . (1). Jennison (17) considered group A sequential tests that minimize expected number of events under various alternatives and presented parametric families of tests that are nearly optimal in this sense. . as before. The triangular test approximately minimizes the expected number of events at termination under the alternative δ δ′ /2. k 1. K where the constant ∆ is speciﬁed by the trial designer.Stopping Clinical Trials 193 Here Z′ ln((1 α)/β). larger values of ∆ correspond to a greater willingness to . . .16) may be adapted to the group sequential setting in which K analyses will be carried out at equally spaced intervals of information time. Whitehead and Stratton (14) indicate how the sequential triangular test (15.583/√K in the formula for Q is an approximate correction for exact results that hold in the case of a purely sequential procedure. k 1.2. ∆ 0 corresponds essentially to a design using an upper O’Brien–Fleming bound to test H 0:δ 0 and a lower δA . . the value of B is reasonably close to the symmetric O’Brien–Fleming bound at level 2α. . .332/√K } ⋅ η 8 ⋅ ln(1/2α) 0 and Q 2 ⋅ ln(1/2α)/ η 0.

if the event rate on the experimental arm exceeds that on the control arm. if this rule is superimposed on symmetric two-sided boundaries by replacing the lower boundary a k with 0 for any information time t k greater than or equal to one half and the result is viewed as an asymmetric group sequential procedure testing a one-sided hypothesis. III.05-level test of equality of hazards and results in a loss of power of 0. In this implementation. The stochastic curtailment approach requires a computation of conditional power functions. terminate at an earlier stage.02 for any alternative hypothesis indicating a treatment beneﬁt. there is almost no change in the operating characteristics. and O’Fallon (21) proposed a method for early termination of trials when there appears to be no beneﬁt after a substantive portion of total events has been observed. Stochastic Curtailment A commonly applied predictive approach to early stopping makes use of the concept of stochastic curtailment (24–27). which is tantamount to adopting asymmetric boundaries after sufﬁcient information has been obtained to guarantee high power against alternatives of interest. CONDITIONAL POWER METHODS A. Emerson and Fleming (19) compared the efﬁciencies of one-sided symmetric designs having power boundaries to the results of Jennison (17) and concluded that the restriction to boundaries of this form results in negligible loss of efﬁciency. deﬁned as . The method is an extension of earlier work by Ellenberg and Eisenberger (22) and Wieand and Therneau (23) and was ﬁrst considered for multistage trials in advanced disease. In its simplest form.194 Dignam et al. At that time. It can be shown that the adoption of this rule has essentially no effect on the size of a nominal 0. These procedures are similar in spirit to the double triangular test (14). then termination of the trial should be considered. Wieand. Similarly. compared with a ﬁxed sample size test of that same alternative at the scheduled deﬁnitive analysis (21). Both Emerson and Fleming (19) and Pampallona and Tsiatis (20) also consider two-sided group sequential procedures that allow for the possibility of early stopping in favor of the null hypothesis. the proposed rule calls for performing an interim analysis when one half of the required events have taken place.13). Pampallona and Tsiatis (20) provide a comparison of the operating characteristics of asymmetric one-sided designs based on power boundaries with the designs proposed by DeMets and Ware (11. the stopping rule calls for early termination if at any scheduled interim analysis at or after the halfway point the experimental treatment is observed to be no more efﬁcacious than the control. where patient outcomes are poor and there is likely to be substantial information regarding treatment efﬁcacy while accrual is still underway. Schroeder.

we may decide that continuation is futile because H 0 ultimately will not be rejected regardless of further observations. This criticism has motivated methods that take an unconditional predictive approach in assessing the consequences of continuing the trial (28–30). In any case.g. we condition on various alternatives in favor of the treatment to assess the potential for a trial to reverse from early interim analyses results unexpectedly favoring the control group. the method always depends on unknown information at the time of the decision. H a ) is sufﬁciently large. Predictive Power and Current Data Methods Stochastic curtailment has been criticized on the basis that it requires conditioning on the current data and at the same time an alternative hypothesis that may be unlikely to have given rise to that data. equivalently. In Section II it was noted that the normalized logrank statistics asymptotically behave like Brownian motion. and η is deﬁned in Section II. speciﬁed through a distribution Pr(Z(1) ∈ R | D) ∫ Pr(Z(1) ∈ R | D. if under a ‘‘realistic’’ alternative hypothesis H a this probability is sufﬁciently small or. This provides an easy way to compute conditional power over a range of alternatives (26. Z α is the critical value against which the ﬁnal test statistic Z(1) is to be compared. This is the case of interest when considering early stopping for negative results. H )Pr(H| D)dH (6) A Bayesian formulation is a natural setting for this approach. B. If this conditional probability is sufﬁciently large under H 0..27): C(t) 1 Φ Zα Z(t)√t √1 η(1 t t) (5) In Eq. If a noninformative prior distribution is used for the distribution of the parameter of interest (e. In an example presented later. if 1-γ Pr(Z(1) ∉ R | D. D represents current data. Φ(⋅) is the standard normal distribution function. Z(t) is the current standard normal variate associated with the logrank test.Stopping Clinical Trials 195 γ Pr(Z(1) ∈ R | D. and H denotes either the null hypothesis H 0 or an alternative hypothesis H a. R is the rejection region of this test. t is the fraction of total events for deﬁnitive analysis that have occurred to date (so-called information time. These socalled predictive power procedures use weighted averages of conditional power over values of the alternative. H . one may decide to stop and immediately reject H 0. since H a must be speciﬁed. this was deﬁned for prespeciﬁed increments as t k m k /m K in Sect. On the other hand. II). H ) (4) where Z(1) represents a test statistic to be computed at the end of the trial. (5).

IV. Nevertheless. difference or ratio of proportions. Alternatively. We assume a normal prior distribution with speciﬁed mean δ p and variance σ 2. in situations where early termination is considered. say.30). If the skeptic is reasonably open-minded. (5) to project power resulting from further follow-up according to the pattern of observations thus far (26. where m is the total number of events currently obmean δ served. A BAYESIAN APPROACH TO ASSESS EARLY TERMINATION FOR NO BENEFIT Recently. In this spirit.29. the current (observed) alternative could be used in the conditional power formulation in Eq. expressed as a difference of means.36). he or she would . then the posterior distribution in this weighted average of conditional power depends only on the current data. It is suggested that a skeptical member of the clinical community may adopt a prior for δ that is centered at 0. if the goal of any clinical trial is ultimately to inﬂuence clinical practice. When an informative prior distribution is used. an analysis of the robustness of conclusions over a range of priors thought to resemble the a priori beliefs of reasonable members of the clinical research community should provide insight into the impact that trial results might be expected to exert on clinical practice. and these parameters may be altered to reﬂect varying degrees of enthusiasm. This is often an overlooked aspect of trial monitoring. as early stopping can result in diminished impact of the ﬁndings and continued controversy and delay while results are debated and large expensive trials are replicated. δ δ A . the notion of ‘‘skeptical’’ and ‘‘optimistic’’ prior distributions is discussed by numerous authors (31–33. or hazard ratio). its results must be sufﬁciently strong to prove compelling to a community of clinical researchers whose prior opinions and experiences are diverse. reﬂecting the unfortunate fact that relatively few regimens tested lead to material improvements in outcome.196 Dignam et al. that they must believe is both clinically meaningful and relatively probable. described in the following section. the trial designers will have speciﬁed a planning alternative for δ. The values of δ p and σ 2 may be determined to reﬂect an individual’s p p prior level of enthusiasm regarding the efﬁcacy of a proposed regimen. Although Bayesian analyses entail the difﬁcult and sometimes controversial task of specifying prior information. Thus. interest has grown in the application of Bayesian statistical methodology to problems in clinical trial monitoring and early stopping (31–35). then we obtain a fully Bayesian approach. δ has an approximately normal likelihood with ˆ and variance 4/m. Bayesian calculations for clinical trial monitoring can be motivated by adopting the log hazard ratio δ deﬁned in Section II as a summary measure of relative treatment efﬁcacy (32). We denote the partial maximum likelihood estiˆ mate of the log hazard ratio as δ.

let m be the number of events observed thus far ˆ and let n be the number of additional events to be observed. that δ 0. p 0. . m n.02) (37). EXAMPLES A. Following well-known results from Bayesian inference using the normal distribution. . For a given prior distribution and the observed likelihood. with 69% of patients receiving tamoxifen remaining event free compared with 57% of placebo patients and has also showed a signiﬁcant survival advantage (at 10 years. A Trial Stopped Early for No Beneﬁt In 1982. say δ 0. contralateral breast cancer or other new primary cancer. a double-blind comparison of 5 years of tamoxifen (10 mg b. . It therefore may be reasonable to model an optimist’s prior by setting δ p δ A and σ p δ A /1. . Subsequent follow-up through 10 years has conﬁrmed this beneﬁt. or death from any cause.d. one might be inclined to consider the trial organizers as being among the most optimistic of its proponents. 80% tamoxifen vs. or δ δ ALT corresponding to some clinically relevant effect size δ ALT.Stopping Clinical Trials 197 be willing to admit some probability that this effect could be achieved. . perhaps on the order of 5%.645 is speciﬁed. V. The ﬁrst report of ﬁndings in 1989 indicated improved disease-free survival (DFS. NSABP initiated Protocol B-14. From the posterior distribution one can also formulate a predictive distribution to assess the consequences of continuing the trial for some ﬁxed additional number of failures. a skeptical prior with mean δ p 0 and standard deviation σ p δ A /1. 83% vs. and the current weight of evidence for beneﬁt can thus be assessed directly by observing the probability that the effect is in some 0. Using similar logic. Denote by δ n the log relative risk that maximizes that portion of the partial likelihood corresponding to ˆ failures m 1. i. but even they would be reasonably compelled to admit as much as a 5% chance that the proposed regimen will have no effect. m 2.e.’’ since p the information in the prior distribution is equivalent to that in an hypothetical trial yielding a log hazard ratio estimate of δ p based on this number of events.i. the posterior distribution for δ has mean and variance given by ˆ δpost (n0δp mδ)/(n0 m) and σ 2 4/(n0 m) post where n0 4/σ 2.) with placebo in patients having estrogen receptorpositive tumors and no axillary node involvement.. Using these considerations.645. indicating a beneﬁt. speciﬁed range of interest. a posterior density can be computed for δ. 77% event free at 4 years. As before. p 0.00001). The predictive distribution of δ n is normal 2 with the same mean as the posterior distribution and variance σ pred 4/(n 0 m) 4/n. This quantity is thought of as the prior ‘‘sample size. 76% placebo. deﬁned as time to either breast cancer recurrence.

75 years).10 level. 1995). Results suggested that even under extremely optimistic assumptions concerning the true state of nature. In April 1987 a second randomization was initiated. nominal 2p 0. 1993. There had been six deaths on the placebo arm and nine among tamoxifen patients.0035) was not crossed (2p 0. (10) at the two-sided 0.’’ repeating conditional power calculations as if we had intended to observe a total of 229 events before ﬁnal analysis (this number of events would allow for the detection of a 30% reduction in event rate with a power of 85%). However. there were 32 events on the placebo arm and 56 on the treatment arm (relative risk 0. At the third interim analysis (data received as of June 30. By the second interim analysis (data received as of September 30. there were 24 events on the placebo arm and 43 on the tamoxifen arm (relative risk 0. then the null hypothesis could possibly be rejected (Fig. Conﬁdential interim end point analyses were to be compiled by the study statistician and presented to the independent Data Monitoring Committee of the NSABP. given the current data and a range of alternative hypotheses [Eq. 1172 patients were rerandomized. Four interim analyses were scheduled at approximately equal increments of information time.015). the logrank statistic would not approach . 1994). the null hypothesis could almost certainly not be rejected: Even under the assumption of a 67% reduction in failures. At that time. Stopping boundaries were obtained using the method of Fleming et al. At the ﬁrst interim analysis. we computed the conditional probability of rejecting the null hypothesis at the scheduled ﬁnal analysis (115 events). Four-year DFS was 92% for patients on placebo and 86% for patients on tamoxifen.05 level one-sided test with a power of at least 0. calculations showed that even if the remaining 27 events (of 115) all occurred on the placebo arm. more events had occurred in the tamoxifen group (28 events) than among those receiving placebo (18 events). based on all data received as of September 30. the conditional probability of eventual rejection was less than 5%. (5)]. Patients who had received tamoxifen and were event free through 5 years were rerandomized to either continue tamoxifen for an additional 5 years or to receive placebo. 3. To provide for a 0. Between April 1987 and December 1993. a total of 115 events would be required before deﬁnitive analysis. with 10 deaths on the placebo arm and 19 among patients receiving tamoxifen.85 under the assumed alternative of a 40% reduction in DFS failure rate.59).57. We also considered an ‘‘extended trial. Results indicated that if the trial was continued and the underlying relative risk was actually strongly in favor of tamoxifen. The boundary for early termination (2α 0. 1). Although there was concern regarding the possibility of a less favorable outcome for patients continuing tamoxifen.0030) and follow-up for most patients was relatively short (mean. we recommended that the trial be continued to the next scheduled interim analysis because the early stopping criterion was not achieved (2α 0.03).198 Dignam et al.

At this time. suppose the lower boundaries at the third. Reprinted from (41) with permission from Elsevier Science.8596. (21) discussed earlier. fourth. we also considered the early stopping rule proposed by Wieand et al. (5) (27). The solid line is based on Eq. The imbalance in deaths also persisted (13 placebo arm. the dashed line is based on a binomial calculation following from a Poisson occurrence assumption.11). Probabilities are conditional on the results of the second and third interim analysis and are graphed as a function of the assumed placebo/tamoxifen relative risk. To illustrate the consequences of superimposing this rule on the established monitoring boundaries of this trial.0501 to 0. Had such a conservative ‘‘futility’’ rule such as that described . 2p 0. 23 tamoxifen.0496 and the power under the alternative of a 40% reduction in event rate is reduced from 0.Stopping Clinical Trials 199 Figure 1 Conditional probability of ﬁnding a signiﬁcant beneﬁt for tamoxifen in the NSABP B-14 trial if deﬁnitive analysis was deferred to the 229th event. By this interim analysis. For the extended trial allowing follow-up to 229 events.8613 to 0. and ﬁnal analysis were replaced with zeros. Figure 1 shows that conditional power was now about 15% under the planning alternative of 40% reduction in relative risk and was only 50% under the more unlikely assumption of a twofold beneﬁt for continuing tamoxifen. signiﬁcance. Then the (upper) level of signiﬁcance is reduced from 0. considerably more events had occurred on the treatment arm than on the control arm.

Also shown is an ‘‘optimistic’’ prior distribution centered at δ p Figure 2 Prior distribution.586) 0. We subsequently also applied Bayesian methods for comparative purposes and to attempt to address the broader question of consensus in clinical trials. .534 and standard deviation 0. it would have suggested termination by this time.200 Dignam et al. As discussed. above been incorporated into the monitoring plan. the approaches taken in considering the early termination of the B-14 study were frequentist. Figure 2 shows the log hazard ratio likelihood for the B-14 data at the third interim analysis. and posterior distribution of the logged placebo/ tamoxifen hazard ratio after the third interim analysis of NSABP B-14. as the closure of the B-14 study had prompted some criticism from the cancer research community (38–40). likelihood. Reprinted from (41) with permission from Elsevier Science. under which the most probable treatment effect is a 40% reduction in risk.511. with only a 5% prior probability that the treatment provides no beneﬁt.213. having mean ln(0. δA 0. An ‘‘optimistic’’ normal prior distribution is assumed. The resulting posterior distribution contains about 13% probability mass to the right of ln(hazard ratio) δ 0.

85 among patients who had received continued tamoxifen and 50 among patients rerandomized to placebo (relative risk 0.686 0.199 and σ pred 0. By 1 year subsequent to publication (data through December 31. . The resulting posterior distribution has mean 0. 1996). we obtain µ pred 0. The prior standard deviation is σ p 1. then the ultimate result will be signiﬁcant at the 0. Subsequent follow-up of Protocol B-14 continues to support the ﬁndings that prompted early closure of this study. For the prior distribution speciﬁed above and the observations at the third interim analysis. this individual would now assign essentially no probability ( 3 10 5) to the possibility that the beneﬁt is as large as a 40% reduction in risk. If the trial were to be extended to allow a total of 229 events or 141 events beyond the third interim analysis.01).199/0.176) 0. Although there will sometimes be special circumstances that require analyses not speciﬁed a priori.60. 135 total events had occurred. it is preferable to determine in advance whether the considerations for stopping are asymmetric in nature and.199 and standard deviation 0. The predictive probability of a signiﬁcant treatment comparison following the 229th event is ˇ determined as follows: If δ m n denotes the estimated log relative risk based on ˆ all the data.645 0.Stopping Clinical Trials 201 corresponding to a 40% reduction in failures for patients continuing on tamoxifen δA/ relative to those stopping at 5 years.0001. Effect of Two Easily Applied Rules in the Adjuvant and Advanced Disease Setting The trial presented in the preceding example was not designed with an asymmetric rule for stopping in the face of negative results. The predictive probability of this occurrence is 1 Φ({0. There were 36 deaths among tamoxifen patients and 17 among control patients (nominal 2p 0. this individual would still assign a small but nonnegligible probability to the possibility that continued tamoxifen has some beneﬁt.311. To the degree that this prior distribution represents that of a clinical researcher who was initially very optimistic. It was partly for this reason that the data monitoring committee and investigators considered several methods of analysis before reaching a decision to stop the trial.645 ⋅ √(4/229) 0.13.686. we also computed the predictive distribution. to include an appropriate asymmetric stopping rule in the initial protocol design. if so. Since approximately δ (88δ 141δ ˆ this event requires that δ n 0. 1. On the other hand. A more extensive discussion of this case study has been published elsewhere (41).199}/0.243) 0.05 level is δ m n ˆm n ˆ ˆ n )/299. where Φ(⋅) is the standard normal distribution function. B. From this distribution one can determine that the posterior probability that δ 0 is 1 Φ(0. these calculations suggest that even in the face of the negative trial ﬁndings as of the third interim analysis. nominal 2p 0.217.243.176 (also shown).002).

suppose one had designed a trial to test the hypothesis H 0: δ 0 versus δ 0 to have 90% power versus the alternative H A: δ ln(1.g. SAS Proc PHREG [43]).5) at some small signiﬁcance level (e.25 at the second look.0025). In the framework of Section II.5) with a one-sided α of 0. . α 0. this rule is asymptotically equivalent to stopping if the standardized Z is 0. (10). and PEST3. Reading. and at the third look would be 0. Following Wieand et al. Alternatively. e. one may modify a ‘‘standard’’ symmetric rule (e. To illustrate this.5) versus δ ln(1. and 198 events.899. using a Fleming et al.74 at the second look. there would be a 0.93 at the ﬁrst look. a 0. Cambridge. and a 0. an advanced disease lung cancer trial). (21).025. (10) rule with three interim looks.5) from 0.132..900).25 chance of stopping at the second look. and its use has a negligible effect on the original operating characteristics of the group sequential procedure (α 0.28 at the third look (a fact that one does not need to know to use the rule. From Table 1a of Fleming et al. Computer packages (East. boundaries would be simply to replace the lower Z-critical values with zeroes at each interim analysis at or beyond the halfway point. It follows that if the experimental treatment adds no additional beneﬁt to the standard regimen. or 0. If a symmetric lower boundary were considered inappropriate. an alternative but equally simple way to modify the symmetric Fleming et al.g. if the experimental arm offered no additional beneﬁt to that of the standard regimen..0248. Cytel Software Corp. Using this approach.81 at the ﬁrst look.. 2.g. one might choose to replace it by simply testing the alternate hypothesis H A: δ ln(1.22 chance of stopping at the third look (Table 1). with ﬁnal analysis at 264 events.02. power 0. O’Brien–Fleming boundaries) by retaining the upper boundary for early stopping due to positive results but replacing the lower boundary to achieve a more appropriate rule for stopping due to negative results. the probability of stopping at the ﬁrst look would be very small (0.4975. It is often the case that this will alter the operating characteristics of the original plan so little that no additional iterative computations are required. since the alternative hypothesis can be tested directly using standard statistical software.18 chance of stopping at the ﬁrst look. or 2.67 at the third look. Again no special program is needed to implement this rule.902 to 0. Adding this rule does not signiﬁcantly change the experiment-wise type I error rate (α 0. but the probability of stopping at the second look would be 0. UK) are available to help with the design of studies using any of the rules discussed in Section 2 (see Emerson [42] for a review). Such a design would require interim looks when there had been 66. University of Reading. MA.. The null hypothesis would be rejected at the end of the trial if Z exceeded 2.202 Dignam et al.10 (Table 1).0247) and would only lower the power to detect a treatment effect of δ A ln(1. or 0. MPS Research Unit.005) at each interim look (this suggestion is adapted from the monitoring rules in SWOG Protocol SWOG-8738. one such rule would be to stop and conclude that the treatment was beneﬁcial if the standardized logrank statistic Z exceeded 2.

or third look.10 203 Probability of Stopping Under Alternative δ 1. If one designed the study to accrue 326 patients. SWOG. the potential beneﬁt of stopping in the face of negative results would be to prevent a substantial number of patients from receiving the apparently ineffective experimental regimen. Wieand. Southwest Oncology Group. including the likelihood that patients will still be receiving the treatment at the time of the early looks and whether it is likely that the experimental treatment would be used outside the setting of the clinical trial before its results are presented. second. respectively). we consider two scenarios.004 0.22 WSO 0.010 0.5).003 WSO 0. Under the assumption of constant hazards.Stopping Clinical Trials Table 1 Probability of Stopping at Each of Three Early Looks Using Two Easily Applied Rules Probability of Stopping If Treatments are Equivalent No.5 SWOG 0.5 months if δ δA ln(1.4975 0. which would take 26 months. and 7 months according to whether the trial stopped at the ﬁrst. To illustrate this.000 0. Suppose one would expect the accrual rate to such a study to be 150 patients per year.001 WSO. Early stopping after 66 deaths have occurred would prevent 180 patients from being entered to the trial and stopping after 132 deaths would prevent 99 patients from being entered. in addition to allowing early reporting of the results (the savings in time for reporting the results would be approximately 19. of events 66 132 198 SWOG 0.0025 0. Scenario 2: The treatment is being tested in an adjuvant trial where the expected hazard rate is 0. 12. 227 patients to be entered when 132 deaths have occurred. one would need to follow them for slightly less than 4. corresponding to a 5-year survival rate of slightly more than 87%. this is equivalent to the hypothesis δ A ln(1. Thus. If the experimental treatment offers no additional beneﬁt. one would expect 146 patients to have been entered when 66 deaths have occurred.0277 deaths/person-year. and 299 patients to have been entered when 198 deaths have occurred (Table 2).005 0. The decision of which rule to use will depend on several factors.5).18 0. O’Fallon (21). Schroeder.25 0.5 additional months to observe 264 deaths if the experimental regimen offers no additional beneﬁt to the standard regimen or an additional 7. If one is now looking for an alternative δ A . Scenario 1: The treatment is being tested in an advanced disease trial where the median survival with conventional therapy has been 6 months and the alternative of interest is to see if the experimental treatment results in at least a 9-month median survival.

VI. The second and third looks would occur approximately 3 and 15 months after the termination of accrual. so early stopping after these analyses would have no effect on the number of patients entering the trial. which would take approximately 39 months. or third look. The savings in time for reporting the results would be approximately 36. If there is little likelihood that the therapy will be used in future patients unless it can be shown to be efﬁcacious in the current trial. there may be little advantage to reporting early negative results. 1975 of the initiation if δ δA expected 2600 patients will have been entered by the time 66 events have occurred if the experimental regimen offers no additional beneﬁt to the standard regimen (Table 2). respectively. although it could permit early reporting of the results. of Patients Accrued Advanced Disease Trial 146 227 299 326 Adjuvant Disease Trial 1975 2600 2600 2600 No.204 Table 2 Effect of Early Stopping on Accrual and Reporting Time No. second.5}). a reasonable plan would be to accrue 2600 patients. and 12 months according to whether the trial stopped at the ﬁrst. SUMMARY Statistical monitoring procedures are used in cancer clinical trials to ensure the early availability of efﬁcacious treatments while at the same time preventing spurious early termination of trials for apparent beneﬁt that may later diminish. With this accrual and event rate. if the experimental regimen offers no additional beneﬁt to the standard regimen (75 months after ln{1. of Patients to be Accrued Advanced Disease Trial 180 99 27 0 Adjuvant Disease Trial 625 0 0 0 Dignam et al. and one might choose not to consider early stopping for negative results at any of these looks.5) (which would roughly correspond to increasing the 5-year survival rate to 91%) and if the accrual rate was approximately 800 patients per year. 24. of Events 66 132 198 264 ln(1. . which should occur approximately 66 months after initiation of the trial. and to analyze the data when 264 deaths have occurred. Time until Final Analysis (mo) Advanced Disease Trial 19 12 7 0 Adjuvant Disease Trial 36 24 12 0 No.

have had a careful mathematical development.. so that the expected number of events required to trigger reporting of results is reduced under such alternatives. By weighing these considerations against each other in the speciﬁc study situation at hand. For example. predictive power. The power boundaries of Wang and Tsiatis provide a convenient way to explore trade-offs between maximum event size and expected number of events to ﬁnal analysis by considering a variety of values of the tuning parameter ∆. These methods are quite practical. such evidence of a negative effect for an experimental therapy at the time of an interim analysis may be analyzed with the help of conditional power calculations. for the O’Brien–Fleming procedure. But the probability of early stopping under alternatives of signiﬁcant treatment effect is relatively high. it has often been considered to be most important to minimize the maximum number of required events.g.Stopping Clinical Trials 205 Properly designed. Other considerations (e. or fully Bayesian methods. In such circumstances. In the absence of a prespeciﬁed stopping rule that is sensitive to the possibility of early stopping for no beneﬁt or negative results. conserving resources and affording patients the opportunity to pursue other treatment options and avoid regimens that may have known and unknown risks while offering little beneﬁt. m K is only very slightly greater than the number of events that would be required if no interim analyses was to be performed. We discussed these approaches in Sections III and IV and applied several of them to data from the NSABP Protocol B-14. leading to the rather widespread use of the O’Brien–Fleming method and similar methods such as those of Haybittle and Fleming et al. The price paid for this is some loss of efﬁciency (in terms of expected number of events required for ﬁnal analysis) under alternatives of signiﬁcant treatment effect. and have well-studied operating characteristics. In contrast. the possibility of attenuation of treatment effect over time) also argue for the accumulation of a signiﬁcant number of events before deﬁnitive analysis and therefore favor policies that are rather conservative in terms of early stopping. a satisfactory monitoring procedure can be chosen. a study that was closed in the face of . and care should be taken to select a monitoring policy that is consistent with the goals and structure of a particular clinical trial. deaths or treatment failures) have occurred. and more emphasis has been placed on the use of interim analysis policies to prevent the premature disclosure of early results than on their use to improve efﬁciency by allowing the possibility of early reporting. these procedures can also provide support for stopping a trial early when results do not appear promising.g. The group sequential monitoring rules described in this chapter differ with respect to their operating characteristics.. the need to perform secondary subset analyses. In phase III cancer trials (particularly in the adjuvant setting). among symmetric rules. it is often the case that both accrual and treatment of patients are completed before a signiﬁcant number of clinical events (e. the Pocock procedure is associated with a relatively large maximum number of events (m K ) required for ﬁnal analysis.

our preference has been to use simple approaches such as the Wieand et al. when one deviates from the original design. and we heartily . the standard 5 years). the use of O’Brien–Fleming boundaries leads to the method of Pampallona and Tsiatis (20) with ∆ 0. They were kind enough to provide us with an advance copy as one of us (Bryant) was about to teach a Group Sequential Monitoring course at the University of Pittsburgh. one may or may not require the procedure to be closed (i. the initial power and signiﬁcance level computations of the study are altered. The example in Section V. as accrual may be completed before the rule is applied. we recommend one of the approaches described in Sections III and IV.B showed that a rule applied in an advanced disease setting might prevent the accrual of a fraction of patients to an ineffective experimental regimen. require the upper and lower boundaries to join at the Kth analysis). including the anticipated morbidity of the experimental regimen. rules. whereas the same rule applied to an adjuvant trial is likely only to affect the timing of the presentation of results. Of course. After we completed this chapter. severity of the disease being studied. Our experience is that these approaches are easily explained to (and accepted by) our clinical colleagues. we became aware of a new volume by Jennison and Turnbull (44).206 Dignam et al. negative results for an experimental schedule of extended tamoxifen (10 years vs. If closure is required. it is important to know what effect the plan will have on the power and signiﬁcance level of the overall design. Even when one has had the foresight to include an asymmetric design. with examples of several asymmetric designs in current use. If such an approach is reasonable. It is preferable to include a plan for stopping in the face of negative results at the time the study protocol is developed. Many factors enter into the choice of a design. and the likely effect of early release of results on other studies.. using the O’Brien–Fleming or Fleming et al. one may gain further insight regarding unexpected results by applying some of these methods. We advocate that trial designers give serious thought to the suitability of asymmetric monitoring rules. type.e. Their work contains a comprehensive coverage of many of the methods and issues discussed in our chapter. In the latter case. the methods in Section II allow the statistician to develop a design that seems most appropriate for his or her situation. the expected accrual rate of the study. modiﬁcation of symmetric boundaries or the use of an upper boundary of the O’Brien–Fleming or Fleming et al. When asymmetric monitoring boundaries are required. In particular. coupled with a lower boundary derived by testing the alternative hypothesis H A: δ δ A. The mathematics required to create an appropriate group sequential design that incorporates asymmetric monitoring boundaries is presented in Section II. If at an interim look one is faced with negative results and has not designed the trial to consider this possibility.

In: Crowley J. 77:855–861. Biometrics 1979. Pocock SJ. CA: Institute of Mathematical Statistics Lecture Notes—Monograph Series. Asymmetric group sequential boundaries for monitoring clinical trials. REFERENCES 1. 5:348–361. 38:153–162. Haybittle JL. 287–301. Chichester: Ellis Horwood Ltd. Fleming TR. Pocock SJ. Whitehead J. Repeated signiﬁcance testing for a general class of statistic used in censored survival analysis. 7. Tsiatis AA. Wald A. Vol. DeMets DL. Lan KKG. 2. 74:155–165. 1947. Johnson RA. Repeated assessment of results in clinical trials in cancer treatments. 8. McPherson CK. Ware JH. Whitehead J. Biometrika 1979. 12. 64:191–199. J Royal Stat Soc Series A 1969. Biometrics 1982. 43:193–199. 13. 14. 15. Wang SK. New York: John Wiley and Sons. 10. 66:443–452. with application to group sequential boundaries. 3. Br J Radiol 1971. 35:549–556. Biometrika 1980. J Am Stat Assoc 1982. Fleming TR. Biometrics 1987. 67:651–660. Sequential Analysis. 4. 70:659–663. Survival Analysis. Gail MH. 1982. 9. pp. O’Brien PC. 2. 132:235–244. 16. Interim analyses for randomized clinical trials: the group sequential approach. O’Brien PC. 1983. Approximately optimal one-parameter boundaries for group sequential trials. Control Clin Trials 1984. Biometrics 1983. Tsiatis AA. 68:311–315. The analysis of sequential clinical trials. Ware JH. . Armitage P. Hayward. Rowe BC. The Design and Analysis of Sequential Clinical Trials. Jones D. Slud EV. Group sequential methods in the design and analysis of clinical trials. Harrington DP. 11. Discrete sequential boundaries for clinical trials. 18. The asymptotic joint distribution of the efﬁcient scores test for the proportional hazards model calculated over time. Simulation studies on increments of the two-sample logrank score test for survival time data. Biometrika 1987. Group sequential clinical trials with triangular continuation regions. 69:661–663. DeMets DL. 39:227–236. Stratton I. Efﬁcient group sequential tests with unpredictable group sizes. DeMets DL. eds. Tsiatis AA. Designs for group sequential tests. Repeated signiﬁcance tests on accumulating data. DeMets DL. Biometrika 1983. 6. Whitehead J.Stopping Clinical Trials 207 recommend the book to individuals who wish to expand their knowledge regarding early stopping procedures. A multiple testing procedure for clinical trials. Biometrika 1981. 17. Biometrika 1977. Biometrika 1982. Group sequential methods for clinical trials with a one-sided hypothesis. 5. 44:793–797. Jennison C.

Bayesian approaches to randomized trials [with discussion]. J Natl Cancer Inst 1996. 1996. Bayesian Biostatistics. Stat Sci 1989. 40. The B-value: a tool for monitoring data. 29. Statistical approaches to interim monitoring of medical trials: a review and commentary. Spiegelhalter DJ. Ungerleider RS. J Natl Cancer Inst 1996. 32. Cancer Treatment Rep 1985. Spiegelhalter DJ. Simon R. Johnson NJ. 3:311–323. 1:207–219. 45:905–923. Five years of tamoxifen—or more? J Natl Cancer Inst 1996. 24. O’Fallon JR. Lan KKG. Stat Med 1994. Current Trials Working Party of the Cancer Research Campaign Breast Cancer Trials Group. Tsiatis AA. Symmetric group sequential test designs. Jennison C. Monitoring clinical trials: conditional or predictive power? Control Clin Trials 1986. Assessing whether to perform a conﬁrmatory randomized clinical trial. 23. 88:1529–1542. Comment on ‘‘Investigating therapies of potentially great beneﬁt: ECMO’’ by J. Wittes J. J R Stat Soc A 1994. 88:1510– 1512. Stat Med 1994. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. 42:19–35. 13:1453–1458. J Stat Plan Infer 1994. 33. Ellenberg SS. Greenhouse J. Commun Stat Theory Meth 1991. Preliminary results from the Cancer Research Campaign trial evaluating tamoxifen duration in women aged ﬁfty years or older with breast cancer. 13:1371–1383. Biometrics 1989. 36. Fleming TR. Stopping when the experimental regimen does not appear to help. Control Clin Trials 1982. Tamoxifen: the long and short of it. Lan KKG. 35. Five versus more than ﬁve years of tamoxifen therapy for breast cancer patients with negative lymph nodes and estrogen receptorpositive tumors. 39. 34. Parmar MKB. Control Clin Trials 1987. Davis BR. Fisher B. Ware JH. 25. 14:1379–1391. Kass R. New York: Marcel Dekker. Hardy RJ. 30. Commun Stat Sequent Anal 1982. Ware. Bryant J. An aid to data monitoring in long-term clinical trials. Lan KKG. Dignam J. 26. Simon R. Halperin M. 88:1645–1651. Therneau T. 27. Stochastically curtailed tests in long-term clinical trials. 157:357–416. 45:317–323. 88:1791–1793. Stangl DK. 44: 579–585. Stat Sci 1990. 38. Schroeder G. 88:1834–1839. J Natl Cancer Inst 1996.208 Dignam et al. Pampallona S. 37. Greenhouse J. Parmar MKB. Parmar MKB. 22. Wieand S. Freedman LS. Robust Bayesian methods for monitoring clinical trials. Biometrics 1988. Berry DA. Upper bounds for type I and type II error rates in conditional power calculations. 8:20–28. 4:310–317. Wieand S. 20. Choi SC. 5:299–317. Turnbull BW. A two-stage design for randomized trials with binary outcomes. Swain SM. . DeMets DL. Stat Med 1994. Spiegelhalter DJ. Pepple PA. 28. 31. Blackburn PR. why and how of Bayesian clinical trials monitoring. et al. Wasserman L. 21. Biometrics 1989. Peto R. Freedman LS. J Natl Cancer Inst 1996. Freedman LS. 19. Halperin M. The what. eds. Eisenberger MA. Monitoring clinical trials based on predictive probability of signiﬁcance. An efﬁcient design for phase III studies of combination chemotherapies. 7:8–17. 19:3571–3584. 69:1147–1154. Emerson SS.

Wolmark N. Boca Raton: Chapman & Hall/CRC. Am Stat 1996. Early stopping of a clinical trial when there is evidence of no treatment beneﬁt: Protocol B-14 of the National Surgical Adjuvant Breast and Bowel Project. Emerson SS. Control Clin Trials 1998. 1991.. Jennison C. Version 6. 50: 183–192. NC: Sas Institute Inc. Dignam J. 19:575–588. Fisher B. 2000. Turnbull BW. Bryant J. 42. Wieand HS. SAS/STAT Software: the PHREG Procedure. Cary. Group Sequential Methods with Applications to Clinical Trials. SAS Technical Report P-217. 63 pp. Statistical packages for group sequential methods. 44.Stopping Clinical Trials 209 41. . 43.

.

England I. the sample size is unknown in advance and is determined in part by the nature of the emerging data. and this chapter concentrates on the most frequently implemented member of the former class: the triangular test. 211 . Sequential clinical trials are becoming increasingly common in clinical research because they offer the ethical and economic advantages of avoiding continuation in the face of mounting evidence against a treatment and of requiring relatively small sample sizes when the advantage of a treatment becomes quickly and clearly apparent. that will occur for practical or logistical reasons rather than as a consequence of the nature of the accumulating data. Trials with purely administrative looks or with interim assessments of safety only are not generally considered to be sequential. Most sequential designs currently being used are derived from either the boundaries approach or the α-spending approach. as the primary therapeutic question is not repeatedly addressed.12 Use of the Triangular Test in Sequential Clinical Trials John Whitehead The University of Reading. Reading. INTRODUCTION A clinical trial is described as sequential if its design includes one or more interim analyses that could lead to a resolution of the primary therapeutic question. in which there are no interim analyses and the necessary sample size is calculated in advance. in the event. Although a ﬁxed-sample trial may not. In a sequential trial. achieve its target sample size. Thus it can be distinguished from a ﬁxed-sample trial.

θ is equal to log (1) 0. whereas Z and V are observable sample statistics. thus expressing zero advantage. The quantity θ is an unknown population parameter. take a comparison of the types of mattress on which a patient lies during surgery. To clarify the meaning of each of the quantities above. whereas the trial will be stopped for lack of effect as soon as it is apparent that continuation is futile. The control treatment is focused on. the advantage of E over C. The history and mathematical properties of the method are given in Section IV. two examples are given. The statistic V will be the null variance of Z. COMPARATIVE CLINICAL TRIALS Throughout this chapter it is assumed that patients are being randomized between one experimental treatment E and one control treatment C and that the primary therapeutic question concerns a single patient response. The symbol θ will be used to denote the advantage of E over C in the patient population as a whole. Section III consists of a detailed account of a trial in renal cancer that used the triangular test. The statistic Z is the logrank statistic. II. Denote by pE and pC the probabilities of suffering a . whereas Z will denote the observed advantage apparent from the current data. The log is taken so that when hazards are equivalent. Suppose that in a trial of cancer therapy. in which the primary patient response is the incidence of pressure sores. These features are made more precise in subsequent sections. The amount of information about θ contained in Z will be denoted by V. might be expressed as minus the log of the ratio of the hazard on E to that on C. Section VI is a survey of recent applications of the triangular test. For a second example.212 Whitehead The triangular test is an efﬁcient form of sequential procedure that uses as small a sample size as possible while still maintaining the required precision of the testing procedure. the primary patient response is the survival time of the patient from randomization to death. Then θ. It is an asymmetrical procedure in the sense that overwhelming evidence is required to reach an early conclusion that an experimental treatment is effective.4 of Whitehead (1). which can be thought of as the observed number of deaths on C minus the number expected under the null hypothesis of no advantage of E over C. the minus sign means that a reduction in hazard on E will show as a positive value of θ. The full formulae for Z and V are given in Section 3. so that a positive value of Z indicates an advantage of E. The following section is an introduction to the clinical trial context in which the triangular test can most easily be applied. and rivals and variations to the approach are described in Section V. which is approximately equal to one quarter of the number of deaths.

This is illustrated in Figure 1. the value of Z is plotted against V at each interim. . and S SE SC. Then θ will be taken to be the log-odds ratio: θ log 1 pE pE log 1 pC pC . Z and V are computed from the available data. The traditional χ2 statistic for a 2 2 contingency table. Then nEnC SF nE S and V Z SE n n3 where nE and nC denote the total number of patients on E and C. respectively. consequently. with random variation about it quantiﬁed by the variance V. F FE FC. Let SE. respectively. In a sequential analysis. The statistic Z is constructed so that its expected value is θV and its variance is V. can be shown to be equal to Z 2 /V. Figure 1 Maintaining a plot of Z against V at interim analyses. SC and FE. At each interim analysis of the trial. respectively. a positive value of Z is encouraging. resulting in an expected linear path of gradient θ. usually expressed as ∑(O E)2 /E. and n nE nC. FC denote the numbers of successes (no pressure sores) and failures (pressure sores) on E and C.Triangular Test in Sequential Clinical Trials 213 pressure sore on the experimental and control (standard) mattresses.

reject the null hypothesis. it occurs with probability (1 β). ˆ In large samples the maximum likelihood estimate θ of θ and its stanˆ dard error SE(θ) are approximately equal to Z/V and 1/√V. Sequential designs are constructed according to the same sort of power requirement as governs a conventional sample size calculation. The part of the ence improvement θR Figure 2 Boundaries of the triangular test. Crossing the upper boundary represents signiﬁcant evidence that E is better than C. and for θ equal to some refer0. this occurs with probability 1/2α. do not reject the null hypothesis. The trial is also stopped if the lower boundary is crossed. respectively. and a region corresponding to signiﬁcant disadvantage of E over C is indicated as the solid portion of the lower boundary in Figure 2. ˆ ˆ One commonly used variation on the general scheme above is to use θ {SE (θ)} 2 ˆ in place of Z and {SE (θ)} 2 in place of V.214 Whitehead Sequential designs deriving from the boundaries approach are deﬁned by superimposing boundaries on the Z-V plane. It is arranged that when θ 0.– – –. A rising path of Z against V indicates growing evidence of an advantage of E over C. the trial can be stopped with a conclusion that E is signiﬁcantly better than C. and the upper boundary is placed so that once the path crosses it. The triangular test is illustrated in Figure 2.——. .

the survival patterns of groups E and C were compared using a logrank test. Interim analyses were planned for 12 months after the start of the trial and at 6-month intervals thereafter. stratifying by center and by whether the patient had had a nephrectomy and by single or multiple metastases. an analysis must be performed. then the p value found is invalid. For survival data. At each interim. Treatment with alpha-interferon is both expensive and toxic. if either rejection region 2 is reached the ﬁnal two-sided p value will be less than α. A variety of techniques is now available to overcome these problems and to produce acceptable analyses based on the form of sequential design used. a triangular design was chosen. Mathematically. If a conventional analysis is applied. Standard treatment in this indication is hormonal therapy with medroxyprogesterone acetate. the point estimate of θ is biased. the parameter θ measuring the advantage of E over C was deﬁned in Section I to be minus the log of the ratio of the hazard on E to that on C. The treatment under investigation was biological therapy with alpha-interferon. and conﬁdence intervals for θ are too narrow. A CLINICAL TRIAL OF ALPHA-INTERFERON IN METASTATIC RENAL CARCINOMA The MRC Renal Cancer Collaborators (2) described a multicenter. It was certainly not believed necessary to have a high power of establishing a signiﬁcant disadvantage of E relative to C to dissuade clinicians from using E if the trial was negative. Once the trial has stopped. Following these considerations. and the patients randomized to the control group (C) received this therapy. consequently. which was based on overall target and anticipated survival patterns on E and C. Denote the hazard functions of patients on E and C by hE(t) and hC(t). The primary treatment comparison concerned survival time from randomization to death. It was anticipated that recruitment would proceed at the rate of 125 patients per year. controlled trial in patients with metastatic renal carcinoma. randomized. It was believed to be appropriate to continue the trial only as long as the emerging results were consistent with an outcome in favor of E. Patients were assigned to treatment using the method of minimization.Triangular Test in Sequential Clinical Trials 215 lower boundary corresponding to signiﬁcant evidence that E is worse than C is 1 reached with probability –α when θ 0. and patients receiving this treatment formed the experimental group (E). Full details of both design and analysis methods are given in Whitehead (1). . III. and the survivor functions by SE(t) and SC(t). stratiﬁed by whether the patient had had a nephrectomy before randomization. respectively. The stratiﬁcation was not accounted for in the design.

315V 14.2 was anticipated in the control group.2)} 0.000 0.000 No.900 Worse 0. for all t 0.32.345 SE(2) 0. SC(2) 0.28 0.320 Better 0. that is. A power of (1 β) 0.004 0.200 0.258 0.3.2.025 0.216 Whitehead θ log hE(t) .025 0. Substituting in Eq.2 0.293 0.173 0 0.28 0.708.2 SC(2) 0.2 0. (1) The assumption that this quantity is constant over all t is known as the proportional hazards assumption.105V respectively.173 0. log { log (0.2 0. (2) Based on results from Selli et al. of deaths at termination Median (90th %ile) 82 (124) 109 (176) 163 (284) 246 (381) 201 (341) Duration of trial (mo) Median (90th %ile) 17 (22) 21 (29) 28 (41) 38 (52) 34 (49) . then E will be declared to be signiﬁcantly inferior to C. The triangular design satisfying this power requirement has upper and lower boundaries Z and Z 14. An equivalent expression for θ is θ log { log SE(t)} log { log SC(t)}.000 0. (3).90 was set for achieving signiﬁcance at the 5% level (two-sided alternative) if alphainterferon increased this 2-year survival rate to SE(2) 0.2 0. a 2-year survival rate of 0.345. Table 1 shows the properties of the Table 1 Properties of the Triangular Test Used in the Renal Carcinoma Trial Probability of ﬁnding E signiﬁcantly θ 0.105 0.345 0. corresponding to a hazard ratio of hE(t)/hC(t) 0. If the lower boundary is crossed with V 17. for all t hC(t) 0.362 0.32)} log { log (2) gives a reference improvement of θR (0.103 0.148 0.

98 No nephrectomy Z 2. The recruitment rate averaged 60 per year throughout the trial.21 3. Details of the trial design were published by Fayers et al.63 22. depend only on the values of the (minus) log-hazard ratio θ indicated. If θ 0.33 1. Table 2 Interim Analyses for the Renal Carcinoma Trial Nephrectomy Date 29 Oct 93 22 Sept 94 20 Feb 95 14 Feb 96 10 Feb 97 1 Oct 97 n 69 122 158 222 293 335 d 20 37 67 130 190 236 Z 1.Triangular Test in Sequential Clinical Trials 217 design.2 being correct.10 . the probabilities of ﬁnding E signiﬁcantly better or worse. then the power of obtaining a signiﬁcant result is only 0. and these are summarized in Table 2. and on a steady entry rate of 125 patients per year.06 0.38 16. In Table 1.60 9. Recruitment to the trial began in February 1992.79 9.21 16.99 56. The Q statistic is Cochran’s test statistic for heterogeneity between strata and is given by Q ∑ (Z 2 /V) (∑ Z)2 /(∑ V).62 1.99 3. E is worse than C by a magnitude equal to the target improvement (on the log-hazards scale).54 15.68 Combined Z 1.84 2. This emphasises the asymmetrical nature of the design and represents the scientiﬁc loss due to reducing expected sample size. This formula is familiar from meta-analysis (see ref.44 4.69 10. (4).) The equivalent ﬁxed sample size design would require 353 deaths and would last for 49 months.66 Q 4.345.99 29.00 26. A total of six interim analyses were performed. on an exponential form of survival pattern in both groups. that is. If this were a ﬁxed sample study.07 2.69 4.01 2. the medians and 90th percentiles of duration depend on the pretrial estimate of SC(2) as 0.55 V 1.35 45. it is of course an ethical gain.3.30 0.27 V 2.73 23.68 9. On the other hand.75 6.37 19.33 14. the number of patients recruited (n). Recruitment was to continue until a boundary was reached or until 600 patients had entered the trial. (Although a scientiﬁc loss. less than half the anticipated rate.78 2.83 V 4.42 8.47 4. Q would follow the χ2 distribution on one degree of freedom: Here caution is required in interpretation due to the repeated interim analyses. It can be seen that a substantial reduction in trial duration is likely to be achieved by use of the triangular test.71 31.38 1.03 6. and the medians and 90th percentiles of the number of deaths.17 9. Each row ﬁrst gives the date of the interim analysis.61 7. 5). Then the logrank statistic (Z) and its null variance (V) are given separately for each stratum by nephrectomy and combined by summation. and the number of known deaths (d).

the stopping boundaries must be brought closer together. . The resulting boundaries are known as Christmas tree boundaries because of their shape. indicating an interaction between treatment and nephrectomy group. and return by the time of the next look. This ﬁgure displays a feature of the stopping rule. The outer triangular boundaries are calculated to achieve the required error probabilities when monitoring of the trial is continuous. the Q statistic was nominally signiﬁcant at the 5% level.218 Whitehead The combined values of Z and V are plotted against one another in Figure 3. Thus discrete monitoring makes stopping less likely. At each of the ﬁrst two interim analyses. Because the interim analyses are discrete. it is possible that the hypothetical sample path arising from continuous monitoring might cross the boundaries undetected between interim analyses. Relative to control. It is sufﬁcient for the plotted point to reach these inner boundaries for the trial to be stopped. To compensate and achieve the required error probabilities.583 times the square root of the increment in V. the more the boundaries are brought in. The jagged inner boundaries achieve this: The longer the gap between looks. The magnitude of the correction is 0. the experimental group appeared to be Figure 3 The ﬁnal sequential plot for the MRC trial of alpha-interferon. which has not yet been discussed. They can be used with any design based on straight-line boundaries but are especially accurate for the triangular test.

with a 95% conﬁdence interval of (0.600). This is just two thirds of the 353 calculated as required for a ﬁxed sample design at the beginning of the trial. 0.5 months on E being a modest advantage given the cost and toxicity of alpha-interferon.716 is obtained. and later analyses showed no trace of it at all. and this greater death rate has contributed to the reduction in duration of the trial. These ﬁrst two interim analyses. These authors both developed the sequential probability ratio test (SPRT). Transformed to the hazard ratio hE(t)/hC(t). 236 deaths had occurred.345 The value of θM is very close to the target improvement of θR under which the trial was powered.22 and 0. At the time of the sixth interim analysis. with p 0. a median unbiased estimate of 0. which allowed for the sequential nature of the design. taking informal account of multiple and repeated testing. the upper boundary was reached. the stopping boundaries for Z are a cV (upper) and b cV (lower).017 (twosided alternative).5 months on E and 6. By reacting to both the increased death rate relative to predictions and to the emerging advantage of E.549. Recruitment was closed on 30 November 1997.940). The analysis conducted on 1 October 1997. IV. The impression of heterogeneity thereafter subsided.334. Both are worse than anticipated. the triangular test appreciably reduced the duration of the renal carcinoma trial. However. The validity of proportional hazards and the consistency of the advantage of E over C over various stratiﬁcations of the patients were considered informally and found to be satisfactory. The trial protocol gave the Data Monitoring Committee the power to recommend stopping or continuing the trial. Two-year survival probabilities for E and C were estimated to be 0. The Data Monitoring Committee did recommend stopping. The median unbiased estimate of log-hazard ratio is θM 0. These parallel boundaries are . it was decided to take no action at either of the ﬁrst two interims. At the sixth interim analysis.Triangular Test in Sequential Clinical Trials 219 beneﬁted within the no nephrectomy stratum and disadvantaged within the nephrectomy stratum. and the decision was conﬁrmed by the MRC Renal Cancer Working Party.12. were presented to a Data Monitoring Committee. like all subsequent ones. Median survival times are estimated to be 8. HISTORY OF THE TRIANGULAR TEST Sequential analysis originated in the war-time work of Wald (6) and Barnard (7) concerning the inspection of newly manufactured batches of military hardware. and the apparent heterogeneity caused some concern.062. with 95% conﬁdence interval (0. the extra 2.0 months on C. respectively. Applied directly to the clinical trial context of Section II. 0. found that alpha-interferon is associated with signiﬁcantly better survival than medroxyprogesterone acetate. 0.

During the early 1950s. they become similar to triangular stopping regions. The recommendation of the triangular test as a means of minimizing expected sample size is justiﬁed by this result. The sequential designs of the mid-1970s suffered from four major limitations: . which occurs at 1 1 θ –θR . however. θR)—which will be equal to one another—reach the minimum value achievable by a sequential test satisfying the speciﬁed power requirement. then the optimality of 2 the SPRT implies that both E(V*. and skew plans that resemble the triangular test (see also Spicer [19]). The SPRT has an optimality property.18). θ) 2 quirement. and so there is no upper limit on the amount of information that is required. Jennison (14) reported numerical work seeking optimal designs when interim analyses are not especially frequent. E(V*.220 Whitehead open-ended. Now we return to the case of very frequent interim analyses and a power 1 set to be 1 –α. Under these circumstances b a 2 1 and c –θR. Following the publications of Wald and Barnard (6. which included the triangular test. and Kilpatrick and Oldham (16) implemented a sequential t-test (a form of SPRT) in a trial of bronchial dilators. the sample path has the maximum 2 2 propensity for wandering between the two parallel boundaries. restricted procedures. imagine that interim analyses are very 1 frequent and that the power is set at 1 –α. Included are double versions of the SPRT (popularly known as the trouser test). most based on the approximate properties of straight-line boundaries. its attractions for clinical research were quickly recognized. 1 (13) found that although α 0. –θR) for the speciﬁed power redesigns that minimize maxθ E(V*. with gradient of the lower boundary three times that of the upper boundary as described in Sections I and II above. If V* denotes the value of V at termination. in which the SPRT minimizes E(V*.7). He described these and found that as α → 0 they become the triangular test. –θR) can be reduced further by using a 2 triangular test that is longer and thinner than that illustrated in Figures 2 and 3. stated and proved by Wald and Wolfowitz (8). Lai (12) sought 1 E(V*. 0) and E(V*. the maximum value of E(V*. Bross (15) devised sequential medical plans. Excellent surveys of the emerging methodology are given in the two editions of the book by Armitage (17. When expressed on the Z-V diagram. Anderson (9) derived properties of a variety of procedures based on straight-line stopping boundaries. The 2-SPRT design of Lorden (10) and the minimum probability ratio test of Hall (11) are both alternative characterizations of the triangular test. many authors sought to modify the SPRT and in particular to overcome the open-ended nature of the boundaries. Although sequential analysis was ﬁrst introduced in the context of quality control. θ). 2 Despite this optimal property. θ) at θ 0 and θ θR. When θ –θR. which resemble the double triangular test in spirit. Huang et al. can be quite large. the reduction is very small. The latter edition describes a wide variety of designs. For the clinical trial context.

The concept was not of boundaries but of setting the null probability of stopping to reject the null hypothesis at or before each interim analysis. Although formulated in terms of error probabilities. Asymmetrical group sequential methods introduced by DeMets and Ware (25) began the development of an α-spending counterpart to the triangular test. ‘‘group sequential’’ designs could be displayed on the Z-V plane. The requirement that sequential methods involve very frequent interim analyses was a greater practical barrier to widespread implementation.21) examined the analogy between the test statistics Z and V and the sample sum and sample size of independent normal observations with mean θ and variance 1. and usually a choice had to be made from a limited repertoire of tabulated designs. two such functions being needed to characterize asymmetrical designs. Software for sequential methods was rudimentary. Continuous monitoring of data emerging from a trial was logistically difﬁcult. The group sequential designs of Pocock (22) and O’Brien and Fleming (23) were based on an approach that was totally different from much of what had been developed earlier. This allows many of the early procedures developed for the latter case to be applied far more generally. whereas adjustments for discrete monitoring (26) allowed the triangular test to be used with group sequential sampling and eventually led to the Christmas tree correction described in Section III above. and in the case of binary observations the artiﬁcial stratagem of matched pairs was usually necessary to eliminate nuisance parameters.Triangular Test in Sequential Clinical Trials 221 1. Two camps of sequential methodology having been established. bearing a general resemblance to restricted procedures. Approximations used in the theory were accurate only if interim analyses occurred after every individual response or matched pair of responses. each conducted after a new group of patients had responded. Allowance for discrete monitoring can be achieved by using the recursive numerical integration routines described by Armitage et al. so that the ﬁnal value was equal to the required α. 3. 4. all sequential designs have both a boundaries and an α-spending function representation. The latter has been developed only for straight-line boundaries. The idea was given ﬂexibility and generality through the α-spending function introduced by Lan and DeMets (24). For continuous monitoring. Only normally distributed and binary patient responses were catered for. where they were seen as symmetrical. 2. No special methods existed for producing a valid analysis once a sequential trial has been completed. It . More reasonable was a series of a few interim analyses. they rapidly began to converge. (27) or by corrections of continuous boundaries such as the Christmas tree method. My own work (20.

It is likely. The power properties noted in the alpha-interferon example of Section III hold quite generally: A triangular test designed to have high power of detecting a treatment advantage will have little power to detect a disadvantage of comparable magnitude. perhaps from previous studies. The renal carcinoma described in Section III above was designed and analyzed using PEST. illustrates two asymmetric designs with similar properties. see Chapter 5 of Whitehead (1).222 Whitehead is extremely accurate when used with the triangular test (28). RIVALS AND VARIATIONS TO THE TRIANGULAR TEST The triangular test is a design that will efﬁciently distinguish between superiority of an experimental treatment E over a control C and lack of superiority. For a summary of methods that lead to valid analyses. whereas Figure 4b shows a member of the class of designs described by Pampallona and Tsiatis (36). a truncated SPRT is indicated. Figure 4. In this case too. truncation should occur at quite a large value of V. It should lead to a small sample size if the prediction of substantial superiority proves true. V. Figure 4a shows a truncated SPRT. then the truncated SPRT might be chosen. The truncated SPRT is effective at reducing sample size if θ is negative or exceeds θR. that the triangular test would be preferred in most trials. however. for which the largest sample sizes occur. A sequential module to the language S plus is also available. see Emerson (31). This can be seen from the fact that for small V. The precise optimality results of the (untruncated) SPRT and the triangular test cited in Section IV motivate the following less formal considerations. The triangular test is effective in reducing sample size whatever the true value of θ might be and especially for moderate values lying between 0 and the reference improvement θR. the boundaries of the truncated SPRT are closer together than those of the triangular test. while providing a valid if more lengthy procedure if such optimism turns out not to be justiﬁed. so that the design resembles its untruncated counterpart. Truncation at the value of V 50% greater than the ﬁxed sample size for equivalent power should sufﬁce. which is outside the scope of this chapter. Software in the form of PEST 4 (29) and EaSt 2000 (30) is now available: For a comparative review of earlier versions of these two packages. . To achieve such optimality. If there is good reason to believe that θ exceeds θR. The triangular test was ready for application in the early 1980s and was ﬁrst used in a comparison of anesthetic techniques (32) and in a clinical trial in lung cancer (33–35). but less precise for restricted procedures or the SPRT. a and b. Sometimes there is reason to fear that θ 0 but nevertheless to wish to proceed with a trial because the potential for beneﬁt remains worthy of investigation. The issue of posttrial analysis has attracted a great deal of attention.

especially if there are to be no interim analyses when V is small. it is worth remarking that any such stopping rule can be mapped onto the Z-V plane as a stopping boundary for comparison with the procedures discussed here. Figure 4. Such a step is recommended so that an informed choice of design can be made. (a) Truncated SPRT.Triangular Test in Sequential Clinical Trials 223 Figure 4 Alternative sequential designs. This one is very similar to a triangular test of equivalent power. These and other approaches are described in the book by Jennison and Turnbull (39). c and d. (c) reverse triangular test. The designs are not derived from considerations of optimality. Other sequential alternatives to the triangular test for fulﬁllment of asymmetric power requirements include stochastic curtailment procedures based on conditional or predictive power. (d) double triangular test. It might be . a reverse triangular test and a double triangular test. (37). The reverse triangular test has high power of detecting inferiority of the experimental treatment but low power of showing advantage. (b) a design of Pampallona and Tsiatis. whereas Spiegelhalter et al. The Pampallona and Tsiatis design shown in Figure 4b is one of a family of asymmetric designs indexed by the curvature of the boundaries. respectively. Although outside the scope of this chapter. The use of conditional power was introduced by Lan et al. shows. (38) discussed predictive power.

Nixon et al. When departing from a conventional objective of demonstrating superior efﬁcacy. Within oncology. Examples include trials of corticosteroids for AIDS-induced pneumonia (41). This requirement also rules out the double triangular test.224 Whitehead used in cases in which E has clear nonefﬁcacy advantages such as cost. The truncated SPRT and the reverse and double triangular tests are implemented in the computer package PEST 4. 40). Storb et al. which will have small sample size only if a major treatment difference is apparent. The most desirable trial outcome is a claim that E is better than C. The double triangular test has also found application. Sometimes symmetrical designs are required. An interesting combination of the triangular test with the play-the-winner rule was applied in a study of spinal anesthesia during cesarean section (49). (50) described a triangular test of immunotherapy as a preparation for bone marrow transplantation in leukemia. respectively. of isradipine for the acute treatment of stroke (43). An evaluation of the drug Viagra in the treatment of erectile dysfunction after spinal injury also used the method (47). but a fall-back in which E is no different from C but has secondary advantages (not apparent in the sequential plot) might be worthwhile. RECENT CLINICAL TRIALS BASED ON THE TRIANGULAR TEST AND FURTHER WORK The triangular test has now been used in a wide variety of clinical studies concerned with many therapeutic areas. (51) . the values speciﬁed for power and for type I error rate must be chosen with caution (see ref. restricted procedures (1. with the maximum sample size being desirable otherwise. E no different from C. and it has been implemented in animal studies of medical techniques (48). the triangular design has been used to study the use of surfactant to alleviate respiratory distress in infants (45) and in a trial concerning gastrointestinal reﬂux (46). besides the renal and lung cancer trials mentioned in Sections III and IV. and of implanted deﬁbrillators in coronary heart disease (44). The double triangular test is also appropriate when demonstration of equivalence is the primary concern.18) or symmetrical designs based on the α-spending function approach described in Section IV could be implemented. In these situations. It is sometimes used when investigators wish to hedge their bets. or safety. whereas the Pampallona and Tsiatis designs (including ‘‘double’’ versions) are available in EaSt 2000. of enoxaparin for prevention of deep vein thrombosis resulting from hip replacement surgery (42). In pediatric medicine. ease of use. The double triangular test will potentially distinguish between three true situations: E better than C. VI. and E worse than C. This may be to allow sufﬁcient power to meet secondary objectives such as subgroup analysis or investigation of secondary end points. so that only proven inferiority will dissuade clinicians and patients from using it.

(Suppl. Stratiﬁcation of risk factors in renal cell carcinoma. 8): 1–26. Revised 2nd Edition. 1997. On the development of the medical research council trial of α-interferon in metastatic renal carcinoma. J R Statist Soc 1946. Paulson DF. in particular. Ann Math Stat 1960. Ritchie A. Cancer 1983. 3. applicable to a single stream of binary observations.rdg. A modiﬁcation of the sequential probability ratio test to reduce sample size. 31:165–197.Triangular Test in Sequential Clinical Trials 225 described such a design in a comparison of pressure sore rates after the use of two types of mattress during cancer surgery. Stat Med 1994. 2. The other is. Machin D. Methods for analyzing data after sequential trials of this type are also being developed. Selli C. 4. not by much. New York: Wiley. work is proceeding on survival responses with nonproportional hazards and longitudinal ordinal responses. and suitable alternatives exist. MRC Renal Cancer Collaborators. Fayers PM. Wald A. Hinshaw W. and its optimality makes it hard to improve. Other trials using triangular and related designs are given on the web page: http://www. A general parametric approach to the meta-analysis of randomized clinical trials. It is not appropriate for every situation. Whitehead A. Yuen P. Donaldson N. 10:1665–1677. 6. Interferon-α and survival in metastatic renal carcinoma: early results of a randomised controlled trial. two main challenges remain. Boden et al. 1947. Barnard GA. Lancet 1999. Oliver RTD. to spread its usage to all trials that can beneﬁt from its ethical and economic advantages. 8. Anderson TW. The Design and Analysis of Sequential Clinical Trials. 5. Whitehead J. Whitehead J. but in the case of the triangular test. Wolfowitz J. Cook PA. Chichester: Wiley. 9. 353:14–17. Whitehead J. 19:326–339. is described by Stallard and Todd (54). 13:2249–2260.ac. REFERENCES 1. Wald A. Ann Math Stat 1948. Optimum character of the sequential probability ratio test. (52) report a trial based on the design in cardiology and Yentis (53) described an application to a trial in elective Caesarian section. Sequential Analysis. quite simply. An exact version of the triangular test. One is to use the principles underlying the triangular design to help in the construction of optimal sequential designs for multivariate responses and for multiple treatment comparisons. There is a range of response types to which the triangular and related designs can be extended. Stat Med 1991. Sequential tests in industrial statistics. The Christmas tree correction could be improved on. In the future.htm The properties of the triangular test are now well understood. .uk/mps/mps_home/software/pest4/ practice. 52:899–903. Woodward BH. 7.

Sequential Medical Trials. Statistical packages for group sequential methods. Some new closed sequential designs for clinical trials. Pringle HM. Sequential Analysis. McPherson CK. EaSt 2000: A software package for the design and interim monitoring of group sequential clinical trials. 12. Lai TL. Armitage P. Group sequential methods for clinical trials with a one-sided hypothesis. 30. Br Med J 1954. Armitage P. Hall WJ. Sequential forms of the log rank and modiﬁed Wilcoxon tests for censored data. Optimal stopping and sequential tests which minimize the maximum expected sample size. Am Stat 1996. Oxford: Blackwell. Efﬁcient group sequential tests with unpredictable group sizes. 25. 27. Anaesthesia for out-patient termination of pregnancy.226 Whitehead 10. Br J Anaesth 1982. 1st ed. 54:865–870. Huang P. 23. 26. 32. 8:188–205. Oxford: Blackwell. 4:281–291. Biometrika 1981. Biometrics 1983. Whitehead J. Asymptotic design of symmetric triangular tests for the drift of Brownian motion. 132:235–244. Whitehead J. Jones DR. Ann Stat 1976. Repeated signiﬁcance tests on accumulating data. Emerson SS. Biometrics 1952. Cambridge MA: Cytel. 2000. Hackett GH. Plantevin M. 31. J Biopharm Stat 1996. 14. 28. 65:351–356. Armitage P. Biometrika 1980. PEST 4: Operating Manual. to appear. A multiple testing procedure for clinical trials. 70:659–663. Rowe BC. Lan KKG. ed. 18. Stratton I. 6:361– 373. Biometrika 1983. 16. The University of Reading: England. 2000. Ann Stat 1973. Sequential minimum probability ratio tests. DeMets DL.] 22. 15. 2nd ed. Facey KM. Stallard N. 19. 18:203–211. Large sequential methods with application to the analysis of 2 2 contingency tables. Harris MNE. 64:191–199. Avery A. 20. 13. Comparison of the spending function method and the Christmas tree correction for group sequential trials. Group sequential clinical trials with triangular continuation regions. Sequential medical plans. Calcium chloride and adrenaline as bronchial dilators compared by sequential analysis. a comparison of two anaesthetic techniques. Ware JH. [see correction. 2:1388–1391. Fleming TR. Pocock SJ. J R Stat Soc A 1969. MPS Research Unit. 29. Discrete sequential boundaries for clinical trials. 1:659–673. Cytel Software Corporation. 1975. 1980. Hall WJ. 2-SPRT’s and the modiﬁed Kiefer-Weiss problem of minimizing an expected sample size. Biometrika 1978. Kilpartick GS. Biometrika 1979. 50: 183–192. 74:155–165. New York: Academic Press. Asymptotic Theory of Statistical Tests and Estimation. O’Brien PC. 17. Oldham PD. 66:105–113. 39:227–236. 11. Garrioch DB. Group sequential methods in the design and analysis of clinical trials. Whitehead J. Biometrika 1977. 35:549–556. 68:576. Bross I. . 1960. Biometrics 1962. 24. DeMets DL. Lorden G. 21. Spicer CC. Biometrics 1979. Sequential Medical Trials. Jennison C. 67:651–660. Biometrika 1987. In: Chakravarti IM. Dragalin V.

51: 731–732. 27:733–740. Application of sequential methods to a phase III clinical trial in stroke. Stat Med 1983. Rout CC. Wheaton M. Hellwege H-H. Freedman LS. Daubert JP. Improved survival with an implanted deﬁbrillator in patients with coronary disease at high risk for ventricular arrhythmia. Clin Pharma Ther 1997. Guillot M. 6:179–191. 45. 1:207–219. Smith MD. 61:377–384. 113:14–20. Kuhls E. Duhamel J-F. Higgins SL. Bellisant E. Derry FA. Ruedy J. Tsiatis AA. Sharma J. Brown MW. Bauer J. The analysis of a sequential clinical trial for the comparison of two lung cancer treatments. Whitehead J. Reddy D. 36. 15:2703– 2715. Fraser M. Sequential designs for equivalence studies. Belzberg A. Moss AJ. Olive G. Bartmann P. 43. Spiegelhalter DJ. Whitehead J. 39. 83:135–141. Lewis RJ. Saksena S. Ellis SH. Porz F. Jones DR. Versmold H. 79:262–269. Cairns CB. Stat Med 1982. Jones DR. Simon R. 48. Stat Med 1996. Wilber D. Lan KKG. 41. Corticosteroids prevent early deterioration in patients with moderately severe pneumocystis carinii pneumonia and the acquired immunodeﬁciency syndrome (AIDS). Levitt N. Monitoring clinical trials: conditional or predictive power? Control Clin Trials 1986.Triangular Test in Sequential Clinical Trials 227 33. London: Chapman and Hall/CRC. Gardner BP. Circulation 1992. Group sequential designs for one-sided and two-sided hypothesis testing with provision for early stopping in favor of the null hypothesis. Turnbull BW. 42. 2:183–190. Glass CA. Pons G. Halperin M. Hall WJ. 49. The design of a sequential clinical trial for comparison of two lung cancer treatments. 85:281–287. Gouws E. Pampallona S. Johnson JR. A reevaluation of the role of crystalloid preload in the prevention of hypotention associated with spinal anesthesia for elective cesarean section. 7:8–17. Neurology 1998. Acta Paediatr 1994. Pohlandt F. Meiler B. Montaner JSG. 34. 42:19–35. Whitehead J. Anesthesiology 1993. Hentschel R. Sequential designs for pharmaceutical clinical trials. Lawson LM. Cannom DS. 37. Rocke DA. Dinsmore WW. Ann Intern Med 1990. The triangular test to assess the efﬁcacy of metoclopramide in gastroesophageal reﬂux. Hieronimi G. Stochastically curtailed tests in long-term clinical trials. 40. Newman CE. 47. Waldo AL. 335:1933–1940. 51:1629–1633. Whitehead J. Heo M. Treatment of prolonged ventricular ﬁbrillation: immediate countershock versus high dose epinephrine and CPR preceding countershock. Drug Inform J 1993. Pariente-Khayat A. Niemann TJ. Group Sequential Methods with Applications to Clinical Trials. Sequent Anal 1982. Gortner L. Whitehead J. Reiter H-L. 38. Maytom MC. Reduced survival with radiotherapy and razoxane compared with radiotherapy alone for inoperable lung cancer in a randomised double-blind trial. Cox R. . Ford CHJ. 46. Seitz RC. J Stat Plan Infer 1994. Klein H. 44. Jennison C. Bernsau U. 35. Pharm Med 1992. Levin J. Levine JH. 1:73–82. 2000. N Engl J Med 1996. Jones DR. Whitehead J. Newman CE. High-dose versus low-dose bovine surfactant treatment in very premature infants. Schechter MT. Jorch G. Efﬁcacy and safety of oral sildenaﬁl (Viagra) in men with erectile dysfunction caused by spinal cord injury. Blackburn PR. B J Cancer 1985.

355:1751–1756. Carlier MF. van Gilst WH. Hansen J. Lum L. Mason S. Brown J. Int J Obstet Anesth 2000. Whitehead A. Doney K. Santoni J-P. Bond S. McGufﬁn R. The effect of prophylactic glycopyrrolate on maternal haemodynamics following spinal anaesthesia for elective caesarian section. Buckner CD. 52. Bensinger W. Stallard N. Sanders J. Farewell V. 54.228 Whitehead 50. 51. A sequential randomised controlled trial comparing a dry visco-elastic polymer pad and standard operating table mattress in the prevention of post-operative pressure sores. Jenkins CS. Whitehead J. Bertrand ME. 9:156–159. Sullivan K. Pedersen OL. Hill R. Fox KM. Starkey IR. 314:729–735. Julian DG. Diltiazem in acute myocardial infarction treated with thrombolytic agents: a randomised placebo-controlled trial. Beatty P. Exact sequential tests for single samples of discrete responses using spending functions. Lie KI. Storb R. Barnes PK. Col JJ. Yee G. Deeg. Methotrexate and cyclosporine compared with cyclosporine alone for prophylaxis of acute graft versus host disease after bone marrow transplantation for leukemia. N Engl J Med 1986. Clift R. McElvenny D. Stewart P. Martin P. 19:3051–3064. J. Scheldewaert RG. Boden WE. 53. 35:193–203. Lucas DN. Thomas ED. Lancet 2000. Nixon J. Todd S. . Appelbaum F. Statistics in Medicine 2000. Yentis SM. Witherspoon R. Int J Nurs Stud 1998.

payers. time until disease progression.13 Design and Analysis Considerations for Complementary Outcomes Bernard F. Adverse event data are used to describe the risks associated with a new treatment. physicians. 229 . The two most common of these ‘‘complementary outcomes’’ are quality of life (1–3) and factors related to economic cost (4). In this chapter. and policy makers considering the use of a new regimen. Modern clinical trials often collect data in addition to these usual outcomes. or time until death. including treatment-related adverse events and clinical response. Clinical response is often deﬁned in terms of changes in tumor size. INTRODUCTION In cancer clinical trials. Cole Dartmouth Medical School. Lebanon. researchers are able to address questions regarding quality of life and monetary cost that may be raised by patients. By including these outcomes in clinical trials. By compiling such data. New Hampshire I. and the clinical response data are used to describe the beneﬁts. data are routinely collected for various patient outcomes. researchers are able to make an objective evaluation of the safety and efﬁcacy of a new therapy. we provide an overview of design and analysis considerations relating to complementary outcomes in cancer clinical trials.

and overall treatment experience. Other domains include disease symptoms. A list of cancer-speciﬁc instruments is provided in Table 1. 100% denotes normal function. Measuring Quality of Life Many instruments are available for measuring quality of life in clinical trials. introduced in 1948 (5. average of the item responses). or items pertaining to a particular domain. social functioning. For example. Quality-of-life instruments usually include several individual questions. general level of anxiety. they assessed patients’ general feeling of well-being. and to compare different treatments within a clinical trial.’’ The KPS assessment is made by the clinician. and mental health. is generally considered to be the ﬁrst such measure. such as physical functioning. ability to perform housework. normal life’’ (8). mood. KPS is measured on an 11-point scale from 0% to 100% (10% increments) where 0% denotes death. Using a 10-question instrument. to detect changes over time. Modern quality-of-life assessment in cancer clinical trials is generally cited to have begun in 1976 with Priestman and Baum’s study of breast cancer treatment (9). Each . and they can be applied in a wide range of disease settings. History Early measures of quality of life in cancer focused on physical functioning. vitality. general health perceptions. B.230 Cole II. The results indicated that this instrument could be used to assess the subjective beneﬁt of treatment in individual women. social activities. and the SCL-90-R (12).6). and the domain score (also called scale score) is obtained by summarizing the responses from the associated items (e.. These can be divided into general and disease-speciﬁc instruments. and psychological variables in cancer patients. The goal of each instrument is to measure overall quality of life and various quality-of-life domains. QUALITY-OF-LIFE ASSESSMENT A. The Karnofsky performance status (KPS). Commonly used general instruments include the SF-36 (10). pain. in 1971 Izsak and Medalie (7) developed a multidimensional scale that measured physical. social. and role functioning. Each of these instruments includes general questions relating to a patient’s health and functioning.g. In 1975 a trial for patients with acute myelogenous leukemia used a six-level assessment of quality of life ranging from ‘‘hospital stay throughout illness’’ to ‘‘no symptoms. the Sickness Impact Proﬁle (11). nausea. pain. Subsequent efforts in quality-of-life assessment evaluated illnesses and therapeutic regimens from the patients’ perspective. The assessments were based on patient reports of their symptoms and functioning. level of activity. appetite. and other values denote ‘‘approximate percentage of normal physical performance.

social. with text at either end describing the extremes of the scale. The most frequently used scales have either four or ﬁve categories. overall health Physical. and these rules are established after careful testing. physical functioning Physical well-being. Each item is generally measured on a Likert scale or a linear analogue self-assessment (LASA) scale. psychosocial.Complementary Outcomes Table 1 Cancer-Speciﬁc Quality-of-Life Measurement Instruments Number of Items 30 Domains assessed 231 Instrument and reference Breast Cancer Chemotherapy Questionnaire (BCQ) (43) Cancer Rehabilitation Evaluation System (CARES) (44) European Organization for Research and Treatment of Cancer scale (EORTC: QLQ-C30) (45) 93–132 42 Functional Assessment of Cancer Therapy (FACT) (46) 36–40 Functional Living Index—Cancer (FLIC) (47) 22 International Breast Cancer Study Group Quality of Life Questionnaire (IBCSG–QL) (48) Linear Analogue Self-Assessment (LASA) (9) Quality of Life Index (QLI) (18) 10 25 5 Attractiveness. social/family. social). functional. disease-speciﬁc items. medical interaction. treatment and disease issues. relationship with doctor. health perceptions. . hope. well-being. sexual. symptom.and treatment-speciﬁc items Five functional scales (physical. daily living. appetite. global well-being. fatigue. three symptom scales (fatigue. The Likert scale is an ordered categorical scale consisting of a limited choice of clearly deﬁned responses. symptoms. disease symptoms. role. emotional. global quality of life Physical. cognitive. fatigue. inconvenience. physical symptoms. social support. emotional. usually 10 cm long. social support. mood. the LASA scale is an unmarked line. marital. social Physical activity. nausea). diseasespeciﬁc items Psychological. psychological. In contrast. Each patient is asked to place a mark on the line in a position that best reﬂects his or her response relative to the two labeled extreme points. outlook on life instrument has its own rules regarding the computation of the domain scores. emotional. social support Physical. pain. coping. psychological.

Classically. whereas a measurement of preference or utility will differentiate them. Instruments that can be used for this purpose include the EuroQol (15). Fortunately. The probability p is varied until the patient is indifferent between the certain and the uncertain choice and the ﬁnal p is taken as the utility value. there are procedures for obtaining utility data from quality-of-life instruments using multiattribute utility theory (14). Interview techniques are cumbersome to use in practice. or utility. where 0 denotes a health state ‘‘as bad as death’’ and 1 denotes a health state ‘‘as good as perfect health. For example. these procedures were developed by administering both the instrument and the interview to a study sample and building a statistical model for predicting the utility value from the instrument responses. . The ‘‘standard gamble’’ technique gives patients a choice between a chronic health state with certainty or an uncertain health state that is either perfect health (with probability p) or death (with probability 1 p). Descriptive quality-of-life instruments will correctly provide similar scores for these two patients.8 for time with toxicity. This can be accomplished by assessing a patient’s value for one health state compared with another based on quality-of-life considerations.232 Cole C. Utility is measured on a scale from 0 to 1. The duration of the ‘‘perfect health’’ state is varied until the patient expresses indifference to the choice. For a detailed overview of utility assessment.7 months of perfect health. Measuring Patients’ Preferences and Utilities In addition to measuring descriptive quality of life. two patients might report similar symptoms with similar frequency and duration.’’ Values between 0 and 1 denote degrees between these extremes. Generally. This adjustment allows treatments that have different impacts on quality of life to be compared in a meaningful way. The utility is then taken as the ratio of the ﬁnal health state durations. A. see Bennett and Torrance (13). is that the utility represents the amount of time in a state of perfect health that a patient values as equal to one unit of time in state A. suppose that state A has a utility of 0. A simple interpretation of a utility for a speciﬁc health state. but they may differ on how important these symptoms are in their daily lives. The ‘‘time trade-off’’ technique gives patients a choice between living for a certain amount of time in a state of less than perfect health or a shorter amount of time in a state of perfect health.2 months. Then 1 month in state A is equivalent in value to 0.7. it is possible to measure a patient’s preference. if a patient experiences 6 months of toxicity and has a utility weight of 0. then the quality-adjusted time spent with toxicity is 4. This interpretation leads to the idea that quality-of-life-adjusted time can be obtained by multiplying a health state duration by its utility coefﬁcient. For example. for particular health states. utility assessment is carried out using interview techniques. For example.

time point. the treatment effect .g. Other techniques include growth curve models and the construction of summary measures. Standard methods include repeated-measures analysis of variance or more general mixed effects regression models. Therefore. Observations may be missing for many reasons. For example. B. Repeated-Measures Analysis of Variance Repeated-measures analysis of variance is a commonly used and convenient approach to modeling longitudinal quality-of-life data. III. if a treatment by time interaction is found. A common display for longitudinal quality-of-life data consists of a plot of mean scores over time according to treatment group for each qualityof-life domain measured. and a treatment group by time point interaction as independent factors. such as age. Posttreatment measurements are generally collected at longer intervals corresponding to followup visits or are obtained using mailed surveys (e. Generally. Overview Quality-of-life data are generally collected longitudinally in cancer clinical trials. a longitudinal analysis procedure is appropriate. It is useful to indicate on these graphs which assessment time points occurred during the treatment phase of the study. These summaries will help to guide the modeling of the data and the interpretation of the modeling results. ANALYSIS OF QUALITY-OF-LIFE DATA A. speciﬁc contrasts of the regression parameters can be evaluated.Complementary Outcomes 233 Health Utilities Index (16). may be included as adjustment factors if confounding is a concern. and the Q-tility Index (17). 6-month intervals).. a graphical display of the data is a useful starting point. a patient may be too ill to complete the questionnaire. Other covariates. whereas others are related to the quality of life the patient is experiencing. or the patient may be doing so well that he or she does not visit the clinic at the time when a quality-of-life assessment is due.. As with most statistical analyses. The main difﬁculty in analyzing longitudinal quality-of-life data is the appropriate handling of missing observations. which uses Spitzer’s Quality of Life Index (18). the model predicts a speciﬁc domain of quality of life using treatment group.g. 1-month intervals). Often the data collection is most intense during the treatment phase of the study when patients have frequent clinic visits (e. some of which may be considered missing at random. It is also useful to indicate how many subjects provided data at each time point. For example. In addition.

Standard software for this analysis usually assumes compound symmetry for the covariance matrix of the longitudinal assessments. so that the multivariate approach uses only a fraction of the available subjects. An advantage of these methods is that missing data can be accommodated in a variety of ways. We refer to Fairclough (20) for an excellent review of these methods. However. C. The method of summary measures similarly collapses the multiple observations for an individual into a single outcome. in which case it is estimated from the data. one can model covariance as a function of the time difference between two assessments. (23) evaluated the quality-of-life impact of various adjuvant cytotoxic therapy schedules in patients with breast cancer who were separately treated in two clinical trials conducted by the International Breast Cancer Study Group . Other corrections for multiple testing can be used when this problem is present (e. When quality-of-life assessments are obtained sporadically or at varying time points. Growth curve modeling generally involves ﬁtting polynomials to the longitudinal data for an individual. Alternatively. a separate analysis is performed for each quality-of-life measure obtained in a study. the Bonferroni procedure). Generally.. Examples include the mean of the observations or the area under the curve of the plotted quality-of-life scores over time. An additional valuable reference in this area is the proceedings volume of the Workshop on Missing Data in Quality of Life Research in Cancer Clinical Trials: Practical and Methodological Issues (21). with the analysis focusing on the ﬁtted polynomials (19). imputation techniques).g. Other Techniques and Missing Data Other techniques for analyzing longitudinal quality-of-life data include growth curve models and the construction of summary measures. this uses only subjects with valid data on all subscales.. Although a multivariate analysis approach may be used. this approach can lead to inﬂated type I error due to the multiple testing. Example ¨ Hurny et al.g. more recently developed methods for missing data can be applied when using a multivariate or repeated-measures model (e. Bonetti et al.234 Cole can be evaluated at particular time points by testing the appropriate linear combination of the regression parameters. Often patients are missing one or more subscales from an instrument. The main difﬁculty with repeated-measures analysis of variance is accommodating missing observations. The use of mixed effect regression analysis allows one to ﬁt other forms for the covariance matrix or to leave the covariance matrix unspeciﬁed. Finally. (22) recently developed a method-of-moments estimation procedure for evaluating quality-of-life data in the presence of nonignorable missing observations. D.

trial VI. In particular. The number of patients who furnished usable data at each time point varied from 1022 patients at baseline to 797 patients at the 18-month time point. (23) focused on the data recorded during the ﬁrst 18 months after randomization.Complementary Outcomes 235 (trials VI and VII). consisted of 1475 patients who were pre. Because utility is measured on the unit interval (0.or perimenopausal at the time of study recruitment and were randomized in a 2 2 factorial design to receive three or six initial cycles of chemotherapy with or without later reintroduction of three single cycles of chemotherapy administered at 3-month intervals. ANALYSIS OF QUALITY-ADJUSTED SURVIVAL A. statistically signiﬁcant (p 0. ¨ The analysis conducted by Hurny et al. and the ﬁfth was based on a 28-item adjective checklist. IV. the group that received three cycles of initial therapy with no reintroduction therapy had the highest coping scores (indicating a better degree of coping). The quality-of-life instrument used in the study measured ﬁve indicators of quality of life: physical well-being.05) differences were observed in the mean coping scores for the four groups at time points 6. and 15 months. The Quality-adjusted Time Without Symptoms of disease or Toxicity of treatment (Q-TWiST) method (24. Measurements were obtained at baseline. One of these trials. At these time points. Both models included the patient’s language/culture as a covariate. where the weightings are based on utility values. all four treatment groups had similar mean coping scores. Analysis of variance was used to compare the treatment groups at each time point. At the 18-month time point. perceived adjustment/coping. and emotional well-being. mood. The ﬁrst four indicators were measured using LASA scales. and a repeated-measures model was used to make comparisons over time. Patients who had missing data at a particular time point were excluded only from the analysis of that time point. appetite. Square-root transformations were used to stabilize the variance of all scales. 12. Introduction Quality-of-life-adjusted survival time is a complementary outcome that is increasingly being used in cancer clinical research.25) is one technique for evaluating quality- . By this time all patients had completed their treatment. and again 1 and 6 months after disease recurrence. It represents a patient’s survival time weighted by the quality of life experienced. The results indicated that mean quality-of-life scores increased over time for all treatment groups. 2 months later.1). then every 3 months until 24 months. qualityadjusted survival time is in the same time units as overall survival. This allows comparisons of treatments that differ in their quality-of-life effects and their effects on survival using a metric that simultaneously accounts for both of these differences. 9.

B. health states may be deﬁned as follows: S1 time with treatment-related toxicity. S2 time without toxicity and without disease progression. . k. Q-TWiST compares treatments by computing the time spent in a series of clinical health states that may impact a patient’s quality of life. the clinical trial data are used to deﬁne the exiting time for each health state. in practice. If S1 is skipped. This is accomplished by partitioning the overall survival time into the health states for each treatment group separately. Sk. The three steps involved in a Q-TWiST analysis are described brieﬂy below followed by an illustrative example. in the adjuvant chemotherapy setting. and t3 time from randomization until death. is . the health state exiting times may be deﬁned as follows: t1 time from randomization until the end of treatment-related toxicity. The exiting time from state Sk will be the time of death. we may deﬁne a k-state model with health states S1. Step 2: Partition Overall Survival The second step is to estimate the mean health state durations using the clinical trial data. patients must move through the health states in order. For example. Each health state is then weighted by a utility value. and the Q-TWiST outcome is deﬁned by the sum of the weighted health state durations. then tj tj 1. C. disease progression. Let Ki (⋅) denote the Kaplan-Meier estimate corresponding to ti. . Step 1: Deﬁne Clinical Health States The ﬁrst step in the analysis is to deﬁne quality-of-life-oriented health states that are relevant for the disease setting and the treatments being studied. . The Kaplan-Meier method can be used for this purpose. i 1. the partitioning is done up to a restriction time L. In this case. For example. For the k-state model. If any of the exiting times are censored. whichever occurs ﬁrst. That is. censoring precludes one from estimating the entire survival curve. whichever occurs ﬁrst. If a state Sj is skipped. . . late sequelae). then the exiting times for all subsequent states will be similarly censored.. A common choice for L is the median follow-up duration. . however. or death. and they should be progressive. . treatment-related adverse events. in the adjuvant chemotherapy setting. then t1 0. k. and S3 time with disease recurrence until death. . . . The health states should reﬂect changes in clinical status that may be associated with changes in quality of life (e. i 1. For example. disease progression. restricted to L for health state S1. although health states may be skipped. 1 i j k. For each patient. t2 time from randomization until disease progression or death. . where the only possible transitions are from Si to Sj. .g. Then the mean health state duration.236 Cole adjusted survival in clinical trials. let ti denote the exiting time (measured from study entry or randomization time) from health state Si.

then these data can be incorporated directly into the analysis. control group). D. ..g. Conﬁdence bounds for the threshold line can also be plotted to deﬁne regions of utility coefﬁcient values for which the treatment effect difference is statistically signiﬁcant.. if u1. whereas averaging individual health state durations when censoring is present leads to biased results (26). . uk. experimental drug group vs. .’’ Contour lines can be used to indicate the magnitude of the Q-TWiST treatment effect associated with different pairs of utility values. The contour line corresponding to a treatment effect of zero is called the ‘‘threshold line. . For example. . . Step 3: Compare the Treatments The third step is to compare the treatments using a weighted sum of the mean health state durations. Q-TWiST is calculated separately for each treatment group. the treatment comparison can be plotted for all possible values of the unknown utility coefﬁcients in a two-dimensional graph called a ‘‘threshold utility plot.Complementary Outcomes L 237 ˆ τ1 K1(u)du 0 and the mean health state duration for health state Si (2 L i k) is ˆ τi (Ki(u) 0 Ki 1(u))du This approach to the estimation of the health state durations provides consistent estimates. and the treatment effects are obtained by subtracting the Q-TWiST estimates corresponding to two treatment groups (e. Sk. When utility data are not available. If data are available for estimating u1.’’ The threshold line indicates all utility value pairs for which the treatment effects are equal in terms of Q-TWiST. When two utility weights are unknown and two treatments are being compared. ... in the case where k 3 and u2 1. . Standard errors for the treatment effects can be obtained using the bootstrap method (24) or recently derived closed-form estimators (27–29). Q-TWiST is equivalent to the mean survival time restricted to L. uk denote the utility coefﬁcients for the respective health states S1. the treatment comparison can be evaluated by computing the treatment effect for varying values of the utility weights in a sensitivity analysis. . the Q-TWiST end point is given by k Q TWiST i 1 uiτi Note that if all of the utility coefﬁcients equal unity. For example. the contour lines for the threshold utility plot are deﬁned by .

TWiST. Table 3 shows the computation of Q-TWiST for two possible selections of the utility weights uTox and uRel. the threshold line does not appear on the graph. E. Figure 1 shows the partitioning of overall survival into the health states according to treatment group based on the product limit method and restricted to the median follow-up interval of 84 months. the interferon group also experienced more time with severe or life-threatening toxicity.31). however. and Rel. the area beneath the overall survival curve (OS) is partitioned into the health states Tox. the above equation represents the threshold line. c 3. 2. the contour lines indicate that .238 Cole ue c u 1 ∆1 ∆3 ∆2 where c represents the Q-TWiST treatment effect and ∆i is the treatment group difference for the duration of health state Si. Note that if c 0. The results indicate that patients in the interferon group experienced more time in TWiST and less time in Rel as compared with the observation group. the mean overall survival time (OS). However. and the mean relapse-free survival time (RFS) within the ﬁrst 84 months from randomization in the study. TWiST all time without severe or life-threatening treatment toxicity and without symptoms of disease relapse. 3 months). and Rel all time following disease relapse.g. For each graph. The health states deﬁned for the Q-TWiST analysis were as follows: Tox all time with severe or life-threatening side effects of high-dose interferon. the interferon group experienced more quality-adjusted time than the control group regardless of the utility values used.. 1. Example The Eastern Cooperative Oncology Group (ECOG) clinical trial EST1684 compared high-dose interferon alfa-2b therapy versus clinical observation for the adjuvant treatment of high-risk resected malignant melanoma in 280 patients (30. Therefore. Note that mean OS is equivalent uRel 1 and that mean RFS is equivalent to to mean Q-TWiST when uTox mean Q-TWiST when uTox 1 and uRel 0. This was accomplished by plotting on the same graph as OS a survival curve for the duration of severe or life-threatening side effects of interferon (Tox) and a survival curve for the time until disease relapse or death (RFS). These health states reﬂect the major clinical changes in quality of life that are important for evaluating the impact of high-dose interferon. 2. 1. Contour lines can be obtained by setting appropriate values for c and plotting the resulting equations (e. Figure 2 illustrates the threshold plot for the treatment comparison. 0. The utility weight for TWiST in this analysis was assumed to be unity because TWiST represents a state of best possible quality of life. Note that in this case. Table 2 shows the mean health state durations.

no therapy) and (B) interferon for patients with malignant melanoma. (From Ref.) .Complementary Outcomes 239 Figure 1 Partitioned survival plots for the ECOG clinical trial comparing (A) clinical observation (i. the relapse-free survival curve (RFS).e.. Each plot illustrates the overall survival curve (OS). and a curve representing the duration of Toxicity (Tox). The area between the OS and RFS curves represents the duration of the relapse health state (Rel). and the area between the Tox and RFS curves represents time without symptoms of relapse or toxicity (TWiST). 31.

0 4. Source: Ref.2 0.5 0.3 14.8 6.1 10.0 7.4 0. time with severe or life-threatening side effects of treatment. time without severe or life-threatening side effect of treatment and without symptoms of disease relapse.4 49.0 to 15.03 * Tox.1 2. 31.240 Cole Table 2 Mean Time in Months for the Components of Q-TWiST Restricted to 84 Months of Median Follow-Up in the ECOG Trial EST 1684 Treatment Group Outcome* Tox TWiST Rel OS RFS Observation 0.07 0.5 0. Rel. overall survival time.’’ † Treatment difference corresponds to interferon minus observation and is given with a 95% conﬁdence interval (CI) and a two-sided p value based on a Z-test.6 0. relapse-free survival time.4 0.0 12.0 2. the Q-TWiST beneﬁt for interferon ranges from 2 to 8 months depending on the selection of utility weight values. In addition. Values of the utility weights above the dashed line correspond to a signiﬁcant (p 0. uRel.9 95% CI† 5.5 17.05 * uTox.8 to to to to to 6.4 to 12.001 0.0 8.0 p (two-sided)† 0. Each utility weight is measured on a scale from 0 ‘‘as bad as death’’ to 1 ‘‘as good as perfect health.2 34. .05) beneﬁt for interferon in terms of Q-TWiST and values below the dashed line indicate utility coefﬁcient Table 3 Mean Q-TWiST in Months Within 84 Months of Median Follow-Up in the ECOG Trial for Arbitrary Sets of Utility Weight Values Utility Values* uTox 0. the upper 95% conﬁdence band for the threshold line appears as a dashed line. time following disease relapse until death.5 0. † Treatment difference corresponds to interferon minus observation and is given with a 95% conﬁdence interval (CI) and a two-sided p value based on a Z-test.4 30. OS.6 95% CI† 2.4 42. the utility weight associated with disease relapse.3 38.0 7.9 Difference† 5.8 33.8 3.0 30.2 0. RFS. Source: Ref.1 p (two-sided)† 0.5 Difference† 5.9 Interferon Alfa-2b 41. TWiST.2 42.9 uRel 0. the utility weight associated with the severe or life-threatening side effects of treatment (TOX).4 Treatment Group Observation 36.7 11. 31.0 Interferon Alfa-2b 5.

The graph illustrates the treatment comparison in terms of Q-TWiST for all possible values of the utility weights for toxicity and relapse. F. Further Developments Quality-of-life-adjusted survival in clinical trials is an area of active methodological research. Zhao and Tsiatis (27) present a consistent estimator for the . Glasziou et al.e. Parametric (32) and semiparametric (33) regression models have been developed for quality-adjusted survival.05). the Q-TWiST treatment effect is statistically signiﬁcant (i. In addition.Complementary Outcomes 241 Figure 2 Threshold utility analysis for the ECOG clinical trial. p 0. In particular. The positive numbers on the contour lines indicate that mean Q-TWiST for the interferon group is greater than for the observation group for all possible pairs of utility weights between 0 and 1. methods have been developed for forecasting treatment effects (34) and performing meta-analyses (35–37). The parallel dotted lines represent contours for the treatment effect in terms of Q-TWiST (interferon minus observation) as the utility weights vary between 0 and 1.) value pairs for which the Q-TWiST comparison favored the interferon group but did not reach statistical signiﬁcance. (From Ref 31. For utility value pairs above the heavy dashed line in the upper left corner of the plot. A number of recently published papers hold promise for an increasing array of statistical tools.. (38) describe methods for combining longitudinal quality-of-life data with survival data using an integration-based approach.

V. in a second paper (28). Ongoing research focuses on developing statistical tools for the analysis of utility data in conjunction with quality-adjusted survival analysis. drug treatments. COST-EFFECTIVENESS AND COST–UTILITY ANALYSIS Overview The economic evaluation of different treatment options is critically important for including medical costs in policy decisions and clinical practice decisions. For example. whereas in cost–utility analysis it is expressed in qualityadjusted time. and if µA and µB denote the respective mean survival times for the two drugs. The gathering of cost information is not necessary because utilization data can be assigned appropriate costs at a later time. The cost–utility ratio is obtained by replacing the denominator in the cost-effectiveness ratio with the quality-adjusted treatment effect. The main challenge for clinical trials is to measure the utilization of resources related to treatment.242 Cole distribution of quality-adjusted survival time and. A. Moreover. describe techniques for estimating mean quality-adjusted lifetime based on their consistent estimator. The clinical centers involved with the trial can provide some of these data for the utilization at these clinical sites. diagnostic tests. At a minimum. Both methods result in ratios that relate the incremental cost of a therapy to the incremental clinical beneﬁt of that therapy. and hospitalizations. Costeffectiveness analysis and cost–utility analysis are two common methods for comparing the economic cost of treatments in a standardized way. the treatment effect can be expressed in terms of Q-TWiST in a cost–utility ratio. it is much more practical to collect data on resource utilization and later assign costs. the additional cost of using drug A is quantiﬁed and the clinical beneﬁt of drug A relative to drug B. to evaluate the cost-effectiveness of drug A relative to drug B. in practice it is very difﬁcult to measure the real costs of medical treatment. In particular. As a result. data should be collected for all physician visits. however. costs can change drastically over the course of a study. where the quality-adjustment is made using utility weights. it is important that data also are collected directly from the patients because many will receive care from other sites. respectively. The use of a patient diary or . The denominator in cost-effectiveness analysis is any measure of beneﬁt. and costs may vary widely from one geographic region to another. the cost-effectiveness ratio is given by cA µA cB µB This ratio represents the additional cost of using drug A per unit of lifetime saved relative to drug B. if cA and cB denote the total costs of using drug A and drug B. For example.

In addition. (5) a procedure for estimating patient utility or preference. (4) longitudinal assessment of quality of life using a general instrument and a disease-speciﬁc instrument. (3) measurements of the timing and duration of all toxicities and adverse events. along with periodic telephone calls from a study coordinator. Note that the results presented by Hillner et al. The statistical issues related to economic analysis are too complex to be covered in this chapter. cost parameters can be varied within certain ranges). Nevertheless. They estimated that the total cost of medical care within the ﬁrst 7 years after diagnosis was $91. making statistical inferences impossible. will facilitate accurate collection of the data.656 for interferon-treated patients and $76. The main difﬁculty is that cost-effectiveness ratios are derived from varied data sources. Of course. This discounting allows the future (decreased) value of money to be expressed in current dollars. STUDY DESIGN IN CLINICAL TRIALS As a gold standard study design we propose a cancer clinical trial including the following outcomes: (1) the usual clinical end points such as progression-free survival and overall survival. The cost-effectiveness ratio is therefore given by $15.. Example Hillner et al. the incremental cost was estimated at $15. The result is that the ratios have an unknown distribution.g.0 $2152 per life month saved ($25. (2) the usual assessment of toxicity/adverse event frequency and grade. and some of the parameters (e.Complementary Outcomes 243 interval questionnaire. These ﬁgures were discounted at an annual rate of 3%.0 months.(40) performed an economic evaluation of interferon treatment for malignant melanoma based on the ECOG clinical trial EST1684. B.848 per life-year saved). A common solution to this problem is to use sensitivity analysis to examine how the costeffectiveness ratio varies as the parameters of the analysis vary (e. it is possible to make statistical inferences on cost-effectiveness and cost–utility ratios (39). few studies will include all these components due to constrained resources. Table 2 indicates that the increase in mean survival within the ﬁrst 84 months associated with interferon was 7. cost) may be point estimates and not based on sampled data. clinical trials that began .580 for patients not treated with interferon. and (6) a procedure for estimating health care utilization. By including all these components in a clinical trial..076/7.g. it becomes possible to address the clinical beneﬁts of a new therapy and its impact on quality of life and whether it is cost effective. Therefore. VI. discounted the survival time (as well as costs) by 3% per year.076. in some cases. differ slightly from these calculations because of rounding and because Hillner et al.

the results of the analysis should be displayed for a wide variety of choices for the utility weight values and not for just one or two arbitrary selections. there are a number of excellent . These data could be used to validate the physician-reported adverse event data typically collected. we provide an overview of the basic components of quality-oflife research in cancer clinical trials. As a result. Inferior ancillary study designs include those that use proxy data for subjective quality-of-life domains. The advantage of the cross-sectional design is that the study can be completed more quickly. Unfortunately. CONCLUSION In this chapter. Utility weights can then be assigned to the health states. The timing of assessments should be designed to measure quality of life for the various clinical health states that a patient might experience both during and after therapy (e. By combining clinical outcome data with patient-level cycle-by-cycle toxicity data (both of which are typically collected in cancer clinical trials).. it is critical that quality-of-life components are prospectively incorporated. or they may be left unspeciﬁed. patient-diary data are useful for estimating the duration of time spent with adverse events—an outcome necessary for a health-state utility model. In the latter case.. a baseline assessment should take place before treatment randomization. patients should be asked to selfreport troublesome adverse events and symptoms and their durations using a diary. One approach is to launch a smaller study that collects quality-of-life and utility data from a group of patients. a disease-speciﬁc quality-of-life instrument should be administered longitudinally. The patient diary idea is particularly appealing from a quality-of-life perspective because it is likely that a patient will self-report adverse events that cause distress and therefore represent decrements in quality of life. For many clinical trials currently being designed where quality of life is an important end point. VII. To ﬁll this potential gap. treatment-related toxicity. disease progression). and this health state utility model can be used to compare treatments in terms of quality-adjusted time. The utility weights can be estimated from a secondary cross-sectional study.g. In addition. we cannot fully cover all aspects of this topic. researchers use other methods to address pressing clinical issues.244 Cole in the 1980s or early 1990s will generally not include components for measuring quality of life or utility. disease progression). For randomized studies. The disadvantage is that longitudinal effects on quality of life cannot be estimated. it is often possible to obtain estimates of durations of the health states. Such a study can be longitudinal or cross-sectional. however. in this short space. because methods for assessment were not as well established as they are today. toxicity. At a minimum. Another approach is to retrospectively evaluate the duration of major health states that are thought to impact quality of life (e.g.

REFERENCES 1. Assessment of quality of life in clinical trials. The longitudinal use of a quality-of-life instrument is also strongly recommended. With these components in place. 11:127–135. ¨ 3. ACKNOWLEDGMENTS I thank Richard Gelber and Shari Gelber for helpful comments on this chapter. health policy issues. Bernhard J. cross-cultural and cross-national issues. et al. In particular. Methodological and statistical issues of qual- . the evaluation illustrates the need to consider quality-adjusted survival comparisons in clinical research and to develop more practical methods for assessing patient preferences for incorporation in the decision-making process. Another more compact reference is the chapter by Gelber and Gelber (42). Torfs K. Olschewski M. Hurny C. Gelber RD. which reviews methods used in clinical research and provides more detail regarding statistical analysis methods. future clinical trials should carefully collect data regarding toxicity grade and duration in addition to the usual clinical outcomes. Gore SM. and pharmacoeconomics.Complementary Outcomes 245 references for further reading. Quality of life assessment: can we keep it simple? J R Stat Soc A 155:353–393. Stat Med 1991. Simes RJ. At a minimum. Kiebert W. We also provided an example illustrating the use of quality-of-life-adjusted survival time (Q-TWiST) in cancer clinical research. J Nat Cancer Inst Monogr 1992. Neymark N. Schulgen G. The use of assessment tools and procedures similar to those described in this chapter is becoming increasingly important in cancer clinical research. 10:1915–1930. This book is particularly well suited to the quality-of-life researcher involved with study design and analysis. Moreover. a meaningful evaluation can be made of treatments in terms of clinical outcome. Supported in part by the American Cancer Society (RPG-90-013-08-PBP) and the National Cancer Institute (CA23108). Spiegelhalter DJ. The Q-TWiST analysis of the ECOG trial EST1684 improved the clinical usefulness of the information obtained from the clinical trial. and cost. analysis. Quality of life in clinical trials of adjuvant therapies. Cox DR. quality of life. Fletcher AE. Jones DJ. 4. the large volume edited by Spilker (41) is a thorough reference covering quality-of-life measurement. Fitzpatrick R. as is the tracking of individual health care costs over the course of a study. Schumacher M. 2. Goldhirsch A.

Quality of Life and Pharmacoeconomics in Clinical Trials. Zee B. Workshop on Missing Data in Quality of Life Research in Cancer Clinical Trials: Practical and Methodological Issues. Yates JW. Lancet 1975. Derogatis MF. pp. The use of nitrogen mustards in the palliative treatment of carcinoma. 2nd ed. 19. Medalie JH. In: Spilker B. 253–265. Measuring health state preferences and utilities: rating scale. Derogatis LR. In: Staquet M. pp. 2:621–624. et al. 45:2220–2224. ed. The ‘‘Q-tility index’’: a new tool for assessing health-related quality of life and utilities in clinical trials and clinical practice. Growth curve model analysis for quality of life data. 323–335. Fairclough D. ed. TIMS Studies Mgmt Sci 1977. 17: 757–766. Philadelphia: LippincottRaven. 6:59–89. 16. 24:179–191. 10. Stat Med 1998. ed. Oxford: Oxford University Press 1998. The sickness impact proﬁle. 15. pp. Methods of analysis for longitudinal studies of health-related quality of life. Quality of Life and Pharmacoeconomics in Clinical Trials. 239–252. In: Spilker B. Comprehensive follow-up of carcinoma patients. 1996. Eur J Cancer 1998. Priestman TJ. Izsak FC. 1996. Quality of Life and Pharmacoeconomics in Clinical Trials. 337–345. Philadelphia: LippincottRaven. 13. Torrance GW. In: Spilker B. time trade-off and standard gamble techniques. SCL-90-R and the BSI. ed. Hall J. J Chron Dis 1971. Philadelphia: Lippincott-Raven. O’Leary J. pp. 2nd ed. Prankerd TAJ. 1:634. pp. Quality of Life and Pharmacoeconomics in Clinical Trials. 8. In: Spilker B. Torrance GW. 11. 227–247. McKegney FP. 14. . ed. 1996. Proc ASCO 1994. 17. Measuring the quality of life of cancer patients: a concise QL-index for use by physicians. Quality of survival in acute myeloid leukemia. Gelber RD. Farquhar PH. 191–201. et al. 12. Lancet 1976. pp. 20. Philadelphia: Lippincott-Raven. Burchenal JH. Quality of Life and Pharmacoeconomics in Clinical Trials. Evaluation of patients with advanced cancer using the Karnofsky performance status. Dobson AJ. Abelmann WH. Karnofsky DA. Kind P. pp. Richards JDM. 13:436. Stat Med 1998. Philadelphia: Lippincott-Raven. In: Spilker B. In: Spilker B. Cancer 1980. Ware JE Jr. Burge PS.246 Cole ity of life and economic evaluation in cancer clinical trials: report of a workshop. A survey of multiattribute utility theory and applications. 5. Bennett KJ. Weeks J. Baum M. 2nd ed. Spitzer WO. Chalmer B. ed. Quality of Life and Pharmacoeconomics in Clinical Trials. ed. Feeny DH. Fairclough DL. 1996. 18. 7. 17: 511–796. 34:585–597. Bernhard J. eds. Philadelphia: Lippincott-Raven. 347–354. Furlong WJ. 21. 2nd ed. Damiano AM. J Chron Dis 1981. 2nd ed. Evaluation of quality of life in patients receiving treatment for advanced breast cancer. 6. Cancer 1948. et al. Craver LF. 34:1317–1333. 1996. 1:899–900. 1996. 2nd ed. The EuroQoL instrument: an index of health-related quality of life. 9. Health Utilities Index. The SF-36 health survey. Quality of Life Assessment in Clinical Trials.

Cox regression models for quality adjusted survival analysis. Quality-of-life adjusted survival analysis of interferon alfa-2b adjuvant treatment of high-risk resected cutaneous melanoma: an Eastern Cooperative Oncology Group Study. A consistent estimator for the distribution of quality adjusted survival time. 1:114–121. 14:7–17. 347:1279–1284. Cole BF. Zhao H. 1996. et al. Johnson ER. 45:781–795. Estimating mean quality adjusted lifetime with censored data. Cole BF. The Q-TWiST method. Ernstoff MS. Simes RJ. 56:173–182. Glasziou PP. 27. 14:2666– 2673. Gelber RD. Anderson KM. 33. 35. Variance and sample size calculations in quality-of-life adjusted survival analysis (Q-TWiST). Gelman RS. J Clin Oncol 1996. Cancer J Sci Am 1995. 32. Tsiatis AA. 38. Gelber RD. 36. Goldhirsch A. et al. Simes RJ. Gelber RD. Cole BF. ¨ 23. et al. Series B. Impact of adjuvant therapy on quality of life in women with node-positive breast cancer. Quality of Life and Pharmacoeconomics in Clinical Trials. 9:1259–1276. Gelber RD. Gelber RD. Glasziou P. J Clin Oncol 1996. Bonetti M. Atkins MB. Biometrics 1989. Gelber RD.Complementary Outcomes 247 22. Parametric approaches to quality adjusted survival analysis. Statistical versus quantitative signiﬁcance in the socioeconomic evaluation of medicines. Hilden J. O’Brien BJ. Hurny C. 2nd ed. Cole BF. Murray S. Philadelphia: Lippincott-Raven. PharmacoEconom 1994. Lancet 1996. Cole BF. Cole BF. Parametric extrapolation of survival estimates with applications to quality of life evaluation of treatments. Adjuvant chemotherapy plus tamoxifen compared with tamoxifen alone for postmenopausal breast cancer: a meta-analysis using quality-adjusted survival. 31. Stat Med 1995. Cole BF. 39. 25. Coates AS. 14:1771– 1784. Economic analysis of . Tsiatis AA. Interferon alpha-2b adjuvant therapy of high-risk resected cutaneous melanoma: The Eastern Cooperative Oncology Group Trial EST 1684. Smith TJ. Biometrika 1997. 94:1025–1034. Gelber RD. 437–444. Biometrics 1994. Goldhirsch A. Controlled Clin Trials 1993. 50:621–631. Kirkwood JM. In: Spilker B. Stat Med 1993. Lancet 1996. A method-of-moments estimation procedure for categorical quality-of-life data with non-ignorable missingness. A quality-adjusted survival meta-analysis of adjuvant chemotherapy for premenopausal breast cancer. 62. 17:1215–1229. Hunt Strawderman M. A quality-of-life oriented endpoint for comparing therapies. Cole BF. 30. Goldhirsch A. Part 1: 175–188. Gelber RD. Hillner BE. Goldhirsch A. 26. et al. Stat Med 1998. Sankhya 2000. 347:1066–1071. Gelber RD. Kirkwood JM. pp. Kirkwood JM. Gelber RD. et al. Gelber S. 40. Gelber RD. 14:485–499. 28. Zhao H. Gelber RD. Bernhard J. Quality adjusted survival analysis. 37. Goldhirsch A. Cole BF. Drummond MF. 29. 34. 24. ed. Goldhirsch A. Cole BF. J Am Stat Assoc 1999. Cole BF. 84:339–348. Quality-adjusted survival analysis with repeated quality-of-life measures. 5:389–398. Stat Med 1990. Goldhirsch A. 12:975–987. Biometrics 2000. Adjuvant chemotherapy for premenopausal breast cancer: a meta-analysis using quality-adjusted survival.

Philadelphia: Lippincott-Raven. ed. Gent M. Gelber RD. Schag CA. 111:231–249. In: Thall PF. 48. Quality of Life and Pharmacoeconomics in Clinical Trials. 1:19–29. Massashusetts: Kluwer Academic Publishers. pp. Spilker B. 44. ¨ Hurny C. A modular approach to quality-of-life assessment in cancer clinical trials. ed. The CARES: A generic measure of health-related quality of life for patients with cancer. Philadelphia: Lippincott-Raven. Quality-of-life assessment in clinical trials. ed. Recent Advances in Clinical Trial Design and Analysis. 15:2351–2358. 28: 118–124. 2nd ed. Aaronson NK. Quality of Life and Pharmacoeconomics in Clinical Trials. Ganz PA. . Levine MN. et al. et al. In: Spilker B. Quality of Life and Pharmacoeconomics in Clinical Trials. Norwell. The functional assessment of cancer therapy (FACT) and functional assessment of HIV infection (FAHI) quality of life measurement system. Philadelphia: Lippincott-Raven. (1996). Bonomi AE. Eur J Cancer 1992. 225–246. 1996. Recent Results Cancer Res 1988. Quality of life in stage II breast cancer: an instrument for clinical trials. 45. 2nd ed. 215–225. Ahmedzai S. 46. In: Spilker B. J Clin Oncol 1997. Quality of life measures for patients receiving adjuvant therapy for breast cancer: an international trial. pp. 43. ed. Bullinger M. Lee JJ. 6:1789–1810. Clinch JJ. Qual Life Res 1992. Gelber S. 203–214. 42.248 Cole adjuvant interferon alfa-2b in high-risk melanoma based on projections from Eastern Cooperative Oncology Group 1684. 41. pp. The functional living index—cancer: ten years later. Bernhard J. J Clin Oncol 1988. 2nd ed. Cella DF. 47. Gelber R. 1996. Guyatt GH.

and emotional dimensions of life (2–4). However. it should include physical. Kingston. Thus. future orien249 . HRQL in oncology is a multidimensional construct consisting of subjective indicators of health. HRQL is not in itself only performance status. there was controversy over how to deﬁne the term HRQL and the breadth of the constructs to be included. tumor measurement. in many oncological clinical trials. in addition to the traditional end points such as tumor response and survival. family well-being. When HRQL was ﬁrst proposed as an outcome to be measured in clinical trials. or laboratory values. Researchers have agreed that at a minimum. in health care the term HRQL is restricted to how illness or its treatment affects patients’ ability to function and their symptom burden (1). functional ability (activity). and job satisfaction. Canada I. West Vancouver. Canada David Osoba Quality of Life Consulting. Thus. British Columbia. Zee National Cancer Institute of Canada. Ontario.14 Health-Related Quality-of-Life Outcomes Benny C. emotional well-being. socioeconomic status. treatment satisfaction. social. toxicity ratings. INTRODUCTION Health-related quality of life (HRQL) is now included as a major end point. spirituality. These indicators should include physical concerns (symptoms and pain). The construct ‘‘quality of life’’ can be very broad and can include such dimensions as the air we breathe.

(2) cancers with a poor prognosis. However. (5) identiﬁcation of the full range of side effects and impact of treatment. Frequency of measurements and logistics of data collection should be considered to minimize potential problems with missing data. Food and Drug Administration recommends that the beneﬁcial effects on HRQL and/or survival be the basis of approval of new anticancer drugs. depending on the population being studied and the speciﬁc conditions of the study. they should be discussed in the protocol and the psychometric properties of the generic instruments should be referenced. to add disease-speciﬁc or situation-speciﬁc modules or checklists (6. and occupational functioning. we discuss some of these issues in clinical trials incorporating HRQL as an end point. (4) supportive care interventions. many antineoplastic therapies give rise to a number of distressing side effects with a presumed deterioration in HRQL. To perform a clinical trial using HRQL as end points. (3) treatment arms with similar survivals. Information obtained from quality-of-life assessments in phase II testing for new chemotherapeutic agents can also guide quality-of-life evaluations planned in future large randomized studies (5). sexuality. and the patients may have increased survival and improved HRQL (10). or a general (condition-speciﬁc). and (6) using quality of life as a predictor of response and survival. the motivation for adding these items should be addressed. The general approach in medical studies is to use a generic. questionnaire that assesses physical. the duration and frequency of the HRQL measurements should be clearly stated in the protocol. There are several reasons for applying HRQL assessments in oncology (8): (1) studies in which symptom control is the primary outcome.S. these side effects are usually reversible when the treatments have been completed. For example. When a supplemental disease-speciﬁc checklist is needed. Depending on the purpose of the study. In the following sections. whereas others are designed to assess the impact of both symptoms of disease and toxicity of treatment in a phase III randomized trial setting. When the objective of a study is to evaluate longer term effects. demonstration of a favorable effect on HRQL is more important than most other traditional measures of efﬁcacy (9). . it is important to develop a protocol with clearly deﬁned objectives and deﬁnitions of end points (7). when treatment does not have an impact on survival. so that the design of the study and follow-up schedule are developed to ensure good compliance (HRQL questionnaire completion rates). Eligibility requirements for the HQRL component should be given. not all dimensions are relevant to a particular study.250 Zee and Osoba tation. The choice of instruments is important. However. social functioning. emotional.7). the U. From the regulatory point of view. some quality-of-life assessments are targeted to obtain information for decision making in health care policy. and social functioning and then. Therefore.

This is critical when a study is comparing two different treatment modalities or treatment schedules. then survival time is an obvious end point. there are some additional practical issues that need to be considered in phase III trials with HRQL components. a study of dose-intensive chemotherapy versus standard alternating-dose chemotherapy for patients with extensive-stage small cell lung cancer may expect to have more treatment-related symptoms during the duration of chemotherapy. When the primary objective is to evaluate whether patients survive longer with a dose-intensive regimen. extensive-stage small cell lung cancer patients have a rather short median survival of about 1 year. . a study may permit patient entry to the treatment protocol without participation in the HRQL component. one of the most important questions is whether the HRQL end point is unambiguously stated and addresses the study question fully. For example. For many nonblinded cancer trials. the HRQL assessments should be patterned on the treatment schedules with an identical administration procedure (12). Such an outcome may indicate that one treatment is preferable to another. For example. It is also likely that dose-intensive treatment produces more side effects and may have an impact on patients HRQL. and the emergence and magnitude of a clinically signiﬁcant beneﬁt may be evident around 4–6 months.Health-related Quality of Life Outcomes 251 II. such as survival. shows equivalence in treatment effect but one treatment arm shows less toxicity or improved overall HRQL. blinding of the allocated treatment for the person administering the HRQL questionnaire(s) is as important as the blinding of the allocated treatment for the primary care personnel. In a double-blind randomized trial. Another example where HRQL is an important end point is the situation where the primary objective of the study. even when it does not extend survival. However. a standardized procedure must be used to reduce bias from the administration procedure. DESIGN CONSIDERATIONS During the design stage of a phase III randomized trial. The improvement in median survival time gained by using highly toxic dose-intensive chemotherapy needs to be justiﬁed by the HRQL outcomes to make an informed treatment decision (11). Randomization and Blinding The basic design issues when HRQL is included as an end point are similar to those of most conventional randomized controlled trials. However. and appropriate stratiﬁcation may be required so that the purpose of the randomization process to reduce bias in treatment selection is preserved. It is important to consider the proportion of nonparticipants in the HRQL components. A.

Test–retest reliability reﬂects the reproducibility of scores at two different time points between which HRQL is not expected to change. previous work on the psychometric properties of the instrument should be referenced in the protocol. a study that includes only patients who consent to take part in HRQL assessments may involve only welladjusted patients. using an instrument with appropriate length to minimize noncompliance due to deteriorating physical or emotional illness. high test–retest. and interrater reliability (if rater-completed HRQL assessment is used). On the other hand. culturally appropriate instruments should be used. Information about HRQL for this subset of patients may affect the generalizability of the study results. will be affected. The chosen instrument should have demonstrated good internal consistency. A strategy for collecting HRQL data that is appropriate to all patients should be considered.252 Zee and Osoba B. it is difﬁcult to know how much the generalizability of the results from the study population. which is a measure of the extent to which different items represent the same domain content. i. There are many good review articles on the choice of quality of life instruments. Eligibility The effect of a disproportionate number of nonparticipators in HRQL components between two treatment arms may introduce bias. Interrater reliability estimates the degree of agreement among . Once an instrument has been chosen for a study. However. and social well being are affected by medical intervention. Moinpour (13) discussed the operational deﬁnition of HRQL in clinical trials with respect to health care and the treatment of disease. Does the patient population include all those who consent to take part in the treatment aspect or include only those who consent to participate in the HRQL component? Exclusion of those who do not take part in HRQL assessments may limit the generalizability of the treatment outcome (10). allowing physicians to treat the quality-of-life assessments as optional and then to stratify for those who consent to take part in HRQL assessments versus those who do not may reduce the problem. Psychometric Properties The choice of instruments with well-established psychometric properties are important for proper interpretation of the HRQL results. C. The randomization and the stratiﬁcation procedure should account for this problem.e. translation to other languages for various ethnic groups. mental. as deﬁned in the eligibility criteria of the protocol. and the use of proxies who know the patients well are all important considerations. For example. how physical. assistance for patients with visual or auditory impairment. Internal consistency is usually measured by Cronbach’s alpha coefﬁcient (14). but the HRQL results of the study must be interpreted with caution since they represent a slightly different patient population than that of the survival end point. For example..

an instrument that demonstrates an appropriate content validity (including adequate relevant items in a speciﬁc domain) and convergent and divergent validity with other instruments is sufﬁcient to justify its use in most clinical trials. if we are to study overall health. The Proﬁle of Mood States with 65 items with a ﬁve-point Likert scale response format was also used. However. the Functional Living Index—Cancer (FLIC) and Cancer Rehabilitation Evaluation System (CARES) were used to assess the quality of life of patients treated with either modiﬁed radical mastectomy or segmental mastectomy (17).Health-related Quality of Life Outcomes 253 different raters but is not applicable in most patient self-administered HRQL instruments. e. and the ability related to work and do household chores. This is important because the weighting of various domains in an instrument varies with the individual valuation. and general well-being. Factor analysis. and structural equation modeling are common methods to assess construct validity (15). Sometimes a demonstration of criterion validity is possible if the criterion exists at the same time as the measure. As Mor and Guadagnoli (16) pointed out. criterion validity may be established if the quality of life domains correctly show a difference between groups. global quality of life. and sexual domains. the interpretation of an aggregate score from a number of existing domains as equaling global quality of life may provide misleading results when the instrument is predominated by certain domains. Both the CARES summary scores and FLIC global score showed no signiﬁcant difference between the two groups.. the average score representing a total mood disturbance. but patients with segmental mastec- .g. However. The concept of validity is more complex since there is no gold standard for HRQL assessment and it is primarily the repeated use of an instrument in many trials over time that establishes validity. construct validity assesses whether the HRQL instrument relates to other observed variables or other constructs in a way that is consistent with a theoretical framework. performance status). Finally. criterion validity is not easy to establish in HRQL instruments. and a summary score is obtained from 5 higher order factors including physical.. The CARES (18) instrument has 93 to 132 items with ﬁve-point Likert scale. For example. stress. A global score for FLIC was obtained from 22 visual analogue scales including concerns related to pain. Questionnaires that aggregate individual items into a total score do not necessarily represent global quality of life or general well-being. psychosocial.g. if we know that there is a signiﬁcant difference between two groups of patients with respect to a criterion (e. marital. a questionnaire with a separate global quality-of-life domain is an important consideration. multitrait scaling analysis. For example. medical interaction. D. Global Quality of Life An important aspect of choosing HRQL instruments is to select one that contains relevant items for a speciﬁc purpose.

Symptom checklists.254 Zee and Osoba tomy had signiﬁcantly more mood disturbance at one month than did those with total mastectomy. emotional. toward physical functioning or emotional functioning). may not be captured by a questionnaire with only a 7-day time frame. They can provide additional information not provided by general (core) questionnaires.g. Although some investigators have suggested weighting of certain domain scores (e. fatigue) after treatment. which are expected to peak within a day or two (i.e. to capture additional information about the effects of a particular cancer. One difﬁculty is that the timing of the symptom checklists intended to only capture the side effects of treatment may differ from that of HRQL assessments.. and social domains and other symptoms can be determined. are often used to capture . One explanation for these results may be that the chosen questionnaires did not include factors that patients believed to be important. This would provide us with further information on the impact of speciﬁc dimensions on patients’ global quality of life. Kemeny et al. Another likely explanation is that the scores for all the domains in a questionnaire may not change in the same direction. appearance. Statements about overall or global quality of life based on aggregate scores may be dangerous. patients with segmental mastectomy had signiﬁcantly fewer problems with clothing and body image as indicated in these domains in CARES. or its treatment in a given clinical trial. in the form of a self-administered daily diary. These are clear indications of differences in HRQL. the conclusion based on aggregate scores indicates no signiﬁcance difference. Once this information is obtained. whereas the same questionnaire may be adequate for symptoms that occur over longer periods of time (e. nausea and vomiting after chemotherapy). one or two questions asking directly for an overall assessment of overall health/global quality of life represents an overall assessment incorporating an individual’s own values. such as body image. that is. E. and thus improvement in some domains may be canceled out by worsening in others. Checklists (and modules) are developed to be disease and situation speciﬁc. More importantly.. The effects of some symptoms. (19) pointed out that the subject matter in identifying proper domains to study. at 1 year of follow-up. In contrast to the method of aggregate scoring. the assigned weights are derived from observers opinions and not necessarily those of patients (2). giving no change in the aggregate score. Symptom Checklists and HRQL One of the design questions in HRQL studies is whether to incorporate extra symptom checklists to validated general HRQL questionnaires (20).. and femininity. It is therefore very difﬁcult to develop an instrument that provides an aggregate score applicable across different studies. However.g. There is no theoretical reason why symptom checklists cannot be completed at the same time as are general HRQL questionnaires. is important. association of global quality of life with physical.

However. Thus. similar to a case report form. in a symptom control study of antiemetics. For example.Health-related Quality of Life Outcomes 255 symptoms induced by treatment. For example. the use and wording of a conditional lead-in question must be carefully considered. Pater et al. It has been generally accepted that HRQL should be assessed by patient self-administered questionnaires because of the subjective nature of the content. DATA COLLECTION It is widely recognized that there is disagreement between physician rating and patients’ self-assessed HRQL scores (22). When a symptom checklist is added to the core questionnaire. patients were capable of assimilating the symptomatic experience for the corresponding recall period. even when the questionnaires are in the form of a booklet (patients sometimes miss an entire page of questions if these are placed on the reverse side). A clear information sheet about the HRQL assessment and instructions about how to ﬁll out the questionnaires are useful. It is also possible to add additional HRQL assessments (including checklists) at other times to capture the effects of maximum toxicity (12). Printing should always be done on single page. A lead-in instruction was used in a nausea and vomiting checklist to identify patients who had experienced nausea and vomiting in the past 7 days. the design of the diary has to be short and simple so that the data collection burden for both patients and the data center is minimized but the chance of capturing crucial symptomatic data is maximized. It is also important that the instrument is brief to reduce patient burden as much as possible and questions are easy to understand without using technical terminology or jargon. The data management approach suitable for obtaining HRQL data has to be considered to minimize the chance of missing data. Some appropriately scheduled quality-of-life assessments have been shown to capture critical information when compared with a symptom checklist. HRQL Questionnaires The format and layout of the questionnaire should be spacious and clear. III. The following are a few points about the general principles of data management practices. the investigators wished to know how nausea and vomiting affected patients’ functioning and HRQL. (21) showed that the timing of assessment (either day 4 or day 8 after chemotherapy) and the recall period (either 3 days or 7 days) using the EORTC QLQ-C30 is associated closely with the occurrence of nausea and vomiting as captured by a daily diary using a visual analogue scale. using a larger than usual font for instructions and questions to avoid having double answers for one item and missing answers in the adjacent item. A. and they were then asked to assess the impact of that nausea and .

e. To develop a general computer program for monitoring compliance. B. they should be included within the protocol. it may be difﬁcult to monitor HRQL compliance when the schedules and frequency of collections vary between many trials. Special computer programs are needed to monitor compliance in speciﬁc studies. and possible solutions may be attempted. The schedule for HRQL assessments should coincide with the schedule of regular follow-up visits as much as possible to facilitate data collection. If the instructions are speciﬁc to a particular trial. Often there has been a training session on HRQL data collection provided for the clinical research personnel both at data collection centers and at central data processing and analysis centers.256 Zee and Osoba vomiting on their functioning. It was also noted that the nausea and vomiting domain based on two items from the core HRQL questionnaire had a high correlation with the nausea and vomiting scores from the daily diary (21). most research groups reported that the baseline compliance rates were above 90%. An ongoing interaction between the clinical research personnel in the clinics and the study coordinator of the central ofﬁce is useful to understand the problems a particular center might be facing. compliance (i. HRQL questionnaire completion rates) in later studies has improved. Standardized telephone interviews may also be used. in the 70% range. A clear set of instructions or a full manual should be developed for the clinical research personnel in the centers to follow. particularly if HRQL assessment is required at times not corresponding to clinic visits. There are a number of common characteristics of studies with good compliance. Compliance Some earlier studies reported difﬁculties collecting HRQL data (23). However. The general awareness of the value of HRQL data has been raised. The study coordinator can check this item with . It was discovered that some patients who had reported no experience of nausea and vomiting in their daily diary provided answers to the checklist of questions and some patients who should have answered did not answer the checklist. the National Surgical Adjuvant Breast and Bowel Project Breast Cancer Prevention Trial had a very high compliance rate for the ﬁrst 12month assessments on the placebo arm (23). For example. In a workshop on ‘‘Missing Data in Quality of Life Research in Cancer Clinical Trials’’ (30) held in 1997. The compliance rate while patients were receiving treatment was in the 80% range and after patients completed treatment. and more attention has been given to achieve high compliance (24–29). a simple question on the case report form asking whether HRQL should be performed at the current visit is extremely helpful. It is generally better to complete the HRQL questionnaires at a standardized time in the patient visit while patients are still in the clinic rather than have patients take questionnaires home and mail them back at speciﬁc time points. In the data center. However.. the core questionnaire in the same study had very good completion rates.

a hypothesis-testing approach is used for the quality-of-life data. if the study question is not about a speciﬁc functioning domain or symptom. for example. For each HRQL assessment. considering a comparative study between two treatment arms. which is the probability of rejecting the null hypothesis if the real effect is larger than the smallest clinically meaningful difference speciﬁed (usually 80% or 90%). a study designed to evaluate the impact on cognitive functioning in patients with metastatic breast cancer receiving high-dose chemotherapy versus a treatment that may improve cognitive functioning. IV. The hypothesis for this study is to focus speciﬁcally on cognitive functioning. in symptom control studies. For example. (3) the power of the statistical test. Since quality of life is a multidimensional construct. The societal perspective considers the de- . The study may have a speciﬁc question in mind. and (4) the variability of the outcome measure estimated from previous studies. the global quality-of-life scores can be used as the end point for the sample size estimation. Computerization of these items makes the compliance rate calculation simpler than assigning windows for scheduled visits where delay of treatment or visits may increase the complexity of the calculation. (2) the smallest clinically meaningful difference. In this setting. a front-page (cover sheet) form is required from the center. SAMPLE SIZE One of the critical aspects in the design of a phase III clinical trial involving HRQL as an end point(s) is sample size calculation. and a query can be sent to the center if the expected HRQL questionnaires were not done. then the reasons of noncompliance are documented (31). To determine the sample size required for a trial. the hypothesis of interest should be clearly deﬁned. In general. for example. Other studies may focus on different dimensions. the primary objective is to compare the study treatment with the control treatment to show the difference in effects on HRQL data. For a study with HRQL as an end point.Health-related Quality of Life Outcomes 257 the appropriate protocol schedule. The study must be adequately powered to be able to make a ﬁrm conclusion about the HRQL hypothesis. the primary end point may be a speciﬁc symptom such as nausea and vomiting induced by moderately or highly emetogenic chemotherapy or fatigue. several quantities must be considered: (1) the signiﬁcance level at which we wish to perform our hypothesis test (usually 5%) and if it is to be one-tailed or two-tailed. the most difﬁcult quantity for justifying the sample size calculation is the smallest clinically meaningful difference. This is because the meaning of the change on HRQL depends on the perspective of the potential user of the information (31). If a quality-of-life assessment was scheduled but patients did not complete a questionnaire. This front page contains information on the date and location of completion and whether assistance was required at the time of completion.

using global quality of life as the primary end point. At the institution level. when HRQL is considered as an end point. To detect a difference of seven points. Otherwise. discontinuation or alteration of treatment).. inclusion of other functioning domains or symptoms may inﬂate the type I error of the hypothesis tests. The results showed that patients with breast cancer and small cell lung cancer perceived a change of 6. emotional and social functioning. 128 patients per arm would be needed. a sample size of 63 patients per arm is required to detect a difference of 10 points using a two-sided 5% level test with 80% power if the standard deviation is 20. In a randomized clinical trial. Here we are adopting the view that global quality of life should be treated as a separate domain and that this information cannot be assimilated from other existing domains and symptoms. a sample size formula can be derived for a randomized trial with two treatment arms. If global quality of life is not the only speciﬁc domain of interest. a global test statistic for multiple end points such as those proposed by O’Brien (33) or Tandon (34) may be more appropriate.to 100-point scale) for global quality of life. However.7 (on a 0. one may consider a degree of change to be large enough if it leads to the adoption of certain health care policies. A discussion about matching the clinical questions for multiple end points to appropriate statistical procedures can be found in O’Brien and Geller (35). In fact.g. (31) used a Subjective Signiﬁcance Questionnaire asking about physical. since HRQL is a subjective end point. the magnitude of change may be considered clinically worthwhile when the degree of change is large enough to cause most clinicians to consider using a speciﬁed study intervention in a given situation (e. Another consideration in sample size estimation is the multidimensional aspect of HRQL. and global quality of life to assess the degree of change in the EORTC QLQ-C30 scores that were perceptible to patients with breast cancer and small cell lung cancer. respectively. none of the above deﬁnitions seems sensitive enough to provide us with a practical criterion that deﬁnes the smallest clinically meaningful difference. Osoba et al. the change perceived by patients as being meaningful should be considered important.258 Zee and Osoba gree of importance at a population level where small differences may be important because of the large number of individuals who may be affected. . The sample formula reviewed by Lachin (32) can be used: n 2 (Z1-α /2 d2 Z1-β)2 σ2 The standard deviation for global quality of life can be determined from previous studies.9 to 10. It is advisable to control for the increase in type I error using Bonferroni type adjustment for multiple end points when we estimate the required sample size. From this information.

Xk′D xk can be written as a linear combination of the observations Sij c′Yij. K/2 1. µiT] represents the mean quality-of-life scores for treatment group i 1. . . . If the analysis is to be done based on summary statistics for individuals’ repeated measurements. . of full row rank. . and H is an (h T) matrix. . the HRQL assessments are scheduled to be done repeatedly over time. where Dσ is a diagonal matrix whose elements consist of σ′ [σ1.. Examples of summary statistics include average postrandomization HRQL assessments. . Yijk is distributed as σ2e) and x′k is the (k 1)th row of X. σT] and P is a function of ρ. . . To estimate sample size. Let yit of life for the jth individual in the ith treatment group and assume that each yij MVN(µi. last observation. . Ha : Hδ ≠ 0 µ2. For a vector Yij of (K 1) of observed repeated HRQL measurements for treatment i and subject j. n1 (c′XDX′c c′c σe2) (1 c′(µ1 1/m) (Z1 µ2 ) 2 α/2 Z1 β) When there are missing data and the data can be separated into strata according to missing data patterns. . . the last observation is obtained by deﬁning c′ (0. . the above sample size formula can be modiﬁed (36). K/2 1. The summary statistic N(µik. and area under the curve. . For analysis based on multivariate analysis of variance for repeated measures. i. the observed measures are represented by a random effects model: Yij µi X′bij eij where µ i (µ i0. . a multivariate normal distribution with µ′i [µi1. 1) and the slope by c′ ( K/2. K/2). Therefore. imposing h where δ µ1 linearly independent restrictions on δ.Health-related Quality of Life Outcomes 259 In many clinical trials. . and eij is random error distributed as N(0. we need to assume either a compound symmetry (PCS) or autoregressive (PAR) correlation structure. 0. Consider the hypothesis H0 : Hδ 0 vs. . [c′XDX′c c′cσ2e]/ni). The sample size for the two treatment groups is n1 and n2 mn1. µ i1. . the method by Dawson (36) can be used. µ iK) are group means for the (K 1) repeated measures. the sample size can be determined using the method proposed by Rochon [yij1. σ2eI(K 1)). The average of the summary statistics for the two treatment groups is denoted by ∑j Sij /ni and is distributed as N(c′µi. the slope of HRQL scores along time. . . respectively. Dσ PDσ. . ). and bij is a q 1 vector of random subject effects distributed as N(0. yijT] denote the set of repeated measures of quality (37). D(q q)). . For example. . . 2 at time t 1 to T and a variance–covariance matrix of that can be written as function of a vector of anticipated standard deviations σ and the correlation matrix P.e. .

Simple. In particular. as suggested by Dawson and Lagakos (40). a test of treatment time interaction where H [ 1. The sample size can be determined using Hotelling’s T 2 distribution. V. cross-sectional.. IT 1] and IT 1 is an identity matrix of dimension T 1. General Methods of Analysis for HRQL Data The analysis of HRQL data at a cross-sectional time point is the simplest way of looking at the data. An alternative method is to summarize the longitudinal data into a summary statistic before performing a between-arms comparison. (3) the careful implementation of data management procedures and monitoring. (2) the administration and collection of questionnaires. i. an iterative procedure is required. This method requires speciﬁcation of the regression model for the marginal mean and parameters for both null and alternative hypotheses. with h and n1 n2 2 degree of freedom. the method of Liu and Liang (38) may be used. A. Since there is no closed-form expression. . In a repeated-measures approach. Instead of using mutivariate analysis of variance (MNOVA) approach. and the noncentrality parameter.260 Zee and Osoba The parameters required to be speciﬁed include a vector of δ and a value for ρ. This method is simple but may overlook important changes in HRQL along time. The tests between mean values of two treatment groups at different time intervals may also hide signiﬁcant individual changes and do not take into account different proportions of missing data in the two arms. They proposed a modiﬁcation of the sample size and power formula of Self and Mauritsen (39) to handle correlated observations based on the generalized estimating equation method and a quasiscore test statistic. a test of whether the treatment difference is consistent from evaluation to evaluation is of interest.e. descriptive statistics at each time interval can be used to describe the overall trend of the two treatment groups. (5) dealing with missing data. Comparisons using this method do not take into account within-patient variation and may inﬂate the type I error. Some of these problems have been discussed in previous sections and include (1) dealing with multiple end points due to the multidimensional nature of HRQL data. an anticipated difference δ. and (6) difﬁculty in the interpretation when missing data are nonrandom. ANALYSIS The analysis of HRQL data presents a number of challenges to statisticians. and it suffers similar problems to the cross-sectional analysis. and the matrix H. (4) longitudinal data collection that may happen at irregular intervals due to delay in treatment and missing follow-up visits. we need to formulate the hypothesis of interest.

especially for studies with long-term follow-up. the differences between two consecutive quality-of-life scores for a speciﬁc domain were calculated. . . . . (46). H0 : F1(y) F2(y) vs. 2). the summary statistics for repeated measurements approach can be used. The type I error of the test is properly retained when the distribution of missing pattern is the same and the distribution of outcome conditional on the pattern of missing data is also the same between the two treatment groups. .’’ For each individual. the number of parameters will increase and the estimation procedure for a conditional logistic model will again become complicated. The null hypothesis of interest is that the outcome vectors are distributed equally in the two groups. ni) and within group i (i 1. All these methods require a transformation of the time axis into discrete time intervals to set up for the repeated-measures analysis. 1. or consistent decrease over time. (43) to determine a test statistic for testing the hypothesis. and the sign of the differences was recorded (positive. Dawson and Lagakos (40) proposed a stratiﬁed test for repeated-measures data that contain missing data for comparing two treatment groups in a randomized trial. Based on the sign of the differences. and it only provides a rough estimate of the general changes in quality of life. Since the number of repeated measures of HRQL in clinical trials may be quite large. (42) suggested a weighted least-squares approach as described in Grizzle et al. . However. The patterns were compared between the two treatment groups using ordered polytomous logistic regression in which the patterns were treated as ordered categorical variables. when the number of time points is large. (17) suggested assessing the general change of HRQL over time and called this approach ‘‘pattern analysis. For categorical data. Methods of Analysis with Missing Data For HRQL data with missing observations. . or negative). the analysis of repeated-measures approach may require a large number of parameters in the model. the patterns of changes were classiﬁed into one of three categories: consistent increase over time. B. the deﬁnition of consistent increasing or decreasing patterns is rather arbitrary. no consistent patterns of increase or decrease. A review of other methods by Davis (44) includes generalized estimating equations of Liang and Zeger (45) and the two-stage method of Stram et al. A different approach was proposed by Zwinderman (47) using a logistic latent trait model. K) time points for subject j ( j 1. . Suppose that the quality-of-life measurements Yijk were measured at xk (k 0. However. 2. Other methods include univariate repeated measures and MNOVA as described in Zee and Pater (41) for continuous repeated measures. Koch et al. H1 : F1( y) ≠ F2( y) . The treatment group assessment adjusted for confounding variables was then carried out.Health-related Quality of Life Outcomes 261 Ganz et al. zero.

Park and Davis (51) provided a test of the missing data mechanism for repeated categorical data. Schluchter (54) proposed a joint mixed-effect and survival model to estimate the association between the repeated-measures and the drop-out mechanism. For example. . (52) extended the generalized estimating equations method proposed by Liang and Zeger (45) to incorporate repeated multinomial response. Fairclough et al. say S11.262 Zee and Osoba Suppose that the summary statistic S is some scalar function of Y. . Another model-based approach was proposed by Zee (56) based on growth curve models stratiﬁed for health state which is likely to have different missing data mechanisms. Methods such as the pattern-mixture model (53) can be adopted in the growth-curve method to assess the missing data mechanism for both the whole data set and within health states to evaluate the appropriateness of the assumption. . this class of model requires the assumption MCAR. C. when data are not missing completely at random or the test statistics across strata indicate a treatment by missing data pattern interaction. any distribution-free test comparing the summary statistics in the two groups will have the correct size under H0. the n2 vectors Y11. . . for example. . the health states can be separated by the period of time when patients are receiving protocol treatment versus the period after patients are off protocol treatment. (55) used these methods with respect to quality-of-life examples. using logistic regression to model the missing data pattern and examine whether the assumption of missing completely at random (MCAR) is valid. . Lipsitz et al. Under H0. Y1n1. . They focused particularly on the method proposed by Ridout (50). However. . Y21. (49) gave a general review of the test of missing data assumptions with respect to repeated HRQL data. as are the corresponding values of S. Consequently. are independent and identically n1 distributed. . S21. . There is a trade-off between the duration of quality lifetime gained by the study treatment and the duration of toxicity due to treat- . This model requires only missing at random assumption to be satisﬁed within each of the health states. Shih and Quan (48) further assessed the situations when some sufﬁcient conditions as proposed by Dawson and Lagakos are not met. Other Methods The incorporation of HRQL end points in a study may be motivated by a highly toxic treatment where clinical beneﬁt with respect to relative gain in survival is discounted by the duration of toxicity experienced from treatment and duration of progression due to disease. S2n2. . . Y2n2. . . . Little (53) gave a comprehensive review of the modeling approaches for various drop-out mechanisms in repeated-measures analysis and proposed the pattern-mixture model for testing various assumptions. Curran et al. S1n1.

Effect of Cancer on Quality of Life. D. Osoba D. a sensitivity analysis. This approach provides a clear picture of the experience of patients with respect to the HRQL domains being measured. The average HRQL scores can be assessed using summary statistics or longitudinal data analysis methods for comparing the effect of treatments. called a threshold utility analysis. . 1991. pp. These methods are rather difﬁcult to implement because they are conceptually complex tools. ed. However. This method can be used to incorporate both survival and HRQL data in a single comparison between treatment groups. REFERENCES 1. The ﬁrst is to measure multidimensional domains of HRQL and follow patients longitudinally.Health-related Quality of Life Outcomes 263 ment and relapse due to disease. such as the Q-TWiST end point. which is traditionally determined using standard gamble or time tradeoff techniques (58). it is not necessary to obtain patientderived utility scores. Goldhirsh et al. Another approach is to deﬁne a totally new end point that incorporates overall survival and various health states of patients during and after treatment. In: Osoba D. Conclusion In summary. a utility measure is needed to summarize the trade-off for individual patients. two different approaches are available to measure HRQL. Missing data pattern should be veriﬁed to identify potential violation of assumptions. the advantage and usefulness of the Q-TWiST model is being compromised when the summary measure becomes more complicated in its interpretation. One may try to expand the utility coefﬁcient and the deﬁnition of toxic side effects to cover the psychosocial and emotional functioning. However. can be performed to assess the relative beneﬁts between treatment groups for all combinations of utility scores. 26–40. Instead. For the purpose of comparison in a randomized controlled trial. Measuring the effect of cancer on quality of life. A major limitation of this method is that it does not address change in psychosocial and emotional functioning. Model-based methods such as mixed effect model or growth-curve model are appropriate. Boca Raton: CRC Press. Other alternative methods include the Health Utility Index (59) and cancer-speciﬁc Q-tility Index (60). (57) proposed a Quality-adjusted Time Without Symptoms and Toxicity (Q-TWiST) model to assess the beneﬁt of adjuvant treatment in breast cancer. One may try to expand the utility coefﬁcient and the deﬁnition of toxic side effects to cover the psychosocial and emotional functioning. Approximate 95% conﬁdence intervals for the differences in Q-TWiST between the two arms for each set of utility values can be determined by bootstrap method (61).

Baum M. Portenoy R. Hays RD. 13. Seidman A. 16:456a. Oxford: Oxford University Press. Guidelines for measuring health-related quality of life in clinical trials. pp. Ren L. In search of the ‘‘quality’’ in quality-of-life research. In: Scheurlen H. Bullinger M. Methods and Practice. Livingston R. . Structural Equations with Latent Variables. Bruner D. J Natl Cancer Inst Monogr 1996. Shepherd F. 5. Houston C. Usakewicz J. Yao T. and other generic health concepts. Quality of life in phase II trials: a study of methodology and predictive value in patients with advanced breast cancer treated with paclitaxel and granulocytecolony stimulating factor. Osoba D. 11. 20: 7–9. Gnecco C. Fayers PM. 17. A practical guide for selecting quality-of-life measures in clinical trials and practice. 69:1729–1738. Ahmedzai S. 1988. 8. et al. 1991. Ganz P. 89–104. Cost of quality of life research in South West Oncology Group. Berlin: Springer. Souhrada M. Coefﬁcient alpha and the internal structure of tests. Boca Raton: CRC Press. Lapore J. 111:231–249. Boca Raton: CRC Press. Osoba D. 16. Mont E. Tan S. 1991. McCabe M. J Clin Epidemiol 1988. 7–23. Till JE. 4. Moore T. Proc ASCO 1997. Psychometrika 1951. Mor V. Thaler H. 7. 9. A Modular approach to quality-of-life assessment in clinical trials. Quality of Life Assessment in Clinical Trials. Aaronson NK. Aaronson NK. JNCI 1992. Recent Results in Cancer Research. Kay R. A randomised study of CODE plus thoracic irradiation versus alternating CAV/EP for extensive stage small cell lung cancer (ESCLC) (abstr). Guadagnoli E. Polinsky M. ed. In: Osoba D. Cronbach L. Lee J. Int J Radiat Oncol Biol Phys 1995. 41:1055–1058. Effect of Cancer on Quality of Life. Quality of life assessment in cancer treatment protocols: research issues in protocol development. J Nat Cancer Inst 1995. Measuring functioning.264 Zee and Osoba 2. 1998. Measurement of quality of life in clinical trials. Kaasa S. Ware JE Jr. 12. Norton L. 49:288– 294. pp. Effect of Cancer on Quality of Life. Cancer Clinical Trials: A Critical Appraisal. well-being. Rationale for the timing of health-related quality-of-life assessments in oncological palliative therapy. 3. Schag A. Moinpour C. Beitz J. Gotay C. Cheson B. 14. ed. Korn E. In: Osoba D. Osoba D. Justice R. 87:316– 322. In: Staquet MJ. Onetto N. 10. 20:11–16. 84: 575–579. 19– 35. 15. Quality of life measurement: a psychometic tower of Babel. pp. J Natl Cancer Inst Monogr 1996. 16:297–334. Murray N. Cancer Treat Rev 1996 22 (suppl A): 69–73. Kortmansky J. eds. eds. 6. New York: John Wiley & Sons. Oncology 1992. Breast conservation versus mastectomy: is there a difference in psychological adjustment or quality of life in the year after surgery? Cancer 1992. Bollen K. McCabe M. Salvaggio M. Beltangady M. Quality of life endpoints in cancer clinical trials: the Food and Drug Administration perspectives. Grechko J. 31:191–192.

Pater J. Cancer Treat Rev 1993 19 (suppl A): 43– 51. Chin C. 21. Who should measure quality of life. 29. Quality-of-life assessment: patient compliance with questionnaire completion. Sadura A. Pater J. 17: 511–651. Stat Med 1998. Controlled Clin Trials 1997. Geller N. Bernhard J. O’Brien P. the doctor or the patient? Br J Cancer 1988.Health-related Quality of Life Outcomes 265 18. 57:109–112. Stat Med 1998. 27. Palmer M. 23. Pater J. Stat Med 1998. J Natl Cancer Inst Monogr 1996. Hahn E. Assessing problems of cancer patients: psychometric properties of the Cancer Inventory of Problems Situations. 84:1023–1026. Zee B. 25. The Quality of Life Committee of the Clinical Trials Group of the National Cancer Institute of Canada: organization and functions. Siau J. 7:273–278. Plant H. Missing data in quality of life research in Eastern Clinical Oncology Group (ECOG) clinical trials: problem and solutions. Cancer 1988. 9:83–102. Schaq C. Gelber RD. J Nat Cancer Inst 1992. Lofter W. Cella D. Effects of altering the time of administration and the time frame of quality of life assessments in clinical trials: an example of using the EORTC QLQ-C30 in a large antiemetic trial. Zee B. Psychosocial outcome in a randomized surgical trial for treatment of primary breast cancer. Biometrics 1984. Cancer 1988. 31. Figlin R. 40:1079–1087. 17:603–612. Ganz P. Estimating the quality of life in a clinical trial of patients with metastatic lung cancer using Karnofsky performance status and the Functional Living Index—Cancer. Zee B. Workshop on Missing Data in Quality of Life Research in Cancer Clinical Trials: Practical and Methodological Issues. Myles J. Osoba D. Lachin J. 18:222–227. 28. 20:107–111. Ganz P. 2:93–113. Osoba D. J Clin Oncol 1998. Wellisch D. 35. Aadland R. 32. Dempsey E. Tandon P. Haskell C. Gebski V. 34. Introduction to sample size determination and power analysis for clinical trials. Dancey J. 9:819–827. . Procedures for comparing samples with multiple endpoints. Coates A. Schain W. Interpreting test for efﬁcacy in clinical trials with multiple endpoints. 17:533–540. et al. 33. Fairclough D. Control Clin Trials 1981. 26. Self-rating symptom checklists: a simple method for recording and evaluating symptom control in oncology. Kemeny M. Health related quality of life studies of the National Cancer Institute of Canada Clinical Trials Group. et al. Qual Life Res 1998. Stat Med 1990. O’Brien P. 19. Interpreting the signiﬁcance of changes in health-related quality of life scores. Myles J. 24. Pater J. Osoba D. Osoba D. Heinrich R. eds. Stat Med 1998. 30. 62:1231–1237. Completion rates in health-related quality of life assessment: approach of the National Cancer Institute of Canada Clinical Trials Group. Lynch D. Health Psychol 1990. Zee B. 1:211– 218. Rodrigues G. 17:547–559. 16:139–144. Application of global statistics in analysing quality of life data. 20. Osoba D. Osoba D. Webster K. Osoba D. 22. Qual Life Res 1992. 61:849–856. Soto N. Slevin ML. Quality of life studies of the Australian New Zealand Breast Cancer Trials Group: approaches to missing data. Gore M.

Zee B. Sample size calculations for two-group repeated-measures experiments. Koch G. Powers/sample size calculations for generalized linear models. Landis R. 46. Stat Med 1998. Stat Med 1994. Ware J. Stram D. Liu G. Sample size calculation for studies with correlated observations. 17: 757–766. Stat Med 1991. Biometrics 1993. 40. 10:1959–1980. Liang K. Grizzle J. Shih J. Zhao L. A test of missing data mechanism for repeated categorical data. Size and power of two-sample tests of repeated measures data. Biometrics 1977. Starmer F. Analysis of repeated ordered categorical outcomes with possibly missing observations and time-dependent covariates. 48. Growth curve model analysis for quality of life data. Analysis of repeated categorical data using generalized estimating equations. Peterson H. 54:323–330. Lipsitz S.266 Zee and Osoba 36. Koch G. Zeger S. Davis C. Biometrics 1993. 54. 47. A general methodology for the analysis of experiments with repeated measurement of categorical data. Self S. Biometrika 1986. 90:1112–1121. Fayers P. Stat Med 1998. Rochon J. The measurement of change of quality of life in clinical trials. . Biometrics 1991. Cella D. Davis CS. 41. Zwinderman A. 49:631–638. Dawson JD. Stat Med 1992. 17:697–709. 1991. Sample size calculation based on slopes and other summary statistics. Modelling the drop-out mechanism in repeated-measures studies. Statistical analysis of trials assessing quality of life. pp. Biometrics 1988. 53. Semi-parametric and non-parametric methods for the analysis of repeated measurements with applications to clinical trials. Incomplete quality of life data in randomized trials: missing forms. Biometrics 1969. Methods for the analysis of informatively censored longitudinal data. Biometrics 1998. 11:1861–1870. 113–123. 47:1383–1398. J Am Stat Assoc 1995. 37. Kim K. Machin D. 51. eds. 44:79–46. Lehnen R. Boca Raton: CRC Press. Molenberghs G. 25:489–504. 49. Stratiﬁed testing for treatment effects with missing data. Fairclough D. 49:1022–1032. Comparison of several model-based methods for analysing incomplete quality of life data in clinical trials. Liang K. Mauritsen R. 56. Longitudinal data analysis using generalized linear models. 47:1617–1618. 33:133–158. Stat Med 1998. The Effect of Cancer on Quality of Life. Schluchter MD. Wei L. 73:13–22. 50. Biometrics 1991. Testing for random dropouts in repeated measurement data. 39. In: Osoba D. 17:781–796. Freeman D. Stat Med 1990. 45. Analysis of categorical data by linear models. Ridout M. Lagakos SW. 43. Zee B. Bonomi P. 38. 13:1149–1163. Quan H. Freeman J. Biometrics 1997. 42. 9:931–942. 83:631–637. J Am Stat Assoc 1988. Biometrics 1998. 55. Park T. Little R. Dawson JD. 44. Curran D. 53:937–947. 54:782–787. Pater J. 52.

13:436.Health-related Quality of Life Outcomes 267 57. Feeny D. Goldhirsh A. Simes RJ. Multi-attribute preference functions: health utilities index. Taking quality of life into account in health economic analysis. J Clin Oncol 1989. 20:23–27. Torrance G. Gelber R. Furlong W. Stat Med 1990. 58. Glasziou P. Proc ASCO 1994. Cost and beneﬁt of adjuvant therapy in breast cancer: a quality adjusted survival analysis. 60. The Q-tility index: a new tool for assessing the health-related quality of life and utilities in clinical trials and clinical practices. Pharmacoeconomics 1995. Simes J. Coates A. 59. Quality adjusted survival analysis. J Natl Cancer Inst Monogr 1996. Weinstein M. Gelber RD. Fairclough D. Weeks J. Weeks J. 61. 9:1259–1276. Boyle M. 7:503–520. O’Leary J. . 7: 36–44. Paltiel D. Glasziou PP.

.

role functioning) and symptom status. Washington I. and satisfaction with care.g. investigators may study additional areas such as ﬁnancial concerns. disease-free survival. In some trials. and toxicity.. The challenge lies in combining this information in the treatment evaluation context. New York. Symptoms speciﬁc to the cancer site and/or the treatments under evaluation are also usually included to monitor for toxicities and to gauge the palliative effect of the treatment on disease-related symptoms. Mailman School of Public Health. social. emotional. We sometimes characterize QOL and cost outcomes as alternative or complementary because they add to information provided by traditional clinical trials’ end points such as survival. Although a total or summary score is desirable for the QOL measure. family well-being.15 Statistical Analysis of Quality of Life Andrea B. INTRODUCTION In randomized treatment trials for cancer or other chronic diseases. it is equally important to have separate measures of basic domains of functioning (e. Troxel Joseph L. physical. QOL should be measured comprehensively (1–3). tumor response. at least in the phase III setting. Columbia University. New York Carol McMillen Moinpour Fred Hutchinson Cancer Research Center. There is fairly strong consensus that. the primary reason for assessing quality of life (QOL) is to broaden the scope of treatment evaluation. Seattle. spirituality. Speciﬁc components of QOL provide information not only on speciﬁc interpretation of treatment ef269 .

myeloma opthalmic. Version 2 Troxel and Moinpour . ovarian. body image.270 Table 1 Examples of Comprehensive QOL Questionnaires QOL Dimensions (# Items) Physical functioning (10) Role-physical (4) Bodily pain (2) General health (5) Vitality (4) Social functioning (2) Role-emotional (4) Mental health (5) Core (30) Physical functioning (5) Role functioning (2) Cognitive functioning (2) Emotional functioning (4) Social functioning (2) Symptom scales Fatigue (3) Pain (2) Nausea/vomiting (2) Single-item symptoms (5) Financial impact (1) Global HRQOL (1) Health transition (10) Physical component summary score (35) Mental component summary score (35) No total score Modules: Cancer-speciﬁc: lung (13). esophageal (24). colorectal (38) Others in development: bladder. head and neck (35). palliative care No total score. leukemia. module scores 15–21 Reference 8–14 Questionnaire SF-36 EORTC QLQ-C30. pancreatic. breast (23). prostate Treatment-speciﬁc: high-dose chemotherapy.

FAMS.) Treatment-speciﬁc: BMT (23). central nervous system. lung (9) ovarian (12). Functional Assessment of HIV Infection. European Organization for the Research and Treatment of Cancer. functional. Functional Assessment of Non-Life Threatening Conditions. FAHI (47) FAMS (59). Trial Outcome Index composed of physical.Statistical Analysis of Quality of Life Functional Assessment of Cancer Therapy (FACT)† Version 3 Version 4 Core* Physical (7) Functional (7) Social (7) Emotional (6) Scores: FACT-G (Core items). bladder (12). * Relationship with doctor items no longer included in FACT scores. items contributing to overall score (19) Total score 5 subscale scores 22–26 Cancer Rehabilitation Evaluation System—Short Form (CARES-SF) Physical (10) Psychosocial (17) Medical interaction (4) Marital (6) Sexual (3) 27–30 EORTC. bone marrow transplantation. FANLT (26) Misc. fatigue (13). Module. Total score Additional concerns (modules): Cancer-speciﬁc: breast (9). urinary incontinence (11) Other modules/scales: Spirituality (12). TOI. 271 . Functional Assessment of Multiple Sclerosis. head and neck (11). hepatobiliary (18). FAHI. anemia/fatigue (20). cervix (15). prostate (12) Additional concerns (cont. and cancer-speciﬁc module. CNS. endocrine (18). CNS (12). fecal incontinence (12). brain (19). TOI. † The Functional Assessment of Chronic Illness Therapy (FACIT) measurement system represents the current version (#4) of the FACT questionnaires. diarrhea (11). taxane (16) Symptom-speciﬁc: anorexia/cachexia (12). BMT. colorectal (9). FANLT. esophageal (17). biologic response modiﬁers (13) neurotoxicity from systemic chemo (11).

Patients sometimes fail to complete QOL assessments because of negative events they experience. however. Selected questionnaires must be reliable and valid (5) and sensitive to change over time (6. Because not all patients are subject to these missing observations at the same rate. analyses using only complete observations are therefore potentially biased. Interviews can be used to obtain these data. especially when treatment failure or survival rates differ between arms. The FACT and EORTC QOL questionnaires have a core section and symptom modules speciﬁc to the disease or type of treatment. like the SF-36 and CARES-SF. (4) conducted a study in which the radiotherapy regimen was modiﬁed as a result of QOL data. . Centralized monitoring of both submission rates and the quality of data submitted must also be considered. or death. can be used with any cancer site but may require supplementation with a separate symptom measure to address concerns about prominent symptoms and side effects. quality control is critical to ensure clean complete data. such as treatment toxicities.272 Troxel and Moinpour fects.7). QOL data should be generated by patients in a systematic standardized fashion. that QOL data provide important outcomes. This effort requires substantial staff time and therefore cannot be done without adequate resources. It is precisely in the treatment of advanced disease. Enforcement of the same requirements for both clinical and QOL follow-up data communicates the importance of the QOL data for the trial. along with appropriate item content. data analysis is often complicated by problems of missing information. particularly in the advanced-stage disease setting. but also can identify areas in which cancer survivors need assistance in their return to daily functioning. they can document the extent of palliation achieved by an experimental treatment. submission rates for follow-up QOL questionnaires can be less than desirable. When QOL research is conducted in many and often widely differing institutions. Sugarbaker et al. The ﬁrst step is to make completion of a baseline QOL assessment a trial eligibility criterion. Although this is a rich source of information. ensure a more accurate picture of the patient’s QOL. Others. good measurement properties. Even with the best quality control procedures. Ongoing training of clinical research associates is mandatory because QOL data are still not considered routine and there is a fair degree of turnover in data management staff. that is. An increasing focus on QOL studies in the context of clinical trials has resulted in accumulation of QOL data in a wide variety of conditions and patient populations. but self-administered questionnaires are usually more practical in the multiinstitution setting of clinical trials. Table 1 describes four QOL questionnaires that meet these measurement criteria and are frequently used in cancer clinical trials. the set of complete observations is not always representative of the total group. disease progression. Data on speciﬁc areas of functioning can also help suggest ways to improve cancer treatments.

The groups may be deﬁned by treatment arm. Generalized Linear Models A second general class of models is the likelihood-based generalized linear model (GLM) (31). usually using iteratively reweighed least-squares or Newton-Raphson algorithms. ANALYSIS OPTIONS FOR ‘‘COMPLETE’’ DATA SETS A. The models rely on an assumption of normally distributed data. They range in emphasis from the data collection stage. unbiased estimates will be obtained. If missing data are random (see below). prognosis. the total variation in the data is then attributed to between-group and within-group portions. hypotheses concerning difference among groups or between combinations of groups may be tested. This framework is attractive since it accommodates a whole class of data rather than being restricted to continuous Gaussian measurements. II. where attention focuses on obtaining the missing values. Estimation proceeds by solving the likelihood score equations. We ﬁrst describe methods that are appropriate for complete or nearly complete data and then move on to techniques for incomplete data sets. asking patients to ﬁll out questionnaires comprising several subscales at repeated intervals over the course of the study. Longitudinal Methods In general the methods described below are applicable to both repeated measures on an individual over time and measurements of different scales or scores on a given individual at the same point in time. Repeated-Measures ANOVA or MANOVA Analysis of variance (ANOVA) and covariance (ANCOVA) or their multivariate versions (MANOVA) represent a very popular class of models for continuous QOL data. Many studies of course use both designs. 1. it allows a uniﬁed treatment of measurements of different types. to the analysis stage. using the glm() function (33). grade. . GLMs can be ﬁt with generalized linear interactive modeling (GLIM) (32) or with Splus. and so on.Statistical Analysis of Quality of Life 273 Several methods have been developed to address this problem. with speciﬁcation of an appropriate link function that determines the form of the mean and variance. Generalized linear mixed models are a useful extension. SAS macros are available to ﬁt these models. 2. allowing for the inclusion of random effects in the GLM framework. where the goal is adjustment to properly account for the missing values.

QOL data are often subject to missingness. Depending on the nature of the mechanism producing the missing data. Once the event has been clearly deﬁned. analyses must be adjusted differently.’’ Missing data probabilities are independent of both observable and unobservable quantities. and the analysis can generally be adjusted by weighting schemes or stratiﬁcation. stage of disease). TYPES OF MISSING DATA PROBLEMS As mentioned brieﬂy above. as it is by nature sensitive to problems of regression to the mean (34). III. sex. time-to-event or survival analysis methods can be applied. either at a population level or using a two-stage model to allow individual rates of change that are then dependent on characteristics such as treatment group or demographic variables. we list several types of missing data and provide general descriptions of the mechanisms along with their more formal technical names and terms. These include Kaplan-Meier estimates of ‘‘survival’’ functions (35). A. This type of mechanism rarely obtains in real data. This type of mechanism can hold if subjects with poor baseline QOL scores are more prone to missing values later in the trial or if an external measure of health. Change-score analysis has the advantage of inherently adjusting for the baseline score but must also be undertaken with caution. the analysis tools can be directly applied. The QOL database. Missing at Random (MAR) Missing data probabilities are dependent on observable quantities (such as covariates like age. observed data are a random subsample of complete data. Growth curve models can be used to accomplish this. however. B. such as the Karnofsky performance . 4. Below. Time-to-Event Analysis If attainment of a particular QOL score or milestone is the basis of the experiment. and logrank and other tests for differences in the event history among comparison groups. Missing Completely at Random (MCAR) This mechanism is sometimes termed ‘‘sporadic.274 Troxel and Moinpour 3. supports few such milestones at this time. Cox proportional hazard regression models (36) to relate covariates to the probability of the event. Change-Score Analysis Analysis of individual or group changes in QOL scores over time is often of great importance in longitudinal studies.

Identiﬁcation of missing data mechanisms in QOL research proceeds through two complementary avenues: collecting as much additional patient information as possible and applying simple graphical techniques and using hypothesis testing to distinguish missing data processes. such as survival and disease status and toxicities. More precisely. Because the missingness mechanism depends on observed data. results assuming MCAR and MAR may differ. completely explains the propensity to be missing. but nonlikelihood based inference is valid under MCAR only. Or. however. Missing Not at Random (MNAR). likelihood based and Bayesian inference are valid under both MCAR and MAR. This type of mechanism is fairly common in QOL research. D. Graphical presentations can be crucial as a ﬁrst step in elucidating the relationship of missing data to the outcome of interest and providing an overall summary of results that is easily understood by nonstatisticians. even with the best efforts at data collection. C. Many groups include a cover sheet with the QOL questionnaire to record the reason for incomplete assessments. due to worse survival on one arm of the trial. To determine which methods of statistical analysis will be appropriate. subjects having great difﬁculty coping with disease and treatment may be more likely to refuse to complete a QOL assessment. or Nonignorable Missing data probabilities are dependent on unobservable quantities. One example is treatment-based differences in QOL compliance.Statistical Analysis of Quality of Life 275 status. Evaluating the Missing Data Problem As noted. A clear picture of the extent of missing QOL assessments is necessary both for selection of the appropriate methods of analysis and for honest reporting of the trial with respect . This information can be used to model the missing data and QOL processes and determine the extent and nature of the missing data process. Other covariates are collected routinely on patients. clinical trials studying QOL will likely suffer somewhat from missing QOL data. Nonrandom. such as missing outcome values or unobserved latent variables describing outcomes such as general health and well-being. the analyst must ﬁrst determine the patterns and amount of missing data and identify the mechanisms that generate missing data. The research question is relevant when considering conditional analyses given complete data. Rubin (37) addressed the assumptions necessary to justify ignoring the missing data mechanism and established that the extent of ignorability depends on the inferential framework and the research question of interest. analyses can be conducted that adjust properly for the missing observations.

(From Ref. The panel indicates the QOL assessments made for the seven scheduled during the ﬁrst 6 months as a percentage of those anticipated from the currently living patients. this means summarizing the proportions of patients in whom assessment is possible (e.276 Troxel and Moinpour to reliability and generalizability. reproduced with permission. Machin and Weeden (38) combine these two concepts in Figure 1. In clinical trials.g. The times at which QOL assessments were scheduled are indicated beneath the time axis.) . 1996). 38. using the familiar Kaplan-Meier plot to indicate survival rates and a simple table describing QOL Figure 1 Kaplan-Meier estimates of the survival curves of patients with small cell lung cancer by treatment group (after MRC Lung Cancer Working Party.. surviving patients still on study) and then the pattern of assessments among these patients. copyright John Wiley & Sons Limited.

The former proposal involves testing whether scores from patients who drop out immediately after a given time point are a random sample of scores from all available patients at that assessment. Diggle (43) and Ridout (44) proposed methods to compare MCAR and MAR drop-out. as demonstrated by Curran et al. from 25% at baseline to 71% among the evaluable patients at 6 months. A particularly simple but informative display is given in Figure 3. especially in relation to patients’ QOL. Higher symptom distress is reported by patients who drop out due to death or illness. 2. Testing for MNAR As mentioned earlier. Patients with a decreasing QOL score may also be more likely to drop out. it may be necessary to present additional details about the missing data. where a change score between two previous assessments was predictive of drop-out. stratiﬁed by baseline QOL. For this study of palliative treatment for patients with small cell lung cancer and poor prognosis. If the reasons for missing assessments differ over time or across treatment groups. and the worsening of symptom status over time is more severe for these patients as well. Since baseline QOL scores are often predictive of survival (42). the usual Kaplan-Meier plots. The average differences. due to Troxel (39). Comparing MCAR and MAR Assuming a monotone pattern of missing data.Statistical Analysis of Quality of Life 277 assessment compliance. 1. the Kaplan-Meier plot illustrates why the expected number of assessments is reduced by 60% at the time of the ﬁnal assessment. can be very informative. graphical presentations can convey results so that readers may individually balance the importance of early versus late differences among treatment arms or across the different domains of QOL. This is illustrated in Figure 2. clearly communicate the consistent trends across all the time points and the statistical signiﬁcance at speciﬁc points in time. with 95% conﬁdence intervals. (40). A useful technique is to present the available data separately for patients with different amounts of and reasons for drop-out. Recall that . where estimates of average symptom distress in patients with advanced colorectal cancer are presented by reason for drop-out and duration of follow-up. The latter proposal centers on logistic regression analysis to test whether observed covariates affect the probability of dropout. The table further indicates the increase in missing data even among surviving subjects. A second step is describing the missing data mechanism. This illustrates the differences in the physician ratings of QOL between patients who completed and did not complete the QOL self-assessment. Finally. if likelihood or Bayesian inference is used. then distinguishing between MCAR and MAR is often not the primary concern. due to Coates and Gebski (41).

– – –.. lost-death. (From Ref.) if either MCAR or MAR holds. Unfortunately. The main issue for likelihood or Bayesian inference is distinguishing between MAR and MNAR. lost-illness. reproduced with permission.. — —. 39..278 Troxel and Moinpour Figure 2 Average scores by type and length of follow-up: Symptom Distress Scale. ——. the missing data mechanism depends on observed quantities only and inferences on Y can be based solely on the observed data. complete follow-up. lost-other. . testing the assumptions of MAR against a hypothesis of MNAR is not trivial. copyright John Wiley & Sons Limited. such a procedure rests on strong assumptions .

Negative scores indicate that patients who did not comply with self-assessment of QOL had worse QOL as assessed by the physician using the QLI. Plots show mean difference and 95% conﬁdence intervals. analysis of repeated measures with missing data is not trivial. (From Ref. number complying with self-assessment. suggests that underidentiﬁability is a serious problem with MNAR missing data models and that problems may arise when estimating the parameters of the missing data mechanism simultaneously with the parameters of the underlying data model. 41. *Number not complying. Similar problems may exist in the selection model framework (47).) that are themselves untestable (40). discussing pattern-mixture models. certain assumptions are made in the speciﬁcation of the model about the relationship between the missing data process and unobserved data. When ﬁtting a nonignorable model. copyright John Wiley & Sons Limited. indicated in weeks after randomization. The possible range of difference in scores is 10. (45) provide examples where different models produce almost similar ﬁts to the observed data but yield completely different predictions for the unobserved data. Little (46). analysts will have a more solid basis for missing data models. This is especially true for QOL assessments where data may be missing for several reasons. Because of the difﬁculties in identifying the missing data mechanism. reproduced with permission.Statistical Analysis of Quality of Life 279 Figure 3 ANZ 8614: difference in the Quality of Life Index (QLI) score (assessed by the physician) between patients who did or did not comply with self-assessment of QOL at the relevant time point. however. . Molenberghs et al. If a sufﬁcient amount of data is collected relating to why QOL questionnaires have not been completed. These assumptions are fundamentally untestable.

as described above. proxy measures of the patient’s QOL can also be useful. as much information as possible should be collected regarding the patient’s health status. given survival up to a certain point. patients who fail or relapse are taken off-study. conditional analyses of QOL. since in this case the patient’s potentially available data has been simply truncated rather than not observed. and a few are described below. 3.280 Troxel and Moinpour E. Missing Data due to Death Obviously subjects who die before completion of the study will have shortened QOL vectors. this too results in nonignorable ‘‘missing data. Differential rates of illness. although the poor correlation between patient and proxy measures is well documented (48). Missing Data due to Illness It is perhaps inevitable in a clinical trial studying QOL that some subjects will be at certain times too ill to complete their QOL assessments. may be the most appropriate. Although these issues are of some concern in their own right. . 2.’’ This situation is different from that of missing data due to illness described above. Special problems Several special scenarios arise with respect to QOL data collection. It is potentially less severe. and clinical status. since all patients who fail are taken off-study rather than some being self-selected for inability to complete a QOL assessment. they become increasingly problematic when there is differential missing data between two treatment arms. As with patients who have missing QOL data due to illness. death. Nonetheless. in general they are still followed for vital status. Because vital status and QOL are almost certainly not independent. Since health status and QOL are almost certainly not independent. Models that jointly assess both the QOL and survival end points can also be useful. To facilitate modeling of the missing data process in this situation. 1. Although every effort should be made to complete the QOL assessment schedule. Enforced Missingness due to Study Constraints In some trials. In this situation. toxicity episodes. differential failure rates on treatment arms can have a devastating effect on the analysis of the QOL data. or relapse will generally result in differing amounts of missing data for the study arms. subsequent QOL assessments are not always obtained. this results in nonignorable missingness. had it been observed. however. making a valid comparison of study treatments extremely difﬁcult with respect to the QOL end point. It makes little sense to discuss what a patient’s QOL score would have been after death. this can result in nonignorable missing data.

). The joint model allows the Ti (or a function of the Ti.g. estimates are obtained by solving an estimating equation of the following form: n U i 1 Di′V i 1 (Yi µi ) 0 Here µi E(Yi|Xi. etc. these must be assumed to be MCAR.. through the covariance parameter σbt. Software to ﬁt this model is not readily available. on the rate of change in the underlying measurements. Suppose the time to drop-out. such as the baseline response value or other relevant clinical information. Generalized Estimating Equations (GEEs) GEEs (51) provide a framework to treat disparate kinds of data in a uniﬁed way. or censoring. postrelapse. in essence it is a type of pattern-mixture model (see below).β) and Di ∂µi /∂β are the usual mean and derivative functions and Vi is a working correlation matrix.Statistical Analysis of Quality of Life 281 IV. in this case the log) to be correlated with the random effects bi. METHODS Several extensions to standard mixed models have been proposed in the context of longitudinal measurements in clinical trials. ′ µt σbt τ2 For example. For Gaussian measurements. The model is as follows: bi log(Ti) N 0 B σbt . Schluchter (50) proposed a joint mixed effects model for the longitudinal assessments and the time to drop-out. patients with steeper rates of decline in measurements over time (as measured by the random effects bi) may be more likely to fail early. A. If there are intermittent patterns of missing data. Zee (49) proposed growth curve models where the parameters relating to the polynomial in time are allowed to differ according to the various health states experienced by the patient (e. is denoted by Ti. In addition. off treatment. and these may be either constant or varying across health states. on treatment. Instead. The model may also contain other covariates. Care must be taken to ensure that enough patients remain in the various health states to properly estimate the extra parameters. This model allows MNAR data in the sense that the time of drop-out is allowed to depend. they require speciﬁcation of only the ﬁrst two moments of the repeated measures. rather than the likelihood. the estimating equations resulting from the GEE are equivalent to the usual score equations . This method requires that the missing data be MAR and may be ﬁt with standard packages by simply creating an appropriate variable to indicate health state.

Selection models proceed by modeling the complete data and then modeling the behavior of the missingness probabilities conditional on the outcome data. The type of missingness mechanism is controlled by the covariates and/or responses that are included in the model for the missingness probabilities. the weights used in the analysis are then the inverses of these estimated probabilities. Wi. or responses. If conditioning arguments are used. GEEs produce unbiased estimates for data that are MCAR. the selection model is concerned with f(Yi)f(Ri|Yi). such as the logistic. πij is an estimate of πij. Although software exists to ﬁt GEEs. The observed data likelihood is . can be used for the missing data probabilities. such as the baseline QOL score. such as the stage of disease.282 Troxel and Moinpour obtained from a multivariate normal maximum likelihood model. the same estimates will be obtained from either method. implying that the missing data may be MNAR. When the missingness probabilities depend only on observed covariates. presented simply. provided that both parts of the model are correctly speciﬁed. (55). The two approaches are discussed and compared in detail by Little (46). (54) discuss these equations and their properties in detail. and diag(Q) indicates i a matrix of zeroes with the vector Q on the diagonal. the approach will produce unbiased estimates even in the face of MNAR data. Models for continuous data have been proposed by Diggle and Kenward (47) and Troxel et al. a logistic or probit model can be used to estimate missingness probabilities for every subject. α). In this example. any parametric model. additional programming is required to ﬁt a weighted version. it is possible to allow dependence on previous values as well. Software is available in the form of an SAS macro (52). whereas the pattern mixture model is concerned with f(Ri)f(Yi|Ri). The selection models assume that the complete underlying responses are multivariate normal. B.54). possibly unobserved measurement Yij. the estimating equation takes the form n U i 1 D′iV i 1diag Ri (Yi ˆ πi µi) 0 ˆ where πij P(Rij 1|Y0. Robins et al. Pattern mixture models proceed by estimating the parameters of interest within strata deﬁned by patterns of and/ or reasons for missingness and then by combining the estimates. two types of models can result. Although the computations can be burdensome. Extensions to the GEE do exist for data that are MAR: Weighted GEEs will produce unbiased estimates provided the weights are estimated consistently (53. the probabilities may depend on the current. Joint Modeling of Measurement and Missingness Processes One can model the joint distribution of the underlying complete data Yi and the missingness indicators Ri.

Estimates are usually obtained through direct maximization of the likelihood surface. The usual drawback with respect to nonignorable missingness applies. some question whether it is appropriate to impute values for subjects whose data are missing because of early death. resulting in a set of estimates obtained from each imputed data set. and the analysis in question is conducted on each of them. Software to ﬁt the Diggle and Kenward (47) model is available (56). Multiple imputation (61) is similar in spirit to simple imputation but with added safeguards against underestimation of variance due to substitution. such as the mean of the existing values. the bias that results from assuming the wrong type of missingness mechanism may well be more severe than the bias that results from misspeciﬁcation of a full maximum likelihood model. Such an approach allows an estimate of the study population’s experience if the entire group stayed on study for the entire period of QOL . A model is required to obtain the imputed values. Multiple imputation can be conducted in the presence of all kinds of missingness mechanisms. In addition. Here. C. loglinear models are used for the joint probability of outcome and response variables conditional on covariates.Statistical Analysis of Quality of Life 283 obtained by integrating the complete data likelihood over the missing values. and in the presence of nonignorable missingness. the assumptions governing that model are generally untestable due to the very nature of the missing data. methods allowing for nonignorable missing data have been proposed by Fay (57). however. These several results are then combined to obtain ﬁnal estimates based on the multiple set.’’ of data sets is a way of converting an incomplete to a complete data set. and Conaway (59). The models can be ﬁt using the EM algorithm (60). The estimates are also likely to depend on modeling assumptions. these models can be very useful for investigation and testing of the missingness mechanism. Baker and Laird (58). numerical integration is generally required. Once estimates are obtained. even worse. Several data sets are imputed. inference is straightforward using standard likelihood techniques. or ‘‘ﬁlling-in. and then adjusting the analysis to account for the fact that the substituted value was not obtained with the usual random variation. Multiple Imputation Imputation. This method allows analysis of all the data. For discrete data. most of which are untestable in the presence of MNAR missing data. the resultant estimates are sensitive to the chosen model. even when the missingness probabilities depend on potentially unobserved values of the response. This method is attractive because once the imputation is conducted. Simple imputation consists of substituting a value for the missing observations. Finally. the methods for complete data described in Section II can be applied. Despite these drawbacks. treating the parameters of the missingness model as a nuisance.

would carry with them random variation. rely on variations in the scores to hold. especially when the worse score method is used. QOL results are generalizable to new patients who could be candidates for the treatments where their length of survival is unknown. either within a patient or within a group of patients (such as those on the same treatment arm). the variance of estimates will be underestimated. last value carried forward should be used with extreme care. One of the most serious problems with substitution methods. when in practice often it is the subjects in rapid decline who tend to drop out prematurely. Average Score Use of the average score. That is. Worst Score This method is often used in sensitivity analyses. determined in a variety of ways. For this reason. This is usually the most extreme assumption possible. It assumes. For these reasons. is that they can seriously damage the psychometric properties of a measure. are subject to bias and heavily subject to assumptions made in obtaining the substituted value. A. is more closely related to classic .284 Troxel and Moinpour assessment. A second problem is that in substituting values and then conducting analyses based on that data. AVOIDING PITFALLS: SOME COMMONLY USED SOLUTIONS Substitution Methods In general. which the substituted values do not. These properties. however. Last Value Carried Forward This substitution method tries to use each patient’s score to provide information about the imputed value. however. 1. such as reliability and validity. had they been observed. that subjects who drop out do not have a changing QOL score. they should not be used to produce a primary analysis on which treatment or other decisions are based. so an analysis robust to worst-score substitution has a strong defense. 3. however. methods that rely on substitution of some value. The comment raised above regarding the psychometric measurement properties warrants caution. in conducting sensitivity analyses to determine the extent to which the analysis is swayed by differing data sets. since the missing values. since the implicit assumption is that the subjects who did not submit an assessment are all as worse off as they can possibly be with respect to QOL. V. Substitution methods can be useful. 2. if at all.

To compound this arbitrariness. determined not by the patient’s actual experience with toxicity but by a predeﬁned expectation of the average interval in which patients with a given disease receiving a given treatment experience problems due to treatment. or quality-adjusted time without symptoms and toxicity. though still one that relies on utilities.e. but it does not necessarily force each subject’s score to remain constant. but it is adjusted for the QOL experience of the patients. however. This is an extremely appealing idea. It can be difﬁcult to implement satisfactorily in practice. investigators commonly substitute utilities obtained from some standard population rather then any information obtained directly from the patient. it assumes that the imputed values are no different from the observed values. time spent suffering from treatment-induced toxicities may be rated at 50% of perfect health. The two methods described below have gained some popularity. toxicity due to disease (i. the average interval they spend disease-free after the end of therapy. These intervals may be somewhat arbitrary. in our view. For example. because of the difﬁculty of obtaining the appropriate values with which to weight survival in different periods. Q-TWiST Q-TWiST (64). brought on by relapse). This renders the analysis largely uninterpretable. Because utilities are obtained using lengthy interviews or questionnaires focusing on time trade-offs or standard gambles. 2. Data from patient rating scales and Q-TWiST analyses can differ (65). in which designated periods of life are weighted according to some utility describing QOL. utilities for each period are chosen by the analyst. 1.Statistical Analysis of Quality of Life 285 imputation methods.63) proposed analyses in which survival is treated as the primary outcome. Again. The patient’s course through time is divided up into intervals in which the patients experiences toxicity due to treatment. . is a more detailed method of adjustment. and no toxicity.. This results in an analysis that reﬂects only a small amount of patient-derived data and a large number of parameters chosen by the investigator. Quality-adjusted Life Years This method consists of estimating a fairly simple weighted average. for it clariﬁes the inherent trade-off between length and quality of life that applies to most patients. and the average time until relapse. B. Adjusted Survival Analyses Some authors (62.

Med Care 1994. Snow KK. Surgery 1982. Bergman B. et al. Bullinger M. March 1–2. Aaronson NK. 1992. The EORTC QLQ-LC13: a modular supplement to the EORTC QLQ-C30 for use in lung cancer trials. Guyatt GH. Moinpour CM. Bethesda. 1993.286 Troxel and Moinpour REFERENCES 1. Quality of life end points in cancer clinical trials: review and recommendations. 10. Lu R. 1978. Nunnally J. Ahlner EM. Ware JE Jr. Recent Results Cancer Res 1988. Ganz PA. and reliability across diverse patient groups. SF-36 Physical and Mental Health Summary Scales: A User’s Manual. 16. Quality of life assessment of patients in extremity sarcoma clinical trials. Kirshner B. Conceptual framework and item selection. Tests of data quality. Report from a National Cancer Institute (USA) workshop on quality of life assessment in cancer clinical trials. New England Medical Center. 31:247–263. Gianola FJ. Barofsky I. Psychometric and clinical tests of validity in measuring physical and mental health constructs. Quality of life in clinical trials. Hailey B. Aaronson NK. Durham and London: Duke University Press. 1:27–36. McHorney C. McHorney C. Proceedings of a workshop held at the National Institute of Health. 81:485–495. 8. Ware JE Jr. Med Care 1992. 9. JE Jr. Cella D. Rosenberg SA. A methodological framework for assessing health indices. Kosinski M. 1:203–210. Kosinski MA. Measuring Functioning and Well-Being: The Medical Outcomes Approach. MD: National Institute of Health. Nayﬁeld S. Sherbourne CD. A modular approach to quality of life assessment in cancer clinical trials. Stewart AL. Bergman B. et al. 14. Ware. The MOS 36-item short-form health survey (SF36). 4. 18. Hayden KH. 15. Sugarbaker PH. Philadelphia: Lippincott-Raven. Sherbourne C. Eur J Cancer 1994. Ware JE Jr. Boston: The Health Institute. 2. National Cancer Institute (US). 1994. Charlson M. Keller SD. 30A: 635–642. scaling assumptions. New York: McGraw-Hill. Psychometric Theory. Guyatt GH. Raczek A. 111:231–249. 91:17–23. I. Feigl P. Development of an EORTC questionnaire module to . J Chronic Dis 1985. 30:473–483. 32:40–46. Aaronson NK. Boston: Nimrod Press. 1996. 2nd ed. Responsiveness and validity in health status measurement: a clariﬁcation. Quality of Life and Pharmacoeconomics in Clinical Trials. J Clin Epidemiol 1989. Mitchell A. 17. ed. The MOS 36-item short-form health survey (SF-36). 42: 403–408. SF-36 Health Survey: manual and Interpretation Guide. Meyskens FL Jr. Moinpour CM. Bjordal K. Med Care 1993. 12. J Natl Cancer Inst 1993. III. 5. 7. 1995. The SF-36 Health Survey. 6. Deyo RA. 1996. Metch B. Crowley J. In: Spilker B. Ahmedzai S. 3. The European Organization for Research and Treatment of Cancer QLQ-C30: a quality of life instrument for use in international clinical trials in oncology. Quality of Life 1992. Ware J. Gandek B. Ware. 85:365–373. Levine MN. II. J Natl Cancer Inst 1989. Ahmedzai S. The MOS 36-item short-form health survey (SF-36). et al. 13. Ware J. 11. JE Jr.

Ganz PA. Meier P. EORTC QLQ-C30 Reference Values. Baker RJ. Tulsky DS. Mo F. Living with cancer: the cancer inventory of problem situations. Ganz PA. The Functional Assessment of Cancer Therapy (FACTS) scale. D’Antonio LL. 40:972–980. . 75:1151–1161. Nelder JA. Cobleigh M. J Clin Oncol 1997. Manual of the Functional Assessment of Chronic Illness Therapy (FACIT Scales)—Version 4. Generalized linear models. Long S. New York: John Wiley and Sons. 1986. 22. Gelke C. 23. 2nd ed. The European Organization for Research and Treatment of Cancer approach to quality of life: guidelines for developing questionnaire modules. Heinrich RL. Levin V. 1997. Fleiss JL. Cella DF. Zimmerman GJ. Shiomoto G. Saraﬁan B. Ganz PA. 34. 28. Bonomi AE. J Psychosoc Oncol 1983. Sprangers M. Deasy S. 1997. Schag CAC. Pregibon D. Cella DF. 122:482–487. Cella D. Nelder JA. eds. J Am Stat Assoc 53:457–481. Cancer 1991. Development of a brain subscale and revalidation of the general version (FACT-G) in patients with primary brain tumors. 33. Brussels: EORTC Quality of Life Study Group. 20. In: Chambers JM. Tulsky DS. J Clin Psychol 1984. 68:1406–1413. Release 3. Cancer Rehabilitation Evaluation System— Short Form (CARES-SF): a cancer speciﬁc rehabilitation and quality of life instrument. 31. Development of a comprehensive quality of life measurement tool: CARES. Heinrich RL. 1998. be used in quality of life assessments in head and neck cancer patients.Statistical Analysis of Quality of Life 287 19. 11:570–579. Generalized Linear Interactive Modeling. EORTC Quality of Life Study Group. The cancer inventory of problem situations: an instrument for assessing cancer patients’ rehabilitation needs. Lloyd SR. Cella DF. 2:287–295. 26. 1989. 21. Heinrich RL. 35. Heinrich RL. Gray G. 30. 4:135–138. 24. Kaplan EL. 27. Bonomi AE. 32. 33:879–885. The GLIM System. Arch Otolaryngol Head Neck Surg 1996. The functional assessment of cancer therapy scale: development and validation of the general measure. Outcomes Research and Education (CORE). Byrne K. 1993. Brady MJ. 29. Cancer 1995. EORTC Quality of Life Study Group. Nonparametric estimator from incomplete observations. Quality of life and functional status measures in patients with head and neck cancer. EORTC QLQ-C30 Scoring Manual. London: Chapman and Hall. Cull A. et al. Schag CAC. Cella DF. Generalized Linear Models. Reliability and validity of the Functional Assessment of Cancer Therapy–Breast Quality-of-Life Instrument. J Clin Oncol 1993. Oxford: Numerical Algorithms Group. 1:11–24. Linn E. Qual Life Res 1993. McCullagh P. Meyers C. Schag CAC. London: Chapman and Hall. 1978. Schag CAC. The Design and Analysis of Clinical Experiments. Brussels: EORTC Quality of Life Study Group. Oncology 1990. Statistical Models in S. Hastie TJ. Hastie TJ. Acta Oncol 1994. 15:974–986. Evanston Northwestern Healthcare and Northwestern University. Weitzner M. 25.

Schluchter MD. 90:1112–1121. Schmitz SFH. JNCI 1998. Appl Stat 1994. Meyskens FL Jr. Sprangers MAG. 50. 1996. Liang KY. Kenward M. Moinpour CM. Blumenstein BA. Troxel AB. Coates A. 51. 17:711–724. Diggle PJ. Rubin DB. Identifying the types of missingness in quality of life data from clinical trials. England. 53. Technical Report. Testing for random dropouts in repeated measurement data. 17:767– 779. Inference and missing data. J Clin Epidemiol 1992. 39. Troxel AB. 49. Machin D. Regression models and life tables [with discussion]. University of Lancaster. Testing for random dropouts in repeated measurements data. Stat Med 1992. Zeger SL. 46. Biometrics 1991. Am Statist 1999. JASA 1995. 17:533–540. Rotnitzky A. Stat Med 1998. Germany. Non-random missingness in categorical data: strengths and limitations. 52. Robins JM. Harrington DP. 43:49–93. Biometrika 1986. Smith DM. Molenberghs G. 21:411–421. Veith RW. Quality of life in advanced prostate cancer: results of a randomized therapeutic trial. Skeel R. Molenberghs G. 55. Diggle P. Biometrics 1989. 11:1861–1870. Savage MJ. 47:1617–1621. 17: 697–710. 54. 45:1255–1258. 56. . The Oswald Manual. 90:1537–1544. Bacchi M.47:425–438. Informative drop-out in longitudinal analysis [with discussion]. 47. Yee M. Lipsitz SR. Universitaet Dortmund. 1994. Stat Med 1998. Zhao LP. Ridout M. The role of health care providers and signiﬁcant others in evaluating the quality of life of patients with chronic disease: a review. Statistics Group. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Lovato LC. JASA 90:122–129. 42. 17: 757–766. 73:13–22. Zee BC. JASA 1995. Eisenberger M. Fachbereich Statistik. Robins JM.288 Troxel and Moinpour 36. Curran D. Troxel AB. Semiparametric efﬁciency in multivariate regression models with missing data. 45:743–760. Sylvester RJ. 53:110–118. Lipsitz SR. Biometrika 1976. Goetghebeur EJT. 63:581–592. Quality of life studies of the Australian New Zealand Breast Cancer Trials Group: approaches to missing data. Stat Med 1998. Analysis of longitudinal data with non-ignorable non-monotone missing values. Methods for the analysis of informatively censored longitudinal data. Groemping U. Lancaster. Crawford ED. 40. A comparative analysis of quality of life data from a Southwest Oncology Group randomized trial of advanced colorectal cancer. 41. 44. Stat Med 1998. Modeling the drop-out mechanism in repeated-measures studies. 45. Rotnitzky A. Stat Med 1998. Higgins B. 38. 90:106–121. Longitudinal data analysis using generalized linear models. 48. Weeden S. Technical Report. Little RJA. J R Stat Soc B 1972. Suggestions for the presentation of quality of life data from clinical trials. 37. Cox DR. Aaronson NK. Growth curve model analysis for quality of life data. Gebski VJ. GEE: a SAS macro for longitudinal data analysis. 43. Appl Stat 1998.

49:161–169. 60. Conaway MR. Am Stat 1995. Multiple Imputation for Nonresponse in Surveys. Regression analysis for categorical variables with outcome subject to nonignorable nonresponse. Simes J. Dempster AP. Glasziou PP. Goldhirsch A. Gelber RD. 39:1–38. Cole BF. . 61. Causal models for patterns of nonresponse. 8:723–731. Rubin DB. 1987. Stat Med 1990. JASA 1992. Simes RJ. Glasziou P. 7:36–44. Rubin DB. Gelber RD. Maximum likelihood estimation from incomplete data via the EM algorithm [with discussion]. Fetting JH. Goldhirsch A. 64. Qual Life Res 2000. Quality adjusted survival analysis. Quality of life and quality adjusted survival for breast cancer patients receiving adjuvant therapy. 65. Costs and beneﬁts of adjuvant therapy in breast cancer: a quality-adjusted survival analysis. 62. The analysis of repeated categorical measurements subject to nonignorable nonresponse. 63. 87:817–824. Fay RE. 83:62–69. JASA 1988. Laird NM. Cella D.Statistical Analysis of Quality of Life 289 57. Baker SG. JCO 1989. New York: John Wiley and Sons. JASA 81:354–365. Moinpour CM. Comparing treatments using quality-adjusted survival: the Q-TWiST method. 9:1259–1276. J R Stat Soc B 1977. 58. Fairclough DL. Gelber RD. Wonson W. 59. Coates A. Laird NM.

.

1) (1. Other major components of health care costs include physician/professional costs (25%) and pharmaceutical and home care costs (approximately 10% each). colorectal cancer (24%). More than 90% of medical costs for cancer are associated with ﬁve diagnoses: breast cancer (24%). and State University of New York at Albany School of Public Health. Hospital care represents the largest single cost component accounting for approximately 50% of total cancer care costs. totaling more than $100 billion annually. Costs of Cancer Care Health care expenditures in the United States have risen dramatically.4). INTRODUCTION A. now exceeding one trillion dollars annually and constituting 14% of the gross domestic product (Fig.2). Albany. Health Care Outcome Measures There is increasing interest in the assessment of health care outcomes beyond traditional clinical measures of efﬁcacy. Lyman Albany Medical College. Alternative measures of interest include 291 . prostate cancer (18%). lung cancer (18%). Cancer care costs vary over time and are generally greater during the period immediately after diagnosis and during the last few months before death (3). New York I. and bladder cancer (8%) (3. Approximately 10% of health care expenditures are allocated for cancer care.16 Economic Analysis of Cancer Clinical Trials Gary H. B.

The primary economic measure in most economic studies is the mean cost or cost difference between treatment groups. skewed distributions. health-related quality of life and economic outcomes. health expenditures projected for the year 2007 are $2. whereas per capita expenditures are presented in $ thousands. quality of life. cost-effectiveness. health care expenditures for selected years from 1960 to 1998 reported by the Health Care Financing Administration.1 trillion. including cost-minimization. patient counseling. and economic outcomes into summary measures such as the quality-adjusted life year (QALY) and cost-effectiveness and cost-utility ratios. combined clinical and economic outcome measures permit more rational comparisons of different clinical strategies for purposes of medical decision making. Nevertheless. The analysis of economic outcomes is complicated by the multiple outcomes. To facilitate the comparison of different treatment strategies. and cost-utility analyses. C.S. Economic Analyses A number of different types of economic evaluations have been developed. and frequent missing data. . Annual U. Total annual expenditures are reported in units of $100 billion. Total U. combined measures have been developed that bring together clinical.292 Lyman Figure 1 Annual health care expenditures.S. and health care policy formulation. clinical practice guideline development.

The variability in the cost measures and the lack of agreement on clinically meaningful cost differences further limit the conclusions derived from such studies. therefore. Such analyses. This chapter focuses attention on the design. AML. acute myelogenous leukemia. The addition of economic outcomes to traditional measures of clinical efﬁcacy increases the complexity and cost of CCTs. autologous bone marrow transplantation. NSCLC. IFN. adjuvant. cyclophosphamide. HD. etoposide. however. metastatic. adj. met.S. and Figure 2 Cost effectiveness of cancer treatment. should be limited to large phase III trials where important trade-offs between efﬁcacy and cost are anticipated. cyclophosphamide. Economic analyses. interferon. CAE. ABMT. (41). methotrexate.Economic Analysis of Clinical Trials 293 D. . are associated with several important methodological challenges. 5-ﬂuorouracil. Figure 2 compares published cost-effectiveness measures for several types of cancer treatment derived from CCTs. thousands) per life year saved. Economic measures are often of secondary interest in such trials lacking a priori hypotheses with frequent missing data and inadequate sample size for valid statistical inference. advanced. adriamycin(doxorubicin). non small cell lung cancer. adv. conduct. Hodgkin’s disease. Cost effectiveness is expressed in terms of incremental cost ($U. Economic Analysis in Controlled Clinical Trials Performing economic analyses in association with controlled clinical trials (CCTs) has gained increasing enthusiasm in recent years. analysis. Estimated cost effectiveness for various cancer treatment modalities adapted from Smith et al. CMF.

and social functioning (6). utility measures derived from economic and decision theory attempt to assess HRQOL by eliciting patient preferences for speciﬁc outcome states (7). and interpretation of such economic analyses. The sum over all health care states of the product of the time spent in each state and the utility associated with the state will yield the quality-adjusted time without symptoms . II. Patient preferences can be assessed through a time trade-off method incorporating a standard reference gamble generating a single value of health status along a linear continuum from death (0) to full health (1). clinical outcome can be measured in terms of life expectancy or the average number of years of life remaining at a given age. Serial measurement of patient preferences over time can be used to estimate the cumulative impact of treatment on HRQOL. family well-being. HEALTH CARE OUTCOMES A. B. Alternatively. Clinical Efﬁcacy Response and survival often represent the primary clinical end points for the assessment of efﬁcacy upon which sample size and power calculations are based. Health proﬁles derived from psychosocial theory attempt to assess HRQOL through one of a variety of scales addressing the relevant dimensions associated with quality of life. 3). The QALY represents the time in full health considered by the patient equivalent to actual time in the diseased state. Such measures are limited by the difﬁculty in judging a clinically important gain in life years and extrapolating censored survival data beyond the trial period. such as functional ability. This represents a more powerful method for assessing treatment effect than comparing median survivals or the proportion event free at a given time (Fig.g. The major advantage of measures of patient preference or utility is that they can then be used to adjust measures of longevity such as life expectancy (e. quality-adjusted life years or QALYs). sexuality/intimacy. emotional well-being. evaluation. Alternatively.294 Lyman reporting of economic analyses in the setting of cancer clinical trials. The gain in life expectancy or life years saved with treatment represents the marginal efﬁcacy and can be thought of as the area between the survival curves with and without intervention. The strengths and the limitations of such analyses are discussed. there has been increasing interest in the assessment of the impact of cancer and cancer treatment on quality of life. Changes in life expectancy are often used in economic analyses to express the efﬁcacy of treatment. The life expectancy of a population can be thought of as representing the area under the corresponding survival curve (5).. Health-related Quality of Life (HRQOL) In recent years. and guidelines are offered for the proper conduct. treatment satisfaction.

Several authors have addressed the methodological challenges of HRQOL outcomes associated with the design and analysis of clinical trials (12. C.13). The assessment of HRQOL in conjunction with conventional clinical efﬁcacy measures in CCTs has gained increasing interest over the past several years (9–11). Hypothetical survival curves of control and treatment subjects displaying the probability of survival over time since randomization adapted from Naimark and from Wright and Weinstein (5). The gain in median survival and probability of 5-year survival are shown. (8). The area between the curves represents the life years gained with the intervention. of disease or toxicity of treatment or Q-TWIST described by Gelber et al. The activity level represents the amount of various resources used and the time expended in providing medical . The most important economic outcome of interest for clinical decision making and health policy formation is cumulative total cost. Guidelines have been proposed for the incorporating of HRQOL measures into CCTs (14). Economic Outcomes Economic outcome measures differ in several respects from traditional clinical outcome measures. which considers both the activity level over time and unit costs.Economic Analysis of Clinical Trials 295 Figure 3 Gain in life expectancy. The value of such measures is limited by the time and cost involved in their assessment through direct patient encounters and the lack of elucidation of the multidimensional aspects of HRQOL.

such as transportation costs to and from the institution and child care expenses. Unit costs represent the cost associated with each unit of activity. follow-up. Direct nonmedical costs represent additional expenditures incurred while receiving medical care. such as days lost from work and the economic impact of lost economic output due to premature death. professional services. Indirect costs include those associated with the morbidity of disease and treatment. diagnosis. The types of economic analyses associated with cancer CCTs are summarized in Table 1. radiologic and laboratory testing. The total cost of illness represents the weighted sum of the unit costs where the weights are represented by the units of activity for each cost item such that n Total cost [unit activity unit cost] The major focus of such economic analyses relates to those resources and costs that might differ between treatment groups. Intangible costs are those associated with pain and suffering and the loss of companionship. pharmaceuticals. When Should Economic Analyses Be Performed? Economic analyses are most useful when the clinical and economic outcomes of interest are discordant. care. when an intervention is associated with equal or improved outcome but a greater cost or when the cost of an intervention is the . These costs include those associated with hospitalization. III. and home health care services. that is. and palliation of disease. rehabilitation.296 Table 1 Types of Economic Analysis Cost unit Monetary Monetary Monetary Monetary Monetary Effect unit — Equal LYS* QALYS† Monetary Lyman Methodology Cost Cost Cost Cost Cost of illness minimization effectiveness utility beneﬁt * Life years saved. Although it is very difﬁcult to express such concerns in monetary terms. these represent real social and emotional costs to the patient and family. Often economic outcome measures are combined with clinical and/or quality of life measures to provide a summary outcome measure reﬂecting the simultaneous difference in cost and the change in survival or quality-adjusted survival. ECONOMIC ANALYSIS A. † Quality-adjusted life years saved. treatment. Direct medical costs represent the costs of providing medical services for the prevention.

therefore. Economic analyses. interventions that are associated with large or uncertain resource consequences and small or unclear efﬁcacy are most likely to be candidates for an economic analysis. A common approach is that of burden-of-illness or cost-of-illness studies where the cost of disease in a population is summarized by tabulating the incidence or prevalence of disease. 5). the associated morbidity or mortality. When clinical effectiveness is not an issue or is considered equal between therapeutic alternatives. however. Costeffectiveness analysis compares interventions based on the ratio of the marginal cost and the marginal effectiveness (marginal cost-effectiveness) expressed as the added cost per life year saved (Fig. the measures of interest are generally the additional cost of one strategy over another (marginal cost) and the additional clinical beneﬁt (marginal efﬁcacy) or quality-adjusted clinical beneﬁt (marginal utility). 2. the evaluation may be most reasonably based on differences in resource utilization or cost through a cost-minimization analysis where the strategy associated with the lowest total cost is identiﬁed. Clearly. Types of Economic Analysis 1.Economic Analysis of Clinical Trials 297 same or less but with less effectiveness. The proper timing of an economic evaluation in the development of a new intervention is important. and the total costs of illness. it is often preferable to combine economic measures with those of clinical efﬁcacy (Fig. Such an approach. When important differences in both clinical efﬁcacy and cost are anticipated. is limited by the requirement that a monetary value is placed on clinical and quality of life outcome measures. Noncomparative Evaluations Noncomparative (descriptive) economic studies generally are performed for either health administrative or public health purposes and do not involve explicit comparisons of treatment options. B. Comparative Evaluations Comparative economic studies evaluate possible interventions in cohorts of individuals comparing the beneﬁts and the costs. several types of economic evaluations are available (15–21). 4). Introduction too early in the process before efﬁcacy and standard procedures have been established may lead to the waste of limited resources. although implicit comparisons are often made. As shown in Table 1. Clinical beneﬁts are sometimes converted into the same economic measure in a cost-beneﬁt analysis to combine them into a single measure. should generally be limited to deﬁnitive or conﬁrmatory studies of promising approaches likely to have considerable economic consequences or for which a trade-off between efﬁcacy and cost is anticipated. Cost-utility analysis compares treat- . whereas incorporation too late in the process may limit the ability of the evaluation to alter the dissemination of the technology. In this situation.

the patient or family. marginal summary measures do not reﬂect the absolute beneﬁt or cost of an intervention. a third party payor. the least costly strategy associated with the greatest effectiveness or utility. From the narrowest perspective. the lowest cost will be associated with the absence of care or no intervention and the shortest survival.298 Lyman Figure 4 Combined outcome measures. or that of society as a whole. From a more global perspective. although very important to the patient and family. Limitations of Economic Analyses The evaluation and interpretation of an economic analysis will often differ substantially depending on the perspective from which it was undertaken. Likewise. A strategy associated with a lower absolute effectiveness may . Relationship between clinical and economic outcome measures. for example. Ultimately. In addition. these latter two approaches attempt to identify the most efﬁcient approach. public health efforts aimed at screening and early detection and disease prevention assume greater importance since these will ultimately improve clinical outcome. may not even be considered in economic analyses from most other perspectives. lifetime costs will often be less in those with the shortest life expectancy such as the elderly. C. Indirect and intangible costs. a health care provider or institution. that is. Clinical measures such as survival or life expectancy and quality of life may be combined with economic outcome measures such as cost to simultaneously evaluate cost and efﬁcacy in terms of cost-effectiveness or cost-utility ratios. ments based on the ratio of the marginal cost and the marginal utility (marginal cost-utility) expressed as the added cost per QALY saved.

Why Perform Economic Analyses in Association with Clinical Trials? 1. ECONOMIC ANALYSIS AND CCTs A. whereas those associated with greater cost and less effectiveness (upper left) are always unacceptable. IV. Plane displaying the relationship between incremental cost (ordinate) and incremental effectiveness (abscissa). whereas those above the line are considered unacceptable. Any point on the plane represents cost effectiveness expressed as the ratio of incremental cost to incremental effectiveness. Interventions associated with greater effectiveness and lower cost (lower right) are always considered acceptable. Any estimated cost effectiveness below that line represents an acceptable ratio. actually appear superior in terms of cost-effectiveness or cost-utility. therefore.Economic Analysis of Clinical Trials 299 Figure 5 Cost-effectiveness plane. Just as CCTs are thought to represent . Strengths The quality of an economic analysis depends upon the precision and validity of the underlying data best provided by CCTs. The acceptability of cost-effectiveness ratios in the other boxes depends on whether it lies below or above the maximum cost-effectiveness line. It is important. The straight line from the lower left of the plane to the upper right represents the maximum acceptable cost-effectiveness ratio determined by society. to measure both the absolute and the marginal beneﬁt and cost in such analyses.

The study population also must adhere to clinical monitoring that may not be representative of clinical practice and will be associated with resource utilization and costs differing considerably from routine. conduct. Study design. 2. analysis. small sample size. Types of Studies As shown in Table 2 three general types of economic analysis related to CCTs are described that vary in the nature and source of the economic data. economic analyses with CCTs may have low external validity related to the lack of representativeness and limited generalizability due. randomized clinical trials. In type I . economic analysis will add to the cost and complexity of CCTs and should generally be limited to use with large. economic analyses based on such trials may represent the best means to evaluate the cost and costefﬁciency of treatment. requires careful attention to the proper design. and care must be utilized in selecting only the most relevant and objective measures of resource utilization for inclusion in the trial. conduct. The appropriate use of economic analyses in association with CCTs. especially when the resource consequences or costs are large. The care taken in the design. B. Such economic analyses will be based on the most reliable estimates of treatment efﬁcacy. therefore.300 Lyman the most deﬁnitive way to evaluate interventions for efﬁcacy. The importance of randomized controlled trials is evident in efforts of observational studies and nonrandomized trials to emulate their careful design and analysis procedures to achieve the same conclusions. Careful consideration should be given to the importance of the economic information and the appropriateness of the clinical trial design prior to incorporating economic assessment into a CCT. The same methodological rigor should be applied to the economic analysis as is commonly used in the assessment of therapeutic efﬁcacy. and they will facilitate the comprehensive comparison of therapeutic options. prospective. and frequent missing data. The costs involved with exploratory or early clinical trials may not be representative of what they would be with more experience. and reporting of such analyses (22–27). and analysis of such trials may provide the best available information on resource utilization and treatment efﬁcacy. Design Considerations 1. Economic analyses associated with CCTs should be sought before wide dissemination of new technologies. to strict eligibility criteria. Finally. in part. and planned analyses are generally detailed in a written protocol. Weaknesses Economic outcomes measured in association with clinical trials are often considered of secondary importance with no a priori hypothesis. phase III. data collection. Even when properly designed and conducted.

including any economic analysis. retrospective data from study institutions or other sources. which allows for the introduction of a measurement bias. Study Design The design of a clinical investigation. data from all or a sample of trial institutions. and confounding (28). but there is little information on measure variability and subsequent analysis is based on sensitivity analysis to assess the robustness of the assumptions. Efﬁcacy. Study Hypotheses The major study questions related to economic measures should be clearly stated in terms of testable hypotheses. In type II economic studies. resources used. all cost information is obtained either from an independent source in an unsampled fashion or from a subsample of study subjects. Such an analysis has limited generalizability and requires considerable effort and justiﬁcation addressing concerns about sampling and measurement bias. The amount of information collected often requires limiting sampling to a subgroup of the study population. Retrospective. measurement. should attempt to minimize the potential for systematic error or bias. Confounding represents the modiﬁcation of the true treatment effect by a factor associated with . including resource utilization and unit costs. economic analyses. Such an approach provides information on variability for estimation and hypothesis testing but may limit generalizability to other economic environments and time periods. Missing data cannot be assumed to be missing at random. Such studies can often be performed rapidly at relatively low cost.Economic Analysis of Clinical Trials Table 2 Type I II III Economic Analyses Associated with Cancer Clinical Trials Efﬁcacy Prospective Prospective Prospective Activity Retrospective Prospective Prospective Unit cost Retrospective Retrospective Prospective Precision 301 Generalizability Prospective. The economic importance of speciﬁc interventions are likely to be greatest when considering diseases of clinical and public health signiﬁcance and interventions associated with considerable cost trade-offs. primary outcomes. resource utilization is sampled concurrently with measures of clinical efﬁcacy. usually at a few institutions. Activity. complete cost information is obtained on the trial subjects. 3. The clinical and economic relevancy of the study hypotheses should be clearly stated. All primary economic questions and secondary hypotheses relating to outcome differences among speciﬁed subgroups should be stated in advance of the trial. including that associated with subject selection. In type III economic studies. 2.

In small trials. The sample size necessary to adequately address primary study hypotheses should be stated in advance based on the likely treatment effect or the number of events anticipated. maximum tolerable alpha error (false positive). Sample Size The goal of a clinical trial is to conﬁrm the treatment effect accurately or to refute it unambiguously. and setting of the study should be fully detailed. The potential for confounding is most effectively addressed in the design of a trial by incorporating appropriate controls. The nature. location. should be presented. Randomization ensures that both known and unknown confounding factors will be distributed equally in the treatment groups on average. Confounding can obscure a true outcome difference when it exists or create an apparent difference that does not exist. it must be anticipated that a longer duration of accrual or follow-up may be needed. It is imperative that sufﬁcient numbers of subjects are included in the trial that a negative study is unlikely to be a false negative (29). When the primary outcomes represent failure time data (time-toevent). Study Population All subjects in the study should be described and accounted for. even relatively large and clinically important differences in outcome may be statistically insigniﬁcant because of low study power. A balance between narrow eligibility to enhance study power and limiting restrictions to increase generalizability should be sought. 4. measurement variation. The balance of important prognostic factors within treatment groups can be enhanced by randomization separately within subgroups (stratiﬁcation) but should be conﬁrmed in the analysis. Sample sizes large enough to achieve a power of 80–95% are generally considered desirable for detecting meaningful differences. Failure to consider censoring in an economic analysis may further compromise the power of the study. This longer period of observation may not always be justiﬁed or even ethical. When the sample size of the trial is appropriately targeted to the economic outcomes. basing treatment assignment on randomization. Eligibility criteria.302 Lyman both the outcome of interest and treatment group assignment. including any inclusion and exclusion criteria. 5. and maximum beta error (false negative) considered acceptable. Multiinstitutional CCTs may increase study accrual and sample size and external validity of both the clinical and the economic outcomes. In studies with insufﬁcient sample size to address subgroup . especially when meaningful differences in clinical outcome are already apparent. sample size estimation should consider the event (or cost) rate and the anticipated duration of observation and the expected censoring rate. and by blinding subjects and investigators to the assigned treatment (double blinding).

measurement. may be missing nonrandomly and bias group comparisons. which are often not available until later or even after trial completion. or noncompliance may result in either item nonresponse at a speciﬁc point in time or unit nonresponse where most information on a resource component is missing. of greater concern is the possibility that missing data. will not salvage an underpowered or biased clinical trial. and data recording are important to minimize bias and random error in a trial. Although the prospective concurrent collection of outcome data in a CCT generally reduces the potential for missing data. It may be difﬁcult to estimate sample size requirements given the limited information on what constitutes meaningful differences in economic outcomes. CCTs with primary economic hypotheses may require larger sample sizes to achieve the desired ability to demonstrate an economic effect. Missing data associated with death. they are seldom designed with early stopping rules based on secondary outcomes such as cost. even when randomly missing. The primary data analysis and any planned subgroup analysis should be described in advance in sufﬁcient detail to provide the reader with a full understanding of the planned analysis. The economic resource and cost information collected should be objective and comprehensive and yet limited to that needed to address prestated hypotheses matching clinical measures in style and frequency. standardized measurement of economic outcomes should be applied equally to treatment groups. The quality and completeness of observation. . results should be presented descriptively for purposes of hypothesis generation only.Economic Analysis of Clinical Trials 303 analyses. will reduce the power of the study analysis. skewed distributions and frequent missing data. Outcome Measures and Analysis Economic outcomes should include measures of activity level (including time and resources used) and unit costs of such activity. Sample size estimation in economic studies is complicated by the limited efﬁciency of conventional methods used with such data. Because of greater variability. As a ﬁrst-order approximation. Although interim analyses of large trials of expensive technologies might be desirable. however. Even the most elegant analysis. Sample size estimates based on ratios of cost and efﬁcacy should consider the variance and covariance of both measures and the desired level of precision. including subject withdrawal. Missing data. Sample size estimates should consider any adjustment needed for multiple testing due to the multiple outcome measures involved. 6. treatment delay. Resource utilization measures should be speciﬁed in advance and applied equally to each intervention group ideally by blinding both the patient and investigator to the treatment assignment. Where this is not feasible or ethical. However. sample size requirements can be estimated on the basis of the approximately log-normal distribution of cost data. loss to followup. methods for minimizing and dealing with missing data should be discussed in advance and explicitly handled in the analysis. disability.

economic data are often considered of secondary importance or are added to a clinical trial as an afterthought and relegated to a low level of importance.304 Lyman C. Study Conduct Considerations 1.31). It is essential to distinguish between resource utilization related to the intervention and that related to the conduct of the trial. including data collection and altered patterns of care and follow-up. It is also important that resources consumed and unit costs are measured separately since they vary quite differently. the use of more representative external unit cost data may be considered. that resource utilization and unit cost information are generally not independent of one another or of the clinical trial design. 2. Resource Utilization Data Patient monitoring and data collection procedures for conventional clinical outcomes in the conduct of a clinical trial are relatively standardized. When the focus is on internal validity and maintaining the direct association between resource utilization and cost. however. When the focus is on external validity or institutional data are not considered representative. concurrent and prospective collection of cost information should be considered. It is essential that the same systematic effort and precision are applied to the collection of economic outcomes as are used to measure clinical efﬁcacy. Unit Cost Data Costing methodology varies considerably between studies. Analysis Considerations 1. whereas unit costs vary considerably between institutions. The types of resource utilization generally considered in such studies are summarized in Table 3. Resource utilization depends primarily on the clinical situation. D. regions. including an economic evaluation with cost values suitable for statistical . Unfortunately. Even when concurrent costing is not feasible or desirable. It must always be kept in mind. A survey of published randomized trials. and health systems and over time. the use of site-speciﬁc cost information should be applied to the pooled resource data. Type of Study The type of analysis appropriate for an economic evaluation depends on the study design and the nature of the data (30. Economic analyses in association with CCTs often do not adequately address changes in resource utilization and cost that occur over time. Answers to economic questions depend on resource utilization and cost through the period of full recovery or death requiring longer patient monitoring than for the estimation of clinical efﬁcacy. Resource utilization data in a CCT is most accurate and complete when collected concurrently with efﬁcacy data.

chemotherapy. R. Social Services 4. Social Services.N. Lost wages.D. equipment maintenance and depreciation. other) Direct: nonmedical 1. chemotherapy) Radiation therapy services Drugs/treatments Surgical procedures Blood bank services (transfusions) Other services: support services 2. etc. Loss of work time by patient. Hospitalization* Routine vs. distance traveled. Ambulatory (clinic) Frequency Outpatient tests/procedures Outpatient treatment (surgery. family.Economic Analysis of Clinical Trials Table 3 Sources of Resource Utilization in Economic Analyses Associated with Cancer Clinical Trials Direct: medical 1. Medical/nursing services Home visits Interim testing 2. Impact on family resources Days lost from work Transportation costs Out-of-pocket expenses * Direct and indirect institutional expenditures including overhead (utilities.). rent. time spent Indirect: medical 1. intensive care Frequency Duration Physician/nursing services Laboratory/radiology services Type and number of tests Pharmacy services (medications.) 3. radiation. and friends during treatment. etc. 305 . Other medical support services Indirect: nonmedical 1. consumables.. Physical therapy 3. Nursing home/hospice care Visits (M..

The source of unit cost information and any discounting considered should be justiﬁed. such as cost effectiveness. The most difﬁcult situation is that associated with informative missing .306 Lyman analysis. Measurement variability is often greater for indirect costs where missing or incomplete data are also more likely to be a problem. was recently reported (32). and the external validity or generalizability of results should be discussed. missing data. Missing Data Missing data may have an impact not only on study precision by reducing the number of subjects with complete data but also study validity by biasing outcome estimates if the missing data are associated with outcome measures or treatment group assignment. When missing data is missing at random but related to the observed data. The results of such economic studies should be viewed as exploratory or hypothesis generating and should be presented descriptively. In such analyses. The source of economic data in such studies is often derived from small subsamples or from separate nonsampled sources. efﬁcacy and cost outcomes. Calculated mean costs or combined measures of cost and efﬁcacy. frequently ignore the inherent variability between subjects relying on sensitivity analyses to assess the robustness of any conclusions. they are considered missing completely at random and can be dealt with by complete case analysis with some loss in power or by simple imputation of missing values such as the last observed or mean values with some underestimation of variance. If missing data are independent of observed and unobserved data. or multiple testing issues. it is often informative to review the distribution of each outcome measure along with some percentile range such as the interquartile range. the investigator controls the variation and range of parameters. missing data. When information on variability is available. sample size requirements. the same rigor of statistical analysis should be applied as is used for assessing clinical efﬁcacy. Economic measures are often skewed with frequent outliers and greater variability than most clinical measures. The relationship between missing data and treatment group assignment. Economic outcomes collected in the context of CCTs are often considered of secondary importance with limited attention given to prestated economic hypotheses. and robustness is arbitrarily deﬁned. Study evaluation should be based on an intention-to-treat analysis and appropriately powered to measure effect sizes of economic importance carefully considering measurement distributions. In larger studies incorporating a limited number of a priori economic hypotheses. or important covariates should be studied. Missing data can also complicate multivariate modeling which considers only cases for which data are available on all covariates considered. multiple imputation techniques and bootstrapping can provide more reasonable point estimates and variance. and multiple comparisons. 2. the potential interaction between parameters is ignored.

which may actually increase external validity with total costs estimated by regression or multiple imputation techniques.g. and low false-negative rate. large sample size. which depends on missing data or the parameters of interest. small variability.Economic Analysis of Clinical Trials 307 data.’’ and the inability to address interaction. In the analysis of a clinical trial. including estimation and hypothesis testing. Estimation summarizes the distribution of outcomes providing measures of central tendency. It is generally not considered necessary to have unit cost information on all subjects as long as resource utilization data are complete. the range of values considered. which increases the chance of observing a statistically signiﬁcant difference due to chance alone. Although it is sometimes useful to compare cumulative cost distributions between groups using a general nonparametric technique such as the Kolmogorov-Smirnov test. and measures of variability or precision. the lack of standard criteria for ‘‘robustness. more powerful methods exist for comparing speciﬁc distribution parameters such as mean and median costs. . 3. When dealing with very large data sets that are reasonably well behaved. Unit cost data can be collected on a subset of patients or from an independent data source. Cost Differences. the robustness of the assumptions may be assessed with a sensitivity analysis. A true difference is supported by a large treatment effect. Inferences on cost differences between treatment groups should be supported by measures of precision (e. such as conﬁdence intervals. Appropriate adjustment in signiﬁcance levels for multiple testing in the analysis is necessary.. The observed differences in outcome may represent either true effects or differences due to random error (variability) or systematic error (bias). Statistical inference in economic studies is most commonly based on differences in arithmetic mean costs between treatment groups since only these estimates permit ready calculation of the total costs of interest. Economic analyses are often faced with multiple outcome measures and repeated measures over time. however. Hypothesis testing involves an assessment of the probability of obtaining the observed difference in outcome under the null hypothesis of no true difference between the groups. When the variance of unit costs and the covariance between cost and resources used are not available. low false-positive rate. Such an analysis is limited by the potential bias in selecting variables for analysis. Cost data. such as means or proportions. are often highly skewed due to high costs incurred by a few patients. Statistical Analysis Economic outcomes of a clinical trial such as activity level or cost are seldom equal among the study groups. which represents the upper and lower bounds likely to contain the true value of a variable. random error is addressed through statistical inference. conﬁdence intervals) of the estimated difference in mean costs or appropriate hypothesis testing considering outcome distributions.

33). The observed data are treated as an empirical probability distribution that can be sampled repeatedly with replacement providing a distribution of outcomes from which conﬁdence limits and hypothesis testing can be developed. they compare the median and the distributions of costs rather than arithmetic mean cost differences. Rank procedures.308 Lyman greater power will be associated with the use of parametric analyses such as Student’s t test. censoring is informative with regard to costs and survival.36). however. generally assume that group distributions have the same variance and shape and replace economically relevant information with ranks. In addition. When cost data are censored before death. The product-limit estimation method of Kaplan and Meier represents a reasonable approach for dealing with cumulative costs over time. When faced with smaller data sets or unresolved skewed distributions. Methods that ignore censoring will potentially bias mean costs and cost-effectiveness ratios. inference based on log transformed costs compare geometric means. When dealing with the need for covariate adjustment. the log-rank test related to the proportional hazards regression method of Cox may have advantages. Recently. . Bayesian methods based on subjective prior beliefs have been proposed but the need to determine a priori distributions and computational complexity limit their applicability. Zhou and Gao (34) proposed a Zscore for differences in means when group variances are not equal since the log of the mean of the untransformed costs equals the mean of the log of the transformed costs plus one half of the variance. nonparametric bootstrap methods based on the original data have been proposed which make no assumption about the shape or equality of the underlying distributions (36). A number of difﬁculties may be encountered in assessing costs in failure time studies. However. The different scales for death and censoring can result in informative censoring even if no deaths are observed (27). including the truncation of outliers. This is illustrated by the nonconstant changes in cost over time and the informative relationship of costs to health status exempliﬁed by the increase in costs immediately before death. however. will result in loss of economically important information and may yield misleading results. Alternative methods. particularly when censoring is present (35. which do not address the primary issue of importance related to arithmetic mean cost differences. In addition. The assumption of independent censoring is often violated in cost-to-event type analyses. which may be reasonably applied with conﬁdence intervals calculated on the mean cost difference. analysis with nonparametric methods such as rank and log-rank tests is more appropriate. Log transformation of costs will reduce the impact of outliers and may be useful when it results in normal and similarly sized distributions (30. The Wilcoxon rank or Mann-Whitney tests are often used in this situation since they are much more efﬁcient for comparing asymmetric distributions and yet relatively efﬁcient even when comparing normal distributions.

Acceptability curves can be used to summarize uncertainty in cost-effectiveness studies. The resulting conﬁdence limits on the cost-effectiveness plane are generally considered overly conservative. Statistical inference on combined measures of cost and effectiveness is complicated by the lack of information on the variance and covariance structure of costs and clinical efﬁcacy. (39) calculated the probability that the cost-effectiveness ratio falls below a deﬁned maximum acceptable ratio on the cost-effectiveness plane (Fig. Van Hout et al. Assessing the probabilities associated with varying ceiling cost-effectiveness ratios deﬁne an acceptability curve where the 50th percentile deﬁnes the point estimate. respectively. none of which is entirely satisfactory (38). Therefore. 4). (40) reported the use of the bootstrap technique to obtain a nonparametric estimate of the joint density based on the probability of results falling below a speciﬁed threshold level of cost effectiveness. A ‘‘conﬁdence box’’ may be deﬁned by estimating conﬁdence limits separately for incremental effect and incremental cost. Such a curve crosses the probability axis at the one-sided p value for the incremental cost (∆C) and is asymptotic to 1 minus the one-sided p value for the incremental effectiveness (∆E). Several methods for estimating conﬁdence intervals for cost-effectiveness ratios based on the joint variance of cost and efﬁcacy have been proposed. which they claim is equal to integrating under the appropriate regions of the joint probability distribution f(E. Hlatky et al. conﬁdence limits may be deﬁned for the ceiling cost-effectiveness ratio from the acceptability curve. NB The net-beneﬁt statistic offers some advantages for handling uncertainty in costeffectiveness analysis. ∆C) z α/2 √var(NB) Confidence limits z α/2 √var(NB) . In addition.Economic Analysis of Clinical Trials 309 Combined Outcome Measures. These are often dealt with conservatively by presenting variance or conﬁdence limits around point estimates of efﬁcacy and resource utilization or cost separately. It has also been . Parametric estimation of the joint density of incremental effect and cost considers the covariance generally deﬁning an ellipse on the cost-effectiveness plane. where E and C represent observed incremental mean effectiveness and mean cost. Conﬁdence limits may also be derived from the net-beneﬁt statistic where the net beneﬁt (NB) is deﬁned as NB CER ceiling ∆E ∆C The net beneﬁt can be shown to be normally distributed with variance and conﬁdence limits deﬁned as Var(NB) CER 2 ceiling var(∆E) (NB var(∆C) 2CER 2 ceiling cov(∆E. the conﬁdence limits are problematic when the uncertainty includes different quadrants of the cost-effectiveness plane (37). ignoring correlation between cost and beneﬁt. C) around maximum likelihood point estimates for cost effectiveness. including sample size calculation (37).

Nevertheless. Cost Discounting It is also important to adjust changes in cost or beneﬁt measures for changes over time and place. and must deal with the proportional hazards and linearity assumptions of the model. education. make no assumption about the distribution of costs for an individual. marital status). covariate adjustment is necessary to estimate absolute effects because of the heterogeneity in prognostic factors. If actual confounding has occurred. Even when relative treatment effects are the same across subgroups. The proportional hazards regression method of Cox has been proposed for skewed resource or cost data providing estimates of mean cost differences by including treatment assignment as a covariate (42). While multiple regression is commonly used in covariate adjustment. The outcomes of interest in economic analysis are the absolute cost difference and absolute treatment effect that depend on the control survival and the relative survival advantage with treatment. residence. Covariates of particular interest in economic analyses include demographic factors (age. 5. Price adjustment is necessary . the distribution of known prognostic factors within treatment groups should be evaluated. to increase validity by controlling for confounding bias. the apparent relationship between treatment and outcome will be either strengthened or weakened with adjustment through either stratiﬁed analysis or multivariate modeling. Adjustment Covariate adjustment is generally undertaken for one of three reasons: to increase precision or tighten conﬁdence intervals on estimates of treatment effect. they permit cost analyses to consider the issue of censoring that might otherwise result in low cost estimates when considering a severe illness or an intervention associated with high early mortality or withdrawal. type of health insurance and provider organization). Cost discounting considers preferences for immediate over future beneﬁt and for delaying present costs to the future. Despite efforts to minimize bias in the design and conduct of a clinical trial. the skewness of cost distributions may result in overestimates of variance and broad conﬁdence limits. race. family/caregiver status. Covariate adjustment of treatment effect and costs will nearly always increase power.310 Lyman suggested that a Bayesian approach allows a more direct method for estimating cost-effectiveness ratios (41). employment status. socioeconomic factors (income. sex. occupation. Regression on linear and logarithmic transformations of costs may not yield normal residuals limiting the interpretation of results. 4. prior treatment). Any covariate found to be associated with both treatment group assignment and the outcome of interest must be considered a possible confounding factor and addressed further in the analysis. and comorbidities (functional status. and to estimate outcomes in patient subgroups. Such models are complex.

Economic Analysis of Clinical Trials 311 when observations extend over time ( 1 year) or geographical region to present economic results in a common framework. Subgroup analyses should include measures of variability in the effect measures such as conﬁdence limits. Ideally. multiple subgroup analyses should be discouraged and limited to those of major interest and stated in advance of the trial for which a difference in efﬁcacy or cost effectiveness might be anticipated (e. stratiﬁcation factors).g. prognostic or predictive factor). The present cost. and decision modeling of clinical and economic outcomes. Adjustment for confounding factors may improve both validity and precision providing more accurate estimates. strong evidence for such effect modiﬁcation should be provided. therefore. The Cost Discount Rate (CDR) represents the cost discount (future cost present cost) as a proportion of the present cost. cost adjustments are generally based on the Consumer Price Index or the Fixed Weight Index. Unless there are valid reasons to expect such subgroup differences in treatment effect. 7. such models should be externally validated on an independent data set and some measure of goodness-of-ﬁt of the model to the data reported. All future and past costs are generally expressed in terms of the present or some ﬁxed point in time. Clinical . 6. Subgroup Analyses Treatment effects and costs in a clinical trial often differ between subgroups of the study population. multiple testing in subgroup analyses is associated with an increase probability of ﬁnding signiﬁcant differences due to chance alone (type I error). Although such differences may represent an interaction between the intervention and a covariate (e.g.. Modeling Modeling of the relationship between treatment and outcome is used for a variety of purposes: adjustment for known confounding variables. Even when the treatment effect is uniform across subsets. Clinical prediction models for patient selection may improve the cost efﬁciency of an intervention. the observed differences may also be the result of random error or study bias. represents the future cost divided by (CDR 1) n when discounting is conducted over n years. Therefore. The best approach to subgroup analyses is to perform a test for interaction to assess the homogeneity of treatment effect across patient subsets rather than reporting difference in outcomes between subgroups. It is reasonable to view any differences with considerable skepticism utilizing more restrictive criteria for judging statistical signiﬁcance. Such models may also permit estimation of outcome differences within subgroups when heterologous. Statistically signiﬁcant treatment effects in one group and not in another may reﬂect the variation in power when one group has a more favorable outcome with fewer events and therefore lower power to show an effect. In the United States.. development of clinical prediction models.

Particular attention must be paid to the Markovian assumption that state transition probabilities are independent of previous health states requiring the use of a combination of distinct health states to model the medical history. cost-effectiveness analysis or cost-utility analysis). The generalizability of the results for patients outside of the context of the individual CCT should be discussed (47). Interpretation and Reporting Considerations 1. including beneﬁts and costs.46). In such models. which involves multiplying the estimated outcome value by the probability of that outcome occurring and summing over all branches of the immediately preceding chance event. Individual Studies The interpretation and reporting of economic analyses should always consider other possible explanations for the observed differences in outcome. The threshold probability of an event relates to the ratio of beneﬁts and costs reﬂected in the values or utilities incorporated into the model (See the Appendix). emphasis on descriptive and graphical displays is often more rewarding than any formal statistical testing. including low study power (sample size). When a decision point is reached. The costs measured and details of the cost analysis should be presented and discussed. E. and outcome values. and multiple comparisons (44. missing data. Sensitivity analyses based on such models permit an assessment of the robustness of the optimal strategy by assessing how changes in parameter values effect the expected value of the choices and the threshold where expected outcome values are equal. The analysis of decision models requires speciﬁcation of the model structure. including choices. The analysis of decision models is based on calculating the expected value of each choice by a process of folding back. probabilities of all chance events. This weighted sum then represents the expected value of the outcome.45. Despite certain limitations. differences in study populations. Markov modeling provides a valuable tool for economic evaluation of chronic diseases with simultaneous assessment of effectiveness and cost. the choice associated with the greatest expected beneﬁt or lowest expected cost represents the preferred choice.g. Economic analyses related to CCTs are subject to the same sources of varia- .312 Lyman decision models represent valuable methods for the economic evaluation of data from comparative studies of intervention strategies permitting simultaneous consideration of more than one type of outcome measure (e. including discounting with disease progression over time (43). measurement variability. which now becomes the outcome value for the immediately preceding step. and outcomes.. chance events. A review of cost analyses associated with randomized clinical trials revealed that only one half of the studies actually reported cost ﬁgures and few reported indirect costs or study-related costs (44).

a limited number of testable economic hypotheses along with their relevance. 5. Study rationale: The logical basis and importance of an economic analysis should be laid along with the rationale for conducting an economic evaluation in relationship to a CCT. 3. Multiple testing: Correct for multiple testing related to multiple outcomes and repeated measures over time. 4. Unit costs: Measure and record unit cost data including adjustments/conversions. Discounting: Any applied discount rate for inﬂation/time should be speciﬁed and justiﬁed. 3. 7. Planned analyses: The type of economic analysis should be speciﬁed and justiﬁed in advance including any subgroup analysis planned and any model to be used. 8. quality of life. 4. 3. Power: Estimate statistical power for evaluating group comparisons and assessing conﬁdence in reported results. Outcome measures: All pertinent clinical efﬁcacy. Separate analysis: Resources used and unit costs should be analyzed separately before any combined analysis. Study population: Deﬁne the source and nature of the study population (treatment and control groups) including eligibility and exclusion criteria. Missing data: Every effort should be made to minimize incomplete data. . and multiple comparisons. and economic outcomes should be measured using valid instruments speciﬁed in advance. Outcome measures: Measures of clinical efﬁcacy and economic cost should be speciﬁed in advance. repeated measures over time. calculate summary estimates and conﬁdence limits for treatment effect and resource quantities and unit costs. 5. Treatment assignment: Treatment assignment should be randomized or at least standardized and the rationale presented. Combined outcomes: Focus subsequent estimation on combined outcomes of incremental cost and incremental efﬁcacy (cost-effectiveness or utility ratios) 4. 2. 6. 7. Study hypotheses: Deﬁne before study initiation.Economic Analysis of Clinical Trials Table 4 313 Guidelines for Economic Analyses Associated with Cancer Clinical Trials Design 1. Perspective: The viewpoint from which the study is to be conducted and analyzed should be speciﬁed and justiﬁed. 2. Activity measures (quantities of resources used): These and unit costs (direct and indirect) should be collected and reported separately. variability. Data collection 1. 2. Estimation: After careful examination of distributions. Data analysis 1. 6. alpha and beta error (power). Sample size: Sample size should be sufﬁcient for valid conclusions concerning primary and major secondary hypotheses: effect size. Hypothesis testing: Apply appropriate method of statistical inference to group comparisons based on the observed data distributions.

g. Quality of Life: HRQOL measures should be reported separately and in combined measures with other clinical measures. analysis performed. tion in results as other clinical investigations. In a recent review of 45 randomized trials that included individual cost data. marginal cost effectiveness. including probabilities and outcome values. Results: Discuss results in the context of the primary and any secondary hypotheses. 4. and health policy formulation. The authors of this study concluded that only 36% provided conclusions justiﬁed on the basis of the data presented. Data interpretation and reporting 1. e. including generalizability to other settings. study population. Modeling: Any model used should be justiﬁed and model parameters. Relevance: Discuss importance of study question. Meta-analysis can form the basis of an economic evaluation by systematically summarizing the results of several . sample size originally planned. Sensitivity analysis: Sensitivity analyses should be based on valid models with justiﬁcation for the range of variable variation. 6. 11. 5.. 3. and interpretation of economic analyses associated with clinical trials are offered in Table 4. should be appropriately estimated and justiﬁed. study population. conduct. 9. Validity: Discuss the issues of internal and external validity. treatment assignment. Preliminary guidelines for the design. 2. analysis. Limitations: Discuss the limitations of the study design. 2. including relevance to clinical decision making. 25 (56%) presented statistical tests or measures of precision on the cost comparisons between groups. data analysis including statistical inference and modeling. outcome measures including resource utilization and cost estimates. Methods: Present methods fully including a priori hypotheses. Meta-Analysis If the existing information already suggests that the intervention in question is efﬁcacious. Cost-effectiveness/utility: Treatment groups should be compared on the basis of an incremental analysis. whereas only 9 (20%) reported adequate measures of variability (48). cost efﬁciency. then it may be reasonable to base an economic analysis on either a systematic review or formal meta-analysis. 10. measurements obtained.314 Table 4 Continued Lyman 8. Resource utilization: Present resources used and cost estimates separately and utilizing appropriate aggregate or combined measures.

Nevertheless. Computerized literature searches are inadequate for identifying unpublished analyses. conducting. or reporting an economic analysis related to a CCT discussed in this chapter will enhance the quality and validity of the study. analyzing. the economic analysis will be incorporated into a written and approved protocol. cancer care is associated with both clinical and economic outcomes of interest. Attention to the many important issues in planning. the population to be studied. and the planned statistical analysis. Clearly. the clinical and economic measurements to be obtained. analysis. the investigator must ﬁrst decide whether an economic analysis is needed and whether a CCT is a reasonable framework. such trials provide a desirable environment for assessing complementary outcomes such as costs and quality of life in addition to measures of clinical efﬁcacy. SUMMARY AND CONCLUSIONS In conclusion. Ideally. The principle difﬁculty consists if identifying and accessing all relevant results on a particular issue considering publication bias due to failure to publish negative study results. and studies commissioned for speciﬁc administrative purposes. Such an analysis is limited by the type and quality of economic data collected or reported. When such analyses are warranted. Any economic analysis based on the results of a meta-analysis is constrained by the potential bias from incomplete ascertainment and by the incomplete collection and reporting of resource use by most CCTs. economic analyses based on such comprehensive data may provide powerful information on the cost efﬁciency of an intervention. studies independent of the primary trial. Recent reviews of the analysis and interpretation of economic data in randomized controlled trials reveal a lack of awareness about important statistical issues. the same methodological rigor in design. In many ways. The investigator must ultimately decide how . CCTs appear to represent an excellent source of carefully obtained information for incorporation into economic analyses. Clinical trial data banks may identify additional clinical trials with concurrent economic evaluations but will not detect independent economic evaluations. V.Economic Analysis of Clinical Trials 315 studies of a given clinical intervention providing greater conﬁdence of treatment effect and resource utilization than individual studies. Economic analyses have gained increasing importance in the evaluation of costly cancer treatments in the setting of limited resources (49–59). conduct. when properly designed and conducted. including a priori hypotheses. and reporting should be applied as used for conventional measures of clinical efﬁcacy. Guidelines have been provided here for the design and analysis of economic studies in association with CCTs. Meta-analysis of economic evaluations related to clinical trials must consider the same methodological challenges as other meta-analyses.

. it is evident that as the ratio of beneﬁt to cost increases. however. The expected value of the treatment and no treatment strategies is therefore EVtreatment p ⋅ U treat/disease (1 p) ⋅ U treat/no disease EVno treatment p ⋅ U no treat/disease (1 p) ⋅ U no treat/no disease The treatment strategy associated with the greatest expected value should be chosen to optimize the likelihood of the best result. The ability to properly measure and analyze such data will greatly aid clinicians and health care planners in providing optimal quality and cost-effective care to patients with cancer (60). EVtreatment p ⋅ U treat/disease p ⋅ U no treat/disease p threshold EVno treatment (1 p) ⋅ U treat/no disease (1 p) ⋅ U no treat/no disease U treat/no disease 1 U no treat/no disease U treat/no disease U treat/disease U no treat/disease U no treat/no disease cost/(benefit cost) 1/(benefit/cost) From such a relationship.316 Lyman to interpret and present the data and what it means in the broader health care setting. i. such analyses will play an increasingly important role in clinical decision making. Most often. Perhaps of greatest importance is the generalizability of economic results of a clinical trial for the routine application of such interventions within a larger population. evidencebased clinical guideline development. we are interested in determining the threshold probability at which point the expected value of the treatment strategies are equal. individual patient counseling. Above the threshold . and national and international health policy formulation. APPENDIX: DECISION MODEL THRESHOLD ANALYSIS BASED ON BENEFITS AND COSTS Each possible outcome in a realistic clinical situation can be considered to have a certain value or utility (U) and a certain probability of disease (p). the threshold probability of disease decreases.e. reimbursement. In the years to come. The beneﬁts and costs can be derived from utility estimates as shown: Benefit of treatment U treat/disease Cost of treatment U no treat/no disease U no treat/disease U treat/no disease A sensitivity analysis could be conducted comparing the expected value functions as the probability of disease is varied.

Gotay CC. Detsky AS. 7. Hays RD.K. 3. 6. Quality-of-life assessment in cancer treatment protocols: research issues in protocol development. Wright JC. 13. Stephens R. 9. Ann Intern Med 1995. The indications for treatment therefore broaden as the ratio of beneﬁt to cost increases. 12. Stavins J. Quality of Life Assessment in Clinical Trials: Methods and Practice. J Chronic Dis 1987. . Oncology 1995. Outcomes of cancer treatment for technology assessment and cancer treatment guidelines. Staquet MJ. 15. 14. 5. Altman DG. Oncology 1995. Moore TD. Economic analyses of health care technology: a report on principles. Quality of life assessment in clinical trials—guidelines and a checklist for protocol writers: the U. 12:257S–265S. Health Administration Press 1989. 10. Fintor L. Naglie IG. J Natl Cancer Inst 1992. 14:671– 679. Resource allocation decisions in health care: a role for quality of life assessments. Quality-of-life-adjusted evaluation of adjuvant therapy for operable breast cancer. Med Care 1991. 33:20–28. Controlled Clin Trials 1991. 2. Girling DJ. Goldhirsch A. Baker MS. McCabe MS. 1998. Cella DF. 16. Pocock SJ. Harvey A. Quality of life assessment in clinical cancer research. A clinician’s guide to cost-effectiveness analysis. J Natl Cancer Inst 1990. 330:380–404. Brown ML. Weeks J. Drummond MF. Task Force on Principles for Economic Analyses of Health Care Technology. Measurement of utilities and quality-adjusted survival. N Engl J Med 1998. Oxford: Oxford University Press. Ann Intern Med 1991. Schumacher M. Korn EL. In: Cancer in Cancer Care and Cost. American Society of Clinical Oncology. Gains in life expectancy from medical interventionsstandardizing data on outcomes. 29:725–742. Gelber RD. Fayers PM. New York: Marcel Dekker. 40:605–616. A perspective on the role of quality-of-life assessment in clinical trials. Weinstein MC. Cheson BD. REFERENCES 1. 9: 67–70. Hopwood P. 84:575–579. 11. Medicare use in the last 90 days of life. 82:1811–1814. Eur J Cancer 1997. Bonomi AE. 122:61–70. Brown ML. J Clin Oncol 1996. 17. Site-speciﬁc treatment costs. The national economic burden of cancer: an update. Cavelli F. Ann Intern Med 1990. Kessler LC. 113:147–154. 8. Measuring quality of life: 1995 update. 4. treatment will be associated with a greater expected value and will therefore be the favored strategy. The economic burden of cancer. Br J Cancer 1994. 114:621–628. Gaumer GL. Schulgen G. Fayers PM. et al. Medical Research Council Experience. In: Cancer Prevention and Control. 1995. 70:1–5. Machin D. Olschewski M.Economic Analysis of Clinical Trials 317 probability of disease. 9: 47–60.

16: 783–790. Angers J-F. Conﬁdence intervals for log-normal means. 28. Westerman IL. Siegel JE. what. Russell LB. Kaplan EL. 22. 32. Tallman MS. 276:1330–1341. 34. Drummond MF. Davies L. Integrating economic analysis into cancer clinical trials: the National Cancer Institute-American Society of Clinical Oncology Economics Workbook. 31. Recommendations of the panel on cost-effectiveness in health and medicine. Analysis and interpretation of cost data in randomized controlled trials: review of published studies. Siegel JE. JAMA 1996. Glick HA. et al. Drummond MF. 17:1715–1723. Gao S. Harrell F. Drummond MF. Zhou XH. Issues for statisticians in pharmaco-economic evaluations. Nonparametric estimation from incomplete observations. How should cost data in pragmatic randomized trials be analyzed? Brit Med J 2000. and why? Oncology 1994. Economic analysis in phase III clinical cancer trials. Udvarhelyi IS. The use of the bootstrap statistical method for the pharmacoeconomic cost analysis of skewed data. 25. Bennett CL. Weinstein MC. 5:115–128. Castilloux A-M. 15:227–236. Labelle RJ. 320:1197–1200. Thompson SG. Gold MR. 276:1253– 1258. 27. Grieve AP. Bennett CL. 24:1–28. J Am Stat Assoc 1958. Drummond MF. Rutten-Van Molken MPMH. 9:169–175. Davies L. Thompson SG. Bennett CL. where. 116:238–244. Desgagne A. In search of power and signiﬁcance: issues in the design and analysis of stochastic cost-effectiveness studies in health care. Epstein AM. Stoddart GL. Statistical analysis of cost outcomes in a randomized controlled clinical trial. Trials and tribulations: emerging issues in designing economic evaluations alongside clinical trials. Rai A.318 Lyman 18. Kamiet MS. Weinstein MC. Revisiting the methodological issues. O’Brien BJ. JAMA 1996. Gulati S. Rowe JM. Economic analyses of phase III cooperative cancer group clinical trials: are they feasible? Cancer Invest 1997. 35. Russell LB. Recommendations for reporting cost-effectiveness analysis. Economic analysis and clinical trials. Buchner D. Brown M. Int J Technol Assess Health Care 1991. Stat Med 1998. Stat Med 1997. Van Doorslaer EKA. 23. 24. ¨ 30. Coyle D. Golub R. Russell LB. Colditz GA. 29. 20. . 19. Br Med J 1998. Gold MR. Armitage JL. Barber JA. Economic analysis during phase III clinical trials: who. Meier P. LeLorier J. 317:1195–1200. Pharmacoeconomics 1998. Controlled Clin Trials 1984. Willan A. Economic analysis alongside clinical trials. Gold MR. 276:1172–1177. Weinstein MC. when. Van Vliet RCJA. JAMA 1996. Cost-effectiveness and cost beneﬁt analyses in the medical literature: are the methods being used correctly? Ann Intern Med 1992. 53:457–481. Siegel JE. 14:135–144. The role of costeffectiveness analysis in health and medicine. Int J Technol Assess Health Care 1998. 7:561–573. 13:487–497. Waters TM. J Natl Cancer Inst Monogr 1998. 26. Health Econ 1994. Med Care 1994. Barber JA. 32:150–163. Cancer Invest 12:336–342. 3:333–345. 33. 36. Daniels N. 21.

Evans WK. Thompson SG. 48. J Clin Oncol 1988. 9:475–482. et al. et al. Cost-effectiveness analysis in oncology. 350:1025–1027. quality of life. Health Econ 1999. Health Econ 1998. 53. 321:697. Moskowitz AJ. Briggs A. Maiwenn JA. Dudley RA. Jaakkimainen L. Smith TJ. 38. Smith TJ. 56. Desch CE. on behalf of the BMJ Economic Evaluation Working Party. O’Brien BJ. TJ. Hlatky MA. J Clin Oncol 1993. 85:1460–1474. An introduction to Markov modelling for economic evaluation. 45. and health economics. Barber JA. Goodwin PJ. effects and C/E ratios alongside a clinical trial. 8:191–201. Brit Med J 2000. 49. Analysis and interpretation of cost data in randomised controlled trials: review of published studies. Guidelines for authors and peer reviewers of economic submissions to the B. Smith LR. Hillner BE. Desch CE. Coyle D. et al. Briggs AH. 8:1301–1309. Conﬁdence intervals or surfaces? Uncertainty on the cost effectiveness plane. Bayesian estimation of cost-effectiveness ratios from clinical trials. 3:309–319. J Natl Cancer Inst 1993. 40. Economic evaluation of a randomized clinical trial comparing vinorelbine. Interpreting cost analysis of clinical interventions. Evans WK. 51. 54. Reeves GAG. 279:54–57. 2:1405–1408. McSorley PA.Economic Analysis of Clinical Trials 319 37. 41. Boothroyd DB. 39. Earle CC. Balas EA. The efﬁcacy and cost-effectiveness of adjuvant therapy of early breast cancer in pre-menopausal women. 5:297–305. 42. Cost-effectiveness of cancer chemotherapy: an economic evaluation of a randomized trial in small-cell lung cancer. Ann Oncol 1998. JAMA 1998. Pater J. 46:261–271. 7:723–740. 6:1537–1547. 11:771–776. Feld R. Goodman PJ. Cost effectiveness in oncology. Lancet 1997. Efﬁcacy and cost-effectiveness of autologous bone . 55. Cost effectiveness calculations and sample size. Harrell FE. Generalization from phase III clinical trials: survival. Jefferson TO. Willan AR. Hillner BE. Johnstone IM. Pater J. et al. Whang W. 47. Campbell MK. Fayers PM. 43. 46. Health Econ 1996. Smith. Heitjan DF. 313:275–283. 13:397–409. Torgerson DS. Smith TJ. Health Econ 1994. Long-term cost-effectiveness of alternative management strategies for patients with life-threatening ventricular arrhythmias. 52. 50:185–193. Br Med J 1996. Neighbors DM. LeChevalier T. Rainer ACK. Lancet 1985. Sculpher M. J Clin Oncol 1995. Gnann W. 317:1195–1200.J. Efﬁcacy and cost-effectiveness of cancer treatment: rational allocation of resources based on decision analysis. J Clin Epidemiol 1997. Fenn P. 50. Costs. vinorelbine plus cisplatin. Counting the costs of chemotherapy in a National Cancer Institute of Canada randomized trial of non-small cell lung cancer. Conﬁdence intervals for cost-effectiveness ratios: an application of Fieller’s theorem. Hillner BE. Pharmacoeconomics 1998. Comparison of analytic models for estimating the effect of clinical factors on the cost of coronary artery bypass graft surgery. Br Med J 1998. J Clin Oncol 1990. Gilad S. and vindesine plus cisplatin for non-small cell lung cancer. Van Hout BA. Hillner BE. 13:2166–2173. Hand DJ. 44. J Clin Epidemiol 1993. Drummond MF.

330:540–544. 60. Kilore ML. JAMA 1992. Weeks JC. Adams JL. Emanuel LL. 267:2055–2061. Cost Aspects of palliative cancer care. . J Clin Oncol 2000. Berry SH. Emanuel EJ. Escarce JJ. Potsky AL. 59. The economics of dying: the illusion of cost savings at the end of life. Use of unequal randomization to aid the economic efﬁciency of clinical trials. 19:105–110. Torgerson DJ. Kaplan R. Figlin RA. Measuring the incremental cost of clinical cancer research. McCabe M.320 Lyman marrow transplantation in metastatic breast cancer: estimates using decision analysis while awaiting clinical trial results. Bailes JS. Lewis JH. Brit Med J 2000. 57. Wagle N. Semin Oncol 1995. Weidmer BA. N Engl J Med 1994. 22:64–66. Schoenbaum ML. 321:759. 58. Goldman DP. Campbell MK.

and Willi Sauerbrei Institute of Medical Biometry and Medical Informatics. Freiburg. more generally. Studies on prognostic factors attempt to determine survival probabilities or. sample sizes are often far too small to serve as a basis for reliable results. Furthermore. In general. and the evaluation of therapies. In contrast to therapeutic studies. Missing values in some or all prognostic factors constitute a serious problem that is often underestimated. the identiﬁcation and assessment of prognostic factors constitutes one of the major tasks in clinical cancer research. epidemiology. most studies investigating prognostic factors are based on historical data lacking precisely deﬁned selection criteria. University of Freiburg. As far as the statistical analysis is concerned. and to rank the relative importance of various factors. a proper multivariate analysis considering simultaneously the inﬂuence of various potential prognostic factors on overall or event-free survival of the patients is not always attempted. INTRODUCTION Besides investigations on etiology. Germany I. where statistical principles and methods are well developed and generally accepted. a prediction of the course of the disease for groups of patients deﬁned by the values of prognostic factors. however. Norbert Hollander.17 Prognostic Factor Studies ¨ Martin Schumacher. this is not the case for the evaluation of prognostic factors. the evaluation of prognostic factors based on historical data has the advantages that follow-up and other basic data of patients might be readily available in a database and that the values of new prognostic factors obtained 321 . Guido Schwarzer. Although some efforts toward an improvement of this situation have been undertaken (1–4).

we refer to the more theoretically oriented textbooks on survival analysis and counting processes (20–22). at least to some extent. In this disease. however. . (13). data from three prognostic factor studies in breast cancer serve as illustrative examples. for a deeper understanding why these methods work. the effects of more than 160 potential prognostic factors are currently controversially discussed. There have been some ‘‘classic’’ articles on statistical aspects of prognostic factors in oncology (6–10) that describe the statistical methods and principles that should be used to analyze prognostic factor studies. independent. To illustrate important statistical aspects in the evaluation of prognostic factors and to examine the problems associated with such an evaluation in more detail. why prognostic factors are discussed controversially and why prognostic models derived from such studies are often not accepted for practical use (5). independent means that the prognostic factor retains its prognostic value despite the addition of other prognostic factors. ‘‘DESIGN’’ OF PROGNOSTIC FACTOR STUDIES The American Joint Committee on Cancer has established three major criteria for prognostic factors: Factors must be signiﬁcant. According to Hermanek et al. Throughout this chapter we assume that the reader is familiar with standard statistical methods for survival data to that extent as is presented in more practically orientated textbooks (14–19).11. and clinically important implies clinical relevance. do not fully address the problem that statistical methods and principles are not adequately applied when analyzing and presenting the results of a prognostic factor study (4.322 Schumacher et al.5. from stored tissue or blood samples may be added retrospectively. such studies are particularly prone to some of the deﬁciencies mentioned above. and clinically important (23). These articles. It is therefore a general aim of this chapter not only to present updated statistical methodology but also to point out the possible pitfalls when applying these methods to prognostic factor studies. including insufﬁcient quality of data on prognostic factors and follow-up data and heterogeneity of the patient population due to different treatment strategies. signiﬁcance implies that the prognostic factor rarely occurs by chance. However.12). such as being capable (at least in principle) of inﬂuencing patient management and thus outcome. more than 500 papers have been published in 1997. These issues are often not mentioned in detail in the publication of prognostic studies but might explain. This illustrates the importance and the unsatisfactory situation in prognostic factors research. II.15). Statistical aspects of prognostic factor studies are also discussed in the monograph on prognostic factors in cancer (13) and in some recent textbooks on survival analysis (14. A substantial improvement of this situation seems possible with an improvement in the application of statistical methodology in this area.

usually require enormous resources and especially a long time until results will be available.24–26). For conﬁrmatory studies that may be seen comparable with phase III studies in therapeutic research. 11. Recognizing that these will be observational studies. the authors argue that they should be carried out in a way that the same careful design standards are adopted as are used in clinical trials. Documentation of intra. 9.and interlaboratory reproducibility of assays Blinded conduct of laboratory assays Deﬁnition and description of a clear inception cohort Standardization or randomization of treatment Detailed statement of hypotheses (in advance) Justiﬁcation of sample size based on power calculations Analysis of additional prognostic value beyond standard prognostic factors Adjustment of analyses for multiple testing Avoidance of outcome-orientated cutoff values Reporting of conﬁdence intervals for effect estimates Demonstration of subset-speciﬁc treatment effects by an appropriate statistical test . 10. treatment has to been given in a standard- Table 1 Requirements for Conﬁrmatory Prognostic Factor Studies According to Simon and Altman (4) 1. a third type of ‘‘design’’ is used in most prognostic factor studies that can be termed a ‘‘retrospectively deﬁned historical cohort’’ where stored tumor tissue or blood samples are available and basic and follow-up data of the patients are already documented in a database. various prognostic factors are investigated. however. 6. Both designs. 2. That is also emphasized by Simon and Altman (4). 8. Thus.Prognostic Factor Studies 323 From these criteria it becomes obvious that statistical aspects will play an important role in the investigation of prognostic factors (13. It is important in such a setting that the prognostic factors of interest are measured either in all patients enrolled into the clinical trial or in those patients belonging to a predeﬁned subset. they listed 11 important requirements. except for randomization. 7. To meet the requirements listed in Table 1 in such a situation. Thus. given in a somewhat shortened version in Table 1. it is clear that inclusion and exclusion criteria have to be carefully applied. Especially. 5. a prospective observational study where treatment is standardized and everything is planned in advance emerges as the most desirable study design. From these requirements it can be deduced that prognostic factors should be investigated in carefully planned prospective studies with sufﬁcient numbers of patients and sufﬁciently long follow-up to observe the end point of interest (usually eventfree or overall survival). who give a concise and thoughtful review on statistical aspects of prognostic factor studies in oncology. 3. 4. A slightly different design is represented by a randomized controlled clinical trial where in addition to some therapeutic modalities.

case-cohort studies. this will usually lead to a drastic reduction in the number of patients eligible for the study as compared with that number of patients originally available in the database. etc. If the requirements are followed in a consistent manner. or other study types often used in epidemiology [27]) have only been rarely used for the investigation of prognostic factors. Thus. special care is necessary to arrive at correct and reproducible results regarding the role of potential prognostic factors. patients for whom these requirements are not fulﬁlled have to be excluded from the study.) were deﬁned retrospectively.324 Schumacher et al. Some exclusion criteria (history of malignoma. nested case-control studies. at least to some sufﬁcient extent. In addition. T 4 and/or M 1 tumors according to the TNM classiﬁcation system of the International Union Against Cancer (13). if this design is applied. Besides age. This study is referred to as the Freiburg DNA study. Eight patients characteristics were investigated. It is interesting to note that other types of designs (e. and size of the primary tumor. older than 80 years. This left 139 of 218 patients originally investigated for the analysis.. III. number of positive lymph nodes. Their role and their potential use for prognostic factor research has not yet been fully explored. follow-up data are often not of such quality as should be the case in a wellconducted clinical trial or prospective study. Since this is clearly an investigation of treatment–covariate interactions. without adjuvant therapy after primary surgery. Freiburg DNA Study The database of the ﬁrst study consisted of all patients with primary previously untreated node-positive breast cancer who were operated between 1982 and 1987 in the Department of Gynecology at the University of Freiburg and whose tumor material was available for DNA investigations. this ideally should be performed in the setting of a large-scaled randomized trial where information on the potential predictive factor is recorded and analyzed by means of appropriate statistical methods (28–31). The three types of designs described above will also be represented by the three prognostic studies in breast cancer that we use as illustrative examples and that are dealt with in more detail in the next section. There is one situation where the randomized controlled clinical trial should be the design type of choice: the investigation of so-called predictive factors that indicate whether a speciﬁc treatment works in a subgroup of patients deﬁned by the predictive factor but not—or is even harmful—in another subgroup of patients. EXAMPLES: PROGNOSTIC STUDIES IN BREAST CANCER A. Otherwise. ized manner.g. the grading score according to Bloom and Richardson (32) and estrogen and progesterone receptor status were .

DNA ﬂow cytometry was used to measure ploidy status of the tumor (using a cutpoint of 1. distant metastasis.1 for the DNA index) and S-phase fraction. 76 events were observed for event-free survival. Further details of the study which we are using solely for illustrative purposes can be found elsewhere (33). which is the percentage of tumor cells in the DNA synthetizing phase obtained by cell cycle analysis.1–8.4 8. The distribution of these characteristics in the patient population is shown in Table 2A. Event-free survival was estimated as 50% after 5 years.1 3. second malignancy. which was deﬁned as the time from surgery to the ﬁrst of the following events: occurrence of locoregional recurrence. of positive lymph nodes Category 50 yr 50 yr 1–3 4–9 10 2 cm 2–5 cm 5 cm Missing 1 2 3 Missing 20 fmol 20 fmol Missing 20 fmol 20 fmol Missing Diploid Aneuploid 3. or death. The median follow-up was 83 months. Table 2A Patient Characteristics in the Freiburg DNA Breast Cancer Study Factor Age No. At the time of analysis.4 Missing n 52 87 66 42 31 25 73 36 5 3 81 54 1 32 99 8 34 98 7 61 78 27 55 27 30 (%) (37) (63) (48) (30) (22) (19) (54) (27) (2) (59) (39) (24) (76) (26) (74) (44) (56) (25) (50) (25) Tumor size Tumor grade Estrogen receptor Progesterone receptor Ploidy status S-phase fraction .Prognostic Factor Studies 325 recorded.

controlled. The study was designed as a comprehensive cohort study (35). tumor size. B. that is. that is. The study had a 2 2 factorial design with four adjuvant treatment arms: three versus six cycles of chemotherapy with and without hormonal treatment. Patients were not older than 65 years of age and presented with a Karnofsky index of at least 60. GBSG-2 Study The second study is a prospective. Histopathological classiﬁcation was reexamined. and grading was performed centrally by one reference pathologist for all cases. tumor grading according to Bloom and Richardson (32). this study is referred to as GBSG-2 study. randomized and nonrandomized patients who fulﬁlled the entry criteria were included and followed according to the study procedures. and number of involved lymph nodes. Primary local treatment was by a modiﬁed radical mastectomy (Patey) with en bloc axillary dissection with at least six identiﬁable lymph nodes. Prognostic factors evaluated in the trial were patient’s age. menopausal status. estrogen and progesterone receptor. of positive lymph nodes Progesterone receptor Estrogen receptor . The principal eligibility criterion was a histologically veriﬁed primary breast cancer of stage T1a-3aN MO. Event-free survival Table 2B Patient Characteristics in GBSG-2 Study Factor Age Category 45 yr 46–60 yr 60 yr Pre Post 20 mm 21–30 mm 30 mm 1 2 3 1–3 4–9 10 20 fmol 20 fmol 20 fmol 20 fmol n 153 345 188 290 396 180 287 219 81 444 161 376 207 103 269 417 262 424 (%) (22) (50) (27) (42) (58) (26) (42) (32) (12) (65) (24) (55) (30) (15) (39) (61) (38) (62) Menopausal status Tumor size Tumor grade No. histological tumor type. with positive regional lymph nodes but no distant metastases.326 Schumacher et al. clinical trial on the treatment of node-positive breast cancer patients conducted by the German Breast Cancer Study Group (GBSG) (34).

720 patients were recruited. contralateral tumor. or death. of whom about two thirds were randomized.blackwellpublishers. GBSG-4 Study As a third example we use data from a prospective study in node-negative breast cancer conducted by the GBSG (36) that is referred to as GBSG-4 study. From 1984 to 1989. secondary tumor.3%). During 6 years. 299 events for event-free survival and 171 deaths were observed. After a median follow-up of nearly 5 years. all having mastectomy Table 2C Patient Characteristics in GBSG-4 Study Factor Age Menopausal status Tumor size Category 40 yr 40 yr Pre Post 10 mm 11–20 mm 21–30 mm 31–50 mm 50 mm 20 fmol 20–49 fmol 50–299 fmol 300 fmol 20 fmol 20–49 fmol 50–299 fmol 300 fmol 1 2 3 Solid Invasive duct or lob Others n 62 541 215 388 45 236 236 74 12 270 98 181 54 283 81 175 64 136 325 142 300 124 179 (%) (10) (90) (36) (64) (7) (39) (39) (12) (2) (45) (16) (30) (9) (47) (13) (29) (11) (23) (54) (24) (50) (21) (30) Estrogen receptor Progesterone receptor Tumor grade Histologic tumor type . Complete data on the seven standard prognostic factors as given in Table 2B were available for 686 patients (95. 662 patients were enrolled into the study.Prognostic Factor Studies 327 was deﬁned as time from mastectomy to the ﬁrst occurrence of either locoregional or distant recurrence. The data of this study as used in this chapter are available from http:/ /www. Event-free survival was about 50% at 5 years.co. who were taken as the basic patient population for this study.uk/rss/. C.

and death. IV. the selection interval—the cutpoint ˆ µ is taken such that the p value for the comparison of observations below and above the cutpoint is a minimum. histological tumor type and estrogen and progesterone receptor were recorded as prognostic factors. and one cycle of chemotherapy given perioperatively as standardized treatment. tumor size. Age.73. A popular approach for such a datadependent categorization is the so-called minimum p value method where— within a certain range of the distribution of Z. second cancer. the choice of such a categorization represented by one or more cutpoints is by no means obvious. we restrict ourselves to the problem of selecting only one cutpoint and to a so-called univariate analysis. menopausal status. Median follow-up is about 5 years. This means that we consider only one covariate Z—in the Freiburg DNA breast cancer data the SPF—as a potential prognostic factor. the proportional hazards (37) cutpoint model is deﬁned as λ (t |Z µ) exp(β) λ (t| Z µ). In the Freiburg DNA breast cancer study we consider S-phase fraction (SPF) as a new prognostic factor but indeed was some years old (11). tumor grade. The distribution of these factors is summarized in Table 2C. CUTPOINT MODEL In prognostic factor studies. This may sometimes be done according to medical or biological reasons or may just reﬂect some consensus in the scientiﬁc community. For simplicity. The parameter θ exp (β) is referred to as the relative risk of observations with Z µ with respect ˆ ˆ to observations with Z µ and is estimated through θ exp(β) by maximizing the corresponding partial likelihood (37) with given cutpoint µ. often an attempt is made to derive such cutpoints from the data and to take those cutpoints that give the best separation in the data at hand. ⋅)) denotes the hazard function of the event free survival time random variable T. distant metastases. If this covariate has been measured on a quantitative scale. Thus. The fact that µ is usually unknown makes this a problem of model selection where the cutpoint µ has to be estimated from the data too. t 0 where λ (t | ⋅) lim h →0 (1/h)(Pr (t T t h|T t. We restrict ourselves to 603 patients with complete data on the seven prognostic factors considered. When a ‘‘new’’ prognostic factor is investigated. There have been 155 events observed so far. Applying this method to SPF in the Freiburg . the end point of primary interest is event-free survival deﬁned as the time from treatment to the ﬁrst of the following events: locoregional recurrence. values of the factors considered are often categorized in two or three categories.328 Schumacher et al. the Kaplan-Meier estimate of event-free survival at 5 years is 0.

This procedure was repeated 100 times. which is often also referred to as an optimal cutpoint. 4. In the 100 repetitions.007 when using the range between the 10% and the 90% quantile of distribution of Z as the selection interval.44]. By a random allocation of the observed values of SPF to the observed survival times. we simulate independence of these two variables. . Figure 1A shows the resulting p values as a function of the possible cutpoints considered. Simulating the null hypothesis of no prognostic relevance of SPF with respect to event-free survival (β 0). we obtained 45 signiﬁcant ( p min 0. based on the logrank test. we illustrate that the minimum p value method may lead to a drastic overestimation of the absolute value of the logrelative risk (38). the corresponding 95% conﬁdence interval is [1.05) results for the logrank test Figure 1 p Values of the logrank test as a function of all possible cutpoints for S-phase fraction (A) and Kaplan-Meier estimates of event-free survival probabilities by S-phase fraction (B) in the Freiburg DNA study. which is equivalent to the null hypothesis β 0. and the estimated relative risk with ˆ ˆ respect to the dichotomized covariate I(Z µ) using the ‘‘optimal’’ cutpoint µ ˆ 10. θ 2.27.7.7. The difference in event-free survival looks rather impressive. and in each repetition we selected a cutpoint by using the minimum p value method.7 and a minimum p value of p min 0.37 is quite large. a cutpoint of µ 10.Prognostic Factor Studies 329 ˆ DNA breast cancer data we obtain. Figure 1B displays the Kaplan-Meier estimates of the event-free survival funcˆ tions of the groups deﬁned by the estimated cutpoint µ 10.

a comparison of three approaches can be found in an article by Hilsenbeck and Clark (40). this problem can be solved by using a corrected p value p cor as proposed in Lausen and Schumacher (39). corresponding well to theoretical results as outlined in Lausen and Schumacher (39). The selection interval is characterized by the proportion ε of smallest and largest values of Z that are not considered as potential cutpoints. if there are only a few Figure 2 Estimates of cutpoints and log-relative risks in 100 repetitions of randomly allocated observed SPF values to event-free survival times in the Freiburg DNA study before (A) and after (B) correction. However.330 Schumacher et al. The formula reads p cor ϕ (u) u 1 (1 ε)2 log u ε2 4 ϕ (u) u where ϕ denotes the probability density function and u is the (1 p min /2) quantile of the standard normal distribution. . We obtained no estimates near the null hypothesis β 0 as a result of the optimization process of the minimum p value approach. it is obvious that the minimum p value method cannot lead to correct results of the logrank test. It should be mentioned that other approaches of correcting the minimum p value could be applied. Because of the well-known problems resulting from multiple testing. Especially. which has been developed by taking the minimization process into account. The estimated optimal cutpoints of the 100 repetitions and the corresponding estimates of the log-relative risk are shown in Figure 2A.

The latter problem that is especially relevant when sample sizes and/or effect sizes are of small or moderate magnitude could at least partially be solved by applying some shrinkage method. for example.Prognostic Factor Studies 331 cutpoints. They identiﬁed 19 different cutpoints used in the literature. correction of p values is essential but leaves the problem of overestimation of the relative risk in absolute terms. In Figure 2B the results of the correction process in the 100 simulated studies ˆ ˆ ˆ ˆ are displayed when a heuristic estimate c (β2 var (β))/β2 was applied where ˆ and var(β) result from the minimum p value method (46). we get c 1 since β is the maximum partial likelihood estimate. Recently. it has to be recognized that the minimum p value method leads to a dramatic inﬂation of the type I error rate. Obviously. with maximum partial likelihood estimation of c in a model ˆ λ (t | SPF µ) exp(cβ) λ (t | SPF µ) ˆ ˆ using the original data. a so-called shrinkage factor has been proposed (44) to shrink the parameter estimates. however. One of these is that in almost every study where this method is applied. the log-relative risk should then be estimated by ˆ ˆ ˆ β cor c ⋅ β ˆ ˆ where β is based on the minimum p value method and c is the estimated shrinkage ˆ factor. that the optimal cutpoint approach has further disadvantages. In general. This makes comparisons across studies extremely difﬁcult or even impossible.05. This heuristic estiˆ β mate performed quite well when compared with more elaborated cross-validation and resampling approaches (45). It should be noted. In the Freiburg DNA breast cancer data. Schumacher et al. (11) pointed out this problem for studies of the prognostic relevance of SPF in breast cancer published in the literature. we obtained four signiﬁcant results ( p cor 0. the chance of declaring a quantitative factor as prognostically relevant when in fact it does not have any inﬂuence on event-free survival is about 50% when a level of 5% has been intended. we obtain a corrected p value of . an improved Bonferroni inequality can be applied (41–43). (45) compared several methods to estimate ˆ c. another cutpoint will emerge. Altman et al. ˆ whereas small values of c should reﬂect a substantial overestimation of the logrelative risk. Four signiﬁcant results were also obtained with the usual p value when using the median of the empirical distribution of SPF in the original data as a ﬁxed cutpoint in all repetitions. Considering the cutpoint model. Thus. Values of c close to one should indicate a minor degree of overestimation. To correct for overestimation. might be preferred. Using the correction formula in the 100 repetitions of our simulation experiment. other approaches as regression modeling. Thus.05) corresponding well to the signiﬁcance level of α 0. and some of them were solely used because they emerged as the optimal cutpoint in a speciﬁc data set.

In contrast. and by a Cox model assuming a log-linear relationship.1 for the ˆ cor 2 for the cross-validation and bootstrap approaches. heuristic method and to θ Unfortunately. Figure 3 Log-relative risk for S-phase fraction in the Freiburg DNA study estimated by the minimum p value method. taking S-phase as a continuous covariate with an assumed log-linear relationship in a conventional Cox regression model. The correction of the relative risk ˆ estimate by applying some shrinkage factor leads to a value of θ cor 2. before and after correction. conﬁdence intervals are not straightforward to obtain. the estimated log-relative risks for both approaches are displayed in Figure 3. λ (t |Z) ˜ λ 0 (t) exp(β Z) ˜ leads to a p value of p 0. .332 Schumacher et al.061 for testing the null hypothesis β 0. p cor 0. For comparison. bootstrapping the whole model-building process including the estimation of a shrinkage factor would be one possibility.123 that provides no clear indication that S-phase is of prognostic relevance for node-positive breast cancer patients.

speciﬁcally we coded ‘‘0’’ for premenopausal and ‘‘1’’ for postmenopausal patients. .Prognostic Factor Studies 333 V. . This model-building process should in principle be taken into account when judging the results of a prognostic study. if Z j is a binary ˆ covariate. . Therefore. The factors listed in Table 2B are investigated with regard to their prognostic relevance. that is. Because of the patients’ preference in the nonrandomized part and because of a change in the study protocol concerning premenopausal patients. then exp (β j ) is simply the relative risk of category 1 to the reference category (Z j 0). Z 2. . . which is assumed to be constant over the time range considered. Age is taken in years. that is. In a ﬁrst attempt all quantitative factors are included as continuous covariates assuming a log-linear relationship. the baseline hazard is allowed to vary between the two strata while keeping the regression coefﬁcient of the other factors constant over strata. this was done by using a Cox regression model stratiﬁed for hormonal treatment. then the model is given by λ (t | Z 1 . . We demonstrate various approaches with the data of the GBSG-2 study. Z k ) λ 0 (t) exp(β 1Z 1 β2Z2 . Z k.. then exp (β j ) represents the increase or decrease in risk if Z j is increased by one unit. REGRESSION MODELING AND RELATED ASPECTS The standard tool for analyzing the prognostic relevance of various factors—in more technical terms usually called covariates—is the Cox proportional hazards regression model (37. k). Since all patients received adjuvant chemotherapy in a standardized manner and there appeared no difference between three and six cycles (34). chemotherapy is not considered any further. . The estimated logˆ relative risks β j can then be interpreted as estimated ‘‘effects’’ of the factors ˆ Z j ( j 1. We come back to this problem at several occasions below. β k Zk ) where λ (t |⋅) denotes the hazard function of the event-free or overall survival time random variable T and λ 0 (t) is the unspeciﬁed baseline hazard. Grade is considered as quantitative covariate in this approach. especially in sections VII and VIII. menopausal status is a binary covariate per se. .Z 2 . . the risk between grade . and so on. Since the impact of hormonal treatment is not of primary interest in this prognostic study.. Age and menopausal status had a strong inﬂuence on whether this therapy was administered. in practice it is often neglected. If we denote the prognostic factors under consideration by Z 1. . It has to be noted that the ‘‘ﬁnal’’ multivariate regression model is often the result of a more or less extensive model-building process that may involve the categorization and/or transformation of covariates and the selection of variables in an automatic or a subjective manner.47). only about a third of the patients received hormonal treatment. If Z j is measured on a quantitative scale. all analyses were adjusted for hormonal treatment. tumor size in mm. .

From this full model it can be seen that tumor size.52). hampered by the fact that the assumed log-linear relationship for quantitative factors may be in sharp contrast to the real situation and that also irrelevant factors are included that will not be needed in subsequent steps. it is. however. One reason might be that there is a relatively clearcut difference between three strong factors (and tumor size that seems to have a borderline inﬂuence) and the others that show only a negligible inﬂuence on event-free survival in this study. for example. It is therefore desirable to arrive at a simple and parsimonious ‘‘ﬁnal model’’ that only contains those prognostic factors that strongly affect event-free survival (48). and progesterone receptor are selected for all three selection levels considered. For selection of a single factor. number of positive lymph nodes. tumor size is included in addition. for example. tumor grade. correlation between various factors may lead to undesirable statistical properties of the estimated regression coefﬁcients. In the GBSG-2 study. and the estrogen receptor do not exhibit prognostic relevance. . when a signiﬁcance level of 5% is used.334 Schumacher et al. in the formation of risk groups deﬁned by the prognostic factors. when using 15.51. this. backward elimination can be recommended because of several advantages compared with other stepwise variable selection procedures (48. . The results of this Cox regression model are given in Table 3 in terms of estimated relative risks and p values of the corresponding Wald tests under the heading ‘‘full model. such as inﬂation of standard errors or problems of instability caused by multicollinearity. should not be expected in general. In addition. The previous approach implicitly assumes that the inﬂuence of a prognostic factor on the hazard function follows a log-linear relationship. . . the results of the full model and the three backward elimination procedures do not differ too much in these particular data. For other factors even monotonicity of the log-relative risk may be violated.157)) corresponds asymptotically to the well-known Akaike information criterion whereas selection levels of 5% or even 1% lead to a more stringent selection of factors (50). categories 1 and 2 is the same as between grade categories 2 and 3. backward elimination with a selection level of 15. menopausal status.7% as the selection level. this means that the risk is increased by ˆ the factor exp (β) if the number of positive lymph nodes is increased from l to l 1 for l 1. In general. By taking lymph nodes as the covariate Z. however. this should at least be accompanied by conﬁdence intervals for the relative risks that we have omitted here in order not to present too many numbers. tumor grade.’’ In a publication. and the progesterone receptor have a signiﬁcant impact on event-free survival. This could be a questionable assumption at least for large numbers of positive lymph nodes.7% (BE(0. 2. Thus. Age. which could result in overlooking an impor- . The three other columns of Table 3 contain the results of the Cox regression models obtained after backward elimination (BE) for three different selection levels (49). lymph nodes. The full model has the advantage that the regression coefﬁcients of the factors considered can be estimated in an unbiased fashion.

001 0.001 — 335 .006 0.051 0.001 0.001 — RR — — — 1.051 0.057 0.008 0.049 0.998 — BE (0.31 0.998 1.998 — BE (0.67 RR — — 1.325 1.006 0.007 1.05) p value — — — 0.061 0.998 — BE (0.008 1.14 0.340 1.001 0.057 0.001 0.001 — RR — — — 1.310 1.340 1.01) p value — — — 0.157) p value — — 0.001 0.009 0.000 p value 0. Quantitative Prognostic Factors Are Taken as Continuous Covariates Assuming a Log-Linear Relationship Full model Factor Age Menopausal status Tumor size Tumor grade Lymph nodes Progesterone receptor Estrogen receptor RR 0.Prognostic Factor Studies Table 3 Estimated Relative Risks (RR) and Corresponding p Values in the Cox Regression Models for the GBSG-2 Study.991 1.321 1.

respectively. 4–9. Elimination of only one dummy variable corresponding to a factor with three categories would correspond to an amalgamation of categories (8). lymph nodes. the categorization used in the second approach can always be criticized because of some degree of arbitrariness and subjectivity concerning the number of categories and the speciﬁc . the prognostic factors under consideration are often categorized and so-called dummy variables for the different categories are deﬁned. we could also test the two-dimensional vector of corresponding regression coefﬁcients to be zero.336 Schumacher et al. For age. tant prognostic factor. alternatively. The data of the GBSG-2 study suggest that grade categories 2 and 3 could be amalgamated into one category (grade 2–3). the categorization presented in Table 2B is used that was speciﬁed independently of the speciﬁc data set in accordance with the literature (34).745 (1.321) 2 for grade 2 and grade 3. the latter one would lead to estimated relative risks of 1. the results of the full model are supplemented by those obtained after backward elimination with three selection levels. in contrast to values of 1. age and menopausal status are also marginally signiﬁcant and are included into the model by backward elimination with a selection level of 15. and 10 positive nodes. Because of this uncertainty. and progesterone receptor show again the strongest effects. In these analyses where tumor grade. For those factors with three categories. there is some indication that linearity or even monotonicity of the log-relative risk may be violated. this would lead to an estimated relative risk of 1. one degree of freedom would be sufﬁcient. So.728 and a corresponding p value of 0.723 and 1. whereas when treating grade as a quantative covariate. On the other hand.7%. In Table 4 we give the p values of the Wald tests for the two dummy variables separately. In any case this needs two degrees of freedom. 1–3 positive nodes serves as the reference category. two binary dummy variables were deﬁned contrasting the corresponding category with the reference category chosen as that with the lowest values. thus avoiding that the categorized factors are treated as quantitative covariates. for example. In the GBSG-2 study.019. The use of dummy variables may also be the reason that grade is no longer included by backward elimination with a selection level of 1%. again. From this point of view the ﬁrst approach assuming all relationships to be log-linear may not be ﬂexible enough and may not capture important features of the relationship between various prognostic factors and event-free survival. Grade categories 2 and 3 do not seem well separated as is suggested by the previous approach presented in Table 3 where grade was treated as ordinal covariate. Table 4 displays the results of the Cox regression model for the categorized covariates.321 and 1.746 when using dummy variables. lymph nodes were categorized into 1–3. The results of the two approaches presented in Tables 3 and 4 show that model building within the framework of a prognostic study has to ﬁnd a compromise between sufﬁcient ﬂexibility with regard to the functional shape of the underlying log-relative risk functions and simplicity of the derived model to avoid problems with serious overﬁtting and instability.

037 — 0.036 — 0.01) p value — — — — — — — — — — — — 0.089 — 0.032 0.071 3.001 — 0.316 1 1.033 0.687 1 0.746 1 1.001 — 0.001 0.976 3.031 0.709 1.120 — 0.108 — 0. Prognostic Factors Are Categorized as in Table 2B Full model Factor Age 45 45–60 60 Pre Post 20 21–30 30 1 2 3 1–3 4–9 10 20 20 20 20 RR 1 0.994 p value — 0.783 1 2.240 1.778 1 2.001 0.001 — 0.304 — — — 1 1.307 1 1.001 — — RR — — — — — — — — 1 1.165 0.687 1 1.Prognostic Factor Studies Table 4 Estimated Relative Risks (RR) and Corresponding p Values in the Cox Regression Models for the GBSG-2 Study.120 — — — — 0.103 — 0.97 BE (0.741 1 0.001 — — Menopausal status Tumor size Tumor grade Lymph nodes Progesterone receptor Estrogen receptor 337 .110 3.05) p value — — — — — — — — — 0.494 — — BE (0.029 3.001 — — RR — — — — — — — — — — — 1 2.661 1 0.672 0.030 0.045 — 0.001 0.026 0.536 — — BE (0.001 0.692 1 1.512 1 0.157) RR 1 0.679 0.001 — 0.718 1.723 1.545 1 0.001 — 0.545 — — p value — 0.

5 Tumor grade 1 Tumor grade 2–3 exp ( 0.058 p Value 0.55).742 7. This simple extension of ordinary polynomials generates a considerable range of curve shapes while still preserving simplicity when compared with smoothing splines or other nonparametric techniques.12 ⋅ lymph nodes) where the factor 0. 3} and Z 0 is deﬁned as log Z.001 .517 1.981 0. 0.812 0 0. the powers p and q are taken from the set { 2. For age. 2. a larger number of cutpoints and corresponding dummy variables would be needed. for progesterone receptor.5 Regression coefﬁcient 1. Without going into the details of this model-building process reported elsewhere (54. This function is displayed in Figure 4A in comparison with the corresponding functions derived from the two other approaches.338 Schumacher et al.001 0. For lymph nodes a further restriction has been incorporated by assuming that the relationship should be monotone with an asymptote for large numbers of positive nodes. This was achieved by using the simple primary transformation exp( 0.001 — 0. For a quantitative covariate Z it uses functions β 0 β 1 Z p β 2 Z q to model the log-relative risk.026 0. it will not fully exploit the information available and will be associated with some loss in efﬁciency.12 was estimated from the data (54). The estimated power for this transformed variable was equal to one and a second power was not needed. Grade categories 2 and 3 have been amalgamated as has been pointed out above. For a more ﬂexible modeling of the functional relationship.12 ∗ Lymph nodes) (Progesterone receptor 1) 0. for example. Sauerbrei and Royston (54) extended the proposed multivariate FP approach to a model-building strategy considering transformation and selection of variables. Likewise.5. we summarize the results in Table 5. In addition. The method has been originally developed by Royston and Altman (53) and has been termed the ‘‘fractional polynomial’’ (FP) approach. a power of 0. It provides some further indication that there is a nonmonotonic relationship that would be overlooked by the log-linear approach.001 0. the powers 2 and 0. 1. 1. 0.5 was estimated that gives a signiﬁcant contri- Table 5 Estimated Regression Coefﬁcients and Corresponding p Values in the Final Cox Regression Model for the GBSG-2 Study Using the Fractional Polynomial Approach Factor/function (Age/50) 2 (Age/50) 0.5 have been estimated and provide signiﬁcant contributions to the logrelative risk function.5. cutpoints chosen. 0. We will therefore sketch a third approach that will provide more ﬂexibility while preserving simplicity of the ﬁnal model to an acceptable degree.

Sauerbrei and Royston (54) provide a graphical comparison of the FP approach with generalized additive models (56) for the log-relative risk functions displayed in Figure 4. For lymph nodes. that there must always be a compromise .Prognostic Factor Studies 339 Figure 4 Estimated log-relative risk functions for age (A). lymph nodes (B). whereas it substantially overestimates it for very large numbers. there is a variety of other ﬂexible methods available that has not been presented here. A–C. some general comments are in order. At the end of this section. it suggests that the log-linear approach underestimates the increase in risk for small numbers of positive nodes. however. bution to the log-relative risk functions. and progesterone receptor (C) obtained by the FP. Figure 4 shows these functions for lymph nodes (B) and progesterone receptor (C) in comparison with those derived from the log-linear and from the categorization approach. First. categorization and log-linear approach in the GBSG-2 study. The categorization approach seems to provide a reasonable compromise for this factor. It should be stressed. Various other nonparametric methods could in principle be used.

58). In Table 6 the inclusion frequencies over 1000 bootstrap samples are given for the prognostic factors under consideration. If important assumptions appear to be seriously violated.2% 28. other features than fulﬁllment of model assumptions are getting more important. the whole model selection or building process is repeated and the results are summarized over the bootstrap samples. some alternative approaches are discussed in Sections VI and VIII. These frequencies underline that tumor grade. the log-linear relationship in a ‘‘standard’’ Cox model). We illustrate this procedure for backward elimination with a selection level of 5% in the Cox regression model with quantitative factors included as continuous covariates (Table 3). between ﬂexibility and simplicity and that simple models have the additional advantage that they can be interpreted by clinical colleagues more easily. lymph nodes.g.1% .1% 8. Third. The aspect of model complexity is discussed in more detail by Sauerbrei (48). the analyses presented for the GBSG-2 study concentrated on the Cox regression model. Some special aspects have already been addressed above (e.1% 62. Bootstrap resampling has been applied to investigate the stability of the selected ‘‘ﬁnal model’’ (59–61). however. One is stability and addresses the question whether we could replicate the selected ﬁnal model having different data... and progesterone receptor Table 6 Inclusion Frequencies over 1000 Bootstrap Samples Using the Backward Elimination Method (BE (0. Because of simplicity we do not consider the selection process including transformation of covariates as was used by Sauerbrei and Royston (54). Second. has important consequences: Model checking with regard to the assumptions of this model has to be carefully undertaken. numerous others can be found in textbooks and review articles on survival analysis (14.8% 38. This. when dealing with prognostic factor studies. extensions of the Cox model (e. with time-varying regression coefﬁcients) or other models should be taken into consideration.g.3% 100% 98.57. In each bootstrap sample. the standard statistical tool for prognostic studies and without any doubt the one that is most commonly used.05)) with a Selection Level of 5% in the GBSG-2 Study Factor Age Menopausal status Tumor size Tumor grade Lymph nodes Progesterone receptor Estrogen receptor Inclusion frequency 18.340 Schumacher et al.

Thus.Prognostic Factor Studies 341 are by far the strongest factors. 3. or because the size of the subgroup is smaller than n min . As in Section V. the method leads directly to prognostic subgroups deﬁned by the potential prognostic factors. 2. This is achieved by a recursive tree-building algorithm. progesterone receptor in 98% and tumor grade in 62% of the bootstrap samples. We concentrate solely on the application to survival data (64–68) and use the abbreviation CART as a synonym for different types of tree-based analyses. Brieﬂy. and prespecify an upper bound for the p values of the logrank test statistic. respectively. . Breiman et al. 4. if the minimal p value is greater than p stop. . An allowable split is given by a cutpoint of a quantitative or an ordinal factor within a given range of the distribution of the factor or some bipartition of the classes of a nominal factor. The partition procedure is stopped if no allowable split exists. For each of the two resulting subgroups. In 60. . Then the tree-building algorithm is deﬁned by the following steps (42): 1. We deﬁne a minimum number of patients within a subgroup. possibly with other selected factors. These ﬁgures might be much lower in other studies where more factors with a weaker effect are investigated. lymph nodes are always included. n min say. VI. Z k that may have an inﬂuence on the survival time random variable T. The minimal p value of the logrank statistic is computed for all k factors and all allowable splits within the factors. p stop . Bootstrap resampling of this type also provides insight into the interdependencies between different factors by inspecting the bivariate inclusion frequencies (61). (62) gives a comprehensive description of the method of classiﬁcation and regression trees (CART) that has been modiﬁed and extended in various directions (63).4% of the bootstrap samples a model is selected that contains these three factors. the procedure is repeated. . The percentage of bootstrap samples where exactly this model—containing these three factors only—is selected is 26. The whole group of patients is split into two subgroups based on the factor and the corresponding cutpoint with the minimal p value if the minimal p value is smaller or equal to p stop. Z 2 . .1%. we start with k potential prognostic factors Z 1 . CLASSIFICATION AND REGRESSION TREES Analysis by building a hierarchical tree is one approach for nonparametric modeling of the relationship between a response variable and several potential prognostic factors. the idea of CART is to construct subgroups that are internally as homogeneous as possible with regard to the outcome and externally as separated as possible.

0001) yielding a subgroup of 583 patients with less than or equal to nine positive nodes (event rate 38. Thus. will allow 25 splits whereas the binary factor menopausal status will allow only 1 split. Lymph nodes will allow for 10 possible splits.6%) has been observed. the logrank test was not stratiﬁed for hormonal therapy that could have been done in principle. respectively. for simplicity. The factor with the smallest corrected p value is lymph nodes and the whole group is split at an estimated cutpoint of nine positive nodes ( p cor 0. the number of possible partitions will also be different.2%) and 60 patients (progesterone receptor 23. and we deﬁne n min 20 and p stop 0. and progesterone and estrogen receptor offer 182 and 177 possible cutpoints. In these two subgroups.8%) and a subgroup of 103 patients with more than nine positive patients (event rate 70. no further splits are possible because of the p stop criterion. Thus. We start with the whole group of 686 patients (the ‘‘root’’) where a total of 299 events (crude event rate 43. At this level. 2. tumor size will allow 32 possible splits and tumor grade. . Since the potential prognostic factors are usually measured on different scales. 9) and the right node (patients with number of positive lymph nodes. The procedure is then repeated with the left node (patients with number of positive lymph nodes. correction of p values and/or restriction to a set of few prespeciﬁed cutpoints may be useful to overcome the problem that factors allowing more splits have a higher chance of being selected by the tree-building algorithm because of multiple testing and may be preferred to binary factors with prognostic relevance.05. Likewise.0001) this yields a subgroup of 376 patients with less than or equal to three positive nodes (event rate 31.6%) and a subgroup of 207 patients with four to nine positive nodes (event rate 51. We illustrate the procedure by means of the GBSG-2 study. in the left node. progesterone receptor is associated with the smallest corrected p value and the cutpoint is obtained as 23 fmol ( p 0. respectively. and the minimal p value at each interior node. This yields subgroups with 43 patients (progesterone receptor 23. thus they are regarded as ﬁnal nodes.9%). then the factor age. As a splitting criterion we use the test statistic of the logrank test. For the right node (patients with more than nine positive nodes). for example. 9). we decide to use the p value correction as outlined in Section IV. With a cutpoint of three positive nodes ( p cor 0. This tree-building algorithm yields a binary tree with a set of patients. If we restrict the possible splits to the range between the 10% and 90% quantile of the empirical distribution of each factor. various quantities of interest. event rate 51. a splitting rule.342 Schumacher et al. For the patients in the resulting ﬁnal nodes that may again be combined by some amalgamation. as Kaplan-Meier estimates of event-free survival or relative risks with respect to some reference.0003). lymph nodes again appeared to be the strongest factor. event rate 85%). This leads to the problems that have already been extensively discussed in Section IV. can be computed.7%).

To protect against serious overﬁtting of the data—that in other algorithms is accomplished by tree pruning—we deﬁne various restrictions like the p stop and the n min criteria and the use of corrected p values. again. As already outlined above.70). respectively. p value correction but no prespeciﬁcation of cutpoints was used. . Because of the p stop criterion. pruning. we obtain the tree displayed in Figure 5 that is parsimonious in the sense that only the strongest factors. and amalgamation (62. This presentation allows an immediate visual impression about the resulting prognostic classiﬁcation obtained by the ﬁnal nodes of the tree. We present a somewhat different algorithm that concentrates on the tree-building process. progesterone receptor is the strongest factor with cutpoints of 90 fmol ( p cor 0. lymph nodes and the progesterone Figure 5 Classiﬁcation and regression tree obtained for the GBSG-2 study.006) and 55 fmol ( p cor 0. In this graphical representation. The result of the tree-building procedure is summarized in Figure 5. the size of the subgroups is taken proportionally to the width of the boxes. Applying these restrictions.0018). whereas the centers of the boxes correspond to the observed event rates.63. no further splits are possible and the resulting subgroups are considered as ﬁnal nodes too.Prognostic Factor Studies 343 The subgroups of patients with one to three and four to nine positive nodes allow further splits.69. a variety of deﬁnitions exists of CART type algorithms that usually consist of tree building.

344

Schumacher et al.

receptor, are selected for the splits. However, the values of the cutpoints obtained for progesterone receptor (90, 55, and 23 fmol) are somewhat arbitrary and may not be reproducable and/or not comparable with those obtained in other studies. Thus, another useful restriction may be the deﬁnition of a set of prespeciﬁed possible cutpoints for each factor. In the GBSG-2 study we used 35, 40, 45, 50, 55, 60, 65, and 70 years for age: 10, 20, 30, and 40 mm for tumor size; and 5, 10, 20, 100, and 300 fmol for progesterone and estrogen receptors. The resulting tree is displayed in Figure 6A. It is only different from the one without this restriction in that the selected cutpoints for the progesterone receptor are now

Figure 6 Classiﬁcation and regression trees obtained for the GBSG-2 study. p Value correction and prespeciﬁcation of cutpoints (A); no p value correction with (B) and without (C) prespeciﬁcation of cutpoints.

Prognostic Factor Studies

345

100, 20, and 20 fmol in the ﬁnal nodes. For comparison, the trees without using the p value correction with and without prespeciﬁcation of a set of possible cutpoints are presented in Figure 6, B and C. Since lymph nodes and progesterone receptor are the dominating prognostic factors in this patient population, the resulting trees are identical at the ﬁrst two levels to those where the p values have been corrected. The ﬁnal nodes in the latter ones, however, will again be split, leading to a larger number of ﬁnal nodes. In addition, other factors like age, tumor size, and estrogen receptor are now used for the splits at subsequent nodes too. A more detailed investigation on the inﬂuence of p value correction and prespeciﬁcation of possible cutpoints on resulting trees and their stability is given by Sauerbrei (71).

VII. FORMATION AND VALIDATION OF RISK GROUPS The ﬁnal nodes of a regression tree deﬁne a prognostic classiﬁcation scheme per se; to be useful in practice, however, some combination of ﬁnal nodes to a prognostic subgroup might be indicated. This is especially important if the number of ﬁnal nodes is large and/or if the prognosis of patients in different ﬁnal nodes is comparable. So, for example, from the regression tree presented in Figure 6A ( p value correction and predeﬁned cutpoints), the prognostic classiﬁcation given in Table 7 can be derived that is very much in agreement to current knowledge about the prognosis of node-positive breast cancer patients. The deﬁnition of subgroups III and IV reﬂects that patients with more than nine positive lymph nodes can still be further separated by other prognostic factors, in particular by subdivision of progesterone-positive ( 20) and -negative patients (72) and that progesterone-negative patients with four to nine positive lymph nodes have a similarly poor prognosis as progesterone-positive patients with more than nine

Table 7 Prognostic Classiﬁcation Scheme Derived from the Regression Tree ( p Value Correction and Predeﬁned Cutpoints) in the GBSG-2 Study Prognostic subgroup I II III IV Deﬁnition of subgroup (LN 3 and PR (LN 3 and PR (LN 4–9 and PR (LN 9 and PR 100) 100) or (LN 4–9 and PR 20) 20) or (LN 9 and PR 20) 20)

LN, no. of positive lymph nodes; PR, progesterone receptor.

346

Schumacher et al.

positive lymph nodes. Among the other patients, subgroup I with a relatively favorable prognosis can be deﬁned by one to three positive lymph nodes and a ‘‘markedly’’ positive progesterone receptor ( 100). The results in terms of estimated event free survival are displayed in Figure 7A; the Kaplan-Meier curves show a good separation of the four prognostic subgroups. Since in other studies or in clinical practice progesterone receptor may often be only recorded as positive or negative, the prognostic classiﬁcation scheme in Table 7 may be modiﬁed in the way that the deﬁnition of subgroups I and II are replaced by I*: (LN and II*: (LN 3 and PR 20) or (LN 4–9 and PR 20) 3 and PR 20)

respectively, where LN is lymph nodes and PR is progesterone receptor, since 20 fmol is a more commonly agreed cutpoint. The resulting Kaplan-Meier estimates of event-free survival are depicted in Figure 7B. For two of the regression approaches outlined in Section V, prognostic subgroups have been formed by dividing the distribution of the so-called prognosˆ ˆ tic index, β 1 Z 1 ⋅ ⋅ ⋅ β k Z k , into quartiles. The results in terms of estimated event-free survival are displayed in Figure 8A (Cox regression model with continuous factors, BE(0.05), Table 3) and in Figure 8B (Cox regression model with categorized covariates, BE(0.05), Table 4). It should be noted that in the deﬁnition of the corresponding subgroups, tumor grade enters in addition to lymph nodes and progesterone receptor.

Figure 7 Kaplan-Meier estimates of event-free survival probabilities for the prognostic subgroups derived from the CART approach (A) and the modiﬁed CART approach (B) in the GBSG-2 study.

Prognostic Factor Studies

347

Figure 8 Kaplan-Meier estimates of event-free survival probabilities for the prognostic subgroups derived from a Cox model with continuous (A) and categorized (B) covariates and according to the Nottingham Prognostic Index (C) in the GBSG-2 study.

For comparison, Figure 8C shows the Kaplan-Meier estimates of eventfree survival for the well-known Nottingham Prognostic Index (NPI) (73,74) that is the only prognostic classiﬁcation scheme based on standard prognostic factors that enjoys widespread acceptance (75). This index is deﬁned as NPI 0.02 size (in mm) lymph node stage tumor grade

where lymph node stage is equal to 1 for node-negative patients, 2 for patients with one to three positive lymph nodes, and 3 if four or more lymph nodes were involved. It is usually divided into three prognostic subgroups: NPI-I (NPI 3.4), NPI-II (3.4 NPI 5.4), and NPI-III (NPI 5.4). Since it was developed for node-negative and node-positive patients, there seems room for improvement by taking other factors (e.g., progesterone receptor) into account (76).

348

Schumacher et al.

Since the NPI has been validated in various other studies (75), we can argue that the degree of separation displayed in Figure 8C could be achieved in general. This, however, is by no means true for the other proposals derived by regression modeling or CART techniques where some shrinkage has to be expected (46,77,78). We therefore attempted to validate the prognostic classiﬁcation schemes deﬁned above with the data of an independent study that—in more technical terms—is often referred to as a ‘‘test set’’ (79). As a test set we take the Freiburg DNA study that covers the same patient population and in addition comprises the same prognostic factors as in the GBSG-2 study. Some complications have to be resolved, however. So only progesterone and estrogen receptor status (positive, 20 fmol; negative, 20 fmol) is recorded in the Freiburg DNA study and the original values are not available. Thus, only those classiﬁcation schemes where progesterone receptor enters as positive or negative can be considered for validation. Furthermore, we restrict ourselves to those patients where the required information on prognostic factors is complete. Table 8A shows the estimated relative risks for the prognostic groups derived from the categorized Cox model and from the modiﬁed CART classiﬁcations scheme deﬁned above. The relative risks have been estimated by using dummy variables deﬁning the risk groups and by taking the group with the best prognosis as reference. When applying the classiﬁcation schemes to the data of the Freiburg DNA study, the deﬁnitions and

Table 8A Estimated Relative Risks for Various Prognostic Classiﬁcation Schemes Derived in the GBSG-2 Study and Validated in the Freiburg DNA Study Estimated relative risks (no. of patients) Prognostic groups Cox I II III IV CART I* II* III IV NPI II III GBSG-2 study 1 2.68 3.95 9.92 1 1.82 3.48 8.20 1 2.15 (52) (218) (236) (180) (243) (253) (133) (57) (367) (301) Freiburg DNA study 1 1.78 3.52 7.13 1 1.99 3.19 4.34 1 2.91 (33) (26) (58) (14) (50) (38) (33) (11) (46) (87)

Prognostic Factor Studies

349

categorization derived in the GBSG-2 study are used. Note that the categorization into quartiles of the prognostic index does not yield groups with equal number of patients since the prognostic index from the categorized Cox model takes only few different values. From the values given in Table 8A, it can be seen that there is some shrinkage in the relative risks when estimated in the Freiburg DNA study that we used as a test set. This shrinkage is more pronounced in the modiﬁed CART classiﬁcation scheme (reduction by the factor 0.53 in the high-risk group) as compared with the categorized Cox model (reduction by the factor 0.72 in the high-risk group). To get some idea of the amount of shrinkage that has to be anticipated in a test set, based on the original data where the classiﬁcation scheme has been developed—the so-called training set (79)—cross-validation or other resampling methods can be used. For classiﬁcation schemes derived by regression modeling, similar techniques as already outlined in Section IV can be used. These consist essentially in estimating a shrinkage factor for the prognostic index (44,46). The relative risks for the prognostic subgroups are then estimated by categorizing the shrinked prognostic index according to the cutpoints used in the original data. ˆ In the GBSG-2 study we obtained an estimated shrinkage factor of c 0.95 for the prognostic index derived from the categorized Cox model indicating that we would not expect a serious shrinkage of the relative risks between the prognostic subgroups. Compared with the estimated relative risks in the Freiburg DNA study (Table 8A), it is clear that the shrinkage effect in the test set can only be predicted to a limited extent. This deserves at least two comments. First, we have used leave-one-out cross-validation that possibly could be improved by bootstrap or other resampling methods (45), second, we did not take the variable selection process into account. By doing so, we would expect more realistic estimates of the shrinkage effect in an independent study. Similar techniques can in principle also be applied to classiﬁcation schemes derived by CART methods. How to do this best, however, is still a matter of ongoing research (71).

VIII. ARTIFICIAL NEURAL NETWORKS During the last years, the application of artiﬁcial neural networks (ANNs) for prognostic and diagnostic classiﬁcation in clinical medicine has attracted growing interest in the medical literature. So, for example, a ‘‘miniseries’’ on neural networks that appeared in the Lancet contained three more or less enthusiastic review articles (80–82) and an additional commentary expressing some scepticism (83). In particular, feed-forward neural networks have been used extensively, often accompanied by exaggerated statements of their potential. In a recent review article (84), we identiﬁed a substantial number of articles with application of ANNs to prognostic classiﬁcation in oncology.

350

Schumacher et al.

The relationship between ANNs and statistical methods, especially logistic regression models, has been described in several articles (85–90). Brieﬂy, the conditional probability that a binary outcome variable Y is equal to one, given the values of k prognostic factors Z 1 , Z 2 , . . . , Z k is given by a function f (Z,w). In feed-forward neural networks, this function is deﬁned by

r k

f (Z,w)

Λ W0

j 1

W j ⋅ Λ w 0j

i 1

w ij Z i

where w (W 0 , . . . , W r , w 01 , . . . , w kr) are unknown parameters (called ‘‘weights’’) and Λ (⋅) denotes the logistic function (Λ (u) (1 exp ( u)) 1), called ‘‘activation-function.’’ The weights w can be estimated from the data via maximum likelihood, although other optimization procedures are often used in this framework. The ANN is usually introduced by a graphical representation like that in Figure 9. This ﬁgure illustrates a feed-forward neural network with one hidden layer. The network consists of k input units, r hidden units, and one output unit and corresponds to the ANN with f (Z, w) deﬁned above. The arrows indicate the ‘‘ﬂow of information.’’ If there is no hidden layer (r 0), the ANN reduces to a common logistic regression model which is also called the ‘‘logistic perceptron.’’ In general, feed-forward neural networks with one hidden layer are universal approximators (91) and thus can approximate any function deﬁned by the conditional probability that Y is equal to one given Z with arbitrary precision by

Figure 9 Graphical representation of an artiﬁcial neural network with one input, one hidden, and one output layer.

Prognostic Factor Studies

351

increasing the number of hidden units. This ﬂexibility can lead to serious overﬁtting that can again be compensated by introducing some weight decay (79,92) that is, by adding a penalty term

r r k

λ

j 1

W2 j

j 1 i 1

w2 ij

to the log-likelihood. The smoothness of the resulting function is then controlled by the decay parameter λ. It is interesting to note that in our literature review of articles published between 1991 and 1995, we have not found any application in oncology where weight decay has been used (84). Extension to survival data with censored observations is associated with various problems. Although there is a relatively straightforward extension of ANNs to handle grouped survival data (93), several naive proposals can be found in the literature. To predict outcome (death or recurrence) of individual breast cancer patients, Ravdin and Clark (94) and Ravdin et al. (95) used a network with only one output unit but using the number j of the time interval as additional input. Moreover, they consider the unconditional probability of dying before t j rather than the conditional one as output. Their underlying model then reads

k

P (T

tj | Z )

Λ w0

i 1

wi Zi

wk

1

⋅j

for j 1, . . . , J. T denotes again the survival time random variable, and the time intervals are deﬁned through t j 1 t t j, 0 t0 t1 ⋅⋅⋅ tJ ∞. This parameterization ensures monotonicity of the survival probabilities but also implies a rather stringent and unusual shape of the survival distribution, since in the case that no covariates are considered this reduces to P (T tj ) Λ (w 0 wk

1

⋅ j)

for j 1, . . . , J. Obviously, the survival probabilities do not depend on the length of the time intervals, which is a rather strange and undesirable feature. Including a hidden layer in this expression is a straightforward extension retaining all the features summarized above. De Laurentiis and Ravdin (96) call such type of neural networks ‘‘time-coded models.’’ Another form of neural networks that has been applied to survival data are the so-called single time point models (96). Since they are identical to a logistic perception or a feed-forward neural network with a hidden layer, they correspond to ﬁtting of logistic regression models or their generalizations to survival data. In practice, a single time point t* is ﬁxed

352

Schumacher et al.

Prognostic Factor Studies

353

**and the network is trained to predict the survival probability. The corresponding model is given by
**

k

P (T

t*| Z )

Λ w0

i 1

wi Zi

or its generalization when introducing a hidden layer. This approach is used by Burke (97) to predict 10-year survival of breast cancer patients based on various patient and tumor characteristics at time of primary diagnosis. McGuire et al. (98) used this approach to predict 5-year event-free survival of patients with axillary node-negative breast cancer based on seven potentially prognostic variables. Kappen and Neijt (99) used it to predict 2-year survival of patients with advanced ovarian cancer obtained from 17 pretreatment characteristics. The neural network they actually used reduced to a logistic perceptron. Of course, such a procedure can be repeatedly applied for the prediction t2 ⋅⋅⋅ t J . For example, of survival probabilities at ﬁxed time points t 1 Kappen and Neijt (99) trained several (J 6) neural networks to predict survival of patients with ovarian cancer after 1, 2, . . . , 6 years. The corresponding model reads

k

P (T

tj | Z )

Λ w0j

i 1

wij Zi

in the case that no hidden layer is introduced. Note that without restriction on the parameters such an approach does not guarantee that the probabilities P (T t j |Z ) increase with j and hence may result in life-table estimators suggesting nonmonotone survival function. Closely related to such an approach are the socalled multiple time point models (96) where one neural network with J output units with or without a hidden layer is used. The common drawback of these naive approaches is that they do not allow one to incorporate censored observations in a straightforward manner, which is closely related to the fact that they are based on unconditional survival probabilities instead of conditional survival probabilities as is the Cox model. Neither omission of the censored observations—as suggested by Burke (97)—nor treating censored observations as uncensored are valid approaches but a serious source of bias, which is well known in the statistical literature. De Laurentiis and Ravdin (96) propose imputed estimated conditional survival probabilities for the censored cases from a Cox regression model, that is, they use a well-established statistical

Figure 10 Estimated event-free survival probabilities at 2 years vs. at 1 year for various artiﬁcial neural networks in the GBSG-2 study.

354

Schumacher et al.

respectively. We then increased the number of hidden units to 2 and 5 and varied the degree of weight decay resulting in severe violations of monotonicity of the estimated event-free survival probabilities for a considerable number of patients. The ﬁgures demonstrate the possible bias resulting from the omission of censored observations. In a second stage. We therefore come to a third approach that has been originally suggested by Faraggi and Simon (100) and extended by others (101). Both ways of handling censored observations are contrasted by the estimated event-free survival probabilities at 5 years derived from the ﬁnal FP Cox model (Sec. A and B. estimated event-free survival probabilities at 5 years obtained from the ﬁnal FP Cox model in the GBSG-2 study. we have used the approach of single time point models for the prediction of event-free survival at 1. It can be seen that in this model estimated event-free survival probabilities are still monotone. All seven prognostic factors were considered as quantitative covariates. the bias is smaller for the imputation method. V. This leads to a neural network generalization of the Cox regression model deﬁned by Figure 11 Estimated event-free survival probabilities at 5 years when censored observations are omitted (A) and replaced by imputed values (B) vs. respectively. who emphasized that the resulting bias may be negligible.Prognostic Factor Studies 355 procedure just to make an artiﬁcial neural network work. although ﬁve hidden units and a corresponding number of 42 additional parameters were introduced. 2. Figure 11. and were scaled to the interval [0. The idea is to replace the function exp (β 1 Z 1 ⋅ ⋅ ⋅ β kZ k ) in the deﬁnition of the Cox model by a more ﬂexible function motivated by the function f (Z. 3. w) used in ANNs. respectively. For this imputation. We illustrate some of the points made above with data from the GBSG-2 study. and hormonal therapy in addition were used as inputs for the neural nets. The latter approach is also used by Ripley (92). 4. Figure 10 shows the results of various ANNs in terms of estimated event-free survival at 2 years versus estimated event-free survival at 1 year for those 623 patients who were not censored before 2 years. The ANN with no hidden unit corresponds to an ordinary logistic regression model for event-free survival at 1 and 2 years. First. and 5 years.1 we then obtain nearly the same results as for the logistic regression model (9 parameters). shows the estimated event-free survival probabilities at 5 years when censored observations are omitted or replaced by imputed values from a Cox model. . Table 5). Censored observations occurring before the corresponding time points were omitted. they. we illustrate the impact of insufﬁcient handling of censored observation. but this may be still not considered a fully satisfactory method. For a decay parameter of λ 0. 1]. we used the Cox model presented in Table 4 (full model). except menopausal status.

.

. we again used the data of the GBSG-2 study where we applied some preselection of variables in that we took those factors that were included in the ﬁnal FP model (Sec. Table 8B contrasts the results from the GBSG-2 study used as training set and the Freiburg DNA study used as test set in terms of estimated relative risks where the predicted eventfree survival probabilities are categorized in quartiles. 20 fmol. . .w)) k f FS (Z. This network contained 20 hidden units (20 (7 20) 160 parameters) and showed a similar separation than the one where estrogen and progesterone receptor entered as quantitative inputs. 20 fmol) were used as inputs. grade. The latter one must be suspected to serious overﬁtting with a high chance that the degree of separation achieved could never be reproduced in other studies. Estimation of weights is then done by maximizing the partial likelihood that includes the correct and usual handling of censored observations in that these patients contribute to the partial likelihood as long as they are at risk. In the training set. in addition to age.Prognostic Factor Studies 357 λ (t | Z 1 . Thus. whereas the F&S network turns out to yield a completely useless prognostic clas- Figure 12 Kaplan-Meier estimates of event-free survival probabilities for the prognostic subgroups derived from various Faraggi & Simon networks and from the FP approach in the GBSG-2 study. . Although the problem of censoring is satisfactorily solved in this approach. we observe a 20-fold increase in risk between the high-risk and the low-risk group. It should be noted that the F&S network contains 5 (6 5) 35 parameters when 5 hidden units are used and 20 (6 20) 140 when 20 hidden units are used. problems remain with potentially serious overﬁtting of the data. Figure 12 shows the results for various F&S networks compared with the FP approach in terms of Kaplan-Meier estimates of event-free survival in the prognostic subgroups deﬁned by the quartiles of the corresponding prognostic indices. V. Table 5). and number of lymph nodes. tumor size.w) j 1 W j Λ w 0j i 1 w ij Z i Note that the constant W 0 is omitted in the framework of the Cox model. estrogen and progesterone status (positive. and progesterone receptor (all scaled to the interval [0. we used the four factors. To highlight this phenomenon we trained a slightly different F&S network where. tumor grade. especially if the number r of hidden units is large. age. . lymph nodes. Z k ) where r λ 0 (t) exp ( f FS (Z. negative. 1]). and hormone therapy as inputs for the Faraggi and Simon (F&S) network. For illustration.

62 5.14 (37) (16) (38) (35) (23) (25) (32) (46) (20) (31) (28) (47) (23) (25) (33) (45) * Twenty hidden units.46 2.1. especially when some weight decay is introduced.22 4. † Twenty hidden units. weight decay 0.72 1 1. weight decay 0. It should be noted that the FP approach contains at most eight parameters if we ignore the preselection of the four factors.39 1 1. weight decay 0.45 2.75 1 1.1 siﬁcation scheme in the test set where the estimated relative risks are not even monotone increasing.03 1 1. § Five hidden units.34 0. Table 8B Estimated Relative Risks for Various Prognostic Classiﬁcation Schemes Based on F&S Neural Networks Derived in the GBSG-2 Study and Validated in the Freiburg DNA Study Estimated relative risks (no.03 1.62 5.77 (179) (178) (159) (170) (171) (172) (171) (172) (172) (171) (171) (172) (172) (171) (171) (172) Freiburg DNA study 1 0.98 1.89 2.27 1 1. of patients) Prognostic groups F&S* I II III IV F&S† I II III IV F&S‡ I II III IV F&S§ I II III IV GBSG-2 study 1 3.57 3. ‡ Five hidden units.09 4. It is obvious that some restrictions either in terms of a maximum number of parameters or by using some weight decay are absolutely necessary to avoid such an amount of overﬁtting as observed in the two prognostic classiﬁcation schemes based on F&S networks where weight decay was not applied. .358 Schumacher et al.64 3. The results for an F&S network with ﬁve hidden units are very much comparable with the FP approach.27 8.24 7.49 1 1.00 22. weight decay 0.57 3.

The resulting ﬁgures are often accompanied by p values of the logrank test for the null hypothesis that the survival functions in the g risk strata are equal. Especially. it should be emphasized that they have to be regarded as very ﬂexible nonlinear regression models deserving the same careful model building as other statistical models of similar ﬂexibility. This is the way in that we also presented the results of prognostic classiﬁcation schemes derived by various statistical methods in previous sections. a Cox model using dummy variates for the risk strata is ﬁtted and the log-likelihood and/or estimated relative risks of risk strata with respect to a reference are given. In particular. the question arises how its predictive ability can be assessed and how its performance can be compared with that of competitors. ASSESSMENT OF PROGNOSTIC CLASSIFICATION SCHEMES Once a prognostic classiﬁcation scheme is developed and deﬁned. It is clear that a signiﬁcant result is a necessary but not a sufﬁcient condition for good predictive ability. ‘‘absolute’’ meaning that 1/RR replaces RR for relative risks . Sometimes.Prognostic Factor Studies 359 Summarizing our experience with ANNs for prognostic classiﬁcation. Suppose that a prognostic classiﬁcation scheme consist of g prognostic groups—called risk strata or risk groups—then one common approach is to present the Kaplan-Meier estimates for event-free or overall survival in the g groups. SEP is the weighted geometric mean of ‘‘absolute’’ relative risks between strata and baseline. When applying ANNs to survival data. In our literature survey we did not ﬁnd a satisfactory application (84). one has to take care that the standard requirements for such data as the proper incorporation of censored observations or the modeling of conditional survival probabilities are met. we used the baseline reference estimated in a Cox model where the dummy variates for risk strata were centered to have mean zero. although some progress has been made in recent methodological contributions (92. Recently.101. It is interesting to note that there is no commonly agreed approach available in the statistical literature and most measures that are used have some ad-hoc character. IX.102). the dangers of serious overﬁtting have to be taken into account. we proposed a summary measure of separation (36) deﬁned as g SEP exp j 1 nj ˆ |β j | n ˆ where n j denotes the number of patients in risk stratum j and β j is the estimated log-hazard ratio or log-relative risk of patients in risk stratum j with respect to a baseline reference.

. A formal comparison. we use the GBSG-4 study for illustration. the CART classiﬁcation scheme again shows the best performance. Table 9A summarizes the results of some ad hoc measures applied to the data of the GBSG-4 study. respectively. For the second one.55 for the simpliﬁed Cox Index. CART III: (age 40 years and grade 2–3) or (size 20 mm and age 60 years and grade 1) or (size 20 mm and grade 2–3 and estrogen receptor 300 fmol) (n 3 284). 2 (n 1 277). Therefore. shows the Kaplan-Meier estimates in the various risk strata corresponding to the three prognostic classiﬁcation schemes. CART II: Size 20 mm and [(grade 2–3 and age 40 years) or (grade 1 and age 60 years)] (n 2 222). and a very small high risk stratum in the GBSG-4 study. CART IV: Size 20 mm and grade 2–3 and estrogen receptor 300 fmol (n 19). .e. the p values of the logrank test are highly signiﬁcant ( p 0. to the pooled Kaplanˆ Meier estimate S (t). this leads to two relatively large medium risk strata. SEP essentially compares risks within strata with the risk in the entire population. and Cox 4. a smaller low risk stratum. although the model-based approach may be preferable for formal reasons.56 for the NPI and 1. four risk 205). II. is hampered by the fact that the corresponding regression models are not nested. this measure does not prove to be particularly useful. Often this model-based baseline reference turns out to be very similar to the estimated marginal distribution of T.74) that has already been deﬁned in Section VII and two classiﬁcation schemes that have been derived from a Cox regression model and a CART approach. given as CART I: Grade 1 and age 60 years (n 1 78). Figure 13. respectively. In the nodenegative breast cancer patients we compare the NPI (73. the pooled Kaplan-Meier estimate has been used previously as baseline reference (36). In fact. There is some improvement in the log-likelihood for the NPI and some further improvement for the simpliﬁed Cox Index. however.360 Schumacher et al. With a value of 1. and III with numbers of patients in the GBSG-4 study) are deﬁned through COX 1. In this section. A–C. Three risk groups (I.0001). i. Cox 3 (n 2 121). For all three prognostic classiﬁcation schemes considered. 5 (n 3 groups have been obtained (36). As can be seen from the numbers of patients. The CART classiﬁcation scheme shows the best result.71. The summary measure SEP yields an average absolute risk with respect to baseline of about 1. RR 1 (103). The ﬁrst scheme is deﬁned through a simpliﬁed prognostic index COX I (age 40 years) I (size 20mm) grade where I(⋅) denotes the indicator function being equal to 1 if ‘‘⋅’’ holds true and 0 otherwise. Thus.

1 1 NPI 0.9 1.550 CART index 0.710 .Prognostic Factor Studies 361 Figure 13 Kaplan-Meier estimates of event-free survival probabilities for the prognostic subgroups according to the Nottingham Prognostic Index (A). Table 9A Ad Hoc Measures for Predictive Ability of Three Prognostic Classiﬁcation Schemes in the GBSG-4 Study in Node-negative Breast Cancer Ad hoc measure p value 2 log L SEP Pooled Kaplan-Meier — 1856.0001 1817.3 1.9 1.559 COX index 0. derived by a Cox model (B).0001 1802.0001 1827. and a CART approach (C) in the GBSG-4 study in node-negative breast cancer.

First. a trivial constant predicˆ tion S (t*) 0. As outlined above. It will attain its maximum value of 1 only if the estimated event-free probabilities happen to be equal to 1 minus the observed event status for all patients. it is of central importance to recognize that the time-to-event itself cannot adequately be predicted (104–108). the observed survival or event status at t*.362 Schumacher et al. . we consider an approach based directly on the estimates of eventfree probabilities S (t*| Z z) for patients with Z z. it is the aim of a prognostic classiﬁcation scheme to provide estimated event-free ˆ probabilities S (t*| j) for patients in risk stratum j ( j 1. This quantity is known as the quadratic score. the so-called logarithmic score may be preferred. . Consequently a measure of inaccuracy that is aimed to assess the value of a given prognostic classiﬁcation scheme should compare the estimated event-free probabilities with the observed individual outcome. To determine the mean square error of prediction in this case. These estimated probabilities may be used as predictions of the event status Y I (T t*). If some closer relationship to the likelihood is intended. In the extreme case where the estimated event-free probabilities are 0 or 1 for all patients—this corresponds to the assertion that the event-free status a t* can be predicted withˆ out error—BS(t*) will be zero if S (t*| j) coincides with the observed event status. Since these ad hoc measures are only of limited value. The best one can do at t 0 is to try to estimate the probability that the event of interest will not occur until a prespeciﬁed time horizon represented by some time point t*. S (t*| j). leading to BS(t*) 1 n n (I(T i i 1 t*) ˆ S (t*| j i ))2 where the sum goes over all n patients. . This is given by LS(t*) 1 n n {I(T i i 1 ˆ t*) log S (t*| j i ) ˆ S (t*| j i ))} I(T i t*) log (1 . it is equal to the Brier score. Multiplied by a factor of 2 (omitted here for simplicity). a detailed description can be found elsewhere (103). Y I (T t*) has to be compared with the estimated ˆ probability.5 for all patients would be the most plausible approach.25. Thus. which was originally developed for judging the inaccuracy of probabilistic weather forecasts (109–111). . This yields a Brier score equal to 0. given the available covariate information for a particular patient at t 0. In the absence of any knowledge about the disease under study. The expected value of the Brier score may be interpreted as a mean square error of prediction if the event status at t* ˆ is predicted by the estimated event-free probabilities S (t*| j). g). we now brieﬂy outline some recent developments.

for t ∈ [0. by integrating it with respect to some weight function W(t) (103). which is not very much below 0.’’ Hence LS (t*) is equal to zero in the extreme situation where the estimated event-free ˆ probabilities S (t*| j i ) are 0 or 1 for all patients and coincide with their observed event status I (T i t*). For the NPI.5 is made for all patients.196.529 10.196 0. however. a Brier or a logarithmic score under random censorship can be deﬁned that enjoys the desirable statistical properties (103). It can.179 0.549 6.1% COX index 0. We calculated the Brier and the logarithmic score for the data of the GBSG4 study in node-negative breast cancer.538 8. the Brier score at t* 5 is equal to 0. For all measures.25 when the trivial prediction S (t*) 0. t*].e. censoring has not been taken into account into the deﬁnition of both measures of inaccuracy.733 for all patients. Using these scores.Prognostic Factor Studies 363 where we adopt conventions ‘‘0 ⋅ log 0 0’’ and ‘‘1 ⋅ log 0 ∞ . the simpliﬁed COX index. be solved by reweighting the individual contributions in a similar way as in the calculation of the Kaplan-Meier estimator. and the CART index. So far. How to do that in a way that the resulting measures are still consistent estimates of the population quantities—in case of the Brier score the mean square error of prediction—is not a trivial problem.184 0. and some further Table 9B Measures of Inaccuracy for Three Prognostic Classiﬁcation Schemes in the GBSG-4 Study in Node-negative Breast Cancer Measure of inaccuracy BS (t* LS (t* R2 5) 5) Pooled Kaplan-Meier 0.175 0. It will attain inﬁnity if the estimated event-free probability happens to be equal to I (T i t*) for at least one patient (111.7% CART index 0.184. With this weighting scheme. the simpliﬁed COX index performs better than the NPI. Table 9B summarizes the results of various measures of inaccuracy for the NPI. If we do not wish to restrict ourselves to one ﬁxed time point t*.0% NPI 0.580 0. i.. we can consider both the Brier score and the logarithmic score as a function of time for 0 t t*. remember that ˆ the Brier score is equal to 0. R2-type measures (113–116) can also be readily deﬁned by relating the Brier or the logarithmic score for a prognostic classiﬁcation scheme to that score where the pooled Kaplan-Meier estimate is used as ‘‘universal’’ prediction for all patients. the value reached by the ˆ pooled Kaplan-Meier prediction S (t*) 0. whereas observations censored before t* get weight zero.4% . This function can also be averaged over time. The reweighting of uncensored observations and of observations censored after t* is done by the reciprocal of the Kaplan-Meier estimate of the censoring distribution.112).

there is only a moderate gain of accuracy. the COX and the CART indices have been derived from the same data set. SAMPLE SIZE CONSIDERATIONS If the role of a new prognostic factor is to be investigated. the covariates included are expected to be correlated with the factor of primary interest.120). We have assumed that the estimated probabilities S (t |j) of being event free up to time t have emerged from external sources. For the GBSG-4 study in node-negative breast cancer. however. this will often be done with the Cox proportional hazards model. In addition. Even for the NPI that has been proposed in the literature (73. cross-validation and resampling techniques may be used in a similar way as for the estimation of error rates (118. however. the determination of measures of inaccuracy in an independent test data set is absolutely necessary (79).74). We consider the situation that we wish to study the prognostic relevance of a certain factor—denoted by Z 1 in the presence of a second factor Z 2. the R2-measure of explained residual variation based on the Brier score just reaches 10. improvement is achieved by the CART index.364 Schumacher et al. Relative to the prediction with the pooled Kaplan-Meier estimate for all patients. In this situation. To reduce the resulting overoptimism. In general. In this section we give an extension of Schoenfeld’s formula (121) to the situation that a correlated factor is included in the analysis. a careful planning of an appropriate study is required. the existing sample size and power formulae are not valid and may not be applied. which can also be a score based on several other factors. An adequate analysis of the independent prognostic effect of a new factor has to be adjusted for the existing standard factors (4. Sample size and power formulae in survival analysis have been developed for randomized treatment comparisons. Actually. we have estimated the event-free probabilities from our data set and used them as predictions. it has to be acknowledged that measures of inaccuracy tend to be large. This includes an assessment of the power of the study in terms of sample sizes. In the analysis of prognostic factors. it has to be mentioned that there may be still some overoptimism present resulting when a measure of inaccuracy is calculated in the same data where the prognostic classiﬁcation ˆ scheme is derived from.4% for the best prognostic classiﬁcation scheme. X. the other way round. this is by no means true and only pretended for illustrative purposes.119) or for the reduction of bias of effect estimates as outlined in Section IV. We assume that the analysis of . R2-type values tend to be small reﬂecting that predictions are far from being perfect (117). or. For deﬁnitive conclusions. The criterion of interest is survival or event-free survival of the patients. With survival or event-free survival as the end point.

Table 10 gives for some situations the value of the VIF and the effective sample size Nψ. This formula reads N (u 1 α/2 u 1 β )2 1 2 (log θ 1 ) ψ (1 p)p 1 ρ 2 the factor 1/(1 ρ 2 ) is usually called the variance inﬂation factor (VIF). the prevalence of Z 1 1. that has to be taken into account. the number of events required to obtain a power of 0.5. the situation of a randomized clinical trial with equal probabilities for treatment allocation. By using the same approximations as Schoenfeld (121). one can derive a formula also for the case when Z 1 and Z 2 are correlated with correlation coefﬁcient ρ. It shows that the required number . For P (Z 1 1) sake of simplicity we assume that Z 1 and Z 2 are binary with p denoting the prevalence of Z 1 1. The relative risk between the groups deﬁned by Z 1 is then given by θ 1 exp (β 1). the expected number of events—often also called the ‘‘effective sample size’’—to achieve a prespeciﬁed power is minimal for p 0. that is. This is the same formula as that used for a comparison of two populations as developed by George and Desu (122) for an unstratiﬁed and by Bernstein and Lagakos (123) for a stratiﬁed comparison of exponentially distributed survival times and by Schoenfeld (124) for the unstratiﬁed logrank test. Assume that the effect of Z 1 shall be tested by an appropriate two-sided test based on the partial likelihood derived from the Cox model with signiﬁcance level α and power 1 β to detect an effect that is given by a relative risk of θ 1 . Obviously.Prognostic Factor Studies 365 the main effects of Z 1 and Z 2 is performed with Cox proportional hazards model given by λ (t | Z 1. This formula is identical to a formula derived by Lui (126) for the exponential regression model in the case of no censoring. Z 2 ) λ 0 (t) exp (β 1 Z 1 β2 Z2 ) where λ 0 (t) denotes an unspeciﬁed baseline hazard function and β 1 and β 2 are the unknown regression coefﬁcients representing the effects of Z 1 and Z 2. For independent Z 1 and Z 2 .8 to detect an effect to Z 1 of magnitude θ 1 as calculated by the formula given above. The sample size formula depends on p. for details we refer to Schmoor et al. (125). it was shown by Schoenfeld (121) that the total number of patients required is given by the following expression N (u 1 α/2 u 1 β )2 (log θ 1 )2 ψ (1 p)p where ψ is the probability of an uncensored observation and u γ denotes the γ quantile of the standard normal distribution.

2 0. The Spearman correlation coefﬁcient of these two factors is ρ 0.19 1.4 0. Taking the prevalence of progesterone-positive tumors ( p 60%) into account.56 191 199 227 298 227 237 271 355 1.04 1.5 Table 11 Distribution of Progesterone Receptor by Tumor Grade and Estrogen Receptor in the GBSG-2 Study Tumor grade 1 Progesterone receptor 20 20 Correlation coefﬁcient 5 76 0.05) θ1 p 0.536 20 79 345 .2 0.5 ρ 0 0. θ 1 (α 0.366 Schumacher et al. The sample size formulae given above will now be illustrated by means of the GBSG-2 study. Note that the formula given above is identical to that developed by Palta and Amini (127) for the situation that the effect of Z 1 is analyzed by a stratiﬁed logrank test where Z 2 0 and Z 2 1 deﬁne the two strata.56 1 1.6 VIF 1 1. if they are categorized as binary variables we ﬁnd ρ 0.4 0. Table 10 Variance Inﬂation Factors and Effective Sample Size Required to Detect an Effect of Z 1 of Magnitude θ 1 with Power 0.8 as Calculated by the Approximate Sample Size Formula for Various Values of p. ρ.3 of events for the case of two correlated factors may increase up to a factor of 50% in situations realistic in practice.6 0 0.04 1. a number of 213 events is required to detect a relative risk of 0. Suppose we want to investigate the inﬂuence of the progesterone receptor in the presence of tumor grade.248 2 264 341 3 Estrogen receptor 20 190 72 0.377.5 θ1 Nψ 65 68 78 102 78 81 93 122 16 17 19 26 19 20 23 30 2 θ1 4 0.248 from Table 11.67 and of 74 events to detect an relative risk of 0.19 1.

but in principle it is possible to base the sample size calculation on that formula. it is the number of events per model parameter that matters which is often overlooked. The question is whether it is large enough to permit the investigation of several prognostic factors. The sample size formula given above addresses the situation of two binary factors. There have been some recommendations in the literature.128–131). .e. one practical solution is to prespecify a prognostic score based on the existing standard factors and to consider this score as the second covariate to be adjusted for. The Spearman correlation coefﬁcient is equal to ρ 0. More precisely.. So from this aspect. Another possibility would be to adjust for that prognostic factor where the largest effect on survival and the highest correlation is anticipated. So if several factors that should be included in the analysis.134). the variance inﬂation factor is equal to 1.07. This has to be contrasted with the situation that both factors under consideration are uncorrelated. as given in Table 11.41 and a number of events of 284 and 97 required to detect a relative risk of 0. These recommendations range from 10 to 25 events per model parameter to ensure stability of the selected model and of corresponding parameter estimates and to avoid serious overﬁtting.536 if they are categorized into positive ( 20 fmol) and negative ( 20 fmol). a higher correlation has to be considered. the required sample size may be calculated using a more general formula that can be developed according to the lines of Lubin and Gail (132). regarding the event per variable relationship (58. Numerical integration techniques are then required to perform the necessary calculation.5). indicating that the correlation between the two factors has only little inﬂuence on power and required sample sizes. based on practical experience or on results from simulation studies. it should be mentioned that a sample size formula for the investigation of interactive effects of two prognostic factors is also available (125. both with a power of 80% at a signiﬁcance level of 5%.5.Prognostic Factor Studies 367 with a power of 80% (signiﬁcance level α 5%). Finally. It may be more difﬁcult to pose the necessary assumptions than in the situation of only two binary factors. the GBSG-2 study with 299 events does not seem too small to investigate the relevance of prognostic factors that exhibit at least a moderate effect (relative risk of 0. signiﬁcance level α 5%). The anticipated situation has then to be speciﬁed in terms of the joint distribution of the factors under study and the size of corresponding effects on survival.5.133.67 and 69 to detect a relative risk of 0. If we want to investigate the prognostic relevance of progesterone receptor in the presence of estrogen receptor.67 or 1. factors occurring on several levels or factors with continuous distribution). in this case the required number of events is 201 to detect a relative risk of 0. For more general situations (i. respectively (power 80%. In this situation. This leads to a variance inﬂation factor of 1.598 if both factors are measured on a quantitative scale and ρ 0.67 and of 0.

Some insight into the stability and generalizability of the derived models can be gained by cross-validation and resampling methods that. Here it is desirable to deﬁne a relatively homogeneous study population and to avoid a nontransparent mixture of patients from different stages of a disease and with substantially different treatments. some of these problems have been demonstrated in this chapter. In prognostic factor studies in oncology where the end point is survival or event-free survival. including the method of classiﬁcation and regression trees and ANNs. however. been done in a study on the role of DNA content in advanced ovarian cancer (137). and follow-up data to the heterogeneity of patients with respect to prognostic factors and therapy. cannot be regarded to completely replace an independent validation study. This has. In particular we highlight the situation that historical data might be available in a database. In addition. completeness of baseline. CONCLUDING REMARKS In this chapter we consider statistical aspects of the evaluation of prognostic factors. Problems with completeness and quality of data can usually not retrospectively be solved. By using data from three prognostic factor studies in breast cancer. validation in an independent study is an essential step in establishing a prognostic factor or a prognostic classiﬁcation scheme.136). such data might provide a valuable source for studying the role of prognostic factors under certain conditions. The problems when dealing with historical data range from the probably insufﬁcient quality of data. the only possibility is then a prospective collection of such data. a multivariate approach is absolutely essential. Thus.368 Schumacher et al. including a regular follow-up. but also here various problems have to be considered requiring expert knowledge in medical statistics. for example. the Cox regression model provides a ﬂexible tool for such an analysis. The lack of appropriate validation studies in combination with insufﬁcient design considerations and inadequate statistical analyses resulting in serious overﬁtting has led to the situation of conﬂicting results and failures to establish many . As far as the statistical analysis is concerned. In contrast to therapeutic studies where the comparison of treatments based on historical data is almost always a totally useless exercise (135. The most important requirements one should keep in mind are to arrive at models that are as simple and parsimonious as possible and to avoid serious overﬁtting. XI. further methodological research is needed about the use of such methods under various circumstances. Only if these requirements are acknowledged can generalizability for future patients be achieved. Problems with the heterogeneity of the patient population can at least partially be avoided by deﬁnition of suitable inclusion and exclusion criteria preferably in a similar study protocol as it is common practice in prospective therapeutic studies. We have shown that various approaches for such a modelbuilding process exist.

a careful planning of prognostic factor studies and a proper and thoughtful statistical analysis is an essential prerequisite for achieving an improvement of the current situation. Second. For small values of the prevalence. at a ﬁrst glance. that are essential for the calculation of sample sizes. In illustrating various approaches. the distribution can be described by the prevalence of the factor that might differ substantially from the optimal value of 0. respectively. the statistical analysis should be carefully planned step by step and the model-building process should at least principally be ﬁxed in advance in a statistical analysis plan as is required in much more detail for clinical trials according to international guidelines (140. however. For a concrete study. it is the expected number of events and not the total number of patients that constitutes the quantity of central importance and that depends on the length and completeness of follow-up. From these considerations it can be derived that small or premature studies are not informative and cannot lead to an adequate assessment of prognostic factors. we were not consistent in the sense that we have not attempted a complete analysis of a particular study according to some generally accepted guidelines.Prognostic Factor Studies 369 ‘‘new’’ and even ‘‘old’’ prognostic factors (138. or better the number of model parameters. They can only create hypotheses in an explorative sense and might even lead to more confusion on the role of prognostic factors because of various sources of bias including the publication bias (145–148).5 usually arising in a randomized clinical trial. It has also to be recognized that the number of patients or events. We have rather shown the strengths and weaknesses of various approaches to protect against potential pitfalls. suitable for the ﬁnal analysis might differ substantially from the number of patients available in a database when rigorous inclusion and exclusion criteria are applied. close cooperation between different centers or study groups might be necessary that might lead to a meta-analysis type evaluation of prognostic factors (149). one might accept the requirement that established prognostic factors should exhibit large relative risks. that should affect the size of a study. Practical experience and results from simulation studies suggest as some rule of thumb that studies with less than 10 to 25 events per factor (or parameter) cannot be considered as an informative and reliable basis for the evaluation of prognostic factors. Thus.139). To reach deﬁnitive conclusions. Third. a study has to be larger than a comparable therapeutic trial using the same value of the relative risk as a clinically relevant difference. The problem of adequate sample sizes for prognostic factor studies has not been fully appreciated in the past. studies on prognostic factors seem to require smaller number of patients. Three points have to be recognized. In contrast to therapeutic studies where one might argue that also very small differences associated with relative risks close to 1 are relevant for a comparison of therapies (142–144). First.141). Thus. it is the number of factors under consideration. This approach would surely be associated with many additional .

These assumptions do not cover the situation of time-varying effects and of time-dependent covariates. one also has to be prepared that some factors may have not been measured. Thus. To arrive at a prediction of survival probabilities for such patients. the size of an independent validation study should be large enough to allow valid and deﬁnitive conclusions. Thus. this may not be a very efﬁcient approach. the Freiburg DNA study that we used as a validation study several times throughout this contribution was far too small from this point of view and should have served for illustrative purposes only. more sophisticated methods for dealing with missing values in some prognostic factors might be useful (152–154). In general.152). It would also encourage the use of standard prognostic models for particular entities or stages of cancer. If multiple end points or different events are of interest. we have also assumed that effects of prognostic factors are constant over time and that prognostic factors are recorded and known at time of diagnosis. For applying prognostic factors or prognostic classiﬁcation schemes to future patients. Thus. ACKNOWLEDGMENTS We thank our colleagues Erika Graf and Claudia Schmoor for valuable contributions and Regina Gsellinger for her assistance in preparing the manuscript. One of these topics is concerned with the handling of missing values in prognostic factors. We have always conﬁned ourselves to a so-called complete case analysis that would lead to consistent estimates of the regression coefﬁcients if some assumptions are met (151. other approaches might also be seen as useful and adequate.20. the methods and approaches presented here have at least in part been selected and assessed according to the subjective views of the authors. A number of important topics have not or have only been mentioned in passing in this chapter. surrogate deﬁnitions for the corresponding prognostic classiﬁcation schemes are required. What should not be a matter of controversy.370 Schumacher et al.22) and current research papers. However. For these topics that are also of importance for prognostic factor studies.19. Throughout this chapter. conducting. difﬁculties. especially if the missing rates are higher than in the three prognostic studies that we used for illustration. and analyzing of prognostic factor studies to arrive at generalizable and reproducible results that could contribute to a better understanding and possibly to an improvement of the prognosis of cancer patients.150). is the need for a careful planning. In addition. we refer to more advanced textbooks in survival analysis (14. . but it would help to avoid some of the problems due to publication bias (146.149. the use of competing risk and multistate models may be indicated. however.

Gospodarowicz MK. In Buyse ME. A framework for evaluating and conducting prognostic studies: an application to cirrhosis of the liver. Hutter RVP. New York: Wiley. 14. Staquet MJ. Byar DP. 86:829–835. Breast Cancer Res Treat 1992. Altman DG. Survival Analysis: A Practical Approach. J Nat Cancer Inst 1994. 1984. Clark GM. 11. Marubini E. Special issue. New York: Springer. Use of regression models: statistical aspects. 1994. 1991. Henson DE. Br Med J 1995. 423–443. Love SB. George SL. 17. 7. 72:511–518. Lausen B. ed. Schumacher M. Cancer Clinical Trials: Methods and Practice. Villeneuve J-P. Prognostic models: clinically useful or quickly forgotten? Commentary. 15. 18. Sauerbrei W. 1984. Klein JP. 5:462–471. Modelling Survival Data in Medical Research. eds. J Natl Cancer Inst 1991. Chichester: Wiley. 42:791–805. Identiﬁcation of prognostic factors. 10. Machin D. In: Mike V. 69:979–985. 311:1539–1541. Esnaola S. Int J Cancer 1974. 1995. Chichester: Wiley. 1982. pp. Semin Oncol 1988. Statistical aspects of prognostic factor studies in oncology. Survival Analysis: Techniques for Censored and Truncated Data. Armitage P.Prognostic Factor Studies 371 REFERENCES 1. New York: Wiley. Sylvester RJ. 1995. Hermanek P. Dangers of using ‘‘optimal’’ cutpoints in the evaluation of prognostic factors. Stanley KE. Identiﬁcation and assessment of prognostic factors. Oxford: Oxford University Press. Review of survival analyses published in cancer journals. Moeschberger ML. 365–401. Altman DG. 2nd ed. In: Buyse ME. 13:16–36. London: Chapman & Hall. 444–466. Harris EK. Lee ET. Byar DP. Altman DG. Commentary. Statistical methods for the identiﬁcation and use of prognostic factors. Statistics in Medical Research. Infante-Rivard C. 6. Survivorship Analysis for Clinical Studies. Simon R. Gehan EA. Analysis of survival data: Cox and Weibull models with covariates. Analysing Survival Data from Clinical Trials and Observational Studies. Albert A. 16. New York: Marcel Dekker. Breast cancer prognostic factors: evaluation guidelines. Altman DG. Heidelberg–New York: Springer. 3. Prognostic factor integration. Stepniewska KA. 13. eds. 83:154–155. Collett D. 2. 9. Valsecchi MG. pp. . Parmar MKB. Br J Cancer 1994. pp. eds. 8. Simon R. De Stavola BL. Cancer Clinical Trials: Methods and Practice. 5. Statistical Methods for Survival Data Analysis. Oxford: Oxford University Press. Wyatt JC. 19. 1995. Prognostic Factors in Cancer. J Clin Epidemiol 1989. 1992. 22:185–293. 4. Sobin LH. Sylvester RJ. McGuire WL. 1997. Br J Cancer 1995. Staquet MJ. 12.

Beyerle C. Schumacher M. Neumann RLA. J Clin Oncol 1994. Menzel D. Rothman KJ. Sauerbrei W. Schumacher M. Prentice RL. Kiechle M. 71:2426–2429. Gill RD. Stat Med 1985. Philadelphia: Lippincott-Raven. Stat Med 1993. Stat Med 1996. Sauerbrei W. Ulm K. Sauerbrei W. Cancer 1992. Sauerbrei W. 2nd ed. 28. Modern Epidemiology. Fleming TR. Keiding N. Breast Cancer Res Treat 1998. Cancer 1993. Bastert G. 20. 1998. Olschewski M. Byar DP. DNA ﬂow cytometry in node positive breast cancer: prognostic value and correlation to morphological and clinical factors. Regression models and life tables (with discussion). 17:406–412. Hubner K. Henson DE. Counting Processes and Survival Analysis. Breast Cancer Res Treat 1997. 72:3131–3135. Schumacher M. Rauschecker HF for the German Breast Cancer Study Group. Reduction of bias caused by model . J R Stat Soc Ser B 1972. Randomized 2 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. Comparison of the Cox model and the regression tree procedure in analyzing a randomized clinical trial. Multiple prognostic factors and outcome analysis in patients with cancer. Schumacher M for the German Breast Cancer Study Group. Kommoss F. Pﬂeiderer A. 22. Schmoor C. The Statistical Analysis of Failure Time Data. 27. 1991. 25. Pﬁsterer J. 12:2086–2093. 30. [Correction. The future of prognostic factors in outcome prediction for patients with cancer. 29. Giese E.372 Schumacher et al. 14:473–482. Statistical Methods Based on Counting Processes. 32. Hubner K. Burke HB. 69:1639–1644. Assessing apparent treatment covariate interactions in randomized clinical trials. 42: 149–163. Validation of existing and development of new prognostic classiﬁcation schemes in node negative breast cancer. Schumacher M. Future directions for the American Joint Committee on Cancer. Randomized and non-randomized patients in clinical trials: experiences with comprehensive cohort studies. ¨ 38. Cancer 1993. Anal Quant Cytol Histol 1995.] 37. 23. Biometrics 1985. Cancer 1992. Andersen PK. 21. Cox DR. Br J Cancer 1957. Bojar H. Simon R. Gail M. Henson DE. Kalbﬂeisch JD. Patients subsets and variation in therapeutic efﬁcacy. 35. 15:263–271. 31. 33. New York: Wiley. Schmoor C. Fielding LP. Henson DE. New York: Wiley. Simon R. Testing for qualitative interactions between treatment effects and patient subsets. Criteria for prognostic factors and for an enhanced prognostic system. 41:361–372. Br J Cancer 1982. ¨ 36. ¨ 34. 48:191–192. Richardson WW. Freedman LS. Histological grading and prognosis in primary breast cancer. Bloom HJG. Harrington DP. Fenoglio-Preiser CM. Schmoor C. Schmoor C. Hilgarth M. 12: 2351–2366. 26. Borgan O. 24. 34:187–220. 1992. Greenland S. 2:359–377. 4:255–263. 1980. New York: Springer. Hollander N. Olschewski M. 70:2367–2377. Fielding LP.

Appl Stat 1994.Prognostic Factor Studies 373 39. 53. Royston P. 57. ¨ eds. 162:71–94. Worsley KJ. Bojar H. pp. Miller AJ. Van Houwelingen HC. Verweij P. 16:2813– 2827. Schumacher M. 12:621–625. 55. 69:297–302. 10:1931–1941. Silvestri D. Biometrika 1982. Modelling the effects of standard prognostic factors in node positive breast cancer. Sauerbrei W. 43. Comparison of variable selection procedures in regression models— a simulation study and practical examples. 48. Clark GM. Scand J Stat 1986. 13:159–171. New York: Chapman and Hall. 50. 1993. Sauerbrei W. Biometrie und Epide¨ miologie. Sauerbrei W. Schmoor C. Survival analysis 1982–1991: the second decade of the proportional hazards regression model. Le Cessie S. Stat Med 1993. In: Michaelis J. Appl Stat 1999. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion). Practical P-value adjustment for optimally selected cutpoints. Proceedings of the Statistical Computing Section. 9:1303–1325. Schumacher M. 52. Sauerbrei W. 41. 44. The use of resampling methods to simplify regression models in medical statistics. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. Comput Stat Data Analysis 1996. 54. Munchen: MMV. pp. Lausen B. 40. Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. In: Dirschedl P. 43:429–467. Tibshirani RJ. Sauerbrei W. 47. Model selection criteria and model selection tests in regression models. Van Houwelingen HC. 1994. Valsecchi MG. Mellin I. Hilsenbeck SG. Biometrics 1992. Generalized Additive Models. 1996. ¨ Terasvirta T. 51. Stat Med 1997. Hommel G. Heidelberg: PhysicaVerlag. 1–7. 1990. Mantel N. Schumacher M. Hastie TJ. pp. ¨ Schumacher M. 48:73–85. Lausen B. Andersen PK. Why stepdown procedures in variable selection? Technometrics 1970. 15:103–112. Resampling and cross-validation techniques: a tool to reduce bias caused by model building. Lausen B. Evaluation of long-term survival: use of diagnostics . Royston P. Schumacher M. 108–113. 46. Hollander N. eds. 79:1752–1760. 42. Ostermann R. 1990. 48:313–329. Cross-validation in survival analysis. American Statistical Association. building. 49. Predictive value of statistical models. J R Stat Soc Ser A 1999. London: Chapman and Hall. 12:2305–2314. Stat Med 1990. Computational Statistics. Royston P. An improved Bonferroni inequality and applications. 1483–1496. Maximally selected rank statistics. Stat Med 1996. and the German Breast Cancer Study Group (GBSG). 56. Europaische Perspektiven der Medizinischen Informatik. Subset Selection in Regression. Altman DG. Stat Med 1991. Classiﬁcation and regression trees (CART) used for the exploration of prognostic factors measured on different scales. Br J Cancer 1999. 21:307–326. Wellek S. 45. Sauerbrei W.

Johnson J. 66. 8:771–783. Tree-structured statistical methods. Mark DB. The Nottingham Prognostic Index applied to 9. Grifﬁths K. Blamey RW. LeBlanc M.374 Schumacher et al. Ciampi A. A strategy for binary description and classiﬁcation. Balslev I. Stat Med 1989. 69:1065–1069. Computational Statistics. Regression trees for censored data. In: Everitt BS. 67. Chen CH. Br J Cancer 1982. Olshen R. Campbell FC. and measuring and reducing errors. 101–125. Multivariable prognostic models: issues in developing models. 69. Lou Z. and robust estimators with Cox’s proportional hazards model. Crowley J. 65. Zedeler K. 64. eds. 32:281–290. 61. JASA 1993. 1984. pp. In: Dodge Y. Andersen PK. Segal MR. 58. Stat Med 1996. Classiﬁcation and Knowledge Organization. Stat Med 1996. Olshen R. eds. pp. 1992. Haybittle JL. Schumacher M. Segal MR. LeBlanc M.149 patients from the studies of the Danish Breast Cancer Cooperative Group (DBCG). Chichester. Mouridsen HT. Sauerbrei W. 509–518. Biometrics 1992. Galea MH. Breast Cancer Res Treat 1992. eds. Stat Med 1985. Schmoor C. Tree-structured survival analysis in medical research. Friedman JH. Heidelberg. Wiley. A prognostic index in primary breast cancer. Cancer Treat Rep 1985. 63. Axelsson CK. 59. 22:207–219. 75. J Comput Graph Stat 1992. Sauerbrei W. 54:31–38. Survival trees by goodness of split. eds. Classiﬁcation and Regression Trees. New York: Springer. 74. 44:35–47. 68. 45:361–366. In: Armitage P. Dunn G. Hendricks L. Tree-structured survival analysis. Nicholson RI. 48:411–425. Tree-growing for the multivariate model: the RECPAM approach. Elston CE. Breast Cancer Res Treat 1999. London: Arnold. Crowley J. Sox HC. Carstensen B. Stat Med 1992. Tibshirani R. Rasmussen BB. evaluating assumptions and adequacy. The Nottingham Prognostic Index in primary breast cancer. 60. Biometrics 1988. . 70. Vol 1. Heidelberg: Physica-Verlag. 4561–4573. In: Klar R. 4:39–46. Altman DG. On the development and validation of classiﬁcation schemes in survival data. Bootstrap investigation of the stability of a Cox regression model. Doyle PJ. 1998. Schumacher M. 1998. Stone CJ. Elston CW. Methodological arguments for the necessity of randomized trials in high-dose chemotherapy for breast cancer. Harrell FE. A bootstrap resampling procedure for model building: application to the Cox regression model. 72. Zhang H. George SL. 15:361–387. 71. 1:3–20. Blamey RW. 1997. 11:2093–2109. Crowley J. 62. Whittaker J. Gordon L. Relative risk regression trees for censored survival data. Encyclopedia of Biostatistics. Opitz O. Lee KL. LeBlanc M. Breast Cancer Res Treat 1994. 15: 2763–2780. pp. Ellis IO. Wadsworth: Monterey. 88:457– 467. Olshen R. Berlin. The bootstrap and identiﬁcation of prognostic factors via Cox’s proportional hazards regression model. 73. Statistical Analysis of Medical Data: New Developments. Colton T. Breiman L.

Stat Methods Med Res 1997. 9:2–54. Vach W. Dybowski R. Pattern Recognition and Neural Networks. Stinchcombe M. On the misuses of artiﬁcial neural networks for prognostic and diagnostic classiﬁcation in oncology. Med Decis Making 1996. Comput Stat 1997. Multilayer feedforward networks are universal approximators. 92. 87. A practical application of neural network analysis for predicting outcome on individual breast cancer patients. Machle BO. Survival analysis and neural nets. Breast Cancer Res Treat 1998. Breast Cancer Res Treat 1992. Hornik K. Liestøl K. Cambridge: University Press. 94. Vach W. Neural networks in clinical medicine. Nervous about artiﬁcial neural networks? Lancet 1995. 95. 79. Ripley BD. 77. Lancet 1995. Owens MA. Jensen JL. Clark GM. Titterington DM. Cheng B. Baxt WG. Kennedy RL. 346:1203–1207. Ravdin PM. of Engineering Science. Neural networks: a review from a statistical perspective (with discussion). Harrison RF. Vach W. 89. Neural Networks 1989. 86. Copas JB. Stem HS. 1996. Stat Sci 1994. 50:284–293. Stat Med 2000. Breast Cancer Res Treat 1992. Artiﬁcial neural networks in pathology and medical laboratories. Part I. 346:1135–1138. Neural network models for breast cancer prognosis. Application of artiﬁcial neural networks to clinical medicine. 83. Schumacher M. 19:541– 561. Pandian MR. London: Chapman and Hall. Networks and Chaos—Statistical and Probabilistic Aspects. 82. 21:47–53. Survival analysis of censored data: neural network . 88. 81. Wyatt J. Clark GM. Ripley RM. 93. 90. 78. 346:1175– 1177. 91. McGuire WL. In: Barndorff Nielsen OE. Neural networks in applied statistics (with discussion). 12:279–292.D. Statistical aspects of neural networks. 80. Stat Med 1994. 38:205–220. Ph. University of Oxford. Introduction to neural networks. Collett K. Frost D. Andersen U. A demonstration that breast cancer recurrence can be predicted by neural network analysis. Neural networks and logistic regression. dissertation. Oxford. Ravdin PM. On the relation between the shrinkage effect and a shrinkage method. Lancet 1995. 6:167–183. 1993.Prognostic Factor Studies 375 76. 2:359–366. Cross SS. Vendely P. 22:285–293. Andersen PK. eds. 1998. Ravdin PM. 84. 85. 346:1075–1079. Technometrics 1996. Schwarzer G. Gant V. 96. 16:386–398. The prognostic contribution of estrogen and progesterone receptor status to a modiﬁed version of the Nottingham Prognostic Index. Ripley BD. Am Stat 1996. 21:661–682. Rossner R. 13:1189–1200. White H. Misra M. 48:1–9. Understanding neural networks as statistical tools. Dept. Hilsenbeck SG. Using regression models for prediction: shrinkage and regression to the mean. Warner B. De Laurentiis M. Skjaerven R. Penny W. Comput Stat Data Anal 1996. Schumacher M. Lancet 1995.

Burke HB. New York: Cambridge University Press. Allred. 10:73–79. The measurement of performance in probabilistic diagnosis III: methods based on continuous functions of the diagnostic probabilities. Scarpi E. 78:1–3. Simon R. Problems and prediction in survival-data analysis. Kappen HJ. The evaluation of clinical predictions. Biometrika 1990. 14:73–82. 100. Biganzoli E. 17:238–246. Methods Inform Med 1978. Lynn J. Neural networks as statistical methods in survival analysis. Br Med J 1972. Am Stat 1991. 107. 18:2529– 2545. 105. Brier GW. 110. Prediction of survival of patients terminally ill with cancer. 101. Clinical Applications of Artiﬁcial Neural Networks. Marubini E. Ravdin PM. 264:29–31. Prediction in survival analysis. Chichester: Wiley. explained risk and goodness of ﬁt. Stat Med 1999. 4(suppl):31–34. Arnoldi E. 3:143–152. Shapiro AR. A neural network model for survival data. eds. Galluci M. 113. The explained variation in proportional hazards regression. McGuire WL. model or medic. Lifetime Data: Models in Reliability and Survival Analysis. Ripley BD. Treatment decisions in axillary node-negative breast cancer patients. Clark GM. 102. Habbema JDF. Schmoor C. Graf E. 1995. Predicting life span for applicants to inpatient hospice. Indelli M. Accuracy of predictions of survival in later stages of cancer.] 97. Ting Lee ML. Veriﬁcation of forecasts expressed in terms of probability. 1997. Schumacher M. Stat Med 1995. 17:1169–1186. Ripley RM. Henderson R. Stat Med 1998. Parkes MC. 109. 108. Maltoni M. 77:216–218. Gant V. Sauerbrei W. 45:201–206. eds. Cancer 1995. Hand DJ. Marinari M. Faraggi D. . DC. Dordrecht: Kluwer Academic Publishers. Ann Oncol 1993. In: Dybowski R. Neural network analysis to predict treatment outcome. [Correction. analysis detection of complex interactions between variables. 11:173–180. Kimber AC. 106. In: Jewell NP. Neijt JP. Semin Surg Oncol 1994. Arch Intern Med 1988. 296:1509– 1514. Monogr Nat Cancer Inst 1992. Tandon AK.376 Schumacher et al. 99. 98. Piva L. Biometrika 1994. Frontini L. 32:113–118. Mariani L. Breast Cancer Res Treat 1994. Assessment and comparison of prognostic classiﬁcation schemes for survival data. 81:631. 2001 (in press). Construction and Assessment of Classiﬁcation Rules. 112. N Engl J Med 1997. Stat Med 1995. Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach. 75:2613–2622. 111. Amadori D. Schemper M. Korn EJ. Bjerregard B. Henderson R. 104. Chamness GC. 103. Jones M. Boracchi P. Pirovano M. Artiﬁcial neural networks for cancer research: outcome prediction. Forster LA. Hilden J. Monthly Weather Rev 1950. Simon R. 148:2540–2543. 114. Withmore GA. Explained residual variation.

44:497–507. Statistician 1995. Stare J. Schoenfeld DA. Importance of events per independent variable in proportional hazards analysis. 27:15–24. Sample size considerations for the evaluation of prognostic factors in survival analysis. 124. J Clin Epidemiol 1995. 130. Planning the size and duration of a clinical trial studying the time to some critical event. 133. 134. Control Clin Trials 1993. Efron B. Schmoor C. Holford TR. Lubin JH. JASA 1983. Lagakos SW. 39:499–503. Explained variation in survival analysis. 38:801–809. Fayers PM. Tibshirani R. 15: 1999–2012. R2: a useful measure of model performance when predicting a dichotomous outcome. Importance of events per independent variable in proportional hazards analysis. Am J Epidemiol 1990. 78:316–330. Cancer Treat Rep 1985. Davis K. Sample size determination under an exponential model in the presence of a confounder and type I censoring. 122. Machin D. An investigation on measures of explained variation in survival analysis. Matchar DB. Concato J. Olschewski M. 129. 117. goals and general strategy. 128. Schumacher M. The asymptotic properties of nonparametric tests for comparing survival distribution. I. Stat Med 1984. Lui K-J. Biometrika 1981. 116. George SL. Schoenfeld DA. Harrell FE. 121. 118. Feinstein AR. Background. Sauerbrei W. 92:548–560. Graf E. J Chronic Dis 1974. 126. Schumacher M. 48:1503–1510. Schumacher M. Control Clin Trials 1992. Peterson B. 13:226–239. Peduzzi P. 18:375–384. Regression models for prognostic prediction: advantages. 120. Rosati RA. Biometrics 1983.632 bootstrap method.Prognostic Factor Studies 377 115. 69:1071–1077. Sample size and power determination for stratiﬁed clinical trials. Consideration of covariates and stratiﬁcation in sample size determination for survival time studies. Stat Med 1999. JASA 1997. Schemper M. 123. Reichert TA. Control Clin Trials 1992. Schwartz M. Stat Med 2000. Peduzzi P. Amini SB. 131. . 13:446–458. Efron B. George SL. Pryor DB. Harrell FE. J Stat Comput Sim 1978. Analysis of randomized and nonrandomized patients in clinical trials using the comprehensive cohort follow-up study design. 14:511–522. 119. 132. Bernstein D. Holford TR. 68:316–319. 125. Feinstein AR. Lee KL. Lee KL. Ash A. 8:65–73. Concato J. Sample size: how many patients are necessary? Br J Cancer 1995. Desu MM. J Clin Epidemiol 1995. Regression modeling strategies for improved prognostic prediction. Sample size requirements and length of study for testing interaction in a 2 k factorial design when time-to-failure is the outcome. 3:143–152. Sample size formula for the proportional-hazard regression model. 72:1–9. Estimating the error rate of prediction rule: improvement on cross-validation. 127. II. Accuracy and precision of regression estimates. 19:441–452. J Chronic Dis 1985. Improvement on cross-validation: the . On power and sample size for studying features of the relative odds of disease. Califf RM. 131:551–566. problems and suggested solutions. Stat Med 1996. Gail MJ. Palta M. 48:1495–1501.

153. Statistical considerations for a medical data base. 138. 143. Estimation of regression coefﬁcients when some regressors are not always observed. Vach W. Kommoss F. Publication bias: the case for an international registry of clinical trials. Robins JM. J Clin Epidemiol 1992.378 Schumacher et al. 142. Biostatistical methodology in clinical trials in applications for marketing authorisations for medicinal products. duBois A. Zhao LD. Ibrahim JG. What do we mean by validating a prognostic model? Stat Med 2000. Estimating equations with incomplete categorical covariates in the Cox model. 147. Why do we need some large. Why do so many prognostic factors fail to pan out? Breast Cancer Res Treat 1992. Renz H. 151:419–463. 19:453–473. Peto R. Green SB. Control Clin Trials 1987. Publication bias in clinical research. 10:151–160. Biomed 1978. 149. Matthews DR. 141. Collins R. Lancet 1991. 146. Yusuf S. Cancer 1994. Vach W. 139. Some issues in estimating the effect of prognostic factors from incomplete covariate data. 137. Altman DG. Logistic Regression with Missing Values in the Covariates. 150. 16:57–72. Principles Pract Oncol 1991. Stat Med 1997. Easterbrook PJ. Ellenberg JH. 74:2509–2515. 135. Recommended for Adoption at Step 4 of the ICH Process on 1 May 1996 by the ICH Steering Committee. New York: Springer 1994. Berlin A. 36:323–332. Royston P. Using observational data from registries to compare treatments: the fallacy of omnimetrics. Cellular DNA content and survival in advanced ovarian cancer. Bias in meta-analytic research. Berlin JA. 154. Stat Med 1984. Lecture Notes in Statistics 86. Stat Med 1984. 3:402–422. 145. Rotnitzky A. 54:1002–1013. Br Med J 1997. 148. Pﬂeiderer A. Stern JM. Biometrics 1980. JASA 1994. Stat Med 1995. 45:885–892. Peto R. International Conference of Harmonisation and Technical Requirements for Registration of Pharmaceuticals for Human Drugs. Simes RJ. 136. Large trials with simple protocols. Biometrics 1998. 151. . 89:846–866. 14:1659–1682. Publication bias: a problem of interpreting medical data. Clinical trial methodology. Pﬁsterer J. Lipsitz SR. 152. CPMP Working Party on Efﬁcacy of Medicinal Products. 315:640–645. simple randomized trials? (with discussion). December 1994. 144. 28:24–36. McGuire WL. Meta-analysis and cancer clinical trials. J R Stat Soc Ser A 1988. Tijssen GP. Sauerbrei W. Felson DT. Gopalan R. 22:197–206. 4:1529–1541. Byar DP. Kiechle-Schwarz M. ICH Harmonised Tripartite Guideline: Guideline for Good Clinical Practice. Publication bias: evidence of delayed publication in a cohort study of clinical research projects. European Community. 140. Lubsen J. Clark GM. 5: 1–9. Note for Guidance III/3630/92-EN. Simes RJ. J Clin Oncol 1986. Indications and contraindications. Hilsenbeck SG. Dambrosia JM. Begg CB. Simon R. 337:867–872. 3:361–370.

Further reasons to explore prognostic factors are discussed by Byar (1). Risk stratiﬁcation in oncology until now is based mainly on conventional factors of tumor staging (TNM classiﬁcation: UICC [2]) like local tumor invasion or size. Hjalmar Nekarda. An improved estimate about prognosis can also be used to inform a patient more precisely about the further outcome of the disease. there is a tendency away from a more or less uniform therapy toward an individual therapy.18 Statistical Methods to Identify Prognostic Factors Kurt Ulm. one is interested in getting more insight in the development of the disease (e.g. and Ursula Berger Technical University of Munich. For a better stratiﬁca- 379 . Second.. status of lymph nodes. Two reasons are mainly responsible for this trend. INTRODUCTION In recent years the search for prognostic factors has stimulated increasing attention in medicine. especially in oncology and cardiology. in the tumor biology). Munich. Germany I. Pia Gerein. and status of metastasis. But the outcome of these factors to answer the questions mentioned above is limited. First.

. . Harrell et al.. If the time of follow-up is denoted by t and the potential prognostic factors with Z (Z l . especially who should be treated. Stomach cancer belongs to the category of tumors where the effect of an adjuvant chemotherapy has not been proven. For this type of data in the literature. in statistics classic methods like the logistic or the Cox regression have been used over decades. The proposals made here contain some other features and are concentrated to give an answer to the two questions mentioned at the beginning. Throughout the chapter very often the ratio of the hazard function or the logarithm is considered: ln λ(t |Z ) λ 0 (t) p βjZj j 1 (2) which is also denoted as relative hazards and can be interpreted as the logarithm of the relative risk (ln RR). Data on patients with stomach cancer are used to illustrate these new methods. Z p ).380 Ulm et al. The appropriate analysis on the other hand is by far not a routine task for statisticians. Comparable with medicine where traditional factors were mainly used. The problems are not restricted to breast cancer. II. . For example in breast cancer. tion of patients for prognosis and therapy. The aim of this chapter is to summarize and highlight some of these new developments. (7) proposed a system that can be used to identify important factors. Classic Method The example used to illustrate the new developments contains censored data. mostly the Cox model is used. Parallel to the development of new prognostic factors in medicine.g. Recently. One of the related topics is the question about adjuvant therapy. in stomach cancer [4. . METHODS A. the model is usually given in the following form (8): λ(t |Z ) λ(t | Z O) ⋅ e ∑βjZj λ 0 (t) ⋅ e ∑βjZ j (1) with λ(t |Z ) being the hazard rate at time t given the factors Z. the staging system needs to be more sophisticated and new factors have to be identiﬁed. In other locations and in other medical disciplines there are the same problems (e. new statistical tools have been described. about 100 new factors are under discussion (3). The search for important factors is a great challenge in medicine. presumably one of the leading ﬁelds in that research area.5] or in cardiology [6]).

e. 2. The advantage of this approach is the representation of β(Z ) in a functional form. of the example used in Section III are shown. The main problems for wider applications associated with this approach are the use of special software like S and the lack of a simple statistics to test whether β(Z ) is equal to β ⋅ Z or different. 2}. 3}.Statistical Methods to Identify Prognostic Factors 381 B. a decrease in the effect if time is prolonged). maybe after some transformation. The method to estimate β(Z ) can be described in the context of a penalized log-likelihood function Lp(β): Lp(β) 2 ⋅ 1(β) λ ⋅ P(β) (3) where l(β) is the usual log-likelihood function.g. penalizing deviations from smooth functions. Linearity Assumption In the simplest form. i ∈ {1.5. If two polynomials are used a variety of functional relationships can be described. P(β) is a roughness penalty.. PAI-1. In Figure 1 both methods. The problem is the appropriate choice of λ. 1. Considering one factor Z.5. The effect of a certain factor is assumed to be constant over the total follow-up period. On the other hand. Either one component or at least two components seems to be sufﬁcient for most practical applications. p i ∈ { 2.5. it is more natural to assume a change (e.5. A second option is the use of fractional polynomials (11). The only assumption is that β(Z ) has to be smooth. There are several proposals in the literature on how to check the assumption of linearity. to be linearly related to the outcome. The deviance from linearity is obvious. and common test statistics can be used for their determination. 0. 1. Using the integrated squared derivative as roughness penalty P(β) ∫(β″(Z )) 2 dZ the maximum of relation (3) leads to natural cubic splines (10). This leads to smoothing splines (9). Proportional Hazards Assumption One of the basic assumptions in using the Cox model is that of proportional hazards. The advantages of fractional polynomials is that standard software packages. each continuous factor Z is assumed. 1. i.5. There is only a slight difference between the results of both methods. We want to mention two methods to estimate β(Z ). like SPSS or SAS. 0. and λ is the weight of penalizing. 1. the result of using smoothing splines and fractional polynomials for one of the factors. 0. 2. C. One approach is not to specify the function β(Z ).. One approach is to change the linear relationship β ⋅ Z into a functional form β(Z ). the idea . The idea is simply to construct a function β(Z ) consisting of up to some polynomials of the form Z pi .

The ﬁrst approach by Cox himself was based on using some predeﬁned functions. Over the years several proposals have been made to extend this approach.g. Otherwise.382 Ulm et al. is to extend the linear assumption β ⋅ Z into γ(t) ⋅ Z. some form of relationship between γ(t) and Z has to be assumed. e. The results of assuming a linear relationship (⋅ ⋅ ⋅ ⋅). a linear (γ(t) t) or a log-function (γ(t) ln t). estimation methods for regression models . There is a long history in extending the classical Cox model. fractional polynomials can also be used by deﬁning γ(t) ∑ β i ⋅ t pi. As an alternative. One approach is again the use of smoothing splines (12) to analyze the time-varying effect of a certain factor Z. Now the inﬂuence of a certain factor Z on the hazard ratio can be described as a function of time. The hypothesis of interest is whether γ(t) is constant or not (H 0: γ(t) γ 0). The advantages of fractional polynomials compared with smoothing splines are again the use of standard software packages and the direct use of a simple test statistics. are shown. For the analysis.. Figure 1 Inﬂuence of a prognostic factor (PAI-1) on the relative risk plotted on an lnscale. One way to simplify the analysis is to restrict this form of relationship to binary variables. including the 95% CI (---) and fractional polynomials (— —). smoothing spline (——).

According to the selection of γ(t).94/ √t) is shown.Z*) λ 0 (t) ⋅ e ∑βj(Zj ) * ∑γ j (t)⋅Zj (4) . In all the analyses. III). Combination of (Z ) and (t) into One Model In the context of a regression model both extensions can be combined into a model of the form λ(t| Z.28 4. with time-dependent covariates can be applied by performing the transformations X i (t) Z ⋅ t pi. older) using fractional polynomials (γ(t) 1. In this example the inﬂuence of age on the survival rate is considered (for details. Age is divided into two groups according to the median of 65 years.Statistical Methods to Identify Prognostic Factors 383 Figure 2 The time-varying effect of age ( 65 years vs. Figure 2 shows an example with a timevarying effect. see Sect. the values X i (t) have to be calculated for all observed failure times. a decrease of the effect during extended follow-up can be seen D.

E. in a univariate analysis the ‘‘optimal’’ choices for β(Z ) and γ(t) can be identiﬁed. based on the likelihood ratio statistics taking into account the degrees of freedom for β(Z ) or γ(t). The CART method results in a variety of so-called terminal nodes with subgroups of patients at different risks. For this split all factors with all possible divisions into two groups are considered. Therefore. The problem is of j course related to the cutpoints used. In the ﬁrst step. CART Method The idea of this method is simply to split the whole data into two subsamples with the greatest difference in the outcome (e. If a split is performed. F. Selection Procedure We describe very brieﬂy one option for the selection of factors in the context of fractional polynomials. the CART method. no proof exists whether ‘‘optimal’’ binary coding has any inﬂuence on the selection of γ(t). or coded. either the factor Zj or Z* is selected j that provides the best ﬁt. the difference is too small or the number of patients in the subsample is too low. both subsamples are further analyzed independently in the same way until no further split is recommended. There . another approach. For the division of a continuous factor Z into a binary variable Z*. In the situation of failure time data. either the use of the predeﬁned cutpoints or the selection of ‘‘optimal’’ cutpoints based on maximal selected test statistics. For performing a split. The second choice is connected with an inﬂated p value. has attracted great attention—at least in the medical community (14). One approach is to divide the functional term PI ∑β j (Z j ) ∑γ(t)Z* into certain intervals. An alternative can be the use of the criterion proposed by Akaike AIC Dev 2⋅ν with Dev the deviance and v the number of parameters used to describe for β(Z ) or γ(t).384 Ulm et al. The selection in connection with the use of smoothing splines is described in another article (13). However. We want to concentrate here on the use of fractional polynomials. two options are possible. Within this model the inﬂuence of certain factors on the event rate can be investigated in greater extent compared with the classical Cox model. the survival rate). However..g. In a stepwise forward procedure. a certain test statistic has to be selected. The clinicians can now identify the subgroups where different therapies should be applied. for example. where Z denotes all the continuous factors and Z* all the binary covariates either binary in a natural way. like gender. model (4) does not give directly an answer to the classiﬁcation of a patient into a certain risk group. the log-rank test is often used.

g. Three of ﬁve continuous factors show an association with the event rate (NODOS.GE) or optimized cutpoints . Two factors (AGE and NODOS. Another problem is related to the selection of the optimal splits. Univariate Analysis The ﬁrst step was used to identify ‘‘optimal’’ functions for β(Z ) and γ(t).SUB) (18) turned out to be statistically signiﬁcantly correlated with the survival rate. The follow-up period is between 3 months and 11 years (median. called the maximal selected test statistics. One way to get an optimal tree is to prune the tree (15).PR. In the multivariate analysis the percentage of positive lymph nodes (NODOS. The classic Cox model gives the results as shown in Table 2. For the analysis of time-varying effects all continuous factors Z have been changed into binary variables Z*. Results Using Fractional Polynomials 1. Only the continuous factors are included in this analysis.GE) show no association on the log of the hazard ratio even using a more ﬂexible form of the relationship. new factors like uPA and PAI-1 were investigated (17). RESULTS A. the form of the relationship is better described in a nonlinear way. Until now 108 patients had died. In addition to the traditional factors (TNM classiﬁcation). This means a large tree is constructed and afterward this large tree is cut back. uPA. One can use a permutation test or some correction formulas (16). NODOS. However. It is well known that there is an inﬂation of the test statistics in analyzing continuous data. 41 months). Table 1 gives a short description of the prognostic factors used in the analysis. The continuous factors have not been divided into certain categories. B. Usually there is a mixture of continuous and discrete factors. The results for β(Z ) can be seen in Table 3.. with the lowest misclassiﬁcation rate). III. and NODOS.Statistical Methods to Identify Prognostic Factors 385 are several proposals how to deﬁne the ‘‘optimal’’ tree (e. Description of the Data The study contains data of 295 completely resected patients with stomach cancer who underwent curative surgery between 1987 and 1996 (17). To make a fair comparison. The transformation was based either on predeﬁned cutpoints (AGE. and PAI-1) comparable with the classic Cox model. the p value has to be adjusted. Figure 3 shows the survival curve for the whole sample.PR.PR) and the local tumor invasion (T.

model (4) has been considered.GE there is a signiﬁcant change in the effect during follow-up. Figure 3 Survival curve for the total sample (n stomach cancer. showing a constant effect over time. 295. and NODOS. 2.386 Ulm et al.SUB. Only the parameters are newly estimated in this multivariate analysis. T. The results of the univariate analysis of time-varying effects ( γ(t)) can be seen in Table 4. The value of the likelihood function ( 2 ⋅ ln L) . METAS. In a stepwise forward procedure all the results from the univariate analyses either in considering β(Z ) or γ(t) have been included (Table 5). For AGE. This means the functional form remained unchanged. The selection procedure is based on the likelihood ratio statistics taking into account the degrees of freedom. Multivariate Analysis In the multivariate analysis. DIFF. 108 deaths) of patients with (uPA and PAI-1). The percentage of positive lymph nodes is the most important factor.

9 1.30 0.43 .94 (cutoff) 1: 5. 1.2 2. 13 of comp.94 0: 4. 12.004 eβ 1.PR T. 2 (cutoff) 1: 3.6 4.62 0: no 1: yes 0: 1.41 0.001 0.77 0.1 1.001 0.001 0. no) DIFF (3. 4 0: 42 (median) 1: 42 0: 5.02–20.4 vs.SUB (5–7 vs.02–264.7 1.7 1.57 0.34 0.Statistical Methods to Identify Prognostic Factors Table 1 Factor AGE NODOS.001 0.001 0.1 p Value 0.01 50.01 19.0 1.0 Multivariate p Value 0. III [17]) Grading Total number of removed lymph nodes Urokinase-type plasminogen activator Plasminogen activator inhibitor type 1 Table 2 Result of the Analysis of the Stomach Cancer Study Using the Classic Cox Model Type of analysis Univariate Factor AGE NODOS. 1–4) METAS (yes vs.13 Percentage of positive lymph nodes Local tumor invasion (Japanese staging system [17]) Cutoff lamina subserosa Lymph node metastasis (no.GE uPA PAI-1 eβ 1.0 1.3 1.7 3.2) NODOS.64 0.48 0.1 1.02 0.GE uPA PAI-1 yes/no 1–4 6–105 0.SUB Prognostic Factors Analyzed in the Stomach Cancer Study Range 28–90 0–97 1–7 0: 1: 0: 1: 0: 1: Coding 65 (median) 65 20 20 4 (cutoff) 4 Interpretation Age at surgery 387 METAS DIFF NODOS.PR T.11 0.13 (cutoff) 1: 4.001 0.1 1.

05/√Z .GE uPA PAI-1 γ(t) 1.001 0.09/t 2 37.8/t 2 ⋅ ln t 0.001 — — γ 0) Table 5 Step 1 2 3 4 5 Total Multivariate Analysis: Result of the Stepwise Selection Procedure Factor NODOS.002 0.04 0.17 Z 2 0.001 0.7 0.SUB METAS DIFF NODOS.001 — 0.99/√Z 2 p(H 0 : β(Z ) 0.01 t 2 0.44/√t γ 3 (t) 0.SUB AGE NODOS.3 4.001 0.7/t 2 ⋅ ln t 0.003 t 2 ⋅ ln t 1.09 0.62 0.12 Z 1 1.004 t 2 ⋅ ln t 0.GE uPA PAI-1 β(Z ) 0.8 84.5/t 2 68.GE PAI-1 β(Z ) or γ(t)* β 1 (Z ) γ 2 (t) γ 3 (t) γ 4 (t) β 5 (Z ) Likelihood ratio statistics R 115 20 13 10 7 165 p 0.001 ⋅ t 2 γ 4 (t) β 5 (Z) 1. Table 3 ‘‘Optimal’’ Choices for the Fractional Polynomials β(Z ) in the Univariate Analysis (only Continuous Factors Are Included) Factor AGE NODOS.001 ⋅ t 2 Constant Constant p(H 0 : γ (t) 0.001 0) Table 4 ‘‘Optimal’’ Choices for the Fractional Polynomials γ(t) Analyzing the Time-varying Effect of all Dichotomized Factors (Univariate Analysis) Factor AGE NODOS.388 Ulm et al.97 Z 2 113 ⋅ Z 2 0.09 0.01 0.01 ⋅ Z 0.001 0.09/Z 1.35 0.0 0.PR NODOS.01 Z 2 1.008 * β 1 (Z) 0.PR T.PR T.84 ⋅ Z 2 γ 2 (t) 0.01 ⋅ t 2 0.96 4.007 0.01 0.9/√t Constant 1.12 0.

AGE.GE. .Statistical Methods to Identify Prognostic Factors 389 (a) (b) Figure 4 Results of the multivariate analysis: (a) time-varying effects γ(t) for T.PR (b1) and PAI-1 (b2). and NODOS. (b) β(Z ) for NODOS.SUB.

Shortly after surgery the older patients (65 years and older) had a higher mortality rate. The second factor selected is the local tumor invasion (T. and after about 2 years of follow-up the situation changes and the younger patients seem to have the higher risk. .GE. There is again a change of the effect over time. and PAI-1 could be identiﬁed.SUB increases within the ﬁrst 3 years and then decreases. The patients with 42 or more lymph nodes removed have the higher risk at the beginning. In contrast to the result of the classical Cox model. NODOS. The fourth factor selected is the total number of lymph nodes removed (R 10). But about 2 years after surgery. showing a strong time-varying effect (R 20). The inﬂuence of T. The result can be seen in Figure 4. Finally. (c) Figure 4 Continued is increased by value of R 115. the effect of PAI-1 is considered to be important (R 7). an additional effect of AGE. with a dynamic effect (R 13).390 Ulm et al. the risk in the group with fewer lymph nodes removed is increasing. But the difference is declining. The next factor selected is age. The inﬂuence of PAI-1 is constant over time but the value of PAI-1 seems important.SUB).

This factor was ﬁrst divided into two categories ( 20% and greater) based on clinical experience (19). Among the remaining 222 patients with less than 20% positive lymph nodes. . In the subsample with less than 20% positive lymph nodes. The analysis of the continuous factor. The next split is performed with uPA. Altogether six subgroups are Figure 5 Log-rank rest statistics to select the optimal cutoff value for the split of the continuous factor NODOS.PR into two groups. (χ 2 LR In the next step.5) followed by 21% LR 121. CART Method The most important factor was the percentage of positive lymph nodes.Statistical Methods to Identify Prognostic Factors 391 C. whereas 54 have died so far. the subsample with the high mortality rate was further divided by the same factor using a cutpoint of 70% into a group of 12 patients where all have died and another group where 42 of 61 have died. The cutpoint with the highest value of the test statistics was 12% (χ 2 124. 5). showed that the predeﬁned cutpoint of 20% was close to the optimal cutpoint (Fig.3). percentage of positive lymph nodes.SUB shows the best discrimination. Seventy-three patients had more than 20% positive lymph nodes. 54 have also died. the factor T.

There is a great difference in the mortality rate starting from 13 of 126 to 12 of 12. DISCUSSION Within the extended regression model.392 Ulm et al. additionally the nonlinear inﬂuence of PA1-1 and the signiﬁcant change in the effect of AGE. The two remaining subgroups containing about half of all patients have a low mortality rate. the value of the logrank test statistics the total number of patients ( n) and the number of deaths ( ) are given.SUB. The relative risks given in Figure 6 for each of the terminal nodes represent the risk compared with the total sample. Figure 6 Result of the CART-analysis after pruning.GE during follow-up could be detected. given in Figure 6. Four of the six subgroups have a higher mortality rate compared with the total sample. and NODOS. T. At each split. . identiﬁed with the optimal tree after pruning. Therefore. IV. For each terminal node additionally the relative risk (RR) compared with the total sample is calculated. some of the values are below 1 and some are above 1.

are ignored in the model. is mostly used. has to be calculated. The shrinkage factor can also be estimated by using bootstrapping or cross-validation (7). is called lasso. A shrinkage factor λ (λ 1). The sum of the standardized regression parameters should be less than some predeﬁned value c. depending on the total number of regression parameters and the likelihood ratio statistics of the particular model. The effect is that some of the factors Z j .’’ The idea is to construct new samples by using bootstrap techniques and apply the CART method to all samples. The idea there is to estimate the parameters β under same constraints (∑ | β j | c). The estimation of β depends on the choice of c. either forward or backward. For each sample a . To simplify the analysis. Based on the result of CART. a stepwise procedure. the effect of the variables is sometimes overestimated. But in any case it seems justiﬁed to use these extensions of the classic models to get more insight into the data and the disease. It seems more appropriate to use the result of CART for the identiﬁcation of subgroups of patients where certain strategies should be applied. recently published by Tibshirani (22). On the other hand.23 and RR 0. Especially the time-varying effect can be analyzed in an easy way. A small value of c corresponds to a model with only few parameters. The idea is simply to use available estimation methods for timedependent covariates after having applied suitable transformations. The result can also be used to investigate the factors in greater detail. Patients with an increased risk at the beginning should be medically examined more frequently shortly after surgery. developing treatment decisions based on the result of regression analysis should be met with suspicion.Statistical Methods to Identify Prognostic Factors 393 Based on these results. called ‘‘bagging. Fractional polynomials can be applied in connection with standard software packages like SPSS or SAS. assuming a time-varying effect. is calculated and the estimated regression parameters β have to be multiplied by λ to give adequate values for these parameters. The next step can be to extend this model for analyzing also interactions. a more detailed prognosis for an individual patient can be made. all patients except that in the two groups with the lowest risks (RR 0. There is a discussion in the literature regarding adjuvant treatment in gastric cancer (20). However. which are only borderline signiﬁcant. This form is than used in the multivariate analysis. In recent years some procedures to correct these estimates have been proposed. Another approach.47) seem to be candidates for some sort of adjuvant therapy. the form of the relationship can be estimated in a univariate model. For the selection of variables in the classic Cox model. Therefore at each observed failure time the transformed value of the covariate. One method is called shrinkage (21). Breiman (23) made a proposal on how to improve the prediction using CART. A further impact of this analysis can be a different schedule for follow-up.

Le Blanc M. Nonparametric regression and generalized linear models. J R Stat Soc B 1993. these decision rules have to be applied and the average has to be calculated. The CART method helps to identify risk groups and can be used directly for treatment decisions. Schildberg FW. N Engl J Med 1992. Byar D. 3. Generalized additive Models. 8. 326:1756–1761. 34:187–220. Prognostic factors in gastric cancer. Hermanek P. et al. Tibshirani R. 55:757– 796. Cox DR. It can be shown that the misclassiﬁcation rate can be reduced. Green P. Survival trees by goodness of split. Silverman B. 4. Hutter RUP. 1984. Cancer Clinical Trials..394 Ulm et al. REFERENCES 1. Hastie T. 43:429–467. 1994. Royston P. Hastie T. 5. Appl Stat 1994. Harrell FE. et al. McGuire WL. 2. Identifying and modeling prognostic factors with censored data. 7. J R Stat Soc B 1972. 14. Heart rate chronotropy following ventricular prematine beats predicts mortality after acute myocardial infarction. Crowley J. 15:361–387. 9. 11. Stat Med 1998. Oxford Press. 13. Fink U. Friedman J. To classify a new patient. 6. The result is a whole set of different decision rules. 88:457–467. A commentary on uniform use. Breiman L. London: Chapman and Hall. J Am Stat Assoc 1993. 84:1651–1664. The problem of this method. Br J Surg 1997. Allgayer H. Identiﬁcation of prognostic factors. Ulm K. Ulm K. Schmidt G. Lancet 1999. Sobin LH. is that no simple decision rule is available. 12. Classiﬁcation and Regression Trees. new tree or decision rule is obtained. Prognostic factors and treatment decisions in axillarynode-negative breast cancer. 1993. Clark GM. et al. Stone C. London: Chapman and Hall. 34:838–928. UICC TNM supplement. Klinger A. Gastric cancer. Stat Med 1996. Regression models and life tables. eds. In: Buyse M. Multivariable prognostic models: issues in developing models. Malik M. Lee KL. Dannegger F. Mark DB. 10. . evaluating assumptions and adequacy and measuring and reducing errors. however. Tibshirani R. Olshen R. New York: Chapman and Hall. Curr Probl Surg 1997. 15. 1984. Heiss MM. Siewert JR. Berlin: Springer. 353:1390–1396. Henson DE. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Varying-coefﬁcient models. Altmann DG. The result can be used to give a better information about prognosis and to deﬁne the appropriate schedule for further medical examinations. In summary the result of the extended Cox model gives more insight into the further development of the disease. 1990.

Bleiberg H. et al. The general rules for gastric cancer study in surgery and pathology. 82:621–631. Di Leo A. Tibshirani R. Breiman L. Busch R. Wenninger A. Sahmoud T. J Clin Oncol 1998. Cunningham D. Predictive value of statistical models. Van Houwelingen JC. 16:3714. Machine Learning 1996. J R Stat Soc B 1996. 20. 54:2900–2907. Cancer 1998. 15:103–112. 18. Japanese Research Committee for Gastric Cancer. 22.Statistical Methods to Identify Prognostic Factors 395 16. Nekarda H. Bagging predictors. Hermanek P. Clark GM. Stat Med 1996. German Gastric Cancer Study Group. 26:123–140. 17. Practical p-value adjustment for optimally selected cutpoints. Classiﬁcation of regional lymph node metastasis from gastric carcinoma. 58:267–288. Roder JD. Prognostic impact of urokinase-type plasminogen activator and its inhibitor PAI-1 in completely resected gastric cancer. Le Cessie S. Regression shrinkage and selection via the lasso. Ulm K. Bottcher K. Jpn J Surg 1981. Cancer Res 1994. 9:1303–1325. 11:127–138. . Wittekind C. 23. 21. 19. Siewert JR. Vogelsang H. Schmitt M. Stat Med 1990. Adequate number of patients are needed to evaluate adjuvant treatment in gastric cancer. Hilsenbeck SG. Rougier P.

.

France. Making precise the notion of explained variation in the particular context of proportional hazards regression requires some thought. But before considering more closely the speciﬁcs of the model. consider a study of 2174 breast cancer patients. a population value that can be given a concrete and intuitively useful interpretation. in addition to the usual model ﬁtting and diagnostic tools—the evaluation of relative and combined predictive effects—it is also desirable to present summary measures estimating the percentage of explained variation. but additionally any measure should be estimating some meaningful population counterpart. followed over a period of 15 years at the Institute Curie in Paris. Boston. To give the ideas a more tangible framework. roughly speaking we know that any suitable measure would reﬂect the relative importance of the covariates. This relative importance applies to the data set in hand.19 Explained Variation in Proportional Hazards Regression John O’Quigley University of California at San Diego. Motivation For many survival studies based on the use of a regression model. Massachusetts I. La Jolla. A large number of potential and known prognostic factors were 397 . EXPLAINED VARIATION IN SURVIVAL A. California Ronghui Xu Harvard School of Public Health and Dana-Farber Cancer Institute.

When a knowledge of Z makes T deterministic. Quite generally. speciﬁcally from 32% to 33%. When there is no reduction in variance by conditioning upon Z.398 O’Quigley and Xu recorded. The conditional variance of T given Z translates predictability. Detailed analyses of these data have been the subject of a number of communications. The above expression does not lean on any model. this ﬁgures drops to 5%. independently of any model. the relevant quantity we need to deﬁne Ω 2 can be expressed as . Intermediate values of Ω 2 have a precise interpretation in terms of percentages of explained variation as a consequence of Eq. by recoding a continuous prognostic variable. (2). histology grade. a negligible amount. Or that by adding tumor size to a model in which the main prognostic factors are already included. into discrete classes on the basis of cutpoints? Note that for this latter problem the models are nonnested and so the problem would be inherently more involved. and tumor size. i. progesterone receptor status. the explained variation increases. then Var(T|Z ) and Ω 2 0 and Ω 2 1. then Var(T ) E{Var(T| Z )} 0. Or given that some variable can explain so much variation.. and grade. that stage explains some 20% of survival but that once we have taken account progesterone status. Z ). we have that Var(T ) E{Var(T|Z )} Var{E(T|Z )} (1) The above identity enables us to write down an expression for the proportion of explained variation Ω 2 as Ω 2 (T|Z ) Var(T ) E{Var(T| Z )} Var(T ) Var{E(T| Z )} Var(T ) (2) the notation Ω 2 (T| Z ) reminding us which way round we are conditioning. A question of interest might relate to the reduction in variance of the random variable T by conditioning upon Z. We would like to be able to say. age at diagnosis for example. B. Apart from the marginal variance. for a normal model directly in terms of prediction intervals and for other models if only by virtue of the Chebyshev inequality.e. Var(T ). in terms of these percentages. age. Denote the marginal distribution functions by F(t) and G(z) and the conditional distribution functions by F(t | z) and G(z| t). Let us suppose that we focus here on a subset of prognostic factors: age at diagnosis. say. the two quantities coinciding for bivariate normal models. then to what extent do we lose (or gain). We may also be interested in Ω 2 (Z|T ). stage. for example. Explained Variation in Regression Models Consider the pair of random variables (T.

the reason for including it as an argument. respectively. . Note that the special value β 0 corresponds to absence of association between T and Z. To avoid confusion when referring to β as an argument of a function as opposed to some assumed population value. it helps the development by making this explicit in that dG(z| t)dF (t) (4) . Instead of Eq. We parameterize our regression model such that the special value β 0 indicates an absence of association between the variables. the dependence being expressed via the conditional distribution of one of the variables given the other. the larger the value of β in absolute value then the greater the degree of dependence for any given covariate distribution.i. then it is only necessary to replace F(t). G n (z). z i ). It will also be helpful to consider an equivalent expression for E {Var(T| Z )} arising from a simple application of Bayes theorem. and F(t| z) in Eq. Given the pairs of i. Again we will sometimes wish to make this dependence explicit by writing Ω 2 (Z|T. n}. G(z). i 1. observations {(t i . (2) by F n (t). the values t and z only enter into the equation as dummy variables and second. consistent estimates for Ω 2 will follow if we can consistently estimate F(t | z) and G(z).d. when we may be able to estimate more readily the conditional distribution of Z given T rather than the other way around. The value of β itself quantiﬁes the strength of regression effect and thereby directly relates to Ω 2 (β). Typically we would not be interested in values of Ω 2 (β) elsewhere than at the true population value of β. we may denote the ﬁxed population value as β 0. represent a standardized measure of strength of effect lying between 0 and 1. but the concept turns out to be useful. to obtain R 2 as a consistent estimator of Ω 2.Explained Variation in Proportional Hazards Regression 2 399 E{Var(T|Z )} t tdF (t| z) dF (t | z)dG(z) (3) This elementary deﬁnition is helpful in highlighting two important and related points: First. . . and F n (t | z). G n (z). Ω 2 (β) will be invariant to location and scale changes in the covariate and. (3) write 2 tg(z| u)dF (u) E{Var(T|Z )} t g(z |u)dF(u) The above expression can be advantageous in certain estimation contexts. for example. and F n (t |z). the empirical distribution functions F n (t). . and this dependence quantiﬁed by some parameter β. Whereas the actual value of β itself will depend on the scaling of the covariate.β) or simply Ω 2 (β) when it is clear which way the regression is being done. In the main we are interested in regression models. in some sense.

The reason is outlined in the following section.β) such invariance fails. maintaining an analogous interpretation. if we wish to accommodate time-dependent covariates. whereas for deﬁnition Ω 2 (T| Z.β) it can be seen that failure rank invariance is respected.β 0)} Var{E(T|Z )} Var(T ) β 0 )} (6) One of the variables. however. Schoenfeld Residuals and Explained Variation in Proportional Hazards Models Inference in the proportional hazards model remains invariant following monotonic increasing transformations on the time scale. It may seem the more natural to work with Ω 2 (T|Z ). and intermediate values should be interpretable.β E{Var(T |Z. C.β 0)} E{Var(T |Z.β) can be readily generalized. It could be argued that such a property ought be maintained for an appropriate ˆ Ω 2 (β) measure and its sample-based estimate R 2 (β). β 0)} E{Var(T )} Var(T ) (5) leading to an expression for Ω 2 (β 0) in which the role of β is readily understood from Ω 2 (T|Z. expressed via a model including an unknown baseline hazard function.β) is no longer even well deﬁned. absence of effect should translate as 0%.400 O’Quigley and Xu E{Var(T| Z. β) where . the actual values of the failure times themselves having no impact on parameter estimates and their associated variance. These considerations lead to an Ω 2 of the form Ω 2 (Z| T. This is the case with the proportional hazards model where Z represents the covariate and T the elapsed time.β). this may not be the way to proceed. perfect prediction of the survival ranks should translate as 100%. a fundamental feature of the Cox model. The measure introduced by O’Quigley and Flandre (1994) comes under this heading and corresponds to Ω 2 (Z| T. More importantly. whereas Ω 2 (T|Z. the deﬁnition Ω 2 (Z| T ) having some advantage in this context. then Ω 2 (Z | T. It can then be argued that a suitable measure of explained variation for the Cox model would relate to the predictability of the failure ranks rather than the actual times. Xu (1996) shows that a reduction in the conditional variance of Z given T translates as greater predictability of the failure rankings given Z. and we model the conditional distribution of T given Z. may have been assigned certain values by design. For deﬁnition Ω 2 (Z| T. Only the observed ranks of the failures matter. This is a fundamental property. most often Z.β 0 ) E{Var(T | Z.

β 0)} t. whereas ∑ n 1r 2 (0)/n can be viewed as the average discrepancy without i i a model. Model and Notation Let T 1 . ESTIMATION UNDER PROPORTIONAL HAZARDS A. In many practical cases.Explained Variation in Proportional Hazards Regression 401 Ω 2 (Z|T. T n be the failure times and C 1 . 2. We can write the above as Ω 2 (β) 1 ∫ E β {[Z(t) ∫ E β {[Z(t) E β (Z(t)| t)] 2 | t}dF(t) E 0 (Z(t)| t)] 2 | t}dF(t) (8) where E β denotes expectation assuming the model to be true at the value β and E 0 is the expected value under the null model. . As it turns out (O’Quigley and Flandre 1994). C i ) and δ i I(T i C i ). thereby providing a particularly simple expression in terms of the Schoenfeld residuals. the quantity ∑ n 1r 2 (β)/n can be viewed as the i i average discrepancy between the observed covariate and its expected value under the model.β 0)} E{Var(Z(t) |T E{Var(Z(t) |T t. and it can be shown (Xu 1996) that ˆ R 2 (β) 1 ˆ ∑ δi 1 r 2 (β)w i i ∑ δi 1 r 2 (0)w i i (9) ˆ where r i (⋅) are the Schoenfeld (1982) residuals. the usual Schoenfeld residuals play a key role here. We therefore suppress the notation Z |T in the deﬁnition of Ω 2. where I(⋅) is the indicator function. although the dependence on β may be indicated. . . then our problem is solved. Deﬁne the ‘‘at risk’’ . is consistent 2 for Ω (β). We return to this below but note that if we can consistently estimate all the quantities in Eq. one minus the usual R 2 is the ratio of the average of the squared residuals and the average squared deviations about the overall mean. in which the squared residuals of linear regression are replaced by the squared Schoenfeld residuals and where the square deviations about the overall mean are replaced by the square deviations of Z about the overall mean values of Z sequentially conditional on the ˆ risk sets. (8). For each i we observe X i min(T i . In the absence of censoring. . These residuals are typically an ingredient of any standard analysis. evaluated at β and 0. n. .β)} (7) When talking about proportional hazards regression. In this expression the weights w i are the decrements in the marginal Kaplan-Meier estimate. . this form is assumed unless indicated otherwise. II. T 2 . The estimate R 2 here is of the same form. For ordinary linear regression.β) E{Var(Z(t)| T t. Censoring is dealt with by correctly weighting the squared residuals. C n be the censoring times for the individuals i 1. C 2. . . ignoring the w i by equating them all to one may have little impact. . . . .

Z i (t). useful when β may not be constant with time. . we assume the covariate Z to be one dimensional. Let (t) be a step function of t with discontinuities at the points X i . .t)}n 1 by i n ε β (Z|t) 1 n Z (t)π (β. . for ﬁxed t. i 1. Under the model. Z i (t). . or explanatory variables. can be written λ i (t) λ 0 (t) exp{βZ i (t)} (10) where λ 0 (t) is a ﬁxed unknown ‘‘baseline’’ hazard function and β is a relative risk parameter to be estimated. 2. All the results given here hold for an independent censorship model. Also. . π i (β. . . Usually we are interested in the of the distribution function by F situation where each subject has related covariates. n). Mostly. . T i C i } and N(t) ∑ n N i (t). . We also use the counting process notation: let N i (t) t. Basis for Inference π i (β. Z i (i 1. at which the function takes the value Z i (X i ). . in which case it is assumed to be a predictable stochastic process and we will use the notation Z(t). is given by the solution to (Xu and O’Quigley 1998) U 2 (β) ∫ W(t){ (t) ε β (Z| t)}dN(t) 0 (14) . . etc. Z in general could be time dependent.402 O’Quigley and Xu indicator Y i (t) I(X i t).t) First some basic deﬁnitions.t) (12) K(t)Y (t)Z (t) exp{βZ (t)} 1 Statistical inference on β is usually carried out by maximizing Cox’s (1975) partial likelihood which is equivalent to obtaining the value of β satisfying U 1 (β) ∫ { (t) ε β (Z| t)}dN(t) 0 (13) An alternative to the partial likelihood estimator. Let K(t)Y i (t) exp{βZ i (t)} (11) where K 1 (t) ∑ n 1 Y (t) exp{βZ (t)}. The Cox (1972) proportional hazards model assumes that the hazard function λ i (t) (i 1.t) is exactly the conditional probability that at time t it is precisely individual i who is selected to fail. . for ease of exposition. n. The left continuous version of the I {T i 1 ˆ Kaplan-Meier estimate of survival is denoted S (t) and the Kaplan-Meier estimate ˆ ˆ (t) 1 S (t). deﬁne the expectation of Z(t) under the distribution {π i (β. given all the individuals at risk and given that one failure occurs. B. n) for individuals with different covariates.

This is also of theoretical interest since under departures from proportional hazards. Xu and O’Quigley 1998). Estimating Our basic task is accomplished in this section via a main theorem and a series of corollaries. an estimate based on U 2 (β) has a solid interpretation as average effect. . k i i 1 (16) 1. For practical calculation note that W(X i ) F (X i ) F (X i ) w i at each observed failure time X i . . β) converges in probability to E β (Z(t)| t)]2 |t}dF(t) (18) ∫ E β {[Z(t) . Let ∞ (β. the conditional distribution function of Z(t) given T t is consistently estimated by ˆ F t (z |t) ˆ P (Z(t) z |T t) { :Z (t) z} ˆ π {β. Our purpose here is consistent estimation of i 2 Ω and not robust estimation of β.t} i K(t) Y i (t)Z k (t) exp{βZ i (t)}. Theorem 1 Under model (10). an independent censoring mechanism.t} Corollary 1 Deﬁning n ε β (Z |t) k i 1 n Z k (t)π i {β.b) 0 W(t)ε β {[Z(t) ε b (Z(t)| t)]2 |t} dN(t) (17) then Corollary 2 (β.Explained Variation in Proportional Hazards Regression 403 ˆ where W(t) S (t){∑ n 1 Y i (t)} 1. t). but the two estimates of Ω 2 of the following section are related in a way not dissimilar to the relationship between the above ˆ ˆ two estimators. the jump of the Kaplan-Meier curve. and where ˆ β is any consistent estimate of β. 2. This can be anticipated from the deﬁnition of W(X i ) whereby we can write U 2 (β) ∫ { (t) 2 ˆ ε β (Z|t)}dF (t) 0 (15) C. The proofs are not given here. i. whereas the estimate based on U 1 (β) cannot be interpreted in the presence of censoring (Xu 1996.e. under the model.. ˆ then ε β (Z k |t) provide consistent estimates of E β (Z k (t)|T In addition we have the following two results. They can be found in Xu (1996) where proofs of the statements of the following section can also be found. .

Although R 2 and Ω 2 ε ε ε are nonnegative.0) (20) ˆ Then R 2 (β) converges in probability to Ω 2 (β) in Eq. (9). our main purpose for studying it has ε been to develop certain statistical properties and for providing a simple way to construct conﬁdence intervals for the population quantity Ω 2 (β). we ˆ ˆ can show that R 2 (β) and R 2 (β) are asymptotically normal. Finally. Our experience is that ˆ ˆ R 2 (β) will only be slightly negative in ﬁnite samples if β is very close to zero. in a leastsquares sense. In contrast. we can show that under the ε ˆ ˆ R 2 (β) | converges to zero in probability. provides a poorer ﬁt than the null model. and R 2 (β) 1. as illustrated in the example. | R 2 (β) ε arise. Viewed as a function of β. ε . ε Theorem 3 Let (b) for b 0. ε We have R 2 (0) 0. This would nonetheless be unusual corresponding to the case in which the best-ﬁtting model. we cannot guarantee the same for R 2.β) (β. (8). Indeed. R 2 (0) 0. R 2 will be very close in value to R 2. Alˆ though R 2 (β) is of interest in its own right. R 2 (β) → 1 as | β | → ∞ and. as a function of ε 2 β. The coefﬁcients R 2 and R 2 and the population counterpart Ω 2 have a number of useful properties.0) converges in probability to E 0 (Z(t)| t)] 2 | t}dF(t) (19) ∫ E β {[Z(t) Theorem 2 Deﬁne R 2 (β) ε 1 (β. 2 2 Both R and R ε are invariant under linear transformations of Z and monotonically increasing transformations of T. R 2 (β) 1. β be deﬁned by n ∞ (b) i 1 0 W(t){Z i (t) ε b (Z| t)}2 dN i (t) (21) then ˆ R 2 (β) 1 ˆ (β) (0) (22) is a consistent estimate of Ω 2 (β 0) in Eq. Our experience has been that when the proportional hazards model correctly generates the data. ˆ Notice that the above deﬁned R 2 (β) is the same as Eq. When discrepancies model. (8). this would seem to be indicative of a failure in model assumptions. R ε (β) increases monotonically with |β|. R 2 (β) reaches its ˆ maximum close to β.404 O’Quigley and Xu Corollary 3 (β. This last property (as well as all the stated properties of R 2 ) also applies to Ω 2 (β) and enables us to construct conﬁε dence intervals for Ω 2 using that of β.

e. we can take ∑ n 1 δ i W(X i )r 2 (β) to be a residual sum of squares i i analogous to those from linear regression. . So R 2 is asymptotically equivalent to the ratio of the regression sum of squares to the total sum of squares. So deﬁning n SS tot i 1 n δ i W(X i )r 2 (0) i ˆ δ i W(X i )r 2 (β) i i 1 n ˆ δ i W(X i ){ε β (Z|X i ) i 1 SS res SS reg ε 0 (Z|X i )} 2 we obtain an asymptotic decomposition of the total sum of squares into the residual sum of squares and the regression sum of squares. SUMS OF SQUARES INTERPRETATION We have the following sums of squares decomposition for R 2 (β): ε ε β {[Z ε 0 (Z| X i )] 2 | X i } ε β {[Z εβ |Z |X i )] 2 |X i } {ε β (Z| X i ) ε 0 (Z| X i )} 2 (23) on the basis of which we can rearrange Eq.. Then n δ i W(X i ){Z i (X i ) i 1 n ε 0 (Z|X i )} 2 n δ i W(X i ){Z i (X i ) i 1 n ε β (Z|X i )} 2 i 1 δ i W(X i ){εβ (Z|Xi ) ε β (Z|X i )} ε 0 (Z| X i )} 2 2 i 1 δ i W(X i ){ε β (Z|X i ) ε 0 (Z|X i )}{Z i (X i ) Now the last term in the above is a weighted score that according to Proposition ˆ 1 of Xu (1996) is asymptotically zero with β β. (20) so that R 2 (β) ε ∑n i ∑n i 1 1 δ i W(X i ){ε β (Z|X i ) δ i W(X i )ε β {[Z ε 0 (Z|X i )} 2 ε 0 (Z|X i )] 2 |X i } (24) Furthermore. i.Explained Variation in Proportional Hazards Regression 405 III. SS tot SS res SS reg (25) holds asymptotically. whereas ∑ n 1 δ i W(X i )r 2 (0) corresponds i i to the total sum of squares.

1994) introduced the concept of individual survival curves for each subject. where β is a vector of the same dimension as Z and a′b denotes the usual inner product of a with b. His measure depends heavily on censoring. Korn and Simon (1990) suggested a class of potential functionals of interest. assuming we deem this a requirement. although a very simple approximation was suggested and appeared to work well. The earliest suggestions date back to Harrell (1986). MULTIVARIATE EXTENSION Most often we are interested in explanatory variables Z of dimension greater than 1. They have some advantage in generality in being applicable to a much wider class of models than the proportional hazards one. The multiple coefﬁcient is then ˆ R 2 (β) 1 ˆ ˆ ∑ δi 1 [β ′r i (β)] 2 w i ˆ ∑ δ [β ′r i (0)] 2 w i i 1 (26) V. we may not have such a result. For the multivariate normal model F(t | Z z) and F(t | η β′z) are the same. as Ω 2 (T| η. OTHER SUGGESTED MEASURES There have been other suggestions for suitable measures of explained variation under the proportional hazards model. Their measures are not invariant to time transformation nor could they accommodate time-dependent covariates. with the .β). Common classes of regression models consider the impact of a linear combination η β ′Z on T. Schemper (1990. it makes sense to consider the multiple coefﬁcient of explained variation. The Kent and O’Quigley measure was not able to accommodate time-dependent covariates. The exact linear combination we would use for η is of course unknown. The principle difﬁculty in Kent and O’Quigley’s measure was its complexity of calculation. Such a quantity would not be invariant to monotonic transformations upon T and. and evaluated the explained variation via an appropriate distance measuring the ratio of average dispersions with the model to those without a model. and so it is only necessary to consider η and not the actual values of z themselves. but in as much as we consider the effect as essentially being summarized via β′z. such as the conditional median. known as the coefﬁcient of determination in linear regression. and in ˆ practice we replace the vector β by the vector β. For other models. Kent and O’Quigley (1988) developed a measure based on the Kullback-Leibler information gain. Everything now follows through exactly as for the univariate case in which we work with the ﬁtted Schoenfeld residuals and the null residuals. and this could be interpreted as the proportion of randomness explained in the observed survival times by the covariates. then we should consider Ω 2 (η| T.406 O’Quigley and Xu IV.β). and this effectively rules it out for practical use.

Interpretation is difﬁcult. but it is not our intention here to give a complete review of them. VI.. The 95% conﬁdence interval for Ω (β) obtained using the and R ε (β) monotonicity of Ω 2 (β).631) using R 2 and (0. An alternative to the information gain measure of Kent and O’Quigley (1986). For practical use in calculating conﬁdence interval of Ω 2 (β). Our 1000 bootstrap using Efron’s bias-corrected ε accelerated bootstrap (BCa) method gives conﬁdence interval (0.628). and further work is needed for this to be demonstrated and under what conditions. leads to a coefﬁcient with good properties (Xu and O’Quigley 1999). similar in spirit but leaning on theorem 1 and the conditional distribution of Z given T rather than the other way around.111. We see that these have very good agreement ε with the one obtained through monotonicity.386 2 ˆ 2 0. i. The explained variation proposals of Schemper .106. For the same data the measure proposed by Kent and O’Quigley (1986) resulted in the value 0. and R 2 (β) 0. Other measures have also been proposed in the literature (see e.37. is (0. plugging the two end points of the interval for β into R 2 (⋅). 0. Our estimate of the regression coefﬁcient is β 1.53.614) using R 2. it seems quite likely that the approach may be consistent. ˆ The above R 2 (β) can be compared with some of the suggestions of the previous section. we recommend the ‘‘plug-in’’ method. The unavailable empirical estimators are replaced by estimators deriving from an iterative algorithm. and assuming the data do not strongly contradict the proportional hazards assumption. the population model not being referred to in the work of Schemper and Kaider (1997). As with the Harrell measure the Schemper measures depend on censoring. which records the remission times of 42 patients with acute leukemia treated by 6mercaptopurine (6-MP) or placebo. even when the censoring mechanism is independent of the failure mechanism. Intuitively it appears that such an approach would come under the heading of providing estimators for the relevant population quantities of Section I.g. we anticipate the two coefﬁcients to be close to one another.Explained Variation in Proportional Hazards Regression 407 model and without the model.371. However. There are in fact theoretical reasons for this agreement. This is a promising idea that requires further study to anticipate the statistical properties of the approach. Schemper and Kaider (11) suggested multiple imputation as a way to deal with censoring. 0. which is the most computationally efﬁcient. Currently this appears somewhat ad hoc. In practical examples this coefﬁcient based on information gain appears to give close agreement with the R 2 measure discussed here. 0. It was used in the original Cox (1972) paper and has been studied by many other authors under the proportional hazards ˆ ˆ model.e.103. ILLUSTRATION We illustrate the basic ideas on the well-known Freireich (1963) data. Schemper and Stare 1996).

Finally. Schemper M. O’Quigley J. Biometrika 1975. 77:216–218. Kaider A. 62:269–276. Biometrika 1990. 75:525–534. 81:631.32 drops to 0. 91:2310–2314. Biometrika 1988. again there appears to be no good grounds for investigating a potential association between the measures. based on quadratic loss. the measure of Schemper and Kaider (1997) is calculated to be 0. unlike the measure suggested here and the partial likelihood estimator itself. Measures of explained variation for survival data. The measure of Korn and Simon (1990). Correction: the explained variation in proportional hazards regression. For these data the value 0. Stare J. whereas rank ordering is the only assumption we need for the measure presented in this Chapter. Regression models and life tables (with discussion). This is because their measure.29. 23:467– 476. 9: 487–503. JR Stat Soc B 1972. Stat Med 1990.19 and Schemper’s later correction (1994) resulted in V 2 0. Cary. Although there is some comfort to be gained by a value not dissimilar from that obtained above. Harrell FE. Flandre P. 34: 187–220. 21:699–716. Explained variation in survival analysis. There is no obvious link between the Schemper measures and those presented here since the Schemper measures depend on an independent censoring mechanism. resulted in (his notation) V 1 0. Freireich EO. Cox DR. Korn EL.34. 15:1999– 2012. . Partial likelihood. Predictive capability of proportional hazards regression. 1986. Schemper M.40. NC: SAS Institute Inc. Schemper M. Version 5. Blood 1963. SAS Supplement Library User’s Guide. Kent JT. Comput Stat Data Anal 1997.32. gave the value 0.29 if the failure times are replaced by the square roots of the times. Stat Med 1996. The Korn and Simon measure is most useful when the time variable provides more information than just an ordering. The effect of 6-mercaptopmine on the duration of steroid induced remission in acute leukemia. The explained variation in proportional hazards regression. does not remain invariant to monotone increasing transformation of time. Measure of dependence for censored survival data.20 and V 2 0.. based on empirical survival functions per subject. Biometrika 1994. Simon R. Proc Natl Acad Sci USA 1994.408 O’Quigley and Xu (1990). Schemper M. O’Quigley J. and the measure of Xu and O’Quigley (1999) turns out to be 0. REFERENCES Cox DR. A new approach to estimate correlation coefﬁcients in the presence of censoring and proportional hazards. The PHGLM Procedure.

1996. ASA 1998 Proceedings of the Biometrics Section. O’Quigley J. Estimating average log relative risk under nonproportional hazards. Nonparam Stat 1999. Xu R. O’Quigley J. San Diego. 69:239–241. . Ph. Xu R.Explained Variation in Proportional Hazards Regression 409 Schoenfeld DA. Inference for the Proportional Hazards Model. Biometrika 1982. A R 2 type measure of dependence for proportional hazards models.D thesis of University of California. Xu R. 216–221. Partial residuals for the proportional hazards regression model. 12:83–107.

.

When comparing the effects of two or more treatments on patient outcome. it cannot guarantee that treatment groups are perfectly balanced with regard to all variables that may be related to outcome.. 200 patient) randomized trials. as when evaluating data 411 . a fundamental scientiﬁc problem is that apparent treatment differences may result not from the inherent superiority of one particular treatment over another but rather from differences between the patients in the treatment groups. Estey University of Texas M.D. Although randomization is an essential tool in comparative treatment evaluation. extent of disease.20 Graphical Methods for Evaluating Covariate Effects in the Cox Model Peter F. Thall and Elihu H. patient characteristics often have profound effects on prognosis. INTRODUCTION In medicine.g. or the presence of a particular cytogenetic or molecular abnormality typically have substantial effects on his or her survival. Texas I. In the more common and problematic setting where treatment comparisons are based on data from separate trials. This is especially true in small (e. This observation has led to use of the randomized clinical trial to ensure that groups given different treatments are on average similar with regard to characteristics that may be related to response (‘‘covariates’’). in oncology a patient’s age. For example. Houston. Anderson Cancer Center.

since the current literature is quite extensive. Unfortunately. Accounting for individual patient characteristics when evaluating treatment effects entails some form of statistical regression analysis. then the statistical estimate of the covariate’s effect under the ﬁtted model may greatly misrepresent its actual effect.6 of Fleming and Harrington. many published results in the medical literature are based on ﬁtted models for which no goodness-of-ﬁt analysis has been performed. Formal descriptions of the methods are given by Therneau et al. treatments of this problem are given by Li and Begg (2) and Stangl (3). (4). Cain and Lange (12).412 Thall and Estey from two or more single-arm phase II trials of different treatments. Our goal is to discuss and illustrate by example some useful graphical displays and statistical tests in terms that can be understood by physicians or other nonmathematical readers. we do not address this issue here. Grambsch and Therneau (5). Some earlier references are Crowley and Hu (8). Therefore. The graphical methods illustrate qualitative relationships between covariates and outcome that are not otherwise apparent. to obtain an improved model ﬁt. and they also lead to use of the extended Cox model. Because these methods provide more accurate and reliable evaluation of covariate and treatment effects on patient outcome. Crowley and Storer (9). Grambsch (6). Although unobserved ‘‘latent’’ effects are also an important consideration when combining data from multiple clinical centers or separate trials. their application often leads in turn to profound changes in the substantive inferences formed from a particular data set. We do not attempt to discuss all existing methods for assessing goodness-of-ﬁt of the Cox model. explanation of the type of methods discussed here is given in Chapter 4. and in Chapter 4 of the important book by Fleming and Harrington (7). For the interested reader. and Harrell (13). apparent covariate effects and treatment effects obtained from a ﬁtted Cox model may be substantively misleading. Kay (10). the assumptions underlying the Cox model are often violated in practice. which allows the possibility of covariate effects that vary with time. In particular. If such model criticism is not done and if the qualitative relationship between a covariate and patient outcome is different from that assumed by a particular model. The Cox regression model (1) is the most widely used tool for evaluating the relationship between covariates and time-to-event treatment outcomes. The purpose of this chapter is to illustrate some statistical methods for assessing goodness-of-ﬁt under the Cox model and also for correcting poor model ﬁt. An excellent. Schoenfeld (11). albeit somewhat more mathematical. These techniques are straightfor- . the use of statistical methods to account for variables that may inﬂuence patient outcome is critically important in evaluating both randomized and nonrandomized clinical trials. such as survival time or disease-free survival (DFS) time. the potential for the effects of unbalanced covariates to confound actual treatment differences is much greater. When this is the case.

Each covariate may or may not be of value in predicting T. as is the case with patients who have not suffered the event in question when the trial is analyzed. especially in the analysis of medical data.’’ and ‘‘ATRA’’ data sets. for some patients the value of T may be right censored in that T is not observed but rather is known only to be no smaller than a censoring time. with a 26-week median DFS time. Most of our examples deal with the relationship between a single covariate and survival or DFS time. where 139 of 215 patients died with a median survival time of 28 weeks. Z ) . caspase 2 (C 2 ) and caspase 3 (C 3 ). Z ) of the event occurring at time t from baseline for a patient with covariates Z takes the form λ(t. Our goal is to bring these methods into more widespread use. . on survival were evaluated (17). where 415 of 530 patients had events (died or relapsed). In addition. COX REGRESSION MODEL Consider the common problem of assessing the relationships between each of a collection of covariates Z (Z 1 . We illustrate the methods using three data sets arising from clinical trials in acute myelogenous leukemia (AML) and myelodysplastic syndromes (MDS) conducted at M. where 116 of the 185 good-prognosis patients had events with a median DFS time of 82 weeks.’’ ‘‘caspase. we also discuss how properly modeling covariate effects may affect treatment effect estimates in a multivariate model (Sect. The covariates typically include one or more indicator variables denoting treatments given to patients. By far the most commonly used methods are based on the Cox regression model (1). Z k ) and the elapsed time T from a ‘‘baseline’’ usually deﬁned as the time of diagnosis or initiation of treatment to a particular event such as relapse or death.19). II. XII). . We refer to these as the ‘‘ﬂudarabine. VII) and also the use of conditional survival plots to assess interactions between two covariates (Sect. Anderson Cancer Center: a data set arising from several phase II trials of combination chemotherapies each involving ﬂudarabine (16). Many models and methods deal with this type of data (7. The Cox model assumes that the instantaneous hazard λ(t.D. The methods apply quite generally to any time-to-event outcome subject to right censoring. . and a data set arising from a four-arm randomized trial designed to evaluate the effects of all-trans retinoic acid (ATRA) and the growth factor granulocyte colony-stimulating factor on survival (18). a data set for which the effects of two proteins. and those covariates that are predictive typically differ substantially in their qualitative relationships and strength of association with T. However. For one example we also use simulated data having speciﬁc properties.14).Graphical Methods for the Cox Model 413 ward to implement using freely available computer programs in either Splus or SAS (13. usually due to the fact that the study ended without the patient experiencing the event.

. Since no model can be perfectly correct. that has motivated this chapter.414 Thall and Estey λ 0 (t) exp (β 1 Z1 . Due to the fact that β1Z1 . known as the linear component of the model. When the underlying model assumptions are not met in that the model does not ﬁt the data well. GOODNESS-OF-FIT The Cox model has proved to be an extremely useful statistical tool for evaluating covariate effects on events that occur over time. β k Z k . The expression β 1 Z 1 . Thus. however. Ideally. where λ 0 (t) is an underlying baseline hazard function not depending on the covariates and β1. then exp(βA ) is the relative risk or hazard ratio of the event for a patient given treatment A compared with B. and meta-analysis that go far beyond the present discussion. III.. however. . and above 1.. since it involves notions of Bayesian inference. the sort of analysis described above may be invalid. For this reason. Numerical values β A 0. A relative risk of 1 corresponds to the case where the risk of the event is the same with the two treatments. regardless of the patient’s other covariates. β k Zk log e{λ (t. cross-validation. β A 0. Consequently. . β k Z k ). equal to 1. commonly known as goodness-of-ﬁt analysis. If a particular β j 0. Thus. .. a value of β A signiﬁcantly less (greater) than 0 is the basis for inferring that A is superior (inferior) to B. . and 0 correspond to relative risks below 1. then the covariate Z j has no effect on the hazard. If one deﬁnes the binary indicator variable Z A 1 for treatment group A and Z A 0 for treatment B and includes β A ZA in the linear component. the relative risk exp(βA ) associated with treatment A vis a vis treatment B is the same at any time t. respecβA tively. . model criticism should also include consideration of ﬁts obtained with other similar data sets. for example. is typically the main focus of a Cox model analysis since the β j ’s quantify the covariate effects. . It is this danger. the Cox model is also called the proportional hazards model. β k are unknown parameters quantifying the covariate effects. the covariates are said to have a log linear effect on the hazard of the event. since it may easily produce ﬂawed or substantively misleading inferences. The point is that use of a statistical model without some goodness-of-ﬁt assessment is bad scientiﬁc practice. ... given the widespread use of the Cox model to analyze medical data. Two crucial assumptions underlying the Cox model are that the covariates have a log linear effect on the hazard of the event and that the value of each β j does not vary with time. the practical question is whether a given statistical model provides a reasonable ﬁt to the data at hand. practical application of any statistical model should include some form of data-driven model criticism. We do not pursue this issue here. . Z)/λ 0 (t)}.

one for each patient. If Z satisﬁes the proportional hazards assumption (i. . logistic. if it has a log linear effect on the hazard). the plot on Z of the residual r m based on this new model should show no pattern other than random noise. and repeat this process until one obtains a good ﬁt. as shown in Figure 1. say with 1000 patients or more. and in this case the shape of the smoothed line indicates the relationship between outcome and Z that should produce a good ﬁt.14..g. so that rm essentially adjusts the patients’ observed times for censoring. Applying a local regression smoother (14. on the horizontal axis. Subsequently.e.21). A wide variety of methods for residual analyses is discussed in the statistical literature.20) to create a line through the scattergram then allows one to examine visually the nature of the relationship between r m and the covariate Z. examine a residual plot for the new model. Nonlinear patterns correspond to violation of the proportional hazards assumption. These ‘‘observed minus predicted’’ values may be used to assess how well the regression model ﬁts the data. The martingale residuals for this plot are computed from a Cox model that includes only a baseline hazard function but no covariates. A large positive (negative) martingale residual r m for a patient corresponds to a ﬁtted model that underestimates (overestimates) the risk of the event for that patient. a simple. Smoothed scattergrams may be constructed easily using several widely available statistical software packages.. a plot of the smoothed line alone may be visually clearer since the scattergram of points tends to overwhelm the picture. The martingale residuals associated with a ﬁtted Cox model are the analog of ordinary residuals associated with a linear regression model. MARTINGALE RESIDUAL PLOTS In linear regression analysis the residuals are the differences. Thus. Cox). and each method applies to a particular type of regression model (e. martingale residuals are numerical values. one for each patient in the data set. This fact may be exploited to assess goodness-of-ﬁt in terms of a martingale residual plot. This produces a scattergram of points. between the observed outcome variable and the value of that variable predicted by the ﬁtted regression model. the pattern revealed by the smoother is impossible to determine by visual inspection of the scattergram alone. Speciﬁcally. if necessary ﬁt a new model incorporating this relationship. linear. that quantify the excess risk of the event not explained by the model. For larger data sets. then aside from random variation the smoothed line will be straight. routine way to ﬁt Cox models is to ﬁrst examine the martingale residual plot for each covariate Z. In most applications. obtained by plotting a point for each patient that corresponds to the patient’s value of r m on the vertical axis and the patient’s value for the particular covariate. including Splus (13.Graphical Methods for the Cox Model 415 IV. Z. after a model with an appropriately transformed version of Z has been ﬁt.

416 Thall and Estey Figure 1 Martingale residual scatterplot on white blood cell count. . Z ) λ 0 (t) exp [β 1 (t) Z 1 . The extended Cox model allowing one or more of the β j ’s to vary with time has hazard function of the form λ(t. the log-linear effect and corresponding risk associated with Zj at time t are given by β j (t) Z j and exp[β j (t) Z j ]. . . A related extension is that which allows covariates to be evaluated repeatedly over time rather than being recorded only at baseline (t 0). . respectively. These two extensions are computationally very similar. including treatment. although we do not explore this point here. . . β k (t) Z k ]. Z ) λ 0 (t) exp [β 1 Z 1 (t) . . Under this extended model. from the caspase data. V. An application where this extension often is appropriate is in evaluating the effect of baseline performance status (PS) on survival in acute leukemia. β k Z k (t)]. since the risk of regimen-related death for patients with poor PS decreases once they survive chemotherapy and hence βPS(t) may become closer to 0 as t increases beyond the early period. TIME-VARYING COVARIATE EFFECTS Time-to-event data often deviate from the usual Cox model in that the effect of a given covariate may vary over time. so that Z j (t) denotes the value of the jth covariate at time t and the hazard function takes the extended form λ(t.

due to Grambsch and Therneau (5). for example. suggesting that the assumption of a log linear hazard is appropriate. A very important point in interpreting these graphs is that one may see patterns in any plot if one stares at it long enough. consequently. is the GrambschTherneau-Schoenfeld (GTS) residual plot (5. most of the points in the scattergram are forced into a small area in the left portion of the ﬁgure. A smoothed GTS plot provides a picture of β(t) as a function of t. and ran a local weighted regression (‘‘lowess’’) smoother (18) through the points to obtain the solid line. each application described here would be followed by ﬁtting and evaluating multivariate models incorporating whatever forms are determined in the univariate ﬁts. Here. This plot has an accompanying statistical test. We plotted r m on white blood count (WBC) for each patient. it is essential to assess the form of β(t) to determine how Z actually affects patient survival. we similarly truncate the domain of the covariate as appropriate. A COVARIATE WITH NO PROGNOSTIC VALUE We begin with an illustration of what a martingale residual plot looks like for a covariate that is of no value for predicting outcome. it tests whether the ordinary Cox model is appropriate. We illustrate these methods for assessing goodness-of-ﬁt by example. which gives a clearer picture of the true relationship between WBC and DFS for most of the data points. it takes the form of a particular goodness-of-ﬁt test for the Cox model. hence. In each illustration given below. A graphical method for doing this. which produced the scattergram of points. Another method for dealing with a scattergram in which most of the points occupy a small portion of the plot is to transform the covariate. the Cox model assumptions are reasonable since the p value of a Grambsch-Therneau goodness-of-ﬁt test is p 0. similar to the martingale residual plot. There is no relationship . In practice. VI. by replacing Z with log(Z ). the smoothed line in Figure 2 is straight. This test is very general in that for each of several transformations of the time axis.11). Note that very few patients have very large WBC values.42. hence. with emphasis on the graphical displays described above. Figure 1 is a martingale residual plot from the caspase data.Graphical Methods for the Cox Model 417 When the effect of a covariate Z varies over time. a statistical test should accompany any graphical method to determine whether an apparent pattern is real. whereas the right portion of the smoothed line is very sensitive to the locations of a few points. also known as a scaled Schoenfeld residual plot. Aside from random ﬂuctuations. This produces Figure 2. Most of our examples simplify things by focusing on the effects of one covariate for the sake of illustration. of whether β varies with time versus the null hypothesis that it is constant over time. A simple way to deal with this common problem is to truncate the WBC domain by excluding a small number of patients with large WBC values.

A QUADRATIC EFFECT The following example illustrates both the importance of including relevant patient covariates when evaluating treatment effects and the importance of using goodness-of-ﬁt analyses to properly model covariate effects. between WBC and DFS since p 0. WBC is of no value for predicting DFS in this data set.418 Thall and Estey Figure 2 Martingale residual scatterplot as in Figure 1. it also seems reasonable to include platelets in the model. VII. yields a test of the hypothesis β ATRA 0 versus βATRA ≠ 0 having p value 0. Since baseline platelet count (platelets) often has a signiﬁcant effect on DFS in treatment of hematologic diseases.72 seems to imply that ATRA reduces the risk of death or relapse in this patient group and that this reduction is both statistically and medically signiﬁcant. The estimated relative risk exp( 0.329) 0. but with right truncation of the white blood count domain.83 for a test of β WBC 0 versus the alternative hypothesis β WBC ≠ 0 under the usual Cox model with βWBCWBC as its linear component. A ﬁt of the usual Cox model with linear component β ATRAATRA to the ATRA data. summarized as Model 1 in Table 1. Thus.055. .

045 0.172 0. the plot suggests that either a very low or very high platelet count is prognostically disadvantageous. for the model including the ATRA effect alone. as assumed when the standard Cox model is ﬁt.231 0. compared with only 3.022 6 4 419 Model LR (df) 3.085.204 0. Cytogenetics.69 on 1 df. A closer analysis leads to rather different conclusions.0 10 1.0014. and in general it is standard practice to include the lower .181 p 0.1 (2) 24.7 10 2.594 0.329 0.134 0.297 0.546 0. This is done by ﬁtting a Cox model with linear component β 1ATRA β 2platelets β 3platelets 2. the test of this hypothesis addresses the question of whether the bend in the line in Figure 3 is signiﬁcant or is merely an artifact of random variation in the data. This ﬁt indicates that a higher platelet count is a highly signiﬁcant predictor of better DFS and that after accounting for platelets the ATRA effect is still marginally signiﬁcant with p 0.18 9.136 0. Thus.055.055 0.170 0. In particular.69 (1) 13. the relationship between platelet count and DFS is not log linear.8 10 0. p 0.005 0.173 0.4 (3) 4 29. and the ATRA Effect Covariate ATRA ATRA Platelets ATRA Platelets Platelets 2 ATRA Platelets Platelets 2 m5m7 Estimated coefﬁcient 0. this model provides a better overall ﬁt since its likelihood ratio (LR) statistic is 13.4 (4) 5 4 The resulting ﬁtted model is summarized as Model 2 in Table 1. the risk of an event begins to increase. summarized as Model 3 in Table 1. The smoothed martingale residual plot given in Figure 3 indicates that the risk of relapse or death initially decreases as platelet count increases but that as platelet count rises above roughly 150 10 3.299 0. It is worth noting that the question of whether β 2 0 under this model is essentially irrelevant.044 0. but rather appears to be log parabolic.163 0. A Cox model that includes both platelets and platelets 2 as covariates would account for the parabolic shape suggested by Figure 3.174 0.105 0.085 0.Graphical Methods for the Cox Model Table 1 Model 1 2 3 Platelets. Moreover. Since the hypothesis β 3 0 reduces the log parabolic model to the simpler log linear model. p 0.1 10 4 corresponding to this test indicates that the log parabolic model is indeed appropriate.172 0.1 on 2 degrees of freedom (df).415 SE 0. The p value 1.24 5.1 10 0.

This example illustrates the common scientiﬁc phenomenon that an apparently signiﬁcant treatment effect may disappear entirely once patient prognostic covariates are properly accounted for. This is underscored by the LR test for the overall ﬁt of Model 3 (LR 24. We take this analysis one step further by adding to the linear component the indicator. It is also notable how easily the defect in the log linear model was revealed and corrected.0 10 5 ). given in Figure 4. now that platelet count has been modeled properly the ATRA effect is no longer even marginally signiﬁcant ( p 0. from the ATRA data.4 on 3 df.24). A ﬁnal point is that there are . order term Z whenever Z 2 has a signiﬁcant coefﬁcient. Perhaps most importantly. which shows quantitatively that Model 3 provides a substantially better ﬁt than Model 2. The martingale residual plot based on the ﬁtted Model 3. indicating that the parabolic model has adequately described the relationship between platelets and DFS. that the patient has the cytogenetic abnormality characterized by the loss of speciﬁc regions of the ﬁfth and seventh chromosomes.18). shows a pattern consistent with random noise. This is summarized as Model 4 in Table 1.420 Thall and Estey Figure 3 Martingale residual scatterplot on platelets. which shows that m5m7 is a signiﬁcant predictor of worse DFS and that the inclusion of this covariate further reduces the prognostic signiﬁcance of ATRA ( p 0. p 2. m5m7.

One consequence of this practice is that in evaluating the statistical analyses of two published studies where different ‘‘optimal’’ cutpoints were used to deﬁne I c and the studies concluded. by replacing it with a binary indicator variable I c 0 if Z c and I c 1 if Z c for some cutpoint c. CUTPOINTS A common practice in the medical literature is to dichotomize a numerical-valued variable Z. One common practice is to set c equal to the mean or median of Z.Graphical Methods for the Cox Model 421 Figure 4 Martingale residual scatterplot on platelets. Another is to use the ‘‘optimal’’ cutpoint which gives the smallest p value. a number of models other than a parabola that may describe a curved line. we chose a parabolic function because it is simple and achieves the goal of providing a reasonable ﬁt to the data. The cutpoint may be determined in various ways. respectively. that ‘‘Z had a signiﬁcant effect on survival’’ and ‘‘Z did not have a signiﬁ- . for the test of β c 0 under the model with linear term β cIc. among all possible cutpoints. VIII. based on ﬁtted model including a parabolic model for platelets. such as white count or platelet count.

where IHG 1 if hemoglobin 7 and 0 if hemoglobin is 7 yields a p value of 0. which accounts for fact that multiple preliminary tests were conducted to locate the optimal cutpoint (22).422 Thall and Estey cant effect on survival.’’ it is impossible to determine whether the conﬂicting conclusions were due to a phenomenological difference between the two studies or were simply artifacts of random variation in the data manifested in application of the optimal cutpoint method. This plot shows that for a cutpoint c located somewhere between 6 and 8. illustrated by the martingale residual plot given in Figure 5. That is. A search over values of c in this range yields the optimal cutpoint 7 and a ﬁt of the Cox model with linear term β HG I HG . The use of a model containing I c in place of Z is appropriate only when it describes the actual relationship between Z and the outcome. Figure 5 Martingale residual scatterplot on hemoglobin. Other methods for correcting p values to account for an optimal cutpoint search have been given by Altman et al. A more appropriate test. (23) and Faraggi and Simon (24). from the caspase data. there is a ‘‘threshold effect’’ of hemoglobin on DFS. An illustration of this is provided by the effect of hemoglobin on DFS in the caspase data set.004 for the test of β HG 0 under this model.032. patients having baseline hemoglobin above c have a higher risk of relapse or death. . yields the corrected p value 0.

as is apparent from Figure 6. . The properly adjusted p value that accounts for this is 0.623.370. that white noise produces an apparently signiﬁcant predictor based on the optimal cutpoint. chosen because it gave the lowest p value among all possible values for c.623 and I c 0 if Z 0. Figure 6 Martingale residual scatterplot on white noise. In such cases. the ﬁnal ‘‘signiﬁcant’’ test is merely an artifact of this multiple testing procedure. More fundamentally.018. This appears to show that the binary variable I c 1 if test of β c Z 0. is because many tests of hypotheses were conducted to determine the optimal cutpoint. The optimal cutpoint c 0.Graphical Methods for the Cox Model 423 Unfortunately. because Figure 5 does not exhibit the sharp vertical rise that indicates an actual threshold effect. the practice of replacing a continuous variable Z with a binary indicator I c is just plain wrong. Figure 6 is a martingale residual plot obtained by simulating an artiﬁcial covariate Z according to a standard normal distribution. This anomaly. which correctly reﬂects the facts that the cutpoint is a artiﬁcial and that Z is nothing more than noise. it is inappropriate to ﬁt a cutpoint model for Z to the data. produces a Cox model with linear component β cI c under which the uncorrected p value for the 0 is 0. hence. in most cases where a cutpoint model is used the relationship between Z and outcome simply is not of the form given by Figure 5. so that Z is ‘‘white noise’’ and.623 is a ‘‘signiﬁcant’’ predictor. has no relationship whatsoever to the actual patient outcome data.

A martingale residual plot on Z AHD for the ﬂudarabine data is given in Figure 7. While I AHD. We used the parametric function mini0. PLATEAU EFFECT It is well known that the presence of an antecedent hematologic disorder (AHD) is prognostically unfavorable in AML. and mum {β1 log(Z AHD this provides a reasonable ﬁt that agrees with the lowess smooth. This reduces the duration Z AHD of an AHD to the binary indicator variable I AHD 1 if Z AHD 1 month and I AHD 0 if Z AHD 1 month.442 for β 2 each have p values 10 8.204 for β 1 and 0. the lowess Figure 7 Martingale residual scatterplot on duration of antecedent hematological disorder. The estimates 0. with the lowess smooth given by the solid line and a parametric ﬁt. by the dashed line. There are many functions that describe this pattern. We have previously considered an AHD ‘‘present’’ if a documented abnormality in blood exceeded 1 month in duration before diagnosis of AML (25). β 2arctan(Z AHD )}. described below. illustrated by the dashed line. from the ﬂudarabine data. .5). Due to the plateau in the smoothed curve.424 Thall and Estey IX. the relationship between Z AHD and DFS is neither linear nor quadratic. Others have used a cutpoint of 3 months rather than 1 month. Figure 7 indicates that the risk of relapse or death increases sharply for values of Z AHD from 0 up to roughly 10 to 20 months and then stabilizes at a constant level for larger values.

under the Cox model the hazard associated with an age of 60 years at diagnosis is the same at either 1 month or 5 years after treatment. Figure 8 is the smoothed GTS plot on age. X. from the ﬂudarabine data. for example.0175] 2. at any time after start of treatment a 60-year-old patient has twice the risk of death or relapse compared with a 20-year-old patient.Graphical Methods for the Cox Model 425 function. since exp[(60 20) 0. This assumption frequently is not veriﬁed despite the fact that in some situations it might appear tenuous. This indicates that. PARABOLIC TIME-VARYING EFFECT As noted earlier. Figure 7 indicates that the cutpoint model. only approximates either the lowess or the parametric function. A ﬁt of the ordinary Cox model with linear term β AGE AGE to the ﬂudarabine data yields an estimate of β AGE equal to 0. the usual Cox model assumes that covariate effects are constant over time. For example. the clinician might suspect that older patients are at greater risk of death occurring during the ﬁrst few weeks of therapy rather than later on.0175 with p 0. and the parametric function each are highly signiﬁcant predictors of DFS with the associated p 10 7 in each case.001. including Figure 8 Grambsch-Therneau-Schoenfeld residual scatterplot on age. For example. IAHD. . in chemotherapy of hematologic malignancies.

shows that the effect of poor PS decreases during the ﬁrst 3 months and then reaches a small but nonzero plateau thereafter. The grouping of points into rows is characteristic of GTS plots for variables taking on a small number of values. the smoothed GTS plot indicates that β AGE (t) may be a parabolic function of t. whereas dots below the line indicate a deﬁcit of deaths. indicating that the proportional hazards assumption is untenable. 3. given in Figure 9. The respective estimates of β1. In particular.zph’’ computer subroutine of Therneau (14) using the ‘‘identity’’ (untransformed) timescale.001. β 2 . This is the case for pretreatment Zubrod PS in the ﬂudarabine data set.014. Thus. indicating that age really has a parabolic time-varying effect. which correspond respectively to the ﬁve groups of points in the plot.417 with p 0. although in practice we have found it most useful to apply the graphical and model-based methods together. 4. 2. PS take has possible values 0. A general . These conditional plots. XII. The ordinary Cox model ﬁt with linear term β PS Z PS gives an estimate for β PS of 0. and β 3 under this extended Cox model are 0. the effect of age on DFS cannot be quantiﬁed adequately by the single estimate 0. the horizontal line at β AGE 0 corresponds to age having no effect. In Figure 8. and 2.007. and 0. Dots above the horizontal line at β AGE 0 indicate an excess of deaths. The Grambsch-Therneau goodness-of-ﬁt test has p 10 7. with p 10 6.0175 under the ordinary Cox model.62 10 4. the GTS plot indicates that the proportional hazards assumption may not be appropriate. XI. This plot and the associated test were produced with the ‘‘cox.0175 noted above. A Grambsch-Therneau test (5) of the hypothesis that the data are compatible with the proportional hazards assumption has p 0. 1. conﬁrming the graphical results.45 10 6 respectively. Here.426 Thall and Estey a 95% conﬁdence band for the graphical estimate of β AGE (t). The GTS plot suggests that the effect of age on the risk of relapse or death may be described by the quadratic function (β 1 β 2t β 3t2 ) Z AGE . 1.0150.’’ may be used to make inferences without resorting to any parametric model ﬁt or conventional test of hypothesis. NONLINEAR TIME-VARYING EFFECT WITH PLATEAU The GTS plot is especially valuable when a time-dependent covariate effect is not described easily by a parametric function. or ‘‘coplots.2 10 4. 3. CONDITIONAL KAPLAN-MEIER PLOTS A useful method for assessing covariate effects on survival or DFS is to construct a set of conditional Kaplan-Meier (KM) survival plots. as previously. The GTS plot based on PS. Whereas the effect of age would be represented by a horizontal line at 0.

The purpose of this ﬁgure is to provide a visual representation of the joint effects of C 2 and C 3 on survival. middle one third. the three rows correspond to the lowest. from the ﬂudarabine data.69. Similarly. and upper thirds of the C3 sample values. discussion of conditional plots.07) and high C 3 values (1. for example. going from bottom to top. Thus. Moving from left to right. the three columns correspond to the lowest one third. is given by Cleveland (26). (17).07 C 2 for this data set. for example. in the context of analyzing trinary data. Each plot is thus the usual KM estimate of the survival probability curve but conditional on the patient having C 2 and C3 values in their speciﬁed ranges. 1. middle. constructed from a particular subset of the data.57. moving from left to right along the bottom row shows how survival changes with increasing C 2 given that . 0.69 and upper one third of the C 2 values.69 1. Each of the nine plots in Figure 10 is a usual KM plot. 0.09 C 3 1. the KM plot in the center of the top row is constructed C2 from the data of the 21 patients having intermediate C 2 values (0.09.Graphical Methods for the Cox Model 427 Figure 9 Grambsch-Therneau-Schoenfeld residual scatterplot on performance status.07. and 1. Thus. which are C 3 1.57 C 3 for these data. and 1.57 C 3). which happen to be C 2 C2 1. An application of coplots to the caspase data is given in Figure 10. which is reproduced from Estrov et al. This ﬁgure was constructed as follows.

based on three nonoverlapping intervals for each covariate. each of which contains half of the C 2 data. with the ﬁrst interval running . C3 1. would not be revealed by the conventional approach of ﬁtting a Cox model with linear component including the terms β 2C 2 β 3C 3 β 23C 2 C 3. A slightly different way to construct this type of plot is to allow the adjacent subintervals of each covariate to overlap. in order to provide a smoother visual transition. The coplots clearly show that this is not the case. Figure 11 is obtained by ﬁrst deﬁning subintervals in the domain of C 2. C 3 ) values.09. It is important to bear in mind that the particular numerical cutoffs of the subintervals here are speciﬁc to this data set. since this parametric model assumes that the multiplicative interaction term β 23C 2C 3 is in effect over the entire domain of both covariates. This sort of interactive effect. The most striking message conveyed by this matrix of KM coplots is that survival is very poor for patients having both high C 2 and high C 3 .428 Thall and Estey Figure 10 Conditional KM plots for varying C 2 and C 3 values. manifested on a particular subdomain of the two-dimensional set of (C 2 . so that any patterns revealed by the plots may hold generally for a similar data set but the particular numerical values very likely will differ.

The particular numerical values of the C 2 and C 3 interval end points are given along the bottom and left side of Figure 11.Graphical Methods for the Cox Model 429 Figure 11 Conditional KM plots for varying C 2 and C 3 values. the second from the 1/6th to 4/6th percentile. whereas the . the third from the 2/6th to the 5/6th percentile. and the fourth from the 3/6th percentile to the maximum. Scanning each of the lower two rows from left to right shows that for lower values of C 3. using overlapping intervals for each covariate. survival improves with increasing C 2. and markedly in the top row. where survival drops as both C 2 and C 3 become large. Another advantage of using intervals that overlap is that the sample size for each KM plot is larger than if the intervals are disjoint. where survival seems to level off. Four subintervals of C 3 are deﬁned similarly. adjacent intervals share 1/3rd of the C 2 data. comprising 27% of the sample values. This pattern changes slightly at the end of the third row. Thus. The relatively poor survival shown in the upper right corner KM plot is notable in that this plot is based on 59 patients. from the minimum to the 3/6th percentile.

which is the case if its effect on patient risk is not log linear as assumed under the usual Cox model. which may be the case with the typical approach of ﬁtting a Cox or logistic regression model without performing any goodness-of-ﬁt analyses. In fact. . Formally. the p value of any such ﬁnal test should be adjusted for the process that produced the model. statistical tests. dichotomization of a numerical variable by use of a cutpoint without ﬁrst determining the actual form of the covariate’s effect typically leads to loss of information and in many cases is completely wrong. DISCUSSION The importance of prognostic factor analyses in clinical research is widely acknowledged. evaluation of the covariate’s effect may be misleading. The use of these methods helps to avoid ﬂawed inferences from being drawn. it applies more broadly to the entire model-ﬁtting process. Use of the optimum cutpoint often leads to spurious inferences arising from nothing more than random variation in the data. When a covariate is modeled incorrectly. This is due to the fact that both the model-ﬁtting process and the ﬁnal tests are based on the same data set. An important caveat to keep in mind when interpreting the results of any regression analysis is that the model-ﬁtting process. a particular regression model ﬁt to a given data set is not likely to provide as good a ﬁt to another data set based on a similar experiment. For example. recognizes this problem. Our examples illustrate that these problems may be addressed easily and effectively by the combined use of martingale residual plots. XIII. the adjusted p value computation required to test properly for an optimal cutpoint. which we do not pursue here. The practical point is that due to random variation. In particular. This consideration leads to notions of cross-validation and bootstrapping. In this chapter. we illustrate some general problems with these analyses as they are often conducted. We also provide examples of covariates having effects that change over time. is not accounted for by p values obtained using conventional methods based only on the ﬁnal ﬁtted model.430 Thall and Estey considerably more striking drop in the upper right corner plot of Figure 10 is based on a more extreme subsample of 31 patients. and transformation of covariates as appropriate. along with methods for revealing and formally evaluating such time-varying effects. model-based regression analyses may be augmented or even avoided entirely by the use of conditional KM plots. noted in Section 8. A basic reference is Efron and Tibshirani (27). The aim of these methods is to determine the true relationship between one or more covariates and patient outcome in a particular data set. including graphical methods and tests based on intermediate models. Finally. and we describe some graphical methods and tests to address these problems.

95–112. 14:2173–2190. ed. 10 and 11) due to a real biological phenomenon. J Am Stat Assoc 89:1523–1527. in turn they provide more reliable inferences regarding covariate and treatment effects on patient outcome. they lead quite easily to corrected models. pp. statistical research is constantly evolving. Finally. Regression models and life tables (with discussion). this ﬁnding may be illusory. because the methods often provide a greatly improved ﬁt of the statistical model to the data. such as those described here.000 associated with inferior DFS in patients with AML or MDS (Fig. Li Z. 4. Random effects models for combining results from controlled and uncontrolled studies in meta-analysis. Boston: Kluwer. Recent Advances in Clinical Trial Design and Analysis. Stangl DK. These methods are of value to medical researchers for at least three reasons. Therneau TM. Why. Proportional hazards tests and diagnostics based on weighted residuals. they suggest new directions for medical research. have decreased dramatically due to the widespread availability of high-speed computing platforms and ﬂexible statistical software packages. REFERENCES 1. First. 81:515–526. J Royal Stat Soc B 1972. Stat Med 1995. . but the p value of 0. 2. Modelling and decision making using Bayesian hierarchical models. Begg CB. Therneau TM.0001 for the quadratic term in Model 3 of Table 1 suggests otherwise. Is the sharp drop in survival for high levels of both C 2 and C 3 (Fig. 34:187–220. In: Thall PF. Cox DR. Biometrika 1990. for example. graphical methods provide a powerful means to determine if and how the Cox model assumptions are violated. Grambsch PM. 5. Grambsch PM. Difﬁculties in implementing graphical methods. 77:147–160. Martingale-based residuals for survival models. ACKNOWLEDGMENT We are grateful to Terry Therneau for his thoughtful comments on an earlier draft of this manuscript. and they are perhaps the best method available for communicating the results of a regression analysis to nonstatistical colleagues. Biometrika 1994. 6. 3. Powerful new techniques for modeling and analyzing data are currently becoming available at an ever increasing rate. 3)? Certainly. Goodness-of-ﬁt diagnostics for proportional hazards regression models.Graphical Methods for the Cox Model 431 Like medical research. is a platelet count above 200. or is it merely an artifact of random variation? Second. 1995. Grambsch PM. Fleming TR.

15. Kantarjian H. Cleveland WS. 19. J Am Stat Assoc 1979. Schumacher M. Fleming TR. Crowley L. Harris D. 26. 93:2478–2484. 72:27–36. 1991. Summit. Tibshirani RJ. The New S Language. 20. Sauerbrei W. Blood 1997. Thall PF. Randomized phase II study of ﬂudarabine cytosine arabinoside idarubicin all trans retinoic acid granulocyte-colony stimulating factor in poor prognosis newly diagnosed acute myeloid leukemia and myelodysplastic syndrome. Harrell FE. 92:3090–3097. 12:671–678. Inc. Covariance analysis of heart transplant survival data. 8. Mayo Foundation. Caspase 2 and caspase 3 protein levels as predictors of survival in acute myelogenous leukemia. New York: Chapman and Hall. A Package for Survival in S. J Natl Cancer Inst 1994. Approximate case inﬂuence for the proportional hazards regression model with censored data. Schoenfeld D. Thall PF. Laird and B. 11. Applied Survival Analysis and Logistic Regression. 86:829– 835. 67:145–153. The dangers of using ‘‘optimal’’ cutpoints in evaluation of prognostic factors. Wilks ARA. 47:1283–1296. Biometrics 1984. 1986. Cleveland WS. Applied Statistics 1977. Francis. Hu M. Harrell FE. Blood 1998. Harrington DP. Kantarjian HM. A simulation study of cross-validation for selecting an optimal cutpoint in univariate survival analysis. 1997. Efron B. 27. J Clin Oncol 1994. 16. Biometrika 1980. Effect of diagnosis (RAEB. 40:493–499. Becker RA. Practical p-value adjustments for optimally selected cutpoints. 1993. Talpaz M. during and after ﬂudarabine araC induction therapy of newly diagnosed AML or MDS: comparison with ﬂudarabine ara-C without G-CSF. Chambers RM. N. An Introduction to the Bootstrap. Charlottesville: University of Virginia. Cain KC. Thall PF. Walterscheid M. 90:2969–2977. RAEB-t. Altman DG. 14. Stat Med 1996. Therneau TM. Pierce S. Thall PF. Kornblau S. 15:103–112. Predicting Outcomes. Chi-squared goodness-of-ﬁt tests for the proportional hazards regression model. 10. New York: Wiley. J Am Stat Assoc 1977. Kay R. 1995. 17. Stat Med 1996. Blood 1999. 22. 1993. Estey EH. Biometrics 1991. 15:2203–2214. Lausen B. 26:227–237. 1988. NC: SAS Institute. 24. Estey EH. Hilsenbeck SG. 25. 21. 74:829–836. 12. 9. Beran M. J Am Stat Assoc 1983. 13. Cary. Gentleman R. Simon R. or AML) on outcome of AML-type chemotherapy. 23. et al. Proportional hazards regression models and the analysis of censored survival data. Keating M.432 Thall and Estey 7. NJ: Hobart Press. Version 5. Estrov Z. SAS Supplememntal User’s Guide. Estey EH. 18. Counting Processes and Survival Analysis. Estey EH. Pierce S. Andreeff M. 878:277–281. . Paciﬁc Grove. Faraggi D. Crowley JJ. Local full likelihood estimation for the proportional hazards model. Robust locally-weighted regression and smoothing scatterplots. Lange NT. Crowley JJ. Storer BE: Comment on ‘‘A reanalysis of the Stanford heart transplant data’’ by M. CA: Wadsworth. Van Q. The PHGLM procedure. Visualizing Data. Use of G-CSF before. Aitkin.

England. Recent interest in molecular and genetic markers create additional problems for the data analyst. INTRODUCTION In this chapter we are interested in exploratory data analysis rather than precise inference from a randomized controlled clinical trial. They will not be discussed in this chapter. The methods described in this chapter are illustrated using data from the Medical Research Council’s fourth and ﬁfth Myelomatosis Trials (1). Although the quality of data collection and follow-up is important. We are concerned with methods appropriate for analyzing a small number of prognostic factors. London. there is no need to have a randomized trial to study prognostic factors.21 Graphical Approaches to Exploring the Effects of Prognostic Factors on Survival Peter D. London. large series of patients receiving standard therapy are important sources of information. A total of *Current afﬁliation: Imperial College School of Medicine. but these are mostly concerned with multiple testing and test reliability. 433 . England I. Sasieni and Angela Winnett* Imperial Cancer Research Fund.

7 6.9 3.5 5.1 6. 0 736 255 822 Age log2 (sβ2m) log2 (serum creatinine) Variable ABCM Cuzick index int.’’ 0 otherwise Min.9 3rd quartile 6.4 7.7 6.5 Freq. 8. 0 otherwise 1 for Cuzick index ‘‘intermediate’’ or ‘‘poor.3 2.’’ 0 otherwise 1 for Cuzick index ‘‘poor.3 11. or poor Cuzick index poor Continuous variables 5.434 Table 1 Variable Summary of Prognostic Variables Units 10 years log(mg/l) log(mM) Description Indicator variables 1 for trial 5 with ABCM.7 3 2. 1 277 758 191 Max.1 Sasieni and Winnett . 1st quartile Median 6.1 1.0 Freq.

No signiﬁcant differences were found between survival of patients with the different treatments in trial 4 (3) or between survival of patients in trial 4 and patients with M7 treatment in trial 5. and melphalan (ABCM). so they have their limitations and survival analysis models are required. In the absence of censoring. Age in years was divided by 10 so that the difference in the interquartile range of each continuous variable was close to 1 (between 1 and 1. 1. Although such models are attractive. An advantage of this approach is the ease with which the results can be presented and interpreted. ‘‘medium death’’ (1–5 years). one might consider a single model for the ordered multinomial end points: ‘‘early death’’ ( 1 year). The patients in the fourth trial received treatment of either intermittent courses of melphalan and prednisone (MP) or MP with vincristine given on the ﬁrst day of each course.3). serum β2 microglobulin (sβ2m). A disadvantage of ﬁtting three logistic models is that although any patient who survives 5 years must have survived 1 year. and serum creatinine.3-bis(2-chloroethyl)-1-nitrosourea (BCNU). or ‘‘long-term survivor’’ ( 10 years). Example None of the survival times in the myeloma data were censored at less than 2 years. the models are not linked and estimation of the conditional probability of survival to 5 years given that the patient is alive at 1 year is not straightforward. However. or adriamycin. whereas 454 of the deaths occurred in the ﬁrst 2 years and 199 in . which is based on blood urea concentration. Survival times range from 1 day to over 7 years. A number of prognostic variables were recorded—age. In the ﬁfth trial patients received either intermittent oral melphalan (M7). one could use three logistic models to examine the effect of various potentially prognostic factors on each end point. hemoglobin. it is rare that one has uncensored long-term follow-up. ‘‘late death’’ (5–10 years). Also prognostic groups were deﬁned according to the Cuzick index. and long-term (10 year) survival. and clinical performance status (2). of whom 821 had died by the time the data set was compiled and 192 were censored. A summary of the variables is given in Table 1. short-term survival is often uncensored and logistic regression is a useful and underused technique for exploring the role of prognostic factors in such situations. so these three groups were pooled together. Logarithms (to base 2) were used for sβ2m and serum creatinine since otherwise they are very skewed. II. Not only can the importance of individual factors be evaluated in a multivariate model. medium-term (5 year). LOGISTIC REGRESSION Clinically one may be interested in short-term (1 year). Instead. but a prognostic score can be developed to quantify the probability that an individual with a given proﬁle will survive to each of the three time points. cyclophosphamide.Prognostic Factors and Survival 435 1013 patients are included.

02–1.90 965 1012 95% CI (0.08 1.56 0.84) (1.41–2. 1.72–1.06 0.37 0.R.28–1.30–3.31–2.44–0.39) (1.29) (1.90–1.99) (0.74 2.11–1.62 1.19 1.46–0.74–1.45) Covariate Age (per 10 years) log2 (sβ2m) log 2 (serum creatinine) ABCM Cuzick index int.63 1.23) Sasieni and Winnett .00) (0.71) (1.28) (0.54) (0.00) (0.38 1.15–2. or poor (vs.00) (0.R.23) (0.94–1.85–1.436 Table 2 Multivariate Logistic Regression Using Three End Points Compared with Longer Survival Death within 0–6 months 0–2 years O. good) Cuzick index poor (vs.65) 6 months–2 years O.46 860 1004 95% CI (1.68 1. or good) Deviance Null deviance O.50–1.62 1.94 0.89) (0.R. 1.95–2. int.27 1.55–1.20 0.33) (0.62 1. 1.94–2.11 1255 1393 95% CI (1.09) (1.90) (0.

16) 886 Covariate log 2 (serum creatinine) log 2 (sβ 2m) Deviance O. From the logistic regression results it can be seen that sβ2m is a strongly prognostic factor for survival both to 6 months and to 2 years. whereas sβ2m and ABCM treatment are associated with differences in survival from 6 months to 2 years and survival up to 6 months. the effect is smaller in magnitude and not statistically signiﬁcant.R.42–2. It is seen that neither age nor the Cuzick index have statistically signiﬁcant association with survival up to 2 years conditional on survival up to 6 months. The treatment ABCM can be seen to improve survival to 2 years. logistic models were used to study the effects of the prognostic variables on the probability of death in the ﬁrst 6 months and on the probability of death in the ﬁrst 2 years. where the two variables have been Table 3 Logistic Regression Models for Serum Creatinine and sβ 2 M only Logistic regression for death within 6 mo with serum creatinine only Logistic regression for death within 6 mo with serum creatinine and sβ2m O.68) (1. 2.Prognostic Factors and Survival 437 the ﬁrst 6 months. but these models by themselves do not indicate whether there is any association with survival to 2 years within those patients who survived to 6 months. The results of the logistic regressions are in Table 2. This can also be seen from Table 4. including only those patients who were still alive after 6 months. This was investigated in a further logistic model for survival up to 2 years. so that the strong prognostic value of serum creatinine by itself can be largely accounted for by the confounding effect of sβ2m. The Cuzick prognostic index ‘‘good’’ does indeed indicate improved survival. There is strong correlation between serum creatinine and sβ2m.33 1. whereas the prognostic value of sβ2m is highly statistically signiﬁcant even when serum creatinine has been included in the model. The results are also in Table 2. whereas the difference between the groups ‘‘poor’’ and ‘‘intermediate’’ is not statistically signiﬁcant. although its effect on 6-month survival is still beneﬁcial. The odds ratios for age and the Cuzick prognostic index are closer to one for survival to 2 years than for survival to 6 months. Therefore. at least as far as 6 months.75 95% CI (1. Conﬁdence intervals were calculated based on 1.05–1.14 95% CI (1.R. this is in contrast to what is seen if only serum creatinine is included in the model (Table 3).82–2. 1.52) 916 .96 standard error of the coefﬁcients. The effect of serum creatinine is not statistically signiﬁcant.

75 6.46 6.67 7. the proportional hazards regression model introduced by Cox (4) has become ubiquitous in medical journals. In particular. The prognostic value of sβ2m after adjusting for serum creatinine can be seen by the increasing proportions dead in each row (i.73 16 25% 33 21% 42 26% 70 26% 42 29% 203 26% 3.438 Sasieni and Winnett Table 4 Number of Individuals in Categories Deﬁned by Serum Creatinine and sβ2 m with Percentage Dying in the First 6 Months in Each Category log 2 (sβ 2 m) log2(serum creatinine) 6. The prognostic value of each variable by itself can be seen by the proportions dead by 6 months in the row and column total cells.04 7..93 33 18% 41 12% 51 22% 56 14% 16 19% 197 17% 2.04–7.44–2.75–7. The (conditional) hazard (of death) at time t for an individual with covariates Z is deﬁned by λ(t |Z) lim h↓0 P(T ∈ [t. as sβ2m increases in each category of serum creatinine). III. The correlation between the two variables can be seen by the high frequencies around the diagonal of the table.e.73 7 29% 5 20% 17 18% 31 35% 142 46% 202 41% Total 212 10% 197 13% 200 15% 202 21% 202 40% 1013 20% divided into ﬁve categories each with roughly equal numbers of individuals.44 59 7% 62 11% 57 7% 25 4% 1 0% 204 8% 2. which increase as the value of each variable increases. This contrasts with the strong increasing trend seen in the column of marginal totals.67 Total 1.46–6. Z)/h [1] .89 97 5% 56 9% 33 3% 20 20% 1 0% 207 7% 1. PROPORTIONAL HAZARDS Hazard-based models are naturally adapted for use with (right) censored data and have therefore become the standard approach for survival analysis. The lack of association between serum creatinine and survival after adjusting for sβ2m can be seen from the proportions dead which do not increase steadily in each (internal) column of Table 4.93–3.89–2. t h] |T t.

good) Cuzick index poor (vs.12 1.89) (1. the results are in Table 5.093 log partial likelihood (null model): 10.95–1. higher values of age and sβ2m are associated with worse prognosis. Even without questioning the form of the model. rarely observed in clinical studies. or poor (vs.38 1. As in the logistic models.36) Age (per 10 years) log 2 (sβ2 m) log 2 (serum creatinine) ABCM Cuzick index int. or good) 2 2 log partial likelihood (ﬁtted model): 10.15–1.91–1. We assume here that the goal is not simply prediction (in which case one may prefer to use a ridge or shrinkage approach over covariate selection (5)) but that the chosen model should be biologically plausible.21) (1. it makes quite strong assumptions on the form of the effects and it is always important to check the appropriateness of the model. In many situations. several of the covariates will be correlated. Constant hazards are. and it is a good idea to include certain basic covariates (factors known to be of prognostic value from previous studies) in any model. This model forms a good starting point for analysis of prognostic factors for censored survival data. as in the logistic models. However.03–1. It is certainly useful to properly document the model selection procedures employed and where possible to validate the ﬁnal model on a separate data set. int.236 . Table 2. ABCM treatment and Cuzick prog- Table 5 Covariate Cox Proportional Hazards Model Hazard ratio 1.11 95% CI (1.65) (0.37) (0. however.76 1. Constant hazards correspond to exponential random variables and are conveniently described in terms of one death every so many person-years. Example Cox regression was used to estimate the effects of prognostic variables on survival in the myeloma study.26 1.65–0. Here.Prognostic Factors and Survival 439 It is the death rate at time t among those who are alive (and uncensored) just prior to time t.06 0. one has to apply a sensible model-building strategy for selecting important prognostic factors from a pool of potentially relevant covariates.19) (0. The usual form of the proportional hazards model is λ(t| Z) λ 0 (t)exp(β T Z) [2] in which λ 0 (t) is an unspeciﬁed baseline hazard function (that corresponds to individuals with Z 0) and β is a vector of parameters. After that one may wish to consider a step forward or a step backward procedure to select a model.16–1.

with 95% conﬁdence intervals. but the difference between Cuzick prognostic index intermediate and poor is not statistically signiﬁcant.5. Cuzick index ‘‘good.5.’’ . The ﬁrst part of the ﬁgure is based on a patient with values of the other covariates Figure 1 Survival functions estimated from the Cox model. Notice that the hazard ratios are generally closer to one than the corresponding odds ratios from the logistic models in Table 2. Cuzick index ‘‘poor’’.54 and 4. for log 2 (sβ2 m) equal to 1. but this is to be expected from the relationship between odds ratios and hazard ratios. Prognostic value of the prognostic index After ﬁtting the Cox model one has ˆ a function β T Z that deﬁnes the effect of covariates on the baseline hazard. (b) age 57. higher sβ 2m is strongly associated with an increased hazard. Example From Table 5. This will most conveniently be described in terms of the effect on the median (or some other quantile) survival or on survival to some ﬁxed time.52 (the 10% and 90% quantiles). An advantage of this latter approach is that the proportional hazards model is only used to divide the population into subgroups with different prognoses: The actual survival of each subgroup is then estimated nonparametrically. log 2 (serum creatinine) 7. not ABCM treatment. The disadvantages are the potential bias and loss of information associated with discretizing a continuous prognostic factor and the loss of power resulting from abandonment of the model. log 2 (serum creatinine) 6.440 Sasieni and Winnett nostic index good are associated with a reduced hazard. ABCM treatment.54 and 4. Figure 1 shows estimates from the ﬁtted Cox model of the survival function for patients with log 2 (sβ 2 m) equal to 1. This is not itself particularly useful clinically but should be combined with the estimated baseline hazard to obtain estimates of the effects of the covariates on survival. but it is not clear what the clinical signiﬁcance of this effect would be. one can simply use the prognostic index to divide the study population into subgroups and estimate the survival of each group using standard Kaplan-Meier techniques. Alternatively.52 and (a) age 69.

and Kaplan-Meier estimates of survival in each group were calculated. ˆ The prognostic index. which fall in the 1st and 4th of the 10 groups. 20%.98 and β T Z 2. 1-year survival probabilities for the four groups in Figure 1 are 51%.45–1.80). for individuals with ˆ ˆ prognostic index from the Cox model (a) β TZ in (1.36) and (1. The covariate values of Figˆ ˆ ure 1a correspond to β T Z 1. TRANSFORMATION OF COVARIATES Both the logistic model and the Cox model (2) impose a particular form on each continuous covariate.22) and (b) β TZ in (0. 22%. 76%. corresponding to a relatively poor prognosis. As with all regression models one should consider the .45 ˆ to 3. 71%. the sample was divided into 10 groups with equal numbers of individuals in each group. whereas in Figure 2 the survival functions are estimated nonparametrically but based on grouping large numbers of patients together. and 87% and in Figure 2 33%. 82%.52–3. Thus. For comparison. β TZ Figure 2 shows the Kaplan-Meier estimates for these groups. The two methods produce survival estimates with quite different shapes since Figure 1 is based on the assumption of proportional hazards and the speciﬁc form of the Cox model. β T Z.77. from the ﬁtted Cox model ranges from 0. and 94%. and the covariate values of Figure 1b correspond to ˆ ˆ 1. By partitioning the prognostic index β T Z. it is seen that deviations from the model ﬁt are greatest at the extremes of the prognostic index range. 16%. 9%.68–1.22.66. and 42%. The corresponding 5-year survival rates are 2%. which fall in the 6th and 10th of the 10 groups. whereas the second part of the ﬁgure is based on a patient with values of the other covariates corresponding to a relatively good prognosis. and 54% compared with 11%.91–2. IV.08 and β T Z 1. 71%.Prognostic Factors and Survival 441 Figure 2 Kaplan-Meier estimates with 95% conﬁdence intervals.03) and (2.

Usually there will be more than one covariate. The smoother used is a cubic smoothing spline with 7 degrees of freedom and the shaded histograms on the plots indicate the distribution of the covariates. death) against each covariate. . A. and the following sections describe methods for exploring the relationship between continuous covariates and survival within a logistic model or a Cox model.e. Since the logistic model assumes that the logit transformation of the probability of death given covariate vector Z is linear in each component of Z. it may be more useful to plot the logit transformation of a smooth of the response against a covariate..7). Many serum markers have positively skewed distributions and are traditionally logtransformed. In other situations one may need to consider whether an extreme value has undue inﬂuence on the parameter estimates or whether there is a (biologically plausible) nonmonotone covariate effect. A simple exploratory analysis of the effect of a single continuous covariate on survival can be done using smooth estimates of quantiles of the conditional survival function (6. Transformation of Covariates in the Logistic Model For the logistic model as in Section II the form of the covariate effects can be investigated graphically simply using scatter plot smoothers of the response (i.442 Sasieni and Winnett possibility of transforming covariates before entering them in the model. The conditional survival function for covariate value z is estimated using the Kaplan-Meier estimator with individual i weighted according to the distance between z and Zi. Figure 3 Logit transformation of smoothed indicator of death up to 2 years against prognostic variables. Example Figure 3 shows the logit transformation of a smooth of the indicator of death up to 2 years against the covariate values for sβ2m and serum creatinine.

The simplest approach to investigating covariate transformations is to partition the covariate of interest to create about ﬁve ‘‘dummy variables. less than for sβ 2m. By contrast the probability of death by 2 years does seem to be a monotone function of serum creatinine concentration (except possibly at the lowest few percent of concentrations).Prognostic Factors and Survival 443 Notice that the probability of death within 2 years of diagnosis is very high (90%) for those with very high levels of log 2 (sβ 2 m) ( 5) and very low (20%) for those with extremely low values ( 0). the relationship may not be monotone—there is certainly no evidence of increasing risk associated with log 2 (sβ2 m) values of between 1. In this chapter we are more interested in diagnostic plots to investigate whether the chosen form exp(βj Zj ) is reasonable. Techniques exist for estimating the sj directly using local estimation (8–10). Transformation of Covariates in the Cox Model The Cox model is hazard based and allows for censoring. but other graphical methods have been developed. A plot of the estimated parameters against the mean value of the observations in the interval is used to examine the appropriateness of the linear ﬁt associated with the basic model. this is a very crude approximation to the logarithm of the hazard ratio that may vary as a smooth function of the basic covariate. so there is no simple response that can be plotted to investigate the form of the covariate effects in it. or penalized partial likelihood (14. Despite this increasing trend.15). These are deﬁned as .’’ Cutpoints should be chosen so that there are roughly equal numbers of observations in each group wherever possible. regression splines (11–13). however. but standard cutpoints may be preferred. A very crude estimate of the sj can be obtained by applying a scatterplot smoother to a plot of the so-called martingale residuals against Zj (16). Methods based on residuals yield one-step approximations toward the underlying sj and have the advantage of being easy to use and easy to apply using any software that can do Cox regression and smoothing. Consider the additive Cox model p λ(t| Z) λ 0 (t) exp j 1 sj (Zj ) [3] in which the hazard ratio associated with the jth covariate is equal to the function exp{sj (Zj )} instead of simply exp(βj Zj ).0 and 2. With a continuous covariate. Discretizing a covariate and estimating a separate parameter for each interval is equivalent to ﬁtting a piecewise constant function. B.0. The spread of risk is. The strong relationship between dying and serum creatinine is interesting in that it largely disappears after adjusting for sβ 2m.

Alternatively. which leads to Eq. a robust ˆ smoother is likely to be biased due to the skewedness of Λ. a ˆ better diagnostic plot can be obtained by adjusting each martingale residual M i ˆ i and plotting by E smooth ˆ Mi ˆ against Z ij with smoothing weights Ei ˆi E [4] ˆ ˆ We call M i /E i the adjusted martingale residual. a diagnostic plot ˆ can be obtained by smoothing both δ i and E i against the covariate values and plotting the logarithm of the ratio of the two smooths. ˆ Λ 0 (Ti ) Tj ∑Tk Ti Tj δj ˆ exp(β T Z k ) and δ i 1 if the observation on individual i is a death. Thus. where exp ˆ ˆ (β TZi )Λ0 (Ti ) is an estimate of the cumulative hazard function for individual i at Ti . δ i 0 if it is censored.) ˆ i are an improvement as the terms δi provide an adjustment martingale residuals M for censoring. Since residual methods yield one-step approximations toward . an estimate of E(δi |Z ij )/E(Ei |Zij ) approxiˆ βj Zij }. Earlier approaches ˆ ˆ ˆ ˆ included plotting the terms E i exp(β T Zi )Λ0 (Ti ) δi M i against Z i (17. The resulting estimates of the s j’s are not the best available diagnostics. M i is smoothed against Z ij to estimate the form of sj. Smoothing is needed in any plot of martingale residuals since they are generally very skewed and are nearly uncorrelated. The two methods [4] and [5] are similar as log{E(δ |Z )/ estimate of sj (Zij ) βj ij i ij ˆ ˆ ˆ E(Ei | Zij )} is approximately equal to E(Mi |Z ij )/E(Ei |Zij ) by the approximation log(1 x) x for small x. Note that smooths should be mean based since they are estimating expected values. E(δ i | Z i ) ˆ ˆ ˆ E{Λ0 (Ti )exp(∑j βj Zij )| Zi}. plots of martingale residuals or adjusted martingale residuals themselves are not usually helpful. Some statisticians use martingale residuals before entering a new covariate into the Cox model. These residuals can then be smoothed against each component of the covariate ˆ vector.18.444 Sasieni and Winnett ˆ Mi δi ˆ ˆ exp(β T Zi )Λ 0 (Ti ) for individual i with covariate vector Zi and survival time Ti . whereas E(E i | Z i ) Cox model (3). [5] as an mately estimates the factor exp{sj (Zij ) ˆ Z . Motivation for these plots comes from the fact that under the additive ˆ E[Λ 0 (T i ) exp{∑j s j (Z ij )}| Z i ]. log smooth(δ i ) against Z ij ˆ smooth (Ei ) against Z ij [5] (19).

they are particularly important in the case of smoothed residual plots. to determine whether the function s j deviates from linear. Estimating conﬁdence intervals for smoothed estimates presents various problems such as bias correction. between 3 and 5. and age was split into two at 55 years. Similarly log2 (serum creatinine) was split into two variables for values greater than and less than 10. and the covariance of ˆ ˆ ˆ ˆ M i /E i . However. the variance of the linear estimate. are usually small compared with the variance of ˆ ˆ M i /E i for each individual. it is always advisable to start with at least a linear approximation. sβ2m. These cutpoints are shown as vertical dashed lines on Figure 4. k Example Figure 4 shows the weighted smooth of the adjusted martingale residuals (4) with the linear term added for the continuous covariates. and are certainly useful. On the other hand. for example. and greater than 5. so approximate conﬁdence intervals can be found by ˆ ˆ estimating the variance of the vector of adjusted residuals M i /E i by the diagonal ˆ matrix with diagonal elements equal to 1/E i. Sect. The smoother is a smoothing spline with 7 degrees of freedom. the linear terms for log 2 (sβ2 m). log 2 (serum creatinine). multiple testing. it may be more useful to simply plot the residual estimate. and determining the shape of an estimate as opposed to its value (see for example Hastie and Tibshirani. the variance of the smooth estimate [4] can be estimated by the diagonal matrix with diagonal ˆ elements equal to 1/E i . and M i′ /E i′ for i ≠ i′. and the shaded histograms on the plots indicate the distribution of the covariates. although care should be taken in interpreting them. The variable log2 (sβ2m) was split into four. The additive Cox model [3] only makes sense if there is some constraint on the functions s j .Prognostic Factors and Survival 445 the underlying s j . between 2 and 3. with ˆ coefﬁcient βj . if λ 0 is taken to be the hazard function for an individual with Z 0. If martingale residuals are based on a model including the covariate Z j .8 (20)). If the weighted smooth against the kth covariate is represented by the linear smoothing matrix L k . and a second variable was deﬁned equal to age when age is . and serum creatinine in the Cox model of Table 5. for values up to 2. 3. but there is an additional problem due to correlation between adjusted residuals for different individuals and the variance due to adding the linear estimate from the Cox model. in Figure 4 λ 0 corresponds to the minimum observed value of each covariate. Conﬁdence intervals Conﬁdence intervals are always important when looking at any estimate. As a result of Figure 4. and age in the Cox model were replaced by continuous piecewise linear functions. since smoothers can make a plot appear to have some nonlinear structure even from random data with no underlying structure. then s j (0) 0 for each covariate j. age. A new variable was deﬁned equal to age when age is less than 55 and equal to 55 otherwise. ˆ ˆ ˆ The variance of the adjusted residual M i /E i can be estimated by 1/E i . Pointwise conﬁdence intervals are relatively simple to estimate. premultiplied by L k and postmultiplied by L T. at least if a linear smoother is used. the function s j (Z j ) is estimated by the residual estimate [4] or [5] ˆ added to the linear term βj Z j.

serum creatinine has no statistically signiﬁcant association with increased hazard either in general or for extreme values. The model in Table 6 does not give a particularly good estimate of the shape of the functions β j (z j ).446 Sasieni and Winnett Figure 4 Estimate of covariate transformation using smoothed adjusted martingale residuals. . The value of minus twice the log partial likelihood for this model is 48 less than that for the model in Table 5. An increase in sβ2m is associated with an increased hazard both for very high values and for more typical values. even allowing for the data driven choice of cut points. whereas very low values may not indicate a correspondingly lower hazard. but it does give a better idea of the strength of the effects and the standard errors of the estimates. with approximate 95% conﬁdence intervals. Figure 4 itself is more appropriate for that. similarly for sβ2 m and serum creatinine variables. according to this model it is both very young and older patients that have an increased hazard. but extremely high values of sβ2m seem to be associated with an additional increase in the hazard. After adjusting for sβ2 m. so by a partial likelihood ratio test the new model is certainly a better ﬁt. In general. the hazard seems to depend on the level of sβ2 m in a fairly complicated way. The resulting estimates are shown in Table 6. greater than 55 and equal to 55 otherwise. for the addition of ﬁve extra variables. Thus.

65–1.92–1.39) Figure 5 shows adjusted martingale residual plots for sβ2 m and serum creatinine based on martingale residuals calculated from a model without these covariates.34 1. the plot for serum creatinine indicates a strong effect (note the scales on the y-axes are not the same as in Fig.14 447 95% CI (0. In contrast to Figure 4. or poor (vs.67. 4). only age.Prognostic Factors and Survival Table 6 Cox Proportional Hazards Model Using Continuous Piecewise Linear Covariate Effects Covariate Age up to 55 years (per 10 years) Age after 55 years (per 10 years) log 2 (sβ2 m) up to 2 log 2 (sβ2 m) between 2 and 3 log 2 (sβ2 m) between 3 and 5 log 2 (sβ2 m) above 5 log 2 (serum creatinine) up to 10 log 2 (serum creatinine) above 10 ABCM Cuzick index int.045 Hazard ratio 0.17) (0. good) Cuzick index poor (vs.28 1.08) (0.58) (0.18) (0.93–1.00 2.11) (3.58–2.13–1.00) (1. hemoglobin. ABCM.62) (0.97–12. 184 (91%) had log2 (sβ2 m) greater than 2.02 0.92) (1.93.78–1.12–1.93 5.80 1.44) (0. or good) 2 log partial likelihood: 10.67–0. This is due to the correlation between serum creatinine and sβ2 m as discussed in Section II and illustrates the advantage of entering a covariate in a model at least as a linear term before calculating residuals.83–1. Figure 5 Estimate of covariate transformation using smoothed adjusted martingale residuals based on the model without sβ2 m or serum creatinine. and the Cuzick index indicator variables are in the model.43 0.20) (1. Recall from Table 4 that of the 202 individuals with log2 (serum creatinine) greater than 7.79 1.38–10. int. that is. .87 1.03 3.

.448 Sasieni and Winnett V. Let t (1) . One simple approach is to estimate the parameters of the Cox model locally in time. Standard software for ﬁtting a Cox model to time-dependent covariates can be used to estimate β(t) if one is willing to use a parametric regression spline so that a single covariate X is replaced by a vector Xb(t) where b is a vector basis for the spline (21. it is conceptually simple and easily implemented in any package capable of doing Cox regression. t 1] containing t* and estimates β in a standard Cox model left truncating the data at t 0 and right censoring at t 1. Rather. Covariate effects that are thought to change (on the hazard ratio scale) over time can be modeled by including user-deﬁned time-dependent covariates. . there is bias toward the ends of the range of event times. . t (i)) ˆ ˆ where S k (β. is given by exp{β T (z 2 time. but that is not a particularly ﬂexible approach. in the same way as smoothing in general using a running mean smoother results in bias toward the ends. (23). the hazard ratio of two individuals with covariates Z z 1 and z 1 )}. In a similar way to the methods for transformations of covariates. regression splines are not the most ﬂexible approach. then the Schoenfeld residual at event time t (i) is deﬁned as ˆ r (i) j:Tj t(i) Zj ˆ S 1 (β. which does not vary with Z z 2 . NONPROPORTIONAL HAZARDS: TIME-VARYING COEFFICIENTS In the Cox model. Note also that this is a vector valued residual.22). Note that the residuals are only deﬁned at death times (there is not a residual that is identically zero at each censoring time). However. Smoothing each component of the residuals against the event times leads to estimates of each component of the vector of functions β. This approach is discussed more fully by Valsecchi et al. . t (d) be the unique event times. A disadvantage is that because the estimate is locally constant. t (i)) ˆ S 0 (β. To estimate the parameters at some point t* one considers a window in time (t 0 . Although this is computationally intensive. one-step estimates of time-varying coefﬁcients can by found by smoothing residuals against time. The appropriate residuals here are Schoenfeld residuals (24–26). one may wish to consider the more general model λ(t |Z) λ 0 (t)exp{β(t) TZ} in which the hazard ratio exp{β(t) T (z 2 z 1 )} is allowed to vary over time through the vector of functions β(t). . t) ∑ j:Tj t Z jk exp(β T Z j ). respectively. and here we are more interested in diagnostic plots that can be used to examine the form of the functions β(t) rather than direct estimation.

smooths should be mean based and not robust. The adjusted residual V (i)1 r (i) then has variance approximately V (i)1. for the kth covariate. the shape (but not the magnitude) of β(t) could be estimated without standardizing. This saves computational effort. If one only had a single of V (i)1r (i) is approximately equal to {β(t (i) ) covariate. t (i)) 2 ˆ S 0 (β. t) ∑j:Tj t Z j 2 exp(β T Z j ). if V is used instead of V (i) . Additionally. so the covariate should be entered in the Cox model initially at least as constant. rather. On the other hand. Therefore. therefore in smoothing. since a one-step estimate starting from a constant estimate should be better than a one-step estimate starting from zero. particularly since V is equal to the inverse of the Cox model variance matrix divided by the number of events and the adjusted Schoenfeld residuals are therefore available without any extra computation in statistics packages (such as S-Plus. in which case the ˆ Schoenfeld residuals can be adjusted using V. particularly for the later event times. t (i)) ˆ S 1 (β. ˆ Often V (i) will be nearly the same for each event time. conﬁdence intervals are important for interpreting smooths of adjusted Schoenfeld residuals. This variance can vary greatly. Conﬁdence intervals In the same way as described in Section IV. Theory suggests that the expected value ˆ ˆ ˆ β} (24). The residual plots are only one-step estimates. the inverses ˆ of the kth diagonal elements of the matrices V (i)1 should be used as weights. the mean of the V (i)s. Stata). The motivation for the plots is based on using smoothing to estimate expected values. but because the different components are not in general independent. and ˆ using the mean V can lead to bias (27).Prognostic Factors and Survival 449 ˆ The residual r (i) has variance V(i) estimated by ˆ V (i) j:Tj t(i) ˆ S 2 (β. if the range of covariate values in the risk set ˆ becomes small at later event times. care must be taken in interpreting the smooths once the risk set is small.B. Smoothing is needed for looking at plots of Schoenfeld residuals since the adjusted residuals themselves are often highly skewed and nearly uncorrelated. then V (i) is likely to be smaller as well. t (i)) 2 ˆ ˆ where S 2 (β. trends in Schoenfeld residuals may be obscured by the residuals being in ‘‘bands’’ for different values of categorical covariates. therefore. if the risk set becomes very small at later event times or. t (i)) ˆ S 0 (β. it is necessary to standardize the residuals by premultiplying by the inverse ˆ ˆ of their variance before smoothing against time. Approximate pointwise conﬁdence intervals can be estimated fairly easily if a ˆ (i) linear smoother is used (26). A number of points from Section IV also apply to the plots in this section. Let β 1 be the adjusted Schoenfeld residuals plus . The constant estimates from the Cox model can be added to the smoothed adjusted residuals to estimate the functions β or the smoothed adjusted residuals can be plotted by themselves to estimate the deviation of β from constant.

for covariby V ˆ ate k. A third Cox model was ﬁtted in which the constant coefﬁcients of log2 (sβ2 m) were replaced by a piecewise constant coefﬁcient. but the effect decreases over time. there does not seem to be any statistically signiﬁcant association between sβ2 m and increased hazard beyond 2 years. The shaded histograms indicate the distribution of observed deaths. Again. The four variables for log2 (sβ2 m) were set to zero after 2 years and a further variable was deﬁned to have value zero up to 2 years and the value of log2 (sβ2 m) after 2 years. . for patients with log 2 (sβ2 m) between 2 and 3 in the ﬁrst part and for those with log 2 (sβ2 m) above 5 in the second part. From these plots it seems that high values of sβ2 m are associated with an increased hazard initially. which was allowed to have different values for up to and beyond 2 years. from the ﬁtted Cox model in Table 6. β 1 V (i)1 r (i) β. The results are in Table 7. The smoother is a smoothing spline with 7 degrees of freedom. β 1 and β 1 are approximately uncorrelated. thus. the Cox model gives an idea of the prognostic effect of sβ2 m with error estimates in each of the two time intervals but does not give any idea of how quickly or in what manner the effect decreases over time. and possibly sβ2 m has no prognostic value beyond 2 or 3 years.450 Sasieni and Winnett ˆ (i) ˆ ˆ (i) ˆ ˆ constant estimate. then the variance of the smoothed estimate can be estimated by premultiplying the diagonal variance matrix by L k and postmultiplying by L T. the variance of the vector of estimates β 1 can be estimated by the diagonal (i)k matrix with (i)th diagonal element equal to the kth diagonal element of V (i)1. an idea Figure 6 Estimates of time-varying coefﬁcients with 95% conﬁdence intervals using smoothed Schoenfeld residuals. then the variance of β 1 can be estimated ˆ (i) ˆ (i′) ˆ (i)1 and for i ≠ i′. If the weighted smoothing for covariate k is represented by a linear smoothing matrix L k . Thus. k Example Figure 6 shows the weighted smooth of the adjusted Schoenfeld residuals with the constant estimate added for sβ2 m.

83 1.013 Hazard ratio 0. Therefore it might be necessary to study the effect of a continuous prognostic variable without assuming either proportional hazards or a particular form with respect to the covariate.61 1.02–3. However estimating the set of functions λk now means estimating a function of both time and the covariate X. int.13–1. .73–1.28 0.92 0.40) of this can be seen from the ﬁgure instead.27) (0.14) (0.11 4.92–1. VI.94–1.’’ one for each level of the discretised covariate.92) (1.65–8. good) Cuzick index poor (vs.Prognostic Factors and Survival Table 7 Cox Model Using Continuous Piecewise Linear Covariate Effects with Coefﬁcients Piecewise Constant in Time Covariate Age up to 55 years (per 10 years) Age after 55 years (per 10 years) log 2 (sβ2 m) up to 2 in the ﬁrst two years log 2 (sβ2 m) between 2 and 3 in the ﬁrst two years log 2 (sβ2 m) between 3 and 5 in the ﬁrst two years log 2 (sβ2 m) above 5 in the ﬁrst two years log 2 (sβ2 m) after the ﬁrst 2 years log 2 (serum creatinine) up to 10 log 2 (serum creatinine) above 10 ABCM Cuzick index int.44) (0.98 2.81 1. Denote by X the discretised covariate of interest.42 1.18–1. or good) 2 log partial likelihood: 10.02) (0.15 451 95% CI (0.14) (0.83–10.91–1. The functions λ k (t) can be estimated using smoothing (28).00) (1.67–0. and model the conditional hazard as λ(t| Z.35) (2.33) (2.71) (0.65–1.96) (0.01 2. The value of minus twice the log