You are on page 1of 416

Statistics in the Health

Sciences
Statistics in the Health
Sciences
Theory, Applications,
and Computing

By
Albert Vexler
The State University of New York at Buffalo, USA
Alan D. Hutson
Roswell Park Cancer Institute, USA
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not
warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB®
software or related products does not constitute endorsement or sponsorship by The MathWorks of a
particular pedagogical approach or particular use of the MATLAB® software.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper

International Standard Book Number-13: 978-1-138-19689-6 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Names: Vexler, Albert, author. | Hutson, Alan D., author.


Title: Statistics in the health sciences : theory, applications, and
computing / Albert Vexler and Alan D. Hutson.
Description: Boca Raton, Florida : CRC Press, [2018] | Includes
bibliographical references and index.
Identifiers: LCCN 2017031841| ISBN 9781138196896 (hardback) | ISBN
9781315293776 (e-book) | ISBN 9781315293769 (e-book) | ISBN 9781315293752
(e-book) | ISBN 9781315293745 (e-book)
Subjects: LCSH: Medical statistics. | Medicine–Research–Statistical
methods. | Epidemiology–Statistical methods. | Statistical hypothesis
testing.
Classification: LCC R853.S7 V48 2018 | DDC 610.2/1–dc23
LC record available at https://lccn.loc.gov/2017031841

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
To my parents, Octyabrina and Alexander, and my son, David

Albert Vexler

To my wife, Brenda, and three kids, Nick, Chance, and Trey

Alan D. Hutson
Table of Contents

Preface ................................................................................................................... xiii


Authors ................................................................................................................ xvii

1. Prelude: Preliminary Tools and Foundations ........................................... 1


1.1 Introduction ...........................................................................................1
1.2 Limits ...................................................................................................... 1
1.3 Random Variables .................................................................................2
1.4 Probability Distributions .....................................................................3
1.5 Commonly Used Parametric Distribution Functions ......................5
1.6 Expectation and Integration ................................................................6
1.7 Basic Modes of Convergence of a Sequence of
Random Variables .................................................................................8
1.7.1 Convergence in Probability ....................................................9
1.7.2 Almost Sure Convergence ......................................................9
1.7.3 Convergence in Distribution ..................................................9
1.7.4 Convergence in rth Mean ..................................................... 10
1.7.5 O(.) and o(.) Revised under Stochastic Regimes ................ 10
1.7.6 Basic Associations between the
Modes of Convergence .......................................................... 10
1.8 Indicator Functions and Their Bounds as Applied to
Simple Proofs of Propositions ........................................................... 10
1.9 Taylor’s Theorem ................................................................................. 12
1.10 Complex Variables .............................................................................. 14
1.11 Statistical Software: R and SAS ......................................................... 16
1.11.1 R Software ............................................................................... 16
1.11.2 SAS Software ..........................................................................22

2. Characteristic Function–Based Inference ................................................ 31


2.1 Introduction ......................................................................................... 31
2.2 Elementary Properties of Characteristic Functions ....................... 32
2.3 One-to-One Mapping .........................................................................34
2.3.1 Proof of the Inversion Theorem ........................................... 35
2.4 Applications ......................................................................................... 40
2.4.1 Expected Lengths of Random Stopping Times and
Renewal Functions in Light of
Tauberian Theorems .............................................................. 40
2.4.2 Risk-Efficient Estimation....................................................... 46
2.4.3 Khinchin’s (or Hinchin’s) Form of
the Law of Large Numbers ................................................... 48
2.4.4 Analytical Forms of Distribution Functions ...................... 49

vii
viii Table of Contents

2.4.5 Central Limit Theorem ......................................................... 50


2.4.5.1 Principles of Monte Carlo Simulations ............... 51
2.4.5.2 That Is the Question: Do Convergence Rates
Matter? .....................................................................54
2.4.6 Problems of Reconstructing the General Distribution
Based on the Distribution of Some Statistics .....................54
2.4.7 Extensions and Estimations of Families of
Distribution Functions .......................................................... 56

3. Likelihood Tenet ...........................................................................................59


3.1 Introduction ......................................................................................... 59
3.2 Why Likelihood? An Intuitive Point of View ................................. 61
3.3 Maximum Likelihood Estimation ....................................................63
3.4 The Likelihood Ratio .......................................................................... 69
3.5 The Intrinsic Relationship between the Likelihood Ratio
Test Statistic and the Likelihood Ratio of Test Statistics: One
More Reason to Use Likelihood ........................................................ 73
3.6 Maximum Likelihood Ratio .............................................................. 76
3.7 An Example of Correct Model-Based Likelihood Formation .......80

4. Martingale Type Statistics and Their Applications ..............................83


4.1 Introduction .........................................................................................83
4.2 Terminology .........................................................................................84
4.3 The Optional Stopping Theorem and Its
Corollaries: Wald’s Lemma and Doob’s Inequality........................ 90
4.4 Applications ......................................................................................... 93
4.4.1 The Martingale Principle for Testing
Statistical Hypotheses ........................................................... 93
4.4.1.1 Maximum Likelihood Ratio in Light of the
Martingale Concept ............................................... 96
4.4.1.2 Likelihood Ratios Based on
Representative Values ............................................ 97
4.4.2 Guaranteed Type I Error Rate Control of the
Likelihood Ratio Tests ......................................................... 100
4.4.3 Retrospective Change Point Detection Policies ............... 100
4.4.3.1 The Cumulative Sum (CUSUM) Technique ..... 101
4.4.3.2 The Shiryayev–Roberts (SR)
Statistic-Based Techniques .................................. 102
4.4.4 Adaptive (Nonanticipating) Maximum Likelihood
Estimation ............................................................................. 105
4.4.5 Sequential Change Point Detection Policies .................... 109
4.5 Transformation of Change Point Detection Methods into a
Shiryayev–Roberts Form .................................................................. 113
4.5.1 Motivation ............................................................................. 114
4.5.2 The Method........................................................................... 115
Table of Contents ix

4.5.3 CUSUM versus Shiryayev–Roberts................................... 120


4.5.4 A Nonparametric Example ................................................. 123
4.5.5 Monte Carlo Simulation Study........................................... 124

5. Bayes Factor..................................................................................................129
5.1 Introduction ....................................................................................... 129
5.2 Integrated Most Powerful Tests ...................................................... 132
5.3 Bayes Factor........................................................................................ 136
5.3.1 The Computation of Bayes Factors .................................... 137
5.3.1.1 Asymptotic Approximations .............................. 142
5.3.1.2 Simple Monte Carlo, Importance
Sampling, and Gaussian Quadrature................ 147
5.3.1.3 Generating Samples from the Posterior ............ 147
5.3.1.4 Combining Simulation and Asymptotic
Approximations .................................................... 148
5.3.2 The Choice of Prior Probability Distributions ................. 149
5.3.3 Decision-Making Rules Based on the
Bayes Factor .......................................................................... 153
5.3.4 A Data Example: Application of the
Bayes Factor .......................................................................... 158
5.4 Remarks .............................................................................................. 161

6. A Brief Review of Sequential Methods ................................................. 165


6.1 Introduction ....................................................................................... 165
6.2 Two-Stage Designs ............................................................................ 167
6.3 Sequential Probability Ratio Test .................................................... 169
6.4 The Central Limit Theorem for a Stopping Time ......................... 177
6.5 Group Sequential Tests..................................................................... 179
6.6 Adaptive Sequential Designs .......................................................... 181
6.7 Futility Analysis ................................................................................ 182
6.8 Post-Sequential Analysis .................................................................. 182

7. A Brief Review of Receiver Operating


Characteristic Curve Analyses.................................................................185
7.1 Introduction ....................................................................................... 185
7.2 ROC Curve Inference ....................................................................... 186
7.3 Area under the ROC Curve ............................................................. 189
7.3.1 Parametric Approach .......................................................... 190
7.3.2 Nonparametric Approach................................................... 192
7.4 ROC Curve Analysis and Logistic Regression:
Comparison and Overestimation ................................................... 193
7.4.1 Retrospective and Prospective ROC ................................. 195
7.4.2 Expected Bias of the ROC Curve and
Overestimation of the AUC ................................................ 196
7.4.3 Example ................................................................................. 198
x Table of Contents

7.5 Best Combinations Based on Values of


Multiple Biomarkers ......................................................................... 201
7.5.1 Parametric Method .............................................................. 202
7.5.2 Nonparametric Method ...................................................... 202
7.6 Remarks .............................................................................................. 202

8. The Ville and Wald Inequality: Extensions and Applications .........205


8.1 Introduction ....................................................................................... 205
8.2 The Ville and Wald inequality ........................................................ 208
8.3 The Ville and Wald Inequality Modified in Terms of
Sums of iid Observations ................................................................. 209
8.4 Applications to Statistical Procedures ........................................... 213
8.4.1 Confidence Sequences and Tests with Uniformly
Small Error Probability for the Mean of a Normal
Distribution with Known Variance ................................... 213
8.4.2 Confidence Sequences for the Median.............................. 215
8.4.3 Test with Power One............................................................ 217

9. Brief Comments on Confidence Intervals and p-Values.....................219


9.1 Confidence Intervals ......................................................................... 219
9.2 p-Values ...............................................................................................223
9.2.1 The EPV in the Context of an ROC Curve Analysis ...... 226
9.2.2 Student’s t-Test versus Welch’s t-Test ................................. 227
9.2.3 The Connection between EPV and Power ....................... 229
9.2.4 t-Tests versus the Wilcoxon Rank-Sum Test ..................... 231
9.2.5 Discussion ............................................................................. 236

10. Empirical Likelihood .................................................................................239


10.1 Introduction ....................................................................................... 239
10.2 Empirical Likelihood Methods ....................................................... 239
10.3 Techniques for Analyzing Empirical Likelihoods ....................... 242
10.3.1 Practical Theoretical Tools for Analyzing ELs ................ 249
10.4 Combining Likelihoods to Construct Composite
Tests and to Incorporate the Maximum
Data-Driven Information ................................................................. 253
10.5 Bayesians and Empirical Likelihood .............................................254
10.6 Three Key Arguments That Support the Empirical
Likelihood Methodology as a Practical Statistical
Analysis Tool ..................................................................................... 255
Appendix..................................................................................................... 256

11. Jackknife and Bootstrap Methods ..........................................................259


11.1 Introduction ....................................................................................... 259
11.2 Jackknife Bias Estimation ................................................................ 261
11.3 Jackknife Variance Estimation ........................................................ 263
Table of Contents xi

11.4 Confidence Interval Definition ....................................................... 263


11.5 Approximate Confidence Intervals ................................................ 264
11.6 Variance Stabilization ....................................................................... 265
11.7 Bootstrap Methods ............................................................................ 266
11.8 Nonparametric Simulation .............................................................. 267
11.9 Resampling Algorithms with SAS and R ...................................... 270
11.10 Bootstrap Confidence Intervals....................................................... 279
11.11 Use of the Edgeworth Expansion to Illustrate the
Accuracy of Bootstrap Intervals......................................................284
11.12 Bootstrap-t Percentile Intervals....................................................... 288
11.13 Further Refinement of Bootstrap Confidence Intervals .............. 291
11.14 Bootstrap Tilting................................................................................ 303

12 Examples of Homework Questions.........................................................305


12.1 Homework 1.......................................................................................305
12.2 Homework 2.......................................................................................306
12.3 Homework 3.......................................................................................306
12.4 Homework 4....................................................................................... 307
12.5 Homework 5.......................................................................................308
12.6 Homeworks 6 and 7 .......................................................................... 311

13 Examples of Exams ..................................................................................... 315


13.1 Midterm Exams ................................................................................. 315
Example 1 ...................................................................................................... 315
Example 2 ...................................................................................................... 316
Example 3 ...................................................................................................... 317
Example 4 ...................................................................................................... 318
Example 5 ...................................................................................................... 319
Example 6 ...................................................................................................... 320
Example 7 ...................................................................................................... 322
Example 8 ...................................................................................................... 324
Example 9 ...................................................................................................... 325
13.2 Final Exams........................................................................................ 326
Example 1 ...................................................................................................... 326
Example 2 ...................................................................................................... 328
Example 3 ...................................................................................................... 329
Example 4 ...................................................................................................... 331
Example 5 ...................................................................................................... 332
Example 6 ...................................................................................................... 333
Example 7 ...................................................................................................... 335
Example 8 ...................................................................................................... 336
Example 9 ...................................................................................................... 339
13.3 Qualifying Exams .............................................................................340
Example 1 ......................................................................................................340
Example 2 ......................................................................................................342
xii Table of Contents

Example 3 ......................................................................................................345
Example 4 ......................................................................................................348
Example 5 ...................................................................................................... 350
Example 6 ......................................................................................................354
Example 7 ...................................................................................................... 357
Example 8 ...................................................................................................... 359

14 Examples of Course Projects ....................................................................363


14.1 Change Point Problems in the Model of Logistic Regression
Subject to Measurement Errors ....................................................... 363
14.2 Bayesian Inference for Best Combinations Based on Values
of Multiple Biomarkers ..................................................................... 363
14.3 Empirical Bayesian Inference for Best Combinations Based
on Values of Multiple Biomarkers...................................................364
14.4 Best Combinations Based on Log Normally Distributed
Values of Multiple Biomarkers ........................................................364
14.5 Empirical Likelihood Ratio Tests for Parameters of
Linear Regressions ............................................................................ 365
14.6 Penalized Empirical Likelihood Estimation ................................. 366
14.7 An Improvement of the AUC-Based Interference ........................ 366
14.8 Composite Estimation of the Mean based on
Log-Normally Distributed Observations ...................................... 366

References ........................................................................................................... 369


Author Index ....................................................................................................... 381
Subject Index ...................................................................................................... 387
Preface—Please Read!

In all likelihood, the Universe of Statistical Science is “relatively” or “pri-


vately” infinite and expanding in a similar manner to our Universe, sat-
isfying the property of a science that is alive. Hence, it would be absolute
impossibility to include everything in one book written by humans. One of
our major goals is to provide the readers a small but efficient “probe” that
could assist in Statistical Space discoveries.
The primary objective of the this book is to provide a compendium of statis-
tical techniques ranging from classical methods through bootstrap strategies
to modern recently developed statistical techniques. These methodologies
may be applied to various problems encountered in health-related studies.
Historically, initial developments in statistical science were induced by
real-life problems, when appropriate statistical instruments employed
empirical arguments. Perhaps, since the eighteenth century, the heavy con-
volution between mathematics, probability theory, and statistical methods
has provided the fundamental structures for correct statistical and biosta-
tistical techniques. However, we cannot ignore a recent trend toward redun-
dant simplifications of statistical considerations via so called “intuitive” and
“applied” claims in complex statistical applications. It is our experience that
applied statisticians or users often neglect the underlying postulates when
implementing formal statistical procedures and with respect to the inter-
pretation of their results. A very important motivation towards writing this
book was to better refocus the scientist towards understanding the underpin-
nings of appropriate statistical inference in a well-rounded fashion. Maybe
now is the time to draw more attention of theoretical and applied researchers
to methodological standards in statistics and biostatistics? In contrast with

xiii
xiv Preface—Please Read!

many biostatistical books, we focus on rigorous formal proof schemes and


their extensions regarding different statistical principles. We also show the
basic ingredients and methods for constructing and examining correct and
powerful statistical processes.
The material in the book should be appropriate for use both as a text and
as a reference. In our book readers can find classical and new theoretical
methods, open problems, and new procedures across a variety of topics for
their scholarly investigations. We present results that are novel to the current
set of books on the market and results that are even new with respect to the
modern scientific literature. Our aim is to draw the attention of theoretical
statisticians and practitioners in epidemiology and/or clinical research to
the necessity of new developments, extensions and investigations related to
statistical methods and their applications. In this context, for example, we
would like to emphasize for whom is interested in advanced topics the fol-
lowing aspects. Chapter 1 lays out a variety of notations, techniques and
foundations basic to the material that is treated in this book. Chapter 2 intro-
duces the powerful analytical instruments that, e.g., consist of principles
of Tauberian theorems, including new results, with applications to convo-
lution problems, evaluations of sequential procedures, renewal functions,
and risk-efficient estimations. In this chapter we also consider problems of
reconstructing the general distribution based on the distribution of some
observed statistics. Chapter 3 shows certain nontrivial conclusions regard-
ing the parametric likelihood ratios. Chapter 4 is developed to demonstrate
a strong theoretical instrument based on martingales and their statistical
applications, which include the martingale principle for testing statistical
hypotheses and comparisons between the cumulative sum technique and
the Shiryayev–Roberts approach employed in change point detection poli-
cies. A part of material shown in Chapter 4 can be found only in this book.
Chapter 5 can assist the statistician in developing and analyzing various
Bayesian procedures. Chapter 8 provides the fundamental components for
constructing unconventional statistical decision-making procedures with
power one. Chapter 9 proposes novel approaches to examine, compare, and
visualize properties of various statistical tests using correct p-value-based
mechanisms. Chapter 10 introduces the empirical likelihood methodol-
ogy. The theoretical propositions shown in Chapter 10 can lead to a quite
mechanical and simple way to investigate properties of nonparametric like-
lihood–based statistical schemes. Several of these results can be found only
in this book. In Chapter 14 one can discover interesting open problems. This
book also provides software code based on both the R and SAS statistical
software packages to exemplify the statistical methodological topics and
their applied aspects.
Indeed, we focused essentially on developing a very informative textbook
that introduces classical and novel statistical methods with respect to vari-
ous biostatistical applications. Towards this end we employ our experience
and relevant material obtained via our research and teaching activity across
Preface—Please Read! xv

10–15 years of biostatistical practice and training Master and PhD level stu-
dents in the department of biostatistics. This book is intended for gradu-
ate students majoring in statistics, biostatistics, epidemiology, health-related
sciences, and/or in a field where a statistics concentration is desirable, par-
ticularly for those who are interested in formal statistical mechanisms and
their evaluations. In this context Chapters 1–10 and 12–14 provides teach-
ing sources for a high level statistical theory course that can be taught in a
statistical/biostatistical department. The presented material evolved in con-
junction with teaching such a one-semester course at The New York State
University at Buffalo. This course, entitled “Theory of Statistical Inference,”
has belonged to a set of the four core courses required for our biostatistics
PhD program.
This textbook delivers a “ready-to-go” well-structured product to be
employed in developing advanced biostatistical courses. We offer lectures,
homework questions, and their solutions, examples of midterm, final, and
Ph.D. qualifying exams as well as examples of students’ projects.
One of the ideas regarding this book’s development is that we combine
presentations of traditional applied and theoretical statistical methods with
computationally extensive bootstrap type procedures that are relatively
novel data-driven statistical tools. Chapter 11 is proposed to help instruc-
tors acquire a statistical course that introduces the Jackknife and Bootstrap
methods. The focus is on the statistical functional as the key component of
the theoretical developments with applied examples provided to illustrate
the corresponding theory.
We strongly suggest to begin lectures by asking students to smile!
It is recommended to start each lecture class by answering students’ inqui-
ries regarding the previously assigned homework problems. In this course,
we assume that students are encouraged to present their work in class
regarding individually tailored research projects (e.g., Chapter 14). In this
manner, the material of the course can be significantly extended.
Our intent is not that this book competes with classical fundamental
guides such as, e.g., Bickel and Doksum (2007), Borovkov (1998), and Serfling
(2002). In our course, we encourage scholars to read the essential works of the
world-renown authors. We aim to present different theoretical approaches
that are commonly used in modern statistics and biostatistics to (1) analyze
properties of statistical mechanisms; (2) compare statistical procedures; and
(3) develop efficient (optimal) statistical schemes. Our target is to provide
scholars research seeds to spark new ideas. Towards this end, we dem-
onstrate open problems, basic ingredients in learning complex statistical
notations and tools as well as advanced nonconventional methods, even in
simple cases of statistical operations, providing “Warning” remarks to show
potential difficulties related to the issues that were discussed.
Finally, we would like to note that this book attempts to represent a part of
our life that definitely consists of mistakes, stereotypes, puzzles, and so on,
that we all love. Thus our book cannot be perfect. We truly thank the reader
xvi Preface—Please Read!

for his/her participation in our life! We hope that the presented material can
play a role as prior information for various research outputs.

Albert Vexler
Alan D. Hutson
Authors

Albert Vexler, PhD, obtained his PhD degree in Statistics and Probability
Theory from the Hebrew University of Jerusalem in 2003. His PhD advi-
sor was Moshe Pollak, a fellow of the American Statistical Association and
Marcy Bogen Professor of Statistics at Hebrew University. Dr. Vexler was a
postdoctoral research fellow in the Biometry and Mathematical Statistics
Branch at the National Institute of Child Health and Human Development
(National Institutes of Health). Currently, Dr. Vexler is a tenured Full Professor
at the State University of New York at Buffalo, Department of Biostatistics.
Dr. Vexler has authored and co-authored various publications that contribute
to both the theoretical and applied aspects of statistics in medical research.
Many of his papers and statistical software developments have appeared in
statistical/biostatistical journals, which have top-rated impact factors and
are historically recognized as the leading scientific journals, and include:
Biometrics, Biometrika, Journal of Statistical Software, The American Statistician,
The Annals of Applied Statistics, Statistical Methods in Medical Research,
Biostatistics, Journal of Computational Biology, Statistics in Medicine, Statistics and
Computing, Computational Statistics and Data Analysis, Scandinavian Journal of
Statistics, Biometrical Journal, Statistics in Biopharmaceutical Research, Stochastic
Processes and Their Applications, Journal of Statistical Planning and Inference,
Annals of the Institute of Statistical Mathematics, The Canadian Journal of Statistics,
Metrika, Statistics, Journal of Applied Statistics, Journal of Nonparametric Statistics,
Communications in Statistics, Sequential Analysis, The STATA Journal; American
Journal of Epidemiology, Epidemiology, Paediatric and Perinatal Epidemiology,
Academic Radiology, The Journal of Clinical Endocrinology & Metabolism, Journal of
Addiction Medicine, and Reproductive Toxicology and Human Reproduction.
Dr. Vexler was awarded National Institutes of Health (NIH) grants to
develop novel nonparametric data analysis and statistical methodology. His
research interests are related to the following subjects: receiver operating
characteristic curves analysis; measurement error; optimal designs; regres-
sion models; censored data; change point problems; sequential analysis; sta-
tistical epidemiology; Bayesian decision-making mechanisms; asymptotic
methods of statistics; forecasting; sampling; optimal testing; nonparametric
tests; empirical likelihoods, renewal theory; Tauberian theorems; time series;
categorical analysis; multivariate analysis; multivariate testing of complex
hypotheses; factor and principal component analysis; statistical biomarker
evaluations; and best combinations of biomarkers. Dr. Vexler is Associate
Editor for Biometrics and Journal of Applied Statistics. These journals belong to
the first cohort of academic literature related to the methodology of biostatis-
tical and epidemiological research and clinical trials.

xvii
xviii Authors

Alan D. Hutson, PhD, received his BA (1988) and MA (1990) in Statistics


from the State University of New York (SUNY) at Buffalo. He then worked
for Otsuka America Pharmaceuticals for two years as a biostatistician.
Dr. Hutson then received his MA (1993) and PhD (1996) in Statistics from the
University of Rochester. His PhD advisor was Professor Govind Mudholkar,
a world-renown researcher in Statistics and Biostatistics. He was hired as a
biostatistician at the University of Florida in 1996 as a Research Assistant
Professor and worked his way to a tenured Associate Professor. He had
several roles at the University of Florida including Interim Director of
the Division of Biostatistics and Director of the General Clinical Research
Informatics Core. Dr. Hutson moved to the University at Buffalo in 2002 as
an Associate Professor and Chief of the Division of Biostatistics. He was the
founding chair of the new Department of Biostatistics in 2003 and became a
full professor in 2007. His accomplishments as Chair included the implemen-
tation of several new undergraduate and graduate degree programs and a
substantial growth in the size and quality of the department faculty and stu-
dents. In 2005, Dr. Hutson also became Chair of Biostatistics (now Biostatistics
and Bioinformatics) at Roswell Park Cancer Institute (RPCI), was appointed
Professor of Oncology, and became the Director of the Core Cancer Center
Biostatistics Core. Dr. Hutson helped implement the new Bioinformatics
Core at RPCI. He recently became the Deputy Group Statistician for the
NCI national NRG cancer cooperative group. Dr. Hutson is Fellow of the
American Statistical Association. He is Associate Editor of Communications
in Statistics, Associate Editor of the Sri Lankan Journal of Applied Statistics, and
is a New York State NYSTAR Distinguished Professor. He has membership
on several data safety and monitoring boards and has served on several high
level scientific review panels. He has over 200 peer-reviewed publications. In
2013, Dr. Hutson was inducted into the Delta Omega Public Health Honor
Society, Gamma Lambda Chapter. His methodological work focuses on non-
parametric methods for biostatistical applications as it pertains to statistical
functionals. He has several years of experience in the design and analysis of
clinical trials.
1
Prelude: Preliminary Tools and Foundations

1.1 Introduction
The purpose of this opening chapter is to supplement the reader’s knowledge
of elementary mathematical analysis, probability theory, and statistics,
which the reader is assumed to have at his or her disposal. This chapter is
intended to outline the foundational components and definitions as treated
in subsequent chapters. The introductory literature, including various foun-
dational textbooks in the fields, can be easily found to complete this chap-
ter material in details. In this chapter we also present several important
comments regarding the implementation of mathematical constructs in the
proofs of the statistical theorems that will provide the reader with a rigorous
understanding of certain fundamental concepts. In addition to the literature
cited in this chapter we suggest the reader consult the book of Petrov (1995)
as a fundamental source. The book of Vexler et al. (2016 a) will provide the
readers an entree into the following cross-cutting topics: Data, Statistical
Hypotheses, Errors Related to the Statistical Testing Mechanism (Type
I and II Errors), P-values, Parametric and Nonparametric Approaches,
Bootstrap, Permutation Testing, Measurement Errors.
In this book we primarily deal with certain methodological and practical
tools of biostatistics. Before introducing powerful statistical and biostatisti-
cal methods it is necessary to outline some important fundamental concepts
such as mathematical limits, random variables, distribution functions, inte-
gration, convergence, complex variables and examples of statistical software
in the following sections.

1.2 Limits
The statistical discipline deals particularly with real-world objects but its
asymptotic theory oftentimes consists of concepts that employ the symbols
“→” (convergence) and “∞” (infinity). These concepts are not related directly
to a fantasy, they just represent formal notations associated with asymptotic
techniques relative to how a theoretical approximation may actually behave

1
2 Statistics in the Health Sciences: Theory, Applications, and Computing

in real-world applications with finite objects. For example, we will use the
formal limit notation f ( x) → a as x → ∞ (or lim f ( x) = a), where a is a con-
x →∞
stant. In this case we do not pretend that there is a real number such that
x = ∞ or x has a value that is close to infinity. We formulate that for all ε > 0
(in the math language: ∀ε > 0) there exists a constant A (∃ A > 0) such that the
inequality f ( x) − a < ε holds for all x > A. By the notation . we mean gener-
ally a measure, an operation that satisfies certain properties. In the example
mentioned above f ( x) − a can be considered as the absolute value of the
distance f ( x) − a, that is, f ( x) − a = f ( x) − a . In this context, one can say for
any real value ε > 0 we can find a number A to such that −ε < f ( x) − a < ε for
all x > A. In this framework it is not difficult to denote different asymptotic
notational devices. For example, the form f ( x) → ∞ as x → ∞ (or lim f ( x) = ∞ )
x →∞
means ∀ε > 0: ∃ A > 0 f ( x) > ε , for ∀x > A. The function f can denote a
series or a sequence of variables, when f is a function of the natural number
n = 1, 2, 3,... In this case we use the notation fn and can consider, e.g., the for-
mal definition of lim fn in a similar manner to that of lim f ( x).
n →∞ x →∞
Oftentimes asymptotic propositions treat their results using the symbols
O (⋅) , o (⋅) , and ~, which are called “big Oh,” “little oh,” and “tilde/twiddle,”
respectively. These symbols provide comparisons between the magnitude
of two functions, say u (x) and v (x). The notation u (x) = O (v (x)) means that
u (x) v (x) remains bounded if the argument x is sufficiently near to some
given limit L, which is not necessarily finite. In particular, O (1) stands for
a bounded function. The notation u (x) = o (v (x)) , as x → L, denotes that
lim {u (x) v (x)} = 0. In this case, o (1) means a function which tends to zero. By
x →L
u (x) ~ v (x) , as x → L, we define the relationship lim {u (x) v (x)} = 1.
x →L

Examples
1. A function G( x) with x > 0 is called slowly varying (at infinity) if
for all a > 0 we have G (ax) ~ G (x), as x → ∞ , e.g., log (x) is a slowly
varying function.
2. It is clear that we can obtain the different representations of the
( )
function sin (x), including sin (x) = O ( x ) as x → 0, sin (x) = o x 2
as x → ∞ , and sin (x) = O (1) .
( )
3. x 2 = o x 3 as x → ∞ .
4. f n = 1 − (1 − 1 n) = o(1) as n → ∞ .
2

1.3 Random Variables


In order to describe statistical experiments in a mathematical framework
we first introduce a space of elementary events, say ω 1 , ω 2 ,..., that are
Prelude: Preliminary Tools and Foundations 3

independent. For example, we observe a patient who can have red eyes that
are caused by allergy and then we can represent these traits as ω 1 = " has " ,
ω 2 = " does not have " . In general, the space Ω = {ω 1 , ω 2 ,....} can consist of a con-
tinuous number of points. This scenario can be associated, e.g., with experi-
ments when we measure a temperature or plot randomly distributed dots
in an interval. We refer the reader to various probability theory textbooks
that introduce definitions of underlying probability spaces, random vari-
ables, and probability functions in detail (e.g., Chung, 2000; Proschan
and Shaw, 2016). In this section we shortly recall a random variable ξ(ω ) :
Ω → E to be a measurable transformation of Ω elements, possible outcomes,
into some set E. Oftentimes, E = R, the real line, or R d, the d-dimensional
real space. The technical axiomatic definition requires both Ω and E to be
measurable spaces. In particular, we deal with random variables that work
in a “one-to-one mapping” manner, transforming each random event ω i to a
specific real number, ξ(ω i ) ≠ ξ(ω j ), i ≠ j.

1.4 Probability Distributions


In clinical studies, researchers collect and analyze data with the goal of solic-
iting useful information and making inferences. Oftentimes recorded data
represents values (realizations) of random variables. However, in general,
experimental data cannot identically and fully characterize relevant ran-
dom variables. Full information associated with a random variable ξ can be
derived from its distribution function, denoted in the form Fξ ( x) = Pr (ξ ≤ x) .
In this example, we consider a right continuous distribution function of the
one-dimensional real-valued random variable. The reader should be familiar
with the ideas of left continuous distribution functions and the correspond-
ing definitions related to random vectors.
Assume we have a function F( x). In order for it to be a distribution function
for a random variable there are certain conditions that need to be satisfied.
For the sake of brevity we introduce the following notation: F(+∞) = lim F( x),
x →∞
F( −∞) = lim F( x), F( x + 0) = lim F( x + h), and F( x − 0) = lim F( x − h), where
x →−∞ h↓ 0 h↓ 0
the notation h ↓ 0 means that h > 0 tends to zero from above. Then we call
a function F( x) a distribution function if F( x) satisfies the following condi-
tions: (1) F( x) is nondecreasing, that is, F( x + h) ≥ F( x), for all h > 0; (2) F( x) is
right-continuous, that is, F( x + 0) = F( x); and (3) F(+∞) = 1 and F( −∞) = 0 . This
definition insures that distribution functions are bounded, monotone, and
can have only discontinues of the first kind, i.e., x0-type arguments at which
F( x0 − 0) ≠ F( x0 + 0).
Note that the definition mentioned above regarding a distribution func-
tion F( x) does not provide for the existence of dF( x) dx in general cases. In
4 Statistics in the Health Sciences: Theory, Applications, and Computing

a scenario with an absolutely continuous distribution function F( x), one can


define


x
F ( x) = f (t)dt with f ( x) = dF( x)/ dx ,
−∞

where f ( x) is the density function and the integral is assumed to be defined


according to the chapter material presented below. In order to recognize a
density function, one can use the definition of a distribution function to say
that every function g( x) is the density function of an absolutely continuous
distribution function if g( x) satisfies the conditions: (1) g( x) ≥ 0 , for all x, and

(2) F(+∞) =

−∞
g(t)dt = 1. The ability to describe random variables using den-
sity functions is restricted by opportunities to prove corresponding density
functions exist. The ideas presented above pertain to univariate distribu-
tion functions and extend to higher dimensions termed multivariate dis-
tribution functions. An excellent review book on continuous multivariate
distributions is presented by Kotz et al. (2000).
Note that in parallel with the role of density functions for continuous
random variables one can define probability mass functions in the con-
text associated with discrete random variables. Denote a set of a finite or
countably infinity number of values u1 , u2 , u3 ,.... Then a random variable ξ is
discrete if
j∑ )
Pr (ξ = u j = 1. The function f (u) = Pr (ξ = u) is the probability
mass function. In this case the right-continuous distribution function of ξ
has the form F( x) = Pr (ξ ≤ x) = ∑ uj ≤ x
f (u j ).
Oftentimes, statistical propositions consider independent and identically
distributed (iid) random variables, say ξ 1 , ξ 2 , ξ 3 ,... . In these cases, a sequence
or other collection of random variables is iid if each random variable has
the same probability distribution as the others and all random variables are
mutually independent. That is to say

Pr(ξ1 ≤ t1 , ξ2 ≤ t2 , ξ3 ≤ t3 ,...) = Pr(ξ1 ≤ t1 )Pr(ξ2 ≤ t2 )Pr(ξ3 ≤ t3 )...


= F(t1 )F(t2 )F(t3 )...,

where the distribution function F(t) = Pr(ξ 1 ≤ t).


In order to summarize experimental data our desired goal can be to esti-
mate distribution functions, either nonparametrically, semi-parametrically,
or parametrically. In each setting the goal is to develop accurate, robust, and
efficient tools for estimation. In general, estimation of the mean (if it exists)
and/or the standard deviation (if it exists) of a random variable only partially
describe the underlying data distribution.
Prelude: Preliminary Tools and Foundations 5

1.5 Commonly Used Parametric Distribution Functions


There are a multitude of distribution functions employed in biostatistical
studies such as the normal or Gaussian, log-normal, t, χ 2, gamma, F, bino-
mial, uniform, Wishart, and Poisson distributions. In reality there have
been thousands of functional forms of distributions that have been pub-
lished, studied, and available for use via software packages. Parametric dis-
tribution functions can be defined up to finite set of parameters (Lindsey,
1996). For example, in the one-dimensional case, the normal (Gaussian)
( )
distribution has the notation N μ , σ 2 , which corresponds to density func-
tion defined as f ( x) = (σ 2 π )−1 exp(− ( x − μ)2 (2σ 2 )) , where the parameters
μ and σ 2 represent the mean and variance of the population and the sup-
port for x is over the entire real line. The shape of the density has been
described famously as the bell-shaped curve. The values of the param-
eters μ and σ 2 may be assumed to be unknown. If the random variable X
has a normal distribution, then Y = exp(X ) has a log-normal distribution.
Other examples include the gamma distribution, denoted Gamma(α,β),
with shape parameter α and rate parameter β. The density function for the
gamma distribution is given as (βα Γ (α)) x α − 1 exp( −βx), x > 0, where Γ(α) is
a complete gamma function. The exponential distribution, denoted exp(λ),
has rate parameter λ with the corresponding density function given as
λ exp( −λx), x > 0. Note that the exponential distribution is a special case of the
gamma distribution with α = 1. It is often the case that simpler well-known
models are nested within more complex models with additional parameters.
While the more complex models may fit the data better there is a trade-off in
terms of efficiency when a simpler model also fits the data well.
Figure 1.1 depicts the density functions of the N(0,1), Gamma(2,1) and
exp(1) distributions, for example.
Note that oftentimes measurements related to biological processes fol-
low a log-normal distribution (Koch, 1966). For example, exponential
growth is combined with further symmetrical variation: with a mean con-
centration of, say, 106 bacteria, one cell division more or less will result in
2 × 106 or 5 × 105 cells. Therefore, the range will be multiplied or divided
by 2 around the mean, that is, the density is asymmetrical. Such skewed
distributions of these types of biological processes have long been known
to fit the log-normal distribution function (Powers, 1936; Sinnott, 1937).
Skewed distributions that often closely fit the log-normal distribution are
particularly common when mean values are low, variances are large, and
values of random variables cannot be negative, as is the case, for instance,
with species abundance and lengths of latent periods of infectious dis-
eases (Lee and Wang, 2013; Balakrishnan et al., 1994). Limpert et al. (2001)
discussed various applications of log-normal distributions in different
6 Statistics in the Health Sciences: Theory, Applications, and Computing

0.4 1.0

0.3 0.8
0.3

0.6
0.2
f(x)

f(x)

f(x)
0.2
0.4

0.1 0.1
0.2

0.0 0.0 0.0


–4 –2 0 2 4 0 2 4 6 8 10 0 2 4 6 8 10
x x x
(a) (b) (c)

FIGURE 1.1
The density plots where the panels (a)–(c) display the density functions of the standard normal
distribution, of the Gamma(2,1) distribution, and of the exp(1) distribution, respectively.

fields of science, including geology, human medicine, environment, atmo-


spheric sciences, microbiology, plant physiology and the social sciences.

1.6 Expectation and Integration


Certain important properties of random variables may be represented suf-
ficiently in the context of mathematical expectations, or simply expectations.
For example, the terms: standard deviation, covariance, mean are examples
of expectations, which are frequently found in the statistical literature.
The expectation of a random variable ξ is denoted in the form E(ξ) and is
defined as the integral of ξ with respect to its distribution function, that is,


E(ξ) = u d Pr (ξ ≤ u). In general, for a function ψ about the random variable
ξ we can define

{
E ψ (ξ) = } ∫ ψ (u) dPr (ξ ≤ u), e.g., Var (ξ) = ∫ {u − E (ξ)} dPr (ξ ≤ u).
2

When we have the density function f ( x) = dF( x)/ dx , F( x) = Pr(ξ ≤ x) the


definition of expectation is clear. In this case, expectation is presented as the
traditional Riemann–Stieltjes integral, e.g., accurately approaching E(ξ) by
the sums

∑min { f (u
i
i −1) , f (ui)} (ui − ui −1) ≤ E(ξ) = ∫ f (u) du ≤ ∑max { f (u
i
i −1 ) , f (ui)} (ui − ui −1),

where ui < u j , for i < j, partition the integral support.


Prelude: Preliminary Tools and Foundations 7

In many scenarios, formally speaking, the derivative dF( x) dx does not


exist, for example, when a corresponding random variable is categorical
(discrete). Thus, what does the component d Pr (ξ ≤ u) in the E(ξ) definition
mean? In general, the definition of expectation consists of a Lebesgue–
Stieltjes integral. To outline this concept we define the operations infi-
mum and supremum. Assume that A ⊂ R is a set of real numbers. If
M ∈ R is an upper (a lower) bound of A satisfying M ≤ M ′ ( M ≥ M ′) for
every upper (lower) bound M ′ of A, then M is the supremum (infimum)
of A, that is, M = sup( A) (M = inf( A)). These notations are similar to the
maximum and minimum, but are more useful in analysis, since they can
characterize special cases which may have no minimum or maximum. For
instance, let A be the set {x ∈ R : a ≤ x ≤ b} for fixed variables a < b. Then
sup( A) = max( A) = b and inf( A) = min( A) = a. Defining A = {x ∈ R : 0 < x < b},
we have sup( A) = b and inf( A) = 0, whereas this set A does not have a max-
imum and minimum, because, e.g., any given element of A could sim-
ply be divided in half resulting in a smaller number that is still in A,
but there is exactly one infimum inf( A) = 0 ∉ A, which is smaller than all
the positive real numbers and greater than any other number which could
be used as a lower bound.
Lebesgue revealed the revolutionary idea to define the integration


U
ψ(u) du. He did not follow Newton, Leibniz, Cauchy, and Riemann who
D
partitioned the x-axis between D and U , the integral bounds. Instead,
l = inf ( ψ(u)) and L = sup ( ψ(u)) were proposed to be defined in order
u∈ [ D ,U ] u∈ [ D ,U ]
to partition the y-axis between l and L. That is, instead of cutting the area
under ψ using a finite number of vertical cuts, we use a finite number of
horizontal cuts. The panels (a) and (b) of Figure 1.2 show an attempting to
(
integrate the “problematic” function ψ (u) = u2 I (u < 1) + u2 + 1 I (u ≥ 1), where )
I (.) = 0,1 denotes the indicator function, using a Riemann-type approach

via the sums ∑ ψ (u


i
i−1 )(ui − ui − 1) and ∑ ψ (u )(u − u
i
i i i −1 ). Alternatively the

hatched area in the panel (c) visualizes Lebesgue integration.


Thus, the main difference between the Lebesgue and Riemann inte-
grals is that the Lebesgue principle attends to the values of the function,
subdividing its range instead of just subdividing the interval on which
the function is defined. This fact makes a difference when the function
is “problematic,” e.g., when it has large oscillations or discontinuities.
For an excellent review of Lebesgue’s methodology we refer the reader to
Bressoud (2008).
We may note, for example, observing n independent and identically dis-
tributed data points ξ 1 ,..., ξ n , one can estimate their distribution function via

∑ I (ξ i ≤ t). Then the average


n
the empirical distribution function Fn (t) =
i=1
8 Statistics in the Health Sciences: Theory, Applications, and Computing

3.0 3.0 3.0

2.5 2.5 2.5

2.0 2.0 2.0


ψ (u)

ψ (u)

ψ (u)
1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4
u u u
(a) (b) (c)

FIGURE 1.2
Riemann-type integration (a, b) and Lebesgue integration (c).

∑ ξ i , an estimator of E (ξ), can be presented in the integral form


n
ξ = n−1
i=1

∫ t dF (t) using Lebesgue’s integration.


n

1.7 Basic Modes of Convergence of a


Sequence of Random Variables
In the context of limit definitions, the measure of a distance between func-
tions and their limit values denoted as . was mentioned in Section 1.2.
Regarding limits of sequences of random variables, this measure can be
defined differently, depending on the characterizations of random variables
Prelude: Preliminary Tools and Foundations 9

in which we are interested. Thus, statistical asymptotic propositions can


employ the following forms of convergence.

1.7.1 Convergence in Probability


p
Let ξ 1 ,..., ξ n be a sequence of random variables. We write ξn →ζ as n → ∞ , if
lim Pr ( ξn − ζ < ε) = 1 for ∀ε > 0, where ζ can be a random variable. Equivalently
n →∞
we can formalize this statement in the following form: for ∀ε > 0 and ∀δ > 0,
∃N : ∀n > N we have Pr ( ξn − ζ ≥ ε) ≤ δ . The formalism ξn →ζ is named con-
p

vergence in probability.
In this framework, frequently applied in the asymptotic analysis result,
is the Continuous Mapping Theorem that states that for every continuous
p p
function h(⋅), if ξn →ζ then we also have h(ξn ) → h(ζ) (e.g., Serfling, 2002).

1.7.2 Almost Sure Convergence


Let ξ 1 ,..., ξ n be a sequence of random variables and let ζ be a random variable.
Define a sequence |ξ n − ζ|. Then ξ 1 ,..., ξ n is said to converge to ζ almost surely
(or almost everywhere, with probability one, strongly, etc.) if Pr(lim ξ n = ζ) = 1.
a. s. n →∞
This can be written in the form ξn →ζ as n → ∞ .
The almost sure convergence is a stronger mode of convergence than that
of the convergence in probability. In fact, an equivalent condition for conver-
gence almost surely is that lim Pr(|ξ m − ζ|< ε , for all m ≥ n) = 1, for each ε > 0.
n →∞
The almost sure convergence is essentially the pointwise convergence of
a sequence of functions. In particular, random variables ξ 1 ,..., ξ n and ζ can
be regarded as functions of an element ω from the sample space S, that is,
a. s.
ξ i ≡ ξ i (ω ), i = 1,..., n , and ζ ≡ ζ(ω ). Then ξn →ζ as n → ∞ , if the function ξ n (ω )
converges to ζ(ω ) for all ω ∈S, except for those ω ∈W , W ⊂ S and Pr(W ) = 0.
Hence this convergence is named almost surely.

1.7.3 Convergence in Distribution


Let ξ 1 ,..., ξ n and ζ be real-valued random variables. We say that ξ n converges in
L
distribution (converge in law or converge weakly) to ζ, writing ξn →ζ ( ξn →ζ),
d

if Pr(ξ n ≤ x) → P(ζ ≤ x) as n → ∞ at every point x that is a continuity point of


the distribution function of ζ.
Convergence in distribution is one of the many ways of describing how
ξ n converges to the limiting random variable ζ. In particular, we are usually
interested in the case that ζ is a constant and the ξ i’s are sample means; this
situation is formally stated as the (weak) law of large numbers.
10 Statistics in the Health Sciences: Theory, Applications, and Computing

1.7.4 Convergence in rth Mean


For r > 0, we obtain that ξ 1 ,..., ξ n , a sequence of random variables, converges
r
in the rth mean to ζ, if lim E ξn − ζ = 0.
n →∞

1.7.5 O(.) and o(.) Revised under Stochastic Regimes


For two sequences of random variables {U n} and {Vn}, we formulate that
the notation U n = Op (Vn) denotes that U n Vn = Op (1), where the subscript p
means O(.) in probability. Similarly, the notation U n = op (Vn) , denotes that
p
U n Vn → 0 .

1.7.6 Basic Associations between the Modes of Convergence


The following results are frequently useful in theoretical calculations:

1. Convergence in probability implies convergence in distribution.


2. In the opposite direction, convergence in distribution implies con-
vergence in probability only when the limiting random variable ζ is
a constant.
3. Convergence in probability does not imply almost sure convergence.
4. Almost sure convergence implies convergence in probability, and
hence implies convergence in distribution. It is the notion of conver-
gence used in the strong law of large numbers.

The fundamental book of Serfling (2002) covers a broad range of useful


materials regarding asymptotic behavior of random variables and their
characteristics.

1.8 Indicator Functions and Their Bounds as


Applied to Simple Proofs of Propositions
In Section 1.6 we employed the indicator function I (.). For any event G, the
indicator function I (G) = 1, if G is true and I (G) = 0 , if G is false.
Consider four simple results that can be essentially helpful for various sta-
tistical explanations. (1) I (G) = 1 − I (not G). (2) I (G1 , G2 ) ≤ I (Gk ), k = 1, 2. (3) The
indicator function I (G) denotes a random variable with two values 0 and 1.
Therefore, EI (G) = 1 × Pr {I(G) = 1} + 0 × Pr {I(G) = 0} = Pr {G is true}. For exam-
ple, for a random variable ξ, its distribution function F( x) = Pr (ξ ≤ x) satisfies
{ }
F( x) = E I (ξ ≤ x) . (4) It is clear that if a and b are positive then I (a ≥ b) ≤ a / b.
Prelude: Preliminary Tools and Foundations 11

Examples
1. Chebyshev’s inequality plays one of the main roles in theoretical
statistical inference. In this frame, we suppose that a function
ϕ( x) > 0 increases to show that

⎧⎪ ϕ (ξ) ⎫⎪
{ } {
Pr (ξ ≥ x) = Pr ϕ (ξ) ≥ ϕ (x) = E ⎡⎣I ϕ (ξ) ≥ ϕ (x) ⎤⎦ ≤ E ⎨ } ⎬,
⎪⎩ϕ (x) ⎪⎭

{ }
where it is assumed that E ϕ (ξ) < ∞ . Defining x > 0, ϕ (x) = x,
and ξ = ζ − E (ζ) , r > 0, we have Chebyshev’s inequality.
r

2. Let ξ 1 ,..., ξ n be a sequence of iid positive random variables and


n
Sn = ξ i . Then, for x > 0, we have
i =1

Pr (Sn ≤ x) = Pr {exp (−Sn x) ≥ exp (−1)} ≤ eE {exp (−Sn x)} = eq n ,

where q = E {exp (−ξ1 x)} ≤ 1 and e = exp(1).


3. A frequently used method for theoretical statistical evaluations
is similar to the following strategy. For example, one may be
interested in the approximation of the expectation E {ψ (An )},
where ψ is an increasing function and An is a statistic with the
p
property An / n → 1 as n → ∞ . To get some intuition about the
asymptotic behavior of E {ψ (An )}, we have, for 0 < ε < 1,

{ ( )} {
E {ψ (An)} = E ψ (An) I An ≤ n + nε + E ψ (An) I An > n + nε ( )}
{ ( )( )} {
≤ E ψ n + nε I An ≤ n + nε + E ψ (An) I An > n + nε ( )}
⎡ ⎤ 1/2
( ) { (
≤ ψ n + nε + ⎢ E {ψ (An)} Pr An > n + nε ⎥ , )}
2

⎣ ⎦

{ ( )} { (
E {ψ (An)} ≥ E ψ (An) I An ≥ n − nε ≥ E ψ n − nε I An ≥ n − nε )( )}
( ) { (
≥ ψ n − nε − E ψ (An) I An < n − nε )}
⎡ ⎤ 1/2
( ) { (
≥ ψ n − nε − ⎢ E {ψ (An)} Pr An < n − nε ⎥ , )}
2

⎣ ⎦

where the Cauchy–Schwarz inequality, E (ξζ) ≤ E ξ 2 E ζ 2 { } ( ) ( )


2

for random variables ξ and ζ, was applied. Now, in order to

show the remainder terms ⎡E {ψ (An)} Pr An > n + nε ⎤ { ( )}


2 1/2
and
⎣ ⎦
⎡ ⎤ 1/2

{ (
⎢⎣ E {ψ (An)} Pr An < n − n ⎥⎦ )}
2 ε
potentially vanish to zero, a rough
12 Statistics in the Health Sciences: Theory, Applications, and Computing

bound for E {ψ (An )} might be shown to be in effect. If this is


2

the case then a Chebyshev-type inequality could be applied to


( ) ( )
Pr An > n + nε and Pr An < n − nε . Note that, since An ~ n, we
{
choose the “vanished” events An > n + nε and An < n − nε to be } { }
present in the remainder terms. In the case where ψ is a decreas-
ing function, in a similar manner to that shown above, one can
start with

{ ( )} {
E {ψ (An)} = E ψ (An) I An ≥ n − nε + E ψ (An) I An < n − nε ( )}
{ ( )( )} {
≤ E ψ n − nε I An ≥ n − nε + E ψ (An) I An < n − nε , ( )}
{ ( )} { (
E {ψ (An)} ≥ E ψ (An) I An ≤ n + nε ≥ E ψ n + nε I An ≤ n + nε )( )}
( ) {
≥ ψ n + nε − E ψ (An) I An > n + nε .( )}

1.9 Taylor’s Theorem


An essential toolkit for statisticians to prove various theoretical propositions
includes methods for expanding functions in powers of their arguments by
a formula similar to that provided by Taylor’s theorem–type results. In math-
ematics, a Taylor series is a representation of a function as an infinite sum
of terms that are calculated from the values of the function’s derivatives at a
single point. The basic Taylor’s theorem states:

(i) When a function g( x) has derivatives of all orders at x = a, the infi-


nite Taylor series expanded about x = a is g( x) = ∑g ( j)


( a)( x − a) j j ! ,
j=0

∏ ∏
k k
where, for an integer k, k ! = i and g ( a) = d g( x)/
(k) k
dx .
i=1 i =1 x=a

(ii) Assume that a function g( x) has k + 1 derivatives on an open interval


containing a. Then, for each x in the interval, we have

∑ g j!(a) (x − a) + R
( j)
g( x) = j
k +1 ( x),
j=0

g ( k + 1) (c)
where the remainder term Rk + 1 ( x) satisfies Rk + 1 ( x) = ( x − a) k + 1
( k + 1)!
for some c between a and x, c = a + θ( x − a), 0 ≤ θ ≤ 1.
Prelude: Preliminary Tools and Foundations 13

1.0

0.8 F M
ϕ (x)

0.6
U V

0.4
B C

A D + θ(A − D) D
0.5 0.6 0.7 0.8 0.9 1.0
x

FIGURE 1.3
Illustration of Taylor’s theorem.

To display some intuition regarding the proof of Taylor’s theorem, we


depict Figure 1.3.
Assume we would like to find the area under a function φ( x) with x ∈[ A, D],
using a rectangle. It is clear the rectangle ABCD is too small and the rect-


D
angle AFMD is too large to represent φ( x) dx. The area of AUVD is (U–A)
A
(D–A) and is appropriate to calculate the integral. The segment UV should
cross the curve of φ( x), when x ∈[ A, D], at the point D + θ(D − A), θ ∈(0,1).


D
Then φ( x) dx = (D − A)φ(c), where c = D + θ( A − D). Define φ( x) = dg( x)/ dx
A
to obtain the Taylor theorem with k = 0 . Now, following the mathematical
induction principle and using integration by parts, one can obtain the Taylor
result.
We remark that Taylor’s theorem can be generalized for multivariate func-
tions; for details see Mikusinski and Taylor (2002) and Serfling (2002).
Warning: It is very critical to note that while applying the Taylor expan-
sion one should carefully evaluate the corresponding remainder terms.
Unfortunately, if not implemented carefully Taylor’s theorem is easily mis-
applied. Examples of errors are numerous in the statistical literature. A typ-
ical example of a misstep using Taylor’s theorem can be found in several
14 Statistics in the Health Sciences: Theory, Applications, and Computing

publications and is as follows: Let ξ 1 ,..., ξ n be a sequence of iid random


n
variables with Eξ1 = μ ≠ 0 and Sn = ξ i . Suppose one is interested in
i =1
evaluating the expectation E (1 Sn) . Taking into account that ESn = μn and
then “using” the Taylor theorem applied to the function 1 x , we “have”
{ }
E (1 Sn) ≈ 1 (μn) − E (Sn − μn) (μn) that, e.g., “results in” nE (1 Sn) = O(1). In
2

this case, the expectation E (1 Sn) may not exist, e.g., when ξ 1 is from a normal
distribution. This scenario belongs to cases that do not carefully consider
the corresponding remainder terms. Prototypes of the example above can
be found in the statistical literature that deals with estimators of regression
models. In many cases, complications related to evaluations of remainder
terms can be comparable with those of the corresponding initial problem
statement. This may be a consequence of the universal law of conservation
of energy or mere sloppiness.
In the context of the example mentioned above, it is interesting to show that
it turns out that even when E (1 Sn) does not exist, we obtain E (Sm Sn) = m n,


n
for m ≤ n, since we have 1 = E (Sm Sn) = E (ξi Sn) = nE (ξi Sn) and then
i =1


m
E (ξ1 Sn) = 1 n implies E (Sm Sn) = E (ξi Sn) = mE (ξi Sn) = m n .
i =1

Example
In several applications, for a function ϕ, we need to compare E {ϕ(ξ)} with
( )
ϕ E (ξ) , where ξ is a random variable. Applying Taylor’s theorem to ϕ(ξ),
we can obtain that
ϕ(ξ) = ϕ (E(ξ)) + {ξ − E(ξ)} ϕ′ (E(ξ)) + {ξ − E(ξ)} ϕ′′ (E(ξ) + θ {ξ − E(ξ)}) / 2, θ ∈ (0, 1).
2

Then, provided that ϕ ′′ is positive or negative, we can easily obtain


E {ϕ(ξ)} ≥ ϕ (E(ξ)) or E {ϕ(ξ)} ≤ ϕ (E(ξ)), respectively. For example, E (1/ ξ) ≥
1/ E(ξ). This is a simple way for checking Jensen's inequality, which
states that if ξ is a random variable and ϕ is a convex or concave func-
tion, then E {ϕ(ξ)} ≥ ϕ (E(ξ)) or E {ϕ(ξ)} ≤ ϕ (E(ξ)), respectively.

1.10 Complex Variables


It is just as easy to define a complex variable z = a + ib with i = −1 as it is
to define real variables a and b. Complex variables consist of an imaginary
unit i that satisfies i 2 = −1. Mathematicians love to be consistent, if we can
solve the elementary equation x 2 − 1 = 0 then we should know how to solve
the equation x 2 + 1 = 0. This principle provides a possibility to denote k roots


k
of the general equation ai x k = 0 with an integer k and given coefficients
i=0
a1 ,..., ak . Complex variables are involved in this process, adding a dimension
Prelude: Preliminary Tools and Foundations 15

to mathematical analysis. Intuitively, we can consider a complex number


z = a + ib as a vector with components a and b.
Formally, for z = a + ib, z ∈ C, the real number a is called the real part of the
complex variable z and the real number b is called the imaginary part of z,
expressing Re( z) = a and Im( z) = b. It can be noted that complex variables
are naturally thought of as existing on a two-dimensional plane, there is no
natural linear ordering on the set of complex numbers. We cannot say that
a complex variable z1 is greater or less than a complex variable z2 . However
many mathematical operations require one to consider distances between
variables and their comparisons, e.g., in the context of asymptotic mecha-
nisms or in different aspects of integrations. Toward this end, the absolute
value of a complex variable is defined. The absolute value (or modulus or
magnitude) of a complex number z = a + ib is r = z = a 2 + b 2 .
The complex conjugate of the complex variable z = a + ib is defined to be
z = a − ib and denoted as z.
The definitions mentioned above are simple, but in order to show
completeness of complex analysis, e.g., in the context of analytic
functions, we need to attend to the complicated question: do functions of
complex variables belong to C, having the form a + ib? If yes, then how does
one quantify their absolute values using only the definitions mentioned
above? For example, it is required to obtain a method for proving that log (z) =
Re {log (z)} + i Im {log (z)}, exp (z) = Re {exp (z)} + i Im {exp (z)}, and so on. In this
case Euler’s formula can be assigned to play a vital role.
In complex analysis, the following formula establishes the fundamental rela-
tionship between trigonometric functions and the complex exponential func-
tion. Euler’s formula states that, for any real number x, e ix = cos( x) + i sin( x) ,
where e is the base of the natural logarithm.

Proof. Note that i 2 = −1, i 3 = −i, i 4 = 1, and so on. Using the Taylor expansion,
we have
∞ ∞ ∞

ex = ∑ xj
j!
, sin( x) = ∑ (−1)2 j+1
x 2 j+1
(2 j + 1)!
, and cos( x) = ∑(−1) 2j x2 j
(2 j)!
.
j=0 j=0 j=0

This result implies that


ix x 2
e ix = 1 + − + ... = cos( x) + i sin( x),
1! 2!
yielding Euler’s formula.
Euler’s formula shows that e ixt = 1 , for all x , t ∈ R. Then, for example, for
( ) ( )
all random variables, E e itξ exists, since E e itξ ≤ E e itξ = 1.
Thus, one can show that complex variables can be presented as r exp(iw),
where r, w are real-valued variables. This assists to show that analytic
16 Statistics in the Health Sciences: Theory, Applications, and Computing

functions of complex variables are “a + ib” type complex variables and then
to define their absolute values.
Euler’s formula has inspired mathematical passions and minds. For exam-
ple, using x = π 2, we have i = e iπ/2 , which leads to i i = e −π/2 . Is this a “door”
from the “unreal” realm to the “real” realm? In this case, we remind the
reader that exp( x) = lim (1 + x n) and then interpretations of i i = e −π/2 are left
n
n →∞
to the reader’s imagination.
Warning: In general, definitions and evaluations of a function of a complex
variable require very careful attention. For example, we have e ix = cos x + sin x ,
but we also can write e ix = cos( x + 2 πj) + i sin( x + 2 πk ), j = 1, 2,.....; k = 1, 2,...,
taking into account the period of the trigonometric functions. Thus it is not a
simple task to define unique functions that are similar to log {ϕ (z)} and {ϕ (z)} ,
1/ p

where p is an integer. In this context, we refer the reader to Finkelestein et al. (1997)
and Chung (1974) for examples of the relevant problems and their solutions. In
statistical terms, the issue mentioned above can make estimation of functions of
complex variables complicated since, e.g., a problem of unique reconstruction of
{ϕ (z)}1/p based on ϕ (z) is in effect (see for example Chapter 2).

1.11 Statistical Software: R and SAS


In this section, we outline the elementary components of the R (R Development
Core Team, 2014) and SAS (Delwiche and Slaughter, 2012) statistical software
packages. Both packages allow the user to implement powerful built-in rou-
tines and user community developed code or employ programming features
for customization. For example, Desquilbet and Mariotti (2010) presented
the use of an SAS macro (customized SAS code) using data from the third
National Health and Nutrition Examination Survey to investigate adjusted
dose–response associations (with different models) between calcium intake
and bone mineral density (linear regression), folate intake and hyperhomo-
cysteinemia (logistic regression), and serum high-density lipoprotein cho-
lesterol and cardiovascular mortality (Cox model). Examples of R and SAS
programs are employed throughout the book.

1.11.1 R Software
R is an open-source, case-sensitive, command line-driven software for sta-
tistical computing and graphics. The R program is installed using the direc-
tions found at www.r-project.org.
Once R has been loaded, the main input window appears, showing a short
introductory message (Gardener, 2012). There may be slight differences in
appearance depending upon the operating system. For example, Figure 1.4
shows the main input window in the Windows operating system, which has
Prelude: Preliminary Tools and Foundations 17

FIGURE 1.4
The main input window of R in Windows.

menu options available at the top of the page. Below the header is a screen
prompt symbol > in the left-hand margin indicating the command line.

Rules for names of variables and R data sets:


A syntactically valid R name consists of letters, numbers and the dot or
underline characters and starts with either a letter or the dot not followed by
a number. Reserved words are not syntactic names.
Comments in R:
Comments can be inserted starting with a pound (#) throughout the R
code without disrupting the program flow.
Inputting data in R:
R operates on named data structures. The simplest form is the numeric vec-
tor, which is a single entity consisting of an ordered collection of numbers.
One way to input data is through the use of the function c(), which takes an
arbitrary number of vector arguments and whose value is a vector got by con-
catenating its arguments end to end. For example, to set up a vector x, consist-
ing of four numbers, namely, 3, 5, 10, and 7, the assignment can be done with

> x<-c(3,5,10,7)

Notice that the assignment operator (<–), which consists of the two charac-
ters, less than (<) and minus (–), occurring strictly side-by-side and “points”
18 Statistics in the Health Sciences: Theory, Applications, and Computing

to the object receiving the value of the expression. In most contexts the
equal operator (=) can be used as an alternative. In order to display the value
of x, we simply type in its name after the prompt symbol > and press the
Return key.

> x

As a result, R provides the following output:

[1] 3 5 10 7

Assignment can also be made using the function assign(). An equivalent


way of making the same assignment as above is with:

> assign("x", c(3,5,10,7))

The usual assignment operator (<–) can be thought of as a syntactic short-


cut to this.
In some cases the elements of a vector may not be completely known
and the vector is incomplete. When an element or value is “not available” or
a “missing value” in the statistical sense, a place within a vector may be
reserved for it by assigning it the special value NA. In general, without fur-
ther suboption specification, any operation on an NA becomes an NA.
Note that there is a second kind of “missing” values, which are produced
by numerical computation, the so-called Not a Number, NaN, values. The
following examples give NaN since the result cannot be defined sensibly:

> 0/0
> Inf - Inf

The function is.na(x) gives a logical vector of the same size as x with value
TRUE if and only if the corresponding element in x is NA or NaN. The func-
tion is.nan(x) is only TRUE when the corresponding element in x is NaNs.
Missing values are sometimes printed as <NA> when character vectors
are printed without quotes.
Manipulating data in R
In R, vectors can be used in arithmetic expressions, in which case the opera-
tions are performed element by element. The elementary arithmetic opera-
tors are the usual + (addition), – (subtraction), * (multiplication), / (division),
and ^ (exponentiation). The following example illustrates the assignment of
y, which equals x 2 + 3:

> y <- x^2 + 3

R can be programmed to perform simple statistical calculations as well as


complex computations. Table 1.1 shows some simple commands that produce
Prelude: Preliminary Tools and Foundations 19

TABLE 1.1
Selected R Commands That Produce the Descriptive Statistics
of a Numerical Vector x
Syntax Definition

mean(x, na.rm = FALSE) Gives the arithmetic mean


sd(x, na.rm = FALSE) Gives the standard deviation
var(x, na.rm = FALSE) Gives the variance
sum(x, na.rm = FALSE) Gives the sum of the vector elements

descriptive statistics of a vector x given a sample of measurements x1 ,..., xn.


The setting “na.rm = FALSE” is set to indicate a missing statistic value is
returned if there are missing values in the data vector.
Instead of using the built-in functions contained in R, such as those shown
in Table 1.1, customized functions can be created to carry out additional
specific tasks. For this purpose the function() command can be used. The
following example shows a simple function “mymean” that determines
the running mean of the first i, i = 1,… , n elements of a vector x, where n is
the number of elements in x. Results are shown for x, our toy dataset from
above, by applying the customized function mymean.

> # to define the function


> mymean <- function(x) {
+ tmp <- c()
+ for(i in 1:length(x)) tmp[i] <- mean(x[1:i])
+ return(tmp)
+ }
> # execute the function using the data defined above
> mymean(x)
[1] 3.00 4.00 6.00 6.25

Note that the symbol plus (+) is shown at the left-hand side of the screen
instead of the symbol greater than (>) to indicate the function commands
are being input. The built-in R functions “length,” “return,” and “mean” are
used within the new function mymean. For more details about R functions
and data structures, we refer the reader to Crawley (2012).
In addition to vectors, R allows manipulation of logical quantities. The
elements of a logical vector can be values TRUE (abbreviated as T), FALSE
(abbreviated as F), and NA (in the case of “not available”). Note however that
T and F are just variables that are set to TRUE and FALSE by default, but are
not reserved words and hence can be overwritten by the user. Therefore, it
would be better to use TRUE and FALSE.
Logical vectors are generated by conditions. For example,
> cond <- x > y
20 Statistics in the Health Sciences: Theory, Applications, and Computing

TABLE 1.2
Selected R Comparison Operators
Symbolic Meaning

== Equals
!= not equal
> greater than
< less than
>= greater than or equal
<= less than or equal
%in% determines whether specific elements are in a longer vector

sets cond as a vector of the same length as x and y, with values FALSE
corresponding to elements of x and y if the condition is not met and TRUE
where it is. Table 1.2 shows some basic comparison operators in R.
A group of logical expression, say cond1 and cond2, can be combined with
the symbol & (and) or | (or), where cond1 & cond2 is their intersection and
cond1 | cond2 is their union (or). !cond1 is the negation of cond1.
Often, the user may want to make choices and take action dependent on
a certain value. In this case, an if statement can be very useful, which takes
the following form:

if (cond) {statements}

Three components are contained in the if statement: if, the keyword;


(cond), a single logical value between parentheses (or an expression that
leads to a single logical value); and {statements}, a block of codes between
braces ({}) that has to be executed when the logical value is TRUE. When
there is only one statement, the braces can be omitted. For example, the
following statement set the variable y equal to 10 if the variable Gender
equals “F”:

if (Gender == "F") y <- 10

To execute repetitive code statements for a particular number of times, a


“for” loop in R can be used. For loops are controlled by a looping vector. In
each iteration of the loop, one value in the looping vector is assigned to a
variable that can be used in the statements of the body of the loop. Usually,
the number of loop iterations is defined by the number of values stored in
the looping vector; they are processed in the same order as they are stored
in the looping vector. Generally, for loop construction takes the following
form:

for (variable in seq) {


statements
}
Prelude: Preliminary Tools and Foundations 21

If the goal is to create a new vector, a vector to store member variables


should be set up before creating a loop. For example,

x <- NULL
for (j in 1:50) {
x[j] <- j^2
}

The program creates a vector of 50 observations x = j 2, where j ranges from


1 to 50.
Printing data in R
In order to display the data or the variable, we can simply type in its name
after the prompt symbol >, and press the Return key.
There are many routines in R developed by researchers around the
world that can be downloaded and installed from CRAN-like repositories
or local files using the command install.packages(“packagename”), where
packagename is the name of the package to be installed and must be in
quotes; single or double quotes are both fine as long as they are not mixed.
Once the package is installed, it can be loaded by issuing the command
library(packagename) after which commands specific to the package can be
accessed. Through an extensive help system built into R, a help entry for a
specified command can be brought up via the help(commandname) com-
mand. As a simple example, we introduce the command “EL.means” in the
EL library.
> install.packages("EL")
Installing package into 'C:/Users/xiwei/Documents/R/win-library/3.1'
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/
EL_1.0.zip'
Content type 'application/zip' length 53774 bytes (52 Kb)
opened URL
downloaded 52 Kb

package 'EL' successfully unpacked and MD5 sums checked

The downloaded binary packages are in


C:\Users\xiwei\AppData\Local\Temp\Rtmp4uRCPS\downloaded_packages
> library(EL)
> help(EL.means)

The EL.means function provides the software tool for implementing the
empirical likelihood tests we will introduce in detail in this book.
As another concrete example, we show the “mvrnorm” command in the
MASS library.

> install.packages("MASS")
Installing package into 'C:/Users/xiwei/Documents/R/win-library/3.1'
(as ‘lib’ is unspecified)
22 Statistics in the Health Sciences: Theory, Applications, and Computing

trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.1/MASS_7.3-34.zip'


Content type 'application/zip' length 1083003 bytes (1.0 Mb)
opened URL
downloaded 1.0 Mb

package 'MASS' successfully unpacked and MD5 sums checked

The downloaded binary packages are in


C:\Users\xiwei\AppData\Local\Temp\Rtmp4uRCPS\downloaded_packages
> library(MASS)
> help(mvrnorm)

The mvrnorm command is very useful for simulating data from a


multivariate normal distribution. To illustrate, we simulate bivariate
normal data with mean (0, 0)T and an identity covariance matrix with a
sample size of 5.

> n <- 5 # define the sample size


> mu <- c(0,0) # define the mean vector
> Sigma <- matrix(c(1,0,0,1), byrow=TRUE, ncol=2) # define covariance matrix
> set.seed(123) # define the seed to fix the sample
> X <- mvrnorm (n, mu=mu, Sigma=Sigma) # generate data
> X
[,1] [,2]
[1,] -1.7150650 -0.56047565
[2,] -0.4609162 -0.23017749
[3,] 1.2650612 1.55870831
[4,] 0.6868529 0.07050839
[5,] 0.4456620 0.12928774

For more details about the implementation of R, we refer the reader to, e.g.,
Gardener (2012) and Crawley (2012).

1.11.2 SAS Software


SAS software is widely used to analyze data from various clinical trials and
manage large datasets. SAS runs on a wide range of operating systems. More
information about SAS can be found at the website: http://www.sas.com/.
SAS can be run in both an interactive and batch mode. To run SAS interac-
tively type SAS at your system prompt (UNIX/LINUX), or click the SAS icon
(PC). Figure 1.5 shows the SAS interface in Microsoft Windows.
In the interactive SAS environment, one can write, edit, and submit pro-
grams for processing, as well as view and print the results.
A SAS program is a sequence of statements executed in order. A state-
ment provides instructions for SAS to execute and must be appropriately
placed in the program. Statements can consist of SAS keywords, SAS names,
special characters, and operators. SAS is a free format language in that SAS
statements:
Prelude: Preliminary Tools and Foundations 23

FIGURE 1.5
The interface of SAS.

• Are not case-sensitive, except those inside of quoted strings


• Can start in any column
• Can continue on the next line or be on the same line as other
statements.

The most important rule is that every SAS statement ends with a semicolon (;).
Rules for names of variables and SAS data sets
When making up names for variables and data sets (data set names follow
similar rules as variables, but they have a different name space), the follow-
ing rules should be followed:

• Names must be 32 characters or less in length, containing only let-


ters, digits, or the underscore character (_).
• Names must start with a letter or an underscore; however, it is a
good idea to avoid starting variable names with an underscore,
because special system variables are named that way.

In SAS, there are virtually no reserved words; it differentiates user-defined


names with keywords or special system variables names by context.
Comments in SAS
Note that there are two styles of comments in SAS: one starts with an asterisk
(*) and ends with a semicolon (;) as is shown in the example. The other style
starts with a backslash asterisk (/*) and ends with an asterisk backslash (*/).
24 Statistics in the Health Sciences: Theory, Applications, and Computing

In the case of unmatched comments, SAS can’t read the entire program and
will stop in the middle of a program, much like unmatched quotation marks.
The solution, in batch mode, is to insert the missing part of comment and
resubmit the program.
Inputting data in SAS
In SAS, DATA steps are used to read and modify data, and can also be used to
simulate data. The DATA step is flexible relative to the various data formats.
DATA steps have an underlying matrix structure such that programming
statements will be executed for each row of the data matrix. There are multi-
ple ways to import data into SAS. Either way, you must type raw data directly
in the SAS program, link SAS to a database, or direct SAS to the data file.
The following SAS program illustrates how to type raw data and use the
INPUT and DATALINES statement:

* Read internal data into SAS data set Namelists;


DATA Namelists;
INPUT Name $ Gender $ Age;
DATALINES;
Lincoln M 46
Sara F 32
Catherine F 18
Mike M 35
;
RUN;

The keywords, e.g., DATA, INPUT, DATALINES, and RUN, identify the
type of statement and instruct the execution in SAS. For example, the INPUT
statement, a part of the DATA step, indicates to SAS the variable names and
their formats. To write an INPUT statement using list input, simply list the
variable names after the INPUT keyword in the order they appear in the
data file. If the variable is a character, then leave a space and place a dollar
sign ($) after the corresponding variable name.
Separating the data from the program avoids the possibility that data will
accidentally be altered when editing the SAS program. For data contained
in external files, the INFILE statement can be used to direct SAS to the data
from ASCII files. The INFILE statement follows the DATA statement and
must precede the INPUT statement. After the INFILE keyword, the file path
and the file name are enclosed in quotation marks.
By default, the DATA step starts reading with the first data line and, if
SAS runs out of data on a line, it automatically goes to the next line to
read values for the rest of the variables. Most of the time this works fine,
but sometimes data files cannot be read using the default settings. In the
INFILE statement, the options placed after the filename can change the
way SAS reads raw data files. For instance, the FIRSTOBS= option tells SAS
at what line to begin reading data; the OBS= option can be used anytime
you want to read only a part of your data file; the MISSOVER option tells
Prelude: Preliminary Tools and Foundations 25

SAS that if it runs out of data, do not go to the next data line but assign
missing values to any remaining variables instead; and the DELIMITER=,
or DLM=, option allows SAS to read data files with other delimiters (the
default is a blank space).
Moreover, SAS assumes that the number of characters including spaces
in a data line (termed a record length) of external files is no more than 256
in some operating environments, e.g., the Windows operating system. If the
data contains records that are longer than 256 characters, use the LRECL=
option in the INFILE statement to specify a record length at least as long as
the longest record in the data file. For more details about the INFILE options,
we refer the reader to Delwiche and Slaughter (2012). More complex data
import and export features are available in SAS, but go beyond the scope of
our applications.
To illustrate, the following program reads data from a tab-separated exter-
nal file into an SAS data set using the INFILE statement:

* Read internal data into SAS data set Namelists;


DATA Namelists;
INFILE 'c:\MyRawData\Namelists.dat' DLM = '09'X;
INPUT Name $ Gender $ Age;
RUN;

With SAS, there is commonly more than one way to accomplish the same
result, including the input of data. For more details about other ways of
inputting data, e.g., the IMPORT procedure, we refer the reader to Delwiche
and Slaughter (2012).
When a variable exists in SAS but does not have a value, the value is said
to be missing. SAS assigns a period (.) for numeric data and a blank for char-
acter data. By the MISSING statement, the user may specify other characters
instead of a period or a blank to be treated as missing data. To illustrate how
the missing data coding works, the following example declares that the char-
acter u is to be treated as a missing value for the character variable Gender
whenever it is encountered in a record:

DATA two;
MISSING u;
INPUT $ Gender;
CARDS;
RUN;

Missing values have a value of false when used with logical operators such
as AND or OR.
Manipulating data in SAS
In SAS, the users can create and redefine variables with assignment state-
ments using the following basic form:

variable = expression;
26 Statistics in the Health Sciences: Theory, Applications, and Computing

On the left side of the equal sign is a variable name, either new or old.
On the right side of the equal sign can be a constant, another variable, or a
mathematical expression. The basic types of assignment statements may use
operators such as + (addition), - (subtraction), * (multiplication), / (division),
and ** (exponentiation). The following example illustrates the assignment of
Newvar, which equals the square of OldVar plus 3:

NewVar = OldVar ** 2 + 3;

In the case where a simple expression using only arithmetic operators is


not enough, SAS provides a set of numerous useful and built-in functions.
Table 1.3 shows selected SAS numeric functions; see Delwiche and Slaughter
(2012) for more numeric functions and character functions.
There are a variety of control statements that control the flow of execution of
statements in the data step.
If the user wants to conditionally execute a SAS statement based on certain
conditional logic, that is, to conduct computations under certain conditions,
the IF-THEN statement can be used, which takes the following general form:

IF condition THEN action;

The condition is an expression comparing arguments, and the action is


what SAS will execute when the expression is true, often an assignment
statement. For example,

IF Gender = 'F' THEN y=10;

This statement tells SAS to set the variable y equal to 10 whenever the
variable Gender equals F. The terms on either side of the comparison, sep-
arated by a comparison operator, may be constants, variables, or expres-
sions. The comparison operator may be either symbolic or mnemonic,
depending on the user’s preference. Table 1.4 shows some basic comparison
operators.
A single IF-THEN statement can only have one action. To specify multiple
conditions, we can combine the condition with the keywords AND (&) or OR
(|). For example,

TABLE 1.3
Selected SAS Numeric Functions
Syntax Definition

MEAN(arg-1,arg-2,…arg-n) Arithmetic mean of nonmissing values


STD(arg-1,arg-2,…arg-n) Standard deviation of nonmissing values
VAR(arg-1,arg-2,…arg-n) Variance of nonmissing values
SUM(arg-1,arg-2,…arg-n) Sum of nonmissing values
Prelude: Preliminary Tools and Foundations 27

TABLE 1.4
Selected SAS Comparison Operators
Symbolic Mnemonic Meaning
= EQ equals
¬ =, ^ =, or ~ = NE not equal
> GT greater than
< LT less than
>= GE greater than or equal
<= LE less than or equal
IN determine whether a variable’s value is among a
list of values

IF condition AND condition THEN action;

A group of actions can be executed by adding the keywords DO and END:

IF condition THEN DO;


action;
action;
END;

The DO statement designates a group of statements to be executed as a


unit until a matching END statement appears. The DO statement, the match-
ing END statement, and all the statements between them define a DO loop.
There are several variations of the DO-END statement. The following exam-
ple presents an iterative DO statement that executes a group of SAS state-
ments repetitively between the DO and END statements:

DO index = start TO stop BY increment;


statements;
END;

The number of times statements are executed is determined as follows.


Initially the variable index is set to the value of start and statements are exe-
cuted. Next the value of increment is added to the index and the new value is
compared to the value of stop. The statements are executed again only if the
new value of index is less than or equal to stop. If no increment is specified,
the default is 1. The process continues iteratively until the value of index is
greater than the value of stop. We illustrate with a simple example:

DATA one;
DO j = 1 TO 50;
x = j**2;
OUTPUT one;
END;
DROP j;
CARDS;
RUN;
28 Statistics in the Health Sciences: Theory, Applications, and Computing

The program creates a SAS data set one with 50 observations and a vari-
able x = j 2, where the index variable j ranges from 1 to 50. The DROP state-
ment help get rid of the index variable j. The OUTPUT statement tells SAS to
write the current observation to the output data set before returning to the
beginning of the DATA step to process the next observation.
Printing data in SAS
The PROC PRINT procedure lists data in a SAS data set as a table of obser-
vations by variables. The following statements can be used with PROC
PRINT:
PROC PRINT DATA = SASdataset;
VAR variables;
ID variable;
BY variables;
TITLE 'Print SAS Data Set';
RUN;

where SASdataset is the name of the data set printed. If none is specified, then
the last SAS data set created will be printed. If no VAR statement is included,
then all variables in the data set are printed; otherwise only those listed,
and in the order in which they are listed, are printed. When an ID state-
ment is used, SAS prints each observation with the values of the ID variables
first instead of the observation number, the default setting. The BY statement
specifies the variable that the procedure uses to form BY groups; the obser-
vations in the data set must either be sorted by all the variables specified
(e.g., use the PROC SORT procedure), or they must be indexed appropriately,
unless the NOTSORTED option in the BY statement is used. For more details,
we refer the read to Delwiche and Slaughter (2012).
Summarizing data in SAS
After reading the data and making sure it is correct, one may summarize
and analyze the data using built-in SAS procedures or PROCs. For example,
PROC MEANS provides a set of descriptive statistics for numeric variables.
Vitually all SAS PROCs have additional options or features, which can be
accessed with subcommands, e.g., one can use PROC MEANS to carry out a
one-sample t-test.
As another example, the following program sorts the SAS data set Namelists
by gender using PROC SORT, and then summarizes the Age by Gender using
PROC MEANS with a BY statement (the MAXDEC option is set to zero, so no
decimal places will be printed.):

* Sort the data by Gender;


PROC SORT DATA = Namelists;
BY Gender;
* Calculate means by Gender for Age;
PROC MEANS DATA = Namelists MAXDEC = 0;
Prelude: Preliminary Tools and Foundations 29

BY Gender;
VAR Age;
TITLE 'Summary of Age by Gender';
RUN;

Here are the results of the PROC MEANS by gender:


Summary of Age by Gender
For more details about the data summarization by descriptive statistics and/
or graphs, we refer the reader to Delwiche and Slaughter (2012).
For a majority of the examples in later chapters, the reader is expected to
understand the basics commands of the DATA step, such as DO loops and
the OUTPUT statement, and be familiar with the various statistical PROCs
that he or she might use generally given parametric assumptions. Some
knowledge of the basic SAS macro language and PROC IML (a separate way
to program in SAS) will also be helpful. It is our goal that most of the code
provided in this book will be easily modified to handle a variety of problems
found in practice. It should be noted that R routines can be implemented
within SAS PROC IML.
The MEANS Procedure
Gender = F
Analysis Variable: Age
N Mean Std Dev Minimum Maximum

2 25 10 18 32

Gender = M
Analysis Variable: Age
N Mean Std Dev Minimum Maximum
2 41 8 35 46
2
Characteristic Function–Based Inference

2.1 Introduction
In this chapter we present powerful characteristic function–based tools as an
approach for investigating the properties of distribution functions and their
associated quantities. As it turns out the information contained within the
characteristic function structure has a one-to-one correspondence with the
distribution function framework. In this context characteristic functions can
be extremely helpful in both theoretical and applied biostatistical work. In
particular, characteristic functions can be used to simplify many analytical
proofs in statistics. In certain situations, the form of a characteristic function
can easily define a family of distribution functions, thus generalizing known
conjugate families. This is very important, for example, in parametric statis-
tics and the construction of Bayesian priors. Oftentimes a relatively simple
characteristic function can mathematically represent a random variable in
terms of its given properties even when the distribution function does not
have an explicit analytical form.
In a similar manner to the analysis based on characteristic functions we
also introduce Laplace transformations of distribution functions in this
chapter. In the statistical context, oftentimes we need to estimate distribution
functions based on observations subject to different noise effects or that are
based on using data that are directly presented as sums or maximums. These
scenarios and various tasks of statistical sequential procedures, as well as in
the context of renewal theory, are examples of situations where characteristic
functions and Laplace transformations can play a main role in analytical
analyses.
We suggest that the reader who is interested in more details regarding
characteristic functions beyond those presented in this book consult the
book of Lukacs (1970) as a fundamental guide.
In Section 2.2 we consider the elementary properties of characteristic func-
tions as well as convolution statements in which characteristic functions can
be involved. The one-to-one mapping propositions related to characteristic
and distribution functions are presented in Section 2.3. In Section 2.4 we
outline various applications of characteristic functions to biostatistical topics,

31
32 Statistics in the Health Sciences: Theory, Applications, and Computing

e.g., those related to renewal functions, sequential procedures, Tauberian


theorems, risk-efficient estimation, the law of large numbers, the central
limit theorem, issues of reconstructing the general distribution based on the
distribution of some statistics, extensions and estimations of families of dis-
tribution functions. In this chapter we also attend to measurement error
problems, cost-efficient designs, and several principles pertaining to Monte
Carlo simulation.

2.2 Elementary Properties of Characteristic Functions


We define the characteristic function ϕ ξ (t), t ∈ R , of a distribution function
F( x) = Pr {ξ ≤ x}, where ξ is a random variable, by the following expression:

ϕ ξ (t) =
∫e itx
( )
dF( x) = E e itξ , i = −1.
−∞

Thanks to Euler’s formula, the characteristic function can be presented as


{ }
ϕ ξ (t) = E cos (tξ) + i sin (tξ) , which leads immediately to the following prop-
erties:

Proposition 2.2.1. Let F( x) denote a distribution function with characteristic


function ϕ ξ (t). Then

1. ϕ ξ (0) = 1;
2. ϕ ξ (t) ≤ 1, for all t;
3. ϕ −ξ (t) = ϕ ξ (t), where the horizontal bar atop of ϕ ξ (t) defines the com-
plex conjugate of ϕ ξ (t).

For example, in order to obtain property (2) above we use the elementary
inequality

( ) {
E e itξ ≤ E e itξ = E cos 2 (tξ) + sin 2 (t ξ) }
1/2
= 1.

An attractive property of characteristic functions is related to the follow-


ing set of results. Consider independent random variables ξ 1 and ξ 2 . The
distribution function of the sum

{
Fξ1 + ξ2 ( x) = Pr (ξ 1 + ξ 2 ≤ x) = E I (ξ 1 + ξ 2 ≤ x) }
Characteristic Function–Based Inference 33

has the form


Fξ1 +ξ2 ( x) =
∫∫ I (u + u 1 2 ≤ x) d Pr (ξ1 ≤ u1 , ξ2 ≤ u2 )

=
∫∫ I (u + u 1 2 ≤ x) dFξ1 (u1 ) dFξ2 (u2 ),

since ξ 1 and ξ 2 are independent. Then

⎧⎪ x− u2 ∞
⎫⎪
Fξ1 +ξ2 ( x) =
∫∫ I ( u1 ≤ x − u2 ) dFξ1 (u1 ) dFξ2 (u2 ) = ⎨
−∞ ⎩⎪ −∞
∫ ∫
dFξ1 (u1 ) ⎬ dFξ2 (u2 )
⎭⎪

=
∫ F (x − u ) dF (u ),
−∞
ξ1 2 ξ2 2


u
since it is clear that F(u) = dF(u) with F(−∞) = 0. In the case with depen-
−∞
dent random variables ξ 1 and ξ 2 , using the definition of conditional probabil-
ity one can show that the distribution of the sum is given as

Fξ1 + ξ2 ( x) =
∫ Pr (ξ
−∞
1 ≤ x − u2 |ξ 2 = u2 ) dFξ2 (u2 ). (2.1)

Intuitively, this equation means that in the right-hand side of the formula
above the random variable ξ 2 is fixed. However, we allow ξ 2 to vary accord-
ing to its probability distribution. Thus, Equation (2.1) holds.
The integral form (2.1) of Fξ1 + ξ2 ( x) shown above is called a convolution. When
we have n > 1 independent ξ 1 ,..., ξ n random variables, it is clear that

⎛ n
⎞ n

Fξ1 +ξ2 +...+ξn ( x) =


∫∫∫
... Fξ1 ⎜ x −
 ⎝
∑ ∏ dF (u ).
i= 2
ui⎟
⎠ i= 2
ξi i

(n− 1) times

Thus, the evaluation of distribution functions of sums of random variables


using their components’ distribution functions is not a simple task, even
when the random variables are iid.
In parallel to this issue, the characteristic function of a sum of independent
ξ 1 ,..., ξ n random variables satisfies the simple equation
n

ϕ ξ1 +ξ2 +...+ξn (t) = E e { } = ∏ E{e ( )}.


it ( ξ1 +ξ2 +...+ξn ) it ξi

i =1
n
( ), when ξ ,..., ξ
that is, ϕ ξ1 + ξ2 + ... + ξn (t) = q with q = E e itξ1
are iid, since the random 1 n

variables e itξ i
, i = 1,..., n, are independent, that is, E (e
= E e itξ k E e itξl , itξ k + itξl
) ( ) ( )
1 ≤ k ≠ l ≤ n. Thus characteristic functions can be used to derive the
34 Statistics in the Health Sciences: Theory, Applications, and Computing

distribution of a sum of random variables. Hence, we can bypass using the


direct convolution method for determining the distribution of the sum
ξ 1 + ... + ξ n , thus simplifying the analysis.
Note that we can completely enjoy the benefits of employing characteristic
functions if we show that characteristic functions and distribution functions
are equivalent in the context of explaining the properties of random variables.

2.3 One-to-One Mapping


The major reason for our interest in characteristic functions is that they
uniquely describe the corresponding distribution functions. In order to show
a one-to-one mapping between distribution functions and their correspond-
ing characteristic functions, we should begin by reporting that the conclu-
sion of part (2) of Proposition 2.2.1 states that characteristic functions exist for
all random variables. This is important to note, since distribution functions
do exist for all random variables by the definition. It is clear that the defini-

tion ϕ ξ (t) = e itx dF( x) provides the characteristic function if the distribution
function is specified. The second stage is to prove that probabilities of inter-
vals can be recovered from the characteristic functions using the following
inversion theorem.

Theorem 2.3.1. (Inversion Theorem)

Suppose x and y are arguments of the distribution function F(u) = Pr (ξ ≤ u)


that is continuous at x and y. Then

1
∫ e −itx − e −ity 2 2
F( y ) − F( x) = lim ϕ(t)e −t σ /2 dt , (2.2)
σ→0 2 π it
−∞

( )
where ϕ(t) = E e itξ is the characteristic function.

Comments:

(i). The statement “the function F is continuous at x and y” does not


mean F is a continuous function. We assume there are no saltus
(jumps) of F specifically at x and y.
(ii). The integral at (2.2) may have at first glance seemed to be problem-
atic around the point t = 0 . However, even when σ = 0, by virtue of
Taylor’s theorem applied to e − itx − e − ity around t = 0 , we have

δ δ
e − itx − e − ity

−δ
it
ϕ(t)dt =
∫ (y − x)ϕ(t)dt + O(δ)
−δ
Characteristic Function–Based Inference 35

with ϕ(t) ≤ 1 , for all t and fixed δ > 0. Thus, while restricting F (or ϕ ) to hold

∞ −δ

∫ δ
ϕ(t)/t dt < ∞ and
∫ −∞
ϕ(t)/t dt < ∞ , (2.3)

for some δ > 0, we could put lim into the integral at (2.2), yielding
σ→ 0


e − itx − e − ity

1
F ( y ) − F ( x) = ϕ(t)dt. (2.4)
2π it
−∞

It turns out that in the case when the density function f (u) = dF(u)/du
exists we can write ϕ(t) =
∫e itu
f (u) du = {∫
can use the method of integration by parts to obtain conditions on F in order
}
f (u) de itu / (it) and then the reader

(
to satisfy (2.3). For example, ϕ(t) = 1/ 1 + t 2 (a Laplace distribution), )
−t
ϕ(t) = e (a Cauchy distribution) satisfy (2.3). Thus, we conclude that a class
of distribution functions in the form (2.4) is not empty.

(iii). Oftentimes, in the mathematical literature, the Inversion Theorem is


presented in the form

T
e − itx − e − ity

1
F( y ) − F( x) = lim ϕ(t)dt ,
T →∞ 2 π it
−T

which is equivalent to (2.2).

2.3.1 Proof of the Inversion Theorem


We outline the proof of Theorem 2.3.1 via the following steps:
Step 1: Assume the random variable ξ is well-behaved so as to have a char-
acteristic function that satisfies constraint (2.3) and is integrable. Then, com-
ment (ii) above yields

∞ ∞
1
∫ e −itx − e −ity 1
∫ e −itx − e −ity
2 2
lim ϕ(t)e −t σ /2 dt = ϕ(t) dt
σ →0 2 π it 2π it
−∞ −∞

∞ ∞
⎛ ⎞ ⎧ ⎫
e −it (x −ξ) − e ( ) ⎪
− it y −ξ
=
1 ⎜
E⎜
2 π ⎜⎝ ∫
−∞
e −itx − e −ity itξ ⎟
it
e dt⎟ =
1 ⎪
E⎨
⎟ 2π ⎪
⎠ ⎩ −∞
∫ it
dt⎬ .
⎪⎭
36 Statistics in the Health Sciences: Theory, Applications, and Computing

Since Euler’s formula provides

( ) ( )
e − it(x −ξ) − e ( ) = cos t (x − ξ) − i sin t (x − ξ) − cos t ( y − ξ) + i sin t ( y − ξ) ,
− it y −ξ
( ) ( )

we consider
∫ −∞
{cos(t)/t} dt. The fact that cos(t)/t = − {cos(−t)/ (−t)} implies

∞ ∞

∫ {cos(t)/t} dt = ∫ {cos(t)/t} dt + ∫ {cos(t)/t} dt = 0.


0

−∞ −∞ 0

Thus

1 ⎧⎪ e − it(x −ξ) − e ( ) ⎫⎪ 1 ⎧⎪ sin t ( y − ξ) − sin t (x − ξ) ⎫⎪


∞ − it y −ξ ∞
( ) ( )
E⎨
2π ⎪
⎩ −∞
∫ it
dt⎬ =


E⎨
2π ⎪
⎩−∞
t
dt⎬ (2.5)
⎪⎭


and we need to attend to

{sin(t)/t} dt . Note that this integral is interesting
−∞ ∞
in itself. Taking into account that sin(t) = O(1) and
∫ (1/t) dt = ∞, for δ > 0, δ

then one concern is with respect to the finiteness of
∫ −∞
{sin(t)/t} dt. Toward
this end, using the property sin( −t) = − sin(t), we represent

∞ ∞ ∞ ∞ ∞ ∞
sin (t) sin (t)

−∞
∫ t
dt = 2
t ∫0

dt = 2 sin (t) e − ut dudt = 2
0

0
∫∫e
0 0
− ut
sin (t) dtdu,

where integration by parts implies

1 ⎧⎪ − ut ⎫⎪
∞ ∞ ∞

∫e ∫ ∫
1 ∞
− ut
sin (t) dt = − sin (t) de = − ⎨e sin (t) − e − ut d sin (t)⎬
− ut
u u⎪ 0
⎪⎭
0 0 ⎩ 0


1 − ut
= e cos (t) dt
u
0


⎧⎪ ∞
⎫⎪
∫ cos (t) de ∫
1 1 ∞
⎨e cos (t) 0 − e d cos (t)⎬
− ut − ut − ut
=− 2 =− 2
u u ⎪⎩ ⎪⎭
0 0

∫e
1 1
= 2− 2 − ut
sin (t) dt ,
u u
0
Characteristic Function–Based Inference 37

∫ ( )
−1
that is, e − ut sin (t) dt = 1 + u2 . Thus we obtain
0

∞ ∞ ∞ ∞ ∞
sin (t) sin (t)
∫ ∫ ∫ ∫ ∫
1
dt = 2 dt = 2 sin (t) e − ut dudt = 2 du.
t t 1 + u2
−∞ 0 0 0 0

This result resolves the problem regarding the existence of
∞ −1

−∞
{sin(t)/t} dt. It
is well known that ∫ (1 + u ) du = π (e.g., Titchmarsh, 1976) so that
0
2


sin (t)


dt = 2 arctan(t) 0 = π.
t
−∞

This is the Dirichlet integral, which can be easily applied to conclude that


⎧ π, if γ > 0 ⎫
sin (γt) ⎪ ⎪

−∞
t
dt = ⎨ −π, if γ < 0
⎪ 0, if γ = 0
⎬ = π I (γ ≥ 0) − π I (γ < 0) .

⎩ ⎭
Then, reconsidering Equation (2.5), we have


1 ⎧⎪ e − it(x − ξ) − e ( ) ⎫⎪
− it y − ξ
E⎨
2π ⎪
⎩−∞
∫ it
dt⎬
⎭⎪

= { } { } {
E πI (y − ξ) ≥ 0 − πI (y − ξ) < 0 − πI (x − ξ) ≥ 0 + πI (x − ξ) < 0 ⎤
1 ⎡
2π ⎣ ⎦ } { }
{ }
= E ⎡⎣I {ξ ≤ y} − 1 − I {ξ ≤ y} − I {ξ ≤ x} + 1 − I {ξ ≤ x} ⎤⎦
1
2
{ }
= Pr (ξ ≤ y) − Pr (ξ ≤ x)

= F( y ) − F( x),

which completes the proof of Theorem 2.3.1 in the case ξ is well-behaved.


Step 2: Assume the random variable ξ is not as well-behaved as required
in Step 1 above. In this case we will use a random variable in the neighbor-
hood of ξ that does satisfy the requirements as seen in Step 1 above. The idea
is that instead of considering ξ we can focus on ζ = ξ + η, where the random
variable η is continuous and independent of ξ. In this setting the convolution
principle can lead to

Fζ (u) = Pr (η + ξ ≤ u) =
∫ Pr (η ≤ u − x)d Pr (ξ ≤ x)
38 Statistics in the Health Sciences: Theory, Applications, and Computing

and then, for example, the density function fζ (u) = dFζ (u)/du = ∫ f (u − x)dF(x)
η

with f η (u) = dFη (u)/du exists. Even when ξ = 1, 2,... is discrete, we have

∞ ∞

Fζ (u) = ∑j=1
Fη (u − j) Pr (ξ = j) , fζ (u) = ∑ f (u − j) Pr (ξ = j) .
j=1
η

In this case, in accordance to Step 1, it is clear that

∞ ∞
e − itx − e − ity e − itx − e − ity
Fζ ( y ) − Fζ ( x) =
1
2π ∫ it
E e itζ dt =( )
1
2π ∫ it
( )
ϕ(t)E e itη dt. (2.6)
−∞ −∞

Then “Der η hat seine Arbeit getan, der η kann gehen” (German: Schiller,
2013) and we let η go. To formalize this result we start by proving the follow-
ing lemma.

Lemma 2.3.1.

)
Let a random variable η be normally N μ , σ 2 -distributed. Then the charac- (
teristic function for η is given by ϕ (t) = E (e ) = exp (iμt − σ t /2).η
itη 2 2

Proof. If we define the random variable η1 ~ N (0,1) then its characteristic



function is given as ϕ η1 (t) = (2 π)

−1 2
e itue − u /2
du, since the density function
−∞
of η1 is (2 π) e − u /2. Thus
−1 2

∞ ∞
dϕ η1 (t)/ dt = (2 π)
∫ du = − (2 π) i

−1 2 −1 2
iue itu− u /2
e itu de − u /2
.
−∞ −∞

Integrating the above expression by parts, we arrive at



dϕ η1 (t)/dt = − (2 π) t du = −tϕ η1 (t), that is, {dϕ η1 (t)/dt} /ϕ η1 (t) = −t
−1 2
e itu− u /2
−∞
or { }
d ln (ϕ η1 (t)) / dt = −t. This differential equation has the solution
( )
ϕ η1 (t) = exp −t /2 + C, for all constants C. Taking into account that ϕ η1 (0) = 1,
2

we can admit only C = 0. Then ϕ η1 (t) = exp −t 2 /2 . The random variable ( )


(
η = μ + ση1 ~ N μ , σ 2 )
and hence has the characteristic function
iμ t
ϕ η (t) = e E e ( )
iσtη1 iμ t
= e ϕ η1 (σt) .
Now the proof of Lemma 2.3.1 is complete.
Characteristic Function–Based Inference 39

Set η ~ N (0, σ 2 ). Lemma 2.3.1 applied to Equation (2.6) yields



σ 2 t2

∫ e −itx − e −ity
1 −
Fζ ( y ) − Fζ ( x) = ϕ(t)e 2 dt , ζ = ξ + η.
2π it
−∞

Since the distribution function F of ξ is continuous at x and y, we have that


σ 2 t2

∫ e −itx − e −ity
1 −
lim {Fζ ( y ) − Fζ ( x)} = lim ϕ(t)e 2 dt ,
σ →0 2 π σ →0 it
−∞

which leads to


σ 2 t2

∫ e −itx − e −ity
1 −
F ( y ) − F ( x) = lim ϕ(t)e 2 dt ,
2 π σ →0 it
−∞

( )
since Pr η > δ → 0, for all δ > 0, as σ → 0 (here presenting ηk ~ N (0, σ 2k ) it is
then clear that we have ξ + ηk ⎯ ⎯
⎯⎯
⎯⎯
⎯ ⎯
→ ξ and Fξ+ηk → F as σ k → 0, k → ∞ ). This
p

completes the proof of Theorem 2.3.1.


Note that the difference F( y ) − F( x) uniquely denotes the distribution func-
tion F. Thus, we conclude that characteristic functions uniquely define their
distribution functions and vice versa.
Assume the density f ( x) of ξ exists. In view of Theoerm 2.3.1, we can write


e − it(x) − e − it(x+ h)

1
{F(x + h) − F(x)}/h = ϕ(t)dt ,
2π iht
−∞

using y = x + h. From this with h → 0 and the definitions f (x) = lim {F(x + h) − F(x)}/h,
{ } { }
h→0
− it(x) ⎡ − it(x) − it(x + h) ⎤
d e /dx = − lim e −e /h we have the following result.
h→0 ⎣ ⎦

Theorem 2.3.2.

Let the characteristic function ϕ(t) be integrable. Then

∫e
1 − itx
f ( x) = ϕ(t) dt.

−∞
40 Statistics in the Health Sciences: Theory, Applications, and Computing

2.4 Applications
The common scheme to apply mechanisms based on characteristic functions
consists of the following algorithm:

Original problem Transformed


based on distribution problem based on
functions characteristic functions

Transformed
Solution
Solution

Inverse
transformation

We apply this algorithm within the set of examples given below.

2.4.1 Expected Lengths of Random Stopping Times and Renewal


Functions in Light of Tauberian Theorems
Assume, for example, that H denotes the hours that a physician can serve
patients during a specific time span. The patients are surveyed sequentially
and each patient, i, requires X i hours to be observed. Suppose X i > 0, i = 1, 2,...,
are iid random variables. Then the physician will observe a random number
of patients denoted as

⎧ n ⎫

N ( H ) = min ⎨n ≥ 1 :
⎪⎩
∑ ⎪
X i ≥ H⎬ .
⎪⎭
i =1


n
In this case, the physician will end his/her shift when X i overshoots
i=1
the threshold H > 0 .
Characteristic Function–Based Inference 41

In statistical applications, constructions similar to N ( H ) above fall in the


realm of Renewal Theory (e.g., Cox, 1962). The integer random variable N ( H )
is a stopping time and its expectation E {N ( H )} is called as a renewal
function.
In general X i , i = 1, 2... can be negative. In this case the stopping time has
⎧⎪ n
⎫⎪
the form N ( H ) = inf ⎨n ≥ 1 :
⎩⎪ i=1

X i ≥ H ⎬ and we need to require that
⎭⎪
Pr {N ( H ) < ∞} = 1 in order to stop at all.
In this section we analyze the expectation E {N ( H )}, when X i > 0, i = 1, 2,...
are continuous random variables. Toward this end we begin with the simple
expression

∞ ∞
⎧⎪ j − 1 j

E {N ( H )} = ∑j=1
j Pr {N ( H ) = j} = ∑
j=1

j Pr ⎨ X i < H ,
⎩⎪ i = 1
∑ X ≥ H⎪⎬⎭⎪,
i=1
i


0
where we define X i = 0 and use the rule that the process stops the first
i=1
⎧⎪ j − 1 j
⎫⎪
time at j when the event ⎨ X i < H ,
⎪⎩ i = 1
∑i=1

X i ≥ H ⎬ occurs. (Note that if not all
⎪⎭
⎧⎪ ⎛ k −1 ⎞
X i > 0, i = 1, 2..., then we should consider {N ( H ) = j} = ⎨ max ⎜ ∑
X i⎟ < H ,
⎪⎩1≤ k ≤ j − 1 ⎝ i = 1 ⎠
⎛ k ⎞ ⎫⎪
max ⎜ ∑
1≤ k ≤ j
X i⎟ ≥ H ⎬, see Figure 2.1 for details.) Thus
⎝ i=1 ⎠ ⎪⎭

n n
ΣXi ΣXi
i–1 i–1

H
N(H) H N(H)

n n
j–1 j j–1 j
(a) (b)

FIGURE 2.1
n

The sum ∑ X crosses the threshold H at j when (a) all X > 0, i = 1, 2... and (b) we have several
i =1
i i

negative observations.
42 Statistics in the Health Sciences: Theory, Applications, and Computing

∞ ⎡ ⎧ j −1 ⎫ ⎧ j −1 j ⎫⎤
E {N ( H )} = ∑ ⎢ ⎪
∑ ⎪ ⎪
j ⎢ Pr ⎨ X i < H⎬ − Pr ⎨ X i < H ,
⎪ ⎪⎭ ⎪⎩ i =1
∑ ∑ ⎪⎥
X i < H⎬⎥
⎪⎭⎥
j =1 ⎢ ⎣ ⎩ i =1 i =1 ⎦
∞ ⎡ ⎧ j −1 ⎫ ⎧ j ⎫⎤
= ∑ ⎢ ⎪
∑ ⎪ ⎪
j ⎢ Pr ⎨ X i < H⎬ − Pr ⎨ X i < H⎬⎥
⎪ ⎪⎭ ⎪⎩ i =1
∑ ⎪⎥
⎪⎭⎥
j =1 ⎢ ⎣ ⎩ i =1 ⎦
∞ ⎧ j − 1 ⎫ ∞ ⎧ j ⎫
= ∑ ⎪

j Pr ⎨ X i < H⎬ −
⎪⎩ i =1


⎪⎭ j =1


j Pr ⎨ X i < H⎬
⎪⎩ i =1

⎪⎭
(2.7)
j =1

∞ ⎧ j ⎫ ∞ ⎧ j ⎫
= 1+ ∑ ⎪

∑ ⎪

(j + 1) Pr ⎨ X i < H⎬ − j Pr ⎨ X i < H⎬



∑ ⎪
⎪⎭
j =1 ⎩ i =1 ⎭ j =1 ⎩ i =1
∞ ⎧ j ⎫
= 1+ ∑ ∑ ⎪
Pr ⎨ X i < H⎬ .


⎪⎭
j =1 ⎩ i =1

This result leads to the non-asymptotic upper bound of the renewal func-
tion, since by (2.7) we have

∞ ⎧⎪ ⎛ 1 j
⎞ ⎫⎪
E {N ( H )} − 1 = ∑
j=1
Pr ⎨exp ⎜ −
⎪⎩ ⎝ H
∑ i=1
X i⎟ > e −1⎬
⎠ ⎪⎭

∞ ⎡ ⎧ ⎛ 1 j
⎞ ⎫⎪⎤
= ∑ ⎢ ⎪

E ⎢I ⎨exp ⎜ −
⎝ H
∑ X i⎟ > e −1⎬⎥
⎠ ⎪⎭⎥⎦
j=1 ⎣ ⎩ i=1

∞ ⎧⎪ ⎛ 1 j
⎞ ⎫⎪ ∞ j
⎛ − 1 Xi ⎞
≤e ∑ E ⎨exp ⎜ −
⎝ H
∑ X i⎟ ⎬ = e
⎠ ⎭⎪
∑∏ E ⎜e H ⎟
⎝ ⎠
j=1 ⎩⎪ i=1 j=1 i=1

∞ j
⎪⎧ ⎛ − H X1 ⎞ ⎪⎫

1
=e ⎨E ⎜ e ⎟⎠ ⎬ , e = exp(1),
j=1 ⎩⎪ ⎝ ⎭⎪

where the fact that I ( a > b) ≤ a/b, for all positive a and b, was used. Noting
( )
that E e − X1 /H ≤ 1 and using the sum formula for a geometric progression, we
obtain

−1
⎛ − 1 X1 ⎞ ⎧⎪ ⎛ − 1 X1 ⎞ ⎫⎪
E {N ( H )} ≤ 1 + e E ⎜ e H ⎟ ⎨1 − E ⎜ e H ⎟ ⎬ .
⎝ ⎠ ⎩⎪ ⎝ ⎠ ⎭⎪
Characteristic Function–Based Inference 43

( )
One can apply Taylor’s theorem to E e − X1 /H by considering 1/H around 0 as
H → ∞ to easily show that the upper bound for E {N ( H )} is asymptotically
linear with respect to H , when E X 12 < ∞ . ( )
Now let us study the asymptotic behavior of E {N ( H )} as H → ∞. Toward
this end we consider the following nonformal arguments: (1) Characteristic
functions correspond to distribution functions that increase and are bounded
by 1. Certainly, the properties of a characteristic function will be satisfied if
we assume the corresponding “distribution functions” are bounded by 2 or,
more generally, by linear-type functions. In this context the function E {N ( H )}
increases monotonically and is bounded. Thus we can focus on the transfor-

mation ψ(t) = ∫
0
e itud ⎡⎣ E {N (u)}⎤⎦ of E {N (u)} > 0 in a similar manner to that of
characteristic functions. (2) Intuitively, relatively small values of t can pro-
vide ψ (t) to represent a behavior of E {N ( H )}, when H is large, because sche-
matically it may be displayed that


ψ (t → 0) ~
∫ 0
e iu(t→0)d ⎡⎣E {N (u)}⎤⎦ ~ E {N (u → ∞)} .

In a rigorous manner, these arguments can be found in the context of Taube-


rian theorems (e.g., Korevaar, 2004; Subkhankulov, 1976; Yakimiv, 2005) that
can associate formally ψ (t), as t → 0, with E {N ( H )}, as H → ∞.
In this chapter we will prove a simple Tauberian theorem to complete the
asymptotic evaluation of E {N ( H )} (see Proposition 2.4.1 below).
By virtue of (2.7), the function ψ (t) can be presented as

⎡ ∞
⎧⎪ j ⎫⎪⎤ ∞
⎧⎪ j ⎫⎪
∑ ∑ ∑∫ ∑
∞ ∞
ψ (t) =

0
e itud ⎢1 +

⎣ j=1
Pr ⎨ X i < u⎬⎥ =
⎪⎩ i=1 ⎪⎭⎥⎦ j=1
0
e itud Pr ⎨ X i < u⎬
⎪⎩ i=1 ⎪⎭

∞ ⎧⎪ ⎛ j
⎞ ⎫⎪ ∞

= ∑ E ⎨exp ⎜ it

∑ X i⎟ ⎬ =
⎠ ⎭⎪
∑q j

j=1 ⎩⎪ i=1 j=1

( )
with q = E e itX1 and q < 1. The sum formula for a geometric progression
provides

q
ψ (t) = .
1− q

Applying Taylor’s theorem to q = E e itX1 with respect to t around 0, ( )


we have q = 1 + E (itX 1) + o(t), when we assume E X 12 < ∞ , and then ( )
44 Statistics in the Health Sciences: Theory, Applications, and Computing

ψ (t) ~ −1/ {tiE (X 1)} = i/ {tE (X 1)} as t → 0. The transformation i/ {tE (X 1)} cor-
responds to the function A(u) = u/ {E (X 1)} , u > 0, since (see Theorem 2.3.2)

∞ ∞
i i cos (tu) − i sin (tu)
∫e ∫
dA(u) 1 − itu 1
= dt = dt =
du 2π tE (X 1) 2 πE (X 1) t E (X 1)
−∞ −∞


(see the proof of Theorem 2.3.1 to obtain values of the integrals

∫ −∞
{cos(ut)/t} dt
and

−∞
{sin(ut)/t} dt ). Thus E {N (H )} ~ H/E (X1) as H → ∞.
Let us derive the asymptotic result about E {N ( H )} given above in a formal
way. By the definition of N ( H ) we obtain
N(H )
⎧⎪N ( H ) ⎫⎪
∑ i=1
X i ≥ H and then E ⎨ ∑
Xi⎬ ≥ H.
⎩⎪ i = 1 ⎭⎪

Using Wald’s lemma (this result will be proven elsewhere in this book,
see Chapter 4), we note that

⎧N(H ) ⎫

E⎨ ∑ ⎪
X i⎬ = E (X 1) E {N ( H )} and then E {N ( H )} ≥ H/E (X 1) .
⎪⎩ i =1 ⎪⎭
(2.8)

Consider the following scenario where we define g(u) > 0 to be an increas-


ing function. Then the real-valued Laplace transformation of g(u) has the

form G( s) =
∫0
e − su dg(u), s > 0 . The form of the Laplace transformation
belongs to a family of transformations that includes the characteristic func-
tions considered above. In this case, we have the Tauberian-type result given
as follows:

Proposition 2.4.1. Assume a > 0 is a constant and the increasing function


g(u) satisfies the conditions (i) g(u) ≥ au and (ii) g(u) = o(e u ) as u → ∞ . If
G( s) < ∞, for all s > 0, then

G (1/u) − au w
g(u) ≤ e + auwe w − 1 ,
(w − 1)

for all u > 0 and w > 1.


Proof. By virtue of condition (ii), integrating by parts, we obtain

G( s) = s

0
e − su g( y ) dy.
Characteristic Function–Based Inference 45

Since g(u) increases and g(u) ≥ au, for s ∈ R, we have


∫ ∫ ∫
u wu
G( s) = s e − sy g( y ) dy + s e − sy g( y ) dy + s e − sy g( y ) dy
0 u wu

∫ ∫ ∫
u wu
≥ sa e − sy y dy + sg(u) e − sy dy + sa e − sy y dy.
0 u wu

Thus

G( s) ≥
a
s
{ ( a
s
)
1 − e − su (1 + su)} + g(u) e − su − e − swu + {e − swu (1 + suw)} .

Then, setting s = 1/u, we have

G(1/ u) − au 2 e −1 − (1 + w)e − w
g(u) ≤ + au . (2.9)
e −1 − e − w e −1 − e − w

Taylor’s theorem provides e −1 − e − w = (w − 1)exp ⎡⎣− {1 + θ1 (w − 1)}⎤⎦ and


2 e −1 − (1 + w)e − w = (w − 1) {1 + θ2 (w − 1)} exp ⎡⎣− {1 + θ2 (w − 1)}⎤⎦ with 0 < θ1 , θ2 < 1,
where e −1 − e − w and 2 e −1 − (1 + w)e − w are evaluated as functions of w around
w = 1. Therefore e −1 − e − w ≥ (w − 1)exp (− w) and 2 e −1 − (1 + w)e − w ≤ (w − 1)we −1.
This modifies inequality (2.9) to the form

G(1/u) − au
g(u) ≤ + auwe −1+ w .
(w − 1)e − w

This completes the proof.


Proposition 2.4.1 with g(u) = E {N (u)} and a = E (X 1) implies

1 E {N (u)} G (1/u) − u/E (X 1) w 1


≤ ≤ e + we w − 1 ,
E (X 1) u (w − 1)u E (X 1)

where, in a similar manner to the analysis of ψ (t) =

0
e itud ⎡⎣E {N (u)}⎤⎦ = q/(1 − q)


∑ ⎫


E ⎨exp ⎛ − X i / u⎞ ⎬ = v/(1 − v) ,
j
shown above, we have G(1/u) =
j=1 ⎩ ⎝ i=1 ⎠⎭
v=E e ( − X1 /u
) = 1 − E (X 1) /u + o(1/u) as u → ∞ and E X 12 < ∞ . This means that ( )
1 E {N (u)} 1
≤ lim ≤ we w −1 .
E (X 1) u→∞ u E (X 1)

Considering w → 1, we conclude that lim E {N (u)} = {E (X 1)} .


1 −1

u →∞ u
46 Statistics in the Health Sciences: Theory, Applications, and Computing

The asymptotic result E {N ( H )} ~ H / E (X 1), as H → ∞, is called the elemen-


tary renewal theorem. Naïvely, taking into account that E ⎛ ∑ X i⎞ = nE (X 1)
n

⎝ i =1 ⎠
⎧⎪ n
⎫⎪
and the definition N ( H ) = min ⎨n ≥ 1 :
⎩⎪ i=1

X i ≥ H ⎬ , one can decide that it is
⎭⎪
obvious that E {N ( H )} ~ H/E (X 1). However, we invite the reader to employ
probabilistic arguments in a different manner to that mentioned above in
order to prove this approximation, recognizing a simplicity and structure of
the proposed analysis. Proposition 2.4.1 provides non-asymptotically a very
accurate upper bound of the renewal function. Perhaps, this proposition can
be found only in this book. The demonstrated proof scheme can be extended
to more complicated cases with, e.g., dependent random variables as well as
improved to obtain more accurate results, e.g., regarding the expression
lim ⎡⎢u−ω ⎡E {N (u)} − {E (X 1)} u⎤⎤⎥ with 0 ≤ ω ≤ 1 (for example, in Proposition 2.4.1
−1

u→∞ ⎣ ⎣ ⎦⎦
one can define w = 1 + u−ω log(u)). The stopping time N ( H ) will also be stud-
ied in several forthcoming sections of this book.
Warning: Tauberian-type propositions should be applied very carefully.
For example, define the function v( x) = sin( x). The corresponding Laplace

transformation is J ( s) =
∫0
e − su d sin(u) = s/(1 + s2 ), s > 0. This does not lead us
to conclude that since J ( s) → 0 , we have v( x) → 0. In this case we remark that
s→ 0 x →∞
v( x) is not an increasing function for x ∈( −∞ , ∞).

2.4.2 Risk-Efficient Estimation


An interesting problem in statistical estimation is that of constructing an
estimate for minimizing a risk function, which is typically defined as a sum
of the estimate’s mean squared error cost of sampling. Robbins (1959) initi-
ated the study of such risk-efficient estimates and the idea of assessing the
performance of sequential estimates using a risk function became extremely
popular in statistics. Certain developments in this area of estimation are
summarized in Lai (1996).
Let θ̂n denote an estimate of an unknown parameter θ computed from a
( )
2
sample of fixed size n. Assume that nE θˆ n − θ → σ 2 , as n → ∞ . If the cost of
taking a single observation equals C, then the risk function associated with
θ̂n is given by

( )
2
Rn (C) = E θˆ n − θ + Cn.

Thus we want to minimize the mean squared error of the estimate balanced
by the attempt to request larger and larger sets of data points, which is
Characteristic Function–Based Inference 47

restricted by the requirement to pay for each observation. For example, con-
sider the simple estimator of the mean θ = E (X 1) based on iid observations


n
X i , i = 1,..., n is θˆ n = X n = X i /n. In this case Rn (C) = σ /n + Cn with
2
i=1
σ 2 = var(X 1 ).
Making use of the definition of the risk function one can easily show that
{ }
Rn (C) is minimized when n = nC* = σ/C 1/2 . Since the optimal fixed sample
size nC* depends on σ, which is commonly unknown, one can resort to
sequential sampling and consider the stopping rule within the estimation
procedure of the form

{
N (C) = inf n ≥ n0 : n ≥ σˆ k /C1/2 , }

where n0 ≥ 1 is an initial sample size and σ̂ k is a consistent estimate of σ > 0.


We sample the observations sequentially (one by one) and will stop buying
data points at N (C), obtaining our final estimate θˆ N (C ). The method presented
in Section 2.4.1 can be adopted to show that the expected sample size E {N (C)}
is asymptotically equivalent to the optimal fixed sample size nC* in the sense
that E {N (C)} /nC* → 1 as C → 0 (Vexler and Dmitrienko, 1999).
Consider, for another example, iid observations X i > 0, i = 1,..., n, from an
exponential distribution with the density function f (u) = θ −1 exp − uθ −1 , θ > 0. ( )
In this case, the maximum likelihood estimator of the parameter θ = E (X 1) is

∑ ( )
n 2
θˆ n = X n = X i /n and E θˆ n − θ = θ /n. Assume, depending on parame-
2
i=1
ter values, it is required to pay θ 4 H for one data point. Then Rn ( H ) = θ2 /n + Hθ 4n
and we have Rn ( H ) − Rn− 1 ( H ) = −θ2 / {(n − 1) n} + Hθ 4. The equation
Rn ( H ) − Rn − 1 ( H ) = 0 gives the optimal fixed sample size nH* ≈ θ−1/H 1/2 , { }
where θ is unknown. In this framework, the stopping rule is

{ }
⎧⎪ ⎫⎪
( )
n


−1
N ( H ) = inf n ≥ n0 : n ≥ θˆ n /H 1/2 = min ⎨n ≥ n0 : X i ≥ 1/H 1/2 ⎬
⎩⎪ i=1 ⎭⎪


N(H )
and the estimator is θˆ N ( H ) = X i /N ( H ). It is straightforward to apply
i=1
the technique mentioned in Section 2.4.1 to evaluate E {N ( H )} where the fact

( )
that E e − sX1 = θ −1
∫ 0
e − sue − u/θ du = {1 + θs} can be taken into account.
−1

Note that naïvely one can estimate nC* using the estimator σ̂/C1/2 . This is
very problematic, since in order to obtain the estimator σ̂ of σ we need a
sample with the size we try to approximate.
48 Statistics in the Health Sciences: Theory, Applications, and Computing

2.4.3 Khinchin’s (or Hinchin’s) Form of the Law of Large Numbers


In this section we assume that random variables ξ 1 ,..., ξ n are iid with E (ξ 1) = θ.


n
The average ξn = n−1Sn , Sn = ξ i , can be considered as a simple estimator
i=1
of θ. For all ε > 0 , Chebyshev’s inequality states

) { }
k

(
E ξn − θ
( )
Pr ξn − θ > ε = Pr ξn − θ > ε k ≤
k

ε k
, k > 0.

Consider for example k = 2 . It may be verified that

⎡⎧ n ⎫⎪ ⎤
2
⎧⎪ n n ⎫⎪
{
E (Sn − nθ)
2
} ⎢⎪

= E ⎨ (X i − θ)⎬ ⎥ = E ⎨
⎢⎪ i = 1 ⎥ ∑∑ (X i − θ) (X j − θ)⎬
⎣⎩ ⎭⎪ ⎦ ⎪⎩ i = 1 j=1 ⎪⎭

⎧⎪ n ⎫ ⎧⎪ n n ⎫

⎪⎩ i = 1
2⎪
= E ⎨ (X i − θ) ⎬ + E ⎨
⎪⎭ ⎪⎩ i = 1
∑ ∑ (X − θ) (X − θ)⎪⎬⎪ i j
j = 1, j ≠ i ⎭
n n n

= ∑ E (X − θ) + ∑ ∑ E (X − θ) (X − θ)
i=1
i
2

i = 1 j = 1, j ≠ i
i j

= nE (X 1 − θ) .
2

( )
This yields that Pr ξn − θ > ε ≤ E (X 1 − θ) /(nε 2 ) → 0 as n → ∞ , proving the
2

law of large numbers, ξn →θ as n → ∞ .


p

It is clear that to conduct proofs in a manner similar to the algorithm above,


it is required that E ξ1 ( 1+δ
)
< ∞ for some δ > 0. Let us apply a method based
on characteristic functions to prove the property ξn →θ as n → ∞ . We have
p

the characteristic function of ξn in the form

( ) {( )} ( (( )))
n
ϕ ξ (t) = E e itSn /n = E e itξ1 /n = exp n ln E e itξ1 /n .

( ( ))
Define the function l(t) = ln E e itξ1 , rewriting

⎛ ⎛ l(t/n) − l(0)⎞ ⎞
ϕ ξ (t) = exp (nl(t/n)) = exp ⎜⎜ t ⎜⎜ ⎟⎟ ⎟⎟ ,
⎝ ⎝ t/n ⎠⎠
Characteristic Function–Based Inference 49

{ ( )} { ( )}
−1
where l(0) = 0. The derivate dl(t)/dt is E e itξ1 E iξe itξ1 and then
dl(u)/du u= 0 = iE (ξ) = iθ. By the definition of the derivate

⎛ l(t/n) − l(0)⎞ dl(u)


⎜⎝ t/n ⎟⎠ → du = iθ as n → ∞.
u= 0

Thus ϕ ξ (t) → e itθ as n → ∞ , where e itθ corresponds to a degenerate distribu-


p
tion function of θ. This justifies that ξn →θ as n → ∞ requiring only that
E (ξ1) < ∞ .

2.4.4 Analytical Forms of Distribution Functions


Characteristic functions can be employed in order to derive analytical forms
of distribution functions. Consider the following examples.

1. Let independent random variables ξ 1 ,..., ξ n be normally distributed.


The sum Sn = ξ 1 + ... + ξ n is normally distributed with

∑ E (ξ i ) and var (Sn) = ∑ var (ξ i ) . Lemma 2.3.1 shows


n n
E (Sn) =
i=1 i=1
this fact, since the characteristic function of Sn satisfies

∏ E (e ) = ∏ ( )
n n
itξi
ϕ Sn (t) = exp itE(ξi ) − var(ξi )t 2 /2
i=1 i=1

⎛ ⎞
∑ t2

n n
= exp ⎜ it E(ξi ) − var(ξi )⎟ .
⎝ i=1 2 i=1 ⎠

2. Gamma distributions: Let ξ have the density function

⎧ ⎛ ∞ ⎞
f (u) = ⎨ ⎝ 0 ∫
⎪ γ λ uλ− 1e − γu / ⎜ x γ − 1e − x dx⎟ , u ≥ 0,

⎪ 0, u < 0

with the parameters λ > 0 (a shape) and γ > 0 (a rate = 1/scale).


In a similar manner to the proof of Lemma 2.3.1, using integra-
tion by parts, we have that the characteristic function of ξ satisfies
( ) ( )
d ln (ϕ ξ (t)) / dt = d −λ ln (γ − it) / dt. Then, taking into account
ϕ ξ (0) = 1, we obtain ϕ ξ (t) = (1 − it / γ )−λ.
50 Statistics in the Health Sciences: Theory, Applications, and Computing

3. By virtue of the definition of the χ 2n distribution function with n degrees


n
of freedom, the random variable ζn = ξ i2 has a χ 2n distribution if
i=1
iid random variables ξ 1 ,..., ξ n are N (0,1) distributed. Note that

(
Pr (ζ 1 < u) = Pr ξ1 < u = ) 2
2π ∫e − x 2 /2
dx.
0

Thus, the density function of ζ1 is (2 π)


−1/2
e − u/2 u−1/2 that corresponds
to the gamma distribution from Example 2 above with parameters
λ = 1/2 and γ = 1/2. This means the characteristic function of ζn is
(1 − 2it)− n/2 . Therefore we conclude that ζn is gamma distributed
with parameters λ = n/2 and γ = 1/2.

2.4.5 Central Limit Theorem

( )
Let ξ 1 ,..., ξ n be iid random variables with a = E ξ 1 and 0 < σ 2 = var(ξ 1 ) < ∞.

∑ and ζ = (σ n)
n −1/2
Define the random variables Sn = ξi n
2
(Sn − an). The cen-
i=1
tral limit theorem states that ζn has asymptotically (as n → ∞ ) the N (0,1)
distribution. We will prove this theorem using characteristic functions.
Toward this end, we begin with the representation of ζn in the form
ζ n = (n)
−1/2
{∑ n

i=1 }
(ξi − a)/σ , ( )
where E ξ1 − a /σ = 0 and var (ξ1 − a) /σ = 1, { }
which allows us to assume a = 0, σ = 1 without loss of generality. In this case,
we should show that the characteristic function of ζn, say ϕζn (t), converges to
e − t /2. It is clear that ϕζn (t) = E ⎛⎜ e ∑ i=1 i ⎞⎟ = ∏ ( ) { ( )}
n n n
2 it ξ/ n
E e itξi / n = ϕ ξ t/ n ,
⎝ ⎠
( ( ))
i=1
that is, ln (ϕζn (t)) = n ln ϕ ξ t/ n . When n → ∞ , t/ n → 0 that leads us to
apply the Taylor theorem to the following objects: (1) ϕ ξ t/ n , considering ( )
t/ n around 0 and (2) the logarithm function, considering its argument
around 1. That is

⎛ t2 ⎛ t 2 ⎞⎞ ⎧⎪ t 2 ⎛ t 2 ⎞ ⎫⎪ t2
ln (ϕζn (t)) = n ln ⎜ 1 − + o ⎜ ⎟ ⎟ = n ⎨− + o ⎜ ⎟ ⎬ → − , n → ∞.
⎝ 2n ⎝ n ⎠⎠ ⎪⎩ 2 n ⎝ n ⎠ ⎪⎭ 2

The proof is complete.


The simple proof scheme shown above can be easily adapted when ζn has
more complicated structure or ξ 1 ,..., ξ n are not iid. For example, when
Characteristic Function–Based Inference 51

ξ 1 ,..., ξ n are independent but not identically distributed, we have


∑ (( )) and then the asymptotic evaluation of ln (ϕ (t))
n
ln (ϕ ζn (t)) = ln E e itξi / n
ζn
i =1
can be provided in a similar manner to that demonstrated in this section.

2.4.5.1 Principles of Monte Carlo Simulations


The outcomes of the law of large numbers and the central limit theorem yield
central techniques applied in a context of Monte Carlo simulations (e.g.,
Robert and Casella, 2004).


b
Consider the problem regarding calculation of the integral J = g(u) du,
a
where g is a known function and the bounds a, b are not necessarily finite.
One can rewrite the integral in the form


⎡⎧ g(u) ⎫ ⎤
J=
∫ ⎢⎢⎣⎨⎩ f (u)⎬⎭ I {u ∈[a, b]}⎥⎥⎦ f (u) du,
−∞

⎡⎧ g(ξ) ⎫ ⎤
using a density function f . Then, we have J = E ⎢⎨ ⎬ I {ξ ∈[a, b]}⎥ with ξ,
⎢⎣⎩ f (ξ) ⎭ ⎥⎦
which is a random variable distributed according to the density function f .
n ⎡⎧ g(ξ ) ⎫ ⎤
In this case, we can approximate J using J n =
1
n
⎢⎨
i
i = 1 ⎢ f (ξ i )
⎣⎩
⎬ I {ξi ∈[a, b]}⎥,
⎭ ⎥⎦

where ξ 1 ,..., ξ n are independent and simulated from f . The law of large
numbers provides J n → J as n → ∞ .
The central limit theorem can be applied to evaluate how accurate this
approximation is as well as to define a value of the needed sample size n.
−1/2
⎡ ⎛⎡ ⎤ 2 ⎞⎟ ⎤⎥
⎢ ⎜ ⎢ ⎧⎪ g(ξ1 ) ⎫⎪
That is, we have asymptotically n (J n − J) ⎢ E ⎜ ⎨ ⎬ I {ξ1 ∈ [a, b]} − J ⎟ ⎥
⎥ ~ N (0, 1)
⎢ ⎜ ⎢⎣ ⎩⎪ f (ξ1 )⎭⎪ ⎥ ⎟⎥
⎦ ⎠⎦
and then, for example, ⎣ ⎝

{ }
Pr J n − J ≥ δ = 1 − Pr {J n − J < δ} + Pr {J n − J < −δ}

( ) (
≈ 1 − Φ δ n /Δ + Φ −δ n /Δ , )


u
where Φ (u) =
2
e− x /2
dx/ 2 π and Δ is an estimator of
−∞
1/2
⎡ ⎛ ⎡⎧ ⎫ ⎤ ⎞⎤
2
ξ
⎬ I {ξi ∈[a, b]} − J⎥ ⎟ ⎥
⎢E ⎜ ⎢⎨ g ( 1 )
obtained, e.g., by generating, for a relatively
⎢ ⎜⎝ ⎢⎣⎩ f (ξ1 ) ⎭ ⎥⎦ ⎟⎠ ⎥
⎣ ⎦
52 Statistics in the Health Sciences: Theory, Applications, and Computing

large integer M, iid random variables η1 ,..., ηM with the density function f ,
and defining

2
M ⎡⎧ g( η ) ⎫ M
⎧⎪ g( η j ) ⎫⎪ ⎤
Δ =
2 1
M ∑ ⎢⎨ i
⎬ I {ηi ∈[a, b]} −
⎢⎩ f ( ηi ) ⎭
1
M ∑ ⎨ ⎬
⎩⎪ f ( η j ) ⎭⎪
{I η j ∈ [a , }
b] ⎥

i=1 ⎣ j=1 ⎦
2
⎤ ⎡1 ⎤
2
M
⎡⎧ g( ηi ) ⎫ M
⎪⎧ g( η j ) ⎪⎫
=
1
M ∑ ⎢⎨ ⎬ I {ηi ∈[a, b]}⎥ − ⎢
⎢⎣⎩ f ( ηi ) ⎭ ⎥⎦ ⎢⎣ M
∑ ⎨ {
⎪⎩ f ( η j ) ⎪⎭
}
⎬ I η j ∈[a, b] ⎥ .

i=1 j=1 ⎦

Thus, denoting presumed values of the errors δ 1 and δ 2 , we can derive the
sample size n that satisfies

{ } ( ) (
Pr J n − J ≥ δ 1 ≈ 1 − Φ δ 1 n / Δ + Φ −δ 1 n / Δ ≤ δ 2 . )
Note that, in order to generate random samples from any density function f ,
we can use the following lemma.

Lemma 2.4.1. Let F denote a continuous distribution function of a random


variable ξ. Then the random variable η = F(ξ) is uniformly [0,1] distributed.

Proof. Consider the distribution function

{ }
Pr( η ≤ u) = Pr {0 ≤ F(ξ) ≤ u} = Pr ξ ≤ F −1 (u) I(0 ≤ u ≤ 1) + I(u > 1)

( )
= F F −1 (u) I (0 ≤ u ≤ 1) + I (u > 1) = uI (0 ≤ u ≤ 1) + I (u > 1),

where F −1 defines the inverse function. This completes the proof of


Lemma 2.4.1.
From this result, it is apparent that generating η ~ Uniform[0,1] we can com-
pute a ξ that is a root of the equation η = F(ξ) to obtain the random variable
ξ distributed according to F. We remark that an R function (R Development
Core Team, 2014), uniroot (or optimize applied to minimize the function
{η − F(ξ)}2 ), can be easily used to find numerical solutions for η = F(ξ) with
respect to ξ.
In many cases, the choice of the density function f in the approximation to
J given above is intuitional and a form of f can be close to that of the inte-
grated function g. The function f is assumed to be well-behaved over the


1
support [a,b]. For example, to approximate J = g(u) du, we might select f
−1
Characteristic Function–Based Inference 53


associated with the Uniform[−1,1] distribution; to approximate J =
we might select f associated with a normal distribution. −∞
g(u) du,

As an alternative to the approach shown above we can employ the Newton–
Raphson method, which is a technique for calculating integrals numerically.
In several scenarios, the Newton–Raphson method can outperform the
Monte Carlo strategy. However, it is known that in various multidimensional
cases, e.g., related to
∫∫∫ g(u , u , u ) du du du , the Monte Carlo approach is
1 2 3 1 2 3

simpler, more accurate and faster than Newton–Raphson type algorithms.


Regarding multivariate statistical simulations we refer the reader to the book
of Johnson (1987).

Example
Assume T (X 1 ,.., X n ) denotes a statistic based on n iid observations
X 1 ,.., X n with a known distribution function. In various statistical inves-
tigations it is required to evaluate the probability p = Pr {T (X 1 ,..., X n ) > H}
for a fixed threshold, H . This problem statement is common in the con-
text of testing statistical hypotheses when examining the Type I error
rate and the power of a statistical decision-making procedure (e.g.,
Vexler et al., 2016a). In order to employ the Monte Carlo approach, one
can generate X 1 ,.., X n from the known distribution, calculating
ν1 = I {T (X 1 ,..., X n ) > H}. This procedure can be repeated M times to in
turn generate the iid variables ν1 ,..., ν M. Then the probability p can be


M
estimated as pˆ = νi /M . The central limit theorem concludes that
i=1
for large values of M we can expect

{ ⎡ 0.5 ⎤
Pr p ∈ ⎢ pˆ − 1.96 {pˆ (1 − pˆ )/M} , pˆ + 1.96 {pˆ (1 − pˆ )/M} ⎥ ≈ 0.95,

0.5

⎦ }
where it is applied that var( ν1 ) = p(1 − p) and the standard normal distri-


1.96
bution function Φ (1.96) =
2
e−x /2
dx/ 2 π ≈ 0.975 . Thus, for example,
−∞
when p is anticipated to be around 0.05 (e.g., in terms of the Type I error
rate analysis), it is reasonable to require 1.96 (0.05(1 − 0.05)/M) < 0.005
0.5

and then we can choose M > 8000, whereas, when p is anticipated to be


around 0.5 (e.g., in terms of an analysis of the power of a test), it is reason-
able to require 1.96 (0.5(1 − 0.5)/M) < 0.05 and then we can recommend
0.5

choosing M > 500.


For example, to evaluate the Type I error rate of the t-test for H 0:
E (X 1) = 0 versus H 1: E (X 1) ≠ 0 (e.g., Vexler et al., 2016a), one can apply the
following simple R code:

MC<-10000 #the number of the Monte Carlo generations


n<-25
Decision<-array()
TestStatistic<-array()
for(i in 1:MC)
54 Statistics in the Health Sciences: Theory, Applications, and Computing

{
X<-rnorm(n)
TestStatistic[i]<-t.test(X,alternative = c(“two.sided”)) $'statistic'
Decision[i]<-1*(TestStatistic[i]>qt(0.95,n))
}
mean(Decision)

In this code we can change MC<-10000 to MC<-1000 and X<-rnorm(n)


to X<-rnorm(n,0.1,1) in order to obtain the Monte Carlo power, when the
alternative parameter is assumed to be E (X 1) = 0.1. Note that, assuming a
value of the alternative parameter and a target value of the power, we
can employ the R code above to determine n. This statement corresponds
to the problem of the power calculation (or the sample size calculation)
that is a common issue of study designs.

2.4.5.2 That Is the Question: Do Convergence Rates Matter?


Sums of random variables play vital roles in statistics. For example, loga-
rithms of likelihood functions considered in later chapters have forms of
sums of random variables. The law of large numbers as well as the central
limit theorem are partial solutions to a general problem of analyzing asymp-
totic behaviors of sums of random variables. For example, when ξ 1 ,..., ξ n are
iid random variables with E (ξ1) = a, var (ξ1) = 1, we have (ξ1 + ... + ξn) /n → a
and (ξ1 + ... + ξn − an) /n1/2 → ζ as n → ∞ , where a is the constant, but ζ is a
N (0,1) random variable. Informally, one can say: “the sum (ξ 1 + ... + ξ n) grows
approximately at the same rate as an.” The question is what is happening “in
between” the law of large numbers (when we divide the sum by n) and the
central limit theorem (when we divide the sum by n1/2 ). Specifically, if we
consider ω n = (ξ1 + ... + ξn − an) /bn , where bn is intermediate in size between
n1/2 and n, is ω n asymptotically a constant or a random variable? This is a
nontrivial question. We invite the reader to think about this problem that
will be discussed in several forthcoming sections of this book.

2.4.6 Problems of Reconstructing the General Distribution Based


on the Distribution of Some Statistics
In various practical applications, researchers aim to estimate the distribution
function of iid random variables, say X 1 ,.., X n , observing only statistics based


m
on X 1 ,.., X n , e.g., in the forms X i , X i + ε i or max (X i ). Consider, for exam-
i=1
ple, the following practical problem. One of the main issues of epidemiologi-
cal research for the last several decades has been the relationship between
biological markers (biomarkers) and disease risk. Commonly, a measure-
ment process yields operating characteristics of biomarkers. However, the
high cost associated with evaluating these biomarkers as well as measure-
ment errors corresponding to a measurement process can prohibit further
epidemiological applications (e.g., Armstrong, 2015: Chapter 31). When, for
example, analysis is restricted by the high cost of assays, following
Characteristic Function–Based Inference 55

Pooling
Individual specimens
• Physically combining p
2
several individual 1
specimens, to create a
single mixed sample

• Pooled samples are the


averages of the
individual specimens Pooled sample

FIGURE 2.2
Pooling.

Schisterman et al. (2011), we suggest applying an efficient pooling design to


collection of data (see Figure 2.2 for an illustrative purpose).
In order to allow for the instrument sensitivity problem corresponding to
measurement errors, we formulate models with additive measurement errors.
Obviously, these issues require assumptions on a biomarker distribution,
under which operating characteristics of a biomarker can be evaluated.
The existence of measurement error in exposure data potentially affects
any inferences regarding variability and uncertainty because the distribu-
tion representing the observed data set deviates from the distribution that
represents an error-free data set. Methodologies for improving the character-
ization of variability and uncertainty with measurement errors in data are
proposed by many biostatistical manuscripts. Thus, the model, which cor-
responds to observing a biomarker of interest plus a measurement error, is
not in need of extensive describing. However, note that, since values of a
biomarker are functionally convolute with noisy measurement errors, usu-
ally distribution functions of measurement errors are assumed to be known.
Moreover, on account of the complexity of extracting the biomarker distribu-
tion from observed convoluted data (say deconvolution), practically, nor-
mality assumptions related to the biomarker distribution are assumed.
Under these assumptions parameters of an error distribution can be evalu-
ated by applying an auxiliary reliability study of biospecimens (e.g., a cycle
of remeasuring of biospecimens) (Schisterman et al. 2001b).
Another stated problem dealing with the deconvolution is the exploration
based on pooled data. Without focusing on situations where pooled data is
an organic output of a study, we touch on pooling in the context of study
design. The concept of a pooling-based design is extensively dealt with in the
statistical literature starting with publications related to cost-efficient testing
of World War II recruits for syphilis. In order to reduce cost or labor inten-
siveness of a study, a pooling strategy may be employed, whereby 2 or more
(say p) individual specimens are physically combined into a single “pooled”
unit for analysis. Thus, applying pooling design provides a p-fold decrease
56 Statistics in the Health Sciences: Theory, Applications, and Computing

of the number of measured assays. Each pooled sample test result is assumed
to be the average of the individual unpooled samples; this is most often the
case because many tests are expressed per unit of volume.
Thus, assuming that iid unobserved variables X 1 , X 2 ,... represent measure-
ments of a biomarker of interest, one can consider scenarios in which the dis-
tribution function of X 1 should be estimated based on the observations
( )∑
pj
Z j = p −1 X i (pooling) or Wj = X j + ε j (measurement error), where
i = p ( j -1) + 1
ε1 , ε 2 ,... are iid random variables with a known distribution function, see for
details, e.g., Vexler et al. (2008b, 2010a). In these cases the following idea can be
used. The characteristic functions of Z1 and W1 satisfy ϕ Z (t) = {ϕ X (t/p)} and
p

ϕW (t) = ϕ X (t) ϕ ε (t) , respectively, where ϕ X (t) is the characteristic function of


X 1 and ϕ ε (t) is the known characteristic function of ε1 . The empirical estima-
tors of ϕ Z (t) or ϕW (t) based on samples Z j { } j≥1
{ }
or Wj j≥1
can be easily obtained
(e.g., Feuerverger and Mureika, 1977), which yields a possibility to estimate
ϕ X (t) under certain assumptions regarding the distribution function of X 1
(Vexler et al., 2008b, 2010a), e.g., ϕˆ X (t) = ϕˆ W (t)/ϕ ε (t) , where ϕˆ W (t) is the empir-
ical estimator of ϕW (t). Theorem 2.3.2 can be applied to develop an estimator
of the distribution function of X 1, using an estimated ϕ X (t). For example, the
estimator of the density function of X 1 can be presented as


1 T − itx ˆ
fˆ ( x) = e ϕ X (t) dt, for a fixed T > 0. In practice, values of T can be cho-
2 π −T
sen to be 0.6, 0.7, 1.2, 1.3, 1.66, 1.7, etc.
Note that, in several situations, we observe Z j = max (X i ). A method
p ( j − 1) + 1≤ i ≤ pj

for reconstructing the distribution function of X 1 based on Z j { } j≥1


is consid-
ered in Belomestnyi (2005).

2.4.7 Extensions and Estimations of Families of


Distribution Functions
Relatively simple forms of characteristic functions can represent sets of families
of distribution functions. For example, in Chapter 1 we introduce the notation
( )
N μ , σ 2 to define a family of normally distributed random variables with the
parameters μ and σ 2 that correspond to the mean and variance of the random
variables. Consider, for instance, the parametric characteristic function

{
ϕ (t ; α , β, γ , a) = exp i at − γ t
α
(1+iβtω( t , α)/ t )} , t ∈(−∞, ∞),
⎧ tan( πα/2), if α ≠ 1

ω( t , α ) = ⎨
( )
⎪⎩ 2 ln t / π, if α=1

γ ≥ 0, α ∈(0, 2], β ≤ 1, a ∈(−∞, ∞).


Characteristic Function–Based Inference 57

We note that ω( t , 2) = 0 so that the class of distribution functions associated


with ϕ (t ; α , β, γ , a) covers normal distribution functions for α = 2. The param-
eters α , β, γ, and a are called the characteristic exponent, a measure of skew-
ness, the scale parameter and the location parameter, respectively. The
definition of ϕ (t ; α , β, γ , a) is quite general and corresponds to many practical
distributions, for example, the normal, α = 2; the Cauchy, α = 1 and the stable
law of characteristic exponent, α = 1/ 2. Thus, changing values of the param-
eters we can provide a modification of the classical type of a density function
corresponded to ϕ (t ; α , β, γ , a). In cases with unknown α , β, γ , a, one can use
Theorem 2.3.2 to construct their maximum likelihood estimation (Vexler
et al., 2008b).
Although we consider the parametric approach to describe the character-
istic function, estimation of the unknown parameters by using directly den-
sity functions corresponding to ϕ (t ; α , β, γ , a) is difficult; it is complicated by
the fact that their densities are not generally available in closed analytical
forms for all possible values of α , β, γ , a, making it difficult to apply conven-
tional estimation methods.
Schematic R codes: The characteristic function ϕ (t ; α , β, γ , a) can be coded in
the form

fi<- function(t,alpha,beta,gamma,a){
if (alpha==1) w<-2*log(abs(t))/pi else w<-tan(pi*alpha/2)
return(exp(1i*a*t-(gamma*abs(t)^(alpha))*(1+1i*beta*sign(t)*w))) }

where (alpha,beta,gamma,a) are the parameters. The corresponding density


function is
fs<-function(u,alpha,beta,gamma,a){
integ1<-function(t) exp(-1i*u*t)*fi(t,alpha,beta,gamma,a)
integ1R<-function(t) Re(integ1(t))
integrate(integ1R,-Inf,Inf)[[1]]/(2*pi) }

Consider the following scenarios to be left for the reader’s imagination.


Suppose, for example, we estimate α as 1.1 or 1.2, or 1.5. Is this a “drift” from
one classical parametric family to other one, e.g., from Cauchy to Normal, or a
“stay” between the families? Suppose we assume data follow a normal
( ) ( )
N μ , σ 2 distribution and estimate the parameters μ , σ 2 . In this case, it can
be a vital mistake if the real data follow a Cauchy distribution. Perhaps, we
can suggest to involve the ϕ (t ; α , β, γ , a)-based inference to reduce this risk.
3
Likelihood Tenet

3.1 Introduction
When the forms of data distributions are assumed to be known, the likelihood
principle plays a prominent role in statistical methods research for develop-
ing powerful statistical inference tools. The likelihood method, or simply the
likelihood, is arguably the most important concept for inference in parametric
modeling when the underlying data are subject to various assumed stochas-
tic mechanisms and restrictions related to biomedical and epidemiological
studies.
Without loss of generality, consider a situation in which n data points
X 1 ,..., X n are distributed according to the joint density function f ( x1 ,..., xn ; θ)
with the parameter θ, where x1 ,..., xn are arguments of f . Various useful
modern statistical concepts are focused on the likelihood function of θ,
L(θ) = f (X 1 ,..., X n ; θ)c, where c can depend on X 1 ,..., X n but not on θ (usu-
ally c = 1). If X 1 ,..., X n are discrete then for each θ, L(θ) is defined as the
probability of observing a realization of X 1 ,..., X n. The likelihood function
measures the relative plausibility of the range of values for θ given obser-
vations X 1 ,..., X n, i.e., the likelihood informs us as to which θ most likely
generated the observed data provided that model assumptions are correctly
stated.
Beginning from the mid-1700s the problem of determining the most
probable position for the object of observations, including determining the
most likely parameter from a distribution that generated the data, was
extensively treated in terms of mathematical descriptions by such historical
figures as Thomas Simpson, Thomas Bayes, Johann Heinrich Lambert, and
Joseph Louis Lagrange. Statistical literature has credited past figures with
their contributions to likelihood theory, in particular Thomas Bayes in terms
of the well-known Bayes theorem. However, it was not until Ronald Fisher
(1922) revisited the likelihood principle via groundbreaking mathematical
arguments that the approach exploded onto the statistical research land-
scape. Fisher suggested comparing various competing values of θ relative
to their likelihood ratios, which in turn provides a measure of the relative

59
60 Statistics in the Health Sciences: Theory, Applications, and Computing

plausibility of the parameter. The essence of the modern argument of Fisher


can be described in terms of sufficiency.
In modern statistical developments a statistic is defined as sufficient
with respect to a model and its corresponding unknown parameter if “no
other statistic that can be calculated from the same sample provides any
additional information as to the value of the parameter” (Fisher, 1922). The
function L(θ) provides a sufficient statistic with respect to θ. Assume that
there are two estimates of θ from L(θ), say θˆ S and θˆ A. Let θˆ S be a sufficient
statistic. Roughly, we can anticipate that θˆ S and θˆ A are approximately normal
given large samples (in Section 3.3 we rigorously attend to the asymptotic
normality of a L(θ)-based estimator of θ). In accordance with Fisher (1922),
we can consider the situation when θˆ S and θˆ A have a bivariate normal
distribution (e.g., Balakrishnan and Lai, 2009; Borovkov, 1998, p. 109) with
( ) ( )
parameters that correspond to the following statements: E θˆ S = E θˆ A = θ,
( ) ( ) {( )( )}
var θˆ S = σ S2 , var θˆ A = σ 2A and E θˆ S − θ θˆ A − θ / (σ Sσ A ) = ρ. Intuitively, the
( ) ( )
assumption E θˆ S = E θˆ A = θ represents our aim to provide the consistency
of the estimates: θˆ S → θ and θˆ A → θ with large samples. Then the classical
property of the bivariate normal distribution shows that the conditional
( )
expectation is given as E θˆ A |θˆ S = u = θ + ρ (σ A /σ S) (u − θ). This statistic
cannot be associated with θ, since θˆ S is sufficient. Thus ρ (σ A /σ S) = 1 or
σ S = ρσ A ≤ σ A , that is, θˆ S cannot have a larger mean squared error than any
other such estimate θˆ A (if we accept the use of “approximate normality” of the
estimators, in an informal manner). Thus, Fisher concluded that “Sufficiency
implies optimality, at least when combined with consistency and asymptotic
normality.”
In this book, for clarity of explanation, we will assume c = 1 in the defi-
nition of the likelihood function and provide our own arguments (see, for
example, Section 3.5) to position the likelihood function as an essential and
optimal tool in biostatistical inference. We suggest that the reader who is
interested in more details regarding likelihood methodology and its epic
story consult fundamental guides presented in many publications, includ-
ing Berger et al. (1988), Reid (2000) and Stigler (2007).
Likelihood methodology is addressed extensively in the literature. There
are multitudes of books that consider different likelihood-based methods
and their applications to biostatistical problems. In this chapter, we briefly
introduce several principles and aspects of the maximum likelihood estima-
tion and its applications (Sections 3.2 and 3.3), likelihood ratio–based proce-
dures and their optimality (Sections 3.4 and 3.5), and maximum likelihood
ratio tests and their properties (Section 3.6). In Section 3.7 we introduce an
example related to correct model-based likelihood formations.
Likelihood Tenet 61

3.2 Why Likelihood? An Intuitive Point of View


Let us assume that we observe X 1 ,..., X n. Then the widely used least squares
method for finding the likely parameter b that corresponds to the observed

n
data is via the estimate of b given as bˆ = arg min (X − β)2 . The idea
β
i=1
i

behind this estimation method is clear and easily understandable in the


context of minimization of the distances between data points and the esti-
mated parameter location. While focusing on L(θ), statisticians measure
how “likely” θ is relative to generating the observations, oftentimes aim-
ing to detect values of θ that maximize L(θ). Avoiding considerations based
on Bayesian perspectives (e.g., Bickel and Doksum, 2007, p. 114; Carlin and
Louis, 2011 as well as Section 5.1), one can ponder the following questions:
Why does the concept of max θ L(θ) seem to be a simple and widely applica-
ble idea? Why maximize the joint density over the parameter space? Why is
maximizing the likelihood often the optimal strategy as compared to other
methods such as least squares or the method of moments? Why not rise to
argue in favor of a method of minimum likelihood or even mediocre likeli-
hood? Even though conceptually the likelihood principle is straightforward
on the face of it, the history of the topic shows that this “simple idea” is really
anything but simple.
In most likelihood-based procedures it is anticipated that true values of
the parameter are located around a maximum of the likelihood function. For
pathological cases where the likelihood method fails we refer the reader to
Smith (1985). To get some intuition about the likelihood principle we first con-
sider the expectation of the log likelihood ratio given as Eθ0 log (L (θ1) / L (θ0 )) , { }
where the operator Eθ0 means that the expectation should be derived pro-
vided that θ0 reflects a true value of θ. In this case, Jensen’s inequality yields
the following inequality:

Eθ0 {log (L (θ1) /L (θ0))} = Eθ0 {log ( f (X 1 ,..., X n ; θ1) /f (X 1 ,..., X n ; θ0))}

(
≤ log Eθ0 ⎡⎣ f (X 1 ,..., X n ; θ1) /f (X 1 ,..., X n ; θ0)⎤⎦ )
= log ∫ ... ∫ ff ((xx ,...,
1
1,..., x ; θ )
n

n
1

x ;θ )
0
f (x ,..., x ; θ ) dx  dx
1 n 0 1 n

= log ∫ ... ∫ f (x ,..., x ; θ ) dx dx


1 n 1 1 n = log(1)

= 0.
62 Statistics in the Health Sciences: Theory, Applications, and Computing

Similarly,

Eθ1 {log (L (θ1) /L (θ0))} = −Eθ1 {log ( f (X 1 ,..., X n ; θ0) /f (X 1 ,..., X n ; θ1))}

(
≥ − log Eθ1 ⎡⎣ f (X 1 ,..., X n ; θ0) /f (X 1 ,..., X n ; θ1)⎤⎦ )
= log ∫ ... ∫ ff ((xx ,...,
1

1
n

n
x ;θ )
0

,..., x ; θ )
1
f (x ,..., x ; θ ) dx ... dx
1 n 1 1 n

= log ∫ ... ∫ f (x ,..., x ; θ ) dx ... dx


1 n 0 1 n = log(1)

= 0.
Moreover,

Eθ0 {L (θ1) /L (θ0)} = log ∫ ... ∫ ff ((xx ,...,


,..., x ; θ )
1

1
n

x ;θ )
n
1

0
f (x ,..., x ; θ ) dx ... dx
1 n 0 1 n = 1,

whereas

⎧⎪ 1 ⎫⎪ 1
Eθ1 {L (θ1) /L (θ0)} = Eθ1 ⎨ ⎬≥
⎪⎩ L (θ0) /L (θ1)⎪⎭ Eθ1 {L (θ0) /L (θ1)}

⎡ f (x1 ,..., xn ; θ0) ⎤ −1


=⎢
⎢⎣ ∫ ... ∫ f (x1 ,..., xn ; θ1)
f (x1 ,..., xn ; θ1) dx1 ... dxn⎥ = 1.
⎥⎦

In a probability context, these results support the evaluation of Prθ0 {L (θ0 ) ≤


L (θ1)}, where the subscript θ0 shows that the probability is considered
provided that the true value of θ is θ0. Toward this end, we remark that


n
L(θ) = f (X i ; θ) when n data points X 1 ,..., X n are iid and X 1 is dis-
i=1
tributed according to the density function f ( x1 ; θ) with the parameter θ.
Note that in the general case the function L(θ) can also be presented in the
product-type chain rule form

L(θ) = f (X 1 ,..., X n ; θ) = f (X n |X 1 ,..., X n−1 ; θ) f (X 1 ,..., X n−1 ; θ)

= f (X n |X 1 ,..., X n−1 ; θ) f (X n−1 |X 1 ,..., X n− 2 ; θ) f (X 1 ,..., X n− 2 ; θ) = ...


n

= ∏ f (X |X ,..., X
i 1 i −1 ; θ).
i =1

We highlight this form since commonly theoretical evaluations of statistical


procedures employ propositions related to sums of random variables due to
Likelihood Tenet 63

their relative theoretical tractability. In this case a convenient form to work


with is the log likelihood function l(θ) = log (L(θ)), which is a sum.
Consider the case with iid observations, in which

⎧ n ⎫
⎪1
Prθ0 {L (θ0) ≤ L (θ1)} = Prθ0 ⎨
⎪⎩ n
∑( )

log ( f (X i ; θ1)) − log ( f (X i ; θ0)) ≥ 0⎬
⎪⎭
i =1

⎧ n ⎫
⎪1
= Prθ0 ⎨
⎪⎩ n
∑((log ( f (X ; θ )) − a ) − (log ( f (X ; θ )) − a )) ≥ a
i 1 1 i 0 0 0

− a1⎬ ,
⎪⎭
i =1

where, for k = 0,1, log ( f (X i ; θ k )) , i = 1,..., n are iid random variables with
{ }
Eθ0 log ( f (X 1 ; θ k )) = ak. Since a0 ≥ a1 as shown above, it is clear that the analy-
sis mentioned in Section 2.4.3, where Chebyshev’s inequality is applied, leads
to Prθ0 {L (θ0 ) ≤ L (θ1)} → 0 as n → 0 .
Thus, it is “most likely” “on average” and in probability that the true value θ0 of the
parameter θ satisfies L (θ0 ) ≈ max θ L (θ).

3.3 Maximum Likelihood Estimation


When the likelihood depends on an unknown parameter (or unknown
parameters), it can be suggested that ranges of plausible values for the param-
eter (or parameters) can be directly obtained from the likelihood function,
first by determining the maximum likelihood estimate. For n iid observa-
tions X 1 ,..., X n the maximum likelihood estimate is any value θˆ n such that

n
f (X i ; θˆ n ) = max θ L(θ) . Note that such a θˆ n need not exist, and when it
i=1
does, it usually depends on what form of the density f ( x ; θ) was assumed. In
this section we assert that in regular cases θˆ n is asymptotically normal. We
begin by first outlining the following regularity conditions necessary for the
asymptotic normality of the parameter estimates to hold. In order to consider
θˆ n we need its existence and uniqueness, then let

(i) possible values of θ belong to an open interval (not necessarily finite),


where the term “open interval” means an interval that has no mini-
mum and maximum, e.g., (1, 2), ( − ∞, 0), and
(ii) the set A = {x : f ( x ; θ) > 0} be independent of θ.

For example, we cannot determine a maximum likelihood estimator in the



n
situation when f (X i ; θ) is 0 for all values of θ.
i=1
64 Statistics in the Health Sciences: Theory, Applications, and Computing

Chapter 2 introduces tools that can be applied to show the asymptotic nor-
mality of likelihood estimators via evaluating sums of iid random variables.
The log maximum likelihood

∑ log ( f (X ; θˆ ))
n

l(θˆ n ) = log ⎛ ∏ f (X i ; θˆ n )⎞ =
n

⎝ i=1 ⎠ i n
i=1

is a sum, but its summands are dependent random variables


( )
log f (X i ; θˆ n ) , i = 1,..., n. Then tools shown in Chapter 2 cannot be directly
applied to analyze the sum l(θˆ n ). However, Section 3.2 supports that we can
anticipate that

(iii) θˆ n →θ
p
as n → ∞, where θ represents a true value of the parameter.

Then Taylor’s theorem can be employed to evaluate an l(θˆ n )-based construc-


tion, considering θˆ n around θ. Towards this end we should require that

(iv) ∂ k log ( f ( x ; θ)) / ∂θ k , k = 1, .., 2, exist, for all x ∈ A .

Using Taylor’s arguments, remainder terms should be taken into account


and then the following condition should be satisfied:

(v) There exists a positive number c and a function M( x) such that

∂3 log ( f ( x ; θ0 )) / ∂θ30 ≤ M( x) for all x, θ0 ∈(θ − c , θ + c) and Eθ {M( x)} < ∞.

To state that the random variable ∂log ( f (X 1 ; θ)) / ∂θ has finite positive
variance, we denote the Fisher Information I (θ) = Eθ ∂log ( f (X 1 ; θ)) / ∂θ , { }
2

conditioning

{
(vi) 0 < I (θ) = Eθ ∂log ( f (X 1 ; θ)) / ∂θ } { }
= −Eθ ∂2 log ( f (X 1 ; θ)) / ∂θ2 < ∞.
2

A more rigorous and detailed treatment regarding the regularity conditions


may be found in Lehmann and Casella (1998, pp. 440–450).
The central limit theorem related to the maximum likelihood estimation
has the following form.

Theorem 3.3.1
Assume that X 1 ,..., X n are iid and satisfy the conditions (i–vi) mentioned
above. Then the distribution function of n θˆ n − θ satisfies ( )
Prθ ( n (θˆ − θ) < u) → Pr (ξ < u)
n as n → ∞,
Likelihood Tenet 65

where ξ is a normally distributed random variable with E (ξ) = 0 and


var (ξ) = 1/ I (θ).
Proof. The idea of the proof is to apply Taylor’s theorem in order to repre-
sent a l(θˆ n )-based statistic as a sum of iid random variables and then use the
technique shown in Chapter 2.
Consider the score function given as the derivative l '(θˆ n ) = ∂l(u)/ ∂u u=θˆ .
n
Expanding the score function l '(θˆ n ) about θ, a true value of the unknown
parameter, we have

( ) ( )
2
l '(θˆ n ) = l '(θ) + θˆ n − θ l ''(θ) + 0.5 θˆ n − θ l '''(θ n )

with l ''(θ) = ∂2 l(u)/ ∂u2 u=θ , l '''(θ n ) = ∂3 l(u)/ ∂u3 u =θ and θ n that lies between θ
n

and θˆ n . Since θˆ n maximizes the log likelihood, l '(θˆ n ) = 0. This provides

( ){ (
0 = l '(θ) + θˆ n − θ l ''(θ) + 0.5 θˆ n − θ l '''(θ n ) , ) }
resulting in

(
n θˆ n − θ = ) l '(θ)/ n
( )
−l ''(θ)/n − 0.5 θˆ n − θ l '''(θ n )/n
. (3.1)

( )
Consider the remainder term 0.5 θˆ n − θ l '''(θ n )/n . Condition (v) implies

( ) ( )∑
n
0.5 θˆ n − θ l '''(θ n )/ n ≤ 0.5 θˆ n − θ M(X i )/n,
i =1

where M(X 1 ),..., M(X n ) are iid random variables, and then using the
results presented in Section 2.4.3 and condition (iii) we obtain that

( ) ( ) ∑
n
0.5 θˆ n − θ l '''(θ n )/n = 0.5 θˆ n − θ Op (1) = op (1) as n → ∞, since M(X i )/n
i =1

→ Eθ {M(X 1 )}. Similarly, because l ''(θ) is a sum of iid random variables


p

{∂ log ( f (X ; θ))/ ∂θ } , i = 1, ..., n,


2
i
2
we apply condition (vi) to conclude that
−l ''(θ)/n → I (θ) as n → ∞, where
p

⎡{∂ f (X 1 ; θ)/ ∂θ}2 − f (X 1 ; θ) ∂2 f (X 1 ; θ)/ ∂θ2 ⎤


{
I (θ) = Eθ ∂log ( f (X 1 ; θ)) / ∂θ }
2
= −Eθ ⎢ ⎥
⎢⎣ { f (X1 ; θ)}2 ⎥⎦
and
66 Statistics in the Health Sciences: Theory, Applications, and Computing

⎡ f (X ; θ) ∂2 f (X ; θ)/∂θ2 ⎤
∫ {∂ ∂2
∫ f (u; θ)du = ∂θ∂
2
Eθ ⎢ 1 1
⎥= 2
f (u; θ)/∂θ2} du = 1 = 0.
⎢⎣ { f (X1 ; θ)} ⎥⎦ ∂θ2
2 2

Formally speaking, the regularity conditions (i–vi) ensure that we can move
the operator ∂ outside the integration as was done above.
2

∂θ2
Thus Equation (3.1) can be rewritten as

(
n θˆ n − θ = )
l '(θ)/ n
I (θ) + op (1)
,

{
where l '(θ) is a sum of iid random variables ∂log ( f (X i ; θ)) / ∂θ , i = 1, ..., n, }
with

∫ f (u; θ) (∂θ ) du
∂log f (u; θ)
Eθ {∂log ( f (X 1 ; θ)) /∂θ} =

= ∫ f (u; θ) ∂ f (fu(u; θ;)/θ)∂θ du


∂θ ∫

= f (u; θ) du


= 1 = 0.
∂θ

Then by virtue of the central limit theorem (Section 2.4.5) we complete the
proof.
The maximum likelihood estimator θˆ n is a point estimator. One of
the statistical applications of Theorem 3.3.1 is related to the construc-
tion of large sample confidence interval estimators. Towards this end,
we consider a situation in which we are interested in obtaining an inter-
val, say [ a, b], based on X 1 ,..., X n with coverage probability given as
{ }
Prθ θ ∈⎡⎣a (X 1 ,..., X n) , b (X 1 ,..., X n)⎤⎦ = 1 − α for a prespecified level α. In this
case, for large n, we have

{ }
Prθ θ ∈ ⎡⎣θˆ n − u1−α/2 I (θ)/n , θˆ n + u1−α/2 I (θ)/n ⎤⎦ ≈ Φ (u1−α/2 ) − Φ (− u1−α/2 ) ,

where the constant u1−α/2 is given as the root of Φ (u1−α/2 ) = 1 − α/2,


u
Φ (− u1−α/2 ) = α/2, and Φ (u) =
2
e− x /2
dx/ 2 π denotes the distribution
−∞
function of a standard normal random variable. For example, u1−α/2  1.96,
if α = 0.05. Then, calculating a point estimator of I (θ) as Iˆ(θˆ ), we can
obtain the equal tailed (1 − α)100% confidence interval estimator given as
⎡a = θˆ − u ˆ ˆ ˆ ˆ ˆ ⎤
1−α /2 I (θ)/n , b = θ n + u1−α /2 I (θ)/n .
⎣⎢ ⎦⎥
n
Likelihood Tenet 67

Theorem 3.3.1 can be used to show that in various scenarios the maximum
likelihood method provides asymptotically normal estimators with mini-
mum variances (e.g., Lehmann and Casella, 1998).
Warning: The statement mentioned above is well addressed in textbooks
related to elementary courses in statistics. In general, the maximum like-
lihood estimators might be outperformed by related superefficient esti-
mation procedures. This is almost true for special sets of values of θ with
Lebesgue measure zero (Le Cam, 1953). Consider the following example.
Let n iid observations X 1 ,..., X n be N (θ,1) distributed. In this case, the maxi-
∑ ( )
n
mum likelihood estimator θˆ n = X i /n of θ satisfies n θˆ n − θ ~ N (0,1) .

{ } { }
i=1
Define an estimator of θ in the form vn = θˆ n I θˆ n ≥ n−1/4 + aθˆ n I θˆ n < n−1/4
with 0 < a < 1. When θ ≠ 0, we have that the distribution function
Prθ { }
n (vn − θ) < u satisfies

Prθ { n (vn − θ) < u}

= Prθ { n (v − θ) < u, θˆ ≥ n } + Pr { n (v − θ) < u, θˆ < n }


n n
−1/4
θ n n
−1/4

= Pr { n (θˆ − θ) < u, θˆ ≥ n } + Pr { n (aθˆ − θ) < u, θˆ < n }


θ n n
−1/4
θ n n
−1/4

= Pr { n (θˆ − θ) < u} − Pr { n (θˆ − θ) < u, θˆ < n } + Pr { n (aθˆ − θ) < u, θˆ


θ n θ n n
−1/4
θ n n < n−1/4 }
≤ Pr { n (θˆ − θ) < u} + Pr { n (aθˆ − θ) < u, θˆ < n }
θ n θ n n
−1/4

≤ Pr { n (θˆ − θ) < u} + Pr { θˆ < n }


θ n θ n
−1/4

as well as

Prθ { }
n (vn − θ) < u ≥ Prθ { n (θˆ − θ) < u} − Pr { θˆ
n θ n }
< n−1/4 .

Since

⎧ n ⎫
{ } ⎪
Prθ θˆ n < n−1/4 = Prθ ⎨−n−1/4 <
⎪⎩

X i /n < n−1/4⎬

⎪⎭
i =1

⎧ n ⎫ ⎧ n ⎫

∑ ⎪
⎪⎭

⎪⎩ i = 1

= Prθ ⎨ (X i − θ)/n < n−1/4 − θ⎬ − Prθ ⎨ (X i − θ)/n < −n−1/4 − θ⎬
⎪⎩ i = 1 ⎪⎭

⎧ n ⎫ ⎧ n ⎫

∑ ⎪
⎪⎭

⎪⎩ i = 1

= Prθ ⎨ (θ − X i) > θn − n3/4⎬ − Prθ ⎨ (θ − X i ) > n3/4 + θn⎬
⎪⎩ i = 1 ⎪⎭

68 Statistics in the Health Sciences: Theory, Applications, and Computing

and, for large values of n and fixed θ, θn − n3/4 > 0, applying Chebyshev’s
inequality in a similar manner to that of Section 2.4.3, we obtain
{ }
Prθ θˆ n < n−1/4 → 0 as n → ∞. Then asymptotically n (vn − θ) ~ N (0,1), if
θ ≠ 0.

It is clear that when θ = 0 we have

{
Prθ= 0 { n (vn − θ) < u} ≤ Pr0 θˆ n ≥ n−1/4 + Pr0 } { n (aθˆ ) < u}
n and

{
Prθ= 0 { n (vn − θ) < u} ≥ − Pr0 θˆ n ≥ n−1/4 + Pr0 } { n (aθˆ ) < u} ,
n

{ } ⎧
∑ ⎫
n
where Pr0 θˆ n ≥ n−1/4 = Pr0 ⎨ X i ≥ n3/4 ⎬ → 0 as n → ∞.
⎩ i = 1 ⎭
Thus n (vn − θ) ~ N (0, a ), if θ = 0, and n (vn − θ) ~ N (0,1), if θ ≠ 0, as n → ∞
2

and 0 < a < 1. This states that, in this case, the estimator vn is more efficient
asymptotically than the maximum likelihood estimator (var (vn) ≤ var θˆ n , ( )
for large n) at least when θ = 0.
Remark 1. In general cases related to different data distributions we need to
derive maximum likelihood estimators numerically. For example, consider
n iid observations X 1 ,..., X n from the Cauchy distribution with density func-
{ }
−1
tion f (u; θ) = ⎡π 1 + (u − θ) ⎤ . The maximum likelihood estimator θˆ n of the
2
⎣ ⎦
shift parameter θ is a root of the equation l '(θˆ n ) = 0, i.e.,

n
X i − θˆ n
∑ (
i=1 1 + X i − θˆ n )
2 = 0,

that can be solved numerically, e.g., using the Newton–Raphson method (see,
e.g., R procedure uniroot, R Development Core Team, 2014). In this Cauchy
framework, the regularity conditions are satisfied and

∂log ( f (u; θ)) ⎧⎪∂log ( f (X 1 ; θ)) ⎫⎪


2
2 (u − θ) u2
∫ {1 + u } du = 2.
4 1
= 2 , Eθ ⎨ ⎬ =
∂θ 1 + (u − θ) ⎩⎪ ∂θ ⎭⎪ π 2 3

(
Thus, following Theorem 3.3.1, we have n θˆ n − θ ~ N (0, 2) as n → ∞. )
Remark 2. In cases of models with several unknown parameters, a multidi-
mensional version of Theorem 3.3.1 can be obtained under appropriate exten-
sions of the regularity conditions in higher dimensions (e.g., Serfling, 2002).
Remark 3. Properties of the maximum likelihood estimation where stan-
dard regularity conditions are violated are considered in Borovkov (1998).
Likelihood Tenet 69

3.4 The Likelihood Ratio


Historically the likelihood concept has been associated with the likelihood
ratio principle, which is the basis for developing powerful statistical infer-
ence tools for use in clinical experiments when the forms of data distribu-
tions are assumed to be specified.
Let us start by first outlining the likelihood ratio principle applied to sta-
tistical decision-making theory. Likelihood ratio–based testing was first pro-
posed and formulated in a series of foundational papers published in the
period of 1928–1938 by Neyman and Pearson (1928, 1933, 1938). In 1928, the
authors introduced the generalized likelihood ratio test and its association
with chi-squared statistics. Five years later, the Neyman–Pearson lemma
was introduced, showing the optimality of the likelihood ratio test. These
seminal works provided us with the familiar notions of simple and compos-
ite hypotheses and errors of the first and second kind, thus defining formal
decision-making rules for testing (e.g., Vexler et al., 2016a).
Without loss of generality, the principle idea of the proof of the Neyman–
Pearson lemma can be shown by using the following statement: Assume that
data points {X i , i = 1, … , n} are observed presenting for example biomarker
measurements. In general, the observations do not need to represent values
of iid random variables. We would like to classify X 1 , … , X n corresponding
to hypotheses of the following form: H 0 : {X i , i = 1, … , n} are from a joint
density function f0, versus H 1 : {X i , i = 1, … , n} are from a joint density func-
tion f1. For example, one can define f k (x1 , ..., xn) = f (x1 , ..., xn ; θ k ), where k = 0,1
and θ0 , θ1 are different values of the parameter of a joint density function f .
In this context, in order to construct the likelihood ratio test statistic, we
should consider the ratio between the joint density function of {X 1 , … , X n}
obtained under H 1 and the joint density function of {X 1 , … , X n} obtained
under H 0. We then define the likelihood ratio (LR) as

LRn = f1 (X 1 ,..., X n ) f0 (X 1 ,..., X n ).

In this case the likelihood ratio test based decision rule is to reject H 0 if and
only if LRn ≥ B, where B is a prespecified test threshold that does not depend
on the observations. This test is uniformly most powerful. In order to clar-
ify this fact we note that the term “most powerful” induces us to formally
define how to compare statistical tests. According to the Neyman–Pearson
concept of testing statistical hypotheses, since the ability to control the Type
I error rate of statistical test has an essential role in statistical decision-
making, we compare tests with equivalent probabilities of the Type I error
PrH0 {test rejects H 0} = α, where the subscript H 0 indicates that we consider
the probability given that the null hypothesis is true. The level of significance
70 Statistics in the Health Sciences: Theory, Applications, and Computing

α is the probability of making a Type I error. In practice, the researcher


should choose a value of α, e.g., α = 0.05, before performing the test. Thus, we
should compare the likelihood ratio test with δ, any decision rule based on
{X i , i = 1, … , n}, setting up PrH0 {δ rejects H 0} = α and PrH0 {LRn ≥ B} = α. This
comparison is with respect to the power PrH1 {test rejects H 0}. Notice that
to derive the mathematical expectation, in the context of a problem related
to testing statistical hypotheses, one must define whether the expectation
should be conducted under an H 0- or H 1 -regime. For example,

EH1 {ϕ(X 1 , X 2 ,… X n )} = ∫ ... ∫ ϕ(x , x ,…x ) f (x , x ,…x )dx dx …dx ,


1 2 n 1 1 2 n 1 2 n

where the expectation is considered under the alternative hypothesis. In this


framework we consider the trivial inequality

(A − B){I (A ≥ B) − ν} ≥ 0, (3.2)

for all A, B, where ν ∈[0,1] and I ( A ≥ B) denotes the indicator function.


Taking into account the comments mentioned above, we derive the expec-
tation under H 0 of inequality (3.2), where A = LRn, B is a test threshold, and
ν = δ = δ (X 1 , ..., X n) represents any decision rule based on {X i , i = 1, … , n}.
One can assume that δ = 0,1, and when δ = 1 we reject H 0. Thus, we obtain

EH0 {(LRn − B) I (LRn ≥ B)} ≥ EH0 {(LRn − B) δ}.

And hence,

⎧⎪ f (X ,..., X n ) ⎫⎪
EH0 ⎨ 1 1 I (LRn ≥ B)⎬ − BEH0 {I (LRn ≥ B)}
⎪⎩ f0 (X 1 ,..., X n ) ⎪⎭
(3.3)
⎧⎪ f (X ,..., X n ) ⎫⎪
≥ EH0 ⎨ 1 1 δ(X 1 ,..., X n )⎬ − BEH0 {δ(X 1 ,..., X n )} ,
⎩⎪ f0 (X 1 ,..., X n ) ⎭⎪

where EH0 (δ) = EH0 {I (δ = 1)} = PrH0 {δ = 1} = PrH0 {δ rejects H 0}. Since we com-
pare the tests with the fixed level of significance

EH0 {I (LRn ≥ B)} = PrH0 {LRn ≥ B} = PrH0 {δ rejects H 0} = α,


Likelihood Tenet 71

inequality (3.3) leads to

⎪⎧ f (X ,..., X n ) ⎪⎫ ⎪⎧ f (X ,..., X n ) ⎪⎫
EH0 ⎨ 1 1 I (LRn ≥ B)⎬ ≥ EH0 ⎨ 1 1 δ(X 1 ,..., X n )⎬ . (3.4)
⎪⎩ f0 (X 1 ,..., X n ) ⎪⎭ ⎪⎩ f0 (X 1 ,..., X n ) ⎪⎭

Consider

⎧⎪ f (X ,..., X n ) ⎫⎪
EH0 ⎨ 1 1 δ(X 1 ,..., X n )⎬
⎩⎪ f0 (X 1 ,..., X n ) ⎭⎪

= ∫ ... ∫ ff ((xx ,,……,, xx )) δ (x ,…, x ) f (x ,…, x ) dx …dx


1

0
1

1
n

n
1 n 0 1 n 1 n (3.5)

= ∫ ... ∫ δ (x ,…, x ) f (x ,…, x ) dx …dx


1 n 1 1 n 1 n = EH1 (δ) = PrH1 {δ rejects H 0} .

Since δ represents any decision rule based on {X i , i = 1,… , n}, including the
likelihood ratio–based test, Equation (3.5) implies

⎪⎧ f (X ,..., X n ) ⎪⎫
E H0 ⎨ 1 1 I (LRn ≥ B)⎬ = PrH1 {LRn ≥ B} .
⎩⎪ f0 (X 1 ,..., X n ) ⎭⎪

Applying this equation and Equations (3.5) to (3.4), we complete the proof
that the likelihood ratio test is the most powerful statistical decision rule.
This simple proof technique was used to show optimal aspects of differ-
ent statistical decision-making policies based on the likelihood ratio concept
(Vexler and Wu, 2009; Vexler et al., 2010b; Vexler and Gurevich, 2009, 2011).

Remark. The approach shown above can be extended to be considered


with respect to decision-making procedures that satisfy the requirement
PrH1 (reject H 0 ) ≥ α for all levels α’s of significance. It is interesting to note
that if one proposes a most powerful test, say V, in a set of tests that satisfy
PrH1 (reject H 0 ) ≥ α for all α, then it is possible to construct a test, say G, that does
not satisfy PrH1 (reject H 0 ) ≥ α for some α, but the power of G outperforms that
of V. In this context, we refer the reader to, e.g., Bross et al. (1994), Suissa and
Shuster (1984), and Section 9.2.

Warning: Relativity of Optimality of Statistical Tests. Commonly, in


accordance with a given risk function of hypothesis testing, investigators try
to derive an optimal property of a test. Inequality (3.2) helps us to show the
optimal property of the likelihood ratio test statistic in terms of maximizing
72 Statistics in the Health Sciences: Theory, Applications, and Computing

the power at a fixed Type I error rate. However, in the case when we have a
given test statistic, say ϒ, instead of the likelihood ratio LRn, inequality (3.2)
yields (ϒ − B) {I (ϒ ≥ B) − δ} ≥ 0, for any decision rule δ ∈[0,1]. Therefore, the
test rule ϒ > B also has an optimal property that can be derived using (3.2).
The problem is how to obtain a reasonable interpretation of the optimal-
ity? Thus, in scenarios when one wishes to investigate the optimality of a
given test in a general context, inequalities similar to (3.2) can be obtained
by focusing on the structure of a test. Such inequalities could report an opti-
mal property of the test. However, translation of that property in terms of
the quality of tests is the issue. Consider the following simple example. Let
iid data points X 1 ,… , X n be observed. We would like to test for H 0 versus
H 1. Randomly select X k from the observations and define the test: reject H 0
if X k > B, where the threshold B > 0. By virtue of inequality (3.2), we have
(X k − B)I {X k > B} ≥ (X k − B)δ, where δ = 0,1 is any decision rule based on
X 1 ,… , X n and the event {δ = 1} rejects H 0 . Since

EH0 {(X k − B)I (X k > B)} ≥ EH0 {(X k − B)δ} ,

we have

EH0 {X k I (X k > B)} − B PrH0 {X k > B} ≥ EH0 (X k δ) − B PrH0 {δ rejects H 1} .

Thus, considering decision-making procedures that satisfy


EH0 {X 1I (δ rejects H 0)} = γ, for a fixed γ , we conclude that the test based on X k
has the smallest Type I error rate, since the inequality above yields

γ − B PrH0 {X k > B} ≥ γ − B PrH0 {δ rejects H 1} .

In this instance we would like to preserve the condition that if we incorrectly


reject H 0, the expectation of X k is fixed at a presumed level.
Now, we define the test with the decision rule to reject H 0 if
X k > − B . Then (X k + B)I {X k > − B} ≥ (X k + B)δ and EH1 {X k I (X k > −B)} +
B PrH1 {X k > −B} ≥ EH1 (X k δ) + B PrH1 {δ rejects H 1} . Then consider tests with
EH1 {X k I (δ rejects H 0)} = γ , for a fixed γ . It follows that the test X k > − B
is the most powerful rule in a set of tests with the fixed characteristic
EH1 {X k I (δ rejects H 0)} = γ . This allows us to consider preserving the condi-
tions such as if we correctly reject H 0, the expectation of X k is fixed on a pre-
sumed level.
The results above demonstrate that criteria for which a given test is opti-
mal can be declared by the structure of this test. Hence, probably almost any
reasonable tests are in general optimal.
It is a common situation in statistical practice that we have several out-
comes of different tests regarding one test statement. Perhaps, in such cases,
it is reasonable to attempt to focus on the optimal aspects of the tests in order
to make an appropriate decision.
Likelihood Tenet 73

In this book we will show several examples of determining the optimal


properties of test procedures via (3.2)-type inequalities.

3.5 The Intrinsic Relationship between the Likelihood


Ratio Test Statistic and the Likelihood Ratio of Test
Statistics: One More Reason to Use Likelihood
The Neyman–Pearson testing concept, given by fixing the probability of a
Type I error, has come under some criticism by epidemiologists and others.
One of the critical points is about the importance of paying attention to
Type II error rates, PrH1 {test does not reject H 0}. For example, Freiman et al.
(1978) pointed out the results of 71 clinical trials that reported no statistically
“significant” differences between the compared treatments. The authors
found that in the great majority of these trials, the strong effects of new treat-
ment were clinically significant. It was argued that the investigators in these
trials failed to reject the null hypothesis when it appeared the alternative
hypothesis was more likely the underlying truth, which probably resulted in
an increase in the Type II error. In the context of likelihood ratio–based tests,
we present the following result that demonstrates an association between
the probabilities of Type I and II errors.
Suppose we would like to test for H 0 versus H 1, employing the likelihood
ratio LR = f H1 (D) /f H0 (D) based on data D, where f H defines a density func-
tion that corresponds to the data distribution under the hypothesis H. Say,
for simplicity, we reject H 0 if LR > C , where C is a presumed threshold. In this
case, we can then show the following proposition.

Proposition 3.5.1. Let f HL (u) define the density function of the likelihood
ratio LR test statistic under the hypothesis H. Then

f HL1 (u) = u f HL0 (u) (3.6)

with u > 0.

Proof. In order to obtain the property (3.6), we consider, for nonrandom vari-
ables u and s, the probability

PrH1 {u − s ≤ LR ≤ u} = EH1 I {u − s ≤ LR ≤ u} = ∫ I {u − s ≤ LR ≤ u} f
H1

= ∫ I {u − s ≤ LR ≤ u} ff H1

H0
f = ∫ I {u − s ≤ LR ≤ u} (LR) f
H0 H0 .
74 Statistics in the Health Sciences: Theory, Applications, and Computing

This implies the inequalities


PrH1 {u − s ≤ LR ≤ u} ≤ I {u − s ≤ LR ≤ u}(u) f H0 = u PrH0 {u − s ≤ LR ≤ u}

and

PrH1 {u − s ≤ LR ≤ u} ≥ ∫ I {u − s ≤ LR ≤ u}(u − s) f H0 = (u − s) PrH0 {u − s ≤ LR ≤ u} .

Dividing these inequalities by s and employing s → 0, we obtain


f HL1 (u) = f HL0 (u)u, where f HL0 (u) and f HL1 (u) are the density functions of the sta-
tistic LR = f H1 /f H0 under H 0 and H 1, respectively.
The proof is complete.
Thus, we can obtain the probability of a Type II error in the form of
Pr {the test does not reject H0|H1 is true} = Pr{LR ≤ C|H1 is true} =

∫ ∫
C C
f HL1 (u) du = uf HL0 (u) du. Now, if, in order to control the Type I error, the
0 0
density function f HL0 (u) is assumed to be known, then the probability of the
Type II error can be easily computed.
The likelihood ratio property f HL1 (u) f HL0 (u) = u can be applied to solve dif-
ferent issues related to performing the likelihood ratio test. For example, in
terms of the bias of the test, one can request to find a value of the threshold
C that maximizes

Pr {the test rejects H 0 |H1 is true} − Pr {the test rejects H 0 |H 0 is true} ,

where the probability Pr {the test rejects H 0 |H1 is true} depicts the power of
the test. This equation can be expressed as

⎛ ⎞ ⎛ ⎞
∫ ∫
C C
Pr {LR > C|H 1 is true} − Pr {LR > C|H 0 is true} = ⎜⎜ 1 − f HL1 (u) du⎟⎟ − ⎜⎜ 1 − f HL0 (u) du⎟⎟ .
⎝ 0 ⎠ ⎝ 0 ⎠

Set the derivative of the above formula to be zero and solve the following
equation:

d ⎡⎛ ⎞ ⎛ ⎞⎤
∫ ∫
C C
⎢ 1− f HL1 (u) du⎟ − ⎜ 1 − f HL0 (u) du⎟ ⎥ = − f HL1 (C) + f HL0 (C) = 0.
dC ⎣⎜⎝ 0 ⎠ ⎝ 0 ⎠⎦
Likelihood Tenet 75

By virtue of the property (3.6), this implies −Cf HL0 (C) + f HL0 (C) = 0 and then
C = 1, which provides the maximum discrimination between the power and
the probability of a Type I error in the likelihood ratio test.

Likelihood Argument of Optimality: Assume that an investigator provides


a test for hypotheses H 0 versus H 1, calculating a test statistic, say Τ, based on
the underlying data. Suppose we would like to improve Τ in terms of relative
efficiency. Towards this end, we made creative efforts to derive the density
functions f HT0 (u) and f HT1 (u) of the statistic Τ under H 0 and H 1, respectively.
According to Section 3.4, the statistic f HT1 (Τ) f HT0 (Τ) should outperform Τ.
In the case when Τ represents a likelihood ratio, Proposition 3.5.1 shows
that one cannot improve the test statistic using only the values of Τ. In
other words, the interesting fact is that the likelihood ratio f HL1 /f HL0 based
on the likelihood ratio LR = f H1 /f H0 comes to be the likelihood ratio, that is,
f HL1 (LR)/f HL0 (LR) = LR. We leave it to the reader’s imagination to interpret this
statement in terms of the information that this result provides.

Remark: Change of Measure. In this chapter we apply a very helpful tool


for analyzing theoretical characteristics of statistical procedures. This tool
is based on a change of measure. Consider for example a test statistic Tn
that suggests rejecting H 0 for large values of Tn when n data points are
observed. It is reasonable to assume that under the null hypothesis, when
n increases, the behavior of Tn is difficult to anticipate, e.g., oftentimes we
cannot expect any approximate trend of Tn. Thus, in many situations, the
analysis of characteristics similar to EH0 {ψ (Tn )}, where ψ is a function,
including the Type I error rate, is a very complicated issue. Commonly,
under the alternative hypothesis H 1, a deterministic trend of Tn can be eas-
ily derived for relatively large n, e.g., Tn ~ cn, for a constant c, when we
could expect the test is power one as n → ∞. Then, to evaluate EH0 {ψ (Tn )}
one can change the measure

⎧⎪ f (T )⎫⎪
EH0 {ψ(Tn )} = ∫ ψ(u) f H0 (u) du = ∫ ψ(u) ff H0 (u)
H1 (u)
f H1 (u)d u = EH1 ⎨ ψ(Tn ) H0 n ⎬ ,
⎪⎩ f H1 (Tn )⎪⎭

where f H k (u) is a density function of Tn under the hypothesis H k , k = 0,1, and


focus on properties of Tn under H 1. For example, Taylor’s arguments consid-
ering Tn ~ cn can be employed.
Formally in this framework we shall use the Radon–Nikodym theorem
(e.g., Chung, 2000; Borovkov, 1998). In this context the likelihood ratio can be
referred to as the Radon–Nikodym derivative of the two measures at time n.
An interesting example of an application of the algorithm mentioned above
to examine properties of a statistical procedure for change point detection
policies can be found in Yakir (1995).
76 Statistics in the Health Sciences: Theory, Applications, and Computing

3.6 Maximum Likelihood Ratio


Various real-world data problems require consideration of statistical hypoth-
eses with structures that depend on unknown parameters. In this case, the
maximum likelihood method proposes to approximate the most powerful
likelihood ratio test, employing a proportion of the maximum likelihoods,
where the maximizations are over values of the unknown parameters
belonging to distributions of observations under the corresponding null and
alternative hypotheses. We shall assume the existence of essential maximum
likelihood estimators. The influential Wilks’ theorem provides the basic
rationale as to why the maximum likelihood ratio approach has had tre-
mendous success in statistical applications (Wilks, 1938). Wilks showed that
under the regularity conditions, asymptotic null distributions of maximum
likelihood ratio test statistics are independent of the nuisance parameters.
That is, Type I error rates of the maximum likelihood ratio tests can be con-
trolled asymptotically, and approximations to the corresponding p-values
can be computed.
In order to present Wilks’ theorem we suppose that X = {X 1 ,..., X n } are iid
observations each with the density function f ( x ; θ), where θ is a real-valued
vector of parameters. We are interested in testing if the vector θ is in a speci-
fied subset Ω 0 of the parametric space Ω, i.e., H 0 : θ ∈ Ω0 versus H 1 : θ ∈ Ωc0 ,

n
where Ω c0 is a complement to Ω 0 in Ω. Let L(θ|x ) = f (X i ; θ) be the like-
i =1
lihood function. In this case, the maximum likelihood ratio test statistic is
MLRn = supθ∈Ωc L(θ|x ) supθ∈Ω0 L(θ|x ). Then, under the null hypothesis, the
statistic 2 log (MLRn) follows a χ 2d distribution as n → ∞, where the degrees of
freedom d equal to the difference in dimensionality of Ωc and Ω 0. This prop-
osition requires that the regularity conditions hold. For simplicity and clar-
ity of exposition, we state Wilks’ theorem in the one-dimension case, where
we are interested in testing H 0 : θ = θ0 versus H 1 : θ ≠ θ0, where θ is a scalar.

Theorem 3.6.1
Let l(θ|x ) = log (L(θ|x )) and θ̂ be the maximum likelihood estimator of θ
under H 1, that is, θˆ = arg max θ l(θ|x ), provided that it exists. Then, under
{
the null hypothesis, the distribution of 2 log (MLRn) = 2 l(θˆ |x ) − l(θ0 |x ) is }
asymptotically a χ 2d = 1 distribution as n → ∞.

Proof. Expand l(θ0 |x ) in a Taylor series around θ̂, that is,

∂l(θ|x ) 1 ∂2 l(θ|x )
l(θ0 |x ) = l(θˆ |x ) + (θ0 − θˆ ) + (θ0 − θˆ )2 + R,
∂θ θ=θˆ 2 ∂θ2 θ=θˆ
Likelihood Tenet 77

1 ∂3 l(θ|x )
where the remainder term R = (θ0 − θˆ )3 , for some θ between θ̂
6 ∂θ3 θ=θ
and θ. Based on regularity condition (v)|∂3 log f ( x ; θ) ∂θ3 |≤ M( x) for all x ∈ A ,
θ0 − c < θ < θ 0 + c with Eθ0 {M(X )} < ∞ and the fact that ⎢θ0 − θˆ ⎢= Op (n−1/2 + ε ),
ε > 0 (Section 3.3), we can obtain R = Op (n−3/2+3 ε n) = Op (n−1/2+3 ε ) = op (1) , where
1 ∂l(θ|x )
ε can be chosen as ε < . Noting that = 0, we have
6 ∂θ θ=θˆ

1 ∂2 l(θ|x )
l(θ0 |x ) = l(θˆ |x ) + (θ0 − θˆ )2 + op (1).
2 ∂θ2 θ=θˆ

By substituting l(θ0 |x ) with its corresponding Taylor expansion, we con-


clude that

∂2 l(θ|x )
{ }
2 l(θˆ |x ) − l(θ0 |x ) = −(θ0 − θˆ )2
∂θ2 θ=θˆ
+ op (1)

2⎧
⎪ ∂2 l(θ|x ) ⎫⎪
= { }
n (θ0 − θˆ ) ⎨−n−1 ⎬ + op (1)
∂θ2 θ=θˆ ⎭⎪
⎩⎪
⎡ ⎫⎪ 1/2⎤
2
⎧⎪
⎢ −1 ∂ l(θ|x ) ⎥
2
ˆ
= ⎢ n (θ0 − θ) ⎨−n ⎬ ⎥ + op (1).
⎢⎣ ⎩⎪ ∂θ 2
θ=θˆ ⎭⎪ ⎥

Under the regularity conditions, we have that n (θ0 − θˆ ) → N (0,1 I (θ0 ))


∂2 l(θ|x )
(Theorem 3.3.1) and − n−1 → I (θ0 ) in probability (Chapters 1 and 2 can
∂θ2 θ=θˆ 1/2
⎧⎪ −1 ∂2 l(θ|x ) ⎫⎪
provide arguments to show this fact). Then n (θ0 − θ) ⎨− n ˆ ⎬
⎩⎪ ∂θ2 θ=θˆ ⎭⎪

{
converges to N (0,1) in distribution, and therefore 2 l(θˆ |x ) − l(θ0 |x ) con- }
verges to χ12 in distribution. (We also suggest the reader consult Slutsky’s
theorem (Grimmett and Stirzaker, 1992) in this context.)
The proof is complete.
Note that, in the general situation with MLRn = supθ∈Ωc L(θ|x )
supθ∈Ω0 L(θ|x ), the proof mentioned above can be easily extended. In this
case, under H 0 : θ ∈Ω 0 , the statistic 2 log (MLRn) follows asymptotically a χ 2d
distribution as n → ∞, where the degrees of freedom d equal to the differ-
ence in dimensionality of Ωc and Ω 0 and are clarified by the amount of items
1/2 2
⎡ ⎧⎪ −1 ∂2 l(θ|x ) ⎫⎪ ⎤
similar to ⎢ n (θ0 − θ) ⎨− n
ˆ ⎬ ⎥ ⎯n⎯⎯ → {N (0, 1)} that should be
2
→∞
⎢ ⎩⎪ ∂θ 2
⎪ ⎦
θ=θˆ ⎭


presented in the corresponding log likelihood expansion.
78 Statistics in the Health Sciences: Theory, Applications, and Computing

Comments: Thus, if certain key assumptions are met, one can show that
parametric likelihood methods are very powerful and efficient statistical
tools. We should emphasize that the role of the discovery of the likelihood
ratio methodology in statistical developments can be compared to the devel-
opment of the assembly line technique of mass production. The likelihood
ratio principle gives clear instructions and technique manuals on how to
construct efficient statistical decision rules in various complex problems
related to clinical experiments. For example, Vexler et al. (2011) developed a
likelihood ratio test for comparing populations based on incomplete longitu-
dinal data subject to instrumental limitations.
Although many statistical publications continue to contribute to the likeli-
hood paradigm and are very important in the statistical discipline (an excel-
lent account can be found in Lehmann and Romano, 2006), several significant
questions naturally arise about the general applicability of the maximum
likelihood approach. Conceptually, there is an issue specific to classifying
maximum likelihoods in terms of likelihoods that are given by joint den-
sity (or probability) functions based on data. Integrated likelihood functions,
with respect to arguments related to data points, are equal to one, whereas
accordingly integrated maximum likelihood functions often have values that
are indefinite. Thus, while likelihoods present full information regarding the
data, the maximum likelihoods might lose information conditional on the
observed data. Consider this simple example: Suppose we observe X 1, which
is assumed to be from a normal distribution N ( μ, 1) with mean parameter
( )
μ. In this case, the likelihood has the form (2 π) exp −(X 1 − μ)2 /2 and, cor-
−0.5

respondingly,
∫ (2π)
−0.5
( )
exp −(X 1 − μ) /2 dX1 = 1, whereas the maximum like-
2

lihood, i.e., the likelihood evaluated at the estimate of μ, μˆ = X 1 , is (2 π) ,


−0.5

which clearly does not represent the data and is not a proper density. This
demonstrates that since the Neyman–Pearson lemma is fundamentally
founded on the use of the density-based constitutions of likelihood ratios,
maximum likelihood ratios cannot be optimal in general cases. That is, the
likelihood ratio principle is generally not robust when the hypothesis tests
have corresponding nuisance parameters to consider, e.g., testing a hypoth-
esized mean given an unknown variance.
An additional inherent difficulty of the maximum likelihood ratio test
occurs when a clinical experiment is associated with an infinite-dimensional
problem and the number of unknown parameters is relatively large. In this
case, Wilks’ theorem should be re-evaluated, and nonparametric approaches
should be considered in the contexts of reasonable alternatives to the para-
metric likelihood methodology (Fan et al., 2001).
The ideas of likelihood and maximum likelihood ratio testing may not be
fiducial and applicable in general nonparametric function estimation/testing
settings. It is also well known that when key assumptions are not met, para-
metric approaches may be suboptimal or biased as compared to their robust
Likelihood Tenet 79

counterparts across the many features of statistical inferences. For example,


in a biomedical application, Ghosh (1995) proved that the maximum like-
lihood estimators for the Rasch model are inconsistent, as the number of
nuisance parameters increases to infinity (Rasch models are often employed
in clinical trials that deal with psychological measurements, e.g., abilities,
attitudes, and personality traits). Due to the structure of likelihood functions
based on products of densities, or conditional density functions, relatively
insignificant errors in classifications of data distributions can lead to vital
problems related to the applications of likelihood ratio type tests (Gurevich
and Vexler, 2010). Moreover, one can note that, given the wide variety and
complex nature of biomedical data (e.g., incomplete data subject to instru-
mental limitations or complex correlation structures), parametric assump-
tions are rarely satisfied. The respective formal tests are complicated, or
oftentimes not readily available.

Author’s Note: In this book we will employ martingale-based arguments


to demonstrate that in general situations the maximum likelihood ratios are
vastly different from the likelihood ratio and thus cannot provide optimal
non-asymptotic properties similar to those related to the likelihood ratios.
Using the idea that commonly in the context of hypothesis testing we do not
need to estimate unknown parameters, we only shall decide to reject or not
to reject the null hypothesis, we will propose decision-making procedures,
applying representative values instead of unknown parameters or more gen-
eral Bayes factor concepts in order to obtain optimal tests based on samples
with fixed sizes.
Brief Example of the Maximum Likelihood Ratio MLRn. Assume that a sam-
ple of iid observations X 1 , X 2 ,..., X n follow an exponential distribution with
the rate parameter λ, that is, X 1 ,… , X n ~ f ( x) = λexp(−λx). We then describe
the maximum likelihood ratio test statistic for the composite hypothesis
H 0 : λ = 1 versus H 1 : λ ≠ 1. The log-likelihood function is

∑ ∑
n n
l(λ) = l(λ|X 1 , X 2 ,..., X n ) = log f (X i ) = n log λ − λ Xi .
i =1 i =1

To calculate the maximum likelihood estimation, we solve the equation


dl(λ) n

n
= − X i = 0 . The maximum likelihood estimator of λ is
dλ λ i=1


n
ˆλ = X −1 = n X i . Therefore, the maximum likelihood ratio test statistic is
i=1

{ }
2 log (MLRn) = 2 l(λˆ ) − l(1) = 2 n{− log(X ) − 1 + X}. The distribution of 2 log (MLRn)
under H 0 is approximately χ12 and χ1,0.95
2
= 3.84(Pr( χ12 > χ1,2 0.95 ) = 0.05). We
reject H 0 if 2 log (MLRn) > 3.84 at the 0.05 significance level.
80 Statistics in the Health Sciences: Theory, Applications, and Computing

Real data examples and their corresponding R codes for implementation


are presented in Vexler et al. (2016a).

3.7 An Example of Correct Model-Based Likelihood


Formation
Generally speaking, when data are available and model assumptions are fit-
ted, it is not a trivial issue to formulate correctly the corresponding likeli-
hood function. Let us include the following example to illustrate that we
need carefully attend to the likelihood statement. Consider n data points
X 1 ,..., X n that are assumed to satisfy the measurement model with autore-
gressive errors

X i = μ + εi , εi = βεi −1 + ξi , ε0 = 0, i = 1,..., n,

where μ and β are parameters, the errors ε1 ,..., ε n are unobserved and depen-
dent corresponding to the model ε i = βε i − 1 + ξi , that is called as autoregres-
sion with iid noise ξ1 ,..., ξn from a specified density function fξ.
By virtue of the model assumption, we have

ξ1 = X 1 − μ , ξi = X i − μ(1 − β) − βX i −1 , i = 2,..., n.

Then one can write the likelihood function in the form

L(μ , β) = fξ (X 1 − μ) ∏ f (X − μ(1 − β) − βX
ξ i i −1 ).
i=2

Although this form is correct, the approach applied above might lead to
incorrect likelihood forms in several general scenarios, since by definition
the likelihood function should be constructed using distribution functions
of observations that are X 1 ,..., X n in this example.
Let us demonstrate the formal exercise to derive the likelihood function.
To this end we start by noting that
n

L(μ , β) = f (X 1 ,..., X n ) = ∏ f (X |X ,..., X


i 1 i −1 ).
i =1

Taking into account the autoregressive form of the model, we obtain


Likelihood Tenet 81

L(μ , β) = f (X 1 ) ∏ f (X |X
i=2
i i−1 ).

Now we derive the joint density function of (X i , X i − 1), fXi , Xi−1 (u, v), that is

∂2
fXi , Xi−1 (u, v) = Pr (X i < u, X i −1 < v)
∂u ∂v
∂2
= Pr (ξi + μ(1 − β) + βX i −1 < u, X i −1 < v)
∂u ∂v

=
∂2
∂u ∂v ∫ Pr (ξ + μ(1 − β) + βX
−∞
i i −1 < u, X i −1 < v|X i −1 = t) fXi−1 (t) dt

=
∂2
∂u ∂v ∫ Pr (ξ + μ(1 − β) + βt < u, t < v|X
−∞
i i −1 = t) fXi−1 (t) dt

=
∂2
∂u ∂v ∫ Pr (ξ + μ(1 − β) + βt < u|X
−∞
i i −1 = t) fXi−1 (t) dt

v

∫ Pr (ξ + μ(1 − β) + βt < u) f
2
= i X i−1 (t) dt
∂u ∂v
−∞

= Pr (ξi < u − μ(1 − β) − βv) fXi−1 ( v) = fξ (u − μ(1 − β) − βv) fXi−1 ( v),
∂u

where the convolution principle (see Chapter 2) is used and fXi−1 ( v) =


d
Pr(X i −1 < v). Thus using conditional probability theory we conclude that
dv

f (X i |X i −1 ) = fξ (X i − μ(1 − β) − βX i −1 ), i ≥ 2,

and then L(μ , β) = fξ (X 1 − μ) ∏ f (X − μ(1 − β) − βX


i=2
ξ i i−1 ).

An example of a potential use of the method shown above is related to a


likelihood function that implements the autoregressive process of outcomes,
incorporating the characteristics of the problematic longitudinal data in bio-
medical research, e.g., regarding mouse tumor experiments (Vexler et al.,
82 Statistics in the Health Sciences: Theory, Applications, and Computing

2011). For example, one can consider a study with l treatment groups. For
treatment group k ( k = 1,..., l), there are nk experimental units. Each experi-
mental unit is followed by m equally spaced time points by a prespeci-
fied schedule. Let X ijk denote the jth measurement of the ith unit in group
k ( j = 1,..., m; i = 1,..., nk ; k = 1,..., l) . The variable X ijk may reflect a tumor size in
a tumor development model with mice. We can model successive measure-
ments within an experimental unit as

X ijk = μ jk + εijk , εijk = ∑β ε v i, j− v, k + ξijk ,


v =1

for some integer r < j, where ξijk are iid with the mean of 0 and the β v ’s
describe the relationship between ε ijk ’s. This model depicts the rth order of
autoregressive associations between the observations.
4
Martingale Type Statistics and Their
Applications

4.1 Introduction
Probability theory has several fundamental branches that have been dealt
with extensively in the theoretical literature associated with martingale
techniques. Oftentimes, modern concepts of stochastic processes analysis
are presented in the framework of martingale constructions. Historically,
gambling theory and econometrics have employed martingale theory to
model fair stochastic games when information regarding past events does
not provide us the ability to predict the mean of future winnings.
There are a multitude of books that cover different martingale based
methods. In this chapter, we will focus on the following three aspects that
have straightforward connections with biostatistical theory and its applica-
tions: (1) A martingale can represent a statistic, say Mn, based on n data
points such that the expectation of a future outcome Mn+1, which is calcu-
lated given knowledge regarding the present statistic Mn, is equal to the
observed value Mn. For example, intuitively, if a test statistic, say Tn, based on
n observations satisfies this martingale property, under the null hypothesis
H 0 , then on average Tn values are relatively stable under H 0 (Tn does not
increase on average, EH0 (Tn+1 |T1 , ..., Tn ) = Tn, with respect to sizes of underlin-
ing data). Thus, this can reduce the chances that the test statistic Tk , k > n,
will overshoot a corresponding test threshold when more data points are
available and H 0 is true. (2) In previous chapters, we demonstrated that
propositions based on sums of iid random variables can provide general
proof schemes in the evaluation of various statistical procedures. The mar-
tingale machinery can extend different concepts based on sums of iid sum-
mands by taking into account things such as dependency structures between
observations. (3) Martingale methodology plays a crucial role in the devel-
opment and examination of various statistical sequential procedures.
Common sequential statistical schemes involve random numbers of obser-
vations according to random stopping times (rules) for surveying data

83
84 Statistics in the Health Sciences: Theory, Applications, and Computing

(see, e.g., Chapter 2 for examples of sequential statistical procedures). In this


context, martingales present a rather wide class of statistics, which are
invariant relative to such operations as a continuous change of a measure
and random change of time. Thus, we can extend statistical procedures
based on statistics with a fixed number of data points to approaches based on
statistics with random numbers of observations.
We suggest that the reader who is interested in further details regarding
martingales to consult the essential works of Chow et al. (1971), Chung (1974,
2000), Edgar and Sucheston (2010), Liptser and Shiryayev (1989) and Williams
(1991).
This chapter will outline the following topics: Definitions and examples of
martingale related components, including σ-algebras, conditional expecta-
tions with respect to σ-algebras, martingale type objects and stopping times
(Section 4.2); The optional stopping theorem and its corollaries in the forms
of Wald’s lemma and Doob’s inequality (Section 4.3); Applications of martin-
gale theory for developing the principles of efficient retrospective and
sequential statistical procedures, including decision-making schemes, adap-
tive estimators, and change point detection policies (Section 4.4). The mate-
rial presented in Section 4.5 is optional to be demonstrated when Chapter 4
is to be used in a course text development. The reader can consider Section 4.5
as an advanced example of a scholarly statistical research presentation. We
emphasize that Section 4.5 displays a novel discovery in martingale transfor-
mations of testing strategies and martingale based comparisons between
decision-making procedures. The authors are indebted to Professor Moshe
Pollak and Dr. Aiyi Liu for many helpful discussions and comments related
to the results shown in Section 4.5.

4.2 Terminology
To begin our chapter we need to first provide some of the key definitions
used throughout its development.
In Section 4.1 we referred to the terms “information regarding past events”
and “given knowledge regarding the present statistic.” In order to formalize
these, we provide the following series of fundamental definitions:

Definition 4.2.1: Let A be a class of subsets. Then A defines an algebra if it


satisfies the following conditions:

(i) the set of elementary events Ω (see Section 1.3, taking into account
that we require Pr (Ω) = 1) belongs to A, Ω ⊆ A;
Martingale Type Statistics and Their Applications 85

(ii) the facts A ⊆ A and B ⊆ A imply A ∪ B ⊆ A and A ∩ B ⊆ A ;


(iii) if A ⊆ A then A c ⊆ A, where A c is the complement to A.

Definition 4.2.2: The class of subsets ℑ is a σ -algebra if ℑ is an algebra


such that wherever Fn ⊆ ℑ then ∩
Fn ⊆ ℑ.
n=1
Note that in Definition 4.2.1 using condition (iii), we can require that
A ∪ B ⊆ A or A ∩ B ⊆ A , since (A ∪ B) = A c ∩ Bc. Similarly, in Definition
c

4.2.2, one can apply ∪ F ⊆ ℑ instead of ∩ F ⊆ ℑ.


n=1
n
n=1
n

We shall associate the term “knowledge” mentioned above with informa-


tion provided by a statistic. Toward this end let us define a σ-algebra gener-
ated by a random variable X or simply based on X.

Definition 4.2.3: If the σ-algebra ℑ consists of subsets A’s in the form of


A = {ω : X (ω) ∈ B}, where B’s are Borel sets, then ℑ is a σ-algebra generated
by the random variable X.
In general, the elements of B can be quite complicated. However, if the ran-
dom variable X is a statistic with real values we often can employ Definition
4.2.3 to denote the σ -algebra based on X, noting it as σ (X). In an intuitive
manner (not a standard notation), it is very easy to understand that ℑ = σ (X)
corresponds to a collection of all possible analytical functions of X. Similarly,
if we have m statistics X 1 , ..., X m the σ-algebra ℑm = σ(X 1 , ..., X m ) includes all
possible analytic operations based upon X 1 , ..., X m . Some examples
are given as: 1 = (X1 )0 ∈ ℑ m ; X1 + X m ∈ ℑ m ; X m− 1X m ∈ ℑ m ; exp (t1X1 + t2 X 2 ) ∈ ℑ m,
where m ≥ 2 and t1 , t2 are constants. However, in this setting we should
be very careful with, e.g., X 1 + X m+1, since in general if X m+1 cannot be calcu-
lated using onlyX 1 , ..., X m , we have X 1 + X m+1 ∉ℑm. More formally the sym-
bols “∈” and “∉” should be interpreted in terms of measurability, i.e., for
example the notation X 1 + X m ∈ℑm means X 1 + X m is ℑm-measurable for all
subsets A ⊂ ℑm and here the symbol “⊂” indicates that ℑm includes A. Actu-
ally the “confusion” regarding the terms “to belong ∈” and “to be measur-
able” is not vital in the context of the materials presented in this chapter and
we guess the meaning of, e.g., X 1 + X m ∈ℑm is more understandable than that
of X 1 + X m is ℑm- measurable for all subsets A ⊂ ℑm, in the framework of this
book.
Martingale methodology employs the conditional expectation concept.
Consider, for example, a scenario where we observe data points ξ1 , ..., ξn and
a statistic X is calculated using values of ξ1 , ..., ξn, i.e., X = X (ξ1 , ..., ξn ) . Then
{ }
the conditional expectation E X |σ (ξl , ..., ξ k ) , 1 ≤ l ≤ k ≤ n can be considered
in intuitive fashion as a derivative of the expectation E {X} at the moment
when we condition on ξl , ..., ξ k to be fixed with respect to this expectation.
86 Statistics in the Health Sciences: Theory, Applications, and Computing

This leads us to conclude that E {X |σ (ξl , ..., ξ k )} ∈σ (ξl , ..., ξ k ), since in


{ }
E X |σ (ξl , ..., ξ k ) values of ξ1 , ..., ξl−1 and ξ k +1 , ..., ξn are integrated out with
respect to their joint distribution. For example,
{ }
E ξl |σ (ξl , ..., ξ k ) = ξl ∈σ (ξl , ..., ξ k ) ; { } {
E ξ1ξ2 |σ (ξ1) = ξ1E ξ2 |σ (ξ1) ∈σ (ξ1), }
where { }
E ξ2 |σ (ξ1) = E (ξ2 ) ∈σ (ξ1) if ξ1 and ξ2 are independent;
⎪⎧ ⎪⎫ ⎪⎧ ⎪⎫
∑ ∑ ∑
n k n
E⎨ ξ i |σ (ξ1 ,..., ξ k )⎬ = ξi + E ⎨ ξ i |σ (ξ1 ,..., ξ k )⎬ ∈ σ (ξ1 ,..., ξ k ) . Note
⎩⎪ i =1 ⎭⎪ i =1 ⎩⎪ i = k +1 ⎭⎪
{
that, to be a bit formal, since, in general, E X |σ (ξl , ..., ξ k ) is a σ (ξl , ..., ξ k ) }
-measurable random variable, it would be “elegant” to write “almost sure”
(see Section 1.7.2) in the examples shown above, e.g.,
{ } { }
E ξ 1ξ 2 |σ (ξ 1) = ξ 1E ξ 2 |σ (ξ 1) a.s. Nevertheless we shall omit such obvious
“a.s.” from now on. A general formal definition of a conditional expectation
has the form shown below.

Definition 4.2.4: Let X be a random variable with E X < ∞ and ℑ denote a


σ-algebra. The conditional expectation of X with respect to ℑ is the random
variable Y = E (X |ℑ) , which has the following properties: (i) Y ∈ℑ ; (ii) for
each A ⊂ ℑm, we have E {Y I (A)} = E {X I (A)}, where I(.) is the indicator
function.
Property (ii) of Definition 4.2.4 leads to a very useful rule of Probability
Theory that oftentimes is called as the law of total expectation or the law
of iterated expectation and formulated as

E(X ) = E {E(X |ℑ)} . (4.1)

For example, consider the simple scenarios when (1) X ∈ℑ then E(X |ℑ) = X and
it is clear that E {E(X |ℑ)} = E(X ), and (2) X is independent of ℑthen E(X |ℑ) = E(X )
and it is clear that E {E(X |ℑ)} = E {E(X )} = E(X ). Note that one can show that the
convolution principle (2.1) shown in Chapter 2 is a corollary of Equation (4.1).
Now, given our background material above, we can define a martingale in
an appropriate manner relative to the main purpose of this chapter.

Definition 4.2.5: Let Mn denote a statistic based on n data points and


ℑ1 , ..., ℑn be a sequence of σ-algebras such that ℑ1 ⊂ ℑ2 ⊂ ... ⊂ ℑn , and
E Mn < ∞. The couple (Mn , ℑn ) is called

a martingale, if it satisfies E (Mn |ℑn−1) = Mn−1;


a submartingale, if it satisfies E (Mn |ℑn−1) ≥ Mn−1 ;
a supermartingale, if it satisfies E (Mn |ℑn−1) ≤ Mn−1 .
Martingale Type Statistics and Their Applications 87

It is clear that if (Mn , ℑn) is a martingale then, for example, Mn 2 , ℑn is ( )


a submartingale, whereas (log (Mn) , ℑn) is a supermartingale, since
( )
E Mn 2 |ℑ n−1 ≥ {E (Mn |ℑ n−1)} = Mn−12 and E (log (Mn)|ℑ n−1) ≤ log E (Mn |ℑ n−1) = log (Mn−1). ( )
2

In these cases Jensen's inequality was used.

Warning: It is very important to draw the reader’s attention to Definition 4.2.5,


which focuses on the pair (Mn , ℑn ). Consider the example where we define

∑ ξ i of iid random variables ξ 1 ,..., ξ n with E (ξ1) = 0 and let


n
Mn as the sum
i=1

∑ ∑
n−1 n−1
ℑn = σ (ξ1 , ..., ξn ). Then E (Mn |ℑ n− 1) = ξ i + E (ξ n |ℑ n − 1) = ξ i + E (ξ n) = Mn − 1 + 0 .
i =1 i =1

This implies that (Mn , ℑn) is a martingale. However, defining ℑ 'n = σ (y1 ,..., y n),
where iid random variables y1 ,..., y n are independent of ξ 1 ,..., ξ n , we obtain
E (Mn |ℑ‘n−1) = ∑ E (ξi ) = 0 and then (Mn , ℑ'n) is not a martingale.
n

i=1
In a general aspect, one can consider martingale type statistics to extend
the notion of a statistic consisting of a sum of iid random variables. For
example, assume that ξ 0 ,..., ξ n are iid random variables with E (ξ 1) = a. In this
case, it is clear that ⎛ ∑ ⎛
ξi − an, σ (ξ1 , ..., ξn)⎞ = ⎝ ∑
(ξi − a), σ (ξ1 , ..., ξn )⎞⎠ is
n n

⎝ i=1 ⎠ i=1

a martingale as shown above. It turns out that by defining the statistic

∑ (ξ i − 1 − a)(ξ i − a) we obtain
n
Mn =
i=1

⎪⎧ ⎫
∑ (ξ − a)(ξ − a)|σ (ξ ,..., ξ )⎪⎬⎭⎪
n
E {Mn |σ (ξ 0 ,..., ξ n− 1)} = E ⎨ i −1 n−1
⎩⎪
i 0
i =1

= ∑ (ξ − a)(ξ − a) + E {(ξ − a)(ξ − a)|σ (ξ ,..., ξ


n−1
i −1 i n−1 n 0 n−1 )}
i =1

= ∑ (ξ − a)(ξ − a) + (ξ − a) E (ξ − a)
n−1
i −1 i n−1 n
i =1

= ∑ (ξ − a)(ξ − a) = M
n−1
i −1 i n−1
i =1

( )
and then Mn , σ (ξ 0 ,..., ξ n − 1) is a martingale, where Mn represents a sum of
dependent random variables (certainly, we assumed that E ξ1 < ∞ to apply
Definition 4.2.5.). As another example, via taking into account Section 3.7,
we propose analyzing observations ε1 , ..., ε n that satisfy the autoregres-
sion model ε i = βε i−1 + ξi , i = 1, ..., n, with the iid noise ξ 1 ,..., ξ n and
E (ξ1) = 0, ε 0 = 0 .
88 Statistics in the Health Sciences: Theory, Applications, and Computing

In this case we have ε k = ξ k + β (ξ k − 1 + βε k − 2 ) = ... = ∑


k
β k − j ξ j , which yields
j=1

⎡ ⎤

k −1
E {β− kε k |σ (ξ1 ,..., ξ k −1)} = β− k ⎢ βk − j ξ j + E {ξ k |σ (ξ1 ,..., ξ k −1)}⎥
⎢⎣ j =1 ⎥⎦
⎧⎪ ⎫⎪
∑ ∑
k −1 k −1
= β− k ⎨ βk − j ξ j + E (ξ k )⎬ = β− k βk − j ξ j
⎩⎪ j =1 ⎭⎪ j =1


k −1
= β−( k −1) β( k −1)− j ξ j = β−( k −1)ε k −1 , k > 1.
j =1

( )
This implies that β − n ε n , σ (ξ1 , ..., ξn ) is a martingale, where the observations
ε1 , ..., ε n are sums of independent and not identically distributed random
variables.
Chapter 2 introduced statistical strategies that employ random numbers of
data points. In forthcoming sections of this book we will introduce several
procedures based on non retrospective mechanisms as statistical sequen-
tial schemes. In order to extend retrospective statistical tools based on data
with fixed sample sizes to procedures stopped at random times we introduce
the following definition:

Definition 4.2.6: A random variable τ is a stopping time with respect to


σ-algebra ℑn ( ℑ1 ⊂ ℑ2 ⊂ ... ⊂ ℑn ) if it satisfies (i) {τ ≤ n} ⊂ ℑn , for all n ≥ 1;
(ii) Pr (τ < ∞) = 1.
In the statistical interpretation, the idea is that only information in ℑn at time n
should lead us to stop without knowledge of the future, whether or not time τ
has arrived. For example, assume τ is a stopping time. Then τ + 1 is a stopping
time, since we can stop at τ observing ℑn and decide finally to stop our proce-
dure at the next stage τ + 1, i.e., by the definition of τ and the σ-algebra, we
have {τ + 1 ≤ n} = {τ ≤ n − 1} ⊂ ℑn−1 ⊂ ℑn. However τ − 1 is not a stopping time,
since we cannot stop at τ observing ℑn and decide that we should stop our
procedure at the previous stage τ − 1, i.e., {τ − 1 ≤ n} = {τ ≤ n + 1} ⊂ ℑn+1 ⊄ ℑn.

Examples:
1. It is clear that a fixed integer N is a stopping time.
2. Suppose τ and ν are stopping times. Then min( τ , ν) and max( τ, ν)
are stopping times. Indeed, min( τ, ν) and max( τ, ν) satisfy con-
dition (ii) of Definition 4.2.6, since τ and ν are stopping times.
In this case, the events {min(τ, ν) ≤ n} = {τ ≤ n} ∪ {ν ≤ n} ⊂ ℑn and
{max(τ, ν) ≤ n} = {τ ≤ n} ∩ {ν ≤ n} ⊂ ℑn .
Martingale Type Statistics and Their Applications 89

3. Let integers τ and ν define stopping times. Then τ + ν is a stop-




ping time, since {τ + ν ≤ n} = {τ ≤ n − ν} = {τ ≤ n − ν} ∩ {ν = i} =
i=1


n
{τ ≤ n − i} ∩ {ν = i} ⊂ ℑn
i =1
where {τ ≤ 0} = ∅ .
4. In a similar manner to the example above, one can show that
τ − ν is not a stopping time.
5. Consider the random variable ϖ = inf {n ≥ 1 : X n ≥ an}, where
X 1 , X 2 , X 3 ,... are independent and identically uniformly [0, 1]
(
distributed random variables and ai = 1 − 1/(i + 1)2 , i ≥ 1 . Defining )
ℑn = σ (X 1 ,..., X n ), we have {ϖ ≤ n} = {max 1≤ k ≤ n (X k − ak ) ≥ 0} ⊂ ℑn.
Let us examine, for a nonrandom N ,

Pr (ϖ > N) = Pr {max 1≤ k ≤ N (X k − ak ) ≤ 0}
= Pr {(X 1 − a1) ≤ 0,..., (X N − aN ) ≤ 0}
N N

= ∏ Pr {(X k − ak ) ≤ 0} = ∏ Pr {X k ≤ ak}
k =1 k =1
N

= ∏a . k
k =1

In this case,

⎛ N ⎞ N N

log ⎜⎜


∏ ak ⎟⎟ =


∑ log(1 − 1/( k + 1)2 ) ≈ ∑{−1/(k + 1) } , 2

k =1 k =1 k =1

where we apply the Taylor argument to represent log(1 − s) =


− s + s 2 /2 + O( s 3 ) with s = 1/( k + 1)2. Thus Pr (ϖ > N) not → 0 as
N → ∞ and then Pr (ϖ < ∞) < 1. This means ϖ is not a stopping
time.
6. Defining ai = (1 − 1/ (i + 1)) , i ≥ 1, in Example (5) above, we obtain
⎛ N
⎞ N N

log ⎜

∏ k =1
ak ⎟ =

∑k =1
log (1 − 1/( k + 1)) ≈ ∑ (−1/(k + 1)) ≈ − log(N ) → −∞
k =1

as N → ∞ and then Pr (ϖ > N) → 0 that implies ϖ is a stopping


time, in this case.
7. Consider the random variable W = inf n ≥ 1 : Sn ≥ 2 n2 (n + 1) , { }

n
where Sn = X i and X 1 > 0, X 2 > 0, X 3 > 0,... are iid random
i
variables with E (X 1) < 1 . Let us examine
90 Statistics in the Health Sciences: Theory, Applications, and Computing

Pr (W < ∞) = ∑ Pr (W = n)
n=1

= ∑ Pr {S < 4,..., S
1 n−1 < 2(n − 1)2 n, Sn ≥ 2 n2 (n + 1)}
n=1
∞ ∞

≤ ∑ Pr {S ≥ 2n (n + 1)} ≤ ∑ 2nE((nS +) 1)
n
2
2
n

n=1 n=1

E (X 1)
= ∑ 2n(n + 1)
n=1

1
= E (X 1) < 1,
2
where Chebyshev’s inequality is used. Thus, although (W ≤ n) ⊂
ℑn = σ (X 1 ,..., X n ), W is not a stopping time.

4.3 The Optional Stopping Theorem and Its Corollaries:


Wald’s Lemma and Doob’s Inequality
Historically, martingale type objects were associated with Monte Carlo rou-
lette games. Suppose X n shows our capital obtained at a stage n while we are
playing the Monte Carlo roulette game without any arbitrage, i.e., no one
outside monitors the roulette wheel and provides us more money.
Martingale Type Statistics and Their Applications 91

It turns out that, in a fair game, (X n , ℑn = σ (X 1 , ..., X n)) is a martingale. Then


{ } {
on average E (X n) = E E (X n |ℑn− 1) = E (X n− 1) = E E (X n− 1 |ℑn− 2 ) = ... = E (X1), }
where the rule (4.1) is applied and X 1 displays our initial capital. One can
wonder about a “smart” strategy, when the game will be stopped depending
on the history of the game, e.g., at the time min {N , inf (n ≥ 1 : X n ≥ $100)},
where N is a fixed integer. Towards this end, consider the following result:

Proposition 4.3.1 (The Optional Stopping Theorem). Let (Mn , ℑn ) be a


martingale and τ ≥ 1 be a stopping time that corresponds to ℑn. Assume that
(
E (τ) < ∞ and, for all n ≥ 1 on the set {τ ≥ n} ⊂ ℑn−1, E Mn − Mn− 1 |ℑn− 1 ≤ c , )
where c is a constant. Then E (Mτ ) = E (M1).

Proof. For an integer N, define the stopping time τ N = min( N , τ). It is clear
that

⎧⎪ N ⎫⎪
E (Mτ N ) = E ⎨
⎩⎪ i=1

Mi I (τ N = i)⎬
⎪⎭
⎡ N ⎤
= E⎢
⎢⎣ i=1

Mi {I (τ N ≤ i) − I (τ N ≤ i − 1)}⎥
⎥⎦
N N

= ∑ E {M I (τ
i=1
i N ≤ i)} − ∑ E {M I (τ
i=1
i N ≤ i − 1)} .

In this equation, by virtue of property (4.1) and the martingale and stopping
time definitions, we have

E {Mi I (τ N ≤ i − 1)} = EE {Mi I (τ N ≤ i − 1)|ℑi−1}

{
= E I (τ N ≤ i − 1) E (Mi |ℑi−1) }
Thus = E {Mi−1I (τ N ≤ i − 1)} .

N N

E (M τ N ) = ∑E{M I (τ i N ≤ i)} − ∑E{M i −1 I (τ N ≤ i − 1)}


i =1 i =1

= E {MN I (τ N ≤ N)}
= E (MN ), where M0 I (τ N ≤ 0) = 0.

{ } {
Since E (MN ) = E E (MN |ℑN − 1) = E (MN − 1) = E E (MN − 1 |ℑN − 2 ) = ... = E (M1), we }
conclude with that E (Mτ N ) = E (M1). Now, considering N → ∞, we complete
the proof.
92 Statistics in the Health Sciences: Theory, Applications, and Computing

In Section 2.4.1 we applied Wald’s lemma to evaluate the expectation of


the renewal function. Now we present Wald’s lemma in the following form:

Proposition 4.3.2. Assume that ξ 1 ,..., ξ n are iid random variables with
E (ξ 1) = a and E ξ1 < ∞. Let τ ≥ 1 be a stopping time with respect to
ℑn = σ (ξ 1 ,..., ξ n). If E (τ) < ∞, then

⎛ τ ⎞
E ⎜⎜


∑ξ ⎟⎟⎟⎠ = aE (τ).
i
i =1

Proof. Consider

⎛ n
⎞ n− 1

E⎜


i=1
ξi − an|ℑn−1⎟ =

∑ ξ − an + E (ξ |ℑ
i=1
i n n− 1 )
n− 1

= ∑ ξ − an + a
i=1
i

n− 1

= ∑ ξ − a(n − 1).
i=1
i

⎛ n

This implies that ⎜

∑ ξ − an, ℑ ⎟⎠ is a martingale. Therefore the use of Propo-
i=1
i n

⎛ τ ⎞ ⎛ τ ⎞
sition 4.3.1 provides E ⎜
⎝ i=1

ξi − aτ⎟ = E (ξ1 − a) = 0 and then E ⎜

ξi ⎟ = aE (τ),
⎝ i=1 ⎠

which completes the proof.
As has been mentioned previously, Chebyshev’s inequality plays a very
useful role in theoretical statistical analysis. The next proposition can be
established as an aspect of extending Chebyshev’s approach.

Proposition 4.3.3 (Doob’s Inequality). Let (Mn ≥ 0, ℑn ) be a martingale.


Then, for all ε > 0,

(
Pr max Mk ≥ ε ≤
1≤ k ≤ n
) E (M1)
ε
.

Proof. Define the random variable ν = inf {n ≥ 1 : Mn ≥ ε}. Since it is possible


that Pr (ν < ∞) < 1, ν may not be a stopping time, whereas νn = min( ν, n)
defines a stopping time with respect to ℑn. Next, consider the probability
Martingale Type Statistics and Their Applications 93

Pr (Mνn ≥ ε) = Pr (Mνn ≥ ε, ν ≤ n) + Pr (Mνn ≥ ε, ν > n) .

It is clear that νn = ν , if ν ≤ n ; νn = n , provided that ν > n . Thus

Pr (Mνn ≥ ε) = Pr (Mν ≥ ε, ν ≤ n) + Pr (Mn ≥ ε, ν > n) ,

where, by the definition of ν, Mν ≥ ε a.s. and the event {Mn ≥ ε, ν > n} = ∅,


since {Mn ≥ ε} means {ν > n} is not true. This provides

(
Pr (Mνn ≥ ε) = Pr( ν ≤ n) = Pr max Mk ≥ ε .
1≤ k ≤ n
)
( ) {
That is Pr max Mk ≥ ε = Pr (Mνn ≥ ε) = E I (Mνn ≥ ε) , where the indicator
1≤ k ≤ n
}
function I (a ≥ b) satisfies the inequality I (a ≥ b) ≤ a/b, where a and b are pos-

( )
itive. Then Pr max Mk ≥ ε ≤ E (Mνn ) / ε and the use of Proposition 4.3.1 com-
1≤ k ≤ n
pletes the proof.
Note that Proposition 4.3.3 displays a nontrivial result: for Mn ≥ 0 one
could anticipate that max Mk increases, providing Pr max Mk ≥ ε = 1 at
1≤ k ≤ n
( 1≤ k ≤ n
)
least for large values of n, however Doob’s inequality bounds max Mk in
1≤ k ≤ n
probability. We might expect that by using Proposition 4.3.3, an accurate

( )
bound for Pr max Mk ≥ ε can be obtained. While evaluating Pr max Mk ≥ ε ,
1≤ k ≤ n
( 1≤ k ≤ n
)
we could directly employ Chebyshev’s approach, obtaining Pr max Mk ≥ ε ≤ ( )
( )
1≤ k ≤ n

E max Mk / ε , but that is probably not useful.


1≤ k ≤ n

4.4 Applications
The main theme of this section is to demonstrate that the martingale meth-
odology has valuable substance when it is applied in developing and exam-
ining statistical procedures.

4.4.1 The Martingale Principle for Testing Statistical Hypotheses


Suppose we have a set of n observations X 1 ,..., X n which are planned to be
employed for testing hypotheses H 0 versus H 1. In this case, Section 3.4
suggests applying the likelihood ratio LRn = f1 (X1 ,..., X n ) f0 (X 1 ,..., X n ) if the
94 Statistics in the Health Sciences: Theory, Applications, and Computing

joint density functions f1 , under H 1, and f0, under H 0 , are available. Defining
ℑn = σ (X 1 ,..., X n), we use the material of Section 3.2 to derive
⎧ n+ 1 ⎫



∏ f1 (X i |X 1 ,..., X i −1 ) ⎪


EH0 (LRn+1 |ℑ n) = EH0 ⎨ i =1
n+ 1 |ℑ n⎬
⎪ ⎪

⎪⎩
∏ f0 (X i |X 1 ,..., X i −1 ) ⎪
⎪⎭
i =1
n

∏ f (X |X ,..., X
1 i 1 i −1 )
⎧⎪ f (X |X ,..., X n ) ⎫⎪
= i =1
EH0 ⎨ 1 n+1 1 |ℑ n⎬
⎩⎪ f0 (X n+1 |X 1 ,..., X n ) ⎭⎪
n

∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
n

∏ f (X |X ,..., X
1 i 1 i −1 )
= i =1
n ∫ ff ((uu||XX ,...,
1 1 X )n
f (u|X ,..., X ) du
0 1 n

∏ f (X |X ,..., X
0 ,..., X )
1 n
0 i 1 i −1 )
i =1
n

∏ f (X |X ,..., X
1 i 1 i −1 )
= i =1
n = LRn ,
∏ f (X |X ,..., X
0 i 1 i −1 )
i =1

where the subscript H 0 indicates that we consider the expectation given that
the hull hypothesis is correct. Then, following Section 4.2, the pair (LRn , ℑn)
is an H 0 -martingale and the pair (log (LRn ) , ℑn ) is an H 0 -supermartingale.
This says—at the moment in intuitive fashion—that while increasing the
amount of observed data points under H 0 , we do not enlarge chances to
reject H 0 for large values of the likelihood ratio, in average.
Now, under H 1, we consider

∏ f (X |X ,..., X
1 i 1 i −1 )
⎪⎧ f (X |X ,..., X n ) ⎪⎫
EH1 (LRn+1 |ℑ n) = i =1
EH1 ⎨ 1 n+1 1 |ℑ n⎬
⎩⎪ f0 (X n+1 |X 1 ,..., X n ) ⎭⎪
n

∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
n

∏ f (X |X ,..., X
1 i 1 i −1 )
⎪⎧ 1 ⎪⎫
= i =1
EH1 ⎨ |ℑ n⎬
⎪⎩ f0 (X n+1 |X 1 ,..., X n )/ f1 (X n+1 |X 1 ,..., X n ) ⎪⎭
n

∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
Martingale Type Statistics and Their Applications 95

∏ f (X |X ,..., X
1 i 1 i −1 )
1
≥ i =1
n
EH1 { f0 (X n+1 |X 1 ,..., X n )/ f1 (X n+1 |X 1 ,..., X n )|ℑ n}
∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
n

∏ f (X |X ,..., X
1 i 1 i −1 )
⎧⎪ ⎫⎪ −1
= i =1
n ⎨
⎩⎪
∫ f0 (u|X 1 ,..., X n )
f1 (u|X 1 ,..., X n ) du⎬ = LRn ,
⎭⎪
∏ f (X |X ,..., X
f1 (u|X1 ,..., X n )
0 i 1 i −1 )
i =1

where Jensen’s inequality is used and the subscript H 1 indicates that we con-
sider the expectation given that the alternative hypothesis is true. Regarding
log (LRn + 1), we have

EH1 {log (LRn+1)|ℑ n}


⎧ ⎛ n+ 1 ⎞ ⎫




⎜ ∏ f0 (X i |X 1 ,..., X i −1 )⎟⎟
⎟ |ℑ ⎪⎬


= −EH1 ⎨log ⎜⎜ i =1
n+ 1 ⎟ n⎪




⎜⎜

∏ f1 (X i |X 1 ,..., X i −1 )⎟⎟




⎩ i =1 ⎭
⎛ ⎛ n+ 1 ⎞⎞




⎜ ∏ f0 (X i |X 1 ,..., X i −1 ) ⎟⎟
⎟⎟

≥ − log ⎜ EH1 ⎜⎜ i =1
n+ 1 |ℑ n⎟⎟ ⎟

⎜⎜

⎜⎜

∏ f1 (X i |X 1 ,..., X i −1 ) ⎟⎟
⎟⎟ ⎟⎟
⎠⎠
⎝ i =1

⎛ n ⎞

⎜ ∏ f0 (X i |X 1 ,..., X i −1 )
⎛ ⎞


= − log ⎜⎜ |ℑ n⎟⎟ ⎟⎟
f0 (X n+1 |X 1 ,..., X n )
i =1
n EH1 ⎜⎜
∏ f (X |X ,..., X ⎝ ⎠⎟
⎜ f1 (X n+1 |X 1 ,..., X n )
⎜⎜ 1 i 1 i −1 ) ⎟⎟
⎝ i =1 ⎠
= log (LRn) .

Then the pairs (LRn , ℑn) and (log (LRn) , ℑn) are H 1-submartingales. Thus,
intuitively, under H 1, this shows that by adding more observations to be
used in the test procedure, we provide more power to the likelihood
ratio test.
96 Statistics in the Health Sciences: Theory, Applications, and Computing

Conclusion: The likelihood ratio test statistic is most powerful given all of
the assumptions are met. Therefore, generally speaking, it is reasonable to consider
developing test statistics or their transformed versions that could have properties of
H 0 -martingales (supermartingales) and H 1-submartingales, anticipating an opti-
mality of the corresponding test procedures. Note that in many testing scenarios
the alternative hypothesis H 1 is “not H 0 ” and hence does not have a specified
form. For example, the composite statement regarding a parameter θ can be
stated as H 0 : θ = θ0 versus H 1 : θ ≠ θ0, where θ0 is known. In this case,
EH1 -type objects do not have unique shapes. Then we can aim to preserve
only the H 0 -martingale (supermartingales) property of a corresponding
rational test statistic, following the Neyman–Pearson concept of statistical
tests, in which H 0 -characteristics of decision-making strategies are vital
(e.g., Vexler et al., 2016a).

4.4.1.1 Maximum Likelihood Ratio in Light of the Martingale Concept


In Section 3.6 we pointed out a significant difference between the likelihood
ratio and the maximum likelihood ratio test statistics. In order to provide a
martingale interpretation of this fact, we assume that observations X 1 ,..., X n
are from the density function f ( x1 , ..., xn ; θ) with a parameter θ. We are inter-
ested in testing the hypothesis H 0 : θ = θ0 versus H 1 : θ ≠ θ0, where θ0 is
known. The maximum likelihood ratio is

MLRn = f (X 1 ,..., X n ; θˆ n ) f (X 1 ,..., X n ; θ0 ), where θˆ n = arg max θ f (X 1 ,..., X n ; θ).

Noting that θˆ n + 1 should maximize f (X 1 ,..., X n + 1 ; θ), i.e., f (X 1 , ..., X n ; θˆ n+1 ) ≥


f (X 1 , ..., X n ; θˆ n ), where θˆ n ∈ℑn = σ (X 1 ,..., X n), we obtain

EH0 {MLRn + 1 |ℑn = σ (X1 ,..., X n )}

⎧ n+ 1



∏ f (X |X ,..., X
i 1 i−1 ; θˆ n + 1 ) ⎪

= EH0 ⎨ i=1
n+ 1 |ℑn = σ (X1 ,..., X n )⎬



∏i=1
f (X i |X1 ,..., X i − 1 ; θ0 ) ⎪


n+ 1
⎧ ⎫


∏ f (X |X ,..., X
i=1
i 1 i−1 ; θˆ n ) ⎪

≥ EH0 ⎨ n+ 1 |ℑn ⎬




i=1
f (X i |X1 ,..., X i − 1 ; θ0 ) ⎪


Martingale Type Statistics and Their Applications 97

∏ f (X |X ,..., X
i 1 i−1 ; θˆ n )
⎪⎧ f (X n + 1 |X 1 ,..., X n ; θˆ n ) ⎪⎫
= i=1
EH0 ⎨ |ℑn ⎬
⎪⎩ f (X n + 1 |X 1 ,..., X n ; θ0 )
n


i=1
f (X i |X 1 ,..., X i − 1 ; θ0 )
⎪⎭

∏ f (X |X ,..., X
i 1 i−1 ; θˆ n )
f (u|X 1 ,..., X n ; θˆ n (X 1 ,..., X n ))
= i=1
n ∫ f (u|X 1 ,..., X n ; θ0 )
f (u|X 1 ,..., X n ; θ0 ) du

i=1
f (X i |X1 ,..., X i − 1 ; θ0 )

= MLRn .

Thus the pair (MLRn , ℑn) is an H 0 -submartingale (not a martingale as wanted).


This reinforces a concern that, in general, the maximum likelihood ratio statis-
tic can provide an optimal decision-making strategy based on data with fixed
sample sizes. It follows that while increasing the amount of observed data
points under H 0 , we do increase the chances to reject H 0 on average for large
values of the maximum likelihood ratio. One may then conclude that the viola-
tion of the H 0 -martingale principle shown above yields that maximum likeli-
hood–based testing procedures are not very useful when sample sizes are
very large. Almost no null hypothesis is exactly true in practice. Consequently,
when sample sizes are large enough, almost any null hypothesis will have a
tiny p-value, and hence will be rejected at conventional levels (Marden, 2000).

4.4.1.2 Likelihood Ratios Based on Representative Values


Let us repeat the idea that oftentimes in the context of hypothesis testing we
do not need to estimate unknown parameters when we only are interested in
the decision to reject or not to reject the null hypothesis. In this section we
consider decision-making procedures based upon representative values that
can substitute for unknown parameters.
Without loss of generality, we consider a straightforward example that
introduces the basic ingredients for further explanations in this book.
Assume that we observe iid data points X 1 , ..., X n and are interested in test-
ing the hypothesis

H 0 : X 1 ~ f0 ( x) versus H 1 : X 1 ~ f1 ( x) = f0 ( x)exp(θ1x + θ2 ), (4.2)

where f0 is a density function (known or unknown), and quantities θ1, θ2 are


unknown, θ2 can depend on θ1. Let the sign of θ1 be known, e.g., θ1 > 0.
The statement of the problem above is quite general. For example, when
X 1 ~ N (μ , σ 2 ) with unknown σ 2 , the considered hypothesis could have the
form of H 0 : μ = 0, σ = η versus H 1 : μ ≠ 0, σ = η , for some unknown η > 0. In
98 Statistics in the Health Sciences: Theory, Applications, and Computing

(
this case, we have f0 ( x) = exp − x 2 (2 η2 ) ) 2 πη2 and f1 ( x) = f0 ( x)exp(θ1x + θ2 )
with θ1 = μ/(η2 ) and θ2 = −μ 2 /(2 η2 ), respectively.
By virtue of the statement (4.2), the likelihood ratio test statistic is

⎛ ⎞

n
LRn = exp ⎜⎜ θ1 X i + nθ2⎟⎟ .
⎝ i =1 ⎠

Since θ1, θ2 are unknown, we denote the statistic

⎛ ⎞

n
Tn = exp ⎜⎜ a X i + b⎟⎟ ,
⎝ i =1 ⎠

where a and b can be chosen arbitrarily and a is of the same sign of θ1, i.e.,
a > 0 in this example, e.g., a = 7, b = 0 . The decision-making policy is to reject
H 0 if Tn > C, where C > 0 is a test threshold.
Suppose, only for the following theoretical exercise, that
c=
∫ f (u)exp(au + b) du < ∞. Then
0

⎧ n+ 1 ⎫

⎛ 1 ⎞ 1



f ∏
0 (X i )exp (aX i + b)



EH0 ⎜ n+1 Tn+1 |ℑ n = σ (X 1 ,..., X n)⎟ = n+1 EH0 ⎨ i =1 n+ 1 | ℑ n⎬
⎝c ⎠ ⎪ ⎪

c
⎪ f 0 (X i ) ⎪
⎪⎩ i =1
⎪⎭

T ⎧⎪ f0 (X n+1 )exp (aX n+1 + b)⎫⎪


= nn+1 EH0 ⎨ ⎬
c ⎩⎪ f 0 ( X n+ 1 ) ⎭⎪
f0 (u)exp (au + b)
=
Tn
c n+ 1 ∫ f0 (u)
f0 (u) du

Tn
= ,
cn

( )
i.e., Tn / c n , ℑn is an H 0 -martingale, where c n is independent of the underly-
ing data. Thus we can anticipate that the test statistic Tn can be optimal. This
is displayed in the following result:
The statistic Tn provides the most powerful test of the hypothesis testing
at (4.2).

Proof. Note again that the likelihood ratio test statistic LRn = exp ⎛⎜θ1 ∑ X i + nθ2⎞⎟
n

⎝ i=1 ⎠
is most powerful. Following the Neyman–Pearson concept, in order to com-
pare the test statistics LRn and Tn, we define test thresholds CαT and CαLR to
satisfy α = PrH0 (LRn > CαLR ) = PrH0 (Tn > CαT ) for a prespecified significance
Martingale Type Statistics and Their Applications 99

level α, since the Type I error rates of the tests should be equivalent. In this
case, using the definitions of the test statistics, we have

⎡ ⎤ ⎡ ⎤
∑ { ( ) } ∑ { ( ) }
n n
α = PrH0 ⎢ X i > log CαLR − nθ2 θ1⎥ = PrH0 ⎢ X i > log CαT − b a⎥ , θ1 > 0, a > 0.
⎢⎣ i =1 ⎥⎦ ⎢⎣ i=1 ⎥⎦

That is, {log (CαLR ) − nθ2} θ1 = {log (CαT ) − b} a and then log (CαT ) = a {log (CαLR ) − nθ2} θ1 + b.

Therefore, under H 1, the power

⎡ ⎤
∑ { ( ) }
n
PrH1 (Tn > CαT ) = PrH1 ⎢ X i > log CαT − b a⎥
⎣ i =1 ⎦
⎡ ⎤
∑ { ( ) }
n
= PrH1 ⎢ X i > log CαLR − nθ2 θ1⎥
⎣ i =1 ⎦
= PrH1 (LRn > CαLR ),

i.e., the statistic Tn is the most powerful test statistic. (In this book, the nota-
tion PrH k ( A) means that we consider the probability of ( A) given that the
hypothesis H k is true, k = 0, 1.) This completes the proof.
Since θ1 and θ2 are unknown, one can propose to estimate θ1 and θ2 by
using the maximum likelihood method in order to approximate the likelihood
ratio test statistic. In this case, the efficiency of the maximum likelihood ratio
test may be lost completely. The optimal method based on the test statistic
exp ⎛ a ∑ X i + b⎞ can be referred as a simple representative method, since
n

⎝ i=1 ⎠
we employ arbitrarily chosen numbers to represent the unknown parame-
ters θ1 and θ2.
In this section, we present the simple representative method, which can be
easily extended by integrating test statistics through variables that represent
unknown parameters with respect to functions that can display weights cor-
responding to values of the variables. This approach can be referred to as a
Bayes factor type decision-making mechanism that will be described for-
mally in forthcoming sections of this book. For example, observing X 1 ,..., X n,
one can denote the test statistic

BFn =
∫ f (X , ..., X ; θ)π(θ) dθ
1 1 n

f0 (X 1 , ..., X n ; θ0 )

to test for H 0 : X 1 , ..., X n ~ f0 ( x1 , ..., xn ; θ0 ) versus H 1 : X 1 ,..., X n ~ f1 ( x1 ,..., xn ; θ1 ),


θ1 ≠ θ0 with known θ0 and unknown θ1, where π(θ) represents our level of
belief (probability-based weights) on the possible consequences of the deci-
sions and/or the possible values of θ under the alternative hypothesis. In this
case, it is clear that (BFn , ℑn = σ (X 1 , ..., X n)) is an H 0 -martingale.
100 Statistics in the Health Sciences: Theory, Applications, and Computing

4.4.2 Guaranteed Type I Error Rate Control of the Likelihood Ratio Tests

As shown in Section 4.4.1 the pair (LRn > 0, ℑn = σ (X1 , ..., Xn)) is an
H 0 -martingale. Thus Proposition 4.3.3 implies PrH0 (max 1≤ k ≤ n LRk > C) ≤ C −1,
since EH0 (LR1) =
∫ { f (u)/ f (u)} f (u) du = 1. This result insures that we can
1 0 0

non-asymptotically control maximum values of the likelihood ratio test sta-


tistic in probability. For example, setting the test threshold C = 20 , we guar-
antee that the Type I error rate of the likelihood ratio test does not exceed 5%,
for all fixed n ≥ 1 . This result could be conservative, however, we should note
that the inequality PrH0 (max 1≤ k ≤ n LRk > C) ≤ C −1 was obtained without any
requirements on the data density functions, e.g., we did not restrict the obser-
vations to be iid. It is clear that when the form of LRn is specified (in general,
this form can be very complicated) one can try to obtain an accurate evalua-
tion of the corresponding Type I error rates, e.g., via Monte Carlo approxima-
tions (see Section 2.4.5.1). In this case, Doob’s inequality suggests the initial
values for the evaluations.
Assume we observe sequentially (one-by-one) iid data points
X 1 , X 2 , X 3 , ... and need to decide X 1 , X 2 , X 3 , ... ~ f0 or X 1 , X 2 , X 3 , ... ~ f1,
provided that if X 1 , X 2 , X 3 , ... ~ f0 it is not required to stop the sampling. In
this case, we can define the stopping time τ(C) = min (k ≥ 1 : LRk ≥ C). Then an
error is to stop the sampling under X 1 , X 2 , X 3 ,... ~ f0 with the rate
Pr (τ ≤ n|X 1 ~ f0) = Pr (max 1≤ k ≤ n LRk > C|X 1 ~ f0) ≤ C −1 , for all n ≥ 1. This deter-
mines the Type I error rate of the sequential procedure τ(C) to be C −1-bounded.
(In this context, see Section 4.4.4 and the forthcoming material of the book
related to sequential statistical procedures.)

4.4.3 Retrospective Change Point Detection Policies


Change point problems originally arose in quality control applications,
when one typically observes the output of production processes and should
indicate violations of acceptable specifications. In general, the problem of
detecting changes is found in quite a variety of experimental sciences. For
example, in epidemiology one may be interested in testing whether the inci-
dence of a disease has remained constant over time, and if not, in estimating
the time of change in order to suggest possible causes, e.g., increases in
asthma prevalence due to a new factory opening.
In health studies there may be a distributional shift, oftentimes in terms of
a shift in location either due to random factors or due to some known factors
at a fixed point in time that is either known or needs to be estimated. For
example, biomarker levels may be measured differently between two labora-
tories and a given research organization may switch laboratories. Then there
may be a shift in mean biomarker levels simply due to differences in sample
processing. As another example, the speed limit on many expressways in the
Martingale Type Statistics and Their Applications 101

United States was increased from 55 miles per hour to 65 miles per hour in
1987. In order to investigate the effect of this increased speed limit on high-
way traveling, one may study the change in the traffic accident death rate
after the modification of the 55 mile per hour speed limit law. Problems
closely related to the example above are called change point problems in the
statistical literature.
We will concentrate in this section on retrospective change point prob-
lems, where inference regarding the detection of a change occurs retrospec-
tively, i.e., after the data has already been collected. Various applications, e.g.,
in the areas of biological, medical and economics, tend to generate retrospec-
tive change point problems. For example, in genomics, detecting chromo-
somal DNA copy number changes or copy number variations in tumor cells
can facilitate the development of medical diagnostic tools and personalized
treatment regimens for cancer and other genetic diseases; e.g., see Lucito
et al. (2000). The retrospective change point detection methods are also use-
ful in studying the variation (over time) of share prices on the major stock
exchanges. Various examples of change point detection schemes and their
applications are introduced in Csörgö and Horváth (1997) and Vexler et al.
(2016a).
Let X 1 , X 2 , ..., X n be independent continuous random variables with fixed
sample size n. In the formal context of hypotheses testing, we state the
change point detection problem as testing for

H 0 : X 1 , X 2 ,..., X n ~ f0 versus H 1 : X i ~ f0 , X j ~ f1 , i = 1,..., v − 1, j = v,..., n,

where f0 ≠ f1 are density functions. Note that v ∈ (1, n] is an unknown param-


eter. This simple change point model is termed a so-called at most one
change-point model. In the parametric fashion, efficient detection methods
for the classical simple change point problem include the cumulative sum
(CUSUM) and Shiryayev–Roberts approaches (e.g., Vexler et al., 2016a).

4.4.3.1 The Cumulative Sum (CUSUM) Technique


When the parameter v can be assumed to be known, the likelihood ratio
statistic Λ nv provides the most powerful test by virtue of the proof shown in
Section 3.4, where

Πni = k f1 (X i )
Λ nk = .
Πni = k f0 (X i )

We refer the reader to Chapter 3 for details regarding likelihood ratio tests.
When the parameter v is unknown, the maximum likelihood estimator
v̂ = arg max 1≤ k ≤ n Λ nk of the parameter v can be applied. By plugging the maxi-
mum likelihood estimator v̂ of the change point location v, we have
102 Statistics in the Health Sciences: Theory, Applications, and Computing

Λ n = Λ vˆ = max Λ nk ,
1≤ k ≤ n

which has the maximum likelihood ratio form. Note that log (Λ n) =


n
max log ( f1 (X i )/ f0 (X i )), which corresponds to the well-known cumula-
1≤ k ≤ n i= k

tive sum (CUSUM) type test statistic. The null hypothesis is rejected for large
values of the CUSUM test statistic Λ n. It turns out that, even in this simple
case, evaluations of H 0 -properties of the CUSUM test are very complicated
tasks.
Consider the σ-algebra ℑnk = σ (X k , ..., X n) and the expectation

⎛ Π n f (X ) ⎞
( )
EH0 Λ nk −1 |ℑ nk = EH0 ⎜⎜ ni = k −1 1 i |ℑ nk ⎟⎟
⎝ Π i = k − 1 f 0 (X i ) ⎠
Π f (X )
n ⎛ f (X ) ⎞
= ni = k 1 i EH0 ⎜⎜ 1 k −1 |ℑ nk ⎟⎟
Π i = k f 0 (X i ) ⎝ f 0 (X k − 1 ) ⎠
= Λ nk .

( )
This shows that Λ nk , ℑnk is an H 0 -martingale that can be called an
H 0 -nonnegative reverse martingale. Then a simple modification of Proposi-
tion 4.3.3 (Gurevich and Vexler, 2005) implies

( )
PrH0 (Λ n > C) ≤ EH0 Λ nn / C = 1/C

that provides an upper bound of the corresponding Type I error rate. The
arguments mentioned below Proposition 4.3.3 support that the bound 1/C
might be very close to the actual Type I error rate. The non-asymptotic Type
I error rate monitoring via 1/C was obtained without any requirements on
the density functions f0 , f1. In several scenarios with specified f0 , f1-forms,
complex asymptotic propositions can be used to approximate PrH0 (Λ n > C)
(Csörgö and Horváth, 1997) as well as one can also employ Monte Carlo
approximations to the Type I error rate. In these cases, the bound 1/C sug-
gests the initial values for the evaluations.

4.4.3.2 The Shiryayev–Roberts (SR) Statistic-Based Techniques


In the previous subsection, we considered the case where the change point
location is maximum likelihood estimated, which leads to the CUSUM
scheme. Alternatively taking into account Section 4.4.1.2, we can propose the
Π ni = j f1 (X i )
statistic Λ nj = n , where an integer j can be chosen arbitrarily.
Π i = j f 0 (X i )
Martingale Type Statistics and Their Applications 103


n
Extending this approach, we obtain the test statistic SRn = w j Λ nj , where
j =1
the deterministic weights w j ≥ 0, j = 1, ..., n, are known. For simplicity and
clarity of exposition, we redefine


n
SRn = Λ nj .
j =1

The statistic SRn can be considered in the context of a Bayes factor type test-
ing, when we integrate the unknown change point with respect to the uni-
form prior information regarding a point where the change is occurred. In
this change point problem, the integration will correspond to simple sum-
mation. This approach is well-addressed in the change point literature as the
Shiryayev–Roberts (SR) scheme (Vexler et al., 2016a). The null hypothesis is
rejected if SRn / n > C , for a fixed test threshold C > 0.
Noting that SRn = 1 n (SRn− 1 + 1) , we consider the pair (SRn − n, ℑn =
f (X )
f 0 (X n )
σ (X 1 , ..., X n)) that satisfies

⎪⎧ f (X )⎪⎫
EH0 (SRn − n|ℑ n−1) = (SRn−1 + 1) EH0 ⎨ 1 n ⎬ − n = SRn−1 − (n − 1).
⎩⎪ f0 (X n )⎭⎪

Then (SRn − n, ℑn) is an H 0 -martingale. Similarly, since

⎧⎪ f (X )⎫⎪ 1
EH1 ⎨ 1 n ⎬ ≥ = 1,
⎩⎪ f0 (X n )⎭⎪ EH1 { f0 (X n )/ f1 (X n )}

one can show that (SRn − n, ℑn) is an H 1 -submartingale. Thus, according to


Section 4.4.1, it is reasonable to attempt deriving an optimal property of the
retrospective SR change point detection policy. (In this context, we would
like to direct the reader’s attention again to the “Warning” in Section 3.4.)
It turns out that the retrospective change point detection policy based on
the SR statistic is non-asymptotically optimal in the sense of average most
powerful (via v = 1,..., n).

Proof. To prove this statement, one can use in equality (3.2) in the form

⎛1 ⎞ ⎧⎪ ⎛ 1 ⎞ ⎫⎪
⎜ SRn − C⎟ ⎨ I ⎜ SRn ≥ C⎟ − δ⎬ ≥ 0,
⎝n ⎠ ⎩⎪ ⎝ n ⎠ ⎪⎭
104 Statistics in the Health Sciences: Theory, Applications, and Computing

where δ = δ (X 1 , ..., X n) represents any decision rule based on {X i , i = 1, … , n} ,


δ = 0,1, and when δ = 1 we reject H 0 . Then in a similar manner to the analysis
of the likelihood ratio statistic shown in Section 3.4, we obtain

⎪⎧ Πik=−11 f0 (X i )Πin=k f1 (X i ) ⎛ 1 ⎞ ⎪⎫
n
⎛1 ⎞
1
n ∑E H0 ⎨
⎪⎩ Πni=1 f0 (X i )
I ⎜ SRn ≥ C⎟ ⎬
⎝n ⎠ ⎪⎭
⎜ SRn ≥ C⎟
⎝n ⎠
k =1

n
⎧⎪ Πik=−11 f0 (X i )Πni=k f1 (X i ) ⎫⎪

1
n ∑E H0 ⎨
⎩⎪ Π n
f
i=1 0 ( X i )
I (δ(X1 , ..., X n ) rejects H 0 )⎬ − C PrH0 (δ(X1 , ..., X n ) rejects H 0 ) ,
⎭⎪
k =1

∑ ∫ ... ∫ I ⎜⎝ n1 SR
⎛ ⎞

1 n ⎛1 ⎞
n ≥ C⎟ f0 (x1 ,..., xk − 1) f1 (xk ,..., xn) dxi − C PrH0 ⎜ SRn ≥ C⎟
n ⎠ i =1 ⎝n ⎠
k =1

∑ ∫ ... ∫ I (δ(x ,..., x ) rejects H ) f (x ,..., x ) f1 (xk ,..., xn) ∏


n
1
≥ 1 n 0 0 1 k −1 dxi −
n i =1
k =1

C PrH0 (δ rejects H 0) .

This leads to the conclusion

n n

∑ ∑Pr
1 ⎛1 ⎞ 1
PrH1 ⎜ SRn ≥ C|ν = i⎟ ≥
⎝n ⎠ n
H1 (δ rejects H 0 |ν = i) ,
n
i =1 i =1

when the Type I error rates PrH0 (SRn / n ≥ C) and PrH0 (δ rejects H 0) are equal.
The proof is complete.

Remark 1. In the considered statement of the problem, the change point


parameter ν ∈ (1, n] is unknown. In this case, how does one calculate the
power of change point detection procedures? It is clear the test power is a
function of ν, so is it unknown, in general? Suppose, for example, when
n > 10 a student defines a test statistic Λ 10n
. Then his/her test statistic can be
powerless when, e.g., ν = 1, 2, however when ν = 10 no one can outperform
the power of Λ 10
n
, since Λ 10
n
is the likelihood ratio, in this case. Thus it is inter-
esting to ask the question: what is the sense of “most powerful testing” in the
change point problem? In this context, the meaning of the statement “most
powerful in average” is clear.

Remark 2. Vexler et al. (2016a) presented a literature review, data examples-


and relevant R codes in order to compare the CUSUM and SR retrospective
change point detection procedures and their extended forms. The authors
discussed properties of the maximum likelihood estimator νˆ = arg max 1≤ k ≤ n Λ nk
Martingale Type Statistics and Their Applications 105

of the change point parameter ν. Section 4.5 introduces a novel martingale


based concept to compare the CUSUM and SR statistics.

4.4.4 Adaptive (Nonanticipating) Maximum Likelihood Estimation


In this subsection we introduce modifications of the maximum likelihood
estimations that preserve the H 0 -martingale properties of test statistics.
Consider the following scenarios:

(i) Let X 1 , ..., X n be from the density function f ( x1 , ..., xn ; θ) with a param-
eter θ. We are interested in testing the hypothesis H 0 : θ = θ0 versus
H 1 : θ ≠ θ0, where θ0 is known.
Define the maximum likelihood estimators of θ as

θˆ 1, k = arg max θ f (X 1 ,..., X n ; θ), θˆ 1,0 = θ0 , k = 0,..., n.

In this case, the maximum likelihood ratio test statistic MLRn =


f (X 1 , ..., X n ; θˆ 1, n ) f (X 1 , ..., X n ; θ0 ) is
n

∏ f (X |X ,..., X i 1 i −1 ; θˆ n )
MLRn = i =1
n .
∏ f (X |X ,..., X i 1 i −1 ; θ0 )
i =1

Then the adaptive (nonanticipating) maximum likelihood ratio test statistic


can be denoted in the form
n

∏ f (X |X ,..., X i 1 i −1 ; θˆ 1, i −1 )
ALRn = i =1
n .
∏ f (X |X ,..., X i 1 i −1 ; θ0 )
i =1

It turns out that (ALRn > 0, ℑn = σ (X1 , ..., Xn)) is an H 0 -martingale, since
θˆ 1, i − 1 ∈ℑi − 1 and
n

∏ f (X |X ,..., X
i 1 i −1 ; θˆ 1, i − 1 )
⎛ f (X |X ,..., X ; θˆ ) ⎞
EH0 (ALRn+ 1 |ℑ n) = i =1
EH0 ⎜⎜ n+ 1 1 n 1, n
|ℑ n⎟⎟
⎝ f (X n+ 1 |X 1 ,..., X n ; θ0 )
n

∏ f (X |X ,..., Xi 1 i −1 ; θ0 )

i =1

= ALRn ∫ f (u|X 1 ,..., X n ; θˆ 1, n )


f (u|X 1 ,..., X n ; θ0 )
f (u|X 1 ,..., X n ; θ0 )du = ALRn .
106 Statistics in the Health Sciences: Theory, Applications, and Computing

Thus Proposition 4.3.3 implies PrH0 (max 1≤ k ≤ n ALRk > C) ≤ C −1.


(ii) Let X 1 , ..., X n be from the density function f ( x1 , ..., xn ; θ, λ) with parame-
ters θ and λ. We are interested in testing the hypothesis H 0 : θ = θ0 versus
H 1 : θ ≠ θ0 , where θ0 is known. The parameter λ is a nuisance parameter that
can have different values under H 0 and H 1 , respectively.
A testing strategy, which can be applied in this case, is to try finding
invariant transformations of the observations, thus eliminating the nui-
sance parameter from the statement of the testing problem. For example,
when we observe the iid data points X 1 , ..., X n ~ N (μ , σ 2 ) to test for
H 0 : σ 2 = 1 versus H 1 : σ 2 ≠ 1, the observations can be transformed to be
Y2 = X 2 − X 1 , Y3 = X 3 − X 1 ..., Yn = X n − X 1 and the likelihood ratio statistic
based on Y2 , ..., Yn, where Y2 , ..., Yn are iid given X 1, can be applied with-
out any attention to μ ’s values. (In this context, one can also use


k
Y1 = X 1 − g k , Y2 = X 2 − g k ..., Yn = X n − g k , where g k = X i /k with a fixed
i=1
k < n or, for simplicity, Y1 = X 2 − X1 , Y2 = X 4 − X 3 ,....) Testing strategies based on
invariant statistics are very complicated in general and cannot be employed
in various situations, depending on forms of f ( x1 , ..., xn ; θ, λ).
It is interesting to note that classical statistical methodology is directed
toward the use of decision-making mechanisms with fixed significant levels.
Classic inference requires special and careful attention of practitioners relative
to the Type I error rate control since it can be defined by different functional
forms. In the testing problem with the nuisance parameter λ, for any presumed
significance level, α, the Type I error rate control may take the following forms:

(1). supλ PrH0 (reject H 0 ) = α ;


(2). supλ lim PrH0 (reject H 0 based on observations {X 1 ,..., X n}) = α ;
n →∞
(3). lim supλ PrH0 (reject H 0 based on observations {X 1 ,..., X n}) = α ;
n →∞

(4).
∫ Pr H0 (reject H 0 )π(λ) d λ = α, π>0: ∫ π(λ) dλ = 1.
Form (1) from above is the classical definition (e.g., Lehmann and Romano,
2006). However, in some situations, it may occur that supλ PrH0 (reject H 0 ) = 1.
If this is the case the Type I error rate is not controlled appropriately. In addi-
tion, the formal notation of the supremum is mathematically complicated in
terms of derivations and calculations, either analytically or via Monte Carlo
approximations. Form (2) above is commonly applied when we deal with
maximum likelihood ratio tests as it relates to Wilks’ theorem (Wilks, 1938).
However, since the form at (2) is an asymptotical large sample consideration,
the actual Type I error rate PrH0 (reject H 0) may not be close to the expected
level α given small or moderate fixed sample sizes n. The form given above
at (3) is rarely applied in practice. The form above at (4) is also introduced in
Martingale Type Statistics and Their Applications 107

Lehmann and Romano (2006). This definition of the Type I error rate control
depends on the choice of the function π(θ) in general.
Now let us define the maximum likelihood estimators of θ and λ as
(
θˆ , λˆ
1, k= arg max
1, k ) f (X ,..., X ; θ, λ), θˆ = θ , λˆ = λ , k = 0,..., n and
(θ , λ) 1 n 1,0 0 1,0 10

λˆ 0,1, k = arg max λ f (X 1 ,..., X n ; θ0 , λ), k = 1,..., n,

where λ 10 is a fixed reasonable variable. Then the adaptive maximum likeli-


hood ratio test statistic can be denoted in the form
n

∏ f (X |X ,..., X i 1 i −1 ; θˆ 1, i −1 , λˆ 1, i −1 )
ALRn = i =1
n .
∏ f (X i |X 1 ,..., X i −1 ; θ0 , λˆ 0,1, n )
i =1
n

By virtue of the definition of λ̂ 0,1, k , we have


n
∏ f (X |X , ..., X
i=1
i 1 i−1 ; θ0 , λˆ 0,1, n ) ≥

∏i=1
f (X i |X 1 , ..., X i − 1 ; θ0 , λ H0 ) , where λ H0 is a true value of λ under H 0 . Then

∏ f (X |X ,..., X i 1 i −1 ; θˆ 1, i −1 , λˆ 1, i −1 )
ALRn = i =1
n ≤ Mn with
∏ f (X i |X 1 ,..., X i −1 ; θ0 , λˆ 0,1, n )
i =1

∏ f (X |X ,..., X
i 1 i −1 ; θˆ 1, i −1 , λˆ 1, i −1 )
Mn = i =1
n ,
∏ f (X |X ,..., X i 1 i −1 ; θ0 , λ H0 )
i =1

where it is clear that (Mn > 0, ℑn = σ (X 1 , ..., X n)) is an H 0 -martingale. Thus


Proposition 4.3.3 implies

supλ H0 PrH0 (max 1≤ k ≤ n ALRk > C) ≤ supλ H0 PrH0 (max 1≤ k ≤ n Mk > C) ≤ C −1 .

This result is very general and does not depend on forms of f ( x1 , ..., xn ; θ, λ) .
(iii) Let X 1 ,..., X n be from the density function f ( x1 , ..., xn ; θ, λ) with parame-
ters θ and λ. Assume that we are interested in testing the hypothesis
H 0 : θ ∈(θL , θU ) versus H 1 : θ ∉(θL , θU ) , where θL , θU are known. The parame-
ter λ is a nuisance parameter that can have different values under H 0 and H 1,
respectively.
108 Statistics in the Health Sciences: Theory, Applications, and Computing

In this case, the adaptive maximum likelihood ratio test statistic can be
denoted in the form
n

∏ f (X |X ,..., X
i 1 i −1 ; θˆ 1, i −1 , λˆ 1, i −1 )
ALRn = i =1
n ,
∏ f (X i |X 1 ,..., X i −1 ; θˆ 0,1, n , λˆ 0,1, n )
i =1

where

(θˆ 1, k )
, λˆ 1, k = arg max(θ∉(θL , θU ), λ) f (X 1 , ..., X n ; θ, λ), θˆ 1,0 = θ10 , λˆ 1,0 = λ 10 , k = 0, ..., n
and

(θˆ 0,1, k )
, λˆ 0,1, k = arg max(θ∈(θL , θU ), λ) f (X 1 , ..., X n ; θ, λ), k = 1, ..., n

with θ10 , λ 10 that have fixed reasonable values.


In a similar manner to that shown in Scenario (ii) above, we obtain

sup{θ∈ (θL , θU ) , λ} PrH0 (max 1≤ k ≤ n ALRk > C) ≤ C −1 .

(iv) Let X 1 , X 2 ,..., X n be independent continuous random variables with fixed


sample size n. We state the change point problem as testing for

H 0 : X 1 , X 2 ,..., X n ~ f0 (u; θ0) versus H 1 : X i ~ f0 (u; θ0) , X j ~ f1 (u; θ1) ,

i = 1,..., ν − 1, j = ν,..., n,

where θ0 is an known parameter, θ1 is an unknown parameter and forms of


the density functions f0 , f1 can be equivalent if θ0 ≠ θ1. In many practical situ-
ations, we can assume that θ0 is known, for example, when the stable regime
under the null hypothesis has been observed for a long time. Applications of
invariant data transformations (e.g., see Scenario (ii) above) can also lead
us to a possibility to assume θ0 is known to be, e.g., zero. In this case, the
adaptive CUSUM test statistic is

⎧⎪ Πn f (X ; θˆ )⎫⎪
Λ n = max ⎨ i =nk 1 i i + 1, n ⎬ , θˆ l , n = arg max θ {Πni = l f1 (X i ; θ)} , θˆ n+ 1, n = θ0
1≤ k ≤ n ⎪ Π i = k f 0 (X i ; θ 0 ) ⎪
⎩ ⎭

(
that implies the pair Λ n , ℑnk = σ (X k ,..., X n) is an H 0-nonnegative reverse )
martingale (Section 4.4.3.1). Then, by virtue of Proposition 4.3.3, the Type I
error rate of the considered change point detection procedure satisfies
PrH0 (Λ n > C) ≤ 1/ C.
Martingale Type Statistics and Their Applications 109

(v) Let X 1 , X 2 ,..., X n be independent continuous random variables with fixed


sample size n. Assume that the change point problem is to test for

H 0 : X 1 , X 2 ,..., X n ~ f0 (u; θ0) versus H 1 : X i ~ f0 (u; θ0) , X j ~ f1 (u; θ1) ,

i = 1,..., ν − 1, j = ν,..., n,

where θ0 and θ1 are unknown.


In this case, the adaptive CUSUM test statistic is

⎧⎪ Πn f (X ; θˆ )⎫⎪
Λ n = max ⎨ i =n k 1 i 1, i+1, n ⎬ , θˆ s, l , n = arg max θ {Πin=l f s (X i ; θ)} ,
1≤ k ≤ n ⎪ Π ˆ
⎩ i = k f0 (X i ; θ0, k , n ) ⎭⎪

s = 0,1, θˆ 1, n+1, n = θ10 ,

where θ10 is a fixed reasonable variable. Then, using the techniques presented
in Scenarios (i), (iii), and (iv) above, we conclude that the Type I error rate of
this change point detection procedure satisfies the very general inequality
supθ0 PrH0 (Λ n > C) ≤ 1/ C.

Remark. In the context of general retrospective change point detection pro-


cedures based on martingale type statistics, we refer the reader, e.g., to Vexler
(2006, 2008) and Vexler et al. (2009b), where the following issues are presented:
multiple change point detection statements, martingale type associations
between the CUSUM and SR techniques, martingale type transformations of
test statistics, Monte Carlo comparisons related to the CUSUM and SR tech-
niques, and biostatistical data examples associated with change point detec-
tion problems.

4.4.5 Sequential Change Point Detection Policies


In previous sections we focused on offline change point detection or retro-
spective change point analysis, where inference regarding the detection of
a change occurs retrospectively, i.e., after the data has already been col-
lected. In contrast to the retrospective change point problem, the so-called
online (sequential) change point problem or an online surveillance change
problem features methods in which a prospective analysis is performed
sequentially. In this case, in order to detect a change as soon as possible,
such that the consequences of such a change can be tackled effectively, the
detection method is implemented after every new observation is collected.
The online change point problems are widely presented and studied in
fields such as statistical quality control, public health surveillance, and
110 Statistics in the Health Sciences: Theory, Applications, and Computing

signal processing. For instance, in a continuous production process, the


quality of the products is expected to remain stable. However, in practice,
for some known or unknown reasons, the process may fail to produce prod-
ucts of equal quality. Therefore, one may be interested in investigating when
the quality of a product starts to deteriorate. Under such circumstances,
online change point analysis can be used in the form of control charts to
monitor output of industrial processes. Typically, control charts have a cen-
tral line (the mean) and upper and lower lines representing control limits,
which are usually set at three-sigma (standard deviations) detection limits
away from the mean. Any data points that fall outside these limits or
unusual patterns (determined by various run tests) on the control chart sug-
gest that systematic causes of variation are present. Under such circum-
stance, the process is said to be out of control and actions are to be taken to
find, and possibly eliminate the corresponding causes. A process is declared
to be in control if all points charted lie randomly within the control limits.
In order to illustrate the construction and operation of control charts we
consider the following two data examples.

Example 1: We consider a data example containing one-at-a-time


measurements of a continuous process variable presented in Gavit et al.
(2009). Figure 4.1 shows the X-chart or control chart for the data. The
horizontal upper and lower dashed lines represent the upper and lower
three-sigma control limits, respectively. The middle solid horizontal

xbar.one chart for dat

UCL
Group summary statistics

12

11

CL
10

8
LCL

1
Group
Number of groups = 23
Center = 10.23913 LCL = 7.724624 Number beyond limits = 0
StdDev = 0.8381689 UCL = 12.75364 Number violating runs = 1

FIGURE 4.1
The X-chart for a data example containing one-at-a-time measurements of a continuous pro-
cess variable presented in Gavit, P. et al., BioPharm International, 22, 46–55, 2009.
Martingale Type Statistics and Their Applications 111

line is the mean. Based on the X-chart shown in Figure 4.1, there is no
evidence showing that the process is out of control.

Implementation of Figure 4.1 is shown in the following R code:

> # install.packages("qcc") # install it for the first time


> library(qcc)
> dat.raw <- c(10.9,9.7,8.6,9.3,9.2,10.4,9.6,10.0,8.8,11.0,10.3,9
.3,11.1,9.9,8.9,10.2,10.6,12.2,10.7,11.2,10.9,11.2,11.5)
> dat <- qcc.groups(dat.raw, sample=1:length(dat.raw))
> obj <- qcc(dat, type="xbar.one",nsigmas = 3)

Example 2: We consider another data example obtained from the inside


diameter measurements of piston rings for an automotive engine
produced by a forging process (Montgomery, 1991). The inside diameter
of the rings manufactured by the process is measured on 25 samples,
each of size 5, for control phase I.
Figure 4.2 presents the X-bar chart, a type of Shewhart control chart
that is used to monitor the arithmetic means of successive samples of
constant size, for this calibration data. This control chart shows the cen-
ter line (CL) as a horizontal solid line and the upper limits (UCL) and
lower control limits (LCL) as dashed lines, and the sample group statis-
tics are drawn as points connected with lines.

xbar Chart for phase I and phase II


Calibration data in phase I New data in phase II

74.020
Group summary statistics

UCL
74.010

74.000 CL

73.990
LCL
1 3 5 7 9 12 15 18 21 24 27 30 33 36 39
Group
Number of groups = 40
Center = 74.00118 LCL = 73.98805 Number beyond limits = 3
StdDev = 0.009785039 UCL = 74.0143 Number violating runs = 1

FIGURE 4.2
The Shewhart chart (X-bar chart) for both calibration data and new data, where all the statistics
and the control limits are solely based on the calibration data, i.e., the first 25 samples.
112 Statistics in the Health Sciences: Theory, Applications, and Computing

Implementation of Figure 4.2 is shown in the following R code:

> # install.packages(“qcc”) # install it for the first time


> library(qcc)
> data(pistonrings)
> diameter <- qcc.groups(pistonrings$diameter,
pistonrings$sample)
>
> # Plot sample means to control the mean level of a continuous variable

> `Phase I` <- diameter[1:25,]


> `Phase II` <- diameter[26:40,]
> obj <- qcc(`Phase I`, type="xbar", newdata=`Phase II`, nsigmas = 3)

In contrast to the simple strategies mentioned above, we consider a very


efficient sequential change point detection policy based on the Shiryayev–
Roberts technique. Assume we sequentially (one-by-one) observe indepen-
dent continuous random variables X 1 , X 2 ,.... The statement of the sequential
change point detection problem consists of providing an indication (an
alarm) regarding a possible change in distribution of the data points. We are
interested in a sequential test for

H 0 : X 1 , X 2 , ... ~ f0 versus H 1 : X 1 , ..., X ν−1 ~ f0 , X ν , X ν+1 , ... ~ f1 ,

where f0 ≠ f1 are density functions.


In this case, the proposed change point detection procedure is based on
the stopping rule:

n
Π ni= j f1 (X i )
N (C) = min {n ≥ 1 : SRn > C} , SRn = ∑ Λ ,Λ
j=1
n
j
n
j =
Π ni= j f0 (X i )
,

C > 0 is a test threshold,

determining when one should stop sampling and claim a change has
occurred. In order to perform the N (C) strategy, we need to provide instruc-
tions regarding how to select values of C > 0 . This issue is analogous to that
related to defining test thresholds in retrospective decision-making mecha-
nisms with controlled Type I error rates. Certainly, we can define relatively
small values of C, yielding a very powerful detection policy, but if no change
has occurred (under H 0 ) most probably the procedure will stop when we do
not need to stop and the stopping rule will be not useful. Then it is vital to
monitor the average run length to false alarm EH0 {N (C)} in the form of
determining C > 0 to satisfy EH0 {N (C)} ≥ D, for a presumed level D. Thus
defining, e.g., D = 100 and stopping at N (C) = 150 we could expect that a false
alarm is in effect and then just rerun the detection procedure after additional
Martingale Type Statistics and Their Applications 113

investigations. In this context, we note that (SRn − n, ℑn = σ (X1 , ..., X n)) is an


H 0 -martingale (Section 4.4.3.2) and then Proposition 4.3.1 concludes with
EH0 {SRN (C ) − N (C)} = EH0 {SR1 − 1} = 0, i.e., EH0 {N (C)} = EH0 {SRN (C )} . Thus,
since, by virtue of the definition of N (C) , SRN (C ) ≥ C , we have EH0 {N (C)} ≥ C .
This result gives a rule to select values of C, controlling the average run
length of the procedure to false alarm. It is clear that Monte Carlo simulation
schemes can help us to derive accurate approximations to EH0 {N (C)} in spec-
ified scenarios of the data distributions. In these cases, the result EH0 {N (C)} ≥ C
is still valuable in the context of initial values that can be applied in terms of
the Monte Carlo study.
Note that the techniques described in Section 4.4.4 can be easily applied to
extend the statement of the detection problem mentioned above preserving
⎡⎣EH0 {N (C)} ≥ C⎤⎦ -type properties.

Remark. In order to obtain detailed information regarding sequential


change point detection mechanisms and their applications, the reader can
consult outstanding publications of Professor Moshe Pollak and Professor
Tze Lai (e.g., Pollak, 1985, 1987; Lai, 1995).

4.5 Transformation of Change Point Detection Methods into a


Shiryayev–Roberts Form
The Shiryayev–Roberts statistic SRn for detection of a change has the appeal-
ing property that, when the process is in control, SRn − n is a martingale
with zero expectation, thus the average run lengths to false alarm EH0 (N) of
a stopping time N can be evaluated via EH0 (N) = EH0 (SRN ).
We apply an idea of Brostrom (1997) and show that in certain cases it is pos-
sible to transform a surveillance statistic into a Shiryayev–Roberts form. We
demonstrate our method with several examples, both parametric as well as
nonparametric. In addition, we propose an explanation of the phenomenon
that simple CUSUM and Shiryayev–Roberts schemes have similar properties.
The Shiryayev–Roberts change point detection technique is usually based
on N (C) = the first crossing time of a certain statistic SRn over a prespecified
threshold C. The statistic SRn has the property that SRn − n is a martingale
with zero expectation when the process is in control. Therefore, the average
run lengths (ARL) to false alarm EH0 {N (C)} satisfies EH0 {N (C)} = EH0 {SRN (C )},
and since SRN (C ) ≥ C one easily obtains the inequalityEH0 {N (C)} ≥ C . (Renewal-
theoretic arguments can be applied often to obtain an approximation
EH0 {N (C)} ≈ aC , where a is a constant.) Thus, if a constraint EH0 {N (C)} ≥ B on
114 Statistics in the Health Sciences: Theory, Applications, and Computing

the false alarm rate is to be satisfied, choosing C = B (or C = B/a ) does the job
(see Pollak, 1987, and Yakir, 1995, for the basic theory).
Alternative approaches do not have this property. When a method is
based on a sequence of statistics {Tn} that is an H 0 -submartingale when
the process is in control (such as CUSUM), we follow Brostrom (1997) in
applying Doob’s decomposition to {Tn} and extract the martingale com-
ponent. (The anticipating component depends on the past and does not
contain new information about the future.) We transform the martingale
component into a Shiryayev–Roberts form. We apply this procedure to
several examples. Among others, we show that our approach can be used
to obtain nonparametric Shiryayev–Roberts procedures based on ranks.
Another application is to the case of estimation of unknown postchange
parameters.

4.5.1 Motivation
To motivate the general method, consider the following example. Suppose
that observations Y1 , Y2 , ... are independent, Y1 , ..., Yν− 1 ~ N (0, 1) when the
process is in control, Yν , Yν+ 1 , ... ~ N (θ, 1) when the process is out of control,
where ν is the change point and θ is unknown. Let Pν and Eν denote the
probability measure and expectation when the change occurs at ν. The value
of ν is set to ∞ if no change ever takes place in the sequence of observations.
Where θ is known, the classical Shiryayev–Roberts procedure would declare
a change to be in effect if SRnI ≥ C for some prespecified threshold C, where

n
⎧⎪ n ⎫⎪
SR = I
n
k =1

exp ⎨
⎪⎩ i = k
∑(
θYi − θ2 / 2 ⎬
⎪⎭
) (4.3)

(e.g., Pollak, 1985, 1987). When θ is unknown, it is natural to estimate θ. In


order to preserve the P∞-martingale structure, Lorden and Pollak (2005) and


i−1
Vexler (2006) propose to replace θ in the k th term of SRnI by Yj /(i − k ).
j= k
(The same idea appears in Robbins and Siegmund (1973) and Dragalin (1997)
in a different context.) A criticism of this approach is that foregoing the use
of Yk ,..., Yi − 1 without Yi ,..., Yn when estimating θ in the k th term is inefficient
(e.g., Lai, 2001, p. 398).
An alternative is to estimate θ in the k th term of SRnI by the Pν= k maximum


n
likelihood estimator θˆ k , n = Yj /(n − k + 1) , so that SRnI becomes
j= k

n− 1
⎧⎪ ⎫
∑ Y − (n − k + 1) (θˆ ) / 2⎪⎬⎭⎪ + 1
n


2
SRnI = exp ⎨θˆ k ,n i k ,n (4.4)
k =1 ⎩⎪ i= k
Martingale Type Statistics and Their Applications 115

∑ = 0 (the k = n term is mechanically replaced by


0
with the convention that
1

( ) ( )
1 in order that EH0 SRnI = E∞ SRnI < ∞ ). Note that SRnI , ℑn = σ (Y1 , ..., Yn) is a ( )
P∞-submartingale, for

⎧⎪n− 1

∑ Y − (n − k + 1) (θˆ ) / 2⎪⎬⎪⎭ + 1
n


2
SR ≥
I
n exp ⎨θˆ k , n− 1 i k , n− 1
k =1 ⎪⎩ i= k

(because the maximum is attained at θˆ k , n), so

(
E∞ SRnI |ℑ n− 1 )
n− 2 ⎧ n−1 ⎫ ⎛ ⎧ n−1 ⎫ ⎞
∑exp ⎪⎨⎪⎩θˆ ∑Y − 21 (n − k) (θˆ ) ⎪⎬⎪⎭E ⎜⎜⎜ exp ⎪⎨⎪⎩θˆ ∑Y − 21 (θˆ ) ⎪⎬⎪⎭ |ℑ ⎟
2 2
≥ k , n−1 i k , n−1 ∞ Y
k , n−1 n i k , n−1 n−1 ⎟

k =1 i=k ⎝ i=k ⎠

+1 = SRnI− 1 .

Generally, one can show that applying the maximum likelihood estima-
tion of the unknown postchange parameters to the initial Shiryayev–Roberts
statistics of the parametric detection schemes leads to the same submartin-
gale property.
Our goal is to transform SRnI into SRn such that (SRn − n, ℑn) is a P∞-martin-
gale with zero expectation, in such a manner that relevant information will
not be lost. Rather than continue with this example, we present a general
theory.

4.5.2 The Method

We assume that initially we have a series SRnI of test statistics that is a { }


P∞-submartingale. Let SR0I = 0 and define recursively

(SR ) W + 1 ,
I

Wn =
n− 1

E (SR |ℑ ) I
n− 1
( )
SRn = SRnI Wn , W0 = 0, n = 1, 2, ... (4.5)
∞ n n− 1

It is clear that Wn ∈ ℑ n−1 and therefore E∞ (SRn |ℑ n− 1) = SRn + 1 so that (SRn − n, ℑ n)


is a P∞ -martingale with zero expectation. (If SRnI − n, ℑn is a P∞ -martingale ( )
with zero expectation, then Equation (4.5) provides Wn = 1, n ≥ 1.) We will now
116 Statistics in the Health Sciences: Theory, Applications, and Computing

show that {SRn − n} is a P∞-martingale transform of the martingale compo-


nent of SRnI. Towards this end, we present the Doob’s decomposition result,
representing several notations that appear in this section.

Lemma 4.5.1.

Let Sn be a submartingale with respect to ℑn then Sn can be uniquely decomposed as


Sn = S n + S n with the following properties:
M P

(
S n is a martingale component of Sn (i.e., S n , ℑn is a martingale);
M M
)
is ℑn − 1-measurable, having the predictable property, S n ≥ S n− 1 (a.s.),
P P P
S n
S 1 = 1 and S n − S n− 1 = E (Sn − Sn− 1 |ℑn− 1) .
P P P

Now we formulate the widely known definition of martingale transforms.


M
If G n is a martingale with respect to ℑn, then a martingale transform G ' n
M

M
of G n is given by

G'
M
n
= G'
M
n−1 (
+ an G n − G n − 1 ,
M M
) (4.6)

where an ∈ℑn− 1. Such transformations have a long history and interesting


interpretations in terms of gambling. An interpretation is if we have a fair
game, we can choose the size and side of our bet at each stage based on the
prior history and the game will continue to be fair. An association of a sto-
chastic game with a change point detection is presented, for example, by
Ritov (1990).
M
Thus, if SR n is a martingale transform (“transition: martingale-
M M M ⎧
martingale”) of SR I n ( SR n ∈ℑn , but SR n ∉ ⎨ℑn− 1 ∩ SR I

M n− 1⎫
k k =1
⎬), then

{ }
this implies that the information in SRn about a change having or not having
M
taken place is the same as that contained in SR I n , which in turn is the same
as the information contained in the initial sequence SRnI . { }

( )
Proposition 4.5.1. Let SRnI , ℑn be a nonnegative P∞-submartingale
( )
SR0I = 0 . Then, under P∞-regime,

( )
1. SRn = SRnI Wn is a nonnegative submartingale with respect to ℑn;
M
M
2. SR n is a martingale transform of SR I n
, where Wn is defined
by (4.5).
Martingale Type Statistics and Their Applications 117

Proof. By applying the definition of Wn ∈ℑn−1, we obtain E∞ (SRn |ℑn−1) = Wn−1


( ) ( )
SRnI−1 + 1 ≥ SRn−1 hence SRn = SRnI Wn is a nonnegative submartingale with
respect to ℑn. It now follows from Lemma 4.5.1 that

SR n = SRn − SR n = SRn − n.
M P

Therefore,

SR n − SR n−1 = SRn − SRn−1 − 1 = SRnI Wn − SRnI−1 Wn−1 − 1,


M M
( ) ( )
( ) { ( )}
where SRnI− 1 Wn− 1 = E∞ SRnI |ℑn− 1 Wn − 1 by the definition of Wn . Conse-
quently,

{
SR n − SR n−1 = SRnI − E∞ SRnI |ℑ n−1 Wn .
M M
( )}
On the other hand, by Lemma 4.5.1 one can show that

SR I M
n
− SR I M
n−1
= SR − I
n ∑{E (SR |ℑ ∞
I
k ) − SR } − SR
k −1
I
k −1
I
n−1
k

n−1

− ∑{E (SR |ℑ ∞
I
k k −1 ) − SR } I
k −1
k

= SRnI − E∞ SRnI |ℑ n−1 .( )


Since Wn ∈ℑn− 1 , by definition (4.6) the proof of Proposition 4.5.1 is now com-
plete.
Consider an inverse version of Proposition 4.5.1.

Proposition 4.5.2. Assume ν = ∞ and an ∈ℑn − 1 exists, such that

SR '
M
n
= SR '
M
n−1
+ an SR I ( M

n
− SR I
M

n−1 ),
where SR 'n ∈ℑn is a submartingale with respect to ℑn and E∞ (SR 'n |ℑ n−1) =
SR 'n−1 + 1. Then SR 'n = (SRnI) Wn + ς n, where Wn ∈ℑn− 1 is defined by (4.5),
(ς n , ℑ n) is a martingale with zero expectation, and ς n is a martingale trans-
M
form of SR I n
.
118 Statistics in the Health Sciences: Theory, Applications, and Computing

( )
Proof. We have SR 'n = SRnI Wn + ς n, where ς n = SR 'n − SRnI Wn ∈ℑn. Consider ( )
the conditional expectation

(( )
E∞ (SR 'n |ℑ n−1) = E∞ SRnI Wn |ℑ n−1 + E∞ (ς n |ℑ n−1) )
( )
= SRnI−1 Wn−1 + 1 + E∞ (ς n |ℑ n−1)
= SR 'n−1 − ς n−1 + 1 + E∞ (ς n |ℑ n−1) .

On the other hand, by the statement of this proposition, E∞ (SR 'n |ℑ n−1) =
SR 'n−1 + 1. Therefore, we conclude with E∞ (ς n |ℑn− 1) = ς n− 1.
From the basic definitions and Lemma 4.5.1 (in a similar manner to the
proof of Proposition 4.5.1) it follows that

SR '
M
n
− SR '
M
n−1 ( ) ) (
= SRnI Wn + ς n − SRnI − 1 Wn − 1 − ς n − 1 − SR ' n + SR '
P P
n−1

= (SR ) W
I
n n + ς − W E (SR |ℑ ) + 1 − ς
n n ∞
I
n n−1 n−1 − E∞ (SR 'n − SR 'n − 1 |ℑ n − 1)

= ς n − ς n−1 + {SR − E (SR |ℑ )} W


I
n ∞
I
n n−1 n

= ς n − ς n − 1 + SR{ I M
n
− SR I M
}W .
n−1
n

Hence

ς n − ς n−1 = (an − Wn) SR I { } , (a − W ) ∈ ℑ


M M
− SR I n n n−1 .
n n−1

This completes the proof of Proposition 4.5.2.

M
Interpretation: We set out to extract the martingale component SR I n
of
I M
SR and to get a martingale transform SR = SRn − n of SR n such that
I M
n n
SRn − n is a P∞-martingale with zero expectation. However, if we are given
any SR 'n with the property that SR 'n − n is a P∞-martingale with zero expec-
tation, then Proposition 4.5.2 states that SR 'n = SRn + ς n where ς n is a martin-
M
gale transform of SR I n . Therefore, note that the information in SRn and in
SR 'n is the same. In other words, from an information point of view, our
procedure is equivalent to any other procedure based on a sequence SRnI { }
that has the same martingale properties as {SRn}.
An obvious property of our procedure is:

Corollary 4.5.1. Let N (C) = min {n ≥ 1 : SRn ≥ C} , where SRn is defined by


(4.5). Then E∞ {N (C)} ≥ C.
Returning to the example of the previous section, we obtain (from
Equation (4.4)) by a standard calculation:
Martingale Type Statistics and Their Applications 119

∑ exp {(n − k + 1) (θˆ ) /2} + 1,


n− 1
2
SR = I
n k ,n
k =1

{ }
n− 1

∑ ⎛⎜⎝ n n− −k +k 1⎞⎟⎠ ( )
1/2

( )
2
E∞ SR |ℑn− 1 =
I
n exp (n − k ) θˆ k , n− 1 /2 + 1
k =1

and {Wn} , {SRn} are obtained from (4.5).


Additional examples:
1. Let the prechange and postchange distributions of the independent
observations Y1 , Y2 ,... be respectively the standard exponential (i.e.,
Y1 ,..., Yν− 1 ~ f 0 (u) = exp(− u) ) and the exponential with an unknown parameter


n
θ > 0 (i.e., Yν , Yν+ 1 ,... ~ f1 (u; θ) = θ exp(−θu)). Therefore, θˆ k ,n = (n − k + 1)/ Yj is
j= k
the maximum likelihood estimator and the initial Shiryayev–Roberts
statistic is
n−1 n−1
⎪⎧ ⎞ ⎪⎫
n

∑∏ ∑( )
f1 (Yi ; θˆ k , n ) ⎛
( )
n− k +1 −1
SRnI = +1= θˆ k , n exp ⎨(n − k + 1) ⎜ θˆ k , n − 1⎟ ⎬ + 1.
k =1 i = k
f 0 (Yi )
k =1
⎩⎪ ⎝ ⎠ ⎭⎪

Since
⎛ n ⎞
E∞ ⎜
⎜ ∏
⎝ i= k
f1 (Yi ; θˆ k , n )
f0 (Yi )
|ℑ n−1⎟


( )
∫⎛
−1
ˆ
n − k + 1 ( n − k ) θk , n−1 −( n − k + 1) eu
= (n − k + 1) e e − u du
−1⎞ n − k + 1
0
⎝ (
⎜u + (n − k ) θˆ k , n−1 ⎟
⎠ )
{ }
n− k +1
⎛ n − k + 1⎞
(θˆ ) ( )
n− k −1
=⎜ ⎟ k , n−1 exp (n − k ) θˆ k , n−1 − (n − k + 1) ,
⎝ n− k ⎠

we have

{ }
n− 1 n− k + 1

( ) ∑ ⎛⎜⎝ n n− −k +k 1⎞⎟⎠ (θˆ ) ( )


n− k −1
E∞ SRnI |ℑn − 1 = k , n− 1 exp (n − k ) θˆ k , n − 1 − (n − k + 1) + 1
k =1

and {Wn} , {SRn} are obtained from (4.5).


2. Consider the simple segmented linear regression model

Yi = θxi I (i ≥ ν) + εi , i ≥ 1. (4.7)
where θ is the unknown regression parameter, xi , i ≥ 1, are fixed predictors,
ε i , i ≥ 1, are independent random disturbance terms with standard normal
120 Statistics in the Health Sciences: Theory, Applications, and Computing

density and I(⋅) is the indicator function. In this case, the maximum likeli-
hood estimator of the unknown parameter is


n
Yj x j
j= k
θˆ k , n = . (4.8)

n
x 2j
j= k

Since the conditional expectation of the estimator of the likelihood ratio is

⎧⎛ ⎞ 2⎫
{( } ∑
⎡ ⎤ ⎛ ⎞ 1/2

n−1

) ⎪⎜ Yj x j⎟⎟ ⎪
2 n
⎢ exp − Yi − θˆ k , n xi / 2 ⎥ ⎜ x 2j ⎟ ⎪⎪ ⎜⎝ ⎠ ⎪⎪
n

E∞ ⎢⎢ ∏ { }

|ℑ n − 1⎥⎥ = ⎜
j=k ⎟
⎟ exp ⎨
j=k
⎬,
∑ ∑
n−1 n−1
exp − (Yi) / 2 ⎪ 2 x 2j ⎪⎪
2
⎢ i=k ⎥ ⎜⎜ x 2j ⎟⎟ ⎪
⎢⎣ ⎥⎦ ⎝ j=k ⎠ ⎪⎩ j=k
⎪⎭

the transformed detection scheme for the model (4.7) has form (4.5), where

⎧⎛ ⎞ 2⎫

n
⎪⎜ Yj x j⎟⎟ ⎪
n−1
⎪⎪ ⎜⎝ ⎠ ⎪⎪
SRn =
I
∑ exp ⎨
j=k
⎬ + 1,
⎪ 2
∑ x 2j ⎪⎪
n
k =1
⎪ j=k
⎪⎩ ⎪⎭

⎧⎛
(4.9)
⎞ 2⎫


⎞ 1/2

n−1
⎪⎜ Yj x j ⎟⎟ ⎪
n
n−1 ⎜ x ⎟
2
⎪⎪ ⎜⎝ ⎠ ⎪⎪

j
⎜ ⎟
( ) j=k j=k
E∞ SRnI |ℑ n − 1 = ⎜ ⎟ exp ⎨ ⎬ + 1.
∑ ∑
n−1 n−1
k =1 ⎜ x 2j ⎟⎟ ⎪ 2 x 2j ⎪⎪
⎜ ⎪
⎝ j=k ⎠ ⎪⎩ j=k
⎪⎭

Note that, if the parameters of a segmented linear regression similar to (4.7),


under regime P∞ are unknown (e.g., the intercept, under P∞ , is unknown),
then the observations can be invariantly transformed into the case consid-
ered in this example (e.g., Krieger et al., 2003; Vexler, 2006).

4.5.3 CUSUM versus Shiryayev–Roberts


It is well accepted that the change point detection procedures based on
Shiryayev–Roberts statistics and the schemes founded on CUSUM statistics
have almost equivalent optimal statistical properties (e.g., Krieger et al., 2003:
Section 1). Here we establish an interesting link between the CUSUM and
Shiryayev–Roberts detection policies.
We assume that prior to a change the observations Y1 , Y2 ,... are iid with
density f0. Post-change they are iid with density f1 . The simple CUSUM pro-
cedure stops at
Martingale Type Statistics and Their Applications 121

∏ f (Y )
f1 (Yi )
M(C) = min {n ≥ 1 : Λ n ≥ C} , where Λ n = max
1≤ k ≤ n 0 i
i= k

and the simple Shiryayev–Roberts procedure stops at

n n

N (C) = min {n ≥ 1 : SRn ≥ C} , SRn = ∑∏ ff ((YY )). 1

0
i

i
k =1 i = k

In accordance with the proposed methodology, we present the following


proposition.

M
Proposition 4.5.3. The P∞-martingale component SR n of the Shiryayev-
Roberts statistic SRn is a martingale transform of Λ n , which is the
M

P∞-martingale component of the CUSUM statistic Λ n (and vice versa).

Proof. We have

⎧ n n ⎫

Λ n = max ⎨ ∏
f1 (Yi )
, ∏
f1 (Yi )
⎪⎩ i =1 f0 (Yi ) i = 2 f0 (Yi )
f (Y )⎪
,..., 1 n ⎬
f0 (Yn )⎪⎭
⎡ ⎧ n n n ⎫ ⎤
= max ⎢⎢ max ⎨
⎢⎣


f1 (Yi )
, ∏
f1 (Yi )
⎪⎩ i =1 f0 (Yi ) i = 2 f0 (Yi )
,..., ∏
f1 (Yi )⎪ f1 (Yn )⎥
⎬ ,
f0 (Yi )⎪⎭ f0 (Yn )⎥⎥
i = n−1 ⎦ (4.10)
⎡ ⎧ n−1 n−1 n−1 ⎫ ⎤
=
f1 (Yn )
f0 (Yn )
⎢ ⎪
max ⎢ max ⎨
⎪⎩
∏ f1 (Yi )
f0 (Yi )
, ∏ f1 (Yi )
f0 (Yi )
,..., ∏ f1 (Yi )⎪ ⎥
⎬ ,1
f0 (Yi )⎪ ⎥⎥
⎣⎢ i =1 i=2 i = n−1 ⎭ ⎦
f (Y )
= 1 n max (Λ n ,1) .
f0 (Yn )

Since E∞ { f1 (Yn )/f0 (Yn )|ℑn− 1} = 1 and Λ n− 1 ∈ℑn− 1, Λ n is a P∞-submartingale


with respect to ℑn. By applying Lemma 4.5.1, we obtain the P∞-martingale
component of Λ n in the form
n

Λ
M
n
= Λn − ∑{max (Λ k −1 ,1) − Λ k −1}.
k

Hence, by (4.10)
Λ − Λ = Λ n − Λ n−1 − max (Λ n−1 ,1) + Λ n−1
M M
n n−1

⎧⎪ f (Y ) ⎫⎪
= max (Λ n−1 ,1) ⎨ 1 n − 1⎬ .
⎩⎪ f0 (Yn ) ⎭⎪
122 Statistics in the Health Sciences: Theory, Applications, and Computing

On the other hand, it is clear that SRn is a P∞-submartingale with respect to


ℑn and

f1 (Yn )
SR n − SR n−1 = SRn − n − SRn−1 + (n − 1) = (SRn−1 + 1) − (SRn−1 + 1)
M M

f0 (Yn )

⎪⎧ f (Y ) ⎪⎫
= (SRn−1 + 1) ⎨ 1 n − 1⎬ .
⎩⎪ f0 (Yn ) ⎭⎪

By virtue of an = max (Λ n− 1 , 1) / (SRn− 1 + 1) ∈ ℑ n− 1 (or an = (SRn−1 + 1) / max (Λ n−1 , 1) ∈ ℑ n−1),


the proof of Proposition 4.5.3 now follows from definition (4.6).
Thus, following our methodology, we can transform the CUSUM method
into a martingale-based procedure (i.e., Equation (4.5) with SR 'n = Λ n). Here
Λ n−1Wn−1 + 1
SR 'n = Λ nWn , Wn = , W0 = 0, Λ 0 = 0. (4.11)
max (Λ n−1 ,1)
where max (Λ n− 1 , 1) = E∞ (Λ n |ℑn− 1) . Certainly, in a similar manner to
Corollary 4.5.1, the stopping time N '(C) = min {n ≥ 1 : SR 'n ≥ C} satisfies
E∞ {N '(C)} ≥ C. Moreover, we have the following proposition.

Proposition 4.5.4. The classical Shiryayev–Roberts statistic SRn is the


transformed CUSUM statistic (4.11) (and vice versa).

Proof. By substituting (4.10) in (4.11), we obtain

SR 'n =
(Λ n−1Wn−1 + 1) Λ
max (Λ n−1 ,1)
n

(Λ n−1Wn−1 + 1) f1 (Yn ) max Λ ,1


= ( n−1 )
max (Λ n−1 ,1) f0 (Yn )
f1 (Yn )
= (SR 'n + 1) , SR '0 = 0.
f0 (Yn )
Since

f1 (Yn )
SRn = (SRn + 1) , SR0 = 0,
f0 (Yn )

the proof of Proposition 4.5.4 is now complete.


What this means is that the CUSUM scheme and the Shiryayev–Roberts
policy are based on the same information. The difference between them
stems from their different predictable components: that of Shiryayev–Roberts

{max (Λ k −1 , 1) − Λ k −1}. In other words,
n
is n whereas that of CUSUM is
k =1
CUSUM makes use of superfluous information contained in the observations.
Martingale Type Statistics and Their Applications 123

∑ {max (Λ k −1 , 1) − Λ k −1} is increasing prior to a change, shorten-


n
Note that
k =1
ing the ARL to false alarm, but its increase is approximately zero postchange,
so that it does not contribute toward shortening the average delay to detec-
tion. This explains why Shiryayev–Roberts is somewhat better than CUSUM.

4.5.4 A Nonparametric Example*


Consider the sequential change point problem where nothing is known
about both the prechange and postchange distributions other than that they
are continuous. In this case, an obvious approach would be to base surveil-
lance on the sequence of sequential ranks of the observations. Several proce-
dures have been constructed for this problem, e.g., Gordon and Pollak (1995).
Here we consider a problem of a slightly different nature. Suppose one feels
that the prechange observations may be N 0, σ 2 and the postchange obser- ( )
( )
vations N θσ , σ 2 (θ, σ are possibly unknown), but is not sure of this. If a
nonparametric method could be constructed in a way that will be efficient if
( )
the observations were truly N 0, σ 2 and N θσ , σ 2 as above, one may well ( )
prefer the nonparametric scheme, as it promises validity of ARL to false
alarm even if the normality assumption is violated. It is this problem that we
address here.
Formally, let Y1 , Y2 ,... be independent random variables. When the process is
in control, {Yi , i ≥ 1} are iid. If the process goes out of control at time ν, then
{Yi , i ≥ ν} are iid and have a distribution deferent from the in-control state. Define
n

rj , n = ∑ I (Y ≤ Y ), ρ
i j n = rn, n /(n + 1), Zn = Φ−1 (ρn ), ℑ n = σ (ρ1 ,..., ρn) ,
i =1

where Φ −1 is the inverse of the standard normal distribution function. Note


that the in-control distribution of the sequence {ρi , i ≥ 1} does not depend on
the distribution of the Y’s as long as it is continuous, and ρ1 ,..., ρn are inde-
pendent. Clearly, under P∞, ρn converges in distribution to U[0,1] and there-
fore Zn → N (0,1), as n→∞. Fix θ ≠ 0 and define recursively
n

∑e ∑
n

( )
2 θ Zi −θ2 ( n − k + 1)/2
SRθI , n = e θZn −θ /2
SRθI , n−1 + 1 = i= k
, SRθI ,0 = 0. (4.12)
k =1

(Note that if Z1 , Z2 ,... were exactly standard normal, then (4.12) is the same as
(4.3).) Applying our methodology, letting Wθ ,0 = 0, define recursively

* This subsection was prepared by Professor Moshe Pollak.


124 Statistics in the Health Sciences: Theory, Applications, and Computing

( )
SRθ ,n = SRθI ,n Wθ ,n , Wθ ,n =
(SR )W + 1 I
θ , n− 1 θ , n− 1

E (SR |ℑ ) ∞
I
θ,n n− 1
(4.13)
=
(SR )W + 1
I
θ , n− 1 θ , n− 1
,
∑ ⎧ i ⎞ ⎫
1
exp ⎨Φ ⎛⎜ ⎟ θ⎬ (SR )
− θ2 1 n
−1
e 2 I
θ , n− 1 +1
n i=1 ⎩ ⎝ n + 1⎠ ⎭

{
N θ (C) = min n ≥ 1 : SRθ , n ≥ C }
and therefore E∞ {N θ (C)} ≥ C.

Remark 1. For the sake of clarity, the example above was developed to give a
robust detection method if one believes the observations to be approximately
normal (and one is on guard against a change of mean). This method can be
applied to other families of distributions in a similar manner.

Remark 2. Instead of fixing θ, one can fix a prior G for θ, and define the Bayes
factor type procedure

( )
SRn = SRnI Wn , SRnI = SRθI ,n dG(θ),

Wn =
(SR )W I
n− 1 n− 1 +1
, (4.14)
( 1
) ∑ ∫ ⎧
exp ⎨Φ −1 ⎛⎜
i ⎞ 1 2⎫
n
SRnI−1 + 1
⎝ ⎟⎠ θ − θ ⎬ dG(θ)
n i=1 ⎩ n+1 2 ⎭
SR = W0 = 0.
I
0

Note that if G is a conjugate prior (e.g., G = N (.,.)), then the integral in the
denominator of Wn can be calculated explicitly. An alternative approach to Equa-
tion (4.14) is simply to define SRn =
∫ SR θ, n dG(θ). However, the calculation of

∫ SR θ, n dG(θ) involves an arduous numerical integration (after every observation).

4.5.5 Monte Carlo Simulation Study


Proposed transformation versus application of nonanticipating estimation.
Let the simulated samples satisfy (4.7) with xi = sin(i). In accordance with
Krieger et al. (2003) and Lorden and Pollak (2005), denote the stopping rule

∑ ∏ exp {− 21 (Y − θˆ
) − 21 (Y ) } ≥ A⎪⎬⎪
⎧⎪ n n
2 ⎫
N (1) ( A) = min ⎨n ≥ 1 :
2
i k ,i−1 i x i
⎪⎩ k =1 i= k ⎭
based on the application of the nonanticipating estimation {θˆ } , where the k ,i−1
ˆ ˆ
maximum likelihood estimator 4.8 defines θ k , l (θ k , l = 0, if l < k ). Alterna-
tively, let
Martingale Type Statistics and Their Applications 125

N (2) ( A) = min {n ≥ 1 : SRn ≥ A} ,

where SRn is defined by (4.5) with (4.9).


We ran 15,000 repetitions of the model for each procedure and at each
point A = 50, 75, 100, 125,…. Figure 4.3 depicts the Monte Carlo averages of
N (1) ( A) and N (2) ( A) in this case, where every simulated observation is from
N (0,1) (i.e., ν = ∞ and this simulation run is independent of θ).
{ }
It appears that E∞ N (1) ( A) /A behaves nonconstantly, for the range of A.
{ }
Note that, in the case where xi = 1, we have lim ⎡⎣E∞ N (1) ( A) /A⎤⎦ = const (see
A →∞
Lorden and Pollak, 2005). Figure 4.4 corresponds to the definition of
the model (4.7), where ν = 1 and θ = 1/4, 1/2.
In accordance with Figures 4.3 and 4.4, the detecting strategy
N (2) ( A) tends to be a quicker detecting scheme (with estimated
{ } { } {
A1 , A2 : E∞ N (1) ( A1 )  E∞ N (2) ( A1 ) ). Moreover, in this experiment E1 N (1) ( A) }
is not a linear function of log( A), as it usually is for parametric change point
detections.

50 100 150 200 250


A

FIGURE 4.3
{ }
Comparison between the Monte Carlo estimator of E∞ N (1) ( A) /A ( ) and the Monte Carlo
{ }
estimator of E∞ N (2) ( A) /A (—∘—). The curve (--∘--) images the Monte Carlo estimator of
{ } { }
E∞ N (1) ( A) /E∞ N (2) ( A) .
126 Statistics in the Health Sciences: Theory, Applications, and Computing

Modification of CUSUM. Next, we investigate one conclusion of Section 4.5.3


regarding the CUSUM detection scheme. In particular, Section 4.5.3 suggests
that the stopping rule

⎡ n

N '( A) = min ⎢n ≥ 0 : Λ n = Λ n −
⎢⎣
M
∑ {max (Λ
k =1
k −1 , 1) − Λ k − 1} ≥ A⎥
⎥⎦

is somewhat better than the classical CUSUM stopping time


n

M( A) = min {n ≥ 1 : Λ n ≥ A} , where Λ n = max


1≤ k ≤ n
∏ ff ((YY )).
1

0
i

i
i= k

24 24

22 22

20 20

18 18

16 16

14 14

12 12

10 10

8 8

6 6

4 4

2 2

50 75 100 125 50 100 150 200 250


A A
(a) (b)

FIGURE 4.4
{ }
Comparison between the Monte Carlo estimator of E1 N (1) ( A) /(5 log( A)) ( ) and the
{ }
Monte Carlo estimator of E1 N (2) ( A) /(5 log( A)) (—∘—). The curve (- - - -) images the Monte
Carlo estimator of E {N ( A)} /E {N
1
(1)
1
(2)
}
( A) . The graphs (a) and (b) correspond to the model
(4.7), where θ = 1/4, 1/2 , respectively.
Martingale Type Statistics and Their Applications 127

We executed 15,000 repetitions of independent observations


Y1 ,..., Yν− 1 ~ f0 = N (0,1) ; Yν ,... ~ f1 = N (θ = 0.5,1) for each Monte Carlo simu-
lation. In the case of ν = ∞ and A = 50, 100, 150, 200, 250, the Monte Carlo
estimator of E∞ {N '( A)} /A behaves approximately constantly and we evalu-
{ }
ate E∞ N (1) ( A) / A ≈ 768.2. At that rate, we can conclude with
E∞ {M( A)} /A ≈ 13.8. Figure 4.5 illustrates the accuracy of the considered con-
clusion of Section 4.5.3.

90

85

80

75

100 200 300 400 500


A

FIGURE 4.5
The curved lines (—∘—) and (——) denote the Monte Carlo estimators of E50 {N '( A)} and
E50 {M( A)}, respectively.
128 Statistics in the Health Sciences: Theory, Applications, and Computing

26

100
25

24
80

23

60
22

21
40

29

20
100 150 200 250 100 300 500
nu nu
B = 350 B = 550

FIGURE 4.6
The curved lines (—∘—) and (⋯⋅) denote the Monte Carlo estimators of L (N θ=0.5 (B/1.3301), ν)
and L (N o (B/1.3316), ν) , respectively, plotted against ν.

Nonparametric procedure. Here, by basing on simulations of the process


{Y1 , ..., Yν−1 ~ N (0, 1); Yν , ... ~ f1 = N (θ = 0.5, 1)}, we evaluate the efficiency of the
policy (4.13). We employ the loss function L (N ( A), ν) = Eν {N ( A) − ν|N ( A) ≥ ν}
regarding the stopping time N θ ( A) from (4.13) and the optimal stopping time
{ }
N o ( A) = min n ≥ 1 : SRnI ≥ A , where SRnI is given by (4.3). Applying the theo-
retical results by Pollak (1987) and Yakir (1995) leads to the estimated
E∞ {N o ( A)}  1.3316 A. Running 15,000 repetitions of N θ ( A) at each point A = 50,
75, 100,…, 550 (with ν = ∞) shows that the Monte Carlo estimator of E∞ {N θ ( A)} /A
behaves constantly and E∞ {N θ ( A)} /A  1.3301.
Fixing B = 350 and B = 550, we calculate the Monte Carlo estimators of
L (N o (B/1.3316), ν) and L (N θ (B/1.3301), ν) based on 15,000 repetitions of
N o (B/1.3316) and N θ (B/1.3301) at each point ν = 30, 40, 50,…. We depict the
results of these simulations in Figure 4.6.
From these results we conclude that the proposed nonparametric method
is asymptotically (as A → ∞ and ν → ∞) optimal in the considered case.
5
Bayes Factor

5.1 Introduction
Over the years statistical science has developed two schools of thought as it
applies to statistical inference. One school is termed frequentist methodol-
ogy and the other school of thought is termed Bayesian methodology. There
has been a decades-long debate between both schools of thought on the opti-
mality and appropriateness of each other’s approaches. The fundamental
difference between the two schools of thought is philosophical in nature
regarding the definition of probability, which ultimately one could argue is
axiomatic in nature.
The frequentist approach denotes probability in terms of long run aver-
ages, e.g., the probability of getting heads when flipping a coin infinitely
many times would be ½ given a fair coin. Inference within the frequentist
context is based on the assumption that a given experiment is one realization
of an experiment that could be repeated infinitely many times, e.g., the notion
of the Type I error rate discussed earlier is that if we were to repeat the same
experiment infinitely many times we would falsely reject the null hypothesis
a fixed percentage of the time.
The Bayesian approach defines probability as a measure of an individu-
al’s objective view of scientific truth based on his or her current state of
knowledge. The state of knowledge, given as a probability measure, is then
updated through new observations. The Bayesian approach to inference is
appealing as it treats uncertainty probabilistically through conditional
distributions. In this framework, it is assumed that a priori information
regarding the parameters of interest is available in order to weight the cor-
responding likelihood functions toward the prior beliefs via the use of Bayes
theorem.
Although Bayesian procedures are increasingly popular, expressing accu-
rate prior information can be difficult, particularly for nonstatisticians and/
or in a multidimensional setting. In practice one may be tempted to use the
underlying data in order to help determine the prior through estimation of
the prior information. This mixed approach is usually referred to as empiri-
cal Bayes in the literature (e.g., Carlin and Louis, 2011).

129
130 Statistics in the Health Sciences: Theory, Applications, and Computing

We take the approach that the Bayesian, empirical Bayesian, and frequen-
tist methods are useful and that in general there should not be much dis-
agreement between the scientific conclusions that are drawn in the practical
setting. Our ultimate goal is to provide a roadmap for tools that are efficient
in a given setting. In general, Bayesian approaches are more computationally
intensive than frequentist approaches. However, due to the advances in com-
puting power, Bayesian methods have emerged as an increasingly effective
and practical alternative to the corresponding frequentist methods. Perhaps,
empirical Bayes methods somewhat bridge the gap between pure Bayesian
and frequentist approaches.
This chapter provides a basic introduction to the Bayesian view on statisti-
cal testing strategies with a focus on Bayes factor–type principles.
As an introduction to the Bayesian approach, in contrast to frequentist meth-
ods, first consider the simple hypothesis test H 0 : θ = θ0 versus H 1 : θ = θ1, where
the parameters θ0 and θ1 are known. Given an observed random sample
X = {X 1 ,..., X n }, we can then construct the likelihood ratio test statistic
LR = f (X |θ1 ) f (X |θ0 ) for the purpose of determining which hypothesis is
more probable. (Although, in previous chapters, we used the notation f (X ; θ)
for displaying density functions, in this chapter we also state f (X |θ) as the
density notation in order to be consistent with the Bayesian aspect that — “the
data distribution is conditional on the parameter, where the parameter is a
random variable with a prior distribution.”) The decision-making procedure
is to reject H 0 for large values of LR. In this case, the decision-making rule
based on the likelihood ratio test is uniformly most powerful. Although this
classical hypothesis testing approach has a long and celebrated history in the
statistical literature and continues to be a favorite of practitioners, it can be
applied straightforwardly only in the case of simple null and alternative
hypotheses, i.e., when the parameter under the alternative hypothesis, θ1, is
known.
Various practical hypothesis testing problems involve scenarios where the
parameter under the alternative θ1 is unknown, e.g., testing the composite
hypothesis H 0 : θ = θ0 versus H 1 : θ ≠ θ0. In general, when the alternative
parameter is unknown, the parametric likelihood ratio test is not applicable,
since it is not well defined. When a hypothesis is composite, Neyman and
Pearson suggested replacing the density at a single parameter value with the
maximum of the density over all parameters in that hypothesis. As a result
of this groundbreaking theory the maximum likelihood ratio test became a
key method in statistical inference (see Chapter 3).
It should be noted that several criticisms pertaining to the maximum like-
lihood ratio test may be made. First, in a decision-making process, we do not
generally need to estimate the unknown parameters under H 0 or H 1. We just
need to make a binary decision regarding whether to reject or not to reject
the null hypothesis. Another criticism is that in practice we rely on the func-
tion f (X |θˆ ) in place of f (X |θ), which technically does not yield a likelihood
function, i.e., it is not a proper density function. Therefore, the maximum
Bayes Factor 131

likelihood ratio test may lose efficiency as compared to the likelihood ratio
test of interest (see details in Chapters 3 and 4).
Alternatively, one can provide testing procedures by substituting the
unknown alternative parameter by variables that do not depend on the
observed data (see details in Chapter 4). Such approaches can be extended
to provide test procedures within a Bayes factor type framework. We can
integrate test statistics through variables that represent the unknown
parameters with respect to a function commonly called the prior distribu-
tion. This approach can be generalized to more complicated hypotheses and
models.
A formal Bayesian concept for analyzing the collected data X involves the
specification of a prior distribution for all parameters of interest, say θ,
denoted by π(θ), and a distribution for X, denoted by Pr(X|θ). Then the pos-
terior distribution of θ given the data X is

Pr(X |θ)π(θ)
Pr(θ|X ) = ,
∫ Pr(X |θ)π(θ) dθ

which provides a basis for performing formal Bayesian analysis. Note that
maximum likelihood estimation (Chapter 3) in light of the posterior distri-
bution function, where Pr(X |θ) is expressed using the likelihood function
based on X, has a simple interpretation as the mode of the posterior distribu-
tion, representing a value of the parameter θ that is “most likely” to have
produced the data.
Consider an interesting example to illustrate advantages of the Bayesian
methodology. Assume we observe a sample X = {X 1 ,..., X n } of iid data points
from a density function f (x|θ) , where θ is the parameter of interest. In order
to define the posterior density


n
f (X i |θ)π(θ)
g(θ|X ) = i=1
,
∫∏
n
f (X i |θ)π(θ) dθ
i=1

a parametric form of f (x|θ) should be known. Let us relax this requirement


by assuming that the distribution function F of X 1 is unknown, but
F ∈{F1 ,..., FT }, where F1 ,..., FT have known parametric forms and correspond
to density functions f1 ,..., fT , respectively. Then the posterior joint density
function can be written as

∏ f (X |θ)π(θ)G ,
n
k i k
g(θ, k|X ) = i=1

∑ G ∫ ∏ f (X |θ)π(θ) dθ
T n
k k i
k =1 i=1
132 Statistics in the Health Sciences: Theory, Applications, and Computing

where Gk denotes a prior probability weight regarding the event F = Fk .


Suppose that, for simplicity, F ∈{Exp(1/ θ), N(θ,1) and then using the poste-
rior density function we can calculate the posterior probabilities p1, p2 , p3 ,
and p4 of {θ ∈(0.1, 0.2) , k = 1}, {θ ∈(0.1, 0.2) , k = 2}, {θ ∈(0.2, 0.25) , k = 1} and
{θ ∈(0.2, 0.25) , k = 2}, respectively. Assume that p1 > p2 and p3 < p4. Thus one
can conclude that when θ ∈(0.1, 0.2) we probably have the data from an expo-
nential distribution, whereas when θ ∈(0.2, 0.25) we probably have the data
from a normal distribution. For different intervals of θ values we can “pre-
fer” different forms of data distributions. In contrast, the common paramet-
ric frequentist approach suggests to fix one distribution function, fitting the
data distribution, and to evaluate different values of θ.
We suggest that the reader who is interested in further details regarding
Bayesian theory and its applications to consult the essential works of Berger
(1985) and Carlin and Louis (2011).
In the following sections, we describe and explore topics regarding the
use of Bayes factor principles. Under Bayes factor type statistical
decision-making mechanisms, external information is incorporated into
the evaluation of evidence about a hypothesis and functions that represent
possible parameter values under the alternative hypothesis are considered.
These functions can be interpreted in light of our current state of knowl-
edge, such as belief where we would like to be most powerful with regard
to the external, or prior information on the parameter of interest under the
alternative hypothesis. For example, a physician may expect the median
survival rate for a new compound to fall in a certain range with a given
degree of probability. Without loss of generality, we focus on basic scenar-
ios of testing that can be easily extended to more complicated situations of
statistical decision-making procedures. In Section 5.2 we show an optimal-
ity property relative to using Bayes factor statistics for testing statistical
hypotheses. In Section 5.3 we introduce Bayes factor principles in general
scenarios, including considerations of the relevant computational aspects,
asymptotic approximations, prior selection issues and we provide brief
illustrations. In Section 5.4 we provide some remarks regarding a relation-
ship between Bayesian (Bayes factor) and frequentist (p-values) strategies
for statistical testing.

5.2 Integrated Most Powerful Tests


In Section 4.4.1.2 we studied the simple representative method, which can be
easily extended by integrating test statistics through variables that represent
unknown parameters with respect to a function that links the corresponding
Bayes Factor 133

weights to the variables. This approach can be applied to develop Bayes


factor type decision-making mechanisms that will be described formally
below.
Without loss of generality, assume for the purpose of explanation that we
have a sample X = {X 1 ,..., X n } and are interested in the composite hypothesis
H 0 : θ = θ0 versus H 1 : θ ≠ θ0. Frequentist statistics are operationalized solely
by the use of sample information (the data obtained from the statistical
investigation) in terms of making inferences about θ, without in general
employing prior knowledge of the parameter space (i.e., the set of all possible
values of the unknown parameter). On the other hand, in decision-making
processes attempts are made to combine the sample information with other
relevant facets of the problem in order to make an optimal decision.
Typically, two sorts of external information are relevant to Bayesian
methodologies. The first source of information is a set of rules regarding
the possible cost of any given decision. Usually this information, referred
to as the loss function, can be quantified by determining the incurred loss
for each possible decision as a function of θ. The second source of informa-
tion can be termed prior information, that is, prior belief (weights) on the
possible values of θ being true under the alternative hypothesis across the
parameter space. This is information about the parameter of interest, θ,
that is obtained from sources other than the statistical investigation, e.g.,
information comes from past experience about similar situations involving
a similar parameter of interest; see Berger (1980) for more details in terms
of eliciting prior information. In the clinical trial setting, one may for
example gain insight into the behavior of the process in the early phases of
clinical investigations and can then apply this information to the later
phases of investigation.
More formally, let π(θ) represent our level of belief (probability based
weights) for the likely values of θ under the alternative hypothesis satisfying

∫ π(θ) dθ = 1. One can extend the technique of Section 4.4.1.2 in order to pro-
pose the test statistic

Bn =
∫f H1 (X 1 ,..., X n |θ)π(θ) dθ
,
f H0 (X 1 ,..., X n |θ0 )

where f H1 , f H0 denote the joint density functions of {X 1 ,… , X n} under H 1 and


H 0 respectively. The decision rule is to reject the null hypothesis for large
values of the test statistic.
It turns out that (Bn , ℑn = σ (X 1 ,..., X n)) is an H 0 -martingale, since
134 Statistics in the Health Sciences: Theory, Applications, and Computing

⎪⎧ f H1 (X 1 , ..., X n+1 |θ) ⎫⎪


EH0 (Bn+1 |ℑn ) =
∫E H0 ⎨
⎪⎩ f H0 (X 1 , ..., X n+1 |θ0 )
|ℑn ⎬ π(θ) dθ
⎪⎭

⎧ n+ 1



∏f H1 (X i |X 1 , ..., X i−1 , θ) ⎪

=
∫E H0 ⎨
i=1
n+ 1 |ℑn ⎬ π(θ) dθ




i=1
f H0 (X i |X 1 , ..., X i−1 , θ0 ) ⎪


n

∏f H1 (X i |X 1 , ..., X i−1 , θ)
⎪⎧ f H (X n+1 |X 1 , ..., X n , θ) ⎫⎪
=
∫ i=1
n EH0 ⎨ 1
⎩⎪ f H0 (X n+1 |X 1 , ..., X n , θ0 )
|ℑn ⎬ π(θ) dθ
⎭⎪
∏f
i=1
H0 (X i |X 1 , ..., X i−1 , θ0 )

∏f H1 (X i |X 1 , ..., X i−1 , θ)
⎪⎧ f H1 (u|X 1 , ..., X n , θ) ⎫⎪
=
∫ i=1
n ⎨∫f
⎩⎪ H 0 (u|X 1 , ..., X n , θ 0 )
f H0 (u|X 1 , ..., X n , θ0 ) du⎬ π(θ)dθ
⎭⎪
∏f
i=1
H0 (X i |X 1 , ..., X i−1 , θ0 )
n

∏f H1 (X i |X 1 , ..., X i−1 , θ)
=
∫ i=1
n π(θ) dθ = Bn .

i=1
f H 0 (X i |X 1 , ..., X i−1 , θ0 )

Thus, according to Section 4.4.1, we can anticipate that the test statistic Bn can
be optimal. This is displayed in the following result: the decision rule based
on Bn provides the integrated most powerful test with respect to the func-
tion π(θ).
Proof. Taking into account the inequality ( A − B) {I ( A ≥ B) − δ} ≥ 0, where δ
could be either 0 or 1, described in Section 3.4, we define A = Bn and B = C (C
is a test threshold: the event {Bn > C} implies one should reject H 0 ), and δ
represents a rejection rule for any test statistic based on the observations; if
δ = 1 we reject the null hypothesis. Then it follows that
EH0 {Bn I (Bn ≥ C)} − CEH0 {I (Bn ≥ C)} ≥ EH0 (Bnδ − Cδ),

EH0 {Bn I (Bn ≥ C)} − C PrH0 (Bn ≥ C) ≥ EH0 (Bnδ) − C PrH0 (δ = 1) .

Since we control the Type I error rate of the tests to be Pr {Test rejects H 0 |H 0} = α ,
we have

EH0 {Bn I (Bn ≥ C)} ≥ EH0 (Bnδ).


Thus

∫f (u1 ,..., un |θ)π(θ) dθ


∫ ∫
H1
… f H0 (u1 ,..., un |θ0 )I (Bn ≥ C)du1  dun
f H0 (u1 ,..., un |θ0 )
Bayes Factor 135

∫f (u1 ,..., un |θ)π(θ) dθ


∫ ∫
H1
≥ … f H0 (u1 ,..., un |θ0 )δ(u1 ,..., un )du1  dun .
f H0 (u1 ,..., un |θ0 )

That is,

∫ {∫ … ∫ f H1 }
(u1 ,..., un |θ)I (Bn ≥ C)du1  dun π(θ) dθ

≥ ∫ {∫ … ∫ f H1 }
(u1 ,..., un |θ)I (δ(u1 ,..., un ) = 1)du1  dun π(θ)dθ.

This concludes with

∫ Pr {B n rejects H 0 | H 1 with the alternative parameter θ=θ1}π(θ1 )θ1


∫ Pr {δ rejects H |H 0 1 with the alternative parameter θ=θ1}π(θ1 )θ1 ,

completing the proof.


Note that the definition of the test statistic Bn corresponding to the point null
hypothesis θ = θ0 is widely addressed in the Bayesian statistical literature, e.g.,
Berger (1985, pp. 148–150, Equation 4.15), Bernardo and Smith (1994, pp. 391–
392), and Kass (1993).
Following this result, we can reconsider π(θ) with respect to the area under
which it would yield the most powerful decision-making rule. For example,
if π(θ) = I {θ ∈[G1 , G2 ]} /(G2 − G1 ), G2 > G1, where I {⋅} is the indicator function,
then Bn will provide the integrated most powerful test with respect to the
interval [G1 , G2 ]. The values of G1 and G2 can be set up based on the practical
meaning of the parameter θ under the alternative hypothesis.
Remark. The function π(θ) can be chosen in a conservative fashion, often-
times known as a flat prior, or it can be chosen in an anti-conservative fash-
ion such that we put more weight on a narrow range of values for θ or it can
be chosen anywhere between. The benefit of the anti-conservative approach
is that if our prior belief about the location of θ is corroborated by the
observed data values we have a very powerful decision rule. In general, the
following principles can be considered: (1) the function π(θ) depicts our
belief on how good values of θ represent the unknown parameter, weight-
ing the θ values over the corresponding range of the parameter; (2) π(θ)
provides a functional form of the prior information; and (3) the function
π(θ) can be chosen depending on the area under which we would like to
obtain the most powerful decision rule by virtue of the result regarding the
integrated most powerful property of the Bayes factor based test statistics
shown above. We also refer the reader to Section 5.3.2 for more details
regarding the choice of π(θ).
136 Statistics in the Health Sciences: Theory, Applications, and Computing

5.3 Bayes Factor


The Bayesian approach for hypothesis testing was developed by Jeffreys
(1935, 1961) as a major part of his program for scientific theory. A large
statistical literature has been devoted to the development of Bayesian meth-
ods on various testing problems. The Bayes factor, as the centerpiece in
Bayesian evaluation of evidence, closely integrates human thought pro-
cesses of decision-making mechanisms into formal statistical tests. In a
probabilistic manner, the Bayes factor efficiently combines information
about a parameter based on data (or likelihood) with prior knowledge
about the parameter.
Assume that the data, X, have arisen from one of the two hypotheses H 0
and H 1 according to a probability density Pr(X |H 0 ) or Pr(X|H 1 ). Given
a priori (prior) probabilities Pr( H 0 ) and Pr( H 1 ) = 1 − Pr( H 0 ), the data produce
a posteriori (posterior) probabilities of the form Pr(H 0|X ) and
Pr(H 1|X ) = 1 − Pr(H 0|X ). Based on Bayes theorem, we then have

Pr( H 1 |X ) Pr(X |H 1 ) Pr( H 1 )


= .
Pr( H 0 |X ) Pr(X |H 0 ) Pr( H 0 )

The transformation from the prior probability to the posterior probability is


simply multiplication by

Pr(X |H 1 )
B10 = , (5.1)
Pr(X |H 0 )

which is named the Bayes factor. Therefore, posterior odds (odds = proba-
bility/ (1−probability)) is the product of the Bayes factor and the prior odds.
In other words, the Bayes factor is the ratio of the posterior odds of H 1 to its
prior odds, regardless of the value of the prior odds. When there are unknown
parameters under either or both of the hypotheses, the densities Pr(X |H k )
(( k = 0,1)) can be obtained by integrating (not maximizing) over the param-
eter space, i.e.,

Pr(X |H k ) =
∫ Pr(X|θ , H )π(θ |H ) dθ , k = 0, 1,
k k k k k

where θ k is the parameter under H k , π(θ k |H k ) is its prior density, and


Pr(X |θ k , H k ) is the probability density of X, given the value of θ k, or the like-
lihood function of θ k. Here θ k may be a vector, and in what follows we will
denote its dimension by dk . When multiple hypotheses are involved, we
denote Bjk as the Bayes factor for H k versus H j . In this case, as a common
Bayes Factor 137

practice, one of the hypotheses is considered the null and is denoted by H 0 .


Kass and Raftery’s (1995) outstanding paper provided helpful details related
to the Bayes factor mechanisms.

5.3.1 The Computation of Bayes Factors


Bayes factors involve computation of integrals usually solved via numerical
methods. Many integration techniques have been adapted to problems of
Bayesian inference, including the computation of Bayes factors.
Define the density (integral) in the numerator or the denominator of the
Bayes factor described in Equation (5.1) as

I = Pr(X |H ) =
∫ Pr(X|θ, H )π(θ|H ) dθ.
For simplicity and for ease of exposition, the subscript k ( k = 0,1) as the hypo-
thesis indicator is eliminated in the notation H here.
In some cases, the density (integral) I may be evaluated analytically. For
example, an exact analytic evaluation of the integral I is possible for expo-
nential family distributions with conjugate priors, including normal linear
models (DeGroot, 2005; Zellner, 1996). Exact analytic evaluation is best in the
sense of accuracy and computational efficiency, but it is only feasible for a
narrow class of models.
More often, numerical integration, also called “quadrature,” is required
when the analytic evaluation of the integral I is intractable. Generally,
most relevant software is inefficient and of little use for the computation of
these integrals. One reason is that when sample sizes are moderate or
large, the integrand becomes highly peaked around its maximum, which
may be found more efficiently by other techniques. In this instance, gen-
eral purpose quadrature methods that do not incorporate knowledge of
the likely maximum are likely to have difficulty finding the region where
the integrand mass is accumulating. In order to demonstrate this intui-
tively, we first show that an important approximation of the log-likelihood
function is based on a quadratic function of the parameter, e.g., θ. We do
not attempt a general proof but provide a brief outline in the unidimen-
sional case. Suppose the log-likelihood function l(θ) is three times differ-
entiable. Informally, consider a Taylor approximation to the log-likelihood
function l(θ) of second order around the maximum likelihood estimator
(MLE) θ̂, that is,

() ()
l (θ) ≈ l θˆ + (θ − θˆ )l′ θˆ +
(θ − θˆ )2 ˆ
2!
()
l′′ θ .
138 Statistics in the Health Sciences: Theory, Applications, and Computing

Note that at the MLE θ̂, the first derivative of the log-likelihood function
l ′(θˆ ) = 0. Thus, in the neighborhood of the MLE θ̂, the log-likelihood function
l (θ) can be expressed in the form

()
l (θ) ≈ l θˆ +
(θ − θˆ )2 ˆ
2!
l′′ θ . ()
(We required that l ''' (θ) exists, since the remainder term of the approximation
above consists of a l ''' component.) The following example provides some
intuition about the quadratic approximation of the log-likelihood function.

Example
Let X = {X 1 ,..., X n } denote a random sample from a gamma distribution
with unknown shape parameter θ and known scale parameter κ 0. The
density function of the gamma distribution is
f (x ; θ, κ 0) = x θ− 1 exp(− x κ 0 ) (Γ(θ)κθ0 ), where the gamma function

Γ (θ) =
∫ 0
t θ − 1 exp( −t) dt. The log-likelihood function of θ is

∑ ∑
n n
l(θ) = (θ − 1) log(X i ) − n log(Γ(θ)) − nθ log(κ 0 ) − Xi κ 0 .
i=1 i=1

The first and the second derivatives of the log-likelihood function are


n
l′(θ) = −n Γ′(θ) Γ(θ) + log(X i ) − n log(κ 0 ),
i =1


1 θ−1
d2 t log(t)
l′′(θ) = −n log(Γ(θ)) = n dt < 0,
dθ 2 0 1− t
respectively. In this case, there is no closed-form solution of the MLE of
the shape parameter θ, which can be found by solving l ′(θˆ ) = 0 numeri-
cally. Correspondingly, the value of the second derivative of the log-
()
likelihood function at the MLE l ′′ θˆ can be computed. Figure 5.1 presents
the plot of log-likelihood (solid line) and its quadratic approximation
(dashed line) versus values of the shape parameter θ based on samples of
sample sizes n = 10, 25, 35, 50 from a gamma distribution with the true
shape parameter of θ = θ0 = 2 and a known scale parameter κ 0 = 2 . It is
obvious from Figure 5.1 that as the sample size n increases, the quadratic
approximation to the log-likelihood function l(θ) performs better in the
neighborhood of the MLE of θ.

Implementation of Figure 5.1 is shown in the following R code:

> n.seq<-c(10,25,35,50)
> N<-max(n.seq)
Bayes Factor 139

> shape0<-2
> scale0<-2
> X<-rgamma(N,shape=shape0,scale = scale0)
>
> # Quadratic approximation to the log-likelihood function
> par(mar=c(3.5, 4, 3.5, 1),mfrow=c(2,2),mgp=c(2,1,0))
> for (n in n.seq){
+ x<-X[sample(1:N,n,replace=FALSE)]
+ # log-likelihood as a function of the shape parameter theta
+ loglike<-function(theta) (theta-1)*sum(log(x))-n*log(gamma(theta))-n*theta
*log(scale0)-sum(x)/scale0
+ neg.loglike<-function(theta)-loglike(theta)
+
+ # the first derivative of loglikelihoodlog
+ loglike.D<-function(theta) sum(log(x)) - n*digamma(theta)-n*log(scale0)
+ # the second derivative of log-likelihood
+ loglike.DD<-function(theta) -n*trigamma(theta)
+
+ # one-dimensional optimization that maximize log-likelihood with respect to
the shape parameter
+ init<-mean(x)^2/var(x) # set initial value
+ theta.mle<-optim(init,neg.loglike,method="L-BFGS-B", lower=init-6*(-loglike.
DD(init)^(-1/2))/sqrt(n), upper=init+6*(-loglike.DD(init)^(-1/2))/sqrt(n))$par
+
+ loglike.DD.mle <- loglike.DD(theta.mle) # second derivative at MLE
+
+ # approximation of the log-likelihood
+ loglike.est<-function(theta) loglike(theta.mle)+loglike.DD.mle*((theta-theta.
mle)^2)/2
+
+ # plot: the plotting limit of x-axis
+ tc<-theta.mle
+ rg<-6*((-loglike.DD.mle)^(-1/2) )/sqrt(n) # half range
+
+ # plot the true log-likehood
+ plot(Vectorize(loglike),c(tc-rg,tc+rg),xlim=c(tc-rg,tc+rg),type="l",lty=1,
lwd=2,cex.lab=1.2,xlab=expression(theta),ylab=expression(paste("Log-likeli-
hood( ",theta,")")), main=paste0("n=",n))
+ # plot the approximation to the log-likehood
+ curve(loglike.est,from=tc-rg,to=tc+rg,type="l",lty=2,lwd=2,add=TRUE,col=
2)
+ lines(c(theta.mle,theta.mle), c(-1e+5,loglike(theta.mle)),lty=3)
+ }
>

It can be easily shown that the log-likelihood function l(θ) is highly


peaked near the MLE θ̂, e.g., in the case of large samples. In such cases, the
posterior density function Pr(θ|X ) is highly peaked about its maximum θ, 
that is, the posterior mode (“generalized MLE”), at which the log posterior
is maximized and has slope zero. Assume that θ exists and the prior π(θ)
(positive), as well as the log-likelihood function l(θ), are three times dif-
ferentiable near the posterior mode θ.  Then the posterior density Pr(θ|X )
for a large sample size n can be approximated by a normal distribution
with mean θ and covariance matrix Σ = (Il(θ ))−1, where the “generalized”
observed Fisher information matrix Il(θ ) = −Hl(θ ), i.e., minus the Hessian
140 Statistics in the Health Sciences: Theory, Applications, and Computing

n = 10 n = 25
−23.5
Log−likelihood (θ)

Log−likelihood (θ)
–58.2
–24.5 –25.5

–58.6
1.5 2.0 2.5 3.0 2.0 2.1 2.2 2.3 2.4 2.5 2.6
θ θ

n = 35 n = 50
–79.6
Log−likelihood (θ)

Log−likelihood (θ)
–114.5
–79.8
–80.0

–114.7

1.9 2.0 2.1 2.2 1.90 1.95 2.00 2.05 2.10 2.15 2.20
θ θ

FIGURE 5.1
The plot of log-likelihood (solid line) and its quadratic approximation (dashed line) versus the
shape parameter θ based on samples of sizes n = 10, 25, 35, 50 from a gamma distribution with
the true shape parameter of θ = θ0 = 2 and the known scale parameter 2.

matrix (second derivative matrix) of the log-posterior evaluated at θ.  Con-


sequently, in such cases, for relatively large sample size n, the Bayes factor
is approximately equivalent to the maximum likelihood ratio (for details,
see the analysis shown below). In addition, the asymptotic distribution of
the posterior mode depends on the Fisher information and not on the prior
(Le Cam, 1986). Intuitively and consistent with the Bayesian concept, when
the sample size n is relatively large, the impact of prior information has
less effect on the form of the Bayes factor, i.e., large amounts of data mini-
mize the weights of the prior knowledge in terms of the actual behavior of
the likelihood function, since large data provide so much information that
we do not need any prior knowledge.
For ease of exposition, and in an intuitive manner, we outline the proof of
the normal approximation, considering a unidimensional case. Denote the
log nonnormalized posterior

l(θ) = log( h(θ)), where h(θ) = Pr(X |θ, H )π(θ|H ).


Bayes Factor 141

Similar to the way of obtaining the approximation to the log-likelihood


function, applying a Taylor expansion of l(θ) of second order at the posterior
 we have
mode θ,

Pr(θ|X ) ∝ Pr(X |θ, H )π(θ|H ) = exp(l(θ)) ≈ exp{l(θ ) + l′′(θ )(θ − θ )2 2},

where ∝ represents “proportionality.” Note that the posterior mode always


lies between the peak of the prior and that of the likelihood, because it com-
bines information from the prior density and the likelihood. This posterior
density approximation technique is often referred to as the modal approxi-
mation because θ is estimated by the posterior mode.
For relatively large samples the choice of the prior density is generally not
that relevant, and in fact the prior π(θ) might be ignored in the above
derivation. Generally speaking, relatively large data provide so much
information that we do not need any prior knowledge. In such cases, the
posterior mode θ is replaced by the MLE θ̂ and the generalized observed
Fisher information is replaced by the observed Fisher information −l ′′(θˆ ).

{ ( )} { ( )}
2 2
Then l′′(θ )(θ − θ )2 = n1/2 θ − θ l′′(θ )/ n  n1/2 θ − θˆ l ′′(θ)/ n ~Normal, pro-
vided that θ is a true fixed value of the parameter (see Chapter 3). In this
scenario, since θ maximizes l(θ), i.e., l '(θ ) = 0, the Taylor argument applied to
l '(θ ) shows that

0=
d 

l (θ)
θ=θ
= l '(θˆ ) + {log ( π(θ))}
d
d θ=θˆ
( )
+ θ − θˆ
d2 
dθ2
l (θ)
θ=θ
, with l '(θˆ ) = 0

and then
−1
⎡d ⎤ ⎧ d2  ⎫
θ = θˆ − ⎢ {log ( π(θ))} ⎥ ⎨ 2 l (θ) ⎬ = θˆ + op (1),
θ=θˆ ⎦ ⎩ dθ
θ=θ
⎣d ⎭

( )
where θ ∈ θˆ , θ and
d2 
dθ2
l (θ)
θ=θ
= Op (n) as a sum of random variables. This

may also be considered a case where frequentist and Bayesian methods align
in some regards.
It is interesting to note that asymptotic arguments similar to those shown
above and in the forthcoming sections can also be employed to compare the
Bayesian and empirical Bayesian concepts (Petrone et al., 2014).
The fact that some problems are of high dimension may also lead to intrac-
table analytic evaluation of integrals. In this case Monte Carlo methods may
be used, but these too need to be adapted to the statistical context. Bleistein
142 Statistics in the Health Sciences: Theory, Applications, and Computing

and Handelsman (1975) as well as Evans and Swartz (1995) reviewed various
numerical integration strategies for evaluating integral I. In this chapter we
outline several techniques available to evaluate the Bayes factor, including
the asymptotic approximation method, the simple Monte Carlo technique,
importance sampling, and Gaussian quadrature strategies, as well as
approaches based on generating samples from the posterior distributions.
For general discussion and references, see, e.g., Kass and Raftery (1995) and
Han and Carlin (2001).

5.3.1.1 Asymptotic Approximations


(1) Laplace’s method. The Laplace approximation (De Bruijn, 1970;
Tierney and Kadane, 1986) to the marginal density


I = Pr(X |H ) = Pr(X |θ, H )π(θ|H ) dθ of the data is obtained by approxi-
mating the posterior with a normal distribution. It is assumed that the pos-
terior density, which is proportional to the nonnormalized posterior
h(θ) = Pr(X |θ, H )π(θ|H ), is highly peaked about its maximum θ,  the poste-
rior mode. This assumption will usually be satisfied if the likelihood func-
tion Pr(X |θ, H ) is highly peaked near its maximum θ̂, e.g., in the case of large
samples. Denote l(θ) = log( h(θ)). Expanding l(θ) as a quadratic about θ and
then exponentiating the resulting expansion yields an approximation to h(θ)
that has the form of a normal density with mean θ and covariance matrix
Σ = (−Hl(θ ))−1 , where Hl(θ ) is the Hessian matrix of second derivatives of the
log-posterior evaluated at θ.  More specifically, component-wise,
⎡ ∂ log l(θ)⎤
2
Hl(θ )ij = ⎢ ⎥ . Integrating the approximation to h(θ) with respect
⎢⎣ ∂θi ∂θ j ⎥⎦ 
θ=θ
to θ yields

IˆL = (2 π)d/2 |Σ |1/2 Pr(X |θ , H )π(θ |H ),

where d is the dimension of θ. Under conditions specified by Kass et al.


(1990), it can be derived that I = IˆL (1 + O (n−1 )) as n → ∞ , that is, the relative
error of Laplace’s approximation to I is O (n−1 ). Therefore when Laplace’s
method is applied to both the numerator and denominator of B10 in Equation
5.1, the resulting approximation leads to a relative error of order O (n−1 ).
Laplace’s method yields accurate approximations and is often computa-
tionally efficient. By employing Laplace’s methods, one can show that the
asymptotic properties of Bayes factor type procedures based on data are
close to those based on maximum-likelihood estimation methods.
(2) Variants on Laplace’s method. Laplace’s method may be applied in
alternative forms by omitting part of the integrand from the exponent when
Bayes Factor 143

performing the expansion. (For the general formulation, see Kass and Vaid-
yanathan (1992), which followed Tierney et al. (1989).) An important variant
on the approximation Î to I is

IˆMLE = (2 π)d/2 |Σˆ |1/2 Pr(X |θˆ , H )π(θˆ |H ),

where Σˆ −1 is the observed information matrix, that is, the negative Hessian
matrix of the log-likelihood evaluated at the maximum likelihood estimator
θ̂. This approximation again has relative error of order O(n−1 ).
Define G to be the set of all possibilities that satisfy hypothesis H , and G '
to be the set of all possibilities that satisfy hypothesis H '. We will call H ' a
nested hypothesis within H if G ' ⊂ G.
Consider nested hypotheses, where there must be some parameterization
under H 1 of the form θ = (β, ψ )T such that H 0 is obtained from H 1 when ψ = ψ 0
for some ψ 0 , with parameter (β, ψ ) having prior π(β, ψ ) under H 1 and then
H 0 : ψ = ψ 0 with prior π(β|H 0 ). For instance, it is desired to check whether or
not the collected data, denoted by X, could be described as a random sample
from some normal distribution with mean μ 0 , assuming that they may be
described as a random sample from some normal distribution with unknown
mean μ and unknown variance σ 2 . In this case, notation-wise, ψ 0 = μ 0 , ψ = μ,
and β = σ 2 .
Applying the approximation IˆMLE, one can show that

2 log B10 ≈ Λ + log π(βˆ , ψˆ |H 1 ) − log π(βˆ *, H 0 ) + (d1 − d0 )log(2 π) + log|Σˆ 1 |− log|Σˆ 0 |,

{ (
where Λ = 2 log Pr(X |(βˆ , ψ ) ( )}
ˆ ), H 1 ) − log Pr(X|βˆ *,H 0 ) is the log-likelihood ratio
statistic having approximately a chi-squared distribution with degrees of
freedom (d1 − d0 ), a difference between the numbers of the estimated param-
eters under H 1 and H 0 , and β̂* denotes the MLE under H 0 . Here the covari-
ance matrices Σ̂ k could be either observed or expected information, under
which case the approximation of 2 log B10 has relative error of order O(n−1 ) or
O(n−1/2 ), respectively. For more information, we refer the reader to Kass and
Raftery (1995).
Alternatively, we may consider the following MLE-based approach. We do
not attempt a general proof but we instead provide an informal outline for
the unidimensional case when regarding parameter θ we have the hypoth-
esis H 0 : θ = θ0 versus H 1 : θ ≠ θ0, where θ0 is known. Note that in a Bayesian
context, we are interested in testing for the model, where θ has a degenerate
prior distribution with mass at θ0 versus the model that states θ has a prior
distribution π(θ). Recall that the log-likelihood ratio function lr(θ) = l(θ) − l(θ0 )
attains its maximum at the MLE θ̂, satisfying l '(θˆ ) = 0, and the likelihood
144 Statistics in the Health Sciences: Theory, Applications, and Computing

function l(θ) increases up to the MLE θ̂ and decreases afterward. Define


φn = c n1/2 − γ , where c is a constant and γ < 1/ 6. Then the marginal ratio (e.g.,
see the test statistic Bn defined in Section 5.2 when the observations are iid)
can be divided into three components:

∞ ˆ φn
θ+
Bn = I / exp (l(θ0 )) =
∫ −∞
exp(lr(θ))π(θ) dθ = I1 + I 2 +
∫ ˆ φn
θ−
exp(lr(θ))π(θ) dθ,

ˆ n
θ−φ ∞
where I1 =
∫−∞
exp(lr(θ))π(θ) dθ and I 2 =
∫ ˆ n
θ+φ
exp(lr(θ))π(θ) dθ. We first con-
sider the component I1 . Based on the fact that the log-likelihood ratio func-
tion l(θ) is a nondecreasing function of θ for θ ∈( −∞ , θˆ − φn ) and that
lr(θˆ − φn ) → −∞, under the null hypothesis, as n → ∞ , it can be derived that


I1 ≤ exp(lr(θˆ − φn ))
∫ −∞
π(θ) dθ = exp(lr(θˆ − φn )) → 0, as n → ∞ ,

since the Taylor arguments can provide

lr(θˆ − φ n ) ≈ lr(θˆ ) − φ nlr '(θˆ ) + φ n2lr ''(θˆ )/ 2! = lr(θˆ ) + φ n2lr ''(θˆ )/ 2

(here φn is considered around 0), where, under H 0 , 2lr(θˆ ) has approximately


a chi-squared distribution with one degree of freedom and

{ } { ( )
φn2lr ''(θˆ ) = − n2 γ −lr ''(θˆ ) / n  − n2 γ −lr ''(θ0 ) − θˆ − θ0 lr '''(θ0 ) / n }
= − n2 γ Op (1) → −∞

(here θ̂ is considered around θ0) with θˆ − θ → 0 and ( )


lr " (θ0 ) ∼ n, lr ''' (θ0 ) ∼ n; lr " (θ0 ) , lr ''' (θ0 ) are sums of iid random variables.
This behavior is similar for the component I 2, since the log-likelihood
function l(θ) is a non-increasing function of θ for θ ∈(θˆ + φn , ∞) and
lr(θˆ + φn ) → −∞, under the null hypothesis, as n → ∞ , we have


I 2 ≤ exp(lr(θˆ + φn ))
∫ −∞
π(θ) dθ = exp(lr(θˆ + φn )) → 0, as n → ∞.
Bayes Factor 145

Therefore, we have that the Bayes factor Bn = I / exp (l(θ0 )) satisfies approxi-
mately
ˆ φn
θ+
I / exp (l(θ0 )) ≈
∫ ˆ φn
θ−
exp(lr(θ))π(θ) dθ, as n → ∞.

In this approximation,

lr(θ) = l(θ) − l(θ0 ) = l(θˆ ) − l(θ0 ) + (θ − θˆ )l '(θˆ ) + (θ − θˆ )2 l ''(θˆ )/ 2 + R(θ)

= l(θˆ ) − l(θ0 ) + (θ − θˆ )2 l ''(θˆ )/ 2 + R(θ),

where the remainder term ( ) satisfies R(θ) = O (φ )O (n),


R(θ) = (θ − θˆ )3 l '''(θ)/ 6, θ ∈ θ, θˆ p
3
n p

R(θ) = Op (n−1/2+3 γ ) = op (1), when θ ∈ ⎡θˆ − φn , θˆ + φn ⎤ , φn = c n1/2−γ and γ < 1/ 6 . Then we


⎣ ⎦
conclude with

ˆ φn
θ+
I / exp (l(θ0 )) ≈

ˆ φn
θ−
exp(l(θˆ ) − l(θ0 ) + (θ − θˆ )2 l ''(θˆ )/ 2)π(θ) dθ

≈ exp(l(θˆ ) − l(θ0 ))
∫ exp(−(θ − θˆ ) (−l ''(θˆ ))/ 2)π(θ) dθ,
2

where
∫ exp(−(θ − θˆ ) (−l ''(θˆ ))/ 2)π(θ) dθ is a Gaussian-type integral. This
2

result can be easily applied to control asymptotically the Type I error rate of
the test statistic Bn = I / exp (l(θ0 ))in the frequentist manner.
In a more formal manner, Vexler et al. (2014) used the asymptotic principle
shown above for constructing and evaluating nonparametric Bayesian proce-
dures, where the empirical likelihood method was employed instead of the
parametric likelihood. (Regarding the empirical likelihood approach, see
Chapter 10.)
(3) The Schwarz criterion. It is possible to avoid the introduction of the
prior densities π k (θ k |H k ) in Equation (5.1) by using

{
S = log Pr(X |θˆ 1 , H 1 ) − log Pr(X |θˆ 0 , H 0 ) } 1
− (d1 − d0 )log(n),
2

where θ̂ k is the MLE under H k, dk is the dimension of θ k, k = 0,1 and n is the


sample size. The second term in S acts as a penalty term that corrects for dif-
ferences in dimension between H k, k = 0,1. The quantity S is often called the
Schwarz criterion. The Bayesian information criterion (BIC) can be defined
as minus twice the Schwarz criterion; sometimes an arbitrary constant is
146 Statistics in the Health Sciences: Theory, Applications, and Computing

added. The BIC is a criterion for model selection among a finite set of mod-
els. In this case, the model with the lowest BIC is preferred (e.g., Carlin and
Louis, 2011). Schwarz (1978) showed that, under certain assumptions on the
data distribution,

S − log B10
→ 0, as n → ∞.
log B10

Thus, the quantity exp(S) provides a rough approximation to the Bayes factor
that is independent of the priors on the θ k. Kass and Wasserman (1995) pro-
vided excellent discussion regarding the method mentioned above.
The benefits of asymptotic approximations in this context include the fol-
lowing: (1) numerical integration is replaced with numerical differentiation,
which is computationally more stable; (2) asymptotic approximations do not
involve random numbers, and consequently, two different analysts can pro-
duce a common answer based on the same dataset, model, and prior distri-
bution; and (3) in order to investigate the sensitivity of the result to modest
changes in the prior or the likelihood function, the computational complex-
ity is greatly reduced; for more detail, we refer the reader to Carlin and Louis
(2011). Among the asymptotic approximations described above, the Schwarz
criterion is the easiest approximation to compute and requires no specifica-
tion of prior distributions. In addition, as long as the number of degrees of
freedom involved in the comparison is reasonably small relative to sample
size, the analysis based on the Schwarz criterion will not mislead in a quali-
tative sense.
However, asymptotic approximations also have the following limita-
tions: (1) in order to obtain a valid approximation, the posterior distribu-
tion must be unimodal, or at least nearly unimodal; (2) the size of the
dataset must be fairly large, while it is hard to judge how large is large
enough; (3) the accuracy of asymptotic approximations cannot be improved
without collecting additional data; (4) the correct parameterization, on
which the accuracy of the approximation depends, e.g., θ versus log(θ),
may be difficult to ascertain; (5) when the dimension of θ is moderate or
high, say, greater than 10, Laplace’s method becomes unstable, and numer-
ical computation of the associated Hessian matrices will be prohibitively
difficult; and (6) when the number of degrees of freedom involved in the
comparison is large and the prior is very different from that for which the
approximation is best, the asymptotic approximation based on the Schwarz
criterion can be very poor; e.g., McCulloch and Rossi (1991) illustrated the
poor approximation by the Schwarz criterion with an example of 115
degrees of freedom. For more details, we refer the reader to, e.g., Kass and
Raftery (1995) as well as Carlin and Louis (2011). Therefore, in such cases,
researchers may turn to alternative methods, e.g., Monte Carlo sampling,
Bayes Factor 147

importance sampling, and Gaussian quadrature. These methods typically


require longer runtimes, but can be programmed easily and can be applied
in more general scenarios, which are the subject of the following section in
this chapter.

5.3.1.2 Simple Monte Carlo, Importance Sampling, and


Gaussian Quadrature
Dropping the notational dependence on the hypothesis indicator H k , the


integral I becomes I = Pr(X ) = Pr(X |θ)π(θ) dθ. The simplest Monte Carlo
integration estimate of I is
m

∑ Pr(X|θ
∧ 1
Pr 1 (X ) = (i)
),
m i=1

where {θ( i ) : i = 1,..., m} is a sample from the prior distribution; this is the
average of the likelihoods of the sampled parameter values, e.g., Hammersley
and Handscomb (1964)
The precision of simple Monte Carlo integration can be improved by
importance sampling. It involves generating a sample {θ( i ) : i = 1,..., m} from
a density π * (θ). Under general conditions, a simulation-consistent estimate
of I is


m
wi Pr(X |θ( i ) )
IˆMC = i=1
,

m
wi
i=1

where wi = π(θ( i ) )/ π * (θ( i ) ), the function π * (θ) is known as the importance


sampling function. Then the approximation of Bayes factor can be computed
accordingly; for general discussion, see Geweke (1989). Although the Monte
Carlo integration and importance sampling methods are less precise and
more computationally demanding, they may be the only applicable methods
in complex models.
Genz and Kass (1997) demonstrated the evaluation of integrals that are
peaked around a dominant mode and provided an adaptive Gaussian
quadrature method. This method is efficient especially when the dimension-
ality of the parameter space is modest.

5.3.1.3 Generating Samples from the Posterior


Several methods can be applied to generate samples from posterior distribu-
tions, including direct simulation and rejection sampling for the simplest
148 Statistics in the Health Sciences: Theory, Applications, and Computing

cases. In more complex cases, Markov chain Monte Carlo (MCMC) methods
(e.g., Smith and Roberts, 1993) particularly the Metropolis–Hastings algo-
rithm and the Gibbs sampler, as well as the weighted likelihood bootstrap,
provide general schemes.
These methods provide a sample approximately drawn from the posterior
density Pr(θ|X ) = Pr(X |θ)π(θ)/ Pr(X ). Substituting into Î MC defined in Sec-
tion 5.3.1.2 yields as an estimate for Pr(X ),
−1
⎧⎪ 1 m
⎫⎪


( i ) −1
Pr 2 (X ) = ⎨ Pr(X |θ ) ⎬ ,
⎪⎩m i=1 ⎪⎭
which is the harmonic mean of the likelihood values (Newton and Raftery,
1994). This converges almost surely to the correct value, Pr(X ), as m → ∞, but
it does not generally satisfy a Gaussian central limit theorem. See Kass and
Raftery (1995) for additional modifications and discussion. Note that the
MCMC methods have not yet been applied in many demanding problems
and may require large numbers of likelihood function evaluations, resulting
in difficulty in some cases.

5.3.1.4 Combining Simulation and Asymptotic Approximations


DiCiccio et al. (1997) provided a simulation based version of Laplace’s method
using generated observations from the posterior distributions by employing
Markov chain Monte Carlo or other techniques. Following the previous nota-
tions of θ and Σ described in Section 5.3.1.1 (Laplace’s method), let θ be the
posterior mode and let covariance matrix Σ be minus the inverse of the Hes-
sian of the log-posterior evaluated at θ.  If no analytical form can be obtained,
then θ and Σ can be estimated via simulation. The normal approximation to
the posterior is ϕ(⋅) = ϕ(⋅; θ , Σ ), where ϕ(⋅; μ , Σ) denotes a normal density with
mean vector μ and covariance matrix Σ. Let B = {θ ∈Θ : (θ − θ )T Σ −1(θ − θ ) 2 < δ 2},

 1/2|/Γ( p / 2 + 1), where Γ(u) =
which has volume υ = δ p π p/2|∑

0
t u−1 exp(−t) dt
⎡ a ⎤
⎢ 1⎥
and a = [a1  ak ], for a vector a = ⎢  ⎥ . Define Pr(B) =
T
⎢ ⎥
⎢⎣ ak ⎥⎦
∫ Pr(θ|X ) dθ and
B

α= ∫ (  
ϕ u; θ∑ du.
B
)
A modification of Laplace’s method can be obtained that simply estimates
the unknown value of the posterior probability density at the mode using
the simulated probability assigned to a small region around the mode
divided by its area. Observing that

Pr(X |θ )π(θ ) Pr(X |θ )π(θ ) ϕ(θ ) Pr(X |θ )π(θ ) α
I= = ≈ ,
Pr(θ |X ) ϕ(θ ) Pr(θ |X ) ϕ(θ ) Pr(B)
Bayes Factor 149

the volume-corrected estimator has the form

Pr(X |θ )π(θ ) α


IˆL* = ,
ϕ(θ ) Pˆ

where P̂ is the Monte Carlo estimate of Pr(B), that is, a proportion of the
sampled values inside B. DiCiccio et al. (1997) demonstrated that the simu-
lated version of Laplace’s method with a local volume correction approxi-
mates the Bayes factor accurately, which is especially useful in the case of
costly likelihood function evaluations.
The importance of sampling techniques can be modified by restricting
them to small regions about the mode. To improve the accuracy of approxi-
mations, Laplace’s method can be combined with the simple bridge sam-
pling technique proposed by Meng and Wong (1996). For detailed information,
we refer the reader to DiCiccio et al. (1997).

5.3.2 The Choice of Prior Probability Distributions


Implementation of Bayes factor approaches requires the specification of pri-
ors, as is the case in all Bayesian analysis. In principle, priors formally
represent available external information, providing a way to combine other
information about the values of the target parameter with the data.
Typically, prior distributions are specified based on information accumu-
lated from past studies or from the opinions of subject-area experts. The elic-
ited prior is a means of drawing information from subject-area experts, who
have a great deal of information about the substantive question but are not
involved in the model construction process, with the goal of constructing a
probability structure that quantifies their specific knowledge and experien-
tial intuition about the studied effects.
In order to simplify the subsequent computational burden, e.g., make com-
putation of the posterior distribution easier, experimenters often limit the
choice of priors by restricting π(θ) to some familiar distributional family. In
choosing a prior belonging to a specific distributional family, some choices
may have more computational advantages than others. In particular, if the
prior probability distribution π(θ) is in the same distributional family as the
resulting posterior distribution Pr(θ|X ) , then the prior and posterior are
called conjugate distributions, and the prior is called a conjugate prior for
the likelihood function. For example, the Gaussian family is conjugate to
itself (or self-conjugate) with respect to a Gaussian likelihood function. That
is, if the likelihood function is Gaussian, choosing a Gaussian prior over the
mean will guarantee the Gaussian posterior distribution.
Often in practice, there is no reliable prior information regarding θ that
exists, or it is desired to make an inference based solely on the data. In such
cases, the noninformative prior plays a major role in Bayesian analyses.
150 Statistics in the Health Sciences: Theory, Applications, and Computing

A noninformative prior can be constructed by some formal rule that con-


tains “no information” with respect to θ in the sense that no one θ value is
favored over another, provided all values are logically possible. The simplest
and oldest rule to determine a noninformative prior is the principle of indif-
ference, which assigns equal probabilities to all possibilities. Noninforma-
tive priors can express objective information such as “the variable is within
a certain range.” For example, if we have a bounded continuous parameter
space, say, Θ = [ a, b], −∞ < a < b < ∞ , then the uniform distribution
π(θ) = I { a ≤ θ ≤ b} (b − a), where I {⋅} is the indicator function, is arguably non-
informative for θ. If the parameter space is unbounded, i.e., Θ = ( −∞ , ∞), then
the appropriate uniform prior may be π(θ) = c, c > 0. It is worth emphasizing
that even if this distribution is improper in that
∫ π(θ) dθ = ∞, Bayesian infer-
ence is still possible provided that
∫ Pr(X|θ) dθ = K, K < ∞. Then
Pr(X |θ)c Pr(X |θ)
Pr(θ|X ) = = .
∫ Pr(X|θ)c dθ K

Since
∫ Pr(θ|X ) dθ = 1, the posterior density is indeed proper, and hence
Bayesian inference may proceed as usual. Note that when employing
improper priors, proper posteriors will not always result and extra care must
be taken.
Krieger et al. (2003) have proposed several forms of a prior π(θ) and the cor-
θ
responding distribution H (θ) =
∫ π(θ) dθ in the context of sequential change
point detection (see Chapter 4). This method of choices of a prior can be
adapted for the problem stated in this chapter. For example, if we suspect
that the observations under the alternative hypothesis have a distribution
that differs greatly from the distribution of the observations under the null,
the prior distribution could be chosen as
−1 +
⎡ μ ⎞ ⎤ ⎛ θ − μ⎞ μ ⎞⎞
H(θ) = ⎢Φ ⎛⎜ ⎟ ⎥ ⎜ Φ ⎛⎜ ⎟ − Φ ⎛⎜ − ⎟ ⎟ , (5.2)
⎣ ⎝ σ⎠⎦ ⎝ ⎝ σ ⎠ ⎝ σ ⎠⎠

where Φ is the standard normal distribution function and a + = aI ( a ≥ 0). A


somewhat broader prior is

1 ⎛ ⎛ θ − μ⎞ θ + μ ⎞⎞
H(θ) = Φ⎜ + Φ ⎛⎜
⎜ ⎟ ⎝ σ ⎟⎠ ⎟⎠
. (5.3)
2⎝ ⎝ σ ⎠

Note that the parameters μ and σ > 0 of the distributions H (θ) can be chosen
arbitrarily. Krieger et al. (2003) recommended the specification of μ = 0, and
σ = 1 so that two forms of H (θ) specified above are simplified. In accordance
Bayes Factor 151

with the rules shown in Marden (2000), where the author reviewed the Bayes-
ian approach applied to hypothesis testing, the function H can be defined,
e.g., for a fixed σ > 0, in the form

⎧1 ⎛ θ − μ⎞ θ + μ ⎞⎞ ⎫
H ∈Ψ = ⎨ ⎜ Φ ⎛⎜ ⎟ + Φ ⎛⎜ ⎟ ⎟ , μ ∈(μ lower , μ upper )⎬ . (5.4)
⎩2 ⎝ ⎝ σ ⎠ ⎝ σ ⎠⎠ ⎭

Here Ψ is a set of distribution functions. The following R code shows the


way to plot the distribution functions of priors specified in Equations (5.2)
through (5.4) as shown in Figure 5.2.
> # The distribution function defined by Equation (5.2).
> H1<-function(theta,mu,sigma) {
+ tmp<-pnorm((theta-mu)/sigma)-pnorm(-mu/sigma)
+ H<-ifelse(tmp>0,tmp,0)/pnorm(mu/sigma)
+ return(H)
+ }
>
> # The distribution function defined by Equations (5.3) and (5.4).
> H2<-function(theta,mu,sigma) {
+ H<-(pnorm((theta-mu)/sigma)+pnorm((theta+mu)/sigma))/2
+ return(H)
+ }
> par(mfrow=c(1,2),mar=c(4,4,2,2))
> theta.seq<-seq(0,4,.1)
> H1.dist<-sapply(theta.seq,H1,mu=0,sigma=1)
> plot(theta.seq,H1.dist,type="l",xlim=c(-4,4),lty=1,lwd=2,xlab="theta",ylab=
"Probability")
>
> theta.seq2<-seq(-4,4,.1)
> H2.dist<-sapply(theta.seq2,H2,mu=0,sigma=1)
> lines(theta.seq2,H2.dist,lty=2,lwd=2)
>
> # fix sigma=1
> mu.seq<-seq(-5,5,1)
> H2.dist<-sapply(theta.seq2,H2,mu=mu.seq[1],sigma=1)
> plot(theta.seq2,H2.dist,lty=3,lwd=2,type="l",xlim=c(-4,4),ylim=c(0,1),xlab=
"theta",ylab="Probability")
> for (i in 2:length(mu.seq)){
+ H2.dist<-sapply(theta.seq2,H2,mu=mu.seq[i],sigma=1)
+ lines(theta.seq2,H2.dist,lty=i+2,lwd=2)
+ }

In Figure 5.2, the solid line and the dashed line in the left panel present H (θ)
defined by (5.2) and (5.3) with μ = 0 and σ = 1, respectively, and the right
panel presents H (θ) defined by Equation (5.4) for μ changing from −5 to 5.

Example
As a specific example of the choice of the prior function, Gönen et al.
(2005) provided a case study. Assuming the data points Yir ( i = 1, 2,
r = 1,..., ni ) are independent and normally distributed with means μ i and
152 Statistics in the Health Sciences: Theory, Applications, and Computing

1.0 1.0

0.8 0.8
Probability

0.6

Probability
0.6

0.4 0.4

0.2 0.2

0.0 0.0
–4 –2 0 2 4 –4 –2 0 2 4
θ θ

FIGURE 5.2
The cumulative distribution functions that correspond to the priors (5.2) through (5.4). The
solid line and the dashed line in the left panel present H (θ) defined by (5.2) and (5.3) with μ = 0
and σ = 1, respectively. The right panel presents H (θ) defined by (5.4), where μ = −5,..., 5 .

a common variance σ 2 , the goal is to test H 0 : δ = μ 1 − μ 2 = 0 against


H 1 : δ ≠ 0 . In order to obtain the usual two-sample t statistic, prior
knowledge is modeled for δ σ instead of δ , which can be specified as
δ σ|{μ , σ 2 , δ σ ≠ 0} ~ N (λ , σ δ2 ). For sample size calculations, prior infor-
mation to suggest the expected effect size λ is routinely used. The large-
sample size calculation formula for two-sample tests is given by

2( z1−α/2 + z1−β )2

z1−α 2
1 2
n= with e − u 2 du = 1 − α 2
(δ σ )2 (2π )1 2 −∞


z1− β
1 2
and e − u 2 du = 1 − β ,
(2π )1 2 −∞

where n = n1 = n2 = 2 nδ is the sample size per group, δ σ is the prespeci-


fied anticipated standardized effect size, and α, β are the Type I and
Type II error probability, respectively. Note that the value σ δ can be
expressed as a function of the prior probability that the effect is in the
wrong direction. For example, in the case of α = 0.05, β = 0.2,
λ = (1.96 + 0.84)nδ−1/2 = 2.80nδ−1/2 and one thinks Pr(δ < 0|δ ≠ 0) = 0.10 , one
can obtain σ δ = 2.19nδ−1/2 using normal distribution calculations.
These calculations involved the choice of zero for the tenth percentile
of the prior on δ σ. Note that other percentiles could have been selected
as well. Another calibration would involve selection of σ δ based on a
prior assumed value for Pr(δ σ > 2 λ|δ ≠ 0) = 0.10. In order to ensure con-
Bayes Factor 153

sistency, it would be helpful to try several such values. The remaining


parameter that needs to be specified is given in terms of the probability
that H 0 is true, that is, π 0 = Pr(δ = 0). Observing that it is unethical to
randomize patients when the outcome is certain, the quantities Pr(δ ≤ 0)
and Pr(δ > 0) should be roughly comparable. One may set π 0 = 0.5 as an
“objective” value (Berger and Sellke, 1987), which, in conjunction with
Pr(δ < 0|δ ≠ 0) = 0.10 , yields Pr(δ ≤ 0) = 0.5 + 0.10 × 0.5 = 0.55. Alternati-
vely, the prior π 0 can be assigned to reflect prior belief in the null; one
may set Pr(δ ≤ 0) = 0.5 , which implies π 0 = 0.444 in conjunction with
Pr(δ < 0|δ ≠ 0) = 0.10 .

Note that a common criticism of Bayesian methods is that the priors may
be chosen arbitrarily in some cases or subjective in a special way in other
cases. Kass (1993) discussed that the value of a Bayes factor may be sensitive
to the choice of priors on parameters in the competing models. A broad range
of literature has discussed the selection of priors and various techniques for
constructing priors, including Jeffrey’s rules and their variants; see, e.g.,
Berger (1980, 1985), Kass and Wasserman (1996) and Sinharay and Stern
(2002) for additional information.

5.3.3 Decision-Making Rules Based on the Bayes Factor


The Bayes factor is a summary of the evidence provided by the data in favor
of one scientific theory, represented by a statistical model, as opposed to a
competing model. Jeffreys (1961) provided a scale of interpretation for the
Bayes factor in terms of evidence against the null hypothesis; see Table 5.1 for
details. It is suggested to interpret B10 in half-units on the log 10 scale.
It is interesting to note that in the frequentist manner of testing statistical
hypotheses, mostly in the period before the Neyman–Pearson concept of set-
ting the Type I error rates to be fixed was widely accepted, decisions based
on the maximum likelihood ratio statistics were reported in a similar man-
ner to those shown in Table 5.1 (e.g., Reid, 2000).
Probability itself provides a meaningful scale defined by betting, and so
these categories are not a calibration of the Bayes factor, but rather a rough
descriptive statement about standards of evidence in a scientific investiga-
tion; see Kass and Raftery (1995) for a detailed review.

TABLE 5.1
Bayes Factor as Evidence against the Null Hypothesis
log 10 (B10 ) (B10 ) Evidence against H 0
0–1/2 1–3.2 Weak
1/2–1 3.2–10 Substantial
1–2 10–100 Strong
>2 >100 Decisive
154 Statistics in the Health Sciences: Theory, Applications, and Computing

Asymptotic approximations to the Bayes factor are easy to compute using


the output from standard packages that maximize likelihoods. Recall that
we described the asymptotic behavior of the Bayes factor in Section 5.3.1.1,
e.g., the Schwarz criterion, given as

{ }
1
S = log Pr(X |θˆ 1 , H 1 ) − log Pr(X |θˆ 0 , H 0 ) − (d1 − d0 )log(n),
2

which provides a rough approximation to the logarithm of the Bayes factor as


{ }
n → ∞. Note that Λ = 2 log Pr(X |θˆ 1 , H 1 ) − log Pr(X |θˆ 0 , H 0 ) is the maximum
log-likelihood ratio statistic, with approximately a χ d21 − d0 distribution. Conse-
quently, Bayes factor type tests can be viewed as frequentist tests due to the
asymptotic results, where corresponding critical values and powers can be
obtained accordingly.

Example
Consider a straightforward but common example. Let X = {X 1 ,..., X n } be
a random sample from some normal distribution with mean μ and vari-
ance σ 2 . It is desired to test the one-sided hypothesis H 0 : μ ≤ μ X 0 versus
H 1 : μ > μ X 0, where μ X 0 is a prespecified number of interest. Let X denote


n
the sample mean, X = n−1 X i . We then consider two scenarios with
i=1
known/unknown value of variance σ 2 .

Scenario 1: (The variance σ 2 is assumed to be known.) Let the prior distribu-


tion function of μ be chosen in a conjugate manner to be normal,
N(μ 0 , τ 20 ). Then the posterior density is

⎛ 1 ⎛ (μ − μ )2 n
⎞⎞
∑ (X σ− μ) ⎟⎠⎟⎟⎠ ,
2
Pr(μ|X ) ∝ π(μ)Pr(X |μ) ∝ exp ⎜ − ⎜ 0
+ i
⎜⎝ 2 ⎝ τ 20 i=1
2

which is associated with a normal density function with the mean and
the variance as

μ 0 τ 20 + n X σ 2 2 1
μn = , τn = ,
1 τ 20 + n σ 2 1 τ 20 + n σ 2

respectively. Therefore, the posterior probability of H 0, denoted by q0, is


Pr(μ ≤ μ X 0 |X ) = Φ((μ X 0 − μ n ) τ n ), where Φ is the standard normal distribu-
tion function. Denote the value of prior probability of H 0,
Pr(μ ≤ μ X 0 ) = Φ((μ X 0 − μ 0 ) τ 0 ) , by p0 . Thus, the Bayes factor, the ratio of
the posterior odds to the prior odds, has the form

Pr(X|H 1 ) Pr(μ > μ X 0 |X ) Pr(μ > μ X 0 ) (1 − q0 ) p0


B10 = = = .
Pr(X|H 0 ) Pr(μ ≤ μ X 0 |X ) Pr(μ ≤ μ X 0 ) (1 − p0 )q0
Bayes Factor 155

Note that the prior precision, 1 τ 20 , and the data precision, n σ 2 , play
equivalent roles in the posterior distribution. Thus, for a large sample
size n, the posterior distribution is largely determined by σ 2 and the
sample mean X; see Gelman et al. (2003) for detailed discussion.
Scenario 2: (The variance σ 2 is assumed to be unknown.) Let the conditional
distribution of μ given σ 2 be normal and the marginal distribution of σ 2
be scaled-inverse- χ 2 distribution, that is,

μ|σ 2 ~ N (μ 0 , σ 2 κ 0 ), σ 2 ~ Inverse − χ 2 (ν0 , σ 20 ),

where the probability density function of the scaled-inverse- χ 2 ( ν0 , σ 20 )


(σ 2 ν 2)ν0 2 −(1+ν0 2) ⎛ ν σ2 ⎞
distribution is f ( y ; ν0 , σ 20 ) = 0 0 y exp ⎜ − 0 0 ⎟ , y > 0. Then
Γ( ν0 2) ⎝ 2y ⎠
the joint prior density, a product form of Pr(μ|σ 2 )Pr(σ 2 ), is

⎛ ν σ 2 + κ 0 (μ 0 − μ)2 ⎞
Pr(μ , σ 2 ) ∝ σ −1 (σ 2 )−( ν0 /2 + 1) exp ⎜ − 0 0 ⎟⎠ ,
⎝ 2σ 2

which is denoted as Normal-Inverse- χ 2 (μ 0 , σ 2 κ 0 ; ν0 , σ 20 ).


n
Let s 2 denote the sample variance, that is, s 2 = (n − 1)−1 (X i − X )2 .
i=1
Correspondingly, the joint posterior density is

⎛ ν σ 2 + κ 0 (μ 0 − μ)2 ⎞ 2 − n/2 ⎛ (n − 1)s 2 + n(X − μ)2 ⎞ ,


Pr(μ , σ 2 |X ) ∝ σ −1 (σ 2 )−( ν0 /2 + 1) exp ⎜ − 0 0 (σ ) exp ⎜ −
⎝ 2σ 2 ⎠⎟ ⎝ 2σ 2 ⎟⎠

which is Normal-Inverse- χ 2 (μ n , σ n2 κ n ; νn , σ n2 ), where

κ0 n
μn = μ0 + X , κ n = κ 0 + n, νn = ν0 + n,
κ0 + n κ0 + n
κ 0n
νnσ n2 = ν0σ 20 + (n − 1)s 2 + ( X − μ 0 )2 .
κ0 + n

To sample from the joint posterior distribution, we first draw σ 2 from its
marginal posterior distribution Inverse-χ 2 ( νn , σ 2n ), then draw μ from its
normal conditional posterior distribution N ( μn , σ 2 κ n ), using the gener-
ated value of σ 2.
The marginal posterior density of σ 2 is a scaled-inverse- χ 2 distribu-
tion, i.e.,

σ 2 |X ~ Inverse − χ 2 ( νn , σ n2 ).

The conditional posterior density of μ, given σ 2, is proportional to the joint


posterior density in Scenario 1, where the value of σ 2 is assumed to be
known, i.e.,

μ|σ 2 , X ~ N (μ n , σ 2 κ n ).
156 Statistics in the Health Sciences: Theory, Applications, and Computing

Integration of the joint posterior density with respect to σ 2 shows that


the marginal posterior density for μ follows a nonstandardized Stu-
dent’s t distribution,

− ( νn /2 + 1)
⎛ κ (μ − μ)2 ⎞
Pr(μ|X ) ∝ ⎜ 1 + n n 2 ⎟ = tνn (μ n , σ 2n κ n ) ,
⎝ ν nσ n ⎠

where the probability density function of the nonstandardized Student’s


− (( ν+ 1) 2)
Γ(( ν + 1) 2) ⎛ 1 ⎛ y − μ⎞ ⎞
2
t distribution tν (μ , σ ) is f ( y ; ν, μ , σ ) = ⎜ 1+ ⎜ ⎟
Γ( ν 2) πνσ ⎝ ν ⎝ σ ⎠ ⎟⎠

with Γ(u) =
∫ 0
t u− 1 exp(−t) dt . Denote the corresponding distribution


y
function as T ( y ; ν, μ , σ ) = f (t ; ν, μ , σ ) dt. In this case, the posterior
−∞

probability of H 0, denoted by q0, is Pr(μ ≤ μ X 0 |X ) = T (μ X 0 ; νn , μ n , σ n2 κ n ).


The value of the prior probability of H 0, denoted by p0 , is
μX 0 ∞
Pr(μ ≤ μ X 0 ) =
∫ ∫
Pr(μ , σ 2 ) dμ dσ 2 , which can be computed based on
−∞ 0
(1 − q0 ) p0
previous derivation. Thus, the Bayes factor has the form of B10 = .
(1 − p0 )q0
The results based on the one-sample problem can be easily generalized to
the two-sample case. A classic example would be to compare the means
of two normal distributions with unknown variances. Let X1 = {X11 ,..., X1n1 }
and X 2 = {X 21 ,..., X 2 n2 } be two independent random samples drawn from
normal distributions N(μ 1 , σ 12 ) and N(μ 2 , σ 2 ), respectively, where μ 1 , μ 2 , σ 12 and
2

σ 2 are unknown. We are interested in testing H 0 : μ 1 = μ 2 , σ 12 = σ 22 versus


2

H 1 : μ 1 ≠ μ 2 , σ 12 ≠ σ 22 .
As was mentioned above, Bayes factor type test statistics can be used in the
context of traditional tests via a control of the corresponding Type I error
rate, e.g., employing the asymptotic evaluations of the Bayes factor struc-
tures. We illustrate this principle in the following example.

Example
Let X = {X 1 ,..., X n } , n = 30, be a random sample drawn from Y − 1, where
Y ~ exp(λ), i.e., f ( x ; λ) = λ exp( −λ( x + 1)), λ > 0, x > −1, is the density func-
tion of the observations. Denote the mean of X by θ (θ = 1 λ − 1). Suppose
it is of interest to test H 0 : θ = 0 versus H 1 : θ ≠ 0 . Under the alternative
hypothesis, the prior function for θ is assumed to be the Uniform[0, 2]-dis-
tribution function. In this case, based on the Schwarz criterion approxi-
mation, the relevant maximum likelihood ratio can be well approximation
to log B10 + log(n)/ 2, as n → ∞ . The corresponding maximum likelihood
ratio has an asymptotic χ12 distribution under the null hypothesis. Com-
paring 2 log B10 + log(n) with the critical values related to χ12 distribution
Bayes Factor 157

one can provide an approach to make statistical decision rules from the
traditional frequentist test perspective. Implementation of the testing
procedure is presented in the following R code:

> n<-30
> lambda.true<-1/2
> 1/lambda.true-1 # theta.true
[1] 1
>
> set.seed(123456)
> x<-rexp(n,lambda.true)-1
> x
[1] 0.19113728 0.01426038 1.51633983 1.11874795 1.27664726 2.94610447
-0.86056815 4.24772281 -0.76769938
[10] 0.22111551 -0.72606026 1.06624316 0.10882527 2.32608560 3.73953110
1.89499414 2.52307509 1.35740018
[19] -0.76989244 -0.90812499 -0.83491881 0.29580491 3.72774643 -0.01274517
-0.57838195 2.79788529 0.74200029
[28] -0.93858993 1.18364253 -0.28730946
>
> theta0<-0 # EX under H_0
> lambda0<-(theta0+1)^(-1)
>
> neg.loglike<-function(x,theta){
+ length(x)*log(theta+1)+(sum(x)+length(x))/(theta+1)
+ }
>
> # Bayes factor
> # likelihood
> like<-function(x,theta){
+ exp(-(length(x)*log(theta+1)+(sum(x)+length(x))/(theta+1)))
+ }
>
> # prior
> ll<-0;uu<-2
> prior<-function(theta) 1/(uu-ll)*(theta<=uu)*(theta>=ll)
>
> integrand<-function(theta) like(theta,x=x) #*prior(theta)
> BF<-integrate(Vectorize(integrand), ll, uu)$value/like(theta0,x=x)
> log10(BF)
[1] 3.227941
> stat.est<-log(BF)+1/2*log(n)
> p.BF<-1-pchisq(2*stat.est,df=1)
> p.BF
[1] 1.920634e-05
> p.t<-t.test(x,mu=theta0)$p.val
> p.t
[1] 0.003940047

In this example, the value of log 10 (B10 ) is 3.2279, greater than 2, suggesting a
decisive evidence against H 0 . From the traditional testing viewpoint, the
approach based on the Bayes factor leads to a p-value far below 0.0001,
158 Statistics in the Health Sciences: Theory, Applications, and Computing

suggesting the rejection of the null hypothesis, which is consistent with the
result based on the t-test shown in the R code above.
The following points about Bayes factors should be emphasized: (1)
Bayes factors provide a way to evaluate evidence in favor of a null hypoth-
esis through the incorporation external information; (2) Bayes factor based
decision-making procedures can be easily applied with respect to non-
nested hypotheses (see Section 5.3.1.1 for the definition of nested
hypotheses); (3) Technically, Bayes factors are simpler to compute as com-
pared to deriving non-Bayesian significance tests based on “nonstandard”
statistical models that do not satisfy common regularity conditions; (4)
When estimation or prediction is of interest, Bayes factors can be con-
verted to weights corresponding to various models so that a composite
estimate or prediction may be obtained that takes into account structural
or model uncertainty; (5) The use of Bayes factors allow us to feasibly
assess the sensitivity of our conclusions relative to the choose of prior dis-
tribution; and (6) As shown in Section 5.2, Bayes factor type procedures
can provide optimal decision rules as compared to other approaches. For
more details, we refer the reader to Kass and Raftery (1995) as well as Vex-
ler et al. (2010b).

5.3.4 A Data Example: Application of the Bayes Factor


In order to illustrate a use of the Bayes factor mechanism, we consider a
study of survival among turtles where the question of comparing two
nested models is of interest. In this study, 244 turtle eggs of the same age
from 31 clutches (or families) were removed from their nests in a site in Illi-
nois on the bank of the Mississippi River and taken to the laboratory where
they were incubated and hatched (Janzen et al., 2000). Several days later, the
baby turtles were released from the same place where the eggs were found.
The turtles that traveled successfully to the water were marked as “sur-
vived.” Five days after their release, turtles not identified as having sur-
vived were assumed dead. The birth weight of each turtle was collected as
a covariate. The objective is to assess the effect of birth weight on survival
and to determine whether there is any clutch effect on survival. Figure 5.3
shows a scatterplot of the birth-weight effect on survival status as well as
the clutch effect on survival. The clutches are numbered from 1 through 31
in an increasing order of average birth weight of the turtles. Figure 5.3 sug-
gests that the heaviest turtles tend to survive and the lightest ones tend to
die. It also suggests some variability in survival rates across clutches. For
example, in one clutch (with average birth weight 5.41), only 2 turtles out of
12 survived while in a second clutch (with average birth weight 7.50), 9 tur-
tles out of 11 survived.
Bayes Factor 159

7
Birth-weight

4 X: Died
0: Survied

3
0 10 20 30
Clutch

FIGURE 5.3
The snapshot from the paper by Sinharay and Stern (2002) regarding the scatterplot with the
clutches sorted by average birth weight.

Let y ij denote the response (survival status with 1 denoting survival) and
xij the birth weight of the jth turtle in the ith clutch, i = 1, 2...m = 31, j = 1, 2,...ni.
The model we fit to the data is

y ij |pij ~ bern( pij ), i = 1, 2...m = 31, j = 1, 2,...ni ;

pij = Φ(α 0 + α 1xij + bi ), i = 1, 2...m = 31, j = 1, 2,...ni ;


u
i.i.d. 1 2
bi |σ 2 ~ N (0, σ 2 ), i = 1, 2,..., m, Φ(u) = e−u /2
du,
(2 π)1/2 −∞

where the bi’s are random effects for clutch (family). The clutch effects are
assumed to be the same for all birth weights. To assess the importance of
clutch effects, we compare our alternative model, the probit regression
model with random effects, with the null model, a simple probit regression
model with no random effects.
Suppose p0 (α) is the prior distribution for α under the null model, while
the prior distribution for (α , θ) under the alternative model is p(α , θ) =
p(σ 2 )p1 (α|σ 2 ). Then, the Bayes factor for comparing the alternative model
(M1 ) against the null model (M 0 ) is BF10 = p( y|M1 )/ p( y|M 0 ), with
160 Statistics in the Health Sciences: Theory, Applications, and Computing

p( y|M1 ) = ∫ p(y|α, b)p(b|σ )p (α|σ )p(σ )dbdαdσ ,


2
1
2 2 2

p( y|M ) = ∫ p( y|α, b = 0) p (α)dα .


0 0

For this example,


m ni

p( y|a, b) = ∏ ∏ {Φ(a + a x
i=1 j=1
0 1 ij + bi )} ij {1 − Φ(a0 + a1xij + bi )}
y 1− y ij
,

m ni

p( y|α , b = 0) = ∏ ∏ {Φ(a + a x )}
i=1 j=1
0 1 ij
y ij
{1 − Φ( a0 + a1xij )}
1− y ij
,

p(b|σ 2 ) ∝ (σ 2 )− m/2 exp − {∑ m

i=1 }
bi2 (2σ 2 ) .

Furthermore, we use a proper vague prior distribution on α, a bivariate nor-


mal distribution with mean 0 and variance 20.1 under both models.
The approximated value of the Bayes factor, obtained using the impor-
tance sampling method, with an importance sample size of 5,000 under both
models, is 0.31 with an estimated standard deviation of 0.01. The value of the
Bayes factor indicates some evidence in favor of the null model.
We then investigate the sensitivity of the Bayes factor to the choice of prior
distributions. The prior distributions on α under the two models do not affect
the Bayes factor much because there are 244 data points in the dataset. How-
ever, the prior distribution for the variance component σ 2 is influential on any
posterior inference because our ability to learn about σ 2 is determined by the
number of clutches rather than the number of animals. Furthermore, there is
little prior information available for σ 2 because very few studies like the tur-
tle study have been performed. To study the effect of the choice of the prior
distribution for σ 2 on the Bayes factor, let BF10, σ 2 denote the “point mass prior
Bayes factor” comparing the probit regression model with random effects,
where the variance component is fixed at σ 2 , against the simple probit regres-
sion model without any variance component. For grid values of σ 2 , the

approximated values of the Bayes factor, BF 10, σ 2 , obtained by the importance

sampling method are computed. Figure 5.4 presents a plot of BF 10, σ 2 against

σ 2 . The figure demonstrates that BF 10, σ 2 = 1 when σ 2 = 0, implying that the
two models are identical at that value. The estimated Bayes factor then
increases with increase in σ 2 until it reaches its maximum value of about 4.55
at around σ 2 = 0.09. This is sensible because the restricted maximum likeli-
hood estimate (REML) of σ 2 is 0.091. Also, when it comes to choosing between
a small hypothesized value (by small here we mean less than 0.3) of σ 2 and σ 2
Bayes Factor 161

3
Bayes factor

0
0.0 0.5 1.0 1.5
Variance component

FIGURE 5.4 ∧
The snapshot from the paper by Sinharay and Stern (2002) regarding the plot of BF 10,σ 2
against σ .
2

equal to zero, the approximated Bayes factor favors the small positive value of
σ 2 . However, when it comes to choosing between a large value of σ 2 and σ 2
equal to zero, the approximated Bayes factor favors the σ 2 equal to zero; we
refer the reader to Sinharay and Stern (2002) for more details.

5.4 Remarks
The Bayes factor, the ratio of the posterior odds (not probability) of a hypoth-
esis to the prior odds, is equal to a likelihood ratio in the simplest case (and
only then); see Good (1992) for details. It can be described as a multiplicative
weight of evidence. In the frequentist setting of testing statistical hypothe-
ses, problems may arise when inference about the truth or believability of
the null hypothesis is based on the classical formulation of a p-value (see
forthcoming material in this book regarding p-values). In this context, it
should be noted that the p-value is not the probability that the null hypoth-
esis is true. Unfortunately, many practitioners and students have tried to use
the p-value to measure the evidence of the null hypothesis as the probabil-
ity that the null hypothesis is true. This is the wrong concept, since, in
general, p-values are random variables. Formal arguments about incorrect
interpretations of p-values are provided in Chapter 9. We also remark
again (as in Chapter 4) that almost no null hypothesis is exactly true in
practice. Consequently, when sample sizes are large enough, almost any
162 Statistics in the Health Sciences: Theory, Applications, and Computing

null hypothesis will have a tiny p-value, and hence will be rejected at con-
ventional levels. As alternatives to the classical formulation of p-values for
testing hypotheses, Bayes factors have been suggested as a measure of the
extent to which the observed data align with a given hypothesis by incorpo-
rating prior information about the parameter of interest. The use of thep-
value should not be discarded all together. A small p-value does not
necessarily mean that we should reject the null hypothesis and leave it at
that. Rather, it is a red flag indicating that something is up; the null hypoth-
esis may be false, possibly in a substantively uninteresting way, or maybe we
got unlucky. On the other hand, a large p-value does mean that there is not
much evidence against the null. For example, in many settings the Bayes fac-
tor is bounded above by 1/p-value. Even Bayesian analyses can benefit from
classical testing at the model-checking stage (see Box (1980) or the Bayesian
p-values in Gelman et al. (2013)). In this context, for more details, we refer the
reader to Marden (2000).
Warning: Assume the parameter of interest is θ. In the frequentist manner,
the meaning of testing statistical hypotheses, say H 0 : θ = θ0 versus
H 1 : θ = θ1 ≠ θ0 , is clear. In the Bayesian setting, we believe that θ is random.
Then it is possible that the statement Pr (θ = θ0 ) = Pr (θ = θ1) = 0 has a sense. If
your statistical world is completely Bayesian, you compare the model
H 0 : θ ~ π 0 with the model H 1 : θ ~ π1 , where π 0 and π1 are prior density func-
tions of θ. In this case, we can still write H 0 : θ = θ0 , assuming that π 0 is an
appropriate degenerated density function (e.g., Berger, 1985; Vexler et al.,
2010b). Let us consider briefly the following simple scenario, in which using
the Bayes factor procedure we would like to compare the following models:
M0 : q ~ π 0 = Unif [0,1] versus M1 : q ~ π1 = Unif [1,1.5], where q is a quintile
that satisfies F(q) = α ≡ 0.1 for the independent and identically F-distributed
data points X 1 ,..., X n. Let X 1 ,..., X n be exponentially distributed with the den-


q
sity function f ( x|λ q ), where λ q : f (u|λ q ) du = α = 0.1. Then the Bayes fac-
0
tor statistic is
1.5 n

∫ ∏ f (X |λ )dq
i =1
i q

BF = 1
1 n
.

∫ ∏ f (X |λ )dq
0 i =1
i q

For illustrative purposes, we can simulate 5,000 times the scenario men-
tioned above under the model M0 with n = 25 using the following R code:

alpha<-0.1
n<-25
MC<-5000
Bayes Factor 163

BF<-array()
for(mc in 1:MC)
{
a0<-0
b0<-1
a1<-b0
b1<-b0+0.5
q<-runif(1,a0,b0) # model M0
#q<-runif(1,a1,b1)# model M1
QQ<-function(u) pexp(q,u)-alpha
lambda<-uniroot(QQ,c(0,1000000))$root # to find lambda corresponding to the q’s value
x<-rexp(n,lambda)
Intr <-function(uu){
fix<-uu
QQ1<-function(u) pexp(fix,u)-alpha
QQ1V<-Vectorize(QQ1)
lambda1<-uniroot(QQ1V,c(0,100000000))$root
S<-sum(log(dexp(x,lambda1)))
return(exp(S))
}
n

# Intr(uu) is the function ∏ f (X |λ ), where


i=1
i q λ q corresponds to the argument uu
Inegr<-Vectorize(Intr)
Num<-integrate(Inegr,a1,b1)[[1]]
Den<-integrate(Inegr,a0,b0,stop.on.error = FALSE)[[1]]

BF[mc]<-Num/Den

print(length(BF[BF<1]))
print(length(BF[(BF>=1)&(BF<10^0.5)]))
print(length(BF[(BF>=10^0.5)&(BF<10^1)]))
print(length(BF[(BF>=10^1)&(BF<10^1.5)]))
}

We would like to invite the reader to run the code above and obtain the
results, evaluating them with respect to Table 5.1.
Note again that, in the frequent perspective, the approaches mentioned in
this chapter show that Bayes factor type procedures can provide very effi-
cient decision-making mechanisms.
6
A Brief Review of Sequential Methods

6.1 Introduction
Retrospective studies are generally derived from already existing databases or
combining existing pieces of data, e.g., using electronic health records to exam-
ine risk or protection factors in relation to a clinical outcome, where the out-
come may have already occurred prior to the start of the analysis. The
investigator collects data from past records, with no or minimal patient follow-
up, as is the case with a prospective study. Many valuable studies, such as the
first major case-control study published by Lane-Claypon (1926), investigating
risk factors for breast cancer, were retrospective investigations. Prospective
studies may be analyzed similar to retrospective studies or they may have a
sequential element to them, which we will describe below, that is, data may be
analyzed continuously or in stages during the course of study.
In retrospective studies it is common to have sources of error due to con-
founding and various biases such as selection bias, e.g., bias due to missing
data elements, misclassification, or information bias. In addition, the inferen-
tial decision procedures for retrospective studies are based on data sets,
where the corresponding samples sizes are fixed in advance and can often-
times be quite large. It is possible in these settings to have underpowered or
overpowered designs. For example medical claims data may lead to an over-
powered design, i.e., a study that will detect small differences from the null
hypothesis that may not be scientifically interesting. Consequently, retro-
spective studies may induce unnecessarily high financial and/or human cost
or statistically significant results that are not necessarily meaningful. Armit-
age (1981) stated that

The classical theory of experimental design deals predominantly with


experiments of predetermined size, presumably because the pioneers of
the subject, particularly R.A. Fisher, worked in agricultural research,
where the outcome of a field trial is not available until long after the
experiment has been designed and started.

In fact, in an experiment in which data accumulate steadily over a period


of time, it is natural to monitor results as they occur, with a view toward

165
166 Statistics in the Health Sciences: Theory, Applications, and Computing

taking action such as certain modifications or early termination of the study.


For example, a disease prevention trial or an epidemiological cohort study
of occupational exposure may run on a time scale of tens of years. It is not
uncommon for phase III cancer clinical trials, where the primary objective
is final confirmation of safety and efficacy, with a survival endpoint, to run
10 years.
From a logistical and cost-effectiveness point of view it is natural to employ
sequential statistical procedures. Sequential analysis will be performed as
“a method allowing hypothesis tests to be conducted on a number of occa-
sions as data accumulate through the course of a trial. A trial monitored in
this way is usually called a sequential trial” (Everitt and Palmer, 2010). In
fully sequential approaches statistical conclusions are conducted after the
collection of every observation. In group sequential testing, tests are con-
ducted after batches of data are observed. Both approaches allow one to
draw decisions during the data collection and possibly reach a final
conclusion at a much earlier stage as is the case in classical hypothesis test-
ing. In classical retrospective hypothesis testing where the sample size is
fixed at the beginning of the experiment and the data collection is conducted
without considering the data and/or the analysis, the sample size in sequen-
tial analysis is a random variable (see Chapters 2 and 4 for several examples).
The sequential analysis methodology possesses many advantages,
including economic savings in sample size, time and cost. It also has advan-
tages in terms of ethical considerations and monitoring, e.g., a clinical trial
might be stopped early if a new treatment is more efficacious than originally
assumed. Sequential analysis methods have different operating characteris-
tics as compared to corresponding fixed sample size procedures. Given the
potential cost savings and the interesting theoretical properties of sequential
tests, many corresponding variations have become well established and
thoroughly investigated. There are formal guidelines for use of sequential
statistical procedures in many research areas (DeGroot, 2005; Dmitrienko
et al., 2005). For example, in terms of government regulated trials in the
United States there are formal published requirements pertaining to interim
analyses and the reporting of the corresponding statistical results. It is stated
in a Food and Drug Administration Guideline (1988) that

The process of examining and analyzing data accumulating in a clinical trial,


either formally or informally, can introduce bias. Therefore all interim analyses,
formal or informal, by any study participant, sponsor staff member, or data
monitoring group should be described in full even if treatment groups were not
identified. The need for statistical adjustment because of such analyses should be
addressed. Minutes of meetings of the data monitoring group may be useful (and
may be requested by the review division).

Armitage (1975, 1991) argued that ethical considerations demand a trial be


stopped as soon as possible when there is clear evidence of the preference for
A Brief Review of Sequential Methods 167

one or more treatments over the standard of care or placebo, which logically
leads to the use of a sequential trial. In his publications cited above, Armit-
age described a number of sequential testing methods and their application
to trials comparing two alternative treatments.
Sequential methods are well suited for use in clinical trials with short-term
outcomes, e.g., a one-month follow-up value. When dealing with human sub-
jects, regular examinations of accumulating results and early termination of
the study are ethically desirable (Armitage, 1975). Sequential methods touch
much of modern statistical practice and hence we cannot possibly include all
relevant theory and examples. In this chapter, we will outline several of the
more well-applied sequential testing procedures, including two-stage
designs, the sequential probability ratio test, group sequential tests, and
adaptive sequential designs.
Warning: (1) Note that the scheme of sequential testing may result in very com-
plicated estimation issues in the post-sequential analyses of the data. Investiga-
tors should be careful applying standard estimation techniques to data obtained
via a sequential data collection procedure. For example, estimation of the sample
mean and variance of the data can be very problematic in terms of obtaining
unbiased estimates (Liu and Hall, 1999). (2) In the retrospective setting, one can
pretest parametric assumptions regarding underlying data and implement a
parametric statistical procedure, e.g., estimation, adjusting its results with respect
to the pretest. Oftentimes, parametric sequential procedures suppose paramet-
ric distribution assumptions even before the data is observed. (3) Common sta-
tistical sequential schemes stop at random times with lengths that depend on
observations. Thus, in general, data obtained after sequential analyses cannot be
evaluated for goodness-of-fit using the classical tests developed for retrospective
testing. For example, the Shapiro–Wilk test for normality (e.g., Vexler et al., 2016a)
has a Type I error rate that does not depend on the data distribution. This prop-
erty is not held if the number of observations is random with a distribution that
depends on underlying data characteristics.
In Sections 6.2, 6.5, 6.6, and 6.7 we introduce two-stage sequential proce-
dures, concepts of group sequential tests, adaptive sequential designs and
futility analysis. The classical sequential probability ratio test is detailed in
Section 6.3. In Section 6.4 we show a technique that can be applied to evalu-
ate asymptotic properties of stopping times used in statistical sequential
schemes. In Section 6.8 we provide some remarks regarding post-sequential
procedures data evaluations.

6.2 Two-Stage Designs


The rudiments of sequential analysis can be traced to the works of Huygh-
ens, Bernoulli, DeMoivre, and Laplace with respect to the gambler’s ruin
problem (Lai, 2001). The formal applications of sequential procedures started
168 Statistics in the Health Sciences: Theory, Applications, and Computing

in the late 1920s in the area of statistical quality control in manufacturing


production. The early two-stage designs, which can be extended to multi-
stage designs, were proposed for industrial acceptance sampling of a pro-
duction lot where the decision was that the lot met specification or the lot
was defective. In a two-stage acceptance sampling plan, there are six param-
eters: the sample sizes for each stage ( n1 and n2 ), acceptance numbers ( c1
and c2), and rejection numbers (d1 and d2), where d1 > c1 + 1 and d2 = c2 + 1. To
implement the plan, one first takes an initial sample of n1 items, and if this
contains c1 defective items or fewer, the lot is accepted; but if d1 defective
items or more are found, the lot is rejected. Otherwise, the decision is deferred
until a second sample of size n2 is inspected. The lot is then accepted if the
total cumulative number of defective items is less than or equal to c2 and
rejected if this number is greater than or equal to d2; see Dodge and Romig
(1929) for more details. The two-stage sequential testing idea can be general-
ized to a multistage sampling plan where up to K stages are permitted (see
Bartky, 1943, for details). This approach was subsequently developed by
Freeman et al. (1948) to form the basis of the U.S. military standards for
acceptance sampling.
A similar problem arises in the early stages of drug screening and in
Phase II clinical trials, where the primary objective is to determine whether
a new drug or regimen has minimal level of therapeutic efficacy, i.e., suf-
ficient biological activity against the disease under study, to warrant more
extensive development. Such trials are often conducted in a multi-
institution setting where designs of more than two stages are difficult to
manage. Simon (1989) proposed an optimal two-stage design, which has
the goal to “minimize the expected sample size if the new drug or regi-
men has low activity subject to constraints upon the size of the Type I
error rate (α) and the Type II error rate (β).” These types of interim stopping
rules are often termed a futility analysis; we will discuss futility analysis
in detail in Section 6.7.
Simon’s two-stage designs are based on testing for the true response prob-
ability p that H 0 : p ≤ p0 versus H 1 : p ≥ p1, for some desirable target levels p0
and p1. Each Simon’s two-stage design is indexed by four numbers n1, n2 , r1,
and r, where n1 and n2 denote the numbers of patients studied in the first and
second stage, and r1 and r denote the stopping boundaries in the first and
second stage, respectively. Let the total sample size be n = n1 + n2 . The study
is terminated at the end of the first stage and the drug is rejected for further
investigation if r1 or fewer responses out of n1 participants are observed; oth-
erwise, the study proceeds to the second stage, with a total sample size n,
and the drug is rejected for further development if r or fewer responses are
observed at the end of the second stage, i.e., at the end of the study. A Type I
error occurs when there are more than r1 responses at the end of the first
stage and more than r responses at the end of the study when p ≤ p0. A Type
II error occurs if there are r1 or fewer responses in the first stage or there are
r or fewer responses at the end of the study when p ≥ p1. The values of n1, n2 , r1,
A Brief Review of Sequential Methods 169

and r are found for fixed values of p0 , p1, α (the Type I error rate), and β (the
Type II error rate), and are determined as follows.
The decision of whether or not to terminate after the first or the second
stage is based on the number of responses observed. The number of
responses X , given a true response rate p and a sample size m, follows a
binomial distribution, that is, Pr(X = x) = b( x ; p , m) and Pr(X ≤ x) = B( x ; p , m),
where b and B denote the binomial probability mass function, and the bino-
mial cumulative distribution function, respectively. Therefore, the total
sample size is random. The probability of early termination, and the
expected sample size depend on the true probability of response p. By using
exact binomial probabilities, the probability of early termination after the
first stage, denoted as PET( p), has the form B(r1 ; p , n2 ) , and the expected
sample size based a true response probability p for this design can be
defined as E( N |p) = n1 + {1 − PET( p)} n2. The probability of rejecting a drug
with a true response probability p, denoted as R( p) , is


min{ n1 , r }
R( p) = B(r1 ; p , n2 ) + b( x ; p , n1 )B(r − x ; p , n2 ).
x = r1 + 1

For pre-specified values of the parameters p0 , p1, α and β, if the null hypoth-
esis is true, then we require that the probability that the drug should be
accepted for further study in other clinical trials should be less than α, i.e.,
concluding that the drug is sufficiently promising. We also require that the
probability of rejecting the drug for further study should be less than β
under the alternative. That is, an acceptable design is one that satisfies the
error probability constraints R( p0 ) ≥ 1 − α and R( p1 ) ≥ 1 − β. Let Ω be the set of
all such designs. A grid search is used to go through every combination of n1,
n2 , r1 and r, with an upper limit for the total sample size n, usually between
0.85 and 1.5 times the sample size for a single stage design (Lin and Shih,
2004). The optimal design under H 0 is the one in Ω that has the smallest
expected sample size E( N |p0 ). For more details, we refer the reader to Simon
(1989). Extensions of this design for testing both efficacy and/or futility after
the first set of subjects has enrolled are well developed, e.g., see Kepner and
Chang (2003).

6.3 Sequential Probability Ratio Test


The modern theory of sequential analysis originates from the research of
Barnard (1946) and Wald (1947). In particular, the method of the sequential
probability ratio test (SPRT) has been the predominant influence of the
subsequent developments in the area. Inspired by the Neyman–Pearson
170 Statistics in the Health Sciences: Theory, Applications, and Computing

lemma (Chapter 3), Wald (1947) proposed the SPRT, which provides a method
of constructing efficient sequential statistical schemes.
The intuitive idea is this: while testing statistical hypotheses in many situ-
ations the decision-making procedures produce outcomes with respect to
conclusions of the form to reject or not to reject the corresponding null
hypothesis. Assume we have n = 100 observations and the appropriate likeli-
hood ratio test statistic based on first 35 data points is relatively too small.
Then we will have a very rare chance to reject the null hypotheses using the
full data. In this case we do not need to observe 100 data points to make the
test decision. Similarly, we can consider situations when the likelihood ratio
has relatively large values. Thus, one can terminate the testing procedure
when the likelihood ratio is greater or smaller than presumed threshold val-
ues instead of calculating the test statistic based upon n = 100 observations.
Let X 1 , X 2 ,... be a sequence of data points with joint probability density
function f . Consider a basic form for testing a simple null hypothesis
H 0 : f = f0 against a simple alternative hypothesis H 1 : f = f1, where f0 and f1
are known. Recall that the likelihood ratio has the form

f1 (X 1 ,..., X n )
LRn = .
f0 (X 1 ,..., X n )

The goal of the SPRT is to inform a decision as to which hypothesis is more


likely as soon as possible relative to the desired Type I and Type II error rates.
To accomplish this goal, observations are collected sequentially one at a time;
when a new observation has been made, a decision rule has to be made
among the following options: (1) not to reject the null hypothesis and stop
sampling; (2) reject the null hypothesis and stop sampling; or (3) collect
another observation as a piece of information and repeat (1) and (2). Towards
this end, we specify boundaries (thresholds) for the decision process given as
0 < A < 1 < B < ∞ , and sample X 1 , X 2 ,... sequentially until the random time N,
where

N = N A , B = inf{n ≥ 1 : LRn ∉( A, B)}.

In other words, we stop sampling at time N and decide not to reject H 0 if


LRN ≤ A or decide to reject H 0 if LRN ≥ B; see Figure 6.1 for an illustration.
The determination of A and B is given below.
Graphically we plot the likelihood value as a function of the sample size
and examine the time series process as to whether it crossed either the A or
B boundary. The R code for producing the plot is:

> # For any pre-specified A and B (0<A<1<B). For example, A=.11, B=0.
> A=.11
> B=9
A Brief Review of Sequential Methods 171

Overshoot
B
LR

A
Overshoot

5 10 15 20
n

FIGURE 6.1
The sampling process via the SPRT; the solid line shows the case where the sampling process
stops when LRN ≤ A and that leads to the decision not to reject H 0, and the dashed line repre-
sents the case where the sampling process stops when LRN ≥ B and we decide to reject H 0. The
shown overshoots have the values of LRN − A or LRN − B, respectively.

> muX0<-0 # A simple example where X~i.i.d. N(0,1) under H0


> muX1<-0.5 # A simple example where X~i.i.d. N(0.5,1) under H1
> L<-1
> n<-0
> L.seq.A<-c()
> while(L>A & L<B){
+ x<-rnorm(1,muX0,1)
+ L<-L*exp((muX1-muX0)*x+(muX0^2-muX1^2)/2) #dnorm(x,muX1,1)/dnorm(x,muX0,
1)
+ L.seq.A<-c(L.seq.A,L)
+ n<-n+1
+ }
>
> L<-1
> n<-0
> L.seq.B<-c()
> while(L>A & L<B){
172 Statistics in the Health Sciences: Theory, Applications, and Computing

+ x<-rnorm(1,muX1,1)
+ L<-L*exp((muX1-muX0)*x+(muX0^2-muX1^2)/2)
+ L.seq.B<-c(L.seq.B,L)
+ n<-n+1
+ }
> # to plot
> par(mar=c(4,4,2,2))
> plot(L.seq.A,type="l",lty=1,ylim=c(0,9.3),yaxt="n",xlab="n",ylab="L")
> lines(L.seq.B,type="l",lty=2)
> abline(h=c(A,B))
> axis(2,c(A,1,B),label=c("A","1","B"),las=2)

The SPRT stopping boundaries: The thresholds A and B are chosen so that
the Type I error and Type II error probabilities are approximately equal
(bounded appropriately) to the prespecified values α and β, respectively, and
formally defined as

α = PrH0 (LRN ≥ B) ≤ B−1 (1 − β) and β = PrH1 (LRN ≤ A) ≤ A(1 − α).

Proof. We have



α = PrH0 (LRN ≥ B) = PrH0 (LRn ≥ B, N = n)
n=1

∑ E {I(LR ≥ B, N = n)} = ∑ ∫ I(LR ≥ B, N = n) f (x ,..., x ) dx ... dx


∞ ∞
= H0 n n 0 1 n 1 n
n=1 n=1

= ∑ ∫ I (LR ≥ B, N = n)

f ( x ,..., x ) 0 1 n
n f ( x ,..., x ) dx ... dx
1 1 n 1 n
n=1 f ( x ,..., x ) 1 1 n

= ∑ ∫ I(LR ≥ B, N = n)

f ( x ,..., x ) 1 1 n
n dx ... dx 1 n
n=1 LR n

≤ B ∑ ∫ I(LR ≥ B, N = n) f ( x ,..., x ) dx ... dx = B ∑ E {I(LR ≥ B, N = n)}


∞ ∞
−1 −1
n 1 1 n 1 n H1 n
n=1 n=1

= B−1 PrH1 (LRN ≥ B) = B−1 (1 − β).

Similarly, it can be derived that β = PrH1 (LRN ≤ A) ≤ A PrH0 (LRN ≤ A) =


A (1 − PrH0 (LRN ≥ B)) = A(1 − α). The proof is complete.
Note that the inequalities shown above account for values of the likelihood
ratio that may “overshoot” the boundary at any given decision point; e.g., see
Figure 6.1. Treating the above inequalities as approximate equalities and
solving for α and β leads to the useful error rate approximations

1− A A(B − 1)
α≈ and β ≈ ;
B− A B− A

or equivalently, solving for A and B leads to the approximations of the thresh-


old values
A Brief Review of Sequential Methods 173

β 1−β
A≈ and B ≈ ,
1−α α
where the symbol “≈” means that the overshoot values are disregarded.
The Wald approximation to the average sample number (ASN): The
expected stopping time E( N ) is often called the average sample number
(ASN). Being able to calculate the ASN is practically relevant in terms of the
logistics of planning a given study. To study the ASN and the asymptotic
properties of the stopping time N , we consider the following reformulation
and make an additional assumption that the observations X i , i ≥ 1, are iid.
Therefore, the log-likelihood ratio log (LRn) is a sum of iid random variables
n

Zi = log { f1 (X i )/f0 (X i )}, i.e., log (LRn) = ∑Z = S , where we define log (LR ) as
i=1
i n n

Sn for simplicity. Then the stopping time can be rewritten as

N = N c1 , c2 = inf{n ≥ 1 : Sn ≤ −c1 or Sn ≥ c2 },

where c1 = − log (A) and c2 = log (B) are positive thresholds.


Denote μ Zi = Ei (Z1 ) =
∫ log { f (x)/f (x)} f (x) dx and σ
1 0 i
2
Zi = Ei (Z1 − μ Zi)2, under
the hypothesis H i, i = 0,1. By Wald’s fundamental identity (see Proposition 4.3.2
in Chapter 4 as well as Wald, 1947), the Wald approximation to the expected
sample size can be expressed in the form

⎧ ⎛ 1 − β⎞ ⎛ β ⎞⎫
E0 N ≅ μ −Z10 ⎨α log ⎜ ⎟⎠ + (1 − α)log ⎜⎝ ⎟ ⎬ , under H0
⎩ ⎝ α 1 − α⎠ ⎭

and

⎪⎧ ⎛ 1 − β⎞ ⎛ β ⎞ ⎪⎫
E1N ≅ μ −Z11⎨(1 − β)log ⎜ ⎟ + β log ⎜ ⎟ ⎬ , under H1,
⎩⎪ ⎝ α ⎠ ⎝ 1 − α ⎠ ⎭⎪

where μ Zk ≠ 0, k = 0, 1 . These approximations are obtained using Proposition


4.3.2, which provides

⎧⎪ N ⎫⎪

Ek {log (LRN )} = Ek ⎨ Zi ⎬ = Ek (N) μ Zk , k = 0, 1,
⎩⎪ i=1 ⎭⎪

where, ignoring values of the overshoots, we can approximate

log (LRN ) ≈ log(B)I (LRN ≥ B) + log( A)I (LRN ≤ A) ,


174 Statistics in the Health Sciences: Theory, Applications, and Computing

Ek {log (LRN )} ≈ log(B) PrH k (LRN ≥ B) + log( A) PrH k (LRN ≤ A) , k = 0, 1,

α = PrH0 (LRN ≥ B) = 1 − PrH0 (LRN ≤ A) , β = PrH1 (LRN ≤ A) = 1 − PrH1 (LRN ≥ B) ,

since LRN can take only values that satisfy LRN ≥ B, LRN ≈ B , or LRN ≤ A,
LRN ≈ A with A ≈ β (1 − α) and B ≈ (1 − β) α −1 .
−1

Note that the Wald approximation to Ek ( N ), k = 0,1 underestimates the


true ASN. The accuracy of the Wald approximation to E( N ) is mostly deter-
mined by the amount that SN will tend to overshoot − c1 or c2. If this over-
shoot tends to be small, the approximations will be quite good; otherwise,
the approximation can be poor (Berger, 1985).

Asymptotic properties of the stopping time N : Define c = min(c1 , c2 ) and


suppose that E(Z1 ) = μ ≠ 0 , where c1 and c2 are defined above. It is clear that,


under H 0 , μ = μ Z 0 = E0 (Z1 ) = log { f1 (x)/ f0 (x)} f0 (x) dx < 0, whereas, under H 1,


μ = μ Z 1 = E1 (Z1 ) = log { f1 ( x)/ f0 ( x)} f1 ( x) dx > 0 (see Section 3.2 for details).
Based on the central limit theorem result shown by Siegmund (1968), it is
easy to check that for μ > 0,

(c2 / μ)−1/2 [ N − (c2 / μ)] → N (0, σ 2 / μ 2 ), as c → ∞.


d

Similarly, if μ < 0, then (c1/|μ|)−1/2 [ N − (c1/|μ|)] → N (0, σ 2 / μ 2 ) as c → ∞ .


d

See Martinsek (1981) as well as the next section for more detail. Additionally,
Martinsek (1981) presented the following result:

Theorem 6.3.1.

Let Z1 , Z2 , ... be iid with E(Z1 ) = μ ≠ 0, var(Z1 ) = σ 2 ∈(0, ∞), and assume
E|Z1 |p < ∞ , where p ≥ 2 . Then

(1) If μ > 0 and c2 = o(c1 ) as c → ∞, then {c2 − p/2 |N − (c2 / μ)|p : c ≥ 1} is uni-
formly integrable and hence E|N − (c2 / μ)|r ~ (c2 / μ)r /2 (σ / μ)r E|N (0, 1)|r
as c → ∞ , for 0 < r ≤ p ;
(2) If μ < 0 and c1 = o(c2 ) as c → ∞, then {c1− p/2 |N − (c1/|μ|)|p : c ≥ 1} is uni-
formly integrable and hence E|N − (c1 /|μ|)|r ~ (c1 /|μ|)r/2 (σ |μ|) E|N (0, 1)|r
r

as c → ∞ , for 0 < r ≤ p. (Here the notation N (0,1) presents a normally


(0,1) distributed random variable.)

When p = 2, the asymptotic behavior of var( N ) is shown in the following


corollary (Martinsek, 1981):
A Brief Review of Sequential Methods 175

Corollary 6.3.1. Let Z1 , Z2 ,... be iid with E(Z1 ) = μ ≠ 0 , var(Z1 ) = σ 2 ∈(0, ∞).
Then
(1) If μ > 0 and c2 = o(c1 ) as c → ∞ , then var ( N ) ~ c2 σ 2 /μ 3 as c → ∞ ;
(2) If μ < 0 and c2 = o(c1 ) as c → ∞ , then var ( N ) ~ c1σ 2 /|μ|3 as c → ∞ ,
where c = min (c1,c2).

Example:

As a typical example of the SPRT, we consider the situation where


X ~ N (μ , σ 2 ) , σ 2 = 1 , and it is desired to test H 0 : μ = μ 0 versus H 1 : μ = μ 1
with α = 0.05 and β = 0.2, where μ 0 = −1/ 2 and μ 1 = 1/ 2. For purpose of
the comparison of the required sample size, in addition to the SPRT, we
also considered the nonsequential likelihood ratio test (LRT), that is, the


n
fixed sample size test. Define X n = n−1 X i . In this case, it is clear that
i=1
the likelihood ratio is

(
LRn = exp n(μ1 − μ 0 )X n + n (μ 20 − μ12 ) 2 , )
and

α = Pr(LRn > Cα |H 0 ) = Pr(X n > log(Cα ) (n(μ 1 − μ 0 )) + (μ 1 + μ 0 ) 2|H 0 ),

β = 1 − Pr(LRn > Cα |H 1 ) = 1 − Pr(X n > log(Cα ) (n(μ1 − μ 0 )) + (μ1 + μ 0 ) 2|H 1 ).

Therefore, since X n ~ N (μ , σ 2 /n), one can easily obtain that


log(Cα ) (n(μ 1 − μ 0 )) + (μ 1 + μ 0 ) 2 = Φ(−μ10 ,σ 2 /n) (1 − α) , where Φ(−μ10 ,σ 2 /n) (1 − α)
denotes the (1 − α)th quantile of a normal distribution with a mean of μ 0
and a variance of σ 2 n. As a consequence, β = Pr(X n < Φ(−μ10 ,σ 2 /n) (1 − α)|H 1 ),
and hence the fixed sample size n in the nonsequential LRT test can be
obtained by solving Φ(−μ11 ,σ 2 /n) (β) = Φ(−μ10 ,σ 2 /n) (1 − α). In this example, assum-
ing α = 0.05 and β = 0.2, the nonsequential LRT test requires a sample
size that is approximately 6.1826 (6.1826 corresponds to how much
sample information is required to achieve the desired error probabili-
ties). Here we use the uniroot function in R to estimate the sample size.
The R code to obtain the results is shown below:

## set up parameters
> alpha <- 0.05 # the significance level
> beta <- 0.2 # power=1-beta=0.8
> muX0 <- -1/2
> muX1 <- 1/2
>
176 Statistics in the Health Sciences: Theory, Applications, and Computing

> ### Likelihood ratio test


> # power as a function of the number of observations n
> get.power <- function(n){
+ right.part <- qnorm(1-alpha,mean=muX0,sd=1/sqrt(n)
+ pnorm(right.part,mean=muX1,sd=1/sqrt(n))
+ }
> power <- 1-beta
> n.LR <- uniroot(function(n) get.power(n)-beta, c(0, 10000))$root
> n.LR
[1] 6.182566

For the sequential test, it can be easily obtained that the stopping
boundaries of the SPRT are A ≈ 4 19 and B ≈ 16 based on α = 0.05 and β = 0.2.
Note first by simple calculation that μ Zi = μ i and σ 2Zi = 1, i = 0,1, and thus the
Wald approximation to the ASN is 2.6832 and 3.8129 under the H 0 and H 1,
respectively. The ASNs obtained via the Monte Carlo simulations (we refer
the reader to Chapter 2 for more details related to Monte Carlo studies)
are 4.1963 and 5.6990 under the null and the alternative, respectively. This
proves that the Wald approximations are indeed underestimates of the ASN
(however, considerably smaller than the sample size required in the nonse-
quential LRT test). The R code to obtain the results is shown below:

> ##### the function to get the ASN via Monte Carlo simulations ######
> # case: "H0" for the null hypothesis; "H1" or "Ha" for the alternative
> # alpha: the type I error rate, i.e., presumed significance level
> # beta: the type II error rate, where the power is 1-beta
> # muX0: the mean specified under the null hypothesis
> # muX1: the mean specified under the alternative hypothesis
> # MC: the number of Monte Carlo repetitions
>
> get.n<-function(case="H0",alpha=0.05,beta=0.2,muX0=-1/2,muX1=1/2,MC=5000){
+ if (missing(alpha)|missing(beta)) stop("missing alpha value or beta value
")
+ if (missing(muX0)|missing(muX1)) stop("missing mean value under H_0 or H_
1") else {
+ if (case=="H0") mu<-muX0 else if (case=="H1"|case=="Ha") mu<-muX1 else
stop("wrong specification of case")
+ }
+ if (!missing(alpha) & !missing(beta) & !missing(muX0) & !missing(muX1) &
case%in%c("H0","H1","Ha")){
+ if (missing(MC)) MC<-5000
+ A<-beta/(1-alpha)
+ B<-(1-beta)/alpha
+ n.seq<-c()
+ for (i in 1:MC){
+ L<-1
+ n<-0
+ while(L>A & L<B){
+ x<-rnorm(1,mu,1)
+ L<-L*exp(-(2*(muX0-muX1)*x+muX1^2-muX0^2)/2)
+ n<-n+1
+ }
+ n.seq[i]<-n
A Brief Review of Sequential Methods 177

+ }
+ return(n.seq)
+ }
+ }
>
> ## SPRT under H_0
>
n.seq.0<-get.n(case="H0",alpha=0.05,beta=0.2,muX0=-1/2,muX1=1/2,MC=50000)
> mean(n.seq.0)
[1] 4.1963
> var(n.seq.0)
[1] 11.2889
>
> ## SPRT under H_1
>
n.seq.1<-get.n(case="H1",alpha=0.05,beta=0.2,muX0=-1/2,muX1=1/2,MC=50000)
> mean(n.seq.1)
[1] 5.6990
> var(n.seq.1)
[1] 13.5441

The SPRT is very simple to apply in practice and this procedure typically
leads to lower sample sizes on average than fixed sample size tests given
fixed α and β.
For this example the asymptotic ASN calculated based on Theorem 6.3.1
under H 0 and H 1 is 3.1163 and 5.5452, respectively. The result is closer to the ASN
obtained via the Monte Carlo simulations as compared to the Wald approxima-
tion to the ASN. The asymptotic var ( N ) calculated based on Corollary 6.3.1
under H 0 and H 1 is 12.4652 and 22.1807, respectively, whereas the one obtained
via the Monte Carlo simulations is 11.2889 and 13.5441, respectively.
In conclusion, there are a variety of fully sequential tests. The SPRT is the-
oretically optimal in the sense that it attains the smallest possible expected
sample size before a decision is made as compared to all sequential tests that
do not have larger error probabilities than the SPRT (Wald and Wolfowitz,
1948). However, it should be noted that the SPRT is an open scheme with an
unbounded sample size; as a consequence, the non-asymptotic distribution
of sample size can be skewed with a large variance. For more detail, we refer
the reader to Martinsek (1981) and Jennison and Turnbull (2000).

6.4 The Central Limit Theorem for a Stopping Time


In Chapter 2 we considered the stopping time

⎧⎪ n ⎫⎪
N ( H ) = min ⎨n ≥ 1 :
⎩⎪
∑ X i ≥ H⎬ ,
⎭⎪
i =1
178 Statistics in the Health Sciences: Theory, Applications, and Computing

where X i > 0, i = 1, 2,..., are iid random variables. The random process N ( H )
has not only a central role in renewal theory, but also is widely applied in
statistical sequential schemes. Evaluations of N ( H ) can demonstrate several
basic principles that can be adapted to analyze various stopping rules used
in sequential procedures.
In Chapter 2 we obtained the result E {N ( H )} ~ H / E (X 1) as H → ∞ . In
order to derive an asymptotic distribution of N ( H ), we define
μ = E (X 1) , σ 2 = var (X 1) and provide the central limit theorem for the variable
( )
−1/2
ζ( H ) = (N ( H ) − H / a) Hσ 2 / a 3 in the following proposition.

Proposition 6.4.1. If σ 2 < ∞, then the distribution function of ζ( H )


converges to the N (0,1) distribution as H → ∞.

Proof. Following previous material of this book, we know how to deal with
sums of iid random variables. Therefore we are interested in associating the
distribution function of ζ( H ) with that of a sum of iid random variables. To
this end, we note that

⎧ k ⎫
⎧ k ⎫

⎪ ∑
( i )
X − a ⎪


Pr {N ( H ) ≥ k} = Pr ⎨
⎪⎩ i =1
∑ ⎪ ⎪
X i ≤ H⎬ = Pr ⎨ i =1 1/2 ≤
⎪⎭ ⎪ σ (k)
H − ak ⎪
1/2 ⎬
σ (k) ⎪
.
⎪ ⎪
⎪⎩ ⎪⎭

It is clear that when we use k → ∞ and H → ∞ restricted to hold


(H − ak)(σ 2 k)
−1/2
= u, where u is a fixed variable, the central limit theorem
(Chapter 2) provides

⎧ k ⎫
⎧⎪ k ⎫⎪ ⎪ ∑
⎪ (X i − a) ⎪

Pr {N ( H ) ≥ k} = Pr ⎨
⎩⎪ i =1
∑ X i ≤ H⎬ = Pr ⎨
⎭⎪
i =1

⎪ σ (k)
1/2
≤ u⎬ → Φ(u),

⎪ ⎪
⎩ ⎭

where Φ(u) is the standard normal distribution function. Then let us solve
(H − ak)(σ 2 k)
−1/2
= u with respect to k:

H
k= +
{
uσ uσ ± 2 (aH)
1/2
(1 + u σ 2 2
)
/ (4Ha) } = H ± uσ (aH) {1 + O (H )}.
1/2

1/2 −1/2
2 2
a 2a a a
A Brief Review of Sequential Methods 179

( )
−1/2
Thus, since ζ( H ) = (N ( H ) − H / a) Hσ 2 / a 3 , we have

{ (
Pr {N ( H ) ≥ k} = Pr ζ( H ) ≥ ±u + O H −1/2 → Φ(u).)}
Since the function Φ(u) increases when its argument increases, we should
{ ( )}
choose Pr {N ( H ) ≥ k} = Pr ζ( H ) ≥ − u + O H −1/2 to satisfy Pr {ζ( H ) ≥ − u} 
when u . Therefore, we have Pr {ζ( H ) ≥ − u} → Φ(u) as H → ∞.
Since Φ(u) is a symmetric distribution function, we can rewrite

Pr {ζ( H ) ≥ −u} → Φ(u) = 1 − Φ(−u) or 1 − Pr {ζ( H ) ≤ −u} → 1 − Φ(−u).

Defining v = − u, we conclude with Pr {ζ( H ) ≤ v} → Φ( v). This completes the


proof of Proposition 6.4.1.

6.5 Group Sequential Tests


In practice, especially in the context of clinical trials, it is convenient to
analyze the data after collecting groups of observations, as opposed to a
fully sequential test in which the collection of a single observation at a time
and continuous data monitoring can be a serious practical burden.
The introduction of group sequential tests has led to a wide use of
sequential methods and has achieved the most efficient gains as compared
with fully sequential tests. The group sequential designs and corresponding
methods of analysis are particularly useful in clinical trials, where it is
standard practice that monitoring committees meet at regular intervals to
assess the progress of a study and to add formal interim analyses of the
primary patient response. Group sequential tests address the ethical and
efficiency concerns in clinical trials. They can be conducted conveniently
with most of the benefits of fully sequential tests in terms of lower expected
sample sizes and shorter average study lengths (Jennison and Turnbull,
2000).
In this section, we introduce the general idea of group sequential tests. In
group sequential designs, subjects are allocated in up to K groups of an equal
group size m, according to a constrained randomization scheme, which
ensures m subjects receive each treatment in every group and the accumulat-
ing data are analyzed after each group of 2m responses. The experiment can
stop early to reject the null hypothesis if the observed difference is suffi-
ciently large. For each k = 1,...,K, assume that Sk is the test statistic computed
from the first k groups of observations, Ck is the corresponding threshold,
and the decision rule is to reject H 0 when Sk > Ck ; the test terminates if H 0 is
180 Statistics in the Health Sciences: Theory, Applications, and Computing

rejected, i.e., Sk > Ck . If the test continues to the Kth analysis and SK < CK , it
stops at that point and H 0 is not rejected.
The problem in which many authors are interested in the field of sequen-
tial analysis is the calculation of the sequence of critical values, {C1 ,..., CK }, to
give an overall Type I error rate. Various different types of group sequential
test give rise to different sequences. In choosing the appropriate number of
groups K and the group size m there may be external influences such as a
predetermined cost or length of trial to fix mK or a natural interval between
analyses to fix m. However, it is useful to evaluate the statistical power con-
straint by the possible designs under consideration. Note that when signifi-
cance tests at a fixed level are repeated at stages during the accumulation of
data, the probability of obtaining a significant result rises above the nominal
significance level in the case that the null hypothesis is true. Therefore,
repeated significance testing of the accumulated data is applied after each
group is evaluated, with critical boundaries adjusted for multiple testing
(e.g., Dmitrienko et al., 2010). For more details, we refer the reader to, e.g.,
Jennison and Turnbull (2000).
As a specific example, consider a clinical trial with two treatments, A and
B, where the response variable for each patient to treatment A or B is nor-
mally distributed with known variance, σ 2 , and unknown mean, μ A or μ B,
respectively. The group sequential procedure for testing H 0 : μ A = μ Bversus
H 1 : μ A ≠ μ B is specified by the maximum number of stages, K, the number
of patients, m, to be accrued on A and B at each stage (assume equal sample
sizes for each stage and for A and B), and decision boundaries a1 ,..., aK . The
sequential procedure is as follows: upon completion of each stage k of sam-
pling (1 ≤ k ≤ K), compute the test statistic

∑ (X )( )
k
Sk = m Ai − X Bi 2σ ,
i =1

where X Ai and X Bi denote the sample mean to treatments A and B at the i th


stage of m patients, respectively. If |Sk |< ak and k < K , continue to the stage
k + 1; if|Sk |≥ ak or k = K, stop sampling. Let M denote the stopping stage. The
null hypothesis is accepted if |SM |< aM or rejected if |SM |≥ aM . Note that
S1 , S2 − S1 ,..., Sk − Sk − 1 are independent and identically distributed normal
random variables with mean μ = n (μ A − μ B ) ( 2 σ ) and variance one.
Pocock (1977) proposed the boundaries ak = ak 1/2 , k = 1, 2,..., K, where the
constant a satisfies

α = 1 − Pr(|S1 |< a,|S2 |< a2 1/2 ,...,|Sk |< ak 1/2 μ = 0).

This group sequential design can sometimes be statistically superior to stan-


dard sequential designs. And the results based on a normal response can be
A Brief Review of Sequential Methods 181

easily adapted to other types of response data. For more details and discus-
sion regarding other types of response data, we refer the reader to Pocock
(1977).
The assumption of equal numbers of observations in each group can be
relaxed and then more flexible methods for handling unequal and unpre-
dictable group sizes have been developed (Jennison and Turnbull, 2000).

6.6 Adaptive Sequential Designs


In recent years, there has been a great deal of interest in the work of adaptive
sequential designs, an alternative and somewhat dissimilar methodology to
the sequential approaches described above. Adaptive designs allow the
modification of the design and the sample size after a portion of the data has
been collected. This can be done using sequentially observed and estimated
treatment effects at interim analyses to modify the maximal statistical infor-
mation to be collected or re-examine the variance assumptions about the
original design based on estimates.
For each k = 1,...,K, assume that Sk is the test statistic computed from the
first k groups of observations. Essentially, the technique of the adaptive
design is based on the assumption of multivariate normality for S1 ,..., SK
and was first described by Bauer and Kohne (1994). The authors focused
on a two-stage design and assumed that the data from each stage are inde-
pendent of those from the previous stage. We outline the basic idea with a
simple two-arm parallel clinical trial as follows: suppose that it is desired
to test the null hypothesis H 0 : θ = 0, where θ represents the treatment dif-
ference between the experimental and control groups. Based on the data
obtained from each of the two stages, two p-values, p1 and p2 , can be
obtained. Using Fisher’s combination method, Bauer and Kohne (1994)
showed that −2 log( p1 p2 ) follows a chi-square distribution on four degrees
of freedom under the null hypothesis, giving the ability to combine the
data from the two stages in a single test. The approach only assumes the
independence of data from the two stages, leading to great flexibility in
the design and analysis of trials without inflating the Type I error rate. The
adaption methodology can be extended to trials with greater numbers of
stages. The adaptations can be based on unblinded data collected in a
trial, as well as external information. In addition, the adaptation rules
need not be specified in advance.
Note that the adaptive design approach makes use of a test statistic, which
is not a sufficient statistic for the treatment difference, resulting in a lack of
power for the test. However, the enhanced flexibility of the adaptive design
makes it extremely attractive and important.
182 Statistics in the Health Sciences: Theory, Applications, and Computing

6.7 Futility Analysis


We introduced Simon’s two-stage designs in Section 6.2, which provide a
simple futility rule for an early stop in clinical trials. The term futility refers
to the inability to achieve its objectives in a clinical trial. Futility analyses
involve the decision-making process to terminate a trial prior to completion
conditional on the data accumulated so far when the interim results suggest
that it is unlikely to achieve statistical significance. It can save resources that
could be used on more promising research. One can combine the sequential
testing concept with futility analysis.
Stochastic curtailment is one approach to futility analysis. It refers to a
decision to terminate the trial based on an assessment of the conditional
power, where conditional power is the probability that the final result will be
statistically significant conditional on the data observed thus far and a spe-
cific assumption about the pattern of the data to be observed in the remain-
der of the study. Common assumptions include the original design effect, or
the effect estimated from the current data, or under the null hypothesis.
While a conditional power computation could be used as the basis for termi-
nating a trial when a positive effect emerges, a group sequential procedure is
usually employed for such decisions. A conditional power assessment
is usually used to assess the futility of continuing a trial when there is little
evidence of a beneficial effect. In some studies such monitoring for condi-
tional power is done in an ad hoc manner, whereas in others a futility-
stopping criterion is specified.
There are other approaches proposed to assess futility, such as group
sequential methods described in Section 6.5, predictive power, and predictive
probability. For more details, we refer the reader to, e.g., Snapinn et al. (2006).

6.8 Post-Sequential Analysis


In the previous sections, we discussed the advantages of sequential proce-
dures. However, in the framework of sequential experiments, once data have
been collected, the hypothesis testing paradigm may no longer be the directly
useful one for the following statistical analysis; see, e.g., the discussion in
Cutler et al. (1966). In practice, medical studies usually require a more com-
plete analysis rather than the simple “accept” or “reject” decision of a hypoth-
esis test. Once a sequential procedure is executed, generally speaking, we
will meet the disadvantages of post-sequential analysis. The problem is that
all simple statistical approaches based on retrospectively collected data
should be completely modified when data is collected via sequential schemes.
A Brief Review of Sequential Methods 183

Moreover, analyzing the data arising from sequential experiments com-


monly require very complicated adjustments.
The probability properties of the statistics obtained in a trial that is stopped
early at only n samples are different from those attained in a similar trial that
is run for a predetermined number of trials, even if they end up collecting
the same number of samples. Because the sampling procedure stops ran-
domly during a sequential study, classical methods related to the fixed sam-
ple size test do not work for the sequential designs, in many situations. If
these issues are not accounted for in the interpretation of the sequential trial,
the results of analysis of data collected via sequential procedures will be
biased. For a simple example, we assume N is a stopping time based on


N
sequentially obtained iid observations X 1 , X 2 ,..., X n , then X i N may
i =1

not be an unbiased estimator of E(X 1 ), that is, E ⎛ ∑ X i N⎞ ≠ E(X 1 ). Pianta-


N

⎝ i=1 ⎠
dosi (2005) argued that the estimate of a treatment effect will be biased when
a trial is terminated at an early stage: the earlier the decision, the larger the
bias. Whitehead (1986) investigated the bias of maximum likelihood esti-
mates calculated at the end of a sequential procedure. Liu and Hall (1999)
showed that in a group sequential test about the drift θ of a Brownian motion
X (t) stopped at time T, the sufficient statistic S = (T , X (T )) is not complete
for θ. In addition, there exist infinitely many unbiased estimators of θ and
none has uniformly minimum variance.
Furthermore, most sequential analyses involve parametric approaches,
especially in the group sequential designs. In contrast with the analysis of
data obtained retrospectively, the parametric assumptions are posed
before data points are observed. Even if we have strong reasons to assume
parametric forms of the data distribution, it will be extremely hard, for
example, to test for the goodness of fit of data distributions after the execu-
tion of sequential procedures. In this case, perhaps, e.g., the known Shapiro–
Wilk test for normality will not be exact and its critical value will depend on
the underlying data distribution.
Therefore, it is important that proper methodology is followed in order to
avoid invalid conclusions based on sequential data. For more information
about the analysis following a sequential test, we refer the reader to Jennison
and Turnbull (2000).
7
A Brief Review of Receiver Operating
Characteristic Curve Analyses

7.1 Introduction
Receiver operating characteristic (ROC) curve analysis is a popular tool for
visualizing, organizing, and selecting classifiers based on their classification
accuracy. The ROC curve methodology was originally developed during
World War II to analyze classification accuracy in differentiating signal from
noise in radar detection (Lusted, 1971). Recently, the ROC methodology has
been extensively adapted to medical areas heavily dependent on screening
and diagnostic tests (Lloyd, 1998; Zhou and Mcclish, 2002; Pepe, 2000, 2003),
in particular, radiology (Obuchowski, 2003; Eng, 2005), bioinformatics (Lasko
et al., 2005), epidemiology (Green and Swets, 1966; Shapiro, 1999), and labora-
tory testing (Campbell, 1994). For example, in laboratory diagnostic tests,
which are central in the practice of modern medicine, common uses of ROC-
based methods include screening a specific population for evidence of dis-
ease and confirming or ruling out a tentative diagnosis in an individual
patient, where the interpretation of a diagnostic test result depends on the
ability of the test to distinguish between diseased and non-diseased sub-
jects. In cardiology, diagnostic testing and ROC curve analysis plays a fun-
damental role in clinical practice, e.g., using serum markers to predict
myocardial necrosis and using cardiac imaging tests to diagnose heart dis-
ease. ROC curves are increasingly used in the machine-learning field, due in
part to the realization that simple classification accuracy is often a poor stan-
dard for measuring performance (Provost and Fawcett, 1997). In addition to
being a generally useful graphical method for visualizing classification
accuracy, ROC curves have properties that make them especially useful for
domains with skewed discriminating distributions and/or unequal
classification error costs. These characteristics have become increasingly
important as research continues into the areas of cost-sensitive learning and
learning in the presence of unbalanced classes (Fawcett, 2006).
ROC curve analysis can also be applied generally for evaluating the accu-
racy of goodness-of-fit tests of statistical models (e.g., logistic regression) that

185
186 Statistics in the Health Sciences: Theory, Applications, and Computing

classify subjects into two categories such as diseased or non-diseased (e.g., in


the context of a linear discriminant analysis). For example, in cardiovascular
research, predictive modeling to evaluate expected outcomes, such as mor-
tality or adverse cardiac events, as functions of patient risk characteristics is
common. In this setting, ROC curve analysis is very useful in terms of sort-
ing out important risk factors from less important risk factors.
The ROC curve technique has been widely used in disease classification
with low-dimensional biomarkers because (1) it accommodates case-control
designs and (2) it allows treating discriminatory accuracy of a biomarker
(diagnostic test) for distinguishing between two populations in an efficient
and relatively simple manner.
This chapter will outline the following ROC curve topics: ROC Curve
Inference, Area under the ROC Curve, ROC curve Analysis and Logistic
Regression, Best Combinations Based on Values of Multiple Biomarkers.
These themes are considered in the contexts of definitions, parametric/
nonparametric estimation/testing and the relevant literature in the field of
the ROC curve analyses.

7.2 ROC Curve Inference


In this chapter we assume, without loss of generality, that X 1 , … , X n and
Y1 , … , Ym are iid observations from diseased and non-diseased populations,
respectively and let F and G denote the corresponding continuous
distribution functions of X and Y. The ROC curve R(t) is defined as
R(t) = 1 − F(G −1 (1 − t)), where t ∈[0, 1] and the function G −1 defines the inverse
function of G, i.e., G −1 (t) : G(G −1 (t)) = t (e.g., Pepe, 2003). The ROC curve is a
plot of sensitivity (true positive rate, 1 − F(t)) against one minus specificity
(true negative rate, 1 − G(t)) for different values of the threshold t. Note that
the ROC curve is a special case of a probability-probability plot (P-P plot)
(e.g., Vexler et al., 2016a). As an example, we consider three biomarkers with
their corresponding ROC curves presented in Figure 7.1, whose underlying
distributions are F1 ~ N (0,1), G1 ~ N (0,1) for biomarker A (the diagonal line),
F2 ~ N (1,1), G2 ~ N (0,1) for biomarker B (in a dashed line), and F3 ~ N (10,1),
G3 ~ N (0,1) for biomarker C (in a dotted line), respectively.
The following R code is used to plot the ROC curves shown in Figure 7.1:

> t<-seq(0,1,0.001)
> R1<-1-pnorm(qnorm(1-t,0,1),0,1) # biomarker A
> R2<-1-pnorm(qnorm(1-t,1,1),0,1) # biomarker B
> R3<-1-pnorm(qnorm(1-t,10,1),0,1) # biomarker C
> plot(R1,t,type="l",lwd=1.5,lty=1,cex.lab=1.1,ylab="Sensitivity",xlab="1-Spe
cificity")
> lines(R2,t,lwd=1.5,lty=2)
> lines(R3,t,lwd=1.5,lty=3)
A Brief Review of Receiver Operating Characteristic Curve Analyses 187

1.0

0.8

0.6
Sensitivity

0.4

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0


1−Specificity

FIGURE 7.1
The ROC curves related to the biomarkers. The solid diagonal line corresponds to the ROC
curve of biomarker A, where F1 ~ N (0, 1) and G1 ~ N (0, 1). The dashed line displays the ROC
curve of biomarker B, where F2 ~ N (1, 1) and G2 ~ N (0, 1). The dotted line close to the upper left
corner plots the ROC curve for biomarker C, where F3 ~ N (10, 1) and G3 ~ N (0, 1).

It can be seen that the farther apart the two distributions F and G are in
terms of location, the more the ROC curve shifts to the top left corner. A near
perfect biomarker would have an ROC curve coming close to the top left
corner, and a biomarker without any discriminability would result in an
ROC curve that is a diagonal line from the points (0,0) to (1,1).
The ROC curve displays a distance between two distribution functions.
The ROC curve is a well-accepted statistical tool for evaluating the discrimi-
natory ability of biomarkers (e.g., Shapiro, 1999). It is a convenient way to
compare diagnostic biomarkers because the ROC curve places tests (bio-
markers values) on the same scale where they can be compared for accuracy.
There exists extensive research on estimating ROC curves from the para-
metric and nonparametric perspectives (e.g., Pepe, 2003; Hsieh and Turnbull,
1996; Wieand et al., 1989). For example, assuming that both the diseased and
non-diseased populations are normally distributed, that is, F ~ N (μ 1 , σ 12 ) and
G ~ N (μ 2 , σ 22 ), the corresponding ROC curve can be expressed as

R(t) = Φ( a + bΦ−1 (t)),

where a = (μ 1 − μ 2 )/σ 1, b = σ 2 /σ 1 , and Φ is the standard normal cumulative


distribution function. This is oftentimes referred to as the bi-normal ROC
curve. The estimated ROC curve is obtained by substituting the maximum
likelihood estimators (MLE) of the parameters μ 1, μ 2, σ 1, and σ 2 into the for-
mula above. The nonparametric estimation of the ROC curve incorporates
188 Statistics in the Health Sciences: Theory, Applications, and Computing

empirical distribution functions in place of their parametric counterparts.


Toward this end we define the empirical distribution function of F based on
iid observations X 1 ,..., X n from F as

Fˆn (t) =
1
n ∑ I{X ≤ t},
i
i =1

where I {⋅} denotes the indicator function. The empirical distribution function
Ĝm of G can be defined similarly, using iid observations Y1 ,..., Ym from G. Esti-
mating F and G by their corresponding empirical estimates F̂n and Ĝm,
respectively, gives the empirical estimator of the ROC curve in the form

Rˆ (t) = 1 − Fˆn (Gˆ m−1 (1 − t)),

which can be shown to converge to R(t) for large sample sizes n and m.
Figure 7.2 presents the nonparametric estimators of the ROC curves based on
data related to generated values (n = m = 1000) of the three biomarkers
described with respect to Figure 7.1, i.e., F1 ~ N (0,1), G1 ~ N (0,1) for bio-
marker A (the diagonal line), F2 ~ N (1,1), G2 ~ N (0,1) for biomarker B (in a

1.0

0.8

0.6
Sensitivity

0.4

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0


1−Specificity

FIGURE 7.2
The nonparametric estimators of the ROC curves related to three different biomarkers based
on n = 1000 and m = 1000 data points. The solid diagonal line corresponds to the nonparametric
estimator of the ROC curve of biomarker A, where F1 ~ N (0, 1) and G1 ~ N (0, 1) . The dashed line
displays the nonparametric estimator of the ROC curve of biomarker B, where F2 ~ N (1, 1) and
G2 ~ N (0, 1). The dotted line close to the upper left corner plots the nonparametric estimator of
the ROC curve for biomarker C, where F3 ~ N (0, 1) and G3 ~ N (10, 1).
A Brief Review of Receiver Operating Characteristic Curve Analyses 189

dashed line), and F3 ~ N (10,1), G3 ~ N (0,1) for biomarker C (in a dotted line),
respectively, using the following R code:

> if(!("pROC" %in% rownames(installed.packages()))) install.packages("pROC")


> library(pROC)
> n<-1000
> set.seed(123) # set the seed
> # Simulate data from the normal distribution
> X1<-rnorm(n,1,1)
> Y1<-rnorm(n,0,1)
> group<-cbind(rep(1,n),rep(0,n))
> measures<-c(X1,Y1)
> roc1<-roc(group, measures)
> plot(1-roc1$specificities,roc1$sensitivities,type="l",ylab="Sensitivity",x
lab="1-Specificity")
> abline(a=0,b=1,col="grey") # add the diagonal line for reference

It should be noted that for large sample sizes, the ROC curves are well
approximated by the nonparametric estimators.
In health-related studies, the ROC curve methodology is commonly related
to case-control studies. As a type of observational study, case-control stud-
ies differentiate and compare two existing groups differing in outcome on
the basis of some supposed causal attributes. For example, based on factors
that may contribute to a medical condition, subjects can be grouped as cases
(subjects with a condition/disease) and controls (subjects without the
condition/disease), e.g., cases could be subjects with breast cancer and
controls may be health subjects. For independent populations, e.g., cases and
controls, various parametric and nonparametric approaches have been
proposed to evaluate the performance of biomarkers (e.g., Pepe, 1997; Hsieh
and Turnbull, 1996; Wieand et al., 1989; Pepe and Thompson, 2000; McIntosh
and Pepe, 2002; Bamber, 1975; Metz et al., 1998).

7.3 Area under the ROC Curve


A rough idea of the performance of a set of biomarkers, in terms of their
diagnostic accuracy, can be obtained through visual examination of the
corresponding ROC curves. However, judgments based solely on visual
inspection of the ROC curve are far from enough to precisely describe the
diagnostic accuracy of biomarkers. The area under the ROC curve (AUC)
is a common summary index of the diagnostic accuracy of a binary bio-
marker. The AUC measures the ability of the marker to discriminate
between the case and control groups (Pepe and Thompson, 2000; McIntosh
and Pepe, 2002).
Bamber (1975) noted that the AUC is equal to Pr(X > Y ).
190 Statistics in the Health Sciences: Theory, Applications, and Computing

Proof. By the definition of the ROC curve, R(t) = 1 − F(G −1 (1 − t)) , where
t ∈[0, 1] and F and G are distribution functions of X and Y, respectively, the
AUC can be expressed as

∫ ∫ ∫
1 1 ∞
R(t)dt = (1 − F(G−1 (1 − t))dt = (1 − F(w))dG(w)
0 0 −∞



= 1− F(w)dG(w) = 1 − Pr(X ≤ Y ) = Pr(X > Y ).
−∞

The proof is complete.


Values of the AUC can range from 0.5, in the case of no differentiation
between the case and control distributions, to 1, where the case and control
distributions are perfectly separated. For more details, see Kotz et al. (2003),
for wide discussions regarding evaluations of the AUC-type objectives.

7.3.1 Parametric Approach


Under binormal assumptions, where for the diseased population X ~ N (μ 1 , σ 12 )
and for the non-diseased population Y ~ N (μ 2 , σ 22 ), a closed form of the AUC
is presented as

⎛ μ −μ ⎞ u
⎛ ⎞
∫ exp ⎜⎜⎝ − z2 ⎟⎟⎠ dz.
2
1
A = Φ⎜ 1 2 ⎟
, Φ (u) =
⎜ σ2 + σ2 ⎟ 2π
⎝ 1 2⎠ −∞

Noting that X and Y are independent, we obtain that

⎧⎪ (X − Y ) − (μ − μ ) μ − μ 2 ⎫⎪
A = Pr(X > Y ) = Pr(X − Y > 0) = 1 − Pr ⎨ 1 2
≤− 1 ⎬
⎪⎩ σ 12 + σ 22 σ 12 + σ 22 ⎪⎭

⎛ μ −μ ⎞ ⎛ μ −μ ⎞
= 1 − Φ⎜ − 1 2 ⎟
= Φ⎜ 1 2 ⎟
.
⎜ 2⎟
σ1 + σ2 ⎠ ⎜ 2⎟
⎝ σ1 + σ2 ⎠
2 2

The index A ≥ 0.5, when μ 1 ≥ μ 2 . By substituting maximum likelihood esti-


mators for μ i and σ 2i , i = 1, 2 into the above formula, the maximum likelihood
estimator of the AUC can be obtained directly. Given the estimator of the
AUC under binormal distributional assumptions, one can easily construct
large-sample confidence interval-based tests for the AUC using the delta
method (we outline this method below). For nonnormal data a transforma-
tion of observations to normality may first be applied, e.g., the Box–Cox trans-
formation (Box and Cox, 1964), prior to the parametric ROC curve approach
being applied. In general, when data distributions are assumed to have
A Brief Review of Receiver Operating Characteristic Curve Analyses 191

parametric forms different than the normal distribution function for F


and G, the AUC can be expressed as Pr(X > Y ) = G( x)dF( x), in a similar

manner to the technique shown above (Kotz et al., 2003).
When parametric forms of the distribution functions F and G are specified
in nonnormal shapes, methods for evaluating the AUC can be proposed in a
similar manner to those shown above. The maximum likelihood estimation


of A = Pr(X > Y ) = G( x)dF( x) can be provided by expressing one of param-
eters of F and G as a function of A. For example, the following approach
can be easily extended to be appropriate for different parametric forms of

( )
⎛ μ −μ ⎞
F and G: when A = Φ ⎜ 12 2 2 ⎟ , we have μ 1 = σ 12 + σ 22 Φ −1 (A) + μ 2 , where
⎝ σ1 + σ2 ⎠
( )
Φ Φ (u) = u, and then using the likelihood function based on observations
−1

from X ~ N (( σ + σ ) Φ
2
1
2
2
−1
(A) + μ 2 , σ 12 ) and Y ~ N(μ , σ ), the maximum like-
2
2
2

lihood estimator of A can be calculated maximizing the corresponding like-


lihood with respect to A ∈(0, 1), μ 2 , σ 1 > 0 and σ 2 > 0. It is clear the properties
of the maximum likelihood estimator of A can be obtained via the material of
Chapter 3 (see, e.g., Vexler et al., 2008a as well as Vexler et al., 2008b, for more
details). Note that, statistical software packages such as R, SAS, or SPlus
allow us to numerically perform the maximization of the corresponding log-
likelihood functions without using closed forms of the estimators of the
unknown parameters. The basic procedure “optim” in R (R Development
Core Team, 2014) can be carried out with respect to minimizing the negative
log-likelihoods and the procedure “multiroot” can help finding these mini-
mizations.
The delta method for evaluating estimators of the AUC, e.g., in terms of
their asymptotic variances and distributions, can be employed similarly to
the following scheme: when X ~ N (μ 1 , σ 12 ) and Y ~ N (μ 2 , σ 22 ), one can esti-
mate the parameters μ 1 , σ 12 , μ 2 , σ 22 as μˆ 1 , σˆ 12 , μˆ 2 , σˆ 22, obtaining the estimators
μˆ − μˆ μ1 − μ 2 ˆ = Φ(δˆ ) of A. Then Taylor’s theorem
δˆ = 12 2 2 of δ = and A
ˆσ 1 + σˆ 2 σ 12 + σ 22
yields the approximation

ˆ  Φ(δ) + (δˆ − δ)Φ '(δ) = A + (δˆ − δ)Φ '(δ);


A

see, e.g., Kotz et al. (2003) for details.

Example:
Assume that biomarker levels were measured from diseased and healthy
populations, with iid observations X 1 = 0.39, X 2 = 1.97 , X 3 = 1.03, and
X 4 = 0.16, which are assumed to be from a normal distribution N (μ 1 , σ 12 ),
192 Statistics in the Health Sciences: Theory, Applications, and Computing

and iid observations Y1 = 0.42, Y2 = 0.29 , Y3 = 0.56, Y4 = −0.68 , and Y5 =


−0.54 , which are assumed to be from a normal distribution N (μ 2 , σ 22 ),
respectively. In this case, the maximum likelihood estimators of the
parameters are μˆ 1 = 0.8875, μˆ 2 = 0.01, σˆ 12 = 0.4922, and σˆ 22 = 0.2655 . Based
on the definition described in Section 7.2, aˆ = (μˆ 1 − μˆ 2 ) σˆ 1 = 1.251 and
bˆ = σˆ 2 σˆ 1 = 0.734, and therefore the ROC curve can be estimated as
Rˆ (t) = Φ( aˆ + bˆΦ −1 (t)) = Φ(1.251 + 0.734Φ −1 (t)). The AUC can be estimated as
( 1 2 1)
ˆ = Φ (μˆ − μˆ ) σˆ 2 + σˆ 2 = 0.8433, which summarizes the discriminat-
A 2

ing ability of the biomarker with respect to the disease. The interpreta-
tion of the AUC is that this particular marker accurately predicts a case
to be a case and a control to be control 84% of the time and that 16% of the
time the marker will misclassify the groupings.

7.3.2 Nonparametric Approach


Conversely, a nonparametric estimator for the AUC can be obtained as
m n
ˆ= 1
A
mn ∑∑ I(X > Y ),i j
i =1 j =1

where X i , i = 1, … , m and Yj , j = 1, … , n are the observations for diseased and


non-diseased populations, respectively (Zhou et al., 2011). It is equivalent to
the well-known Mann–Whitney statistic, and the variance of this empirical
estimator can be obtained using U-statistic theory (Serfling, 2002). Replacing
the indicator function by a kernel function, one can obtain a smoothed ROC
curve (Zou et al., 1997). For details regarding nonparametric evaluations and
comparisons of ROC curves and AUCs we refer the reader to Vexler et al.
(2016a).

Example:
Biomarker levels were measured from diseased and healthy popula-
tions, providing iid observations X 1 = 0.39, X 2 = 1.97 , X 3 = 1.03, and
X 4 = 0.16 , which are assumed to be from a continuous distribution,
and iid observations Y1 = 0.42, Y2 = 0.29 , Y3 = 0.56, Y4 = −0.68 , and
Y5 = −0.54, which are also assumed to be from a continuous distribution,
respectively. In this case, the empirical estimates of the distribution

1 4
functions F and G have the forms Fˆ4 (t) = I (X i < t) and


4 i=1
ˆ 1 5
G5 (t) = I (Yj < t), respectively, and the empirical estimate of the
5 j=1
ROC curve is Rˆ (t) = 1 − Fˆ4 (Gˆ 5−1 (1 − t)) . The AUC can be estimated as
ˆ= 1
∑ ∑
4 5
A I (X i > Yj ) = 0.75, suggesting a moderate discriminat-
4×5 i=1 j=1
ing ability of the biomarker with respect to discriminating the disease
population from the health population. Note that in this case, the non-
parametric estimator of the AUC is smaller than the estimator of the
A Brief Review of Receiver Operating Characteristic Curve Analyses 193

AUC under the normal assumption (Section 7.3.1). For such a small data-
set used in our examples the difference in AUC estimates is likely due
more to the discreteness of the nonparametric method than any real dif-
ference between the approaches.

7.4 ROC Curve Analysis and Logistic Regression:


Comparison and Overestimation
In general, the discriminant ability of a continuous covariate, e.g., biomarker
measurements, can be considered based on logistic regression or the ROC
curve methodology (Pepe, 2003). Logistic regression has been proposed as a
mean of modeling the probability of disease given several test results (e.g.,
Richards et al., 1996). It is often used to find a combination of covariates that
discriminates between two groups or populations, for example, diseased
and non-diseased populations. Suppose we have a data set consisting of a
binary outcome, q ∈ {0,1}, which indicates the membership of the individual,
and m components in covariate vector z. Table 7.1 illustrates the connection
between the ROC curve and logistic regression with a case-control
study. Let X and Y represent the measurements of the biomarker in the
case and control group, respectively. Individuals in the case group
(q = 1) have values of z given q = 1, that is, X = z|q = 1, while individuals in
the control group (q = 0) have values of z given q = 0, that is, Y = z|q = 0.
The logistic regression models
T
eβ z
Pr(q = 1|z) = T ,
1 + eβ z

where the covariate vector z can include a component with the value of 1 to
present an intercept coefficient of the model and the vector β consists of the
parameters of the regression, the operator T denotes the transpose.
Logistic regression relies only on an assumption about the form of the con-
ditional probability for disease given the covariate vector z and does not

TABLE 7.1
The outcome levels and the biomarker values z , where X
and Y represent values of measurements of the biomarker
in the case and control group, respectively
Class z (Biomarker values)

Case ( q = 1) X = z|q = 1 (values of z given q = 1)


Control ( q = 0 ) Y = z|q = 0 (values of z given q = 0 )
194 Statistics in the Health Sciences: Theory, Applications, and Computing

require specification of the much more complex joint distribution of the


covariate vector z.
It can be emphasized that logistic regression focuses on Pr(q = 1|z) and the
covariant z is assumed to be not a random variable, whereas the ROC curve
methodology attends to Pr( z|q) and z is assumed to be a random variable.
In this section, we present the fact that using the same data to fit both the
discriminant score and to estimate its ROC curve leads to an overly optimistic
estimate of the accuracy of the test as compared to how the model would
perform on samples of future cases (Copas and Corbett, 2002). In general, for
large studies data are split into training samples and validation samples.
The training sample is used to fit the model and the validation sample is
used to assess the performance of the model fit relative to how it would func-
tion on a future set of patients. This is discussed further below.
The ROC curve is a standard way of illustrating and evaluating the perfor-
mance of a discriminant score or screening marker. Let s be such a score, e.g.,
s = βT z, where z is the covariate vector. And we have two groups or popula-
tions indexed by the binary outcome q ∈ {0,1}, e.g., which represents the
membership of the diseased or non-diseased populations. Then threshold u
gives the false positive rate

F0 (u) = Pr(s ≥ u|q = 0)

and the true positive rate

F1 (u) = Pr(s ≥ u|q = 1).

The ROC curve, R, is the graph of F1 (u) against F0 (u) as u ranges over all pos-
sible values,

R = {( F1 (u), F0 (u)), −∞ < u < +∞}.

It is important to recognize that an empirical ROC curve, such as that


shown in Figure 7.2, is a retrospective calculation, using the same data to
estimate the score and to assess its performance. What we really want to
know is how well this particular score would perform if it were to be adopted
in practice. This would involve a prospective assessment of how well the
score discriminates between future and independent cases with q = 1 and
q = 0. Thus we distinguish between the retrospective (training set) ROC,
R̂(βˆ ), the curve with s = βˆ T z and with F0 and F1 taken as the empirical distri-
butions of s in the sample, and the prospective (validation set) ROC, R(βˆ ), the
curve with the same score s = βˆ T z but with F and F taken as the true popu-
0 1

lation distributions of s. The term overestimation in the title of this section


refers to the difference
A Brief Review of Receiver Operating Characteristic Curve Analyses 195

Rˆ (βˆ ) − R(βˆ ),

where the difference between two curves denotes the curve of differences in
vertical coordinates graphed against common values for the horizontal coor-
dinate. In other words, this is the difference in true positive rates having
chosen the thresholds to match the false positive rates. Typically, the term
overestimation is positive, that is, the retrospective ROC gives an inflated
assessment of the true performance of the score.
Consider the classifier qˆ = 1, if β̂T z ≥ u, and qˆ = 0, if β̂T z < u. Then it is well
known that the retrospective error rate, namely the proportion of cases in the
sample for which qˆ ≠ q, is a downward biased estimate of the true prospec-
tive error rate, which would be obtained if the classifier q̂ were to be applied
to the whole population. Efron (1986) derives an asymptotic approximation
to the expected bias. Efron’s formula is consistent with the approximation
based on the ROC curve methodology, which we will present in the follow-
ing subsection.

7.4.1 Retrospective and Prospective ROC


Suppose we have a total sample size of n individuals with data ( zi , qi ), where
q denotes the group membership as before and there are m components in
covariate vector z, including the intercept term z1 = 1. We assume through-
out that the data fit well to the logistic regression model

T
eβ z
Pr(q = 1|z) = T .
1 + eβ z

Let β̂ be the maximum likelihood estimator of β. Then, if we apply the scores


uˆ = βˆ T zi to the data against a threshold u, the observed proportions of false
and true positives are

∑ H(uˆ − u)(1 − q ), Fˆ (u) = nq ∑ H(uˆ − u)q ,


1 1
Fˆ0 (u) = i i 1 i i
n(1 − q ) i i

respectively, where q = ∑ q /n and H is the Heaviside function: H(z) = 1 if


i

z ≥ 0 , and H ( z) = 0 if z < 0 . The sums in these and subsequent expressions


run from 1 to n. The ROC curve Rˆ = Rˆ (βˆ ) is then the graph of Fˆ1 (u)
against Fˆ0 (u).
For the prospective ROC fit, suppose the random q’s in these data are rep-
licated a large number of times. Then, at each xi, we expect a proportion pi of
replicated cases to have q = 1, where
196 Statistics in the Health Sciences: Theory, Applications, and Computing

e ui
pi = , ui = βT zi .
1 + e ui

If the scores ûi are applied to the replicated data against a threshold v, the
future proportions of false and true positives are

F0 (v) =
1
n(1 − p ) ∑ H(uˆ − v)(1 − p ), F (v) = np1 ∑ H(uˆ − v)p ,
i i 1 i i
i i

where p = ∑ p / n. Then R = R(βˆ ) is the graph of F (v) and F (v).


i 1 0

7.4.2 Expected Bias of the ROC Curve and Overestimation of the AUC
To simplify the notation, let ε i = qi − pi , H i = H (uˆ i − u), and H i* = H (uˆ i − v).
Then we can obtain

q = p + ε,

and

⎛ ⎞
n{ Fˆ0 (u) − F0 (v)} = ∑ H ⎜⎝11−−pp −− εε − 11−− pp ⎟⎠ − 1 −1 p ∑(H
i
i i i *
i − H i )(1 − pi ).
i i

For large n, values of pi will be close to pu, which is the true value of
Pr( y = 1|βT z = u). Hence

⎛ ⎞
n{ Fˆ0 (u) − F0 (v)} ≈ ∑ H ⎜⎝11−−pp −− εε − 11−− pp ⎟⎠ − 11−−pp ∑(H
i
i i i u *
i − H i ).
i i

Similarly, we can obtain

⎛ ⎞
n{ Fˆ1 (u) − F1 (v)} ≈ ∑ H ⎜⎝ pp ++ εε − pp ⎟⎠ − pp ∑(H
i
i i i u *
i − H i ).
i i

Define vu to be the value of v such that n{ Fˆ0 (u) − F0 (v)} = 0. We have

Fˆ1 (u) − F1 (vu ) ≈ n−1 ∑ H A(i, u), i


i

⎛ p + ε p ⎞ (1 − p )pu ⎛ 1 − pi − ε i 1 − pi ⎞
where A(i, u) = ⎜ i i − i ⎟ − − .
⎝ p+ε p ⎠ p(1 − pu ) ⎜⎝ 1 − p − ε 1 − p ⎟⎠
A Brief Review of Receiver Operating Characteristic Curve Analyses 197

Note that ε = Op (n−1/2 ). Let Ω = n−1


we can obtain
∑ p (1 − p )z z
i
i i
T
i i and di2 = ziT Ω−1zi , then

⎧ ⎫ ⎛ n1/2 (ui − u)⎞


S(u) = E{ Fˆ1 (u) − F1 (vu )} =
n
p u
3/2
p ∑ i
⎨di −

1
⎬ϕ
p(1 − p )di ⎭ ⎜⎝ di ⎟⎠ ,

ϕ(u) = (2π)
−1 / 2
exp(− u2/2).

The terms pu and ui in the right-hand side are functions of the true parameter
vector β, and the terms di and p depend on pi and hence are also functions of
β. Estimating these in the obvious way using β̂ gives the corresponding esti-
mator Sˆ (u). The corrected ROC curve

R * = { Fˆ1 (u) − Sˆ (u), Fˆ0 (u)}

is an estimator of the ROC curve that would be obtained if the fitted score
β̂T z were to be validated on a large replicated sample. As expected, this indi-
cates that this score discriminates between the two populations noticeably
less well than the retrospective analysis seems to suggest.
The estimated AUC is

∫ Fˆ (u) dFˆ (u) = n(11− q ) ∑(1 − q )Fˆ (uˆ ).


1 0 j 1 j
j

Similar to the overestimation in the ROC curve described above, the overes-
timation in the AUC is

∫ {Fˆ (u) − F (u )} dFˆ (u) = n(11− q ) ∑(1 − q ){Fˆ (uˆ ) − F (v


1 1 v 0 j 1 j 1 uˆ j )}
j

⎪⎧n1/2 (u j − ui )⎪⎫
=
1
n (1 − q )
2 ∑ H (uˆ i − uˆ j )(1 − qi ) A(i, uˆ j ) ≈
1
2 n p(1 − p )
5/2 ∑ pi (1 − pi )bij ϕ ⎨
⎩⎪ bij
⎬,
⎭⎪
i, j i≠ j

where aij = ( z j − zi )T Ω −1zi , bij2 = aij + a ji = ( zi − z j )T Ω −1 ( zi − z j ). Therefore, the


overestimation in the AUC is approximately

1
E[ pv (1 − pv ) f (u)tr{Ω −1 var( z|U )}],
np(1 − p )

where U = βT z has probability density function f (u), E(.) and var(.) denote the
empirical expectation and variance over the empirical distribution of the
sample covariate vectors z1 ,..., zn. Note that the overestimation is again of
the order O(n−1 ). For more details see Copas and Corbett (2002).
198 Statistics in the Health Sciences: Theory, Applications, and Computing

7.4.3 Example
Let X1 ,..., X m and Y1 ,..., Yn denote biomarker measurements from non-diseased
and diseased populations, respectively. Assume that we would like to exam-
ine the discriminant ability of the biomarker based on the AUC, that is, test-
ing for H 0 : AUC = 1/ 2 versus H 1 : AUC ≠ 1/ 2 as well as based on the logistic
regression, i.e., testing for the coefficient, β , H 0 : β = 0 versus H 1 : β ≠ 0 .
As described in Section 7.3.2, a nonparametric estimator of the AUC is
equivalent to the well-known Mann–Whitney statistic, which follows an
asymptotic normal distribution, and the variance of this empirical estimator
can be obtained using U-statistic theory (Serfling, 2002). Then the nonpara-

mn ∑ ∑
1 m n
metric estimator of the AUC is Aˆ = I (X i > Yj ) . Let Ri be the rank
i=1 j=1

of X i’s (the i th ordered value among X i ’s) in the combined sample of X i ’s and
Yj’s, and Sj be the rank of Yj’s (the jth ordered value among Yj’s) in the com-
bined sample of X i ’s and Yj’s. Define

⎡ m ⎛ m ⎞ 2⎤
⎢ m + 1⎟ ⎥
S =
2
10
1

(m − 1)n2 ⎢ ∑ (Ri − i) − m ⎜⎜
2
⎜m

1
∑ Ri −
2 ⎟⎟ ⎥
⎠ ⎦
⎥,
⎣ i =1 i=1

⎡ ⎛ ⎞ 2⎤
⎢ n n
n + 1⎟ ⎥
S =
2
01
1
(n − 1)m ⎢
2
⎢ ∑ 2 ⎜1
(Sj − j) − n ⎜
⎜n
∑ Si − ⎟ ⎥.
2 ⎟ ⎥
⎢⎣ j =1 ⎝ j=1 ⎠ ⎥⎦

Then the estimated variance of this empirical estimator of the AUC is

⎛ mS01
2 2 ⎞
+ nS10 ⎛ mn ⎞
S2 = ⎜⎜ ⎟⎟ ⎜ ⎟.
⎝ m+ n ⎠ ⎝ m + n⎠

Based on the asymptotic normal distribution of the empirical estimator of


the AUC, the corresponding p-value can be easily obtained and the 95% con-
fidence interval of the AUC is [ Aˆ − 1.96 × S, Aˆ + 1.96 × S].
The test for H 0 : β = 0 versus H 1 : β ≠ 0 is based on the asymptotic distribu-
tion of the estimated coefficient β from the logistic regression fit (Hosmer
and Lemeshow, 2004). It should be noted that the test H 0 : β = 0 versus H 1 : β ≠ 0
is a test of association, which is a less stringent measure than prediction.
In order to evaluate the test procedures described above, we conducted
10,000 Monte Carlo generations of X1 ,..., X m and Y1 ,..., Yn . In the simulation set-
ting, we assume that both the case group X and the control group Y follow a
log-normal distribution log N (μ , σ 2 ) with different μ’s and a common σ 2 ,
where E(Y ) = 3 , E(X ) = 3 + δ , and δ ranges from 0 to 2.
Figure 7.3 demonstrates the experimental comparison of the Monte Carlo
powers between the test for H 0 : AUC = 1/ 2 versus H 1 : AUC ≠ 1/ 2 (upper
curves) and the test based on the logistic regression H 0 : β = 0 versus H 1 : β ≠ 0
A Brief Review of Receiver Operating Characteristic Curve Analyses 199

0.8 0.8

0.6 0.6
Powers

Powers
0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
δ δ

FIGURE 7.3
The experimental comparisons based on the Monte Carlo powers related to the test for
H 0 : AUC = 1/2 versus H 1 : AUC ≠ 1/2 (upper curves) and the test for H 0 : β = 0 versus H 1 : β ≠ 0
(lower curves). The tests were performed in order to detect the discriminant ability of the bio-
marker at the 0.05 significance level. The left and the right panels correspond to the sample
sizes n = m = 50 and n = m = 100 , respectively.

(lower curves) to detect the discriminant ability of the biomarker at the 0.05
significance level, where the left and the right panel corresponds to the sam-
ple sizes n = m = 50 and n = m = 100, respectively. In these cases, the test based
on the AUC concept demonstrates better discriminant ability of the bio-
marker compared to the test based on the logistic regression.
The R code to implement the procedures is presented below:

> # empirical AUC (Mann-Whitney stat), returns p-value for H_0: AUC=1/2
versus H_1: AUC!=1/2
> AUC.pval<-function(x,y) {
+ m<-length(x)
+ n<-length(y)
+ x<-sort(x)
+ y<-sort(y)
+ rankall<-rank(c(x,y))
+ R<-rankall[1:m]
+ S<-rankall[-c(1:m)] # ranks of y
+
+ S10<-1/((m-1)*n^2)*(sum((R-1:m)^2)-m*(mean(R)-(m+1)/2)^2)
+ S01<-1/((n-1)*m^2)*(sum((S-1:n)^2)-n*(mean(S)-(n+1)/2)^2)
+ S2<-(m*S10+n*S01)/(m+n)
+
+ a<-wilcox.test(y,x)$statistic/(n*m)
+ AUC<-as.numeric(ifelse(a>=0.5,a,1-a)) # empirical AUC
+ pval<- 2*(1-pnorm(abs(AUC-0.5)/sqrt(S2/(m*n/(m+n)))))
+ return(pval)
+ }
>
200 Statistics in the Health Sciences: Theory, Applications, and Computing

> # logistic regression


> logistic.pval<-function(x,y){
+ if (!is.null(ncol(x))) {
+ Z<-rbind(x,y)
+ m<-nrow(x)
+ n<-nrow(y)
+ } else{
+ Z<-c(x,y)
+ m<-length(x)
+ n<-length(y)
+ }
+ q<-c(rep(0,m),rep(1,n))
+ newdata<-data.frame(cbind(q,Z))
+ glm.out = glm(q ~Z, family=binomial(logit), data=newdata)
+ return(summary(glm.out)$coef[-1,c(1,4)])
+ }
>
> MC<-10000
> alpha<-0.05
>
> Ex=3
> sigma2.x<-sigma2.y<-1
> delta.seq<-seq(0,2,.2)
> n.seq<-c(50,100)
>
> powers.all<-lapply(n.seq,function(n){
+ m<-n
+ powers<-sapply(delta.seq,function(delta){
+ Ey=Ex+delta
+ ux<-log(Ex)-0.5*sigma2.x
+ uy<-log(Ey)-0.5*sigma2.y
+ tmp<-sapply(1:MC,function(b){
+ x<-exp(rnorm(m,ux,sqrt(sigma2.x))) #x is lognormal
+ y<-exp(rnorm(n,uy,sqrt(sigma2.y))) #y is lognormal
+ pvals<-c(logistic.pval(x,y)[2]<alpha,AUC.pval(x,y)<alpha)
+ names(pvals)<-c("logistic","AUC")
+ return(pvals)
+ })
+ pow.delta<-apply(tmp,1,mean)
+ return(pow.delta)
+ })
+ colnames(powers)<-delta.seq
+ return(powers)
+ })
>
> names(powers.all)<-paste0("m=n=",n.seq)
> powers.all
$`m=n=50`
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
logistic 0.0326 0.0332 0.0548 0.0796 0.1273 0.1695 0.2208 0.2743 0.3371 0.4046 0.4692
AUC 0.0540 0.0687 0.0998 0.1494 0.2286 0.2955 0.3845 0.4738 0.5499 0.6237 0.7055
$`m=n=100`
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
logistic 0.0350 0.0501 0.0850 0.1423 0.2272 0.3317 0.4277 0.5252 0.6270 0.7122 0.7781
AUC 0.0538 0.0796 0.1414 0.2419 0.3710 0.5107 0.6464 0.7595 0.8397 0.9042 0.9422
A Brief Review of Receiver Operating Characteristic Curve Analyses 201

Note that in the case where both the case group X and the control group Y
follow a log-normal distribution log N (μ , σ 2 ) with a common value for μ and
we vary the values for σ 2 between the case group and the control group, the
AUC does not allow us to detect any significant difference between the two
groups. This is because a simple log transformation of observed data points
shows that the AUC is equal to 0.5 in this case. We refer the reader to Section
7.3.1 for the computation of the AUC under normal assumptions.

7.5 Best Combinations Based on Values of


Multiple Biomarkers
In practice, different biomarker levels are usually associated with disease in
various magnitudes and in different directions. For example, low levels of
high-density lipoprotein (HDL) cholesterol and high levels of thiobarbuturic
acid reacting substances (TBARS), biomarkers of oxidative stress and anti-
oxidant status, are indicators of coronary heart disease (Schisterman et al.,
2001a). When multiple biomarkers are available, it is of great interest to seek
a combination of biomarkers to improve diagnostic accuracy (e.g., Liu et al.,
2011). Due to the simplicity in practical applications, we will attend to the
best linear combination (BLC) of biomarkers, such that the combined score
achieves the maximum AUC or the maximum treatment effect over all pos-
sible linear combinations. We refer the reader to Pepe (2003) and Pepe and
Thompson (2000) for information regarding general methods related to best
combinations of biomarkers that improve diagnostic accuracy.
Warning: Note that the implementation of logistic regression based on sev-
eral biomarkers does not aim to maximize the AUC, its objective is to maxi-
mize the likelihood function. However, we motivate the reader, e.g., via the
Monte Carlo simulations, to compare the AUC based on BLC described in
this section with the AUC based on the linear combinations of biomarkers
obtained from logistic regression.
Consider a study with d continuous-scale biomarkers yielding measure-
ments X i = (X 1i ,..., X di )T , i = 1,..., n, on n diseased patients, and measurements
Yj = (Y1 j ,..., Ydj )T , j = 1,..., m, on m non-diseased patients, respectively, where T
denotes the transpose. It is of interest to construct effective one-dimensional
combined scores of biomarker measurements, i.e., X (a) = aT X and Y (a) = aT Y,
such that the AUC based on these scores is maximized over all possible lin-
ear combinations of biomarkers. Define A(a) = Pr(X (a) > Y (a)); the statistical
problem is to estimate the maximum AUC defined as A = A(a 0 ), where the
vector a 0 consists of the BLC coefficients satisfying a 0 = arg max a A(a). For
simplicity, we assume that the first component of the vector a equals 1. For
202 Statistics in the Health Sciences: Theory, Applications, and Computing

example, in the case of two biomarkers, i.e., d = 2 , the AUC can be defined as
A( a) = Pr(X 1 + aX 2 > Y1 + aY2 ).

7.5.1 Parametric Method


Assuming X i ~ N ( μ X , Σ X ), i = 1, … , n and Yj ~ N ( μ Y , Σ Y ), j = 1, … , m, Su and
Liu (1993) derived the BLC coefficients a 0 ∝ Σ C−1μ and the corresponding opti-
mal AUC as Φ(ω 1/2 ), where μ = μ X − μ Y , Σ C = Σ X + Σ Y, ω = μ T Σ C−1μ , and Φ is
the standard normal cumulative distribution function.
Based on Su and Liu’s point estimator, we can derive the confidence inter-
val estimation for the BLC-based AUC under multivariate normality assump-
tions (e.g., Reiser and Faraggi, 1997).

7.5.2 Nonparametric Method


In the nonparametric context, the BLC-based maximum AUC can be evalu-
ated using different nonparametric schemes to estimate the probability
Pr(X (a) > Y (a)) as a function of the vector a (e.g., Vexler et al., 2006; Chen et al.,
2015).

7.6 Remarks
Medical diagnoses usually involve the classification of patients into two or
more categories. When subjects are categorized in a binary manner, i.e.,
non-diseased and diseased, the ROC curve methodology is an important
statistical tool for evaluating the accuracy of continuous diagnostic tests,
and the AUC is one of the common indices used for overall diagnostic
accuracy. In many situations, the diagnostic decision is not limited to a
binary choice. For example, a clinical assessment, NPZ-8, of the presence of
HIV-related cognitive dysfunction (AIDS Dementia Complex [ADC]) would
discriminate between patients exhibiting clinical symptoms of ADC (com-
bined stages 1–3), subjects exhibiting minor neurological symptoms (ADC
stage 0.5) and neurologically unimpaired individuals (ADC stage 0) (Nakas
and Yiannoutsos, 2004). For such disease processes with three stages,
binary statistical tools such as the ROC curve and AUC need to be extended.
In this case of three ordinal diagnostic categories, the ROC surface and the
volume under the surface can be applied to assess the accuracy of tests. For
details, we refer the reader to, for example, Nakas and Yiannoutsos (2004).
In contrast, logistic regression can be easily generalized, e.g., to multino-
mial logistic regression, in order to deal with multiclass problems.
There are several measurements that are often used in conjunction with
the ROC curve technique. For example: (1) the Youden index is a summary
measure of the ROC curve. It both measures the effectiveness of a diagnostic
A Brief Review of Receiver Operating Characteristic Curve Analyses 203

marker and enables the selection of an optimal threshold value (cutoff point)
for the biomarker. The Youden index, J, is the point on the ROC curve that is
farthest from line of equality (diagonal line), that is, using the definitions
applied in Section 7.2, one can show that J = max −∞<c <∞ {Pr (X ≥ c) + Pr (Y ≤ c) − 1}
(e.g., Fluss et al., 2005; Schisterman et al., 2005). (2) The partial AUC (pAUC)
has been proposed as an alternative measure to the full AUC. When using
the partial AUC, one considers only those regions of the ROC space where
data have been observed, or which correspond to clinically relevant values of


b
test sensitivity or specificity, that is, pAUC = R(t)dt for prespecified
a
0 ≤ a < b ≤ 1. Assume, for example, random variables YD and YD are from the
distribution functions FYD and FYD that correspond to biomarker’s mea-
surements from diseased (D) and non-diseased (D) subjects, respectively.
The partial area under the ROC curve is the area under a portion of the
ROC curve, oftentimes defined as the area between two false positive rates
(FPRs). For example, the pAUC with two fixed a priori values for FPRs t0
and t1 is

∫ { ( )}
t1
pAUC = R(t) dt = Pr YD > YD , YD ∈ SY−D1 (t1 ), SY−D1 (t0 ) ,
t0

where SYD and SYD are the survival functions of the diseased and healthy
group, respectively. To simplify this notation, we denote q0 = SY−D1 (t0 ) and
q1 = SY−D1 (t1 ). Then

pAUC = Pr {YD > YD , YD ∈ (q0 , q1 )} .


8
The Ville and Wald Inequality:
Extensions and Applications

8.1 Introduction
The purpose of this chapter is twofold with respect to: (1) Our desire to intro-
duce a very powerful theoretical scheme for constructing statistical proce-
dures, and (2) We would like to show again arguments against statistical
stereotypes. For example, one generally can state that if a statistical test has
a reasonable fixed Type I error rate then the test cannot be of power one. In
this chapter, a test with power one is presented.
In previous chapters we employed the law of large numbers and the cen-
tral limit theorem to propose and examine statistical tools. In this chap-
ter we apply the law of the iterated logarithm. The material presented in
this chapter may seem overly technical, however we encourage the reader
to study the methodological technique introduced below in order to obtain
beneficial skills in developing advanced statistical procedures.
The Ville and Wald inequality: In Chapter 4, Doob’s inequality was intro-
duced with a focus on developing a bound for the distribution of a statistic,
which can be anticipated to be increasing when the sample size n increases.
In this case, we concluded that Doob’s inequality can be expected to be
very accurate in terms of such a bound. The right side of Doob’s inequal-
ity (the bound) does not depend on n. In the asymptotic framework this
result begs the following question: Can we apply the symbolic expression
. = Pr (lim n→∞ ) ≤ The Bound” with respect to Doob’s inequality?
“ lim n→∞ Pr ()
In general, since the probability can be written as Pr (.) = E {I (.)} =
∫ I(.), where
I (.) denotes the indicator function, the action “ lim n→∞ Pr (.) = Pr (lim n→∞ )”
corresponds to complicated issues related to possibly plugging the opera-
tor lim n→∞ into the integral. In this context, the Fatou–Lebesgue theorem
(Titchmarsh, 1976) can justify the operation “ lim n→∞ Pr (.) = Pr (lim n→∞ )”, pro-
vided that the conditions of the theorem are satisfied. In this chapter, when
n = ∞, we introduce a simple extension of Doob’s inequality that is entitled
the Ville and Wald inequality. This result was presented by Ville in 1939

205
206 Statistics in the Health Sciences: Theory, Applications, and Computing

as well as by Wald in 1947 (Ville, 1939; Wald, 1947). As mentioned above


the Ville and Wald inequality can be assumed to be extremely accurate. We
will target to formulate the inequality in terms of sums of iid random vari-
ables, offering general opportunities to construct various powerful statistical
procedures (see also Section 2.4.5.2 in this context) including, e.g., tests with
power one.
The law of the iterated logarithm: In Section 2.4.5.2 we stated a question


n
regarding the asymptotic behavior of the statistic Sn /bn , where Sn = Xi
i =1
is a sum of n iid random variables with zero expectation and a deterministic
sequence bn → ∞ as n → ∞ , “in between” the law of large numbers (bn = n)
and the central limit theorem (bn = n1/2 ). The law of the iterated logarithm,

⎧ ⎫
⎪ ⎪
Pr ⎨lim sup
Sn
= 1⎬=1 (log(e) = 1) ,
(
⎪⎩ n→∞ 2 n var (X 1) log (log (n)) )
1/2
⎪⎭

precisely assesses the asymptotic behavior of Sn /bn . Thus we need to “press”


on Sn a “little bit” more than bn = n1/2 to force Sn /bn to not be considered
as a random variable even with respect to its maximum values as n → ∞ .
Note that, in the statistical context and in many scenarios, the quality of a
statistical tool depends on its possibility for predicting deterministically
bounds of the random variable Sn. In this context we have the following
scheme: (1) having the rule to control Sn /n, one can provide several statistical
procedures, e.g., consistent estimation; (2) having the rule to control Sn /n1/2 ,
one can evaluate properties of statistical procedures at Item (1) above and
develop efficient tests (see previous chapters in this book); and (3) the law of
the iterated logarithm detects very accurately deterministic bounds for Sn ,
providing extremely efficient statistical algorithms (e.g., a test with power
one we will show in this chapter) based on an anticipated deterministic
(data-free) forecasting related to Sn’s values. It turns out that the sequence
(2n var(X )log (log (n)))
1/2
1 increases about as slowly as is possible for sums of
iid random variables with mean 0,

{ ( (
Pr max 1≤ n Sn − 2 n var(X 1 )log (log (n)) )
1/2
) ≥ 0}
{ (
= Pr Sn ≥ 2 n var(X 1 )log (log (n)) )
1/2
for some n ≥ 1 < 1. }
For example, when X 1 ,..., X n are iid data points it is reasonable to reject the
hypothesis E (X 1) = 0, if Sn does not follow the law of the iterated logarithm
related to the case with E (X 1) = 0. For illustrative purposes, we generated 5,000
The Ville and Wald Inequality: Extensions and Applications 207

2 2

1 1

0 0

–1 –1

–2 –2
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
n n
(a) (b)

FIGURE 8.1
Values of Sn /n1/2 (◦◦◦) plotted against n = 12, ..., 5000. Panel (a) represents the bounds ±1 and

( (
panel (b) represents the bounds ± 2 log log (n) ))
1/2
.

iid random variables X i = ξ i − 1 ≥ −1 with Pr (ξ 1 < u) = 1 − exp( − u) and com-


puted the sequence of S1 = X 1 , S2 = (X 1 + X 2) /2 1/2 , S3 = (X 1 + X 2 + X 3) /31/2 ,....
Values of Sn /n1/2 are depicted in Figure 8.1 against n = 12,..., 5000 using “◦”
symbols. In this figure, panel (a) represents the bounds ±1 (the central limit
(
theorem) and panel (b) represents the bounds ± 2 log (log (n)) (the law of the )
1/2

iterated logarithm).
It is clear that, in terms of the Type I error rate control, the law of the iter-
ated logarithm could insure the use of a test procedure based on Sn /n1/2 in a
better manner than the use of the central limit theorem. And the “payment”
( )
for this improvement is only 2 log (log (n)) , which clearly cannot signifi-
1/2

cantly impact the power of the test when Sn ~ nE (X 1) with E (X 1) ≠ 0 .


In this chapter we will apply the law of the iterated logarithm to judge the
proposed techniques.
The reader has all ingredients, including those provided in this book, to
prove the law of the iterated logarithm. However, in actuality the proof is
quite technical and thus we omit it, referring the reader to Petrov (1975).
We suggest that the reader who is interested in more details regarding this
chapter’s material consult the fundamental publications of Professor Herbert
Robbins (e.g., Robbins, 1970).
In Section 8.2 we consider the Ville and Wald inequality. This inequality is
extended and rewritten in terms of sums of iid random variables in Section 8.3.
The statistical significance of the obtained results is introduces in Section 8.4.
208 Statistics in the Health Sciences: Theory, Applications, and Computing

8.2 The Ville and Wald inequality


Consider two assumptions H and H′. Under H the random variables (not
necessary iid) X 1 ,..., X n with a specified joint distribution function P have
a density function g n ( x1 ,..., xn ) , whereas under H′ for each n ≥ 1 the
sequence X 1 ,..., X n is from P′, any other joint probability distribution,
with a density function g n′ ( x1 ,..., xn ) . In this case, the likelihood ratio is
LRn = g n′ (X 1 ,..., X n )/g n (X 1 ,..., X n ) when g n (X 1 ,..., X n ) > 0. Then, under H , for
any ε > 0,

PrH (LRn ≥ ε for some n ≥ 1) ≤ 1/ε.

Proof. Define the stopping rule N = inf {n ≥ 1 : LRn ≥ ε} with N = ∞ if no such


n : LRn ≥ ε . Thus, we have

PrH (LRn ≥ ε for some n ≥ 1)


∞ ∞

= PrH (N < ∞) = ∑Pr H (N = j) = ∑EH {I (N = j)}


j =1 j =1

= ∑ ∫ I (N = j) g (x ,..., x )dx ...dx


j 1 j 1 j
j =1

= ∑ ∫ I ⎛⎜⎝ max LR 1≤ k < j


k
⎞ g j ( x1 ,..., x j )
< ε, LR j ≥ ε⎟
⎠ g′j ( x1 ,..., x j )
g′j ( x1 ,..., x j )dx1 ...dx j
j =1

= ∑ ∫ I ⎛⎜⎝ max LR 1≤ k < j


k
⎞ 1
< ε, LR j ≥ ε⎟
⎠ LR j
g′j ( x1 ,..., x j )dx1 ...dx j
j =1

≤ ∑ ∫ I ⎛⎜⎝ max LR 1≤ k < j


k
⎞1
< ε, LR j ≥ ε⎟ g′j ( x1 ,..., x j )dx1 ...dx j
⎠ε
j =1

=
1
ε ∑ ∫ I (N = j) g′ (x ,..., x )dx ...dx = 1ε Pr
j 1 j 1 j H'
1
(N < ∞) ≤ .
ε
j =1

This completes the proof.


Note that PrH (LRn ≥ ε for some n ≥ 1) = PrH max LRk ≥ ε and LRn with the (
1≤ k ≤∞
)
sigma algebra based on (X 1 ,..., X n) is an H -martingale. This supports our
conclusion presented in Chapter 4 that statistics with martingale properties
can provide efficient decision-making schemes. It turns out that even maxi-
mum values of the likelihood ratio are in control over all n ≥ 1.
The Ville and Wald Inequality: Extensions and Applications 209

8.3 The Ville and Wald Inequality Modified in


Terms of Sums of iid Observations
As discussed in Section 8.1, it is reasonable to study several techniques pre-
sented as examples of reformulations of the Ville and Wald inequality in terms
of sums of iid observations. Towards this end, we will employ an approach
that can be associated with the material presented in Chapter 5 when the
likelihood function is proposed to be integrated over different values of the
parameter. Let the observations X 1 ,..., X n be iid N (0,1), under H. In this case,


n
g n ( x1 ,..., xn ) = φ( xi ) with φ( x) = (2 π)−1/2 exp(− x 2 /2). Assume that H′ cor-
i=1
responds to a mixture of distribution functions Pθ, where Pθ denotes that

∫ ∏
n
X 1 ,..., X n are iid N (θ, 1), defining g n′ ( x1 ,..., xn ) = φ (xi − θ) dF (θ), for an
−∞ i=1
arbitrary distribution function F. Thus we obtain the likelihood ratio

∫ ( ) ∑
n
LRn = g′n (X 1 ,..., X n )/g n (X 1 ,..., X n ) = exp θSn − nθ2 /2 dF (θ), Sn = Xi .
−∞ i =1

This likelihood ratio can be interpreted in the contexts shown in Chapter 5.


Section 5.2 provides to the reader a necessary proof scheme to extend the
Ville and Wald inequality to be appropriate for the Bayes factor, when the
null hypothesis H states that the parameter equals to zero. For reasons which
will become apparent in the next sections related to applications of the pre-
(
sented method, we replace F (θ) by F θm1/2 , where m is an arbitrary positive )
constant, since F is an arbitrary distribution function. Then we rewrite

LRn = ∫
−∞
(
exp θSn − nθ2 /2 dF θm1/2 . ) ( )
Define the function

f (x , t) = ∫ −∞
(
exp yx − ty 2 /2 dF (y). )
Then

LRn = ∫ exp (θS − nθ /2) dF (θm )


−∞
n
2 1/2

= ∫ exp (yS m
−∞
n
−1/2
) (
− ny 2 (2 m)−1 dF (y) = f Snm−1/2 , n/m . )
210 Statistics in the Health Sciences: Theory, Applications, and Computing

Thus the Ville and Wald inequality states

{( )
PrH f Snm−1/2 , n / m ≥ ε for some n ≥ 1 = }
{( ) }
Prθ= 0 f Snm−1/2 , n / m ≥ ε for some n ≥ 1 ≤ 1/ ε (m > 0, ε > 0).

The following examples illustrate the significance of the result above, consid-
ering particular choices of F.

Example 1. Suppose the distribution function F corresponds to random



variables having support over (0, ∞). Then f (x , t) =
∫ 0
( )
exp yx − ty 2 /2 dF (y) is
an increasing function of x. This concludes with

f (x , t) ≥ ε if and only if x ≥ A (t , ε) , for all t > 0,

where A (t, ε) is the positive solution x of the equation f (x, t) = ε, i.e.,


f (x = A (t , ε) , t) = ε. In this case, the Ville and Wald inequality shows that

PrH {Sn ≥ m1/2 A (n / m, ε) for some n ≥ 1}

= Prθ= 0 {Sn ≥ m1/2 A (n / m, ε) for some n ≥ 1} ≤ 1/ ε.

This result remains valid if instead of being iid N (θ,1) the observations are
any iid random variables that satisfy E {exp (uX 1)} < ∞ for all 0 < u < ∞ (Robbins,
1970). This comment is applicable regarding the following examples too.

Example 2. Suppose the distribution function F is degenerate with mass one


assigned to the point 2 a > 0. This scenario can be associated with the hypoth-
esis H : θ = 0 versus H ' : θ = 2 a. As in Example 1 above, we have

( )
f (x , t) = exp 2 ax − 2 a 2t ≥ ε if and only if x ≥ at + {log (ε)} /(2 a),

for all t > 0.

Thus

PrH {Sn ≥ an / m1/2 + dm1/2 for some n ≥ 1}

= Prθ= 0 {Sn ≥ an / m1/2 + dm1/2 for some n ≥ 1} ≤ e −2 ad ,

where d = log(ε)/(2 a) > 0, a > 0 and m > 0.


The Ville and Wald Inequality: Extensions and Applications 211

Example 3. Suppose the distribution function F is defined by


⎛ 2 ⎞ 1/2
y

F( y ) = I (y > 0) ⎜ ⎟ ∫
2
e−u /2
du. Then for t > −1 we have
⎝ π⎠ 0


⎛ 2 ⎞ 1/2
f (x , t) = ⎜ ⎟ ∫
2
e − yx −(t+1) y /2
dy
⎝ π⎠ 0

2exp {21 x /(t + 1)} Φ ⎛⎜


2
x ⎞

u
1
⎜ (t + 1)1/2 ⎟⎟ , Φ (u) = (2 π)1/2
2
= e−y /2
dy.
(t + 1) 1/2
⎝ ⎠ −∞

Defining h−1 (u) as the inverse function to h(u) = u2 + 2 log (Φ(u)), one can
obtain A (t , ε) = (t + 1)1/2 h−1 (2 log (ε / 2) + log (t + 1)), which yields

⎧⎪ ⎛ ⎛ n ⎞⎞ ⎫⎪
Prθ= 0 ⎨Sn ≥ (m + n)1/2 h−1 ⎜⎜ h( a) + log ⎜ + 1⎟ ⎟⎟ for some n ≥ 1⎬
⎩⎪ ⎝ ⎝ m ⎠⎠ ⎭⎪
1 2
≤ e − a /2 ,
2Φ( a)

2
where 0 < a satisfies 0 < ε = 2Φ( a)e a , m > 0 and it is known that
h(u) ~ u2 , h−1 (u) ~ u1/2 as u → ∞.

Example 4. Omitting technical details, one can show that the choice of

( ) ∫
y
1
F( y ) = I 0 < y < e − e δ du for all δ > 0
{
u log (1/ u) log (log (1/ u)) }
1+δ
0

provides that
PrH {Sn ≥ cn for some n ≥ 1} = Prθ= 0 {Sn ≥ cn for some n ≥ 1} ≤ 1/ ε, cn = m1/2 A (n / m, ε)

(
with cn ~ 2 n log (log (n)) )
1/2
as n → ∞ (Robbins and Siegmund, 1970).

Example 5. Define the distribution function F to be symmetric


( F(−u) = 1 − F(u), u ∈ (−∞, ∞)). Then f (x , t) ≥ ε if and only if x ≥ A (t , ε), for all
t > 0. For example, if F(u) = Φ(u) we obtain

⎧⎪ ⎛ ⎛ n ⎞⎞
1/2 ⎫⎪ 2
Prθ= 0 ⎨ Sn ≥ (m + n)1/2 ⎜⎜ a 2 + log ⎜ + 1⎟ ⎟⎟ for some n ≥ 1⎬ ≤ e − a /2 ,
⎪⎩ ⎝ ⎝ m ⎠⎠ ⎪⎭

where a > 0 and m > 0. As a one-sided version of this inequality, noting that
(
for all t > 1 a 2 + log (t) )
> h−1 (h (a) + log(t)) (the reader can simply plot the
1/2
212 Statistics in the Health Sciences: Theory, Applications, and Computing

(
functions a 2 + log (t) )
and h−1 (h (a) + log(t)) against t > 1, where h and h−1 are
1/2

defined in Example 3 above), we can use the output of Example 3 to obtain

⎧⎪ ⎛ ⎛ n ⎞⎞
1/2 ⎫⎪ 2
Prθ= 0 ⎨Sn ≥ (m + n)1/2 ⎜⎜ a 2 + log ⎜ + 1⎟ ⎟⎟ for some n ≥ 1⎬ ≤ e − a /2 /(2Φ( a)).
⎪⎩ ⎝ ⎝ m ⎠⎠ ⎪⎭

In a similar manner to that shown above, Robbins (1970) and Robbins and
Siegmund (1970) concluded with

⎧ ⎛ ⎛ n⎞⎞
1/2 ⎫
⎪ ⎪
for some n ≥ m⎬ ≤ 2 (1 − Φ( a) + aφ( a)) , φ(u) = (2 π)
−1/2 − u2 /2
Prθ=0 ⎨ Sn ≥ (n)1/2 ⎜⎜ a 2 + log ⎜ ⎟ ⎟⎟ e .
⎪⎩ ⎝ ⎝ m ⎠ ⎠ ⎪

Actually, in the above inequality we can see the role of m > 0, when it is
stated that “ for some n ≥ m.” The integer m can define a pre-sample size of
observations after which we analyze max m≤ n<∞ Sn .
The preceding examples show how to transform the Ville and Wald
inequality to the form PrH {Sn ≥ cn for some n} ≤ 1/ε or PrH { Sn ≥ cn for some n} ≤ 1/ε,
where cn depends on ε > 0. Example 2 derives cn = an/m1/2 + dm1/2 ~ an/m1/2,
⎛ ⎛ n ⎞⎞
Example 3 derives c n = (m + n)1/2 h−1 ⎜⎜ h( a) + log ⎜ + 1⎟ ⎟⎟ ~ (n log(n))
1/2
and
⎝ ⎝ m ⎠⎠
Example 4 concludes with c n ~ (2n log (log (n))) as n → ∞ . Let us focus again on
1/2

the law of the iterated logarithm. The sequence c n ~ (2n log (log (n))) increases
1/2

about as slowly as is possible for sums of iid random variables with mean 0 and
variance 1, PrH {Sn ≥ cn for some n ≥ 1} < 1. Assume, for example, a researcher
( )
attempts to improve upon c n ~ 2 n log (log (n)) , with the aim of providing a
1/2

{
limit to the expression PrH Sn ≥ cn (log(log(log(n)))) / (log(log(n))) for some n .
1/2 1/2
}
In this case, the law of the iterated logarithm implies that this approach would
be invalid given that

PrH {Sn ≥ cnlog(log(log(n)))1/2 /log(log(n))1/2 for some n} = 1.

In order to demonstrate this point, we generated 10,000 iid random


variables X i ~ N (0,1) and computed the sequence of S1 = X 1 , S2 = (X 1 + X 2) ,
S3 = (X 1 + X 2 + X 3) ,.... Values of Sn are plotted in Figure 8.2 against
n = 150,...,10000 using “◦” symbols. In this figure, we present the bounds
(
± 2 n log (log (n)) ) (
(curves —) and ± 2 n log (log (log(n))) )
1/2 1/2
(curves - - - -).
The Ville and Wald Inequality: Extensions and Applications 213
Sn

0 2000 4000 6000 8000 10000


n

FIGURE 8.2
( )
Values of Sn (◦◦◦) plotted against n = 150, ..., 10000. The figure presents the bounds ± 2 n log (log (n))
1/2

( (
(curves —) and ± 2 n log log (log(n)) ))
1/2
(curves - - - -).

8.4 Applications to Statistical Procedures


Consider the statistical significance of the results mentioned in Section 8.3.

8.4.1 Confidence Sequences and Tests with Uniformly Small Error


Probability for the Mean of a Normal Distribution
with Known Variance
Let the observations X 1 ,..., X n , X n + 1 ,... be independent and identically N (θ,1)
distributed, where θ is an unknown parameter, −∞ < θ < ∞. We are inter-
ested in obtaining the intervals I n = ((Sn − cn )/n, (Sn + cn )/n) , where n ≥ 1,
Sn = X 1 + ... + X n and cn is a sequence of positive constants, such that we can
control the cover probability Prθ (θ ∈ I n for every n ≥ 1).
It is clear that

Prθ (θ ∈ I n for every n ≥ 1) = Prθ (Sn − cn < θn < Sn + cn for every n ≥ 1)


= Prθ (Sn − cn < θn, Sn + cn >θn for every n ≥ 1)
= Prθ (Sn − θn < cn , Sn − θn > −cn for every n ≥ 1)
⎛ n ⎞

= Prθ ⎜
⎜ ∑ ⎟
(X i − θ) < cn for every n ≥ 1⎟

⎝ i =1 ⎠
= Prθ= 0 ( Sn < cn for every n ≥ 1) .
214 Statistics in the Health Sciences: Theory, Applications, and Computing

Note that, in the equation above, we changed the measure Prθ to be Prθ= 0 ,
S −c S +c
since, under the assumption X i ~ N (θ,1), the expression n n < θ < n n
n n
is equivalent to Sn < cn , when X i ~ N (0,1). Thus

Prθ (θ ∈ I n for every n ≥ 1) = Pr0 ( Sn < cn for every n ≥ 1)

= 1 − Pr0 ( Sn ≥ cn for some n ≥ 1).

Define cn = [(n + m)( a 2 + log(n/m + 1))]1/2 , with 0 < cn /n → 0 as n → ∞ . In this


case, Example 5 in Section 8.3 provides
2
Pr0 ( Sn ≥ cn for some n ≥ 1) ≤ e − a /2
.
Hence
⎛ ∞ ⎞
Prθ (θ ∈ I n for every n ≥ 1) = Pr0 ⎜⎜ θ ∈ ∩ I i⎟⎟ ≥ 1 − e − a /2 ,
2

⎝ i =1 ⎠
which can be made near 1 by choosing a sufficiently large, e.g., a 2 = 6 yields
( )
1 − exp − a 2 / 2 ≅ 0.95. Thus for a 2 = 6 and any m > 0 , the sequence I n with
cn defined above forms a 95% confidence sequence for an unknown θ with
coverage probability ≥ 0.95.
Following the conventional concept of confidence interval constructions
(see Chapter 9 in this context), one can use the fact Sn = X 1 + ... + X n ~ N (θn, n)
to propose, for any fixed n,

Prθ {(Sn − 1.96n1/2 )/n < θ < (Sn + 1.96n1/2 )/ n} = Prθ {θ < (Sn + 1.96n1/2 )/n}

− Prθ {θ < (Sn − 1.96n1/2 )/n} = 1 − Prθ {(Sn − θn)/n ≤ −1.96n1/2}

−1 + Prθ {(Sn − θn)/n ≤ 1.96n−1/2} = Φ (1.96) − Φ (−1.96) ≅ 0.95.

However, by virtue of the law of the iterated logarithm, we conclude with

Prθ {(Sn − 1.96n1/2 )/ n < θ < (Sn + 1.96n1/2 )/ n for every n ≥ 1} = 0.

The advantage of the confidence sequence I n as compared to a fixed sample


size confidence interval is that it allows us to “follow” the unknown param-
eter throughout the whole sequence X 1 ,... with an interval I n whose length
shrinks to 0 as the sample size increases, in such a way that with probabil-
ity ≥ 0.95 the interval I n contains the parameter at every stage. This advantage
(
costs us a widening of the intervals (Sn − 1.96n1/2 )/n, (Sn + 1.96n1/2 )/n to )
((Sn − cn )/n,(Sn + cn )/n) with respect to an asymptotic order of O [log(n)]1/2 /n ( )
as n → ∞.
The Ville and Wald Inequality: Extensions and Applications 215

8.4.2 Confidence Sequences for the Median


Let Z1 , Z2 ,... be iid data points with median M that satisfy Pr(Z1 ≤ M) = 0.5.
The usual confidence interval for M, provided that the sample size n is rela-
tively large, is

Pr {Za( n1 ) ≤ M ≤ Za( n2 )} ≅ 2Φ( a) − 1,

where Z1(n) ≤ ... ≤ Zn( n) denote the ordered values Z1 ,..., Zn and

a1 = a1 (n) = largest integer ≤ 0.5(n − an0.5 ),

a2 = a2 (n) = smallest integer ≥ 0.5(n + an0.5 ).

Proof. As used in various proof schemes shown in this book, the idea of this
proof is to rewrite the stated problem to a formulation in terms of sums of
iid random variables and then to employ an appropriate theorem related to
sums of iid random variables. In this case, we aim to use the central limit
theorem (Section 2.4.5) via the following algorithm: Define the empirical


n
distribution function Fn (u) = I (Zi ≤ u)/n . The function Fn (u) increases
i=1
providing

{ ( ) }
Pr {Za( n1 ) ≤ M} = Pr Fn Za( n1 ) ≤ Fn (M) = Pr {a1/n ≤ Fn (M)} ,

∑ I (Z ≤ Z )/n = ∑ I (Z )
n n
since Fn (Za( n1 ) ) = i
( n)
a1
( n)
i ≤ Za( n1 ) /n = a1/n. Thus
i=1 i=1

Pr {Za( n1 ) ≤ M} = Pr {a1/n ≤ Fn (M)} =

⎧ n ⎫

∑ ⎪
1 − Pr ⎨ {I (Zi ≤ M) − 0.5}/ 0.5n1/2 < −a⎬ ,
⎪⎩ i =1 ⎪⎭
( )
where I (Z1 ≤ M) ,..., I (Zn ≤ M) are iid random variables with
E {I (Z1 ≤ M)} = Pr (Z1 ≤ M) = 0.5 and var {I (Z1 ≤ M)} = E {I (Z1 ≤ M) − 0.5} = 0.52 .
2

Then
{ }
Pr Za( n1 ) ≤ M → 1 − Φ( − a) = Φ( a) as n → ∞ . Similarly, we have
{
Pr M ≤ Z } → Φ(a) as n → ∞. Since
( n)
a2

Pr {Za( n1 ) ≤ M ≤ Za( n2 )} = Pr {M ≤ Za( n2 )} − Pr {M ≤ Za( n1 )}

= Pr {M ≤ Za( n2 )} − 1 + Pr {M > Za( n1 )}

≅ Φ( a) − 1 + Φ( a),
the proof is complete.
216 Statistics in the Health Sciences: Theory, Applications, and Computing

In order to construct a confidence sequence for M, we define


X i = 1 if X i ≤ M

= −1 if X i > M , Sn = X 1 + ... + X n

and
b1 = b1 (n) = largest integer ≤ 0.5(n − cn )

b2 = b2 (n) = smallest integer ≥ 0.5(n + cn ),

where cn is some sequence of positive constants.


Now we will obtain that

Pr(Z(bn1 ) ≤ M ≤ Zb( 2n) for every n ≥ m > 0) ≥ 1 − Pr ( Sn ≥ cn for some n ≥ m > 0).

Towards this end, we consider the event


{ }{ }
A = Zb(1n) > M or Zb( 2n) < M = Zb(1n) ≤ M ≤ Zb( 2n) (here A is the complement of A).
If Z ( n)
b1 > M then M < Z ( n)
b1 ≤Z
( n)
b1 + 1 ≤Z
( n)
b1 + 2 ≤ ... ≤ Zn( n) and we have (n − b1 ) times
the value of (−1) present in X 1 ,..., X n. The values of the X’s, which do not cor-
respond to Zb(1n) ,..., Z(nn), can be equal to 1 or (−1) but must be ≤ 1. This implies

Sn = X 1 + ... + X n ≤ (−1)(n − b1 ) + (1)b1 = 2b1 − n = −cn .

If Zb( 2n) < M then Z1(n) < ... < Zb( 2n) < M and we have b2 times the value of 1 pres-
ent in X 1 ,..., X n. In this case, the values of the X’s, which do not correspond to
Z1(n) ,..., Z(bn2 ), can be equal to 1 or (−1) but must be ≥ (−1). This implies

Sn = X 1 + ... + X n ≥ (1)b2 + (−1)(n − b2 ) = cn .

{ } ()
Therefore A ⊆ Sn ≥ cn leading to Pr A = 1 − Pr (A) ≥ Pr Sn ≥ cn , for all n. { }
Thus

Pr(Z(bn1 ) ≤ M ≤ Zb( 2n) for every n ≥ m > 0) ≥ 1 − Pr( Sn ≥ cn for some n ≥ m > 0).

It is clear that E {exp (uX 1)} < ∞, for all 0 < u < ∞ . Then, following Example
(
5 in Section 8.3, we use the sequence cn = [n a 2 + log (n / m) ]1/2 , obtaining )
Pr( Sn ≥ cn for some n ≥ m) ≤ 2(1 − Φ( a) + aφ( a)), which provides

Pr(Z(bn1 ) ≤ M ≤ Zb( 2n) for every n ≥ m > 0) ≥ 1 − 2(1 − Φ( a) + aφ( a))

= 2 Φ( a) − 1 − 2 aφ( a).
The Ville and Wald Inequality: Extensions and Applications 217

{
Note that, comparing the method based on Pr Za( n1 ) ≤ M ≤ Za( n2 ) ≅ 2 Φ( a) − 1 }
with the result above, one can compute, for large m, that

a 2Φ(a) − 1 2Φ(a) − 1 − 2aφ(a)

2 0.9546 0.7386
2.5 0.9876 0.9001
2.8 0.9948 0.9506
3 0.9947 0.971
3.5 0.9996 0.9933

We would like to emphasize again that, by virtue of the law of the iterated
logarithm, we have

Pr(Z(an1 ) ≤ M ≤ Za( n2 ) for every n ≥ m) = 0

and in many practical situations, we do not know if the sample size n is large
{
enough to hold the asymptotic proposition Pr Za( n1 ) ≤ M ≤ Za( n2 ) ≅ 2 Φ( a) − 1. }
8.4.3 Test with Power One
Assume that we can observe the independent and identically N (θ,1)-
distributed data points X 1 ,..., X n , X n + 1 ,... in order to test for H 0 : θ ≤ 0 versus
H 1 : θ > 0. Define

⎧⎪first n ≥ 1 such that Sn = X 1 + ... + X n ≥ cn


N=⎨
⎪⎩∞ if no such n occurs

to apply the test procedure: when N < ∞, we stop sampling with X N and
reject H 0 in favor of H 1; if N = ∞, we continue sampling indefinitely and do
not reject H 0 . In this case, the Type I error rate has the form
⎧ ⎛ k ⎞ ⎫

Prθ≤ 0 (we reject H 0 ) = Prθ≤ 0 ( N < ∞) = Prθ≤ 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ ⎪
X i − c k ⎟⎟ ≥ 0⎬

⎠ ⎪
⎩ i =1 ⎭
⎧ ⎛ k ⎞ ⎫

= Prθ≤ 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ ⎟ ⎪
(X i − θ) + θk − ck ⎟⎟ ≥ 0⎬ .
⎠ ⎪
⎩ i =1 ⎭
That is to say
⎧ ⎛ k ⎞ ⎫

Prθ≤ 0 (we reject H 0 ) ≤ Prθ≤ 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ (X i − θ) − ck ⎟⎟⎟ ≥ 0⎬ ,



⎩ i =1 ⎭
218 Statistics in the Health Sciences: Theory, Applications, and Computing

since θk ≤ 0. Then

⎧ ⎛ k ⎞ ⎫

Prθ≤ 0 (we reject H 0 ) ≤ Prθ= 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ ⎪
(X i) − ck ⎟⎟⎟ ≥ 0⎬ = Prθ= 0 (N < ∞).
⎠ ⎪
⎩ i =1 ⎭
As in Section 8.4.1, here we changed the measure Prθ to be Prθ= 0 . If we are using
cn = [(n + m)( a 2 + log(n/m + 1))]1/2 , as presented in Example 5 of Section 8.3,
then we have

⎧ ⎛ k ⎞ ⎫

Prθ≤ 0 (we reject H 0 ) ≤ Prθ= 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ ⎪
(X i) − ck ⎟⎟⎟ ≥ 0⎬
⎠ ⎪
⎩ i =1 ⎭
1 2
= Prθ= 0 (Sn ≥ cn for some n ≥ 1) ≤ e − a /2 ,
2Φ( a)

which controls the Type I error probability of the test.


The Type II error probability of the test is

Prθ> 0 (we do not reject H 0 ) = Prθ> 0 (Sn < cn for all n ≥ 1)

= Prθ> 0 (Sn /n < cn /n for all n ≥ 1).

By virtue of the law of large numbers, under the alternative hypothesis, we


have Sn /n → θ > 0 , whereas cn /n → 0 as n → ∞ . Then one can find K such
that for n > K, Prθ> 0 (Sn /n < cn /n) = 0 (see Section 1.2 for details). Thus

1 − Power = Prθ> 0 (we do not reject H 0 ) = Prθ> 0 (Sn /n < cn /n for all n ≥ 1) = 0

and the test has power one against the alternative θ > 0.
Regarding the expected sample size Eθ> 0 ( N ), it can be shown that for any
stopping rule N based on X 1 ,... the inequality Eθ> 0 ( N ) ≥ −2 log (Pr0 ( N < ∞)) /θ2
must hold for every θ > 0 (see Chapter 6). Thus if we are willing to tolerate an
N for which Pr0 ( N < ∞) = 0.05, then necessarily Eθ> 0 ( N ) ≥ 6 / θ2 for every θ > 0;
however, no such N will minimize Eθ ( N ) uniformly for all θ > 0. For N applied
in this section, where cn = [(n + m)( a 2 + log(n/m + 1))]1/2 , a simple Monte Carlo
study can evaluate Eθ> 0 ( N ) for different values of θ > 0 and fixed m > 0 .
9
Brief Comments on Confidence
Intervals and p-Values

9.1 Confidence Intervals


We assume the reader is familiar with the conventional principles of
confidence interval estimation that is of a type given by an interval of a
scalar parameter, which in turn can be generalized to the concept of a confi-
dence region for a set of parameters (or a vector of parameters).
In this section, we will essentially focus on the Bayesian approach for confi-
dence interval estimation because of its efficiency and natural interpretation,
which provides wide applicability in statistical practice. However, the com-
ments shown in this chapter can be easily adapted to the frequentist frame-
work of confidence interval (or region) estimation. Two suggested detailed
references pertaining to the core frequentist theory, as put forward by R.A.
Fisher and J. Neman, are Schweder and Hjort (2002) and Singh et al. (2005).
The Bayesian display of the upper and lower bounds, which contains a
large fraction of the posterior mass (typically 95%) related to a functional
parameter, is an analog of the frequentist confidence interval and commonly
termed in the literature as a credible set or simply “confidence interval”
(e.g., Carlin and Louis, 2011). There is a rich statistical literature regarding the
theoretical and applied aspects of the Bayesian confidence interval estima-
tion (e.g., Broemeling, 2007; Gelman et al., 2013). To outline this technique,
we assume that the dataset X consists of n independent and identically
distributed observations X = (X 1 ,..., X n ) from density function f ( x|θ) with
an unknown parameter θ of interest. For simplicity, we suppose θ is sca-
lar. In the Bayesian framework (Chapter 5) we define the prior distribution
π(θ), which represents the prior information about θ mapped onto a prob-
ability space. The prior information is updated conditionally on the observed

n
data using the likelihood function f (X i |θ) via Bayes theorem to obtain
i=1
the posterior density function of θ,

∏ ∫∏
n n
h(θ|X ) = f (X i |θ)π(θ) f (X i |θ)π(θ)dθ .
i=1 i=1

219
220 Statistics in the Health Sciences: Theory, Applications, and Computing

We assume that h(θ|X ) is unimodal. The (1 − α) 100% Bayesian confidence


interval estimate for θ can be presented as an interval ⎡⎣qL , qU ⎤⎦ such that the pos-
terior probability Pr (qL < θ < qU |X) = 1 − α (or simply Pr (qL < θ < qU ) = 1 − α in


qU
this section) at a fixed significant level α, that is, h(θ|X ) dθ = 1 − α. In this
qL
case, the problem is that we have only one equation to find values of the
two unknown quantities qL , qU. Consider, for example, Figure 9.1, where, for
α = 0.05, we display the scenarios related to the intervals ⎡⎣qL , qU ⎤⎦ that satisfy

α α
∫ ∫
qL
Panel (a) : = h (θ|X) dθ, = h (θ|X) dθ ;
2 −∞ 2 qU


α ⎛ 1⎞
∫ ∫
qL
= h (θ|X) dθ, α ⎜ 1 − = h (θ|X) dθ
⎝ 1.2 ⎟⎠
Panel (b):
1.2 −∞ qU

and


qU
Panel (c) : h(θ|X ) dθ = 1 − α , h(qL |X ) = h(qU |X ).
qL

Warning: Similar to the issue regarding the general Type I error rate defi-
nition mentioned in Section 4.4.4, it is reasonable to raise the question,
“What kind of confidence interval definition do you use?” in terms of
what is actually presented.
0.20 0.20 0.20

0.15 0.15 0.15

0.10 0.10 0.10


h

h
h

0.05 0.05 0.05

0.00 0.00 0.00


0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
θ θ θ
(a) (b) (c)

FIGURE 9.1
The confidence interval definitions related to the following scenarios: (a) the posterior prob-
abilities Pr (θ ≤ qL ) = Pr (θ ≥ qU ) = α/2 and (b) the posterior probabilities Pr (θ ≤ qL ) = α / 1.2 ,
Pr (θ ≥ qU ) = α(1 − 1/ 1.2), where α = 0.05 . Panel (c) presents the interval ⎡⎣qL , qU ⎤⎦ calculated using
the equations Pr (qL < θ < qU ) = 1 − α and h(qL |X ) = h(qU |X ).
Brief Comments on Confidence Intervals and p-Values 221

In order to calculate the interval ⎡⎣qL , qU ⎤⎦ a common approach found in the


statistical literature suggests employing the following two strategies:
Method (I): The bounds qL and qU are computed as roots of the equations


α α
∫ ∫
qL
= h (θ|X) dθ and = h (θ|X) dθ.
2 −∞ 2 qU

This implies the Bayesian equal-tailed confidence interval estimation


Method (II): We can derive values for the bounds qL and qU as roots of the
equations


qU
h(qL |X ) = h(qU |X ) and h(θ|X ) dθ = 1 − α.
qL

This implies the Bayesian highest posterior density confidence interval


estimation (see, e.g., panel (c) of Figure 9.1). Method (I) is computation-
ally simple and oftentimes used in practice. The practical implemen-
tation of Method (II) usually requires using complicated computation
schemes based on Markov Chain Monte Carlo techniques (e.g., Chen and
Shao, 1999).
The length qU − qL can characterize a quality of the confidence interval esti-
mation. This means that the rule for constructing the confidence interval
should make as much use of the information in the data set and prior as
possible. In this case, one way of assessing optimality is by the length of
the interval so that a rule for constructing a confidence interval is judged
better than another if it leads to intervals whose lengths are typically
shorter. For example, it seems that panel (c) of Figure 9.1 depicts the shorter
confidence interval. Let us derive qU and qL that minimize qU − qL, satisfy-


qU
ing h(θ|X ) dθ = 1 − α. To this end, we employ the Lagrange method that
qL
implies solving the equations

⎡ ⎛ qU ⎞⎤
∂ ⎢
∂qU ⎢
⎢⎣
⎜⎝
qL

qU − qL + λ ⎜ 1 − α − h(u|X ) du⎟ ⎥ = 0,
⎟⎠ ⎥
⎥⎦

⎡ ⎛ qU ⎞⎤
∂ ⎢
∂ qL ⎢
⎢⎣
⎜⎝
qL

qU − qL + λ ⎜ 1 − α − h(u|X ) du⎟ ⎥ = 0,
⎟⎠ ⎥
⎥⎦

where λ is the Lagrange multiplier, with respect to qU and qL. Then we have
1 − λh(qU |X ) = 0, − 1 + λh(qL |X ) = 0, obtaining h(qL |X ) = h(qU |X ).
222 Statistics in the Health Sciences: Theory, Applications, and Computing

Thus, we conclude that (a) in general, the highest posterior density


confidence interval does have the shortest length, and (b) in the case, where
h is a symmetric function, the equal-tail setup is the optimal definition of
the confidence interval in the context of its length.
Warning: In order to apply the above methods in practice, the form of the den-
sity function f ( x|θ) needs to be specified. Daniels and Hogan (2008) showed sig-
nificant issues relative to verifying the parametric assumptions for various cases
as to when the Bayesian confidence interval estimation can be applied efficiently.
Zhou and Reiter (2010) demonstrated that when parametric assumptions are not
met exactly, the posterior estimators are generally biased. The statistical literature
has displayed many examples when parametric forms of data distributions are not
available and there are vital concerns relative to using the Bayesian parametric con-
fidence interval approach. In this context, Vexler et al. (2016b) developed a robust
nonparametric method for confidence interval estimation in the Bayesian manner.
Toward this end, empirical likelihood functions (Chapter 10) were employed to
replace the corresponding unknown parametric likelihood functions in the posterior
probability construction.
In the frequentist framework, the use of confidence intervals or regions
for decision making has a one-to-one mapping to hypothesis testing about
a parameter, set of parameters or functions. For a single parameter, e.g., the
population mean, one can construct a confidence interval based on manipu-
lating the probability statement that a parameter lies between a lower and
upper bound, which are functions of the data (e.g., Section 3.3). In many
instances the concept of a pivot can be used to manipulate the terms of the
probability statement in order to obtain a confidence interval. Confidence
intervals can be based on both parametric and nonparametric approaches,
as well as large sample approximations.
In terms of confidence intervals and their relationship to hypoth-
esis tests, we would reject H 0 for a two-sided test if the parameter
under H 0 fell outside the confidence interval bounds. Consider, for
example, a simple scenario when a test statistic TS(θ0 ), as a function
of θ0 based on data, is proposed for testing the hypothesis H 0 : θ = θ0 ,
where θ is the parameter of interest. Assume we reject the null hypothesis
for large values of TS(θ0 ). In order to control the Type I error rate, we
should derive an exact or asymptotic distribution of TS(θ0 ) under H 0 , say
PrH0 {TS(θ0 ) ≤ C} = G(C), for a fixed argument C. Then we reject the
(or converges to)
hypothesis H 0 when TS(θ0 ) ≥ g1−α , where g1−α is the 100(1 − α)% percentile
of the distribution G, and α is the significance level. By virtue of the rela-
tion between the testing and confidence interval estimation, we can obtain
the confidence interval estimator of θ in the form of CI1-α = {θ : TS(θ) ≤ g1−α},
assuming that the nominal coverage probability is specified as 1 − α. In this
context, shorter confidence intervals can correspond to greater statistical
power in the parallel statistical hypothesis test.
Brief Comments on Confidence Intervals and p-Values 223

In various situations, Monte Carlo experiments (Section 2.4.5.1) can be used to


evaluate and compare the qualities of proposed confidence interval estimators.
In such cases, commonly, the Monte Carlo coverage probabilities (e.g., the aver-
age value of indicators I {θ0 ∈ CI1−α} based on Monte Carlo generated data
under H 0 , in the terms of the example mentioned above) and the Monte Carlo
average lengths of the considered confidence intervals (e.g., the average value of
max {θ : TS(θ) ≤ g1−α} – min {θ : TS(θ) ≤ g1−α} based on Monte Carlo generated
data under H 0 ) are basic factors to be computed. The Monte Carlo cover prob-
abilities can be compared to the expected coverage probability 1 − α.

9.2 p-Values
The p-value has long played a role in scientific research as a key decision-
making tool with respect to hypothesis testing and dates back to Laplace
in the 1770s (Stigler, 1986). The concept of the p-value was popularized by
Fisher as an inferential tool and is where the first occurrence of the term
“statistical significance” is found (Fisher, 1925). In the frequentist hypothesis
testing framework a test is deemed statistically significant if the p-value,
which is a statistic having support over the real line in (0,1) space, is below
some threshold known as the critical value. In a vast majority of studies the
critical value is set at 0.05.
A majority of traditional testing procedures are designed to draw a
conclusion (or make an action) with respect to the binary decision of reject-
ing or not rejecting the null hypothesis H 0 , depending on locations of the
corresponding values of the observed test statistics, that is, detecting whether
test statistic values belong to a fixed sphere or interval. In simple cases, test
procedures require us to compare corresponding test statistics values based
on observed data with test thresholds. P-values can serve as an alternative
data-driven approach for testing statistical hypotheses based on using the
observed values of test statistics as the thresholds in the theoretical prob-
ability of the Type I error. P-values can themselves also serve as a summary
type result based on data in that they provide meaningful experimental
data-based evidence about the null hypothesis.
For example, suppose that the test statistic L has a distribution function F
under H 0 . Under a one-sided upper-tailed alternative H 1 the p-value is the ran-
dom variable 1 − F(L), which is uniformly distributed under H 0 (e.g., Vexler
et al., 2016a).
The obvious correct use of the p-value is to simply draw a conclusion of
reject or do not reject the null hypothesis. This principle simplifies and stan-
dardizes statistical decision-making policies. In this manner, for example,
different algorithms for combining decision-making rules using their
p-values as test statistics can be naturally derived (e.g., Vexler et al., 2017).
224 Statistics in the Health Sciences: Theory, Applications, and Computing

Warning: The p-value is oftentimes misused and misinterpreted in the applied


scientific literature where statistical decision-making procedures are involved. Many
scientists misinterpret smaller p-values as providing stronger theoretical evidence
against a null hypothesis relative to larger p-values. For example, some researchers
draw conclusions regarding comparisons of associations between a disease and dif-
ferent factors using values of the corresponding p-values. In a hypothetical study,
consider evaluating associations between a disease, say D, and two biomarkers, say
A and B. It is not uncommon for scientists to conclude that the association between
D and A is stronger than that between D and B if the p-value regarding the asso-
ciation between A and D is smaller than that of the association between B and D.
This example demonstrates the noncareful use of the p-value’s concept, since perhaps
data obtained in a different but relevant experiment might provide the contradict-
ing conclusion simply due to the stochastic nature of the p-value. These types of
issues have led several scientific journals to discourage the use of p-values, with
some scientists and statisticians encouraging their abandonment (Wasserstein and
Lazar, 2016). For example, the editors of the journal entitled Basic and Applied Social
Psychology announced that the journal would no longer publish papers contain-
ing p-value-based studies since the statistics were too often used to support lower-
quality research (Trafimow and Marks, 2015). Misinterpreting the magnitude of the
p-value as strength for or against the null hypothesis or as a probability statement
about the null hypothesis can lead to a misinterpretation of the results of a statisti-
cal test. Several common weaknesses or misinterpretations of the p-value are as fol-
lows: (1) employing the p-value as the probability that the null hypothesis is true can
be wrong or even far from reasonable, as the p-value strongly depends on current
data in use; (2) the p-value is not very useful for large sample sizes, because almost no
null hypothesis is exactly true when examined using real data, and when the sample
size is relatively large, almost any null hypothesis will have a tiny p-value, leading
to the rejection of the null at conventional significance levels; and (3) model selection
is difficult, e.g., simply choosing the wrong model among a class of models or using
a model that is not robust to statistical assumptions will lead to an incorrect p-value
and/or inflated Type I error rates.

The p-value is a function of the data and hence it is a random variable,


which too has a probability distribution. The subtlety in terms of those that
try to interpret the magnitude of the relative p-value is that the distribution
of the p-value is conditional on either the null hypothesis being true or not.
Under the null hypothesis, typically, p-values are exactly (or asymptotically)
Uniform[0,1] distributed. However, if the null hypothesis is false p-values
have a non-Uniform[0,1] distribution for which the shape of the distribution
varies across several factors including sample size and the distance of the
parameter of interest from the hypothesized value (null). Hence, for the same
exact null and alternative values the distribution of the p-value may be small
or large simply as a function of the sample size (statistical power). In the era
of “big data” it would not be unusual to constantly find extremely small
p-values simply as a function of a massively large sample size, with nothing
Brief Comments on Confidence Intervals and p-Values 225

really to do with the scientific question. Likewise a large observed p-value


may simply be due to a very small sample size.
Statisticians have long recognized the deficiencies in terms of interpreting
p-values relative to their stochastic nature and have tried to develop remedies
to aid scientists in the interpretation of their data. For example, Lazzeroni
et al. (2014) developed prediction intervals for p-values in replication studies.
This approach has certain critical points regarding the following problems:
(1) in the frequentist context it is uncommon to create confidence intervals
of random variables; (2) under the null hypothesis p-values are distributed
according to a Uniform[0,1] distribution, whereas in many scenarios, if we
are sure the alternative hypothesis is in effect, the prediction interval for the
p-value is not needed; and (3) prediction intervals for p-values can be directly
associated with those for corresponding test statistic values, linking to just
rejection sets of the test procedures.
The stochastic aspect of the p-value has been well studied by Dempster
and Schatzoff (1965), who introduced the concept of the expected signifi-
cance level. Sackrowitz and Samuel-Cahn (1999) developed the approach
further and renamed it as the expected p-value (EPV). The authors pre-
sented the great potential of using EPVs in various aspects of hypothesis
testing.
Comparisons of different test procedures, e.g., a Wilcoxon rank-sum test
versus Student’s t-test, based on their statistical power is oftentimes prob-
lematic in terms of deeming one method being the preferred test over a
range of scenarios. One reason for this issue to occur is that the comparison
between two or more testing procedures is dependent upon the choice of a
prespecified significance level α. One test procedure may be more or less
powerful than the other one depending on the choice of α. Alternatively,
one can consider the EPV concept in order to compare test procedures. In
this chapter, we show that the EPV corresponds to the integrated power of
a test via all possible values of α ∈(0,1). Thus, the performance of the test
procedure can be evaluated globally using the EPV concept. Smaller values
of EPV show better test qualities in a more universal fashion. This method is
an alternative approach to the Neyman–Pearson concept of testing statistical
hypotheses. The famous Neyman–Pearson lemma (Section 3.4) introduced
us to the concept that a reasonable statistical testing procedure controls the
Type I error rate at a prespecified significance level, α, in conjunction with
maximizing the power in a uniform fashion. Thus, for different values of α
we may obtain different superior test procedures. On the other hand, the
EPV-based approach allows us to compare between decision-making rules in
a more objective manner. The global test performance of testing procedures
can be measured by one number, the EPV, and hence tests can be more easily
rank-ordered.
We further advance the concept of the EPV. We prove that there is a strong
association between the EPV concept and the well-known receiver operating
characteristic (ROC) curve methodology (Chapter 7). It turns out that we can
226 Statistics in the Health Sciences: Theory, Applications, and Computing

use well-established ROC curve and AUC methods to evaluate and visualize
the properties of various decision-making procedures in the p-value-based
context. Further, in parallel with partial AUCs (Section 7.6), we develop a
partial expected p-value (pEPV) and introduce a novel method for visual-
izing the properties of statistical tests in an ROC curve framework. We prove
that the conventional power characterization of tests is a partial aspect of the
presented EPV/ROC technique.
In order to exemplify additional applications of the EPV/ROC methodol-
ogy, we refer the reader to Vexler et al. (2017).

9.2.1 The EPV in the Context of an ROC Curve Analysis


In this section we present the following material: the formal definition of
the EPV and the association between the EPV and the AUC. We also pro-
vide a new quantity called the partial EPV (pEPV), which characterizes a
property of decision-making procedures using the concept of partial AUCs
(see Section 7.6).
Let the random variable T (D) represent a test statistic depending on data
D. Assume Fi defines the distribution function of T (D) under the hypoth-
esis H i , i = 0, 1 , where the subscript i indicates the null (i = 0) and alternative
(i = 1) hypotheses, respectively. Given Fi is continuous we can denote Fi−1
to represent the inverse or quantile function of Fi , such that Fi ( Fi−1 ( γ )) = γ ,
where 0 < γ < 1 and i = 0,1. In this setting, in order to concentrate upon the
main issues, we will only focus on tests of the form: the event T (D) > C
rejects H 0 , where C is a prefixed test threshold. Thus the p-value has the form
1 − F0 (T (D)). Sackrowitz and Samuel-Cahn (1999) proved that the expected
p-value E(1 − F0 (T (D))|H 1 ) is

EPV = Pr(T 0 ≥ T A ), (9.1)

where independent random variables T 0 and T A are distributed according to


F0 and F1, respectively.

Proof. We have E(1 − F0 (T (D))|H 1 ) = ∫ {1 − F (u)} dF (u) = ∫ Pr (T ≥ u) dF (u) = Pr (T


0 1
0
1
0
≥ TA ),
which completes the proof.
The simple example of the EPV is when T 0 ~ N (μ 1 , σ 12 ) and T A ~ N (μ 2 , σ 22 ).
Then the EPV can be expressed as

( (
EPV = Φ (μ1 − μ 2) σ 12 + σ 22 )
−1/2
),
where Φ is the cumulative standard normal distribution.
Note that the formal notation (9.1) is similar to that for the area under the
corresponding ROC curve (Section 7.3). In this case, the ROC curve can assist
Brief Comments on Confidence Intervals and p-Values 227

in measuring a distance between the distribution functions F0 and F1. In


this context, one can reconsider the EPV in terms of the area under the ROC
curve (AUC), obtaining that the EPV is 1 − AUC. This connection between
the EPV and the AUC induces new techniques for evaluating statistical test
qualities via the well-established ROC curve methodology.

9.2.2 Student’s t-Test versus Welch’s t-Test


Consider, for illustrative purposes, the following example related to appli-
cations of Student’s and Welch’s t-tests. The recent biostatistical literature
has extensively discussed which test, Student’s t- or Welch’s t-test, to use in
practical applications. The questions in this setting are: What is the risk of
using Student’s t-test when variances of the two populations are different?
Also, what is the loss in power when using Welch’s t-test when the variances
of the two populations are equal (Julious, 2005; Zimmerman and Zumbo,
2009)? In order to apply the ROC curve analysis based on the EPV concept,
we denote Student’s t-test statistic as

⎡ ⎤ −1
( ) ( )
1/2
TS = X − Y ⎢ Sp n−1 + m−1 ⎥⎦ ,

and Welch’s t-test statistic as

( )(
TW = X − Y S12 n−1 + S22 m−1 )
−1/2
,

where X is the sample mean based on the independent normally distrib-


uted data points X 1 , ..., X n , Y is the sample mean based on the indepen-


n
dent normally distributed observations Y1 , ..., Ym, S12 = (X i − X )2 /(n − 1)
i=1


m
and S22 = (Yi − Y )2 /(m − 1) are the unbiased estimators of the variances
j=1
σ 12 = Var(X 1 ) and σ 22 = Var(Y1 ), respectively, and Sp2 = {(n − 1)S12 + (m − 1)S22 } /(n + m − 2)
is the pooled sample variance. Figure 9.2 depicts the ROC curves,
ROC(t) = 1 − F1 { F0−1 (1 − t)}, t ∈(0,1), for each test when the distribution func-
tions F0 , F1 of the t-test statistic (Student’s t-test statistic or Welch’s t-test sta-
tistic) are correctly specified corresponding to underlying distributions of
observations with δ = EX 1 − EY1= 0.7 and 1 under H 1. These graphs show
that there are no significant differences between the relative curves. In the
scenario n = 10, m = 50, σ 12 = 4, σ 22 = 1, δ = 1 Student’s t-test slightly dem-
onstrates better performance than that of Welch’s t-test. By using differ-
ent values of n, m, σ 12 , σ 22 , δ, we attempt to detect cases when significant
differences between the relevant curves are in effect. Applying differ-
ent combinations of n = 10, 20, ..., 150 ; m = 10, 20, ..., 150 ; σ 12 = 1, 2 2 ,...,102;
and σ 22 = 1, we could not derive a scenario when one of the considered tests
228 Statistics in the Health Sciences: Theory, Applications, and Computing

1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) (b) (c)
1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(d) (e) (f)

FIGURE 9.2
The ROC curves related to Student’s t-test (“______”) and Welch’s t-test (“- - - -”), where graph
(a) represents the case of n = 10 , m = 50 , σ 12 = 1, σ 22 = 1, δ = 0.7 ; graph (b) represents the case
of n = 20, m = 40, σ 12 = 1, σ 22 = 1, δ = 0.7 ; graph (c) represents the case of n = 40, m = 20 , σ 12 = 4 ,
σ 22 = 1, δ = 0.7 ; graph (d) represents the case of n = 40, m = 20 , σ 12 = 9 , σ 22 = 1, δ = 0.7 ; graph (e)
represents the case of n = 10 , m = 50 , σ 12 = 4 , σ 22 = 1, δ = 1; and graph (f) represents the case of
n = 20, m = 40, σ 12 = 9, σ 22 = 1, δ = 0.7 .

TABLE 9.1
The Areas Under the ROC Curves of the Student’s t-Test ( AUCS ) and Welch’s
t-Test ( AUCW )

n m σ 12 σ 22 δ AUCS AUCW

10 50 1 1 0.7 0.9217 0.9172


20 40 1 1 0.7 0.9619 0.9611
40 20 4 1 0.7 0.8959 0.8962
40 20 9 1 0.7 0.8269 0.8271
10 50 4 1 1.0 0.8614 0.8569
20 40 9 1 0.7 0.7631 0.7626

clearly outperforms the other one with respect to the EPV. The corresponding
AUC = (1 − EPV) values are given in Table 9.1. Thus, if the Type I error rates
of the tests are correctly controlled, there are no critical differences between
Student’s t-test and Welch’s t-test.
Brief Comments on Confidence Intervals and p-Values 229

The results shown above can be obtained analytically, since the distribu-
tion functions of the statistics TS and TW under H 0 and H 1 have specified
forms. Alternatively, we provide the following R code that can be easily
modified in order to evaluate various decision-making procedures using the
accurate Monte Carlo approximations to the EPV/ROC instruments.
library(pROC)
N<-75000 # Number of Monte Carlo generations
n<-20
m<-20
T0E<-array() #Values of T_S under H_0
T0NE<-array() #Values of T_W under H_0
T1E<-array() #Values of T_S under H_1
T1NE<-array() #Values of T_W under H_1

for(i in 1:N){
x0<-rnorm(n,0,1)# X’s under H_0
y0<-rnorm(m,0,1)# Y’s under H_0

x1<-rnorm(n,1,1) #X’s under H_1


y1<-rnorm(m,0,4) #Y’s under H_1

T0E[i]<-t.test(x0,y0,alternative = c("greater"),var.equal = TRUE)$stat[[1]]


T1E[i]<-t.test(x1,y1,alternative = c("greater"),var.equal = TRUE)$stat[[1]]

T0NE[i]<-t.test(x0,y0,alternative = c("greater"),var.equal = FALSE)$stat[[1]]


T1NE[i]<-t.test(x1,y1,alternative = c("greater"),var.equal = FALSE)$stat[[1]]
}
par(mfrow=c(1,3))
Ind<-c(array(1,N),array(0,N))
TE<-c(T0E,T1E)
TNE<-c(T0NE,T1NE)
plot.roc(Ind, TE, legacy.axes=TRUE) #ROC curve (1-EPV) for T_S
lines.roc(Ind,TNE,col='red') #ROC curve (1-EPV) for T_W

TEroc<-roc(Ind, TE) #ROC curve (1-EPV) for T_S


TNEroc<-roc(Ind, TNE) #ROC curve (1-EPV) for T_S

##############Partial EPVs
#Values of (1-the partial EPVs) for T_S depending on alpha:
TEauc<-function(alpha) 1-auc(TEroc, partial.auc=c(0, alpha))
#Values of (1-the partial EPVs) for T_W depending on alpha:
TNEauc<-function(alpha) 1-auc(TNEroc, partial.auc=c(0, alpha))
TEaucV<-Vectorize(TEauc)
TNEaucV<-Vectorize(TNEauc)
plot(TEaucV,0,1)
plot(TNEaucV,0,1,col='red')

9.2.3 The Connection between EPV and Power


The value of 1 – EPV can be expressed in the form of the statistical power of
a test through integration uniformly over the significance level α from 0 to 1;
that is,
230 Statistics in the Health Sciences: Theory, Applications, and Computing

∫ ∫
∞ ∞
EPV = Pr(T 0 ≥ T A ) = Pr(T A ≤ t) dF0 (t) = Pr {F0 (T A ) ≤ F0 (t)} dF0 (t)
−∞ −∞

∫ Pr {1 − F0 (T A ) ≥ α} d(1 − α) = ∫ ⎡ 1 − Pr 1 − F (T A ) ≤ α ⎤ dα = 1 −
{ 0 }⎦⎥ ∫
0 1 1
= ⎢ Pr( p − value ≤ α|H 1 ) dα.
1 0 ⎣ 0

The above expression of the EPV considers the weight of the significance
level α from 0 to 1. It may appear to suffer from the defect of assigning most
of its weight to relatively uninteresting values of α not typically used in
practice, e.g., α ≥ 0.1. Alternatively, we can focus on significance levels of α in
a specific interesting range by considering the pEPV; that is

∫ ∫ Pr {1 − F (T ) ≤ α}dα
αU αU
pEPV = 1 − Pr {p − value ≤ α|H 1}dα = 1 − 0
A
0 0

= 1+ ∫ Pr {F (T ) ≥ 1 − α}d(1 − α) = 1 + ∫
αU 1−αU
0
A
Pr {F (T ) ≥ z}dz 0
A
0 1

= 1− ∫ Pr {F (T ) ≥ z}dz = 1 − ∫
1 ∞
0
A
Pr {F (T ) ≥ F (t)}dF (t) 0
A
0 0
1−αU F0−1 (1−αU )

= 1− ∫

Pr {T A ≥ t}dF0 (t) = 1 − Pr {T A ≥ T 0 , T 0 ≥ F0−1 (1 − αU )}
F0−1 (1−αU )

at a fixed upper level αU ≤ 1 . In the R code presented in Section 9.2.2, we


provided an example of pEPV computations.


αU
In general, one can define the function pEPV (α L , αU ) = 1 − Pr {p − value ≤ u|H 1} du
αL

and focus on
d

{− pEPV (0, α)}. Then, in this case, the
d

{− pEPV (0, α)} implies
the power at a significance level of α. An essential property of efficient statis-
tical tests is unbiasedness (Lehmann and Romano, 2006). In this context, we
can denote an unbiased statistical test when the probability of committing
a Type I error is less than the significance level and we have a proper power
function, that is, Pr(reject H 0 |H 0 ) ≤ α and Pr(reject H 0 |H 0 is not true) ≥ α.
In parallel with this definition, it is natural to consider the inequality


α
pEPV (0, α) ≤ 1 − Pr {p − value ≤ u|H 0}du = 1 − α 2 /2,
0

since p − value ~ Uniform[0, 1] (i.e., Pr {p − value ≤ u|H 0} = u, u ∈[0, 1]) under


H 0 and we assume H 1 ≠ H 0. In this case,
Brief Comments on Confidence Intervals and p-Values 231

d

{ }
{pEPV (0, α)} = − Pr(reject H 0 |H 0 is not true) and ddα 1 − α 2 /2 = −α.

However, it is clear that the requirement pEPV(0, α) ≤ 1 − α 2 /2 is weaker


than that of Pr(p − value < α|H 0 is not true) ≥ α.
Thus, the EPV based concept extends the conventional power characteriza-
tion of tests.
Remark. The Neyman–Pearson lemma framework for comparing, for exam-
ple, two test statistics, say M1 and M2, provides the following scheme: the Type
I error rates of the tests should be fixed at a prespecified significance level α,
Pr (M1 rejects H 0 |H 0) ≤ α and Pr (M2 rejects H 0 |H 0 ) ≤ α; then M1 is superior
with respect to M2, if Pr (M1 rejects H 0 |H 1) > Pr (M2 rejects H 0 |H 1). In gen-
eral the power functions, Pr (M1 rejects H 0 |H 1) and Pr (M2 rejects H 0 |H 1),
depend on α. Thus, for different values of α we may theoretically obtain
diverse conclusions regarding the preferable test procedure. The EPV and
pEPV methods make the comparison more objective in a global sense. The
test performance can be measured employing just the EPV or pEPV value
by itself. Smaller values of the EPV or pEPV indicate a more preferable test
procedure when comparing two or more tests. The definitions of EPV and
pEPV show that for a most powerful test (e.g., the likelihood ratio test), the
EPV and pEPV will be the minimum as compared to any other tests with the
same H 0 versus H 1.

9.2.4 t-Tests versus the Wilcoxon Rank-Sum Test


In this section we compare and contrast the properties of two well-known
two-sample test statistics, namely Welch’s t-test and the Wilcoxon rank-
sum test. In order to be concrete, we exemplify our points using a case-
control study’s statement of problem. The central idea of the case-control
study is the comparison of a group having the outcome of interest to a
control group with regard to one or more characteristics. In health-related
experiments, the case group usually consists of individuals with a given
disease, whereas the control group is disease-free. In such instances, the
use of biomarkers to assist medical decision making and/or the diagnosis
and prognosis of individuals with a given disease, is increasingly common
in both clinical settings and epidemiological research. This has spurred an
increase in exploration for and development of new biomarkers. For exam-
ple, myocardial infarction (MI) is commonly caused by blood clots block-
ing the blood flow of the heart leading to heart muscle injury. Heart disease
is a leading cause of death, affecting about 20% of population, regardless
232 Statistics in the Health Sciences: Theory, Applications, and Computing

of ethnicity, according to the Centers for Disease Control and Prevention


(e.g., Schisterman et al., 2001a, 2002). The biomarker “high density lipo-
protein (HDL) cholesterol” is often used as a discriminant factor between
individuals with and without MI disease. The HDL cholesterol levels can
be examined from a 12-hour fasting blood specimen for biochemical anal-
ysis at baseline, providing values of measurements regarding HDL bio-
markers to be collected on cases who survived an MI and on controls who
had no previous MI. Note that oftentimes measurements related to bio-
logical processes follow a log-normal distribution (see for details Limpert
et al., 2001; Vexler et al., 2016a, pp. 13–14). Thus, one may be interested
in how often a log-transformed HDL cholesterol level of the case group,
say X , “outperforms” a log-transformed HDL cholesterol level of the case
group, Y . Typically, this research statement is associated with the mea-
sure Pr(X > Y ) that is assumed to be examined using n independent and
normally distributed data points X 1 ,..., X n as well as m independent and
normally distributed observations Y1 ,..., Ym (e.g., Vexler et al., 2008a&b).
In this scenario, in order to test the hypothesis H 0 : Pr (X > Y) = 0.5 versus
H 1 : Pr (X > Y) > 0.5 the traditional statistical literature strongly suggests
using t-test type procedures (e.g., Browne, 2010). It is common that scholars
are encouraged to apply t-test-type decision-making mechanisms when
the underlying data follow a normal distribution.
We compare the one-sided two-sample Student’s t- and Welch’s t-tests
with the corresponding Wilcoxon rank-sum test (e.g., Ahmad, 1996) in
terms of their relative strengths and weaknesses. The Wilcoxon rank-sum
test, which is a permutation test, is generally recommended when the data
are assumed to be from a nonnormal distribution. In order to compare the
tests we provide the following R code (R Development Core Team, 2014)
that can be easily modified and applied to evaluate various decision-mak-
ing procedures using accurate Monte Carlo approximations to the EPV/
ROC instruments.

library(pROC)
N<-100000 #Number of Monte Carlo data generations
W0<-array()
T0<-array()
W1<-array()
T1<-array()
n<-25 #The sample size n
m<-25 #The sample size m

for(i in 1:N){
x0<-rnorm(n,0,1) #Values of X from N(0,1) generated under the hypothesis H0
y0<-rnorm(m,0,1) #Values of Y from N(0,1) generated under the hypothesis H0
x1<-rnorm(n,0,1) #Values of X from N(0,1) generated under the hypothesis H1
Brief Comments on Confidence Intervals and p-Values 233

y1<-rnorm(m,-0.5,10) #Values of Y from N(–0.5,10) generated under the hypothesis H1

W0[i]<-wilcox.test(x0,y0,alternative = c("greater"))$stat[[1]]/(n*m)
#Values of the Wilcoxon test statistic under H0
W1[i]<-wilcox.test(x1,y1,alternative = c("greater"))$stat[[1]]/(n*m)
#Values of the Wilcoxon test statistic under H1

EV<-FALSE #This parameter indicates the use of Welch’s t-test statistic


#EV<-TRUE #This parameter indicates the use of Student's t-test statistic

T0[i]<-t.test(x0,y0,alternative = c("greater"),var.equal = EV)$stat[[1]]


#Values of the t-test statistic under H0
T1[i]<-t.test(x1,y1,alternative = c("greater"),var.equal = EV)$stat[[1]]
#Values of the t-test statistic under H1
}

#Plotting the ROC cureves


Ind<-c(array(1,N),array(0,N))
W<-c(W0,W1)
T<-c(T0,T1)
plot.roc(Ind, W,type="l", legacy.axes=TRUE,xlab="t",ylab="ROC(t)")
lines.roc(Ind,T,col="red",lty=2)

To exemplify the proposed approach for comparing the test statistics, this R
code provides simulation-based evaluations of the ROC curves based on val-
ues of the test statistics related to the null and alternative hypotheses, using
the R built-in procedure pROC. In this scenario, we assume, for example, that
X 1 ,..., X n ~ N (0,1) and Y1 ,..., Ym ~ N (0,1) under H 0 , whereas X 1 ,..., X n ~ N (0,1)
and Y1 ,..., Ym ~ N ( −0.5,10). Figure 9.3 shows the ROC curves (ROCT (t), t) and
(ROCW (t), t), where
ROCT (t) = 1 − FT1 { FT−01 (1 − t)} and ROCW (t) = 1 − FW1 { FW−01 (1 − t)}, t ∈(0,1)

with the distribution functions FTk and FWk that correspond to the t-test statistic
and the Wilcoxon rank-sum test statistic distributions under H k , k = 0,1,
respectively.
According to the executed R Code (the parameter EV<-FALSE), we focus
on Welch’s t-test statistic, which is reasonable in the considered setting
of data distributions’ parameters. It is interesting to remark that when
examining Student’s t-test statistic, setting the parameter EV<-TRUE the
graphs show that there are no significant differences between the rela-
tive curves. Similar observations are in effect regarding the considerations
shown below.
Analyzing EPVs for the one-sided, two-sample t- and Wilcoxon tests
based on normally distributed data points ( n = m = 10, 20, 50 ), Sackrowitz
and Samuel-Cahn (1999) concluded that “The t test is best both for the Normal
234 Statistics in the Health Sciences: Theory, Applications, and Computing

1.0
0.8
0.6
ROC (t)
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0


t

FIGURE 9.3
Values of the functions ROCT (t) (curve “- - - -”) and ROCW (t) (curve “------”) plotted against
t ∈ (0, 1).

distribution (not surprising!) and the Uniform distribution.” In order to com-


pute the EPVs corresponding to the considered example, we can execute
the code

Troc<-roc(Ind, T)
Wroc<-roc(Ind, W)
EPV_t<-1-auc(Troc) # EPV of the t-test
EPV_W<-1-auc(Wroc) # EPV of the Wilcoxon test

Indeed, the computed EPV of the t-test is 0.431, which is smaller than the
0.439 of that related to the Wilcoxon rank-sum test. However, Figure 9.3 dem-
onstrates that for t ∈(0, 0.5) the Wilcoxon test is somewhat better that the
t-test. This motivates us to employ the pEPV for this analysis. Toward this
end, we denote the function

G (α) = { pEPVW (0, α) − pEPVt (0, α)}/pEPVt (0, α),


Brief Comments on Confidence Intervals and p-Values 235

where pEPVT (0, α) and pEPVW (0, α) are the function pEPV (0, α) defined in
Section 9.2.3 and computed with respect to the t-test and the Wilcoxon test,
respectively. In order to depict the result we use the following code:
pEPV_t<-function(u) 1-auc(Troc, partial.auc=c(0,u))[[1]]
pEPV_W<-function(u) 1-auc(Wroc, partial.auc=c(0,u))[[1]]
G<-function(u) (pEPV_W(u)-pEPV_t(u))/(pEPV_t(u))
GV<-Vectorize(G)
plot(GV,0,1,xlab='alpha',ylab='G(alpha)')

Figure 9.4 presents the curve of G(α).


In this case, it is clear that the Wilcoxon test outperforms the t-test, when the
significance level α < 0.8.
Let us fix α = 0.05 and calculate the corresponding powers of the tests
using the following code:
Wc<-quantile(W0,0.95) # the 95% critical value of the Wilcoxon test
Tc<-quantile(T0,0.95) # the 95% critical value of the t-test
PowW<-mean(1*(W1>=Wc)) #the power of the Wilcoxon test
PowT<-mean(1*(T1>=Tc)) #the power of the t-test
print(c(PowT,PowW))
0.015
0.010
0.005
G(α)
0.000
–0.005
–0.010

0.0 0.2 0.4 0.6 0.8 1.0


α

FIGURE 9.4
The relative comparison between the t-test and the Wilcoxon test using their pEPVs via the
function G(α) plotted against α ∈(0, 1).
236 Statistics in the Health Sciences: Theory, Applications, and Computing

We obtain that the power of the Wilcoxon test is 0.12, whereas the power
of the t-test is 0.08.
Remark 1. Assume we are interested in the measure Pr(X > Y ) .
A fast way to evaluate Pr(X > Y ) can be based on the R command
wilcox.test(X,Y,alternative = c("greater"))$stat[[1]]. For example, in the setting of
the example mentioned in this section, one can use

nn<-5000000
x1<-rnorm(nn,0,1)
y1<-rnorm(nn,-0.5,10)
wilcox.test(x1,y1,alternative = c("greater"))$stat[[1]]/(nn*nn)

to obtain the approximated value of Pr(X > Y ) under H 1 as 0.5192665, which


0.5
corresponds to Pr(X > Y ) = (2 π)
−1/2 2
e−z /2
dz ≈ 0.5181275.
−∞

Remark 2. In various scenarios with Var(X 1 ) = Var(Y1 ) under H 1, it was


observed using the function G(α), α < 0.1, that the t-test procedures and the
Wilcoxon rank-sum test provide approximately the same properties.
Remark 3. The R code presented in this section can be easily modified to
provide the EPV/ROC analysis based on a real data. To this end the variables
x0, y0 can be simulated corresponding to H 0 (e.g., as X 1 ,..., X n ~ N (0,1) and
Y1 ,..., Ym ~ N (0,1)) and the variables x1, y1 can be sampled from observed
X 1 ,..., X n and Y1 ,..., Ym in a bootstrap manner at each loop iteration in “for(i in
1:N){…” (see for details Vexler et al., 2017 as well as Chapter 11).

9.2.5 Discussion
We have seen that the EPV is a very useful and succinct measurement
tool of the performance of decision-making mechanisms. We have dem-
onstrated a novel methodology to analyze and visualize characteristics
of test procedures. Toward this end the “EPV/ROC” concept has been
introduced. This approach provides us new and efficient perspectives
toward developing and examining statistical decision-making policies,
including those related to the partial EPVs, associations between the EPV
and the power of tests, visualization of properties of testing mechanisms,
development of optimal tests based on minimizing the EPVs, and cre-
ation of new methods for optimally combining multiple test statistics.
Many possible avenues of research can be done based on the concept we
introduced in this chapter. For example, a large sample theory can be
developed to evaluate the EPVs in several parametric and nonparamet-
ric scenarios. In addition, Bayesian-type methods can be developed in
order to evaluate test properties in the “EPV/ROC” frame. The proposed
Brief Comments on Confidence Intervals and p-Values 237

technique can be easily applied to obtain confidence region estimation


of vector parameters based on confidence interval estimates for the indi-
vidual elements. These topics warrant further strong empirical and meth-
odological investigations.
Remark. We shall note that there are different approaches to define p-value
type mechanisms, see, for example, Bayarri and Berger (2000), Berger and
Boos (1994), Berger (2003), Hwang and Yang (2001).
10
Empirical Likelihood

10.1 Introduction
Over the long course of statistical methodological developments there been a
shift from classic parametric likelihood methods to a focus towards robust
and efficient nonparametric and semiparametric developments of various
“artificial” or “approximate” likelihood techniques. These methods have a
wide variety of applications related to biostatistical experiments. Many non-
parametric and semiparametric approximations to powerful parametric likeli-
hood procedures have been used routinely in both statistical theory and
practice. Well-known examples include the quasi-likelihood method, which
are approximations of parametric likelihoods via orthogonal functions, tech-
niques based on quadratic artificial likelihood functions, and the local maxi-
mum likelihood methodology (e.g., Wedderburn, 1974; Claeskens and Hjort,
2004; Fan et al., 1998). Various studies have shown that artificial or approximate
likelihood-based techniques efficiently incorporate information expressed
through the data, and have many of the same asymptotic properties as those
derived from the corresponding parametric likelihoods. The empirical likeli-
hood (EL) method is one of a growing array of artificial or approximate
likelihood-based methods currently in use in statistical practice (e.g., Owen,
2001; Vexler et al., 2009a, 2014). Interest and the resulting impact in EL methods
continue to grow rapidly. Perhaps more importantly, EL methods now have
various vital applications in an expanding number of health related studies.
In Sections 10.2 and 10.3, we outline basic components related to EL techniques
and their theoretical evaluations. Sections 10.4 and 10.5 demonstrate valuable
examples of EL applications. In Section 10.6, we provide arguments that can be
accepted in favor of EL methods in order to be applied in statistical practice.

10.2 Empirical Likelihood Methods


As background for the development of EL-type techniques, we first out-
line the conventional EL approach. The classical EL takes the form

239
240 Statistics in the Health Sciences: Theory, Applications, and Computing


n

i=1
(F(X i ) − F(X i −)), which is a functional of the cumulative distribution
function F and iid observations X i , i = 1,… , n. In the distribution-free setting,

n
an empirical estimator of the likelihood takes the form of Lp = pi ,
i =1
where the components pi, i = 1,… , n, estimators of the probability weights,

n
should maximize the likelihood Lp , provided that pi = 1 and empirical
i=1
constraints based on X 1 ,… , X n hold. For example, suppose we would like to
test the hypothesis

H 0 : E {g(X 1 , θ)} = 0 versus H 1 : E {g(X 1 , θ)} ≠ 0 ,

where g(.,.) is a given function and θ is a parameter. Then, in a nonparametric



n
fashion, we define the EL function of the form EL(θ) = L(X 1 ,..., X n |θ) = pi,
i=1


n
where pi = 1.
i=1
Under the null hypothesis, the maximum likelihood approach requires
one to find the values of 0 < p1 ,..., pn < 1 that maximize the EL given the

∑ ∑
n n
empirical constraints pi = 1 and pi g(X i , θ) = 0 that present an
i=1 i=1
empirical version of the condition under H0 that E {g(X 1 , θ)} = 0. In situations,
when there are no 0 < p1 ,… , pn < 1 to satisfy the empirical constraints, it is
assumed that the null hypothesis is rejected in this stage. Further, following
the Lagrange method, we define the function
n
⎛ n
⎞ ⎛ n

W (p1 ,..., pn) = ∑i=1
log (pi ) + λ 1 ⎜ 1 −


i=1
pi ⎟ + λ ⎜ −
⎠ ⎝
∑ i=1
pi g(X i , θ)⎟ ,

where λ 1 and λ are Lagrange multipliers. Then we have


∂W (p1 ,..., pn) / ∂ pk = pk−1 − λ 1 − λg(X k , θ) . Hence, to maximize W (p1 ,..., pn),
0 < p1 ,… , pn < 1 should satisfy ∂W (p1 ,..., pn) / ∂ pk = 0 or equivalently

∑ {
pk ∂W (p1 ,..., pn) / ∂ pk = 0. }
n
pk ∂W (p1 ,..., pn) / ∂ pk = 0, k = 1,..., n. This implies
k =1

∑ ∑
n n
Thus, since pk = 1, we obtain n − λ 1 − λ pk g(X k , θ) = 0, where,
k =1 k =1


n
under H0, it is assumed that pk g(X k , θ) = 0. This leads to n − λ 1 = 0.
k =1
Therefore, using the equation ∂W (p1 ,..., pn) / ∂ pk = 0, we obtain
pk = (n + λg(X k , θ)) , k = 1,..., n. That is to say, the Lagrange method provides
−1

n n

∏ ∏ (n + λg(X , θ))
−1
EL(θ) = sup pi = i ,
0 < p1 , p2 ,..., pn < 1, ∑ pi = 1, ∑ pi g ( X i , θ ) = 0 i=1 i=1

∑ g(X , θ)(n + λg(X , θ))


−1
where λ is a root of i i = 0. Note that the derivate
Empirical Likelihood 241


∑ g(X , θ)(n + λg(X , θ)) ∑ g(X , θ) (n + λg(X , θ))
−1 −2
i i =− i
2
i ≤ 0.
∂λ

∑ g(X , θ)(n + λg(X , θ))


−1
Then, with respect to λ, the function i i monotonically
decreases. Taking into account this fact, one can show that if
min i = 1,..., n (g(X i , θ)) < 0 < max i = 1,..., n (g(X i , θ)) then one unique solution for λ can
be found (see Owen, 2001, for details). Commonly, a numerical method is
∑ g(X i , θ) (n + λg(X i , θ)) = 0 with respect to λ.
−1
required to solve the equation
Since under H 1, the only constraint under consideration is
similar manner to that shown above, we have
∑ p = 1, in a
i

n n

EL = sup
0 < p1 , p2 ,..., pn < 1, ∑
∏ ∏n
pi = 1 i = 1
pi =
i=1
−1
= (n) .
−n

Finally, we obtain the EL ratio (ELR) test statistic

ELR (θ) = EL / EL(θ)

for the hypothesis test of H 0 versus H 1. For example, when the function
g(u, θ) = u − θ, the null hypothesis corresponds to the expectation E(X 1 ) = θ,
for fixed θ. The EL ratio test strategy is that we reject H 0 for large values of
ELR (θ).
Owen (1988, 1990, 1991) showed that the nonparametric test statistic
2 log (ELR (θ)) has an asymptotic chi-square distribution under the null hypo-
3
thesis when E g(X 1 , θ) < ∞ . This result illustrates that Wilks’ theorem–type
results continue to hold in the context of this infinite-dimensional problem
(n, the number of p1 ,..., pn , is assumed to be large, n → ∞). The proof of this
proposition is outlined in Section 10.3. Thus, we reject H 0 when

2 log {ELR (θ)} ≥ χ12 (1 − α),

where χ12 (1 − α) is the 100(1 − α)% percentile of the chi-square distribution


with degree of freedom one, and α is the significance level.
Consequently, there are techniques for correcting forms of ELRs to improve
the convergence rate of the respective null distributions of the test statistics to
chi-square distributions. These techniques are similar to those applied in the
field of parametric maximum likelihood ratio procedures (Vexler et al., 2009a).
The statement of the hypothesis testing above can easily be inverted with
respect to providing nonparametric confidence interval estimators (Chapter 9).
In terms of the accessibility of this method, it should be noted that the num-
ber of EL software packages continues to expand, particularly the R-based
software packages. For example, library(emplik) and library(EL) are R packages
242 Statistics in the Health Sciences: Theory, Applications, and Computing

that include the R functions el.test() and EL.test(). These simple R functions
can be very useful for the EL analysis of data from statistical studies.

10.3 Techniques for Analyzing Empirical Likelihoods


The analysis presented in this section is relatively clear, and has the basic
ingredients for more general cases. We posit that the following results can be
associated with deriving different properties of EL-type procedures includ-
ing the power and Type I error rate analysis of the ELR test.
Properties of many statistical quantities based on parametric likelihoods
can be studied by using the fact that parametric likelihood functions are
often highly peaked about their maximum values (e.g., Chapters 3 and 5).
The modern statistical literature considers a variety of semi- and nonpara-
metric procedures created by proposing to employ EL functions in efficient
parametric schemes instead of parametric likelihoods. For example, in this
context, the results of Qin and Lawless (1994) have a remarkable use with
respect to operations with ELs in a similar manner to those related to para-
metric maximum likelihoods.
The following lemma illustrates a strong similarity between behaviors of
empirical and parametric likelihood functions. Suppose the function g( x , θ)
appeared in the EL definition is once differentiable with respect to the sec-
ond argument θ. We then have the following.

Lemma 10.3.1.


n
Let θ M be a root of the equation n−1 g(X i , θ M ) = 0, where ∂ g(X i , θ)/ ∂θ < 0
i=1
(or ∂ g(X i , θ)/ ∂θ > 0), for all i=1, 2,…, n. Then the argument θ M is a global
maximum of the function

EL(θ) = max {∏ n

i=1
pi : 0 < pi < 1, ∑
n

i=1
pi = 1, ∑
n

i=1 }
pi g(X i , θ) = 0 ,

which increases and decreases monotonically for θ < θ M and θ > θ M , respectively.


n
Proof. It is clear that the argument θ M , a root of n−1 g(X i , θ M ) = 0, maximizes
i=1
the function EL (θ), since in this case EL (θ M ) = n −n
with pi = n−1 , i = 1,… , n,
∏ ∑
n n
which maximizes pi given the sole constraint pi = 1, 0 ≤ pi ≤ 1, i = 1, … , n.
i=1 i=1

Using the Lagrange method, one can represent EL (θ) as

∏p ,
1
EL(θ) = i 0 < pi = < 1, i = 1,… , n,
i=1
n + λg(X i , θ)
Empirical Likelihood 243

where the Lagrange multiplier λ is a root of the equation


∑ g(X , θ)(n + λg(X , θ))
−1
i i = 0 (e.g., Owen, 2001). This then yields the follow-
ing expression:

d log (EL(θ))
n n n


= −λ ∑
i=1
∂ g(X i , θ)/ ∂θ
n + λg(X i , θ)
− ∑i=1
g(X i , θ) ∂λ
n + λg(X i , θ) ∂θ
= −λ ∑ ∂ng+(Xλg,(θX)/, θ∂θ) ,
i=1
i

where without loss of generality we assume ∂ g(X i , θ)/ ∂θ > 0, i = 1,… , n.


Now define the function

∑ g(X i , θ) (n + λg(X i , θ)) .


n
L (λ) =
−1

i=1

Since dL (λ) / dλ < 0, the function L (λ) decreases with respect to λ and has just
one root relative to solving L (λ) = 0. Consider the scenario with θ > θ M . In
this case when λ 0 = 0 we can conclude that

L (λ 0 ) = ∑ ∑
n n
g(X i , θ) (n) ≥ g(X i , θ M ) (n) = 0,
−1 −1
i=1 i=1

since g(X i , θ) increases with respect to θ ( ∂ g(X i , θ)/ ∂θ > 0).


The function L (λ) decreases. This implies that the root of L (λ) = 0 should
be located on the right side from λ 0 = 0 and then this root is positive. For
a graphical representation of this case see Figure 10.1a below.

L(λ) > 0
λ0 = 0
| >
0

λ<0 λ
θ > θM θ < θM
n n
n n
L(λ0) = ∑ g(Xi, θ)/n >∑ g (Xi, θM)/n = 0
o i=1 i=1 oL(λ0) =∑ g(Xi, θ)/n < ∑ g (Xi, θM)/n = 0
i=1 i=1
0 L(λ)

L(λ) > 0
L(λ)

λ0 = 0
| >
λ>0 λ

L(λ) < 0 L(λ) < 0

0 λ 0 λ
(a) (b)

FIGURE 10.1
The schematic behaviors of L(λ) plotted against λ (the axis of abscissa), when (a) θ > θ M and (b)
θ < θ M, respectively.
244 Statistics in the Health Sciences: Theory, Applications, and Computing

Thus, by virtue of

d log (EL(θ))
n
∂ g(X i , θ)/ ∂θ

= −λ ∑ n + λg(X , θ) ,
i=1 i

we prove that the function EL(θ) decreases, when θ > θ M .


Taking the same approach, one can show that the root of L (λ) = 0 should be
to the left of λ 0 = 0, when θ < θ M . For a graphical representation of this case
see Figure 10.1b. This result combined with d log (EL(θ)) = −λ
n
∂ g(X i , θ)/ ∂θ
dθ i=1
n + λg(X i , θ) ∑
completes the proof of Lemma 10.3.1.
Note that θ M plays a role that is similar to that of parametric maxi-
mum likelihood estimators. For example, when g(u, θ) = u − θ, we obtain

n
X i ; if, for a given function z(u), g(u, θ) = z(u) − θ , θ > 0 then
2
θ M = X = n−1
i=1

∑ z(X ).
n
θ M 2 = n−1 i
i=1
Turning to the task of developing asymptotic methods based
on ELs, we provide the proposition below. Without loss of generality
and for simplicity of notation, we set g(u, θ) = u − θ in the definition of the


n
empirical likelihood ratio function ELR (θ). Thus, EL(θ) = pi ,
i=1


n
log (ELR(θ)) = log{1 + λ(X i − θ)/ n} with pi = {n + λ(X i − θ)} , where λ is −1
i=1
a root of

∑ (X i − θ)(n + λ(X i − θ))−1 = 0 .


n

i=1

Note that λ = λ(θ) is a function of θ. Then, defining λ‘= dλ(θ)/dθ, λ" = d 2 λ(θ)/dθ2,

∑ (X i − θ)(n + λ(X i − θ))−1 = 0 , one


n
and λ ( k ) = d k λ(θ)/dθ k , k = 3, 4 and using
i=1
can show:

Proposition 10.3.1. We have the following equations:

d {log ELR(θ)}/dθ = −λ(θ), λ ' = − n ∑ ∑


n n
pi 2 / (X i − θ)2 pi 2 ,
i=1 i=1

{
λ '' = − 2(λ ')2 ∑
n

i=1
(X i − θ)3 pi3 + 4n(λ ') ∑
n

i=1
(X i − θ)pi3

−2 nλ∑
n

i=1 }∑
pi3 /
n

i=1
(X i − θ)2 pi2 ,

= {6λ ' λ '' ∑ ∑ (X − θ)p − 6 λ ' ∑ (X − θ) p


n n n
λ (3) (X i − θ)3 pi3 + 6nλ '' i
3
i
3
i
4 4
i
i=1 i=1 i=1

− 18 n( λ ') ∑ (X − θ) p + 12 n(λ ')∑ p


n n
2 2 4 3
i i i
i=1 i=1

− (18n ( λ ' ) + 6nλ )∑ p } /∑ (X − θ) p ,


n n
2 2 4 2 2
i i i
i=1 i=1
Empirical Likelihood 245

{
λ (4) = (8(λ ')λ (3) + 6(λ '')2 ) ∑
n

i=1
(X i − θ)3 pi 3 + 8nλ (3) ∑
n

i=1
(X i − θ)pi 3


n
− 36(λ ')2 λ '' (X i − θ)4 pi 4
i=1

∑ ∑ ∑
n n n
− 72 nλ ' λ '' (X i − θ)2 pi 4 − (36n2 λ ''− 18nλλ ') pi 4 + 24nλ '' pi 3
i=1 i=1 i=1

∑ ∑ (X − θ) p − 72n(λ ')∑ (X − θ)p


n n n
+ 24(λ ')4 (X i − θ)5 pi 5 + 96n(λ ')3 i
3
i
5
i i
4
i=1 i=1 i=1

+ (144n (λ ') + 24nλ λ ')∑ (X − θ)p


n
2 2 2 5
i i

− (72 n λ(λ ') + 24nλ )∑ p } /∑ (X − θ) p .


i=1
n n
2 3 5 2 2
i i i
i=1 i=1

The proof of Proposition 10.3.1 is based on technical computations.


This proposition can support a variety of evaluations of ELR (θ)-type proce-
dures. To show the relevant examples, we should note that log {ELR (θ M )} = 0,
since, in this case, pi = 1/n, for all i, and EL(θ M ) = EL. It is also clear that,
when θ = θ M , λ(θ) = 0, since pi = 1/ n, i = 1,… , n maximize EL ≥ EL(θ), for all
θ, and satisfy automatically the constraint pi g(X i , θ) = 0, when θ = θ M , by ∑
virtue of the definition of θ M . Thus, for example in the case of g(u, θ) = u − θ,
one can use Proposition 10.3.1 to obtain the Taylor expansion for the function
log {ELR (θ)} at argument θ M = X , in the form

log ELR (θ) ≈ log ELR X + (θ − X ) ()


d log ELR X ()

θ= X

+
2
(θ − X )2 d log ELR X ( ) +
3
(θ − X )3 d log ELR X ( )
2! dθ2 3! dθ3
θ= X θ= X

( ) ∑
1 2 ⎡1 n

= n0.5 (θ − X ) / ⎢ (X i − X ) ⎥ 2
2 ⎣n i=1 ⎦


3
1 3 ⎡1 2⎤
n
+ n(θ − X ) / ⎢ (X i − X ) ⎥ ,
3 ⎣n i=1 ⎦
as n → ∞. This approximation depicts Wilks’ theorem when θ = EX 1. (In the case
of θ = EX 1 and n → ∞, n0.5 (θ − X ) has a normal distribution, n(θ − X )3 → 0, and
p


n
(X i − X )2 / n → var(X 1 ).) Using Proposition 10.3.1 to figure more terms in
i=1
the Taylor expansion can provide high order approximations to the null distri-
bution of the ELR test, e.g., to obtain the Bartlett correction of the ELR structure
(e.g., Vexler et al., 2009a). Under the alternative hypothesis θ ≠ EX 1, the approxi-
mation above shows the power of the ELR test. In this context, we note that Lazar
and Mykland (1999) considered a general form of ELRs and the case where
( )
θ − EX 1 ~ O n−0.5 . The authors compared the local power of ELR to that of an
ordinary parametric likelihood ratio. Their notable research shows that there is
no loss of efficiency in using an EL model up to a second-order approximation.
246 Statistics in the Health Sciences: Theory, Applications, and Computing

In a similar manner to Proposition 10.3.1, more complicated ELR structures


can be analyzed. For example, one can consider the null hypothesis:
H 0 : E(X 1 ) = θ1 and E(X 12 ) = θ2 . In this case, under the null hypothesis, the EL
function is given as
⎧ n ⎫

∏ ∑ ∑ ⎪

n n n
EL (θ1 , θ2) = max ⎨ pi : pi = 1, pi X i = θ1 , pi X i 2 = θ 2 ⎬ .
⎪⎩ i =1 i =1 i =1 i =1 ⎪⎭
Then using the Lagrangian

∑ ∑ ∑ ∑
n n n n
Λ= log pi + λ(1 − pi ) + λ 1 (θ1 − pi X i ) + λ 2 (θ2 − pi X i 2 ),
i=1 i=1 i=1 i=1

we obtain pi = {n + λ 1 (X i − θ1 ) + λ 2 (X i 2 − θ2 )} −1, where λ 1 and λ 2 are roots of


∑ (X − θ ) p = 0 and ∑ (X
i 1 i i
2
)
− θ2 pi = 0. The Appendix presents several results
that are similar to those related to the evaluations of ELR (θ) mentioned above.
To exemplify how close the empirical likelihood is to its parametric coun-
terparts, we consider the following example. Let {X 1 ,..., X n } denote a random
sample from the following distributions: (1) normal distribution N (θ, 1), and
(2) Exponential(θ), where θ is the rate parameter. Define, in the normal case,
the minus maximum log-likelihood ratio test statistic as

∑ ∑
n n
M(θ) = − log { MLR ( θ )} = − (X i − θ)2 /2 + (X i − X )2 /2
i =1 i =1

and the minus log-empirical likelihood ratio test statistic for E(X 1 ) = θ as

∑ log {1 + λ (X i − θ)/n },
n
E (θ) = − log {ELR ( θ )} = −
i =1

∑ ∑
n n
(X i − θ) (n + λ(X i − θ)) = 0 and X = n −1
−1
where λ is a root of X i . In
i =1 i =1
the case of the exponential distribution, the minus maximum log-likelihood
ratio test statistic is
M (θ) = − log { MLR ( θ )} = n log θ − n θX + n log X + n
and the minus log-empirical likeilhood ratio test statistic for E (X 1 ) = 1/θ is

∑ log {1 + λ(X i − 1/θ)/ n }.


n
E (θ) = − log {ELR ( θ )} = −
i =1

Generating X1,…,Xn ~ N(1,1) and X1,…,Xn ~ Exp(1), we obtained Figure 10.2.


This figure presents the plots of the parametric maximum log-likelihood ratios
and the log-empirical likelihood ratios versus values of the parameter θ based
on samples of sizes n = 25, 50,150, where the solid line and the dashed line
represent the functions E(θ) and M(θ) when the underlying distribution is nor-
mal, respectively, while the dotted line and dot-dash line represent the func-
tions E(θ) and M(θ) when the underlying distribution is exponential, respectively.
Figure 10.2 illustrates Lemma 10.3.1. This figure shows that the empirical
n=25 n=50 n=150
0.0 0.0 0.0

−0.5 −0.5 −0.5

−1.0 −1.0 −1.0

−1.5

E(θ), M(θ)
−1.5 −1.5
Empirical Likelihood

log−EL(log−PL) ratio (θ)


log−EL(log−PL) ratio (θ)
−2.0 −2.0 −2.0

−2.5 −2.5 −2.5


0.8 1.0 1.2 1.4 0.7 0.8 0.9 1.0 1.1 1.2 0.85 0.90 0.95 1.00 1.05 1.10 1.15
θ θ θ
n=25 n=50 n=150
0 0 0

−1 −1 −1

−2 −2 −2

E(θ), M(θ)
−3 −3 −3

log−EL(log−PL) ratio (θ)


log−EL(log−PL) ratio (θ)

−4 −4 −4

0.8 1.0 1.2 1.4 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.00 1.05 1.10 1.15 1.20 1.25 1.30
θ θ θ
Normal, EL Normal, PL Exp, EL Exp, PL

FIGURE 10.2
The log-empirical likelihood ratios and the maximum log-likelihood ratios based on samples of sample sizes n = 25,50,150, where the solid line and the
dashed line correspond to the log-empirical likelihood ratios and the maximum log-likelihood ratios when the underlying distribution is normal, respec-
tively, while the dotted line and the dot-dash line represent the log-empirical likelihood ratio and the maximum log-likelihood ratio when the underlying
247

distribution is exponential, respectively.


248 Statistics in the Health Sciences: Theory, Applications, and Computing

likelihood ratio behaves in a similar manner to the parametric maximum


likelihood ratio and approaches to the parametric maximum likelihood ratio
asymptotically. Furthermore, as the sample size n increases, the log-empirical
likelihood ratio approximates the maximum log-likelihood ratio well in the
neighborhood of the maximum likelihood estimators. The log-empirical like-
lihood (ratio) increases monotonically up to the maximum likelihood estima-
tor and then decreases monotonically.
The R code for implementation of the procedure is shown below.
library(emplik)
theta0<-1
n.seq<-c(25, 50, 150)

# normal
plot.norm<-function(X){
get.elr<-function(theta) sapply(theta,function(pp) el.test(X,mu=pp)$'-2LLR'/(-2)) # log EL
get.plr<-function(theta) sapply(theta,function(pp) sum(-(X-pp)^2/2)+sum((X-mean(X))^2/2)) #log(PL)
rg<-6/sqrt(length(X))
curve(get.elr,xlim=c(max(0,mean(X)-rg),mean(X)+rg),type="l",lty=1,lwd=2,add=TRUE,col=1)
curve(get.plr,xlim=c(max(0,mean(X)-rg),mean(X)+rg),type="l",lty=2,lwd=2,add=TRUE,col=2)
}

#exponential
plot.exp<-function(X){
get.elr.exp<-function(theta) sapply(theta,function(pp) el.test(X,mu=1/pp)$'-2LLR'/(-2)) #log EL
get.l.exp<-function(theta) sum(log(dexp(X,rate=theta)))
get.plr.exp<-function(theta) sapply(theta,function(pp) get.l.exp(pp)-get.l.exp(1/mean(X))) #log(PL)
rg<-6/sqrt(length(X))
curve(get.elr.exp,xlim=c(max(0.1,1/mean(X)-rg),1/mean(X)+rg)type="l",lty=3,lwd=2,add=TRUE,col=3)
curve(get.plr.exp,xlim=c(max(0.1,1/mean(X)-rg),1/mean(X)+rg),type="l",lty=4,lwd=2,add=TRUE,col=4)
}
#add legend for the plot
add_legend <- function(...) {
opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend(...)
}
par(mar=c(5.5, 4, 3.5, 1.5),mfrow=c(2,3),mgp=c(2,1,0))
# normal
for (n in n.seq){
rg<-2/sqrt(n)
X<-rnorm(n,theta0,1)
plot(theta0,1,xlim=c(max(0,mean(X)-rg),mean(X)+rg),xlab=expression(theta), ylim=c(-
2.5,0.005),ylab=expression(paste(“E(“,theta,”),M(“,theta,”)”)), main=paste0("n=",n),cex.axis=1.2)
plot.norm(X)
abline(v=mean(X),lty=2,col="grey")
}
# exponential
for (n in n.seq){
rg<-2/sqrt(n)
X<-rexp(n,rate=theta0)
plot(theta0,1,xlim=c(max(0,1/mean(X)-rg),1/mean(X)+rg),xlab=expression(theta), ylim=c(-
4.5,0.005),ylab=expression(paste(“E(“,theta,”),M(“,theta,”)”)), main=paste0("n=",n),cex.axis=1.2)
plot.exp(X)
abline(v=1/mean(X),lty=2,col="grey")
}
add_legend("bottom", legend=c("Normal, EL","Normal, PL","Exp, EL","Exp, PL"), lwd=2,
lty=1:4, col=1:4, horiz=TRUE, bty='n', cex=1.1)
Empirical Likelihood 249

10.3.1 Practical Theoretical Tools for Analyzing ELs


In this section we show several simple propositions that can assist in analyz-
ing empirical likelihood type procedures. For example, the remainder term
in the approximation

∑ ∑
3
log {ELR (θ)} ≈ (
n (θ − X ) / ⎢⎡
1 0.5 1
) (X i − X )2 ⎤⎥ + n(θ − X )3 / ⎡⎢
1 1
(X i − X )2 ⎤⎥
2 n n

2 ⎣n i=1 ⎦ 3 ⎣n i=1 ⎦

mentioned below Proposition 10.3.1 can be evaluated using the propositions


presented in this section. The analysis is relatively clear, and has the basic
ingredients for more general cases.

Proposition 10.3.2. Assume E X 1 ( ) < ∞. Then max


3
i = 1,..., n (
X i = op n1/3 + ε , for )
all ε > 0.

Proof. Chebyshev’s inequality (Chapter 1) provides that

n n

∑ ( ∑
3 3

( ) )
E Xi E X1
Pr max i =1,..., n X i ≥ n1/3+ε ≤ Pr X i ≥ n1/3+ε ≤ = 3ε ⎯⎯
⎯ ⎯ ⎯
⎯ → 0.
n1+ 3ε n n →∞
i =1 i =1

This completes the proof.


In many cases, while analyzing EL-type procedures, we need to evaluate
forms similar to λ(θ)X i , λ '(θ)X i , and so on for all θ. Proposition 10.3.2 shows
( )
X i = op n1/3 + ε , i = 1,..., n. Proposition 10.3.1 connects λ ( k ) (θ) with λ(θ), k = 1,.., 4.
The following results examine λ(θ).
Consider as in Section 10.2 the empirical likelihood

n n

EL(θ) = sup
0 < p1 , p2 ,..., pn < 1, ∑ pi = 1, ∑ pi g ( X i , θ )= 0
∏ p = ∏(n + λg(X , θ))
i i
−1
,
i =1 i =1

∑ g(X , θ)(n + λg(X , θ))


−1
where λ is a root of i i = 0. Then we obtain the follow-
ing propositions.

Proposition 10.3.3. We have that λ ≥ 0 if and only if


n
∑ g (X , θ) ≥ 0, and
i

∑ g (X , θ) < 0.
i=1
λ < 0 if and only if i
i=1
250 Statistics in the Health Sciences: Theory, Applications, and Computing

Proof. The forms of pi = {n + λg (X i , θ)} , i = 1,.., n, and n− n ≥ ∏ p (here


−1
i
n

∏ p provided that only 0 < p ,..., p < 1, ∑ p = 1)


i =1
pi = n−1 , i = 1,.., n, maximize i 1 n i
imply that i =1

⎛ n ⎞ n

0 ≤ log ⎜⎜ n− n /


∏ pi⎟⎟ =


∑log {1 + λg (X , θ)/n}. i
i =1 i =1

Using the inequality log (1 + s) ≤ s for s > −1, we obtain

n n n

0≤ ∑ log {1 + λg (X i , θ) /n} ≤ ∑ λg (X i , θ) /n = λ ∑ g (X , θ)/n. i


i =1 i =1 i =1

This completes the proof.

Proposition 10.3.4. The Lagrange multiplier λ satisfies


n n

λ= ∑ g (X i , θ)/ ∑{g (X , θ)} p .


i
2
i
i =1 i =1

∑ g (X i , θ) pi = 0 with pi = {n + λg (X i , θ)} , i = 1,.., n,


n −1
Proof. The constraint
i=1
implies that
n n n

∑ g (X i , θ) = ∑{g (Xi , θ) (1 − pi)} + ∑{g (Xi , θ) pi}


i =1 i =1 i =1
n

= ∑ g (X , θ) (1 − p )i i
i =1

⎧ ⎫
∑ g (X , θ) ⎪⎨⎩⎪ n n+ +λgλ(gX(X, θ,)θ−) 1⎪⎬⎭⎪
n

=
i
i
i
i =1
n n n

= n ∑ g (X , θ) p + λ∑ g (X , θ) p −∑ g (X , θ) p
i i i
2
i i i
i =1 i =1 i =1
n

= λ ∑{g (X , θ)} p . i
2
i
i =1

This completes the proof.


Empirical Likelihood 251

n ⎡ n ⎤ −1
Proposition 10.3.5. If λ ≥ 0, we have 0 ≤ λ ≤ n ∑ g (X i , θ) ⎢⎢ ∑ {g (X i , θ)} I {g (X i , θ) < 0}⎥⎥ ;
2

i=1 ⎢⎣ i=1 ⎥⎦
−1
n
⎡ n

∑ ∑
g (X i , θ) ⎢ {g (X i , θ)} I {g (X i , θ) > 0}⎥ ≤ λ < 0, where
2
if λ < 0, we have n
i=1 ⎢⎣ i = 1 ⎥⎦

I {}
. is the indicator function.

Proof. With λ ≥ 0, we obtain

n n

∑ {g (Xi , θ)} pi ≥ ∑ ⎡⎣g (X , θ) p I {g (X , θ) < 0}⎤⎦ =


2 2
i i i
i=1 i=1

⎡ 2 I {g (X i , θ) < 0} ⎤
n n

∑ ⎢{g (X i , θ)} ∑ ⎡⎢⎣{g (X , θ)} 1


I {g (X i , θ) < 0}⎤⎥,
2
⎥≥
n + λg (X i , θ) ⎥⎦
i
i=1 ⎢⎣ i=1
n ⎦

where pi = {n + λg (X i , θ)} , i = 1, … , n.
−1

Applying this result and Proposition 10.3.3 to Proposition 10.3.4 yields

−1
n
⎡ n ⎤
∑ ∑
g (X i , θ) ⎢ {g (X i , θ)} I (g (X i , θ) < 0)⎥ .
2
0≤λ<n
i=1 ⎢⎣ i = 1 ⎥⎦

It follows similarly that when λ < 0,

−1
n
⎡ n ⎤
∑ ∑
g (X i , θ) ⎢ {g (X i , θ)} I {g (X i , θ) > 0}⎥ < λ < 0 .
2
n
i=1 ⎢⎣ i = 1 ⎥⎦

This completes the proof.


Proposition 10.3.5 provides the exact non-asymptotic bounds for λ. Owen
(1988) used complicated considerations to obtain the approximate bounds for
λ as n → ∞. Proposition 10.3.5 immediately demonstrates that, for all ε > 0,
( )
λ = op n1/2 +ε under H 0 , since the bounds presented in Proposition 10.3.5 are
based on the sums of iid random variables. The propositions above can be
useful in the context of numerical computations of ELs, providing, e.g., the
exact bounds for λ, where λ is a numerical solution of
∑ g(X , θ)(n + λg(X , θ))
−1
i i = 0.
252 Statistics in the Health Sciences: Theory, Applications, and Computing

−1
Proposition 10.3.6. Since the probability weights pi = (n + λg(X i , θ)) , i = 1,..., n,

∑ ∑ ∑
n n2 n
we have the properties g(X i , θ)2 pi2 = − 2 + 2 pi2 and pi2 ≥ n−1.
λ λ i=1

Proof. By virtue of ∑ p = 1, it is clear that


i

0≤ ∑ g(X i , θ)2 pi2 =


1
λ2 ∑ {n + λg(X , θ)}
λ 2 g(X i , θ)2
i
2

∑ λ g(X {, nθ)+ +λg2(nXλg, θ(X)} , θ) + n


2 2 2
1
= i i
λ2 i
2

∑ 2{nnλ+gλ(Xg(X, θ), θ+)}n


2
1
− i
λ2 i
2

=
n
λ 2
λ
1
− 2 ∑ {n + λg(X , θ)}
2 nλg(X i , θ) + n2
i
2 ,

where

2 n {λg(X i , θ) + n} − 2 n2 + n2
n
λ 2
λ
1
− 2 ∑ {n + λg(Xi , θ)}2 n

λ
n 2n
= 2− 2
λ ∑ n2
pi + 2
λ ∑p 2
i and ∑ p = 1.
i=1
i

∑p
n n2
Thus, we obtain − + 2
i ≥ 0.
λ2 λ2
This completes the proof.
∑ g(X , θ)(n + λg(X , θ))
−1
For example, consider the equation i i = 0. We have

g(X i , θ)
∑ (n + λg(X , θ)) = 0,
d
dθ i

which implies that

g '(X i , θ) (n + λg(X i , θ)) − λ ' g(X i , θ)2 − λg '(X i , θ) g '(X i , θ)


∑ (n + λg(Xi , θ))2
= 0.

Then

g '(X i , θ)n − λ ' g(X i , θ)2


∑ (n + λg(X i , θ))2
= 0.
Empirical Likelihood 253

This result leads to


n

∑ g '(X , θ)p
n i
2
i

λ' = i=1
n .
∑ g(X , θ) p
i=1
i
2 2
i

In the case with g(u, θ) = u − θ, one can use Proposition 10.3.6 to obtain the
inequality
n

n∑p 2
i
1
λ' = − n
i=1
≤− n .

i=1
(X i − θ)2 pi2 ∑
i=1
(X i − θ)2 pi2

10.4 Combining Likelihoods to Construct Composite


Tests and to Incorporate the Maximum Data-Driven
Information
Strictly speaking, EL techniques and parametric likelihood methods are
closely related concepts. This provides the impetus for an impressive expan-
sion in the number of EL developments, based on combinations of likeli-
hoods of different types (e.g., Qin, 2000).
Consider a simple example, where we assume to observe independent cou-
ples given as (X , Y ). In this case, the likelihood function can be denoted as
L(X , Y ). Suppose that the data points related to X’s are observed completely,
whereas a proportion of the observed data for the Y’s is incomplete. Assume
a model of Y given X, i.e., Y |X , is well defined, e.g., Yi = βX i + ε i, where β
denotes the model parameter and ε i is a normally distributed error term, for
i = 1,… , n. Then, we refer to Bayes theorem to represent the joint likelihood
L(X , Y ) = L(Y |X )L(X ), where L(X ) can be substituted by the EL to avoid para-
metric assumptions regarding the distribution of X’s.
In this context, Qin (2000) shows an inference on incomplete bivariate data,
using a method that combines the parametric model and ELs. This method
also incorporates auxiliary information from variables in the form of con-
straints, which can be obtained from reliable resources such as census
reports. This approach makes it possible to use all available bivariate data,
whether completely or incompletely observed. In the context of a group com-
parison, constraints can be formed based on null and alternative hypotheses,
and these constraints are incorporated into the EL. This result was extended
and applied to the following practical issues.
254 Statistics in the Health Sciences: Theory, Applications, and Computing

Malaria remains a major epidemiological problem in many developing


countries. In endemic areas, an individual may have symptoms attributable
either to malaria or to other causes. From a clinical viewpoint, it is important
to attend to the next tasks: (1) to correctly diagnose an individual who has
developed symptoms, so that the appropriate treatments can be given; and
(2) to determine the proportion of malaria-affected cases in individuals who
have symptoms, so that policies on intervention program can be developed.
Once symptoms have developed in an individual, the diagnosis of malaria
can be based on the analysis of the parasite levels in blood samples. How-
ever, even a blood test is not conclusive, as in endemic areas many healthy
individuals can have parasites in their blood slides. Therefore, data from this
type of study can be viewed as coming from a mixture distribution, with the
components corresponding to malaria and nonmalaria cases. Qin and Leung
(2005) constructed new EL procedures to estimate the proportion of clinical
malaria using parasite-level data from a group of individuals with symp-
toms attributable to malaria.
Yu et al. (2010) proposed two-sample EL techniques based on incomplete
data to analyze a Pneumonia Risk Study in an ICU setting. In the context of
this study, the initial detection of ventilator-associated pneumonia (VAP) for
inpatients at an intensive care unit requires composite symptom evaluation,
using clinical criteria such as the clinical pulmonary infection score (CPIS).
When CPIS is above a threshold value, bronchoalveolar lavage (BAL) is per-
formed, to confirm the diagnosis by counting actual bacterial pathogens.
Thus, CPIS and BAL results are closely related, and both are important indi-
cators of pneumonia, whereas BAL data are incomplete. Yu et al. (2010) and
Vexler et al. (2010c) derived EL methods to compare the pneumonia risks
among treatment groups for such incomplete data.
In semi- and nonparametric contexts, including EL settings, Qin and
Zhang (2005) showed that the full likelihood can be decomposed into the
product of a conditional likelihood and a marginal likelihood, in a similar
manner to the parametric likelihood considerations. These techniques aug-
ment the study’s power by enabling researchers to use any observed data
and relevant information.

10.5 Bayesians and Empirical Likelihood


The statistical literature has shown that Bayesian methods (e.g., Chapter 5)
can be applied for various tasks for the analysis of health-related experi-
ments. Commonly, the application of a Bayesian approach requires the
assumption of functional forms corresponding to the distribution of the
underlying data and parameters of interest. However, in cases with data sub-
ject to complex missing data problems, parametric estimation is complicated
Empirical Likelihood 255

and formal tests for the relevant goodness-of-fit are often not available. The
statistical literature has shown that tests derived from empirical likelihood
methodology possess many of the same asymptotic properties as those
based on parametric likelihoods. This leads naturally to the idea of using the
empirical likelihood instead of the parametric likelihood as the basis for
Bayesian inference.
Lazar (2003) demonstrated the potential for constructing nonparametric
Bayesian inference based on ELs. The key idea is to substitute the parametric
likelihood (PL) with the EL in the Bayesian likelihood construction relative
to the component of the likelihood used to model the observed data. It is
demonstrated that the EL function is a proper likelihood function and can
serve as the basis for robust and accurate Bayesian inference. This Bayesian
empirical likelihood method provides a robust nonparametric data-driven
alternative to the more classical Bayesian procedures.
Vexler et al. (2014) developed the nonparametric Bayesian posterior expec-
tation fashion by incorporating the EL methodology into the posterior likeli-
hood construction. The asymptotic forms of the EL-based Bayesian posterior
expectation are shown to be similar to those derived in the well-known para-
metric Bayesian and Frequentist statistical literature. In the case when the
prior distribution function depends on unknown hyper-parameters, a
nonparametric version of the empirical Bayesian method, which yields double
empirical Bayesian estimators, can be obtained. This approach yields a non-
parametric analog of the well-known James–Stein estimation that has been
well addressed in the literature dealing with multivariate-normal observa-
tions (e.g., Stein, 1956).
Vexler et al. (2016b) provided an EL-based technique for incorporating
prior information into the equal-tailed and highest posterior density confi-
dence interval estimators (Chapter 9) in the Bayesian manner. It was demon-
strated that the proposed EL Bayesian approach may correct confidence
regions with respect to skewness of the data distribution.
The EL Bayesian procedures can serve as a powerful approach to incorpo-
rating external information into the inference process about given data, in a
distribution-free manner.

10.6 Three Key Arguments That Support the Empirical


Likelihood Methodology as a Practical Statistical
Analysis Tool
One of the important advantages of EL techniques is their general applicabil-
ity and an assessment of their performance under conditions that are com-
monly unrestricted by parametric assumptions. When one is in doubt about
256 Statistics in the Health Sciences: Theory, Applications, and Computing

the best strategy for constructing statistical decision rules, the following
arguments can be accepted in favor of EL methods:

Argument 1: The EL methodology employs the likelihood concept in a


simple nonparametric fashion in order to approximate optimal para-
metric procedures. The benefit of using this approach is that the EL
techniques are often robust as well as highly efficient. In this con-
text, we also may apply EL functions to replace parametric likeli-
hood functions in known and well-developed constructions.
Argument 2: Similarly to the parametric likelihood concept, the EL
methodology gives relatively simple systematic directions for con-
structing efficient statistical tests that can be applied in various com-
plex statistical experiments.
Argument 3: Perhaps the extreme generality of EL methods and their
wide scope of usefulness partly follow from the ability to easily set
up EL-type statistics as components of composite parametric, semi-
parametric, and nonparametric likelihood–based systems, efficiently
attending to any observed data and relevant information. Paramet-
ric, semiparametric, and empirical likelihood methods play roles
complementary to one another, providing powerful statistical proce-
dures for complicated practical problems.

In conclusion, we note that EL-based methods are employed in much of


modern statistical practice, and we cannot describe all relevant theory and
examples. The reader interested in the EL methods will find more details
and many pertinent articles across the statistical literature.

Appendix
ELR (θ1 , θ2): Several results that are similar to those related to the evalua-
tions of ELR (θ).
One can show that the logarithm of the ELR test statistic for the null hypoth-
esis: H 0 : E(X 1 ) = θ1 and E(X 12 ) = θ2 has the form


n
log ELR (θ1 , θ2 ) = log{1 + λ 1 (X i − θ1 )/n + λ 2 (X i 2 − θ2 )/n},
i=1

where λ 1 and λ 2 are Lagrange multipliers that satisfy the equations


n n

∑i=1
( X i − θ1 )
n + λ 1 ( X i − θ1 ) + λ 2 ( X i 2 − θ 2 )
= 0 and ∑
i=1
(X i 2 − θ 2 )
n + λ 1 ( X i − θ1 ) + λ 2 ( X i 2 − θ 2 )
= 0.

In this case, ∂log (ELR) / ∂θ1 = −λ 1 and ∂log ELR / ∂θ2 = −λ 2. At point
⎛θ = X =
∑ ∑ X i2 /n⎞⎟ , we have log {ELR (θ1 , θ2 )} = λ 1 = λ 2 = 0,
n n
⎜⎝ 1 X i /n , θ2 = X 2 =
i=1 i=1 ⎠
pi = 1/n and the following derivative values:
Empirical Likelihood 257

∑ ∑ ∑
n n n
n2 (X i2 − X 2 )2 n2 ( X i − X )2 n2 (X i − X )(X i2 − X 2 )
∂λ 1 ∂λ 2 ∂λ 1 ∂λ 2
= i=1
, = i=1
, = = i=1
.
∂θ1 Δ ∂θ2 Δ ∂θ2 ∂θ1 Δ

∂2 λ 1 1
∑ (X − X ){ ∂λ ∂λ

n n
= [2 (X − X ) + i
2 2 1
i
2
(X i 2 − X 2 )}2 (X i − X )(X i2 − X 2 )
∂θ12 nΔ ∂θ
i=1 ∂θ 1 1 i=1

− 2∑ (X − X ){ ∑
n ∂λ ∂λ n
(X − X ) +
i (X 1
i
2
i
2
− X 2 )}2 (X i 2 − X 2 )2 ],
∂θ
i=1 ∂θ 1 1 i=1

∂2 λ 2 1
∑ (X − X ){ ∂λ ∂λ
(X − X )} ∑ (X − X )(X
n n
= [2 (X − X ) + i
1
i
2
i
2 2 2
i
2
i − X2)
∂θ12 nΔ ∂θ i=1 ∂θ 1 1 i=1

− 2∑ (X − X ){ (X − X )} ∑ (X − X ) ],
n ∂λ ∂λ n
(X − X ) +
i
2 2 1
i
2
i
2 2 2
i
2
∂θ
i=1 ∂θ 1 1 i=1

∂2 λ 1 ∂2 λ 2 1

∂λ 1 ∂λ

n n
= = [2 (X i 2 − X 2 ){
(X i − X ) + 2 (X i 2 − X 2 )}2 (X i − X )(X i2 − X 2 )
∂θ2 2 ∂θ1 ∂θ2 nΔ ∂θ2 i=1 ∂θ2 i=1

∑ ∂λ ∂λ

n n
−2 (X i − X ){ 1 (X i − X ) + 2 (X i 2 − X 2 )}2 (X i 2 − X 2 )2 ],
i=1 ∂θ2 ∂θ2 i=1

∂2 λ 2 ∂2 λ 1 1

∂λ 1 ∂λ

n n
= = [2 (X i − X ){
(X i − X ) + 2 (X i 2 − X 2 )}2 (X i − X )(X i2 − X 2 )
∂θ1 2
∂θ1 ∂θ2 nΔ ∂θ1 i=1 ∂θ1 i=1

∑ ∂λ ∂λ

n n
−2 (X i 2 − X 2 ){ 1 (X i − X ) + 2 (X i 2 − X 2 )}2 (X i − X )2 ],
i=1 ∂θ1 ∂θ1 i=1

∑ ∑ ∑
n n n
where Δ = { (X i − X )(X i2 − X 2 )}2 − ( X i − X )2 (X i2 − X 2 )2.
i =1 i =1 i =1

Then, in a similar manner to the analysis shown in Section 10.3, one can
show that, under H 0 , 2 log ELR (θ1 , θ2 ) ~ χ 22 , as n → ∞.
11
Jackknife and Bootstrap Methods

11.1 Introduction
Jackknife and bootstrap methods and other “computationally intensive”
techniques of statistical inference have a long history dating back to the per-
mutation test introduced in the 1930s by R.A. Fisher. Ever since that time
there has been work towards developing efficient and practical nonparamet-
ric models that did not rely upon the classical normal-based model assump-
tions. In the 1940s Quenouille introduced the method of deleting “one
observation at a time” for reducing a bias of estimation. This method was
further developed by Tukey for standard error estimation and coined by him
as the “jackknife” method (Tukey, 1958). This early work was followed by
many variants up to the 1970s. The jackknife method was then extended to
what is now referred to as the bootstrap method. Even though there were
predecessors towards the development of the bootstrap method, e.g., see Harti-
gan (1969, 1971, 1975), it is generally agreed upon by statisticians that Efron’s
(1979) paper in which the term “bootstrap” was coined, in conjunction with
the development of high-speed computers, was a cornerstone event towards
popularizing this particular methodology. Following Efron’s paper there
was an explosion of research on the topic of bootstrap methods. At one time
nonparametric bootstrapping was thought to be a statistical method that
would solve almost all problems in an efficient easy-to-use nonparametric
manner. So why do we not have a PROC BOOTSTRAP in SAS, and why are
bootstrap methods not used more widely? One answer lies in the fact that
there is not a general approach to bootstrapping along the lines of fitting a
generalized linear model with normal error terms. Oftentimes the bootstrap
approach to solving a problem requires writing new pieces of sometimes-
difficult SAS code that only pertains to a specific problem. Another reason
for the bootstrap method’s failure to take off as a general approach is that it
sometimes does not work well in certain situations, and that the method for
correcting these deficient procedures are oftentimes difficult or tedious.
Our focus will be on problems where the simple bootstrap methods are
known to work well. Even though the bootstrap method is primarily
considered a nonparametric method, it can also be used in conjunction with

259
260 Statistics in the Health Sciences: Theory, Applications, and Computing

parametric models as well. We will touch on how to employ the parametric


bootstrap method as well.
Complicated problems in practice where the bootstrap method is a reason-
able approach are situations such as repeated measures analyses, where one
wishes to treat the correlation structure as a nuisance parameter, as opposed
to assuming something unreasonable, or the case where there are unequal
numbers of repeated measurements per subject.
The most common use of the bootstrap method is to approximate the sam-
pling distribution of a statistic such as the mean, median, regression slope,
correlation coefficient, and so on. Once the sampling distribution has been
approximated via the bootstrap method, estimation and inference involving
the given statistic follows in a straightforward manner. Note however that the
bootstrap does not provide exact answers. It provides approximate variance
estimates and approximate coverage probabilities for confidence intervals. As
with most statistical methods these approximations improve for increasing
sample sizes. The estimated probability distribution of a given statistic based
upon the bootstrap method is obtained by conditioning on the observed data-
set and replacing the population distribution function with its estimate in
some statistical function such as the expected value. The method may be car-
ried out using either a parametric or nonparametric estimate of the distribu-
tion function. For example, the parametric bootstrap distribution of the sample
mean x under normality assumptions is given simply by Fˆ ( x) = Φ( n ( x − x )/ s),
where Φ denotes the probit function, n is the sample size, and x and s are the
sample mean and sample standard deviation, respectively. The 95% paramet-
ric bootstrap percentile confidence interval is given simply by the 2.5th and
97.5th percentiles of Fˆ ( x) or simply x + s × 1.96 / n . This should look familiar
to anybody who has opened an introductory statistics book as the approxi-
mate normal theory confidence interval for the sample mean.
Oftentimes the calculations are not so straightforward or we wish to apply
the bootstrap method in a nonparametric fashion, that is, we don’t wish to con-
strain ourselves to a parametric functional form such as the normal distribu-
tion when specifying the distribution function F( x). In the majority of cases we
need to approximate the bootstrap method through the generation of boot-
strap replications of the statistic of interest. Using a resampling procedure we
will be able to approximate bootstrap estimates for quantities such as standard
errors, p-values, and confidence intervals for a complicated statistic without
relying on traditional approximations based upon asymptotic normality.
There are many well-known books on bootstrapping such as Efron and
Tibshirani (1994), Davison and Hinkley (1997), and Shao and Tu (1995) that
we can suggest to the reader who is interested in further details regarding
jackknife and bootstrap methods.
This chapter will outline the following topics: jackknife bias estimation,
jackknife variance estimation, confidence interval definition, approximate
confidence intervals, variance stabilization, bootstrap methods, nonpara-
metric simulation, resampling algorithms with SAS and R, bootstrap
Jackknife and Bootstrap Methods 261

confidence intervals, use of the Edgeworth expansion to illustrate the accu-


racy of bootstrap intervals, bootstrap-t percentile intervals, further refine-
ment of bootstrap confidence intervals, and bootstrap tilting (Sections 11.2
through 11.14, respectively).

11.2 Jackknife Bias Estimation


Quenouille (1949, 1956) developed a nonparametric estimate of the large
sample bias of a given statistic that has first order bias of 1/n. The methodol-
ogy Quenouille developed was coined the jackknife method by Tukey (1958).
In terms of demonstrating the method let X 1 , X 2 , … , X n be a set of iid obser-
vations from a distribution function F with a density function f ( x ; θ) and let
θˆ = θˆ (X 1 , X 2 , … , X n ) be an estimator of some population quantity θ = t( F ),
where t( F ) is what is referred to as a statistical functional, e.g., the expecta-
tion, quantile, etc. An estimate of t( F ) is given by θˆ = t( Fˆ ), where

n
Fˆ ( x) = I (X i ≤ x) / n is the empirical distribution function and I(.) denotes


i=1
the indicator function. For example, if t (F) = xdF is the expectation then
n

∫ ∑
ˆθ = t( Fˆ ) = xdFˆ ( x) =
i=1
X i / n is the sample mean (see also Section 1.6 in this
context).
The bias of the statistic is given by

bias = E[t( Fˆ ) − t( F )].

Quenouille derived an estimate of the bias by deleting one observation X i


and then recalculating θˆ = θ( Fˆ ), given as

()
θˆ ( i ) = t Fˆi = θˆ (X 1 , X 2 , … , X i − 1, , X i + 1, … , X n, )

along with the average θˆ (.) = ∑θˆ


i=1
(i) / n. Then the estimate of the bias is given

as (n − 1) (θˆ (.) − θˆ ) such that the bias-corrected jackknife estimator for θ is


given as

θ = n ⋅ θˆ − (n − 1)θˆ (.).

Note however, there is a statistical price to pay for such a correction in terms
of bias-variance trade-offs. This approach to bias correction is most useful
for statistics that have expectations of the form
262 Statistics in the Health Sciences: Theory, Applications, and Computing

() a ( F ) a2 ( F )
E θˆ = θ + 1
n
+ 2 + O(n−3 ),
n

when the constants a1 ( F ) and a2 ( F ) do not depend on n. Oftentimes expecta-


tions of maximum likelihood estimators have values of the form above, e.g.,
see Firth (1993) for a general description of the problem.
The mechanism for which the jackknife method works follows by noting
that

( ) a (F )
E θˆ (.) = θ + 1
a (F )
+ 2 2 + O(n−3 ).
n − 1 (n − 1)

Hence,

() a (F)
E θ = θ − 2
n(n − 1)
+ O(n−3 )

such that θ has bias of order O(n−2 ), whereas θ̂ has bias of order O(n−1 ). As an
example denote the population variance as

t (F) =
∫ (u − E (X)) dF(u)
2

with the corresponding moment estimator given as

∫ (u − X) ∑(X − X ) / n,
2
t( Fˆ ) = dFˆ (u) = i
2

i=1

where X = ∑ ∑(X − X) /(n (n − 1))


n 2
X i / n. Then an estimate of the bias is given as i
i=1
i=1
n

such that the jackknife bias-corrected estimator is given as θ = ∑(X i − X)


2
/(n − 1).
In this case θ is precisely unbiased. i=1

A variant of the delete one observation at a time approach towards bias


reduction is the grouped jackknife approach. Let n = gh, where g and h are
integers such that we can remove g distinct blocks of size h from X 1 , X 2 , … , X n
given as X 1* , X 2* , … , X h* then

θ = gθˆ − (g − 1) θˆ (). ,
Jackknife and Bootstrap Methods 263

⎛ ⎞
where θˆ (.) = ∑θˆ / ⎜⎝ nh ⎟⎠
i
i and ∑ denotes the summation over all possible
i
subsets. The blocked jackknife biased corrected estimator has smaller vari-
ance than the leave one out at a time classic estimators (Shao and Wu, 1989).

11.3 Jackknife Variance Estimation


Tukey (1958) put forth the idea that the jackknife method is a useful tool for
estimating the variance of θˆ = θ( Fˆ ) in the iid setting. The variance estimate

∑( )
n
2
 (θˆ ) =
for θ̂ takes the form Var θˆ − θˆ / n. In general, the variance estima-
(i) (.)
i=1
tor works for a class of statistics that are so-called smooth statistics and not
ˆ (θ ) ≠ 0,
so well in other instances, that is, in many instances lim n Var(θˆ ) − Var
n →∞
{( )}
where Var(X ) = E X − (EX) ( )
2 2
{ }
 (θˆ ) > Var(θˆ ). We outline the con-
. In fact, E Var
cept of smoothness in the bootstrap methods section below.
n

For a simple example, let θˆ = t( Fˆ ) = xdFˆ =


∫ ∑
X i / n be the sample mean;
i=1
n θˆ − X i ˆ θ − Xi
then θˆ ( i ) = , θ(.) = θˆ and θˆ ( i ) − θˆ (.) = . This yields
n−1 n−1
n

∑( )
 ˆ 1 2 S2
Var(θ) = X i − X = . There may in general be a temptation to
n (n − 1) i = 1 n

 (θˆ ) for a confidence interval about θ. However, intervals


use θˆ ± tn − 1, α/2 Var
of this type have notoriously poor coverage probabilities. We will illustrate
that the bootstrap approach described below provides a more suitable
approach for generating approximate confidence intervals.

11.4 Confidence Interval Definition


Let X denote a random variable and let a and b denote two positive real num-
bers. Then
P (a < X < b) = P (a < X and X < b)
⎛ bX ⎞
= P⎜ > b and X < b⎟
⎝ a ⎠
bX
= P(X < b < ).
a
264 Statistics in the Health Sciences: Theory, Applications, and Computing

bX
The interval G(X ) = (X , ) is then a random interval and assumes the value
a
G( x) whenever X assumes the value x. The interval G(X ) contains the value b
with a certain fixed probability. Now in this framework denote θ ∈ Θ ⊆ R.
For 0 < α < 1 a function satisfying P(θ( X ) ≤ θ) ≥ 1 − α , ∀θ is called a lower con-
fidence band at level α, where θ( X ) does not depend on θ and X is a random
vector of observations. A similar definition holds for the upper confidence
band θ( X ). A family of subsets S( X ) of Θ is said to constitute a family of con-
fidence sets at confidence level 1 − α if P(S( X ) ∈Θ) ≥ 1 − α , ∀θ ∈Θ. A special
case S (X) = (θ (X) , θ( X )) is called a confidence interval provided
( )
P θ (X) < θ < θ( X ) ≥ 1 − α , ∀θ.
A key towards generating a standard confidence interval used in applica-
tions is to develop a so-called pivot. For example, let X 1 , X 2 , … , X n be iid
observations from a density function f ( x ; θ) and denote the statistic
T (X 1 , X 2 , … , X n ; θ) = T ( X ; θ) . The goal in confidence interval generation is to
start by choosing λ 1 and λ 2 such that the probability statement

P(λ 1 < T ( X ; θ) < λ 2 ) = 1 − α

may be written in the form discussed above and given as

( )
P θ (X) < θ < θ( X ) ≥ 1 − α .

This can be accomplished by finding a pivot within the form of the statistic
T( X ; θ) about the parameter θ. If the distribution function of T( X ; θ) has a
form that is independent of θ then T( X ; θ) is called a pivot. For example, let
X 1 , X 2 , … , X n ~ N (θ, σ ); then the pivot T ( X ; θ) = n ( x − θ)σ −1 ~ N (0, 1), where


n
x= X i / n, that is, T( X ; θ) has a distribution that is θ free. Hence we can
i=1
pivot about θ to arrive at the exact 1 − α confidence interval for θ with the
(
classic form Pθ x − z1−α/2σ / n < θ < x + z1−α/2σ / n ≥ 1 − α. )

11.5 Approximate Confidence Intervals


When the distribution of the statistic T( X ; θ) is unknown or analytically
intractable oftentimes it can be approximated with large sample approxima-
tions given certain regularity conditions, e.g., the existence of moments. The
reason as to why the distribution of T( X ; θ) may be unknown is that we do
Jackknife and Bootstrap Methods 265

not wish to assume a form for a density function f ( x ; θ) or we simply do not


know what it may be based on experience.
A well-known approximation is the case when T( X ; θ) takes the form of a
sum of iid observations. Then we have in the general case the asymptotic
approximation T ( X ; θ) ~ AN (θ, σ 2T /n), where Var (T ( X ; θ)) = σ 2T /n. The clas-
sic large sample confidence interval or z-interval then takes the form
T ( X ; θ) ± z1−α/2σ T / n . The general problem with this approximate confidence
interval is that in general the nuisance parameter σ 2T is unknown.
Typically, some consistent estimator of σ 2T is used in the confidence inter-
val approximation. Now, however, unlike the classic case when X 1 , X 2 , … ,
X n ~ N (θ, σ 2 ) and we can apply the t-distribution given σ̂ 2 is the sample
standard deviation to generate a precise confidence interval, the confidence
interval T ( X ; θ) ± z1−α/2σˆ / n is an approximation of an approximation. In
P
general, if T ( X ; θ) ~ AN (θ, σ 2T / n) and σˆ 2 → σ 2T then T ( X ; θ) ± z1−α/2 σˆ / n
has coverage probability that converges to 1 − α as the sample size n → ∞.
However, in many instances approximate confidence intervals of this type
may have poor coverage in small finite samples and should be used with
caution.
In certain instances these types of approximate intervals may be improved
upon using some common techniques such as variance stabilization trans-
formations, symmetry transformations (e.g., the log transform) or incorpo-
rating some functional knowledge about f ( x ; θ).

11.6 Variance Stabilization


When the large sample variance of the statistic T( X ; θ) = θˆ depends on θ, that is,
⎛ h (θ) σ 2T ⎞
θˆ ~ AN ⎜ θ, , where h ()
. is a continuously differentiable function, then
⎝ n ⎟⎠
⎛ cσ 2T ⎞
one may consider a function g such that g(θ) ~ AN ⎜ θ, . For illustration
⎝ n ⎟⎠
purposes consider X 1, X 2 , … , X n to be iid data points from an exponential
density function f ( x ; θ) = exp(− x / θ)/ θ . Then we know E (x) = θ, Var( x ) = θ2,
and x ~ AN (θ, θ2 /n). One variant of a (1 − α) large sample confidence interval
for θ would be x ± z1−α/2 x/ n due to the variance being dependent on the
parameter of interest. Alternatively, consider that log x ~ AN (log θ, 1/n) with
variance that does not depend on θ. Hence a 1 − α large sample confidence
interval for log θ would be log x ± z1−α/2 1/ n . This interval then can be back-
transformed to the original scale by exponentiating the lower and upper
bounds.
266 Statistics in the Health Sciences: Theory, Applications, and Computing

11.7 Bootstrap Methods


The term “bootstrap” was coined by Efron (1979). Hartigan (1969) first dis-
cussed what we now refer to as the bootstrap method. Bootstrap methods
are a very valuable inference tool if properly used. The primary usage of this
method is around bias estimation, variance estimation, and confidence inter-
val generation and inference. In some sense, bootstrap methods generalize
the jackknife described earlier. The bootstrap method is a resampling method
based on sampling from the original dataset with replacement. For small
samples we can enumerate all possible subsamples, which we will refer to as
the exact bootstrap method. For larger samples Monte Carlo approaches are
generally applied. The key idea is to extract “extra” information from the
data via sampling with replacement. An alternative approach termed per-
mutation methods will be described in a later section, which in its most basic
form is a sampling without replacement approach.
For the purpose of introducing the method we will start with the univari-
ate standard case where we assume X 1 , X 2 , … , X n is a set of iid observations


n
from f ( x ; θ). The empirical distribution function Fˆ (x) = I (X i ≤ x) / n plays
i =1
n

a key role in classic bootstrap methods. More formally, Fˆ ( x) = ∑H(x − X ) / n,


i
i=1
where H (u) is the unit step function defined as
⎧⎪ 0, u < 0,
H (u) = ⎨
1, u ≥ 0.
⎩⎪
Note that H (u) is in a sense also a discrete-valued distribution function. Hence,
by convention the derivative of H (u), H ′ (u) = h(u) , is a degenerate density func-
tion with point mass at u = 0. Note also that we will use the empirical quantile
estimator as well to illustrate bootstrap methods and aid in resampling
approaches. Let X(1) < X(2) < … < X( n) denote the order statistic. Then the empir-
ical quantile function estimator in the absolutely continuous case for
F −1 (u) = Q(u) is given as Qˆ (u) = X([nu]+ 1), where [.] denotes the floor function. The
alternative definition for the empirical distribution function about H provides


a more formal method for calculating quantities, e.g., g (x) dFˆ ( x), which if done
analytically provides so-called exact bootstrap solutions, where g( x) denotes a
generic function corresponding to the population quantity of interest, e.g.,
g (x) = x corresponds to the calculation of E(X ).
The heart of bootstrap methods revolves around understanding statistical
functionals. Notationally, a statistical functional has the form θ = t( F ) such that
the estimator has the form θˆ = t( Fˆ ). The function θ = t( F ) is central to nonpara-
metric estimation. The key assumption that we will operate under is that some
characteristic of F is defined by θ. Note that F and θ may be multidimensional.
Jackknife and Bootstrap Methods 267

In general, bootstrap methods work well in terms of the performance character-


istics when the statistical functional of interest has the form θ = t( F ). In other
scenarios caution must be taken when using bootstrap methods, e.g., estimat-
ing properties of a threshold parameter. As a general rule-of-thumb if θˆ = t( Fˆ )
can be assumed to be asymptotically normal and consistent then the bootstrap
methods that we will outline will work well.
As a straightforward example of the relationship between a statistical
functional and its corresponding estimator, consider the statistical func-
tional θ = t (F) =
∫ xdF, which is the expectation. Then the estimator is given
by “plugging-in” F̂ for F and given as

θˆ = t Fˆ = ( ) ∫ xdFˆ (x)
n

=
∫ xd(∑H(x − X )/ n)
i=1
i

=
1
n ∑ ∫ xdH(x − X )
i=1
i

=
1
n ∑X = X ,
i=1
i

that is, the sample average. Oftentimes in terms of direct calculations the
quantile form of the statistical functional is worth considering, e.g., for the
average we have

θ = t (F) =
∫ xdF = ∫Q (u) du
0

1 n i/n n

and hence θˆ = ∫ Qˆ (u) du =


1
n ∑∫ Qˆ (u) du =
1
n ∑X (i) = X , where Qˆ (u) = X([nu]+ 1)
0 i = 1 ( i − 1)/ n i =1

and [.] denotes the floor function.

11.8 Nonparametric Simulation


In many scenarios θ = t (F) can be estimated directly via the plug-in method
()
by θˆ = t Fˆ or via Monte Carlo simulation. The reason one may wish to use
268 Statistics in the Health Sciences: Theory, Applications, and Computing

simulation is that direct calculations may be complex. Many quantities of


interest around the statistical functional such as confidence intervals are dif-
ficult to calculate directly. For example, say we are interested in the statistical
functional pertaining to the quantile of the sample average. In this case

θ = t (F) = min E[|X − θ|+(2α − 1)(X − θ)],


θ

where α denotes the quantile of X, 0 < α < 1. This quantity can be calculated
directly for small samples, but becomes computationally, analytically infea-
sible for even moderate sample size. Or as another complex example say we
were interested in the statistical functional of t (F) = Var(S / x ) where S is the
sample standard deviation. In this case, direct calculation is not possible.
However a simulation approached based on sampling with replacement pro-
vides a powerful tool for calculating such quantities.
Let us take a simple example of estimating the variance for the coefficient
of variation. In this case θ = t (F) = 100 × Var (X) / E(X ) such that
()
θˆ = t Fˆ = 100 × S / X . Deriving the estimator Var  (100 ×S / X ) of the coefficient
of variation is nontrivial even when the parametric form of the density is
assumed known. Suppose we observe a sample of size n = 10 with observed
data X={0.25, 0.40, 0.46, 0.27, 2.51, 1.24, 4.11, 6.11, 0.46, 1.35}. The estimate in this
()
case is θˆ = t Fˆ = 114.6. For the nonparametric bootstrap approach we sample
with replacement from F̂ in order to form bootstrap replicates. One single
randomly sampled bootstrap replicate in this example was x* ={0.25, 0.40,
0.46, 0.27, 1.29, 1.29, 4.11, 6.11, 0.25, 1.37} such that the bootstrap replicate statis-
( )
tics was θˆ * = t Fˆ * = 124.29 , where F̂ * is the empirical distribution function
based on the resampled values. If we repeat this process say with B = 10, 000
bootstrap replications we obtain the approximate distribution for
()
θˆ = t Fˆ = 100 × S / X illustrated in Figure 11.1. Similar to the jackknife a non-
() ( ( )) ∑ (t (Fˆ ) − t (Fˆ )) ,
B 2
1
parametric estimate of the variance of θˆ = t Fˆ is  t Fˆ ≈
Var i
* *
B−1

( ( ))
i=1

( ) ∑( )
B
1  t F̂ = 514.6. Note that there
where t Fˆ * = ti Fˆ * . For our example, Var
B i=1
are two sources of variance in the approximation, Monte Carlo error and
stochastic error. The Monte Carlo error goes to 0 as B → ∞. We will demon-
strate that this resampling approach can be extended to generating approxi-
mate bootstrap confidence intervals for θ = t (F).
The general steps one needs to follow in order to utilize the power of clas-
sic bootstrap methods are as follows:

(1) Have a clear definition of θ = t (F).


(2) Ensure that the estimator for θ = t (F) has the form θˆ = t Fˆ . ()
Jackknife and Bootstrap Methods 269

(3) Generate B bootstrap samples of size n from F̂ with replacement


denoted X 1* , X 2* , … , X n* .
( ) ( ) ( )
(4) Calculate bootstrap replicates of the statistic t1 Fˆ * , t2 Fˆ * , … , tB Fˆ * .
( ) ( ) ( )
(5) Utilize t1 Fˆ * , t2 Fˆ * , … , tB Fˆ * to approximate the distribution of
()
θˆ = t Fˆ to calculate the quantity of interest, e.g., an estimate of the
()
variance for θˆ = t Fˆ .

11

10

6
Percent

0
2 3 3 4 4 5 6 6 7 7 8 9 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
4 0 6 2 8 4 0 6 2 8 4 0 6 0 0 1 2 2 3 3 4 5 5 6 6 7 8 8 9 9 0
2 8 4 0 6 2 8 4 0 6 2 8 4 0 6 2 8 4
The coefficient of variation, x

FIGURE 11.1
B = 10,000 bootstrap replications for the coefficient of variation example.
270 Statistics in the Health Sciences: Theory, Applications, and Computing

11.9 Resampling Algorithms with SAS and R


The majority of nonparametric bootstrap procedures discussed from this point
forward rely upon the ability to sample from the data with replacement. In this
chapter we will discuss the pieces of code needed to carry out the simple boot-
strap in the univariate setting. When we get to more complicated repeated mea-
sures and cluster-related datasets later in the book we will use the code provided
below as a jumping off point. The basic idea is that once you understand the
syntax provided in this chapter you should be able to easily modify it for most
bootstrap-related problems that would be encountered in practice.
It turns out that resampling from the data with replacement can be accom-
plished relatively easily through the generation of a vector or set of multino-
mial random variables. The algorithms outlined below will be the “front-end”
for most of the bootstrap procedures we intend to discuss. The key to pro-
gramming the nonparametric bootstrap procedure is to consider carefully
how data is managed internally within SAS. The point of the procedures
outlined below is to avoid having to create a multitude of arrays or special-
ized macros, and to try maximize efficiency, thus making it easier for the
novice to carry out the bootstrap procedures. In other instances we will be
able to take advantage of PROC SURVEYSELECT, which is an already built-
in procedure that allows us to resample the data with replacement. Note
however, for certain problems one may not be able to avoid using macros or
arrays. We will deal with a few specialized cases later in the book.
Let us start with an example dataset X = {0.5, 0.4, 0.6, 0.2} of size n = 4. If we
sample from this dataset once with replacement, we might obtain a bootstrap
replication denoted x ∗ = {0.5, 0.4, 0.6, 0.2}. Note that this is equivalent to keep-
ing track of the set of multinomial counts c={2,1,0,1} corresponding to the
original dataset x as listed and outputting each observation X i the corre-
sponding ci number of times, where X i denotes the ith observation and ci
denotes the corresponding count, e.g., X 2 = 0.4 and c2 = 1. In other words
sampling with replacement is exactly equivalent to generating a set of random
multinomial counts with marginal probabilities 1/n, and outputting the data
value the corresponding number of times. One way to do this in SAS easily
and efficiently is to generate random binomial variables conditioned on the
margins as described in Davis (1993) and given by the following formula:

⎛ i−1 ⎞
Ci ~ B ⎜ n −
⎜⎝ ∑ C , n − 1i + 1⎟⎟⎠
j=0
j (11.1)


n−1
for i = 1,..., n − 1, where C0 = p0 = 0, Cn = n − C j, and B(n, p) denotes a
j=0
standard binomial distribution. Random binomial variables may be gener-
ated using the RANBIN function in SAS. Even though this formula may look
Jackknife and Bootstrap Methods 271

intimidating at first glance, it is relatively straightforward to program. Again,


note that we are sticking with the convention that a capital “C” represents a
random quantity, and a lowercase “c” represents an observed value. There-
fore, in bootstrap parlance, the data X is considered fixed or observed for a
given bootstrap resampling procedure, and the multinomial counts C vary
randomly every bootstrap replication. The theoretical properties of boot-
strap estimators can be examined fairly straightforward within this frame-
work by noting that the multinomial counts C are independent of the data X.
For our simple example of n = 4 observations we would have for a single
bootstrap resample:

(1) C1 is a random binomial variable B(4,1/ 4).


(2) C2 is a random binomial variable B(4 − C1 ,1/ 3).
(3) C3 is a random binomial variable B(4 − C1 − C2 ,1/ 2).
(4) C4 = 4 − C1 − C2 − C3.

We would then want to output the value X i Ci number of times for one


i−1
bootstrap resample. Note that if n − Ci < 0 then stop; all subsequent Ci’s
j=0

should be set to “0”. The algorithm is easily accomplished in SAS a multiple


number of times in a variety of ways. For the first example let us focus on the
most straightforward approach. Let us generate counts for X={0.5,0.4,0.6,0.2},
B = 3 bootstrap replications without doing anything fancy. We need to gener-
ate random binomial variables using the function RANBIN(seed, n, p), where
n and p are the parameters of the binomial distribution. The seed dictates the
random number stream; a “0” produces a different random number stream
each successive run of the program. Note that there are practical reasons for
choosing a nonzero seed, such as the need to re-create a given analyses. In
this case one may choose a positive integer such as 123453. Seeds less than
“0” may also be employed. We refer the reader to the SAS technical docu-
ments for further details (see Section 1.11).
In order to understand the basic resampling code written below, one
must be familiar with the SORT command, in conjunction with the FIRST
and LAST command. In addition, knowledge of the RETAIN command is
needed. The function RANBIN will be used to generate random binomial
variables.

data original;
n=4; /*set the sample size*/
input x_i @@;
do b=1 to 3; /*output the data B times*/
output;
end;
cards;
0.5 0.4 0.6 0.2;
proc sort;by b;
272 Statistics in the Health Sciences: Theory, Applications, and Computing

This first piece of code in data original just outputs the same data-
set three times in a row. We then need to sort the values by b. Now all
we do in the next set of code is carry out the generation of multinomial
random variables sequentially via a marginal random binomial number
generator.

/*SAS PROGRAM 11.1*/


data resample;
set original;by b;
retain sumc i; /*need to add the previous
count to the total */
if first.b then do;
sumc=0;
i=1; /*set counters*/
end;
p=1/(n-i+1); /*p is the probability of a "success"*/

if ^last.b then do;


if n>sumc then c_i=ranbin(0,n-sumc,p); /*generate the binomial variate*/
else c_i=0;
sumc=sumc+c_i;
end;
i=i+1;
if last.b then c_i=n-sumc;
proc print;
var b c_i x_i;

The statement “else c_i=0” is just a shortcut needed for the cases where we
i−1

hit the limit of the constraint that n − ∑C


j=0
j must be positive. All the other

statements are needed for counting purposes and updating the marginal
binomial probabilities. i−1


1
Basically what we have done is keep track of n − C j and for
j=0
n−i+1
B = 1, and then we repeated the resampling scheme for B = 2, and for B = 3.
The output for a given run is below.

OBS B C_I X_I


1 1 3 0.5
2 1 0 0.4
3 1 0 0.6
4 1 1 0.2
5 2 2 0.5
6 2 0 0.4
7 2 0 0.6
8 2 2 0.2
9 3 1 0.5
10 3 0 0.4
11 3 0 0.6
12 3 3 0.2
Jackknife and Bootstrap Methods 273

What this program above does is basically the nuts and bolts of most uni-
variate bootstrapping procedures. It is easily modified for the multivariate
procedures. Note that if there are missing values they should be eliminated
within the first data step.
As an example consider how the program would work for a simple

n
statistic such as the sample mean X = X i / n. A bootstrap replication of
i=1

=∑
n
the mean would simply be x ∗ Ci X i / n, or for our example
i=1
x = (3 × 0.5 + 1 × 0.2)/ 4 = 0.426 , or for those of you who prefer matrix
*
1
XC'
notation X * = . To carry out this process in SAS simply add the following
n
code to the end of SAS PROGRAM 11.1:

proc means;by b;
var x_i;
weight c_i;
output out=bootstat mean=xbarstar;

For the sample mean we can take advantage of the WEIGHT statement
built in to PROC MEANS. Since the sample size of n = 4 is small we would
basically only need to increase the value of B up to say B = 100 in this exam-
ple in order to get a very accurate approximation of the distribution of the
sample mean for this dataset. Guidelines for the number of resamples will be
discussed later. Through the use of the WEIGHT statement PROC MEANS is

n
“automatically” calculating x ∗ = Ci X i / n for each value of B via the BY
i=1
statement. In a variety of problems from regression to generalized linear
models we will utilize the fact that the WEIGHT utility is built in to a num-
ber of SAS procedures. From a statistical theory standpoint in the case of the
sample mean one can easily prove that if E(X ) = μ that E(X ∗ ) = μ as well.
Note that

∑ ∑
n n
E(X ∗ ) = E( Ci X i / n) = E(Ci )E(X i )/ n = μ ,
i=1 i=1

given E(Ci ) = 1, that is, remarkably all bootstrap resampled estimates of the
mean are unbiased estimates of μ.
Note that the number of bootstrap replications is typically recommended
to be B = 1000 or higher for moderately sized datasets, e.g., see Booth and
Sarkar (1998). However, if you can feasibly carry out a higher number of rep-
lications the precision of the bootstrap procedure can only increase, that is,
from a theoretical statistical point of view you can never take too many boot-
strap replications. From a practical point of view you may stretch the mem-
ory capacity and speed of your given computer. In general, one should shoot
274 Statistics in the Health Sciences: Theory, Applications, and Computing

for B = 1000. If a value for B >1000 can be practically used then by all means
use it. Obviously, the memory requirements for the above algorithm can
expand rapidly for moderately sized datasets and values of B around 1000.
The program (SAS PROGRAM 11.1) will run very rapidly, however it is a
“space hog” with respect to disk space.
In some instances we can keep the code even more simple by replacing the
data-step resample in the previous approach with PROC SURVEYSELECT as
seen in the following example.

/* SAS PROGRAM 11.2 */


proc surveyselect data=original
method=urs sampsize=4 out=resample;
by b;

proc print noobs;


var b numberhits x_i;

data restar;set resample;


do ii=1 to numberhits;
output;
end;

The R code is as following:


> resamples <- lapply(1:3, function(i)
+ sample(data, replace = T))
> b1<-c(4,mean(resamples[[1]]),sd(resamples[[1]]),min(resamples[[1]]),max(res
amples[[1]]))
> names(b1)<-c("N","Mean","Std Dev", "Minimum", "Maximum")
> b2<-c(4,mean(resamples[[2]]),sd(resamples[[2]]),min(resamples[[2]]),max(res
amples[[2]]))
> names(b2)<-c("N","Mean","Std Dev", "Minimum", "Maximum")
> b3<-c(4,mean(resamples[[3]]),sd(resamples[[3]]),min(resamples[[3]]),max(res
amples[[1]]))
> names(b3)<-c("N","Mean","Std Dev", "Minimum", "Maximum")

The method=urs implies unrestricted random sampling with replacement


and needs to be specified. Also, note that the sampsize=4 corresponds to the
sample size of n = 4 for this example and needs to be modified manually for
different sized datasets. Finally, the output from PROC SURVEYSELECT for
this example will look as follows:

Number
b Hits x_i
1 2 0.5
1 2 0.4
2 1 0.4
2 1 0.6
2 2 0.2
3 1 0.5
3 1 0.4
3 2 0.2
Jackknife and Bootstrap Methods 275

Therefore, the additional need to output each observation in the dataset


resample numberhits times in dataset restar. The data from restar
is what would then be used in order to calculate the resampled statistic of
interest. This approach is slightly more inefficient than the straight
generation of conditional multinomial counts but may be somewhat more
straightforward for the novice user. The multinomial approach is what
would be recommended if we were carrying out the bootstrap method
within PROC IML.
As an alternative to outputting the dataset B number of times in the
input dataset we can use SAS macros, which will carry out the same task
as before albeit somewhat slower. Note however macros tend to be more
efficient in terms of space requirements. You do not need to be an expert
in macro programming in order to modify your own programs. The same
basic macro language concepts will be used throughout the book without
too much modification. We refer the reader to SAS® 9.3 Macro Language
(SAS Institute Inc., 2011) Reference for further details. From our point of
view a macro processor basically simplifies repetitive data entry and data
manipulation tasks. The macro facility makes it possible to define complex
input and output “subroutines.” However with respect to bootstrapping
we need to know only a few basic concepts:

1. All macros have a user-defined macro name, a beginning and an


end.
2. Variables can be passed to macros upon the macro being called.
3. It is possible to repeat a data step over and over again via looping
within a macro.
4. Macro variables can be passed to data steps and certain PROCs.

The key macro statements that we will use in the bootstrapping program
below and throughout the book consist of the following:

1. %macro multiwt(brep); The beginning of the macro is defined by


the %macro statement. We have chosen to name the macro multiwt
and pass it the variable named brep, which represents the number of
bootstrap replications that we have chosen. For this example the
number of bootstrap replications B = 3.
2. %do bsim=1 %to &brep; The beginning of the macro do loop is
given by the %do command. The macro variable &brep is given at
the macro call. The variable bsim is the looping index variable.
3. b=&bsim; The macro variable &bsim is passed to the data step in
order to indicate the current bootstrap resample. An ampersand
indicates to the compiler that bsim is a macro variable.
4. %end; The %end closes the macro do loop.
276 Statistics in the Health Sciences: Theory, Applications, and Computing

5. %mend; The %mend command signifies the end of the macro mul-
tiwt.
6. %multiwt(3); The command %multiwt(3) is the macro call which
passes the value of 3 to the macro variable brep.

Based on these definitions we developed a macro version of the bootstrap


program contained earlier in this chapter. In addition, we “automatically”
determine the value for the sample size n in this program via the NOBS com-
mand. The macro variable brep is used to represent the number of bootstrap
replications B. All that has basically changed as compared to the previous
code is that we are generating the c’s one macro iteration at a time. Later on
we will illustrate what additional calculations will be carried out within the
macro multiwt and what calculations will be carried out outside the scope of
the macro.

/* SAS PROGRAM 11.3 */


data original;
input x_i @@;
cards;
0.5 0.4 0.6 0.2;

data sampsize;
set original nobs=nobs; /*automatically obtain the sample
size*/
n=nobs;

%macro multiwt(brep);
%do bsim=1 %to &brep;

data resample;
set sampsize;
retain sumc i; /*need to add the previous
count to the total */
b=&bsim;
if _n_=1 then do;
sumc=0;
i=1;
end;
p=1/(n-i+1); /*p is the probability of a "success"*/

if _n_<n then do;


if n>sumc then c_i=ranbin(0,n-sumc,p); /*generate the binomial variate*/
else c_i=0;
sumc=sumc+c_i;
end;
i=i+1;
if _n_=n then c_i=n-sumc;

proc append data=resample out=total;


Jackknife and Bootstrap Methods 277

%end;
%mend;
%multiwt(3);

proc print data=total;


var b c_i x_i;

The same program may be rewritten using PROC SURVEYSELECT. Note


that we had to create a variable called dummy in order to automatically input
the sample size into PROC SURVEYSELECT. We also no longer have the abil-
ity to weight observations. As noted earlier this may be slightly more ineffi-
cient than the previous code.

/* SAS PROGRAM 11.4 */


options ls=65;
data original;
input x_i @@;
dummy=1; /*need a dummy variable for surveyselect*/
cards;
0.5 0.4 0.6 0.2
;

data sampsize;
set original nobs=nobs;
_nsize_=nobs; /*control variable for surveyselect*/
if _n_=1 then output;

%macro multiwt(brep);

%do bsim=1 %to &brep;

proc surveyselect data=original noprint


method=urs sampsize=sampsize out=resample;
strata dummy;

data restar;set resample;


b=&bsim;
do ii=1 to numberhits;
output;
end;

proc append data=restar out=total;

%end;
%mend;
%multiwt(3);

proc print;
var b x_i;

Oftentimes for simple univariate samples it is much more efficient in


terms of storage issues to use PROC IML in order to carry out the bootstrap
278 Statistics in the Health Sciences: Theory, Applications, and Computing

resampling methodology. From SAS/IML User’s Guide we have the following


description: “(SAS/IML) is a multilevel, interactive programming language
with dynamic capabilities. You can use commands to tune in to the finest
detail or to reach out to grand operations that can process thousands of val-
ues.” With respect to bootstrap methods, PROC IML is very useful through
the efficient use of its array structure. We can use IML in conjunction with
other PROCs, macros, or as a stand-alone programming language. Carrying
out the multiwt macro from above in similar fashion using PROC IML con-
sists of the following code.

/* SAS PROGRAM 11.5 */


data original;
input x_i @@;
cards;
0.5 0.4 0.6 0.2;

proc iml;
use original;
read all into data;

/*Create a n by 1 vector of data original


called data*/

n=nrow(data); /*Calculate the sample size*/


bootdata=1:n; /*Array to hold one bootstrap resample*/

brep=3; /*Set the number of bootstrap resamples*/


do i=1 to brep;
do j=1 to n;
index=int(ranuni(0)*n)+1;
bootdata[j]=data[index];
end;
print i bootdata;
end;

quit;

/*********Listing File******************/
I BOOTDATA
1 0.5 0.4 0.5 0.5
I BOOTDATA
2 0.4 0.5 0.4 0.5
I BOOTDATA
3 0.6 0.5 0.4 0.2

The USE and READ statements simply retrieve SAS datasets and import
them into PROC IML. Alternatively, data may be entered directly in PROC
IML with a statement such as

data={0.5, 0.4, 0.6, 0.2};


Jackknife and Bootstrap Methods 279

There are infinitely many possibilities with respect to utilizing this


basic IML code. We may want to carry out some calculations within IML
and/or outside IML. We will modify the code below for specific cases
throughout the book. The main concept is the generation of a random
indexing function labeled index in the IML code. Basically we are gener-
ating a random integer from 1 to n. This allows us to easily resample from
the data original using a simple array structure. We see the results for
three successive bootstrap resamples with x ∗1 = {0.5, 0.4, 0.5, 0.5},
x ∗2 = {0.4, 0.5, 0.5, 0.5}, and x ∗3 = {0.6, 0.5, 0.4, 0.2}. The only real disadvantage
of utilizing PROC IML is that it requires an additional SAS programming
skill set.

11.10 Bootstrap Confidence Intervals


Recall from the previous section that the goal in confidence interval estima-
tion is to find λ 1 and λ 2 such that P(λ 1 < T ( X ; θ) < λ 2 ) = 1 − α given T( X ; θ)
is monotone in θ. In terms of statistical functions we can write
( ) ;θ − t (F ; θ), e.g., X − E(X ), where we are essentially centering the
T ( X ; θ) = t F
statistic of interest about the parameter of interest to form a pivot. If the dis-
( )
tribution function for t F;θ is known we could find a tα such that
( )
;θ − t (F ; θ) ≤ tα ) = α. Then a one-sided confidence interval for t (F; θ) is
P(t F
(−∞ , t (F
;θ) − t ). In the nonparametric bootstrap setting we do not assume a
α

form for the distribution function of t (F ;θ). In this case we approximate the
confidence interval machinery by replacing t (F ;θ) − t (F ; θ) with the bootstrap
approximation t (Fˆ ; θ) − t (F
* ;θ) and t with t estimated from bootstrap resa-
α
*
α

mpling. The pivotal one-sided interval is then (−∞ , t (F ;θ) − t ), where t is the
*
α
*
α

αth quantile of t (Fˆ ; θ) − t (F


* ;θ) obtained from B bootstrap resamples. As

shown in Shao and Tu (1995) we then have

(( ) ) ( )
;θ − tα* > t (F ; θ) = 1 − α + O n−1/2 .
PtF

The result can be proven via an Edgeworth expansion technique. Note that
the nuisance parameter about scale is “built in” to the bootstrap resampling
process. The two-sided 100 × (1 − α) confidence interval becomes
( ) ( )
;θ − t1* −α/2 , t F;θ − tα* /2 ).
(t F
Returning to our coefficient of variation example with α = 0.10 we arrive at

tα* /2 = 71.6 − 114.6 and t1* −α/2 = 146.3 − 114.6


280 Statistics in the Health Sciences: Theory, Applications, and Computing

such that the two-sided confidence interval is (114.6–(146.3–114.6), 114.6–


(71.6–114.6)) = (82.9, 157.6). For moderate to large sample sizes this approxima-
tion works well. Also, unlike traditional confidence intervals we are not
constrained that the interval is symmetric about the test statistic of interest.
Improvements and modifications can be made to this approach based on
various additional approximations that we will touch on. Under symmetry
assumptions regarding the sampling distribution of t F ( ) ;θ , which hold approx-
imately when we can assume that t F; ( )  θ has an asymptotic normal distribu-
tion, we can simplify the calculations above to what is known as the
percentile confidence interval and given as (bα* /2 , b1* −α/2 ), where bα* is the αth
( )
quantile from the resampling distribution for t F̂ * ; θ . The percentile interval
is probably the most used technique in general practice and given as (71.6,
146.3) for our coefficient of variation example.
The general lemma around percentile intervals is as follows. Suppose
( )
;θ − t (F ; θ) d ( ) ( )
t Fˆ * ; θ − t F ;θ d
( ( ))
t F
→ T and ; θ = σ n. Then
→ T , where Var t F
σˆ n σˆ *n
the random variable T does not need to follow a normal distribution, but
does need to be symmetric. The percentile interval is then asymptotically
correct at level 1 − α. As a general rule if t F ( )
;θ has an asymptotic normal
distribution then bootstrap methods fit the general lemma and will be cor-
rect in the asymptotic sense.
Similar to jackknife methods bootstrap methods tend to work best for
so-called smooth statistics. The “smoothness” of a statistic can be exa
mined if we as can assume an expansion of t F̂ of the form ()
()
n


1
t Fˆ = t (F) + φ F (X i ) + Rn. One can examine the quadratic term in the
n i=1

()
n


1
expansion of t F̂ . If the quadratic term of φ F (X i ) does not “dominate”
n i=1
the series typically of the order o(n−1 ) the statistic is “smooth.” There is no set

definition of smoothness. A von Mises type expansion having the form


()
n

∑ t ' (H ( x − X( i ) ) − F) + Rn would be one expansion for which to


1
t Fˆ = t (F) +
n i=1
examine smoothness when the derivative t' exists. The sample mean is an
example of a smooth statistics whereas the sample median based on a single
order statistic is not.
Let us illustrate the bootstrap percentile confidence interval method
using CD4 counts obtained from newborns infected with HIV, e.g., see
Sleasman et al. (1999) for a detailed description. Suppose that we are inter-
ested in calculating the 95% percentile confidence interval for the CD4
count data using a less standard measure, but more robust measure of
location, say the α-trimmed mean statistic. The trimmed mean statistic is
Jackknife and Bootstrap Methods 281

defined so as to “slice off” α × 100 percent of the extreme observations from


both tails of the empirical distribution and takes the form of Equation (11.2)
below. For symmetric distributions this statistic has the same expectation
as the sample mean. For asymmetric distributions the trimmed mean will
be less influenced by observations in the tail, that is, in general its expecta-
tion will be to the right or left of that of the sample mean in a direction
opposite that of the tail. The median is a specific case of the trimmed mean
as α → 1/ 2 in the limit of the general statistic. One may calculate the
trimmed mean within SAS’s PROC UNIVARIATE using the TRIMMED
command along with its approximate 95% confidence interval under the
assumption that the data are sampled from a symmetric population. The
actual trimmed mean statistic is given as

n− k

()
t Fˆ = X α =
1
n − 2k ∑X (i) , (11.2)
i = k +1

where X( i ) denotes the ith ordered observation, α denotes the trimming pro-
portion and k = [nα], and the function [.] again denotes the floor function.
Suppose we specify for our continuing example the trimming proportion of
α = 0.05 with a sample size of n = 44. That implies that k = [44 × 0.05] = 2
observations are trimmed from each tail. Given that this trimming strategy
seems very straightforward we need to step back and ask: what is the popu-
lation parameter that we are estimating? It is very important to have an
understanding of what you are estimating prior to carrying out any boot-
strap estimation procedure. The α-trimmed mean statistic is an estimator of
the population trimmed mean:

F −1 (1 −α )


1
t (F) = Eα (X ) = x dF( x),
1 − 2α F −1 ( α )

that is, the population parameter in this instance is Eα (X ), not the expected
value of X, E(X ). For the specific case where the distribution is symmetric
about the mean Eα (X ) = E(X ). This is not the case for asymmetric distribu-
tions. Therefore, in this case our goal is to obtain a 95% confidence interval
for Eα (X ).
Within SAS PROGRAM 11.6 corresponding to our example the CD4 count
data from n = 44 subjects is read into data original. As an alternative to
what is provided in PROC UNIVARIATE we can calculate the bootstrap per-
centile confidence interval nonparametrically and compare the results. Note
that the “guts” of the sample program is the same as what we have been
utilizing throughout for simple bootstrapping. Within the simple bootstrap
algorithm we can calculate the trimmed mean in PROC UNIVARIATE using
282 Statistics in the Health Sciences: Theory, Applications, and Computing

the TRIMMED command and output the value via SAS’s ODS system to the
dataset tr, that is, there is no real additional programming to be done other
than to use the already built-in calculations within PROC UNIVARIATE.
The lower and upper bootstrap percentiles are then calculated and output
via a separate run of PROC UNIVARIATE using the command pctlpts=2.5
97.5 pctlpre=p.
For our example dataset the trimmed mean and its standard deviation
(of the trimmed mean) turned out to be x.05 = 304.9 ± 76.8 as compared to
the sample mean x = 377.7 ± 74.3. The difference between the two statis-
tics again illustrates the asymmetry in the data, and reinforces the fact
that the built-in confidence interval procedure should probably be
avoided. For this example the 95% semi-parametric confidence interval
for the trimmed mean, as calculated in PROC UNIVARIATE under an
asymmetry assumption, turned out to be (149.3, 460.6). For B = 1000 resa-
mples the indices for the percentile interval are λ 1 = 250 and λ 2 = 9750 .
Therefore, the 95% bootstrap interval would be the 250th and 9750th
∗ ∗
ordered resampled values, or namely (T(250) ( Fn (X )), T(9750) ( Fn (X ))) . After
B = 10000 resamples the 95% bootstrap percentile interval was calculated
as (181.8, 466.0). Also note that the percentile interval is not constrained to
be symmetric about the statistic x.05 = 304.9 , thus making it much more
flexible with respect to underlying assumptions. Given the moderately
large sample size of n = 44 and the smoothness of the trimmed mean sta-
tistic we should feel fairly confident about the coverage accuracy of the
bootstrap percentile confidence for this example. We will compare this
interval with other more complicated bootstrap methods given later on.
At the minimum the percentile method is useful for examining the
assumptions of parametric or semi-parametric methods in terms of their
validity. As noted earlier, in general the bootstrap-t method is what we
would generally recommend. It will be described in the next section.
Below is the basic SAS program used to carry out the trimmed mean per-
centile interval calculations for our example.

/* SAS PROGRAM 11.6*/


data original;
label x_i='CD4 counts';
input x_i @@;
cards;
1397 471 64 1571 663 1128 719 407
480 1147 362 2022 10 175 494 31
202 751 30 27 118 8 181 1432 31 18
105 23 8 70 12 504 252 0 100 4
545 226 390 230 28 5 0 176;

proc univariate trimmed=.05;


var x_i;
ods listing select trimmedmeans;
Jackknife and Bootstrap Methods 283

/***********Begin Standard Resampling Algorithm*****************/


data sampsize;
set original nobs=nobs;
n=nobs;
do b=1 to 10000; /*output the data B times*/
output;
end;

proc sort data=sampsize;by b;

data resample;
set sampsize;by b;
retain sumc i; /*need to add the previous
count to the total */
if first.b then do;
sumc=0;
i=1;
end;
p=1/(n-i+1); /*p is the probability of a "success"*/

if ^last.b then do;


if n>sumc then c_i=ranbin(0,n-sumc,p); /*generate the binomial variate*/
else c_i=0;
sumc=sumc+c_i;
end;
i=i+1;
if last.b then c_i=n-sumc;

data main;set resample;


if c_i >0 then do;
do i=1 to c_i;
output;
end;
end;

/*************End Standard Resampling Algorithm*****************/


/******Calculate 10000 Trimmed Means**************************/

proc univariate trimmed=.05;by b;


var x_i;
ods listing close;
ods output trimmedmeans=tr;

/******Calculate Percentile Interval**************************/

proc univariate data=tr noprint;


var mean;
output out=quantile pctlpts=2.5 97.5 pctlpre=p;

/******Print Results**************************/
284 Statistics in the Health Sciences: Theory, Applications, and Computing

proc print data=quantile;


title '95% Bootstrap Percentile Confidence Interval';

The R code is as follows:

> data<-c(1397, 471, 64, 1571, 663, 1128, 719, 407,


+ 480, 1147, 362, 2022, 10, 175, 494, 31,
+ 202, 751, 30, 27, 118, 8, 181, 1432, 31, 18,
+ 105, 23, 8, 70, 12, 504, 252, 0, 100, 4,
+ 545, 226, 390, 230, 28, 5, 0, 176)
> data_sort<-sort(data)
> resamples <- lapply(1:1000, function(i)
+ sample(data, replace = T))
> each_mean<-lapply(1:1000, function(i)
+ mean(resamples[[i]]))
> vector<-c()
> for (i in 1:1000){
+ vector<-c(vector,each_mean[[i]])
+ i<-i+1
+ }
> quantile(vector,c(0.025,0.975))
2.5% 97.5%
243.0193 530.7216

11.11 Use of the Edgeworth Expansion to Illustrate the


Accuracy of Bootstrap Intervals
We will apply the Edgeworth expansion technique (Hall, 1992) to illustrate
the accuracy of the coverage probabilities of bootstrap confidence intervals.
The Edgeworth expansion is a series that approximates a probability
distribution in terms of its cumulants. The rationale for using the Edgeworth
expansion over standard asymptotic normal methods is that it provides
more precise statements about the error terms pertaining to the bootstrap
approximations.
The version of the classic Edgeworth expansion for an approximation to
( )
; θ is given as
the distribution function for t F

Ft(F; θ) (x) = Φ (x) − φ (x) ⎡⎢ γ 1H 2 (x) + γ 1 H 5 (x)⎤⎥ + o(n−1 ) ,


1 1 1 2
γ 2 H 3 (x) +
⎣6 24 72 ⎦

where Φ (x) and φ (x) denote the standard normal cdf and pdf,
H 2 (x) = x 2 − 1, H 3 (x) = x 3 − 3 x and H 5 (x) = x 5 − 10 x 3 + 15 x are Hermite poly-
Jackknife and Bootstrap Methods 285

nomials, and γ 1 and γ 2 denote the skewness and kurtosis respectively. In


r
2 d
general, H r (x) = ( −1)r e x
2

r
e − x and additional series terms can be added to
dx
improve the degree of accuracy of the approximation.
A classic example for the Edgeworth approximation is as follows. Let iid
n

observations X 1 , X 2 , … , X n ~ N (0,1) and let V = ∑X ; then it is well known


i=1
2
i

V−n
that ~ χ . Asymptotically we have that T =
2
n ~ AN (0,1) such that we
2n
have that the approximate distribution function for T is FT (x) ≅ Φ (x), that is,
the first term in the Edgeworth expansion above. More precisely we know
2 2 12
that in this example γ 1 = and γ 2 = . Hence, a more precise approxi-
n n
mation for the distribution function for T is given as

⎡ 2 2 ⎤
FT (x) ≅ Φ (x) − φ (x) ⎢ ( x − 1) +
1 3
x − 3x + (
1 5
)
x − 10 x 3 + 15 x ⎥ ( )
⎣ 3 n 2n 9n ⎦

with error of order o(n−1).


To prove the bootstrap consistency of our bootstrap pivotal quantity
( ) ( ) ;θ we start by examining the pivotal quantity t (F; θ) − t (F ; θ). Equiv-
t Fˆ * ; θ − t F
alently, we can use the pivotal quantity ( )
; θ − t (F ; θ)),
n (t F which aids in the
technical components of the proofs. Let K = K (X1 , X 2 , … , X n |F) = n (t (F; θ) − t (F ; θ)).
Denote the distribution function of K as GK|F (k) = P( K < k|F ). Denote the boot-
strap replication of K as ( ) (( )
( ) with distribu- ; θ
K * = K X 1* , X 2* , … , X n* |F = n t Fˆ * ; θ − t F )
tion function GK *|Fˆ (k) = Pr( K < k|F ). If the bootstrap confidence works well
ˆ *

in a given scenario then GK *|Fˆ (k) should be “close” to GK|F (k).


In order to prove the consistency of the bootstrap distribution function
estimator we need to show that

(
P |GK *|Fˆ (k) − GK|F (k)|> ε → 0, n → ∞.)
For specific cases this proof can be straightforward. However the general
result is technically challenging. Specific cases often require a key piece of
knowledge about the given problem to complete the proof. For example, take

( ) ∑C X = ∑x , where the C ’s
n n

t (F ; θ) = E (X|F) = θ. Then t F ( )
;θ = x and t Fˆ * ; θ = i i
*
i i
i=1 i=1
can be considered multinomial random variables, that is, the data are fixed
and the resampling counts follow a multinomial distribution. The classic
286 Statistics in the Health Sciences: Theory, Applications, and Computing

proof of bootstrap consistency in this case utilizes Mallow’s distance. If we


have two distribution functions with X ~ H and Y ~ G then Mallow’s dis-
inf
tance measure is denoted as ρ (H , G) = ( )
1/r
E ||X − Y ||r = ρ(X , Y ), where
τ xy
τ xy is the set of all possible joint distribution functions for X and Y whose
marginal distributions are H and G, respectively, where x is the Euclidean
norm in p-dimensional real space (Mallows, 1972).
In the bootstrap setting we apply the known Mallow’s distance results
⎛ n n ⎞ n

( ) ∑ ∑ ∑
ρ Fˆ , F → 0, ρ ⎜⎜ X i , Yi⎟⎟ ≤ ρ(X i , Yi )2 , ρ (H , G) = ρ (X , Y) ,
a. s

⎜ ⎟
⎝ i=0 i=0 ⎠ i=0

ρ(X , Y )2 = ρ(X − E(X ), Y − E(Y ))2 +||E (X) − E(Y )||2, and ρ (aX , aY) =|a|ρ (X , Y)
e.g., see Shao and Tu (1995). For example, using the notation from above let
K = K (X 1 , X 2 , … , X n |F) = n t F (( )
;θ − t (F ; θ) = n (X − θ) and
)
(
K * = K X 1* , X 2* , … , X n* |F =) n (t (Fˆ ; θ) − t (F
*
)
;θ) = n (X * − X ). Then we have

ρ(GK *|Fˆ (k) , GK|F (k) = ρ( K * , K )

(
= ρ( n X * − X − n (X − θ)) )
⎛ n n ⎞ n

=
1 ⎜
ρ⎜
n ⎜⎝ ∑ (X i − X ), ∑ (X i − θ)⎟⎟ ≤


1
n ∑ρ((X − X) ,(X − θ))
* 2

i=0 i=0 i=0

(
= ρ( X − X ,(X − θ))
*
)
= ρ(X * , X )2 −||θ − X ||2

= ρ( Fˆ , F )2 −||θ − X ||2 = o (1) ,

a. s.
given X →θ in combination with the known Mallow’s distance results out-
lined above. Hence, GK *|Fˆ (k) is ρ-consistent for this example, which implies it
can be used to develop approximate confidence that converge to the appro-
priate α level for large samples. The consistency of GK *|Fˆ (k) can be shown
oftentimes via the Berry–Esse’en inequality.
In the general case, why does the simple percentile interval generate a
valid confidence interval with coverage that converges to 100 × (1 − α) % as
( ) ( )
;θ . For the more general case
n → ∞ ? As before let θˆ * = t Fˆ * ; θ and θ = t F
we will employ Edgeworth expansion techniques, which provide less pre-
cise statements about the large sample coverage of bootstrap percentile
intervals.
As in the specific case using Mallow’s distance the goal is prove a large
sample equivalence between Pr(θˆ − θ ≤ x|F ) and Pr(θˆ * − θˆ ≤ x|Fˆ ). In this
Jackknife and Bootstrap Methods 287

regard let us first examine the Edgeworth expansions of each of these quan-
tities such that we have

( ⎝ σ⎠)
P θ − θ ≤ x|F = Φ ⎛⎜ ⎞⎟ +
x 1
n
q1 ⎛⎜ |F⎞⎟ φ ⎛⎜ ⎞⎟ + ⎛⎜ |F⎞⎟ φ ⎛⎜ ⎞⎟ + r ,

x
σ ⎠ ⎝
x
σ ⎠
1 x

q2 σ ⎠ ⎝
x
σ⎠

and

P(θˆ * − θˆ ≤ x|Fˆ ) = Φ ⎛⎜ ⎞⎟ + q1 ⎛⎜ |Fˆ ⎞⎟ φ ⎛⎜ ⎞⎟ + q2 ⎛⎜ |Fˆ ⎞⎟ φ ⎛⎜ ⎞⎟ + r *,


x 1 x x 1 x x
⎝ σˆ ⎠ n ⎝ σ
ˆ ⎠ ⎝ σ
ˆ ⎠ n ⎝ σ
ˆ ⎠ ⎝ σˆ ⎠

where q1 and q2 are a rearrangement of the H i’s from our introduction of the
Edgeworth expansion above and r * denotes the remainder term. Then we have

( ) (
sup P θˆ − θ ≤ x|F − P θˆ * − θˆ ≤ x|Fˆ
x
)
1 ⎛ ⎛x ⎞ ⎛ x ˆ⎞⎞ 1 ⎛ ⎛ x ⎞ ⎛ x ˆ⎞⎞ ⎛ x ⎞ −1/2
≤ sup ⎜⎝ q1 ⎜⎝ σ |F⎟⎠ − q1 ⎜⎝ σˆ |F⎟⎠ ⎟⎠ + n ⎜⎝ q2 ⎜⎝ σ|F⎟⎠ − q2 ⎜⎝ σˆ |F⎟⎠ ⎟⎠ φ ⎜⎝ σ ⎟⎠ + Op (n )
x n
⎛ −1⎞ ⎛ x⎞ ⎛ x⎞
given the assumption that σˆ − σ = Op ⎜⎜ n 2 ⎟⎟ , which yields ⎜ ⎟ − φ ⎜ ⎟ = Op (n−1/2 ).
⎝ ⎠ ⎝ σˆ ⎠ ⎝ σ⎠
Hence, we have that

( ) (
sup P θˆ − θ ≤ x|F − P θˆ * − θˆ ≤ x|Fˆ = Op (n−1 ) ,
x
)
( )
where P θˆ * − θˆ ≤ x|Fˆ is a random quantity. It is important to note that just
because the asymptotic theory is mathematically correct that there is no
guarantee that the finite sample coverage will be accurate for these types of
intervals. An alternative approach to the Edgeworth expansion for these
problems is to use Cornish–Fisher expansions (Cornish and Fisher, 1937).
The general lemma for the percentile method is as follows. Suppose
ˆθ − θ d θˆ * − θˆ d
→ T and → T . If we assume σˆ → σ and σˆ * → σˆ and T is distrib-
σˆ σˆ *
uted according to a symmetric distribution function then the percentile con-
fidence interval (bα* /2 , b1* −α/2 ) described above is asymptotically correct at
level 1 − α. Note that if T is asymptotically normal then we know that the
distribution of T will be symmetric asymptotically. Hence, as a rule of
thumb asymptotically normally distributed T’s will provide valid large
sample percentile bootstrap confidence intervals for the statistical func-
tional t (F; θ).

Warning. A common mistake of practitioners is to start with the statistic


and not consider exactly what the function it links to is. Hence, they are
288 Statistics in the Health Sciences: Theory, Applications, and Computing

interpreting their confidence intervals about the wrong parameter, e.g., Sup-
pose t (F ; θ) = E(X( n) |F ). Then the T of interest is not the sample maximum
X( n) but is E(X( n) |Fˆ ), which is a weighted average of a linear combination of
order statistics. We can refine the percentile method to have better accuracy
in finite samples using the approaches below.

11.12 Bootstrap-t Percentile Intervals


The bootstrap-t percentile interval is a slight modification of the percentile
method used in order to get better finite sample coverage probabilities. The
;θ with
key idea is to add a correction by replacing the quantity t Fˆ * ; θ − t F ( ) ( )
σ t( Fˆ )
(t (Fˆ ; θ) − t (F;θ)) , where σ ( ( )
2
* 2
is the variance estimate for t ;θ , e.g., S
F
σ t( Fˆ * ) tF; θ) n

∑ ( ) ( ) ( )
B 2
for X, and σ t2(Fˆ * ;θ) = ⎛ t Fˆ * ; θ − t Fˆ * ; θ ⎞ , where t F̂ * ; θ is the average of the
⎝ i ⎠
i=1
bootstrap resampled values. If an analytical expression for σ t2(F; θ) is unavail-
able or not easily calculated a double bootstrap approach is necessary, e.g., see
SAS PROGRAM 11.9. The heuristic idea is that percentile intervals tend to
σ ˆ
suffer undercoverage in small samples; the scalar correction t( F ) provides
σ t( Fˆ * )
slightly wider intervals and hence better coverage probabilities. Similar to the

( ) ( )
;θ . Then one can use an
percentile interval proofs let θˆ * = t Fˆ * ; θ and θ = t F
Edgeworth expansion technique to show that

⎛ θˆ − θ ⎞ ⎛ θˆ * − θˆ ⎞
sup P ⎜ ≤ x|F⎟ − P ⎜ ≤ x|Fˆ ⎟ = Op (n−1 )
x ⎝ σ t( Fˆ ) ⎠ ⎝ σ t( Fˆ * ) ⎠

as compared to Op (n−1/2 ) for the percentile interval. There is no guarantee


that the bootstrap-t percentile approach will work better than the percentile
method, but it has been our experience that it is worth the additional effort
in the small sample setting. The tricky issue here is that there are oftentimes
σ ˆ
not closed form solutions for the correction factor t( F ) . In this case a double
σ t( Fˆ * ) σ ˆ
bootstrap method resampling method is used to approximate t( F ) with
σ t( Fˆ * ) σ t( Fˆ * )
, where σ t( Fˆ ** ) is obtained from subsamples of the original resamples.
σ t( Fˆ ** )
Jackknife and Bootstrap Methods 289

Example. Let us revisit our example from the percentile interval section
involving the trimmed mean xα. As you recall we were interested in mea-
sures of location for CD4 count data. Previously we calculated 95% bootstrap
percentile confidence intervals for E.05 (X ) and E.25 (X ) based upon the trimmed
means T (Fn (X )) − x.05 and T (Fn (X )) − x.25, respectively. In order to recalculate
these intervals for the same dataset using the bootstrap-t method, as com-
pared to the percentile interval approach, we need some additional program-
ming steps. These are given in the example SAS PROGRAM 11.7.
The first step in modifying the existing programs is to create a dummy
variable used to merge bootstrap quantities labeled dummy. For this exam-
ple dummy = 1 for every data and resample value. To calculate the trimmed
mean and its standard deviation we employed the SAS ODS system with the
command ods output trimmedmeans=trbase in conjunction with PROC
UNIVARIATE. This allows us to output the trimmed mean and standard
deviation to the dataset trbase within our sample program. It is important to
note that the standard deviation estimate is labeled as the standard error in
SAS, that is, what is calculated is the standard error estimate for X and the
standard deviation for the statistic Xα .
After outputting the trimmed mean and its standard deviation from PROC
UNIVARIATE we reassigned them new variable names within the dataset
rename. This was necessary so that values were not overwritten when
merged with the bootstrap resampled trimmed means and standard devia-
tions of the same name. As with the original trimmed mean calculation the
bootstrap resampled statistics were also calculated using PROC UNIVARI-
ATE and output using ODS to dataset combine. In addition, within dataset
σ ˆ
( ( ) ( ))
combine we calculated the t( F ) t Fˆ * ; θ − t F;θ ’s needed in order to cal-
σ t( Fˆ * )
culate the interval. The bootstrap-t interval based on the ordered
σ t( Fˆ )
σ t( Fˆ * ) ( ( ) ( )) ;θ ’s was then processed via PROC UNIVARIATE. We also
t Fˆ * ; θ − t F

recalculated the percentile interval within this program for comparison. The
intervals were distinguished by the percentile prefixes pp and pt, for the
percentile and bootstrap-t methods, respectively.
For our example run B = 10000 resamples were used. The 95% percentile
interval turned out to be (179.9, 471.1) as compared to the previous run
given in the earlier section of (181.8, 466.0). This gives you an idea of how
much the intervals might change from successive runs, even with B = 10000
resamples. The bootstrap-t 95% confidence interval turned out to be
∗ ∗
(S(250) (⋅), S(9750) (⋅)) = (170.5, 537.1). Recall also that the semi-parametric interval
built in to SAS, calculated under the assumption of symmetry, yielded the
interval (149.3, 460.6). Given the apparent skewness in our example CD4
count dataset the bootstrap-t interval may be shown to be theoretically
preferable to the semi-parametric interval calculated by SAS in this
instance. The bootstrap-t interval for this example is reflective of the skew-
ness of the original dataset, and is wider than the percentile method since
290 Statistics in the Health Sciences: Theory, Applications, and Computing

it was scaled more appropriately. Also, notice how the bootstrap-t interval
is shifted slightly to the right as compared to the semi-parametric interval,
again due to the asymmetry of the original dataset.
Note that the symmetry assumption may be relaxed somewhat if the trim-
ming proportion is increased. Let us examine the effect of the choice of trim-
ming proportion as well, e.g., say we wished to use a more robust measure of
location. We can rerun the example with a trimming proportion of α = .25 , as
compared to α = .05 above, that is, in this case we will eliminate 25% of the
extreme observations from either tail of the distribution prior to estimating
the mean. For our CD4 count data x.25 = 210.1 ± 61.5 as compared to
x.05 = 304.9 ± 76.8 , and x = 377.7 ± 74.3 . The confidence intervals in turn for x.25
were (108.0, 351.1), (99.3, 397.2), and (82.3, 338.0) for the percentile, bootstrap-t,
and semi-parametric method, respectively. The intervals are not too dissimi-
lar. In general, the more trimming that takes place the more we can relax the
assumption of an underlying symmetrical distribution.
/* SAS PROGRAM 11.7 */
data original;
label x_i='CD4 counts';
input x_i @@;
dummy=1; /*define a dummy variable to merge on*/
cards;

1397 471 64 1571 663 1128 719 407


480 1147 362 2022 10 175 494 31
202 751 30 27 118 8 181 1432 31 18
105 23 8 70 12 504 252 0 100 4
545 226 390 230 28 5 0 176
;

proc univariate trimmed=.05;by dummy;


var x_i;
ods listing select trimmedmeans;
ods output trimmedmeans=trbase; /*output trimmed mean and standard deviation*/

data rename;set trbase; /*re-label the variables*/


keep dummy tbase stdbase;
tbase=mean;
stdbase=stdmean;

/***********Begin Standard Resampling Algorithm*****************/

data sampsize;
set original nobs=nobs;
n=nobs;
do b=1 to 10000; /*output the data B times*/
output;
end;

proc sort data=sampsize;by b;

data resample;
Jackknife and Bootstrap Methods 291

set sampsize;by b;
retain sumc i; /*need to add the previous
count to the total */
if first.b then do;
sumc=0;
i=1;
end;

p=1/(n-i+1); /*p is the probability of a "success"*/

if ^last.b then do;


if n>sumc then c_i=ranbin(0,n-sumc,p); /*generate the binomial variate*/
else c_i=0;
sumc=sumc+c_i;
end;
i=i+1;
if last.b then c_i=n-sumc;

data main;set resample;


if c_i >0 then do;
do i=1 to c_i;
output;
end;
end;

/*************End Standard Resampling Algorithm*****************/


proc univariate trimmed=.05;by dummy b;
var x_i;
ods output trimmedmeans=tr;

data combine;merge tr rename;by dummy;


s=tbase-stdbase*(mean-tbase)/stdmean;

proc gplot data=combine;


plot stdmean*mean;

proc gchart;
vbar s;

proc univariate data=combine noprint;


var mean s;
output out=quantile pctlpts=2.5 97.5 pctlpre=pp pt;

proc print data= quantile;


title '95% Bootstrap Percentile and t Confidence Intervals';

11.13 Further Refinement of Bootstrap Confidence Intervals


The two key features that tend to drive bootstrap confidence interval accu-
racy in finite samples are symmetry and scale adjustments, particularly as
292 Statistics in the Health Sciences: Theory, Applications, and Computing

they pertain to the percentile interval. The bootstrap-t interval provides a


scale adjustment, but does not overcome potential asymmetry of the test sta-
tistic distribution in finite samples. In general, if some monotone function
exists such that

( ( ( )) )
; θ − φ (t( F ; θ)) < x = ψ ( x)
PφtF

holds for all possible F’s where ψ ( x) is a continuous and symmetric function,
that is ψ (x) = 1 − ψ ( − x) is assumed continuous, then the lower bound =
( ( ( )) )
φ−1 φ t F;θ − zα/2 , where zα/2 = ψ −1(α / 2) and φ (x) is a monotone increasing
transformation. As a simple example let φ (x) = x, t F ( )
;θ = X , t (F; θ) = θ with
ψ (x) = Φ( nx / σ ), where σ is a scale parameter. So, for the relationship above
to hold the big assumption is symmetry. The bootstrap percentile interval
version of this framework is

( ( ( )) ( ) )
P φ t Fˆ * ; θ − φ t( Fˆ ; θ) < x = ψ * ( x),

( )
where the key assumption is that φ t( Fˆ ; θ) is symmetric given large samples,
such that the percentile interval holds approximately to a first-order approx-
imation.
Using this framework second-order finite sample corrections can be devel-
oped. Now consider

( ( ( )) )
;θ − φ (t (F ; θ)) + z0 < x = ψ ( x),
Pφt F

where z0 is a constant. In order to illustrate the concept we start with a


nonbootstrap example around Fisher’s z-transformation, e.g., see Fisher
( )
(1921). In this case φ (x) = n tanh −1 ( x), t Fˆ ; θ = r , and φ (t( F ; θ)) = ρ, where r is
Pearson’s sample correlation and ρ is the population correlation. Then
( )
n tanh −1 (r ) − tanh −1 (ρ) has a more symmetric variance-stabilized distribu-
tion than n (r − ρ). However, in finite samples n (tanh −1 (r ) − tanh −1 (ρ)) − ρ / 2 n
( )
is an even better approximation to that of n tanh −1 (r ) − tanh −1 (ρ) in general
if φ, ρ, and z0 are known, e.g., see Konish (1978). Then the lower bound for the
confidence interval for ρ in the case can be written in the form

( ( ( ))
φ−1 φ t F )
;θ − zα/2 + z0 .
The bootstrap variant of this approach, termed the bootstrap bias-corrected
approach, starts with a calibration step where we write
Jackknife and Bootstrap Methods 293

( ( ( )) ( ( )) )
P φ t Fˆ * ; θ − φ t Fˆ ; θ + z0 < x = ψ * ( x),

( ( ( ))) ( )
where z0 = ψ −1 K * t F̂ * ; θ and K * (t Fˆ * ; θ ) is the distribution function of

( ( )) ( ) ( ) ( )
φ t Fˆ * ; θ − φ t( Fˆ ; θ) . If the distribution of t F̂ * ; θ is symmetric about φ t( Fˆ ; θ)
then z0 = 0, and we have the same result as the standard percentile interval.
Calculating z0 in practice is difficult since in general we do not know the
form of ψ(x). In applications we can consider approximating ψ(x) with Φ (x) ,
the standard normal distribution function. Then the estimator

( ( ( ))) ( ) (
zˆ 0 = Φ −1 K * t Fˆ * ; θ . If the distribution of t F̂ * ; θ is symmetric about φ t( Fˆ ; θ))
become
*
( ( ))
then ẑ0 ≈ 0, that is, K t F̂ ; θ ≈ 1/ 2 . The confidence bounds for t( F ; θ) then
*

lower = Fˆ *t(−Fˆ1* ;θ) ⎡⎣Φ(2 zˆ 0 + Φ −1 (α / 2))⎤⎦ ,

upper = Fˆ *t(−Fˆ1* ;θ) ⎡⎣Φ(2 zˆ 0 + Φ −1 (1 − α / 2))⎤⎦ .

The example below is to calculate the 95% confidence interval for Tukey’s
trimean:

/* SAS PROGRAM 11.8 */


*********************************;
** Bias Corrected Percentile Method****;
*********************************;
* The program below estimates the trimean and its confidence interval;

data counts;
label pl_ct='platelet count';
input pl_ct @@;
cards;
1397 471 64 1571 663 1128 719 407
480 1147 362 2022 10 175 494 31
202 751 30 27 118 8 181 1432 31 18
105 23 8 70 12 504 252 0 100 4
545 226 390 230 28 5 0 176
;

proc iml;
use counts;
read all into data;

*reset print;

n=nrow(data);
brep=5000; * Number of bootstrap replications;
294 Statistics in the Health Sciences: Theory, Applications, and Computing

total=n*brep;
bootdata=j(total,3, 0);

* Create bootstrap replications;


do i=1 to total;
index=int(ranuni(0)*n)+1;
bootdata[i,1]=data[index]; bootdata[i,2]=int((i-1)/n)+1; * This is
the number of replications;
end;

call sortndx(ndx, bootdata, {2, 1});


* Sort data by replication number and values;
bootdata=bootdata[ndx,];

do i=1 to total;
bootdata[i,3]=mod(i,n) + (mod(i,n)=0)*n;
* This is the order of each replicated data;
end;

* Find trimmed mean;

subset=bootdata[loc(bootdata[,3]=int(n/4)+1|bootdata[,3]=int(n/2)+1
|bootdata[,3]=int(3*n/4)+1),1];
matrix1=I(brep);
matrix2={0.25 0.5 0.25};
matrix3=matrix1 @ matrix2;
tukey = matrix3 * subset;

call sort(tukey, {1});


tukey=tukey;

* Find trimmed mean for the original data;


call sortndx(ndx, data, {1});
data=data[ndx,;
trimmean=1/4 * data[int(n/4)+1,1]+1/2 * data[int(n/2)+1,1]
+ 1/4 * data[int(n*3/4)+1,1];

*Calculate b;
subset = tukey[loc(tukey[,1] <= trimmean),];
p=nrow(subset)/brep;
b=quantile('NORMAL', p); * b=z_0;

* Calculate confidence interval;


alpha=0.05;
z_alpha_2 = quantile('NORMAL', alpha/2);
Q_l=int(brep*probnorm(2*b + z_alpha_2));
Q_u=int(brep*probnorm(2*b - z_alpha_2));

if Q_l=0 then Lower=0;


else Lower=tukey[Q_l];
Upper=tukey[Q_u];

CI=trimmean||p||b||Lower||Upper;
Jackknife and Bootstrap Methods 295

colname={'trimmean' 'P' 'b' 'lower' 'upper'};


print CI[colname=colname];

quit;

The results of SAS program 11.8 are given in Table 11.1.


( )
If the statistic t F̂ * ; θ cannot be estimated directly, a double-bootstrap strat-
egy may be needed.

TABLE 11.1
The Estimated Confidence Interval for the Population Trimean Based on 5000
Replications
Trimean P b lower upper
223.5 0.4388 –0.154012 112.25 386.75

The corresponding macro version of the SAS program is:

/* SAS PROGRAM 11.9 */


data original;
label x_i='CD4 counts';
input x_i @@;
dummy=1; /*define a dummy variable to merge on*/
cards;
1397 471 64 1571 663 1128 719 407
480 1147 362 2022 10 175 494 31
202 751 30 27 118 8 181 1432 31 18
105 23 8 70 12 504 252 0 100 4
545 226 390 230 28 5 0 176
;

ods listing close;

proc univariate trimmed=.05;by dummy;


var x_i;
ods listing select trimmedmeans;
ods output trimmedmeans=trbase;

data rename;set trbase;


keep dummy tbase ;
tbase=mean;

ods printer ps file='trim_BC.ps';


296 Statistics in the Health Sciences: Theory, Applications, and Computing

data sampsize;
set original nobs=nobs;
n=nobs;
do b=1 to 10000; /*output the data B times*/
output;
end;

proc sort data=sampsize;by b;

data resample;
set sampsize;by b;
retain sumc i; /*need to add the previous
count to the total */
if first.b then do;
sumc=0;
i=1;
end;
p=1/(n-i+1); /*p is the probability of a "success"*/

if ^last.b then do;


if n>sumc then c_i=ranbin(0,n-sumc,p); /*generate the binomial variate*/
else c_i=0;
sumc=sumc+c_i;
end;
i=i+1;
if last.b then c_i=n-sumc;

data main;set resample;


if c_i >0 then do;
do i=1 to c_i;
output;
end;
end;

proc univariate trimmed=.05;by dummy b;


var x_i;
ods output trimmedmeans=tr;

data combine;merge tr rename;by dummy;


ind=0;
if mean<tbase then ind=1;

proc means mean noprint;


var ind;
output out=bias mean=phat;

%macro bc;

data percents;set bias;


alpha=0.05;
z0=probit(phat);
lower=100*probnorm(2*z0+probit(alpha/2));
upper=100*probnorm(2*z0+probit(1-alpha/2)); /*need percentiles*/
Jackknife and Bootstrap Methods 297

call symput('lll',lower);
call symput('uuu',upper);

proc univariate noprint pctldef=4 data=combine;


var mean;
output out=quantile pctlpts= &lll &uuu pctlpre=p;

%mend;
%bc;

proc print;

ods listing;

The 95% bias corrected confidence interval is (179.92, 471.49).


The next extension for finite samples involves two correction factors and is
known as the bias-corrected and accelerated (BCa) interval; whereas above
( ( ))
bias refers to the first-order term in the bias of E t F̂; θ , in this case we write
our corrected distribution function as

⎜ ( ( )) ( ( ))
⎛ φ t Fˆ * ; θ − φ t Fˆ ; θ ⎞
+ z0 < x⎟ = ψ * ( x),
( ( ))
P
⎜ 1 + aφ t Fˆ ; θ ⎟
⎝ ⎠

where a is what is known as the acceleration constant, which adjusts for


finite sample skewness for the distribution of φ t Fˆ * ; θ − φ t( Fˆ ; θ) . In this ( ( )) ( )
case we again approximate z0 and a such that are approximate upper and
lower bounds become

⎡ ⎛ zˆ 0 + Φ −1 (α / 2) ⎞ ⎤
lower = Fˆ *t(−Fˆ1* ;θ) ⎢Φ ⎜ zˆ 0 + ⎥,
⎣ ⎝ 1 − aˆ( zˆ 0 + Φ −1 (α / 2)⎟⎠ ⎦

⎡ ⎛ zˆ 0 + Φ −1 (1 − α / 2) ⎞ ⎤
upper = Fˆ *t(−Fˆ1* ;θ) ⎢Φ ⎜ zˆ 0 + ⎥,
⎣ ⎝ 1 − aˆ( zˆ 0 + Κ −1 (1 − α / 2)⎟⎠ ⎦

(K (t (Fˆ ; θ))), aˆ = ⎛ ∑w ⎞
1 B
3
w
( ) ( )
i
where zˆ 0 = Φ −1 * * 6 i=1
, and wi = ti Fˆ * ; θ − t Fˆ * ; θ . In

⎝∑
B 3/2
2
⎠ i=1
i

general, the bias-corrected and bias-accelerated intervals may not perform as


expected due to the fact that we are estimating z0 and a.
298 Statistics in the Health Sciences: Theory, Applications, and Computing

/* SAS PROGRAM 11.10*/


*********************************;
** Bias-Corrected and Accelerated Method****;
*********************************;
* The program below estimates the trimean and its confidence
interval;

data counts;
label pl_ct='platelet count';
input pl_ct @@;
cards;
1397 471 64 1571 663 1128 719 407
480 1147 362 2022 10 175 494 31
202 751 30 27 118 8 181 1432 31 18
105 23 8 70 12 504 252 0 100 4
545 226 390 230 28 5 0 176
;
/*
data counts;
label pl_ct='platelet count';
input pl_ct @@;
cards;
230 222 179 191 103 293 316 520 143 226
225 255 169 204 99 107 280 226 143 259
;
*/
proc iml;
use counts;
read all into data;

*reset print;

n=nrow(data);
brep=500; * Number of bootstrap replications;

total=n*brep;
bootdata=j(total,3, 0);

* Create bootstrap replications;


do i=1 to total;
index=int(ranuni(0)*n)+1;
bootdata[i,1]=data[index]; bootdata[i,2]=int((i-1)/n)+1; *
This is the number of replications;
end;

call sortndx(ndx, bootdata, {2, 1});


* Sort data by replication number and values;
bootdata=bootdata[ndx,];

do i=1 to total;
bootdata[i,3]=mod(i,n) + (mod(i,n)=0)*n;
* This is the order of each replicated data;
end;

* Calculate a;
call sort(data, {1});
Jackknife and Bootstrap Methods 299

col=(1:n)`;
data_perm=data||col;

do i=1 to n;
permut=data_perm[loc(data_perm[,2] ^= i),];
col2=(1:n-1)`;
permut=permut||col2;

trim_perm=1/4 * permut[int((n-1)/4)+1,1] + 1/2 *


permut[int((n-1)/2)+1, 1]
+ 1/4 * permut[int((n-1)*3/4) +1,1];
perm_a=perm_a//trim_perm;
end;
print perm_a;

a_vector=perm_a||col||col||col;
a_vector[,4] = a_vector[+,1]/n;
a_vector[,2] = (a_vector[,1] - a_vector[,4])##3;
a_vector[,3] = (a_vector[,1] - a_vector[,4])##2;
a=a_vector[+,2] /6/a_vector[+,3]##1.5; *a;

* Find trimmed mean;


subset=bootdata[loc(bootdata[,3]=int(n/4)+1|bootdata[,3]=int(n/2)+1
|bootdata[,3]=int(3*n/4)+1),1];
matrix1=I(brep);
matrix2={0.25 0.5 0.25};
matrix3=matrix1 @ matrix2;
tukey = matrix3 * subset;

call sort(tukey, {1});


tukey=tukey;

* Find trimmed mean for the original data;


call sortndx(ndx, data, {1});
data=data[ndx,];
trimmean=1/4 * data[int(n/4)+1,1] + 1/2 * data[int(n/2)+1, 1]
+ 1/4 * data[int(n*3/4) +1,1];

*Calculate b;
subset = tukey[loc(tukey[,1] <= trimmean),];
p=nrow(subset)/brep;
b=fuzz(quantile('NORMAL', p)); * b=z_0;

* Calculate confidence interval;


alpha=0.05;
z_alpha_2=quantile('NORMAL', alpha/2);
Q_l=int(brep *probnorm(b + (z_alpha_2 + b)/(1 - a * (z_alpha_2 + b))));
Q_u=int(brep*probnorm(b + (- z_alpha_2 + b)/(1 - a * (- z_alpha_2 + b))));

if Q_l=0 then Lower=0;


else Lower=tukey[Q_l];
Upper=tukey[Q_u];

CI=trimmean||p||a||b||Lower||Upper;
colname={'trimmean' 'P' 'a' 'b' 'lower' 'upper'};
print CI[colname=colname];
quit;
300 Statistics in the Health Sciences: Theory, Applications, and Computing

The results are shown in Table 11.2


TABLE 11.2
The Estimated Confidence Interval for the Trimmed Mean Based on 5000
Replications
Trimean P a b lower upper
223.5 0.4348 –0.011665 –0.164167 116 376.25

The BCa method produces a narrower confidence interval than BC method.


A corresponding macro version of SAS PROGRAM 11.10 follows:

/* SAS PROGRAM 11.11 */


data original;
label x_i='platelet count';
input x_i@@;
dummy=1;
cards;
1397 471 64 1571 663 1128 719 407
480 1147 362 2022 10 175 494 31
202 751 30 27 118 8 181 1432 31 18
105 23 8 70 12 504 252 0 100 4
545 226 390 230 28 5 0 176
;

* The trimmed mean of the original data;


proc univariate data=original trimmed=.05;by dummy;
var x_i;
ods listing select trimmedmeans;
ods output trimmedmeans=trbase;
quit;

data rename;set trbase;


keep dummy tbase ;
tbase=mean;

ods printer ps file='trim_BC.ps';

data sampsize;
set original nobs=nobs;
n=nobs;
do b=1 to 10000; /*output the data B times*/
output;
end;

proc sort data=sampsize;by b;

*Bootstrap sampling;
data resample;
set sampsize;by b;
retain sumc i; /*need to add the previous
count to the total */
Jackknife and Bootstrap Methods 301

if first.b then do;


sumc=0;
i=1;
end;
p=1/(n-i+1); /*p is the probability of a "success"*/

if ^last.b then do;


if n>sumc then c_i=ranbin(0,n-sumc,p); /*generate the binomial variate*/
else c_i=0;
sumc=sumc+c_i;
end;
i=i+1;
if last.b then c_i=n-sumc;

data main;set resample;


if c_i >0 then do;
do i=1 to c_i;
output;
end;
end;
run;

* The estimated trimmed mean of the bootstrap samples;


proc univariate data=main trimmed=.05;by dummy b;
var x_i;
ods output trimmedmeans=tr;
run;

data combine;merge tr rename;by dummy;


ind=0;
if mean<tbase then ind=1;
run;

* Calculating z0, the bias;


proc means data=combine mean noprint;
var ind;
output out=bias mean=phat;
run;

* Calculating a, the acceleration;


data temp;
set original;
id=_n_;
run;

%macro jackknife();
%do i=1 %to 44;
dm "log;clear;output;clear";
proc univariate data=temp trimmed=.05;by dummy;
var x_i;
where id ne &i;
ods output trimmedmeans=jack;
run;

proc append data=jack out=jackknife;


302 Statistics in the Health Sciences: Theory, Applications, and Computing

run;
%end;
%mend;
%jackknife;

proc means data=jackknife noprint;


var mean;
by dummy;
output out=summary mean=mean_trimean;
run;

data jk;
merge jackknife summary;
by dummy;
num=( mean_trimean - mean )**3;
den=( mean_trimean - mean )**2;
keep mean mean_trimean num den;
run;

proc means data=jk;


var num den;
output out=summary sum=sum_num sum_den;
run;

data bias;
merge bias summary;
ahat=sum_num/6/sum_den**1.5;
run;

%macro bc;

data percents;set bias;


alpha=0.05;
z0=probit(phat);
lower=100*probnorm(z0 + (z0 + probit(alpha/2))
/(1 - ahat*(z0 +probit(alpha/2) )));
upper=100*probnorm(z0 + (z0 + probit(1-alpha/2))
/(1 - ahat*(z0 +probit(1-alpha/2) )));
/*need percentiles*/
call symput('lll',lower);
call symput('uuu',upper);

proc univariate noprint pctldef=4 data=combine;


var mean;
output out=quantile pctlpts= &lll &uuu pctlpre=p;
run;
%mend;
%bc;

proc print;
run;

ods listing;

The 95% confidence interval is (186.09, 482.38).


Jackknife and Bootstrap Methods 303

The R code is as follows:

> data<-c(1397, 471, 64, 1571, 663, 1128, 719, 407,


> 480, 1147, 362, 2022, 10, 175, 494, 31,
> 202, 751, 30, 27, 118, 8, 181, 1432, 31, 18,
> 105, 23, 8, 70, 12, 504, 252, 0, 100, 4,
> 545, 226, 390, 230, 28, 5, 0, 176)
> library(boot)
> my.mean = function(x, indices) {
+ return( mean( x[indices] ) )
+ }
> time.boot = boot(data, my.mean, 5000)
> boot.ci(time.boot,type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 5000 bootstrap replicates

CALL :
boot.ci(boot.out = time.boot, type = "bca")

Intervals :
Level BCa
95% (253.1, 550.7 )
Calculations and Intervals on Original Scale

11.14 Bootstrap Tilting


The direct linkage between empirical likelihood (EL) methods outlined in
Chapter 10 and bootstrap methodologies is through a specific technique
developed by Efron (1981) who coined the term nonparametric tilting,
which now is referred to as bootstrap tilting. In fact one could make the
case that bootstrap tilting is essentially one of the forerunners to
nonparametric EL methods. As we outline the procedure for bootstrap tilt-
ing we note that this is an approach developed by Efron (1981) for hypothesis
testing via bootstrap methodology with an eye towards improved Type I
error control of test procedures as compared to more straightforward boot-
strap resampling schemes.
The general bootstrap tilting approach as given as follows: Suppose we
are interested in testing H 0 : θ = θ0 , equivalently written in terms of the
notation of statistical functionals as H 0 : t (F) = t (F0 ) versus H 0 : t (F) > t (F0 ) , as
opposed to confidence interval generation. The idea behind tilting is to use
a nonparametric estimate of F0 relative to maintaining the Type I error con-
trol, that is, minimize statistical functional-based distances such as
(( ) )
|Pr t Fˆ0 > t (F0 )|H 0 − α | under H 0 for a desired α-level such as 0.05. This
approach implies using a form of the empirical estimator constrained under
304 Statistics in the Health Sciences: Theory, Applications, and Computing

H 0 of the form F0, n ( x) = ∑w i=1


0, i I {X i ≤ x}. The w0, i ’s are chosen to minimize

the distance from wi = 1/ n, i = 1, 2, … , n, under the null hypothesis con-

( )
n

straint t F̂0 = θ0 and constraints on the weights of ∑w


i=1
0, i = 1 given all

w0, i > 0, i = 1, 2, ….n. One approach for determining the weights is via the
Kullback–Leibler distance, e.g., see Efron (1981). Other distance measures
such as those based on entropy concepts may also be used. The weights
obtained using the Kullback–Leibler distance-based approach are identical
to those obtained via the EL method described in this entry. Once the w0, i ’s
are determined the null distribution can be estimated via Monte Carlo resa-
mpling B times from F̂0 . In rare cases the null distribution may be obtained
directly, e.g. when the parameter of interest is the population quantile given
as θ = F −1 (u). The approximate bootstrap p-value estimated via Monte Carlo
resampling is given as #(θˆ *0 ' s > θ0 ) / B, where θ̂*0 denotes the estimator
obtained from a bootstrap resample from F̂0 . For the specific example

∫ xdF is the mean, the weights w are chosen that minimize the
θ = t( F ) =
n
0, i

Kullback–Leibler distance D(w , w ) = ∑w log(nw ) subject to the con-


0 0, i 0, i
n n

straints ∑w x = θ , ∑w = 1, and all w > 0, i = 1, 2, ….n. Parametric


i =1
0, i i 0 0, i 0, i
i=1 i=1
alternatives follow similarly to those described relative to confidence inter-
val generation given above, that is, resample from Fμˆ 0, where μ̂ 0 is estimated
under H 0 . An interesting example of use of the bootstrap tilting approach in
clinical trials is illustrated in Chuang and Lai (2000) relative to estimating
confidence intervals for trials with group sequential stopping rules where a
natural pivot does not exist. The R package tilt.boot (R package version 1.1–1.3)
provides a straightforward approach for carrying out these calculations.
12
Examples of Homework Questions

In this chapter we demonstrate examples of homework questions related to


our course. In order to solve several tasks shown below, students will be
encouraged to read additional relevant literature. It is suggested to start each
lecture class by answering students’ inquiries regarding the previously
assigned homework problems. In this manner, the material of the course can
be extended. In the following sections, in certain cases, we provide com-
ments regarding selected homework questions.

12.1 Homework 1
1. Provide an example when the random sequence ξn ⎯p⎯ → ξ, but
ξn not ⎯a⎯
. s. → ξ .
2. Using Taylor’s theorem, show that exp(it) = cos(t) + i sin(t), where
i = −1 satisfies i2 = −1.
3. Find ii.

4. Find

0
exp( − s x)sin( x) dx. (Hint: Use the rule

∫ v du = vu − ∫ udv.)

sin(t)
5. Obtain the value of the integral dt.
0 t
⎧⎪ n
⎫⎪
6. Let the stopping time N ( H ) have the form N ( H ) = inf ⎨n :
⎩⎪ i=1

Xi ≥ H⎬ ,
⎭⎪
where X 1 , X 2 ,.... are iid random variables with E (X 1) > 0 . Show that
∞ ⎧⎪ j ⎫⎪

EN ( H ) =
j=1
∑ P {N ( H ) ≥ j}. Is the equation EN ( H ) =
j=1
P
⎩⎪ i=1
∑ ∑
⎨ Xi < H⎬
⎭⎪
correct? If not what do we require to correct this equation?
7. Download and install the R software. Try to perform simple R pro-
cedures, e.g., hist, mean, etc.

305
306 Statistics in the Health Sciences: Theory, Applications and Computing

12.2 Homework 2
1. Fisher information ɩ(.):
Let X ~ N (μ , 1), where μ is unknown. Calculate ɩ(μ).
Let X ~ N (0, σ 2 ), where σ 2 is unknown. Calculate ɩ(σ 2 ).
Let X ~ Exp(λ), where λ is unknown. Calculate ɩ(λ).
2. Let X i , i ≥ 1, be iid with EX 1 = 0. Show an example of a stopping
⎛ N ⎞
rule N such that E ⎜ ∑ X i⎟ ≠ 0.
⎝ i=1 ⎠
3. Assume X 1 ,..., X n are independent, E (X i ) = 0, var(X i ) = σ i2 < ∞ ,
E X i < ∞ , i = 1,..., n. Prove the central limit theorem (CLT) based
3

n n

∑ ∑
1
result regarding X i , Bn = σ i2 as n → ∞. In this context,
Bn i = 1 i=1
what are conditions on σ 2i that we need to assume in order for the
result to hold?
4. Suggest a method for using R to evaluate numerically (experimen-
tally) the theoretical result related to Question 3 above, employing
the Monte Carlo concept. Provide intuitive ideas and the relevant R
code.

12.3 Homework 3
1. Let the random variables X 1 ,… , X n be dependent and denote the
sigma algebra as ℑn = σ {X 1 ,… , X n}. Assume that E (X 1) = μ and
⎛ n ⎞
E (X i |ℑi -1) = μ , i = 2,.., n , where μ is a constant. Please find E ⎜∏ X i⎟ .
⎝ i=1 ⎠
2. The likelihood ratio (LR) with a corresponding sigma algebra is
an H0-martingale and an H1-submartingale. Please evaluate in
this context the statistic log (LRn) based on dependent data points
X 1 ,… , X n, when the hypothesis H0 says X 1 ,… , X n ~ f0 versus H1 :
X 1 ,… , X n ~ f1, where f0 and f1 are joint density functions.

( )
3. Assume (X n , ℑn) is a martingale. What are X n , ℑn and X n , ℑn ? ( 2
)
4. Using the Monte Carlo method calculate numerically the value of

∫ exp(− u
1.7
J= )u3 sin(u) du. Compare the obtained result with a
−∞
result that can be conducted using the R function integrate. Find
Examples of Homework Questions 307

approximately the sample size N such that the Monte Carlo estimator
{ }
J N of J satisfies Pr J N − J > 0.001 ≅ 0.05 .

5. Using the Monte Carlo method, calculate numerically


5 ∞

∫ ∫ exp(− u
1.7
J= − t 2 )u3 sin(u cos(t))dtdu, approximating this inte-
−4 −∞
gral via the Monte Carlo estimator J N . Plot values of J N − J against
ten different values of N (e.g., N = 100, 250, 350, 500,600, etc.), where
N is the size of a generated sample that you employ to compute J N in
the Monte Carlo manner.

12.4 Homework 4
The Wald martingale.

1. Let observations X 1 ,… , X n be iid. Denote φ(t) to be a characteristic


j

function of X 1. Show that Wj = exp((−1)1/2 t ∑ X )φ(t)


k =1
k
−j
is a martingale

(here define an appropriate σ −algebra).


Goodness-of-fit tests:
2. Let observations X 1 ,… , X n be iid with a density function f. We would
like to test for H0 : f = f0 versus H1 : f ≠ f0 , where a form of f0 is known.
2.1 Show that in this statement of the problem there are no most
powerful tests.
2.2 Assume f0 is completely known; prove that to test for H0 is
equivalent to testing the uniformity of f.
2.3 Research one test for normality and explain a principle on
which the test is based.
Measurement model with autoregressive errors:
3. Assume we observe X1 ,… , X n from the model X i = μ + ei , ei = β ei −1 + ε i ,
e0 = 0, i = 1,… , n, where ε i , i = 1,....,n, are iid random variables with a
density function f and μ is a constant.
3.1 Derive the likelihood function based on the observations
X 1 ,… , X n.
3.2 Assuming ε i ~ N (0, 1), derive the most powerful test for H0 : μ = 0,
β = 0.5 verus H1 : μ = 0.5, β = 0.7 (the proof should be presented).
3.3 Assuming ε i ~ N (0,1), find the maximum likelihood estimators
of μ , β in analytical closed (as possible) forms.
308 Statistics in the Health Sciences: Theory, Applications and Computing

3.4 Using Monte Carlo generated data (setup the true μ = 0.5, β = 0.7 ),
test for the normality of distributions of the maximum likeli-
hood estimators obtained above, for n = 5,7,10,12,25,50. Show
your conclusions regarding this Monte Carlo type study.

Comments. Regarding Question 2.2, we can note that if X 1 ,… , X n ~ F0 then


F0 (X 1) ,… , F0 (X n) ~ U (0,1). Regarding Question 2.3, it can be suggested to read
the Introduction in Vexler and Gurevich (2010). As an example, the following
test strategy based on characterization of normality can be demonstrated.
Assume we observe X 1 ,… , X n that are iid. In order to construct a test for
normality, Lin and Mudholkar (1980) used the fact that X 1 ,… , X n ~ N (μ , σ 2 )
if and only if the sample mean and the sample variance based on
X 1 ,… , X n are independently distributed. In this case one can define
⎛ n ⎛ n ⎞ ⎞
2

Yi = n ⎜

−1
∑ X j − (n − 1) ⎜
2 −1
∑ Xj⎟ ⎟
⎜⎝ j=1: j≠i ⎟⎠ ⎟ and consider the sample correlation between
⎝ j = 1: j ≠ i ⎠

⎛ n ⎞ ⎛ n ⎞
X i − (n − 1) ⎜
−1
∑ X j ⎟ and Yi or between X i − (n − 1)−1 ⎜
⎜⎝ j=1: j≠ i ⎟⎠ ∑
X j ⎟ and Yi 1/3.
⎜⎝ j=1: j≠ i ⎟⎠
The H 0 -distribution of the test statistic proposed by Lin and Mudholkar
(1980) does not depend on (μ , σ 2 ). Then the corresponding critical values of
the test can be tabulated using the Monte Carlo method.

12.5 Homework 5
Two-sample nonparametric testing.

1. Assume we have iid observations X 1 ,… , X n ~ FX that are indepen-


dent of iid observations Y1 ,… , Ym ~ FY , where the distribution func-
tions FX , FY are unknown. We would like to test for FX = FY , using the
n m

ranks-based test statistic Gnm =


1
nm ∑ ∑ I {X ≥ Y } − 21 . For large
i=1 j=1
i j

values of this statistic, we reject the null hypothesis.


1.1. What is a simple method that can be used to obtain critical
values of the test? (Hint: Use the Monte Carlo method, generating
samples of X 1 ,… , X n ~ FX and Y1 ,… , Ym ~ FY to calculate approxi-
mate values of C0.05 : PrFX = FY {Gnm ≥ C0.05} = 0.05 based on
X , Y ~ N (0, 1); X , Y ~ Unif (−1, 1); X , Y ~ N (100, 7) with, e.g., n = m = 20.
Make a conclusion and try to prove theoretically your guess.)
Examples of Homework Questions 309

1.2. Conduct simulations to obtain the Monte Carlo powers of the


test via the following scenarios:

n = m = 10, X ~ N (0, 1), Y ~ Unif [−1, 1]; n = m = 15, X ~ N (0, 1), Y ~ Unif [−1, 1];

n = m = 40, X ~ N (0, 1), Y ~ Unif [−1, 1]; n = 10, m = 30, X ~ N (0, 1), Y ~ Unif [−1, 1];

n = 30, m = 10, X ~ N (0, 1), Y ~ Unif [−1, 1]; n = m = 10, X ~ Gamma(1, 2), Y ~ Unif [0, 5]

at α = PrFX = FY {Gnm ≥ C0.05} = 0.05 (please formulate your results in


the form of a table).
Power calculations:
2. Consider the simple linear regression model

Yi = βX i + ε i , ε i iid ~ N (0,1), i = 1,… , n.

Assume that in a future study we plan to test for β = 0 versu β ≠ 0,


applying the corresponding maximum likelihood ratio test MLRn
based on (Y|X) (here, the symbol “|” means “given”). X’s are expected
to be close to be from the Unif(-1,1) distribution. To design the future
study, please fix the Type I error rate as 0.05 and fill out the following
table, presenting the Monte Carlo powers of the test with respect to
the situations shown in the table below.
Expected Effect β

β = 0.1 β = 0.5 β =1
n = 10
n = 20
n = 50
n = 100

Sequential Probability Ratio Test (SPRT):


3. Assume that we survey sequentially iid observations Y1 , Y2 ….Then
answer the following:
3.1 Write the SPRT for H 0 : Y ~ N (0,1) versus H 1 : Y ~ N (0.5,1).
3.2 Calculate values of the corresponding thresholds to use in the
SPRT, provided that we want to preserve the Type I and II errors
rates as α = 0.05, β = 0.15, respectively.
3.3 Using Monte Carlo techniques evaluate the expectation of the
corresponding stopping time under the null hypothesis H0 as
well as under the alternative hypothesis H1.
3.4 Suppose that we observe retrospectively iid data points X 1 ,… , X n
and the likelihood ratio (LR) based on X 1 ,… , X n is applied to test
for H 0 : Y ~ N (0,1) versus H 1 : Y ~ N (0.5,1) at α = 0.05 . What
310 Statistics in the Health Sciences: Theory, Applications and Computing

should the sample size n be to obtain the power 1 − β = 1 − 0.15?


(Hint: You can use the CLT to obtain the critical value of
the log(LR) test statistic as well as to evaluate the needed n
under H1.)
3.5 Compare the results of 3.4 with 3.3.

Comments. Regarding Question 1.1, one can note that

⎧ n m ⎫
PrFX = FY {Gnm ≥ C0.05} =

⎪ 1
I⎨
⎪⎩ nm
∑∑
i=1 j=1
{ }
I X i ≥ Yj −
1
2

≥ C0.05 ⎬ d Ψ nm ,
⎪⎭

where Ψ nm is the joint distribution function of the random variables


{ }
I X i ≥ Yj , i = 1,.., n, j = 1,..., m. Under the hypothesis FX = FY , we have
{ } { )}
I X i ≥ Yj = I FX (X i ) ≥ FY (Yj , i = 1,..., n, j = 1,..., m and then their joint distri-

bution does not depend on FX = FY . Then the test statistic is exact, meaning
that its distribution is independent of the underlying data distribution, under
the hypothesis FX = FY . Regarding Question 3.4, we suggest application of the
Monte Carlo approach. One can also use the distribution function of the log-
⎛ f (X )⎞
n

likelihood ratio ∑
i=1
ξ i , where ξi = log ⎜ H1 i ⎟ , i = 1,..., n with f H1 and f H0 ,
⎝ f H0 (X i )⎠
which are the density functions related to N (0,1) and N (0.5,1) , respectively.
Alternatively, defining the test threshold as C, we note that

⎧ n ⎫
⎪ ∑
⎪ i=1
ξi − nEH0 (ξ1)
C − nEH0 (ξ1) ⎪

PrH0 ⎨ 1/2 1/2 ≥ 1/2 1/2 ⎬ = 0.05 and
⎪n (VarH0 (ξ1 )) n (VarH0 (ξ1 )) ⎪
⎪ ⎪
⎩ ⎭

⎧ n ⎫
⎪ ∑
⎪ i=1
ξi − nEH1 (ξ1)
C − nEH1 (ξ1) ⎪

PrH1 ⎨ 1/2 1/2 ≥ 1/2 1/2 ⎬ = 0.85.
⎪n (VarH1 (ξ1 )) n (VarH1 (ξ1 )) ⎪
⎪ ⎪
⎩ ⎭
Then, using the CLT, we can derive n and C, solving the equations

⎛ C − nEH0 (ξ1) ⎞ ⎛ C − nEH1 (ξ1) ⎞


1 − Φ ⎜ 1/2 1/2 ⎟ = 0.05, 1 − Φ ⎜ 1/2 1/2 ⎟ = 0.85,
⎝ n (VarH0 (ξ1 )) ⎠ ⎝ n (VarH1 (ξ1 )) ⎠
Examples of Homework Questions 311


u
2
where Φ(u) = (2 π)−1/2 e−z /2
dz. It can be suggested to compare these
−∞
three approaches.

12.6 Homeworks 6 and 7


In order to work on the tasks presented in this section, students should use
real-world data. In our course, we consider data from a study evaluating bio-
markers related to atherosclerotic coronary heart disease. In this study, free
radicals have been implicated in the atherosclerotic coronary heart disease
process. Well-developed laboratory methods may grant an ample number of
biomarkers of individual oxidative stress and antioxidant status. These
markers quantify different phases of the oxidative stress and antioxidant sta-
tus process of an individual. A population-based sample of randomly
selected residents of Erie and Niagara counties of the state of New York,
USA., 35–79 years of age, was the focus of this investigation. The New York
State Department of Motor Vehicles driver’s license rolls were utilized as the
sampling frame for adults between the ages of 35 and 65; the elderly sample
(age 65–79) was randomly selected from the Health Care Financing Admin-
istration database. A cohort of 939 men and women were selected for the
analyses yielding 143 cases (individuals with myocardial infarction, MI=1)
and 796 controls (MI=0). Participants provided a 12-hour fasting blood spec-
imen for biochemical analysis at baseline, and a number of parameters were
examined from fresh blood samples.
We evaluate measurements related to the biomarker TBARS (see for details
Schisterman et al., 2001).

Homework 6

1. Using the nonparametric test defined in Section 12.5, Question 1, test


the hypothesis that a distribution function of TBARS measurements
related to MI=0 is equal to a distribution function of TBARS mea-
surements related to MI=1.
2. Please provide a test for normality based on measurements of TBARS
that correspond to MI=0, MI=1, and the full data, respectively. Plot
relevant histograms and Q-Q plots.
3. Consider the power (Box–Cox) transformation of TBARS measure-
ments, that is,
⎧ λ
⎪⎪ sign(X ) X − 1 , λ ≠ 0
h(X , λ) = ⎨ λ

⎪⎩ sign( X )log X , λ = 0,
312 Statistics in the Health Sciences: Theory, Applications and Computing

where X 1 ,… , X n are observations and {h(X 1 , λ),… , h(X n , λ)} is the


transformed data.
Assume h(X , λ) ~ N (μ , σ 2 ), where the parameters λ , μ , σ 2 are
unknown.
3.1 Calculate values of the maximum likelihood estimators λˆ , λˆ , λˆ f 0 1
of λ based on the data with MI=0, MI=1, and the full data,
respectively (using numerical calculations).
3.2 Calculate the asymptotic 95% confidence intervals for λˆ 0 , λˆ 1 , λˆ f .
3.3 Test transformed observations h(TBARS, λˆ 0 ) (on data with MI=0),
h(TBARS, λˆ ) (on data with MI=1), and h(TBARS, λˆ ) (on the full
1 f
data) for normality, e.g., applying the Shapiro–Wilk test.
4. Calculate a value of the 2log maximum likelihood ratio

⎛ ⎞
2 log ⎜
⎜⎝∏MI = 0
f ( h(TBARS, λˆ 0 ))∏MI = 1
f ( h(TBARS, λˆ 1 )) ∏ f (h(TBARS, λˆ ))⎟⎟⎠
MI = 0,1
f

to test for the hypothesis from Question 1. Compare results related


to Questions 1 and 4.

Comments. Regarding Question 3.1, we suggest obtaining the estimators of


μ and σ 2 in explicit forms, whereas the estimator of λ can be derived by
using a numerical method, e.g., via the R built-in operator uniroot or optimize.
The corresponding schematic R code can have the form

G<-function(λ){
n

μˆ = ∑ h(X , λ)/ n
i=1
i

= ∑ (h(X , λ) − μˆ ) / n
2
σˆ 2 i
i=1

return ⎛log 2 πσˆ 2


} ⎝ ((
−n/2
)
exp(−1/ 2) ⎞
⎠ )
GV<-Vectorize(G)
λ̂ =optimize (GV,maximum = TRUE)$maximum

Homework 7
Let us perform the following Jackknife-type procedure.
Denote TBARS measurements related to MI=1 as Data A with the sample
size N A and TBARS measurements related to MI=0 as Data B with the sample
size N B.
Examples of Homework Questions 313

Consider the next algorithm:


Define n = 0.
Step 1: Randomly select N A − n observations from Data A; and N B − n
observations from Data B. (The R operator sample can be used in this
step.)
Step 2: Use the selected observations in Step 1 to calculate values of the
maximum likelihood estimators λˆ 0 , λˆ 1 of λ (see Homework 6, Ques-
tion 3) and then test transformed observations h(TBARS, λˆ 0 ) (based
on the selected sample from Data A) and h(TBARS, λˆ 1 ) (based on the
selected sample from Data B) for normality, e.g., applying the
Shapiro–Wilk test. Record the corresponding p-values as p0 and p1,
respectively.
Step 3: Repeat Steps 1–2 5000 times, obtaining average values based on
p0 ’s and p1’s, say PNAA − n = p0 and PNBB − n = p1.
Redefine n = n + 10 and perform Steps 1–3.
The outputs of the procedure above are the average p-values
PNAA , PNAA − 10 , PNAA − 20 , PNAA − 30 ,... and PNBB , PNBB − 10 , PNBB − 20 , PNBB − 30 ,... that can be plotted
against the sample sizes N A , N A − 10, N A − 20, N A − 30,.... This can be used to
evaluate models of functional dependencies between Puk and u, where k = A, B.
exp (ak + bk u)
For example, we can estimate the models Puk = , where ak and
b are coefficients, k = A, B. 1 + exp (ak + bk u)
k
One can employ the obtained models to extrapolate values
of Puk for u > N k , k = A, B. In this case, assume that α ′ denotes a fixed level.
Then we can consider the scheme below.
If PNk k > α ′ and the extrapolated values of Puk increase, we can conclude that
probably while increasing the sample size we could not reject the corre-
sponding null hypothesis.
If PNk k > α ′ and the extrapolated values of Puk decrease, we can forecast the
sample size u0 : {1 + exp (− ak − bk u0 )} = α ′, when we could expect to reject
−1

the corresponding null hypothesis.


The procedure demonstrated above can be easily modified for application
as an experimental data-driven sample size calculation method in various
practical situations.
13
Examples of Exams

In this chapter we demonstrate examples of midterm, final, and qualifying


exams. In the following sections, in certain cases, we provide comments
regarding a subset of selected exam questions.

13.1 Midterm Exams

Example 1:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each ques-
tion has an indicated point value. The total points for the exam are 100
points.

Show all your work. Justify all answers.


Good luck!

Question 1 (30 points) Characteristic functions


Provide a proof about the one-to-one mapping proposition related to the
result: “For every characteristic function there is a corresponding unique
distribution function.” (Hint: Show that for all random variables its
characteristic function φ exists. Formulate and prove the inversion theo-
rem that shows F( y ) − F( x) is equal to a functional of φ(t), where F and φ are
the distribution and characteristic functions, respectively. Consider cases
when we can operate with dF(u) du and when we cannot use dF(u) du. If
you need, derive the characteristic function of a normally distributed ran-
dom variable.)

315
316 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 2 (30 points)


Assume data points X 1 ,… , X n are from a joint density function f ( x1 ,… , xn ; θ),
where θ is an unknown parameter. Suppose we want to test for H 0 : θ = 0
versus H 1 : θ ≠ 0 then complete the following:

2.1 Write the maximum likelihood ratio MLn and formulate the maxi-
mum likelihood ratio test. Write the corresponding Bayes factor type
likelihood ratio, BLn , test statistic. Show an optimality of BLn in the
context of “most powerful decision rules” and interpret the
optimality.
2.2 Derive the asymptotic distribution function of the statistic MLn,
under H0. How can this asymptotic result about the MLR test be
used in practice?

Question 3 (20 points)


Consider the test statistics MLn and BLn from Question 2. Under H 0 , are they
a martingale, submartingale, or supermartingale? Can you provide relevant
conclusions regarding the expected performance of the tests based on MLn
and BLn test statistics?

Question 4 (20 points)


Formulate the SPRT (Wald sequential test), evaluating the Type I and II errors
rates of the test.

Example 2:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each question
has an indicated point value. The total points for the exam are 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (20 points) Characteristic functions


1.1 Define the characteristic function φ(t) of a random variable X. Show
φ(t) exists, for all X and t. Assume X has a normal density function
( )
−1/2
f (u) = 2 πσ 2 exp(−u2 /(2σ 2 )). What is a form of φ(t)? Assume X
has a density function f (u) = exp( − u)I (u > 0). What is a form of φ(t)?
Examples of Exams 317

1+ ε
1.2 Let X 1 ,… , X n be iid with EX 1 = μ (E X 1 < ∞ , ε > 0 is not required).
Prove that (X 1 + … + X n) / n → μ as n → ∞ .

Question 2 (30 points)


See Question 2, Section 13.1, Example 1.

Question 3 (15 points)


3.1 Provide the definition of a stopping time, presenting an example of
a stopping time (prove formally that your statistic is a stopping
time).
3.2 Let v and u be stopping times. Are min(v −1, u), max(v , u), u + 1 stop-
ping times? (Relevant proofs should be shown.)

Question 4 (35 points)


4.1 Formulate and prove the optional stopping theorem for martingales
(provide formal definitions of objects that you use in this theorem).
4.2 Prove the Dub theorem (inequality) based on nonnegative martin-
gales.
4.3 Sequential change point detection. Suppose we observe sequentially
iid measurements X 1 , X 2 …. To detect a change point in the data
distribution (i.e., to test for H 0 : X 1 , X 2 … ~ f0 vs. H 1 : X 1 , X 2 ,… ,
X ν− 1 ~ f0 ; X ν , X ν+ 1 ,… ~ f1 , where the densities f0 , f1 are known, the
change point ν is unknown) we use the stopping time
⎧⎪ n n

N ( H ) = inf ⎨n : SRn =
⎪⎩
∑ ∏ ff ((XX )) > H⎪⎬⎪⎭ (we stop and reject H ). Derive
k =1 i= k
1

0
i

i
0

the lower bound for the expression EN ( H ) {N ( H )} . Why do we need this


inequality in practice?

Example 3:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!
318 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 1 (37 points) Characteristic functions


See Question 1, Section 13.1, Example 1.

Question 2 (19 points)


Assume data points X 1 ,… , X n are iid with a density function f ( x|θ) , where
θ is an unknown parameter. Define the maximum likelihood estimator of
θ to be θ̂n , which is based on X 1 ,… , X n. Derive the asymptotic distribution
( )
of n0.5 θˆ n − θ as n → ∞ . Show the relevant conditions and the correspond-
ing proof.

Question 3 (25 points)


Assume data points X 1 ,… , X n are from a joint density function f ( x1 ,… , xn ; θ),
where θ is an unknown parameter. Suppose we want to test H 0 : θ = 0 versus
H 1 : θ ≠ 0. Complete the following:

3.1 Write the Bayes factor type likelihood ratio BLRn; Show the optimal-
ity of BLRn in the context of “most powerful decision rules.”
3.2 Find the asymptotic ( n → ∞ ) distribution function of the statistic
MLRn, the maximum likelihood ratio, under H0.

Question 4 (19 points)


See Question 4.1, Section 13.1, Example 2.

Example 4:
11 AM–12:20 PM
This exam has 3 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (41 points) Characteristic functions


See Question 1, Section 13.1, Example 1.
Examples of Exams 319

Question 2 (36 points)


Let ε1 ,… , ε d be iid random variables from N (0,1). Assume we have data
points X 1 ,… , X n that are iid observations and X 1 has a density function f that


d
has a form of the density function of (ε i )2 , where the integer parameter
i=1
d is unknown and d ≤ 10.

2.1 Write the maximum likelihood ratio (MLR) statistic, MLRn, and
formulate the MLR test for H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0, where
θ0 is known. (Hint: Use characteristic functions to derive the form
of f .)
2.2 Propose an integrated most powerful test for H 0 : EX 1 = θ0 versus
H 1 : EX 1 ≠ θ0 (θ0 is known), provided that we are interested in obtain-
ing the maximum integrated power of the test when values of the
alternative parameter can have all possible values with no prefer-
ence. (Hint: Use a Bayes factor type procedure.). Formally prove that
your test is integrated most powerful.

Question 3 (23 points)


(Here provide formal definitions of objects that you use to answer the fol-
lowing questions, e.g., the definition of a martingale.)
See Questions 4.1–4.2, Section 13.1, Example 2.

Example 5:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (17 points) Characteristic functions


See Question 1, Section 13.1, Example 2.

Question 2 (19 points)


See Question 2, Section 13.1, Example 3.
320 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 3 (27 points)


Assume data points X 1 ,… , X n are from a joint density function f ( x1 ,… , xn ; θ),
where θ is an unknown parameter. Suppose we want to test for H 0 : θ = 0
versus H 1 : θ ≠ 0 then answer the following:

3.1 See Question 3.1, Section 13.1, Example 3.


3.2 Assume the observations are iid random variables with an unknown
distribution. Define the corresponding empirical likelihood ratio
test for H 0 : θ = 0 versus H 1 : θ ≠ 0, where θ = EX 1. Show (and prove)
the asymptotic H 0 -distribution of the 2log of the empirical likeli-
hood ratio.

Question 4 (37 points)


4.1 Let v and u be stopping times. Are min(v −1, u), max(v , u), u + 1,
u/10, 10u stopping times?
4.2–4.3 See Questions 4.1, 4.3, Section 13.1, Example 2.
4.4 Assume iid random variables X 1 > 0, X 2 > 0,.... Defining the inte-
⎪⎧ ⎪⎫
n

ger random variable N ( H ) = inf ⎨n :


⎩⎪ i=1

X i > H ⎬, show: (1) N ( H ) is
⎭⎪
a stopping time; (2) the non-asymptotic upper bound for EN ( H ).

Example 6:
11 AM–12:20 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (35 points) Characteristic functions


See Question 1, Section 13.1, Example 1.

Question 2 (8 points)
Let the statistic LRn define the likelihood ratio to be applied to test for H 0
versus H 1. Prove that f HLR1 (u)/ f HLR0 (u) = u, where f HLR denotes the density func-
tion of LRn under H = H 0 or H = H 1, respectively.
Examples of Exams 321

Question 3 (21 points)


Let ξ 1 ,… , ξ d be iid random variables from N (0,1), η1 ,… , ηd denote iid random
variables from N (0,1), and ω 1 ,… , ω d be iid random variables that just have
the two values 0 and 1 with Pr {ω 1 = 0} = 0.5. Let the random variables ξ1 , … , ξ d,
η1 , … , ηd and ω 1 , … , ω d be independent and ε i = ω i ξ i + (1 − ω i )ηi , i = 1, … , d.
Assume we have data points X 1 ,… , X n that are iid observations and X 1 has


d
a density function f that has a form of the density function of (ε i )2 ,
i=1
where the integer parameter d is unknown and d ≤ 10.

3.1 Write the maximum likelihood ratio (MLR) statistic, MLRn, and
formulate the MLR test for H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0, where
θ0 is known. (Hint: Use characteristic functions to derive the form
of f .)
3.2 Propose, giving the corresponding proof, the integrated most pow-
erful test for H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0 (here θ0 is known),
provided that we are interested to obtain the maximum integrated
power of the test when values of the alternative parameter can have
all possible values with no preference.

Question 4 (27 points)


(Here provide the formal definitions of objects that you use to answer the
following questions, e.g., the definition of a martingale.)
4.1–4.2 See Questions 4.1–4.2, Section 13.1, Example 2.
4.3 Suppose iid random variables X1 > 0,… , X n > 0,... have EX1 = Var(X1 ) = 1.
Define

⎧ n ⎫

τ = inf ⎨n > 0 :
⎪⎩
∑ ⎪
X i ≥ n − n0.7 ⎬ .
⎪⎭
i =1

Is τ a stopping time? (Hints: consider the event {τ ≥ N} via an event based on


N

∑ X , use the Chebyshev inequality to analyze the needed probability, and


i=1
i

let N → ∞.)

Question 5 (9 points)
⎪⎧ ⎫
n

Define τ( H ) = inf ⎨n > 0 :


⎩⎪
∑ X ≥ H⎭⎪⎪⎬ , where iid random variables
i=1
i X 1 > 0,… , X n > 0,...

have EX 1 = a. Show how to obtain a non-asymptotic upper bound for E {τ( H )}


that is linear with respect to H as H → ∞.
322 Statistics in the Health Sciences: Theory, Applications, and Computing

Comments. Regarding Question 4.3, we have

⎛ ⎛ n ⎞ ⎞ ⎛ N ⎞

Pr (τ ≥ N) = Pr ⎜ max ⎜⎜
⎜ n≤ N ⎜

∑ ⎟


X i − n + n0.7 ⎟⎟ < 0⎟ ≤ Pr ⎜⎜
⎟ ⎜

∑ (1 − X i) > N 0.7 ⎟
⎟⎟

⎝ i =1 ⎠ i =1

⎛ N ⎞2
E ⎜⎜


∑ (1 − X i)⎟⎟⎟
⎠ N
i =1
≤ 1.4
= 1.4 → 0.
N N N →∞

Example 7:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Questions 1 (42 points) Characteristic functions


1.1 See Question 1, Section 13.1, Example 1.
1.2 Assume f ( x) and φ(t) are the density and characteristic functions of
a random variable, respectively. Let φ(t) be a real function (φ(t) has
real values for real values of t). Using a proposition that you should
show with the proof to answer Question 1.1, prove that


1
f ( x) = cos (tx) φ(t) dt ,
2π −∞

provided that φ(t) is an integrable function.

Question 2 (17 points)


2.1 See Question 2, Section 13.1, Example 6.
2.2 See Question 2, Section 13.1, Example 3.
Examples of Exams 323

Question 3 (17 points)


3.1. Let X 1 ,… , X n be iid observations. Suppose we are interested in test-
ing for H 0 : X 1 ~ f0 versus H 1 : X 1 ~ f0 (u)exp (θ1u + θ2 ) , where f0 (u) is a
density function, and the H 1-parameters θ1 > 0, θ2 are unknown. In
this case, propose the appropriate most powerful test, justifying the
proposed test’s properties.
3.2. See Question 3.2, Section 13.1, Example 3.

Question 4 (24 points)


4.1 Formulate and prove the optional stopping theorem.
4.2 Suppose we observe sequentially X 1 , X 2 , X 3 … that are iid data points
from the exponential distribution F(u) = 1 − exp (− u / λ) , u ≥ 0; F(u) = 0, u < 0 .
To observe one X we should pay A dollars depending on values of λ.
Suppose we want to estimate λ using the maximum likelihood esti-
mator λ̂ n based on n observations.
4.2.1 Propose an optimal procedure to minimize the risk function

( )
2
L(n) = Aλ 4n + E λˆ n − λ .
4.2.2 Show a non-asymptotic upper bound of E ⎛ ∑ X i⎞ , where N is
N

⎝ i=1 ⎠
a length (a number of needed observations) of the procedure
defined in in the context of Question 4.2.1. The obtained form of
the upper bound should be a direct function of A and λ. Should
you use propositions, these propositions must be proven.

Comments. Regarding Question 1.2, we have φ(t) = E (e itξ ) = E {cos (tξ)} . Then

∞ ∞

∫e ∫ (cos (tx) − i sin (tx)) cos (tξ) dt,


1 − itx 1
f ( x) = ϕ(t) dt = E
2π 2π
−∞ −∞

where
∫ sin (tx) cos (tξ)dt = 0, since sin (−tx) cos (−tξ) = − sin (tx) cos (tξ).
−∞
324 Statistics in the Health Sciences: Theory, Applications, and Computing

Example 8:
11 AM–12:20 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (29 points) Characteristic functions


1.1 See Question 1, Section 13.1, Example 1.
1.2 Let X and Y be independent random variables. Assume X − Y and
X + Y are identically distributed. Show that the characteristic func-
tion of Y is a real-valued function.

Question 2 (21 points)


See Question 4.2, Section 13.1, Example 7.

Question 3 (18 points)


Assume we observe iid data points X 1 ,… , X n and X 1 is from a density func-
tion f ( x). Suppose we want to test H 0 : f = f0 versus H 1 : f ( x) = f0 (u)exp(θu + φ),
where θ > 0.
If the alternative parameters θ and φ are known, write the corresponding
most powerful test statistic, providing a relevant proof.

Question 4 (15 points)


4.1 Let iid observations X 1 ,… , X n be from a N (θ,1) distribution. Suppose
we want to test for H 0 : θ = θ0 versus H 1 : θ = θ1, using the most powerful
test statistic. Provide an explicit formula to calculate the expected p-value
related to the corresponding test, showing the corresponding proof.
4.2 Let TSn define a likelihood ratio. Obtain the likelihood ratio based on
TSn, showing a needed proof.

Question 5 (17 points)


5.1 Write the definition of a stopping time and provide an example.
Write the definition of a martingale and provide an example.
5.2 Formulate and prove the optional stopping theorem.
Examples of Exams 325

Comments. Regarding Question 1.2, we have E exp (it (X − Y)) = E exp (it (X + Y)).
Then E exp (it (−Y)) = E exp (it (Y)) . Thus E (cos (tY) − i sin (tY)) = E (cos (tY) + i sin (tY)).
This implies E (sin (tY)) = 0 .

Example 9:
11 AM–12:20 PM
This exam has 6 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (33 points)


See Question 1, Section 13.1, Example 1.

Question 2 (7 points)
See Question 2, Section 13.1, Example 6.

Question 3 (9 points)
Assume that a random variable ξ has a gamma distribution with density
function given as
⎧ γ γ−1 −αu
⎪ α u e / Γ( γ), u ≥ 0
f (u; α, γ) = ⎨
⎪⎩ 0, u < 0,

where α > 0, γ are parameters, and Γ ( γ ) is the Euler gamma function


∫0
uγ − 1e − u du. Derive an explicit form of the characteristic function of ξ (Hint:
The approach follows similar to that of normally distributed random variables).

Question 4 (15 points)


See Question 3.1, Section 13.1, Example 6.
(Hint: To give a complete answer, you can use the characteristic func-
tion based method to derive the form of f , a density function of the
observations.)
326 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 5 (27 points)


See Question 4, Section 13.1, Example 6.

Question 6 (9 points)
See Question 5, Section 13.1, Example 6.

13.2 Final Exams

Example 1:
11:45 AM–02:45 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (25 points)


1.1 Describe the basic components of receiver operating characteristic
curve analyses based on continuous measurements of biomarkers.
1.2 Assume a measurement model with autoregressive error terms of
the form

X i = μ + εi , εi = β εi−1 + ξi , ξ0 = 0, i = 1,… , n,

where ξ i , i = 1,… , n, are independent identically distributed


(iid) random variables with the distribution function


u

(2 πσ )
2 −1/2
−∞
exp(−u2 /(2σ 2 )) du; μ , β , σ are parameters. Suppose we
are interested in the maximum likelihood ratio test for
H 0 : μ = 0, β = 0, σ = 1 versus H 1 : μ ≠ 0, β ≠ 0, σ = 1. Write the maxi-
mum likelihood ratio test in an analytical closed (as possible) form.
Provide a schematic algorithm to evaluate the power of the test
Examples of Exams 327

based on a Monte Carlo study, when n is fixed, and we focus on the


test power, when μ = 0.1, β = 0.5.

Question 2 (25 points). Formulate and prove the following


propositions
2.1 The optional stopping theorem for martingales;
2.2 The Dub theorem (inequality) based on nonnegative martingales;
2.3 The Ville and Wald inequality.

Question 3 (20 points)


3.1 Write a definition of a stopping time, providing an example of a sta-
tistical inference based on a stopping time.
3.2 Let v and u be stopping times. Are min( v − 1, u), max( v , u), u + 2
stopping times (provide relevant proofs)?
3.3 Prove the Wald theorem regarding the expectation of a random sum
of iid random variables. Can you use the Wald theorem to calculate
⎛ τ− 1 ⎞
E⎜ ∑ X i⎟ , where X 1 ,… , X n are iid with EX 1 = 1 and τ is a stopping
⎝ i=1 ⎠
time with Eτ = 25?

Question 4 (15 points)


Prove the following propositions.

4.1 Let X 1 ,… , X n be iid with EX 1 = μ , Var(X 1 ) = σ 2 , E X 1 < ∞. Then the


3

statistic (X 1 + … + X n − an) / bn has asymptotically (as n → ∞ ) a stan-


dard normal distribution, where please define the forms of an and bn.
4.2 See Question 1.2, Section 13.1, Example 2.

Question 5 (15 points)


Let X 1 ,… , X n be iid observations with EX 1 = μ . Suppose that we are inter-
ested in testing H 0 : μ = 0 vs. H 1 : μ ≠ 0 . In this case, provide the corresponding
empirical likelihood ratio test. Derive a relevant asymptotic proposition
about this test that can be used in order to control the Type I error rate of
the test.
328 Statistics in the Health Sciences: Theory, Applications, and Computing

Example 2:
03:30 PM–05:00 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (40 points)


See Question 1, Section 13.1, Example 1.

Question 2 (35 points)


Assume data points X 1 ,… , X n are iid. Suppose we are interested in testing
the hypothesis H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0, where θ0 is known.

2.1 Let X ~ Gamma(α , β ). Derive the maximum likelihood ratio (MLR)


test statistic MLn and formulate the MLR test for H 0 : E (X 1) = θ0 , β = β 0
versus H 1 : E(X 1 ) ≠ θ0 , β ≠ β 0, where θ0 , β 0 are known. Find the asymp-
totic distribution function of the test statistic 2 log( MLn ) under H 0
(Wilks’ theorem).
2.2 See Question 3.2, Section 13.1, Example 5. Show (and prove) the
asymptotic distribution of the empirical likelihood ratio test statistic
under H 0 (the nonparametric version of Wilks’ theorem).

Question 3 (10 points)


Formulate and prove the Ville and Wald inequality.

Question 4 (15 points)


Assume the measurement model with autoregressive error terms of the
form

X i = μ + εi , εi = β εi−1 + ξi , ε0 = 0, i = 1,… , n ,

where ξi , i = 1,… , n, are independent identically distributed (iid) random


variables with the distribution function
Examples of Exams 329


u
Pr {ξ1 < u} = (2 π)
−1/2
exp(− u2 / 2) du ; μ , β are parameters.
−∞

Show how to obtain the likelihood function based on {X i i = 1,… , n}, in a


general case. Derive the likelihood function based on observed X 1 ,… X n .

Example 3:
04:30 PM–07:30 PM
This exam has 7 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (29 points) Characteristic functions


1.1 Define the characteristic function φ(t) of a random variable X. Show
φ(t) exists, for all X and t. Assume X has the density function
( )
−1/2
f (u) = 2 πσ 2 exp(−(u − μ )2 /(2σ 2 )) with parameters μ and σ 2. Write
out the expression for φ(t).
1.2 See Question 1, Section 13.1, Example 1.
1.3 See Question 1.2, Section 13.1, Example 2.

Question 2 (8 points)
Formulate and prove the central limit theorem related to the stopping time
⎪⎧ ⎫
n

N ( H ) = inf ⎨n :
⎪⎩
∑ X ≥ H⎪⎬⎪⎭ , where X
i=1
i 1 > 0, X 2 > 0,... are iid, EX 1 = a, Var(X 1 ) = σ 2 < ∞

and H → ∞. (Here an asymptotic distribution of N ( H ) should be shown.)

Question 3 (21 points)


Assume data points X 1 ,… , X n are from a joint density function f ( x1 ,… , xn ; θ),
where θ is an unknown parameter. Suppose we want to test H 0 : θ = 0 versus
H 1 : θ ≠ 0 . Complete the following:

3.1 Write the maximum likelihood ratio MLRn and formulate the MLR
test. Write the Bayes factor type likelihood ratio BLRn. Show an
optimality of BLRn in the context of “most powerful decision rules”
and interpret the optimality.
330 Statistics in the Health Sciences: Theory, Applications, and Computing

3.2 Find the asymptotic ( n → ∞ ) distribution function of the statistic


2 log( MLRn ), under H 0 .
3.3 Assume X 1 ,… , X n are iid and the parameter θ = EX 1. Let f ( x1 ,… , xn ; θ)
be unknown. To test for H 0 : θ = 0 versus H 1 : θ ≠ 0 , formulate the
empirical likelihood ratio test statistic ELRn. Find the asymptotic
( n → ∞ ) distribution function of the statistic 2 log(ELRn ), under H 0 ,
provided that E X 1 < ∞ .
3

Question 4 (5 points)
What are Martingale, Submartingale and Supermartingale? Consider
the test statistics MLRn and BLRn from Question 3. Under H 0 , are these
statistics Martingale, Submartingale, or Supermartingale? Justify your
response.

Question 5 (11 points)


5.1 Formulate and prove: the Dub theorem (inequality) based on non-
negative martingales and the Ville and Wald inequality
5.2 See Question 4.3, Section 13.1, Example 2.

Question 6 (10 points)


See Question 4, Section 13.1, Example 1.

Question 7 (16 points)


7.1 Show (in detail with all needed proofs) that

⎧⎪ n ⎫⎪

⎩⎪ i = 1
an
P ⎨ X i ≥ 0.5 + dm0.5 , for some n ≥ 1⎬ ≤ exp {−2 ad} , d =
m
⎭⎪
log(ε)
2a
,

for all ε > 0 and a > 0,

where X 1 ,… , X n are iid and X 1 ~ N (0,1). Here a proof of the general


inequality

⎧⎪ n ⎫⎪ 1
∑⎪⎩ i = 1
n
P ⎨ X i ≥ m0.5 A( ,ε) for some n ≥ 1⎬ ≤ (A(.,.) is a function that you
m ⎪⎭ ε

should define) is required to be presented.


Examples of Exams 331

7.2 Suppose we are interested in confidence sequences with uniformly


small error probability for the mean of a normal distribution N(θ,1).
Carry out the following:
7.2.1 Define the relevant confidence sequences.
7.2.2 Show the corresponding proofs.
7.2.3 Show how to operate with the required result.
7.2.4 Compare this approach with the conventional non-uniform confi-
dence interval estimation.

Example 4:
11 AM–12:20 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (25 points)


Prove the following theorem (the inversion formula). Let f define a density
function of a random variable X. Then

f ( x) =
1
2π ∫ e E (e )dt, i
−∞
− itx itX 2
= −1, i = −1.

Question 2 (19 points)


Assume we have data points X 1 ,… , X n that are iid observations from a den-
sity function f and EX 1 = θ. We would like to test for H 0 : θ = θ0 ver-
sus H 1 : θ = θ1. Let f and θ1 be unknown. Write the empirical likelihood ratio
test statistic ELRn. Find the asymptotic ( n → ∞ ) distribution function of the
statistic 2 log ELRn, under H 0 .
How can one use this asymptotic result to carry forth the ELR test in practice?

Question 3 (19 points)


See Question 2, Section 13.2, Example 3.
332 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 4 (17 points)


4.1 Prove the Wald lemma regarding the expectation of a random sum
of iid random variables. Can you use the Wald lemma to calculate
⎛ τ− 1 ⎞
E⎜ ∑ X i⎟ , where X 1 ,… , X n are iid with EX 1 = 1 and τ is a stopping
⎝ i=1 ⎠
time with Eτ = 25? Relevant formal definitions of objects that you
use (e.g., of a σ -algebra) should be presented.
4.2 See Question 4.3, Section 13.1, Example 2.

Question 5 (20 points)


Define statistics an and An based on a sample z1 ,… , zn of iid observations with
the median M, such that the probability Pr {an ≤ M ≤ An for every n ≥ m} can
be monitored (m is an integer number). The relevant proof should be present.

Example 5:
11:45 AM–1:05 PM
This exam has 6 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (12 points)


See Question 2, Section 13.2, Example 3.

Question 2 (16 points)


Assume that data points X 1 ,… , X n are from a joint density function
f ( x1 ,… , xn ; θ), where θ is an unknown parameter. Suppose we want to test
for H 0 : θ = 0 versus H 1 : θ ≠ 0 then answer the following:

2.1 Write the maximum likelihood ratio MLRn and formulate the MLR
test. Find the asymptotic ( n → ∞ ) distribution function of the statis-
tic 2 log MLRn, under H 0 . Show how this asymptotic result can be
used for the MLR test application.
Examples of Exams 333

2.2 Assume we would like to show how the asymptotic distribution


obtained in Question 2.1 above is appropriate to the actual (real)
distribution of the statistic 2logMLRn, when the sample size n is fixed.
In this case, suggest an algorithm to compare the actual distribution
of the test statistic with the asymptotic approximation to the actual
distribution, when n is, say, 10, 20, 30.

Question 3 (19 points)


Relevant formal definitions of objects that you use to answer the ques-
tion below (e.g., of a stopping time) should be presented. Formulate and
prove:

3.1 The Dub theorem (inequality).


3.2 The Ville and Wald inequality.

Question 4 (13 points)


See Question 4, Section 13.1, Example 1.

Question 5 (14 points)


Confidence sequences for the median. See Question 5, Section 13.2, Example 4.

Question 6 (26 points)


See Question 3.3, Section 13.2, Example 3.

Example 6:
11.45 AM–1:05 PM
This exam has 7 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!
334 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 1 (19 points)


Assume we observe iid data points X 1 ,… , X n and iid data points Y1 ,… , Yn . It is
known that var(X1 ) = σ 2X < ∞, var(Y1 ) = σ Y2 < ∞, Pr {X i ≤ u, Yj ≤ v} = Pr {X i ≤ u} Pr {Yj ≤ v}
for i ≠ j, and cov(X i , Yi ) = σ XY , where σ 2X , σ Y2 and σ XY are known. An investi-
⎛1 n n

gator wants to use the test statistic Gn = n ⎜ ∑
⎝ n i=1
Xi −
1
∑ Yi⎟ to test for the
n i=1 ⎠
hypothesis H 0 : EX 1 = EY1. Using the method based on characteristic functions,
propose an approach to control asymptotically, as n → 0, the Type I error rate
of the test: reject H 0 for large values of Gn. The corresponding proof should be
shown. Do we need additional assumptions on the data distributions?

Question 2 (15 points)


See Question 2, Section 13.1, Example 3.

Question 3 (29 points)


Assume we have data points X 1 ,… , X n that are iid observations and
EX 1 = θ. Suppose we would like to test H 0 : θ = θ0 versus H 1 : θ = θ1 . Com-
plete the following:

3.1 Let X 1 have a density function f with a known form that depends on
θ. When θ1 is assumed to be known, write the most powerful test
statistic (the relevant proof should be shown).
3.2 Assuming θ1 is unknown, write the maximum likelihood ratio MLRn
and formulate the MLR test based on X 1 ,… , X n. Find the asymptotic
( n → ∞ ) distribution function of the statistic 2 log MLRn, under H 0 .
3.3 Assume f and θ1 are unknown. Write the empirical likelihood ratio
test statistic ELRn. Find the asymptotic ( n → ∞ ) distribution function
of the statistic 2 log ELRn, under H 0 , provided that E X 1 < ∞ .
3

Question 4 (10 points)


See Question 4, Section 13.1, Example 1.

Question 5 (9 points)
See Question 4.3, Section 13.1, Example 2.
Examples of Exams 335

Question 6 (9 points)
If it is possible, formulate a test with power 1. Relevant proofs should be
shown.

Question 7 (9 points)
See Question 4, Section 13.2, Example 2.

Example 7:
11:00 AM–12:20 PM
This exam has 7 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (15 points)


1.1 Let X 1 ,… , X n be independent identically Cauchy-distributed random
variables with the characteristic function exp(− t ) of X1. What is a
distribution function of (X 1 + … + X n) / n ?
1.2 See Question 1.2, Section 13.1, Example 2.

Question 2 (14 points)


See Question 2, Section 13.3, Example 3.

Question 3 (14 points)


Let ξ1 ,… , ξd be iid random variables from N(0, 1), η1 ,… , ηd denote iid random
variables from N (0, 1) , and ω 1 ,… , ω d be iid random variables that have only the
two values 0 and 1 with Pr {ω 1 = 0} = 0.5. The random variables ξ1 ,… , ξd , η1 ,… , ηd ,
and ω 1 ,… , ω d are independent and ε i = ω i ξi +(1 − ω i )ηi , i = 1,… , d.
Assume we have data points X 1 ,… , X n that are iid observations and X 1 has


d
a density function f that has a form of the density function of (ε i )2 ,
i=1
where the integer parameter d is unknown and d ≤ 10. Propose, giving the
336 Statistics in the Health Sciences: Theory, Applications, and Computing

corresponding proof, the integrated most powerful test for H 0 : EX 1 = θ0


versus H 1 : EX 1 ≠ θ0 (here θ0 is known), provided that we are interested to
obtain the maximum integrated power of the test when values of the alterna-
tive parameter can have all possible values with no preference. Should you
need to use f , the analytical form of f should be derived. (Hint: Use charac-
teristic functions to derive the form of f .)

Question 4 (19 points)


4.1 Define a martingale, submartingale, and supermartingale.
4.2 See Question 4.1, Section 13.1, Example 5.
4.3 Let (X n > 0, ℑ n ) denote a martingale with EX1 = 1. Is τ(H ) = min {n > 0 : X n > H},
where H > 1, a stopping time? (Corresponding proof should be
shown.)

Question 5 (9 points)
Formulate and prove the Dub theorem (inequality).

Question 6 (14 points)


See Question 4.3, Section 13.1, Example 2.

Question 7 (15 points)


Confidence sequences for the median. See Question 5, Section 13.2, Example 4.

Example 8:
11:00 AM–12:20 PM
This exam has 9 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (18 points)


1.1 See Question 1.2, Section 13.1, Example 2.
1.2 Prove that the characteristic function of a symmetric (around 0) ran-
dom variable is real-valued and even.
Examples of Exams 337

Question 2 (9 points)
Assume that X 1 ,… , X n are iid random variables and
X 1 ~ Exp(1), Sn = (X 1 + … + X n) . Let N define an integer random variable
that satisfies Pr {N < ∞} = 1, {N ≥ n} ∈σ (X 1 ,… , X n) , where σ (X 1 ,… , X n)
is the σ-algebra based on (X 1 ,… , X n). Calculate a value of
E exp (i tSN + N log(1 − i t)) , i2 = −1 , where t is a real number. Proofs of the
theorems you use to solve this problem do not need to be shown.

Question 3 (9 points)
Biomarker levels were measured from disease and healthy populations, pro-
viding the iid observations X 1 = 0.39, X 2 = 1.97, X 3 = 1.03, X 4 = 0.16 that are
assumed to be from a normal distribution as well as iid observations
Y1 = 0.42, Y2 = 0.29, Y3 = 0.56, Y4 = −0.68, Y5 = −0.54 that are assumed to be from
a normal distribution, respectively.

Define the receiver operating characteristic (ROC) curve and the area
under the curve (AUC).
Obtain a formal notation of the AUC.
Estimate the AUC.
What can you conclude regarding the discriminating ability of the bio-
marker with respect to the disease?

(Values that may help you to approximate the estimated AUC:


Pr {ξ < x} ≈ 0.56, when x = 1, ξ ~ N (0.7, 4); Pr {ξ < x} ≈ 0.21, when x = 0, ξ ~ N (0.8, 1) ;
Pr {ξ < x} ≈ 0.18, when x = 1, ξ ~ N (0.9, 1) Pr {ξ < x} ≈ 0.18, when x = 0, ξ ~ N (0.9, 1) .)

Question 4 (11 points)


See Question 2, Section 13.1, Example 3.

Question 5 (14 points)


See Question 2, Section 13.2, Example 4.

Question 6 (11 points)


See Question 3, Section 13.2, Example 4.

Question 7 (10 points)


See Question 4, Section 13.1, Example 1.
338 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 8 (9 points)
See Question 4.3, Section 13.1, Example 2.

Question 9 (9 points)
Confidence sequences for the median. See Question 5, Section 13.2, Example 4.

Comments. Regarding Question 1.2, we have F(u) = 1 − F(− u). Then

( ) ∫ ∫ ∫ ∫
∞ −∞ −∞ ∞
φ(t) = E e itξ = e itu dF(u) = e −itu dF(−u) = e −itud (1 − F(u)) = e −itud (F(u))
−∞ ∞ ∞ −∞

( )
= E e −itξ .

Thus
∞ ∞

∫ (cos (tu) + i sin (tu))dF(u) = ∫ (cos (−tu) + i sin (−tu))dF(u),


−∞ −∞

where cos (tu) = cos (−tu) and sin (−tx) = − sin (tx). This leads to E sin (tξ) = 0 { }
and then

( )
φ(t) = E e itξ = E {cos (tξ)} .

Regarding Question 2, we consider

E {exp (i tSn + n log(1 − i t))|σ (X 1 ,..., X n−1)} .

for a fixed n. It is clear that

E {exp (i tSn + n log(1 − i t))|σ (X 1 ,..., X n−1)} = exp (i tSn−1 + n log(1 − i t)) ϕ(t),

where ϕ(t) is the characteristic function of X 1. This implies

E {exp (i tSn + n log(1 − i t))|σ (X 1 ,..., X n−1)} = exp (i tSn−1 + (n − 1)log(1 − i t)) .

{ }
Then exp (i tSn + n log(1 − i t)) , σ (X 1 ,..., X n) is a martingale and we can apply
the optional stopping theorem.
Examples of Exams 339

Example 9:
11.45 AM–1:10 PM
This exam has 8 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (12 points)


Assume a receiver operating characteristic (ROC) curve corresponds to
values of a biomarker measured with respect to a disease population,
say X, and a healthy population, say Y. Let X and Y be independent
and log-normally distributed, such that X ~ exp(ξ1 ), Y ~ exp(ξ2 ), where
ξ1 ~ N ( μ 1 , σ 12 ), ξ2 ~ N ( μ 2 , σ 22 ) with known σ 12 , σ 22 and unknown μ 1 , μ 2. In
order to estimate the area under the ROC curve (AUC), we observe iid data
points X 1 ,… , X n and iid data points Y1 ,… , Ym corresponding to X and Y,
respectively.
Propose an estimator of the corresponding AUC. Using the δ-method,
evaluate the asymptotic variance of the proposed estimator.

Question 2 (15 points)


See Question 2, Section 13.1, Example 3.

Question 3 (21 points)


See Question 3, Section 13.2, Example 6.

Question 4 (10 points)


See Question 4, Section 13.1, Example 1.

Question 5 (8 points)
See Question 4.3, Section 13.1, Example 2.

Question 6 (9 points)
If it is possible, formulate a test with power one. Relevant proofs should be
shown.
340 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 7 (10 points)


See Question 4, Section 13.2, Example 2.

Question 8 (15 points)


Assume iid random variables X 1 , X 2 … are from the Uniform[0,1] distribu-
tion. Define the random variable τ = inf {n ≥ 1 : X n > an}.
Is τ a stopping time with respect to the σ-algebra ℑn = σ (X 1 ,..., X n), when
( )
ai = 1 − 1/ (i + 1) , i ≥ 1?
2

Is τ a stopping time with respect to ℑ n = σ (X1 ,..., X n), when ai = (1 − 1/ (i + 1)) , i ≥ 1?


Corresponding proofs should be shown.

13.3 Qualifying Exams

Example 1:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (20 points) Characteristic functions


1.1 Define the characteristic function φ(t) of a random variable X. Show
(and prove) that φ(t) exists, for all probability distributions of X and all
real t. Assume X has the density function f (u) = (2 πσ 2 ) exp(− u2 / (2σ 2 )) .
−1/2

What is a form of φ(t)? Assume X has the density function


f (u) = exp(− u), u > 0. What is a form of φ(t)?
1.2 Using characteristic functions, derive the analytical form of the dis-
tribution function of a χ 2d-distributed random variable ξ, where d is
the degree of freedom. (Hint: Use Gamma distributions.)
1.3 See Question 1.2, Section 13.1, Example 2.
Examples of Exams 341

Question 2 (30 points) Asymptotic results


2.1 Let X 1 ,… , X n be iid observations. Suppose X 1 has a density function
f that can be presented in a known parametric form depending on
an unknown parameter θ. Define the maximum likelihood estima-
tor of θ. Formulate and prove the theorem regarding an asymptotic
distribution of the maximum likelihood estimator. (Please do not
forget the assumptions of the theorem.)
2.2 Suppose we want to test for H 0 : θ = 0 versus H 1 : θ ≠ 0 based on
observations from Question 2.1 above.
2.2.1 Towards this end, define the maximum likelihood ratio statistic
MLn and formulate the maximum likelihood ratio (MLR) test.
2.2.2 Using the theorem from Question 2.1, derive the asymptotic dis-
tribution function of the statistic 2 log MLn , under H 0 .
2.2.3 Assume the sample size n is not large. Can you suggest a compu-
tational algorithm as an alternative to the asymptotic result from
Question 2.2.2? If yes, provide the algorithm.

Question 3 (35 points) Testing


3.1 Let X 1 ,… , X n be observations that have a joint density function f . f
can be presented in a known parametric form depending on an
unknown parameter θ.
3.1.1 Propose (and prove) the most powerful test for H 0 : θ = 0 versus
H 1 : θ = 1.
3.1.2 Propose (and prove) the integrated most powerful test for
H 0 : θ = 0 versus H 1 : θ ≠ 0 with respect to that we are interested in
obtaining the maximum integrated power of the test when values
of the alternative parameter belong to the uniform function

⎪ 1, θ ∈ [0,1]

π (θ) = ⎨ 0, θ < 0 (Hint: Use a Bayes factor type procedure.)

⎪ 0, θ > 1.

3.1.3 Consider the test statistics from Wuestions 2.2.1, 3.1.1 and 3.1.2,
say ML, L, and BL, respectively. Are the ML, L, and BL test sta-
tistics a martingale, submartingale or supermartingale under
H 0 ? Are the log-transformed values, log(ML), log(L), and log(BL)
a martingale, submartingale or supermartingale under H0?
How about under H1? Can you draw conclusions regarding the
test statistics?
342 Statistics in the Health Sciences: Theory, Applications, and Computing

3.2 Assume a measurement model with autoregressive errors in the


form of

X i = μ + εi , εi = β εi−1 + ξi , ξ0 = 0, i = 1,… , n,

where ξi , i = 1,… , n, are independent identically distributed (iid)


u
with the distribution function (2 π)
−1/2
exp(− u2 / 2) du. Suppose we
−∞
are interested in the maximum likelihood ratio test for H 0 : μ = −1
and ε i , i = 1,… n, are independent versus H 1 : μ = 0 and εi , i = 1,… n,
are dependent. Then complete the following:
Write the relevant test in an analytical closed form.
Provide a schematic algorithm to evaluate the power of the test
based on a Monte Carlo study.
3.3 See Question 4, Section 13.1, Example 1.

Question 4 (15 points)


4.1 Formulate and prove the optional stopping theorem for martingales
(here provide formal definitions of objects that you use in this theo-
rem, e.g., the definition of a stopping time).
4.2 Prove the Dub theorem (inequality) based on a nonnegative
martingale.
4.3 Prove the Ville and Wald inequality.

Example 2:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 8 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (31 points)


1.1 Let x, y denote real variables and i 2 = −1, i = −1 .
Show (prove) that exp(ix) = cos( x) + i sin( x).
Compute (i) in the form of a real variable.
i

1.2 Define the characteristic function φ(t) of a random variable X. Prove


that φ(t) exists for all probability distributions and all real t.
Examples of Exams 343

1.3 Prove the following theorem (the inversion formula). Let y > x be real
arguments of the distribution function F (u) = Pr {X ≤ u}, where F ()
. is
a continuous function at the points x and y. Then

F (y) − F (x) =
1
lim
2 π σ →0 ∫
−∞
exp(−itx) − exp(−ity )
it
φ(t)exp(−
t 2σ 2
2
) dt.

1.4 See Question 1.2, Section 13.1, Example 2.

Question 2 (16 points)


Let ε1 ,… , ε d be iid random variables from N(0, 1). Assume we have data
points X 1 ,… , X n that are iid observations and X 1 has the density function f


d
in a form of the density function of (ε i )2 , where the integer parameter d
i=1
is unknown and d ≤ 10.

2.1 Write the maximum likelihood ratio (MLR) statistic, MLn, and for-
mulate the MLR test for H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0, where θ0 is
known. (Hint: Use characteristic functions to derive the form of f .)
2.2 Propose, providing a proof, the integrated most powerful test for
H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0 (here θ0 is known), provided that
we are interested in obtaining the maximum integrated power of
the test when values of the alternative parameter can have all pos-
sible values with no preference. (Hint: Use a Bayes factor type
procedure.)
2.3 Assume the sample size n is not large. How can we apply the tests
from Questions 2.1 and 2.2, in practice, without analytical evalua-
tions of distribution functions of the test statistics? (i.e., assuming
that you have a data set, how do you execute the tests?) Can you sug-
gest computational algorithms? If yes, write the relevant algorithms.
(Justify all answers.)

Question 3 (8 points)
Assume we observe iid data points X 1 ,… , X n that follow a joint density
function f ( x1 ,… , xn ; θ), where θ is an unknown parameter. Suppose we
want to test for the hypothesis H 0 : θ = θ0 versus H 1 : θ = θ1 , where θ0 , θ1 are
known. In this case, propose (and give the relevant prove) the most power-
ful test. Examine theoretically asymptotic distributions of log(LRn )/ n
(here LRn is the test statistic proposed by you) under H 0 and H 1, respectively.
p
Show that log(LRn )/ n → a, where the constant a ≤ 0, under H 0 and a ≥ 0,
n →∞
under H 1. (Hint: Use the central limit theorem, the proof of which should
be present.)
344 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 4 (14 points)


4.1 Formulate and prove the optional stopping theorem for martingales
(here provide formal definitions of objects that you use, e.g., the def-
inition of a martingale).
4.2 Prove the Wald theorem regarding the expectation of a random sum
of iid random variables. Can you use the Wald theorem to calculate
⎛ τ− 1 ⎞ ⎛ τ+ 1 ⎞ ⎛ ⎞
E⎜ ∑ ∑
X i⎟ , E ⎜
⎝ i=1 ⎠ ⎝ i=1 ⎠
X i⎟ and E ⎜
⎜⎝
1≤ i ≤τ/2

X i⎟ , where X 1 ,… , X n are iid with
⎟⎠

EX 1 = −1 and τ is a stopping time with Eτ = 4.5? If yes, calculate the


expectations.
4.3 Formulate and prove the following inequalities:
Dub theorem (Inequality) based on nonnegative martingales.
The Ville and Wald inequality.

Question 5 (7 points)
See Question 4, Section 13.1, Example 1.

Question 6 (8 points)
Assume the measurement model with autoregressive errors in the form of

X i = μ + εi , εi = β εi −1 + ξi , ξ0 = 0, i = 1,… , n,

where ξi , i = 1,… , n, are independent identically distributed (iid) with the


u
distribution function (2 π)
−1/2
exp(− u2 / 2) du. We are interested in the
−∞
maximum likelihood ratio test for

H 0 : μ = −1 and ε i , i = 1,… n, are independent

versus

H 1 : μ = 0 and ε i , i = 1,… n, are dependent.

Write the relevant test in an analytical closed form (here needed opera-
tions with joint densities and obtained forms of the relevant likelihoods
should be explained).
Is the obtained test statistic, say L, a martingale/submartingale/supermar
tingale, under H 0 / H 1?
Is log(L) a martingale/submartingale/supermartingale, under H 0 ?
(The proofs should be shown.)
Examples of Exams 345

Question 7 (8 points)
See Question 3.3, Section 13.2, Example 3.

Question 8 (7 points)
⎧⎪ n

Let the stopping time have the form of N ( H ) = inf ⎨n :
are iid random variables with Ex1 > 0 . ⎩⎪
∑ x ≥ H⎪⎬⎭⎪, where x
i=1
i i

∞ ∞ ⎧⎪ j ⎫⎪
8.1 Show that EN ( H ) = ∑j=1
P {N ( H ) ≥ j}. Is EN ( H ) = ∑ ∑
j=1
P ⎨ xi < H ⎬? If
⎪⎩ i = 1 ⎪⎭
not what do we need to correct this equation?
8.2 Formulate (no proof is required) the central limit theorem related to
⎧⎪ n
⎫⎪
the stopping time N ( H ) = inf ⎨n : ∑
⎩⎪ i = 1
X i ≥ H ⎬, where X i > 0, i > 0,
⎭⎪
EX 1 = a > 0, Var(X 1 ) = σ 2 < ∞, and H → ∞. (Here the asymptotic dis-
tribution of N ( H ) should be shown.)

Example 3:
The PhD theory qualifying exam, Part I
8:30AM–12:30PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (19 points)


1.1 Let x, y denote real variables and i 2 = −1, where i denotes an imagi-
nary number. Compare i(2 i ) with exp(−π) + i exp(−π) (i.e., “=”, “<”, “>”).
(Here the proof and needed definitions should be provided.)
1.2 Obtain the characteristic function φ(t) of a random variable X that
has a normal distribution N ( μ , σ 2 ), where EX = μ , var(X ) = σ 2. (Jus-
tify your answer.)
1.3 See Question 1.3, Section 13.3, Example 2.
346 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 2 (32 points)


n
2.1 Find, showing the proof, the asymptotic distribution of X i / n1/2
i=1
as n → ∞ , where the random variables X 1 ,… , X n are iid (independent
identically distributed) with EX 1 = 0, var(X 1 ) = σ 2.
2.2 See Question 2, Section 13.1, Example 3.
⎧⎪ n
⎫⎪
2.3 Consider the stopping time N ( H ) = inf ⎨n : ∑
⎩⎪ i = 1
X i ≥ H ⎬, where
⎭⎪
X1 > 0,… , X n > 0 are iid, EX 1 = a < ∞, Var(X 1 ) = σ 2 < ∞. Find, showing
a proof, an asymptotic ( H → ∞) distribution of N ( H ).
2.4 Assume we observe iid data points X 1 ,… , X n that are from a known
joint density function f ( x1 ,… , xn ; θ), where θ is an unknown param-
eter. We want to test for the hypothesis H 0 : θ = θ0 versus H 1 : θ = θ1,
where θ0 , θ1 are known. In this case, propose (and give the relevant
proof) the most powerful test statistic, say, LRn. Prove the asymptotic
p
consistency of the proposed test: log(LRn )/ n → a, where the con-
n →∞
stant a ≤ 0, under H 0 and a ≥ 0, under H 1, just using that
∞ ⎛ f (u)⎞ ∞ ⎛ f (u)⎞
∫ −∞
log ⎜ H1 ⎟ f H0 (u) du < ∞ and
⎝ f H0 (u)⎠ −∞ ∫
log ⎜ H1 ⎟ f H1 (u) du < ∞ (but
⎝ f H0 (u)⎠
1+ ε
⎛ f (u)⎞

without the constraint
−∞ ∫
log ⎜ H1 ⎟
⎝ f H0 (u)⎠
f H j (u) du < ∞ , j = 0,1, ε > 0),

where f H0 and f H1 are the density functions of X 1 under H 0 and H 1,


respectively. (You should define the limit value a and show a ≤ 0,
under H 0 and a ≥ 0, underH 1.)

Question 3 (21 points)


3.1 See Question 4, Section 13.2, Example 2.
3.2 Assume you save your money in a bank and each day, i, you obtain a ran-
dom benefit ri > 5.5 that is independent of r1 ,… , ri − 1; you also can obtain
⎧ 10, Pr {Zi = 10} = 1/ 3

money via a risky game having each day Zi = ⎨
⎪⎩ −5, Pr {Zi = −5} = 2 / 3
dollars, i = 0,1,… (random variables Zi are iid and independent of
r1 ,… , ri). You decide to stop the game when you collect $100, that is,

when ∑ (Z + r ) ≥ 100 the first time. When you stop, what will be
i=1
i i
Examples of Exams 347

your total expected benefit from the game (i.e., E ∑ (Z ))? In general,
i=1
i

formulate and prove the needed theorems to illustrate this result.


Relevant formal definitions of objects that you use (e.g., of a stopping
time) should be presented. When you apply a theorem the corre-
sponding conditions should be checked.
3.3 See Question 4.3, Section 13.3, Example 2.

Question 4 (7 points)
See Question 4, Section 13.1, Example 1.

Question 5 (21 points)


5.1 See Question 7.1, Section 13.2, Example 3.
5.2 See Question 7.2, Section 13.2, Example 3.
5.3 Assume we survey X 1 ,… , X n ,... iid observations and X 1 ~ N (θ,1).
Can we test for θ ≤ 0 versus θ > 0 with power 1? If yes, propose the
test and formal arguments (proofs).

Comments. Regarding Question 3.2, we note that, defining the random vari-
⎧⎪ n
⎫⎪
able τ = inf ⎨n ≥ 1 :
⎪⎩

i=1
(Zi + ri ) ≥ 100⎬, since (Zi + ri ) > 0, we have, for a fixed N,
⎪⎭

⎧ N ⎫ ⎧ ⎛ N ⎞ ⎫

⎪⎩ i =1
∑ ⎪
⎪⎭

Pr {τ > N} = Pr ⎨ (Zi + ri) < 100⎬ = Pr ⎨exp ⎜⎜ −
⎪ ⎜
1
⎝ 100
∑ ⎠

(Zi + ri)⎟⎟⎟ > e −1⎬

⎩ i =1 ⎭
⎛ N ⎞
≤ eE exp ⎜⎜ −

1
⎜ 100 ∑ (Zi + ri)⎟⎟⎟

i =1

⎛ ⎛ 1 ⎞⎞
N

= e ⎜⎜ E exp ⎜ − (Z1 + r1 )⎟ ⎟⎟ → 0. Pr {τ < ∞} = 1.


⎝ ⎝ 100 ⎠ ⎠ N →∞

Then τ is a stopping time.


348 Statistics in the Health Sciences: Theory, Applications, and Computing

Example 4:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (23 points)


1.1 See Question 1.1, Section 13.3, Example 1.
1.2 Assume independent random variables ξ1 and ξ2 are identically dis-
tributed with
⎧ 1, u > 1
⎪⎪
Pr {ξ1 < u} = ⎨ (u) , 0 ≤ u ≤ 1
0.5

⎪⎩ 0, u < 0.

Derive the characteristic function φ(t) of η = max(ξ1 , ξ2 ).


1.3 Using characteristic functions, derive the analytical form of the dis-
tribution function of a χ 2d-distributed random variable ξ, where d is
the degree of freedom. (Hint: Use Gamma distributions.)
1.4 See Question 1.2, Section 13.1, Example 2.
1.5 See Question 1, Section 13.2, Example 4.

Question 2 (23 points)


Assume we have data points X 1 ,… , X n that are iid observations and EX 1 = θ.
We would like to test for the hypothesis H 0 : θ = θ0 versus H 1 : θ = θ1.

2.1 Let X 1 have a density function f with a known form that depends on
θ. When θ1 is assumed to be known, propose the most powerful test
statistic (a relevant proof should be shown).
Is the couple (the corresponding test statistic and the sigma algebra
σ {X 1 ,… , X n}) a martingale, submartingale, or supermartingale,
under H 0 /under H 1? (Justify your answer.)
Show (prove) that asymptotic distributions of (n)−1/2 × log of the
test statistic, under H 0 and H 1, respectively. (Should you use a theo-
rem, show a relevant proof of the theorem.)
Examples of Exams 349

How can one use the asymptotic result obtained under H 0 to per-
form the test in practice? If possible, suggest an alternative method
to evaluate the H 0 -distribution of the test statistic.
2.2 Write the maximum likelihood ratio MLRn and formulate the MLR
test based on X 1 ,… , X n. Find the asymptotic ( n → ∞ ) distribution
function of the statistic 2 log( MLRn ), under H 0 .
Is the couple (the corresponding test statistic and sigma algebra
σ {X 1 ,… , X n}) a martingale, submartingale or supermartingale under
H 0 ? (Justify your answer.)
Illustrate how one can use this asymptotic result to perform the test
in practice? Propose (if it is possible) an alternative to this use of the
H 0 -asymptotic distribution of the test statistic.
2.3 Assume f , the density function of X 1 ,… , X n, and θ1 are unknown.
Formulate the empirical likelihood ratio test statistic ELRn. Find the
asymptotic (n → ∞ ) distribution function of the statistic 2 log(ELRn ),
under H 0 , provided that E X 1 < ∞ . How can one use this asymptotic
3

result to perform the corresponding test, in practice? Propose (if it is


possible) an alternative to this use of the H 0 -asymptotic distribution
of the test statistic.

Question 3 (11 points)


Assume a measurement model with autoregressive error terms in the form

Xi = μ + εi , ε i = β ε i−1 + ξi , ε 0 = 0, X 0 = 0, i = 1,… , n,

where ξi , i = 1,… , n are independent identically distributed (iid) with a


known density function fε (u). Our data is based on observed X 1 ,… , X n.

3.1 Obtain the likelihood function (the joint density function)


f (X 1 ,… , X n ), providing relevant formal arguments. The target likeli-
hood should be presented in terms based on fε . (Suggestion: Show

that f (X 1 ,… , X n ) = f (X ).)
3.2 Let μ = 1 and suppose that we would like to test that ε i , i = 1,… , n, are
independent identically distributed random variables. Propose an
integrated most powerful test, showing the relevant proof, when our
interest is for β ∈ (0,1] uniformly under the alternative hypothesis.
Is the couple, the corresponding test statistic and the sigma algebra
σ {X 1 ,… , X n} , a martingale, submartingale, or supermartingale under
H 0 ? (Justify your answer.)
Can you suggest a method for obtaining critical values of the test statistic?
350 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 4 (31 points)


4.1 Formulate and prove the optional stopping theorem. Relevant for-
mal definitions of objects that you use (e.g., of martingales, a stop-
ping time) should be presented.
4.2 Prove the Wald theorem regarding the expectation of a random sum
of iid random variables.
4.3 Suppose X 1 > 0,… , X n > 0 are iid with EX 1 = 1 and τ is a stopping
time with Eτ = 25. Can you use the Wald theorem to calculate

⎛ τ+ 1 ⎞ ⎛ τ/3 ⎞ ⎛ ⎞ ⎛ ν1
⎞ ⎛ ν2 + τ ⎞ ⎛ ν2 − τ ⎞
E⎜ ∑ ∑
X i⎟ , E ⎜ X i⎟ , E ⎜ ∑ X i⎟ , E ⎜
⎝ i = 1 ⎠ ⎝ i = 1 ⎠ ⎜⎝ i:i ≤ X1 + X25 ⎟⎠ ⎝

i=1

X i⎟ , E ⎜ ∑
X i⎟ , E ⎜
⎠ ⎝ i=1 ⎠ ⎝ i=1 ⎠
X i⎟ ,

⎧ n ⎫
v
where 1
answers.
= min {n : X n > 2 n}, 2
v = min

⎨ n :
⎪⎩ i =1
( i)
X > ∑
10000

⎬? Justify your
⎪⎭
4.4 See Question 4.3, Section 13.3, Example 2.
4.5 See Question 2, Section 13.2, Example 3.
4.6 See Question 4, Section 13.1, Example 1.

Question 5 (12 points)


5.1 See Question 7.1, Section 13.2, Example 3.
5.2 See Question 5.3, Section 13.3, Example 3.

Example 5:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (8 points)
1.1 Assume we observe X 1 , X 2 , X 3 … , which are independent identically
distributed (iid) data points with EX 1 = μ and an unknown variance.
Examples of Exams 351

To observe one X we should pay A dollars. We want to estimate μ


N
using X N = X i / N . Propose a procedure to estimate μ mini-
i=1
( )
2
mizing the risk function L(n) = An + E X n − μ with respect to n.
1.2 Biomarker levels were measured from disease and healthy populations,
providing iid observations X 1 = 0.39, X 2 = 1.97, X 3 = 1.03, X 4 = 0.16
that are assumed to be from a continuous distribution as well as iid
observations Y1 = 0.42, Y2 = 0.29, Y3 = 0.56, Y4 = −0.68, Y5 = −0.54 that
are assumed to be from a continuous distribution, respectively. Define
the receiver operating characteristic (ROC) curve and the area under
the curve (AUC). Estimate nonparametrically the AUC. What can you
conclude regarding the discriminating ability of the biomarker with
respect to the disease?

Question 2 (28 points)


Define i 2 = −1, where i is an imaginary number, i = −1.

2.1 See Questions 1.1–1.3, Section 13.3, Example 2.


2.2 Suppose the random variable η has a continuous density function and
the characteristic function φ(t) = (exp(it − t 2 / 2) − exp(−t 2 / 2)) /(it), i 2 = −1.
We know that η = η1 + η2 , where η1 ~ N (0,1) and η2 has a continuous
density function. Derive the density function of η2. Show how to
formally calculate the density function using the inversion formula.
(Suggestion: Use the Dirichlet integrals that are employed to prove
the inversion formula.)
2.3 Using characteristic functions, derive an analytical form of the dis-
tribution function of a χ 2d-distributed random variable ξ, where d is
the degree of freedom. (Hint: Use Gamma distributions.)
2.4 See Question 1.2, Section 13.1, Example 2.

Question 3 (20 points)


3.1 Assume we have two independent samples with data points
X 1 ,… , X n and Y1 ,… , Ym, respectively. Let X 1 ,… , X n be iid random
observations with density function f x ( x|θX ) and let Y1 ,… , Ym denote
iid random observations with density function f y ( x|θY ), where θX
and θY are unknown parameters. Define θˆ Xn and θˆ Ym to be the maxi-
mum likelihood estimators of θX and θY based on X 1 ,… , X n and
Y1 ,… , Ym, respectively. Given that θX = θY , derive the asymptotic dis-
tribution of (n + m)0.5 (θˆ Xn − θˆ Ym ), when m / n → γ , for a known fixed γ,
as n → ∞ . Show the relevant conditions and the proof. (Hint: Use
that a − b = ( a − c) + (c − b).)
352 Statistics in the Health Sciences: Theory, Applications, and Computing

3.2 Assume we have data points X 1 ,… , X n that are iid observations and
EX 1 = θ. Suppose we would like to test for the hypothesis H 0 : θ = θ0.
Write the maximum likelihood ratio test statistic MLRn and formu-
late the MLR test based on X 1 ,… , X n. Find the asymptotic ( n → ∞ )
distribution function of the statistic 2 log( MLRn ), under H 0 .
Is the couple (the test statistic and sigma algebra σ {X 1 ,… , X n}) a martin-
gale, submartingale, or supermartingale, under H 0 ? (Justify your answer.)
Illustrate how to apply this asymptotic result in practice.
3.3 Suppose that we would like to test that iid observations X 1 ,… , X n are
from the density function f0 ( x) versus X 1 ,… , X n ~ f1 ( x|θ), where the
parameter θ > 0 is unknown. Assume that f1 ( x|θ) = f0 ( x)exp (θx − φ(θ)),
where φ is an unknown function. Write the most powerful test sta-
tistic and provide the corresponding proof.

Question 4 (28 points)


4.1 Formulate and prove the optional stopping theorem. Relevant for-
mal definitions of objects that you use (e.g., martingales, a stopping
time) should be presented.
4.2 Prove the Wald theorem regarding the expectation of a random sum
of iid random variables.
4.3 Suppose X 1 > 0,… , X n > 0,... are iid random variable with EX 1 = 1
and τ is a stopping time with Eτ = 25. Can you use the Wald theorem
to calculate

⎛ τ+1 ⎞ ⎛ τ/3 ⎞ ⎛ ⎞ ⎛ v1 ⎞ ⎛ v2+τ ⎞ ⎛ v2 −τ ⎞


E ⎜⎜


∑ X i⎟⎟ , E ⎜⎜
⎟ ⎜
⎠ ⎝
∑ ⎟ ⎜
X i⎟ , E ⎜
⎟ ⎜ ∑ ⎟
X i⎟ , E ⎜⎜
⎟ ⎜
⎠ ⎝ i:i≤X1 +X25 ⎠ ⎝
∑ ⎟ ⎜ ∑
X i⎟⎟ , E ⎜⎜ X i⎟⎟ , E ⎜⎜
⎟ ⎜
⎠ ⎝ i =1 ⎠ ⎝ i =1 ⎠

X i⎟⎟ ,

i =1 i =1 i =1

⎧ n ⎫ ⎧ n ⎫

where v1 = min ⎨n ≥ 1 :
⎪⎩
∑ ⎪ ⎪
X i > 2n2 (n + 1)⎬ and v2 = min ⎨n :
⎪⎭ ⎪⎩
∑X > 10000⎪⎬⎪⎭ ?
i
i =1 i =1

Justify your answers. (Suggestion: While analyzing v1 , you can use


n 1 n
Chebyshev’s inequality and that lim n→∞ ∑ = lim n→∞ = 1.)
k =1 k ( k + 1) (n + 1)
4.4 Show the non-asymptotic inequality

⎛ N (t ) ⎞ E (exp (−X 1 / t))


E ⎜⎜
⎜ ∑ X i⎟⎟ ≤ exp(1) (EX 1)
⎟ (
1 − E (exp (−X 1 / t))
,
)
⎝ i =1 ⎠
Examples of Exams 353

where X 1 > 0,… , X n > 0,... are iid random variables with EX 1 < ∞ and

⎧ n ⎫

N (t) = min ⎨n :
⎪⎩
∑ ⎪
X i > t⎬ .
⎪⎭
i =1

4.5 See Question 4.3, Section 13.3, Example 2.


4.6 See Question 2, Section 13.2, Example 3.
4.7 See Question 4, Section 13.1, Example 1.

Question 5 (16 points)


5.1 See Question 7.1, Section 13.2, Example 3.
5.2 See Question 5.3, Section 13.3, Example 3.
5.3 See Question 7.2, Section 13.2, Example 3.

Comments. Regarding Question 2.2, we can use that exp(itx) = cos(tx) + i sin(tx)
and the proof scheme related to the inversion theorem


exp(− itx) − exp(− ity )
F (y) − F (x) =

1
lim φ(t)exp(−t 2σ 2 )dt
2 π σ→0 −∞ it

in order to obtain the integrals


∫ cos(it(1 − x))/(it), ∫ sin(it(1 − x))/(it). Regard-
ing Question 4.3, one can note that

∞ ∞ ⎛ n j ⎞
Pr (v1 < ∞) = ∑ Pr (v1 = j) = ∑ Pr ⎜⎜ max n < j
⎜ ∑(Xi) − 2n2 (n + 1) < 0, ∑(Xi) > 2 j2 (j+ 1)⎟⎟⎟
j =1 j =1 ⎝ i =1 i =1 ⎠
⎛ j ⎞
∞ ⎛ j ⎞ ∞ ∑
E ⎜⎜

X i⎟⎟
⎟ ∞

≤ ∑ ∑ Pr ⎜⎜

(X i) > 2 j 2 (j+ 1)⎟⎟⎟ ≤ ∑ ⎝ i =1 ⎠ 1
2 j (j+ 1)
2
=
2 ∑ j(j+1 1) = 21 .
j =1 ⎝ i =1 ⎠ j =1 j =1
354 Statistics in the Health Sciences: Theory, Applications, and Computing

Example 6:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (10 points)


1.1 See Question 3, Section 13.2, Example 8.
1.2 See Question 4,2, Section 13.1, Example 7.

Question 2 (29 points)


2.1 Let X and Y be iid random variables with the characteristic func-
2
tion φ(t). Show that φ(t) is a characteristic function of X–Y.
2.2 Assume the random variables X 1 , X 2 , X 3 … are iid. The function φ(t)
is a characteristic function of X 1. Define the characteristic function
of X 1 + X 2 + X 3 + … + X ν, where ν is an integer random variable
independent of X 1 , X 2 , X 3 … with Pr {ν = k} = pk , k = 1, 2,…
2.3 See Questions 1.1–1.3, Section 13.3, Example 2.
2.4 Suppose the random variable η has a continuous density function
and the characteristic function f (t) = (exp(it − t 2 / 2) − exp(−t 2 / 2)) /(it), i 2 = −1.
We know that η = η1 + η2 , where η1 ~ N (0,1) and η2 has a continuous
density function and η1 , η2 are independent random variables.
Derive the density function of η2.
Show how to formally calculate the density function of η2 using the
inversion formula. (Suggestion: Use the Dirichlet integrals that are
employed to prove the inversion formula.)
2.5 Let X 1 ,… , X n be independent random variables with EX i = 0,
EX i2 = σ i2 < ∞ , i = 1,… , n. Use the method based on characteristic func-
⎛ ⎞ 0.5
∑ ∑
n n
tions to define the asymptotic distribution of X i / ⎜
⎜ σ i⎟
2

i =1 ⎝ i =1 ⎠
as n → ∞. What are the conditions on σ i2 and the moments of X 1 ,… , X n
you need, in order to show the result regarding the asymptotic distribu-
⎛ ⎞ 0.5
∑ ∑
n n
tion of X i / ⎜⎜ σ i2⎟⎟ ?
i =1 ⎝ i =1 ⎠
Examples of Exams 355

Question 3 (31 points)


3.1 Suppose we observe iid data points X 1 ,… , X n that are from a known
joint density function f ( x1 ,… , xn ; θ), where θ is an unknown
parameter and we want to test the hypothesis H 0 : θ = θ0 versus
H 1 : θ = θ1 , where θ0 , θ1 are known. Complete the following:
3.1.1 Derive (and give the relevant proof) the most powerful test statis-
tic, say, LRn for H 0 .
3.1.2 See Question 2.4, Section 13.3, Example 3.
3.2 See Question 2, Section 13.1, Example 3.
3.3 Let X 1 ,… , X n be observations that have joint density function f . f
can be presented in a known parametric form depending on an
unknown parameter θ.
3.3.1 Propose (and prove) the integrated most powerful test for H 0 : θ = 0
versus H 1 : θ ≠ 0 with respect to obtaining the maximum integrated
power of the test when values of the alternative parameter satisfy
⎧ 1, θ ∈ 0, 1
⎪ [ ]

the uniform distribution with the density π (θ) = ⎨ 0, θ < 0

⎪⎩ 0, θ > 1.
3.3.2 Write the maximum likelihood ratio MLRn and formulate the
MLR test for H 0 : θ = 0 versus H 1 : θ ≠ 0 based on X 1 ,… , X n. Find
the asymptotic (n → ∞ ) distribution function of the statis-
tic 2 log MLRn, under H 0 .
Illustrate how one can use this asymptotic result in practice.
3.3.3 Is the couple (MLRn and sigma algebra σ {X 1 ,… , X n}) a martingale,
submartingale, or supermartingale under H 0 ? (Justify your answer.)
3.4 Assume f , the density function of X 1 ,… , X n, is unknown, X 1 ,… , X n
are iid, and θ = EX 1. Complete the following:
3.4.1 Formulate the empirical likelihood ratio test statistic ELRn.
3.4.2 Find the asymptotic ( n → ∞ ) distribution function of the statistic
2 log (ELRn)2 log (ELRn), under H 0 , provided that E X 1 < ∞ .
3

3.4.3 How does one use this asymptotic result to perform the corre-
sponding test, in practice?
3.5 See Question 3.3, Section 13.3, Example 5.

Question 4 (25 points)


4.1 Assume you save your money in a bank each day, i, and you obtain
a random benefit ri > 5.5, where r1 ,… , ri ,... are iid. You also can obtain
356 Statistics in the Health Sciences: Theory, Applications, and Computing

⎧ 10, Pr {Zi = 10} = 1/ 3



money via a risky game having each day Zi = ⎨
⎪⎩ −5, Pr {Zi = −5} = 2 / 3
dollars, i = 0,1,… (random variables Zi are iid and independent of
r1 ,… , ri).
You decide to stop the game when you will collect $100, that is, when
∑ (Z + r ) ≥ 100 the first time.
i=1
i i

−1
It is known that E exp (− sri ) = 0.5 ⎡⎣exp (− s10) / 3 + 2exp (s5) / 3⎤⎦ with
s = 1/ 100.
4.1.1 When you stop, what will be your expected benefit from the risky
game (i.e., E ∑ (Z ))? When you apply the needed theorem, cor-
i=1
i

responding conditions should be checked. (Here the equation


Pr {ξ < u} = Pr {exp(−ξ / u) > exp(−1)} and Chebyshev’s inequality
can be applied.)
4.1.2 Formulate and prove the needed theorems in general. Relevant
formal definitions of objects that you use (e.g., of a stopping time)
should be presented.
4.2 Prove the Wald lemma regarding the expectation of a random sum
of iid random variables.
4.3 See Question 4.3, Section 13.3, Example 2.
4.4 See Question 2, Section 13.2, Example 3.
4.5 See Question 4, Section 13.1, Example 1.

Question 5 (5 points)
Confidence sequences for the median. See Question 5, Section 13.2, Example 4.
Compare this result with the corresponding well-known non-uniform
confidence interval estimation.

Comments. Regarding Question 4.1, we note that, defining the random vari-
⎧⎪ n
⎫⎪
able τ = inf ⎨n ≥ 1 :
⎩⎪ i=1
∑(Zi + ri ) ≥ 100⎬, since (Zi + ri ) > 0, we have, for a fixed N ,
⎭⎪
⎧ N ⎫ ⎧ ⎛ N ⎞ ⎫

∑ ⎪
Pr {τ > N} = Pr ⎨ (Zi + ri) < 100⎬ = Pr ⎨exp ⎜ −
⎪⎩ i =1 ⎪⎭



⎜ 100

1
∑ ⎟ −1⎪
(Zi + ri)⎟⎟ > e ⎬
⎠ ⎪
⎩ i =1 ⎭
⎛ ⎞ ⎛ ⎞⎞
N


⎛ 1
N

≤ eE exp ⎜⎜ − (Zi + ri)⎟⎟⎟ = e ⎜⎜ E exp ⎜ − (Z1 + r1 )⎟ ⎟⎟


1
→ 0. Pr {τ < ∞} = 1.
⎜ 100 ⎝ ⎝ 100 ⎠⎠ N →∞
⎝ i =1 ⎠
Then τ is a stopping time.
Examples of Exams 357

Example 7:
The PhD theory qualifying exam, Part I
8:30–12:30
This exam has 10 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (9 points)
Assume a statistician has observed data points X 1 ,… , X n and developed a test
statistic Tn to test the hypothesis H 0 versus H 1 = not H 0 based on X 1 ,… , X n. We
do not survey X 1 ,… , X n, but we know the value of Tn and that Tn is distributed
accordingly to the known density functions f0 (u) and f1 (u) under H 0 and H 1,
respectively.

1.1 Can you propose a transformation of Tn (i.e., say, a function G (Tn)) that
will improve the power of the test, when any other transformations,
e.g., K (Tn) based on a function K (u), could provide less powerful tests,
compared with the test based on G (Tn)? If you can construct such
function G, you should provide a corresponding proof.
1.2 What is a form of Tn that satisfies G (Tn) = Tn ? The corresponding
proof must be shown here.
(Hint: The concept of the most powerful testing can be applied here.)

Question 2 (5 points)
Assume that X 1 ,… , X n are independent and identically distributed (iid) ran-
dom variables and X 1 ~ N (0,1), Sn = (X 1 + … + X n). N , an integer random
(
variable, defines a stopping time. Calculate a value of E exp itSN + t 2 N / 2 , )
where i 2 = −1 and t is a real number. Proofs of the theorems you use in this
question are not required to be shown.

Question 3 (4 points)
See Question 3, Section 13.2, Example 8.

Question 4 (7 points)
See Question 4,2, Section 13.1, Example 7.
358 Statistics in the Health Sciences: Theory, Applications, and Computing

Question 5 (14 points)


5.1 Assume independent random variables ξ1 and ξ2 are identically
Exp(1)-distributed. Derive the characteristic function φ(t) of
η = min(ξ1 , ξ2 ).
5.2 See Question 1.2, Section 13.1, Example 2.
5.3 See Question 1, Section 13.2, Example 4.

Question 6 (3 points)
Let the observations X 1 ,… , X n be distributed with parameters (θ, σ), where σ
is unknown. We want to test parametrically for the parameter θ (θ = 0 under
H 0 ). Define the Type I error rate of the test. Justify your answer.

Question 7 (20 points)


See Question 2, Section 13.3, Example 4.

Question 8 (10 points)


See Question 3, Section 13.3, Example 4.

Question 9 (23 points)


9.1 Formulate and prove the optional stopping theorem. Relevant for-
mal definitions of objects that you use (e.g., martingales, a stopping
time) should be present.
9.2 See Question 4.3, Section 13.3, Example 2.
9.3 See Question 2, Section 13.2, Example 3.
9.4 See Question 4, Section 13.1, Example 1.

Question 10 (5 points)
See Question 5.3, Section 13.3, Example 3.

Comments. Regarding Question 2, we consider E {exp (itSn + t 2 n / 2)|σ (X1 ,..., X n−1)},
for a fixed n. It is clear that E {exp (itSn + t 2 n / 2)|σ (X1 ,..., X n−1)} = exp (itSn−1 + t 2 n / 2) ϕ(t),
where ϕ(t) is the characteristic function of X 1. This implies

{ ( ) } (
E exp itSn + t 2 n / 2 |σ (X 1 ,..., X n−1) = exp i tSn−1 + t 2 (n − 1)/ 2 . )
Examples of Exams 359

{ ( ) }
Then exp itSn + t 2 n / 2 , σ (X 1 ,..., X n) is a martingale and we can apply the
optional stopping theorem.

Example 8:
The PhD theory qualifying exam, Part I
8:30AM–12:30PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.

Show all your work. Justify all answers.


Good luck!

Question 1 (17 points)


1.1 Assume we have a test statistic, T, that is distributed as N(1, σ 2 = 4) and
N(2, σ 2 = 5) under the null and alternative hypothesis, respectively.
Define the expected p-value of the test statistic. Obtain the simple
formal notation of the corresponding expected p-value, approximat-
ing it numerically via the use of the standard normal distribution
curve below:

0.8

0.7

0.6
F (x)

0.5

0.4

0.3

0.2

−1.0 −0.5 0.0 0.5 1.0


x

1.2 Suppose we observe values of the likelihood ratio test statistics


LRi , i = 1,..., m. The ratio LRi was calculated corresponding to the
hypotheses H 0 i versus H 1i , i = 1,..., m. It is assumed that
LRi , i = 1,..., m, are independent and, for all i = 1,..., m, H ki is a compo-
nent of the hypothesis H k , k = 0,1. Let the density functions f LRi (u)
of LRi , i = 1,..., m, satisfy f LRi (u|H k ) = f LRi (u|H ki ), k = 0,1, i = 1,..., m.
360 Statistics in the Health Sciences: Theory, Applications, and Computing

Derive a likelihood ratio test statistic based on the observations


LR1 ,..., LRm to test for H 0 versus H 1, providing relevant proofs of
used propositions.
1.3 Biomarker levels were measured from disease and healthy popula-
tions, providing the independent and identically distributed (iid)
observations X 1 = 0.39, X 2 = 1.97, X 3 = 1.03, X 4 = 0.16 that are assu-
med to be from a continuous distribution FX as well as iid observa-
tions Y1 = 0.42, Y2 = 0.29, Y3 = 0.56, Y4 = −0.68, Y5 = −0.54 that are assu-
med to be from a continuous distribution FY , respectively. Define the
receiver operating characteristic (ROC) curve and an area under the
curve (AUC). Estimate the AUC nonparametrically.
Estimate the AUC, assuming FX and FY are normal distribution func-
tions (the graph in Question 1.1 can help you).
What can you conclude regarding the discriminating ability of the
biomarker with respect to the disease?
1.4 See Question 4,2, Section 13.1, Example 7.

Question 2 (29 points)


2.1 See Question 2.1, Section 13.3, Example 6.
2.2 Let X and Y be independent random variables. Assume X − Y and
X + Y are identically distributed. Show the characteristic function of
Y is a real value function.
2.3 See Question 2, Section 13.2, Example 8.
2.4 What is ln (i) , i 2 = −1?
2.5 See Question 1.3, Section 13.3, Example 2.
2.6 Suppose the random variable η has a continuous density function
( )
and the characteristic function φ(t ) = exp(i t − t 2 / 2) − exp( − t 2 / 2) / (i t ), i 2 = −1.
We know that η = η1 + η2 , where η1 ~ N (0,1) and η2 has a continuous
density function. Derive the density function of η2. Show formally
how to obtain (calculate) the density function, using the inversion
formula. (Suggestion: Use the Dirichlet integrals that are employed
to prove the inversion formula.)
2.7 See Question 2.5, Section 13.3, Example 6.

Question 3 (23 points)


3.1 We observe iid data points X 1 ,… , X n that are from a known joint
density function f ( x1 ,… , xn ; θ), where θ is an unknown parameter.
We want to test for the hypothesis H 0 : θ = θ0 versus H 1 : θ = θ1, where
θ0 , θ1 are known.
Examples of Exams 361

3.1.1 Propose (and give a relevant prove) the most powerful test
statistic, say, LRn for H 0 versus H 1.
3.1.2 See Question 2.4, Section 13.3, Example 3.
3.2. Let X 1 ,… , X n be observations that have a joint density function f
that can be presented in a known parametric form, depending on an
unknown parameter θ.

3.2.1 Propose (and prove) an integrated most powerful test for H 0 : θ = 0


versus H 1 : θ ≠ 0 with respect to obtaining the maximum inte-
grated power of the test when values of the alternative parameter
satisfy the uniform distribution function with the density


⎪ 1, θ ∈ [0,1]

π (θ) = ⎨ 0, θ < 0

⎪ 0, θ > 1.

3.2.2 Write the maximum likelihood ratio MLRn and formulate the
MLR test for H 0 : θ = 0 versu H 1 : θ ≠ 0 based on X 1 ,… , X n. Find the
asymptotic (n → ∞ ) distribution function of the statistic
2 log (MLRn), under H 0 .
3.2.3 Is the couple (MLRn and sigma algebra σ {X 1 ,… , X n}) a martin-
gale, submartingale, or supermartingale, under H 0 ? (Justify your
answer.)
3.3 Assume f , the density function of X 1 ,… , X n, is unknown, X 1 ,… , X n
are iid and θ = EX 1.
3.3.1 Define the empirical likelihood ratio test statistic ELRn.
3.3.2 Find the asymptotic ( n → ∞ ) distribution function of the statistic
2 log (ELRn), under H 0 , provided that E X 1 < ∞ .
3

3.4 See Question 3.3, Section 13.3, Example 5.

Question 4 (26 points)


4.1. Assume you save your money in a bank account each day, i, and you
obtain a random benefit ri > 5.5, r1 ,… , ri ,... are iid. You also can obtain
⎧ 11, Pr {Zi = 10} = 1/ 3

money via a risky game having each day Zi = ⎨
⎪⎩ −5, Pr {Zi = −5} = 2 / 3
dollars, i = 0,1,… (random variables Zi are iid and independent of
r1 ,… , ri).
You decide to stop the game when you will collect $100, that is, when


n
at a first time n: (Zi + ri ) ≥ 100.
i=1
362 Statistics in the Health Sciences: Theory, Applications, and Computing

⎪⎧ ⎪⎫

j
It is known that, for all t > 0 and j > 0, Pr ⎨
⎪⎩
(Zi + ri) ≤ t⎬⎪ = t / (j( j + 1)), when
i=1 ⎭

t / ( j( j + 1)) ≤ 1, and Pr {∑ j

i=1
(Zi + ri ) ≤ t } = 1, when t / ( j( j + 1)) ≥ 1.
4.1.1 What will be your expected benefit from the risky game stopped
⎛ ⎞
∑ ∑
n
at a first time n: (Zi + ri ) ≥ 100, that is, E ⎜ Zi⎟ ? When you
i=1
⎝ i=1 ⎠
apply a needed theorem, corresponding conditions should be



checked. (Suggestion: Use that 1/ ( j( j + 1)) = 1.)
i=1

4.1.2 Formulate and prove needed theorems in general. Relevant for-


mal definitions of objects that you use (e.g., of a stopping time)
should be presented.
4.2 Prove the optional stopping theorem.
4.3 Formulate and prove the Dub theorem (inequality) based on non-
negative martingales.
4.4 See Question 3, Section 13.2, Example 4.
4.5 See Question 4, Section 13.1, Example 1.
4.6 See Question 4.3, Section 13.1, Example 2.

Question 5 (5 points)
See Question 5.3, Section 13.3, Example 3.

Comments. Several relevant comments can be found with respect to the


examples shown in the sections above. Regarding Question 4.1, we note that,
⎧⎪ n
⎫⎪
defining the random variable τ = inf ⎨n ≥ 1 :
we have, for a fixed N , ⎪⎩ i=1

(Zi + ri ) ≥ 100⎬, since (Zi + ri ) > 0,
⎪⎭

⎧ N ⎫ ⎧ ⎛ N ⎞ ⎫


⎪⎩ i =1

⎪⎭

Pr {τ > N} = Pr ⎨ (Zi + ri) < 100⎬ = Pr ⎨exp ⎜⎜ −
⎪ ⎜
1
⎝ 100
∑ ⎠

(Zi + ri)⎟⎟⎟ > e −1⎬

⎩ i =1 ⎭
⎛ ⎞ ⎛ ⎞⎞
N



N

≤ eE exp ⎜⎜ − (Zi + ri)⎟⎟⎟ = e ⎜⎜ E exp ⎜ − (Z1 + r1 )⎟ ⎟⎟


1 1
→ 0. Pr {τ < ∞} = 1.
⎜ 100 ⎝ ⎝ 100 ⎠ ⎠ N →∞
⎝ i =1 ⎠
∞ ⎧⎪ j ⎫⎪
Then τ is a stopping time with E (τ) = ∑ ∑
j=1
Pr ⎨ (Zi + ri ) < 100⎬.
⎪⎩ i = 1 ⎪⎭
14
Examples of Course Projects

In this chapter we outline examples of projects that can be proposed for


individuals or groups of students relative to the course material contained in
this textbook. In order to work on the projects, students will be encouraged
to state correctly the corresponding problems, read the relevant literature
and provide a research result. The minimal expectation is that a Monte Carlo
study is performed in the context of the corresponding project. We believe
that interesting solutions related to the proposed projects have the potential
to be published in statistical journals. This can motivate students to try to
develop methodological approaches regarding the course projects.

14.1 Change Point Problems in the Model of Logistic


Regression Subject to Measurement Errors
Assume we observe independent data points {Yi = (0,1), i = 1,..., n} that satisfy
the model

( )
Pr (Yi = 1|X i ) = 1 + exp (−α 0 − β 0 X i ))−1 I (ν < i) + (1 + exp (−α 1 − β1X i )
−1
I (ν ≥ i),

where α 0 , α 1, β 0 , β1 and ν are parameters. The issue examined in this study is


a change point problem of hypothesis testing, where H 0 : ν ∉[1, n] versus
H 1 : 1 ≤ ν ≤ n, ν > 0 is unknown. In this statement we consider scenarios when
{X i , i = 1,..., n} are unobserved, whereas the change point detection schemes
should be based on {Yi , Zi = X i + ε i , i = 1,..., n}, where ε 1 ,..., ε n denote the mea-
surement error terms.

14.2 Bayesian Inference for Best Combinations Based on


Values of Multiple Biomarkers
In Section 7.5, we considered the best combinations of multiple biomarkers in
the context of the ROC analysis. The proposed method extends the Su and

363
364 Statistics in the Health Sciences: Theory, Applications, and Computing

Liu (1993) concept to cases when prior information related to the parameters
of biomarker distributions is assumed to be specified.

14.3 Empirical Bayesian Inference for Best Combinations


Based on Values of Multiple Biomarkers
In the case of multivariate normally distributed data, Stein (1956) proved that
when the dimension of the observed vectors is greater than or equal to three,
the maximum likelihood estimators (MLEs) are inadmissible estimators of
the corresponding parameters. James and Stein (1961) provided another esti-
mator that yields a frequentist risk (MSE) smaller than that of the MLEs.
Efron and Morris (1972) showed that the James–Stein estimator belongs to a
class of posterior empirical Bayes (PEB) point estimators in the Gaussian/
Gaussian model (Carlin and Louis, 2011). Su and Liu (1993) proposed the
MLEs of best combinations based on multivariate normally distributed val-
ues of biomarkers. In this project, we evaluate an improvement of Su and
Liu’s estimation scheme, using the PEB concept instead of the MLE.

14.4 Best Combinations Based on Log Normally Distributed


Values of Multiple Biomarkers
Oftentimes measurements related to biological processes follow a log-normal
distribution (see for details Limpert, et al., 2001; Vexler et al., 2016a, pp. 13–14).
In this context, we assume, without the loss of generality, that two biomark-
ers are measured presenting log-normally distributed values of (X 1 , X 2 ) and
(Y1 , Y2 ) corresponding to the case and control populations. In general, X 1 and
X 2 can be dependent, and Y1 can be dependent on Y2 . In order to apply the
ROC curve analysis, one can use the following approaches: (1) the Su and Liu
(1993) method, obtaining the best linear combination based on normally dis-
tributed (log(X 1 ), log(X 2 )) and (log(Y1 ), log(Y2 )) variables; (2) we can construct
the best combination of the biomarker values using the distribution of
(X 1 , X 2 ) and (Y1 , Y2 ) (e.g., Pepe and Thompson, 2000); (3) the best linear com-
bination based on log-normally distributed (X 1 , X 2 ) and (Y1 , Y2 ) variables;
(4) a technique that approximates the distribution function of a sum of
log-normally distributed random variables via a log-normal distribution.
The research question in this project is to compare these techniques, making
suggestions when to employ method (1), (2), (3), or (4) with respect to differ-
ent values of the data distribution parameters, taking into account relative
complexities of the methods.
Examples of Course Projects 365

14.5 Empirical Likelihood Ratio Tests for Parameters of


Linear Regressions
Consider, for example, the following relatively simple scenario. Observe inde-
pendent data points (Y1 , X 1 ),....,(Yn , X n ) that satisfy the linear regression
model E (Yi |X i ) = βX i , i = 1,..., n, where β is a parameter. Assume we are inter-
ested in the empirical likelihood ratio test of H 0 : β = β 0 versus H 1 : β ≠ β 0 .
Following the empirical likelihood methodology (Chapter 10), under the null
hypothesis, the empirical likelihood can be presented as

⎛ n n n

max ⎜
0 < p1 ,..., pn < 1

∏ ∑ i=1
pi :
i=1
pi = 1, ∑ p (Y − β X ) = 0⎟⎠ ,
i=1
i i 0 i

when the empirical version of E (Yi − βX i ) = E ⎡⎣E (Yi − βX i )|X i ⎤⎦ = 0, i = 1,..., n, ( )


is taken into account. However, we also have

( ) ( )
E X i (Yi − βX i ) = E ⎡⎣E X i (Yi − βX i )|X i ⎤⎦ = E ⎡⎣X iE (Yi − βX i )|X i ⎤⎦ = 0, i = 1,..., n. ( )
Then, one can propose the H 0 − empirical likelihood in the form

⎛ n n n n

max ⎜
0 < p1 ,..., pn < 1

∏ ∑
i=1
pi :
i=1
pi = 1, ∑ i=1
pi (Yi − β 0 X i ) = 0, ∑ p X (Y − β X ) = 0⎟⎠ .
i=1
i i i 0 i

( )
Moreover, it is clear that E X ik (Yi − βX i ) = 0, i = 1,..., n, k = 0,1, 2, 3... Therefore,
in general, we can define that the H 0 − empirical likelihood is

⎛ n n n

Lk (β 0 ) = max ⎜
0 < p1 ,..., pn < 1

∏ ∑
i=1
pi :
i=1
pi = 1, ∑ p (Y − β X ) = 0,
i=1
i i 0 i

n n

∑ i=1
pi X i (Yi − β 0 X i ) = 0,..., ∑ p X (Y − β X ) = 0⎟⎠ .
i=1
i
k
i i 0 i

In this case, the nonparametric version of Wilks’ theorem regarding the


empirical likelihood ratio test statistic n− n /Lk (β 0 ) can provide the result that
2 log(n− n /Lk (β 0 )) has an asymptotic chi-square distribution with k + 1 degrees
of freedom under the null hypothesis. It can be anticipated that, for a fixed
sample size n, relatively small values of k provide a “good” Type I error rate
control via Wilks’ theorem, since k defines the number of constraints to be in
effect when the empirical likelihood is derived. However, relatively large
values of k will increase the power of the test. The question to be evaluated
366 Statistics in the Health Sciences: Theory, Applications, and Computing

in this study is the following: What are the optimal values of k that, depend-
ing on n, lead to an appropriate Type I error rate control and maximize the
power of the empirical likelihood ratio test? Theoretically, this issue corre-
sponds to higher order approximations to the Type I error rate and the power
of the test.

14.6 Penalized Empirical Likelihood Estimation


The penalized parametric likelihood methodology is well addressed in the
modern statistical literature. In parallel with the penalized parametric likeli-
hood techniques, this research proposes and examines their nonparametric
counterparts, using the material of Chapters 5, 10, Section 10.5, and methods
shown in Qin and Lawless (1994).

14.7 An Improvement of the AUC-Based Interference


In Section 7.3.1, it is shown that, for the diseased population X ~ N (μ 1 , σ 12 )
and for the non-diseased population Y ~ N (μ 2 , σ 22 ) , a closed form of the AUC
⎛ μ −μ ⎞
is A = Φ ⎜ 12 2 2 ⎟ . Then, in the case with μ 1 = μ 2 , the AUC cannot assist in
⎝ σ1 + σ2 ⎠
detecting differences between the distributions of X and Y when σ 12 ≠ σ 22.
Towards this end, we propose considering the AUC based on the random
variables Z(X , X 2 ) and Z(Y , Y 2 ), where X 2, Y 2 are chi-square distributed and
{ }
the function Z is chosen to maximize the AUC, Pr Z(X , X 2 ) > Z(Y , Y 2 ) (see
Section 7.5 in this context). As a first stage in this statement of the problem,
(
we can consider Z(u, v) = u + av, where a = arg max λ Pr X + λX 2 > Y + λY 2 . )

14.8 Composite Estimation of the Mean based on


Log-Normally Distributed Observations
Let iid data points X 1 ,..., X n be observed. Assume that X 1 is distributed as the
random variable e ξ , where ξ ~ N (μ , σ 2 ) with unknown parameters μ , σ 2. The
problem is to estimate θ = EX 1. Since θ = exp(μ + σ 2 / 2), the maximum
likelihood estimation can provide the estimator θˆ = exp(μˆ + σˆ 2 /2) of θ, where
Examples of Course Projects 367

μˆ , σˆ 2 are the maximum likelihood estimators of μ , σ 2. However, it can be


n

anticipated that the simple estimator θ = ∑ X /n of θ = EX


i=1
i 1 can outperform

θ̂ for relatively small values of n. The research issue in this project is related
to how to combine θ̂ and θ to optimize the confidence interval estimation of
θ, depending on n.
References

Ahmad, I. A. (1996). A class of Mann-Whitney-Willcoxon type statistics. American


Statistician, 50, 324–327.
Armitage, P. (1975). Sequential Medical Trials. New York, NY: John Wiley & Sons.
Armitage, P. (1981). Importance of prognostic factors in the analysis of data from
clinical trials. Controlled Clinical Trials, 1(4), 347–353.
Armitage, P. (1991). Interim analysis in clinical trials. Statistics in Medicine, 10(6),
925–937.
Armstrong, D. (2015). Advanced Protocols in Oxidative Stress III. New York, NY:
Springer.
Balakrishnan, N., Johnson, N. L. and Kotz, S. (1994). Continuous Univariate Distribu-
tions. New York, NY: Wiley Series in Probability and Statistics.
Balakrishnan, N. and Lai, C.-D. (2009). Continuous Bivariate Distributions. New York,
NY: Springer.
Bamber, D. (1975). The area above the ordinal dominance graph and the area below
the receiver operating characteristic graph. Journal of Mathematical Psychology,
12(4), 387–415.
Barnard, G. A. (1946). Sequential tests in industrial statistics. Supplement to the Journal
of the Royal Statistical Society, 8(1), 1–26.
Bartky, W. (1943). Multiple sampling with constant probability. Annals of Mathematical
Statistics, 14(4), 363–377.
Bauer, P. and Kohne, K. (1994). Evaluation of experiments with adaptive interim anal-
yses. Biometrics, 50(4), 1029–1041.
Bayarri, M. J. and Berger, J. O. (2000). P values for composite null models. Journal of the
American Statistical Association, 95, 1127–1142.
Belomestnyi, D. V. (2005). Reconstruction of the general distribution by the distribu-
tion of some statistics. Theory of Probability & Its Applications, 49, 1–15.
Berger, J. O. (1980). Statistical Decision Theory: Foundations, Concepts, and Methods. New
York, NY: Springer-Verlag.
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. New York, NY:
Springer-Verlag.
Berger, J. O. (2000). Could Fisher, Jeffreys and Neyman have agreed on testing? Statis-
tical Science, 18. 1–12.
Berger, J. O. and Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of
P values and evidence. Journal of the American Statistical Association, 82(397), 112–122.
Berger, J. O., Wolpert, R. L., Bayarri, M. J., DeGroot, M. H., Hill, B. M., Lane, D. A. and
LeCam, L. (1988). The likelihood principle. Lecture Notes—Monograph Series, 6,
iii–v, vii–xii, 1–199.
Berger, R. L. and Boss, D. D. (1994). P values maximized over a confidence set for the
nuisance parameter. Journal of the American Statistical Association, 89, 1012–1016.
Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. Toronto: Wiley.
Bickel, P. J. and Doksum, K. A. (2007). Mathematical Statistics: Basic Ideas and Selected
Topics. 2nd ed., Vol. I, Updated Printing. Upper Saddle River, NJ: Pearson
Prentice Hall.

369
370 References

Bleistein, N. and Handelsman, R. A. (1975). Asymptotic Expansions of Integrals. New


York, NY: Courier Corporation.
Booth, J. G. and Sarkar, S. (1998). Monte Carlo approximation of bootstrap variances.
American Statistician, 52(4), 354–357.
Borovkov, A. A. (1998). Mathematical Statistics. The Netherlands: Gordon & Breach
Science Publishers.
Box, G. E. (1980). Sampling and Bayes’ inference in scientific modelling and robustness.
Journal of the Royal Statistical Society. Series A (General), 143(4), 383–430.
Box, G. E. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society. Series B (Methodological), 26(2), 211–252.
Bressoud, D. M. (2008). A Radical Approach to Lebesgue’s Theory of Integration. New
York, NY: Cambridge University Press.
Broemeling, L. D. (2007). Bayesian Biostatistics and Diagnostic Medicine. New York, NY:
Chapman & Hall.
Brostrom, G. (1997). A martingale approach to the changepoint problem. Journal of the
American Statistical Association, 92, 1177–1183.
Bross, I. D., Gunter, B., Snee, R. D., Berger, R. L., et al. (1994). Letters to the editor.
American Statistician, 48, 174–177.
Browne, R. H. (2010). The t-test p value and its relationship to the effect size and
P(X>Y). American Statistician, 64(1), 30–33.
Campbell, G. (1994). Advances in statistical methodology for the evaluation of diag-
nostic and laboratory tests. Statistics in Medicine, 13(5–7), 499–508.
Carlin, B. P. and Louis, T. A. (2011). Bayesian Methods for Data Analysis. Boca Raton, FL:
CRC Press.
Chen, M. H. and Shao, Q. M. (1999). Monte Carlo estimation of Bayesian credible and
HPD intervals. Journal of Computational and Graphic Statistics, 8, 69–92.
Chen, X., Vexler, A. and Markatou, M. (2015). Empirical likelihood ratio confidence
interval estimation of best linear combinations of biomarkers. Computational
Statistics & Data Analysis, 82, 186–198.
Chow, Y. S., Robbins, H. and Siegmund, D. (1971). Great Expectations: The Theory of
Optimal Stopping. Boston, MA: Houghton Mifflin.
Chuang, C.-S. and Lai, T. L. (2000). Hybrid resampling methods for confidence inter-
vals. Statistica Sinica, 10(1), 1–32.
Chung, K. L. (1974). A Course in Probability Theory. San Diego, CA: Academic Press.
Chung, K. L. (2000). A Course in Probability Theory. 2nd ed. London: Academic Press.
Claeskens, G. and Hjort, N. L. (2004). Goodness of fit via non‐parametric likelihood
ratios. Scandinavian Journal of Statistics, 31(4), 487–513.
Copas, J. and Corbett, P. (2002). Overestimation of the receiver operating characteristic
curve for logistic regression. Biometrika, 89(2), 315–331.
Cornish, E. A. and Fisher, R. A. (1937). Moments and cumulants in the specification of
distributions. Review of the International Statistical Institute, 5, 307–322.
Cox, R. (1962). Renewal Theory. New York, NY: Methuen & Company.
Crawley, M. J. (2012). The R Book. United Kingdom: John Wiley & Sons.
Csörgö, M. and Horváth, L. (1997). Limit Theorems in Change-Point Analysis. New York,
NY: Wiley.
Cutler, S., Greenhouse, S., Cornfield, J. and Schneiderman, M. (1966). The role of
hypothesis testing in clinical trials: Biometrics seminar. Journal of Chronic Dis-
eases, 19(8), 857–882.
References 371

Daniels, M. and Hogan, J. (2008). Missing Data in Longitudinal Studies: Strategies for
Bayesian Modeling and Sensitivity Analysis. New York, NY: Chapman & Hall.
Davis, C. S. (1993). The computer generation of multinomial random variates. Compu-
tational Statistics & Data Analysis, 16(2), 205–217.
Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cam-
bridge: Cambridge University Press.
De Bruijn, N. G. (1970). Asymptotic Methods in Analysis. Amsterdam: Dover Publications.
DeGroot, M. H. (2005). Optimal Statistical Decisions. Hoboken, NJ: John Wiley & Sons.
Delwiche, L. D. and Slaughter, S. J. (2012). The Little SAS Book: A Primer: A Program-
ming Approach. Cary, NC: SAS Institute.
Dempster, A. P. and Schatzoff, M. (1965). Expected significance level as a sensitivity
index for test statistics. Journal of the American Statistical Association, 60, 420–436.
Desquilbet, L. and Mariotti, F. (2010). Dose-response analyses using restricted cubic
spline functions in public health research. Statistics in Medicine, 29(9), 1037–1057.
DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997). Computing Bayes
factors by combining simulation and asymptotic approximations. Journal of the
American Statistical Association, 92(439), 903–915.
Dmitrienko, A., Molenberghs, G., Chuang-Stein, C. and Offen, W. (2005). Analysis of
Clinical Trials Using SAS: A Practical Guide. Cary, NC: SAS Institute.
Dmitrienko, A., Tamhane, A. C. and Bretz, F. (2010). Multiple Testing Problems in Phar-
maceutical Statistics. New York, NY: CRC Press.
Dodge, H. F. and Romig, H. (1929). A method of sampling inspection. Bell System
Technical Journal, 8(4), 613–631.
Dragalin, V. P. (1997). The sequential change point problem. Economic Quality Control,
12, 95–122.
Edgar, G. A. and Sucheston, L. (2010). Stopping Times and Directed Processes (Encyclopedia
of Mathematics and Its Applications). New York, NY: Cambridge University Press.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics,
7(1), 1–26.
Efron, B. (1981). Nonparametric standard errors and confidence intervals. The Cana-
dian Journal of Statistics/La Revue Canadienne de Statistique, 9(2), 139–158.
Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of
the American Statistical Association, 81(394), 461–470.
Efron, B. and Morris, C. (1972). Limiting risk of Bayes and empirical Bayes estimators.
Journal of the American Statistical Association, 67, 130–139.
Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap. Boca Raton, FL:
CRC Press.
Eng, J. (2005). Receiver operating characteristic analysis: A primer. Academic Radiology,
12(7), 909–916.
Evans, M. and Swartz, T. (1995). Methods for approximating integrals in statistics with
special emphasis on Bayesian integration problems. Statistical Science, 10(3), 254–272.
Everitt, B. S. and Palmer, C. (2010). Encyclopaedic Companion to Medical Statistics.
Hoboken, NJ: John Wiley & Sons.
Fan, J., Farmen, M. and Gijbels, I. (1998). Local maximum likelihood estimation and
inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
60(3), 591–608.
Fan, J., Zhang, C. and Zhang, J. (2001). Generalized likelihood ratio statistics and
Wilks phenomenon. Annals of Statistics, 29(1), 153–193.
372 References

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8),


861–874.
Feuerverger, A. and Mureika, R. (1977). The empirical characteristic function and its
applications. Annals of Statistics, 5, 88–97.
Finkelestein, M., Tucker, H. G. and Veeh, J. A. (1997). Extinguishing the distinguished
logarithm problems. Proceedings of the American Mathematical Society, 127, 2773–2777.
Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, 27–38.
Fisher, R. (1925). Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd.
Fisher, R. A. (1921). On the “probable error” of a coefficient of correlation deduced
from a small sample. Metron, 1, 3–32.
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philo-
sophical Transactions of the Royal Society A, 222, 309–368.
Fluss, R., Faraggi, D. and Reiser, B. (2005). Estimation of the Youden index and its
associated cutoff point. Biometrical Journal, 47, 458–472.
Food and Drug Administration. (1988). Guideline for the Format and Content of the Clin-
ical and Statistical Sections of New Drug Applications. FDA, U.S. Department of
Health and Human Services, Rockville, MD.
Freeman, H. A., Friedman, M., Mosteller, F. and Wallis, W. A. (1948). Sampling Inspec-
tion. New York, NY: McGraw-Hill.
Freiman, J. A., Chalmers, T. C., Smith, H. Jr. and Kuebler, R. R. (1978). The importance
of beta, the type II error and sample size in the design and interpretation of the
randomized control trial. Survey of 71 “negative” trials. New England Journal of
Medicine, 299(13), 690–694.
Gardener, M. (2012). Beginning R: The Statistical Programming Language. Canada: John
Wiley & Sons.
Gavit, P., Baddour, Y. and Tholmer, R. (2009). Use of change-point analysis for process
monitoring and control. BioPharm International, 22(8), 46–55.
Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003). Bayesian Data Analysis.
Boca Raton, FL: CRC Press.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B.
(2013). Bayesian Data Analysis. Boca Raton, FL: CRC Press.
Genz, A. and Kass, R. E. (1997). Subregion-adaptive integration of functions having a
dominant peak. Journal of Computational and Graphical Statistics, 6(1), 92–111.
Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo inte-
gration. Econometrica: Journal of the Econometric Society, 57(6), 1317–1339.
Ghosh, M. (1995). Inconsistent maximum likelihood estimators for the Rasch model.
Statistics & Probability Letters, 23(2), 165–170.
Gönen, M., Johnson, W. O., Lu, Y. and Westfall, P. H. (2005). The Bayesian two-sample
t-test. American Statistician, 59(3), 252–257.
Good, I. (1992). The Bayes/non-Bayes compromise: A brief review. Journal of the Amer-
ican Statistical Association, 87(419), 597–606.
Gordon, L. and Pollak, M. (1995). A robust surveillance scheme for stochastically
ordered alternatives. Annals of Statistics, 23, 1350–1375.
Green, D. M. and Swets, J. A. (1966). Signal Detection Theory and Psychophysics. New
York, NY: Wiley.
Grimmett, G. and Stirzaker, D. (1992). Probability and Random Processes. Oxford:
Oxford University Press.
Gurevich, G. and Vexler, A. (2005). Change point problems in the model of logistic
regression. Journal of Statistical Planning and Inference, 131(2), 313–331.
References 373

Gurevich, G. and Vexler, A. (2010). Retrospective change point detection: From para-
metric to distribution free policies. Communications in Statistics—Simulation and
Computation, 39(5), 899–920.
Hall, P. (1992). The Bootstrap and Edgeworth Expansions. New York, NY: Springer.
Hammersley, J. M. and Handscomb, D. C. (1964). Monte Carlo Methods. New York, NY:
Springer.
Han, C. and Carlin, B. P. (2001). Markov chain Monte Carlo methods for computing
Bayes factors. Journal of the American Statistical Association, 96(455), 1122–1132.
Hartigan, J. (1971). Error analysis by replaced samples. Journal of the Royal Statistical
Society. Series B (Methodological), 33(2), 98–110.
Hartigan, J. A. (1969). Using subsample values as typical values. Journal of the American
Statistical Association, 64(328), 1303–1317.
Hartigan, J. A. (1975). Necessary and sufficient conditions for asymptotic joint
normality of a statistic and its subsample values. Annals of Statistics, 3(3),
573–580.
Hosmer, D. W. Jr. and Lemeshow, S. (2004). Applied Logistic Regression. Hoboken, NJ:
John Wiley & Sons.
Hsieh, F. and Turnbull, B. W. (1996). Nonparametric and semiparametric estimation of
the receiver operating characteristic curve. Annals of Statistics, 24(1), 25–40.
Hwang, J. T. G. and Yang, M-c. (2001). An optimality theory for MID p-values in
2 x 2 contingency tables. Statistica Sinica, 11, 807–826.
James, W. and Stein, C. (1961). Estimation with quadratic loss. Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, 1, 361–379.
Janzen, F. J., Tucker, J. K. and Paukstis, G. L. (2000). Experimental analysis of an early
life-history stage: Selection on size of hatchling turtles. Ecology, 81(8), 2290–2304.
Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability.
Mathematical Proceedings of the Cambridge Philosophical Society, 31(2), 203–222.
Jeffreys, H. (1961). Theory of Probability. Oxford: Oxford University Press.
Jennison, C. and Turnbull, B. W. (2000). Group Sequential Methods with Applications to
Clinical Trials. New York, NY: CRC Press.
Johnson, M. E. (1987). Multivariate Statistical Simulations. New York, NY: John Wiley &
Sons.
Julious, S. A. (2005). Why do we use pooled variance analysis of variance? Pharmaceu-
tical Statistics, 4, 3–5.
Kass, R. E. (1993). Bayes factors in practice. Statistician, 42(5), 551–560.
Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical
Association, 90(430), 773–795.
Kass, R. E. and Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal
parameters, with application to testing equality of two binomial proportions.
Journal of the Royal Statistical Society. Series B (Methodological), 54(1), 129–144.
Kass, R. E. and Wasserman, L. (1995). A reference Bayesian test for nested hypotheses
and its relationship to the Schwarz criterion. Journal of the American Statistical
Association, 90(431), 928–934.
Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal
rules. Journal of the American Statistical Association, 91(435), 1343–1370.
Kass, R. E., Tierney, L. and Kadane, J. B. (1990). The validity of posterior asymptotic
expansions based on Laplace’s method. Bayesian and Likelihood Methods in Statis-
tics and Econometrics: Essays in Honor of George A. Barnard. Geisser, S., Hodges, J.
S., Press, S. J. and ZeUner, A. (Eds.), New York, NY: North-Holland.
374 References

Kepner, J. L. and Chang, M. N. (2003). On the maximum total sample size of a group
sequential test about binomial proportions. Statistics & Probability Letters, 62(1),
87–92.
Koch, A. L. (1966). The logarithm in biology 1. Mechanisms generating the log-normal
distribution exactly. Journal of Theoretical Biology, 12(2), 276–290.
Konish, S. (1978). An approximation to the distribution of the sample correlation coef-
ficient. Biometrika, 65, 654–656.
Korevaar, J. (2004). Tauberian Theory. New York, NY: Springer.
Kotz, S., Balakrishnan, N. and Johnson, N. L. (2000). Continuous Multivariate Distribu-
tions, Models and Applications. New York, NY: John Wiley & Sons.
Kotz, S., Lumelskii, Y. and Pensky, M. (2003). The Stress-Strength Model and Its General-
izations: Theory and Applications. Singapore: World Scientific Publishing Com-
pany.
Krieger, A. M., Pollak, M. and Yakir, B. (2003). Surveillance of a simple linear regres-
sion. Journal of the American Statistical Association, 98, 456–469.
Lai, T. L. (1995). Sequential changepoint detection in quality control and dynamical
systems. Journal of the Royal Statistical Society. Series B (Methodological), 57(4),
613–658.
Lai, T. L. (1996). On uniform integrability and asymptotically risk-efficient sequential
estimation. Sequential Analisis, 15, 237–251.
Lai, T. L. (2001). Sequential analysis: Some classical problems and new challenges.
Statistica Sinica, 11, 303–408.
Lane-Claypon, J. E. (1926). A Further Report on Cancer of the Breast with Special Reference
to Its Associated Antecedent Conditions. Ministry of Health. Reports on Public
Health and Medical Subjects (32), London.
Lasko, T. A., Bhagwat, J. G., Zou, K. H. and Ohno-Machado, L. (2005). The use of
receiver operating characteristic curves in biomedical informatics. Journal of Bio-
medical Informatics, 38(5), 404–415.
Lazar, N. A. (2003). Bayesian empirical likelihood. Biometrika, 90(2), 319–326.
Lazar, N. A. and Mykland, P. A. (1999). Empirical likelihood in the presence of nui-
sance parameters. Biometrika, 86(1), 203–211.
Lazzeroni, L. C., Lu, Y. and Belitskaya-Levy, I. (2014). P-values in genomics: Apparent
precision masks high uncertainty. Molecular Psychiatry, 19, 1336–1340.
Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and
related Bayes’ estimates. University of California Publications in Statistics, 1, 277–330.
Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. New York, NY:
Springer-Verlag.
Lee, E. T. and Wang, J. W. (2013). Statistical Methods for Survival Data Analysis. Hoboken,
NJ: John Wiley & Sons.
Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation. New York, NY:
Springer-Verlag.
Lehmann, E. L. and Romano, J. P. (2006). Testing Statistical Hypotheses. New York, NY:
Springer-Verlag.
Limpert, E., Stahel, W. A. and Abbt, M. (2001). Log-normal distributions across the
sciences: Keys and clues on the charms of statistics, and how mechanical models
resembling gambling machines offer a link to a handy way to characterize log-
normal distributions, which can provide deeper insight into variability and
probability—Normal or log-normal: That is the question. BioScience, 51(5),
341–352.
References 375

Lin, C.-C. and Mudholkar, G. S. (1980). A simple test for normality against asymmetric
alternatives. Biometrika, 67(2), 455–461.
Lin, Y. and Shih, W. J. (2004). Adaptive two‐stage designs for single‐arm phase IIA
cancer clinical trials. Biometrics, 60(2), 482–490.
Lindsey, J. (1996). Parametric Statistical Inference. Oxford: Oxford University Press.
Liptser, R. Sh. and Shiryayev, A. N. (1989). Theory of Martingales. London: Kluwer
Academic Publishers.
Liu, A. and Hall, W. (1999). Unbiased estimation following a group sequential test.
Biometrika, 86(1), 71–78.
Liu, C., Liu, A. and Halabi, S. (2011). A min–max combination of biomarkers to
improve diagnostic accuracy. Statistics in Medicine, 30(16), 2005–2014.
Lloyd, C. J. (1998). Using smoothed receiver operating characteristic curves to sum-
marize and compare diagnostic systems. Journal of the American Statistical Asso-
ciation, 93(444), 1356–1364.
Lorden, G. and Pollak, M. (2005). Nonanticipating estimation applied to sequential
analysis and changepoint detection. The Annals of Statistics, 33, 1422–1454.
Lucito, R., West, J., Reiner, A., Alexander, J., Esposito, D., Mishra, B., Powers, S., Nor-
ton, L. and Wigler, M. (2000). Detecting gene copy number fluctuations in tumor
cells by microarray analysis of genomic representations. Genome Research, 10(11),
1726–1736.
Lukacs, E. (1970). Characteristic Functions. London: Charles Griffin & Company.
Lusted, L. B. (1971). Signal detectability and medical decision-making. Science,
171(3977), 1217–1219.
Mallows, C. L. (1972). A note on asymptotic joint normality. Annals of Mathematical
Statistics, 39, 755–771.
Marden, J. I. (2000). Hypothesis testing: From p values to Bayes factors. Journal of the
American Statistical Association, 95, 1316–1320.
Martinsek, A. T. (1981). A note on the variance and higher central moments of the stop-
ping time of an SPRT. Journal of the American Statistical Association, 76(375), 701–703.
McCulloch, R. and Rossi, P. E. (1991). A Bayesian approach to testing the arbitrage
pricing theory. Journal of Econometrics, 49(1), 141–168.
McIntosh, M. W. and Pepe, M. S. (2002). Combining several screening tests: Optimality
of the risk score. Biometrics, 58(3), 657–664.
Meng, X.-L. and Wong, W. H. (1996). Simulating ratios of normalizing constants via a
simple identity: A theoretical exploration. Statistica Sinica, 6(4), 831–860.
Metz, C. E., Herman, B. A. and Shen, J. H. (1998). Maximum likelihood estimation of
receiver operating characteristic (ROC) curves from continuously‐distributed
data. Statistics in Medicine, 17(9), 1033–1053.
Mikusinski, P. and Taylor, M. D. (2002). An Introduction to Multivariable Analysis from
Vector to Manifold. Boston, MA: Birkhäuser.
Montgomery, D. C. (1991). Introduction to Statistical Quality Control. New York, NY:
Wiley.
Nakas, C. T. and Yiannoutsos, C. T. (2004). Ordered multiple‐class ROC analysis with
continuous measurements. Statistics in Medicine, 23(22), 3437–3449.
Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the
weighted likelihood bootstrap. Journal of the Royal Statistical Society. Series B
(Methodological), 56(1), 3–48.
Neyman, J. and Pearson, E. S. (1928). On the use and interpretation of certain test cri-
teria for purposes of statistical inference: Part II. Biometrika, 20A(3/4), 263–294.
376 References

Neyman, J. and Pearson, E. S. (1933). The testing of statistical hypotheses in relation


to probabilities a priori. Mathematical Proceedings of the Cambridge Philosophical
Society, 29(4), 492–510.
Neyman, J. and Pearson, E. S. (1938). Contributions to the theory of testing statistical
hypotheses. Journal Statistical Research Memoirs (University of London), 2, 25–58.
Obuchowski, N. A. (2003). Receiver operating characteristic curves and their use in
radiology 1. Radiology, 229(1), 3–8.
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single func-
tional. Biometrika, 75(2), 237–249.
Owen, A. B. (1990). Empirical likelihood ratio confidence regions. The Annals of Statis-
tics, 18(1), 90–120.
Owen, A. B. (1991). Empirical likelihood for linear models. The Annals of Statistics,
19(4), 1725–1747.
Owen, A. B. (2001). Empirical Likelihood. Boca Raton, FL: CRC Press.
Pepe, M. S. (1997). A regression modelling framework for receiver operating charac-
teristic curves in medical diagnostic testing. Biometrika, 84(3), 595–608.
Pepe, M. S. (2000). Receiver operating characteristic methodology. Journal of the Amer-
ican Statistical Association, 95(449), 308–311.
Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Predic-
tion. Oxford: Oxford University Press.
Pepe, M. S. and Thompson, M. L. (2000). Combining diagnostic test results to increase
accuracy. Biostatistics, 1(2), 123–140.
Petrone, S., Rousseau, J. and Scricciolo, C. (2014). Bayes and empirical Bayes: Do they
merge? Biometrika, 101(2), 285–302.
Petrov, V. V. (1975). Sums of Independent Random Variables. New York, NY: Springer.
Piantadosi, S. (2005). Clinical Trials: A Methodologic Perspective. Hoboken, NJ: John
Wiley & Sons.
Pocock, S. J. (1977). Group sequential methods in the design and analysis of clinical
trials. Biometrika, 64(2), 191–199.
Pollak, M. (1985). Optimal detection of a change in distribution. The Annals of Statistics,
13, 206–227.
Pollak, M. (1987). Average run lengths of an optimal method of detecting a change in
distribution. The Annals of Statistics, 5(2), 749–779.
Powers, L. (1936). The nature of the interaction of genes affecting four quantitative
characters in a cross between Hordeum deficiens and Hordeum vulgare. Genet-
ics, 21(4), 398.
Proschan, M. A. and Shaw, P. A. (2016). Essentials of Probability Theory for Statisticians.
New York, NY: Chapman & Hall/CRC.
Provost, F. J. and Fawcett, T. (1997). Analysis and visualization of classifier perfor-
mance: Comparison under imprecise class and cost distributions. In Proceedings
of the Third International Conference on Knowledge Discovery and Data Mining,
AAAI Press, 43–48.
Qin, J. (2000). Combining parametric and empirical likelihoods. Biometrika, 87(2),
484–490.
Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations.
The Annals of Statistics, 22(1), 300–325.
Qin, J. and Leung, D. H. (2005). A semiparametric two‐component “compound” mix-
ture model and its application to estimating malaria attributable fractions. Bio-
metrics, 61(2), 456–464.
References 377

Qin, J. and Zhang, B. (2005). Marginal likelihood, conditional likelihood and empirical
likelihood: Connections and applications. Biometrika, 92(2), 251–270.
Quenouille, M. H. (1949). Problems in plane sampling. The Annals of Mathematical
Statistics, 20(3), 355–375.
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3–4), 353–360.
R Development Core Team. (2014). R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria.
Reid, N. (2000). Likelihood. Journal of the American Statistical Association, 95, 1335–1340.
Reiser, B. and Faraggi, D. (1997). Confidence intervals for the generalized ROC crite-
rion. Biometrics, 53(2), 644–652.
Richards, R. J., Hammitt, J. K. and Tsevat, J. (1996). Finding the optimal multiple-test
strategy using a method analogous to logistic regression the diagnosis of hepa-
tolenticular degeneration (Wilson’s disease). Medical Decision Making, 16(4),
367–375.
Ritov, Y. (1990). Decision theoretic optimality of the CUSUM procedure. The Annals of
Statistics, 18, 1464–1469.
Robbins, H. (1959). Sequential estimation of the mean of a normal population. Proba-
bility and Statistics (Harold Cramér Volume). Stockholm: Almquist and Wiksell, pp.
235–245.
Robbins, H. (1970). Statistical methods related to the law of the iterated logarithm. The
Annals of Mathematical Statistics, 41, 1397–1409.
Robbins, H. and Siegmund, D. (1970). Boundary crossing probabilities for the Wiener
process and sample sums. The Annals of Mathematical Statistics, 41, 1410–1429.
Robbins, H. and Siegmund, D. (1973). A class of stopping rules for testing parametric
hypotheses. In Proceedings of Sixth Berkeley Symposium of Mathematical Statistics
and Probability, University of California Press, 37–41.
Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods. 2nd ed., New York,
NY: Springer.
Sackrowitz, H. and Samuel-Cahn, E. (1999). P values as random variables-expected p
values. The American Statistician, 53, 326–331.
SAS Institute Inc. (2011). SAS® 9.3 Macro Language: Reference. Cary, NC: SAS Institute.
Schiller, F. (2013). Die Verschwörung des Fiesco zu Genua. Berliner: CreateSpace
Independent Publishing Platform.
Schisterman, E. F., Faraggi, D., Browne, R., Freudenheim, J., Dorn, J., Muti, P., Arm-
strong, D., Reiser, B. and Trevisan, M. (2001a). Tbars and cardiovascular disease
in a population-based sample. Journal of Cardiovascular Risk, 8(4), 219–225.
Schisterman, E. F., Faraggi, D., Browne, R., Freudenheim, J., Dorn, J., Muti, P., Arm-
strong, D., Reiser, B. and Trevisan, M. (2002). Minimal and best linear combina-
tion of oxidative stress and antioxidant biomarkers to discriminate cardiovascular
disease. Nutrition, Metabolism, and Cardiovascular Disease, 12, 259–266.
Schisterman, E. F., Faraggi, D., Reiser, B. and Trevisan, M. (2001b). Statistical inference
for the area under the receiver operating characteristic curve in the presence of
random measurement error. American Journal of Epidemiology, 154, 174–179.
Schisterman, E. F., Perkins, N. J., Liu, A. and Bondell, H. (2005). Optimal cut-point and
its corresponding Youden index to discriminate individuals using pooled blood
samples. Epidemiology, 16, 73–81.
Schisterman, E. F., Vexler, A., Ye, A. and Perkins, N. J. (2011). A combined efficient
design for biomarker data subject to a limit of detection due to measuring
instrument sensitivity. The Annals of Applied Statistics, 5, 2651–2667.
378 References

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2),
461–464.
Schweder, T. and Hjort, N. L. (2002). Confidence and likelihood. Scandinavian Journal
of Statistics, 29(2), 309–332.
Serfling, R. J. (2002). Approximation Theorems of Mathematical Statistics. Canada: John
Wiley & Sons.
Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. New York, NY: Springer Science
& Business Media.
Shao, J. and Wu, C. F. J. (1989). A general theory for jackknife variance estimation. The
Annals of Statistics, 17, 1176–1197.
Shapiro, D. E. (1999). The interpretation of diagnostic tests. Statistical Methods in Med-
ical Research, 8(2), 113–134.
Siegmund, D. (1968). On the asymptotic normality of one-sided stopping rules. The
Annals of Mathematical Statistics, 39(5), 1493–1497.
Simon, R. (1989). Optimal two-stage designs for phase II clinical trials. Controlled Clin-
ical Trials, 10(1), 1–10.
Singh, K., Xie, M. and Strawderman, W. E. (2005). Combining information from inde-
pendent sources through confidence distributions. The Annals of Statistics, 33(1),
159–183.
Sinharay, S. and Stern, H. S. (2002). On the sensitivity of Bayes factors to the prior
distributions. The American Statistician, 56(3), 196–201.
Sinnott, E. W. (1937). The relation of gene to character in quantitative inheritance. Pro-
ceedings of the National Academy of Sciences of the United States of America, 23(4), 224.
Sleasman, J. W., Nelson, R. P., Goodenow, M. M., Wilfret, D., Hutson, A., Baseler, M.,
Zuckerman, J., Pizzo, P. A. and Mueller, B. U. (1999). Immunoreconstitution
after ritonavir therapy in children with human immunodeficiency virus infec-
tion involves multiple lymphocyte lineages. The Journal of Pediatrics, 134,
597–606.
Smith, A. F. and Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler
and related Markov chain Monte Carlo methods. Journal of the Royal Statistical
Society. Series B (Methodological), 55(1), 3–23.
Smith, R. L. (1985). Maximum likelihood estimation in a class of nonregular cases.
Biometrika, 72(1), 67–90.
Snapinn, S., Chen, M. G., Jiang, Q. and Koutsoukos, T. (2006). Assessment of futility
in clinical trials. Pharmaceutical Statistics, 5(4), 273–281.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability, 1, 197–206.
Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900.
Cambridge, MA: Belknap Press of Harvard University Press.
Stigler, S. M. (2007). The epic story of maximum likelihood. Statistical Science, 22, 598–620.
Subkhankulov, M. A. (1976). Tauberian Theorems with Remainder Terms. Moscow: Nauka.
Su, J. Q. and Liu, J. S. (1993). Linear combinations of multiple diagnostic markers.
Journal of the American Statistical Association, 88(424), 1350–1355.
Suissa, S. and Shuster, J. J. (1984). Are Uniformly Most Powerful Unbiased Tests
Really Best? The American Statistician, 38, 204–206.
Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments
and marginal densities. Journal of the American Statistical Association, 81(393),
82–86.
References 379

Tierney, L., Kass, R. E. and Kadane, J. B. (1989). Fully exponential Laplace approxima-
tions to expectations and variances of nonpositive functions. Journal of the Amer-
ican Statistical Association, 84(407), 710–716.
Titchmarsh, E. C. (1976). The Theory of Functions. New York, NY: Oxford University Press.
Trafimow, D. and Marks, M. (2015). Editorial in basic and applied social psychology.
Basic and Applied Social Pschology, 37, 1–2.
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of Mathe-
matical Statistics, 29, 614–623.
Vexler, A. (2006). Guaranteed testing for epidemic changes of a linear regression
model. Journal of Statistical Planning and Inference, 136(9), 3101–3120.
Vexler, A. (2008). Martingale type statistics applied to change points detection. Com-
munications in Statistics—Theory and Methods, 37(8), 1207–1224.
Vexler, A. A. and Dtmirenko, A. A. (1999). Approximations to expected stopping times
with applications to sequential estimation. Sequential Analysis, 18, 165–187.
Vexler, A. and Gurevich, G. (2009). Average most powerful tests for a segmented
regression. Communications in Statistics—Theory and Methods, 38(13), 2214–2231.
Vexler, A. and Gurevich, G. (2010). Density-based empirical likelihood ratio change
point detection policies. Communications in Statistics—Simulation and Computa-
tion, 39(9), 1709–1725.
Vexler, A. and Gurevich, G. (2011). A note on optimality of hypothesis testing. Journal
MESA, 2(3), 243–250.
Vexler, A., Hutson, A. D. and Chen, X. (2016a). Statistical Testing Strategies in the Health
Sciences. New York, NY: Chapman & Hall/CRC.
Vexler, A., Liu, A., Eliseeva, E. and Schisterman, E. F. (2008a). Maximum likelihood
ratio tests for comparing the discriminatory ability of biomarkers subject to
limit of detection. Biometrics, 64(3), 895–903.
Vexler, A., Liu, A. and Schisterman, E. F. (2010a). Nonparametric deconvolution of
density estimation based on observed sums. Journal of Nonparametric Statistics,
22, 23–39.
Vexler, A., Liu, A., Schisterman, E. F. and Wu, C. (2006). Note on distribution-free
estimation of maximum linear separation of two multivariate distributions.
Nonparametric Statistics, 18(2), 145–158.
Vexler, A., Liu, S., Kang, L. and Hutson, A. D. (2009a). Modifications of the empirical
likelihood interval estimation with improved coverage probabilities. Communi-
cations in Statistics—Simulation and Computation, 38(10), 2171–2183.
Vexler, A., Schisterman, E. F. and Liu, A. (2008b). Estimation of ROC based on stably
distributed biomarkers subject to measurement error and pooling mixtures. Sta-
tistics in Medicine, 27, 280–296.
Vexler, A., Tao, G. and Hutson, A. (2014). Posterior expectation based on empirical
likelihoods. Biometrika, 101(3), 711–718.
Vexler, A. and Wu, C. (2009). An optimal retrospective change point detection policy.
Scandinavian Journal of Statistics, 36(3), 542–558.
Vexler, A., Wu, C., Liu, A., Whitcomb, B. W. and Schisterman, E. F. (2009b). An exten-
sion of a change point problem. Statistics, 43(3), 213–225.
Vexler, A., Wu, C. and Yu, K. F. (2010b). Optimal hypothesis testing: From semi to fully
Bayes factors. Metrika, 71(2), 125–138.
Vexler, A., Yu, J. and Hutson, A. D. (2011). Likelihood testing populations modeled by
autoregressive process subject to the limit of detection in applications to longi-
tudinal biomedical data. Journal of Applied Statistics, 38(7), 1333–1346.
380 References

Vexler, A., Yu, J., Tian, L. and Liu, S. (2010c). Two‐sample nonparametric likelihood
inference based on incomplete data with an application to a pneumonia study.
Biometrical Journal, 52(3), 348–361.
Vexler, A., Yu, J., Zhao, Y., Hutson, A. D. and Gurevich, G. (2017). Expected P-values in
light of an ROC curve analysis applied to optimal multiple testing procedures.
Statistical Methods in Medical Research. doi:10.1177/0962280217704451. In Press.
Vexler, A., Zou, L. and Hutson, A. D. (2016b). Data-driven confidence interval estima-
tion incorporating prior information with an adjustment for skewed data. Amer-
ican Statistician, 70, 243–249.
Ville, J. (1939). Étude critique de la notion de collectif. Monographies des Probabilités (in
French). 3. Paris: Gauthier-Villars.
Wald, A. (1947). Sequential Analysis. New York, NY: Wiley.
Wald, A. and Wolfowitz, J. (1948). Optimum character of the sequential probability
ratio test. Annals of Mathematical Statistics, 19(3), 326–339.
Wasserstein, R. L. and Lazar, N. (2016). The ASA’s statement on p-values: Context,
process, and purpose. American Statistician, 70, 129–133.
Wedderburn, R. W. (1974). Quasi-likelihood functions, generalized linear models,
and the Gauss–Newton method. Biometrika, 61(3), 439–447.
Whitehead, J. (1986). On the bias of maximum likelihood estimation following a
sequential test. Biometrika, 73(3), 573–581.
Wieand, S., Gail, M. H., James, B. R. and James, K. L. (1989). A family of nonparamet-
ric statistics for comparing diagnostic markers with paired or unpaired data.
Biometrika, 76(3), 585–592.
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing
composite hypotheses. Annals of Mathematical Statistics, 9(1), 60–62.
Williams, D. (1991). Probability with Martingales. New York, NY: Cambridge University
Press.
Yakimiv, A. L. (2005). Probabilistic Applications of Tauberian Theorems. Boston, MA: Mar-
tinus Nijhoff Publishers and VSP.
Yakir, B. (1995). A note on the run length to false alarm of a change-point detection
policy. Annals of Statistics, 23, 272–281.
Yu, J., Vexler, A. and Tian, L. (2010). Analyzing incomplete data subject to a threshold
using empirical likelihood methods: An application to a pneumonia risk study
in an ICU setting. Biometrics, 66(1), 123–130.
Zellner, A. (1996). An Introduction to Bayesian Inference in Econometrics. New York, NY:
Wiley.
Zhou, X. and Reiter, J. P. (2010). A note on Bayesian inference after multiple imputation.
American Statistician, 64, 159–163.
Zhou, X.-H. and Mcclish, D. K. (2002). Statistical Methods in Diagnestie Medicine. New
York, NY: Wiley.
Zhou, X.-H., Obuchowski, N. A. and McClish, D. K. (2011). Statistical Methods in Diag-
nostic Medicine. Hoboken, NJ: John Wiley & Sons.
Zimmerman, D. W. and Zumbo, B. D. (2009). Hazards in choosing between pooled
and separate-variance t tests. Psiologica, 30, 371–390.
Zou, K. H., Hall, W. and Shapiro, D. E. (1997). Smooth non‐parametric receiver oper-
ating characteristic (ROC) curves for continuous diagnostic tests. Statistics in
Medicine, 16(19), 2143–2156.
Author Index

A Chen, M. G., 182


Chen, M. H., 221
Abbt, M., 5, 232
Chen, X., 1, 53, 69, 80, 96, 101, 103, 104,
Ahmad, I. A., 232
167, 186, 192, 202, 222, 223, 232,
Alexander, J., 101
364
Armitage, P., 165–167
Chow, Y. S., 84
Armstrong, D., 54, 55, 201, 232, 311
Chuang, C.-S., 304
Chuang-Stein, C., 166
B Chung, K. L., 3, 16, 75, 84
Claeskens, G., 239
Baddour, Y., 110
Copas, J., 194, 197
Balakrishnan, N., 4, 5, 60
Corbett, P., 194, 197
Bamber, D., 189
Cornfield, J., 182
Barnard, G. A., 169
Cornish, E. A., 287
Bartky, W., 168
Cox, D. R., 190
Baseler, M., 280
Cox, R., 41
Bauer, P., 181
Crawley, M. J., 19, 22
Bayarri, M. J., 60
Csörgö, M., 101, 102
Belitskaya-Levy, I., 225
Cutler, S., 182
Belomestnyi, D. V., 56
Berger, J. O., 60, 132, 133, 135, 153, 162,
174 D
Bernardo, J. M., 135
Daniels, M., 222
Bhagwat, J. G., 185
Davis, C. S., 270
Bickel, P. J., 61
Davison, A. C., 260
Bleistein, N., 141
De Bruijn, N. G., 142
Bondell, H., 203
DeGroot, M. H., 60, 137, 166
Booth, J. G., 273
Delwiche, L. D., 16, 25, 26, 28, 29
Borovkov, A. A., 60, 68, 75
Dempster, A. P., 225
Box, G. E., 162, 190
Desquilbet, L., 16
Bressoud, D. M., 7
DiCiccio, T. J., 148, 149
Bretz, F., 180
Dmitrienko, A., 166, 180
Broemeling, L. D., 219
Dodge, H. F., 168
Brostrom, G., 113, 114
Doksum, K. A., 61
Browne, R., 55, 201, 232, 311
Dorn, J., 55, 201, 232, 311
Dragalin, V. P., 114
C Dunson, D. B., 162, 219
Campbell, G., 185
Carlin, B. P., 61, 129, 132, 142, 146, 219, 364
E
Carlin, J. B., 155, 162, 219
Casella, G., 51, 64, 67 Edgar, G. A., 84
Chalmers, T. C., 73 Efron, B., 195, 259, 260, 266, 303,
Chang, M. N., 169 304, 364

381
382 Author Index

Eliseeva, E., 191 Hammersley, J. M., 147


Eng, J., 185 Hammitt, J. K., 193
Esposito, D., 101 Han, C., 142
Evans, M., 142 Handelsman, R. A., 142
Everitt, B. S., 166 Handscomb, D. C., 147
Hartigan, J. A., 259, 266
Herman, B. A., 189
F Hill, B. M., 60
Fan, J., 78, 239 Hinkley, D. V., 260
Faraggi, D., 55, 201–203, 232, 311 Hjort, N. L., 219, 239
Farmen, M., 239 Hogan, J., 222
Fawcett, T., 185 Horvath, L., 101, 102
Feuerverger, A., 56 Hosmer, D. W. Jr., 198
Finkelestein, M., 16 Hsieh, F., 187, 189
Firth, D., 262 Hutson, A. D., 1, 53, 69, 71, 78, 80, 81, 96,
Fisher, R. A., 59, 60, 165, 219, 223, 259, 101, 103, 104, 145, 167, 186, 192,
287, 292 222, 223, 226, 232, 236, 239, 255,
Fluss, R., 203 280, 364
Freeman, H. A., 168
Freiman, J. A., 73
J
Freudenheim, J., 201, 232, 311
Friedman, M., 168 James, B. R., 187, 189
James, K. L., 187, 189
James, W., 364
G Janzen, F. J., 158
Gail, M. H., 187, 189 Jeffreys, H., 136, 153
Gardener, M., 16, 22 Jennison, C., 177, 179–181, 183
Gavit, P., 110 Jiang, Q., 182
Gelman, A., 155, 162, 219 Johnson, M. E., 53
Genz, A., 147 Johnson, N. L., 4, 5
Geweke, J., 147 Johnson, W. O., 151
Ghosh, M., 79 Julious, S. A., 227
Gijbels, I., 239
Gönen, M., 151
Good, I., 161 K
Goodenow, M. M., 280 Kadane, J. B., 142–143
Gordon, L., 123 Kang, L., 239
Green, D. M., 185 Kass, R. E., 135, 137, 142–143, 146–149,
Greenhouse, S., 182 153, 158
Grimmett, G., 77 Kepner, J. L., 169
Gurevich, G., 71, 78–79, 81, 102, 223, 226, Koch, A. L., 5
236, 241, 245, 308 Kohne, K., 181
Konish, S., 292
Korevaar, J., 43
H
Kotz, S., 4, 5, 190, 191
Halabi, S., 201 Koutsoukos, T., 182
Hall, P., 284 Krieger, A. M., 120, 124, 150
Hall, W., 167, 183, 192 Kuebler, R. R., 73
Author Index 383

L Molenberghs, G., 166


Montgomery, D. C., 111
Lai, C.-D., 60
Morris, C., 364
Lai, T. L., 46, 113, 114, 167, 304
Mosteller, F., 168
Lane, D. A., 60
Mudholkar, G. S., 308
Lane-Claypon, J. E., 165
Mueller, B. U., 280
Lasko, T. A., 185
Mureika, R., 56
Lawless, J., 242, 366
Muti, P., 201, 232, 311
Lazar, N. A., 224, 245, 255
Mykland, P. A., 245
Lazzeroni, L. C., 225
Le Cam, L., 67, 140
Lee, E. T., 5 N
Lehmann, E. L., 64, 67, 78, 106, 107, 230
Lemeshow, S., 198 Nakas, C. T., 202
Leung, D. H., 254 Nelson, R. P., 280
Limpert, E., 5, 232 Neman, J., 219
Lin, C.-C., 308 Newton, M. A., 148
Lin, Y., 169 Neyman, J., 69
Lindsey, J., 5 Norton, L., 101
Liptser, R. Sh., 84
Liu, A., 167, 183, 191, 192, 202, 203 O
Liu, C., 201
Liu, J. S., 202, 364 Obuchowski, N. A., 185, 192
Liu, S., 239, 254 Offen, W., 166
Lloyd, C. J., 185 Ohno-Machado, L., 185
Lorden, G., 114, 124, 125 Owen, A. B., 239, 241, 243, 251
Louis, T. A., 61, 129, 132, 142, 146, 219,
364 P
Lu, Y., 151, 225
Lucito, R., 101 Palmer, C., 166
Lukacs, E., 31 Paukstis, G. L., 158
Lumelskii, Y., 190, 191 Pearson, E. S., 69
Lusted, L. B., 185 Pensky, M., 190, 191
Pepe, M. S., 185–187, 189, 193, 201, 364
Perkins, N. J., 55, 203
M Petrov, V. V., 1, 207
Mallows, C. L., 286 Piantadosi, S., 183
Marden, J. I., 97, 151, 162 Pizzo, P. A., 280
Mariotti, F., 16 Pocock, S. J., 180, 181
Markatou, M., 202 Pollak, M., 84, 113, 114, 123–125, 128, 150
Marks, M., 224 Powers, L., 5
Martinsek, A. T., 174, 177 Powers, S., 101
McClish, D. K., 185, 192 Proschan, M. A., 3
McCulloch, R., 146 Provost, F. J., 185
McIntosh, M. W., 189
Meng, X.-L., 149
Q
Metz, C. E., 189
Mikusinski, P., 13 Qin, J., 242, 253–254, 366
Mishra, B., 101 Quenouille, M. H., 259, 261
384 Author Index

R Smith, A. F. M., 135, 148


Smith, H. Jr., 73
Raftery, A. E., 137, 142, 143, 146, 148, 149,
Smith, R. L., 61
153, 158
Snapinn, S., 182
R Development Core Team, 16, 52, 191,
Stahel, W. A., 5, 232
232
Stein, C., 255, 364
Reid, N., 60, 153
Stern, H. S., 153, 155, 159, 161, 162, 219
Reiner, A., 101
Stigler, S. M., 60, 223
Reiser, B., 55, 201–203, 232, 311
Stirzaker, D., 77
Reiter, J. P., 222
Strawderman, W. E., 219
Richards, R. J., 193
Su, J. Q., 202, 363–364
Ritov, Y., 116
Subkhankulov, M. A., 43
Robbins, H., 46, 84, 114, 207, 210–212
Sucheston, L., 84
Robert, C. P., 51
Swartz, T., 142
Roberts, G. O., 148
Swets, J. A., 185
Romano, J. P., 78, 106, 107, 230
Romig, H., 168
Rossi, P. E., 146 T
Rubin, D. B., 155, 162, 219
Tamhane, A. C., 180
Tao, G., 145, 239, 255
S Taylor, M. D., 13
Sackrowitz, H., 225, 226, 233 Tholmer, R., 110
Samuel-Cahn, E., 225, 226, 233 Thompson, M. L., 189, 201, 364
Sarkar, S., 273 Tian, L., 254
SAS Institute Inc., 275 Tibshirani, R. J., 260
Schatzoff, M., 225 Tierney, L., 142, 143
Schiller, F., 38 Titchmarsh, E. C., 37, 205
Schisterman, E. F., 191, 201–203, 232, Trafimow, D., 224
311 Trevisan, M., 55, 201, 232, 311
Schneiderman, M., 182 Tsevat, J., 193
Schwarz, G., 146 Tu, D., 260, 279, 286
Schweder, T., 219 Tucker, H. G., 16
Sellke, T., 153 Tucker, J. K., 158
Serfling, R. J., 9, 10, 13, 68, 192, 198 Tukey, J. W., 259, 261, 263
Shao, J., 260, 263, 279, 286 Turnbull, B. W., 177, 179–181, 183,
Shao, Q. M., 221 187, 189
Shapiro, D. E., 185, 187, 192
Shaw, P. A., 3
V
Shen, J. H., 189
Shih, W. J., 169 Vaidyanathan, S. K., 143
Shiryayev, A. N., 84, 103 Veeh, J. A., 16
Siegmund, D., 84, 114, 174, 211, 212 Vehtari, A., 162, 219
Simon, R., 168, 169 Vexler, A., 1, 47, 53, 56, 57, 69, 71, 78–81,
Singh, K., 219 96, 101–104, 109, 114, 120, 145,
Sinharay, S., 153, 159, 161 158, 162, 167, 186, 191, 192,
Sinnott, E. W., 5 202, 222, 223, 226, 232, 236,
Slaughter, S. J., 16, 25, 26, 28, 29 239, 241, 245, 254–255, 308, 364
Sleasman, J. W., 280 Ville, J., 205, 206
Author Index 385

W Y
Wald, A., 169, 170, 173, 206 Yakimiv, A. L., 43
Wallis, W. A., 168 Yakir, B., 75, 114, 120, 124, 128,
Wang, J. W., 5 150
Wasserman, L., 146, 148, 149, 153 Ye, A., 55
Wasserstein, R. L., 224 Yiannoutsos, C. T., 202
Wedderburn, R. W., 239 Yu, J., 222, 223, 226, 236, 254
West, J., 101 Yu, K. F., 71
Westfall, P. H., 151
Whitehead, J., 183
Z
Wieand, S., 187, 189
Wigler, M., 101 Zellner, A., 137
Wilfret, D., 280 Zhang, B., 254
Wilks, S. S., 76, 106 Zhang, C., 78
Williams, D., 84 Zhao, Y., 223, 226, 236
Wolfowitz, J., 177 Zhou, X., 222
Wolpert, R. L., 60 Zhou, X.-H., 185
Wong, W. H., 149 Zimmerman, D. W., 227
Wu, C., 71, 79, 109 Zou, K. H., 185, 192
Wu, C. F. J., 263 Zou, L., 222, 255
Zuckerman, J., 280
Zumbo, B. D., 227
X
Xie, M., 219
Subject Index

A Autoregressive errors, measurement


model with, 80, 307
Absolute value, 2, 15, 16
Average run length (ARL) to false
Adaptive CUSUM test statistic, 108, 109
alarm, 112, 113
Adaptive Gaussian quadrature method,
Average sample number (ASN), 173
147
Wald approximation, 173–174, 176
Adaptive maximum likelihood
estimation, 105–109
B
Adaptive maximum likelihood ratio,
107, 108 Bartlett correction, 245
Adaptive sequential designs, 181 Bayes factor, 79, 99, 103, 124, 136, 145
Almost sure convergence, 9, 10 applications, 158–161
Anti-conservative approach, 135 computation, 137–142
A posteriori probabilities, 136 asymptotic approximations,
Approximate confidence intervals, see Asymptotic approximations
264–265 importance sampling, 147
Approximate likelihood technique, 239 samples from posterior
A priori information, 129 distributions, 147–148
A priori probabilities, 136 simple Monte Carlo, 147
Area Under the ROC curve (AUC), decision-making rules, 153–158
189–190, 202, 227 implications, 161–162
estimated, 197 principles, 132
improvement of, 366 prior distributions, 149–153
nonparametric approach, 192–193 R code, 162–163
overestimation of, 196–197 Ville and Wald inequality, 209
parametric approach, 190–192 Bayesian approach, 129
partial, 203, 226 confidence interval estimation, 219,
of student’s t-test, 228 220
Artificial likelihood-based techniques, and empirical likelihood,
239 254–255
ASN, see Average sample number equal-tailed confidence interval
Asymmetric distributions, 281 estimation, 221
Asymptotic approximations, 265 frequentist method vs., 130
Bayes factor, 142, 154 highest posterior density confidence
benefits of, 146 interval estimation, 221
Laplace’s method, 142 for hypothesis testing, 136, 151
Schwarz criterion, 145–147 a priori information, 129
simulation and, 148–149 Bayesian inference, 150
variants on Laplace’s method, best combination of multiple
142–145 biomarkers, 363–364
Asymptotic properties of stopping time, empirical, 364
174–177 nonparametric, 255
At most one change-point model, 101 Bayesian information criterion (BIC),
AUC, see Area Under the ROC curve 145–146

387
388 Subject Index

Bayesian methodology, 129 Central limit theorem (CLT), 174, 205,


advantages of, 131 215, 310
criticism of, 153 characteristic functions, 50–54
information source, 133 Gaussian, 148
Bayesian priors, 31 maximum likelihood estimation,
Bayes theorem, 59, 129, 136, 219, 253 64–66
Berry–Esse’en inequality, 286 for stopping time, 177–179
Best linear combination (BLC) of Change of measure, 75
biomarkers, 201–202 Change point detection method
Bias estimation, jackknife method, retrospective, 100–105
261–263 sequential, 109–113
BIC, see Bayesian information criterion transformation of, see Shiryayev–
“Big Oh” notation, 2, 10 Roberts technique
Biomarker(s), 54 Change point problems, 100, 101, 103,
high density lipoprotein cholesterol, 109, 363
232 Characteristic exponent, 57
in medical decision making, 231–232 Characteristic function
multiple, best combination of, algorithm, 40
201–202 central limit theorem, 50–54
Bayesian inference, 363–364 distribution functions
empirical Bayesian inference, analytical forms of, 49–50
364 reconstruction issues, 54–56
log-normal distribution, 364 elementary properties of, 32–34
Biostatistical theory, 83 law of large numbers, 48–49
Bivariate normal distribution, 60 one-to-one mapping, 34–39
Bootstrap consistency, 285, 286 parametric approach, 56–57
Bootstrap method, 259–260, 266–268 risk-efficient estimation, 46–47
bias-corrected and accelerated Tauberian theorems, 40–46
interval, 297–299 Chebyshev’s inequality, 11, 12, 48, 63, 68,
bootstrap-t percentile interval, 280, 92, 249
288–292 Chi-square distribution, 181, 241, 365
confidence intervals, 279–284, Classical retrospective hypothesis
291–303 testing, 166
Edgeworth expansion technique, 279, Clinical pulmonary infection score
284–288 (CPIS), 254
nonparametric simulation, 267–269 Coefficient of variation, 268, 269
resampling algorithms with SAS and Complex conjugate, 15, 32
R codes, 270–279 Complex variables, 14–16
Bootstrap replications, 260, 268–271, 273, Composite estimation, of mean, 366–367
275, 276 Composite hypothesis, 130
Bootstrap tilting, 303–304 Conditional expectation, 85
Bootstrap-t percentile interval, 280, Conditional power computation, 182
288–292 Conditional probability theory, 81
Confidence band, 264
Confidence interval estimation
C
Bayesian approach, 219, 220
Case-control studies, 164, 189, 231 equal-tailed, 221
Cauchy distribution, 35, 57, 68 highest posterior density, 221, 222
Cauchy–Schwarz inequality, 11 Confidence interval estimators, 66, 222
Subject Index 389

Confidence intervals, 215 null hypothesis rejection, 133


approximate, 264–265 sequential probability ratio test,
bootstrap, 279–284, 291–303 170
constructions, 214, 216 Deconvolution, 55
definition, 263–264 Delta method, 190, 191
estimation by SAS program, 289–291, Density function, 4–6
293–302 Dirichlet integral, 37
hypothesis testing, 222 Discrete random variables, 4
posterior probabilities, 220 Distribution function, 3, 31, 43
Confidence sequences, 213, 216 analytical forms of, 49–50
advantage of, 214 characteristic function of, 32
for median, 215–217 continuous, 4
Conjugate distributions, 149 convergence in, 9, 10
Conjugate priors, 137, 149 empirical, 7, 188, 261, 266, 268
Consistency, 60 extensions and estimations, 56–57
Continuous mapping theorem, 9 Laplace transformations of, 31
Continuous random variables, 4, 41, of measurement errors, 55
108, 109 one-to-one mapping, 34–39
Control limits, 110, 111 parametric, 5–6
Control statements, SAS, 26–28 reconstruction issues, 54–56
Convergence of standard normal random variable,
almost sure, 9 66
associative properties, 10 sums of random variables, 32, 33
in distribution, 9 DO-END statement, 27
in probability, 9 Doob’s decomposition, 114, 116
rate, 54 Doob’s inequality, 92–93, 100, 205
in rth mean, 10
Convolution, 33, 81, 86,
E
see also Deconvolution
Correct model-based likelihood Econometrics, 83
formation, 80–82 Edgeworth expansion technique, 279,
Course projects, examples of, 363–367 284–288
Coverage probability, 66 Efron’s formula, 195
CPIS, see Clinical pulmonary infection Elementary renewal theorem, 46
score Elicited prior, 133, 149
Credible set, 219 EL methods, see Empirical likelihood
Cumulative distribution functions, 152 methods
Cumulative sum (CUSUM) technique Empirical Bayes(ian), 129
change point detection, 101–102 concept, 130, 141
modification of, 126–128 inference, 364
Shiryayev–Roberts vs., 120–123 nonparametric version, 255
Empirical distribution function, 7, 188,
261, 266, 268
D
Empirical likelihood functions, 222
Decision(-making) rule, 98, 133, 134, 256 Empirical likelihood (EL) methods, 303
Bayes factor, 153–158 Bayesian approach and, 254–255
likelihood ratio–based test, 69–71, classical, 239–241
78, 130 and parametric likelihood methods,
most powerful, 135 242, 253–254
390 Subject Index

practical theoretical tools for Exponential distribution, 47, 132


analyzing, 249–253 density function for, 5, 6
software packages, 241–242
as statistical analysis tool, 255–256
F
techniques for analyzing, 242–248
Empirical likelihood ratio tests, 365–366 False alarm, average run length to, 112,
Empirical likelihood ratio test statistic, 113
241, 244–246, 256–257 False positive rate (FPR), 194, 195, 203
EPV, see Expected p-value Fatou–Lebesgue theorem, 205
Equal-tailed confidence interval Fisher information, 64, 139–141
estimation, 66, 221 Fisher’s combination method, 181
Errors, see also Type I error; Type II error Fisher’s z-transformation, 292
autoregressive, 80 Flat prior, 135
measurement, 55 FPR, see False positive rate
Monte Carlo, 268 Frequentist approach, 129, 130, 133
Estimation, see also Overestimation Futility analysis, 168, 182
bias, 261–263
confidence interval, see Confidence
G
interval estimation
James–Stein, 255, 364 Gambling theory, 83
maximum likelihood, see Maximum Gamma distribution, 140
likelihood estimation analytical forms of, 49
mean density function, 5, 6, 138
composite, 366–367 Gaussian quadrature, 142, 147
trimmed, see Trimmed mean Gaussian-type integral, 145
estimation Geometric progression, sum formula
nonanticipating, 105–109, 124–125 for, 42, 43
risk-efficient, 46–47 Goodness-of-fit tests, 307
unbiased, 167, 183, 273 Grouped jackknife approach, 262
Euler’s formula, 15–16, 32, 36 Group sequential tests, 166, 179–181
Exam questions, examples
final exams, 326–340
H
midterm exams, 326–340
qualifying exams, 326–340 Hessian matrix, 139–140, 143
Expectation, 70 Highest posterior density confidence
conditional, 85 interval estimation, 221, 222
definition of, 6 Highly peaked integrand, 137
and integration, 6–8 Hinchin’s form, law of large numbers,
law of, 86 48–49
Expected bias Homework questions, examples of,
asymptotic approximation to, 195 305–313
of ROC curve, 196–197 Hypothesis
Expected p-value (EPV), 225, composite, 130
see also Partial expected formulation of, 69
p-value nested, 143
power and, 229–231 null, see Null hypothesis
ROC curve analysis, 226–227, 236–237 Hypothesis testing, 75
of t-test, 233–234 Bayesian approach for, 136, 151
Expected significance level, 225 classical, 166
Subject Index 391

composite statement, 96 K
martingale principle, 93–99
Kernel function, 192
most powerful test, 98
Khinchin’s form, law of large numbers,
problems, 130
48–49
Kullback–Leibler distance, 304
I
IF-THEN statement, 26 L
Importance sampling, 147, 160
Improper distribution, 150 Lagrange method, 221, 240, 242
In control process, 110, 113, 114, 123 Lagrange multiplier, 221, 240, 243,
Independent and identically distributed 250, 256
(iid) random variables Laplace approximation, 142
distribution function of, 54, 56 Laplace distribution, 35
probability distribution, 4 Laplace’s method, 142
sum of, 65, 66, 173, 178 modification of, 148
in Ville and Wald inequality, simulated version of, 149
209–213 variants on, 142–145
Indicator function, 7, 10–12, 70, 135, 150 Laplace transformation
Inference of distribution functions, 31
AUC-based, 366 real-valued, 44
Bayesian, 150, 255, 363–364 Law of iterated expectation, 86
empirical Bayesian, 364 Law of large numbers, 32, 48, 205
frequentist context, 129 Khinchin’s (Hinchin’s) form of, 48–49
on incomplete bivariate data, 253 outcomes of, 51, 54
ROC curve, 186–189 strong, 10
INFILE statement, 24–25 weak, 9
Infimum, 7 Law of the iterated logarithm, 205,
Integrated most powerful test, 132–135 206–207, 214, 217
Integration, expectation and, 6–8 Law of total expectation, 86
Invariant data transformations, 108 LCL, see Lower control limit
Invariant statistics, 106 Least squares method, 61
Invariant transformations, 106 Lebesgue principle, 7
Inversion theorem, 34–39 Lebesgue–Stieltjes integral, 7, 8
Likelihood function, 59, 60
Likelihood method, 59–60
J
artificial/approximate, 239
Jackknife method, 259 correct model-based, 80
bias estimation, 261–263 intuition about, 61–63
confidence interval parametric, 78
approximate, 264–265 Likelihood ratio (LR), 69–73, 97–99, 170,
definition, 263–264 see also Log-likelihood ratio;
homework questions, 312–313 Maximum likelihood ratio
nonparametric estimate of variance, Likelihood ratio test
268 bias, 74
variance estimation, 263 decision rule, 69–71
variance stabilization, 265 optimality of, 69, 71–72
James–Stein estimation, 255, 364 power, 70, 74
Jensen’s inequality, 14, 61, 87, 95 type I error control, 100
392 Subject Index

Likelihood ratio test statistic, 69, 73–75, type I error rate control, 100, 106–107
96 Wald’s lemma, 92
Limits, 1–2, 8 Maximum likelihood estimation, 63–68,
Linear regressions, empirical likelihood 131, 142, 183
ratio tests for, 365–366 Maximum likelihood estimator (MLE),
“Little oh” notation, 2, 10 47, 137, 143–144, 187, 191, 364,
Log-empirical likelihood ratios, 246–248 366–367
Logistic regression, 185 adaptive, 105–109
change point problems in, 363 example, 67–68
ROC curve analysis, 193–195 log-likelihood function, 137–139
example, 198–201 as point estimator, 66
expected bias of ROC, 196–197 Maximum likelihood ratio (MLR),
retrospective and prospective 76–80, 96–97
ROC, 195–196 Maximum likelihood ratio test statistic,
Log likelihood function, 63, 79, 137–141 76, 79, 96, 105
Log-likelihood ratio, 61, 143, 144, 173, 310 MCMC method, see Markov chain
Log maximum likelihood, 64 Monte Carlo method
Log-normal distribution, 5–6 Mean
Log-normally distributed values composite estimation of, 366–367
composite estimation of mean, rth mean, convergence in, 10
366–367 Measurability, 85
of multiple biomarkers, 364 Measurable random variable, 86
Lower confidence band, 264 Measurement errors, 32, 54–56, 363
Lower control limit (LCL), 111 Measurement model, with
LR, see Likelihood ratio autoregressive errors, 80, 307
Metropolis–Hastings algorithm, 148
Minus maximum log-likelihood ratio
M
test statistic, 246–248
Magnitude of complex number, 15 MLE, see Maximum likelihood
Mallow’s distance, 286–287 estimator
Mann–Whitney statistic, 192, 198 Model selection, 146
Markov chain Monte Carlo (MCMC) Modulus of complex number, 15
method, 148, 221 Monte Carlo (MC)
Martingale-based arguments, 79 approximations, 102, 106, 229, 232
Martingale transforms, 116–118, 121 coverage probabilities, 223
Martingale type statistics, 83 error, 268
adaptive estimators, 105–109 Markov chain, 148
change point detection, see Change resampling, 304
point detection method roulette game, 90
conditions for, 86 simplest, 147
Doob’s inequality, 92–93 Monte Carlo power, 54, 198–199, 309
hypotheses testing, 93–99 Monte Carlo simulation, 113, 177,
likelihood ratios, 97–99 267–268, 308, 310
maximum likelihood ratio, 96–97 principles of, 51–54
optional stopping theorem, 91 Shiryayev–Roberts technique,
reverse martingale, 102, 108 124–128
σ –algebra, 85–86 Most powerful decision rule, 135
stopping time, 88–90 Most powerful test
terminologies, 84–90 of hypothesis testing, 98
Subject Index 393

integrated, 132–135 O
likelihood ratio test, 69, 71
One-to-one mapping, 31, 34–39
Most probable position, 59
Online change point problem, 109
Multiple biomarkers, best combination
Optimality, 69, 221
of, 201–202
likelihood argument of, 75
Bayesian inference, 363–364
of statistical tests, 71–72
empirical Bayesian inference, 364
“Optim” procedure, 191
log-normal distribution, 364
Optional stopping theorem, 91
Multiroot procedure, 191
Out of control process, 110, 111, 114, 123
“mvrnorm” command, 21–22
Overestimation, 194–195
area under the ROC curve, 196–197
N
Nested hypothesis, 143 P
Newton–Raphson method, 53, 68
Neyman–Pearson lemma, 69, 78, 225, 231 p-value, 76, 97, 132, 223–226, 237
Neyman–Pearson testing concept, 69, expected, see Expected p-value
73, 96, 98 formulation of, 161–162
Nonanticipating estimation null hypothesis, 161–162, 224, 225
maximum likelihood estimation, prediction intervals for, 225
105–109 stochastic aspect of, 225
Monte Carlo simulation, 124–125 Parametric approach
Non-asymptotic upper bound of area under the ROC curve, 190–192
renewal function, 42 best linear combination of
Noninformative prior, 149–150 biomarkers, 202
Nonparametric approach Parametric characteristic function,
area under the ROC curve, 192–193, 56–57
198 Parametric distribution functions, 5–6
best linear combination of Parametric likelihood (PL) method, 78,
biomarkers, 202 255
Nonparametric estimation of ROC empirical likelihood and, 253–254
curve, 187–188 Parametric sequential procedures, 167
Nonparametric method Parametric statistical procedure, 167
Monte Carlo simulation scheme, 128 Partial AUC (pAUC), 203, 226
Shiryayev–Roberts technique, Partial expected p-value (pEPV), 226,
123–124 230, 231, 234–235
Nonparametric simulation, 267–269 pAUC, see Partial AUC
Nonparametric testing, 308 Penalized empirical likelihood
Nonparametric tilting, see Bootstrap estimation, 366
tilting pEPV, see Partial expected p-value
Not a number (NaN) values, 18 Percentile confidence interval, 280–282,
Null hypothesis, 97 288–291
Bayes factor, 153 Pivot, 264
CUSUM test statistic, 102 PL method, see Parametric likelihood
empirical likelihood function, 246 method
p-value, 161–162, 224, 225 Point estimator, 66
rejection of, 73, 79, 102, 130, 158, 170, Pooling design, 55–56
222 Posterior distribution, 131, 146–149, 155
Numerical integration, 137, 142, 146 Posterior empirical Bayes (PEB), 364
394 Subject Index

Posterior odds, 136, 161 expected bias of, 196–197


Post-sequential analysis, 182–183 expected p-value and, 225, 226–227
Power, 53 laboratory diagnostic tests, 185
calculation, 54 logistic regression and, 193–201
expected p-value and, 229–231 nonparametric estimation of, 187–188
of likelihood ratio test, 70, 74 student’s t-test, 227, 228
Prior (probability) distribution, 131, Regularity conditions, 63, 64, 66, 68, 76,
149–153, 159–160, 219, 255 77, 264
Probability, convergence in, 9, 10 Remainder term, 11, 12, 14, 64, 65, 77,
Probability distribution, 3–4, 149–153 145, 249, 287
Probability mass function, 4 Renewal function, 40–46, 92
Probability theory, 3, 83, 86 Renewal theory, 31, 41, 178
Probit regression model, 159, 160 Replications (Bootstrap), 260, 268–271,
Problematic function, 7 273, 275, 276
PROC IML, 29, 275, 277–279 Representative method, 99, 132
PROC MEANS Resampling algorithms, 270–279
procedure, 28–29 Restricted maximum likelihood
WEIGHT statement, 273 estimate (REML), 160
PROC SURVEYSELECT, 270, 274, 277 Retrospective change point detection,
PROC UNIVARIATE, 281–282, 289 100–101
Product type chain rule form, 62 cumulative sum technique, 101–102
Shiryayev–Roberts technique, 102–105
Retrospective error rate, 195
Q
Retrospective mechanism, 88
Quasi-likelihood method, 239 Retrospective studies, 165
Reverse martingale, 102, 108
Riemann–Stieltjes integral, 6, 8
R
Risk-efficient estimation, 46–47
Radon–Nikodym derivative, 75 Risk function, 46, 47, 71
Radon–Nikodym theorem, 75 R software, 16, see also SAS software
RANBIN function, 270, 271 comparison operators, 20
Random variables, 2–3 data manipulation, 18–21
binomial, 270–271 input window, 16–17
continuous, 4, 41, 108, 109 printing data in, 21–22
discrete, 4 rth mean, convergence in, 10
measurable, 86
Rasch model, 79
S
Rate of convergence, 54
R code, 17, 53–54, 57, 111, 112, 138–139, Sample size
151, 157, 162–163, 170–172, calculation, 54, 152
176–177, 186, 189, 199–200, 229, expected, 47, 168, 169, 173, 218
232–233, 248, 270–279, 284, 303 Sampling
Receiver operating characteristic (ROC) importance, 147, 160
curve analysis, 185, 364 techniques, 149
area under the ROC curve, 189–193 SAS/IML, 278
in cardiovascular research, 185, 186 SAS software, 16, 22
case-control studies, 189 comments in, 23–24
curve inference, 186–189 comparison operators, 27
Subject Index 395

control statements, 26–28 Shiryayev–Roberts (SR) technique


data import, 24–25 change point detection, 102–105, 113
data manipulation, 25–28 cumulative sum vs., 120–123
data printing, 28 method, 115–120
interface, 22, 23 Monte Carlo simulation study,
macros, 275–276 124–128
numeric functions, 26 motivation, 114–115
resampling algorithms with, 270–279 nonparametric example, 123–124
summarizing data in, 28–29 σ –algebra, 85–86
trimmed mean and confidence Simon’s two-stage designs, 168–169
interval estimation, 289–291, Simple representative method, 99, 132
293–302 Simplest Monte Carlo integration, 147
variables and data sets, 23 Skewed distributions, 5
Schwarz criterion, 145–147, 154 Slutsky’s theorem, 77
Semi-parametric confidence interval, Smooth statistics, 263, 280
282 SPRT, see Sequential probability ratio
Sequential analysis methodology test
adaptive sequential designs, 181 SR technique, see Shiryayev–Roberts
advantages, 166 technique
central limit theorem for stopping Stabilization, variance, 265
time, 177–179 Standardizing decision-making policies,
futility analysis, 182 223
group sequential tests, 179–181 Statistical decision-making theory, 69
post-sequential analysis, 182–183 Statistical functionals, 261, 266–268, 287,
regulated trials, 166–167 303
sequential element, 165 Statistical hypotheses testing,
sequential probability ratio test, martingale principle, 93–99
169–177 Statistical literature, 59
two-stage designs, 167–169 Statistical sequential schemes, 88, 167,
Sequential change point detection, 178
109–113 Statistical significance, 223
Sequential element, 165 Stochastic curtailment approach, 182
Sequential methods, 167, 179 Stochastic processes analysis, 83
Sequential probability ratio test (SPRT), Stopping times (rules), 41, 47, 83, 88–89,
169, 309–310 92, 112, 124, 126, 173
asymptotic properties of stopping asymptotic properties, 174–177
time, 174–177 central limit theorem, 177–179
goal of, 170 Ville and Wald inequality, 208
sampling process, 171 Student’s t-test, 227–229
stopping boundaries, 172–173 Study designs, 54, 55
Wald approximation to average Submartingale
sample number, 173–174 condition, 86–87
Sequential procedures, 83 H0-submartingale, 97, 114
guidelines for, 166 H1-submartingale, 95, 96, 103
testing, 167 P∞-submartingale, 115, 116, 121, 122
Sequential trial, 166, 167, 183 Sufficiency, notion of, 60
Shapiro–Wilk test for normality, 167, 183 Sum of iid random variables, 65, 66,
Shewhart chart (X-bar chart), 111 173, 178
396 Subject Index

Supermartingale, 86–87 Simon’s two-stage designs, 168


Supremum, 7, 106 in statistical decision-making,
Symmetric distribution, 179, 211, 281, 69, 72
287 test with power one, 217–218
Type I error rate control, 100, 106–107,
134, 145, 156, 207, 365–366
T
Type II error probability, 73, 74
Tauberian theorems, 32, 40–46 test with power one, 218
Tauberian-type propositions, 44–46 Type II error rate, 168
Taylor’s theorem, 12–14, 43, 45, 64, Simon’s two-stage designs, 168–169
65, 191
TBARS measurement, 311–313
U
Test statistic
adaptive cumulative sum, 108, 109 Unbiased statistical test, 230
adaptive maximum likelihood ratio, Uniroot function, 52, 68, 175, 312
107, 108 Unit step function, 266
definition of, 99 Upper control limit (UCL), 111
empirical likelihood ratio, 241, U-statistic theory, 192, 198
244–246, 256–257
likelihood ratio, 69, 73–75, 96
V
maximum likelihood ratio, 76, 79,
96, 105 Validation samples, 194
minus maximum log-likelihood VAP, see Ventilator-associated
ratio, 246 pneumonia
Test with power one, 217–218 Variance, 155
“Tilde/twiddle” notation, 2 Bayes factor vs., 160, 161
Training samples, 194 estimation, 263, 268
Trimmed mean estimation population, 262
PROC UNIVARIATE, 289–290 stabilization, 265
SAS program, 289–291, 293–302 Variants on Laplace’s method, 142–145
Trimmed mean statistic, 280–282 Ventilator-associated pneumonia (VAP),
Trivial inequality, 70 254
True positive rate, 194, 195 Ville and Wald inequality, 205–206,
t-tests, Wilcoxon rank-sum test vs., 208
231–236 Bayes factor, 209
Two-sample nonparametric testing, distribution function, 210–211
308–309 statistical significance, 213–218
Two schools of thought, 129 in sums of iid observations, 209–213
Two-stage acceptance sampling plan, symmetric distribution function,
168 211–213
Two-stage designs, sequential analysis, W
167–169 Wald approximation to average sample
Type I error rate(s), 53, 167 number, 173–174, 176
empirical likelihood ratio test, 242 Wald martingale, 307
of maximum likelihood ratio test, 76 Wald’s lemma, 44, 92
notion of, 129 WEIGHT statement, 273
probability of, 70, 73 Welch’s t-test, 227–229
Subject Index 397

Wilcoxon rank-sum test, 225, 231–236 Y


Wilks’ theorem, 76, 78, 106, 241, 245,
Youden index, 202, 203
365

X Z

X-chart, 110 Z-interval, 265

You might also like