Professional Documents
Culture Documents
(Chapman & Hall - CRC Biostatistics Series) Hutson, Alan - Vexler, Albert - Statistics in The Health Sciences - Theory, Applications, and Computing (2018, Chapman and Hall - CRC)
(Chapman & Hall - CRC Biostatistics Series) Hutson, Alan - Vexler, Albert - Statistics in The Health Sciences - Theory, Applications, and Computing (2018, Chapman and Hall - CRC)
Sciences
Statistics in the Health
Sciences
Theory, Applications,
and Computing
By
Albert Vexler
The State University of New York at Buffalo, USA
Alan D. Hutson
Roswell Park Cancer Institute, USA
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not
warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB®
software or related products does not constitute endorsement or sponsorship by The MathWorks of a
particular pedagogical approach or particular use of the MATLAB® software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying, microfilming, and recording, or in any information storage or retrieval
system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Albert Vexler
Alan D. Hutson
Table of Contents
vii
viii Table of Contents
5. Bayes Factor..................................................................................................129
5.1 Introduction ....................................................................................... 129
5.2 Integrated Most Powerful Tests ...................................................... 132
5.3 Bayes Factor........................................................................................ 136
5.3.1 The Computation of Bayes Factors .................................... 137
5.3.1.1 Asymptotic Approximations .............................. 142
5.3.1.2 Simple Monte Carlo, Importance
Sampling, and Gaussian Quadrature................ 147
5.3.1.3 Generating Samples from the Posterior ............ 147
5.3.1.4 Combining Simulation and Asymptotic
Approximations .................................................... 148
5.3.2 The Choice of Prior Probability Distributions ................. 149
5.3.3 Decision-Making Rules Based on the
Bayes Factor .......................................................................... 153
5.3.4 A Data Example: Application of the
Bayes Factor .......................................................................... 158
5.4 Remarks .............................................................................................. 161
Example 3 ......................................................................................................345
Example 4 ......................................................................................................348
Example 5 ...................................................................................................... 350
Example 6 ......................................................................................................354
Example 7 ...................................................................................................... 357
Example 8 ...................................................................................................... 359
xiii
xiv Preface—Please Read!
10–15 years of biostatistical practice and training Master and PhD level stu-
dents in the department of biostatistics. This book is intended for gradu-
ate students majoring in statistics, biostatistics, epidemiology, health-related
sciences, and/or in a field where a statistics concentration is desirable, par-
ticularly for those who are interested in formal statistical mechanisms and
their evaluations. In this context Chapters 1–10 and 12–14 provides teach-
ing sources for a high level statistical theory course that can be taught in a
statistical/biostatistical department. The presented material evolved in con-
junction with teaching such a one-semester course at The New York State
University at Buffalo. This course, entitled “Theory of Statistical Inference,”
has belonged to a set of the four core courses required for our biostatistics
PhD program.
This textbook delivers a “ready-to-go” well-structured product to be
employed in developing advanced biostatistical courses. We offer lectures,
homework questions, and their solutions, examples of midterm, final, and
Ph.D. qualifying exams as well as examples of students’ projects.
One of the ideas regarding this book’s development is that we combine
presentations of traditional applied and theoretical statistical methods with
computationally extensive bootstrap type procedures that are relatively
novel data-driven statistical tools. Chapter 11 is proposed to help instruc-
tors acquire a statistical course that introduces the Jackknife and Bootstrap
methods. The focus is on the statistical functional as the key component of
the theoretical developments with applied examples provided to illustrate
the corresponding theory.
We strongly suggest to begin lectures by asking students to smile!
It is recommended to start each lecture class by answering students’ inqui-
ries regarding the previously assigned homework problems. In this course,
we assume that students are encouraged to present their work in class
regarding individually tailored research projects (e.g., Chapter 14). In this
manner, the material of the course can be significantly extended.
Our intent is not that this book competes with classical fundamental
guides such as, e.g., Bickel and Doksum (2007), Borovkov (1998), and Serfling
(2002). In our course, we encourage scholars to read the essential works of the
world-renown authors. We aim to present different theoretical approaches
that are commonly used in modern statistics and biostatistics to (1) analyze
properties of statistical mechanisms; (2) compare statistical procedures; and
(3) develop efficient (optimal) statistical schemes. Our target is to provide
scholars research seeds to spark new ideas. Towards this end, we dem-
onstrate open problems, basic ingredients in learning complex statistical
notations and tools as well as advanced nonconventional methods, even in
simple cases of statistical operations, providing “Warning” remarks to show
potential difficulties related to the issues that were discussed.
Finally, we would like to note that this book attempts to represent a part of
our life that definitely consists of mistakes, stereotypes, puzzles, and so on,
that we all love. Thus our book cannot be perfect. We truly thank the reader
xvi Preface—Please Read!
for his/her participation in our life! We hope that the presented material can
play a role as prior information for various research outputs.
Albert Vexler
Alan D. Hutson
Authors
Albert Vexler, PhD, obtained his PhD degree in Statistics and Probability
Theory from the Hebrew University of Jerusalem in 2003. His PhD advi-
sor was Moshe Pollak, a fellow of the American Statistical Association and
Marcy Bogen Professor of Statistics at Hebrew University. Dr. Vexler was a
postdoctoral research fellow in the Biometry and Mathematical Statistics
Branch at the National Institute of Child Health and Human Development
(National Institutes of Health). Currently, Dr. Vexler is a tenured Full Professor
at the State University of New York at Buffalo, Department of Biostatistics.
Dr. Vexler has authored and co-authored various publications that contribute
to both the theoretical and applied aspects of statistics in medical research.
Many of his papers and statistical software developments have appeared in
statistical/biostatistical journals, which have top-rated impact factors and
are historically recognized as the leading scientific journals, and include:
Biometrics, Biometrika, Journal of Statistical Software, The American Statistician,
The Annals of Applied Statistics, Statistical Methods in Medical Research,
Biostatistics, Journal of Computational Biology, Statistics in Medicine, Statistics and
Computing, Computational Statistics and Data Analysis, Scandinavian Journal of
Statistics, Biometrical Journal, Statistics in Biopharmaceutical Research, Stochastic
Processes and Their Applications, Journal of Statistical Planning and Inference,
Annals of the Institute of Statistical Mathematics, The Canadian Journal of Statistics,
Metrika, Statistics, Journal of Applied Statistics, Journal of Nonparametric Statistics,
Communications in Statistics, Sequential Analysis, The STATA Journal; American
Journal of Epidemiology, Epidemiology, Paediatric and Perinatal Epidemiology,
Academic Radiology, The Journal of Clinical Endocrinology & Metabolism, Journal of
Addiction Medicine, and Reproductive Toxicology and Human Reproduction.
Dr. Vexler was awarded National Institutes of Health (NIH) grants to
develop novel nonparametric data analysis and statistical methodology. His
research interests are related to the following subjects: receiver operating
characteristic curves analysis; measurement error; optimal designs; regres-
sion models; censored data; change point problems; sequential analysis; sta-
tistical epidemiology; Bayesian decision-making mechanisms; asymptotic
methods of statistics; forecasting; sampling; optimal testing; nonparametric
tests; empirical likelihoods, renewal theory; Tauberian theorems; time series;
categorical analysis; multivariate analysis; multivariate testing of complex
hypotheses; factor and principal component analysis; statistical biomarker
evaluations; and best combinations of biomarkers. Dr. Vexler is Associate
Editor for Biometrics and Journal of Applied Statistics. These journals belong to
the first cohort of academic literature related to the methodology of biostatis-
tical and epidemiological research and clinical trials.
xvii
xviii Authors
1.1 Introduction
The purpose of this opening chapter is to supplement the reader’s knowledge
of elementary mathematical analysis, probability theory, and statistics,
which the reader is assumed to have at his or her disposal. This chapter is
intended to outline the foundational components and definitions as treated
in subsequent chapters. The introductory literature, including various foun-
dational textbooks in the fields, can be easily found to complete this chap-
ter material in details. In this chapter we also present several important
comments regarding the implementation of mathematical constructs in the
proofs of the statistical theorems that will provide the reader with a rigorous
understanding of certain fundamental concepts. In addition to the literature
cited in this chapter we suggest the reader consult the book of Petrov (1995)
as a fundamental source. The book of Vexler et al. (2016 a) will provide the
readers an entree into the following cross-cutting topics: Data, Statistical
Hypotheses, Errors Related to the Statistical Testing Mechanism (Type
I and II Errors), P-values, Parametric and Nonparametric Approaches,
Bootstrap, Permutation Testing, Measurement Errors.
In this book we primarily deal with certain methodological and practical
tools of biostatistics. Before introducing powerful statistical and biostatisti-
cal methods it is necessary to outline some important fundamental concepts
such as mathematical limits, random variables, distribution functions, inte-
gration, convergence, complex variables and examples of statistical software
in the following sections.
1.2 Limits
The statistical discipline deals particularly with real-world objects but its
asymptotic theory oftentimes consists of concepts that employ the symbols
“→” (convergence) and “∞” (infinity). These concepts are not related directly
to a fantasy, they just represent formal notations associated with asymptotic
techniques relative to how a theoretical approximation may actually behave
1
2 Statistics in the Health Sciences: Theory, Applications, and Computing
in real-world applications with finite objects. For example, we will use the
formal limit notation f ( x) → a as x → ∞ (or lim f ( x) = a), where a is a con-
x →∞
stant. In this case we do not pretend that there is a real number such that
x = ∞ or x has a value that is close to infinity. We formulate that for all ε > 0
(in the math language: ∀ε > 0) there exists a constant A (∃ A > 0) such that the
inequality f ( x) − a < ε holds for all x > A. By the notation . we mean gener-
ally a measure, an operation that satisfies certain properties. In the example
mentioned above f ( x) − a can be considered as the absolute value of the
distance f ( x) − a, that is, f ( x) − a = f ( x) − a . In this context, one can say for
any real value ε > 0 we can find a number A to such that −ε < f ( x) − a < ε for
all x > A. In this framework it is not difficult to denote different asymptotic
notational devices. For example, the form f ( x) → ∞ as x → ∞ (or lim f ( x) = ∞ )
x →∞
means ∀ε > 0: ∃ A > 0 f ( x) > ε , for ∀x > A. The function f can denote a
series or a sequence of variables, when f is a function of the natural number
n = 1, 2, 3,... In this case we use the notation fn and can consider, e.g., the for-
mal definition of lim fn in a similar manner to that of lim f ( x).
n →∞ x →∞
Oftentimes asymptotic propositions treat their results using the symbols
O (⋅) , o (⋅) , and ~, which are called “big Oh,” “little oh,” and “tilde/twiddle,”
respectively. These symbols provide comparisons between the magnitude
of two functions, say u (x) and v (x). The notation u (x) = O (v (x)) means that
u (x) v (x) remains bounded if the argument x is sufficiently near to some
given limit L, which is not necessarily finite. In particular, O (1) stands for
a bounded function. The notation u (x) = o (v (x)) , as x → L, denotes that
lim {u (x) v (x)} = 0. In this case, o (1) means a function which tends to zero. By
x →L
u (x) ~ v (x) , as x → L, we define the relationship lim {u (x) v (x)} = 1.
x →L
Examples
1. A function G( x) with x > 0 is called slowly varying (at infinity) if
for all a > 0 we have G (ax) ~ G (x), as x → ∞ , e.g., log (x) is a slowly
varying function.
2. It is clear that we can obtain the different representations of the
( )
function sin (x), including sin (x) = O ( x ) as x → 0, sin (x) = o x 2
as x → ∞ , and sin (x) = O (1) .
( )
3. x 2 = o x 3 as x → ∞ .
4. f n = 1 − (1 − 1 n) = o(1) as n → ∞ .
2
independent. For example, we observe a patient who can have red eyes that
are caused by allergy and then we can represent these traits as ω 1 = " has " ,
ω 2 = " does not have " . In general, the space Ω = {ω 1 , ω 2 ,....} can consist of a con-
tinuous number of points. This scenario can be associated, e.g., with experi-
ments when we measure a temperature or plot randomly distributed dots
in an interval. We refer the reader to various probability theory textbooks
that introduce definitions of underlying probability spaces, random vari-
ables, and probability functions in detail (e.g., Chung, 2000; Proschan
and Shaw, 2016). In this section we shortly recall a random variable ξ(ω ) :
Ω → E to be a measurable transformation of Ω elements, possible outcomes,
into some set E. Oftentimes, E = R, the real line, or R d, the d-dimensional
real space. The technical axiomatic definition requires both Ω and E to be
measurable spaces. In particular, we deal with random variables that work
in a “one-to-one mapping” manner, transforming each random event ω i to a
specific real number, ξ(ω i ) ≠ ξ(ω j ), i ≠ j.
∫
x
F ( x) = f (t)dt with f ( x) = dF( x)/ dx ,
−∞
0.4 1.0
0.3 0.8
0.3
0.6
0.2
f(x)
f(x)
f(x)
0.2
0.4
0.1 0.1
0.2
FIGURE 1.1
The density plots where the panels (a)–(c) display the density functions of the standard normal
distribution, of the Gamma(2,1) distribution, and of the exp(1) distribution, respectively.
∫
E(ξ) = u d Pr (ξ ≤ u). In general, for a function ψ about the random variable
ξ we can define
{
E ψ (ξ) = } ∫ ψ (u) dPr (ξ ≤ u), e.g., Var (ξ) = ∫ {u − E (ξ)} dPr (ξ ≤ u).
2
∑min { f (u
i
i −1) , f (ui)} (ui − ui −1) ≤ E(ξ) = ∫ f (u) du ≤ ∑max { f (u
i
i −1 ) , f (ui)} (ui − ui −1),
∫
U
ψ(u) du. He did not follow Newton, Leibniz, Cauchy, and Riemann who
D
partitioned the x-axis between D and U , the integral bounds. Instead,
l = inf ( ψ(u)) and L = sup ( ψ(u)) were proposed to be defined in order
u∈ [ D ,U ] u∈ [ D ,U ]
to partition the y-axis between l and L. That is, instead of cutting the area
under ψ using a finite number of vertical cuts, we use a finite number of
horizontal cuts. The panels (a) and (b) of Figure 1.2 show an attempting to
(
integrate the “problematic” function ψ (u) = u2 I (u < 1) + u2 + 1 I (u ≥ 1), where )
I (.) = 0,1 denotes the indicator function, using a Riemann-type approach
ψ (u)
ψ (u)
1.5 1.5 1.5
0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4 0.6 0.8 1.0 1.2 1.4
u u u
(a) (b) (c)
FIGURE 1.2
Riemann-type integration (a, b) and Lebesgue integration (c).
vergence in probability.
In this framework, frequently applied in the asymptotic analysis result,
is the Continuous Mapping Theorem that states that for every continuous
p p
function h(⋅), if ξn →ζ then we also have h(ξn ) → h(ζ) (e.g., Serfling, 2002).
Examples
1. Chebyshev’s inequality plays one of the main roles in theoretical
statistical inference. In this frame, we suppose that a function
ϕ( x) > 0 increases to show that
⎧⎪ ϕ (ξ) ⎫⎪
{ } {
Pr (ξ ≥ x) = Pr ϕ (ξ) ≥ ϕ (x) = E ⎡⎣I ϕ (ξ) ≥ ϕ (x) ⎤⎦ ≤ E ⎨ } ⎬,
⎪⎩ϕ (x) ⎪⎭
{ }
where it is assumed that E ϕ (ξ) < ∞ . Defining x > 0, ϕ (x) = x,
and ξ = ζ − E (ζ) , r > 0, we have Chebyshev’s inequality.
r
∑
n
Sn = ξ i . Then, for x > 0, we have
i =1
{ ( )} {
E {ψ (An)} = E ψ (An) I An ≤ n + nε + E ψ (An) I An > n + nε ( )}
{ ( )( )} {
≤ E ψ n + nε I An ≤ n + nε + E ψ (An) I An > n + nε ( )}
⎡ ⎤ 1/2
( ) { (
≤ ψ n + nε + ⎢ E {ψ (An)} Pr An > n + nε ⎥ , )}
2
⎣ ⎦
{ ( )} { (
E {ψ (An)} ≥ E ψ (An) I An ≥ n − nε ≥ E ψ n − nε I An ≥ n − nε )( )}
( ) { (
≥ ψ n − nε − E ψ (An) I An < n − nε )}
⎡ ⎤ 1/2
( ) { (
≥ ψ n − nε − ⎢ E {ψ (An)} Pr An < n − nε ⎥ , )}
2
⎣ ⎦
{ (
⎢⎣ E {ψ (An)} Pr An < n − n ⎥⎦ )}
2 ε
potentially vanish to zero, a rough
12 Statistics in the Health Sciences: Theory, Applications, and Computing
{ ( )} {
E {ψ (An)} = E ψ (An) I An ≥ n − nε + E ψ (An) I An < n − nε ( )}
{ ( )( )} {
≤ E ψ n − nε I An ≥ n − nε + E ψ (An) I An < n − nε , ( )}
{ ( )} { (
E {ψ (An)} ≥ E ψ (An) I An ≤ n + nε ≥ E ψ n + nε I An ≤ n + nε )( )}
( ) {
≥ ψ n + nε − E ψ (An) I An > n + nε .( )}
∏ ∏
k k
where, for an integer k, k ! = i and g ( a) = d g( x)/
(k) k
dx .
i=1 i =1 x=a
∑ g j!(a) (x − a) + R
( j)
g( x) = j
k +1 ( x),
j=0
g ( k + 1) (c)
where the remainder term Rk + 1 ( x) satisfies Rk + 1 ( x) = ( x − a) k + 1
( k + 1)!
for some c between a and x, c = a + θ( x − a), 0 ≤ θ ≤ 1.
Prelude: Preliminary Tools and Foundations 13
1.0
0.8 F M
ϕ (x)
0.6
U V
0.4
B C
A D + θ(A − D) D
0.5 0.6 0.7 0.8 0.9 1.0
x
FIGURE 1.3
Illustration of Taylor’s theorem.
∫
D
angle AFMD is too large to represent φ( x) dx. The area of AUVD is (U–A)
A
(D–A) and is appropriate to calculate the integral. The segment UV should
cross the curve of φ( x), when x ∈[ A, D], at the point D + θ(D − A), θ ∈(0,1).
∫
D
Then φ( x) dx = (D − A)φ(c), where c = D + θ( A − D). Define φ( x) = dg( x)/ dx
A
to obtain the Taylor theorem with k = 0 . Now, following the mathematical
induction principle and using integration by parts, one can obtain the Taylor
result.
We remark that Taylor’s theorem can be generalized for multivariate func-
tions; for details see Mikusinski and Taylor (2002) and Serfling (2002).
Warning: It is very critical to note that while applying the Taylor expan-
sion one should carefully evaluate the corresponding remainder terms.
Unfortunately, if not implemented carefully Taylor’s theorem is easily mis-
applied. Examples of errors are numerous in the statistical literature. A typ-
ical example of a misstep using Taylor’s theorem can be found in several
14 Statistics in the Health Sciences: Theory, Applications, and Computing
∑
n
variables with Eξ1 = μ ≠ 0 and Sn = ξ i . Suppose one is interested in
i =1
evaluating the expectation E (1 Sn) . Taking into account that ESn = μn and
then “using” the Taylor theorem applied to the function 1 x , we “have”
{ }
E (1 Sn) ≈ 1 (μn) − E (Sn − μn) (μn) that, e.g., “results in” nE (1 Sn) = O(1). In
2
this case, the expectation E (1 Sn) may not exist, e.g., when ξ 1 is from a normal
distribution. This scenario belongs to cases that do not carefully consider
the corresponding remainder terms. Prototypes of the example above can
be found in the statistical literature that deals with estimators of regression
models. In many cases, complications related to evaluations of remainder
terms can be comparable with those of the corresponding initial problem
statement. This may be a consequence of the universal law of conservation
of energy or mere sloppiness.
In the context of the example mentioned above, it is interesting to show that
it turns out that even when E (1 Sn) does not exist, we obtain E (Sm Sn) = m n,
∑
n
for m ≤ n, since we have 1 = E (Sm Sn) = E (ξi Sn) = nE (ξi Sn) and then
i =1
∑
m
E (ξ1 Sn) = 1 n implies E (Sm Sn) = E (ξi Sn) = mE (ξi Sn) = m n .
i =1
Example
In several applications, for a function ϕ, we need to compare E {ϕ(ξ)} with
( )
ϕ E (ξ) , where ξ is a random variable. Applying Taylor’s theorem to ϕ(ξ),
we can obtain that
ϕ(ξ) = ϕ (E(ξ)) + {ξ − E(ξ)} ϕ′ (E(ξ)) + {ξ − E(ξ)} ϕ′′ (E(ξ) + θ {ξ − E(ξ)}) / 2, θ ∈ (0, 1).
2
∑
k
of the general equation ai x k = 0 with an integer k and given coefficients
i=0
a1 ,..., ak . Complex variables are involved in this process, adding a dimension
Prelude: Preliminary Tools and Foundations 15
Proof. Note that i 2 = −1, i 3 = −i, i 4 = 1, and so on. Using the Taylor expansion,
we have
∞ ∞ ∞
ex = ∑ xj
j!
, sin( x) = ∑ (−1)2 j+1
x 2 j+1
(2 j + 1)!
, and cos( x) = ∑(−1) 2j x2 j
(2 j)!
.
j=0 j=0 j=0
functions of complex variables are “a + ib” type complex variables and then
to define their absolute values.
Euler’s formula has inspired mathematical passions and minds. For exam-
ple, using x = π 2, we have i = e iπ/2 , which leads to i i = e −π/2 . Is this a “door”
from the “unreal” realm to the “real” realm? In this case, we remind the
reader that exp( x) = lim (1 + x n) and then interpretations of i i = e −π/2 are left
n
n →∞
to the reader’s imagination.
Warning: In general, definitions and evaluations of a function of a complex
variable require very careful attention. For example, we have e ix = cos x + sin x ,
but we also can write e ix = cos( x + 2 πj) + i sin( x + 2 πk ), j = 1, 2,.....; k = 1, 2,...,
taking into account the period of the trigonometric functions. Thus it is not a
simple task to define unique functions that are similar to log {ϕ (z)} and {ϕ (z)} ,
1/ p
where p is an integer. In this context, we refer the reader to Finkelestein et al. (1997)
and Chung (1974) for examples of the relevant problems and their solutions. In
statistical terms, the issue mentioned above can make estimation of functions of
complex variables complicated since, e.g., a problem of unique reconstruction of
{ϕ (z)}1/p based on ϕ (z) is in effect (see for example Chapter 2).
1.11.1 R Software
R is an open-source, case-sensitive, command line-driven software for sta-
tistical computing and graphics. The R program is installed using the direc-
tions found at www.r-project.org.
Once R has been loaded, the main input window appears, showing a short
introductory message (Gardener, 2012). There may be slight differences in
appearance depending upon the operating system. For example, Figure 1.4
shows the main input window in the Windows operating system, which has
Prelude: Preliminary Tools and Foundations 17
FIGURE 1.4
The main input window of R in Windows.
menu options available at the top of the page. Below the header is a screen
prompt symbol > in the left-hand margin indicating the command line.
> x<-c(3,5,10,7)
Notice that the assignment operator (<–), which consists of the two charac-
ters, less than (<) and minus (–), occurring strictly side-by-side and “points”
18 Statistics in the Health Sciences: Theory, Applications, and Computing
to the object receiving the value of the expression. In most contexts the
equal operator (=) can be used as an alternative. In order to display the value
of x, we simply type in its name after the prompt symbol > and press the
Return key.
> x
[1] 3 5 10 7
> 0/0
> Inf - Inf
The function is.na(x) gives a logical vector of the same size as x with value
TRUE if and only if the corresponding element in x is NA or NaN. The func-
tion is.nan(x) is only TRUE when the corresponding element in x is NaNs.
Missing values are sometimes printed as <NA> when character vectors
are printed without quotes.
Manipulating data in R
In R, vectors can be used in arithmetic expressions, in which case the opera-
tions are performed element by element. The elementary arithmetic opera-
tors are the usual + (addition), – (subtraction), * (multiplication), / (division),
and ^ (exponentiation). The following example illustrates the assignment of
y, which equals x 2 + 3:
TABLE 1.1
Selected R Commands That Produce the Descriptive Statistics
of a Numerical Vector x
Syntax Definition
Note that the symbol plus (+) is shown at the left-hand side of the screen
instead of the symbol greater than (>) to indicate the function commands
are being input. The built-in R functions “length,” “return,” and “mean” are
used within the new function mymean. For more details about R functions
and data structures, we refer the reader to Crawley (2012).
In addition to vectors, R allows manipulation of logical quantities. The
elements of a logical vector can be values TRUE (abbreviated as T), FALSE
(abbreviated as F), and NA (in the case of “not available”). Note however that
T and F are just variables that are set to TRUE and FALSE by default, but are
not reserved words and hence can be overwritten by the user. Therefore, it
would be better to use TRUE and FALSE.
Logical vectors are generated by conditions. For example,
> cond <- x > y
20 Statistics in the Health Sciences: Theory, Applications, and Computing
TABLE 1.2
Selected R Comparison Operators
Symbolic Meaning
== Equals
!= not equal
> greater than
< less than
>= greater than or equal
<= less than or equal
%in% determines whether specific elements are in a longer vector
sets cond as a vector of the same length as x and y, with values FALSE
corresponding to elements of x and y if the condition is not met and TRUE
where it is. Table 1.2 shows some basic comparison operators in R.
A group of logical expression, say cond1 and cond2, can be combined with
the symbol & (and) or | (or), where cond1 & cond2 is their intersection and
cond1 | cond2 is their union (or). !cond1 is the negation of cond1.
Often, the user may want to make choices and take action dependent on
a certain value. In this case, an if statement can be very useful, which takes
the following form:
if (cond) {statements}
x <- NULL
for (j in 1:50) {
x[j] <- j^2
}
The EL.means function provides the software tool for implementing the
empirical likelihood tests we will introduce in detail in this book.
As another concrete example, we show the “mvrnorm” command in the
MASS library.
> install.packages("MASS")
Installing package into 'C:/Users/xiwei/Documents/R/win-library/3.1'
(as ‘lib’ is unspecified)
22 Statistics in the Health Sciences: Theory, Applications, and Computing
For more details about the implementation of R, we refer the reader to, e.g.,
Gardener (2012) and Crawley (2012).
FIGURE 1.5
The interface of SAS.
The most important rule is that every SAS statement ends with a semicolon (;).
Rules for names of variables and SAS data sets
When making up names for variables and data sets (data set names follow
similar rules as variables, but they have a different name space), the follow-
ing rules should be followed:
In the case of unmatched comments, SAS can’t read the entire program and
will stop in the middle of a program, much like unmatched quotation marks.
The solution, in batch mode, is to insert the missing part of comment and
resubmit the program.
Inputting data in SAS
In SAS, DATA steps are used to read and modify data, and can also be used to
simulate data. The DATA step is flexible relative to the various data formats.
DATA steps have an underlying matrix structure such that programming
statements will be executed for each row of the data matrix. There are multi-
ple ways to import data into SAS. Either way, you must type raw data directly
in the SAS program, link SAS to a database, or direct SAS to the data file.
The following SAS program illustrates how to type raw data and use the
INPUT and DATALINES statement:
The keywords, e.g., DATA, INPUT, DATALINES, and RUN, identify the
type of statement and instruct the execution in SAS. For example, the INPUT
statement, a part of the DATA step, indicates to SAS the variable names and
their formats. To write an INPUT statement using list input, simply list the
variable names after the INPUT keyword in the order they appear in the
data file. If the variable is a character, then leave a space and place a dollar
sign ($) after the corresponding variable name.
Separating the data from the program avoids the possibility that data will
accidentally be altered when editing the SAS program. For data contained
in external files, the INFILE statement can be used to direct SAS to the data
from ASCII files. The INFILE statement follows the DATA statement and
must precede the INPUT statement. After the INFILE keyword, the file path
and the file name are enclosed in quotation marks.
By default, the DATA step starts reading with the first data line and, if
SAS runs out of data on a line, it automatically goes to the next line to
read values for the rest of the variables. Most of the time this works fine,
but sometimes data files cannot be read using the default settings. In the
INFILE statement, the options placed after the filename can change the
way SAS reads raw data files. For instance, the FIRSTOBS= option tells SAS
at what line to begin reading data; the OBS= option can be used anytime
you want to read only a part of your data file; the MISSOVER option tells
Prelude: Preliminary Tools and Foundations 25
SAS that if it runs out of data, do not go to the next data line but assign
missing values to any remaining variables instead; and the DELIMITER=,
or DLM=, option allows SAS to read data files with other delimiters (the
default is a blank space).
Moreover, SAS assumes that the number of characters including spaces
in a data line (termed a record length) of external files is no more than 256
in some operating environments, e.g., the Windows operating system. If the
data contains records that are longer than 256 characters, use the LRECL=
option in the INFILE statement to specify a record length at least as long as
the longest record in the data file. For more details about the INFILE options,
we refer the reader to Delwiche and Slaughter (2012). More complex data
import and export features are available in SAS, but go beyond the scope of
our applications.
To illustrate, the following program reads data from a tab-separated exter-
nal file into an SAS data set using the INFILE statement:
With SAS, there is commonly more than one way to accomplish the same
result, including the input of data. For more details about other ways of
inputting data, e.g., the IMPORT procedure, we refer the reader to Delwiche
and Slaughter (2012).
When a variable exists in SAS but does not have a value, the value is said
to be missing. SAS assigns a period (.) for numeric data and a blank for char-
acter data. By the MISSING statement, the user may specify other characters
instead of a period or a blank to be treated as missing data. To illustrate how
the missing data coding works, the following example declares that the char-
acter u is to be treated as a missing value for the character variable Gender
whenever it is encountered in a record:
DATA two;
MISSING u;
INPUT $ Gender;
CARDS;
RUN;
Missing values have a value of false when used with logical operators such
as AND or OR.
Manipulating data in SAS
In SAS, the users can create and redefine variables with assignment state-
ments using the following basic form:
variable = expression;
26 Statistics in the Health Sciences: Theory, Applications, and Computing
On the left side of the equal sign is a variable name, either new or old.
On the right side of the equal sign can be a constant, another variable, or a
mathematical expression. The basic types of assignment statements may use
operators such as + (addition), - (subtraction), * (multiplication), / (division),
and ** (exponentiation). The following example illustrates the assignment of
Newvar, which equals the square of OldVar plus 3:
NewVar = OldVar ** 2 + 3;
This statement tells SAS to set the variable y equal to 10 whenever the
variable Gender equals F. The terms on either side of the comparison, sep-
arated by a comparison operator, may be constants, variables, or expres-
sions. The comparison operator may be either symbolic or mnemonic,
depending on the user’s preference. Table 1.4 shows some basic comparison
operators.
A single IF-THEN statement can only have one action. To specify multiple
conditions, we can combine the condition with the keywords AND (&) or OR
(|). For example,
TABLE 1.3
Selected SAS Numeric Functions
Syntax Definition
TABLE 1.4
Selected SAS Comparison Operators
Symbolic Mnemonic Meaning
= EQ equals
¬ =, ^ =, or ~ = NE not equal
> GT greater than
< LT less than
>= GE greater than or equal
<= LE less than or equal
IN determine whether a variable’s value is among a
list of values
DATA one;
DO j = 1 TO 50;
x = j**2;
OUTPUT one;
END;
DROP j;
CARDS;
RUN;
28 Statistics in the Health Sciences: Theory, Applications, and Computing
The program creates a SAS data set one with 50 observations and a vari-
able x = j 2, where the index variable j ranges from 1 to 50. The DROP state-
ment help get rid of the index variable j. The OUTPUT statement tells SAS to
write the current observation to the output data set before returning to the
beginning of the DATA step to process the next observation.
Printing data in SAS
The PROC PRINT procedure lists data in a SAS data set as a table of obser-
vations by variables. The following statements can be used with PROC
PRINT:
PROC PRINT DATA = SASdataset;
VAR variables;
ID variable;
BY variables;
TITLE 'Print SAS Data Set';
RUN;
where SASdataset is the name of the data set printed. If none is specified, then
the last SAS data set created will be printed. If no VAR statement is included,
then all variables in the data set are printed; otherwise only those listed,
and in the order in which they are listed, are printed. When an ID state-
ment is used, SAS prints each observation with the values of the ID variables
first instead of the observation number, the default setting. The BY statement
specifies the variable that the procedure uses to form BY groups; the obser-
vations in the data set must either be sorted by all the variables specified
(e.g., use the PROC SORT procedure), or they must be indexed appropriately,
unless the NOTSORTED option in the BY statement is used. For more details,
we refer the read to Delwiche and Slaughter (2012).
Summarizing data in SAS
After reading the data and making sure it is correct, one may summarize
and analyze the data using built-in SAS procedures or PROCs. For example,
PROC MEANS provides a set of descriptive statistics for numeric variables.
Vitually all SAS PROCs have additional options or features, which can be
accessed with subcommands, e.g., one can use PROC MEANS to carry out a
one-sample t-test.
As another example, the following program sorts the SAS data set Namelists
by gender using PROC SORT, and then summarizes the Age by Gender using
PROC MEANS with a BY statement (the MAXDEC option is set to zero, so no
decimal places will be printed.):
BY Gender;
VAR Age;
TITLE 'Summary of Age by Gender';
RUN;
2 25 10 18 32
Gender = M
Analysis Variable: Age
N Mean Std Dev Minimum Maximum
2 41 8 35 46
2
Characteristic Function–Based Inference
2.1 Introduction
In this chapter we present powerful characteristic function–based tools as an
approach for investigating the properties of distribution functions and their
associated quantities. As it turns out the information contained within the
characteristic function structure has a one-to-one correspondence with the
distribution function framework. In this context characteristic functions can
be extremely helpful in both theoretical and applied biostatistical work. In
particular, characteristic functions can be used to simplify many analytical
proofs in statistics. In certain situations, the form of a characteristic function
can easily define a family of distribution functions, thus generalizing known
conjugate families. This is very important, for example, in parametric statis-
tics and the construction of Bayesian priors. Oftentimes a relatively simple
characteristic function can mathematically represent a random variable in
terms of its given properties even when the distribution function does not
have an explicit analytical form.
In a similar manner to the analysis based on characteristic functions we
also introduce Laplace transformations of distribution functions in this
chapter. In the statistical context, oftentimes we need to estimate distribution
functions based on observations subject to different noise effects or that are
based on using data that are directly presented as sums or maximums. These
scenarios and various tasks of statistical sequential procedures, as well as in
the context of renewal theory, are examples of situations where characteristic
functions and Laplace transformations can play a main role in analytical
analyses.
We suggest that the reader who is interested in more details regarding
characteristic functions beyond those presented in this book consult the
book of Lukacs (1970) as a fundamental guide.
In Section 2.2 we consider the elementary properties of characteristic func-
tions as well as convolution statements in which characteristic functions can
be involved. The one-to-one mapping propositions related to characteristic
and distribution functions are presented in Section 2.3. In Section 2.4 we
outline various applications of characteristic functions to biostatistical topics,
31
32 Statistics in the Health Sciences: Theory, Applications, and Computing
ϕ ξ (t) =
∫e itx
( )
dF( x) = E e itξ , i = −1.
−∞
1. ϕ ξ (0) = 1;
2. ϕ ξ (t) ≤ 1, for all t;
3. ϕ −ξ (t) = ϕ ξ (t), where the horizontal bar atop of ϕ ξ (t) defines the com-
plex conjugate of ϕ ξ (t).
For example, in order to obtain property (2) above we use the elementary
inequality
( ) {
E e itξ ≤ E e itξ = E cos 2 (tξ) + sin 2 (t ξ) }
1/2
= 1.
{
Fξ1 + ξ2 ( x) = Pr (ξ 1 + ξ 2 ≤ x) = E I (ξ 1 + ξ 2 ≤ x) }
Characteristic Function–Based Inference 33
=
∫∫ I (u + u 1 2 ≤ x) dFξ1 (u1 ) dFξ2 (u2 ),
⎧⎪ x− u2 ∞
⎫⎪
Fξ1 +ξ2 ( x) =
∫∫ I ( u1 ≤ x − u2 ) dFξ1 (u1 ) dFξ2 (u2 ) = ⎨
−∞ ⎩⎪ −∞
∫ ∫
dFξ1 (u1 ) ⎬ dFξ2 (u2 )
⎭⎪
∞
=
∫ F (x − u ) dF (u ),
−∞
ξ1 2 ξ2 2
∫
u
since it is clear that F(u) = dF(u) with F(−∞) = 0. In the case with depen-
−∞
dent random variables ξ 1 and ξ 2 , using the definition of conditional probabil-
ity one can show that the distribution of the sum is given as
∞
Fξ1 + ξ2 ( x) =
∫ Pr (ξ
−∞
1 ≤ x − u2 |ξ 2 = u2 ) dFξ2 (u2 ). (2.1)
Intuitively, this equation means that in the right-hand side of the formula
above the random variable ξ 2 is fixed. However, we allow ξ 2 to vary accord-
ing to its probability distribution. Thus, Equation (2.1) holds.
The integral form (2.1) of Fξ1 + ξ2 ( x) shown above is called a convolution. When
we have n > 1 independent ξ 1 ,..., ξ n random variables, it is clear that
⎛ n
⎞ n
(n− 1) times
i =1
n
( ), when ξ ,..., ξ
that is, ϕ ξ1 + ξ2 + ... + ξn (t) = q with q = E e itξ1
are iid, since the random 1 n
variables e itξ i
, i = 1,..., n, are independent, that is, E (e
= E e itξ k E e itξl , itξ k + itξl
) ( ) ( )
1 ≤ k ≠ l ≤ n. Thus characteristic functions can be used to derive the
34 Statistics in the Health Sciences: Theory, Applications, and Computing
( )
where ϕ(t) = E e itξ is the characteristic function.
Comments:
δ δ
e − itx − e − ity
∫
−δ
it
ϕ(t)dt =
∫ (y − x)ϕ(t)dt + O(δ)
−δ
Characteristic Function–Based Inference 35
with ϕ(t) ≤ 1 , for all t and fixed δ > 0. Thus, while restricting F (or ϕ ) to hold
∞ −δ
∫ δ
ϕ(t)/t dt < ∞ and
∫ −∞
ϕ(t)/t dt < ∞ , (2.3)
for some δ > 0, we could put lim into the integral at (2.2), yielding
σ→ 0
∞
e − itx − e − ity
∫
1
F ( y ) − F ( x) = ϕ(t)dt. (2.4)
2π it
−∞
It turns out that in the case when the density function f (u) = dF(u)/du
exists we can write ϕ(t) =
∫e itu
f (u) du = {∫
can use the method of integration by parts to obtain conditions on F in order
}
f (u) de itu / (it) and then the reader
(
to satisfy (2.3). For example, ϕ(t) = 1/ 1 + t 2 (a Laplace distribution), )
−t
ϕ(t) = e (a Cauchy distribution) satisfy (2.3). Thus, we conclude that a class
of distribution functions in the form (2.4) is not empty.
T
e − itx − e − ity
∫
1
F( y ) − F( x) = lim ϕ(t)dt ,
T →∞ 2 π it
−T
∞ ∞
1
∫ e −itx − e −ity 1
∫ e −itx − e −ity
2 2
lim ϕ(t)e −t σ /2 dt = ϕ(t) dt
σ →0 2 π it 2π it
−∞ −∞
∞ ∞
⎛ ⎞ ⎧ ⎫
e −it (x −ξ) − e ( ) ⎪
− it y −ξ
=
1 ⎜
E⎜
2 π ⎜⎝ ∫
−∞
e −itx − e −ity itξ ⎟
it
e dt⎟ =
1 ⎪
E⎨
⎟ 2π ⎪
⎠ ⎩ −∞
∫ it
dt⎬ .
⎪⎭
36 Statistics in the Health Sciences: Theory, Applications, and Computing
( ) ( )
e − it(x −ξ) − e ( ) = cos t (x − ξ) − i sin t (x − ξ) − cos t ( y − ξ) + i sin t ( y − ξ) ,
− it y −ξ
( ) ( )
∞
we consider
∫ −∞
{cos(t)/t} dt. The fact that cos(t)/t = − {cos(−t)/ (−t)} implies
∞ ∞
−∞ −∞ 0
Thus
∞ ∞ ∞ ∞ ∞ ∞
sin (t) sin (t)
−∞
∫ t
dt = 2
t ∫0
∫
dt = 2 sin (t) e − ut dudt = 2
0
∫
0
∫∫e
0 0
− ut
sin (t) dtdu,
1 ⎧⎪ − ut ⎫⎪
∞ ∞ ∞
∫e ∫ ∫
1 ∞
− ut
sin (t) dt = − sin (t) de = − ⎨e sin (t) − e − ut d sin (t)⎬
− ut
u u⎪ 0
⎪⎭
0 0 ⎩ 0
∫
1 − ut
= e cos (t) dt
u
0
∞
⎧⎪ ∞
⎫⎪
∫ cos (t) de ∫
1 1 ∞
⎨e cos (t) 0 − e d cos (t)⎬
− ut − ut − ut
=− 2 =− 2
u u ⎪⎩ ⎪⎭
0 0
∫e
1 1
= 2− 2 − ut
sin (t) dt ,
u u
0
Characteristic Function–Based Inference 37
∫ ( )
−1
that is, e − ut sin (t) dt = 1 + u2 . Thus we obtain
0
∞ ∞ ∞ ∞ ∞
sin (t) sin (t)
∫ ∫ ∫ ∫ ∫
1
dt = 2 dt = 2 sin (t) e − ut dudt = 2 du.
t t 1 + u2
−∞ 0 0 0 0
∞
This result resolves the problem regarding the existence of
∞ −1
∫
−∞
{sin(t)/t} dt. It
is well known that ∫ (1 + u ) du = π (e.g., Titchmarsh, 1976) so that
0
2
∞
sin (t)
∫
∞
dt = 2 arctan(t) 0 = π.
t
−∞
This is the Dirichlet integral, which can be easily applied to conclude that
∞
⎧ π, if γ > 0 ⎫
sin (γt) ⎪ ⎪
∫
−∞
t
dt = ⎨ −π, if γ < 0
⎪ 0, if γ = 0
⎬ = π I (γ ≥ 0) − π I (γ < 0) .
⎪
⎩ ⎭
Then, reconsidering Equation (2.5), we have
∞
1 ⎧⎪ e − it(x − ξ) − e ( ) ⎫⎪
− it y − ξ
E⎨
2π ⎪
⎩−∞
∫ it
dt⎬
⎭⎪
= { } { } {
E πI (y − ξ) ≥ 0 − πI (y − ξ) < 0 − πI (x − ξ) ≥ 0 + πI (x − ξ) < 0 ⎤
1 ⎡
2π ⎣ ⎦ } { }
{ }
= E ⎡⎣I {ξ ≤ y} − 1 − I {ξ ≤ y} − I {ξ ≤ x} + 1 − I {ξ ≤ x} ⎤⎦
1
2
{ }
= Pr (ξ ≤ y) − Pr (ξ ≤ x)
= F( y ) − F( x),
Fζ (u) = Pr (η + ξ ≤ u) =
∫ Pr (η ≤ u − x)d Pr (ξ ≤ x)
38 Statistics in the Health Sciences: Theory, Applications, and Computing
and then, for example, the density function fζ (u) = dFζ (u)/du = ∫ f (u − x)dF(x)
η
with f η (u) = dFη (u)/du exists. Even when ξ = 1, 2,... is discrete, we have
∞ ∞
Fζ (u) = ∑j=1
Fη (u − j) Pr (ξ = j) , fζ (u) = ∑ f (u − j) Pr (ξ = j) .
j=1
η
∞ ∞
e − itx − e − ity e − itx − e − ity
Fζ ( y ) − Fζ ( x) =
1
2π ∫ it
E e itζ dt =( )
1
2π ∫ it
( )
ϕ(t)E e itη dt. (2.6)
−∞ −∞
Then “Der η hat seine Arbeit getan, der η kann gehen” (German: Schiller,
2013) and we let η go. To formalize this result we start by proving the follow-
ing lemma.
Lemma 2.3.1.
)
Let a random variable η be normally N μ , σ 2 -distributed. Then the charac- (
teristic function for η is given by ϕ (t) = E (e ) = exp (iμt − σ t /2).η
itη 2 2
∞ ∞
dϕ η1 (t)/ dt = (2 π)
∫ du = − (2 π) i
∫
−1 2 −1 2
iue itu− u /2
e itu de − u /2
.
−∞ −∞
∫
∞
dϕ η1 (t)/dt = − (2 π) t du = −tϕ η1 (t), that is, {dϕ η1 (t)/dt} /ϕ η1 (t) = −t
−1 2
e itu− u /2
−∞
or { }
d ln (ϕ η1 (t)) / dt = −t. This differential equation has the solution
( )
ϕ η1 (t) = exp −t /2 + C, for all constants C. Taking into account that ϕ η1 (0) = 1,
2
∫ e −itx − e −ity
1 −
Fζ ( y ) − Fζ ( x) = ϕ(t)e 2 dt , ζ = ξ + η.
2π it
−∞
∞
σ 2 t2
∫ e −itx − e −ity
1 −
lim {Fζ ( y ) − Fζ ( x)} = lim ϕ(t)e 2 dt ,
σ →0 2 π σ →0 it
−∞
which leads to
∞
σ 2 t2
∫ e −itx − e −ity
1 −
F ( y ) − F ( x) = lim ϕ(t)e 2 dt ,
2 π σ →0 it
−∞
( )
since Pr η > δ → 0, for all δ > 0, as σ → 0 (here presenting ηk ~ N (0, σ 2k ) it is
then clear that we have ξ + ηk ⎯ ⎯
⎯⎯
⎯⎯
⎯ ⎯
→ ξ and Fξ+ηk → F as σ k → 0, k → ∞ ). This
p
∞
e − it(x) − e − it(x+ h)
∫
1
{F(x + h) − F(x)}/h = ϕ(t)dt ,
2π iht
−∞
using y = x + h. From this with h → 0 and the definitions f (x) = lim {F(x + h) − F(x)}/h,
{ } { }
h→0
− it(x) ⎡ − it(x) − it(x + h) ⎤
d e /dx = − lim e −e /h we have the following result.
h→0 ⎣ ⎦
Theorem 2.3.2.
∫e
1 − itx
f ( x) = ϕ(t) dt.
2π
−∞
40 Statistics in the Health Sciences: Theory, Applications, and Computing
2.4 Applications
The common scheme to apply mechanisms based on characteristic functions
consists of the following algorithm:
Transformed
Solution
Solution
Inverse
transformation
⎧ n ⎫
⎪
N ( H ) = min ⎨n ≥ 1 :
⎪⎩
∑ ⎪
X i ≥ H⎬ .
⎪⎭
i =1
∑
n
In this case, the physician will end his/her shift when X i overshoots
i=1
the threshold H > 0 .
Characteristic Function–Based Inference 41
∞ ∞
⎧⎪ j − 1 j
⎫
E {N ( H )} = ∑j=1
j Pr {N ( H ) = j} = ∑
j=1
∑
j Pr ⎨ X i < H ,
⎩⎪ i = 1
∑ X ≥ H⎪⎬⎭⎪,
i=1
i
∑
0
where we define X i = 0 and use the rule that the process stops the first
i=1
⎧⎪ j − 1 j
⎫⎪
time at j when the event ⎨ X i < H ,
⎪⎩ i = 1
∑i=1
∑
X i ≥ H ⎬ occurs. (Note that if not all
⎪⎭
⎧⎪ ⎛ k −1 ⎞
X i > 0, i = 1, 2..., then we should consider {N ( H ) = j} = ⎨ max ⎜ ∑
X i⎟ < H ,
⎪⎩1≤ k ≤ j − 1 ⎝ i = 1 ⎠
⎛ k ⎞ ⎫⎪
max ⎜ ∑
1≤ k ≤ j
X i⎟ ≥ H ⎬, see Figure 2.1 for details.) Thus
⎝ i=1 ⎠ ⎪⎭
n n
ΣXi ΣXi
i–1 i–1
H
N(H) H N(H)
n n
j–1 j j–1 j
(a) (b)
FIGURE 2.1
n
The sum ∑ X crosses the threshold H at j when (a) all X > 0, i = 1, 2... and (b) we have several
i =1
i i
negative observations.
42 Statistics in the Health Sciences: Theory, Applications, and Computing
∞ ⎡ ⎧ j −1 ⎫ ⎧ j −1 j ⎫⎤
E {N ( H )} = ∑ ⎢ ⎪
∑ ⎪ ⎪
j ⎢ Pr ⎨ X i < H⎬ − Pr ⎨ X i < H ,
⎪ ⎪⎭ ⎪⎩ i =1
∑ ∑ ⎪⎥
X i < H⎬⎥
⎪⎭⎥
j =1 ⎢ ⎣ ⎩ i =1 i =1 ⎦
∞ ⎡ ⎧ j −1 ⎫ ⎧ j ⎫⎤
= ∑ ⎢ ⎪
∑ ⎪ ⎪
j ⎢ Pr ⎨ X i < H⎬ − Pr ⎨ X i < H⎬⎥
⎪ ⎪⎭ ⎪⎩ i =1
∑ ⎪⎥
⎪⎭⎥
j =1 ⎢ ⎣ ⎩ i =1 ⎦
∞ ⎧ j − 1 ⎫ ∞ ⎧ j ⎫
= ∑ ⎪
∑
j Pr ⎨ X i < H⎬ −
⎪⎩ i =1
⎪
∑
⎪⎭ j =1
⎪
∑
j Pr ⎨ X i < H⎬
⎪⎩ i =1
⎪
⎪⎭
(2.7)
j =1
∞ ⎧ j ⎫ ∞ ⎧ j ⎫
= 1+ ∑ ⎪
⎪
∑ ⎪
∑
(j + 1) Pr ⎨ X i < H⎬ − j Pr ⎨ X i < H⎬
⎪
⎪
⎪
∑ ⎪
⎪⎭
j =1 ⎩ i =1 ⎭ j =1 ⎩ i =1
∞ ⎧ j ⎫
= 1+ ∑ ∑ ⎪
Pr ⎨ X i < H⎬ .
⎪
⎪
⎪⎭
j =1 ⎩ i =1
This result leads to the non-asymptotic upper bound of the renewal func-
tion, since by (2.7) we have
∞ ⎧⎪ ⎛ 1 j
⎞ ⎫⎪
E {N ( H )} − 1 = ∑
j=1
Pr ⎨exp ⎜ −
⎪⎩ ⎝ H
∑ i=1
X i⎟ > e −1⎬
⎠ ⎪⎭
∞ ⎡ ⎧ ⎛ 1 j
⎞ ⎫⎪⎤
= ∑ ⎢ ⎪
⎪
E ⎢I ⎨exp ⎜ −
⎝ H
∑ X i⎟ > e −1⎬⎥
⎠ ⎪⎭⎥⎦
j=1 ⎣ ⎩ i=1
∞ ⎧⎪ ⎛ 1 j
⎞ ⎫⎪ ∞ j
⎛ − 1 Xi ⎞
≤e ∑ E ⎨exp ⎜ −
⎝ H
∑ X i⎟ ⎬ = e
⎠ ⎭⎪
∑∏ E ⎜e H ⎟
⎝ ⎠
j=1 ⎩⎪ i=1 j=1 i=1
∞ j
⎪⎧ ⎛ − H X1 ⎞ ⎪⎫
∑
1
=e ⎨E ⎜ e ⎟⎠ ⎬ , e = exp(1),
j=1 ⎩⎪ ⎝ ⎭⎪
where the fact that I ( a > b) ≤ a/b, for all positive a and b, was used. Noting
( )
that E e − X1 /H ≤ 1 and using the sum formula for a geometric progression, we
obtain
−1
⎛ − 1 X1 ⎞ ⎧⎪ ⎛ − 1 X1 ⎞ ⎫⎪
E {N ( H )} ≤ 1 + e E ⎜ e H ⎟ ⎨1 − E ⎜ e H ⎟ ⎬ .
⎝ ⎠ ⎩⎪ ⎝ ⎠ ⎭⎪
Characteristic Function–Based Inference 43
( )
One can apply Taylor’s theorem to E e − X1 /H by considering 1/H around 0 as
H → ∞ to easily show that the upper bound for E {N ( H )} is asymptotically
linear with respect to H , when E X 12 < ∞ . ( )
Now let us study the asymptotic behavior of E {N ( H )} as H → ∞. Toward
this end we consider the following nonformal arguments: (1) Characteristic
functions correspond to distribution functions that increase and are bounded
by 1. Certainly, the properties of a characteristic function will be satisfied if
we assume the corresponding “distribution functions” are bounded by 2 or,
more generally, by linear-type functions. In this context the function E {N ( H )}
increases monotonically and is bounded. Thus we can focus on the transfor-
∞
mation ψ(t) = ∫
0
e itud ⎡⎣ E {N (u)}⎤⎦ of E {N (u)} > 0 in a similar manner to that of
characteristic functions. (2) Intuitively, relatively small values of t can pro-
vide ψ (t) to represent a behavior of E {N ( H )}, when H is large, because sche-
matically it may be displayed that
∞
ψ (t → 0) ~
∫ 0
e iu(t→0)d ⎡⎣E {N (u)}⎤⎦ ~ E {N (u → ∞)} .
⎡ ∞
⎧⎪ j ⎫⎪⎤ ∞
⎧⎪ j ⎫⎪
∑ ∑ ∑∫ ∑
∞ ∞
ψ (t) =
∫
0
e itud ⎢1 +
⎢
⎣ j=1
Pr ⎨ X i < u⎬⎥ =
⎪⎩ i=1 ⎪⎭⎥⎦ j=1
0
e itud Pr ⎨ X i < u⎬
⎪⎩ i=1 ⎪⎭
∞ ⎧⎪ ⎛ j
⎞ ⎫⎪ ∞
= ∑ E ⎨exp ⎜ it
⎝
∑ X i⎟ ⎬ =
⎠ ⎭⎪
∑q j
( )
with q = E e itX1 and q < 1. The sum formula for a geometric progression
provides
q
ψ (t) = .
1− q
ψ (t) ~ −1/ {tiE (X 1)} = i/ {tE (X 1)} as t → 0. The transformation i/ {tE (X 1)} cor-
responds to the function A(u) = u/ {E (X 1)} , u > 0, since (see Theorem 2.3.2)
∞ ∞
i i cos (tu) − i sin (tu)
∫e ∫
dA(u) 1 − itu 1
= dt = dt =
du 2π tE (X 1) 2 πE (X 1) t E (X 1)
−∞ −∞
∞
(see the proof of Theorem 2.3.1 to obtain values of the integrals
∞
∫ −∞
{cos(ut)/t} dt
and
∫
−∞
{sin(ut)/t} dt ). Thus E {N (H )} ~ H/E (X1) as H → ∞.
Let us derive the asymptotic result about E {N ( H )} given above in a formal
way. By the definition of N ( H ) we obtain
N(H )
⎧⎪N ( H ) ⎫⎪
∑ i=1
X i ≥ H and then E ⎨ ∑
Xi⎬ ≥ H.
⎩⎪ i = 1 ⎭⎪
Using Wald’s lemma (this result will be proven elsewhere in this book,
see Chapter 4), we note that
⎧N(H ) ⎫
⎪
E⎨ ∑ ⎪
X i⎬ = E (X 1) E {N ( H )} and then E {N ( H )} ≥ H/E (X 1) .
⎪⎩ i =1 ⎪⎭
(2.8)
G (1/u) − au w
g(u) ≤ e + auwe w − 1 ,
(w − 1)
∫ ∫ ∫
u wu
G( s) = s e − sy g( y ) dy + s e − sy g( y ) dy + s e − sy g( y ) dy
0 u wu
∫ ∫ ∫
u wu
≥ sa e − sy y dy + sg(u) e − sy dy + sa e − sy y dy.
0 u wu
Thus
G( s) ≥
a
s
{ ( a
s
)
1 − e − su (1 + su)} + g(u) e − su − e − swu + {e − swu (1 + suw)} .
G(1/ u) − au 2 e −1 − (1 + w)e − w
g(u) ≤ + au . (2.9)
e −1 − e − w e −1 − e − w
G(1/u) − au
g(u) ≤ + auwe −1+ w .
(w − 1)e − w
⎧
∑ ⎫
∑
∞
E ⎨exp ⎛ − X i / u⎞ ⎬ = v/(1 − v) ,
j
shown above, we have G(1/u) =
j=1 ⎩ ⎝ i=1 ⎠⎭
v=E e ( − X1 /u
) = 1 − E (X 1) /u + o(1/u) as u → ∞ and E X 12 < ∞ . This means that ( )
1 E {N (u)} 1
≤ lim ≤ we w −1 .
E (X 1) u→∞ u E (X 1)
u →∞ u
46 Statistics in the Health Sciences: Theory, Applications, and Computing
⎝ i =1 ⎠
⎧⎪ n
⎫⎪
and the definition N ( H ) = min ⎨n ≥ 1 :
⎩⎪ i=1
∑
X i ≥ H ⎬ , one can decide that it is
⎭⎪
obvious that E {N ( H )} ~ H/E (X 1). However, we invite the reader to employ
probabilistic arguments in a different manner to that mentioned above in
order to prove this approximation, recognizing a simplicity and structure of
the proposed analysis. Proposition 2.4.1 provides non-asymptotically a very
accurate upper bound of the renewal function. Perhaps, this proposition can
be found only in this book. The demonstrated proof scheme can be extended
to more complicated cases with, e.g., dependent random variables as well as
improved to obtain more accurate results, e.g., regarding the expression
lim ⎡⎢u−ω ⎡E {N (u)} − {E (X 1)} u⎤⎤⎥ with 0 ≤ ω ≤ 1 (for example, in Proposition 2.4.1
−1
u→∞ ⎣ ⎣ ⎦⎦
one can define w = 1 + u−ω log(u)). The stopping time N ( H ) will also be stud-
ied in several forthcoming sections of this book.
Warning: Tauberian-type propositions should be applied very carefully.
For example, define the function v( x) = sin( x). The corresponding Laplace
∞
transformation is J ( s) =
∫0
e − su d sin(u) = s/(1 + s2 ), s > 0. This does not lead us
to conclude that since J ( s) → 0 , we have v( x) → 0. In this case we remark that
s→ 0 x →∞
v( x) is not an increasing function for x ∈( −∞ , ∞).
( )
2
Rn (C) = E θˆ n − θ + Cn.
Thus we want to minimize the mean squared error of the estimate balanced
by the attempt to request larger and larger sets of data points, which is
Characteristic Function–Based Inference 47
restricted by the requirement to pay for each observation. For example, con-
sider the simple estimator of the mean θ = E (X 1) based on iid observations
∑
n
X i , i = 1,..., n is θˆ n = X n = X i /n. In this case Rn (C) = σ /n + Cn with
2
i=1
σ 2 = var(X 1 ).
Making use of the definition of the risk function one can easily show that
{ }
Rn (C) is minimized when n = nC* = σ/C 1/2 . Since the optimal fixed sample
size nC* depends on σ, which is commonly unknown, one can resort to
sequential sampling and consider the stopping rule within the estimation
procedure of the form
{
N (C) = inf n ≥ n0 : n ≥ σˆ k /C1/2 , }
∑ ( )
n 2
θˆ n = X n = X i /n and E θˆ n − θ = θ /n. Assume, depending on parame-
2
i=1
ter values, it is required to pay θ 4 H for one data point. Then Rn ( H ) = θ2 /n + Hθ 4n
and we have Rn ( H ) − Rn− 1 ( H ) = −θ2 / {(n − 1) n} + Hθ 4. The equation
Rn ( H ) − Rn − 1 ( H ) = 0 gives the optimal fixed sample size nH* ≈ θ−1/H 1/2 , { }
where θ is unknown. In this framework, the stopping rule is
{ }
⎧⎪ ⎫⎪
( )
n
∑
−1
N ( H ) = inf n ≥ n0 : n ≥ θˆ n /H 1/2 = min ⎨n ≥ n0 : X i ≥ 1/H 1/2 ⎬
⎩⎪ i=1 ⎭⎪
∑
N(H )
and the estimator is θˆ N ( H ) = X i /N ( H ). It is straightforward to apply
i=1
the technique mentioned in Section 2.4.1 to evaluate E {N ( H )} where the fact
∞
( )
that E e − sX1 = θ −1
∫ 0
e − sue − u/θ du = {1 + θs} can be taken into account.
−1
Note that naïvely one can estimate nC* using the estimator σ̂/C1/2 . This is
very problematic, since in order to obtain the estimator σ̂ of σ we need a
sample with the size we try to approximate.
48 Statistics in the Health Sciences: Theory, Applications, and Computing
∑
n
The average ξn = n−1Sn , Sn = ξ i , can be considered as a simple estimator
i=1
of θ. For all ε > 0 , Chebyshev’s inequality states
) { }
k
(
E ξn − θ
( )
Pr ξn − θ > ε = Pr ξn − θ > ε k ≤
k
ε k
, k > 0.
⎡⎧ n ⎫⎪ ⎤
2
⎧⎪ n n ⎫⎪
{
E (Sn − nθ)
2
} ⎢⎪
∑
= E ⎨ (X i − θ)⎬ ⎥ = E ⎨
⎢⎪ i = 1 ⎥ ∑∑ (X i − θ) (X j − θ)⎬
⎣⎩ ⎭⎪ ⎦ ⎪⎩ i = 1 j=1 ⎪⎭
⎧⎪ n ⎫ ⎧⎪ n n ⎫
∑
⎪⎩ i = 1
2⎪
= E ⎨ (X i − θ) ⎬ + E ⎨
⎪⎭ ⎪⎩ i = 1
∑ ∑ (X − θ) (X − θ)⎪⎬⎪ i j
j = 1, j ≠ i ⎭
n n n
= ∑ E (X − θ) + ∑ ∑ E (X − θ) (X − θ)
i=1
i
2
i = 1 j = 1, j ≠ i
i j
= nE (X 1 − θ) .
2
( )
This yields that Pr ξn − θ > ε ≤ E (X 1 − θ) /(nε 2 ) → 0 as n → ∞ , proving the
2
( ) {( )} ( (( )))
n
ϕ ξ (t) = E e itSn /n = E e itξ1 /n = exp n ln E e itξ1 /n .
( ( ))
Define the function l(t) = ln E e itξ1 , rewriting
⎛ ⎛ l(t/n) − l(0)⎞ ⎞
ϕ ξ (t) = exp (nl(t/n)) = exp ⎜⎜ t ⎜⎜ ⎟⎟ ⎟⎟ ,
⎝ ⎝ t/n ⎠⎠
Characteristic Function–Based Inference 49
{ ( )} { ( )}
−1
where l(0) = 0. The derivate dl(t)/dt is E e itξ1 E iξe itξ1 and then
dl(u)/du u= 0 = iE (ξ) = iθ. By the definition of the derivate
∏ E (e ) = ∏ ( )
n n
itξi
ϕ Sn (t) = exp itE(ξi ) − var(ξi )t 2 /2
i=1 i=1
⎛ ⎞
∑ t2
∑
n n
= exp ⎜ it E(ξi ) − var(ξi )⎟ .
⎝ i=1 2 i=1 ⎠
⎧ ⎛ ∞ ⎞
f (u) = ⎨ ⎝ 0 ∫
⎪ γ λ uλ− 1e − γu / ⎜ x γ − 1e − x dx⎟ , u ≥ 0,
⎠
⎪ 0, u < 0
⎩
∑
n
of freedom, the random variable ζn = ξ i2 has a χ 2n distribution if
i=1
iid random variables ξ 1 ,..., ξ n are N (0,1) distributed. Note that
(
Pr (ζ 1 < u) = Pr ξ1 < u = ) 2
2π ∫e − x 2 /2
dx.
0
( )
Let ξ 1 ,..., ξ n be iid random variables with a = E ξ 1 and 0 < σ 2 = var(ξ 1 ) < ∞.
∑ and ζ = (σ n)
n −1/2
Define the random variables Sn = ξi n
2
(Sn − an). The cen-
i=1
tral limit theorem states that ζn has asymptotically (as n → ∞ ) the N (0,1)
distribution. We will prove this theorem using characteristic functions.
Toward this end, we begin with the representation of ζn in the form
ζ n = (n)
−1/2
{∑ n
i=1 }
(ξi − a)/σ , ( )
where E ξ1 − a /σ = 0 and var (ξ1 − a) /σ = 1, { }
which allows us to assume a = 0, σ = 1 without loss of generality. In this case,
we should show that the characteristic function of ζn, say ϕζn (t), converges to
e − t /2. It is clear that ϕζn (t) = E ⎛⎜ e ∑ i=1 i ⎞⎟ = ∏ ( ) { ( )}
n n n
2 it ξ/ n
E e itξi / n = ϕ ξ t/ n ,
⎝ ⎠
( ( ))
i=1
that is, ln (ϕζn (t)) = n ln ϕ ξ t/ n . When n → ∞ , t/ n → 0 that leads us to
apply the Taylor theorem to the following objects: (1) ϕ ξ t/ n , considering ( )
t/ n around 0 and (2) the logarithm function, considering its argument
around 1. That is
⎛ t2 ⎛ t 2 ⎞⎞ ⎧⎪ t 2 ⎛ t 2 ⎞ ⎫⎪ t2
ln (ϕζn (t)) = n ln ⎜ 1 − + o ⎜ ⎟ ⎟ = n ⎨− + o ⎜ ⎟ ⎬ → − , n → ∞.
⎝ 2n ⎝ n ⎠⎠ ⎪⎩ 2 n ⎝ n ⎠ ⎪⎭ 2
∫
b
Consider the problem regarding calculation of the integral J = g(u) du,
a
where g is a known function and the bounds a, b are not necessarily finite.
One can rewrite the integral in the form
∞
⎡⎧ g(u) ⎫ ⎤
J=
∫ ⎢⎢⎣⎨⎩ f (u)⎬⎭ I {u ∈[a, b]}⎥⎥⎦ f (u) du,
−∞
⎡⎧ g(ξ) ⎫ ⎤
using a density function f . Then, we have J = E ⎢⎨ ⎬ I {ξ ∈[a, b]}⎥ with ξ,
⎢⎣⎩ f (ξ) ⎭ ⎥⎦
which is a random variable distributed according to the density function f .
n ⎡⎧ g(ξ ) ⎫ ⎤
In this case, we can approximate J using J n =
1
n
⎢⎨
i
i = 1 ⎢ f (ξ i )
⎣⎩
⎬ I {ξi ∈[a, b]}⎥,
⎭ ⎥⎦
∑
where ξ 1 ,..., ξ n are independent and simulated from f . The law of large
numbers provides J n → J as n → ∞ .
The central limit theorem can be applied to evaluate how accurate this
approximation is as well as to define a value of the needed sample size n.
−1/2
⎡ ⎛⎡ ⎤ 2 ⎞⎟ ⎤⎥
⎢ ⎜ ⎢ ⎧⎪ g(ξ1 ) ⎫⎪
That is, we have asymptotically n (J n − J) ⎢ E ⎜ ⎨ ⎬ I {ξ1 ∈ [a, b]} − J ⎟ ⎥
⎥ ~ N (0, 1)
⎢ ⎜ ⎢⎣ ⎩⎪ f (ξ1 )⎭⎪ ⎥ ⎟⎥
⎦ ⎠⎦
and then, for example, ⎣ ⎝
{ }
Pr J n − J ≥ δ = 1 − Pr {J n − J < δ} + Pr {J n − J < −δ}
( ) (
≈ 1 − Φ δ n /Δ + Φ −δ n /Δ , )
∫
u
where Φ (u) =
2
e− x /2
dx/ 2 π and Δ is an estimator of
−∞
1/2
⎡ ⎛ ⎡⎧ ⎫ ⎤ ⎞⎤
2
ξ
⎬ I {ξi ∈[a, b]} − J⎥ ⎟ ⎥
⎢E ⎜ ⎢⎨ g ( 1 )
obtained, e.g., by generating, for a relatively
⎢ ⎜⎝ ⎢⎣⎩ f (ξ1 ) ⎭ ⎥⎦ ⎟⎠ ⎥
⎣ ⎦
52 Statistics in the Health Sciences: Theory, Applications, and Computing
large integer M, iid random variables η1 ,..., ηM with the density function f ,
and defining
2
M ⎡⎧ g( η ) ⎫ M
⎧⎪ g( η j ) ⎫⎪ ⎤
Δ =
2 1
M ∑ ⎢⎨ i
⎬ I {ηi ∈[a, b]} −
⎢⎩ f ( ηi ) ⎭
1
M ∑ ⎨ ⎬
⎩⎪ f ( η j ) ⎭⎪
{I η j ∈ [a , }
b] ⎥
⎥
i=1 ⎣ j=1 ⎦
2
⎤ ⎡1 ⎤
2
M
⎡⎧ g( ηi ) ⎫ M
⎪⎧ g( η j ) ⎪⎫
=
1
M ∑ ⎢⎨ ⎬ I {ηi ∈[a, b]}⎥ − ⎢
⎢⎣⎩ f ( ηi ) ⎭ ⎥⎦ ⎢⎣ M
∑ ⎨ {
⎪⎩ f ( η j ) ⎪⎭
}
⎬ I η j ∈[a, b] ⎥ .
⎥
i=1 j=1 ⎦
Thus, denoting presumed values of the errors δ 1 and δ 2 , we can derive the
sample size n that satisfies
{ } ( ) (
Pr J n − J ≥ δ 1 ≈ 1 − Φ δ 1 n / Δ + Φ −δ 1 n / Δ ≤ δ 2 . )
Note that, in order to generate random samples from any density function f ,
we can use the following lemma.
{ }
Pr( η ≤ u) = Pr {0 ≤ F(ξ) ≤ u} = Pr ξ ≤ F −1 (u) I(0 ≤ u ≤ 1) + I(u > 1)
( )
= F F −1 (u) I (0 ≤ u ≤ 1) + I (u > 1) = uI (0 ≤ u ≤ 1) + I (u > 1),
∫
1
support [a,b]. For example, to approximate J = g(u) du, we might select f
−1
Characteristic Function–Based Inference 53
∞
associated with the Uniform[−1,1] distribution; to approximate J =
we might select f associated with a normal distribution. −∞
g(u) du,
∫
As an alternative to the approach shown above we can employ the Newton–
Raphson method, which is a technique for calculating integrals numerically.
In several scenarios, the Newton–Raphson method can outperform the
Monte Carlo strategy. However, it is known that in various multidimensional
cases, e.g., related to
∫∫∫ g(u , u , u ) du du du , the Monte Carlo approach is
1 2 3 1 2 3
Example
Assume T (X 1 ,.., X n ) denotes a statistic based on n iid observations
X 1 ,.., X n with a known distribution function. In various statistical inves-
tigations it is required to evaluate the probability p = Pr {T (X 1 ,..., X n ) > H}
for a fixed threshold, H . This problem statement is common in the con-
text of testing statistical hypotheses when examining the Type I error
rate and the power of a statistical decision-making procedure (e.g.,
Vexler et al., 2016a). In order to employ the Monte Carlo approach, one
can generate X 1 ,.., X n from the known distribution, calculating
ν1 = I {T (X 1 ,..., X n ) > H}. This procedure can be repeated M times to in
turn generate the iid variables ν1 ,..., ν M. Then the probability p can be
∑
M
estimated as pˆ = νi /M . The central limit theorem concludes that
i=1
for large values of M we can expect
{ ⎡ 0.5 ⎤
Pr p ∈ ⎢ pˆ − 1.96 {pˆ (1 − pˆ )/M} , pˆ + 1.96 {pˆ (1 − pˆ )/M} ⎥ ≈ 0.95,
⎣
0.5
⎦ }
where it is applied that var( ν1 ) = p(1 − p) and the standard normal distri-
∫
1.96
bution function Φ (1.96) =
2
e−x /2
dx/ 2 π ≈ 0.975 . Thus, for example,
−∞
when p is anticipated to be around 0.05 (e.g., in terms of the Type I error
rate analysis), it is reasonable to require 1.96 (0.05(1 − 0.05)/M) < 0.005
0.5
{
X<-rnorm(n)
TestStatistic[i]<-t.test(X,alternative = c(“two.sided”)) $'statistic'
Decision[i]<-1*(TestStatistic[i]>qt(0.95,n))
}
mean(Decision)
∑
m
on X 1 ,.., X n , e.g., in the forms X i , X i + ε i or max (X i ). Consider, for exam-
i=1
ple, the following practical problem. One of the main issues of epidemiologi-
cal research for the last several decades has been the relationship between
biological markers (biomarkers) and disease risk. Commonly, a measure-
ment process yields operating characteristics of biomarkers. However, the
high cost associated with evaluating these biomarkers as well as measure-
ment errors corresponding to a measurement process can prohibit further
epidemiological applications (e.g., Armstrong, 2015: Chapter 31). When, for
example, analysis is restricted by the high cost of assays, following
Characteristic Function–Based Inference 55
Pooling
Individual specimens
• Physically combining p
2
several individual 1
specimens, to create a
single mixed sample
FIGURE 2.2
Pooling.
of the number of measured assays. Each pooled sample test result is assumed
to be the average of the individual unpooled samples; this is most often the
case because many tests are expressed per unit of volume.
Thus, assuming that iid unobserved variables X 1 , X 2 ,... represent measure-
ments of a biomarker of interest, one can consider scenarios in which the dis-
tribution function of X 1 should be estimated based on the observations
( )∑
pj
Z j = p −1 X i (pooling) or Wj = X j + ε j (measurement error), where
i = p ( j -1) + 1
ε1 , ε 2 ,... are iid random variables with a known distribution function, see for
details, e.g., Vexler et al. (2008b, 2010a). In these cases the following idea can be
used. The characteristic functions of Z1 and W1 satisfy ϕ Z (t) = {ϕ X (t/p)} and
p
∫
1 T − itx ˆ
fˆ ( x) = e ϕ X (t) dt, for a fixed T > 0. In practice, values of T can be cho-
2 π −T
sen to be 0.6, 0.7, 1.2, 1.3, 1.66, 1.7, etc.
Note that, in several situations, we observe Z j = max (X i ). A method
p ( j − 1) + 1≤ i ≤ pj
{
ϕ (t ; α , β, γ , a) = exp i at − γ t
α
(1+iβtω( t , α)/ t )} , t ∈(−∞, ∞),
⎧ tan( πα/2), if α ≠ 1
⎪
ω( t , α ) = ⎨
( )
⎪⎩ 2 ln t / π, if α=1
fi<- function(t,alpha,beta,gamma,a){
if (alpha==1) w<-2*log(abs(t))/pi else w<-tan(pi*alpha/2)
return(exp(1i*a*t-(gamma*abs(t)^(alpha))*(1+1i*beta*sign(t)*w))) }
3.1 Introduction
When the forms of data distributions are assumed to be known, the likelihood
principle plays a prominent role in statistical methods research for develop-
ing powerful statistical inference tools. The likelihood method, or simply the
likelihood, is arguably the most important concept for inference in parametric
modeling when the underlying data are subject to various assumed stochas-
tic mechanisms and restrictions related to biomedical and epidemiological
studies.
Without loss of generality, consider a situation in which n data points
X 1 ,..., X n are distributed according to the joint density function f ( x1 ,..., xn ; θ)
with the parameter θ, where x1 ,..., xn are arguments of f . Various useful
modern statistical concepts are focused on the likelihood function of θ,
L(θ) = f (X 1 ,..., X n ; θ)c, where c can depend on X 1 ,..., X n but not on θ (usu-
ally c = 1). If X 1 ,..., X n are discrete then for each θ, L(θ) is defined as the
probability of observing a realization of X 1 ,..., X n. The likelihood function
measures the relative plausibility of the range of values for θ given obser-
vations X 1 ,..., X n, i.e., the likelihood informs us as to which θ most likely
generated the observed data provided that model assumptions are correctly
stated.
Beginning from the mid-1700s the problem of determining the most
probable position for the object of observations, including determining the
most likely parameter from a distribution that generated the data, was
extensively treated in terms of mathematical descriptions by such historical
figures as Thomas Simpson, Thomas Bayes, Johann Heinrich Lambert, and
Joseph Louis Lagrange. Statistical literature has credited past figures with
their contributions to likelihood theory, in particular Thomas Bayes in terms
of the well-known Bayes theorem. However, it was not until Ronald Fisher
(1922) revisited the likelihood principle via groundbreaking mathematical
arguments that the approach exploded onto the statistical research land-
scape. Fisher suggested comparing various competing values of θ relative
to their likelihood ratios, which in turn provides a measure of the relative
59
60 Statistics in the Health Sciences: Theory, Applications, and Computing
Eθ0 {log (L (θ1) /L (θ0))} = Eθ0 {log ( f (X 1 ,..., X n ; θ1) /f (X 1 ,..., X n ; θ0))}
(
≤ log Eθ0 ⎡⎣ f (X 1 ,..., X n ; θ1) /f (X 1 ,..., X n ; θ0)⎤⎦ )
= log ∫ ... ∫ ff ((xx ,...,
1
1,..., x ; θ )
n
n
1
x ;θ )
0
f (x ,..., x ; θ ) dx dx
1 n 0 1 n
= 0.
62 Statistics in the Health Sciences: Theory, Applications, and Computing
Similarly,
Eθ1 {log (L (θ1) /L (θ0))} = −Eθ1 {log ( f (X 1 ,..., X n ; θ0) /f (X 1 ,..., X n ; θ1))}
(
≥ − log Eθ1 ⎡⎣ f (X 1 ,..., X n ; θ0) /f (X 1 ,..., X n ; θ1)⎤⎦ )
= log ∫ ... ∫ ff ((xx ,...,
1
1
n
n
x ;θ )
0
,..., x ; θ )
1
f (x ,..., x ; θ ) dx ... dx
1 n 1 1 n
= 0.
Moreover,
1
n
x ;θ )
n
1
0
f (x ,..., x ; θ ) dx ... dx
1 n 0 1 n = 1,
whereas
⎧⎪ 1 ⎫⎪ 1
Eθ1 {L (θ1) /L (θ0)} = Eθ1 ⎨ ⎬≥
⎪⎩ L (θ0) /L (θ1)⎪⎭ Eθ1 {L (θ0) /L (θ1)}
∏
n
L(θ) = f (X i ; θ) when n data points X 1 ,..., X n are iid and X 1 is dis-
i=1
tributed according to the density function f ( x1 ; θ) with the parameter θ.
Note that in the general case the function L(θ) can also be presented in the
product-type chain rule form
= ∏ f (X |X ,..., X
i 1 i −1 ; θ).
i =1
⎧ n ⎫
⎪1
Prθ0 {L (θ0) ≤ L (θ1)} = Prθ0 ⎨
⎪⎩ n
∑( )
⎪
log ( f (X i ; θ1)) − log ( f (X i ; θ0)) ≥ 0⎬
⎪⎭
i =1
⎧ n ⎫
⎪1
= Prθ0 ⎨
⎪⎩ n
∑((log ( f (X ; θ )) − a ) − (log ( f (X ; θ )) − a )) ≥ a
i 1 1 i 0 0 0
⎪
− a1⎬ ,
⎪⎭
i =1
where, for k = 0,1, log ( f (X i ; θ k )) , i = 1,..., n are iid random variables with
{ }
Eθ0 log ( f (X 1 ; θ k )) = ak. Since a0 ≥ a1 as shown above, it is clear that the analy-
sis mentioned in Section 2.4.3, where Chebyshev’s inequality is applied, leads
to Prθ0 {L (θ0 ) ≤ L (θ1)} → 0 as n → 0 .
Thus, it is “most likely” “on average” and in probability that the true value θ0 of the
parameter θ satisfies L (θ0 ) ≈ max θ L (θ).
Chapter 2 introduces tools that can be applied to show the asymptotic nor-
mality of likelihood estimators via evaluating sums of iid random variables.
The log maximum likelihood
∑ log ( f (X ; θˆ ))
n
l(θˆ n ) = log ⎛ ∏ f (X i ; θˆ n )⎞ =
n
⎝ i=1 ⎠ i n
i=1
(iii) θˆ n →θ
p
as n → ∞, where θ represents a true value of the parameter.
To state that the random variable ∂log ( f (X 1 ; θ)) / ∂θ has finite positive
variance, we denote the Fisher Information I (θ) = Eθ ∂log ( f (X 1 ; θ)) / ∂θ , { }
2
conditioning
{
(vi) 0 < I (θ) = Eθ ∂log ( f (X 1 ; θ)) / ∂θ } { }
= −Eθ ∂2 log ( f (X 1 ; θ)) / ∂θ2 < ∞.
2
Theorem 3.3.1
Assume that X 1 ,..., X n are iid and satisfy the conditions (i–vi) mentioned
above. Then the distribution function of n θˆ n − θ satisfies ( )
Prθ ( n (θˆ − θ) < u) → Pr (ξ < u)
n as n → ∞,
Likelihood Tenet 65
( ) ( )
2
l '(θˆ n ) = l '(θ) + θˆ n − θ l ''(θ) + 0.5 θˆ n − θ l '''(θ n )
with l ''(θ) = ∂2 l(u)/ ∂u2 u=θ , l '''(θ n ) = ∂3 l(u)/ ∂u3 u =θ and θ n that lies between θ
n
( ){ (
0 = l '(θ) + θˆ n − θ l ''(θ) + 0.5 θˆ n − θ l '''(θ n ) , ) }
resulting in
(
n θˆ n − θ = ) l '(θ)/ n
( )
−l ''(θ)/n − 0.5 θˆ n − θ l '''(θ n )/n
. (3.1)
( )
Consider the remainder term 0.5 θˆ n − θ l '''(θ n )/n . Condition (v) implies
( ) ( )∑
n
0.5 θˆ n − θ l '''(θ n )/ n ≤ 0.5 θˆ n − θ M(X i )/n,
i =1
where M(X 1 ),..., M(X n ) are iid random variables, and then using the
results presented in Section 2.4.3 and condition (iii) we obtain that
( ) ( ) ∑
n
0.5 θˆ n − θ l '''(θ n )/n = 0.5 θˆ n − θ Op (1) = op (1) as n → ∞, since M(X i )/n
i =1
⎡ f (X ; θ) ∂2 f (X ; θ)/∂θ2 ⎤
∫ {∂ ∂2
∫ f (u; θ)du = ∂θ∂
2
Eθ ⎢ 1 1
⎥= 2
f (u; θ)/∂θ2} du = 1 = 0.
⎢⎣ { f (X1 ; θ)} ⎥⎦ ∂θ2
2 2
Formally speaking, the regularity conditions (i–vi) ensure that we can move
the operator ∂ outside the integration as was done above.
2
∂θ2
Thus Equation (3.1) can be rewritten as
(
n θˆ n − θ = )
l '(θ)/ n
I (θ) + op (1)
,
{
where l '(θ) is a sum of iid random variables ∂log ( f (X i ; θ)) / ∂θ , i = 1, ..., n, }
with
∫ f (u; θ) (∂θ ) du
∂log f (u; θ)
Eθ {∂log ( f (X 1 ; θ)) /∂θ} =
∂
= 1 = 0.
∂θ
Then by virtue of the central limit theorem (Section 2.4.5) we complete the
proof.
The maximum likelihood estimator θˆ n is a point estimator. One of
the statistical applications of Theorem 3.3.1 is related to the construc-
tion of large sample confidence interval estimators. Towards this end,
we consider a situation in which we are interested in obtaining an inter-
val, say [ a, b], based on X 1 ,..., X n with coverage probability given as
{ }
Prθ θ ∈⎡⎣a (X 1 ,..., X n) , b (X 1 ,..., X n)⎤⎦ = 1 − α for a prespecified level α. In this
case, for large n, we have
{ }
Prθ θ ∈ ⎡⎣θˆ n − u1−α/2 I (θ)/n , θˆ n + u1−α/2 I (θ)/n ⎤⎦ ≈ Φ (u1−α/2 ) − Φ (− u1−α/2 ) ,
∫
u
Φ (− u1−α/2 ) = α/2, and Φ (u) =
2
e− x /2
dx/ 2 π denotes the distribution
−∞
function of a standard normal random variable. For example, u1−α/2 1.96,
if α = 0.05. Then, calculating a point estimator of I (θ) as Iˆ(θˆ ), we can
obtain the equal tailed (1 − α)100% confidence interval estimator given as
⎡a = θˆ − u ˆ ˆ ˆ ˆ ˆ ⎤
1−α /2 I (θ)/n , b = θ n + u1−α /2 I (θ)/n .
⎣⎢ ⎦⎥
n
Likelihood Tenet 67
Theorem 3.3.1 can be used to show that in various scenarios the maximum
likelihood method provides asymptotically normal estimators with mini-
mum variances (e.g., Lehmann and Casella, 1998).
Warning: The statement mentioned above is well addressed in textbooks
related to elementary courses in statistics. In general, the maximum like-
lihood estimators might be outperformed by related superefficient esti-
mation procedures. This is almost true for special sets of values of θ with
Lebesgue measure zero (Le Cam, 1953). Consider the following example.
Let n iid observations X 1 ,..., X n be N (θ,1) distributed. In this case, the maxi-
∑ ( )
n
mum likelihood estimator θˆ n = X i /n of θ satisfies n θˆ n − θ ~ N (0,1) .
{ } { }
i=1
Define an estimator of θ in the form vn = θˆ n I θˆ n ≥ n−1/4 + aθˆ n I θˆ n < n−1/4
with 0 < a < 1. When θ ≠ 0, we have that the distribution function
Prθ { }
n (vn − θ) < u satisfies
as well as
Prθ { }
n (vn − θ) < u ≥ Prθ { n (θˆ − θ) < u} − Pr { θˆ
n θ n }
< n−1/4 .
Since
⎧ n ⎫
{ } ⎪
Prθ θˆ n < n−1/4 = Prθ ⎨−n−1/4 <
⎪⎩
∑
X i /n < n−1/4⎬
⎪
⎪⎭
i =1
⎧ n ⎫ ⎧ n ⎫
⎪
∑ ⎪
⎪⎭
⎪
⎪⎩ i = 1
⎪
= Prθ ⎨ (X i − θ)/n < n−1/4 − θ⎬ − Prθ ⎨ (X i − θ)/n < −n−1/4 − θ⎬
⎪⎩ i = 1 ⎪⎭
∑
⎧ n ⎫ ⎧ n ⎫
⎪
∑ ⎪
⎪⎭
⎪
⎪⎩ i = 1
⎪
= Prθ ⎨ (θ − X i) > θn − n3/4⎬ − Prθ ⎨ (θ − X i ) > n3/4 + θn⎬
⎪⎩ i = 1 ⎪⎭
∑
68 Statistics in the Health Sciences: Theory, Applications, and Computing
and, for large values of n and fixed θ, θn − n3/4 > 0, applying Chebyshev’s
inequality in a similar manner to that of Section 2.4.3, we obtain
{ }
Prθ θˆ n < n−1/4 → 0 as n → ∞. Then asymptotically n (vn − θ) ~ N (0,1), if
θ ≠ 0.
{
Prθ= 0 { n (vn − θ) < u} ≤ Pr0 θˆ n ≥ n−1/4 + Pr0 } { n (aθˆ ) < u}
n and
{
Prθ= 0 { n (vn − θ) < u} ≥ − Pr0 θˆ n ≥ n−1/4 + Pr0 } { n (aθˆ ) < u} ,
n
{ } ⎧
∑ ⎫
n
where Pr0 θˆ n ≥ n−1/4 = Pr0 ⎨ X i ≥ n3/4 ⎬ → 0 as n → ∞.
⎩ i = 1 ⎭
Thus n (vn − θ) ~ N (0, a ), if θ = 0, and n (vn − θ) ~ N (0,1), if θ ≠ 0, as n → ∞
2
and 0 < a < 1. This states that, in this case, the estimator vn is more efficient
asymptotically than the maximum likelihood estimator (var (vn) ≤ var θˆ n , ( )
for large n) at least when θ = 0.
Remark 1. In general cases related to different data distributions we need to
derive maximum likelihood estimators numerically. For example, consider
n iid observations X 1 ,..., X n from the Cauchy distribution with density func-
{ }
−1
tion f (u; θ) = ⎡π 1 + (u − θ) ⎤ . The maximum likelihood estimator θˆ n of the
2
⎣ ⎦
shift parameter θ is a root of the equation l '(θˆ n ) = 0, i.e.,
n
X i − θˆ n
∑ (
i=1 1 + X i − θˆ n )
2 = 0,
that can be solved numerically, e.g., using the Newton–Raphson method (see,
e.g., R procedure uniroot, R Development Core Team, 2014). In this Cauchy
framework, the regularity conditions are satisfied and
(
Thus, following Theorem 3.3.1, we have n θˆ n − θ ~ N (0, 2) as n → ∞. )
Remark 2. In cases of models with several unknown parameters, a multidi-
mensional version of Theorem 3.3.1 can be obtained under appropriate exten-
sions of the regularity conditions in higher dimensions (e.g., Serfling, 2002).
Remark 3. Properties of the maximum likelihood estimation where stan-
dard regularity conditions are violated are considered in Borovkov (1998).
Likelihood Tenet 69
In this case the likelihood ratio test based decision rule is to reject H 0 if and
only if LRn ≥ B, where B is a prespecified test threshold that does not depend
on the observations. This test is uniformly most powerful. In order to clar-
ify this fact we note that the term “most powerful” induces us to formally
define how to compare statistical tests. According to the Neyman–Pearson
concept of testing statistical hypotheses, since the ability to control the Type
I error rate of statistical test has an essential role in statistical decision-
making, we compare tests with equivalent probabilities of the Type I error
PrH0 {test rejects H 0} = α, where the subscript H 0 indicates that we consider
the probability given that the null hypothesis is true. The level of significance
70 Statistics in the Health Sciences: Theory, Applications, and Computing
(A − B){I (A ≥ B) − ν} ≥ 0, (3.2)
And hence,
⎧⎪ f (X ,..., X n ) ⎫⎪
EH0 ⎨ 1 1 I (LRn ≥ B)⎬ − BEH0 {I (LRn ≥ B)}
⎪⎩ f0 (X 1 ,..., X n ) ⎪⎭
(3.3)
⎧⎪ f (X ,..., X n ) ⎫⎪
≥ EH0 ⎨ 1 1 δ(X 1 ,..., X n )⎬ − BEH0 {δ(X 1 ,..., X n )} ,
⎩⎪ f0 (X 1 ,..., X n ) ⎭⎪
where EH0 (δ) = EH0 {I (δ = 1)} = PrH0 {δ = 1} = PrH0 {δ rejects H 0}. Since we com-
pare the tests with the fixed level of significance
⎪⎧ f (X ,..., X n ) ⎪⎫ ⎪⎧ f (X ,..., X n ) ⎪⎫
EH0 ⎨ 1 1 I (LRn ≥ B)⎬ ≥ EH0 ⎨ 1 1 δ(X 1 ,..., X n )⎬ . (3.4)
⎪⎩ f0 (X 1 ,..., X n ) ⎪⎭ ⎪⎩ f0 (X 1 ,..., X n ) ⎪⎭
Consider
⎧⎪ f (X ,..., X n ) ⎫⎪
EH0 ⎨ 1 1 δ(X 1 ,..., X n )⎬
⎩⎪ f0 (X 1 ,..., X n ) ⎭⎪
0
1
1
n
n
1 n 0 1 n 1 n (3.5)
Since δ represents any decision rule based on {X i , i = 1,… , n}, including the
likelihood ratio–based test, Equation (3.5) implies
⎪⎧ f (X ,..., X n ) ⎪⎫
E H0 ⎨ 1 1 I (LRn ≥ B)⎬ = PrH1 {LRn ≥ B} .
⎩⎪ f0 (X 1 ,..., X n ) ⎭⎪
Applying this equation and Equations (3.5) to (3.4), we complete the proof
that the likelihood ratio test is the most powerful statistical decision rule.
This simple proof technique was used to show optimal aspects of differ-
ent statistical decision-making policies based on the likelihood ratio concept
(Vexler and Wu, 2009; Vexler et al., 2010b; Vexler and Gurevich, 2009, 2011).
the power at a fixed Type I error rate. However, in the case when we have a
given test statistic, say ϒ, instead of the likelihood ratio LRn, inequality (3.2)
yields (ϒ − B) {I (ϒ ≥ B) − δ} ≥ 0, for any decision rule δ ∈[0,1]. Therefore, the
test rule ϒ > B also has an optimal property that can be derived using (3.2).
The problem is how to obtain a reasonable interpretation of the optimal-
ity? Thus, in scenarios when one wishes to investigate the optimality of a
given test in a general context, inequalities similar to (3.2) can be obtained
by focusing on the structure of a test. Such inequalities could report an opti-
mal property of the test. However, translation of that property in terms of
the quality of tests is the issue. Consider the following simple example. Let
iid data points X 1 ,… , X n be observed. We would like to test for H 0 versus
H 1. Randomly select X k from the observations and define the test: reject H 0
if X k > B, where the threshold B > 0. By virtue of inequality (3.2), we have
(X k − B)I {X k > B} ≥ (X k − B)δ, where δ = 0,1 is any decision rule based on
X 1 ,… , X n and the event {δ = 1} rejects H 0 . Since
we have
Proposition 3.5.1. Let f HL (u) define the density function of the likelihood
ratio LR test statistic under the hypothesis H. Then
with u > 0.
Proof. In order to obtain the property (3.6), we consider, for nonrandom vari-
ables u and s, the probability
PrH1 {u − s ≤ LR ≤ u} = EH1 I {u − s ≤ LR ≤ u} = ∫ I {u − s ≤ LR ≤ u} f
H1
= ∫ I {u − s ≤ LR ≤ u} ff H1
H0
f = ∫ I {u − s ≤ LR ≤ u} (LR) f
H0 H0 .
74 Statistics in the Health Sciences: Theory, Applications, and Computing
∫
PrH1 {u − s ≤ LR ≤ u} ≤ I {u − s ≤ LR ≤ u}(u) f H0 = u PrH0 {u − s ≤ LR ≤ u}
and
∫ ∫
C C
f HL1 (u) du = uf HL0 (u) du. Now, if, in order to control the Type I error, the
0 0
density function f HL0 (u) is assumed to be known, then the probability of the
Type II error can be easily computed.
The likelihood ratio property f HL1 (u) f HL0 (u) = u can be applied to solve dif-
ferent issues related to performing the likelihood ratio test. For example, in
terms of the bias of the test, one can request to find a value of the threshold
C that maximizes
where the probability Pr {the test rejects H 0 |H1 is true} depicts the power of
the test. This equation can be expressed as
⎛ ⎞ ⎛ ⎞
∫ ∫
C C
Pr {LR > C|H 1 is true} − Pr {LR > C|H 0 is true} = ⎜⎜ 1 − f HL1 (u) du⎟⎟ − ⎜⎜ 1 − f HL0 (u) du⎟⎟ .
⎝ 0 ⎠ ⎝ 0 ⎠
Set the derivative of the above formula to be zero and solve the following
equation:
d ⎡⎛ ⎞ ⎛ ⎞⎤
∫ ∫
C C
⎢ 1− f HL1 (u) du⎟ − ⎜ 1 − f HL0 (u) du⎟ ⎥ = − f HL1 (C) + f HL0 (C) = 0.
dC ⎣⎜⎝ 0 ⎠ ⎝ 0 ⎠⎦
Likelihood Tenet 75
By virtue of the property (3.6), this implies −Cf HL0 (C) + f HL0 (C) = 0 and then
C = 1, which provides the maximum discrimination between the power and
the probability of a Type I error in the likelihood ratio test.
⎧⎪ f (T )⎫⎪
EH0 {ψ(Tn )} = ∫ ψ(u) f H0 (u) du = ∫ ψ(u) ff H0 (u)
H1 (u)
f H1 (u)d u = EH1 ⎨ ψ(Tn ) H0 n ⎬ ,
⎪⎩ f H1 (Tn )⎪⎭
Theorem 3.6.1
Let l(θ|x ) = log (L(θ|x )) and θ̂ be the maximum likelihood estimator of θ
under H 1, that is, θˆ = arg max θ l(θ|x ), provided that it exists. Then, under
{
the null hypothesis, the distribution of 2 log (MLRn) = 2 l(θˆ |x ) − l(θ0 |x ) is }
asymptotically a χ 2d = 1 distribution as n → ∞.
∂l(θ|x ) 1 ∂2 l(θ|x )
l(θ0 |x ) = l(θˆ |x ) + (θ0 − θˆ ) + (θ0 − θˆ )2 + R,
∂θ θ=θˆ 2 ∂θ2 θ=θˆ
Likelihood Tenet 77
1 ∂3 l(θ|x )
where the remainder term R = (θ0 − θˆ )3 , for some θ between θ̂
6 ∂θ3 θ=θ
and θ. Based on regularity condition (v)|∂3 log f ( x ; θ) ∂θ3 |≤ M( x) for all x ∈ A ,
θ0 − c < θ < θ 0 + c with Eθ0 {M(X )} < ∞ and the fact that ⎢θ0 − θˆ ⎢= Op (n−1/2 + ε ),
ε > 0 (Section 3.3), we can obtain R = Op (n−3/2+3 ε n) = Op (n−1/2+3 ε ) = op (1) , where
1 ∂l(θ|x )
ε can be chosen as ε < . Noting that = 0, we have
6 ∂θ θ=θˆ
1 ∂2 l(θ|x )
l(θ0 |x ) = l(θˆ |x ) + (θ0 − θˆ )2 + op (1).
2 ∂θ2 θ=θˆ
∂2 l(θ|x )
{ }
2 l(θˆ |x ) − l(θ0 |x ) = −(θ0 − θˆ )2
∂θ2 θ=θˆ
+ op (1)
2⎧
⎪ ∂2 l(θ|x ) ⎫⎪
= { }
n (θ0 − θˆ ) ⎨−n−1 ⎬ + op (1)
∂θ2 θ=θˆ ⎭⎪
⎩⎪
⎡ ⎫⎪ 1/2⎤
2
⎧⎪
⎢ −1 ∂ l(θ|x ) ⎥
2
ˆ
= ⎢ n (θ0 − θ) ⎨−n ⎬ ⎥ + op (1).
⎢⎣ ⎩⎪ ∂θ 2
θ=θˆ ⎭⎪ ⎥
⎦
{
converges to N (0,1) in distribution, and therefore 2 l(θˆ |x ) − l(θ0 |x ) con- }
verges to χ12 in distribution. (We also suggest the reader consult Slutsky’s
theorem (Grimmett and Stirzaker, 1992) in this context.)
The proof is complete.
Note that, in the general situation with MLRn = supθ∈Ωc L(θ|x )
supθ∈Ω0 L(θ|x ), the proof mentioned above can be easily extended. In this
case, under H 0 : θ ∈Ω 0 , the statistic 2 log (MLRn) follows asymptotically a χ 2d
distribution as n → ∞, where the degrees of freedom d equal to the differ-
ence in dimensionality of Ωc and Ω 0 and are clarified by the amount of items
1/2 2
⎡ ⎧⎪ −1 ∂2 l(θ|x ) ⎫⎪ ⎤
similar to ⎢ n (θ0 − θ) ⎨− n
ˆ ⎬ ⎥ ⎯n⎯⎯ → {N (0, 1)} that should be
2
→∞
⎢ ⎩⎪ ∂θ 2
⎪ ⎦
θ=θˆ ⎭
⎥
⎣
presented in the corresponding log likelihood expansion.
78 Statistics in the Health Sciences: Theory, Applications, and Computing
Comments: Thus, if certain key assumptions are met, one can show that
parametric likelihood methods are very powerful and efficient statistical
tools. We should emphasize that the role of the discovery of the likelihood
ratio methodology in statistical developments can be compared to the devel-
opment of the assembly line technique of mass production. The likelihood
ratio principle gives clear instructions and technique manuals on how to
construct efficient statistical decision rules in various complex problems
related to clinical experiments. For example, Vexler et al. (2011) developed a
likelihood ratio test for comparing populations based on incomplete longitu-
dinal data subject to instrumental limitations.
Although many statistical publications continue to contribute to the likeli-
hood paradigm and are very important in the statistical discipline (an excel-
lent account can be found in Lehmann and Romano, 2006), several significant
questions naturally arise about the general applicability of the maximum
likelihood approach. Conceptually, there is an issue specific to classifying
maximum likelihoods in terms of likelihoods that are given by joint den-
sity (or probability) functions based on data. Integrated likelihood functions,
with respect to arguments related to data points, are equal to one, whereas
accordingly integrated maximum likelihood functions often have values that
are indefinite. Thus, while likelihoods present full information regarding the
data, the maximum likelihoods might lose information conditional on the
observed data. Consider this simple example: Suppose we observe X 1, which
is assumed to be from a normal distribution N ( μ, 1) with mean parameter
( )
μ. In this case, the likelihood has the form (2 π) exp −(X 1 − μ)2 /2 and, cor-
−0.5
respondingly,
∫ (2π)
−0.5
( )
exp −(X 1 − μ) /2 dX1 = 1, whereas the maximum like-
2
which clearly does not represent the data and is not a proper density. This
demonstrates that since the Neyman–Pearson lemma is fundamentally
founded on the use of the density-based constitutions of likelihood ratios,
maximum likelihood ratios cannot be optimal in general cases. That is, the
likelihood ratio principle is generally not robust when the hypothesis tests
have corresponding nuisance parameters to consider, e.g., testing a hypoth-
esized mean given an unknown variance.
An additional inherent difficulty of the maximum likelihood ratio test
occurs when a clinical experiment is associated with an infinite-dimensional
problem and the number of unknown parameters is relatively large. In this
case, Wilks’ theorem should be re-evaluated, and nonparametric approaches
should be considered in the contexts of reasonable alternatives to the para-
metric likelihood methodology (Fan et al., 2001).
The ideas of likelihood and maximum likelihood ratio testing may not be
fiducial and applicable in general nonparametric function estimation/testing
settings. It is also well known that when key assumptions are not met, para-
metric approaches may be suboptimal or biased as compared to their robust
Likelihood Tenet 79
∑ ∑
n n
l(λ) = l(λ|X 1 , X 2 ,..., X n ) = log f (X i ) = n log λ − λ Xi .
i =1 i =1
∑
n
ˆλ = X −1 = n X i . Therefore, the maximum likelihood ratio test statistic is
i=1
{ }
2 log (MLRn) = 2 l(λˆ ) − l(1) = 2 n{− log(X ) − 1 + X}. The distribution of 2 log (MLRn)
under H 0 is approximately χ12 and χ1,0.95
2
= 3.84(Pr( χ12 > χ1,2 0.95 ) = 0.05). We
reject H 0 if 2 log (MLRn) > 3.84 at the 0.05 significance level.
80 Statistics in the Health Sciences: Theory, Applications, and Computing
X i = μ + εi , εi = βεi −1 + ξi , ε0 = 0, i = 1,..., n,
where μ and β are parameters, the errors ε1 ,..., ε n are unobserved and depen-
dent corresponding to the model ε i = βε i − 1 + ξi , that is called as autoregres-
sion with iid noise ξ1 ,..., ξn from a specified density function fξ.
By virtue of the model assumption, we have
ξ1 = X 1 − μ , ξi = X i − μ(1 − β) − βX i −1 , i = 2,..., n.
L(μ , β) = fξ (X 1 − μ) ∏ f (X − μ(1 − β) − βX
ξ i i −1 ).
i=2
Although this form is correct, the approach applied above might lead to
incorrect likelihood forms in several general scenarios, since by definition
the likelihood function should be constructed using distribution functions
of observations that are X 1 ,..., X n in this example.
Let us demonstrate the formal exercise to derive the likelihood function.
To this end we start by noting that
n
L(μ , β) = f (X 1 ) ∏ f (X |X
i=2
i i−1 ).
Now we derive the joint density function of (X i , X i − 1), fXi , Xi−1 (u, v), that is
∂2
fXi , Xi−1 (u, v) = Pr (X i < u, X i −1 < v)
∂u ∂v
∂2
= Pr (ξi + μ(1 − β) + βX i −1 < u, X i −1 < v)
∂u ∂v
∞
=
∂2
∂u ∂v ∫ Pr (ξ + μ(1 − β) + βX
−∞
i i −1 < u, X i −1 < v|X i −1 = t) fXi−1 (t) dt
=
∂2
∂u ∂v ∫ Pr (ξ + μ(1 − β) + βt < u, t < v|X
−∞
i i −1 = t) fXi−1 (t) dt
=
∂2
∂u ∂v ∫ Pr (ξ + μ(1 − β) + βt < u|X
−∞
i i −1 = t) fXi−1 (t) dt
v
∂
∫ Pr (ξ + μ(1 − β) + βt < u) f
2
= i X i−1 (t) dt
∂u ∂v
−∞
∂
= Pr (ξi < u − μ(1 − β) − βv) fXi−1 ( v) = fξ (u − μ(1 − β) − βv) fXi−1 ( v),
∂u
f (X i |X i −1 ) = fξ (X i − μ(1 − β) − βX i −1 ), i ≥ 2,
2011). For example, one can consider a study with l treatment groups. For
treatment group k ( k = 1,..., l), there are nk experimental units. Each experi-
mental unit is followed by m equally spaced time points by a prespeci-
fied schedule. Let X ijk denote the jth measurement of the ith unit in group
k ( j = 1,..., m; i = 1,..., nk ; k = 1,..., l) . The variable X ijk may reflect a tumor size in
a tumor development model with mice. We can model successive measure-
ments within an experimental unit as
for some integer r < j, where ξijk are iid with the mean of 0 and the β v ’s
describe the relationship between ε ijk ’s. This model depicts the rth order of
autoregressive associations between the observations.
4
Martingale Type Statistics and Their
Applications
4.1 Introduction
Probability theory has several fundamental branches that have been dealt
with extensively in the theoretical literature associated with martingale
techniques. Oftentimes, modern concepts of stochastic processes analysis
are presented in the framework of martingale constructions. Historically,
gambling theory and econometrics have employed martingale theory to
model fair stochastic games when information regarding past events does
not provide us the ability to predict the mean of future winnings.
There are a multitude of books that cover different martingale based
methods. In this chapter, we will focus on the following three aspects that
have straightforward connections with biostatistical theory and its applica-
tions: (1) A martingale can represent a statistic, say Mn, based on n data
points such that the expectation of a future outcome Mn+1, which is calcu-
lated given knowledge regarding the present statistic Mn, is equal to the
observed value Mn. For example, intuitively, if a test statistic, say Tn, based on
n observations satisfies this martingale property, under the null hypothesis
H 0 , then on average Tn values are relatively stable under H 0 (Tn does not
increase on average, EH0 (Tn+1 |T1 , ..., Tn ) = Tn, with respect to sizes of underlin-
ing data). Thus, this can reduce the chances that the test statistic Tk , k > n,
will overshoot a corresponding test threshold when more data points are
available and H 0 is true. (2) In previous chapters, we demonstrated that
propositions based on sums of iid random variables can provide general
proof schemes in the evaluation of various statistical procedures. The mar-
tingale machinery can extend different concepts based on sums of iid sum-
mands by taking into account things such as dependency structures between
observations. (3) Martingale methodology plays a crucial role in the devel-
opment and examination of various statistical sequential procedures.
Common sequential statistical schemes involve random numbers of obser-
vations according to random stopping times (rules) for surveying data
83
84 Statistics in the Health Sciences: Theory, Applications, and Computing
4.2 Terminology
To begin our chapter we need to first provide some of the key definitions
used throughout its development.
In Section 4.1 we referred to the terms “information regarding past events”
and “given knowledge regarding the present statistic.” In order to formalize
these, we provide the following series of fundamental definitions:
(i) the set of elementary events Ω (see Section 1.3, taking into account
that we require Pr (Ω) = 1) belongs to A, Ω ⊆ A;
Martingale Type Statistics and Their Applications 85
For example, consider the simple scenarios when (1) X ∈ℑ then E(X |ℑ) = X and
it is clear that E {E(X |ℑ)} = E(X ), and (2) X is independent of ℑthen E(X |ℑ) = E(X )
and it is clear that E {E(X |ℑ)} = E {E(X )} = E(X ). Note that one can show that the
convolution principle (2.1) shown in Chapter 2 is a corollary of Equation (4.1).
Now, given our background material above, we can define a martingale in
an appropriate manner relative to the main purpose of this chapter.
∑ ∑
n−1 n−1
ℑn = σ (ξ1 , ..., ξn ). Then E (Mn |ℑ n− 1) = ξ i + E (ξ n |ℑ n − 1) = ξ i + E (ξ n) = Mn − 1 + 0 .
i =1 i =1
This implies that (Mn , ℑn) is a martingale. However, defining ℑ 'n = σ (y1 ,..., y n),
where iid random variables y1 ,..., y n are independent of ξ 1 ,..., ξ n , we obtain
E (Mn |ℑ‘n−1) = ∑ E (ξi ) = 0 and then (Mn , ℑ'n) is not a martingale.
n
i=1
In a general aspect, one can consider martingale type statistics to extend
the notion of a statistic consisting of a sum of iid random variables. For
example, assume that ξ 0 ,..., ξ n are iid random variables with E (ξ 1) = a. In this
case, it is clear that ⎛ ∑ ⎛
ξi − an, σ (ξ1 , ..., ξn)⎞ = ⎝ ∑
(ξi − a), σ (ξ1 , ..., ξn )⎞⎠ is
n n
⎝ i=1 ⎠ i=1
∑ (ξ i − 1 − a)(ξ i − a) we obtain
n
Mn =
i=1
⎪⎧ ⎫
∑ (ξ − a)(ξ − a)|σ (ξ ,..., ξ )⎪⎬⎭⎪
n
E {Mn |σ (ξ 0 ,..., ξ n− 1)} = E ⎨ i −1 n−1
⎩⎪
i 0
i =1
= ∑ (ξ − a)(ξ − a) + (ξ − a) E (ξ − a)
n−1
i −1 i n−1 n
i =1
= ∑ (ξ − a)(ξ − a) = M
n−1
i −1 i n−1
i =1
( )
and then Mn , σ (ξ 0 ,..., ξ n − 1) is a martingale, where Mn represents a sum of
dependent random variables (certainly, we assumed that E ξ1 < ∞ to apply
Definition 4.2.5.). As another example, via taking into account Section 3.7,
we propose analyzing observations ε1 , ..., ε n that satisfy the autoregres-
sion model ε i = βε i−1 + ξi , i = 1, ..., n, with the iid noise ξ 1 ,..., ξ n and
E (ξ1) = 0, ε 0 = 0 .
88 Statistics in the Health Sciences: Theory, Applications, and Computing
⎡ ⎤
∑
k −1
E {β− kε k |σ (ξ1 ,..., ξ k −1)} = β− k ⎢ βk − j ξ j + E {ξ k |σ (ξ1 ,..., ξ k −1)}⎥
⎢⎣ j =1 ⎥⎦
⎧⎪ ⎫⎪
∑ ∑
k −1 k −1
= β− k ⎨ βk − j ξ j + E (ξ k )⎬ = β− k βk − j ξ j
⎩⎪ j =1 ⎭⎪ j =1
∑
k −1
= β−( k −1) β( k −1)− j ξ j = β−( k −1)ε k −1 , k > 1.
j =1
( )
This implies that β − n ε n , σ (ξ1 , ..., ξn ) is a martingale, where the observations
ε1 , ..., ε n are sums of independent and not identically distributed random
variables.
Chapter 2 introduced statistical strategies that employ random numbers of
data points. In forthcoming sections of this book we will introduce several
procedures based on non retrospective mechanisms as statistical sequen-
tial schemes. In order to extend retrospective statistical tools based on data
with fixed sample sizes to procedures stopped at random times we introduce
the following definition:
Examples:
1. It is clear that a fixed integer N is a stopping time.
2. Suppose τ and ν are stopping times. Then min( τ , ν) and max( τ, ν)
are stopping times. Indeed, min( τ, ν) and max( τ, ν) satisfy con-
dition (ii) of Definition 4.2.6, since τ and ν are stopping times.
In this case, the events {min(τ, ν) ≤ n} = {τ ≤ n} ∪ {ν ≤ n} ⊂ ℑn and
{max(τ, ν) ≤ n} = {τ ≤ n} ∩ {ν ≤ n} ⊂ ℑn .
Martingale Type Statistics and Their Applications 89
∩
n
{τ ≤ n − i} ∩ {ν = i} ⊂ ℑn
i =1
where {τ ≤ 0} = ∅ .
4. In a similar manner to the example above, one can show that
τ − ν is not a stopping time.
5. Consider the random variable ϖ = inf {n ≥ 1 : X n ≥ an}, where
X 1 , X 2 , X 3 ,... are independent and identically uniformly [0, 1]
(
distributed random variables and ai = 1 − 1/(i + 1)2 , i ≥ 1 . Defining )
ℑn = σ (X 1 ,..., X n ), we have {ϖ ≤ n} = {max 1≤ k ≤ n (X k − ak ) ≥ 0} ⊂ ℑn.
Let us examine, for a nonrandom N ,
Pr (ϖ > N) = Pr {max 1≤ k ≤ N (X k − ak ) ≤ 0}
= Pr {(X 1 − a1) ≤ 0,..., (X N − aN ) ≤ 0}
N N
= ∏ Pr {(X k − ak ) ≤ 0} = ∏ Pr {X k ≤ ak}
k =1 k =1
N
= ∏a . k
k =1
In this case,
⎛ N ⎞ N N
log ⎜⎜
⎜
⎝
∏ ak ⎟⎟ =
⎟
⎠
∑ log(1 − 1/( k + 1)2 ) ≈ ∑{−1/(k + 1) } , 2
k =1 k =1 k =1
log ⎜
⎝
∏ k =1
ak ⎟ =
⎠
∑k =1
log (1 − 1/( k + 1)) ≈ ∑ (−1/(k + 1)) ≈ − log(N ) → −∞
k =1
Pr (W < ∞) = ∑ Pr (W = n)
n=1
∞
= ∑ Pr {S < 4,..., S
1 n−1 < 2(n − 1)2 n, Sn ≥ 2 n2 (n + 1)}
n=1
∞ ∞
≤ ∑ Pr {S ≥ 2n (n + 1)} ≤ ∑ 2nE((nS +) 1)
n
2
2
n
n=1 n=1
∞
E (X 1)
= ∑ 2n(n + 1)
n=1
1
= E (X 1) < 1,
2
where Chebyshev’s inequality is used. Thus, although (W ≤ n) ⊂
ℑn = σ (X 1 ,..., X n ), W is not a stopping time.
Proof. For an integer N, define the stopping time τ N = min( N , τ). It is clear
that
⎧⎪ N ⎫⎪
E (Mτ N ) = E ⎨
⎩⎪ i=1
∑
Mi I (τ N = i)⎬
⎪⎭
⎡ N ⎤
= E⎢
⎢⎣ i=1
∑
Mi {I (τ N ≤ i) − I (τ N ≤ i − 1)}⎥
⎥⎦
N N
= ∑ E {M I (τ
i=1
i N ≤ i)} − ∑ E {M I (τ
i=1
i N ≤ i − 1)} .
In this equation, by virtue of property (4.1) and the martingale and stopping
time definitions, we have
{
= E I (τ N ≤ i − 1) E (Mi |ℑi−1) }
Thus = E {Mi−1I (τ N ≤ i − 1)} .
N N
= E {MN I (τ N ≤ N)}
= E (MN ), where M0 I (τ N ≤ 0) = 0.
{ } {
Since E (MN ) = E E (MN |ℑN − 1) = E (MN − 1) = E E (MN − 1 |ℑN − 2 ) = ... = E (M1), we }
conclude with that E (Mτ N ) = E (M1). Now, considering N → ∞, we complete
the proof.
92 Statistics in the Health Sciences: Theory, Applications, and Computing
Proposition 4.3.2. Assume that ξ 1 ,..., ξ n are iid random variables with
E (ξ 1) = a and E ξ1 < ∞. Let τ ≥ 1 be a stopping time with respect to
ℑn = σ (ξ 1 ,..., ξ n). If E (τ) < ∞, then
⎛ τ ⎞
E ⎜⎜
⎜
⎝
∑ξ ⎟⎟⎟⎠ = aE (τ).
i
i =1
Proof. Consider
⎛ n
⎞ n− 1
E⎜
⎝
∑
i=1
ξi − an|ℑn−1⎟ =
⎠
∑ ξ − an + E (ξ |ℑ
i=1
i n n− 1 )
n− 1
= ∑ ξ − an + a
i=1
i
n− 1
= ∑ ξ − a(n − 1).
i=1
i
⎛ n
⎞
This implies that ⎜
⎝
∑ ξ − an, ℑ ⎟⎠ is a martingale. Therefore the use of Propo-
i=1
i n
⎛ τ ⎞ ⎛ τ ⎞
sition 4.3.1 provides E ⎜
⎝ i=1
∑
ξi − aτ⎟ = E (ξ1 − a) = 0 and then E ⎜
⎠
ξi ⎟ = aE (τ),
⎝ i=1 ⎠
∑
which completes the proof.
As has been mentioned previously, Chebyshev’s inequality plays a very
useful role in theoretical statistical analysis. The next proposition can be
established as an aspect of extending Chebyshev’s approach.
(
Pr max Mk ≥ ε ≤
1≤ k ≤ n
) E (M1)
ε
.
(
Pr (Mνn ≥ ε) = Pr( ν ≤ n) = Pr max Mk ≥ ε .
1≤ k ≤ n
)
( ) {
That is Pr max Mk ≥ ε = Pr (Mνn ≥ ε) = E I (Mνn ≥ ε) , where the indicator
1≤ k ≤ n
}
function I (a ≥ b) satisfies the inequality I (a ≥ b) ≤ a/b, where a and b are pos-
( )
itive. Then Pr max Mk ≥ ε ≤ E (Mνn ) / ε and the use of Proposition 4.3.1 com-
1≤ k ≤ n
pletes the proof.
Note that Proposition 4.3.3 displays a nontrivial result: for Mn ≥ 0 one
could anticipate that max Mk increases, providing Pr max Mk ≥ ε = 1 at
1≤ k ≤ n
( 1≤ k ≤ n
)
least for large values of n, however Doob’s inequality bounds max Mk in
1≤ k ≤ n
probability. We might expect that by using Proposition 4.3.3, an accurate
( )
bound for Pr max Mk ≥ ε can be obtained. While evaluating Pr max Mk ≥ ε ,
1≤ k ≤ n
( 1≤ k ≤ n
)
we could directly employ Chebyshev’s approach, obtaining Pr max Mk ≥ ε ≤ ( )
( )
1≤ k ≤ n
4.4 Applications
The main theme of this section is to demonstrate that the martingale meth-
odology has valuable substance when it is applied in developing and exam-
ining statistical procedures.
joint density functions f1 , under H 1, and f0, under H 0 , are available. Defining
ℑn = σ (X 1 ,..., X n), we use the material of Section 3.2 to derive
⎧ n+ 1 ⎫
⎪
⎪
⎪
∏ f1 (X i |X 1 ,..., X i −1 ) ⎪
⎪
⎪
EH0 (LRn+1 |ℑ n) = EH0 ⎨ i =1
n+ 1 |ℑ n⎬
⎪ ⎪
⎪
⎪⎩
∏ f0 (X i |X 1 ,..., X i −1 ) ⎪
⎪⎭
i =1
n
∏ f (X |X ,..., X
1 i 1 i −1 )
⎧⎪ f (X |X ,..., X n ) ⎫⎪
= i =1
EH0 ⎨ 1 n+1 1 |ℑ n⎬
⎩⎪ f0 (X n+1 |X 1 ,..., X n ) ⎭⎪
n
∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
n
∏ f (X |X ,..., X
1 i 1 i −1 )
= i =1
n ∫ ff ((uu||XX ,...,
1 1 X )n
f (u|X ,..., X ) du
0 1 n
∏ f (X |X ,..., X
0 ,..., X )
1 n
0 i 1 i −1 )
i =1
n
∏ f (X |X ,..., X
1 i 1 i −1 )
= i =1
n = LRn ,
∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
where the subscript H 0 indicates that we consider the expectation given that
the hull hypothesis is correct. Then, following Section 4.2, the pair (LRn , ℑn)
is an H 0 -martingale and the pair (log (LRn ) , ℑn ) is an H 0 -supermartingale.
This says—at the moment in intuitive fashion—that while increasing the
amount of observed data points under H 0 , we do not enlarge chances to
reject H 0 for large values of the likelihood ratio, in average.
Now, under H 1, we consider
∏ f (X |X ,..., X
1 i 1 i −1 )
⎪⎧ f (X |X ,..., X n ) ⎪⎫
EH1 (LRn+1 |ℑ n) = i =1
EH1 ⎨ 1 n+1 1 |ℑ n⎬
⎩⎪ f0 (X n+1 |X 1 ,..., X n ) ⎭⎪
n
∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
n
∏ f (X |X ,..., X
1 i 1 i −1 )
⎪⎧ 1 ⎪⎫
= i =1
EH1 ⎨ |ℑ n⎬
⎪⎩ f0 (X n+1 |X 1 ,..., X n )/ f1 (X n+1 |X 1 ,..., X n ) ⎪⎭
n
∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
Martingale Type Statistics and Their Applications 95
∏ f (X |X ,..., X
1 i 1 i −1 )
1
≥ i =1
n
EH1 { f0 (X n+1 |X 1 ,..., X n )/ f1 (X n+1 |X 1 ,..., X n )|ℑ n}
∏ f (X |X ,..., X
0 i 1 i −1 )
i =1
n
∏ f (X |X ,..., X
1 i 1 i −1 )
⎧⎪ ⎫⎪ −1
= i =1
n ⎨
⎩⎪
∫ f0 (u|X 1 ,..., X n )
f1 (u|X 1 ,..., X n ) du⎬ = LRn ,
⎭⎪
∏ f (X |X ,..., X
f1 (u|X1 ,..., X n )
0 i 1 i −1 )
i =1
where Jensen’s inequality is used and the subscript H 1 indicates that we con-
sider the expectation given that the alternative hypothesis is true. Regarding
log (LRn + 1), we have
⎛ n ⎞
⎜
⎜ ∏ f0 (X i |X 1 ,..., X i −1 )
⎛ ⎞
⎟
⎟
= − log ⎜⎜ |ℑ n⎟⎟ ⎟⎟
f0 (X n+1 |X 1 ,..., X n )
i =1
n EH1 ⎜⎜
∏ f (X |X ,..., X ⎝ ⎠⎟
⎜ f1 (X n+1 |X 1 ,..., X n )
⎜⎜ 1 i 1 i −1 ) ⎟⎟
⎝ i =1 ⎠
= log (LRn) .
Then the pairs (LRn , ℑn) and (log (LRn) , ℑn) are H 1-submartingales. Thus,
intuitively, under H 1, this shows that by adding more observations to be
used in the test procedure, we provide more power to the likelihood
ratio test.
96 Statistics in the Health Sciences: Theory, Applications, and Computing
Conclusion: The likelihood ratio test statistic is most powerful given all of
the assumptions are met. Therefore, generally speaking, it is reasonable to consider
developing test statistics or their transformed versions that could have properties of
H 0 -martingales (supermartingales) and H 1-submartingales, anticipating an opti-
mality of the corresponding test procedures. Note that in many testing scenarios
the alternative hypothesis H 1 is “not H 0 ” and hence does not have a specified
form. For example, the composite statement regarding a parameter θ can be
stated as H 0 : θ = θ0 versus H 1 : θ ≠ θ0, where θ0 is known. In this case,
EH1 -type objects do not have unique shapes. Then we can aim to preserve
only the H 0 -martingale (supermartingales) property of a corresponding
rational test statistic, following the Neyman–Pearson concept of statistical
tests, in which H 0 -characteristics of decision-making strategies are vital
(e.g., Vexler et al., 2016a).
⎧ n+ 1
⎫
⎪
⎪
∏ f (X |X ,..., X
i 1 i−1 ; θˆ n + 1 ) ⎪
⎪
= EH0 ⎨ i=1
n+ 1 |ℑn = σ (X1 ,..., X n )⎬
⎪
⎪
⎩
∏i=1
f (X i |X1 ,..., X i − 1 ; θ0 ) ⎪
⎪
⎭
n+ 1
⎧ ⎫
⎪
⎪
∏ f (X |X ,..., X
i=1
i 1 i−1 ; θˆ n ) ⎪
⎪
≥ EH0 ⎨ n+ 1 |ℑn ⎬
⎪
⎪
⎩
∏
i=1
f (X i |X1 ,..., X i − 1 ; θ0 ) ⎪
⎪
⎭
Martingale Type Statistics and Their Applications 97
∏ f (X |X ,..., X
i 1 i−1 ; θˆ n )
⎪⎧ f (X n + 1 |X 1 ,..., X n ; θˆ n ) ⎪⎫
= i=1
EH0 ⎨ |ℑn ⎬
⎪⎩ f (X n + 1 |X 1 ,..., X n ; θ0 )
n
∏
i=1
f (X i |X 1 ,..., X i − 1 ; θ0 )
⎪⎭
∏ f (X |X ,..., X
i 1 i−1 ; θˆ n )
f (u|X 1 ,..., X n ; θˆ n (X 1 ,..., X n ))
= i=1
n ∫ f (u|X 1 ,..., X n ; θ0 )
f (u|X 1 ,..., X n ; θ0 ) du
∏
i=1
f (X i |X1 ,..., X i − 1 ; θ0 )
= MLRn .
(
this case, we have f0 ( x) = exp − x 2 (2 η2 ) ) 2 πη2 and f1 ( x) = f0 ( x)exp(θ1x + θ2 )
with θ1 = μ/(η2 ) and θ2 = −μ 2 /(2 η2 ), respectively.
By virtue of the statement (4.2), the likelihood ratio test statistic is
⎛ ⎞
∑
n
LRn = exp ⎜⎜ θ1 X i + nθ2⎟⎟ .
⎝ i =1 ⎠
⎛ ⎞
∑
n
Tn = exp ⎜⎜ a X i + b⎟⎟ ,
⎝ i =1 ⎠
where a and b can be chosen arbitrarily and a is of the same sign of θ1, i.e.,
a > 0 in this example, e.g., a = 7, b = 0 . The decision-making policy is to reject
H 0 if Tn > C, where C > 0 is a test threshold.
Suppose, only for the following theoretical exercise, that
c=
∫ f (u)exp(au + b) du < ∞. Then
0
⎧ n+ 1 ⎫
⎛ 1 ⎞ 1
⎪
⎪
⎪
f ∏
0 (X i )exp (aX i + b)
⎪
⎪
⎪
EH0 ⎜ n+1 Tn+1 |ℑ n = σ (X 1 ,..., X n)⎟ = n+1 EH0 ⎨ i =1 n+ 1 | ℑ n⎬
⎝c ⎠ ⎪ ⎪
∏
c
⎪ f 0 (X i ) ⎪
⎪⎩ i =1
⎪⎭
Tn
= ,
cn
( )
i.e., Tn / c n , ℑn is an H 0 -martingale, where c n is independent of the underly-
ing data. Thus we can anticipate that the test statistic Tn can be optimal. This
is displayed in the following result:
The statistic Tn provides the most powerful test of the hypothesis testing
at (4.2).
Proof. Note again that the likelihood ratio test statistic LRn = exp ⎛⎜θ1 ∑ X i + nθ2⎞⎟
n
⎝ i=1 ⎠
is most powerful. Following the Neyman–Pearson concept, in order to com-
pare the test statistics LRn and Tn, we define test thresholds CαT and CαLR to
satisfy α = PrH0 (LRn > CαLR ) = PrH0 (Tn > CαT ) for a prespecified significance
Martingale Type Statistics and Their Applications 99
level α, since the Type I error rates of the tests should be equivalent. In this
case, using the definitions of the test statistics, we have
⎡ ⎤ ⎡ ⎤
∑ { ( ) } ∑ { ( ) }
n n
α = PrH0 ⎢ X i > log CαLR − nθ2 θ1⎥ = PrH0 ⎢ X i > log CαT − b a⎥ , θ1 > 0, a > 0.
⎢⎣ i =1 ⎥⎦ ⎢⎣ i=1 ⎥⎦
That is, {log (CαLR ) − nθ2} θ1 = {log (CαT ) − b} a and then log (CαT ) = a {log (CαLR ) − nθ2} θ1 + b.
⎡ ⎤
∑ { ( ) }
n
PrH1 (Tn > CαT ) = PrH1 ⎢ X i > log CαT − b a⎥
⎣ i =1 ⎦
⎡ ⎤
∑ { ( ) }
n
= PrH1 ⎢ X i > log CαLR − nθ2 θ1⎥
⎣ i =1 ⎦
= PrH1 (LRn > CαLR ),
i.e., the statistic Tn is the most powerful test statistic. (In this book, the nota-
tion PrH k ( A) means that we consider the probability of ( A) given that the
hypothesis H k is true, k = 0, 1.) This completes the proof.
Since θ1 and θ2 are unknown, one can propose to estimate θ1 and θ2 by
using the maximum likelihood method in order to approximate the likelihood
ratio test statistic. In this case, the efficiency of the maximum likelihood ratio
test may be lost completely. The optimal method based on the test statistic
exp ⎛ a ∑ X i + b⎞ can be referred as a simple representative method, since
n
⎝ i=1 ⎠
we employ arbitrarily chosen numbers to represent the unknown parame-
ters θ1 and θ2.
In this section, we present the simple representative method, which can be
easily extended by integrating test statistics through variables that represent
unknown parameters with respect to functions that can display weights cor-
responding to values of the variables. This approach can be referred to as a
Bayes factor type decision-making mechanism that will be described for-
mally in forthcoming sections of this book. For example, observing X 1 ,..., X n,
one can denote the test statistic
BFn =
∫ f (X , ..., X ; θ)π(θ) dθ
1 1 n
f0 (X 1 , ..., X n ; θ0 )
4.4.2 Guaranteed Type I Error Rate Control of the Likelihood Ratio Tests
As shown in Section 4.4.1 the pair (LRn > 0, ℑn = σ (X1 , ..., Xn)) is an
H 0 -martingale. Thus Proposition 4.3.3 implies PrH0 (max 1≤ k ≤ n LRk > C) ≤ C −1,
since EH0 (LR1) =
∫ { f (u)/ f (u)} f (u) du = 1. This result insures that we can
1 0 0
United States was increased from 55 miles per hour to 65 miles per hour in
1987. In order to investigate the effect of this increased speed limit on high-
way traveling, one may study the change in the traffic accident death rate
after the modification of the 55 mile per hour speed limit law. Problems
closely related to the example above are called change point problems in the
statistical literature.
We will concentrate in this section on retrospective change point prob-
lems, where inference regarding the detection of a change occurs retrospec-
tively, i.e., after the data has already been collected. Various applications, e.g.,
in the areas of biological, medical and economics, tend to generate retrospec-
tive change point problems. For example, in genomics, detecting chromo-
somal DNA copy number changes or copy number variations in tumor cells
can facilitate the development of medical diagnostic tools and personalized
treatment regimens for cancer and other genetic diseases; e.g., see Lucito
et al. (2000). The retrospective change point detection methods are also use-
ful in studying the variation (over time) of share prices on the major stock
exchanges. Various examples of change point detection schemes and their
applications are introduced in Csörgö and Horváth (1997) and Vexler et al.
(2016a).
Let X 1 , X 2 , ..., X n be independent continuous random variables with fixed
sample size n. In the formal context of hypotheses testing, we state the
change point detection problem as testing for
Πni = k f1 (X i )
Λ nk = .
Πni = k f0 (X i )
We refer the reader to Chapter 3 for details regarding likelihood ratio tests.
When the parameter v is unknown, the maximum likelihood estimator
v̂ = arg max 1≤ k ≤ n Λ nk of the parameter v can be applied. By plugging the maxi-
mum likelihood estimator v̂ of the change point location v, we have
102 Statistics in the Health Sciences: Theory, Applications, and Computing
Λ n = Λ vˆ = max Λ nk ,
1≤ k ≤ n
which has the maximum likelihood ratio form. Note that log (Λ n) =
∑
n
max log ( f1 (X i )/ f0 (X i )), which corresponds to the well-known cumula-
1≤ k ≤ n i= k
tive sum (CUSUM) type test statistic. The null hypothesis is rejected for large
values of the CUSUM test statistic Λ n. It turns out that, even in this simple
case, evaluations of H 0 -properties of the CUSUM test are very complicated
tasks.
Consider the σ-algebra ℑnk = σ (X k , ..., X n) and the expectation
⎛ Π n f (X ) ⎞
( )
EH0 Λ nk −1 |ℑ nk = EH0 ⎜⎜ ni = k −1 1 i |ℑ nk ⎟⎟
⎝ Π i = k − 1 f 0 (X i ) ⎠
Π f (X )
n ⎛ f (X ) ⎞
= ni = k 1 i EH0 ⎜⎜ 1 k −1 |ℑ nk ⎟⎟
Π i = k f 0 (X i ) ⎝ f 0 (X k − 1 ) ⎠
= Λ nk .
( )
This shows that Λ nk , ℑnk is an H 0 -martingale that can be called an
H 0 -nonnegative reverse martingale. Then a simple modification of Proposi-
tion 4.3.3 (Gurevich and Vexler, 2005) implies
( )
PrH0 (Λ n > C) ≤ EH0 Λ nn / C = 1/C
that provides an upper bound of the corresponding Type I error rate. The
arguments mentioned below Proposition 4.3.3 support that the bound 1/C
might be very close to the actual Type I error rate. The non-asymptotic Type
I error rate monitoring via 1/C was obtained without any requirements on
the density functions f0 , f1. In several scenarios with specified f0 , f1-forms,
complex asymptotic propositions can be used to approximate PrH0 (Λ n > C)
(Csörgö and Horváth, 1997) as well as one can also employ Monte Carlo
approximations to the Type I error rate. In these cases, the bound 1/C sug-
gests the initial values for the evaluations.
∑
n
Extending this approach, we obtain the test statistic SRn = w j Λ nj , where
j =1
the deterministic weights w j ≥ 0, j = 1, ..., n, are known. For simplicity and
clarity of exposition, we redefine
∑
n
SRn = Λ nj .
j =1
The statistic SRn can be considered in the context of a Bayes factor type test-
ing, when we integrate the unknown change point with respect to the uni-
form prior information regarding a point where the change is occurred. In
this change point problem, the integration will correspond to simple sum-
mation. This approach is well-addressed in the change point literature as the
Shiryayev–Roberts (SR) scheme (Vexler et al., 2016a). The null hypothesis is
rejected if SRn / n > C , for a fixed test threshold C > 0.
Noting that SRn = 1 n (SRn− 1 + 1) , we consider the pair (SRn − n, ℑn =
f (X )
f 0 (X n )
σ (X 1 , ..., X n)) that satisfies
⎪⎧ f (X )⎪⎫
EH0 (SRn − n|ℑ n−1) = (SRn−1 + 1) EH0 ⎨ 1 n ⎬ − n = SRn−1 − (n − 1).
⎩⎪ f0 (X n )⎭⎪
⎧⎪ f (X )⎫⎪ 1
EH1 ⎨ 1 n ⎬ ≥ = 1,
⎩⎪ f0 (X n )⎭⎪ EH1 { f0 (X n )/ f1 (X n )}
Proof. To prove this statement, one can use in equality (3.2) in the form
⎛1 ⎞ ⎧⎪ ⎛ 1 ⎞ ⎫⎪
⎜ SRn − C⎟ ⎨ I ⎜ SRn ≥ C⎟ − δ⎬ ≥ 0,
⎝n ⎠ ⎩⎪ ⎝ n ⎠ ⎪⎭
104 Statistics in the Health Sciences: Theory, Applications, and Computing
⎪⎧ Πik=−11 f0 (X i )Πin=k f1 (X i ) ⎛ 1 ⎞ ⎪⎫
n
⎛1 ⎞
1
n ∑E H0 ⎨
⎪⎩ Πni=1 f0 (X i )
I ⎜ SRn ≥ C⎟ ⎬
⎝n ⎠ ⎪⎭
⎜ SRn ≥ C⎟
⎝n ⎠
k =1
n
⎧⎪ Πik=−11 f0 (X i )Πni=k f1 (X i ) ⎫⎪
≥
1
n ∑E H0 ⎨
⎩⎪ Π n
f
i=1 0 ( X i )
I (δ(X1 , ..., X n ) rejects H 0 )⎬ − C PrH0 (δ(X1 , ..., X n ) rejects H 0 ) ,
⎭⎪
k =1
∑ ∫ ... ∫ I ⎜⎝ n1 SR
⎛ ⎞
∏
1 n ⎛1 ⎞
n ≥ C⎟ f0 (x1 ,..., xk − 1) f1 (xk ,..., xn) dxi − C PrH0 ⎜ SRn ≥ C⎟
n ⎠ i =1 ⎝n ⎠
k =1
C PrH0 (δ rejects H 0) .
n n
∑ ∑Pr
1 ⎛1 ⎞ 1
PrH1 ⎜ SRn ≥ C|ν = i⎟ ≥
⎝n ⎠ n
H1 (δ rejects H 0 |ν = i) ,
n
i =1 i =1
when the Type I error rates PrH0 (SRn / n ≥ C) and PrH0 (δ rejects H 0) are equal.
The proof is complete.
(i) Let X 1 , ..., X n be from the density function f ( x1 , ..., xn ; θ) with a param-
eter θ. We are interested in testing the hypothesis H 0 : θ = θ0 versus
H 1 : θ ≠ θ0, where θ0 is known.
Define the maximum likelihood estimators of θ as
∏ f (X |X ,..., X i 1 i −1 ; θˆ n )
MLRn = i =1
n .
∏ f (X |X ,..., X i 1 i −1 ; θ0 )
i =1
∏ f (X |X ,..., X i 1 i −1 ; θˆ 1, i −1 )
ALRn = i =1
n .
∏ f (X |X ,..., X i 1 i −1 ; θ0 )
i =1
It turns out that (ALRn > 0, ℑn = σ (X1 , ..., Xn)) is an H 0 -martingale, since
θˆ 1, i − 1 ∈ℑi − 1 and
n
∏ f (X |X ,..., X
i 1 i −1 ; θˆ 1, i − 1 )
⎛ f (X |X ,..., X ; θˆ ) ⎞
EH0 (ALRn+ 1 |ℑ n) = i =1
EH0 ⎜⎜ n+ 1 1 n 1, n
|ℑ n⎟⎟
⎝ f (X n+ 1 |X 1 ,..., X n ; θ0 )
n
∏ f (X |X ,..., Xi 1 i −1 ; θ0 )
⎠
i =1
∑
k
Y1 = X 1 − g k , Y2 = X 2 − g k ..., Yn = X n − g k , where g k = X i /k with a fixed
i=1
k < n or, for simplicity, Y1 = X 2 − X1 , Y2 = X 4 − X 3 ,....) Testing strategies based on
invariant statistics are very complicated in general and cannot be employed
in various situations, depending on forms of f ( x1 , ..., xn ; θ, λ).
It is interesting to note that classical statistical methodology is directed
toward the use of decision-making mechanisms with fixed significant levels.
Classic inference requires special and careful attention of practitioners relative
to the Type I error rate control since it can be defined by different functional
forms. In the testing problem with the nuisance parameter λ, for any presumed
significance level, α, the Type I error rate control may take the following forms:
(4).
∫ Pr H0 (reject H 0 )π(λ) d λ = α, π>0: ∫ π(λ) dλ = 1.
Form (1) from above is the classical definition (e.g., Lehmann and Romano,
2006). However, in some situations, it may occur that supλ PrH0 (reject H 0 ) = 1.
If this is the case the Type I error rate is not controlled appropriately. In addi-
tion, the formal notation of the supremum is mathematically complicated in
terms of derivations and calculations, either analytically or via Monte Carlo
approximations. Form (2) above is commonly applied when we deal with
maximum likelihood ratio tests as it relates to Wilks’ theorem (Wilks, 1938).
However, since the form at (2) is an asymptotical large sample consideration,
the actual Type I error rate PrH0 (reject H 0) may not be close to the expected
level α given small or moderate fixed sample sizes n. The form given above
at (3) is rarely applied in practice. The form above at (4) is also introduced in
Martingale Type Statistics and Their Applications 107
Lehmann and Romano (2006). This definition of the Type I error rate control
depends on the choice of the function π(θ) in general.
Now let us define the maximum likelihood estimators of θ and λ as
(
θˆ , λˆ
1, k= arg max
1, k ) f (X ,..., X ; θ, λ), θˆ = θ , λˆ = λ , k = 0,..., n and
(θ , λ) 1 n 1,0 0 1,0 10
∏ f (X |X ,..., X i 1 i −1 ; θˆ 1, i −1 , λˆ 1, i −1 )
ALRn = i =1
n .
∏ f (X i |X 1 ,..., X i −1 ; θ0 , λˆ 0,1, n )
i =1
n
∏i=1
f (X i |X 1 , ..., X i − 1 ; θ0 , λ H0 ) , where λ H0 is a true value of λ under H 0 . Then
∏ f (X |X ,..., X i 1 i −1 ; θˆ 1, i −1 , λˆ 1, i −1 )
ALRn = i =1
n ≤ Mn with
∏ f (X i |X 1 ,..., X i −1 ; θ0 , λˆ 0,1, n )
i =1
∏ f (X |X ,..., X
i 1 i −1 ; θˆ 1, i −1 , λˆ 1, i −1 )
Mn = i =1
n ,
∏ f (X |X ,..., X i 1 i −1 ; θ0 , λ H0 )
i =1
This result is very general and does not depend on forms of f ( x1 , ..., xn ; θ, λ) .
(iii) Let X 1 ,..., X n be from the density function f ( x1 , ..., xn ; θ, λ) with parame-
ters θ and λ. Assume that we are interested in testing the hypothesis
H 0 : θ ∈(θL , θU ) versus H 1 : θ ∉(θL , θU ) , where θL , θU are known. The parame-
ter λ is a nuisance parameter that can have different values under H 0 and H 1,
respectively.
108 Statistics in the Health Sciences: Theory, Applications, and Computing
In this case, the adaptive maximum likelihood ratio test statistic can be
denoted in the form
n
∏ f (X |X ,..., X
i 1 i −1 ; θˆ 1, i −1 , λˆ 1, i −1 )
ALRn = i =1
n ,
∏ f (X i |X 1 ,..., X i −1 ; θˆ 0,1, n , λˆ 0,1, n )
i =1
where
(θˆ 1, k )
, λˆ 1, k = arg max(θ∉(θL , θU ), λ) f (X 1 , ..., X n ; θ, λ), θˆ 1,0 = θ10 , λˆ 1,0 = λ 10 , k = 0, ..., n
and
(θˆ 0,1, k )
, λˆ 0,1, k = arg max(θ∈(θL , θU ), λ) f (X 1 , ..., X n ; θ, λ), k = 1, ..., n
i = 1,..., ν − 1, j = ν,..., n,
⎧⎪ Πn f (X ; θˆ )⎫⎪
Λ n = max ⎨ i =nk 1 i i + 1, n ⎬ , θˆ l , n = arg max θ {Πni = l f1 (X i ; θ)} , θˆ n+ 1, n = θ0
1≤ k ≤ n ⎪ Π i = k f 0 (X i ; θ 0 ) ⎪
⎩ ⎭
(
that implies the pair Λ n , ℑnk = σ (X k ,..., X n) is an H 0-nonnegative reverse )
martingale (Section 4.4.3.1). Then, by virtue of Proposition 4.3.3, the Type I
error rate of the considered change point detection procedure satisfies
PrH0 (Λ n > C) ≤ 1/ C.
Martingale Type Statistics and Their Applications 109
i = 1,..., ν − 1, j = ν,..., n,
⎧⎪ Πn f (X ; θˆ )⎫⎪
Λ n = max ⎨ i =n k 1 i 1, i+1, n ⎬ , θˆ s, l , n = arg max θ {Πin=l f s (X i ; θ)} ,
1≤ k ≤ n ⎪ Π ˆ
⎩ i = k f0 (X i ; θ0, k , n ) ⎭⎪
where θ10 is a fixed reasonable variable. Then, using the techniques presented
in Scenarios (i), (iii), and (iv) above, we conclude that the Type I error rate of
this change point detection procedure satisfies the very general inequality
supθ0 PrH0 (Λ n > C) ≤ 1/ C.
UCL
Group summary statistics
12
11
CL
10
8
LCL
1
Group
Number of groups = 23
Center = 10.23913 LCL = 7.724624 Number beyond limits = 0
StdDev = 0.8381689 UCL = 12.75364 Number violating runs = 1
FIGURE 4.1
The X-chart for a data example containing one-at-a-time measurements of a continuous pro-
cess variable presented in Gavit, P. et al., BioPharm International, 22, 46–55, 2009.
Martingale Type Statistics and Their Applications 111
line is the mean. Based on the X-chart shown in Figure 4.1, there is no
evidence showing that the process is out of control.
74.020
Group summary statistics
UCL
74.010
74.000 CL
73.990
LCL
1 3 5 7 9 12 15 18 21 24 27 30 33 36 39
Group
Number of groups = 40
Center = 74.00118 LCL = 73.98805 Number beyond limits = 3
StdDev = 0.009785039 UCL = 74.0143 Number violating runs = 1
FIGURE 4.2
The Shewhart chart (X-bar chart) for both calibration data and new data, where all the statistics
and the control limits are solely based on the calibration data, i.e., the first 25 samples.
112 Statistics in the Health Sciences: Theory, Applications, and Computing
n
Π ni= j f1 (X i )
N (C) = min {n ≥ 1 : SRn > C} , SRn = ∑ Λ ,Λ
j=1
n
j
n
j =
Π ni= j f0 (X i )
,
determining when one should stop sampling and claim a change has
occurred. In order to perform the N (C) strategy, we need to provide instruc-
tions regarding how to select values of C > 0 . This issue is analogous to that
related to defining test thresholds in retrospective decision-making mecha-
nisms with controlled Type I error rates. Certainly, we can define relatively
small values of C, yielding a very powerful detection policy, but if no change
has occurred (under H 0 ) most probably the procedure will stop when we do
not need to stop and the stopping rule will be not useful. Then it is vital to
monitor the average run length to false alarm EH0 {N (C)} in the form of
determining C > 0 to satisfy EH0 {N (C)} ≥ D, for a presumed level D. Thus
defining, e.g., D = 100 and stopping at N (C) = 150 we could expect that a false
alarm is in effect and then just rerun the detection procedure after additional
Martingale Type Statistics and Their Applications 113
the false alarm rate is to be satisfied, choosing C = B (or C = B/a ) does the job
(see Pollak, 1987, and Yakir, 1995, for the basic theory).
Alternative approaches do not have this property. When a method is
based on a sequence of statistics {Tn} that is an H 0 -submartingale when
the process is in control (such as CUSUM), we follow Brostrom (1997) in
applying Doob’s decomposition to {Tn} and extract the martingale com-
ponent. (The anticipating component depends on the past and does not
contain new information about the future.) We transform the martingale
component into a Shiryayev–Roberts form. We apply this procedure to
several examples. Among others, we show that our approach can be used
to obtain nonparametric Shiryayev–Roberts procedures based on ranks.
Another application is to the case of estimation of unknown postchange
parameters.
4.5.1 Motivation
To motivate the general method, consider the following example. Suppose
that observations Y1 , Y2 , ... are independent, Y1 , ..., Yν− 1 ~ N (0, 1) when the
process is in control, Yν , Yν+ 1 , ... ~ N (θ, 1) when the process is out of control,
where ν is the change point and θ is unknown. Let Pν and Eν denote the
probability measure and expectation when the change occurs at ν. The value
of ν is set to ∞ if no change ever takes place in the sequence of observations.
Where θ is known, the classical Shiryayev–Roberts procedure would declare
a change to be in effect if SRnI ≥ C for some prespecified threshold C, where
n
⎧⎪ n ⎫⎪
SR = I
n
k =1
∑
exp ⎨
⎪⎩ i = k
∑(
θYi − θ2 / 2 ⎬
⎪⎭
) (4.3)
∑
i−1
Vexler (2006) propose to replace θ in the k th term of SRnI by Yj /(i − k ).
j= k
(The same idea appears in Robbins and Siegmund (1973) and Dragalin (1997)
in a different context.) A criticism of this approach is that foregoing the use
of Yk ,..., Yi − 1 without Yi ,..., Yn when estimating θ in the k th term is inefficient
(e.g., Lai, 2001, p. 398).
An alternative is to estimate θ in the k th term of SRnI by the Pν= k maximum
∑
n
likelihood estimator θˆ k , n = Yj /(n − k + 1) , so that SRnI becomes
j= k
n− 1
⎧⎪ ⎫
∑ Y − (n − k + 1) (θˆ ) / 2⎪⎬⎭⎪ + 1
n
∑
2
SRnI = exp ⎨θˆ k ,n i k ,n (4.4)
k =1 ⎩⎪ i= k
Martingale Type Statistics and Their Applications 115
( ) ( )
1 in order that EH0 SRnI = E∞ SRnI < ∞ ). Note that SRnI , ℑn = σ (Y1 , ..., Yn) is a ( )
P∞-submartingale, for
⎧⎪n− 1
⎫
∑ Y − (n − k + 1) (θˆ ) / 2⎪⎬⎪⎭ + 1
n
∑
2
SR ≥
I
n exp ⎨θˆ k , n− 1 i k , n− 1
k =1 ⎪⎩ i= k
(
E∞ SRnI |ℑ n− 1 )
n− 2 ⎧ n−1 ⎫ ⎛ ⎧ n−1 ⎫ ⎞
∑exp ⎪⎨⎪⎩θˆ ∑Y − 21 (n − k) (θˆ ) ⎪⎬⎪⎭E ⎜⎜⎜ exp ⎪⎨⎪⎩θˆ ∑Y − 21 (θˆ ) ⎪⎬⎪⎭ |ℑ ⎟
2 2
≥ k , n−1 i k , n−1 ∞ Y
k , n−1 n i k , n−1 n−1 ⎟
⎟
k =1 i=k ⎝ i=k ⎠
+1 = SRnI− 1 .
Generally, one can show that applying the maximum likelihood estima-
tion of the unknown postchange parameters to the initial Shiryayev–Roberts
statistics of the parametric detection schemes leads to the same submartin-
gale property.
Our goal is to transform SRnI into SRn such that (SRn − n, ℑn) is a P∞-martin-
gale with zero expectation, in such a manner that relevant information will
not be lost. Rather than continue with this example, we present a general
theory.
(SR ) W + 1 ,
I
Wn =
n− 1
E (SR |ℑ ) I
n− 1
( )
SRn = SRnI Wn , W0 = 0, n = 1, 2, ... (4.5)
∞ n n− 1
Lemma 4.5.1.
(
S n is a martingale component of Sn (i.e., S n , ℑn is a martingale);
M M
)
is ℑn − 1-measurable, having the predictable property, S n ≥ S n− 1 (a.s.),
P P P
S n
S 1 = 1 and S n − S n− 1 = E (Sn − Sn− 1 |ℑn− 1) .
P P P
M
of G n is given by
G'
M
n
= G'
M
n−1 (
+ an G n − G n − 1 ,
M M
) (4.6)
( )
Proposition 4.5.1. Let SRnI , ℑn be a nonnegative P∞-submartingale
( )
SR0I = 0 . Then, under P∞-regime,
( )
1. SRn = SRnI Wn is a nonnegative submartingale with respect to ℑn;
M
M
2. SR n is a martingale transform of SR I n
, where Wn is defined
by (4.5).
Martingale Type Statistics and Their Applications 117
SR n = SRn − SR n = SRn − n.
M P
Therefore,
{
SR n − SR n−1 = SRnI − E∞ SRnI |ℑ n−1 Wn .
M M
( )}
On the other hand, by Lemma 4.5.1 one can show that
SR I M
n
− SR I M
n−1
= SR − I
n ∑{E (SR |ℑ ∞
I
k ) − SR } − SR
k −1
I
k −1
I
n−1
k
n−1
− ∑{E (SR |ℑ ∞
I
k k −1 ) − SR } I
k −1
k
SR '
M
n
= SR '
M
n−1
+ an SR I ( M
n
− SR I
M
n−1 ),
where SR 'n ∈ℑn is a submartingale with respect to ℑn and E∞ (SR 'n |ℑ n−1) =
SR 'n−1 + 1. Then SR 'n = (SRnI) Wn + ς n, where Wn ∈ℑn− 1 is defined by (4.5),
(ς n , ℑ n) is a martingale with zero expectation, and ς n is a martingale trans-
M
form of SR I n
.
118 Statistics in the Health Sciences: Theory, Applications, and Computing
( )
Proof. We have SR 'n = SRnI Wn + ς n, where ς n = SR 'n − SRnI Wn ∈ℑn. Consider ( )
the conditional expectation
(( )
E∞ (SR 'n |ℑ n−1) = E∞ SRnI Wn |ℑ n−1 + E∞ (ς n |ℑ n−1) )
( )
= SRnI−1 Wn−1 + 1 + E∞ (ς n |ℑ n−1)
= SR 'n−1 − ς n−1 + 1 + E∞ (ς n |ℑ n−1) .
On the other hand, by the statement of this proposition, E∞ (SR 'n |ℑ n−1) =
SR 'n−1 + 1. Therefore, we conclude with E∞ (ς n |ℑn− 1) = ς n− 1.
From the basic definitions and Lemma 4.5.1 (in a similar manner to the
proof of Proposition 4.5.1) it follows that
SR '
M
n
− SR '
M
n−1 ( ) ) (
= SRnI Wn + ς n − SRnI − 1 Wn − 1 − ς n − 1 − SR ' n + SR '
P P
n−1
= (SR ) W
I
n n + ς − W E (SR |ℑ ) + 1 − ς
n n ∞
I
n n−1 n−1 − E∞ (SR 'n − SR 'n − 1 |ℑ n − 1)
= ς n − ς n − 1 + SR{ I M
n
− SR I M
}W .
n−1
n
Hence
M
Interpretation: We set out to extract the martingale component SR I n
of
I M
SR and to get a martingale transform SR = SRn − n of SR n such that
I M
n n
SRn − n is a P∞-martingale with zero expectation. However, if we are given
any SR 'n with the property that SR 'n − n is a P∞-martingale with zero expec-
tation, then Proposition 4.5.2 states that SR 'n = SRn + ς n where ς n is a martin-
M
gale transform of SR I n . Therefore, note that the information in SRn and in
SR 'n is the same. In other words, from an information point of view, our
procedure is equivalent to any other procedure based on a sequence SRnI { }
that has the same martingale properties as {SRn}.
An obvious property of our procedure is:
{ }
n− 1
∑ ⎛⎜⎝ n n− −k +k 1⎞⎟⎠ ( )
1/2
( )
2
E∞ SR |ℑn− 1 =
I
n exp (n − k ) θˆ k , n− 1 /2 + 1
k =1
∑
n
θ > 0 (i.e., Yν , Yν+ 1 ,... ~ f1 (u; θ) = θ exp(−θu)). Therefore, θˆ k ,n = (n − k + 1)/ Yj is
j= k
the maximum likelihood estimator and the initial Shiryayev–Roberts
statistic is
n−1 n−1
⎪⎧ ⎞ ⎪⎫
n
∑∏ ∑( )
f1 (Yi ; θˆ k , n ) ⎛
( )
n− k +1 −1
SRnI = +1= θˆ k , n exp ⎨(n − k + 1) ⎜ θˆ k , n − 1⎟ ⎬ + 1.
k =1 i = k
f 0 (Yi )
k =1
⎩⎪ ⎝ ⎠ ⎭⎪
Since
⎛ n ⎞
E∞ ⎜
⎜ ∏
⎝ i= k
f1 (Yi ; θˆ k , n )
f0 (Yi )
|ℑ n−1⎟
⎟
⎠
∞
( )
∫⎛
−1
ˆ
n − k + 1 ( n − k ) θk , n−1 −( n − k + 1) eu
= (n − k + 1) e e − u du
−1⎞ n − k + 1
0
⎝ (
⎜u + (n − k ) θˆ k , n−1 ⎟
⎠ )
{ }
n− k +1
⎛ n − k + 1⎞
(θˆ ) ( )
n− k −1
=⎜ ⎟ k , n−1 exp (n − k ) θˆ k , n−1 − (n − k + 1) ,
⎝ n− k ⎠
we have
{ }
n− 1 n− k + 1
Yi = θxi I (i ≥ ν) + εi , i ≥ 1. (4.7)
where θ is the unknown regression parameter, xi , i ≥ 1, are fixed predictors,
ε i , i ≥ 1, are independent random disturbance terms with standard normal
120 Statistics in the Health Sciences: Theory, Applications, and Computing
density and I(⋅) is the indicator function. In this case, the maximum likeli-
hood estimator of the unknown parameter is
∑
n
Yj x j
j= k
θˆ k , n = . (4.8)
∑
n
x 2j
j= k
⎧⎛ ⎞ 2⎫
{( } ∑
⎡ ⎤ ⎛ ⎞ 1/2
∑
n−1
) ⎪⎜ Yj x j⎟⎟ ⎪
2 n
⎢ exp − Yi − θˆ k , n xi / 2 ⎥ ⎜ x 2j ⎟ ⎪⎪ ⎜⎝ ⎠ ⎪⎪
n
E∞ ⎢⎢ ∏ { }
⎜
|ℑ n − 1⎥⎥ = ⎜
j=k ⎟
⎟ exp ⎨
j=k
⎬,
∑ ∑
n−1 n−1
exp − (Yi) / 2 ⎪ 2 x 2j ⎪⎪
2
⎢ i=k ⎥ ⎜⎜ x 2j ⎟⎟ ⎪
⎢⎣ ⎥⎦ ⎝ j=k ⎠ ⎪⎩ j=k
⎪⎭
the transformed detection scheme for the model (4.7) has form (4.5), where
⎧⎛ ⎞ 2⎫
∑
n
⎪⎜ Yj x j⎟⎟ ⎪
n−1
⎪⎪ ⎜⎝ ⎠ ⎪⎪
SRn =
I
∑ exp ⎨
j=k
⎬ + 1,
⎪ 2
∑ x 2j ⎪⎪
n
k =1
⎪ j=k
⎪⎩ ⎪⎭
⎧⎛
(4.9)
⎞ 2⎫
⎛
∑
⎞ 1/2
∑
n−1
⎪⎜ Yj x j ⎟⎟ ⎪
n
n−1 ⎜ x ⎟
2
⎪⎪ ⎜⎝ ⎠ ⎪⎪
∑
j
⎜ ⎟
( ) j=k j=k
E∞ SRnI |ℑ n − 1 = ⎜ ⎟ exp ⎨ ⎬ + 1.
∑ ∑
n−1 n−1
k =1 ⎜ x 2j ⎟⎟ ⎪ 2 x 2j ⎪⎪
⎜ ⎪
⎝ j=k ⎠ ⎪⎩ j=k
⎪⎭
∏ f (Y )
f1 (Yi )
M(C) = min {n ≥ 1 : Λ n ≥ C} , where Λ n = max
1≤ k ≤ n 0 i
i= k
n n
0
i
i
k =1 i = k
M
Proposition 4.5.3. The P∞-martingale component SR n of the Shiryayev-
Roberts statistic SRn is a martingale transform of Λ n , which is the
M
Proof. We have
⎧ n n ⎫
⎪
Λ n = max ⎨ ∏
f1 (Yi )
, ∏
f1 (Yi )
⎪⎩ i =1 f0 (Yi ) i = 2 f0 (Yi )
f (Y )⎪
,..., 1 n ⎬
f0 (Yn )⎪⎭
⎡ ⎧ n n n ⎫ ⎤
= max ⎢⎢ max ⎨
⎢⎣
⎪
∏
f1 (Yi )
, ∏
f1 (Yi )
⎪⎩ i =1 f0 (Yi ) i = 2 f0 (Yi )
,..., ∏
f1 (Yi )⎪ f1 (Yn )⎥
⎬ ,
f0 (Yi )⎪⎭ f0 (Yn )⎥⎥
i = n−1 ⎦ (4.10)
⎡ ⎧ n−1 n−1 n−1 ⎫ ⎤
=
f1 (Yn )
f0 (Yn )
⎢ ⎪
max ⎢ max ⎨
⎪⎩
∏ f1 (Yi )
f0 (Yi )
, ∏ f1 (Yi )
f0 (Yi )
,..., ∏ f1 (Yi )⎪ ⎥
⎬ ,1
f0 (Yi )⎪ ⎥⎥
⎣⎢ i =1 i=2 i = n−1 ⎭ ⎦
f (Y )
= 1 n max (Λ n ,1) .
f0 (Yn )
Λ
M
n
= Λn − ∑{max (Λ k −1 ,1) − Λ k −1}.
k
Hence, by (4.10)
Λ − Λ = Λ n − Λ n−1 − max (Λ n−1 ,1) + Λ n−1
M M
n n−1
⎧⎪ f (Y ) ⎫⎪
= max (Λ n−1 ,1) ⎨ 1 n − 1⎬ .
⎩⎪ f0 (Yn ) ⎭⎪
122 Statistics in the Health Sciences: Theory, Applications, and Computing
f1 (Yn )
SR n − SR n−1 = SRn − n − SRn−1 + (n − 1) = (SRn−1 + 1) − (SRn−1 + 1)
M M
f0 (Yn )
⎪⎧ f (Y ) ⎪⎫
= (SRn−1 + 1) ⎨ 1 n − 1⎬ .
⎩⎪ f0 (Yn ) ⎭⎪
SR 'n =
(Λ n−1Wn−1 + 1) Λ
max (Λ n−1 ,1)
n
f1 (Yn )
SRn = (SRn + 1) , SR0 = 0,
f0 (Yn )
rj , n = ∑ I (Y ≤ Y ), ρ
i j n = rn, n /(n + 1), Zn = Φ−1 (ρn ), ℑ n = σ (ρ1 ,..., ρn) ,
i =1
∑e ∑
n
( )
2 θ Zi −θ2 ( n − k + 1)/2
SRθI , n = e θZn −θ /2
SRθI , n−1 + 1 = i= k
, SRθI ,0 = 0. (4.12)
k =1
(Note that if Z1 , Z2 ,... were exactly standard normal, then (4.12) is the same as
(4.3).) Applying our methodology, letting Wθ ,0 = 0, define recursively
( )
SRθ ,n = SRθI ,n Wθ ,n , Wθ ,n =
(SR )W + 1 I
θ , n− 1 θ , n− 1
E (SR |ℑ ) ∞
I
θ,n n− 1
(4.13)
=
(SR )W + 1
I
θ , n− 1 θ , n− 1
,
∑ ⎧ i ⎞ ⎫
1
exp ⎨Φ ⎛⎜ ⎟ θ⎬ (SR )
− θ2 1 n
−1
e 2 I
θ , n− 1 +1
n i=1 ⎩ ⎝ n + 1⎠ ⎭
{
N θ (C) = min n ≥ 1 : SRθ , n ≥ C }
and therefore E∞ {N θ (C)} ≥ C.
Remark 1. For the sake of clarity, the example above was developed to give a
robust detection method if one believes the observations to be approximately
normal (and one is on guard against a change of mean). This method can be
applied to other families of distributions in a similar manner.
Remark 2. Instead of fixing θ, one can fix a prior G for θ, and define the Bayes
factor type procedure
( )
SRn = SRnI Wn , SRnI = SRθI ,n dG(θ),
∫
Wn =
(SR )W I
n− 1 n− 1 +1
, (4.14)
( 1
) ∑ ∫ ⎧
exp ⎨Φ −1 ⎛⎜
i ⎞ 1 2⎫
n
SRnI−1 + 1
⎝ ⎟⎠ θ − θ ⎬ dG(θ)
n i=1 ⎩ n+1 2 ⎭
SR = W0 = 0.
I
0
Note that if G is a conjugate prior (e.g., G = N (.,.)), then the integral in the
denominator of Wn can be calculated explicitly. An alternative approach to Equa-
tion (4.14) is simply to define SRn =
∫ SR θ, n dG(θ). However, the calculation of
∑ ∏ exp {− 21 (Y − θˆ
) − 21 (Y ) } ≥ A⎪⎬⎪
⎧⎪ n n
2 ⎫
N (1) ( A) = min ⎨n ≥ 1 :
2
i k ,i−1 i x i
⎪⎩ k =1 i= k ⎭
based on the application of the nonanticipating estimation {θˆ } , where the k ,i−1
ˆ ˆ
maximum likelihood estimator 4.8 defines θ k , l (θ k , l = 0, if l < k ). Alterna-
tively, let
Martingale Type Statistics and Their Applications 125
FIGURE 4.3
{ }
Comparison between the Monte Carlo estimator of E∞ N (1) ( A) /A ( ) and the Monte Carlo
{ }
estimator of E∞ N (2) ( A) /A (—∘—). The curve (--∘--) images the Monte Carlo estimator of
{ } { }
E∞ N (1) ( A) /E∞ N (2) ( A) .
126 Statistics in the Health Sciences: Theory, Applications, and Computing
⎡ n
⎤
N '( A) = min ⎢n ≥ 0 : Λ n = Λ n −
⎢⎣
M
∑ {max (Λ
k =1
k −1 , 1) − Λ k − 1} ≥ A⎥
⎥⎦
0
i
i
i= k
24 24
22 22
20 20
18 18
16 16
14 14
12 12
10 10
8 8
6 6
4 4
2 2
FIGURE 4.4
{ }
Comparison between the Monte Carlo estimator of E1 N (1) ( A) /(5 log( A)) ( ) and the
{ }
Monte Carlo estimator of E1 N (2) ( A) /(5 log( A)) (—∘—). The curve (- - - -) images the Monte
Carlo estimator of E {N ( A)} /E {N
1
(1)
1
(2)
}
( A) . The graphs (a) and (b) correspond to the model
(4.7), where θ = 1/4, 1/2 , respectively.
Martingale Type Statistics and Their Applications 127
90
85
80
75
FIGURE 4.5
The curved lines (—∘—) and (——) denote the Monte Carlo estimators of E50 {N '( A)} and
E50 {M( A)}, respectively.
128 Statistics in the Health Sciences: Theory, Applications, and Computing
26
100
25
24
80
23
60
22
21
40
29
20
100 150 200 250 100 300 500
nu nu
B = 350 B = 550
FIGURE 4.6
The curved lines (—∘—) and (⋯⋅) denote the Monte Carlo estimators of L (N θ=0.5 (B/1.3301), ν)
and L (N o (B/1.3316), ν) , respectively, plotted against ν.
5.1 Introduction
Over the years statistical science has developed two schools of thought as it
applies to statistical inference. One school is termed frequentist methodol-
ogy and the other school of thought is termed Bayesian methodology. There
has been a decades-long debate between both schools of thought on the opti-
mality and appropriateness of each other’s approaches. The fundamental
difference between the two schools of thought is philosophical in nature
regarding the definition of probability, which ultimately one could argue is
axiomatic in nature.
The frequentist approach denotes probability in terms of long run aver-
ages, e.g., the probability of getting heads when flipping a coin infinitely
many times would be ½ given a fair coin. Inference within the frequentist
context is based on the assumption that a given experiment is one realization
of an experiment that could be repeated infinitely many times, e.g., the notion
of the Type I error rate discussed earlier is that if we were to repeat the same
experiment infinitely many times we would falsely reject the null hypothesis
a fixed percentage of the time.
The Bayesian approach defines probability as a measure of an individu-
al’s objective view of scientific truth based on his or her current state of
knowledge. The state of knowledge, given as a probability measure, is then
updated through new observations. The Bayesian approach to inference is
appealing as it treats uncertainty probabilistically through conditional
distributions. In this framework, it is assumed that a priori information
regarding the parameters of interest is available in order to weight the cor-
responding likelihood functions toward the prior beliefs via the use of Bayes
theorem.
Although Bayesian procedures are increasingly popular, expressing accu-
rate prior information can be difficult, particularly for nonstatisticians and/
or in a multidimensional setting. In practice one may be tempted to use the
underlying data in order to help determine the prior through estimation of
the prior information. This mixed approach is usually referred to as empiri-
cal Bayes in the literature (e.g., Carlin and Louis, 2011).
129
130 Statistics in the Health Sciences: Theory, Applications, and Computing
We take the approach that the Bayesian, empirical Bayesian, and frequen-
tist methods are useful and that in general there should not be much dis-
agreement between the scientific conclusions that are drawn in the practical
setting. Our ultimate goal is to provide a roadmap for tools that are efficient
in a given setting. In general, Bayesian approaches are more computationally
intensive than frequentist approaches. However, due to the advances in com-
puting power, Bayesian methods have emerged as an increasingly effective
and practical alternative to the corresponding frequentist methods. Perhaps,
empirical Bayes methods somewhat bridge the gap between pure Bayesian
and frequentist approaches.
This chapter provides a basic introduction to the Bayesian view on statisti-
cal testing strategies with a focus on Bayes factor–type principles.
As an introduction to the Bayesian approach, in contrast to frequentist meth-
ods, first consider the simple hypothesis test H 0 : θ = θ0 versus H 1 : θ = θ1, where
the parameters θ0 and θ1 are known. Given an observed random sample
X = {X 1 ,..., X n }, we can then construct the likelihood ratio test statistic
LR = f (X |θ1 ) f (X |θ0 ) for the purpose of determining which hypothesis is
more probable. (Although, in previous chapters, we used the notation f (X ; θ)
for displaying density functions, in this chapter we also state f (X |θ) as the
density notation in order to be consistent with the Bayesian aspect that — “the
data distribution is conditional on the parameter, where the parameter is a
random variable with a prior distribution.”) The decision-making procedure
is to reject H 0 for large values of LR. In this case, the decision-making rule
based on the likelihood ratio test is uniformly most powerful. Although this
classical hypothesis testing approach has a long and celebrated history in the
statistical literature and continues to be a favorite of practitioners, it can be
applied straightforwardly only in the case of simple null and alternative
hypotheses, i.e., when the parameter under the alternative hypothesis, θ1, is
known.
Various practical hypothesis testing problems involve scenarios where the
parameter under the alternative θ1 is unknown, e.g., testing the composite
hypothesis H 0 : θ = θ0 versus H 1 : θ ≠ θ0. In general, when the alternative
parameter is unknown, the parametric likelihood ratio test is not applicable,
since it is not well defined. When a hypothesis is composite, Neyman and
Pearson suggested replacing the density at a single parameter value with the
maximum of the density over all parameters in that hypothesis. As a result
of this groundbreaking theory the maximum likelihood ratio test became a
key method in statistical inference (see Chapter 3).
It should be noted that several criticisms pertaining to the maximum like-
lihood ratio test may be made. First, in a decision-making process, we do not
generally need to estimate the unknown parameters under H 0 or H 1. We just
need to make a binary decision regarding whether to reject or not to reject
the null hypothesis. Another criticism is that in practice we rely on the func-
tion f (X |θˆ ) in place of f (X |θ), which technically does not yield a likelihood
function, i.e., it is not a proper density function. Therefore, the maximum
Bayes Factor 131
likelihood ratio test may lose efficiency as compared to the likelihood ratio
test of interest (see details in Chapters 3 and 4).
Alternatively, one can provide testing procedures by substituting the
unknown alternative parameter by variables that do not depend on the
observed data (see details in Chapter 4). Such approaches can be extended
to provide test procedures within a Bayes factor type framework. We can
integrate test statistics through variables that represent the unknown
parameters with respect to a function commonly called the prior distribu-
tion. This approach can be generalized to more complicated hypotheses and
models.
A formal Bayesian concept for analyzing the collected data X involves the
specification of a prior distribution for all parameters of interest, say θ,
denoted by π(θ), and a distribution for X, denoted by Pr(X|θ). Then the pos-
terior distribution of θ given the data X is
Pr(X |θ)π(θ)
Pr(θ|X ) = ,
∫ Pr(X |θ)π(θ) dθ
which provides a basis for performing formal Bayesian analysis. Note that
maximum likelihood estimation (Chapter 3) in light of the posterior distri-
bution function, where Pr(X |θ) is expressed using the likelihood function
based on X, has a simple interpretation as the mode of the posterior distribu-
tion, representing a value of the parameter θ that is “most likely” to have
produced the data.
Consider an interesting example to illustrate advantages of the Bayesian
methodology. Assume we observe a sample X = {X 1 ,..., X n } of iid data points
from a density function f (x|θ) , where θ is the parameter of interest. In order
to define the posterior density
∏
n
f (X i |θ)π(θ)
g(θ|X ) = i=1
,
∫∏
n
f (X i |θ)π(θ) dθ
i=1
∏ f (X |θ)π(θ)G ,
n
k i k
g(θ, k|X ) = i=1
∑ G ∫ ∏ f (X |θ)π(θ) dθ
T n
k k i
k =1 i=1
132 Statistics in the Health Sciences: Theory, Applications, and Computing
∫ π(θ) dθ = 1. One can extend the technique of Section 4.4.1.2 in order to pro-
pose the test statistic
Bn =
∫f H1 (X 1 ,..., X n |θ)π(θ) dθ
,
f H0 (X 1 ,..., X n |θ0 )
⎧ n+ 1
⎫
⎪
⎪
∏f H1 (X i |X 1 , ..., X i−1 , θ) ⎪
⎪
=
∫E H0 ⎨
i=1
n+ 1 |ℑn ⎬ π(θ) dθ
⎪
⎪
⎩
∏
i=1
f H0 (X i |X 1 , ..., X i−1 , θ0 ) ⎪
⎪
⎭
n
∏f H1 (X i |X 1 , ..., X i−1 , θ)
⎪⎧ f H (X n+1 |X 1 , ..., X n , θ) ⎫⎪
=
∫ i=1
n EH0 ⎨ 1
⎩⎪ f H0 (X n+1 |X 1 , ..., X n , θ0 )
|ℑn ⎬ π(θ) dθ
⎭⎪
∏f
i=1
H0 (X i |X 1 , ..., X i−1 , θ0 )
∏f H1 (X i |X 1 , ..., X i−1 , θ)
⎪⎧ f H1 (u|X 1 , ..., X n , θ) ⎫⎪
=
∫ i=1
n ⎨∫f
⎩⎪ H 0 (u|X 1 , ..., X n , θ 0 )
f H0 (u|X 1 , ..., X n , θ0 ) du⎬ π(θ)dθ
⎭⎪
∏f
i=1
H0 (X i |X 1 , ..., X i−1 , θ0 )
n
∏f H1 (X i |X 1 , ..., X i−1 , θ)
=
∫ i=1
n π(θ) dθ = Bn .
∏
i=1
f H 0 (X i |X 1 , ..., X i−1 , θ0 )
Thus, according to Section 4.4.1, we can anticipate that the test statistic Bn can
be optimal. This is displayed in the following result: the decision rule based
on Bn provides the integrated most powerful test with respect to the func-
tion π(θ).
Proof. Taking into account the inequality ( A − B) {I ( A ≥ B) − δ} ≥ 0, where δ
could be either 0 or 1, described in Section 3.4, we define A = Bn and B = C (C
is a test threshold: the event {Bn > C} implies one should reject H 0 ), and δ
represents a rejection rule for any test statistic based on the observations; if
δ = 1 we reject the null hypothesis. Then it follows that
EH0 {Bn I (Bn ≥ C)} − CEH0 {I (Bn ≥ C)} ≥ EH0 (Bnδ − Cδ),
Since we control the Type I error rate of the tests to be Pr {Test rejects H 0 |H 0} = α ,
we have
That is,
∫ {∫ … ∫ f H1 }
(u1 ,..., un |θ)I (Bn ≥ C)du1 dun π(θ) dθ
≥ ∫ {∫ … ∫ f H1 }
(u1 ,..., un |θ)I (δ(u1 ,..., un ) = 1)du1 dun π(θ)dθ.
≥
∫ Pr {δ rejects H |H 0 1 with the alternative parameter θ=θ1}π(θ1 )θ1 ,
Pr(X |H 1 )
B10 = , (5.1)
Pr(X |H 0 )
which is named the Bayes factor. Therefore, posterior odds (odds = proba-
bility/ (1−probability)) is the product of the Bayes factor and the prior odds.
In other words, the Bayes factor is the ratio of the posterior odds of H 1 to its
prior odds, regardless of the value of the prior odds. When there are unknown
parameters under either or both of the hypotheses, the densities Pr(X |H k )
(( k = 0,1)) can be obtained by integrating (not maximizing) over the param-
eter space, i.e.,
Pr(X |H k ) =
∫ Pr(X|θ , H )π(θ |H ) dθ , k = 0, 1,
k k k k k
I = Pr(X |H ) =
∫ Pr(X|θ, H )π(θ|H ) dθ.
For simplicity and for ease of exposition, the subscript k ( k = 0,1) as the hypo-
thesis indicator is eliminated in the notation H here.
In some cases, the density (integral) I may be evaluated analytically. For
example, an exact analytic evaluation of the integral I is possible for expo-
nential family distributions with conjugate priors, including normal linear
models (DeGroot, 2005; Zellner, 1996). Exact analytic evaluation is best in the
sense of accuracy and computational efficiency, but it is only feasible for a
narrow class of models.
More often, numerical integration, also called “quadrature,” is required
when the analytic evaluation of the integral I is intractable. Generally,
most relevant software is inefficient and of little use for the computation of
these integrals. One reason is that when sample sizes are moderate or
large, the integrand becomes highly peaked around its maximum, which
may be found more efficiently by other techniques. In this instance, gen-
eral purpose quadrature methods that do not incorporate knowledge of
the likely maximum are likely to have difficulty finding the region where
the integrand mass is accumulating. In order to demonstrate this intui-
tively, we first show that an important approximation of the log-likelihood
function is based on a quadratic function of the parameter, e.g., θ. We do
not attempt a general proof but provide a brief outline in the unidimen-
sional case. Suppose the log-likelihood function l(θ) is three times differ-
entiable. Informally, consider a Taylor approximation to the log-likelihood
function l(θ) of second order around the maximum likelihood estimator
(MLE) θ̂, that is,
() ()
l (θ) ≈ l θˆ + (θ − θˆ )l′ θˆ +
(θ − θˆ )2 ˆ
2!
()
l′′ θ .
138 Statistics in the Health Sciences: Theory, Applications, and Computing
Note that at the MLE θ̂, the first derivative of the log-likelihood function
l ′(θˆ ) = 0. Thus, in the neighborhood of the MLE θ̂, the log-likelihood function
l (θ) can be expressed in the form
()
l (θ) ≈ l θˆ +
(θ − θˆ )2 ˆ
2!
l′′ θ . ()
(We required that l ''' (θ) exists, since the remainder term of the approximation
above consists of a l ''' component.) The following example provides some
intuition about the quadratic approximation of the log-likelihood function.
Example
Let X = {X 1 ,..., X n } denote a random sample from a gamma distribution
with unknown shape parameter θ and known scale parameter κ 0. The
density function of the gamma distribution is
f (x ; θ, κ 0) = x θ− 1 exp(− x κ 0 ) (Γ(θ)κθ0 ), where the gamma function
∞
Γ (θ) =
∫ 0
t θ − 1 exp( −t) dt. The log-likelihood function of θ is
∑ ∑
n n
l(θ) = (θ − 1) log(X i ) − n log(Γ(θ)) − nθ log(κ 0 ) − Xi κ 0 .
i=1 i=1
The first and the second derivatives of the log-likelihood function are
∑
n
l′(θ) = −n Γ′(θ) Γ(θ) + log(X i ) − n log(κ 0 ),
i =1
∫
1 θ−1
d2 t log(t)
l′′(θ) = −n log(Γ(θ)) = n dt < 0,
dθ 2 0 1− t
respectively. In this case, there is no closed-form solution of the MLE of
the shape parameter θ, which can be found by solving l ′(θˆ ) = 0 numeri-
cally. Correspondingly, the value of the second derivative of the log-
()
likelihood function at the MLE l ′′ θˆ can be computed. Figure 5.1 presents
the plot of log-likelihood (solid line) and its quadratic approximation
(dashed line) versus values of the shape parameter θ based on samples of
sample sizes n = 10, 25, 35, 50 from a gamma distribution with the true
shape parameter of θ = θ0 = 2 and a known scale parameter κ 0 = 2 . It is
obvious from Figure 5.1 that as the sample size n increases, the quadratic
approximation to the log-likelihood function l(θ) performs better in the
neighborhood of the MLE of θ.
> n.seq<-c(10,25,35,50)
> N<-max(n.seq)
Bayes Factor 139
> shape0<-2
> scale0<-2
> X<-rgamma(N,shape=shape0,scale = scale0)
>
> # Quadratic approximation to the log-likelihood function
> par(mar=c(3.5, 4, 3.5, 1),mfrow=c(2,2),mgp=c(2,1,0))
> for (n in n.seq){
+ x<-X[sample(1:N,n,replace=FALSE)]
+ # log-likelihood as a function of the shape parameter theta
+ loglike<-function(theta) (theta-1)*sum(log(x))-n*log(gamma(theta))-n*theta
*log(scale0)-sum(x)/scale0
+ neg.loglike<-function(theta)-loglike(theta)
+
+ # the first derivative of loglikelihoodlog
+ loglike.D<-function(theta) sum(log(x)) - n*digamma(theta)-n*log(scale0)
+ # the second derivative of log-likelihood
+ loglike.DD<-function(theta) -n*trigamma(theta)
+
+ # one-dimensional optimization that maximize log-likelihood with respect to
the shape parameter
+ init<-mean(x)^2/var(x) # set initial value
+ theta.mle<-optim(init,neg.loglike,method="L-BFGS-B", lower=init-6*(-loglike.
DD(init)^(-1/2))/sqrt(n), upper=init+6*(-loglike.DD(init)^(-1/2))/sqrt(n))$par
+
+ loglike.DD.mle <- loglike.DD(theta.mle) # second derivative at MLE
+
+ # approximation of the log-likelihood
+ loglike.est<-function(theta) loglike(theta.mle)+loglike.DD.mle*((theta-theta.
mle)^2)/2
+
+ # plot: the plotting limit of x-axis
+ tc<-theta.mle
+ rg<-6*((-loglike.DD.mle)^(-1/2) )/sqrt(n) # half range
+
+ # plot the true log-likehood
+ plot(Vectorize(loglike),c(tc-rg,tc+rg),xlim=c(tc-rg,tc+rg),type="l",lty=1,
lwd=2,cex.lab=1.2,xlab=expression(theta),ylab=expression(paste("Log-likeli-
hood( ",theta,")")), main=paste0("n=",n))
+ # plot the approximation to the log-likehood
+ curve(loglike.est,from=tc-rg,to=tc+rg,type="l",lty=2,lwd=2,add=TRUE,col=
2)
+ lines(c(theta.mle,theta.mle), c(-1e+5,loglike(theta.mle)),lty=3)
+ }
>
n = 10 n = 25
−23.5
Log−likelihood (θ)
Log−likelihood (θ)
–58.2
–24.5 –25.5
–58.6
1.5 2.0 2.5 3.0 2.0 2.1 2.2 2.3 2.4 2.5 2.6
θ θ
n = 35 n = 50
–79.6
Log−likelihood (θ)
Log−likelihood (θ)
–114.5
–79.8
–80.0
–114.7
1.9 2.0 2.1 2.2 1.90 1.95 2.00 2.05 2.10 2.15 2.20
θ θ
FIGURE 5.1
The plot of log-likelihood (solid line) and its quadratic approximation (dashed line) versus the
shape parameter θ based on samples of sizes n = 10, 25, 35, 50 from a gamma distribution with
the true shape parameter of θ = θ0 = 2 and the known scale parameter 2.
{ ( )} { ( )}
2 2
Then l′′(θ )(θ − θ )2 = n1/2 θ − θ l′′(θ )/ n n1/2 θ − θˆ l ′′(θ)/ n ~Normal, pro-
vided that θ is a true fixed value of the parameter (see Chapter 3). In this
scenario, since θ maximizes l(θ), i.e., l '(θ ) = 0, the Taylor argument applied to
l '(θ ) shows that
0=
d
dθ
l (θ)
θ=θ
= l '(θˆ ) + {log ( π(θ))}
d
d θ=θˆ
( )
+ θ − θˆ
d2
dθ2
l (θ)
θ=θ
, with l '(θˆ ) = 0
and then
−1
⎡d ⎤ ⎧ d2 ⎫
θ = θˆ − ⎢ {log ( π(θ))} ⎥ ⎨ 2 l (θ) ⎬ = θˆ + op (1),
θ=θˆ ⎦ ⎩ dθ
θ=θ
⎣d ⎭
( )
where θ ∈ θˆ , θ and
d2
dθ2
l (θ)
θ=θ
= Op (n) as a sum of random variables. This
may also be considered a case where frequentist and Bayesian methods align
in some regards.
It is interesting to note that asymptotic arguments similar to those shown
above and in the forthcoming sections can also be employed to compare the
Bayesian and empirical Bayesian concepts (Petrone et al., 2014).
The fact that some problems are of high dimension may also lead to intrac-
table analytic evaluation of integrals. In this case Monte Carlo methods may
be used, but these too need to be adapted to the statistical context. Bleistein
142 Statistics in the Health Sciences: Theory, Applications, and Computing
and Handelsman (1975) as well as Evans and Swartz (1995) reviewed various
numerical integration strategies for evaluating integral I. In this chapter we
outline several techniques available to evaluate the Bayes factor, including
the asymptotic approximation method, the simple Monte Carlo technique,
importance sampling, and Gaussian quadrature strategies, as well as
approaches based on generating samples from the posterior distributions.
For general discussion and references, see, e.g., Kass and Raftery (1995) and
Han and Carlin (2001).
∫
I = Pr(X |H ) = Pr(X |θ, H )π(θ|H ) dθ of the data is obtained by approxi-
mating the posterior with a normal distribution. It is assumed that the pos-
terior density, which is proportional to the nonnormalized posterior
h(θ) = Pr(X |θ, H )π(θ|H ), is highly peaked about its maximum θ, the poste-
rior mode. This assumption will usually be satisfied if the likelihood func-
tion Pr(X |θ, H ) is highly peaked near its maximum θ̂, e.g., in the case of large
samples. Denote l(θ) = log( h(θ)). Expanding l(θ) as a quadratic about θ and
then exponentiating the resulting expansion yields an approximation to h(θ)
that has the form of a normal density with mean θ and covariance matrix
Σ = (−Hl(θ ))−1 , where Hl(θ ) is the Hessian matrix of second derivatives of the
log-posterior evaluated at θ. More specifically, component-wise,
⎡ ∂ log l(θ)⎤
2
Hl(θ )ij = ⎢ ⎥ . Integrating the approximation to h(θ) with respect
⎢⎣ ∂θi ∂θ j ⎥⎦
θ=θ
to θ yields
performing the expansion. (For the general formulation, see Kass and Vaid-
yanathan (1992), which followed Tierney et al. (1989).) An important variant
on the approximation Î to I is
where Σˆ −1 is the observed information matrix, that is, the negative Hessian
matrix of the log-likelihood evaluated at the maximum likelihood estimator
θ̂. This approximation again has relative error of order O(n−1 ).
Define G to be the set of all possibilities that satisfy hypothesis H , and G '
to be the set of all possibilities that satisfy hypothesis H '. We will call H ' a
nested hypothesis within H if G ' ⊂ G.
Consider nested hypotheses, where there must be some parameterization
under H 1 of the form θ = (β, ψ )T such that H 0 is obtained from H 1 when ψ = ψ 0
for some ψ 0 , with parameter (β, ψ ) having prior π(β, ψ ) under H 1 and then
H 0 : ψ = ψ 0 with prior π(β|H 0 ). For instance, it is desired to check whether or
not the collected data, denoted by X, could be described as a random sample
from some normal distribution with mean μ 0 , assuming that they may be
described as a random sample from some normal distribution with unknown
mean μ and unknown variance σ 2 . In this case, notation-wise, ψ 0 = μ 0 , ψ = μ,
and β = σ 2 .
Applying the approximation IˆMLE, one can show that
2 log B10 ≈ Λ + log π(βˆ , ψˆ |H 1 ) − log π(βˆ *, H 0 ) + (d1 − d0 )log(2 π) + log|Σˆ 1 |− log|Σˆ 0 |,
{ (
where Λ = 2 log Pr(X |(βˆ , ψ ) ( )}
ˆ ), H 1 ) − log Pr(X|βˆ *,H 0 ) is the log-likelihood ratio
statistic having approximately a chi-squared distribution with degrees of
freedom (d1 − d0 ), a difference between the numbers of the estimated param-
eters under H 1 and H 0 , and β̂* denotes the MLE under H 0 . Here the covari-
ance matrices Σ̂ k could be either observed or expected information, under
which case the approximation of 2 log B10 has relative error of order O(n−1 ) or
O(n−1/2 ), respectively. For more information, we refer the reader to Kass and
Raftery (1995).
Alternatively, we may consider the following MLE-based approach. We do
not attempt a general proof but we instead provide an informal outline for
the unidimensional case when regarding parameter θ we have the hypoth-
esis H 0 : θ = θ0 versus H 1 : θ ≠ θ0, where θ0 is known. Note that in a Bayesian
context, we are interested in testing for the model, where θ has a degenerate
prior distribution with mass at θ0 versus the model that states θ has a prior
distribution π(θ). Recall that the log-likelihood ratio function lr(θ) = l(θ) − l(θ0 )
attains its maximum at the MLE θ̂, satisfying l '(θˆ ) = 0, and the likelihood
144 Statistics in the Health Sciences: Theory, Applications, and Computing
∞ ˆ φn
θ+
Bn = I / exp (l(θ0 )) =
∫ −∞
exp(lr(θ))π(θ) dθ = I1 + I 2 +
∫ ˆ φn
θ−
exp(lr(θ))π(θ) dθ,
ˆ n
θ−φ ∞
where I1 =
∫−∞
exp(lr(θ))π(θ) dθ and I 2 =
∫ ˆ n
θ+φ
exp(lr(θ))π(θ) dθ. We first con-
sider the component I1 . Based on the fact that the log-likelihood ratio func-
tion l(θ) is a nondecreasing function of θ for θ ∈( −∞ , θˆ − φn ) and that
lr(θˆ − φn ) → −∞, under the null hypothesis, as n → ∞ , it can be derived that
∞
I1 ≤ exp(lr(θˆ − φn ))
∫ −∞
π(θ) dθ = exp(lr(θˆ − φn )) → 0, as n → ∞ ,
{ } { ( )
φn2lr ''(θˆ ) = − n2 γ −lr ''(θˆ ) / n − n2 γ −lr ''(θ0 ) − θˆ − θ0 lr '''(θ0 ) / n }
= − n2 γ Op (1) → −∞
∞
I 2 ≤ exp(lr(θˆ + φn ))
∫ −∞
π(θ) dθ = exp(lr(θˆ + φn )) → 0, as n → ∞.
Bayes Factor 145
Therefore, we have that the Bayes factor Bn = I / exp (l(θ0 )) satisfies approxi-
mately
ˆ φn
θ+
I / exp (l(θ0 )) ≈
∫ ˆ φn
θ−
exp(lr(θ))π(θ) dθ, as n → ∞.
In this approximation,
ˆ φn
θ+
I / exp (l(θ0 )) ≈
∫
ˆ φn
θ−
exp(l(θˆ ) − l(θ0 ) + (θ − θˆ )2 l ''(θˆ )/ 2)π(θ) dθ
≈ exp(l(θˆ ) − l(θ0 ))
∫ exp(−(θ − θˆ ) (−l ''(θˆ ))/ 2)π(θ) dθ,
2
where
∫ exp(−(θ − θˆ ) (−l ''(θˆ ))/ 2)π(θ) dθ is a Gaussian-type integral. This
2
result can be easily applied to control asymptotically the Type I error rate of
the test statistic Bn = I / exp (l(θ0 ))in the frequentist manner.
In a more formal manner, Vexler et al. (2014) used the asymptotic principle
shown above for constructing and evaluating nonparametric Bayesian proce-
dures, where the empirical likelihood method was employed instead of the
parametric likelihood. (Regarding the empirical likelihood approach, see
Chapter 10.)
(3) The Schwarz criterion. It is possible to avoid the introduction of the
prior densities π k (θ k |H k ) in Equation (5.1) by using
{
S = log Pr(X |θˆ 1 , H 1 ) − log Pr(X |θˆ 0 , H 0 ) } 1
− (d1 − d0 )log(n),
2
added. The BIC is a criterion for model selection among a finite set of mod-
els. In this case, the model with the lowest BIC is preferred (e.g., Carlin and
Louis, 2011). Schwarz (1978) showed that, under certain assumptions on the
data distribution,
S − log B10
→ 0, as n → ∞.
log B10
Thus, the quantity exp(S) provides a rough approximation to the Bayes factor
that is independent of the priors on the θ k. Kass and Wasserman (1995) pro-
vided excellent discussion regarding the method mentioned above.
The benefits of asymptotic approximations in this context include the fol-
lowing: (1) numerical integration is replaced with numerical differentiation,
which is computationally more stable; (2) asymptotic approximations do not
involve random numbers, and consequently, two different analysts can pro-
duce a common answer based on the same dataset, model, and prior distri-
bution; and (3) in order to investigate the sensitivity of the result to modest
changes in the prior or the likelihood function, the computational complex-
ity is greatly reduced; for more detail, we refer the reader to Carlin and Louis
(2011). Among the asymptotic approximations described above, the Schwarz
criterion is the easiest approximation to compute and requires no specifica-
tion of prior distributions. In addition, as long as the number of degrees of
freedom involved in the comparison is reasonably small relative to sample
size, the analysis based on the Schwarz criterion will not mislead in a quali-
tative sense.
However, asymptotic approximations also have the following limita-
tions: (1) in order to obtain a valid approximation, the posterior distribu-
tion must be unimodal, or at least nearly unimodal; (2) the size of the
dataset must be fairly large, while it is hard to judge how large is large
enough; (3) the accuracy of asymptotic approximations cannot be improved
without collecting additional data; (4) the correct parameterization, on
which the accuracy of the approximation depends, e.g., θ versus log(θ),
may be difficult to ascertain; (5) when the dimension of θ is moderate or
high, say, greater than 10, Laplace’s method becomes unstable, and numer-
ical computation of the associated Hessian matrices will be prohibitively
difficult; and (6) when the number of degrees of freedom involved in the
comparison is large and the prior is very different from that for which the
approximation is best, the asymptotic approximation based on the Schwarz
criterion can be very poor; e.g., McCulloch and Rossi (1991) illustrated the
poor approximation by the Schwarz criterion with an example of 115
degrees of freedom. For more details, we refer the reader to, e.g., Kass and
Raftery (1995) as well as Carlin and Louis (2011). Therefore, in such cases,
researchers may turn to alternative methods, e.g., Monte Carlo sampling,
Bayes Factor 147
∫
integral I becomes I = Pr(X ) = Pr(X |θ)π(θ) dθ. The simplest Monte Carlo
integration estimate of I is
m
∑ Pr(X|θ
∧ 1
Pr 1 (X ) = (i)
),
m i=1
where {θ( i ) : i = 1,..., m} is a sample from the prior distribution; this is the
average of the likelihoods of the sampled parameter values, e.g., Hammersley
and Handscomb (1964)
The precision of simple Monte Carlo integration can be improved by
importance sampling. It involves generating a sample {θ( i ) : i = 1,..., m} from
a density π * (θ). Under general conditions, a simulation-consistent estimate
of I is
∑
m
wi Pr(X |θ( i ) )
IˆMC = i=1
,
∑
m
wi
i=1
cases. In more complex cases, Markov chain Monte Carlo (MCMC) methods
(e.g., Smith and Roberts, 1993) particularly the Metropolis–Hastings algo-
rithm and the Gibbs sampler, as well as the weighted likelihood bootstrap,
provide general schemes.
These methods provide a sample approximately drawn from the posterior
density Pr(θ|X ) = Pr(X |θ)π(θ)/ Pr(X ). Substituting into Î MC defined in Sec-
tion 5.3.1.2 yields as an estimate for Pr(X ),
−1
⎧⎪ 1 m
⎫⎪
∑
∧
( i ) −1
Pr 2 (X ) = ⎨ Pr(X |θ ) ⎬ ,
⎪⎩m i=1 ⎪⎭
which is the harmonic mean of the likelihood values (Newton and Raftery,
1994). This converges almost surely to the correct value, Pr(X ), as m → ∞, but
it does not generally satisfy a Gaussian central limit theorem. See Kass and
Raftery (1995) for additional modifications and discussion. Note that the
MCMC methods have not yet been applied in many demanding problems
and may require large numbers of likelihood function evaluations, resulting
in difficulty in some cases.
α= ∫ (
ϕ u; θ∑ du.
B
)
A modification of Laplace’s method can be obtained that simply estimates
the unknown value of the posterior probability density at the mode using
the simulated probability assigned to a small region around the mode
divided by its area. Observing that
Pr(X |θ )π(θ ) Pr(X |θ )π(θ ) ϕ(θ ) Pr(X |θ )π(θ ) α
I= = ≈ ,
Pr(θ |X ) ϕ(θ ) Pr(θ |X ) ϕ(θ ) Pr(B)
Bayes Factor 149
where P̂ is the Monte Carlo estimate of Pr(B), that is, a proportion of the
sampled values inside B. DiCiccio et al. (1997) demonstrated that the simu-
lated version of Laplace’s method with a local volume correction approxi-
mates the Bayes factor accurately, which is especially useful in the case of
costly likelihood function evaluations.
The importance of sampling techniques can be modified by restricting
them to small regions about the mode. To improve the accuracy of approxi-
mations, Laplace’s method can be combined with the simple bridge sam-
pling technique proposed by Meng and Wong (1996). For detailed information,
we refer the reader to DiCiccio et al. (1997).
Since
∫ Pr(θ|X ) dθ = 1, the posterior density is indeed proper, and hence
Bayesian inference may proceed as usual. Note that when employing
improper priors, proper posteriors will not always result and extra care must
be taken.
Krieger et al. (2003) have proposed several forms of a prior π(θ) and the cor-
θ
responding distribution H (θ) =
∫ π(θ) dθ in the context of sequential change
point detection (see Chapter 4). This method of choices of a prior can be
adapted for the problem stated in this chapter. For example, if we suspect
that the observations under the alternative hypothesis have a distribution
that differs greatly from the distribution of the observations under the null,
the prior distribution could be chosen as
−1 +
⎡ μ ⎞ ⎤ ⎛ θ − μ⎞ μ ⎞⎞
H(θ) = ⎢Φ ⎛⎜ ⎟ ⎥ ⎜ Φ ⎛⎜ ⎟ − Φ ⎛⎜ − ⎟ ⎟ , (5.2)
⎣ ⎝ σ⎠⎦ ⎝ ⎝ σ ⎠ ⎝ σ ⎠⎠
1 ⎛ ⎛ θ − μ⎞ θ + μ ⎞⎞
H(θ) = Φ⎜ + Φ ⎛⎜
⎜ ⎟ ⎝ σ ⎟⎠ ⎟⎠
. (5.3)
2⎝ ⎝ σ ⎠
Note that the parameters μ and σ > 0 of the distributions H (θ) can be chosen
arbitrarily. Krieger et al. (2003) recommended the specification of μ = 0, and
σ = 1 so that two forms of H (θ) specified above are simplified. In accordance
Bayes Factor 151
with the rules shown in Marden (2000), where the author reviewed the Bayes-
ian approach applied to hypothesis testing, the function H can be defined,
e.g., for a fixed σ > 0, in the form
⎧1 ⎛ θ − μ⎞ θ + μ ⎞⎞ ⎫
H ∈Ψ = ⎨ ⎜ Φ ⎛⎜ ⎟ + Φ ⎛⎜ ⎟ ⎟ , μ ∈(μ lower , μ upper )⎬ . (5.4)
⎩2 ⎝ ⎝ σ ⎠ ⎝ σ ⎠⎠ ⎭
In Figure 5.2, the solid line and the dashed line in the left panel present H (θ)
defined by (5.2) and (5.3) with μ = 0 and σ = 1, respectively, and the right
panel presents H (θ) defined by Equation (5.4) for μ changing from −5 to 5.
Example
As a specific example of the choice of the prior function, Gönen et al.
(2005) provided a case study. Assuming the data points Yir ( i = 1, 2,
r = 1,..., ni ) are independent and normally distributed with means μ i and
152 Statistics in the Health Sciences: Theory, Applications, and Computing
1.0 1.0
0.8 0.8
Probability
0.6
Probability
0.6
0.4 0.4
0.2 0.2
0.0 0.0
–4 –2 0 2 4 –4 –2 0 2 4
θ θ
FIGURE 5.2
The cumulative distribution functions that correspond to the priors (5.2) through (5.4). The
solid line and the dashed line in the left panel present H (θ) defined by (5.2) and (5.3) with μ = 0
and σ = 1, respectively. The right panel presents H (θ) defined by (5.4), where μ = −5,..., 5 .
2( z1−α/2 + z1−β )2
∫
z1−α 2
1 2
n= with e − u 2 du = 1 − α 2
(δ σ )2 (2π )1 2 −∞
∫
z1− β
1 2
and e − u 2 du = 1 − β ,
(2π )1 2 −∞
Note that a common criticism of Bayesian methods is that the priors may
be chosen arbitrarily in some cases or subjective in a special way in other
cases. Kass (1993) discussed that the value of a Bayes factor may be sensitive
to the choice of priors on parameters in the competing models. A broad range
of literature has discussed the selection of priors and various techniques for
constructing priors, including Jeffrey’s rules and their variants; see, e.g.,
Berger (1980, 1985), Kass and Wasserman (1996) and Sinharay and Stern
(2002) for additional information.
TABLE 5.1
Bayes Factor as Evidence against the Null Hypothesis
log 10 (B10 ) (B10 ) Evidence against H 0
0–1/2 1–3.2 Weak
1/2–1 3.2–10 Substantial
1–2 10–100 Strong
>2 >100 Decisive
154 Statistics in the Health Sciences: Theory, Applications, and Computing
{ }
1
S = log Pr(X |θˆ 1 , H 1 ) − log Pr(X |θˆ 0 , H 0 ) − (d1 − d0 )log(n),
2
Example
Consider a straightforward but common example. Let X = {X 1 ,..., X n } be
a random sample from some normal distribution with mean μ and vari-
ance σ 2 . It is desired to test the one-sided hypothesis H 0 : μ ≤ μ X 0 versus
H 1 : μ > μ X 0, where μ X 0 is a prespecified number of interest. Let X denote
∑
n
the sample mean, X = n−1 X i . We then consider two scenarios with
i=1
known/unknown value of variance σ 2 .
⎛ 1 ⎛ (μ − μ )2 n
⎞⎞
∑ (X σ− μ) ⎟⎠⎟⎟⎠ ,
2
Pr(μ|X ) ∝ π(μ)Pr(X |μ) ∝ exp ⎜ − ⎜ 0
+ i
⎜⎝ 2 ⎝ τ 20 i=1
2
which is associated with a normal density function with the mean and
the variance as
μ 0 τ 20 + n X σ 2 2 1
μn = , τn = ,
1 τ 20 + n σ 2 1 τ 20 + n σ 2
Note that the prior precision, 1 τ 20 , and the data precision, n σ 2 , play
equivalent roles in the posterior distribution. Thus, for a large sample
size n, the posterior distribution is largely determined by σ 2 and the
sample mean X; see Gelman et al. (2003) for detailed discussion.
Scenario 2: (The variance σ 2 is assumed to be unknown.) Let the conditional
distribution of μ given σ 2 be normal and the marginal distribution of σ 2
be scaled-inverse- χ 2 distribution, that is,
⎛ ν σ 2 + κ 0 (μ 0 − μ)2 ⎞
Pr(μ , σ 2 ) ∝ σ −1 (σ 2 )−( ν0 /2 + 1) exp ⎜ − 0 0 ⎟⎠ ,
⎝ 2σ 2
∑
n
Let s 2 denote the sample variance, that is, s 2 = (n − 1)−1 (X i − X )2 .
i=1
Correspondingly, the joint posterior density is
κ0 n
μn = μ0 + X , κ n = κ 0 + n, νn = ν0 + n,
κ0 + n κ0 + n
κ 0n
νnσ n2 = ν0σ 20 + (n − 1)s 2 + ( X − μ 0 )2 .
κ0 + n
To sample from the joint posterior distribution, we first draw σ 2 from its
marginal posterior distribution Inverse-χ 2 ( νn , σ 2n ), then draw μ from its
normal conditional posterior distribution N ( μn , σ 2 κ n ), using the gener-
ated value of σ 2.
The marginal posterior density of σ 2 is a scaled-inverse- χ 2 distribu-
tion, i.e.,
σ 2 |X ~ Inverse − χ 2 ( νn , σ n2 ).
μ|σ 2 , X ~ N (μ n , σ 2 κ n ).
156 Statistics in the Health Sciences: Theory, Applications, and Computing
− ( νn /2 + 1)
⎛ κ (μ − μ)2 ⎞
Pr(μ|X ) ∝ ⎜ 1 + n n 2 ⎟ = tνn (μ n , σ 2n κ n ) ,
⎝ ν nσ n ⎠
∫
y
function as T ( y ; ν, μ , σ ) = f (t ; ν, μ , σ ) dt. In this case, the posterior
−∞
H 1 : μ 1 ≠ μ 2 , σ 12 ≠ σ 22 .
As was mentioned above, Bayes factor type test statistics can be used in the
context of traditional tests via a control of the corresponding Type I error
rate, e.g., employing the asymptotic evaluations of the Bayes factor struc-
tures. We illustrate this principle in the following example.
Example
Let X = {X 1 ,..., X n } , n = 30, be a random sample drawn from Y − 1, where
Y ~ exp(λ), i.e., f ( x ; λ) = λ exp( −λ( x + 1)), λ > 0, x > −1, is the density func-
tion of the observations. Denote the mean of X by θ (θ = 1 λ − 1). Suppose
it is of interest to test H 0 : θ = 0 versus H 1 : θ ≠ 0 . Under the alternative
hypothesis, the prior function for θ is assumed to be the Uniform[0, 2]-dis-
tribution function. In this case, based on the Schwarz criterion approxi-
mation, the relevant maximum likelihood ratio can be well approximation
to log B10 + log(n)/ 2, as n → ∞ . The corresponding maximum likelihood
ratio has an asymptotic χ12 distribution under the null hypothesis. Com-
paring 2 log B10 + log(n) with the critical values related to χ12 distribution
Bayes Factor 157
one can provide an approach to make statistical decision rules from the
traditional frequentist test perspective. Implementation of the testing
procedure is presented in the following R code:
> n<-30
> lambda.true<-1/2
> 1/lambda.true-1 # theta.true
[1] 1
>
> set.seed(123456)
> x<-rexp(n,lambda.true)-1
> x
[1] 0.19113728 0.01426038 1.51633983 1.11874795 1.27664726 2.94610447
-0.86056815 4.24772281 -0.76769938
[10] 0.22111551 -0.72606026 1.06624316 0.10882527 2.32608560 3.73953110
1.89499414 2.52307509 1.35740018
[19] -0.76989244 -0.90812499 -0.83491881 0.29580491 3.72774643 -0.01274517
-0.57838195 2.79788529 0.74200029
[28] -0.93858993 1.18364253 -0.28730946
>
> theta0<-0 # EX under H_0
> lambda0<-(theta0+1)^(-1)
>
> neg.loglike<-function(x,theta){
+ length(x)*log(theta+1)+(sum(x)+length(x))/(theta+1)
+ }
>
> # Bayes factor
> # likelihood
> like<-function(x,theta){
+ exp(-(length(x)*log(theta+1)+(sum(x)+length(x))/(theta+1)))
+ }
>
> # prior
> ll<-0;uu<-2
> prior<-function(theta) 1/(uu-ll)*(theta<=uu)*(theta>=ll)
>
> integrand<-function(theta) like(theta,x=x) #*prior(theta)
> BF<-integrate(Vectorize(integrand), ll, uu)$value/like(theta0,x=x)
> log10(BF)
[1] 3.227941
> stat.est<-log(BF)+1/2*log(n)
> p.BF<-1-pchisq(2*stat.est,df=1)
> p.BF
[1] 1.920634e-05
> p.t<-t.test(x,mu=theta0)$p.val
> p.t
[1] 0.003940047
In this example, the value of log 10 (B10 ) is 3.2279, greater than 2, suggesting a
decisive evidence against H 0 . From the traditional testing viewpoint, the
approach based on the Bayes factor leads to a p-value far below 0.0001,
158 Statistics in the Health Sciences: Theory, Applications, and Computing
suggesting the rejection of the null hypothesis, which is consistent with the
result based on the t-test shown in the R code above.
The following points about Bayes factors should be emphasized: (1)
Bayes factors provide a way to evaluate evidence in favor of a null hypoth-
esis through the incorporation external information; (2) Bayes factor based
decision-making procedures can be easily applied with respect to non-
nested hypotheses (see Section 5.3.1.1 for the definition of nested
hypotheses); (3) Technically, Bayes factors are simpler to compute as com-
pared to deriving non-Bayesian significance tests based on “nonstandard”
statistical models that do not satisfy common regularity conditions; (4)
When estimation or prediction is of interest, Bayes factors can be con-
verted to weights corresponding to various models so that a composite
estimate or prediction may be obtained that takes into account structural
or model uncertainty; (5) The use of Bayes factors allow us to feasibly
assess the sensitivity of our conclusions relative to the choose of prior dis-
tribution; and (6) As shown in Section 5.2, Bayes factor type procedures
can provide optimal decision rules as compared to other approaches. For
more details, we refer the reader to Kass and Raftery (1995) as well as Vex-
ler et al. (2010b).
7
Birth-weight
4 X: Died
0: Survied
3
0 10 20 30
Clutch
FIGURE 5.3
The snapshot from the paper by Sinharay and Stern (2002) regarding the scatterplot with the
clutches sorted by average birth weight.
Let y ij denote the response (survival status with 1 denoting survival) and
xij the birth weight of the jth turtle in the ith clutch, i = 1, 2...m = 31, j = 1, 2,...ni.
The model we fit to the data is
∫
u
i.i.d. 1 2
bi |σ 2 ~ N (0, σ 2 ), i = 1, 2,..., m, Φ(u) = e−u /2
du,
(2 π)1/2 −∞
where the bi’s are random effects for clutch (family). The clutch effects are
assumed to be the same for all birth weights. To assess the importance of
clutch effects, we compare our alternative model, the probit regression
model with random effects, with the null model, a simple probit regression
model with no random effects.
Suppose p0 (α) is the prior distribution for α under the null model, while
the prior distribution for (α , θ) under the alternative model is p(α , θ) =
p(σ 2 )p1 (α|σ 2 ). Then, the Bayes factor for comparing the alternative model
(M1 ) against the null model (M 0 ) is BF10 = p( y|M1 )/ p( y|M 0 ), with
160 Statistics in the Health Sciences: Theory, Applications, and Computing
p( y|a, b) = ∏ ∏ {Φ(a + a x
i=1 j=1
0 1 ij + bi )} ij {1 − Φ(a0 + a1xij + bi )}
y 1− y ij
,
m ni
p( y|α , b = 0) = ∏ ∏ {Φ(a + a x )}
i=1 j=1
0 1 ij
y ij
{1 − Φ( a0 + a1xij )}
1− y ij
,
i=1 }
bi2 (2σ 2 ) .
3
Bayes factor
0
0.0 0.5 1.0 1.5
Variance component
FIGURE 5.4 ∧
The snapshot from the paper by Sinharay and Stern (2002) regarding the plot of BF 10,σ 2
against σ .
2
equal to zero, the approximated Bayes factor favors the small positive value of
σ 2 . However, when it comes to choosing between a large value of σ 2 and σ 2
equal to zero, the approximated Bayes factor favors the σ 2 equal to zero; we
refer the reader to Sinharay and Stern (2002) for more details.
5.4 Remarks
The Bayes factor, the ratio of the posterior odds (not probability) of a hypoth-
esis to the prior odds, is equal to a likelihood ratio in the simplest case (and
only then); see Good (1992) for details. It can be described as a multiplicative
weight of evidence. In the frequentist setting of testing statistical hypothe-
ses, problems may arise when inference about the truth or believability of
the null hypothesis is based on the classical formulation of a p-value (see
forthcoming material in this book regarding p-values). In this context, it
should be noted that the p-value is not the probability that the null hypoth-
esis is true. Unfortunately, many practitioners and students have tried to use
the p-value to measure the evidence of the null hypothesis as the probabil-
ity that the null hypothesis is true. This is the wrong concept, since, in
general, p-values are random variables. Formal arguments about incorrect
interpretations of p-values are provided in Chapter 9. We also remark
again (as in Chapter 4) that almost no null hypothesis is exactly true in
practice. Consequently, when sample sizes are large enough, almost any
162 Statistics in the Health Sciences: Theory, Applications, and Computing
null hypothesis will have a tiny p-value, and hence will be rejected at con-
ventional levels. As alternatives to the classical formulation of p-values for
testing hypotheses, Bayes factors have been suggested as a measure of the
extent to which the observed data align with a given hypothesis by incorpo-
rating prior information about the parameter of interest. The use of thep-
value should not be discarded all together. A small p-value does not
necessarily mean that we should reject the null hypothesis and leave it at
that. Rather, it is a red flag indicating that something is up; the null hypoth-
esis may be false, possibly in a substantively uninteresting way, or maybe we
got unlucky. On the other hand, a large p-value does mean that there is not
much evidence against the null. For example, in many settings the Bayes fac-
tor is bounded above by 1/p-value. Even Bayesian analyses can benefit from
classical testing at the model-checking stage (see Box (1980) or the Bayesian
p-values in Gelman et al. (2013)). In this context, for more details, we refer the
reader to Marden (2000).
Warning: Assume the parameter of interest is θ. In the frequentist manner,
the meaning of testing statistical hypotheses, say H 0 : θ = θ0 versus
H 1 : θ = θ1 ≠ θ0 , is clear. In the Bayesian setting, we believe that θ is random.
Then it is possible that the statement Pr (θ = θ0 ) = Pr (θ = θ1) = 0 has a sense. If
your statistical world is completely Bayesian, you compare the model
H 0 : θ ~ π 0 with the model H 1 : θ ~ π1 , where π 0 and π1 are prior density func-
tions of θ. In this case, we can still write H 0 : θ = θ0 , assuming that π 0 is an
appropriate degenerated density function (e.g., Berger, 1985; Vexler et al.,
2010b). Let us consider briefly the following simple scenario, in which using
the Bayes factor procedure we would like to compare the following models:
M0 : q ~ π 0 = Unif [0,1] versus M1 : q ~ π1 = Unif [1,1.5], where q is a quintile
that satisfies F(q) = α ≡ 0.1 for the independent and identically F-distributed
data points X 1 ,..., X n. Let X 1 ,..., X n be exponentially distributed with the den-
∫
q
sity function f ( x|λ q ), where λ q : f (u|λ q ) du = α = 0.1. Then the Bayes fac-
0
tor statistic is
1.5 n
∫ ∏ f (X |λ )dq
i =1
i q
BF = 1
1 n
.
∫ ∏ f (X |λ )dq
0 i =1
i q
For illustrative purposes, we can simulate 5,000 times the scenario men-
tioned above under the model M0 with n = 25 using the following R code:
alpha<-0.1
n<-25
MC<-5000
Bayes Factor 163
BF<-array()
for(mc in 1:MC)
{
a0<-0
b0<-1
a1<-b0
b1<-b0+0.5
q<-runif(1,a0,b0) # model M0
#q<-runif(1,a1,b1)# model M1
QQ<-function(u) pexp(q,u)-alpha
lambda<-uniroot(QQ,c(0,1000000))$root # to find lambda corresponding to the q’s value
x<-rexp(n,lambda)
Intr <-function(uu){
fix<-uu
QQ1<-function(u) pexp(fix,u)-alpha
QQ1V<-Vectorize(QQ1)
lambda1<-uniroot(QQ1V,c(0,100000000))$root
S<-sum(log(dexp(x,lambda1)))
return(exp(S))
}
n
BF[mc]<-Num/Den
print(length(BF[BF<1]))
print(length(BF[(BF>=1)&(BF<10^0.5)]))
print(length(BF[(BF>=10^0.5)&(BF<10^1)]))
print(length(BF[(BF>=10^1)&(BF<10^1.5)]))
}
We would like to invite the reader to run the code above and obtain the
results, evaluating them with respect to Table 5.1.
Note again that, in the frequent perspective, the approaches mentioned in
this chapter show that Bayes factor type procedures can provide very effi-
cient decision-making mechanisms.
6
A Brief Review of Sequential Methods
6.1 Introduction
Retrospective studies are generally derived from already existing databases or
combining existing pieces of data, e.g., using electronic health records to exam-
ine risk or protection factors in relation to a clinical outcome, where the out-
come may have already occurred prior to the start of the analysis. The
investigator collects data from past records, with no or minimal patient follow-
up, as is the case with a prospective study. Many valuable studies, such as the
first major case-control study published by Lane-Claypon (1926), investigating
risk factors for breast cancer, were retrospective investigations. Prospective
studies may be analyzed similar to retrospective studies or they may have a
sequential element to them, which we will describe below, that is, data may be
analyzed continuously or in stages during the course of study.
In retrospective studies it is common to have sources of error due to con-
founding and various biases such as selection bias, e.g., bias due to missing
data elements, misclassification, or information bias. In addition, the inferen-
tial decision procedures for retrospective studies are based on data sets,
where the corresponding samples sizes are fixed in advance and can often-
times be quite large. It is possible in these settings to have underpowered or
overpowered designs. For example medical claims data may lead to an over-
powered design, i.e., a study that will detect small differences from the null
hypothesis that may not be scientifically interesting. Consequently, retro-
spective studies may induce unnecessarily high financial and/or human cost
or statistically significant results that are not necessarily meaningful. Armit-
age (1981) stated that
165
166 Statistics in the Health Sciences: Theory, Applications, and Computing
one or more treatments over the standard of care or placebo, which logically
leads to the use of a sequential trial. In his publications cited above, Armit-
age described a number of sequential testing methods and their application
to trials comparing two alternative treatments.
Sequential methods are well suited for use in clinical trials with short-term
outcomes, e.g., a one-month follow-up value. When dealing with human sub-
jects, regular examinations of accumulating results and early termination of
the study are ethically desirable (Armitage, 1975). Sequential methods touch
much of modern statistical practice and hence we cannot possibly include all
relevant theory and examples. In this chapter, we will outline several of the
more well-applied sequential testing procedures, including two-stage
designs, the sequential probability ratio test, group sequential tests, and
adaptive sequential designs.
Warning: (1) Note that the scheme of sequential testing may result in very com-
plicated estimation issues in the post-sequential analyses of the data. Investiga-
tors should be careful applying standard estimation techniques to data obtained
via a sequential data collection procedure. For example, estimation of the sample
mean and variance of the data can be very problematic in terms of obtaining
unbiased estimates (Liu and Hall, 1999). (2) In the retrospective setting, one can
pretest parametric assumptions regarding underlying data and implement a
parametric statistical procedure, e.g., estimation, adjusting its results with respect
to the pretest. Oftentimes, parametric sequential procedures suppose paramet-
ric distribution assumptions even before the data is observed. (3) Common sta-
tistical sequential schemes stop at random times with lengths that depend on
observations. Thus, in general, data obtained after sequential analyses cannot be
evaluated for goodness-of-fit using the classical tests developed for retrospective
testing. For example, the Shapiro–Wilk test for normality (e.g., Vexler et al., 2016a)
has a Type I error rate that does not depend on the data distribution. This prop-
erty is not held if the number of observations is random with a distribution that
depends on underlying data characteristics.
In Sections 6.2, 6.5, 6.6, and 6.7 we introduce two-stage sequential proce-
dures, concepts of group sequential tests, adaptive sequential designs and
futility analysis. The classical sequential probability ratio test is detailed in
Section 6.3. In Section 6.4 we show a technique that can be applied to evalu-
ate asymptotic properties of stopping times used in statistical sequential
schemes. In Section 6.8 we provide some remarks regarding post-sequential
procedures data evaluations.
and r are found for fixed values of p0 , p1, α (the Type I error rate), and β (the
Type II error rate), and are determined as follows.
The decision of whether or not to terminate after the first or the second
stage is based on the number of responses observed. The number of
responses X , given a true response rate p and a sample size m, follows a
binomial distribution, that is, Pr(X = x) = b( x ; p , m) and Pr(X ≤ x) = B( x ; p , m),
where b and B denote the binomial probability mass function, and the bino-
mial cumulative distribution function, respectively. Therefore, the total
sample size is random. The probability of early termination, and the
expected sample size depend on the true probability of response p. By using
exact binomial probabilities, the probability of early termination after the
first stage, denoted as PET( p), has the form B(r1 ; p , n2 ) , and the expected
sample size based a true response probability p for this design can be
defined as E( N |p) = n1 + {1 − PET( p)} n2. The probability of rejecting a drug
with a true response probability p, denoted as R( p) , is
∑
min{ n1 , r }
R( p) = B(r1 ; p , n2 ) + b( x ; p , n1 )B(r − x ; p , n2 ).
x = r1 + 1
For pre-specified values of the parameters p0 , p1, α and β, if the null hypoth-
esis is true, then we require that the probability that the drug should be
accepted for further study in other clinical trials should be less than α, i.e.,
concluding that the drug is sufficiently promising. We also require that the
probability of rejecting the drug for further study should be less than β
under the alternative. That is, an acceptable design is one that satisfies the
error probability constraints R( p0 ) ≥ 1 − α and R( p1 ) ≥ 1 − β. Let Ω be the set of
all such designs. A grid search is used to go through every combination of n1,
n2 , r1 and r, with an upper limit for the total sample size n, usually between
0.85 and 1.5 times the sample size for a single stage design (Lin and Shih,
2004). The optimal design under H 0 is the one in Ω that has the smallest
expected sample size E( N |p0 ). For more details, we refer the reader to Simon
(1989). Extensions of this design for testing both efficacy and/or futility after
the first set of subjects has enrolled are well developed, e.g., see Kepner and
Chang (2003).
lemma (Chapter 3), Wald (1947) proposed the SPRT, which provides a method
of constructing efficient sequential statistical schemes.
The intuitive idea is this: while testing statistical hypotheses in many situ-
ations the decision-making procedures produce outcomes with respect to
conclusions of the form to reject or not to reject the corresponding null
hypothesis. Assume we have n = 100 observations and the appropriate likeli-
hood ratio test statistic based on first 35 data points is relatively too small.
Then we will have a very rare chance to reject the null hypotheses using the
full data. In this case we do not need to observe 100 data points to make the
test decision. Similarly, we can consider situations when the likelihood ratio
has relatively large values. Thus, one can terminate the testing procedure
when the likelihood ratio is greater or smaller than presumed threshold val-
ues instead of calculating the test statistic based upon n = 100 observations.
Let X 1 , X 2 ,... be a sequence of data points with joint probability density
function f . Consider a basic form for testing a simple null hypothesis
H 0 : f = f0 against a simple alternative hypothesis H 1 : f = f1, where f0 and f1
are known. Recall that the likelihood ratio has the form
f1 (X 1 ,..., X n )
LRn = .
f0 (X 1 ,..., X n )
> # For any pre-specified A and B (0<A<1<B). For example, A=.11, B=0.
> A=.11
> B=9
A Brief Review of Sequential Methods 171
Overshoot
B
LR
A
Overshoot
5 10 15 20
n
FIGURE 6.1
The sampling process via the SPRT; the solid line shows the case where the sampling process
stops when LRN ≤ A and that leads to the decision not to reject H 0, and the dashed line repre-
sents the case where the sampling process stops when LRN ≥ B and we decide to reject H 0. The
shown overshoots have the values of LRN − A or LRN − B, respectively.
+ x<-rnorm(1,muX1,1)
+ L<-L*exp((muX1-muX0)*x+(muX0^2-muX1^2)/2)
+ L.seq.B<-c(L.seq.B,L)
+ n<-n+1
+ }
> # to plot
> par(mar=c(4,4,2,2))
> plot(L.seq.A,type="l",lty=1,ylim=c(0,9.3),yaxt="n",xlab="n",ylab="L")
> lines(L.seq.B,type="l",lty=2)
> abline(h=c(A,B))
> axis(2,c(A,1,B),label=c("A","1","B"),las=2)
The SPRT stopping boundaries: The thresholds A and B are chosen so that
the Type I error and Type II error probabilities are approximately equal
(bounded appropriately) to the prespecified values α and β, respectively, and
formally defined as
Proof. We have
∑
∞
α = PrH0 (LRN ≥ B) = PrH0 (LRn ≥ B, N = n)
n=1
= ∑ ∫ I (LR ≥ B, N = n)
∞
f ( x ,..., x ) 0 1 n
n f ( x ,..., x ) dx ... dx
1 1 n 1 n
n=1 f ( x ,..., x ) 1 1 n
= ∑ ∫ I(LR ≥ B, N = n)
∞
f ( x ,..., x ) 1 1 n
n dx ... dx 1 n
n=1 LR n
1− A A(B − 1)
α≈ and β ≈ ;
B− A B− A
β 1−β
A≈ and B ≈ ,
1−α α
where the symbol “≈” means that the overshoot values are disregarded.
The Wald approximation to the average sample number (ASN): The
expected stopping time E( N ) is often called the average sample number
(ASN). Being able to calculate the ASN is practically relevant in terms of the
logistics of planning a given study. To study the ASN and the asymptotic
properties of the stopping time N , we consider the following reformulation
and make an additional assumption that the observations X i , i ≥ 1, are iid.
Therefore, the log-likelihood ratio log (LRn) is a sum of iid random variables
n
Zi = log { f1 (X i )/f0 (X i )}, i.e., log (LRn) = ∑Z = S , where we define log (LR ) as
i=1
i n n
N = N c1 , c2 = inf{n ≥ 1 : Sn ≤ −c1 or Sn ≥ c2 },
⎧ ⎛ 1 − β⎞ ⎛ β ⎞⎫
E0 N ≅ μ −Z10 ⎨α log ⎜ ⎟⎠ + (1 − α)log ⎜⎝ ⎟ ⎬ , under H0
⎩ ⎝ α 1 − α⎠ ⎭
and
⎪⎧ ⎛ 1 − β⎞ ⎛ β ⎞ ⎪⎫
E1N ≅ μ −Z11⎨(1 − β)log ⎜ ⎟ + β log ⎜ ⎟ ⎬ , under H1,
⎩⎪ ⎝ α ⎠ ⎝ 1 − α ⎠ ⎭⎪
⎧⎪ N ⎫⎪
∑
Ek {log (LRN )} = Ek ⎨ Zi ⎬ = Ek (N) μ Zk , k = 0, 1,
⎩⎪ i=1 ⎭⎪
since LRN can take only values that satisfy LRN ≥ B, LRN ≈ B , or LRN ≤ A,
LRN ≈ A with A ≈ β (1 − α) and B ≈ (1 − β) α −1 .
−1
∫
under H 0 , μ = μ Z 0 = E0 (Z1 ) = log { f1 (x)/ f0 (x)} f0 (x) dx < 0, whereas, under H 1,
∫
μ = μ Z 1 = E1 (Z1 ) = log { f1 ( x)/ f0 ( x)} f1 ( x) dx > 0 (see Section 3.2 for details).
Based on the central limit theorem result shown by Siegmund (1968), it is
easy to check that for μ > 0,
See Martinsek (1981) as well as the next section for more detail. Additionally,
Martinsek (1981) presented the following result:
Theorem 6.3.1.
Let Z1 , Z2 , ... be iid with E(Z1 ) = μ ≠ 0, var(Z1 ) = σ 2 ∈(0, ∞), and assume
E|Z1 |p < ∞ , where p ≥ 2 . Then
(1) If μ > 0 and c2 = o(c1 ) as c → ∞, then {c2 − p/2 |N − (c2 / μ)|p : c ≥ 1} is uni-
formly integrable and hence E|N − (c2 / μ)|r ~ (c2 / μ)r /2 (σ / μ)r E|N (0, 1)|r
as c → ∞ , for 0 < r ≤ p ;
(2) If μ < 0 and c1 = o(c2 ) as c → ∞, then {c1− p/2 |N − (c1/|μ|)|p : c ≥ 1} is uni-
formly integrable and hence E|N − (c1 /|μ|)|r ~ (c1 /|μ|)r/2 (σ |μ|) E|N (0, 1)|r
r
Corollary 6.3.1. Let Z1 , Z2 ,... be iid with E(Z1 ) = μ ≠ 0 , var(Z1 ) = σ 2 ∈(0, ∞).
Then
(1) If μ > 0 and c2 = o(c1 ) as c → ∞ , then var ( N ) ~ c2 σ 2 /μ 3 as c → ∞ ;
(2) If μ < 0 and c2 = o(c1 ) as c → ∞ , then var ( N ) ~ c1σ 2 /|μ|3 as c → ∞ ,
where c = min (c1,c2).
Example:
∑
n
fixed sample size test. Define X n = n−1 X i . In this case, it is clear that
i=1
the likelihood ratio is
(
LRn = exp n(μ1 − μ 0 )X n + n (μ 20 − μ12 ) 2 , )
and
## set up parameters
> alpha <- 0.05 # the significance level
> beta <- 0.2 # power=1-beta=0.8
> muX0 <- -1/2
> muX1 <- 1/2
>
176 Statistics in the Health Sciences: Theory, Applications, and Computing
For the sequential test, it can be easily obtained that the stopping
boundaries of the SPRT are A ≈ 4 19 and B ≈ 16 based on α = 0.05 and β = 0.2.
Note first by simple calculation that μ Zi = μ i and σ 2Zi = 1, i = 0,1, and thus the
Wald approximation to the ASN is 2.6832 and 3.8129 under the H 0 and H 1,
respectively. The ASNs obtained via the Monte Carlo simulations (we refer
the reader to Chapter 2 for more details related to Monte Carlo studies)
are 4.1963 and 5.6990 under the null and the alternative, respectively. This
proves that the Wald approximations are indeed underestimates of the ASN
(however, considerably smaller than the sample size required in the nonse-
quential LRT test). The R code to obtain the results is shown below:
> ##### the function to get the ASN via Monte Carlo simulations ######
> # case: "H0" for the null hypothesis; "H1" or "Ha" for the alternative
> # alpha: the type I error rate, i.e., presumed significance level
> # beta: the type II error rate, where the power is 1-beta
> # muX0: the mean specified under the null hypothesis
> # muX1: the mean specified under the alternative hypothesis
> # MC: the number of Monte Carlo repetitions
>
> get.n<-function(case="H0",alpha=0.05,beta=0.2,muX0=-1/2,muX1=1/2,MC=5000){
+ if (missing(alpha)|missing(beta)) stop("missing alpha value or beta value
")
+ if (missing(muX0)|missing(muX1)) stop("missing mean value under H_0 or H_
1") else {
+ if (case=="H0") mu<-muX0 else if (case=="H1"|case=="Ha") mu<-muX1 else
stop("wrong specification of case")
+ }
+ if (!missing(alpha) & !missing(beta) & !missing(muX0) & !missing(muX1) &
case%in%c("H0","H1","Ha")){
+ if (missing(MC)) MC<-5000
+ A<-beta/(1-alpha)
+ B<-(1-beta)/alpha
+ n.seq<-c()
+ for (i in 1:MC){
+ L<-1
+ n<-0
+ while(L>A & L<B){
+ x<-rnorm(1,mu,1)
+ L<-L*exp(-(2*(muX0-muX1)*x+muX1^2-muX0^2)/2)
+ n<-n+1
+ }
+ n.seq[i]<-n
A Brief Review of Sequential Methods 177
+ }
+ return(n.seq)
+ }
+ }
>
> ## SPRT under H_0
>
n.seq.0<-get.n(case="H0",alpha=0.05,beta=0.2,muX0=-1/2,muX1=1/2,MC=50000)
> mean(n.seq.0)
[1] 4.1963
> var(n.seq.0)
[1] 11.2889
>
> ## SPRT under H_1
>
n.seq.1<-get.n(case="H1",alpha=0.05,beta=0.2,muX0=-1/2,muX1=1/2,MC=50000)
> mean(n.seq.1)
[1] 5.6990
> var(n.seq.1)
[1] 13.5441
The SPRT is very simple to apply in practice and this procedure typically
leads to lower sample sizes on average than fixed sample size tests given
fixed α and β.
For this example the asymptotic ASN calculated based on Theorem 6.3.1
under H 0 and H 1 is 3.1163 and 5.5452, respectively. The result is closer to the ASN
obtained via the Monte Carlo simulations as compared to the Wald approxima-
tion to the ASN. The asymptotic var ( N ) calculated based on Corollary 6.3.1
under H 0 and H 1 is 12.4652 and 22.1807, respectively, whereas the one obtained
via the Monte Carlo simulations is 11.2889 and 13.5441, respectively.
In conclusion, there are a variety of fully sequential tests. The SPRT is the-
oretically optimal in the sense that it attains the smallest possible expected
sample size before a decision is made as compared to all sequential tests that
do not have larger error probabilities than the SPRT (Wald and Wolfowitz,
1948). However, it should be noted that the SPRT is an open scheme with an
unbounded sample size; as a consequence, the non-asymptotic distribution
of sample size can be skewed with a large variance. For more detail, we refer
the reader to Martinsek (1981) and Jennison and Turnbull (2000).
⎧⎪ n ⎫⎪
N ( H ) = min ⎨n ≥ 1 :
⎩⎪
∑ X i ≥ H⎬ ,
⎭⎪
i =1
178 Statistics in the Health Sciences: Theory, Applications, and Computing
where X i > 0, i = 1, 2,..., are iid random variables. The random process N ( H )
has not only a central role in renewal theory, but also is widely applied in
statistical sequential schemes. Evaluations of N ( H ) can demonstrate several
basic principles that can be adapted to analyze various stopping rules used
in sequential procedures.
In Chapter 2 we obtained the result E {N ( H )} ~ H / E (X 1) as H → ∞ . In
order to derive an asymptotic distribution of N ( H ), we define
μ = E (X 1) , σ 2 = var (X 1) and provide the central limit theorem for the variable
( )
−1/2
ζ( H ) = (N ( H ) − H / a) Hσ 2 / a 3 in the following proposition.
Proof. Following previous material of this book, we know how to deal with
sums of iid random variables. Therefore we are interested in associating the
distribution function of ζ( H ) with that of a sum of iid random variables. To
this end, we note that
⎧ k ⎫
⎧ k ⎫
⎪
⎪ ∑
( i )
X − a ⎪
⎪
⎪
Pr {N ( H ) ≥ k} = Pr ⎨
⎪⎩ i =1
∑ ⎪ ⎪
X i ≤ H⎬ = Pr ⎨ i =1 1/2 ≤
⎪⎭ ⎪ σ (k)
H − ak ⎪
1/2 ⎬
σ (k) ⎪
.
⎪ ⎪
⎪⎩ ⎪⎭
⎧ k ⎫
⎧⎪ k ⎫⎪ ⎪ ∑
⎪ (X i − a) ⎪
⎪
Pr {N ( H ) ≥ k} = Pr ⎨
⎩⎪ i =1
∑ X i ≤ H⎬ = Pr ⎨
⎭⎪
i =1
⎪ σ (k)
1/2
≤ u⎬ → Φ(u),
⎪
⎪ ⎪
⎩ ⎭
where Φ(u) is the standard normal distribution function. Then let us solve
(H − ak)(σ 2 k)
−1/2
= u with respect to k:
H
k= +
{
uσ uσ ± 2 (aH)
1/2
(1 + u σ 2 2
)
/ (4Ha) } = H ± uσ (aH) {1 + O (H )}.
1/2
1/2 −1/2
2 2
a 2a a a
A Brief Review of Sequential Methods 179
( )
−1/2
Thus, since ζ( H ) = (N ( H ) − H / a) Hσ 2 / a 3 , we have
{ (
Pr {N ( H ) ≥ k} = Pr ζ( H ) ≥ ±u + O H −1/2 → Φ(u).)}
Since the function Φ(u) increases when its argument increases, we should
{ ( )}
choose Pr {N ( H ) ≥ k} = Pr ζ( H ) ≥ − u + O H −1/2 to satisfy Pr {ζ( H ) ≥ − u}
when u . Therefore, we have Pr {ζ( H ) ≥ − u} → Φ(u) as H → ∞.
Since Φ(u) is a symmetric distribution function, we can rewrite
rejected, i.e., Sk > Ck . If the test continues to the Kth analysis and SK < CK , it
stops at that point and H 0 is not rejected.
The problem in which many authors are interested in the field of sequen-
tial analysis is the calculation of the sequence of critical values, {C1 ,..., CK }, to
give an overall Type I error rate. Various different types of group sequential
test give rise to different sequences. In choosing the appropriate number of
groups K and the group size m there may be external influences such as a
predetermined cost or length of trial to fix mK or a natural interval between
analyses to fix m. However, it is useful to evaluate the statistical power con-
straint by the possible designs under consideration. Note that when signifi-
cance tests at a fixed level are repeated at stages during the accumulation of
data, the probability of obtaining a significant result rises above the nominal
significance level in the case that the null hypothesis is true. Therefore,
repeated significance testing of the accumulated data is applied after each
group is evaluated, with critical boundaries adjusted for multiple testing
(e.g., Dmitrienko et al., 2010). For more details, we refer the reader to, e.g.,
Jennison and Turnbull (2000).
As a specific example, consider a clinical trial with two treatments, A and
B, where the response variable for each patient to treatment A or B is nor-
mally distributed with known variance, σ 2 , and unknown mean, μ A or μ B,
respectively. The group sequential procedure for testing H 0 : μ A = μ Bversus
H 1 : μ A ≠ μ B is specified by the maximum number of stages, K, the number
of patients, m, to be accrued on A and B at each stage (assume equal sample
sizes for each stage and for A and B), and decision boundaries a1 ,..., aK . The
sequential procedure is as follows: upon completion of each stage k of sam-
pling (1 ≤ k ≤ K), compute the test statistic
∑ (X )( )
k
Sk = m Ai − X Bi 2σ ,
i =1
easily adapted to other types of response data. For more details and discus-
sion regarding other types of response data, we refer the reader to Pocock
(1977).
The assumption of equal numbers of observations in each group can be
relaxed and then more flexible methods for handling unequal and unpre-
dictable group sizes have been developed (Jennison and Turnbull, 2000).
∑
N
sequentially obtained iid observations X 1 , X 2 ,..., X n , then X i N may
i =1
⎝ i=1 ⎠
dosi (2005) argued that the estimate of a treatment effect will be biased when
a trial is terminated at an early stage: the earlier the decision, the larger the
bias. Whitehead (1986) investigated the bias of maximum likelihood esti-
mates calculated at the end of a sequential procedure. Liu and Hall (1999)
showed that in a group sequential test about the drift θ of a Brownian motion
X (t) stopped at time T, the sufficient statistic S = (T , X (T )) is not complete
for θ. In addition, there exist infinitely many unbiased estimators of θ and
none has uniformly minimum variance.
Furthermore, most sequential analyses involve parametric approaches,
especially in the group sequential designs. In contrast with the analysis of
data obtained retrospectively, the parametric assumptions are posed
before data points are observed. Even if we have strong reasons to assume
parametric forms of the data distribution, it will be extremely hard, for
example, to test for the goodness of fit of data distributions after the execu-
tion of sequential procedures. In this case, perhaps, e.g., the known Shapiro–
Wilk test for normality will not be exact and its critical value will depend on
the underlying data distribution.
Therefore, it is important that proper methodology is followed in order to
avoid invalid conclusions based on sequential data. For more information
about the analysis following a sequential test, we refer the reader to Jennison
and Turnbull (2000).
7
A Brief Review of Receiver Operating
Characteristic Curve Analyses
7.1 Introduction
Receiver operating characteristic (ROC) curve analysis is a popular tool for
visualizing, organizing, and selecting classifiers based on their classification
accuracy. The ROC curve methodology was originally developed during
World War II to analyze classification accuracy in differentiating signal from
noise in radar detection (Lusted, 1971). Recently, the ROC methodology has
been extensively adapted to medical areas heavily dependent on screening
and diagnostic tests (Lloyd, 1998; Zhou and Mcclish, 2002; Pepe, 2000, 2003),
in particular, radiology (Obuchowski, 2003; Eng, 2005), bioinformatics (Lasko
et al., 2005), epidemiology (Green and Swets, 1966; Shapiro, 1999), and labora-
tory testing (Campbell, 1994). For example, in laboratory diagnostic tests,
which are central in the practice of modern medicine, common uses of ROC-
based methods include screening a specific population for evidence of dis-
ease and confirming or ruling out a tentative diagnosis in an individual
patient, where the interpretation of a diagnostic test result depends on the
ability of the test to distinguish between diseased and non-diseased sub-
jects. In cardiology, diagnostic testing and ROC curve analysis plays a fun-
damental role in clinical practice, e.g., using serum markers to predict
myocardial necrosis and using cardiac imaging tests to diagnose heart dis-
ease. ROC curves are increasingly used in the machine-learning field, due in
part to the realization that simple classification accuracy is often a poor stan-
dard for measuring performance (Provost and Fawcett, 1997). In addition to
being a generally useful graphical method for visualizing classification
accuracy, ROC curves have properties that make them especially useful for
domains with skewed discriminating distributions and/or unequal
classification error costs. These characteristics have become increasingly
important as research continues into the areas of cost-sensitive learning and
learning in the presence of unbalanced classes (Fawcett, 2006).
ROC curve analysis can also be applied generally for evaluating the accu-
racy of goodness-of-fit tests of statistical models (e.g., logistic regression) that
185
186 Statistics in the Health Sciences: Theory, Applications, and Computing
> t<-seq(0,1,0.001)
> R1<-1-pnorm(qnorm(1-t,0,1),0,1) # biomarker A
> R2<-1-pnorm(qnorm(1-t,1,1),0,1) # biomarker B
> R3<-1-pnorm(qnorm(1-t,10,1),0,1) # biomarker C
> plot(R1,t,type="l",lwd=1.5,lty=1,cex.lab=1.1,ylab="Sensitivity",xlab="1-Spe
cificity")
> lines(R2,t,lwd=1.5,lty=2)
> lines(R3,t,lwd=1.5,lty=3)
A Brief Review of Receiver Operating Characteristic Curve Analyses 187
1.0
0.8
0.6
Sensitivity
0.4
0.2
0.0
FIGURE 7.1
The ROC curves related to the biomarkers. The solid diagonal line corresponds to the ROC
curve of biomarker A, where F1 ~ N (0, 1) and G1 ~ N (0, 1). The dashed line displays the ROC
curve of biomarker B, where F2 ~ N (1, 1) and G2 ~ N (0, 1). The dotted line close to the upper left
corner plots the ROC curve for biomarker C, where F3 ~ N (10, 1) and G3 ~ N (0, 1).
It can be seen that the farther apart the two distributions F and G are in
terms of location, the more the ROC curve shifts to the top left corner. A near
perfect biomarker would have an ROC curve coming close to the top left
corner, and a biomarker without any discriminability would result in an
ROC curve that is a diagonal line from the points (0,0) to (1,1).
The ROC curve displays a distance between two distribution functions.
The ROC curve is a well-accepted statistical tool for evaluating the discrimi-
natory ability of biomarkers (e.g., Shapiro, 1999). It is a convenient way to
compare diagnostic biomarkers because the ROC curve places tests (bio-
markers values) on the same scale where they can be compared for accuracy.
There exists extensive research on estimating ROC curves from the para-
metric and nonparametric perspectives (e.g., Pepe, 2003; Hsieh and Turnbull,
1996; Wieand et al., 1989). For example, assuming that both the diseased and
non-diseased populations are normally distributed, that is, F ~ N (μ 1 , σ 12 ) and
G ~ N (μ 2 , σ 22 ), the corresponding ROC curve can be expressed as
Fˆn (t) =
1
n ∑ I{X ≤ t},
i
i =1
where I {⋅} denotes the indicator function. The empirical distribution function
Ĝm of G can be defined similarly, using iid observations Y1 ,..., Ym from G. Esti-
mating F and G by their corresponding empirical estimates F̂n and Ĝm,
respectively, gives the empirical estimator of the ROC curve in the form
which can be shown to converge to R(t) for large sample sizes n and m.
Figure 7.2 presents the nonparametric estimators of the ROC curves based on
data related to generated values (n = m = 1000) of the three biomarkers
described with respect to Figure 7.1, i.e., F1 ~ N (0,1), G1 ~ N (0,1) for bio-
marker A (the diagonal line), F2 ~ N (1,1), G2 ~ N (0,1) for biomarker B (in a
1.0
0.8
0.6
Sensitivity
0.4
0.2
0.0
FIGURE 7.2
The nonparametric estimators of the ROC curves related to three different biomarkers based
on n = 1000 and m = 1000 data points. The solid diagonal line corresponds to the nonparametric
estimator of the ROC curve of biomarker A, where F1 ~ N (0, 1) and G1 ~ N (0, 1) . The dashed line
displays the nonparametric estimator of the ROC curve of biomarker B, where F2 ~ N (1, 1) and
G2 ~ N (0, 1). The dotted line close to the upper left corner plots the nonparametric estimator of
the ROC curve for biomarker C, where F3 ~ N (0, 1) and G3 ~ N (10, 1).
A Brief Review of Receiver Operating Characteristic Curve Analyses 189
dashed line), and F3 ~ N (10,1), G3 ~ N (0,1) for biomarker C (in a dotted line),
respectively, using the following R code:
It should be noted that for large sample sizes, the ROC curves are well
approximated by the nonparametric estimators.
In health-related studies, the ROC curve methodology is commonly related
to case-control studies. As a type of observational study, case-control stud-
ies differentiate and compare two existing groups differing in outcome on
the basis of some supposed causal attributes. For example, based on factors
that may contribute to a medical condition, subjects can be grouped as cases
(subjects with a condition/disease) and controls (subjects without the
condition/disease), e.g., cases could be subjects with breast cancer and
controls may be health subjects. For independent populations, e.g., cases and
controls, various parametric and nonparametric approaches have been
proposed to evaluate the performance of biomarkers (e.g., Pepe, 1997; Hsieh
and Turnbull, 1996; Wieand et al., 1989; Pepe and Thompson, 2000; McIntosh
and Pepe, 2002; Bamber, 1975; Metz et al., 1998).
Proof. By the definition of the ROC curve, R(t) = 1 − F(G −1 (1 − t)) , where
t ∈[0, 1] and F and G are distribution functions of X and Y, respectively, the
AUC can be expressed as
∫ ∫ ∫
1 1 ∞
R(t)dt = (1 − F(G−1 (1 − t))dt = (1 − F(w))dG(w)
0 0 −∞
∫
∞
= 1− F(w)dG(w) = 1 − Pr(X ≤ Y ) = Pr(X > Y ).
−∞
⎛ μ −μ ⎞ u
⎛ ⎞
∫ exp ⎜⎜⎝ − z2 ⎟⎟⎠ dz.
2
1
A = Φ⎜ 1 2 ⎟
, Φ (u) =
⎜ σ2 + σ2 ⎟ 2π
⎝ 1 2⎠ −∞
⎧⎪ (X − Y ) − (μ − μ ) μ − μ 2 ⎫⎪
A = Pr(X > Y ) = Pr(X − Y > 0) = 1 − Pr ⎨ 1 2
≤− 1 ⎬
⎪⎩ σ 12 + σ 22 σ 12 + σ 22 ⎪⎭
⎛ μ −μ ⎞ ⎛ μ −μ ⎞
= 1 − Φ⎜ − 1 2 ⎟
= Φ⎜ 1 2 ⎟
.
⎜ 2⎟
σ1 + σ2 ⎠ ⎜ 2⎟
⎝ σ1 + σ2 ⎠
2 2
⎝
∫
of A = Pr(X > Y ) = G( x)dF( x) can be provided by expressing one of param-
eters of F and G as a function of A. For example, the following approach
can be easily extended to be appropriate for different parametric forms of
( )
⎛ μ −μ ⎞
F and G: when A = Φ ⎜ 12 2 2 ⎟ , we have μ 1 = σ 12 + σ 22 Φ −1 (A) + μ 2 , where
⎝ σ1 + σ2 ⎠
( )
Φ Φ (u) = u, and then using the likelihood function based on observations
−1
from X ~ N (( σ + σ ) Φ
2
1
2
2
−1
(A) + μ 2 , σ 12 ) and Y ~ N(μ , σ ), the maximum like-
2
2
2
Example:
Assume that biomarker levels were measured from diseased and healthy
populations, with iid observations X 1 = 0.39, X 2 = 1.97 , X 3 = 1.03, and
X 4 = 0.16, which are assumed to be from a normal distribution N (μ 1 , σ 12 ),
192 Statistics in the Health Sciences: Theory, Applications, and Computing
ing ability of the biomarker with respect to the disease. The interpreta-
tion of the AUC is that this particular marker accurately predicts a case
to be a case and a control to be control 84% of the time and that 16% of the
time the marker will misclassify the groupings.
Example:
Biomarker levels were measured from diseased and healthy popula-
tions, providing iid observations X 1 = 0.39, X 2 = 1.97 , X 3 = 1.03, and
X 4 = 0.16 , which are assumed to be from a continuous distribution,
and iid observations Y1 = 0.42, Y2 = 0.29 , Y3 = 0.56, Y4 = −0.68 , and
Y5 = −0.54, which are also assumed to be from a continuous distribution,
respectively. In this case, the empirical estimates of the distribution
∑
1 4
functions F and G have the forms Fˆ4 (t) = I (X i < t) and
∑
4 i=1
ˆ 1 5
G5 (t) = I (Yj < t), respectively, and the empirical estimate of the
5 j=1
ROC curve is Rˆ (t) = 1 − Fˆ4 (Gˆ 5−1 (1 − t)) . The AUC can be estimated as
ˆ= 1
∑ ∑
4 5
A I (X i > Yj ) = 0.75, suggesting a moderate discriminat-
4×5 i=1 j=1
ing ability of the biomarker with respect to discriminating the disease
population from the health population. Note that in this case, the non-
parametric estimator of the AUC is smaller than the estimator of the
A Brief Review of Receiver Operating Characteristic Curve Analyses 193
AUC under the normal assumption (Section 7.3.1). For such a small data-
set used in our examples the difference in AUC estimates is likely due
more to the discreteness of the nonparametric method than any real dif-
ference between the approaches.
where the covariate vector z can include a component with the value of 1 to
present an intercept coefficient of the model and the vector β consists of the
parameters of the regression, the operator T denotes the transpose.
Logistic regression relies only on an assumption about the form of the con-
ditional probability for disease given the covariate vector z and does not
TABLE 7.1
The outcome levels and the biomarker values z , where X
and Y represent values of measurements of the biomarker
in the case and control group, respectively
Class z (Biomarker values)
The ROC curve, R, is the graph of F1 (u) against F0 (u) as u ranges over all pos-
sible values,
Rˆ (βˆ ) − R(βˆ ),
where the difference between two curves denotes the curve of differences in
vertical coordinates graphed against common values for the horizontal coor-
dinate. In other words, this is the difference in true positive rates having
chosen the thresholds to match the false positive rates. Typically, the term
overestimation is positive, that is, the retrospective ROC gives an inflated
assessment of the true performance of the score.
Consider the classifier qˆ = 1, if β̂T z ≥ u, and qˆ = 0, if β̂T z < u. Then it is well
known that the retrospective error rate, namely the proportion of cases in the
sample for which qˆ ≠ q, is a downward biased estimate of the true prospec-
tive error rate, which would be obtained if the classifier q̂ were to be applied
to the whole population. Efron (1986) derives an asymptotic approximation
to the expected bias. Efron’s formula is consistent with the approximation
based on the ROC curve methodology, which we will present in the follow-
ing subsection.
T
eβ z
Pr(q = 1|z) = T .
1 + eβ z
e ui
pi = , ui = βT zi .
1 + e ui
If the scores ûi are applied to the replicated data against a threshold v, the
future proportions of false and true positives are
F0 (v) =
1
n(1 − p ) ∑ H(uˆ − v)(1 − p ), F (v) = np1 ∑ H(uˆ − v)p ,
i i 1 i i
i i
7.4.2 Expected Bias of the ROC Curve and Overestimation of the AUC
To simplify the notation, let ε i = qi − pi , H i = H (uˆ i − u), and H i* = H (uˆ i − v).
Then we can obtain
q = p + ε,
and
⎛ ⎞
n{ Fˆ0 (u) − F0 (v)} = ∑ H ⎜⎝11−−pp −− εε − 11−− pp ⎟⎠ − 1 −1 p ∑(H
i
i i i *
i − H i )(1 − pi ).
i i
For large n, values of pi will be close to pu, which is the true value of
Pr( y = 1|βT z = u). Hence
⎛ ⎞
n{ Fˆ0 (u) − F0 (v)} ≈ ∑ H ⎜⎝11−−pp −− εε − 11−− pp ⎟⎠ − 11−−pp ∑(H
i
i i i u *
i − H i ).
i i
⎛ ⎞
n{ Fˆ1 (u) − F1 (v)} ≈ ∑ H ⎜⎝ pp ++ εε − pp ⎟⎠ − pp ∑(H
i
i i i u *
i − H i ).
i i
⎛ p + ε p ⎞ (1 − p )pu ⎛ 1 − pi − ε i 1 − pi ⎞
where A(i, u) = ⎜ i i − i ⎟ − − .
⎝ p+ε p ⎠ p(1 − pu ) ⎜⎝ 1 − p − ε 1 − p ⎟⎠
A Brief Review of Receiver Operating Characteristic Curve Analyses 197
ϕ(u) = (2π)
−1 / 2
exp(− u2/2).
The terms pu and ui in the right-hand side are functions of the true parameter
vector β, and the terms di and p depend on pi and hence are also functions of
β. Estimating these in the obvious way using β̂ gives the corresponding esti-
mator Sˆ (u). The corrected ROC curve
is an estimator of the ROC curve that would be obtained if the fitted score
β̂T z were to be validated on a large replicated sample. As expected, this indi-
cates that this score discriminates between the two populations noticeably
less well than the retrospective analysis seems to suggest.
The estimated AUC is
Similar to the overestimation in the ROC curve described above, the overes-
timation in the AUC is
⎪⎧n1/2 (u j − ui )⎪⎫
=
1
n (1 − q )
2 ∑ H (uˆ i − uˆ j )(1 − qi ) A(i, uˆ j ) ≈
1
2 n p(1 − p )
5/2 ∑ pi (1 − pi )bij ϕ ⎨
⎩⎪ bij
⎬,
⎭⎪
i, j i≠ j
1
E[ pv (1 − pv ) f (u)tr{Ω −1 var( z|U )}],
np(1 − p )
where U = βT z has probability density function f (u), E(.) and var(.) denote the
empirical expectation and variance over the empirical distribution of the
sample covariate vectors z1 ,..., zn. Note that the overestimation is again of
the order O(n−1 ). For more details see Copas and Corbett (2002).
198 Statistics in the Health Sciences: Theory, Applications, and Computing
7.4.3 Example
Let X1 ,..., X m and Y1 ,..., Yn denote biomarker measurements from non-diseased
and diseased populations, respectively. Assume that we would like to exam-
ine the discriminant ability of the biomarker based on the AUC, that is, test-
ing for H 0 : AUC = 1/ 2 versus H 1 : AUC ≠ 1/ 2 as well as based on the logistic
regression, i.e., testing for the coefficient, β , H 0 : β = 0 versus H 1 : β ≠ 0 .
As described in Section 7.3.2, a nonparametric estimator of the AUC is
equivalent to the well-known Mann–Whitney statistic, which follows an
asymptotic normal distribution, and the variance of this empirical estimator
can be obtained using U-statistic theory (Serfling, 2002). Then the nonpara-
mn ∑ ∑
1 m n
metric estimator of the AUC is Aˆ = I (X i > Yj ) . Let Ri be the rank
i=1 j=1
of X i’s (the i th ordered value among X i ’s) in the combined sample of X i ’s and
Yj’s, and Sj be the rank of Yj’s (the jth ordered value among Yj’s) in the com-
bined sample of X i ’s and Yj’s. Define
⎡ m ⎛ m ⎞ 2⎤
⎢ m + 1⎟ ⎥
S =
2
10
1
⎢
(m − 1)n2 ⎢ ∑ (Ri − i) − m ⎜⎜
2
⎜m
⎝
1
∑ Ri −
2 ⎟⎟ ⎥
⎠ ⎦
⎥,
⎣ i =1 i=1
⎡ ⎛ ⎞ 2⎤
⎢ n n
n + 1⎟ ⎥
S =
2
01
1
(n − 1)m ⎢
2
⎢ ∑ 2 ⎜1
(Sj − j) − n ⎜
⎜n
∑ Si − ⎟ ⎥.
2 ⎟ ⎥
⎢⎣ j =1 ⎝ j=1 ⎠ ⎥⎦
⎛ mS01
2 2 ⎞
+ nS10 ⎛ mn ⎞
S2 = ⎜⎜ ⎟⎟ ⎜ ⎟.
⎝ m+ n ⎠ ⎝ m + n⎠
0.8 0.8
0.6 0.6
Powers
Powers
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
δ δ
FIGURE 7.3
The experimental comparisons based on the Monte Carlo powers related to the test for
H 0 : AUC = 1/2 versus H 1 : AUC ≠ 1/2 (upper curves) and the test for H 0 : β = 0 versus H 1 : β ≠ 0
(lower curves). The tests were performed in order to detect the discriminant ability of the bio-
marker at the 0.05 significance level. The left and the right panels correspond to the sample
sizes n = m = 50 and n = m = 100 , respectively.
(lower curves) to detect the discriminant ability of the biomarker at the 0.05
significance level, where the left and the right panel corresponds to the sam-
ple sizes n = m = 50 and n = m = 100, respectively. In these cases, the test based
on the AUC concept demonstrates better discriminant ability of the bio-
marker compared to the test based on the logistic regression.
The R code to implement the procedures is presented below:
> # empirical AUC (Mann-Whitney stat), returns p-value for H_0: AUC=1/2
versus H_1: AUC!=1/2
> AUC.pval<-function(x,y) {
+ m<-length(x)
+ n<-length(y)
+ x<-sort(x)
+ y<-sort(y)
+ rankall<-rank(c(x,y))
+ R<-rankall[1:m]
+ S<-rankall[-c(1:m)] # ranks of y
+
+ S10<-1/((m-1)*n^2)*(sum((R-1:m)^2)-m*(mean(R)-(m+1)/2)^2)
+ S01<-1/((n-1)*m^2)*(sum((S-1:n)^2)-n*(mean(S)-(n+1)/2)^2)
+ S2<-(m*S10+n*S01)/(m+n)
+
+ a<-wilcox.test(y,x)$statistic/(n*m)
+ AUC<-as.numeric(ifelse(a>=0.5,a,1-a)) # empirical AUC
+ pval<- 2*(1-pnorm(abs(AUC-0.5)/sqrt(S2/(m*n/(m+n)))))
+ return(pval)
+ }
>
200 Statistics in the Health Sciences: Theory, Applications, and Computing
Note that in the case where both the case group X and the control group Y
follow a log-normal distribution log N (μ , σ 2 ) with a common value for μ and
we vary the values for σ 2 between the case group and the control group, the
AUC does not allow us to detect any significant difference between the two
groups. This is because a simple log transformation of observed data points
shows that the AUC is equal to 0.5 in this case. We refer the reader to Section
7.3.1 for the computation of the AUC under normal assumptions.
example, in the case of two biomarkers, i.e., d = 2 , the AUC can be defined as
A( a) = Pr(X 1 + aX 2 > Y1 + aY2 ).
7.6 Remarks
Medical diagnoses usually involve the classification of patients into two or
more categories. When subjects are categorized in a binary manner, i.e.,
non-diseased and diseased, the ROC curve methodology is an important
statistical tool for evaluating the accuracy of continuous diagnostic tests,
and the AUC is one of the common indices used for overall diagnostic
accuracy. In many situations, the diagnostic decision is not limited to a
binary choice. For example, a clinical assessment, NPZ-8, of the presence of
HIV-related cognitive dysfunction (AIDS Dementia Complex [ADC]) would
discriminate between patients exhibiting clinical symptoms of ADC (com-
bined stages 1–3), subjects exhibiting minor neurological symptoms (ADC
stage 0.5) and neurologically unimpaired individuals (ADC stage 0) (Nakas
and Yiannoutsos, 2004). For such disease processes with three stages,
binary statistical tools such as the ROC curve and AUC need to be extended.
In this case of three ordinal diagnostic categories, the ROC surface and the
volume under the surface can be applied to assess the accuracy of tests. For
details, we refer the reader to, for example, Nakas and Yiannoutsos (2004).
In contrast, logistic regression can be easily generalized, e.g., to multino-
mial logistic regression, in order to deal with multiclass problems.
There are several measurements that are often used in conjunction with
the ROC curve technique. For example: (1) the Youden index is a summary
measure of the ROC curve. It both measures the effectiveness of a diagnostic
A Brief Review of Receiver Operating Characteristic Curve Analyses 203
marker and enables the selection of an optimal threshold value (cutoff point)
for the biomarker. The Youden index, J, is the point on the ROC curve that is
farthest from line of equality (diagonal line), that is, using the definitions
applied in Section 7.2, one can show that J = max −∞<c <∞ {Pr (X ≥ c) + Pr (Y ≤ c) − 1}
(e.g., Fluss et al., 2005; Schisterman et al., 2005). (2) The partial AUC (pAUC)
has been proposed as an alternative measure to the full AUC. When using
the partial AUC, one considers only those regions of the ROC space where
data have been observed, or which correspond to clinically relevant values of
∫
b
test sensitivity or specificity, that is, pAUC = R(t)dt for prespecified
a
0 ≤ a < b ≤ 1. Assume, for example, random variables YD and YD are from the
distribution functions FYD and FYD that correspond to biomarker’s mea-
surements from diseased (D) and non-diseased (D) subjects, respectively.
The partial area under the ROC curve is the area under a portion of the
ROC curve, oftentimes defined as the area between two false positive rates
(FPRs). For example, the pAUC with two fixed a priori values for FPRs t0
and t1 is
∫ { ( )}
t1
pAUC = R(t) dt = Pr YD > YD , YD ∈ SY−D1 (t1 ), SY−D1 (t0 ) ,
t0
where SYD and SYD are the survival functions of the diseased and healthy
group, respectively. To simplify this notation, we denote q0 = SY−D1 (t0 ) and
q1 = SY−D1 (t1 ). Then
8.1 Introduction
The purpose of this chapter is twofold with respect to: (1) Our desire to intro-
duce a very powerful theoretical scheme for constructing statistical proce-
dures, and (2) We would like to show again arguments against statistical
stereotypes. For example, one generally can state that if a statistical test has
a reasonable fixed Type I error rate then the test cannot be of power one. In
this chapter, a test with power one is presented.
In previous chapters we employed the law of large numbers and the cen-
tral limit theorem to propose and examine statistical tools. In this chap-
ter we apply the law of the iterated logarithm. The material presented in
this chapter may seem overly technical, however we encourage the reader
to study the methodological technique introduced below in order to obtain
beneficial skills in developing advanced statistical procedures.
The Ville and Wald inequality: In Chapter 4, Doob’s inequality was intro-
duced with a focus on developing a bound for the distribution of a statistic,
which can be anticipated to be increasing when the sample size n increases.
In this case, we concluded that Doob’s inequality can be expected to be
very accurate in terms of such a bound. The right side of Doob’s inequal-
ity (the bound) does not depend on n. In the asymptotic framework this
result begs the following question: Can we apply the symbolic expression
. = Pr (lim n→∞ ) ≤ The Bound” with respect to Doob’s inequality?
“ lim n→∞ Pr ()
In general, since the probability can be written as Pr (.) = E {I (.)} =
∫ I(.), where
I (.) denotes the indicator function, the action “ lim n→∞ Pr (.) = Pr (lim n→∞ )”
corresponds to complicated issues related to possibly plugging the opera-
tor lim n→∞ into the integral. In this context, the Fatou–Lebesgue theorem
(Titchmarsh, 1976) can justify the operation “ lim n→∞ Pr (.) = Pr (lim n→∞ )”, pro-
vided that the conditions of the theorem are satisfied. In this chapter, when
n = ∞, we introduce a simple extension of Doob’s inequality that is entitled
the Ville and Wald inequality. This result was presented by Ville in 1939
205
206 Statistics in the Health Sciences: Theory, Applications, and Computing
∑
n
regarding the asymptotic behavior of the statistic Sn /bn , where Sn = Xi
i =1
is a sum of n iid random variables with zero expectation and a deterministic
sequence bn → ∞ as n → ∞ , “in between” the law of large numbers (bn = n)
and the central limit theorem (bn = n1/2 ). The law of the iterated logarithm,
⎧ ⎫
⎪ ⎪
Pr ⎨lim sup
Sn
= 1⎬=1 (log(e) = 1) ,
(
⎪⎩ n→∞ 2 n var (X 1) log (log (n)) )
1/2
⎪⎭
{ ( (
Pr max 1≤ n Sn − 2 n var(X 1 )log (log (n)) )
1/2
) ≥ 0}
{ (
= Pr Sn ≥ 2 n var(X 1 )log (log (n)) )
1/2
for some n ≥ 1 < 1. }
For example, when X 1 ,..., X n are iid data points it is reasonable to reject the
hypothesis E (X 1) = 0, if Sn does not follow the law of the iterated logarithm
related to the case with E (X 1) = 0. For illustrative purposes, we generated 5,000
The Ville and Wald Inequality: Extensions and Applications 207
2 2
1 1
0 0
–1 –1
–2 –2
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
n n
(a) (b)
FIGURE 8.1
Values of Sn /n1/2 (◦◦◦) plotted against n = 12, ..., 5000. Panel (a) represents the bounds ±1 and
( (
panel (b) represents the bounds ± 2 log log (n) ))
1/2
.
iterated logarithm).
It is clear that, in terms of the Type I error rate control, the law of the iter-
ated logarithm could insure the use of a test procedure based on Sn /n1/2 in a
better manner than the use of the central limit theorem. And the “payment”
( )
for this improvement is only 2 log (log (n)) , which clearly cannot signifi-
1/2
=
1
ε ∑ ∫ I (N = j) g′ (x ,..., x )dx ...dx = 1ε Pr
j 1 j 1 j H'
1
(N < ∞) ≤ .
ε
j =1
∏
n
g n ( x1 ,..., xn ) = φ( xi ) with φ( x) = (2 π)−1/2 exp(− x 2 /2). Assume that H′ cor-
i=1
responds to a mixture of distribution functions Pθ, where Pθ denotes that
∞
∫ ∏
n
X 1 ,..., X n are iid N (θ, 1), defining g n′ ( x1 ,..., xn ) = φ (xi − θ) dF (θ), for an
−∞ i=1
arbitrary distribution function F. Thus we obtain the likelihood ratio
∫ ( ) ∑
n
LRn = g′n (X 1 ,..., X n )/g n (X 1 ,..., X n ) = exp θSn − nθ2 /2 dF (θ), Sn = Xi .
−∞ i =1
= ∫ exp (yS m
−∞
n
−1/2
) (
− ny 2 (2 m)−1 dF (y) = f Snm−1/2 , n/m . )
210 Statistics in the Health Sciences: Theory, Applications, and Computing
{( )
PrH f Snm−1/2 , n / m ≥ ε for some n ≥ 1 = }
{( ) }
Prθ= 0 f Snm−1/2 , n / m ≥ ε for some n ≥ 1 ≤ 1/ ε (m > 0, ε > 0).
The following examples illustrate the significance of the result above, consid-
ering particular choices of F.
This result remains valid if instead of being iid N (θ,1) the observations are
any iid random variables that satisfy E {exp (uX 1)} < ∞ for all 0 < u < ∞ (Robbins,
1970). This comment is applicable regarding the following examples too.
( )
f (x , t) = exp 2 ax − 2 a 2t ≥ ε if and only if x ≥ at + {log (ε)} /(2 a),
Thus
F( y ) = I (y > 0) ⎜ ⎟ ∫
2
e−u /2
du. Then for t > −1 we have
⎝ π⎠ 0
∞
⎛ 2 ⎞ 1/2
f (x , t) = ⎜ ⎟ ∫
2
e − yx −(t+1) y /2
dy
⎝ π⎠ 0
Defining h−1 (u) as the inverse function to h(u) = u2 + 2 log (Φ(u)), one can
obtain A (t , ε) = (t + 1)1/2 h−1 (2 log (ε / 2) + log (t + 1)), which yields
⎧⎪ ⎛ ⎛ n ⎞⎞ ⎫⎪
Prθ= 0 ⎨Sn ≥ (m + n)1/2 h−1 ⎜⎜ h( a) + log ⎜ + 1⎟ ⎟⎟ for some n ≥ 1⎬
⎩⎪ ⎝ ⎝ m ⎠⎠ ⎭⎪
1 2
≤ e − a /2 ,
2Φ( a)
2
where 0 < a satisfies 0 < ε = 2Φ( a)e a , m > 0 and it is known that
h(u) ~ u2 , h−1 (u) ~ u1/2 as u → ∞.
Example 4. Omitting technical details, one can show that the choice of
( ) ∫
y
1
F( y ) = I 0 < y < e − e δ du for all δ > 0
{
u log (1/ u) log (log (1/ u)) }
1+δ
0
provides that
PrH {Sn ≥ cn for some n ≥ 1} = Prθ= 0 {Sn ≥ cn for some n ≥ 1} ≤ 1/ ε, cn = m1/2 A (n / m, ε)
(
with cn ~ 2 n log (log (n)) )
1/2
as n → ∞ (Robbins and Siegmund, 1970).
⎧⎪ ⎛ ⎛ n ⎞⎞
1/2 ⎫⎪ 2
Prθ= 0 ⎨ Sn ≥ (m + n)1/2 ⎜⎜ a 2 + log ⎜ + 1⎟ ⎟⎟ for some n ≥ 1⎬ ≤ e − a /2 ,
⎪⎩ ⎝ ⎝ m ⎠⎠ ⎪⎭
where a > 0 and m > 0. As a one-sided version of this inequality, noting that
(
for all t > 1 a 2 + log (t) )
> h−1 (h (a) + log(t)) (the reader can simply plot the
1/2
212 Statistics in the Health Sciences: Theory, Applications, and Computing
(
functions a 2 + log (t) )
and h−1 (h (a) + log(t)) against t > 1, where h and h−1 are
1/2
⎧⎪ ⎛ ⎛ n ⎞⎞
1/2 ⎫⎪ 2
Prθ= 0 ⎨Sn ≥ (m + n)1/2 ⎜⎜ a 2 + log ⎜ + 1⎟ ⎟⎟ for some n ≥ 1⎬ ≤ e − a /2 /(2Φ( a)).
⎪⎩ ⎝ ⎝ m ⎠⎠ ⎪⎭
In a similar manner to that shown above, Robbins (1970) and Robbins and
Siegmund (1970) concluded with
⎧ ⎛ ⎛ n⎞⎞
1/2 ⎫
⎪ ⎪
for some n ≥ m⎬ ≤ 2 (1 − Φ( a) + aφ( a)) , φ(u) = (2 π)
−1/2 − u2 /2
Prθ=0 ⎨ Sn ≥ (n)1/2 ⎜⎜ a 2 + log ⎜ ⎟ ⎟⎟ e .
⎪⎩ ⎝ ⎝ m ⎠ ⎠ ⎪
⎭
Actually, in the above inequality we can see the role of m > 0, when it is
stated that “ for some n ≥ m.” The integer m can define a pre-sample size of
observations after which we analyze max m≤ n<∞ Sn .
The preceding examples show how to transform the Ville and Wald
inequality to the form PrH {Sn ≥ cn for some n} ≤ 1/ε or PrH { Sn ≥ cn for some n} ≤ 1/ε,
where cn depends on ε > 0. Example 2 derives cn = an/m1/2 + dm1/2 ~ an/m1/2,
⎛ ⎛ n ⎞⎞
Example 3 derives c n = (m + n)1/2 h−1 ⎜⎜ h( a) + log ⎜ + 1⎟ ⎟⎟ ~ (n log(n))
1/2
and
⎝ ⎝ m ⎠⎠
Example 4 concludes with c n ~ (2n log (log (n))) as n → ∞ . Let us focus again on
1/2
the law of the iterated logarithm. The sequence c n ~ (2n log (log (n))) increases
1/2
about as slowly as is possible for sums of iid random variables with mean 0 and
variance 1, PrH {Sn ≥ cn for some n ≥ 1} < 1. Assume, for example, a researcher
( )
attempts to improve upon c n ~ 2 n log (log (n)) , with the aim of providing a
1/2
{
limit to the expression PrH Sn ≥ cn (log(log(log(n)))) / (log(log(n))) for some n .
1/2 1/2
}
In this case, the law of the iterated logarithm implies that this approach would
be invalid given that
FIGURE 8.2
( )
Values of Sn (◦◦◦) plotted against n = 150, ..., 10000. The figure presents the bounds ± 2 n log (log (n))
1/2
( (
(curves —) and ± 2 n log log (log(n)) ))
1/2
(curves - - - -).
Note that, in the equation above, we changed the measure Prθ to be Prθ= 0 ,
S −c S +c
since, under the assumption X i ~ N (θ,1), the expression n n < θ < n n
n n
is equivalent to Sn < cn , when X i ~ N (0,1). Thus
⎝ i =1 ⎠
which can be made near 1 by choosing a sufficiently large, e.g., a 2 = 6 yields
( )
1 − exp − a 2 / 2 ≅ 0.95. Thus for a 2 = 6 and any m > 0 , the sequence I n with
cn defined above forms a 95% confidence sequence for an unknown θ with
coverage probability ≥ 0.95.
Following the conventional concept of confidence interval constructions
(see Chapter 9 in this context), one can use the fact Sn = X 1 + ... + X n ~ N (θn, n)
to propose, for any fixed n,
Prθ {(Sn − 1.96n1/2 )/n < θ < (Sn + 1.96n1/2 )/ n} = Prθ {θ < (Sn + 1.96n1/2 )/n}
where Z1(n) ≤ ... ≤ Zn( n) denote the ordered values Z1 ,..., Zn and
Proof. As used in various proof schemes shown in this book, the idea of this
proof is to rewrite the stated problem to a formulation in terms of sums of
iid random variables and then to employ an appropriate theorem related to
sums of iid random variables. In this case, we aim to use the central limit
theorem (Section 2.4.5) via the following algorithm: Define the empirical
∑
n
distribution function Fn (u) = I (Zi ≤ u)/n . The function Fn (u) increases
i=1
providing
{ ( ) }
Pr {Za( n1 ) ≤ M} = Pr Fn Za( n1 ) ≤ Fn (M) = Pr {a1/n ≤ Fn (M)} ,
∑ I (Z ≤ Z )/n = ∑ I (Z )
n n
since Fn (Za( n1 ) ) = i
( n)
a1
( n)
i ≤ Za( n1 ) /n = a1/n. Thus
i=1 i=1
⎧ n ⎫
⎪
∑ ⎪
1 − Pr ⎨ {I (Zi ≤ M) − 0.5}/ 0.5n1/2 < −a⎬ ,
⎪⎩ i =1 ⎪⎭
( )
where I (Z1 ≤ M) ,..., I (Zn ≤ M) are iid random variables with
E {I (Z1 ≤ M)} = Pr (Z1 ≤ M) = 0.5 and var {I (Z1 ≤ M)} = E {I (Z1 ≤ M) − 0.5} = 0.52 .
2
Then
{ }
Pr Za( n1 ) ≤ M → 1 − Φ( − a) = Φ( a) as n → ∞ . Similarly, we have
{
Pr M ≤ Z } → Φ(a) as n → ∞. Since
( n)
a2
≅ Φ( a) − 1 + Φ( a),
the proof is complete.
216 Statistics in the Health Sciences: Theory, Applications, and Computing
= −1 if X i > M , Sn = X 1 + ... + X n
and
b1 = b1 (n) = largest integer ≤ 0.5(n − cn )
Pr(Z(bn1 ) ≤ M ≤ Zb( 2n) for every n ≥ m > 0) ≥ 1 − Pr ( Sn ≥ cn for some n ≥ m > 0).
If Zb( 2n) < M then Z1(n) < ... < Zb( 2n) < M and we have b2 times the value of 1 pres-
ent in X 1 ,..., X n. In this case, the values of the X’s, which do not correspond to
Z1(n) ,..., Z(bn2 ), can be equal to 1 or (−1) but must be ≥ (−1). This implies
{ } ()
Therefore A ⊆ Sn ≥ cn leading to Pr A = 1 − Pr (A) ≥ Pr Sn ≥ cn , for all n. { }
Thus
Pr(Z(bn1 ) ≤ M ≤ Zb( 2n) for every n ≥ m > 0) ≥ 1 − Pr( Sn ≥ cn for some n ≥ m > 0).
It is clear that E {exp (uX 1)} < ∞, for all 0 < u < ∞ . Then, following Example
(
5 in Section 8.3, we use the sequence cn = [n a 2 + log (n / m) ]1/2 , obtaining )
Pr( Sn ≥ cn for some n ≥ m) ≤ 2(1 − Φ( a) + aφ( a)), which provides
= 2 Φ( a) − 1 − 2 aφ( a).
The Ville and Wald Inequality: Extensions and Applications 217
{
Note that, comparing the method based on Pr Za( n1 ) ≤ M ≤ Za( n2 ) ≅ 2 Φ( a) − 1 }
with the result above, one can compute, for large m, that
2 0.9546 0.7386
2.5 0.9876 0.9001
2.8 0.9948 0.9506
3 0.9947 0.971
3.5 0.9996 0.9933
We would like to emphasize again that, by virtue of the law of the iterated
logarithm, we have
and in many practical situations, we do not know if the sample size n is large
{
enough to hold the asymptotic proposition Pr Za( n1 ) ≤ M ≤ Za( n2 ) ≅ 2 Φ( a) − 1. }
8.4.3 Test with Power One
Assume that we can observe the independent and identically N (θ,1)-
distributed data points X 1 ,..., X n , X n + 1 ,... in order to test for H 0 : θ ≤ 0 versus
H 1 : θ > 0. Define
to apply the test procedure: when N < ∞, we stop sampling with X N and
reject H 0 in favor of H 1; if N = ∞, we continue sampling indefinitely and do
not reject H 0 . In this case, the Type I error rate has the form
⎧ ⎛ k ⎞ ⎫
⎪
Prθ≤ 0 (we reject H 0 ) = Prθ≤ 0 ( N < ∞) = Prθ≤ 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ ⎪
X i − c k ⎟⎟ ≥ 0⎬
⎟
⎠ ⎪
⎩ i =1 ⎭
⎧ ⎛ k ⎞ ⎫
⎪
= Prθ≤ 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ ⎟ ⎪
(X i − θ) + θk − ck ⎟⎟ ≥ 0⎬ .
⎠ ⎪
⎩ i =1 ⎭
That is to say
⎧ ⎛ k ⎞ ⎫
⎪
Prθ≤ 0 (we reject H 0 ) ≤ Prθ≤ 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ (X i − θ) − ck ⎟⎟⎟ ≥ 0⎬ ,
⎠
⎪
⎪
⎩ i =1 ⎭
218 Statistics in the Health Sciences: Theory, Applications, and Computing
since θk ≤ 0. Then
⎧ ⎛ k ⎞ ⎫
⎪
Prθ≤ 0 (we reject H 0 ) ≤ Prθ= 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ ⎪
(X i) − ck ⎟⎟⎟ ≥ 0⎬ = Prθ= 0 (N < ∞).
⎠ ⎪
⎩ i =1 ⎭
As in Section 8.4.1, here we changed the measure Prθ to be Prθ= 0 . If we are using
cn = [(n + m)( a 2 + log(n/m + 1))]1/2 , as presented in Example 5 of Section 8.3,
then we have
⎧ ⎛ k ⎞ ⎫
⎪
Prθ≤ 0 (we reject H 0 ) ≤ Prθ= 0 ⎨max ⎜⎜
⎪ 1≤ k <∞ ⎜⎝
∑ ⎪
(X i) − ck ⎟⎟⎟ ≥ 0⎬
⎠ ⎪
⎩ i =1 ⎭
1 2
= Prθ= 0 (Sn ≥ cn for some n ≥ 1) ≤ e − a /2 ,
2Φ( a)
1 − Power = Prθ> 0 (we do not reject H 0 ) = Prθ> 0 (Sn /n < cn /n for all n ≥ 1) = 0
and the test has power one against the alternative θ > 0.
Regarding the expected sample size Eθ> 0 ( N ), it can be shown that for any
stopping rule N based on X 1 ,... the inequality Eθ> 0 ( N ) ≥ −2 log (Pr0 ( N < ∞)) /θ2
must hold for every θ > 0 (see Chapter 6). Thus if we are willing to tolerate an
N for which Pr0 ( N < ∞) = 0.05, then necessarily Eθ> 0 ( N ) ≥ 6 / θ2 for every θ > 0;
however, no such N will minimize Eθ ( N ) uniformly for all θ > 0. For N applied
in this section, where cn = [(n + m)( a 2 + log(n/m + 1))]1/2 , a simple Monte Carlo
study can evaluate Eθ> 0 ( N ) for different values of θ > 0 and fixed m > 0 .
9
Brief Comments on Confidence
Intervals and p-Values
∏ ∫∏
n n
h(θ|X ) = f (X i |θ)π(θ) f (X i |θ)π(θ)dθ .
i=1 i=1
219
220 Statistics in the Health Sciences: Theory, Applications, and Computing
∫
qU
this section) at a fixed significant level α, that is, h(θ|X ) dθ = 1 − α. In this
qL
case, the problem is that we have only one equation to find values of the
two unknown quantities qL , qU. Consider, for example, Figure 9.1, where, for
α = 0.05, we display the scenarios related to the intervals ⎡⎣qL , qU ⎤⎦ that satisfy
∞
α α
∫ ∫
qL
Panel (a) : = h (θ|X) dθ, = h (θ|X) dθ ;
2 −∞ 2 qU
∞
α ⎛ 1⎞
∫ ∫
qL
= h (θ|X) dθ, α ⎜ 1 − = h (θ|X) dθ
⎝ 1.2 ⎟⎠
Panel (b):
1.2 −∞ qU
and
∫
qU
Panel (c) : h(θ|X ) dθ = 1 − α , h(qL |X ) = h(qU |X ).
qL
Warning: Similar to the issue regarding the general Type I error rate defi-
nition mentioned in Section 4.4.4, it is reasonable to raise the question,
“What kind of confidence interval definition do you use?” in terms of
what is actually presented.
0.20 0.20 0.20
h
h
FIGURE 9.1
The confidence interval definitions related to the following scenarios: (a) the posterior prob-
abilities Pr (θ ≤ qL ) = Pr (θ ≥ qU ) = α/2 and (b) the posterior probabilities Pr (θ ≤ qL ) = α / 1.2 ,
Pr (θ ≥ qU ) = α(1 − 1/ 1.2), where α = 0.05 . Panel (c) presents the interval ⎡⎣qL , qU ⎤⎦ calculated using
the equations Pr (qL < θ < qU ) = 1 − α and h(qL |X ) = h(qU |X ).
Brief Comments on Confidence Intervals and p-Values 221
∞
α α
∫ ∫
qL
= h (θ|X) dθ and = h (θ|X) dθ.
2 −∞ 2 qU
∫
qU
h(qL |X ) = h(qU |X ) and h(θ|X ) dθ = 1 − α.
qL
∫
qU
ing h(θ|X ) dθ = 1 − α. To this end, we employ the Lagrange method that
qL
implies solving the equations
⎡ ⎛ qU ⎞⎤
∂ ⎢
∂qU ⎢
⎢⎣
⎜⎝
qL
∫
qU − qL + λ ⎜ 1 − α − h(u|X ) du⎟ ⎥ = 0,
⎟⎠ ⎥
⎥⎦
⎡ ⎛ qU ⎞⎤
∂ ⎢
∂ qL ⎢
⎢⎣
⎜⎝
qL
∫
qU − qL + λ ⎜ 1 − α − h(u|X ) du⎟ ⎥ = 0,
⎟⎠ ⎥
⎥⎦
where λ is the Lagrange multiplier, with respect to qU and qL. Then we have
1 − λh(qU |X ) = 0, − 1 + λh(qL |X ) = 0, obtaining h(qL |X ) = h(qU |X ).
222 Statistics in the Health Sciences: Theory, Applications, and Computing
9.2 p-Values
The p-value has long played a role in scientific research as a key decision-
making tool with respect to hypothesis testing and dates back to Laplace
in the 1770s (Stigler, 1986). The concept of the p-value was popularized by
Fisher as an inferential tool and is where the first occurrence of the term
“statistical significance” is found (Fisher, 1925). In the frequentist hypothesis
testing framework a test is deemed statistically significant if the p-value,
which is a statistic having support over the real line in (0,1) space, is below
some threshold known as the critical value. In a vast majority of studies the
critical value is set at 0.05.
A majority of traditional testing procedures are designed to draw a
conclusion (or make an action) with respect to the binary decision of reject-
ing or not rejecting the null hypothesis H 0 , depending on locations of the
corresponding values of the observed test statistics, that is, detecting whether
test statistic values belong to a fixed sphere or interval. In simple cases, test
procedures require us to compare corresponding test statistics values based
on observed data with test thresholds. P-values can serve as an alternative
data-driven approach for testing statistical hypotheses based on using the
observed values of test statistics as the thresholds in the theoretical prob-
ability of the Type I error. P-values can themselves also serve as a summary
type result based on data in that they provide meaningful experimental
data-based evidence about the null hypothesis.
For example, suppose that the test statistic L has a distribution function F
under H 0 . Under a one-sided upper-tailed alternative H 1 the p-value is the ran-
dom variable 1 − F(L), which is uniformly distributed under H 0 (e.g., Vexler
et al., 2016a).
The obvious correct use of the p-value is to simply draw a conclusion of
reject or do not reject the null hypothesis. This principle simplifies and stan-
dardizes statistical decision-making policies. In this manner, for example,
different algorithms for combining decision-making rules using their
p-values as test statistics can be naturally derived (e.g., Vexler et al., 2017).
224 Statistics in the Health Sciences: Theory, Applications, and Computing
use well-established ROC curve and AUC methods to evaluate and visualize
the properties of various decision-making procedures in the p-value-based
context. Further, in parallel with partial AUCs (Section 7.6), we develop a
partial expected p-value (pEPV) and introduce a novel method for visual-
izing the properties of statistical tests in an ROC curve framework. We prove
that the conventional power characterization of tests is a partial aspect of the
presented EPV/ROC technique.
In order to exemplify additional applications of the EPV/ROC methodol-
ogy, we refer the reader to Vexler et al. (2017).
( (
EPV = Φ (μ1 − μ 2) σ 12 + σ 22 )
−1/2
),
where Φ is the cumulative standard normal distribution.
Note that the formal notation (9.1) is similar to that for the area under the
corresponding ROC curve (Section 7.3). In this case, the ROC curve can assist
Brief Comments on Confidence Intervals and p-Values 227
⎡ ⎤ −1
( ) ( )
1/2
TS = X − Y ⎢ Sp n−1 + m−1 ⎥⎦ ,
⎣
( )(
TW = X − Y S12 n−1 + S22 m−1 )
−1/2
,
∑
n
dent normally distributed observations Y1 , ..., Ym, S12 = (X i − X )2 /(n − 1)
i=1
∑
m
and S22 = (Yi − Y )2 /(m − 1) are the unbiased estimators of the variances
j=1
σ 12 = Var(X 1 ) and σ 22 = Var(Y1 ), respectively, and Sp2 = {(n − 1)S12 + (m − 1)S22 } /(n + m − 2)
is the pooled sample variance. Figure 9.2 depicts the ROC curves,
ROC(t) = 1 − F1 { F0−1 (1 − t)}, t ∈(0,1), for each test when the distribution func-
tions F0 , F1 of the t-test statistic (Student’s t-test statistic or Welch’s t-test sta-
tistic) are correctly specified corresponding to underlying distributions of
observations with δ = EX 1 − EY1= 0.7 and 1 under H 1. These graphs show
that there are no significant differences between the relative curves. In the
scenario n = 10, m = 50, σ 12 = 4, σ 22 = 1, δ = 1 Student’s t-test slightly dem-
onstrates better performance than that of Welch’s t-test. By using differ-
ent values of n, m, σ 12 , σ 22 , δ, we attempt to detect cases when significant
differences between the relevant curves are in effect. Applying differ-
ent combinations of n = 10, 20, ..., 150 ; m = 10, 20, ..., 150 ; σ 12 = 1, 2 2 ,...,102;
and σ 22 = 1, we could not derive a scenario when one of the considered tests
228 Statistics in the Health Sciences: Theory, Applications, and Computing
FIGURE 9.2
The ROC curves related to Student’s t-test (“______”) and Welch’s t-test (“- - - -”), where graph
(a) represents the case of n = 10 , m = 50 , σ 12 = 1, σ 22 = 1, δ = 0.7 ; graph (b) represents the case
of n = 20, m = 40, σ 12 = 1, σ 22 = 1, δ = 0.7 ; graph (c) represents the case of n = 40, m = 20 , σ 12 = 4 ,
σ 22 = 1, δ = 0.7 ; graph (d) represents the case of n = 40, m = 20 , σ 12 = 9 , σ 22 = 1, δ = 0.7 ; graph (e)
represents the case of n = 10 , m = 50 , σ 12 = 4 , σ 22 = 1, δ = 1; and graph (f) represents the case of
n = 20, m = 40, σ 12 = 9, σ 22 = 1, δ = 0.7 .
TABLE 9.1
The Areas Under the ROC Curves of the Student’s t-Test ( AUCS ) and Welch’s
t-Test ( AUCW )
n m σ 12 σ 22 δ AUCS AUCW
clearly outperforms the other one with respect to the EPV. The corresponding
AUC = (1 − EPV) values are given in Table 9.1. Thus, if the Type I error rates
of the tests are correctly controlled, there are no critical differences between
Student’s t-test and Welch’s t-test.
Brief Comments on Confidence Intervals and p-Values 229
The results shown above can be obtained analytically, since the distribu-
tion functions of the statistics TS and TW under H 0 and H 1 have specified
forms. Alternatively, we provide the following R code that can be easily
modified in order to evaluate various decision-making procedures using the
accurate Monte Carlo approximations to the EPV/ROC instruments.
library(pROC)
N<-75000 # Number of Monte Carlo generations
n<-20
m<-20
T0E<-array() #Values of T_S under H_0
T0NE<-array() #Values of T_W under H_0
T1E<-array() #Values of T_S under H_1
T1NE<-array() #Values of T_W under H_1
for(i in 1:N){
x0<-rnorm(n,0,1)# X’s under H_0
y0<-rnorm(m,0,1)# Y’s under H_0
##############Partial EPVs
#Values of (1-the partial EPVs) for T_S depending on alpha:
TEauc<-function(alpha) 1-auc(TEroc, partial.auc=c(0, alpha))
#Values of (1-the partial EPVs) for T_W depending on alpha:
TNEauc<-function(alpha) 1-auc(TNEroc, partial.auc=c(0, alpha))
TEaucV<-Vectorize(TEauc)
TNEaucV<-Vectorize(TNEauc)
plot(TEaucV,0,1)
plot(TNEaucV,0,1,col='red')
∫ ∫
∞ ∞
EPV = Pr(T 0 ≥ T A ) = Pr(T A ≤ t) dF0 (t) = Pr {F0 (T A ) ≤ F0 (t)} dF0 (t)
−∞ −∞
∫ Pr {1 − F0 (T A ) ≥ α} d(1 − α) = ∫ ⎡ 1 − Pr 1 − F (T A ) ≤ α ⎤ dα = 1 −
{ 0 }⎦⎥ ∫
0 1 1
= ⎢ Pr( p − value ≤ α|H 1 ) dα.
1 0 ⎣ 0
The above expression of the EPV considers the weight of the significance
level α from 0 to 1. It may appear to suffer from the defect of assigning most
of its weight to relatively uninteresting values of α not typically used in
practice, e.g., α ≥ 0.1. Alternatively, we can focus on significance levels of α in
a specific interesting range by considering the pEPV; that is
∫ ∫ Pr {1 − F (T ) ≤ α}dα
αU αU
pEPV = 1 − Pr {p − value ≤ α|H 1}dα = 1 − 0
A
0 0
= 1+ ∫ Pr {F (T ) ≥ 1 − α}d(1 − α) = 1 + ∫
αU 1−αU
0
A
Pr {F (T ) ≥ z}dz 0
A
0 1
= 1− ∫ Pr {F (T ) ≥ z}dz = 1 − ∫
1 ∞
0
A
Pr {F (T ) ≥ F (t)}dF (t) 0
A
0 0
1−αU F0−1 (1−αU )
= 1− ∫
∞
Pr {T A ≥ t}dF0 (t) = 1 − Pr {T A ≥ T 0 , T 0 ≥ F0−1 (1 − αU )}
F0−1 (1−αU )
∫
αU
In general, one can define the function pEPV (α L , αU ) = 1 − Pr {p − value ≤ u|H 1} du
αL
and focus on
d
dα
{− pEPV (0, α)}. Then, in this case, the
d
dα
{− pEPV (0, α)} implies
the power at a significance level of α. An essential property of efficient statis-
tical tests is unbiasedness (Lehmann and Romano, 2006). In this context, we
can denote an unbiased statistical test when the probability of committing
a Type I error is less than the significance level and we have a proper power
function, that is, Pr(reject H 0 |H 0 ) ≤ α and Pr(reject H 0 |H 0 is not true) ≥ α.
In parallel with this definition, it is natural to consider the inequality
∫
α
pEPV (0, α) ≤ 1 − Pr {p − value ≤ u|H 0}du = 1 − α 2 /2,
0
d
dα
{ }
{pEPV (0, α)} = − Pr(reject H 0 |H 0 is not true) and ddα 1 − α 2 /2 = −α.
library(pROC)
N<-100000 #Number of Monte Carlo data generations
W0<-array()
T0<-array()
W1<-array()
T1<-array()
n<-25 #The sample size n
m<-25 #The sample size m
for(i in 1:N){
x0<-rnorm(n,0,1) #Values of X from N(0,1) generated under the hypothesis H0
y0<-rnorm(m,0,1) #Values of Y from N(0,1) generated under the hypothesis H0
x1<-rnorm(n,0,1) #Values of X from N(0,1) generated under the hypothesis H1
Brief Comments on Confidence Intervals and p-Values 233
W0[i]<-wilcox.test(x0,y0,alternative = c("greater"))$stat[[1]]/(n*m)
#Values of the Wilcoxon test statistic under H0
W1[i]<-wilcox.test(x1,y1,alternative = c("greater"))$stat[[1]]/(n*m)
#Values of the Wilcoxon test statistic under H1
To exemplify the proposed approach for comparing the test statistics, this R
code provides simulation-based evaluations of the ROC curves based on val-
ues of the test statistics related to the null and alternative hypotheses, using
the R built-in procedure pROC. In this scenario, we assume, for example, that
X 1 ,..., X n ~ N (0,1) and Y1 ,..., Ym ~ N (0,1) under H 0 , whereas X 1 ,..., X n ~ N (0,1)
and Y1 ,..., Ym ~ N ( −0.5,10). Figure 9.3 shows the ROC curves (ROCT (t), t) and
(ROCW (t), t), where
ROCT (t) = 1 − FT1 { FT−01 (1 − t)} and ROCW (t) = 1 − FW1 { FW−01 (1 − t)}, t ∈(0,1)
with the distribution functions FTk and FWk that correspond to the t-test statistic
and the Wilcoxon rank-sum test statistic distributions under H k , k = 0,1,
respectively.
According to the executed R Code (the parameter EV<-FALSE), we focus
on Welch’s t-test statistic, which is reasonable in the considered setting
of data distributions’ parameters. It is interesting to remark that when
examining Student’s t-test statistic, setting the parameter EV<-TRUE the
graphs show that there are no significant differences between the rela-
tive curves. Similar observations are in effect regarding the considerations
shown below.
Analyzing EPVs for the one-sided, two-sample t- and Wilcoxon tests
based on normally distributed data points ( n = m = 10, 20, 50 ), Sackrowitz
and Samuel-Cahn (1999) concluded that “The t test is best both for the Normal
234 Statistics in the Health Sciences: Theory, Applications, and Computing
1.0
0.8
0.6
ROC (t)
0.4
0.2
0.0
FIGURE 9.3
Values of the functions ROCT (t) (curve “- - - -”) and ROCW (t) (curve “------”) plotted against
t ∈ (0, 1).
Troc<-roc(Ind, T)
Wroc<-roc(Ind, W)
EPV_t<-1-auc(Troc) # EPV of the t-test
EPV_W<-1-auc(Wroc) # EPV of the Wilcoxon test
Indeed, the computed EPV of the t-test is 0.431, which is smaller than the
0.439 of that related to the Wilcoxon rank-sum test. However, Figure 9.3 dem-
onstrates that for t ∈(0, 0.5) the Wilcoxon test is somewhat better that the
t-test. This motivates us to employ the pEPV for this analysis. Toward this
end, we denote the function
where pEPVT (0, α) and pEPVW (0, α) are the function pEPV (0, α) defined in
Section 9.2.3 and computed with respect to the t-test and the Wilcoxon test,
respectively. In order to depict the result we use the following code:
pEPV_t<-function(u) 1-auc(Troc, partial.auc=c(0,u))[[1]]
pEPV_W<-function(u) 1-auc(Wroc, partial.auc=c(0,u))[[1]]
G<-function(u) (pEPV_W(u)-pEPV_t(u))/(pEPV_t(u))
GV<-Vectorize(G)
plot(GV,0,1,xlab='alpha',ylab='G(alpha)')
FIGURE 9.4
The relative comparison between the t-test and the Wilcoxon test using their pEPVs via the
function G(α) plotted against α ∈(0, 1).
236 Statistics in the Health Sciences: Theory, Applications, and Computing
We obtain that the power of the Wilcoxon test is 0.12, whereas the power
of the t-test is 0.08.
Remark 1. Assume we are interested in the measure Pr(X > Y ) .
A fast way to evaluate Pr(X > Y ) can be based on the R command
wilcox.test(X,Y,alternative = c("greater"))$stat[[1]]. For example, in the setting of
the example mentioned in this section, one can use
nn<-5000000
x1<-rnorm(nn,0,1)
y1<-rnorm(nn,-0.5,10)
wilcox.test(x1,y1,alternative = c("greater"))$stat[[1]]/(nn*nn)
∫
0.5
corresponds to Pr(X > Y ) = (2 π)
−1/2 2
e−z /2
dz ≈ 0.5181275.
−∞
9.2.5 Discussion
We have seen that the EPV is a very useful and succinct measurement
tool of the performance of decision-making mechanisms. We have dem-
onstrated a novel methodology to analyze and visualize characteristics
of test procedures. Toward this end the “EPV/ROC” concept has been
introduced. This approach provides us new and efficient perspectives
toward developing and examining statistical decision-making policies,
including those related to the partial EPVs, associations between the EPV
and the power of tests, visualization of properties of testing mechanisms,
development of optimal tests based on minimizing the EPVs, and cre-
ation of new methods for optimally combining multiple test statistics.
Many possible avenues of research can be done based on the concept we
introduced in this chapter. For example, a large sample theory can be
developed to evaluate the EPVs in several parametric and nonparamet-
ric scenarios. In addition, Bayesian-type methods can be developed in
order to evaluate test properties in the “EPV/ROC” frame. The proposed
Brief Comments on Confidence Intervals and p-Values 237
10.1 Introduction
Over the long course of statistical methodological developments there been a
shift from classic parametric likelihood methods to a focus towards robust
and efficient nonparametric and semiparametric developments of various
“artificial” or “approximate” likelihood techniques. These methods have a
wide variety of applications related to biostatistical experiments. Many non-
parametric and semiparametric approximations to powerful parametric likeli-
hood procedures have been used routinely in both statistical theory and
practice. Well-known examples include the quasi-likelihood method, which
are approximations of parametric likelihoods via orthogonal functions, tech-
niques based on quadratic artificial likelihood functions, and the local maxi-
mum likelihood methodology (e.g., Wedderburn, 1974; Claeskens and Hjort,
2004; Fan et al., 1998). Various studies have shown that artificial or approximate
likelihood-based techniques efficiently incorporate information expressed
through the data, and have many of the same asymptotic properties as those
derived from the corresponding parametric likelihoods. The empirical likeli-
hood (EL) method is one of a growing array of artificial or approximate
likelihood-based methods currently in use in statistical practice (e.g., Owen,
2001; Vexler et al., 2009a, 2014). Interest and the resulting impact in EL methods
continue to grow rapidly. Perhaps more importantly, EL methods now have
various vital applications in an expanding number of health related studies.
In Sections 10.2 and 10.3, we outline basic components related to EL techniques
and their theoretical evaluations. Sections 10.4 and 10.5 demonstrate valuable
examples of EL applications. In Section 10.6, we provide arguments that can be
accepted in favor of EL methods in order to be applied in statistical practice.
239
240 Statistics in the Health Sciences: Theory, Applications, and Computing
∏
n
i=1
(F(X i ) − F(X i −)), which is a functional of the cumulative distribution
function F and iid observations X i , i = 1,… , n. In the distribution-free setting,
∏
n
an empirical estimator of the likelihood takes the form of Lp = pi ,
i =1
where the components pi, i = 1,… , n, estimators of the probability weights,
∑
n
should maximize the likelihood Lp , provided that pi = 1 and empirical
i=1
constraints based on X 1 ,… , X n hold. For example, suppose we would like to
test the hypothesis
∑
n
where pi = 1.
i=1
Under the null hypothesis, the maximum likelihood approach requires
one to find the values of 0 < p1 ,..., pn < 1 that maximize the EL given the
∑ ∑
n n
empirical constraints pi = 1 and pi g(X i , θ) = 0 that present an
i=1 i=1
empirical version of the condition under H0 that E {g(X 1 , θ)} = 0. In situations,
when there are no 0 < p1 ,… , pn < 1 to satisfy the empirical constraints, it is
assumed that the null hypothesis is rejected in this stage. Further, following
the Lagrange method, we define the function
n
⎛ n
⎞ ⎛ n
⎞
W (p1 ,..., pn) = ∑i=1
log (pi ) + λ 1 ⎜ 1 −
⎝
∑
i=1
pi ⎟ + λ ⎜ −
⎠ ⎝
∑ i=1
pi g(X i , θ)⎟ ,
⎠
∑ {
pk ∂W (p1 ,..., pn) / ∂ pk = 0. }
n
pk ∂W (p1 ,..., pn) / ∂ pk = 0, k = 1,..., n. This implies
k =1
∑ ∑
n n
Thus, since pk = 1, we obtain n − λ 1 − λ pk g(X k , θ) = 0, where,
k =1 k =1
∑
n
under H0, it is assumed that pk g(X k , θ) = 0. This leads to n − λ 1 = 0.
k =1
Therefore, using the equation ∂W (p1 ,..., pn) / ∂ pk = 0, we obtain
pk = (n + λg(X k , θ)) , k = 1,..., n. That is to say, the Lagrange method provides
−1
n n
∏ ∏ (n + λg(X , θ))
−1
EL(θ) = sup pi = i ,
0 < p1 , p2 ,..., pn < 1, ∑ pi = 1, ∑ pi g ( X i , θ ) = 0 i=1 i=1
∂
∑ g(X , θ)(n + λg(X , θ)) ∑ g(X , θ) (n + λg(X , θ))
−1 −2
i i =− i
2
i ≤ 0.
∂λ
n n
EL = sup
0 < p1 , p2 ,..., pn < 1, ∑
∏ ∏n
pi = 1 i = 1
pi =
i=1
−1
= (n) .
−n
for the hypothesis test of H 0 versus H 1. For example, when the function
g(u, θ) = u − θ, the null hypothesis corresponds to the expectation E(X 1 ) = θ,
for fixed θ. The EL ratio test strategy is that we reject H 0 for large values of
ELR (θ).
Owen (1988, 1990, 1991) showed that the nonparametric test statistic
2 log (ELR (θ)) has an asymptotic chi-square distribution under the null hypo-
3
thesis when E g(X 1 , θ) < ∞ . This result illustrates that Wilks’ theorem–type
results continue to hold in the context of this infinite-dimensional problem
(n, the number of p1 ,..., pn , is assumed to be large, n → ∞). The proof of this
proposition is outlined in Section 10.3. Thus, we reject H 0 when
that include the R functions el.test() and EL.test(). These simple R functions
can be very useful for the EL analysis of data from statistical studies.
Lemma 10.3.1.
∑
n
Let θ M be a root of the equation n−1 g(X i , θ M ) = 0, where ∂ g(X i , θ)/ ∂θ < 0
i=1
(or ∂ g(X i , θ)/ ∂θ > 0), for all i=1, 2,…, n. Then the argument θ M is a global
maximum of the function
EL(θ) = max {∏ n
i=1
pi : 0 < pi < 1, ∑
n
i=1
pi = 1, ∑
n
i=1 }
pi g(X i , θ) = 0 ,
which increases and decreases monotonically for θ < θ M and θ > θ M , respectively.
∑
n
Proof. It is clear that the argument θ M , a root of n−1 g(X i , θ M ) = 0, maximizes
i=1
the function EL (θ), since in this case EL (θ M ) = n −n
with pi = n−1 , i = 1,… , n,
∏ ∑
n n
which maximizes pi given the sole constraint pi = 1, 0 ≤ pi ≤ 1, i = 1, … , n.
i=1 i=1
∏p ,
1
EL(θ) = i 0 < pi = < 1, i = 1,… , n,
i=1
n + λg(X i , θ)
Empirical Likelihood 243
d log (EL(θ))
n n n
dθ
= −λ ∑
i=1
∂ g(X i , θ)/ ∂θ
n + λg(X i , θ)
− ∑i=1
g(X i , θ) ∂λ
n + λg(X i , θ) ∂θ
= −λ ∑ ∂ng+(Xλg,(θX)/, θ∂θ) ,
i=1
i
i=1
Since dL (λ) / dλ < 0, the function L (λ) decreases with respect to λ and has just
one root relative to solving L (λ) = 0. Consider the scenario with θ > θ M . In
this case when λ 0 = 0 we can conclude that
L (λ 0 ) = ∑ ∑
n n
g(X i , θ) (n) ≥ g(X i , θ M ) (n) = 0,
−1 −1
i=1 i=1
L(λ) > 0
λ0 = 0
| >
0
λ<0 λ
θ > θM θ < θM
n n
n n
L(λ0) = ∑ g(Xi, θ)/n >∑ g (Xi, θM)/n = 0
o i=1 i=1 oL(λ0) =∑ g(Xi, θ)/n < ∑ g (Xi, θM)/n = 0
i=1 i=1
0 L(λ)
L(λ) > 0
L(λ)
λ0 = 0
| >
λ>0 λ
0 λ 0 λ
(a) (b)
FIGURE 10.1
The schematic behaviors of L(λ) plotted against λ (the axis of abscissa), when (a) θ > θ M and (b)
θ < θ M, respectively.
244 Statistics in the Health Sciences: Theory, Applications, and Computing
Thus, by virtue of
d log (EL(θ))
n
∂ g(X i , θ)/ ∂θ
dθ
= −λ ∑ n + λg(X , θ) ,
i=1 i
∑ z(X ).
n
θ M 2 = n−1 i
i=1
Turning to the task of developing asymptotic methods based
on ELs, we provide the proposition below. Without loss of generality
and for simplicity of notation, we set g(u, θ) = u − θ in the definition of the
∏
n
empirical likelihood ratio function ELR (θ). Thus, EL(θ) = pi ,
i=1
∑
n
log (ELR(θ)) = log{1 + λ(X i − θ)/ n} with pi = {n + λ(X i − θ)} , where λ is −1
i=1
a root of
i=1
Note that λ = λ(θ) is a function of θ. Then, defining λ‘= dλ(θ)/dθ, λ" = d 2 λ(θ)/dθ2,
{
λ '' = − 2(λ ')2 ∑
n
i=1
(X i − θ)3 pi3 + 4n(λ ') ∑
n
i=1
(X i − θ)pi3
−2 nλ∑
n
i=1 }∑
pi3 /
n
i=1
(X i − θ)2 pi2 ,
{
λ (4) = (8(λ ')λ (3) + 6(λ '')2 ) ∑
n
i=1
(X i − θ)3 pi 3 + 8nλ (3) ∑
n
i=1
(X i − θ)pi 3
∑
n
− 36(λ ')2 λ '' (X i − θ)4 pi 4
i=1
∑ ∑ ∑
n n n
− 72 nλ ' λ '' (X i − θ)2 pi 4 − (36n2 λ ''− 18nλλ ') pi 4 + 24nλ '' pi 3
i=1 i=1 i=1
+
2
(θ − X )2 d log ELR X ( ) +
3
(θ − X )3 d log ELR X ( )
2! dθ2 3! dθ3
θ= X θ= X
( ) ∑
1 2 ⎡1 n
⎤
= n0.5 (θ − X ) / ⎢ (X i − X ) ⎥ 2
2 ⎣n i=1 ⎦
∑
3
1 3 ⎡1 2⎤
n
+ n(θ − X ) / ⎢ (X i − X ) ⎥ ,
3 ⎣n i=1 ⎦
as n → ∞. This approximation depicts Wilks’ theorem when θ = EX 1. (In the case
of θ = EX 1 and n → ∞, n0.5 (θ − X ) has a normal distribution, n(θ − X )3 → 0, and
p
∑
n
(X i − X )2 / n → var(X 1 ).) Using Proposition 10.3.1 to figure more terms in
i=1
the Taylor expansion can provide high order approximations to the null distri-
bution of the ELR test, e.g., to obtain the Bartlett correction of the ELR structure
(e.g., Vexler et al., 2009a). Under the alternative hypothesis θ ≠ EX 1, the approxi-
mation above shows the power of the ELR test. In this context, we note that Lazar
and Mykland (1999) considered a general form of ELRs and the case where
( )
θ − EX 1 ~ O n−0.5 . The authors compared the local power of ELR to that of an
ordinary parametric likelihood ratio. Their notable research shows that there is
no loss of efficiency in using an EL model up to a second-order approximation.
246 Statistics in the Health Sciences: Theory, Applications, and Computing
∑ ∑ ∑ ∑
n n n n
Λ= log pi + λ(1 − pi ) + λ 1 (θ1 − pi X i ) + λ 2 (θ2 − pi X i 2 ),
i=1 i=1 i=1 i=1
∑ ∑
n n
M(θ) = − log { MLR ( θ )} = − (X i − θ)2 /2 + (X i − X )2 /2
i =1 i =1
and the minus log-empirical likelihood ratio test statistic for E(X 1 ) = θ as
∑ log {1 + λ (X i − θ)/n },
n
E (θ) = − log {ELR ( θ )} = −
i =1
∑ ∑
n n
(X i − θ) (n + λ(X i − θ)) = 0 and X = n −1
−1
where λ is a root of X i . In
i =1 i =1
the case of the exponential distribution, the minus maximum log-likelihood
ratio test statistic is
M (θ) = − log { MLR ( θ )} = n log θ − n θX + n log X + n
and the minus log-empirical likeilhood ratio test statistic for E (X 1 ) = 1/θ is
−1.5
E(θ), M(θ)
−1.5 −1.5
Empirical Likelihood
−1 −1 −1
−2 −2 −2
E(θ), M(θ)
−3 −3 −3
−4 −4 −4
0.8 1.0 1.2 1.4 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.00 1.05 1.10 1.15 1.20 1.25 1.30
θ θ θ
Normal, EL Normal, PL Exp, EL Exp, PL
FIGURE 10.2
The log-empirical likelihood ratios and the maximum log-likelihood ratios based on samples of sample sizes n = 25,50,150, where the solid line and the
dashed line correspond to the log-empirical likelihood ratios and the maximum log-likelihood ratios when the underlying distribution is normal, respec-
tively, while the dotted line and the dot-dash line represent the log-empirical likelihood ratio and the maximum log-likelihood ratio when the underlying
247
# normal
plot.norm<-function(X){
get.elr<-function(theta) sapply(theta,function(pp) el.test(X,mu=pp)$'-2LLR'/(-2)) # log EL
get.plr<-function(theta) sapply(theta,function(pp) sum(-(X-pp)^2/2)+sum((X-mean(X))^2/2)) #log(PL)
rg<-6/sqrt(length(X))
curve(get.elr,xlim=c(max(0,mean(X)-rg),mean(X)+rg),type="l",lty=1,lwd=2,add=TRUE,col=1)
curve(get.plr,xlim=c(max(0,mean(X)-rg),mean(X)+rg),type="l",lty=2,lwd=2,add=TRUE,col=2)
}
#exponential
plot.exp<-function(X){
get.elr.exp<-function(theta) sapply(theta,function(pp) el.test(X,mu=1/pp)$'-2LLR'/(-2)) #log EL
get.l.exp<-function(theta) sum(log(dexp(X,rate=theta)))
get.plr.exp<-function(theta) sapply(theta,function(pp) get.l.exp(pp)-get.l.exp(1/mean(X))) #log(PL)
rg<-6/sqrt(length(X))
curve(get.elr.exp,xlim=c(max(0.1,1/mean(X)-rg),1/mean(X)+rg)type="l",lty=3,lwd=2,add=TRUE,col=3)
curve(get.plr.exp,xlim=c(max(0.1,1/mean(X)-rg),1/mean(X)+rg),type="l",lty=4,lwd=2,add=TRUE,col=4)
}
#add legend for the plot
add_legend <- function(...) {
opar <- par(fig=c(0, 1, 0, 1), oma=c(0, 0, 0, 0),
mar=c(0, 0, 0, 0), new=TRUE)
on.exit(par(opar))
plot(0, 0, type='n', bty='n', xaxt='n', yaxt='n')
legend(...)
}
par(mar=c(5.5, 4, 3.5, 1.5),mfrow=c(2,3),mgp=c(2,1,0))
# normal
for (n in n.seq){
rg<-2/sqrt(n)
X<-rnorm(n,theta0,1)
plot(theta0,1,xlim=c(max(0,mean(X)-rg),mean(X)+rg),xlab=expression(theta), ylim=c(-
2.5,0.005),ylab=expression(paste(“E(“,theta,”),M(“,theta,”)”)), main=paste0("n=",n),cex.axis=1.2)
plot.norm(X)
abline(v=mean(X),lty=2,col="grey")
}
# exponential
for (n in n.seq){
rg<-2/sqrt(n)
X<-rexp(n,rate=theta0)
plot(theta0,1,xlim=c(max(0,1/mean(X)-rg),1/mean(X)+rg),xlab=expression(theta), ylim=c(-
4.5,0.005),ylab=expression(paste(“E(“,theta,”),M(“,theta,”)”)), main=paste0("n=",n),cex.axis=1.2)
plot.exp(X)
abline(v=1/mean(X),lty=2,col="grey")
}
add_legend("bottom", legend=c("Normal, EL","Normal, PL","Exp, EL","Exp, PL"), lwd=2,
lty=1:4, col=1:4, horiz=TRUE, bty='n', cex=1.1)
Empirical Likelihood 249
∑ ∑
3
log {ELR (θ)} ≈ (
n (θ − X ) / ⎢⎡
1 0.5 1
) (X i − X )2 ⎤⎥ + n(θ − X )3 / ⎡⎢
1 1
(X i − X )2 ⎤⎥
2 n n
2 ⎣n i=1 ⎦ 3 ⎣n i=1 ⎦
n n
∑ ( ∑
3 3
( ) )
E Xi E X1
Pr max i =1,..., n X i ≥ n1/3+ε ≤ Pr X i ≥ n1/3+ε ≤ = 3ε ⎯⎯
⎯ ⎯ ⎯
⎯ → 0.
n1+ 3ε n n →∞
i =1 i =1
n n
EL(θ) = sup
0 < p1 , p2 ,..., pn < 1, ∑ pi = 1, ∑ pi g ( X i , θ )= 0
∏ p = ∏(n + λg(X , θ))
i i
−1
,
i =1 i =1
∑ g (X , θ) < 0.
i=1
λ < 0 if and only if i
i=1
250 Statistics in the Health Sciences: Theory, Applications, and Computing
⎛ n ⎞ n
0 ≤ log ⎜⎜ n− n /
⎜
⎝
∏ pi⎟⎟ =
⎟
⎠
∑log {1 + λg (X , θ)/n}. i
i =1 i =1
n n n
= ∑ g (X , θ) (1 − p )i i
i =1
⎧ ⎫
∑ g (X , θ) ⎪⎨⎩⎪ n n+ +λgλ(gX(X, θ,)θ−) 1⎪⎬⎭⎪
n
=
i
i
i
i =1
n n n
= n ∑ g (X , θ) p + λ∑ g (X , θ) p −∑ g (X , θ) p
i i i
2
i i i
i =1 i =1 i =1
n
= λ ∑{g (X , θ)} p . i
2
i
i =1
n ⎡ n ⎤ −1
Proposition 10.3.5. If λ ≥ 0, we have 0 ≤ λ ≤ n ∑ g (X i , θ) ⎢⎢ ∑ {g (X i , θ)} I {g (X i , θ) < 0}⎥⎥ ;
2
i=1 ⎢⎣ i=1 ⎥⎦
−1
n
⎡ n
⎤
∑ ∑
g (X i , θ) ⎢ {g (X i , θ)} I {g (X i , θ) > 0}⎥ ≤ λ < 0, where
2
if λ < 0, we have n
i=1 ⎢⎣ i = 1 ⎥⎦
I {}
. is the indicator function.
n n
⎡ 2 I {g (X i , θ) < 0} ⎤
n n
where pi = {n + λg (X i , θ)} , i = 1, … , n.
−1
−1
n
⎡ n ⎤
∑ ∑
g (X i , θ) ⎢ {g (X i , θ)} I (g (X i , θ) < 0)⎥ .
2
0≤λ<n
i=1 ⎢⎣ i = 1 ⎥⎦
−1
n
⎡ n ⎤
∑ ∑
g (X i , θ) ⎢ {g (X i , θ)} I {g (X i , θ) > 0}⎥ < λ < 0 .
2
n
i=1 ⎢⎣ i = 1 ⎥⎦
−1
Proposition 10.3.6. Since the probability weights pi = (n + λg(X i , θ)) , i = 1,..., n,
∑ ∑ ∑
n n2 n
we have the properties g(X i , θ)2 pi2 = − 2 + 2 pi2 and pi2 ≥ n−1.
λ λ i=1
=
n
λ 2
λ
1
− 2 ∑ {n + λg(X , θ)}
2 nλg(X i , θ) + n2
i
2 ,
where
2 n {λg(X i , θ) + n} − 2 n2 + n2
n
λ 2
λ
1
− 2 ∑ {n + λg(Xi , θ)}2 n
λ
n 2n
= 2− 2
λ ∑ n2
pi + 2
λ ∑p 2
i and ∑ p = 1.
i=1
i
∑p
n n2
Thus, we obtain − + 2
i ≥ 0.
λ2 λ2
This completes the proof.
∑ g(X , θ)(n + λg(X , θ))
−1
For example, consider the equation i i = 0. We have
g(X i , θ)
∑ (n + λg(X , θ)) = 0,
d
dθ i
Then
∑ g '(X , θ)p
n i
2
i
λ' = i=1
n .
∑ g(X , θ) p
i=1
i
2 2
i
In the case with g(u, θ) = u − θ, one can use Proposition 10.3.6 to obtain the
inequality
n
n∑p 2
i
1
λ' = − n
i=1
≤− n .
∑
i=1
(X i − θ)2 pi2 ∑
i=1
(X i − θ)2 pi2
and formal tests for the relevant goodness-of-fit are often not available. The
statistical literature has shown that tests derived from empirical likelihood
methodology possess many of the same asymptotic properties as those
based on parametric likelihoods. This leads naturally to the idea of using the
empirical likelihood instead of the parametric likelihood as the basis for
Bayesian inference.
Lazar (2003) demonstrated the potential for constructing nonparametric
Bayesian inference based on ELs. The key idea is to substitute the parametric
likelihood (PL) with the EL in the Bayesian likelihood construction relative
to the component of the likelihood used to model the observed data. It is
demonstrated that the EL function is a proper likelihood function and can
serve as the basis for robust and accurate Bayesian inference. This Bayesian
empirical likelihood method provides a robust nonparametric data-driven
alternative to the more classical Bayesian procedures.
Vexler et al. (2014) developed the nonparametric Bayesian posterior expec-
tation fashion by incorporating the EL methodology into the posterior likeli-
hood construction. The asymptotic forms of the EL-based Bayesian posterior
expectation are shown to be similar to those derived in the well-known para-
metric Bayesian and Frequentist statistical literature. In the case when the
prior distribution function depends on unknown hyper-parameters, a
nonparametric version of the empirical Bayesian method, which yields double
empirical Bayesian estimators, can be obtained. This approach yields a non-
parametric analog of the well-known James–Stein estimation that has been
well addressed in the literature dealing with multivariate-normal observa-
tions (e.g., Stein, 1956).
Vexler et al. (2016b) provided an EL-based technique for incorporating
prior information into the equal-tailed and highest posterior density confi-
dence interval estimators (Chapter 9) in the Bayesian manner. It was demon-
strated that the proposed EL Bayesian approach may correct confidence
regions with respect to skewness of the data distribution.
The EL Bayesian procedures can serve as a powerful approach to incorpo-
rating external information into the inference process about given data, in a
distribution-free manner.
the best strategy for constructing statistical decision rules, the following
arguments can be accepted in favor of EL methods:
Appendix
ELR (θ1 , θ2): Several results that are similar to those related to the evalua-
tions of ELR (θ).
One can show that the logarithm of the ELR test statistic for the null hypoth-
esis: H 0 : E(X 1 ) = θ1 and E(X 12 ) = θ2 has the form
∑
n
log ELR (θ1 , θ2 ) = log{1 + λ 1 (X i − θ1 )/n + λ 2 (X i 2 − θ2 )/n},
i=1
∑i=1
( X i − θ1 )
n + λ 1 ( X i − θ1 ) + λ 2 ( X i 2 − θ 2 )
= 0 and ∑
i=1
(X i 2 − θ 2 )
n + λ 1 ( X i − θ1 ) + λ 2 ( X i 2 − θ 2 )
= 0.
In this case, ∂log (ELR) / ∂θ1 = −λ 1 and ∂log ELR / ∂θ2 = −λ 2. At point
⎛θ = X =
∑ ∑ X i2 /n⎞⎟ , we have log {ELR (θ1 , θ2 )} = λ 1 = λ 2 = 0,
n n
⎜⎝ 1 X i /n , θ2 = X 2 =
i=1 i=1 ⎠
pi = 1/n and the following derivative values:
Empirical Likelihood 257
∑ ∑ ∑
n n n
n2 (X i2 − X 2 )2 n2 ( X i − X )2 n2 (X i − X )(X i2 − X 2 )
∂λ 1 ∂λ 2 ∂λ 1 ∂λ 2
= i=1
, = i=1
, = = i=1
.
∂θ1 Δ ∂θ2 Δ ∂θ2 ∂θ1 Δ
∂2 λ 1 1
∑ (X − X ){ ∂λ ∂λ
∑
n n
= [2 (X − X ) + i
2 2 1
i
2
(X i 2 − X 2 )}2 (X i − X )(X i2 − X 2 )
∂θ12 nΔ ∂θ
i=1 ∂θ 1 1 i=1
− 2∑ (X − X ){ ∑
n ∂λ ∂λ n
(X − X ) +
i (X 1
i
2
i
2
− X 2 )}2 (X i 2 − X 2 )2 ],
∂θ
i=1 ∂θ 1 1 i=1
∂2 λ 2 1
∑ (X − X ){ ∂λ ∂λ
(X − X )} ∑ (X − X )(X
n n
= [2 (X − X ) + i
1
i
2
i
2 2 2
i
2
i − X2)
∂θ12 nΔ ∂θ i=1 ∂θ 1 1 i=1
− 2∑ (X − X ){ (X − X )} ∑ (X − X ) ],
n ∂λ ∂λ n
(X − X ) +
i
2 2 1
i
2
i
2 2 2
i
2
∂θ
i=1 ∂θ 1 1 i=1
∂2 λ 1 ∂2 λ 2 1
∑
∂λ 1 ∂λ
∑
n n
= = [2 (X i 2 − X 2 ){
(X i − X ) + 2 (X i 2 − X 2 )}2 (X i − X )(X i2 − X 2 )
∂θ2 2 ∂θ1 ∂θ2 nΔ ∂θ2 i=1 ∂θ2 i=1
∑ ∂λ ∂λ
∑
n n
−2 (X i − X ){ 1 (X i − X ) + 2 (X i 2 − X 2 )}2 (X i 2 − X 2 )2 ],
i=1 ∂θ2 ∂θ2 i=1
∂2 λ 2 ∂2 λ 1 1
∑
∂λ 1 ∂λ
∑
n n
= = [2 (X i − X ){
(X i − X ) + 2 (X i 2 − X 2 )}2 (X i − X )(X i2 − X 2 )
∂θ1 2
∂θ1 ∂θ2 nΔ ∂θ1 i=1 ∂θ1 i=1
∑ ∂λ ∂λ
∑
n n
−2 (X i 2 − X 2 ){ 1 (X i − X ) + 2 (X i 2 − X 2 )}2 (X i − X )2 ],
i=1 ∂θ1 ∂θ1 i=1
∑ ∑ ∑
n n n
where Δ = { (X i − X )(X i2 − X 2 )}2 − ( X i − X )2 (X i2 − X 2 )2.
i =1 i =1 i =1
Then, in a similar manner to the analysis shown in Section 10.3, one can
show that, under H 0 , 2 log ELR (θ1 , θ2 ) ~ χ 22 , as n → ∞.
11
Jackknife and Bootstrap Methods
11.1 Introduction
Jackknife and bootstrap methods and other “computationally intensive”
techniques of statistical inference have a long history dating back to the per-
mutation test introduced in the 1930s by R.A. Fisher. Ever since that time
there has been work towards developing efficient and practical nonparamet-
ric models that did not rely upon the classical normal-based model assump-
tions. In the 1940s Quenouille introduced the method of deleting “one
observation at a time” for reducing a bias of estimation. This method was
further developed by Tukey for standard error estimation and coined by him
as the “jackknife” method (Tukey, 1958). This early work was followed by
many variants up to the 1970s. The jackknife method was then extended to
what is now referred to as the bootstrap method. Even though there were
predecessors towards the development of the bootstrap method, e.g., see Harti-
gan (1969, 1971, 1975), it is generally agreed upon by statisticians that Efron’s
(1979) paper in which the term “bootstrap” was coined, in conjunction with
the development of high-speed computers, was a cornerstone event towards
popularizing this particular methodology. Following Efron’s paper there
was an explosion of research on the topic of bootstrap methods. At one time
nonparametric bootstrapping was thought to be a statistical method that
would solve almost all problems in an efficient easy-to-use nonparametric
manner. So why do we not have a PROC BOOTSTRAP in SAS, and why are
bootstrap methods not used more widely? One answer lies in the fact that
there is not a general approach to bootstrapping along the lines of fitting a
generalized linear model with normal error terms. Oftentimes the bootstrap
approach to solving a problem requires writing new pieces of sometimes-
difficult SAS code that only pertains to a specific problem. Another reason
for the bootstrap method’s failure to take off as a general approach is that it
sometimes does not work well in certain situations, and that the method for
correcting these deficient procedures are oftentimes difficult or tedious.
Our focus will be on problems where the simple bootstrap methods are
known to work well. Even though the bootstrap method is primarily
considered a nonparametric method, it can also be used in conjunction with
259
260 Statistics in the Health Sciences: Theory, Applications, and Computing
∫
i=1
the indicator function. For example, if t (F) = xdF is the expectation then
n
∫ ∑
ˆθ = t( Fˆ ) = xdFˆ ( x) =
i=1
X i / n is the sample mean (see also Section 1.6 in this
context).
The bias of the statistic is given by
()
θˆ ( i ) = t Fˆi = θˆ (X 1 , X 2 , … , X i − 1, , X i + 1, … , X n, )
θ = n ⋅ θˆ − (n − 1)θˆ (.).
Note however, there is a statistical price to pay for such a correction in terms
of bias-variance trade-offs. This approach to bias correction is most useful
for statistics that have expectations of the form
262 Statistics in the Health Sciences: Theory, Applications, and Computing
() a ( F ) a2 ( F )
E θˆ = θ + 1
n
+ 2 + O(n−3 ),
n
( ) a (F )
E θˆ (.) = θ + 1
a (F )
+ 2 2 + O(n−3 ).
n − 1 (n − 1)
Hence,
() a (F)
E θ = θ − 2
n(n − 1)
+ O(n−3 )
such that θ has bias of order O(n−2 ), whereas θ̂ has bias of order O(n−1 ). As an
example denote the population variance as
t (F) =
∫ (u − E (X)) dF(u)
2
∫ (u − X) ∑(X − X ) / n,
2
t( Fˆ ) = dFˆ (u) = i
2
i=1
θ = gθˆ − (g − 1) θˆ (). ,
Jackknife and Bootstrap Methods 263
⎛ ⎞
where θˆ (.) = ∑θˆ / ⎜⎝ nh ⎟⎠
i
i and ∑ denotes the summation over all possible
i
subsets. The blocked jackknife biased corrected estimator has smaller vari-
ance than the leave one out at a time classic estimators (Shao and Wu, 1989).
∑( )
n
2
(θˆ ) =
for θ̂ takes the form Var θˆ − θˆ / n. In general, the variance estima-
(i) (.)
i=1
tor works for a class of statistics that are so-called smooth statistics and not
ˆ (θ ) ≠ 0,
so well in other instances, that is, in many instances lim n Var(θˆ ) − Var
n →∞
{( )}
where Var(X ) = E X − (EX) ( )
2 2
{ }
(θˆ ) > Var(θˆ ). We outline the con-
. In fact, E Var
cept of smoothness in the bootstrap methods section below.
n
∑( )
ˆ 1 2 S2
Var(θ) = X i − X = . There may in general be a temptation to
n (n − 1) i = 1 n
bX
The interval G(X ) = (X , ) is then a random interval and assumes the value
a
G( x) whenever X assumes the value x. The interval G(X ) contains the value b
with a certain fixed probability. Now in this framework denote θ ∈ Θ ⊆ R.
For 0 < α < 1 a function satisfying P(θ( X ) ≤ θ) ≥ 1 − α , ∀θ is called a lower con-
fidence band at level α, where θ( X ) does not depend on θ and X is a random
vector of observations. A similar definition holds for the upper confidence
band θ( X ). A family of subsets S( X ) of Θ is said to constitute a family of con-
fidence sets at confidence level 1 − α if P(S( X ) ∈Θ) ≥ 1 − α , ∀θ ∈Θ. A special
case S (X) = (θ (X) , θ( X )) is called a confidence interval provided
( )
P θ (X) < θ < θ( X ) ≥ 1 − α , ∀θ.
A key towards generating a standard confidence interval used in applica-
tions is to develop a so-called pivot. For example, let X 1 , X 2 , … , X n be iid
observations from a density function f ( x ; θ) and denote the statistic
T (X 1 , X 2 , … , X n ; θ) = T ( X ; θ) . The goal in confidence interval generation is to
start by choosing λ 1 and λ 2 such that the probability statement
( )
P θ (X) < θ < θ( X ) ≥ 1 − α .
This can be accomplished by finding a pivot within the form of the statistic
T( X ; θ) about the parameter θ. If the distribution function of T( X ; θ) has a
form that is independent of θ then T( X ; θ) is called a pivot. For example, let
X 1 , X 2 , … , X n ~ N (θ, σ ); then the pivot T ( X ; θ) = n ( x − θ)σ −1 ~ N (0, 1), where
∑
n
x= X i / n, that is, T( X ; θ) has a distribution that is θ free. Hence we can
i=1
pivot about θ to arrive at the exact 1 − α confidence interval for θ with the
(
classic form Pθ x − z1−α/2σ / n < θ < x + z1−α/2σ / n ≥ 1 − α. )
∑
n
from f ( x ; θ). The empirical distribution function Fˆ (x) = I (X i ≤ x) / n plays
i =1
n
∫
a more formal method for calculating quantities, e.g., g (x) dFˆ ( x), which if done
analytically provides so-called exact bootstrap solutions, where g( x) denotes a
generic function corresponding to the population quantity of interest, e.g.,
g (x) = x corresponds to the calculation of E(X ).
The heart of bootstrap methods revolves around understanding statistical
functionals. Notationally, a statistical functional has the form θ = t( F ) such that
the estimator has the form θˆ = t( Fˆ ). The function θ = t( F ) is central to nonpara-
metric estimation. The key assumption that we will operate under is that some
characteristic of F is defined by θ. Note that F and θ may be multidimensional.
Jackknife and Bootstrap Methods 267
θˆ = t Fˆ = ( ) ∫ xdFˆ (x)
n
=
∫ xd(∑H(x − X )/ n)
i=1
i
=
1
n ∑ ∫ xdH(x − X )
i=1
i
=
1
n ∑X = X ,
i=1
i
that is, the sample average. Oftentimes in terms of direct calculations the
quantile form of the statistical functional is worth considering, e.g., for the
average we have
θ = t (F) =
∫ xdF = ∫Q (u) du
0
1 n i/n n
where α denotes the quantile of X, 0 < α < 1. This quantity can be calculated
directly for small samples, but becomes computationally, analytically infea-
sible for even moderate sample size. Or as another complex example say we
were interested in the statistical functional of t (F) = Var(S / x ) where S is the
sample standard deviation. In this case, direct calculation is not possible.
However a simulation approached based on sampling with replacement pro-
vides a powerful tool for calculating such quantities.
Let us take a simple example of estimating the variance for the coefficient
of variation. In this case θ = t (F) = 100 × Var (X) / E(X ) such that
()
θˆ = t Fˆ = 100 × S / X . Deriving the estimator Var (100 ×S / X ) of the coefficient
of variation is nontrivial even when the parametric form of the density is
assumed known. Suppose we observe a sample of size n = 10 with observed
data X={0.25, 0.40, 0.46, 0.27, 2.51, 1.24, 4.11, 6.11, 0.46, 1.35}. The estimate in this
()
case is θˆ = t Fˆ = 114.6. For the nonparametric bootstrap approach we sample
with replacement from F̂ in order to form bootstrap replicates. One single
randomly sampled bootstrap replicate in this example was x* ={0.25, 0.40,
0.46, 0.27, 1.29, 1.29, 4.11, 6.11, 0.25, 1.37} such that the bootstrap replicate statis-
( )
tics was θˆ * = t Fˆ * = 124.29 , where F̂ * is the empirical distribution function
based on the resampled values. If we repeat this process say with B = 10, 000
bootstrap replications we obtain the approximate distribution for
()
θˆ = t Fˆ = 100 × S / X illustrated in Figure 11.1. Similar to the jackknife a non-
() ( ( )) ∑ (t (Fˆ ) − t (Fˆ )) ,
B 2
1
parametric estimate of the variance of θˆ = t Fˆ is t Fˆ ≈
Var i
* *
B−1
( ( ))
i=1
( ) ∑( )
B
1 t F̂ = 514.6. Note that there
where t Fˆ * = ti Fˆ * . For our example, Var
B i=1
are two sources of variance in the approximation, Monte Carlo error and
stochastic error. The Monte Carlo error goes to 0 as B → ∞. We will demon-
strate that this resampling approach can be extended to generating approxi-
mate bootstrap confidence intervals for θ = t (F).
The general steps one needs to follow in order to utilize the power of clas-
sic bootstrap methods are as follows:
11
10
6
Percent
0
2 3 3 4 4 5 6 6 7 7 8 9 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
4 0 6 2 8 4 0 6 2 8 4 0 6 0 0 1 2 2 3 3 4 5 5 6 6 7 8 8 9 9 0
2 8 4 0 6 2 8 4 0 6 2 8 4 0 6 2 8 4
The coefficient of variation, x
FIGURE 11.1
B = 10,000 bootstrap replications for the coefficient of variation example.
270 Statistics in the Health Sciences: Theory, Applications, and Computing
⎛ i−1 ⎞
Ci ~ B ⎜ n −
⎜⎝ ∑ C , n − 1i + 1⎟⎟⎠
j=0
j (11.1)
∑
n−1
for i = 1,..., n − 1, where C0 = p0 = 0, Cn = n − C j, and B(n, p) denotes a
j=0
standard binomial distribution. Random binomial variables may be gener-
ated using the RANBIN function in SAS. Even though this formula may look
Jackknife and Bootstrap Methods 271
We would then want to output the value X i Ci number of times for one
∑
i−1
bootstrap resample. Note that if n − Ci < 0 then stop; all subsequent Ci’s
j=0
data original;
n=4; /*set the sample size*/
input x_i @@;
do b=1 to 3; /*output the data B times*/
output;
end;
cards;
0.5 0.4 0.6 0.2;
proc sort;by b;
272 Statistics in the Health Sciences: Theory, Applications, and Computing
This first piece of code in data original just outputs the same data-
set three times in a row. We then need to sort the values by b. Now all
we do in the next set of code is carry out the generation of multinomial
random variables sequentially via a marginal random binomial number
generator.
The statement “else c_i=0” is just a shortcut needed for the cases where we
i−1
statements are needed for counting purposes and updating the marginal
binomial probabilities. i−1
∑
1
Basically what we have done is keep track of n − C j and for
j=0
n−i+1
B = 1, and then we repeated the resampling scheme for B = 2, and for B = 3.
The output for a given run is below.
What this program above does is basically the nuts and bolts of most uni-
variate bootstrapping procedures. It is easily modified for the multivariate
procedures. Note that if there are missing values they should be eliminated
within the first data step.
As an example consider how the program would work for a simple
∑
n
statistic such as the sample mean X = X i / n. A bootstrap replication of
i=1
=∑
n
the mean would simply be x ∗ Ci X i / n, or for our example
i=1
x = (3 × 0.5 + 1 × 0.2)/ 4 = 0.426 , or for those of you who prefer matrix
*
1
XC'
notation X * = . To carry out this process in SAS simply add the following
n
code to the end of SAS PROGRAM 11.1:
proc means;by b;
var x_i;
weight c_i;
output out=bootstat mean=xbarstar;
For the sample mean we can take advantage of the WEIGHT statement
built in to PROC MEANS. Since the sample size of n = 4 is small we would
basically only need to increase the value of B up to say B = 100 in this exam-
ple in order to get a very accurate approximation of the distribution of the
sample mean for this dataset. Guidelines for the number of resamples will be
discussed later. Through the use of the WEIGHT statement PROC MEANS is
∑
n
“automatically” calculating x ∗ = Ci X i / n for each value of B via the BY
i=1
statement. In a variety of problems from regression to generalized linear
models we will utilize the fact that the WEIGHT utility is built in to a num-
ber of SAS procedures. From a statistical theory standpoint in the case of the
sample mean one can easily prove that if E(X ) = μ that E(X ∗ ) = μ as well.
Note that
∑ ∑
n n
E(X ∗ ) = E( Ci X i / n) = E(Ci )E(X i )/ n = μ ,
i=1 i=1
given E(Ci ) = 1, that is, remarkably all bootstrap resampled estimates of the
mean are unbiased estimates of μ.
Note that the number of bootstrap replications is typically recommended
to be B = 1000 or higher for moderately sized datasets, e.g., see Booth and
Sarkar (1998). However, if you can feasibly carry out a higher number of rep-
lications the precision of the bootstrap procedure can only increase, that is,
from a theoretical statistical point of view you can never take too many boot-
strap replications. From a practical point of view you may stretch the mem-
ory capacity and speed of your given computer. In general, one should shoot
274 Statistics in the Health Sciences: Theory, Applications, and Computing
for B = 1000. If a value for B >1000 can be practically used then by all means
use it. Obviously, the memory requirements for the above algorithm can
expand rapidly for moderately sized datasets and values of B around 1000.
The program (SAS PROGRAM 11.1) will run very rapidly, however it is a
“space hog” with respect to disk space.
In some instances we can keep the code even more simple by replacing the
data-step resample in the previous approach with PROC SURVEYSELECT as
seen in the following example.
Number
b Hits x_i
1 2 0.5
1 2 0.4
2 1 0.4
2 1 0.6
2 2 0.2
3 1 0.5
3 1 0.4
3 2 0.2
Jackknife and Bootstrap Methods 275
The key macro statements that we will use in the bootstrapping program
below and throughout the book consist of the following:
5. %mend; The %mend command signifies the end of the macro mul-
tiwt.
6. %multiwt(3); The command %multiwt(3) is the macro call which
passes the value of 3 to the macro variable brep.
data sampsize;
set original nobs=nobs; /*automatically obtain the sample
size*/
n=nobs;
%macro multiwt(brep);
%do bsim=1 %to &brep;
data resample;
set sampsize;
retain sumc i; /*need to add the previous
count to the total */
b=∽
if _n_=1 then do;
sumc=0;
i=1;
end;
p=1/(n-i+1); /*p is the probability of a "success"*/
%end;
%mend;
%multiwt(3);
data sampsize;
set original nobs=nobs;
_nsize_=nobs; /*control variable for surveyselect*/
if _n_=1 then output;
%macro multiwt(brep);
%end;
%mend;
%multiwt(3);
proc print;
var b x_i;
proc iml;
use original;
read all into data;
quit;
/*********Listing File******************/
I BOOTDATA
1 0.5 0.4 0.5 0.5
I BOOTDATA
2 0.4 0.5 0.4 0.5
I BOOTDATA
3 0.6 0.5 0.4 0.2
The USE and READ statements simply retrieve SAS datasets and import
them into PROC IML. Alternatively, data may be entered directly in PROC
IML with a statement such as
form for the distribution function of t (F ;θ). In this case we approximate the
confidence interval machinery by replacing t (F ;θ) − t (F ; θ) with the bootstrap
approximation t (Fˆ ; θ) − t (F
* ;θ) and t with t estimated from bootstrap resa-
α
*
α
mpling. The pivotal one-sided interval is then (−∞ , t (F ;θ) − t ), where t is the
*
α
*
α
(( ) ) ( )
;θ − tα* > t (F ; θ) = 1 − α + O n−1/2 .
PtF
The result can be proven via an Edgeworth expansion technique. Note that
the nuisance parameter about scale is “built in” to the bootstrap resampling
process. The two-sided 100 × (1 − α) confidence interval becomes
( ) ( )
;θ − t1* −α/2 , t F;θ − tα* /2 ).
(t F
Returning to our coefficient of variation example with α = 0.10 we arrive at
∑
1
t Fˆ = t (F) + φ F (X i ) + Rn. One can examine the quadratic term in the
n i=1
()
n
∑
1
expansion of t F̂ . If the quadratic term of φ F (X i ) does not “dominate”
n i=1
the series typically of the order o(n−1 ) the statistic is “smooth.” There is no set
n− k
()
t Fˆ = X α =
1
n − 2k ∑X (i) , (11.2)
i = k +1
where X( i ) denotes the ith ordered observation, α denotes the trimming pro-
portion and k = [nα], and the function [.] again denotes the floor function.
Suppose we specify for our continuing example the trimming proportion of
α = 0.05 with a sample size of n = 44. That implies that k = [44 × 0.05] = 2
observations are trimmed from each tail. Given that this trimming strategy
seems very straightforward we need to step back and ask: what is the popu-
lation parameter that we are estimating? It is very important to have an
understanding of what you are estimating prior to carrying out any boot-
strap estimation procedure. The α-trimmed mean statistic is an estimator of
the population trimmed mean:
F −1 (1 −α )
∫
1
t (F) = Eα (X ) = x dF( x),
1 − 2α F −1 ( α )
that is, the population parameter in this instance is Eα (X ), not the expected
value of X, E(X ). For the specific case where the distribution is symmetric
about the mean Eα (X ) = E(X ). This is not the case for asymmetric distribu-
tions. Therefore, in this case our goal is to obtain a 95% confidence interval
for Eα (X ).
Within SAS PROGRAM 11.6 corresponding to our example the CD4 count
data from n = 44 subjects is read into data original. As an alternative to
what is provided in PROC UNIVARIATE we can calculate the bootstrap per-
centile confidence interval nonparametrically and compare the results. Note
that the “guts” of the sample program is the same as what we have been
utilizing throughout for simple bootstrapping. Within the simple bootstrap
algorithm we can calculate the trimmed mean in PROC UNIVARIATE using
282 Statistics in the Health Sciences: Theory, Applications, and Computing
the TRIMMED command and output the value via SAS’s ODS system to the
dataset tr, that is, there is no real additional programming to be done other
than to use the already built-in calculations within PROC UNIVARIATE.
The lower and upper bootstrap percentiles are then calculated and output
via a separate run of PROC UNIVARIATE using the command pctlpts=2.5
97.5 pctlpre=p.
For our example dataset the trimmed mean and its standard deviation
(of the trimmed mean) turned out to be x.05 = 304.9 ± 76.8 as compared to
the sample mean x = 377.7 ± 74.3. The difference between the two statis-
tics again illustrates the asymmetry in the data, and reinforces the fact
that the built-in confidence interval procedure should probably be
avoided. For this example the 95% semi-parametric confidence interval
for the trimmed mean, as calculated in PROC UNIVARIATE under an
asymmetry assumption, turned out to be (149.3, 460.6). For B = 1000 resa-
mples the indices for the percentile interval are λ 1 = 250 and λ 2 = 9750 .
Therefore, the 95% bootstrap interval would be the 250th and 9750th
∗ ∗
ordered resampled values, or namely (T(250) ( Fn (X )), T(9750) ( Fn (X ))) . After
B = 10000 resamples the 95% bootstrap percentile interval was calculated
as (181.8, 466.0). Also note that the percentile interval is not constrained to
be symmetric about the statistic x.05 = 304.9 , thus making it much more
flexible with respect to underlying assumptions. Given the moderately
large sample size of n = 44 and the smoothness of the trimmed mean sta-
tistic we should feel fairly confident about the coverage accuracy of the
bootstrap percentile confidence for this example. We will compare this
interval with other more complicated bootstrap methods given later on.
At the minimum the percentile method is useful for examining the
assumptions of parametric or semi-parametric methods in terms of their
validity. As noted earlier, in general the bootstrap-t method is what we
would generally recommend. It will be described in the next section.
Below is the basic SAS program used to carry out the trimmed mean per-
centile interval calculations for our example.
data resample;
set sampsize;by b;
retain sumc i; /*need to add the previous
count to the total */
if first.b then do;
sumc=0;
i=1;
end;
p=1/(n-i+1); /*p is the probability of a "success"*/
/******Print Results**************************/
284 Statistics in the Health Sciences: Theory, Applications, and Computing
where Φ (x) and φ (x) denote the standard normal cdf and pdf,
H 2 (x) = x 2 − 1, H 3 (x) = x 3 − 3 x and H 5 (x) = x 5 − 10 x 3 + 15 x are Hermite poly-
Jackknife and Bootstrap Methods 285
r
e − x and additional series terms can be added to
dx
improve the degree of accuracy of the approximation.
A classic example for the Edgeworth approximation is as follows. Let iid
n
V−n
that ~ χ . Asymptotically we have that T =
2
n ~ AN (0,1) such that we
2n
have that the approximate distribution function for T is FT (x) ≅ Φ (x), that is,
the first term in the Edgeworth expansion above. More precisely we know
2 2 12
that in this example γ 1 = and γ 2 = . Hence, a more precise approxi-
n n
mation for the distribution function for T is given as
⎡ 2 2 ⎤
FT (x) ≅ Φ (x) − φ (x) ⎢ ( x − 1) +
1 3
x − 3x + (
1 5
)
x − 10 x 3 + 15 x ⎥ ( )
⎣ 3 n 2n 9n ⎦
(
P |GK *|Fˆ (k) − GK|F (k)|> ε → 0, n → ∞.)
For specific cases this proof can be straightforward. However the general
result is technically challenging. Specific cases often require a key piece of
knowledge about the given problem to complete the proof. For example, take
( ) ∑C X = ∑x , where the C ’s
n n
t (F ; θ) = E (X|F) = θ. Then t F ( )
;θ = x and t Fˆ * ; θ = i i
*
i i
i=1 i=1
can be considered multinomial random variables, that is, the data are fixed
and the resampling counts follow a multinomial distribution. The classic
286 Statistics in the Health Sciences: Theory, Applications, and Computing
( ) ∑ ∑ ∑
ρ Fˆ , F → 0, ρ ⎜⎜ X i , Yi⎟⎟ ≤ ρ(X i , Yi )2 , ρ (H , G) = ρ (X , Y) ,
a. s
⎜ ⎟
⎝ i=0 i=0 ⎠ i=0
ρ(X , Y )2 = ρ(X − E(X ), Y − E(Y ))2 +||E (X) − E(Y )||2, and ρ (aX , aY) =|a|ρ (X , Y)
e.g., see Shao and Tu (1995). For example, using the notation from above let
K = K (X 1 , X 2 , … , X n |F) = n t F (( )
;θ − t (F ; θ) = n (X − θ) and
)
(
K * = K X 1* , X 2* , … , X n* |F =) n (t (Fˆ ; θ) − t (F
*
)
;θ) = n (X * − X ). Then we have
(
= ρ( n X * − X − n (X − θ)) )
⎛ n n ⎞ n
=
1 ⎜
ρ⎜
n ⎜⎝ ∑ (X i − X ), ∑ (X i − θ)⎟⎟ ≤
⎟
⎠
1
n ∑ρ((X − X) ,(X − θ))
* 2
(
= ρ( X − X ,(X − θ))
*
)
= ρ(X * , X )2 −||θ − X ||2
a. s.
given X →θ in combination with the known Mallow’s distance results out-
lined above. Hence, GK *|Fˆ (k) is ρ-consistent for this example, which implies it
can be used to develop approximate confidence that converge to the appro-
priate α level for large samples. The consistency of GK *|Fˆ (k) can be shown
oftentimes via the Berry–Esse’en inequality.
In the general case, why does the simple percentile interval generate a
valid confidence interval with coverage that converges to 100 × (1 − α) % as
( ) ( )
;θ . For the more general case
n → ∞ ? As before let θˆ * = t Fˆ * ; θ and θ = t F
we will employ Edgeworth expansion techniques, which provide less pre-
cise statements about the large sample coverage of bootstrap percentile
intervals.
As in the specific case using Mallow’s distance the goal is prove a large
sample equivalence between Pr(θˆ − θ ≤ x|F ) and Pr(θˆ * − θˆ ≤ x|Fˆ ). In this
Jackknife and Bootstrap Methods 287
regard let us first examine the Edgeworth expansions of each of these quan-
tities such that we have
( ⎝ σ⎠)
P θ − θ ≤ x|F = Φ ⎛⎜ ⎞⎟ +
x 1
n
q1 ⎛⎜ |F⎞⎟ φ ⎛⎜ ⎞⎟ + ⎛⎜ |F⎞⎟ φ ⎛⎜ ⎞⎟ + r ,
⎝
x
σ ⎠ ⎝
x
σ ⎠
1 x
⎝
q2 σ ⎠ ⎝
x
σ⎠
and
where q1 and q2 are a rearrangement of the H i’s from our introduction of the
Edgeworth expansion above and r * denotes the remainder term. Then we have
( ) (
sup P θˆ − θ ≤ x|F − P θˆ * − θˆ ≤ x|Fˆ
x
)
1 ⎛ ⎛x ⎞ ⎛ x ˆ⎞⎞ 1 ⎛ ⎛ x ⎞ ⎛ x ˆ⎞⎞ ⎛ x ⎞ −1/2
≤ sup ⎜⎝ q1 ⎜⎝ σ |F⎟⎠ − q1 ⎜⎝ σˆ |F⎟⎠ ⎟⎠ + n ⎜⎝ q2 ⎜⎝ σ|F⎟⎠ − q2 ⎜⎝ σˆ |F⎟⎠ ⎟⎠ φ ⎜⎝ σ ⎟⎠ + Op (n )
x n
⎛ −1⎞ ⎛ x⎞ ⎛ x⎞
given the assumption that σˆ − σ = Op ⎜⎜ n 2 ⎟⎟ , which yields ⎜ ⎟ − φ ⎜ ⎟ = Op (n−1/2 ).
⎝ ⎠ ⎝ σˆ ⎠ ⎝ σ⎠
Hence, we have that
( ) (
sup P θˆ − θ ≤ x|F − P θˆ * − θˆ ≤ x|Fˆ = Op (n−1 ) ,
x
)
( )
where P θˆ * − θˆ ≤ x|Fˆ is a random quantity. It is important to note that just
because the asymptotic theory is mathematically correct that there is no
guarantee that the finite sample coverage will be accurate for these types of
intervals. An alternative approach to the Edgeworth expansion for these
problems is to use Cornish–Fisher expansions (Cornish and Fisher, 1937).
The general lemma for the percentile method is as follows. Suppose
ˆθ − θ d θˆ * − θˆ d
→ T and → T . If we assume σˆ → σ and σˆ * → σˆ and T is distrib-
σˆ σˆ *
uted according to a symmetric distribution function then the percentile con-
fidence interval (bα* /2 , b1* −α/2 ) described above is asymptotically correct at
level 1 − α. Note that if T is asymptotically normal then we know that the
distribution of T will be symmetric asymptotically. Hence, as a rule of
thumb asymptotically normally distributed T’s will provide valid large
sample percentile bootstrap confidence intervals for the statistical func-
tional t (F; θ).
interpreting their confidence intervals about the wrong parameter, e.g., Sup-
pose t (F ; θ) = E(X( n) |F ). Then the T of interest is not the sample maximum
X( n) but is E(X( n) |Fˆ ), which is a weighted average of a linear combination of
order statistics. We can refine the percentile method to have better accuracy
in finite samples using the approaches below.
∑ ( ) ( ) ( )
B 2
for X, and σ t2(Fˆ * ;θ) = ⎛ t Fˆ * ; θ − t Fˆ * ; θ ⎞ , where t F̂ * ; θ is the average of the
⎝ i ⎠
i=1
bootstrap resampled values. If an analytical expression for σ t2(F; θ) is unavail-
able or not easily calculated a double bootstrap approach is necessary, e.g., see
SAS PROGRAM 11.9. The heuristic idea is that percentile intervals tend to
σ ˆ
suffer undercoverage in small samples; the scalar correction t( F ) provides
σ t( Fˆ * )
slightly wider intervals and hence better coverage probabilities. Similar to the
( ) ( )
;θ . Then one can use an
percentile interval proofs let θˆ * = t Fˆ * ; θ and θ = t F
Edgeworth expansion technique to show that
⎛ θˆ − θ ⎞ ⎛ θˆ * − θˆ ⎞
sup P ⎜ ≤ x|F⎟ − P ⎜ ≤ x|Fˆ ⎟ = Op (n−1 )
x ⎝ σ t( Fˆ ) ⎠ ⎝ σ t( Fˆ * ) ⎠
Example. Let us revisit our example from the percentile interval section
involving the trimmed mean xα. As you recall we were interested in mea-
sures of location for CD4 count data. Previously we calculated 95% bootstrap
percentile confidence intervals for E.05 (X ) and E.25 (X ) based upon the trimmed
means T (Fn (X )) − x.05 and T (Fn (X )) − x.25, respectively. In order to recalculate
these intervals for the same dataset using the bootstrap-t method, as com-
pared to the percentile interval approach, we need some additional program-
ming steps. These are given in the example SAS PROGRAM 11.7.
The first step in modifying the existing programs is to create a dummy
variable used to merge bootstrap quantities labeled dummy. For this exam-
ple dummy = 1 for every data and resample value. To calculate the trimmed
mean and its standard deviation we employed the SAS ODS system with the
command ods output trimmedmeans=trbase in conjunction with PROC
UNIVARIATE. This allows us to output the trimmed mean and standard
deviation to the dataset trbase within our sample program. It is important to
note that the standard deviation estimate is labeled as the standard error in
SAS, that is, what is calculated is the standard error estimate for X and the
standard deviation for the statistic Xα .
After outputting the trimmed mean and its standard deviation from PROC
UNIVARIATE we reassigned them new variable names within the dataset
rename. This was necessary so that values were not overwritten when
merged with the bootstrap resampled trimmed means and standard devia-
tions of the same name. As with the original trimmed mean calculation the
bootstrap resampled statistics were also calculated using PROC UNIVARI-
ATE and output using ODS to dataset combine. In addition, within dataset
σ ˆ
( ( ) ( ))
combine we calculated the t( F ) t Fˆ * ; θ − t F;θ ’s needed in order to cal-
σ t( Fˆ * )
culate the interval. The bootstrap-t interval based on the ordered
σ t( Fˆ )
σ t( Fˆ * ) ( ( ) ( )) ;θ ’s was then processed via PROC UNIVARIATE. We also
t Fˆ * ; θ − t F
recalculated the percentile interval within this program for comparison. The
intervals were distinguished by the percentile prefixes pp and pt, for the
percentile and bootstrap-t methods, respectively.
For our example run B = 10000 resamples were used. The 95% percentile
interval turned out to be (179.9, 471.1) as compared to the previous run
given in the earlier section of (181.8, 466.0). This gives you an idea of how
much the intervals might change from successive runs, even with B = 10000
resamples. The bootstrap-t 95% confidence interval turned out to be
∗ ∗
(S(250) (⋅), S(9750) (⋅)) = (170.5, 537.1). Recall also that the semi-parametric interval
built in to SAS, calculated under the assumption of symmetry, yielded the
interval (149.3, 460.6). Given the apparent skewness in our example CD4
count dataset the bootstrap-t interval may be shown to be theoretically
preferable to the semi-parametric interval calculated by SAS in this
instance. The bootstrap-t interval for this example is reflective of the skew-
ness of the original dataset, and is wider than the percentile method since
290 Statistics in the Health Sciences: Theory, Applications, and Computing
it was scaled more appropriately. Also, notice how the bootstrap-t interval
is shifted slightly to the right as compared to the semi-parametric interval,
again due to the asymmetry of the original dataset.
Note that the symmetry assumption may be relaxed somewhat if the trim-
ming proportion is increased. Let us examine the effect of the choice of trim-
ming proportion as well, e.g., say we wished to use a more robust measure of
location. We can rerun the example with a trimming proportion of α = .25 , as
compared to α = .05 above, that is, in this case we will eliminate 25% of the
extreme observations from either tail of the distribution prior to estimating
the mean. For our CD4 count data x.25 = 210.1 ± 61.5 as compared to
x.05 = 304.9 ± 76.8 , and x = 377.7 ± 74.3 . The confidence intervals in turn for x.25
were (108.0, 351.1), (99.3, 397.2), and (82.3, 338.0) for the percentile, bootstrap-t,
and semi-parametric method, respectively. The intervals are not too dissimi-
lar. In general, the more trimming that takes place the more we can relax the
assumption of an underlying symmetrical distribution.
/* SAS PROGRAM 11.7 */
data original;
label x_i='CD4 counts';
input x_i @@;
dummy=1; /*define a dummy variable to merge on*/
cards;
data sampsize;
set original nobs=nobs;
n=nobs;
do b=1 to 10000; /*output the data B times*/
output;
end;
data resample;
Jackknife and Bootstrap Methods 291
set sampsize;by b;
retain sumc i; /*need to add the previous
count to the total */
if first.b then do;
sumc=0;
i=1;
end;
proc gchart;
vbar s;
( ( ( )) )
; θ − φ (t( F ; θ)) < x = ψ ( x)
PφtF
holds for all possible F’s where ψ ( x) is a continuous and symmetric function,
that is ψ (x) = 1 − ψ ( − x) is assumed continuous, then the lower bound =
( ( ( )) )
φ−1 φ t F;θ − zα/2 , where zα/2 = ψ −1(α / 2) and φ (x) is a monotone increasing
transformation. As a simple example let φ (x) = x, t F ( )
;θ = X , t (F; θ) = θ with
ψ (x) = Φ( nx / σ ), where σ is a scale parameter. So, for the relationship above
to hold the big assumption is symmetry. The bootstrap percentile interval
version of this framework is
( ( ( )) ( ) )
P φ t Fˆ * ; θ − φ t( Fˆ ; θ) < x = ψ * ( x),
( )
where the key assumption is that φ t( Fˆ ; θ) is symmetric given large samples,
such that the percentile interval holds approximately to a first-order approx-
imation.
Using this framework second-order finite sample corrections can be devel-
oped. Now consider
( ( ( )) )
;θ − φ (t (F ; θ)) + z0 < x = ψ ( x),
Pφt F
( ( ( ))
φ−1 φ t F )
;θ − zα/2 + z0 .
The bootstrap variant of this approach, termed the bootstrap bias-corrected
approach, starts with a calibration step where we write
Jackknife and Bootstrap Methods 293
( ( ( )) ( ( )) )
P φ t Fˆ * ; θ − φ t Fˆ ; θ + z0 < x = ψ * ( x),
( ( ( ))) ( )
where z0 = ψ −1 K * t F̂ * ; θ and K * (t Fˆ * ; θ ) is the distribution function of
( ( )) ( ) ( ) ( )
φ t Fˆ * ; θ − φ t( Fˆ ; θ) . If the distribution of t F̂ * ; θ is symmetric about φ t( Fˆ ; θ)
then z0 = 0, and we have the same result as the standard percentile interval.
Calculating z0 in practice is difficult since in general we do not know the
form of ψ(x). In applications we can consider approximating ψ(x) with Φ (x) ,
the standard normal distribution function. Then the estimator
( ( ( ))) ( ) (
zˆ 0 = Φ −1 K * t Fˆ * ; θ . If the distribution of t F̂ * ; θ is symmetric about φ t( Fˆ ; θ))
become
*
( ( ))
then ẑ0 ≈ 0, that is, K t F̂ ; θ ≈ 1/ 2 . The confidence bounds for t( F ; θ) then
*
The example below is to calculate the 95% confidence interval for Tukey’s
trimean:
data counts;
label pl_ct='platelet count';
input pl_ct @@;
cards;
1397 471 64 1571 663 1128 719 407
480 1147 362 2022 10 175 494 31
202 751 30 27 118 8 181 1432 31 18
105 23 8 70 12 504 252 0 100 4
545 226 390 230 28 5 0 176
;
proc iml;
use counts;
read all into data;
*reset print;
n=nrow(data);
brep=5000; * Number of bootstrap replications;
294 Statistics in the Health Sciences: Theory, Applications, and Computing
total=n*brep;
bootdata=j(total,3, 0);
do i=1 to total;
bootdata[i,3]=mod(i,n) + (mod(i,n)=0)*n;
* This is the order of each replicated data;
end;
subset=bootdata[loc(bootdata[,3]=int(n/4)+1|bootdata[,3]=int(n/2)+1
|bootdata[,3]=int(3*n/4)+1),1];
matrix1=I(brep);
matrix2={0.25 0.5 0.25};
matrix3=matrix1 @ matrix2;
tukey = matrix3 * subset;
*Calculate b;
subset = tukey[loc(tukey[,1] <= trimmean),];
p=nrow(subset)/brep;
b=quantile('NORMAL', p); * b=z_0;
CI=trimmean||p||b||Lower||Upper;
Jackknife and Bootstrap Methods 295
quit;
TABLE 11.1
The Estimated Confidence Interval for the Population Trimean Based on 5000
Replications
Trimean P b lower upper
223.5 0.4388 –0.154012 112.25 386.75
data sampsize;
set original nobs=nobs;
n=nobs;
do b=1 to 10000; /*output the data B times*/
output;
end;
data resample;
set sampsize;by b;
retain sumc i; /*need to add the previous
count to the total */
if first.b then do;
sumc=0;
i=1;
end;
p=1/(n-i+1); /*p is the probability of a "success"*/
%macro bc;
call symput('lll',lower);
call symput('uuu',upper);
%mend;
%bc;
proc print;
ods listing;
⎜ ( ( )) ( ( ))
⎛ φ t Fˆ * ; θ − φ t Fˆ ; θ ⎞
+ z0 < x⎟ = ψ * ( x),
( ( ))
P
⎜ 1 + aφ t Fˆ ; θ ⎟
⎝ ⎠
⎡ ⎛ zˆ 0 + Φ −1 (α / 2) ⎞ ⎤
lower = Fˆ *t(−Fˆ1* ;θ) ⎢Φ ⎜ zˆ 0 + ⎥,
⎣ ⎝ 1 − aˆ( zˆ 0 + Φ −1 (α / 2)⎟⎠ ⎦
⎡ ⎛ zˆ 0 + Φ −1 (1 − α / 2) ⎞ ⎤
upper = Fˆ *t(−Fˆ1* ;θ) ⎢Φ ⎜ zˆ 0 + ⎥,
⎣ ⎝ 1 − aˆ( zˆ 0 + Κ −1 (1 − α / 2)⎟⎠ ⎦
(K (t (Fˆ ; θ))), aˆ = ⎛ ∑w ⎞
1 B
3
w
( ) ( )
i
where zˆ 0 = Φ −1 * * 6 i=1
, and wi = ti Fˆ * ; θ − t Fˆ * ; θ . In
⎝∑
B 3/2
2
⎠ i=1
i
data counts;
label pl_ct='platelet count';
input pl_ct @@;
cards;
1397 471 64 1571 663 1128 719 407
480 1147 362 2022 10 175 494 31
202 751 30 27 118 8 181 1432 31 18
105 23 8 70 12 504 252 0 100 4
545 226 390 230 28 5 0 176
;
/*
data counts;
label pl_ct='platelet count';
input pl_ct @@;
cards;
230 222 179 191 103 293 316 520 143 226
225 255 169 204 99 107 280 226 143 259
;
*/
proc iml;
use counts;
read all into data;
*reset print;
n=nrow(data);
brep=500; * Number of bootstrap replications;
total=n*brep;
bootdata=j(total,3, 0);
do i=1 to total;
bootdata[i,3]=mod(i,n) + (mod(i,n)=0)*n;
* This is the order of each replicated data;
end;
* Calculate a;
call sort(data, {1});
Jackknife and Bootstrap Methods 299
col=(1:n)`;
data_perm=data||col;
do i=1 to n;
permut=data_perm[loc(data_perm[,2] ^= i),];
col2=(1:n-1)`;
permut=permut||col2;
a_vector=perm_a||col||col||col;
a_vector[,4] = a_vector[+,1]/n;
a_vector[,2] = (a_vector[,1] - a_vector[,4])##3;
a_vector[,3] = (a_vector[,1] - a_vector[,4])##2;
a=a_vector[+,2] /6/a_vector[+,3]##1.5; *a;
*Calculate b;
subset = tukey[loc(tukey[,1] <= trimmean),];
p=nrow(subset)/brep;
b=fuzz(quantile('NORMAL', p)); * b=z_0;
CI=trimmean||p||a||b||Lower||Upper;
colname={'trimmean' 'P' 'a' 'b' 'lower' 'upper'};
print CI[colname=colname];
quit;
300 Statistics in the Health Sciences: Theory, Applications, and Computing
data sampsize;
set original nobs=nobs;
n=nobs;
do b=1 to 10000; /*output the data B times*/
output;
end;
*Bootstrap sampling;
data resample;
set sampsize;by b;
retain sumc i; /*need to add the previous
count to the total */
Jackknife and Bootstrap Methods 301
%macro jackknife();
%do i=1 %to 44;
dm "log;clear;output;clear";
proc univariate data=temp trimmed=.05;by dummy;
var x_i;
where id ne &i;
ods output trimmedmeans=jack;
run;
run;
%end;
%mend;
%jackknife;
data jk;
merge jackknife summary;
by dummy;
num=( mean_trimean - mean )**3;
den=( mean_trimean - mean )**2;
keep mean mean_trimean num den;
run;
data bias;
merge bias summary;
ahat=sum_num/6/sum_den**1.5;
run;
%macro bc;
proc print;
run;
ods listing;
CALL :
boot.ci(boot.out = time.boot, type = "bca")
Intervals :
Level BCa
95% (253.1, 550.7 )
Calculations and Intervals on Original Scale
( )
n
w0, i > 0, i = 1, 2, ….n. One approach for determining the weights is via the
Kullback–Leibler distance, e.g., see Efron (1981). Other distance measures
such as those based on entropy concepts may also be used. The weights
obtained using the Kullback–Leibler distance-based approach are identical
to those obtained via the EL method described in this entry. Once the w0, i ’s
are determined the null distribution can be estimated via Monte Carlo resa-
mpling B times from F̂0 . In rare cases the null distribution may be obtained
directly, e.g. when the parameter of interest is the population quantile given
as θ = F −1 (u). The approximate bootstrap p-value estimated via Monte Carlo
resampling is given as #(θˆ *0 ' s > θ0 ) / B, where θ̂*0 denotes the estimator
obtained from a bootstrap resample from F̂0 . For the specific example
∫ xdF is the mean, the weights w are chosen that minimize the
θ = t( F ) =
n
0, i
12.1 Homework 1
1. Provide an example when the random sequence ξn ⎯p⎯ → ξ, but
ξn not ⎯a⎯
. s. → ξ .
2. Using Taylor’s theorem, show that exp(it) = cos(t) + i sin(t), where
i = −1 satisfies i2 = −1.
3. Find ii.
∞
4. Find
∫
0
exp( − s x)sin( x) dx. (Hint: Use the rule
∞
∫ v du = vu − ∫ udv.)
∫
sin(t)
5. Obtain the value of the integral dt.
0 t
⎧⎪ n
⎫⎪
6. Let the stopping time N ( H ) have the form N ( H ) = inf ⎨n :
⎩⎪ i=1
∑
Xi ≥ H⎬ ,
⎭⎪
where X 1 , X 2 ,.... are iid random variables with E (X 1) > 0 . Show that
∞ ⎧⎪ j ⎫⎪
∞
EN ( H ) =
j=1
∑ P {N ( H ) ≥ j}. Is the equation EN ( H ) =
j=1
P
⎩⎪ i=1
∑ ∑
⎨ Xi < H⎬
⎭⎪
correct? If not what do we require to correct this equation?
7. Download and install the R software. Try to perform simple R pro-
cedures, e.g., hist, mean, etc.
305
306 Statistics in the Health Sciences: Theory, Applications and Computing
12.2 Homework 2
1. Fisher information ɩ(.):
Let X ~ N (μ , 1), where μ is unknown. Calculate ɩ(μ).
Let X ~ N (0, σ 2 ), where σ 2 is unknown. Calculate ɩ(σ 2 ).
Let X ~ Exp(λ), where λ is unknown. Calculate ɩ(λ).
2. Let X i , i ≥ 1, be iid with EX 1 = 0. Show an example of a stopping
⎛ N ⎞
rule N such that E ⎜ ∑ X i⎟ ≠ 0.
⎝ i=1 ⎠
3. Assume X 1 ,..., X n are independent, E (X i ) = 0, var(X i ) = σ i2 < ∞ ,
E X i < ∞ , i = 1,..., n. Prove the central limit theorem (CLT) based
3
n n
∑ ∑
1
result regarding X i , Bn = σ i2 as n → ∞. In this context,
Bn i = 1 i=1
what are conditions on σ 2i that we need to assume in order for the
result to hold?
4. Suggest a method for using R to evaluate numerically (experimen-
tally) the theoretical result related to Question 3 above, employing
the Monte Carlo concept. Provide intuitive ideas and the relevant R
code.
12.3 Homework 3
1. Let the random variables X 1 ,… , X n be dependent and denote the
sigma algebra as ℑn = σ {X 1 ,… , X n}. Assume that E (X 1) = μ and
⎛ n ⎞
E (X i |ℑi -1) = μ , i = 2,.., n , where μ is a constant. Please find E ⎜∏ X i⎟ .
⎝ i=1 ⎠
2. The likelihood ratio (LR) with a corresponding sigma algebra is
an H0-martingale and an H1-submartingale. Please evaluate in
this context the statistic log (LRn) based on dependent data points
X 1 ,… , X n, when the hypothesis H0 says X 1 ,… , X n ~ f0 versus H1 :
X 1 ,… , X n ~ f1, where f0 and f1 are joint density functions.
( )
3. Assume (X n , ℑn) is a martingale. What are X n , ℑn and X n , ℑn ? ( 2
)
4. Using the Monte Carlo method calculate numerically the value of
∞
∫ exp(− u
1.7
J= )u3 sin(u) du. Compare the obtained result with a
−∞
result that can be conducted using the R function integrate. Find
Examples of Homework Questions 307
approximately the sample size N such that the Monte Carlo estimator
{ }
J N of J satisfies Pr J N − J > 0.001 ≅ 0.05 .
∫ ∫ exp(− u
1.7
J= − t 2 )u3 sin(u cos(t))dtdu, approximating this inte-
−4 −∞
gral via the Monte Carlo estimator J N . Plot values of J N − J against
ten different values of N (e.g., N = 100, 250, 350, 500,600, etc.), where
N is the size of a generated sample that you employ to compute J N in
the Monte Carlo manner.
12.4 Homework 4
The Wald martingale.
3.4 Using Monte Carlo generated data (setup the true μ = 0.5, β = 0.7 ),
test for the normality of distributions of the maximum likeli-
hood estimators obtained above, for n = 5,7,10,12,25,50. Show
your conclusions regarding this Monte Carlo type study.
Yi = n ⎜
⎜
−1
∑ X j − (n − 1) ⎜
2 −1
∑ Xj⎟ ⎟
⎜⎝ j=1: j≠i ⎟⎠ ⎟ and consider the sample correlation between
⎝ j = 1: j ≠ i ⎠
⎛ n ⎞ ⎛ n ⎞
X i − (n − 1) ⎜
−1
∑ X j ⎟ and Yi or between X i − (n − 1)−1 ⎜
⎜⎝ j=1: j≠ i ⎟⎠ ∑
X j ⎟ and Yi 1/3.
⎜⎝ j=1: j≠ i ⎟⎠
The H 0 -distribution of the test statistic proposed by Lin and Mudholkar
(1980) does not depend on (μ , σ 2 ). Then the corresponding critical values of
the test can be tabulated using the Monte Carlo method.
12.5 Homework 5
Two-sample nonparametric testing.
n = m = 10, X ~ N (0, 1), Y ~ Unif [−1, 1]; n = m = 15, X ~ N (0, 1), Y ~ Unif [−1, 1];
n = m = 40, X ~ N (0, 1), Y ~ Unif [−1, 1]; n = 10, m = 30, X ~ N (0, 1), Y ~ Unif [−1, 1];
n = 30, m = 10, X ~ N (0, 1), Y ~ Unif [−1, 1]; n = m = 10, X ~ Gamma(1, 2), Y ~ Unif [0, 5]
β = 0.1 β = 0.5 β =1
n = 10
n = 20
n = 50
n = 100
⎧ n m ⎫
PrFX = FY {Gnm ≥ C0.05} =
∫
⎪ 1
I⎨
⎪⎩ nm
∑∑
i=1 j=1
{ }
I X i ≥ Yj −
1
2
⎪
≥ C0.05 ⎬ d Ψ nm ,
⎪⎭
bution does not depend on FX = FY . Then the test statistic is exact, meaning
that its distribution is independent of the underlying data distribution, under
the hypothesis FX = FY . Regarding Question 3.4, we suggest application of the
Monte Carlo approach. One can also use the distribution function of the log-
⎛ f (X )⎞
n
likelihood ratio ∑
i=1
ξ i , where ξi = log ⎜ H1 i ⎟ , i = 1,..., n with f H1 and f H0 ,
⎝ f H0 (X i )⎠
which are the density functions related to N (0,1) and N (0.5,1) , respectively.
Alternatively, defining the test threshold as C, we note that
⎧ n ⎫
⎪ ∑
⎪ i=1
ξi − nEH0 (ξ1)
C − nEH0 (ξ1) ⎪
⎪
PrH0 ⎨ 1/2 1/2 ≥ 1/2 1/2 ⎬ = 0.05 and
⎪n (VarH0 (ξ1 )) n (VarH0 (ξ1 )) ⎪
⎪ ⎪
⎩ ⎭
⎧ n ⎫
⎪ ∑
⎪ i=1
ξi − nEH1 (ξ1)
C − nEH1 (ξ1) ⎪
⎪
PrH1 ⎨ 1/2 1/2 ≥ 1/2 1/2 ⎬ = 0.85.
⎪n (VarH1 (ξ1 )) n (VarH1 (ξ1 )) ⎪
⎪ ⎪
⎩ ⎭
Then, using the CLT, we can derive n and C, solving the equations
∫
u
2
where Φ(u) = (2 π)−1/2 e−z /2
dz. It can be suggested to compare these
−∞
three approaches.
Homework 6
⎛ ⎞
2 log ⎜
⎜⎝∏MI = 0
f ( h(TBARS, λˆ 0 ))∏MI = 1
f ( h(TBARS, λˆ 1 )) ∏ f (h(TBARS, λˆ ))⎟⎟⎠
MI = 0,1
f
G<-function(λ){
n
μˆ = ∑ h(X , λ)/ n
i=1
i
= ∑ (h(X , λ) − μˆ ) / n
2
σˆ 2 i
i=1
Homework 7
Let us perform the following Jackknife-type procedure.
Denote TBARS measurements related to MI=1 as Data A with the sample
size N A and TBARS measurements related to MI=0 as Data B with the sample
size N B.
Examples of Homework Questions 313
Example 1:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each ques-
tion has an indicated point value. The total points for the exam are 100
points.
315
316 Statistics in the Health Sciences: Theory, Applications, and Computing
2.1 Write the maximum likelihood ratio MLn and formulate the maxi-
mum likelihood ratio test. Write the corresponding Bayes factor type
likelihood ratio, BLn , test statistic. Show an optimality of BLn in the
context of “most powerful decision rules” and interpret the
optimality.
2.2 Derive the asymptotic distribution function of the statistic MLn,
under H0. How can this asymptotic result about the MLR test be
used in practice?
Example 2:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each question
has an indicated point value. The total points for the exam are 100 points.
1+ ε
1.2 Let X 1 ,… , X n be iid with EX 1 = μ (E X 1 < ∞ , ε > 0 is not required).
Prove that (X 1 + … + X n) / n → μ as n → ∞ .
0
i
i
0
Example 3:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
3.1 Write the Bayes factor type likelihood ratio BLRn; Show the optimal-
ity of BLRn in the context of “most powerful decision rules.”
3.2 Find the asymptotic ( n → ∞ ) distribution function of the statistic
MLRn, the maximum likelihood ratio, under H0.
Example 4:
11 AM–12:20 PM
This exam has 3 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
∑
d
has a form of the density function of (ε i )2 , where the integer parameter
i=1
d is unknown and d ≤ 10.
2.1 Write the maximum likelihood ratio (MLR) statistic, MLRn, and
formulate the MLR test for H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0, where
θ0 is known. (Hint: Use characteristic functions to derive the form
of f .)
2.2 Propose an integrated most powerful test for H 0 : EX 1 = θ0 versus
H 1 : EX 1 ≠ θ0 (θ0 is known), provided that we are interested in obtain-
ing the maximum integrated power of the test when values of the
alternative parameter can have all possible values with no prefer-
ence. (Hint: Use a Bayes factor type procedure.). Formally prove that
your test is integrated most powerful.
Example 5:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Example 6:
11 AM–12:20 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Question 2 (8 points)
Let the statistic LRn define the likelihood ratio to be applied to test for H 0
versus H 1. Prove that f HLR1 (u)/ f HLR0 (u) = u, where f HLR denotes the density func-
tion of LRn under H = H 0 or H = H 1, respectively.
Examples of Exams 321
∑
d
a density function f that has a form of the density function of (ε i )2 ,
i=1
where the integer parameter d is unknown and d ≤ 10.
3.1 Write the maximum likelihood ratio (MLR) statistic, MLRn, and
formulate the MLR test for H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0, where
θ0 is known. (Hint: Use characteristic functions to derive the form
of f .)
3.2 Propose, giving the corresponding proof, the integrated most pow-
erful test for H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0 (here θ0 is known),
provided that we are interested to obtain the maximum integrated
power of the test when values of the alternative parameter can have
all possible values with no preference.
⎧ n ⎫
⎪
τ = inf ⎨n > 0 :
⎪⎩
∑ ⎪
X i ≥ n − n0.7 ⎬ .
⎪⎭
i =1
let N → ∞.)
Question 5 (9 points)
⎪⎧ ⎫
n
⎛ ⎛ n ⎞ ⎞ ⎛ N ⎞
⎜
Pr (τ ≥ N) = Pr ⎜ max ⎜⎜
⎜ n≤ N ⎜
⎝
∑ ⎟
⎠
⎟
X i − n + n0.7 ⎟⎟ < 0⎟ ≤ Pr ⎜⎜
⎟ ⎜
⎝
∑ (1 − X i) > N 0.7 ⎟
⎟⎟
⎠
⎝ i =1 ⎠ i =1
⎛ N ⎞2
E ⎜⎜
⎜
⎝
∑ (1 − X i)⎟⎟⎟
⎠ N
i =1
≤ 1.4
= 1.4 → 0.
N N N →∞
Example 7:
11 AM–12:20 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
∫
1
f ( x) = cos (tx) φ(t) dt ,
2π −∞
( )
2
L(n) = Aλ 4n + E λˆ n − λ .
4.2.2 Show a non-asymptotic upper bound of E ⎛ ∑ X i⎞ , where N is
N
⎝ i=1 ⎠
a length (a number of needed observations) of the procedure
defined in in the context of Question 4.2.1. The obtained form of
the upper bound should be a direct function of A and λ. Should
you use propositions, these propositions must be proven.
Comments. Regarding Question 1.2, we have φ(t) = E (e itξ ) = E {cos (tξ)} . Then
∞ ∞
where
∫ sin (tx) cos (tξ)dt = 0, since sin (−tx) cos (−tξ) = − sin (tx) cos (tξ).
−∞
324 Statistics in the Health Sciences: Theory, Applications, and Computing
Example 8:
11 AM–12:20 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Comments. Regarding Question 1.2, we have E exp (it (X − Y)) = E exp (it (X + Y)).
Then E exp (it (−Y)) = E exp (it (Y)) . Thus E (cos (tY) − i sin (tY)) = E (cos (tY) + i sin (tY)).
This implies E (sin (tY)) = 0 .
Example 9:
11 AM–12:20 PM
This exam has 6 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Question 2 (7 points)
See Question 2, Section 13.1, Example 6.
Question 3 (9 points)
Assume that a random variable ξ has a gamma distribution with density
function given as
⎧ γ γ−1 −αu
⎪ α u e / Γ( γ), u ≥ 0
f (u; α, γ) = ⎨
⎪⎩ 0, u < 0,
∫0
uγ − 1e − u du. Derive an explicit form of the characteristic function of ξ (Hint:
The approach follows similar to that of normally distributed random variables).
Question 6 (9 points)
See Question 5, Section 13.1, Example 6.
Example 1:
11:45 AM–02:45 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
X i = μ + εi , εi = β εi−1 + ξi , ξ0 = 0, i = 1,… , n,
∫
u
(2 πσ )
2 −1/2
−∞
exp(−u2 /(2σ 2 )) du; μ , β , σ are parameters. Suppose we
are interested in the maximum likelihood ratio test for
H 0 : μ = 0, β = 0, σ = 1 versus H 1 : μ ≠ 0, β ≠ 0, σ = 1. Write the maxi-
mum likelihood ratio test in an analytical closed (as possible) form.
Provide a schematic algorithm to evaluate the power of the test
Examples of Exams 327
Example 2:
03:30 PM–05:00 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
X i = μ + εi , εi = β εi−1 + ξi , ε0 = 0, i = 1,… , n ,
∫
u
Pr {ξ1 < u} = (2 π)
−1/2
exp(− u2 / 2) du ; μ , β are parameters.
−∞
Example 3:
04:30 PM–07:30 PM
This exam has 7 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Question 2 (8 points)
Formulate and prove the central limit theorem related to the stopping time
⎪⎧ ⎫
n
N ( H ) = inf ⎨n :
⎪⎩
∑ X ≥ H⎪⎬⎪⎭ , where X
i=1
i 1 > 0, X 2 > 0,... are iid, EX 1 = a, Var(X 1 ) = σ 2 < ∞
3.1 Write the maximum likelihood ratio MLRn and formulate the MLR
test. Write the Bayes factor type likelihood ratio BLRn. Show an
optimality of BLRn in the context of “most powerful decision rules”
and interpret the optimality.
330 Statistics in the Health Sciences: Theory, Applications, and Computing
Question 4 (5 points)
What are Martingale, Submartingale and Supermartingale? Consider
the test statistics MLRn and BLRn from Question 3. Under H 0 , are these
statistics Martingale, Submartingale, or Supermartingale? Justify your
response.
⎧⎪ n ⎫⎪
∑
⎩⎪ i = 1
an
P ⎨ X i ≥ 0.5 + dm0.5 , for some n ≥ 1⎬ ≤ exp {−2 ad} , d =
m
⎭⎪
log(ε)
2a
,
⎧⎪ n ⎫⎪ 1
∑⎪⎩ i = 1
n
P ⎨ X i ≥ m0.5 A( ,ε) for some n ≥ 1⎬ ≤ (A(.,.) is a function that you
m ⎪⎭ ε
Example 4:
11 AM–12:20 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
f ( x) =
1
2π ∫ e E (e )dt, i
−∞
− itx itX 2
= −1, i = −1.
Example 5:
11:45 AM–1:05 PM
This exam has 6 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
2.1 Write the maximum likelihood ratio MLRn and formulate the MLR
test. Find the asymptotic ( n → ∞ ) distribution function of the statis-
tic 2 log MLRn, under H 0 . Show how this asymptotic result can be
used for the MLR test application.
Examples of Exams 333
Example 6:
11.45 AM–1:05 PM
This exam has 7 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
3.1 Let X 1 have a density function f with a known form that depends on
θ. When θ1 is assumed to be known, write the most powerful test
statistic (the relevant proof should be shown).
3.2 Assuming θ1 is unknown, write the maximum likelihood ratio MLRn
and formulate the MLR test based on X 1 ,… , X n. Find the asymptotic
( n → ∞ ) distribution function of the statistic 2 log MLRn, under H 0 .
3.3 Assume f and θ1 are unknown. Write the empirical likelihood ratio
test statistic ELRn. Find the asymptotic ( n → ∞ ) distribution function
of the statistic 2 log ELRn, under H 0 , provided that E X 1 < ∞ .
3
Question 5 (9 points)
See Question 4.3, Section 13.1, Example 2.
Examples of Exams 335
Question 6 (9 points)
If it is possible, formulate a test with power 1. Relevant proofs should be
shown.
Question 7 (9 points)
See Question 4, Section 13.2, Example 2.
Example 7:
11:00 AM–12:20 PM
This exam has 7 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
∑
d
a density function f that has a form of the density function of (ε i )2 ,
i=1
where the integer parameter d is unknown and d ≤ 10. Propose, giving the
336 Statistics in the Health Sciences: Theory, Applications, and Computing
Question 5 (9 points)
Formulate and prove the Dub theorem (inequality).
Example 8:
11:00 AM–12:20 PM
This exam has 9 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Question 2 (9 points)
Assume that X 1 ,… , X n are iid random variables and
X 1 ~ Exp(1), Sn = (X 1 + … + X n) . Let N define an integer random variable
that satisfies Pr {N < ∞} = 1, {N ≥ n} ∈σ (X 1 ,… , X n) , where σ (X 1 ,… , X n)
is the σ-algebra based on (X 1 ,… , X n). Calculate a value of
E exp (i tSN + N log(1 − i t)) , i2 = −1 , where t is a real number. Proofs of the
theorems you use to solve this problem do not need to be shown.
Question 3 (9 points)
Biomarker levels were measured from disease and healthy populations, pro-
viding the iid observations X 1 = 0.39, X 2 = 1.97, X 3 = 1.03, X 4 = 0.16 that are
assumed to be from a normal distribution as well as iid observations
Y1 = 0.42, Y2 = 0.29, Y3 = 0.56, Y4 = −0.68, Y5 = −0.54 that are assumed to be from
a normal distribution, respectively.
Define the receiver operating characteristic (ROC) curve and the area
under the curve (AUC).
Obtain a formal notation of the AUC.
Estimate the AUC.
What can you conclude regarding the discriminating ability of the bio-
marker with respect to the disease?
Question 8 (9 points)
See Question 4.3, Section 13.1, Example 2.
Question 9 (9 points)
Confidence sequences for the median. See Question 5, Section 13.2, Example 4.
( ) ∫ ∫ ∫ ∫
∞ −∞ −∞ ∞
φ(t) = E e itξ = e itu dF(u) = e −itu dF(−u) = e −itud (1 − F(u)) = e −itud (F(u))
−∞ ∞ ∞ −∞
( )
= E e −itξ .
Thus
∞ ∞
where cos (tu) = cos (−tu) and sin (−tx) = − sin (tx). This leads to E sin (tξ) = 0 { }
and then
( )
φ(t) = E e itξ = E {cos (tξ)} .
E {exp (i tSn + n log(1 − i t))|σ (X 1 ,..., X n−1)} = exp (i tSn−1 + n log(1 − i t)) ϕ(t),
E {exp (i tSn + n log(1 − i t))|σ (X 1 ,..., X n−1)} = exp (i tSn−1 + (n − 1)log(1 − i t)) .
{ }
Then exp (i tSn + n log(1 − i t)) , σ (X 1 ,..., X n) is a martingale and we can apply
the optional stopping theorem.
Examples of Exams 339
Example 9:
11.45 AM–1:10 PM
This exam has 8 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Question 5 (8 points)
See Question 4.3, Section 13.1, Example 2.
Question 6 (9 points)
If it is possible, formulate a test with power one. Relevant proofs should be
shown.
340 Statistics in the Health Sciences: Theory, Applications, and Computing
Example 1:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 4 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
X i = μ + εi , εi = β εi−1 + ξi , ξ0 = 0, i = 1,… , n,
∫
u
with the distribution function (2 π)
−1/2
exp(− u2 / 2) du. Suppose we
−∞
are interested in the maximum likelihood ratio test for H 0 : μ = −1
and ε i , i = 1,… n, are independent versus H 1 : μ = 0 and εi , i = 1,… n,
are dependent. Then complete the following:
Write the relevant test in an analytical closed form.
Provide a schematic algorithm to evaluate the power of the test
based on a Monte Carlo study.
3.3 See Question 4, Section 13.1, Example 1.
Example 2:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 8 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
1.3 Prove the following theorem (the inversion formula). Let y > x be real
arguments of the distribution function F (u) = Pr {X ≤ u}, where F ()
. is
a continuous function at the points x and y. Then
∞
F (y) − F (x) =
1
lim
2 π σ →0 ∫
−∞
exp(−itx) − exp(−ity )
it
φ(t)exp(−
t 2σ 2
2
) dt.
∑
d
in a form of the density function of (ε i )2 , where the integer parameter d
i=1
is unknown and d ≤ 10.
2.1 Write the maximum likelihood ratio (MLR) statistic, MLn, and for-
mulate the MLR test for H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0, where θ0 is
known. (Hint: Use characteristic functions to derive the form of f .)
2.2 Propose, providing a proof, the integrated most powerful test for
H 0 : EX 1 = θ0 versus H 1 : EX 1 ≠ θ0 (here θ0 is known), provided that
we are interested in obtaining the maximum integrated power of
the test when values of the alternative parameter can have all pos-
sible values with no preference. (Hint: Use a Bayes factor type
procedure.)
2.3 Assume the sample size n is not large. How can we apply the tests
from Questions 2.1 and 2.2, in practice, without analytical evalua-
tions of distribution functions of the test statistics? (i.e., assuming
that you have a data set, how do you execute the tests?) Can you sug-
gest computational algorithms? If yes, write the relevant algorithms.
(Justify all answers.)
Question 3 (8 points)
Assume we observe iid data points X 1 ,… , X n that follow a joint density
function f ( x1 ,… , xn ; θ), where θ is an unknown parameter. Suppose we
want to test for the hypothesis H 0 : θ = θ0 versus H 1 : θ = θ1 , where θ0 , θ1 are
known. In this case, propose (and give the relevant prove) the most power-
ful test. Examine theoretically asymptotic distributions of log(LRn )/ n
(here LRn is the test statistic proposed by you) under H 0 and H 1, respectively.
p
Show that log(LRn )/ n → a, where the constant a ≤ 0, under H 0 and a ≥ 0,
n →∞
under H 1. (Hint: Use the central limit theorem, the proof of which should
be present.)
344 Statistics in the Health Sciences: Theory, Applications, and Computing
Question 5 (7 points)
See Question 4, Section 13.1, Example 1.
Question 6 (8 points)
Assume the measurement model with autoregressive errors in the form of
X i = μ + εi , εi = β εi −1 + ξi , ξ0 = 0, i = 1,… , n,
∫
u
distribution function (2 π)
−1/2
exp(− u2 / 2) du. We are interested in the
−∞
maximum likelihood ratio test for
versus
Write the relevant test in an analytical closed form (here needed opera-
tions with joint densities and obtained forms of the relevant likelihoods
should be explained).
Is the obtained test statistic, say L, a martingale/submartingale/supermar
tingale, under H 0 / H 1?
Is log(L) a martingale/submartingale/supermartingale, under H 0 ?
(The proofs should be shown.)
Examples of Exams 345
Question 7 (8 points)
See Question 3.3, Section 13.2, Example 3.
Question 8 (7 points)
⎧⎪ n
⎫
Let the stopping time have the form of N ( H ) = inf ⎨n :
are iid random variables with Ex1 > 0 . ⎩⎪
∑ x ≥ H⎪⎬⎭⎪, where x
i=1
i i
∞ ∞ ⎧⎪ j ⎫⎪
8.1 Show that EN ( H ) = ∑j=1
P {N ( H ) ≥ j}. Is EN ( H ) = ∑ ∑
j=1
P ⎨ xi < H ⎬? If
⎪⎩ i = 1 ⎪⎭
not what do we need to correct this equation?
8.2 Formulate (no proof is required) the central limit theorem related to
⎧⎪ n
⎫⎪
the stopping time N ( H ) = inf ⎨n : ∑
⎩⎪ i = 1
X i ≥ H ⎬, where X i > 0, i > 0,
⎭⎪
EX 1 = a > 0, Var(X 1 ) = σ 2 < ∞, and H → ∞. (Here the asymptotic dis-
tribution of N ( H ) should be shown.)
Example 3:
The PhD theory qualifying exam, Part I
8:30AM–12:30PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
∑
n
2.1 Find, showing the proof, the asymptotic distribution of X i / n1/2
i=1
as n → ∞ , where the random variables X 1 ,… , X n are iid (independent
identically distributed) with EX 1 = 0, var(X 1 ) = σ 2.
2.2 See Question 2, Section 13.1, Example 3.
⎧⎪ n
⎫⎪
2.3 Consider the stopping time N ( H ) = inf ⎨n : ∑
⎩⎪ i = 1
X i ≥ H ⎬, where
⎭⎪
X1 > 0,… , X n > 0 are iid, EX 1 = a < ∞, Var(X 1 ) = σ 2 < ∞. Find, showing
a proof, an asymptotic ( H → ∞) distribution of N ( H ).
2.4 Assume we observe iid data points X 1 ,… , X n that are from a known
joint density function f ( x1 ,… , xn ; θ), where θ is an unknown param-
eter. We want to test for the hypothesis H 0 : θ = θ0 versus H 1 : θ = θ1,
where θ0 , θ1 are known. In this case, propose (and give the relevant
proof) the most powerful test statistic, say, LRn. Prove the asymptotic
p
consistency of the proposed test: log(LRn )/ n → a, where the con-
n →∞
stant a ≤ 0, under H 0 and a ≥ 0, under H 1, just using that
∞ ⎛ f (u)⎞ ∞ ⎛ f (u)⎞
∫ −∞
log ⎜ H1 ⎟ f H0 (u) du < ∞ and
⎝ f H0 (u)⎠ −∞ ∫
log ⎜ H1 ⎟ f H1 (u) du < ∞ (but
⎝ f H0 (u)⎠
1+ ε
⎛ f (u)⎞
∞
without the constraint
−∞ ∫
log ⎜ H1 ⎟
⎝ f H0 (u)⎠
f H j (u) du < ∞ , j = 0,1, ε > 0),
when ∑ (Z + r ) ≥ 100 the first time. When you stop, what will be
i=1
i i
Examples of Exams 347
your total expected benefit from the game (i.e., E ∑ (Z ))? In general,
i=1
i
Question 4 (7 points)
See Question 4, Section 13.1, Example 1.
Comments. Regarding Question 3.2, we note that, defining the random vari-
⎧⎪ n
⎫⎪
able τ = inf ⎨n ≥ 1 :
⎪⎩
∑
i=1
(Zi + ri ) ≥ 100⎬, since (Zi + ri ) > 0, we have, for a fixed N,
⎪⎭
⎧ N ⎫ ⎧ ⎛ N ⎞ ⎫
⎪
⎪⎩ i =1
∑ ⎪
⎪⎭
⎪
Pr {τ > N} = Pr ⎨ (Zi + ri) < 100⎬ = Pr ⎨exp ⎜⎜ −
⎪ ⎜
1
⎝ 100
∑ ⎠
⎪
(Zi + ri)⎟⎟⎟ > e −1⎬
⎪
⎩ i =1 ⎭
⎛ N ⎞
≤ eE exp ⎜⎜ −
⎝
1
⎜ 100 ∑ (Zi + ri)⎟⎟⎟
⎠
i =1
⎛ ⎛ 1 ⎞⎞
N
Example 4:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
2.1 Let X 1 have a density function f with a known form that depends on
θ. When θ1 is assumed to be known, propose the most powerful test
statistic (a relevant proof should be shown).
Is the couple (the corresponding test statistic and the sigma algebra
σ {X 1 ,… , X n}) a martingale, submartingale, or supermartingale,
under H 0 /under H 1? (Justify your answer.)
Show (prove) that asymptotic distributions of (n)−1/2 × log of the
test statistic, under H 0 and H 1, respectively. (Should you use a theo-
rem, show a relevant proof of the theorem.)
Examples of Exams 349
How can one use the asymptotic result obtained under H 0 to per-
form the test in practice? If possible, suggest an alternative method
to evaluate the H 0 -distribution of the test statistic.
2.2 Write the maximum likelihood ratio MLRn and formulate the MLR
test based on X 1 ,… , X n. Find the asymptotic ( n → ∞ ) distribution
function of the statistic 2 log( MLRn ), under H 0 .
Is the couple (the corresponding test statistic and sigma algebra
σ {X 1 ,… , X n}) a martingale, submartingale or supermartingale under
H 0 ? (Justify your answer.)
Illustrate how one can use this asymptotic result to perform the test
in practice? Propose (if it is possible) an alternative to this use of the
H 0 -asymptotic distribution of the test statistic.
2.3 Assume f , the density function of X 1 ,… , X n, and θ1 are unknown.
Formulate the empirical likelihood ratio test statistic ELRn. Find the
asymptotic (n → ∞ ) distribution function of the statistic 2 log(ELRn ),
under H 0 , provided that E X 1 < ∞ . How can one use this asymptotic
3
Xi = μ + εi , ε i = β ε i−1 + ξi , ε 0 = 0, X 0 = 0, i = 1,… , n,
⎛ τ+ 1 ⎞ ⎛ τ/3 ⎞ ⎛ ⎞ ⎛ ν1
⎞ ⎛ ν2 + τ ⎞ ⎛ ν2 − τ ⎞
E⎜ ∑ ∑
X i⎟ , E ⎜ X i⎟ , E ⎜ ∑ X i⎟ , E ⎜
⎝ i = 1 ⎠ ⎝ i = 1 ⎠ ⎜⎝ i:i ≤ X1 + X25 ⎟⎠ ⎝
∑
i=1
∑
X i⎟ , E ⎜ ∑
X i⎟ , E ⎜
⎠ ⎝ i=1 ⎠ ⎝ i=1 ⎠
X i⎟ ,
⎧ n ⎫
v
where 1
answers.
= min {n : X n > 2 n}, 2
v = min
⎪
⎨ n :
⎪⎩ i =1
( i)
X > ∑
10000
⎪
⎬? Justify your
⎪⎭
4.4 See Question 4.3, Section 13.3, Example 2.
4.5 See Question 2, Section 13.2, Example 3.
4.6 See Question 4, Section 13.1, Example 1.
Example 5:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Question 1 (8 points)
1.1 Assume we observe X 1 , X 2 , X 3 … , which are independent identically
distributed (iid) data points with EX 1 = μ and an unknown variance.
Examples of Exams 351
∑
N
using X N = X i / N . Propose a procedure to estimate μ mini-
i=1
( )
2
mizing the risk function L(n) = An + E X n − μ with respect to n.
1.2 Biomarker levels were measured from disease and healthy populations,
providing iid observations X 1 = 0.39, X 2 = 1.97, X 3 = 1.03, X 4 = 0.16
that are assumed to be from a continuous distribution as well as iid
observations Y1 = 0.42, Y2 = 0.29, Y3 = 0.56, Y4 = −0.68, Y5 = −0.54 that
are assumed to be from a continuous distribution, respectively. Define
the receiver operating characteristic (ROC) curve and the area under
the curve (AUC). Estimate nonparametrically the AUC. What can you
conclude regarding the discriminating ability of the biomarker with
respect to the disease?
3.2 Assume we have data points X 1 ,… , X n that are iid observations and
EX 1 = θ. Suppose we would like to test for the hypothesis H 0 : θ = θ0.
Write the maximum likelihood ratio test statistic MLRn and formu-
late the MLR test based on X 1 ,… , X n. Find the asymptotic ( n → ∞ )
distribution function of the statistic 2 log( MLRn ), under H 0 .
Is the couple (the test statistic and sigma algebra σ {X 1 ,… , X n}) a martin-
gale, submartingale, or supermartingale, under H 0 ? (Justify your answer.)
Illustrate how to apply this asymptotic result in practice.
3.3 Suppose that we would like to test that iid observations X 1 ,… , X n are
from the density function f0 ( x) versus X 1 ,… , X n ~ f1 ( x|θ), where the
parameter θ > 0 is unknown. Assume that f1 ( x|θ) = f0 ( x)exp (θx − φ(θ)),
where φ is an unknown function. Write the most powerful test sta-
tistic and provide the corresponding proof.
⎧ n ⎫ ⎧ n ⎫
⎪
where v1 = min ⎨n ≥ 1 :
⎪⎩
∑ ⎪ ⎪
X i > 2n2 (n + 1)⎬ and v2 = min ⎨n :
⎪⎭ ⎪⎩
∑X > 10000⎪⎬⎪⎭ ?
i
i =1 i =1
where X 1 > 0,… , X n > 0,... are iid random variables with EX 1 < ∞ and
⎧ n ⎫
⎪
N (t) = min ⎨n :
⎪⎩
∑ ⎪
X i > t⎬ .
⎪⎭
i =1
Comments. Regarding Question 2.2, we can use that exp(itx) = cos(tx) + i sin(tx)
and the proof scheme related to the inversion theorem
∞
exp(− itx) − exp(− ity )
F (y) − F (x) =
∫
1
lim φ(t)exp(−t 2σ 2 )dt
2 π σ→0 −∞ it
∞ ∞ ⎛ n j ⎞
Pr (v1 < ∞) = ∑ Pr (v1 = j) = ∑ Pr ⎜⎜ max n < j
⎜ ∑(Xi) − 2n2 (n + 1) < 0, ∑(Xi) > 2 j2 (j+ 1)⎟⎟⎟
j =1 j =1 ⎝ i =1 i =1 ⎠
⎛ j ⎞
∞ ⎛ j ⎞ ∞ ∑
E ⎜⎜
⎜
X i⎟⎟
⎟ ∞
≤ ∑ ∑ Pr ⎜⎜
⎜
(X i) > 2 j 2 (j+ 1)⎟⎟⎟ ≤ ∑ ⎝ i =1 ⎠ 1
2 j (j+ 1)
2
=
2 ∑ j(j+1 1) = 21 .
j =1 ⎝ i =1 ⎠ j =1 j =1
354 Statistics in the Health Sciences: Theory, Applications, and Computing
Example 6:
The PhD theory qualifying exam, Part I
8:30 AM–12:30 PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
3.4.3 How does one use this asymptotic result to perform the corre-
sponding test, in practice?
3.5 See Question 3.3, Section 13.3, Example 5.
−1
It is known that E exp (− sri ) = 0.5 ⎡⎣exp (− s10) / 3 + 2exp (s5) / 3⎤⎦ with
s = 1/ 100.
4.1.1 When you stop, what will be your expected benefit from the risky
game (i.e., E ∑ (Z ))? When you apply the needed theorem, cor-
i=1
i
Question 5 (5 points)
Confidence sequences for the median. See Question 5, Section 13.2, Example 4.
Compare this result with the corresponding well-known non-uniform
confidence interval estimation.
Comments. Regarding Question 4.1, we note that, defining the random vari-
⎧⎪ n
⎫⎪
able τ = inf ⎨n ≥ 1 :
⎩⎪ i=1
∑(Zi + ri ) ≥ 100⎬, since (Zi + ri ) > 0, we have, for a fixed N ,
⎭⎪
⎧ N ⎫ ⎧ ⎛ N ⎞ ⎫
⎪
∑ ⎪
Pr {τ > N} = Pr ⎨ (Zi + ri) < 100⎬ = Pr ⎨exp ⎜ −
⎪⎩ i =1 ⎪⎭
⎪
⎪
⎜
⎜ 100
⎝
1
∑ ⎟ −1⎪
(Zi + ri)⎟⎟ > e ⎬
⎠ ⎪
⎩ i =1 ⎭
⎛ ⎞ ⎛ ⎞⎞
N
∑
⎛ 1
N
Example 7:
The PhD theory qualifying exam, Part I
8:30–12:30
This exam has 10 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
Question 1 (9 points)
Assume a statistician has observed data points X 1 ,… , X n and developed a test
statistic Tn to test the hypothesis H 0 versus H 1 = not H 0 based on X 1 ,… , X n. We
do not survey X 1 ,… , X n, but we know the value of Tn and that Tn is distributed
accordingly to the known density functions f0 (u) and f1 (u) under H 0 and H 1,
respectively.
1.1 Can you propose a transformation of Tn (i.e., say, a function G (Tn)) that
will improve the power of the test, when any other transformations,
e.g., K (Tn) based on a function K (u), could provide less powerful tests,
compared with the test based on G (Tn)? If you can construct such
function G, you should provide a corresponding proof.
1.2 What is a form of Tn that satisfies G (Tn) = Tn ? The corresponding
proof must be shown here.
(Hint: The concept of the most powerful testing can be applied here.)
Question 2 (5 points)
Assume that X 1 ,… , X n are independent and identically distributed (iid) ran-
dom variables and X 1 ~ N (0,1), Sn = (X 1 + … + X n). N , an integer random
(
variable, defines a stopping time. Calculate a value of E exp itSN + t 2 N / 2 , )
where i 2 = −1 and t is a real number. Proofs of the theorems you use in this
question are not required to be shown.
Question 3 (4 points)
See Question 3, Section 13.2, Example 8.
Question 4 (7 points)
See Question 4,2, Section 13.1, Example 7.
358 Statistics in the Health Sciences: Theory, Applications, and Computing
Question 6 (3 points)
Let the observations X 1 ,… , X n be distributed with parameters (θ, σ), where σ
is unknown. We want to test parametrically for the parameter θ (θ = 0 under
H 0 ). Define the Type I error rate of the test. Justify your answer.
Question 10 (5 points)
See Question 5.3, Section 13.3, Example 3.
Comments. Regarding Question 2, we consider E {exp (itSn + t 2 n / 2)|σ (X1 ,..., X n−1)},
for a fixed n. It is clear that E {exp (itSn + t 2 n / 2)|σ (X1 ,..., X n−1)} = exp (itSn−1 + t 2 n / 2) ϕ(t),
where ϕ(t) is the characteristic function of X 1. This implies
{ ( ) } (
E exp itSn + t 2 n / 2 |σ (X 1 ,..., X n−1) = exp i tSn−1 + t 2 (n − 1)/ 2 . )
Examples of Exams 359
{ ( ) }
Then exp itSn + t 2 n / 2 , σ (X 1 ,..., X n) is a martingale and we can apply the
optional stopping theorem.
Example 8:
The PhD theory qualifying exam, Part I
8:30AM–12:30PM
This exam has 5 questions, some of which have subparts. Each question
indicates its point value. The total is 100 points.
0.8
0.7
0.6
F (x)
0.5
0.4
0.3
0.2
3.1.1 Propose (and give a relevant prove) the most powerful test
statistic, say, LRn for H 0 versus H 1.
3.1.2 See Question 2.4, Section 13.3, Example 3.
3.2. Let X 1 ,… , X n be observations that have a joint density function f
that can be presented in a known parametric form, depending on an
unknown parameter θ.
⎧
⎪ 1, θ ∈ [0,1]
⎪
π (θ) = ⎨ 0, θ < 0
⎪
⎪ 0, θ > 1.
⎩
3.2.2 Write the maximum likelihood ratio MLRn and formulate the
MLR test for H 0 : θ = 0 versu H 1 : θ ≠ 0 based on X 1 ,… , X n. Find the
asymptotic (n → ∞ ) distribution function of the statistic
2 log (MLRn), under H 0 .
3.2.3 Is the couple (MLRn and sigma algebra σ {X 1 ,… , X n}) a martin-
gale, submartingale, or supermartingale, under H 0 ? (Justify your
answer.)
3.3 Assume f , the density function of X 1 ,… , X n, is unknown, X 1 ,… , X n
are iid and θ = EX 1.
3.3.1 Define the empirical likelihood ratio test statistic ELRn.
3.3.2 Find the asymptotic ( n → ∞ ) distribution function of the statistic
2 log (ELRn), under H 0 , provided that E X 1 < ∞ .
3
∑
n
at a first time n: (Zi + ri ) ≥ 100.
i=1
362 Statistics in the Health Sciences: Theory, Applications, and Computing
⎪⎧ ⎪⎫
∑
j
It is known that, for all t > 0 and j > 0, Pr ⎨
⎪⎩
(Zi + ri) ≤ t⎬⎪ = t / (j( j + 1)), when
i=1 ⎭
t / ( j( j + 1)) ≤ 1, and Pr {∑ j
i=1
(Zi + ri ) ≤ t } = 1, when t / ( j( j + 1)) ≥ 1.
4.1.1 What will be your expected benefit from the risky game stopped
⎛ ⎞
∑ ∑
n
at a first time n: (Zi + ri ) ≥ 100, that is, E ⎜ Zi⎟ ? When you
i=1
⎝ i=1 ⎠
apply a needed theorem, corresponding conditions should be
∑
∞
checked. (Suggestion: Use that 1/ ( j( j + 1)) = 1.)
i=1
Question 5 (5 points)
See Question 5.3, Section 13.3, Example 3.
⎧ N ⎫ ⎧ ⎛ N ⎞ ⎫
⎪
∑
⎪⎩ i =1
⎪
⎪⎭
⎪
Pr {τ > N} = Pr ⎨ (Zi + ri) < 100⎬ = Pr ⎨exp ⎜⎜ −
⎪ ⎜
1
⎝ 100
∑ ⎠
⎪
(Zi + ri)⎟⎟⎟ > e −1⎬
⎪
⎩ i =1 ⎭
⎛ ⎞ ⎛ ⎞⎞
N
∑
⎛
N
( )
Pr (Yi = 1|X i ) = 1 + exp (−α 0 − β 0 X i ))−1 I (ν < i) + (1 + exp (−α 1 − β1X i )
−1
I (ν ≥ i),
363
364 Statistics in the Health Sciences: Theory, Applications, and Computing
Liu (1993) concept to cases when prior information related to the parameters
of biomarker distributions is assumed to be specified.
⎛ n n n
⎞
max ⎜
0 < p1 ,..., pn < 1
⎝
∏ ∑ i=1
pi :
i=1
pi = 1, ∑ p (Y − β X ) = 0⎟⎠ ,
i=1
i i 0 i
( ) ( )
E X i (Yi − βX i ) = E ⎡⎣E X i (Yi − βX i )|X i ⎤⎦ = E ⎡⎣X iE (Yi − βX i )|X i ⎤⎦ = 0, i = 1,..., n. ( )
Then, one can propose the H 0 − empirical likelihood in the form
⎛ n n n n
⎞
max ⎜
0 < p1 ,..., pn < 1
⎝
∏ ∑
i=1
pi :
i=1
pi = 1, ∑ i=1
pi (Yi − β 0 X i ) = 0, ∑ p X (Y − β X ) = 0⎟⎠ .
i=1
i i i 0 i
( )
Moreover, it is clear that E X ik (Yi − βX i ) = 0, i = 1,..., n, k = 0,1, 2, 3... Therefore,
in general, we can define that the H 0 − empirical likelihood is
⎛ n n n
Lk (β 0 ) = max ⎜
0 < p1 ,..., pn < 1
⎝
∏ ∑
i=1
pi :
i=1
pi = 1, ∑ p (Y − β X ) = 0,
i=1
i i 0 i
n n
⎞
∑ i=1
pi X i (Yi − β 0 X i ) = 0,..., ∑ p X (Y − β X ) = 0⎟⎠ .
i=1
i
k
i i 0 i
in this study is the following: What are the optimal values of k that, depend-
ing on n, lead to an appropriate Type I error rate control and maximize the
power of the empirical likelihood ratio test? Theoretically, this issue corre-
sponds to higher order approximations to the Type I error rate and the power
of the test.
θ̂ for relatively small values of n. The research issue in this project is related
to how to combine θ̂ and θ to optimize the confidence interval estimation of
θ, depending on n.
References
369
370 References
Daniels, M. and Hogan, J. (2008). Missing Data in Longitudinal Studies: Strategies for
Bayesian Modeling and Sensitivity Analysis. New York, NY: Chapman & Hall.
Davis, C. S. (1993). The computer generation of multinomial random variates. Compu-
tational Statistics & Data Analysis, 16(2), 205–217.
Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cam-
bridge: Cambridge University Press.
De Bruijn, N. G. (1970). Asymptotic Methods in Analysis. Amsterdam: Dover Publications.
DeGroot, M. H. (2005). Optimal Statistical Decisions. Hoboken, NJ: John Wiley & Sons.
Delwiche, L. D. and Slaughter, S. J. (2012). The Little SAS Book: A Primer: A Program-
ming Approach. Cary, NC: SAS Institute.
Dempster, A. P. and Schatzoff, M. (1965). Expected significance level as a sensitivity
index for test statistics. Journal of the American Statistical Association, 60, 420–436.
Desquilbet, L. and Mariotti, F. (2010). Dose-response analyses using restricted cubic
spline functions in public health research. Statistics in Medicine, 29(9), 1037–1057.
DiCiccio, T. J., Kass, R. E., Raftery, A. and Wasserman, L. (1997). Computing Bayes
factors by combining simulation and asymptotic approximations. Journal of the
American Statistical Association, 92(439), 903–915.
Dmitrienko, A., Molenberghs, G., Chuang-Stein, C. and Offen, W. (2005). Analysis of
Clinical Trials Using SAS: A Practical Guide. Cary, NC: SAS Institute.
Dmitrienko, A., Tamhane, A. C. and Bretz, F. (2010). Multiple Testing Problems in Phar-
maceutical Statistics. New York, NY: CRC Press.
Dodge, H. F. and Romig, H. (1929). A method of sampling inspection. Bell System
Technical Journal, 8(4), 613–631.
Dragalin, V. P. (1997). The sequential change point problem. Economic Quality Control,
12, 95–122.
Edgar, G. A. and Sucheston, L. (2010). Stopping Times and Directed Processes (Encyclopedia
of Mathematics and Its Applications). New York, NY: Cambridge University Press.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics,
7(1), 1–26.
Efron, B. (1981). Nonparametric standard errors and confidence intervals. The Cana-
dian Journal of Statistics/La Revue Canadienne de Statistique, 9(2), 139–158.
Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of
the American Statistical Association, 81(394), 461–470.
Efron, B. and Morris, C. (1972). Limiting risk of Bayes and empirical Bayes estimators.
Journal of the American Statistical Association, 67, 130–139.
Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap. Boca Raton, FL:
CRC Press.
Eng, J. (2005). Receiver operating characteristic analysis: A primer. Academic Radiology,
12(7), 909–916.
Evans, M. and Swartz, T. (1995). Methods for approximating integrals in statistics with
special emphasis on Bayesian integration problems. Statistical Science, 10(3), 254–272.
Everitt, B. S. and Palmer, C. (2010). Encyclopaedic Companion to Medical Statistics.
Hoboken, NJ: John Wiley & Sons.
Fan, J., Farmen, M. and Gijbels, I. (1998). Local maximum likelihood estimation and
inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology),
60(3), 591–608.
Fan, J., Zhang, C. and Zhang, J. (2001). Generalized likelihood ratio statistics and
Wilks phenomenon. Annals of Statistics, 29(1), 153–193.
372 References
Gurevich, G. and Vexler, A. (2010). Retrospective change point detection: From para-
metric to distribution free policies. Communications in Statistics—Simulation and
Computation, 39(5), 899–920.
Hall, P. (1992). The Bootstrap and Edgeworth Expansions. New York, NY: Springer.
Hammersley, J. M. and Handscomb, D. C. (1964). Monte Carlo Methods. New York, NY:
Springer.
Han, C. and Carlin, B. P. (2001). Markov chain Monte Carlo methods for computing
Bayes factors. Journal of the American Statistical Association, 96(455), 1122–1132.
Hartigan, J. (1971). Error analysis by replaced samples. Journal of the Royal Statistical
Society. Series B (Methodological), 33(2), 98–110.
Hartigan, J. A. (1969). Using subsample values as typical values. Journal of the American
Statistical Association, 64(328), 1303–1317.
Hartigan, J. A. (1975). Necessary and sufficient conditions for asymptotic joint
normality of a statistic and its subsample values. Annals of Statistics, 3(3),
573–580.
Hosmer, D. W. Jr. and Lemeshow, S. (2004). Applied Logistic Regression. Hoboken, NJ:
John Wiley & Sons.
Hsieh, F. and Turnbull, B. W. (1996). Nonparametric and semiparametric estimation of
the receiver operating characteristic curve. Annals of Statistics, 24(1), 25–40.
Hwang, J. T. G. and Yang, M-c. (2001). An optimality theory for MID p-values in
2 x 2 contingency tables. Statistica Sinica, 11, 807–826.
James, W. and Stein, C. (1961). Estimation with quadratic loss. Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, 1, 361–379.
Janzen, F. J., Tucker, J. K. and Paukstis, G. L. (2000). Experimental analysis of an early
life-history stage: Selection on size of hatchling turtles. Ecology, 81(8), 2290–2304.
Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability.
Mathematical Proceedings of the Cambridge Philosophical Society, 31(2), 203–222.
Jeffreys, H. (1961). Theory of Probability. Oxford: Oxford University Press.
Jennison, C. and Turnbull, B. W. (2000). Group Sequential Methods with Applications to
Clinical Trials. New York, NY: CRC Press.
Johnson, M. E. (1987). Multivariate Statistical Simulations. New York, NY: John Wiley &
Sons.
Julious, S. A. (2005). Why do we use pooled variance analysis of variance? Pharmaceu-
tical Statistics, 4, 3–5.
Kass, R. E. (1993). Bayes factors in practice. Statistician, 42(5), 551–560.
Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical
Association, 90(430), 773–795.
Kass, R. E. and Vaidyanathan, S. K. (1992). Approximate Bayes factors and orthogonal
parameters, with application to testing equality of two binomial proportions.
Journal of the Royal Statistical Society. Series B (Methodological), 54(1), 129–144.
Kass, R. E. and Wasserman, L. (1995). A reference Bayesian test for nested hypotheses
and its relationship to the Schwarz criterion. Journal of the American Statistical
Association, 90(431), 928–934.
Kass, R. E. and Wasserman, L. (1996). The selection of prior distributions by formal
rules. Journal of the American Statistical Association, 91(435), 1343–1370.
Kass, R. E., Tierney, L. and Kadane, J. B. (1990). The validity of posterior asymptotic
expansions based on Laplace’s method. Bayesian and Likelihood Methods in Statis-
tics and Econometrics: Essays in Honor of George A. Barnard. Geisser, S., Hodges, J.
S., Press, S. J. and ZeUner, A. (Eds.), New York, NY: North-Holland.
374 References
Kepner, J. L. and Chang, M. N. (2003). On the maximum total sample size of a group
sequential test about binomial proportions. Statistics & Probability Letters, 62(1),
87–92.
Koch, A. L. (1966). The logarithm in biology 1. Mechanisms generating the log-normal
distribution exactly. Journal of Theoretical Biology, 12(2), 276–290.
Konish, S. (1978). An approximation to the distribution of the sample correlation coef-
ficient. Biometrika, 65, 654–656.
Korevaar, J. (2004). Tauberian Theory. New York, NY: Springer.
Kotz, S., Balakrishnan, N. and Johnson, N. L. (2000). Continuous Multivariate Distribu-
tions, Models and Applications. New York, NY: John Wiley & Sons.
Kotz, S., Lumelskii, Y. and Pensky, M. (2003). The Stress-Strength Model and Its General-
izations: Theory and Applications. Singapore: World Scientific Publishing Com-
pany.
Krieger, A. M., Pollak, M. and Yakir, B. (2003). Surveillance of a simple linear regres-
sion. Journal of the American Statistical Association, 98, 456–469.
Lai, T. L. (1995). Sequential changepoint detection in quality control and dynamical
systems. Journal of the Royal Statistical Society. Series B (Methodological), 57(4),
613–658.
Lai, T. L. (1996). On uniform integrability and asymptotically risk-efficient sequential
estimation. Sequential Analisis, 15, 237–251.
Lai, T. L. (2001). Sequential analysis: Some classical problems and new challenges.
Statistica Sinica, 11, 303–408.
Lane-Claypon, J. E. (1926). A Further Report on Cancer of the Breast with Special Reference
to Its Associated Antecedent Conditions. Ministry of Health. Reports on Public
Health and Medical Subjects (32), London.
Lasko, T. A., Bhagwat, J. G., Zou, K. H. and Ohno-Machado, L. (2005). The use of
receiver operating characteristic curves in biomedical informatics. Journal of Bio-
medical Informatics, 38(5), 404–415.
Lazar, N. A. (2003). Bayesian empirical likelihood. Biometrika, 90(2), 319–326.
Lazar, N. A. and Mykland, P. A. (1999). Empirical likelihood in the presence of nui-
sance parameters. Biometrika, 86(1), 203–211.
Lazzeroni, L. C., Lu, Y. and Belitskaya-Levy, I. (2014). P-values in genomics: Apparent
precision masks high uncertainty. Molecular Psychiatry, 19, 1336–1340.
Le Cam, L. (1953). On some asymptotic properties of maximum likelihood estimates and
related Bayes’ estimates. University of California Publications in Statistics, 1, 277–330.
Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. New York, NY:
Springer-Verlag.
Lee, E. T. and Wang, J. W. (2013). Statistical Methods for Survival Data Analysis. Hoboken,
NJ: John Wiley & Sons.
Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation. New York, NY:
Springer-Verlag.
Lehmann, E. L. and Romano, J. P. (2006). Testing Statistical Hypotheses. New York, NY:
Springer-Verlag.
Limpert, E., Stahel, W. A. and Abbt, M. (2001). Log-normal distributions across the
sciences: Keys and clues on the charms of statistics, and how mechanical models
resembling gambling machines offer a link to a handy way to characterize log-
normal distributions, which can provide deeper insight into variability and
probability—Normal or log-normal: That is the question. BioScience, 51(5),
341–352.
References 375
Lin, C.-C. and Mudholkar, G. S. (1980). A simple test for normality against asymmetric
alternatives. Biometrika, 67(2), 455–461.
Lin, Y. and Shih, W. J. (2004). Adaptive two‐stage designs for single‐arm phase IIA
cancer clinical trials. Biometrics, 60(2), 482–490.
Lindsey, J. (1996). Parametric Statistical Inference. Oxford: Oxford University Press.
Liptser, R. Sh. and Shiryayev, A. N. (1989). Theory of Martingales. London: Kluwer
Academic Publishers.
Liu, A. and Hall, W. (1999). Unbiased estimation following a group sequential test.
Biometrika, 86(1), 71–78.
Liu, C., Liu, A. and Halabi, S. (2011). A min–max combination of biomarkers to
improve diagnostic accuracy. Statistics in Medicine, 30(16), 2005–2014.
Lloyd, C. J. (1998). Using smoothed receiver operating characteristic curves to sum-
marize and compare diagnostic systems. Journal of the American Statistical Asso-
ciation, 93(444), 1356–1364.
Lorden, G. and Pollak, M. (2005). Nonanticipating estimation applied to sequential
analysis and changepoint detection. The Annals of Statistics, 33, 1422–1454.
Lucito, R., West, J., Reiner, A., Alexander, J., Esposito, D., Mishra, B., Powers, S., Nor-
ton, L. and Wigler, M. (2000). Detecting gene copy number fluctuations in tumor
cells by microarray analysis of genomic representations. Genome Research, 10(11),
1726–1736.
Lukacs, E. (1970). Characteristic Functions. London: Charles Griffin & Company.
Lusted, L. B. (1971). Signal detectability and medical decision-making. Science,
171(3977), 1217–1219.
Mallows, C. L. (1972). A note on asymptotic joint normality. Annals of Mathematical
Statistics, 39, 755–771.
Marden, J. I. (2000). Hypothesis testing: From p values to Bayes factors. Journal of the
American Statistical Association, 95, 1316–1320.
Martinsek, A. T. (1981). A note on the variance and higher central moments of the stop-
ping time of an SPRT. Journal of the American Statistical Association, 76(375), 701–703.
McCulloch, R. and Rossi, P. E. (1991). A Bayesian approach to testing the arbitrage
pricing theory. Journal of Econometrics, 49(1), 141–168.
McIntosh, M. W. and Pepe, M. S. (2002). Combining several screening tests: Optimality
of the risk score. Biometrics, 58(3), 657–664.
Meng, X.-L. and Wong, W. H. (1996). Simulating ratios of normalizing constants via a
simple identity: A theoretical exploration. Statistica Sinica, 6(4), 831–860.
Metz, C. E., Herman, B. A. and Shen, J. H. (1998). Maximum likelihood estimation of
receiver operating characteristic (ROC) curves from continuously‐distributed
data. Statistics in Medicine, 17(9), 1033–1053.
Mikusinski, P. and Taylor, M. D. (2002). An Introduction to Multivariable Analysis from
Vector to Manifold. Boston, MA: Birkhäuser.
Montgomery, D. C. (1991). Introduction to Statistical Quality Control. New York, NY:
Wiley.
Nakas, C. T. and Yiannoutsos, C. T. (2004). Ordered multiple‐class ROC analysis with
continuous measurements. Statistics in Medicine, 23(22), 3437–3449.
Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference with the
weighted likelihood bootstrap. Journal of the Royal Statistical Society. Series B
(Methodological), 56(1), 3–48.
Neyman, J. and Pearson, E. S. (1928). On the use and interpretation of certain test cri-
teria for purposes of statistical inference: Part II. Biometrika, 20A(3/4), 263–294.
376 References
Qin, J. and Zhang, B. (2005). Marginal likelihood, conditional likelihood and empirical
likelihood: Connections and applications. Biometrika, 92(2), 251–270.
Quenouille, M. H. (1949). Problems in plane sampling. The Annals of Mathematical
Statistics, 20(3), 355–375.
Quenouille, M. H. (1956). Notes on bias in estimation. Biometrika, 43(3–4), 353–360.
R Development Core Team. (2014). R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria.
Reid, N. (2000). Likelihood. Journal of the American Statistical Association, 95, 1335–1340.
Reiser, B. and Faraggi, D. (1997). Confidence intervals for the generalized ROC crite-
rion. Biometrics, 53(2), 644–652.
Richards, R. J., Hammitt, J. K. and Tsevat, J. (1996). Finding the optimal multiple-test
strategy using a method analogous to logistic regression the diagnosis of hepa-
tolenticular degeneration (Wilson’s disease). Medical Decision Making, 16(4),
367–375.
Ritov, Y. (1990). Decision theoretic optimality of the CUSUM procedure. The Annals of
Statistics, 18, 1464–1469.
Robbins, H. (1959). Sequential estimation of the mean of a normal population. Proba-
bility and Statistics (Harold Cramér Volume). Stockholm: Almquist and Wiksell, pp.
235–245.
Robbins, H. (1970). Statistical methods related to the law of the iterated logarithm. The
Annals of Mathematical Statistics, 41, 1397–1409.
Robbins, H. and Siegmund, D. (1970). Boundary crossing probabilities for the Wiener
process and sample sums. The Annals of Mathematical Statistics, 41, 1410–1429.
Robbins, H. and Siegmund, D. (1973). A class of stopping rules for testing parametric
hypotheses. In Proceedings of Sixth Berkeley Symposium of Mathematical Statistics
and Probability, University of California Press, 37–41.
Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods. 2nd ed., New York,
NY: Springer.
Sackrowitz, H. and Samuel-Cahn, E. (1999). P values as random variables-expected p
values. The American Statistician, 53, 326–331.
SAS Institute Inc. (2011). SAS® 9.3 Macro Language: Reference. Cary, NC: SAS Institute.
Schiller, F. (2013). Die Verschwörung des Fiesco zu Genua. Berliner: CreateSpace
Independent Publishing Platform.
Schisterman, E. F., Faraggi, D., Browne, R., Freudenheim, J., Dorn, J., Muti, P., Arm-
strong, D., Reiser, B. and Trevisan, M. (2001a). Tbars and cardiovascular disease
in a population-based sample. Journal of Cardiovascular Risk, 8(4), 219–225.
Schisterman, E. F., Faraggi, D., Browne, R., Freudenheim, J., Dorn, J., Muti, P., Arm-
strong, D., Reiser, B. and Trevisan, M. (2002). Minimal and best linear combina-
tion of oxidative stress and antioxidant biomarkers to discriminate cardiovascular
disease. Nutrition, Metabolism, and Cardiovascular Disease, 12, 259–266.
Schisterman, E. F., Faraggi, D., Reiser, B. and Trevisan, M. (2001b). Statistical inference
for the area under the receiver operating characteristic curve in the presence of
random measurement error. American Journal of Epidemiology, 154, 174–179.
Schisterman, E. F., Perkins, N. J., Liu, A. and Bondell, H. (2005). Optimal cut-point and
its corresponding Youden index to discriminate individuals using pooled blood
samples. Epidemiology, 16, 73–81.
Schisterman, E. F., Vexler, A., Ye, A. and Perkins, N. J. (2011). A combined efficient
design for biomarker data subject to a limit of detection due to measuring
instrument sensitivity. The Annals of Applied Statistics, 5, 2651–2667.
378 References
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2),
461–464.
Schweder, T. and Hjort, N. L. (2002). Confidence and likelihood. Scandinavian Journal
of Statistics, 29(2), 309–332.
Serfling, R. J. (2002). Approximation Theorems of Mathematical Statistics. Canada: John
Wiley & Sons.
Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. New York, NY: Springer Science
& Business Media.
Shao, J. and Wu, C. F. J. (1989). A general theory for jackknife variance estimation. The
Annals of Statistics, 17, 1176–1197.
Shapiro, D. E. (1999). The interpretation of diagnostic tests. Statistical Methods in Med-
ical Research, 8(2), 113–134.
Siegmund, D. (1968). On the asymptotic normality of one-sided stopping rules. The
Annals of Mathematical Statistics, 39(5), 1493–1497.
Simon, R. (1989). Optimal two-stage designs for phase II clinical trials. Controlled Clin-
ical Trials, 10(1), 1–10.
Singh, K., Xie, M. and Strawderman, W. E. (2005). Combining information from inde-
pendent sources through confidence distributions. The Annals of Statistics, 33(1),
159–183.
Sinharay, S. and Stern, H. S. (2002). On the sensitivity of Bayes factors to the prior
distributions. The American Statistician, 56(3), 196–201.
Sinnott, E. W. (1937). The relation of gene to character in quantitative inheritance. Pro-
ceedings of the National Academy of Sciences of the United States of America, 23(4), 224.
Sleasman, J. W., Nelson, R. P., Goodenow, M. M., Wilfret, D., Hutson, A., Baseler, M.,
Zuckerman, J., Pizzo, P. A. and Mueller, B. U. (1999). Immunoreconstitution
after ritonavir therapy in children with human immunodeficiency virus infec-
tion involves multiple lymphocyte lineages. The Journal of Pediatrics, 134,
597–606.
Smith, A. F. and Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler
and related Markov chain Monte Carlo methods. Journal of the Royal Statistical
Society. Series B (Methodological), 55(1), 3–23.
Smith, R. L. (1985). Maximum likelihood estimation in a class of nonregular cases.
Biometrika, 72(1), 67–90.
Snapinn, S., Chen, M. G., Jiang, Q. and Koutsoukos, T. (2006). Assessment of futility
in clinical trials. Pharmaceutical Statistics, 5(4), 273–281.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate
normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability, 1, 197–206.
Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900.
Cambridge, MA: Belknap Press of Harvard University Press.
Stigler, S. M. (2007). The epic story of maximum likelihood. Statistical Science, 22, 598–620.
Subkhankulov, M. A. (1976). Tauberian Theorems with Remainder Terms. Moscow: Nauka.
Su, J. Q. and Liu, J. S. (1993). Linear combinations of multiple diagnostic markers.
Journal of the American Statistical Association, 88(424), 1350–1355.
Suissa, S. and Shuster, J. J. (1984). Are Uniformly Most Powerful Unbiased Tests
Really Best? The American Statistician, 38, 204–206.
Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments
and marginal densities. Journal of the American Statistical Association, 81(393),
82–86.
References 379
Tierney, L., Kass, R. E. and Kadane, J. B. (1989). Fully exponential Laplace approxima-
tions to expectations and variances of nonpositive functions. Journal of the Amer-
ican Statistical Association, 84(407), 710–716.
Titchmarsh, E. C. (1976). The Theory of Functions. New York, NY: Oxford University Press.
Trafimow, D. and Marks, M. (2015). Editorial in basic and applied social psychology.
Basic and Applied Social Pschology, 37, 1–2.
Tukey, J. W. (1958). Bias and confidence in not quite large samples. Annals of Mathe-
matical Statistics, 29, 614–623.
Vexler, A. (2006). Guaranteed testing for epidemic changes of a linear regression
model. Journal of Statistical Planning and Inference, 136(9), 3101–3120.
Vexler, A. (2008). Martingale type statistics applied to change points detection. Com-
munications in Statistics—Theory and Methods, 37(8), 1207–1224.
Vexler, A. A. and Dtmirenko, A. A. (1999). Approximations to expected stopping times
with applications to sequential estimation. Sequential Analysis, 18, 165–187.
Vexler, A. and Gurevich, G. (2009). Average most powerful tests for a segmented
regression. Communications in Statistics—Theory and Methods, 38(13), 2214–2231.
Vexler, A. and Gurevich, G. (2010). Density-based empirical likelihood ratio change
point detection policies. Communications in Statistics—Simulation and Computa-
tion, 39(9), 1709–1725.
Vexler, A. and Gurevich, G. (2011). A note on optimality of hypothesis testing. Journal
MESA, 2(3), 243–250.
Vexler, A., Hutson, A. D. and Chen, X. (2016a). Statistical Testing Strategies in the Health
Sciences. New York, NY: Chapman & Hall/CRC.
Vexler, A., Liu, A., Eliseeva, E. and Schisterman, E. F. (2008a). Maximum likelihood
ratio tests for comparing the discriminatory ability of biomarkers subject to
limit of detection. Biometrics, 64(3), 895–903.
Vexler, A., Liu, A. and Schisterman, E. F. (2010a). Nonparametric deconvolution of
density estimation based on observed sums. Journal of Nonparametric Statistics,
22, 23–39.
Vexler, A., Liu, A., Schisterman, E. F. and Wu, C. (2006). Note on distribution-free
estimation of maximum linear separation of two multivariate distributions.
Nonparametric Statistics, 18(2), 145–158.
Vexler, A., Liu, S., Kang, L. and Hutson, A. D. (2009a). Modifications of the empirical
likelihood interval estimation with improved coverage probabilities. Communi-
cations in Statistics—Simulation and Computation, 38(10), 2171–2183.
Vexler, A., Schisterman, E. F. and Liu, A. (2008b). Estimation of ROC based on stably
distributed biomarkers subject to measurement error and pooling mixtures. Sta-
tistics in Medicine, 27, 280–296.
Vexler, A., Tao, G. and Hutson, A. (2014). Posterior expectation based on empirical
likelihoods. Biometrika, 101(3), 711–718.
Vexler, A. and Wu, C. (2009). An optimal retrospective change point detection policy.
Scandinavian Journal of Statistics, 36(3), 542–558.
Vexler, A., Wu, C., Liu, A., Whitcomb, B. W. and Schisterman, E. F. (2009b). An exten-
sion of a change point problem. Statistics, 43(3), 213–225.
Vexler, A., Wu, C. and Yu, K. F. (2010b). Optimal hypothesis testing: From semi to fully
Bayes factors. Metrika, 71(2), 125–138.
Vexler, A., Yu, J. and Hutson, A. D. (2011). Likelihood testing populations modeled by
autoregressive process subject to the limit of detection in applications to longi-
tudinal biomedical data. Journal of Applied Statistics, 38(7), 1333–1346.
380 References
Vexler, A., Yu, J., Tian, L. and Liu, S. (2010c). Two‐sample nonparametric likelihood
inference based on incomplete data with an application to a pneumonia study.
Biometrical Journal, 52(3), 348–361.
Vexler, A., Yu, J., Zhao, Y., Hutson, A. D. and Gurevich, G. (2017). Expected P-values in
light of an ROC curve analysis applied to optimal multiple testing procedures.
Statistical Methods in Medical Research. doi:10.1177/0962280217704451. In Press.
Vexler, A., Zou, L. and Hutson, A. D. (2016b). Data-driven confidence interval estima-
tion incorporating prior information with an adjustment for skewed data. Amer-
ican Statistician, 70, 243–249.
Ville, J. (1939). Étude critique de la notion de collectif. Monographies des Probabilités (in
French). 3. Paris: Gauthier-Villars.
Wald, A. (1947). Sequential Analysis. New York, NY: Wiley.
Wald, A. and Wolfowitz, J. (1948). Optimum character of the sequential probability
ratio test. Annals of Mathematical Statistics, 19(3), 326–339.
Wasserstein, R. L. and Lazar, N. (2016). The ASA’s statement on p-values: Context,
process, and purpose. American Statistician, 70, 129–133.
Wedderburn, R. W. (1974). Quasi-likelihood functions, generalized linear models,
and the Gauss–Newton method. Biometrika, 61(3), 439–447.
Whitehead, J. (1986). On the bias of maximum likelihood estimation following a
sequential test. Biometrika, 73(3), 573–581.
Wieand, S., Gail, M. H., James, B. R. and James, K. L. (1989). A family of nonparamet-
ric statistics for comparing diagnostic markers with paired or unpaired data.
Biometrika, 76(3), 585–592.
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing
composite hypotheses. Annals of Mathematical Statistics, 9(1), 60–62.
Williams, D. (1991). Probability with Martingales. New York, NY: Cambridge University
Press.
Yakimiv, A. L. (2005). Probabilistic Applications of Tauberian Theorems. Boston, MA: Mar-
tinus Nijhoff Publishers and VSP.
Yakir, B. (1995). A note on the run length to false alarm of a change-point detection
policy. Annals of Statistics, 23, 272–281.
Yu, J., Vexler, A. and Tian, L. (2010). Analyzing incomplete data subject to a threshold
using empirical likelihood methods: An application to a pneumonia risk study
in an ICU setting. Biometrics, 66(1), 123–130.
Zellner, A. (1996). An Introduction to Bayesian Inference in Econometrics. New York, NY:
Wiley.
Zhou, X. and Reiter, J. P. (2010). A note on Bayesian inference after multiple imputation.
American Statistician, 64, 159–163.
Zhou, X.-H. and Mcclish, D. K. (2002). Statistical Methods in Diagnestie Medicine. New
York, NY: Wiley.
Zhou, X.-H., Obuchowski, N. A. and McClish, D. K. (2011). Statistical Methods in Diag-
nostic Medicine. Hoboken, NJ: John Wiley & Sons.
Zimmerman, D. W. and Zumbo, B. D. (2009). Hazards in choosing between pooled
and separate-variance t tests. Psiologica, 30, 371–390.
Zou, K. H., Hall, W. and Shapiro, D. E. (1997). Smooth non‐parametric receiver oper-
ating characteristic (ROC) curves for continuous diagnostic tests. Statistics in
Medicine, 16(19), 2143–2156.
Author Index
381
382 Author Index
W Y
Wald, A., 169, 170, 173, 206 Yakimiv, A. L., 43
Wallis, W. A., 168 Yakir, B., 75, 114, 120, 124, 128,
Wang, J. W., 5 150
Wasserman, L., 146, 148, 149, 153 Ye, A., 55
Wasserstein, R. L., 224 Yiannoutsos, C. T., 202
Wedderburn, R. W., 239 Yu, J., 222, 223, 226, 236, 254
West, J., 101 Yu, K. F., 71
Westfall, P. H., 151
Whitehead, J., 183
Z
Wieand, S., 187, 189
Wigler, M., 101 Zellner, A., 137
Wilfret, D., 280 Zhang, B., 254
Wilks, S. S., 76, 106 Zhang, C., 78
Williams, D., 84 Zhao, Y., 223, 226, 236
Wolfowitz, J., 177 Zhou, X., 222
Wolpert, R. L., 60 Zhou, X.-H., 185
Wong, W. H., 149 Zimmerman, D. W., 227
Wu, C., 71, 79, 109 Zou, K. H., 185, 192
Wu, C. F. J., 263 Zou, L., 222, 255
Zuckerman, J., 280
Zumbo, B. D., 227
X
Xie, M., 219
Subject Index
387
388 Subject Index
composite statement, 96 K
martingale principle, 93–99
Kernel function, 192
most powerful test, 98
Khinchin’s form, law of large numbers,
problems, 130
48–49
Kullback–Leibler distance, 304
I
IF-THEN statement, 26 L
Importance sampling, 147, 160
Improper distribution, 150 Lagrange method, 221, 240, 242
In control process, 110, 113, 114, 123 Lagrange multiplier, 221, 240, 243,
Independent and identically distributed 250, 256
(iid) random variables Laplace approximation, 142
distribution function of, 54, 56 Laplace distribution, 35
probability distribution, 4 Laplace’s method, 142
sum of, 65, 66, 173, 178 modification of, 148
in Ville and Wald inequality, simulated version of, 149
209–213 variants on, 142–145
Indicator function, 7, 10–12, 70, 135, 150 Laplace transformation
Inference of distribution functions, 31
AUC-based, 366 real-valued, 44
Bayesian, 150, 255, 363–364 Law of iterated expectation, 86
empirical Bayesian, 364 Law of large numbers, 32, 48, 205
frequentist context, 129 Khinchin’s (Hinchin’s) form of, 48–49
on incomplete bivariate data, 253 outcomes of, 51, 54
ROC curve, 186–189 strong, 10
INFILE statement, 24–25 weak, 9
Infimum, 7 Law of the iterated logarithm, 205,
Integrated most powerful test, 132–135 206–207, 214, 217
Integration, expectation and, 6–8 Law of total expectation, 86
Invariant data transformations, 108 LCL, see Lower control limit
Invariant statistics, 106 Least squares method, 61
Invariant transformations, 106 Lebesgue principle, 7
Inversion theorem, 34–39 Lebesgue–Stieltjes integral, 7, 8
Likelihood function, 59, 60
Likelihood method, 59–60
J
artificial/approximate, 239
Jackknife method, 259 correct model-based, 80
bias estimation, 261–263 intuition about, 61–63
confidence interval parametric, 78
approximate, 264–265 Likelihood ratio (LR), 69–73, 97–99, 170,
definition, 263–264 see also Log-likelihood ratio;
homework questions, 312–313 Maximum likelihood ratio
nonparametric estimate of variance, Likelihood ratio test
268 bias, 74
variance estimation, 263 decision rule, 69–71
variance stabilization, 265 optimality of, 69, 71–72
James–Stein estimation, 255, 364 power, 70, 74
Jensen’s inequality, 14, 61, 87, 95 type I error control, 100
392 Subject Index
Likelihood ratio test statistic, 69, 73–75, type I error rate control, 100, 106–107
96 Wald’s lemma, 92
Limits, 1–2, 8 Maximum likelihood estimation, 63–68,
Linear regressions, empirical likelihood 131, 142, 183
ratio tests for, 365–366 Maximum likelihood estimator (MLE),
“Little oh” notation, 2, 10 47, 137, 143–144, 187, 191, 364,
Log-empirical likelihood ratios, 246–248 366–367
Logistic regression, 185 adaptive, 105–109
change point problems in, 363 example, 67–68
ROC curve analysis, 193–195 log-likelihood function, 137–139
example, 198–201 as point estimator, 66
expected bias of ROC, 196–197 Maximum likelihood ratio (MLR),
retrospective and prospective 76–80, 96–97
ROC, 195–196 Maximum likelihood ratio test statistic,
Log likelihood function, 63, 79, 137–141 76, 79, 96, 105
Log-likelihood ratio, 61, 143, 144, 173, 310 MCMC method, see Markov chain
Log maximum likelihood, 64 Monte Carlo method
Log-normal distribution, 5–6 Mean
Log-normally distributed values composite estimation of, 366–367
composite estimation of mean, rth mean, convergence in, 10
366–367 Measurability, 85
of multiple biomarkers, 364 Measurable random variable, 86
Lower confidence band, 264 Measurement errors, 32, 54–56, 363
Lower control limit (LCL), 111 Measurement model, with
LR, see Likelihood ratio autoregressive errors, 80, 307
Metropolis–Hastings algorithm, 148
Minus maximum log-likelihood ratio
M
test statistic, 246–248
Magnitude of complex number, 15 MLE, see Maximum likelihood
Mallow’s distance, 286–287 estimator
Mann–Whitney statistic, 192, 198 Model selection, 146
Markov chain Monte Carlo (MCMC) Modulus of complex number, 15
method, 148, 221 Monte Carlo (MC)
Martingale-based arguments, 79 approximations, 102, 106, 229, 232
Martingale transforms, 116–118, 121 coverage probabilities, 223
Martingale type statistics, 83 error, 268
adaptive estimators, 105–109 Markov chain, 148
change point detection, see Change resampling, 304
point detection method roulette game, 90
conditions for, 86 simplest, 147
Doob’s inequality, 92–93 Monte Carlo power, 54, 198–199, 309
hypotheses testing, 93–99 Monte Carlo simulation, 113, 177,
likelihood ratios, 97–99 267–268, 308, 310
maximum likelihood ratio, 96–97 principles of, 51–54
optional stopping theorem, 91 Shiryayev–Roberts technique,
reverse martingale, 102, 108 124–128
σ –algebra, 85–86 Most powerful decision rule, 135
stopping time, 88–90 Most powerful test
terminologies, 84–90 of hypothesis testing, 98
Subject Index 393
integrated, 132–135 O
likelihood ratio test, 69, 71
One-to-one mapping, 31, 34–39
Most probable position, 59
Online change point problem, 109
Multiple biomarkers, best combination
Optimality, 69, 221
of, 201–202
likelihood argument of, 75
Bayesian inference, 363–364
of statistical tests, 71–72
empirical Bayesian inference, 364
“Optim” procedure, 191
log-normal distribution, 364
Optional stopping theorem, 91
Multiroot procedure, 191
Out of control process, 110, 111, 114, 123
“mvrnorm” command, 21–22
Overestimation, 194–195
area under the ROC curve, 196–197
N
Nested hypothesis, 143 P
Newton–Raphson method, 53, 68
Neyman–Pearson lemma, 69, 78, 225, 231 p-value, 76, 97, 132, 223–226, 237
Neyman–Pearson testing concept, 69, expected, see Expected p-value
73, 96, 98 formulation of, 161–162
Nonanticipating estimation null hypothesis, 161–162, 224, 225
maximum likelihood estimation, prediction intervals for, 225
105–109 stochastic aspect of, 225
Monte Carlo simulation, 124–125 Parametric approach
Non-asymptotic upper bound of area under the ROC curve, 190–192
renewal function, 42 best linear combination of
Noninformative prior, 149–150 biomarkers, 202
Nonparametric approach Parametric characteristic function,
area under the ROC curve, 192–193, 56–57
198 Parametric distribution functions, 5–6
best linear combination of Parametric likelihood (PL) method, 78,
biomarkers, 202 255
Nonparametric estimation of ROC empirical likelihood and, 253–254
curve, 187–188 Parametric sequential procedures, 167
Nonparametric method Parametric statistical procedure, 167
Monte Carlo simulation scheme, 128 Partial AUC (pAUC), 203, 226
Shiryayev–Roberts technique, Partial expected p-value (pEPV), 226,
123–124 230, 231, 234–235
Nonparametric simulation, 267–269 pAUC, see Partial AUC
Nonparametric testing, 308 Penalized empirical likelihood
Nonparametric tilting, see Bootstrap estimation, 366
tilting pEPV, see Partial expected p-value
Not a number (NaN) values, 18 Percentile confidence interval, 280–282,
Null hypothesis, 97 288–291
Bayes factor, 153 Pivot, 264
CUSUM test statistic, 102 PL method, see Parametric likelihood
empirical likelihood function, 246 method
p-value, 161–162, 224, 225 Point estimator, 66
rejection of, 73, 79, 102, 130, 158, 170, Pooling design, 55–56
222 Posterior distribution, 131, 146–149, 155
Numerical integration, 137, 142, 146 Posterior empirical Bayes (PEB), 364
394 Subject Index
X Z