You are on page 1of 134

ApPENDIX A

Measure and Integration Theory

This appendix contains an introduction to the theory of measure and integration.


The first section is an overview. It could serve either as a refresher for those who
have previously studied the material or as an informal introduction for those who
have never studied it.

A.I Overview
A.I.l Definitions
In many introductory statistics and probability courses, one encounters discrete
and continuous random variables and vectors. These are all special cases of a
more general type of random quantity that we will study in this text. Before we
can introduce the more general type of random quantity, we need to generalize
the sums and integrals that figure so prominently in the distributions of discrete
and continuous random variables and vectors. The generalization is through the
concept of a measure (to be defined shortly), which is a way of assigning numerical
values to the "sizes" of sets.
Example A.1. Let S be a nonempty set, and let A ~ S. Define I-£(A) to be
the number of elements of A. Then I-£(S) > 0, 1-£(0) = 0, and if Al n A2 = 0,
I-£(AI U A2) = I-£(Ad + I-£(A2). Note that I-£(A) = 00 is possible if S has infinitely
many elements. The measure 1-£ described here is called counting measure on S.
Example A.2. Let A be an interval of real numbers. If A is bounded, let I-£(A)
be the length of A. If A is unbounded, let I-£(A) = 00. It is easy to see that
p.(IR) = 00, I 1-£(0) = 0, and if Al n A2 = 0 and Al U A2 is an interval, then

I By IR, we mean the set of real numbers.


A.L Overview 571

J.&(A 1 U A2) = J.&(AI) + J.&(A2). The measure J.& described here is called Lebesgue
measure.
Example A.S. Let f : m. -+ m.+ be a continuous function. 2 Define, for each
interval A, J.&(A) = fA f(x)dx. Then J.&(m.) > 0, J.&(0) = 0, and if Al n A2 = 0 and
Al U A2 is an interval, then J.&(AI U A2) = J.&(A 1 ) + J.&(A2).
Since measure will be used to give sizes to sets, the domain of a measure will
be a collection of sets. In general, we cannot assign sizes to all sets, but we need
enough sets so that we can take unions and complements. A collection of sets
that is closed under taking complements and finite unions is called a field. A field
that is closed under taking countable unions is called au-field.
Example A.4. Let S be any set. Let A = is, 0}. This u-field is called the trivial
u-field. As a second example, let A C S. Let A = is, A, AC , 0}. Let B be another
subset of B, and let A = {B,A, B,Ac,Bc,A n B,A nBc, .. .}. Such examples
grow rapidly. The largest u-field is the collection of all subsets of S, called the
power set of S and denoted 28 .
Example A.S. One field of subsets of m. is the collection of all unions of finitely
many disjoint intervals (unbounded intervals are allowed). This collection is not
a u-field, however.

It is easy to prove that the intersection of an arbitrary collection of u-fields


is itself a u-field. Since 2 8 is a u-field, it is easy to see that, for every collection
of subsets C of B, there is a smallest u-field A that contains C, namely the
intersection of all O'-fields that contain C. This smallest O'-field is called the 0'-
field genemted by C.
The most commonly used u-field in this book will be the one generated by the
collection C of open subsets of a topological space. 3 This u-field is called the Borel
u-field. It is easy to see that the Borel u-field 8 1 for lR is the O'-field generated
by the intervals of the form [b, 00). It is also the u-field generated by the intervals
of the form (-00, a] and the u-field generated by the intervals of the form (a, b).
Since multidimensional Euclidean spaces are topological spaces, they also have
Borel O'-fields.
An alternative way to generate the Borel u-fields of lRk spaces is by means of
product spaces. The O'-field generated by all product sets (one factor from each
u-field) in a product space is called the product u-field. In m.k , the product u-
field of one-dimensional Borel sets 8 1 is the same as the Borel u-field 8 k in the
k-dimensional space (Proposition A.3S).
Sometimes, we need to extend m. to include points at infinity. The extended real
numbers are the points in lRU{ 00, -oo}. The Borel u-field 8+ ofthe extended real
numbers consists of 8 1 together with all sets of the form BU{ oo}, BU{ -oo}, and
B U {oo, -oo} for B E 8 1 • It is easy to check that 8+ is a u-field. (See Problem 4
on page 603.)

2By lR+, we mean the open interval (0,00).


3A space X is a topological space if it has a collection V of subsets called a
topology which satisfies the following conditions: 0 E V, X E V, the intersection
of finitely many elements of V is in V, and the union of arbitrarily many elements
of V is in V. The sets in V are called open sets.
572 Appendix A. Measure and Integration Theory

If A is a IT-field of subsets of a set S, then a measure p. on S is a function from


A to the nonnegative extended real numbers that satisfies
• p.(0) = 0,
• {An}~=l mutually disjoint implies p.(U~lAi) = E:1 P.(Ai).
If p. is a measure, the triple (S, A, ",) is called a measure space. If (S, A, p.) is a
measure space and p.(S) = 1, then p. is called a probability and (S, A, p.) is called
a probability space.
Some examples of measures were given earlier. The Caratheodory extension
theorem A.22 shows how to construct measures by first defining countably ad-
ditive set functions on fields and then extending them to the generated IT-field.
Lebesgue measure is defined in this manner by starting with length for unions of
disjoint intervals.
Sets with measure zero are ubiquitous in measure theory, so there is a special
term that allows us to refer to them more easily. If E is some statement concerning
the points in S, and p. is a measure on S, we say that E is true almost everywhere
with respect to p., written a.e. [p.], if the set of s such that E is not true is contained
in a set A with p.(A) = O. If p. is a probability, then almost everywhere is often
expressed as almost surely and denoted a.s. [p.].
Example A.6. It is well known that a nondecreasing function can have at most
a countable number of discontinuities. Since countable sets have Lebesgue mea-
sure (length) 0, it follows that nondecreasing functions are continuous almost
everywhere with respect to Lebesgue measure.

Infinite measures are difficult to deal with unless they behave like finite mea-
sures in certain important ways. If there exists a countable partition of the set S
such that each element of the partition has finite p. measure, then we say that p.
is IT-finite. When an abstract measure is mentioned in this text, it will generally
be safe to assume that it is IT-finite unless the contrary is clear from context.

A.1.2 Measurable Functions


There are certain types of functions with which we will be primarily concerned.
Suppose that S is a set with a IT-field A of subsets, and let T be another set
with a IT-field C of subsets. Suppose that I : S --+ T is a function. We say I
is measurable if for every B E C, 1-1 (B) EA. When there are several possible
IT-fields of subsets of either S or T, we will need to say explicitly with respect to
which IT-field I is measurable. If I is measurable, one-to-one, and onto and 1-1 is
measurable, we say that I is bimeasurable. If the two sets S and T are topological
spaces with BorellT-fields, a measurable function is Borel measurable.
As examples, all continuous functions are Borel measurable. But many dis-
continuous functions are also measurable. For example, step functions are mea-
surable. All monotone functions are measurable. In fact, it is very difficult to
describe a nonmeasurable function without using some heavy mathematics.
If Sand T are sets, C is a u-field of subsets of T, and I : S --+ T is a function,
then it is easy to show that r1(C) is a IT-field of subsets of S. In fact, it is the
smallest IT-field of subsets of S such that I is measurable, and it is called the
IT-field generated by I·
A.I. Overview 573

Some useful properties of measurable functions are in Theorem A.38. To sum-


marize, multivariate functions with measurable coordinates are measurable; com-
positions of measurable functions are measurable; sums, products, and ratios of
measurable functions are measurable; limits, suprema, and infima of sequences
of measurable functions are measurable.
As an application of the preceding results, we have Theorem A.42, which says
that one function 9 is a function of another f if and only if 9 is measurable with
respect to the a-field generated by f.
Many theorems about measurable functions are proven first for a special class of
measurable functions called simple functions and then extended to all measurable
functions using some limit theorems. A measurable function f is called simple
if it assumes only finitely many distinct values. The most fundamental limit
theorem is Theorem A.41, which says that every nonnegative measurable function
can be approached from below (pointwise) by a sequence of nonnegative simple
functions.

A.1.3 Integration
The integral of a function with respect to a measure is a way to generalize the
Riemann integral. The interested readers should be able to convince themselves
that the integral as defined here is an extension of the Riemann integral. That
is, if the Riemann integral of a function over a closed and bounded interval
exists, then so does the integral as defined here, and the two are equal. We
define the integral in stages. We start with nonnegative simple functions. If f is
a nonnegative simple function represented as f(s) = 2::=1
adA, (s), with the ai
distinct and the Ai mutually disjoint, then the integral 0/ f with respect to I-' is
f f(s)dl-'(s) =2::=1 ail-'(Ai). If 0 times 00
occurs in such a sum, the result is 0
by convention. The integral of a nonnegative simple function is allowed to be 00.
For general nonnegative measurable functions, we define the integral of f with
f f
respect to I-' as f(s)dl-'(s) = SUPg:5/,g simple g(s)dl-'(s). For general functions f,
let j+(s) = max{f(s),O} and r(s) = - min{f(s),O} (the positive and negative
parts of f, respectively). Then f(s) = j+(s) - r(s). The integral of f with
respect to p, is

J f(s)dl-'(s) = J
f+(s)dl-'(s) - J r(s)dl-'(s),

if at least one of the two integrals on the right is finite. If both are infinite, the
integral is undefined. We say that f is integrable if the integral of / is defined and
is finite. The integral is defined above in terms of its values at all points in S.
Sometimes we wish to consider only a subset of A ~ S. The integral of f over A
with respect to I-' is

i f(s)dl-'(s) = JIA(S)f(s)dl-'(s).

Several important properties of integrals will be needed in this text. Proposi-


tion A.49 and Theorem A.53 state a few of the simpler ones, namely that functions
that are almost everywhere equal have the same integral, that the integral of a
linear combination of functions is the linear combination of the integrals, that
574 Appendix A. Measure and Integration Theory

smaller functions have smaller integrals, and that two integrable functions that
have the same integral over every set are equal almost everywhere. Another useful
property, given in Theorem A.54, is that a nonnegative integrable function leads
to a new measure /I by means of the equation II(A) = fA f(s)dl-'(s).
The most important theorems concern the interchange of limits with integra-
tion. Let Un}~=1 be a sequence of measurable functions such that fn(x) ---+ f(x)
a.e. [1-'1. The monotone convergence theorem A.52 says that if the fn are nonneg-
ative and fn(x) ~ f(x) a.e. [1-'1, then

(A.7)

The dominated convergence theorem A.57 says that if there exists an integrable
function 9 such that Ifn(x)1 :::; g(x), a.e. [I-'], then (A.7) holds.
Part 1 of Theorem A.38 says that measurable functions into each of two mea-
surable spaces combine into a jointly measurable function. Measures and inte-
gration can also be extended from several spaces into the product space. For
example, suppose that I-'i is a measure on the space (8i , Ai) for i = 1,2. To de-
fine a measure on (81 x 82,A 1 ®A2), we can proceed as follows. For each product
set A = Al X A 2, define 1-'1 x 1-'2(A) = 1-'1 (Al)1-'2(A2). The Caratheodory exten-
sion theorem A.22 allows us to extend this definition to all of the product space.
Lebesgue measure on lR?, denoted dxdy, is such a product measure. Not every
measure on a product space is a product measure. Product probability measures
will correspond to independent random variables.
Extending integration to product spaces proceeds through two famous theo-
rems. Tonelli's theorem A.69 says that a nonnegative function f satisfies

/ f(x, y)dl-'l x 1-'2(X, y) / [ / f(x,Y)dl-'l(X)] dI-'2(Y)

/ [ / f(X,Y)dI-'2(Y)] dl-'l(X),

Fubini's theorem A.70 says that the same equations hold if f is integrable with
respect to 1-'1 x 1-'2. These results also extend to finite product spaces 81 x· .. X 8 n •

A.1.4 Absolute Continuity


A special type of relationship between two measures on the same space is called
absolute continuity. If 1-'1 and 1-'2 are two measures on the same space, we say that
1-'2 is absolutely continuous with respect to 1-'1, denoted 1-'2 « 1-'1, if 1-'1 (A) = 0
implies 1-'2(A) = O. When 1-'2 « 1-'1, we say that 1-'1 is a dominating measure for
1-'2. Here are some examples:
Example A.S .
• Let f be any nonnegative measurable function and let 1-'1 be a measure.
Define 1-'2(A) = fA f(s)dl-'l(S). (See Theorem A.54.) Then, 1-'2 <t: 1-'1·
• Let 8 be the natural numbers and let a1,a2, ... be any sequence of non-
negative numbers. Define 1-'1 to be counting measure on 8, and let 1-'2(A) =
2: a iEA ai. Then 1-'2 «1-'1.
A.2. Measures 575

• Let 1-'1,1-'2, ... be a collection of measures on the same space (8,A). Let
a1,a2, ... be a collection of positive numbers. Then I-' = EAlI i ail-'i is a
measure and I-'i « I-' for all i.
The last example above is important because it tells us that for every countable
collection of measures, there is a single measure such that all measures in the
collection are absolutely continuous with respect to it.
The Ra.don-Nikodym theorem A.74 says that the first part of Example A.8 is
the most general form of absolute continuity with respect to u-finite measures.
That is, if 1-'1 is u-finite and 1-'2 « 1-'1, then there exists an extended real-valued
measurable function f such that 1-'2(A) = fA f(x)dl-'l(X). In addition, if 9 is
J
1-'2 integrable, then g(X)dI-'2(X) = f g(x)f(x)dl-'l(X). The function f is called
the Radon-Nikodym derivative of 1-'2 with respect to 1-'1 and is usually denoted
(dI-'2/dI-'1)(S).
A similar theorem, A.8i, relates integrals with respect to measures on two
different spaces. It says that a function f : 8 1 -+ 82 induces a measure on the
range 8 2 • If 1-'1 is a measure on 81, then define 1-'2(A) = 1-'1 (J-1 (A»). Integrals
with respect to 1-'2 can be written as integrals with respect to 1-'1 in the following
way: f g(y)dI-'2(Y) = f g(f(X»dl-'l(X). The measure 1-'2 is called the measure
induced on 82 by f from 1-'1.

A.2 Measures
A measure is a way of assigning numerical values to the "sizes" of sets. The
collection of sets whose sizes are given by a measure is a u-field. (See Examples AA
and A.5 on page 571.)
Definition A.9. A nonempty collection of subsets A of a set 8 is called a field
if
• A E A implies4 AC E A,
• AI, A2 E A implies Al U A2 E A.
A field A is called a u-field if {An}~=1 E A implies U~1Ai E A.
Proposition A.IO. Let N be an arbitrary set of indices, and let y = {Aa : Q E
N} be an arbitrary collection of u-fields of subsets of a set 8. Then naENAa is
also a u-field of subsets of 8.
Because of Proposition A.l0 and the fact that 28 is a u-field, it is easy to
see that, for every collection of subsets C of 8, there is a smallest u-field A that
contains C, namely the intersection of all u-fields that contain C.
Definition A.n. Let C be the collection of intervals in JR. The smallest u-field
containing C is called the Borel u-field. In general, if 8 is a topological space, and
B is the smallest u-field that contains all of the open sets, then B is called the
Borel u-field.

4The symbol A C stands for the complement of the set A.


576 Appendix A. Measure and Integration Theory

In addition to the Borel u-field, the product u-field is also generated by a simple
collection of sets.
Definition A.l2 .
• Let ~ be an index set, and let {8a }aEN be a collection of sets. Define
8 = flaEN 8 a . We call 8 a product space .
• For each a E ~, let Aa be a u-field of subsets of 8a . Define the product
u-field as follows. ®aENAa is the smallest u-field that contains all sets of
the form flaEN Aa, where Aa E Aa for all a and all but finitely many Aa
are equal to 8 a •
In the special case in which ~ = {I, 2}, we use the notation 8 = 8 1 X 82 and the
product u-field is denoted A1 ® A2.
Proposition A.l3. 5 The Borel u-field 8 k of IRk is the same as the product u-
field of k copies of (IR, 8 1 ).
There are other types of collections of sets that are related to u-fields. Some-
times it is easier to prove results about these other collections and then use the
theorems that follow to infer similar results about u-fields.
Definition A.l4. Let 8 be a set. A collection II of subsets of 8 is called a 11'-
system if A, B E II implies A n B E II. A collection A is called a >.-system if
8 E A, A E A implies AO E A, and {An}~=1 E A with Ai n Aj = 0 for i =f. j
implies U~1 Ai E A.
As in Proposition A.1O, the intersection of arbitrarily many lr-systems is a
lr-system, and so too with A-systems. The following propositions are also easy to
prove.
Proposition A.l5. If 8 is a set and C is a collection of subsets of 8 such that
C is a lr-system and a A-system, then C is au-field.
Proposition A.16. If 8 is a set and A is a A-system of subsets, then A, AnB E A
implies A n BO E A.

The following lemma is the key to a useful uniqueness theorem.


Lemma A.l7 (lr-A theorem).6 Suppose that II is a lr-system, that A is a A-
system, and that II S;; A. Then the smallest u-field containing II is contained in
A.
PROOF. Define >.(II) to be the smallest A-system containing II, and define u(II)
to be the smallest u-field containing II. For each A S;; 8, define gA to be the
collection of all sets B S;; S such that An B E A(II).
First, we show that gA is a A-system for each A E A(II). To see this, note that
An 8 E >'(II), so 8 EgA. If B EgA, then An BE >.(II), and Proposition A.I6
says that An BO E A(II), so BO EgA. Finally, {Bn}~=1 E gA with the Bn

5This proposition is used in the proof of Theorem A.38.


6This lemma is used in the proofs of Theorems A.26 and B.46 and
Lemma A.61.
A.2. Measures 577

disjoint implies that A n Bn E oX(II) with A n Bn disjoint, so their union is in


oX(II). But their union is A n (U~=IBn). So U~=IBn EgA.
Next, we show that oX(II) ~ ge for every C E oX(II). Let A, B E II, and notice
that AnB E II, so BE gAo Since gA is a oX-system containing II, it must contain
oX(II). It follows that An C E oX(II) for all C E oX(II). If C E oX(II), it then follows
that A E ge. So, II ~ ge for all C E oX(II). Since ge is a oX-system containing IT,
it must contain oX(II).
Finally, if A, B E >'(IT), we just proved that B EgA, so An B E >.(IT) and
hence oX(II) is also a 7r-system. By Proposition A.I5, oX(II) is a cr-field containing
II and hence must contain cr(II). Since oX(II) ~ A, the proof is complete. 0
We are now in a position to give a precise definition of measure.
Definition A.1S.
• A pair (8, A), where 8 is a set and A is a cr-field, is called a measumble
space.
• A function p, : A --+ [O,oo} is called a measure if
p,(0) = 0,
{An}~=1 mutually disjoint implies p, (U~IAi) = Z:::I p,(Ai).
• A function p, : A --+ [-oo,oo} that satisfies the above two conditions and
does not assume both of the values 00 and -00 is called a signed measure. 7
• If p, is a measure, the triple (8, A, p,) is called a measure space.
• If (8, A, p,) is a measure space and p,(S) = 1, then p, is called a probability
and (8, A, p,) is called a probability space.
Some examples of measures were given in Section A.1.
Theorem A.19. 8 If (S,A,p,) is a measure space and {An}~=l is a monotone
sequence, 9 then p, (limi_oo Ai) = limi_oo p,( Ai) if either of the following holds:
• the sequence is increasing,
• the sequence is decreasing and P,(Al) < 00.
PROOF. If the sequence is increasing, then let Bl = Al and Bk = Ak \ Ak-l for
k> 1.10 Then {Bn}~=l are disjoint and the following are true:
k k

UBi =Ak, UBi = lim A =L


00

k, p,(Ak) p,(B;)
k-oo
i=l i=l i=l

7Signed measures will only be used in Section A.6.


8This theorem is used in the proofs of Theorems A.50 and B.90 and
Lemma A.72.
9 A sequence of sets {An}~=l is monotone if either Al ~ A2 ~ ... or
Al 2 A2 2 .... In the first case, we say that the sequence is increasing and
limn_ oo An = U~lAi. In the second case, we say that the sequence is decreasing
and limn_oo An = n~lAi'
laThe symbol A \ B is another way of saying A n Be.
578 Appendix A. Measure and Integration Theory

kl~"!, JL( Ak) = ~ JL( Bd = JL (Q Bi) = JL U~"!, Ak) .


If the sequence is decreasing, then let Bi = Ai \ Ai+l, for i = 1,2, .... It follows
that

Al = k~"!' Ak U (Q Bi) ,

and all of the sets on the right-hand side are disjoint. It follows that
k-l
Ak Al \ UB;,
;=1
00

JL(At} = JL Cl~"!, Ak) + 2: JL(Bi),


i=1
k-l
JL(A k ) = JL(AI) - 2: JL(B;) ,
;=1
00

lim JL(A k )
k~oo
JL(At} - 2:JL(B;) = JL ( lim A k ). 0
k~oo
;=1

Another useful theorem concerning sequences of sets is the following.


Theorem A.20 (First Borel-Cantelli lemma).ll If E::"=1 JL(An) < 00,
then JL (n~1 U;:"=i An) = O.
PROOF. Let Bi = U~=iAn and B = n~IB;. Since B ~ B; for each i, it fol-
lows that JL(B) ::; JL(B;) for all i. Since JL(Bi) ::; E::; JL(An), it follows that
lim;_oo JL(Bi) = O. Hence JL(B) = O. 0
Theorem A.22 below is used in several places for extending measures defined
on a field to the smallest a-field containing the field. A definition is required first.
Definition A.21. Let 8 be a set, A a collection of subsets of 8, and JL : A --+
lRu {±oo} a set function. Suppose that S = U~IA; with JL(Ai) < 00 for each i.
Then we say JL is a-finite. If JL is a a-finite measure on (8, A), then (8, A, JL) is
called a a-finite measure space.
The proof of Theorem A.22 is adapted from Royden (1968).
Theorem A.22 (Caratheodory extension theorem) P Let JL be a set func-
tion defined on a field C of subsets of a set 8 that is a-finite, nonnegative, extended

llThis theorem is used in the proofs of Lemma A.72 and Theorems B.90
and 1.61. There is a second Borel-Cantelli lemma, which involves probability
measures, but we will not use'it in this text. See Problem 20 on page 663. The
set whose measure is the subject of this theorem is sometimes called An infinitely
often because it is the set of points that are in infinitely many of the An.
12This theorem is used to prove the existence of many common measures
(including product measure) and in the proofs of Lemma A.24 and of Theo-
rems B.1l8, B.131, and B.133.
A.2. Measures 579

real-valued, and countably additive and satisfies J.t(0) = o. Then there is a unique
extension of J.t to a measure on a measure space13 (S, A, J.t *). (That is, C ~ A
and J.t(A) = J.t*(A) for all A E C.)
PROOF. The proof will proceed as follows. First, we will define J.t* and A. Then we
will show that J.t* is monotone and subadditive, that C ~ A, that A is a o--field,
that J.L* is countably additive on A, that J.t* extends J.t, and finally that J.t* is the
unique extension.
For each B E 2 8 , define
00

J.t*(B) = infLJ.t(A i ), (A.23)


i=l

where the inf is taken over all {Ai}~l such that B ~ U~lAi and Ai E C for all
i. Let

First, we show that J.t* is monotone and subadditive. Clearly, J.t*(A) :::; J.t(A)
for all A E C and Bl ~ B2 implies J.t*(Bl) :::; J.t*(B2). It is also easy to see that
J.t*(Bl U B2) 'S J.t*(Bl) + J.t*(B2) for all B 1 ,B2 E 28. In fact, if {Bn}~l E 28 ,
then J.t*(U~lBi) :::; L::l J.t*(Bi). The proof is to notice that the collection of
numbers whose inf is J.t* of the union includes all of the sums of the numbers
whose infima are the J.t* values being added together.
Next, we show that C ~ A. Let A E C and C E 28. Since J.t* is subadditive, we
only need to show that J.t*(C) ~ J.t*(C n A) + J.t*(C n A C ). If J.t*(C) = 00, this is
clearly true. So let J.t*(C) < 00. From the definition of J.t*, for every f > 0, there
exists a collection {Ai}~l of elements of Csuch that E:I J.t(Ai) < J.L*(C) + f.
Since J.t(Ai) = J.t(Ai n A) + J.t(Ai n A C) for every i, we have
00 00

i=l i=l

~ J.t*(CnA)+J.t*(CnA c ).
Since this is true for every f > 0, it must be that J.t*(C) ~ J.t*(CnA)+J.t*(CnA c ),
hence A E A.
Next, we show that A is a o--field. It is clear that 0 E A and A E A implies
ACE A by the symmetry in the definition of A. Let AI, A2 E A and C E 2 8 . We
can write

J.t*(C) J.t*(CnAI) + J.t*(Cn Af)


J.t*(C n AI) + J.t*(C n Af n A2) + J.t*(C n Af n Af)
~ J.t*(C n [AI U A2l) + J.t*(C n [AI U A2(),

where the first two equalities follow from AI, A2 E A, and the last follows from
the subadditivity of J.t*. So, Al U A2 EA. Let {An}~=l E A; then we can write

l3The usual statement of this theorem includes the additional claim that the
measure space (S, A, J.t*) is complete. A measure space is complete if every subset
of every set with measure 0 is in the o--field.
580 Appendix A. Measure and Integration Theory

A = U~IAi = U~IBi' where each Bi E A and the Bi are disjoint. (This just
makes use of complements and finite unions of elements of A being in A.) Let
Dn = Uf=IBi and C E 25. Since A C S;;; D;{ and Dn E A for each n, we have

1-'" (C) 1-'"(CnDn)+I-'*(CnD;{)


~ 1L*(CnDn)+IL*(CnAc )
n

LIL*(C n B i ) + IL"(C n A C ).
;=1

Since this is true for every n,


00

IL*(C) > LIL*(CnBi)+IL*(CnAc )


i=1
~ 1L*(CnA) + I-'"(Cn A C ),
where the last inequality follows from subadditivity. So, A is au-field.
Next, we show that IL* is countably additive when restricted to A. If AI, A2
are disjoint elements of A, then Al = (AI U A2) n Al and A2 = (AI U A2) n Af.
It follows that
I-'*(Al U A2) = 1-'* (Ad + 1L"(A2).
By induction, 1-'* is finitely additive on A. Let A = U~IAi, where each Ai E
A and the Ai are disjoint. Since Uf=IAi S;;; A, we have, for every n, IL*(A) ~
2:~=IIL"(Ai)' which implies I-'*(A) ~ 2:: 1 1-'* (Ai). By subadditivity, we get the
reverse inequality, hence 1-'* is countably additive on A.
Next, we prove that 1-'" extends 1-'. Since JJ" is countably additive on A, we
can let BEe and {An}~1 E C be disjoint, such that B S;;; U~IAn. Then
B S;;; U~=1 (An n B) = B, and IL" (B) ~ 2:::"=I IL (An n B) = I-'(B), since JJ is
countably additive on C.
To prove uniqueness, suppose that 1-" also extends IL to A. Then I-"(B) ~
2:::"=I IL (A n ) if B S;;; U~=IAn. Hence, JJ'(B) ~ IL·(B) for all B E A. If there exists
B such that IL'(B) < 1-'" (B), let {An}~=1 E C be disjoint and such that I-'(An) <
00 and U~=IAn = S. Then, there exists n such that IL'(B nAn) < I-'*(B nAn).
Since 1L'(An ) = I-'*(An), it must be that 1L'(B c nAn) > 1L·(Bc nAn), but this is
a contradiction. 0
Here are some examples:
• Let S = IR and let B be the Borel u-field. Define IL( (a, b]) = b - a for inter-
vals, and extend I-' to finite unions of disjoint intervals by addition. The-
orem A.22 will extend I-' to the u-field B. This measure is called Lebesgue
measure on the real line.
• Let F be any monotone increasing function on IR which is continuous from
the right. Let S = IR and let B be the Borel u-field. Define 1L«a, b]) =
F(b) - F(a). This can be extended to all of B. In particular, if F is a CDF,
then I-' is a probability.
In the examples above, the claim was made that I-' could be extended to the
Borel u-field. To do this by way of the Caratheodory extension theorem A.22, we
need IL to be defined on a field, countably additive, and u-finite. For the cases
described above, this can be arranged as follows. Suppose that IL is defined on
A.2. Measures 581

intervals of the form (a, b] with a = -00 and/or b = 00 possible.1 4 The collection
C of all unions of finitely many disjoint intervals of this form is easily seen to be
a field. If (aI, bd, ... , (an, bn] are mutually disjoint, set

It is not hard to see that this extension of /-L to C is well defined. This means that if
Uf=l (ai, b;] = U~l (Ci' di], where (Cl' dd, ... , (c m , dm ] are also mutually disjoint,
then E~=l JL«ai,bij) = E : l /-L«Ci,dij). If JL is finite for every interval, then it is
u-finite. To see that /-L is countably additive on C, suppose that /-L«a, bJ) = F(b)-
F(a), where F is nondecreasing and continuous from the right. If {(an' bn]}~=l is
a sequence of disjoint intervals and (a, b] is an interval such that U~=l (an, bn] ~

2:::
(a,b], then it is not difficult to see that E::"=lJL«an,bn]) ~ /-L«a,b]). If (a,b] ~
U~=l(an,bn], we can also prove that 1 /-L«an,bn]) ~ JL«a,bj) (see Problem 7
on page 603). Together these facts will imply that JL is countably additive on C.
The proof of Theorem A.22 leads us to the following useful result. Its proof is
adapted from Halmos (1950).
Lemma A.24. l5 Let (S, A, /-L) be a u-finite measure space. Suppose that C is a
field such that A is the smallest u-field containing C. Then, for every A E A and
f > 0, there is C E C such that JL(C~A) < f. 16

PROOF. Clearly, /-L and C satisfy the conditions of Theorem A.22, so that /-L is equal
to the p,* in the proof of that theorem. Let A E A and f > 0 be given. It follows
from (A.23) that there exists a sequence {Ai}~l in C such that A ~ U~lAi and
00

/-L(A) > L/-L(Ai ) -~.


i=l
Since p, is countably additive,

so that there exists n such that

Let C = U~lAi, which is clearly in C. Now

l4If b = 00, we mean (a, 00) by (a, b]. That is, we do not intend 00 to be a
point in the space S.
l5This lemma is used in the proof of the Kolmogorov zero-one law B.68.
16The symbol ~ here refers to the symmetric difference operator on pairs of
sets. We define C~A to be (C n A C ) u (CC n A).
582 Appendix A. Measure and Integration Theory

Similarly,

It now follows that I-'(AAC) < f. 0


Sets with measure zero are ubiquitous in measure theory, so there is a special
definition that allows us to refer to them more easily.
Definition A.25. Let E be some statement concerning the points in 8 such that
for each point s E 8 E is either true or false but not both. Suppose that there
exists a set A E A such that I-'(A) = 0 and that for all s E A C , E is true. Then
we say that E is true almost everywhere with respect to 1-', written a.e. fJ.I.]. If I-'
is a probability, then almost everywhere is often expressed as almost surely and
denoted a.s. [1-'].
The following theorem implies uniqueness of measures with certain properties.
Theorem A.26. 17 Suppose that 1-'1 and 1-'2 are measures on (8, A) and A is the
smallest u-field containing the 7r-system IT. If 1-'1 and 1-'2 are both u-finite on IT
and they agree on IT, then they agree on A.
PROOF. First, let C E IT be such that 1-'1(C) = 1-'2(C) < 00, and define gc to
be the collection of all B E A such that 1-'1(B n C) = 1-'2(B n C). Using simple
properties of measures, we see that gc is a A-system that contains IT, hence it
equals A by Lemma A.l7. (For example, if B E gc,

1-'1(B c n C) = 1-'1 (C) - Ih(B n C) = 1-'2(C) - 1-'2(B n C) =1-'2(Bc n C),


so B C E gc.)
Next, if 1-'1 and 1-'2 are not finite, there exists a sequence {Cn }::"=l E IT such
that 1-'1(Cn ) = 1-'2(Cn ) < 00, and 8 = U::"=lCn. (Since IT is only a 7r-system, we
cannot assume that the C n are disjoint.) For each A E A,

Since I-'j (Uf=l [C; n AD can be written as a linear combination of values of I-'j
at sets of the form A n C, where C E IT is the intersection of finitely many of
C1, ... , Cn, it follows from A E gc that I-'duf=dC; n A]) = 1-'2 (Uf=l[C; n AJ)
for all n, hence 1-'1(A) = 1-'2(A). 0

A.3 Measurable Functions


There are certain types of functions with which we will be primarily concerned.

17This theorem is used in the proofs of Theorems B.32, B,46, B.118, B.l3l,
and 1.115, Lemma A.64, and Corollary B,44.
A.3. Measurable Functions 583

Definition A.27. Suppose that S is a set with a a-field A of subsets, and let T
be another set with a a-field C of subsets. Suppose that I : S -+ T is a function.
We say I is measurable if for every B E C, l-l(B} E A. If I is measurable,
one-to-one, and onto and 1-1 is measurable, we say that I is bimeasurable. If
T = 1R, the real numbers, and C = B, the Borel a-field, then if I is measurable,
we say that I is Borel measurable.

Proposition A.2S. Suppose that (S, A) and (T, C) are measurable spaces. Sup-
pose that I : S -+ T is a lunction.
• II A = 2s , then I is measurable.
• IIC = {T,0}, then I is measurable.
• II A = {S, 0}, {y} E C lor every yET, and / is measurable, then / is
constant.
As examples, if S = T = 1R and A = B is the Borel a-field, then all continuous
functions are measurable. But many discontinuous functions are also measurable.
For example, step functions are measurable. All monotone functions are measur-
able. In fact, it is very difficult to describe a nonmeasurable function without
using some heavy mathematics.
The following theorems make it easier to show that a function is measurable.
Theorem A.29. l8 Let N, S, and T be arbitrary sets. Let {A, : a E N} be a
collection 0/ subsets 01 T, and let A be an arbitrary subset 0/ T. Let / : S -+ T
be a function. Then

rl (U nEN
An) U
nEN
rl(An},

r l (n ",EN
An) n
",EN
rl(An},

rl(AC} rl(A)C.

PROOF. For the union, if s E l-l(U",ENA",}, then 1(8} E UnENAn , hence there
exists a such that 1(8} E An, so s E f-l(A n } and s E UnEN/-l(A n ). If 8 E
UnEN/-l(An ), then there exists a such that 8 E l-l(An), hence /(s) E An,
hence f(s) E UnENAn , hence 8 E l-l(U",ENA",). This proves the first equality.
The second is almost identical in that "there exists a" is merely replaced by "for
all a" in the above proof. For the complement, if 8 E l-l(A C ), then 1(8) E A C
and 1(8) ~ A. Hence, 8 ~ rl(A) and s E rl(A)c. If s E rl(A)c, then
8 ~ rl(A) and 1(8} ~ A. So, f(8) E A C and 8 E rl(AC). 0

Corollary A.30. 19 II Sand T are sets and C is a a-field of subsets of T and


I : S -+ T i8 a function, then f- l (C) is a a-field 01 subsets of S. In fact, it is the
smallest a-field 01 subsets 0/ S such that / is measurable.

18This theorem is used in the proof of Theorem A.34.


19This corollary is used in the proof of Theorem A.42, and it is used to define
the a-field generated by a function.
584 Appendix A. Measure and Integration Theory

Definition A.31. The u-field rl(C) in Corollary A.30 is called the u-field gen-
emted by f.
A measurable function also generates a u-field of subsets of its image.
Proposition A.32. Let (T, C) be a measumble space. Let U ~ T be arbitmry
(possibly not even in C). Define C. = {U n B : B E C}. Then C. is a u-field of
subsets of U.

Definition A.33. The u-field C. in Proposition A.32 is called the restriction of


the u-field C to U. If f : 8 ~ T and U = 1(8), then C. is called the image u-field
of f·
Theorem A.34. 20 Let (8,A) be a measumble space and let f : 8 -- T be a
function. Let C· be a nonempty collection of subsets of T, and let C be the smallest
u-field that contains CO. If rl(C·) ~ A, then rl(C) ~ A.
PROOF. Let C2 be the collection of all subsets B of T such that r I (B) E A. By
assumption, C· ~ C2 • We will now prove that C2 is a u-field; hence it must contain
C, which implies the conclusion of the theorem. Clearly, C2 is nonempty, since C·
is nonempty. Let A E C2. Theorem A.29 implies f-l(A C ) = f-l(A)c E A, since
A is a u-field. This means that A C E C2. Let AI, A 2, . .. E C2. Then Theorem A.29
implies

r l
(QAi) = Qrl(Ai) E A,

since A is a u-field. So C2 is au-field. 0


To use this theorem to show that a function f : S ~ T is measurable when T
has a u-field of subsets C, we can find a smaller collection of subsets C· such that C
is the smallest u-field containing C· and prove that f-I(C·) ~ A. Theorem A.34
would then imply f-I(C) ~ A and f is measurable. As an example, consider the
next lemma.
Lemma A.35. 21 Let (8,A) be a measumble space, and let f : 8 ~ JR be a
function. Then f is measumble if and only if r 1«b, 00» E A for all b E JR.
PROOF. The "only if' part is trivial. For the "if" part, let C· be the collection of
all subsets of 1R of the form (b,oo). The smallest u-field containing these is the
Borel u-field 8, so r l (8) s:; A by Theorem A.34. 0
There are versions of Lemma A.35 that apply to intervals of the form (-00, a]
and those of the form (a, b), and so on. Similarly, there is a version for general
topological spaces.
Proposition A.36. 22 Let (8, A) be a measumble space, and let (T,C) be a topo-
logical space with Borel u-field. Then f : 8 ~ JR is measumble if and only if
rl(C) E A for all open C (or for all closed C).

20This theorem is used in the proofs of Lemma A.35, Proposition A.36, Corol-
lary A.37, Theorems A.38, B.75, and B.133, and to prove that stochastic processes
are measurable.
2lThis lemma is used in the proofs of Theorems A.38 and A.74.
22This proposition is used in the proof of Theorem A.38.
A.3. Measurable Functions 585

Another example of the use of Theorem A.34 is the proof that all continu
ous
functions are measurable. The result follows because the Borel a-field
is the
smallest a-field containing open sets.
Coroll ary A.37. Let (5, A) and (T, B) be topological spaces with
their Borel
a-fields. If f : 5 - t T is continuous, then f is mea.surable.
Here are some properties of measurable functions that will prove useful.
Theore m A.38. Let (5, A) be a measurable space.
1. Let N be an index set, and let {(Tn,C",)}"'EN be a collectio
n of measurable
spaces. For each a E N, let j", ; 5 --+ T", be a function . Define
f ; 5 --t
I1aEN Tn by f(s) = {f",(S)}",EN. Then f is measurable (with respect to
the
product a-field) if and only if each fOt is measurable.
2. If (V, Cd and (U,C2) are measurable spaces and f ; 5 --> V and 9
; V -t U
are measurable, then gU) ; 5 --> U is measurable.
3. Let f and 9 be measurable function s from 5 to IR n , and let a be
a constan t
scalar and let b E IR n be constant. Then the following function s
are also
measurable: f+g and af +b. Ifn = 1, then f·g and JIg are also measura
ble,
where f / 9 can be set equal to an arbitrary constan t when 9 = o.
4. If, for each n, fn is a measurable, extended real-valued function, then
sUPn fn, infn fn, limsuPn jn, and liminfn fn are all measurable.
5. Let (T, C) be a metric space with Borel a-field. If /k ; 5 --> T is a
measurable
function for each k = 1,2, ... and limk~oo fk(s) = f(s) for all s,
then f is
measurable.
6. Let (T, C) be a metric space with Barela -field, and let J.t be a
measure on
(5, A). If /k ; 5 --> T is a measurable function for each k = 1,2, ...
and
limk_oo fk(s) exists a.e. [ILl, then there is a measurable j ; 5 -->
T such
that limk~oo /k(s) = f(s), a.e. [ILl·
PROOF. (1) Suppose that j is measurable. To show
that jOt is measurable, let
Bo. E Co. and let B/3 = T/3 for f3 # a. Set C = I1;EN B/3, which
is in the
product a-field, because all but finitely many B/3 equal the entire space
T/3. Then
/;;1 (Bo.) = rI(C). Since f is measurable, rI(C) E A. Now, suppose that each
fa is measurable, and let B = flOiEN B OI , with Be. E COl for all a and
all but finitely
many BOt (say B Ot11 ... , B Otn ) equal to T",. Then f-1(B) = n~d;;/(
BOI;) E A.
Since the sets of the form B generate the product a-field, r1(B)
E A for all B
in the product a-field according to Theorem A.34.
(2) Let A E C2 • We need to prove that g(f)-l( A) E A. First, note
that
g(l)-l = f-1(g-1 ). Since g is measurable, g-l(A) E C . Since
1 I is measurable,
r1(g-1 (A)) E A. So g(f)-l( A) E A.
(3) The arithme tic parts of the theorem are all similar. They all follow
from
parts 2 and 1. For example, hex, y) = x + y is a measurable function
from IR?
to JR, so h(f, g) = I + 9 is measurable. For the quotient, a little
more care is
needed. Let hex, y) = x/y when y # 0 and let it be an arbitrar y constan
t when
y = O. Then h is measurable since {(x, y) ; y = O} is in 8 2 • It follows that
h(f, g)
is measurable.
(4) Let f = SUPn In. Then, for each finite b, {s ; f(s) :s b} = n~=l {s ;
fn(s) :s b} E A. Also {s ; f(s) = -oo} = n~=l{S : fn(s) = -oo}
E A, and
586 Appendix A. Measure and Integration Theory

{s : f(s) = oo} = n bl U~=I {s : fn(s) > i} EA. Similar arguments work for
inL Since limsuPnfn = infksuPn>kfn and liminfnfn = sUPkinfn2:kfn, these
are also measurable. -
(5) Let d be the metric in T. For each closed set C E C, and each m, let
Cm = {t: d(t, C) < l/m}. For each closed C, define
00 00 00

(A.39)
m=l n=l k=n

It is easy to see that A.(C) E A is the set of all s such that limn_oo fn(s) E
C. Obviously, f-I(C) consists of those s such that limn_oo fn(s) E C. Hence,
r I (C) = A. (C) E A, and Proposition A.36 says that f is measurable.
(6) Let G = {s : limk_oo fk(S) does not exist}, and let G ~ C with j.t(C) = O.
Let t E T, and define f(s) = t for sEC and f(s) = limk_oo /k(s) for s E CC.
Apply part 5 to the restrictions of the functions {fk}~1 to CC to conclude that
f restricted to CC (call the restriction g) is measurable. If A E C, f-I(A) =
g-I(A) E A if t ft A and rl(A) = g-I(A) U C E A if tEA. So f is measurable.
o
Part 6 is particularly useful in that it allows us to treat the limit of a sequence
of measurable functions as a measurable function even if the limit only exists
almost everywhere. This is only useful, however, if we can show that functions
that are equal almost everywhere have similar properties.
Many theorems about measurable functions are proven first for a special class of
measurable functions called simple functions and then extended to all measurable
functions using some limit theorems.
Definition A.40. A measurable function f is called simple if it assumes only
finitely many distinct values.
A simple function is often expressed in terms of its values. Let f be a simple
function taking values in IRn for some n. Suppose that {al, ... , ak} are the dis-
tinct values assumed by f, and let Ai = rl({ai}). Then f(s) = 2:~=1 adA; (s).
The most fundamental limit theorem is the following.
Theorem A.41. If f is a nonnegative measurable function, then there exists a
sequence of simple functions {Ji}bl such that for all s E S, /i(s) i f(s).
PROOF. For k = 1, ... ,i2 i , let Ak,i = {s : (k _1)/2 i :$ f(s) < k/2 i }. Define
AO,i = {s: f(s) ~ i}. Then AO,i,AI,i,." , A i2 ;,i are disjoint and their union is S.
Define
f '(s) = { k;l if s E Ak,i for k > 0,
• i if s E AO,i.
It is clear that Ii(s) :$ f(s) for all i and s, and each fi is a simple function. Since,
for k > 0, Ak,i = A2k-l,i+l U A 2 k,HI, and AO,i = AO,HI U Ai2i+l+l,i+l U··· U
< fHl(S)
A (Hl)2''+1',.+1, it is easy to see that fi(s) - .
for all i and all s. It is also
-i
easy to see that for each s, there exists n such that for l ~ n, If(s) - f;(s)1 :$ 2 .
, 0
Hence Ii(s) i f(s). . .
The following theorem will be very useful throughout the study of statistics. It
says that one function 9 is a function of another f if and only if 9 is measurable
with respect to the O'-field generated by f·
A.4. Integrat ion 587

Theore m A.42. Let (81, AI), (82,A2) , and (83,A3) be measurable


spaces such
that A3 contains all singletons. Suppose that f : 81 -+ 82 is measura
ble. Let Alf
be the u-field generated by f. Let T be the image of f and let A.
be the image
u-field of f. Let 9 : 8 1 -+ 8 3 be a measurable function. Then 9 is Al/
measurable
if and only if there is a measurable function h : T -+ 83 such that for
each s E 81,
g(s) = h(f(s».
PROOF. For the "if" part, assume that there is a measura
ble h : T -+ 83 such that
g(s) = h(f(s» for all s E 81. Let BE A3. We need to show that g-I(B)
E Alf·
Since h is measurable, h- l (B) E A., so h-I(B) = TnA for some A
E A2. Since
rI(A) = rl(T n A) and g-I(B) = rl(h-I( B»), it follows that
g-I(B) =
rl(A) E AI/.
For the "only if" part, assume that 9 is Al/ measurable. For each
t E 83, let
C t == g-I({t} ). Since g is measurable with respect to All, let At E
AI! be such
that C t == r 1 (A t ). (Such At exists because of Corollary A.30.) Define
h(s) == t
for all sEAt n T. (Note that if tt 1= t2, then At} n At2 n T == 0,
so h is well
defined.) To see that g(s) = h(f(s)), let g(s) == t, so that sECt = r l
(At ). This
means that f(s) E At nT, which in turn implies h(f(s)) == t == g(s).
To see that h is measurable, let A E A3. We must show that h- l
(A) EA•.
Since 9 is Alf measurable, g-l(A) E Al/, so there is some B E A2
such that
g-I(A) = rl(B). We will show that h- l (A) == BnT E A. to complet
e the proof.
If s E h-I(A), then t = h(s) E A and s == f(x) for some x E C
t ~ g-I(A) ==
rI(B), so f(x) E B. Hence, s E BnT. This implies that h- l (A) ~
BnT. Lastly,
.if s E B n T, s = f(x) for some x E rI(B) = g-I(A) and h(s)
= h(f(x») =
g(x) E A. So, h(s) E A and s E h-I(A). This implies B nT ~ h- l (A).
0
The condition that A3 contain singletons is needed to avoid the situatio
n in
the following example.
Examp le A.43. Let 8 1 = 8 2 = 8 3 = lR and let Al = A2 be the
Borel u-field,
while A3 is the trivial u-field. Then every function 9 : Sl -+ S3 is All
measurable
no matter what f is, for example, g( s) == s. If f (8) = 82, then 9 is not
a function
of f.

AA Integration
The integral of a function with respect to a measure is a way to generali
ze the
notion of weighted average. We define the integral in stages. We start
with non-
negative simple functions.
Definit ion A.44. Let f be a nonnegative simple function represen
ted as f(8) =
L::"'I adA,(s) , with the ai distinct and the Ai mutuall y disjoint. Then,
the in-
f
tegral of f with respect to 1-£ is f(s)dl-£(8) = 2::=1
ail-£(Ai). If 0 times 00
occurs
in such a sum, the result is 0 by convention.
The integral of a nonnegative simple function is allowed to be
that the formula for the integral of a nonnegative simple function is
00.
It turns out
more general
than in Definition A.44.
588 Appendix A. Measure and Integration Theory

Proposition A.45. 23 If (S, A, 1-£) is a measure space, Ai E A and ai 2:


i = 1, ... , n, and f(s) = E:'::l
adA; (s), then f(s)dl-£(s) = I
aiJL(Ai). E:'::l
°for
Next, we consider general nonnegative measurable functions. If f is a nonnega-
tive simple function, then for every nonnegative simple function 9 :::; I, it follows
I
easily from Definition A.44 that g(s)dl-£(s) :::; f f(s)dl-£(s). Hence, the following
definition contains no contradiction with Definition A.44.
Definition A.46. If I is a nonnegative measurable function, then the integral
of f with respect to 1-£ is I
f(s)dl-£(s) = SUPg~/.g simple g(s)dp.(s). I
For general functions f, define the positive part as f+(s) = max{f(s),O} and
define the negative part as r(s) = -min{f(s),O}. Then I(s) = r(s)- r(s). If
° I
f 2: 0, then r == and r (s )dl-£ (s) = OJ hence the following definition contains
no contradiction with the previous definitions.
Definition A.47. If I is a measurable function, then the integral 01 I with respect

J J J
to 1-£ is
f(s)dl-£(s) = t+(s)dl-£(s) - r(s)dl-£(s),
if at least one of the two integrals on the right is finite. If both are infinite, the
integral is undefined. We say that f is integrable if the integral of f is defined
and is finite.
The integral is defined above in terms of its values at all points in S. Sometimes
we wish to consider only a subset of S.
Definition A.4S. If A ~ Sand f is measurable, the integral of f over A with

i J
respect to 1-£ is
f(s)dl-£(s) = IA(s)f(s)dl-£(s).

Here are a few simple facts about integrals.


Proposition A.49. Let (S, A, 1-£) be a probability space, and let f, 9 : S - IR be
measurable.
1. If f = 9 a.e. [1-£), then f f(s)dl-£(s) = f g(s)dl-£(s) il either integral is de-
fined.
2. If f f(s)dJL(s) is defined and a is a constant, then

J af(s)dJL(s) = a f f(s)dJL(s).

3. If f and 9 are integrable with respect to 1', and f :::; g, a.e. lI'), then

J f(s)dJL(s) :::; J g(s)dJL(s).

4. If f and 9 are integrable and fA f(s)dJL(s) = fA g(s)dl-£(s) for all A E A,


then f = g, a.e. lI').

23This proposition is used in the proof of Theorem A.53.


A.4. Integrat ion 589

The proofs of the next few theorems are essentially borrowed from
Royden
(1968).
Theore m A.50 (Fatou 's lemma} .24 Let {fn}~=l be a sequence of
nonnegative
measurable functions. Then

Jl~n:~f j(s)d/-L(s) $ l~n:i~f ! fn(s}d/-L(s).

PROOF. Let f(s) = liminfn _ oo fn(s). Since

J f(s)d/-L(s) = sup
simple <p :s; f
J ¢(s)dJ.L(s) ,

we need only prove that, for every simple ¢ ~ f,

J ¢(s)d/-L(s) $ l~n:~f J fn(s)d/-L(s),

Since this is clearly true if ¢(s) = 0, a.s. [/-L], we will assume that /-L(A)
> 0, where
A = {s : ¢(s) > O}. Let ¢ ~ f be simple, let € > 0, and let 6 and
M be the
smallest and largest positive values that ¢ assumes. For each n, define

An = {s E A: /k(s) > (1- €)¢(s), for all k 2:: n}.

Since (1 - €)¢(s) < f(s) for all sEA, U~=lAn = A and An ~ An+!
for all n.
Let Bn = AnA::; .

f fn(s)dJ.L(s) 2:: (
JAn
f",(s)dJ.L(s) 2:: (1 - €) (
JAn
¢(s)dJ.L(s). (A.51)

If /-L(Bn) = 00 for n = no, then J.L(A) = 00 and ¢(s)dJ.L(s) = 00, f since ¢ takes
on only finitely many different values. The rightmost integral in (A.51)
is at least
DJ.L(An), which goes to 00 as n increases, hence lim inf n _ oo J fn(s)dJ.L
(s) = 00 and
the result is true. So, assume p,(Bn) < 00 for all n. Since n~=lBn =
0, it follows
from Theorem A.19 that lim n _ oo J.L(Bn) = 0. So, there exists N such
that n > N
implies J.L(Bn) < €. Since

i
-

J ¢(s)dJ.L(s) = ¢(s)dJ.L(s) = in ¢(s)dJ.L(s) + in ¢(s)dp,(s)

$ (¢(s)dJ .L(s) + M€,


JAn
(A.51) implies that, for n 2: N,

f fn(s)dp,(s) 2:: (1 - €) f ¢(s)dp,(s) - €(l - €)M.

24This theorem is used in the proofs of Theorems A.52, A.57, A.60,


B.117,
and 7.80.
590 Appendix A. Measure and Integration Theory

J
If 4J(s)dp.(s) = 00, the result is true again. If J4J(s)dp.(s) = K < 00, then for
every n ~ N,

j In(s)dp.(s) ~j 4J(s)dp.(s) - €[(l - €)M + K],

! !
hence
l~~:f In (s)dp.(s) ~ 4J(s)dp.(s) - €[(l- €)M + K].
Since this is true for every € > 0,

liminf
n ..... oo
j In (s)dp.(s) ~ j4J(S)dP.(S)'
o
Theorem A.52 (Monotone convergence theorem). Let {fn}~=l be a se-
quence 01 measurable nonnegative functions, and let I be a measurable function
such that In(x) :5 I(x) a.e. [p.] and In(x) -+ I(x) a.e. lP-]. Then,

nl~~ j In (x)dp.(x) = j I(x)dp.(x).

PROOF. Since In :5 I for all n, J In (x)dJl(x) :5 J I(x)dp.(x) for all n. Hence


l~~~f j In (x)dp.(x) :5 li~s.:!p j In (x)dp.(x) :5 j I(x)dp.(x).

By Fatou's lemma A.50, Jl(x)dJl(x) :5liminfn-+oo JIn (x)dp.(x). o


Theorem A.53. II J I(s)dp.(s) and J g(s)dp.(s) are defined and they are not
both infinite and 01 opposite signs, then Il/(s) + g(s)]dJl(s) = J I(s)dp.(s) +
Jg(s)dp.(s).
PROOF. If I,g ~ 0, then by Theorem A.41, there exist sequences of nonnega-
tive simple functions {fn}~l and {gn}~=l such that In T I and gn T g. Then
J
Un + gn) TU + g) and Illn(s) + gn(s)]dp.(s) = In (s)dp.(s) + gn(s)dp.(s) by J
Proposition A.45. The result now follows from the monotone convergence theo-
r
rem A.52. For integrable I and g, note that U+g)+ + +g- = U+g)- + + g+. r
What we just proved for nonnegative functions implies that

j u + g)+(s)dp.(s) +j r(s)dp.(s) +/ g-(s)dJl(s)

= j[U + g)+(s) + res) + g-(s)Jdp.(s)

= f[u + g)-(s) + I+(s) + g+(s)Jdp.(s)

j (f + g)-(s)dp.(s) + / 1+ (s)dp.(s) +/ g+(s)dp.(s).

Rearranging the terms in the first and last expressions gives the desired result. If
both I and 9 have infinite integral of the same sign, then it follows easily using
A.4. Integration 591

Proposition A.49, that 1+ 9 has infinite integral of the same sign. Finally, if only
one of I and 9 has infinite integral, it also follows easily from Proposition A.49
that I + 9 has infinite integral of the same sign. 0
A nonnegative function can be used to create a new measure.
Theorem A.54. Let (8, A, IL) be a measure space, and let I : 8 --+ IR be non-
negative and measurable. Then v(A) = fA l(s)dIL(s) is a measure on (8, A).
PROOF. Clearly, v is nonnegative and v(0) = 0, since l(s)10(S) = 0, a.e. [IL].
Let {An}~=1 be disjoint. For eachn, define gn(S) = l(s)IAn (s) and In(s) =
1::=1 gi(S). Define A = U~=IAn. Then 0 $ In $ I1A, a.e. [IL] and In converges
to I1A, a.e. [IL]. So, the monotone convergence theorem A.52 says that

lim j In(s)dIL(S)
n .... oo
= v(A). (A.55)

Also, V(Ai) =f gi(s)dIL(S), for each i. It follows from Theorem A.53 that

(A.56)

Take the limit as n --+ 00 of the second and last terms in (A.56) and compare to
(A.55) to see that v is countably additive. 0

Theorem A.57 (Dominated convergence theorem). Let {fn}~=1 be a se-


quence 01 measurable junctions, and let I and 9 be measurable functions such that
In(x) --+ I(x) a.e. [IL], I/n(x)1 $ g(x) a.e. [IL], and f g(x)dIL(X) < 00. Then,

!!..~ 1 In (x)dIL(X) = I I (X)dIL(X),

PROOF. We have -g(x) $ In(x) $ g(x) a.e. [IL], hence

g(X) + In(x) ~ 0, a.e. [ILJ,


g(x) - In(x) ~ 0, a.e. [IL],
lim (g(x) + In (x)] = g(x) + I(x) a.e. [IL],
n .... oo
lim (g(x) - In(x)J
n .... oo
= g(x) - I(x) a.e. [IL].

It follows from Fatou's lemma A.50 and Theorem A.53 that

l(g(x) + f(x)]dIL(x) $ l~~~f j(g(X) + In(x)]dIL(X)


1 g(x)dIL(X) + l~~~f 1 In (x)dIL(X) ,

$ liminfjln(x)dlL(X).
n .... oo
592 Appendix A. Measure and Integration Theory

Similarly, it follows that

J[g(X) - f(x)]dJL(x) ~ l~~f J[g(X) - fn(x)]dJL(x)

J g(x)dJL(x) -li~~~p J fn(x)dJL(x),

J f(x)dJL(x) ~ li::s~p J fn(x)dJL(x).

Together, these imply the conclusion of the theorem. 0


An alternate version of the dominated convergence theorem is the following.
Proposition A.58. 25 Let {In}~l' {gn}~=l be sequences of measurable func-
tions such that Ifn(x)1 $ gn(X), a.e. [JL]. Let f and 9 be measurable functions
such that limn_ oo fn(x) = f(x) and limn_ex:> gn(X) = g(x), a.e. [JL]. Suppose
J J
that lim n _ oo gn(x)dJL(x) = g(x)dJL(x) < 00. Then, lim n _ oo fn(x)dJL(x) = J
J f(x)dJL(x).
The proof is the same as the proof of Theorem A.57, except that gn replaces 9 in
the first three lines and wherever 9 appears with fn and a limit is being taken.
For O'-finite measure spaces, the minimal condition that guarantees convergence
of integrals is uniform integrability.
Definition A.59. A sequence of integrable functions {In}~=l is uniformly inte-
grable (with respect to JL) if lim c _ oo sUPn J{:z::lfn(:z:ll>C} Ifn(x)ldJL(x) = O.

Theorem A.60. 26 Let JL be a finite measure. Let {In}~=l be a sequence of inte-


J
grable functions such that limn_co fn = f a.e. [JL]. Then limn_co fn(x)dJL(x) =
J f(x)dJL(x) if {In}~=l is uniformly integrable. 27

PROOF. Let ft, f;;, f+, and f- be the positive and negative parts of fn and
f. We will prove that the result holds for nonnegative functions and take the
difference to get the general result. Let ~ > 0 and let c be large enough so that
sUPn J{x:fn(xl>c} fn(x)dJL(x) < ~. The functions

( ) _ {fn(X) if fn(x) $ C,
gn X - C if !n(x) > c

converge a.e. [JL] to


if f(x) $ c,
g(x) = { ~(x) if f(x) > c.
We now have

J f(x)dJL(x) > J g(x)dJL(x) = }~~ J gn(x)dJ.L(X)

~ li~~~p J fn(x)dJL(x) - ~,

25This proposition is used in the proof of Scheffe's theorem B. 79.


26This theorem is used in the proofs of Theorems 1.121 and B.118.
27 One could replace "if" by "if and only if," but we will never need the "only
if" part of the theorem in this book.
A.5. Product Spaces 593

where the second line follows from the dominated convergence theorem A.57 and
J
the third from our choice of c. Since this is true for every f, we have f(x)dp,{x) ;::
J
lim sup fn(x)dp,{x). Combining this with Fatou's lemma A.50 gives

J f(x)dp,{x) = !~~ Jfn(x)dp,(x). o

A.5 Product Spaces


In Definition A.12, we introduced product spaces and product IT-fields. We would
like to be able to define measures on (81 X 8 2 , Al ® A2) in terms of measures on
(81, AI) and (8 2 ,A2). The derivation of product measure given here resembles
the derivation in Billingsley (1986, Section 18).
Lemma A.61. 28 Let (81,Al,P,I) and (82,A2,P,2) be IT-finite measure spaces,
and let Al ® A2 be the product IT-field .
• For every B E Al ®A2 and every x E 81, B", = {y: (x,y) E B} E A2 and
P,2(B",) is a measurable function from (81, AI) to 1R U {oo} .
• For every B E Al ®A2 and every y E 82, B11 = {x: (x,y) E B} E Al and
P,1(B11) is a measurable function from (82 , A2) to 1R U {oo}.
PROOF. Clearly, we need only prove one of the two sets of assertions. First, let
B =Al X A2 with Ai E Ai for i =
1,2 and x E 8 1. Then

B", = {A2 if x E ~1'


o otherwlSe.

So, B", E A2. Let C be the collection of all sets B ~ 8 1 X 82 such that B", E A2.
If BE C, then (BO)", = {y: (x,y) f/. B} = (B",)o, so BO E C. Let {Bn}~=1 E C.
Then it is easy to see that

(91 Bn) '" = {Y: (x,y) E 91 Bn} = 91{Y: (x,y) E Bn} = 91(Bn)", E c.
(A.62)
Clearly, 81 x 82 E C, so C is a IT-field containing all product sets; hence it contains
Al®A2. Next, let fB(X) = P,2(B",) for BE A 1®A2. Write 81 X82 = U~=IEn with
En = A 1n X A 2n and P,i(Am) < 00 for all n and i = 1,2 and with the En disjoint.
Then let fB,n = P,2«B n En)",). It follows that fB = E:=1 fB,n. If we can show
that fB,n is measurable for each n, then so is fB, since they are nonnegative, and
the sum is well defined. If B = Bl X B 2, then fB,n(X) = IAlnnBl (x)p,2{A2n nB2),
which is a measurable function. Let 'D be the collection of all sets D ~ 81 X 82

28This lemma is used in the proofs of Lemmas A.64 and A.67 and Theo-
rems A.69 and B.46.
594 Appendix A. Measure and Integration Theory

such that fD,n is measurable. If DE V, then fDC,n = JL2(A2n ) - fD,n, which is


measurable, so DC E V. If {Dm}~=l E V with the Dm disjoint, then

JL2 (91 (Dm n En)",) = ~ Jl.2 (Dm n En)",


00

LfDm,n(X),
m=l

which is a measurable function, so U~=lDm E V. Clearly, 8 1 x 82 E V, so V is


a A-system (see Definition A.14) that contains the 1I'-system of product sets. By
the 1I'-A theorem A.17, V contains Al ® A2. 0
The following corollary to Lemma A.64 is a sort of dual to part 1 of Theo-
rem A.38.
Corollary A.63. Let (81, AI), (8 2 ,A2), and (X,8) be measurable spaces. If f:
8 1 X 82 -> X is measurable, then for every Sl E 8 1 , f.1(S2) = f(sl,s2) is a
measurable function from 8 2 to X.
Lemma A.64. 29 8uppose that (8 1 , AI, Jl.I) and (82 , A2, Jl.2) are a-finite measure
spaces. For each x E 8 1 , Y E 8 2 , and B E Al ® A 2, define B", and BY as in
iS iS
Lemma A.61. Then vdB) = 1 Jl.2(B",)dJl.1(X) and v2(B) = 2 Jl.1(BY)dJl.2(Y)
both define the same measure on (81 x 82, Al ® A2). If Ai E Ai for i = 1,2, then
v1(Al x A2) = JL1(A1)JL2(A2)'
PROOF. First, prove that VI is a measure. The proof that V2 is a measure is
identical. Clearly, v1(B) ~ 0 for all Band v1(0) = o. If {Bn}~=l are disjoint,
then

00

where the first equality follows from the definition of VI, the fact that Jl.2 is
countably additive, and (A.62); the second equality follows from the monotone
convergence theorem A.52 and the fact that 2:::'=1
Jl.2«Bn)",) :::; 2:::'=1
Jl.2«Bn )",)
for all m; and last equality follows from the definition of VI. This proves that VI
(and so too V2) is a measure. Note that if B = Al X A2, then

rIAl (x)JL2(A2)dJL1(X) = JL1(A 1)JL2(A2)


lSl
r IA2(Y)Jl.1(A1)dJl.2(Y) = v2(B).
lS2
SO, VI = V2 on the 1I'-system consisting of product sets. Since each of JL1 and Jl.2 is
a-finite, there exists a countable collection of product sets whose union is 8 1 x 82

29This lemma is used in the proof of Lemma A.67.


A.5. Produc t Spaces 595

and such that each one has finite III = measure. By Theorem A.26,
112 III agrees
with 112 on all of Al @A2'
0
Definit ion A.65. Let (8i' Ai, J1.i) for i = 1,2 be (7-finite measure
spaces. Define
the product measure J1.1 x J1.2 on (81 x 82, Al @ A2) as the common
value of the
two measure s III and 112 in Lemma A.64.
Lebesgue measure on lR.2, denoted dxdy, is a product measure
. Not every
measure on a product space is a product measure. Produc t probabi
lity measure s
will corresp ond to indepen dent random variables. (See Theorem 8.66.)
Propos ition A.66. Let J1. be a measure on a product space (8 1 X
82, Al @ A2)'
Then J1. is a product measure if and only if there exist set functions
J.ti : Ai ->
1R for i = 1,2 such that, for every Al E Al and A2 E A2, J1.(AI
x A2} =
J1.dAl)J1.2(A2).
Lemm a A.67. 30 Let f be a measurable function from 8 1 x 8 2 to
m such that
either {x E 81 : Jlf(x,y)ldJ1.2(Y) = oo} ~ A E AI, where III (A) =
0, or f 2 O.
Then, there is a measurable (possibly extended real-valued) function
9 : 81 ->
mu{±o o} such thatg(x ) = Jf(x,y) dIl 2(Y), a.e. [Ill]' Iff is the indicat
orofa
measurable set B, then

! g(x)dIl 1(X} = III x 1l2(B). (A.68)

PROOF. For each B E Al @ A 2, note that J IB(X, y)dJ1.2(y) =


J1.2(B:x), where
Bx is defined in Lemma A.61. It was shown there that 1l2(B:x) is
a measur-
able function of x. It follows from Lemma A.64 that (A.68) holds.
It now fol-
lows from the linearity of integral s that if f is a nonnega tive simple
function,
then g(x) = J f(x,y)d Il 2(Y) is a measura ble function of x. If f is
a nonnega tive
measura ble function, let {fn}~=l be a sequence of nonnega tive simple
functions
such that In ~ I for all nand limn -+ oo fn(x, y) = f(x, y) for all
(x, y). Then,
the monoto ne convergence theorem A.52 says that limn-+oo J fn(x,
y)dIl2(Y) =
J l(x,y)dJ1.2(Y) = g(x) for all x. By part 5 of Theorem A.38, 9 is measura
ble. If
J1.r{x E 81 : J I/(x, y)ldIl2(Y) = oo} = 0, then the argume nt just given
both rand r and the difference Jr (x, Y )dJ1.2 (y) - Jr
applies to
(x, Y )d1l2 (y) is defined
a.e. [1l11 and equals J f(x, y)dJ1.2(y), a.e. [Ild· If we let g(x) = J
J rex, y)dIl 2(y) for all x f/. A, and let g(x) be constant on A, (x,then
r y)dIl2( Y)-
g(x) =
J I(x, y)dIl2(X), a.e. [Ill], and 9 is measurable. 0
The following two theorem s will be used many times in the study
of product
spaces.
Theore m A.69 (Tonel li's theore m). Let (81,AI,1 l1) and (8
2,A2,J.t2) be (7-
finite measure spaces. Let f : SI X S2 ...... m be a nonnegative measura
ble function.
Then

30This lemma is used in the proofs of Theorem A.70 and of Lemma


s 6.48
and 8.46.
596 Appendix A. Measure and Integration Theory

= / [/ f(X,Y)dIL2(Y)] dILl (x).


PROOF. As in the proof of Lemma A.67, let {In}~l be a sequence of non-
negative simple functions such that fn S f for all nand lim n _ oo fn(x, y) =
J
f(~~Y) for all (x,y). If fn(x,y) = L~:l ai,nIBi.n(x,y), then fn(x,y)dIL2(Y) =
Li=l ai,nIL2(Bi ,n,x) by Lemma A.61 and
/ [ / fn(x,y)dIL2(y)] dlLl(X) =/ f(x,y)dlLl X 1L2(X,y)

by (A.68). Since 0 S J J
fn(x, y)dIL2(Y) ::; f(x, y)dJj2(y) for all x and n, and
J J
lim n _ oo fn(x, y)dIL2(Y) = f(x, y)dIL2(Y) as in the proof of Lemma A.67, it
follows from the monotone convergence theorem A.52 that

f f(x, y)dJjl x Jj2(X, y) = }!..~ f fn(X, y)dJjl x Jj2(X, y)

= }!..~ / [/ fn(X, Y)dIL2(Y)] dILl (X)

f [}!..~ f fn(X, y)dJj2(y)] dJjl (x)

/ [ / f(X,y)dJj2(X,y)] dlLl(X).

The proof that the iterated integrals can be calculated in the other order is
similar. 0

Theorem A.70 (Fubini's theorem). Let (Sl,Al,Jjl) and (S2,A2,1L2) be (1-


finite measure spaces. If f : Sl x S2 --+ IR is integrable with respect to Jjl x Jj2,
then

/J(X,Y)dJjl XJj2(X,y) = ![/f(x,y)dJjl(X)]dJj2(y) = f[fJ(x,Y)dIL2(Y)]dlLI(X)'


PROOF. Let g(x) = J If(x,y)dIL2(Y), a.e. [ILl] be measurable. Then

1 g(x)dIL1(X) = 1[1 If(X,Y)ldIL2 (Y)] dlLl(X) = 1 If{x,y)ldlLl X 1L2{X,y) < 00

follows from Tonelli's theorem A.69 applied to If I· It follows that

{x: / If(x,y)ldJj2(Y) = oo} CAE Al


implies 1L1(A) = O. Apply Tonelli's theorem A.69 to f+ and f- and note that
J J
the set of all x such that r(x,y)dIL2(Y) - r(x,y)dIL2(Y) is undefined is a
J
subset of {x : If(x, y)ldIL2(Y) = oo}. It follows that this difference of integrals
is defined a.e. [ILl] and the integral (with respect to ILd of the difference (which
equals J[f f(x,y)dIL2{y)]dIL1(X)) is the difference of the integrals (which equals
J f{x, y)dlLl X 1L2(X, y)). . 0
All of the results of this section can be extended to finite product spaces
81 x ... X 8 n by simple inductive arguments.
A.6. Absolute Continuity 597

A.6 Absolute Continuity


It is also common to consider two different measures on the same space.
Definition A.71. Let ILl and IL2 be two measures on the same space (S,A).
= =
Suppose that, for all A E A, ILl (A) 0 implies 1L2{A) O. Then, we say that 1L2
is absolutely continuous with respect to ILl, denoted 1L2 « ILl· When 1L2 « ILl, we
say that ILl is a dominating measure for 1L2.
I
Consider next a function f and a measure IL such that f(x)dlL(x) is defined.
Then II(A) = J
f(x)dIL(x) is defined for all measurable A. If f takes on negative
values with pos'ftive measure, then II is not a measure because it assigns negative
values to some sets, such as A = {x : f(x) < O}. However, II is still a signed
measure.
If one of a pair of two measures is finite, there is a necessary and sufficient
condition for absolute continuity which resembles the definition of continuity of
functions.
Lemma A.72. 3l Let ILl and IL2 be measures on a space (S,A). Consider the
following condition:

For every f > 0, there is 6. such that JLl(A) < 6. implies JL2(A) < f. (A.73)

• If condition (A. 73) holds, then JL2 « JLl.


• If IL2 «ILl and 1L2 is finite, then condition (A. 73) holds.
PROOF. For the first part, let f > 0 and suppose that ILl (A) = O. Then JLl (A) < 6.
and 1L2(A) < f. Since this is true for all f > 0, JL2(A) = O. For the second
part, suppose that JL2 « ILl, that IL2 is finite, and that (A.73) fails. Then there
exists € > 0 such that, for every integer n, there is An with 1-'1 (An) < 1/n2 but
1L2{An) ~ E. Let A = n~l U:'=k An. By the first Borel-Cantelli lemma A.20,
ILl(A) = 0 so' IL2(A) = O. Since 1-'2 is finite, Theorem A.19 implies that

JL2(A) = kl~~ jJ,2 (Qk An) ~ E.

This is a contradiction. 0
The following theorem says that the first part of Example A.8 on page 574 is
the most general form of absolute continuity with respect to o--finite measures.
The proof is mostly borrowed from Royden (1968).
Theorem A.74 (Radon-Nikodym theorem). Let ILl and JL2 be measures on
(8, A) such that 1L2 «JLl and 1-'1 is o--finite. Then there exists an extended real-
valued measumble junction f : S --+ [0,00] such that for every A E A,

(A.75)

3IThis lemma is used in the proof of Lemma B.119.


598 Appendix A. Measure and Integration Theory

Also, il 9 : S -> IR is /-l2 integrable, then

J g(x)d/-l2(x) = Jg(x)/(x)d/-ll(X). (A.76)

The function f is called the Radon-Nikodym derivative of /-l2 with respect to /-ll
and it is unique a.e. [/-llj. The Radon-Nikodym derivative is sometimes denoted
(d/-l2/d/-l 1) (s). II/-l2 is (I-finite, then I is finite a.e. [/-ld·
PROOF. First, we prove uniqueness a.e. [/-llj. Suppose that such an f exists.
Let 9 be another function such that I and 9 are not a.e. [/-llj equal. Let An =
{x : I(x) > g(x) + lin} and Bn = {x : I(x) < g(x) - lin}. Since I and
9 are not equal a.e. [/-ld, then there exists n such that either /-ll(An) > 0 or
/-ll(Bn) > o. Let A be a subset of either An or Bn with finite positive measure.
Then fA f(x)d/-l1(x) '" fA g(x)d/-l1(x). Hence 9 '" d/-l2/d/-ll.
The proof of existence proceeds as follows. First, we show that we can reduce
to the case in which /-ll is finite. Then, we create a collection of signed measures
Va indexed by a real number Ct. For each Ct we find a set A a such that every
subset of Aa has positive Va measure and every subset of the complement B a
has negative Va measure. We then show that B{J ~ B a for (3 ~ Ct, which allows
us to define I(x) = sup{Ct : X E Ba}. Finally, we show that I satisfies (A.75) and
(A.76).
Now, we prove that we need only consider finite /-ll. Since /-ll is u-finite, let
{An}~=l be disjoint elements of A such that /-ll(Ai) < 00 and S = U~lAi. Let
/-lj,i be /-lj restricted to Ai for j = 1,2 and each i. Then /-l2,i « /-ll,i for each i and
each /-ll,i is finite. Suppose that for each i we can find Ii as in the theorem with
/-lj replaced by /-lj,i for j = 1,2. Then f(x) = "E:l IA; (X)fi(X) is the function
required by the theorem as stated. Hence, we prove the theorem only for the case
in which /-ll is finite.
Suppose that /-ll is finite, and define the signed measure Va = Ct/-ll - /-l2 for
each nonnegative rational number Ct. (Note that va(A) never equals 00, although
it may equal -00.) For each Ct, define

Pa {A E A : va(B) ~ 0, for every B ~ A},


Aa sup va(A).
AEPa

That is, A" is the supremum of the signed measures of sets all of whose subsets
have nonnegative signed measure. 32 Since 0 EPa, A" ~ o. Let {An}~=l be
such that A" = limi_oo v,,(A i ), and let A a = U~lAi. Since every subset of AO
can be written as a union of subsets of the Ai, it follows that A a EPa, hence
Aa ~ va(Aa). Since A" \ Ai <; A a , it follows that va(A a \ Ai) ~ 0 for all i a~d
l/a(A") = l/a(A a \ A;) + l/a(A;) 2 va(Ai) for all i. It follows that Aa ~ va(A ).
Hence Aa = v,,(A a ) < 00. Define B a = (Aa)c.
Next, we prove that every subset of B a has nonpositive measure. 33 If not, let
B <; B a such that va(B) > o. If B has no subsets with negative signed measure,

32The sets in Pa are often called the positive sets relative to the signed measure
Va·
33S uch sets are called negative sets relative to the signed measure Va·
A.6. Absolute Continuity 599

then BuA'" E P'" and v",(A"'UB) > >''''' a contradiction. So, let n1 be the smallest
positive integer such that there is a subset B1 ~ B with v",(Bd < -l/n1. For
each k > 1, let nk be the smallest positive integer such that there exists a subset
Bk ~ B \ u7,;;} Bi with V",(Bk) < -link. Now, let C = B \ Uk"=lBk. Clearly
v"'(C) > O. If we prove that C has no subsets with negative signed measure,
then C E P'" and we have another contradiction. So, suppose that D ~ C has
v",(D) = -€ < O. Since v",(B) > 0, it must be that 2:~=1 V",(Bk) > -00.
Hence limk_oo nk = 00. So, there is k such that I/(nk+1 - 1) < €. Notice that
D ~ C ~ B\U~=lBk. Since v",(D) < -l/(nk+l-I), this contradicts the definition
of nk+1.
If (3 > 0, we have

v",(A'" n B/3) ~ 0, v/3(A'" n B/3) $ O.

Subtract the first inequality from the second to get (f3 - O)J.!l (A'" n B/3) $ 0,
from which it follows that J.!l(A'" n B/3) = O. Since v/3(A) ;::: v",(A) for f3;::: 0, we
can assume that A'" ~ A/3 if f3 ;::: o. It follows that B/3 ~ B'" for f3 ;::: 0, and we
can define f(x) = sup{o : x E B"'}. Since B O = S, f(x) ~ 0 for all x. It is easy
to see that f(x) ;::: 0 if x E B'" and f(x) $ 0 if x E A"'. It is also easy to see that
{x: f(x) ;::: b} = U"'~bB"'. Since this is a countable union of measurable sets, it
is measurable. By Lemma A.35, f is measurable.
Next, we prove that (A.75) holds for every A E A. Let A E A be arbitrary
and let € > 0 be given. Let N > J.!l(A)/€ be a positive integer. Define Ek =
An Bk/N n A(k+l)/N and Eoo = A \ Uk::1Ak/N. Then A = Uk::1Ek U Eoo and
the E j are all disjoint. So J.!2(A) = J.!2(Eoo ) + 2:;:'0 J.!2(Ek). By construction
f(x) E [kiN, (k + l)/N] for all x E Ek and f(x) = 00 for all x E Eoo. Since
Vk/N(Ek) $ 0 and V(k+1)/N(Ek) ~ 0, we have, for finite k,

1J.!2(Ek) -le k
f(X)dJ.!l(X)1 $ ~J.!l(Ek)' (A.77)

If J.!l(Eoo ) > 0, then J.!2{Eoo ) = 00 since v",(Eoo ) < 0 for all o. If J.!l(Eoo ) = 0,
then J.!2(Eoo) = 0 by absolute continuity. Either way, J.!2{Eoo ) = JEoo
f(x)dJ.!l{X).
Adding this into the sum of (A. 77) over all finite k gives

1J.!2{E) -Ie f(X)dJ.!l{X)1 $ ~J.!l(E) < €.

Since this is true for every € > 0, (A.75) is established.


To prove (A.76), we note that it is true if 9 is an indicator function, hence it
is true for all simple functions. By the monotone convergence theorem A.52, it is
true for all nonnegative functions and by subtraction it is true for all integrable
functions.
Finally, if f(x) = 00 for all x E A with J.!l(A) > 0, then J.!2{B) = 00 for every
B ~ A such that /L1{B) > O. It is now impossible for J.!2 to be a-finite. 0
In statistical applications, we will often have a class of measures, each of which
is absolutely continuous with respect to a single a-finite measure. It would be
nice if the single dominating measure were in the original class or could be con-
structed from the class. The following theorem addresses this problem. The proof
is borrowed from Lehmann (1986).
600 Appendix A. Measure and Integration Theory

Theorem A.78. 34 Let 1.£ be a u-finite measure on (S,A). Suppose that N is a


collection of measures on (S, A) such that for every v E N, v « 1.£. Then there
exists a sequence of nonnegative numbers {Ci}~1 and a sequence of elements of
N, {Vi}~1 such that L:
1 Ci = 1 and v « 2::1 CiVi for every v E N.

PROOF. If N is a countable collection, the result is trivially true. If 1.£ is finite,


let >. = 1.£.If 1.£ is not finite, then there exists a countable partition of S into
{8i}~1 such that 0 < 1.£(8i ) = di < 00. For each B E A, let >.(B) = 2::11.£(B n
8 i )/(2'di ). In either case>. is finite and v « >. for every v E N. Define Q to be
the collection of all measures of the form"~ LJ.=l aWi where"~LJ.=l ai = 1 and each
Vi E N. Clearly f3 E Q implies f3 « >..
Next, let V be the collection of sets C in A such that there exists Q E Q
satisfying >.({x E C : dQ/d>.(x) = O}) = 0 and Q(C) > o. To see that V is
nonempty, let v be a measure in N that is not identically 0 and let C = {x :
dv/d>.(x) > O}. Then with Q = v, we have {x E C : dQ/d>.(x) = O} = 0 and
Q(C) = v(C) = v(8) > 0, so C E V. Since A is finite, sUPcev >'(C) = c < 00, so
there exist {Cn}~=l such that limn_co >'(Cn ) = c and C n E V for all n. Let Co =
U~=lCn and let Qn E Q be such that Qn(Cn) > 0 and >.({x E Cn : dQn/d>.(x) =
O}) = O. Let Qo = 2:::'=12- nQn E Q, so that dQo/d>. = E::'=12- ndQn/d>. and

{ x E Co: dQo
d>' (x) = 0} ~ UOO
{
x E Cn : dQn
d>' (x) = 0} •
n=l
which implies that Co E V and >'(Co) = c.
Since Qo E Q, we now need only prove that v « Qo for all v E N to finish
the proof. Suppose that Qo(A) = 0 and v E N. We must prove v(A) O. Since =
Qo(A nCo) = 0 and dQo/d>.(x) > 0 for all x E Co, it follows that >'(A nCo) = 0
and hence v(AnCo) = O. Let C = {x: dv/d>.(x) > O}. Then, v(Anccfnc c ) = 0
since dv/d>.(x) = 0 for x E C c . Let D = Anccf nc, which is disjoint from Co. If
>.(D) > 0, then >'(Co U D) > >'(Co) and D E V. It follows easily that Co U DE V
and >'(Co U D) > >'(Co) contradicts >'(Co) = c. Hence >.(D) = 0 and v(D) = 0,
which implies veAl = v(A nCo) + v(A n ccf n CC) + v(D) = o. 0
There is a chain rule for Radon-Nikodym derivatives.
Theorem A.79 (Chain rule).35 Let v and." be u-finite measures and suppose
that 1.£ « v «1/. Then

dl.£ (s) = dJ-l (8) dv (8), a.e. [.,,). (A.80)


d." dv d."
PROOF. It is easy to see that 1.£ « ." so that dJ-l/d." exists. For every set A, it

1
follows from (A.76) that

J-I(A) = 1 dl.£
-(s)dv(s)
Adv
== dl.£ (8)-d
-d
A v
dv (s)d.,,(s).
."

34This theorem is used in the proofs of Lemmas 2.15 and 2.24. It appears as
Theorem 2 in Appendix 3 of Lehmann (1986) and is attributed to Halmos and
Savage (1949).
35This theorem is used in the proof of Lemma 2.15.
A.6. Absolute Continu ity 601

By the uniqueness of Radon- Nikody m derivatives, (A.80) holds.


.0
The Radon- Nikody m theorem A. 74 relates integrals with respect
to two dIf-
ferent measures on the same space. There are also theorems that relate
integrals
with respect to two different measures on two different spaces.
Theore m A.81. A measurable function f from one measure space l
(5 ,Al,lll)
to a measurable space (S2,A2) , f: SI -+ S2, induces a measure on the range
S2.
For each A E A2, define 1l2(A) = III (I-I (A)). Integrals with respect
to 112 can
be written as integrals with respect to j)l in the following way: If
g : 52 -+ IR tS

J J
integrable, then
g(y)dIl2(Y) = g(f(x»d j)J(x), (A.82)

PROOF. What needs to be proven is that {t2 is indeed


a measure and that (A.82)
holds. To see that {t2 is a measure, note that if A, B E A2 are disjoint
, then so too
are f-l(A) and f-l(B). The fact that j)2 is nonnegative and countab
ly additive
now follows directly from the same fact about ttl.
If 9 : S2 -+ lR is the indicato r function of a set A, then

J g(y)d{t2(Y) {t2(A) = j)1(f-J( A))


f I f -l(A)(x) d{tl(X) = f g(f(X»d j)l(X),

That (A.82) is true for all nonnegative simple functions follows by adding
the far
ends of this equatio n (multiplied by positive constan ts). The monoto
ne conver-
gence theorem A.52 allows us to extend the equality to all nonnegative
integrable
functions. By subtrac tion, we can extend to all integrab le function
s. 0
Definit ion A.S3. The measure {t2 in Theorem A.8! is called the measure
induced
on (82 , A2) by f from j)J.

If the measure ttl in Theorem A.81 is not finite, and the function
f is not
one-to-one, the measure {t2 may not be very interesting.
Examp le A.S4. Let 81 = lR?, 8 2 = IR, j)l equal Lebesgue measure
on lR?, and
f(x, y) = x. Let the two u-fields be Borelu-fields. The measure {t2
that f induces
on (82, A2) from ttl is the following. If A E A2 and the Lebesgue measure
of A is
0, then {t2(A) = O. Otherwise, j)2(A) = 00. Althoug h j)2 is absolute
ly continuous
with respect to Lebesgue measure, it is not IT-finite. The only function
s 9 that
are integrable with respect to {t2 are those that are almost everywh
ere O.
If j)l is IT-finite, there is a way to avoid the problem in Exampl e
A.84 by making
use of the following result.
Theore m A.85. 36 A measure {t on a space (S, A) is u-finite if and only
if there
exists an integrable function f : 8 -+ IR such that f > 0, a.e. [ttl.

36This theorem is used in the proof of Theorem B.46.


602 Appendix A. Measure and Integration Theory

PROOF. For the "if" part, let f be as in the statement of the theorem. Let 0 <
f f(s)dJ.L(s) = c < 00. Let An = {s : lin ~ f(s) < l/(n - I)}, for n = 1,2, ....
We see that Al = {s: f(s) ~ I} and 8 = U~=lAn. We can write

It follows that J.L(An) ~ nc for all n. Hence J.L is u-finite.


For the "only if" part, assume that J.L is u-finite, and let {An}~=l be mutually
disjoint sets such that 8 = U~=lAn and J.L(An) < 00 for all n. Define f(s) to
equal TnlJ.L(An} for all sEAn and for all n such that J.L(An} > O. For n such
that J.L(An) = 0, set f(s) = 0 if sEAn. Then

Example A.S6 (Continuation of Example A.84; see page 601). Let hex, y) =
exp( _[x 2 + y2)/2). It is known that h is integrable with respect to J.Ll and h
is everywhere strictly positive. Let J.L~(C) = Ie h(x,y)dJ.L 1(x, y). Then J.L~ «J.Ll
and J.LI « J.Li· The measure J.L~ induced on (8 2 ,A2 ) from J.Li by f(x,y} = x
is J.L~(B) = J21T IB exp( _x 212)dx. A function 9 : 82 -+ IR is integrable with
respect to J.L; if and only if exp( _x 2/2)g(x) is integrable with respect to Lebesgue
measure.

As a sort of reverse version of Theorem A.8l, functions from a measurable


space to a measure space induce measures on the domain space.
Proposition A.ST. Let f be a measurable function from a measurable space
(8 l ,Al) to a measure space (82,A2,J.L2), f : 81 -+ 82. Let All ~ Al be the
u-field generated by f, and let T be the image of f. 8uppose that T E A2. Then f
induces a measure J.Ll on (8l,Alf) defined by J.Ll(A) = J.L2(TnB) if A = rl(B).
Jilurthermore, if 9 : (81 , Au) -+ IR is integrable with respect to J.L1, then

J g(X)dJ.LI(X) =[ h(y)dJ.L2(y), (A.88)

where h satisfying h(J(x» = g(x) is guaranteed to exist by Theorem A.42.

A.7 Problems
Section A.2:

1. Let 8 be a set and let A be the collection of all subsets of 8 that either
are countable or have countable complement. Prove that A is au-field.
2. Prove Proposition A.lO on page 575.
A.7. Problem s 603

3. Prove Proposi tion A.13 on page 576. (Hint: First, show that every
open
ball in IRk is the union of countably many open rectangles. Then
prove
that the smalles t a-field containing open balls must be the same
as the
smallest a-field containing open rectangles.)
4. Prove that B+ defined on page 571 is a a-field of subsets of the
extende d
real numbers.
5. Prove Proposi tion A.15 on page 576.
6. Prove Proposi tion A.16 on page 576.
7. *Let F : lR -+ lR be a nondecreasing function that is continuous
from the
right. For each interval (a, b], define p.«a, b]) = F(b) - F(a).
(a) Suppose that {(an, bn]}~l is a sequence of disjoint intervals
such
that U~=l (an, bnJ ~ (a, bJ. Prove that I:::"=lp.«an, bnl) S; p.«a, bl).
(Hint: Prove it for finite collections and take a limit.)
(b) Suppose that {(an, bn]};';"=l is a sequence of disjoint intervals
such
that (a,b} ~ U;';"=l(an,bn j. Prove that L:::"=lp,«an,bn ]) ~ p,«a,b]).
(Hint: First, prove it for finite collections by induction. For
infinite
collections, let p.( (a, b]) > E > O. Cover a compac t interval [a + 6,
b}
with finitely many open intervals (an, bn + 6n ) such that lp.«a, bJ)
-
p.«a + 6, b])1 < E/2 and IL::=1 L::=1
p.«an, bnl) - p.«an, bn +6n ])1 <
E/2. This can be done by using continuity from the right.)
(c) Prove that p. is countably additive on the smallest field contain
ing
intervals of the form (a,b]. (Hint: Deal separat ely with finite and
semi-infinite intervals)
8. A measure space (8,A,p.) is complete if A ~ B E A and p,(B) =
0 implies
A E A. Let (8,C,p.) be a measure space, and let '0 = {D : 3A,C
E
C with D~A ~ C and p.(C) = O}. For each D E '0, define p.·(D) =
p.(A)
where D~A ~ C and p.(C) = O. Show that p.. is well defined and
that
(8, V, p..) is a complete measure space.
Section A.3:

9. Prove Proposi tion A.28 on page 583.


10. Prove Proposi tion A.32 on page 584.
11. Prove Proposi tion A.36 on page 584.
12. Let (8, A, p.) be a measure space, and let {jn};';"=1 be a sequenc
e of mea-
surable functions from 8 to lR. Suppose that for every E > 0, L:=1
p,( {s :
fn(s) > E}) < 00. Prove that limn_oo fn(s) = 0, a.e. lp.l. (Hint:
Use the
first Borel-C antelli lemma A.20.)
13. Let (8j, Aj) for j = 0, 1, 2, 3 be measurable spaces. Let fJ : 80
-+ 8j be
measurable and onto for j = 1,2,3. Let Ao,j be the a-field generat
ed by
=
fJ for j 1,2. Prove that /3 is measurable with respect to AO,1 n A O,2·
if and only if there exist measurable gj : 8 j -+ 83 for j = 1,2 such
that
fa = glUt) = g2(h)·
604 Appendix A. Measure and Integration Theory

Section A..4:

14. If f ~ 0 is measurable and J f(s)dJ,l(s) = 0, then show that f(s) = 0, a.e.


[J,I).
15. If f(s) > 0 for all sEA and J,I(A) > 0, prove that fA f(s)dJ,l(s) > O.
16. Prove Proposition A.45 on page 588. (Hint: Use induction on n.)
17. Prove Proposition A.49 on page 588. (Hint: For part 4, use Problem 14 on
page 604.)
18. Let 8 = IR and let A be the O'-field of sets that are either countable or have
countable complement. (See Problem 1 on page 602.) Let J,I be Lebesgue
measure. Suppose that f : 8 - IR is integrable. Prove that f = 0, a.e. fI.t.).
19. Let (8, A) be a measurable space, and let f be a bounded measurable
function. (That is, there exist a and b such that a $; f(x) $; b for all
x E 8.)
(a) Let J,I be a measure on (8, A) such that J,I(8) = 1. Prove that
a $; J f(x}dJ,l(x) $; b.

(b) Let € > O. Prove that there exists a simple function 9 such that for
J J
all measures J,I satisfying J,I(8) = 1, I f(x)dJ,l(x) - g(x)dJ,l(x) I < f.
20. Prove the following alternative type of monotone convergence theorem:
Let {In}~=1 be a sequence of integrable functions such that fn(x) con-
J
verges monotonically to f(x) a.e. [J,I). Then f(x)dJ,l(x) is defined and
J f(x)dJ,l(x) = limn _ oo f fn(x)dJ,l(x). (Hint: Use the dominated conver-
gence theorem A.57 on the positive parts of fn and the monotone con-
vergence theorem A.52 on the negative parts, or vice versa, depending on
whether the convergence is from above or below.)
21. Let (8, A, J,I) be a measure space, let {gn}~=1 be a sequence of integrable
functions that converges a.e. fI.t.), and let 9 be another integrable function.
Suppose that for all C E A,

lim
n-+oo}c
r
gn(s)dJ,l(s) = 1
C
g(s)dJ,l(s).

Prove that lim n _ oo gn = g, a.e. [J,I).

8ection A.5:

22. Prove Proposition A.66 on page 595.


23. Let (81,Ad and (82,A2) be measurable spaces, and define the product
space (81 x 82, Al ® A2)' Prove that A x B E Al ® A2 with A ~ 81 and
that B ~ 82 implies A E Al and B E A2. (Hint: For each C E Al ® A2,
define Cy = {x : (x,y) E C}. Then let C = {C: Cy E AI, for all y E 82}.
Prove that C is a O'-field containing all product sets.)
A.7. Problems 605

Section A.6:

24. Suppose that IJI «: 1J2 and 1J2 «: 1J1.


(a) Show that a.e. [1J1] means the same thing as a.e. [1J2].
(b) Show that

dlJl (s) = (d1J 2 (S)) -1, a.e. [I


IJI an d a.e. [I
1J2.
dIJ2 dlJl

25. If IJI is a measure and f is a nonnegative measurable function, then define


the measure 1J2 by 1J2(A) = JAf(s)dIJ1(s). Prove that 1J2 «: 1J1.
26. Let >. be Lebesgue measure on JR and define

for some fixed c > 0 and Xo E JR.


(a) Prove that IJ is a measure.
(b) Show that>. «: IJ, but that IJ ¢:. >..
(c) Show that J f(x)dlJ(x} = J f(x)d>'(x} + cf(xo}.
27. *In the proof of Theorem A.74, we proved the Hahn decomposition theorem
for signed measures, namely that if v is a signed measure on (S, A), then
there exists A E A such that A is a positive set and A C is a negative set
relative to v.
(a) Let v be a signed measure on (S,A). Suppose that there are two dif-
ferent Hahn decompositions. That is, Al and A2 are both positive sets
and Af and Af are both negative sets. Prove that every measurable
subset B of Al n Af has v(B) = o.
(b) If v is a signed measure on (S,A), use the Hahn decomposition the-
orem to create definitions for the following:
i. The integral with respect to v of a measurable function.
ii. When a function is integrable with respect to v.
(c) If there are two different Hahn decompositions for a signed measure
v, prove that the definition of integral with respect to v produces the
same value for both decompositions.
28. In the statement of Proposition A.87 on page 602, prove that the measure
IJI is well defined. (That is, suppose that A = rl(Bd = r 1(B2), and
prove that 1J2(BI n T) = 1J2(B2 n T).) Also prove that IJI is a measure.
29. In the statement of Proposition A.87 on page 602, assuming that IJI is a
well-defined measure, prove that (A.88) holds.
ApPENDIX B

Probability Theory

This appendix builds on Appendix A but is otherwise self-contained. It contains


an introduction to the theory of probability. The first section is an overview.
It could serve either as a refresher for those who have previously studied the
material or as an informal introduction for those who have never studied it.

B.1 Overview
B.l.l Mathematical Probability
The measure theoretic definition of probability is that a measure space (8, A, JL)
is called a probability space and JL is called a probability if 1-'(8) = 1. Each element
of A is called an event. A measurable function X from 8 to some other space
(X, 8) is called a random quantity. The most popular type of random quantity
is a random variable, which occurs when X is m. with the Borel u-field. The
probability measure JLx induced on (X, 8) by X from JL is called the distribution
ofX.
Example B.l. Let 8 = X = m. with Borel u-field. Let f be a nonnegative
function such that f f(x)dx = 1. Define JL(A) = fA f(x)dx and Xes) = s. Then
X is a continuous random variable with density f, and JLX = JL. If we let /I denote
Lebesgue measure, then JLX «/I with dJLx/d/l = f.
1
Example B.2. Let 8 = m. with Borel u-field. Let X = {X 1 ,X2 , ••• a countable
set. Let f be a nonnegative function defined on X such that Ei=l f(x;) = 1.
Define JL(A) = E{i:Z,EA} f(Xi). Then X is a discrete random variable with prob-
ability mass function f, and JLX = 1-'. If we let /I denote counting measure on X,
then JL« /I with dJL/d/l = f.
B.l. Overview 607

In both of these examples, we will say that f is the density of X with respect
to v.
When there is one probability space (8, A, IJ.) from which all other probabilities
are induced by way of random quantities, then the probability in that one space
will be denoted Pro So, for example, if J-tx is the distribution of a random quantity
X and if BE B, then Pr(X E B) = J-t(X- 1 (B» = J-tx(B).
The expected value or mean or expectation of a random variable X is defined
(and denoted) as E(X) = J xdJ-tx(x), if the integral exists, where J-tx is the
distribution of X. If X is a vector of random variables (called a random vector),
then E(X) will stand for the vector with coordinates equal to the means of the
coordinates of X.
The (in)famous law of the unconscious statistician, B.12, is very useful for
calculating means of functions of random quantities. It says that E[f(X)] =
J f(x)dJ-tx(x). For example, the variance of a random variable X with mean c is
Var(X) = E([X - CJ2), which can be calculated as J(x - c)2dJ-tx(x). The covari-
ance between two random variables X and Y with means Cx and cy, respectively,
is Cov(X, Y) = E([X - cxllY - cyJ).

B.1.2 Conditioning
We begin with a heuristic derivation of the important concepts using the special
case of discrete random quantities. Afterwards, we define the important terms in
a more rigorous way.
Consider the case of two random quantities X and Y, each of which assumes at
most countably many distinct values, X E X = {Xl, ... } and Y E Y = {Yl, ... }.
Let Pij = Pr(X = Xi, Y = Vi). Then
00

Pr(X = Xi) = LPi; = Pi., and


;=1
00

Pr(Y = Yi) = LPi; = P.i.


i=1
These equations give the marginal distributions of X and Y, respectively. We can
define the conditional probability that X = Xi given Y = Yi by

Pr(X = xilY = Yi) = Pii = Pili.


P.;
~o.te t~at ~or e~h j,E: 1 Pi!; = 1 so that the numbers {Pi\j.}~1 define a proba-
bIlity dIstrIbutIon on X kn~~n as the conditional distribution of X given Y = Yi.
We can calculate the cond,t,onal mean (expectation) of a function f of X given
Y = Yi by

= Yi) = L f(Xi)Pilj.
00

E(f(X)IY
i=1
From the conditional distribution, we could define a measure on (X, 2x) by

J-tx!y(AIYi) =L Pi!i·
""EA
608 Append ix B. Probabi lity Theory

It follows that, for each j, E(f(X) IY = Yj) = J f(x)d/-LxlY(xIYj). We can think


of this conditio nal mean as a function of y:

g(y) = E(f(X) IY = y).

The margina l distribu tion of Y is a measure on (Y,2)1) defined by

1J,y(B) = LP.j, for all B E2)1.


YjEB
x Y, 2x ® 2)1)
Similarly, the joint distribu tion of (X, Y) induces a measure on (X
2 x 2)1. The point of all of these
by /-LX,Y(C) = L(Xi,Yj)ECPi j , for all C E ®
write the integral of 9 over
measure s and distribu tions is the following. We can
any set B E 2)1 as

fa L L L f(Xi)PiIiP.j
00

g(y)d/-LY(Y) g(Yj )P.j =


YjEB YjEB i=l

J f(x)fB (y)d/-Lx,Y (x, y) = E (f (X) fB (Y».

The overall equatio n

fa g(y)d/-LY(Y) = E (f (X) fB (Y»

in general.
will be used as the propert y that defines conditio nal expecta tion
the definitio n of conditio nal expecta tion, we will define conditio nal prob-
Throug h
ability and conditio nal distribu tions in general.
mean and
Theorem B.21 says that, in general, if a random variable X has finite
A, then a function 9 : S -+ 1R exists which is measura ble
if C is a sub-a-field of
with respect to the a-field C and such that

E(XfB) = fa g(s)d/-L(s), for all B E C. (B.3)

random
This is the general version of what we worked out above for discrete
s in which C was the u-field generat ed by Y. We will use the symbol
variable
that E(XIC)
E(XIC) to stand for the function g. The two importa nt features
with respect to the u-field C and that it satisfies
possesses are that it is measura ble
that equals E(XIC) a.s. [/-L] will also satisfy (B.3), so there
(B.3). Any function
many function s that satisfy the definitio n of conditio nal expecta tion. All
may be
When we say
such functions are called versions of the conditio nal expecta tion.
E(XIC), we will mean that it is a version of E(XIC).
that a random variable equals
we can set B = S in (B.3) and the equatio n become s E(X) =
Notice that
generali zation
E[E(XIC)]. This result is called the law oftotal probability. A useful
is given in Theorem B.70.
symbol
lf C is the a-field generat ed by another random quantit y Y, then the
of E(XIC). For the case in which C is the a-field
E(XIY) is usually used instead
ed by Y, some special notation is introduc ed. We saw in Theorem A.42
generat
by Y if and
that a function is measura ble with respect to the a-field generat ed
B.l. Overview 609

only if it is a function of Y. Hence, there is a function h defined


on the sp~e
Y where Y takes its values such that E(XIY) =
heY). We use the notatIOn
E(XIY = t) to stand for h(t). (See Corollary B.22.) In this not~tio~
, w~ have, for
all B E C, E(XIB) = fe E(XIY = t)dlJ.y(t), where lJ.y is the dlstnbu
tlOn of Y.
Example B.4. Let S = lR? and let A be the two-dimensional Borel sets.
Let

IJ.(A) = jA
_1_ exp {_3(s~
V37r 3
+ s~ - SIS2)} dSl d S2.

Suppose that Xes) = SI and Y(s) = S2 when s = (SI' S2). Now E(IX!)
00.
y'2[ir <
We claim that g(s) == s2/2 and h(t) == t/2 satisfy the conditions required
=
to be
E(XIY) (s) and E(XIY == t), respectively. First, note that t~e u-field
by Y is Ay == {1R xC: C is Borel measurable}, and IJ.Y IS the measure ed genera~
wIth
density exp(-t2 /2)/.;z. ;r. It is clear that any measurable function
of S2 alone is
Ay measurable. Let B =
1R. x C, so that E(XIB) equals

[: L ~7r
SI exp {-~(s~ + s~ - SI S 2)} ds2dsl
= /1
e
00

-00
SI V2 exp {_3 (SI- !S2)2} -1-exp
v'31r 3 v'2ir
2
{-!s~}dSldS2
2

= r
le 2
!s2_1_ exp
V'f;ir
{-!s~} dS2

L[: ~S2:'
2

exp {-~ (Sl - ~S2 f} ~ exp {-~S~} ds}ds2


= Is ~S2 Ja7r exp { -~(s~ + s~ - SI S 2)} ds}ds2 = Is g(s)dJ1(s).
Note also that the third line in the above string equals fe h(s2)dlJ.y(s2).
It is easy to see that if X is already measurable with respect
to C, then
E(XIC) == X.
Conditional probability turns out to be the special case of conditio
nal expec-
tation in which X = IA. That is, we define Pr(AIC) == E(IAIC) . A
conditional
probability is regular if Pr('IC)( s) is a probability measure for all s.
It turns out
that, under very general conditions (see Theorem B.32), we can choose
the func-
tions Pr(AIC )O in such a way that they are regular conditional probabi
lities. In
particul ar, the space (X, 8) needs to be sufficiently like the real number
s with the
Borel u-field. Such spaces are called Borel spaces as defined in Definiti
on B.31. All
of the most common spaces are Borel spaces. In particul ar, 1R. k for all
finite k and
1R00 • For those readers with more mathem atical background, comple
te separab le
metric spaces are also Borel spaces. Also, finite and countable product
s of Borel
spaces are Borel spaces.
In the future, we will assume that all versions of conditional probabi
lities are
regular when they are on Borel spaces. If C is the a-field generated
by Y, then
Pr(AIY == y) will be used to stand for E(IAIY = y).
If X : S -+ X is a random quantity , its conditional distribution is the
collection
of conditional probabilities on X induced from the restriction of conditio
nal prob-
abilities on S to the a-field generated by X. If the PCIC) are regular
conditional
610 Appendix B. Probability Theory

probabilities, then we say that the version of the conditional distribution of X


given C is a regular conditional distribution. When we refer to a conditional dis-
tribution without the word "version," we will mean a version of the conditional
distribution. Occasionally, we will need to choose a version that satisfies some
other condition. In those cases, we will try to be explicit about versions.
Because conditional distributions are probability measures, many of the theo-
rems from Appendix A which apply to such measures apply to conditional dis-
tributions. For example, the monotone convergence theorem A.52 and the dom-
inated convergence theorem A.57 apply to conditional means because limits of
measurable functions are still measurable. Also, most of the properties of proba-
bility measures from this appendix apply as well.
We now turn our attention to the existence and calculation of densities for
conditional distributions. If the joint distribution of two random quantities has
a density with respect to a product measure, then the conditional distributions
have densities that can be calculated in the usual way, as the joint density divided
by the marginal density of the conditioning variable. Theorem B.46 allows us to
extend this result to joint distributions that are not absolutely continuous with
respect to product measures, such as when one of the quantities is a function of
the other. Here, we merely give an example of how such conditional densities are
calculated.
Example B.5. Let X = (XI,X2) have bivariate normal distribution with den-
sity

with respect to Lebesgue measure on rn? The marginal density of Y = XI + X2


with respect to Lebesgue measure is

fy(y) = ~u exp ( - 2~2 [y - (I-'I + 1-'2)1 2) ,

where u 2 = u~ +u~ +2{XTIU2. The fair (X, Y) does not have a joint density with
respect to Lebesgue measure on IR , but it does have a joint density with respect
to the measure /.I on IR3 defined as follows. For each A ~ IR\ let A' = {(XI,X2) :
(XI,X2,X1 +X2) E A}. Let /.I(A) = A2(A'), where Ak is Lebesgue meas.ure on IRk
for k = 1,2. Then fx,y(x, y) = fx(x) is the joint density of (X, Y) With respect
to /.I, and

/x(x) = _1_ ex p ( __
fy (y) V21ru*
I_
2u*2
(XI -I'I _[u~ + {XT1U2](~ U
-1'1 - 1-'2»)2) ,

if y = Xl + X2, is the conditional density of X given Y = y with respect to the


measure /.Ix 1)1 (AIY) = AI(A:), where A: = {Xl: (Xl,y - xI) E A}.
The concept of conditional independence will turn out to be central ~~ t~e
development of statistical models. A collection {Xn}~=1 of random quantities IS
B.1. Overview 611

conditio nally indepen dent given another quantity Y if the con~itio


n~l .distrib~­
tion (given Y) of every finite subset is a product measure. If, In
ad~ltlOn, Y IS
constan t almost surely, we say that {Xn}~=1 are independent. We
Will call ran-
dom quantiti es (conditionally) IID if they are (conditionally) indepen
dent and
they all have the same conditio nal distribu tion.

B.1.3 Limit Theorems


There are three types of convergence which we consider for sequenc
es of random
quantiti es: almost sure convergence, convergence in probability, and
convergence
in distribu tion. The weakest of these is the last. (See Theorem B.90.)
A sequence
{Xn}~=1 converges in distribution to X if limn_co E (f (Xn)) ::::: E (f (X))
f~
every bounde d continu ous function f. We denote this type of converg
ence Xn -+
X. If X = JR., a more commo n way to express Xn E. X is that limn_
oo Fn(x) =
F(x) for all x at which F is continuous, where Fn is the CDF of Xn and
F is the
CDF of X.I
If X is a metric space with metric d, we say that a sequence {Xn}~=
l converges
in probability to X if, for every f > 0, limn_co Pr(d(X n, X) > f) =
O. We write
this as Xn .!.. X. Almost sure convergence is the same as almost
everywhere
convergence of functions, and it is the stronge st of the three. That
is, Xn -+ X,
a.s. means that {s : Xn(S) does not converge to Xes)} ~ E with Pr(E)
::::: O.
A popular method for proving convergence in distribu tion involves
the use of
charact eristic functions. The characteristic function of a random vector
X is the
complex-valued function

cPx(t) ::::: E (exp[it T Xl) .


It is easy to see that the characte ristic function exists for every
random vector
and has complex absolute value at most 1 for all t. Other facts that
follow directly
from the definition are the following. If Y ::::: aX +b, then cPy(t) ::::: cPx
(at) exp(itb) .
If X and Yare indepen dent, cPx+y = cPxcPy.
The importa nce of characte ristic functions is that they charact erize
distribu -
tions (see the uniqueness theorem B.106) and they are "continuous"
as a function
of the distribu tion in the sense of convergence in distribu tion (see
the continu ity
theorem B.93).
Two of the more useful limit theorem s are the weak law oflarge number
s B.95
and the central limit theorem B.97. If {Xn}~=l are llD random
variables with
2::
finite mean fL, then the weak law of large number s says that the sample
average
Xn = 1 X;/n converges in probabi lity to fL. If, in addition , they have finite
variance 0'2, the central limit theorem B.97 says that v'n(X - fL)
the normal distribu tion with mean 0 and variance 0'2.
n E. N(0,0'2),

lSee Problem 25 on page 664. If X::::: IRk, the same idea can be used.
That
is, Xn E. X if and only if the joint. CDFs Fn of Xn converge to
the joint CDF
F of X at all points at which F is continuous. Since we will not need
to use this
charact erizatio n, we will not prove it.
612 Appendix B. Probability Theory

B.2 Mathematical Probability


In this chapter, we will present the basic framework of the measure theoretic
probability calculus. Most of the concepts like random quantities, distributions,
an so forth. will be special cases of measure theoretic concepts introduced in
Appendix A.

B.2.1 Random Quantities and Distributions


We begin by introducing the basic building blocks of probability theory.
Definition B.O. A probability space is a measure space (8, A, 1-') with 1-'(8) = 1.
Each element of A is called an event. If (8,A, I-') is a probability space, (X,8)
is a measurable space, and X : 8 -> X is measurable, then X is called a random
quantity. If X = JR and 8 is the Borel or Lebesgue u-field, then X is called a
random variable. Let I-'x be the probability measure induced on (X, 8) by X
from I-' (see Definition A.S3). This probability measure is called the distribution
of X. The distribution of X is said to be discrete if there exists a countable set
A ~ X such that I-'x(A) = 1. The distribution of X is continuous if I-'x({x}) = 0
for all x E X.
The distribution of X is easily seen to be equivalent to the restriction of I-' to
the u-field generated by X, Ax.
When there is one probability space from which all other probabilities are
induced by way of random quantities, then the probability in that one space will
be denoted Pro So, for example, in the above definition of the distribution of a
random quantity X, if BE 8, then Pr(X E B) = I-'(X- 1 (B)) = I-'x(B).
The distribution of a random variable can be described by its cumulative dis-
tribution function.
Definition B.T. A function F is a (cumulative) distribution function (CDF) if
it has the following properties:
• F is nondecreasingj
• lim"' ..... -oo F(x) = OJ
• lim"' ..... oo F(x) = Ij
• F is continuous from the right.
Proposition B.S. If X is a random variable, then the function Fx(x) = Pr(X $
x) is a CDF. In this case, Fx is called the CDF of X.
A distribution function F can be used to create a measure on (JR, 8) as fol-
lows. Set I-'«a,b)) = F(b) - F(a), and extend this to the whole u-field using the
Caratheodory extension theorem A.22. 2
We can also construct a distribution function from a probability measure on
the real numbers. If I-' is a probability measure on (JR, 8 1 ), the CDF associated
with it is F(x) = I-'«-oo,x]). If f is a Borel measurable function from JR to JR,
J J
we will write f{x)dF(x) and f{x)dl-'{x) interchangeably.

2See the discussion on page 581 and Problem 7 on page 603.


B.2. Mathem atical Probabi lity 613

If J-t is a probability measure on (IR n , B n ), a joint CDF can be defined


as
F{Xl, ... ,X n ) = J-t ((-00, xd x .,. x (-oo,x n ]) ,
the measure of an orthant . For every joint CDF, there is a random vector
X with
that CDF and we call the CDF Fx.
Definit ion B.9. Let (8, A, J-t) be a probability space, and let (X, B, v)
be a mea-
sure space. Suppose that X : 8 -> X is measurable. Let J-tx be
the measure
induced on (X,B) by X from J-t. Suppose that J-tx «v. Then we call
the Radon-
Nikodym derivative f x = dJ-tx / dv the density of X with respect to
v.
Propos ition B.lO. If h : X :-> IR is measurable and fx is the density
of X
J J
with respect to v, then h{x)dF x{x) = h(x)fx( x)dv(x) .
Definit ion B.ll. If X is a random variable with CDF Fx{')' then the
expected
value (or mean, or expectation) of X is E{X) = J xdFx{x }. If X is
a random
vector, then E(X) will stand for the vector with coordinates equal
to the means
of the coordinates of X.
The following theorem is often called the law of the unconscious
statistic ian,
because some people forget that it is not really the definition of expecte
d value.
Theore m B.l2. 3 If X : 8 -> X is a random quantity and f : X
-> IR is a
J
measurable function, then E[f(X)] = f(x)dJ-tx(x), where J-tx is the
distribution
ofX.
PROOF. If we let Y = f{X), then Y induces a measure (with CDF Fy)
on
(JR, B1) according to Theorem A.8!. The definition of E(Y) is ydFy(y J
), and
J J
Theorem A.81 says that ydFy(y ) = f(x)dFx (x}.
0
Definit ion B.l3. If X is a random variable with finite mean c, then the
variance
of X is the mean of eX - C)2 and is denoted Var(X) . If X is a random
vector
with finite mean vector c, then the covariance matrix of X is the mean
of eX -
c)(X - c)T and is also denoted Var(X) . The covariance of two random
variables
X and Y with finite means Cx and cy is E((X - cx][Y - CY]) and
is denoted
Cov(X, V).

It is possible for a random variable to have finite mean and infinite


variance.
Propos ition B.l4. If X has finite mean J-t, then Var(X) = E(X2) - 2
J-t •

B.2.2 Some Useful Inequalities


Although there are theoreti cal formulas for calculating means of
functions of
random variables, often they are not analytically tractabl e. We may,
on the other
hand, only need to know that a mean is less than some value. For this
reason, we
present some well-known inequalities concerning means of random
variables.

3This theorem is used in making sense of the notation Ell when introdu
parame tric models. cing
614 Appendix B. Probability Theory

Theorem B.15 (Markov inequality). 4 Suppose that X is a nonnegative mn-


dom variable with finite mean p.. Then, for all c > 0, Pr(X ~ c) ~ p./c.

PROOF. Let F be the CDF of X. Then, we can write

p. = f xdF(x) ~ 1
[c,oo)
xdF(x) ~ cl
[c,oo)
dF(x) = cPr(X ~ c).
Divide the extreme parts by c to get the result. 0
The following well-known inequality follows trivially from the Markov inequal-
ity B.15.
Corollary B.I6 (Tchebychev's inequality).5 Suppose that X is a mndom
variable with finite variance (12 and finite mean p.. Then, for all c > 0,
(12
Pr(IX - p.1 ~ c) ~ 2'
c
Another well-known inequality involves convex functions. 6 The proof of this
theorem resembles the proofs in Ferguson (1967) and Berger (1985).
Theorem B.IT (Jensen's inequality).1 Letg be a convex function defined on
a convex subset X of IRk and suppose that Pr(X E X) = 1. If E(X) is finite,
then E(X) E X and g(E(X» $ E(g(X».
PROOF. First, we prove that E(X) E X by induction on the dimension of X.
Without loss of generality, we can assume that E(X) = 0, since we can subtract
E(X) from X and from every element of X, and E(X) E X if and only if 0 E
X - E(X). If k = 0, then X = {O} and E(X) = O. Suppose that 0 E X for all
X with dimension strictly less than m ~ k. Now suppose that X and X have
dimension m and 0 f/. X. Since X and {O} are disjoint convex sets, the separating
hyperplane theorem C.5 says that there is a nonzero vector v and a constant c
such that, for every x E X, V T X $ c and 0 ~ c. 8 If we let Y = v T X, then we
have Pr(Y ~ c) = 1 and E(Y) = 0 ~ c. It follows that Pr(Y = c) = 1 and c = O.
Hence, X lies in the (m - I)-dimensional convex set Z = X n {x : v T x = O}. It
follows that 0 E Z C X.
Next, we prove the inequality by induction on k. For k = 0, E(g(X» =
g(E(X», since X is degenerate. Suppose that the inequality holds for all di-
mensions up to m - 1 < k. Let X have dimension m. Define the subset of JRm+l,
X' = {(x,z): x E X,z E JR, and g(x) ~ z}.

Let (Xl,ZI) and (X2,Z2) be in X' and define


(y, w) = (axl + (1- a)x2, aZl + (1- a)z2)'
4This theorem is used in the proofs of Corollary B.16 and Lemma 1.61.
5This corollary is used in the proof of Theorem 1.59.
6Let X be a linear space. A function f: X -+ JR is convex if f(oXx+(I-oX)y) ~
oXf(x) + (1 - oX)f(y) for all x, y E X and all oX E [0,1].
1This theorem is used in the proofs of Lemma B.114 and Theorems B.118
and 3.20.
sThe symbol v T stands for the transpose of the vector v.
B.3. Conditioning 615

Since ag(xl) + (1- a)g(x2) ~ g(y) and w ~ ag(xI) + (1- a)g(x2), it follows that
(y, w) E X', so X' is convex. It is also clear that (E(X), g{E(X))) is a boundary
point of X'. The supporting hyperplane theorem C.4 says that there is a vector
v = (v""v .. ) such that, for all (x,z) E X', v;I x + v .. z ~ v;IE(X) + v .. g(E(X».
Since (x, Zl) E X' implies (x, Z2) E X' for all Z2 > ZI, it cannot be that v.. < 0,
since then lim.. _ oo v;I x + V .. Z = -00, a contradiction. Since (x, g(x» E X' for all
x E X, it follows that v;I X + v .. g(X) ~ v;IE(X) + v .. g(E(X», from which we
conclude
v .. g(E(X» :5 v~ [X - E(X)] + v .. g(X). (B.18)
Taking expectations of both sides of this gives v .. g(E(X» :5 v .. g(X). If v .. > 0, the
proof is complete. If v .. = 0, then (B.18) becomes 0:5 v T [X -E(X)] which implies
v T[X - E(X)] = 0 with probability 1. Hence X lies in an (m - l)-dimensional
space, and the induction hypothesis finishes the proof. 0
The famous Cauchy-Schwarz inequality for vectors 9 has a probabilistic version.
Theorem B.lO (Cauchy-Schwarz inequality).l0 Let Xl and X2 be two ran-
dom vectors of the same dimension such that E(IIXiIl2) is finite for i = 1,2. Then

(B.20)
PROOF. Let Z = 1 if XlX2 ~ 0 and Z = -1 if XlX2 < O. Let Y = IIXI +
cZX2112, where c = -y'EIIXl Il2/EIIX211 2. Then Y ~ 0 and Z2 = 1. So

o :5 E(Y) = EIIXlll 2 + c2EIIX2112 + 2cE(IXl X21)


= 2EIIXlll2 _ 2E(IXl X 21)y'EII X l Il2.
y'EIIX211 2
The desired result follows immediately from this inequality. o

B.3 Conditioning
B.3.1 Conditional Expectations
Section B.1.2 contains a heuristic derivation of the important concepts in condi-
tioning using the special case of discrete random quantities. We now turn to a
more general presentation.
Theorem B.21. 11 Let (8, A, 1') be a probability space, and suppose that X: 8 ~
1R is a measurable function with E(IXI) < 00. Let C be a sub-u-field of A. Then
there exists a C measurable function 9 : 8 ~ 1R which satisfies

E(XIB) = fa g(s)dl'(s), for all B E C.

9That is, if Xl and X2 are vectors, then Ixi x21 :5 IIXlllllx211.


lOThis theorem is used in the proofs of Theorems 3.44, 5.13, and 5.18.
llThis theorem is used to help define the general concept of conditional
expectation.
616 Appendix B. Probability Theory

PROOF. Use Theorem A.54 to construct two measures 11+ and p,- on (8,C):

p,+(B) = i X+(s)dp,(s), p,_(B) = i X-(s)dp,(s).

It is clear that p,+ « p, and p,_ « p,. The Radon-Nikodym theorem A.74 tells us
that there are C measurable functions g+ and g_ such that

p,+(B) = fa g+(s)dp,(s), p,_(B) = fa g_(s)dp,(s).

Since E(XIB) = p,+(B) - p,_(B), the result follows with g = g+ - g_. 0


We will use the symbol E(XIC) to stand for the function g. If C is the a-field
generated by another random quantity Y, then the symbol E(XIY) is usually
used instead of E(XIC). For the case in which C is the a-field generated by Y,
the next corollary follows from Theorem B.21 with the help of Theorem A.42.
Corollary B.22. Let (8, A, p,) be a probability space, and let (y, C) be a mea-
surable space such that C contains all singletons. 8uppose that X : 8 --+ IR and
Y : 8 --+ Y are measurable functions and E(IXI) < 00. Let P,y be the measure
induced on (Y,C) by Y from p, (see Theorem A.Sl). Let Ay be the sub-a-field
of A generated by Y. Then there exists a function h : Y --+ IR that satisfies the
following: If BE Ay equals y-l(C) for C E C, then E(XIB) = fch(t)dp,y(t).
We will use the symbol E(XIY = t) to stand for the h(t) in Corollary B.22.
At this point the reader might wish to review Example BA on page 609.
To summarize the above results, we state the following.
Definition B.23. Let (8, A, p,) be a probability space, and suppose that X
8 --+ 1R is measurable and E(IXI) < 00. Let C be a sub-a-field of A. We define
the conditional mean (conditional expectation) of X given C denoted E(XIC) to
be any C measurable function g : 8 --+ 1R that satisfies

E(XIB) = i g(s)dp,(s), for all B E C.

Each such function is called a version of the conditional mean. If Y : 8 --+ Y and
C is the sub-a-field generated by Y, then E(XIC) is also called the conditional
mean of X given Y, denoted E(XIY). If, in addition, the a-field of subsets of Y
contains singletons, let h : Y --+ ill. be the function such that 9 h(Y). Then =
h(t) is denoted by E(XIY t). =
When we say that a random variable equals E(XIY), we will mean that it is
a version of E(XIY). The following propositions are immediate consequences of
the above definitions.
Proposition B.24. Let (8, A, p,) be a probability space, and let (y, C) be a mea-
surable space such that C contains singletons. Let X : 8 --+ IR and Y : 8 --+ Y
be measurable. Let P,y be the measure on Y induced from p, by Y. A func-
tion 9 : Y --+ IR is a version of E(XIY = t) if and only if for all B E C,
fB g(t)dp,y(t) = E(XIB(Y»'
B.3. Conditioning 617

Proposition B.25 .
• If Z and W are both versions ofE(XIC), then Z = W, a.s .
• If X is C measurable, then E(X\C) = X, a.s.

Proposition B.26. IfC = {8,0}, the trivial u-fi~ld, then E(XIC) = E(X).

Proposition B.27. 12 Let (8,A,IL) be a probability space, and let (Y,C) be a


measurable space. Let X : 8 ..... IR and Y : 8 ..... Y be measurable, and let
9 : Y ..... IR be such that g(Y)X is integrable. Let ILY be the measure on Y induced
from IL by Y. Then E(g(Y)X) = Jg(t)E(XIY = t)dlLy(t).

Proposition B.28Y Let (8, A, IL) be a probability space and let X : 8 ..... JR,
Y : 8 ..... (y,B1), and Z: 8 ..... (2,B 2 ) be measurable functions. Let lLy and ILZ
be the measures induced on Y and 2 by Y and Z, respectively, from p,. Suppose
that E(\XI) < 00 and that Z is a one-to-one function of Y, that is, there exists
a bimeasurable h : Y ..... 2 such that Z = h(Y). Then E(X\Y = y) = E(X\Z =
h(y», a.s. IlLY].
Conditional probability is the special case of conditional expectation in which
X=IA.
Definition B.29. Let (8, A, p,) be a probability space. For each A E A, the
conditional probability of A given C (or given Y if C is the u-field generated by
Y) is Pr(A\C) = E(IAIC). IfPr(·\C)(s) is a probability on (8, A) for aIls E 8, then
the conditional probabilities given C are called regular conditional probabilities.
It turns out that under very general conditions (see Theorem B.32), we can
choose the functions Pr(AIC) in such a way that they are regular conditional
probabilities. In the future, we will assume that this is done in all such cases.
If C is the u-field generated by Y, then Pr(AIY = y) will be used to stand for
E(IA\Y = y) as in the discussion following Corollary B.22.
If X : 8 --t X is a random quantity, its conditional distribution is the collec-
tion of conditional probabilities on X induced from the restriction of conditional
probabilities on 8 to the u-field generated by X.
Definition B.30. Let (8,A,IL) be a probability space and let (X, B) be a mea-
surable space. Suppose that X : 8 ..... X is a measurable function. Let P be the
probability on (X, B) induced by X from IL. Let C be a sub-u-field of A. For
each B E B, let P(BIC) = Pr(A\C), where A = X- 1 (B). We say that any set of
functions from 8 to [0,1] of the form

{P(BICK), for all BE 8}

is a version of the conditional distribution of X given C. If C is the u-field gen-


erated by another random quantity Y : 8 ..... y, a version of the conditional

12This proposition is used in the proof of Theorem B.64.


13This proposition is used to facilitate the transition from spaces of probability
measures to subsets of Euclidean space when parametric models are introduced.
It is also used in the proof of Theorem 2.114.
618 Appendix B. Probability Theory

distribution of X given Y is specified by any collection of probability functions


of the form
{Pr(·IY = t), for all t E Y}.
If the P(·IC) are regular conditional probabilities, then we say that the version
of the conditional distribution of X given C is a regular conditional distribution.
When we refer to a conditional distribution without the word ''version,'' we
will mean a version of the conditional distribution. Occasionally, we will need to
choose a version that satisfies some other condition. In those cases, we will try
to be explicit about versions.
If X is sufficiently like the real numbers, there will be versions of conditional
distributions that are regular. We make that precise with the following definition.
Definition B .31. Let (X, 8) be a measurable space. If there exists a bimeasur-
able function 4> : X -+ R, where R is a Borel subset of JR, then (X, 8) is called a
Borel space.
In particular, we can show that all Euclidean spaces with the Borel u-fields
are Borel spaces. (See Lemma B.36.) First, we prove that regular conditional
distributions exist on Borel spaces. The proof is borrowed from Breiman (1968,
Section 4.3).
Theorem B.32. Let (8, A, p.) be a probability space and let C be a sub-u-field of
A. Let (X, 8) be a Borel space. Let X : 8 -+ X be a mndom quantity. Then there
exists a regular conditional distribution of X given C.
PROOF. Let 4> : X -+ R be the function guaranteed by Definition B.31. Define
the random variable Z = 4>(X) : 8 -+ R ~ JR. First we prove that the u-
field generated by X, Ax, is contained in the u-field generated by Z, Az. Let
B E Ax; then there is C E 8 such that B = X-l(C). Since 4> is one-to-one,
4>-l(4)(C)) = C. Since 4>-1 is measurable, 4>(C) is a Borel subset of R. Now,
Z-l(4)(C)) = X-I (C) = B, hence B E Az. It is also easy to see that Az is
contained in Ax, so they are equal. If Z has a regular conditional distribution,
then so does X. The remainder of the proof is to show that Z has a regular
conditional distribution.
For each rational number q, choose a version of Pr(Z :::; qlC) and let

Mq,r = {s : Pr(Z :::; qIC)(s) < Pr(Z :::; rIC)(s)}, M = UMq,r.


q>r

According to Problem 3 on page 662 and countable additivity, p.(M) = O. Next,


define
Nq = {s: r ! q, lim
r rational
Pr(Z:::; rIC)(s) i= Pr(Z:::; qIC)}, N= UN q•
All q

We can use Problem 3 on page 662 once again to prove that p.(Nq ) 0 for all q, =
=
hence p.( N) = O. Similarly, we can show that p.( L) 0, where L is the set

{
s:
r
lim
-+ -00
Pr(Z ~ rIC)(s) =J: o} U{s:
r
lim
-+ 00
Pr(Z ~ rIC)(s) =J: I}.
r rational r rational
B.3. Conditioning 619

If G is an arbitrary CDF, we can define


if s E MUNUL,
F(zIC)(s) ={ <?(z)
otherwise.
hmr! :t, r rational Pr(Z ::; riC)
F(·IC)(s) is a CDF for every s (see Problem 2 on page 661), and it is easy to
check that F(zIC) is a version of Pr(Z ::; zlC) for every z. If we extend F(·IC)(s)
to a probability measure 7](.j s) on the Borel u-field for every s, we only need to
check that, for every Borel set B, 7](Bj.) is a version of Pr(Z E BIC). That is, for
every C E C, we need

l 7](B; s)d",(s) = Pre {Z E B} n C). (B.33)

By construction, (B.33) is true if B is an interval of the form (-00, z). Such


intervals form a 7r-system II such that B is the smallest u-field containing II. If
we define

Q (B) = Ie 7](Bj s)d",(s) Q (B) = Pr({Z E B} nc)


1 Pr(C) , 2 Pr(C) ,
we see that Ql and Q2 agree on II. Tonelli's theorem A.69 can be used to see
that Ql is countably additive, while Q2 is clearly a probability. It follows from
Theorem A.26 that Ql and Q2 agree on B. 0
Note that the only condition required for regular conditional distributions to
exist is a condition on the space of the random quantity for which we desire a
regular conditional distribution. The u-field C, or the random quantity on which
we condition, can be quite general. In the future, if we assume that (X, B) is a
Borel space, we can construct regular conditional distributions given anything we
wish. Also, since the function in the definition of Borel space is one-to-one and
the Borel u-field of IR contains singletons, it follows that the u-field of a Borel
space contains singletons (cf. Theorem A.42).

B.3.2 Borel Spaces*


In this section we prove that there are lots of Borel spaces. First, we prove that
every space satisfying some general conditions is a Borel space, and then we will
show that Euclidean spaces satisfy those conditions. Then, we show that finite
and countable products of Borel spaces are Borel spaces. The most general type
of Borel space in which we shall be interested is a complete separable metric
space (sometimes called a Polish space).
Definition B.34. Let X be a topological space. A subset D of X is dense if, for
every x E X and every open set U containing x, there is an element of Din U.
If there exists a countable dense subset of X, then X is separable. Suppose that
X is a metric space with metric d. A sequence {Xn}~=1 is Cauchy if, for every E,
there exists N such that m, n ~ N implies d(xn' Xm) < E. A metric space X is
complete if every Cauchy sequence converges. A complete and separable metric
space is called a Polish space.

*This section may be skipped without interrupting the flow of ideas.


620 Appendix B. Probability Theory

We would like to prove that all Polish spaces are Borel spaces. First, we prove
that !Roo is a Borel space (Lemma B.36). Then we prove that there exist bimeasur-
able maps between Polish spaces and measurable subsets of !Roo (Lemma BAO).
The following simple proposition pieces these results together.
Proposition B .35. If X is a Borel space and there exists a bimeasurable function
f : Y --> X, then Y is a Borel space.
Lemma B.36. The infinite product space IRoo is a Borel space.

PROOF. The idea of the prooe 4 is the following. We start by transforming each
coordinate to the interval (0,1) using a continuous function with continuous
inverse. For each number in (0,1) we find a base 2 expansion, which is a sequence
of Os and Is. We then take these sequences (one for each coordinate) and merge
them into a single sequence, which we then interpret as the base 2 expansion of
a number in (0,1). If this sequence of transformations is bimeasurable, we have
our function <p.
Let 'I/J : lRoo --> (0,1)00 be defined by

tan-l(xd 1 tan- (x2) 1


'I/J(Xl,X2, ... )= (
1
-+
2 7r
,-+
2 7r
,... ) ,
which is bimeasurable. For each x E [0,1), set yo(x) =x and for j = 1,2, ... ,
define

I if 2Yi-l(X) ~ 1,
{
o if not,
Yj(x) = 2Yj-l(X) - Zj(x).

For each j, Zj is a measurable function. It is easy to see that Zj (x) is the jth digit
in a base 2 expansion of x with infinitely many Os. Note also that Yj(x) E [0,1)
for all j and x.
Create the following triangular array of integers:

1
2 3
4 5 6
7 8 9 10
11 12 13 14 15

Let the jth integer from the top of the ith column be l(i,j). Then

l(i,j) = i(i;I) +i(j-l)+ U- 1)ij - 2 ).

14This proof is adapted from Breiman (1968, Theorem AA7).


B.3. Conditioning 621

Clearly, each integer t appears once and only once as £( i, j) for some i and j .15
Define
(8.37)

Then h is clearly a measurable function from (0, 1)00 to a subset R of (0, 1). There
is a countable subset of (0, 1) which is not in the image of h. These are the numbers
with only finitely many Os in one or more of the subsequences {£(i,j)}~l of their
base 2 expansion for i = 1,2, .... For example, the number c = E:o
2- i (Hl)/2-l
is not in R.16 Since the complement of a countable set is measurable, the set R
is measurable.
We define 4> = h( t/J). If we can show that h has a measurable inverse, the proof
is complete. For each x E R, define

4>i(X) = ~ Zt(i,j) (x) . (8.38)


~ 2]
j=l

Clearly, each 4>i is measurable. Note that, for each i and j,

Zj(4)i(X)) = Zt(i,j)(X), (8.39)

Combining (B.37), (B.38), (8.39), and the fact that every integer appears once
and only once as lei, j) for some i and j, we see that h(4)l (x), 4>2 (X), ... ) = x, so
that (4)1,4>2, ... ) is the inverse of h and it is measurable. 0

Lemma B.40. If (X,8) is a Polish space with the Borel (J-field and metric d,
then it is a Borel spaceP

PROOF. All we need to prove is that there exists a bimeasurable f : X -> G, where
G is a measurable subset of ]Roo. We then use Lemma B.36 and Proposition B.35.
Let {Xn}~=l be a countable dense subset of X, and let d be the metric on X.
Define the function f : X -> IR 00 by

f(x) = (d(x, xd, d(x, X2), .. .).


We will first show that f is continuous, which will make it measurable. Suppose
that {Yn}~=l is a sequence in X that converges to Y E X. The kth coordinate of
f(Yn) is d(Yn,Xk), which converges to d(y,Xk) because the metric is continuous.
Hence, each coordinate of f is continuous, and f is continuous. Next, we prove
that f is one-to-one. Suppose that f(x) = fey). Then d(x, Xn) = dey, Xn) for

l5It is easy to check the following. For each integer t, let k = inf{ n : t :s
n(n + 1)/2}. Then ret) = 1 + k(k + 1)/2 - t and set) = k + 1 - ret) have the
= =
property that £(r(t), s(t)) t, r(£(i,j)) i, and s(£(i,j)) j. =
l6This number corresponds to having Is in the first column of the triangular
array but nowhere else. Clearly, 0 < c < 1, but it is impossible to have Is in the
entire first column, since this would require Xl = 1. Even if Xl = 1 had been
allowed, its base 2 expansion would have ended in infinitely many Os rather than
infinitely many Is.
l7This proof is adapted from p. 219 of Billingsley (1968) and Theorem 15.8 of
Royden (1968).
622 Appendix B. Probability Theory

all n. Since {Xn}~1 is dense, there exists a subsequence {Xn;}~1 such that
limj--+oo Xnj = x. It follows that 0 = limj--+oo d(x, x nj ) = limj-+oo dey, xnj)j hence
limj--+oo Xnj = y, and y = x.
Next, we prove that f- I : f(X) -+ X is continuous. Suppose that a se-
quence of points {f(Yn)}::'=1 converges to fey). Let limj--+oo Xnj = y. Then
limj--+oo dey, Xnj ) = O. But dey, Xnj ) is the nj coordinate of fey), which in turn
is the limit (as n -+ 00) of the nj coordinate of f(Yn). For each j, d(Yn,Y) $
d(yn, xnj ) +d(y, xnj ). Let 10 > 0 and let j be large enough so that dey, Xnj ) < 10/2.
Now, let N be large enough so that n ~ N implies d(Yn' Xn,;) < dey, Xn,; ) + f/2. It
follows that, if n ~ N, d(Yn, y) < E. Hence limn-+oo Yn = y and f- I is continuous,
hence measurable.
Finally, we will prove that the image G of f is a measurable subset of JRoo. We
will do this by proving that G is the intersection of countably many open subsets
-18
of G. Let G n be the following set:
{x E JRoo : 3 0", a neighborhood of x with d(a,b) ~ lin for all a,b E f-I(O",)}.
Since 0", S;; G n for each x E G n , G n is open. Also, since f and f- I are continuous,
it is easy to see that G S;; G n for all n. Let G' = G n::'=1 Gn. For each x E G',
let O""n S;; G n be such that 0",,1 ;;;? 0",,2 ;;;? ••• and that dCa, b) $ lin for
all a,b E f-I(O""n)' Note that f-I(O."n) ;;;? f-I(O""n+il for all n. If Yn E
f-I(O""n) for every n, then {Yn}::'=1 is a Cauchy sequence, since n,m ~ N
implies d(Yn,Ym) $ liN. Hence, there is a limit Y to the sequence. It is easy to
see that if there were two such sequences with limits Y and y', then dey, y') < f
for all f > 0, hence y = y'. So we can define a function h: G' -+ X by hex) = y.
If x E G, then clearly hex) = rl(x). If x' E O""n, then d(h(x), h(x'» ~ lin, so
h is continuous. We now prove that G' S;; G, which implies that G = G' and the
proof will be complete. Let x E G', and let Xn E G be such that Xn -+ x. (This is
possible since G' S;; G.) Since h is continuous, f-I(X n) -+ hex). If Yn = f-I(X n)
and y = hex), then Yn -+ Y and f(Yn) -+ fey) E G, since f is continuous. But
f(Yn) = Xn , so fey) = x, and the proof is complete. 0
Next, we show that products of Borel spaces are Borel spaces.
Lemma B.41. Let (Xn,Bn) be a Borel space for each n. The product spaces
n~=1 Xi for all finite nand n:=1
Xn with product u-fields are Borel spaces.
PROOF. We will prove the result for the infinite product. The proofs for finite
products are similar. If Xn = JR for all n, the result is true by Lemma B.36. For
general X n , let 4>n : Xn -+ R,.. and 4>. : JRoo -+ R. be bimeasurable, where R,..
and R. are measurable subsets of JR. Then, it is easy to see that

4> : g (g Rn)
Xn -+ 4>.

is bimeasurable, where 4>(XI,X2, ... ) = 4>.(4)I(XI),4>2(X2),'' .). 0


Next, we show that the set of bounded continuous functions from [0,1) to the
real numbers is also a Polish space.

18We use symbol G to stand for the closure of the set q..
The closure C?f a subs~t
G of a topological space is the smallest closed set contammg G. A set IS closed If
and only if its complement is open.
B.3. Conditi oning 623

Lemm a B.42. 19 Let C[O, 1) be the set of all bounded continuous function
s from
[0,1) to JR. Let p(f, g) = SUPXE[O ,l j lf(x) - g(x)l· Then, p is a metric on
C[O, 1)
and C[O, 1) is a Polish space.
PROOF. That p is a metric is easy to see. To see that C
is separab le, let Dk be the
set of functions that take on rational values at the points 0, l/k, ...
, (k - 1)/k, 1
and are linear between these values. Let D = Uk=lDk. The set D
is countab le.
Every continu ous function on a compac t set is uniformly continuous,
so let f E
e[O, 1) and € > 0. Let 8 be small enough so that Ix-YI < 8 implies If(x)- f(Y)1
<
E/4. Let k be larger than 4/€. There exists 9 E Dk such that Ig(i/k) - f(i/k)1
for each i = 0, ... , k. For ilk < x < (i + l)/k, If(x) - f(i/k)1
< €/4
< E/4, and
Ig(x) - g(i/k)1 < €/2, so I/(x) - g(x)1 < E. To see that e[O, 1] is complet
e, let
{/n}~=l be a Cauchy sequence. Then, for all x, {/n(X)}~=l is a Cauchy
sequence
of real number s that converges to some number f(x). We need to
show that the
convergence of f n to f is uniform. To the contrary, assume that there
exists E such
that, for each n there is Xn such that Ifn(Xn) - f(xn)1 > E. We know
that there
exists n such that m > n implies lin (X) - Im(x)1 < E/2 for all x.
In particul ar,
Ifn(x n ) - fm(xn)1 < E/2 for all m > n. Since limm _ oo fm(xn) = f(x n}
, it follows
that there exists m such that Ifm(x n) - f(xn)1 < E/2, a contrad iction.
0
Because Borel spaces have u-fields that look just like the Borel u-field
of the
real number s, their u-fields are generat ed by countab ly many sets.
The countab le
field that generat es the Borel u-field of 1R is the collection of all
sets that are
unions of finitely many disjoint intervals (including degener ate ones
and infinite
ones) with rational endpoin ts.
Propos ition B.43. 20 Let (X,B) be a Borel space. Then there exists
a countable
field e such that B is the smallest u-field containing C.
Because a field is a 7f-system, Theorem A.26 and Proposi tion B.43
imply the
following.
Coroll ary B.44. Let (X, B) be a Borel space, and let e be a countab
le field that
genemtes B. II ILl and IL2 are u-finite measures on B that agree on e,
then they
agree on B.

B.3.3 Conditional Densities


Because conditional distribu tions are probabi lity measures, many of
the theorem s
from Append ix A which apply to such measure s apply to conditio
nal distribu -
tions. For example, the monoto ne convergence theorem A.52 and
the domina ted
convergence theorem A.57 apply to conditio nal means because limits
of measur-
able functions are still measurable. Also, most of the propert ies
of probabi lity
measures from this append ix apply as well. In this section, we focus
on the exis-
tence and calculat ion of densities for conditio nal distribu tions.
If the joint distribu tion of two random quantiti es has a density with
respect
to a product measure, then the conditio nal distribu tions have densitie
s that can

19This lemma is used in the proof of Lemma 2.121.


20This proposi tion is used in the proofs of Lemma s 2.124 and 2.126
and The-
orem 3.110.
624 Appendix B. Probability Theory

be calculated in the usual way.


Proposition B.45. Let (S,A,p.) be a probability space and let (X,81,VX) and
(Y,82,Vy) be l1-jinite measure spaces. Let X : S -+ X and Y : S -+ Y be
measurable /unctions. Let P,X,Y be the probability induced on (X x Y,81 ® 82)
by (X, Y) from p.. Suppose that p.x,y «
Vx x Vy. Let the density be Ix,Y(x, y).
Let the probability induced on (Y,82) by Y from p, be denoted p,y. Then P,Y is
absolutely continuous with respect to Vy with density

fy(y) = i fX,y(x,y)dvx(x),

and the conditional distribution of X given Y has densities

f XIY (xly) = fx,Y(x, y)


fy(y)

with respect to vx.


This proposition can be proven directly using Tonelli's theorem A.69 or as a
special case of Theorem B.46 (see Problem 15 on page 663).
Theorem B.46. Let (X,8I) be a Borel space, let (y,82) be a measurable space,
and let (XxY, 81®82, v) be a l1-jinite measure space. Then, there exists a measure
11)1on (y,82) and for each y E y, there exists a measure vXly(·ly) on (X,81)
such that for each integrable or nonnegative h : X x Y -+ JR, f hex, y)dvXly(xly)
is 8 2 measurable and

1 hex, y)d/l(x, y) = 1[1 hex, y)d/lXIY(XIY)] dVy(y). (B.47)

PROOF. Let f be the strictly positive integrable function guaranteed by The-


orem A.85. Without loss of generality, assume that f I(x, y)dv(x, y) = 1. The
measure p,(A) = fA I(x, y)d/l(x, y) is a probability, /I « p" and (dv/dp,)(x, y) =
1/I(x, y). Let P,xIY be a regular conditional distribution on (X, 8 1 ) constructed
from p., and let Vy be the marginal distribution on (y,82). Define

Note that
1
IAXB(X,y)dp,Xly(xly) = IB(y)P,Xly(Aly), (B.48)

which is a measurable function of y because P.xIY is a regular conditional distri-


bution. Just as in the proof of Lemma A.61, we can use the 7r-A theorem A.17
to show that f gdp.xl'Y is measurable if 9 is the indicator of an element of
the product l1-field. It follows that f gdp.xlY is measurable for every nonneg-
ative simple function g. By the monotone convergence theorem A.52, letting
{gn}~=l be nonnegative simple functions increasing to. 9 everywhere, it fol.lows
that f g(x,y)dp,Xly(xly) is measurable for all nonnegatIve measurable functIOns,
and hence f hd/lXIY = f h/ fdp,xlY is measurable if h is nonnegative.
B.3. Conditioning 625

Next, define a probabi lity 1/ on (X x y, BI @ B2) by

1/(C) = 1[I Ic(x, y)dJLXly(XIY)] dvy(y).

It follows from (B.48) that 'f/ and JL agree on the collection of all
product sets
(a 7T-system that generates BI @ B2)' Theorem A.26 implies that they
agree on
BI @ B2. By linearity of integrals and the monotone convergence theorem
A.52,
if 9 is nonnegative, then

1 g(x, y)d1/(x, y) 1 [I g(x, Y)dJLX1y(XIY)] dvy(y)

1 [I g(X,Y)I(X,Y)dVXIY(X IY)] dVy(y). (B.49)

For every nonnegative h,

1 h(x,y)d v(x,y) = I
h(X,y)
f(x,y/( x,y)dv (x,y)= Ih(X,y ) (
f(x,y)d JLX,y
)
(B.50)

1~~:: ~~ d1/(x, y) = 1 [I hex, y)dVXIY(X IY )] dVy(y),

where the second equality follows from the fact that dJLldv = I,
the third fol-
lows from the fact that JL and 1/ are the same measure, and the
fourth follows
from (B.49). If h is integrable with respect to v, then (B.50) applies
to h+,
h-, and Ihl, and all three results are finite. Also, f Ih(x, y)ldvxl y(xly)
is mea-
surable and vy({y : flh(x,y )ldvxly (xly) = oo}) = o. So f h+(x,y)
dvxly(x l)
and f h-(x,y) dvXly( xly) are both finite almost surely, and their
difference is
f hex, y)dvXly(xly), a measurable function. It now follows that (B.47) holds. 0
The measures /.Iy and VXIY in Theorem B.46 are not unique. In the
proof, we
could easily have defined /.Iy several ways, such as vy(A) = fA g(Y)JLy
(y) for any
strictly positive function 9 with finite JLy integral. A corresponding
adjustm ent
would have to be made to the definition of VXIY:

In the special case in which v is a product measure VI x V2, it is


easy to show
that VI can play the role of vXly(·ly ) for all y and that V2 can play
the role of /.Iy
in Theorem B.46. (See Problem 15 on page 663.)
There is a familiar applicat ion of Theorem B.46 to cases in which X
and Yare
Euclidean spaces but V is concent rated on a lower-dimensional manifol
d defined
by a function y = g(x).
Propos ition B.51. Suppose that X = mn and Y = mk , with k <
n. Let 9 :
X -+ Y be such that there exists h : X -> mn - k such that vex) =
(g(x), hex))
is one-to-one, is differentiable, and has a differentiable inverse. For
y E mk and
wE mn - k , define J(y,w) to be the Jacobian, that is, the
determinant of the
matrix of partial derivatives of the coordinates of v-I(y,w ) with respect
to the
coordinates of y and of w. Let Ai be Lebesgue measure on mi , for each
i. Define
626 Appendix B. Probability Theory

a measure v on X x Y by v(C) = An({X : (x,g(x» E C}). Then, vy equal to


Lebesgue measure on IRk and vXly(AIY) = fAo J(y,W)dAn-k(W) satisfy (B.47),
v
where A; = {w: v- 1 (y,w) E A}.
We are now in position to derive a formula for conditional densities in general?l
Theorem B.52. Let (8, A, JL) be a probability space, let (X, 8 1) be a Borel space,
let (Y,82) be a measurable space, and let (X x y, 8 1 ®82 , v) be a (I-finite measure
space. Let vy and VXIY be as guaranteed by Theorem B.46. Let X : 8 -+ X and
Y : 8 -+ Y be measurable functions. Let JLX,Y be the probability induced on
(X x Y,8 1 ®~) by (X, Y) from JL. 8uppose that JLX,Y « v. Let the density be
fx,y(x,y). Let the probability induced on (y,82) by Y from JL be denoted JLY.
Then, JLy « vyi for each y E y,

~Y (y) = fy(y) = [ /x,y(x,y)dvxly{xly); (B.53)


y lx
and the conditional distribution of X given Y = y has density

j (I) - jx,Y{x,y) (8.54)


XIY xy - Jy(y)

with respect to VXly(·ly).


PROOF. It follows from Theorem B.46 that for all B E 8 2,

JLY(B) = f IB (y)/x,Y (x, y)dv(x, y)

= f IB{Y) [f fX,Y(X,Y)dVXI)I(X 1Y)] dvy{y).


The fact that JLY « V)I and (B.53) both follow from this equation. Let JLxly(·ly)
denote a regular conditional distribution of X given Y = y. For each A E 81
and B E 82, apply Theorem B.46 with h(x, y) = IA(x)IB(y)fxly(xly)Jy(y) to
conclude

Since this is true for all B E 82, we conclude that

JLxlY(Aly) = 1 fXIY(xly)dvXly(xly)·

Hence (8.54) gives the density of JLxly(·ly) with respect to vXI)I(·ly)· 0


The point of Theorem B.52 is that we can calculate conditional densities for
random quantities even if the measure that dominates the joint distribution is
not a product measure. When the joint distribution is dominated by a product

21The condition that the joint distribution have a density with respect .t~ a
measure v in Theorem B.52 is always met since v can be taken equal to the Jomt
distribution. The theorem applies even if v is not the joint distribution, however.
B.3. Conditioning 627

measure the conditional distributions are all dominated by the same measure.
(See Pr;blem 15 on page 663.) In general, however, the conditional distributio~
of X given Y = y is dominated by a measure that depends on y. For example, If
Y = g{X), the joint distribution of (X, Y) is not dominated by a product measure
even if the distribution of X is dominated. (See also Problem 7 on page 662.)
Nevertheless, we have the following result.
Corollary B.55. 22 Let (S,A,p,) be a probability space, let (Y,B2) be a measur-
able space such that B2 contains all singletons, and let (X, B) be a Borel space with
Vx a u-finite measure on (X,8). Let X : S -+ X and g : X -+ Y be measumble
junctions. Let Y = g{X). Suppose that the distribution of X has density fx with
respect to vx. Define von (X x Y,Bl ®B2) by v(C) = vx({x: (X,9(X» E C}).
Let p,X,Y be the probability induced on (X x y, BI ®B2) by (X, Y) from p,. Let the
probability induced on (Y,B2 ) by Y from p, be denoted P,y. Then p,x,y «v with
Radon-Nikodym derivative fx,y(x,y) = fx(x)l{g(x)} (y). Also, the conditions of
Theorem B.46 hold, and we can write

dp,y ( )
dvy Y = fy(y) = i l{g(x)} {y)fx (x)dvXIY{x/y),

if y = g(x),
fxlY{x/y) = { O~;f:~ otherwise.

Also, the conditional distribution ofY given X is given by P,Ylx{C/x) = 10{g(x».


PROOF. Since Vx is u-finite, v is also. Since Y is a function of X, Theorem A.81
J
implies that for all integrable h, h{x, y)dv(x, y) =J
h(x, g{x»dvx(x). The facts
that Ix,Y has the specified form and that P,YIX is the conditional distribution of
Y given X follow easily from this equation. 0
The point of Corollary B.55 is that if Y = g{X), then we can assume that the
conditional distribution of X given Y =
y is concentrated on g-I({y}).
Example B.56. 23 Let 1 be a spherically symmetric density with respect to An,
LebesguemeasureonlRn . That is, I(x) = h{x T x) for some function h: lR -+ lR+ o
(the interval [0,00» and J h{x T x)dAn{X) = 1. Let X have density f and let
V = XT X. Let R = V- 1 / 2 , and transform to spherical coordinates:

XI rcos(fh),
X2 = r sin(Ot} COS{(2),

X n -l rsin(Ot} .. 'COS(On_l),
Xn = r sin(Ol) ... sin(On_l).
The Jacobian is r n - 1 j(f), where j is some function of 0 alone. The Jacobian for
the transformation to v and 0 is v(n/2)-lj(0)/2. The integral of j(O) over all 0

22This corollary is used in the proof of Theorem 2.86 and in Example 3.106.
23The calculation in this example is used again in Example 4.121.
628 Appendix B. Probability Theory

values is 7r n / 2/f(n/2}. So, the marginal density of V is

Iv (v) = 7r ~ V ~ -1 h( V ) •
2f(~)
The conditional density of X given V =v is then
2f(~)Vl-~ T
IXIV(xlv) = 7r1f I{v}(x x)

f
with respect to the measure IIXIV(Clv) = c • v(n/2)-lj(9)dA n_l(9)/2, where

C' = {9 : v! (cos(Ot), ... , sin(OI}'" sin(On_t}} E C}.


It follows that the conditional distribution of X given V = v is given by

JLxlV(Clv} +1
= r(!!)
7r c'
j(O}dAn-l(O}.

It is easy to see that JLxlV(·lv} is the uniform distribution over the sphere of
radius v in n dimensions.

Another example was given in Example B.5 on page 610.

B.3.4 Conditional Independence


The concept of conditional independence will turn out to be central to the devel-
opment of statistical models.
Definition B.57. Let ~ be an index set, let Y and {XihEN be random quantities,
and let Ai be the a-field generated by Xi. We say that {XihEN are conditionally
independent given Y if, for every n and every set of distinct indices iI, ... ,in and

n
every collection of sets Al E Ail, ... , An E A in , we have

p, (6 A; Y) ~ P,(A;IY), a.a. (B.58)

If, in addition, Y is constant almost surely, we say {Xi hEN are independent.
Under the same conditions as above, if all of the conditional distributions of
the Xi given Yare the same, then we say {XihEN are conditionally IID given
Y. If, in addition, Y is constant almost surely, we say {XihEN are IID.
Example B.59. Let F be a joint CDF of n random variables Xl,.'" X n , and
let JL be the corresponding measure on lRn. Then JL is a product measure if and
only if Xl, ... ,Xn are independent (see Proposition B.66).
Example B.60 (Continuation of Example B.56; see page 627).24 Transform to
(Y, V), where Y = X/Vl/2. Then, the conditional distribution of Y given V is
given by

24This calculation is used again in Example 4.121.


8.3. Conditioning 629

where D' = {9 : (cos(91 ), ... ,sin(91 ) .. ·sin(9n -I)) ED}. We note that this
formula does not depend on v; hence Y is independent of V. In addition, it is
easy to see that JLYlv(ylv) is just the uniform distribution over the sphere of
radius 1 in n dimensions.
The use of conditional independence in predictive inference is based on the
following theorem.
Theorem B.61. 26 Let N be an index set, let Y and {XihEN be a collection
of random quantities, and let Ai be the u-field generated by Xi. Then {XihEN
are conditionally independent given Y if and only if for every nand m and
every set of distinct indices it, ... ,in, jt, ... ,jm and every collection of sets Al E
Ail, ... ,An E Ain , we have

(8.62)

PROOF. For the "if" part, we will assume (8.62) and prove (B.58) by induction
on n. For n = 1, there is nothing to prove. Assuming (8.58) is true for all n ::; k,
we now prove it for n = k + 1. Let Aj E Ai; for j = 1, ... ,k + 1. According to
(B.62) and (8.58) for n = k, we have

It follows that for all B E Ay, the u-field generated by Y,

Pr (B 0 Ai) = Pr (BnAHIOAi)

= inAk+1 Pr (oAiIY,Xk+l) (s)dJL(s)


= lnAIo+I n
k
Pr(AiIY)(s)dJL(s) = l 1Ak+l (s) nk
Pr(AiIY)(s)dJL(s)

=
f
JB Pr(AHIIY)(s) n k
Pr(AiIY)(s)dJL(s) = in k+l
Pr(AiIY)(s)dJL(s).

The equality of the first and last terms above for all B E Ay means that
Il~~ll Pr(AiIY) = Pr(n~~: AiIY), a.s., which is what we need to complete the
induction.
For the "only if" part, we will assume (B.58) and prove (B.62). For a function
9 to be the left-hand side of (8.62), it must be measurable with respect to the
u-field AY,m generated by Y, Xii, ... ,Xjm , and satisfy

(B.63)

26This theorem is used in the proofs of Theorems 2.14 and 2.20.


630 Appendix B. Probability Theory

for all C E AY,m. Clearly, the right-hand side of (B.62) is measurable with respect
to AY,7n' If C = Cy nCx, where Cy E Ay and Cx is in the O'-field generated by
X jl , ... , Xj"" then

Pr (CQ Ai) fay Pr ( Cx Q Ai Y) (s)dJL(s)

= fay lex (s) Pr (0 Ai Y) (s)dJL(s)


fa Pr (0 Ail Y) (s)dJL(s).
This means that (B.63) holds with 9 = Pr(n?=lAiIY) so long as C is of the
specified form. To show that it holds for all C E AY,m, we first note that AY,m is
the smallest O'-field containing all sets of the specified form. Clearly, (B.63) holds
for all sets that are unions of finitely many disjoint sets of the specified form by
linearity of integrals. These sets form a field C. According to Lemma A.24, for
each f > 0, there is C. E C such that Pr(C.~C) < f/2. The following facts follow
trivially:

1e.
g(s)dJL(s)

Pr ( cD Ai) - Pr ( C. 0I Ai) < 2'


f

Il g(s)dJL(s) -l. g(s)dJL(s) I <


f

Combining these gives that lIe g(s)dJL(s) - Pr (C n?=l Adl < f. Since f is arbi-
trary, (B.63) holds for all C E AY,m. 0
A particular case of interest involves three random quantities. Theorem B.64
says that when there are only two Xs in Theorem B.61, we can check conditional
independence by checking only one of the equations of the form (B.62).
Theorem B.64. 26 Let X, Y, and Z be three random quantities, and let Ax, Ay,
and Az be the O'-fields generated by each of them. Suppose that for all A E Ax,
=
Pr(AIY, Z) Pr(AIY). Then X and Z are conditionally independent given Y.
PROOF. We need to check that for every A E Ax and B E Az, Pr(A n BIY) =
Pr(AIY) Pr(BIY). Equivalently, for all such A and B, and all C E Ay, we must
show

Pr(A n B n C) = f le(s) Pr(AIY)(s) Pr(BIY)(s)dJL(s). (B.65)

26This theorem is used in the proofs of Theorems 2.14 and 2.20.


B.3. Conditioning 631

Since we have assumed that Pr(AIY, Z) = Pr(AIY ), we have that, for


all B E Az
and C E Ay,

Pr(A n B n C) = f Ie(s)I8 (s) Pr(AIY) (s)d/L(s ).

We can use Proposi tion B.27 with g(Y) = Ie Pr(AIY ) and X = 18


to see that

f Ie(s)I8 (s) Pr(AIY) (s)d/L(s ) = f Ic(s) Pr(AIY )(s) Pr(BIY) (s)d/L(s ).

Together, these last two equatio ns prove (B.65).


The following result relates product measure on a product space to indepen 0
dent
random variables.
Propos ition B.66. Let (8, A, /L) be a probability space and let (Ti
, Bi ) (i =
1, ... , n) be measurable space. Let Xi : 8
-+ Ti be measurable for i = 1, ... , n. Let
/Li be the measure that Xi induces on Ti for each i, and let Tn =
TI X ..• X Tn,
B n = B1 ® ... ®Bn. Let /L. be the measure that (Xl,' .. , Xn) induces
on (Tn, Bn)
from /L. Then /L. is the product measure /Ln = /LI X .,. x /Ln, if and
only if the Xi
are independent.
The same result holds for conditional independence.
Coroll ary B.67. Random quantities X I, ... ,Xn are conditionally
indepen dent
given Y if and only if the product measure of the conditional
distributions of
XI, ... ,Xn given Y is a version of the conditional distribution of(XI,
... ,Xn )
given Y.
There is an interest ing theorem that applies to sequences of indepen
dent ran-
dom variables, even if they are not identically distribu ted.
Theore m B.68 (Kolmo gorov zero-on e law).27 Suppose that (8,A,/L
) is a
probability space. Let {Xn}~l be a sequence of indepen dent random
quantities.
For each n, let Cn be the u-field generated by (X n , X n + l , .. .) and let
C = n~=ICn.
Then every set in C has probability 0 or probability 1.
PROOF. Let An be the u-field generat ed by (Xl, ... , Xn). Then C.
= U~IAn is
a field. It is easy to see that C is contained in the smallest u-field contain
ing C•.
Let A E C. By Lemma A.24, for every k > 0, there exists nand Ck
E An such
that IL(A~Ck) < 11k. It follows that

lim IL( Ck) = IL(A),


k->oo

(B.69)
Since A E C, it follows that A E Cn + l ; hence A and Ck are indepen
dent for
every k. It follows that IL(Ck n A) = /L(A)/L(Ck). It follows from
(B.69) that
IL(A) = IL(A)2, and hence either /L(A) = 0 or /L(A) = 1.
0

27This theorem is used in the proofs of Corollary 1.63 and Lemma


7.83, and in
the discussion of "sampling to a foregone conclusion" in Section 9.4.
632 Appendix B. Probability Theory

The a-field C in Theorem B.68 is often called the tail a-field of the sequence
{Xn}~=l' An interesting feature of the tail a-field is that limits are measurable
with respect to it. 28 (See Problem 21 on page 663.)

B.3.5 The Law of Total Probability


Next, we introduce some theorems that are very simple to state for discrete
random variables but appear to be rather unwieldy in the general case. We will,
however, need them often.
Theorem B.70 (Law of total probability). Let (8, A, JL) be a probability
space, and let Z be a random variable with E(IZI) < 00. Let C ~ B be sub-u-
fields of A. Then E(ZIC) = E{E{ZIB)IC), a.e. [JL].
PROOF. Define T = E{ZIB) : 8 -+ JR, which is any B measurable function
satisfying E{ZIB) = fB T(s)dJL{s), for all BE B. We need to show that E{ZIC) =
E{TIC) a.s. fIL]. The function E{TIC) is any C measurable function satisfying
fc E{TIC){s)dJL{s) = E{TIc), for all C E C. But, since C ~ B, C E C implies
C E B. So, for C E C,

l E{TIC)(s)dJL{s) = E{TIc) = J Ic(s)T{s)dJL{s) = l T{s)dJL{s) = E{ZIc),

where the last equality follows since T = E{ZIB) and C E B. Since E{TIC) is C
measurable, equating the first and last entries of the above string of equations
means that E{TIC) satisfies the condition required for it to equal E{ZIC). 0
When Band C are the a-fields generated by two random quantities X and·
Y, respectively, C ~ B means Y is a function of X. So, Theorem B.70 can be
rewritten in this case.
Corollary B.n. Let X: 8 -+ Ul, Y : 8 -+ U2, and Z: 8 -+ 1R be measurable
junctions such that E{IZI) < 00. 8uppose that Y is a function of X. Then,
E{ZIY) = E {E {ZIX)I Y}, a.s. fIL]·
The most popular special case of this corollary occurs when Y is constant.
Corollary B.72. 29 Let (8,A,JL) be a probability space. Let X : 8 -+ U1 and
Z : 8 -+ 1R be measurable functions such that E{IZI) < 00. Then, E{Z) =
E{E{ZIX)}.
This is the special case of Theorem B.70 when C is the trivial a-field.
The following theorem implies that if a conditional mean given X depends on
X only through heX), then it is also the conditional mean given heX).
Theorem B.73. 30 Let (8,A,JL) be a probability space and let Band C be sub-
a-fields of A with C ~ B. Let Z : 8 -+ 1R be measurable such that E(IZl) <

28The tail a-field will play a role in the proofs of Corollary 1.63 and Theo-
rem 1.49.
29This corollary is used in the proof of Theorem B.75.
30This theorem is used in the proofs of Theorems 1.49 and 2.6.
B.3. Conditioning 633

00. Then there exists a version of E(ZI8) that


is C measurable if and only if
E(ZI8) = E(ZIC), a.s. [j.t].
PROOF. For the "if" direction, if E(ZI8) = E(ZIC), a.s. [j.t], then
E(ZIC) is
measurable with respect to both C and 8, and hence it is a C measura
ble version
of E(ZI8). For the "only if" direction, if W is a C measurable version
of E(ZI8) ,
then W = E(WIC), a.s. [j.t] by the second part of Proposi tion B.25.
By the law
of total probability B.70, E(WIC) = E(ZIC), a.s. [j.t].
0
A useful corollary is the following.
Coroll ary B.74. 31 Let (8, A, j.t) be a probability space. Let (81 ,Ad and
(82, A2)
be measurable spaces, and let X : 8 --> 8 1 and h : 8 1 --> 82 be
measurable
functions. Let Z : 8 -+ JR be measurable such that E(lZI) < 00. Define
Y = heX).
Then E(ZIX = x, Y = y) == E(ZIX == x) a.s. with respect to the
measure on
(81 x 8 2, A1 ® A2) induced by (X, Y) : 8 --> 8 1 X 82 from j.t.
The following theorem deals with conditioning on two random quantiti
es at
the same time. In words it says that the conditional mean of a random
variable
Z given two random quantiti es Xl and X2 can be calculated two
ways. One is
to condition on both Xl and X2 at once, and the other is to conditio
n on one
of them, say X2, and then find the conditional mean of Z given Xl,
but starting
from the conditional distribu tion of (Z,X1) given X2.
Theore m B.75. 32 Let (8, A, j.t) be a probability space and let (Xi, 8 i) fori
== 1,2
be measurable spaces. Let Xi : 8 - t Xi for i == 1,2 and Z : 8 --> JR
be random
quantities such that E(lZI) < 00. Let j.t1,2,Z denote the measure on
(Xl x X2 X
JR,Bl ® B2 ® B) induced by (Xl,X2 , Z) from j.t. (Here,8 denotes the
Borel (J'-
field.) For each (x,y) E Xl X X2, let g(x,y) denote E(ZI(X l,X 2) =
(x,y». For
each A E A and y E X2, let j.t(2)(Aly) denote Pr(AIX2 = y). For each
y E X2,
let hex, y) denote the conditional mean of Z given Xl = x calculat
ed in the
probability space (8,A,j.t( 2)(·ly». Then h = 9 a.s. [j.t1,2,Z].
PROOF. Saying that h = 9 a.s. [j.tl,2,Z] is equivalent to saying that

To prove this we first note that f(s) = h(Xl (s), X2(S» is measurable
with respect
to the (J'-field generated by (Xl, X 2 ), Ax 1 ,x 2' All that remains is
to show that
it satisfies the integral condition required to be E(ZIX 1,X ). That
2 is, for all

l
C E AXl,x2 ,
E(ZIc) = f(s)dj.t(s). (8.76)
Let j.t2 be the measure on (X2,8 2) induced by X2 from j.t. First,
suppose that
C = An B, where A E AXl and B E A x2 . The last hypothesis of
the theorem
says that for all A E AXIl E(ZIAI X2 = y) = fA h(X l (s),y)dj.t(2)(sly)
. If j.t112(·ly)
is the probability on (Xl,B l ) induced by Xl from j.t(2) (-Iy), then j.t112(·ly)
is also
the conditional distribu tion of Xl given X2 = Y as in Theorem B.46.
Suppose

31This corollary is used in the proof of Theorem 2.14.


32This theorem is used in the proof of Lemma 2.120, and it is used
in making
sense of the notation Ee when introducing parame tric models.
634 Appendix B. Probability Theory

that A = X11(D) and B = X;l(F). Then An B = (Xl,X 2)-1(D x F) and


E(ZIAJX2 = y) = JDh(X,y)d/1-1(X), By Corollary B.72 and Theorem B.46, we
can write

E(ZlAIB) = 1Iv h(x, y)d/L112(XJy)d/1-2(Y)

( h(x, y)d/1-1,2,Z(X, y, z) = ( J(s)d/1-(s).


JDXFXR J AnB
This proves (8.76) for C = An B. Let C be the collection of all sets C in A such
that (8.76) holds. Clearly SEC. If C E C, then CC E C since f J(s)d/1-(s) =
E(Z). By additivity of integrals, if {Cd~l E C, then U~lCi ~ C, hence C
contains the smallest IT-field containing all sets of the form An B for A E AXl
and B E AX2' Theorem A.34 can be used to show that this IT-field is AXl,X2' 0
If a random variable has finite second moment, then there is a concept of
conditional variance.
Definition B.77. Let X : S --+ 1Rk have finite second moment, and let C be a
sub-IT-field of A. Then the conditional covariance matrix of X given C is defined
as Var(XJC) = E[(X - E(XJC»)(X - E(XJC»TJC].
The following result is easy to prove.
Proposition B.78. 33 Let X: S --+ IRk have finite second moment, and let C be
a sub-IT-field of A. Then Var(X) = EVar(XJC) + Var[E(XJC)].

BA Limit Theorems
There are several types of convergence that will be of interest to us. They involve
sequences of random quantities or sequences of distributions.

B.4.1 Convergence in Distribution and in Probability


The simplest type of convergence occurs when the distributions have densities
with respect to a common measure. The following theorem is due to Scheffe
(1947).
Theorem B.79 (Schefi'e's theorem).34 Let {Pn}~=l and P be nonnegative
Junctions from a measure space (X, B, II) to IR such that the integral oj each
junction is 1 and limn->ooPn(x) = p(x), a.e. [II]. Then

lim
n-+ooJB
r Pn(X)dll(X) =(
JB
p(X)dll(X), Jor all B E B.

PROOF. Let 6n (x) = Pn(X) - p(x), and let 6;i and 6;; be its positive and ?eg-
ative parts. Clearly, both limn->oo 6;i = 0 and limn->oo 6;; = 0, a.e. [II]. Smce

33This proposition is used in the proofs of Theorems 2.36 and 2.86.


34This theorem is used in the proofs of Lemma 1.113 and Theorem 1.121.
B.4. Limit Theorems 635

o s:; 0;; :5 p is true, it follows from the dominated convergence theorem A.57
that limn~oo IB 0;; (x)dv(x) = 0 for all B. Since both Pn and P are densities,
Ix On (x)dv(x) = 0 for all n. It follows that limn~oo Ix 0;; (x)dv(x) = O. Since
IB(x)8;;(x) s:; 8;t(x) for all x, it follows from Proposition A.58 that

lim
n-+cx>
1
B
o;!"(x)dv(x) = o.
=
So, limn~oo IBfpn(X) - p(x)Jdv(x) 0 for all B. 0
Since defining convergence requires a topology, the following definitions require
that the random quantities lie in various types of topological spaces.
Definition B.SO. Let {Xn}~=,l be a sequence of random quantities and let X
be another random quantity, all taking values in the same topological space X.
Suppose that limn~oo E (f (Xn» = E (f (X» for every bounded continuous func-
tion , : X ..... JR, then we say that Xn converges in distribution to X, which is
written Xn E. X.
Convergence in distribution is sometimes defined in terms of probability mea-
sures. The reason is that if Xn E. X, the actual values of Xn and of X do not
play any role in the convergence. All that matters is the distributions of Xn and
of X.
Definition B.S1. Let {Pn}~=l be a sequence of probability measures on a topo-
logical space (X, 8) where 8 contains all open sets. Let P be another probability
on (X, 8). We say that Pn converges weakl1f5 to P (denoted Pn ~ P) if, for each
bounded continuous function 9 : X ..... JR, limn~oo J g(x)dPn(x) = J g(x)dP(x).

35This is not exactly the same as the concept of weak convergence in normed
linear spaces [see, for example, Dunford and Schwartz (1957), p. 419J. The col-
lection of all probability measures on a space (X, 8) can be considered a subset
of a normed linear space C consisting of all finite signed measures v (see Defini-
tion A.IS) with the norm being SUPBEB Iv(B)I. Weak convergence of a sequence
{Vn}~='l in this space would require the convergence of L(vn) for every bounded
linear functional L on C. Every bounded measurable function 9 on (X, 8) deter-
I
mines a bounded linear functional Lg on C by Lg(v) = g(x)dv(x), where the
integral with respect to a signed measure can be defined as in Problem 27 on
page 605. Hence, weak convergence of a sequence of probability measures would
require convergence of the means of all bounded measurable functions. In partic-
ular, limn~oo Pn(B) = PCB) for all measurable sets B, not just those for which
P assigns 0 probability to the boundary (see the portmanteau theorem B.83 on
page 636). Alternatively, we can consider the set of bounded continuous functions
, : X ..... JR as a normed linear space N with "'" = sup", I/(x)l. Then the set of
finite signed measures C is a set of bounded linear functionals on N using the
J
definition v(f) = '(x)dv(x). Weak" convergence of a sequence {Vn}~=l in C to
v is defined as the convergence of vn(f) to v(f) for all , EN. This is precisely
convergence in distribution. Hence, it would make more sense to call convergence
in distribution weak" convergence rather than weak convergence. Since the tra-
dition in probability theory is to call it weak convergence, we will continue to do
so.
636 Appendix B. Probability Theory

It is easy to see that these two types of convergence are the same.
Proposition B.S2. Let Pn be the distribution of X n , and let P be the distribution
of X. Then, Xn E. X if and only if Pn ~ P.
Since we will usually be dealing with X spaces that are metric spaces, there are
some equivalent ways to define convergence in distribution or weak convergence.
The proofs of Theorems B.83 and B.88 are adapted from Billingsley (1968).
Theorem B.S3 (Portmanteau theorem).36 The following are all equivalent
in a metric space:
w
1. Pn -+ P;
2. limsuPn_oo Pn(B) :$ P(B) for each closed B;
3. liminf n _ oo Pn(A) 2:: P(A), for each open A;
4. limn_co Pn(C) = P(C), for each C with P(8C) = 0. 37
PROOF. Let d be the metric in the metric space. First, assume (1) and let B be
a closed set. Let 6 > 0 be given. For each e > 0, define C. =
{x : d(x,B) :$ e},
where d(x, B) = infl/EB d(x, y). Since Id(x, B) - d(y, B)I :$ d(x, y), we see that
d(x, B) is continuous in x. Each C. is closed and n.>oc. = B. Let e be small
enough so that P(C.) :$ P(B) + 6. Let f : 1R -+ 1R be

1 if t :$ 0,
f(t) ={ 1- t if 0 < t < 1,
o if t 2:: 1,

and define g.(x) = f(d(x, B)/e). Then g. is bounded and continuous. So,
!~~ f g.{x)dPn{x) = f g.{x)dP{x).

It is easy to see that 0 :$ g.(x) :$ 1, g.(x) = 1 for all x E B, and g.(x) = 0 for all
x ¢ C•. Hence, for every 6 > 0,

Pn(B) = J IB(x)dPn(x):$ Jg. (x)dPn (x) -+ J g.(x)dP(x)

5 f Ic. (x)dP(x) = P(C.) :$ P(B) + 6.


It follows that limsuPn_oo Pn(B) :$ P(B), which is (2).
That (2) and (3) are equivalent follows easily from the facts that if A is open,
then B = A C is closed and Pn(A) = 1- Pn(B). It is also easy to see that (2) and
(3) together imply (4). Next assume (4), let B be a closed set, and define C. as
above. The boundary of C. is a subset of {x: d(x, B) = e}. There can be at most
countably many e such that these sets have positive probability. Hence, there

36This theorem is used in the proofs of Theorem B.88 and Lemma 7.19.
37We use the symbol a in front of the name of a subset of a topological space
to refer to the boundary of the set. The boundary of a set C in a topological space
is the intersection of the closure of the set with the closure of the complement.
BA. Limit Theorems 637

exists a sequence {fk}~1 converging to 0 such that P(d(X,B) = fk) = 0 for all
k. It follows that limn_co Pn(C. k ) = P(C. k ) for all k. Since Pn(B) :5 Pn(C. k )
for every nand k, we have, for every k,

Since PCB) = limk ..... co P(C' k )' we have (2). So, (2), (3), and (4) are equivalent
and (1) implies (2).
All that remains is to prove that (2) implies (1). Assume (2), and let f be
a bounded continuous function. Let m < f(x) < M for all x. For each k, let
Fi,k = {x : f(x) :5 m + (M - m)i/k} for i = 1, ... , k. Let FO,k = 0. Each Fi,k is
closed, since f is continuous. Let Gi,k = Fi,k \ Fi- 1 ,k for i = 1, ... , k. It is easy
to see that for every probability Q,

m + (M - m) Lk i-I!
-k-Q(Gi,k) < f(x)dQ(x) :5 m + (M - m) L "kQ(Gi,k).
ki
i=1 i=1

Since Q(Gi,k) = Q(Fi,k) - Q(Fi-l,k) for every i and k, we get

M-m
M - - k- Lk ! M-m
Q(Fi,k) < f(x)dQ(x) :5 M + - k- -
M-m
- k- LkQ(Fi,k).
i=1 i=1
(B.84)
For each i,
(B.85)

It follows that, for every k,

! f(x)dP(x) :5 M
M-m
+ -k- -
M-m
-k- L P(Fi,k)
k

i=1

M-m M-m k
:5 M + - - k - - - - k - LlimsupPn(Fi,k)
i=1 n-oo

:5 Mk-m +liminf!f(x)dPn(x),
n_co
where the first inequality follows from the second inequality in (B.84) with Q = P,
the second inequality follows from (B.85), and the third inequality follows from
the first inequality in (B.84) with Q = Pn. Letting k be arbitrarily large, we get

! f(x)dP(x) :5 l~~~f ! f(x)dPn(x). (B.86)

Now, apply the same reasoning to - f to get

- ! f(x)dP(x) < l~~~f! - f(x)dPn(x) = -li:~~p ! f(x)dPn(x),

! f(x)dP(x) ~ li:~~P! f(x)dPn(x). (B.87)

Together, (B.86) and (B.87) imply (1). o


638 Appendix B. Probability Theory

Theorem B.88 (Continuous mapping theorem).38 Let {Xn}~=l be a se-


quence of random quantities, and let X be another random quantity all taking
values in the same metric space X. Suppose that Xn Eo X. Let y be a metric
space and let 9 : X -+ y. Define

C9 = {x: 9 is continuous at x}.

Suppose that Pr(X E C g ) = 1. Then g(Xn }


D
-+ g(X}.

PROOF. Let P n be the distribution of g(X n } and let P be the distribution of


g(X). Let B be a closed subset of y. If x E g-I(B) but x r¢ g-I(B), then 9 is
not continuous at x. It follows that g-I(B} <; g-I(B} U Now write Cf.
limsupPn(B} = limsupPr(Xn E g-I(B» :::; limsupPr(Xn E g-I(B»
n-+oo n-+oo n-+oo

< Pr(X E g-I(B» :::; Pr(X E g-I(B» + Pr(X E C;}


= Pr(X E g-I(B» = PCB),
and the result now follows from the portmanteau theorem B.83. o
Another type of convergence is convergence in probability.
Definition B.89. If {Xn}~=l and X are random quantities in a metric space
with metric d, and if, for every f > 0, lim n -+oo Pr(d(Xn, X) > f) = 0, then we
say that Xn converges in probability to X, which is written Xn !: X.
The following theorem is useful in that it relates convergence in distribution,
convergence in probability, and the simpler concept of convergence almost surely.
Theorem B.90. 39 Let {Xn}~=l be a sequence of random vectors and let X be a
random vector.
1. Iflim n-+ oo Xn =X a.s., then Xn !: X.
P D
2. If Xn -+ X, then Xn -+ X.
D P
3. If X is degenerate and Xn -+ X, then Xn -+ X.
4. If Xn ~ X, then there is a subsequence {nk}k:,l such that limk-+oo X nk =
X, a.s.
PROOF. First, assume that Xn converges a.s. to X. For each nand f, let An,. =
{s: d(Xn(s},X(s»:::; f}. Then Xn(S} converges to Xes} if and only if

,E D. (Q, [.ON A •.• ]).

Since this set must have probability 1, then so too must U~=l (n~=NAn,.) for
all €. By Theorem A.19, it follows that for every €, limN-+oo Pr (n~=NAn,.) = 1.

38This theorem is used to provide a short proof of DeFinetti's representation


theorem for Bernoulli random variables in Example 1.82 on page 46.
39This theorem is used in the proofs of Theorems B.95, 1.49, 7.26, and 7.78.
B.4. Limit Theorem s 639

Hence, for each 10 > 0, limn_oo Pr(A~,<) == 0, which is precisely what it means to
p
say that Xn -> X.
Next assume that Xn .!: X. Let 9 : X -> IR be bounde d and continu
ous with
Ig(x)1 ~ K for all x. Let 10 > 0, and let A be a compac t se~ wit~ Pr(X
E A~ > 1-
°
€/[6Kj. A continuous function (like g) on a compac t set IS umformly
contmu?us.
So let 8 > be such that x E A and d(x, y) < 8 implies Ig(x) - g(y)1
< 10/3. Smce
Xn ~ X, there exists N such that n ~ N implies Pr(d(X n, X) < 8)
> 1-€/[6K j.
Let B = {X E A,d(Xn ,X) < 8}. It follows that Ig(X)IB - g(Xn)IB
I < 10/3 and,
for all n ~ N, Pr(B) > 1 - €/[3Kj. Also, note that n ;:: N implies
10 10
IEg(X) - E[g(X) IBli < 3' IEg(Xn) - E[g(Xn )IBll < 3'

So, n ;:: N implies


IEg(X) - Eg(Xn)1 ::; IEg(X) - E[g(X)IBJI + IE[g(X )IBj- E[g(Xn)IBJI
+ IE[g(Xn)IBJ - Eg(Xn)1
10
-3 -10 +3-10 = 1:.
+3
. D
Thus, lImn-+oo Eg(Xn) == Eg(X), and we have proven Xn -+ X.
Next, suppose that X is degener ate at Xo and Xn 2? X. Let 10 > 0,
and define
if d(x,xo) ::; ~,
if d(x,xo) ~ 10,
otherwise.
Since 9 is bounde d and continuous, Eg(Xn) converges to Eg(X). But
Eg(X) = 1
since Pr(g(X ) = 1) = 1, and Eg(Xn) ::; Pr(d(Xn, X) < 10), since 0::;
g(x) ::; 1 for
all x. So lim n _ oo Pr(d(Xn, xo) < 10) = 1, and Xn ~ X.
Finally, assume that Xn ~ X. Let nk be such that n ;:: nk implies

Pr(d(X n, X) ;::~) < Tk.


Define Ak = {d(Xnk 'X) ;:: 11k}. By the first Borel-C antelli lemma
A.20, we
have Pr(B) = 0, where B = nbl Uk=i Ak. It is easy to check that
B is the
event that d(Xnk' X) is at least l/k for infinitely many different k.
Hence Be ~
{limk-+oo X nk = X}, and limk_oo X nk = X, a.s.
0

B.4.2 Characteristic Functions


There is a very importa nt method for proving convergence in distribu
tion which
involves the use of characteristic functions.
Definit ion B.91. Let X be a random vector. The complex-valued function

cpx(t) = E (exp[it T Xl)


is called the characteristic function of X. If F is a k-dimensional
distribu tion
J
function, the function cpp(t) = exp[it T x)dF(x) is called the charact
eristic func-
tion of F.
640 Appendix B. Probability Theory

Example B.92. Let X have standard normal distribution. Then

r/>X(t) J 1 exp -
exp( itx) ..,f'f; (X2) 1
2" dx ==..,f'f; J exp -([X - itJ2 2
2 + t ) dx

exp ( -~).
Similarly, for other normal distributions, N(/-£, (12), the characteristic functions
are c/>x(t) == exp( _(12 t 2 /2 + it/-£).

By Theorem B.12, if X has CDF F, then c/>x == c/>F. It is easy to see that the
characteristic function exists for every random vector and it has complex absolute
value at most 1 for all t. Other facts that follow directly from the definition are
the following. If Y == aX + b, then r/>y(t) == c/>x(at)exp(itb). If X and Y are
independent, c/>x+y == c/>xc/>y.
The reason that characteristic functions are so useful for proving convergence
in distribution is two-fold. First, for each characteristic function C/>, there is
only one CDF F such that c/>F == c/>. (See the uniqueness theorem B.106.) Sec-
ond, characteristic functions are "continuous" as a function of the distribution
in the sense of convergence in distribution. That is, Xn E. X if and only if
limn~oo c/>Xn(t) == c/>x(t) for all t.40 (See the continuity theorem B.93.)

Theorem B.93 (Continuity theorem).41 For finite-dimensional random vec-


tors, convergence in distribution is equivalent to convergence of characteristic
functions. That is, Xn E. X if and only if limn~oo c/>Xn (t) == c/>x (t) for all t.
PROOF. The "only if" part follows from Definition B.80 and the fact that one
can write exp( it T x) as two bounded, continuous, real-valued functions of x for
every t.
For the "if" part, suppose that X is k-dimensional and that lim n-+ oo c/>Xn (t) ==
c/>x(t) for all t. To prove that for each bounded continuous g, limn~oo Eg(Xn) ==
Eg(X), we will truncate 9 to a bounded rectangle and then approximate the
truncated function by a function g' whose mean is a linear combination of values
of the characteristic function. The mean of g'(Xn ) will then converge to the mean
of g'(X). We then need to show that the means of g'(X) and g'(Xn ) approximate
the means of g(X) and g(Xn), respectively.
First, we need to find a bounded rectangle on which to do the truncation. For
each coordinate Xl of X, we will show that if a and b are continuity points of
the CDF FXi of Xl, and FXl(b) - FXl(a) > q, then there is b' > b and a' < a
such that limn -+ oo FXln (b') - FXln (a') ?: q. For each a, b, 0, define

I if a < x < b,
1-~ if a - 0 < x:::; a,
{ (B.94)
fa,b,6(x) == ~ _ X~b if b:::; x < b+ 0,
otherwise.

40This presentation is a hybrid of the presentations given by Breiman (1968,


Chapter 8) and Hoel, Port, and Stone (1971, Chapter 8).
41This theorem is used in the proofs of Theorems B.95, B.97, and 7.20.
B.4. Limit Theorems 641

Note that this function has equal values at a - 6 and b + 6. Consider the inter-
val [a - 6, b + 6] as a circle identifying the two endpoints. Now, use the Stone-
Weierstrass theorem C.3 to approximate uniformly fa.b.6 to within f on the circle
by f~.b.6 •• (X} = E~=-l bj exp(21fijx/c}, where c = b - a + 26. If Y is a random
variable, then Ef~.b.6 •• (Y} is a linear combination of values of the characteristic
function of Y. So, we have lim n ..... oo Ef~.b.6 .• (Xn} = Ef~.b.6 .• (X}. Let q > 0, and
let a and b be continuity points of Fxt such that Fxt(b} - Fxt(a} = v > q. Let
w = v - q. Let 6 > 0 be arbitrary, and define a' = a - 6 and b' = b + 6. Let N be
large enough so that n ;::: N implies IEf~.b.6.w/3(X~} - Ef~.b.6.w/3(Xl)1 < w/3. If
n;::: N, then
Fxtn (b') - Fxtn (a');::: Efa.b.6(X~) > Ef~.b.6.f(X!) - .!j
>
- EJ'a.b.6.' (Xl)
-"'3 2w ;::: Efa.b.6(X t ) - w
;::: Fxt(b} - Fxt(a} - w = q.

Now, let 9 be a bounded continuous function, and suppose that Ig(x)1 < K for
all x. Let f > O. For each coordinate xt of X, let at and be be continuity points
of Fxt such that Fxt(bt} -Fxt(at) > I-f/(7[K +f/7]k). Let 6 > 0 be arbitrary,
and define at = at - 6, bt = bt - 6, and g*(x} = g(x) n:=l fa~.b~.6(Xt). Use the
Stone-Weierstrass theorem C.3 to uniformly approximate g* to within f/7 on the
rectangle {x : at - 6 :5 Xl :5 bt + 6} by

L ... L
ml mJe

g'(x} = ajlo ...• jk exp(21fij T x),


il=-ml ik=-mk

where j is the vector with lth coordinate jt/[bt - at + 26]. Then,

lim Eg'(Xn) = Eg'(X).


n ..... oo

Let Nl be large enough so that n;::: Nl implies Fx~ (bt}-Fx~ (at) ;::: I-f/(7[K +
f/7]k) for all j. Let N2 be large enough so that n ;::: N2 implies IEg'(Xn) -
Eg'(X} I < f/7. Let R be the rectangle R = {x : at < Xl :5 btl. Since g' is periodic
in every coordinate, it is bounded by K + f/7 on all of IRk. If n ;::: max{Nl' N2},
then IEg(Xn} - Eg(X)1 is no greater than

Elg(Xn)Inc(Xn)1 + Elg(X}Inc(X) I + Elg'(Xn)Inc(Xn)1


+ IEg(Xn}In(Xn} - Eg'(Xn}In(Xn)1 + IEg'(Xn) - Eg'(X}1
+ Elg'(X)Inc(X} I + IEg'(X)In(X) - Eg(X)In(X)1 :5 f. 0

We will prove two more limit theorems that make use of the continuity theo-
rem B.93. Suppose that X has finite mean. Since Iexp(itx} - 11 :5 min{ltxl,2}
for all t,z,42 and
. exp(itx} -1
I1m .
t ..... o . t = lX

42See Problem 26 on page 664.


642 Appendix B. Probability Theory

for all x, it follows from the dominated convergence theorem that

Similarly, if X has finite variance, it can be shown that

;t22 <px(t)lt=o = _E(X2).


Using these two facts, we can prove the weak law of large numbers and the centrol
limit theorem.
Theorem B.95 (Weak law of large numbers). Suppose that {Xn}~=l are
IID mndom variables with finite mean p,. Then, Xn = L:~=1 Xiln converges in
probability to p,.
PROOF. First, we will prove that the characteristic function of X n -p, converges to
1 for all t. Let Y; = Xi -p,. Since <PY; (0) = 1, log <PY, (t) exists and is differentiable
near t = 0, and we know that

!!..log<py (0) = 0 = lim log¢x,(t) (B.96)


dt' t-+O t
The characteristic function of X n - p, is ¢' (t) = ¢Y; (tint. For fixed t, let n be
°
large enough so that tin is close enough to for log ¢Y, (tin) to be well defined.
We know that
t) log ¢Y; ( .!. )
log¢.(t) = n log ¢Y, ( ;;: = t .!. n •
n
The limit of this quantity, as n - t 00, is 0 by (B.96). It follows that for all t,
lim n -+ oo ¢.(t) = 1. By the continuity theorem B.93, Xn - f.t E. o. By Theo-
- p
rem B.90, Xn - f.t - t O. 0
In Chapter 1, we prove a strong law of large numbers 1.62, which has a stronger
conclusion and a weaker hypothesis. There is also a weak law of large numbers
for the case of infinite means. (See Problem 27 on page 664.)
The following theorem is very useful for approximating distributions.
Theorem B.97 (Centrailimit theorem). Suppose that {Xi}~l is a sequence
that is lID with finite mean f.t and finite variance (]'2. Let X n be the avemge of
thefirstn XiS. Then fo(Xn-P,) E. N(0,(]'2), the normal distribution with mean
o and variance (]'2.
PROOF. Set Yn = foe X n -p,). We might as well assume that f.t = 0, since we have
just subtracted it from each Xi. Since the second derivative of the characteristic
function at t = 0 of each Xi is _(]'2, we can apply I'Hopital's rule twice to conclude

(B.98)

The characteristic function of Yn is ¢Yn (t) = ¢x; (tl v'nt· We will prove that
this converges to exp(-t 2(]'2/2) for each t. Since log<Pyn(t) = nlog¢x;(tlfo),
B.4. Limit Theorems 643

we use (B.98) to note that

It follows that limn -+ oo CPY" (t) = exp( _t 2 u 2 /2), and the continuity theorem B.93
finishes the proof. 0
There is also a multivariate version of the central limit theorem.
Theorem B.99 (Multivariate central limit theorem).43 Let {Xn}~=l be a
sequence of lID mndom vectors in IRP with mean p. and covariance matrix E.
Then v'n(X" - p.) Eo Np(O, E), a multivariate normal distribution.

PROOF.
-
Let Yn = Vri(X n - p.) and let Y '" Np(O,E). Then Yn -+ Y if and
v
only if the characteristic function of Yn converges to that of Y. That is, if and
only if, for each A E JRP, E exp { iA TYn} -+ E exp { iA TY }. This occurs if and
only if, for each A, ATYn Eo ATy. The distribution of ATy is N(O, ATEA), and
ATYn is .;n times the average of the AT (Xn - p.). By the univariate central limit
theorem B.97, ATYn Eo ATy. 0
There are inversion formulas for characteristic functions which allow us to
obtain or approximate the original distributions from the characteristic functions.
Example B.100 (Continuation of Example B.92j see page 640). Let X have
J
distribution N(O, ( 2 ). Then Icpx(t)ldt < 00. In fact,

2~ J exp(-ixt)cpx(t)dt = ~
2~
Jexp ( - u 2 [t + ~X]2 __
2 u2
1 X2) dt
2u 2

~u exp ( - 2~2X2) = fx(x).


Example B.100 says that the following inversion formula applies to normal
distributions with 0 mean. It is equally easy to see that it applies to NIe(O, lie)
distributions. 44
Lemma B.IOI (Continuous inversion formula).45 Let X E IR" have inte-
grable chamcteristic function. Then the distribution of X has a bounded density
f x with respect to Lebesgue measure given by

fx(x) = (2!)" J exp(-it T x)CPx(t)dt. (B.102)

PROOF. Clearly, the function in (B.102) is bounded since cpx is integrable. Let Y"
have N,,(O, u 2 lie) distribution. The characteristic function of X + Y" is CPxCPy".

(2!)" Jexp(-it T x)cpx(t)cpy" (t)dt

43This theorem is used in the proofs of Theorems 7.35 and 7.57.


44We use the symbol lie to stand for the k x k identity matrix.
45This lemma is used in the proofs of Lemma B.105 and Corollary B.106.
644 Appendix B. Probability Theory

(2!)k JJ exp(-it T x)exp(it T z)</>y,,(t)dFx(z)dt (B.103)

J h, (x - z)dFx(z) = fx+y" (x),

where the second equality follows from the fact that (B.102) applies to normal
distributions. Now suppose that we let (f go to zero. Since </>x is integrable and
</>y,,(t) goes to 1 for all t, it follows that the left-hand side of (B.103) converges to
the right-hand side of (B.102). It also follows that fx+y" is bounded uniformly
in (f and x. Let B be a hypercube such that the probability is 0 that X is in the
boundary of B. Then

r,,-0 fx+Yu (x)dx


} B
lim
,,-oJrB fx+Y" (x)dx = } {B fx(x)dx,
= lim (B. 104)

where the first equality follows from the boundedness of f x +Yu' and the second
is proven as follows. The difference between fx+y" (x)dx and IB fx(x)dx is IB
the sum over the 2k corners of the hypercube B of terms like
k

L Pr(b i - Y",i < Xi :::; bi , Y",i > 0) + Pr(bi < Xi :::; b; - y",;, Y",i < 0),
i=l

where bi is the ith coordinate of the corner. We can write

Pr(bi - Y",i < Xi ~ bi, Y",i > 0) = 1 00


Pr(bi - y < Xi :::; bi, y > O)dFY".i (y).

This last expression goes to 0 as (f - 0 since b; is a continuity point for FXi'


A similar argument applies to the other probability. The equality of the first
and last expressions in (B.104) is what it means to say that lim,,_o fx+y" (x)
is the density of X with respect to Lebesgue measure. This, in turn, equals the
right-hand side of (B.102). 0

Lemma B.I05. 46 Let Y be a mndom variable such that </>y is integmble. Let X
be an arbitmry mndom variable independent of Y. For all finite a < band c,

Pr(a < X + cY ~ b) = ~
211"
f (exp( -ibt) -. exp( -iat») </>x(t)</>y(ct)dt.
-~t

PROOF. Since </>y is integrable and </>x+cy(t) = </>x (t)</>y (ct), it follows that
X + cY has integrable characteristic function. Lemma B.101 says that (B.102)
applies to X + cY, hence

~ j</>x(t)</>y(ct)exP(-itX)dt
211"

Pr(a < X + cY :::; b) lb fx+cy(x)dx

46This lemma is used in the proof of Corollary B.106.


B.5. Stochastic Processes 645

2~ 1b J </Jx(t)</Jy(ct)exp(-itx)dtdx

2~ J </Jy(et)</Jx(t) 1b exp(-itx)dxdt

2~ ! </Jy(ct)</Jx(t) (exp( -itb) ':texp (-ita)) dt. 0

Corollary B.106 (Uniqueness theorem).47 Let F and G be two univariate


CDFs such that </JF = </JG. Then F = G.
PROOF. In the proof of Lemma B.101, we proved that ifY '" N(O, 1), and if a and
b are continuity points of F, and X has CDF F, then limc_o Pr(a < X + cY ::;
b) = Pr(a < X::; b). The same is true of G. Hence, F = G by Lemma B.105. 0
An obvious consequence of the uniqueness theorem is the following.
Corollary B.107. 48 Suppose that F and G are k-dimensional CDFs such that
J
lor every bounded continuous I, l(x)dF(x) = l(x)dG(x). Then F = G. J

B.5 Stochastic Processes


B.5.1 Introduction
Sometimes we wish to specify a joint distribution for an infinite sequence of
random variables. Let (S, A, 1-') be a probability space. If Xn : S --+ IR for every
n and each Xn is measurable with respect to the Borel u-field B, we can define
a u-field of subsets of IR oo such that the infinite sequence X = (X l ,X2 , ... ) is
measurable. Let BOO be the smallest u-field that contains all finite-dimensional
orthants, that is, every set B of the form

{x: Xil ::; el, ... ,Xi n ::; Cn , for some n and some integers il, ... ,in

and some numbers CI, ••• , Cn }.

It is clear that X-I (B) E A since it is the intersection of finitely many sets in
A. By Theorem A.34, it follows that X-I (BOO) ~ A, so X is measurable with
respect to this u-field.

B.5.2 Martingales+
A particular type of stochastic process that is sometimes of interest is a martin-
gale. [For more discussion of martingales, see Doob (1953), Chapter VII.]

47This corollary is used in the proof of Theorem 2.74.


48This corollary is used in the proof of DeFinetti's representation theorem 1.49.
+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
646 Appendix B. Probability Theory

Definition B.lOS. Let (8, A, 1-1) be a probability space. Let N be a set of consec-
utive integers. For each n EN, let F,. be a sub-u-field of A such that Fn ~ F,.+l
for all n such that n and n + 1 are in N. Let {X,,},.E.N' be a sequence of ran-
dom variables such that Xn is measurable with respect to Fn for all n. The
sequence of pairs {(Xn,F,.)}nE.N' is called a martingale if, for all n such that n
and n + 1 are in N, E(Xn+lIFn) = Xn. It is called a submartingale if, for every
n, E(Xn+lIFn) ~ X n .
Note that a martingale is also a submartingale.
Example B.I09. A simple example of a martingale is the following. Let N =
{1, 2, ..:1,. and let {Yn}~l be independent random variables with mean O. Let
Xn = E:=l Yi. Let Fn be the u-field generated by YI, ... , Yn . Then,

E(YI + ... + Yn+ll.rn) = YI + ... + Yn = X n ,


since E(Yn+lIFn) = 0 by independence. If each Yi has nonnegative finite mean,
then E(Xn+lIF,.) ~ Xn, and we have a submartingale.
Example B.llO. Another example of a martingale is the following. Let N be a
collection of consecutive integers, and let {F,,},.E.N' be an increasing sequence of
u-fields. Let X be a random variable with E(IXI) < 00. Set X,. = E(XIF,.). By
the law of total probability B. 70,

E(X"+lIF,.) = E[E(XIF,.+dIF,.) = E(XIFn) = X n ,


so {(X,., Fn)}nE.N' is a martingale.
Example B.llI. If {(Xn, Fn)}"E.N' is a martingale, then

IXnl = IE(X"+IIFn)1 :5 E(lXn+lIlFn), (B.112)


hence {(IXnl,Fn)}nE.N' is a submartingale.

The following result is proven using the same argument as in Example B.111.
Proposition B.ll3. 49 If {(Xn, Fn)}nE.N' is a martingale, then EIXnl is nonde-
creasing in n.
The reader should note that if {(Xn, Fn)}nE.N' is a submartingale and if M ~
N is a string of consecutive integers, then {(Xn, .rn)}nEM is also a submartingale.
Similarly, if k is an integer (positive or negative) and M = {n: n+k EN}, then
{(X~, F~)}nEM is a submartingaie, where X~ = Xn+A: and .r:..
= Fn+lc. This
latter is just a shifting of the index set.
There are important convergence theorems that a.pply to many martingales
and submartingales. They say that if the set N is infinite, then limit random
variables exist. A lemma is needed to prove these theorems. 50 It puts a bound on
how often a submartingale can cross an interval between two numbers. It is used
to show that such crossings cannot occur infinitely often with high probability.
(Infinitely many crossings of a nondegenerate interval would imply divergence of
the submartingale.)

49This proposition is used in the proof of Theorem B.122.


50This lemma is proven by Doob (1953, Theorem VII, 3.3).
B.5. Stochastic Processes 647

Lemma B.114 (Upcrossing lemma).5l Let.N = {I, ... , N}, and suppose
that {(X",.rn)}~=l is a submartingale. Let r < q, and define V to be the number
of times that the sequence Xl, ... , XN crosses from below r to above q. Then

E(V) :5 _1_ (EIXNI


q-r
+ Ir!). (B.115)

PROOF. Let Yn = max{O,Xn - r} for every n. Since g(x) = max{O, x} is a non-


decreasing convex function of x, it is easy to see (using Jensen's inequality B.17)
that {Yn , .rn}~=l is a submartingale. Note that a consecutive set of Xi(S) cross
from below r to above q if and only if the corresponding consecutive set of Yi(s)
cross from 0 to above q - r. Let To(s) = 0 and define Tm for m = 1,2, ... as
Tm(s) = inf{k:5 N: k > Tm-l(s), Yk(s) = O}, if m is odd,
Tm(s) = inf{k:5 N: k > Tm-l(s), Yk(s) ;::: q - r}, if m is even,
Tm(s) = N + 1, if the corresponding set above is empty.

Now V(s) is one-half of the largest even m such that Tm(s) :5 N. Define, for
i= 1, ... ,N,

Ri(S) = {Io if Tm(s) < i :5 Tm+l(S) for m odd,


otherwIse.

Then (q - r)V(s) :5 E;:'l Ri(S)(Yi(S) - Yi-l(S» = X, where Yo == 0 for conve-


nience. First, note that for all m and i, {Tm(s) :5 i} E .ri. Next, note that for
every i,

{s:Ri(s)=l}= U ({Tm:5i-1}n{Tm+l:5i-1}C)E.ri-l. (B.116)


m odd

E(X) = L:J. . _
N

i=l {s.R,{s)-l}
(Yi(S) - Yi-l(s»dl-£(s)

= L J . . _ (E(YiI.ri-t}(S) - Yi-l(s»dJL(s)
i=l {s.R,{s)-l}
N
:5 L !(E(YiI.ri-t}(S) - Yi_l(s»dJL(s)
i=l
N
L(E(Yi) - E(Yi_t}) = E(YN),
i=l
where the second equality follows from (B.116) and the inequality follows from
the fact that {Yn, Fn}~=l is a submartingale. It follows that (q-r)E(V) :5 E(YN).
Since E(YN) :5lrl + E(IXN!), it follows that (B.115) holds. 0
The proof of the following convergence theorem is adapted from Chow, Rob-
bins, and Siegmund (1971).

5lThis lemma is used in the proofs of Theorems B.U7 and B.122.


648 Appendix B. Probability Theory

Theorem B.117 (Martingale convergence theorem: part 1).52 Suppose


that {(Xn,Fn)}~l is a submartingale such that sUPnEIXnl < 00. Then X =
limn _ oo Xn exists a.s. and EIXI < 00.
PROOF. Let X· = limsuPn-+oo Xn and X. = liminfn-+oo Xn. Let B = {s :
X.{s) < X·(s)}. We will prove that J.L(B) = O. We can write

B=
r < q,
u
r, q rational
{s : X·{s) ~q >r ~ X.(s)}.

Now, X·{s) > q > r ~ X.{s) if and only if the values of Xn(S) cross from being
below r to being above q infinitely often. For fixed rand q, we now prove that
this has probability OJ hence J.L(B) = O. Let Vn equal the number of times that
Xl, ... , Xn cross from below r to above q. According to Lemma B.114,

supE(Vn) $ _1_ (SuPE{IXnl) + Irl) < 00.


n q-r n

The number of times the values of {Xn(S)}~=l cross from below r to above q
equals limn-+oo Vn(s). By the monotone convergence theorem A.52,

00 > supE(Vn) = E{ lim Vn ).


n n-oo

It follows that J.L{{s: limn-+ co Vn{s) = oo}) = O.


Since J.L(B) = 0, we have that X = lim n-+ oo Xn exists a.s. Fatou's lemma A.50
says E(IXi) $ liminfn-+co E(lXnl) $ sUPn E(lXnl) < 00. 0
For the particular martingale in which Xn = E(XIFn) for a single X, we have
an expression for the limit.
Theorem B.118 (Levy's theorem: part 1).53 Let {Fn}~=l be an increasing
sequence of u-fields. Let Foo be the smallest u-field containing all of the Fn. Let
E(\XD < 00. Define Xn = E(X\Fn) and Xoo = E(X\Foo). Then limn-+oo Xn =
Xoo , a.s.
The proof of this theorem requires a lemma that will also be needed later.
Lemma B.119. 54 Let {Fn }~=l be a sequence of u-fields. Let E(\XI) < 00. Define
Xn = E(X\Fn). Then {Xn}~=l is a uniformly integmble sequence.
PROOF. Since E(XIFn) = E{X+IFn) - E(X-IFn), and the sum of uniformly
integrable sequences is uniformly integrable, we will prove the result for nonneg-
ative X. Let Ae,n = {Xn ~ c} E Fn. So fAc.n Xn{s)dJ.L{s) = fAc,n X{s)dJ.L{s). If
we can find, for every E > 0, a C such that fA X (s)dJ.L(s) < E for all n and all
c ~ C, we are done. Define 71{A) = fAX{s)dJ.Lt~). We have 71 «J.L and 71 is finite.

52This theorem is used in the proof of Theorems B.118 and 1.121.


53This theorem is used in the proofs of Theorem 7.78 and Lemma 7.124.
54This lemma is used in the proofs of Theorems B.U8, B.122, and B.124. It is
borrowed from Billingsley (1986, Lemma 35.2).
B.5. Stochastic Processes 649

By Lemma A.72, we have that for every E > 0 there exists li such that JL(A) < 6
implies T/(A) < E. By the Markov inequality B.15,
1 1
JL(Ac,n) :$ -E(Xn) = -E(X),
c c
for all n. Let C = 2E(X)/li. Then c ~ C implies JL(Ac,n) < 6 for all n, so
T/(Ac,n) < € for all n. 0
PROOF OF THEOREM B.ll8. By Lemma B.119, {Xn}~=1 is a uniformly integrable
sequence. Let Y be the limit of the martingale guaranteed by Theorem B.117.
Since Y is a limit of functions of the X n , it is measurable with respect to Foo. It
follows from Theorem A.60 that for every event A, lim n _ oo E(XnIA) = E(Y IA)'
Next, note that, for every A E F n ,

fA
Y(s)dJ.&(s) = n_oo
lim f
A
E(XIFn)(s)dJ.&(s) = f
A
X(s)dJL(S),

where the last equality follows from the definition of conditional expectation.
Since this is true for every n and every A E F n , it is true for all A in the field
F = U~=IFn. Since IXI is integrable, we can apply Theorem A.26 to conclude
that the equality holds for all A E F oo , the smallest IT-field containing F. The
equality E(XIA) = E(YIA) for all A E Foo together with the fact that Y is Foo
measurable is precisely what it means to say that Y = E(XIFoo) = Xoo. 0
For negatively indexed martingales, there is also a convergence theorem. Some
authors refer to negatively indexed martingales in a different fashion, which is
often more convenient.
Definition B.120. Let (S, A, JL) be a probability space. For each n = 1,2, ... ,
let Fn be a sub-IT-field of A such that Fn+l ~ Fn for all n. Let {Xn}~=1 be a
sequence of random variables such that Xn is measurable with respect to Fn for
all n. The sequence of pairs {(Xn,Fn)}~=1 is called a reversed martingale if for
all n E(XnIFn+l) = X n + 1 .
Example B.121. As in Example B.110, we can let {Fn}~=1 be a decreasing
sequence of IT-fields, and let E(IX!) < 00. Define Xn = E(XIFn). It follows from
the law of total probability B.70 that {(Xn,Fn)}~=l is a reversed martingale.
The following theorem is proven by Doob (1953, Theorem VII 4.2).
Theorem B.122 (Martingale convergence theorem: part 11).55 Suppose
that {(Xn, Fn)}n<o is a martingale. Then X = limn _ - oo Xn exists a.s. and has
finite mean.
PROOF. Just as in the proof of Theorem B.117, we let Vn be the number of times
that the finite sequence Xn,Xn+1, ... ,X- 1 crosses from below a rational r to
above another rational q (for n < 0). The upcrossing lemma B.114 says that
1
E{Vn) :$ -
q-r
(E(IX_ 1 !) + Ir!) < 00.

55This theorem is used in the proof of Theorem B.124.


650 Appendix B. Pcobability Theory

As in the proof of Theorem B.117, it follows that X = limn_-oo Xn exists with


probability 1. From (B.112) and Lemma B.119, it follows that

lim E(IXnl)
n--oo
= E(IXI).
By Proposition B.113, it follows that E(lXI) < 00, and so X has finite mean. 0
It is usually more convenient to express Theorem B.122 in terms of reversed
martingales.
Corollary B.123. 56 If {(Xn, Fn)}~l is a reversed martingale, then limn_co Xn
exists a.s. and has finite mean.
There is also a version of Levy's theorem B.118 for reversed martingales.
Theorem B.124 (Levy's theorem: part 11).57 Let {Fn}~=l be a decreasing
sequence of a-fields. Let Foo be the intersection n~=lFn. Let E(IXI) < 00. Define
Xn = E(XIFn) and Xoo = E(XIFoo). Then limn_oo Xn = Xoo a.s.
PROOF. It is easy to see that {(Xn' Fn)}~=l is a reversed martingale and that
E(IX11) < 00. By Theorem B.122, it follows that lim n__ oo Xn = Y exists and is
finite a.s. To prove that Y = Xoo a.s., note that Xoo = E(X1IFoo) since .1'00 ~ Fl.
So, we must show that Y = E(X1IFoo). Let A E .1'00' Then

i Xn(s)dl./-(s) = i Xl (s)dl./-(S) ,

since A E Fn and Xn = E(XdFn). Once again, using (B.112) and Lemma B.119,
it follows that fA Y(s)dl./-(s) = fA X1 (s)dl./-(s); hence Y = E(XdFoo). 0

B.5.3 Markov Chains·


Another type of stochastic process we will occasionally meet is a Markov chain. 58
Definition B.125. Let {Xn}~=l be a sequence of random variables taking val-
ues in a space X with a-field B. The sequence is called a Markov chain (with
stationary transition distributionsl 9 if there exists a function p : B x X -+ [0, I]
such that
• for all x EX, pC x) is a probability measure on B;
• for all B E B, p(B,·) is B measurable;

56This corollary is used in the proof of Theorem B.124.


57This theorem is used in the proofs of Theorem 1.62, Corollary 1.63,
Lemma 2.121, and Lemma 7.83.
"This section may be skipped without interrupting the flow of ideas.
581n this text, we only use Markov chains as occasional examples of sequences
of random variables that are not exchangeable.
59There are more general definitions of Markov chains and Markov processes
in which the transition distribution from Xn to Xn+l is allowed to depend on n.
We will not need these more general processes in this book.
B.5. Stochastic Processes 651

• for each n and each BE 13,


p(B, x) = Pr(Xn+1 E BIXI = Xl, X2 = X2,···, Xn-l = Xn-l, Xn = X),
almost surely with respect to the joint distribu tion of (X I, ..• , X n ).
The last condition in the definition of a Markov chain says that the
conditional
distribu tion of Xn+l given the past depends only on the most recent
past X n . In
other words, Xn+l is conditionally independent of Xl, ... , Xn-l given n
X .
Examp le B.126. A sequence {Xn}~=l of lID random variables
is a Markov
chain with p(B,x) = Pr(Xi E B) for all x.
Examp le B.121. Let {Xn}~=l be Bernoulli random variables such
that
Pr(Xn+1 = 11XI = Xl,·" ,Xn = Xn) = px,.,l,
for Xn E {O, 1}. The entire joint distribu tion of the sequence is determi
ned by the
numbers PO,1, Pl,l, and Pr(XI = 1).

B.5.4 General Stochastic Processes


Occasionally, we will have to deal with more complicated stochas
tic processes.
What makes them more complicated is that they consist of more than
countably
many random quantities.
Examp le B.128. Let :F be a set of real-valued functions of a real
vector. That
is, there exists k such that FE :F means F : JRk -+ JR. Suppose that
X : 8 -+ :F
is a random quantity whose values are functions themselves. We would
like to
be able to discuss the distribu tion of X. We will need a u-field of
subsets of :F
in order to discuss measurability. A natural u-field is the smalles
t u-field that
contains all sets of the form At,x = {F E :F : F( t) :$ x}, for all
t E JRk and
all X E JR. It can be shown (see below) that X is measurable with
respect to
this u-field if, for every t E 1Rk, the real-valued function G t : 8 -+
1R is Borel
measurable, where Gt(s) = F(t) when Xes) = F.
A general stochastic process can be defined, and it resembles the above
example
in all importa nt aspects.
Definit ion B.129. Let (8, A, 11) be a probability space, and let R
be some set.
For each r E R, let (Xr, 13r ) be a Borel space, and let Xr : 8 -+ Xr be
measurable.
The collection of random variables X = {Xr : r E R} is called
a stochastic
process.
Examp le B.130. If every (Xr ,13r ) is the same space (X,13), then
X can be
thought of as a "random function" from R to X as follows. For each
s E 8, define
=
the function F. : R -+ 8 by F.(r) Xr(S). In order to make this a
function, we need a u-field on the set of functions from R to X. Since
true random
this set of
functions is the product set X R , a natural u-field is the product u-field
BR. The
product u-field is easily seen to be the smallest u-field containing all
sets of the
form Ar,B = {F : F(r) E B}, for r E Rand B E 13. Now, let F :
8 -+ XR be
defined by F(s) == F•. Then F is measurable because

F-I(Ar, B) = {s: F.(r) E B} = {s : Xr{s) E B} E A,


because Xr is measurable.
652 Appendix B. Probability Theory

The important theorem about stochastic processes is that their distribution is


determined by the joint distributions of all finite collections of the X r •
Theorem B.131.6o Let R be a set and, for each r E R, let (Xr,Br) be a Borel
space. Let X = {Xr : r E R} and X' = {X; : r E R} be two stochastic processes.
Suppose that for every k and every k-tuple (rl' ... , rk) E Rk, the joint distribution
of (Xr1 , · · • ,Xrk ) is the same as that of (X;l' ... , X;k). Then the distribution of
X is the same as that of x' .

PROOF. Define X = TIrER Xr and let B be the product u-field. Say that a set
C E B is a finite-dimensional cylinder set if there exists k and rl, ... , rk E R and
a measurable D ~ TI:=l Xr , such that

C = {x EX: (x r1 , ••. ,Xrk) ED}.


It is easy to see that if {rl, ... ,rk} ~ {tl, ... , t m } for m ~ k, then there exists a
measurable subset D' of n;:lXSj such that

C = {x EX: (X S l> ... ,xs m ) ED'},


by taking the Cartesian product of D times the product of those Xr for r E
{Sl, ... , Sm} \ {rl, ... , rk} and then possibly rearranging the coordinates of all
points in this set to match the order of rl, ... , rk among Sl, ... , Sm. So, if C and
G are both finite-dimensional cylinder sets with

G = {x EX: (Xhl, ••• ,Xht) E E},


then we can let {tl, ... , t m } = {rl, ... , rk} U {hI, ... , ht} and write
C {XEX:(Xtl> ... ,Xt m )ED'},
G {XEX:(xtl> ... ,Xt m )EG'}.
It follows that

enG = {x EX: (Xtll ... , Xt m ) E D' n G'} .


So the finite-dimensional cylinder sets form a 1r-system. By assumption, the dis-
tributions of X and X' agree on this 1r-system. Since X = {x EX: Xr E X r } for
arbitrary r E R and since the distributions of X and X' are finite measures, we
can apply Theorem A.26 to conclude that the distributions are the same. 0
Another important fact about general stochastic processes is that it is possible
to specify a joint distribution for the entire process by merely specifying all
of the finite-dimensional joint distributions, so long as they obey a consistency
condition.
Definition B.132. Let X = TIrER Xr with the product u-field, where (Xr , Br) is
a Borel space for every r. For each finite k and each k-tuple (iI, ... , ik) of distinct
elements of R, let Pilo ... ,ik be a probability measure on n:=l
Xi;- We say that
these probabilities are consistent if the following conditions hold for each k and
distinct iI, ... , ik E R and each A in the product u-field of TI;=l Xi; :

6°This theorem is used in the proofs of Theorem B.133 and DeFinetti's repre-
sentation theorem 1.49.
B.5. Stochas tic Processes 653

• For each permuta tion 7r of k items, Pil, ... ,ik (A) = H,,(1) , ... ,i,,(k) (B), where
B = {(X".(I),'" ,X".(k»: (Xl, ... ,Xk) E A} .
• For each f. E R \ {il, ... , ik}, H1, ... ,ik (A) = Pil, ... ,ik,l(B), where

B = {(Xl, ... ,Xk, Xk+I) : (XI, .. " Xk) E A, Xk+l EXt}.


Since the set R may not be ordered, the first condition ensures that
it does not
matter in what order one writes a finite set of indices. The second
condition is
the substan tive one, and it ensures that the margina l distribu tions
of subsets of
coordinates are the probability measures associated with those subsets.
To avoid excessive notation , it will be convenient to refer to PJ as
the proba-
bility measure associated with a finite subset J ~ R without specifyi
ng the order
of the elements of J. When the consistency conditions in Definition
B.132 hold,
this should not cause any confusion.
The proof of the following theorem is adapted from Loeve (1977,
pp. 94-5).
The theorem says that consistent finite-dimensional distribu tions
determi ne a
unique joint distribu tion on the product space.
Theore m B.133. 61 Let X = TIrER Xr with the product u-field, where
Xr is a
Borel space for every r. For each finite subset J ~ R, let PJ be
a probability
measure on TIrEJ X r • Suppose that the PJ are consistent as defined
in Defini-
tion B.132. Then there exists a unique distribution on X with finite-di
mensional
marginals given by the PJ.
PROOF. The uniqueness follows from Theorem B.131, if we can prove
existence.
First, suppose that Xr = 1R for all r. Let C be the class of all unions
of finitely
many finite-dimensional cylinder sets of the form C = TIrER C r , where
all but
finitely many of the C r equallR and the others are unions of finitely
many inter-
vals. The class C is a field. For C of the above form, define P(C) = PJ(TIrE
J Cr ).
The consistency assumption implies that P can be uniquely extende
d to a finitely
additive probability on C. To show that P is countab ly additive,
we will show
that if {An}~=1 is a decreasing sequence of elements of C such that
P(An) > 10
for all n, then A = n~=IAn is nonempty. Suppose that P(An) > 10
for all n. Let
I n be the set of all subscrip ts involved in AI, ... , An and J be the
union of these
sets. Let An = Bn X TIrOn X r. Then P(A .. ) = PJ" (Bn), and Bn is
the union of
finitely many product s of intervals. For each product of intervals H
that consti-
tutes B n , we can find a product of bounde d closed intervals contain
ed in H such
that the PJn probability of the union of these H is as close as we wish
to PJ" (B n ).
Let C n be a finite union of product s of closed bounde d intervals contain
ed in Bn
such that PJn (Bn \ Cn ) < €/2n+l. If Dn is the cylinder set correspo
nding to Cn ,
then
PJn (An \ Dn) = PJn (Bn \ en) < 2n~1 .
Now, let En = Ann~1 Di, so that P(An \En) < 10/2. It follows that
PeEn) > €/2,
so each En is nonempty. Let xn :::: (x~,x~, ... ) E En. Since El 2
E2 2 "', it
follows that for every k 2: 0, x n + k E En ~ D ... Hence (x?+kji E I
n ) E Cn. Since

61This theorem is used in the proof of Lemma 2.123.


654 Appendix B. Probability Theory

each en is bounded, there is a subsequence of {(xf; i E Jl)}~=1 that converges to


a point (Xiii E JI) Eel. Let the subsequence be {(x~~;i E J1)}%"=I' Then there
e
is a subsequence of{(x~~; i E h)}%"=l that converges to a point (Xi; i E h) E 2 •
Continue extracting subsequences to get a limit point XJ = (Xi; i E J) E Dn for
all n. Hence, every point that extends XJ to an element of X is in An for all
n, and A is nonempty. Now apply the Caratheodory extension theorem A.22 to
extend P to the entire product a-field.
For general Borel spaces, let CPr : Xr -+ Fr be a bimeasurable mapping to a
Borel subset of rn. for each r. It follows easily by using Theorem A.34 that the
function cp: X -+ TIrERFr is bimeasurable, where cp(x) = (CPr(Xr);r E R). For
each finite s.ubset J, cP induces a probability on TIiEJ IR from PJ, and these are
clearly consistent. By what we have already proven there is a probability P on
TITER IR with the desired marginals. Then cp-l induces a probability on X from
P with the desired marginals. 0

B.6 Subjective Probability


It is not obvious for what purpose a mathematical probability, as described in
this chapter and defined in Definition A.18, would ever be useful. In this section,
we try to show how the mathematical definition of probability is just what one
would want to use to describe one's uncertainty about unknown quantities if one
were forced to gamble on the outcomes of those unknown quantities. 62
DeFinetti (1974) suggests that probability be defined in terms of those gam-
bles an agent is willing to accept. Others, like DeGroot (1970), would only require
that probabilities be subjective degrees of belief. Either way, we might ask, "Why
should degrees of belief or gambling behavior satisfy the measure theoretic defini-
tion of probability?" In this section, we will try to motivate the measure theoretic
definition of probability by considering gambling behavior. We begin by adopting
the viewpoint of DeFinetti (1974}.63
For the purposes of this discussion, let a mndom variable be any number about
which we are uncertain. For each bounded random variable X, assume that there
is some fair price p such that an agent is indifferent between all gambles that pay
c(X - p), where c is in some sufficiently small symmetric interval around 0 such
that the maximum loss is still within the means of the agent to pay. For example,
suppose that X = X is observed. If c(x - p) > 0, then the agent would receive
this amount. If c(x - p) < 0, then the agent would lose -c(x - p). It must be
that -c(x - p} is small enough for the agent to be able to pay. Surely, for X in a

621n Section 3.3, we give a much more elaborate motivation for the entire
apparatus of Bayesian decision theory, which includes mathematical probab~lity
as one of its components. An alternative derivation of mathematical probability
from operational considerations is given in Chapter 6 of DeGroot (1970).
63There are a few major differences between the approach in this section and
DeFinetti's approach, which DeFinetti, were he alive, would be quick to po~nt
out. Out of respect for his memory and his followers, we will also try to pomt
out these differences as we encounter them.
B.6. Subjective Probability 655

bounded set, C can be made small enough for this to hold, so long as the agent
has some funds available.
Definition B.134. The fair price p of a random quantity is called its prevision
and is denoted P(X). It is assumed, for a bounded random quantity X, that the
agent is indifferent between all gambles whose net gain (loss if negative) to the
agent is c(X - P(X)) for all c in some symmetric interval around O.
The symmetric interval around 0 mentioned in the definition of prevision may
be different for different random variables. For example, it might stand to reason
that the interval corresponding to the random variable 2X would be half as wide
as the interval corresponding to X.
Another assumption we make is that if an agent is willing to accept each of a
countable collection of gambles, then the agent is willing to accept all of them
at once, so long as the maximum possible loss is small enough for the agent to
pay.64 An example of countably many gambles, each of which is acceptable but
cannot be accepted together, is the famous St. Petersburg paradox.
Example B.135. Suppose that a fair coin is tossed until the first head appears.
Let N be the number of tosses until the first head appears. For n = 1,2, ... ,
define
if N = n,
otherwise.
Suppose that our agent says that P(Xn } = 1 for all n. For each n, there is Cn < 0
such that the agent is willing to accept cn(Xn - 1). If - 2::""=1cn 2n is too big,
however, the agent cannot accept all of the gambles at once. Similarly, there are
Cn > 0 such that the agent is willing to accept Cn (Xn - 1). If 2:::'-1
Cn is too
big, the agent cannot accept all of these gambles. The St. Petersburg paradox
corresponds to the case in which Cn = 1 for all n. In this case, the agent pays 00
and only receives 2N in return. We have ruled out this possibility by requiring
that the agent be able to afford the worst possible loss.

The following example illustrates how it is possible to accept infinitely many


gambles at once.

Example B.136. Suppose that a random quantity X could possibly be anyone


of the positive integers. For each positive integer x, let

if X = x,
if not.

Suppose that our agent is indifferent between all gambles of the form c(Ix _ 2-"')
for all -1 :s :s
c 1 and all integers x. Then, we assume that the agent is also
indifferent between all gambles of the form 2:::'=1
c",(I", - T X ), so long as -1:s
:s 1 for all x. (Note that the largest possible loss is no more than 1.) Let
:L:
C",
y = 1 cx!x with -1 :5 Cx :5 1 for all x. Note that Y is a bounded random

64DeFinetti would not require an agent to accept count ably many gambles at
once, but rather only finitely many. We introduce this stronger requirement to
avoid mathematical problems that arise when the weaker assumption holds but
the stronger one does not. Schervish, Seidenfeld, and Kadane (1984) describe one
such problem in detail.
656 Appendix B. Probability Theory

quantity, and that the agent has implicitly agreed to accept all gambles of the
form c(Y - ,.,,) for -1 ::; c ::; 1, where ,." = E::'=l cz2- z . If the agent were
foolish enough to be indifferent between all gambles of the form dey - p) for
-a ::; d ::; a where p i= ,.", then a clever opponent could make money with no risk.
For example, if p > ,.", let f = min{l, a}. The opponent would ask the agent to
accept the gamble fey -p) as well as the gambles - fcz(Iz _2- Z ) for x = 1,2, ....
The net effect to the agent of these gambles is - f(p -,.,,) < 0, no matter what
value X takes! A similar situation arises if p < ,.". Only p = ,." protects the agent
from this sort of problem, which is known as Dutch book.

To avoid Dutch book, we introduce the following definition.


Definition B.137. Let {X" : el E A} be a collection of bounded random vari-
ables. Suppose that, for each el, an agent gives a prevision P(X,,) and is indifferent
between all gambles of the form c(X" - P(X,,» for -d" ::; c ::; d" with d" =
min{maxx(:':~:P(x,.), p(X,.)MminX ,.} for some M > o. These previsions are coherent
if there exist no countable subset B ~ A and {Cb : -db::; Cb ::; db, for all bE B}
such that -M ::; LbEB Cb(Xb - P(Xb» < 0 under all circumstances. 65 If a
collection of previsions is not coherent, we say that it is incoherent.
The value M is the maximum amount the agent is willing to lose. Coherence of
a sufficiently rich collection of previsions is equivalent to a probability assignment.
Theorem B.13S. 66 Let (B,A) be a measurable space. Suppose that, Jor each
C E A, the agent assigns a prevision P(Ic), where Ic is the indicator oj C.
Define ,." : A -+ IR by ,.,,( C) = P(lc). Then the previsions are coherent if and
only if,." is a probability on (8, A).
PROOF. Without loss of generality, suppose that the agent is indifferent between
all gambles of the form c(Ic - P(lc», for all -1 ::; c ::; 1. For the "if" part,
assume that,." is a probability. Let {Cn}::'=l E A and Ci E [-1,1) be such that
with
=L »,
00

X en(lcn - ,.,,(Cn
n=l
the maximum losses from X and from -X are small enough for the agent to
afford. Since this makes X bounded, it follows from Fubini's theorem A.70 that
E(X) = OJ hence it is impossible that X < 0 under all circumstances, and the
previsions are coherent.
For the "only if' part, assume that the previsions are coherent. Clearly, ,.,,(0) =
0, since 10 = 0 and -c,.,,(0) ~ 0 for both positive and negative c. It is also easy to
see that ,.,,(A) ~ 0 for all A. If ,.,,(A) < 0, then for all negative c, c(IA -,.,,(A» < 0
and we have incoherence. Countable additivity follows in a similar fashion. Let
{An}~=l be mutually disjoint, and let A = U~=lAn. If ,.,,(A) < E:=l,.,,(A n ),

65When only finitely many gambles are required to be combined at once, as by


DeFinetti (1974), incoherence requires that the sum .be s~rictly less than s.ome
negative number under all circumstances. That is, DeFmettl would allow a stnctly
negative gamble to be called coherent, so long as the least upper bound was O.
66This theorem is used in the proof of Theorem B.139.
B.6. Subjective Probabi lity 657

then the following gamble is always negative:


00

If J.t(A) >E::"=l J.t(An), then the negative of the above gamble is always negativ~
Either way there is incoherence.
Theorem B.138 says that if an agent insists on dealing with a l7-field
of sub-
sets of some set 8, then expressing coherent previsions for gambles
on events is
equivalent to choosing probabilities. 67 Similar claims can be made about
bounde d
random variables.
Theore m B.139. Let C be the collection of all bounded measura
ble function s
from a measurable space (8, A) to lR. Suppose that, for each X
E C, an agent
assigns a prevision P(X). The previsions are coherent if and only
if there exists
a probability /1 on (8, A) such that P(X) = E(X) for all X E C.
PROOF. Suppose that the agent is indifferent between
all gambles of the form
c(X - P(X» for -dx ::; c S dx. For the "if" direction, the proof
is virtuall y
identical to the corresponding part of the proof of Theorem B.138.
For the "only
if" part, note that IA E C for every A E A. It follows from Theorem
B.138 that a
probabi lity J.t exists such that J.t(A) = P(lA) for all A E A. Hence P(X)
for all simple functions X. Let X > 0 and let Xl S X 2 S ... be simple
= E(X)
functions
less than or equal toX such that lim n _ oo Xn = X. Then X =
so
E::"=l(Xn+l -Xn ),
00

P(X) = LP(Xn +1 - Xn) = }~~ E(X n +l) = E(X),


n=l
from coherence and the monoto ne convergence theorem A.52. For
general X,
let X+ and X- be, respectively, the positive and negative parts
of X. Since
P(X) = P(X+) - P(X-) follows easily from coherence, the proof
is complete. 0
We conclude this "motivation" of probability theory from gambling
considera-
tions by trying to motivat e conditional probability. Suppose that,
in addition to
assigning previsions to gambles involving arbitrar y bounde d random
variables,
the agent is also required to assign conditional previsions in the followin
g way.
Let C be a sub-l7-field of A, and suppose that gambles of the form cIA
(X - p), for
all nonemp ty A E C, are being considered. 68 The fair price would
be that value
of p, denoted P(XIA) , such that the agent was indifferent between
all gambles of
the form cIA (X - P(XIA) ) for all c in some symmet ric interval around
than choose a different P(XIA) for each A, the agent has the option
o. Rather
of choosing
a single function Q : 8 -+ 1R such that Q is measura ble with respect
to the l7-field
C. The conditio nal gambles would then be cIA(X - Q).
Examp le B.140. For the simple case in which C = {0, A, A G , 8},
Q is measur-
able if and only if it takes on only two values, one on A and the other
on AG. In

671n the theory of DeFine tti (1974), one obtains finitely additive
probabilities
without assuming that probabilities have been assigned to all element
s of a 17-
field.
68DeFinetti (1974) would only require that such conditional gambles
be con-
sidered one at a time rather than a l7-field at a time.
658 Appendix B. Probability Theory

this case, there are only two sets of conditional gambles (other than the "uncondi-
tional" gambles c[X-P(X)]) , namely cIA (X -P(XIA» and cIAc(X -P(XIAc ».
Here, Q = P(XIA)IA + P(XIAc)IX, Note that the previsions P(XIA) and
P(IA) = ",(A) are already expressed. It is easy to see that
cIA(X - P(XIA»
= c(XIA - E(XIA» - cP(XIA)(IA -",(A» + c[P(XIA)",(A) - E(XIA)].
Clearly, the only coherent choices of P(XIA) satisfy P(XIA)",(A) = E(XIA). If
",(A) > 0, then P(XIA) = E(XIA)/",(A), the usual conditional mean of X given
A. Similarly, P(XIAc)",(Ac ) = E(XI~) must hold.
The general situation is not much different from Example B.140.
Theorem B.141. Suppose that an agent must choose a function Q that is mea-
sumble with respect to a sub-a-field C so that for each nonempty A E C, he or
she is indifferent between all gambles of the form cIA(X - Q). The choice of Q
is coherent if and only ifE(QIA) = E(XIA), for all A E C.
PROOF. As in Example B.140, note that
cIA(X - Q) = c(XIA - E(XIA» - c(QIA - E(QIA» + c[E(QIA) - E(XIA)]'
The choice of Q can be coherent if and only if E(QIA) = E(XIA). 0
The reader should note the similarity between the conditions in Theorem B.141
and Definition B.23. The function Q must be a version of the conditional mean
of X given C.
Example B.142. Let (X, Y) be random variables with a traditional joint density
with respect to Lebesgue measure Ix,Y. That is, for all C E JR.2,

Pr«X, Y) E C) = fa IX,y(x,y)dxdy,

and for all bounded measurable functions 9 : JR.2 --+ JR.,

E(g(X, Y» = j g(x, y)fx,y(x, y)dxdy. (B.143)

Let C be the a-field generated by Y. That is, C = {y-I(A) : A E B}, where B is


the Borel a-field of subsets of JR.. It is straightforward to check that for all A E c,
E(XIA) = E(QIA), where Q(s) = h(Y(s», and

h(y) = jxfx,y(X,Y) dx,


fy(y)
J
and fy(y) = fX,Y(x,y)dx is the usual marginal density of Y. (Just apply
(B.143) with g(x,y) = xh(y) and with g(x,y) = xlc(y), where A = y-I(C).)
What we have done in this section is give a motivation for the use of the math-
ematical probability calculus to express uncertainty for the purposes of gambling.
We assume that an agent chooses which gambles to accept in such a way that he
or she is not subject to Dutch book, which is a combination of acceptable gam-
bles that produces a loss no matter what happens. We were also able to use this
approach to motivate the mathematical definition of conditional expectatio~. by
introducing conditional gambles and requiring that the same coherence condltlOn
apply to conditional and unconditional gambles alike.
B.7. Simulation 659

B.7 Simulation *
Several times in this text, we will want to generat e observations
that have a
desired distribu tion. Such observations will be called pseudorandom
numbers be-
cause samples appear to have the properties of random variables,
but they are
actually generat ed by a complicated determi nistic process. We will
not go into
detail on how pseudor andom numbers with uniform U(O, 1) distribu
tion are gen-
erated. In this section, we wish to prove a couple of useful theorem s
about how to
generat e pseudor andom number s with other distribu tions under the
assump tion
that pseudor andom number s with U(O, 1) distribu tion can be generat
ed.
Theore m B.144. Let F be a CDF and define the inverse of F by

F- 1 ( } _ { inf{x: F(x) ~ q} ifq > 0,


q - sup{x : F(x) > O} if q = O.
If U has U(O, 1) distribution, then X = F- 1 (U) has CDF F.
PROOF. We will calculate Pr(X ~ t) for all t. First, let t be a continu
ity point of
F. Then
Pr(X ~ t) = Pr(F-1(U) ~ t) = Pr(U ~ F(t» = F(t),
where the second equality follows from the fact that, at a continu
ity point t,
X ~ t if and only if U ~ F(t), and the third equality follows
from the fact
that U has U(0,1} distribu tion. Finally, let t be a jump point
of F and let
F(t) -lim"'r t F(x) = c. Then X = t if and only if t - c < U ~ t, so

Pr(X = t) = Pr(t - c < U ~ t) = c.


So, X has CDF F at continu ity points of F and its distribu tion
has the same
sized jumps as F at the same points. So the CDF of X is F.
0
This theorem allows us to generat e pseudor andom variables with
arbitrar y
CDF F, if we can find F- 1 • The method described in this theorem
is called the
probability integral transform. Note that the probabi lity integral transfor
m has a
surprising theoreti cal implication.
Propos ition B.145. Let U have U(O,1) distribution, and let X
be a random
quantity taking values in a Borel space X. Then there exists a measura
ble function
f : 10, 1] ~ X such that feU) has the same distribution as X.
The next theorem allows us to find pseudor andom variables with
arbitrar y
density I if we can generat e pseudor andom variables with another density
9 such
that I(x) ~ kg(x} for some number k and all x.
Theore m B.146 (Accep tance-r ejectio n). Let f be a nonnegative
integrable
function, and let 9 be a density junction. Let k > 0 and suppose that f
(x) ~ kg( x)
for all x. Suppose that {Y;}~1 and {Ui}~l are all independent and
that the Y;
have density 9 and the Ui are U(O, 1). Define Z = YN, where

N = min {i .
U. < fey;} }
. • - kg(Y;) .

*This section may be skipped without interrup ting the flow of ideas.
660 Appendix B. Probability Theory

Then Z has density proportional to f.


PROOF. We can write the CDF of Z as

Pr(Z ~ t) = ( IUi::; - -
Pr Yi::; t
f(Yi»)
kg(Yi)
Pr (Yi ::; t, U. ::; tK~i))
= --'--.,....---....,--!...
Pr (u. < 1llil)
• - k9(YJ

E [pr ( Yi ::; t, Ui ::; ~ IYi) ]


=
E [pr ( Ui ::; k~\W) IYi ) ]
where we have used the law of total probability B.70 in the last equation. The
conditional probability in the numerator is

Pr (Yi• <- , U·• <- k IYi)


t f(Yi)
9 (Yi)'

= { 0 if Yi
> t,
k9(YJ l . lv".i <
1.!Y:il.·f _ t•

The mean of this is

jt -00
fey)
kg(y)g(y)d y =k
1jt -00 f(y)dy,

since Yi has PDF g(.). Similarly, the denominator conditional probability can be
written as
Pr (Ui < f(Yi)
- kg(Yi)
IYi) = f(Yi) .
kg(Yi)
The mean of this is likewise seen to be J f(y)dy/k. The ratio of these is
Pr(Z < t) = J~oo f(y)dy
- J f(y)dy ,

hence Z has density proportional to f. 0


Next, we prove a theorem that allows us to simulate from distributions with
bounded densities and sufficiently thin tails even when we only know the density
up to a normalizing constant. The theorem is due to Kinderman and Monahan
(1977).
Theorem B.147 (Ratio of uniforms method). Let f : IR -> [0,(0) be an
integrable function. Define

If (U, V) has uniform distribution over the set A, then V /U has density propor-
tional to f.
PROOF. Let (U, V) be uniformly distributed on the set A. Then fu,v(u,v) =
lA(u,v)/c, where c is the area of A. Define X = U and Y V/U. The Jacobian =
for the transformation is x and the joint density of (X, Y) is

/x,y(x,y)
x
= ~IA(x,xy) x
= ~I[O,v'f'"iY)l (x ) .
B.8. Problems 661

J
It follows that fy(y) = ov'fu5 ~dx = icf(y). 0
If both f(x) :::; b and a :::; xV f(x) :::; b for all x, then A is contained in the
rectangle with opposite corners (0, a) and (b, c). We can then generate U '" U(O, b)
and V '" U(a, c). We set X = V/U, and if U2 :5 f(X), take X as our desired
random variable. If U 2 < f(X), try again.
An important application of simulation is to the numerical integration tech-
nique called importance sampling. Suppose that we wish to know the value of the
ratio of two integrals
J v(8)h(8)d8
(B.148)
J h(8)d8 '
where 8 can be a vector. Suppose that f is a density function such that h/ f is
nearly constant and it is easy to generate pseudorandom numbers with density
/. Let {Xi}~l be an IID sequence of pseudorandom numbers with density f.
Then

J h(8)d8 = E (h(Xi»)
f(X i ) ,

J v(8)h(9)d9 h(X;»)
E ( v(Xd /(Xi) ,

where the expectations are with respect to the pseudo-distribution of Xi. If we let
Wi = h(Xi)/ /(Xi ) and Zi = V(Xi)Wi , then the weak law of large numbers B.95
says that Zn/W n converges in probability to (B.148).69 The reason that we want
h/ / to be nearly constant is so that the variance of Wi is small. In Section 7.1.3,
we will show how to approximate the variance of Zn/W n as an estimate of
(B.148).

B.8 Problems
Section B.2:

1. Suppose that an urn contains m ~ 3 white balls and n ~ 3 black balls.


Suppose that the urn is well mixed so that at any time, the probability
that anyone of the remaining balls in the urn is as likely to be drawn as
any other. We will draw three balls without replacement and set Xi = 1 if
°
the ith ball drawn is black, Xi = if the ith ball is white. Show that
Pr(X1 = 0,X2 = 1,X3 = 1)
Pr(X1 = 1, X 2 = 1, X3 = 0).
2. Suppose that H is a nondecreasing function and
F(x) = inf H(t).
t >x
t rational

69The strong law of large numbers 1.63 says that Zn/W n converges a.s. to
(B.148).
662 Appendix B. Probability Theory

(a) Prove that F is continuous from the right.


(b) Prove that infall x H(x) = infall x F(x).
(c) Prove that sUPall x H(x) = sUPall x F(x).
Section B.3:

3. Using the definition of conditional probability, show that AnB = 0 implies


Pr(AIC) + Pr(BIC) = Pr(A U BIC), a.s.

Use this to help prove that {An}~=l disjoint implies

4. *Let Xl and X 2 be lID random variables with U(O, 1) distribution. Let

Using the definition of conditional distribution, show that the conditional


distribution of Xl given T = t is a mixture of a point mass at t and a
U(O, t) distribution. Also, find the mixture.
5. Let (S, A, IJ.) be a probability space. Let C be a sub-q-field such that IJ.(C) E
{O, I} for all C E C. Let EIXI < 00. Prove that E(XIC) = E(X), a.s. [IJ.].
6. Let (S, A, IJ.) be a probability space. Let {An}~l be a partition of S,
and let C be the smallest q-field containing {An}~l. Let X be a random
variable. Show that E(XIC) = E~ IAn W n , where

Wn = {:;('A:;P if IJ.(An) > 0,


otherwise.

7. Let <l> denote the standard normal CDF, and let the joint CDF of random
variables (X, Y) be

< y + 1,
Fx,Y(x, y) = { ~~y)
~ if Y - 1 ~ x
if x ~ Y + 1,
otherwise.

(a) Find the conditional distribution of X given Y.


(b) Find the conditional distribution of Y given X.
8. Prove Proposition B.25 on page 617. (Hint: Use part 4 of Proposition A.49.)
9. Prove Proposition B.26 on page 617.
10. Prove Proposition B.27 on page 617. (Hint: Prove it for 9 an indicator func-
tion, then for simple functions, then for nonnegative measurable functions,
then for all integrable functions.)
11. Prove Proposition B.28 on page 617.
B.8. Problems 663

12. Suppose that Xl, ... , Xn are independent, each with distribution N(e, 1).
Find the conditional distribution of Xl"'" Xn given Xn = x, where Xn =
E~lXi/n.
13. Let 81 ~ 82 ~ ... be a sequence of u-fields, and let X ~ o. Suppose that
E(XI8n ) = Y for all n. Let 8 be the smallest u-field containing all of the
8 n. Show that E(XI8) = Y, a.s. (Hint: Show that the union of the 8 n is a
1I'-system, and use Theorem A.26.)
14. Prove Proposition B.43 on page 623.
15. Assume the conditions of Theorem B.46. Also, suppose that (X,8l,Vt)
and ()l, 82,112) are u-finite measure spaces and v = VI X V2. Prove that VI
can play the role of vXly(·ly) for all y and that V2 can play the role of vy
in the statement of Theorem B.46.
16. Prove Proposition B.51 on page 625. (Hint: Notice that IA(V-l(y,w» =
IA;(w).)
17. Prove Proposition B.66 on page 631. (Hint: Prove the result for product
sets first, and then use Theorem A.26.)
18. Prove Corollary B.67 on page 631.
19. Prove Corollary B.74 on page 633.
20. Prove the second Borel-Cantelli lemma: If {An}~=l are mutually inde-
pendent and E:=l Pr(An) = 00, then Pr(n~l U~=i An) = 1. (This set
is sometimes called An infinitely often.) (Hint: Find the probability of the
complement by using the fact that 1 - x:5 exp(-x) for 0:5 x :5 1.)
21. *Suppose that (8, A, It) is a measure space. Let {fn}~=l be a sequence of
measurable functions In : 8 --+ T, where (T,8) is a metric space with Borel
u-field. Let C be the tail u-field of {fn}~=l' If limn_oo fn(s) = f(s), for all
s, then prove that f is measurable with respect to C. (Hint: Refer to the
proof of part 5 of Theorem A.3B. Show that the set A. E C by showing
that the union in (A.39) does not need to start at 1.)
22. Let (8, A, It) be a probability space, and let C be the tail u-field of a se-
quence of random quantities {Xn}~=lt where Xn : 8 --+ X for all n. Let
V be the u-field generated by {Xn}~=l' Let X = (Xl,X2, ... ) E Xoo.
If 11' is a permutation of a finite set of integers {1, ... , n}, let 11'X =
(X"(l), ... , X,.(n) , X n + 1, ... ). We say that A E V is symmetric if A =
X-I (B) and for every permutation 11' of finitely many coordinates, A =
(1I'X)-l(B) as well.
(a) Prove that every C E C is symmetric.
(b) Show that there can be symmetric events that are not in C.
23. Prove Proposition B.78 on page 634.
664 Appendix B. Probability Theory

Section B.4:

24. Find a sequence of random variables that converges in probability to 0


but does not converge a.s. to O. (Hint: Consider the countable collection
of all subsets of [0,1] of the form [k/2 n , (k + 1)/2n] with k and n integers.
Arrange them in an appropriate sequence.)
25. Let {Xn}~=l be a sequence of random variables, and let X be another
random variable. Let Fn be the CDF of Xn and let F be the CDF of X.
Prove that Xn E. X if and only if limn-->oo Fn(x) = F(x) for every x such
that F is continuous at x.
26. Prove that I exp(iy) -11 ~ min{lyl, 2} for all y. (Hint: Show that exp(iy) =
1+i J:
exp( is )ds for y 2: 0 and a similar formula for y < 0.)
27. Prove the weak law of large numbers for infinite means: Suppose that
{Xi}~l are IID with mean 00. Then, for all real x, limn-->oo Pr(Xn >
x) = 1, where Xn = 2::~=1 Xi/no (Hint: Define Yi.t = min{Xi' t}. Prove
that E(Yi.t) < 00 for all t, but limt-->oo E(Yi.t) = 00.)
28. *Suppose that X is a random vector having bounded density with respect to
Lebesgue measure. Prove that the characteristic function of X is integrable.
(Hint: Run the proof of Lemma B.101 in reverse.)
Section B.5:

29. Let {in}~=l be a sequence of numbers in {O, I}. Suppose that {Xn}~=1 is
a sequence of Bernoulli random variables such that

Pr(X1 =il, ... ,Xn =i n )= x~2(n~4)'


x+2

where x = 2::7=1 ij. Show that this specifies a consistent set of joint distri-
butions for n = 1,2, ....
30. Let /-L be a finite measure on (1R, B), where B is the Borel u-field. Suppose
that {X(t) : -00 < t < oo} is a stochastic process such that X(t) has
Beta(/-L(-oo,t],/-L(t,oo)) distribution for each t, X(t) > Xes) ift > s, and
X(·) is continuous from the right.
(a) Prove that Pr(limt-->oo X(t) = 1) = 1.
(b) Let U = inf{t : X(t) 2: 1/2}. Prove that the median of U is inf{t :
IL(-oo,t]2: /-L(t,oo)}. (Hint: Write {U $ s} in terms of X(·).)
31. Let R be a set, and let (Xr, Br) be a Borel space for every r E R. Let X =
n Xr and let B be the product u-field. For each r E R, let Xr : X -+ Xr
bert'~ projection function Xr(X) = Xr . Prove that B is the union of all of
the u-fields generated by all of the countable collections of Xr functions.
That is, let Q be the set of all countable subsets of R, and for each q E Q
let xq = {Xr lrEq and let Bq be the u-field generated by xq. Then show
that B = uqEQBq.

Section B. 7:

32. Prove Proposition B.145 on page 659.


ApPENDIX C
Mathematical Theorems Not
Proven Here

There are several theorems of a purely mathematical nature which we use on


occasion in this text, but which we do not wish to prove here because their proofs
involve a great deal of mathematical background of which we will not make use
anywhere else.

C.1 Real Analysis


Theorem C.1 (Taylor's theorem).1 Suppose that 1 : IRm -+ IR has con-
tinuous partial derivatives 01 all orders up to and including k + 1 with respect
to all coordinates in a convex neighborhood D of a point Xo. For xED and
i = 1, ... , k + 1, define

D(i)(f,x,y) = :t ... :t
i1=1 i;=1
(8Z'
31
~i.8z' I(Z)\
3.
llYi.) ,
z=z 8=1

where we allow notation like 8 3 /8z 1 8z 1 8z4 to stand for 8 3 /8z~8z4. Then, for
XED,
k

!(x) = !(xo) + ~ ~D(i)(f;xo,x - xo) + (k ~ I)! D(k+l) (f; x., x - xo),


0=1

where x· is on the line segment joining X and Xo.

1This theorem is used in the proofs of Theorems 7.63, 7.89, 7.108, and 7.125.
For a proof (with m = 2), see Buck (1965), Theorem 16 on page 260.
666 Appendix C. Mathematical Theorems Not Proven Here

Theorem C.2 (Inverse function theorem).2 Let f be a continuously differ-


entiable function from an open set in IRn into IR n such that «ali / ax j)) is a
nonsingular matrix at a point x. If y = f (x), then there exist open sets U and V
such that x E U, Y E V, f is one-to-one on U, and feU) = V. Also, if g: V -+ U
is the inverse of f on U, then g is continuously differentiable on V.
Theorem C.3 (Stone-Weierstrass theorem).3 Let A be a collection of con-
tinuous complex functions defined on a compact set C and satisfying these con-
ditions:
• If f E A, then the complex conjugate of f is in A.
• If Xl =1= X2 E C, then there exists f E A such that f(xd =1= f(x2).
• If f, 9 E A, then f + 9 E A and f g E A.
• If f E A and c is a constant, then cf E A .
• For each x E C, there exists f E A such that f(x) =1= O.
Then, for every continuous complex function f on C, there exists a sequence
{fn}::"=l in A such that fn converges uniformly to f on C.
Theorem C.4 (Supporting hyperplane theorem).4 If S is a convex subset
of a finite-dimensional Euclidean space, and Xo is a boundary point of S, then
there is a nonzero vector v such that for every xES, V T X ~ V T Xo.
Theorem C.5 (Separating hyperplane theorem). 5 If Sl and S2 are disjoint
convex subsets of a finite-dimensional Euclidean space, then there is a nonzero
vector v and a constant c such that for every x E Sl, V T X ::; c and for every
y E S2, V T Y ~ c.
Theorem C.6 (Bolzano-Weierstrass theorem).6 Suppose that B is a closed
and bounded subset of a finite-dimensional Euclidean space. Then every infinite
subset of B has a cluster point in B.

C.2 Complex Analysis


Theorem C. 7. 7 Let f be an analytic function in a neighborhood of a point z.
Then the derivatives of f of every order exist and are analytic in a neighborhood

2This theorem is used in the proof of Theorem 7.57. For a proof, see Rudin
(1964), Theorem 9.17.
3This theorem is used in the proofs of DeFinetti's representation theorem 1.49
and 1.47 and Theorem B.93. For a proof, see Rudin (1964), Theorem 7.31.
4This theorem is used in the proof of Theorem B.17. For a proof, see Berger
(1985), Theorem 12 on page 341, or Ferguson (1967), Theorem 1 on page 73.
5This theorem is used in the proof of Theorems B.17, 3.77 and 3.95. For a
proof, see Berger (1985), Theorem 13 on page 342, or Ferguson (1967), Theorem 2
on page 73. ..
6This theorem is used in the proof of Theorem 3.77. For a proof, see DugundJ1
(1966), Theorems 3.2 and 4.3 of Chapter XI. .
7This theorem is used to show that certain estimators are UMVUE, and III
the proof of Theorem 2.74. For a proof, see Churchill (1960, Sections 52 and 56).
C.3. Functional Analysis 667

01 z. II J<k) denotes the kth derivative 01 I, then


00 (k)
I(x) = ~)x - z)k I k'(z)
k=O
lor all x in some circle around z.
Theorem C.8 (Maximum modulus theorem).8 Let I be an analytic func-
tion in an open set D which is continuous on the closure 01 D. Let the maximum
value 01 I/{z)1 lor z in the closure 01 D be c. Then I/{z)1 < c lor all zED unless
I is constant on D.
Theorem C.9 (Cauchy's equation).9 Let G be a Borel subset 01 IRk with
positive Lebesgue measure. Let I : G -+ IR be measurable. Let H 1 = G and
Hn = Hn-l + G lor each n. For each n, let gn : Hn -+ IR be measurable such
that gn (E~=l Xi) = E~l I(Xi), lor almost all (Xl, ... ,Xn) E G n • Then there is
a real number a and a vector b E IRk such that I{x) = a + bT X a.e. in G.

C.3 Functional Analysis


Theorem C.10. lo lIT is an operator with finite norm on the Hilbert space L 2 {",)
J
given by T(f)(x) = K(x',x)d",(x'), then T is 01 Hilbert-Schmidt type il and

JJ
only il
IK{x', x)1 2 d",{x')d",{x) < 00.

Theorem C.llY Every operator 01 Hilbert-Schmidt type is completely contin-


uous.
Theorem C.12.l2 II T is a completely continuous sell-adjoint operator, then T
has an eigenvalue A with IAI = IITII.
Theorem C.13. l3 lIT is a linear operator with finite norm and T· is its adjoint
operator, then IIT·TII = IITII2.

8This theorem is used in the proof of Theorem 2.64. For a proof, see Churchill
(1960), Section 54, or Ahlfors (1966), Theorem 12' on page 134.
9This theorem is used in the proof of Theorem 2.114. For a proof, see Diaconis
and Freedman (1990), Theorem 2.1.
lOThis theorem is used in the proof of Theorem 8.40. For a lroof, see Sec-
tion XI.6 of Dunford and Schwartz (1963). By L 2{",) we mean {/: 12{x)d",{x) <
oo}.
llThis theorem is used in the proof of Theorem 8.40. For a proof, see Theo-
rem 6 of Section XI.6 of Dunford and Schwartz (1963). The reader should note
that Dunford and Schwartz (1963) use the term compact instead of completely
continuous.
l2This theorem is used in the proof of Theorem 8.40. For a proof, see Lemma 1
in Section VIIL3 of Berberian (1961).
l3This theorem is used in the proof of Theorem 8.40. For a proof, see part (5)
of Theorem 2 on p. 132 of Berberian (1961).
ApPENDIX D
Summary of Distributions

The distributions used in this book are listed here. We give the name and sym-
bol used to describe each distribution. Each distribution is absolutely continuous
with respect to some measure or other. In most cases the mean and variance are
given. In some cases, the symbol for the CDF is given.

D.l Univariate Continuous Distributions


Alternate noncentral beta
Symbol: ANCB(q,a,,)
.
D enslty: f x ()
x ==
,",00
L.Jk=O ~
~(1
-,
)! k r + +2k
"Y r ~+k r !+k x
~+k-l(1
-x
)!+k- 1

Dominating measure: Lebesgue measure on [0,1]

Alternate noncentral chi-squared


Symbol: ANCX2(q, a, ,)1
• ",,00 r(k+!? k( )! ",!+k-l ("')
DenSity: fx(x) == L.Jk=O klr(! "Y 1-"Y 2!+k r (!+k) exp -2
Dominating measure: Lebesgue measure on [0, (0)
Mean: q+a~
1-1'

Variance: 2 [q + a1'12~1'1')]

IThis distribution was derived without a name by Geisser (1967). It was named
L2 by Lecoutre and Rouanet (1981).
D.l. Univariate Continuous Distributions 669

Alternate noncentral F
Symbol: ANCF(q,a,,?

Dominating measure: Lebesgue measure on [0; 00)


Mean: (1 -,) a~2 + ,~, if a> 2
Variance: 2a 2 (a_2+q)(1_"}')2
(a-2)2(a-4)q
+ 4a(a-2)q2
2 "}'(1-"}') if a > 4
,

Beta
Symbol: Beta( a, (3)
Density: Ix (x) = ic~'igJ)xa-1(1- X),8-1
Dominating measure: Lebesgue measure on [0,1]
Mean: a~,8
.
v:arlance: a('J
(a+,8)2(a+,8+1)

Cauchy
Symbol: Cau(p" (7'2)

Density: Ix (x) = 11'-1 (1 + (:z:~~)2) -1

Dominating measure: Lebesgue measure on (-00,00)


Mean: Does not exist
Variance: Does not exist

Chi-squared
Symbol: X~

Density: fx(x) = 2i~(~) exp(-~)


Dominating measure: Lebesgue measure on [0, 00)
Mean: a
Variance: 2a

2The alternate noncentral F distribution, with a different scaling factor, was


called the 1/1 2 distribution by Rouanet and Lecoutre (1983). See also Lecoutre
(1985). The distribution was derived without a name by Femindiz (1985).
Schervish (1992) gives additional details concerning the ANCX 2 , ANCE, and
ANCF distributions.
670 Appendix D, Summary of Distributions

Exponential
Symbol: Exp(O)
Density: fx(x) == Oexp(-xO)
Dominating measure: Lebesgue measure on [0, 00)
Mean: !
Variance: -b
F
Symbol: Fq,a

r(!I..±.!!.)q~a!
x2- (a + qx)--r
• !1 1 !1±<!
DensIty: fx(x) == ~r
r(~)r(!)
Dominating measure: Lebesgue measure on [0,00)
Mean: a~2' if a> 2
" .
variance: 2a2 q(a-4)(a-2)2'
q+a-2 'f
I a>
4

Gamma
Symbol: r(a,.8)
Density: fx(x) == ~X"'-l exp(-,Bx)
Dominating measure: Lebesgue measure on [0, 00)
Mean: ~
Variance: -p
Inverse gamma
Symbol: r- 1 (a,,B)
Density: fx(x) == ~X-O-l exp(-~)
Dominating measure: Lebesgue measure on [0, 00)
Mean: 6, if a> 1
Variance: (0 1~2("'_2)' if a > 2

Laplace
Symbol: Lap(tL, 0')
Density: fx(x) == 2~ exp (_Ix~el),
Dominating measure: Lebesgue measure on 1R
Mean: tL
Variance: 20'2
Dol. Univariate Continuous Distributions 671

Noncentral beta
Symbol: NCB (a., {3, 1/J)
D ensl'ty: f x ()
x = ,",00 (!I!.)k exp (!I!.)
L..Jk=O 2
r(a+~)
- 2 klr(a+k)r(~)x a+k-l(1 - x )~-1
Dominating measure: Lebesgue measure on [0,1)

Noncentral chi-squared
Symbol: NCx.~(1/J)
D enslty:
' f x (x ) -- ,",00
L..Jk=O
(!I!.)k
2 exp (!I!.)
- 2 ",!+k-l (
kJ2~+kr(!+k) exp -2
"')

Dominating measure: Lebesgue measure on [0,00)


Mean: q+1/J
Variance: 2q + 41/J

Noncentral F
Symbol: NCF(q, a, 1/J)

Dominating measure: Lebesgue measure on [0, 00)


Mean: (1 +~) a~2' if a> 2
Variance' 2 (!!}2 (q+.p)2+(q+2.p){a-2) if a > 4
'q (a 2)2(" 4)

Noncentral t
Symbol: NCt a (6)

Density: fx(x) = 2::-0 (6:t exp(-~) va;;-r(


- 0
r(~) ! )(!)
i (1 + "':)-~
Dominating measure: Lebesgue measure on ill.
~
Mean: 6----rrrr Vf, if a> 1
,
Variance: 2
a(6 +1)
a-2
a6 2
-"""2
[~"-l
r!
]2 0

' If a>2
CDF: NCT,,(oj6)

Normal
Symbol: N(p., 002 )
Density: /x(x) = (V21Too)-l exp ( _("';,.~)2)
Dominating measure: Lebesgue measure on (-00,00)
672 Appendix D. Summary of Distributions

Mean: J.t
Variance: 0'2

CDF: ~(.) (For N(O, 1) distribution)

Pareto
Symbol: Par(o:, c)
Density: /x(x) = ;=+1
Dominating measure: Lebesgue measure on [e, 00)
Mean: :~l' if 0: > 1
...varIance:
" • c 2°
(0 2)(0 1)2,'1 f
0: >2

t
Symbol: t a (J.t,0'2)

Density: /x(x) = ~1i~(7 (1 + ("':(7!tr~


Dominating measure: Lebesgue measure on (-00,00)
Mean: J.t, if a > 1
Variance: 0'2 a~2' if a >2
CDF: TaO (For t .. (O, 1) distribution)

Uniform
Symbol: U(a, b)
Density: /x(x) = (b - a)-l
Dominating measure: Lebesgue measure on [a, b]
Mean: ~
Variance: (b_a)2
12

D.2 Univariate Discrete Distributions


Bernoulli
Symbol: Ber(p)
Density: /x(x) = p"'(l _ p)l-'"
Dominating measure: Counting measure on {O,1}
Mean: p
Variance: p(l - p)
D.2. Univariate Discrete Distributions 673

Binomial
Symbol: Bin{n,p)
Density: /x{x) = (:)p"'{I- p)l-",
Dominating measure: Counting measure on {O, ... , n}
Mean: np
Variance: np{l - p)

Geometric
Symbol: Geo(p)
Density: /x{x) = p{l- p)'"
Dominating measure: Counting measure on {O, 1,2, ... }
Mean·.!=.E
• p

Variance: ~
p

Hypergeometric
Symbol: Hyp{N,n,k)
Density: /x{x) = (~)[~):)
Dominating measure: Counting measure on
{max{O,n - N + k}, ... ,min{n,k}}
Mean: ~
Variance: n (~) ( N;k) (~:7 )

Negative binomial
Symbol: N egbin{ a, p)
Density: /x{x) = Wot:lpO{l - p)'"
Dominating measure: Counting measure on {O, 1,2, ... }
Mean'• a.!=.E
p

Variance: a7
Poisson
Symbol: Poi{>.)
Density: fx{x) = exp{->'):~
Dominating measure: Counting measure on {O, 1,2, ... }
Mean: >.
Variance: >.
674 Appendix D. Summary of Distributions

D.3 Multivariate Distributions


Dirichlet
Symbol: Dirk(al, ... , ak)
Densl' ty.. f Xl.···.Xk_l (Xl,· .. , Xk-l ) = I'(",tl
r{"'o) X"'l-l .. 'X"'k-l-l(l
... r(",/c) 1 k-I - x1 -
... - xk_da/c-l, where ao = 2::=1 ai
Dominating measure: Lebesgue measure on
{(Xl, ... ,Xk-l): all Xi ~ 0 and Xl + ... Xk-l ::; I}
Mean: E(Xi ) =~
Variance: Var(Xi) = "';J0o-a;)
"'0("'0+ 1 )

Covariance: Cov(Xi , X J.) = ",g ("'0 +1)!riCk;

Multinomial
Symbol: Multk(n,Pl, ... ,Pk)
Density: !Xl ..... Xk(Xl, ... ,Xk) =( n
.::r:l'···I:l:k
)PZll ... p~/c
'"

Dominating measure: Counting measure on


{(Xl, ... ,Xk): all Xi E {O, ... ,n} and Xl + ... + X" = n}
Mean: E(Xi) = npi
Variance: Var(Xi) = npi(l - Pi)
Covariance: COV(Xi, Xj) = -npiPj

Multivariate Normal
Symbol: Np(JJ, (1)
Density: !x(x) = (271r~I(1I-~ exp(-!(x - JJ)T (1-l(X - JJ»)
Dominating measure: Lebesgue measure on IRP
Mean: E(Xi) = /Ji
Variance: Var(Xi) = (1i.i
Covariance: COV(Xi, Xi) = (1i.i
References

AHLFORS, L. (1966). Complex Analysis (2nd ed.). New York: McGraw-Hill.


AITCHISON, J. and DUNSMORE, I. R. (1975). Statistical Prediction Analysis.
Cambridge: Cambridge University Press.
ALBERT, J. H. and CHlB, S. (1993). Bayesian analysis of binary and polychoto-
mous response data. Journal of the American Statistical Association, 88,
669-679.
ALDOUS, D. J. (1981). Representations for partially exchangeable random vari-
ables. Journal of Multivariate Analysis, 11, 581-598.
ALDOUS, D. J. (1985). Exchangeability and related topics. In P. L. HENNEQUlN
(Ed.), Ecole d'EM de ProbabiliMs de Saint-Flour XIII-1983 (pp. 1-198).
Berlin: Springer-Verlag.
ANDERSON, T. W. (1984). An Introduction to Multivariate Statistical Analysis
(2nd ed.). New York: Wiley.
ANSCOMBE, F. J. and AUMANN, R. J. (1963). A definition of subjective proba-
bility. Annals of Mathematical Statistics, 34, 199-205.
ANTONIAK, C. E. (1974). Mixtures of Dirichlet processes with applications to
Bayesian nonparametric problems. Annals of Statistics, 2, 1152-1174.
BAHADUR, R. R. (1957). On unbiased estimates of uniformly minimum variance.
Sankhya, 18, 211-224.
BARNARD, G. A. (1970). Discussion on paper by Dr. Kalbfleisch and Dr. Sprott.
Journal of the Royal Statistical Society (Series B), 32, 194-195.
BARNARD, G. A. (1976). Conditional inference is not inefficient. Scandinavian
Journal of Statistics, 3, 132-134.
BARNDORFF-NIELSEN, O. E. (1988). Parametric Statistical Models and Likeli-
hood. Berlin: Springer-Verlag.
BARNETT, V. (1982). Comparative Statistical Inference (2nd ed.). New York:
Wiley.
BARRON, A. R. (1986). Discussion of "On the consistency of Bayes estimates"
by Diaconis and Freedman. Annals of Statistics, 14, 26-30.
BARRON, A. R. (1988). The exponential convergence of posterior probabilities
with implications for Bayes estimators of density functions. Technical Re-
port 7, Department of Statistics, University of Illinois, Champaign, IL.
BASU, D. (1955). On statistics independent of a complete sufficient statistic.
Sankhya, 15, 377-380.
BASU, D. (1958). On statistics independent of sufficient statistics. Sankhya, 20,
223-226.
676 References

BAYES, T. (1764). An essay toward solving a problem in the doctrine of chances.


Philosophical Transactions of the Royal Society of London, 53, 37(}---418.
BECKER, R. A., CHAMBERS, J. M., and WILKS, A. R. (1988). The New S
Language: A Programming Environment for Data Analysis and Graphics.
Pacific Grove, CA: Wadsworth and Brooks/Cole.
BERBERIAN, S. K. (1961). Introduction to Hilbert Space. New York: Oxford
University Press.
BERGER, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd
ed.). New York: Springer-Verlag.
BERGER, J. O. (1994). An overview of robust Bayesian analysis (with discussion).
Test, 3, 5-124.
BERGER, J. O. and BERRY, D. A. (1988). The relevance of stopping rules in
statistical inference (with discussion). In S. S. GUPTA and J. O. BERGER
(Eds.), Statistical Decision Theory and Related Topics IV (pp. 29-72). New
York: Springer-Verlag.
BERGER, J. O. and SELLKE, T. (1987). Testing a point null hypothesis: The
irreconcilability of P values and evidence (with discussion). Journal of the
American Statistical Association, 82, 112-122.
BERK, R. H. (1966). Limiting behavior of posterior distributions when the model
is incorrect. Annals of Mathematical Statistics, 37, 51-58.
BERKSON, J. (1942). Tests of significance considered as evidence. Journal of the
American Statistical Association, 37,325-335.
BERTI, P., REGAZZINI, E., and RIGo, P. (1991). Coherent statistical inference
and Bayes theorem. Annals of Statistics, 19, 366-381.
BICKEL, P. J. and FREEDMAN, D. A. (1981). Some asymptotic theory for the
bootstrap. Annals of Statistics, 9, 1196-1217.
BILLINGSLEY, P. (1968). Convergence of Probability Measures. New York: Wiley.
BILLINGSLEY, P. (1986). Probability and Measure (2nd ed.). New York: Wiley.
BISHOP, Y. M. M., FIENBERG, S. E., and HOLLAND, P. W. (1975). Discrete
Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press.
BLACKWELL, D. (1947). Conditional expectation and unbiased sequential esti-
mation. Annals of Mathematical Statistics, 18, 105-110.
BLACKWELL, D. (1973). Discreteness of Ferguson selections. Annals of Statistics,
1,356-358.
BLACKWELL, D. and DUBINS, L. (1962). Merging of opinions with increasing
information. Annals of Mathematical Statistics, 33, 882-886.
BLACKWELL, D. and RAMAMOORTHI, R. V. (1982). A Bayes but not classically
sufficient statistic. Annals of Statistics, 10, 1025-1026.
BLYTH, C. R. (1951). On minimax statistical decision procedures and their
admissibility. Annals of Mathematical Statistics, 22, 22-42.
BONDAR, J. V. (1988). Discussion of "Conditionally acceptable frequentist so-
lutions" by George Casella. In S. S. GUPTA and J. O. BERGER (Eds.) ,
Statistical Decision Theory and Related Topics IV (pp. 91-93). New York:
Springer-Verlag.
References 677

BORTKIEWICZ, L. V. (1898). Das Gesetz der Kleinen Zahlen. Leipzig:


Teubne r.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transfor
mations (with
discussion). Journal of the Royal Statistical Society (Series B), 26,
211-246.
Box, G. E. P. and TIAO, G. C. (1968). A Bayesian approac h to
some outlier
problems. Biometrika, 55, 119-129.
BREIMAN, L. (1968). Probability. Reading, MA: Addison-Wesley.
BRENNER, D., FRASER, D. A. S., and McDuNNOUGH, P. (1982).
On asymp-
totic normali ty of likelihood and conditio nal analysis. Canadian Journal
of
Statistics, 10, 163-172.
BROWN, L. D. (1967). The conditio nal level of Student 's t test.
Annals of
Mathematical Statistics, 38, 1068-1071.
BROWN, L. D. (1971). Admissible estimat ors, recurren t diffusions,
and insoluble
bounda ry value problems. Annals of Mathematical Statistics, 42,
855-903.
(See also correction, Annals of Statistics, 1, 594-596.)
BR.OWN, L. D. and HWANG, J. T. (1982). A unified admissibility
proof. In S. S.
GUPTA and J. O. BERGER (Eds.), Statistical Decision Theory and
Related
Topics III (pp. 205-230). New York: Academic Press.
BUCK, C. (1965). Real Analysis (2nd ed.). New York: McGraw-Hill.
BUEHLER, R. J. (1959). Some validity criteria for statistic al inferenc
es. Annals
of Mathematical Statistics, 30, 845-863.
BUEHLER, R. J. and FEDDERSON, A. P. (1963). Note on a conditio
nal propert y
of Student 's t. Annals of Mathematical Statistics, 34, 1098-1100.
CASELLA, G. and BERGER, R. L. (1987). Reconciling Bayesian and
frequentist
evidence in the one-sided testing problem (with discussion). Journal
of the
American Statistical Association, 82, 106-11l .
CHALONER, K., CHURCH, T., LOUIS, T. A., and MATTS, J. P. (1993).
Graphic al
elicitati on of a prior distribu tion for a clinical trial. The Statistic
ian, 42,
341-353.
CHANG, T. and VILLEGAS, C. (1986). On a theorem of Stein relating
Bayesian
and classical inferences in group models. Canadian Journal of Statistic
s, 14,
289-296.
CHAPMAN, D. and ROBBINS, H. (1951). Minimu m variance estimat
ion without
regulari ty assump tions. Annals of Mathematical Statistics, 22, 581-586
.
CHEN, C.-F. (1985). On asympt otic normali ty of limiting density
functions with
Bayesian implications. Journal of the Royal Statistical Society (Series
B),
47, 540-546.
CHOW, Y. S., ROBBINS, H., and SIEGMUND, D. (1971). Great Expecta
tions: The
Theory of Optimal Stopping. New York: Hought on Mifflin.
CHURCHILL, R. V. (1960). Complex Variables and Applications (2nd
ed.). New
York: McGraw Hill.
CLARKE, B. S. and BAR.RON, A. R. (1994). Jeffreys' prior is asympto
tically least
favorable under entropy risk. Journal of Statistical Planning and
Inference,
41,37-6 0.
678 References

CORNFIELD, J. (1966). A Bayesian test of some classical hypotheses-with ap-


plications to sequential clinical trials. Journal of the American Statistical
Association, 61, 577-594.
COX, D. R. (1958). Some problems connected with statistical inference. Annals
of Mathematical Statistics, 29, 357-372.
Cox, D. R. (1977). The role of significance tests. Scandinavian Journal of
Statistics, 4, 49-70.
Cox, D. R. and HINKLEY, D. V. (1974). Theoretical Statistics. London: Chap-
man and Hall.
CRAMER, H. (1945). Mathematical Methods of Statistics. Princeton: Princeton
University Press.
CRAMER, H. (1946). Contributions to the theory of statistical estimation. Skan-
dinavisk Aktuarietidsk, 29, 85-94.
DAVID, H. A. (1970). Order Statistics. New York: Wiley.
DAWID, A. P. (1970). On the limiting normality of posterior distributions. Pro-
ceedings of the Cambridge Philosophical Society, 67, 625-633.
DAWID, A. P. (1982). Intersubjective statistical models. In G. KOCH and
F. SPIZZICHINO (Eds.), Exchangeability in Probability and Statistics (pp.
217-232). Amsterdam: North-Holland.
DAWID, A. P. (1984). Statistical theory: The prequential approach. Journal of
the Royal Statistical Society (Series A), 147, 278-292.
DAWID, A. P., STONE, M., and ZIDEK, J. V. (1973). Marginalization paradoxes
in Bayesian and structural inference. Journal of the Royal Statistical Society
(Series B), 35, 189-233.
DEFINETTI, B. (1937). Foresight: Its logical laws, its subjective sources. In H. E.
KYBURG and H. E. SMOKLER (Eds.), Studies in Subjective Probability (pp.
53-118). New York: Wiley.
DEFINETTI, B. (1974). Theory of Probability, Vols. I and II. New York: Wiley.
DEGROOT, M. H. (1970). Optimal Statistical Decisions. New York: Wiley.
DEMoIVRE, A. (1756). The Doctrine of Chance (3rd ed.). London: A. Millar.
DIACONIS, P. and FREEDMAN, D. A. (1980a). Finite exchangeable sequences.
Annals of Probability, 8, 745-764.
DIACONIS, P. and FREEDMAN, D. A. (1980b). DeFinetti's generalizations of
exchangeability. In R. C. JEFFREY (Ed.), Studies in Inductive Logic and
Probability, II (pp. 233-249). Berkeley: University of California.
DIACONIS, P. and FREEDMAN, D. A. (1980c). DeFinetti's theorem for Markov
chains. Annals of Probability, 8, 115-130.
DIACONIS, P. and FREEDMAN, D. A. (1984). Partial exchangeability and suf-
ficiency. In J. K. GHOSH and J. Roy (Eds.), Statistics: Applications and
New Directions (pp. 205-236). Calcutta: Indian Statistical Institute.
DIACONIS, P. and FREEDMAN, D. A. (1986a). On the consistency of Bayes
estimates (with discussion). Annals of Statistics, 14, 1-26.
References 679

DIACONIS, P. and FREEDMAN, D. A. (1986b). On inconsis tent Bayes


estimat es
of location. Annals of Statistics, 14, 68-87.
DIACONIS, P. and FREEDMAN, D. A. (1990). Cauchy 's equatio n and
DeFine tti's
theorem . Scandinavian Journal of Statistics, 17, 235-250.
DIACONIS, P. and YLVISAKER, D. (1979). Conjug ate priors for exponen
tial fam-
ilies. Annals of Statistics, 7, 269-281.
DICKEY, J. M. (1980). Beliefs about beliefs, a theory of stochas tic
assessments
of subjecti ve probabilities. In J. M. BERNARDO, M. H. DEGROOT,
D. V.
LINDLEY, and A. F. M. SMITH (Eds.), Bayesian Statistics (pp.
471-487).
Valencia, Spain: University Press.
DOOB, J. L. (1949). Applica tion of the theory of marting ales. In
Le Calcul des
Probabilites et ses Applications (pp. 23-27). Paris: Colloques Interna tionaux
du Centre Nationa l de la Recherche Scientifique.
DOOB, J. L. (1953). Stochastic Processes. New York: Wiley.
DUBINS, L. E. and FREEDMAN, D. A. (1963). Random distribu
tion functions.
Bulletin of the American Mathematical Society, 69, 548-551.
DUGUNDJI, J. (1966). Topology. Boston: Allyn and Bacon.
DUNFORD, N. and SCHWARTZ, J. T. (1957). Linear Operators, Part
I: General
Theory. New York: Interscience.
DUNFORD, N. and SCHWARTZ, J. T. (1963). Linear Operators, Part
II: Spectral
Theory. New York: Interscience.
EBERHARDT, K. R., MEE, R. W., and REEVE, C. P. (1989). Compu
ting factors
for exact two-sided toleranc e limits for a normal distribu tion. Commu
nica-
tions in Statisti cs-Simu lation and Computation, 18, 397-413.
EDWARDS, W., LINDMAN, H., and SAVAGE, L. J. (1963). Bayesia
n statistic al
inference for psychological research. Psychological Review, 70, 193-242
.
EFRON, B. (1979). Bootstr ap method s: Anothe r look at the jackknif
e. Annals of
Statistics, 7, 1-26.
EFRON, B. (1982). The Jackknife, the Bootstrap and Other Resamp
ling Plans.
Philade lphia: Society for Industr ial and Applied Mathem atics.
EFRON, B. and HINKLEY, D. V. (1978). Assessing the accurac
y of the max-
imum likelihood estimat or: Observe d versus expecte d Fisher informa
tion.
Biometrika, 65, 457-487.
EFRON, B. and MORRIS, C. N. (1975). Data analysis using Stein's
estimat or
and its generalizations. Journal of the American Statistical Associa
tion, 70,
311-319.
EFRON, B. and TIBSHIRANI, R. J. (1993). An Introduction to
the Bootstrap.
London: Chapm an and Hall.
ESCOBAR, M. D. (1988). Estimating the Means of Several Normal
Populations
by Nonparametric Estimat ion of the Distribution of the Means. Ph.D.
thesis,
Yale University.
FABIUS, J. (1964). Asympt otic behavio r of Bayes' estimate s. Annals
of Mathe-
matical Statistics, 35, 846-856.
680 References

FERGUSON, T. S. (1967). Mathematical Statistics: A Decision Theoretic Ap-


proach. New York: Academic Press.
FERGUSON, T. S. (1973). A Bayesian analysis of some nonparametric problems.
Annals of Statistics, 1, 209-230.
FERGUSON, T. S. (1974). Prior distributions on spaces of probability measures.
Annals of Statistics, 2, 615-629.
FERRANDIZ, J. R. (1985). Bayesian inference on Mahalanobis distance: An al-
ternative to Bayesian model testing. In J. M. BERNARDO, M. H. DEG-
ROOT, D. V. LINDLEY, and A. F. M. SMITH (Eds.), Bayesian Statistics
2: Proceedings of the Second Valencia International Meeting (pp. 645-653).
Amsterdam: North Holland.
FIELLER, E. C. (1954). Some problems in interval estimation. Journal of the
Royal Statistical Society (Series B), 16, 175-185.
FISHBURN, P. C. (1970). Utility Theory for Decision Making. New York: Wiley.
FISHER, R. A. (1922). On the mathematical foundations of theoretical statistics.
Philosophical Transactions of the Royal Society of London, Series A, 222A,
309-368.
FISHER, R. A. (1924). The conditions under which X2 measures the discrepancy
between observation and hypothesis. Journal of the Royal Statistical Society,
87, 442-450.
FISHER, R. A. (1925). Theory of statistical estimation. Proceedings of the Cam-
bridge Philosophical Society, 22, 700-725.
FISHER, R. A. (1934). Two new properties of mathematical likelihood. Proceed-
ings of the Royal Society of London, A, 144, 285-307.
FISHER, R. A. (1935). The fiducial argument in statistical inference. Annals of
Eugenics, 6, 391-398.
FISHER, R. A. (1936). Has Mendel's work been rediscovered? Annals of Science,
1, 115-137.
FISHER, R. A. (1943). Note on Dr. Berkson's criticism of tests of significance.
Journal of the American Statistical Association, 38, 103-104.
FISHER, R. A. (1966). The Design of Experiments (8th ed.). New York: Hafner.
FRASER, D. A. S. and McDuNNOUGH, P. (1984). Further remarks on asymp-
totic normality of likelihood and conditional analyses. Canadian Journal of
Statistics, 12, 183-190.
FREEDMAN, D. A. (1963). On the asymptotic behavior of Bayes' estimates in
the discrete case. Annals of Mathematical Statistics, 34, 1386-1403.
FREEDMAN, D. A. (1977). A remark on the difference between sampling with
and without replacement. Journal of the American Statistical Association,
72,681.
FREEDMAN, D. A. and DIACONIS, P. (1982). On inconsistent M-estimators.
Annals of Statistics, 10, 454-461.
FREEDMAN, L. S. and SPIEGELHALTER, D. J. (1983). The assessment ofsubjec-
tive opinion and its use in relation to stopping rules of clinical trials. The
Statistician, 32, 153-160.
References 681

FREEMAN, P. R. (1980). On the number of outliers in data from a linear


model. In
J. M. BERNARDO, M. H. DEGROOT, D. V. LINDLEY, and A. F. M. SMITH
(Eds.), Bayesian Statistics (pp. 349-365). Valencia, Spain: Univers
ity Press.
GABRIEL, K. R. (1969). Simulta neous test proced ures-so me theory
of mUltiple
comparisons. Annals of Mathematical Statistics, 40, 224-250.
GARTHWAITE, P. and DICKEY, J. (1988). Quantif ying expert opinion
in linear
regression problems. Journal of the Royal Statistical Society (Series
B), 50,
462-474.
GARTHWAITE, P. H. and DICKEY, J. M. (1992). Elicitat ion of prior
distribu tions
for variable-selection problems in regression. Annals of Statistics,
20, 1697-
1719.
GAVASAKAR, U. K. (1984). A Study of Elicitation Procedures by
Modelling the
Errors in Responses. Ph.D. thesis, Carnegie Mellon University.
GEISSER, S. (1967). Estimat ion associat ed with linear discriminants.
Annals of
Mathematical Statistics, 38, 807-817.
GEISSER, S. and EDDY, W. F. (1979). A predictive approac h to model
selection.
Journal of the American Statistical Association, 74, 153-160.
GELFAND, A. E. and SMITH, A. F. M. (1990). Samplin g-based approac
hes to cal-
culating margina l densities. Journal of the American Statistical ASSOCia
tion,
85,398 -409 .
. GEMAN, S. and GEMAN, D. (1984). Stochas tic relaxati on, Gibbs
distribu tions
and the Bayesian restorat ion of images. IEEE Trans. on Pattern
Analysis
and Machine Intelligence, 6, 721-741.
GNANADESIKAN, R. (1977). Methods for Statistical Data Analysis
of Multivariate
Observations. New York: Wiley.
GOOD, I. J .. (1956). Discussion of "Chanc e and control: Some implica
tions of
random ization" by G. Spencer Brown. In C. CHERRY (Ed.), Informa
tion
Theory: Third London Symposium (pp. 13-14). London: Butterw orths.
HALL, P. (1992). The Bootstrap and Edgeworth Expansion. New York:
Springer-
Verlag.
HALL, W. J., WIJSMAN, R. A., and GHOSH, J. K. (1965). The
relation ship
between sufficiency and invariance with applicat ions in sequent ial
analysis.
Annals of Mathematical Statistics, 36, 575-614.
HALMOS, P. R. (1950). Measure Theory. New York: Van Nostran
d.
HALMOS, P. R. and SAVAGE, L. J. (1949). Applica tion of the Radon-
Nikody m
theorem to the theory of sufficient statistic s. Annals of Mathematical
Statis-
tics, 20, 225-241.
HAMPEL, F. R., RONCHETTI, E. M., ROUSSEEUW, P. J., and STAHEL
, W. A.
(1986). Robust Statistics: The Approach Based on Influence Functions.
New
York: Wiley.
HARTIGAN, J. (1983). Bayes Theory. New York: Springer-Verlag.
HEATH, D. and SUDDERTH, W. D. (1976). DeFine tti's theorem on
exchang eable
variables. American Statistician, 30, 188-189.
682 References

HEATH, D. and SUDDERTH, W. D. (1989). Coherent inference from improper


priors and from finitely additive priors. Annals of Statistics, 17, 907-919.
HEWITT, E. and SAVAGE, L. J. (1955). Symmetric measures on cartesian prod-
ucts. Transactions of the American Mathematical Society, 80, 470--501.
HEYDE, C. C. and JOHNSTONE, 1. M. (1979). On asymptotic posterior normality
for stochastic processes. Journal of the Royal Statistical Society (Series B),
41, 184-189.
HILL, B. M. (1965). Inference about variance components in the one-way model.
Journal of the American Statistical Association, 60, 806--825.
HILL, B. M., LANE, D., and SUDDERTH, W. D. (1987). Exchangeable urn pro-
cesses. Annals of Probability, 15, 1586-1592.
HOEL, P. G., PORT, S. C., and STONE, C. J. (1971). Introduction to Probability
Theory. Boston: Houghton Mifflin.
HOGARTH, R. M. (1975). Cognitive processes and the assessment of subjective
probability distributions (with discussion). Journal of the American Statis-
tical Association, 70, 271-294.
HUBER, P. J. (1964). Robust estimation of a location parameter. Annals of
Mathematical Statistics, 35, 73--101.
HUBER, P. J. (1967). The behaviour of maximum likelihood estimates under
nonstandard conditions. In L. M. LECAM and J. NEYMAN (Eds.), Pro-
ceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, volume 1 (pp. 221-233). Berkeley: University of California.
HUBER, P. J. (1977). Robust Statistical Procedures. Philadelphia: Society for
Industrial and Applied Mathematics.
HUBER, P. J. (1981). Robust Statistics. New York: Wiley.
JAMES, W. and STEIN, C. M. (1960). Estimation with quadratic loss. In J. NEY-
MAN (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical
Statistics and Probability, volume 1 (pp. 361-379). Berkeley: University of
California.
JAYNES, E. T. (1976). Confidence intervals vs. Bayesian intervals (with dis-
cussion). In W. L. HARPER and C. A. HOOKER (Eds.), Foundations of
Probability Theory, Statistical Inference, and Statistical Theories of Science
(pp. 175-257). Dordrecht: D. Reidel.
JEFFREYS, H. (1961). Theory of Probability (3rd ed.). Oxford: Oxford University
Press.
JOHNSTONE, 1. M. (1978). Problems in limit theory for martingales and posterior
distributions from stochastic processes. Master's thesis, Australian National
University.
KADANE, J. B., DICKEY, J. M., WINKLER, R. L., SMITH, W., and PETERS,
S. C. (1980). Interactive elicitation of opinion for a normal linear model.
Journal of the American Statistical Association, 75, 845-854.
KADANE, J. B., SCHERVISH, M. J., and SEIDENFELD, T. (1985). Statistical
implications of finitely additive probability. In P. GOEL and A. ZELLNER
References 683

(Eds.), Bayesian Inference and Decision Techniques with Applic~t


ions: Es-
says in Honor of Bruno DeFinetti (pp. 59-76). Amster dam: Elsevier
SCience
Publish ers.
KADANE J. B., SCHERVISH, M. J., and SEIDENFELD, T. (1996). Reasoni
ng to a
fore~one conclusion. Journal of the American Statistical Association, 91,
to
appear.
KAGAN, A. M., LINNIK, Y. V., and RAO, C. R. (1965). On a charact
erizatio n of
the normal law based on a propert y of the sample average. Sankhya
, Series
A, 32, 37-40.
KAHNEMAN, D., SLOVIC, P., and TVERSKY, A. (Eds.) (1982).
Judgment Un-
der Uncertainty: Heuristics and Biases. Cambridge: Cambri dge Univers
ity
Press.
KASS, R. E. and RAFTERY, A. E. (1995). Bayes factors. Journal of
the Americ an
Statistical Association, 90, 773-795.
KASS, R. E. and STEFFEY, D. (1989). Approx imate Bayesian inferenc
e in condi-
tionally indepen dent hierarchical models (parame tric empiric al Bayes
mod-
els). Journal of the Americ an Statistical Association, 84, 717-726
.
KASS, R. E., TIERNEY, L., and KADANE, J. B. (1988). Asympt otics
in Bayesia n
comput ation. In J. M. BERNARDO, M. H. DEGROOT, D. V.
LINDLEY,
and A. F. M. SMITH (Eds.), Bayesian Statistics :1 (pp. 261-278).
Oxford:
Clarend on Press.
KASS, R. E., TIERNEY, L., and KADANE, J. B. (1990). The validity
of posterio r
expansi ons based on Laplace 's method . In S. GEISSER, J. S. HODGES
, S. J.
PRESS, and A. ZELLNER (Eds.), Bayesian and Likelihood Methods
in Statis-
tics and Econometrics (pp. 473-488). Amster dam: Elsevier (North Holland
).
KEIFER, J. and WOLFOWITZ, J. (1956). Consistency of the maximu
m likelihood
estimat or in the presence of infinitely many inciden tal parame ters.
Annals
of Mathematical Statistics, 27, 887-906.
KERRJDGE, D. (1963). Bounds for the frequency of misleading Bayes
inferences.
Annals of Mathematical Statistics, 34, 1109-1110.
KINDERMAN, A. J. and MONAHAN, J. F. (1977). Compu ter generat
ion of ran-
dom variables using the ratio of uniform deviates. A eM Transac
tions on
Mathematical Software, 3, 257-260.
KINGMAN, J. F. C. (1978). Uses of exchangeability. Annals of
Probability, 6,
183-197.
KNUTH, D. E. (1984). The 1E;Xbook. Reading, MA: Addison-Wesley
.
KRAFT, C. H. (1964). A class of distribu tion function processe
s which have
derivatives. Journal of Applied Probability, 1, 385-388.
KRASKER, W. and PRATT, J. W. (1986). Discussion of "On the
consiste ncy
of Bayes estimate s" by Diaconis and Freedm an. Annals of Statistic
s, 14,
55-58.
KREM, A. (1963). On the indepen dence in the limit of extreme
and central
order statistic s. Publications of the Mathematical Institute of the Hungar
ian
Academy of Science, 8, 469-474.
684 References

KSHIRSAGAR, A. M. (1972). Multivariate Analysis. New York: Marcel Dekker.


KULLBACK, S. (1959). Information Theory and Statistics. New York: Wiley.
LAMPORT, L. (1986). UTEX: A Document Prepamtion System. Reading, MA:
Addison-Wesley.
LAURITZEN, S. L. (1984). Extreme point models in statistics (with discussion).
Scandinavian Journal of Statistics, 11, 65-91.
LAURITZEN, S. 1. (1988). Extremal Families and Systems of Sufficient Statistics.
Berlin: Springer-Verlag.
LAVINE, M. (1992). Some aspects of Polya tree distributions for statistical mod-
elling. Annals of Statistics, 20, 1222-1235.
LAVINE, M., WASSERMAN, L., and WOLPERT, R. L. (1991). Bayesian inference
with specified prior marginals. Journal of the American Statistical Associa-
tion, 86, 964-971.
LAVINE, M., WASSERMAN, L., and WOLPERT, R. L. (1993). Linearization of
Bayesian robustness problems. Journal of Statistical Planning and Inference,
37,307-316.
LECAM, L. M. (1953). On some asymptotic properties of maximl,lm likelihood
estimates and related Bayes estimates. University of California Publications
in Statistics, 1, 277-330.
LECAM, L. M. (1970). On the assumptions used to prove asymptotic normality
of maximum likelihood estimates. Annals of Mathematical Statistics, 41,
802-828.
LECOUTRE, B. (1985). Reconsideration of the F-test of the analysis of variance:'
The semi-Bayesian significance test. Communications in Statistics-Theory
and Methods, 14, 2437-2446.
LECOUTRE, B. and ROUANET, H. (1981). Deux structures statistiques fonda-
mentales en analyse de la variance univariee et mulitvariee. Mathematiques
et Sciences Humaines, 75, 71-82.
LEHMANN, E. 1. (1958). Significance level and power. Annals of Mathematical
Statistics, 29,1167-1176.
LEHMANN, E. L. (1983). Theory of Point Estimation. New York: Wiley.
LEHMANN, E. L. (1986). Testing Statistical Hypotheses (2nd ed.). New York:
Wiley.
LEHMANN, E. 1. and SCHEFFE, H. (1955). Completeness, similar regions and
unbiased estimates. Sankhya, 10, 305-340. (Also 15, 219-236, and correction
17, 250.)
LINDLEY, D. V. (1957). A statistical paradox. Biometrika, 44, 187-192.
LINDLEY, D. V. and NOVICK, M. R. (1981). The role of exchangeability in
inference. Annals of Statistics, 9, 45-58.
LINDLEY, D. V. and PHILLIPS, L. D. (1976). Inference for a Bernoulli process
(a Bayesian view). American Statistician, 30, 112-119.
LINDLEY, D. V. and SMITH, A. F. M. (1972). Bayes estimates for the linear
model. Journal of the Royal Statistical Society (Series B), 34, 1-41.
References 685

LOEVE, M. (1977). Probability Theory I (4th ed.). New York: Springer-Verlag.


MAULDIN, R. D., SUDDERTH, W. D., and WILLIAMS, S. C. (1992). Polya trees
and random distributions. Annals of Statistics, 20, 1203-1221-
MAULDIN, R. D. and WILLIAMS, S. C. (1990). Reinforced random walks and
random distributions. Proceedings of the American Mathematical Society,
110, 251-258.
MENDEL, G. (1866). Versuche iiber pflanzenhybriden. Verhandlungen Natur-
forschender Vereines in Brunn, 10, 1.
METIVIER, M. (1971). Sur la construction de mesures alt~atoires presque surement
absolument continues par rapport a une mesure donnee. Zeitschrijt fur
Wahrscheinlichkeitstheorie, 20, 332-344.
MORRIS, C. N. (1983). Parametric empirical Bayes inference: Theory and appli-
cations (with discussion). Journal of the American Statistical Association,
78,47-65.
NACHBIN, L. (1965). The Haar Integral. Princeton: Van Nostrand.
NEYMAN, J. (1935). Su un teorema concernente Ie cosiddette statistiche suffici-
enti. Giornale Dell'Istituto Italiano degli Attuari, 6, 320--334.
NEYMAN, J. and PEARSON, E. S. (1933). On the problem of the most efficient test
of statistical hypotheses. Philosophical 1hmsactions of the Royal Society of
London, Series A, 231, 289-337.
NEYMAN, J. and SCOTT, E. L. (1948). Consistent estimates based on partially
consistent observations. Econometrica, 16, 1-32.
PEARSON, K. (1900). On the criterion that a given system of deviations from the
probable in the case of a correlated system of variables is such that it can
be reasonably supposed to have arisen from random sampling. Philosoph-
ical Magazine (5thSeries), 50, 339-357. (See also correction, Philosophical
Magazine (6thSeries), 1, 670-671.)
PERLMAN, M. (1972). On the strong consistency of approximate maximum like-
lihood estimators. In L. M. LECAM, J. NEYMAN, and E. L. SCOTT (Eds.),
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and
Probability, volume 1 (pp. 263-281). Berkeley: University of California.
PIERCE, D. A. (1973). On some difficulties in a frequency theory of inference.
Annals of Statistics, 1,241-250.
PITMAN, E. (1939). The estimation of location and scale parameters of a contin-
uous population of any given form. Biometrika, 30, 391-421.
PRATT, J. W. (1961). Review of "Testing Statistical Hypotheses" by E. L.
Lehmann. Journal of the American Statistical Association, 56, 163-167.
PRATT, J. W. (1962). Discussion of "On the foundations of statistical inference"
by Allan Birnbaum. Journal of the American Statistical Association, 57,
314-316.
RAO, C. R. (1945). Information and the accuracy a.ttainable in the estimation
of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37,
81-91.
686 References

RAO, C. R. (1973). Linear Statistical Inference and Its Applications (2nd ed.).
New York: Wiley.
ROBBINS, H. (1951). Asymptotically subminimax solutions of compound sta-
tistical decision problems. In J. NEYMAN (Ed.), Proceedings of the Second
Berkeley Symposium on Mathematical Statistics and Probability (pp. 131-
148). Berkeley: University of California.
ROBBINS, H. (1955). An empirical Bayes approach to statistics. In J. NEY-
MAN (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability, volume 1 (pp. 157-164). Berkeley: University of
California.
ROBBINS, H. (1964). The empirical Bayes approach to statistical decision prob-
lems. Annals of Mathematical Statistics, 35, 1-20.
ROBERT, C. P. (1993). A note on Jeffreys-Lindley paradox. Statistica Sinica, 3,
601-608.
ROBERTS, H. V. (1967). Informative stopping rules and inferences about popu-
lation size. Journal of the American Statistical Association, 62, 763-775.
ROUANET, H. and LECOUTRE, B. (1983). Specific inference in ANOVA: From
significance tests to Bayesian procedures. British Journal of Mathematical
and Statistical Psychology, 36, 252-268.
ROYDEN, H. L. (1968). Real Analysis. London: Macmillan.
RUBIN, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9, 130-134.
RUDIN, W. (1964). Principles of Mathematical Analysis (2nd ed.). New York:
McGraw-Hill.
SAVAGE, L. J. (1954). The Foundations of Statistics. New York: Wiley.
SAVAGE, L. J. (1962). The Foundations of Statistical Inference. London:
Methuen.
SCHEFFE, H. (1947). A useful convergence theorem for probability distributions.
Annals of Mathematical Statistics, 18,434--438.
SCHERVISH, M. J. (1983). User-oriented inference. Journal of the American
Statistical Association, 78, 611-615.
SCHERVISH, M. J. (1992). Bayesian analysis of linear models (with discussion).
In J. M. BERNARDO, J. O. BERGER, A. P. DAWID, and A. F. M. SMITH
(Eds.), Bayesian Statistics 4: Proceedings of the Second Valencia Interna-
tional Meeting (pp. 419--434). Oxford: Clarendon Press.
SCHERVISH, M. J. (1994). Discussion of "Bootstrap: More than a stab in the
dark?" by G. A. Young. Statistical Science, 9, 408-410.
SCHERVISH, M. J. (1996). P-values: What they are and what they are not.
American Statistician, 50, to appear.
SCHERVISH, M. J. and CARLIN, B. P. (1992). On the convergence of successive
substitution sampling. Journal of Computational and Graphical Statistics,
1, 111-127.
SCHERVISH, M. J. and SEIDENFELD, T. (1990). An approach to consensus and
certainty with increasing evidence. Journal of Statistical Planning and In-
ference, 25,401-414.
References 687

SCHERVISH, M. J., SEIDENFELD, T., and KADANE, J. B. (1984). The extent


of non-conglomerability of finitely additive probabilities. Zeitschrift fur
Wahrscheinlichkeitstheorie, 66, 205-226.
SCHERVISH, M. J., SEIDENFELD, T., and KADANE, J. B. (1990). State dependent
utilities. Journal of the American Statistical Association, 85, 840-847.
SCHWARTZ, L. (1965). On Bayes procedures. Zeitschrift fur Wahrscheinlichkeit-
stheorie, 4, 10-26.
SEIDENFELD, T. and SCHERVISH, M. J. (1983). A conflict between finite addi-
tivity and avoiding Dutch Book. Philosophy of Science, 50, 398-412.
SEIDENFELD, T., SCHERVISH, M. J., and KADANE, J. B. (1995). A representation
of partially ordered preferences. Annals of Statistics, 23, 2168-2217.
SERFLING, R. J. (1980). Approximation Theorems of Mathematical Statistics.
New York: Wiley.
SETHURAMAN, J. (1994). A constructive definition of Dirichlet priors. Statistica
Sinica, 4, 639-650.
SINGH, K. (1981). On the asymptotic accuracy of Efron's bootstrap. Annals of
Statistics, 9, 1187-1195.
SMITH, A. F. M. (1973). A general Bayesian linear model. Journal of the Royal
Statistical Society, Ser. B, 35, 67-75.
SPJ0TVOLL, E. (1983). Preference functions. In P. J. BICKEL, K. DOKSUM,
and J. L. HODGES, JR. (Eds.), A Festschrift for Erich L. Lehmann (pp.
409-432). Belmont, CA: Wadsworth.
STATSCI (1992). S-PLUS, Version 3.1 (software package). Seattle: StatSci Divi-
sion, MathSoft, Inc.
STEIN, C. M. (1946). A note on cumulative sums. Annals of Mathematical
Statistics, 17, 498-499.
STEIN, C. M. (1956). Inadmissibility of the usual estimator for the mean of
a multivariate normal distribution. In J. NEYMAN (Ed.), Proceedings of
the Third Berkeley Symposium on Mathematical Statistics and Probability,
volume 1 (pp. 197-206). Berkeley: University of California.
STEIN, C. M. (1965). Approximation of improper prior measures by prior proba-
bility measures. In J. NEYMAN and L. M. LECAM (Eds.), Bernoulli, Bayes,
Laplace: Anniversary Volume (pp. 217-240). New York: Springer-Verlag.
STEIN, C. M. (1981). Estimation of the mean of a multivariate normal distribu-
tion. Annals of Statistics, 9, 1135-1151.
STIGLER, S. M. (1986). The History of Statistics: The Measurement of Uncer-
tainty before 1900. Cambridge, MA: Belknap.
STONE, M. (1976). Strong inconsistency from uniform priors. Journal of the
American Statistical Association, 71, 114-125.
STONE, M. and DAWID, A. P. (1972). Un-Bayesian implications of improper
Bayes inference in routine statistical problems. Biometrika, 59, 369-375.
688 References

STRASSER, H. (1981). Consistency of maximum likelihood and Bayes estimates.


Annals of Statistics, 9, 1107-1113.
STRAWDERMAN, W. E. (1971). Proper Bayes minimax estimators of the multi-
variate normal mean. Annals of Mathematical Statistics, 42, 385-388.
TAYLOR, R. L., DAFFER, P. Z., and PATTERSON, R. F. (1985). Limit Theorems
for Sums of Exchangeable Random Variables. Totowa, NJ: Rowman and
Allanheld.
TIERNEY, L. (1994). Markov chains for exploring posterior distributions (with
discussion). Annals of Statistics, 22,1701-1762.
TIERNEY, L. and KADANE, J. B. (1986). Accurate approximations for poste-
rior moments and marginal densities. Journal of the American Statistical
Association, 81, 82-86.
VENN, J. (1876). The Logic of Chance (2nd ed.). London: Macmillan.
VERDINELLI, I. and WASSERMAN, L. (1991). Bayesian analysis of outlier problems
using the Gibbs sampler. Statistics and Computing, 1, 105-117.
VON MISES, R. (1957). Probability, Statistics and Truth. London: Allen and
Unwin.
VON NEUMANN, J. and MORGENSTERN, O. (1947). Theory of Games and Eco-
nomic Behavior (2nd ed.). Princeton: Princeton University Press.
WALD, A. (1947). Sequential Analysis. New York: Wiley.
WALD, A. (1949). Note on the consistency of the maximum likelihood estimate.
Annals of Mathematical Statistics, 20, 595-601.
WALD, A. and WOLFOWITZ, J. (1948). Optimum character of the sequential
probability ratio test. Annals of Mathematical Statistics, 19, 326-339.
WALKER, A. M. (1969). On the asymptotic behaviour of posterior distributions.
Journal of the Royal Statistical Society (Series B), 31, 80-88.
WALLACE, D. L. (1959). Conditional confidence level properties. Annals of
Mathematical Statistics, 30,864-876.
WELCH, B. L. (1939). On confidence limits and sufficiency, with particular ref-
erence to parameters of location. Annals of Mathematical Statistics, 10,
58-69.
WEST, M. (1984). Outlier models and prior distributions in Bayesian linear
regression. Journal of the Royal Statistical Society (Series B), 46,431-439.
WILKS, S. S. (1941). Determination of sample sizes for setting tolerance limits.
Annals of Mathematical Statistics, 12, 91-96.
YOUNG, G. A. (1994). Bootstrap: More than a stab in the dark? (with discus-
sion). Statistical Science, 9, 382-415.
ZELLNER, A. (1971). An Introduction to Bayesian Inference in Econometrics.
New York: Wiley.
Notation and Abbreviation Index

o (vector of Os), 385 Dirk("') (distribution), 674


dJl2/dJll (Radon-Nikodym
1 (vector of Is), 345 derivative), 575, 598
~ (symmetric difference), 581
28 (power set), 571
Eo(-) (conditional mean given e = 0),
« (absolutely continuous), 574, 597 19
a.e. (almost everywhere), 582 ExpO (distribution), 670
ANCB(·,·,·) (distribution), 668 EO (expected value), 607, 613
ANC:e(·,·,·) (distribution), 668 E(·I·) (conditional mean), 616
ANCF(·,·,·) (distribution), 669
ANOVA (analysis of variance), 384 f+ (positive part), 588
ARE (asymptotic relative efficiency), f- (negative part), 588
413 fXls (conditional density of X given
a.s. (almost surely), 582 8),13
Ax (u-field generated by X), 51, 82 fXIY (conditional density), 13
N (action space), 144 Fq,a (distribution), 670
ex (action space u-field), 144
r- 1 (.,.) (distribution), 670
\ (remove one set from another), 577 f(-,·) (distribution), 670
B (closure of set), 622 Geo(·) (distribution), 673
Ber(·) (distribution), 672
Beta{·,·) (distribution), 669 HPD (highest posterior density), 327
Bin{·,·) (distribution), 673 Hyp{',',') (distribution), 673
[3k (Borel u-field), 576
[3 (Borel u-field), 575
lID (independent and identically
distributed), 611
Cau(·,·) (distribution), 669 lk (identity matrix), 643
CDF (cumulative distribution IXIT('; ·1·) (conditional Kullback-
function), 612 Leibler information),
c~ (constant related to RHM), 367 115
C g (constant related to LHM), 367 IXIT(-I-) (conditional Fisher
X~ (distribution), 669 information), 111
E. (converges in distribution), 635 Ix (-) (Fisher information), 111
Ix(·j·) (Kullback-Leibler
& (converges in probability), 638 information), 115
w
- (converges weakly), 635 lAO (indicator function), 9
Covo(',') (conditional covariance
given e = 0), 19 Lap(·, .) (distribution), 670
Cov (covariance), 613 '\g (measure constructed from LHM),
Cp (u-field on set of probability 367
measures), 27 LHM (left Haar measure), 363
A C (complement of set), 575 LMP (locally most powerful), 245
LMVUE (locally minimum variance
Dir(·) (Dirichlet process), 54 unbiased estimator), 300
690 Notation and Abbreviation Index

LR (likelihood ratio), 274 Qn(·lx) (conditional distribution


given n observations), 539
MC (most cautious), 230
MLE (maximum likelihood 1R (real numbers), 570
estimator), 307 1R+ (positive reals), 571
MLR (monotone likelihood ratio), 1R+ o (nonnegative reals), 627
239 pg (measure constructed from RHM),
MP (most powerful), 230 367
MRE (minimum risk equivariant), RHM (right Haar measure), 363
347 r(T/,8) (Bayes risk), 149
MUltk( . .. ) (distribution), 674 R(B,8) (risk function), 149
J.Lejx (·1·) (posterior distribution), 16
(S, A, J.L) (measure space), 577
NCB(·,·,·) (distribution), 671 SPRT (sequential probability ratio
NCX~O (distribution), 671 test), 549
NCF(·,.,·) (distribution), 671 SSS (successive substitution
NCtaO (distribution), 671 sampling), 507
NCTa(·j·) (CDF of NCt
T (parameter space a-field), 13
distribution), 671
Negbin(·,·) (distribution), 673 8' (parametric index), 50
v (dominating measure), 13 8 (parameter), 51
N(·,·) (distribution), 671 T (statistic), 84
N p (·,·) (distribution), 674 TaO (CDF of t distribution), 672
N (integers plus 00), 537 ta(-,·) (distribution), 672
v T (transpose of vector), 614
n (parameter space), 13, 82
UMA (uniformly most accurate), 317
Op (stochastic small order), 396
UMAU (uniformly most accurate
Op (stochastic large order), 396
unbiased), 321
o (small order), 394
o (large order), 394
UMC (uniformly most cautious), 230
UMCU (uniformly most cautious
unbiased), 254
Po (parametric family), 50, 82
UMP (uniformly most powerful), 230
Par(·,·) (distribution), 672
UMPU (uniformly most powerful
{)B (boundary of set), 636
unbiased), 254
<1>(-) (CDF of normal distribution),
UMPUAI (uniformly most powerful
672
unbiased almost invariant),
P n (empirical probability measure),
384
12
UMVUE (uniformly minimum
Poi(.) (distribution), 673
variance unbiased
PrO (probability), 612
estimator), 297
Pr(·I·) (conditional probability), 617
USC (upper semicontinuous), 417
P9,T(-) (conditional distribution of T
U(-,·) (distribution), 672
given e = B), 84
P~ (-) (conditional probability given Var 9(.,·) (conditional variance given
8 = B), 51, 83 e = B), 19
P9(-) (conditional distribution given Var (variance), 613
e = B), 51, 83
P (random probability measure), 25 x (element of sample space), 82
P (set of all probability measures), 27 X (sample space), 13, 82
Name Index

Ahlfors, L., 667, 675 Chaloner, K., 24, 677


Aitchison, J., 325, 675 Chambers, J., x, 676
Albert, J., 519, 675 Chang, T., 379, 677
Aldous, D., 46, 79, 482, 675 Chapman, D., 303, 677
Anderson, T., 386, 675 Chen, C., 435, 677
Andrews, C., ix Chib, S., 519, 675
Anscombe, F., 181, 675 Chow, y., 647, 677
Antoniak, C., 59, 675 Church, T., 24, 677
Aumann, R, 181, 675 Churchill, R, 666-667, 677
Clarke, B., 446, 677
Bahadur, R, 94, 675 Cornfield, J., 563, 565, 678
Barnard, G., 320, 420, 675 Cox, D., 21, 218, 424, 521, 677-678
Barndorff-Nielsen, 0., 307, 675 Cramer, H., 301, 678
Barnett, V., vii, 675
Barron, A., 434-435, 446, 675, 677 Daffer, P., 33, 688
Basu, D., 99-100, 675 David, H., 404, 678
Bayes, T., 16, 29, 676 Dawid, A., 21, 125, 435, 521, 678,
Becker, R, x, 676 687
Berberian, S., 507, 667, 676 DeFinetti, B., ix, 6, 21, 25, 28, 654,
Berger, J., 22, 173, 284, 525, 565, 656-657, 678
614, 666, 676 DeGroot, M., ix, 91, 98, 181, 362,
Berger, R, 283, 677 536, 654, 678
Berk, R, 417, 430, 432, 676 DeMoivre, A., 8, 678
Berkson, J., 218, 281, 676 Diaconis, P., ix, 15, 28, 41, 46, 108,
Berry, D., 565, 676 123,126,426,434,479-480,
Berti, P., 21, 676 667, 678-680
Bhattacharyya, A., 305 Dickey, J., 24, 679, 681-682
Bickel, P., 330-331, 676 Doob, J., 36, 429, 507, 645-646, 679
Billingsley, P., 46, 621, 636, 648, 676 Doytchinov, B., ix
Bishop, Y., 462, 676 Dubins, L., 70, 455, 676, 679
Blackwell, D., 56, 86, 152, 455, 676 Dugundji, J., 666, 679
Blyth, C., 158, 676 Dunford, N., 507, 635, 667, 679
Bohrer, R., x Dunsmore, I., 325, 675
Bondar, J., 236, 676
Bortkiewicz, L., 462, 677 Eberhardt, K., 326, 679
Box, G., 21, 521, 677 Eddy, W., 521,681
Breiman, L., 618, 640, 677 Edwards, W., 222, 284, 679
Brenner, D., 435, 677 Efron, B., 166,330-331,335-336,423,
Brown, L., 99, 160, 167, 677 679
Buck, C., 665, 677 Escobar, M., 60, 679
Buehler, R, 99, 677
Fabius, J., 61, 679
Carlin, B., 507, 686 Fedderson, A., 99, 677
Casella, G., 283, 677
692 Name Index

Ferguson, T., 52, 56, 61, 173, 179, Johnstone, I., 435, 682
181,248,258,614,666,680
Ferrandiz, J., 669, 680 Kadane, J., 21, 24,183-184,446,564,
Fieller, E., 321, 680 655, 682-683, 687-688
Fienberg, S., 462, 676 Kagan, A., 349, 683
Fishburn, P., 181, 680 Kahneman, D., 23, 683
Fisher, R., 89, 96, 217-218, 307, 370, Kasa, R., ix, 226, 446, 505, 683
373, 522, 680 Kerridge, D., 564, 683
Fraser, D., 435, 677, 680 Kiefer, J., 417, 420, 683
Freedman, D., 15, 28, 40-41, 46, 61, Kinderman, A., 660, 683
70,123,126,330-331,426, Kingman, J., 36, 683
434,479-480,667,676,678- Knuth, D., x, 683
680 Kraft, C., 66, 683
Freedman, L., 24, 680 Krasker, W., 56, 683
Freeman, P., 524, 681 Krem, A., 408, 683
Kshirsagar, A., 386, 684
Gabriel, K., 252, 681 Kullback, S., 116, 684
Garthwaite, P., 24, 681
Gavasakar, U., 24, 681 Lamport, L., x, 684
Geisser, S., 521, 668, 681 Lane, D., 9, 682
Gelfand, A., 507,681 Lauritzen, S., 28, 123, 481, 684
Geman, D., 507, 681 Lavine, M., 69,526,684
Geman, S., 507, 681 LeCam, L., 414, 437, 684
Ghosh, J., 381-382, 681 Lecoutre, B., 668-669, 684, 686
Gnanadesikan, R., 22, 681 Lehmann, E., 231,280, 285,298,350,
Good, I., 565, 681 684
Levy, P., 648, 650
Hadjicostas, P., ix Lindley, D., 6, 229, 284, 479, 684
Hall, P., 337-338, 681 Lindman, H., 222,284,679
Hall, W., 381-382, 681 Linnik, Y., 349, 683
Halmos, P., 364, 600, 681 Laeve, M., 34, 653, 685
Hampel, F., 315, 681 Louis, T., 24, 677
Hartigan, J., 20-21, 33, 681
Matts, J., 24, 677
Heath, D., 21, 46, 681-682
Mauldin, R., 66, 69, 685
Hewitt, E., 46, 682
McDunnough, P., 435, 677, 680
Heyde, C., 435, 682
Mee, R., 326, 679
Hill, B., 9, 484, 682
Mendel, G., 217, 685
Hinkley, D., 218, 423, 678-679
Metivier, M., 66,685
Hodges, J., 414
Monahan, J., 660, 683
Hoel, P., 640, 682
Morgenstern, 0., 181-182,688
Hogarth, R., 24, 682 Morris, C., 166, 500, 679, 685
Holland, P., 462, 676
Huber, P., 310, 315, 428, 682 Nachbin, L., 364, 685
Hwang, J., 160, 677 Neyman, J., 89, 175,231,247,420,
685
James, W., 163,682 Nobile, A., ix
Jaynes, E., 379, 682 Novick, M., ix, 6, 684
Jeffreys, H., 122, 229, 284, 682
Jiang, T., ix Oue, S., ix
Name Index 693

Patterson, R., 33, 688 Smith, A., 479, 507, 681, 684, 687
Pearson, E., 175, 231, 247, 685 Smith, W., 24, 682
Pearson, K., 216, 685 Spiegelhalter, D., 24, 680
Perlman, M., 430, 685 Spj(lltvoll, E., 283, 687
Peters, S., 24, 682 Stahel, W., 315, 681
Phillips, L., 6, 684 Steffey, D., 505, 683
Pierce, D., 99, 685 Stein, C., 163, 379, 382, 568, 682, 687
Pitman, E., 347, 685 Stigler, S., 8, 687
Port, S., 640, 682 Stone, C., 640, 682
Portnoy, S., x Stone, M., 21, 678, 687
Pratt, J., 56, 98, 683, 685 Strasser, H., 430, 688
Strawderman, W., ix, 166,688
Raftery, A., 226, 683 Sudderth, W., 9, 21, 46, 66, 69,
Ramamoorthi, R., 86, 676 681-682, 685
Rao, C., 152, 301, 349, 683, 685-686
Reeve, C., 326, 679 Taylor, R., 33, 688
Regazzini, E., 21, 676 Tiao, G., 521, 677
Rigo, P., 21, 676 Tibshirani, R., 336, 679
Robbins, H., 303, 647, 677, 686 Tierney, L., 225, 446, 507, 683, 688
Robert, C., 225, 686 Tversky, A., 23, 683
Roberts, H., 565,686
Ronchetti, E., 315, 681 Venn, J., 8,688
Rouanet, H., 668-669, 684, 686 Verdinelli, I., 524, 688
Rousseeuw, P, 315, 681 Villegas, C., 379, 677
Royden, H., 578, 589, 597, 621, 686 Von Mises, R., 10, 688
Rubin, D., 332, 686 Von Neumann, J., 181-182,688
Rudin, W., 666, 686
Wald, A., 415, 549, 552, 557,688
Savage, L., 46, 181, 222, 284, 565, Walker, A., 435, 442, 688
600, 679, 681-682, 686 Wallace, D., 99, 688
Scheffe, H., 298, 634, 684, 686 Wasserman, L., ix, 524, 526, 684, 688
Schervish, M., v Welch, B., 320, 688
Schwartz, J., 507, 635, 667, 679 West, M., 524, 688
Schwartz, L., 429, 687 Wijsman, R., x, 381-382, 681
Scott, E., 420, 685 Wilks, A., x, 676
Seidenfeld, T., ix, 21, 183-184, 187, Wilks, S., 325, 688
429, 564, 655, 682-683, Williams, S., 66, 69, 685
686-687 Winkler, R., 24, 682
Sellke, T., 284, 676 Wolfowitz, J., 417, 420, 557, 683, 688
Serfiing, R., 413, 687 Wolpert, R., 526, 684
Sethuraman, J., 56, 687
Short, T., ix Ylvisaker, D., 108, 679
Shurlow, N., v Young, G., 329, 688
Siegmund, D., 647, 677
Singh, K., 331, 687 Zellner, A., 16, 688
Siovic, P., 23, 683 Zidek, J., 21, 678
Subject Index*

Abelian group, 353 Basu's theorem, 99


Absolutely continuous, 574, 597, 668 Bayes factor, 221, 238, 262-263, 274
Absolutely continuous function, 211 Bayes risk, 149
Accept hypothesis, 214 Bayes rule, 150, 154-155, 167-168,
Acceptance-rejection, 659 178
Action space, 144 extended, 169
Admissible, 154-157, 162, 167, 174 formal, 146, 150, 157, 348, 351,
A, 154-156, 162 369
Almost everywhere, 572, 582 generalized, 156-157
Almost invariant function, 383 partial, 147, 150
Almost surely, 572, 582 Bayes' theorem, 4, 16
Alternate noncentral beta Bayesian bootstrap, 332
distribution, 668 Bernoulli distribution, 672
Alternate non central X2 distribution, Beta distribution, 54, 669
668 Bhattacharyya lower bounds, 305
Alternate noncentral F distribution, Bias, 296 .
669 Bimeasurable function, 572, 583, 618
Alternative, 2, 214 Binomial distribution, 673
composite, 215 Bolzano-Weierstrass theorem, 666
simple, 215, 233 Bootstrap, 329
Analysis of variance, 384, 491 Bayesian, 332
Analytic function, 105 nonparametric, 329
Ancillary statistic, 95, 99, 119 parametric, 330
maximal, 97 Borel a-field, 571, 575
ANOVA, 384, 491 Borel space, 609, 618
Archemedian condition, 192 Borel-Cantelli lemma:
ARE,413 first, 578
Asymptotic distribution, 399 second, 663
Asymptotic efficiency, 413 Boundary, 636
Asymptotic relative efficiency, 413 Boundedly complete statistic, 94, 99
Asymptotic variance, 402 Box-Cox transformations, 521
Autoregression, 141
Called-off preference, 184
Autoregressive process, 441
Caratheodory extension theorem, 578
Axioms of decision theory, 183-184,
Cauchy distribution, 669
296
Cauchy sequence, 619
Backward induction, 537 Cauchy's equation, 667
Bahadur's theorem, 94 Cauchy-Schwarz inequality, 615
Base measure, 54 CDF,612
empirical, 404-405, 408
Base of test, 215--216
Central limit theorem, 642
multivariate, 643
'Italicized page numbers indicate Chain rule, 600
where a term is defined. Chapman-Robbins lower bound, 304
Subject Index 695

Characteristic function, 611, 639 Conservative prediction set, 324


Chi-squared distribution, 669 Conservative tolerance set, 325
Chi-squared test of independence, 467 Consistent, 397, 412
Closed set, 622 Consistent conditional preference, 186
Closure, 622 Consistent distributions, 652
Coherent tests, 252 Contingency table, 467
Complete class, 174 Continuity axiom, 184
essentially, 174, 244, 251, 256 Continuity theorem, 640
minimal, 174 Continuous distribution, 612
minimal, 174-175 Continuous mapping theorem, 638
Complete class theorem, 179 Convergence:
Complete measure space, 579, 603 pointwise, 184
Complete metric space, 619 weak, 399, 635
Complete statistic, 94, 298 Convergence in distribution, 399, 611,
boundedly, 94, 99 635
Composite alternative, 215 Convergence in probability, 396, 611,
Composite hypothesis, 215 638
Conditional distribution, 13, 16, 607, Convex function, 614
609,617 Counting measure, 570
regular, 610, 618 Covariance, 607, 613
version, 617 Cramer-Rao lower bound, 301
Conditional expectation, 19,607, 616 multiparameter, 306
version, 608, 616 Credible set, 327
Conditional Fisher information, 111, Cumulative distribution function (see
119 CDF),612
Conditional independence, 9, 610, 628 Cylinder set, 652
Conditional Kullback-Leibler
information, 115, 119 Data, 82
Conditional mean, 607, 616 Decide optimally after stopping, 540
version, 616 Decision rule, 145
Conditional preference, 185 maximum, 541
consistent, 186 nonrandomized, 145, 151, 153
Conditional probability, 607, 609, 617 nonrandomized sequential, 537
regular, 609, 617 randomized, 145, 151
Conditional score function, 111 randomized sequential, 537
Conditionally sufficient statistic, 95 regular, 54{}-541
Confidence coefficient, 315, 325 sequential, 537
Confidence interval, 3 nonrandomized, 537
fixed-width, 559 randomized, 537
sequential, 559 terminal, 537
Confidence sequence, 569 truncated, 542
Confidence set, 279, 315, 379 Decision theory, 144, 181
conservative, 315 axioms, 183-184
exact, 315 Decreasing sequence of sets, 577
randomized, 316 DeFinetti's representation theorem,
UMA,317 28
UMAU, 321 Degenerate exponential family, 104
Conjugate prior, 92 Degenerate weak order, 183
Conservative confidence set, 315 Delta method, 401, 464, 466
696 Subject Index

Dense, 619 improper, 20


Density, 607, 613 t,672
Dirichlet distribution, 52, 54, 674 uniform, 659, 672
Dirichlet process, 52, 54, 332, 434 Distribution function (see CDF), 612
Discrete distribution, 612 Dominance axiom, 185
Distribution: Dominated convergence theorem, 591
alternate noncentral beta, 668 Dominates, 154
alternate noncentral X?, 668 Dominating measure, 574, 597
alternate noncentral F, 669 Dutch book, 656
asymptotic, 399
Bernoulli, 672 Efficiency:
beta, 54, 669 asymptotic, 413
binomial, 673 asymptotic relative, 413
Cauchy, 669 second-order, 414
chi-squared, 669 Elicitation of probabilities, 22-23
conditional, 13, 16 Empirical Bayes, 166, 420, 500
consistent, 652 Empirical CDF, 404-405, 408
continuous, 612 Empirical distribution, 12, 38
Dirichlet, 52, 54, 674 Empirical probability measure, 12
discrete, 612 (-contamination class, 524, 526, 528
empirical, 12, 38 Equal-tailed test, 263
exponential, 670 Equivalence class, 140
F,670 Equivalence relation, 140
fiducial, 370, 373 Equivariant rule, 357
gamma, 670 location, 346-347, 351
geometric, 673 minimum risk (see MRE), 347
half-normal, 389 scale, 350
hyper geometric, 673 Essentially complete class, 174, 244,
inverse gamma, 670 251, 256
Laplace, 670 minimal, 174
least favorable, 168 Estimator, 3, 296
marginal, 14 maximum likelihood, 3, 307
multinomial, 674 MFUE, 347, 351, 363
multivariate normal, 643, 674 Pitman, 347, 363
negative binomial, 673 point, 3, 296
noncentral beta, 289, 671 unbiased, 3, 296
noncentral X2 , 671 Event, 606, 612
noncentral F, 289, 671 Exact confidence set, 315
noncentral t, 289, 325, 671 Exact prediction set, 324
normal, 21, 349, 611, 640, 642, Exchangeable, 7, 27-28
671 partially, 125, 479
multivariate, 643, 674 row and column, 482
Pareto, 672 Expectation, 607, 613
Poisson, 673 conditional, 19, 616
posterior, 16 Expected Fisher information, 423
predictive, 14 Expected loss principle, 146, 181
posterior, 18 Expected value (see Expectation),
prior, 14 613
prior, 13 Exponential distribution, 670
Subject Index 697

Exponential family, 102-103, 105, location, 354


109, 155, 239, 249 location-scale, 354, 357, 368
degenerate, 104 permutation, 355
nondegenerate, 104 scale, 354
Extended Bayes rule, 169
Extended real numbers, 571 Haar measure:
Extremal family, 123, 125 left, 363
related, 366
F distribution, 670 right, 363
Fatou's lemma, 589 related, 366
FI regularity conditions, 111 Hahn decomposition theorem, 605
Fiducial distribution, 370, 373 Half-normal distribution, 389
Field, 571, 575 Hierarchical model, 166, 476
Finite population sampling, 74 Highest posterior density region (see
Finitely additive probability, 21, 281, HPD),327
564, 657 Hilbert space, 507
Fisher information, 111, 113, 301, Hilbert-Schmidt-type operator, 507,
412, 463 667
conditional, 111, 119 Horse lottery, 182
expected, 423 Hotelling's T2, 388
observed, 226, 424, 435 HPD region, 327, 329, 343
Fisher-Neyman factorization Hypergeometric distribution, 673
theorem, 89 Hyperparameters, 477
Fixed point, 505 Hypothesis, 2, 214
Fixed-point problem, 505 composite, 215
Fixed-width confidence interval, 559 one-sided, 241
Floor of test, 215-216 simple, 215, 233
Formal Bayes rule, 146, 150, 157,348, Hypothesis test, 2
351, 369 predictive, 219, 325
Fubini's theorem, 596 randomized, 3
Function: Hypothesis-testing loss, 214
absolutely continuous, 211
bimeasurable, 583 Identity element of group, 353
measurable, 572, 583 Ignorable statistic, 142
simple, 586 lID, 2, 8, 611, 628
conditionally, 9-10, 83,611, 628
Gamma distribution, 670 Image sigma field, 584
General linear group, 354 Importance sampling, 403, 661
Generalized Bayes rule, 156-157, 159 Improper prior, 20, 122, 223, 263
Generalized Neyman-Pearson lemma, Inadmissible, 154
247 Increasing sequence of sets, 577
Generated u-field, 571-572, 584 Independence, 610, 628
Geometric distribution, 673 conditional, 9, 610, 628
Gibbs sampling, 507 Indifferent, 183
Goodness of fit test, 218, 461 Induced measure, 575, 601
Gross error sensitivity, 312 Infinitely often, 578, 663
Group, 35~355-356 Influence function, 311
abelian, 353 Information:
general linear, 354 Fisher, 111, 113,463
698 Subject Index

Kullback-Leibler, 115-116 LHM,363


Integrable, 588 Likelihood function, 2, 13, 307
uniformly, 592 Likelihood ratio test (see LR test),
Integral, 573, 587-588 274
over a set, 588 Linear regression, 276, 321
Invariance of distributions, 355 LMP test, 245, 265, 289
Invariant function, 357 LMPU test, 265, 292
almost, 383 LMVUE,900
location, 346 Locally minimum variance unbiased
maximal, 358 estimator, 300
scale, 350 Locally most powerful test (see
Invariant loss, 956 LMP),245
location, 946 Location equivariant rule, 946
scale, 350-351 Location estimation, 346
Invariant measure, 363 Location group, 354
Inverse function theorem, 666 Location invariant function, 346
Inverse gamma distribution, 670 Location invariant loss, 346
Inverse of group element, 353 Location parameter, 944
Location-scale group, 354
Jacobian, 625 Location-scale parameter, 345
James-Stein estimator, 163,486 Look-ahead decision rule, 546
Jeffreys' prior, 122,446 Loss function, 144, 162, 189, 296
Jensen's inequality, 614 convex, 349
hypothesis-testing, 214
Kolmogorov zero-one law, 631 squared-error, 146, 297
Kullback-Leibler divergence, 116 0-1, 215
Kullback-Leibler information, 0-1-c, 215, 218
115-116 Lower boundary, 170, 179, 233-235,
conditional, 115, 119 287
LR test, 223, 273-274, 458-459
Levy's theorem, 648, 650
'x-admissible, 154-156, 162 Marginal distribution, 14, 607
Laplace approximation, 226, 446 Marginalization paradox, 21
Laplace distribution, 670 Markov chain, 15, 507, 650
Large order, 394 Markov chain Monte Carlo, 507
stochastic, 996 Markov inequality, 614
Law of large numbers: Martingale, 645-646
strong, 34-36 reversed, 33, 649
weak,642 Martingale convergence theorem,
Law of the unconscious statistician, 648-649
607, 613 Maximal ancillary, 97
Law of total probability, 632 Maximal invariant, 358
Least favorable distribution, 168 Maximin strategy, 168
Lebesgue measure, 571, 580 Maximin value, 168
Left Haar measure, 363 Maximum likelihood estimator, 3,
related, 366 307, 415, 418-421
Lehmann-Scheffe theorem, 298 Maximum modulus theorem, 667
L-estimator, 410 Maximum of decision rules, 541
Level of test, 215-216 MC test, 230
Subject Index 699

Mean, 607, 613 Neyman-Pearson fundamental


conditional, 616 lemma, 175, 231
trimmed, 314 NM-Iottery, 182
Measurable function, 572, 583 Noncentral beta distribution, 289,
Measure, 570, 572, 575, 577 671
induced, 601 Noncentral X2 distribution, 671
Lebesgue, 571, 580 Noncentral F distribution, 289, 671
product, 595 Noncentral t distribution, 289, 325,
u-finite, 572, 578, 601 671
signed, 577, 597 Nondegenerate exponential family,
Measure space, 572, 577 104
M-estimator, 313-315, 424-428, 434 Nondegenerate weak order, 183
Method of Laplace, 226, 446 Nonnull states, 184
Method of moments, 340 Nonparametric, 52
Mill's ratio, 470 Nonparametric bootstrap, 329
Minimal complete class, 174-175 Nonrandomized decision rule, 145,
Minimal essentially complete class, 151, 153
174 Nonrandomized sequential decision
Minimal sufficient statistic, 92 rule, 537
Minimax principle, 167, 189 Normal distribution, 21, 349, 611,
Minimax rule, 167-169 640, 642, 671
Minimax theorem, 172 multivariate, 643, 674
Minimax value, 168 Null states, 184
Minimum risk equivariant (see MRE),
347 Observed Fisher information, 226,
MLE, 3, 307, 415, 418-421 424,435
MLR,239-244 One-sided hypothesis, 241
Monotone convergence theorem, 590 One-sided test, 239, 243
Monotone likelihood ratio, 239-244 Open set, 571
Monotone sequence of sets, 577 Operating characteristic, 215
Most cautious test, 230 Orbit, 358
Most powerful test, 230 Order statistics, 86
MP test, 230 Outliers, 521
MRE, 347, 349, 351, 363
Multinomial distribution, 674 Parameter, 1, 6, 50--51, 82
Multiparameter Cramer-Rao lower location, 344
bound,306 location-scale, 345
Multivariate central limit theorem, natural, 103, 105
643 scale, 345
Multivariate normal distribution, 643, Parameter space, 1, 50, 82
674 natural, 103, 105
Parametric bootstrap, 330
Natural parameter, 103, 105 Parametric family, 1, 50, 102
Natural parameter space, 103, 105 Parametric index, 33, 50
Natural sufficient statistic, 103 Parametric models, 12
Negative binomial distribution, 673 Parametric Models, 49
Negative part, 573, 588 Pareto distribution, 672
Negative set, 598 Partial Bayes rule, 147, 150
Neyman structure, 266 Partially exchangeable, 125,479
700 Subject Index

Percentile-t bootstrap confidence Pure significance test, 217


interval, 336 P-value, 279, 375, 380
Permutations, 355
11"->' theorem, 576 Quantile:
Pitman's estimator, 347, 363 sample, 404-405, 408
Pivotal, 316, 370, 373
Point estimation, 296 Radon-Nikodym derivative, 575, 598
Point estimator, 296 Radon-Nikodym theorem, 597
Pointwise convergence, 184 Random probability measure, 27
Poisson distribution, 673 Random quantity, 82, 606, 612
Polish space, 619 Random variables, 606, 612
Polya tree distribution, 69 exchangeable, 27
Polya urn scheme, 9 IID,8
Portmanteau theorem, 636 Randomized confidence set, 316
Positive part, 573, 588 Randomized decision rule, 145, 151
Positive set, 598 Randomized sequential decision rule,
Posterior distribution, 4, 16 537
asymptotic normality, 435, 437, Randomized test, 3
442-443 Rao-Blackwell theorem, 152
consistency, 429-430 Ratio of uniforms, 660
Posterior predictive distribution, 18 Regression, 276, 321, 519
Posterior risk, 146, 150 Regular conditional distribution, 610,
Power function, 2, 215, 240 618
Power set, 571 Regular conditional probabilities,
Prediction set, 324-325 609,617
conservative, 324 Regular decision rule, 540
exact, 324 Reject hypothesis, 214
Predictive distribution, 14, 455 Rejection region, 2
posterior, 18 Related LHM, 366
prior, 14 Related RHM, 366
Predictive hypothesis test, 219, 325 Relative rate of convergence, 413, 470
Preference, 182 Restriction of u-field, 584
conditional, 185 Reversed martingale, 649
consistent, 186 RHM,363
Prevision, 655 Right Haar measure, 363
Prior distribution, 4, 13 related, 366
improper, 20, 223, 263 Risk function, 149-150, 153, 155, 167,
natural conjugate family, 92 216, 233, 297-298
Prize, 181 Risk set, 170-172, 179,233,235,287
Probability, 572, 577 Robustness, 310
empirical, 12 Bayesian, 524
Rowand column exchangeable, 482
random, 27
Probability integral transform, 519,
Sample quantile, 404-405, 408
659
Sample space, 2, 82
Probability space, 572, 577, 606, 612
Scale equivariant rule, 350
Product measure, 595
Scale estimation, 350
Product u-field, 576
Scale group, 354
Product space, 576 Scale invariant function, 350
Pseudorandom numbers, 659
Subject Index 701

Scale invariant loss, 350-351 Stopping time, 537, 548, 552, 554
Scale parameter, 345 Strict preference, 183
Scheffe's theorem, 634 Strong law of large numbers, 34-36
Score function, 111, 122,302, 305 Strongly unimodal, 329
conditional, 111 Submartingale, 646
Second-order efficiency, 414 Successive substitution, 505-506, 545
Sensitivity analysis, 524 Successive substitution sampling, 507
Separable space, 619 Sufficient statistic, 84-85-86, 99, 103,
Separating hyperplane theorem, 666 109, 150-151, 298
Sequential decision rule, 537 conditionally, 95
Sequential probability ratio test, 549 minimal,92
Sequential test, 548 natural, 103
Set estimation, 296 Superefficiency, 414
Shrinkage estimator, 163 Supporting hyperplane theorem, 666
a-field, 575 Sure-thing principle, 184
Borel, 571, 575
generated, 571-572, 584 t distribution, 672
image, 584 Tail a-field, 632
restriction, 584 Tailfree process, 60
tail, 632 Taylor's theorem, 665
a-finite measure, 572, 578, 601 Tchebychev's inequality, 614
Signed measure, 577, 597, 605, 635 Terminal decision rule, 537
Significance probability, 217, 228, 280 Test:
Significance test, 217 goodness of fit, 218, 461
Simple alternative, 215 one-sided, 239, 243
Simple function, 586 two-sided, 256, 273
Simple hypothesis, 215 Test function, 175, 215
Size of test, 2, 215-216 Theorem:
Small order, 394 Bahadur,94
stochastic, 396 Basu, 99
SPRT,549 Bayes, 4,16
Squared-error loss, 146, 297 Bhattacharyya lower bounds,
Vn-consistent, 401 305
SSS, 507 Bolzano-Weierstrass, 666
St. Petersburg paradox, 655 Caratheodory extension, 578
State independence, 184, 205 Cauchy's equation, 667
State-dependent utility, 205-206 central limit, 642
States of Nature, 181, 189, 205 multivariate, 643
Statistic, 83 chain rule, 600
ancillary, 95,99, 119 Chapman-Robbins bound, 304
boundedly complete, 94 complete class, 179
complete, 94, 298 continuity, 640
sufficient, 84-85-86, 99, 103, continuous mapping, 638
150-151,298 Cramer-Rao lower bound, 301
Stein estimator (see James-Stein DeFinetti, 27-28
estimator), 163 dominated convergence, 591
Stochastic large order, 396 Fatou's lemma, 589
Stochastic small order, 396 Fisher-Neyman, 89
Stone-Weierstrass theorem, 666 Fubini,596
702 Subject Index

Hahn decomposition, 605 UMP test, 230, 240, 243-244, 255,


inverse function, 666 257
Kolmogorov zero-one law, 631 UMPU test, 254-256
Levy, 648, 650 UMPUAI test, 384
law of total probability, 632 UMVUE, 297-299
Lehmann-Scheffe, 298 Unbiased estimator, 3, 296-302
martingale convergence, 648-649 Unbiased test, 254
maximum modulus, 667 Uniform distribution, 659, 672
minimax, 172 Uniformly integrable, 592
monotone convergence, 590 Uniformly minimum variance
multivariate central limit, 643 unbiased estimator (see
Neyman-Pearson, 175,231 UMVUE),297
generalized, 247 Uniformly most accurate confidence
11"->',576 set (see UMA), 317
portmanteau, 636 Uniformly most accurate unbiased
Radon-Nikodym, 597 confidence set (See UMAU),
Rao-Blackwell, 152 321
Scheffe, 634 Uniformly most cautious test (see
separating hyperplane, 666 UMC),230
Stone-Weierstrass, 666 Uniformly most cautious unbiased
strong law of large numbers, 36 test (see UMCU), 254
supporting hyperplane, 666 Uniformly most powerful test (see
Taylor, 665 UMP),230
Tonelli, 595 Uniformly most powerful unbiased
uniqueness, 645 test (see UMPU), 254
upcrossing, 647 Uniqueness theorem, 645
weak law of large numbers, 642 Up crossing lemma, 647
Tolerance coefficient, 325 Upper semicontinuous, 417
Tolerance set, 219, 325 USC, 417
conservative, 325 Utility function, 181, 188
Tonelli's theorem, 595 state-dependent, 205-206
Topological space, 571, 575
Topology, 571 Variance, 607, 613
Transformation, 354 Variance components, 484
Transition kernel, 124 Variance stabilizing transformation,
Trimmed mean, 314 402
Trivial u-field, 571 Version of conditional distribution,
Truncated decision rule, 542 617
Two-sided alternative, 246 Version of conditional expectation,
Two-sided hypothesis, 246 608,616
Two-sided test, 256, 273 Version of conditional mean, 616
Type I error, 214
Type II error, 214 Wald's lemma, 552
Weak convergence, 399, 635
UMA confidence set, 317 Weak· convergence, 635
UMAU confidence set, 321 Weak law of large numbers, 642, 664
UMC test, 230-231, 239, 244, 255, Weak order, 183, 216-217, 280
257 degenerate, 183
UMCU test, 254-256 nondegenerate, 183
Weak preference, 182
Springer Series in Statistics
(conlin"od from p. Ii)

Pollard: Convergence of Stochastic Processes.


Pratt/Gibbons: Concepts of Nonparametric Theory.
ReadlCressie: Goodness-of-Fit Statistics for Discrete Multivariate Data.
Reinsel: Elements of Multivariate Time Series Analysis.
Reiss: A Course on Point Processes.
Reiss: Approximate Distributions of Order Statistics: With Applications
to Non-parametric Statistics.
Rieder: Robust Asymptotic Statistics.
Rosenbaum: Observational Studies.
Ross: Nonlinear Estimation.
Sachs: Applied Statistics: A Handbook of Techniques, 2nd edition.
Siirndal/SwenssonlWretman: Model Assisted Survey Sampling.
Schervish: Theory of Statistics.
Seneta: Non-Negative Matrices and Markov Chains, 2nd edition.
Shao/Tu: The Jackknife and Bootstrap.
Siegmund: Sequential Analysis: Tests and Confidence Intervals.
Simonoff: Smoothing Methods in Statistics.
Small: The Statistical Theory of Shape.
Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions, 3rd edition.
Tong: The Multivariate Normal Distribution.
van der Vaart/Wellner: Weak Convergence and Empirical Processes: With
Applications to Statistics.
Vapnik: Estimation of Dependences Based on Empirical Data.
Weerahandi: Exact Statistical Methods for Data Analysis.
West/Harrison: Bayesian Forecasting and Dynamic Models.
Wolter: Introduction to Variance Estimation.
Yaglom: Correlation Theory of Stationary and Related Random Functions I:
Basic Results.
Yag/om: Correlation Theory of Stationary and Related Random Functions II:
Supplementary Notes and References.

You might also like