You are on page 1of 43

PART A

Appendix
A.1
Inequalities

This section presents several inequalities for sums of independent stochastic


processes. For a stochastic process {Xt : t ∈ T } indexed by some arbitrary
index set, the notation kXk is an abbreviation for the supremum supt |Xt |.
The stochastic processes need not be measurable maps into a Banach
space. Independence of the stochastic processes X1 , X2 , . . . is understood in
the
Q∞sense that each of the processes is defined on a product probability space
i=1 (Ωi , Ui , Pi ) with Xi depending on the ith coordinate of (ω1 , ω2 , . . .)
only. The process Xi is called symmetric if Xi and −Xi have the same
distribution. In the case of nonmeasurability, the symmetry of independent
processes X1 , X2 , . . . may be understood in the sense that outer probabili-
ties remain the same if one or more Xi are replaced by −Xi .†
Throughout the chapter Sn equals the partial sum X1 + . . . + Xn of
given stochastic processes X1 , X2 , . . . .

A.1.1 Proposition (Ottaviani’s inequality). Let X1 , . . . , Xn be indepen-


dent stochastic processes indexed by an arbitrary set. Then for λ, µ > 0,

P∗ kSn k > λ
  

P max kSk k > λ + µ ≤ .
k≤n 1 − maxk≤n P∗ kSn − Sk k > µ

Proof. Let Ak be the event that kSk k∗ is the first kSj k∗ that is strictly
greater than λ + µ. The event on the left is the disjoint union of A1 , . . . , An .


This is the case, for instance, if each (Ωi , Ui , Pi ) is a product (Xi , Ai , Pi )2 of two iden-
tical probability spaces and Xi (x1 , x2 ) = Zi (x1 ) − Zi (x2 ) for some stochastic process Zi .
A.1 Inequalities 593

Since kSn − Sk k∗ is independent of kS1 k∗ , . . . , kSk k∗ ,


P(Ak ) min P kSn − Sj k∗ ≤ µ ≤ P Ak , kSn − Sk k∗ ≤ µ
 
j≤n
≤ P(Ak , kSn k∗ > λ ,


since kSk k∗ > λ + µ on Ak . Sum up over k to obtain the result.

A.1.2 Proposition (Lévy’s inequalities). Let X1 , . . . , Xn be indepen-


dent, symmetric stochastic processes indexed by an arbitrary set. Then for
every λ > 0 we have the inequalities
 
P∗ max kSk k > λ ≤ 2 P∗ kSn k > λ ,

k≤n
 
P∗ max kXk k > λ ≤ 2 P∗ kSn k > λ .

k≤n

Proof. Let Ak be the event that kSk k∗ is the first kSj k∗ that is
strictly greater than λ. The event on the left in the first inequality is
the disjoint union of A1 , . . . , An . Write Tn for the sum of the sequence
X1 , . . . , Xk , −Xk+1 , . . . , −Xn . By the triangle inequality, 2kSk k∗ ≤ kSn k∗ +
kTn k∗ . It follows that

P(Ak ) ≤ P Ak , kSn k∗ > λ + P Ak , kTn k∗ > λ = 2P Ak , kSn k∗ > λ ,


  

since X1 , . . . , Xn are symmetric. Sum up over k to obtain the first inequality.


For the second inequality, let Ak be the event that kXk k∗ is the first
kXj k∗ that is strictly greater than λ. Write Tn for the sum of the vari-
ables −X1 , . . . , −Xk−1 , Xk , −Xk+1 , . . . , −Xn . By the triangle inequality,
2kXk k∗ ≤ kSn k∗ + kTn k∗ . Proceed as before.

An interesting consequence of Lévy’s inequalities is the following theo-


rem concerning the convergence of random series of independent processes.

A.1.3 Proposition (Lévy-Ito-Nisio). Let X1 , X2 , . . . be independent


stochastic processes with sample paths in `∞ (T ). Then the following state-
ments are equivalent:
(i) {Sn } converges outer almost surely to a tight limit;
(ii) {Sn } converges in outer probability to a tight limit;
(iii) {Sn } converges weakly to a tight limit.

Proof. The implications (i) ⇒ (ii) ⇒ (iii) are true for general random se-
quences. We must prove the implications in the converse direction.
(iii) ⇒ (ii). Since any Cauchy sequence is convergent, it suffices to show
that Snk+1 − Snk converges in outer probability to zero as k → ∞ for any
sequence n1 < n2 < · · ·. The sequence (Snk+1 , Snk ) is asymptotically tight
and asymptotically measurable. By Prohorov’s theorem, every subsequence
594 A Appendix

has a further subsequence along which Yk = Snk+1 − Snk converges in


distribution to a tight limit Y . Then Yn (t) Y (t) for every t. For every
real number s at which the characteristic function of the weak limit of Sn (t)
is nonzero, in particular for every s sufficiently close to zero,

EeisSnk+1 (t)
EeisYk (t) = → 1.
EeisSnk (t)
Thus the characteristic function of Y (t) is 1 in a neighborhood of zero.
Conclude that Y = 0 almost surely.
(ii) ⇒ (i). Write S for the limit in probability of Sn . First assume that
the processes Xj are symmetric. There exists a subsequence n1 < n2 < · · ·
such that P∗ kSnk − Sk > 2−k < 2−k for every k. By the Borel-Cantelli
lemma, Snk as∗
→ S as k → ∞. By a Lévy inequality,
   
P∗ max kSn − Snk k > 2−k+1 ≤ 2P∗ kSnk+1 − Snk k > 2−k+1 .
nk <n≤nk+1

The right side is smaller than a multiple of 2−k . Hence maxnk <n≤nk+1 kSn −
Snk k∗ converges almost surely to zero by the Borel-Cantelli lemma. This
concludes the proof that Sn converges outer almost surely for symmetric
Xi .
Given general processes, construct an independent copy Y1 , Y2 , . . . de-
fined on a copy of the original probability space (Ω, U, P), and let Tn be
the corresponding partial sums. Then the elements of the sequence Sn − Tn
are the partial sums of the symmetric variables Xi − Yi and converge in
outer probability. (It is formally defined on (Ω, U, P) × (Ω, U, P).) By the
preceding paragraph, Sn − Tn converges outer almost surely. By Fubini’s
theorem there exists ω such that Sn − Tn (ω) converges outer almost surely.
Then it converges also in distribution. Since Sn converges in distribution
as well, it follows that the sequence Tn (ω) is convergent.

A.1.4 Proposition (Hoffmann-Jørgensen inequalities). Let X1 , . . . , Xn


be independent stochastic processes indexed by an arbitrary set. Then for
any λ, η > 0,
   2  
P∗ max kSk k > 3λ + η ≤ P∗ max kSk k > λ +P∗ max kXk k > η .
k≤n k≤n k≤n

If X1 , . . . , Xn are independent and symmetric, then also


2  
P∗ kSn k > 2λ + η ≤ 4P∗ kSn k > λ + P∗ max kXk k > η .

k≤n

Proof. Let Ak be the event that kSk k∗ is the first kSj k∗ that is strictly
greater than λ. The (disjoint) union of A1 , . . . , An is the event that
maxk≤n kSk k∗ is greater than λ. By the triangle inequality, kSj k∗ ≤
A.1 Inequalities 595

kSk−1 k∗ + kXk k∗ + kSj − Sk k∗ for every j ≥ k. On Ak the first term


on the right is bounded by λ. Conclude that on Ak

max kSj k∗ ≤ λ + max kXk k∗ + max kSj − Sk k∗ .


j≥k k≤n j>k

On Ak this remains true if the maximum on the left is taken over all kSj k∗ .
Since the processes are independent, we obtain for every k
 
P Ak , max kSk k∗ > 3λ + η
k≤n
   
≤ P Ak , max kXk k∗ > η + P(Ak )P max kSj − Sk k∗ > 2λ .
k≤n j>k

In the probability on the far right the variable maxj>k kSj − Sk k∗ can be
bounded by 2 maxk≤n kSk k∗ . Next sum over k to obtain the first inequality
of the proposition.
To prove the second inequality, first establish by the same method that

P Ak , kSn k∗ > 2λ+η ≤ P Ak , max kXk k∗ > η +P(Ak )P kSn −Sk k∗ > λ .
  
k≤n

In the probability on the far right, bound the variable kSn − Sk k∗ by


maxk≤n kSn − Sk k∗ . Next sum over k to obtain

P kSn k∗ > 2λ + η

     
≤ P max kXk k∗ > η + P max kSk k∗ > λ P max kSn − Sk k∗ > λ .
k≤n k≤n k≤n

The processes Sk and Sn − Sk are the partial sums of the symmetric pro-
cesses X1 , . . . , Xn and Xn , . . . , X2 , respectively. Apply Lévy’s inequality to
both probabilities on the far right to conclude the proof.

A.1.5 Proposition (Hoffmann-Jørgensen inequalities for moments).


Let 0 < p < ∞ and suppose that X1 , . . . , Xn are independent stochastic
processes indexed by an arbitrary index set T . Then there exist constants
Cp and 0 < up < 1 such that
 
E∗ max kSk kp ≤ Cp E∗ max kXk kp + F −1 (up )p ,
k≤n k≤n

where F −1 is the quantile function of the random variable maxk≤n kSk k∗ .


Furthermore, if X1 , . . . , Xn are symmetric, then there exist constants Kp
and 0 < vp < 1 such that
 
E∗ kSn kp ≤ Kp E∗ max kXk kp + G−1 (vp )p ,
k≤n

where G−1 is the quantile function of the random variable kSn k∗ . For p ≥
1, the last inequality is also valid for mean-zero processes (with different
constants).
596 A Appendix

Proof. Take λ = η = t in the first inequality of the preceding proposition


to find that, for any τ > 0,
Z  
∗ p
E max kSk k = 4 p
P max kSk k∗ > 4t d(tp )
k≤n k≤n
Z ∞  2
≤ (4τ )p + 4p P max kSk k∗ > t d(tp )
τ k≤n
Z ∞  
+4 p
P max kXk k∗ > t d(tp )
τ  k≤n

≤ (4τ )p + 4p P max kSk k∗ > τ E∗ max kSk kp
k≤n k≤n

+ 4 E max kXk kp .
p
k≤n

Choosing τ such that 4p P maxk≤n kSk k∗ > τ is bounded by 12 , and rear-




ranging terms, yield the claimed inequality. The second inequality can be
proved in a similar manner, this time using the second inequality of the
preceding proposition.
The inequality for mean-zero processes follows from the inequality for
symmetric processes by symmetrization and desymmetrization: by Jensen’s
inequality, E∗ kSn kp is bounded by E∗ kSn − Tn kp if Tn is the sum of n
independent copies of X1 , . . . , Xn .

Hoffmann-Jørgensen’s inequality strengthens probability statements


about sums to statements concerning expectations or p-moments, pro-
vided the corresponding moment of the maximum of the individual terms
can be controlled. Thus it can be used to “invert” the consequences of
Markov-type inequalities under anP additional condition. A typical appli-
n
cation
Pn is to a sequence of sums i=1 Xni . Boundedness in probability
∗ −1
k i=1 Xni k = OP (1) implies that Pn n (u) ∗= O(1) for the sequence of
G
quantile functions ofP the variables k i=1 Xni k . Conclude that the sequence
n
of expectations E∗ k i=1 Xni kp is O(1) if the same is true for the sequence
E∗ max1≤i≤n kXni kp .
Hoffmann-Jørgensen’s inequality may also be used to bound higher
moments of sums in terms of a first moment plus a higher moment of the
maximum of the individual terms. The first part of the following proposition
is in this spirit, although its constant is much smaller than the constant
obtainable from Hoffmann-Jørgensen’s inequality.

A.1.6 Proposition. Let X1 , . . . , Xn be independent, mean-zero stochastic


processes indexed by an arbitrary index set T . Then
h i
p

kSn k∗ ≤ K kSn k∗ + max1≤i≤n kXi k∗ (p > 1),
P,p log p P,1 P,p
h i
∗ ∗
kSn k ≤ Kp kSn k + max1≤i≤n kXi k ∗
ψp P,1 ψp
(0 < p ≤ 1),
h P
n 1/q i
∗ q
kSn k∗ ≤ Kp kSn k∗ P,1 + kXi k (1 < p ≤ 2).
ψp i=1 ψp
A.1 Inequalities 597

Here 1/p + 1/q = 1, and K and Kp are a universal constant and a constant
depending on p only, respectively.

Proof. The first part with the inferior constant 24(21/p )(3p ) follows
from the preceding
R1 Hoffmann-Jørgensen inequality by noting that (1 −
v)G−1 (v) ≤ v G−1 (s) ds ≤ E∗ kSn k for every v, and then substitution
of vp = 1 − 3−p /8 and Kp = 2(3p ). The proofs of the first part with the im-
proved constant p/ log p and of the second and third parts are long and rely
on the isoperimetric methods developed by Talagrand (1989). See Ledoux
and Talagrand (1991), pages 172–175.

Another consequence of Hoffmann-Jørgensen’s inequality is the fol-


lowing equivalence of moments of the supremum of sums of independent
stochastic processes and the same moment of the maximal summand.

A.1.7 Proposition. Let 0 < p < ∞ and suppose that X1 , X2 , . . .


are independent stochastic processes indexed by an arbitrary set T . If
supn≥1 kSn k∗ < ∞ almost surely, then

E∗ sup kSn kp < ∞


n≥1

if and only if
E∗ sup kXn kp < ∞.
n≥1

Proof. Since kXn k ≤ kSn k + kSn−1 k, finiteness of the first expectation


clearly implies finiteness of the second expectation. For any fixed n, it fol-
lows from Proposition A.1.5 that
 
E∗ max kSk kp ≤ Cp E∗ max kXk kp + F −1 (up )p ,
k≤n k≤n

where F −1 is the quantile function of the random variable maxk≤n kSk k∗ ,


and 0 < up < 1. Since supn≥1 kSn k is assumed finite almost surely, there
is an M so that F −1 (up ) ≤ M for all n. Hence letting n → ∞ in the last
display shows that finiteness of the second expectation implies finiteness of
the first expectation.

An interesting corollary of this proposition for normalized sums of


independent stochastic processes is as follows.

A.1.8 Corollary. Let 0 < p < ∞ and let {an } be a sequence of positive
numbers that increases to infinity. Let X1 , X2 , . . . be independent
∗ stochastic
processes indexed by an arbitrary set T . If supn≥1 kSn k/an < ∞ almost
surely, then
kSn kp
E∗ sup <∞
n≥1 apn
598 A Appendix

if and only if
kXn kp
E∗ sup < ∞.
n≥1 apn

Proof. See Ledoux and Talagrand (1991), page 159.

Verification of the second statement in the preceding corollary can be


carried out by use of Problem 2.3.5.

A.1.9 Proposition (Marcinkiewicz inequalities). Let X1 , X2 , . . . be in-


dependent, mean zero stochastic variables. Then for p ≥ 1 there exist con-
stants AP and Bp such that
X n p/2 Xn p X n p/2
Ap E Xi2 ≤ E Xi ≤ Bp E Xi2 .

i=1 i=1 i=1

Proof. This can be proved using symmetrization followed by Khinchine’s


inequality, Proposition A.6.10. Alternatively see Chow and Teicher (1978),
page 356, or Giné and de la Peña (1999), Lemma 1.4.13.

A.1.10 Proposition (Hoeffding’s inequality). Let {c1 , . . . , cN } be ele-


ments of an arbitrary vector space V, and let U1 , . . . , Un and V1 , . . . , Vn
denote samples of size n ≤ N drawn without and with replacement, respec-
tively, from {c1 , . . . , cN }. Then, for every convex function φ: V → R,
Xn  Xn 
Eφ Uj ≤ Eφ Vj .
j=1 j=1

Proof. See Hoeffding (1963) and Marshall and Olkin (1979), Corollary
A.2.e, page 339.

A.1.11 Proposition (Contraction principle). Let X1 , . . . , Xn be arbi-


trary stochastic processes and γ1 , . . . , γn arbitrary, real-valued (measur-
able) random variables with 0 ≤ γi ≤ 1. Let ξ1 , . . . , ξn be independent, real
random variables with zero means independent of (X1 , . . . , Xn , γ1 , . . . , γn ).
Then Xn Xn
E∗ ξi γi Xi ≤ E∗ ξi Xi .

i=1 i=1

Proof. Since the γi can be taken out one at a time, it suffices to show that
E∗ kξγX + Y k ≤ E∗ kξX + Y k
whenever ξ has zero mean and is independent of (γ, X, Y ). By the triangle
inequality, the left side is bounded by
E∗ (ξX + Y )γ + E∗ Y (1 − γ) = E∗ (ξX + Y ) γ + E∗ kY k (1 − γ).

A.1 Problems and Complements 599

The second term on the right can be written

E∗Y,γ Y + (Eξ ξ)X (1 − γ).



By Jensen’s inequality and Fubini’s theorem, it is bounded by E∗ Y +
ξX (1 − γ). The result follows since γ is measurable.

Problems and Complements


1. In the first inequality of Proposition A.1.5, the constants Cp = 2(4p ) and
up = 1 − 4−p /2 will do. The second inequality is true for symmetric processes
with Kp = 2(3p ) and vp = 1−3−p /8; for mean-zero processes with Kp = 4(6p )
and vp = 1 − 3−p /16.
A.2
Gaussian Processes

Since Gaussian processes arise as the limits in distribution of empirical


processes, their properties are of importance for many of the developments
in Parts 2 and 3. Inequalities for Gaussian processes have also been used in
a number of proofs involving multiplier processes with Gaussian multipliers.
In this chapter we discuss the most important results.

A.2.1 Inequalities and Gaussian Comparison


A stochastic process X = (Xt : t ∈ T ), indexed
P by a semimetric space T , is
Gaussian if all finite linear combinations i ai Xti are normally distributed,
and mean-zero if EXt = 0 for all t ∈ T . For a separable process we write
kXk for the supremum supt∈T |Xt |, and define the median M (X) of kXk
by the two requirements

P kXk ≤ M (X) ≥ 12 , P kXk ≥ M (X) ≥ 21 .


 

It can be shown that this pair of inequalities define a unique number M (X)
(see the proof of the following theorem). Furthermore, we define

σ 2 (X) = sup var Xt .


t∈T

The following proposition shows that the distribution of the supremum of


a zero-mean Gaussian process has sub-Gaussian tails, whenever it is finite.
A.2 Gaussian Processes 601

A.2.1 Proposition (Borell’s inequality). Let X be a mean-zero, separa-


ble Gaussian process X with finite median. Then for every λ > 0,
 λ2 
P |kXk − M (X)| ≥ λ ≤ exp − ,
2σ 2 (X)
 λ2 
P |kXk − EkXk| ≥ λ ≤ 2 exp − 2 ,
2σ (X)
 λ2 
P kXk ≥ λ ≤ 2 exp − .
8EkXk2

Proof. The proof is based on finite-dimensional approximation and the


concentration inequalities for the finite-dimensional, standard normal dis-
tribution given in the lemma ahead.
First, consider the case that the index set T consists of finitely many
points t1 , . . . , tk . Then the process X can be represented as AZ, for Z a
standard normal vector and A the symmetric square root of the covariance
matrix of X. By the Cauchy-Schwarz inequality,
1/2
max |Az|i ≤ max kAi. kkzk = max A2ii kzk.
1≤i≤k 1≤i≤k 1≤i≤k

This implies that the function f (z) = maxi |Az|i is Lipschitz of norm
bounded by supt σ(Xt ) = σ(X). Apply Lemma A.2.2 to both f and −f ,
and combine the results to obtain the theorem for the case where the index
set is finite.
Since X is separable by assumption, the supremum kXk is the almost-
sure, monotone limit of a sequence of finite suprema. The supremum M
of the sequence of medians can be seen to be a median (the smallest) of
kXk. Approximate kXk − M (X) by a sequence of similar objects for finite
suprema to obtain the first inequality of the theorem with M = M (X).
The proof of this inequality is complete if it is shown that M is the only
median of kXk.
Since the median of |Xt | is bounded above by the median of kXk,
it follows that P |Xt | ≤ M (X) ≥ 21 for every t. Taking into account
the normal distribution of Xt , we obtain that σ(Xt ) is bounded above by
M (X)/Φ−1 (3/4). Therefore, σ(X) is finite and the exponential inequality
obtained previously is nontrivial. The argument actually shows that both
the left-tail probability
 P kXk ≤ M − λ and the right-tail probability
P kXk ≥ M + λ are bounded by 1/2 times the exponential upper bound.
This means that these probabilities are strictly less than 1/2 for λ > 0,
whence M is a unique median of kXk.
To obtain the second inequality of the theorem, we note first that EkXk
is finite in view of the exponential tail bound for kXk − M (X). Next we
use the following lemma and take limits along finite subsets as before.
The third inequality is trivially satisfied if 0 ≤ λ < 2EkXk, because in
that case the exponential is larger than exp(− 21 ) ≥ 0.6. For λ > 2EkXk,
602 A Appendix

the probability is bounded by P kXk > EkXk + λ/2 , which can be further
bounded by the second inequality.

A.2.2 Lemma. Let Z be a d-dimensional random vector with the standard


normal distribution. Then for every Lipschitz function f : Rd → R with
kf kLip ≤ 1,
P f (Z) − med f (Z) > λ ≤ 12 exp − 21 λ2 ,
 

P f (Z) − Ef (Z) > λ ≤ exp − 12 λ2 .


 


Proof. For the first inequality consider, the set A = z: f (z) ≤ med f (Z) .
λ
Since f has Lipschitz norm bounded by 1, the  set A of points at distance

at most λ from A is contained in the set z: f (z) ≤ med f (Z) + λ . It
follows that
P f (Z) ≤ med f (Z) + λ ≥ P Z ∈ Aλ .
 

By definition of the median, the set A has probability at least 1/2 under
the standard normal distribution. According to the isoperimetric inequality
for the normal distribution, Corollary A.2.9, a half-space H with boundary
at the origin is an extreme set in the sense that P(Z ∈ Aλ ) ≥ P(Z ∈ H λ )
for every λ > 0, for any other measurable set A with probability at least
1/2. The proof of the first inequality of the lemma is complete upon noting
that P(Z ∈ H λ ) = Φ(λ) and 1 − Φ(λ) ≤ 21 exp(− 12 λ2 ).
For the proof of the second inequality, assume without loss of generality
that Ef (Z) = 0. An arbitrary Lipschitz function f can be approximated
by arbitrarily smooth functions with compact supports and of no larger
Lipschitz norms. Therefore, it is no loss of generality to assume that f is
sufficiently regular. In particular assume without loss of generality that f
is differentiable with gradient ∇f uniformly norm bounded by 1.
Let Zt be normally distributed with mean zero and covariance matrix
(1 − e−2t )I. For t ≥ 0, define functions Pt f by

Pt f (x) = Ef e−t x + Zt .


The operators Pt form a semigroup (P0 = I and Ps Pt = Ps+t ) on a domain


that includes all sufficiently regular functions: the Ornstein-Uhlenbeck or
Hermite semigroup defined via Mehler’s formula. The generator of the group
is the derivative A = d/dtPt |t=0 and has the property that d/dtPt = APt .
We shall need the integration-by-parts formula


−Ef (Z)Ag(Z) = E ∇f (Z), ∇g(Z) .

This is valid for sufficiently regular functions f and g. 


For fixed r > 0, define G(t) = E exp rPt f (Z) . Then G(0) =
E exp rf (Z) is the moment-generating function of f (Z) and G(∞) = 1.
A.2 Gaussian Processes 603

Differentiating under the expectation and using the integration-by-parts


formula. we obtain
−G0 (t) = −rEerPt f (Z) APt f (Z)
2
= rE ∇erPt f (Z) , ∇Pt f (Z) = r2 EerPt f (Z) ∇Pt f (Z) .

Differentiating under the expectation in the definition of Pt f , we see that


the gradient ∇Pt f (x) is equal to E∇f (e−t x + Zt )e−t and its norm is
bounded above by supx k∇f (x)ke−t . Substitution in the preceding display
yields the inequality −G0 (t) ≤ r2 e−2t G(t). Equivalently,
0
(log G)0 (t) ≥ 21 r2 e−2t .

Since the functions log G and 12 r2 e−2t are both zero at ∞ and hence equal,
it follows that log G is smaller than 12 r2 e0 at zero. This concludes the proof
that E exp rf (Z) = G(0) ≤ exp( 12 r2 ).
By Markov’s inequality, it follows that P f (Z) ≥ λ ≤ exp( 12 r2 − rλ)


for every r > 0. Optimize to conclude the proof of the lemma.

The process X having bounded sample paths is the same as kXk being
a finite random variable. In that case the median M (X) is certainly finite
and σ(X) is finite by the argument in the proof of the preceding proposi-
tion. Next, the inequalities in the preceding proposition show that kXk has
moments of all orders. In fact, we have the following proposition.

A.2.3 Proposition. Let X be a mean-zero, separable Gaussian process


such that kXk is finite almost surely. Then σ 2 (X) < ∞ and

E exp αkXk2 < ∞ if and only if α 2σ 2 (X) < 1.




Proof. It is shown in the preceding proof that σ(X) ≤ M (X)/Φ−1 (3/4).


The sufficiency of the condition α 2σ 2 (X) < 1 follows by integrating Borell’s
inequality. The necessity may be proved by considering the individual Xt ,
whose normal distributions can be handled explicitly.

Thus, for Gaussian processes, finiteness of lower moments implies


finiteness of higher moments. Higher moments can even be bounded by
lower ones up to universal constants. The following proposition reverses
the usual Liapunov inequality for moments.

A.2.4 Proposition. There exist constants Kp,q depending on 0 < p ≤ q <


∞ only such that
1/q 1/p
EkXkq ≤ Kp,q EkXkp ,

for any separable Gaussian process X for which kXk is finite almost surely.
604 A Appendix

The distribution of a mean-zero Gaussian process is completely de-


termined by its covariance function. According to Anderson’s lemma,
Lemma 3.12.4, a smaller covariance function indicates that the process is
more concentrated near zero. The following theorem is in the same spirit and
shows that smaller variances of differences imply a stochastically smaller
maximum value.

A.2.5 Proposition (Slepian, Fernique, Marcus, and Shepp). Let X and


Y be separable, mean-zero Gaussian processes indexed by a common index
set T such that

E(Xs − Xt )2 ≤ E(Ys − Yt )2 , for all s, t ∈ T.



Then E supt∈T Xt  ≤ E supt∈T Yt , and also EΦ sups,t |Xs − Xt | ≤
EΦ sups,t |Ys − Yt | , for every convex, increasing Φ: [0, ∞) → [0, ∞). If,
moreover, EXt2 = EYt2 for every t ∈ T , then EkXk ≤ 2 EkY k and
   
P sup Xt ≥ λ ≤ P sup Yt ≥ λ , for all λ > 0.
t∈T t∈T

Finally, if T is a compact metric space and Y has a version with continuous


sample paths, then so does X.

Proof. For the probability bound, see Ledoux and Talagrand (1991), Corol-
lary 3.12. This implies the bound E supt∈T Xt ≤ 2E supt∈T Yt , as shown in
Corollary 3.14 of the same reference. The improved inequality without the
factor 2 and the comparison of the moduli require more elaborate argu-
ments, which can be found in the original papers.
If Xt = 0, for some t ∈ T , then supt |Xt | ≤ sups,t (Xs − Xt ) and hence
E supt |Xt | ≤ 2E supt Xt , by the symmetry of the normal distribution. If the
variances of the processes X and Y are equal (or at least EXt2 ≤ EYt2 ), then
we can add zero variables X∞ and Y∞ to the processes, giving processes
X̄ and Ȳ with extended index set T̄ = T ∪ {∞} such that E(X̄s − X̄t )2 ≤
E(Ȳs − Ȳt )2 , for every s, t ∈ T̄ . Then E supt |Xt | ≤ 2E supt X̄t ≤ 2E supt Ȳt ,
by the first assertion.

For a centered Gaussian process {Xt : t ∈ T }, let ρ denote its standard


deviation semimetric on T defined by

ρ(s, t) = σ(Xs − Xt ), s, t ∈ T.

Let N (ε, T, ρ) be the covering number of T with respect to ρ.


By Corollary 2.2.8, the expectation EkXk can be bounded above by
an entropy integral with respect to ρ. Sudakov’s inequality gives a bound
in the opposite direction.
A.2 Gaussian Processes 605

A.2.6 Proposition (Sudakov’s inequality). For every mean-zero, sepa-


rable Gaussian process X,
p p
sup ε log N (ε, T, ρ) ≤ 2π log 2 E sup Xt .
ε>0 t∈T

Moreover, if almost all sample


p paths of X are bounded and uniformly con-
tinuous on (T, ρ), then ε log N (ε, T, ρ) → 0 as ε → 0.

Proof. Let t1 , . . . , tN be a maximal set of points in T with ρ(ti , tj ) ≥ ε,


for all i 6= j. Then the balls of radius ε around the ti cover T , and hence
N (ε, T, ρ) ≤ √N . If Z1 , . .√. , ZN are i.i.d. standard normal random variables,
then E(εZi / 2 − εZj / 2)2 = ε2 ≤ E(Xti − Xtj )2 , for 1 ≤ i 6= j ≤ N ,
and hence Slepian’s √ inequality, Proposition A.2.5, gives that E supt∈T Xt ≥
E max1≤i≤N εZi / 2.
Thus the problem is reduced to proving that E max1≤i≤N Zi ≥ cN : =
(π log 2)−1/2 (log N )1/2 . For N = 1 this is clear, while for N = 2 the in-
equality can be verified (with equality) by direct calculation. By Corol-
lary A.2.8 applied to the convex function z 7→ maxi zi , the expected value
of max1≤i≤N Zi is bounded below by its median, which can be directly cal-
culated to be equal to Φ−1 (2−1/N ). Thus the proof can be completed by
showing that P(Z1 ≥ cN ) > 1 − 2−1/N , for all N ≥ 3.
For the final assertion, see Lemma 2.10.17.

A.2.7 Proposition (Ehrhard, Latala). If Z is a standard normal vector


in Rd , then, for any convex Borel set A ⊂ Rd and Borel set B ⊂ Rd , and
any µ ∈ [0, 1],

Φ−1 P(Z ∈ µA + (1 − µ)B) ≥ µ Φ−1 P(Z ∈ A) + (1 − µ) Φ−1 P(Z ∈ B) .


     

Proof. Ehrhard (1993) proved the inequality when both A and B are
convex. The generalization given here is due to Latala (1996).

A.2.8 Corollary.If f : Rd → R is convex, then the function t 7→


Φ−1 P(f (Z) ≤ t) is concave. Furthermore, the median of the variable
f (Z) is smaller than its mean Ef (Z), if the latter exists.

Proof. The sets C(t) = {z:  f (z) ≤ t} are convex, and, by convexity of
f , satisfy C µs + (1 − µ)t ⊃ µC(s) + (1 − µ)C(t), for any s, t ∈ R and
µ ∈ [0, 1]. The given function is h(t) = Φ−1 P(Z  ∈ C(t)) , and hence
h(µs + (1 − µ)t) ≥ Φ−1 P(Z ∈ µC(s) + (1 − µ)C(t) . This is lower bounded
by µh(s)+(1−µ)h(t), by Proposition A.2.7, applied with the sets A = C(s)
and B = C(t).
The median m of f (Z) satisfies P(f (Z) < m) ≤ 1/2 ≤ P(f (Z) ≤ m),
which implies that h(m−) ≤ 0 ≤ h(m). Since the function h is continuous
in view of its concavity, it must satisfy h(m) = 0. Because h is increasing
606 A Appendix

from −∞ to ∞, its gradients are positive and hence there exists


 b > 0 such

that h(t) ≤ b (t − m), for every t ∈ R. Then P f (Z) ≤ t = Φ h(t) ≤
Φ b(t − m) , for every t, whence f (Z) is stochastically larger than a Gaus-
sian variable with mean m. In particular Ef (Z) ≥ m.

For A ⊂ Rd and λ > 0, let Aλ = {x ∈ Rd : kx − Ak < λ} be the


λ-enlargement of A. The Gaussian isoperimetric inequality says that half-
spaces minimize among sets of equal measure the growth of their probability
under the standard normal distribution when enlarging the set.
A half-space is understood to be a set of the form {x ∈ Rd : hx, ui ≤ r},
for some u ∈ Rd and r ∈ R.

A.2.9 Corollary (Isoperimetric inequality). For any Borel set A ⊂ Rd


and half-space H ⊂ Rd such that P(Z ∈ A) = P(Z ∈ H), we have P(Z ∈
Aλ ) ≥ P(Z ∈ H λ ), for any λ > 0. In particular, if P(Z ∈ A) ≥ 1/2, then
P(Z ∈ / Aλ ) ≤ 1 − Φ(λ), for any λ > 0.

Proof. Let B be the unit ball in Rd . Since λB = (1 − µ) λ/(1 − µ) B and
λ/(1 − µ) B is convex, Proposition A.2.7 gives, for µ ∈ (0, 1) and λ > 0,
Φ−1 P(Z ∈ µA+λB) ≥ µΦ−1 P(Z ∈ A) +(1−µ)Φ−1 P(Z ∈ λ/(1−µ)B) .
     

λ
As µ ↑ 1, the sets µA + λB increase  to A + λB = A , and hence the
−1 λ
left side increases to Φ  ∈ A ) . The first term on the right clearly
P(Z
increases to Φ−1 P(Z ∈ A), while the second term on the right is (1 −
µ)Φ−1 P kZk ≤ λ/(1 − µ) , and tends  to λ as µ ↑ 1, as follows from the
fact Φ−1 (v) ∼ F −1 (v) ∼ 2 log+ (1 − v) 1/2 , as v ↑ 1, for F the distribution
function of kZk, which is the root of chisquare variable, or by repeated
application of l’Hôpital’s rule. Thus, letting µ ↑ 1 in the preceding display,
λ
and
 −1  right of the limits, we conclude that P(Z ∈ A ) ≥
applying Φ left and
Φ Φ (P(Z ∈ A)) + λ .
By the rotation symmetry of the standard normal distribution it is not
a loss of generality that the half-space in the assumption takes the form
H = {z ∈ Rd : z1 ≤ r}. In that case P(Z ∈ H) = Φ(r), and H λ = {z ∈
Rd : z1 ≤ r + λ}, so that P(Z ∈ H λ ) = Φ(r + λ). If P(Z ∈ A) = P(Z ∈ H),
then the right side of the final equality in the preceding paragraph reduces
to P(Z ∈ H λ ).
The final statement of the corollary follows from the inequality P(Z ∈
Aλ ) ≥ Φ(r + λ), where r ≥ 0 since P(Z ∈ A) ≥ 1/2, so that Φ(r + λ) ≥
Φ(λ).

A.2.10 Corollary (Anderson’s inequality in the Gaussian case). For


any convex, symmetric, measurable subset C ⊂ Rd and any a ∈ Rd , we
have P(Z ∈ C + a) ≤ P(Z ∈ C).

Proof. Since 12 (C + a) + 12 (C − a) = C and the sets are convex by


assumption, Proposition A.2.7 gives Φ−1 P(Z ∈ C) ≥ 21 Φ−1 P(Z ∈

A.2 Gaussian Processes 607

C + a) + 12 Φ−1 P(Z ∈ C − a) . The symmetry of C and Z give that


 

P(Z ∈ C − a) = P(Z ∈ C + a), so that the right side is equal to


Φ−1 P(Z ∈ C + a) . The corollary follows upon applying Φ left and right
of the inequality.

The isoperimetric inequality for the standard normal distribution ex-


tends to Gaussian random elements with values in a general separable Ba-
nach space. To every such a Gaussian random element corresponds a so-
called reproducing kernel Hilbert space, which may be considered embedded
in the Banach space. Let H1 be the closed unit ball of this reproducing
kernel Hilbert space.

A.2.11 Proposition (Geometric Borell’s inequality). For any zero-mean


Gaussian random element X in a separable Banach space B and any  mea-
surable set A ⊂ B, we have P(X ∈ A + λH1 ) ≥ Φ Φ−1 P(X ∈ A) + λ ,


for any λ > 0.


P∞
Proof. We can represent the Gaussian variable as X = j=1 Zj hj , for i.i.d.
standard normal variables Z1 , Z2 , . . . and an orthonormal basis h1 , h2 , . . . of
the reproducing kernel Hilbert space, where the series converges in B, almost
surely. (See e.g. Van der Vaart and Van Zanten (2008), P Section 4.) Then the

reproducing kernel Hilbert space is the set of P all series j=1 ξj hj , with ξ =

(ξj ) ∈ `2 , normed by the `2 -norm kξk2 =: k j=1 ξj hj kH . Thus Z = (Zj )
belongs
P∞ with probability one to the set B0 ⊂ R∞ of all x ∈ R∞ such that
j=1 xj hj converges in B. Let A0 ⊂ B0 be the set of coefficients x of such
converging P series with limit belonging to A. From the measurability of the
maps x 7→ j∈J xj hj for finite sets J of indices, it can be ascertained that
both B0 and A0 are measurable in R∞ . Clearly  P(X ∈ A) = P(Z ∈ A0 ), and
also P(X ∈ A+λH1 ) ≥ P kZ −A0 k2 ≤ λ , since kx−A P 0 k2 ≤ λPimplies that
x = a+λξ, for a ∈ A0 and ξ with kξk2 ≤ 1, so that k j xj hj − j aj hj kH ≤
−1
kλξk2 ≤ λ. Thus it suffices to show that P kZ −A0 k2 ≤ λ ≥ Φ Φ P(Z ∈
A0 ) + λ , for any λ > 0. Pn
If A0 is compact in R∞ , then λn : = inf a∈A0 j=1 (zj −aj )2 ↑ kz −A0 k22 ,
as n → ∞, for any z ∈ R∞ . Indeed, the numbers λn are nondecreasing in
n and kz − A0 k22 is certainly
Pn not smaller 2than limn λn . A sequence an ∈ A0
constructed such that j=1 (zj − an,j ) ≤ λn + 1/n, for Pm
every n, must
have a converging subsequence an0 → a ∈ A0 , and then j=1 (zj − aj )2 =
Pm
limn0 j=1 (zj − an,j )2 ≤ limn0 (λn0 + 1/n0 ), for every fixed m. By monotone
convergence kz − ak22 ≤ limn λn .
It follows that, for compact A0 ⊂ R∞ , by the portmanteau theorem,
 Xn 
(Zj − aj )2 ≤ λ2 .

P kZ − A0 k2 ≤ λ ≥ lim sup P inf
n→∞ a∈A0
j=1

~ n − An k2 ≤ λ , for Z n = (Z1 , . . . , Zn ), and An =



The right side is P kZ
{(a1 , . . . , an ): a ∈ A0 }, the projection of A0 on its first n coordinates. By
608 A Appendix

Corollary A.2.9 (see its proof for the equivalent formulation needed) this is
bounded below by Φ Φ−1 P(Z n ∈ An ) +λ , for any λ > 0. Since the event
 

{Z n ∈ An } contains
 theevent {Z ∈ A0 }, this is further bounded below by
Φ Φ−1 P(Z ∈ A0 ) + λ , for any λ > 0.
This concludes the proof when A0 is compact. Because every Borel
measure on R∞ is inner regular, for a general A0 there exists a sequence
of compact sets Kn ⊂ A0 with P(Z ∈ Kn ) ↑ P(Z ∈ A0 ). The argument of
the preceding
 paragraph applied with  Kn instead of A0 gives that P kZ −
Kn k2 ≤ λ ≥ Φ Φ−1 P(Z ∈ K  n + λ . Since Kn ⊂ A0 , the left side is
)
smaller than P kZ − A0 k2≤ λ , for every n, while the right side increases
to Φ(Φ−1 P(Z ∈ A0 ) + λ , as n → ∞.

A.2.2 Exponential Bounds


To obtain exponential bounds for kXk = supt∈T |Xt | without the center-
ing by M (X) or EkXk as in Proposition A.2.1, requires hypotheses giving
control of EkXk (or equivalently of M (X)). The following theorem, due to
Talagrand, is an example of results of this type.
For a given Gaussian process X indexed by an arbitrary set T , we de-
note by ρ its “natural semimetric” ρ(s, t) = σ(Xs −Xt ). RecallR that σ(X) is

the supremum of the standard deviations σ(Xt ). Let Φ(z) = z φ(x) dx ≤
−1
z φ(z) be the tail probability of a standard normal variable.

A.2.12 Proposition. Let X be a separable, mean-zero Gaussian process


such that for some K > σ(X), some V > 0, and some 0 < ε0 ≤ σ(X),
 K V
N (ε, T, ρ) ≤ , 0 < ε < ε0 .

Then
√ there exists a universal constant D such that, for all λ ≥ σ 2 (X)(1 +
V )/0 ,
   DKλ V  λ 
P sup Xt ≥ λ ≤ √ Φ .
t∈T V σ 2 (X) σ(X)

This is closely related to Theorems 2.14.25 and 2.14.29 for empirical


processes. Another result of Talagrand for Gaussian processes which is par-
allel to Theorem 2.14.30 in the case of empirical processes, is as follows.

A.2.13 Proposition. Let X be a separable, zero-mean Gaussian process,


and for δ > 0, let
Tδ = {t ∈ T : EXt2 ≥ σ 2 (X) − δ 2 }.
≥ W√
Let V √ ≥ 1, and suppose that, for all 0 < δ 2 ≤ σ 2 (X) and all 0 <  ≤
δ(1 + V )/ W ,
(A.2.14) N (, Tδ , ρ) ≤ Kδ W −V .
A.2 Gaussian Processes 609

Then there exists a universal constant D such that, for λ ≥ 2σ(X) W ,
  W W/2  λ V −W  λ 
P sup Xt ≥ λ ≤ K V /2 DV +W 2 Φ .
t∈T V σ (X) σ(X)

For more results in this vein, see Talagrand (1994), Samorodnitsky


(1991), Adler (1990) and Adler and Brown (1986). For particular Gaus-
sian random fields, both the renewal theoretic methods of Siegmund (1988,
1992), and the “Poisson clump heuristic” methods of Aldous (1989), yield
results on the asymptotic behavior of P supt∈T Xt ≥ λ , as λ → ∞, that of-
ten seem to provide accurate approximations for λ on the order of 1.5σ(X).
Here are some particular examples of the above theorems with connections
to the results of Hogan and Siegmund (1986), Siegmund (1988, 1992), and
Aldous (1989).

A.2.15 Example (Brownian sheet on [0, 1]2 ). The standard Brownian


sheet B is a zero-mean Gaussian process indexed by T = [0, 1]2 , with co-
variance function

cov B(s), B(t) = (s1 ∧ t1 )(s2 ∧ t2 ).
Then σ 2 (B) = 1, and (A.2.14) holds with V = 4 and W = 4. Hence, for
some constant M , and λ > 0,
 λ2
P sup Bt ≥ λ ≤ M λ−1 exp(− ).

t∈T 2
In fact, it was shown by Goodman (1976) that, for all λ > 0,
Z ∞  
4 zΦ(z)dz ≤ P sup Bt ≥ λ ≤ 4Φ(λ).
λ t∈T

These inequalities imply that, for λ → ∞,


r
   8 −1 λ2
P sup Bt ≥ λ ∼ 4Φ(λ) = 4P B(1, 1) ≥ λ ∼ λ exp(− ).
t∈T π 2

A.2.16 Example (Brownian bridge on [0, 1]2 ). The Brownian sheet B 0


pinned down to 0 at t = (1, 1) is a zero-mean Gaussian process indexed by
[0, 1]2 , with covariance function
cov B 0 (s), B 0 (t) = (s1 ∧ t1 )(s2 ∧ t2 ) − s1 s2 t1 t2 .


Alternatively, B 0 can be defined from the Brownian sheet of Exam-


ple A.2.15, by setting
B 0 (t1 , t2 ) = B(t1 , t2 ) − t1 t2 B(1, 1).
This is just the P −Brownian
bridge process GP indexed by the collection of
sets [0, t]: t ∈ [0, 1] , with P equal to the uniform distribution (Lebesgue
610 A Appendix

measure) on [0, 1]2 . Then σ 2 (B 0 ) = 1/4, and this supremum is achieved for
every t ∈ [0, 1]2 with t1 t2 = 1/2. It can be shown that (A.2.14) holds with
V = 4 and W = 1. Therefore, for some constant M ,
 
P sup Bt0 ≥ λ ≤ M λ2 exp(−2λ2 ).
t∈T

It has been shown by Hogan and Siegmund (1986) and also by Aldous
(1989), page 202, that as λ → ∞,
 
P sup Bt0 ≥ λ ∼ (4 log 2)λ2 exp(−2λ2 ).
t∈T

In this case Goodman (1976) established the lower bound


 
P sup Bt0 ≥ λ ≥ (1 + 2λ2 ) exp(−2λ2 ), λ > 0.
t∈T

The asymptotic formula of Hogan and Siegmund becomes greater than


Goodman’s lower bound for λ ≥ (2(log 4 − 1))−1/2 = 1.138 . . .. For this and
some numerical comparisons, see Adler (1990), Chapter V.

A.2.17 Example (Tucked brownian sheet on [0, 1]2 ). The tucked Brow-
nian sheet is the zero-mean Gaussian process indexed by [0, 1]2 , with co-
variance function

cov B 00 (s), B 00 (t) = (s1 ∧ t1 − s1 t1 )(s2 ∧ t2 − s2 t2 ).




This can be obtained from the Brownian sheet given in Example A.2.15,
by pinning it down to 0 on the entire boundary of [0, 1]2 , i.e.

B 00 (t1 , t2 ) = B(t1 , t2 ) − t1 B(1, t2 ) − t2 B(t1 , 1) + t1 t2 B(1, 1).

This process arises in testing independence and in connection with the


estimation of copula functions; see Chapters 3.9 and 3.10. In this case
σ 2 (B 00 ) = 1/16, and this supremum is uniquely achieved for t = (1/2, 1/2).
It can be shown that (A.2.14) holds with V = 4 and W = 2. Therefore, for
some constant M ,
 
P sup Bt00 ≥ λ ≤ M λ exp(−8λ2 ).
t∈T

It follows from the methods of Siegmund (1988, 1992), or alternatively from


Aldous (1989), that as λ → ∞,
  √
P sup Bt00 ≥ λ ∼ (4 2π)λ exp(−8λ2 ).
t∈T
A.2 Gaussian Processes 611

A.2.18 Example (Kiefer-Müller process on [0, 1]2 ). The Kiefer-Müller


process Z on [0, 1]2 can be obtained from a Brownian sheet B by way of

Z(t1 , t2 ) = B(t1 , t2 ) − t2 B(t1 , 1).

Then Z has covariance function


cov Z(s1 , t1 ), Z(s2 , t2 ) = (s1 ∧ s2 )(t1 ∧ t2 − t1 t2 ).

This process is a special case of the P −Kiefer process in Chapter 2.12 with
P the uniform distribution on [0, 1] and F = {1[0,t] : t ∈ [0, 1]}. In this case
σ 2 (Z) = 1/4, and this is uniquely achieved for (s, t) = (1, 1/2). It can be
shown that (A.2.14) holds with V = 4 and W = 3. Therefore, for some
constant M ,
 
P sup Zt ≥ λ ≤ M exp(−2λ2 ).
t∈T

It follows from the methods of Siegmund (1988, 1992), or alternatively from


Aldous (1989), that as λ → ∞,
 
P sup Zt ≥ λ ∼ 2 exp(−2λ2 ).
t∈T

A.2.19 Example (Brownian bridge indexed by convex subsets of


[0, 1]2 ). Consider the P −Brownian bridge process with P uniform on [0, 1]2
indexed by the collection C of all convex subsets of [0, 1]2 . It follows from
Corollary 2.7.11 (with d = r = 2) that

 1
(A.2.20) log N[ ] ε, C, L2 (P ) ≤ K ,
ε

and hence C is P −preGaussian. Although neither Theorem A.2.12 nor


A.2.13 apply in this case, the methods of Samorodnitsky (1991) apply, and
show that for some constants C and D,
 
(A.2.21) P sup GP (C) ≥ λ ≤ C exp(Dλ2/3 ) exp(−2λ2 ).
C∈C

More generally, the methods of Samorodnitsky (1991) show that if the


power 1 in (A.2.20) is replaced by V , then the power 2/3 in (A.2.21) must
be replaced by 2V /(V + 2).
612 A Appendix

A.2.3 Majorizing Measures


A separable, zero-mean Gaussian process {Xt : t ∈ T } indexed by an ar-
bitrary index set T is sub-Gaussian with respect to its standard deviation
semimetric ρ(s, t) = σ(Xs −Xt ). Thus Corollary 2.2.8 yields for every δ > 0
Z δ p
E sup |Xs − Xt | . log N (ε, T, ρ) dε,
ρ(s,t)<δ 0

where N (ε, T, ρ) is the covering number of T with respect to ρ. If the entropy


integral on the right is finite, then it follows immediately that the process
is uniformly ρ-continuous in mean. This conclusion may be strengthened
to the uniform continuity of almost all sample paths with the help of the
Borel-Cantelli lemma (Problem 2.2.17).
While a finite entropy integral is sufficient for the sample-path conti-
nuity of a Gaussian process, it is not necessary. Majorizing measures may
be considered a refinement of entropy numbers and yield a necessary and
sufficient condition for sample-path continuity.

A.2.22 Proposition. Let {Xt : t ∈ T } be a separable zero-mean Gaussian


process with standard deviation semimetric ρ. Then almost all sample paths
t 7→ Xt are uniformly ρ-continuous and bounded if and only if
Z δs
1
(A.2.23) lim sup log  dε = 0,
δ↓0 t∈T 0 µ B(t, ε)

for some Borel probability measure µ on (T, ρ). Here B(t, ε) is the ρ-ball of
radius ε around t.

A measure µ for which the integral in the theorem is finite is called a


majorizing measure. A further result on majorizing measures is that finite-
ness of the integral characterizes the boundedness of the sample paths.
Among the bounded processes the uniformly continuous processes are char-
acterized by majorizing measures with the additional property that the
majorizing measure integral is continuous at zero, as in (A.2.23). A chain-
ing argument may establish that for any probability measure µ and every
δ, η > 0,‡
Z ∞s
1
E sup |Xt − Xt0 | . sup log  dε,
t t 0 µ B(t, ε)
Z ηs
1 p
E sup |Xs − Xt | . sup log  dε + δ log N (η, T, ρ).
ρ(s,t)<δ t 0 µ B(t, ε)


Ledoux and Talagrand (1991), pages 320–321, and (11.15) on page 317.
A.2 Gaussian Processes 613

This explains the name “majorizing measure” and readily yields one direc-
tion of the preceding proposition. In the other direction, the proposition is
harder to prove. This converse part is used in the proof of Theorem 2.11.11
in combination with the following lemma.

A.2.24 Lemma. Let T be a semimetric space and µ be a Borel probability


measure on T such that (A.2.23) holds. Then there exists a probability
measure m on T and a sequence of nested, measurable partitions T = ∪i Tqi
such that diam Tqi ≤ 2−q for every i and
s
X 1
lim sup 2−q log = 0,
q0 →∞ t∈T
q>q
m(T q t)
0

where Tq t is the partitioning set Tqi at level q to which t belongs.

Proof. The set T is totally bounded: the existence of infinitely many  dis-
joint balls B(ti , δ) of fixed radius δ > 0 would require that µ B(ti , δ) → 0
as i → ∞. However,
s Z δs
1 1
sup δ log  ≤ sup log  dε.
t µ B(t, δ) t 0 µ B(t, ε)

The right-hand side is finite by assumption. 


Fix δ > 0. Find a point t1 that (nearly) maximizes
 µ B(t, δ) over
t ∈ T . Precisely, let µ B(t
 1 , δ) ≥ r supt µ B(t, δ) for some fixed 0 < r ≤ 1.
Set m{t1 } = µ B(t1 , δ) , and let T1 = B(t1 , 2δ) be the ball of radius 2δ
around t1 . For every t ∈ T1 (even for every t ∈ T ),
 
r µ B(t, δ) ≤ µ B(t1 , δ) = m{t1 } ≤ m(T1 ).

Next, find a t2 that nearly maximizes µ B(t, δ) over t ∈ T − T1 . Set
m{t2 } = µ B(t2 , δ) , and let T2 = B(t2 , 2δ) − T1 . For every t ∈ T2
 
r µ B(t, δ) ≤ µ B(t2 , δ) = m{t2 } ≤ m(T2 ).

Continue this process, constructing disjoint sets T1 , T2 , . . . , TI until the set


of t with ρ(t, ti ) ≥ 2δ for every i is empty. Then T = ∪i Ti is a finite 
partition of T in sets of diameter at most 2δ and m(Ti ) ≥ r µ B(t, δ) for
every t ∈ Ti .
Apply this construction for δ = 2−q−1 and every natural number q.
This yields a sequence of partitions T = ∪i T̄qi and measures m̄q such that
s s
X
−q 1 X
−q 1
2 log ≤ 2 log .
q>q
m̄q ( T̄q t) q>q
rµ B(t, 2−q−1 )
0 0
614 A Appendix

The right side is bounded by a multiple of

Z 2−q0 q 
log 1/µ B(t, ε) dε
0


if µ B(t, ε) is bounded away from 1, which may be assumed.
The present sequence of partitions need not be nested. Replace the qth
partition by the partition in the sets ∩qp=1 T̄p,ip , where i1 , . . . , iq range over
all possible values. For each set in the qth partition, define

  Q
q
mq ∩qp=1 T̄p,ip = p=1 m̄p (T̄p,ip ).

2−q mq .
P
(The exact location of the mass is irrelevant.) Next, set m = q
Then m is a subprobability measure and

v
q
s u
X 1 X X 1
2−q −q t
u
log ≤ 2 q
log 2 + log .
q>q0
m(Tq t) q>q p=1
m̄p T̄p t
0

This converges to zero uniformly in t as q0 → ∞.

A.2.4 Further Results


Many M -estimators are asymptotically distributed as the point of maxi-
mum of a Gaussian process. This point of maximum is not always easy to
evaluate. The following proposition shows that a point of absolute maxi-
mum is unique, if it exists.

A.2.25 Proposition. Let {Xt : t ∈ T } be a Gaussian process with continu-


ous sample paths, indexed by a σ-compact metric space T . If var(Xs −Xt ) 6=
0 for every s 6= t, then almost every sample path achieves its supremum at
at most one point.

Proof. See Kim and Pollard (1990), Lemma 2.6.


A.2 Problems and Complements 615

Problems and Complements


1. For every separable Gaussian process, σ 2 (X) ≤ M 2 (X)/Φ−1 (3/4)2 .
[Hint: See the proof of Proposition A.2.1.]

2. For every separable Gaussian process EkXkp ≤ Kp M (X)p for a constant


depending on p only.
[Hint: Integrate Borell’s inequality to bound EkX − M (X)kp . Next, use the
preceding problem.]
p
3. For every separable Gaussian process, EkXk − M (X) ≤ π/2 σ(X).
4. Suppose that F and G are pre-Gaussian classes of measurable functions. Use
majorizing measures to show that F ∪ G is pre-Gaussian.

5. Every tight, Borel
 measurable centered, Gaussian variable X in ` (T ) sat-
isfies P kXk < ε > 0 for every ε > 0.
 
[Hint: By Anderson’s lemma, P kX − xk ≤ ε ≤ P kXk ≤ ε for every x.
The support of X can be covered by countably many balls.]

6. The Brownian sheet B satisfies the tail bound, for every λ > 0,
 
P sup B(t1 , t2 ) ≥ λ ≤ 4 1 − Φ(λ) .
0≤t1 ,t2 ≤1

[Hint: Use Lévy’s inequality


 twice to upper bound the probability by
2P sup0≤t2 ≤1 B(1, t2 ) ≥ λ ≤ 4P B(1, 1) ≥ λ . (The reflection principle
for Brownian motion implies that the second inequality actually holds with
equality.)]

7. The Kiefer-Müller process Z satisfies the tail bound, for every λ > 0,

Z(t1 , t2 ) ≥ λ ≤ 2 exp(−2λ2 ).

P sup
0≤t1 ,t2 ≤1

[Hint: Use Lévy’s inequality


 to bound the tail probability from above by
2P sup0≤t2 ≤1 Z(1, t2 ) ≥ λ . Since the process Z(1, t2 ): 0 ≤ t2 ≤ 1 is
a standard Brownian bridge process on [0, 1], this can be further bounded
using the known form of the distribution of the one-sided supremum (see
Doob (1949), or Shorack and Wellner (1986), page 34).]
A.3
Rademacher Processes

Let ε1 , . . . , εn be Rademacher variables and x1 , . . . , xnP


arbitrary real func-
n
tions on some “index” set T . The Rademacher process i=1 εi xi (t): t ∈ T
shares several properties with Gaussian processes. In this chapter we re-
strict ourselves to two propositions that have been used in Part 2. See, for
instance, Ledoux and Talagrand (1991) for a further discussion.
Let kxk be the supremum norm supt∈T x(t) for a given function
x: T → R. Define
n
X
σ 2 = sup var εi xi (t) = x2i .
P
t∈T i=1

The tail probabilities of the norm of a Rademacher process are comparable


to those of Gaussian processes. The following inequalities are similar to
the Borell inequalities, although it appears likely that the constants can be
improved.

A.3.1 Proposition.
Pn Let M be a median of the norm of the Rademacher
process i=1 εi xi . Then there exists a universal constant C such that, for
every λ > 0,
 P   λ2 
P εi xi − M > λ ≤ 4 exp − 2 ,


 P P   λ2 
P εi xi − E εi xi > λ ≤ C exp − 2 .


Proof. For the first inequality, see Ledoux and Talagrand (1991), Theo-
rem 4.7, on page 100.
A.3 Rademacher Processes 617
P
To obtain the second inequality
set µ√= Ek εi xi k. Integrate over
the first inequality to obtain µ − M ≤ 4 2πσ. For cλ  >P |µ − M |, the
event in the second inequality is contained in the event k εi xi k − M >
(1 − c)λ . Choose c > 0 sufficiently small to bound the probability of this
event by 4 exp −λ2 /9σ 2 . For cλ ≤ |µ − M |, the right side
 in the second
inequality is never smaller than C exp −(µ − M )2 /9c2 σ 2 , which is at least
1 for sufficiently large C.

A.3.2 Proposition (Contraction principle). For each i, let xi take its


values in the interval [−1, 1], and let φi : [−1, 1] → R be Lipschitz of norm
1 with φi (0) = 0. Then
X X
E εi φi (xi ) ≤ 2E εi xi .

Proof. See Ledoux and Talagrand (1991), Theorem 4.12, on page 112.

1 2
The last proposition
P with the choices
P φi (x)
= 2 x is applied in Chap-
2
ter 2.14.5 to obtain E εi xi ≤ 4E εi xi .
A.4
Isoperimetric Inequalities
for Product Measures

Let (X n , An ) be the product of n copies of a measurable space (X , A).


Given points x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ), say that y controls
a subset of coordinates {xi : i ∈ I} of x if the corresponding coordinates
of y are the same: yi = xi for every i ∈ I. Given vectors y 1 , . . . , y q , let
f (y 1 , . . . , y q , x) be the number of coordinates of x that are not controlled
by any y j . Given subsets A1 , . . . , Aq of X n , define f (A1 , . . . , Aq , x) to be the
minimum of the numbers f (y 1 , . . . , y q , x) when the y j range over Aj . Thus
n − f (A1 , . . . , Aq , x) is the maximal number of coordinates of x that can
be controlled by the clever choice of y j from Aj . Also, if f (A1 , . . . , Aq , x) =
k, then there exist points y j ∈ Aj such that the number of indices i ∈
{1, . . . , n} such that xi ∈ / {yi1 , . . . , yiq } is k (and a choice of y j giving more
control is impossible).
A basic inequality giving an upper bound on the size of the control
functions f (A1 , . . . , Aq , x) is as follows.

A.4.1 Proposition. If the coordinates of X = (X1 , . . . , Xn ) are i.i.d. ran-


dom elements in X , then for any measurable sets A1 , . . . , Aq in An and any
integer q ≥ 1 q
Y −1
E∗ q f (A1 ,...,Aq ,X) ≤ P(X ∈ Al ) .
l=1
Consequently, for every set A ∈ An with P(X ∈ A) ≥ 1/2
 2q  2 k
P∗ f (A, . . . , A, X) ≥ k ≤ k ≤ , k ≥ q.
q q
Here q copies of A appear in the left side.
A.4 Isoperimetric Inequalities for Product Measures 619

Proof. It will first be shown that, for any random variables U1 , . . . , Uq


taking values in [0, 1],
 q
Y
E q ∧ min Ul−1 EUl ≤ 1.
1≤l≤q
l=1

Here 1/0 is understood to be infinite. Indeed, the random variable U defined


by U −1 = q ∧ min1≤l≤q Ul−1 satisfies 1/q ≤ U ≤ 1. Since Ul ≤ U for every
l, the left side is bounded above by EU −1 (EU )q . Since log x ≤ x − 1, the
logarithm of this expression is bounded by EU −1 − 1 + qEU − q. This
is bounded above by zero, because on the interval [1/q, 1] the function
z 7→ z −1 + qz attains a maximum value of 1 + q (at the boundary points).
The proof of the theorem proceeds by induction on n. For n = 1 the
control number f (A1 , . . . , Aq , X) equals 0 if X is contained in some Al and
1 otherwise. Thus

q f (A1 ,...,Aq ,X) = q ∧ min 1Al (X)−1 .


1≤l≤q

The inequality in the first paragraph yields the result.


Suppose the theorem is true for n, and consider a vector X and subsets
Al in X n+1 . Write Y and Z for the first n and the last coordinate of
X = (Y, Z), respectively. Furthermore, for a subset A of An+1 , define the
sections and projection relative to the first n coordinates by
A(Z) = {Y ∈ X n : (Y, Z) ∈ A},
B = {Y ∈ X n : (Y, Z) ∈ A, for some Z}.

If y 1 , . . . , y q control all except k coordinates of Y , then (y 1 , z 1 ), . . . , (y q , z q )


control all except at most k +1 coordinates of (Y, Z); they control all except
k coordinates of (Y, Z) if one z j is chosen equal to Z. Therefore, it follows
that, for every l ≤ q,
 
f A1 , . . . , A q , X , ≤ 1 + f B 1 , . . . , B q , Y
 
f A1 , . . . , Aq , X ≤ f B1 , . . . , Bl−1 , Al (Z), Bl+1 , . . . , Bq , Y .
Write EY for the expectation taken with respect to the first n coordinates
only and let P be the distribution of Y . The first inequality of the display
yields
q
Y −1
EY q f (A1 ,...,Aq ,X) ≤ qEY q f (B1 ,...,Bq ,Y ) ≤ q P (Bl ) ,
l=1

by the induction hypothesis. Similarly, the second inequality together with


the induction hypothesis implies that, for every 1 ≤ l ≤ q,
Y −1
EY q f (A1 ,...,Aq ,X) ≤ P (Bj )P Al (Z) .
j6=l
620 A Appendix

Combine the last two displays to see Qthat their common left side is bounded
−1 q −1
above by q ∧ min  1≤l≤q lZ times l=1 (Bl )
P for the random variables
Zl = P Al (Z) /P (Bl ). An application of the inequality in the first para-
graph completes the proof.
Actually the preceding proof ignores issues of measurability  and is false
in general, because maps of the type X 7→ f A1 , . . . , Aq , X need not be
measurable, as required for the application of Fubini’s theorem.
With some effort it can be seen that this problem does not arise in the
case that the sample space is Polish with Borel σ-field and compact sets
A1 , . . . , Aq . This follows from the identity

\ q
[
  
x: n − f A1 , . . . , Aq , x ≤ k = x: πIi x ∈
/ π I i Ai ,
|∪Ii |>k i=1

where the intersection is taken over all collections of q subsets I1 , . . . , Iq of


{1, 2, . . . , n} and πI is the projection πI : X n → X I on the Ith coordinates.
(Note that πI x ∈ πI A if and only if there exists y ∈ A that controls
{xi : i ∈ I}.) Projections of compact sets are compact, hence measurable.
Thus, in this special situation the proof is correct.
Next, the case of general Borel sets in a Polish (product) space can
be obtained by approximation from within by compact sets. By replacing
every set Ai by a compact Ki ⊂ Ai , the control numbers increase (there
is less control), whence the left side of the theorem increases. Next, apply
the theorem and note that the resulting upper bound decreases to the right
side of the theorem if the probabilities P (Ai − Ki ) are made to decrease to
zero. The latter is possible by the regularity of Borel measures on Polish
spaces.
The case of Polish spaces is sufficient for most applications but can be
extended to arbitrary sample spaces. We sketch the extension.
Since any event in An is contained in the completion of the product
of sub-σ-fields of A that are countably generated, it is no loss of general-
ity to assume that A is countably generated, say with generator the sets
A1 , A2 , . . . . Then the map φ: X → {0, 1}∞ given by

φ(x) = 1A1 (x), 1A2 (x), . . .

is Borel measurable into the Polish space {0, 1}∞ ≡ R and A = φ−1 (B)
n n
for the Borel sets  B. Let φn : X → R be the map (x1 , . . . , xn ) 7→
φ(x1 ), . . . , φ(xn ) , and write P for the underlying measure on X .
By construction there exists for every A ∈ An a Borel set B ∈ B n
with A = φ−1 0
n (B). Suppose that there exists a Borel set B ⊂ B of the
same measure under (P ◦ φ ) with the property that for every z 0 ∈ B 0 ,
−1 n

there exists z ∈ B ∩ φ(X )n such that zi = zi0 for all i such that zi0 ∈ φ(X ).
Thus, the coordinates of z 0 that are not in φ(X ) can be changed, meanwhile
leaving the other coordinates the same, so as to obtain a point in B with
A.4 Isoperimetric Inequalities for Product Measures 621

all coordinates in φ(X ). A vector in B with all coordinates in φ(X ) is the


image of some element in A and it can be seen that
n o 
x: f B10 , . . . , Bq0 , φn (x) ≤ k ⊂ x: f (A1 , . . . , Aq , x) ≤ k .


Consequently,
∗
(P n )∗ f (A1 , . . . , Aq , x) > k ≤ (P ◦ φ−1 )n z: f B10 , . . . , Bq0 , z > k .
  

Since (P ◦ φ−1 )n (B 0 ) = P n (A), the problem has been reduced to the case
of a Polish sample space.
The existence of the sets B 0 can be argued as follows. For every parti-
tion {1, . . . , n} = I ∪ J, define
n o
BI,J = x ∈ Rn : (P ◦ φ−1 )J B(πI x) = 0 .


Here B(πI x) is the section of B at πI x, which is contained in RJ . By


Fubini’s theorem, each set BI,J is a (P ◦ φ−1 )n null set. Thus, B 0 = B −
∪I,J BI,J has the same measure as B. If z 0 ∈ B 0 , then
∗
(P ◦ φ−1 )J B(πI z 0 ) > 0; (P ◦ φ−1 )J φ(X )J = 1.
 

This means that the set B(πI z 0 )∩φ(X )J cannot be empty. If the coordinates
{zi0 : i ∈ J} are not contained in φ(X ), then they can be changed in the
desired manner.

A typical application of the control numbers f (A1 , . . . , Aq , X) is as


follows. For natural numbers n, let Sn : X n → R be permutation-symmetric
functions that satisfy
Sn (x) ≤ Sn+m (x, y), for every x ∈ X n , y ∈ X m ,
(A.4.2)
Sn+m (x, y) ≤ Sn (x) + Sm (y), for every x ∈ X n , y ∈ X m .
Pn
An example is the functions Sn = Eε i=1 εi f (xi ) F for Rademacher ran-
dom variables εi . Given sets A1 , . . . , Aq in X n , the elements of a random
sample X1 , . . . Xn in X can be split into q + 1 sets, the first one con-
sisting of the k = f (A1 , . . . , Aq , X) variables not controlled by the sets
Al and the other q sets consisting of variables that also occur in some
Y l ∈ Al for l = 1, . . . , q. By the “triangle inequality” in (A.4.2), the num-
ber Sn (X1 , . . . , Xn ) can be bounded by the sum of q + 1 terms. By the
first inequality in (A.4.2), the last q terms can be bounded by Sn (Y l ) for
l = 1, . . . , q. It follows that on the set f (A1 , . . . , Aq , X) = k,
q
X
Sn (X) ≤ sup Sk (x) + sup Sn (y).
x∈X k y∈Al
l=1
The first term on the right should be small if k is small, while the size of
the second term can be influenced by the choice of the control sets Al .[
[
The full force of the definition of f (A1 , . . . , Aq , X) is not used for this application. The
control numbers could be reduced to the minimal k such that there exist y l ∈ Al with at most
k of the Xi not occurring in the set {yil : l = 1, . . . , q, i = 1, . . . , n}.
622 A Appendix

A.4.3 Lemma. Let Sn : X n → [0, n] be permutation-symmetric functions


satisfying (A.4.2). Then, for every t > 0 and n,
 t 
P∗ (Sn ≥ t) ≤ exp − 21 t log .
12 (E∗ Sn ∨ 1)
This is true for any Sn = Sn (X) and any vector X = (X1 , . . . , Xn ) with
i.i.d. coordinates.

Proof. Set µ = E∗ Sn . By Markov’s inequality, the set A = {Sn ≤ 2µ}∗ has


probability at least 21 . Since Sn ≤ n by assumption, the argument preceding
the lemma shows that Sn (X) ≤ f (A, . . . , A, X) + q2µ for any integer q ≥ 1.
Thus, the probability in the lemma is bounded by
2q
P∗ f (A, . . . , A, X) ≥ t − 2qµ ≤ t−2qµ ≤ eq log q(2µ + 1) − t log q ,
 
q
for q ≥ 2 by Theorem A.4.1. For t ≥ 12(µ ∨ 1) and q = bt/6(µ ∨ 1)c, the
exponent is bounded above by − 21 t logbt/6(µ∨1)c. The lemma follows since
bxc ≥ 12 x for x ≥ 1. For t ≤ 12(µ ∨ 1) the lemma is trivially true, because
the right side is larger than 1.
A.5
Some Limit Theorems

In this chapter we present the strong law of large numbers and the central
limit theorem uniformly in the underlying measure, as well as the rank
central limit theorem.
Let X̄n be the average of the first n variables from a sequence
X1 , X2 , . . . of independent and identically distributed random vectors in
Rd . Let P be a class of underlying probability measures. For instance, let
Xi be the ith coordinate projection of (Rd )∞ , and let P consist of Borel
probability measures on Rd .

A.5.1 Proposition. Let X1 , X2 , . . . be i.i.d. random variables with distri-


bution P ∈ P such that

lim sup EP |X1 | |X1 | ≥ M = 0.
M →∞ P ∈P

Then the strong law of large numbers holds uniformly in P ∈ P in the sense
that, for every ε > 0,
 
lim sup PP sup |X̄m − EP X1 | ≥ ε = 0.
n→∞ P ∈P m≥n

Proof. See Chung (1951).

The Prohorov distance π between probability laws on Rd dominates


twice the bounded Lipschitz distance and generates the weak topology. The
following theorem gives an explicit
√ upper bound on the Prohorov distance
between the distribution of nX̄n and the limiting normal distribution.
624 A Appendix

This can easily be converted into a central limit theorem uniformly in P


ranging over P. Write π(X, Y ) for the Prohorov distance between the dis-
tributions of random vectors X and Y .

A.5.2 Proposition. Let X1 , . . . , Xn be i.i.d. random vectors with mean


zero and finite covariance matrix Σ. Then for every ε > 0,
√ √ √
nX̄n , N (0, Σ) ≤ 2ε−2 g(ε n) ∨ ε + 41/3 g(ε n)1/3

π
1/4  εg(0) 1/2 
+ C dεg(0) 1 + log .

d


Here C is an absolute constant and g(ε) = EkXk2 kXk ≥ ε .

Proof. √ ε ∈ (0, 1) and 1 ≤ i ≤ n, let Yi be the truncated variable


 For a given
Xi 1 |Xi | ≤ ε n and Zi the centered variable Yi − EYi . By Strassen’s
theorem, it follows that
√ √ √ √ √
nX̄n , nȲn ∨ π nȲn , nZ̄n ≤ (ε−2 ∨ ε−1 )EkXk2 {kXk ≥ ε n} ∨ ε.
 
π

Furthermore, EkZi k3 ≤ 8EkYi k3 ≤ 8ε nEkXi k2 . Thus, by Theorem 1 of
Yurinskii (1977), as corrected in Theorem B on page 395 of Dehling (1983),

√  1/4  8εg(0) 1/2 


π nZ̄n , N (0, ΣZ ) . dεg(0) 1 + log ,

d

where ΣZ is the covariance matrix of Z1 . Finally,


3 √
≤ 4EkXk2 kXk ≥ ε n ,

π N (0, ΣZ ), N (0, Σ)

by Lemma 2.1, page 402, of Dehling (1983).

For each n, let an1 , . . . , ann and bn1 , . . . , bnn be real numbers, and let
(Rn1 , . . . , Rnn ) be a random vector that is uniformly distributed on the n!
permutations of {1, . . . , n}. Consider the rank statistic

n
X
Sn = bni an,Rni .
i=1

The mean and variance of Sn are equal to ESn = nān b̄n and var Sn =
A2n Bn2 /(n − 1), where A2n and Bn2 are the sums of squared deviations from
the meanPn of the numbers an1 , . . . , ann and bn1 , . . . , bnn , respectively. Thus
A2n = i=1 (ani − ān )2 .
A.5 Some Limit Theorems 625

A.5.3 Proposition (Rank central limit theorem). Suppose that

|ani − ān | |bni − b̄n |


max → 0; max → 0.
1≤i≤n An 1≤i≤n Bn
Then the sequence (Sn − ESn )/ σ(Sn ) converges in distribution to a stan-
dard normal distribution if and only if
XX |ani − ān |2 |bnj − b̄n |2
→ 0, for every ε > 0.
√ A2n Bn2
(i,j): n|ani −ān ||bnj −b̄n |>εAn Bn

Proof. See Hájek (1961).


A.6
More Inequalities

This chapter collects well-known and less-known inequalities for binomial,


multinomial, and Rademacher random variables. Some of the results are
related to the inequalities obtained by other means in Chapters 2.2 and 2.14.

A.6.1 Binomial Random Variables


Throughout this subsection Ȳn is the average of independent Bernoulli vari-
ables Y1 , . . . , Yn with success probability p.
One of the simplest inequalities for the tails of sums of bounded random
variables —in particular, for the binomial distribution— is that of Hoeffding
(1963).

A.6.1 Proposition (Hoeffding’s inequality). Let X1 , . . . , Xn be inde-


pendent random variables taking values in [0, 1]. Let µ = EX̄n . Then for
0 < t < 1 − µ,
  µ µ+t  1 − µ 1−µ−t n
P X̄n − µ ≥ t ≤
µ+t 1−µ−t
2
≤ exp −nt g(µ)
≤ exp −2nt2 .

Here g(µ) is equal to (1 − 2µ)−1 log(1/µ − 1), for 0 < µ < 1/2, and 2µ(1 −
−1
µ) , for 1/2 ≤ µ < 1.

Proof. See Hoeffding (1963).


A.6 More Inequalities 627

Applied to Bernoulli variables, Hoeffding’s inequality yields



P n|Ȳn − p| ≥ λ ≤ 2 exp −2λ2 ,

λ > 0.

This inequality is valid for any n and p. For p close to zero or one, it is
possible to improve on the factor 2 in the exponent considerably.

A.6.2 Proposition (Bennett’s inequality). For all n and 0 < p < 1,

√  λ2  λ 
P n|Ȳn − p| ≥ λ ≤ 2 exp − ψ √ , λ > 0.
2p np

Here ψ(x) = 2h(1 + x)/x2 , for h(x) = x(log x − 1) + 1.

Proof. See Bennett (1962), Hoeffding (1963), or Shorack and Wellner


(1986), page 440.

A.6.3 Corollary (Kiefer’s inequality). For all n and 0 < p < e−1 ,

n|Ȳn − p| ≥ λ ≤ 2 exp −λ2 log(1/p) − 1 .
 
P

Proof. Since the probability
√ on the
√ left side is zero if λ > n, we may
assume that λ ≤ n. Then ψ(λ/ np) ≥ ψ(1/p), because ψ is decreas-
ing. Now apply Bennett’s inequality and finish the proof by noting that
ψ(1/p)/(2p) ≥ log(1/p) − 1 for 0 < p < e−1 .

For p ↓ 0, the function log(1/p) − 1 increases to infinity. Thus Kiefer’s


inequality gives bounds of the type 2 exp −Cλ2 for large constants C for
sufficiently small p. For example, 2 exp −11λ2 for p ≤ e−12 .
For p bounded away from zero and one, Hoeffding’s inequality can be
improved as well—at least for large values of λ.

A.6.4 Proposition (Talagrand’s inequality). There exist constants K1


and K2 depending only on p0 , such that for p0 ≤ p ≤ 1 − p0 and all n,
√ K1  λ4 
n(Ȳn − p) = λ ≤ √ exp − 2λ2 +

P , λ > 0,
n 4n
√  K2  λ4 
P n(Ȳn − p) ≥ t ≤ exp − 2λ2 + exp 5λ(λ − t), 0 < t < λ,
λ 4n
√  K2  λ4 
P n(Ȳn − p) ≥ λ ≤ exp − 2λ2 + , λ > 0.
λ 4n

Proof. Stirling’s formula with bounds asserts that


√  n n √  n n
2πn e1/(12n+1) ≤ n! ≤ 2πn e1/(12n) .
e e
628 A Appendix

Hence it follows, that for 1 ≤ k ≤ n − 1,


  1/2
n k 1  n
(A.6.5) p (1 − p)n−k ≤ √ e−nΨ(u,p) .
k 2π k(n − k)
Here set u = k/n − p and define
u + p   1 − (u + p) 
Ψ(u, p) = (u + p) log + 1 − (u + p) log .
p 1−p
The function Ψ satisfies Ψ(0, p) = 0, ∂/∂uΨ(0, p) = 0, and
∂2 4  2 
2
Ψ(u, p) = 1 2
≥ 4 1 + 4 u − ( 12 − p) .
∂u 1 − 4(u − ( 2 − p))
Integrate this inequality to obtain
∂ 16  3  4
Ψ(u, p) ≥ 4u + u − ( 12 − p) + ( 12 − p)3 ≥ 4u + u3 .
∂u 3 3
In the last step, use the fact that the function β 7→ (u − β)3 + β 3 takes
its minimal value at β = u/2. Integrate again to obtain the inequality
Ψ(u, p) ≥ 2u2 + u4 /3. √
For the proof of the first part of the proposition, put k = nλ + np.
Then k ≥ np0 . Furthermore, for k ≤ n(1 − p0 /2), we have n − k ≥ np0 /2,
and the right-hand side of (A.6.5) is bounded above by
1 1 2 4
√ p e−n(2u +u /3) .
2π np20 /2
For n(1 − p0 /2) ≤ k < n, we have u = k/n − p ≥ p0 /2, and the right-hand
side of (A.6.5) is bounded by
1 2 4 K 2 4 √ 4
p e−n(2u +u /3) ≤ √ e−n(2u +u /4) sup ne−n(p0 /2) /12 .
2π(1 − p0 /2) n n≥1

Finally, for k = n, the inequality of the proposition can be reduced to


1 1 K 
1
− log(1/p) + 2(1 − p)2 + (1 − p)4 ≤ log √ .
4 n n
The function h(p) on the left side of this last inequality is negative for
0 < p < 1, with h(0) = −∞ and h(1) = 0, and has a local maxi-
mum of −0.160144... at 0.344214... and a local minimum of −0.184428... at
0.598428... . It follows that the supremum of the left side over any interval
p0 ≤ p ≤ 1 − p0 is strictly less than zero. On the other hand, the infimum of
the right side over n is bounded below by −1/(2eK12 ). Thus, the inequality
is valid for sufficiently large K1 .
For the proof of the second part of the proposition, consider the func-
tion h(x) = 2x2 + x4 /4. The derivative h0 (x) = 4x + x3 is bounded be-
low by 4x and is bounded above by 5x for x ≤ 1. Since h is convex,
A.6 More Inequalities 629

h(x) ≥ h(u) + (x − u)h0 (u) for all x. Apply


√ the first inequality of the
proposition to find that, with k0 = dnp + nte,
√ X K1 −n h(u)+(k/n−p−u)h0 (u)
P n(Ȳn − p) ≥ t) ≤ √ e
n
k≥k0
K1 X nu−(k−np)h0 (u)
= √ e−nh(u) e
n
k≥k0

0
K1 −nh(u) e nu−(k0 −np) h (u)
≤ √ e
n 1 − e−h0 (u)
K1 1 √
≤ √ e−nh(u) e5nu(u−t/ n) .
n 4u
This concludes the proof of the proposition.

A.6.2 Multinomial Random Vectors


In this section (N1 , . . . , Nk ) is a multinomially distributed vector with pa-
rameters n and (p1 , . . . , pk ). It is helpful to relate this to empirical processes.
Let {Ci }ki=1 be a partition of a set X and P the probability measure on
the σ-field C generated by the partition such that P (Ci ) = pi . If Pn is the
empirical measure corresponding  to a sample of size n from P , then the
vector nPn (C1 ), . . . , nPn (Ck ) has the same distribution as (N1 , . . . , Nk ).
The P first two propositions concern the “L1 -” or “total variation
k
distance” i=1 |Ni − npi |. In the representation using the empirical process,
this is equivalent to the Kolmogorov-Smirnov distance
k
1X
kPn − P kC = sup (Pn − P )(C) = |Pn (Ci ) − P (Ci )|.
C∈C 2 i=1

A.6.6 Proposition (Bretagnolle–Huber-Carol inequality). If the ran-


dom vector (N1 , . . . , Nk ) is multinomially distributed with parameters n
and (p1 , . . . , pk ), then
k
X √ 
P |Ni − npi | ≥ 2 nλ ≤ 2k exp −2λ2 , λ > 0.
i=1

Proof. In terms of the Kolmogorov-Smirnov statistic, the left side is equal


to P kGn kC ≥ λ) Each of the probabilities P (Pn − P )(C) > λ can be
bounded by exp(−2λ2 ) by a one-sided version of Hoeffding’s inequality.
There are 2k sets in C. Alternatively, first reduce C by dropping all sets
with probability more than 1/2. This does not change kPn − P kC . Now
the two-sided version of Hoeffding’s inequality can be applied to the 2k−1
remaining sets.

The following inequality results from combining Kiefer’s inequality,


Corollary A.6.3, and Talagrand’s inequality, Proposition A.6.4.
630 A Appendix

A.6.7 Proposition. If the random vector (N1 , . . . , Nk ) is multinomially


distributed with parameters n and (p1 , . . . , pk ), then there is a universal
constant K such that
k
X √  K
P |Ni − npi | ≥ 2 nλ ≤ 2k exp −2λ2 , λ > 0.
i=1
λ

Proof. The proof is slightly easier in the language of empirical pro-


cesses as introduced above. Let C0 be the class of sets C ∈ C such that
P (C) < e−12 , and let C1 be the class of sets with e−12 ≤ P (C) ≤ 1/2. Then
kGn kC = kGn kC0 ∨ kGn kC1 . By the remark following Kiefer’s inequality,
Corollary A.6.3, applied to kGn kC0 and Talagrand’s inequality, Proposi-
tion A.6.4, applied to kGn kC1 , there exists a universal constant K2 such
that
K2
P kGn kC ≥ λ) ≤ 2k 2 exp −11λ2 + 2k exp −2λ2 .
λ
The proposition follows easily.

The last inequality for multinomial random vectors concerns a slight


modification of the classical chi-square statistic (or weighted L2 -distance).

A.6.8 Proposition (Mason and van Zwet inequality). Let the random
vector (N1 , . . . , Nk ) be multinomially distributed with parameters n and
(p1 , . . . , pk ) such that pi > 0 for i < k. Then for every C > 0 and δ > 0
there exist constants a, b, c > 0, such that for P all n ≥ 1 and λ, p1 , . . . , pk−1
k−1
satisfying λ ≤ Cn min{pi : 1 ≤ i ≤ k − 1} and i=1 pi ≤ 1 − δ, we have
k−1
X (Ni − npi )2 
P > λ ≤ a exp(bk − cλ).
i=1
npi

We do not give a full derivation of this inequality here, but note that
the chi-square statistic is an example of a supremum over an elliptical class
of functions. Therefore, the preceding inequality can be deduced from Ta-
lagrand’s general empirical process inequality, Theorem 2.15.5.

A.6.3 Rademacher Sums


The following inequality improves Hoeffding’s inequality for Rademacher
sums, Lemma 2.2.7, in much the same way that Talagrand’s inequal-
ity improves on Hoeffding’s inequality. Let ε1 , . . . , εn denote independent
Rademacher variables.

A.6.9 Proposition (Pinelis-Bentkus-Dzindzalieta).


Pn 2 −1
√ −1 For any numbers
a1 , . . . , an with a
i=1 i = 1, for c∗ = 4 1 − Φ( 2) ≈ 2.378,
n
X   φ(λ)
P ai εi ≥ λ ≤ c∗ 1 − Φ(λ) ≤ c∗ , λ > 0.
i=1
λ
A.6 Problems and Complements 631

A.6.10 Proposition (Khinchine’s inequality). For any real numbers


a1 , . . . , an and any p ≥ 1, there exist constants Ap and Bp such that
n
X
Ap kak2 ≤ ai εi p ≤ Bp kak2 .
i=1

Problems and Complements


1. The function g(p) = ψ(1/p)/(2p) is bounded below by log(1/p) − 1 for 0 <
p < e−1 . It is not bounded below by log(1/p) − c for any c < 1.
[Hint: ψ(1/p)/(2p) − (log(1/p) − 1) → 0 as p → 0.]

2. (Bennett’s inequality plus Jensen equals Bernstein’s inequality)


Show that the function ψ in Proposition A.6.2 is decreasing, ψ(0) = 1, xψ(x)
is increasing, and ψ(x) ≥ 1/(1 + x/3).
[Hint: Let f (x) = h(1 + x) = (x + 1) log(x + 1) − x, and write
Z 1 Z 1 Z 1
f (x) = xf 0 (xt)dt = x2 1{0 ≤ s ≤ t ≤ 1}f 00 (xs) ds dt,
0 0 0

where f 0 (x) = log(1 + x) and f 00 (x) = 1/(1 + x). Hence


Z 1Z 1
ψ(x) = 2 1{0 ≤ s ≤ t ≤ 1}f 00 (xs) ds dt.
0 0

To prove the last inequality, interpret the right side of this identity as an
expectation.]

3. The probability
 in Kiefer’s inequality is bounded below by exp −λ2 (1 −
−2 √
p) log(1/p) when λ = n(1 − p).
4. The constant K resulting from the proof of Proposition A.6.7 may be min-
imized by choosing an optimal cut-off point instead of e−12 . This involves
understanding the dependence of the constant K2 in Proposition A.6.4 on
p0 .
5. Independent binomial (1, pi ) variables X1 , . . . , Xn satisfy
n k
k e np̄n
X     
P Xi > k ≤ exp −np̄n h < ,
np̄n k
i=1

for the function h(x) = x(log x − 1) + 1 and p̄n the average of the success
probabilities.
A
Notes

A.1. Ottaviani (1939) proves his inequality for the case of real-valued
random variables. Hoffmann-Jørgensen (1974) proves his inequalities for
Banach-valued random variables. For both cases Dudley (1984) gives careful
proofs for random elements in nonseparable Banach spaces.
Proposition A.1.6 is due to Talagrand (1989).

A.2. The first inequality of Proposition A.2.1 is a consequence of finer


isoperimetric results obtained independently by Borell (1975) and Sudakov
and Tsirel’son (1978). The second inequality was found by Ibragimov, Su-
dakov, and Tsirel’son (1976) and rediscovered by Maurey and Pisier (see
Pisier (1986)). The present proof is based on the exposition by Ledoux
(1996). See Ledoux and Talagrand (1991), page 59, for the third inequality
of Proposition A.2.1 as well as other interesting results.
The moment inequality Proposition A.2.4 is given by Pisier (1986).
Adler (1990) gives a good summary of many of the Gaussian com-
parison results, including Sudakov’s minorization inequality, Dudley’s ma-
jorization inequality, and the more recent work on majorizing measures.
The inequalities of Proposition A.2.5 are due to Fernique after earlier work
by Slepian (1962), Marcus and Shepp (1971) and Landau and Shepp (1970).
The last assertion of Proposition A.2.5 was noted by Marcus and Shepp and
is stated explicitly in Jain and Marcus (1978).
The section on exponential bounds for Gaussian processes is based
mostly on Talagrand (1994). The interested reader should also consult
Samorodnitsky (1991), Adler and Samorodnitsky (1987), Adler (1990), and
A Notes 633

Goodman (1976). The renewal theoretic large deviation methods of Hogan


and Siegmund (1986), Siegmund (1988, 1992) give precise results on the
asymptotic tail behavior for the suprema of particular Gaussian fields, espe-
cially those satisfying what Aldous (1989) calls the “uncorrelated orthogo-
nal increments” property. See Aldous (1989), Chapter J, pages 190–219 for a
variety of fascinating asymptotic formulas. Computation of the asymptotic
behavior in Example A.2.17 was carried out by David Siegmund (personal
communication).
The section on majorizing measures records work by Preston (1972),
Fernique (1974), and Talagrand (1987c). The sufficiency of the existence
of a (continuous) majorizing measure for the boundedness (and continu-
ity) of a process was established by Fernique (1974) following preliminary
results of Preston (1972). The necessity was proved for stationary Gaus-
sian processes by Fernique (1974) and for general Gaussian processes by
Talagrand (1987c). A relatively simple proof is given by Talagrand (1992).
Lemma A.2.24 is adapted from Talagrand (1987c) and Ledoux and Tala-
grand (1991), who formulate similar results in terms of ultrametrics. For
complete expositions of majorizing measures, see Adler (1990), pages 80ff.,
and Ledoux and Talagrand (1991), Chapters 11 and 12.

A.4. The basic inequality in this chapter was first proved by Talagrand
(1989) and Ledoux and Talagrand (1991). The present elegant proof by
induction is taken from Talagrand (1995), who discusses many extensions
and applications. The name “isoperimetric inequality” appears not entirely
appropriate. It resulted from the analogy with exponential inequalities for
Gaussian variables, which can be derived from isoperimetric inequalities for
the standard normal distribution. This asserts that, among all sets A with
P(A) equal to a fixed number, the probability P(Aε ) increases the least as ε
increases from zero for A of the form {x: |x| > c}. See Ledoux and Talagrand
(1991), Ledoux (1996) and Ledoux (2001) for further references.

A.5. In connection to Proposition A.5.2, consult the papers by Yurinskii


(1977) and Dehling (1983).

A.6. Bennett’s inequality was proved by Bennett (1962); the function


ψ was introduced by Shorack (1980). For Kiefer’s inequality, see Kiefer
(1961); the present proof from Bennett’s inequality is new. For Talagrand’s
inequality, see Talagrand (1994), and for the L1 -inequality for multinomial
variables, see Bretagnolle and Huber (1978). Mason and Van Zwet (1987)
prove inequality A.6.8 and use it to obtain an interesting refinement of
the Komlos-Major-Tusnady inequality. Proposition A.6.9 with the (opti-
mal) constant c∗ is due to Bentkus and Dzindzalieta (2015); Pinelis (1994)
and Pinelis (2007) proved the inequality with slightly worse constants; the
results improve on conjectures of Eaton (1974). Problem A.6.2 was com-
municated to us by David Pollard.

You might also like