You are on page 1of 34

Weak convergence and empirical

processes
Soumendu Sundar Mukherjee
Indian Statistical Institute, Kolkata
April 24, 2019

Warning: These course notes (for M.Stat. second year students) have not been sub-
ject to very careful scrutiny, and so there may be various typos/mistakes here and
there. Please let me know at soumendu041@gmail.com if you find one.

Contents
1 Review of metric topology 3
1.1 Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Open and closed sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Limit points and accumulation points . . . . . . . . . . . . . . . . . . . . 5
1.5 Denseness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Convergence, Cauchy sequences and completeness . . . . . . . . . . . . 6
1.7 Compactness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Connectedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Probability measures on metric Spaces 8


2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Weak convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 The mapping theorem and Skorohod’s representation . . . . . . . . . . . 11
2.4 Relative compactness and Prohorov’s theorem . . . . . . . . . . . . . . . 12

3 Weak convergence in C [0, 1] 13


3.1 Tightness in C [0, 1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Wiener measure and Donsker’s theorem . . . . . . . . . . . . . . . . . . . 13
3.3 Some consequences of Donsker’s theorem . . . . . . . . . . . . . . . . . . 16
3.4 The (standard) Brownian bridge . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Brownian motion on [0, ∞) . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Weak convergence in D [0, 1] 20


4.1 The Skorohod topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1
Contents

5 More on Empirical Processes 22


5.1 Concentration inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Covering number and bracketing numbers . . . . . . . . . . . . . . . . . 25
5.3 Symmetrization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.4 Discretization and chaining . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2
1 Review of metric topology

1 Review of metric topology


We will do a very condensed review of metric space topology here. There will be
numerous exercises which you should attempt to do on your own. Having a good
grasp of the concepts discussed in this section is necessary for this course. You may
consult Kumaresan (2005) and Munkres (1975) for more details.

1.1 Metric spaces


Let M be a non-empty set. ρ : M × M → R+ (= [0, ∞)) is called a metric if it satisfies
1. ρ( x, y) = ρ(y, x )∀ x, y ∈ M.
2. ρ( x, y) = 0 if and only if x = y.
3. ρ( x, y) ≤ ρ( x, z) + ρ(z, y)∀ x, y, z ∈ M.
Together the pair (M, ρ) is called a metric space.

Example 1.1. M = Rd , ρ = ` p ( x, y) = (∑id=1 | xi − yi | p )1/p , with 1 ≤ p < ∞. For p = ∞,


` p ( x, y) = max1≤i≤d | xi − yi |. Important special cases are the `1 , `2 , and the `∞ metrics. `2
is the standard Euclidean metric on Rd .
p→∞
Exercise 1.2. Show that ` p ( x, y) → `∞ ( x, y), for all x, y ∈ Rd .
Example 1.3. M is any non-empty set. Let
(
0 if x = y,
ρ( x, y) =
1 otherwise.

Then ρ is called the discrete metric, and (M, ρ) is called a discrete metric space.
Example 1.4. Let M = C [0, 1], the space of all real-valued continuous functions on [0, 1].
M becomes a metric space with the metric
Z 1 1
ρ( f , g) = ( | f (t) − g(t)| p dt) p ,
0

1 ≤ p < ∞. For p = ∞, we get the uniform metric

ρ( f , g) = max | f (t) − g(t)|.


t∈[0,1]

C [0, 1] with the uniform metric will a central object in this course. We will be interested in
various topological properties of this metric space.
Example 1.5. Let M = (V, E) be a finite graph and define

ρ(v1 , v2 ) = length of the shortest path joining v1 and v2 .

If no such path exists, we define ρ(v1 , v2 ) = +∞ in accordance with the convention inf ∅ =
+∞.
More structure can be given beyond just a metric when M is also a linear space
(vector space). Such as norms, or inner products.

3
1 Review of metric topology

1.2 Open and closed sets


Let x ∈ M and e > 0. Then

• B( x; e) := {y ∈ M | ρ( x, y) < e} is the open ball of radius e around x.

• B[ x; e] := {y ∈ M | ρ( x, y) ≤ e} is the closed ball of radius e around x.

• G ⊂ M is called open if for every point x ∈ U, there is an open ball B( x; ex ) ⊂ U.

• F ⊂ M is called closed if F c is open.

• An open subset U containing x ∈ M is called an (open) neighbourhood of x.

How do open and closed sets behave under union and intersections?

• Arbitrary union (resp. intersection) of open (resp. closed) sets is open (resp.
closed).

• Finite intersection (resp. union) of open (resp. closed) sets is open (resp. closed).

Let A ⊂ M. The closure of A, denoted by Ā, is the set ∩ F⊃ A,F closed F, i.e. Ā is
the “smallest” closed set containing A. The interior of A, denoted by A◦ , is the set
∪G⊂ A,F open G, i.e. A◦ is the “largest” open set containing A. The boundary of A, de-
noted by ∂A, is the set Ā \ A◦ .

Exercise 1.6. Prove or disprove (a) B( x; e) = B[ x; e], (b) B[ x; e]◦ = B( x; e), and (c) ∂B[ x; e] =
∂B( x; e) = S( x; e) = {y ∈ M | ρ(y, x ) = e}.

Together, the collection of all open sets of M defines its topology.


Side remark: Arbitrary intersection (resp. union) of open (resp. closed) sets are called
Gδ (resp. Fσ ) sets. What about arbitrary union/intersection of Gδ or Fσ sets? One can
actually build a hierarchy of sets in this way: Gδσ , Gδσδ , . . . , Fσδ , Fσδσ , . . . and so on.

Exercise 1.7. Show that every open (resp. closed) set is Fσ (resp. Gδ ).

Exercise 1.8. What are all the open (closed) sets in a discrete metric space?

1.3 Continuity
Let M1 , M2 be metric spaces, and f : M1 → M2 . We say that f is continuous at x if,
for any neighbourhood V of f ( x ) in M2 , there exists a neighbourhood U of x in M1
such that U ⊂ f −1 (V ).

Exercise 1.9. Reformulate this as the usual (e, δ)-definition.

If f is continuous at all points x ∈ M1 , then we simply say that f is continuous.

Exercise 1.10. Show that f : M1 → M2 is continuous if and only if, for all open V ⊂ M2 ,
f −1 (V ) is open in M1 .

Side remark. If if, for all open U ⊂ M1 , f (U ) is open in M1 , then we call f an open
map. An invertible map that is continuous as well as open is called a homeomorphism.

4
1 Review of metric topology

Exercise 1.11. What are the continuous functions when M1 is discrete?


A function f : M1 → M2 is called uniformly continuous, if, for all e > 0, ∃δ > 0 such
that ρ1 ( x, y) < δ =⇒ ρ2 ( f ( x ), f (y)) < e.
Exercise 1.12. Of course, all uniformly continuous functions are continuous. Give an example
of a function that is continuous but not uniformly continuous.
Let A ⊂ M, x ∈ M. Define the distance of x from A as
ρ( x, A) := inf ρ( x, y).
y∈ A

Exercise 1.13. Show that for any x, y ∈ M,


|ρ( x, A) − ρ(y, A)| ≤ ρ( x, y).
Hence, the map x 7→ ρ( x, A) is uniformly continuous.
Lemma 1.14 (Urysohn’s lemma). Let A, B be two closed subsets in M. Then there exists a
continuous function f : M → R such that f ( A) = 0 and f ( B) = 1. Thus we can separate
points by continuous functions1 .
ρ( x,A)
Proof. The function x 7→ ρ( x,A)+ρ( x,B)
fits the bill.

Exercise 1.15. Let F ⊂ M be closed. The (uniformly) continuous function f e ( x ) = 1 −


ρ( x,F ) 
e +
satisfies
(
1 if x ∈ F,
f e (x) =
0 if x ∈
/ Fe ,
where Fe = {y ∈ M | ρ(y, F ) < e} is the e-enlargement of F. Note also that IF ( x ) ≤
f e ( x ) ≤ IFe ( x ). Construct another such function using the proof of Urysohn’s lemma.

1.4 Limit points and accumulation points


Let X ⊂ M.
• x ∈ X is called a limit point of X if, for any open neighbourhood Ux of x, Ux ∩ X 6=
∅.
• x ∈ X is called an accumulation point (also called cluster point) of X if, for any
open neighbourhood Ux of x, (Ux \ { x }) ∩ X 6= ∅.
• Limit points, which are not accumulation points are called isolated points.
• A set is closed if and only if it contains all its limit points (exercise).
• A set is called discrete if all its points are isolated points.
• A closed set is called perfect if all its points are accumulation points.
Exercise 1.16. Give examples of each of these in Rd (equipped with the standard metric).
Exercise 1.17. Every subset of a discrete metric space is discrete. Every discrete set is closed.
Exercise 1.18. Can a discrete subset of Rd set be uncountable? Can a perfect subset of Rd be
countable?
1 Topological spaces where one can do this are called completely Hausdorff (or completely T2 ) spaces.

5
1 Review of metric topology

1.5 Denseness
X ⊂ M is dense in M if, for all x ∈ M, and for any open neighbourhood Ux of x,
Ux ∩ X 6= ∅. A metric space is called separable if it contains a countable dense subset.

Example 1.19. Rd is separable because it contains the countable dense subset Qd .

Exercise 1.20. Show that C [0, 1] with the uniform metric is separable. (Hint: Stone-Weierstrass!)

Why do we care about separability? Well, measurability plays well with respect to
countable operations. So if one wants to do measure theory on metric spaces, sepa-
rability helps to restrict attention to countable operations (such as taking sup over a
countable dense set).

1.6 Convergence, Cauchy sequences and completeness


Note that every metric space M is Hausdorff (also called T2 ), i.e. any two distinct
points can be separated by disjoint open sets, i.e., ∃ neighbourhoods Ux , Uy of x, y ∈
M such that Ux ∩ Uy = ∅. One can thus define the concept of convergence of a
sequence.

• Let { xn } be a sequence in M. We say that xn converges to x (written xn → x), if,


for any neighbourhood Ux of x, ∃ N ≥ 1, such that for all n ≥ N, xn ∈ Ux ). Note
that how this encompasses the (e, N )-definition of convergence. The Hausdorff
property ensures that the limit of a sequence is unique.

• Let { xn } be a sequence in M. We say that xn is Cauchy if, for any e > 0, ∃ N ≥ 1


such that for all m, n ≥ N, ρ( xm , xn ) < e.

Exercise 1.21. Show that all convergent sequences are Cauchy. Give a counterexample to the
converse statement.

Exercise 1.22. What are all the Cauchy sequences in a discrete metric space?

X ⊂ M is called complete if all Cauchy sequences in X converge to some element of x.


For example, Rd is complete.

Exercise 1.23. Any discrete metric space is complete.

Exercise 1.24. C [0, 1] is complete in the uniform metric.

Exercise 1.25. Any closed subset of a complete metric space is itself complete.

The fact that C [0, 1] is complete with the uniform metric will be very useful to us.
Complete, separable metric spaces are also called Polish spaces. Thus C [0, 1] is a Polish
space under the uniform metric.

Exercise 1.26. f : M1 → M2 is continuous if and only if for any sequence { xn } in M1


converging to x, the sequence { f ( xn )} in M2 converges to f ( x ).

6
1 Review of metric topology

1.7 Compactness
X ⊂ M is called compact if any open cover {Uα }α∈ I admits a finite subcover. X is
called bounded if ∃ an open ball B( x, e) ⊃ X.
Exercise 1.27. Let diam( A) := supx,y∈ A ρ( x, y) be the diameter of A. Show that A is
bounded if and only if A is either empty or diam( A) is finite.
Exercise 1.28. If X is compact, then it is closed and bounded. In Euclidean spaces the converse
also holds (the so-called Heine-Borel theorem).
Exercise 1.29. What are the compact subsets of a discrete metric space?
Exercise 1.30. Suppose K ⊂ M1 is compact and f : M1 → M2 is continuous, then f is
uniformly continuous.
Exercise 1.31. Suppose K ⊂ M1 is compact and f : M1 → M2 is continuous, then f (K ) is
compact (hence bounded).
X ⊂ M is called totally bounded, if, for any e > 0, ∃ points x1 , . . . , xk ∈ M, where
k = k (e) ≥ 1, such that X ⊂ ∪ik=1 B( xk ; e)2 .
Exercise 1.32. Prove that the closure of a totally bounded set it totally bounded.
For metric spaces there are equivalent characterisations of compactness.
Lemma 1.33. The following are equivalent:
(a) K is compact.

(b) K is complete and totally bounded.

(c) Every infinite subset of K has an accumulation point.

(d) Every sequence in K has a convergent subsequence (this is also called sequential com-
pactness).
A set X ⊂ M is called precompact or relatively compact if X̄ is compact.
Exercise 1.34. In complete metric spaces, the precompact sets are precisely the totally bounded
sets.
Exercise 1.35. Is the closed ball B[ x; e] compact?
As we will be working with C [0, 1] extensively, let us characterize its compact sub-
sets. For this we need to define the notion of equicontinuity of a family of functions.
Let F be a family of functions from M1 to M2 . F is called equicontinuous at x ∈ M1 ,
if, for any e > 0, ∃δ = δ( x, e) such that ρ1 (y, x ) < δ =⇒ ρ2 ( f (y), f ( x )) < e for all
f ∈ F . If F is equicontinuous at all x ∈ M1 , then we simply say that the family F is
equicontinuous.
Exercise 1.36. Define the notion of uniform equicontinuity and then show that if M1 is
compact, then F is uniformly equicontinuous.
2 Sucha finite collection of e-balls covering a space is called an e-net. We will encounter them time and
again when we start talking about empirical process theory.

7
2 Probability measures on metric Spaces

Exercise 1.37. Let M1 be compact and f : M1 × M1 → M2 be continuous. Then the family


F = { f y ( x ) = f ( x, y) | y ∈ M1 } is uniformly equicontinuous.
Exercise 1.38. Let M > 0. Let F be a family of functions from a convex and open set X ⊂ Rd
0
to Rd such that k D f ( x )k ≤ M for all x ∈ X and for all f ∈ F . Then F is uniformly
equicontinuous.

Theorem 1.39 (Arzelà-Ascoli). Let K ⊂ M be compact. Consider the metric space


C (K ) of all continuous functions on K with the uniform metric. S ⊂ C (K ) is compact if
and only if S is closed, bounded and equicontinuous.

The modulus of continuity at level δ > 0 of a function x : M1 → M2 is defined as


wx (δ) = sup ρ2 ( x (s), x (t)).
ρ1 (s,t)≤δ

Theorem 1.40 (Arzelà-Ascoli for C [0, 1]). S ⊂ C ([0, 1]) has compact closure if and
only if the following two conditions hold:

1. supx∈S | x (0)| < ∞.

2. limδ→0 supx∈S wx (δ) = 0.

1.8 Connectedness
Do not need now.

2 Probability measures on metric Spaces


Let (M, ρ) be a metric space. Consider the Borel σ-field B of (M, ρ) (i.e. the σ-field
generated by all open sets). We will be interested in defining a notion of convergence
in distribution on (M, ρ). Our main reference for this section would be Billingsley
(2013) and also the previous editions of that book.

2.1 Preliminaries
Lemma 2.1 (Regularity of probability measures). Every probability measure µ on (M, B)
is regular, i.e. for any Borel set A, and for any e > 0, one can find G open and F closed such
that F ⊂ A ⊂ G, and µ( G \ F ) < e.
Proof. Use the “good set” principle. Show that statement is true if A is a closed set,
and then show that the set of all such sets forms a σ-field. Fill in the details.
R
Notation: Let µ be a measure and f ∈ Cb (M). Then µ f := f dµ. For probability
measures P, P f is thus same as E( f ( X )) where X ∼ P.

A class of functions F on (M, B) is called (probability) measure determining if µ f =


ν f , ∀ f ∈ F implies µ = ν.

8
2 Probability measures on metric Spaces

Lemma 2.2. { IF | F closed} is a probabability determining class. Same goes for { IG |


G open}.

Proof. For a general Borel set A, by regularity, one can construct two sequences of
closed sets Fin ⊂ A such that Pi ( Fin ) → Pi ( A), i = 1, 2. But then Pi ( F1n ∪ F2n ) →
Pi ( A), i = 1, 2, which implies that P1 ( A) = P2 ( A).

Lemma 2.3. Cb (M), the class of bounded continuous functions on M is probability deter-
mining. Same goes for Cbu (M), the class of bounded uniformly continuous functions on M.

Proof. For every closed F, approximate IF by f e to get Pi ( F ) ≤ Pi f e ≤ Pi ( Fe ), i = 1, 2.


But Pi ( Fe ) ↓ Pi ( F ) with e ↓ 0, i = 1, 2, and P1 f e = P2 f e . This shows that P1 ( F ) =
P2 ( F ). As the functions IF are probability determining, we are done.

Lemma 2.4. Any π-system A generating the Borel σ-field B of M is a determining class of
sets.

Proof. The class S = { A ∈ B | P( A) = Q( A)} is a λ-system containing the π-system


A. Hence the desired conslusion follows from the π-λ theorem.
A probability measure P on (M, B) is called tight, if, for any e > 0, there exists a
compact set K ⊂ M such that P(K ) > 1 − e.

Exercise 2.5. Probability measures on σ-compact metric spaces (i.e. metric spaces that are
countable union of compact sets) are tight.

Lemma 2.6. Probability measures on Polish spaces are tight.

Proof. Let { x1 , x2 , . . .} be countable dense subset of M. Let Uik = B( xi ; 1k ). Then


∪i≥1 Uik covers M for any k ≥ 1. Choose nk ≥ 1 such that P(∪1≤i≤nk Uik ) > 1 −
e
2k
. Now set A = ∩k≥1 (∪1≤i≤nk Uik ). Note that P( Ac ) = P(∪k≥1 (∪1≤i≤nk Uik )c ) ≤
∑k≥1 P((∪1≤i≤nk Uik )c ) ≤ ∑k≥1 2ek = e. Hence P( A) > 1 − e.
Note also that, by its very construction, A is totally bounded. As M is complete, Ā
is complete. But the closure of a totally bounded set is also totally bounded. Hence, A
is compact. But, P( Ā) ≥ P( A) > 1 − e, which completes the proof.

2.2 Weak convergence


By P (M) we will denote the set of all probability measures on M. Let P, Pn , n ≥ 1, ∈
w
P ( M). We say that Pn converges weakly to P, denoted as Pn → P, if Pn f → P f , ∀ f ∈
Cb (M). The following lemma gives important equivalent characterizations.

Lemma 2.7 (Portmanteau). The following are equivalent:


w
(a) Pn → P.

(b) Pn f → P f , ∀ f ∈ Cub (M).

(c) lim supn→∞ Pn ( F ) ≤ P( F ), ∀ F closed.

(d) lim infn→∞ Pn ( G ) ≥ P( G ), ∀ G open.

9
2 Probability measures on metric Spaces

(e) Pn ( A) → P( A), ∀ A such that P(∂A) = 03 .


Proof. “(a) =⇒ (b)”: Obvious!

“(b) =⇒ (c)”: We have, for any e > 0,

lim sup Pn ( F ) ≤ lim sup Pn f e = lim Pn f e = P f e ≤ P( Fe ).

Now let e ↓ 0.

“(c) ⇐⇒ (d)”: For the “ =⇒ ” direction, note that, by (b) lim sup Pn ( G c ) ≤ P( G c ),
i.e. lim sup(1 − Pn ( G )) ≤ 1 − P( G ), i.e. 1 − lim inf Pn ( G ) ≤ 1 − P( G ), which is same
as what we want. Similarly, “(d) =⇒ (c)”.

“(c) =⇒ (e)”: By (c) and (d),

P( Ā) ≥ lim sup Pn ( Ā) ≥ lim sup Pn ( A) ≥ lim inf Pn ( A) ≥ lim inf Pn ( A◦ ) ≥ P( A◦ ).

Now P(∂A) = 0 implies that P( Ā) = P( A◦ ) = P( A) and hence we are done.

“(e) =⇒ (a)”: Let f ∈ Cb (M). By scaling and translating suitably, we may assume
R1
that 0 < f < 1. Now P f = 0 P( f > t) dt. Note that P( f > t) = 1 − Ff (X ) (t), where
Ff (X ) is the CDF of the (real-valued) random variable f ( X ). As the set of jumps of a
CDF is at most countable, and ∂{ f > t} ⊂ { f = t} (by continuity of f ), we conclude
that Pn ( f > t) → P( f > t) for all but countably many points. Hence an application
R1 R1
of BCT gives that Pn f = 0 Pn ( f > t) dt → 0 P( f > t) dt = P f .
Recall the statement xn → x if and only if every subsequence of xn contains a further
subsequence that converges to x. We have an analogue of this for weak convergence.
w
Lemma 2.8. Pn → P if and only if every subsequence of Pn contains a further subsequence
that converges weakly to P.
Proof. Only the “if” part is non-trivial. Let f ∈ Cb (M). Consider the real-sequence
{Pn f }. Take a subsequence {Pnk f }. By hypothesis, for the subsequence {Pnk } of
{Pn }, there exists a further subsequence {Pnk } that converges weakly to P. Hence
`
{Pnk f } converges to P f . So any subsequence {Pnk f } of {Pn f } contains a further
`
subsequence {Pnk f } that converges to P f . It follows that Pn f → P f , and we are
`
done.
w
Exercise 2.9. Show that δxn → δx if and only if xn → x.
w
Exercise 2.10. Show that, if δxn → P, then P = δx for some x.
A class of sets A ⊂ B is called convergence determining if Pn ( A) → P( A) for all
w
P-continuity sets A ∈ A implies that Pn → P.
Exercise 2.11. Any convergence determining class is a probability determining class.
Lemma 2.12. Let A be a π-system such that every open set G can be written as a countable
w
union of sets from A. Then Pn ( A) → P( A) for all A ∈ A implies that Pn → P.
3 Such sets A are called P-continuity sets.

10
2 Probability measures on metric Spaces

Lemma 2.13. Suppose M is separable. Let A be a π-system such that for any x ∈ M and
e > 0, there exists A x,e ∈ A with x ∈ A◦x,e ⊂ A x,e ⊂ B( x; e). Then Pn ( A) → P( A) for all
w
A ∈ A implies that Pn → P.
Let A x,e be the class of sets A such that x ∈ A◦ ⊂ A ⊂ B( x; e). Let ∂A x,e = {∂A |
A ∈ A x,e }.
Lemma 2.14. Suppose M is separable. Let A be a π-system such that ∂A x,e either contains
∅ or has uncountably many disjoint elements. Then A is a convergence determining class.
Example 2.15. On Rd , rectangles of the form ( a, b] = ( a1 , b1 ] × · · · × ( ad , bd ] form a con-
vergence determining class.
Example 2.16. On R∞ , finite dimensional cylinders πi−1 ,...,i
1
(( a, b]) form a convergence deter-
d
mining class.
Example 2.17. On C [0, 1], finite dimensional cylinders πt−1 ,...,t 1
k
(( a, b]) do not form a conver-
gence determining class. However, they do form a determining class. To see this note that the
sequence of functions zn (t) = ntI(t ≤ n) + (2 − nt)I(n < t ≤ 2n) converge pointwise to
the 0 function. However, kzn k∞ = 1. So δzn does not converge weakly to δ0 . However, since
πt−1 ,...,t
1
k
zn = (zn (t1 ), . . . , zn (tk )) = (0, . . . , 0) = πt−1 ,...,t
1
k
δ0 , finite dimensional distributions
of δzn converge δ0 (prove this rigorously).

2.3 The mapping theorem and Skorohod’s representation


w
As is true for random variables, if h is a continuous mapping M1 → M2 , and Pn → P,
w
then Pn ◦ h−1 → P ◦ h−1 . In fact, we can say something stronger.

Theorem 2.18 (The mapping theorem). Let h : M1 → M2 be such that P( Dh ) = 0,


w w
where Dh is the set of discontinuity points of h. Then Pn → P implies Pn ◦ h−1 →
P ◦ h −1 .

Proof. Let F be a closed subset of M2 . Note that

lim sup Pn ◦ h−1 ( F ) ≤ lim sup Pn (h−1 ( F ))


n→∞ n→∞
≤ P(h−1 ( F ))
= P(h−1 ( F ) ∩ Dhc )
≤ P ◦ h −1 ( F ),

where the last inequality follows from the set inequality h−1 ( F ) ∩ Dhc ⊂ h−1 ( F ), which
is true beacuse, if x belongs to the LHS, then there is a sequence xn such that h( xn ) ∈ F
and xn → x. But, since x ∈ Dhc , this means that h( xn ) → h( x ). As F is closed, h( x ) ∈ F,
i.e. x ∈ h−1 ( F ).
w
Exercise 2.19. Another tool for your portmanteau: Pn → P if and only if for any bounded
measurable function h with P( Dh ) = 0 one has Pn h → Ph.
Exercise 2.20. Using the previous exercise, give an alternative proof of the mapping theorem.

11
2 Probability measures on metric Spaces

There is a theorem of Skorohod which constructs random elements Xn , X on a com-


mon probability space such that Xn ∼ Pn , X ∼ P and Xn → X almost surely. For
this P needs to have separable support. This result is very useful in practice beacuse
almost sure convergence is easy to work with and implies weak convergence.
Before we embark on a proof in the gerenal case, let us prove it for random variables.
d
Lemma 2.21. Let Fn , F be univariate CDFs with Fn → F. Then there exist random variables
a.s.
Xn ∼ Fn , X ∼ F defined on the same probability space (Ω, A, Q) such that Xn → X.

2.4 Relative compactness and Prohorov’s theorem


A family Π of probaiblity measures is called relatively compact if any sequence {Pn } in
Π has a weakly convergent subsequence. Thus, a sequence {Pn } is relatively compact
if any subsequence {Pnk } has a further subsequence {Pnk } that converges weakly.
`

Lemma 2.22. Consider probability measures on C [0, 1]. If {Pn } is relatively compact, and all
finite dimensional distributions converge weakly, then there exists a probability measure P on
w
C [0, 1] such that Pn → P.
Proof. Consider a subsequence {Pn` } converging to some probability measure Q. Note
w w
then that Pn` ◦ πt−1 ,...,t
1
k
→ Q ◦ πt−1 ,...,t
1
k
. It follows that Pn ◦ πt−1 ,...,t
1
k
→ Q ◦ πt−1 ,...,t
1
k
. As finite
dimensional cylinders form a determining class, it also follows that all subsequential
weak limits of Pn are equal to Q. Since, by relative compactness, any subsequence has
w
a further subsequence that converges weakly to Q, we conclude that Pn → Q.
How does one show relative compactness of a family? A sufficient condition is
tightness of the family—a family Π is called tight if for any e > 0 there exists a compact
set Ke such that P(Ke ) > 1 − e for all P ∈ Π.

Theorem 2.23 (Prohorov). If a family Π is tight, then it is relatively compact.

Proof. To be written up.

Theorem 2.24. Suppose that M is Polish. Then relative compactness of a family Π


implies its tightness.

Proof. The proof is similar to the proof of Lemma 2.6, which establishes tightness of a
single measure. It suffices to show that if M can be written as an increasing union of
open sets Gi , then for any e > 0 there exists ne > 0 such that P( Gne ) > 1 − e, for all
P ∈ Π. Assume not, then, for some e0 > 0, we can find a sequence {Pn } in Π such
that Pn ( Gn ) ≤ 1 − e0 . By relative compactness, there is a further subsequence {Pnk }
converging weakly to some probability measure Q. Now for any m ≥ 1,
Q( Gm ) ≤ lim inf Pnk ( Gm ) ≤ lim inf Pnk ( Gnk ) ≤ 1 − e0 .
Let m → ∞ to get Q(M) ≤ 1 − e0 , a contradiction.
To use this, write M = ∪n≥1 B( xn ; 1/k ), where { x1 , x2 , . . .} is a countable dense set.
n
There is nk such that P(∪nk=1 B( xn ; 1/k )) > 1 − 2ek for all P ∈ Π. Hence, the totally
n
bounded set A = ∩k≥1 ∪nk=1 B( xn ; 1/k ) satisfies P( A) > 1 − e for all P ∈ Π. Ā is the
required compact set.

12
3 Weak convergence in C [0, 1]

3 Weak convergence in C [0, 1]


3.1 Tightness in C [0, 1]
By Arzelà-Ascoli we can characterize compact sets of C [0, 1]. This enables us to for-
mulate an equivalent criterion for tightness.
Lemma 3.1. Let Pn be a sequence of Borel probability measures on C [0, 1]. Then {Pn } is tight
if and only if the following two conditions hold:
(i) For all η > 0, ∃ a ≥ 0 such that, for all n ≥ 1, one has
Pn (| x (0)| ≥ a) ≤ η.
Note that this condition is equivalent to saying that {Pn ◦ π0−1 } is tight.
(ii) For all η > 0, e > 0, ∃δ ∈ (0, 1), n0 ≥ 1 such that, for all n ≥ n0 , one has
Pn (wx (δ) ≥ e) ≤ η.

Now we already know that tightness plus convergence of finite dimensional distri-
butions implies weak convergence. Combining this with the above lemma we can say
the following.
Lemma 3.2. Let Pn be a sequence of Borel probability measures on C [0, 1]. Suppose
(i) Pn ◦ πt−1 ,...,t
1
k
convereges weakly, for any t1 , . . . , tk ∈ [0, 1], k ≥ 1.
(ii) For all e > 0
lim lim sup Pn (wx (δ) ≥ e) = 0.
δ →0 n→∞
w
Then there is a Borel probability measure P on C [0, 1] such that Pn → P. The converse is also
true.

3.2 Wiener measure and Donsker’s theorem


Let ξ i be i.i.d. with Eξ = 0, Eξ 2 = σ2 , living in some probability space (Ω, A, P). Let
Sn = ∑in=1 ξ i , S0 ≡ 0. Construct a random function Xn : (Ω, A, P) → (C [0, 1], B(C [0, 1]))
as follows

(ω )
 Si √
σ n
, if t = ni ,
Xn (ω )(t) ≡ Xn (ω, t) = 1 (ω )
(ω )
 α Si √
σ n
+ (1 − α) Siσ+√ n
, if t = α ni + (1 − α) i+n 1 , 0 ≤ i < n.

If not necessary, we will drop ω for notational simplicity. Note that we can compactly
write Xn (t) as
Sbntc ξ bntc+1
Xn (t) = √ + (nt − bntc) √ .
σ n σ n
Exercise 3.3. Show that Xn defined above is indeed a random function, i.e., Xn is A/B(C [0, 1])
measurable.
We will show that {Pn = P ◦ Xn−1 } is tight and Pn ◦ πt−1 ,...,t 1
k
convereges weakly to
a certain k-variate Gaussian, for any t1 , . . . , tk ∈ [0, 1], k ≥ 1. This will establish the
existence of a measure on C [0, 1] with Gaussian finite dimensional distributions.

13
3 Weak convergence in C [0, 1]

Theorem 3.4 (Existence of Wiener measure and Donsker’s theorem). There exists,
on C [0, 1], a Borel probability measure W such that, for any t1 , . . . , tk ∈ [0, 1], k ≥ 1, one
(t ,...,t ) w
has W ◦ πt−1 ,...,t
1
k
≡ N (0, Σ(t1 ,...,tk ) ), where Σij 1 k = ti ∧ t j . Moreover, Pn → W a .
aW is called the Wiener measure after mathematician Norbert Wiener, who was the first to con-
struct Brownian motion rigorously.

Proof. Note that Xn (0) = 0. For t1 , t2 , . . . , tk ∈ (0, 1], k ≥ 1,

( Xn (t1 ),Xn (t2 ) − Xn (t1 )), . . . , Xn (tk ) − Xn (tk−1 ))


Sbnt1 c Sbnt2 c − Sbnt1 c Sbntk c − Sbntk−1 c
 
= √ + oP (1), √ + oP (1), . . . , √ + oP (1)
σ n σ n σ n
d
→ (Y1 , . . . , Yk ),
CLT

where the Yj ∼ N (0, t j − t j−1 ), t0 ≡ 0, are independent. If we interpret X (0) = 0 ≡


N (0, 0), then the above is true for any t1 , t2 , . . . , tk ∈ [0, 1], k ≥ 1. It follows that (prove)
w
Pn ◦ πt−1 ,...,t
1
k
→ N (0, Σ(t1 ,...,tk ) ),
(t ,...,tk )
where Σij 1 = ti ∧ t j . This verifies condition (i) of Lemma 3.2.

Let 0 = t0 < t1 < . . . tk−1 < tk = 1 be δ-sparse, i.e. min1<i<k |ti − ti−1 | ≥ δ. Then

wx (δ) ≤ 3 max sup | x (s) − x (ti−1 )|.


1 ≤ i ≤ k t i −1 ≤ s ≤ t i

It follows that
e
Pn ( w x ( δ ) ≥ e ) ≤ ∑ Pn ( sup | x (s) − x (ti−1 )| ≥ ).
3
1≤ i ≤ k t i −1 ≤ s ≤ t i

For simplicity, we choose a particular equispaced δ-sparse set: ti = im n , 0 ≤ i < k, m ∈


N, tk = 1. Then we must have δ ≤ t2 − t1 = n . So, in particular, we can take m = dnδe.
m

Also, 1 > tk−1 = (k − 1) m n . If we choose the maximum possible number of points ti ,


km n
then 1 ≤ n . Then k = d m e → 1δ .
Now note that
Sj S j +1

S nt i − 1

sup | Xn (s) − Xn (ti−1 )| = max sup α √ + (1 − α) √ − √
t i −1 ≤ s ≤ t i nti−1 ≤ j<nti 0≤α≤1 σ n σ n σ n
|S j − Snti−1 | |S j+1 − Snti−1 |
 
= max max √ , √
nti−1 ≤ j<nti σ n σ n
|S j − Snti−1 |
= max √ .
nti−1 ≤ j≤nti σ n

By stationarity of the sequence ξ i ,

|S j − Snti−1 | d |S j |
max √ = max √ ,
nti−1 ≤ j≤nti σ n 1≤ j ≤ m σ n

14
3 Weak convergence in C [0, 1]

for any 1 ≤ i < k. For i = k,

|S j − Snti−1 | d |S j |
max √ = max 0 √ ,
nti−1 ≤ j≤nti σ n 1≤ j ≤ m σ n

where m0 = n − (k − 1)m ≤ m. All in all, we conclude that


e
Pn ( w x ( δ ) ≥ e ) ≤ ∑ Pn ( sup | x (s) − x (ti−1 )| ≥ )
3
1≤ i ≤ k t i −1 ≤ s ≤ t i
e
= ∑ P( sup | Xn (s) − Xn (ti−1 )| ≥ )
1≤ i ≤ k t i −1 ≤ s ≤ t i 3
|S j | |S j |
   
e e
= (k − 1)P max √ ≥ + P max 0 √ ≥
1≤ j ≤ m σ n 3 1≤ j ≤ m σ n 3
|S j |
 
e
≤ kP max √ ≥
1≤ j ≤ m σ n 3
 √ 
eσ n
= kP max |S j | ≥
1≤ j ≤ m 3
 √ 
eσ n
≤ 3k max P |S j | ≥ (by Etemadi’s maximal inequality4 )
1≤ j ≤ m 9
 √ 
6 eσ m
≤ max P |S j | ≥ √ ,
δ 1≤ j ≤ m 9 2δ
1
where the last inequality holds for all large n, since 2δ < 1δ = limn→∞ mn = limn→∞ k <
2
δ.
For any 1 ≤ j ≤ m,
√  √ 
|S j | |S j |
   
eσ m e m e
P |S j | ≥ √ =P p ≥ p ≤P p ≥ √ .
9 2δ σ j 9 2jδ σ j 9 2δ

d|S |
By the CLT, √j → Z. Therefore
σ j

|S j |
   
e e
P p > √ → P |Z| > √ .
σ j 9 2δ 9 2δ

Therefore there exists j0 = j0 (e, δ) such that, for all j ≥ j0 , one has

|S j | 2 × (9 2)4 δ2 EZ4 C1 δ2
   
e e
P p > √ ≤ 2P | Z | > √ ≤ = .
σ j 9 2δ 9 2δ e2 e2

For j < j0 , we have, by Chebychev’s inequality,


√  √ √
|S j | (9 2)2 jδ (9 2)2 j0 δ

e m C j
P p ≥ p ≤ 2
< 2
≤ 22 0 ,
σ j 9 2jδ e m e m e nδ

4 Thissays that, for independent random variables ξ 1 , . . . , ξ n , one has P(max1≤i≤n |Si | ≥ 3α) ≤
3 max1≤i≤n P(|Si | ≥ α).

15
3 Weak convergence in C [0, 1]

for some constant C2 > 0. Thus we get, for all large enough n, that
C1 δ2 C2 j0
   
6 C3 δ C4 j0
Pn (wx (δ) ≥ e) ≤ max , = max , ,
δ e2 e2 nδ e2 e2 nδ2
for constants C3 , C4 > 0. It follows that
lim lim sup Pn (wx (δ) ≥ e) = 0.
δ →0 n→∞

This verifies condition (ii) of Lemma 3.2. This proves, simultaneously, the existence of
w
W and that Pn → W.
A random function B on C [0, 1], whose distribution is the Wiener measure, is called a
d
standard Brownian motion on [0, 1]. Thus we have proved that Xn → B.
d
Corollary 3.5 (Donsker). Let h : C [0, 1] → R be such that W ( Dh ) = 0. Then h( Xn ) →
h ( B ).

3.3 Some consequences of Donsker’s theorem


1. Suppose h( x ) = maxt∈[0,1] x (t). Then h is a continuous functional on C [0, 1],
because |h( x ) − h(y)| = |k x k∞ − kyk∞ | ≤ k x − yk∞ . Hence, we conclude that
S d
max √i = h( Xn ) → h( B),
1≤ i ≤ n σ n

whenever the ξ i satisfy assumptions of Theorem 3.4. In particular, take ξ 1 to


be a Rademacher variable: P(ξ 1 = ±1) = 1/2. Then σ2 = 1, and we have a
simple symmetric random walk (SSRW), for which one can prove, by reflection
principle, that
P( max Si ≥ a) = 2P(Sn > a) + P(Sn = a),
1≤ i ≤ n

where a is an integer. Therefore, for α > 0,



 
Si
P max √ > α = P( max Si > nα)
1≤ i ≤ n n 1≤ i ≤ n
√ √
= P( max Si ≥ an ) (where an = d nαe or d nαe + 1)
1≤ i ≤ n
= 2P(Sn > an ) + P(Sn = an )
   
Sn an Sn an
= 2P √ > √ +P √ = √
n n n n
→ 2P( Z > α).
(Justify this last step.) It follows that
 
Si
P max √ ≤ α → 1 − 2P( Z > α) = P(| Z | ≤ α).
1≤ i ≤ n n
d
Thus h( B) = | Z |, which one can restate as
d
max B(t) = | B(1)|.
t∈[0,1]

16
3 Weak convergence in C [0, 1]

2. It is possible to find the joint distribution of (m, M, B(1)) using Donsker’s theo-
rem, where m = mint∈[0,1] B(t), M = maxt∈[0,1] B(t). Indeed, if mn = mint∈[0,1] Xn (t), Mn =
maxt∈[0,1] Xn (t), then
1 d
√ (mn , Mn , Sn ) → (m, M, B(1)),
σ n
because the functional x 7→ (mint∈[0,1] ) x (t), maxt∈[0,1] x (t), x (1)) is continuous.
1
By using properties of SSRW, one can derive the limit distribution of σ√ n
( m n , Mn , S n )
to show that, for a < 0 < b, a < x < y < b,
P( a < m ≤ M < b, x < B(1) < y) (1)
= ∑ P(x + 2k(b − a) < Z < y + 2k(b − a))
k∈ Z
− ∑ P(2b − y + 2k(b − a) < Z < 2b − x + 2k(b − a)),
k∈ Z

where Z ∼ N (0, 1).

3.4 The (standard) Brownian bridge


A (standard) Brownian bridge is a random function Bbr in C [0, 1] with the property that
d
( Bbr (t1 ), . . . , Bbr (tk )) = N (0, Σ(t1 ,...,tk ) ),
(t ,...,t )
where Σij 1 k = ti (1 − t j ), i ≤ j 5 . One way to construct the standard Brownian
bridge is by the following transformation on the standard Brownian motion:
Bbr (t) = B(t) − tB(1).
We will denote the distribution of Bbr by W br . The standard Brownian bridge has a
useful interpretation: it is essentially a standard Brownian motion whose value at 1 is
pinned at 0. Since { B(1) = 0} is measure zero event, care needs to be taken to define
this properly.
Exercise 3.6. Show that Bbr and B(1) are independent.
Lemma 3.7. For e > 0, define a measure on C [0, 1] by Pe ( A) = P( B ∈ A | | B(1)| ≤ e}.
w
Then Pe → W br as e → 0.
Proof. We will use the facts that Bbr and B(1) are independent, and that k Bbr − Bk∞ =
| B(1)|. Let h be a bounded uniformly continuous function on C [0, 1]. Note that
|E[h( B) | | B(1)| ≤ e] − E[h( Bbr )]|
= E h( B) | | B(1)| ≤ e − E h( Bbr ) | | B(1)| ≤ e
   

≤ E |h( B) − h( Bbr )| | | B(1)| ≤ e


 

≤ E wh (k B − Bbr k∞ ) | | B(1)| ≤ e
 

≤ w h ( e ).
Now let e → 0.
5Astochastic process Xt on [0, 1] is called a Gaussian process if all of its finite dimensional distributions
are (multivariate) Gaussians. Thus both B and Bbr are examples of Gaussian processes.

17
3 Weak convergence in C [0, 1]

This characterization of Brownian bridge is useful for calculating limit distributions


of functionals of Brownian bridge from those of Brownian motion.
As an example, we calculate the distribution of supt∈[0,1] | Bbr |. (This is an important
functional, because later we will prove that it is the weak limit of the supremum of the
√ d
canonical empirical process: n supt∈[0,1] | Fn (t) − t| → supt∈[0,1] | Bbr |.)

Theorem 3.8. Let b > 0. We have

∑ (−1)k e−2k b .
2 2
P( sup | Bbr | ≤ b) =
t∈[0,1] k ∈Z

Proof. Put a = −b, x = −e, y = e < b in (1) to get

P(−b < m ≤ M < b, − e < B(1) < e)


= ∑ [Φ(e + 4kb) − Φ(−e + 4kb)]
k∈ Z
− ∑ [Φ(e + 2(2k + 1)b) − Φ(−e + 2(2k + 1)b)].
k∈ Z

Hence

P(−b < m ≤ M < b | −e < B(1) < e)


Φ(e + 4kb) − Φ(−e + 4kb)
= ∑
k∈ Z
Φ(e) − Φ(−e)
Φ(e + 2(2k + 1)b) − Φ(−e + 2(2k + 1)b)
− ∑ Φ ( e ) − Φ (− e )
.
k∈ Z

Let e → 0 and use Lemma 3.9 and the mapping theorem to conclude that (why is the
interchange of sum and limit justified?)

φ(4kb) φ(2(2k + 1)b)


P(−b < mbr ≤ Mbr < b) = ∑ φ (0)
−∑
φ (0)
k∈ Z k∈ Z

∑ e−2(2k) b ∑ e−2(2k+1) b
2 2 2 2
= −
k∈ Z k∈ Z

= ∑ (−1) k −2k2 b2
e .
k ∈Z

As P(supt∈[0,1] | Bbr | ≤ b) = P(−b ≤ mbr ≤ Mbr ≤ b), we are done.

3.5 Brownian motion on [0, ∞)


Using the standard Brownian bridge, we may construct standard Brownian motion on
[0, ∞) via the following transformation:
 
br t
B̃(t) = (1 + t) B , t ≥ 0.
1+t

Exercise 3.9. Show that for s, t ∈ [0, ∞), cov( B̃(s), B(˜t)) = s ∧ t.

18
3 Weak convergence in C [0, 1]

Question: What would be the right topology on C [0, ∞) for this  to make sense? That
is, if Ψ is the map taking x ∈ C [0, 1] to Ψ( x )(t) = (1 + t) x 1+t ∈ C [0, ∞), then what
t

topology on C [0, ∞) would make this map Borel measurable, so that one can define
Wiener measure on C [0, ∞) as W br ◦ Ψ−1 ?
It turns out that the right topology should be the topology of uniform convergence on
compact sets. This is metrizable, e.g., using the following metric

1
ρ( x, y) = ∑ 2k min{ sup | x (t) − y(t)|, 1}.
k ≥1 t∈[0,k]

Exercise 3.10. Show that C [0, ∞), equipped with ρ, is a Polish space.

Exercise 3.11. Ψ is continuous. In fact, ρ(Ψ( x ), Ψ(y)) ≤ C k x − yk∞ , for some universal
constant C > 0.

Thus we can define the Wiener measure on C [0, ∞) as the pushforward W br ◦ Ψ−1
of W br via the map Ψ.
An alternative route is to characterize tightness on (C [0, ∞), ρ) using a suitable
Arzelà-Ascoli theorem and then prove weak convergence of the following random
functions
Sbntc ξ bntc+1
X̃n (t) = √ + (nt − bntc) √ , t ≥ 0.
σ n σ n
The limit, naturally, would be Brownian motion on C [0, ∞). A treatment of this way
of constructing B̃ can be found in Section 2.4 of Karatzas and Shreve (2012).

19
4 Weak convergence in D [0, 1]

4 Weak convergence in D [0, 1]


D [0, 1] or D for short is the set of functions on [0, 1] that are right continuous and have
left limits. These functions are often called càdlàg (abbreviation for the french term
continue à droite, limite à gauche), or RCLL (abbreviation for Right Continuous with Left
Limits), or CORLOL (abbreviation for Continuous On the Right, Limit On the Left)
functions. Examples are I[0,a] (t) or I[a,1] (t). Obviously, C [0, 1] ⊂ D [0, 1]. D [0, 1] is an
importance space to develop weak convergence theory for. Many natural stochastic
processes have discontinuous paths living in D [0, 1]. For example, various jump pro-
cesses such as the Poisson process, the empirical process and so on. Moreover, the
random walk process
1
Xn (t) = √ Sbntc
σ n
has sample paths in D [0, 1] (and converges to Brownian motion, as we shall establish
in this section).

4.1 The Skorohod topology


We need a suitable topology on D [0, 1]. The uniform metric k · k∞ is too strong to
be useful. For example, it fails to capture the closeness of the functions I[0,a) (t) and
I[0,a+ 1 ) (t). Before we introduce a suitable metric on D [0, 1], let us introduce an ana-
n
logue of the modulus of continuity wx (δ) for D [0, 1], which will help us express results
in a succinct way.
The variation of a function x over a set A ⊂ [0, 1] is defined as

wx ( A) = sup | x (u) − x (v)|.


u,v∈ A

In this notation,
wx (δ) = sup wx ([t, t + δ]).
t∈[0,1−δ]

Since functions in D [0, 1] may have jumps, wx (δ) may be arbitarily large! The problem
is that we cannot allow every t in the supremum in the definition of wx (δ). We have
to avoid the jump points.

Exercise 4.1. Construct a sequence of functions xn ∈ D such that lim infn wxn (δ) = ∞, for
any δ > 0.

Recall that, a δ-sparse partition σ = {ti } of [0, 1] is a set of points 0 = t0 < t1 <
· · · < tk = 1 such that mini {ti − ti−1 } ≥ δ. Define

w0x (δ) := inf max wx [ti−1 , ti ).


{ti } δ-sparse i

Exercise 4.2. Show that w0x (δ) is a nondecreasing function of δ.

Lemma 4.3. Let e > 0. There exists a δ-sparse partition {ti } such that

max wx [ti−1 , ti ) < e.


i

20
4 Weak convergence in D [0, 1]

This and the above lemma implies that

x ∈ D if and only if lim w0x (δ) = 0.


δ →0

Now the above lemma also implies that any x ∈ D has at most countably many jump
discontinuities and that k x k∞ < ∞.
Given a partition σ = {ti }, consider the map Sσ : D [0, 1] → D [0, 1], taking x to the
simple function
Sσ x (t) = ∑ x (ti−1 )I[ti−1 ,ti ) (t) + x (1)I{1} (t).
i
It is clear that
k x − Sσ x k∞ ≤ max wx [ti−1 , ti ).
i
Therefore, one can find a sequence of partitions σn such that Sσn x → x uniformly. Thus
every càdlàg function is Borel measurable.
Let Λ be the group of continuous strictly increasing surjective functions λ on [0, 1].
Clearly, λ(0) = 0, λ(1) = 1. We will think of λ’s as deformations of the time scale.
Going back to the previous example, we want to devise a metric that not only com-
pares the values of I[0,a+ 1 ) (t) and I[0,a) (t), but also their supports. λ’s will help us
n
accomplish the latter. Define

d( x, y) = inf k x − y ◦ λk∞ ∨ kλ − I k∞ .
λ∈Λ

Exercise 4.4. Check that Λ indeed is a group.

21
5 More on Empirical Processes

5 More on Empirical Processes



In the last section we proved that the unifrom empirical process Un (t) = nk Fn (t) −
tk converges weakly to the standard Brownian bridge (in the Skorohod topology). This
is a uniform analog of the CLT. We will now focus on proving uniform laws for more
i.i.d.
general empirical processes. Let F be a class of measurable functions. Let ξ i ∼ P.
The empirical process indexed by F is the stochastic process

1 n
(Pn − P ) f = Pn f − P f = ∑ f ( ξ i ) − P f .
n i =1

If we take F = {1(−∞,t] (·) | t ∈ R}, then we recover the uniform empirical process
studied earlier.
A classical result in statistics is the Glivenko-Cantelli theorem:
a.s.
sup | Fn (t) − F (t)| → 0,
t ∈R

which can be thought of as a uniform law of large numbers (because SLLN only guar-
a.s.
antees that F̂n (t) − F (t) → 0, for each fixed t).
In this section, we will develop methods to prove such uniform laws for more gen-
eral empirical processes.
To this end, call a class F of measurable functions P-Glivenko-Cantelli if
a.s.
kPn − PkF := sup |Pn f − P f | → 0.
f ∈F

Because of the classical Glivenko-Cantelli theorem, F = {1(−∞,t] (·) | t ∈ R} is a


Glivenko-Cantelli class.

Remark 5.1 (Measurability issues). There is an obvious question whether kPn − PkF is
measurable. There are some standard techniques to deal with this issue such as (a) the outer-
expectation approach of Hoffman-Jorgensen, (b) working with separable stochastic processes
only, etc. We will henceforth not address measurability issues at all. If you are uncomfortable,
assume that the function class F is countable—no measurability issues would arise then.

Exercise 5.1. Suppose that P is a discrete probability measure. Then F = {1 A (·) |A is a Borel set}
is P-Glivenko-Cantelli.

Exercise 5.2. Show that the above statement is false for absolutely continuous probability
measures on R.

One motivation behind proving uniform laws is to establish consistency of estima-


tors. For example, the median of a univariate distribution F defined as

med( F ) = inf{ x | F ( x ) ≥ 1/2}

is a continous functional in the Kolmogorov-Smirnov metric δ( F, G ) = supt∈R | F (t) −


G (t)|. Thus, by the Glivenko-Cantelli theorem, the sample median med( F̂n ) is a con-
sistent estimator for the population median.

22
5 More on Empirical Processes

5.1 Concentration inequalities


In this section, we will prove two useful concentration inequalities in a series of exer-
cises: Hoeffding’s inequality and the bounded-differences inequality.
A random variable X with mean µ is called sub-Gaussian with parameter σ if
σ 2 λ2
Eeλ(X −µ) ≤ e 2 for all λ ∈ R.

This is essentially saying that the tails of X as fast as a Gaussian.

Exercise 5.3. Suppose X is sub-Gaussian with parameter σ. Then


δ2

P( X ≥ EX + δ) ≤ e 2σ2 .

Exercise 5.4. Suppose X1 and X2 are independent sub-Gaussians


qwith parameters σ1 and σ2
respectively. Then a1 X1 + a2 X2 is sub-Gaussian with parameter a21 σ12 + a22 σ22 . (Note: Inde-
pendence is not actually necessary, but then you would get a slightly worse sub-Gaussianity
parameter.)

Exercise 5.5 (Hoeffding’s inequality). Suppose Xi are independent sub-Gaussians with pa-
rameters σi , i = 1, . . . , n. Then
n n δ2

P ( ∑ a i Xi ≥ E ∑ a i Xi + δ ) ≤ e 2 ∑ n a2 σ 2
i =1 i i .
i =1 i =1

Exercise 5.6. Suppose Xi are zero-mean independent sub-Gaussians with common parameter
σ, i = 1, . . . , n. Then
1 n − nδ2
2
P ( ∑ Xi ≥ δ ) ≤ e 2σ .
n i =1

Exercise 5.7. Suppose η takes values ±1 with probability 1/2 each (such random variables
are called Rademacher variables). Show that η is sub-Gaussian with parameter 1.

Exercise 5.8. Let X be a random variable such that a ≤ X ≤ b almost surely. Show that
X is sub-Gaussian with parameter σ = (b − a)/2. (Hint: Consider Taylor-expanding the
cumulant generating function ψ(λ) = log EeλX of X around 0.)

Using a soft symmetrization argument one can show that bounded random vari-
ables are sub-Gaussian with parameter (b − a). Here is how it goes. Let X 0 be an
independent copy of X and let η be a Rademacher variable independet of both X and
X 0 . We have
0 0 (Jensen) 0
EX eλ(X −EX ) = EX eλ(X −EX ) = EX eEX0 λ(X −X ) ≤ EX EX 0 eλ( X − X ) .
d
Now η ( X − X 0 ) = ( X − X 0 ). Hence

0 0 λ2 ( X − X 0 )2 λ2 ( b − a )2
EX EX 0 eλ(X −X ) = EX EX 0 Eη eλη (X −X ) ≤ EX EX 0 e 2 ≤e 2 .

Expected maxima of sub-Gaussians will appear many times in the later subsections.

23
5 More on Empirical Processes

Exercise 5.9. Let X1 , . . . , Xn be zero-mean sub-Gaussians with common parameter σ. Show


that q
E max Xi ≤ 2σ2 log n
1≤ i ≤ n
p
for all n ≥ 1. If one replaces Xi with | Xi |, show then that an upper bound of 2 σ2 log n holds
for all n ≥ 2.
We will now go beyond independence. Let (Yk )k≥0 be a martingale with respect to
some filtration (Fk )k≥0 . Let Dk = Yk − Yk−1 , k ≥ 1 be the corresponding sequence of
martingale-differences. Assume that
λ2 σ 2
k
E[ e λDk
| F k −1 ] ≤ e 2 .

(As pointed out by Apratim, this condition implies, via Exercise 1 of Homework 5,
that Dk is a sequence of martingale-differences.)
Exercise 5.10. Show that under the above setup we have

n δ2

P ( ∑ Dk ≥ δ ) ≤ e 2 ∑ n σ2
k =1 k .
k =1

Exercise 5.11 (Azuma-Hoeffding inequality). Suppose that the martingale-difference Dk ∈


[ ak , bk ] a.s.
(a) Show that
λ2 ( bk − a k )2
E[eλDk | Fk−1 ] ≤ e 8 .

(b) Conclude that


n 2δ2

P ( ∑ Dk ≥ δ ) ≤ e ∑n ( b − a )2
k =1 k k .
k =1

A standard way to construct martingales is the Doob-construction: Given a random


vector X = ( X1 , . . . , Xn ) with independent components and some function f : Rn → R
such that E| f ( X )| < ∞, let Fk = σ( X1 , . . . , Xk ) and

Yk = E[ f ( X ) | Fk ].

Indeed, by the tower property of conditional expectations,

E[Yk | Fk−1 ] = E[[ f ( X ) | Fk ] | Fk−1 ] = E[ f ( X ) | Fk−1 ] = Yk−1 .

Exercise 5.12 (Bounded-differences inequality). Suppose that the function f has the bounded-
differences property, i.e., for any 1 ≤ i ≤ n, there is some positive constant bi such that

| f ( x1 , . . . , xi−1 , xi , xi+1 , . . . , xn ) − f ( x1 , . . . , xi−1 , xi0 , xi+1 , . . . , xn )| ≤ `i

for all xi , xi0 ∈ R.


(a) Show that the corresponding Doob-martingale-difference Dk satisfies

ak ≤ Dk ≤ bk a.e. for some ak ≤ bk with bk − ak ≤ `k .

24
5 More on Empirical Processes

(b) Hence establish the bounded-differences inequality


2δ2

∑n `2
P( f ( X ) ≥ E f ( X ) + δ ) ≤ e k =1 k .

The bounded differences inequality may be used to show concentration of (a) U-


statistics, (b) length of the longest common subsequence of two independent random
equal-length finite sequences from a common alphabet, (c) Rademacher complexity,
and so on.

5.2 Covering number and bracketing numbers


Whether or not a class F is Glivenko-Cantelli will depend on the “size” of the class.
Let us discuss some notions of size. Let T be a totally bounded set in a metric space
(M, ρ). The e-covering number N (e, T, ρ) is the minimum size of an e-net for T. The nat-
ural logarithm of N (e, T, ρ) is sometimes called metric entropy for statistical mechanical
reasons.
Exercise 5.13. Show that the e-covering number of a d-dimensional Euclidean ball satisfies
 d
2
N (e, B( x; R), k · k2 ) ≤ 1 + .
e
Now let M be a pseudo-metric space of functions. The set [`, u] = { f ∈ M | ` ≤
f ≤ u, ρ(u, `) < e} is called an e-bracket. The e-bracketing number N[] (e, T, ρ) is the
minimum size of a cover of T by e-brackets.
The classical Glivenko-Cantelli theorem may be rephrased in terms of e-bracketing
numbers in the L1 (P) (semi-)norm.

Theorem 5.14. Let F be a class of measurable functions such that N[] (e, F , k · k L1 ) < ∞
for any e > 0. Then F is P-Glivenko-Cantelli.

Proof. Choose a covering [`i , ui ], i = 1, . . . , k by e-brackets. Given f ∈ F find a bracket


[` j , u j ] enveloping f . Then note that
Pn f − P f ≤ Pn u j − P f
= (Pn − P ) u j + P ( u j − f )
≤ max(Pn − P)u j + e.
j

Similarly,
P f − Pn f ≤ P f − Pn ` j
= −(Pn − P)` j + P( f − ` j )
≤ − min(Pn − P)` j + e.
j

Thus
|Pn f − P f | ≤ max{max(Pn − P)u j , − min(Pn − P)` j } +e.
j j
| {z }
=:∆n,e

25
5 More on Empirical Processes

The right hand side does not depend on f , and hence serves as a unifrom upper bound

kPn − PkF ≤ ∆n,e + e.


a.s.
By SLLN (Pn − P)u j , (Pn − P)` j → 0, and since we have finitely many functions u j , ` j ,
a.s.
we conclude that ∆n,e → 0.
Thus, for each fixed e > 0,

lim sup kPn − PkF ≤ e a.s.


n→∞

Choosing a sequence ek → 0 and adjusting the null set accordingly, we conclude that

lim sup kPn − PkF = 0 a.s.


n→∞

This completes the proof.

Exercise 5.15. Recover the classical Glivenko-Cantelli theorem by exhibiting a cover of F =


{1(−∞,t] (·) | t ∈ R} by e-brackets for each e > 0.
We will now discuss a very useful trick called symmetrization, which, coupled with a
size bound will yield a host of Glivenko-Cantelli theorems.

5.3 Symmetrization
Recall that a Rademacher random variable η takes values ±1 with equal probability.
Take an i.i.d. vector of such variables η = (ηi , . . . , ηd ). Clearly, η is a uniformly chosen
corner of the d-dimensional hypercube of side 2 centered at the origin. The size (or,
perhaps more accurately “spread”) of a set A ⊂ Rd can be measured by the maximum
inner product of an element of A with such a random Rademacher direction. We
define the Rademacher complexity of A as

R( A) := Eη suph a, η i.
a∈ A

Given a function class F and n i.i.d. random elements ξ i , the average size (w.r.t. the
distribution of the ξ j ’s) of F may be measured by the expected Rademacher complex-
ity of the Euclidean set F (ξ ) := { n1 ( f (ξ 1 ), . . . , f (ξ n )) | f ∈ F } (intuitively, if F is
large, one would expect the Euclidean set F (ξ ) to be large as well). We define this to
be the Rademacher complexity of F :

R(F ) := Eξ R(F (ξ )).

Theorem 5.16 (symmetrization). We have

EkPn − PkF ≤ 2R(F ).

26
5 More on Empirical Processes

Proof. Let ξ i0 , i = 1, . . . , n be independent copies of ξ i , i = 1, . . . , n. Using Jensen, we


have
1 n
n i∑
EkPn − PkF = Eξ sup | f (ξ i ) − P( f )|
f ∈F =1
1 n
= Eξ sup | ∑
n i =1
( f (ξ i ) − Eξ 0 f (ξ i0 ))|
i
f ∈F

1 n
n i∑
= Eξ sup |Eξ 0 ( f (ξ i ) − f (ξ i0 ))|
f ∈F =1
1 n
≤ Eξ sup Eξ 0 | ∑
n i =1
( f (ξ i ) − f (ξ i0 ))|
f ∈F

1 n
n i∑
≤ Eξ Eξ 0 sup | ( f (ξ i ) − f (ξ i0 ))|.
f ∈F =1

Now let ηi be i.i.d. Rademacher, independent of the ξ i ’s and the ξ i0 ’s. Then

d
(ηi ( f (ξ i ) − f (ξ i0 )))1≤i≤n = ( f (ξ i ) − f (ξ i0 ))1≤i≤n .

Ignoring measurability issues, we conclude that

1 n 1 n
sup | ∑ ( f (ξ i ) − f (ξ i ))| = sup | ∑ ηi ( f (ξ i ) − f (ξ i0 ))|.
0 d

f ∈F n i =1 f ∈F n i =1

Therefore
1 n 1 n
n i∑ ∑ ηi ( f (ξ i ) − f (ξ i0 ))|
0
Eξ Eξ 0 sup | ( f ( ξ i ) − f ( ξ i ))| = E ξ E ξ 0 E η sup |
f ∈F =1 f ∈F n i =1
1 n 1 n
 
≤ Eξ Eξ 0 Eη sup | ∑ ηi f (ξ i )| + sup | ∑ ηi f (ξ i )|
f ∈F n i =1 f ∈F n i =1
= 2R(F ).

This concludes the proof.


We end this section with an application of the bounded-differences inequality.

Exercise 5.17. Let F be a class of functions that are uniformly bounded by B. Show that
2
− nδ 2
P(kPn − PkF ≥ EkPn − PkF + δ) ≤ e 2B .
2
− nδ 2
Conclude that, with probability at least 1 − e 2B , we have

kPn − PkF < EkPn − PkF + δ ≤ 2R(F ) + δ.

In the next subsection, we will learn how to bound the Rademacher complexity
R(F ) of uniformly bounded function classes of finite VC-dimension.

27
5 More on Empirical Processes

5.4 Discretization and chaining


We saw objects like E supt∈T Xt appear frequently in the previous subsection. For ex-
ample, in the definition of Rademacher complexity. In this subsection we will discuss
a technique for (upper)-bounding these objects.
A zero-mean stochastic process ( Xt )t∈T , indexed by a (pseudo-)metric space (T, ρ)
is called sub-Gaussian w.r.t. ρ if
λ2 ρ2 (t,t0 )
Eeλ(Xt −Xt0 ) ≤ e 2 ,

for all t, t0 ∈ T.

Example 5.18. Let T = F , a class of measurable functions (which we will later assume to be
uniformly bounded and have finite VC dimension). Let

1 n
X f = √ ∑ ηi f ( ξ i ) ,
n i =1

where ηi are i.i.d. Rademacher and ξ i ’s are fixed. One can then check that ( X f ) f ∈F is sub-
Gaussian w.r.t. the pseudo-metric
s
1 n
n i∑
ρ ( f , g ) = k f − g k Pn = ( f (ξ i ) − g(ξ i ))2 .
=1

Example 5.19. Let T = Rn× p . Let X be a random n × p matrix with independent zero-mean
sub-Gaussian entries with sub-Gaussianity parameter 1. Let XΘ = h X, Θi F = trace( XΘ> ).
Then it is easy to check that ( XΘ )Θ∈T is sub-Gaussian w.r.t. the Frobenius metric

ρ(Θ, Θ0 ) = kΘ − Θ0 k F .

Now we prove a preliminary result on upper bounding supt∈T Xt for a sub-Gaussian


process ( Xt )t∈T .

Theorem 5.20 (Crude discretization). Let ( Xt )t∈T be a sub-Gaussian stochastic process


with respect to a pseudo-metric ρ on T under which T is totally bounded.
q
E sup Xt ≤ E sup ( Xt − Xt0 ) ≤ 2E sup ( Xt − Xt0 ) + 4D log N (δ, T, ρ),
t ∈T t,t0 ∈ T t,t0 ∈T,
ρ(t,t0 )≤δ

where D = supt,t0 ∈T ρ(t, t0 ) is the diameter of T.

Proof. The first inequality holds because our process is zero-mean:

E sup Xt = E sup( Xt − Xt0 ) (for some t0 ∈ T)


t ∈T t ∈T
≤ E sup ( Xt − Xt0 ).
t,t0 ∈T

28
5 More on Empirical Processes

To prove the second inequality, let N = N (δ, T, ρ) and let S = {t1 , . . . , t N } be a δ-net
of T. Then, for any t, find t j ∈ S which is δ-close to t. We have

±( Xt − Xt1 ) = ±( Xt − Xt j ) ± ( Xt j − Xt1 )
≤ sup ( Xt − Xt0 ) + max | Xt j − Xt1 |.
t,t0 ∈T,
j
ρ(t,t0 )≤δ

Therefore
Xt − Xt 0 = Xt − Xt1 − ( Xt 0 − Xt1 )
≤ 2 sup ( Xt − Xt0 ) + 2 max | Xt j − Xt1 |.
t,t0 ∈T,
j
ρ(t,t0 )≤δ

The above bound, being uniform in t, t0 also works for supt,t0 ∈T ( Xt − Xt0 ). Taking
expectations, we get
E sup ( Xt − Xt0 ) ≤ 2E sup ( Xt − Xt0 ) + 2E max | Xt j − Xt1 |.
t,t0 ∈ T t,t0 ∈T,
j
ρ(t,t0 )≤δ

Now, because ( Xt )t∈T is sub-Gaussian w.r.t. ρ, the random variable Xt j − Xt1 is sub-
Gaussian w.r.t. parameter ρ2 (t j , t1 ) ≤ D2 . Hence, by Exercise 5.9, we get
q
E max | Xt j − Xt1 | ≤ 2 D2 log N (δ, T, ρ).
j

This completes the proof.


Example 5.21. Consider upper bounding Eη sup f ∈F | n1 ∑in=1 ηi f (ξ i )|. This appears in the
definition of Rademacher complexity of a function class F —for fixed ξ i ’s we take expectation
w.r.t. the ηi ’s. Assume that F is uniformly bounded (by B, say) and that is has finite VC-
dimension ν.
As we saw before, X f = √1n ∑in=1 ηi f (ξ i ) is sub-Gaussian w.r.t. the pseudo-metric k f −
gk2Pn . Therefore, by the crude discretization bound, we have
q
E sup X f ≤ 2E sup ( X f − Xg ) + 4D log N (δ, F , k · kPn )
f ∈F f ,g∈F ,
k f − g kPn ≤ δ

(C−S)
Now, X f − Xg = √1
n ∑in=1 ηi ( f (ξ i ) − g(ξ i )) ≤ kη k2 k f − gkPn , whence
q √
E sup ( X f − Xg ) ≤ δEkη k2 ≤ δ Ekη k22 = δ n.
f ,g∈F ,
k f − g kPn ≤ δ

It can be shown that


N (δ, F , k · kPn ) ≤ CB,ν δ−2(ν−1) ,
for some constant CB,ν that only depends on B and ν. Also note that D ≤ 2B. Therefore
s

 
1
E sup X f ≤ 2δ n + 8B log CB,ν + 2(ν − 1) log
f ∈F δ

29
5 More on Empirical Processes

Plugging in δ = √1 , we get that


n

0
E sup X f ≤ CB,ν
p
log n,
f ∈F

0
for some constant CB,ν that only depends on B and ν. Coming back to the original object of
consideration,

1 n 1
Eη sup | ∑
n i =1
ηi f (ξ i )| = √ E sup | X f |
n f ∈F
f ∈F
2
≤ √ E sup X f
n f ∈F
r
0 log n
≤ 2CB,ν ,
n
d
where we have used that fact that X f = − X f so that

E sup | X f | = E sup X f ∨ (− X f )
f ∈F f ∈F
≤ E sup X f ∨ sup X f (− X f )
f ∈F f ∈F
≤ E(sup X f + sup X f (− X f ))
f ∈F f ∈F
= 2E sup X f .
f ∈F

In the bound r
1 n

log n
Eη sup | ∑ ηi f (ξ i )| = OB,ν ,
f ∈F n i =1 n
the log n factor is an artifact of the crudeness of our discretization and can be removed via a
much more refined way of discretization called chaining.

Example 5.22. Consider upper bounding E √1n k X kop 6 where X is as in Example 5.8. We
have, using the variational representation of the operator norm,

k X kop = sup x > Xy


k x k2 =kyk2 =1

= sup trace( Xyx > )


k x k2 =kyk2 =1

= sup h X, xy> i F
k x k2 =kyk2 =1
≤ sup h X, Θi F ,
Θ ∈T

6 Itcan be shown that the largest eigenvalue of the scaled Wishart matrix n1 X > X converges almost

surely to (1 + γ)2 in the asymptotic regime p/n → γ > 0. Here we are using a cheap discretization
bound to prove a non-asymptotic version of this.

30
5 More on Empirical Processes

where T = {Θ ∈ Rn× p | rank(Θ) = 1, kΘk F = 1}. Consider the stochastic process


XΘ = h X, Θi F . From Example 5.8, we know that it is sub-Gaussian w.r.t. the Frobenius
norm. Therefore, by the crude discretization bound, we get that
q
Ek X kop ≤ 2E sup ( XΘ − XΘ0 ) + 4D log N (δ, T, k · k F ).
Θ,Θ0 ∈T,
kΘ−Θ0 k F ≤δ

From Exercise 5.24,

XΘ − XΘ0 = h X, Θ − Θ0 i F ≤ 2kΘ − Θ0 k F k X kop .

Therefore
E sup ( XΘ − XΘ0 ) ≤ 2δEk X kop .
Θ,Θ0 ∈T,
kΘ−Θ0 k F ≤δ

Using Exercise 5.13, it is easy to show that


 n+ p
2
N (δ, T, k · k F ) ≤ 1+ .
δ

Also, note that D ≤ 2. Therefore


s 

2
Ek X kop ≤ 4δEk X kop + 8 (n + p) log 1 + .
δ

Re-arranging and choosing δ = δ0 , a constant, such that 1 − 4δ0 > 0, we get

Ek X kop ≤ C n + p,
p

for some universal constant C. Hence


r  r 
1 p p
E √ k X kop ≤ C 1+ ≤ C 1+ .
n n n

Using Gaussian comparison inequalities one can even prove the above result with C = 1,
which is the best possible in light of the asymptotic result.

Exercise 5.23. Show that

k X kop = sup h X, Θi F .
rank(Θ)=1,kΘk F =1

Exercise 5.24. Show that

(a) k ABkop ≤ k Akop k Bk F .

(b) h A, Bi F ≤ min{rank( A), rank( B)}k Akop k Bk F .

31
5 More on Empirical Processes

Theorem 5.25 (Chaining). We have


Z D q
E sup Xt ≤ E sup ( Xt − Xt0 ) ≤ 2E sup ( Xt − Xt0 ) + 32 log N (u, T, ρ) du
t ∈T t,t0 ∈ T t,t0 ∈T, δ/4
ρ(t,t0 )≤δ

Proof. The first few steps of the proof are the same as those in the proof the crude
discretization bound. We begin with the inequality

E sup ( Xt − Xt0 ) ≤ 2E sup ( Xt − Xt0 ) + 2E max | Xt j − Xt1 |.


t,t0 ∈ T t,t0 ∈T,
j
ρ(t,t0 )≤δ

We will bound the second term more carefully this time.

Example 5.26. Going back to Example 5.21, if we use chaining, the resulting bound is
s
√ Z D  
1
E sup X f ≤ 2δ n + 32 log CB,ν + 2(ν − 1) log du.
f ∈F δ/4 u

RDq
The integral above can be upper-bounded by a constant multiple of 0 log( u1 ) du, which is
finite. Hence, choosing δ = √1 , we get
n

E sup X f = OB,ν (1),


f ∈F

whence follows the optimal bound

1 n
 
1
Eη sup | ∑ ηi f (ξ i )| = OB,ν √ .
f ∈F n i =1 n

RDq
Exercise 5.27. Show that 0 log( u1 ) du < ∞.

Remark 5.2. Although chaining is able to remove extra log n factors in many problems, there
are situations where it does not produce optimal bounds. Michel Talagrand’s generic chaining
method has the final say—it produces optimal upper bounds.

We end this section by collecting in form of a theorem the results we have proved
so far about a uniformly bounded function class of finite VC-dimension.

Theorem 5.28 (Glivenko-Cantelli for uniformly bounded function classes of finite


VC-dimension). Suppose F is a function class uniformly bounded by B and with finite
2
− nδ
VC-dimension ν. Then, with probability at least 1 − e 2B2 , one has
 
1
kPn − PkF ≤ OB,ν √ + δ.
n

32
5 More on Empirical Processes

a.s.
This is much stronger than just saying that kPn − PkF → 0, which follows from the
2
above Theorem by the Borel-Cantelli lemma. Note that if we choose δ2 = 2kB
n , then,
with probability at least 1 − e−k , we have
r 
k
kPn − PkF = OB,ν ,
n

which implies that  


1
k Pn − P k F = O P √ ,
n
which is in fact the optimal rate.

33
References

References
Billingsley, P. (2013). Convergence of probability measures. John Wiley & Sons.

Karatzas, I. and Shreve, S. (2012). Brownian motion and stochastic calculus, volume 113.
Springer Science & Business Media.

Kumaresan, S. (2005). Topology of metric spaces. Alpha Science International Limited.

Munkres, J. R. (1975). Topology: a first course, volume 23. Prentice-Hall Englewood


Cliffs, NJ.

34

You might also like