Sinai1992 Book ProbabilityTheory

Springer Textbook
Yakov G. Sinai
Probability Theory
An Introductory Course
Translated from the Russian

by D. Haughton
With 14 Figures
Springer-Verlag Berlin Heidelberg GmbH

Yakov G. Sinai
Russian Academy of Sciences
L. D. Landau Institute for Theoretica1 Physics
ul. Kosygina, 2
Moscow 117940, GSP-1, V-334, Russia
Translator:
Dominique Haughton
Department of Mathematics
College of Pure and Applied Science
University of Lowell, One University Avenue
Lowell, MA 01854, USA
Title ofthe Russian original edition (in two parts):

Kurs teorii veroyatnostej
Publisher MGU, 1985 and 1986
Mathematics Subject Classification (1991): 60-01
ISBN 978-3-540-53348-1
Die Deutsche Bibliothek - CIP-Einheitsaufnahme.

Sinai, Yakov G.: Probability theory: an introductory course / Yakov G. Sinai. Transl. by
D. Haughton.
ISBN 978-3-540-53348-1 ISBN 978-3-662-02845-2 (eBook)
DOI 10.1007/978-3-662-02845-2
This work is subject to copyright. AII rights are reserved, whether the whole or part of the
material is concemed, specifically the rights oftranslation, reprinting, reuse of iIIustrations,
recitation, broadcasting, reproduction on microfilms or in other ways, and storage in data
banks. Duplication of this publication or parts thereof is permitted only under the pro-
visions ofthe German Copyright Law ofSeptember 9, 1965, in its current version, and per-
mission for use must always be obtained from Springer-Verlag. Violations are liable forpro-
secution under the German Copyright Law.
© Springer-Verlag Berlin Heidelberg 1992
Origina1ly published by Springer-Verlag Berlin Heidelberg New York in 1992
Softcover reprint ofthe hardcover Ist edition 1992
The use of registered names, trademarks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
Typesetting: Camera ready by author
41/3111-5432 Printed on acid-free paper
Prefaces
Preface to the English Edition

This book grew out of lecture courses that were given to second and third
year students at the Mathematics Department of Moscow State University for
many years. It requires some knowledge of measure theory, and the modern
language of the theory of measurable partitions is frequently used. In several
places I have tried to emphasize some of the connections with current trends
in probability theory and statistical mechanics.
I thank Professor Haughton for her excellent translation.
Ya. G. Sinai May 1992
Preface to First Ten Chapters

The first ten chapters of this book comprise an extended version of the first
part of a required course in Probability Theory, which I have been teaching for
may years to fourth semester students in the Department of Mechanics and
Mathematics at Moscow State University. The fundamental idea is to give a
logically coherent introduction to the subject, while making as little use as
possible of the apparatus of measure theory and Lebesgue integration. To this
end, it was necessary to modify a number of details in the presentation of
some long-established sections.
These chapters cover the concepts of random variable, mathematical expec-
tation and variance, as well as sequences of independent trials, Markov chains,
and random walks on a lattice. Kolmogorov's axioms are used throughout the
text. Several non-traditional topics are also touched upon, including the prob-
lem of percolation, and the introduction of conditional probability through
the concept of measurable partitions. With the inclusion of these topics it is
hoped that students will become actively involved in scientific research at an
early stage.
This part of the book was refereed by B.V. Gnedenko, Member of the
Academy of Science of the Ukrainian Socialist Soviet Republic, and N.N.
Chentsov, Doctor of Mathematical and Physical Sciences. I wish to thank
I.S. Sineva for assistance in preparing the original manuscript for publication.
Preface to Chapters 11 - 16
Chapters eleven through sixteen constitute the second part of a course on

Probability Theory for mathematics students in the Department of Mechanics
and Mathematics of Moscow State University. The chapters cover the strong
law of large numbers, the weak convergence of probability distributions, and
the central limit theorem for sums of independent random variables. The no-
tion of stability, as it relates to the central limit theorem, is discussed from the
point of view of the method of renormalization group theory in statistical
physics, as is a somewhat less traditional topic, as is the analysis of asymptotic
probabilities of large deviations.
This part of the book was also refereed by B.V. Gnedenko, as well as
by Professor A.N. Shiryaev. I wish to thank M.L. Blank, A. Dzhalilov, E.O.
Lokutsievckaya and I.S. Sineva for their great assistance in preparing the orig-
inal manuscript for publication.
Translator's Preface
The Russian version of this book was published in two parts. Part I, covering
the first ten chapters, appeared in 1985. Part II appeared in 1986, and only
the first six chapters have been translated here.
I would like to thank Jonathan Haughton for typesetting the English ver-
sion in TEX.
Contents
Lecture 1. Probability Spaces and Random Variables ........ 1

1.1 Probability Spaces and Random Variables ................... 1
1.2 Expectation and Variance for Discrete Random Variables 9
Lecture 2. Independent Identical Trials and the Law of

Large Numbers ................................................ . 15
2.1 Algebras and u-algebrasj Borel u-algebras ................ . 15
2.2 Heuristic Approach to the Construction of "Continuous"
Random Variables ......................................... . 16
2.3 Sequences of Independent Trials ........................... . 18
2.4 Law of Large Numbers for Sequences of Independent
Identical Trials ............................................ . 21
2.5 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Probabilistic Proof of Weierstrass's Theorem ................ 22
2.6.2 Application to Number Theory ............................ . 23
2.6.3 Monte Carlo Methods ..................................... . 24
2.6.4 Entropy of a Sequence of Independent Trials and
Macmillan's Theorem ..................................... . 25
2.6.5 Random Walks ............................................ . 27
Lecture 3. De Moivre-Laplace and Poisson Limit Theorems 30

3.1 De Moivre-Laplace Theorems ............................... 30
3.1.1 Local and Integral De Moivre-Laplace Theorems ......... ... 30
3.1.2 Application to Symmetric Random Walks . . . . . . . . . . . . . .. .. 34
3.1.3 A Problem in Mathematical Statistics ................... ... 34
3.1.4 Generalizations of the De Moivre-:Laplace Theorem ......... 36
3.2 The Poisson Distribution and the Poisson Limit Theorem 40
3.2.1 Poisson Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Application to Statistical Mechanics ..................... ... 41
Lecture 4. Conditional Probability and Independence 43

4.1 Conditional Probability and Independence of Events ........ 43
4.2 Independent u-algebras and sequences of independent trials. . . . 45
4.3 The Gambler's Ruin Problem ............................... 47
Lecture 5. Markov Chains 54
5.1 Stochastic Matrices .................................... ..... 54
5.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Non-Ergodic Markov Chains .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 58
5.4 The Law of Large Numbers and the Entropy of a Markov Chain 61
5.5 Application to Products of Positive Matrices . . . . . . . . . . . . . . . . 64
Lecture 6. Random Walks on the Lattice 7L."" 67
Lecture 7. Branching Processes 73
Lecture 8. Conditional Probabilities and Expectations 78
Lecture 9. Multivariate Normal Distributions 83
Lecture 10. The Problem of Percolation 89
Lecture 11. Distribution Functions, Lebesgue Integrals and

Mathematical Expectation ................................... . 95
11.1 Introduction ..................... ........................... 95
11.2 Properties of Distribution Functions ....................... . 96
11.3 Types of Distribution Functions ........................... . 97
Lecture 12. General Definition of Independent Random

Variables and Laws of Large Numbers .. ....... ... .. .. .. .. . . 104
Lecture 13. Weak Convergence of Probability Measures on the

Line and Belly's Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Lecture 14. Characteristic Functions 120
Lecture 15. Central Limit Theorem for Sums of Independent

Random Variables ...... ......... .... .... ... ......... ......... 127
Lecture 16. Probabilities of Large Deviations 134

Lecture 1. Probability Spaces and Random
Variables
1.1 Probability Spaces and Random Variables
The field of probability theory is somewhat different from the other fields of
mathematics. To understand it, and to explain the meaning of many defini-
tions, concepts and results, we need to draw on real life examples, or examples
from related areas of mathematics or theoretical physics. On the other hand,
probability theory is a branch of mathematics, with its own axioms and meth-
ods, and with close connections with some other areas of mathematics, most
notably with functional analysis, and ordinary and partial differential equa-
tions. At the foundation of probability theory lies abstract measure theory, but
the role played here by measure theory is similar to that played by differential
and integral calculus in the theory of differential equations.
We begin our study of probability theory with its axioms, which were intro-
duced by the great Soviet mathematician and founder of modern probability
theory, A. N. Kolmogorov. The first concept is that of the space of elementary
outcomes. From a general point of view the space of elementary outcomes is
an abstract set. In probability theory, as a rule, it is denoted by n and its
points are denoted by w, and are called elementary events or elementary out-
comes. When n is finite or countable, probability theory is sometimes called
discrete probability theory. A major portion of the first half of this course will
be related to discrete probability theory. Subsets C c n are called events.
Moreover the set C = n is call the certain event, and the set C = 0 (some-
times written as C = JI) is called the impossible event. One can form unions,
intersections, Cartesian products, complements, differences, etc. of events.
Our first task is that of constructing spaces of elementary outcomes for
various concrete situations, and this is best done by considering several ex-
amples.
Example 1.1. At school many of us have heard that probability theory origi-
nated in the analysis of problems arising from games of chance. The simplest,
although important, example of such a game consists of a single toss of a
coin, which can fall on one of its two sides. Here n consists of two points,
n = (w",Wt), where w" (Wt) means that the toss gave heads (tails).
Remark. In this example the number of possible events C c n is equal to
4= 22: n,w",w,,0.
2 Lecture 1. Proba.bility Spa.ces a.nd Ra.ndom Va.ria.bles
If we replace the coin by a die then [} consists of six points, Wl, W2 ,W3,
W4 ,Ws, W6,according to the number showing on the top side after throwing
the die.
Example 1.2. Let us assume that a coin is tossed n times. An elementary

outcome then has the form W = {al' ~, ... ,an}, Each ai takes two values, 1
or O. We will say that al = 1 if the toss yielded heads, and a. = 0 if the toss
yielded tails. Alternatively we could say that W is a word of length n, written
from the alphabet {O, I}. The space [} in this case consists of 2n points.
This space [} is encountered in many problems of probability theory. For

example one can imagine a complicated system consisting of n identical el-
ements, each of which can be found in two states, which as before will be
denoted by {O, I}. The state of the whole system can then be described by a
word w, as above.
In statistical mechanics models, one encounters models which consist of in-
dividual elementary magnets (spins) with different orientations. We represent
them by an arrow directed upwards (downwards) if the north pole is above
(below):
!
S
i
!
N
Fig. 1.1.
Then a collection of magnets situated at the points 1, ... , n is a collection of

arrows
i i ! i ! ! ! i ! ! i
1 1 o 1 o o o 1 o o 1
which we can again codify by a word of length n from (0,1).

Several generalizations suggest themselves immediately. First we can as-
sume that each a. takes not two, but several values. In general it is convenient
to assume that a. is an element of an abstract space of values, X. We can rep-
resent X as an alphabet, and W = {al' ... , an} as a word of length n, written
from the alphabet X. If X = (x(1), ... , x(r)) contains r elements, then each
letter a. takes r values and [} consists of rn points. Imagine, for example, the
following concrete situation. Consider a group of n students, whose birthdays
are known. Assume that none of the participants were born in a leap year.
Then W = {al,"" an} is the list of n birthdays, where each a. takes one of
1.1. Probability Spaces and Random Variables 3
365 possible values. By the same token X is the space of all possible birth-
days, and therefore consists of 365 points, and {} consists of 365 n points. If
X is the unit interval, X = [0,1], then {} consists of all choices of n numbers
on the interval [0,1]. Sometimes the space {} is written as {} = xI1,n j , where
[1,nj = {1,2,3, ... ,n}.
The second generalization concerns a more general interpretation of the
index i in the notation w = {a;}. For example we could have i = (il,i2),
where i 1 and i2 are integers, Kl S i 1 S K 2 , Ll S i2 S L 2 • Denote by I the
set of points i = (i 1 , i 2 ) of the plane with integer coordinates.
Fig. 1.2.
Further assume that each point i can be found in two states, denoted by 0,1.
The state of the whole system is then w = {a;}, i E I, and ai takes the two
values 0,1. Such spaces {} are encountered in the theory of random fields. X
can also be arbitrary.
We now come to a general definition.
Definition 1.1. Let X be a space of values and I be a space of indices. The

space {} = XI is the space of all possible words w = {a" i E I}, where a, EX.
Sometimes {} is called the space of functions from I with values in X.
Example 1.9. In the lottery "6 from 36" we consider a square with 36 cells num-
bered 1 through 36. Each participant must choose six distinct numbers. Here
an individual elementary outcome has the form w = {al, ~, a:;, a4 , as , a6}
and is a subset of six elements of the set with 36 elements. Now {} consists of
(366) = (36.35.34.33.32.31)/6! points.
We can obtain a variant in the following way. Let X be a finite set and
let us carry out a sampling without replacement of n elements of this set,
i.e. w = {al,~, ... ,an}, where now ai =J aj. If IXI = r then {} consists of
r(r - 1) ... (r - n + 1) elements. 1
1 From now on the absolute value of a set will denote its cardinality, that is the
number of its elements. Other examples of spaces n will be given later.
4 Lecture 1. Probability Spaces and Random Variables
Let n be finite, i.e. Inl = m < 00. We denote by 1 the collection of all
events C, i.e. all subspaces of n .
Theorem 1.1. 1 consists 01 2 m events.
Prool. Let C c n. We introduce the function Xc (w) on n, where
() I, if w E Cj
Xc w = { 0, otherwise.
It is clear that every such function on n taking the values 1 and 0 uniquely
defines an event C, namely the set C where the function Xc{w) is equal to
1. In other words, 1 is in one to one correspondence with the points of XO,
where X = {I,O}. The number of such points is equal to 2101 = 2m • 0
Now let n be arbitrary.

Definition 1.2. A collection 9 of subsets of n is called an algebra if
1) n E gj
2) C E 9 => n \ C E gj
3) C I ,C2 , ••• ,Ck E9 => U:=I C; Eg.
Here are some simple corollaries of this definition.
Corollary 1.1. 0 E g.
Proof. Take nEg and apply 2). o
Prool.lndeedn\n:=1 C; = U:=I(n\C;) E g. Consequently n:=1 C; E g. 0
The following result is already less trivial.
Theorem 1.2. II the algebra 9 is finite, then there exist subsets Al , A 2 , •••
o
... ,Ar such that 1) ~ n Ai = lor i =J: jj 2) U;=l ~ = nj 9) any event
C Egis a union 01 ~ 'so
Remark. The collection of events AI' ... ' Ar defines a partition of n. This
way finite algebras are generated by finite partitions, and conversely. This
statement can be generalized considerably.
Remark. The algebra 9 contains 2r elements. Indeed, let us define a new space
n' whose points are the sets in the partition AI, A2 , ... , A r • Then 9 can be
considered as the collection of subsets of the space n' , In'l = r. By Theorem
1.1 9 contains 2r elements.
Proolol Theorem 1.2. Let us number the elements of 9 in an arbitrary way:
For any C we set

C- 1 = n \ C.
Consider the sequence b = {b 1 ,b2 , ••• ,b.}, where each bi takes the value +1
or -1, and set
C(b) = n•
i= 1
C:'.
Then all the C(b) E 9 (as follows from 2) and Corollary 1.2).
It is possible that C(b) = 0. However for any wEn there exists a b such
that w E C(b). Indeed w E C~, for one of the values bi = ±1 for every i,
1 :-::; i :-::; s. Therefore wE C(b),' and so not all C(b) are empty. IT b' I- b" then
C(b') n C(b") = 0. Indeed b' I- b" means that b~ I- b:' for some i. To fix ideas,
assume that b~ = 1, b~' = -1. Then in the expression C(b') we find C i1 = Ci,
so C(b') ~ Ci, and in the expression C(b") we find C i- 1 = n \ Ci' therefore
C(b") ~ n \ C;; it follows that C(b') n C(b") = 0. Therefore the sets C(b) are
empty or non-empty and they are disjoint. Let us take, for A;, the non-empty
intersections. Since each w is an element of one of the C(b), it belongs to one
of the A;. This means that Ui A; = n. 0
Definition 1.3. A collection 1 of subsets of n is called a a-algebra if 1 is

an algebra and if C 1 , ••• , C/o, ... E 7 implies U::
1 C, E 7.
As above, we have:
At first sight it seems that the difference between algebras and

a-algebras is not very large. However this is not the case. In fact, only in
the case of a-algebras does an interesting theory arise.
Definition 1.4. A measurable space is a pair (n,l) where 1 is a a-algebra

of subsets of the space n.
We now consider the case of a discrete space n.
Definition 1.5. Any function I = I(w) with real values is called a random
variable.
For more general spaces n, the function f must have further properties,
known as measurability properties (see below).
Notation. We denote random variables by small greek letters, for example

€, '1 and ~. If an emphasis is needed on the functional dependence on w then
we write €(w), e
= f(w), and so on.
The random variables form a linear space, since they can be added and
multiplied by constant factors. This space is also a commutative ring, since
random variables can be multiplied.
We denote by 1 the a-algebra of all the subsets of 0 (0 is discrete!), and
we consider an arbitrary sub-algebra 10 C 1.
e
Definition 1.6. A random variable = f(w) is said to be 1o-measurable if
for any a,b, a ~ b, the set {w: a ~ f(w) < b} E 10.
In order to understand the meaning of this concept, consider the case

where 10 is finite. By Theorem 1.2 such an 10 gives rise to a finite partition
of 0 into subsets Al , A2 , ... , Ar .
Theorem 1.3. If e is 10 - measurable, it takes constant values on each ~,

l~i::;r.
Proof. Let d, be the different values of the function f. Their number does not
exceed the number of elements of 10 (check this!), and therefore is finite. Let us
write them in increasing order: d1 < ~ < ... < de, and let U, = {w : f (w) =
d,}. It is clear that U, n Ui = 0 for i t j and U, U, = O. We now show that
U, E 10. Let us construct intervals [a" b') where d,_ 1 < a, < d, < b, < d,+ l '
Then
{w : a, ::; f (w) < b;} = U" that is U, E 10. By Theorem 1.2, U, can be written
as a union of A/s, i.e. f(w) = d, for w E Ai' Moreover different i give rise to
different elements of the partition. D
We now return to the general case and turn our attention to the central
concept of our theory, the concept of probability. We noted earlier that proba-
bility is a special case of measure. Now consider an arbitrary measurable space
(OJ).
Definition 1.7. A probability measure is a function P, defined on 1, which

satisfies the following conditions:
1) P{C) ~ 0 for any C E 1;
2) P(O) = 1;
3) if C, E 1, i = 1,2, ... and C, n Ci = 0, then
p(O Ci) = tp(Cd.

i= 1 i= 1
The number P(C) is called the probability of the event C.
We now discuss the meaning of properties 1) - 3). The concrete idea of

probability of an event or a phenomenon or an occurrence is the frequency
of occurrence of this event, or the "chances" of its taking place. Since the
frequency of occurrence is always non-negative, a probability must be non-
negative. Property 2) means that when we consider an experiment whose
outcome can be any point wEn we are dealing with a situation where some
outcome of the experiment will be observed. Sometimes property 2) is called
the normalization property. Those events C for which P( C) = 1 are called cer-
tain (they have a 100% chance of occurrence). Since n = n u 0 and n n 0 = 0,
p(n) = p(n u 0) = p(n) + P(0), and so P(0) = O. Those events C for which
P(C) = 0 are called impossible (0% chances of occurrence). The meaning of
property 3) will be developed gradually. It is called the property of count-
able additivity, or u-additivity of probability measures. It is fundamental to
general measure theory. Properties 1)-3) show that probabilities are normed
measures.
Definition 1.8. The triplet (n, 1, P) is called a probability space.
We now introduce simple corollaries to Definition 1.8. Let n be discrete,

that is n = {Wi}' We denote by Cw the event with only one point: Cw = wen.
Then p(w) = P(Cw ) = P(w) is the probability of the elementary outcome w.
It follows easily from properties 1)-3) of a probability that:
1) p(w) ~ 0, (1.1)
w
A collection of numbers which satisfy 1) and 2) is called a probability distri-

bution. In this way every probability P generates a probability distribution
on n, and vice versa.
Theorem 1.4. Every probability distribution on n generates a probability mea-

sure on the u-algebra 1 of subsets of n by the formula P(C) = Lw.ec p(w,,).
Proof. Properties 1) and 2) of Definition 1.6 follow from (1.1). Property 3)

involves
and follows from the fact that the sum of a converging series with positive
terms does not depend on the order of summation. 0
Here are several equivalent formulations of u-additivity.

8 Lecture 1. Proba.bility Spa.ces a.nd Ra.ndom Va.ria.bles
Theorem 1.S. A function P on 1 which satisfies properties 1) and 2) of

Definition 1.6, and is additive {i.e. property 9 of Definition 1.6 holds for finite
collections C;} is a-additive (satisfies property 9) if and only if
1) for any sequence of events C. E " C. ~ CHI'
p(u C.) = .lim P(C;)j

•
.-00
2) for any sequence of events C. E " C. ;2 CHI'
p(nC.) = ..lim
_ 00
P(C;)j
•
9) for any sequence of events C. E " Cs ;2 CsH , n. Ci = 0,
lim P(C;) = O•
• -00
Proof. All three statements are proved in the same way. By way of example
let us prove the last one. Let C, E 1, C, ;2 CHI, n.
C, = 0. Now form the
events Bi = Ci \ CHI' Then Bi n Bj = 0 for i =1= j, Ck = Ui>k B i . From
the a-additivity property of Definition 1.6 we have P(C1 ) = l P(Bi)' t::
Therefore the remainder of the series L.:::" P(Bi ) = P(Ck ) ~ 0 as k ~ 00.
Conversely, assume that we have a sequence of sets Ci , C, n Cj = 0, for i =1= j.
Form C = U:: 1 Ci . Then C = U;=l Ci u U::n+l Ci for any n, and by finite
additivity P(C) = L.:;=l P(C;) + P(U::n+l Ci)' The events Bn = U:+ 1 Ci
decrease, and nn Bn = 0. Therefore P(Bn } -> 0 and P(C} = 2:;:1 P(c;}. 0
Corollary to the additivity of P. For any collection of sets AI,"" An E 1,

P(U;=1 As) ::; 2:;=1 P(A').
Proof. For n = 1 the statement is clear. We proceed by induction. We have

n n-l n-l n-l
UAi = UAs U An = UAi U (An \ UAs) .

•= I i= 1. i=l i=1
The sets U;:11 Ai and An \ U;:11 Ai do not intersect. Therefore
p(CJ Ai) =
,=1
p(UI ,=1
Ai U (An \ nOI Ai))
,=1
= pCC] i= 1
Ai) + P(A n \ ri As)
i=1
n-l n-l
::; L P(Ai) + P(An \ UAs)

i=1 i=1
1.2. Expectation and Variance 9
using the induction hypothesis. Furthermore An = (An \ U;:ll At) U (An n

(U;:11 At)) and by additivity of the probability
P(An) = p(An \ (0 At)) + p( Ann( nOl At)) ~ P(An \ nOl At) .

• =1 ,=1 1=1
Finally we obtain
p( 0At) ~ 't P(At) + P(An \ nOl At)

i=l i=l i=l
n-l
L P(At) + P(A,.) ~ L P(At).

n
~
i= 1 i= 1
o
Let n be finite, Inl = JI. The probability distribution for which Pi = 1/ JI
is called uniform. Uniform distributions arise naturally in many applications.
For example return to the problem of the "6 from 36" lottery. The space n in
this case was constructed earlier, and consists of e66) points. We now reason
as follows. Let us fix a specific set w. The result of the random drawing of the
lottery will give a certain set w of 6 numbers. It is natural to assume that all
sets are equivalent, and that the probability of each w is the same and equal
to p(w) = l/JI = 1/e66). We now find the probability of winning. According
to the rules, winning will occur when w and w have at least 3 numbers in
common. Let us denote by C,. the event consisting of those w such that w and
U:
w have k numbers in common. Then C = = 3 C,. and consists of the winning
w's. It is clear that Ci n C; = 0 for i =f. j. Therefore P(C) = L:=3 P(C,.).
Furthermore, P(C,.) = L .. ;EC.Pi = L";Ec.l/JI = l/JlIC,.1 = IC,.I/Inl. It
is easy to understand that IC,.I = (:)'(6~°/c). Therefore, for instance
(6) eO) 6.6.4 30.29.28
P(C3 ) = 3 • 3 = l.2.3.~ ~ 0.0412.
( 36) 36.35.34.33.32.31
6 1.2.3.'.6.6
1.2 Expectation and Variance for Discrete Random

Variables
We now return to the general situation. Assume, as previously, that the space
n is e
discrete and consider a random variable = f(w). Its values form a
countable set X = {Xi} with f(w) E X. Now consider the events Ci = {w :
f(w) = Xi}' Clearly, Ci n C; = 0 for i =f. j (the random variable takes only
one value for. each w) and U:
1 = n (for each w some value f(w) is taken).
Definition 1.9. The collection of numbers Pi = P(Xi) = P(Ci ) defined on the

set X is called the probability distribution of the random variable e.
The number Pi is the probability that the random variable takes the e
value Xi' It follows from the properties of probabilities that Pi ~ 0, 2:; Pi = 1.
Definition 1.10. The mathematical expectation (m.e.) or the mean value of

e
the random variable is the number
Ee = L I{w)p(w),
which is defined when L I/(w)lp(w) < 00. If the latter series diverges, the
mathematical expectation is not defined.
In the case where the m.e. is defined, its value does not depend on the
order of summation. The concept of m.e. is similar to the idea of center of
gravity. In the same way as the center of gravity indicates near which point
the mass is concentrated on average, the mathematical expectation indicates
e
near which quantity the values of the random variable are concentrated on
average.
Properties of the Mathematical Expectation
1. If Eel and Ee2 are defined, then E(a6 + be2) is defined, and E(a6 +
be2) = aEel + bEe2'
e
2. If ~ 0 then E ~ O. e
e
3. If = I(w) == 1, then Ee
= 1.
e
4. If A ~ = I(w) ~ B then A ~ Ee
~ B.
5. Ee is defined if and only if Li Ix;lp; < 00 and then Ee = Li XiPi'
6. If the random variable 11 = g{ e) = g(f (w)), then
ETJ = L9(Xi)Pi,
where ETJ is defined if and only if L Ig{Xi)lpi < 00.
7. Chebyshev's Inequality (1). If e ~ 0 and Ee < 00, then for any t > 0
P{(wl I(w) ~ t)} ~ Ee/t.
Proo/ 0/ Properties 1-6. Property 1 follows from the fact that if el = Idw),
e2 = 12 (w) and Eel and Ee2 are defined, then
L I/dw)lp{w) < 00, L 1/2 (w)lp{w) < 00.
Therefore
L la/l(w) + bI (w)lp(w) ~ L(lall/l (w)1 + IbII/ {w)l}p(w)

2 2
= lal L III (w)lp(w) + Ibl L 1/2 (w)lp(w) < 00
and, using the properties of absolutely converging series we find that

Properties 2 and 3 are clear. Properties 1-3 mean that E is a linear (property
1) non-negative (property 2) normed (property 3) functional on the vector
space of random variables. Property 4 follows from the fact that e - A ~ 0,
B-e ~ 0 and therefore E(e-A) = Ee-EA.l = Ee-A ~ 0 and, analogously,
B-Ee ~O.
We now prove property 6, since 5 follows from 6 for g( x) = x. First let
L Ig(x,)lp, < 00. Since the sum of a series with non-negative terms does not
depend on the order of the terms, Lw Ig(f(w))lp(w) can be carried out in the
following way:
L Ig(f(w))lp(w) = L L Ig(f(w))lp(w)
= L Ig(x,)1 L p(w) = L Ig(x,)lp,·
wec i i
Thus the series L Ig(f(w))lp(w) converges if and only if the series L Ig(x,)lp,
does. H any of those series converges then the series L g(f(w))p(w) converges
absolutely, and its sum does not depend on the order of summation. Therefore
Lg(f(w))p(w) = L L g(f(w))p(w)
i weo.
weo,
and the last series also converges absolutely. o

Proof of Property 7. Let
P{(wlf(w) ~ t)} = L p(w) ~ L (f(w)Jt)p(w)

wl/(w )~t wl/(w )~t
= 1 t "~ f(w)p(w) ~ 1~t

" f(w)p(w) = 1 tEe.
wl/(w)~t w
In the last inequality we used the fact that f ~ o. o

In subsequent lectures we will encounter many problems in which the
mathematical expectation plays a very important role. However we now in-
troduce another numerical characteristic of a random variable.
Definition 1.11. The variance of a random variable e is the quantity Vare =

E(e - Ee)2.
The definition is understood to imply that E€ is defined. Var€ character-

izes the amount of variation of the random variable from its mean.
Properties of the Variance
1. Vare is finite if and only if Ee < 00 and Vare = Ee - (Ee)2.
2. Var(ae + b) = a2Vare, where a and b are constants.

3. H A ~ e ~ B, then Vare ~ ((B - A)/2)2.
4. Chebyshev's inequality (e). Let Vare < 00. Then
Vare
P{wllf(w) - Eel ~ t} ~ T '
Proof of Property 1. Let Ee be defined, IEel < 00. Assume at first that Ee2 <
00. Then (e - Ee)2 = e - 2(Ee)e + (Ee)2 and from property 1 of the m.e.
Vare = Ee - E(2(Ee)e) + E((Ee)2) = Ee - 2 (Ee)(Ee) + (Ee)2 = Ee -
e
(Ee)2 < 00. Now let Vare < 00. We have = (e - Ee)2 + 2(Ee)e - (Ee)2,
and from property 1 of the m.e .. We have
o
Proof of Property e. From property 1 of the m.e. we have E( ae + b) = aE e + b.
Therefore
Var(ae + b) = E(ae + b -E(ae + bW = E(ae - aEe)2
= E(a2(e - Ee)2) = a2E(e - Ee)2 = a2Vare.
o
Proof of Property 9. Let A ~ e ~ B. It then follows from property 2 of the

variance that
Vare = E(e - Ee)2 = E(e - (A + B)/2 - (Ee - (A + B)/2))2
= E(e - (A + B)/2)2 - (E(e - (A + B)/2»2 ~ E(e - (A + B)/2)2.
We now note that le-(A+B)/21 ~ (B-A)/2, (e-(A+B)/2)2 ~ ((B-A)/2)2.
From property 4 of the m.e. we have
E(e - (A + B)/2)2 ~ ((B - A}/2}2.

o
Proof of Property -I. We use Chebyshev's inequality (1) for the random variable
11 = (e - Ee)2 ~ O. Then
{wile - Eel ~ t} = {wi" ~ e}.

Therefore
o
Now let €l, €2 be two random variables taking the values Xl, X2, ... and
Yl , Y2 , ... respectively.
Definition 1.12. The joint distribution of the random variables €l, €2 is the
collection of numbers Pij = P(Xi' Yj) = P{ wi €dw) = Xi, €2 (w) = Yj}.
Given the joint distribution of €l' €2 one can find the distribution of €l
(€2) by summing the Pij with respect to j (i). Moreover, if " = 1(6, €2)
then E" = E f(xi, Yi )Pii' again under the assumption that the last series
converges absolutely. In particular E €l . €2 = E Xi Yi Pii·
One can define, in an analogous way, the joint distribution of any finite
number of random variables.
Definition 1.13. The covariance coefficient, or simply the covariance of the

random variables €l, €2 is the number COV(€l' €2) = E(€l - md(€2 - m2),
where mi = E€i for i = 1,2.
We note that
COV(€l, €2) = E(€l - md(€2 - m2) = E(€l e2 - m16 - m2 el + ml~)

= Eel e2 - mlm2·
Let el, €2, ... , en be random variables and ~n = el + e2 + ... + en. Then if
mi = Eei, E~n = E;=l mi,
= LE(ei -mi)2 +2LE(ei -md(ei -m,.)

i= 1 i<j
n
= LVarei + 2L COV(ei, e,.).

i= 1 i<,.
Definition 1.14. The correlation coefficient of the random variables el, e2 is

the number p( el, e2) = Cov( el, e2)/ VVar€1.Vare2.
Theorem 1.6. Ip( el , €2) I ~ 1. If Ip( el , e2) I = 1 then e2 = ael + b, where a

and b are constants.
Proof. The statement of the theorem is a simple consequence of the Cauchy-

Bunyakovskii inequality. For every t we have
Furthermore
E(t(€2 - m2) + (€1 - md)2

= t 2E(€2 - m2)2 + 2tE(€1 - md(€2 -~) + E(€l - md 2
= t 2 Var€2 + 2tCOV(€1, €2) + Var€l.
Since the latter quadratic polynomial (in t) is non-negative,
(COV(€1,€2))2 ~ Var€1.Var€2,
i.e. Ip(€l, €2)1 ~ 1. IT Ip(€l, €2)1 = I then for some to we have E(to (€2 - m2) +
(€1 - md)2 = 0, i.e. to(€:a -~) + (€l - md = 0 and €2 = m2 - (ml/tO) +
(€I/t o). Setting a = lito, b = m2 - (mdto) we obtain the statement of the
theorem. 0
Lecture 2. Independent Identical Trials and
the Law of Large Numbers
2.1 Algebras and u-algebras; Borel u-algebras
We now consider some of the most important examples of u-algebras encoun-

tered in probability theory. First we introduce the following general definition.
Definition 2.1. Let A = {A} be an arbitrary family of subsets of n. The

smallest u-algebra which contains A is called the u-algebra generated by A.
This u-algebra will be denoted by 1(A).
In other words we can say that 1 (A) is the intersection of all those u-
algebras G which contain A , (A C G). There is at least one such u-algebra,
namely the u-algebra of all subsets of n.
A less formal definition of 1(A)
consists in the following. Given sets from A we form finite and infinite unions of
those sets. From the obtained subsets we form finite and infinite intersections.
Then once again we form unions and intersections and so forth. The complete
process makes use of the concept of transfinite induction, but we will not
elaborate on this point.
Assume now that n = JR. l . Consider the following families of subsets:
1. Al is the collection of intervals A E Al which have the form A = (a, b);

2. A2 is the collection of half-open intervals A = [a, b);
3. As is the collection of half-open intervals A = (a, bl;
4. A, is the collection of closed intervals A = [a, bl;
5. Ar. is the collection of left intervals (-00, a);
6. Ae is the collection of right intervals (a, (0);
7. A7 is the collection of open sets of JR.l;
8. As is the collection of closed subsets of JR. l .
Theorem 2.1. 1(Ad = 1(A2 ) = ... = 1(As) = 1(JR.1 ). The u-algdra

1(JR. 1) is called the Borel u-algebra or the u-algebra of Borel subsets.
Proof. All the equalities are proved in the same way. Let us prove for example
that 1(Ar.) = 1(Ad. First for any a < b we have the equality (-oo,b) \
(-00, a) = [a, b) E 1(Ar.). Let us further take an interval (a, b) and a sequence
an ~ a. Then Un [a,. ,b) = (a,b) and Un[an,b) E 1(Ar.). Therefore (a,b) E
16 Lecture 2. The Law of Large Numbers
l(As), i.e. l(Ad ~ l(As). One can prove in an analogous way that l(Ad 2
1(A5)' 0
Let us assume that n is a metric space.

Definition 2.2. The Borel u-algebra on n is the u-algebra l(A) , where A is
the family of open sets of n.
Now let nbe the space of words w =

{Xl"'" Xn} where Xi E X, i.e.
n= X[l,ni. We recall that X is an "alphabet". We assume that we have fixed
on X a u-algebra of subsets 8.
Definition 2.3. A finite-dimensional cylinder C is a set of the form
where iI, ... ,ir are given indices and B I , B 2 , ••• ,Br E 8.
In what follows, when we talk of a u-algebra of subsets of n = X[l,ni we

will mean the u-algebra 1 = l({C}) where {C} is the family of cylinders C.
This definition can be extended, without any changes, to spaces n of the
form n = xr.
2.2 Heuristic Approach to the Construction of

"Continuous" Random Variables
Earlier we introduced the concept of a probability distribution for a random

e
variable = f(w), defined on the discrete space n. Such a random variable
takes a finite or countable number of values Xl, X2, ....
a b
If the probability of occurrence of the value Xi is equal to Pi then we
consider that the point Xi carries the mass Pi' It follows that for any interval
[a,b] the probability P{a ~ ~ ~ b} = Lzola$z,<b Pi ' i.e. the probability that
the random variable takes values in the interval fa, b] is equal to the sum of the
masses Pi which lie in the interval. Imagine that the number of values taken by
the random variable converges to infinity and that the distances Xi+ I - Xi as
well as the Pi converge to zero, where La$ z, $ b Pi converges to a finite value.
For example let -R ::; Xi ::; R, Xi = ijN and Pi = p(~)~ + o(~), where p(t)
is a function on the real line. Then the sum
2.2 Continuous Random Variables 17
J:
behaves like a Riemann sum for the integral p(t)dt. It is clear that p(t) 2: 0
and J~oo p(t)dt = 1, since Li Pi = Li k-p(~) + O(k-) = 1. Any function
p(t) 2: 0 such that J~oo p(t)dt = 1 is called a probability density.
Definition 2.4. Let (n, 1, P) be a probability space, and e= f(w) be a

function on n. If for any interval [a, b]
P{a Ses b} = P{wla S f(w) S b} = p(t)dt l b
we say that the random variable ehas a probability distribution with density
p.
Examples of Densities
1. p(x) = / e-"'> /2, -00 < x < 00 is called the normal density or the
v2"
Gaussian distribution (with parameters (0,1));
1'. p(x) = .;;"" exp( - ("'~; )') is the normal density with parameters (m, a);
2.
p (x) = { --!--
b
a,
x E [a,b]
0, X t/: [a,b]
is the uniform density on the interval [a, b];

2'. The random variables which we have considered thus far take real values,
i.e. values in m,l. One also encounters more general cases (see below)
when the random variables take values iIi the space m,d. Let us fix a
subset V ~ m.d • The uniform distribution on V is the distribution of a
random variable with values in V such that for any U c V
p{e E U} = ::!~.
We assume here that 0 < vol V < 00 (vol (.) represents the d-dimensional
volume of a set). Sometimes to define the uniform distribution on V one
says "a point is chosen in the set V at random with the uniform distribu-
tion";
3.
p(x) = {Ae-
0,
h
,
x
x2:0,
<0
is the exponential density;
4. p(x) = 1/11"(1 + x 2 ), -00 < x < 00 is the Cauchy density.
The mathematical expectation of a random variable which has a proba-
bility density p(x) is given by the integral Ee
= J~oo xp(x) dx, and this math-
ematical expectation is defined if the latter integral converges absolutely. In
exactly the same way
18 Lecture 2. The La.w of La.rge Numbers
Given two random variables 6, E2 their joint probability density is a func-

tion p(x,y) ~ 0 such that J~oo J:'oo p(x,y)dxdy = 1 and such that P{al ::;
El ::; 61 , ~ ::; E2 ::; 62} = J:: J:: p(x, y) dx dy for any al < 61 , ~ < 62, It fol-
lows that El (E2) has a distribution with density J:'oo p(x, y)dy(J:'oo p(x, y) dx).
Furthermore
i: i:
assuming that the latter integral converges absolutely, and
-i: i: i: i:
COV(El' E2) = xyp(x,y) dxdy
xp(x,y) dxdy· yp(x,y) dxdy.
2.3 Sequences of Independent Trials

Let X = {X(l), X(2), ••• , x(r)} be a finite set on which a probability distri-
bution p = {PI"'" Pr}, i.e. a collection of Pi ~ °
with E;= 1 Pi = 1, is
given. The individual number Pi is called the probability of the outcome X(i).
We will also consider the Pi as a function given on X, Pi = p(X(i»). Con-
sider the space of elementary outcomes n = xll.n], i.e. the space of words
w = {Xl"" ,xn }, where X1c EX. We set p(w) = n~=l p(xd. Clearly p(w) ~ O.
We show that E ... p(w) = 1. For n = 1 this follows from the fact that
LP{W) = L""ex p{xd = L:=l p{X(1c») = 1. We now argue by induction.
The summation En = L{." ..... .,a} p{w) can be carried out in the following
way: first fix Xl, ... ,Xn-l and sum over all values of x n , and then sum the
result over Xl, ... , x n - 1 :
The last sum is equal to 1, since E",,,ex p(x n ) = E~=l p(X(1c») = 1. So En =

En _ 1 , and by the induction hypothesis En = 1. It follows that the numbers
p(w) define a probability distribution P on n.
Definition 2.5. The probability space (n, 1, P) is called a sequence of n

independent identical trials with r outcomes and with distribution of outcomes
P. If IXI = 2 then a sequence of n independent identical trials is called a
sequence of identical Bernoulli trials. If X = {I,O} or X = {I,-I} then
sometimes 1 is called the "success" or "win" and 0 or -1 are called "failure"
or "loss".
We now study the properties of P. Let us choose indices ii, i 2 , ••• , it and as
many outcomes x(;.) , x(;.), ... , xu.) (among which there may be repetitions)
and consider the cylinder
2.3 Sequences of Independent Trials 19
Lemma 2.1. P(G) =p(xUd)···p(xU,J).
Proof. P(G) = LP(w) = LP(xd ... p(xn) where the summation is carried
out over those w = (Xl"'" Xn) for which P(Xi t ) = p(xUtl), 1 ~ l ~ t. Then
P(G) =
~i I i;t oj 1 , .•. ,i e
""t=,,(it), l:s;t:S;t
The last sum is equal to 1. This is proved in the same way as the normalization
property LP(W) = 1 was proved before. 0
Corollary 2.1. Let Gi C X, 1 ~ i ~ n be arbitrary subsets of X and let

G = {WIWi E Gil 1 ~ i::; n}. Then P(G) = P(Gd. rr=l
Proof. We have
",EO "'I",EO,
l:S;i:S;n
n n
= II L p(xd = II P(Ci ).
o
We introduce the random variable vIi) (w), equal to the number of i, for
which 1 ~ i ~ n and Xi = xli), and the random variables e1 i )(w) where
e~j) (w)
•
= {I, if Xi = xli),
0, otherwise.
Then
L e}i) (w)
n
vIi) (w) = (2.1)

i= 1
It is clear that vIi) (w) takes the values 0,1, ... , n.
Theorem 2.2. P{v li ) = k} = (:Hpli»)k (1 - p(;»)n-k, Ev U) = np(;) ,

Varv(;) = npli)(I- pU»).
Proof· We note that I- pli) = L.;o!i PI .). Let us defineBk = {wlv(;)(w) = k}.
Then
BL'" =UR. . "';'1 .. """,
where 1 ~ i 1 < i2 < ... < i/o = n and Xi, = xU), 1 ~ t ~ k, Xi f:. XU) for the
remaining i. The number of events Bk;i" ... ,i. is clearly equal to (:), and they
do not intersect. Furthermore
p(w)
IT P(Xi).
Zi;tZ(j) i;ti1l ... ,i k
i:;til,..·,i.
In the last sum we can interchange product and summation:
IT P(Xi) = IT
%i¢Z(j) i~il"",ik
i:;till ... ,i.
In the same way
P(Bk;i" ... ,i.) = (p(iW (1 - p(j)t- k ,
P(Bk) = (:)(pU»)k(l- pU )t- k .
In order to calculate EII(j) and Varll(j) we use (2.1) and Lemma 2.1 where,
for C, we take the cylinder C = {W/Xi, = X(j),Xi. = xU)}:
n n
EII U) = LEe:il = LP(e: il = 1) = nplil,

i= 1 i= 1
r-
Varil lil = E(II(j»)2 _ (ElIlj»)2
= E(t e}j) n2(pU»)2

1= 1
= np(j) + n(n _I)(p(j»)2 _ n 2(p(j»)2

= npU)(I- pU»).
Here we use the fact that (e'1lj))2 = e~j)

'1
and that, by Lemma 2.1,
2.4 Law of Large Numbers 21
We now choose two numbers p, q ~ 0 s.t. P + q = 1.
Definition 2.6. The probability distribution of the random variable which e,

takes integer values 0 ~ k ~ n, where p{e(w) = k} = (:ht'qn-A: = (:)pA: (1-
p)n-A: is called a binomial distribution with parameters p,q = 1- p.
Theorem 2.2 shows that the random variable 1I(i) has a binomial distri-
bution with parameters pW, 1- p(i).
From Theorem 2.2 we now obtain a very important corollary.
2.4 Law of Large Numbers for Sequences of Independent

Identical Trials
For any 6 >0

1I(i)
P{(wll- - p(i) I ~ 6 for at least one i, 1 ~ i ~ r)} -+ 0 as n -+ 00.
n
This statement follows in a straightforward manner from the Chebyshev in-
equality. Let A(i) = (wll"~) - p(j)·1 ~ 6). We are interested in P(U;=l AW).
We have P(U;=l A(i») ~ E;=l P(A(i»). Furthermore, by Chebyshev's in-
equality
(j)
P(AIi») = P{(wll~ - plil I ~ 6)}
n
= P{(wllll(i) - np(i) I ~ 6n)}
= P{(wl 11I1i) - EII(i) I ~ 6n)}
Varll(j) np(i) (1 - p(i))
< = -=-~'--.....:....---..!..
- 6 n 2 2 62 n 2
p(i) (1 - pW)
= -+ 0 as n -+ 00.
62 n
Consequently E;=l P(AIi») -+ 0 as n -+ 00. We will now derive corollaries
from the Law of Large Numbers (LLN). We note that the LLN gives a founda-
tion for representing probabilities as limits of frequencies. This can be done if
further information is available which allows us to assume that we are dealing
with a probability distribution which corresponds to a sequence of indepen-
dent trials. The LLN is a particular case of a general ergodic theorem on the
behavior of arithmetic means of random variables, which holds under much
more general assumptions.
2.5 Generalizations
The first generalization consists in the fact that we can consider an arbitrary
probability space (X, B, P) instead of a finite X, where P is a probability
defined on the events BE B. As previously, we set n = xll,nl and consider
the cylinders C = (wlxi E B 1 , ••• ,Xi. E B Ic ), where B 1 , ••• , BIc E B.
1
Let P(C) = n:=l P(Bl ) for any cylinder C. One can prove (see below)
that there exists a unique q-additive set function defined on B which takes
this form on cylinders C.
Definition 2.'1. The probability space (n, 1, P) is called a sequence of ninde-

pendent identical trials with space of outcomes X and probability distribution
p for those outcomes.
The second generalization, which we mention only briefly, consists in con-

sidering spaces X which depend on the number of trials. We are then talking
about the concept of sequences of independent non-identical trials.
2.6 Applications
2.6.1 Probabilistic Proof of Weierstrass's Theorem
Weierstrass's Theorem (WT), from courses in mathematical analysis, states

that any continuous function I(x), X E [0,11 can be uniformly approxi-
mated by a polynomial. We give a probabilistic proof of the WT due to
S. N. Bernstein.
Definition 2.8. The Bernstein polynomial of degree n for the function I is

the polynomial Qn (x) = E:=o (:)zIc (1 - x)n-Ic I(k/n).
Theorem 2.3 (Weierstrass). For any continuous function I
sup IQn (x) - I(x) I -+ 0 as n -+ 00.

selo,ll
Proof· Let E > o. Since the function I is continuous on [0,11 it is uniformly

continuous. Therefore there exists a 6 > 0 such that the inequality Ix' -x" 1 ~ 6
implies I/(x') - l(x")1 ~ f/2. We define M = sUPselo,l(l/(x)l.
For any n we have
2.6 Applica.tions 23
!(x)-Qn(x) =!(X)- t
k=O
(~)Xk(l-xt-k!(kln)
= t
k=O
(~)Xk(l- xt-k(J(kln) - !(x))
= L (~)Xk(l- xt-k(J(kln) - !(x))

k:lk/n-zl<6
+ L (~)Xk(l- xt-k(J(kln) - !(x))
k:lk/n-zl~6
= El + E2
The first sum can be estimated as follows:
lEI I ::; L (~) xk(1 - xt- k I!(kln) - !(x)1

k:lk/n- zl<6
< ~
- 2
'"'
~
(n)Xk(l_ x)n-k <
k - 2'
~
k:lk/n- zl < 6
since the (:)xk(1 - x)n-k are the probabilities of the binomial distribution,
and their sum over all k is equal to 1. As for the second sum, we have
IE2 1::; L (~)Xk(1 - xt- k I!(kln) - !(x)1

k:lk/n-zl~6
::; 2M L (~)Xk(l- xt- k •
k:lk/n-zl~6
At this point the probabilistic part of the argument begins. We con-

sider a sequence of n independent trials with X = {1,0} and probabilities
p(l) = x, p(O) = 1 - x and the random variable 11(1), equal to the number of
occurrences of 1 among the n trials. Then P(II(1) = k) = (:)xk(1 - x)n-k
(from Theorem 2.2) and Ek:lk/n-zl~6 (:)xk(1 - x)n-k is exactly equal to
the probability that 111(1) In - xl ~ 6. We saw in the proof of the LLN that
this probability, by Chebyshev's inequality, is not greater than x(1 - x)/6 2 n.
Therefore IL'21 ::; 2M x( 1 - x) 162 n. For all sufficiently large values of n this
last expression will be smaller than £/2. Consequently lEI + E21 ::; £ for all
sufficiently large n. 0
2.6.2 Application to Number Theory
We again consider the interval [0, I]. Let us expand every x E [0, I] into its
decimal expansion x = 0, al ,~ ... an •.• , where each digit a. takes its values
°
from to 9. Let us fix the n first digits of the decimal expansion al '" an •
Those numbers x E [0, I] which have these digits in their n first positions form
an interval L1(al ... an) of length 10- n. For different choices of (al' ~ ... an)
those intervals either do not intersect, or intersect at only one end point.
We consider the sequence of n independent identical trials, with space of
outcomes (0,1,2, ... ,9) and probability of each outcome equal to 10- 1 • Then
all,n I consists of the words w = (a1 ... an) and p(w) = 10- n. We denote by
v(i) (x) the n~ber of occurrences of the digit j among the n first digits of
the decimal expansion. Let us fix a number 6 > a and consider those intervals
L1(a1 ... an) for which Ivli ) /n -1/101 ~ 6, a ~ j ~ 9. It then follows from the
LLN that the sum of the lengths of those intervals converges to 1 as n -+ 00.
One sometimes says that, in the decimal expansion of typical numbers x, all
digits are encountered approximately the same number of times.
2.6.3 Monte Carlo Methods
We assume that we must calculate the d-dimensional integral
of a continuous function of d variables. In applications d can take values of

order 10 and above. The computation of the usual Riemann sums uses an enor-
mous volume of operations, and consequently of computer time. So already
at the origin of the development of computational techniques a probabilistic
method of computing these integrals was conceived which uses much less com-
putation, but sometimes leads to gross errors. A sketch of the method is as
follows: let Y = (Yl"'.' Y.. ) and assume that n points Y( 1) , ... y( .. ) are chosen
independently with a uniform distribution in the d-dimensional unit cube.
We form the average
It turns out that with large probability this average is close to I. The value
of d is irrelevant to the proof of this statement, so we will assume that d = 1,
i.e. that
1= 11 fey) dy.
First we explain the meaning of the words "n points are chosen indepen-
dently in the interval [0,11", which we now denote by Y1, .. . ,Y... This means
that we consider a sequence of n identical trials in which X = [0,11, and in
which the distribution of outcomes is the uniform distribution on [0,11. The
sum (l/n) E~=1 f(y.) is a random variable defined on the space a of words
w = (Yl' ... , Y.. ). We now derive the following statement from the LLN:
1
I: f(yd - II ~
n
P{(wll; En -+ 1 as n -+ 00 •
•= 1
2.6 Applications 25
For the proof we divide the interval [0,11 into r equal parts by the points
Xli) = i/r, 0:$ i :$ r. Given w = (Ylo ... ,Yn) E xl1,nl, we construct w' =
(X1""'X n ), where x" = Xli) if i/r :$ y" :$ (i + l)/r (in the case of two
possibilities we choose the left hand side point). We now find the largest r such
that I/(x,,) -/(y,,)1 :$ f./2. This is possible since the function 1 is continuous
on [0,11 and therefore is uniformly continuous. It follows that
I;1 .=1 L Il(y.) -/(x.)1 :$ i·

n 1 n 1
L I(y.) - ; L I(x.)
n
1 :$ ;
.=1 .=1
We now introduce on the space (}' of words w'the probability distribution
which assigns to each w' a probability p(w I ) equal to the probability of those
wEn which correspond to w'. But it is clear then that if w' = (Xl"'" xn)
then
The last relation shows that we have on the space (}' a probability distribution
which corresponds to a sequence of n independent trials with space of out-
comes X' = {0,1/r,2/r, ... ,(r -l)/r} and probability l/r for each outcome.
By the LLN P{(W' Ilv(i) /n - l/rl :$ 6,0 :$ j :$ r - In -+ 1 as n -+ 00 for any
fixed 6. Here vlil = vIi) (WI) is the number of occurrences of the number j /r
in the word w'. We can therefore write
.!:.
n
t
i=l
I(Xi) = E
i=O
v(j) I(x li »)
n
1 j vIi) 1 j
=L ;/(;) + L( --;- - ;)/(;).
r-1 r-1
i=O i=O
We choose r such that I~ E;:~ I(~) - 11:$ i, and we set 6 = f./3rM, where
M = maxl/(x)l. Then
Finally we obtain that if r is chosen in such a way that all the requirements
mentioned above are satisfied, and if y = (yl1), ... , yIn») is such that Ivli ) /n-
l/rl :$ 6 for all 0 :$ j :$ r - 1 then l(l/n) E;=l I(yl.») - II :$ f.. It follows
from the LLN that the probability of such y converges to 1 as n -+ 00.
2.6.4 Entropy of a Sequence of Independent Trials and

Macmillan's Theorem
As before, let n = xlI,n) and X = {x(1), ... ,xlr)}, pli) = p(x(i)), 1:$ j:$ r.
Definition 2.9. The entropy of a sequence of independent identical trials is

the number h = - E;=l pU) InpU).
Here we assume that 0 In 0 = O. Clearly h ~ 0 and h = 0 only in the

degenerate case where one of the probabilities is equal to 1, and the remaining
ones equal to zero. Furthermore max{pW} h = In r. Indeed if all pU) > 0 then
h = h( {pU)}) is a differentiable function of r variables p(j), subject to the
condition E;= 1 pU) = 1. To find the maximum of this function we use Laplace
multipliers and set 1(p(l), ... ,per»~ = - E;=l plil lnp(i) + A(E;=l p(i) -1).
Then 0 = a:;n = -lnpU) -1 + A, i.e. p(j) = e- 1 +A. So all the pU) are equal,
p(j) = l/r, and consequently, for such a distribution, h = In r. If some of the
probabilities are equal to zero, then the same argument shows that max h for
such p(j) is equal to In r', where r' is the number of non-zero probabilities,
and we see that In r' < In r. So max h = In r and it is attained for the uniform
distribution on X. The meaning of entropy is made clear by Macmillan's
Theorem, which we quote below.
Theorem 2.4 (Macmillan). For any a > 0, {3 > 0, there exists a number
no(a,{3) such that for all n > no (a,{3) there exists in the space n = xl1,nl an
event Cn C n which satisfies the following properties:
1. P(Cn ) ~ I-a;
2. for every wE Cn, e-n(h+fj) S p(w) S e-n(h-~);
9. en(h-~) S ICnl S en(h+fj).
Proof. Let 6 > 0, whose value we will specify later, and set
ISiSr}.
It follows from the LLN that P(Cn ) - t 1 as n - t 00 and therefore for n ~
no (0, a) we have P(Cn ) ~ 1- a.
Let us now prove property 2. For w = {Xl' ... , Xn} we have
IT p(X.) = IT (p(j»v(j)(w)
n r
p(W) =
.= 1 i= 1
r r (j)
= exp(L£lU)(W) lnp(j») = exp(-n(- L ~ lnpU»)
i=l i=l n
LPUJ InpU) - L(~ - pU»

r r (j)
= exp( -n(- Inp(j»).
n
i= 1 i= 1
Since w E C n then
IL(£1n I~ 6 L Ilnp(j) I·
r (i) r
- p!i») lnp(i)
i= 1 i= 1
2.6 Applica.tions 27
We choose 6 such that 6 L;= 1 IInp(jJ I ~ {3/2. Then
exp(-n(h + {3/2}) ~ exp(-n(- Lp(jJ Inp!iJ + {3/2)) ~ p(w)

~ exp(-n(- LPliJ InpliJ - {3/2)) = exp(-n(h - {3/2)).
This completes the proof of 2. To prove 3 we write
wEe,.
and
e-n(hH/2JIGnl ~ L p(w) ~ e- n (h- p /2)IG n l·
wEO",
It follows that for n large enough,

en(h-P) ~ (1- a)e n (h- P/2 J ~ P(Gn )e n (h-P/2 J
~ IGnl ~ en (hH/2J P(Gn ) ~ en(hH).
o
The meaning of Macmillan's Theorem (MT) is as follows: Clearly Inl = rn.
MT shows that we can choose, in n, a subset Gn for which the number of points
is between en(h±P), and whose probability is arbitrarily close to 1. If h is much
smaller than In r, then such a reduction of the space n does not change much
from the point of view of probability theory, but significantly reduces the space
itself. This property plays an important role in many problems in information
theory and ergodic theory. In addition the MT shows that for any arbitrary
non-uniform distribution {pUl} the probabilities p(w) for large n become equal
in the limit, i.e. the distribution approaches the uniform distribution, although
in a very weak sense.
2.6.5 Random Walks
Let X = {-I, I}, p(I) = p, p(-I) = q, and consider the sequence of inde-
pendent identical trials with n = xll.nl. Here w = {x1, ... ,xn } where each
Xi = ±1. We set Yo = 0, Yk = L:=l Xi, 1 ~ k ~ n and we draw in the plane
(t,y) the broken line y(t) which passes through the points (k,Yk), 0 ~ k ~ n.
Fig. 2.1.
It is clear that the function y(t) uniquely defines w. Each y(t) consists of
straight segments, at angles of ±45°. If we consider y(t) as the graph of a
motion it is clear that for each unit of time the moving point moves along a
segment of line 1, but arbitrarily changes the direction of motion. The function
y(t) is called the trajectory of the random walk. The trajectory of a random
walk can be figuratively defined as the trajectory of a drunk person. It is
°
clear that if k is even then y(k) is even, and if k is odd y(k) is odd. We
set P{(y(t), ~ t ~ n)} = p(w), where p(w) corresponds to the sequence
w = {Xl"'" X,,}, which gave y(t). Then P{(y(t), 0 ~ t ~ n)} = 'fI'(1) q..,<-l) ,
where v( 1) = v( 1) (w) is the number of unit segments which move according to
an angle +45°, and 1.1(-1) = V(-l)·(W) is the number of unit segments which
move according to an angle of -45°. Clearly we have
y(n) 1.1(1) (w). V(-l)(W)
y(n) = V(l)(W) -V(-l)(W), - = - .
n n n
It follows from the LLN that for large n, with large probability, n- 1v(1) is
close to p and n- 1v(-1) is close to q. Therefore n- 1y(n) with large probability
is close to p - q. The quantity p - q is called the mean one-step displacement.
Definition 2.10. A random walk is said to be symmetric if p - q = 0, i.e. if

p = q = 1/2.
We now find Ey(n) and Ey2 (n) for a symmetric random walk. We use the
fact that 1.1(1) (w) +1.1(-1) (w) = n, 1.1(-1) (w) = n-v(1) (w), y(n) = 21.1(1) (w) - n
and 1.1(1) has a binomial distribution with p = q = 1/2. Using Theorem 2.1 we
obtain:
Ey(n) = E(2v(1) (w) - n) = 2Ev(1) (w) - n = 2.(n/2) - n = 0,
Vary(n) = Ey2 (n) = E(2v(1) (w) - n)2 = E[4(v(1) (W))2 - 4nv(1) (w) + n 2]
= 4E(v(1) (W))2 - 4nEv(1) (w) + n2
= 4Var(v(1) (w)) + 4(Ev(l) (W))2 - 4nEv(1) (w) + n 2
=4.n.i.i+4(if -4.n.(i) +n2 =n.

2.6 Applications 29
The last relation shows that typical values of y(n) lie in a domain of order
O( v'n). In the next lecture we will give this statement an accurate form.
Lecture 3. De Moivre-Laplace and Poisson
Limit Theorems
3.1 De Moivre-Laplace Theorems
3.1.1 Local and Integral De Moivre-Laplace Theorems
We again consider a binomial distribution with probabilities P and q, - i.e.

Pic = (:)plcqn-Ic and, to fix ideas, we assume that P > q. The number k takes
values from 0 to n, so we have n + 1 probabilities Pic' We now study the
question of the behavior of these probabilities as a function of k, for large n.
We will see that there is a relatively small domain of values of k (of size yin)
where the Pic are comparatively large and a remaining domain where the Pic
are negligible. In order to define those k for which Pic is large we find ko such
that Plc o = maxlc Pic' We have the relation:
PHl _ C;J pH1qn-Ic-l _ n!k! (n - k)! P = (n - k) P

Pic - (:) p"qn-Ic - (k+l)!(n-k-l)!n!q (k+l) q'
We now find for which k the inequality PHdplc ~ 1 holds. We have
n-kp
--->1
k+lq- , (n - k)p ~ q(k + 1), np- q ~ k.
Conversely, for k > np - q we have Pic+! /PIc < 1. So ko = [np - q]. It is

therefore natural to expect that the larger values of the probability Pic occur
around the point ko = np.
We now formulate the De Moivre-Laplace Limit Theorem which reinforces
the previous statement. Let A, B be two arbitrary numbers with A < B.
Theorem 3.1 (Local Limit Theorem: De Moivre-Laplace). Let np +

Ayln ~ k ~ np + Byln. Then
1 _(0 __ , ) '
Pic = ~e .... (1 + rn(k)),

y21rnpq
where the remainder rn (k) converges to 0 as n ~ 00 uniformly in k - i.e.
as n ~ 00.
3.1 De Moivre-La.pla.ce Theorems 31
Proof. We use Stirling's formula

r! ,..... V2'lrr rr e- r as r -+ 00.
In addition, we use the following simple statements: under the conditions of

the theorem
k n-k
- -+ p, - - -+ q as n -+ 00.
n n
We now have
V2'lrk kke- k V2'lr(n - k)(n - k)n-ke-n+k

(3.1)
= 1 (~tk(n - ktn+kpkqn-k
J2'lrn~(n:k) n n
1 e- k In ~-(n-k) In ~+k In p+(n-k) In q
v2'lrnpq .
We set z = (k - np) / vnpq and we consider the exponents separately:
k n-k
S = -kln- - (n - k) In - - + klnp+ (n - k)lnq
n n
= _kln(np + z.jiiiJq) _ (n _ k) In(n q - z.jiiiJq) + klnp + (n - k) lnq
n n
= -kln(p + zVpqjn) - (n - k) In(q - zVpqjn) + klnp + (n - k) lnq.
By Taylor's formula with a Lagrange remainder we have
zy'Q Z2 q Z3(pq)3/2
In(p + zVpqjn) = lnp + r.;;; - -2 + Kl (p) 3/2 '
ypn pn n
In(q - zvpqjn) = In q -
zy'P
r;:;;;
yqn
Z2 p
- -
2qn
+ K2 (p)
Z3 (pq )3/2
n
3/2 '
where Kl and K2 are the coefficients of the remainders which arise from the
third derivatives at intermediate points. We only need the fact that as n -+ 00
these constants are bounded by IKd, IK21 ~ C, where C is a constant that
does not depend on n.
zky'q z2 qk ( k)ln z2 p(n-k)

-
S- kl
- n p - - - + - - - n- q+ zy'P(n-k)
.;pn 2pn ..foii
+--=-~-..:..
2qn
3( )3/2 3( )3/2
-K1(p)z pq/ k-K2(p)Z pq/ (n-k)-klnp-(n-k)lnq.
n3 2 n3 2
The expressions which contain Kl (p) and K2 (p) converge uniformly to 0 since
z varies between bounded limits, A ::; z ::; B. We now replace k and (n - k)
by their expressions in terms of z. Since kjn -+ p and (n - k)jn -+ q we have
32 Lecture 3 De Moivre-Laplace and Poisson Limit Theorems
o
In probability theory statements involving the asymptotic behavior of in-
dividual probabilities are sometimes called local limit theorems.
Corollary 3.1. maxie Pie ---? 0 a8 n ---? 00.
Proof. We saw that maxple is obtained for ko = [np - q]. For such a ko
1
Pleo '" . ~ ---? 0 as n ---? 00,
V 21C'npq
since Zo = (ko - np)/ ..jnpq and IZo I ::; 2/ ..jnpq ---? 0 as n ---? 00. o
Theorem 3.2 (Integral Limit Theorem: De Moivre-Laplace). Let A, B

be fixed number8 with A < B. Then
Proof. Again we set z = (k - np)/ "fnpq. When k takes all integer values such
that 0 ::; k ::; n in turn, z varies between the limits -Vnp/q ::; z ::; Vnp/q
by steps of length 1/"fnpq. By Theorem 3.1 we have
The sum
is a Riemann sum for the integral (1/..j2;) J:

e- Z' /2 dz, and the remainder
0(1) converges to 0 uniformly with respect to all considered values of k. 0
The expression on the right hand side of the statement of the theorem is a
probability calculated with the Gaussian density p(z) = (1/..j2;)e- Z' /2. This
density plays an important role in many problems of probability theory.
3.1 De Moivre-La.pla.ce Theorems 33
Corollary 3.2. For any A
lim L Pic 1 fA
= -- •
e-~ /2 dz,
n .... co -../2i
Ic~np+Av'npq
-00
n""
lim
00
L -../2i
1+
Pic = -1-
A
00
e-~ • /2 dz.
Ic~np+Av'npq
We show only the first statement, as the second statement is proved sim-
ilarly. Let f > 0 be arbitrary. Since v'1271' Joo
- 00
e- ~'/2 dz = 1, there exists a
B > 0 such that v'~" I~B e- z '/2 dz = 1- f/4. By Theorem 3.1 we have
lim '" Pic = _1_ fB 2

e- z '/ dz = 1- f/4.
n.... oo ~_
n.,- B v'''P'~ k
../2i -B
S",+B";;pf
Therefore for all n ~ no (f)
L Pic ~ 1- f/2.
np- B v'npq~Ic~ np+B v'npq
Consequently, for such n
L Pic ~ f/2.
Ic<np-Bv'npq
We therefore have
E= Pic +
Ic<np-Bv'npq ",-a "'...,..s.
S",+Av''''''''
We already showed that El ~ f/2 for n ~ no(f). By the theorem we have

proved, E2 satisfies the following equalities:
lim E2 = _1_ fA e- z'/2 dz = _1_ fA e- z'/2 dz - f/8.

n .... 00 ../2i -B ../2i -co
Therefore, for n > n 1(E)

3.1.2 Application to Symmetric Random Walks

Let y(n) be the coordinate of a random walk point at time n. We have already
seen that y(n) = 21.1(1) (w) - n. Recall that 1.1(1) (w) has a binomial distribution
with p = q = 1/2. Therefore
P(Avn ~ y(n) ~ Bvn) = P(Avn ~ 2v(1)(w) - n ~ Bvn)
= P(Avn/2 + n/2 ~ 1.1(1) (w) ~ Bvn/2 + n/2)
= p( AJn.~.~ + n/2 ~ 1.1(1) (w) ~ BJn.~.~ + n/2)

.... -1-
...j2-i
1B ·
A
e-· /2 dz .
The last relation shows that for any interval [A, B] the probability that
y(n)/.;n E [A,B] converges to a positive limit as n .... 00. This means that
for large n, in typical cases (i.e. with positive probability) the random walk
takes values of order .;n, i.e. the random walk point at time n has departed
from its initial value by a distance of order .;n. Such a feature is typical of
diffusion processes.
3.1.3 A Problem in Mathematical Statistics

We now consider a problem in mathematical statistics which is related to a
sequence of n identical trials with two outcomes. For convenience we call these
outcomes "success" and "failure" and denote them respectively by 1 and O.
The result of n trials therefore has the form w = {Xl" .. , xn} where each Xi
takes the values 1 or 0 with probabilities p and q = 1 - P respectively, and
E;= 1 Xi is the number of successes. A fundamental problem of mathematical
statistics in this case consists in estimating the value of p from a given concrete
w. For example, if we are dealing with a real-life game situation, and if the
game is fair, i.e. theoretically p = q = 1/2, then an estimator of the difference
p - 1/2 is an estimator of whether cheating has occurred.
On the basis of the law of large numbers, a natural candidate for the
estimator of pis 1.1(1) (w)/n = (l/n) E;=l Xi' i.e. the proportion of successes. In
connection with this we run into a general problem of mathematical statistics,
that of the accuracy of a statistical estimator. The usual approach applied to
our problem consists in the following. We start from the fact that a probability
distribution is given on the space n for a sequence of independent trials with
an unknown probability p. We choose Q > 0 and call it the tolerance level,
and we find a number a(p, n) such that
P{(wll v (1)(w) _ pi ~ a(p,n))} ~ 1- Q. (3.2)

n
The last inequality means that
-a(p, n)n ~ 1.1(1) (w) - pn ~ a(p, n)n,
3.2 Symmetric Random Walks 35
or
pn - a(p, n)n ~ ,P) (w) ~ pn + a(p, n)n. (3.3)
We now consider the following functions of p, which depend on n:
f+ (p) = pn + a(p, n)n,

f- (p) = pn - a(p, n)n.
We assume that I± (p) are strictly monotonous in p and that there exist inverse
functions g+ (p) and g_ (p), i.e. g+ (/+ (p)) = p and g_ (/_ (p)) = p. It then
follows that equation (3.3) can be rewritten in the following way:
(3.4)
The interpretation of (3.4) is that with probability at least 1- 0, P lies between

the limits
g+ (,,11) (w)) ~ P ~ g_ (,,(1) (w)). (3.5)
Resorting to the frequency interpretation of probability we can say that an
estimation of p by way of (3.5) is correct a proportion (1 - 0) of the time. In
other words if we carry out an estimation of p by using a series of n trials N
times, then in approximately N(l-o) cases the estimation (3.4) is correct, and
in the remaining cases it is incorrect. As we will see, the smaller 0 is, the worse
the accuracy of the estimator, i.e. the greater is g_ (,,(1) (w)) - g+ (,,(1) (w)).
These circumstances dictate the choice of the number o. If the frequency of
errors must be minimal, because each error can lead to serious consequences
(as for example in the case of medical diagnosis), then 0 will be chosen very
small, but then the accuracy of the estimator is worse. If an estimation error
is permissible in a given proportion of cases, then we can choose a greater
0, and obtain a tighter interval for p. The prevailing values of 0 are 0 =
0.1, 0.05, 0.01, 0.005 and 0.001.
We now find an explicit form for f+, f-, g+ and g_. An exact solution of
(3.2) is difficult to obtain. Let us assume that n is large enough for us to be
able to use the De Moivre-Laplace Limit Theorem. Then
p{ (wl!,,(l) (w) -pi ~aypq)}

n .;n
= p{ (wi - aJnp(l - p) ~ ,,(1) (w) - np ~ aJnp(l - p))}
= P{wl pn - aJnp(l - p) ~ ,,(1) (w) ~ pn + aJnp(l - pH
Rj--
1 fa e- 2
a /2 dz=1-o .
.,fii -a
Given the number 0 we can find a = a" by using tables for the normal
distribution. Then
f+ (p) = pn + a" Jnp(l -- p), f- (p) = pn - aa Jnp(l - pl.

Let us set h (p) = t. We have t - pn = a" Jnp(l - p),
Analogous calculations hold for f _(p). IT we solve these equations for p we

obtain p as a function of t. The two different choices of roots correspond to
the two functions g±. In the plane (p, t) the equation t 2 - 2pnt + rn2 -
a! np(l - p) = 0 is the equation of an ellipse passing through the points (0,0)
and (1, n). It is convenient to normalize t - that is to set T = tin.
Fig. a.1.
We then obtain T2 - 2PT + r - a!p(l - p)/n = o. It is now clear that the

width of this ellipse is of the order 0 (at> I..;n). The smaller Q: is, the greater
at> is and the wider the ellipse is. The ellipse must be used in the following
way: given w we find the quantity T = T(W) = V(l) (w)/n and draw in the
plane (p, T) a horizontal line at height T( w). The line intersects the ellipse at
the two points p_ ,p+. Then p_ $ p $ p+ with probability at least 1 - Q:.
The fact that the ellipse intersects the "forbidden domain" p < 0 and p > 1
is connected with the fact that the approximation by a Gaussian distribution
turns out to be invalid here.
3.1.4 Generalizations of the De Moivre-Laplace Theorem
We now consider two generalizations of Theorem 3.1, which will be of use

to us later in our study of higher-dimensional random walks. Let X =
{alt~ib1'~}' p(ad = p(~) = p, p(bd = p(~) = q, 2p + 2q = 1
and X = {alt~iblt~iC1,C2}, p(a1 ) = p(~) = p, p(bd = p(~) = q,
+
p(cd = P(C2) = r, 2p 2q + 2r = 1. We consider the events Cn =
{WIV(l)(W) = V(2) (W),V(3) (w) = V(4)(W)} in the first case; Cn = {WIV(l)(W) =
V(2) (W),V(3) (w) = V(4) (W),V(6) (w) = V(6)(W)} in the second case.
Theorem 3.3. The following relations hold:
lim nP(Cn )
.. -GO
= 21I"ypq
It;;;; in the first case, and
const .
P(Cn ) $ n3 / 2 m the second case,
where const is a constant which depends on p, q and r.
Proof. Since Li V(2i)(W) = Li V(2i+l)(W) for w E Cn, we can assume that n

is even. In the first case
with
and in the second case

Cn = U Cn,lel,Ie.'
lel,le.
with
Cn,lel,le. = {wlv(l)(w) = V(2)(W) = k1 ,v(3)(W) = V(4)(W) =~,
V(6) (w) = V(6) (w) = n/2 - kl - ~}.
For each wE Cn,le cle~ly p(w) = rle .rr- 21e • Therefore P(Cn,le) = ICn,1e I.r le
.qn-21e. In order to calculate ICn,1e I we note that ICn,1e I is equal to the number
of ways in which we can choose from the set {1,2, ... ,n} two subsets with
k elements and two subsets with n/2 - k elements. The first subset can be
chosen in (:) ways. When the first subset is chosen then for the choice of the
second subset we have (n ~ Ie) possibilities. Once we have chosen the first two
subsets, there are (:/~~lele) possibilities for the choice of the third one. Finally
ICn,lel = (~) (n~k) (~;~~)

_ n!(n - k)!(n - 2k)! _ n!
- k!(n - k)!k!(n - 2k)!((n/2 - k)!)2 - (k!)2 ((n/2 _ k)!)2'
By analogous arguments we get, in the second case,
ICn,Ie"Ie.1 = (~) (n ~lkl) (n -2~ - ~) (:/~ ~k~1-_2~)

n!
We now have
P(C )- n! 21e n-21e
n,le -(k!)2((n/2-k)!)2 P q
n! (2 )21e(2 )n-21e~ (2k)! (n - 2k)!
(2k)!(n - 2k)! P q 2" (k!)2 ((n/2 _ k)!)2'
We note that 2!.1!~/; is the probability of obtaining k ones in a sequence of

2k trials with p = q = 1/2. By Theorem 3.1
1 (2k)! 1 1
- - - "" = CL' k -+ +00.
221< (k!)2 _/21r2k!.!. y'lrk
V 22
We also have
(n - 2k)!
In all cases
1 (2k)! 1(n - 2k)! c
----- - <-
221< (k!P 2 - 1< ((n/2 _ k)!)2 - Vn
n 2
for some absolute constant c. So for arbitrary 6 > 0 we now have
'"
P(Cn ) = L.."P(Cn,I<) = '" (1
L.."P21< 221<
(2k)!)
(k!)2
(1 (n - 2k)! )
2n- 21< ((n/2 _ k)!)2
I< I<
=(
'"
L.." +
"')
L.."
(1 (2k)!) (1
P21< 221< (k!)2
(n - 2k)! )
2n- 21< ((n/2 _ k)!)2
IIe/n-pl<6 IIe/n-pl~6
= El + E 2 •
Here the Pi denote the probabilities of a binomial distribution with probabil-
ities 2p and 2q. By Chebyshev's inequality L2 satisfies
E2 ~ In L
IIe/n-pl~6
P21e ~ In L
121e/n-2pl~26
P21e
c 4nqp cq
<---
2 =
2 -- , E 2 'n-+O as n-+oo.
- Vn 40 n
3 2 n /
We can write Ll in the following way:
( 1 (2k)!)( 1 (n-2k)!)
El = L
1I</n-pl<6
P21< 221< (k!P 2n- 2 1< (( /2 - k)f
n.
11",
~ ( 1 + 0) _r-;rnn _r-;rn;; L.." P2 I< ,
Y 1rnp y 1rnq Ie
where 0 = 0(6) -+ 0 as 6 -+ O. The sum Li Pi = 1 and the sum Lie P21e -+ 1/2,
by Theorem 3.1 (we take only half the terms in the Riemann sum; the exact
proof is left to the reader as an exercise). Therefore
1
nE1 ~ (1 + 0) Inn'
21rypq
Since 0 was arbitrary, we have proved that
A lower bound is obtained similarly: for Ik/n - pi < 6

2- (2k)!_I_ (n- 2k)! > 1- a _1___ 1_,

22k (k!)2 2n- 2k ((n/2 _ k)!)2 - ( ) V7rnp 7rnq v
and the fact that Llk/n-pl<6 P2k - t 1/2 as n - t 00 can again be proved by
using Theorem 3.1.
We now turn to the second case. We have
Here Lk. is equal to the sum which we estimated in the first case, therefore
for a certain constant CI
L (n-2kd! (q )2k.( r )n-2k,-2k.

k. (k l !)2((n/2-kl -k2)!)2 2(q+r) 2(q+r)
< CI •
- n - 2kl +1
It follows that
n!(2p)2k, (2(q + r)r- 2k •

P(Cn ) ~ C1 L
k,
(kl!)2(n _ 2kd!(n - 2kl + 1)2211:,
n!(2p)211:, (2(q + r)r- 211:. (2kd!

= CI ~ (2kd!(n - 2kd! ((kd!)22211:' (n - 2kl + 1) .
Moreover, as previously we can write
(2kd! CI
(kl !)22211:, ~ ~.
If we denote by P: the probabilities of a binomial distribution with probabilities

2p and 2(q + r) we obtain
Also, as in the first case, one can show that the main contribution to this sum
is given by those terms where kl/n '"" p. For such kl we have
40 Lecture 3 De Moivre-La.pla.ce a.nd Poiaaon Limit Theorema
1 1 1 1 1
Vkl + 1 (n - 2kl + 1) - n 3/ 2 ..jkdn + lIn (1 - 2kl /n + lIn)
1 1
..... n 3 / 2 y'2P2(q + r)'
The desired statement then follows easily. o
Later we consider other methods which lead to similar estimates.
3.2 The Poisson Distribution and the Poisson Limit

Theorem
3.2.1 Poisson Limit Theorem
The Poisson probability distribution is the following:

O, k < OJ
{
Pk = e- A ~;, k ~ 0.
°
Here A > is the parameter of the Poisson distribution and k takes its values
in the set of integers -00 < k < 00. If a random variable has a Poisson e
distribution with parameter A then
We now consider a problem in which a Poisson distribution arises. Let X =

{1,0} and w = {Xl' ••• ,X.. }, Xi EX. We assume, given on a, a probability
distribution which corresponds to a sequence of n identical independent trials
but where P = p(l) depends on n - i.e. P = P.. - in such a way that as n -+ 00
lim,._oo np.. = A. Then Ev(l) = np.. -+ A as n -+ 00.
Theorem 3.4 (Poisson Limit).

Ale
lim P{V(l) = k} = e- A _ •
.. - 0 0 k!
Proof. We have
P(V(l) = k) = (~)P:(l- P. t- k
_ n(n - 1) ... (n - k + 1) Ie ( .. -k) lD(l-p,,)

- k! p.. e .
3.3 The Poisson Distribution 41
In our situation k is fixed but n - t 00. Therefore (n - k) In(l - Pn) ,... -(n -
k)p,. = -Pnn(l- kIn) - t -A as n - t 00. Furthermore
n(n - 1) ... (n - k + l)p~ = (nPn)" (1 - l/n)(l- 2/n) ...

... (1 - (k - l)/n) - t Ale as n - t 00.
Finally P(v(1) = k) -t ~; e- A as - t 00. o

The condition Pn ,... A/n implies that the probability of 1 (of success)
is very small and decreases as O(l/n). It also means that Ev(1) ,... A as
n - t 00 and remains finite - i.e. the total number of ones in a word of length
n on average remains finite and does not increase with n. For that reason the
Poisson Limit Theorem is sometimes called the limit theorem for rare events.
3.2.2 Application to Statistical Mechanics
Here we describe an application of the Poisson Limit Theorem to statistical

mechanics. An ideal gas in a volume V is a system of N non-interacting
particles. We assume that V is a d-dimensional cube centered at 0 and with
sides R and that R - t 00 and N - t 00 in sucll as way that the number of
particles in a unit volume remains finite, N/ Rd - t A > 0 as R - t 00, N - t 00.
Such a limit transition is sometimes called thermodynamic. The absence of
interaction appears in the assumption that each particle of gas is uniformly
distributed in V independently of the remaining particles. More precisely this
means that X = V and n = XI1.NI. Also a uniform distribution is given on
V for which p(Q) = voIQ/R d , where volQ is the volume of Q C V, and a
probability distribution is given on n for a sequence of N trials. The domain
Q is now fixed, and we introduce the random variable v(Q) (w) equal to the
number of particles which lie in Q.
Theorem 3.5.
r1m P( v (Q)(w) -
-
k) _ (AvoIQ)"
- k! e
-holQ
.
Proof. Let us fix indices i 1 , ••• , i" for the particles which lie in Q, and let
Cil •...• i • = {WIXi. E Q, 1 ~ s ~ k, Xj f/. Q for j f. i 1 , ••• , i,,}. Then
Each Cil •...• i " is a cylinder and for it
Therefore
42 Lecture 3 De Moivre-Laplace and Poi88on Limit Theorems
In other words we have a binomial distribution in which the probability of

1 (success consists in the particle lying in Q) is equal to p,. = vol Q / Rd.
Therefore PNN = NvolQ/Rd -+ >.voIQ, and the desired result follows from
Theorem 3.4. 0
Theorem 3.5 shows that the number of particles of an ideal gas lying in
a given fixed domain Q, when passing to the limit as previously described,
follows a Poisson distribution with parameter >.voIQ.
Lecture 4. Conditional Probability and
Independence
4.1 Conditional Probability and Independence of

Events
Let (O, 1, P) be a probability space and A E 1, B E 1 be two events, with

P{B) > o.
Definition 4.1. The conditional probability of an event A given an event B

is given by
P(AIB) = P(A n B)
P(B) .
The importance of the concept of conditional probability lies in the fact

that for a large number of problems the initial data consist of conditional prob-
abilities from which one wishes to find properties of ordinary non-conditional
probabilities. In this and the next three lectures we consider examples of such
problems.
It is clear that the conditional probability P{AIB) depends on A and on
B, but this dependence has a very different nature. Af; a function of A the
conditional probability satisfies the usual properties of probability:
1. P{AIB) ;::: OJ
2. P{OIB) = Ij
3. for a sequence of disjoint events {At} (i.e. At n Ai = 0 for i ¥ j), with
A = U; At
P(AIB) =L P{A; IB).
The nature of the dependence on B arises from the formula of total prob-
ability which we give below. Let {Bl' B 2 , ••• ,B,., ... } be a finite or countable
partition of the space 0 - i.e. a collection of sets {B;} such that B; n Bi = 0
and U; B; = o. We also assume that P{B;} > 0 for every i. For any A E 1
we have
P{A) = L P{A n B;) = L P{AIB;)P{B;}.
The relation
44 Lecture 4. Conditional Probability and Independence
(4.1)
is called the total probability formula. The nature of this formula is reminiscent
of expressions of the type of a multiple integral written in terms of iterated
integrals. The conditional probability P(AIBi ) plays the role of the inner
integral and the summation over i is the analog of the outer integral.
Sometimes in mathematical statistics the events Bi are called hypotheses,
since the choice of Bi defines a probability distribution on 1, and probabilities
P(Bi ) are called prior probabilities (i.e. given before the experiment). We
assume that as a result of the trial an event A occurred and we wish, on the
basis of this, to draw conclusions on which of the hypotheses Bi is most likely.
This estimation is done by the calculation of the probabilities P(B" IA), which
sometimes are called posterior (after the experiment) probabilities. We have
P(B IA) = P(B" n A) = P(AIB,,).P(B,,) (4.2)

" P(A) }:" P(B,,)P(AIB,,)'
The relation (4.2) is called Bayes' formula.
We now introduce one of the central concepts of ·probability theory. Ai;,-
sume that P(B) > O. We say that the event A does not depend on the event B
if P(AIB) = P(A). It follows from the expression for conditional probability
that P(A n B) = P(A)P(B). This latter relation is taken as the definition
of independence in the general case. This also shows that if P(A) > 0 the
independence of B from A follows from the independence of A from B.
Definition 4.2. The events A, B are said to be independent if P(A n B) =

P(A).P(B).
Lemma 4.1. The events A and B are independent if and only if the events
in any of the pairs (A, B), (A, tI) or (A, tI) are independent.
Proof. Let us prove for example the equality p(Antl) = P(A).P(tI). We have
P(A n tI) = p((n \ A) n tI) = P(B} - P(A n B}

= P(B) - P(A n (n \ B)) = P(B} - P(A) + P(A n B)
= P(B} - P(A) + P(A}P(B} = P(B} - P(A)(1 - P(B))
= P(B} - P(A)P(B} = P(B)(l - P(A)) = P(B}P(A).
The remaining equalities are proved in an analogous way. o
The statements of Lemma 4.1 can be reformulated differently. Consider
two partitions of the space n, namely (A, A} and (B,B). We saw earlier (Lec-
ture I) that finite partitions correspond uniquely to finite subalgebras. Let us
denote these subalgebras, in our case, by 1A and 1B • It follows from Lemma
4.1 that P(C1 n C2 } = P(Cd.P(C:a) for any C1 E 1A , C2 E 1B • We now
directly generalize Definition 4.2.
4.2 Independent u-algebras 45
Definition 4.3. Let ~ and ~ be two a-subalgebras of the a-algebra 1.

Then ~ and ~ are said to be independent if for any GI E ~, G2 E ~,
Definition 4.4. Let ~, ~, ... , 1,. be a finite collection of a-subalgebras of

the a-algebra 1. Then the a-algebras 1;, 1 ~ i ~ n are said to be jointly
independent if for any GI E ~, G2 E ~, ... , G,. E ,,.
It is clear that one can find a-algebras which are pairwise independent
but not jointly independent.
4.2 Independent u-algebras and sequences of

independent trials
Let a consist of the words w = {Xl"" , X,.}, X. E X and P be a probability

distribution on a corresponding to a sequence of independent identical trials.
e.
Given 1 ~ i ~ n let us introduce the partition in which w' = {~, ... , x~ }
and w" = {x~, ... , x:} lie in the same element of the partition if and only if
x: = X:'. By the same token an element Geo of the partition e.is given by a
point x E X and consists of those w for which w. = x. The quotient-space al E.
is canonically isomorphic to X, and the a-algebra .TIe. is isomorphic to the
a-algebra 8 of subsets of the space X (see Lecture 2). The induced probabil-
ity distribution on (ale.,1(e.)) agrees with P. The definition of a sequence
of independent identical trials means that all a-algebras 1(e') are jointly
independent and the probability distribution of each of them does not de-
pend on i (identical trials). In the general case of independent a-subalgebras
1; = 1(e.) the matter reduc~ to a sequence of independent identical trials
where the space of values W; is (ale;, 1;,p,), where P. is the induced distribu-
tion on the measurable space (ale.,1;).
Definition 4.5. Let '11 = 11 (w), ... , TI,. = I,. (w) be a collection of n random
variables. The TIl, •.• , TI,. are said to be jointly independent random variables
if for any Borel subsets Cl, ... ,G,.
,.
P{Tll = Idw) E Gl,···,TI,. = I,.(w) E G,.} = II P(J.(w) E G.) .
• =1
We now show the meaning of this definition when the random variables
TI; take a finite or countable number of values. Let {a~;)} be the values of the
random variable '1i and B;i) = {WI'1i = a~i)}. Then for any i the {B;iJ} form
a partition of the space n which generates a a-algebra that we will denote
by 1;, 1 ~ i ~ n. It follows from Definition 4.4 that
= al.1), ... , '1n = al.n )} = IT P{'1i = a(;)}.

n
P{'11 31 3" ,.
;= 1
The joint independence of random variables is equivalent to the joint inde-

pendence of the a-algebras 1; which correspond to them.
If the random variables '11, .•. , '1n have probability densities P1 (x), ...
.. • , Pn (x) it then follows from Definition 4.4 that
Definition 4.5 is of course a special case of Definition 4.4. To see this we take
as the a-algebra 1,. the a-algebra of the subsets ofthe form Ii: 1 (C), where
C is a Borel subset.
Theorem 4.1. Let '11 and '12 be independent random variables such that E'1i
exists, i = 1,2. Then E('11 . '12) exists and E('11 . '12) = E'11 . E'12'
We prove this theorem only in the case where '11 and '12 take no more than
a countable number of values. Let the values of '11 be the numbers a1 , ~ , ... ,
and the values of '12 be the numbers b1 , b2 , .... It follows from the finiteness
of the expectations E711 and E712 that
The product '11 . '12 takes values ~bi on the set (WI'11 = ~) n (WI'12 = bi ).
Thus, by independence,
E(711''12) = LLa;bi P{(wl'1dw) =ai,'12(w) =bi )}

,. .
,.
= L~P{'1dw) = a;}. L bi P {'12(w) = bi } = E'11 . E'12
i
by the fact that the double series converges absolutely.

The following theorem is proved exactly in the same way.
Theorem 4.2. Let '11, '12, ... , '1n be jointly independent random variables such
that E'1i exists for 1 ~ i ~ n. Then E('11 ''12 "''1n) exists and E('11 "'f}n) =
Ef}l .•• Ef}n.
4.3 The Gambler's Ruin Problem 47
Corollary. Let TJ1 , ••• , '1n be pairwise independent random variables such that
VarTJi < 00 for 1 ~ i ~ n. Then Var('11 + .. '+'1n) < 00 andVar('11 + .. 'TJn) =
VarTJ1 + ". + Var'1n·
Proof. In Lecture 1 we obtained the formula

n
Var(TJ1 +"'+TJn) = LVarTJi +2LCov(TJi,TJi)'

i=l i>i
But
4.3 The Gambler's Ruin Problem

We consider here a typical problem which can be solved by using the total
probability formula. We assume that a series of games take place in which
the player of interest to us wins each time with probability p and loses with
probability 1- p independently of the other games. We denote by :z; the fortune
of the gambler. We will assume that when the gambler wins his fortune is
increased by 1, i.e. :z; t-+ :z; + 1, and when the gambler loses his fortune is
decreased by 1, i.e. :z; t-+ :z; - 1. The game stops when the fortune of the
gambler becomes zero (gambler's ruin) or a number a > 0 (gambler's victory)
where a is the sum of the fortunes of the gamblers at the beginning of the
game. Let us denote by z the initial fortune of our gambler. The course of
the game can be represented conveniently by a graph consisting of straight
segments with angles of ±45°, similar to the trajectory of a random walk.
Each graph starts at the point z, 0 < z < a and ends either at z = a (victory)
or at z = 0 (ruin). It is convenient to assume that after attaining z = 0 or
a the graph continues as a straight horizontal line. These graphs correspond
to the points w of $e space of elementary outcomes n. In order to indicate
the dependence on the initial point z we write w. and n•. For z = 0 or a
we assume that no n..
or contain only one element, consisting of a horizontal
line. The typical form of w is given in Fig. 4.1.
x
a
o n
Fig. 4.1.
For w. E n. we set p( w.) = pic qt where k (or l) is the number of segments

which move upwards (or downwards), i.e. with an angle 45° (or -45°). This
definition does not imply that L",.En. p(w.) = 1. The violation of this last
equality can be interpreted as the fact that there is a positive probability for
an infinite gain. Therefore we set
p. = L p(w.),O<z<a,
w.eD.
where E' (or E") indicates that the summation is carried out for those w.
that end with a (or with 0), i.e. W. is the probability of victory and L. is the
probability of ruin with an initial fortune of z. For z = 0 or a it follows from
the definition that:
Po = 1, Wo = 0, Lo = 1.
Pa = I,Wa = 1, La = O.
Lemma 4.2. The following relations hold:
p. = pPZ+1 + qP.- 1, 0 < z < aj

W. = pW.+l + qW._ 1 ,0 < z < aj
L. = pL'+l + qL._ 1 ,0 < z < a.
Proof. We have
where L: (1) denotes the sum over those w. in which the first game was won
and L (2) denotes the sum over those w. in which the first game was lost.
For w. E L (1) we have p(w.) = p.p(wz+d, where W.+l is the graph which
begins at n = 1 with the point z + 1. Clearly any w.+ 1 E n.+ 1 is obtained
this way. Therefore L (1) p(w.) = p. LEn
W.+l U_+l
p(w.+ d = p.p.+ l ' In the
4.3 The Ga.mbler's Ruin Problem 49
same way it is clear that for w~ E E (2) we have p(w~) = q.p{w._ d and
therefore E
(2) p(w.) = q.p.- 1 • This completes the proof of the first relation.
The remaining relations are proved in the same way. 0
In what follows we will need a general property of the solutions to the

homogeneous linear equation
R~ = p.R~+1 + q.R.- 1 , 0 < z < a (4.3)

subject to the boundary conditions Ro =C 1, R Q =C2•
Lemma 4.3. There ezists at most one solution to (4.9) which satisfies the
given boundary conditions.
Proof. Assume that, to the contrary, there are two such solutions, R!l) ,R!2)
and let R!3) = R!l) - R!2). Then ~3) = RJ.3) = 0 and
R~(3)
= pR(3) R(3) 0
~+ 1 + q ~-1' < z < a.
Let us choose Zo such that max{R!3),0 < z < a} = R!~) > O. Since R!~) ~
R(3) D(3) > R(3) the equality
.0+1' ...... 0 - .0-1
is possible if and only if R!~) = R!: ~ 1 = R!: ~ l ' In exactly the same way
we can go to Zo ± 2 and so forth. In the end we obtain that R!3) = constant
for 0 < z < a. But this constant can only be 0 since ~3) = RJ.3) = O. An
analogous argument can be carried out in the case where min R! 3 ) < O. In the
same way we obtain R!3) == 0 for 0 $ z $ a, i.e. R!l) = R!2).
The function p. satisfies (4.3) and the boundary conditions Po = P = 1. Q
Since p. == 1 is a solution to (4.3) with such boundary conditions, it follows

from Lemma 4.2 that this solution is unique, i.e. p. == 1. We also have shown,
in this way, that the probability of an infinite gain is equal to zero.
Given this fact we can obtain the relations of Lemma 4.1 by using the
total probability formula. Let A~ be the event consisting of all w. which lead
to victory, i.e. consisting of those graphs w. which end with z = a. Then
W~ = P{A~). We denote by B+ (or B_) the event that the first game was
won (or lost). Clearly B+ nB_ = 0, B+ uB_ = n~. Furthermore P{B+) = p,
since
Similarly
P{B_) =L (2)p{W.) = q. L p{w~-d = q.

By the total probability formula

W.. = P{B+ )P{A.. IB+) + P{B_ )P{A.. IB_)
= pP{A .. IB+) + qP{A a IB_).
From the definition of conditional probability we have
P{" IB ) = P{A .. nB+) = p.p"+ 1 =p

II.. + P{B+) P .. +1'
Similarly P{A.IB_) = p.. - 1 • Therefore the relations of Lemma 4.1 are a

particular case of the total probability formula. 0
We now find explicit formulas for W .. and La. The method is totally anal-
ogous to the method of solving linear differential equations with constant
coefficients. We first seek particular solutions to (4.3) of the form R. = )....
Substituting into (4.3) gives ). = p).2 + q. The roots of the last equation have
the form ).1 = 1, ).2 = q/p.
First Case. p f:. q. In this case ).1 f:. ).2' We will look for W. of the form
W. =dl).~ +~).; =d1 +~(;r,

where d1 and ~ will be found from the boundary conditions
Wo = d1 +~ =0,
W" = d1 +~(;r = 1.
This gives d1 = -~ = 1/(1- (q/p)"). We therefore obtain

1-{f.)·
W. = 1- (!)",
LIS can now be found from the relation
Second Case. p = q = 1/2. In this case the equation has the form
1
W.. = i(W.. +1 + W.. -d,
and also).l = ).2 = 1. Aside from the particular solution W. = 1 we can check
immediately that W. = z is also a solution. We therefore look for a solution
of the form
W. = d1 +~z.
Taking into account the boundary conditions we obtain:
4.3 The Ga.mbler's Ruin Problem 51
Thus
Z a- Z
W. = -,
a
L. = 1- W. = --.
a
An intuitive conclusion arises from our equations: the greater the initial for-
tune, the greater the probability of winning.
We now introduce the random variable r(w z ), w. E n., equal to the dura-
tion of the game, i.e. equal to that n = n(w z ) for which the graph first attains
o or a.
Lemma 4.4. There exist constants q, 0 < q < 1 and c, 0 < c < 00, such that
Proof. Let us denote by D" the set of graphs '"II = {x = f(t), 0 ~ t ~ n}

which represent the course of the game from time 0 to n and for which the
game at time n is not finished yet. We denote by CI the set of those w. E n.
for which the course of the game from time 0 to n agrees with the function f.
We then have
P" = P{r(wz) 2 n} = L L
p(w.).
G 1 w.EGI
We first show that EW.EG 1 p(wz ) = ~q"-k, where k is the number of games
won between the times 0 and n, i.e. the number of segments in the graph f of
length J2 which move with an angle of +45 (see Fig. 4.2). 0
I K=5
Q
Fig.4.2.
Indeed let f(n) = Zl. Then every graph Wz E C I consists of two parts: the
first part describes the function f and the second part can be considered as
an element w" l of the space n. l • Correspondingly p(w.) = ~q"-k .p(WZI ) and
L p(w.) = pkq,,-k L p(wzJ = pkq"-k.

w.EOI W a1 eO· l
We now establish the following inequality: for some ql, 0 < ql < 1,
(4.4)
Indeed if we denote by Pl) the graphs which correspond to n + a and by

I the graphs which correspond to n then we have
Pn+<I = LP(C/(1») = L L P(C/(1»)
/ o,(l)co,
= L L pk,q(n+<ll-k , = Lpkqn-k L pk1-kqa-kdk.
/ O,(l)CO, / O,(l)CO,
Here the notation Lo ,(1) cO , denotes a summation with respect to the graphs
Pl) which are extensions of I. We now note that
L l-k,.qa-kdle:::; ql < 1,
oj')CO,
since in a steps the only possible extensions PI) are those which do not lead
to the end of the game.
The desired statement follows from (4.4) since for any m, m = ka + r,
O:::;r<a
Pm :::; ql'Pm - <I :::; q~ .Pm - 2a :::; q~ 'Pr :::; q~ :::; q; 1 .~ ,

= q1 •
q
1/1e
Setting c = q; 1, q = q:lk we obtain the statement of the lemma. o

It follows clearly from Lemma 4.4 that for any k > 0
Erk(w.) = L rk(w.)p(w.) < 00.
Lemma 4.5. Ez = pEHl + qEz - l + 1,0 < z < a, Eo = Ea = O.
Prool. The last equalities follow from the definition. To prove the first equality
we write
= L(1),.- .. +1 EO .+1 (r(wz+d + 1).p.p(WHd

+ L(2) w._1EO._l (r(wz-d + l)q.p(wz-d
= pE.+ + p+ qE.- + q = pE.+ + qE.- + 1.
1 I I 1
Here we used the fact that in the sum L(l) (or L(2») each W z can be repre-
sented as a union of the first segment moving with an angle +45 0 (or -45 0 )
and of w.+ 1 E fl,,+! (or W.-l E flz- d, and that each wH 1 (or wz - d appears
in such a representation. 0
4.3 The Gambler's Ruin Problem 53
The statement of Lemma 4.3 also applies to E". Therefore if we find one
function E. which satisfies the conditions in Lemma 4.5 that function will be
the one we are looking for.
First case. p i q. We note that the function E" = z/(q - p} satisfies the
equation
E" = pE,,+1 + qE,,-l + 1, (4.5)
but does not satisfy the boundary conditions. Since every function of the
form R. = d1 +~(q/p)" satisfies the equation R" = pR.+ 1 +qR"-l then any
function E" = z/(q - p) + R. satisfies (4.5). We choose d1 and ~ in order to
satisfy the boundary conditions
Eo = d1 + ~ = 0, ~ = -d1
E,. = q~ p +d1 (1- (!t) = 0, d1 = (p_ q)(; _ (!t)·
Finally we obtain
Second Case. p = q = 1/2. The equation (4.5) then has the form
1
E" = i(E.+ + E.-d + 1.
1 (4.6)
The function E. = _z2 satisfies (4.6). As in the first case we look for E. of
the form:
Then
Eo = d1 = 0,
E" = -a2 + ~a = 0, ~ = a.
Therefore
E" = _Z2 + az = z(a - z}.
The function E" attains its maximum value a2 /4 for z = a/2, which is
natural in view of the symmetry p = q = 1/2. In addition this relation shows
that the duration of the game for z "" a/2 is of the order a2 /4, i.e. of the order
of the square of the length of the interval [0, a]. We have already seen that
such a behavior is typical for random walks.
Lecture 5. Markov Chains
5.1 Stochastic Matrices
The theory of Markov chains makes use of the theory of so-called stochastic
matrices. We therefore begin with a small digression of a purely algebraic
nature.
Definition 5.1. A matrix Q = IIq,jll of order r x r is said to be stochastic if

1. q,j ~ 0,
2. L;= 1 q,j = 1, for any i.
The conditions under which a matrix is stochastic can be expressed some-

what differently. A column vector 7
= (11' . .. , Ir)' is said to be non-negative
if Ii ~ 0, 1 ~ i ~ rj in this case we write 7
~ o.
Lemma 5.1. The following statements, a), b.1), b.e) and c), are equivalent:
a) the matriz Q is stochastic;
b.1) for any 7 °
~ the vector Q ~ 0; 7
b.e) if 1 = (1, ... ,1)' then Ql = 1, i.e. the vector 1 is an eigenvector of the
matriz Q corresponding to the eige~value 1;
c) if J1 = (1'1' ... , I'r) is a probability distribution, i.e. I'i ~ 0, L;= 1 1'. = 1
then J1' = J1 Q is also a probability distribution.
Proof. H Q is a stochastic matrix then b.1) and b.2) clearly hold and there-
fore a) => b). We now show that b) => a). Consider the column vector
----
= (0, ... ,0,1,0, ...)' ~ o. Then Q6 = (qlj,q2j, ... ,qrj)' ~ 0, i.e. q,j ~ o.
~ ~
6j j
j-l
Furthermore Ql = (Li qli, Li q2i' ... , Li qr i)' and it follows from the equal-
ity Ql = 1 that Li q'i = 1 for all i, therefore b) => a).
We now show that a) => c). H J1' = J1Q, then 1''; = L;=II',q'i. Since Q
°
is stochastic we have 1''; ~ and L;=II''; = Li L, I',q'i = L, Li I',q'i =
L, 1', Li q'i = L, 1', = 1, therefore 71' is also a probability distribution.
5.2 Markov Chains 55
...,-+
Now assume that c) holds. Consider the row vector Oi = (0, ... ,0,1,0, ... ,0),
which corresponds to the probability distribution on the set (1,2, ... , r) which ------
i-l
"t
°
is concentrated at the point i. Then Q = (qi 1, qi2, ... ,qir) is also a proba-
bility distribution. It follows that qij ~ and E;= 1 qij = 1, i.e. c) => a). 0
Lemma 5.2. II Q' = 11<1.; II, Q" = IIq;~·1I are stochastic matrices, then QIII =
Q' .Q" is also a stochastic matrix. II all the q;~. > then all the q;~~ > 0. °
Prool. We have
r
~
qi;
III
= ~
I "
qik . qkj·
k=l
Therefore q;~~ ~ 0. If all

Furthermore
qZi °
> 0, then q;~~ > since q;k ~ and E~= 1 q;k ° = 1.
j= 1 j=lk=l k= 1 j= 1 k=l
This completes the proof of the lemma. o
5.2 Markov Chains
We now return to probability theory. Let n be a space of elementary outcomes

W -- {Wo , Wl , ••• ,W" } , wh
ereE
Wi X - - {(1)
X , X (2) , ••• , x (r)} , <
_,. <
_ n. °
Moreover let us call Wo the state at the origin of time, and Wi the state at
time i. We assume, as given, a probability distribution JL = {JL1' ... ,JLr} on X
and n stochastic matrices P(I), ... ,P(n), with P(k) = IIPij(k)lI.
Definition 5.2. A Markov Chain with state space X, generated by the ini-
tial distribution 71 on X and the stochastic matrices P(I), ... , P(n), is a
probability distribution P on n which satisfies
p(W) = JL", • • P"'.'" 1 (1) . P"'l"" (2) ..... PW .. -l"''' (n).

The points xU) EX, 1 ~ j ~ r, are called the states of the Markov chain.
We now check that the last equation defines a probability distribution on

n. The inequality p(w) ~ °is clear. It remains to be shown that E",en p(w) =
1. We have
E =L p(w) = L JL", •• P",.", 1 (1) ..... P"',,_lW .. (n).

wED wo,. •.• w"
We now perform the summation as follows: Fix the values Wo, W 1 , ••• ,W" - 1
and sum over all values of W". Thus

56 Lecture 5. Ma.rkov Cha.ins
E = L J.Lwo· Pwow, (1) ..... PWn_.w n _, (n - 1). L

w,,=l
PWn-,w n (n).
The last sum is equal to 1 by virtue of the fact that P (n) is a stochastic matrix.
The remaining sum has the same form as E, where n has been replaced by
n - 1. We then fix Wo, ... , W,. _ 2 and sum over all values of w,. _ I, and so on.
In the end we obtain E:
0 = I J.Lw 0 = 1 by the fact that 11 is a probability
distribution. It follows that E = 1.
By similar arguments one can prove the following statement:
P{Wo -- X(i o ) W -
,1 -
Xli,)
, ••• ,
Wk -- X(i.)} -
- ""'0
II . • .. (1)· . .. . p.'''-1 'II
p'0'1 . (k)
for any X(io), • •• , X(i.), k ~ n. This equality shows that the induced probabil-
ity distribution on the space of (k + I)-tuples (wo, ... ,Wk) is also a Markov
chain, generated by the initial distribution 11, and the stochastic matrices
P(I), ... ,P(k).
The matrix entries Pij(k) are called the transition probabilities from state
Xli) to state xU) at time k.
Assume that P{wo = X(i o l , ••• ,W/C_2 = X(i.-.I,W/C_I = x(il} > O.
We consider the conditional probability P {w/c = xU I IWk _ I = x( iI, Wk _ 2 =
X(i.-.I, ••• ,wo = X(io)}. By definition:
-
P{W10- xU) Iw/ c - l-- x(il ,W - X(i.-.) , · · · W
10-2-
- X(i o )}
,0-
P{W0-- X(io) , · · · , 1 W0 - 2-- x(i.-.I , W/ c - l-- Xli) , W/ c-- xU)}

= ~~~~--~~~~------~~~----~~~--~
P{Wo = X(io), ••• ,Wk-2 = X(i.-.) ,W/c-l = Xli)}
_ J.I.' o . Pioi, (1) . Pi, •• (2) ..... Pi._.i(k - 1) . pij(k)
- J.Lio • P'o" (1) . P.,i. (2) ..... P•• _., (k - 1)
= pij(k)
and does not depend on the values i o , i l , ••• , i/c _ 2. This property is sometimes
used as the definition of a Markov chain.
Definition 5.3. A Markov chain is said to be homogeneous if P(k) = P does

not depend on k, 1 ~ k ~ n.
Homogeneous Markov chains can be understood as a generalization of

sequences of independent identical trials. Indeed if the stochastic matrix P =
IIp'j II is such that all its rows are equal to {PlI ... ,Pr}, where {Pl, ... , Pr} is
a probability distribution on X, then a Markov chain with such a matrix P is
a sequence of independent identical trials.
In what follows we consider only homogeneous Markov chains. Sometimes
we will represent such chains with a graph. The vertices of the graph will
consist of the points Xli) EX. The points Xli) and x(jl are connected by an
oriented edge if Pij > o. A collection of states Wo, WI , ••• , Wk, which has a
positive probability, can be represented on the graph as a path of length k
starting at the point Woi P(wo,w l , ••• ,Wk) is the probability of such a path.
5.2 Markov Chains 57
Therefore a homogeneous Markov chain can be represented as a probability

distribution on the space of paths of length k on the graph, or as a sequence
of jumps across the vertices of the graph.
We consider the conditional probabilities P{w.+l = xli) IWt = Xli)). It is
clearly assumed here that P{Wt = Xli)) > o. We show that these conditional
probabilities do not depend on t and that P{w.+l = xli)lwt = Xli)) = p!;) ,
p!;
where the I are the elements of the matrix P' . By Lemma 5.2 the matrix
P' is stochastic.
For s = 1, using the homogeneity of the chain, we have
P{Wt+ 1 -_ XIi) IWt -_ x li))_ P{W t+l- - xl;) ,Wt - - xl'))

- ......:.--"-=-=:--:----'--:~--.:..
P{Wt = xliI)
E P{Wo -- Xliol , ••• , Wt- 1 -- x lit - d , Wt -- Xli) ,t+
W 1 -- xliI)
= ----"-~----~~--~--------~-----
:5(;0) ,.0.,s(il.- 1 )
E P{WO=Xlio), ... ,Wt=xli))
= io,.··,it_l
....:..:...-'::~=-=------------ =
L.J Jl.io· Pioi, • Pi, i •••• Pit_, i
Pii'
io "o"'l_l
i.e. we have proved our statement for s = 1. Let us assume that the statement
is proved for all s ~ 80 and let us prove it for 80 + 1. We have
-- '"'P{w
k.J 'o+l+t -- xli)lw .o+t -- xl") ,Wt -- Xli))
"=1
By the induction hypothesis the last factor is equal to p!:o). We show that
the first factor is equal to P"i. Indeed we have
58 Lecture 5. Markov Chains
P(W. o +1+t -_ X
U)I
W.o+t -
_
X
(Ie)
,Wt -
_
X
(i»)
W'o+t -- ... (Ie) W

...
- ... (i·o+t-,)
"0+t-1 -...
W -
, ••• , l+l -
X(i t +,)
,
- ... (i) , W t- 1 --
W t -... ... (i t _,) , ••• , W 0
...
...(i O»)
= ...
Pioil ••• Pit_IiPi;t+! •• ·p;'O+t-llePlej
Plej
= --------~~---------
P(W'O+t = X(Ie),Wt = Xli»)
x
10,0.·,"_1,
it+ 1 ,0 •. ,i. O + (-1
By the same token

r
P(W -xU)lw'O+t --X(Ie) ,Wt -_X(i»)_"p('O).p
'0+t+1 - - L..J ile
._p('O+l)
Ie, - ij •
Ie= 1
This completes the proof of our statement.

The transition probabilities p!;)
are called s-step transition probabilities.
Definition 5.4. The matrix P is said to be ergodic if there exists an So such

t h at Pi;
('0)
> 0 rlor any l.J.
..
By the statement we have just proved this means that in So steps one can,
with positive probability, proceed from any initial state Xli) to any final state
xU) .
5.3 Non-Ergodic Markov Chains
We now construct examples of Markov chains by giving their graphs. So we

draw graphs without ergodicity.
In the periodic graph as presented in the figure the states are divided into
three pairs, and each time a transition occurs from one pair to another. For
any given s the transition probability p!;)
=1= 0 only for two appropriate values
of j. It is clear that periodic graphs are possible with any period.
5.3 Non-Ergodic Markov Chains 59
Fig. i.l. Disconnected graph
Fig. i.l. Periodic graph
One can show that the examples we have given, and combinations of them,
describe all examples of non-ergodic Markov chains.
Theorem 5.1 (Ergodic Theorem for Markov Chains). Assume as given

a Markov chain with an ergodic transition probability matrix P. Then there
exists a unique probability distribution 7r = (7rl"'" 7rr ) such that
1. 7rP = 7r,
". l'Im._
n
00
(.)
Pii = 7ri'
Proof· Let JJ.' = {JJ.~, ... ,JJ.~}, JJ." = {JJ.~, ... , JJ.~} be two probability distribu-
tions on the space X. We set d(JJ.' , JJ.") = (1/2) E;= IJJ.: - JJ.:' I. Then d can be
1
considered as a distance on the space of probability distributions on X, and

the space X with that distance is a complete metric space. We note now that
r r r
0= LJJ.: - LJJ.:' = L(JJ.: - JJ.:') = L+ (JJ.: - JJ.:') - L+ (Jl: - Jl:') ,

i= 1 ;=1 i= 1
where E+ from now on will denote summation with respect to those indices
i for which the terms are positive. Therefore
d(#,' , #,") =~ t I#'~

i=l
- #'~' I = ~ L+ (#'~ - #'~')
+2
1 ",+
LJ (#'i" - #'i') = ",+
LJ (#'i' - #'i") .
We will soon use this formula. It is also clear that d(#,', #,") ::; 1.
Let #" and #''' be two probability distributions on X. By Lemma 5.1 #,'Q
and #,"Q are also probability distributions on X, for any stochastic matrix Q.
Lemma 5.3.
a. d(#,' Q, #''' Q) ::; d(#,', #,")j
b. if all qij ~ a then d(#,'Q,#,"Q) ::; (1- a).d(#,' ,#,").
Proof. We have
Furthermore
. L+. ('
::; L J+ #'i - #'i") qij
1
= ",+
LJ. (#'i' -
,
#'i") ",+
LJ.J qij·
The sum E; qij ::; E j qij = 1 and this completes the proof of a). We now note
that the sum E; cannot be a sum over all indices j. Indeed if E;=1 #,:qij >
E;= 1 JJ.:' qij for all j, then
r r r r
LLJJ.:qij > LLJJ.:'qij,

j= 1 i= 1 j=1 i=1
which is impossible, since each of the sums is equal to 1. This latter fact is
easily seen by interchanging the order of summation. Therefore at least one
index j is missing in the sum E;
qij. Thus if all qij > a then qij < 1 - a E;
and d(#,'Q, #," Q) ::; (1 - a) 'L.:
(#': - #':') = (1 - a)d(#,', #,"). 0
Let #'0 be an arbitrary probability distribution on X and let #'n = #'0 pn .

We show that the sequence of probability distributions #'n is a Cauchy se-
quence. This means that for any £ > 0 there exists an no (£) such that for any
p we have d(JJ.n,#'n+p) < £ for n ~ no (£). By Lemma 5.3 we have
d(JJ.n,JJ.n+p) = d(JJ.opn ,JJ.opn+ p)
::; (1- a)d(JJ.opn-,JJ.opn+p-<o)
<0
::; (l_a)2d(#'opn-2' o,JJ.opn+ p-2. o)::; ...

::; (1 - a)m d(#,o pn- m·o , J'opn+ p- m. o )
::; (1 - a)m,
5.4 The Law of Large Numbers 61
where n = mso +Sl, 0 ~ Sl < So. For sufficiently large n we have (l-a)m ~ fo,
which completes the proof.
So the sequence IJ.n = 1J.0 pn is a Cauchy sequence. Let us set 'Tr =
liffin_co IJ.n. We have 'TrP = liffin_co IJ.n P = lim.._co (lJ.opn)p =
liffin_co (lJ.opn+l) = 'Tr, i.e. 'TrP = 'Tr. We now show that a vector 'Tr with
such properties is unique. Let there be two vectors 'Trl = 'Trl.P, 'Tr2 = 'Tr2 P .
Then'Trl = 'TrlP··, 'Tr2 = 'Tr2P··. Therefore d('Trl,'Tr2) = d('TrlP··, 'Tr2 P··) ~
(1 - a)d('Trl' 'Tr2) by the lemma. It follows that d('Trl, 'Tr2) = 0, i.e. 'Trl = 'Tr2·
We have thus obtained that for any initial distribution 1J.0 the limit
liffin _ co 1J.0 pn = 'Tr exists and does not depend on the choice of 1J.0. Let us
choose, for 1J.0 the probability distribution which is concentrated at the point
Xli) , i.e.
1J.0 = {O, ... ,O,I,O, ... O}.

'-v-"
i-l
ThenlJ.opn is the probabilitydistribution{p!;)}. Thereforeliffin_co p!;) = 'Tri.

This completes the proof of the theorem. 0
Remark. Let 1J.0 be concentrated at the point X(i). Then d(lJ.opn, 'Tr) =
d(lJ.opn, 'Tr pn) ~ ... ~ (1 - a)m d(lJ.opn-m •• , 'Trpn-m •• ) ~ (1 - a)m . In this
relation we can take m ~ (n/so) - 1. Therefore
d(lJ.opn , 'Tr) ~ (1 - at/··- l

~ (1 - at 1 ((1 _ a)l/ •• )n
= (1- atl{3n, {3 = (1- a)l/ •• < 1.

In other words, the speed of convergence of p!;) to the limit 'Tri is exponential.
Remark. The term "ergodicity" is borrowed from statistical mechanics. In our
case it implies that in Markov chains a certain "forgetting" of initial conditions
occurs, as the probability distribution of the states of the system at time n as
n --t 00 becomes independent of the initial distribution.
Definition 5.5. A probability distribution 'Tr for which 'Tr = 'Tr P is said to be
a stationary distribution of the Markov chain.
We now elucidate the probabilistic meaning of the probabilities 'Tri.
5.4 The Law of Large Numbers and the Entropy of a

Markov Chain
As in the case of a sequence of independent identical trials we introduce the

random variable VIi) (w) equal to the number of occurrences of the state Xli)
in the sequence w = {wo, ... , wn }, i.e. the number of those k's for which
62 Lecture 5. Markov Cha.ins
w" X(i). We also introduce the random variables VIii) (w), equal to the
number of those k > 0 such that W"_1 = Xli), W" = xU).
Theorem 5.2. As n - 00 for any f > 0 we have
for 1~i ~ r,
for 1~ i,i ~ r.
Proof. As previously, we set

(.)
X,,' (w) =
{I W,,=X(i),
0', w" i- X(i),
X(
i .)
J (w) = { w
1,"-1 = Xli) , W " = xU).,
" 0, otherwise.
Then n n
V(i)(w) = LX~i)(W), V(ii)(W) = LX~iil(W).

"=0 "=0
For an initial distribution {JLi} we have
L JLmP~!.
r
Ex~i) (w) =
m=1 m=1
As k - 00, p~l - 1I"i. Therefore as k - 00
( i) ( ii)
Ex" (w) -1I"i, EX" (w) - 1I"iPii,
exponentially fast. Consequently
1 ~ (ii) ]
E [ -n L..J X" (w) - 1I"iPii·
"=1
Therefore for sufficiently large n we have
{wi j"(i)(W) -1I"il ~ f} ~ {wllv(i)(w) _ .!.Ev(i) ~ -2f }, 1

n n n
1--
{wi IJI(ii)(W)
n
-1t.Piil ~ f} ~ {wi 1 - - ~EI.,l·J)1 ~ -2}·
... lI(ij}(w)
n n
E
The probabilities of the events on the right hand side can be estimated by
Chebyshev's inequality:
5.4 The La.w of La.rge Numbers 63
p{ wi!" n w - ; EV lii ) I ~ ~} = p{ wllvlii ) (w) _ EV li ;) I ~

Iii) ( )
f;}
4Var VIii)
<----
- f 2 n2
So the matter is reduced to the estimation of Var VIi) and Var v liif . IT we set
mli) = EXli) = ~r I L . pl~) we have
Ie Ie LJ.= 1 ,... . . '
Varv li ) = E(t(x~i) - m~i)}r

1e=1
= E E(X~i) - E
n
m~il)2 +2 E(X~i! - m~i!)(x~i! - m~i!).

1e=1
Since 0< Xli) < 1 we have -1 <

-le-' I e<-1' I(Xli)
Xli) - mli)
-Ie
li
e - mI»)2
e-< 1 and
~n E(Xli) _ m li »)2 < n Furthermore
LJIe=l Ie Ie -'
E(X li ) - mli»)(x li ) _ mli»)
Ie. Ie. Ie. . Ie.
= EXlilXli)
Ie, Ie.
_ mli)m li )
Ie. Ie.
r
_ ~ lie.) lle.-Ie.) Ii) Ii) - R
- L..JI',P,i Pi> - m le , m le • - 1e"Ie.·
.=1
By the ergodic theorem
Ie <
Idlill - c>..1e ,
p~~) = 1I"i + pt!, IPt! I ~ c>..1e

for some constants c < 00 and >.. < 1. Therefore
So LIe,<Ie. Jl.1e"Ie. ~ const· n and consequently Varv li ) ~ const· n. The

variance Var VIii) is estimated in an analogous way. 0
We now draw a conclusion from this theorem on the entropy of a Markov

chain. In the case of a sequence of independent identical trials we have seen
that the entropy is equal to the limit as n -+ 00 of -(lin) lnp(w} for typical
w's, i.e. for w's which lie in a set with a large probability. In order to use
this property to derive a general definition of entropy we need to study the
behavior of lnp(w) for typical w's in the case of a Markov chain. We have
= 1'''0' II Pii
.,(j;)I.,)
p(w)
lnp(w) = In JL",o + L V(ij) (w) In Pij'

',;
From the law of large numbers, for typical w's we have V(ij) (w)/n "" 1riPij'
Therefore for such w's we have

(w)
1
--lnp(w)
n =
1
-;; In J.'", 0 - L V(ij)
n L
lnPij "" - __ 1riPij lnpij.
'.)
So it is natural in the case of a Markov chain to define the entropy to be
equal to h = - Li 1ri Lj Pij lnpij. It is not difficult to show that for such a
definition of h Macmillan's Theorem remains true.
To conclude we prove one more simple property of the vector 1r which
explains the meaning of the term "stationary distribution". Assume that the
initial distribution J.' for the state Wo agrees with 1r. We show that P(WI; =
xU») = 1rj for any k, 0 ~ k ~ n. For k = 0 this property follows from the
definition. Let us assume that the property is proved for k < ko • By the total
probability formula we have
= x(j)) = L P(W/c -1 = Xli) )P(W/c = xU) /W/C-l = X(i»)

r
P(WI;
i= 1
L 1riPij = 1rj
r
=
i=1
by the equality 1r P = 1r. The statement we have proved means that if the
initial distribution is 1r then the probability distribution for any W/c is given
by that very vector 1r and does not depend on k. Hence the word "stationary".
Markov chains first appeared in the work of the Russian mathematician
A. A. Markov. Later on they appeared significantly in the interesting and
deep theory of Markov processes. The main results here were obtained by
A. N. Kolmogorov, A. la. Khinchin, Doeblin, K. L. Chung, and others.
5.5 Application to Products of Positive Matrices
Let A = I/aijl/ be a matrix with strictly positive entries, 1 ~ i,j ~ r. Then

An = I/a!;) /I where
.
We shall use the ergodic theorem for Markov chains in order to study the
asymptotic behavior of a(~)
)
as n - 00 •
Lemma 5.4. There exist a positive number .x and column vectors e =

(e 1 , . . . ,er )', e* = (e~, ... ,e;)' such that
5.5 Products of Positive Matrices 65
1. ej > 0, ej > 0, 1 ~ j ~ r;
2. E:=1 a'jej = Ae" 1 ~ i ~ r; and Ee:a'j = Aej, 1 ~ j ~ r.
Proof. Let us show that for any matrix A with positive entries there exist at
least one vector e with positive coordinates, and A > 0, for which
L a,jej = Ae"
r
1 ~ i ~ r.
i= 1
Consider a convex set }{ of vectors h = (h1' ... ,hr) such that h, ~ 0, 1 ~

i ~ r, and E;= 1 h, = 1. The matrix A determines a continuous transformation
JI of }{ into itself through the formula Jlh = h', where
The Brouwer Theorem states that any such map has at least one fixed
point. If e is this fixed point then Ae = e, i.e.
e, = Er,=1 Erj=1 ~J.. e· .

3
Letting A= E;=1 E;=1 ~ie; we get the desired result.

Consider the matrix A* = Ila;;II, a;; = a;,. Then by the first part of
Lemma 5.4 one can find A* and e* such that A * e* = A* e*, i.e.
and e; > 0. The equalities
A(e,e*) = (Ae,e*) = (e,A*e*) = A*(e,e*)

show that A = A*. o
Using Lemma 5.4, put
p,; = -\-.
~;ei
I\e,
It is easy to see that the matrix P is a stochastic matrix with strictly positive
entries. The stationary distribution for this matrix is 7r, = e,e; provided that
e and e* are normalized in such a way that E 7r, = E e, e; = 1. Indeed
We can rewrite a~~)

'3
as follows:
aii
(n) = E a··
"1
a·'1 '2. ···a·1,,- 2 1.... -1 tIj... -l.1.
The ergodic theorem for stochastic matrices gives p!;) -+ 'lfj = eje; as n -+ 00.
Thus
a~~)
Tn" -+ ei'lfjej = eiej*
~ -1
and the convergence in the last expression is exponentially fast. Thus
(5.1)
Now we can show that the vectors e, e* and >. > 0 in Lemma 5.4 are
unique. From the last expression,
·
11m -
n-oo n
11naij
(n)
= 1nA.\
Thus >. is uniquely determined by A.
Suppose that there exist two vectors (e*)' and (e*)" with the needed prop-
erties. Choose an arbitrary vector e from Lemma 5.4. Then the existence of
the two vectors (e*)' and (e*)" implies that the matrix P has two stationary
distributions, which is impossible. Thus (e*)' = (e*)". Replacing A by A * we
get the same result for the vector e.
Relation (5.1), together with the statements about the unicity of e and e*,
give a complete description of the behavior of I as -+ 00. a!; n
Lecture 6. Random Walks on the Lattice 7l.d
The topic of random walks is one of the most important and most developed
of those studied in probability theory. Problems from a lot of applications
of probability theory are connected to random walks. In fact we have already
encountered one of them in the gambler's ruin problem. The theory of Markov
chains can be viewed as a theory of random walks on graphs.
In this lecture we study random walks on the lattice 7l.d. • The lattice 7ld. is
understood to be the collection of points Z = {Zl,. •• ,Zd.} where -00 < Xi <
00 are integers and 1 ~ i ~ d. A random walk on 7ld. is a Markov chain whose
state space is X = 7ld • By the same token we have here an example of a Markov
chain with a countable state space. Following the definitions of the previous
lecture we construct the graph of the Markov chain - i.e. from each point Z E
7ld. we find those y E 7l.d. for which the probability Pall of the transition Z -+ Y is
positive. These transition probabilities must satisfy the relation Ell Pall = 1.
As in the case of finite Markov chains the probability distribution of the
Markov chain is uniquely determined by the initial distribution p. = {p."" Z E
7ld.} and the collection of transition probabilities P = lip", II II which now form
an infinite stochastic matrix, sometimes called the stochastic operator of the
random walk. Often the initial distribution will be taken to be the distribution
which is concentrated at one point, i.e. 8("'°) = {6",0-"',z E 7ld.} where 6", = 1
for Z = 0, 6", = 0 for Z t- o. This corresponds to considering trajectories of the
random walk which begin at the point Zo.
Definition 6.1. A random walk is said to be spatially homogeneous if P"'II =

= {P.. , z E 7ld } is a probability distribution on the lattice 7ld •
PII_ "" where P
In the sequel, without mentioning it explicitly, we will only consider ran-

dom walks which are spatially homogeneous. The theory of such random
walks is closely connected to sequences of independent identical trials. In-
deed let W = {wo, WI , ••• ,wn } be an elementary outcome for the random
walk p(w) = p."oP" o'" .•• Pw ... ,w. = p."oPw,-wo ... Pw,,-w ... ,. We now find the
probability distribution for the increments w~ = WI - Wo, ••• ,w~ = Wn - Wn - 1 •
It can be obtained from the previous expression by a summation over Wo i we
simply get POI, - 01 ° ..•.. POI" _ ..... ,. We arrive at the expression for the probabil·
ity of a sequence of independent identical trials POI'1 POI'2 ••• POI'.. • We will make
use of this property substantially.
68 Lecture 6. Random Walks on the Lattice za
One of the main concepts in the theory of random walks is that of re-
currence. Let us take Jl. = 6(0), i.e. let us consider those random walks which
begin at the point o. Let W = {wo, WI, ••• ,Wk} be a trajectory of the random
walk, i.e. Wo = 0, PW;-W;_l > o. We assume that Wi :j:. 0, 1 ~ i ~ k and
Wk = o. In the case of such an W we can say that the trajectory of the ran-
dom walk returns to the initial point for the first time at the k-th step. The
set of such W will be denoted by n(k). We will set, as usual, for W E n(k)
p(w) = PW1-WO ••• PW>-W>_l' fk = Eweo(» p(W), k > o. It is also convenient
to assume that fo = o.
Definition 6.2. A random walk is said to be recurrent if E:= I fk = 1. IT this

latter sum is less than one the random walk is said to be transient.
The definition of recurrence means that the probability of those trajecto-

ries which return to the initial point is equal to 1. We introduce here a general
criterion for the recurrence of a random walk. Let ii(k) consist of those points
W = {wo, WI , ••• ,Wk} such that Wo = Wk = o. In this notation it is possible
that Wi = 0 for some i, 1 ~ i ~ k. Consequently n(k) ~ ii(k). We set, as
previously, p(w) = IT:= I PW;-W;_l' for W E ii(k), and Pk = E wEO (» p(w). It is
also convenient to assume that Po = 1.
Lemma 6.1 (Criterion for Recurrence). A random walk is recurrent if

and only if Ek~O Pk = 00.
Proof. We now introduce an important formula which relates {Ik} and {Pk}:
Pk = fie + fie - I • PI + ... + fo • Pk • (6.1)
We have ii(Ie) = U:=l C., where C. = {wlw E ii(Ie) ,Wi = 0 and Wj :j:. 0, 1 ~
j ~ i}. Since the C. are pairwise disjoint
.= 1
We note furthermore that
Wec,
= L
wEO(')
PW1-WO ···PW,-W;_l L
weo(>-')
PWH1-W,· ··PW>-W>_l
= f • . Pk-.·
By the same token, if we recall that fo = 0, Po = 1 we have
k
Pie =L fi . Pie - i,
.=0 (6.2)
Po = 1.
6. R&ndom W&ika 69
This completes the proof of (6.1).

We now need to make a diversion and talk about generating functions.
Let {a.. } be an arbitrary bounded sequence, i.e. Ia.. I ~ const. The generating
function of the sequence {a,.} is the power series A(z) = 2: .. >0 a.. z" which
is an analytic function of the complex variable z in the domaIn Izl < 1. An
essential fact for us is that A(z) uniquely determines the sequence {a.. } since
a.. = ~I A(") (0).
Returning to our random walk, we introduce the following generating func-
tions:
Let us multiply the left and right sides of (6.2) by z'< and sum with respect to
k from 0 to 00. We then obtain on the left P(z) and, on the right, as is easily
seen, we obtain 1 + P(z) . F(z), i.e.
P(z) = 1 + P(z) . F(z).

So (6.1) and (6.2) imply that the generating functions are related by a very
simple equation: F(z) = 1 - 1/ P(z). We now note that by Abel's Theorem
t
,.=1
fIe = F(I) = lim F(z) = 1 -
• -+ 1
lim p(1 ) .
.-1 Z
In the latter equalities, and below, z -+ 1 from the left on the real axis.
We first assume that 2::=1 PIc < 00. Then, by Abel's Theorem,
= P(I) = E PIc < 00

00
lim P(z)
..... 1
,.=0
and lim..... dl/P(z)) = 1/2::=0 PIc > O. So E:=o fIe < 1, i.e. the random
walk is transient.
A slightly more complicated argument is used when 2::=0 PIc = 00. We
show that in this case lim..... 1 1/ P(z) = O. Let us fix E > O. We find N =
N(E) such that 2::=0 PIc ~ 2/E. Then for z sufficiently close to 1 we have
2::=0 p,.z'< ~ I/E. Consequently, for such z
1 1
-- < <E.
E:=o p,.z" -
P(z) -
This means that lim..... 1 1/ P(z) = 1/ E:= 0 PIc = O. This completes the proof
of the criterion. 0
The advantage of the criterion is that it reduces the question of whether

a random walk is recurrent to a concrete problem connected with a sequence
of independent identical trials. We now consider some applications of this
criterion. Let us introduce the quantity
70 Lecture 6. Ra.ndom Wa.lk8 on the La.ttice Zd
Clearly E is a d-dimensional vector equal to the mathematical expectation,

i.e. the mean value of the one step displacements.
Theorem 6.1. Let E i= 0 and 1:~eZ' IIzl14pa < 00. Then the random walk is
transient.
Proof. Intuitively the last statement should be clear, since it follows from the
law of large numbers that Wn In should be close to E and that the probability
that Wn = 0 should be small. We now support this fact by an accurate proof.
The probability
Pn = P.. ; P.. ; ••• P.. ~ •
w':~" w' =0
L..."=1 "
Clearly
Here the probability P is calculated by using the probability distribution which

corresponds to the sequence of independent identical trials. By Chebyshev's
inequality we have:
p{ll~ Wi _ nEI1 4 > n4E4 } < 16.EII 1:;-1 W~ - nEI14

L..t 10 - 16 - E4 n4
10=1
We now show that Ell 1::=1 W~ -nE1I 4 ::; const·n2 j this implies the inequality
Pn ::; constln 2 , which gives transience. We have
- nEil -E)II L
n 2 n 2 n.
IILW~ = IIL(w~ = (W~, - E, W~. - E)

10=1 10=1 k"Ic.=l
and
lit W~
n
10=1
- nElr = L
k"k.,t"t.=1
(W~, - E, W~. - E)(w~, - E, W~. - E).
We show that E(w~, - E, w~. - E)(w~, - E, w~. - E) can be different from

zero only in the case when no more than two indices among kl' k2 ,i1 and ~
are distinct. First we derive from this fact the inequality that we need. The
number of such terms is no greater than const·n 2 • By the Cauchy-Bunyakovsky
inequality we have
IE(w~, - E, w~. - E)(W~, - E, w~. - E)I
::; EI(w~, - E, w~. - E)I'I(w~, - E, w~. - E)I
::; Ellw~, - Ell ·lIw~. - Ell '1Iw~, - Ell ·lIw~. - Ell
= Ellw~ - EII2 '1Iw~ - EII2 ,
6. Random Walks 71
where k and i are the corresponding indices. If k = i we have Ellw'" - EII4 =

Ezez- Ilz- EII4 pz < 00, which is easily verified from the conditions of the the-
orem. If k =f i then, as we have seen when studying sequences of independent
identical trials,
Ellw~ - EII2 . Ilw~ - EI12 = L Pw' 'lIw~ - EII2 . IIw~ - EII2

"
L.-; Pw'"
1
.Pw'It ••• Pw'l .•. Pw'" IIw~ - EI12 '1Iw~ - EI12
•
w' ,
w'
The finiteness of the last sum is also easily verified from the conditions of the
theorem.
It remains to be checked that when at least three among the indices
k 1 , k 2 , i1 and ~ are distinct the corresponding mathematical expectation is
equal to O. In this case at least one index appears only once. Assume that the
index that appears only once is, for example, k1 • Then
E(W~, - E,w~. - E)(W~, - E,w~. - E)

="p
L.-; WI.... pW",!
.... p.
"''''2
"'p W •.•. p.
tl W l2
"'pw ••
...
L(W~, (s) - E(s), w~. (s) - E(S»)(W~, (s) - E(s),w~. (s) - E(s») .
•= 1
Here we denote by w~ (s), E(s), 1 ~ s ~ d the components of the vectors w~

and E respectively. Since the series converges absolutely we can perform the
summation in the last sum in any order desired. Let us fix all w~, i =f k, and
sum with respect to all values w~. Since E w~ (s)Pw' = E(s) by definition of
the vector E, the obtained sum is equal to O. • 0
Let e1, ... ,ed be the unit coordinate vectors and let PII_ = 1/2d if y =
Z
±e., 1 ~ s ~ d, and 0 otherwise. Such a random walk is said to be elementary.
Theorem 6.2 (Polya). An elementary random walk is recurrent for d = 1,2

and transient for d ~ 3.
It is sometimes said that this theorem is the mathematical foundation for

the saying "All roads lead to Rome". We note however that if we also include
cosmic paths then this saying is not true.
The proof follows very simply from the criterion. The probability Pn is
the probability that E:= 1 w~ = O. For d = 1 this probability is investigated
in the De Moivre-Laplace Local Limit Theorem; it behaves like const/ Vii, In
the general case the probability decreases as const/nd / 2 • For d = 2 or 3 these
statements were proved in fact in Lecture 3. Indeed for d = 2 the increments
w~ take four values, corresponding to ±e. , s = 1,2, and the probability of each
72 Lecture 6. Random Walks on the Lattice 7l. G
value is equal to 1/4. Return to 0 occurs when the number of occurrences of

el and -el are equal, as well as the number of occurrences of e2 and -e2' The
probability of such an event was studied in Lecture 3, where it was shown (see
Theorem 3.3) that Pn ~ 4/27rn as n --+ 00. For d = 3 the increment w~ takes
six values, ±e., s = 1,2,3, and the probability of each of the values is equal
to 1/6. Return to 0 occurs when the number of occurrences of el and -el are
equal, and similarly for e2 and -e2, and f3 and -e3' It follows from Theorem
3.3 that Pn ~ const/n3 / 2 in this case, and this implies transience.
It follows easily from Polya's Theorem that in dimension d = 3 the "typ-
ical" trajectories of a random walk go off to infinity as n --+ 00. One can
ask many questions about the asymptotic properties of such trajectories. For
example, for each n consider the unit vector 11::11 = V n ; this consists in trans-
posing the random walk to the unit sphere. One question is, for typical trajec-
tories does limn _ Vn exist? This would imply that a typical trajectory goes
00
off to infinity in a given direction. It turns out that this is not the case, and
there is no such limit. Furthermore the vectors Vn are uniformly distributed on
the unit sphere. This means that a typical trajectory, when going to infinity, is
seen from the origin under a more or less arbitrary angle. Such a phenomenon
is possible because of the fact that the trajectory of a random walk spreads
out over a length of order O(y'n) in n steps, and therefore manages to move
in all directions.
Spatially homogeneous random walks on Zd. are special cases of homoge-
neous random walks on groups or sub-groups. Let G be a countable group and
P = {pg,g E G} be a probability distribution on the group G. We consider a
Markov chain in which the space of states is the group G, and the transition
probability is P"'II = P"", -1, Y E G, x E G. As in the usual lattice Zd. we can
formulate a definition for the recurrence of a random walk and prove an anal-
ogous criterion. The proof of this criterion makes use of limit theorems for
random variables with values in a group G. In the case of elementary random
walks the answer depends substantially on the group G. For example if G is
taken to be the free group with two generators, a and b, and if the probability
distribution P is concentrated on the four points a, b, a-I and b- l , then such
a random walk will always be transient.
There are interesting problems in connection with continuous groups. The
groups S L( n, R) of matrices of order n with real elements, and determinant
1, arise particularly often. Special methods have been worked out to study
random walks on such groups.
Lecture 7. Branching Processes
Here we consider another example of Markov chains, which arises in the anal-
ysis of so-called multiple decay processes for elementary particles. The theory
of such processes was introduced in articles by the Soviet mathematicians
A. N. Kolmogorov and B. A. Sevast'yanov.
In the simplest case we assume that we have a system of particles of
one type, whose evolution is as follows: during a unit of time each particle
either dies or remains unchanged or splits into several particles of the same
type. We also assume that at time 0 one particle is present and that the whole
evolution is considered on the time interval [0, n]. As always we begin with the
construction of the space aln) of elementary outcomes. An individual outcome
wIn) E aln) describes all consecutive transformations of the particles, and it
is convenient to represent it in the form of a genealogical tree. The next figure
represents an example of an wI 3) for n = 3.
Fig. '1.1.
A circle with a cross means that the particle has died (disappeared). Since
each particle can split into any number of particles, aln) in general is count-
able. Genealogical trees are possible which end with one cross. In this case the
whole process stops. We then consider that the genealogical tree is extended
by extending each cross into another cross. The number n is called the num-
ber of generations. It therefore makes sense to talk of particles of the k-th
generation, O.~ k ~ n.
Let {p,.}~ be a probability distribution where each probability Pic gives the
probability of a particle splitting into k particles (k = 0 corresponds to death,
k = 1 corresponds to the intact preservation of the particle). The probability
74 Lecture 7. Bra.nching Proce88es
distribution on n(n) is constructed from the fact that the individual particles
split independently. Formally, this means the following: next to each particle x
of the m-th generation (a cross does not represent a particle), 0 ~ m ~ n - 1,
we write p(x) = Pic if the particle splits into k particles, k = 0,1,2, .... Next
to each cross we write 1. We now set
p(w(n)) = II p(x).
",Ew(")
The fact that in the latter expression the product over all particles x up to
the (n -1)-st generation (inclusive) appears, reflects the independence of the
splitting process for all particles.
Lemma 7.1. The numbers p(w(n)) define a probability distribution on n(n).
Proof. The proof is carried out by induction on n. For n = 1 the statement

follows from the fact that the set {Plc}~ is a probability distribution. Let us
assume that the statement is already proved for a given n. Then
where the notation w(n+1) --t wIn) means that w(n+1) E n(n+1) was obtained
from wIn) E n(n). We note now that p(w(n+1)) = p(w(n))Ilp(x), where the
last product is carried out over all particles of the n-th generation in w(n).
If x is a cross then p(x) = 1 by our assumption. If x is a particle then it can
split into any number of particles. Therefore
IIp(x)
W(n.+l)_W(n.)
II L Pic = p(w(n)).
00
= p(w(n))
'" Ic=O
By the same token
L p(w(n+ 1)) = L p(W(n)) =1
by the induction hypothesis. o

The space n(n) with the probability distribution {p(n)} is called the
branching process. Let us denote by qn the probability of those wIn) which
end only with crosses - i.e. corresponding to the extinction of our process. It
is clear that qn + 1 ~ qn.
Definition 7.1. The branching process is said to be degenerate if

liIDn_ 00 qn = 1. In the opposite case the process is said to be non-degenerate.
7. Branching Procenes 75
We now obtain a simple criterion for the degeneracy of the process. Let
us introduce the quantity
00
which represents the mean number of progeny of one particle. We have the
following theorem.
Theorem 7.1. Let PI < 1. The branching process is degenerate if and only if
m:::;1.
Remark. The statement of the theorem for m < 1 appears quite natural. The
fact that it holds for m = 1 is more surprising.
Proof. As in the previous lecture we use generating functions. Let us introduce

the random variable v(n) (w(n)} equal to the number of particles of the n-th
generation in w( n). It is clear that v( n) (w( n)} takes the values 0 (extinction),
1,2, .... Let us form the generating function c/>(n) (z) = E:"'=o p{v(n) (w(n)} =
s}z·. Then qn = c/>(n) (O). We also set c/>(z} = E::oPkZk. We study in more
detail the properties of the function c/>(z} for 0 :::; z :::; 1. Since E:=o Pk = 1,
1 = limz_l c/>(z}. Moreover c/>'(z} = E:=I kPkZk-1 ~ 0, where the value of 0 in
the last expression can occur for 0 < z :::; 1 only in the degenerate case where
Po = 1, that is where no splitting occurs and the particle dies immediately.
Excluding this case from consideration we find that c/>'(z} > 0 for 0 < z :::; 1,
i.e. c/>(z} is strictly monotonic. Furthermore c/>"(z) = E:=2 k(k-l}Pkzk-2 ~ 0,
and for 0 < z :::; 1 equality is possible only in the case where Po + PI = 1 and
Pk = 0 for k > 1, i.e. when c/>(z} = Po + PIZ is a linear function. For PI = 1,
Po = 0 its graph is the bisectrix 4>(z) = z, and for PI < 1 its graph intersects
the bisectrix at one point. For Po + PI < 1, 4>"(Z) > 0 for 0 < z :::; 1 and
therefore 4>(z) is a strictly convex function. Since 4>(1} = 1, 4>(0} :::; 1, the
graph of 4>(z} intersects the bisectrix 4>(z} = z at one point at most for which
O<z<1.
Lemma 7.2. If m :::; 1 the graph of c/>(z} intersects the bisectrix c/> = z for
0:::; z :::; 1 only at the point z = 1. If m > 1 there exists a Zo, 0 < Zo < 1, such·
that 4>(Zo} = Zo·
Remark. It follows from our previous analysis that the point Zo is unique.
Remark. The statement of the lemma is obvious for Pk = 0, k ~ 2, i.e. c/>(z} =

Po +PIZ.
Proof. We first show that for m :::; 1 the point Zo does not exist. Assume that
this is not the case, and that for some Zo < 1 we have 4>(Zo) = Zo. By Rolle's
Theorem there exists e such that 4>(1) - 4>(Zo) = 1 - Zo = 4>'(e)(1 - Zo), i.e.
c/>'(e} = 1. But the function c/>"(Z} > 0 for 0 < z :::; 1 (we do not consider the
76 Lecture 7. Branching Processes
degenerate case here). Therefore 4>'(z) is strictly monotonic. Consequently we

must have lim._l 4>'(z) > 1, which is impossible, since by Abel's Theorem
limz _ 14>'(z) = E:=1 kp" ~ 1.
There remains the case when m > 1. Consider 4>1 (z) = 4>(z) - z. We have
4>1 (0) = 4>(0) = Po ~ 0, 4>1 (1) = 4>(1) - 1 = O. In the trivial case where
Po = 0 extinction and, by the same token, degeneracy are impossible. We
will therefore assume that 4>1 (0) = Po > O. Since 4>~ (z) > 0 in a left-half
neighborhood of z = 1, the function 4>1 (z) is negative in a left neighborhood
of z = 1. Consequently there exists a Zo such that 4>1 (Zo) = 0, i.e. 4>( zo) = Zo.
o
The proof of the theorem now consists in showing that lim,._ 00 q"
Zo. We first introduce the following relations, which connect4>(") (z) and
4>(,,+1) (z).
Lemma 'T .3. 4>("+ 1) (z) = 4>(") (4)(z)).
Proof. We have
00
4>(,,+1) (z) = L P{",(,,+1) (W(,,+I)) = s}z'

.=0
= z'p(W(,,+1))
L L
w(a)eo(a) w(a+l)_w(a)
= p(w(,,)) IIp(x)zE('') .
L L
",(")EO(") ",(-+1)-+",(.) .
Here we use the same relation between p(w(,,)) and p(W(,,+l)) as before; the
product no: p(X)zE(o:) is carried out over all particles of the n-th generation,
and e(x) is equal to l only in the case where x has split into l particles.
Therefore ",("+ 1) (w(,,+1)) = Eo: e(x), where the sum is carried out similarly,
over the particles of the n-th generation. In particular, if x is a cross, then
p(x) = 1 and e(x) = O. Furthermore
L IIp(x)zE(O:) = II LP"z" = (4)(z)),, (a) (w(a)) ,
0: "
since ",(,,) (w(,,)) is equal exactly to the number of particles of the n-th gen-
eration. By the same token
(a) ( (a))
p(w(,,)) . (4)(z)),, w
00
=L
00
.=0
7. Bra.nching Processes 77
o
For further discussion another relation will be of use, which follows from
Lemma 7.3: 4>(n+l) (z) = 4>(4)(n) (z)). For n = 2 both relations agree since
4>(1) (z) = 4>(z). We again argue by induction over n. It follows from Lemma
7.3 and from the induction hypothesis, and again from Lemma 7.3, that
We now finish the proof of the theorem. We have already seen that qn =
~ qn+ 1 = 4>(n+l) (0) ~ 1. Therefore liffin ... 00 qn = liffin ... 00 4>(n) (0)
4>(n) (0) =
Zo ~ 1 exists. Since 4>(n+l) (0) = 4>(4)(n) (0)), then
4>(Zo) = lim 4>(4)(n) (0))

8-+00
= nlim
..... oo
4>(n+1) (0) = Zo,
i.e. = 4>(Zo). We now show that Zo is the smallest root of the equation
Zo
Zo 4>(Zo). Indeed let zo be the smallest root. We do not consider the trivial
=
case where zo = 0, i.e. Po = O. By the strong monotonicity of 4>(z) we have
4>(0) = Po = zo,
< 4>(zo)
4>(2)(0) = 4>(4)(0)) < 4>(4)(zo)) = 4>(ZO) = zo,
and by induction on n
4>(n) (0) = 4>(4)(n-1) (0)) < 4>(ZO) = ZO,

i.e. all 4>(n) (0) = qn < ZO. But then liffin ... oo 4>(n) (0) can be only ZO.
We have already seen (Lemma 7.2) that for m S 1 the smallest root of the
equation <I>(Zo) = Zo is Zo = 1, i.e. the process is degenerate, and for m > 1
there exists a unique root for the equation <I>(Zo) = Zo, Zo < 1. 0
Problem. Let m < 1. Then according to Theorem 7.1 the branching process
is degenerate. Let us construct a new space of elementary outcomes n whose
points ware the degenerate genealogical trees. For each w the probability pew)
is defined as before. We denote by r(w) the number of generations in the tree
w. Show that Er(w) < 00.
Lecture 8. Conditional Probabilities and
Expectations
Let (a, 1, P) be a probability space, C Eland P{ C) > O. The conditional

distribution on the set C is a probability distribution defined on the u-algebra
lby
P{AIC) = P{A n C) AE1.
P{C) ,
This conditional distribution is in fact concentrated on the subsets of C. Given
e,
any random variable its mathematical expectation, computed with respect
to this probability distribution, is the conditional expectation E{ elC) given C.
e
If = {C1 , • •• ,Cr } is an arbitrary finite partition of a we have the following
objects:
al) the conditional distribution, conditional on each Ci I 1 ~ i ~ rj
~) the space of elements of the partition ale which consists of r points (the
r elements of the partition) and a probability distribution on this space
for which each probability is equal to P(Ci ).
These two objects are connected by the total probability formula
L P{AICi)P{C
r
P{A) = i ), (8.1)
i=l
which can be interpreted in the following way: first, the conditional measure
of A is calculated with respect to each element Ci of our partition. The result
is a function of i, i.e. a function on the space ale of elements of the partition
e, and the sum is the integral with respect to that space. Viewed in that
way the total probability formula reduces the calculation of a probability to a
procedure such as the one used when writing a double integral as an iteration
of two integrals. The advantage is that one can often obtain some information
or other about the individual terms. The formula
r
E" = L E{"ICi)P{Ci ), (8.1')
is called the total expectation formula and has exactly the same meaning.
It is now already clear in which direction one should generalize (8.1) and
e
(8.1'). Let be an arbitrary, not necessarily finite, partition of a into non-
e
intersecting sets. The elements of the partition will be denoted by C ( and
8. Conditiona.l Proba.bilities and Expectations 79
the element of the partition which contains x will be denoted by C e(x). Each
partition e defines au-algebra 1 (e). A set C E 1 (e) if and only if there exists
e
a union C' E 1 of elements of the partition and such that P( C.aC') = 0 (in
probability theory, as well as in general measure theory, events which differ by
a subset of measure 0 are identified). The space whose points are the elements
C e of the partition e is called the quotient space Die. Since 1(e) ~ 1 the
probability distribution P induces a probability distribution on Die. So to
each partition e corresponds a new probability space (Die, 1(e), Pl. This is a
natural generalization of ~).
Important Example of a Partition. Let {11n} be a sequence of random
variables, n = 0,1,2, .... Let us take two numbers nl,n2,(n1 $ n2) and let
us fix the values of of 11n., nl $ n $ ~. We can obtain a partition e:~ in the
following way: two points w', w" belong to the same element of the partition
e:: if and only if 11n(W') = 11n(W") for all nl $ n $ ~. One sometimes says
that the partition e:~ is connected with the behavior of the sequence 11n on the
time interval [nl, ~J. We can also consider the partition e:. corresponding to
the behavior of sequence 11n for n ~ n1.
It is clear that we could consider finite collections of random variables
111, •• ·11n and obtain partitions by fixing the values of some of the random
variables, for example by fixing the values of 11n. + 1, ••• 11n where n = n1 + ~,
that is, by fixing the values of the last ~ random variables.
Now assume that we can introduce, on each Ce, a probability distribution
P(·ICd defined on the u-algebra 1 such that the total probability formula
holds: for each A E 1
P(A) = ( P(AICeldP(Ce), (8.2)

iOIE
or that the total expectation formula holds: for any random variable 11
(8.2')
We make the following remarks about (8.2) and (8.2'):

1) for a fixed A the quantity P(AICe) or, for a fixed 11, the quantity E(11ICe),
is a function on the quotient space Die;
2) the notation dP(Ce) refers to the induced probability distribution on Die;
3) the integral in (8.2) and (8.2') is the general Lebesgue integral with respect
to the probability space (Die, 1(e),p).
We also assume that P(AICd and E(11ICe) are measurable functions iIi
that space.
Definition 8.1. The partition € is said to be measurable if the probability

distribution P(·ICe), called the conditional distribution, is defined, except for
a subset of elements of the partition of measure 0, and the formulae (8.2) and
(8.2 ') hold.
80 Lecture 8. Conditional Probabilities and Expectations
The probability P(AIC e) is called the conditional probability of the event

A given Ce and E(77ICe) is called the conditional expectation. Formula (8.2')
is called the total expectation formula.
e
One can show that if is a measurable partition the system of conditional
probabilities is essentially (i.e. up to subsets of measure OJ we do not go into
further details here) uniquely defined.
The partitions encountered in most problems of probability theory are
measurable. At the same time it is easy to find an example of a non-measurable
partition.
Example of a Non-measurable Partition. Let {} = 8 1 be the unit circle
and 1 be the a-algebra of Borel subsets of the circle, and P be the Lebesgue
measure on 8 1 • Let a be a fixed irrational number. Two points w', w" E 8 1
e
belong to the same element of the partition if and only if for some integers
kl,k2' w' - w" = kl + k2a.
We will not prove here that this partition is non-measurable. We sim-
ply note that non-measurable partitions are encountered in many important
problems in the theory of non-commutative algebras and that the problem of
their description and classification is of considerable current interest, and is
the subject of intensive investigations nowadays.
We now look at the problem of constructing conditional distributions
in one simple, but important, particular case. Let '11 and 772 be two ran-
dom variables with joint density p(x,y). This means that p(x,y) 2:: 0,
f f p(x,y) dxdy = 1 and
P{77dw) E A, 772 (w) E B} = f1 p(x,y) dxdy,
where A, B are Borel subsets of the real line. Taking B = lR1 we obtain
where pdx) = f~oo p(x, y) dy, i.e. 771 has a distribution with density pdx)
(from now on we assume that all interchanges of the order of integration are
I:
valid). In just the same way 772 has a distribution with density
P2 (y) = p(x, y) dx.
We now find the conditional distribution of 771' conditional on 772. This

means that we consider the partition e2 obtained by fixing the value 772 = y.
The elements of C e• are parameterized by the value y. The quotient space
nle2 can be easily constructed. Any set C in .1(e2) is a union of elements C e.,
i.e. is obtained by finding the set of those values y for which Ce• C C. So
1( 6) is in fact the a-algebra of Borel subsets of the line and of subsets of
the line which differ from them by sets of measure o. The induced probability
distribution on the line is the probability distribution with density P2. We set
8. Conditional Probabilities and Expectations 81
q(XIY) = p(X,Y)/P2(Y) for those X,Y such that P2(y) > 0 and 0 for other x,y.
Then q(xIY) 2: 0 and J q(xIY) dx = 1 for such y. In other words, q(xIY) is a
i i: i:
probability density. For any Borel subset A c Rl
p{el E A} = p(x,y) dxdy = P2(Y) dy i q(xIY) dx.
This is formula (8.2). It shows that '11, for a fixed value Y of the random
variable '12, has density q(xly). By unicity of the conditional distribution we
now have found the conditional distribution of '11 for a fixed '12' We represent
p(x, y) in the form
p(x,y) = P2(Y)' q(xly), (8.3)
where P2 (y) is a probability density and q(xIY) is a probability density in x
for every Y such that P2 (y) > O. If such a representation is obtained for a
given distribution then, by the same token, we have constructed the system
of conditional probabilities.
Relation (8.3) can be directly generalized to the case of several '1;. Namely,
let n = nl + n2 and let '11"", '1n be n random variables whose joint distri-
bution is given by the density p(tl, ... ,tn ). Assume that P is represented in
the form
P(Xl," . , xn, Yl,' .. , Yn) =P2 (Yl,' .. , Yn.).
(8.4)
q(Xl,'" ,xn1IYl"" ,Yn.),
where P2 is a probability density and q(Xl"'" x n1 IY1,' .. , Yn.) is a prob-
ability density in the variables Xl"", Xn 1 for any Yl,.", Yn., such that
P2 (Yl , ... , Yn,) > O. Then P2 is the joint density of the random variables
'1nl T 1,···, "In, and q(Xl"'" Xn1 IYl"'" Yn.) is the joint density of the ran-
dom variables '11" .. , '1nl for fixed values Yl, ... , Yn. of the random variables
'1n + 1 , ••• , '1n .
1
We now proceed to the most general approach to the construction of con-

ditional probabilities and conditional expectations. Let l' C 1 be an arbitrary
a-subalgebra of the a-algebra 1. The question of whether l' can be repre-
sented as l' = 1(e) for a measurable partition e is far from trivial. It can be
answered in the affirmative under some further assumptions on the probability
space (n, 1, P) such as separability and completeness (in so-called Lebesgue
spaces), but in general the answer may be negative in more complicated cases.
Let '1 2: 0 be a non-negative random variable such that J'1(w) dP < 00.
Definition 8.2. A conditional expectation of the random variable '1 with

respect to the a-subalgebra l' is a function E('111') which is measurable
with respect to the a-algebra l' such that for any A E l'
i '1(w) dP = i E('111') dP.
The subtlety in the last equation lies in the fact that on the left hand side
we have a probability distribution given on the whole a-algebra 1, and on
82 Lecture 8. Conditional Probabilities and Expectations
the right hand side a probability distribution given only on the u-algebra 1',
i.e. a totally different object.
The existence of the conditional expectation is established by using the
Radon-Nycodym Theorem. Specifically, fA '1( w) dP is a u-additive set func-
tion defined on 1'. It is absolutely continuous with respect to P, since it
follows from P(A) = 0 that fA '1(w) dP(w) = O. Then by the Radon-Nycodym
Theorem there exists a function g(w) which is measurable with respect to l'
such that
1 '1(w) dP = 1
g(w) dP, A E 1'.
This function is taken to be the conditional expectation. If l' = 1(e) for a

e
measurable partition then the measurability of g with respect to l' means
that g is a function on the quotient space ale, i.e. is constant on the elements
e.
of the partition This is the meaning of conditional expectation.
If we take for '1 the indicator of the set C, i.e. the random variable Xc (w) =
1 for w E C and Xc(w) = 0 for w rt c, we obtain the definition of the
conditional probability of the event C with respect to the u-subalgebra 1',
denoted by P(CI1').
Given an arbitrary, not necessarily non-negative, random variable '1, we
may write it in the form '1 = '11 - '12, where '11 ~ 0, '12 ~ O. We assume
that ErJ1 < 00, ErJ2 < 00. The conditional expectation is then E(rJI1') =
E(rJ111') - E(rJ211').
Unfortunately, in the general case it is not possible to construct conditional
distributions as a compatible system of u-additive probability distributions.
The further restrictions mentioned above are necessary for this.
Lecture 9. Multivariate Normal Distributions
In this lecture we consider a common example of a multivariate distribution.

We begin with the non-degenerate case. Let E1"'" En be n random variables
whose joint distribution is given by the density P(:e1, .•• , :en), of the form
p(:e) = P(:ell'" ,:en) = Ce- }(A(",-m),(",-m)). (9.1)
Here C is a normalizing constant, :e = (:e 1 , ••• ,:en) is an n-dimensional vec-
tor, m = (m1,' .. , m n ) is also an n..,-dimensional vector, and A is a symmetric
matrix. The density (9.1) is said to be the density of a non-degenerate mul-
tivariate normal distribution. The convenience of (9.1) lies in the fact that
the density is defined in terms of very simple parameters: an n-dimensional
vector m and a symmetric matrix A of order n.
Since p(:e) is integrable, p(:e) - 0 as :e - 00. This is possible only in
the case when A is a positive definite matrix. We now find the value of the
constant C. By the normalization condition we have
C
1 -
- -
1 1 00
-00
.....
00
-00
D- t(A(",-m),(s-m)ld'"
""'1'"
d'"
""'n'
We first make the change of variables y = :e - m:
-
C
1 -
-
1 1
00
-00
. . .00. .
-co
D- }(AII,lIldy
.
We now find an orthogonal matrix S such that S*AS = D, where D is

a diagonal matrix with elements d; on the diagonal. We also have det D =
detA. We now make the linear change of variables y = Sz. Then (Ay,y) =
(ASz,Sz) = (S*ASz,z) = (Dz,z) (for an orthogonal matrix S* = S-1) and
= (21rt/2 (det Df1/2 = (21rt/ 2 (detAf1/2.

Therefore C = (21rt n / 2 V'det A.
We now elucidate the probabilistic meaning of the vector m and the matrix
A. Let us find the expectation EEi:
i: . . i:
84 Lecture 9. Multivariate Normal Distributions
Eei = XiP(X1, ... ,X,.) dx 1 ... dx ,.
= cl 1 oo
...
00
Xie-t(A(:r;-m),(:r;-m))dx1 ... dx,.
cl 1
-00 -00
= OO
.. •
00
(Xi - mi)e- t(A(z-m).(z-m))dx1 ... dx,. + mi
-00 -00
Here we made the change of variable X - m = y. The resulting integral is

equal to 0, since the integrand is an odd function of Yi' Thus mi = E ei, i.e.
the components of the vector m are the expectations of the random variables
ei'
We now find Cov( ei , ei ). We have
COV(ei,
= c i: . . i:
ei) = E(ei - m.)(ei - mi)
(Xi - mi)(xi - mi)e- t(A(z-m).(z-m))dx
-cl 1
- -00'"
OO 00
-00 YiYi e -t(AII.II)dY1 ... dy,..
As before, we make the change of variable Y = S z or, more explicitly, Yi =

E;=l SikZk' Then
,.
= L SikSikd; 1.
k=l
Here we used the fact that J~oo ZkZte-1/2 E; d;z~ dZ1 ... dz,. = 0 for k 1= t and
equals .j2-id~3/2 for k = t. We now note that Ek Si/cd; 1 Sik is an element of
the matrix SD-1 S· = SD,-l S-l = A- 1. Thus the matrix A- 1 has an im-
mediate probabilistic meaning. Its elements are the covariances of the random
variables ei, ei' Let us assume that the covariances of the random variables
e
ei, i are equal to 0 for i 1= j. This means that A - 1 is a diagonal matrix. Then
A is also a diagonal matrix, and p( Xl, ... , X,.) = P1 (Xl) ... P,. (X,.) is the prod-
uct of one-dimensional normal densities, i.e. the random variables e1,"" e,.
are independent. Thus in the case of the multivariate normal distribution, if
the covariances reduce to zero, independence follows.
In what follows it will be useful to obtain an expression for the character-
istic function (the Fourier transform) of a multivariate normal distribution.
By this we mean the integral
9. Multivariate Normal Distributions 85
cp(A) = cp(Al, ... ,An)

=C f ... ! e,(.\,I<)- }(A(z-m),(z-m»dxl' •• dxn
= e,(.\,m).cf... ! e,('\,I1)- }(AII,II)dYl ••• dYn·
Again, setting Y = S z, we have
cp(A) = e,(.\,m).c f... ! e'('\'S~)- t(D~'~)dzl ... dzn
= e,(.\,m).c ! ... ! e,(",~)-}(D"')dzl ••. dzn •
Here J.I. = S· A = S- 1 A. The last integral splits into a product of one-

dimensional integrals. We use the fact that for any J.I. and d > 0,
1 1
../21rd- f· 'd '
e'"'-''' dz = e- ...!.
U •
This gives
c f ... ! e'("'Z)- }(D·'·)dz1 ••• dz n= e-} Lo ~ = e- t(D-'",,,)
= e-t(SD-'S-'.\,.\) = e-}(A-'.\,.\).
Finally we obtain
cp(A) = e,(.\,m)- t(A -',\,,\).
By using the last expression we can obtain a definition of a more gen-

eral multivariate normal distribution which is not necessarily non-degenerate.
Specifically let B = A - 1. In the non-degenerate case A > 0 and therefore B
exists and B > O. A general multivariate normal distribution is a distribution
whose characteristic function has the form
where B ~ 0 and m is an arbitrary n-dimensional vector. One can show that

in the degenerate case the probability distribution is concentrated on a plane
of smaller dimension; if we introduce variables on this plane by means of a
linear transformation and a translation, then the density expressed in these
variables has the same form as before.
Now let n = nl + nz. It will be more convenient to use the notation
el, ... , en, and Xl, ... , X n , for the first nl random variables and their values,
and Til, ••. , TIn, and Yl,"" Yn, for the remaining nz random variables and
their values. Correspondingly m = (m(lJ,m(2»), where m!lJ (m(2J) is the
vector of expectations of the random variables e, (TI;), and the matrix A can
be written in the form
A= (AB(l) B) A(2) •
86 Lecture 9. Multivariate Normal Di8tribution8
We now write the density of a non-degenerate normal distribution of the

combined random variables ei'
11; in the form
We now find the conditional distribution of the random variables e1' ... en"
conditional on fixed values Y1' .. . ,Yn. , for the random variables 111,·· . , l1n •.
We first assume that m(1) = 0 and m(2) = o. We have
= (21rtn/2VdetA.exp{ -i[(A(1)x,x) + 2(Bx,y) + (A(2)y,y)]}

= (21rtn/2VdetA.exp{-!(AI2)y,y)}
2
1
X exp{ -2 [(A(1) (x - Ly), (x - Ly))] + (Bx,y)
1 1 1
+ "2 (A(1) (x), Ly) + "2 (A(1) Ly, x) - "2 (A(1) Ly, Ly)}.
Here L is an arbitrary n2 xnl matrix, which transforms R!" into Rn,. Further-
more, by symmetry of A(l), the scalar product (A(l)Ly,x) =
(Ly,A(1)x) = (A(l)X,Ly). Therefore
1 1
(Bx,y) + 2(A(1)x,Ly) + 2(A(l)Ly,x)
= (Bx,y) + (A(1)x,Ly) = (Bx,y) + (LoA(1)x,y).
We now choose L such that B + LOA(l) = 0, i.e.
LO = -B(A(1)t l , L = (_(A(l)))-l B*.
Then for A(3) = A(2) + LOA(1) L = A(2) + B(A* )-1 B* we have
p(xl,. .. ,Xn,iYl, ... ,Yn.) = (21rt n/ 2 vdetA

1 1 1
x exp{ -2 (A(2)y,y) - 2(A(1) Ly,Ly) - 2 [(All) (x - Ly), (x - Ly))]}
= (21rt n./2 v'det A(3) exp{ -~ (A(3)y,y)}.(21rt n ,/2
X v'det A(l) exp{ -~ [(A(l) (x - Ly), (x - Ly))]}.

(9.2)
The latter expression is the density of a multivariate normal distribution for
the random variables el' ... ' en, , with vector of expectation Ly and matrix
A(1), and
9. Multivariate Normal Distributions 87
is the density of a multivariate normal distribution for the random variables

'1I, ... ,l1... with zero expectation and matrix A(3). So we have obtained an
expression of the type (8.4). In passing we have proved a formula which is
well-known in linear algebra,
det A = det A (1) • det (A (2) + B (A ( 1) t 1 B* ),
and also the inequality A(3) = A(2) +B(A(1)t 1 B* > O. The main conclusion
from equation (9.2) for probability theory is that the conditional distribution
of the random variables e1"'" e.. I for fixed values Y1,"" Y... of the ran-
dom variables 111, ••. ,11". is a normal distribution with vector of expectations
Ly = - (A (1) ) - 1 BOy and matrix A (1) • In other words, the dependence on the
conditions manifests itself only in the vector of conditional expectations.
The previous calculations hold for the case where m(l) = 0, m(2) = O. In
order to obtain the result for arbitrary m(1), m(2) it suffices to replace x, y by
x - m(1), y - m(2) in the final answer. We will not go into further details here.
We now return to the original distribution (9.1). Let us assume that m = O.
A system of polynomials, which we will call the Hermite polynomials, although
as a rule this terminology is in use only for n = 1, is closely connected with
the density
p(x) = (21ft ,,/2 v'det A e- i(A>:,>:).
Let us consider an arbitrary monomial X~I x~· ... x~~ of degree k =
k1 + k2 + ... + k". A Hermite polynomial for this monomial is a polyno-
mial X~I x~· ... X~K + Q(x) where Q(x) is a polynomial of degree less than k,
such that
!(X~IX~' ... x~- +Q(x))Qdx)e-i(A>:'>:)dx=O
for any polynomial Q1 (x) of degree less than k. The following notation if often
used for the Hermite polynomial:
X lel
1
xle.
2 •••
x"le • + Q(x) -_.. xlel1 ••• x"le •.•
IT P(x) is a homogeneous polynomial of degree k, its Hermite polynomial
: P(x) : is defined by linearity. The number k is the degree of the Hermite
polynomial.
Lemma 9.1.
ei(A.,,>:) if' e- i(A.,,>:) = P (x)
aX1
leI
... ax,,-
Ie 1 ,
is a Hermite polynomial of degree k.
Proof· For an arbitrary polynomial Q1 of degree less than k we have
! P1 (x)Qdx)e- i(A., ... )dX1 ... dx"

88 Lecture 9. Multivariate Normal Distributions
We apply k integrations by parts to the last integral; since the degree of Ql

is less than k we obtain O. 0
Lecture 10. The Problem of Percolation
The material of this lecture is not on the program of the examination. It is

written in a style close to that of a scientific article, and is concerned with a
very current and difficult problem, which at the same time can be formulated
in a sufficiently elementary way. The hope is that the readers of this lecture
can relatively quickly enter this interesting area and pursue their own research.
We begin with the problem of site percolation through the vertices of a
two-dimensional lattice. Let V be the set of points (sites) of the plane x =
(X1,X2) where X1,X2 are integers, and Ixd ~ L, IX21 ~ L. We assume that
each site of V can be in two states, vacant and occupied, which we denote by
1 and o. The state of the whole system V is therefore a function w = {W.,}
on V, with values in X = {l,O}, where w., describes the state of the point
x = (X1,X2).
A path in V is a collection of points x(l) , •.• , x( r) in V such that IIx( I) -
x( 1- 1) II = 1, 1 < i ~ r. We say that a percolation with respect to 1 from top
to bottom takes place through w if there exists a path xl 1) , ••• , xl r) such that
X(l) = (X~l) ,L), x(r) = (x~r), -L) and W.,(<) = 1, 1 ~ i ~ r.
We can define in analogous ways percolation with respect to 0 from top

to bottom, and percolation from left to right. Another modification of the
percolation problem can be set up in the following way. Let us consider the
set V' of edges of V, i.e. the set of pairs of points b = (x, y), IIx - yll = 1, for
x, y E V. As before, we assume that each edge, or bond, can be in two states,
1 (passable bond) or 0 (blocked bond). Instead of w = {W.,} we are given
w = {w,,}, the collection of states for all the bonds. A path is a sequence of
bonds b(l), ••• ,b(r) such that the origin of b(') is the same as the end of b(,-1),
1 < i ~ r. We say that percolation (or conductivity) with respect to 1 from
top to bottom takes place for w if there exists a continuous path b( 1) , ••• ,b( r)
such that b(1) begins on the line X2 = L, b(r) ends on the line X2 = -L, and
w,,(;) = 1, 1 ~ i ~ r.
The bond percolation problem has an immediate physical interpretation.

Let us assume that the bonds of the lattice are make of two materials, one
conductive and the other non-conductive. Conductivity means that if we put
different potentials on the top and bottom sides of V then the current will flow
along a conductive path. There is an interesting book by A. L. Efros entitled
"Physics and Geometry of Chaos" , published in the series "Kvant Library" in
90 Lecture 10. The Problem of Percolation
1982 in which a lot of important problems in physics are described which are
more or less connected with the property of conductivity.
We denote by n the space of all possible w, and by n(pere) the subset
of those wEn for which percolation from top to bottom occurs (in the
site or bond percolation problem). We assume, on a given n, a probability
distribution which corresponds to a sequence of independent identical trials.
This means that we are given two numbers, p = p(l) ~ 0, q = 1 - P =
p(O) ~ 0, and that we set p(w) = II" p(w,,) (in the site percolation case) and
p(w) = n. p(w,,) (in the bond percolation case). Then p(pere) = p(n lperc )) =
Lweo(pm) p(w) is the probability of percolation from top to bottom and it is
a function of L and p.
We now say that percolation from top to bottom occurs for a given p if
limL_ 00 plpere) = 1. The main problem is then to find the structure of the
set of p's for which percolation occurs. The conclusion, which is difficult to
prove, is that this set is an interval of the form (Pc, 11 and it is not a simple
task to find the critical probability Pc beyond which percolation takes place.
From now on we deal with site percolation. Our goal is to prove the fol-
lowing relatively simple statement.
Theorem 10.1. There exists a p~ 1 ) < 1 such that for all p ~ p~ J) percolation
with respect to 1 from top to bottom and from left to right takes place.
Proof. Let w = {w,,} E n and let us make first the following geometric con-
struction. For each x E V, where w" = 0, let us construct a closed square Q(x)
with smooth corners (see Fig. 10.1), centered at this point, and with sides of
length 3.
x x x
x x x
x
x=(x,.X2)
x x ~= ~
o(x)
Flg.IO.I. Flg.IO.3.
We set Q(w) = U"lw.=o Q(x) and by convention we call Q(w) the domain
occupied by O. Such a domain Q(w) can be split into its maximum connected
components Ql' Q2' ... ,Q,. We recall that this means that each component
Q. is a connected set, i.e. any two points can be joined by a path which is
contained in that set, and that Qi cannot be enlarged without losing this
property. Therefore the Qi do not intersect. Any two components either lie
outside one another, or one component lies in the interior part (the interior)
10. The Problem of Percolation 91
of another component (see Fig. 10.2 where Q is the shaded domain). A com-
ponent Qi is said to be exterior if it is not contained in the interior region of
another component. Let us denote by VIol (w) the set of points x = (Xl' X2)
which lie outside all exterior components, and therefore outside of all compo-
nents Q., 1 ~ i ~ r. The following geometric statement is quite clear, and the
proof of it will be left to the reader: the set VIol (w) is connected, i.e. any two
of its points can be connected by a path. This is due to the fact that V(O (w)
is obtained by removing the exterior components one after the other, as well
as their interior regions, and each such operation preserves connectedness.
It follows from connectedness that percolation from top to bottom with
respect to 1 will occur for w if V( 0 ) (w) intersects the lines X2 = L and X2 = - L.
We now show that the probability of each of these events converges to 1 as
L -+ 00 if p is sufficiently large. It is sufficient to limit ourselves to the case of
the line X2 = L. The probability of percolation from left to right is found in
the same way. .
Let Q be a given connected set. We introduce the indicator function Xq (w),
which takes the value 1 if Q is one of the components Q. and 0 otherwise.
Consider a point of the form (Xl' L) E V. We will call this point interior (for
the configuration w) if there exists a component Q. such that (x1,L) is an
element of Qi or its interior. It is clear that the points which are not interior
belong to V( 0 ) (w). Furthermore
p(w)
wI(", "L) i,,'erior
= L wen
Q
L p(w).XQ (w) = L 1fQ, with
Q
1fQ = L p(w)Xq (w),
wen
(10.1)
where the last sum is carried out over all components Q such that (xl,L)
belongs to Q or to the interior of Q.
Lemma 10.1. There ezist constants C1, C2 < 00 such that
The proof of these lemmas will be carried out a little later. We intro-
duce the lexicographic ordering on the points of the plane. This means that
(X1,X2) < (x~,x~) if either Xl < x~ or Xl = x~ and X2 < ~. Given a con-
nected set Q we define its origin to be the smallest point in the sense of the
lexicographic ordering.
Lemma 10.2. The number of connected sets Q, with IQI = n, which have
their origin ~ the point (Xl' X2), is not greater than 16ft •
The proof of this lemma will be carried out a little later. We now find an
estimate of p(iDt). By Lemma 10.1 and (10.1) we have
p(iot) ~ ~)C1qy.IQI,
Q
where the sum is carried out over all components Q which contain the point
(Xl' L). Furthermore, from Lemma 10.2 we have
~)C1qy·IQI = L L L (C1q)c. IQ1

Q R ("'1.",.) Qlthe origin ofQis
("'I.",.).IQI=R
00
~ L(c 1 qY·R .n2 .16R •

R=l
Here we used the fact that for a fixed point (x1,L) and a fixed value IQI = n
the origin of Q can be one of at most n 2 points.
Let p be large enough and q = 1-p be small enough so that (c1Q)c'16 < 1.
Then the latter series converges. Let us denote its sum by d(p). It is clear that
d(p) -+ 0 as p -+ 1.
We introduce the random variable €(w) equal to the number of interior
points on the line (x1,L), and the random variables 1/"'1 (w), where
( ) _ {I, if (x1,L) is an interior pointi

1/"'1 w - 0 otherwise.
We clearly have €(w) = E:1=-L 1/"'1 (w), E1/"'1 (w) = p(iot) (xd, and
L
Ee(w) = L ET/"'1 (w) :5 (2L + l).d(p).
This inequality shows that the average number of interior points on the line
X2 = L is relatively small. The latter statement is not sufficient yet to prove
the theorem. We also need to estimate the variance Var€.
We have
E( L
L
Var€(w) = (1/"'1 (w) - p(iot) (Xd)2)

"'1=-L
L
L
= E(1/"'1 (w) - p(iot) (xd)(1/",. (w) - p(iot) (X2))'

Zl,z,=-L
Furthermore
1/"'l(W) = LXQ(w), p(iotl(xd = L1I'Q'
Q Q
where the summation in both cases is carried out over the components Q for
which (Xl' L) belongs to Q or to the interior of Qi see 10.1 for the definition
of 11'Q. Therefore
10. The Problem of Percola tion 93
m"'l"" = E(11"'l (w) - p(iDt) (xd) (11"" (w) - p(iDt) (X:l))

= E(L: (XQ'l (w) - 'lrQ'l)' L: (XQ" (w) - 'irQ,') ).
Q'l Q.,
Here Q"'l (or Q",,) denote the components which correspond to Xl (or X:l)'
We interchange the operations of summation and expectation and obtain
Q:ll,Qs,
We now note, and this is the main probabilistic part of the proof, that the
random variables XQ'l -'lrQ'l and XQ., -'irQ., are independent if Q"'l nQ"" =
0. Therefore
E(XQ'l (w) - 'lrQ'l )(XQ., (w) - 'irQ •.)
= E(XQ'l (w) - 'irQ. 1 ).E(XQ., (w) - 'irQ.,) = 0
and therefore in 10.2 we can perform the summation only with respect to those
components Q"'l and Q"" to which both Xl and X2 are interior. We denote
such a summation by L'. Then'
Var€(w) = L m"'l"" = LL m"'l""
Since two components either agree or do not intersect, EXQ'l (w).XQ., (w) = 1
for Q"'l = Q"" and 0 for Q"'l =f. Q"". The latter expression therefore takes the
form
"'1 Q'l Q.,

The first sum is less than or equal to L"'l LQ'1 'lrQ'1 = E€::; (2L + 1).d(p).
The second sum is negative. Therefore
Vare::; (2L + 1).d(p).
We can now use Chebyshev's inequality in its simple form:
p{e(X) ::; 2(2L + 1).d(p)}
= p{e(x) - Ee(x) ::; 2(2L + l)d(p) - Ee}
~ p{e(x) - Ee(x)::; (2L + 1).d(p)}
> p{ le(x) - E€(x)1 < (2L + 1).d(p) }
- ..;va.re - ..;va.re
>1- Yare >1_ 1 -+ 1
- (2L + 1)2.d2 (p) - (2L + l)d(p)
as L - t 00. Let P be large enough so that 2d(p) < 1/2. Then with probabilities
converging to 1 as L - t 00, more than one half of the points on the line X2 = L
are non-interior, i.e. belong to VIol (w). 0
Lemmas 10.1 and 10.2 must still be proved.
Proof of Lemma 10.1. Assume that a set Q is given and that values w% are
fixed for x E Q. The probability of such an event is equal to p"qlql-", where
k is the number of ones, and IQI- k is the number of zeroes. We now note
that for a given set Q the number of subsets of Q where a zero appears is less
than or equal to the total number of all subsets of Q, i.e. 21ql . Furthermore,
near each one, at a distance less than or equal to yI2, lies a 0, since Q is the
union of squares (such as in Fig. 10.1) at the center of each of which lies a O.
Therefore IQI- k ~ (1/9)IQI. Finally we obtain
7T(Q) :::; 2IQI .q(1/9)IQI = (C1q)(l/9)1QI.
But this is the statement of the lemma for C1 = 29 , C2 = 1/9. o
Proof of Lemma 10.2. We show that to each connected set Q we can associate
a word (i1' i2 , •.• , in) of length n, where i, takes at most 16 values, in such a
way that Q is uniquely determined by this word.
Let x be a point in Q. We consider the set R of its neighbors, i.e. those
points y such that d(x,y) :::; 1 and y E Q. The number of such subsets is no
greater than the number of all subsets of a set with four elements (the set of
all neighbors), which equals 24 = 16, so R can be indexed by numbers from 1
to 16.
We now consider the point x = (X 1 ,X2) which is the origin of Q, and
the set Rl of its neighbors which belong to Q. We then consider the smallest
point in Rl which is distinct from x, and we denote by ~ the set of its
neighbors. Then we consider the smallest of the remaining points, and denote
by R3 the set of its neighbors, and 50 forth. In this fashion we obtain a word
(Rl'~'"'' R,,) of length n in which each R. takes at most 16 values. It is
clear that Q is uniquely determined by this word. 0
For the bond percolation problem the exact value of the critical probability
is known to be equal to 1/2. This was proved rigorously relatively recently
by the American mathematician H. Kesten (see H. Kesten, "Percolation for
Mathematicians", Birkhauser, 1982). Many numerical experiments show that
for the site percolation problem for a two-dimensional square lattice, Pc ~ 0.58.
The Japanese mathematician I. Higuchi showed that Pc > 1/2. Recently a
Hungarian mathematician, B. Tot, proved by elementary methods that Pc >
0.503. We also quote the result of the Soviet mathematician L. Mitiushin,
which shows that in the case of the lattice 7L. d in a space of sufficiently high
dimension d, for the site percolation problem, there exists an interval of values
p which contains the value p = 1/2 in its interior, for which percolation is
possible both with respect to 1 and with respect to O.
Lecture 11. Distribution Functions, Lebesgue
Integrals and Mathematical Expectation
11.1 Introduction
In this lecture we introduce once more the fundamental concepts of probability
theory, which we have encountered already, but ih a much more general form
than previously.
We define a probability space as a triplet (n, 1, P) where n is the space of
elementary outcomes, 1 is a a-algebra of subsets of n and P is a probability
measure defined on the events C E 1.
The choice of the a-algebra 1 plays an important role in many problems,
in particular in those problems which are connected to the theory of stochastic
processes. In the cases where n is a complete separable metric space it is usual
to take for 1 the Borel a-algebra, that is, the smallest a-algebra which contains
all the open balls, i.e. the sets of the form {w Ip(w, wo) < r} = Br (w). Such a
a-algebra is denoted by 8(n).
e
Let = f(w) be a real-valued function on n.
Definition 11.1. eis said to be a random variable iffor any Borel set A c m.1
{wlf(w) E A} = rl(A) E 1.
The property given in Definition 11.1 is sometimes called the measurability

of I with respect to the a-algebra 1.
It is easy to check that
1) 1-1 (m.l) = nj
2) 1- 1(m. 1\A) =n\I- 1(A)j
3) 1-1 (U;: 1 A.) = U;: 1 1-1 (Ad·
It follows that the class of subsets of the form 1-1 (A), where A E 8 (m.1),
is a a-subalgebra of the a-algebra 1. This a-algebra is said to be the a-algebra
e.
generated by the random variable We will denote it by 1e.
We assume that the random variable € takes a finite or countable number
of values which we write in a sequence {al'~''''}' We set G; = {wl€ =
I(w) = a;}. Then
1) G; n G = 0 for i =j:. j,
j
2) U; G; = n.
96 Lecture 11. Distribution Functions
In other words, the events C. define a partition of the space n. It is easy

to see that each C E 'Ie. is a union of some C.. One can formally construct
the partition which corresponds to an arbitrary random variable For this e.
we introduce the sets Ct = {wl/(w) = t}, -00 < t < 00. It is clear that the
Ct , for distinct t, do not intersect and that U-co <t<co Ct = n. One can show,
under some additional assumptions on the probability space (n, 1, P), that
this partition is measurable (see Lecture 8). The corresponding probability
distribution on Ct is called the conditional distribution, conditional on the
e
random variable taking on the value t.
We set PdA) = P(J-1(A)) for every A E 8(JR.t). It is easy to check that
Pd A) is a probability measure defined on the Borel a-algebra 8 (JR.1).
Definition 11.2. Pe is called the probability distribution of the random vari-

able e.
We define A, = (-oo,t] = {z E JR.t,z::; t}.
Definition 11.3. Let P be a probability measure on the line JR.1 . The distribu-
tion function F of the measure P is a function of the variable t, -00 < t < 00,
given by the relation F(t) = P(A,). If P = Pe then Fe (t) = Pe (Ad is called
the distribution function of the random variable e.
11.2 Properties of Distribution Functions
1) Fe{td ::; Fe{t2 ) for t1 < t 2 • Indeed At C A" for t1 < t2 and therefore
l
1- 1 (Ad ~ 1- 1 (A,,). It then follows from the properties of probabilities

that P(J-1(A,.)) ::; P(J-1(A,,)).
2) lim,_ co Fdt) = 1, lim, __ co Fdt) = o. We only prove the first state-
ment as the second statement is proved in the same way. Let us take a
..
monotone sequence {t.}, with t. -. 00 as i -. 00. Then the {At.} form a
monotonically increasing sequence of subsets of the line and U. A,. = JR.1.
Therefore the 1- 1 (At;) form a monotonically increasing sequence of sub-
r
sets of n with U. 1 (A,J = n. It follows from the a-additivity of the
probability P that
.lim Fe(t.)
1-+ 00
= .lim Pe(AtJ = .lim
1 - 00 , _ 00
p(J-1 (A,J) = 1.
3) Fe (t) is continuous from the right. Indeed let t. ! f. Then the At; decrease
and n. A,; = At. Therefore from the a-additivity of probabilities we have
.lim Fe(t.)
1-00
= .lim p(r1(A,J) = p(r1(Ad) = Fe(t).
1""00
4) For any f the probability p(e = t) = Fe(t) - lim,;_f-O Fe (t.) for any
sequence {t.}, t. if. Indeed U. A,; = (-oo,l). Therefore
11.3 Types of Distribution Functions 97
i
lim Fe (t.)
-+- co
= P{€ < i},
Fe (t) - liJ? Fe (t i ) = P(€ ~ f) - P(€ < f) = P(€ = f).
t,-t-O
It follows from Property 4) that P( € = f) = 0 if and only if Fe is continuous

at the point f.
For any half-open interval (81,82] with 81 < 82, we have
Pe((81 ,82]) = p( {W181 < E~ 82})

= p({wIE ~ 82}) - p({wIE ~ 81}) = Fd 8 2) - Fdsd·
Thus the distribution function Fdt) uniquely determines the values of Pe on

the half-open intervals (SI ,S2]' It follows from the general theory of extensions
of measures that any a-additive probability measure, defined on the Borel a-
algebra on the line, is uniquely determined by its values on such half-open
intervals (SI' S2] (Caratheodory's Theorem). Consequently the distribution
function Fe (t) uniquely determines the probability distribution Pe.
11.3 Types of Distribution Functions

Discrete Type. Assume that the random variable € takes a finite or countable
number of values al , ~ ,a3, ... with probabilities PI ,P2, P3, ... , respectively. By
definition Fdt) = Lioa<, Pi. Such a distribution function is a step function.
The jumps occur at th~ points a. and the height of the jumps is equal to Pi'
Absolutely Continuous Type (with Respect to the Lebesgue Measure). This type
is that of a distribution function Fe (t) for which there exists a non-negative
function pdt) such that Fe (t) = f pdu) duo In such a case pdt) is called
00
the density of the distribution. We already encountered several examples of

densities in earlier lectures. It is clear that Fe (t) is continuous at every t.
Singular Type (with Respect to the Lebesgue Measure). There are continuous
distribution functions Fe (t) which cannot be represented in the form of an
integral with respect to a density.
The so-called Cantor Staircase is an example of such a situation. We set
Fe (t) == 1/2 for 1/3 ~ t ~ 2/3, Fe (t) = 1/4 for 1/9 ~ t ~ 2/9, and Fe (t) = 3/4
for 7/9 ~ t ~ 8/9. The construction process can be described inductively in
the following way. At the nth step we have intervals of length 3- n , where the
function Fe (t) is not yet defined, although it is defined at the end-points of
such intervals. Let us divide any such interval into three parts, and set Fe (t) to
be constant on the middle interval, and equal to the half-sum of its values at
the end-points of the preceding interval. For all remaining t the function Fe (t)
is extended by continuity. The limit function is called the Cantor Staircase
(sometimes the Devil's Staircase). Singular distribution functions arise more
and more often in applications.
Any distribution function can be represented in a unique way as a sum of

distribution functions of the three types above.
112
114
118 -
o
Fig. 11.1.
In what follows, when talking about distribution functions we will always

mean functions F(x) which satisfy Properties 1) - 3), without referring to their
dependence on €.
The random variables which we introduced in Definition 11.1 are some-
times called real valued random variables. It is often useful to consider complex
or vector-valued random variables or, more generally, random variables with
values in an abstract space. We now introduce a corresponding definition. Let
(X, X) be a measurable space.
Definition 11.4. The transformation € = f(w) : n - t X is said to be mea-

surable if for any JI E X the inverse image r
1 (JI) = {wlf(w) E JI} E 1.
The transformation € will be called an X-valued random variable. We

set Pe (.~) = P({wlf(w) E JI}) for any JI EX. Then Pe is a probability
measure defined on the a-algebra X. It is called the probability distribution
corresponding to the transformation €.
We now return to real valued random variables. A random variable is said
to be simple if it takes a finite or countable number of values. Simple random
variables form a vector space over the field of real numbers. The product of
a finite number of simple random variables is a simple random variable. The
quotient of two simple random variables is a simple random variable, provided
that the denominator does hot take the value zero.
Theorem 11.1. Any random variable € = f(w) is a monotone limit from

below of simple random variables, i.e. € = f(w) = lim,.-+oo fn(w), where €n =
fn (w) are simple random variables and fn (w) i f(w) for every w. Conversely,
if € = f(w) is a monotone limit from below of simple random variables then
eis also a random variable.
Proof. Let €n = fn (w) be defined by the relations
fn(w)=k2-n if k2-n<f(w)~(k+l)2-n.
11.3 Types of Dis tribu tion Functions 99
The sequence €n satisfies the requirements of the theorem.

We now prove the converse statement. Given a function € = f(w) we
consider the subsets A c JRI for which f-I(A) E 1. It is easy to see that the
class of such subsets forms a u-algebra which we will denote by Re. We prove
that the half-open intervals At = (-00, t] E Re. Indeed it is easy to check the
following relation:
Since {wlfn (w) ::; t - 11k} E 1, f- 1 (At) E 1, i.e. At E Re. In Lecture 2 we

proved that the smallest u-algebra which contains all At is B(JR I). Therefore
B(JR 1 ) C Re. 0
Let g(XI" .. , xp) be a function of p real variables which is measurable with

respect to the u-algebra B(JRP).
Theorem 11.2. For any random variables €l = fdw), ... ,€p = fp(w) the
function '1 = g( €l, ... , €p) is also a random variable.
Proof. Using Theorem 11.1 we construct a monotone sequence of simple func-

tions gn (Xl, ... , xp) such that gn i 9 for any choice of (Xl"'" Xp). Then '1 is
a monotone limit of the functions '1,. = gn (€l , ... , €p), which take at most a
countable number of values. It is therefore enough to prove that '1n is a simple
random variable.
We denote the values of '1n by aI' ~, ... , and we set Ai E RP =
((x1, ... ,xp ) E IRPlgn(Xl,""Xp ) = a;}. Then Ai E B(JRP) since the
gn are measurable with respect to the u-algebra B(JRP). To complete the
proof of the theorem we must prove that for any A E B(JRP) we have
{wIUdw), . .. , fp (w)) E A} E 1. We use the following statement: let At, •...• t • =
((XI,""Xp) : Xi::; ti for 1::; i::; p}j then B(JRP) is the smallest u-algebra
generated by all sets ofthe form At, •.... t •• The proof of this statement is carried
out in exactly the same way as for p = 1.
We now argue as in the proof of Theorem 11.1. Denote by R e, ..... E• the
class of those subsets A C JRP such that {wIUdw), ... , fp (w)) E A} E 1. This
class forms a u-algebra of subsets of JRP. Also At, ....• t • E Re, ..... e• since
P
{WI(JI(W), ... ,fp(w)) E Atl •...• t ,} = n{wlfi(W)::; t i } E R(l •...• E.·

i= 1
Consequently B(JRP) ~ R elo .... e• and therefore
{wl(Jdw), ... ,fp(w)) E Ai} E 1, i= 1,2, ...

Thus Tin = gn (€l , ... , €p) is a random variable. 0
Corollary 11.1. If €l"'" €p are random variables and al,"" ap are real
numbers, then '1 = a l €l + ... + ap€p is a random variable.
100 Lecture 11. Dis trib u tion Functions
Corollary 11.2. If €l' ... ,€p are random variables then" = €l X ••• x €p is
also a random variable.
Corollary 11.3 If €l and €2 are random variables, with €2 =f. °for every W
then" = €l / €2 is also a random variable.
In order to prove Corollary 11.1 it suffices to take as 9 in Theorem 11.2

g(Xl"" ,xp) = alXl + ... apxp. To prove Corollary 11.2 we just need to take
g(x l , ... ,xp) =Xl x··· x xp' And to prove Corollary 11.3 we take g(Xl,X2) =
X l /X2 '
The Lebesgue integral plays a central role in probability theory. We briefly
recall how it is constructed. Let € = I(w) be a simple random variable taking
non-negative values, which we denote by aI, ~ , .. , and let Ci = {w II (w) =
ad·
Definition 11.5. The series E € = L:;': 1 ai P( Ci ) is said to be the Lebesgue
integral of the random variable €.
In probability theory E € is called the mathematical expectation of the

random variable €. If E€ < 00 we say that the random variable € has a finite
mathematical expectation. It is clear that
1) Ee ~ OJ
2) Ell = 1 where t is the random variable identically equal to 1 on OJ
3) E(ael + be2) = aEel + bEe2 for any a, b > OJ
4) Eel ~ E€2 if el ~ €2'
We give a short proof for Property 4). Let the values of the random variable
ei be denoted by a~i) ,a~i) , ..• for i = 1,2, and let A~i) denote the set on which
ak
t h evaIue ai(i).IS ten. We set B j"i. = ~i, ,,(1) ,,(2) Th
n ~i.' en on t he set Bj"i., 1.1
C
takes the value aj, and 6 takes the value ai•. We now have
and
Eel - Ee2 = L (a~:) - a~:))P(Bj,'i.) ~ 0,
il,;2
• (1) (2)
since a.Jl >
-
a.JJ on any B J' ' J'.'
l
Now let € = I(w) be an arbitrary random variable taking non-negative

values. We consider the sequence en = In (w) of non-negative simple random
variables which converge monotonically to e = I(w) from below, i.e. enl ~ en.
for n l ~ n 2 and I(w) = liffin ..... co In(w) for every w. It follows from Property
11.3 Type8 of Di8tribution Function8 101
4) of E€ that the sequence of the E€n is non-decreasing and therefore that

there exists a limit limn_ 00 E €n which is possibly infinite.
Theorem 11.3. The value of liIDn_ 00 E€n does not depend on the choice of
the approximating sequence.
Proof. We first establish the following lemma. Let '1 2:: 0 be a simple random
variable, '1 ~ €, and assume that € = limn_ 00 €n, where the €n are non-
negative simple random variables, €n + 1 2:: €n.
Proof. Let us take an arbitrary f > 0 and set Cn = {wl€n (w) - '1(w) > -fl.
It then follows from the monotonicity of €n that Cn ~ Cn + 1 and it follows
from the fact that €n i €, and from the inequality € 2:: '1, that Un Cn = n.
Therefore P(Cn ) - t 1 as n - t 00. We introduce the indicator function Xc"
where
c (w) = {I for w E Cn ,
X .. 0 for w fI. Cn.
Then 'In = '1.Xc" is a simple random variable and '1m ~ €m (w) + f. Therefore
by the monotonicity of E €m we have
E'1,. ~ Ee,.(w) + f, E'1,. ~ lim Eem + f.

m-oo
Since f was arbitrary we obtain E'1,. ~ limm_ 00 Eem. We still must prove
that lim,._ 00 E'1k = E'1.
We denote by bl , ~ , .•. the values of the random variable '1 and by B. the
set where the value b. is taken, i = 1,2, .... Then
It is clear that for all i we have lim,._oo P(B. n C,.) = P(B').

Since the series above consist of non-negative terms, and since the conver-
gence is monotonous for each i, we have
lim E'1,. = lim

Ic-+oo
L b.P(B. n C,.)
•
1;-+00
= Lb. lim P(B. n C,.) = Lb.P(B.) = E'1.

"-00
, i
It is now easy to obtain the independence of limn_ 00 E€n fr<;lm the choice of
the approximating sequence.
Let there be two sequences {ell)} ell) > C(l) and {C(2)} C(2) > C(2)
'-n ''-n+1-'-n "an. ''-'o+l-''aO
such that liIDn_ 00 €il ) = liIDn_ 00 €i2) = € for every w. It follows from Lemma
11.1 that for any k, E€~l) ~ liIDn_oo E€!.2) and therefore lim,._oo E€~l) ~
· .... Et(2)
11m.,. 00 'On. ne 0
U7
am 1·1mn .... Et(2)
bt· 00 'o"
< 1·
_ 1mr. .... 00
Et(1) . ..t(1)
'Or. , byexc h angmg r.
t(2).
an d 'or. ,I.e. 1·1m.,. .... Ec(1)
CD 'on
-- 1·1m.,. .... Et(2)
CD'o".
The limit lim.,. .... E€n is called the mathematical expectation of the ran-
CD
dom variable €. We will denote it by the symbol E€.

The mathematical expectation thus defined satisfies Properties 1) - 4)
given above.
e
Now let be an arbitrary random variable. We introduce the indicator
functions:
I if e = f(w) ~ 0,
x+ (w) = { 0: if e = f(w) < 0,
e= f(w) < 0,
{I if
ife=f(w)~O.
x-(w)= 0:
Then x+ (w) + x- (w) == 1, e = eX+ + eX- = el - e2, where el = eX+ and
e2 = -eX-. Moreover el ~ 0, e2 ~ 0 so the mathematical expectations Eel
and E e2 have already been defined. .
e
Definition 11.6. The random variable has a finite mathematical expecta-
tion if E€l < 00 and Ee2 < 00. In this case it is equal to Ee = Eel - Ee2. If
Eel = 00 and E€2 < 00 (E€l < 00, E€2 = 00) then E€ = 00 (E€ = -00). If
Eel = 00 and Ee2 = 00 then Ee is not defined.
Since lei = el + e2, Elel = Eel + E€2 and so Ee is finite if and only if
Elel is finite.
The mathematical expectation Ee that we have introduced satisfies all
properties described in Lecture 1. In particular
1) E(al el + ~ e2) = alEel + ~Ee2 if Eel and Ee2 are finitej
2) Ee ~ 0 if e ~ OJ
3) Ell = 1.
One sometimes says that Ee is a non-negative normed linear functional
on the linear space of random variables. The space of random variables for e
which Elel < 00 is denoted by £l(n,1,p).
The variance of the random variable e is defined to be E(e - Ee)2, the n-
th order moment is defined to be E en,
and the n-th order centralized moment
is defined to be E(e - Ee)". Given two random variables 6 and e2 their
covariance is defined by Cov( el , e2) = E( el - E €l) (€2 - E €2). The correlation
coefficient of two random variables el, e2 is defined to be p( el , e2) = E( el -
Eed(e2 - Ee2)/..jVarel X Vare2.
The space of random variables efor which Elel p < 00 is denoted by
£p(n,1,p).
It is an essential fact in probability theory that the mathematical expec-
tation, the moments, the variance and so forth of a random variable ecan be
found if the measure p( is known or if the distribution function F( is known.
Theorem 11.4. Let € be a random variable and g(x) be a Borel function on

JR l • Then for the random variable,., = g(e)
i:
11.3 Types of Distribution Functions 103
ETJ = g(x) dPe(X).
1/ Pe has a density Pe (x) then the latter integral can be written as a Lebesgue
integral ETJ = f~oo g(x)Pe (x) dx.
Corollary 11.4 E€ = f~oo xdPe (x).
Corollary 11.5 E€P = f~oo xPdPe{x).
Corollary 11.6 Yare = f~oo (x - E€)2 dPe(x).
Proof 0/ theorem. As usual we first consider the case when 9 2:: 0 and takes
a finite or countable number of values gl'g2' .... Let Ci = {xlg(x) = gi} E
B (IR.t), Ai = g-l (Ci ) E f. Then the random variable TJ takes the values
gl,g2, ... and by definition ETJ = LgiP(Ci). On the other hand, also by
definition, P(Ci ) = PdA;) and LgiPe(A;) = f g(x) dPdx). Therefore the
theorem is proved for simple non-negative g. For an arbitrary non-negative 9
the result follows by passage to the limit in a sequence of simple non-negative
functions. In the general case we write 9 = gl - g2 where gl, g2 2:: 0 and we
use the statement of the theorem separately for gl and g2 . 0
Given several random variables e1,'" ep the concept of joint distribution

function is often used: Fe, ..... e.(X1' .•. 'Xp) = p{e1 ~ X1, ... ,ep ~ xp}. As
in the case where P = 1, the function F(, ..... (. uniquely defines the mea-
sure Pe, ..... e• on B(IRP), where Pe, ..... e• (A) = P( {wl( edw), . .. , €p (w)) E A}),
A E B (IRP). The joint distribution function uniquely determines the joint dis-
tribution function of any subset of the set of random variables (e1"'" ep)
and for the random variable TJ = g( e1 , ... , ep)
ETJ = f··· L. g(X1"" ,Xp) dPe, ..... (, (Xl"" ,xp).

Lecture 12. General Definition of Independent
Random Variables and Laws of Large
Numbers
The definition of independent real-valued random variables was given in Lec-

ture 4. We now extend this definition to the case of more general random
variables.
Let (X, X) be a measurable space, and €1, €2 be two X-valued random
variables, i.e. €1 = Id w), €2 = 12 (w) are given as measurable transformations
from n - X. Then by analogy with Lecture 4 the random variables €1, €2
are said to be independent if for any A1, A2 E X we have the equality: P{ €1 E
A1,€2 E A2} = P({wl/dw) E A1,/2{W) E A2}) = P{€l E A1}P{€2 E A2}. In
the general case let €1 = 11 (w), €2 = 12 {w), ... , €n = In (w) be a collection of
n X-valued random variables.
Definition 12.1. The random variables €1, €2, ... , €n are said to be jointly
independent if for any A1 , A2, ... , A.. E X
= p{{wl/i{W) E Ai' 1 ~ i ~ n}) = II P{€i

n
P{€i E Ai, 1 ~ i ~ n} E Ai}.

;=1
We now prove three simple statements about the joint independence of

random variables.
Lemma 12.1. Let gl (x), ... ,g.. (x) be real-valued measurable functions on X.
Then the random variables 111 = gl{€d,1I2 = g2(€2), ... ,1I.. = gn(€ .. ) are
jointly independent.
Prool. Let Ai, 1 ~ i ~ n, be Borel subsets of the line and let A: E X be defined
as g;l{Ad, i.e. A: = {x E Xlgi{X) E Ai}. Then
P{ 1Ii E A., 1 ~ i ~ n} = P{ €. E A~, 1 ~ i ~ n}

n ..
= II P{€i E II P{1Ii E Ai}.

A:} =
i= 1 .=1
In these equalities we used the joint independence of the e., 1 i ~ ~ n. 0
Lemma 12.2. Under the conditions of Lemma 12.1, assume that the random
variables 111, .. . , 11m are such that IE1Ii I < 00, 1 ~ i ~ m. Then the random
Lecture 12. Independent Random Variables 105
variable TI = TIl X ••• x TIm has a finite mathematical expectation and ETI =
ETJl x ... x ETJm .
Proof. IT IETI. I < 00 then by the definition of the Lebesgue integral we have
EITI;j < 00, 1 ~ i ~ m. Let
x+ (t) ={ 0:
It ~ 0
t< 0
x- (t) = { 1:
0 t~0
t< 0
Then x+ + x- = 1, TI. = TI.(X+ (TI.) + X- ('7.)) = '7.X+ ('7.) + TI.X- ('7.) =
TI.,+ - '7;,- , where '7"+ ~ 0 and '7;,- ~ O. Substituting these expressions into
TI and multiplying out we obtain '7 as a linear combination of products of the
form '71,<1 X ... x '7m,<m' where the f; take the values + or -. In each such
product the '71,<1"" TIm ,Em form a collection of jointly independent random
variables as follows from Lemma 12.1. Therefore it suffices to prove the lemma
for non-negative random variables.
So let '7;, 1 ~ i ~ m be non-negative.
We introduce the functions g!n)(x), where g!n) (x) = k2- n if k2- n ~
g; (x) < (k + 1)2- n. Then gt) (x) i g. (x) as n --+ 00. Let us form the ran-
dom variables '7! n) = gt) (e.), 1 ~ i ~ m. By Lemma 12.1 they are jointly
independent. In addition '7t) i '7;, and '7(n) = rr;:l'7t) i'7 as n --+ 00. By
definition of the Lebesgue integral we have
E'7 = lim E'7(n).

n-oo
Furthermore
L II k;2- n p({wl'7!n) = k;2- n , 1 ~ i ~ n})

m
kll""k,..~Oi=l
L II k;T n II P({wl'7t) = k;2- n })

m m
by independence of the '7!n, • The latter expression is a series with non-negative

terms. We interchange the order of summation and multiplication:
L II II p({wl'7!n) = k;2- n })
m m
k;2- n
kll""km~Oi=l i=1
II L = II E'7!n) II E'7;.
m m m
= k;2- n p({wl'7t' = k;2- n }) ;::-::

i= 1 k i i=1
o
106 Lecture 12. Independent Random Variables
The following lemma will be necessary later for the proof of Kolmogorov's
inequality.
Lemma 12.3. Let €l, ... , €n be real-valued jointly independent random vari-
ables, and let g(x1, .. . , x,,) be a Borel function of k variables. Then Til =
9 (6, ... , €,,), €H 1 , ••• , €n form a collection of jointly independent random
variables.
Proof. Let A, AH 1, .•. , An be Borel subsets of the line. We need to prove that
P(Tll E A,€Hl E AH1,.··,€n E An) = P(Tll E A) rr=Hl P(€. E A;). We
introduce A' = {(Xl, ... ,x,,)ITll = g(x1, ... ,xd E A}. Then A' is a Borel
subset of IR" and we need to prove that
€Hl E AH1, ... ,€n E An} = p((el' ... '€") E A'} II p(e. EA.) .
•=10+ 1
IT A' is a rectangle, i.e. if A' = A~ x A~ x ... x A~ then the desired equality

follows from the independence of €l , ... , en.
IT A' is a finite or countable union
of disjoint rectangles then the desired equality follows from the a-additivity of
probability measures. As is shown in general measure theory, any Borel subset
of IRk can be approximated by such unions at any arbitrary precision, where
precision is defined by the measure of the symmetric difference. 0
We now turn to one of the fundamental theorems of probability theory,

the Law of Large Numbers. First we introduce the following general definition.
Definition 12.2. Let h\, ... , ~n, .. .}, where ~. = I. (w), be a sequence ofreal-
valued random variables. We say that this sequence converges to !i = I(w):
- in probability (in measure) if lim,._oo P(kn -!il > f) = 0 for any f > OJ
- almost surely if Iim,._oo In(w) = I(w) for almost all w.
It is clear that almost sure convergence implies convergence in probability

but the converse is false, as simple examples show.
Let el, e2, ... ,en, ...
be a sequence of random variables with finite math-
ematical expectations m. = Ee.,i = 1,2, .... We form the arithmetic mean
~n = ;(el + ... + en) and mn = ;(ml
+ ... + mn).
Definition 12.3. The sequence of random variables {e.} satisfies
- the Law of Large Numbers if ~n - mn converges in probability to 0, i.e.
P(I!in - mn I> E) -+ 0 as n -+ 00 for any f> OJ
- the Strong Law of Large Numbers if ~n - mn converges to 0 almost surely,
i.e. lim,. _ 00 (~n - mn ) = 0 for almost all w.
Lecture 12. Independent Ra.ndom Va.ria.bles 107
If the random variables e, are jointly independent and if Vare, ~ V =

const, then it follows by Chebyshev's Inequality that the law of large numbers
holds:
P(kn - mn I> f) = p(l(e1 + ... + en) - (m1 + ... + mn)1 ~ m)
~ Var(e1 + ... + en)/f 2 n 2•
In the case of independent random variables Var( e1 +... + e. )

= Vare1 + ... +
Vare.. ~ Vn (see Lecture 4). Therefore the last expression is less or equal to
Vn/f 2 n 2 = V/f 2 n -+ 0 as n -+ 00. We also have a stronger theorem due to
Khinchin:
e,
Theorem 12.1 (Khinchin). A sequence of iointly independent identically
distributed random variables with finite mathematical expectation satisfies the
Law of Large Numbers.
Historically Khinchin's Theorem was one of the first theorems connected

with the Law of Large Numbers. We will not prove it, but we will derive it
from an analogous theorem of Kolmogorov connected with the Strong Law
of Large Numbers, to which we now turn. We need two sufficiently general
statements.
Lemma 12.4 (First Borel-Cantelli). Let {An} be a sequence of events on

a probability space (n,l,p) with EP(A n ) < 00. If A = {wi there exists an
infinite sequence n, = n, (w) for which wEAn., i = 1,2, ... } then P(A) = o.
Proof. We write J! in the following form:
A= n U An.
k=ln=k
o
Lemma 12.5 (Second Borel-Cantelli). Let {An} be a sequence of iointly
independent events on a probability space with E P(An) = 00, and let A = {wi
there exists an infinite sequence n,(w) such that wEAn.}; then P(A) = 1.
Proof· We have A = U:=l n:=k An. Then P(Aj ~ E:=l p(n:=n An) for
any n. By independence of the A.. we have the independence of the An and
therefore p(n:=k An) = rr:=k(l- P(An)) = 0 since E:=l P(An) = 00. 0
Let el, e2," . be a sequence of jointly independent random variables which

have a finite mathematical expectation and variance, i.e. m, = E ei' Vi =
Vare. < 00, 1.~ i ~ n. Then
n
Var(el + ... + en) = Varel + ... + Varen = LVi.

i= 1
Theorem 12.2 (Kolmogorov's Inequality).
Proof. Let us introduce the events C" = {wll(el + .. +e.)-(ml + .. '+mi)1 < t
for 1 ~ i < k, I(el + ... + e,,) - (ml + ... +m,,)1 ~ t}, C = U:=l Ck • It is clear
that C is the event whose probability appears in Kolmogorov's inequality, and
that the C k do not intersect. We have
Var(el + ... + en) = fa (el + ... + en) - (m 1 + ... + mn))2 dP
~ fa (el + ... + en) - (ml + ... + mn)) 2 dP
= t/(
k=l o.
el + ... + en) - (ml + ... + mn)) 2 dP
= t [/
k=l o.
(6 + ... + ek) - (ml + ... + mk))2 dP
+ 2 { (el + ... + ek) - (ml + ... + mk))

}o.
x (ek+l + ... + en) - (mk+l + ... + mn)) dP
+ 1(~k+
o.
1 + ... + ~n) - (mk+ 1 + ... + mn)) 2 dP] .
The last integral is non-negative. The main point is that the middle integral
is equal to zero. Indeed set
I if I(Xl + .. , + Xi) - (ml + ... + m')l < t for 1 ~ i < k,

g(Xl"" ,Xk) = { and I(Xl + ... + Xk) - (ml + ... + mk)1 ~ t,
o otherwise.
Then 9 (el , ... , €k) is the indicator function of Ck' Therefore
{ (el + ... + ek) - (ml + ... + mk))

}o.
x (ek+l + ... + en) - (mk+1 + ... + mn)) dP
= t1
i=k+ 1 n
g(el"'" ek)(el + ... + ek) - (ml + .. , + mk))(ei - mil dP.
Applying Lemma 12.3 to g1 (Xl>'" ,Xk) = g(Xl,'" ,Xk)(Xl + .. '+Xk) - (ml +

... + mk)) we find that the random variables (ei - mi) and g1 (e1 , ... , ek) are
independent for i > k. Therefore by Lemma 12.1 we have
Lechre 12. Independent Random Variables 109
i g(€l"'" €Io)((€l + ... + €Io) - (ml + ... + mlo))(€i - mil dP
=fa gd€l"",€Io)(€i -mi)dP

=fa gd €l , ... , e,.) dP fa (€i - mi) dP = O.
Finally
t ~ t 1 ({
Vi €l + ... + €Io) - (ml + ... + mlo)) 2 dP
i=1 10=1
. a•
~ t2 L P(C Io ) = t 2 P(C),
10=1
i.e.
Theorem 12.3 (First Kolmogorov). A sequence of jointly independent

random variables {€i}, such that 1 E: t.
Var€i < 00 satisfies the Strong Law
of Large Numbers.
e:
Proof· Let = €i -fflj,' = ~((€1 + .. .+€ .. )-(ml + .. ·+m.. = ~ E;=l
We need to show that , -+ 0 almost everywhere. Let us choose E > 0 and
» e:.
form the event B(E) = {wi there exists N = N{w), such that for all n ~ N{w)
we have 1'1 < E}. We have
U n{wll~1 < f}.

00 00
B(E) =
N=1 n=N
We define
By Kolmogorov's inequality we have

=p(2 max
m - 1 <n<2 m
ILe;l~m)
- i= 1
.= 1
Therefore
4 4 16 Vare.
< - "Vare--
00
< - , , - - < 00
00
- £2 L..J • 22m'3 - £2 L..J i2

i=1 i=1
by assumption. By the Borel-Cantelli Lemma we know that for almost all w

there exists an integer mo = mo{w) such that max2m-l<n<2m I~~I < £ for all
m ~ mo. Therefore P{B{£)) = 1 for any £ > O. In partkular P{B{I/k)) = 1
and p(n k B(I/k)) = 1. But if wE n:=l
B(I/k) then, for any k, w E B(I/k),
i.e. there exists N = N(w, k) such that for all n ~ N(w, k) we have I~~ I < 1/ k.
In other words liIDn_ 00 ~~ = 0 for such w. 0
Theorem 12.4 (Second Kolmogorov). A sequence {e.} of jointly inde-

pendent identically distributed random variables with finite mathematical ex-
e.
pectation m = E satisfies the Strong Law of Large Numbers.
Proof. By introducing e: = e. - m. we reduce the problem to the case where

m = O. Let us set
'7.{w) = {e.{w) ifle.1 ~ i,
o
otherwise.
By the properties of the Lebesgue integral we then have
m.=E'7.= [ XdFdx)-+/XdFe{x)=O as i-+oo,

Jlzl~i
where Fe is the common distribution function of the random variables e•. We
have
Lecture 12. Independent Random Variables III
1 1 1
- (6 + ... + e,,) = - (7]1 + ... + 7],,) - - (ml + ... + m,,)
n n n
1 " 1
~(e. - 7].) + -(ml + ... + m,,).
+ -n L-, n
i= 1
Since m" - t 0 as n - t 00 the latter arithmetic mean converges to 0 as n - t 00.

e.
We now show that with probability 1, "I- 7]. only for a finite number of
values of i, which implies the convergence to zero, almost everywhere, of the
term ;; E~=1 (e. - 7].). For this we once again use the Borel-Cantelli Lemma:
L p{e. "I- 7].} = L p{le.1 ~ i}

00 00
;=1 .=
00
1
00
.=111:=.
00
(12.1)
= ;L)k + I)P{k ~ le.1 < k + I}
11:=0
L kP{ k ~ Ie. I < k + I} + 1.

00
=
11:=0
The last sum is the mathematical expectation of the random variable 7] where
'1 = {k, if k 5: 1~I(W)1 < k + I}.
It is clear that '1 ~ I€1 I. Therefore by the properties of the Lebesgue integral
we have E(I€II- '1) ~ 0 and Elell- E'1 ~ 0, i.e. Elell ~ E'1. We recall that
the inequality IEel < 00 always means that Elel < 00. Thus the series in
(12.1) converges.
We show that the First Kolmogorov Theorem applies to the first sum. We
have
Var7]. = E(7]. - m.)2 ~ E7]~ = f x2 dFe(x)

J1%1!:'
We use the fact that for k > 1, ~'-ft; ~~L ,-:\- ~ l....J'_A'
~~L .,----(,11)
, ,- = L..."_,,,
~~L(.,.J.-l , =
,- - ~)
II: ~ 1 • So the latter series is no greater than
112 Lecture 12. Independent Ra.ndom Va.ria.bles
11"1:51
00
x2 dFe{x) ~
1
i2
00 1 f
+ ~ k -1 }1o-1:51 .. 19 x2 dFe{x)
=const +const tl
10=1 10-1:51 .. 1:510
IxldFe{x)
~ const Elel < 00.

o
The Law of Large Numbers, as well as the Strong Law of Large Num-
bers, are related to theorems known as the Ergodic theorems. These theorems
provide general conditions under which the arithmetic means of random vari-
ables have a limit. Basically these conditions reduce to stationarity, i.e. the
assumption that for a sequence of random variables {en} the distribution of
any collection of k random variables en+ 1, ••• , en+lo does not depend on n.
The statements of both laws of large numbers are that for a sequence of
random variables {en} the arithmetic mean ;- E:=1 e., is close to its mathe-
matical expectation and therefore asymptotically does not depend on w, i.e. is
not random. In other words deterministic regularities appear with high proba-
bilities in long series of random variables. In daily life we constantly encounter
the Law of Large Numbers. For example the fact that in its equilibrium state
a gas occupies all the available volume is a statement of a type of law of large
numbers.
Lecture 13. Weak Convergence of Probability
Measures on the Line and Helly's Theorems
In the proof of the De Moivre-Laplace Limit Theorem (see Lecture 3) we came

across the following situation. We considered a random variable '1... which takes
values of the form k / ../npq where k > 0 is an integer, with probabilities of the
form
We do not specify the details of this expression here. The main point is that
the random variable '1... has a discrete distribution which converges in a natural
sense to a continuous distribution with Gaussian density (1/V21r) exp( _x2 /2)
as n -+ 00.
Fig.IS.I.
This form of convergence of distributions is fundamental to limit theorems in

probability theory, which we now consider. We now examine more explicitly
several concepts which are related to this form of convergence.
Let X be a metric space, 8(X) be the u-algebra of its Borel subsets, and
p... be a sequence of probability measures on (X, 8 (X)).
Definition 13.1. The sequence {p... } converges weakly to the probability

measure P if, for any bounded continuous function / E C(X),
. ~~ f /(x) dP... (x) = f /(x) dP(x).

114 Lecture 13. Weak Convergence of Probability Measures
This form of convergence is sometimes written as: Pn => P. Let us consider

a sequence of real-valued random variables {€n} with corresponding proba-
bility measures on the line {Pe.J, i.e. PeJA) = P({wl€n E A}). Then {en}
converges in distribution if the sequence {Pe~} converges weakly to P, i.e.
Pe~ => P.
In Definition 13.1 we could omit the requirement that Pn and P be prob-
ability measures, i.e. normed measures on X. We then obtain a definition of
weak convergence for arbitrary finite measures given on 8(X). This remark
will be useful below.
To each random variable € corresponds a distribution function Fe (x) which
uniquely determines the distribution Pe. We now express the condition of weak
convergence in terms of distribution functions.
Let Pn => P and {Fn} and F be the corresponding distribution functions,
i.e. Fn (x) = Pn ((-00, x]), F(x) = P((-oo,x]). Recall that x is a continuity
point of F(x) if and only if P({x}) = o.
Theorem 13.1. The sequence Pn => P if and only if Fn (x) --t F(x) for every
continuity point x of the function F.
Proof. Let Pn => P and let x be a continuity point of F. We consider the

functions I, 16+ and 16- whose graphs are given in Fig. 13.2.
Fig. U.2.
{I,
Formally we have
I( ) = y~x
Y 0, Y> x
I, y~x
It (y) = 1 - (y - x) /6, x < y ~ x + 6
{
0, y > x +6
I, y ~ x - 6
16-(Y)= { 1-(y-x-6)/6, x-c5<y~x
0, >x
y
The functions 16+ and Ii are continuous and 16- < I < It. Using the fact
that x is a continuity point of F we have, for any f > 0 and n ~ no (f),
Lecture 13. Weak Convergence of Probability Measures 115
Fn(x) =! l(y)dFn(Y)::;! It (y)dFn(Y)
f
::;
E E
16+ (y) dF(y) + 2 ::; F(x + 8) + 2 ::; F(x) + E,
if 6 is such that IF(x ± 8) - F(x)1 ::; E/2. On the other hand, for such n we
have
Fn (x) = ! I(y) dFn (y) 2 ! 1.- (y) dFn (y)
2 ! 1.- (y) dF(y) - E/2 2 F(x - 6) - E/2 2 F(x) - E.
In other words, IFn (x) - F(x)1 ::; E for all sufficiently large n.
We now prove the converse. Let Fn (x) -+ F(x) at every continuity point
of F. We prove that
! I(x) dFn (x) -+ ! !(x) dF(x)
for any bounded function I.

Let M = sup I/(x)l. The discontinuity points of the function F are those
isolated points Xi whose probability is positive (the probability is equal to the
height of the jump in the graph of F at the point Xi)' It follows that the set
of discontinuity points is at most countable.
Let us fix E > O. We choose KdE) = K 1, K 2 (E) = K 2 , Kl < K2 such that
al) K 1 ,K2 are continuity points of F;
~) P{(K2' oo)} = 1 - F(K2 ) ::; E/16M;
a3) P{(-oo,Kd} = F(Kd ::; E/16M.
Since Kl and K2 are continuity points of F, for n 2 ndE) we have:
Therefore
In = !1 ! dF,. - f dF
= 1 f dFn -1 f dF + 1 f dFn + 1 f dFn
-1 -1
(K"K.) (K"K.) (-co,K,] (K.,co)
f dF f dF.
(-co,K,] (K.,oo)
Each of the four latter integrals can be estimated in a simple way: for n 2 n 1 (E)
we have
116 Lecture 13. Weak Convergence of Probability Measures
!_ !_
I oo,K d I dF.. 1::; oo,K dill dF.. ::; M F.. (Kd ::; f./8;
IJ(K.,OO)
{ I dF.. I::; J(K.,OO)
( III dF,. ::; M(1 - F.. (K f./8; 2 )) ::;
I! _oo,K dI dFI ::; f./16; I!K.,oo) I dFI ::; f./16.

Therefore
1!_OO,Kd ldF.. + !K.,oo/dF,. - !-OO,Kd IdF - !K.,oo/dFI
::; 1!-OO,Kd I dF,.1 + I!K.,OO) I dF.. 1+ 1!-OO,Kl) I dFI

+ I( I dFI ::; 3f./8.
J(K.,OO)
It now remains to estimate the difference I~ = ~Kl'K.) I dF" - J(Kl,K.) I dF.

We choose a collection Xo = KI < Xl < ... < xp = K2 of continuity points
of the function F such that xi+ I - Xi ::; 0, where 0 is chosen so that I/(x') -
I(x") I ::; f./4, if KI ::; x',x" ::; K 2, lx' - x"I::; o. We define a new function
Jl6) on (KlJK2J by
1(6) (x) = 1(6) (Xi) if Xi < x::; Xi+l, i = 0, ... ,p - 1.
Then 1/(6) (X) - l(x)1 ::; f./4 and
I{ I dF" - ( 1(6) dF"I::; ( II - Jl6) IdF.. ::; f./4,

J(K1,K.) J(K1,K.) J(K1,K.)
11
(K"K.)
I dF -1 (KloK.)
1(6) dFI ::; 1 (K"K.)
II - 1(6) IdF ::; f./4.
Furthermore
We now note that from the definition

Lecture 13. Wea.k Convergence of Proba.bility Mea.sures 117
!
p-l
f(6)dFn = Lf(Xi)(Fn(XHd-F.. (x;),

(Kl. K.] i=O
!
p-l
f(6) dF =L f(x;)(F(XHd - F(Xi))

(K 1 .K.] i=O
and
p-l
IlK1 .K2 ]f(6) dF.. -lKl. K.] f(6) dFI = I~ f(x;)(F.. (Xi+d - F.. (x;)
I
p-l
- L f(Xi)(F(Xi+d - F(x;))
i=O
p-l
~ Llf(Xi)1 (IF.. (Xi+ d - F(xHdl + IF.. (Xi) - F(Xi)l).

i=O
We choose ~ (f) to be large enough so that IF.. (xi)-F(xi)1 ~ f/16Mp. This is

possible because the points Xi are continuity points of F. The latter expression
is then at most Mp2(f/16Mp) = f/8. Putting together all the estimates we
find that IInl ~ f for n ~ max(ndf),n2(f)). 0
Definition 13.2. A family of probability measures {p.. } on a metric space X

is said to be weakly compact if from any subsequence p.... , n = 1,2, ... , one can
extract a weakly convergent subsequence P..... ' k = 1,2, ... , i.e. P..... => P.
Note that it is not required that P E {P.. }.
In what follows we need the following general fact from functional analysis.
Theorem 13.2. Let X be a separable compact metric space, i.e. a complete

separable metric space which is a compact space. Then any family of measures
1'.. on (X,8(X)) such that 1'.. (X) ~ const, where const does not depend on
a, is weakly compact.
With the help of this theorem we derive the important Reily's Theorem:
Theorem 13.3 (Belly). A family of probability measures {p.. } on the line

X = m.1 is weakly compact if and only iffor any f > 0 there exists an interval
[-K,K] for which p .. (m. 1 \ [-K,K]) ~ f for all a.
Proof. Assuming that the latter inequality is satisfied, we first prove the weak
compactness of the family {P.. }. We take a sequence of probability measures
from the family {p.. } which, for simplicity, we write in the form Po., n =
1,2, ... , and let X = [-i,i]. For any integer i, X is a separable, compact
metric space, and p .. ([ -i, i]) ~ 1 for all n. Therefore for any i we can use the
general theorem from functional analysis that was quoted above.
l18 Lecture 13. Weak Convergence of Probability Measures
Considering the interval [-1,1] we find a subsequence P"o>, i = 1,2, ... ,

such that the restrictions of all the P (') to [-1, I] converge ~eakly to a limit
"0
which we denote by p{l}. We emphasize that p(1} is a measure on [-1,1],

which is not necessarily normed. We choose a subsequence {n!2}} ~ {n!I}}
such that the restrictions of P (') to [-2,2] converge weakly to a limit p(2}
"0
which is a measure on [-2,2], and so forth. In other words we construct a
decreasing collection of subsequences {n!t)}, i.e. {n!t+ I}} ~ {n!t}}, such that
the restrictions P (t) to [-l,l] converge to a measure on [-l,l], which we
"0
denote by p( t). It is essential that these measures be compatible: if A c
[-ll,ld c [-~,~] then P(td(A) = P(t')(A). This holds since {n!t+l)} ~
{n! t)} for alIi. For any Borel set A E B(IR 1 ) we now set
= p(1} (A n [-1,1]) + L
00
P(A) pIt) (A n ([-l,l] \ [-l + 1,l- 1])).

t=2
We show that P is a probability measure, and that P is the weak limit of
the diagonal subsequence {P~:}}.
It is clear that P(A) ~ O. Let us check a-additivity. Let A = 1 A. where U::
the A. do not intersect pairwise. Then by the a-additivity of each measure
p(t}
n [-1,1]) + L
00
P(A) = pIll (A p(t} (A n ([-l,ll \ [-l + l,l- 1]))

t=2
=L +L
L pIt) (A. n ([-l,l] \ [-l + 1,l- 1]))
00 00 00
p(l) (A. n [-1,1])

t= 1 t=2 .= 1
= L [pIll (A. n [-1,1]) + L pIt) (A. n ([-l,l] \ [-l + l,l- 1]))]

00 00
i= 1 t=2
.= 1
It remains to be checked that p(IRl) = 1. It is clear that p(IRl) =

limt_ P([ -l, l]) ~ 1. But by assumption for any f > 0 one can find l( f) such
00
that P,,~ll ([-l,l]) ~ 1- f for all i. Consequently P{[-l,l]} = P(t}{[-l,l]} =

liIllt_oo p,,(tl{[-l,l]} ~ 1- f. Therefore p(IRl) = 1.
We no~ show that P (tl => P. We take an arbitrary bounded continuous
"t
function f and establish that limt_ 00 f f(x) dP (tl
"t
(x) = f f(x) dP(x). Let us
fix f > 0 and choose a k such that P(IRl \ [-k, k]) ~ f/4M, where M =
sup", If(x)l. Then IflR'\I-k,kl f(x) dP(x)1 ~ f/4. Furthermore, increasing k if
necessary, we can assume that P" (tl (IR 1 \ [ - k, k]) ~ f / 4 for all i. Therefore
I flR1\l-k,kl f(x) dP,,~tl (x)1 ~ f/4. Finally, for 12: lo(f), we have
Lecture 13. Wea.k Convergence of Proba.bility Mea.sures 119
li-",,,I f(x) dP,,~t) (x) -i-",,,I f(x) dP(x) I~ E/2

by the weak convergence of the P (t) on any interval [-k,k]. Finally we get
that If f(x) dP,,(t) (x) - f f(x) dP(x) I ~ E for all sufficiently large i.
"t
t
We now prove the converse. Let {POI} be a weakly compact family which
does not satisfy the condition in Helly's Theorem. This means that for some
Eo > 0 and for some integer k > 0, there exists ak such that
POI. ([-k,k]) ~ 1- Eo.
Since {POI} is weakly compact, one can extract from the subsequence
{Po.} a weakly convergent subsequence. Without loss of generality we will
assume that the whole subsequence POI. => P, where P is a probability mea-
sure. Let us choose for any k a point 13k, k - 1 < 13k < k, such that ±f3k are
continuity points of P. Then for all m ~ k we have
It follows from Theorem 13.1 that

P([-f3k ,13k]) = F(f3k) - F(-f3k)
= lim [F.. _(f3k) -F.. _(-f3k)]
m-oo
= lim p .. _ (-1310,1310]) ~ 1- Eo.

m-oo
Here we have denoted by F and Fa m the distribution functions of the prob-

ability measures P and POI _ • Since IR 1 = U:= 1 ( - p" ,p" I it follows from the
last inequality that
p(JRn = lim P([-f3k,f3k]) ~ 1- Eo,

10-00
which proves that P is a probability measure. o
Helly's Theorem is a particular case of a much more general theorem due
to Yu. V.Prokhorov.
Theorem 13.4 (Prokhorov). Let {POI} be a family of probability measures

given on the Borel a-algebra of a complete separable metric space X. This
family is weakly compact if and only if for any E > 0 there exists a compact
set K c X such that Po (K) ~ 1 - a for all a.
Lecture 14. Characteristic Functions
In this lecture we consider the Fourier transform for probability measures, and
its properties. Let us begin with the case of measures on the line. Let P be a
probability measure given on B(ml).
Definition 14.-\. The characteristic function (Fourier transform) of a measure

P is the complex valued function tP(>') of the variable>. E ml given by
tP(>') = L: ei h dP(x) = L: cos>.xdP(x) + i L: sin>.xdP(x).
If P = Pt we will denote the characteristic function by tPt (>.) and will call
e.
it the characteristic function of the random variable The expression means
e
that tPe(>') = EeiAt • If takes on values al ,~, ... with probabilities Pl,P2,···
then
00
cI>(>') = LP"eiAa ••
"=1
If ehas a probability density Pe (x) then
cl>e (>.) = I: ih Pe (x) dx.

In analysis, the integral on the right hand side is called the Fourier transform
of the function Pe (x). Therefore in the general case, the notion of the charac-
teristic function of a measure is the generalization of the Fourier transform.
We now establish some simple properties of characteristic functions.
1. tP(O) = 1. This is clear.
2. ItP(>') I $ 1. Indeed ItP(>') I = If~oo ei h dP(x) I $ f~oo dP(x) = 1.
3. If '1 = ae + b, where a and b are constants, then
Indeed cI>., (>.) = EeiA ., = EeiA(aE+b) = EeiAbeiAaE = eiJ.b EeiAaE =

eiAbcl>eC>.a).
4. e
If cl>eC>'o) = e2lri .. for >.0 !- 0 then is a discrete random variable which
!:
takes values of the form (Q + m), where m is an integer.
Indeed let '1 = e- 2: 0" • Then from property 3 we have
14. Characteristic Functions 121
4>'1(>"0) = e- 2tria 4>e(>"0) = 1.
Furthermore 1 = 4>'1(>"0) = EeiAo'l = Ecos>..o'1 + iEsin>"o'1· Since

cos >"0 '1 ~ 1, the latter equality means that cos Ao T/ = 1 with proba-
bility 1. This is possible only in the case where T/ takes values of the form
'
T/ = 2~:, where m is an integer.
5. 4>(>") is uniformly continuous. Indeed let us take £ > O. We show that there
exists a 6 = 6(£) such that 14>(>") -4>(A')1 < £ if IA - >"'1 < 6.
We first choose t', til, with t' < til, such that P{(t", oo)} ~ £/6 and
P{( -00, t'n ~ f./6. Then
14>(>") - 4>(A') I = I/(eib - eiA '%) dP(x) I

~ 11 (e iA '" - eiA '''') dP(x) I+ 11 (eib - eiA '''') dP(x) I
+ 11
:&<t' z>t"
(e iA % - eiA '%) dP(x)l·

t/~%St"
Each of the first two terms is no'greater in modulus than f./3. Regarding the
third term we hav~
11 (eib - eiA '''') dP(x) = I 11 e'b (1- ei(A'-A)",) dP(x) I

1
t'SzSt" t'S:cSt"
~ 11- ei(A'-A)", IdP(x).

t'~sSt"
We set 6 = 3 max(~'I,le"I)' It then follows from the inequality 11 - eia I ~ lal

that for I>" - >"'1 ~ 6
Definition 14.2. A complex valued function 1(>") is said to be positive semi-

definite if, for any >"1' ... ,Ar the matrix II I (A" - At) II is positive semi-definite.
This means that the quadratic form E I(A" - At)CIo Ct ~ O.

6. Any characteristic function 4>(>') is positive semi-definite. Indeed
t
lo,t=l
4>(A" - At)CIoCt = t /
lo,t=1
ei(A.-AtlCIoCl dP(x)
=/ I t
"=1
Clo eiA ·"1 2 dP(x) ~ O.
By Khinchin's Theorem, any positive semi-definite uniformly contin-
uous function which satisfies the normalization condition 4>(0) = 1 is the
characteristic function· of a probability measure.
122 Lecture 14. Chara.cteristic Functions
e
7. Assume that the random variable has an absolute moment of order
k, i.e. Elel k < 00. Then e(A) is k times continuously differentiable and
4>~k) (0) = i k Ee k •
The proof follows from properties of the Lebesgue integral and consists in
checking that the formal differentiation is correct in the equality
The last integral is finite, since Elel k is finite.

In principle all the properties of a probability distribution can be
translated using the terminology of characteristic functions, although it
often turns out to be a difficult matter. Property 7 shows that the smooth-
ness of the characteristic function is connected with the existence of mo-
ments, i.e. with the decrease of P at infinity.
Problem. If the characteristic function 4>(A) is analytic in a neighborhood of

A = 0, then there exists an a > 0 such that P{(-oo, -x)} $ const e- as , and
P{(x,oo)} $ const e-CU:, for every x > o.
The decrease of 4>( A) at infinity is connected with the existence of a density.

For example if f~oo 14>(A)1 dA < 00 then the distribution P has a density given
by
p(x) = 21 [00 e- i h 4>(A) dA.

1rJ_ oo
We now show that one can always recover the measure P from its char-
acteristic function 4>(A). If 4>(A) is absolutely integrable, then the density
of the distribution can be found by the formula given above. Let us as-
sume that P is concentrated at the points al, ~, ... , and P{ a.} = Pi' Then
4>(A) = E. eiAa.. P. and does not tend to 0 as A-+ 00. In this case
lim ~ [R ei h 4>(A)dA= {O forx~(al'~'''')'

R-oo 2R J-R P, for x = a,.
Before turning to the general case we will say that an interval la, bl is a con-
tinuity interval for P if P{a} = P{b} = O.
Theorem 14.1. For any interval (a, b)
1 jR e- i .h - e- iAh 1 1
lim -2 'A 4>(A) dA = P{(a, b)} + -P{a} + -P{b}.
R_oo 1r -R • 2 2
Proof. By Fubini's Theorem on the interchange of the order of integration in

a two-dimensional Lebesgue integraJ, since the integrand is bounded we have
14o Cha.ra.cteristic Functions 123
Furthermore
i:
j R e-i>.a - e- iAb eiAz; d>" = jR cos >..(x - a) - cos >..(x - b) d>"
-R i>.. -R i>..
+ sin>..(x - a) ~ sin>..(x - b) d>...
The first integral is equal to 0 since the integrand is odd. In the second integral
the integrand is even, therefore
j R e-i>.a -0>.. e- iAb ei>.z dx -_

-R I
2
iR0
sin >..(x - a)
>..
_
d>" 2
iR 0
sin >..(x - b)
>.. d>...
By a change of variables J.£ = >..(x -a) in the first integral and J.£ = >..(x - b) in
iR
the second integral we obtain
2 iR sin>..(x - a)
>.. d>" - 2
sin>..(x - b)
>.. d>" = 2
lR (z-a) sinJ.£
- dJ.£.
o 0 R(z-b) J.£
So
1 j R e-i>'a - e- iAb JOO l 1 R (z-a) sinJ.£
- _R
2" °A 4>(>") d>" = dP(x) - - dJ.£.
I -00 7r R(z-b) J.£
We now note that for any t > 7r /2
-Ii"
- -dJ.£ I+ I--- It - 11t
o
/2 sinJ.£
J.£
cos J.£
J.£ ,,/2 ,,/2
cos J.£
--dJ.£
J.£2
I
~ Ii" - - dJ.£I+ I--tt I"
o
/2 sinJ.£
J.£
+ 1t dJ.£ cos 2" ~ const.
/2 J.£
For 0 ~ t ~ 1r /2 the integral e f: Ii: dJ.£

is also bounded in absolute value,
since the integrand is a bounded function. Since the integrand is even, the
latter inequality also holds for t < O. By Lebesgue's Bounded Convergence
Theorem we have
lim joo dP(x).!. rR(z-<» sinJ.£ dJ.£

R-+oo 7r }R(Z-b) J.£
1
-00
00 l1R(z-a) .
= dP{x) lim - SIDJ.£ dJ.£.
-00 R-oo 'If' R(z-b) J.£
124 Lecture 14. Cha.racteristic Functions
We now obtain the form of this expression for different values of x.

al) IT x > b both integration bounds converge to infinity and therefore the
limit of the integral is equal to zero.
~) IT x < a by an analogous argument the limit of the integral is equal to
zero.
a3) ITa<x<b
. l1
hm -
R-oo 11"
R (",-a)
R(",-b)
sinJL
- - dJL
I'
=- 11
11" -00
00
sinJL
--
I'
dJL = 1.
. 11
hm -
R-oo 11"
0
-R(b-a)
sin-
- I' dJL = -
I'
11
11" -00
0
sinJL
- - dJL = -.
I' 2
1
. l1
11m -
R-oo 11" 0
R (b-a) sinJL
-dJL= -
I'
11
11" 0
00
sinJL
-dJL=-.
I'
1
2
So without any assumption on the interval [a, b]
lim
R-oo
l-R
R ei)..a -
'
.'\
e-i)"b
4>('\) d'\
1
= P{(a,b)} + -P{a} + -P{b}.
2
1
2
IT [a,b] is a continuity interval for the measure P then P{a} = P{b} =0

and P{(a, b)} = P{[a, b]}, so the latter limit is equal to P{[a, b]}. 0
Corollary 14.1.1f two probability measures have equal characteristic functions

then they are equal.
The proof follows from the fact that by Theorem the distribution
functions of the measures must agree at all continuity points. But if two dis-
14.1
tribution functions agree at all continuity points then they must agree at all
points.
The statement in the corollary is sometimes referred to as the Unicity
Theorem for characteristic functions.
One of the reasons why characteristic functions are often helpful in prob-
ability theory is that it is easy to obtain a criterion for the weak convergence
of probability measures from them.
Assume that we have a sequence of probability measures {P,.} and a prob-
ability measure P. We denote the corresponding characteristic functions by
{tP,.(A)} and tP(A).
Theorem 14.2. P,. ~ P if and only if lim,._ 00 4>,. (,\) = 4>('\) for any ,\.
Proof. IT P,. ~ P, then

14. Characteristic Functions 125
4>n CA) = I e'AZ dPn (x) = I CoS.Ax dPn (x) +i I sin >.x dP" (x)
--+ I cos.AxdP(x) +i I sin>.xdP(x) = I e'Az dP(x) = 4>(x),
so one direction in the proof of the theorem is trivial.

To prove the converse statement we establish that the family {Pn } satisfies
the conditions of Helly's Theorem. We will use the following lemma.
Lemma 14.1. Let Q be a probability measure on the line and ,p(>,) be its
characteristic function. Then for any T > 0
2 -]}
Q{[--, 2 ~
1/' ,p(>') d>.l-l.
1-T_,
T T
Proof. By Fubini's Theorem
Therefore
12~ [', ,p{>,) d>'1 1[: Si:;T dQ{x) I

=
~ I[ sinxT dQ(x) \ + \ [ sinxT dQ(x) \

J1 1$2/,
Z XT J1 1>2/,
Z XT
~ [ IsinxTldQ(x) +[ ISinxT\dQ(x)
J1z'I$2 XT J1zrl>2 XT
~ 1Iz'I$2
dQ(x) +1 1.... 1>2
IsinxTldQ(x)
XT
~ Q{[-~,~]}
T T
+ !(l- Q{[-~,~]}) = !Q{[-~,~]} + !.
2 TT 2 TT 2
o
We now return to the proof of the theorem. Let 4>n (>.) --+ 4>(>') for each >.
and let £ > 0 be arbitrary. Since 4>(0) = 1 and 4>(>') is a continuous function
there exists T > 0 such that 14>(>') -11 < £/4 as soon as 1>'1 < T. Then
1[', 4>(>') d>'1 1[', (4)(>') - 1) d>' + 2TI ~ 2T -1[', (4)(>') - 1) d>'1
=
~ 2T - r 14>(>') - 11 d>' > 2T - 2T:' = 2T(1 - :').

J-, 4 4
126 Lecture 14. Cha.ra.cteristic Functions
Therefore
\!T I'
_,
tP(>') d>.\ > 2 - ~.
2
Since tPn(>') - tP(>') and ItPn(>.)1 ~ 1, by Lebesgue's Bounded Convergence
I' I'
Theorem
}~~ _, tPn (>.) d>' = _, tP(>') d>' > 2 - "2.
f
So there exists an JJ such that for all n ~ JJ
By Lemma 14.1, for such n
2 2 ~ \-1
Pn([--,-j)
T T
I'
T_,
\
tPn(>')d>' -1> I-f.
For each n < JJ we choose tn > 0 such that Pn ([ -tn, tn j) > 1 - f. If we set
K = max[~,maxl~nd' tnl then we find that Pn{[-K,K]} > 1- f for all n.
By Helly's Theorem the sequence of measures {Pn } is weakly compact.
So we can choose a weakly convergent subsequence {PnJ, Pn; ~ P. We
now show that P = P. Let us denote by ¢(>.) the characteristic function of
P. From the first part of our theorem tPn (>.) - ¢(>.). On the other hand, by
assumption tPn (>.) - tP(>'). So tP(>') = ¢(>.). By corollary 14.1 we have P = P.
It remains to be established that the whole sequence Pn ~ P. Assume
that this is not true. Then for some bounded continuous function f there
exists a subsequence {nj} such that
! f(x) dPnj (x) r ! f(x) dP(x).

We extract from the sequence {Pn j} a weakly convergent subsequence. With-
out loss of generality one can assume that the subsequence is the whole of
{Pnj } and that Pnj ~ P. By the same argument as previously one can show
that P = P and therefore
! f(x) dPnj (x) - ! f(x) dP(x).
Hence the contradiction. o

Remark. One can show that if tPn (>.) - tP(>') for every>., then this convergence
is uniform on each finite interval in >..
Lecture 15. Central Limit Theorem for Sums
of Independent Random Variables
In this lecture we consider One of the most important results of probability the-
ory, the Central Limit Theorem. We already encountered this theorem in the
particular case of a sequence of Bernoulli trials in the form of the De Moivre-
Laplace Limit Theorem (Lecture 3). In its simplest form the Central Limit
Theorem deals with sums of independent random variables. Let us assume that
a sequence of such random variables el, e2, ... ,en is given. One can think of
e.
the as the results of a sequence of independent measurements of some phys-
ical quantity. Assuming that the strong law of large numbers is applicable to
the {e.} we write e. = m. + ~., where m. = E e., E~. = 0, and we consider the
arithmetic mean !-(el + ... + en) = !-(ml + ... + mn) + !-(~l + ... + ~n)' It
follows from the strong law of large numbers that the last term converges to
zero with probability 1. The issue of the speed and type of its convergence to
zero arises naturally. In the case of a sequence of measurements the answer to
this question gives the precision that one can achieve with n measurements.
Under some simple further assumptions it follows, from Chebyshev's inequal-
ity, that ~(~l + ... + ~n) takes values of order 0(1/ vIn). Therefore we consider
In( ~ (~l + ... + ~n)) = -;7;: (~l + ... + ~n) = Tln' It turns out, and this is the
essence of the Central Limit Theorem, that in general, as n ---t 00 the ran-
dom variables Tln (as functions on il) do not converge to a limit, but their
distributions, as n ---t 00, have a limit which, in a well-known sense, is inde-
pendent of the details of the distributions of the random variables e•. We now
systematically explore those considerations.
Let FI and F2 be two distribution functions.
Definition IS.I. The convolution of the distribution functions FI and F2 is

defined to be G(x) = f~oo FI (x - u) dF2 (u).
The convolution is written as G = FI * F2 •

Theorem IS.I. Let el and e2
be two independent random variables with
distribution functions FI and F2 • Then the random variable e= 6 + e2 has
G as its distribution function.
Proof. Let us denote by FE d. (Xl' X2) the joint distribution function of the
random variables el
and e2'
By definition
128 Lecture 15. Central Limit Theorem
Fd x ) = p({Wlel + e2 ~ x}) = ff+z.~,. dFh ,e.(Xl,X2)'
Since el and e2 are independent we have FE1 ,E.(Xl,X2) = Fdxd F2(X2) and
by Fubini's Theorem
Fe(x) = ff+,..~z dFdxddF2(X2) = f dF2(X2)Fd x - X2)'

o
The formula for the convolution can be viewed as a total probability for-
mula. IT we fix a value u of the random variable e2 then the probability that
el + e2 ~ x under the condition that e2 = u is equal, by the independence of
el and e2, to the probability that el ~ x - u, i.e. Fdx - u).
Assume that the random variable el takes its values in the set Al , and
that the random variable e2 takes its values in the set A 2 • One sometimes
says that Fl (or Pe.) is concentrated on Al , and F2 (or Ph) is concentrated
on A 2 • Then G is concentrated on the set A = Al + A 2 , where
A = {xlx= Xl +X2, Xl E Al , X2 E A 2}.

The set A is called the arithmetic sum of the sets Al and A 2. IT Fl and F2
are discrete then Fl * F2 is also discrete. IT the distribution function Fl has a
i:
density Pl (x) then G has a density p(x) and
p(x) = pdx - u) dF2 (u).
Now let el,"" en be a sequence of n independent random variables, and

FE I , • • • , FE.. be their distribution functions. Then the distribution function of
en
the sum~.. = el + ... + is the n-fold convolution Fr .. = FE 1 * FE. * ... * FE ...
The operation of convolution, as one can see from its definition, is
quite complicated. It turns out that the analysis of convolutions is signif-
icantly simplified by using characteristic functions. We denote by ,pEl (A),
,peo (A), ... , ,pE.. (A) the characteristic functions of the random variables el'
e2, ... , en,
-
and by ,pr .. (A) the characteristic function of ~n'
Proof. The characteristic function ,ph (A) = E exp(iAe/o). Since the random
variables el, e2, ... ,en
are independent, the random variables eO).. EI , • • • , eO).. E.. ,
as functions of independent random variables, are also independent. This lat-
ter statement for real-valued functions can be found in Lemma 12.1. It can
be extended without any modifications to complex-valued functions. further-
more, in the case of real-valued independent random variables the expectation
Lecture 15. Central Limit Theorem 129
of their product is equal to the product of their expectations (see Lemma 12.2).
This statement can also be extended, without any modifications, to the case
of complex-valued random variables. Therefore
<PCft (.x) = Eei>·rft = Ee iA E:=l h = E IT e

10=1
iAh = IT <Pc. (.\).
10=1
o
We are now in a position to prove the Central Limit Theorem in the
following important case.
Theorem 15.3. Let e1' e2, ... en, ...

be a sequence of independent identically
distributed random variables with distribution function F, and such that the
second moment f~oo x 2 dF(x) = ~ < 00. We denote the expectation by m =
Eei = f~oo xdF(x) and the variance by d = ~ - m 2 • Then as n ~ 00 the
distribution of the normalized sums 1] .. =/ (~n - nm) = ~(E; ek - nm)
vnd vnd
converges weakly to the Gaussian normal distribution with density -),-e-
v 2 ..
... • /2.
Proof. The characteristic function of the Gaussian distribution, i.e. the integral
/ f exp(i.\x - x2 /2) dx = e- A' /2. The characteristic function of the random
v2 ..
variable 1].. is equal to
(15.1)
where <p(.\) is the characteristic function of the distribution F (see property

3 of characteristic functions, Lecture 14). In order to prove the theorem it
suffices to show that <P"ft (.\) ~ exp( _.\2/2) for every .\ (Theorem 14.2).
Since ~ < 00, for any .\ the integral f eih x 2 dF(x) exists. Therefore from
the properties of the Lebesgue integral it follows that <p(.\) is twice differen-
tiable, <p"('\) = - f eiAz x 2 dF(x), and <p"('\) is continuous, in fact uniformly
continuous. The proof of this la.tter property is carried out as the proof of the
uniform continuity of characteristic functions (see Property 5 of characteristic
functions, Lecture 14). Therefore for small .\
(15.2)
The last term calls for a further remark. We write the real and imaginary
parts of the characteristic function <p(.\) separately and for each part we write
a Taylor expansion with a Lagrange remainder up to the second order. The
intermediate point is in general different for each part. By the continuity of
<p"('\) and since <p"(O) = -~, the second derivative of the real part converges
to -~ and that of the imaginary part converges to zero. This leads to equation
(15.2). IT we substitute it into (15.1) we obtain
130 Lecture 15. Central Limit Theorem
for any fixed >.. o

We now quote, without proof, a more general formulation of the Central
Limit Theorem which deals with sequences of independent random variables
with different distributions.
Theorem 15.4 (Lyapunov). Let {€k} be a sequence 01 independent random

variables, mk = E€k, dk = Var€k and EI€k -mkl 3 = c:. II (1/D!/2) '£:=1 Ck
-+ 0 as n -+ 00, where Dn = '£:= 1 dk , then the probability distribution 01 the
normed sums Tin = (1/JD:"H'£:=1 €k - '£:=1 mk) converges weakly to the
Gaussian distribution density (1/V21i)e-"·/2.
We recall that Dn = Var('£:=1 €k), since for independent random vari-

ables the variance of the sum is equal to the sum of the variances. A more
general form of the Central Limit Theorem is that with Lindeberg's condi-
tion, which is almost necessary. Let Flo be the distribution function of the
random variable €k' We assume that Lindeberg's condition holds, namely for
each E > 0
Theorem 15.5. II Lindeberg's condition is satisfied then the limiting distribu-

tion (as n -+ (0) 01 the normed sum Tin is Gaussian. II in addition a condition
01 asymptotic smallness is satisfied:
then Lindeberg's condition is necessary lor the convergence 01 distribution 01

Tin to the Gaussian distribution.
There are quite a few generalizations of the Central Limit Theorem, where
the condition of independence of the random variables is replaced by a con-
dition of weak dependence in one sense or another. Other important general-
izations are those connected with vector-valued random variables.
The appearance and the universality of the Gaussian distribution in the
Central Limit Theorem can seem rather enigmatic and surprising. The goal of
the following remarks is to elucidate both those circumstances. It will be conve-

nient to consider a sequence of numbers np = 2P and assume that E = o. We ei
define the random variable 1p = 2- p/'J E::l et, where the et are identically
distributed and have a finite second moment. Then 1p+l = (1/V2)b~ + 1;)
'J' / 'J,+1
where 1~ = 2- p/'J E t =l et and 1; = 2- P 2 Et=2'+l et. We see that 1,. and
1; are independent, identically distributed random variables. If Fp is the dis-
tribution function of 1p then
(15.3)
So the sequence Fp can be obtained from Fo by iterations of a non-linear

operation obtained from transformation (15.3), which we denote by T. This
means that if for any distribution function F we set TF = f~oo F(xv'2 -
u)dF(u) we have Fp+l = TFp and Fp = TPFo. It is easy to check directly
that the Gaussian distribution function G(x) = (1/V21r) f~oo e- u ' / 2 du is a
fixed point of the transformation T, i.e. G = TG. The fact that for a wide
class of distribution functions Fa the convergence Fp =? G holds is related to
the stability of this fixed point. We now note that if we set d(F) = dF(x)f.z2
for a distribution function F such that f xdF(x) = 0, i.e. d(F) is the variance
of the distribution F, then
d(TF) = d(F). (15.4)
This follows from the fact that the variance of the random variable 1 =
(l/V2)(-y' + '1"), where 'Y' and 'Y" are independent, identically distributed
random variables, is the same as the variance of '1' and '1". Equation (15.4)
shows that the variance is the "first integral" of the non-linear transformation
T.
In the general theory of non-linear operators the investigation of the sta-
bility of a fixed point begins with an investigation of its stability with respect
to the linear approximation. In our case the problem on linear stability arises
in the following way. Set F. = f~oo (1 + fh(u)) v'~ ... exp(- u'J') du and write
TF. = 1'"
-00
1
(l+fLh) rn=e-' u dU+O(f).
v27r
1.'
The linear operator L which appears in the last expression is the linearization
of the operator T at the point G. It is easy to write it explicitly as
Lh = V21°O h(u + .x) e-'

r.:; In 1.'
U duo
v7r -00 v2
The integral operator that we have written is sometimes called the Gaussian
integral operator. One can show that it is self-adjoint for a suitable choice
of a scalar product. We now find its spectrum and eigenfunctions. One can
show directly that the eigenfunctions of L are the Hermite polynomials h =
132 Lecture 15. Centra.l Limit Theorem
Pk(x) (see Lecture 9), which correspond to the eigenvalues Ilk = 2l - k /2, k =
0,1,2,3, .... We see that Ilo, III > 1, 112 = 1, and all the remaining eigenvalues
are less than 1. In other words, in directions which correspond to Po and Pl
the operator T is non-stable, while in the direction which corresponds to P2
it is marginal and in all remaining directions it is stable. We note here that
we must consider perturbations h for which
! h(u)e-
..!.
2 du = 0, ! uh(u)e-
..!.
2 du = 0.
The first condition corresponds to the fact that the perturbation of G is such
that f~ 00 (1 + fh( u ))e- U 2 /2 du remains a distribution function, and the second
condition means that the expectation of the perturbed distribution function
remains equal to zero. Thus the perturbations are orthogonal to the non-stable
directions. Furthermore since the variance d(F) is the first integral of the
transformation T then the perturbations must leave the variance unchanged:
i.e. h must be orthogonal to the second Hermite polynomial as well. In the

rest of the subspace L has a spectrum which is strictly less than 1. This shows
that the fixed point G is stable for the linear approximation with respect
to perturbations which lie in the class of distribution functions with zero
expectation and fixed variances.
We can now conclude from the general theory of non-linear operators that
for distribution functions F with zero expectation and d(F) = 1 which lie in a
sufficiently small neighborhood of G, we have T" (F) => G. The formulations
of the Central Limit Theorem given above mean that G is a stable point not
only for a small, but also for a large, class of distributions.
A suitable generalization of this argument makes it possible to deal with
the problem of constructing limit distributions for sequences of dependent
random variables. However we will not deal with this matter in any more
detail.
For a large class of problems it is necessary to consider slowly decreasing
distributions for which f x 2 dF(x) = 00. It turns out that this is where new
limit distributions emerge. We now consider a particular case.
Theorem 15.6. Let el, e2, ... , en, ... be a sequence oj independent, identi-
cally distributed random variables Jor which the distribution Junction F has a
density p(x) such that
i) p(x) = p( -x); and
ii) p(x) ,.. c/(Ixl"'+ 1) lor some a E (0,2), where c is a constant.
Then the distribution oj the normed sum Tin = n - 1/" (el + ... + en) as
n -. 00 converges to a limit distribution whose characteristic function is of
the form exp( -cll>'I") for some constant Cl > 0.
The theory of limit distributions for sums of independent random variables

is one of the most developed branches of probability theory, in which many
remarkable results are known.
Lecture 16. Probabilities of Large Deviations
At the beginning of this course on probability theory we considered the prob-

abilities P(I L::=l 6. - L::=l mkl ~ t) with mA: = Eek for a sequence of
independent random variables el'
e2' ..., and we estimated those probabilities
by Chebyshev's inequality:
It follows in particular that if the random variables e.

are identically dis-
tributed then for some constant c which does not depend on n, and with
d = dk ,
al) for t = cyn we have dj c2 on the right hand side;
~) for t = cn we have djc 2 n on the right hand side.
We know from the Central Limit Theorem that in case a l ) as n --+ 00 the
corresponding probability converges to a positive limit which can be calcu-
lated by using the Gaussian distribution. This means that in al) the order of
magnitude of the estimates obtained from Chebyshev's inequality is correct.
On the other hand in the case ~) the estimate from Chebyshev's inequality
is often too large, and very crude. In this lecture we investigate the possibility
of obtaining more precise estimates for the probability in case ~).
Let us consider a sequence of independent identically distributed random
variables. We denote their common distribution functions by F. We make the
i:
following basic assumption about F: the integral
R(.\) = eA:t dF(x) < 00 (16.1)
for all .\, -00 < .\ < 00. This condition is automatically satisfied if all the e.
e.
are bounded: I I ~ c = const. It is also satisfied if the probabilities of large
e.
values of the decrease faster than exponentially.
We now investigate several properties of the function R(.\). From the
I: I:
finiteness of the integral in (16.1) it follows that the derivatives
R'(,\) = xeA:t dF(x), R"(,\) = x 2 eA:t dF(x)
exist for all.\. Let us consider m(.\) = R'(.\)jR(.\). Then

16. Probabilities of Large Deviations 135
m' (>.) =
R"(>') R'(>') 2
R(>.) - ( R(>')) =
1_ 00
00
x2
R(>.) eA>: dF(x) -
(1 00
X
_ 00 R(>.) eA'" dF(x)
)2 .
We construct a new distribution function FA (X) = at).) f(-oo,,,,) eA>: dF(x) for
each >.. Then m(>') = f xdFA (x) is the expectation, computed with respect to
this distribution, and m'(>') is the variance. Therefore m'(>') > 0 if F is a non-
trivial distribution, i.e. is not concentrated at a point. We exclude this latter
case from our further considerations. Since m' (>.) > 0, m( >.) is a monotonically
increasing function. We will need some information on limm(>') as >. ---+ ±oo.
We say that M+ is an upper limit in probability for the random variable
€; if p(e > M+) = 0, and P(M+ - f :s: € :s: M+) > 0 for any f > O. One
can define a concept of lower limit in probability M- in an analogous way. If
P(€ > M) > 0 (P(€ < M) > 0) for any M, then M+ = 00 (M- = -00). In
all remaining cases M+ and M- are finite.
Lemma 16.1. limA _ oo m(>.) = M+, limA __ oo m(>.) = M-.

Proof. We prove only the first statement, since the second is proved in an
analogous way. Assume first that M+ = 00. We show that m(>.) ~ M for any
preassigned M if>. > >'(M). We have
R'(>') = 1 00
-00
xe Az dF(x) ~ 1 00
2M
xeA>: dF(x).
I: I:
Furthermore
R(>.) = eA>: dF(x) = eA>: dF(x) + 1: eA,. dF(x)
:s: e2AM + 1 2M
00
eA'" dF(x) = 1 00
2M
eA'" dF(X}(1 + ""
f2M
e
2AM
eA>: dF(x)
).
Since M+ = 00, for any M there exists a half-open interval with (a, a + 1],
a> 4M such that F(a + 1) - F(a) > O. Therefore
1 00
2M
eA>: dF(x) ~ 1(",,,+1)
eA>: dF(x) ~ e"A (F(a + 1) - F(a))
~ e4M A (F(a + 1) - F(a)),

and
( 2 )'M ( 2 )'M e- 2AM
f;: eA>: dF(x)

~:----~<
- e4AM (F(a + 1) - F(a)) - F(a + 1) - F(a)'
-~----;--~--:-
For>. > >'(M) the latter expression is less than 1. For such >.
R'(>') > f2: xeA>: dF(x)
R(>') - (f2""u e dF(x))(1 + 1.: :::dF("'))
11
A ..
00
eA" dF(x) 2M
> -
- 2 2M
XI""
2M eA '" dF(x) -
> - 2 = M.
136 Lecture 16. Probabilities of Large Deviations
To obtain the last inequality we used the fact that l:,h .dF(z)
... e • <IF (IZ)
generates a
probability measure on the half-line (2M, 00). So the lemma is proved for the
case where M+ = 00.
We now turn to the case of a finite M+ . We show that for any 0 > 0 there
exists a A(O) such that m(A) ~ M+ - 0 if A > A(O). As previously
R' (A) = JOO xe).,:z dF(x) ~ roo xe).,:z dF(x),

-00 lM+ -6/2
R(A) = jM+-6/2 e).,:z dF(x) + roo e).,:z dF(x)

-00 lM+-6/2
~ e>'(M + - t6) + roo e).,:z dF(x)
lM+ -6/2
= 1 00
M+-6/2
e).,:z dF(x)(1 + 00
fM+-6/2
e>.(M+-6/2)
e).,:z dF(x)
).
Since M+ is an upper limit in probability we have F(M+) - F(M+ - io) > 0

for any 0 > O. Therefore
e>.(M+-6/2) e>.(M+-6/2)
~-----------< ~------------~~
f:+-6/2 e).,:z dF(x) - f(M+-6/4.M+) e).,:z dF(x)
e>.(M + -6/2)
< --~----~~~----~--~~
- e>'(M+ -6/4) (F(M+) - F(M+ - ;0))
_ ~6 1
- e '
- F(M+) -F(M+ - ;0)'
Therefore
R'(A) f:+ -6/2 xe).,:z dF(x) 1
---->
R(A) - fooM+-6/2 e).,:z dF( x ) (1+ F(M+)-F(M+-~6)
-=~~--------~------~~----~
.-H/')'
The first quotient, by the same arguments as above, is greater than or equal
to M+ - iO. The second quotient can be made arbitrarily close to 1 by an
appropriate choice of A. 0
We can now turn directly to an estimate of the probabilities of interest to

ei
us. Let c be fixed, and set m = E < c < M+. We consider the probability
el
Pn,c = P( + ... + en > en). Since e > m the probability we are investigating
is that of the random variables taking values far away from their mathematical
expectation. Such values are called large deviations (from the expectation).
Let Ao be such that m(Ao) = e. Note that m = m(O) < e. Therefore Ao > 0
by the monotonicity of m(Ao).
16. Probabilities of Large Deviations 137
Proof. We have
Pn,c = j ... j dF(xd ... dF(xn}
j ... j
,sl+···+.s ... >cn
j ... j
SI+···+Z: ... >Cn.
The latter integral is the probability, computed with respect to the distribution
Fl.o , that €l + ... + €n > en. But the distribution Fl.o was chosen in such a
way that E).o €i = e. Therefore
j ... j dF).o (xd .. · dF).o (xn) = P).o {€l + ... + €n > en}
Zl+···+ Z .>Cn.
=P).O{€l +"'+€n -nm(Ao} >O}

€l + ... + €n - nm(Ao} 1
= P).o { y'nd(Ao} > O} --+ 2
as n --+ 00. Here we have denoted by d( Ao) the variance of the random variables
computed with respect to the distribution Flo 0 • 0
The lower estimate turns out to be somewhat worse.
Theorem 16.2. For any b > 0 there exists p(b) = p > 0 such that
p.R,e >
_
(R(A 0 )e-).oC)ne-).obJ';-pAI
with
lim Pn
n .... 00
= p(b) > O.
Proof. As in Theorem 16.1
Pn,c ~ j ... j dF(xd ... dF(x n }
j ... /
138 Lecture 16. Probabilities of Large Deviations
The latter integral, as in the case of Theorem 16.1, converges to a positive

limit by the Central Limit Theorem. D
In Theorems 16.1 and 16.2 the number R(Ao)e-~oc = r(Ao) is involved.

It is clear that r(O) = 1. We show that r(Ao) < 1 for all Ao =f o. We have
Inr(Ao) = InR(Ao) - Aoe = -(lnR(Ao) -lnR(O)) - Aoe. By Taylor's formula
we have:
InR(O) -lnR(Ao) = -(lnR(Ao))' Ao + ~~ (InR(Ad)",

where A1 is an intermediate point between 0 and Ao. Furthermore
(InR(Ao))' = ~(~~1 = m(Ao) = e, and (InR(Ad)" > 0
since it is the variance of the random variable €. computed with respect to

the distribution F~ 1 • Thus
A2
Inr(Ao) =- ; (InR(Ad)" < O.
From Theorems 16.1 and 16.2 we obtain:
Corollary 16.1. lim,,_ 00 ; In p.. ,c = In r( Ao) < o.

Indeed let b = 1 in Theorem 16.2. Then
\)
In r ("'0 Ao In p.. In p.. c
- - ::::; - - ' ::::; In r Ao
( ) + -1 In B" .
..;n -
- -
n n n
Passing to the limit as n -+ 00 completes the proof. D
Corollary 16.1 shows that the probabilities p .. ,c decrease exponentially in

n, i.e. much faster than was given by Chebyshev's inequality.
I. Karatzas, Columbia University, New York, NY;
S. E. Shreve, Carnegie Mellon University,
Pittsburgh, PA
BroINnian Motion
and Stochastic Calculus
2nd ed. 1991. XXIII, 470 pp. 10 figs.
(Graduate Texts in Mathematics, Vol. 113)
Softcover ISBN 3-540-97655-8
Designed as a text for graduate courses in stochastic pro-
cesses and written for readers familiar with measure-
theoretic probability and discrete-time processes who wish
to explore stochastic processes in continuous time. The
vehicle chosen for this exposition is Brownian motion,
which is presented as the canonical example of both a
martingale and a Markov process with continuous paths.
In this context, the theory of stochastic integration and
stochastic calculus is developed. The power of this calculus
is illustrated by results concerning representations of
martingales and change of measure on Wiener space, and
this in tum permits a presentation of recent advances in
financial economics (options pricing and consumption/
investment optimization).
The book contains a detailed
discussion of weak and strong
solutions of stochastic differential
equations and a study of local
time for semimartingales, with
special emphasis on the theory of
Brownian local time. The text is
complemented by a large number
of problems and exercises.
D. Revuz, University of Paris VII;
M. Yor, University Pierre et Marie Curie
Continuous Martingales
and BroINnian Motion
1991. IX, 533 pp. 8 figs. (Grundlehren der
mathematischen Wissenschaften, Bd. 293)
Hardcover ISBN 3-540-52167-4
This work provides a detailed study of Brownian
Motion, via the 1t6 stochastic calculus of continuous
processes, e.g. diffusions, continuous semimartingales:
it should facilitate the reading and understanding of
research papers in this area, and be of interest both to
graduate students and to more advanced readers, either
working primarily with stochastic processes, or doing
research in an area involving stochastic processes,
e.g. mathematical physics, and economics.
The emphasis is on methods, rather than generality.
After a first introductory chapter, each of the sub-
sequent ones introduces a new method or idea,
e.g. stochastic integration, local times, excursions,
weak convergence, and describes its applications
to Brownian motion; some of ~ T.erlag
these appear for the first time in Spri.rtger-V~
book form. One of the impor- fler~~rlferil!i
tant features of the book is the ~:~ )'orL ~
large number of exercises LondoG 4
which, at the same time, give pariS -; lII"!
additional results and will help TokYOKong

the reader master the subiect
'J
f{0l1gJon8
Barce "IIIIIfII:l~
more easily. Budapest

Sinai1992 Book ProbabilityTheory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sinai1992 Book ProbabilityTheory

Uploaded by

Copyright:

Available Formats

Springer Textbook

Translated from the Russian

Springer-Verlag Berlin Heidelberg GmbH

Title ofthe Russian original edition (in two parts):

Mathematics Subject Classification (1991): 60-01

Die Deutsche Bibliothek - CIP-Einheitsaufnahme.

Preface to the English Edition

Preface to First Ten Chapters

Chapters eleven through sixteen constitute the second part of a course on

Lecture 1. Probability Spaces and Random Variables ........ 1

Lecture 2. Independent Identical Trials and the Law of

Lecture 3. De Moivre-Laplace and Poisson Limit Theorems 30

Lecture 4. Conditional Probability and Independence 43

Lecture 6. Random Walks on the Lattice 7L."" 67

Lecture 7. Branching Processes 73

Lecture 8. Conditional Probabilities and Expectations 78

Lecture 9. Multivariate Normal Distributions 83

Lecture 10. The Problem of Percolation 89

Lecture 11. Distribution Functions, Lebesgue Integrals and

Lecture 12. General Definition of Independent Random

Lecture 13. Weak Convergence of Probability Measures on the

Lecture 14. Characteristic Functions 120

Lecture 15. Central Limit Theorem for Sums of Independent

Lecture 16. Probabilities of Large Deviations 134

1.1 Probability Spaces and Random Variables

Example 1.2. Let us assume that a coin is tossed n times. An elementary

This space [} is encountered in many problems of probability theory. For

Then a collection of magnets situated at the points 1, ... , n is a collection of

which we can again codify by a word of length n from (0,1).

Definition 1.1. Let X be a space of values and I be a space of indices. The

Theorem 1.1. 1 consists 01 2 m events.

Prool. Let C c n. We introduce the function Xc (w) on n, where

Now let n be arbitrary.

Prool.lndeedn\n:=1 C; = U:=I(n\C;) E g. Consequently n:=1 C; E g. 0

The following result is already less trivial.

Proolol Theorem 1.2. Let us number the elements of 9 in an arbitrary way:

For any C we set

Definition 1.3. A collection 1 of subsets of n is called a a-algebra if 1 is

At first sight it seems that the difference between algebras and

Definition 1.4. A measurable space is a pair (n,l) where 1 is a a-algebra

Notation. We denote random variables by small greek letters, for example

In order to understand the meaning of this concept, consider the case

Theorem 1.3. If e is 10 - measurable, it takes constant values on each ~,

Definition 1.7. A probability measure is a function P, defined on 1, which

p(O Ci) = tp(Cd.

The number P(C) is called the probability of the event C.

We now discuss the meaning of properties 1) - 3). The concrete idea of

Definition 1.8. The triplet (n, 1, P) is called a probability space.

We now introduce simple corollaries to Definition 1.8. Let n be discrete,

A collection of numbers which satisfy 1) and 2) is called a probability distri-

Theorem 1.4. Every probability distribution on n generates a probability mea-

Proof. Properties 1) and 2) of Definition 1.6 follow from (1.1). Property 3)

Here are several equivalent formulations of u-additivity.

Theorem 1.S. A function P on 1 which satisfies properties 1) and 2) of

p(u C.) = .lim P(C;)j

2) for any sequence of events C. E " C. ;2 CHI'

Corollary to the additivity of P. For any collection of sets AI,"" An E 1,

Proof. For n = 1 the statement is clear. We proceed by induction. We have

UAi = UAs U An = UAi U (An \ UAs) .

::; L P(Ai) + P(An \ UAs)

using the induction hypothesis. Furthermore An = (An \ U;:ll At) U (An n

P(An) = p(An \ (0 At)) + p( Ann( nOl At)) ~ P(An \ nOl At) .

p( 0At) ~ 't P(At) + P(An \ nOl At)