Professional Documents
Culture Documents
Yakov G. Sinai
Probability Theory
An Introductory Course
With 14 Figures
Translator:
Dominique Haughton
Department of Mathematics
College of Pure and Applied Science
University of Lowell, One University Avenue
Lowell, MA 01854, USA
ISBN 978-3-540-53348-1
This work is subject to copyright. AII rights are reserved, whether the whole or part of the
material is concemed, specifically the rights oftranslation, reprinting, reuse of iIIustrations,
recitation, broadcasting, reproduction on microfilms or in other ways, and storage in data
banks. Duplication of this publication or parts thereof is permitted only under the pro-
visions ofthe German Copyright Law ofSeptember 9, 1965, in its current version, and per-
mission for use must always be obtained from Springer-Verlag. Violations are liable forpro-
secution under the German Copyright Law.
© Springer-Verlag Berlin Heidelberg 1992
Origina1ly published by Springer-Verlag Berlin Heidelberg New York in 1992
Softcover reprint ofthe hardcover Ist edition 1992
The use of registered names, trademarks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
Typesetting: Camera ready by author
41/3111-5432 Printed on acid-free paper
Prefaces
Translator's Preface
The Russian version of this book was published in two parts. Part I, covering
the first ten chapters, appeared in 1985. Part II appeared in 1986, and only
the first six chapters have been translated here.
I would like to thank Jonathan Haughton for typesetting the English ver-
sion in TEX.
Contents
The field of probability theory is somewhat different from the other fields of
mathematics. To understand it, and to explain the meaning of many defini-
tions, concepts and results, we need to draw on real life examples, or examples
from related areas of mathematics or theoretical physics. On the other hand,
probability theory is a branch of mathematics, with its own axioms and meth-
ods, and with close connections with some other areas of mathematics, most
notably with functional analysis, and ordinary and partial differential equa-
tions. At the foundation of probability theory lies abstract measure theory, but
the role played here by measure theory is similar to that played by differential
and integral calculus in the theory of differential equations.
We begin our study of probability theory with its axioms, which were intro-
duced by the great Soviet mathematician and founder of modern probability
theory, A. N. Kolmogorov. The first concept is that of the space of elementary
outcomes. From a general point of view the space of elementary outcomes is
an abstract set. In probability theory, as a rule, it is denoted by n and its
points are denoted by w, and are called elementary events or elementary out-
comes. When n is finite or countable, probability theory is sometimes called
discrete probability theory. A major portion of the first half of this course will
be related to discrete probability theory. Subsets C c n are called events.
Moreover the set C = n is call the certain event, and the set C = 0 (some-
times written as C = JI) is called the impossible event. One can form unions,
intersections, Cartesian products, complements, differences, etc. of events.
Our first task is that of constructing spaces of elementary outcomes for
various concrete situations, and this is best done by considering several ex-
amples.
Example 1.1. At school many of us have heard that probability theory origi-
nated in the analysis of problems arising from games of chance. The simplest,
although important, example of such a game consists of a single toss of a
coin, which can fall on one of its two sides. Here n consists of two points,
n = (w",Wt), where w" (Wt) means that the toss gave heads (tails).
Remark. In this example the number of possible events C c n is equal to
4= 22: n,w",w,,0.
2 Lecture 1. Proba.bility Spa.ces a.nd Ra.ndom Va.ria.bles
If we replace the coin by a die then [} consists of six points, Wl, W2 ,W3,
W4 ,Ws, W6,according to the number showing on the top side after throwing
the die.
!
S
i
!
N
Fig. 1.1.
i i ! i ! ! ! i ! ! i
1 1 o 1 o o o 1 o o 1
365 possible values. By the same token X is the space of all possible birth-
days, and therefore consists of 365 points, and {} consists of 365 n points. If
X is the unit interval, X = [0,1], then {} consists of all choices of n numbers
on the interval [0,1]. Sometimes the space {} is written as {} = xI1,n j , where
[1,nj = {1,2,3, ... ,n}.
The second generalization concerns a more general interpretation of the
index i in the notation w = {a;}. For example we could have i = (il,i2),
where i 1 and i2 are integers, Kl S i 1 S K 2 , Ll S i2 S L 2 • Denote by I the
set of points i = (i 1 , i 2 ) of the plane with integer coordinates.
Fig. 1.2.
Further assume that each point i can be found in two states, denoted by 0,1.
The state of the whole system is then w = {a;}, i E I, and ai takes the two
values 0,1. Such spaces {} are encountered in the theory of random fields. X
can also be arbitrary.
We now come to a general definition.
Example 1.9. In the lottery "6 from 36" we consider a square with 36 cells num-
bered 1 through 36. Each participant must choose six distinct numbers. Here
an individual elementary outcome has the form w = {al, ~, a:;, a4 , as , a6}
and is a subset of six elements of the set with 36 elements. Now {} consists of
(366) = (36.35.34.33.32.31)/6! points.
We can obtain a variant in the following way. Let X be a finite set and
let us carry out a sampling without replacement of n elements of this set,
i.e. w = {al,~, ... ,an}, where now ai =J aj. If IXI = r then {} consists of
r(r - 1) ... (r - n + 1) elements. 1
1 From now on the absolute value of a set will denote its cardinality, that is the
number of its elements. Other examples of spaces n will be given later.
4 Lecture 1. Probability Spaces and Random Variables
Let n be finite, i.e. Inl = m < 00. We denote by 1 the collection of all
events C, i.e. all subspaces of n .
() I, if w E Cj
Xc w = { 0, otherwise.
It is clear that every such function on n taking the values 1 and 0 uniquely
defines an event C, namely the set C where the function Xc{w) is equal to
1. In other words, 1 is in one to one correspondence with the points of XO,
where X = {I,O}. The number of such points is equal to 2101 = 2m • 0
Corollary 1.1. 0 E g.
Proof. Take nEg and apply 2). o
Theorem 1.2. II the algebra 9 is finite, then there exist subsets Al , A 2 , •••
o
... ,Ar such that 1) ~ n Ai = lor i =J: jj 2) U;=l ~ = nj 9) any event
C Egis a union 01 ~ 'so
Remark. The collection of events AI' ... ' Ar defines a partition of n. This
way finite algebras are generated by finite partitions, and conversely. This
statement can be generalized considerably.
Remark. The algebra 9 contains 2r elements. Indeed, let us define a new space
n' whose points are the sets in the partition AI, A2 , ... , A r • Then 9 can be
considered as the collection of subsets of the space n' , In'l = r. By Theorem
1.1 9 contains 2r elements.
1.1. Probability Spaces and Random Variables 5
C(b) = n•
i= 1
C:'.
Then all the C(b) E 9 (as follows from 2) and Corollary 1.2).
It is possible that C(b) = 0. However for any wEn there exists a b such
that w E C(b). Indeed w E C~, for one of the values bi = ±1 for every i,
1 :-::; i :-::; s. Therefore wE C(b),' and so not all C(b) are empty. IT b' I- b" then
C(b') n C(b") = 0. Indeed b' I- b" means that b~ I- b:' for some i. To fix ideas,
assume that b~ = 1, b~' = -1. Then in the expression C(b') we find C i1 = Ci,
so C(b') ~ Ci, and in the expression C(b") we find C i- 1 = n \ Ci' therefore
C(b") ~ n \ C;; it follows that C(b') n C(b") = 0. Therefore the sets C(b) are
empty or non-empty and they are disjoint. Let us take, for A;, the non-empty
intersections. Since each w is an element of one of the C(b), it belongs to one
of the A;. This means that Ui A; = n. 0
As above, we have:
For more general spaces n, the function f must have further properties,
known as measurability properties (see below).
6 Lecture 1. Probability Spaces and Random Variables
e
Definition 1.6. A random variable = f(w) is said to be 1o-measurable if
for any a,b, a ~ b, the set {w: a ~ f(w) < b} E 10.
Proof. Let d, be the different values of the function f. Their number does not
exceed the number of elements of 10 (check this!), and therefore is finite. Let us
write them in increasing order: d1 < ~ < ... < de, and let U, = {w : f (w) =
d,}. It is clear that U, n Ui = 0 for i t j and U, U, = O. We now show that
U, E 10. Let us construct intervals [a" b') where d,_ 1 < a, < d, < b, < d,+ l '
Then
{w : a, ::; f (w) < b;} = U" that is U, E 10. By Theorem 1.2, U, can be written
as a union of A/s, i.e. f(w) = d, for w E Ai' Moreover different i give rise to
different elements of the partition. D
We now return to the general case and turn our attention to the central
concept of our theory, the concept of probability. We noted earlier that proba-
bility is a special case of measure. Now consider an arbitrary measurable space
(OJ).
1) p(w) ~ 0, (1.1)
w
and follows from the fact that the sum of a converging series with positive
terms does not depend on the order of summation. 0
p(nC.) = ..lim
_ 00
P(C;)j
•
9) for any sequence of events C. E " Cs ;2 CsH , n. Ci = 0,
lim P(C;) = O•
• -00
Proof. All three statements are proved in the same way. By way of example
let us prove the last one. Let C, E 1, C, ;2 CHI, n.
C, = 0. Now form the
events Bi = Ci \ CHI' Then Bi n Bj = 0 for i =1= j, Ck = Ui>k B i . From
the a-additivity property of Definition 1.6 we have P(C1 ) = l P(Bi)' t::
Therefore the remainder of the series L.:::" P(Bi ) = P(Ck ) ~ 0 as k ~ 00.
Conversely, assume that we have a sequence of sets Ci , C, n Cj = 0, for i =1= j.
Form C = U:: 1 Ci . Then C = U;=l Ci u U::n+l Ci for any n, and by finite
additivity P(C) = L.:;=l P(C;) + P(U::n+l Ci)' The events Bn = U:+ 1 Ci
decrease, and nn Bn = 0. Therefore P(Bn } -> 0 and P(C} = 2:;:1 P(c;}. 0
p(CJ Ai) =
,=1
p(UI ,=1
Ai U (An \ nOI Ai))
,=1
= pCC] i= 1
Ai) + P(A n \ ri As)
i=1
n-l n-l
Finally we obtain
~
i= 1 i= 1
o
Let n be finite, Inl = JI. The probability distribution for which Pi = 1/ JI
is called uniform. Uniform distributions arise naturally in many applications.
For example return to the problem of the "6 from 36" lottery. The space n in
this case was constructed earlier, and consists of e66) points. We now reason
as follows. Let us fix a specific set w. The result of the random drawing of the
lottery will give a certain set w of 6 numbers. It is natural to assume that all
sets are equivalent, and that the probability of each w is the same and equal
to p(w) = l/JI = 1/e66). We now find the probability of winning. According
to the rules, winning will occur when w and w have at least 3 numbers in
common. Let us denote by C,. the event consisting of those w such that w and
U:
w have k numbers in common. Then C = = 3 C,. and consists of the winning
w's. It is clear that Ci n C; = 0 for i =f. j. Therefore P(C) = L:=3 P(C,.).
Furthermore, P(C,.) = L .. ;EC.Pi = L";Ec.l/JI = l/JlIC,.1 = IC,.I/Inl. It
is easy to understand that IC,.I = (:)'(6~°/c). Therefore, for instance
(6) eO) 6.6.4 30.29.28
P(C3 ) = 3 • 3 = l.2.3.~ ~ 0.0412.
( 36) 36.35.34.33.32.31
6 1.2.3.'.6.6
The number Pi is the probability that the random variable takes the e
value Xi' It follows from the properties of probabilities that Pi ~ 0, 2:; Pi = 1.
Ee = L I{w)p(w),
which is defined when L I/(w)lp(w) < 00. If the latter series diverges, the
mathematical expectation is not defined.
In the case where the m.e. is defined, its value does not depend on the
order of summation. The concept of m.e. is similar to the idea of center of
gravity. In the same way as the center of gravity indicates near which point
the mass is concentrated on average, the mathematical expectation indicates
e
near which quantity the values of the random variable are concentrated on
average.
Properties of the Mathematical Expectation
1. If Eel and Ee2 are defined, then E(a6 + be2) is defined, and E(a6 +
be2) = aEel + bEe2'
e
2. If ~ 0 then E ~ O. e
e
3. If = I(w) == 1, then Ee
= 1.
e
4. If A ~ = I(w) ~ B then A ~ Ee
~ B.
5. Ee is defined if and only if Li Ix;lp; < 00 and then Ee = Li XiPi'
6. If the random variable 11 = g{ e) = g(f (w)), then
ETJ = L9(Xi)Pi,
where ETJ is defined if and only if L Ig{Xi)lpi < 00.
7. Chebyshev's Inequality (1). If e ~ 0 and Ee < 00, then for any t > 0
P{(wl I(w) ~ t)} ~ Ee/t.
Proo/ 0/ Properties 1-6. Property 1 follows from the fact that if el = Idw),
e2 = 12 (w) and Eel and Ee2 are defined, then
Therefore
Properties 2 and 3 are clear. Properties 1-3 mean that E is a linear (property
1) non-negative (property 2) normed (property 3) functional on the vector
space of random variables. Property 4 follows from the fact that e - A ~ 0,
B-e ~ 0 and therefore E(e-A) = Ee-EA.l = Ee-A ~ 0 and, analogously,
B-Ee ~O.
We now prove property 6, since 5 follows from 6 for g( x) = x. First let
L Ig(x,)lp, < 00. Since the sum of a series with non-negative terms does not
depend on the order of the terms, Lw Ig(f(w))lp(w) can be carried out in the
following way:
L Ig(f(w))lp(w) = L L Ig(f(w))lp(w)
= L Ig(x,)1 L p(w) = L Ig(x,)lp,·
wec i i
Thus the series L Ig(f(w))lp(w) converges if and only if the series L Ig(x,)lp,
does. H any of those series converges then the series L g(f(w))p(w) converges
absolutely, and its sum does not depend on the order of summation. Therefore
Lg(f(w))p(w) = L L g(f(w))p(w)
i weo.
weo,
Proof of Property 1. Let Ee be defined, IEel < 00. Assume at first that Ee2 <
00. Then (e - Ee)2 = e - 2(Ee)e + (Ee)2 and from property 1 of the m.e.
Vare = Ee - E(2(Ee)e) + E((Ee)2) = Ee - 2 (Ee)(Ee) + (Ee)2 = Ee -
e
(Ee)2 < 00. Now let Vare < 00. We have = (e - Ee)2 + 2(Ee)e - (Ee)2,
and from property 1 of the m.e .. We have
o
Proof of Property e. From property 1 of the m.e. we have E( ae + b) = aE e + b.
Therefore
Var(ae + b) = E(ae + b -E(ae + bW = E(ae - aEe)2
= E(a2(e - Ee)2) = a2E(e - Ee)2 = a2Vare.
o
o
1.2. Expectation and Variance 13
Now let €l, €2 be two random variables taking the values Xl, X2, ... and
Yl , Y2 , ... respectively.
Definition 1.12. The joint distribution of the random variables €l, €2 is the
collection of numbers Pij = P(Xi' Yj) = P{ wi €dw) = Xi, €2 (w) = Yj}.
Given the joint distribution of €l' €2 one can find the distribution of €l
(€2) by summing the Pij with respect to j (i). Moreover, if " = 1(6, €2)
then E" = E f(xi, Yi )Pii' again under the assumption that the last series
converges absolutely. In particular E €l . €2 = E Xi Yi Pii·
One can define, in an analogous way, the joint distribution of any finite
number of random variables.
We note that
Furthermore
14 Lecture 1. Probability Spaces and Random Variables
(COV(€1,€2))2 ~ Var€1.Var€2,
i.e. Ip(€l, €2)1 ~ 1. IT Ip(€l, €2)1 = I then for some to we have E(to (€2 - m2) +
(€1 - md)2 = 0, i.e. to(€:a -~) + (€l - md = 0 and €2 = m2 - (ml/tO) +
(€I/t o). Setting a = lito, b = m2 - (mdto) we obtain the statement of the
theorem. 0
Lecture 2. Independent Identical Trials and
the Law of Large Numbers
In other words we can say that 1 (A) is the intersection of all those u-
algebras G which contain A , (A C G). There is at least one such u-algebra,
namely the u-algebra of all subsets of n.
A less formal definition of 1(A)
consists in the following. Given sets from A we form finite and infinite unions of
those sets. From the obtained subsets we form finite and infinite intersections.
Then once again we form unions and intersections and so forth. The complete
process makes use of the concept of transfinite induction, but we will not
elaborate on this point.
Assume now that n = JR. l . Consider the following families of subsets:
Proof. All the equalities are proved in the same way. Let us prove for example
that 1(Ar.) = 1(Ad. First for any a < b we have the equality (-oo,b) \
(-00, a) = [a, b) E 1(Ar.). Let us further take an interval (a, b) and a sequence
an ~ a. Then Un [a,. ,b) = (a,b) and Un[an,b) E 1(Ar.). Therefore (a,b) E
16 Lecture 2. The Law of Large Numbers
l(As), i.e. l(Ad ~ l(As). One can prove in an analogous way that l(Ad 2
1(A5)' 0
where iI, ... ,ir are given indices and B I , B 2 , ••• ,Br E 8.
a b
If the probability of occurrence of the value Xi is equal to Pi then we
consider that the point Xi carries the mass Pi' It follows that for any interval
[a,b] the probability P{a ~ ~ ~ b} = Lzola$z,<b Pi ' i.e. the probability that
the random variable takes values in the interval fa, b] is equal to the sum of the
masses Pi which lie in the interval. Imagine that the number of values taken by
the random variable converges to infinity and that the distances Xi+ I - Xi as
well as the Pi converge to zero, where La$ z, $ b Pi converges to a finite value.
For example let -R ::; Xi ::; R, Xi = ijN and Pi = p(~)~ + o(~), where p(t)
is a function on the real line. Then the sum
2.2 Continuous Random Variables 17
J:
behaves like a Riemann sum for the integral p(t)dt. It is clear that p(t) 2: 0
and J~oo p(t)dt = 1, since Li Pi = Li k-p(~) + O(k-) = 1. Any function
p(t) 2: 0 such that J~oo p(t)dt = 1 is called a probability density.
we say that the random variable ehas a probability distribution with density
p.
Examples of Densities
1. p(x) = / e-"'> /2, -00 < x < 00 is called the normal density or the
v2"
Gaussian distribution (with parameters (0,1));
1'. p(x) = .;;"" exp( - ("'~; )') is the normal density with parameters (m, a);
2.
p (x) = { --!--
b
a,
x E [a,b]
0, X t/: [a,b]
p{e E U} = ::!~.
We assume here that 0 < vol V < 00 (vol (.) represents the d-dimensional
volume of a set). Sometimes to define the uniform distribution on V one
says "a point is chosen in the set V at random with the uniform distribu-
tion";
3.
p(x) = {Ae-
0,
h
,
x
x2:0,
<0
is the exponential density;
4. p(x) = 1/11"(1 + x 2 ), -00 < x < 00 is the Cauchy density.
The mathematical expectation of a random variable which has a proba-
bility density p(x) is given by the integral Ee
= J~oo xp(x) dx, and this math-
ematical expectation is defined if the latter integral converges absolutely. In
exactly the same way
18 Lecture 2. The La.w of La.rge Numbers
i: i:
assuming that the latter integral converges absolutely, and
-i: i: i: i:
COV(El' E2) = xyp(x,y) dxdy
We now study the properties of P. Let us choose indices ii, i 2 , ••• , it and as
many outcomes x(;.) , x(;.), ... , xu.) (among which there may be repetitions)
and consider the cylinder
2.3 Sequences of Independent Trials 19
Proof. P(G) = LP(w) = LP(xd ... p(xn) where the summation is carried
out over those w = (Xl"'" Xn) for which P(Xi t ) = p(xUtl), 1 ~ l ~ t. Then
P(G) =
~i I i;t oj 1 , .•. ,i e
""t=,,(it), l:s;t:S;t
The last sum is equal to 1. This is proved in the same way as the normalization
property LP(W) = 1 was proved before. 0
",EO "'I",EO,
l:S;i:S;n
n n
= II L p(xd = II P(Ci ).
o
We introduce the random variable vIi) (w), equal to the number of i, for
which 1 ~ i ~ n and Xi = xli), and the random variables e1 i )(w) where
e~j) (w)
•
= {I, if Xi = xli),
0, otherwise.
Then
L e}i) (w)
n
Proof· We note that I- pli) = L.;o!i PI .). Let us defineBk = {wlv(;)(w) = k}.
Then
BL'" =UR. . "';'1 .. """,
20 Lecture 2. The La.w of La.rge Numbers
where 1 ~ i 1 < i2 < ... < i/o = n and Xi, = xU), 1 ~ t ~ k, Xi f:. XU) for the
remaining i. The number of events Bk;i" ... ,i. is clearly equal to (:), and they
do not intersect. Furthermore
p(w)
IT P(Xi).
Zi;tZ(j) i;ti1l ... ,i k
i:;til,..·,i.
IT P(Xi) = IT
%i¢Z(j) i~il"",ik
i:;till ... ,i.
In order to calculate EII(j) and Varll(j) we use (2.1) and Lemma 2.1 where,
for C, we take the cylinder C = {W/Xi, = X(j),Xi. = xU)}:
n n
r-
Varil lil = E(II(j»)2 _ (ElIlj»)2
Theorem 2.2 shows that the random variable 1I(i) has a binomial distri-
bution with parameters pW, 1- p(i).
From Theorem 2.2 we now obtain a very important corollary.
2.5 Generalizations
The first generalization consists in the fact that we can consider an arbitrary
probability space (X, B, P) instead of a finite X, where P is a probability
defined on the events BE B. As previously, we set n = xll,nl and consider
the cylinders C = (wlxi E B 1 , ••• ,Xi. E B Ic ), where B 1 , ••• , BIc E B.
1
Let P(C) = n:=l P(Bl ) for any cylinder C. One can prove (see below)
that there exists a unique q-additive set function defined on B which takes
this form on cylinders C.
2.6 Applications
!(x)-Qn(x) =!(X)- t
k=O
(~)Xk(l-xt-k!(kln)
= t
k=O
(~)Xk(l- xt-k(J(kln) - !(x))
= El + E2
The first sum can be estimated as follows:
k:lk/n-zl~6
We again consider the interval [0, I]. Let us expand every x E [0, I] into its
decimal expansion x = 0, al ,~ ... an •.• , where each digit a. takes its values
°
from to 9. Let us fix the n first digits of the decimal expansion al '" an •
Those numbers x E [0, I] which have these digits in their n first positions form
24 Lecture 2. The Law of Large Numbers
an interval L1(al ... an) of length 10- n. For different choices of (al' ~ ... an)
those intervals either do not intersect, or intersect at only one end point.
We consider the sequence of n independent identical trials, with space of
outcomes (0,1,2, ... ,9) and probability of each outcome equal to 10- 1 • Then
all,n I consists of the words w = (a1 ... an) and p(w) = 10- n. We denote by
v(i) (x) the n~ber of occurrences of the digit j among the n first digits of
the decimal expansion. Let us fix a number 6 > a and consider those intervals
L1(a1 ... an) for which Ivli ) /n -1/101 ~ 6, a ~ j ~ 9. It then follows from the
LLN that the sum of the lengths of those intervals converges to 1 as n -+ 00.
One sometimes says that, in the decimal expansion of typical numbers x, all
digits are encountered approximately the same number of times.
It turns out that with large probability this average is close to I. The value
of d is irrelevant to the proof of this statement, so we will assume that d = 1,
i.e. that
1= 11 fey) dy.
First we explain the meaning of the words "n points are chosen indepen-
dently in the interval [0,11", which we now denote by Y1, .. . ,Y... This means
that we consider a sequence of n identical trials in which X = [0,11, and in
which the distribution of outcomes is the uniform distribution on [0,11. The
sum (l/n) E~=1 f(y.) is a random variable defined on the space a of words
w = (Yl' ... , Y.. ). We now derive the following statement from the LLN:
1
I: f(yd - II ~
n
P{(wll; En -+ 1 as n -+ 00 •
•= 1
2.6 Applications 25
For the proof we divide the interval [0,11 into r equal parts by the points
Xli) = i/r, 0:$ i :$ r. Given w = (Ylo ... ,Yn) E xl1,nl, we construct w' =
(X1""'X n ), where x" = Xli) if i/r :$ y" :$ (i + l)/r (in the case of two
possibilities we choose the left hand side point). We now find the largest r such
that I/(x,,) -/(y,,)1 :$ f./2. This is possible since the function 1 is continuous
on [0,11 and therefore is uniformly continuous. It follows that
1 :$ ;
.=1 .=1
We now introduce on the space (}' of words w'the probability distribution
which assigns to each w' a probability p(w I ) equal to the probability of those
wEn which correspond to w'. But it is clear then that if w' = (Xl"'" xn)
then
The last relation shows that we have on the space (}' a probability distribution
which corresponds to a sequence of n independent trials with space of out-
comes X' = {0,1/r,2/r, ... ,(r -l)/r} and probability l/r for each outcome.
By the LLN P{(W' Ilv(i) /n - l/rl :$ 6,0 :$ j :$ r - In -+ 1 as n -+ 00 for any
fixed 6. Here vlil = vIi) (WI) is the number of occurrences of the number j /r
in the word w'. We can therefore write
.!:.
n
t
i=l
I(Xi) = E
i=O
v(j) I(x li »)
n
1 j vIi) 1 j
=L ;/(;) + L( --;- - ;)/(;).
r-1 r-1
i=O i=O
We choose r such that I~ E;:~ I(~) - 11:$ i, and we set 6 = f./3rM, where
M = maxl/(x)l. Then
Finally we obtain that if r is chosen in such a way that all the requirements
mentioned above are satisfied, and if y = (yl1), ... , yIn») is such that Ivli ) /n-
l/rl :$ 6 for all 0 :$ j :$ r - 1 then l(l/n) E;=l I(yl.») - II :$ f.. It follows
from the LLN that the probability of such y converges to 1 as n -+ 00.
As before, let n = xlI,n) and X = {x(1), ... ,xlr)}, pli) = p(x(i)), 1:$ j:$ r.
26 Lecture 2. The Law of Large Numbers
Theorem 2.4 (Macmillan). For any a > 0, {3 > 0, there exists a number
no(a,{3) such that for all n > no (a,{3) there exists in the space n = xl1,nl an
event Cn C n which satisfies the following properties:
1. P(Cn ) ~ I-a;
2. for every wE Cn, e-n(h+fj) S p(w) S e-n(h-~);
9. en(h-~) S ICnl S en(h+fj).
Proof. Let 6 > 0, whose value we will specify later, and set
ISiSr}.
It follows from the LLN that P(Cn ) - t 1 as n - t 00 and therefore for n ~
no (0, a) we have P(Cn ) ~ 1- a.
Let us now prove property 2. For w = {Xl' ... , Xn} we have
IT p(X.) = IT (p(j»v(j)(w)
n r
p(W) =
.= 1 i= 1
r r (j)
= exp(L£lU)(W) lnp(j») = exp(-n(- L ~ lnpU»)
i=l i=l n
Since w E C n then
IL(£1n I~ 6 L Ilnp(j) I·
r (i) r
- p!i») lnp(i)
i= 1 i= 1
2.6 Applica.tions 27
wEe,.
and
e-n(hH/2JIGnl ~ L p(w) ~ e- n (h- p /2)IG n l·
wEO",
o
The meaning of Macmillan's Theorem (MT) is as follows: Clearly Inl = rn.
MT shows that we can choose, in n, a subset Gn for which the number of points
is between en(h±P), and whose probability is arbitrarily close to 1. If h is much
smaller than In r, then such a reduction of the space n does not change much
from the point of view of probability theory, but significantly reduces the space
itself. This property plays an important role in many problems in information
theory and ergodic theory. In addition the MT shows that for any arbitrary
non-uniform distribution {pUl} the probabilities p(w) for large n become equal
in the limit, i.e. the distribution approaches the uniform distribution, although
in a very weak sense.
Let X = {-I, I}, p(I) = p, p(-I) = q, and consider the sequence of inde-
pendent identical trials with n = xll.nl. Here w = {x1, ... ,xn } where each
Xi = ±1. We set Yo = 0, Yk = L:=l Xi, 1 ~ k ~ n and we draw in the plane
(t,y) the broken line y(t) which passes through the points (k,Yk), 0 ~ k ~ n.
28 Lecture 2. The La.w of La.rge Numbers
Fig. 2.1.
It is clear that the function y(t) uniquely defines w. Each y(t) consists of
straight segments, at angles of ±45°. If we consider y(t) as the graph of a
motion it is clear that for each unit of time the moving point moves along a
segment of line 1, but arbitrarily changes the direction of motion. The function
y(t) is called the trajectory of the random walk. The trajectory of a random
walk can be figuratively defined as the trajectory of a drunk person. It is
°
clear that if k is even then y(k) is even, and if k is odd y(k) is odd. We
set P{(y(t), ~ t ~ n)} = p(w), where p(w) corresponds to the sequence
w = {Xl"'" X,,}, which gave y(t). Then P{(y(t), 0 ~ t ~ n)} = 'fI'(1) q..,<-l) ,
where v( 1) = v( 1) (w) is the number of unit segments which move according to
an angle +45°, and 1.1(-1) = V(-l)·(W) is the number of unit segments which
move according to an angle of -45°. Clearly we have
y(n) 1.1(1) (w). V(-l)(W)
y(n) = V(l)(W) -V(-l)(W), - = - .
n n n
It follows from the LLN that for large n, with large probability, n- 1v(1) is
close to p and n- 1v(-1) is close to q. Therefore n- 1y(n) with large probability
is close to p - q. The quantity p - q is called the mean one-step displacement.
We now find Ey(n) and Ey2 (n) for a symmetric random walk. We use the
fact that 1.1(1) (w) +1.1(-1) (w) = n, 1.1(-1) (w) = n-v(1) (w), y(n) = 21.1(1) (w) - n
and 1.1(1) has a binomial distribution with p = q = 1/2. Using Theorem 2.1 we
obtain:
Ey(n) = E(2v(1) (w) - n) = 2Ev(1) (w) - n = 2.(n/2) - n = 0,
Vary(n) = Ey2 (n) = E(2v(1) (w) - n)2 = E[4(v(1) (W))2 - 4nv(1) (w) + n 2]
= 4E(v(1) (W))2 - 4nEv(1) (w) + n2
= 4Var(v(1) (w)) + 4(Ev(l) (W))2 - 4nEv(1) (w) + n 2
The last relation shows that typical values of y(n) lie in a domain of order
O( v'n). In the next lecture we will give this statement an accurate form.
Lecture 3. De Moivre-Laplace and Poisson
Limit Theorems
as n ~ 00.
3.1 De Moivre-La.pla.ce Theorems 31
In(q - zvpqjn) = In q -
zy'P
r;:;;;
yqn
Z2 p
- -
2qn
+ K2 (p)
Z3 (pq )3/2
n
3/2 '
where Kl and K2 are the coefficients of the remainders which arise from the
third derivatives at intermediate points. We only need the fact that as n -+ 00
these constants are bounded by IKd, IK21 ~ C, where C is a constant that
does not depend on n.
o
In probability theory statements involving the asymptotic behavior of in-
dividual probabilities are sometimes called local limit theorems.
Proof. We saw that maxple is obtained for ko = [np - q]. For such a ko
1
Pleo '" . ~ ---? 0 as n ---? 00,
V 21C'npq
since Zo = (ko - np)/ ..jnpq and IZo I ::; 2/ ..jnpq ---? 0 as n ---? 00. o
Proof. Again we set z = (k - np)/ "fnpq. When k takes all integer values such
that 0 ::; k ::; n in turn, z varies between the limits -Vnp/q ::; z ::; Vnp/q
by steps of length 1/"fnpq. By Theorem 3.1 we have
The sum
The expression on the right hand side of the statement of the theorem is a
probability calculated with the Gaussian density p(z) = (1/..j2;)e- Z' /2. This
density plays an important role in many problems of probability theory.
3.1 De Moivre-La.pla.ce Theorems 33
lim L Pic 1 fA
= -- •
e-~ /2 dz,
n .... co -../2i
Ic~np+Av'npq
-00
n""
lim
00
L -../2i
1+
Pic = -1-
A
00
e-~ • /2 dz.
Ic~np+Av'npq
We show only the first statement, as the second statement is proved sim-
ilarly. Let f > 0 be arbitrary. Since v'1271' Joo
- 00
e- ~'/2 dz = 1, there exists a
B > 0 such that v'~" I~B e- z '/2 dz = 1- f/4. By Theorem 3.1 we have
L Pic ~ 1- f/2.
np- B v'npq~Ic~ np+B v'npq
L Pic ~ f/2.
Ic<np-Bv'npq
We therefore have
E= Pic +
Ic<np-Bv'npq ",-a "'...,..s.
S",+Av''''''''
The last relation shows that for any interval [A, B] the probability that
y(n)/.;n E [A,B] converges to a positive limit as n .... 00. This means that
for large n, in typical cases (i.e. with positive probability) the random walk
takes values of order .;n, i.e. the random walk point at time n has departed
from its initial value by a distance of order .;n. Such a feature is typical of
diffusion processes.
or
pn - a(p, n)n ~ ,P) (w) ~ pn + a(p, n)n. (3.3)
We now consider the following functions of p, which depend on n:
Given the number 0 we can find a = a" by using tables for the normal
distribution. Then
Fig. a.1.
lim nP(Cn )
.. -GO
= 21I"ypq
It;;;; in the first case, and
const .
P(Cn ) $ n3 / 2 m the second case,
3.2 Symmetric Random Walks 37
with
For each wE Cn,le cle~ly p(w) = rle .rr- 21e • Therefore P(Cn,le) = ICn,1e I.r le
.qn-21e. In order to calculate ICn,1e I we note that ICn,1e I is equal to the number
of ways in which we can choose from the set {1,2, ... ,n} two subsets with
k elements and two subsets with n/2 - k elements. The first subset can be
chosen in (:) ways. When the first subset is chosen then for the choice of the
second subset we have (n ~ Ie) possibilities. Once we have chosen the first two
subsets, there are (:/~~lele) possibilities for the choice of the third one. Finally
We now have
P(C )- n! 21e n-21e
n,le -(k!)2((n/2-k)!)2 P q
n! (2 )21e(2 )n-21e~ (2k)! (n - 2k)!
(2k)!(n - 2k)! P q 2" (k!)2 ((n/2 _ k)!)2'
1 (2k)! 1 1
- - - "" = CL' k -+ +00.
221< (k!)2 _/21r2k!.!. y'lrk
V 22
We also have
(n - 2k)!
In all cases
1 (2k)! 1(n - 2k)! c
----- - <-
221< (k!P 2 - 1< ((n/2 _ k)!)2 - Vn
n 2
'"
P(Cn ) = L.."P(Cn,I<) = '" (1
L.."P21< 221<
(2k)!)
(k!)2
(1 (n - 2k)! )
2n- 21< ((n/2 _ k)!)2
I< I<
=(
'"
L.." +
"')
L.."
(1 (2k)!) (1
P21< 221< (k!)2
(n - 2k)! )
2n- 21< ((n/2 _ k)!)2
IIe/n-pl<6 IIe/n-pl~6
= El + E 2 •
Here the Pi denote the probabilities of a binomial distribution with probabil-
ities 2p and 2q. By Chebyshev's inequality L2 satisfies
E2 ~ In L
IIe/n-pl~6
P21e ~ In L
121e/n-2pl~26
P21e
c 4nqp cq
<---
2 =
2 -- , E 2 'n-+O as n-+oo.
- Vn 40 n
3 2 n /
We can write Ll in the following way:
( 1 (2k)!)( 1 (n-2k)!)
El = L
1I</n-pl<6
P21< 221< (k!P 2n- 2 1< (( /2 - k)f
n.
11",
~ ( 1 + 0) _r-;rnn _r-;rn;; L.." P2 I< ,
Y 1rnp y 1rnq Ie
where 0 = 0(6) -+ 0 as 6 -+ O. The sum Li Pi = 1 and the sum Lie P21e -+ 1/2,
by Theorem 3.1 (we take only half the terms in the Riemann sum; the exact
proof is left to the reader as an exercise). Therefore
1
nE1 ~ (1 + 0) Inn'
21rypq
Since 0 was arbitrary, we have proved that
Here Lk. is equal to the sum which we estimated in the first case, therefore
for a certain constant CI
(2kd! CI
(kl !)22211:, ~ ~.
Also, as in the first case, one can show that the main contribution to this sum
is given by those terms where kl/n '"" p. For such kl we have
40 Lecture 3 De Moivre-La.pla.ce a.nd Poiaaon Limit Theorema
1 1 1 1 1
Vkl + 1 (n - 2kl + 1) - n 3/ 2 ..jkdn + lIn (1 - 2kl /n + lIn)
1 1
..... n 3 / 2 y'2P2(q + r)'
The desired statement then follows easily. o
Later we consider other methods which lead to similar estimates.
°
Here A > is the parameter of the Poisson distribution and k takes its values
in the set of integers -00 < k < 00. If a random variable has a Poisson e
distribution with parameter A then
Proof. We have
P(V(l) = k) = (~)P:(l- P. t- k
In our situation k is fixed but n - t 00. Therefore (n - k) In(l - Pn) ,... -(n -
k)p,. = -Pnn(l- kIn) - t -A as n - t 00. Furthermore
Theorem 3.5.
r1m P( v (Q)(w) -
-
k) _ (AvoIQ)"
- k! e
-holQ
.
Proof. Let us fix indices i 1 , ••• , i" for the particles which lie in Q, and let
Cil •...• i • = {WIXi. E Q, 1 ~ s ~ k, Xj f/. Q for j f. i 1 , ••• , i,,}. Then
Therefore
42 Lecture 3 De Moivre-Laplace and Poi88on Limit Theorems
Theorem 3.5 shows that the number of particles of an ideal gas lying in
a given fixed domain Q, when passing to the limit as previously described,
follows a Poisson distribution with parameter >.voIQ.
Lecture 4. Conditional Probability and
Independence
The nature of the dependence on B arises from the formula of total prob-
ability which we give below. Let {Bl' B 2 , ••• ,B,., ... } be a finite or countable
partition of the space 0 - i.e. a collection of sets {B;} such that B; n Bi = 0
and U; B; = o. We also assume that P{B;} > 0 for every i. For any A E 1
we have
P{A) = L P{A n B;) = L P{AIB;)P{B;}.
The relation
44 Lecture 4. Conditional Probability and Independence
(4.1)
is called the total probability formula. The nature of this formula is reminiscent
of expressions of the type of a multiple integral written in terms of iterated
integrals. The conditional probability P(AIBi ) plays the role of the inner
integral and the summation over i is the analog of the outer integral.
Sometimes in mathematical statistics the events Bi are called hypotheses,
since the choice of Bi defines a probability distribution on 1, and probabilities
P(Bi ) are called prior probabilities (i.e. given before the experiment). We
assume that as a result of the trial an event A occurred and we wish, on the
basis of this, to draw conclusions on which of the hypotheses Bi is most likely.
This estimation is done by the calculation of the probabilities P(B" IA), which
sometimes are called posterior (after the experiment) probabilities. We have
Lemma 4.1. The events A and B are independent if and only if the events
in any of the pairs (A, B), (A, tI) or (A, tI) are independent.
Proof. Let us prove for example the equality p(Antl) = P(A).P(tI). We have
It is clear that one can find a-algebras which are pairwise independent
but not jointly independent.
Definition 4.5. Let '11 = 11 (w), ... , TI,. = I,. (w) be a collection of n random
variables. The TIl, •.• , TI,. are said to be jointly independent random variables
if for any Borel subsets Cl, ... ,G,.
,.
P{Tll = Idw) E Gl,···,TI,. = I,.(w) E G,.} = II P(J.(w) E G.) .
• =1
We now show the meaning of this definition when the random variables
TI; take a finite or countable number of values. Let {a~;)} be the values of the
46 Lecture 4. Conditional Probability and Independence
random variable '1i and B;i) = {WI'1i = a~i)}. Then for any i the {B;iJ} form
a partition of the space n which generates a a-algebra that we will denote
by 1;, 1 ~ i ~ n. It follows from Definition 4.4 that
P{'11 31 3" ,.
;= 1
Definition 4.5 is of course a special case of Definition 4.4. To see this we take
as the a-algebra 1,. the a-algebra of the subsets ofthe form Ii: 1 (C), where
C is a Borel subset.
Theorem 4.1. Let '11 and '12 be independent random variables such that E'1i
exists, i = 1,2. Then E('11 . '12) exists and E('11 . '12) = E'11 . E'12'
We prove this theorem only in the case where '11 and '12 take no more than
a countable number of values. Let the values of '11 be the numbers a1 , ~ , ... ,
and the values of '12 be the numbers b1 , b2 , .... It follows from the finiteness
of the expectations E711 and E712 that
The product '11 . '12 takes values ~bi on the set (WI'11 = ~) n (WI'12 = bi ).
Thus, by independence,
,.
= L~P{'1dw) = a;}. L bi P {'12(w) = bi } = E'11 . E'12
i
Theorem 4.2. Let '11, '12, ... , '1n be jointly independent random variables such
that E'1i exists for 1 ~ i ~ n. Then E('11 ''12 "''1n) exists and E('11 "'f}n) =
Ef}l .•• Ef}n.
4.3 The Gambler's Ruin Problem 47
Corollary. Let TJ1 , ••• , '1n be pairwise independent random variables such that
VarTJi < 00 for 1 ~ i ~ n. Then Var('11 + .. '+'1n) < 00 andVar('11 + .. 'TJn) =
VarTJ1 + ". + Var'1n·
But
x
a
o n
Fig. 4.1.
p. = L p(w.),O<z<a,
w.eD.
where E' (or E") indicates that the summation is carried out for those w.
that end with a (or with 0), i.e. W. is the probability of victory and L. is the
probability of ruin with an initial fortune of z. For z = 0 or a it follows from
the definition that:
Po = 1, Wo = 0, Lo = 1.
Pa = I,Wa = 1, La = O.
Proof. We have
where L: (1) denotes the sum over those w. in which the first game was won
and L (2) denotes the sum over those w. in which the first game was lost.
For w. E L (1) we have p(w.) = p.p(wz+d, where W.+l is the graph which
begins at n = 1 with the point z + 1. Clearly any w.+ 1 E n.+ 1 is obtained
this way. Therefore L (1) p(w.) = p. LEn
W.+l U_+l
p(w.+ d = p.p.+ l ' In the
4.3 The Ga.mbler's Ruin Problem 49
same way it is clear that for w~ E E (2) we have p(w~) = q.p{w._ d and
therefore E
(2) p(w.) = q.p.- 1 • This completes the proof of the first relation.
The remaining relations are proved in the same way. 0
Lemma 4.3. There ezists at most one solution to (4.9) which satisfies the
given boundary conditions.
Proof. Assume that, to the contrary, there are two such solutions, R!l) ,R!2)
and let R!3) = R!l) - R!2). Then ~3) = RJ.3) = 0 and
R~(3)
= pR(3) R(3) 0
~+ 1 + q ~-1' < z < a.
Let us choose Zo such that max{R!3),0 < z < a} = R!~) > O. Since R!~) ~
R(3) D(3) > R(3) the equality
.0+1' ...... 0 - .0-1
is possible if and only if R!~) = R!: ~ 1 = R!: ~ l ' In exactly the same way
we can go to Zo ± 2 and so forth. In the end we obtain that R!3) = constant
for 0 < z < a. But this constant can only be 0 since ~3) = RJ.3) = O. An
analogous argument can be carried out in the case where min R! 3 ) < O. In the
same way we obtain R!3) == 0 for 0 $ z $ a, i.e. R!l) = R!2).
The function p. satisfies (4.3) and the boundary conditions Po = P = 1. Q
Similarly
We now find explicit formulas for W .. and La. The method is totally anal-
ogous to the method of solving linear differential equations with constant
coefficients. We first seek particular solutions to (4.3) of the form R. = )....
Substituting into (4.3) gives ). = p).2 + q. The roots of the last equation have
the form ).1 = 1, ).2 = q/p.
First Case. p f:. q. In this case ).1 f:. ).2' We will look for W. of the form
Wo = d1 +~ =0,
W" = d1 +~(;r = 1.
Second Case. p = q = 1/2. In this case the equation has the form
1
W.. = i(W.. +1 + W.. -d,
and also).l = ).2 = 1. Aside from the particular solution W. = 1 we can check
immediately that W. = z is also a solution. We therefore look for a solution
of the form
W. = d1 +~z.
Taking into account the boundary conditions we obtain:
4.3 The Ga.mbler's Ruin Problem 51
Thus
Z a- Z
W. = -,
a
L. = 1- W. = --.
a
An intuitive conclusion arises from our equations: the greater the initial for-
tune, the greater the probability of winning.
We now introduce the random variable r(w z ), w. E n., equal to the dura-
tion of the game, i.e. equal to that n = n(w z ) for which the graph first attains
o or a.
Lemma 4.4. There exist constants q, 0 < q < 1 and c, 0 < c < 00, such that
We first show that EW.EG 1 p(wz ) = ~q"-k, where k is the number of games
won between the times 0 and n, i.e. the number of segments in the graph f of
length J2 which move with an angle of +45 (see Fig. 4.2). 0
I K=5
Q
Fig.4.2.
Indeed let f(n) = Zl. Then every graph Wz E C I consists of two parts: the
first part describes the function f and the second part can be considered as
an element w" l of the space n. l • Correspondingly p(w.) = ~q"-k .p(WZI ) and
We now establish the following inequality: for some ql, 0 < ql < 1,
(4.4)
52 Lecture 4. Conditional Probability and Independence
Here the notation Lo ,(1) cO , denotes a summation with respect to the graphs
Pl) which are extensions of I. We now note that
L l-k,.qa-kdle:::; ql < 1,
oj')CO,
since in a steps the only possible extensions PI) are those which do not lead
to the end of the game.
The desired statement follows from (4.4) since for any m, m = ka + r,
O:::;r<a
Prool. The last equalities follow from the definition. To prove the first equality
we write
Here we used the fact that in the sum L(l) (or L(2») each W z can be repre-
sented as a union of the first segment moving with an angle +45 0 (or -45 0 )
and of w.+ 1 E fl,,+! (or W.-l E flz- d, and that each wH 1 (or wz - d appears
in such a representation. 0
4.3 The Gambler's Ruin Problem 53
The statement of Lemma 4.3 also applies to E". Therefore if we find one
function E. which satisfies the conditions in Lemma 4.5 that function will be
the one we are looking for.
First case. p i q. We note that the function E" = z/(q - p} satisfies the
equation
E" = pE,,+1 + qE,,-l + 1, (4.5)
but does not satisfy the boundary conditions. Since every function of the
form R. = d1 +~(q/p)" satisfies the equation R" = pR.+ 1 +qR"-l then any
function E" = z/(q - p) + R. satisfies (4.5). We choose d1 and ~ in order to
satisfy the boundary conditions
Eo = d1 + ~ = 0, ~ = -d1
E,. = q~ p +d1 (1- (!t) = 0, d1 = (p_ q)(; _ (!t)·
Finally we obtain
Second Case. p = q = 1/2. The equation (4.5) then has the form
1
E" = i(E.+ + E.-d + 1.
1 (4.6)
The function E. = _z2 satisfies (4.6). As in the first case we look for E. of
the form:
Then
Eo = d1 = 0,
E" = -a2 + ~a = 0, ~ = a.
Therefore
E" = _Z2 + az = z(a - z}.
The function E" attains its maximum value a2 /4 for z = a/2, which is
natural in view of the symmetry p = q = 1/2. In addition this relation shows
that the duration of the game for z "" a/2 is of the order a2 /4, i.e. of the order
of the square of the length of the interval [0, a]. We have already seen that
such a behavior is typical for random walks.
Lecture 5. Markov Chains
The theory of Markov chains makes use of the theory of so-called stochastic
matrices. We therefore begin with a small digression of a purely algebraic
nature.
Lemma 5.1. The following statements, a), b.1), b.e) and c), are equivalent:
a) the matriz Q is stochastic;
b.1) for any 7 °
~ the vector Q ~ 0; 7
b.e) if 1 = (1, ... ,1)' then Ql = 1, i.e. the vector 1 is an eigenvector of the
matriz Q corresponding to the eige~value 1;
c) if J1 = (1'1' ... , I'r) is a probability distribution, i.e. I'i ~ 0, L;= 1 1'. = 1
then J1' = J1 Q is also a probability distribution.
Proof. H Q is a stochastic matrix then b.1) and b.2) clearly hold and there-
fore a) => b). We now show that b) => a). Consider the column vector
----
= (0, ... ,0,1,0, ...)' ~ o. Then Q6 = (qlj,q2j, ... ,qrj)' ~ 0, i.e. q,j ~ o.
~ ~
6j j
j-l
Furthermore Ql = (Li qli, Li q2i' ... , Li qr i)' and it follows from the equal-
ity Ql = 1 that Li q'i = 1 for all i, therefore b) => a).
We now show that a) => c). H J1' = J1Q, then 1''; = L;=II',q'i. Since Q
°
is stochastic we have 1''; ~ and L;=II''; = Li L, I',q'i = L, Li I',q'i =
L, 1', Li q'i = L, 1', = 1, therefore 71' is also a probability distribution.
5.2 Markov Chains 55
...,-+
Now assume that c) holds. Consider the row vector Oi = (0, ... ,0,1,0, ... ,0),
which corresponds to the probability distribution on the set (1,2, ... , r) which ------
i-l
"t
°
is concentrated at the point i. Then Q = (qi 1, qi2, ... ,qir) is also a proba-
bility distribution. It follows that qij ~ and E;= 1 qij = 1, i.e. c) => a). 0
Lemma 5.2. II Q' = 11<1.; II, Q" = IIq;~·1I are stochastic matrices, then QIII =
Q' .Q" is also a stochastic matrix. II all the q;~. > then all the q;~~ > 0. °
Prool. We have
r
~
qi;
III
= ~
I "
qik . qkj·
k=l
j= 1 j=lk=l k= 1 j= 1 k=l
Definition 5.2. A Markov Chain with state space X, generated by the ini-
tial distribution 71 on X and the stochastic matrices P(I), ... , P(n), is a
probability distribution P on n which satisfies
We now perform the summation as follows: Fix the values Wo, W 1 , ••• ,W" - 1
The last sum is equal to 1 by virtue of the fact that P (n) is a stochastic matrix.
The remaining sum has the same form as E, where n has been replaced by
n - 1. We then fix Wo, ... , W,. _ 2 and sum over all values of w,. _ I, and so on.
In the end we obtain E:
0 = I J.Lw 0 = 1 by the fact that 11 is a probability
distribution. It follows that E = 1.
By similar arguments one can prove the following statement:
P{Wo -- X(i o ) W -
,1 -
Xli,)
, ••• ,
Wk -- X(i.)} -
- ""'0
II . • .. (1)· . .. . p.'''-1 'II
p'0'1 . (k)
for any X(io), • •• , X(i.), k ~ n. This equality shows that the induced probabil-
ity distribution on the space of (k + I)-tuples (wo, ... ,Wk) is also a Markov
chain, generated by the initial distribution 11, and the stochastic matrices
P(I), ... ,P(k).
The matrix entries Pij(k) are called the transition probabilities from state
Xli) to state xU) at time k.
Assume that P{wo = X(i o l , ••• ,W/C_2 = X(i.-.I,W/C_I = x(il} > O.
We consider the conditional probability P {w/c = xU I IWk _ I = x( iI, Wk _ 2 =
X(i.-.I, ••• ,wo = X(io)}. By definition:
-
P{W10- xU) Iw/ c - l-- x(il ,W - X(i.-.) , · · · W
10-2-
- X(i o )}
,0-
= pij(k)
and does not depend on the values i o , i l , ••• , i/c _ 2. This property is sometimes
used as the definition of a Markov chain.
= io,.··,it_l
....:..:...-'::~=-=------------ =
L.J Jl.io· Pioi, • Pi, i •••• Pit_, i
Pii'
io "o"'l_l
i.e. we have proved our statement for s = 1. Let us assume that the statement
is proved for all s ~ 80 and let us prove it for 80 + 1. We have
-- '"'P{w
k.J 'o+l+t -- xli)lw .o+t -- xl") ,Wt -- Xli))
"=1
By the induction hypothesis the last factor is equal to p!:o). We show that
the first factor is equal to P"i. Indeed we have
58 Lecture 5. Markov Chains
P(W. o +1+t -_ X
U)I
W.o+t -
_
X
(Ie)
,Wt -
_
X
(i»)
Plej
= --------~~---------
P(W'O+t = X(Ie),Wt = Xli»)
x
10,0.·,"_1,
it+ 1 ,0 •. ,i. O + (-1
By the statement we have just proved this means that in So steps one can,
with positive probability, proceed from any initial state Xli) to any final state
xU) .
One can show that the examples we have given, and combinations of them,
describe all examples of non-ergodic Markov chains.
1. 7rP = 7r,
". l'Im._
n
00
(.)
Pii = 7ri'
Proof· Let JJ.' = {JJ.~, ... ,JJ.~}, JJ." = {JJ.~, ... , JJ.~} be two probability distribu-
tions on the space X. We set d(JJ.' , JJ.") = (1/2) E;= IJJ.: - JJ.:' I. Then d can be
1
r r r
where E+ from now on will denote summation with respect to those indices
i for which the terms are positive. Therefore
60 Lecture 5. Markov Chains
Lemma 5.3.
a. d(#,' Q, #''' Q) ::; d(#,', #,")j
b. if all qij ~ a then d(#,'Q,#,"Q) ::; (1- a).d(#,' ,#,").
Proof. We have
Furthermore
. L+. ('
::; L J+ #'i - #'i") qij
1
= ",+
LJ. (#'i' -
,
#'i") ",+
LJ.J qij·
The sum E; qij ::; E j qij = 1 and this completes the proof of a). We now note
that the sum E; cannot be a sum over all indices j. Indeed if E;=1 #,:qij >
E;= 1 JJ.:' qij for all j, then
r r r r
which is impossible, since each of the sums is equal to 1. This latter fact is
easily seen by interchanging the order of summation. Therefore at least one
index j is missing in the sum E;
qij. Thus if all qij > a then qij < 1 - a E;
and d(#,'Q, #," Q) ::; (1 - a) 'L.:
(#': - #':') = (1 - a)d(#,', #,"). 0
where n = mso +Sl, 0 ~ Sl < So. For sufficiently large n we have (l-a)m ~ fo,
which completes the proof.
So the sequence IJ.n = 1J.0 pn is a Cauchy sequence. Let us set 'Tr =
liffin_co IJ.n. We have 'TrP = liffin_co IJ.n P = lim.._co (lJ.opn)p =
liffin_co (lJ.opn+l) = 'Tr, i.e. 'TrP = 'Tr. We now show that a vector 'Tr with
such properties is unique. Let there be two vectors 'Trl = 'Trl.P, 'Tr2 = 'Tr2 P .
Then'Trl = 'TrlP··, 'Tr2 = 'Tr2P··. Therefore d('Trl,'Tr2) = d('TrlP··, 'Tr2 P··) ~
(1 - a)d('Trl' 'Tr2) by the lemma. It follows that d('Trl, 'Tr2) = 0, i.e. 'Trl = 'Tr2·
We have thus obtained that for any initial distribution 1J.0 the limit
liffin _ co 1J.0 pn = 'Tr exists and does not depend on the choice of 1J.0. Let us
choose, for 1J.0 the probability distribution which is concentrated at the point
Xli) , i.e.
Remark. Let 1J.0 be concentrated at the point X(i). Then d(lJ.opn, 'Tr) =
d(lJ.opn, 'Tr pn) ~ ... ~ (1 - a)m d(lJ.opn-m •• , 'Trpn-m •• ) ~ (1 - a)m . In this
relation we can take m ~ (n/so) - 1. Therefore
Definition 5.5. A probability distribution 'Tr for which 'Tr = 'Tr P is said to be
a stationary distribution of the Markov chain.
w" X(i). We also introduce the random variables VIii) (w), equal to the
number of those k > 0 such that W"_1 = Xli), W" = xU).
for 1~i ~ r,
for 1~ i,i ~ r.
L JLmP~!.
r
Ex~i) (w) =
m=1 m=1
( i) ( ii)
Ex" (w) -1I"i, EX" (w) - 1I"iPii,
1 ~ (ii) ]
E [ -n L..J X" (w) - 1I"iPii·
"=1
Therefore for sufficiently large n we have
n n
E
The probabilities of the events on the right hand side can be estimated by
Chebyshev's inequality:
5.4 The La.w of La.rge Numbers 63
So the matter is reduced to the estimation of Var VIi) and Var v liif . IT we set
mli) = EXli) = ~r I L . pl~) we have
Ie Ie LJ.= 1 ,... . . '
= E E(X~i) - E
n
Ie <
Idlill - c>..1e ,
= 1'''0' II Pii
.,(j;)I.,)
p(w)
64 Lecture 5. Markov Chains
From the law of large numbers, for typical w's we have V(ij) (w)/n "" 1riPij'
P(WI;
i= 1
L 1riPij = 1rj
r
=
i=1
by the equality 1r P = 1r. The statement we have proved means that if the
initial distribution is 1r then the probability distribution for any W/c is given
by that very vector 1r and does not depend on k. Hence the word "stationary".
Markov chains first appeared in the work of the Russian mathematician
A. A. Markov. Later on they appeared significantly in the interesting and
deep theory of Markov processes. The main results here were obtained by
A. N. Kolmogorov, A. la. Khinchin, Doeblin, K. L. Chung, and others.
.
We shall use the ergodic theorem for Markov chains in order to study the
asymptotic behavior of a(~)
)
as n - 00 •
1. ej > 0, ej > 0, 1 ~ j ~ r;
2. E:=1 a'jej = Ae" 1 ~ i ~ r; and Ee:a'j = Aej, 1 ~ j ~ r.
Proof. Let us show that for any matrix A with positive entries there exist at
least one vector e with positive coordinates, and A > 0, for which
L a,jej = Ae"
r
1 ~ i ~ r.
i= 1
The Brouwer Theorem states that any such map has at least one fixed
point. If e is this fixed point then Ae = e, i.e.
I\e,
It is easy to see that the matrix P is a stochastic matrix with strictly positive
entries. The stationary distribution for this matrix is 7r, = e,e; provided that
e and e* are normalized in such a way that E 7r, = E e, e; = 1. Indeed
aii
(n) = E a··
"1
a·'1 '2. ···a·1,,- 2 1.... -1 tIj... -l.1.
The ergodic theorem for stochastic matrices gives p!;) -+ 'lfj = eje; as n -+ 00.
Thus
a~~)
Tn" -+ ei'lfjej = eiej*
~ -1
(5.1)
Now we can show that the vectors e, e* and >. > 0 in Lemma 5.4 are
unique. From the last expression,
·
11m -
n-oo n
11naij
(n)
= 1nA.\
Thus >. is uniquely determined by A.
Suppose that there exist two vectors (e*)' and (e*)" with the needed prop-
erties. Choose an arbitrary vector e from Lemma 5.4. Then the existence of
the two vectors (e*)' and (e*)" implies that the matrix P has two stationary
distributions, which is impossible. Thus (e*)' = (e*)". Replacing A by A * we
get the same result for the vector e.
Relation (5.1), together with the statements about the unicity of e and e*,
give a complete description of the behavior of I as -+ 00. a!; n
Lecture 6. Random Walks on the Lattice 7l.d
The topic of random walks is one of the most important and most developed
of those studied in probability theory. Problems from a lot of applications
of probability theory are connected to random walks. In fact we have already
encountered one of them in the gambler's ruin problem. The theory of Markov
chains can be viewed as a theory of random walks on graphs.
In this lecture we study random walks on the lattice 7l.d. • The lattice 7ld. is
understood to be the collection of points Z = {Zl,. •• ,Zd.} where -00 < Xi <
00 are integers and 1 ~ i ~ d. A random walk on 7ld. is a Markov chain whose
state space is X = 7ld • By the same token we have here an example of a Markov
chain with a countable state space. Following the definitions of the previous
lecture we construct the graph of the Markov chain - i.e. from each point Z E
7ld. we find those y E 7l.d. for which the probability Pall of the transition Z -+ Y is
positive. These transition probabilities must satisfy the relation Ell Pall = 1.
As in the case of finite Markov chains the probability distribution of the
Markov chain is uniquely determined by the initial distribution p. = {p."" Z E
7ld.} and the collection of transition probabilities P = lip", II II which now form
an infinite stochastic matrix, sometimes called the stochastic operator of the
random walk. Often the initial distribution will be taken to be the distribution
which is concentrated at one point, i.e. 8("'°) = {6",0-"',z E 7ld.} where 6", = 1
for Z = 0, 6", = 0 for Z t- o. This corresponds to considering trajectories of the
random walk which begin at the point Zo.
Proof. We now introduce an important formula which relates {Ik} and {Pk}:
Pk = fie + fie - I • PI + ... + fo • Pk • (6.1)
We have ii(Ie) = U:=l C., where C. = {wlw E ii(Ie) ,Wi = 0 and Wj :j:. 0, 1 ~
j ~ i}. Since the C. are pairwise disjoint
.= 1
Wec,
= L
wEO(')
PW1-WO ···PW,-W;_l L
weo(>-')
PWH1-W,· ··PW>-W>_l
= f • . Pk-.·
By the same token, if we recall that fo = 0, Po = 1 we have
k
Pie =L fi . Pie - i,
.=0 (6.2)
Po = 1.
6. R&ndom W&ika 69
Let us multiply the left and right sides of (6.2) by z'< and sum with respect to
k from 0 to 00. We then obtain on the left P(z) and, on the right, as is easily
seen, we obtain 1 + P(z) . F(z), i.e.
t
,.=1
fIe = F(I) = lim F(z) = 1 -
• -+ 1
lim p(1 ) .
.-1 Z
In the latter equalities, and below, z -+ 1 from the left on the real axis.
We first assume that 2::=1 PIc < 00. Then, by Abel's Theorem,
lim P(z)
..... 1
,.=0
and lim..... dl/P(z)) = 1/2::=0 PIc > O. So E:=o fIe < 1, i.e. the random
walk is transient.
A slightly more complicated argument is used when 2::=0 PIc = 00. We
show that in this case lim..... 1 1/ P(z) = O. Let us fix E > O. We find N =
N(E) such that 2::=0 PIc ~ 2/E. Then for z sufficiently close to 1 we have
2::=0 p,.z'< ~ I/E. Consequently, for such z
1 1
-- < <E.
E:=o p,.z" -
P(z) -
This means that lim..... 1 1/ P(z) = 1/ E:= 0 PIc = O. This completes the proof
of the criterion. 0
Theorem 6.1. Let E i= 0 and 1:~eZ' IIzl14pa < 00. Then the random walk is
transient.
Proof. Intuitively the last statement should be clear, since it follows from the
law of large numbers that Wn In should be close to E and that the probability
that Wn = 0 should be small. We now support this fact by an accurate proof.
The probability
Pn = P.. ; P.. ; ••• P.. ~ •
w':~" w' =0
L..."=1 "
Clearly
We now show that Ell 1::=1 W~ -nE1I 4 ::; const·n2 j this implies the inequality
Pn ::; constln 2 , which gives transience. We have
- nEil -E)II L
n 2 n 2 n.
and
lit W~
n
10=1
- nElr = L
k"k.,t"t.=1
(W~, - E, W~. - E)(w~, - E, W~. - E).
•
w' ,
w'
The finiteness of the last sum is also easily verified from the conditions of the
theorem.
It remains to be checked that when at least three among the indices
k 1 , k 2 , i1 and ~ are distinct the corresponding mathematical expectation is
equal to O. In this case at least one index appears only once. Assume that the
index that appears only once is, for example, k1 • Then
L(W~, (s) - E(s), w~. (s) - E(S»)(W~, (s) - E(s),w~. (s) - E(s») .
•= 1
Let e1, ... ,ed be the unit coordinate vectors and let PII_ = 1/2d if y =
Z
off to infinity in a given direction. It turns out that this is not the case, and
there is no such limit. Furthermore the vectors Vn are uniformly distributed on
the unit sphere. This means that a typical trajectory, when going to infinity, is
seen from the origin under a more or less arbitrary angle. Such a phenomenon
is possible because of the fact that the trajectory of a random walk spreads
out over a length of order O(y'n) in n steps, and therefore manages to move
in all directions.
Spatially homogeneous random walks on Zd. are special cases of homoge-
neous random walks on groups or sub-groups. Let G be a countable group and
P = {pg,g E G} be a probability distribution on the group G. We consider a
Markov chain in which the space of states is the group G, and the transition
probability is P"'II = P"", -1, Y E G, x E G. As in the usual lattice Zd. we can
formulate a definition for the recurrence of a random walk and prove an anal-
ogous criterion. The proof of this criterion makes use of limit theorems for
random variables with values in a group G. In the case of elementary random
walks the answer depends substantially on the group G. For example if G is
taken to be the free group with two generators, a and b, and if the probability
distribution P is concentrated on the four points a, b, a-I and b- l , then such
a random walk will always be transient.
There are interesting problems in connection with continuous groups. The
groups S L( n, R) of matrices of order n with real elements, and determinant
1, arise particularly often. Special methods have been worked out to study
random walks on such groups.
Lecture 7. Branching Processes
Here we consider another example of Markov chains, which arises in the anal-
ysis of so-called multiple decay processes for elementary particles. The theory
of such processes was introduced in articles by the Soviet mathematicians
A. N. Kolmogorov and B. A. Sevast'yanov.
In the simplest case we assume that we have a system of particles of
one type, whose evolution is as follows: during a unit of time each particle
either dies or remains unchanged or splits into several particles of the same
type. We also assume that at time 0 one particle is present and that the whole
evolution is considered on the time interval [0, n]. As always we begin with the
construction of the space aln) of elementary outcomes. An individual outcome
wIn) E aln) describes all consecutive transformations of the particles, and it
is convenient to represent it in the form of a genealogical tree. The next figure
represents an example of an wI 3) for n = 3.
Fig. '1.1.
A circle with a cross means that the particle has died (disappeared). Since
each particle can split into any number of particles, aln) in general is count-
able. Genealogical trees are possible which end with one cross. In this case the
whole process stops. We then consider that the genealogical tree is extended
by extending each cross into another cross. The number n is called the num-
ber of generations. It therefore makes sense to talk of particles of the k-th
generation, O.~ k ~ n.
Let {p,.}~ be a probability distribution where each probability Pic gives the
probability of a particle splitting into k particles (k = 0 corresponds to death,
k = 1 corresponds to the intact preservation of the particle). The probability
74 Lecture 7. Bra.nching Proce88es
distribution on n(n) is constructed from the fact that the individual particles
split independently. Formally, this means the following: next to each particle x
of the m-th generation (a cross does not represent a particle), 0 ~ m ~ n - 1,
we write p(x) = Pic if the particle splits into k particles, k = 0,1,2, .... Next
to each cross we write 1. We now set
p(w(n)) = II p(x).
",Ew(")
The fact that in the latter expression the product over all particles x up to
the (n -1)-st generation (inclusive) appears, reflects the independence of the
splitting process for all particles.
where the notation w(n+1) --t wIn) means that w(n+1) E n(n+1) was obtained
from wIn) E n(n). We note now that p(w(n+1)) = p(w(n))Ilp(x), where the
last product is carried out over all particles of the n-th generation in w(n).
If x is a cross then p(x) = 1 by our assumption. If x is a particle then it can
split into any number of particles. Therefore
IIp(x)
W(n.+l)_W(n.)
II L Pic = p(w(n)).
00
= p(w(n))
'" Ic=O
We now obtain a simple criterion for the degeneracy of the process. Let
us introduce the quantity
00
which represents the mean number of progeny of one particle. We have the
following theorem.
Theorem 7.1. Let PI < 1. The branching process is degenerate if and only if
m:::;1.
Remark. The statement of the theorem for m < 1 appears quite natural. The
fact that it holds for m = 1 is more surprising.
Lemma 7.2. If m :::; 1 the graph of c/>(z} intersects the bisectrix c/> = z for
0:::; z :::; 1 only at the point z = 1. If m > 1 there exists a Zo, 0 < Zo < 1, such·
that 4>(Zo} = Zo·
Remark. It follows from our previous analysis that the point Zo is unique.
Proof. We first show that for m :::; 1 the point Zo does not exist. Assume that
this is not the case, and that for some Zo < 1 we have 4>(Zo) = Zo. By Rolle's
Theorem there exists e such that 4>(1) - 4>(Zo) = 1 - Zo = 4>'(e)(1 - Zo), i.e.
c/>'(e} = 1. But the function c/>"(Z} > 0 for 0 < z :::; 1 (we do not consider the
76 Lecture 7. Branching Processes
Proof. We have
00
= p(w(,,)) IIp(x)zE('') .
L L
",(")EO(") ",(-+1)-+",(.) .
Here we use the same relation between p(w(,,)) and p(W(,,+l)) as before; the
product no: p(X)zE(o:) is carried out over all particles of the n-th generation,
and e(x) is equal to l only in the case where x has split into l particles.
Therefore ",("+ 1) (w(,,+1)) = Eo: e(x), where the sum is carried out similarly,
over the particles of the n-th generation. In particular, if x is a cross, then
p(x) = 1 and e(x) = O. Furthermore
L IIp(x)zE(O:) = II LP"z" = (4)(z)),, (a) (w(a)) ,
0: "
since ",(,,) (w(,,)) is equal exactly to the number of particles of the n-th gen-
eration. By the same token
(a) ( (a))
p(w(,,)) . (4)(z)),, w
00
=L
00
.=0
7. Bra.nching Processes 77
o
For further discussion another relation will be of use, which follows from
Lemma 7.3: 4>(n+l) (z) = 4>(4)(n) (z)). For n = 2 both relations agree since
4>(1) (z) = 4>(z). We again argue by induction over n. It follows from Lemma
7.3 and from the induction hypothesis, and again from Lemma 7.3, that
We now finish the proof of the theorem. We have already seen that qn =
~ qn+ 1 = 4>(n+l) (0) ~ 1. Therefore liffin ... 00 qn = liffin ... 00 4>(n) (0)
4>(n) (0) =
Zo ~ 1 exists. Since 4>(n+l) (0) = 4>(4)(n) (0)), then
i.e. = 4>(Zo). We now show that Zo is the smallest root of the equation
Zo
Zo 4>(Zo). Indeed let zo be the smallest root. We do not consider the trivial
=
case where zo = 0, i.e. Po = O. By the strong monotonicity of 4>(z) we have
4>(0) = Po = zo,
< 4>(zo)
4>(2)(0) = 4>(4)(0)) < 4>(4)(zo)) = 4>(ZO) = zo,
and by induction on n
Problem. Let m < 1. Then according to Theorem 7.1 the branching process
is degenerate. Let us construct a new space of elementary outcomes n whose
points ware the degenerate genealogical trees. For each w the probability pew)
is defined as before. We denote by r(w) the number of generations in the tree
w. Show that Er(w) < 00.
Lecture 8. Conditional Probabilities and
Expectations
L P{AICi)P{C
r
P{A) = i ), (8.1)
i=l
which can be interpreted in the following way: first, the conditional measure
of A is calculated with respect to each element Ci of our partition. The result
is a function of i, i.e. a function on the space ale of elements of the partition
e, and the sum is the integral with respect to that space. Viewed in that
way the total probability formula reduces the calculation of a probability to a
procedure such as the one used when writing a double integral as an iteration
of two integrals. The advantage is that one can often obtain some information
or other about the individual terms. The formula
r
is called the total expectation formula and has exactly the same meaning.
It is now already clear in which direction one should generalize (8.1) and
e
(8.1'). Let be an arbitrary, not necessarily finite, partition of a into non-
e
intersecting sets. The elements of the partition will be denoted by C ( and
8. Conditiona.l Proba.bilities and Expectations 79
the element of the partition which contains x will be denoted by C e(x). Each
partition e defines au-algebra 1 (e). A set C E 1 (e) if and only if there exists
e
a union C' E 1 of elements of the partition and such that P( C.aC') = 0 (in
probability theory, as well as in general measure theory, events which differ by
a subset of measure 0 are identified). The space whose points are the elements
C e of the partition e is called the quotient space Die. Since 1(e) ~ 1 the
probability distribution P induces a probability distribution on Die. So to
each partition e corresponds a new probability space (Die, 1(e), Pl. This is a
natural generalization of ~).
Important Example of a Partition. Let {11n} be a sequence of random
variables, n = 0,1,2, .... Let us take two numbers nl,n2,(n1 $ n2) and let
us fix the values of of 11n., nl $ n $ ~. We can obtain a partition e:~ in the
following way: two points w', w" belong to the same element of the partition
e:: if and only if 11n(W') = 11n(W") for all nl $ n $ ~. One sometimes says
that the partition e:~ is connected with the behavior of the sequence 11n on the
time interval [nl, ~J. We can also consider the partition e:. corresponding to
the behavior of sequence 11n for n ~ n1.
It is clear that we could consider finite collections of random variables
111, •• ·11n and obtain partitions by fixing the values of some of the random
variables, for example by fixing the values of 11n. + 1, ••• 11n where n = n1 + ~,
that is, by fixing the values of the last ~ random variables.
Now assume that we can introduce, on each Ce, a probability distribution
P(·ICd defined on the u-algebra 1 such that the total probability formula
holds: for each A E 1
(8.2')
where A, B are Borel subsets of the real line. Taking B = lR1 we obtain
where pdx) = f~oo p(x, y) dy, i.e. 771 has a distribution with density pdx)
(from now on we assume that all interchanges of the order of integration are
I:
valid). In just the same way 772 has a distribution with density
q(XIY) = p(X,Y)/P2(Y) for those X,Y such that P2(y) > 0 and 0 for other x,y.
Then q(xIY) 2: 0 and J q(xIY) dx = 1 for such y. In other words, q(xIY) is a
i i: i:
probability density. For any Borel subset A c Rl
This is formula (8.2). It shows that '11, for a fixed value Y of the random
variable '12, has density q(xly). By unicity of the conditional distribution we
now have found the conditional distribution of '11 for a fixed '12' We represent
p(x, y) in the form
p(x,y) = P2(Y)' q(xly), (8.3)
where P2 (y) is a probability density and q(xIY) is a probability density in x
for every Y such that P2 (y) > O. If such a representation is obtained for a
given distribution then, by the same token, we have constructed the system
of conditional probabilities.
Relation (8.3) can be directly generalized to the case of several '1;. Namely,
let n = nl + n2 and let '11"", '1n be n random variables whose joint distri-
bution is given by the density p(tl, ... ,tn ). Assume that P is represented in
the form
P(Xl," . , xn, Yl,' .. , Yn) =P2 (Yl,' .. , Yn.).
(8.4)
q(Xl,'" ,xn1IYl"" ,Yn.),
where P2 is a probability density and q(Xl"'" x n1 IY1,' .. , Yn.) is a prob-
ability density in the variables Xl"", Xn 1 for any Yl,.", Yn., such that
P2 (Yl , ... , Yn,) > O. Then P2 is the joint density of the random variables
'1nl T 1,···, "In, and q(Xl"'" Xn1 IYl"'" Yn.) is the joint density of the ran-
dom variables '11" .. , '1nl for fixed values Yl, ... , Yn. of the random variables
'1n + 1 , ••• , '1n .
1
The subtlety in the last equation lies in the fact that on the left hand side
we have a probability distribution given on the whole a-algebra 1, and on
82 Lecture 8. Conditional Probabilities and Expectations
the right hand side a probability distribution given only on the u-algebra 1',
i.e. a totally different object.
The existence of the conditional expectation is established by using the
Radon-Nycodym Theorem. Specifically, fA '1( w) dP is a u-additive set func-
tion defined on 1'. It is absolutely continuous with respect to P, since it
follows from P(A) = 0 that fA '1(w) dP(w) = O. Then by the Radon-Nycodym
Theorem there exists a function g(w) which is measurable with respect to l'
such that
1 '1(w) dP = 1
g(w) dP, A E 1'.
C
1 -
- -
1 1 00
-00
.....
00
-00
D- t(A(",-m),(s-m)ld'"
""'1'"
d'"
""'n'
-
C
1 -
-
1 1
00
-00
. . .00. .
-co
D- }(AII,lIldy
.
= cl 1 oo
...
00
Xie-t(A(:r;-m),(:r;-m))dx1 ... dx,.
cl 1
-00 -00
= OO
.. •
00
(Xi - mi)e- t(A(z-m).(z-m))dx1 ... dx,. + mi
-00 -00
COV(ei,
= c i: . . i:
ei) = E(ei - m.)(ei - mi)
(Xi - mi)(xi - mi)e- t(A(z-m).(z-m))dx
-cl 1
- -00'"
OO 00
,.
= L SikSikd; 1.
k=l
Here we used the fact that J~oo ZkZte-1/2 E; d;z~ dZ1 ... dz,. = 0 for k 1= t and
equals .j2-id~3/2 for k = t. We now note that Ek Si/cd; 1 Sik is an element of
the matrix SD-1 S· = SD,-l S-l = A- 1. Thus the matrix A- 1 has an im-
mediate probabilistic meaning. Its elements are the covariances of the random
variables ei, ei' Let us assume that the covariances of the random variables
e
ei, i are equal to 0 for i 1= j. This means that A - 1 is a diagonal matrix. Then
A is also a diagonal matrix, and p( Xl, ... , X,.) = P1 (Xl) ... P,. (X,.) is the prod-
uct of one-dimensional normal densities, i.e. the random variables e1,"" e,.
are independent. Thus in the case of the multivariate normal distribution, if
the covariances reduce to zero, independence follows.
In what follows it will be useful to obtain an expression for the character-
istic function (the Fourier transform) of a multivariate normal distribution.
By this we mean the integral
9. Multivariate Normal Distributions 85
1 1
../21rd- f· 'd '
e'"'-''' dz = e- ...!.
U •
This gives
= e-t(SD-'S-'.\,.\) = e-}(A-'.\,.\).
Finally we obtain
cp(A) = e,(.\,m)- t(A -',\,,\).
We now find the conditional distribution of the random variables e1' ... en"
conditional on fixed values Y1' .. . ,Yn. , for the random variables 111,·· . , l1n •.
We first assume that m(1) = 0 and m(2) = o. We have
and also the inequality A(3) = A(2) +B(A(1)t 1 B* > O. The main conclusion
from equation (9.2) for probability theory is that the conditional distribution
of the random variables e1"'" e.. I for fixed values Y1,"" Y... of the ran-
dom variables 111, ••. ,11". is a normal distribution with vector of expectations
Ly = - (A (1) ) - 1 BOy and matrix A (1) • In other words, the dependence on the
conditions manifests itself only in the vector of conditional expectations.
The previous calculations hold for the case where m(l) = 0, m(2) = O. In
order to obtain the result for arbitrary m(1), m(2) it suffices to replace x, y by
x - m(1), y - m(2) in the final answer. We will not go into further details here.
We now return to the original distribution (9.1). Let us assume that m = O.
A system of polynomials, which we will call the Hermite polynomials, although
as a rule this terminology is in use only for n = 1, is closely connected with
the density
p(x) = (21ft ,,/2 v'det A e- i(A>:,>:).
Let us consider an arbitrary monomial X~I x~· ... x~~ of degree k =
k1 + k2 + ... + k". A Hermite polynomial for this monomial is a polyno-
mial X~I x~· ... X~K + Q(x) where Q(x) is a polynomial of degree less than k,
such that
!(X~IX~' ... x~- +Q(x))Qdx)e-i(A>:'>:)dx=O
for any polynomial Q1 (x) of degree less than k. The following notation if often
used for the Hermite polynomial:
X lel
1
xle.
2 •••
x"le • + Q(x) -_.. xlel1 ••• x"le •.•
IT P(x) is a homogeneous polynomial of degree k, its Hermite polynomial
: P(x) : is defined by linearity. The number k is the degree of the Hermite
polynomial.
Lemma 9.1.
ei(A.,,>:) if' e- i(A.,,>:) = P (x)
aX1
leI
... ax,,-
Ie 1 ,
1982 in which a lot of important problems in physics are described which are
more or less connected with the property of conductivity.
We denote by n the space of all possible w, and by n(pere) the subset
of those wEn for which percolation from top to bottom occurs (in the
site or bond percolation problem). We assume, on a given n, a probability
distribution which corresponds to a sequence of independent identical trials.
This means that we are given two numbers, p = p(l) ~ 0, q = 1 - P =
p(O) ~ 0, and that we set p(w) = II" p(w,,) (in the site percolation case) and
p(w) = n. p(w,,) (in the bond percolation case). Then p(pere) = p(n lperc )) =
Lweo(pm) p(w) is the probability of percolation from top to bottom and it is
a function of L and p.
We now say that percolation from top to bottom occurs for a given p if
limL_ 00 plpere) = 1. The main problem is then to find the structure of the
set of p's for which percolation occurs. The conclusion, which is difficult to
prove, is that this set is an interval of the form (Pc, 11 and it is not a simple
task to find the critical probability Pc beyond which percolation takes place.
From now on we deal with site percolation. Our goal is to prove the fol-
lowing relatively simple statement.
Theorem 10.1. There exists a p~ 1 ) < 1 such that for all p ~ p~ J) percolation
with respect to 1 from top to bottom and from left to right takes place.
Proof. Let w = {w,,} E n and let us make first the following geometric con-
struction. For each x E V, where w" = 0, let us construct a closed square Q(x)
with smooth corners (see Fig. 10.1), centered at this point, and with sides of
length 3.
x x x
x x x
x
x=(x,.X2)
x x ~= ~
o(x)
Flg.IO.I. Flg.IO.3.
We set Q(w) = U"lw.=o Q(x) and by convention we call Q(w) the domain
occupied by O. Such a domain Q(w) can be split into its maximum connected
components Ql' Q2' ... ,Q,. We recall that this means that each component
Q. is a connected set, i.e. any two points can be joined by a path which is
contained in that set, and that Qi cannot be enlarged without losing this
property. Therefore the Qi do not intersect. Any two components either lie
outside one another, or one component lies in the interior part (the interior)
10. The Problem of Percolation 91
of another component (see Fig. 10.2 where Q is the shaded domain). A com-
ponent Qi is said to be exterior if it is not contained in the interior region of
another component. Let us denote by VIol (w) the set of points x = (Xl' X2)
which lie outside all exterior components, and therefore outside of all compo-
nents Q., 1 ~ i ~ r. The following geometric statement is quite clear, and the
proof of it will be left to the reader: the set VIol (w) is connected, i.e. any two
of its points can be connected by a path. This is due to the fact that V(O (w)
is obtained by removing the exterior components one after the other, as well
as their interior regions, and each such operation preserves connectedness.
It follows from connectedness that percolation from top to bottom with
respect to 1 will occur for w if V( 0 ) (w) intersects the lines X2 = L and X2 = - L.
We now show that the probability of each of these events converges to 1 as
L -+ 00 if p is sufficiently large. It is sufficient to limit ourselves to the case of
the line X2 = L. The probability of percolation from left to right is found in
the same way. .
Let Q be a given connected set. We introduce the indicator function Xq (w),
which takes the value 1 if Q is one of the components Q. and 0 otherwise.
Consider a point of the form (Xl' L) E V. We will call this point interior (for
the configuration w) if there exists a component Q. such that (x1,L) is an
element of Qi or its interior. It is clear that the points which are not interior
belong to V( 0 ) (w). Furthermore
p(w)
wI(", "L) i,,'erior
= L wen
Q
L p(w).XQ (w) = L 1fQ, with
Q
1fQ = L p(w)Xq (w),
wen
(10.1)
where the last sum is carried out over all components Q such that (xl,L)
belongs to Q or to the interior of Q.
The proof of these lemmas will be carried out a little later. We intro-
duce the lexicographic ordering on the points of the plane. This means that
(X1,X2) < (x~,x~) if either Xl < x~ or Xl = x~ and X2 < ~. Given a con-
nected set Q we define its origin to be the smallest point in the sense of the
lexicographic ordering.
Lemma 10.2. The number of connected sets Q, with IQI = n, which have
their origin ~ the point (Xl' X2), is not greater than 16ft •
The proof of this lemma will be carried out a little later. We now find an
estimate of p(iDt). By Lemma 10.1 and (10.1) we have
92 Lecture 10. The Problem of Percolation
p(iot) ~ ~)C1qy.IQI,
Q
where the sum is carried out over all components Q which contain the point
(Xl' L). Furthermore, from Lemma 10.2 we have
Here we used the fact that for a fixed point (x1,L) and a fixed value IQI = n
the origin of Q can be one of at most n 2 points.
Let p be large enough and q = 1-p be small enough so that (c1Q)c'16 < 1.
Then the latter series converges. Let us denote its sum by d(p). It is clear that
d(p) -+ 0 as p -+ 1.
We introduce the random variable €(w) equal to the number of interior
points on the line (x1,L), and the random variables 1/"'1 (w), where
We clearly have €(w) = E:1=-L 1/"'1 (w), E1/"'1 (w) = p(iot) (xd, and
L
This inequality shows that the average number of interior points on the line
X2 = L is relatively small. The latter statement is not sufficient yet to prove
the theorem. We also need to estimate the variance Var€.
We have
E( L
L
L
L
Furthermore
1/"'l(W) = LXQ(w), p(iotl(xd = L1I'Q'
Q Q
where the summation in both cases is carried out over the components Q for
which (Xl' L) belongs to Q or to the interior of Qi see 10.1 for the definition
of 11'Q. Therefore
10. The Problem of Percola tion 93
Q:ll,Qs,
We now note, and this is the main probabilistic part of the proof, that the
random variables XQ'l -'lrQ'l and XQ., -'irQ., are independent if Q"'l nQ"" =
0. Therefore
E(XQ'l (w) - 'lrQ'l )(XQ., (w) - 'irQ •.)
= E(XQ'l (w) - 'irQ. 1 ).E(XQ., (w) - 'irQ.,) = 0
and therefore in 10.2 we can perform the summation only with respect to those
components Q"'l and Q"" to which both Xl and X2 are interior. We denote
such a summation by L'. Then'
Since two components either agree or do not intersect, EXQ'l (w).XQ., (w) = 1
for Q"'l = Q"" and 0 for Q"'l =f. Q"". The latter expression therefore takes the
form
as L - t 00. Let P be large enough so that 2d(p) < 1/2. Then with probabilities
converging to 1 as L - t 00, more than one half of the points on the line X2 = L
are non-interior, i.e. belong to VIol (w). 0
Proof of Lemma 10.1. Assume that a set Q is given and that values w% are
fixed for x E Q. The probability of such an event is equal to p"qlql-", where
k is the number of ones, and IQI- k is the number of zeroes. We now note
that for a given set Q the number of subsets of Q where a zero appears is less
than or equal to the total number of all subsets of Q, i.e. 21ql . Furthermore,
near each one, at a distance less than or equal to yI2, lies a 0, since Q is the
union of squares (such as in Fig. 10.1) at the center of each of which lies a O.
Therefore IQI- k ~ (1/9)IQI. Finally we obtain
7T(Q) :::; 2IQI .q(1/9)IQI = (C1q)(l/9)1QI.
But this is the statement of the lemma for C1 = 29 , C2 = 1/9. o
Proof of Lemma 10.2. We show that to each connected set Q we can associate
a word (i1' i2 , •.• , in) of length n, where i, takes at most 16 values, in such a
way that Q is uniquely determined by this word.
Let x be a point in Q. We consider the set R of its neighbors, i.e. those
points y such that d(x,y) :::; 1 and y E Q. The number of such subsets is no
greater than the number of all subsets of a set with four elements (the set of
all neighbors), which equals 24 = 16, so R can be indexed by numbers from 1
to 16.
We now consider the point x = (X 1 ,X2) which is the origin of Q, and
the set Rl of its neighbors which belong to Q. We then consider the smallest
point in Rl which is distinct from x, and we denote by ~ the set of its
neighbors. Then we consider the smallest of the remaining points, and denote
by R3 the set of its neighbors, and 50 forth. In this fashion we obtain a word
(Rl'~'"'' R,,) of length n in which each R. takes at most 16 values. It is
clear that Q is uniquely determined by this word. 0
For the bond percolation problem the exact value of the critical probability
is known to be equal to 1/2. This was proved rigorously relatively recently
by the American mathematician H. Kesten (see H. Kesten, "Percolation for
Mathematicians", Birkhauser, 1982). Many numerical experiments show that
for the site percolation problem for a two-dimensional square lattice, Pc ~ 0.58.
The Japanese mathematician I. Higuchi showed that Pc > 1/2. Recently a
Hungarian mathematician, B. Tot, proved by elementary methods that Pc >
0.503. We also quote the result of the Soviet mathematician L. Mitiushin,
which shows that in the case of the lattice 7L. d in a space of sufficiently high
dimension d, for the site percolation problem, there exists an interval of values
p which contains the value p = 1/2 in its interior, for which percolation is
possible both with respect to 1 and with respect to O.
Lecture 11. Distribution Functions, Lebesgue
Integrals and Mathematical Expectation
11.1 Introduction
In this lecture we introduce once more the fundamental concepts of probability
theory, which we have encountered already, but ih a much more general form
than previously.
We define a probability space as a triplet (n, 1, P) where n is the space of
elementary outcomes, 1 is a a-algebra of subsets of n and P is a probability
measure defined on the events C E 1.
The choice of the a-algebra 1 plays an important role in many problems,
in particular in those problems which are connected to the theory of stochastic
processes. In the cases where n is a complete separable metric space it is usual
to take for 1 the Borel a-algebra, that is, the smallest a-algebra which contains
all the open balls, i.e. the sets of the form {w Ip(w, wo) < r} = Br (w). Such a
a-algebra is denoted by 8(n).
e
Let = f(w) be a real-valued function on n.
Definition 11.1. eis said to be a random variable iffor any Borel set A c m.1
{wlf(w) E A} = rl(A) E 1.
2) U; G; = n.
96 Lecture 11. Distribution Functions
1) Fe{td ::; Fe{t2 ) for t1 < t 2 • Indeed At C A" for t1 < t2 and therefore
l
..
monotone sequence {t.}, with t. -. 00 as i -. 00. Then the {At.} form a
monotonically increasing sequence of subsets of the line and U. A,. = JR.1.
Therefore the 1- 1 (At;) form a monotonically increasing sequence of sub-
r
sets of n with U. 1 (A,J = n. It follows from the a-additivity of the
probability P that
.lim Fe(t.)
1-+ 00
= .lim Pe(AtJ = .lim
1 - 00 , _ 00
p(J-1 (A,J) = 1.
3) Fe (t) is continuous from the right. Indeed let t. ! f. Then the At; decrease
and n. A,; = At. Therefore from the a-additivity of probabilities we have
.lim Fe(t.)
1-00
= .lim p(r1(A,J) = p(r1(Ad) = Fe(t).
1""00
4) For any f the probability p(e = t) = Fe(t) - lim,;_f-O Fe (t.) for any
sequence {t.}, t. if. Indeed U. A,; = (-oo,l). Therefore
11.3 Types of Distribution Functions 97
i
lim Fe (t.)
-+- co
= P{€ < i},
Fe (t) - liJ? Fe (t i ) = P(€ ~ f) - P(€ < f) = P(€ = f).
t,-t-O
112
114
118 -
o
Fig. 11.1.
Theorem 11.2. For any random variables €l = fdw), ... ,€p = fp(w) the
function '1 = g( €l, ... , €p) is also a random variable.
Corollary 11.1. If €l"'" €p are random variables and al,"" ap are real
numbers, then '1 = a l €l + ... + ap€p is a random variable.
100 Lecture 11. Dis trib u tion Functions
Corollary 11.2. If €l' ... ,€p are random variables then" = €l X ••• x €p is
also a random variable.
Corollary 11.3 If €l and €2 are random variables, with €2 =f. °for every W
then" = €l / €2 is also a random variable.
takes the value aj, and 6 takes the value ai•. We now have
and
Eel - Ee2 = L (a~:) - a~:))P(Bj,'i.) ~ 0,
il,;2
• (1) (2)
since a.Jl >
-
a.JJ on any B J' ' J'.'
l
Theorem 11.3. The value of liIDn_ 00 E€n does not depend on the choice of
the approximating sequence.
Proof. We first establish the following lemma. Let '1 2:: 0 be a simple random
variable, '1 ~ €, and assume that € = limn_ 00 €n, where the €n are non-
negative simple random variables, €n + 1 2:: €n.
Proof. Let us take an arbitrary f > 0 and set Cn = {wl€n (w) - '1(w) > -fl.
It then follows from the monotonicity of €n that Cn ~ Cn + 1 and it follows
from the fact that €n i €, and from the inequality € 2:: '1, that Un Cn = n.
Therefore P(Cn ) - t 1 as n - t 00. We introduce the indicator function Xc"
where
c (w) = {I for w E Cn ,
X .. 0 for w fI. Cn.
Then 'In = '1.Xc" is a simple random variable and '1m ~ €m (w) + f. Therefore
by the monotonicity of E €m we have
Since f was arbitrary we obtain E'1,. ~ limm_ 00 Eem. We still must prove
that lim,._ 00 E'1k = E'1.
We denote by bl , ~ , .•. the values of the random variable '1 and by B. the
set where the value b. is taken, i = 1,2, .... Then
It is now easy to obtain the independence of limn_ 00 E€n fr<;lm the choice of
the approximating sequence.
Let there be two sequences {ell)} ell) > C(l) and {C(2)} C(2) > C(2)
'-n ''-n+1-'-n "an. ''-'o+l-''aO
such that liIDn_ 00 €il ) = liIDn_ 00 €i2) = € for every w. It follows from Lemma
11.1 that for any k, E€~l) ~ liIDn_oo E€!.2) and therefore lim,._oo E€~l) ~
102 Lecture 11. Distribution Functions
· .... Et(2)
11m.,. 00 'On. ne 0
U7
am 1·1mn .... Et(2)
bt· 00 'o"
< 1·
_ 1mr. .... 00
Et(1) . ..t(1)
'Or. , byexc h angmg r.
t(2).
an d 'or. ,I.e. 1·1m.,. .... Ec(1)
CD 'on
-- 1·1m.,. .... Et(2)
CD'o".
The limit lim.,. .... E€n is called the mathematical expectation of the ran-
CD
e
Definition 11.6. The random variable has a finite mathematical expecta-
tion if E€l < 00 and Ee2 < 00. In this case it is equal to Ee = Eel - Ee2. If
Eel = 00 and E€2 < 00 (E€l < 00, E€2 = 00) then E€ = 00 (E€ = -00). If
Eel = 00 and Ee2 = 00 then Ee is not defined.
Since lei = el + e2, Elel = Eel + E€2 and so Ee is finite if and only if
Elel is finite.
The mathematical expectation Ee that we have introduced satisfies all
properties described in Lecture 1. In particular
1) E(al el + ~ e2) = alEel + ~Ee2 if Eel and Ee2 are finitej
2) Ee ~ 0 if e ~ OJ
3) Ell = 1.
One sometimes says that Ee is a non-negative normed linear functional
on the linear space of random variables. The space of random variables for e
which Elel < 00 is denoted by £l(n,1,p).
The variance of the random variable e is defined to be E(e - Ee)2, the n-
th order moment is defined to be E en,
and the n-th order centralized moment
is defined to be E(e - Ee)". Given two random variables 6 and e2 their
covariance is defined by Cov( el , e2) = E( el - E €l) (€2 - E €2). The correlation
coefficient of two random variables el, e2 is defined to be p( el , e2) = E( el -
Eed(e2 - Ee2)/..jVarel X Vare2.
The space of random variables efor which Elel p < 00 is denoted by
£p(n,1,p).
It is an essential fact in probability theory that the mathematical expec-
tation, the moments, the variance and so forth of a random variable ecan be
found if the measure p( is known or if the distribution function F( is known.
1/ Pe has a density Pe (x) then the latter integral can be written as a Lebesgue
integral ETJ = f~oo g(x)Pe (x) dx.
Proof 0/ theorem. As usual we first consider the case when 9 2:: 0 and takes
a finite or countable number of values gl'g2' .... Let Ci = {xlg(x) = gi} E
B (IR.t), Ai = g-l (Ci ) E f. Then the random variable TJ takes the values
gl,g2, ... and by definition ETJ = LgiP(Ci). On the other hand, also by
definition, P(Ci ) = PdA;) and LgiPe(A;) = f g(x) dPdx). Therefore the
theorem is proved for simple non-negative g. For an arbitrary non-negative 9
the result follows by passage to the limit in a sequence of simple non-negative
functions. In the general case we write 9 = gl - g2 where gl, g2 2:: 0 and we
use the statement of the theorem separately for gl and g2 . 0
Definition 12.1. The random variables €1, €2, ... , €n are said to be jointly
independent if for any A1 , A2, ... , A.. E X
Lemma 12.1. Let gl (x), ... ,g.. (x) be real-valued measurable functions on X.
Then the random variables 111 = gl{€d,1I2 = g2(€2), ... ,1I.. = gn(€ .. ) are
jointly independent.
Prool. Let Ai, 1 ~ i ~ n, be Borel subsets of the line and let A: E X be defined
as g;l{Ad, i.e. A: = {x E Xlgi{X) E Ai}. Then
Lemma 12.2. Under the conditions of Lemma 12.1, assume that the random
variables 111, .. . , 11m are such that IE1Ii I < 00, 1 ~ i ~ m. Then the random
Lecture 12. Independent Random Variables 105
variable TI = TIl X ••• x TIm has a finite mathematical expectation and ETI =
ETJl x ... x ETJm .
Proof. IT IETI. I < 00 then by the definition of the Lebesgue integral we have
EITI;j < 00, 1 ~ i ~ m. Let
x+ (t) ={ 0:
It ~ 0
t< 0
x- (t) = { 1:
0 t~0
t< 0
Then x+ + x- = 1, TI. = TI.(X+ (TI.) + X- ('7.)) = '7.X+ ('7.) + TI.X- ('7.) =
TI.,+ - '7;,- , where '7"+ ~ 0 and '7;,- ~ O. Substituting these expressions into
TI and multiplying out we obtain '7 as a linear combination of products of the
form '71,<1 X ... x '7m,<m' where the f; take the values + or -. In each such
product the '71,<1"" TIm ,Em form a collection of jointly independent random
variables as follows from Lemma 12.1. Therefore it suffices to prove the lemma
for non-negative random variables.
So let '7;, 1 ~ i ~ m be non-negative.
We introduce the functions g!n)(x), where g!n) (x) = k2- n if k2- n ~
g; (x) < (k + 1)2- n. Then gt) (x) i g. (x) as n --+ 00. Let us form the ran-
dom variables '7! n) = gt) (e.), 1 ~ i ~ m. By Lemma 12.1 they are jointly
independent. In addition '7t) i '7;, and '7(n) = rr;:l'7t) i'7 as n --+ 00. By
definition of the Lebesgue integral we have
Furthermore
kll""k,..~Oi=l
L II II p({wl'7!n) = k;2- n })
m m
k;2- n
kll""km~Oi=l i=1
II L = II E'7!n) II E'7;.
m m m
o
106 Lecture 12. Independent Random Variables
The following lemma will be necessary later for the proof of Kolmogorov's
inequality.
Lemma 12.3. Let €l, ... , €n be real-valued jointly independent random vari-
ables, and let g(x1, .. . , x,,) be a Borel function of k variables. Then Til =
9 (6, ... , €,,), €H 1 , ••• , €n form a collection of jointly independent random
variables.
Proof. Let A, AH 1, .•. , An be Borel subsets of the line. We need to prove that
P(Tll E A,€Hl E AH1,.··,€n E An) = P(Tll E A) rr=Hl P(€. E A;). We
introduce A' = {(Xl, ... ,x,,)ITll = g(x1, ... ,xd E A}. Then A' is a Borel
subset of IR" and we need to prove that
€Hl E AH1, ... ,€n E An} = p((el' ... '€") E A'} II p(e. EA.) .
•=10+ 1
Definition 12.2. Let h\, ... , ~n, .. .}, where ~. = I. (w), be a sequence ofreal-
valued random variables. We say that this sequence converges to !i = I(w):
- in probability (in measure) if lim,._oo P(kn -!il > f) = 0 for any f > OJ
- almost surely if Iim,._oo In(w) = I(w) for almost all w.
e,
Theorem 12.1 (Khinchin). A sequence of iointly independent identically
distributed random variables with finite mathematical expectation satisfies the
Law of Large Numbers.
A= n U An.
k=ln=k
o
Lemma 12.5 (Second Borel-Cantelli). Let {An} be a sequence of iointly
independent events on a probability space with E P(An) = 00, and let A = {wi
there exists an infinite sequence n,(w) such that wEAn.}; then P(A) = 1.
Proof· We have A = U:=l n:=k An. Then P(Aj ~ E:=l p(n:=n An) for
any n. By independence of the A.. we have the independence of the An and
therefore p(n:=k An) = rr:=k(l- P(An)) = 0 since E:=l P(An) = 00. 0
Proof. Let us introduce the events C" = {wll(el + .. +e.)-(ml + .. '+mi)1 < t
for 1 ~ i < k, I(el + ... + e,,) - (ml + ... +m,,)1 ~ t}, C = U:=l Ck • It is clear
that C is the event whose probability appears in Kolmogorov's inequality, and
that the C k do not intersect. We have
= t/(
k=l o.
el + ... + en) - (ml + ... + mn)) 2 dP
= t [/
k=l o.
(6 + ... + ek) - (ml + ... + mk))2 dP
+ 1(~k+
o.
1 + ... + ~n) - (mk+ 1 + ... + mn)) 2 dP] .
The last integral is non-negative. The main point is that the middle integral
is equal to zero. Indeed set
= t1
i=k+ 1 n
g(el"'" ek)(el + ... + ek) - (ml + .. , + mk))(ei - mil dP.
Finally
t ~ t 1 ({
Vi €l + ... + €Io) - (ml + ... + mlo)) 2 dP
i=1 10=1
. a•
~ t2 L P(C Io ) = t 2 P(C),
10=1
i.e.
e:
Proof· Let = €i -fflj,' = ~((€1 + .. .+€ .. )-(ml + .. ·+m.. = ~ E;=l
We need to show that , -+ 0 almost everywhere. Let us choose E > 0 and
» e:.
form the event B(E) = {wi there exists N = N{w), such that for all n ~ N{w)
we have 1'1 < E}. We have
B(E) =
N=1 n=N
We define
=p(2 max
m - 1 <n<2 m
ILe;l~m)
- i= 1
.= 1
Therefore
4 4 16 Vare.
< - "Vare--
00
< - , , - - < 00
00
1 1 1
- (6 + ... + e,,) = - (7]1 + ... + 7],,) - - (ml + ... + m,,)
n n n
1 " 1
~(e. - 7].) + -(ml + ... + m,,).
+ -n L-, n
i= 1
;=1 .=
00
1
00
.=111:=.
00
(12.1)
= ;L)k + I)P{k ~ le.1 < k + I}
11:=0
=
11:=0
The last sum is the mathematical expectation of the random variable 7] where
It is clear that '1 ~ I€1 I. Therefore by the properties of the Lebesgue integral
we have E(I€II- '1) ~ 0 and Elell- E'1 ~ 0, i.e. Elell ~ E'1. We recall that
the inequality IEel < 00 always means that Elel < 00. Thus the series in
(12.1) converges.
We show that the First Kolmogorov Theorem applies to the first sum. We
have
We use the fact that for k > 1, ~'-ft; ~~L ,-:\- ~ l....J'_A'
~~L .,----(,11)
, ,- = L..."_,,,
~~L(.,.J.-l , =
,- - ~)
II: ~ 1 • So the latter series is no greater than
112 Lecture 12. Independent Ra.ndom Va.ria.bles
11"1:51
00
x2 dFe{x) ~
1
i2
00 1 f
+ ~ k -1 }1o-1:51 .. 19 x2 dFe{x)
=const +const tl
10=1 10-1:51 .. 1:510
IxldFe{x)
We do not specify the details of this expression here. The main point is that
the random variable '1... has a discrete distribution which converges in a natural
sense to a continuous distribution with Gaussian density (1/V21r) exp( _x2 /2)
as n -+ 00.
Fig.IS.I.
Theorem 13.1. The sequence Pn => P if and only if Fn (x) --t F(x) for every
continuity point x of the function F.
Fig. U.2.
{I,
Formally we have
I( ) = y~x
Y 0, Y> x
I, y~x
It (y) = 1 - (y - x) /6, x < y ~ x + 6
{
0, y > x +6
I, y ~ x - 6
16-(Y)= { 1-(y-x-6)/6, x-c5<y~x
0, >x
y
The functions 16+ and Ii are continuous and 16- < I < It. Using the fact
that x is a continuity point of F we have, for any f > 0 and n ~ no (f),
Lecture 13. Weak Convergence of Probability Measures 115
f
::;
E E
16+ (y) dF(y) + 2 ::; F(x + 8) + 2 ::; F(x) + E,
if 6 is such that IF(x ± 8) - F(x)1 ::; E/2. On the other hand, for such n we
have
In other words, IFn (x) - F(x)1 ::; E for all sufficiently large n.
We now prove the converse. Let Fn (x) -+ F(x) at every continuity point
of F. We prove that
Therefore
In = !1 ! dF,. - f dF
-1 -1
(K"K.) (K"K.) (-co,K,] (K.,co)
f dF f dF.
(-co,K,] (K.,oo)
Each of the four latter integrals can be estimated in a simple way: for n 2 n 1 (E)
we have
116 Lecture 13. Weak Convergence of Probability Measures
!_ !_
I oo,K d I dF.. 1::; oo,K dill dF.. ::; M F.. (Kd ::; f./8;
IJ(K.,OO)
{ I dF.. I::; J(K.,OO)
( III dF,. ::; M(1 - F.. (K f./8; 2 )) ::;
!
p-l
!
p-l
and
p-l
IlK1 .K2 ]f(6) dF.. -lKl. K.] f(6) dFI = I~ f(x;)(F.. (Xi+d - F.. (x;)
I
p-l
- L f(Xi)(F(Xi+d - F(x;))
i=O
p-l
In what follows we need the following general fact from functional analysis.
With the help of this theorem we derive the important Reily's Theorem:
Proof. Assuming that the latter inequality is satisfied, we first prove the weak
compactness of the family {P.. }. We take a sequence of probability measures
from the family {p.. } which, for simplicity, we write in the form Po., n =
1,2, ... , and let X = [-i,i]. For any integer i, X is a separable, compact
metric space, and p .. ([ -i, i]) ~ 1 for all n. Therefore for any i we can use the
general theorem from functional analysis that was quoted above.
l18 Lecture 13. Weak Convergence of Probability Measures
= p(1} (A n [-1,1]) + L
00
n [-1,1]) + L
00
=L +L
L pIt) (A. n ([-l,l] \ [-l + 1,l- 1]))
00 00 00
i= 1 t=2
.= 1
t
We now prove the converse. Let {POI} be a weakly compact family which
does not satisfy the condition in Helly's Theorem. This means that for some
Eo > 0 and for some integer k > 0, there exists ak such that
Since {POI} is weakly compact, one can extract from the subsequence
{Po.} a weakly convergent subsequence. Without loss of generality we will
assume that the whole subsequence POI. => P, where P is a probability mea-
sure. Let us choose for any k a point 13k, k - 1 < 13k < k, such that ±f3k are
continuity points of P. Then for all m ~ k we have
In this lecture we consider the Fourier transform for probability measures, and
its properties. Let us begin with the case of measures on the line. Let P be a
probability measure given on B(ml).
If P = Pt we will denote the characteristic function by tPt (>.) and will call
e.
it the characteristic function of the random variable The expression means
e
that tPe(>') = EeiAt • If takes on values al ,~, ... with probabilities Pl,P2,···
then
00
cI>(>') = LP"eiAa ••
"=1
If ehas a probability density Pe (x) then
Each of the first two terms is no'greater in modulus than f./3. Regarding the
third term we hav~
t
lo,t=l
4>(A" - At)CIoCt = t /
lo,t=1
ei(A.-AtlCIoCl dP(x)
=/ I t
"=1
Clo eiA ·"1 2 dP(x) ~ O.
By Khinchin's Theorem, any positive semi-definite uniformly contin-
uous function which satisfies the normalization condition 4>(0) = 1 is the
characteristic function· of a probability measure.
122 Lecture 14. Chara.cteristic Functions
e
7. Assume that the random variable has an absolute moment of order
k, i.e. Elel k < 00. Then e(A) is k times continuously differentiable and
4>~k) (0) = i k Ee k •
The proof follows from properties of the Lebesgue integral and consists in
checking that the formal differentiation is correct in the equality
Before turning to the general case we will say that an interval la, bl is a con-
tinuity interval for P if P{a} = P{b} = O.
1 jR e- i .h - e- iAh 1 1
lim -2 'A 4>(A) dA = P{(a, b)} + -P{a} + -P{b}.
R_oo 1r -R • 2 2
Furthermore
i:
j R e-i>.a - e- iAb eiAz; d>" = jR cos >..(x - a) - cos >..(x - b) d>"
-R i>.. -R i>..
+ sin>..(x - a) ~ sin>..(x - b) d>...
The first integral is equal to 0 since the integrand is odd. In the second integral
the integrand is even, therefore
By a change of variables J.£ = >..(x -a) in the first integral and J.£ = >..(x - b) in
iR
the second integral we obtain
2 iR sin>..(x - a)
>.. d>" - 2
sin>..(x - b)
>.. d>" = 2
lR (z-a) sinJ.£
- dJ.£.
o 0 R(z-b) J.£
So
1 j R e-i>'a - e- iAb JOO l 1 R (z-a) sinJ.£
- _R
2" °A 4>(>") d>" = dP(x) - - dJ.£.
I -00 7r R(z-b) J.£
We now note that for any t > 7r /2
-Ii"
- -dJ.£ I+ I--- It - 11t
o
/2 sinJ.£
J.£
cos J.£
J.£ ,,/2 ,,/2
cos J.£
--dJ.£
J.£2
I
~ Ii" - - dJ.£I+ I--tt I"
o
/2 sinJ.£
J.£
+ 1t dJ.£ cos 2" ~ const.
/2 J.£
1
-00
00 l1R(z-a) .
= dP{x) lim - SIDJ.£ dJ.£.
-00 R-oo 'If' R(z-b) J.£
124 Lecture 14. Cha.racteristic Functions
. l1
hm -
R-oo 11"
R (",-a)
R(",-b)
sinJL
- - dJL
I'
=- 11
11" -00
00
sinJL
--
I'
dJL = 1.
. 11
hm -
R-oo 11"
0
-R(b-a)
sin-
- I' dJL = -
I'
11
11" -00
0
sinJL
- - dJL = -.
I' 2
1
. l1
11m -
R-oo 11" 0
R (b-a) sinJL
-dJL= -
I'
11
11" 0
00
sinJL
-dJL=-.
I'
1
2
So without any assumption on the interval [a, b]
lim
R-oo
l-R
R ei)..a -
'
.'\
e-i)"b
4>('\) d'\
1
= P{(a,b)} + -P{a} + -P{b}.
2
1
2
The proof follows from the fact that by Theorem the distribution
functions of the measures must agree at all continuity points. But if two dis-
14.1
tribution functions agree at all continuity points then they must agree at all
points.
The statement in the corollary is sometimes referred to as the Unicity
Theorem for characteristic functions.
One of the reasons why characteristic functions are often helpful in prob-
ability theory is that it is easy to obtain a criterion for the weak convergence
of probability measures from them.
Assume that we have a sequence of probability measures {P,.} and a prob-
ability measure P. We denote the corresponding characteristic functions by
{tP,.(A)} and tP(A).
Theorem 14.2. P,. ~ P if and only if lim,._ 00 4>,. (,\) = 4>('\) for any ,\.
4>n CA) = I e'AZ dPn (x) = I CoS.Ax dPn (x) +i I sin >.x dP" (x)
Lemma 14.1. Let Q be a probability measure on the line and ,p(>,) be its
characteristic function. Then for any T > 0
2 -]}
Q{[--, 2 ~
1/' ,p(>') d>.l-l.
1-T_,
T T
Therefore
1[', 4>(>') d>'1 1[', (4)(>') - 1) d>' + 2TI ~ 2T -1[', (4)(>') - 1) d>'1
=
Therefore
\!T I'
_,
tP(>') d>.\ > 2 - ~.
2
Since tPn(>') - tP(>') and ItPn(>.)1 ~ 1, by Lebesgue's Bounded Convergence
I' I'
Theorem
}~~ _, tPn (>.) d>' = _, tP(>') d>' > 2 - "2.
f
2 2 ~ \-1
Pn([--,-j)
T T
I'
T_,
\
tPn(>')d>' -1> I-f.
For each n < JJ we choose tn > 0 such that Pn ([ -tn, tn j) > 1 - f. If we set
K = max[~,maxl~nd' tnl then we find that Pn{[-K,K]} > 1- f for all n.
By Helly's Theorem the sequence of measures {Pn } is weakly compact.
So we can choose a weakly convergent subsequence {PnJ, Pn; ~ P. We
now show that P = P. Let us denote by ¢(>.) the characteristic function of
P. From the first part of our theorem tPn (>.) - ¢(>.). On the other hand, by
assumption tPn (>.) - tP(>'). So tP(>') = ¢(>.). By corollary 14.1 we have P = P.
It remains to be established that the whole sequence Pn ~ P. Assume
that this is not true. Then for some bounded continuous function f there
exists a subsequence {nj} such that
In this lecture we consider One of the most important results of probability the-
ory, the Central Limit Theorem. We already encountered this theorem in the
particular case of a sequence of Bernoulli trials in the form of the De Moivre-
Laplace Limit Theorem (Lecture 3). In its simplest form the Central Limit
Theorem deals with sums of independent random variables. Let us assume that
a sequence of such random variables el, e2, ... ,en is given. One can think of
e.
the as the results of a sequence of independent measurements of some phys-
ical quantity. Assuming that the strong law of large numbers is applicable to
the {e.} we write e. = m. + ~., where m. = E e., E~. = 0, and we consider the
arithmetic mean !-(el + ... + en) = !-(ml + ... + mn) + !-(~l + ... + ~n)' It
follows from the strong law of large numbers that the last term converges to
zero with probability 1. The issue of the speed and type of its convergence to
zero arises naturally. In the case of a sequence of measurements the answer to
this question gives the precision that one can achieve with n measurements.
Under some simple further assumptions it follows, from Chebyshev's inequal-
ity, that ~(~l + ... + ~n) takes values of order 0(1/ vIn). Therefore we consider
In( ~ (~l + ... + ~n)) = -;7;: (~l + ... + ~n) = Tln' It turns out, and this is the
essence of the Central Limit Theorem, that in general, as n ---t 00 the ran-
dom variables Tln (as functions on il) do not converge to a limit, but their
distributions, as n ---t 00, have a limit which, in a well-known sense, is inde-
pendent of the details of the distributions of the random variables e•. We now
systematically explore those considerations.
Let FI and F2 be two distribution functions.
Proof. Let us denote by FE d. (Xl' X2) the joint distribution function of the
random variables el
and e2'
By definition
128 Lecture 15. Central Limit Theorem
Since el and e2 are independent we have FE1 ,E.(Xl,X2) = Fdxd F2(X2) and
by Fubini's Theorem
Assume that the random variable el takes its values in the set Al , and
that the random variable e2 takes its values in the set A 2 • One sometimes
says that Fl (or Pe.) is concentrated on Al , and F2 (or Ph) is concentrated
on A 2 • Then G is concentrated on the set A = Al + A 2 , where
i:
density Pl (x) then G has a density p(x) and
Proof. The characteristic function ,ph (A) = E exp(iAe/o). Since the random
variables el, e2, ... ,en
are independent, the random variables eO).. EI , • • • , eO).. E.. ,
as functions of independent random variables, are also independent. This lat-
ter statement for real-valued functions can be found in Lemma 12.1. It can
be extended without any modifications to complex-valued functions. further-
more, in the case of real-valued independent random variables the expectation
Lecture 15. Central Limit Theorem 129
of their product is equal to the product of their expectations (see Lemma 12.2).
This statement can also be extended, without any modifications, to the case
of complex-valued random variables. Therefore
o
We are now in a position to prove the Central Limit Theorem in the
following important case.
Proof. The characteristic function of the Gaussian distribution, i.e. the integral
/ f exp(i.\x - x2 /2) dx = e- A' /2. The characteristic function of the random
v2 ..
variable 1].. is equal to
(15.1)
(15.2)
The last term calls for a further remark. We write the real and imaginary
parts of the characteristic function <p(.\) separately and for each part we write
a Taylor expansion with a Lagrange remainder up to the second order. The
intermediate point is in general different for each part. By the continuity of
<p"('\) and since <p"(O) = -~, the second derivative of the real part converges
to -~ and that of the imaginary part converges to zero. This leads to equation
(15.2). IT we substitute it into (15.1) we obtain
130 Lecture 15. Central Limit Theorem
There are quite a few generalizations of the Central Limit Theorem, where
the condition of independence of the random variables is replaced by a con-
dition of weak dependence in one sense or another. Other important general-
izations are those connected with vector-valued random variables.
The appearance and the universality of the Gaussian distribution in the
Central Limit Theorem can seem rather enigmatic and surprising. The goal of
Lecture 15. Central Limit Theorem 131
(15.3)
This follows from the fact that the variance of the random variable 1 =
(l/V2)(-y' + '1"), where 'Y' and 'Y" are independent, identically distributed
random variables, is the same as the variance of '1' and '1". Equation (15.4)
shows that the variance is the "first integral" of the non-linear transformation
T.
In the general theory of non-linear operators the investigation of the sta-
bility of a fixed point begins with an investigation of its stability with respect
to the linear approximation. In our case the problem on linear stability arises
in the following way. Set F. = f~oo (1 + fh(u)) v'~ ... exp(- u'J') du and write
TF. = 1'"
-00
1
(l+fLh) rn=e-' u dU+O(f).
v27r
1.'
The linear operator L which appears in the last expression is the linearization
of the operator T at the point G. It is easy to write it explicitly as
Pk(x) (see Lecture 9), which correspond to the eigenvalues Ilk = 2l - k /2, k =
0,1,2,3, .... We see that Ilo, III > 1, 112 = 1, and all the remaining eigenvalues
are less than 1. In other words, in directions which correspond to Po and Pl
the operator T is non-stable, while in the direction which corresponds to P2
it is marginal and in all remaining directions it is stable. We note here that
we must consider perturbations h for which
! h(u)e-
..!.
2 du = 0, ! uh(u)e-
..!.
2 du = 0.
The first condition corresponds to the fact that the perturbation of G is such
that f~ 00 (1 + fh( u ))e- U 2 /2 du remains a distribution function, and the second
condition means that the expectation of the perturbed distribution function
remains equal to zero. Thus the perturbations are orthogonal to the non-stable
directions. Furthermore since the variance d(F) is the first integral of the
transformation T then the perturbations must leave the variance unchanged:
Theorem 15.6. Let el, e2, ... , en, ... be a sequence oj independent, identi-
cally distributed random variables Jor which the distribution Junction F has a
density p(x) such that
i) p(x) = p( -x); and
ii) p(x) ,.. c/(Ixl"'+ 1) lor some a E (0,2), where c is a constant.
Then the distribution oj the normed sum Tin = n - 1/" (el + ... + en) as
n -. 00 converges to a limit distribution whose characteristic function is of
the form exp( -cll>'I") for some constant Cl > 0.
Lecture 15. Central Limit Theorem 133
We know from the Central Limit Theorem that in case a l ) as n --+ 00 the
corresponding probability converges to a positive limit which can be calcu-
lated by using the Gaussian distribution. This means that in al) the order of
magnitude of the estimates obtained from Chebyshev's inequality is correct.
On the other hand in the case ~) the estimate from Chebyshev's inequality
is often too large, and very crude. In this lecture we investigate the possibility
of obtaining more precise estimates for the probability in case ~).
Let us consider a sequence of independent identically distributed random
variables. We denote their common distribution functions by F. We make the
i:
following basic assumption about F: the integral
for all .\, -00 < .\ < 00. This condition is automatically satisfied if all the e.
e.
are bounded: I I ~ c = const. It is also satisfied if the probabilities of large
e.
values of the decrease faster than exponentially.
We now investigate several properties of the function R(.\). From the
I: I:
finiteness of the integral in (16.1) it follows that the derivatives
m' (>.) =
R"(>') R'(>') 2
R(>.) - ( R(>')) =
1_ 00
00
x2
R(>.) eA>: dF(x) -
(1 00
X
_ 00 R(>.) eA'" dF(x)
)2 .
We construct a new distribution function FA (X) = at).) f(-oo,,,,) eA>: dF(x) for
each >.. Then m(>') = f xdFA (x) is the expectation, computed with respect to
this distribution, and m'(>') is the variance. Therefore m'(>') > 0 if F is a non-
trivial distribution, i.e. is not concentrated at a point. We exclude this latter
case from our further considerations. Since m' (>.) > 0, m( >.) is a monotonically
increasing function. We will need some information on limm(>') as >. ---+ ±oo.
We say that M+ is an upper limit in probability for the random variable
€; if p(e > M+) = 0, and P(M+ - f :s: € :s: M+) > 0 for any f > O. One
can define a concept of lower limit in probability M- in an analogous way. If
P(€ > M) > 0 (P(€ < M) > 0) for any M, then M+ = 00 (M- = -00). In
all remaining cases M+ and M- are finite.
R'(>') = 1 00
-00
xe Az dF(x) ~ 1 00
2M
xeA>: dF(x).
I: I:
Furthermore
:s: e2AM + 1 2M
00
eA'" dF(x) = 1 00
2M
eA'" dF(X}(1 + ""
f2M
e
2AM
eA>: dF(x)
).
Since M+ = 00, for any M there exists a half-open interval with (a, a + 1],
a> 4M such that F(a + 1) - F(a) > O. Therefore
1 00
2M
eA>: dF(x) ~ 1(",,,+1)
eA>: dF(x) ~ e"A (F(a + 1) - F(a))
For>. > >'(M) the latter expression is less than 1. For such >.
R'(>') > f2: xeA>: dF(x)
R(>') - (f2""u e dF(x))(1 + 1.: :::dF("'))
11
A ..
00
eA" dF(x) 2M
> -
- 2 2M
XI""
2M eA '" dF(x) -
> - 2 = M.
136 Lecture 16. Probabilities of Large Deviations
To obtain the last inequality we used the fact that l:,h .dF(z)
... e • <IF (IZ)
generates a
probability measure on the half-line (2M, 00). So the lemma is proved for the
case where M+ = 00.
We now turn to the case of a finite M+ . We show that for any 0 > 0 there
exists a A(O) such that m(A) ~ M+ - 0 if A > A(O). As previously
= 1 00
M+-6/2
e).,:z dF(x)(1 + 00
fM+-6/2
e>.(M+-6/2)
e).,:z dF(x)
).
Therefore
R'(A) f:+ -6/2 xe).,:z dF(x) 1
---->
R(A) - fooM+-6/2 e).,:z dF( x ) (1+ F(M+)-F(M+-~6)
-=~~--------~------~~----~
.-H/')'
The first quotient, by the same arguments as above, is greater than or equal
to M+ - iO. The second quotient can be made arbitrarily close to 1 by an
appropriate choice of A. 0
Proof. We have
j ... j
,sl+···+.s ... >cn
j ... j
SI+···+Z: ... >Cn.
The latter integral is the probability, computed with respect to the distribution
Fl.o , that €l + ... + €n > en. But the distribution Fl.o was chosen in such a
way that E).o €i = e. Therefore
j ... j dF).o (xd .. · dF).o (xn) = P).o {€l + ... + €n > en}
Zl+···+ Z .>Cn.
Theorem 16.2. For any b > 0 there exists p(b) = p > 0 such that
p.R,e >
_
(R(A 0 )e-).oC)ne-).obJ';-pAI
with
lim Pn
n .... 00
= p(b) > O.
j ... /
138 Lecture 16. Probabilities of Large Deviations