Fall 2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY
6.265/15.070J
Lecture 1
Fall 2013
9/4/2013
Metric spaces and topology
Content. Metric spaces and topology. Polish Space. Arzela-Ascoli Theo

rem. Convergence of mappings. Skorohod metric and Skorohod space.
1 Metric spaces. Open, closed and compact sets

When we discuss probability theory of random processes, the underlying sample
spaces and -eld structures become quite complex. It helps to have a unifying
framework for discussing both random variables and stochastic processes, as
well as their convergence, and such a framework is provided by metric spaces.
Denition 1. A metric space is a pair (S, ) of a set S and a function :
S S R+ such that for all x, y, z S the following holds:
1. (x, y) = 0 if and only if x = y.
2. (x, y) = (y, x) (symmetry).
3. (x, z) (x, y) + (y, z) (triangle inequality).
Examples of metric spaces include S = Rd with (x, y) =
1jd (xj
yj )2 ,
or = 1jd |xj yj | or = max1jd |xj yj |. These metrics are also called

L2 , L1 and L norms and we write Ix yI1 , Ix yI2 , Ix yI , or simply
IxyI. More generally one can dene IxyIp =
1. p = essentially corresponds to Ix yI .
p
1jd (xj yj )
1
p
,p
Problem 1. Show that Lp is not a metric when 0 < p < 1.

Another important example is S = C[0, T ] the space of continuous func
tions x : [0, T ] Rd and (x, y) = T = sup0tT Ix(t) y(t)I, where I I
can be taken as any of Lp or L . We will usually concentrate on the case d = 1,
in which case (x, y) = T = sup0tT |x(t) y(t)|. The space C[0, ) is
1
P
also a metric space under (x, y) = nN 21n min(n (x, y), 1), where n is the
metric dened on C[0, n]. (Why did we have to use the min operator in the def
inition above?). We call T and uniform metric. We will also write Ix yIT
or Ix yI instead of T .
Problem 2. Establish that T and dened above on C[0, ) are metrics.

Prove also that if xn x in metric for xn , x C[0, ), then the restrictions
x'n , x' of xn , x onto [0, T ] satisfy x'n x' w.r.t. T .
Finally, let us give an example of a metric space from a graph theory. Let
G = (V, E) be an undirected graph on nodes V and edges E. Namely, each
element (edge) of E is a pair of nodes (u, v), u, v V . For every two nodes u
and v, which are not necessarily connected by an edge, let (u, v) be the length
of a shortest path connecting u with v. Then it is easy to see that is a metric
on the nite set V .
Denition 2. A sequence xn S is said to converge to a limit x S (we write
xn x) if limn (xn , x) = 0. A sequence xn S is Cauchy if for every E > 0
there exists n0 such that for all n, n' > n0 , (xn , xn' ) < E. A metric space is
dened to be complete if every Cauchy sequence converges to some limit x.
The space Rd is a complete space under all three metrics L1 , L2 , L . The
space Q of rational points in R is not complete (why?). A subset A S is
called dense if for every x S there exists a sequence of points xn A such
that xn x. The set of rational values in R is dense and is countable. The set of
irrational points in R is dense but not countable. The set of points (q1 , . . . , qd )
Rd such that qi is rational for all 1 i d is a countable dense subset of Rd .
Denition 3. A metric space is dened to be separable if it contains a dense
countable subset A. A metric space S is dened to be a Polish space if it is
complete and separable.
We just realized that Rd is Polish. Is there a countable dense subset of
C[0, T ] of C[0, ), namely are these spaces Polish as well? The answer is
yes, but we will get to this later.
Problem 3. Given a set S, consider the metric dened by (x, x) = 0, (x, y) =
6 y. Show that (S, ) is a metric space. Suppose S is uncountable. Show
1 for x =
that S is not separable.
Given x S and r > 0 dene a ball with radius r to be B(x, r) = {y S :
(x, y) r}. A set A S is dened to be open if for every x A there exists
2
E such that B(x, E) A. A set A is dened to be closed if Ac = S \ A is

open. The empty set is assumed to be open and closed by denition. It is easy
to check that union of open sets is open and intersection of closed sets is closed.
Also it is easy to check that a nite intersection of open sets is open and nite
union of closed sets is closed. Every interval (a, b) is and every countable union
of intervals i1 (ai , bi ) is an open set. Show that a union of close intervals
[ai , bi ], i 1 is not necessarily close.
For every set A dene its interior Ao as the union of all open sets U A.
This set is open (check). For every set A dene its closure A as the intersection
of all closed sets V A. This set is closed. For every set A dene its boundary
A as A \ Ao . Examples of open sets are open balls Bo (x, r) = {y S :
(x, y) < r} B(x, r) (check this). A set K S is dened to be compact
if every sequence xn K contains a converging subsequence xnk x and
x K. It can be shown that K Rd is compact if and only if K is closed and
bounded (namely supxK IxI < (this applies to any Lp metric). Prove that
every compact set is closed.
Proposition 1. Given a metric space (S, ) a set K is compact iff every cover
of K by open sets contains a nite subcover. Namely, if Ur , r R is a (possibly
uncountable) family of sets such that K r Ur , then there exists a nite subset
r1 , . . . , rm R such that K 1im Uri .
We skip the proof of this fact, but it can be found in any book on topology.
Problem 4. Give an example of a closed bounded set K C[0, T ] which is not
compact.
Denition 4. Given two metric spaces (S1 , 1 ), (S2 , 2 ) a mapping f : S1
S2 is dened to be continuous in x S1 if for every E > 0 there exists > 0 such
that f (B(x, )) B(f (x), E). Equivalently for every y such that 1 (x, y) <
we must have 2 (f (x), f (y)) < E. And again, equivalently, if for every sequence
xn S1 converging to x S it is also true that f (xn ) converges to f (x).
A mapping f is dened to be continuous if it is continuous in every x S1 .
A mapping is uniformly continuous if for every E > 0 there exists > 0 such
that 1 (x, y) < implies 2 (f (x), f (y)) < E.
Problem 5. Show that f is a continuous mapping if and only if for every open
set U S2 , f 1 (U ) is an open set in S1 .
Proposition 2. Suppose K S1 is compact. If f : K Rd is continuous then
it is also uniformly continuous. Also there exists x0 K satisfying If (x0 )I =
supxK If (x)I, for any norm I I = I Ip .
3
Proof. This is where alternative property of compactness provided in Propo

sition 1 is useful. Fix E > 0. For every x K nd = (x) such that
f (B(x, (x))) B(f (x), E). This is possible by continuity of f . Then K
xK Bo (x, (x)/2) (recall that B0 is an open version of B). Namely, we
have an open cover of K. By Proposition 1, there exists a nite subcover
K 1ik Bo (xi , (xi )/2). Let = min1ik (xi ). This value is positive
since k is nite. Consider any two points y, z K such that 1 (y, z) < /2.
We just showed that there exists i, 1 i k such that 1 (xi , y) (xi )/2.
By triangle inequality 1 (xi , z) < (xi )/2 + /2 (xi ). Namely both y
and z belong to Bo (xi , (xi )). Then f (y), f (z) Bo (f (xi ), E). By triangle
inequality we have If (y) f (z)I If (y) f (xi )I + If (z) f (xi )I < 2E.
We conclude that for every two points y, z such that 1 (y, z) < /2 we have
If (y) f (z)I < 2E. The uniform continuity is established. Notice, that in this
proof the only property of the target space Rd we used is that it is a metric space.
In fact, this part of the proposition is true if Rd is replaced by any metric space
S2 , 2 .
Now let us show the existence of x0 K satisfying If (x0 )I = supxK If (x)I.
First let us show that supxK If (x)I < . If this is not true, identify a se
quence xn K such that If (xn )I . Since K is compact, there ex
ists a subsequence xnk which converges to some point y K. Since f is
continuous then f (xnk ) f (y), but this contradicts If (xn )I . Thus
supxK If (x)I < . Find a sequence xn satisfying limn If (xn )I = supxK If (x)I.
Since K is compact there exists a converging subsequence xnk x0 . Again us
ing continuity of f we have f (xnk ) f (x0 ). But If (xnk )I supxK If (x)I.
We conclude f (x0 ) = supxK If (x)I.
We mentioned that the sets in Rd which are compact are exactly bounded closed
sets. What about C[0, T ]? We will need a characterization of compact sets in this
space later when we analyze tightness properties and construction of a Brownian
motion.
Given x C[0, T ] and > 0, dene wx () = sups,t:|st|< |x(t) x(s)|. The
quantity wx () is called modulus of continuity. Since [0, T ] is compact, then by
Proposition 2 every x C[0, T ] is uniformly continuous on [0, T ]. This may be
restated as for E > 0, there exists > 0 such that wx () < E.
Theorem 1 (Arzela-Ascoli Theorem). A set A C[0, T ] is compact if and
only if it is closed and
sup |x(0)| < ,
(1)
xA
and
lim sup wx () = 0.
0 xA
(2)
Proof. We only show that if A is compact then (1) and (2) hold. The converse is
established using a similar type of mathematical analysis/topology arguments.
We already know that if A is compact it needs to be closed. The assertion
(1) follows from Proposition 2. We now show (2). For any s, t [0, T ] we have
|y(t) y(s)| |y(t) x(t)| + |x(t) x(s)| + |x(s) y(s)| |x(t) x(s)| + 2Ix yI.
Similarly we show that |x(t) x(s)| |y(t) y(s)| + 2Ix yI. Therefore for
every > 0.
|wx () wy ()| < 2Ix yI.
(3)
We now show (2). Check that (2) is equivalent to

1
lim sup wx ( ) = 0.
n xA
n
(4)
Suppose A is compact but (4) does not hold. Then we can nd a subsequence
xni A, i 1 such that wxni (1/ni ) c for some c > 0. Since A is compact
then there is further subsequence of xni which converges to some x A. To
ease the notation we denote this subsequence again by xni . Thus Ixni xI 0.
From (3) we obtain
|wx (1/ni ) wxni (1/ni )| < 2Ix xni I < c/2
for all i larger than some i0 . This implies that
wx (1/ni ) c/2,
(5)
for all sufciently large i. But x is continuous on [0, T ], which implies it is

uniformly continuous, as [0, T ] is compact. This contradicts (5).
Convergence of mappings
Given two metric spaces (S1 , 1 ), (S2 , 2 ) a sequence of mappings fn : S1

S2 is dened to be point-wise converging to f : S1 S2 if for every x S1 we
have 2 (fn (x), f (x)) 0. A sequence fn is dened to converge to f uniformly
if
lim sup 2 (fn (x), f (x)) = 0.
n xS1
Also given K S1 , sequence fn is said to converge to f uniformly on K if the

restriction of fn , f onto K gives a uniform convergence. A sequence fn is said
to converge to f uniformly on compact sets u.o.c if fn converges uniformly to f
on every compact set K S1 .
Problem 6. Let S1 = [0, ) and let S2 be arbitrary. Show that fn converges to
f uniformly on compact sets if and only if for every T > 0
lim sup 2 (fn (t), f (t)) = 0.
n 0tT
Point-wise convergence does not imply uniform convergence even on com

pact sets. Consider xn = nx for x [0, 1/n], = n(2/n x) for x [1/n, 2/n]
and = 0 for x [2/n, 1]. Then xn converges to zero function point-wise but not
uniformly. Moreover, if fn is continuous and fn converges to f point-wise, this
does not imply in general that f is continuous. Indeed, let fn = 1/(nx + 1), x
[0, 1]. Then fn converges to 0 point-wise everywhere except x = 0 where it
converges to 1. The limiting function is discontinuous. However, the uniform
continuity implies continuity of the limit, as we are about to show.
Proposition 3. Suppose fn : S1 S2 is a sequence of continuous mappings
which converges uniformly to f . Then f is continuous as well.
Proof. Fix x S1 and E > 0. There exists n0 such that for all n > n0 ,
supz 2 (fn (z), f (z)) < E/3. Fix any such n > n0 . Since, by assumption fn
is continuous, then there exists > 0 such that 2 (fn (x), fn (y)) < E/3 for all
y Bo (x, ). Then for any such y we have
2 (f (x), f (y)) 2 (f (x), fn (x)) + 2 (fn (x), fn (y)) + 2 (fn (y), f (x)) < 3E/3 = E.
This proves continuity of f .
Theorem 2. The spaces C[0, T ], C[0, ) are Polish.
Problem 7. Use Proposition 3 (or anything else useful) to prove that C[0, T ] is
complete.
That C[0, T ] has a dense countable subset can be shown via approximations
by polynomials with rational coefcients (we skip the details).
3
Skorohod space and Skorohod metric
The space C[0, ) equipped with uniform metric will be convenient when we
discuss Brownian motion and its application later on in the course, since Brown
ian motion has continuous samples. Many important processes in practice, how
ever, including queueing processes, storage, manufacturing, supply chain, etc.
are not continuous, due to discrete quantities involved. As a result we need to
deal with probability concept on spaces of not necessarily continuous functions.
Denote by D[0, ) the space of all functions x on [0, ) taking values in R
or in general any metric space (S, ), such that x is right-continuous and has left
limits. Namely, for every t0 , limtt0 f (t), limtt0 f (t) exist, and limtt0 f (t) =
f (t0 ). As an example, think about a process describing the number of customers
in a branch of a bank. This process is described as a piece-wise constant func
tion. We adopt a convention that at a moment when a customer arrives/departs,
the number of customers is identied with the number of customers right af
ter arrival/departure. This makes the process right-continuous. It also has leftlimits, since it is piece-wise constant.
Similarly, dene D[0, T ] to be the space of right-continuous functions on
[0, T ] with left limits. We will right shortly RCLL. On D[0, T ] and D[0, ) we
would like to dene a metric which measures some proximity between the func
tions (processes). We can try to use the uniform metric again. Let us consider
the following two processes x, y D[0, T ]. Fix , [0, T ) and > 0 such that
+ < T and dene x(z) = 1{z }, y(z) = 1{z + }. We see that x
and y coincide everywhere except for a small interval [, + ). It makes sense
to assume that these processes are close to each other. Yet Ix yIT = 1.
Thus uniform metric is inadequate. For this reason Skorohod introduce the so
called Skorohod metric. Before we dene Skorohod metric let us discuss the
idea behind it. The problem with uniform metric was that the two processes x, y
described above where close to each other in a sense that one is a perturbed ver
sion of the other, where the amount of perturbation is . In particular, consider
the following piece-wise linear function : [0, T ] [0, T ] given by
(t) =
+ t,
1
1 (t
t [0, + ];
), t [ + , T ].
7
We see that x((t)) = y(t). In other words, we rescaled the axis [0, T ] by
a small amount and made y close to (in fact identical to) x. This motivates the
following denition. From here on we use the following notations: x y stands
for min(x, y) and x y stands for max(x, y)
Denition 5. Let be the space of strictly increasing continuous functions
from [0, T ] onto [0, T ]. A Skorohod metric on D[0, T ] is dened by

s (x, y) = inf I II Ix y)I ,
for all x, y D[0, T ], where I is the identity transformation, and I I is

the uniform metric on D[0, T ].
Thus, per this denition, the distance between x and y is less than E if there
exists such that sup0tT |(t)t| < E and sup0tT |x(t)y((t))| <
E.
Problem 8. Establish that s is a metric on D[0, T ].
Proposition 4. The Skorohod metric and uniform metric are equivalent on C[0, T ],
in a sense that for xn , x C[0, T ], the convergence xn x holds under Sko
rohod metric if and only if it holds under the uniform metric.
Proof. Clearly Ix yI s (x, y). So convergence under uniform metric im
plies convergence under Skorohod metric. Suppose now s (xn , x) 0. We
need to show Ixn xI 0.
Consider any sequence n such that In II 0 and Ix(n )xn I
0. Such a sequence exists since s (xn , x) 0 (check). We have
Ix xn I Ix xn I + Ixn xn I.
The second summand in the right-hand side converges to zero by the choice of
n . Also since n converges to I uniformly, and x is continuous on [0, T ] and
therefore uniformly continuous on [0, T ], then Ix xn I 0.
4
Additional reading materials

Billingsley [1], Appendix M1-M10.
References
[1] P. Billingsley, Convergence of probability measures, Wiley-Interscience
publication, 1999.
MIT OpenCourseWare
http://ocw.mit.edu
15.070J / 6.265J Advanced Stochastic Processes

Fall 2013
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

6.265/15.070J
Lecture 2
Fall 2013
9/9/2013
Large Deviations for i.i.d. Random Variables
Content. Chernoff bound using exponential moment generating functions.

Properties of a moment generating functions. Legendre transforms.
Preliminary notes
The Weak Law of Large Numbers tells us that if X1 , X2 , . . . , is an i.i.d. se

quence of random variables with mean E[X1 ] < then for every E > 0
P(|
X1 + . . . + Xn
| > E) 0,
n
as n .
But how quickly does this convergence to zero occur? We can try to use Cheby
shev inequality which says
P(|
Var(X1 )
X1 + . . . + Xn
| > E)
.
n
nE2
This suggest a decay rate of order n1 if we treat Var(X1 ) and E as a constant.

Is this an accurate rate? Far from so ...
In fact if the higher moment of X1 was nite, for example, E[X12m ] < , then
using a similar bound, we could show that the decay rate is at least n1m (exercise).
The goal of the large deviation theory is to show that in many interesting cases
the decay rate is in fact exponential: ecn . The exponent c > 0 is called the large
deviations rate, and in many cases it can be computed explicitly or numerically.
Large deviations upper bound (Chernoff bound)
Consider an i.i.d. sequence with a common probability distribution function

F (x) = P(X x), x R. Fix a value a > , where is again an expectation
corresponding to the distribution F . We consider probability that the average
of X1 , . . . , Xn exceeds a. The WLLN tells us that this happens with probabil
ity converging to zero as n increases, and now we obtain an estimate on this
probability. Fix a positive parameter > 0. We have
P(
1in Xi
> a) = P(
Xi > na)
1in
= P(e
1in
Xi
E[e
1in
Xi
> ena )
]
Markov inequality
ena
E[ i eXi ]
,
=
(ea )n
But recall that Xi s are i.i.d. Therefore E[
obtain an upper bound
1in Xi
P(
ie
> a)
Xi ]
= (E[eX1 ])n . Thus we
E[eX1 ]
ea
n
.
(1)
Of course this bound is meaningful only if the ratio E[eX1 ]/ea is less than
unity. We recognize E[eX1 ] as the moment generating function of X1 and de
note it by M (). For the bound to be useful, we need E[eX1 ] to be at least
nite. If we could show that this ratio is less than unity, we would be done
exponentially fast decay of the probability would be established.
Similarly, suppose we want to estimate
P(
1in Xi
< a),
for some a < . Fixing now a negative < 0, we obtain

P(
1in Xi
< a) = P(e 1in Xi > ena )
M () n
,
ea
2
and now we need to nd a negative such that M () < ea . In particular, we

need to focus on for which the moment generating function is nite. For this
purpose let D(M ) { : M () < }. Namely D(M ) is the set of values
for which the moment generating function is nite. Thus we call D the domain
of M .
3
Moment generating function. Examples and properties
Let us consider some examples of computing the moment generating functions.

Exponential distribution. Consider an exponentially distributed random
variable X with parameter . Then
M () =
0
ex ex dx
e()x dx.
1 ()x
When < this integral is equal to
e
= 1/( ). But
0
when , the integral is innite. Thus the exp. moment generating
function is nite iff < and is M () = /( ). In this case the
domain of the moment generating function is D(M ) = (, ).
Standard Normal distribution. When X has standard Normal distribu

tion, we obtain

x2
1
M () = E[eX ] =
ex e 2 dx
2

x2 2x+ 2 2
1
2
=
e
dx
2
(x)2
2
1
=e2
e 2 dx
2
Introducing change of variables y = x we obtain that the integral
is equal to
1
2
y
2
e
dy = 1 (integral of the density of the standard

2
Normal distribution). Therefore M () = e 2 . We see that it is always

nite and D(M ) = R.
In a retrospect it is not surprising that in this case M () is nite for all .
2
The density of the standard Normal distribution decays like ex and
3
this is faster than just exponential growth ex . So no matter how large

is the overall product is nite.
Poisson distribution. Suppose X has a Poisson distribution with param
eter . Then
M () = E[eX ] =
=
em
m=0
(e )m
m!
m=0
= ee
m
e
m!
P
m
(where we use the formula m0 tm! = et ). Thus again D(M ) = R.
This again has to do with the fact that m /m! decays at the rate similar to
1/m! which is faster then any exponential growth rate em .
We now establish several properties of the moment generating functions.

Proposition 1. The moment generating function M () of a random variable X
satises the following properties:
(a) M (0) = 1. If M () < for some > 0 then M ( ' ) < for all
' [0, ]. Similarly, if M () < for some < 0 then M (' ) < for
all ' [, 0]. In particular, the domain D(M ) is an interval containing
zero.
(b) Suppose (1 , 2 ) D(M ). Then M () as a function of is differentiable
in for every 0 (1 , 2 ), and furthermore,

d

M ()
= E[Xe0 X ] < .
d
=0
Namely, the order of differentiation and expectation operators can be

changed.
Proof. Part (a) is left as an exercise. We now establish part (b). Fix any 0
(1 , 2 ) and consider a -indexed sequence of random variables
Y
exp(X) exp(0 X)
.
0
d
Since d
exp(x) = x exp(x), then almost surely Y X exp(0 X), as
0 . Thus to establish the claim it sufces to show that convergence
of expectations holds as well, namely lim0 E[Y ] = E[X exp(0 X)], and
E[X exp(0 X)] < . For this purpose we will use the Dominated Convergence
Theorem. Namely, we will identify a random variable Z such that |Y | Z al
most surely in some interval (0 E, 0 + E), and E[Z] < .
Fix E > 0 small enough so that (0 E, 0 + E) (1 , 2 ). Let Z =
E1 exp(0 X + E|X|). Using the Taylor expansion of exp() function, for every
(0 E, 0 + E), we have

1
1
1
Y = exp(0 X) X + ( 0 )X 2 + ( 0 )2 X 3 + + ( 0 )n1 X n + ,
2!
3!
n!
which gives

1
1
|Y | exp(0 X) |X| + ( 0 )|X|2 + + ( 0 )n1 |X|n +
2!
n!

1
1 n1
2
n
exp(0 X) |X| + E|X| + + E
|X| +
2!
n!
= exp(0 X)E1 (exp(E|X|) 1)
exp(0 X)E1 exp(E|X|)
= Z.
It remains to show that E[Z] < . We have
E[Z] = E1 E[exp(0 X + EX)1{X 0}] + E1 E[exp(0 X EX)1{X < 0}]
E1 E[exp(0 X + EX)] + E1 E[exp(0 X EX)]
= E1 M (0 + E) + E1 M (0 E)
< ,
since E was chosen so that (0 E, 0 + E) (1 , 2 ) D(M ). This completes
the proof of the proposition.
Problem 1.
(a) Establish part (a) of Proposition 1.
(b) Construct an example of a random variable for which the corresponding
interval is trivial {0}. Namely, M () = for every > 0.
(c) Construct an example of a random variable X such that D(M ) = [1 , 2 ]

for some 1 < 0 < 2 . Namely, the the domain D is a non-zero length
closed interval containing zero.
Now suppose the i.i.d. sequence Xi , i 1 is such that 0 (1 , 2 )
D(M ), where M is the moment generating function of X1 . Namely, M is nite
in a neighborhood of 0. Let a > = E[X1 ]. Applying Proposition 1, let us
differentiate this ratio with respect to at = 0:
d M ()
E[X1 eX1 ]ea aea E[eX1 ]
=
= a < 0.
d ea
e2a
Note that M ()/ea = 1 when = 0. Therefore, for sufciently small positive
, the ratio M ()/ea is smaller than unity, and (1) provides an exponential
bound on the tail probability for the average of X1 , . . . , Xn .
Similarly, if a < , the ratio M ()/ea < 1 for sufciently small negative
.
We now summarize our ndings.
Theorem 1 (Chernoff bound). Given an i.i.d. sequence X1 , . . . , Xn suppose
the moment generating function M () is nite in some interval (1 , 2 ) : 0. Let
a > = E[X1 ]. Then there exists > 0, such that M ()/ea < 1 and
P

M () n
1in Xi
P(
> a)
.
n
ea
Similarly, if a < , then there exists < 0, such that M ()/ea < 1 and
P

M () n
1in Xi
P(
< a)
.
n
ea
How small can we make the ratio M ()/ exp(a)? We have some freedom
in choosing as long as E[eX1 ] is nite. So we could try to nd which
minimizes the ratio M ()/ea . This is what we will do in the rest of the lecture.
The surprising conclusion of the large deviations theory is very often that such
a minimizing value exists and is tight. Namely it provides the correct decay
rate! In this case we will be able to say
P
1in Xi
P(
> a) exp(I(a, )n),
n
where I(a, ) = log M ( )/e
.
6
Legendre transforms
Theorem 1 gave us a large deviations bound (M ()/ea )n which we rewrite as

en(alog M ()) . We now study in more detail the exponent a log M ().
Denition 1. A Legendre transform of a random variable X is the function
I(a) supR (a log M ()).
Let us go over the examples of some distributions and compute their corre
sponding Legendre transforms.
Exponential distribution with parameter . Recall that M () = /(
) when < and M () = otherwise. Therefore when <
I(a) = sup(a log
= sup(a log + log( )),
and I(a) = otherwise. Setting the derivative of g() = a log +

log( ) equal to zero we obtain the equation a 1/( ) = 0 which
has the unique solution = 1/a. For the boundary cases, we have
a log + log( )) when either or (check).
Therefore
I(a) = a( 1/a) log + log( + 1/a)
= a 1 log + log(1/a)
= a 1 log log a.
The large deviations bound then tells us that when a > 1/
P
1in Xi
P(
> a) e (a1log log a)n .
n
Say = 1 and a = 1.2. Then the approximation gives us e(.2log 1.2)n .
Note that we can obtain an exact expression for this tail probability. In
deed, X1 , X1 + X2 , . . . , X1 + X2 + Xn , . . . are the events of a Poisson
process with parameter = 1. Therefore we can compute the probabil
ity P( 1in Xi > 1.2n) exactly: it is the probability that the Poisson
process has at most n 1 events before time 1.2n. Thus

P
X
1in Xi
P(
> 1.2) = P(
Xi > 1.2n)
n
1in
X
0kn1
(1.2n)k 1.2n
e
.
k!
It is not at all clear how revealing this expression is. In hindsight, we

know that it is approximately e(.2log 1.2)n , obtained via large deviations
theory.
2
Standard Normal distribution. Recall that M () = e 2 when X1 has

the standard Normal distribution. The expected value = 0. Thus we x
a > 0 and obtain
I(a) = sup(a
=
a2
2
)
2
achieved at = a. Thus for a > 0, the large deviations theory predicts

that
P
a2
1in Xi
P(
> a) e 2 n .
n
P
Xi
Again we could compute this probability directly. We know that 1in

n
is distributed as a Normal random variable with mean zero and variance
1/n. Thus
P
Z
t2 n
n
1in Xi
P(
> a) =
e 2 dt.
n
2 a
After a little bit of technical work one could

R a+E show that this integral is
dominated by its part around a, namely, a , which is further approx
a2
n n
imated by the value of the function itself at a, namely 2
e 2 . This is
consistent with the value given by the large deviations theory. Simply the
n
lower order magnitude term 2
disappears in the approximation on the
log scale.
Poisson distribution. Suppose X has a Poisson distribution with param
eter . Recall that in this case M () = ee . Then

I(a) = sup(a (e )).
Setting derivative to zero we obtain = log(a/) and I(a) = a log(a/)

(a ). Thus for a > , the large deviations theory predicts that
P
1in Xi
P(
> a) e (a log(a/)a+)n .
n
In this case as well we can compute the large deviations probability ex
plicitly. The sum X1 + + Xn of Poisson random variables is also a
Poisson random variable with parameter n. Therefore
P(
Xi > an) =
1in
X (n)m
en .
m!
m>an
But again it is hard to infer a more explicit rate of decay using this expres
sion
5 Additional reading materials
Chapter 0 of [2]. This is non-technical introduction to the eld which de
scribes motivation and various applications of the large deviations theory.
Soft reading.
Chapter 2.2 of [1].
References
[1] A. Dembo and O. Zeitouni, Large deviations techniques and applications,
Springer, 1998.
[2] A. Shwartz and A. Weiss, Large deviations for performance analysis, Chap
man and Hall, 1995.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 3
Fall 2013
9/11/2013
Large deviations Theory. Cramers Theorem
Content.
1. Cramers Theorem.
2. Rate function and properties.
3. Change of measure technique.
Cramers Theorem
We have established in the previous lecture that under some assumptions on

the Moment Generating Function (MGF) M (), an i.i.d. sequence of random
variables Xi , 1 ;i n with mean satises P(Sn a) exp(nI(a)),
where Sn = n1 1in Xi , and I(a) sup (a log M ()) is the Legendre
transform. The function I(a) is also commonly called the rate function in the
theory of Large Deviations. The bound implies
lim sup
n
log P(Sn a)
I(a),
n
and we have indicated that the bound is tight. Namely, ideally we would like to
establish the limit
lim sup
n
log P(Sn a)
= I(a),
n
Furthermore, we might be interested in more complicated rare events, beyond

the interval [a, ). For example, the likelihood that P(Sn A) for some set
A R not containing the mean value . The Large Deviations theory says that
roughly speaking
1
P(Sn A) = inf I(x),
n n
xA
lim
(1)
but unfortunately this statement is not precisely correct. Consider the following
example. Let X be an integer-valued random variable, and A = { m
p : m
Z, p is odd prime.}. Then for prime n, we have P(Sn A) = 1; but for n = 2k ,
n A)
we have P (Sn A) = 0. As a result, the limit limn log P (S
in this case
n
does not exist.
The sense in which the identity (1) is given by the Cramers Theorem below.
Theorem 1 (Cramers Theorem). Given a sequence of i.i.d. real valued ran
dom variables Xi , i 1 with a common moment generating function M () =
E[exp(X1 )] the following holds:
(a) For any closed set F R,
lim sup
n
1
log P(Sn F ) inf I(x),
xF
n
(b) For any open set U R,

lim inf
n
1
log P(Sn U ) inf I(x).
xU
n
We will prove the theorem only for the special case when D(M ) = R
(namely, the MGF is nite everywhere) and when the support of X is entire
R. Namely for every K > 0, P(X > K) > 0 and P(X < K) > 0. For
example a Gaussian random variable satises this property.
To see the power of the theorem, let us apply it to the tail of Sn . In the
following section we will establish that I(x) is a non-decreasing function on the
interval [, ). Furthermore, we will establish that if it is nite in some interval
containing x it is also continuous at x. Thus x a and suppose I is nite in
an interval containing a. Taking F to be the closed set [a, ) with a > , we
obtain from the
lim sup
n
1
log P(Sn [a, )) min I(x)
xa
n
= I(a).
Applying the second part of Cramers Theorem, we obtain

lim inf
n
1
1
log P(Sn [a, )) lim inf log P(Sn (a, ))
n
n
n
inf I(x)
x>a
= I(a).
2
Thus in this special case indeed the large deviations limit exists:
1
log P(Sn a) = I(a).
n n
lim
The limit is insensitive to whether the inequality is strict, in the sense that we
also have
1
log P(Sn > a) = I(a).
n n
lim
2 Properties of the rate function I

Before we prove this theorem, we will need to establish several properties of
I(x) and M ().
Proposition 1. The rate function I satises the following properties
(a) I is a convex non-negative function satisfying I() = 0. Furthermore, it is
an increasing function on [, ) and a decreasing function on (, ].
Finally I(x) = sup0 (x log M ()) for every x and I(x) =
sup0 (x log M ()) for every x .
(b) Suppose in addition that D(M ) = R and the support of X1 is R. Then,
I is a nite continuous function on R. Furthermore, for every x R we
have I(x) = 0 x log M (0 ), for some 0 = 0 (x) satisfying
x=
M (0 )
.
M (0 )
(2)
Proof of part (a). Convexity is due to the fact that I(x) is point-wise supremum.
Precisely, consider (0, 1)
I(x + (1 )y) = sup[(x + (1 x)y) log M ()]
= sup[(x log M ()) + (1 )(y log M ())]

sup(x log M ()) + (1 ) sup(y log M ())

=I(x) + (1 )I(y).
This establishes the convexity. Now since M (0) = 1 then I(x) 0 x
log M (0) = 0 and the non-negativity is established. By Jensens inequality, we
have that
M () = E[exp(X1 )] exp(E[X1 ]) = exp().
3
Therefore, log M () , namely, log M () 0, implying I() = 0 =

minxR I(x).
Furthermore, if x > , then for < 0 we have x log M () (x
) < 0. This means that sup (x log M ()) must be equal to sup0 (x
log M ()). Similarly we show that when x < , we have I(x) = sup0 (x
log M ()).
Next, the monotonicity follows from convexity. Specically, the existence
of real numbers x < y such that I(x) > I(y) I() = 0 violates
convexity (check). This completes the proof of part (a).
Proof of part (b). For any K > 0 we have
lim inf
log exp(x) dP (x)

log M ()
= lim inf
1
lim inf log
exp(x) dP (x)

K
1
lim inf log (exp(K)P([K, ]))

1
= K + lim inf log P([K, ])

= K (since supp(X1 ) = R, we have P([K, )) > 0.)
Since K is arbitrary,
lim inf
1
log M () =
Similarly,
1
lim inf log M () =
Therefore,
lim x log M () = lim (x
1
log M ())
Therefore, for each x as || , we have that

lim x log M () =
||
From the previous lecture we know that M () is differentiable (hence continu

ous). Therefore the supremum of x log M () is achieved at some nite value
0 = 0 (x), namely,
I(x) = 0 x log M (0 ) < ,
4
where 0 is found by setting the derivative of x log M () to zero. Namely,

0 must satisfy (2). Since I is a nite convex function on R it is also continuous
(verify this). This completes the proof of part (b).
3
Proof of Cramers Theorem
Now we are equipped to proving the Cramers Theorem.

Proof of Cramers
Theorem. Part (a). Fix a closed set F R. Let + =
min{x [, +) F } and = max{x (, ] F }. Note that + and
exist since F is closed. If + = then I() = 0 = minxR I(x). Note
that log P(Sn F ) 0, and the statement (a) follows trivially. Similarly, if
= , we also have statement (a). Thus, assume < < + . Then
P (Sn F ) P (Sn [+ , )) + P (Sn (, ])
Dene
xn P (Sn [+ , )) , yn P (Sn (, ]) .
We already showed that
P (Sn + ) exp(n(+ log M ())), 0.
from which we have
1
log P (Sn + ) (+ log M ()), 0.
n
1
log P (Sn + ) sup(+ log M ()) = I(+ )
n
0
The second equality in the last equation is due to the fact that the supremum
in I(x) is achieved at 0, which was established as a part of Proposition 1.
Thus, we have
lim sup
n
1
log P (Sn + ) I(+ )
n
(3)
1
log P (Sn ) I( )
n
(4)
Similarly, we have
lim sup
n
Applying Proposition 1 we have I(+ ) = minx+ I(x) and I( ) = minx I(x).

Thus
min{I(+ ), I( )} = inf I(x)
xF
(5)
From (3)-(5), we have that

1
1
lim sup log xn inf I(x), lim sup log yn inf I(x),
xF
xF
n
n
n
n
(6)
which implies that

lim sup
n
1
log(xn + yn ) inf I(x).
xF
n
(you are asked to establish the last implication as an exercise). We have estab
lished
1
lim sup log P (Sn F ) inf I(x)
(7)
xF
n
n
Proof of the upper bound in statement (a) is complete.
Proof of Cramers Theorem. Part (b). Fix an open set U R. Fix E > 0 and
nd y such that I(y) inf xU ((x). It is sufcient to show that
lim inf
n
1
P (Sn U ) I(y),
n
(8)
since it will imply

1
P (Sn U ) inf I(x) + E,
n n
xU
and since E > 0 was arbitrary, it will imply the result.
Thus we now establish (8). Assume y > . The case y < is treated
similarly. Find 0 = 0 (y) such that
lim inf
I(y) = 0 y log M (0 ).
Such 0 exists by Proposition 1. Since y > , then again by Proposition 1 we
may assume 0 0.
We will use the change-of-measure technique to obtain the cover bound. For
this, consider a new random variable let X0 be a random variable dened by
Z z
1
exp(0 x) dP (x)
P(X0 z) =
M (0 )
Now,
E[X0 ] =
=
1
M (0 )
M (0 )
x exp(0 x) dP (x)
M (0 )
= y,
6
where the second equality was established in the previous lecture, and the last
equality follows by the choice of 0 and Proposition 1. Since U is open we can
nd > 0 be small enough so that (y , y + ) U . Thus, we have
P(Sn U )
P(Sn (y , y + ))
Z
=
dP (x1 ) dP (xn )
1
|n
xi y|<
Z
=
1
|n
M 1 (0 ) exp(0 xi )dP (xi ).
xi )M n (0 )
exp(0
xi y|<
1in
(9)
Since 0 is non-negative, we obtain a bound
P(Sn (y , y + ))
n
exp(0 yn 0 n)M (0 )
Z
1
|n
M 1 (0 ) exp(0 xi )dP (xi )

xi y|< 1in
However, we recognize the integral

; on the right-hand side of the inequality
1
above as the that the average n
1in Yi of n i.i.d. random variables Yi , 1
i n distributed according to the distribution of X0 belongs to the interval
(y , y + ). Recall, however that E[Yi ] = E[X0 ] = y (this is how X0 was
designed). Thus by the Weak Law of Large Numbers, this probability converges
to unity. As a result
Z
1
M 1 (0 ) exp(0 xi )dP (xi ) = 0.
lim n log
n
1
|n
xi y|< 1in
We obtain
lim inf n1 log P(Sn U ) 0 y 0 + log M (0 )
n
= I(y) 0 .
Recalling that 0 depends on y only and sending to zero, we obtain (8). This
completes the proof of part (b).
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 4
Fall 2013
9/16/2013
Applications of the large deviation technique
Content.
1. Insurance problem
2. Queueing problem
3. Buffer overow probability
1 Safety capital for an insurance company

Consider some insurance company which needs to decide on the amount of cap
ital S0 it needs to hold to avoid the cash ow issue. Suppose the insurance pre
mium per month is a xed (non-random) quantity C > 0. Suppose the claims
are i.i.d. random variable A
LN 0 for the time N = 1, 2, . Then the capital
at time N is SN = S0 + N
n=1 (C An ). The company wishes to avoid the
situation where the cash ow SN is negative. Thus it needs to decide on the
capital S0 so that P(N, SN 0) is small. Obviously this involves a tradeoff
between the smallness and the amount S0 . Let us assume that upper bound
= 0.001, namely 0.1% is acceptable (in fact this is pretty close to the banking
regulation standards). We have
P(N, SN 0) = P(min S0 +
N
N
)
(C An ) 0)
n=1
= P(max
N
N
)
(An C) S0 )
n=1
L
If E[A1 ] C, we have P(maxN N
n=1 (An C) S0 ) = 1. Thus, the
interesting case is E[A1 ] < C (negative drift), and the goal is to determine the
starting capital S0 such that
P(max
N
N
)
(An C) S0 ) .
n=1
2 Buffer overow in a queueing system

The following model is a variant of a classical so called GI/GI/1 queueing sys
tem. In application to communication systems this queueing system consists of
a single server, which processes some C > 0 number of communication packets
per unit of time. Here C is a xed deterministic constant. Let An be the random
number packets arriving at time n, and Qn be the queue length at time n (asume
Q0 =0). By recursion, we have that
QN = max(QN 1 + AN C, 0)
= max(QN 2 + AN 1 + AN 2C, AN C, 0)
n
)
= max ( (AN k C), 0)
1nN 1
k=1
Notice, that in distributional sense we have

QN = max
n
)
max
1nN 1
(AN k C) , 0
k=1
In steady state, i.e. N = , we have

Q = max max
n1
n
)
(Ak C), 0
k=1
Our goal is to design the size of the queue length storage (buffer) B, so that
the likelihood that the number of packets in the queue exceeds B is small. In
communication application this is important since every packet not tting into
the buffer is dropped. Thus the goal is to nd buffer size B > 0 such that
P(Q B) P(max
n1
n
)
(Ak C) B)
k=1
If E[A1 ] C, we have P(Q B) = 1. So the interesting case is E[A1 ] < C

(negative drift).
3 Buffer overow probability
We see that in both situations we need to estimate
n
)
P(max
(Ak C) B).
n1
k=1
We will do this asymptoticlly as B .

2
Theorem 1. Given an i.i.d. sequence An 0 for n 1 and C > E[A1 ].

Suppose
M () = E[exp(A)] < , for some [0, 0 ).
Then
n
)
1
log P(max
(Ak C) B) = sup{ > 0 : M () < exp(C)}
n1
B B
lim
k=1
Observe that since An is non-negative, the MGF E[exp(An )] is nite for

< 0. Thus it is nite in an interval containing = 0, and applying the result
of Lecture 2 we can take the derivative of MGF. Then
d
d
M ()
= E[A],
exp(C)
=C
d
d
=0
=0
Since E[An ] < C, then there exists small enough so that M () < exp(C),
M ( )
exp( C )
Figure 1: Illustration for the existance of such that M () < exp(C)

.
and thus the set of > 0 for which this is the case is non-empty. (see Figure 1).
The theorem says that roughly speaking
P(max
n
n
)
(Ak C) B) exp( B),
k=1
when B is large. Thus given select B such that exp( B) , and we can
set B = 1 log 1 .
3
Example. Let A be a random variable uniformly distributed in [0, a] and C = 2.

Then, the moment generating function of A is
a
exp(a) 1
exp(t)a1 dt =
M () =
0
Then
exp(a) 1
exp(2)}
a
Case 1: a = 3, we have = sup{ > 0 : exp(3) 1 3 exp(2)}, i.e.
= 1.54078.
Case 2: a = 4, we have that { > 0 : exp(3) 1 3 exp(2)} = since
E[A] = 2 = C.
Case 3: a = 2, we have that { > 0 : exp(3)

L
n 1 3 exp(2)} = R+
and thus =
, which implies that P(maxn k=1 (Ak C) B) = 0 by
theorem 1.
sup{ > 0 : M () exp(C)} = sup{ > 0 :
Proof of Theorem 1. We will rst prove an upper bound and then a lower bound.
Combining them yields the result. For the upper bound, we have that
n
n
)
)
)
P(max
P(
(Ak C) B)
(Ak C) B)
n
k=1
n=1
)
n=1
k=1
1
)
B
P(
Ak C + )
n
k=1
exp(n((C +
n=1
= exp(B)
) log M ())) ( > 0)

n
exp(n(C log M ()))
n1
Fix any such that C log M (), the inequality above gives
)
exp(B)
exp(n(C log M ()))
n0
= exp(B)[1 exp((C log M ()))]1
Simplication of the inequality above gives

n
)
1
1
log P(max
(Ak C) B) + log([1 exp((C log M ()))]1 )
n
B
k=1
)
1
(Ak C) B) for : M () < exp(C)

lim sup log P(max
n
B B
k=1
Next, we will derive the lower bound.

P(max
n
n
)
k=1
n
)
(Ak C) B) P( (Ak C) B), n
k=1
Fix a t > 0, then

Bt
n
)
)
P(max
(Ak C) B) P(
(Ak C) B)
n
k=1
k=1
Bt
Then, we have
)
k=1
Bt
(Ak C)
t
)
1
lim inf log P(max
(Ak C) B)
n
B
B
k=1
Bt
)
1
Bt
lim inf log P
(Ak C)
B
B
t
k=1
Bt
)
t
Bt
= lim inf
log P
(Ak C)
B
Bt
t
k=1
= t lim inf
n
= t lim inf
n
t inf
n
)
1
log P(
n
k=1
1
1
log P(
n
n
x>C+ 1t
(Ak C)
k=1
n
)
t
1
Ak C + )
t
I(x) (by Cramers theorem.)
inf t inf
t>0 x>C+ 1
t
I(x) (since we can choose an arbitrary positive t. )
We claim that
(1)
1
I(x) = inf tI(C + )
t>0
t
x>C+ 1t
inf t inf
t>0
Indeed, let x = inf x : I(x) = (possibly x = ). If x C, then I(C +

1
1
t ) = . Suppose C < x . If t is such that C + t x , then inf x>C+ 1 I(x) =

t
. Therefore it does not make sense to consider such t. Now for c +

5
1
t
< x ,
we have I is convex non-decreasing and nite on [E[A1 ], x ). Therefore it is

continuous on [E[A1 ], x ), which gives that
inf
x>C+ 1t
1
I(x) = I(C + )
t
and the claim follows. Thus, we obtain

n
)
1
1
lim inf log P(max
(Ak C) B) inf tI(C + )
n
t>0
B B
t
k=1
Exercise in HW 2 shows that sup{ > 0 : M () < exp(C)} = inf t>0 tI(C +
1
t ).
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 5
Fall 2013
9/16/2013
Extension of LD to Rd and dependent process. Gartner-Ellis Theorem
Content.
1. Large Deviations in may dimensions
2. Gartner-Ellis Theorem
3. Large Deviations for Markov chains
1 Large Deviations in Rd
Most of the developments in this lecture follows Dembo and Zeitouni
p book [1].
d
d
Let Xn R be i.i.d. random variables and A R . Let Sn = 1in Xn
The large deviations question is now regarding the existence of the limit
1
Sn
log P(
A).
n n
n
lim
X1 ))] where (, ) represents the inner

Given Rd , dene M () = E[exp((,
p
product of two vectors: (a, b) =
i ai bi . Dene I(x) = supRd ((, x)
log M ()), where again I(x) = is a possibility.
Theorem 1 (Cramers theorem in multiple dimensions). Suppose M () <
for all Rd . Then
(a) for all closed set F Rd ,
1
Sn
log P(
F ) inf I(x)
xF
n
n
lim sup
n
(b) for all open set U Rd
lim inf
n
1
Sn
log P(
U ) inf I(x)
xU
n
n
1
Unfortunately, the theorem does not hold in full generality, and the addi
tional condition such as M () < for all is needed. Known counterex
amples are somewhat involved and can be found in a paper by Dinwoodie [2]
which builds on an earlier work of Slaby [5]. The difculty arises that there is
no longer the notion of monotonicity of I(x) as a function of the vector x. This
is not the tightest condition and more general conditions are possible, see [1].
The proof of the theorem is skipped and can be found in [1].
d
Let us consider an example of application of Theorem 2. Let Xn = N (0, )
where d = 2 and
1 12
=
, F = {(x1 , x2 ) : 2x1 + x2 5}.
1
2 1
Goal: prove that the limit limn
By the upper bound part,
lim sup
n
1
n
log P( Snn F ) exists, and compute it.
1
Sn
log P(
F ) inf I(x)
xF
n
n
We have
M () = E[exp((, X))]
d
Letting = denote equality in distribution, we have

d
(, X) = N (0, T )
= N (0, 12 + 1 2 + 22 ),
where = (1 , 2 ). Thus
1
M () = exp( (12 + 1 2 + 22 ))
2
1
I(x) = sup(1 x1 + 2 x2 (12 + 1 2 + 22 ))
2
1 ,2
Let
1
g(1 , 2 ) = 1 x1 + 2 x2 (12 + 1 2 + 22 ).
2
From
d
dj g(1 , 2 )
= 0, we have that
1
x1 1 2 = 0,
2
1
x2 2 1 = 0,
2
2
from which we have
4
2
1 = x1 x2 ,
3
3
Then
4
2
2 = x2 x1
3
3
2
I(x1 , x2 ) = (x21 + x22 x1 x2 ).
3
So we need to nd
2 2
(x + x2 x1 x2 )
3 1
s.t. 2x1 + x2 5 (x F )
inf
x1 ,x2
This becomes a non-linear optimization problem. Applying the Karush-KuhnTucker condition, we obtain
min f
s.t. g 0
v f + v g = 0,
(1)
g = 0
< 0.
(2)
which gives
4
2
4
2
( x1 x2 , x2 x1 ) + (2, 1) = 0, (2x1 + x2 5) = 0.
3
3
3
3
If 2x1 + x2 5 = 0, then = 0 and further x1 = x2 = 0. But this violates
2x1 + x2 5. So we have 2x1 + x2 5 = 0 which implies x2 = 5 2x1 .
Thus, we have a one dimensional unconstrained minimization problem:
min
which gives x1 =
10
11 ,
2 2 2
x + (5 2x1 )2 x1 (5 2x1 )
3 1 3
x2 =
35
11
lim sup
n
and I(x1 , x2 ) = 5.37. Thus

1
Sn
log P(
F ) 5.37
n
n
Applying the lower bound part of the Cramers Theorem we obtain

lim inf
n
1
Sn
log P(
F)
n
n
1
Sn
lim inf log P(

F o)
n
n
n
=
= 5.37 (by continuity of I).
3
(3)
(4)
inf
2x1 +x2 >5
I(x1 , x2 )
Combining, we obtain
lim
n
1
Sn
1
Sn
log P(
F ) = lim inf log P(
F)
n
n
n
n
n
1
Sn
= lim sup log P(
F)
n
n
n
= 5.37.
Gartner-Ellis Theorem
The Gartner-Ellis Theorem deals with large deviations event when the sequence
Xn is not necessarily independent. One immediate application of this theorem
is large deviations for Markov chains, which we will discuss in the following
section.
Let Xn be a sequence of not p
necessarily independent random variables in
Rd . Then in general for Sn =
1kn Xk the identity E[exp((, Sn ))] =
n
(E[exp((, X1 ))]) does not hold. Nevertheless there exists a broad set of con
ditions under which the large deviations bounds hold. Thus consider a general
sequence of random variable Yn Rd which stands for (1/n)Sn in the i.i.d.
case. Let n () = n1 log E[exp(n(, Yn ))]. Note that for the i.i.d. case
1
log E[exp(n(, n1 Sn ))]
n
1
= log M n ()
n
= log M ()
n () =
= log E[exp((, X1 ))].

Loosely speaking Gartner-Ellis Theorem says that when convergence
n () ()
(5)
takes place for some limiting function , then under certain additional technical
assumptions, the large deviations principle holds for rate function
I(x) sup((, x) (x)).
(6)
Formally,
Theorem 2. Given a sequence of random variables Yn , suppose the limit ()
(5) exists for all Rd . Furthermore, suppose () is nite and differentiable
4
everywhere on Rd . Then the following large deviations bounds hold for I dened
by (6)
lim sup
n
lim inf
n
1
log P(Yn F ) inf I(x),
xF
n
for any closed set F Rd .
1
log P(Yn U ) inf I(x),
xU
n
for any open set U Rd .
As for Theorem 1, this is not the most general version of the theorem. The
version above is established as exercise 2.3.20 in [1]. More general versions can
be found there as well.
Can we use Chernoff type argument to get an upper bound? For > 0, we
have
1
P( Yn a) = P(exp(Yn ) exp(na))
n
exp(n(a n ()))
So we can get an upper bound
sup(a n (a))
0
In the i.i.d. case we used the fact that sup0 (a M ()) = sup (a M ())
when a > = E[X]. But now we are dealing with the multidimensional case
where such an identity does not make sense.
3
Large Deviations for nite state Markov chains
Let Xn be a nite state Markov chain with states = {1, 2, . . . , N }. The

transition matrix of this Markov chain is P = (Pi,j , 1 i, j N ). We assume
(m)
that the chain is irreducible. Namely there exists m > 0 such that Pi,j > 0 for
all pairs of states i, j, where P (m) denotes the m-the power of P representing the
transition probabilities after m steps. Our goal is to derive the large deviations
bounds for the empirical means of the Markov chain. Namely, let f : Rd
be any function and let Yn = f (X
p n ). Our goal is to derive the large deviations
1
bound for n Sn where Sn = 1in Yk . For this purpose we need to recall
the Perron-Frobenious Theorem.
Theorem 3. Let B = (Bi,j , 1 i, j N ) denote a non-negative irreducible
matrix. Namely, Bi,j 0 for all i, j and there exists m such that all the elements
of B m are strictly positive. Then B possesses an eigenvalue called PerronFrobenious eigenvalue, which satises the following properties.
5
1. > 0 is real.
2. For every e-value of B, || , where || is the norm of (possibly
complex) .
3. The left and right e-vectors of B denoted by and corresponding to
, are unique up to a constant multiple and have strictly positive compo
nents.
This theorem can be found in many books on linear algebra, for example [4].
The following corollary for the Perron-Frobenious Theorem shows that the
essentially the rate of growth of the sequence of matrices B n is n . Specically,
Corollary 1. For every vector = (j , 1 j N ) with strictly positive
elements, the following holds
1
1
n
n
Bi,j
Bj,i
lim log
j = lim log
j = log .
n n
n n
1jN
1jN
Proof. Let = maxj j , = minj j , = maxj j , = minj j . We have

n
n
n
Bi,j j Bi,j
j Bi,j
j .
Therefore,
lim
n
1
log
n
n
Bi,j
j = lim
n
1jN
1
log
n
n
Bi,j
j
1jN
1
= lim log(n i )
n n
= log .
The second identity is established similarly.
Now, given a Markov chain Xn , a function f : Rd and vector Rd ,
consider a modied matrix P = (e(,f (j)) Pi,j , 1 i, j N ). Then P is an
irreducible non-negative matrix, since P is such a matrix. Let (P ) denote its
Perron-Frobenious eigenvalue.
p
Theorem 4. The sequence n1 Sn = n1 p1ik f (Xn ) satises the large devia
tions bounds with rate function I(x) = Rd ((, x)log (P )). Specically,
6
for every state i0 , closed set F Rd and every open set U Rd , the fol
lowing holds:
1
log P(n1 Sn F |X0 = i0 ) inf I(x),
xF
n
n
1
lim inf log P(n1 Sn U |X0 = i0 ) inf I(x).
n
xU
n
lim sup
Proof. We will show that the sequence of functions n () = n1 log E[e(,Sn ) ]

has a limit which is nite and differentiable everywhere. Given the starting
state i0 we have
X
log E[e(,Sn ) ] = log
Pi,i1 Pi1 ,i2 Pin1 ,in
e
(,f (ik ))
i1 ,...,in
1kn
= log
Pn (i0 , j) ,
1jN
where Pn (i, j) denotes the i, j-th entry of the matrix Pn . Letting j = 1 and
applying Corollary 1, we obtain
lim n () = log (P ).
n
Thus the Gartner-Ellis can be applied provided the differentiability of log (P )

with respect to can be established. The Perron-Frobenious theory in fact can
be used to show that such a differentiability indeed takes place. Details can be
found in the book by Lancaster [3], Theorem 7.7.1.
References
[1] A. Dembo and O. Zeitouni, Large deviations techniques and applications,
Springer, 1998.
[2] IH Dinwoodie, A note on the upper bound for iid large deviations, The
Annals of Probability 19 (1991), no. 4, 17321736.
[3] Peter Lancaster and Miron Tismenetsky, Theory of matrices, vol. 2, Aca
demic press New York, 1969.
[4] E Seneta, Non-negative matrices andmarkov chains, Springer2Verlag, New
York (1981).
[5] M Slaby, On the upper bound for large deviations of sums of iid random
vectors, The Annals of Probability (1988), 978990.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 6
Fall 2013
9/23/2013
Brownian motion. Introduction
Content.
1. A heuristic construction of a Brownian motion from a random walk.
2. Denition and basic properties of a Brownian motion.
Historical notes
1765 Jan Ingenhousz observations of carbon dust in alcohol.
1828 Robert Brown observed that pollen grains suspended in water per
form a continual swarming motion.
1900 Bacheliers work The theory of speculation on using Brownian
motion to model stock prices.
1905 Einstein and Smoluchovski. Physical interpretation of Brownian
motion.
1920s Wiener concise mathematical description.
Construction of a Brownian motion from a random walk
The developments in this lecture follow closely the book by Resnick [3].
In this section we provide a heuristic construction of a Brownian motion
from a random walk. The derivation below is not a proof. We will provide a rig
orous construction of a Brownian motion when we study the weak convergence
theory.
The large deviations theory predicts exponential decay of probabilities P( 1in Xn >
an) when a > = E[X1 ] and E[eX1 ] are nite. Naturally, the decay will be
1
slower the closer a is to . We considered only the case when a was a constant.
But what if a is a function of n: a = an ? The Central Limit Theorem tells us
that the decay disappears when an 1n . Recall
Theorem 1 (CLT). Given an i.i.d. sequence (Xn , n 1) with E[X1 ] =
, var[X1 ] = 2 . For every constant a
P
a
t2
1
1in Xi n
e 2 dt.
lim P(
a) =
n
n
2
P
= 1in (Xi ). For
Now let us look at a sequence of partial sums Sn P
simplicity assume = 0 so that we look at Sn =
1in Xi . Can we say
anything about Sn as a function of n? In fact, let us make it a function
o of a real
Xi
J
variable t R+ and rescale it by n as follows. Dene Bn (t) = 1int
n
for every t 0.
Denote by N (, 2 ) the distribution function of a normal r.v. with mean
and variance 2 .
1. For every xed 0 s < t, by CLT we have the distribution of
P
lnsJ<ilntJ Xi
lntJ lnsJ
converging to the standard normal distribution as n . Ignoring the
difference between lntJ lnsJ and n(t s) which is at most 1, and
writing
P
Bn (t) Bn (s)
st<int Xi
=
nt ns
ts
we obtain that Bn (t)Bn (s) converges in distribution to N (0, 2 (ts)).
o
2. Fix t1 < t2 and consider Bn (t1 ) =

o
nt1 <int2
1int1
Xi
and Bn (t2 ) Bn (t1 ) =
Xi
. The two sums contain different elements of the sequence

X1 , X2 , . . .. Since the sequence is i.i.d. Bn (t1 ) and Bn (t2 ) Bn (t1 ) are
independent. Namely for every x1 , x2

n
P(Bn (t1 ) x1 , Bn (t2 ) Bn (t1 ) x2 ) = P(Bn (t1 ) x1 )P(Bn (t2 ) Bn (t1 ) x2 ).

This generalizes to any nite collection of increments Bn (t1 ), Bn (t2 )
Bn (t1 ), . . . , Bn (tk )Bn (tk1 ). Thus Bn (t) has independent increments.
3. Given a small E, for every t the difference Bn (t + E) Bn (t) converges
in distribution to N (0, 2 E). When E is very small this difference is very
close to zero with probability approaching 1 as n . Namely, Bn (t)

is increasingly close to a continuous function as n .
4. Bn (0) = 0 by denition.
Denition
The Brownian motion is the limit B(t) of Bn (t) as n . We will delay an

swering questions about the existence of this limit as well as the sense in which
we take this limit (remember we are dealing with processes and not values here)
till future lectures and now simply postulate the existence of a process satisfy
ing the properties above. In the theorem below, for every continuous function
C[0, ) we let B(t, ) denote (t). This notation is more consistent with
a standard convention of denoting Brownian motion by B and its value at time t
by B(t).
Denition 1 (Wiener measure). Given = C[0, ), Borel -eld B dened
on C[0, ) and any value > 0, a probability measure P satisfying the follow
ing properties is called the Wiener measure:
1. P(B(0) = 0) = 1.
2. P has the independent increments property. Namely for every 0 t1 <
< tk < and x1 , . . . , xk1 R,
P( C[0, ) : B(t2 , ) B(t1 , ) x1 , . . . , B(tk , ) B(tk1 , ) xk1 )

P( C[0, ) : B(ti , ) B(ti1 ), ) xi1 )
=
2ik
3. For every 0 s < t the distribution of B(t)B(s) is normal N (0, 2 (t

s)). In particular, the variance is a linear function of the length of the time
increment t s and the increments are stationary.
The stochastic process B described by this probability space (C[0, ), B, P)

is called Brownian motion. When = 1, it is called the standard Brownian
motion.
Theorem 2 (Existence of Wiener measure). For every 0 there exists a

unique Wiener measure.
(what is the Wiener measure when = 0?). As we mentioned before, we delay
the proof of this fundamental result. For now just assume that the theorem holds
and study the properties.
Remarks :
In future we will not be explicitly writing samples when discussing
Brownian motion. Also when we say B(t) is a Brownian motion, we un
derstand it both as a Wiener measure or simply a sample of it, depending
on the context. There should be no confusion.
It turns out that for any given such a probability measure is unique. On
the other hand note that if B(t) is a Brownian motion, then B(t) is also
a Brownian motion. Simply check that all of the conditions of the Wiener
measure hold. Why is there no contradiction?
Sometimes we will consider a Brownian motion which does not start at
zero: B(0)
= x for some value x = 0. We may dene this process as
x + B(t), where B is Brownian motion.
Problem 1.
1. Let be the space of all (not necessarily continuous) functions : R+
R.
i Construct an example of a stochastic process in which satises
conditions (a)-(c) of the Brownian motion, but such that every path
is almost surely discontinuous.
i Construct an example of a stochastic process in which satises
conditions (a)-(c) of the Brownian motion, but such that every path
is almost surely discontinuous in every point t [0, 1].
HINT: work with the Brownian motion.
2. Suppose B(t) is a stochastic process dened on the set of all (not neces
sarily continuous) functions x : R+ R satisfying properties (a)-(c) of
Denition 1. Prove that for every t 0, limn B(t+ n1 ) = B(t) almost
surely.
Problem 2. Let AR be the set of all homogeneous linear functions x(t) = at

where a varies over all values a R. B(t) denotes the standard Brownian
motion. Prove that P(B AR ) = 0.
Properties
We now derive several properties of a Brownian motion. We assume that B(t)

is a standard Brownian motion.
Joint distribution. Fix 0 < t1 < t2 < < tk . Let us nd the joint distribution
of the random vector (B(t1 ), . . . , B(tk )). Given x1 , . . . , xk R let us nd
the joint density of (B(t1 ), B(t2 ), . . . , B(tk )) in (x1 , . . . , xk ). It is equal to
the joint density of (B(t1 ), B(t2 ) B(t1 ), . . . , B(tk ) B(tk1 )) in (x1 , x2
x1 , . . . , xk xk1 ), which by independent Gaussian increments property of the
Brownian motion is equal to
kY
1
i=1
2(ti+1 ti )
(xi+1 xi )2
2(ti+1 ti )
Differential Property. For any s > 0, Bs (t) = B(t + s) B(s), t 0

is a Brownian motion. Indeed Bs (0) = 0 and the process has independent
increments Bs (t2 ) Bs (t1 ) = B(t2 + s) B(t1 + s) which have a Guassian
distribution with variance t2 t1 .
Scaling. For every c, cB(t) is a Brownian motion with variance 2 = c2 .
Indeed, continuity and the stationary independent increment properties as well
as the Gaussian distribution of the increments, follow immediately. The variance
of the increments cB(t2 ) cB(t1 ) is c2 (t2 t1 ).
For every positive c > 0, B( ct ) is a Brownian motion with variance 1c . Indeed, the process is continuous. The increments are stationary, independent with
Gaussian distribution. For every t1 < t2 , by denition of the standard Brownian
motion, the variance of B( tc2 ) B( tc1 ) is (t2 t1 )/c = 1c (t2 t1 ).
Combining these two properties we obtain that
nian motion.
cB( ct ) is also a standard Brow-
Covariance. Fix 0 s t. Let us compute the covariance

Cov(B(t), B(s)) = E[B(t)B(s)] E[B(t)]E[B(s)]
a
= E[B(t)B(s)]
= E[(B(s) + B(t) B(s))B(s)]
b
= E[B 2 (s)] + E[B(t) B(s)]E[B(s)]

= s + 0 = s.
Here (a) follows since E[B(t)] = 0 for all t and (b) follows since by denition
of the standard Brownian motion we have E[B 2 (s)] = s and by independent
increments property we have E[(B(t)B(s))B(s)] = E[(B(t)B(s))(B(s)
B(0))] = E[(B(t) B(s)]E[(B(s) B(0)] = 0, since increments have a zero
mean Gaussian distribution.
Time reversal. Given a standard Brownian motion B(t) consider the process
B (1) (t) dened by B (1) (t) = tB( 1t ) for all t > 0 and B (1) (0) = 0. In other
words we reverse time by the transformation t 1t . We claim that B (1) (t) is
also a standard Brownian motion.
Proof. We need to verify properties (a)-(c) of Denition 1 plus continuity. The
continuity at any point t > 0 follows immediately since 1/t is continuous func
tion. B is continuous by assumption, therefore tB( 1t ) is continuous for all t > 0.
The continuity at t = 0 is the most difcult part of the proof and we delay it till
the end. For now let us check (a)-(c).
(a) follows since we dened B (1) (0) to be zero.
We delay (b) till we establish normality in (c)
(c) Take any s < t. Write tB( 1t ) sB( 1s ) as
1
1
1
1
1
tB( ) sB( ) = (t s)B( ) + sB( ) sB( )
t
s
t
t
s
The distribution of B( 1t )B( 1s ) is Gaussian with zero mean and variance 1s 1t ,
since B is standard Brownian motion. By scaling property, the distribution of
sB( 1t )sB( 1s ) is zero mean Gaussian with variance s2 ( 1s 1t ). The distribution
of (t s)B( 1t ) is zero mean Gaussian with variance (t s)2 ( 1t ) and also it
is independent from sB( 1t ) sB( 1s ) by independent increments properties of
6
the Brownian motion. Therefore tB( 1t ) sB( 1s ) is zero mean Gaussian with
variance
1 1
1
s2 ( ) + (t s)2 ( ) = t s.
s
t
t
This proves (c).
We now return to (b). Take any t1 < t2 < t3 . We established in (c) that all
the differences B (1) (t2 ) B (1) (t1 ), B (1) (t3 ) B (1) (t2 ), B (1) (t3 ) B (1) (t1 ) =
B (1) (t3 ) B (1) (t2 ) + B (1) (t2 ) B (1) (t1 ) are zero mean Gaussian with vari
ances t2 t1 , t3 t2 and t3 t1 respectively. In particular the variance of
B (1) (t3 ) B (1) (t1 ) is the sum of the variances of B (1) (t3 ) B (1) (t2 ) and
B (1) (t2 ) B (1) (t1 ). This implies that the covariance of the summands is zero.
Moreover, from part (b) it is not difcult to establish that B (1) (t3 ) B (1) (t2 )
and B (1) (t2 ) B (1) (t1 ) are jointly Gaussian. Recall, that two jointly Gaussian
random variables are independent if and only if their covariance is zero.
It remains to prove the continuity at zero of B (1) (t). We need to show the
continuity almost surely, so that the zero measure set corresponding to the sam
ples C[0, ) where the continuity does not hold, can be thrown away.
Thus, we need to show that the probability measure of the set
1
A = { C[0, ) : lim tB( , ) = 0}
t0
t
is equal to unity.
We will use Strong Law of Large Numbers (SLLN). First set t = 1/n
and consider tB( 1t ) = B (n)/n. Because of the independent Gaussian incre
ments property B(n) = 1in (B(i) B(i 1)) is the sum of independent
i.i.d. standard normal random variables. By SLLN we have then B(n)/n
E[B(1) B(0)] = 0 a.s. We showed convergence to zero along the sequence
t = 1/n almost surely. Now we need to take care of the other values of t, or
equivalently, values s [n, n + 1). For any such s we have
|
B(s) B(n)
B(s) B(n)
B(n) B(n)
||
|+|
|
s
n
s
s
s
n
1 1
1
|B(n)|| | +
sup |B(s) B(n)|
s n
n nsn+1
|B(n)| 1
+
sup |B(s) B(n)|.
n2
n nsn+1
We know from SLLN that B(n)/n 0 a.s. Moreover then

B(n)/n2 0.
7
(1)
a.s. Now consider the second term and set Zn = supnsn+1 |B(s) B(n)|.
We claim that for every E > 0,
P(Zn /n > E i.o.) = P( C[0, ) : Zn ()/n > E i.o.) = 0
(2)
where i.o. stands for innitely often. Suppose (2) was indeed the case. The
equality means that for almost all samples the inequality Zn ()/n > E hap
pens for at most nitely many n. This means exactly that for almost all (that
is a.s.) Zn ()/n 0 as n . Combining with (1) we would conclude that
a.s.
B(s) B(n)
| 0,
sup |
s
n
nsn+1
as n . Since we already know that B(n)/n 0 we would conclude that
a.s. lims B(s)/s = 0 and this means almost sure continuity of B (1) (t) at
zero.
It remains to show (2). We observe that due to the independent stationary
increments property, the distribution of Zn is the same as that of Z1 . This is
the distribution of the maximum of the absolute value of a standard Brownian
motion during the interval [0, 1]. In the following lecture we will show that this
maximum has nite expectation: E[|Z1 |] < . On the other hand
Z
Z (n+1)
0
E[|Z1 |] =
P(|Z1 | > x)dx =
P(|Z1 | > x)dx
0
n=0 n
P(|Z1 | > (n + 1)E).
n=0
We conclude that the sum in the right-hand side is nite. By i.i.d. of Zn

0
n=1
0 |Z1 |
|Zn |
P(
> E) =
P(
> E)
n
n
n=1
Thus the sum on the left-hand side is nite. Now we use the Borel-Cantelli
Lemma to conclude that (2) indeed holds.
5

Sections 6.1 and 6.4 from Chapter 6 of Resnicks book Adventures in
Stochastic Processes in the course packet.
Durrett [2], Section 7.1
Billingsley [1], Chapter 8.
8
References
publication, 1999.
[2] R. Durrett, Probability: theory and examples, Duxbury Press, second edi
tion, 1996.
[3] S. Resnick, Adventures in stochastic processes, Birkhuser Boston, Inc.,
1992.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 7
Fall 2013
9/25/2013
The Reection Principle. The Distribution of the Maximum. Brownian
motion with drift
Content.
1. Quick intro to stopping times
2. Reection principle
3. Brownian motion with drift
Technical preliminary: stopping times
Stopping times are loosely speaking rules by which we interrupt the process
without looking at the process after it was interrupted. For example sell your
stock the rst time it hits $20 per share is a stopping rule. Whereas, sell your
stock one day before it hits $20 per share is not a stopping rule, since we do
not know the day (if any) when it hits this price.
Given a stochastic process {Xt }t0 with t Z+ or t R+ , a random variable
T is called a stopping time if for every time the event T t is completely
determined by the history {Xs }0st .
This is not a formal denition. The formal denition will be given later when
we study ltration. Then we will give the denition in terms of the underlying
(, F, P). For now, though, let us just adopt this loose denition.
2
The Reection principle. The distribution of the maximum
The goal of this section is to obtain the distribution of

M (t) sup B(s)
0st
for any given t. Surprisingly the resulting expression is very simple and follows
from one of the key properties of the Brownian motion the reection principle.
Given a > 0, dene
Ta = inf{t : B(t) = a}
the rst time when Brownian motion hits level a. When no such time exists
we dene Ta = , although we now show that it is nite almost surely.
Proposition 1. Ta < almost surely.
Proof. Note that if B hits some level b a almost surely, then by continuity
and since B(0) = 0, it hits level a almost surely. Therefore, it sufces to prove
that lim supt B(t) = almost surely. This in its own order will follow from
lim supn B(n) = almost surely.
Problem 1. Prove that lim supn |B(n)| = almost surely.
The differential property of the Brownian motion suggests that

B(Ta + s) B(Ta ) = B(Ta + s) a
(1)
is also a Brownian motion, independent from B(t), t Ta . The only issue here
is that Ta is a random instance and the differential property was established for
xed times t. Turns out (we do not prove this) the differential property also holds
for a random time Ta , since it is a stopping time and is nite almost surely. The
rst is an immediate consequence of its denition: we can determine whether
Ta t by checking looking at the path B(u), 0 u t. The almost sure
niteness was established in Proposition 1. The property (1) is called the strong
independent increments property of the Brownian motion.
Theorem 1 (The reection principle). Given a standard Brownian motion
B(t), for every a 0
1
P(M (t) a) = 2P(B(t) a) = 2
2t
x2
e 2t dx.
Proof. We have
P(B(t) a) = P(B(t) a, M (t) a) + P(B(t) a, M (t) < a).
2
(2)
Note, however, that P(B(t) a, M (t) < a) = 0 since M (t) B(t). Now
P(B(t) a, M (t) a) = P(B(t) a|M (t) a)P(M (t) a)

= P(B(t) a|Ta t)P(M (t) a).
We have that B(Ta + s) a is a Brownian motion. Conditioned on Ta t,
P(B(t) a|Ta t) = P(B(Ta + (t Ta )) a 0|Ta t) =
1
2
since the Brownian motion satises P(B(t) 0) = 1/2 for every t. Applying
this identity, we obtain
1
P(B(t) a) = P(M (t) a).
2
This establishes the required identity (2).
We now establish the joint probability distribution of M (t) and B(t).
Proposition 2. For every a > 0, y 0
P(M (t) a, B(t) a y) = P(B(t) > a + y).
(3)
Proof. We have
P(B(t) > a + y) = P(B(t) > a + y, M (t) a) + P(B(t) > a + y, M (t) < a)
= P(B(t) > a + y, M (t) a)
= P(B(Ta + (t Ta )) a > y|M (t) a)P(M (t) a).
But since B(Ta + (t Ta )) a, by differential property is also a Brownian
motion, then, by symmetry
P(B(Ta + (t Ta )) a > y|M (t) a)
= P(B(Ta + (t Ta )) a < y|M (t) a)
= P(B(t) < a y|M (t) a).
We conclude
P(B(t) > a + y) = P(B(t) < a y|M (t) a)P(M (t) a)
= P(B(t) < a y, M (t) a).
We now compute the Laplace transform of the hitting time Ta .

Proposition 3. For every > 0
E[eTa ] = e
2a
Proof. We rst compute the density of Ta . We have

2
P(Ta t) = P(M (t) a) = 2P(B(t) a) =
2t
x2
a
e 2t dx = 2(1 N ( )).
t
By differentiating with respect to t we obtain that the density of Ta is given as

a
a2
1
e 2t .
2
t
3
2
Therefore
E[e
Ta
]=
et
t
3
2
a2
e 2t dt.
2
Computing
this integral is a boring exercise in calculus. We just state the result
which is e 2a .
3
Brownian motion with drift
So far we considered a Brownian motion which is characterized by zero mean

and some variance parameter 2 . The standard Brownian motion is the special
case = 1.
There is a natural way to extend this process to a non-zero mean process
by considering B (t) = t + B(t), given a Brownian motion B(t). Some
properties of B (t) follow immediately. For example given s < t, the in
crements B (t) B (s) have mean (t s) and variance 2 (t s). Also,
by the Time Reversal property of B (see the previous lecture) we know that
limt (1/t)B(t) = 0 almost surely. Therefore, almost surely
B (t)
= .
t
t
lim
When < 0 this means that M () supt0 B (t) < almost surely. On
the other hand M () 0 (why?).
Our goal now is to compute the probability distribution of M ().
4
Theorem 2. For every < 0, the distribution of M () is exponential with

parameter 2||/. Namely, for every x 0
P(M () > x) = e
2||x
The direct proof of this result can be found in Section 6.8 of Resnicks
book [3]. The proof consists of two parts. We rst show that the distribution
of M () is exponential. Then we compute its parameter.
Later on we will study an alternative proof based on the optional stopping
theory for martingale processes.
Sections 6.5 and 6.8 from Chapter 6 of Resnicks book Adventures in
Stochastic Processes.
Sections 7.3 and 7.4 in Durrett [2].
Billingsley [1], Section 9.
References
publication, 1999.
tion, 1996.
1992.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 8
Fall 2013
9/30/2013
Quadratic variation property of Brownian motion
Content.
1. Unbounded variation of a Brownian motion.
2. Bounded quadratic variation of a Brownian motion.
Unbounded variation of a Brownian motion
Any sequence of values 0 < t0 < t1 < < tn < T is called a partition =
(t0 , . . . , tn ) of an interval [0, T ]. Given a continuous function f : [0, T ] R
its total variation is dened to be
|f (tk ) f (tk1 )|,
LV (f ) sup
1kn
where the supremum is taken over all possible partitions of the interval [0, T ]
for all n. A function f is dened to have bounded variation if its total variation
is nite.
Theorem 1. Almost surely no path of a Brownian motion has bounded variation
for every T 0. Namely, for every T
P( : LV (B()) < ) = 0.
The main tool is to use the following result from real analysis, which we do
not prove: if a function f has bounded variation on [0, T ] then it is differentiable
almost everywhere on [0, T ]. We will now show that quite the opposite is true.
Proposition 1. Brownian motion is almost surely nowhere differentiable. Specif
ically,
P( t 0 : lim sup |
h0
B(t + h) B(t)
| = ) = 1.
h
1
Proof. Fix T > 0, M > 0 and consider A(M, T ) C[0, ) the set of all
paths C[0, ) such that there exists at least one point t [0, T ] such that
lim sup |
h0
B(t + h) B(t)
| M.
h
We claim that P(A(M, T )) = 0. This implies P(M 1 A(M, T )) = 0 which

is what we need. Then we take a union of the sets A(M, T ) with increasing T
and conclude that B is almost surely nowhere differentiable on [0, ). If
A(M, T ), then there exists t [0, T ] and n such that |B(s)B(t)| 2M |st|
for all s (t n2 , t + n2 ). Now dene An C[0, ) to be the set of all paths
such that for some t [0, T ]
|B(s) B(t)| 2M |s t|
for all s (t n2 , t + n2 ). Then
An An+1
(1)
A(M, T ) n An .
(2)
and
Find k = max{j :
Yk = max{|B(
j
n
t}. Dene
k+2
k+1
k+1
k
k
k1
) B(
)|, |B(
) B( )|, |B( ) B(
)|}.
n
n
n
n
n
n
In other words, consider the maximum increment of the Brownian motion over
these three short intervals. We claim that Yk 6M/n for every path An .
To prove the bound required bound on Yk we rst consider
|B(
k+2
k+1
k+2
k+1
) B(
)| |B(
) B(t)| + |B(t) B(
)|
n
n
n
n
2
1
2M + 2M
n
n
6M
.
n
The other two differences are analyzed similarly.

Now consider event Bn which is the set of all paths such that Yk () 6M/n
for some 0 k T n. We showed that An Bn . We claim that limn P(Bn ) =
2
0. Combining this with (1), we conclude P(An ) = 0. Combining with (2), this
will imply that P(A(M, T )) = 0 and we will be done.
Now to obtain the required bound on P(Bn ) we note that, since the incre
ments of a Brownian motion are independent and identically distributed, then
X
P(Bn )
P(Yk 6M/n)
0kT n
3
2
2
1
1
T nP(max{|B( ) B( )|, |B( ) B( )|, |B( ) B(0)|} 6M/n)
n
n
n
n
n
1
(3)
= T n[P(|B( )| 6M/n)]3 .
n
Finally, we just analyze this probability. We have
1
P(|B( )| 6M/n) = P(|B(1)| 6M/ n).
n
Since B(1)which has the standard normal distribution, its density at any
point is
at most 1/ 2, then we have that this probability is at a most (2(6M )/ 2n).
We conclude that the expression in (3) is, ignoring constants, O(n(1/ n)3 ) =
O(1/ n) and thus converges to zero as n . We proved limn P(Bn ) =

0.
2 Bounded quadratic variation of a Brownian motion
Even though Brownian motion is nowhere differentiable and has unbounded
total variation, it turns out that it has bounded quadratic variation. This observa
tion is the cornerstone of Ito calculus, which we will study later in this course.
We again start with partitions = (t0 , . . . , tn ) of a xed interval [0, T ],
but now consider instead
X
Q(, B)
(B(tk ) B(tk1 ))2 .
1kn
where, we make (without loss of generality) t0 = 0 and tn = T . For every

partition dene
() = max |tk tk1 |.
1kn
Theorem 2. Consider an arbitrary sequence of partitions i , i = 1, 2, . . .. Sup

pose limi (i ) = 0. Then
lim E[(Q(i , B) T )2 ] = 0.
(4)
Suppose in addition limi i2 (i ) = 0 (that is the resolution (i ) con

verges to zero faster than 1/i2 ). Then almost surely
Q(i , B) T.
(5)
In words, the standard Brownian motion has almost surely nite quadratic vari
ation which is equal to T .
Proof. We will use the following fact. Let Z be a standard Normal random
variable. Then E[Z 4 ] = 3 (cute, isnt it?). The proof can be obtained using
Laplace transforms of Normal random variables or integration by parts, and we
skip the details.
Let i = (B(ti )B(ti1 ))2 (ti ti1 ). Then, using the independent Gaussian
increments property of Brownian motion, i is a sequence of independent zero
mean random variables. We have
X
Q(i ) T =
i .
1in
Now consider the second moment of this difference

X
E(Q(i ) T )2 =
E(B(ti ) B(ti1 ))4
1in
E(B(ti ) B(ti1 ))2 (ti ti1 ) +
1in
(ti ti1 )2 .
1in
Using the E[Z 4 ] = 3 property, this expression becomes

X
X
X
(ti ti1 )2 +
(ti ti1 )2
3(ti ti1 )2 2
1in
=2
1in
1in
(ti ti1 )2
1in
2(i )
(ti ti1 )
1in
= 2(i )T.
Now if limi (i ) = 0, then the bound converges to zero as well. This estab
lishes the rst part of the theorem.
To prove the second part identify a sequence Ei 0 such that (i ) =
Ei /i2 . By assumption, such a sequence exists. By Markovs inequality, this is
4
bounded by
P((Q(i ) T )2 > 2Ei )
E(Q(i ) T )2
2(i )T
T
= 2
2Ei
2Ei
i
(6)
Since i iT2 < , then the sum of probabilities in (6) is nite. Then apply
ing the Borel-Cantelli Lemma, the probability that (Q(i ) T )2 > 2Ei for
innitely many i is zero. Since Ei 0, this exactly means that almost surely,
limi Q(i ) = T .
Sections 6.11 and 6.12 of Resnicks [1] chapter 6 in the book.
References
1992.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 9
Fall 2013
10/2/2013
Conditional expectations, ltration and martingales
Content.
1. Conditional expectations
2. Martingales, sub-martingales and super-martingales
Conditional Expectations
1.1
Denition
Recall how we dene conditional expectations. Given a random variable X and

an event A we dene E[X|A] = E[X1{A}]
P(A) .
Also we can consider conditional expectations with respect to random vari
ables. For simplicity say Y is a simple random variable on taking values
y1 , y2 , . . . , yn with some probabilities
P( : Y () = yi ) = pi .
Now we dene conditional expectation E[X|Y ] as a random variable which
takes value E[X|Y = yi ] with probability pi , where E[X|Y = yi ] should be
understood as expectation of X conditioned on the event { : Y () = yi }.
It turns out that one can dene conditional expectation with respect to a
eld. This notion will include both conditioning on events and conditioning on
random variables as special cases.
Denition 1. Given , two -elds G F on , and a probability measure P
on (, F). Suppose X is a random variable with respect to F but not necessar
ily with respect to G, and suppose X has a nite L1 norm (that is E[|X|] < ).
The conditional expectation E[X|G] is dened to be a random variable Y which
satises the following properties:
(a) Y is measurable with respect to G.
1
(b) For every A G, we have E[X1{A}] = E[Y 1{A}].

For simplicity, from now on we write Z F to indicate that Z is measurable
with respect to F. Also let F(Z) denote the smallest -eld such with respect
to which Z is measurable.
Theorem 1. The conditional expectation E[X|G] exists and is unique.
Uniqueness means that if Y ' G is any other random variable satisfying
conditions (a),(b), then Y ' = Y a.s. (with respect to measure P). We will prove
this theorem using the notion of Radon-Nikodym derivative, the existence of
which we state without a proof below. But before we do this, let us develop
some intuition behind this denition.
1.2
Simple properties
Consider the trivial case when G = {, }. We claim that the constant
value c = E[X] is E[X|G]. Indeed, any constant function is measurable
with respect to any -eld So (a) holds. For (b), we have E[X1{}] =
E[X] = c and E[c1{}] = E[c] = c; and E[X1{}] = 0 and E[c1{}] =
0.
As the other extreme, suppose G = F. Then we claim that X = E[X|G].
The condition (b) trivially holds. The condition (a) also holds because of
the equality between two -elds.
Let us go back to our example of conditional expectation with respect to
an event A . Consider the associated -elds G = {, A, Ac , }
(we established in the rst lecture that this is indeed a -eld). Consider
a random variable Y : R dened as
Y () = E[X|A] =
E[X1{A}]
P(A)
Y () = E[X|Ac ] =
E[X1{Ac }]
P(Ac )
c1
for A and
c2
for Ac . We claim that Y = E[X|G]. First Y G. Indeed, assume

for simplicity c1 < c2 . Then { : Y () x} = when x < c1 , = A
2
for c1 x < c2 , = when x c2 . Thus Y G. Then we need to check

equality E[X1{B}] = E[Y 1{B}] for every B = , A, Ac , , which is
straightforward to do. For example say B = A. Then
E[X1{A}] = E[X|A]P(A) = c1 P(A).
On the other hand we dened Y () = c1 for all A. Thus
E[Y 1{A}] = c1 E[1{A}] = c1 P(A).
And the equality checks.
Suppose now G corresponds to some partition A1 , . . . , Am of the sample
space . Given X F, using a similar analysis, we can check that
Y = E[X|G] is a random variable which takes values E[X|Aj ] for all
Aj , for j = 1, 2, . . . , m. You will recognize that this is one of our
earlier examples where we considered conditioning on a simple random
variable Y to get E[X|Y ]. In fact this generalizes as follows:
Given two random variables X, Y : R, suppose both F. Let
G = G(Y ) F be the eld generated by Y . We dene E[X|Y ] to be
E[X|G].
1.3
Proof of existence
We now give a proof sketch of Theorem 1.

Proof. Given two probability measures P1 , P2 dened on the same (, F), P2
is dened to be absolutely continuous with respect to P1 if for every set A F,
P1 (A) = 0 implies P2 (A) = 0.
The following theorem is the main technical part for our proof. It involves
using the familiar idea of change of measures.
Theorem 2 (Radon-Nikodym Theorem). Suppose P2 is absolutely continuous
with respect to P1 . Then there exists a non-negative random variable Y :
R+ such that for every A F
P2 (A) = EP1 [Y 1{A}].
Function Y is called Radon-Nikodym (RN) derivative and sometimes is denoted
dP2 /dP1 .
3
Problem 1. Prove that Y is unique up-to measure zero. That is if Y ' is also RN
derivative, then Y = Y ' a.s. w.r.t. P1 and hence P2 .
We now use this theorem to establish the existence of conditional expec
tations. Thus we have G F, P is a probability measure on F and X is
measurable with respect to F. We will only consider the case X 0 such that
E[X] < . We also assume that X is not constant, so that E[X] > 0. Consider
a new probability measure P2 on G dened as follows:
P2 (A) =
EP [X1{A}]
, A G,
EP [X]
where we write EP in place of E to emphasize that the expectation operator is

with respect to the original measure P. Check that this is indeed a probability
measure on (, G). Now P also induced a probability measure on (, G). We
claim that P2 is absolutely continuous with respect to P. Indeed if P(A) = 0
then the numerator is zero. By the Radon-Nikodym Theorem then there exists
Z which is measurable with respect to G such that for any A G
P2 (A) = EP [Z1{A}].
We now take Y = ZEP [X]. Then Y satises the condition (b) of being a
conditional expectation, since for every set B
EP [Y 1{B}] = EP [X]EP [Z1{B}] = EP [X1{B}].
The second part, corresponding to the uniqueness property is proved similarly
to the uniqueness of the RN derivative (Problem 1).
2
Properties
Here are some additional properties of conditional expectations.

Linearity. E[aX + Y |G] = aE[X|G] + E[Y |G].
Monotonicity. If X1 X2 a.s, then E[X1 |G] E[X2 |G]. Proof idea is similar
to the one you need to use for Problem 1.
Independence.
Problem 2. Suppose X is independent from G. Namely, for every measurable
A R, B G P({X A} B) = P(X A)P(B). Prove that E[X|G] =
E[X].
4
Conditional Jensens inequality. Let be a convex function and E[|X|], E[|(X)|] <
. Then (E[X|G]) E[(X)|G].
Proof. We use the following representation of a convex function, which we do
not prove (see Durrett [1]). Let
A = {(a, b) Q : ax + b (x), x}.
Then (x) = sup{ax + b : (a, b) A}.
Now we prove the Jensens inequality. For any pair of rationals a, b
Q satisfying the bound above, we have, by monotonicity that E[(X)|G]
aE[X|G] + b, a.s., implying E[(X)|G] sup{aE[X|G] + b : (a, b) A} =
(E[X|G]) a.s.
Tower property. Suppose G1 G2 F. Then E[E[X|G1 ]|G2 ] = E[X|G1 ] and
E[E[X|G2 ]|G1 ] = E[X|G1 ]. That is the smaller eld wins.
Proof. By denition E[X|G1 ] is G1 measurable. Therefore it is G2 measurable.
Then the rst equality follows from the fact E[X|G] = X, when X G, which
we established earlier. Now x any A G1 . Denote E[X|G1 ] by Y1 and E[X|G2 ]
by Y2 . Then Y1 G1 , Y2 G2 . Then
E[Y1 1{A}] = E[X1{A}],
simply by the denition of Y1 = E[X|G1 ]. On the other hand, we also have
A G2 . Therefore
E[X1{A}] = E[Y2 1{A}].
Combining the two equalities we see that E[Y2 1{A}] = E[Y1 1{A}] for every
A G1 . Therefore, E[Y2 |G1 ] = Y1 , which is the desired result.
An important special case is when G1 is a trivial -eld {, }. We obtain
that for every eld G
E[E[X|G]] = E[X].
Filtration and martingales
3.1
Denition
A family of -elds {Ft } is dened to be a ltration if Ft1 Ft2 whenever

t1 t2 . We will consider only two cases when t Z+ or t R+ . A stochastic
process {Xt } is said to be adapted to ltration {Ft } if Xt Ft for every t.
Denition 2. A stochastic process {Xt } adapted to a ltration {Ft } is dened
to be a martingale if
1. E[|Xt |] < for all t.
2. E[Xt |Fs ] = Xs , for all s < t.
When equality is substituted with , the process is called supermartingale.
When it is substituted with , the process is called submartingale.
Suppose we have a stochastic process {Xt } adapted to ltration {Ft } and
suppose for some s' < s < t we have E[Xt |Fs ] = Xs and E[Xs |Fs/ ] = Xs/ .
Then using Tower property of conditional expectations
E[Xt |Fs/ ] = E[E[Xt |Fs ]|Fs/ ] = E[Xs |Fs/ ] = Xs/ .
This means that when the stochastic process {Xn } is discrete time it sufces to
check E[Xn+1 |Fn ] = Xn for all n in order to make sure that it is a martingale.
3.2
Simple examples
1. Random walk. Let Xn , n = 1, . . . be an i.i.d. sequence with mean
and variancei
2 < . Let Fn be the Borel -algebra on Rn . Then
Sn n = 0kn Xk n is a martingale. Indeed Sn is adapted to
Fn , and
E[Sn+1 (n + 1)|Fn ] = E[Xn+1 + Sn n|Fn ]
= E[Xn+1 |Fn ] + E[Sn n|Fn ]
a
= E[Xn+1 ] + Sn n
= Sn n.
Here in (a) we used the fact that Xn+1 is independent from Fn and Sn
Fn .
2. Random walk squared. Under the same setting, suppose in addition
= 0. Then Sn2 n 2 is a martingale. The proof of this fact is very
similar.
6

Durrett [1] Section 4.1, 4.2.
References
tion, 1996.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 10
Fall 2013
10/4/2013
Martingales and stopping times
Content.
1. Martingales and properties.
2. Stopping times and Optional Stopping Theorem.
Martingales
We continue with studying examples of martingales.

Brownian motion. A standard Brownian motion B(t) is a martingale on
C[0, ), equipped with the Wiener measure, with respect to the ltration
Bt , t R+ , dened as follows. Let Ct be the the Borel -eld on C[0, t]
generated by open and closed sets with respect to the sup norm: 1f 1 =
maxz[0,t] |f (z)|. For every set A C[0, t] consider an extension A
C[0, ) as a set of all f C[0, ) such that the restriction of f onto
[0, t] belongs to A. Then dene Bt to be the set of all extensions A of sets
A Ct . Informally, Bt is the eld representing the information about the
paths f available by observing it during the time interval [0, t]. Naturally
Bs Bt B when s t, where B is the Borel -eld of the entire space
C[0, ).
We now check that Bt is indeed a martingale. Fix 0 s < t. We have
E[B(t)|Bs ] = E[B(t) B(s) + B(s)|Bs ]
= E[B(t) B(s)|Bs ] + E[B(s)|Bs ]
= 0 + B(s)
= B(s)
where we use the fact that, due to the independent increments property,
B(t) B(s) is independent from Bs , implying E[B(t) B(s)|Bs ] =
E[B(t) B(s)] = 0; and B(s) Bs , implying E[B(s)|Bs ] = B(s).
1
Brownian motion with drift. Now consider a Brownian motion with drift
and standard deviation . That is consider B (t) = t + B(t), where
B is the standard Brownian motion. It is straightforward to show that
B (t)t is a martingale. Also it is simple to see that (B (t)t)2 2 t
is also a martingale.
Walds martingale. Suppose Xn , n N is an i.i.d. sequence with
function M ()
E[X1 ] = , such that the exponential moment generating
g
X
1
E[e ] < for some > 0. Let Sn =
1kn Xk . Then Zn =
exp(Sn )
M n ()
is a martingale. It is straightforward to check this.
Properties of martingales
Note that if Xn is submartingale, then Xn is supermartingale and vice verse.

Proposition 1. Suppose Xn is a martingale and is a convex function such that
E[|(Xn )|] < . Then (Xn ) is a submartingale.
Proof. We apply conditional Jensens inequality:
E[(Xn+1 )|Fn ] (E[Xn+1 |Fn ]) = (Xn ).
As a corollary we obtain that if Xn is a martingale and E[|Xn |p ] <

for all n for some p 1, then |Xn |p is submartingale. Note that if Xn was
submartingale and was non-decreasing, then the same result applies.
Denition 1. A sequence of random variables Hn is dened to be predictable if
Hn Fn1 .
Here is sort of an articial example. Let Xn be adapted to ltration Fn .
Then Hn = Xn1 , H0 = H1 = X0 , is predictable. We simply played with
indices. Instead now consider Hn = E[Xn |Fn1 ]. By the denition of condi
tional expectations, Hn Fn1 , so it is predictable.
Theorem 1 (Doobs decomposition). Every submartingale Xn , n 0 adapted
to ltration Fn , can be written in a unique way as Xn = Mn + An where Mn
is a martingale and An is an a.s. non-decreasing sequence, predictable with
respect to Fn .
Proof. Set A0 = 0 and dene An recursively by An = An1 + E[Xn |Fn1 ]

Xn1 An1 . Let Mn = Xn An . By induction, we have that An Fn1
since An1 Fn2 Fn1 and E[Xn |Fn1 ], Xn1 Fn1 . Therefore An
is predictable. We now need to show that Mn is a martingale. To check that
E[|Mn |] < , it sufces to shot the same for An , since by assumption Xn is
a sub-martingale and therefore E[|Xn ] < . Now, we establish niteness of
E[|An |] by induction, for which it sufces to have niteness of E[|E[Xn |Fn1 ]|]
which follows by conditional Jensens inequality which gives |E[Xn |Fn1 ]|
E[|Xn | |Fn1 ] and the tower property which gives E[E[|Xn | |Fn1 ]] = E[|Xn |] <
. We now establish that E[Mn |Fn1 ] = Mn1 . We have
E[Mn |Fn1 ] = E[Xn An |Fn1 ]
= E[Xn |Fn1 ] An
= Xn1 An1
= Mn1 .
Thus Mn is indeed a martingale. This completes the proof of the existence part.
To prove uniqueness, we assume that Xn = Mn + An is any such decom
position. Then
E[Xn |Fn1 ] = E[Mn + An |Fn1 ] = Mn 1 + An = Xn1 An1 + An
since, by assumption, Mn is a martingale and An is predictable. Then we see
that An satises the same recursion as An , implying An = An . It then immedi
ately follows that Mn = Mn .
The concept of predictable sequences is useful when we want to model a
gambling scheme, or in general any controlled stochastic process, where we
observe the state Xn1 of the process at time t = n 1, take some decision
at time t = n and observe the realization Xn at time t = n. This realization
might actually depend on our decision. In the special case of gambling, suppose
Xn is your wealth at time n if in each round you bet one dollar. In particular,
= 1. If you instead bet an amount Hn each time, then your net
Xn Xn1 g
at time n is 1mn Hm (Xm Xm1 ). Modelling Xn as an i.i.d. coin toss
sequence with the usual product -eld on {0, 1} , we have Xn is adapted to
the -eld on {0, 1}n and Hn is predictable.
Is there a scheme which allows us to beat the system? In fact, yes. This
is the famous double the bet strategy. We make H1 = 1 and Hn = 2Hn1
if Xn1 Xn2 = 1 and stop once Xn1 Xn2 = 1, then assuming we
won the rst time in round k, our net worth is 1 2 2k1 + 2k = 1.
3
If P(Xn1 Xn2 = 1) > 0, and the games have independently distributed

outcomes, then we win a.s. a dollar at some random time k.
As we will see this was possible because we did not put a bound on how
long we are allowed to play and how much we are allowed to lose. We now es
tablish that placing either of this conditions makes beating the gambling system
impossible.
For now we establish the following result.
Theorem
g 2. Suppose Xn is a supermartingale and Hn 0 is predictable. Then
Zn = 1mn Hm (Xm Xm1 ) is also a supermartingale.
Proof. We have
E[Zn+1 |Fn ] = E[Hn+1 (Xn+1 Xn )|Fn ] + E[
Hm (Xm Xm1 )|Fn ].
1mn
Since H is predictable, then Hn+1 Fn , implying that the rst summand is

equal to
Hn+1 E[(Xn+1 Xn )|Fn ] 0,
where inequality follows since Xn is supermartingale. On the other hand,
Hm (Xm Xm1 ) = Zn Fn ,
1mn
implying that its expectation is E[Zn |Fn ] = Zn . This completes the proof.
3
Stopping times
In the example above, showing how to create a winning gambling system, notice
that part of the strategy was to stop the rst time when we win. Thus the game
is interrupted at a random time, which depends on the observed conditions. We
now formalize this using the concept of stopping times. The end goal of this
section is establishing the following result: if the gambling strategy involves a
stopping time which is bounded, then our expected gain is non-positive.
Denition 2. Given a ltration {Ft }tT on a sample space , a random vari
able : T is called a stopping time, if the event { t} = { :
() t} Ft , for every t.
Here we consider again the two cases T = N or T = R+ . Note, that we do
not need to specify a probability measure here. The concept of stopping times is
framed only in terms of measurability with respect to -elds.
4
Consider the ltration corresponding to a sequence of random variables

X1 , . . . , Xn , . . .. Namely = R and Fn is Borel -eld of Rn . Fix
some real value x. Given any sample = (1 , . . . , n , . . .) R dene
X
() = inf{n :
k x}.
1kn
Namely, () is the smallest index at which the sum of the components

is at least x. Then is a stopping time. Indeed, the event n is com
pletely specied by the portion 1 , . . . , n of the sample. In particular
X
i x}.
{ : () n} = 1kn { :
1ik
Each of the events in the union on the right-hand side is measurable with
respect to Fn . This example is the familiar Walds stopping time: is the
smallest n for which a random walk X1 + + Xn x, except for, we
do not say that the random sequence should be i.i.d. and, in fact, we do
not say anything about the probability law of the sequence at all.
Consider the standard Wiener measure on C([0, )), that is consider the
standard Brownian motion. Then, given a > 0, Ta = inf{t : B(t) = a}
is a stopping time with respect to the ltration Bt , t R+ , where Bt is
the ltration described in the beginning of the lecture. This is the familiar
hitting time of the Brownian motion.
The main result in terms of stopping times that we wish to establish is as
follows.
Theorem 3. Suppose Xn is a supermartingale and is a stopping time, which
is a.s. bounded: M a.s. for some M . Then E[X ] E[X0 ]. In other
words, if there exists a bound on the number of rounds for betting, then the
expected net gain is non-positive, provided that in each round the expected gain
is non-positive.
This theorem will be established as a result of several short lemmas.
Lemma 1. Suppose is a stopping time corresponding to the ltration Fn .
Then the sequence of random variables Hn = 1{ n} is predictable.
Proof. Hn is a random variable which takes values 0 and 1. Note that the event
{Hn = 0} = { < n} = { n 1}. Since is a stopping time, then the
event { n 1} Fn1 . Thus Hn is predictable.
5
Given two real values a, b we use a b to denote min(a, b).

Corollary 1. Suppose Xn is supermartingale and is a stopping time. Then
Yn = Xn is also a supermartingale.
Proof. Dene Hn = 1{ n}. Observe that
X
X
Hm (Xm Xm1 ) = H0 X0 +
1mn
Xm (Hm Hm+1 ) + Hn Xn .
0mn1
(1)
Note, H0 = 1{ 0} = 1. Hm Hm+1 = 1{ m} 1{ m + 1} =
1{ = m}. Therefore, the expression on the right-hand side of (1) is equal to
Xn X0 . By Theorem 2, the left-hand side of (1) is a supermartingale. We
conclude that Yn = Xn is a supermartingale.
Now we are ready to obtain our end result.
Proof of Theorem 3. The process Yn = Xn is a supermartingale by Corol
lary 1. Therefore
E[YM ] E[Y0 ]
But YM = XM = X and Y0 = X0 = X0 . We conclude E[X ] E[X0 ].
This concludes the proof of Theorem 3.
4

Durrett [1] Chapter 4.
References
tion, 1996.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 11
Fall 2013
10/9/2013
Martingales and stopping times II
Content.
1. Second stopping theorem.
2. Doob-Kolmogorov inequality.
3. Applications of stopping theorems to hitting times of a Brownian motion.
Second stopping theorem
In the previous lecture we established no gambling scheme can produce a pos

itive gain in expectation if there is a limit on the number of rounds the game is
played. We now establish a similar result assuming that there is a limit on the
amount of wealth possessed by the player.
Theorem 1. Suppose Xn is a supermartingale that is uniformly bounded. That
is |Xn | M a.s., for some M . Suppose is a stopping time. Then E[X ]
E[X0 ]. If, in addition Xn is a martingale, then E[X ] = E[X0 ].
The gambling interpretation of this theorem is as follows: suppose we tried
to use the double the stakes algorithm, which we know guarantees winning a
dollar, when there are no restrictions. But now suppose that there is a limit on
how negative we can go (for example the amount of debt allowed). Say this
limit is M . Consider a modied process Yn = Xn . Then from our description
M Yn 1 < M . Also we remember from the previous lecture that Yn is
a supermartingale. Thus it is a bounded supermartingale. Theorem 1 then tells
us that E[Y ] = E[X ] E[X0 ], namely the scheme does not work anymore.
(Remember that without the restriction, the scheme produces wealth X = 1
with probability one. In particular, E[X ] = 1 > X0 = 0).
Proof of Theorem 1. . Observe, that E[|X |] M < . Consider Yn =
Xn . Then Yn converges to X a.s. as n . Since |Yn | M a.s., then
1
using the Bounded Convergence Theorem, limn E[Yn ] = E[X ]. On the

other hand, we established in Corollary 1 in the previous lecture, that Yn is a
supermartingale. Therefore E[Yn ] E[Y0 ] = E[X0 ]. Combining, we obtain
E[X ] E[X0 ].
2
Doob-Kolmogorov inequality
We now establish a technical but a very useful result, which is an analogue of

Markov/Chebyshevs bounds on random variables.
Theorem 2 (Doob-Kolmogorov inequality). Suppose Xn is a non-negative
submartingale adapted to {Fn } and E > 0. Then for every n N
P( max Xn E)
1mn
E[Xn2 ]
.
E2
If Xn is a martingale, then the non-negativity condition can be dropped.

The convenience of this result is that we can bound the worst case deviation
of a submartingale using its value at the end of time interval.
Proof. Using Jensens inequality we established that if Xn is a martingale then
|Xn | is a submartingale. Since |Xn | is non-negative, the second part follows
from the rst.
To establish the rst part, consider the events A = {max1mn Xm E}
and
Bm = {max1im1 Xi E, Xm > E}. Namely, A is the event that the
submartingale never exceeds E and Bm is the event that it does so at time m for
the rst time. We have = A 1mn Bm and the events A, Bm are mutually
exclusive. Then
E[Xn2 ] = E[Xn2 1{A}] +
E[Xn2 1{Bm }]
1mn
E[Xn2 1{Bm }].

1mn
Note
E[Xn2 1{Bm }] = E[(Xn Xm + Xm )2 1{Bm }]
2
= E[(Xn Xm )2 1{Bm }] + 2E[(Xn Xm )Xm 1{Bm }] + E[Xm
1{Bm }]
The rst of the summands is non-negative. The last is at least E2 P(Bm ),

since on the event Bm we have Xm > E. We now analyze the second term and
here we use the tower property:
E[(Xn Xm )Xm 1{Bm }] = E[E[(Xn Xm )Xm 1{Bm }|Fm ]]

= E[Xm 1{Bm }E[(Xn Xm )|Fm ]]
0,
where the second equality follows since 1{Bm } Fm and the last inequality
follows since Xn is a submartingale and Xm 1{Bm } E > 0 on Bm and
= 0 on
/ Bm . We conclude that
X
E2 P(Bm ) = E2 P(m Bm ) = E2 P( max Xm > E).
E[Xn2 ]
1mn
1mn
This concludes the proof.

Corollary 1. Suppose Xn is a martingale and p 1. Then for every E > 0
P( max |Xn | E)
1mn
E[|Xn |p ]
.
Ep
Proof. The proof of the general case is more complicated, but when p 2 we
almost immediately obtain the result. Using conditional Jensens inequality we
know that |Xn | is a submartingale, as | | is a convex function. It is also nonp
negative. Function x 2 is convex increasing when p 2 and x 0. Recall from
p
the previous lecture that this implies |Xn | 2 is also a submartingale. Applying
Theorem 2 we obtain
p
P( max |Xn | E) = P( max |Xn | 2 E 2 )

1mn
1mn
E[|Xn |p ]
.
Ep
Analogue of this corollary holds in continuous time when the process is

continuous. We just state this result without proving it.
Theorem 3. Suppose {Xt }tR+ is a martingale which has a.s. continuous sam
ple paths. Then for every p 1, T > 0, E > 0,
P( sup |Xt | E)
0tT
E[|XT |p ]
.
Ep
Applications to hitting times of a Brownian motion
We now use the martingale theory and optional stopping theorems to derive
some properties of hitting times of a Brownian motion. Our setting is either
a standard Brownian motion B(t) or a Brownian motion with drift B (t) =
t + B(t). In both cases the starting value is assumed 0. We x a < 0 < b
and ask the question: what is the probability that B (t) hits a before b? For
simplicity we use B instead of B , but mention that we talk about Brownian
motion with drift.
We dene
Ta = inf{t : B (t) = a}, Tb = inf{t : B (t) = b}, Tab = min(Ta , Tb ).
In Lecture 6, Problem 1 we established that when B is standard, lim supt B(t) =
a.s. Thus Tb < a.s. By symmetry, Ta < a.s. Now we ask the question:
what is the probability P(Tab = Ta )? We will use the optional stopping theorems
established before. The only issue is that we are now dealing with continuous
time processes. The derivations of stopping theorems require more details (for
example dening predictable sequences is trickier). We skip the details and just
assume that optional stopping theorems apply in our case as well.
The case of the standard Brownian motion is the simplest.
Theorem 4. Let Ta , Tb , Tab be dened with respect to the standard Brownian
motion B(t). Then
P(Tab = Ta ) =
|b|
.
|a| + |b|
Proof. Recall that B is a martingale. Observe that Tab denes a stopping time:
the event {Tab t} Bt (stopping Tab t is determined completely by the
path of the Brownian motion up to time t). Therefore by Corollary 1 in the
previous lecture, Yt B(t Tab ) is also a martingale. Note that it is a bounded
martingale, since its absolute value is at most max(|a|, |b|). Theorem 1 applied
to Yt then implies that E[YTab ] = E[B(Tab )] = E[Y0 ] = E[B(0)] = 0. On the
other hand, when Tab = Ta , we have B(Tab ) = B(Ta ) = a and, conversely,
when Tab = Tb , we have B(Tab ) = B(Tb ) = b. Therefore
E[B(Tab )] = aP(Tab = Ta ) + bP(Tab = Tb ) = |a|P(Tab = Ta ) + |b|P(Tab = Tb )
Since P(Tab = Ta )+P(Tab = Tb ) = 1, then, combining with the fact E[B(Tab )]
we obtain
|b|
|a|
P(Tab = Ta ) =
, P(Tab = Tb ) =
.
|a| + |b|
|a| + |b|
4
We now consider the more difcult case, when the drift of the Brown
ian motion = 0. Specically, assume < 0. Recall, that in this case
limt B(t) = a.s., so Tab Ta < a.s. Again we want to compute
P(Tab = Ta ).
We x drift < 0, variance 2 > 0 and consider q() = + 12 2 2 .
Proposition 1. For every , the process V (t) = eB(t)q()t is a martingale.
Proof. We rst need to check that E[|V (t)|] < . We leave it as an exercise.
We have for every 0 s < t
E[V (t)|Bs ] = E[e(B(t)B(s)) eq()(ts) eB(s)q()s |Bs ]
= E[e(B(t)B(s)) ]eq()(ts) eB(s)q()s
= eq()(ts) E[e(B(t)B(s)) ]V (s).
where the second equality follows from the ind. increments property of the
Brownian motion, and from the fact E[eB(s)q()s |Bs ] = eB(s)q()s . Since
d
B(t) B(s) = N ((t s), 2 (t s)), then E[e(B(t)B(s)) ] is the Laplace

transform of this normal r.v. which is
1
e(ts)+ 2
2 2 (ts)
= eq()(ts) .
Combining, we obtain that E[V (t)|Bs ] = V (s). Therefore V (t) is a martingale.

Now that we know that V (t) is a martingale, we can try to apply the optional
stopping theorem. For that we need to have a stopping time, and we will use
Tab . We also need conditions for which the expected value at the stopping time
is equal to the expected value at time zero. We use the following observation.
Suppose is such that q() 0. Then 0 V (tTab ) eb a.s. Indeed, the left
side of the inequality follows trivially from non-negativity of V . For the right
hand side, observe that for t Tab we have V (t) max(e|a| , eb ) = eb ,
and the assertion follows. Thus V is a.s. a bounded martingale. Again we use
Theorem 1 to conclude that
E[V (Tab )] = V (0) = 1.
We now set = 2/ 2 . Then q() = 0. Note that
2a
V (Tab )1{Tab = Ta } = e 2 1{Tab = Ta }

2b
V (Tab )1{Tab = Tb } = e 2 1{Tab = Tb }

5
(1)
The previous identity gives

2a
2b
1 = E[V (Tab )] = e 2 P(Tab = Ta ) + e 2 P(Tab = Tb ).

From this and using < 0, we recover
2a
P(Tab = Tb ) =
1 e 2
2b
2a
e 2 e 2
1 e
=
e
2|||b|
2
2|||a|
2
2|||a|
2
Compared with the driftless case, the probability of hitting b rst is exponentially
tilted. Now let us take a . The events Aa = {Tab = Ta } are monotone:
Aa Aa' for at < a < 0. Therefore
P(a<0 Aa ) = lim P(Aa ) = lim
a
1 e
1
e
2|||b|
2
2|||a|
2
2|||a|
= 1 e
2|||b|
2
But what is the event a<0 Aa ? Since Brownian motion has continuous paths,
this event is exactly the event that the Brownian motion never hits the positive
level b. That is the event supt0 B(t) < b. We conclude that when the drift of
the Brownian motion is negative
P(sup B(t) b) = e
2|||b|
2
t0
Recall, from Lecture 6, that we already established this fact directly from the
properties of the Brownian motion the supremum of a Brownian motion with
a negative drift has an exponential distribution with parameter 2||/ 2 .
Durrett [1] Chapter 4.
Grimmett and Stirzaker [2] Section 7.8.
References
tion, 1996.
[2] G. R. Grimmett and D. R. Stirzaker, Probability and random processes,
Oxford University Press, 2005.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 11-Additional material
Fall 2013
10/9/2013
Martingale Convergence Theorem
Content.
1. Martingale Convergence Theorem
2. Doobs Inequality Revisited
3. Martingale Convergence in Lp
4. Backward Martingales. SLLN Using Backward Martingale
5. Hewitt-Savage 0 1 Law
6. De-Finettis Theorem
1 Martingale Convergence Theorem

Theorem 1. (Doob) Suppose Xn is a super-martingale which satises
sup E[|Xn |] <
n
Then, almost surely X = limn Xn exists and is nite in expectation. That is,
dene X = lim sup Xn . Then Xn X a.s. and E[|X |] < .
Proof. The proof relies Doobs Upcrossing Lemma. For that consider
{ : Xn () does not converge to a limit in R}
= { : lim inf Xn () < lim sup Xn ()}
n
= a<b:a,bQ { : lim inf Xn () < a < b < lim sup Xn ()},

n
(1)
where Q is the set of rational values. Let, UN [a, b]() = largest k such that it
satises the following: there exists
0 s1 < t1 < ... < sk < tk N
1
such that
Xsi () < a < b < Xti (),
1 i k.
That is, UN [a, b] is the number of up-crossings of [a, b] up to N . Clearly,

UN [a, b]() is non-decreasing in N . Let U [a, b]() = limN UN [a, b]().
Then (1) can be re-written as
= a<b:a,bQ { : U [a, b]() = }
= a<b:a,bQ a,b .
(2)
Doobs upcrossing lemma proves that P(a,b ) = 0 for every a < b. Then we
have from (2) that P() = 0. Thus, Xn () converges in [, ] a.s. That is,
X = lim Xn exists a.s.
n
Now,
E[|X |] = E[lim inf |Xn |]
n
lim inf E[|Xn |]

n
sup E[|Xn |] < ,

n
where we have used Fatous Lemma in the rst inequality. Thus, X is in L1 .

In particular, X is nite a.s. This completes the proof of Theorem 5 (with
Doobs upcrossing Lemma and its application P (a,b ) = 0 remaining to be
proved.)
Lemma 1. (Doobs Upcrossing) Let Xn be a super-MG. Let UN [a, b] be the
number of upcrossing of [a, b] until time N with a < b. Then,
(b a)E[UN [a, b]] E[(XN a) ]
where
(XN a) =
a XN ,
0,
if XN a
otw.
Proof. Dene a predictable sequence Cn as follows.

C1 () =
1, if X0 () < a
0, otw.
Inductively,
1, if Cn1 () = 1 and Xn1 () b

Cn () = 1, if Cn1 () = 0 and Xn1 () < a
0, otw.
By denition, Cn is predictable. The sequence Cn has the following property.

If X0 < a then C1 = 1. Then the sequence Cn remains equal to 1 until the rst
time Xn exceeds b. It then remains zero until the rst time it becomes smaller
than a at which point it switches back to 1, etc. If instead X0 > a, then C1 = 0
and it remains zero until the rst time Xn becomes smaller than a, at which
point Cn switches to 1 and then continues as above. Consider
Yn = (C X)n =
Ck (Xk Xk1 )
1kn
We claim that
YN () (b a)UN [a, b] (XN () a) .
Let UN [a, b] = k. Then there is 0 s1 < t1 < < sk < tk N such that
Xsi () < a < b < Xti (), i = 1, , k. By denition, Csi +1 = 1 for all
i 1. Further, Ct () = 1 for si + 1 t li ti where li ti is the smallest
time t si such that Xt () > b. Without the loss of generality, assume that
s1 = min{n : Xn < a}. Let, sk+1 = min{n > tk : Xn () < a}. Then,
Cj ()(Xj () Xj1 ())
YN () =
jN
Ct+1 ()(Xt+1 () Xt ())]
[
1ik si tli
Ct+1 ()(Xt+1 () Xt ()) (Because otherwise Ct () = 0.)
+
tsk+1
(Xli () Xsi ()) + XN () Xsk+1 (),
=
1ik
where the term XN () Xsk+1 () is dened to be zero if sk+1 > N . Now,

Xli () Xsi () b a. Now if XN () Xsk+1 then
XN () Xsk+1 0.
Otherwise
|XN () Xsk+1 () | |XN () a|.
3
Therefore we have
YN () UN [a, b](b a) (XN () a)

as claimed.
Now, as we have established earlier, Yn = (C X)n is super-MG since
Cn 0 is predictable. That is,
E[YN ] E[Y0 ] = 0
By claim,
(b a)E[UN [a, b]] E[(XN a) ]
This completes the proof of Doobs Lemma.
Next, we wish to use this to prove P(a,b ) = 0.
Lemma 2. For any a < b, P(a,b ) = 0.
Proof. By denition a,b = { : U [a, b] = }. Now by Doobs Lemma
(b a)E[UN [a, b]] E[(XN a) ]
sup E[|Xn |] + |a|
n
<
Now, UN [a, b] / U [a, b]. Hence by the Monotone Convergence Theorem,
E[UN [a, b]] / E[U [a, b]]. That is, E[U [a, b]] < . Hence, P(U [a, b] =
) = 0.
2
Doobs Inequality
+ . Given > 0,
Theorem 2. Let Xn be a sub-MG and let Xn = max0mn Xm
let A = {Xn }. Then,
P(A) E[Xn 1(A)] E[Xn+ ]

Proof. Dene stopping time
N = min{m : Xm
or m = n}
Thus, P(N n) = 1. Now, by the Optional Stopping Theorem we have that

XN n is a sub-MG. But XN n = XN . Thus
E[XN ] E[Xn ]
4
(3)
We have
E[XN ] = E[XN 1(A)] + E[XN 1(Ac )]

= E[XN 1(A)] + E[Xn 1(Ac )]
(4)
E[Xn ] = E[Xn 1(A)] + E[Xn 1(Ac )]
(5)
Similarly,
From (3) (5), we have

E[XN 1(A)] E[Xn 1(A)]
(6)
P(A) E[XN 1(A)]
(7)
But
From (6) and (7),

P(A) E[Xn 1(A)]
E[Xn+ 1(A)]
E[Xn+ ].
(8)
Suppose, Xn is non-negative sub-MG. Then,

P( max Xk )
0kn
1
E[Xn ]
If it were MG, then we also obtain

1
1
E[Xn ] = E[X0 ]
P( max Xk )
0kn
Lp maximal inequality and Lp convergence
Theorem 3. Let, Xn be a sub-MG. Suppose E[(Xn+ )p ] < for some p > 1.

Then,
1
1
E[( max Xk+ )p ] p qE[(Xn+ )p ] p
0kn
where 1q + p1 = 1. In particular, if Xn is a MG then |Xn | is a sub-MG and hence

1
E[( max |Xk |)p ] p qE[|Xn |p ] p

0kn
Before we state the proof, an important application is
Theorem 4. If Xn is a martingale with supn E[|Xn |p ] < where p > 1, then

Xn X a.s. and in Lp , where X = lim supn Xn .
We will rst prove Theorem 4 and then Theorem 3.
Proof. (Theorem 4) Since supn E[|Xn |p ] < , p > 1, by MG-convergence
theorem, we have that
Xn X, a.s., where X = lim sup Xn .
n
For Lp convergence, we will use Lp -inequality of Theorem 3. That is,

E[( sup |Xm |)p ] q p E[|Xn |p ]
0mn
Now, sup0mn |Xm | / sup0m |Xm |. Therefore, by the Monotone Conver

gence Theorem we obtain that
E[sup |Xm |p ] q p sup E[|Xn |p ] <
n
0m
Thus, sup0m |Xm | Lp . Now,

|Xn X| 2 sup |Xm |
0m
Therefore, by the Dominated Convergence Theorem E[|Xn X|p ] 0.

Proof. (Theorem 3) We will use truncation of Xn to prove the result. Let M be
the truncation parameter: Xn,M = min(Xn , M ). Now, consider the following:
E[(Xn,M )p ] =
0
pp1 P(Xn,M )d
1
pp1 [ E[Xn+ 1(Xn,M )]]d
The above inequality follows from

(
P(Xn,M ) =
0,
if M <
P(Xn ), if M
and Theorem 2. By an application of Fubini for non-negative integrands, we
have
pE[Xn+
Xn,M
p2 d]
p
E[Xn+ (Xn,M )p1 ]
p1
1
1
p
E[(Xn+ )p ] p E[(Xn,M )(p1)q ] q , by Holders inequality.

p1
=
Here,
1
q
=1
1
p
(9)
q(p 1) = p. Thus, we can simplify (9)

1
= qE[(Xn+ )p ] p E[(Xn,M )p ] q
Thus,
||Xn,M ||pp q||Xn+ ||p ||Xn,M ||pq

That is,
p(1 1q )
||Xn,M ||p
q||Xn+ ||p
Hence, ||Xn,M ||p q||Xn+ ||p .

4
Backward Martingale
Let Fn be increasing sequence of -algebra, n 0, such that F3

F2 F1 F0 . Let Xn be Fn adapted, n 0, and
E[Xn+1 |Fn ] = Xn , n < 0
Then Xn is called backward MG.
Theorem 5. Let Xn be backward MG. Then
lim Xn = X exists a.s. and in L1 .
Compare with standard MG convergence results:

(a): We need sup E[|Xn |] < , or non-negative MG in Doobs convergence
theorem, which gives a.s. convergence not L1 .
(b): For L1 , we need UI. And, it is necessary because if Xn Xinf ty a.s. and
L1 then there exists X F s.t. Xn = E[X|Fn ]; and hence Xn is UI.
Proof of Theorem 5. Recall Doobs convergence theorems proof. Let
: = { : Xn () does not converge to a limit in [, ]}

= { : lim inf Xn () < lim sup Xn ()}
n
= a,b:a,bQ { : lim inf Xn () < a < b < lim sup Xn ()}

n
= a,b:a,bQ a,b
Now, recall Un [a, b] is the number of upcrossing of [a, b] in Xn , Xn+1 , ..., X0 as
n . By upcrossing inequality, it follows that
(b a)E[Un [a, b]] E[|X0 |] + |a|
Since Un [a, b] / U [a, b] and By monotone convergence theorem, we have
E[U [a, b]] < P(a,b ) = 0
This implies Xn converges a.s.
Now, Xn = E[X0 |Fn ]. Therefore, Xn is UI. This implies Xn X in
L1 .
Theorem 6. If X = limn Xn and F = n Fn . Then X =
E[X0 |F ].
Proof. Let Xn = E[X0 |Fn ]. If A F Fn , then E[Xn ; A] = E[X0 ; A].
Now,
|E[Xn ; A] E[X ; A]| = |E[Xn X ; A]|
E[|Xn X |; A]
E[|Xn X |] 0 as n (by Theorem 5)
Hence, E[X ; A] = E[X0 ; A]. Thus, X = E[X0 |F ].
Theorem 7. Let Fn F , and Y L1 . Then, E[Y |Fn ] E[Y |F ] a.s.
in L1 .
Proof. Xn = E[Y |Fn ] is backward MG by denition. Therefore,
Xn X a.s. and in L1 .
By Theorem 6, X = E[X0 |F ] = E[Y |F ]. Thus, E[Y |Fn ] E[Y |F ].
Strong Law of Large Number
Theorem 8 (SLLN). Let be i.i.d. with E[|i |] < . Let Sn = 1 + ... + n .

Let Xn = Snn . And, Fn = (Sn , n+1 , ...). Then,
Sn
|Fn1 ]
n
n
1X
=
E[i |Sn+1 ]
n
E[Xn |Fn1 ] = E[
i=1
= E[1 |Sn+1 ]
1
=
Sn+1
n+1
= Xn+1
Then Xn is backward MG.
Proof. By Theorem 5 7, we have Xn X a.s. and in L1 , with X =
E[1 |F ]. Now F is in (the exchangeable -algebra). By Hewitt-Savage
(proved next) 0-1 law, is trivial. That is, E[1 |F ] is a constant. Therefore,
E[X ] = E[1 ] is also a constant. Thus,
Sn
= E[1 ]
n n
X = lim
Hewitt-Savage 0-1 Law
Theorem 9. Let X1 , ..., Xn be i.i.d. and be the exchangeable algebra:

n = {A : n A = A; n Sn }; = n n
If A , then P(A) {0, 1}.
Proof. The key to the proof is the following Lemma:
Lemma 3. Let X1 , ..., Xk be i.i.d. and define
X
1
An () =
An ((Xi1 , ..., Xik ))
npk
(i1 ,...,ik ){1,...,n}
If is bounded then
An () E[(X1 , ..., Xk )] a.s.
9
Proof. An () n by denition. So
An () = E[An ()|n ]
1 X
E[(Xi1 , ..., Xik )|n ]
=
npk
i1 ,...,ik
1 X
E[(X1 , ..., Xk )|n ]
=
n pk
i1 ,...,ik
= E[(X1 , ..., Xk )|n ]
(10)
Let Fn = n . Then Fn F = . Then, for Y = (X1 , ..., Xk ).

E[Y |Fn ] is backward MG. Therefore,
E[Y |Fn ] E[Y |F ] = E[(X1 , ..., Xk )|]
Thus,
An () E[(X1 , ..., Xk )|]
(11)
We want to show that indeed E[(X1 , ..., Xn )|] is E[(X1 , ..., Xn )].
First, we show that E[(X1 , ..., Xn )|] (Xk+1 , ...) since is bounded.
Then, we nd that if E[X|G] F where X is independent of F then E[X|G] is
constant, equal to E[X]. This will complete the proof of Lemma.
First step: consider An (). It has npk terms in which there are k(n 1)pk1
terms containing X1 . Therefore, the effect of terms containing X1 is:
Tn (1)
1
np k
(Xi1 , ..., Xik )
(i1 ,...,ik )
1
k (n 1)pk1 ||||
np k
(n k)! (n 1)!
k
||||
n!
(n k)!
k
= |||| 0 as n
n
(12)
1
Let A1
n () = An () Tn (1). Then, we have An () E[(X1 , ..., Xn )|]
from (11) and (12). Thus, E[((X1 , ..., Xn ))|] is independent on X1 . Simi
larly, repeating argument for X2 , ..., Xk we obtain that
E[(X1 , ..., Xn )|] (Xn+1 , ...)

Second step: if E[X 2 ] , E[X|G] F, X is independent of F then
E[X|G] = E[X].
10
Proof. Let Y = E[X|G]. Now Y F and X is independent on F1 we have that

E[XY ] = E[X]E[Y ] = E[Y ]2
, since E[Y ] = E[X]. Now by denition of conditional expectation for any
Z G, E[XZ] = E[Y Z]. Hence, for Z = Y , we have E[XY ] = E[Y 2 ]. Thus,
E[Y 2 ] = E[Y ]2 V ar(Y ) = 0
Y = E[Y ] a.s.
(13)
This completes the proof of the Lemma.

Now completing proof of H-S law.
We have proved that An () E[(X1 , ..., Xn )] a.s. for all bounded
dependent on nitely many components.
By the rst step, is independent on Gk = (X1 , ..., Xk ). This is true for
all k. k Gk is a -system which contains . Therefore, is independent of
(k Gk ) and (k Gk ). Thus, for all A , A is independent of itself.
Hence,
P(A A) = P(A)P(A) P (A) {0, 1}.
De Finettis Theorem
Theorem 10. Given X1 , X2 , ... sequence of exchangeable, that is, for any n
and n Sn , (X1 , ..., Xn ) (Xn (1) , ..., Xn (n) ), then conditional on ,
X1 , ..., Xn , ... are i.i.d.
Proof. As in H-Ss proof and Lemma, dene An () =
Then, due to exchangeability,
1
npk
(i1 ,...,ik ) (Xi1 , ..., Xik ).
An () = E[An ()|n ] = E[(X1 , ..., Xn )|n ]

E[(X1 , ..., Xn )|] by backward MG convergence theorem. (14)
Since X1 , ... may not be i.i.d., can be nontrivial. Therefore, the limit need not
be constant. Consider a f : Rk1 R and g : R R. Let In,k be set of all
11
distinct 1 i1 , ..., ik n, then

npk1 An (f )nAn (g)
X
=
g(Xm )
iIn,k1 f (Xi1 ,...,Xik1 ) mn
f (Xi , ..., Xik1 )g(Xik ) +
iIn,k
f (Xi1 , ..., Xik1 )
k1
X
g(Xij )
j=1
iIn,k1
(15)
Let j (X1 , ..., Xk1 ) = f (X1 , ..., Xk1 )g(Xj ), 1 j k1 and (X1 , ..., Xk ) =
f (X1 , ..., Xk1 )g(Xk ). Then,
npk1 An (f )nAn (g) = npk An () + npk1
k1
X
An (j )
j=1
Dividing by npk , we have

k
X
n
1
An (f )An (g) = An () +
An (j )
nk+1
nk+1
j=1
by 15, and fact that ||f || , ||g|| < , we have

E[f (X1 , ..., Xk1 |)]E[g(X1 )|] = E[f (X1 , ..., Xk1 )g(Xk )|]
(16)
Thus, we have using (16) that for any collection of bounded functions f1 , ..., fk ,
k
k
k
k
E[ fi (Xi )|] =
E[fi (Xi )|]
i=1
i=1
Message: given the symmetry assumption and given exchangeable statis

tics, the underlying r.v. conditionally become i.i.d.!
A nice example. Let Xi be exchangeable r.v.s. taking values in {0, 1}. Then
there exists distributions on [0, 1] with distribution function F s.t.
Z 1
P(X1 + ... + Xn = k) =
k (1 )nk dF ()
0
for all n. That is, mixture of i.i.d. r.v.

References
12
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 12
Fall 2013
10/21/2013
Martingale Concentration Inequalities and Applications
Content.
1. Exponential concentration for martingales with bounded increments
2. Concentration for Lipschitz continuous functions
3. Examples in statistics and random graph theory
1 Azuma-Hoeffding inequality
Suppose Xn is a martingale wrt ltration Fn such that X0 = 0 The goal of this
lecture is to obtain bounds of the form P(|Xn | n) exp((n)) under
some condition on Xn . Note that since E[Xn ] = 0, the deviation from zero is
the right regime to look for rare events. It turns out the exponential bound of
the form above holds under very simple assumption that the increments of Xn
are bounded. The theorem below is known as Azuma-Hoeffding Inequality.
Theorem 1 (Azuma-Hoeffding Inequality). Suppose Xn , n 1 is a martin
gale such that X0 = 0 and |Xi Xi1 | di , 1 i n almost surely for some
constants di , 1 i n. Then, for every t > 0,
t2
P (|Xn | > t) 2 exp n
2 i=1 di2
Notice that in the special

case when di = d, we can take t = xn and obtain
p
an upper bound 2 exp x2 n/(2d2 ) - which is of the form promised above.
Note that this is consistent with the Chernoff bound for the special case Xn is
the sum of i.i.d. zero mean terms, though it is applicable only in the special case
of a.s. bounded increments.
Proof. f (x) exp(x) is a convex function in x for any R. Then we have
f (di ) = exp(di ) and f (di ) = exp(di ). Using convexity we have that
1
when |x/di | 1
1 x
1
x
( + 1)di + (1 )(di )
2
di
2 di
1 x
1
x
+ 1 f (di ) +
1
f (di )
2
di
2 di
f (di ) + f (di ) f (di ) f (di )
=
+
x.
2
2
exp(x) = f (x) = f
(1)
Further, for every a

exp(a) + exp(a)
=
2
k=0
k=0
=
k=0
ak
+
k!
a2k
2k k!
k=0
(1)k ak
=
k!
k=0
a2k
(2k)!
(because 2k k! (2k)!)
( a2 )k
a2
= exp( ).
k!
2
(2)
We conclude that for every x such that |x/di | 1

exp(x) exp(
exp(di ) exp(di )
d2i
)+
x.
2
2
(3)
We now turn to our martingale sequence Xn . For every t > 0 and every > 0
we have
P(Xn t) = P (exp(Xn ) exp(t))
exp(t)E[exp(Xn )]
= exp(t)E[exp(
(Xi Xi1 ))],

1in
where X0 = 0 was used in the last equality. Applying the tower property of
conditional expectation we have
(Xi Xi1 ))
E exp(
1in
= E E exp((Xn Xn1 )) exp(
(Xi Xi1 ))|Fn1 .

1in1
Now, since Xi , i n 1 are measurable wrt Fn1 , then
X
E exp((Xn Xn1 )) exp(
(Xi Xi1 ))|Fn1
1in1
= exp(
(Xi Xi1 ))E [exp((Xn Xn1 ))|Fn1 ]
1in1
exp(

exp
(Xi Xi1 ))
1in1
2 2
dn

exp(di ) exp(di )
+
E[Xn Xn1 |Fn1 ] ,
2
where (3) was used in the last inequality. Martingale property implies E[Xn
Xn1 |Fn1 ] = 0, and we have obtained an upper bound
2 2
X
X
dn
E exp(
(Xi Xi1 )) E exp(
(Xi Xi1 )) exp
2
1in
1in1
Iterating further we obtain the following upper bound on P(Xn t):

2 2
1in di
P
exp(t) exp
Optimizing over
the choice of , we see that the tightest bound is obtained by
P
setting = t/ i d2i > 0, leading to an upper bound

t2
P(Xn t) exp P
.
2 i d2i
A similar approach using < 0 gives for every t > 0

t2
P
P(Xn t) exp
.
2 i d2i
Combining, we obtain the required result.
2
Application to Lipschitz continuous functions of i.i.d. random variables
Suppose X1 , ..., Xn are independent random variables. Suppose g : Rn R is

a function and d1 , . . . , dn are constants such that for any two vectors x1 , . . . , xn
3
and y1 , . . . , yn
|g(x1 , . . . , xn ) g(y1 , . . . , yn )|
n
X
di 1{xi 6= yi }.
(4)
i=1
In particular when a vector x changes value only in its i-th coordinate the amount
of change in function g is at most di . As a special case, consider a subset of vec
tors x = (x1 , . . . , xn ) such that |xi | ci and suppose g is Lipschitz continuous
with constant
K. Namely, for for every x, y, |g(x) g(y)| K|x y|, where
|x y| = i |xi yi |. Then for any two such vectors
X
|g(x) g(y)| K|x y| K
2ci |xi yi |,
i
and therefore this ts into a previous framework with di = Kci .

Theorem 2. Suppose Xi , 1 i n are i.i.d. and function g : Rn R satises
(4). Then for every t 0

t2
2 .
P(|g(X1 , ..., Xn ) E[g(X1 , ..., Xn )]| > t) 2 exp P
2 i di
Proof. Let Fi be the -eld generated by variables X1 , . . . , Xi : Fi = (X1 , ..., Xi ).
For convenience, we also set F0 to be the trivial -eld consisting of , , so
that E[Z|F0 ] = E[Z] for every r.v. Z. Let M0 = E[g(X1 , ..., Xn )], M1 =
E[g(X1 , ..., Xn )|F1 ],...,Mn = E[g(X1 , ..., Xn )|Fn ]. Observe that Mn is sim
ply g(X1 , . . . , Xn ), since X1 , . . . , Xn are measurable wrt Fn . Thus, we by
tower property
E[Mn |Fn1 ] = E[E[g(X1 , ..., Xn )|Fn ]|Fn1 ] = Mn1 .
Thus, Mi is a martingale. We have
Mi+1 Mi = E[E[g(X1 , ..., Xn )|Fi+1 ] E[g(X1 , ..., Xn )|Fi ]]
= E[E[g(X1 , ..., Xn ) E[g(X1 , ..., Xn )|Fi ]|Fi+1 ]].
Since Xi s are independent, then Mi is a r.v. which on any vector x = (x1 , . . . , xn )
takes value
g(x1 , ..., xn )dP(xi+1 ) dP(xn ),
Mi =
xi+1 ,...,xn
(and in particular only depends on the rst i coordinates of x). Similarly
Z
g(x1 , ..., xn )dP(xi+2 ) dP(xn ).
Mi+1 =
xi+2 ,...,xn
Thus
Z
|Mi+1 Mi | = |
(g(x1 , ..., xn )
xi+2 ,...,xn
g(x1 , ..., xn )dP(xi+1 ))dP(xi+1 ) dP(xn )|,

xi+1
di+1
dP(xi+1 ) dP(xn )
xi+2 ,...,xn
= di+1 .
This derivation represents a simple idea that Mi and Mi+1 only differ in averag
i = Mi M0 = Mi E[g(X1 , . . . , Xn )],
ing out Xi+1 in Mi . Now dening M
i is also a martingale with differences bounded by di , but with
we have that M
an additional property M0 = 0. Applying Theorem 1 we obtain the required
result.
3 Two examples
We now consider two applications of the concentration inequalities developed
in the previous sections. Our rst example concerns convergence empirical dis
tributions to the true distributions of random variables. Specically, suppose we
have a distribution function F , and i.i.d. sequence X1 , . . . , Xn with distribution
F . From the sample
P X1 , . . . , Xn we can build an empirical distribution function

Fn (x) = n1 1in 1{Xi x}. Namely, Fn (x) is simply the frequency of
observing values at most x in our sample. We should realize that Fn is a random
function, since it depends on the sample X1 , . . . , Xn . An important Theorem
called Glivenko-Cantelli says that supxR |Fn (x)F (x)| converges to zero and
in expectation, the latter meaning of course that E[supxR |Fn (x) F (x)|]
0. Proving this result is beyond our scope. However, applying the martin
gale concentration inequality we can bound the deviation of supxR |Fn (x)
F (x)| around its expectation. For convenience let Ln = Ln (X1 , . . . , Xn ) =
supxR |Fn (x) F (x)|, which is commonly called empirical risk in the statis
tics and machine learning elds. We need to bound P(|Ln E[Ln ]| > t).
Observe that L satises property (4) with di = 1/n. Indeed changing one coor
dinate Xi to some XiI changes Fn by at most 1/n, and thus the same applies to
Ln . Applying Theorem 2 we obtain

t2
P(|Ln E[Ln ]| > t) 2 exp
2n(1/n)2
2
t n
= 2 exp
.
2
Thus, we obtain a large deviations type bound on the difference Ln E[Ln ].

For our second example we turn to combinatorial optimization on random
graphs. We will use the so-called Max-Cut problem as an example, though the
approach works for many other optimization and constraint satisfaction prob
lems as well. Consider a simple undirected graph G = (V, E). V is the set of
nodes, denoted by 1, 2, . . . , n. And E is the set of edges which we describe as
a list of pairs (i1 , j1 ), . . . , (i|E| , j|E| ), where i1 , . . . , i|E| , j1 , . . . , j|E| are nodes.
The graph is undirected, which means that the edges (i1 , j1 ) and (j1 , i1 ) are
identical. We can also represent the graph as an n n zero-one matrix A, where
Ai,j = 1 if (i, j) E) and Ai,j = 0 otherwise. Then A is a symmetric matrix,
namely AT = A, where AT is a transpose of A. A cut in this graph is a partition
of nodes into two groups, encoded by function a function : V {0, 1}.
The value M C() of the cut associated with is the number of edges be
6 (j)}|.
tween the two groups. Formally, M C() = |{(i, j) E : (i) =
Clearly M C() |E|. At the same time, a random assignment (i) = 0 with
probability 1/2 and = 1 with probability 1/2 gives a cut with expected value
M C() (1/2)|E|. In fact there is a simple algorithm to construct such a
cut explicitly. Now denote by M C(G) the maximum possible value of the cut:
M C(G) = max M C(). Thus 1/2 M C(G)/|E| 1. Further, suppose
we delete an arbitrary edge from the graph G and obtain a new graph GI . Ob
serve that in this case M C(GI ) M G(G) 1 - the Max-Cut value either stays
the same or goes down by at most one. Similarly, when we add an edge, the
Max-Cut value increases by at most one. Putting this together, if we replace an
arbitrary edge e E by a different edge eI and leave all the other edges intact,
the value of the Max-Cut changes by at most one.
Now suppose the graph G = G(n, dn) is a random Erdos-Renyi graph with
|E| = dn edges. Specically, suppose wep choose every edges E1 , . . . , Edn
uniformly at random from the total set of n2 edges, independently for these
nd choices. Denote by M Cn the value of the maximum cut M C(G(n, dn)) on
this random graph. Since the graph is random, we have that M Cn is a random
variable. Furthermore, as we have just established, d/2 M Cn /n d. One
of the major open problems in the theory of random graphs is computing the
scaling limit E[M Cn ]/n as n . However, we can easily obtain bounds
6
on the concentration of M Cn around its expectation, using Azuma-Hoeffding

inequality. For this goal, think of
p random edges E1 , . . . , Edn as i.i.d. random
variables in the space 1, 2, . . . , n2 corresponding to the space of all possible
edges on n nodes. Let g(E1 , . . . , Edn ) = M Cn . Observe that indeed g is a
function of dn i.i.d. random variables. By our observation, replacing one edge
Ei by a different edge EiI changes M Cn by at most one. Thus we can apply
Theorem 2 which gives

t2
P (|M Cn E[M Cn ]| t) 2 exp
.
2dn
In particular, taking t = rn, where r > 0 is a constant, we obtain a large
2
deviations type bound 2 exp( r2dn ). Taking instead t = r n, we obtain Gaus
r2
sian type bound 2 exp( 2d
). Namely, M Cn = E[M Cn ] + ( n). This is a
meaningful concentration around the mean since, as we have discussed above
E[M Cn ] = (n).
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 13
Fall 2013
10/23/2013
Concentration Inequalities and Applications
Content.
1
Talagrands inequality
Let (i , Fi , i ) be probability spaces (i = 1, ..., n). Let = 1 ... n be
product measure on X = 1 ... n . Let x = (x1 , ..., xn ) X be a point in
this product space.
Hamming distance over X:
d(x, y) = |{i i n : xi = yi }| =
n
n
1{xi =yi }
i=1
-weighted Hamming distance over X for a Rn+ :

da (x, y) =
n
n
ai 1{xi =yi }
i=1
Also |a| =
a2i .
Control-distance from a set: for set A X, and x X:
c
DA
(x) = sup da (x, A) = inf{da (x, y) : y A}
|a|=1
Theorem 1 (Talagrand). For every measurable non-emply set A and productmeasure ,

1 c 2
1
exp( (DA
) )d
4
(A)
In particular,
c
t})
({DA
1
t2
exp( )
(A)
4
2
2.1
Application of Talagrands Inequality

Concentration of Lipschitz functions.
Let F : X R for product space X = 1 ... n such that for every x X,

there exists a a(x) Rn+ with |a| = 1 so that for each y Y ,
F (x) F (y) + da (x, y)
(1)
Why does every 1-Lipschitz function is essentially like (1)?

Consider a 1-Lipschitz function f : X R such that
n
|f (x) f (y)|
|xi yi | (dened on i ) for all x, y X.
i
Let di = maxx,y |xi yi |. We assume di is bounded for all i. Then,

n
n
|f (x) f (y)|
|xi yi |
1{xi 6=yi } di
i
Therefore,
d
f (x) f (y) n di
q
P
P 1{xi 6=yi } = da (x, y) with ai = q
q
Pi
2
2
d2i
i
i di
i di
P
q
f (x)
2
Thus F (x) = ||d||
where
||d||
=
2
i di .
2
c (x),
Let A = {F m}. By denition of DA
c
DA
(x) = sup da (x, A) da (x, y)
a:|a|=1
for a given a such that |a| = 1 and y A. Now for any y A, by denition
F (y) m. Then,
c
F (x) F (y) + da (x, y) m + DA
(x)
c (x) r}. By Talagrands inequality, for
which implies {F m + r} {DA
any r 0,
c
P({f m + r}) P({DA
r})
r2
1
exp( )
4
P(A)
That is,
P({F m})P({F m + r}) exp(
2
r2
)
4
(2)
The median of F , mF is precisely such that

1
1
P(F mF ) , P(F mF )
2
2
Choose m = mF , m = mF r in (2) to obtain:
P(F mF + r) 2 exp(
r2
r2
), P(F mF r) 2 exp( )
4
4
Thus,
P(|F mF | r) 4 exp(
(3)
r2
)
4
2.2 Further Application for Linear Functions

Consider the independent random variables Y1 , ..., Yn on some probability space
(, F, P). Let the constants (ui , vi ), 1 i n such that
ui Yi vi
P
Set Z = suptT < t, Y > ni=1 ti Yi where T is some nite, countable or
compact set of vectors in R+ . We would be interested in situations where
n
2 = sup
t2i (vi ui )2
tT
We wish to apply (3) to this setting by choosing

F (x) = sup < t, x >
tT
where x X and X = ni=1 [ui , vi ]. Given that T is compact, F (x) =<

t (x), x > for some t = t (x) T , given x.
F (x) =
n
n
t i xi
i=1
ti yi +
n
i
|ti ||yi xi |
t i yi +
|ti |(vi ui )1(yi =x

6 i ) (let di = |ti |(vi ui )) .
y > +(
sup < t,
tT
n di
1(yi =
6 xi ))||d||2
||d||2
i
= F (y) + da (x, y)||d||2 (where let = ||d||2 =
sup
tT
= F (y) + da (x, y)
t2i (vi ui )2 )
(4)
By selection of f 1 F , (3) can be applied to f :

P(|f mf | r) 4 exp(
r2
)
4
Let r = , then P(|f mf | ) 4 exp( 42 ). That is,

P(|F mF | ) 4 exp(
2
)
4 2
Now,
Z
P(F s)ds (assume t 0 T )

Z0 mF
Z
1ds +
P(F mF + )d
0
0
Z
mF +
2 exp( 2 )d
4
0
Z
2
mF +
2 exp( 2 )d
4
0
Z
1
2
= mF + 2 8 2
exp( 2 )d
4
24 2
0
= mF + 2 2
E[F ] =
Thus,
|E[F ] mF | 2 2
2.3 More Intricate Application

Longest increasing subsequence:
Let X1 , ..., Xn be points in [0, 1] chosen independently as a product measure.
Let Ln (X1 , ..., Xn ) be the length of longest increasing subsequence. (Note that
Ln () is not obviously Lipschitz). Talagrands inequality implies its concentra

tion.
Lemma 1. Let mn be median of Ln . Then for any r > 0, we have

P(Ln mn + r) 2 exp(
r2
)
4(mn + r)
P(Ln mn r) 2 exp(
r2
)
4mn
Proof. Let us start by establishing rst inequality. Select A = {Ln mn }.

Clearly, by denition P(A) 12 . For a x such that Ln (x) > mn , (i.e. x A),
consider any y A. Now, let set I [n] be indices that give rise to longest
increasing subsequence in x: i.e. say I = {i1 , ..., ip } then xi1 < xi2 < ... < xip
and p is the maximum length of any such increasing subsequence of x. Let
J = {i I : xi 6= yi } for given y. Since I\J is an index set that corresponds
to a increasing subsequence of y (since for i I\J; xi = yi and I is index set
of increasing subsequence of I); we have that (using fact that Ln (y) mn as
y A)
|I\J| mn
That is,
Ln (x) = |I| |I\J| + |J|
n
1(xi 6= yi )
Ln (y) +
iI
Ln (y) +
n
n
(
1
(
Ln (x)[
1(i I)1(xi 6= yi )]
L
(x)
n
i=1
Dene
ai (x) =
1
,
Ln (x)
0,
if i I
o.w.
Then |a| = 1 since |I| = Ln (x) by denition, and hence,

(
(
c
Ln (x) Ln (y) + Ln (x)da (x, y) mn + Ln (x)DA
(x)
Equivalently,
c
DA
(x)
Ln (x) mn
(
Ln (x)
For x such that Ln (x) mn + r, the RHS of ahove is minimal when Ln (x) =
mn + r. Therefore, we have
c
DA
(x)
Ln (x) mn
(
Ln (x)
For x such that Ln (x) mn + r, the RHS of above is minimal when Ln (x) =
mn + r. Therefore, we have
r
c
DA
(x)
mn + r
5
That is
c
Ln (x) mn + r DA
(x)
r
mn + r
for A = {Ln mn }
Putting these together, we have

c
P(Ln mn + r) P(DA
r
mn + r
1
r2
exp(
)
2P (A)
4(mn + r)
But P(A) = P(Ln mn ) 12 , we have that

P(Ln mn + r) 2 exp(
r2
)
4(mn + r)
To establish lower bound, replace argument of the above with x such that Ln (x)
s + u, A = {Ln s}. Then we obtain,
c
DA
(x)
u
s+u
Select s = mn r, u = r. Then whenever x is such that Ln (x) s + u = mn

and for A = {Ln s} = {Ln mn r}.
r
c
DA
(x)
mn
Thus,
c
P(Ln mn ) P(DA
r
1
r2
)
exp(
)
P(Ln mn r)
4mn
mn
which implies
P(Ln mn r) 2 exp(
r2
)
4mn
This completes the proof.

3
Proof of Talagrands Inequality
c (x) = sup
Preparation. Given set A, x X: DA
(da (x, A) = inf yA da (x, y)).
aRn
+
Let
UA (x) = {s {0, 1}n : y A with s 1(x 6= y)} = {1(x 6= y) : y A}
and let
VA (x) = Convex-hull(UA (x)) = {
s S :
s = 1, s 0 for all s UA (x)}
sUA (x)
Thus,
6 x) = 0 UA (x) 0 VA (x)
x A 1(x =
It can therefore be checked that

Lemma 2.
c
DA
(x) = d(0, VA (x))
|y|
inf
yVA (x)
c (x) inf
Proof. (i) DA
yVA (x) |y|: since inf yVA (x) (y) is achieved, let Z be
such that |Z| = inf yVA (x) |y|. Now for any a Rn+ , |a| = 1:
a y a z |a||z| = |z|
inf
yVA (x)
Since inf yVA (x) a y is linear programming, the minimum is achieved at an

extreme point. That is, there exists s UA (x) such that
ay =
inf
yVA (x)
inf
sUA (x)
a s = inf da (x, y) for some y A.

yA
Since this is true for all a, it follows that,

sup
inf da (x, y) |z|
yA
|a|=1,aRn
+
|y|
inf
yVA (x)
c (x) inf
(ii) DA
yVA (x) |y|: Let z be the one achieving minimum
P 2 in VA (x).
Then due to convexity of the objective (equivalently |y|2 =
yi = f (y))
and of the domain, we have for any y VA (x), vf (z)(y z) 0 for any
y VA (x). vf (z) = v(z z) = 2z. Therefore the condition implies
(y z)z 0 y z z z = |z|2 y
Thus, for a =
z
|z|
z
|z|
|z|
Rn+ , |a| = 1, we have that
inf
a y |z|
yVA (x)
But for any given a, inf yVA (x) a y = inf sUA (x) a s = da (x, A) as explained
before. That is, supa:|a|=1 da (x, A) |z| = inf yVA (x) |y|. This completes the
proof.
7
Now we are ready to establish the inequality of Talagrand. The proof is via
induction. Consider n = 1, given set A. Now,
c
(x) =
DA
0, for x A
1, for x
/A
inf da (x, y) = inf 1(x 6= y) =
sup
yA
aRn
+ ,|a|=1
yA
Then,
Z
exp(D /4)dP =
exp(0)dP +
exp(1/4)dP
Ac
= P (A) + e1/4 (1 P (A))

= e1/4 (e1/4 1)P (A)
1
P (A)
(5)
Let f (x) = e1/4 (e1/4 1)x and g(x) = x1 . Because f (x) is a decreasing func
tion of x, g(x) is a decreasing convex function. Thus, the result if established
for n = 1.
Induction hypothesis. Let it hold for some n. We shall assume for ease of the
proof that 1 = 2 = ... = n = ... = . L
Let A n+1 . Let B be its projection on n . Let A(), be section
of A along : if x n , then z = (x, ) n+1 . We observe the
following:
if s UA() (x), then (s, 0) UA (z). Because, for some y n such that
6 y). Therefore, (s, 0) = (1(x =
6 y), 1( =
6 )) = 1(z =
6
(y, ) A, s = 1(x =
(y, )) where (y, ) A. Further, if t UB (x), then (t, 1) UA (z). This is
x,
) A for some
}. Now
because of the following: B = {x
n : (
if t UB (x), then y B such that t = 1(x 6= y). Now (t, 1) = (1(x 6=
y), 1(
6= )) = 1(z 6= (y,
)) as long as there exists
so that (y,
) A and
6 .
=
Given this, it follows that if VA() (x), VB (x), and [0, 1], then
(( + (1 )), 1 ) VA (z). Recall that
c
(z)2 =
DA
inf
|y|2 (1 )2 + | + (1 )|2
yVA (z)
(1 )2 + ||2 + (1 )||2
(6)
Therefore,
c
(z)2 (1 )2 +
DA
inf
VA() (x)
||2 + (1 )
inf
VB (x)
c
c
= (1 )2 + DA()
(x)2 + (1 )DB
(x)2
||2
By Holders inequality, and the induction hypothesis, for ,

Z
c
2
eDA (x,) /4 dP (x)
n
exp
c
(1 )2 + DA
( )
c (x)
(x)2 + (1 )DB
(1 )2
exp(
)
4
exp(
c
DA()
(x)2
n
X
) exp(
-
dP (x)
c (x)2
(1 )DB
) dP (x)
4
Y
)2
(1
)E[X Y ]
4
1
1
(1 )2
exp(
: [0, 1])
)E[X p ]1/p E[Y q ]1/q , (for p = , q =
4
1
Z
Z
(1 )2
c
c
= exp(
)
exp(DA()
(x)2 /4)dP (x)
exp(DB
(x)2 /4)dP (x)
4
n
n
(1 )2
1
1
exp(
)(
) (
)1 by induction hypothesis.
4
P (A()) P (B)
= exp(
= exp
(1 )2
4
1
P (B)
P (A())
P (B)
(7)
(7) is true for any [0, 1], so for tightest upper bound, we shall optimize.
2
2 u.
Claim: for any u [0, 1], inf [0,1] exp( (1)
4 )u
Therefore, (7) reduces to
1
P (A())
(2
)
P (B)
P (B)
Therefore,
Dc (x, )2
exp( A
)dP (x)d()
4
n+1
Z
1
P(A())
(2
)d()
P(B)
P(B)
N
1
(P
)(A)
(2
)
P(B)
P(B)
1
N
, (since u(2 u) 1 for all u R)

(P )(A)
(8)
This completes the proof of Talagrands inequality.

'
Claim: f (u) = u(2 u) f (u) = 2 2u u = 1 maxu f (u) =
9
f (1) = 1.
2
2 u:
Proof. To establish: inf [0,1] exp( (1)
4 )u
if u e1/2 : = 1 + 2 log u 1
2 = log u
2
u
log
u
log
u
2
log
u =e
=e
e
. Thus,
exp(
(1)2
4
= log2 (u) and
(1 )2
)u = exp(log2 u 2 log2 u log u) = exp( log u log2 u)
u
We have that
1
1
1
1 u e1/2 0 log u 0 log u , 0 log2 u
2
2
4
and
'
f (x) = x x2 : x [1/2, 0]; f (x) = 1 2x 0 for x [1/2, 0]

Thus,
log u log2 u
1 1
1
1
exp( log u log2 u)
2 4
4
4
and for u e 2 which implies that 2 u exp( log u log2 u).
10
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 14
Fall 2013
10/28/2013
Introduction to Ito calculus.
Content.
1. Spaces L2 , M2 , M2,c .
2. Quadratic variation property of continuous martingales.
1 Doob-Kolmogorov inequality. Continuous time version

Let us establish the following continuous time version of the Doob-Kolmogorov
inequality. We use RCLL as abbreviation for right-continuous function with left
limits.
Proposition 1. Suppose Xt 0 is a RCLL sub-martingale. Then for every
T, x 0
P( sup Xt x)
0tT
E[XT2 ]
.
x2
Proof. Consider any sequence of partitions n = {0 = tn0 < t1n < . . . < tnNn =
T } such that (n ) = maxj |tnj+1 tjn | 0. Additionally, suppose that the
sequence n is nested, in the sense the for every n1 n2 , every point in n1 is
also a point in n2 . Let Xtn = Xtnj where j = max{i : ti t}. Then Xtn is a
sub-martingale adopted to the same ltration (notice that this would not be the
case if we instead chose right ends of the intervals). By the discrete version of
the D-K inequality (see previous lectures), we have
P(max Xtnj x) = P(sup Xtn x)
jNn
tT
E[XT2 ]
.
x2
By RCLL, we have suptT Xtn suptT Xt a.s. Indeed, x E > 0 and

nd t0 = t0 () such that Xt0 suptT Xt E. Find n large enough and
1
j = j(n) such that tj(n)1 t0 tnj(n) . Then tj(n) t0 as n .

By right-continuity of X, Xtj(n) Xt0 . This implies that for sufciently
large n, suptT Xtn Xtj(n) Xt0 2E, and the a.s. convergence is es
tablished. On the other hand, since the sequence n is nested, then the se
quence suptT Xtn is non-decreasing. By continuity of probabilities, we obtain
P(suptT Xtn x) P(suptT Xt x).
Stochastic processes and martingales
Consider a probability space (, F, P) and a ltration (Ft , t R+ ). We assume

that all zero-measure events are added to F0 . Namely, for every A , such
that for some A' F with P(A' ) = 0 we have A A' F, then A also belongs
to F0 . A ltration is called right-continuous if Ft = E>0 Ft+E . From now
on we consider exclusively right-continuous ltrations. A stochastic process
Xt adopted to this ltration is a measurable function X : [0, ) R,
such that Xt Ft for every t. Denote by L2 the space of processes s.t. the
T
T
Riemann integral 0 Xt ()dt exists a.s. and moreover E[ 0 Xt2 dt] < for
T
every T > 0. This implies P( : 0 |Xt ()|dt < , T ) = 1.
Let M2 consist of square integrable right-continuous martingales with left
limits (RCLL). Namely E[Xt2 ] < for every X M2 and t 0. Finally
M2,c M2 is a further subset of processes consisting of a.s. continuous
processes. For each T > 0 we dene a norm on M2 by IXI = IXIT =
(E[XT2 ])1/2 . Applying sub-martingale property of Xt2 we have E[XT21 ] E[XT22 ]
for every T1 T2 .
A stochastic process Yt is called a version of Xt if for every t R+ , P(Xt =
Yt ) = 1. Notice, this is weaker than saying P(Xt = Yt , t) = 1.
Proposition 2. Suppose (Xt , Ft ) is a submartingale and t E[Xt ] is a con
tinuous function. Then there exists a version Yt of Xt which is RCLL.
We skip the proof of this fact.
Proposition 3. M2 is a complete metric space and (w.r.t. I I) M2,c is a closed
subspace of M2 .
Proof. We need to show that if X (n) M2 is Cauchy, then there exists X
M2 with IX (n) XI 0.
Assume X (n) is Cauchy. Fix t T Since X (n) X (m) is a martingale
(n)
(m)
(n)
(m)
(n)
as well, E[(Xt Xt )2 ] E[(XT XT )2 ]. Thus Xt is Cauchy as
well. We know that the space L2 of random variables with nite second moment
2
(n)
is closed. Thus for each t there exists a r.v. Xt s.t. E[(Xt Xt )2 ] 0 as

(n)
n . We claim that since Xt Ft and X (n) is RCLL, then (Xt , t 0) is
adopted to Ft as well (exercise). Let us show it is a martingale. First E[|Xt |] <
(n)
since in fact E[Xt2 ] < . Fix s < t and A Fs . Since each Xt is a
(n)
(n)
martingale, then E[Xt 1(A)] = E[Xs 1(A)]. We have
(n)
E[Xt 1(A)] E[Xs 1(A)] = E[(Xt Xt )1(A)] E[(Xs Xs(n) )1(A)]

(n)
(n)
(n)
We have E[|Xt Xt |1(A)] E[|Xt Xt |] (E[(Xt Xt )2 ])1/2 0

as n . A similar statement holds for s. Since the left-hand side does not
depend on n, we conclude E[Xt 1(A)] = E[Xs 1(A)] implying E[Xt |Fs ] = Xs ,
namely Xt is a martingale. Since E[Xt ] = E[X0 ] is constant and therefore
continuous as a function of t, then there exists version of Xt which is RCLL.
For simplicity we denote it by Xt as well. We constructed a process Xt M2
(n)
s.t. E[(Xt Xt )2 ] 0 for all t T . This proves completeness of M2 .
(n)
Now we deal with closeness of M2,c . Since Xt Xt is a martingale,
(n)
(n)
(Xt Xt )2 is a submartingale. Since Xt M2 , then (Xt Xt )2 is RCLL.
Then submartingale inequality applies. Fix E > 0. By submartingale inequality
we have
(n)
P(sup |Xt
Xt | > E)
tT
1
(n)
E[(XT XT )2 ] 0,
2
E
as n . Then we can choose subsequence nk such that

(nk )
P(sup |Xt
Xt | > 1/k)
tT
1
.
2k
(nk )
Since 1/2k is summable, by Borel-Cantelli Lemma we have suptT |Xt
(n )
suptT |Xt k ()
Xt | 0 almost surely: P({ :

Xt ()| 0}) =
1. Recall that a uniform limit of continuous functions is continuous as well
(rst lecture). Thus Xt is continuous a.s. As a result Xt M2,c and M2,c is
closed.
3
Doob-Meyer decomposition and quadratic variation of processes in M2,c
Consider a Brownian motion Bt adopted to a ltration Ft . Suppose this ltration

makes Bt a strong Markov process (for example Ft is generated by B itself).
Recall that both Bt and Bt2 t are martingales and also B M2,c . Finally
recall that the quadratic variation of B over any interval [0, t] is t. There is a
3
generalization of these observations to processes in M2,c . For this we need to

recall the following result.
Theorem 1 (Doob-Meyer decomposition). Suppose (Xt , Ft ) is a continuous
non-negative sub-martingale. Then there exist a continuous martingale Mt and
a.s. non-decreasing continuous process At with A0 = 0, both adopted go Ft
such that Xt = At + Mt . The decomposition is unique in the almost sure sense.
The proof of this theorem is skipped. It is obtained by appropriate discretiza
tion and passing to limits. The discrete version of this result we did earlier. See
[1] for details.
Now suppose Xt M2,c . Then Xt2 is a continuous non-negative submartin
gale and thus DM theorem applies. The part At in the unique decomposition of
Xt2 is called quadratic variation of Xt (we will shortly justify this) and denoted
(Xt ).
Theorem 2. Suppose Xt M2,c . Then for every t > 0 the following conver
gence in probability takes place
(Xtj+1 Xtj )2 (Xt ),
lim
n :(n )0
0jn1
where the limit is over all partitions n = {0 = t0 < t1 < < tn = t} and
(n ) = maxj |tj tj1 |.
Proof. Fix s < t. Let X M2,c . We have
E[(Xt Xs )2 ((Xt ) (Xs ))|Fs ] = E[Xt2 2Xt Xs + Xs2 ((Xt ) (Xs ))|Fs ]
= E[Xt2 |Fs ] 2Xs E[Xt |Fs ] + Xs2 E[(Xt )|Fs ] + (Xs )
= E[Xt2 (Xt )|Fs ] Xs2 + (Xs )
= 0.
Thus for every s < t u < v by conditioning rst on Fu and using tower
property we obtain
E (Xt Xs )2 ((Xt ) (Xs ))
(Xu Xv )2 ((Xu ) (Xv )) = 0 (1)
The proof of the following lemma is application of various carefully placed

tower properties and is omitted. See [1] Lemma 1.5.9 for details.
Lemma 1. Suppose X M2 satises |Xs | M a.s. for all s t. Then for
every partition 0 = t0 tn = t
(Xtj+1 Xtj )2
E
j
6M 4 .
Lemma 2. Suppose X M2 satises |Xs | M a.s. for all s t. Then

X
lim E[ (Xtj+1 Xtj )4 ] = 0,
(n )0
where n = {0 = t0 < < tn = t}, (n ) = maxj |tj+1 tj |.

Proof. We have
X
X
(Xtj+1 Xtj )4
(Xtj+1 Xtj )2 sup{|Xr Xs |2 : |r s| (n )}.
j
Applying Cauchy-Schwartz inequality and Lemma 1 we obtain

X
2
X
2
E[ (Xtj+1 Xtj )4 ] E
(Xtj+1 Xtj )2 E[sup{|Xr Xs |4 : |r s| (n )}]
j
j
4
6M E[sup{|Xr Xs |4 : |r s| (n )}].
Now X() is a.s. continuous and therefore uniformly continuous on [0, t].
Therefore, a.s. sup{|Xr Xs |2 : |r s| (n )} 0 as (n ) 0.
Also |Xr Xs | 2M a.s. Applying Bounded Convergence Theorem, we ob
tain that E[sup{|Xr Xs |4 : |r s| (n )}] converges to zero as well and
the result is obtained.
We now return to the proof of the proposition. We rst assume |Xs | M
and (Xs ) M a.s. for s [0, t].
We have (using a telescoping sum)
X
2
X
2
E
(Xtj+1 Xtj )2 (Xt ) = E
(Xtj+1 Xtj )2 ((Xtj+1 ) (Xtj ))
j
When we expand the square the terms corresponding to cross products with
j1 6= j2 disappear due to (1). Thus the expression is equal to
E
2
X
(Xtj+1 Xtj )2 ((Xtj+1 ) (Xtj ))
j
2E
(Xtj+1 Xtj )4 + 2E[
X
((Xtj+1 ) (Xtj ))2 ].
j
The rst term converges to zero as (n ) 0 by Lemma 2.
We now analyze the second term. Since (Xt ) is a.s. non-decreasing, then
X
((Xtj+1 ) (Xtj )) sup {(Xr ) (Xs ) : |r s| (n )}
((Xtj+1 ) (Xtj ))2
0srt
Thus the expectation is upper bounded by

E[(Xt )
sup {(Xr ) (Xs ) : |r s| (n )}]
(2)
0srt
Now (Xt ) is a.s. continuous and thus the supremum term converges to zero a.s.
as n . On the other hand a.s. (Xt )((Xr ) (Xs )) 2M 2 . Thus using
Bounded Convergence Theorem, we obtain that the expectation in (2) converges
to zero as well. We conclude that in the bounded case |Xs |, (Xs ) M on [0, t],
the quadratic variation of Xs over [0, t] converges to (Xt ) in L2 sense. This
implies convergence in probability as well.
It remains to analyze the general (unbounded) case. Introduce stopping
times TM for every M R+ as follows
TM = min{t : |Xt | M or (Xt ) M }
Consider XtM XtTM . Then X M M2,c and is a.s. bounded. Further
2
(XtTM ) is a bounded martinsince Xt2 (Xt ) is a martingale, then XtT
M
gale. Since Doob-Meyer decomposition is unique, we that (XtTM ) is indeed
the unique non-decreasing component of the stopped martingale XtTM . There
is a subtlety here: XtM is a continuous martingale and therefore it has its own
quadratic variation (XtM ) - the unique non-decreasing a.s. process such that
(XtM )2 (XtM ) is a martingale. It is a priori non obvious that (XtM ) is the
same as (XtTM ) - quadratic variation of Xt stopped at TM . But due to unique
ness of the D-M decomposition, it is.
Fix E > 0, t 0 and nd M large enough so that P(TM < t) < E/2. This is
possible since Xt and (Xt ) are continuous processes. Now we have
X

P
(Xtj+1 Xtj )2 (Xt ) > E
j
X

P
(Xtj+1 Xtj )2 (Xt ) > E, t TM + P(TM < t)
j
X

(Xtj+1 TM Xtj TM )2 (XtTM ) > E, t TM + P(TM < t)
=P
j
X

P
(Xtj+1 TM Xtj TM )2 (XtTM ) > E + P(TM < t).
j
We already established the result for bounded martingales and quadratic vari
ations. Thus, there exists = (E) > 0 such that, provided () < , we
have

X

2
P
(Xtj+1 TM Xtj TM ) (XtTM ) > E < E/2.
j
We conclude that for = {0 = t0 < t1 < < tn = t} with () < , we

have

X

P
(Xtj+1 Xtj )2 (Xt ) > E < E.
j

Chapter I. Karatzas and Shreve [1]
References
[1] I. Karatzas and S. E. Shreve, Brownian motion and stochastic calculus,
Springer, 1991.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 15
Fall 2013
10/30/2013
Ito integral for simple processes
Content.
1. Simple processes. Ito isometry
2. First 3 steps in constructing Ito integral for general processes
1 Ito integral for simple processes. Ito isometry

Consider a Brownian motion Bt adopted to some ltration Ft such that (Bt , Ft )
is a strong Markov process. As an example we can take ltration generated
by the Brownian
motion
o
o itself. Our goal is to give meaning to expressions of
the form Xt dBt = Xt ()dBt (), where Xt is some stochastic process
which is adapted to the same ltration as Bt . We will primarily deal with the
case X L2 , although it is possible to extend denitions to more general pro
cesses using the notion
o of local martingales. As in the case of usual integration,
the idea is to dene Xt ()dBt () as some kind of a limit of (random) sums
j Xtj ()(Btj+1 ()Btj ()) and show that the limit exists in some appropri
ate sense. As Xt we can otake all kinds of processes, including Bt itself. For exT
ample we will show that 0 Bt dBt makes sense and equals (1/2)BT2 (1/2)T .
Denition 1. A process X L2 = L2 (, F, (Ft ), P) is called simple if there
exists a countable partition : 0 = t0 < < tn < with limn tn =
such that Xt () = Xtj () for all t [tj , tj+1 ), j = 0, 1, 2, . . . for all .
The subspace of simple processes is denoted by L02
We assume that partition is such that tj as j . It is important to
note that we assume that the partition does not depend on . Thus not every
piece-wise constant process is a simple process. Give an example of a piecewise constant process which is not simple. Note that since Xt Ft we have
Xtj Ftj for each j. As an example of simple process, x any partition
t () dened by X
t () =
and a process Xt L2 and consider the process X
1
Xtj (), where tj is dened by t [tj , tj+1 ). In the denition it is important

t = Xt and not Xt . Observe that the latter is not necessarily adopted
that X
j
j+1
to (Ft )tR+ .
Given a simple process X and t, dene its integral by
Xtj ()(Btj+1 () Btj ()) + Xtn ()(Bt () Btn ()),
It (X()) =
0jn1
where n = max{j : tj t}. Observe that It (X) is an a.s. continuous function

(as Bt is a.s. continuous).
Theorem 1. The following properties hold for It (X)
It (X + Y ) = It (X) + It (Y ).
(1)
E[It2 (X)] = E[
Xs2 ds] [Ito isometry],
(2)
It (X) M2,c ,
(3)
t
E[(It (X) Is (X))2 |Fs ] = E[
Xu2 du], 0 s < t T.
(4)
Notice that (4) is a generalization of Ito isometry. We only prove Ito isome
try, the proof of (4) follows along the same lines.
Proof. Dene tn = t for convenience. We begin with (1). Let {t1j } and {tj2 }
be partitions corresponding to simple processes X and Y . Consider a partition
{tj } obtained as a union of these two partitions. For each tj which belongs to
the second partition but not the rst dene Xtj = Xt1 , where t1i is the largest
i
point not exceeding tj . Do a similar thing for Y . Observe that now Xt = Xtj
for t [tj , tj+1 ). The linearity of Ito integral then follows straight from the
denition.
Now for (2) we have
E[It2 (X)] =
E[Xtj1 Xtj2 (Btj1 +1 Btj1 )(Btj2 +1 Btj2 )].

0j1 ,j2 n1
When j1 < j2 we have

E[Xtj1 Xtj2 (Btj1 +1 Btj1 )(Btj2 +1 Btj2 )] = 0
which we obtain by conditioning on Ftj2 , using the tower property and observ
ing that all of the random variables involved except for Btj2 +1 are measurable
with respect to Ftj2 (recall that Ftj1 Ftj2 ).
2
Now when j1 = j2 = j we have

E[Xt2j (Btj+1 Btj )2 ] = E[Xt2j E[(Btj+1 Btj )2 |Ftj ]]
= E[Xt2j (tj+1 tj )].
Combining, we obtain
E[It2 (X)] =
Z t
X
E[Xt2j (tj+1 tj )] = E[
Xt2j (tj+1 tj )] = E[ Xs2 ds].
Let us show (3). We already know that the process It (X) is continuous. From
Ito isometry it follows that E[It2 (X)] < . It remains to show that it is a
martingale. Thus x s < t. Dene tn = t and dene j0 = max{j : tj s}.
X
Xtj (Btj+1 Btj )|Fs ]
E[It (X)|Fs ] = E[
jn1
= E[
Xtj (Btj+1 Btj )|Fs ] + E[Xtj0 (Bs Btj0 )|Fs ]
jj0 1
+ E[Xtj0 (Btj0 +1 Bs )|Fs ] + E[
Xtj (Btj+1 Btj )|Fs ]
j>j0
= E[
Xtj (Btj+1 Btj )|Fs ] + E[Xtj0 (Bs Btj0 )|Fs ]
jj0 1
= Is (X).
(think about justifying last two equalities).
Constructing Ito integral for general square integrable processes

o
The idea for dening Ito integral XdB for general
o processes in L2o is to ap
proximate X by simple processes X (n) and dene XdB as a limit of X (n) dB,
which we have already dened.
For this purpose we need to show that we can indeed approximate X with
simple processes appropriately. We do this in 3 steps.
Step 1.
Proposition 1. Suppose X L2 is an a.s. bounded continuous process in the
sense M s.t. P( : supt0 |Xt ()| M ) = 1. Then for every T > 0 there
3
exists a sequence of simple processes X n L02 such that

T
Z
lim E[
n
(Xtn Xt )2 dt] = 0.
(5)
Proof. Fix a sequence of partitions n = {tnj } of [0, T ] such that n = max(tnj+1

tnj ) 0 as n . Given process X, consider the modied process Xtn = Xtnj
for all t [tnj , tnj+1 ). This process is simple and is adapted to Ft . Since X is a.s.
continuous, then a.s. Xt () = limn Xtn () (notice that we are using leftcontinuity part of continuity). We conclude that a sequence of measurable func
tions X n : [0, T ] R a.s. converges to X : [0, T ] R. On the other
hand P( : suptT |Xtn ()| M ) = 1. Using Bounded Convergence Theo
oT
rem, the a.s. convergence extends to integrals: E[ 0 (Xtn Xt )2 dt] 0.
Step 2.
Proposition 2. Suppose X L2 is a bounded, but not necessarily continuous
process: |X| M a.s. For every T > 0, there exists a sequence of a.s. bounded
continuous processes Xn such that
Z T
lim E[
(Xtn Xt )2 dt] = 0.
(6)
n
Proof. We use a certain regularization trick to oturn a bounded process into a

t
bounded continuous approximation. Let Xtn = n t1/n Xs ds. We have |X n |
n(1/n)M = M and |Xtn; Xtn | 2n|t' t|M (verify this), implying that
Xtn is a.s. bounded continuous. Since Xt is a.s. Riemann integrable, then
for almost all , the set of discontinuity points of of Xt () has measure zero
and for all continuity points t by Fundamental Theorem of Calculus, we have
limn Xtn () = Xt (). We conclude that X n : [0, T ] R converges
a.s. to X on the same domain. Applying the Bounded Convergence Theorem
we obtain the result.
Step 3.
Proposition 3. Suppose X L2 . For every T > 0 there exists a sequence of
a.s. bounded processes Xn L2 such that
Z T
lim E[
(Xtn Xt )2 dt] = 0.
(7)
n
Proof. Dene X n by Xtn = Xt when n Xt n, Xtn = n, when

Xt < n and Xtn = n, when Xt > n. We have X n X a.s. w.r.t both and
t [0, T ]. Also |Xtn | |Xt | implying
Z
(Xtn Xt )2 dt 2
(Xtn )2 dt + 2
0
Xt2 dt
0
Xt2 dt.
oT
Since E[ 0 Xt2 dt] < , then applying Dominated Convergence Theorem, we
obtain the result.
Exercise 1. Establish (7) by applying instead Monotone Convergence Theorem.

Karatzas and Shreve [1].
ksendal [2], Chapter III.
References
Springer, 1991.
[2] B. ksendal, Stochastic differential equations, Springer, 1991.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 16
Fall 2013
11/04/2013
Ito integral. Properties
Content.
1. Denition of Ito integral
2. Properties of Ito integral
Ito integral. Existence
We continue with the construction of Ito integral. Combining the results of

Propositions 1-3 from the previous lecture we proved the following result.
Proposition 1. Given any process X L2 there exists a sequence of simple
processes Xn L02 such that
T
(Xn (t) X(t))2 dt] = 0.
lim E[
(1)
Now, given a process X L2 , we x any sequence of simple processes

L02 which satises (1) for a given T . Recall, that we already have dened
Ito integral for simple processes It (X n ).
Xn
Proposition 2. Suppose a sequence of simple processes X n satises (1). There

exists a process Zt M2,c satisfying limn E[(Zt It (X n ))2 ] = 0 for all 0
n is another
t T . This process is unique a.s. in the following sense: if X
t
process satisfying (1) and Z is the corresponding limit, then P(Zt = Zt , t
[0, T ]) = 1.
Proof. Fix T > 0. Applying linearity of It (X) and Ito isometry
E[(IT (Xm ) IT (Xn ))2 ] = E[IT2 (Xm Xn )]
T
(Xm (t) Xn (t))2 dt]
= E[
0
(X(t) Xm (t))2 dt] + 2E[
2E[
0
(X(t) Xn (t))2 dt].

0
But since the sequence Xn satises (1), it follows that the sequence IT (X n ) is
Cauchy in L2 sense. Recall now from Theorem 2.2. previous lecture that each
It (X n ) is a continuous square integrable martingale: It (X n ) M2,c Applying
Proposition 2, Lecture 1, which states that M2,c is a closed space, there exists
a limit Zt , t [0, T ] in M2,c satisfying E[(ZT IT (X n ))2 ] 0. The same
applies to every t T since (Zt It (X n ))2 is a submartingale.
It remains to show that such a process Zt is unique. If Zt is a limit of some
n satisfying (1), then by submartingale inequality for every E > 0
sequence X
we have P(suptT |Zt Zt | E) E[(ZT ZT )2 ]/E2 . But
n ))2 ]
E[(ZT ZT )2 ] 3E[(ZT IT (X n ))2 ] + 3E[(IT (X n ) IT (X
n ) ZT )2 ],
+ 3E[(IT (X
and the right-hand side converges to zero. Thus E[(ZT ZT )2 ] = 0. It follows
that Zt = Zt a.s. on [0, T ]. Since T was arbitrary we obtain an a.s. unique limit
on R+ .
Now we can formally state the denition of Ito integral.
Denition 1 (Ito integral). Given a stochastic process Xt L2 and T > 0, its
Ito integral It (X), t [0, T ] is dened to be the unique process Zt constructed
in Proposition 2.
We have dened Ito integral as a process which is dened only on a nite
interval [0, T ]. With a little bit of extra work it can be extended to a process
It (X) dened for all t 0, by taking T and taking appropriate limits.
Details can be found in [1] and are omitted, as we will deal exclusively with Ito
integrals dened on a nite interval.
2
2.1
Ito integral. Properties

Simple example
Let us compute the Ito integral for a special case Xt = Bt . We will do this di
rectly from the denition. Later on we will develop calculus rules for computing
the Ito integral for many interesting cases.
We x a sequence of partitions n : 0 = t0 < < tn = T and consider
Btn = Btj , t [tj , tj+1 ). Assume that limn (n ) = 0, where (n ) =
maxj |tj+1 tj |. We rst show that this is sufcient for having
Z T
lim E[
(Bt Btn )2 dt] = 0.
(2)
n
Indeed
(Bt Btn )2 dt =
n1
n Z tj+1
(Bt Btj )2 dt.
tj
j=0
We have
Z
E[
tj+1
tj+1
(Bt Btj ) dt =
tj
E[(Bt Btj )2 ]dt
tj
tj+1
(t tj )dt
=
tj
(tj+1 tj )2
,
2
implying
Z T
n1
n1
n
1n
E[
(Bt Btn )2 dt] =
(tj+1 tj )2 (n )
(tj+1 tj ) = (n )T 0,
2
0
j=0
j=0
as n . Thus (2) holds.

Thus we need to compute the L2 limit of
n
IT (Bn ) =
Btj (Btj+1 Btj )
j
as n . We use the identity

Bt2j+1 Bt2j = (Btj+1 Btj )2 + 2Btj (Btj+1 Btj ),
implying
B 2 (T ) B 2 (0) =
n1
n
Bt2j+1 Bt2j =
j=0
n1
n
n1
n
j=0
j=0
(Btj+1 Btj )2 + 2
Btj (Btj+1 Btj ),
But recall the quadratic variation property of the Brownian motion:

lim
n
n1
n
(Btj+1 Btj )2 = T
j=0
in L2 (recall that the only requirement for this convergence was that (n )
0). Therefore, also in L2
n1
n
j=0
1
T
Btj (Btj+1 Btj ) B 2 (T ) .
2
2
We conclude
3
Proposition 3. The following identity holds
Z T
1
T
IT (B) =
Bt dBt = B 2 (T ) .
2
2
0
Further, recall that since Bt M2,c then it admits a unique Doob-Meyer
decomposition Bt2 = t + Mt , where t = (Bt ) is the quadratic variation of Bt
and Mt is a continuous martingale. Thus we recognize Mt to be 2It (B).
2.2
Properties
We already know that It (X) M2,c , in particular it a is continuous martin

gale. Let us establish additional properties, some of which are generalizations
of Theorem 2.2 from the previous lecture.
Proposition 4. The following properties hold for It (X):
It (X + Y ) = It (X) + It (Y ), , ,
Z t
E[(It (X) Is (X))2 |Fs ] = E[ Xu2 du|Fs ], 0 s < t T.
(3)
(4)
Furthermore, the quadratic variation of It (X) on [0, T ] is
T
0
Xt2 dt.
Proof. The proof of (3) is straightforward and is skipped. We now prove (4).
Fix any set A Fs . We need to show that
Z t
2
E[(It (X) Is (X)) I(A)] = E[I(A)
Xu2 du].
s
Fix a sequence X n L02 satisfying (1). Then

E[(It (X) Is (X))2 I(A)] = E[(It (X) It (X n ))2 I(A)] + E[(It (X n ) Is (X n ))2 I(A)]
+ E[(Is (X n ) Is (X))2 I(A)]
+ 2E(It (X) It (X n ))(It (X n ) Is (X n ))I(A)
+ 2E(It (X) It (X n ))(Is (X n ) Is (X))I(A)
+ 2E(It (X n ) Is (X n ))(Is (X n ) Is (X))I(A)
But E[(It (X) It (X n ))2 I(A)] 0 since E[(It (X) It (X n ))2 ] 0 (def
inition of Ito integral). Similarly E[(Is (X) Is (X n ))2 I(A)] 0. Applying
Cauchy-Schwartz inequality
|E(It (X) It (X n ))(It (X n ) Is (X n ))I(A)|
(E[(It (X) It (X n ))2 ])1/2 (E[(It (X n ) Is (X n ))2 ])1/2 0
4
from the denition of It (X). Similarly we show that all the other terms with
factor 2 in front converge to zero.
By property (2.6) Theorem 2.2 previous lecture, we have
Z t
n
n 2
E[(It (X ) Is (X )) I(A)] = E[I(A) (Xun )2 du]
s
Now
Z
E[I(A)
(Xun )2 du] E[I(A)
Z t
Xu2 du] = E[I(A) (Xun Xu )(Xun + Xu )du]
s
Z t
E[ |(Xun Xu )(Xun + Xu )|du]
s
Z t
Z t
1
1
E 2 [ (Xun Xu )2 du]E 2 [ (Xun + Xu )2 du]
s
where Cauchy-Schwartz inequality was used in the last step. Now the rst term
in the product converges to zero by the assumption (1) and the second is uni
formly bounded in n (exercise). The assertion then follows.
Now we prove the last part.
R t Applying Proposition 3 from Lecture 1, it suf
ces to show that It2 (X) 0 Xs2 ds is a martingale, since then by uniqueness
Rt
of the Doob-Meyer decomposition we must have that (It (X)) = 0 Xs2 ds. But
note that (4) is equivalent to
Z t
2
2
2
2
E[It (X) Is (X)|Fs ] = E[It (X)|Fs ] Is (X) = E[ Xu2 du|Fs ]
s
Z t
Z s
= E[ Xu2 du|Fs ] E[
Xu2 du|Fs ]
0
0
Z t
Z s
= E[ Xu2 du|Fs ]
Xu2 du.
0
Namely,
Z t
Z
E[It2 (X)|Fs ] E[ Xu2 du|Fs ] = Is2 (X)
0
namely,
3
It2 (X)
Rt
2
0 Xs ds
Xu2 du,
0
is indeed a martingale.

Karatzas and Shreve [1].
ksendal [2], Chapter III.
5
References
Springer, 1991.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 17
Fall 2013
11/13/2013
Ito process. Ito formula
Content.
1. Ito process and functions of Ito processes. Ito formula.
2. Multidimensional Ito formula. Integration by parts.
Ito process
Observe that trivially
t
0 dBs
= Bt . In the previous lecture we computed

t
It (B) =
0
1
t
Bs dBs = Bt2 ,
2
2
or
t
B 2 (t) = 2
B(s)dB(s) + t,
(1)
0
t
Observe also that we cannot have t = 0 Xs dBs for some process X L2 as

Ito integral is a martingale, but t is not. Thus we see that applying a functional
operation to a process which is an Ito integral we do not necessarily get another
Ito integral. But there is a natural generalization of Ito integral to a broader
family, which makes taking functional operations closed within the family.
Denition 1. An Ito process or stochastic integral is a stochastic process on
(, F, P) adopted to Ft which can be written in the form
t
Xt = X0 +
Us ds +
0
Vs dBs ,
0
where U, V L2 . As a shorthand notation, we will write (2) as

dXt = Ut dt + Vt dBt .
1
(2)
Rt
Rt
Thus Bt2 is an Ito process: Bt2 = 0 ds + 2 0 Bs dBs or d(Bt2 ) = dt +
2Bt dBt . Note the difference from the usual differentiation: dx2 = 2xdx. The
additional term dt arises because Brownian motion B is not differentiable and
instead has quadratic variation.
Notation Given an Ito process dXt = Ut dt+Vt dBt , let us introduce the notation
(dXt )2 which stands for Vt2 dt. Equivalently (dXt )2 is (dXt ) (dXt ) which is
computed using the rules dt dt = dt dBt = dBt dt = 0, dBt dBt = dt.
Ito formula
We now introduce the most important formula of Ito calculus:

Theorem 1 (Ito formula). Let Xt be an Ito process dXt = Ut dt + Vt dBt . Sup
pose g(x) C 2 (R) is a twice continuously differentiable function (in particular
all second partial derivatives are continuous functions). Suppose g(Xt ) L2 .
Then Yt = g(Xt ) is again an Ito process and
dYt =
g
1 2g
(Xt )dXt +
(Xt )(dXt )2
x
2 x2
Using the notational convention for dXt = Ut dt + Vt dBt and (dXt )2 , we can
rewrite the Ito formula as
dYt =
g
x
(Xt )Ut +
1 2g
g
2
(X
)V
(Xt )Vt dBt .
t t dt +
2
2 x
x
Thus, we see that the space of Ito processes is closed under twice-continuously
differentiable transformations.
Proof sketch of Theorem 1. We will do this for a very special case. We assume
g 2 g
as well as U and V are all bounded simple processes.
that the derivatives x
x2
The general case is then obtained by approximating U and V by bounded
simple processes in a way similar to how we dened the Ito integral.
Let n : 0 = t0 < t1 < < tn = t be a sequence of partitions such
that (n ) 0. We use the notation B(tj ) = B(tj+1 ) B(tj ), X(tj ) =
X(tj+1 ) X(tj ). Using Taylor expansion of g we obtain

X
g(X(t)) = g(X(0)) +
g(X(tj+1 )) g(X(tj ))
j<n
X g
= g(X(0)) +
(X(tj ))X(tj )
x
j<n
X
1 X 2g
2
+
(X(t
))
X(t
)
+
o(2 X(tj )).
j
j
2
x2
j<n
j<n
Now, we have
X(tj ) = X(tj+1 ) X(tj ) = U (tj )(tj+1 tj ) + V (tj )(B(tj+1 ) B(tj ))).
Thus, we obtain

X g
X g
(X(tj )) U (tj )(tj+1 tj ) + V (tj )(B(tj+1 ) B(tj ))
(X(tj ))X(tj ) =
x
x
j<n
j<n
We argue that in L2 the convergence

Z t

X g
g
(X(tj )) U (tj )(tj+1 tj )
(X(s))U (s)ds
x
0 x
j<n
and
Z t

X g
g
(X(tj )) V (tj )(B(tj+1 ) B(tj ))
(X(s))V (s)dB(s)
x
x
0
j<n
takes place. For the rst convergence, let us x any sample . Then this conver
gence follows straight from the denition of Riemann integral since (n )
0. Thus we have a.s. convergence. Since by our assumptions the integrated
variables are bounded then Bounded Convergence Theorem (applied to uniform
on [0, t] distribution, just as in Proposition 1 Lecture 12) implies convergence
in L2 . To prove the second convergence consider a simple process g which is
g
(X(tj )) for all t [tj , tj+1 ). Then the left-hand side is Ito
dened to be x
integral of g(s)V (s). Then, by denition of Ito integral, the convergence to the
right-hand side holds if the following convergence takes place
Z t
g
lim E[ (
g (s)V (s)
(X(s))V (s))2 ds = 0.
n
x
0
3
It is a simple exercise in analysis to show this and we skip the details.

2g
2
Now consider j<n x
2 (X(tj )) X(tj ) and identify its L2 limit. We have
X 2g
X 2g
2
(X(t
(X(tj ))U 2 (tj )(tj+1 tj )2
))
X(t
)
=
j
j
x2
x2
j<n
j<n
+2
X 2g
(X(tj ))U (tj )V (tj )(tj+1 tj )B(tj )
x2
j<n
X 2g
+
(X(tj ))V 2 (tj )2 B(tj ).
x2
j<n
g
We now analyze these terms when (n ) 0. Recall our assumption that x
2
and U are bounded. Say it is at most C > 0. Therefore the rst sum converges
to zero provided (n ) 0. To analyze the second sum, we square it and take
expected value:
X 2g
2

E[2
(X(t
))U
(t
)V
(t
)(t
t
)B(t
)
]
j
j
j
j+1
j
j
x2
j<n
=4
X
j<n
2g

E[
(X(tj ))U (tj )V (tj )

x2
2

](tj+1 tj )3 ,
where to obtain this equality we rst condition on eld Ftj , note that the ex
pected value of all cross products vanishes and use E[2 B(tj )|Ftj ] = tj+1 tj .
Again, since the second partial derivative and U, V are bounded, then the entire
term converges to zero provided that (n ) 0. In order to analyze the last
sum the following result is needed.
Problem 1 (Generalized Quadratic Variation). Suppose a(s) H 2 and n :
0 = t0 < < tn = t is a sequence of partitions satisfying (n ) 0 as
n . Then the following convergence occurs in L2 :
Z t
X
2
lim
a(tj )(B(tj+1 ) B(tj )) =
a(s)ds.
n
HINT: use the same approach that we did in establishing the quadratic variation
of B.
Using this result we establish that the last sum converges in L2 to
Z t 2
g
(X(s))V 2 (s)ds.
2
0 x
4
P
It remains to analyze j o(2 X(tj )) and using similar techniques, it can be
shown that this term vanishes in L2 norm as (n ) 0. Putting all of this
together, we conclude that g(X(t)) is approximated in L2 sense by
Z t
Z
g
1 t 2g
g(X(0)) +
(X(s))dX(s) +
(X(s))V 2 (s)ds.
2 0 x2
0 x
But recall that V 2 (s)ds = (dX(s))2 . Making this substitution, we complete the
derivation of the Ito formula.
Let us apply Theorem 1 to several examples.
Exercise 1. Verify that in all of the examples below the underlying processes
are in L2 .
(1) using Ito formula. Since Bt =
R t Example 1. Let us re-derive our formula
1 2
dB
is
an
Ito
process
and
g(x)
=
x
is
twice continuously differentiable,
s
2
0
then by the Ito formula we have
1
g
1 2g
d( Bt2 ) = dg(Bt ) =
dBt +
(dBt )2
2
x
2 x2
1
= Bt dBt + (dBt )2
2
dt
= Bt dBt + ,
2
which matches (1).
Example 2. Let us apply Ito formula to Bt4 . We obtain
1
d(Bt4 ) = 4Bt3 dBt + 12Bt2 dt = 4Bt3 dBt + 6Bt2 dt,
2
namely, written in an integral form
Z t
Z t
4
3
Bt = 4
Bs dBs + 6
Bs2 ds.
0
Taking expectations of both sides and recalling that Ito integral is a martingale,
we obtain
Z t
1
E[ Bs2 ds] = E[Bt4 ]
6
0
5
which we find to be (1/6)3t2 as the 4-th moment of a normal zero mean dis4
tribution with
R t std2 is 3 . Recall an earlier exercise where you were asked to
compute E[ 0 Bs ds] directly. We see that Ito calculus is useful even in computing conventional integrals.
3
Multidimensional Ito formula
There is a very useful analogue of Ito formula in many dimensions. We state

this result without proof. Before turning to the formula we need to extend our
discussion to the case of Ito processes with respect to many dimensions, as so
far we have we have considered Ito integrals and Ito processes with respect to
just one Brownian motion. Thus suppose we have a vector of d independent
Brownian motions Bt = (Bi,t , 1 i d, t R+ ). A stochastic process
Xt is defined to be an Ito process with respect to BP
t if there exists Ut L2
and Vi,t L2 , 1 i d such that Xt = Ut dt + i Vi,t dBi,t , in the sense
explained above. The definition naturally extends to the case when Xt is a vector
of processes.
Theorem 2. Suppose dXt = Ut dt + Vt dBt , where vector U = (U1 , . . . , Ud )
and matrix V = (V11 , . . . , Vdd ) have L2 components and B is the vector of d
independent Brownian motions. Let g(x) be twice continuously differentiable
function from Rd into R. Then Yt = g(Xt ) is also an Ito process and
d
d
X
g
1 X 2g
dYt =
(Xt )dXi,t +
(Xt )dXi,t dXj,t ,
xi
2
xi xj
i=1
i,j=1
where dXi,t dXj,t is computed using the rules dtdt = dtdBi = dBi dt =
0, dBi dBj = 0 for all i 6= j and (dBi )2 = dt.
Let us now do a quick example illustrating the use of the Ito formula. Consider g(t, Bt ) = etBt . We will use Ito formula to find its derivative. Since both
t and Bt are Ito processes and g(t, x) = etx is twice continuously differentiable
2
2 tx
function g : R2 R, the formula applies. t
g = xetx , t
2 g = x e , x g =
2
2 tx
tetx , x
2 g = t e . Then we can find its Ito representation using the Ito formula
as
1
1
d(etBt ) = etBt Bt dt + etBt Bt2 (dt)2 + tetBt dBt + t2 etBt (dBt )2
2
2
1 2
tBt
tBt
= e (Bt + t )dt + te dBt .
2
R
Problem 2. Use the Ito formula to nd eBt dBt . In other words, you need to
Rrepresent this integral in terms of expressions not involving dBt (as we did for
Bt dBt ).
Suppose
R t f is a continuously differentiable function. Let us use Ito formula to
nd 0 fs dBs and derive the integration by parts formula. In other words we
look at a special simple case when X is a deterministic process i.e., Xs = fs
a.s. First we observe that f L2 . Indeed, it is differentiable and therefore
continuous.
This
that f is bounded on any nite interval and therefore
Rt 2
R t implies
2
E[ 0 fs ds] = 0 fs ds < .
2
g
df g
d f
Introduce g(t, x) = ft x. We nd that g
t = x dt , x = ft , t2 = x dt2 and
second order partial derivatives with respect to x disappear. Therefore, using Ito
formula, we obtain
d(g(t, Bt ) =
df
df
d2 f
Bt dt + 2 Bt (dt)2 + ft dBt + 0 = Bt dt + ft dBt
dt
dt
dt
This implies (since g(0, B(0)) = 0)

Z
fs dBs = ft Bt
0
Bs
0
df
ds.
ds
This does look like integration by parts.

It turns out that a more general version of the integration by parts formula holds
in Ito calculus. We start by recalling the denition of Stieltjes integral. We are
given a function g which has bounded variation and another function fR, which
t
we assume for simplicity is continuous.
P We dene the Stieltjes integral 0 fs dgs
as an appropriate limit of the sums j ftj (gtj+1 gtj ) where n : 0 = t0 <
< tn = t is a sequence of partitions with (n ) = maxj (tj+1 tj ) 0.
We skip the formalities of the construction of such a limit. They are similar to
(and simpler than) those of the Ito integral.
Now let us state without proving the integration by parts theorem.
Theorem 3. Suppose fs is a continuous function on [0, t] with bounded varia
tion. Then
Z t
Z t
fs dBs = ft Bt
Bs dfs ,
0
where the second integral is the Stieltjes integral.

7
Is there a generalization of integration by parts when f is not necessarily

deterministic? The answer is positive, but, unfortunately, it does not reduce an
Ito process to a process not involving Brownian component dB.
Theorem 4. Let Xt , Yt be Ito processes. Then Xt Yt is also an Ito process:
Z t
Z t
Z t
Xt Yt = X(0)Y (0) +
Xs dYs +
Ys dXs +
dXs dYs
0
Proof. The proof is direct application of multidimensional Ito formula to the

function g(Xt , Yt ) = Xt Yt .
4

ksendal [1], Chapter IV.
References
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 19
Fall 2013
11/20/2013
Applications of Ito calculus to nance
Content.
1. Trading strategies
2. Black-Scholes option pricing formula
Security price processes, trading strategies and arbitrage
As early as 1900, Louis Bachelier had proposed using Brownian motion as a

model for security prices. As insightful as it was, this model is somewhat re
strictive. For example it allows stock price to be negative.
A more appropriate and accepted
e t model isetot assume that stock prices follow
a general Ito process Xt = X0 + 0 s ds + 0 s dBs , where , are adapted
processes. Suppose we make several investment decisions with this stock. In
particular at times t1 < t2 < < tk we decide to hold tk stocks of this
security. What would be our gain/loss at some later time t = tk+1 > tk ? It
is simply j k tj (Xtj+1 Xtj ). It makes sense to assume that our trading
strategy is an adapted process (including the possibility that is deterministic),
since otherwise it means we can predict the market. Thus we can think of as
a simple adapted process. The simplicity means that we make only k trading
decisions. But there is no reason to bound the number of trading decisions a
priori (unless we take trading fees into account). So we may think of as an
arbitrary adapted process t Ft . For technical reasons we do assume that
L2 . This is needed so that we exclude the doubling strategy pathology
which as know may bring a positive gain with probability one.
Denition 1. A trading strategy t is an adapted process in L2 . The gain pro
duced by the trading strategy t during the time interval [0, T ] is dened to be
T
t dXt
0
t t dt +
0
t t dBt
0
(1)
t = (Xt1 , . . . , Xtm ) a vector of trading strategies

Given a vector of securities X
1
m
t is dened to be self-nancing if for every

t = (t , . . . , t ) into securities X
time instance t
Z t
dX(s)
(s)
(2)
t Xt = 0 X0 +
0
Z t
=
0j X0j +
j (s)dXj (s).
(3)
1jm 0
1jm
This denition simply means that whatever we have at any time we invest.
This is easier to understand when we have a simple strategy, that is L02 .
Then we trade at times 0 = t0 < t1 < < tn = t. We start with portfolio
0 worth of dollars of a security. At time t1 our portfolio is worth
0 buy 0 X
0 + 0 (X
t X
0 ) Wt . We create some other portfolio t with the
0 X
1
1
1
condition that t1 Xt1 = Wt1 (self-nancing). At time t2 our portfolio is worth

t X
t ) = 0 X
0 + 0 (X
t X
t X
0 ) + t (X
t )
Wt1 + t1 (X
2
1
1
1
2
1
Wt2 .
t = Wt , and so on. In
Then we create portfolio t2 with the condition t2 X
2
2
the end we obtain
t = Wt = 0 X
0 +
t X
0 +
= 0 X
t Xt )
tj (X
j+1
j
jn1
Z t
(s)dX(s).
We do not require that 0. Having a negative < 0 means essentially

short-selling the security.
One of the basic assumptions of classical nancial models is that the security
market does not allow arbitrage. Arbitrage is an opportunity of obtaining a
positive gain without any risk over some time period [0, T ]. Mathematically, we
dene it as follows.
t is dened to be
Denition 2. A vector of trading strategies t in securities X
T > 0
arbitrage if 0 X0 < 0 and T XT 0 a.s., or 0 X0 0 and T X

(where we say Z > 0 means Z 0 and P(Z > 0) > 0).
We see that in either case an arbitrage creates an opportunity to gain money
without any losses. In the rst case rst case we gain 0 X0 with probability
one. In the second case we gain T XT with a positive probability. It is a basic
2
assumption in nance theory that market prices are such that there does not
exist a trading strategy creating an arbitrage opportunity. As it turns out, under
technical assumptions, this is equivalent to existence of a change of measure
such that with respect to the new measure X is a martingale.
2
Black-Scholes option pricing formula
For the purposes of this section we assume that we are dealing with two securi
ties:
1. A stock, whose time t price St follows a geometric Brownian motion
St = x exp(t + Bt )
for some constants , ; and
2. A bond, whose time t price t at time t is given as
t = 0 exp(rt)
for some constants 0 , r.
Both of these are Ito processes. The stock process we nd using Ito formula
is
1
dS = S(dt + dB) + S 2 (dB)2 = S(dt + dB),
2
2
(4)
where = + 2 . The process is simply a deterministic process satisfying

d = rdt. is called the instantaneous rate of return, is called the volatility
and r is called the risk free interest rate.
In addition suppose we have the following derivative security called op
tion. Specically, we will consider a so called European call option which is
parametrized by a certain strike price K and maturity time T . A European call
option gives its owner the right to buy the stock S at time T for a price K. Thus
the payoff to the owner of the option is max(ST K, 0) at time T (the option
will not be exercised when the price ST < K to avoid a loss). The main ques
tion is what should be the price of this stock at time 0? One would expect the
fair price of the option to depend on the owners/sellers perception of the risk
of the stock. The main insight of the Black Scholes was to realize that the price
of the option is tied to the price of the stock in a deterministic way if there is to
be no arbitrage.
3
The reason for this deterministic dependence has to do with the fact that one
can simply replicate the option by carefully constructing a portfolio consisting
of a bond and a stock. Another surprising aspect of the Black Scholes result was
that the portfolio and the price can be computed in a very explicit form.
We rst present the Black Scholes formula and then discuss its relevance to
option pricing.
The Black-Scholes Formula We are given a strike price K, time instance T ,
interest rate r and volatility . Dene
2
z=
x
+ (r + 2 )(T t)
log K
T t
and
C(x, t) = xN (z) er(T t) KN (z T t).
(5)
We now state the Black Scholes theorem.

Theorem 1 (Black Scholes option pricing theorem). If there is no arbitrage
then the price of an option with strike K and maturity time T must be equal to
C(S0 , 0). In general, the price of this option at time t must be C(St , t).
Thus, as the theorem claims, the current price of the option is uniquely de
termined by the current price of the stock and the market parameters. Notice that
while the price does depend on the volatility of the price process, it does not
depend on = + 2 /2, the instantaneous rate of return. This might seem very
surprising as presumably an option corresponding to a stock which has a higher
rate of return should have higher price. The explanation is that the absence of
arbitrage prevents two stocks with the same volatility from having different rates
of return. To see this observe the following simple fact.
Lemma 1. Suppose = 0 i.e., the stock is riskless. Then no arbitrage implies
= r.
This lemma is pretty much self-evident. If = > r we can make money
by investing in the stock and shortselling the same amount of bonds. Say we buy
one dollar worth of stock and sell one dollar worth of bonds. Then at time zero
our net investment is zero but at time t the worth of our portfolio is et ert > 0.
Similarly, when < r we can buy bonds and sell stock. Thus it must be (as is
obvious) that = r. That is no arbitrage places a constraint on the combinations
of the mean rate of return and the volatility .
4
Before we discuss how to establish the Black-Scholes formula (we will only
sketch the proof) let us discuss its behavior in various extreme cases. Say
again = 0. In this case it must be that = r. What is the worth of the option
at time t = 0? Suppose the price of the stock at time zero is x. At time T
option pays with probability one an amount xerT K. If xrT K then its
worth at time t = 0 is exactly x erT K as we can create a net xerT K by
investing x erT K into stock or bonds (which are equivalent since = r).
Therefore the right price of this option at time zero is exactly x erT K. But
if xerT < K or x KerT < 0 then with probability one option does not pay
anything. Therefore it is worth zero.
Let us see whether this matches what the Black-Scholes formula predicts.
We rst compute z as 0. In this case the limit of z is
log(x/K) + (r + 2 /2)T
=
0
T
lim
when log(x/K) + rT > 0 and

log(x/K) + (r + 2 /2)T
=
0
T
lim
when log(x/K) + rT < 0. The degenerate case log(x/K) + rT = 0 is sort

of a tie breaking case and we do not interpret it. The two conditions are exactly
x erT K > 0 or < 0. In the rst case the Black-Scholes formula gives
C(x, 0) = xN () erT KN () = x ert K.
In the second case it gives
C(x, 0) = xN () erT KN () = 0.
This is consistent with our ndings.
Let us now look at a different approximation as t T , but we no longer
assume = 0. In this case we have the limit
log(x/K) + (r + 2 /2)(T t)
=
tT
T t
when x > K and is when x < K. Also z T t approaches when

x > K and when x < K. This means that
lim
lim C(x, t) = xN () KN () = x K
tT
when x > K and = 0 when x < K. This is again consistent with common
sense. As the strike time T approaches, the uncertainty about the stock gradually
disappears and its worth is x K when x > K and 0 when x < K, namely it
is worth exactly max(x K, 0) - which is its payoff upon maturity.
Proof sketch for the Black-Scholes Theorem. We will show that there exists a
self-nanced trading strategy at , bt for trading stocks St and bonds t such that
aT ST + bT T = max(ST K, 0), that is the value of the portfolio at time T
is exactly the payoff max(ST K, 0) of the option at time T . Since there is no
arbitrage then the price of the option at time t = 0 must be exactly a0 S0 + b0 0 .
We will rst assume that the right price C(St , t) of the option at time t is
nice. Specically the corresponding function C(x, t) is twice continuously
differentiable. Later on when we actually nd C we simply verify that this is
indeed the case. For now, we make this assumption and let us try to infer the
function C as well as self-nanced strategies a and b. We want to nd a selfnanced trading strategy a, b such that
at St + bt t = C(St , t).
(6)
How can we nd it? We will nd an Ito representation of at St + bt t and match

it with Ito representation of C.
Using the Ito formula we know that C(St , t) is again an Ito process and
using (4)
C
C
1 2C
dt +
dS +
(dS)2
t
x
2 x2
C
C
1 2C 2 2
C
=
+
S +
S dt +
SdB.
2
x
t
x
2 x
dC =
(7)
From a self-nanced condition (3) we need to have

Z t
Z t
bs ds .
at St + bt t = a0 S0 + b0 0 +
as dSs +
0
or in differential form
d(aS + b) = adS + bd = (aS + br)dt + aSdB.
(8)
Now we would like to match this with (7). Let us set a = C

x . This way we
C
match both the dB multipliers and make aS = x S . Now the equality (6)
suggest then taking
bt =
1
1
C
(C(St , t) at St ) = (C(St , t)
St )
t
t
x
6
(9)
On the other hand we need to match the remaining dt coefcients in (8) and (7):
br =
C
1 2C 2 2
S
+
t
2 x2
which using (9) leads to

rC(St , t) r
C
C
1 2C 2 2
St =
+
S
x
t
2 x2
This denes a partial differential equation on C. To this equation we have a

boundary condition C(x, T ) = max(x T, 0) (as this is the only arbitrage free
price of the option at time T ). It turns out (and quite miraculously so) that this
PDE has indeed an explicit solution of the form (5). In retrospect we check that
this solution is twice continuously differentiable.
In order to make this proof rigorous, we would just take the guessed solution
plug it in and check that the implied solution for a and b makes them a selfnanced trading strategy. Our proof approach was more in line of how this
formula was discovered.
Dufe Dynamic Asset Pricing Theory [1].
References
[1] D. Dufe, Dynamic asset pricing theory, Princeton University Press, 2001.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 20
Fall 2013
11/25/2013
Introduction to the theory of weak convergence
Content.
1. -elds on metric spaces.
2. Kolmogorov -eld on C[0, T ].
3. Weak convergence.
In the rst two sections we review some concepts from measure theory on
metric spaces. Then in the last section we begin the discussion of the theory
of weak convergence, by stating and proving important Portmentau Theorem,
which gives four equivalent denitions of weak convergence.
1
Borel -elds on metric space
We consider a metric space (S, ). To discuss probability measures on metric

spaces we rst need to introduce -elds.
Denition 1. Borel -eld B on S is the eld generated by the set of all open
sets U S.
Lemma 1. Suppose S is Polish. Then every open set U S can be represented
as a countable union of balls B(x, r).
Proof. Since S is Polish, we can identify a countable set x1 , . . . , xn , . . . which
is dense. For each xi U , since U is open we can nd a radius ri such that
B(xi , ri ) U . Fix a constant M > 0. Consider ri = min(arg sup{r :
B(x, r) U }, M ) > 0 and set ri = ri /2. Then i:xi U B(xi , ri ) U . In
order to nish the proof we need to show that equality i:xi U B(xi , ri ) = U
holds. Consider any x U . There exists 0 < r M such that B(x, r) U .
Since set (xi ) is dense there exists xk such that (xk , x) < r/4. Then, by
triangle inequality, B(xk , 3r/4) B(x, r) U . This means, by deni
tion of rk and rk , that rk 3r/8. But x B(xk , 3r/8) since the distance
1
(xk , x) r/4. This shows x B(xk , rk ). Since x was arbitrary, then

U i:xi U B(xi , ri ).
From this Lemma we obtain immediately that
Corollary 1. Suppose S is Polish. Then B is generated by a countable collection
of balls B(x, r), x S, r 0.
2
Kolmogorov -eld on C[0, T ]
Now let us focus on C[0, T ] and the Borel eld B on it. For each t 0 dene
a projection t : C[0, T ] R as t (x) = x(t). Observe that t is a uniformly
continuous mapping. Indeed
|t (x) t (y)| = |x(t) y(t)| Ix yI.
This immediately implies uniform continuity.
The family of projection mappings t give rise to an alternative eld.
Denition 2. The Kolmogorov -eld K on C[0, T ] is the -eld generated by
t1 (B), t [0, T ], B B, where B is the Borel eld of R.
It turns out (and this will be useful) that the two elds are identical:
Theorem 1. The Kolmogorov eld K is identical to the Borel eld B of
C[0, T ].
Proof. First we show that K B. Since t is continuous then, for every open
set U R t1 (U ) is open in C[0, T ]. This applies to all open intervals U . Thus
each t1 (U ) B. This shows K B.
Now we show the other direction. Since C[0, T ] is Polish, then by Corol
lary 1, it sufces to check that every ball B(x, r) K. Fix x C[0, T ], r 0.
For each rational q [0, T ], consider Bq q1 ([x(q) r, x(q) + r]). This
is the set of all functions y such that |y(q) x(q)| r. Consider q Bq . As a
countable intersection, this is a set in K. We claim that q Bq = B. This implies
the result.
/ B.
To establish the claim, note that B Bq for each q. Now suppose y
Namely, for some t [0, T ] we have |y(t) x(t)| r + > r. Find a
sequence of rational values qn converging to t. By continuity of x, y we have
x(qn ) x(t), y(qn ) y(t). Therefore for all sufciently large n we have
/ Bq .
|y(qn ) x(qn )| > r. This means y
Weak convergence
We now turn to a very important concept of weak convergence or convergence

of probability measures. Recall the convergence in distribution of r.v. Xn X.
Now consider random variables X : S which take values in some metric
space (S, ). Again we dene X to be a random variable if X is a measurable
transformation. We would like to give a meaning to Xn X. In order to do
this we rst dene convergence of probability measures.
Given a metric space (S, ) and the corresponding Borel -eld B, suppose
we have a sequence P, Pn , n = 1, 2, . . . of probability measures on (S, B). Let
Cb (S) denote the set of all continuous bounded real valued functions on S. In
particular every function X Cb (S) is measurable.
Theorem 2 (Portmentau Theorem). The following conditions are equivalent.
1. limn EPn [X] = EP [X], X Cb (S).
2. For every closed set F S, lim supn Pn (F ) P(F ).
3. For every open set U S, lim inf n Pn (U ) P(U ).
4. For every set A B such that P(A) = 0, the convergence limn Pn (A) =
P(A) holds.
What is the meaning of this theorem? Ideally we would like to say that a
sequence of measures Pn converges to measure P if for every set A, Pn (A)
P(A). However, this turns out to be too restrictive. In some sense the fourth part
of the theorem is the meaningful part. Second and third are technical. The rst
is a very useful implication.
Denition 3. A sequence of measures Pn is said to converge weakly to P if one
of the four equivalent conditions in Theorem 2 holds.
Proof. (a)(b) Consider a closed set F . For any E > 0, dene F E = {x
S : (x, S) E}, where the distance between the sets is dened as the smallest
distance between any two points in them. Let us rst show that E F E = F . The
/ F.
inclusion E F E F is immediate. For the other side, consider any x
Since S \ F is open, then B(x, r) S \ F for some r. This means x
/ F r and
the assertion is established.
Invoking the continuity theorem, the equality E F E = F implies
lim P(F E ) = P(F ).
E0
(1)
Dene XE : S [0, 1] by XE (x) = (1 (x, F )E1 )+ . Then XE = 1 on F and

XE = 0 outside of F E . Specically,
1{F } XE 1{F E },
which implies
Pn (F ) EPn [XE ] Pn (F E ),
P(F ) EP [XE ] P(F E ),
Note that XE is a continuous bounded function. Therefore, by assumption limn EPn [XE ] =
E[XE ]. Combining, we obtain that
lim sup Pn (F ) lim sup EPn [XE ] = lim EPn [XE ] = EP [XE ] P(F E )
n
Finally, using (1) we conclude that lim supn Pn (F ) P(F ).

(b) (c) This part follows immediately by observing that F = U c is a
closed set and P(F ) = 1 P(U ), Pn (F ) = 1 Pn (U ). Moreover (c)(b).
(c) (d) Given any set A, consider its closure A A and interior Ao A.
Applying (b) and (c), we have
P(A) lim sup Pn (A) lim sup Pn (A) lim inf Pn (A) lim inf Pn (Ao ) P(Ao ).
n
Now if P(A) = P(A \ Ao ) = 0, then we obtain an equality across and, in

particular, Pn (A) P(A).
w.l.g that 0 X 1. Then
(d)(a) Let
t 1X Cb (S). We may assume
t1
EPn [X] = 0 Pn (X > t)dt, EP [X] = 0 P(X > t)dt. Since X is continuous,
for any set A(t) = {x S : X(x) > t}, its boundary satises A 1{X = t}
(exercise). Observe that probability associated with set 1{X = t} is zero, except
for countably many points t. This can be obtained by looking at sets of the form
1{X = t} with probability weights at least 1/m and taking a union over m.
Thus for almost all t, the set A(t) is a continuity set, i.e., P(A) = 0 and hence
Pn (A(t)) P(A(t)). Since also 0 P, Pn 1, then using the Bounded
Convergence Theorem applied to random t, we obtain that
1
Pn (A(t))dt
0
P(A(t))dt.
0
This implies the result.

4
Using the denition of weak convergence of measures we can dene weak

convergence of metric space valued random variables.
Denition 4. Given a probability space (, F, P) and a metric space (S, ), a
sequence of measurable transformations Xn : S is said to converge weakly
or in distribution to a transformation X : S, if the probability measures on
(S, ) generated by Xn converge weakly to the probability measure generated
by X on (S, ).
4

Notes distributed in the class, Chapter 5.
Billigsley [1] Chapter 1, Section 2.
References
publication, 1999.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 21
Fall 2013
11/27/2013
Functional Law of Large Numbers. Construction of the Wiener Measure
Content.
1. Additional technical results on weak convergence
2. Functional Strong Law of Large Numbers
3. Existence of Wiener measure (Brownian motion)
Additional technical results on weak convergence
Given two metric spaces S1 , S2 and a measurable function f : S1 S2 , sup

pose S1 is equipped with some probability measure P. This induces a proba
bility measure on S2 which is denoted by Pf 1 and is dened by Pf 1 (A) =
P(f 1 (A)) for every measurable set A S2 . Then for any random variable
X : S2 R, its expectation EPf 1 [X] is equal to EP [X(f )]. (Convince your
self that this is the case by looking at the special case when f is a simple func
tion).
Theorem 1 (Mapping Theorem). Suppose Pn P for a sequence of probability
measures P, Pn on S1 and suppose f : S1 S2 is continuous. Then Pn f 1
Pf 1 on S2 .
Proof. We use Portmentau theorem, in particular weak convergence character
ization using bounded continuous functions. Thus let g : S2 R be any
bounded continuous function. Since it is continuous, it is also measurable, thus
it is also a random variable dened on (S2 , B2 ), where B2 is the Borel -eld
on S2 . We have,
EPn f 1 [g] = EPn [g(f )].
Since g is a bounded continuous, then the composition is also bounded continu
ous. Therefore, by Portmanteau theorem
EPn [g(f )] EP [g(f )] = EPf 1 [g]
1
Denition 1. A sequence of probability measures Pn on metric space S is de

ned to be tight if for every E > 0 there exists n0 and a compact set K S,
such that Pn (K) > 1 E for all n > n0 .
Theorem 2 (Prohorovs Theorem). Suppose sequence Pn is tight. Then it con
tains a weakly convergent subsequence Pn(k) P.
The converse of this theorem is also true, but we will not need this. We
do not prove Prohorovs Theorem. The proof can be found in [1]. Recall that
Arzela-Ascoli Theorem provides a characterization of compact sets in C[0, T ].
We can use it now for characterization of tightness.
Proposition 1. Suppose a sequence of measures Pn on C[0, T ] satises the fol
lowing conditions:
(i) There exists a 0 such that limn Pn (|x(0)| a) = 0.
(ii) For each E > 0, lim0 lim supn Pn ({x : wx () > E}) = 0.
Then the sequence Pn is tight.
Proof. Fix E > 0. From (i) we can nd a

and n0 large enough so that Pn (|x(0)| >
a
) < E for all n > n0 . For every m n0 we can also nd am large enough so
that Pm (|x(0)| > am ) < E. Take a = max(
a, am ). Then Pn (|x(0)| > a) < E
for all n. Let B = {x : |x(0)| a}. We just showed Pn (B) 1 E.
Similarly, for every k > 0 we can nd k and nk such that Pn (wx (k ) >
k
E/2 ) < E/2k for all n > nk . For every xed n nk we can nd a small enough
n > 0 such that Pn (wx (n ) > E/2k ) < E/2k since by uniform continuity of
x we have >0 {wx () > E/2k } = a.s. Let k = min(k , minnnk n ).
Let Bk = {x : wx (k ) E}. Since Pn (Bkc ) < E/2k then Pn (k Bkc ) < E
and Pn (k Bk ) 1 E, for all n. Therefore Pn (B k Bk ) 1 2E for all
n. Then set K = B k Bk is closed (check) and satises the conditions of
Arzela-Ascoli Theorem. Therefore it is compact.
2
Functional Strong Law of Large Numbers (FSLLN)
We are about to establish two very important limit results in the theory of stochas
tic processes. In probability theory two cornerstone theorems are (Weak or
Strong) Law of Large Numbers and Central Limit Theorem. These theorems
have direct analogue in the theory of stochastic processes as Functional Strong
Law of Large Numbers (FSLLN) and Functional Central Limit Theorem (FCLT)
also known as Donsker Theorem. The second theorem contains in it the fact that
Wiener Measure exists.
2
We rst describe the setup. Consider a sequence of i.i.d. random variables

X1 , X2 , . . . , Xn , . . .. We assume that E[X1 ] = 0, E[X12 ] = 2 . We can view
each realization of an innite sequence (Xn ()) as a sample in a product space
R equipped with product type -eld F and probability measure Pi.i.d. , in
duced by the probability distribution of X1 .
Dene Sn = 1kn Xk . Fix an interval [0, T ] and for each n 1 and
t [0, T ] consider the following function
Nn (t) =
nt
()
+ (nt nt )
nt +1 ()
(1)
This is a piece-wise linear continuous function in C[0, T ].

Theorem 3 (Functional Strong Law of Large Numbers (FSLLN)). Given an
i.i.d. sequence (Xn ), n 1 with E[X1 ] = 0, E[|X1 |] < , for every T > 0, the
sequence of functions Nn : [0, T ] R converges to zero almost surely. Namely
P(INn ()IT 0) = P( sup |Nn (t, )| 0) = 1
0tT
As we see, just as SLLN, the FSLLN holds without any assumptions on the
variance of X1 , that is even if = .
Here is another way to state FSLLN. We may consider functions Nn dened
on entire [0, ) using the same dening identity (1). Recall that sets [0, T ] are
compact in R. An equivalent way of saying FSLLN is Nn converges to zero
almost surely uniformly on compact sets.
Proof. Fix E > 0 and T > 0. By SLLN we have that for almost all realizations
of an sequence X1 (), X2 (), . . ., there exists n0 () such that for all n >
n0 (),
Sn ()
E
<
n
T
We let M () = max1mn0 () Sm (). We claim that for n > M ()/E, there
holds
sup |Nn (t)| < E.
0tT
We consider two cases. Suppose t [0, T ] is such that nt > n0 (). Then
Nn (t) max
nt
n
3
() S
,
nt +1 ()
We have
Sbntc ()
Sbntc ( ) bntc

=
t .
n
bntc
n
T
Using a similar bound on Sbntc+1 (), we obtain
Nn (t)
.
n
Suppose now t is such that nt n0 (). Then
|Nn (t)|
M ()
< ,
n
since, by our choice n > M ()/. We conclude sup0tT |Nn (t)| < for all
n > M ()/. This concludes the proof.
3
Weiner measure
FSSLN was a simpler functional limit theorem. Here we consider instead a

Gaussian scaling of a random walk Sn and establish existence of the Weiner
measure (Brownian motion) as well as FCLT. Thus suppose we have a sequence
of i.i.d. random variables X1 , . . . , Xn with mean zero but finite variance 2 <
. Instead of function (1) consider the following function
Nn (t, ) =
Sbntc ()
Xbntc+1 ()
+ (nt bntc)
,
n
n
n 1, t [0, T ]. (2)
This is again a piece-wise linear continuous function. Then for each n we obtain
a mapping
n : R C[0, T ].
Of course, for each n, the mapping n depends only on the first nT + 1 coordinates of samples in R .
Lemma 1. Each mapping n is measurable.
Proof. Here is where it helps to know that Kolmogorov field is identical to Borel
field on C[0, T ], that is Theorem 1.4 from the previous lecture. Indeed, now it
suffices to show that that n1 (A) is measurable for each set A of the form
44
A = t1 (, y], as these sets generate Kolmogorov/Borel -field. Each set

n1 (t1 (, y]) is the set of all realizations Nn () such that
P
1km Xk ()
+ (nt m)Xm+1 () y .
Nn (t, ) =
n
where m = bntc. This defines a measurable subset of Rm+1 R . One
to observe that the function f : Rm+1 R defined by
way to see this is P
xk
f (x1 , . . . , xm ) = 1km
+ (nt m)xm+1 is continuous and therefore is
n
measurable. We conclude that n is measurable for each n.
Thus each n induces a probability measure on C[0, T ], which we denote

by Pn . This probability measure is defined by
Pn (A) = Pi.i.d. (n1 (A)) = Pi.i.d. (Nn () A).
We now establish the principal result of this lecture existence of Weiner
measure, namely, the existence of a Brownian motion.
Theorem 4 (Existence of Wiener measure). A sequence of measures Pn has a
weak limit P which satisfies the property of Wiener measure on C[0, T ].
The proof of this fact is quite involved and we give only its scheme, skipping some technical results. First let us outline the main steps in the proof.
In the previous lecture we considered projection mappings Pt : C[0, T ] R.
Similarly, for any collection 0 t1 < < tk we can consider t1 ,...,tk (x) =
(x(t1 ), . . . , x(tk )) Rk .
1. We first show that the sequence of measures Pn on C[0, T ] is tight. We
use this to argue that there exists a subsequence Pn(k) which converges to
some measure .
2. We show that satisfies the properties of Wiener measures. For this
purposes we look at the projected measures t1 ,...,tk ( ) on Rk and show
that these give a joint Gaussian distribution, the kind arising in a Brownian
motion (that is the joint distribution of (B(t1 ), B(t2 ), . . . , B(tk ))). At this
point the existence of Wiener measure is established.
3. We then show that in fact the weak convergence n holds.
Proof sketch. We begin with the following technical and quite delicate result
about random walks.
Lemma 2. The following identity holds for random walks Sn = X1 + + Xn :
lim lim sup 2 P(max |Sk | n) = 0.

(3)
n
kn
Note, that this is indeed a very subtle result. We could try to use submartingale inequality, since Sk is sub-martingale. It will give
1
E[S 2 ]
P(max |Sk | n) 2 2n = 2 .
kn
n
So by taking a product with 2 we do not obtain convergence to zero. On

the other hand note that if the random variables have a nite fourth moment
E[Xn4 ] < , then the result follows from the sub-martingale inequality by considering Sn4 in place of Sn2 (exercise). The proof of this lemma is based on the
following fact:
Proposition 2 (Etemadis Inequality,[1]). For every > 0
P(max |Sk | 3) 3 max P(|Sk | )
kn
kn
Proof. Let Bk be the event |Sk | 3, |Sj | < 3, j < k. Then

P(max |Sk | 3) P(|Sn | ) +
kn
P(Bk |Sn | < )

kn
P(|Sn | ) +
P(Bk |Sn Sk | > 2)

kn
= P(|Sn | ) +
P(Bk )P(|Sn Sk | > 2)

kn
P(|Sn | ) + max P(|Sn Sk | 2)

kn
P(|Sn | ) + max(P(|Sn | ) + P(|Sk | ))

kn
3 max P(|Sk | ).
kn
Now we can prove Lemma 2.

Proof of Lemma 2. Applying Etemadis Inequality
P(max |Sk | n) 3 max P(|Sk | (1/3) n)

kn
kn
Fix E > 0. Let denote the cumulative standard normal distribution. Find 0
large enough so that 22 (1 (/3)) < E/3 for all 0 . Fix any such .
By the CLT we can nd n0 = n0 () large enough so that
P(|Sn | (1/3) n) 2(1 (/3)) + E/(32 )
for all n n0 , implying 2 P(|Sn | (1/3) n) 2E/3.

Now x any n 27n0 /E and any k n. If k n0 , then from the derived
bound we have
2 P(|Sk | (1/3) n) 2 P(|Sk | (1/3) k) 2E/3.

On the other hand, if k n0 then
2 E[Sk2 ]
2k
2 P(|Sk | (1/3) n) 2
=
E/3.
(1/9) 2 n
( /9) 2 n
We conclude that for all n 27n0 /E,
2 max P(|Sk | (1/3) n) E/3,

1kn
from which we obtain
2 lim sup P(max |Sk | n) E

kn
Since E > 0 was arbitrary, we obtain the result.

The next result which we also do not prove says that the property (3) implies
tightness of the sequence of measures Pn on C[0, T ].
Lemma 3. The following convergence holds for every E > 0.
lim lim sup P(wNn () E) = 0.
(4)
0 n
As a result the sequence of measures Pn is tight.

Proof. Observe that
wNn ()
max
ijnT :jin
ikj
Xk |
Exercise 1. Use this to nish the proof of the lemma. Hint: partition interval
[0, T ] into length intervals and use Lemma 2.
7
Let us now see how Lemma 2 implies tightness. We use characterization

given by Proposition 1. First for any positive a, Pn (|x(0)| a) = P(|Nn (0)| >
0) = 0 since Nn (0) = 0. Now Pn (wx () > ) = P(wNn () > ). From the
first part of the lemma we know that the double convergence (4) holds. This
means that condition (ii) of Proposition 1 holds as well.
We now return to the construction of Wiener measure. Lemma 3 implies
that the sequence of probability measures Pn on C[0, T ] is tight. Therefore, by
Prohorovs Theorem, it contains a weakly convergent subsequence Pn(k) P .
Proposition 3. P satisfies the property of the Wiener measure.
Proof. Since P is defined on the space of continuous functions, then continuity of every sample is immediate. We need to establish independence of
increments and the fact that increments are stationary Gaussian. Thus we fix
0 t1 < < tk and y1 , . . . , yk Rk . To preserve the continuity, we still denote elements of C[0, T ] by x, x(t) or x(, t) whereas before we used notations
, B(), B(t, ).
Consider the random vector t1 (Nn ) = Nn (t1 ). This is simply the random
variable
Sbnt1 c ()
Xbnt1 c+1 ()
+ (nt1 bnt1 c)
n
n
The second term in the sum converges to zero in probability. The first term we
rewrite as
Sbnt1 c () b nt1 c
.
n
b nt1 c
and by CLT it converges to a normal N (0, t1 ) distribution. Similarly, consider
P
Xbnt2 c+1 ()
nt1 <mnt2 Xm ( )
Nn (t2 ) Nn (t1 ) =
+ (nt2 bnt2 c)
n
n
Xbnt1 c+1 ()
(nt1 bnt1 c)
n
Again by CLT we see that it converges to normal N (0, t2 t1 ) distribution. Moreover, the joint distribution of (Nn (t1 ), Nn (t2 ) Nn (t1 )) converges
to a joint distribution of two independent normals with zero mean and variances t1 , t2 t1 . Namely, the (Nn (t1 ), Nn (t2 )) converges in distribution to
(Z1 , Z1 + Z2 ), where Z1 , Z2 are independent normal random variables with

zero mean and variances t1 , t2 t1 .
By a similar token, we see that the distribution of the random vector
t1 ,...,tk (Nn ) = (Nn (t1 ), Nn (t2 ), . . . , Nn (tk )) converges in distribution to (Z1 , Z1 +
Z2 , . . . , Z1 + + Zk ) where Zj , 1 j k are independent zero mean normal
random variables with variances t1 , t2 t1 , . . . , tk tk1 .
On the other hand the distribution of t1 ,...,tk (Nn ) is
Pn t1
= Pi.i.d. n1 t1
.
1 ,...,tk
1 ,...,tk
Since Pn(k) P and is continuous, then, applying mapping theorem (The
P t1
. Combining, these two
orem 1) we conclude that Pn(k) t1
1 ,...,tk
1 ,...,tk
1
facts, we conclude that the probability measure P t1 ,...,tk is the probability

measure of (Z1 , Z1 + Z2 , . . . , Z1 + + Zk ) (where again Zj are indepen
dent normal, etc ...). What does this mean? This means that when we select
x C[0, T ] according to the probability measure P and look at its projection
t1 ,...,tk (x) = (x(t1 ), . . . , x(tk )), the probability distribution of this random
vector is the distribution of (Z1 , Z1 + Z2 , . . . , Z1 + + Zk ). This means that x
has independent increments with zero mean Gaussian distribution and variances
t1 , t2 t1 , . . . , tk tk1 . This is precisely the property we needed to establish
for P in order to argue that it is indeed Wiener measure.
This concludes the proof of Proposition 3 the fact that P is the Weiner
measure.
We also need to show that the convergence Pn P holds. For this purpose
we will show that P is unique. In this case the convergence holds. Indeed,
suppose otherwise, there exists a subsequence n(k) such that Pn(k) P . Then
we can nd a bounded continuous r.v. X, such that EPn(k) X EP X. Then we
can nd E0 and a subsequence n(ki ) of n(k) such that |EPn(ki ) X EP X| E0
for all i. By tightness we can nd a further subsequence n(kij ) of n(ki ) which
. But we have seen
converges weakly to some limiting probability measure P
that every such weak limit has to be a Wiener measure which is unique. Namely
= P . This is a contradiction since |EP
P
X EP X| E0 .
n(ki )
j
It remains to show the uniqueness of Wiener measure. This follows again

from the fact that the Kolmogorov -eld coincides with the Borel -eld on
C[0, T ]. But the properties of Wiener measure (independent increments with
variances given by the length of the time increments) uniquely dene probabil
ity on generating sets obtained via projections t1 ,...,tk : C[0, T ] Rk . This
concludes the proof of uniqueness.
4 Applications
Theorem 4 has applications beyond the existence of Wiener measure. Here is
one of them.
Theorem 5. The following convergence holds
max1kn Sk
sup B(t)
n
0tT
(5)
where B is the standard Brownian motion. As a result, for every y

lim P(
n
max1kn Sk
y) = 2(1 (y)),
n
(6)
where is standard normal distribution.

Proof. The function g(x) = sup0tT x(t) is a continuous function on C[0, T ]
(check this). Since by Theorem 4, Pn P then, by Mapping Theorem,
g(Nn ) g(B), where B is a standard Brownian motion random sample
max1kn Sk
from the Wiener measure P . But g(Nn ) = sup0tT Nn (t) =

.
n
We conclude that (5) holds. To prove the second part we note that the set
A = {x C[0, T ] : sup0tT x(t) = y} has P (A) = 0 recall that the
maximum sup0tT B(t) of a Brownian motion has density. Also note that A
is the boundary of the set {x : sup0tT x(t) y}. Therefore by Portmentau
theorem, (6) holds.

Billingsley [1] Chapter 2, Section 8.
References
publication, 1999.
10
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Lecture 22
Fall 2013
12/09/2013
Skorokhod Mapping Theorem. Reected Brownian Motion
Content.
1. G/G/1 queueing system
2. One dimensional reection mapping
3. Reected Brownian Motion (RBM)
Technical preliminaries
We use the following technical facts. If x is a Lipschitz continuous, then it

also has derivative everywhere except measure zero set of points on [0, ) and
t
x(t) = 0 x (u)du. Note, however that if x is a realization of a Brownian motion,
then it is nowhere differentiable and we cannot write this integral representation.
Also if x is a continuous and non-decreasing, then it has derivatives almost ev
t
erywhere and again x(t) = 0 x (u)du.
2
G/G/1 queueing system
We consider the following G/G/1 queueing system consisting of a single server

which works at a unit speed. The G/G/1 notation stands for general arrival/general
service times/single server type queueing system (as opposed to for example
M/M/1 which stands for Poisson arrival/Exponential service time/single server
type systems). Initially we assume there are Q(0) jobs in the system. There is an
arrival process into the system. The times between successive arrivals is an i.i.d.
sequence un 0, n 1 of random variables. We let = 1/E[u1 ] denote
n the ar
rival rate. Namely, 1/ is the mean interarrival time. We let U (n) = 1in ui
and let
A(t) = sup{n : U (n) t}
denote the counting process describing the arrival process.

The time it takes to process the i job arriving to the server (including the rst
Q(0) jobs) is also assumed to be an i.i.d. sequence vnn 0, n 1 of random
variables with service rate = 1/E[v1 ]. Also V (n) = 1in vi and let
S(t) = sup{n : U (n) t}
denote the counting process describing the service process.
Denote by D(t) the cumulative departure process from the queueing system
the number of jobs that departed up to time t. If the server was working all the
time, then we would have D(t) = S(t), but occasionally, the server runs out of
jobs. So let B(t) denote the total cumulative time that the server was working
on jobs during [0, t]. Naturally B 0, B is non-decreasing and B(t) t. Then
D(t) = S(B(t)). Note that we can also write
t
B(t) =
1{Q(s) > 0}ds.

0
The jobs which have not been processed yet form a queue. Denote by Q(t)
the length of the queue at time t. Then we naturally obtain
Q(t) = Q(0) + A(t) D(t) = Q(0) + A(t) S(B(t)).
(1)
In addition we consider workload process Z(t) this represents the amount

of time it takes to process all the jobs present in the system at time t (queue
plus possibly a job being served), assuming no further jobs arrive after time t.
Loosely speaking this is the time it takes to clear the system after we shut the
door.
Note that
Z(t) = V (Q(0) + A(t)) B(t).
(2)
(convince yourself that this is indeed the case).

Finally, it is also useful consider the cumulative idling process
t
I(t) = t B(t) =
1{Q(s) = 0}ds.
(3)
Recall our earlier notations C[0, ) is the space of continuous functions on

[0, ) and D[0, ) is the space of right-continuous function with left limits on
the same domain.
Problem 1. Show that Q, Z D[0, ); Y, Z C[0, )
2
The main question we ask for G/G/1 system is to characterize the behavior
of the queue length Q(t) and workload Z(t) as a process or in steady state
(t = ), when steady-state exists, assuming that we know the distribution of the
interarrival and service times. Ideally, we would like to compute the distribution
of Q(t), Z(t). But this is, with some exceptions, infeasible. Thus the focus will
be on obtaining approximations. As we will see we can get good approximations
when the system is in so called heavy trafc regime. This is the regime when
. Then the typically observed queue length will be very large and, as we
will see, can be approximated by a certain reected Brownian motion, for which
both time dependent and steady-state distribution is known.
3
One dimensional reection (Skorohod) mapping
Introduce
X(t) = V (Q(0) + A(t)) t
(4)
Then X(t) represents the total work that arrived into the system up to time t
minus t. Then we can rewrite (2) as
Z(t) = X(t) + I(t) 0.
(5)
Observe that I(t) is a non-decreasing piece-wise linear continuous process

(it increases at rate 1 when there are no jobs in the system, and stays constant
when there are jobs in the system). In particular,
dY (t) 0
(6)
whenever derivative is dened. Moreover, since the cumulative idling time can
increase only when there are no jobs in the system, then
Z
Z(t)dI(t) = 0.
(7)
0
There is a good reason to write equations (5),(6),(7) these equations turn out
to be dening Q, Y uniquely from X and we now establish this fact in a more
general setting. We say that x D[0, ) has no downward jumps, if at every
point of discontinuity t0 we have limtt0 x(t) x(t0 ).
Problem 2. Show that X has no downward jumps.
Theorem 1 (Reection (Skorohod) Mapping Theorem). Given x D[0, ), x(0)
0, with no downward jumps, there exists a unique pair y C[0, ), z
D[0, ) such that
3
1. z(t) = x(t) + y(t) 0, t R+

2. dy(t) 0
3. z(t)dy(t) = 0 for all t R+ .
This pair is uniquely dened by
y(t) = sup (x(s))+
(8)
0st
z(t) = x(t) + sup (x(s))+
(9)
0st
Moreover, the mappings : D[0, ) C[0, ), : D[0, ) D[0, )

given by y = (x), z = (x) are Lipschitz continuous with respect to the
uniform norm I IT for every T > 0. We call z the reected process of x and y
the regulator of x
Intuitively, the role of the regulator process y is to provide enough push
ing to ensure that the modied process z = x + y remains non-negative, given
a (possibly negative) process x.
Proof. We begin by verifying that the solutions given in the theorem are valid.
Clearly, z 0 and y is non-decreasing, implying dy 0. Let us show that
y C[0, ). Fix t. If x is continuous in t then continuity of y in t is immediate,
as supremum of a continuous function is a continuous function. Suppose x is
discontinuous in t, namely it has an upward jump. Let x = limt' t x(t). Then
x(t) > x and, since x is right-continuous, x(t/ ) > x for all t/ in some
interval [t, t + ]. This means y(t/ ) = y(t) on [t, t + ]. Also since x is the
left limit of x, then for every E > 0 there exists , such that |x(t/ ) x | < E for
t/ [t , t]. This means |y(t/ ) y(t)| < E for t/ [t , t] and the proof of
continuity is complete. Finally, we check that zdy = 0. Suppose dy > 0. Then
we claim that x(t) = inf 0st x(s) (inf is achieved in t). Indeed, otherwise
x(t) > inf 0st x(s) and since x can only have upward jumps, x(t + ) >
inf 0st x(s) for all sufciently small . As a result y(t + ) = y(t) for all
sufciently small , contradicting dy > 0. We conclude x(t) = inf 0st x(s),
which implies z(t) = 0 and z(t)dy(t) = 0.
We now establish the uniqueness of the solution. Suppose there exists an
additional solution y / C[0, ), z / D[0, ) satisfying the properties of the
reection mapping, for the same input process x. Note that z / z = y / y is
a difference between two continuous non-decreasing processes. As a result it is
differentiable almost everywhere and
Z t
1
d
/
2
(z(t) z (t)) =
(z(s) z / (s)) (z(s) z / (s))ds
2
ds
0
Z t
d
d
=
(z(s) z / (s))( y(s) y / (s))ds
ds
ds
0
But, by assumption zdy = z / dy / = 0, and we obtain an expression
Z t
d
d
(z(s) y / (s) z / (s) y(s)ds) 0
ds
ds
0
We conclude z(t) = z / (t). It is then immediate that y(t) = y / (t).
It remains to show that the mappings y = (x) and z = (x) are Lipschitz
continuous. That is, we need to show that for some constants C1 , C2 > 0 and
every pair x, x/ D[0, T ], we have
I(x) (x/ )IT Ix x/ IT , I(x) (x/ )IT Ix x/ IT
Denote Ix x/ IT by V 0. For any t [0, T ] we have
(x)(t) (x/ )(t) = sup (x(s))+ sup (x/ (s))+
0st
0st
+
sup (x (s) + V ) sup (x/ (s))+

0st
0st
/
sup (x (s)) + V sup (x/ (s))+

0st
0st
=V
Similarly, we show (x/ )(t) (x)(t) V . This proves Lipschitz continu
ity of with constant C1 = 1. The Lipschitz continuity of follows then
immediately with constant C2 = 2.
The reection mapping also satises the following important memoryless
property: if we consider the process starting from some time t0 its reection is
the same as if we started at time 0:
Proposition 1. Given t0 > 0, x D[0, ) and the reection y = (x), z =
(x) of x, consider a modied process x
(t) = z(t0 ) + x(t0 + t) x(t0 ), t 0.
x)(t) = y(t0 + t) y(t0 ), t 0.
Then (
x)(t) = z(t0 + t), (
Problem 3. Establish this proposition.
5
Reected Brownian motion
We will see later on that when the G/G/1 queueing system is in heavy-trafc,
the process X(t) dened in (4) is approximated well by a Brownian motion. For
now assume that it is in fact a Brownian motion. Note, that X(t) is a process,
whose distribution we now in principle, since it is directly linked the arrival and
service processes. Thus if we can nd a reection of X with respect to the
Skorohod mapping, we obtain an approximation of the workload process Z(t).
Denition 1. A (one-dimensional) Reected Brownian Motion (RBM) is the pro
cess Z = (B) obtained by Skorohod mapping (, ), when the input process
is a Brownian motion B(t), B(0) 0. When B has drift and variance 2 , we
also write Z = RBM (, 2 ).
Knowing the forms (8),(9) of the reection mapping allows us to obtain
say something about the distribution of the reected process Z when the input
process is a Brownian motion.
Theorem 2. A Reected Brownian Motion Z(t) = RBM (, 2 ), converges in
distribution to some limiting random variable Z() iff < 0. In this case
Z() is exponentially distributed with parameter 2/ 2 .
It should not be surprising that we get limiting exponential distribution when
the drift is negative and no limiting distribution when drift is non-negative. Af
ter all we have established these facts for a maximum of a Brownian motion
M (t) = sup0st B(s). We just do the appropriate adjustments.
Proof. First assume B(0) = 0. Observe that in this case sup0st (B(s))+ =
sup0st (B(s)). Then
P(Z(t) z) = P(B(t) + sup (B(s)) z).
0st
t (u) = B(t) B(t u), 0 u t and B

t (u) =
Consider the process B
B(u), u t. Observe that it is also a Brownian motion with drift and variance
2 (this is similar to a problem on midterm and some of the hw problems).
Therefore
P(B(t) + sup (B(s)) z) = P( sup (B(t) B(s)) z)
0st
0st
(s) z)
= P( sup B
0st
= P( sup B(s) z)
0st
Now we just recall basic properties of a Brownian motion. When 0 we have

sups0 B(s) = with probability one. Therefore P(sup0st B(s) z) 1
as t , and limiting distribution does not exist.
When < 0, we know that P(sups0 B(s) z) = exp(2/ 2 ), thus
lim P( sup B(s) z) = exp(2/ 2 ).
0st
This concludes the proof for the case B(0) = 0.

Suppose now B(0) = w > 0. We let = inf{t : B(t) = 0}, including the
possibility = . When 0 we know that < a.s. Then from strong
Markov property of a Brownian motion we know that B( + t) is a Brownian
motion which starts with origin. Applying Proposition 1, Z( +t) is the reected
Brownian motion implying
lim P(Z(t) z) = lim P(Z( + t) z)
which is unity when = 0 and exp(2/ 2 ) when < 0.

Suppose now > 0. Then we simply use the fact Z(t) B(t), which
follows from non-negativity of Y (t) and limt P(B(t) z) = 1 when > 0,
to conclude limt P(Z(t) z) = 1.
This concludes the proof.
5

Chapter 6 from Chen & Yao book Fundamentals of Queueing Networks [1]
References
[1] H. Chen and D. Yao, Fundamentals of queueing networks: Performance,
asymptotics and optimization, Springer-Verlag, 2001.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070
Problem Set 1
Fall 2013
due 9/16/2009
Problem 1. Consider the space C[0, T ] of continuous functions x : [0, T ]

R, endowed with the uniform metric (x, y) = Ix yI = sup0tT |x(t)
y(t)|. Construct an example of a closed bounded set K C[0, T ] which is not
compact. (A set K C[0, T ] is bounded if there exists a large enough r such
that K B(0, r), where 0 is a function which is identically zero on [0, T ]).
Problem 2. Given two metric spaces (S1 , 1 ), (S2 , 2 ) show that a function
f : S1 S2 is continuous if and only if for every open set O S2 , f 1 (O) is
an open subset of S1 .
Problem 3. Establish that the space C[0, T ] is complete with respect to Ix yI

metric and the space D[0, T ] is complete with respect to the Skorohod metric.
Problem 4. Problem 1 from Lecture 2. Additionally to the parts a)-c), construct

an example of a random variable X with a nite mean and a number x0 > E[X],
such that I(x0 ) < , but I(x) = for all x > x0 . Here I is the Legendre
transform of the random variable X.
Problem 5. Establish the following fact, (which we have used in proving the
s theorem for general closed sets F ): given two
upper bound part of the Cramer
strictly positive sequences xn , yn > 0, show that if lim supn (1/n) log xn
I, lim supn (1/n) log yn I, then lim supn (1/n) log(xn + yn ) I.
Problem 6. Suppose M () < for all . Show that I(x) is a strictly convex
function.
Hint. Give a direct proof of convexity of I and see where inequality may
turn into equality. You may use the following fact which we have established in
the class: for every x there exists 0 such that x = M (0 )/M (0 ).
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070
Problem Set 1
Fall 2013
due 9/30/2013
Problem 1. The following identity was used in the proof of Theorem 1 in Lec
ture 4: sup{ > 0 : M () < exp(C)} = inf t>0 tI(C + 1t ) (see the proof for
details). Establish this identity.
Hint: Establish the convexity of log M () in the region where M () is nite.
Letting = sup{ > 0 : M () < exp(CC)} use the convexity property above
C
to argue that log M () C + d logdM () C ( ). Use this property to nish
the proof of the identity.
Problem 2. This problem concerns the rate of convergence to the limits for the
large deviations bounds. Namely, how quickly does n1 log P(n1 Sn A)
converge to inf xA I(x), where Sn is the sum of n i.i.d. random variables?
Of course the question is relevant only to the cases when this convergence takes
place.
(a) Let Sn be the sum of n i.i.d. random variables Xi , 1 i n taking values
in R. Suppose the moment generating function M () = E[exp(X)] is
nite everywhere. Let a = E[X]. Recall that we have established
in class that in this case the convergence limn n1 log P(n1 Sn a) =
I(a) takes place. Show that in fact there exists a constant C such that
C C
C
C
C 1
Cn log P(n1 Sn a) + I(a)C ,
n
for all sufciently large n. Namely, the rate of convergence is at least as
fast as O(1/n).
Hint: Examine the lower bound part of the Cram
ers Theorem.
(b) Show that the rate O(1/n) cannot be improved.
Hint: Consider the case a = .
Problem 3. Let Xn be an i.i.d. sequence with a Gaussian distribution N (0, 1)

and let Yn be an i.i.d. sequence with a uniform distribution on [1, 1]. Prove
that the limit

1
1
1
log P (
Xi )2 + (
Yi )2 1
n n
n
n
lim
1in
1in
exist and compute it numerically. You may use MATLAB (or any other software
of your choice) and an approximate numeric answer is acceptable.
Problem 4. Let Zn be the set of all length-n sequences of 0, 1 such that a)

1 is always followed by 0 (so for, for example 001101 is not a legitimate
sequence, but 001010 is); b) the percentage of 0-s in the sequence is at least
70%. (Namely, if Xn (z) is the number of zeros in a sequence z Zn , then
Xn (z)/n 0.7 for every z Zn ). Assuming that the limit limn n1 log Zn
exists, compute its limit. You may use MATLAB (or any other software of your
choice) to assist you with computations, and an approximate numerical answer
is acceptable. You need to provide details of your reasoning.
Problem 5. Problem 1 from Lecture 6.
Problem 6. Problem 2 from Lecture 6.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070
Problem Set 3
Fall 2013
due 10/23/2013
Problem 1. Let B be standard Brownian motion. Show that P(lim supt B(t) =
) = 1.
Problem 2. (a) Consider the following sequence of partitions n , n = 1, 2, . . .
of [0, T ] given by ti = ni , 0 i n. Prove that quadratic variation of a standard
Brownian motion almost surely converges to T : limn Q(n , B) = 1 a.s., even
though n (n ) = n 1/n = .
(b) Suppose now the partition is generated by drawing n independent ran
dom values tk = Uk , 1 k n drawn uniformly from [0, T ] and independently
from the Brownian motion. Prove that limn Q(n , B) = T a.s. Note, almost
sure is with respect to the probability space of both the Brownian motion prob
ability and uniform sampling.
Problem 3. Suppose X F is independent from G F. Namely, for every
measurable A R, B G P({X A} B) = P(X A)P(B). Prove that
E[X|G] = E[X].
Problem 4. Consider an assymetric simple random walk Q(t) on Z given by
P(Q(t + 1) = x + 1|Q(t) = x) = p and P(Q(t + 1) = x 1|Q(t) = x) = 1 p
for some 0 < p < 1.
1. Construct a function of the state (x), x Z such that (Q(t)) is a mar
tingale.
2. Suppose Q(0) = z > 0 and p > 1/2. Compute the probability that the
random walk never hits 0 in terms of z, p.
Problem 5. On a probability space (, F, P) consider a sequence of random
variables X1 , X2 , . . . , Xn and -elds F1 , . . . , Fn F such that E[Xj |Fj1 ] =
Xj1 and E[Xj2 ] < .
1. Prove directly (without using Jensens inequality) that E[Xj2 ] E[Xj21 ]
for all j = 2, . . . , n. Hint: consider (Xj Xj1 )2 .
1
2. Suppose Xn = X1 almost surely. Prove that in this case X1 = . . . = Xn

almost surely.
Problem 6. The purpose of this exercise is to extend some of the stopping times
theory to processes which are (semi)-continuous. Suppose Xt is a continuous
time submartingale adopted to Ft , t R+ and T is a stopping time taking values
in R {}. Suppose additionally that Xt is a.s. a right-continuous function
with left limits (RCLL).
(a) Suppose there exists a countably innite strictly increasing sequence tn
R+ , n 0, such that P(T {tn , n 0} {}) = 1. Emulate the proof
of the discrete time processes to show that XtT , t R+ is a submartin
gale.
(b) Given a general stopping time T taking values in R+ {}, consider
a sequence of r.v. Tn dened by Tn () = 2kn , k = 1, 2, . . . if T ()
[ k2n1 , 2kn ) and Tn () = if T () = . Establish that Tn is a stopping
time for every n.
(c) Suppose the submartingale Xt is in L2 , namely E[Xt2 ] < , t. Show
that XT t is a submartingale as well.
Hint: Use part (b), Doob-Kolmogorov inequality and the Dominated Con
vergence Theorem.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070
Problem Set 4
Fall 2013
due 11/13/2013
Problem 1. Consider the longest increasing subsequence problem. Denoting by
Ln the length of a longest increasing subsequence of a random permutation, as
we have discussed in the lecture, it is known that the limit limn E[Ln ]/ n =
exists. Namely E[Ln ] is asymptotically n + o( n) We used Talagrands

concentration inequality to obtain concentration of Ln around its median mn .
Show that in fact it must be the case that limn mn /( n). Namely the median
is also asymptotically n + o( n).

Problem 2. Given a graph G with the node set V and edge set E, a set of nodes
I V is called an independent set if there is no edge between any two nodes
in I. Let Z be the total number of independent sets in G (which is at least
one since we assume that the empty set is an independent set) and at most 2|V | .
Let G = G(n, dn) be the Erdos-Renyi graph and let Zn be the corresponding
(random) number of independent sets in Erdos-Renyi . Establish the following
concentration inequality for log Zn around its expectation E[log Zn ]:
P(log Zn E[log Zn ] + t) 2 exp
t2
Cn
for some constant C which does not depend on n.

Problem 3. Exercise 1 lecture 15.
Problem 4. Suppose X n L
2t is a sequence of processes converging to pro
cess X L2 in the sense E[ 0 (Xsn Xs )2 ds] 0 as n . Recall that
as a part of the proof of Proposition 4 in lecture 16 we needed
to show that
t
t
E[ 0 (Xsn Xs )2 ds] is uniformly bounded, namely supn E[ 0 (Xsn + Xs )2 ds] <
. Establish this fact.
Problem 5. The goal of this exercise is to show that much of Ito calculus can
be generalized to integration with respect to an arbitrary continuous square inte
grable martingales.
Let Mt M2,c and Xt L2 .
1
t
(a) Dene 0 Xs dMs for simple processes. Show that the resulting process
is a martingale and establish Ito isometry for it.
that if X n is a sequence of simple pro
(b) Given an arbitrary X L
2t, show
n
cesses such that limn E[ 0 (Xs Xs )2 d(Ms )] = 0, then the sequence
t n
X dMs is Cauchy for every t in the L2 sense. Use this to dene
0t s
0 Xs dMs for any process X L2 and establish Ito isometry for the Ito
integral. Here the integration d(Ms ) is understood in the Stieltjes sense.
You do not need to prove existence of processes Xtn satisfying the requirement above (unless you would like to).
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Homework 1 Solution
Fall 2013
Problem 1
( rn
Let fn (t) = Tt
for t [0, T ]. Then we have that fn C[0, T ] for n =
1, 2, .... Let K = {fn (t), n = 1, 2, ...}. In order to prove K is closed, it sufces
to prove C[0, T ]\K is open. For any f C[0, T ]\K, assume infn Jf fn J = 0.
Since lim inf n Jf fn J inf n Jf fn J, then we have lim inf n Jf fn J = 0.
Then there exists a subsequence {fni , i = 1, 2, ...} such that Jf fni J 0 as
i . Thus, we have
f (t) =
0,
1,
if 0 t < T
if t = T
However, f is not continuous and thus f

/ C[0, T ]\K. Therefore, we have
inf n Jf fn J > 0. There exists a E > 0 such that B(f, E) C[0, T ]\K,
namely, C[0, T ]\K is open. Also, we have that
Jfn J 1, n = 1, 2, ...
Thus, K is also bounded. For two consecutive fn (t) and fn+1 (t), we have
t n
t
Jfn (t) fn+1 (t)J = sup ( ) (1 )

T
T
0tT
n
n
1
n+1
1+n
n
1
1
for n = 1, 2, ...
3
1+n
=
>
Then
n=1 Bo fn (t),
1
3
1
1+n
is an open cover of K, but in this case we can not nd a nite subset

m
i=1 Bo fni (t),
1
3
ni
1
1 + ni
where m is nite.
such that it is a nite subcover of K. Thus, K is not compact.

1
Problem 2
Proof. () Suppose f : S1 S2 is continuous. For any open set O S2 , we

have f 1 (O) S1 . For a xed x f 1 (O), we have that f (x) O. Since O
is open, there exists an E > 0 such that Bo (f (x), E) O. Since f is continuous,
there exists a > 0 such that f (Bo (x, )) Bo (f (x), E) O. Thus, we have
Bo (x, ) f 1 (O). That is, f 1 (O) is open.
() Suppose f 1 (O) is open in S1 for every open set O S2 . For an x S1 ,
there is an E > 0 such that Bo (f (x), E) is an open set in S2 . Thus, we have
that f 1 (Bo (f (x), E)) is an open set in S1 . For any x f 1 (Bo (f (x), E)),
there exists an > 0 such that Bo (x, ) f 1 (Bo (f (x), E)) which yields
f (Bo (x, )) Bo (f (x), E). Thus, f is continuous.
3
Problem 3
Proof. Suppose f1 , f2 , ... is a cauchy sequence in C[0, T ] with the uniform metric Jx yJ. For an E > 0, there exists an N > 0 such that for any n1 , n2 > N ,
we have
E > Jfn1 fn2 J = sup |fn1 (t) fn2 (t)|
t[0,T ]
|fn1 (t) fn2 (t)|, for any t [0, T ].

Thus, for a xed t [0, T ], f1 (t), f2 (t), ... is a Cauchy sequence in R. Since R
is complete, we have fn (t) f (t) as n .
Next, we need to show that f (t) is continuous on [0, T ]. For t1 [0, T ] and
any > 0, there exists an m large enough, and an > 0 such that for every t2
satisfying |t1 t2 | < , we have that
(by convergence of {fn (t)} for a xed t.)

3
(by convergence of {fn (t)} for a xed t.)

|f (t2 ) fm (t2 )| <
3
(by the continuity of fm (t).)

|fm (t1 ) fm (t2 )| <
3
|f (t1 ) fm (t1 )| <
By the triangle inequality, we have

|f (t1 ) f (t2 )| |f (t1 ) fm(t1 )| + |fm (t1 ) fm(t2 )| + |f (t2 ) fm(t2 )| <
f (t) is continuous on [0, T ], which completes the proof.
2
Problem 4
Proof. Part a.
M (0) = E[e0 ] = 1. If M () < for some > 0, then for any ' (0, ], we
have
'
M ( ' ) = E(e X ) =
'
e x dP (x)
'
e x dP (x) + 1
M () + 1 <
Likewise if M () < for some < 0, then for any ' [, 0), we have
'
M ( ' ) = E(e X ) =
'
e x dP (x)
'
e x dP (x) + 1
M () + 1 <
Part b.
Suppose X has Cauchy distribution, i.e. its density function is
fX (x) =
1
(1 + x2 )
However, for any = 0, lim|x|+ exp(|x|)

= +, and the function exp(x)
x2
1+x2
is therefore not integrable.
Part c.
Let X be a random variable with the following probability density function.
A exp(x x), if x 0
A exp(x x), if x > 0
fX (x) =
.
where A = 1.10045 is a normalizing constant. For [1, 1]. It is readily
veried that
M () = E[exp(X)] = A
exp((1+)x x)dx+A
is nite while for any outside of [1, 1], M () is not nite.

Part d.
3
exp((1)x x)dx
Consider a Bernoulli random variable X Be(1/2), so E(X) = 1/2. X

satises all requirements with x0 = 1. We compute that M () = 12 (1 + e )
which is nite for all . The rate function is
I(x) = sup{x log(1 + exp()) + log 2}
We differentiate the expression above and obtain

d
exp()
( log(1 + exp() + log 2)) = x
d
1 + exp()
x
Solving for , we obtain exp() = 1x
. For 0 x < 1, the equation admits
x
solution = log( 1x ) and the rate function is I(x) = x log(x)+(1x) log(1
x) + log(2). Let x = 1. Then log(1 + exp()) log(exp()) 0,
so that { log(1 + exp()) + log 2} is bounded by log 2 and admits a nite
supremum, and I(1) is nite. For x > 1 (and > 0),
x log(
1 + exp()
) x log(exp()) (x 1)
2
Taking +, we obtain I(x) = +.

5
Problem 5
Proof. Consider two strictly positive sequences xn > 0 and yn > 0. Since
lim supn lognxn I and lim supn lognyn I, then for any E1 > 0, there exists
an N such that for any n > N , we have
log ym
log xm
I + E1 , sup
I + E1
m
m
mn
mn
sup
which yields that

log ym
log xm
, sup
} I + E1
m mn m
mn
max{ sup
Thus, for any m n,
max{
log xm log ym
,
} I + E1
m
m
Taking sup for both sides of the last inequality gives

sup {max{
mn
log xm log ym
,
}} I + E1
m
m
4
For any E2 > 0, we can choose n large enough such that

log 2
log xm log ym
+ sup {max{
,
}} I + E1 + E2
n
m
m
mn
which gives that
sup {max{
mn
log 2
log xm log ym
log(2xm ) log(2ym )
,
}}
+ sup {max{
,
}} I+E1 +E2
m
m
n
m
m
mn
Since as n , we can make E and E2 both approach to 0. Thus,

lim sup
n mn
that is, lim supn

6
log(xn +yn )
n
log(xm + ym )
I
m
I.
Problem 6
Proof. For any x1 , x2 and [0, 1], let x = x1 + (1 )x2 , and observe
I(x) = I(x1 + (1 )x2 ) = sup((x1 + (1 )x2 ) log M ())
= sup((x1 log M ()) + (1 )(x2 log M ()))
sup(x1 log M ()) + (1 ) sup(x2 log M ())
I(x1 ) + (1 )I(x2 )
(1)
If I(x) is not strictly convex, there exists x1 = x2 , and (0, 1) such that
sup{(x1 log M ()) + (1 )(x2 log M ())}
= sup{x1 log M ()} + (1 ) sup{x2 log M ()}
However, we know that for every x R there exists 0 R such that I(x) =
0 x log M (0 ). Moreover, 0 satises
x=
Let 0 R such that
(0 )
M
M (0 )
M (0 )
M (0 )
= x1 + (1 )x2 . We have
I(x) = 0 (x1 + (1 )x2 ) log M (0 )

= (0 x1 log M (0 )) + (1 )(0 x2 log M (0 ))
5
Clearly, if either 0 x1 log M (0 ) < sup (x1 log M ()) or 0 x2 log M (0 ) <
sup (x2 log M ()), then the equality does not hold. Therefore 0 also
achieves the maximum for both x1 and x2 . By rst order conditions, we also
obtain
M (0 )
M (0 )
= x2
= x1 ,
M (0 )
M (0 )
which implies that x1 = x2 and thus gives a contradiction.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Homework 2 Solution
1
Fall 2013
Problem 1
Proof. The lecture note 4 has shown that { > 0 : M () < exp(C)} is
nonempty. Let
:= sup{ > 0 : M () < exp(C)}
If = , which implies that for all > 0, M () < exp(C) holds, we have
1
inf tI(C + ) = inf sup{t(C log M ()) + } = =
t>0 R
t
t>0
Consider the case in which is nite. According to the denition of I(C + 1t ),

we have
1
1
I(C + ) (C + ) log M ( )
t
t
1
1
inf tI(C + ) inf t( (C + ) log M ( ))
t>0
t>0
t
t
= inf t( C log M ( )) +
t>0
(1)
Next, we will establish the convexity of log M () on { R : M () < }.

For two 1 , 2 { R : M () < } and 0 < < 1, Holders inequality
gives
1
E[exp((1 +(1)2 )X)] E[(exp(1 X)) ] E[(exp((1)1 X)) 1 ]1

Taking the log operations on both sides gives
log M (1 + (1 )2 ) log M (1 ) + (1 )M (2 )
By the convexity of log M (), we have
1
1
M ( )
(C + ) log M () (C + ) C
( )
t
t
M ( )
M ( ) 1
=(C
+ )( ) +
t
t
M ( )
1
Thus, we have
1
M ( ) 1
inf t sup (C + ) log M () inf t sup (C
+ )( ) +
t>0 R
t>0 R
t
M ( )
t
(2)
Then we will establish the fact that
sufciently small h > 0 such that
( )
M
M ( )
C. If not, then there exists a
log M ( h) log M ( )
<C
h
which implies that
log M ( h) > log M ( ) Ch
log M ( h) > C( h) M ( h) exp(C( h))
which contradicts the denition of . By the facts that
inf t sup (C
t>0
and
( )
M
M ( )
M ( ) 1
+ )( ) 0, (when = )
M ( )
t
C, we have that
inf t sup (C
t>0
M ( ) 1
+ )( ) = 0
M ( )
t
and the inmum is obtained at t > 0 such that C + t1

we have
( )
M
M ( )
1
inf t sup (C + ) log M ()
t>0 R
t
1
inf tI(C + )
t>0
t
From (1) and (3), we have the result inft>0 tI(C + 1t ) = .
= 0. From (2),
(3)
Problem 2 (Based on Tetsuya Kajis Solution)
(a). Let 0 be the one satisfying I(a) = 0 a log M (0 ) and be a small

positive number. Following the proof of the lower bound of Cramers theorem,
we have
n1 log P(n1 Sn a) n1 log P(n1 Sn [a, a + ))
I(a) 0 n1 log P(n1 Sn a [0, ))
where Sn = Y1 + ... + Yn and Yi (1 i n) is i.i.d. random variable following
z
the distribution P(Yi z) = M (0 )1 exp(0 x)dP (x). Recall that
n
(Yi a)
1
i=1
P(n Sn a [0, )) = P
[0, n)
n
By the CLT, setting = O(n1/2 ) gives
P(n1 Sn a [0, )) = O(1)
Thus, we have
n1 log P(n1 Sn a) + I(a) 0 n1 log P(n1 Sn a [0, ))
= O(n1/2 )
Combining the result from the upper bound n1 log P(n1 Sn a) I(a),
we have
C
|n1 log P(n1 Sn a) + I(a)|
n
(b). Take a = . It is obvious, P(n1 Sn )
I() = 0, we have
1
2
as n . Recalling that
|n1 log P(n1 Sn ) + I()|

Namely, this bound can not be improved.
3
Problem 3
For any n 0, dene a point Mn in R2 by

xM n =
1
Xi
n
in
C
n
and
1
Yi
n
y Mn =
in
Let B0 (1) be the open ball of radius one in R2 . From these denitions, we can
rewrite
2
2
n
n
1
1
Yi
1 = P(Mn
/ B0 (1))
P
Xi +
n
n
i=1
i=1
We will apply Cramers Theorem in R2 :
2
n
n
1 1
1
+
Yi
lim P
Xi
n n
n
n
i=1
where
1 =
i=1
I(x, y) =
with
sup
(1 ,2 )R2
inf
(x,y)B0 (1)C
I(x, y)
(1 x + 2 y log(M (1 , 2 )))
M (1 , 2 ) = E[exp(1 X + 2 Y )]
Note that since (X, Y ) are presumed independent, log(M (1 , 2 )) = log(MX (1 ))+
log(MY (2 )), with MX (1 ) = E[exp(1 , X)] and MY (2 ) = E[exp(2 Y )].
We can easily compute that
MX () = exp(
and
Y
MY () = E[e
2
)
2
1 y 1
1
]=
e
dy = e = (e e )
2
2
2
1
1
y 1
Since (x, y) are decoupled in the denition of (x, y), we obtain I(x, y) =
IX (x) + IY (y) with
x2
12
)=
2
2
1
1
1
IY (y) = sup g2 (y, 2 ) = sup(2 y log( (e2 e2 )))
2
2
2
IX (x) = sup g1 (x1 , 1 ) = sup(1 x
Since for all y, 2 , g2 (y, 2 ) = g2 (y, 2 ), for all y, IY (y) = IY (y).

4
Since IX (x) is increasing in |x| and IY (y) is increasing in |y|, the maximum is attained on the circle x2 + y 2 = 1, which can be reparametrized as a
one-dimensional search over an angle . Optimizing over , we nd that the
minimum of I(x, y) is obtained at x = 1, y = 0, and that the value is equal to
1
2 . We obtain

2
2
n
n
1
1
1
1
Yi
1 =
lim log P
Xi +
n n
n
n
2
i=1
i=1
Problem 4
We denote Yn the set of all length-n sequences which satisfy condition (a). The
rst step of our method will be to construct a Markov Chain with the following
properties:
For every n 0, and any sequence (X1 , ..., Xn ) generated by the Markov
Chain, (X1 , X2 , ..., Xn ) belongs to Yn .
For every n 0, and every (x1 , ..., xn ) Yn , (x1 , ..., xn ) has positive
probability, and all sequences of Yn are almost equally likely.
Consider a general Markov Chain with two states (0, 1) and general transition
probabilities (P00 , P01 ; P10 , P11 ). We immediately realize that if P11 > 0, sequences with two consecutive ones dont have zero probability (in particular, for
n = 2, the sequence (1, 1) has probability (1)P11 . Therefore, we set P11 = 0
(and thus P10 = 0), and verify this enforces the rst condition.
Let now P00 = p, P01 = 1 p, and lets nd p such that all sequences are
almost equiprobable. What is the probability of a sequence (X1 , ..., Xn )?
Every 1 in the sequence (X1 , ..., Xn ) necessarily transited from a 0, with
probability (1 p).
Zeroes in the sequence (X1 , ..., Xn ) can come either from another 0, in
which case they contribute a p to the joint probability (X1 , ..., Xn ), or from
a 1, in which case they contribute a 1. Denote N0 and N1 the numbers of 0
and 1 in the sequence (X1 , ..., Xn ). Since each 1 of the sequence transits to a
0 of the sequence, there are N1 zeroes which contribute a probability of 1, and
thus N0 N1 zeroes contribute a probability of p. This is only almost correct,
though, since we have to account for the initial state X1 , and the nal state Xn .
By choosing for initial distribution (0) = p and (1) = (1 p), the above
reasoning applies correctly to X1 .
Our last problem is when the last state is 1, in which case that 1 does not
give a 1 to 0 transition, and the probabilities of zero-zero transitions is therefore
N0 N1 + 1. In summary, under the assumptions given above, we have:
P(X1 , ..., Xn ) =
(1 p)N1 pN0 N1 ,
when Xn = 0
N
N
N
+1
(1 p) 1 p 0 1 , when Xn = 1
Since N0 + N1 = n, we can rewrite (1 p)N1 pN0 N1 as (1 p)N1 pn2N1 , or

p N1 n
) p . We conclude
equivalently as ( 1
p2
P(X1 , ..., Xn ) =
)N1 pn ,
( 1p
p2
( 1p
)N1 pn+1 ,
p2
when Xn = 0
when Xn = 1
1p
p2
= 1, sequences will be almost equally likely. This
.
equation has positive solution p = 51
= 0.6180, which we take in the rest
2
of the problem (trivia: 1/p = , the golden ratio). The steady state distribution
of the resulting Markov Chain can be easily computed to be = (0 , 1 ) =
1p
1
( 2p
, 2p
) (0.7236, 0.2764). We also obtain the almost equiprobable condition:
We conclude that if
P (X1 , ..., Xn ) =
pn ,
when Xn = 0
n+1
p
, when Xn = 1
We now relate this Markov Chain at hand. Note the following: log(|Zn |) =
log( ||ZYnn|| ) + log(|Yn |), and therefore,
1
1
1
|Zn |
log(Zn ) = lim log(|Yn |) + lim log
n n
n n
n n
|Yn |
lim
Let us compute rst limn n1 log(|Yn |). This is easily done using our Markov
Chain. Fix n 0, and observe that since our Markov Chain only generates
sequences which belong to Yn , we have
1=
P (X1 , ..., Xn )
(X1 ,...,Xn)Yn
Note that for any (X1 , ..., Xn ) Yn , we have pn+1 P (X1 , ..., Xn ) pn , and
so we obtain
pn+1 |Yn | 1 pn |Yn |
6
and so
n |Yn | n+1 ,
n log log |Yn | (n + 1) log
which gives limn n1 log(|Yn |) = log .

n|
We now consider the term |Z
|Yn | . The above reasoning shows that intuitively,
1
Yn is the probability of the equally likely sequences of (X1 , ..., Xn ), and that
|Zn | is the number of such sequences with more than 70% zeroes. Basic probability reasoning gives that the ratio is therefore the probability that a random
sequence (X1 , ..., Xn ) has more than 70% zeroes. Let us rst prove this formally, and then compute the said probability. Denote G(X1 , ..., Xn ) the percent
of zeroes of the sequence (X1 , ..., Xn ). Then, for any k [0, 1]
P (X1 , ..., Xn )
P(G(X1 , ..., Xn ) k) =
(X1 ,...,Xn )Zn
Reusing the same idea as previously,
P(G(X1 , ..., Xn ) k)
pn |Zn |pn |Zn |pn |Zn | = |Zn |
(X1 ,...,Xn)Zn
Similarly,
P(G(X1 , ..., Xn ) k) p
|Zn |
1/p
(1/p)
(1/p)n+1
|Yn |
|Zn |
|Yn |
Taking logs, we obtain

1
1
|Zn |
log P(G(X1 , ..., Xn ) k) = lim log
n n
n n
|Yn |
lim
We will use large deviations for Markov

Chain to compute that probability. First
note that G(X1 , ..., Xn ) is the same as i F (Xi ), when F (0) = 1 and F (1) =
0. By Millers Theorem, obtain that for any x,
1
P(
F (Xi ) nk) = inf I(x)
n n
xk
lim
with
I(x) = sup(x log ())
where () is the largest eigenvalue of the matrix
p exp() 1 p
M () =
exp()
0
7
The characteristic equation is 2

(p exp()) (1 p) exp() = 0, whose
p exp()+
p2 exp(2)+4(1p) exp()
largest solution is () =
2
of the MC is
I(x) = sup(x log(()))
. The rate function
Since the mean of F under the steady state distribution is above 0.7, the minn|
imum minx0.7 I(x) = I() = 0. Thus, limn n1 log |Z
|Yn | = 0, and we
conclude
1
lim log |Zn | = log = 0.4812
n n
In general, for k , we will have
lim
1
log |Zn (k)| = log = 0.4812
n
and for k > ,

1
log |Zn (k)| = log sup(k log(()))
n n
lim
5
5.1
Problem 5
1(i)
Consider a standard Brownian motion B, and let U be a uniform random variable over [1/2, 1]. Let
W (t) =
B(t),
B(U ) = 0,
when t = U
otherwise
With probability 1, B(U ) is not zero, and therefore limtU W (t) = limrU B(t) =
B(U ) = 0 = W (U ), and W is not continuous in U . For any nite collection of times t = (t1 , ..., tn ) and real numbers x = (x1 , ..., xn ); denote
W (t) = (W (t1 ), ..., W (tn )), x = (x1 , ..., xn )
/ {ti , 1 i n})
P(W (t) x) =P(U
/ {ti , 1 i n})P(W (t) x|U
+ P(U {ti , 1 i n})P(W (t) x|U {ti , 1 i n})
/ {ti , i
Note that P(U {ti , 1 i n}) = 0, and P(W (t) x|U
n}) = P(B(t x)), and thus the Process W has exactly the same distribution
properties as B (gaussian process, independent and stationary increments with
zero mean and variance proportional to the size of the interval).
8
5.2
1(ii)
Let X be a Gaussian random variable (mean 0, standard deviation 1), and denote
QX the set {q + x, q Q} R+ where Q is the set of rational numbers.
W (t) =
B(t),
when t
/ QX \{0}
B(t) + 1, whent QX \{0}
Through the exact same argument as 1(i), W has the same distribution properties
as B (this is because QX , just like {ti , 1 i n}, has measure zero for a
random variable with density).
nl
However, note that for any t > 0, |t x i(tx)10
| 10n , proving
10n
n
nl
l
that limn (x + i(tx)10
) = t. However, for any n, x + i(tx)10
QX , and
10n
10n
i(tx)10n l
so limn W (x +
) = B(t) + 1 = B(t). This proves W (t) is surely
10n
discontinuous everywhere.
5.3
Let t 0, and consider the event En = {|B(t + n1 ) B(t)| > E}. Then, since
B(t + n1 ) B(t) is equal in distribution to 1n N , where N is a standard normal,
by Chebychevs inequality, we have
P(En ) = P(n1/2 |N | > E) = P(|N | > En1/2 ) = P(N 4 > E4 n2 )
3
E4 n 2
Since n P(En ) = n n12 < , by Borel-Cantelli lemma, we have that there

almost surely exists N such that for all n N , |B(t + 1/n) B(t)| E,
proving limn B(t + 1/n) = B(t) almost surely.
6
Problem 6
The event B AR is included in the event B(2) B(1) = B(1) B(0), and
thus
P (B AR ) P (B(2) B(1) = B(1) B(0)) = 0
Since the probability that two atomless, independent random variables are equal
is zero (easy to prove using conditional probabilities).
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013
Pset #3 Solutions
Problem 1.
Problem 2(a)
Problem 3.
Problem 4.
2.
Problem 5.
1.
2.
Problem 6.
a).
b).
c).
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013
6.265-15.070 Advanced Stochastic Processes

Take home nal exam.
Date distributed: December 15
Date due: December 18
100 points total
Problem 1 (15 points) Suppose Xi is an i.i.d. zero mean sequence of random variables
with a nite everywhere moment generating function M () = E[exp(X1 )]. Argue the exis
tence and express the following large deviations limit
1
lim log P
Xi Xj > n2 z
n n
1i=jn
in terms of M () and z.
Problem 2 (15 points) Establish the following identity directly from the denition of the
Ito integral:
sdBs = tBt
Hint: Think about the integral
Bs ds.
0
t
0 Bs ds
as a limit of the sums
i Bti+1 (ti+1
ti ).
Problem 3 (15 points) Given a stochastic process Xt , the so-called Stratonovich integral
t
0 Xs dBs of Xt with respect to a Brownian motion Bt is dened as an L2 limit of
Xti (Bti+1 Bti ),
lim
n
0in1
over sequence of partitions n = {0 = t0 < t1 < < tn = t}, with resolution (n ) 0

as n , when such a limit exists. Here ti = ti +t2i+1 . Compute the Stratonovich integral
t
t
0 Bs dBs and compare it with the Ito integral 0 Bs dBs
Problem 4 (10 points) Given a sequence of real values xn R, n 1, consider the asso
ciated sequence of probability measures Pn = xn . Show that Pn is tight if and only if the
set xn , n 1 is bounded. Show that if Pn = xn converges weakly to some measure P, then
there exists a limit limn xn = x and furthermore P = x .
Problem 5 (15 points) Recall from the probability theory that a sequence of random vari
ables Xn : R is said to converge to X in distribution if FXn (x) FX (x) for every point
x, such that F is continuous at x. Here Fn and F denote the cumulative distribution functions
of Xn and X respectively.
Take home nal exam.
December, 2013
(a) Each random variables Z : R induces a probability measure PZ on R equipped

with the Borel -eld dened by PZ (B) = P(Z B). Thus from Xn and X we obtain
sequences of probability measures Pn and P on R. Show that Pn converges weakly to
P (Pn P) if and only if Xn converges to X in distribution. Namely, the two notions
of convergence are identical for random variables.
Hint: for one direction you might nd it useful to use Skorohod Representation Theo
rem (which you would need to nd/recall on your own) and the relationship between
the almost sure convergence and convergence in distribution.
(b) Suppose Xn is sequence of random variables which converges in distribution to X, all
dened on the same probability space . Suppose the sequence of random variables Yn
on converges to zero in distribution. Establish that Xn + Yn converges weakly to X.
Problem 6 (15 points) Exercise 1 from Lecture 21. (Refer to the lecture note for the
statement of the problem).
Problem 7 (15 points) Suppose Xn , n 1 is an i.i.d. sequence with a zero mean and
variance 2 . Suppose > 0 is a xed constant. Compute the following triple-limit
lim z 1 lim lim log P
t n
max
1knt
1ik
Xi
k
n
z .
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

6.265/15.070J
Fall 2013
Midterm Solutions
1
Problem 1
Suppose a random variable X is such that P(X > 1) = 0 and P(X > 1 E) >
0 for every E > 0. Recall that the large deviations rate function is dened
to be I(x) = sup (x log M ()) for every real value x, where M () =
E[exp(X)], for every real value .
(a) show that I(x) = for every x > 1.
Since P(X > 1) = 0, we have
M () =
exp(x)dPX (x) exp()P (X 1) = exp()
We therefore obtain log M () , and conclude that for x > 1

I(x) = sup(x log M ()) sup (x 1) =
(b) show that I(x) < for every E[X] x < 1.

since x E[X], we have that
I(x) = sup(x log M ()) = sup(x log M ())
Now, take any E > 0 such that x < 1 E, and note that
M () =
exp(x)dPX (x)
1
1E
exp(x)dPX (x) exp((1E))P(X > 1E)
Therefore log M () (1 E) log P(X > 1 E), and we obtain

I(x) = sup(xlog M ()) sup((x(1E))log P(X > 1E)) log P(X > 1E) <
0
(c) show that limE0 P(1 E X 1) = 0. Show that I(1) = .

For any E > 0,
M () =
exp(x)dPX (x) =
1E
exp(x)dPX (x) +
1
1E
exp(x)dPX (x)
exp((1 E))P(X < 1 E) + exp()P(1 E X 1)

exp((1 E))(1 + P(1 E X 1)(exp(E) 1))
exp((1 E))(1 + P(1 E X 1) exp(E))
1
Let f (, E) denote the quantity above. For any 0 and E > 0, we have
M () f (, E), and we obtain that
I(1) sup log f (, E)
0,E>0
Take = 1E log( 1 P(X 1 E)), so that 1 + P(1 E X 1) exp(E) = 2.

We obtain that log f ((E), E) = (1 E) log 2, and so log f ((E), E)
1
E log 2. Finally, note that E = log( P(1EX1)
) goes to as E goes to zero.
2
Problem 2
Recall the following one-dimensional version of the large Deviations Principles

for nite state Markov chains. Given an N -state Markov chain Xn , n 0 with
transition matrix
Pi,j , 1 i, j N and a function f : {1, ..., N } R, the
f (Xi )
sequence Snn = 1in

satises the Large Deviations Principle with the
n
rate function I(x) = sup (xlog (P )), where (P ) is the Perron-Frobenius
eigenvalue of the matrix P = (ef (j) Pi,j , 1 i, j N ).
Suppose Pi,j = j for some probability vector j 0, 1 j N ,
j j = 1. Namely, the observations Xn for n 1 are i.i.d. with the probability mass function given by . In this case we know that the large deviations
rate function for the i.i.d. sequence f (Xn ), n 1 is described by the moment
generating function of f (Xn ), n 1. Establish that the two large deviations
rate functions are identical, and thus the LDP for Markov chains in this case is
consistent with the LDP for i.i.d. processes.
Proof. We have that P is
exp(f (X1 ))1 exp(f (XN ))N
..
..
P =
.
exp(f (X1 ))1 exp(f (XN ))N
Let v = [1, ..., 1]T and M () = E[exp(f (X1 ))]. Then we have that
P v = M ()v
Since P has rank 1 and M () > 0, we have that M () is the Perron-Frobenius
eigenvalue of P . Thus, we have
I(x) = sup(x log (P )) = sup(x log M ())
Problem 3
(a) Suppose, Xn , n 0 is a martingale such that the distribution of Xn is

identical for all n and the second moment of Xn is nite. Establish that Xn =
X0 almost surely for all n.
Proof. Let Fn be a ltration to which the martingale Xn , n 0 is adapted.
Then for any n 1, by tower property, we have
E[(Xn X0 )2 ] = E[Xn2 ] + E[X02 ] 2E[X0 Xn ]
= E[Xn2 ] + E[X02 ] 2E[E[X0 Xn |F0 ]]
= E[Xn2 ] + E[X02 ] 2E[X0 E[Xn |F0 ]]
= E[Xn2 ] + E[X02 ] 2E[X02 ]
=0
by the fact that Xn , n 0 has the same distribution and that Xn has nite
second moment. Thus, we have Xn = X0 almost surely.
(b) An urn contains two white balls and one black ball at time zero. At each time
t = 1, 2, ... exactly one ball is added to the urn. Specically, if at time t 0
there are Wt white balls and Bt black balls, the ball added at time t + 1 is white
t
t
with probability WtW+B
and is black with the remaining probability WtB+B
. In
t
t
particular, since there were three balls at the beginning, and at every time t 1
exactly one ball is added, then Wt + Bt = t + 3, t 0. Let T be the rst
time when the proportion of white balls is exactly 50% if such a time exists, and
t
T = if this is never the case. Namely T = min{t : WtW+B
= 12 } if the
t
set of such t is non-empty, and T = otherwise. Establish an upper bound
P(T < ) 23 .
t
is a martingale. Let Ft be a ltration
Proof. First, we will establish that WtW+B
t
Wt
to which Wt +Bt is adapted. Then we have that
Wt
1
E
W t + Bt
and that
E[
Wt
1 + Wt
Bt
Wt
Wt
+
|Ft ] =
W t + Bt 1 + W t + Bt W t + Bt 1 + W t + Bt
W t + Bt
Wt (1 + Wt + Bt )
Wt
=
=
W t + Bt
(Wt + Bt )(1 + Wt + Bt )
3
t
Thus, WtW+B
, t 0 is a martingale. Since |XtT | 1, the optional stopt
ping theorem gives that XT is almost surely well dened random variable and
E[XT ] = E[X0 ]. Thus, we have
1
2
E[XT ] = P(T < ) + P(T = ) = E[X0 ] =
2
3
1
2
P(T < ) + (1 P(T < )) =
2
3
(1)
t
for t and it exists by Martingale
where 0 1 is the fraction WtW+B
t
convergence theorem. By (1), we have that P(T < ) < 1, thus
2
3
2
12 P(T < )
1 P(T < )
2
0 3 2
1 P(T < )
3
1 P(T < )
1 P(T < )
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013
6.265-15.070
Midterm Exam
Date: October 30, 2013
6pm-8pm
80 points total
Problem 1 (30 points) . Suppose a random variable X is such that P(X > 1) = 0 and
P(X > 1 E) > 0 for every E > 0. Recall that the large deviations rate function is dened to
be I(x) = sup (x log M ()) for every real value x, where M () = E[exp(X)], for every
real value .
(a) Show that I(x) = for every x > 1.
(b) Show that I(x) < for every E[X] x < 1.
(c) Suppose limH0 P(1 E X 1) = 0. Show that I(1) = .
Problem 2 (20 points) Recall the following one-dimensional version of the Large Devia
tions Principle for nite state Markov chains. Given an N -state Markov chain Xn , n 0
with transition matrix Pi,j , 1 i, j N and a function f : {1, . . . , N } R, the se
f (Xi )
quence Snn = 1in

satises the Large Deviations Principle with the rate function
n
I(x) = sup (x log (P )), where (P ) is the Perron-Frobenius eigenvalue of the matrix
P = (ef (j) Pi,j , 1 i, j N ).
Suppose Pi,j = j for some probability vector j 0, 1 j N , j j = 1. Namely,
the observations Xn for n 1 are i.i.d. with the probability mass function given by . In
this case we know that the large deviations rate function for the i.i.d. sequence f (Xn ), n 1
is described by the moment generating function of f (Xn ), n 1. Establish that the two large
deviations rate functions are identical, and thus the LDP for Markov chains in this case is
consistent with the LDP for i.i.d. processes.
HINT: consider the left-eigenvector of P corresponding to the eigenvalue (P ).
Problem 3 (30 points) The following two parts can be done independently.
(a) Suppose, Xn , n 0 is a martingale such that the distribution of Xn is identical for all
n and the second moment of Xn is nite. Establish that Xn = X0 almost surely for all
n.
Midterm Exam
October 29, 2009
(b) An urn contains two white balls and one black ball at time zero. At each time t =
1, 2, . . . exactly one ball is added to the urn. Specically, if at time t 0 there are
Wt white balls and Bt black balls, the ball added at time t + 1 is white with probability
Wt /(Wt + Bt ) and is black with the remaining probability Bt /(Wt + Bt ). In particular,
since there were three balls at the beginning, and at every time t 1 exactly one ball
is added, then Wt + Bt = t + 3, t 0. Let T be the rst time when the proportion of
white balls is exactly 50% if such a time exists, and T = if this is never the case.
Namely T = min{t : Wt /(Wt +Bt ) = 1/2} if the set of such t is non-empty, and T =
otherwise. Establish an upper bound P(T < ) 2/3.
Hint: Show that Wt /(Wt + Bt ) is a martingale and use the Optional Stopping Theorem.
MIT OpenCourseWare
http://ocw.mit.edu

Fall 2013

Fall 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fall 2013

Uploaded by

Copyright:

Available Formats

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

Metric spaces and topology

Content. Metric spaces and topology. Polish Space. Arzela-Ascoli Theo

1 Metric spaces. Open, closed and compact sets

or = 1jd |xj yj | or = max1jd |xj yj |. These metrics are also called

Problem 1. Show that Lp is not a metric when 0 < p < 1.

Problem 2. Establish that T and dened above on C[0, ) are metrics.

E such that B(x, E) A. A set A is dened to be closed if Ac = S \ A is

Proof. This is where alternative property of compactness provided in Propo

only if it is closed and

sup |x(0)| < ,

We now show (2). Check that (2) is equivalent to

for all sufciently large i. But x is continuous on [0, T ], which implies it is

Given two metric spaces (S1 , 1 ), (S2 , 2 ) a sequence of mappings fn : S1

Also given K S1 , sequence fn is said to converge to f uniformly on K if the

Point-wise convergence does not imply uniform convergence even on com

Skorohod space and Skorohod metric

for all x, y D[0, T ], where I is the identity transformation, and I I is

Additional reading materials

15.070J / 6.265J Advanced Stochastic Processes

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

Large Deviations for i.i.d. Random Variables

Content. Chernoff bound using exponential moment generating functions.

The Weak Law of Large Numbers tells us that if X1 , X2 , . . . , is an i.i.d. se

This suggest a decay rate of order n1 if we treat Var(X1 ) and E as a constant.

Large deviations upper bound (Chernoff bound)

Consider an i.i.d. sequence with a common probability distribution function

= (E[eX1 ])n . Thus we

for some a < . Fixing now a negative < 0, we obtain

< a) = P(e 1in Xi > ena )

and now we need to nd a negative such that M () < ea . In particular, we

Moment generating function. Examples and properties

Let us consider some examples of computing the moment generating functions.

Standard Normal distribution. When X has standard Normal distribu

dy = 1 (integral of the density of the standard

Normal distribution). Therefore M () = e 2 . We see that it is always

this is faster than just exponential growth ex . So no matter how large

We now establish several properties of the moment generating functions.

Namely, the order of differentiation and expectation operators can be

interval is trivial {0}. Namely, M () = for every > 0.

(c) Construct an example of a random variable X such that D(M ) = [1 , 2 ]

Theorem 1 gave us a large deviations bound (M ()/ea )n which we rewrite as

= sup(a log + log( )),

and I(a) = otherwise. Setting the derivative of g() = a log +

process has at most n 1 events before time 1.2n. Thus

It is not at all clear how revealing this expression is. In hindsight, we

Standard Normal distribution. Recall that M () = e 2 when X1 has

achieved at = a. Thus for a > 0, the large deviations theory predicts

Again we could compute this probability directly. We know that 1in

After a little bit of technical work one could

Poisson distribution. Suppose X has a Poisson distribution with param

eter . Recall that in this case M () = ee . Then

Setting derivative to zero we obtain = log(a/) and I(a) = a log(a/)

15.070J / 6.265J Advanced Stochastic Processes

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

Large deviations Theory. Cramers Theorem

We have established in the previous lecture that under some assumptions on

Furthermore, we might be interested in more complicated rare events, beyond

(b) For any open set U R,

Applying the second part of Cramers Theorem, we obtain

2 Properties of the rate function I

= sup[(x log M ()) + (1 )(y log M ())]

Therefore, log M () , namely, log M () 0, implying I() = 0 =