Chapter 3

Chapter 3
Conditional Expectation
zrrrennrlitigjn
Essence of PNB or
EX 沙
1
斷
i
1313
盤
P BIA RA t PCBHDP
爾
At
A B
A B
i
⼝ A E AIDS
佩
PLAID ⼩⾞器
Given B Africa
P
DslArifiak2
A B Region of Interest RE
A B PND Shifa
i
臖
GditiondEx .EXIY Ixifxn.ly
y
不
f
fxfyjdx
Eh
E
X income for univers
ig graduates
An
EX
Business Humanities
Y Science Engineering
S E B H
hn
我
iiixiy
Property of hg 比作⼩
ohgdependsonynnnrrernnnnnnnwnnnnnnnnnnnnn
Eh Y ⼆ EX
More generally
Ed IYEAKEYIYEAD
Éy E
Note
xicts.no
jxfdyjdx
h
Íxiptxily y xidy
Proof
Eh D fhyfydy
ffxffylyldxdy
ffxflx y dxdy
EX
LHS ⼆
y
fhyt 的邱 f dy
ffxfdyldy
X
It Nfǎy
y th
⼆
ffx IDEA f x y dxdg
yx
⼆
EKIXAD
RHS
Extending Elly y END
h.ly
⼤
YEZ
Properties
Z depends
on
y
on measurable
G
Elif
IN
11
EAZ ⼆
点 X
where A isly_measurable
EExteudingEXM EXH .in
Def ZEENA
Z depends on
A
lie Amearablej
EAZ EAX VAEA
ERIAFECXID
E X 2 IA ⼆ 0
E X 2 Y 0 畑
x 2I Y x
XECATD
鼎
2 X EXH
fx
o dy
0Hilbertspae
Q existence
unique of EEXH
AYesp.sn
X ⼆ 0
0 A
EYD
⼆
go a
P two measures on A EA
vap
PLA ⼆ 0 ⼆
EAKEXIA 0
RN Lemma
⼆ a
unique ZEA
UA f Zdp
ie
ll Il
Ey HZID
Entering
X xtx
Chapter 3
Conditional expectation
We will define conditional expectation E(X|A), where X is a r.v. and A is a sub- -algebra.
This will be used to define martingales later.
3.1 Basic concepts
Given (⌦, F, P ), let A, B 2 F with P (B) > 0, and X, Y are two r.v.s.
• P (A) is a measure of uncertainty of an event A.

e.g. what is the chance of raining for tomorrow?
• Conditional probability P (A|B) is an updated probability of A after getting new infor-

mation B:
P (A \ B)
P (A|B) = .
P (B)
e.g. It has not rained for the past week, and what is the chance of raining tomorrow?
• Without any information about X, the “best” guess of X is
– the population mean µ = EX (for not so heavy-tailed distribution), since
EX = arg min E(C X)2 .

C
1
– the population median m = F (1/2) (for heavy-tailed distribution), since
m = arg min E|C X|.

C
– the population mode (for really heavy-tailed distribution):
mode = arg max f (C).

C
e.g. The salary of a fresh university graduate is $10,000.
13
• Given some extra information B, the updated “best” guess of X is
– the sub-population (or sectional) mean µB = E(X|B) (for not so heavy-tailed dis-
tribution), since
µB = arg min E{(C X)2 |B}.
C
Note: E(X|B) makes a better guess of X than EX in the L2 sense in that
E[E(X|B) X]2  E[EX X]2 .
– the sub-population (or sectional) median mB = F 1 (1/2|B) (for heavy-tailed distri-

bution), since
mB = arg min E{|C X||B}.
C
– the sub-population (or sectional) mode (for really heavy-tailed distribution):
modeB = arg max f (C|B).

C
e.g. The salary of a fresh university graduate is $15,000, if (s)he works in fin-tech industry.
3.2 Examples of conditional probability
3.3 Definitions of E(X|Y ), E(X| (Y )) and E(X|A)
Given (⌦, F, P ), let A, B 2 F with P (B) > 0, and X, Y are two r.v.s.
• The conditional expectation of X given B (P (B) > 0):
E(XIB )
E(X|B) = ,
P (B)
where IB (!) = 1 if ! 2 B, and 0 otherwise.
• Take B = {Y = y}, we get

1
X
E(X|Y = y) = xi P (X = xi |Y = y), X is a discrete r.v.
i=1
Z 1
= xfX|Y (x|y)dx, X is a continuous r.v.
1
14
An example
Example 3.1 Let X ⇠ U (0, 1] with pdf fX (x) = I(0,1] (x). Let
(i 1) i
Ai = {X 2 ( , ]}, i = 1, 2, ...., n.
n n
Pn
Clearly, ⌦ = i=1 Ai and P (Ai ) = 1/n. From the above definition, we get
R Z i/n
xfX (x)dx n i/n 2i 1
Ai
E(X|Ai ) = =n xdx = x2 = .
P (Ai ) (i 1)/n 2 (i 1)/n 2n
2i 1
The “best” guess for X, given Ai ’s (or (A1 , ..., Am )), is E(X|Ai ) = 2n
, i.e.,
1
Xn = E(X|A1 ) = if A1 occurs (with prob. n 1 )
2n
3
E(X|A2 ) = if A2 occurs (with prob. n 1 )
2n
..........
2n 1
E(X|An ) = if An occurs (with prob. n 1 ).
2n
These are the updated expectations on the new space (A1 , ..., Am ). For example, when n = 5,
Xn is a r.v. taking values = 0.1, 0.3, 0.5, 0.7, 0.9 with equal probabilities 1/5.
Here is a graphical illustration.
Definition 3.1 Let h(y) = E(X|Y = y). Then define E(X|Y ) to be
Z := h(Y ).
n
Theorem 3.1 Let Y be a discrete r.v. taking values {y1 , ..., ym } (m could be 1). Then,
• Z is (Y )-measurable (i.e., depending on Y ).
• EA X = EA Z, i.e., E[(X Z)IA ] = 0, for all A 2 (Y ).
Proof. Note that (Y ) = ({Y = y1 }, ..., {Y = ym }).
S
• {Z  t} = {h(Y )  t} = {k:h(yk )t} {Y = yk } 2 (Y ). So h(Y ) is (Y )-measurable
15
END 1
EXH
0Ah Az Asl A4 f
EKÌ
• For Ai = {Y = yi }. we have
E (h(Y )IAi ) = E (h(yi )IAi ) = h(yi )P (Ai ) = E(X|Ai )P (Ai ) = E(XIAi ). (3.1)
For general A 2 (Y ) = (A1 , ..., Am ), it can be shown that A = [i2I Ai , where I ⇢

{1, ..., m}. Then,
X X X
E(XIA ) = E(X I Ai ) = E(XIAi ) = E[h(Y )IAi ] (from (3.1))
i2I i2I i2I
!
X
= E h(Y ) I Ai = EA h(Y ).
i2I
For discrete r.v. Y , knowing Y is equivalent to knowing (Y ). So we define
E(X| (Y )) =: E(X|Y ).
Alternatively, the two properties in Theorem 3.1 can be used to define E(X| (Y )) for all r.v.’s
Y (discrete or otherwise).
Definition 3.2 If E|X| < 1, we define E(X| (Y )) to be any r.v. Z such that:
• Z is (Y )-measurable, i.e. {Z  t} ⇢ (Y ), 8t 2 R.
• EA X = EA Z, i.e., E[(X Z)IA ] = 0, for all A 2 (Y ).
This definition can be generalized even further by replacing (Y ) with A.
Definition 3.3 Give (⌦, F, P ) and a sub- -algebra A ⇢ F. If E|X| < 1, we define E(X|A)
to be any r.v. Z such that
1. Z is A-measurable; (i.e., Z depends on A)
2. EA X = EA Z, i.e., E[(X Z)IA ] = 0, for all A 2 A.
3.4 Existence and uniqueness of E(X|A)
Let
µ(A) := P (A) and ⌫(A) := EA X.
Then µ and ⌫ are two -finite measures on (⌦, A). Then EA X = EA Z can be rewritten as
Z
⌫(A) = Zdµ
A
Then the result follows from Radon-Nikodym Lemma as shown below.
16
Theorem 3.2
• E(X|A) is the Radon-Nikodym derivative of (signed) measure ⌫(A) := EA X w.r.t. P .
• Thus, E(X|A) exists, and is unique (by Radon-Nikodym Lemma).
Proof. Let A 2 A, and ⌫(A) := EA X.
(A) Suppose X 0. Then
– P and ⌫(A) are two -finite measures on (⌦, A) (Exercise).

– ⌫ is dominated by P , i.e., ⌫ ⌧ P.
Proof. If P (A) = EIA = 0, so IA = 0 a.s., hence ⌫(A) = E(XIA ) = 0.
By Radon-Nikodym Theorem, there exists a unique Z =: d⌫/dP such that
(a) Z is A-measurable;
Z
(b) ⌫(A) =: EA X = ZdP = EA Z, 8A 2 A.
A
(B) Suppose E|X| < 1.

Write X = X + X , where X + = max(0, X) and X = min(0, X). Applying (A) to
X + and X , respectively, to get the desired result.
3.5 Geometric interpretation of E(X|A)
Define L2 (⌦, G, P ) = {W 2 (⌦, G, P ) : EW 2 < 1}. Then, E(X|A) has a very nice interpreta-
tion.
Theorem 3.3 We have
E(X|A) = arg min E(X Y )2 .

Y 2L2 (⌦,A,P )
Hence, E(X|A) is the projection of X onto L2 (⌦, A, P ), i.e., the closest point in this subspace
to X.
Proof. Note that
E(X Y )2 = E[(X E(X|A)) + (E(X|A) Y )]2

= E(X E(X|A))2 + E(E(X|A) Y )2 + 2C,
where C = 0 (to be shown later). Therefore,
E(X Y )2 = E(X E(X|A))2 + E(E(X|A) Y )2 E(X E(X|A))2 ,
17
and the equality is attained at Y = E(X|A) 2 A.
It remains to show C = 0:
C = E(X E(X|A))(E(X|A) Y )
= E[E(X E(X|A))(E(X|A) Y )|A]
= E{[E(X|A) Y ][E(X E(X|A))|A]}
= E{(E(X|A) Y ) ⇥ 0}
= 0.
Alternative geometric definition of E(X|A) Z

One can define E(X|A) by using the above projection in L2 (⌦, A, P ), and then use truncation
idea to generalize to L1 (⌦, A, P ).
1. Consider everything in L2 (⌦, A, P ). We have

EA Z = EA X, 8A 2 A,
() E(X Z)IA = 0, 8A 2 A
Loi
() E(X
() (X
Z)Y = 0,
Z) ? Y,
8A 2 A
8A 2 A
ye
(Here ? means perpendicular, not independent.)
() Z = arg minY 2L2 (⌦,A,P ) E(X Y )2 .
2. Next consider everything in L1 (⌦, A, P ). Define
Zn = E(X + ^ n|A) E(X ^ n|A),
EU
and Z = limn Zn .
We no longer have the concept of projection in L1 (⌦, A, P ), but it can be shown

EA Z = EA X, 8A 2 A,
Although it is attempting to define E(X|A) using this geometric argument, it is simpler to
define it by a more abstract definition, as we did before.
I
3.6 Properties of E(X|A)
Recall that, in order to show that E(W |A) = Z, it suffices to show that
It
Ai
(a) Z 2 A;
(b) EA W = EA Z, for any A 2 A.
ZE
18 EAW ⼆点 2
KAEA
take
y 有 EA
⼋
年
⼀
If EKD In ⼆ 0
VAFA
E 熙州
Eyz y
0
YEA
WE A Eh ⼦ D
xwxy 㸂 X
2 X
fx
EX iy
OHilbertspae ENIA
ln UA
EA
Properties of E(X|A) are very much like their unconditional counterparts. Here are some
examples.
1. (Linearity). E(aX + bY |A) = aE(X|A) + bE(Y |A).

Proof. Take Z =: aE(X|A) + bE(Y |A) and W = aX + bY .
(a) Clearly Z 2 A since E(X|A) 2 A and E(Y |A) 2 A.

(b) For all A 2 A, we have
EA (aX + bY ) = aEA X + bEA Y = aEA (E(X|A)) + bEA (E(Y |A))
= EA (aE(X|A) + bEA (E(Y |A)) = EA Z.
2. (Monotonicity). If X < Y , then E(X|A)  E(Y |A).

[Equivalently, if Y > 0, then E(Y |A) 0.]
Proof. Since X < Y , we have EA [E(X|A)] = EA X  EA Y = EA [E(Y |A)]. Take
A = {! : E(X|A) E(Y |A) ✏ > 0} 2 A. Then P (A) = 0, 8✏ > 0, because otherwise
EA [E(X|A)] EA [E(Y |A)] = EA [E(X|A) E(Y |A)] EA ✏ = ✏P (A) > 0
which is a contradiction.
3. (Jensen’s inequality). If is convex, E|X| < 1 and E| (X)| < 1, then
(E(X|A))  E( (X)|A).
Proof. Take µA = E(X|A) 2 A. Since is convex, there exists ⇢µA such that
(x) (µA ) + ⇢(x µA ), 8x 2 R. (6.2)
Put x = X and take conditional expectation, we get
E( (X)|A) E( (µA )|A) + E[⇢µA (X µA )|A]

= (µA ) + ⇢µA (µA µA )
= (µA ) = (E(X|A)).
4. (Perfect information). If X 2 A, then E(X|A) = X. In particular, E(C|A) = C.

That is, if we know X, then our best “guess” of X is itself.
Proof. Take Z = X, clearly,
(a) Z 2 A by assumption;
(b) for all A 2 A, we have EA (Z) = EA (X).
5. (Partial information: taking out what is known.) If X 2 A, and E|Y | < 1 and
E|XY | < 1, then E(XY |A) = XE(Y |A).
Proof. Let Z = XE(Y |A).
(a) Clearly Z 2 A since X, E(Y |A) 2 A.

(b) For all A 2 A, we need to show that
EA Z =: EA [XE(Y |A)] = EA (XY ). (6.3)
19
Proof of (b). We prove this result in the most typical manner.
• First suppose that X = IB with B 2 A. Since A 2 A, we have A\B 2 A. Therefore,
FFMBY
(6.3) holds since
EA [IB E(Y |A)] = EA\B [E(Y |A)] = EA\B (Y ) = EA (IB Y ).X

• We can extend (i) to simple X by linearity.
ECIAIBY
• We can extend (ii) to nonnegative X and Y by the monotone convergence theorem.
• We can extend (iii) to general r.v.s X, Y by splitting them into positive and negative
parts.
6. (No information) If X is independent of A, then E(X|A) = E(X).

i.e., if we don’t know anything about X, then our best “guess” of X is EX.
Proof. Take Z = EX, clearly,
(a) Clearly EX 2 A.
(Any constant C is A-measurable as {C  t} is either ; or ⌦.)
(b) For all A 2 A, we have
EA (Z) = EA (EX) = (EX)EA 1 = (EX)EIA = E(XIA ) = EA X,
where the second last equality follows from independence of X and A 2 A.
7. (Tower property). A1 ⇢ A2 ⇢ F, then
(i) E(X|A1 ) = E[E(X|A1 )|A2 ];

(ii) E(X|A1 ) = E[E(X|A2 )|A1 ]
That is, the smaller -algebra always wins.

Proof.
(i) Notice that E(X|A1 ) 2 A1 , so (i) follows easily.

(ii) Let Y = E(X|A2 ) and Z = E(X|A1 ). We need to show that Z = E[Y |A1 ].
(a) Clearly Z 2 A1 .
(b) For all A 2 A1 , then A 2 A2 , and so EA Z = EA X = EA Y.
8. (Average of local averages is the global average, or double expectation rule.)

EX = E(E(X|A)).
Proof. Note that EA X = EA (E(X|A)) whenever A 2 A. Take A = ⌦.
20
Appendix: An introduction to probability
Sets
Definition 3.4 ⌦ is a sample space.
• A set is a collection of elements in ⌦, denoted by A, B, ...
• A class (or family) is a collection of sets of ⌦, denoted by

A, B, C, D, E, F, G, ...
• ! 2 A : ! is an element of A.
• A ⇢ B : the set A is contained in the set B.
• An empty set ; ⇢ A for any set A.
Definition 3.5 Define
Ac = {! :! 62 A} (complement)
A[B = {! :! 2 A, or ! 2 B} (union)
A\B = {! :! 2 A, and ! 2 B} (intersection)
[1
n=1 An = {! :! 2 An for some n}
\1
n=1 An = {! :! 2 An for all n}.
Probability measure and random variables
A probability is a (set) function from a -algebra to [0, 1]. It is defined on -algebras, since the
power set P(⌦) can be too large unless ⌦ is finite or countable.
Definition: A class F is a -algebra if
• ⌦ 2 F,
• Ac 2 F whenever A 2 F,
• [1
n=1 An 2 F whenever An 2 F, n 1.
Definition: P is a probability measure on (⌦, F) if
• 8A 2 F, 0  P (A)  1,
21
• P (⌦) = 1,
P1 P1
• P( 1 An ) = 1 P (An ).
(⌦, A, P ) is called the probability space.
Definition 3.6 A random variable X : ⌦ ! R is a measurable function, i.e.,
{! : X(!)  r} 2 F.
Example 3.2 Toss a coin. We can construct a probability space (⌦, F, P ):
⌦ = {H, T }, where ”H=Head” or ”T=Tail”,

F = {all possible subsets of ⌦} = {;, H, T, ⌦},
P (;) = 0, P (H) = 1/2, P (T ) = 1/2, P (⌦) = 1.
Here ! = H or T , and |F| = 22 = 4. Let
X = X(w) = I{w = H} = I{the toss is ”Head”}.
It can be shown that X is a random variable (r.v.), since
{X = 0} = {w : X(w) = 0} = T ⇢ F, {X = 1} = {w : X(w) = 1} = H ⇢ F.
Expectation
Definition 3.7 Let X be a r.v. on (⌦, A, P ).
Pn Pn
• For a simple r.v. X = 1 ai IAi with 1 Ai = ⌦, Ai 2 A, define
n
X
EX = ai P (Ai ).
1
• For X 0, there exists simple nonnegative r.v.’s Xn (!) % X(!) for every !. We define
EX = lim EXn  1
n!1
• For general r.v. X, if either EX + < 1 or EX < 1, then define
EX = EX + EX ,
where X + = max{X, 0} = XI{X 0} , X = max{ X, 0} = XI{X0} .
22
Properties of expectations
Assume that X, Y, X1 , ..., Xn below are all r.v.’s on (⌦, A, µ).
• (Absolute integrability) EX is finite if and only if E|X| is finite.
• (Linearity). E(aX + bY ) = aEX + bEY.

P1 P1
• ( -additivity). If A = i=1 Ai , then EA X = i=1 EAi X.
• (Positivity). If X 0 a.s., then EX 0.
• (Monotonicity). If X1  X  X2 a.s., then EX1  EX  EX2 .
• (Mean value theorem). If a  X  b a.s. on A 2 A, then aP (A)  EA X  bP (A).
• (Modulus inequality). |EX|  E|X|.
• (Fatou’s Lemma). If Xn 0 a.s., then E (lim inf n Xn )  lim inf n EXn .
• (Monotone Convergence Theorem). If 0  Xn % X, then limn EXn = EX =

E limn Xn .
• (Dominated Convergence Theorem). If Xn ! X a.s., |Xn | < Y a.s. for all n, and
EY < 1, then limn EXn = EX = E limn Xn .
P P1
• (Integration term by term). If 1 E|Xn | < 1, then |Xn | < 1, a.s. so that
P1 P1 i=1 P1 i=1
i=1 Xn converges a.s., and E ( i=1 Xn ) = i=1 EXn .
Convergence Concepts
Definition 3.8 Let X, X1 , X2 , ... be r.v.’s on (⌦, A, P ). We say that
1. Xn ! X a.s. if P (limn!1 Xn = X) = P ({! : limn!1 Xn (!) = X(!)}) = 1.
2. Xn ! X in rth mean (r > 0) if limn!1 E|Xn X|r = 0.
3. Xn ! X in prob, if limn!1 P (|Xn X| > ✏) = 0, for all ✏ > 0.
4. Xn ! X in distribution if limn!1 FXn (x) = FX (x) for all continuity points of FX (x).
Di↵erence between a.s. convergence and convergence in probability
Conceptually, the mode of almost sure convergence keeps track of the values of random variables
at the same sample point on and on. It requires the convergence of X(!) at almost all sample
points. The mode of convergence in probability requires that the event on which Xn and X
di↵er more than a fixed amount shrinks in probability. This event can be di↵erent when n = 10
23
and when n = 11 and so on. Because of this, the convergence in probability does not imply
convergence of Xn (!) for any ! 2 ⌦. The following classical example is a vivid way to illustrate
this point.
Example. Let ⌦ = [0, 1]. Let F be the classical Borel -algebra on [0, 1] and P be the uniform
probability measure.
For m = 0, 1, 2, ... and k = 0, 1, ..., 2m 1, let

m m
X2m +k (!) = I{k2 < !  (k + 1)2 }
In plain words, we have defined a sequence of random variables made of indicator functions
on intervals of shrinking length 2 m . Yet the union of every 2m intervals completely cover the
sample space [0, 1] as m increases.
It is seen that
m
P (|Xn | > 0)  2
where m = log n/ log 2 1. Hence as n ! 1, P (|Xn | > 0) ! 0. This implies Xn ! 0 in
probability.
At the same time, the sequence

X1 (!), X2 (!), X3 (!), ....
contains infinity numbers of both 0 and 1 for any !. Thus none of such sequence converge. In
other words,
P ({! : Xn (!) converges}) = 0.
Hence, Xn does not converge to 0 in the mode of “almost surely”.
Relationships between modes of convergence
Theorem 3.4
1. Xn ! X a.s. or in Lr (r 1) =) Xn !p X =) Xn !d X.
2. If r > s > 0, then Xn ! X in Lr =) Xn ! X in Ls .
3. No other implications hold in general.
We also have some partial converses to the above results, i.e., converse results with some
additional assumptions. Here are a few examples.
Theorem 3.5 Xn !d C () Xn !p C, where C is a constant.
Theorem 3.6 (Lebesgue Dominated Convergence Theorem) If Xn !p X, |Xn |  Y

a.s. for all n, and EY r < 1 for r > 0, then Xn ! X in Lr , which in turn implies that
EXnr ! EX r .
24
Theorem 3.7 (Dominated convergence a.s. implies convergence in mean) If Xn !
X a.s., P (|Xn |  Y ) = 1 for all n, and EY r < 1 for r 0, then Xn ! X in Lr .
Theorem 3.8 (Convergence in probability sufficiently fast implies a.s. convergence)

P
If 1n=1 P (|Xn X| > ✏) < 1 for all ✏ > 0, then Xn ! X a.s.
Theorem 3.9 (Convergence sequences in probability contains a.s. subsequences) If

Xn !p X, then there exists a non-random integers n1 < n2 < ... such that Xni ! X a.s.
Theorem 3.10 (Skorokhod’s representation theorem) Suppose that Xn !d X. Then

there exist r.v.’s Y and {Yn , n 1} on ((0, 1), B(0,1) , P = (0,1) ) s.t.
(1) Yn and Y have the same d.f.’s as Xn and X. That is, Xn =d Yn , X =d Y .

(2) Yn ! Y a.s. as n ! 1.
Definition 3.9 A sequence of r.v.’s {Yn } is uniformly integrable (u.i.) if
lim sup E{|Yn |I{|Yn |>C} } = 0.

C!1 n
Theorem 3.11 (Vitali’s Theorem) Suppose that Xn !p X, and E|Xn |r < 1 all n (i.e.
Xn 2 Lr ). Then the following three statements are equivalent.
(i) {Xnr } is u.i.;

(ii) Xn ! X in Lr ; and E|X|r < 1.
(iii) E|Xn |r ! E|X|r < 1.
Some more useful theorems are given below.
Theorem 3.12 (Continuous mapping theorem) Let X1 , X2 , ... and X be k-dim random
vectors, g : Rk ! R be continuous. Then
(a). Xn ! X a.s. =) g(Xn ) ! g(X) a.s.

(b). Xn !p X =) g(Xn ) !p g(X).
(c). Xn !d X =) g(Xn ) !d g(X).
Theorem 3.13 (Slutsky’s Theorem) Let Xn !d X, Yn !p C (i.e. Yn !d C). Then
25
(a). Xn + Yn !d X + C.
(b). Xn Yn !d CX.
(c). Xn /Yn !d X/C if C 6= 0.
Skorokhod’s
representation
theorem
a.s.
-
Xn X
6AA
AA
AA fast enough or
AA subsequence
AA
AA
AA Len on
AA
AA
AA
AA
AA
AA P d
u.i. fast X - X > Xn - X
n
enough
X=C
u.i. - necessary and sufficient

DCT - sometimes too strong
? r
L-
L.TT
Xn X<
u.i.
26
Radon-Nikodym theorem
Definition 3.10 Given two measures µ and ⌫ on (⌦, F), we say that ⌫ is absolutely con-
tinuous w.r.t. µ, written as ⌫ ⌧ µ (i.e., ⌫ is dominated by µ ), if
whenever µ(A) = 0 implies ⌫(A) = 0.
Theorem 3.14 (Radon-Nikodym Theorem) If ⌫ ⌧ µ and µ is -finite, there exists a

unique F-measurable function f 0 so that
Z
⌫(A) = f dµ, 8A 2 F.
A
(f is called the “Radon-Nikodym derivative”, often written as d⌫/dµ.)
R
Remark 3.1 Clearly, ⌫ ⌧ µ and µ is -finite () 9f 0 : ⌫(A) = A f dµ.
-algebra generated by r.v.s
Definition. Given (⌦, F, P ). A is a subset of F. Let X be a r.v.
1. (A) is the smallest -algebra containing A.

1
2. (X) := ([B2B {X 2 B}) = (X (B)) .
3. (X1 , ..., Xn ) := ( (X1 ) [ ... [ (Xn )) . It is easy to see
(X1 ) ⇢ (X1 , X2 ) ⇢ ...... ⇢ (X1 , ..., Xn ).
Example 3.3 Toss a coin twice. Then we can construct a probability space (⌦, F, P ), where
⌦ = {HH, HT, T H, T T },
F = {all possible subsets of ⌦}
= {;, HH, HT, T H, T T, {HH, HT }, ...., ⌦}, |F| = 24 = 16
P (HH) = P (HT ) = P (T H) = P (T T ) = 1/4.
Let X = I{first toss is Head}, Y = I{second toss is Head}. Now
{X = 0} = {w : X(w) = 0} = {T H, T T } ⇢ F,
{X = 1} = {w : X(w) = 1} = {HH, HT } ⇢ F.
Now
(X) = (X = 0, X = 1) = ({X = 1}c , {X = 1}) = ({X = 1})
27
= ({HH, HT }) = {;, ⌦, {HH, HT }, {T H, T T }}
(Y ) = (Y = 0, Y = 1) = ({Y = 1}c , {Y = 1}) = ({Y = 1})
= ({HH, T H}) = {;, ⌦, {HH, T H}, {HT, T T }}.
Similarly, it can also be shown that
(X, Y ) = ({X = 1}, {Y = 1}) = .... = F.
(X) and (Y ) contain information about possible outcomes for the first and second tosses,
respectively. Knowing both (X) and (Y ) (i.e., information about both tosses), given by
(X, Y ), we get F.
More on fields generated by a discrete r.v.
Theorem 3.15 Let X be a discrete r.v. taking distinct values {xi , 1  i  n} (where n could
be 1) and let Ai = {! : X(!) = xi }. Then,
• {Ai , i 1} constitute a disjoint partition of ⌦.
• Choose C = {A1 , A2 , ..., An }, then
(C) = (A1 , A2 , ..., An ) = (A0 , A1 , ..., An ) = {[i2I Ai : I ⇢ {0, 1, ..., n}},
where A0 = ;. (Hint: show that the RHS forms a -algebra.)
3.7 Exercises
1. Let V ar(X|A) = E(X 2 |A) (E(X|A))2 . Show
V ar(X) = E (V ar(X|A)) + V ar (E(X|A)) .
Remark 3.2 From the exercise, we get V ar(X) V ar (E(X|A)). That is, smoothing
by local averaging (i.e. E(X|A)) reduces variance.
2. Show that if X and Y are r.v.’s with E (Y |A) = X and EX 2 = EY 2 < 1, then X = Y
a.s. (i.e. P (X = Y ) = 1). [Hint: Work out E(X Y )2 .]
Game y if
OECYHKXEA
EXEEY2
A.i 28 yinx
Malhematiedprofnznnnr
Let z Ext Then
EZ ⼆
EIEY_HD EY EY EYr.co
V4 EZYEDTEZ
⼆ EXTEF
crrv
2EXY
wv
2 EXLEXY as EXEEY2
9EECXYIA
as EXY ⼆
EY Ey
2
⼆比
⼗
⼆
z x_x 0 as

Chapter 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3

Uploaded by

Copyright:

Available Formats

Chapter 3

3.1 Basic concepts

• P (A) is a measure of uncertainty of an event A.

• Conditional probability P (A|B) is an updated probability of A after getting new infor-

• Without any information about X, the “best” guess of X is

– the population mean µ = EX (for not so heavy-tailed distribution), since

EX = arg min E(C X)2 .

m = arg min E|C X|.

– the population mode (for really heavy-tailed distribution):

mode = arg max f (C).

e.g. The salary of a fresh university graduate is $10,000.

Note: E(X|B) makes a better guess of X than EX in the L2 sense in that

E[E(X|B) X]2  E[EX X]2 .

– the sub-population (or sectional) median mB = F 1 (1/2|B) (for heavy-tailed distri-

– the sub-population (or sectional) mode (for really heavy-tailed distribution):

modeB = arg max f (C|B).

3.2 Examples of conditional probability

3.3 Definitions of E(X|Y ), E(X| (Y )) and E(X|A)

• The conditional expectation of X given B (P (B) > 0):

where IB (!) = 1 if ! 2 B, and 0 otherwise.

• Take B = {Y = y}, we get

Here is a graphical illustration.

Definition 3.1 Let h(y) = E(X|Y = y). Then define E(X|Y ) to be

• Z is (Y )-measurable (i.e., depending on Y ).

• EA X = EA Z, i.e., E[(X Z)IA ] = 0, for all A 2 (Y ).

Proof. Note that (Y ) = ({Y = y1 }, ..., {Y = ym }).

For general A 2 (Y ) = (A1 , ..., Am ), it can be shown that A = [i2I Ai , where I ⇢

For discrete r.v. Y , knowing Y is equivalent to knowing (Y ). So we define

• EA X = EA Z, i.e., E[(X Z)IA ] = 0, for all A 2 (Y ).

This definition can be generalized even further by replacing (Y ) with A.

1. Z is A-measurable; (i.e., Z depends on A)

2. EA X = EA Z, i.e., E[(X Z)IA ] = 0, for all A 2 A.

3.4 Existence and uniqueness of E(X|A)

Then the result follows from Radon-Nikodym Lemma as shown below.

• E(X|A) is the Radon-Nikodym derivative of (signed) measure ⌫(A) := EA X w.r.t. P .

• Thus, E(X|A) exists, and is unique (by Radon-Nikodym Lemma).

Proof. Let A 2 A, and ⌫(A) := EA X.

(A) Suppose X 0. Then

– P and ⌫(A) are two -finite measures on (⌦, A) (Exercise).

By Radon-Nikodym Theorem, there exists a unique Z =: d⌫/dP such that

(B) Suppose E|X| < 1.

3.5 Geometric interpretation of E(X|A)

Theorem 3.3 We have

E(X|A) = arg min E(X Y )2 .

Proof. Note that

E(X Y )2 = E[(X E(X|A)) + (E(X|A) Y )]2

where C = 0 (to be shown later). Therefore,

E(X Y )2 = E(X E(X|A))2 + E(E(X|A) Y )2 E(X E(X|A))2 ,

Alternative geometric definition of E(X|A) Z

1. Consider everything in L2 (⌦, A, P ). We have

We no longer have the concept of projection in L1 (⌦, A, P ), but it can be shown

3.6 Properties of E(X|A)

1. (Linearity). E(aX + bY |A) = aE(X|A) + bE(Y |A).

(a) Clearly Z 2 A since E(X|A) 2 A and E(Y |A) 2 A.

2. (Monotonicity). If X < Y , then E(X|A)  E(Y |A).

EA [E(X|A)] EA [E(Y |A)] = EA [E(X|A) E(Y |A)] EA ✏ = ✏P (A) > 0

3. (Jensen’s inequality). If is convex, E|X| < 1 and E| (X)| < 1, then

(x) (µA ) + ⇢(x µA ), 8x 2 R. (6.2)

Put x = X and take conditional expectation, we get

E( (X)|A) E( (µA )|A) + E[⇢µA (X µA )|A]

4. (Perfect information). If X 2 A, then E(X|A) = X. In particular, E(C|A) = C.

(a) Clearly Z 2 A since X, E(Y |A) 2 A.