Professional Documents
Culture Documents
Cond Exp
Cond Exp
Now we review the discrete case. This was section 2.5 in the book. In some sense it
is simpler than the continuous case. Everything comes down to the very first definition
involving conditioning. For events A and B
P (A ∩ B)
P (A|B) =
P (B)
assuming that P (B) > 0. If X is a discrete RV, the conditional density of X given the
event B is
P (X = x, B)
f (x|B) = P (X = x|B) =
P (B)
and the conditional expectation of X given B is
X
E[X|B] = x f (x|B)
x
1
The partition theorem says that if Bn is a partition of the sample space then
X
E[X] = E[X|Bn ] P (Bn )
n
Now suppose that X and Y are discrete RV’s. If y is in the range of Y then Y = y is
a event with nonzero probability, so we can use it as the B in the above. So f (x|Y = y)
is defined. We can change the notation to make it look like the continuous case and write
f (x|Y = y) as fX|Y (x|y). Of course it is given by
P (X = x, Y = y) fX,Y (x, y)
fX|Y (x|y) = =
P (Y = y) fY (y)
This looks identical to the formula in the continuous case, but it is really a different
formula. In the above fX,Y and fY are pmf’s; in the continuous case they are pdf’s. With
this notation we have
X
E[X|Y = y] = x fX|Y (x|y)
x
ω → E[X|Y = y]
2
So this is a random variable. It is usually written as E[X|Y ]. In our example ω is mapped
to (y − 1)/5 where y = Y (ω). So ω is mapped to (Y (ω) − 1)/5. So the random variable
E[X|Y ] is just (Y − 1)/5. Note that E[X|Y ] is a function of Y . This will be true in
general.
We try another conditional expectation in the same example: E[X 2 |Y ]. Again, given
Y = y, X has a binomial distribution with n = y − 1 trials and p = 1/5. The variance of
such a random variable is np(1 − p) = (y − 1)4/25. So
4
E[X 2 |Y = y] − (E[X|Y = y])2 = (y − 1)
25
Using what we found before,
1 4
E[X 2 |Y = y] − ( (y − 1))2 = (y − 1)
5 25
And so
1 4
E[X 2 |Y = y] = (y − 1)2 + (y − 1)
25 25
Thus
1 4 1
E[X 2 |Y ] = (Y − 1)2 + (Y − 1) = (Y 2 + 2Y − 3)
25 25 25
Once again, E[X 2 |Y ] is a function of Y .
Intuition: E[X|Y ] is the function of Y that bests approximates X. This is a vague
statement since we have not said what “best” means. We consider two extreme cases.
First suppose that X is itself a function of Y , e.g., Y 2 or eY . Then the function of Y that
best approximates X is X itself. (Whatever best means, you can’t do any better than
this.) The other extreme case is when X and Y are independent. In this case, knowing Y
tells us nothing about X. So we might expect that E[X|Y ] will not depend on Y . Indeed,
we have
3
Theorem 1 Let X, Y, Z be random variables, a, b ∈ R, and g : R → R. Assuming all the
following expectations exist, we have
(i) E[a|Y ] = a
(ii) E[aX + bZ|Y ] = aE[X|Y ] + bE[Z|Y ]
(iii) E[X|Y ] ≥ 0 if X ≥ 0.
(iv) E[X|Y ] = E[X] if X and Y are independent.
(v) E[E[X|Y ]] = E[X]
(vi) E[Xg(Y )|Y ] = g(Y )E[X|Y ]. In particular, E[g(Y )|Y ] = g(Y ).
(vii) E[X|Y, g(Y )] = E[X|Y ]
(viii) E[E[X|Y, Z]|Y ] = E[X|Y ]
Partial proofs: The first three are not hard to prove, and we leave them to the reader.
Consider (iv). We prove the continuous case and leave the discrete case to the reader.
If X and Y are independent then
Consider (v). Suppose that the random variables are discrete. We need to compute
the expected value of the random variable E[X|Y ]. It is a function of Y and it takes on
the value E[X|Y = y] when Y = y. So by the law of the unconscious whatever,
X
E[E[X|Y ]] = E[X|Y = y] P (Y = y)
y
By the partition theorem this is equal to E[X]. So in the discrete case, (iv) is really the
partition theorem in disguise. In the continuous case it is too.
Consider (vi). We must compute E[Xg(Y )|Y = y]. Given that Y = y, the possible
values of Xg(Y ) are xg(y) where x varies over the range of X. The probability of the
value xg(y) given that Y = y is just P (X = x|Y = y). So
X
E[Xg(Y )|Y = y] = xg(y)P (X = x|Y = y)
x
X
= g(y) x P (X = x|Y = y) = g(y)E[X|Y = y]
x
4
Consider (viii). Again, we only consider the discrete case. We need to compute
E[E[X|Y ; Z]|Y = y]. E[X|Y ; Z] is a random variable. Given that Y = y, its possible
values are E[X|Y = y; Z = z] where z varies over the range of Z. Given that Y = y, the
probability that E[X|Y ; Z] = E[X|Y = y; Z = z] is just P (Z = z|Y = y). Hence,
X
E[E[X|Y ; Z]|Y = y] = E[X|Y = y, Z = z]P (Z = z|Y = y)
z
XX
= x P (X = x|Y = y, Z = z)P (Z = z|Y = y)
z x
X P (X = x, Y = y, Z = z) P (Z = z, Y = y)
= x
z,x
P (Y = y, Z = z) P (Y = y)
X P (X = x, Y = y, Z = z)
= x
z,x
P (Y = y)
X P (X = x, Y = y)
= x
x
P (Y = y)
X
= x P (X = x|Y = y)
x
= E[X|Y = y]
(1)
Example: Let X and Y be independent; each is uniformly distributed on [0, 1]. Let
Z = X + Y . Find E[Z|X], E[X|Z], E[XZ|X], E[XZ|Z].
We start with the easy ones.
1
E[Z|X] = E[X + Y |X] = E[X|X] + E[Y |X] = X + E[Y ] = X +
2
where we have used the independence of X and Y and properties (iv) and the special case
of (vi). Using property (vi), E[XZ|X] = XE[Z|X] = X(X + 21 ).
Now we do the hard one: E[X|Z]. We need the joint pdf of X and Z. So we do a
change of variables. Let W = X, Z = X + Y . This is a linear transformation, so the
Jacobian will be a constant. We postpone computing it. We need to find the image of
the square 0 ≤ x, y ≤ 1 under this transformation. Look at the boundaries. Since it is a
linear transformation, the four edges of the square will be mapped to line segments. To
find them we can just compute where the four corners of the square are mapped.
(x, y) = (0, 0) → (w, z) = (0, 0)
(x, y) = (1, 0) → (w, z) = (1, 1)
(x, y) = (0, 1) → (w, z) = (0, 1)
(x, y) = (1, 1) → (w, z) = (1, 2)
5
So the image of the square is the parallelogram with vertices (0, 0), (1, 1), (0, 1) and (1, 2).
The joint density of W and Z will be uniform on this region. Let A denote the interior
of the parallelogram. Since it has area 1, we conclude
fW,Z (w, z) = 1((w, z) ∈ A)
Note that we avoided computing the Jacobian. Now we can figure out what fX|Z (x|z)
is. We must consider two cases. First suppose 0 ≤ z ≤ 1. Given Z = z, X is uniformly
distributed between 0 and z. So E[X|Z = z] = z/2. In the other case 1 ≤ z ≤ 2. Then
X is uniformly distributed between z − 1 and 1. So E[X|Z = z] = (z − 1 + 1)/2 = z/2.
So in both cases E[X|Z = z] = z/2. Thus E[X|Z] = Z/2. Finally, we have E[XZ|Z] =
ZE[X|Z] = Z 2 /2.
We get a small check on our answers using property (v). We found E[Z|X] = X + 1/2.
So its mean is E[X] + 1/2 = 1/2 + 1/2 = 1. Property (v) says E[E[Z|X]] = E[Z] =
E[X] + E[Y ] = 1/2 + 1/2 = 1.
We also found E[X|Z] = Z/2. Thus the mean of this random variable is E[Z]/2 = 1/2.
And property (v) says it should be E[X] = 1/2.
Example (random sums): Let Xn be an i.i.d. sequence with mean µ and variance σ 2 .
Let N be a RV that is independent of all the Xn and takes on the values 1, 2, 3, · · ·. Let
X
N
SN = Xj
j=1
2
Note that the number of terms in the sum is random. We will find E[SN |N ] and E[SN |N ]
and use them to compute the mean and variance of SN .
Given that N = n, SN is a sum with a fixed number of terms:
X
n
Xj
j=1
6
A.4 Conditional expectation as a “best approximation
We have said that E[X|Y ] is the function of Y that best approximates X. In this section
we make this precise. We will assume we have discrete random variables. The main result
of this section is true in the continuous case as well.
We need to make precise what it means to “be a function of Y .” In the discrete case
we can simply say that X is a function of Y if there is a function g : R → R such that
X = g(Y ). (In the continuous case we need some condition that g be nice.) We start
with a characterization of functions of Y .
where 1Bn is the random variable that is 1 on Bn and 0 on its complement. Define
g(yn ) = xn . For y not in the range of Y we define g(y) = 0. (It doesn’t matter what we
define it to be here.) Then Y = g(X).
Proof: Let yn be the possible values of Y and Bn = {Y = yn }. Then h(Y ) is of the form
X
h(Y ) = xn 1 B n
n
We think of this as a function of all the xn and try to find its mininum. It is quadratic in
each xn with a positive coef of x2n . So the mininum will occur at the critical point given
by setting all the partial derivatives with respect to the xn equal to 0.
∂ X X
E[(X − xn 1Bn )2 ] = −2E[(X − xn 1Bn )1Bj ] = xj P (Bj ) − E[X1Bj ]
∂xj n n
7
It is easy to check that
E[X1Bj ]
E[X|Y = yj ] =
P (Bj )