Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Standard view
Full view
of .
0 of .
Results for:
P. 1
The Law of Iterated Expectations

# The Law of Iterated Expectations

Ratings:
(0)
|Views: 6,136|Likes:

### Availability:

See more
See less

02/25/2015

pdf

text

original

Notes on Law of Iterated Expectations
Cyrus SamiiThe law of iterated expectations is the foundation of many derivations and theorems in applied statistics. I provide thebasic statement of the law and then illustrate how it applies to some important results.A basic statement is as follows:
E(
Y
) = E(E(
Y
|
))
.
Here is a proof for the continuous case:
E(E(
Y
|
)) =

E(
Y
|
)
X
(
x
)
dx
=
 
yf
Y
|
X
(
y
|
x
)
dy
X
(
x
)
dx
=
 
yf
X,Y
(
x,y
)
dxdy
=

y

X,Y
(
x,y
)
dx
dy

yf
Y
(
y
)
dy
= E(
Y
)
,
and for the discrete case,
E(E(
Y
|
)) =
x
E(
Y
|
=
x
)
p
(
x
) =
x
y
yp
(
y
|
x
)
p
(
x
)=
x
y
yp
(
x,y
) =
y
y
x
p
(
x,y
) =
y
yp
(
y
) = E(
Y
)
.
Example 1.
Suppose
Y
|
(2
X,
1)
and
Multinomial
(
.
25
,.
25
,.
5)
. This deﬁnes a joint distribution for
Y
and
(since
(
Y
|
)
(
) =
(
X,Y
)
). Take a large number of draws from this joint distribution. Average the
Y
realizations. Then, bin the
Y
realizations by their associated
values, take the average of the
Y
realizations withineach bin, and ﬁnally take the probability-weighted average over the bins. This should equal the mean of the
Y
s. Hereis a demonstration with a ﬁnite (though rather large) sample using R:
> n <- 5000> p <- c(.25,.25,.5)> x <- t(rmultinom(n,1,p))%*%c(1,2,3)> y <- rnorm(n, mean=2*x, sd=1)> mean(y)[1] 4.500761> tapply(y, x, mean)1 2 31.994880 3.973878 6.033394> tapply(y, x, mean)%*%as.matrix(p)[,1][1,] 4.508887
As the example illustrates,
E(
Y
|
)
is itself a random variable. In the example, we have,
E(
Y
|
) =
2
if
= 14
if
= 26
if
= 3
,
1

where the probability of any of those
E(
Y
|
)
realizations is equal to the probability of realizing
= 1
vs
= 2
vs
= 3
, respectively. Thus,
E(
Y
|
)
is random with respect to the variation in
, and so the outer expectation istaken with respect to this source of random variation—namely, the density of
.Iterated expectations can be taken with nested conditioning sets. For example, we have
E(
Y
|
) = E(E(
Y
|
X,
)
|
)
.
The proof is no more complicated than what we have above—just substitute in the appropriate conditional densities(e.g., replace
Y
with
Y
|
X
). What is important is the statistical interpretation, which typically looks at the condition-ing sets in terms of their information content. In these terms, the conditioning sets are nested:
(
X,
)
. In a sense,the variability in
E(
Y
|
X,
)
is more “ﬁne grained” than the variability in
E(
Y
|
)
given the additional “resolution”contributed by the information on
. The outer expectation retains the variability with respect to
but marginalizesover
. That is,
E(
Y
|
)
is “taking an average” over the ﬁne grain that we would observe if we
could
condition on
(
X,
)
but that we
cannot
do because we only have
. Hence the equality.
Example 2 (Martingales):
Suppose a random walk,
Y
t
=
Y
t
1
+
u
t
where
u
t
iid
(0
,σ
2
t
)
. Suppose we have aninformation set,
t
=
{
u
t
,u
t
1
,...
}
. Then, with
t
1
the value of
Y
t
1
is fully determined, and we are able to forman expectation about
Y
t
as follows:
E(
Y
t
|
t
1
) = E(
Y
t
1
+
u
t
|
t
1
) =
Y
t
1
.
{
Y
t
,
t
}
is an example of a martingaleprocess. Now, what would be
E(
Y
t
|
t
2
)
, our expectation of the price at time
t
given information only up to
t
2
?Since
t
2
t
1
, by application of LIE,
E(
Y
t
|
t
2
) = E(E(
Y
t
|
t
1
)
|
t
2
) = E(
Y
t
1
|
t
2
) =
Y
t
2
. As before,the outer expectation is “averaging” over the ﬁner grain
t
1
information to generate an expectation given our coarser
t
2
information. By induction, we see that
E(
Y
t
|
t
k
) =
Y
k
, which is why a martingale such as this is sometimescalled an “honest” process.
Example 3 (Propensity score theorem):
The propensity score theorem is an important result for methods used to cor-rect for problems of unequal sample selection or treatment assignment.
1
Each unit
i
is assigned a binary treatment,
D
i
∈ {
0
,
1
}
. Suppose each unit
i
has potential outcomes
Y
0
i
and
Y
1
i
which are observed when
D
i
= 0
and
D
i
= 1
, re-spectively. Suppose as well that each unit can be characterized by covariates,
i
, such that “conditional independence”holds,
{
Y
0
i
,Y
1
i
} ⊥
D
i
|
i
. That is, for units with common
i
, treatment assignment is random (or “unconfounded”)with respect to potential outcomes. Finally, let
p
(
i
) = Pr(
D
i
= 1
|
i
)
. The propensity score theorem states thatunder these conditions,
{
Y
0
i
,Y
1
i
}
D
i
|
p
(
i
)
. The proof is based on a straightforward application of LIE withnested conditioning:
Pr(
D
i
= 1
|{
Y
0
i
,Y
1
i
}
,p
(
i
)) = E(
D
i
|{
Y
0
i
,Y
1
i
}
,p
(
i
))= E(E(
D
i
|{
Y
0
i
,Y
1
i
}
,p
(
i
)
,
i
)
|{
Y
0
i
,Y
1
i
}
,p
(
i
))
(by LIE)
= E(E(
D
i
|{
Y
0
i
,Y
1
i
}
,
i
)
|{
Y
0
i
,Y
1
i
}
,p
(
i
))= E(E(
D
i
|
i
)
|{
Y
0
i
,Y
1
i
}
,p
(
i
))
(by conditional independence)
= E(
p
(
i
)
|{
Y
0
i
,Y
1
i
}
,p
(
i
)) =
p
(
i
)
,
Thus, under conditional independence,
Pr(
D
i
= 1
|{
Y
0
i
,Y
1
i
}
,p
(
i
)) =
p
(
i
) = Pr(
D
i
= 1
|
p
(
i
))
. Since
Pr(
D
i
= 1
|
p
(
i
))
completely characterizes the distribution of
D
i
conditional on
p
(
i
)
, we can conclude that
D
i
{
Y
0
i
,Y
1
i
}|
p
(
i
)
. The usefulness of the propensity score theorem is that it reduces the dimensionality of the problem of correcting for non-random selection or treatment assignment. If
i
is of high dimension, then con-ditioning on
i
directly—e.g. by creating stratiﬁcation cells—could be hard due to sparse data over values of
i
.
1
P Rosenbaum and DB Rubin, 1983, “The central role of the propensity score in observational studies for causal effects,”
Biometrika
, 70(1):41-55. The proof here follows the presentation in JD Angrist and JS Pischke, 2008,
Mostly Harmless Econometrics
, Princeton: Princeton UniversityPress, pp. 80-81.
2