where the probability of any of those
E(
Y

X
)
realizations is equal to the probability of realizing
X
= 1
vs
X
= 2
vs
X
= 3
, respectively. Thus,
E(
Y

X
)
is random with respect to the variation in
X
, and so the outer expectation istaken with respect to this source of random variation—namely, the density of
X
.Iterated expectations can be taken with nested conditioning sets. For example, we have
E(
Y

X
) = E(E(
Y

X,Z
)

X
)
.
The proof is no more complicated than what we have above—just substitute in the appropriate conditional densities(e.g., replace
f
Y
with
f
Y

X
). What is important is the statistical interpretation, which typically looks at the conditioning sets in terms of their information content. In these terms, the conditioning sets are nested:
X
⊆
(
X,Z
)
. In a sense,the variability in
E(
Y

X,Z
)
is more “ﬁne grained” than the variability in
E(
Y

X
)
given the additional “resolution”contributed by the information on
Z
. The outer expectation retains the variability with respect to
X
but marginalizesover
Z
. That is,
E(
Y

X
)
is “taking an average” over the ﬁne grain that we would observe if we
could
condition on
(
X,Z
)
but that we
cannot
do because we only have
X
. Hence the equality.
Example 2 (Martingales):
Suppose a random walk,
Y
t
=
Y
t
−
1
+
u
t
where
u
t
∼
iid
(0
,σ
2
t
)
. Suppose we have aninformation set,
I
t
=
{
u
t
,u
t
−
1
,...
}
. Then, with
I
t
−
1
the value of
Y
t
−
1
is fully determined, and we are able to forman expectation about
Y
t
as follows:
E(
Y
t

I
t
−
1
) = E(
Y
t
−
1
+
u
t

I
t
−
1
) =
Y
t
−
1
.
{
Y
t
,I
t
}
is an example of a martingaleprocess. Now, what would be
E(
Y
t

I
t
−
2
)
, our expectation of the price at time
t
given information only up to
t
−
2
?Since
I
t
−
2
⊆
I
t
−
1
, by application of LIE,
E(
Y
t

I
t
−
2
) = E(E(
Y
t

I
t
−
1
)

I
t
−
2
) = E(
Y
t
−
1

I
t
−
2
) =
Y
t
−
2
. As before,the outer expectation is “averaging” over the ﬁner grain
I
t
−
1
information to generate an expectation given our coarser
I
t
−
2
information. By induction, we see that
E(
Y
t

I
t
−
k
) =
Y
k
, which is why a martingale such as this is sometimescalled an “honest” process.
Example 3 (Propensity score theorem):
The propensity score theorem is an important result for methods used to correct for problems of unequal sample selection or treatment assignment.
1
Each unit
i
is assigned a binary treatment,
D
i
∈ {
0
,
1
}
. Suppose each unit
i
has potential outcomes
Y
0
i
and
Y
1
i
which are observed when
D
i
= 0
and
D
i
= 1
, respectively. Suppose as well that each unit can be characterized by covariates,
X
i
, such that “conditional independence”holds,
{
Y
0
i
,Y
1
i
} ⊥⊥
D
i

X
i
. That is, for units with common
X
i
, treatment assignment is random (or “unconfounded”)with respect to potential outcomes. Finally, let
p
(
X
i
) = Pr(
D
i
= 1

X
i
)
. The propensity score theorem states thatunder these conditions,
{
Y
0
i
,Y
1
i
} ⊥⊥
D
i

p
(
X
i
)
. The proof is based on a straightforward application of LIE withnested conditioning:
Pr(
D
i
= 1
{
Y
0
i
,Y
1
i
}
,p
(
X
i
)) = E(
D
i
{
Y
0
i
,Y
1
i
}
,p
(
X
i
))= E(E(
D
i
{
Y
0
i
,Y
1
i
}
,p
(
X
i
)
,X
i
)
{
Y
0
i
,Y
1
i
}
,p
(
X
i
))
(by LIE)
= E(E(
D
i
{
Y
0
i
,Y
1
i
}
,X
i
)
{
Y
0
i
,Y
1
i
}
,p
(
X
i
))= E(E(
D
i

X
i
)
{
Y
0
i
,Y
1
i
}
,p
(
X
i
))
(by conditional independence)
= E(
p
(
X
i
)
{
Y
0
i
,Y
1
i
}
,p
(
X
i
)) =
p
(
X
i
)
,
Thus, under conditional independence,
Pr(
D
i
= 1
{
Y
0
i
,Y
1
i
}
,p
(
X
i
)) =
p
(
X
i
) = Pr(
D
i
= 1

p
(
X
i
))
. Since
Pr(
D
i
= 1

p
(
X
i
))
completely characterizes the distribution of
D
i
conditional on
p
(
X
i
)
, we can conclude that
D
i
⊥⊥ {
Y
0
i
,Y
1
i
}
p
(
X
i
)
. The usefulness of the propensity score theorem is that it reduces the dimensionality of the problem of correcting for nonrandom selection or treatment assignment. If
X
i
is of high dimension, then conditioning on
X
i
directly—e.g. by creating stratiﬁcation cells—could be hard due to sparse data over values of
X
i
.
1
P Rosenbaum and DB Rubin, 1983, “The central role of the propensity score in observational studies for causal effects,”
Biometrika
, 70(1):4155. The proof here follows the presentation in JD Angrist and JS Pischke, 2008,
Mostly Harmless Econometrics
, Princeton: Princeton UniversityPress, pp. 8081.
2