Causal Inference Special Topic

Behavioural Finance
Prospect Theory
Candidate Number: 1031077

University of Oxford
May, 2019
Contents
1 Introduction 1
2 Expected Utility Theory 1

2.1 Subjective Probability Biases . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Doubt On Utility Function . . . . . . . . . . . . . . . . . . . . . . . . 2
3 Prospect Theory 5
3.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Treatment Assignment Mechanism . . . . . . . . . . . . . . . . . . . 8
4 Neyman Repeated Sampling Approach 9
5 Bayesian Inference 12
5.1 Rubin Causal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.2 Posterior predictive distribution of missing potential outcomes . . . . 12
5.3 Posterior distribution of causal effects . . . . . . . . . . . . . . . . . . 13
5.4 Model-based inference . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 Application in NSW example 14

6.1 The Lalonde NSW experimental job-training data . . . . . . . . . . . 14
6.2 Model 1: Assuming independence between potential outcomes . . . . 16
6.3 Model 2: Assuming constant individual treatment effects . . . . . . . 17
6.4 Model 3: Model-based imputation with covariates . . . . . . . . . . . 19
6.5 Model 4: Two-Part Model . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Conclusion 22
8 Future Work 23
A Appendix 24
A.1 Summary Statistics of Causal Estimands . . . . . . . . . . . . . . . . 24
A.2 Pair plots of parameters . . . . . . . . . . . . . . . . . . . . . . . . . 25
A.3 MATLABStan Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
References 32
1 Introduction
Expected Utility Theory is criticized by more and more people. Some argue the
underlying assumptions of the theory do not hold, some prove predictions do not
match behaviors through experimental or empirical test and some combine both of
criticism[1]. Enormous alternative theories have been proposed. Among them, the
most representative one is Prospect Theory, which was proposed by Daniel Kahneman
and Amos Tversky in 1979. Prospect Theory, corrective critique of Expected Utility
Theory, successfully illustrated many unexplained anomalies in economics, which had
a huge impact on later behavioral finance research. Behavioral finance attempts to
change the standard analytical paradigm based on Efficient Market Hypothesis for
decades, analyzing the market behaviors from the perspective of human being[2].
The main goal of this report is that fully discuss improvements of Prospect The-
ory, compared of Expected Utility Theory(EUT). In Section 2, we will quick review
the EUT, and discuss the limitations of it. In Section 3, Prospect Theory will be
introduced in detail. In Section 4, we will mainly focus on the improvements of the
Prospect Theory. In Section 5, we will discuss some limitations and extensions of
Prospect Theory.
2 Expected Utility Theory

The study of decision-making under the risk is an essential part of any asset price
problem in financial economics. In this field, Expected Utility Theory used to occupy
the predominance. The main objective of the theory is to maximize the decision-
makers utility through weighting the various possible outcomes. More specifically, we
could use the utility function u to describe the prospect of possible outcome for the
investors. Consider the possible outcomes xi , than the utility will be given as u(xi )
(u is a function on the terminal wealth i.e. xi = w0 + ∆w, w0 is the initial wealth, ∆w
is wealth change) with the corresponding probability pi (where p1 + p2 + ... + pn =
1). The expected utility is defined as U (X) = ni=1 pi ∗ u(xi ). It is noteworthy that
P
investors are risk advise (i.e. utility function u is concave) in most of cases. One of
the basic assumption of Expected Utility Theory is that each individual is rational.
However, does it is really true?
1
2.1 Subjective Probability Biases
In 1974, Kahneman and Tversky first questioned the rationality assumption from
a psychological perspective. They argue that investors face the Representativeness
bias, Availability bias, Anchoring Effect and etc [3].
(1) Representativeness Bias
Kahneman and Tversky (1974) state that people tend to categorize events accord-
ing to traditional or similar circumstances, and overtrust the possibility of historical
repetition when assessing the probability [3]. Bondt and Thaler (1985) further argue
that Investors will be overly pessimistic (optimistic) for the losers (winners) of the
past stock market, resulting in large differences in stock prices and fundamental prices
[4].
(2) Availability Bias
Kahneman and Tversky (1974) state events that are easily reminiscent can lead
to the misconception that they often occur. They believe that people tend to overes-
timate the probability of an event occurring under the availability bias [3].
(3) Anchoring Effect
Kahneman and Tversky (1974) states the final answer proposed by people is sig-
nifyingly impacted by the initial values. For example, the high school students are
divided into two group. Given 5 mins, they are required to offer an estimation of
1 × 2 × 3 × 4 × 5 × 6 × 7 × 8. The group 1 and group 2 are given by ascending and
descending order. The result is following:
Group Given Median Of Estimation
1 1×2×3×4×5×6×7×8 512
2 8×7×6×5×4×3×2×1 2250
Obviously, group 1 offer the overall smaller values, which is influence by the initial
value 1 and vice versa [3].
2.2 Doubt On Utility Function

In Expected Utility Theory, the terminal utility is weighted the utility of possible
outcomes, in this section, we will introduce some examples that peoples preference
violates the theory. Maurice Allais was the first person who found the violation in
1953 [5]. Kahneman and Tversk classified the violations into 3 aspects: Certainty
2
Effect, Reflection Effect and Isolation Effect in 1979 [6]. In this section, we are going
to introduce each of the effect in detail, based on survey data.
§ Certainty Effect
A 2500(0.33) 2400(0.66) 0(0.01) 18%

Test 1
B 2400(1.00) 82%
C 2500(0.33) 0(0.67) 83%
Test 2
D 2400(0.34) 0(0.67) 17%
Table 1: Problem 1
Table 1 shows the uncertainty outcomes with probability and peoples preference.
Each test people are required to choose prospect. Without loss generality, we de-
fine u(0) = 0. According to Expected Utility Theory, we can derive u(2400) >
0.33u(2500) + 0.66u(2400), from Test 1, equivalently 0.34u(2400) > 0.33u(2500).
However, we obtains 0.33u(2500) > 0.34u(2400) from Test 2, which contradicts with
each other. Summarily, when the outcomes change from certain gain to probable
gain, the desirability will be reduced [6].
A 4000(0.80) 0(0.20) 80%

Test 3
B 3000(1.00) 20%
C 4000(0.20) 0(0.80) 65%
Test 4
D 3000(0.25) 0(0.75) 35%
Table 2: Problem 2
This example was introduced by Allais [5]. In fact, C can be represented as

A with possibility 0.25, denote as (A, 0.25). Similarly, D can be written as (B,
0.25). From Table 2, B is preferred than A. According to Expected Utility Theory, D
should be preferred than C, but statistics shows opposite direction. This problem is
known as Allais Paradox. From the problem, we may propose the property 1 of the
utility: if (y, pq) is equivalent to (x, p), than (y, pqr) is preferred than (x, pr), given
0 < p, q, r < 1. We will formally prove this property in Section4.
§ Reflection Effect
3
A -4000(0.80) 0(0.20) 92%
Test 30
B -3000(1.00) 8%
C -4000(0.20) 0(0.80) 42%
Test 40
D -3000(0.25) 0(0.75) 58%
Table 3: Problem 3
Comparing between Table 2 and Table 3, we could conclude decision makers have
two opposite behavioral preference under the uncertainty gains and losses. That is,
when facing the gains, investors act as risk aversion, but when facing the losses,
investors act as risk seeking, which also violates the assumption of overall concave
utility function of Expected Utility Theory.
§ Isolation Effect
Kahneman and Tversk argue that the inconsistence of preference may be at-
tributed to the way of decomposing prospects based on individuals preference. When
the decomposition is not unique, the preference will vary [6]. Consider the following
two-stage game, in first stage, each individual has 0.75 chance to be eliminated and
0.25 chance to move into next stage. In stage 2, the individual face two choice: (1)
(4000, 0.8), (2) (3000, 1). From the perspective of the entire game, the result show in
Table 4. The outcome 4000 with probability 0.2 (0.25 × 0.8). The individuals pref-
erence is conflicted with Test 4 in Table 2. It is due to the individual tends to ignore
the first stage and only focus on the second stage, when facing the muti-stage prob-
lem. Then applying the certainty effect, the individual tends to choose B. Kahneman
and Tversk indicate that investors make the choice between prospects are not based
completely on the probability in final states, which again violates the assumption of
Expected Utility Theory [6].
A 4000(0.20) 0(0.80) 22%

Test 5
B 3000(0.25) 0(0.75) 78%
Table 4: Problem 4
Consider another game, group 1 and 2 are initially given 1000 and 2000 respec-
tively. The choices they facing and the test preference are showed in Table 5. From
the perspective of the final stage, A = (2000, 0.5; 1000, 0.5) = C and B = (1500) = D.
Again, the investors choices are inconsistent. Investors seems ignore the initial bonus,
because it is common in both opinions in each problem. Thus, we may propose that
4
the change in wealth is the core part estimate the utility, rather than the final wealth
states [6].
A 1000(0.50) 0(0.50) 16%

Test 6
B 500(1.00) 84%
C -1000(0.50) 0(0.50) 69%
Test 7
D -500(1.00) 31%
Table 5: Problem 5
3 Prospect Theory
Under assumptions of traditional financial theory, investors maximize their expected
utility based on the terminal wealth level with corresponding probability. These
decision-making processes are built on the Efficient Market Hypothesis, where each
investor is rational and has the same level information of market. However, investors
preference is influenced by various factors, such as economic situation, knowledge
level, psychological quality and so on, when facing the real choices in financial market.
Also, we find Expected Utility Theory is not sufficient to explain the preference in
real-world decisions. Kahneman and Tversk proposed Prospect Theory in 1979, which
revised the traditional decision-making framework and emphasize that investors are
affected by many factors under uncertainty environment.
Prospect Theory deals with the simple decision problems with monetary outcomes
and states probabilities. Generally, the theory divides the decision-making process
into two stages. The first stage is Editing, a preliminary analysis of the outcomes
and probabilities, and then transform them into edited prospects. This stage some-
times also accompanied by the simplification of these prospects. The second stage is
evaluation, where investors choose the optimal prospect through evaluating them [6].
Next, we will explain each stage in detail.
5
Notationally, let W be the variable that indicates the treatment level, to which
the N unit may receive: W = 1 for Nt treated units and W = 0 for (N − Nt ) control
units. Also let Yi (1) denote the value of a treated unit, and Yi (0) denote the value of a
control unit. In the example described above, Yi (0) and Yi (1) represent the pain level
of headache. The individual treatment effect can be determined by the difference of
two potential outcomes, Yi (1) − Yi (0), for any unit i. The fundamental problem of
causal inference [7] faced by statisticians is that, only one potential outcome can be
observed, and the other being missing for each unit, and it can be described by the
expression:
Yobs = W × Y (1) + (1 − W ) × Y (0) (1)

Ymis = (1 − W ) × Y (1) + W × Y (0) (2)
The data tabulated below illustrates the fundamental problem of causal inference
that we face:
Unit W X Y(0) Y(1) Individual causal effect
1 0 8 5 3* −2*
2 1 15 7* 9 +2*
3 0 26 7 5* −2*
4 1 35 8* 9 +1*
5 0 42 5 4* −1*
True Averages 6.4* 6* -0.4*
Observed 5.7 9
where values marked with ∗ are unobserved, and X denotes the covariates that may
affect the effect of the treatment on the outcomes (i.e., age of a person in the study of
effect of aspirin). The array of values of X,Y(0) and Y(1) represents the ”science”
that we want to learn, which does not change how we assign the treatment to the
unit.
However, in the real situation we would always observe a table as follows:
6
Unit W Y(0) Y(1) Individual causal effect
1 0 5 ? ?
2 1 ? 9 ?
3 0 7 ? ?
4 1 ? 9 ?
5 0 5 ? ?
True Averages ? ? ?
Observed 5.7 9
Therefore, the Rubin causal model reduces the potential outcome framework to a
missing data problem. From the table above, one may draw the conclusion that
taking aspirin is bad for reducing the level of headache pain, by simply looking at the
observed values, (i.e., Y (1) = 9 > Y (0) = 5.7). But we ”know” from the science that
the treatment on average has positive impact on patients’ medical condition. This
motivates us to develop an imputation method to fill the unobserved values, and later
in Section 5 we will discuss how Bayesian approach is implemented in this problem.
Identifying the individual causal effect is not possible in general since there is
always a piece of information missing. Hence we draw our attention to the aggre-
gated causal effect which is the average treatment effect (ATE) over a population of
individuals, and it can be formulated as follows:
τ = E[Y (1)] − E[Y (0)] (3)
3.1 Assumptions
There are two major assumptions [8] need to be made which are often adopted in the
causal analysis.
Assumption 1 No interference among units.
Assumption 2 No hidden version of treatment and control.
Assumption 1 implies that neither Y (0) nor Y (1) can be affected by what actions
other units received. Assumption 2 assumes that the ”science” is not affected by how
we learn it, no matter it is randomized experiments or observational studies. These
two assumptions constitute a critical assumption called ”Stable Unit Treatment Value
7
Assumption” (SUTVA) [8].
Under SUTVA, estimation of causal effect can be found from the matrix of science
 
X1 Y1 (0) Y1 (1)
 .. 

 . 

(X,Y(0),Y(1)) =  Xi Yi (0) Yi (1)  (4)
 
 .. 
.
 
 
XN YN (0) YN (1)
Since labelling of the N -row array is a random permutation, it suggests that every row
is exchangeable with others. To evaluate the causal effect, we must have some units
with Y (1) being observed and some units with Y (0) being observed. It motivates
us to study a probabilistic mechanism that determines which units receive treatment
and which units receive control.
3.2 Treatment Assignment Mechanism

The assignment mechanism determines the conditional probability for each vector of
assignments W , given the science matrix {X, Y (0), Y (1)}:
Pr(W |X, Y (0), Y (1)) (5)
In a completely randomized experiment with N units, a fixed number of Nt units are

chosen to receive treatment. Thus the assignment mechanism by definition gives the
conditional probability as follows:
−1 N
N X
Pr(W |X, Y (0), Y (1)) = for all W such that Wi = Nt (6)
Nt i=1
Without any knowledge about how the missing data is created during the process,
it is generally impossible to infer meaningful things from the missing data. In a
completely randomized experiment, the treatment assignment is often seen as being
unconfounded with potential outcomes. Two key assumptions we need to make in
a completely randomized experiment are ”overlap” and ”unconfoundedness” [9]:
Assumption 3 (Overlap)
0 < Pr(Wi = 1|Xi = x) < 1
8
Assumption 4 (Unconfoundedness) The assignment mechanism is independent of
all potential outcomes including both observed and unobserved. Hence it implies:
Pr(W |X, Y (0), Y (1)) = Pr(W |X) (7)
such that
(Yi (1), Yi (0)) Wi |Xi
|=
where the conditional independence notation, A B|C was first used by Dawid [10],
|=
denoting that A and B are independent conditional on C. The combination of these
two assumptions constitute the strong ignorability. Given this, the assignment mech-
anism can be considered to be ignorable in the imputation process of missing potential
outcomes.
4 Neyman Repeated Sampling Approach

Before introducing Bayesian method, we need to discuss an important frequentist
approach that provides a good estimate for the average treatment effect. Neyman [11]
studies the estimation of the average treatment effect, using the distribution induced
by randomization (i.e., the assignment vector W). His approach is to develop an
unbiased estimator for average treatment effect, and use repeated sampling to obtain
its mean and variance for constructing the confidence interval. We refer the repeated
sampling to as the procedure that samples from both the population of units and the
randomization distribution. In Neyman’s approach, it is difficult to derive the exact
randomization distribution of statistics of which we are interested in, since we do not
start with the assumption that allows us to determine the missing potential outcomes.
However, we may still derive good estimators of aspects of this distribution with first
and second moments. Throughout this section, we maintain the previous stability
assumption, SUTVA.
We begin with the finite-sample average treatment effect, which can be calculated
as follows:
N N
1 X 1 X
τf s = τi = {Yi (1) − Yi (0)} = Ȳ1 − Ȳ0 (8)
N i=1 N i=1
PN
where Ȳ1 and Ȳ0 are the finite sample of potential outcomes, ȲW ;W ={0,1} = i=1 Yi (W )/N .
A natural and unbiased estimator for a completely randomized experiment is proposed
by Neyman, to be defined as the difference in the average observed outcomes between
9
units that receive treatment and control:
τ̂ = Ȳ1obs − Ȳ0obs (9)

1
where Ȳ1obs = Yiobs and Ȳ0obs = N1c i:Wi =1 Yiobs are the sample means of
P P
Nt i:Wi =0
the observed outcomes under treatment or control. The sampling variance of the
Neyman estimator with N units in a completely randomized experiment is
St2 Sc2 Sτ2
var(τ̂ ) = + − (10)
Nt Nc N
where Sc2 and St2 are the sample variances of potential outcomes, Yi (0) and Yi (1), Sτ2
is the sample variance of the individual treatment effects, defined as:
N
1 X
Sc2 = (Yi (0) − Ȳ0 )2 (11)
N − 1 i=1
N
1 X
St2 = (Yi (1) − Ȳ1 )2 (12)
N − 1 i=1
N
1 X
Sτ2 = (Yi (1) − Yi (0) − τf s )2 (13)
N − 1 i=1
Since the pair of potential outcomes for the same unit is never observed simultane-
ously, it is generally impossible to identify Sτ2 from the observed data, thus we need
to develop an estimator for this sampling variance given by Equation 10. If the treat-
ment effects are constant and additive across the population (Yi (1) − Yi (0) = τf s ),
the third component of Equation 10 vanishes, and the variance can be approximated
by an unbiased estimator [9] as follows:
neyman s2c s2t

V̂ = + (14)
Nc Nt
where s2c and s2t are unbiased estimators for Sc2 and St2 , and they are defined as follows:
1 X
s2c = (Y obs − Ȳ0obs )2 (15)
Nc − 1 i:W =0 i
i
1 X
s2t = (Y obs − Ȳ1obs )2 (16)
Nt − 1 i:W =1 i
i
Neyman’s estimator has been widely used in randomization-based causal inference

[9], even the assumption of addictive treatment effect does not always hold. Since the
third component, Sτ2 , is non-negative, Neyman’s estimator tends to overestimate the
true variance, var(τ̂ ):
E[V̂ neyman ] ≥ var(τ̂ ) (17)
10
Hence, the confidence intervals generated with the Neyman’s estimator will have
larger coverage than their nominal coverage in large samples. Neyman reviewed in
Stats. Sci (1990) [12] formulated the following as a confidence interval with frequentist
coverage probabilities of Ȳ1 − Ȳ0 :
√
Ȳ1obs − Ȳ0obs ± 1.96 V neyman (18)
The effect of covariates are ignored by the asymptotic confidence interval above,
the interval tends to be too conservative for rerandomization, that is an assignment
mechanism that checks balances between covariates at the time of randomization. It
is suggested by Morgan and Rubin [13] that more accurate results can be obtained
by using regression adjustment methods since the results can better capture the true
sampling variance of the treatment effect. As a result of simple random sampling, the
super-population ATE, τsp , is equal to the expected value of the finite sample ATE,
τsp , and the estimator Vsp for sampling variance of super-population is identical to
the conservative Neyman’s estimator [9], V neyman .
It is worth mentioning Fisher’s sharp null hypothesis, [14] where Fisher attempted
to prove the following hypothesis is incorrect using randomization distribution:
H0 : Yi (1) = Yi (0) ∀i = 1, · · · , N (19)
It is being ”sharp” because under the null hypothesis of no treatment effects for any
unit, all potential outcomes are known. Under such hypothesis, the assumption of
constant additive treatment effect is satisfied for Neyman’s estimator, and its estimate
for the sampling variance would be unbiased. However, neither Fisher’s nor Neyman’s
approaches addressed the problem on which interventions should be taken to the
unseen units.
In the upcoming section, we will introduce a model-based approach which is more
flexible compared to Neyman repeated sampling approach.
11
5 Bayesian Inference
5.1 Rubin Causal Model
A framework that defines causal effect by potential outcomes with explicit proba-
bilistic assignment mechanism, is called ”Rubin Causal Model”, first given that name
by Holland [15]. The model views all statistical inference problems for causal ef-
fects as missing data problems for which the missing data is created via probabilistic
assignment mechanism [16].
The essence of Rubin causal model is that it beautifully separates the true underly-
ing model of science, Pr(X, Y (0), Y (1)) from the assignment mechanism, Pr(W |X, Y (0), Y (1)).
Combining them together, the model specifies a joint distribution for all quanti-
ties, Pr(W, X, Y (0), Y (1)), that allows simulation-based computational methods (i.e.,
Markov Chain Monte Carlo) to be implemented.
5.2 Posterior predictive distribution of missing potential out-

comes
First, consider the posterior predictive distribution of Ymis that can be computed via
Bayes’ theorem:
Pr(X, Y (0), Y (1))Pr(W |X, Y (0), Y (1))
Pr(Ymis |X, Yobs , W ) = R (20)
Pr(X, Y (0), Y (1))Pr(W |X, Y (0), Y (1))dYmis
With igorable treatment assignment, Equation 20 (by cancelling out Pr(W |X, Yobs ))
becomes:
Pr(X, Y (0), Y (1))
Pr(Ymis |X, Yobs , W ) = R (21)
(Pr(X, Y (0), Y (1)))dYmis
Equation 21 implies that all we need to model is the joint distribution of the under-
lying data (science), Pr(Y (0), Y (1), X).
Since the science array (X, Y(0), Y(1)) is row exchangeable and by de Finetti’s
theorem [17] [18], it is reasonable to consider the unit-level distributions of (X, Y(0), Y(1))
to be independent and identically distributed (i.i.d.) given parameters θ:
N
Z Y
Pr(X, Y (0), Y (1)) = [ Pr(Xi , Yi (0), Yi (1)|θ)]Pr(θ)dθ (22)
i=1
where θ is a parameter with prior distribution Pr(θ). Furthermore, we can factor the
conditional distribution of (X, Yi (0), Yi (1)) into:
Pr(X, Yi (0), Yi (1)|θ) = Pr(Xi |θX )Pr(Yi (0), Yi (1)|Xi , θY |X ) (23)
12
where θY |X is the parameter of the conditional distribution, Pr(Yi (0), Yi (1)|Xi ). Sim-
ilarly, θX is the parameter of the marginal distribution, Pr(Xi ). We assume θY |X
θX are both functions of θ and a priori independent. Rubin [19] pointed out that
Equation 23 creates a connection between fundamental theory and the practice of
applying independent and identical models.
5.3 Posterior distribution of causal effects

After obtaining the posterior predictive distribution for the missing potential out-
comes, Pr(Ymis |X, Yobs , W ), it is straightforward calculate the posterior for causal
effect. By repeatedly drawing values of Ymis from the distribution and calculating
the corresponding causal effect,τ , it can generate the posterior distribution for causal
effect, τ . Thus the causal inference can be perceived completely as a missing data
problem by multiply-imputing [20] the missing potential outcomes to construct the
posterior distribution for the causal estimands.
5.4 Model-based inference

From Bayesian perspective, the primary objective is to derive the posterior predictive
distribution of the missing potential outcomes, Pr(Ymis |Yobs , X, W ), which can be
further used to calculate the posterior distribution of causal effect, Pr(τ |Yobs , X, W ).
We formulate 4 main steps [9] for this Bayesian approach in a competely randomized
experiment.
1. Derive Pr(Ymis |Yobs , X, W, θ)
2. Derive the posterior distribution for θ, Pr(θ|X, Yobs , W )
3. Combine the two distributions to get the posterior distribution

of Ymis , Pr(Ymis |X, Yobs , W )
4. Derive the posterior distribution of causal effect, τ , Pr(τ |X, W, Yobs )

using method described in Section 4.3.
13
6 Application in NSW example
6.1 The Lalonde NSW experimental job-training data
The data we will be analyzing, to illustrate our methods, come from a randomized
evaluation of the National Support Work (NSW) project. It is a transitional and
subsidized work experience program for groups of people who suffer from longstanding
employment problems [21]. This dataset has been extensively analyzed in the field
of econometrics by Heckman and Hotz [22], Dehejia and Wahba [23], and Smith and
Todd [24].
Table 6 summarizes 12 variables that were collected from units may or may not
be assigned to the job training program. In the size of N = 445 people, Nt = 185 of
them were assigned to the treatment group and Nc = 260 were assigned to the control
group. The treatment variable of interest is treat, an indicator variable for whether
or not being assigned in the job-training program. The outcome variable of interest
is re78, a variable represents the post-program labor market experiences, earnings in
1978. The other 10 variables are covariates that may affect the outcome variable at
different levels such as age, educ, married, etc that have a certain degree of influence
on the outcome. Summary statistics for them are presented in the table below:
Table 6: Summary Statistics for NSW Dataset
Variable Description mean s.d. min max Type
treat treatment status 0.42 0.53 0 1 treatment

age age in years 25.37 7.10 17 55.0 covariate
educ education 10.20 1.79 3 16.0 covariate
married being now or ever before married 0.17 0.37 0 1.0 covariate
nodegr being a high school dropout 0.78 0.41 0 1.0 covariate
black being African American 0.83 0.37 0 1.0 covariate
re74 pre-training earnings in 1974 2.10 5.36 0 39.6 covariate
u74 earnings in 1974 being zero 0.73 0.44 0 1.0 covariate
re75 pre-training earnings in 1975 1.38 3.15 0 25.1 covariate
u75 earnings in 1975 being zero 0.65 0.48 0 1.0 covariate
hisp being Hispanic/Latino 0.10 0.30 0 1.0 covariate
post-program labor market earnings
re78 5.30 6.63 0 60.3 outcome
in 1978
14
When applying the methodology we described above to the real-world data, the ma-
jor problem researchers often encountered is to formalize the dependence between
two potential outcomes since the researchers never observe both potential outcomes
simultaneously. Thus, we build several models for the joint distribution of two poten-
tial outcomes and study the sensitivity to the choice of models. For first 3 models, we
choose normal distribution as prior for all the parameters of our interest. It is worth
noting that the inference is insensitive to the specification of prior distribution with
ignorable treatment assignment [7] [25].
We begin with the simplest model where we assume independence between two
potential outcomes with no covariates. One of our goals is to investigate how sensi-
tive the results are to the assumption of independence between potential outcomes,
hence the second model we would like to consider is for which the potential outcomes,
Yi (0) and Yi (1), are perfectly correlated (i.e. ρ = 1) and covariances of both poten-
tial outcomes are equal. Equivalently, the individual treatment effects are constant.
The third model we consider assumes independence between potential outcomes but
with covariates and we want to study whether adding pre-treatment covariates will
improve the imputation process or not. Finally, we incorporate the assumption of
independence, the presence of covariates and adjustment based on dataset feature
into our fourth model, and compare the result with the one obtained via Neyman
repeated sampling approach.
In addition to finite-sample ATE and super-population ATE, we further define
3 finite-sample causal estimands, which are quantile treatment effects (QTE), for
the 0.25, 0.50 and 0.75 quantiles. These QTEs represent the difference between
pth quantiles of the potential outcomes where some values are observed and some
are imputed,τqte = Qp (Yi (1)) − Qp (Yi (0)). These estimands are generally used for
evaluating the effect of treatment at the top or bottom of the distribution.
We start by thinking about the joint distribution of Yi (0) and Yi (1) in the general
case without covariate, defined as follows:
! ! !!
Yi (0) µc σc2 ρσc σt
|θ ∼ Normal , (24)
Yi (1) µt ρσt σc σt2
The posterior distribution of model parameters is obtained by combining the prior

distribution with likelihood function. Conditional on assignment W and parameters
θ, the likelihood function of Yiobs followed from Equation 1 is given as follows:
Pr(Yiobs |W, θ) ∼ Normal(Wi · µt + (1 − Wi ) · µc , Wi σt2 + (1 − Wi ) · σc2 ) (25)
15
where µc and µt are the posterior means of Yi (0) and Yi (1), and σc2 and σt2 are the
posterior variances of Yi (0) and Yi (1). If we set µc = α as the mean of potential
outcomes under control, then µt equals to α + τ as the mean of potential outcomes
after treatment. Furthermore, we assign the same Gaussian prior with zero mean
and variance of 1002 for all parameters: α, τ, σc2 , and σt2 . The standard deviation of
the prior distribution (i.e., 100) has to be large enough to cover the scale of outcome
variable ranges from 0 to 60.3 (see Table 6):
α, τ, σc2 , σt2 ∼ Normal(0, 1002 ) (26)
After we obtained the posterior distributions of parameters by combining the likeli-

hood and prior, we can then derive the sampling distribution for the missing potential
outcomes given treatment assignment vector, observed potential outcomes and pa-
rameters, Pr(Y mis |Y obs , W, θ). For units receiving treatment (i.e., Wi = 1), we can
impute the missing control potential outcomes from the conditional distribution of
Yi (0) given Yi (1). Similarly, we can impute the missing treatment potential outcomes
from the conditional distribution of Yi (1) given Yi (0) for units receiving control (i.e.,
Wi = 0). Given that the joint distribution is a bivariate normal distribution, the
conditional distribution can be easily found as follows:
σt
Pr(Yi (1)|Yi (0), Wi = 0, θ) ∼ Normal(µt + ρ · · (Yi (0) − µc ), σt2 (1 − ρ2 )) (27)
σc
σc
Pr(Yi (0)|Yi (1), Wi = 1, θ) ∼ Normal(µc + ρ · · (Yi (1) − µt ), σc2 (1 − ρ2 )) (28)
σt
6.2 Model 1: Assuming independence between potential out-

comes
Model 1 assumes independence between two potential outcomes, Yi (0) and Yi (1),
hence the correlation coefficient, ρ, equals to 0 from Equation 24:
! ! !!
Yi (0) µc σc2 0
|θ ∼ Normal , (29)
Yi (1) µt 0 σt2
We use Stan program (see Appendix A.3) in Matlab to sequentially draw the posterior
predictive distribution of missing potential outcomes and the posterior distribution
of the parameters. For summary statistics of model 1, please see Table 9 in Appendix
A.1.
All 5 causal estimands are presented as follows:
16
(a) Finite-Sample ATE vs Super-Population (b) Quantile treatment effects for 0.25, 0.50
ATE and 0.5 quantiles
From Figure 1a, we observe that the posterior standard deviation of super-population
ATE is considerably greater than that of finite-sample ATE. From Figure 1b, one
should notice that the posterior distribution of τ0.25 does not look like a normal
distribution, it is because over 31% of the outcome variable, re78, is zero and this
causes the irregularity on the behavior of normal distribution. Figure 1b also provides
some knowledge on the heterogeneity of treatment effect, that is how treatment might
affect different experimental subjects at different levels. For instance, being assigned
to the job-training program provides more benefits for subjects at the high end of the
earnings distribution.
6.3 Model 2: Assuming constant individual treatment effects

In order to investigate the sensitivity of result to the dependence between two poten-
tial outcomes, we consider the most conservative case where we assume the individual
treatment effects are constant across all units, and this suggests that the two potential
outcomes should be perfectly correlated (i.e., ρ = 1.0). Thus the variances for two
potential outcomes are equal as follows:
σc2 = σt2 = σ 2 ∼ Normal(0, 1002 ) (30)
Correspondingly, the joint distribution of two potential outcomes conditional on pa-

rameters is: ! ! !!
Yi (0) µc σ2 σ2
|θ ∼ Normal , (31)
Yi (1) µt σ2 σ2
The variance of first 15 example individual treatment effects of Model 1 and Model
2 are displayed below:
17
(a) Model 1: independence between two (b) Model 2: constant individual treatment
potential outcomes effects
Figure 2: First 15 example individual treatment effects of 2 models
From Figure 2, one observes that there are considerably large variations among pos-
terior means, with large uncertainty intervals in Model 1. However, the posterior
means are the same with minimal uncertainty interval in Model 2. It implies that the
majority of uncertainty in the imputation process arises from when multiply-imputing
[20] the missing potential outcomes.
The table below provides some statistics of causal estimands of our interest for
Model 2, generated by Stan program:
Table 7: Summary statistics of the causal estimands
mean se mean sd 10% 50% 90% n eff Rhat
α 4.536 0.012 0.403 4.018 4.527 5.046 1109 1

τ 1.834 0.020 0.763 1.019 1.848 2.596 991 1
σ 6.602 0.006 0.222 6.320 6.596 6.890 1396 1
τfs 1.834 0.020 0.763 1.019 1.848 2.596 991 1
τ0.25 1.834 0.020 0.763 1.019 1.848 2.596 991 1
τ0.50 1.834 0.020 0.763 1.019 1.848 2.596 991 1
τ0.75 1.834 0.020 0.763 1.019 1.848 2.596 991 1
18
where se mean represents the Monte Carlo standard error, n eff represents the
effective sample size and Rhat is R-hat statistics.
It is assumed there is no treatment effect heterogeneity across the units, hence all
QTE estimands are equal as expected. We notice that the finite sample ATE is equal
to the super-population ATE in this case. It suggests that Model 2 does not only
provide a ”conservative” estimate of the posterior variance for the finite sample ATE,
but also give an unbiased estimate of the super-population ATE in the worst-case
scenario.
6.4 Model 3: Model-based imputation with covariates

Model 1 and Model 2 only discuss cases without covariates in which the posterior
means of missing control outcomes are centred around α, and means of missing treat-
ment outcomes are centred around α+τ . By considering the pre-treatment covariates,
we add Xi βc and Xi βt to the posterior means of two potential outcomes as follows:
! ! !!
Yi (0) α + Xi βc σc2 0
|θ ∼ Normal , (32)
Yi (1) α + Xi βt + τ 0 σt2
where βc and βt are vectors of the slope coefficients for the units receiving control
and treatment respectively. We define βinter to be the difference between these two
vectors (i.e., βinter = βt − βc ), βinter can be obtained by adding an interaction term
between X and W. We assign same prior distribution with mean of 0 and variance of
100 to these two parameters, and again we assume they are prior independent from
the other parameters:
β, βinter ∼ Normal(0, 100) (33)
The rest of the procedures for model-based approach with covariates are similar to
the one without covariates. We can derive the conditional distribution of missing
potential outcome for unit i given observed potential outcomes, assignment vector,
covariates and parameters are given as follows:
Yimis |Yobs , W, X, θ ∼ Normal(Wi · Xi βc + (1 − Wi ) · Xi βt , Wi · σt2 + (1 − Wi ) · σt2 ) (34)
The following figure displays the posterior predictive distribution of the first 15 miss-
ing control outcomes, and for summary statistics for the causal estimands of our
interest, please see Table 10 in Appendix A.1.
19
(a) Model 1: without covariates (b) Model 3: with covariates
Figure 3: Distributions of first 15 missing control potential outcomes
We observe that the distributions of missing potential outcomes are almost the same
in Model 1 while there is greater variation among posterior means in Model 3. Adding
pre-treatment covariates provides more information about the prediction of missing
data, therefore this improves the imputations of missing potential outcomes.
However, all 3 models discussed above indirectly assume continuity of the distri-
bution of potential outcomes while there is a high percentage of zeros in the observed
outcomes (i.e., approximately 31%). Hence the results we obtained are implausible.
This motivates us to construct a model that fits a conditional distribution only to
those strictly positive values of potential outcomes.
6.5 Model 4: Two-Part Model

In the two-part model (or zero-inflated model), we derive two parts of the conditional
distribution for both Yi (0) and Yi (1). First, we can formulate the probability of
20
positive values of Yi (0) and Yi (1) using logistic regression as follows:
exp (Xi γc )
Pr(Yi (0) > 1|Xi , Wi , θ) = (35)
1 + exp(Xi γc )
exp (Xi γt )
Pr(Yi (1) > 1|Xi , Wi , θ) = (36)
1 + exp(Xi γt )
where γc and γt denote the slope coefficients for the control and treated units respec-
tively. Similar as before, we define γinter to be the difference between the two vectors
(i.e., γinter = γt − γc ). We assign a t-distribution prior to these parameters with 5
degrees of freedom and zero means:
γc , γinter ∼ t0 (5) (37)
Second, we assume the logarithm of the potential outcome conditional on the positive
values of outcome is normally distributed:
ln (Yi (0))|Yi (0) > 0, Xi , Wi , θ ∼ Normal(Xi , βc , σc2 ) (38)

ln (Yi (1))|Yi (1) > 0, Xi , Wi , θ ∼ Normal(Xi , βt , σt2 ) (39)
The rest of procedures of this model is similar to Model 3. (For Stan program of this
model, please see Appendix A.3).
By using Equation 9 and 14 of Neyman repeated sampling approach, and adjusting
with 10 covariates via linear regression methods, we can derive the point estimate
and standard error of τ . The statistics of Neyman and Bayesian approach can be
summarized as follows:
Table 8: Point and interval estimates for the population average treatment effect τ
Method Neyman’s approach Model 1 Model 3 Model 4
Point estimate 1.67 1.77 1.55 1.60

Standard error of τ 0.64
Standard deviation of τ 0.69 0.68 0.74
95% confidence interval 0.41 - 2.92
95% credible interval 0.45 - 3.08 0.24 - 2.86 0.15 - 3.05
Model 1 and Model 3 provide nearly the same standard deviation of population
ATE, it indicates that the level of uncertainty is determined by the independence
assumption between two potential outcomes instead of the presence of the covariates.
The point estimates obtained, 1.67 via Neyman’s repeated sampling and 1.77,1.55
21
and 1.60 via Bayesian approach are not small in magnitude relative to the mean and
standard deviation of the re78. Therefore, we can conclude that being assigned to
the job-training program brings the positive effect on people’s earnings in 1978.
By comparing the interval estimates of τ from the Neyman and Bayesian approach,
we see that there is a minor difference between the four intervals relative to the scale
of re78. From Neyman repeated sampling analysis, we are 95% confident that the
super-population ATE of treat = 1 on re78 is between 0.41 − 2.92. Similarly, we are
95% confident that the credible interval for τ with Bayesian Model 1 is from 0.45 to
3.08, with model 3 is from 0.24 to 2.86, and with Model 4 is from 0.15 − 3.05.
In this example, one obvious advantage of Neyman approach is that it avoids the
need to specify the prior distribution, Pr(θ), for the parameters that govern the joint
distribution of two potential outcomes. However, it also comes with the price for
heavily relying on large sample approximations, so that it can justify the proposed
frequentist confidence interval. It should be noted that in large samples, the presumed
benefit of Neyman approach vanishes because the practical implications of the choice
of prior is limited by the Bernstein-Von Mises Theorem [26].
7 Conclusion
In this paper, We have introduced the fundamental of causal inference and described
how Rubin causal model turns a potential outcome framework into a missing data
problem. Based on the stability assumption, SUTVA, we also introduced the Neyman
repeated sampling approach, which is a frequentist method that provides an unbi-
ased estimator for point estimate of ATE and constructs an interval estimator for
it. Furthermore, we formulated 4 main steps from Bayesian perspective to derive the
posterior predictive distributions of missing potential outcomes and interested causal
estimands.
Based on the theoretical results obtained in Section 3 and Section 4, we built 4
different models for NSW dataset from the simplest one to the more sophisticated
one, to assess how each factor affected the results of the models. From simulation
result of Model 1, we observed that a substantial amount of zeros can induce a
non-normal distribution of 0.25 quantile QTE. By investigating the conservative case
in Model 2, we determined the bound of super-population ATE where we assumed
the potential outcomes are perfectly correlated. We preceded by considering the
presence of covariates in Model 3 and showed that adding pre-treatment covariates can
improve the imputations of missing potential outcomes. However, all 3 models assume
22
continuity of distribution of potential outcomes which contradicts to the feature of
the dataset. We thereby built the most sophisticated model by incorporating all
the considerations into Model 4, which provided the most plausible result among all
4 Bayesian models. We also realized that the Bayesian approach can be relatively
more flexible than frequentist approach (i.e., Neyman’s approach) as it can easily
accommodate with a wide variety of causal estimands such as QTEs.
8 Future Work
There are many real-world causal inference problems that can be tackled more flexibly
with Bayesian approach than with frequentist approach. For instance, when there are
unintended missing data (e.g. patient dropout) in the dataset, Bayesian techniques
such as multiple imputation or Expectation-Maximization algorithm for imputing the
missing data, are well matched with the Bayesian approach to causal inference ([27],
Section 3 in [28]). However, the way the missing data imputed may have non-trivial
impact on the causal analysis [29]. More study in the understanding of such impact
may be valuable to investigate.
Alternatively, we can explore the situation when the units (e.g. people) refuse to
comply the active treatment, but take the control treatment instead. This compli-
cation is often called noncompliance, a very active research field for application of
Bayesian approaches in causal inference. In the case described above, the sensitivity
to the prior assumption can be very severe. A possible direction of research can be
revealing the sensitivity and developing the prior restrictions.
23
A Appendix
A.1 Summary Statistics of Causal Estimands
Table 9: Summary Statistics for Model 1
α 4.554 0.008 0.352 4.104 4.562 5.000 1848 0.999

τ 1.776 0.017 0.693 0.877 1.778 2.686 1708 0.999
σc 5.505 0.005 0.247 5.190 5.495 5.825 2058 0.999
σt 7.911 0.009 0.413 7.390 7.897 8.460 1943 1.000
τfs 1.784 0.013 0.512 1.140 1.778 2.451 1524 1.000
τ0.25 0.607 0.009 0.358 0.000 0.637 1.060 1684 1.001
τ0.50 1.633 0.014 0.581 0.927 1.624 2.399 1720 1.000
τ0.75 3.088 0.016 0.653 2.274 3.086 3.889 1704 1.000
Table 10: Summary Statistics for Model 3
α 4.577 0.007 0.345 4.141 4.575 5.021 2535 1.000

τ 1.556 0.014 0.677 0.686 1.546 2.441 2240 1.000
σc 5.434 0.004 0.251 5.114 5.421 5.773 3270 0.999
σt 7.708 0.008 0.431 7.179 7.693 8.278 2682 1.001
τfs 1.551 0.012 0.516 0.914 1.541 2.222 1893 1.000
τ0.25 0.416 0.007 0.341 0.000 0.485 0.819 2340 1.001
τ0.50 1.411 0.012 0.553 0.730 1.369 2.138 1998 1.000
τ0.75 2.899 0.016 0.655 2.074 2.895 3.628 1700 1.000
24
A.2 Pair plots of parameters
Figure 4: MCMC Pairs of parameters for Model 1
25
26
A.3 MATLABStan Code
The following code is used to build Stan program for Model 1 and Model 4 for NSW
project. The code for Model 2 and Model 3 are not included here since they are
straightforward and not very informative.
The code for Model 1:
data {
int<lower=0> N; %% size of sample
vector[N] y; %% potential outcome
vector[N] w; %% treatment assignment
}
parameters {
real alpha; %% intercept
real tau; %% super-population ATE
real<lower=0> sigma_c; %% standard deviation for control units
real<lower=0> sigma_t; %% standard deviation for treated units
}
model {
alpha ~ normal(0, 100); %% priors all have normal distributions

tau ~ normal(0, 100);
sigma_c ~ normal(0, 100);
sigma_t ~ normal(0, 100);
y ~ normal(alpha + tau*w, sigma_t*w + sigma_c*(1 - w)); %% likelood function

}
generated quantities{
real y0[N]; %% potential outcome for W = 0
real y1[N]; %% potential outcome for W = 1
real tau_unit[N]; %% individual treatment effect
real tau_fs; %% finite sample ATE
real tau_qte25; %% QTE at 0.25
for (n in 1:N){
real mu_c = alpha;
27
real mu_t = alpha + tau;
if(w[n] == 1){
y0[n] = normal_rng(mu_c, sigma_c);
y1[n] = y[n];
}else{
y0[n] = y[n];
y1[n] = normal_rng(mu_t, sigma_t);
}
tau_unit[n] = y1[n] - y0[n];
}
tau_fs = mean(tau_unit);
tau_qte25 = quantile(to_vector(y1), 0.25) - quantile(to_vector(y0), 0.25);
}
The code for Mode 4:
data {
int<lower=0> N; %% size of sample
int<lower=0> N_cov; %% NO. of covariates
real y[N]; %% continuous potential outcome y
int<lower=0,upper=1> y_pos[N]; %% flag for y > 0
int<lower=0,upper=1> z[N]; %% treatment assignment
real<lower=-1,upper=1> rho; %% correlation coeffs
vector[N_cov] x[N]; %% vector of covariates
vector[N_cov] xz_inter[N]; %% introduced interaction terms
parameters {
%% two part model

real alpha_bin; %% binary intercept
row_vector[N_cov] beta_bin; %% coefficients for vector of covariates
28
row_vector[N_cov] beta_inter_bin; %% coeffs for interaction terms
real tau_bin; %% treatment effect
%% Continuous model
real alpha_cont; %% continuous intercept
row_vector[N_cov] beta_cont; %% coefficients for vector of covariates
row_vector[N_cov] beta_inter_cont; %% coeffs for interaction terms
real tau_cont; %% treatment effect for continuous part
real<lower=0> sigma_t; %% standard deviation for treated units

real<lower=0> sigma_c; %% standard deviation for control units
}
model {
%% Priors
alpha_bin ~ student_t(5, 0, 2.5); %% student distribution with 5 dofs
beta_bin ~ student_t(5, 0, 2.5);
beta_inter_bin ~ student_t(5, 0, 2.5);
tau_bin ~ student_t(5, 0, 2.5);
%% Priors for the Continuous model

alpha_cont ~ normal(0, 100); %% normal distribution same as before
sigma_c ~ normal(0, 100);
sigma_t ~ normal(0, 100);
beta_cont ~ normal(0, 100);
beta_inter_cont ~ normal(0, 100);
tau_cont ~ normal(0, 100);
for (n in 1:N) {
%% binary model
y_pos[n] ~ bernoulli_logit(alpha_bin + beta_bin * x[n] +
beta_inter_bin * xz_inter[n] + tau_bin * z[n]); %% likelihood function
%% Continuous Sub-Model for positive values

if(y_pos[n] == 1)
29
log(y[n]) ~ normal(alpha_cont + beta_cont * x[n] +
beta_inter_cont * xz_inter[n] + tau_cont * z[n], sigma_t*z[n] +
sigma_c*(1-z[n]));
}
}
generated quantities{
real y0_bar;
real y1_bar;
real tau_samp;
real tau_qte25;
real tau_qte50;
real tau_qte75;
{
real y0[N];
real y1[N];
real tau_ind[N];
int y_pred_pos;
for(n in 1:N){
%% predicted success
real theta_c = inv_logit(alpha_bin + beta_bin * x[n]);
real theta_t = inv_logit(alpha_bin + beta_bin * x[n] +
beta_inter_bin * x[n] + tau_bin);
real mu_c = alpha_cont + beta_cont * x[n]; %% log mean

real mu_t = alpha_cont + beta_cont * x[n] +
beta_inter_cont * x[n] + tau_cont;
if(z[n] == 1){
30
y_pred_pos = bernoulli_rng(theta_c);
if(y_pred_pos == 0){
y0[n] = 0;
}else{
y0[n] = exp(normal_rng(mu_c +
rho*(sigma_c/sigma_t)*(y[n] - mu_t), sigma_c*sqrt(1 - rho^2)));
}
y1[n] = y[n]; %% missing data imputations for y1

}else{
y0[n] = y[n]; %% missing data imputations for y0
y_pred_pos = bernoulli_rng(theta_t);
if(y_pred_pos == 0){
y1[n] = 0;
}else{
y1[n] = exp(normal_rng(mu_t +
rho*(sigma_t/sigma_c)*(y[n] - mu_c), sigma_t*sqrt(1 - rho^2)));
}
}
tau_ind[n] = y1[n] - y0[n];
}
y0_bar = mean(y0); %% store mean values

y1_bar = mean(y1);
tau_samp = mean(tau_ind);

}
}
31
References
[1] Borch K. H. Economics of uncertainty, notes for lectures 10. critiques of expected
utility theory. University of Princeton, 2009.
[2] Causi L. G. Theories of investor behaviour: From the efficient market hypothesis
to behavioural finance. 2017.
[3] Amos Tversky and Daniel Kahneman. Judgment under uncertainty: Heuristics
and biases. science, 185(4157):1124–1131, 1974.
[4] John Y Campbell, John J Champbell, John W Campbell, Andrew W Lo, Andrew
Wen-Chuan Lo, and Archie Craig MacKinlay. The econometrics of financial
markets. princeton University press, 1997.
[5] Maurice Allais. Le comportement de l’homme rationnel devant le risque: critique

des postulats et axiomes de l’école américaine. Econometrica: Journal of the
Econometric Society, pages 503–546, 1953.
[6] Daniel Kahneman and Amos Tversky. Prospect theory: An analysis of decision
under risk. Econometrica, 47(2):263–292, 1979.
[7] Rubin D.B. Bayesian inference for causal effects: The role of randomization.
Ann. Statist., 6(1):34–58, 01 1978.
[8] Rubin D.B. Randomization analysis of experimental data: The fisher random-
ization test. Journal of the American Statistical Association, 75(371):575–582,
1980.
[9] Imben G.W. and Rubin D.B. Neyman’s Repeated Sampling Approach to Com-
pletely Randomized Experiments, page 83112. Cambridge University Press, 2015.
[10] Dawid A. P. Conditional independence in statistical theory. Journal of the Royal

Statistical Society. Series B (Methodological), 41(1):1–31, 1979.
[11] Neyman J. S., Dabrowska D. M., and Speed T. P. On the application of probabil-
ity theory to agricultural experiments. essay on principles. section 9. Statistical
Science, 5(4):465–472, 1990.
32
[12] Neyman J. On the two different aspects of the representative method: The
method of stratified sampling and the method of purposive selection. Journal of
the Royal Statistical Society, 97(4):558–625, 1934.
[13] Morgan K. L. and Rubin D. B. Rerandomization to improve covariate balance

in experiments. The Annals of Statistics, 40(2):1263–1282, 2012.
[14] Enderlein G. Fisher, ronald a.: The design of experiments. eighth edition. oliver
and boyd, edinburgh 1966. xvi + 248 s., 5 abb., 39 tab., brosch. preis s 15.
Biometrische Zeitschrift, 11(2):139–139, 1969.
[15] Holland P. W. Statistics and causal inference. Journal of the American Statistical
Association, 81(396):945–960, 1986.
[16] Rubin D.B. Estimating causal effects of treatments in randomized and nonran-
domized studies. Journal of Educational Psychology, 66, 10 1974.
[17] Bruno de Finetti. Foresight: Its Logical Laws, Its Subjective Sources, pages 134–
174. Springer New York, New York, NY, 1992.
[18] Reinhardt H. Theory of probability: A critical introductory treatment, vol. 2

(bruno de finetti). SIAM Review, 20(1):200–201, 1978.
[19] O’Hagan A., West M., Rubin D.B., Wang X., Yin L., and Zell E. Bayesian causal
inference: Approaches to estimating the effect of treating hospital type on cancer
survival in sweden using principal stratification, 08 2018.
[20] Rubin D.B. Multiple Imputation for Nonresponse in Surveys. Wiley, 1987.
[21] LaLonde R. J. Evaluating the econometric evaluations of training programs with

experimental data. The American Economic Review, 76(4):604–620, 1986.
[22] Heckman J. J. and Robb R. Alternative methods for evaluating the impact of
interventions: An overview. Journal of Econometrics, 30(1):239 – 267, 1985.
[23] Dehejia R. H. and Wahba S. Causal effects in nonexperimental studies: Reeval-

uating the evaluation of training programs. Journal of the American Statistical
Association, 94(448):1053–1062, 1999.
[24] Smith J. A. and Todd P. E. Reconciling conflicting evidence on the perfor-

mance of propensity-score matching methods. The American Economic Review,
91(2):112–118, 2001.
33
[25] Rosenbaum P. R. and Rubin D. B. Sensitivity of bayes inference with data-
dependent stopping rules. The American Statistician, 38(2):106–109, 1984.
[26] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge University Press, 1998.
[27] Ding P. and Li F. Causal inference: A missing data perspective. Statist. Sci.,
33(2):214–237, 05 2018.
[28] Gelman A., Carlin J.B, Stern H.S., Dunson D.B., Vehtari A., and Rubin D.B.
Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statis-
tical Science. Taylor & Francis, 2013.
[29] Mitra R. and Reiter J. P. A comparison of two methods of estimating propen-

sity scores after multiple imputation. Statistical Methods in Medical Research,
25(1):188–204, 2016. PMID: 22687877.
34

Causal Inference Special Topic

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Causal Inference Special Topic

Uploaded by

Copyright:

Available Formats

Behavioural Finance

Candidate Number: 1031077

2 Expected Utility Theory 1

4 Neyman Repeated Sampling Approach 9

6 Application in NSW example 14

2 Expected Utility Theory

Group Given Median Of Estimation

2.2 Doubt On Utility Function

A 2500(0.33) 2400(0.66) 0(0.01) 18%

A 4000(0.80) 0(0.20) 80%

This example was introduced by Allais [5]. In fact, C can be represented as

A 4000(0.20) 0(0.80) 22%

A 1000(0.50) 0(0.50) 16%

Yobs = W × Y (1) + (1 − W ) × Y (0) (1)

Unit W X Y(0) Y(1) Individual causal effect

τ = E[Y (1)] − E[Y (0)] (3)

Assumption 1 No interference among units.

Assumption 2 No hidden version of treatment and control.

3.2 Treatment Assignment Mechanism

Pr(W |X, Y (0), Y (1)) (5)

In a completely randomized experiment with N units, a fixed number of Nt units are

0 < Pr(Wi = 1|Xi = x) < 1

Pr(W |X, Y (0), Y (1)) = Pr(W |X) (7)

4 Neyman Repeated Sampling Approach

τ̂ = Ȳ1obs − Ȳ0obs (9)

neyman s2c s2t

Neyman’s estimator has been widely used in randomization-based causal inference

H0 : Yi (1) = Yi (0) ∀i = 1, · · · , N (19)

5.2 Posterior predictive distribution of missing potential out-

Pr(X, Yi (0), Yi (1)|θ) = Pr(Xi |θX )Pr(Yi (0), Yi (1)|Xi , θY |X ) (23)

5.3 Posterior distribution of causal effects

5.4 Model-based inference

1. Derive Pr(Ymis |Yobs , X, W, θ)

2. Derive the posterior distribution for θ, Pr(θ|X, Yobs , W )

3. Combine the two distributions to get the posterior distribution

4. Derive the posterior distribution of causal effect, τ , Pr(τ |X, W, Yobs )

Table 6: Summary Statistics for NSW Dataset

Variable Description mean s.d. min max Type

treat treatment status 0.42 0.53 0 1 treatment

The posterior distribution of model parameters is obtained by combining the prior

Pr(Yiobs |W, θ) ∼ Normal(Wi · µt + (1 − Wi ) · µc , Wi σt2 + (1 − Wi ) · σc2 ) (25)

α, τ, σc2 , σt2 ∼ Normal(0, 1002 ) (26)

After we obtained the posterior distributions of parameters by combining the likeli-

6.2 Model 1: Assuming independence between potential out-

6.3 Model 2: Assuming constant individual treatment effects

σc2 = σt2 = σ 2 ∼ Normal(0, 1002 ) (30)

Correspondingly, the joint distribution of two potential outcomes conditional on pa-

Figure 2: First 15 example individual treatment effects of 2 models

Table 7: Summary statistics of the causal estimands

mean se mean sd 10% 50% 90% n eff Rhat

α 4.536 0.012 0.403 4.018 4.527 5.046 1109 1

6.4 Model 3: Model-based imputation with covariates

Yimis |Yobs , W, X, θ ∼ Normal(Wi · Xi βc + (1 − Wi ) · Xi βt , Wi · σt2 + (1 − Wi ) · σt2 ) (34)

Figure 3: Distributions of first 15 missing control potential outcomes

6.5 Model 4: Two-Part Model

γc , γinter ∼ t0 (5) (37)

ln (Yi (0))|Yi (0) > 0, Xi , Wi , θ ∼ Normal(Xi , βc , σc2 ) (38)

Method Neyman’s approach Model 1 Model 3 Model 4

Point estimate 1.67 1.77 1.55 1.60

Table 9: Summary Statistics for Model 1

mean se mean sd 10% 50% 90% n eff Rhat

α 4.554 0.008 0.352 4.104 4.562 5.000 1848 0.999

y ~ normal(alpha + tauw, sigma_tw + sigma_c*(1 - w)); %% likelood function